cs249r_book/tinytorch/datasets/tinydigits/README.md

# TinyDigits Dataset

A curated subset of the sklearn digits dataset for rapid ML prototyping and educational demonstrations.

Following Karpathy's "~1000 samples" philosophy for educational datasets.

## Contents

- **Training**: 1000 samples (100 per digit, 0-9)
- **Test**: 200 samples (20 per digit, balanced)
- **Format**: 8×8 grayscale images, float32 normalized [0, 1]
- **Size**: ~310 KB total (vs 10 MB MNIST, 50× smaller)

## Files

```
datasets/tinydigits/
├── train.pkl  # {'images': (1000, 8, 8), 'labels': (1000,)}
└── test.pkl   # {'images': (200, 8, 8), 'labels': (200,)}
```

## Usage

```python
import pickle

# Load training data
with open('datasets/tinydigits/train.pkl', 'rb') as f:
    data = pickle.load(f)
    train_images = data['images']  # (1000, 8, 8)
    train_labels = data['labels']  # (1000,)

# Load test data
with open('datasets/tinydigits/test.pkl', 'rb') as f:
    data = pickle.load(f)
    test_images = data['images']   # (200, 8, 8)
    test_labels = data['labels']   # (200,)
```

## Purpose

**Educational Infrastructure**: Designed for teaching ML systems with real data at edge-device scale.

Following Andrej Karpathy's philosophy: "~1000 samples is the sweet spot for educational datasets."

- **Decent accuracy**: Achieves ~80% test accuracy on MLPs (vs <20% with 150 samples)
- **Fast training**: <10 sec on CPU, instant feedback loop
- **Balanced classes**: Perfect 100 samples per digit (0-9)
- **Offline-capable**: Ships with repo, no downloads needed
- **USB-friendly**: 310 KB fits on any device, even RasPi0
- **Real learning curve**: Model improves visibly across epochs

## Curation Process

Created from the sklearn digits dataset (8×8 downsampled MNIST):

1. **Balanced Sampling**: 100 training samples per digit class (1000 total)
2. **Test Split**: 20 samples per digit (200 total) from remaining examples
3. **Random Seeding**: Reproducible selection (seed=42)
4. **Normalization**: Pixels normalized to [0, 1] range
5. **Shuffled**: Training and test sets randomly shuffled for fair evaluation

The sklearn digits dataset itself is derived from the UCI ML hand-written digits datasets.

## Why TinyDigits vs Full MNIST?

| Metric | MNIST | TinyDigits | Benefit |
|--------|-------|------------|---------|
| Samples | 60,000 | 1,000 | 60× fewer samples |
| File size | 10 MB | 310 KB | 32× smaller |
| Train time | 5-10 min | <10 sec | 30-60× faster |
| Test accuracy (MLP) | ~92% | ~80% | Close enough for learning |
| Download | Network required | Ships with repo | Always available |
| Resolution | 28×28 (784 pixels) | 8×8 (64 pixels) | Faster forward pass |
| Edge deployment | Challenging | Perfect | Works on RasPi0 |

## Educational Progression

TinyDigits serves as the first step in a scaffolded learning path:

1. **TinyDigits (8×8)** ← Start here: Learn MLP/CNN basics with instant feedback
2. **Full MNIST (28×28)** ← Graduate to: Standard benchmark, longer training
3. **CIFAR-10 (32×32 RGB)** ← Advanced: Color images, real-world complexity

## Citation

TinyDigits is curated from the sklearn digits dataset for educational use in TinyTorch.

**Original Source**:
- sklearn.datasets.load_digits()
- Derived from UCI ML hand-written digits datasets
- License: BSD 3-Clause (sklearn)

**TinyTorch Curation**:
```bibtex
@misc{tinydigits2025,
  title={TinyDigits: Curated Educational Dataset for ML Systems Learning},
  author={TinyTorch Project},
  year={2025},
  note={Balanced subset of sklearn digits optimized for edge deployment}
}
```

## Generation

To regenerate this dataset from the original sklearn data:

```bash
python3 datasets/tinydigits/create_tinydigits.py
```

This ensures reproducibility and allows customization for specific educational needs.

## License

See [LICENSE](LICENSE) for details. TinyDigits inherits the BSD 3-Clause license from sklearn.