mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-03 00:07:08 -05:00
Add curated educational datasets for TinyTorch milestones: TinyDigits (~310 KB): - 1000 train + 200 test samples of 8x8 digit images - Balanced: 100 samples per digit class (0-9) - Used by Milestones 03 (MLP) and 04 (CNN) - Created from sklearn digits, normalized to [0,1] TinyTalks (~40 KB): - 350 Q&A pairs across 5 difficulty levels - Character-level conversational dataset - Used by Milestone 05 (Transformer) - Designed for fast training (3-5 min on laptop) Both datasets follow Karpathy's ~1K samples philosophy: - Small enough to ship with repo - Large enough for meaningful learning - Fast training with instant feedback - Works offline, no downloads needed
117 lines
3.8 KiB
Markdown
117 lines
3.8 KiB
Markdown
# TinyDigits Dataset
|
||
|
||
A curated subset of the sklearn digits dataset for rapid ML prototyping and educational demonstrations.
|
||
|
||
Following Karpathy's "~1000 samples" philosophy for educational datasets.
|
||
|
||
## Contents
|
||
|
||
- **Training**: 1000 samples (100 per digit, 0-9)
|
||
- **Test**: 200 samples (20 per digit, balanced)
|
||
- **Format**: 8×8 grayscale images, float32 normalized [0, 1]
|
||
- **Size**: ~310 KB total (vs 10 MB MNIST, 50× smaller)
|
||
|
||
## Files
|
||
|
||
```
|
||
datasets/tinydigits/
|
||
├── train.pkl # {'images': (1000, 8, 8), 'labels': (1000,)}
|
||
└── test.pkl # {'images': (200, 8, 8), 'labels': (200,)}
|
||
```
|
||
|
||
## Usage
|
||
|
||
```python
|
||
import pickle
|
||
|
||
# Load training data
|
||
with open('datasets/tinydigits/train.pkl', 'rb') as f:
|
||
data = pickle.load(f)
|
||
train_images = data['images'] # (1000, 8, 8)
|
||
train_labels = data['labels'] # (1000,)
|
||
|
||
# Load test data
|
||
with open('datasets/tinydigits/test.pkl', 'rb') as f:
|
||
data = pickle.load(f)
|
||
test_images = data['images'] # (200, 8, 8)
|
||
test_labels = data['labels'] # (200,)
|
||
```
|
||
|
||
## Purpose
|
||
|
||
**Educational Infrastructure**: Designed for teaching ML systems with real data at edge-device scale.
|
||
|
||
Following Andrej Karpathy's philosophy: "~1000 samples is the sweet spot for educational datasets."
|
||
|
||
- **Decent accuracy**: Achieves ~80% test accuracy on MLPs (vs <20% with 150 samples)
|
||
- **Fast training**: <10 sec on CPU, instant feedback loop
|
||
- **Balanced classes**: Perfect 100 samples per digit (0-9)
|
||
- **Offline-capable**: Ships with repo, no downloads needed
|
||
- **USB-friendly**: 310 KB fits on any device, even RasPi0
|
||
- **Real learning curve**: Model improves visibly across epochs
|
||
|
||
## Curation Process
|
||
|
||
Created from the sklearn digits dataset (8×8 downsampled MNIST):
|
||
|
||
1. **Balanced Sampling**: 100 training samples per digit class (1000 total)
|
||
2. **Test Split**: 20 samples per digit (200 total) from remaining examples
|
||
3. **Random Seeding**: Reproducible selection (seed=42)
|
||
4. **Normalization**: Pixels normalized to [0, 1] range
|
||
5. **Shuffled**: Training and test sets randomly shuffled for fair evaluation
|
||
|
||
The sklearn digits dataset itself is derived from the UCI ML hand-written digits datasets.
|
||
|
||
## Why TinyDigits vs Full MNIST?
|
||
|
||
| Metric | MNIST | TinyDigits | Benefit |
|
||
|--------|-------|------------|---------|
|
||
| Samples | 60,000 | 1,000 | 60× fewer samples |
|
||
| File size | 10 MB | 310 KB | 32× smaller |
|
||
| Train time | 5-10 min | <10 sec | 30-60× faster |
|
||
| Test accuracy (MLP) | ~92% | ~80% | Close enough for learning |
|
||
| Download | Network required | Ships with repo | Always available |
|
||
| Resolution | 28×28 (784 pixels) | 8×8 (64 pixels) | Faster forward pass |
|
||
| Edge deployment | Challenging | Perfect | Works on RasPi0 |
|
||
|
||
## Educational Progression
|
||
|
||
TinyDigits serves as the first step in a scaffolded learning path:
|
||
|
||
1. **TinyDigits (8×8)** ← Start here: Learn MLP/CNN basics with instant feedback
|
||
2. **Full MNIST (28×28)** ← Graduate to: Standard benchmark, longer training
|
||
3. **CIFAR-10 (32×32 RGB)** ← Advanced: Color images, real-world complexity
|
||
|
||
## Citation
|
||
|
||
TinyDigits is curated from the sklearn digits dataset for educational use in TinyTorch.
|
||
|
||
**Original Source**:
|
||
- sklearn.datasets.load_digits()
|
||
- Derived from UCI ML hand-written digits datasets
|
||
- License: BSD 3-Clause (sklearn)
|
||
|
||
**TinyTorch Curation**:
|
||
```bibtex
|
||
@misc{tinydigits2025,
|
||
title={TinyDigits: Curated Educational Dataset for ML Systems Learning},
|
||
author={TinyTorch Project},
|
||
year={2025},
|
||
note={Balanced subset of sklearn digits optimized for edge deployment}
|
||
}
|
||
```
|
||
|
||
## Generation
|
||
|
||
To regenerate this dataset from the original sklearn data:
|
||
|
||
```bash
|
||
python3 datasets/tinydigits/create_tinydigits.py
|
||
```
|
||
|
||
This ensures reproducibility and allows customization for specific educational needs.
|
||
|
||
## License
|
||
|
||
See [LICENSE](LICENSE) for details. TinyDigits inherits the BSD 3-Clause license from sklearn.
|