Files
TinyTorch/datasets/tinydigits/README.md
Vijay Janapa Reddi f099730723 Increase TinyDigits to 1000 samples following Karpathy's philosophy
You were right - 150 samples was too small for decent accuracy.
Following Andrej Karpathy's "~1000 samples" educational dataset philosophy.

Results:
- Before (150 samples): 19% test accuracy (too small!)
- After (1000 samples): 79.5% test accuracy (decent!)

Changes:
- Increased training: 150 → 1000 samples (100 per digit class)
- Increased test: 47 → 200 samples (20 per digit class)
- Perfect class balance: 0.00 std deviation
- File size: 51 KB → 310 KB (still tiny for USB stick)
- Training time: ~3-5 sec → ~8-10 sec (still fast)

Updated:
- create_tinydigits.py: Load from sklearn, generate 1K samples
- train.pkl: 258 KB (1000 samples, perfectly balanced)
- test.pkl: 52 KB (200 samples, balanced)
- README.md: Updated all documentation with new sizes
- mlp_digits.py: Updated docstring to reflect 1K dataset

Dataset Philosophy:
"~1000 samples is the sweet spot for educational datasets"
- Small enough: Trains in seconds on CPU
- Large enough: Achieves decent accuracy (~80%)
- Balanced: Perfect stratification across all classes
- Reproducible: Fixed seed=42 for consistency

Still perfect for TinyTorch-on-a-stick vision:
- 310 KB fits on any USB drive
- Works on RasPi0
- No downloads needed
- Offline-first education
2025-11-10 17:20:54 -05:00

117 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TinyDigits Dataset
A curated subset of the sklearn digits dataset for rapid ML prototyping and educational demonstrations.
Following Karpathy's "~1000 samples" philosophy for educational datasets.
## Contents
- **Training**: 1000 samples (100 per digit, 0-9)
- **Test**: 200 samples (20 per digit, balanced)
- **Format**: 8×8 grayscale images, float32 normalized [0, 1]
- **Size**: ~310 KB total (vs 10 MB MNIST, 50× smaller)
## Files
```
datasets/tinydigits/
├── train.pkl # {'images': (1000, 8, 8), 'labels': (1000,)}
└── test.pkl # {'images': (200, 8, 8), 'labels': (200,)}
```
## Usage
```python
import pickle
# Load training data
with open('datasets/tinydigits/train.pkl', 'rb') as f:
data = pickle.load(f)
train_images = data['images'] # (1000, 8, 8)
train_labels = data['labels'] # (1000,)
# Load test data
with open('datasets/tinydigits/test.pkl', 'rb') as f:
data = pickle.load(f)
test_images = data['images'] # (200, 8, 8)
test_labels = data['labels'] # (200,)
```
## Purpose
**Educational Infrastructure**: Designed for teaching ML systems with real data at edge-device scale.
Following Andrej Karpathy's philosophy: "~1000 samples is the sweet spot for educational datasets."
- **Decent accuracy**: Achieves ~80% test accuracy on MLPs (vs <20% with 150 samples)
- **Fast training**: <10 sec on CPU, instant feedback loop
- **Balanced classes**: Perfect 100 samples per digit (0-9)
- **Offline-capable**: Ships with repo, no downloads needed
- **USB-friendly**: 310 KB fits on any device, even RasPi0
- **Real learning curve**: Model improves visibly across epochs
## Curation Process
Created from the sklearn digits dataset (8×8 downsampled MNIST):
1. **Balanced Sampling**: 100 training samples per digit class (1000 total)
2. **Test Split**: 20 samples per digit (200 total) from remaining examples
3. **Random Seeding**: Reproducible selection (seed=42)
4. **Normalization**: Pixels normalized to [0, 1] range
5. **Shuffled**: Training and test sets randomly shuffled for fair evaluation
The sklearn digits dataset itself is derived from the UCI ML hand-written digits datasets.
## Why TinyDigits vs Full MNIST?
| Metric | MNIST | TinyDigits | Benefit |
|--------|-------|------------|---------|
| Samples | 60,000 | 1,000 | 60× fewer samples |
| File size | 10 MB | 310 KB | 32× smaller |
| Train time | 5-10 min | <10 sec | 30-60× faster |
| Test accuracy (MLP) | ~92% | ~80% | Close enough for learning |
| Download | Network required | Ships with repo | Always available |
| Resolution | 28×28 (784 pixels) | 8×8 (64 pixels) | Faster forward pass |
| Edge deployment | Challenging | Perfect | Works on RasPi0 |
## Educational Progression
TinyDigits serves as the first step in a scaffolded learning path:
1. **TinyDigits (8×8)** ← Start here: Learn MLP/CNN basics with instant feedback
2. **Full MNIST (28×28)** ← Graduate to: Standard benchmark, longer training
3. **CIFAR-10 (32×32 RGB)** ← Advanced: Color images, real-world complexity
## Citation
TinyDigits is curated from the sklearn digits dataset for educational use in TinyTorch.
**Original Source**:
- sklearn.datasets.load_digits()
- Derived from UCI ML hand-written digits datasets
- License: BSD 3-Clause (sklearn)
**TinyTorch Curation**:
```bibtex
@misc{tinydigits2025,
title={TinyDigits: Curated Educational Dataset for ML Systems Learning},
author={TinyTorch Project},
year={2025},
note={Balanced subset of sklearn digits optimized for edge deployment}
}
```
## Generation
To regenerate this dataset from the original sklearn data:
```bash
python3 datasets/tinydigits/create_tinydigits.py
```
This ensures reproducibility and allows customization for specific educational needs.
## License
See [LICENSE](LICENSE) for details. TinyDigits inherits the BSD 3-Clause license from sklearn.