You were right - 150 samples was too small for decent accuracy.
Following Andrej Karpathy's "~1000 samples" educational dataset philosophy.
Results:
- Before (150 samples): 19% test accuracy (too small!)
- After (1000 samples): 79.5% test accuracy (decent!)
Changes:
- Increased training: 150 → 1000 samples (100 per digit class)
- Increased test: 47 → 200 samples (20 per digit class)
- Perfect class balance: 0.00 std deviation
- File size: 51 KB → 310 KB (still tiny for USB stick)
- Training time: ~3-5 sec → ~8-10 sec (still fast)
Updated:
- create_tinydigits.py: Load from sklearn, generate 1K samples
- train.pkl: 258 KB (1000 samples, perfectly balanced)
- test.pkl: 52 KB (200 samples, balanced)
- README.md: Updated all documentation with new sizes
- mlp_digits.py: Updated docstring to reflect 1K dataset
Dataset Philosophy:
"~1000 samples is the sweet spot for educational datasets"
- Small enough: Trains in seconds on CPU
- Large enough: Achieves decent accuracy (~80%)
- Balanced: Perfect stratification across all classes
- Reproducible: Fixed seed=42 for consistency
Still perfect for TinyTorch-on-a-stick vision:
- 310 KB fits on any USB drive
- Works on RasPi0
- No downloads needed
- Offline-first education
Replaces sklearn-sourced digits_8x8.npz with TinyTorch-branded dataset.
Changes:
- Created datasets/tinydigits/ (~51KB total)
- train.pkl: 150 samples (15 per digit class 0-9)
- test.pkl: 47 samples (balanced across digits)
- README.md: Full curation documentation
- LICENSE: BSD 3-Clause with sklearn attribution
- create_tinydigits.py: Reproducible generation script
- Updated milestones to use TinyDigits:
- mlp_digits.py: Now loads from datasets/tinydigits/
- cnn_digits.py: Now loads from datasets/tinydigits/
- Removed old data:
- datasets/tiny/ (67KB sklearn duplicate)
- milestones/03_1986_mlp/data/ (67KB old location)
Dataset Strategy:
TinyTorch now ships with only 2 curated datasets:
1. TinyDigits (51KB) - 8x8 digits for MLP/CNN milestones
2. TinyTalks (140KB) - Q&A pairs for transformer milestone
Total: 191KB shipped data (perfect for RasPi0 deployment)
Rationale:
- Self-contained: No downloads, works offline
- Citable: TinyTorch educational infrastructure for white paper
- Portable: Tiny footprint enables edge device deployment
- Fast: <5 sec training enables instant student feedback
Updated .gitignore to allow TinyTorch curated datasets while
still blocking downloaded large datasets.
- 301 Q&A pairs across 5 progressive difficulty levels
- 17.5 KB total size, optimized for 3-5 minute training
- Includes train/val/test splits (70/15/15)
- Professional documentation (README, DATASHEET, CHANGELOG, SUMMARY)
- Validation and statistics scripts
- Licensed under CC BY 4.0
Dataset designed specifically for TinyTorch Module 13 (Transformers) to provide
immediate learning feedback for students training their first transformer model.
Added integration tests for DataLoader:
- test_dataloader_integration.py in tests/integration/
- Training workflow integration
- Shuffle consistency across epochs
- Memory efficiency verification
Updated Module 08:
- Added note about optional performance analysis
- Clarified that analysis functions can be run manually
- Clean flow: text → code → tests
Updated datasets/tiny/README.md:
- Minor formatting fixes
Module 08 is now complete and ready to export:
✅ Dataset abstraction
✅ TensorDataset implementation
✅ DataLoader with batching/shuffling
✅ ASCII visualizations for understanding
✅ Unit tests (in module)
✅ Integration tests (in tests/)
✅ Performance analysis tools (optional)
Next: Export with 'bin/tito export 08_dataloader'
Created datasets/tiny/ for shipping small datasets with TinyTorch:
New Structure:
- datasets/tiny/digits_8x8.npz (67KB, 1,797 samples)
- 8×8 handwritten digits from UCI/sklearn
- Normalized to [0-1], ready for immediate use
- Perfect for DataLoader learning (Module 08)
- datasets/tiny/README.md
- Full documentation and usage examples
- Philosophy: tiny (learn) → full (practice) → custom (master)
- datasets/tiny/create_digits_8x8.py
- Extraction script showing how dataset was created
- Reproducible from sklearn.datasets.load_digits()
Updated .gitignore:
- Ignore datasets/* (downloaded large files)
- Allow datasets/tiny/ (shipped small files)
- Allow datasets/README.md and download scripts
- Selectively ignore .npz files (not in tiny/)
Benefits:
✅ Zero download friction for Module 08
✅ Offline-friendly (planes, classrooms, slow networks)
✅ Real handwritten digits (not synthetic noise)
✅ Git-friendly size (67KB vs 10MB MNIST)
✅ Same shape/format students will use for CNNs
Progression:
- Module 08: Learn DataLoader with 8×8 digits
- Milestone 03: Train on full 28×28 MNIST
- Milestone 04: Scale to CIFAR-10
- Created download_mnist.py script to fetch Fashion-MNIST dataset
- Added README explaining dataset format and download process
- Fashion-MNIST used as accessible alternative to original MNIST
- Same format allows seamless use with existing examples