2 Commits

Author SHA1 Message Date
Vijay Janapa Reddi
f099730723 Increase TinyDigits to 1000 samples following Karpathy's philosophy
You were right - 150 samples was too small for decent accuracy.
Following Andrej Karpathy's "~1000 samples" educational dataset philosophy.

Results:
- Before (150 samples): 19% test accuracy (too small!)
- After (1000 samples): 79.5% test accuracy (decent!)

Changes:
- Increased training: 150 → 1000 samples (100 per digit class)
- Increased test: 47 → 200 samples (20 per digit class)
- Perfect class balance: 0.00 std deviation
- File size: 51 KB → 310 KB (still tiny for USB stick)
- Training time: ~3-5 sec → ~8-10 sec (still fast)

Updated:
- create_tinydigits.py: Load from sklearn, generate 1K samples
- train.pkl: 258 KB (1000 samples, perfectly balanced)
- test.pkl: 52 KB (200 samples, balanced)
- README.md: Updated all documentation with new sizes
- mlp_digits.py: Updated docstring to reflect 1K dataset

Dataset Philosophy:
"~1000 samples is the sweet spot for educational datasets"
- Small enough: Trains in seconds on CPU
- Large enough: Achieves decent accuracy (~80%)
- Balanced: Perfect stratification across all classes
- Reproducible: Fixed seed=42 for consistency

Still perfect for TinyTorch-on-a-stick vision:
- 310 KB fits on any USB drive
- Works on RasPi0
- No downloads needed
- Offline-first education
2025-11-10 17:20:54 -05:00
Vijay Janapa Reddi
89c9e0dd7e Create TinyDigits educational dataset for self-contained TinyTorch
Replaces sklearn-sourced digits_8x8.npz with TinyTorch-branded dataset.

Changes:
- Created datasets/tinydigits/ (~51KB total)
  - train.pkl: 150 samples (15 per digit class 0-9)
  - test.pkl: 47 samples (balanced across digits)
  - README.md: Full curation documentation
  - LICENSE: BSD 3-Clause with sklearn attribution
  - create_tinydigits.py: Reproducible generation script

- Updated milestones to use TinyDigits:
  - mlp_digits.py: Now loads from datasets/tinydigits/
  - cnn_digits.py: Now loads from datasets/tinydigits/

- Removed old data:
  - datasets/tiny/ (67KB sklearn duplicate)
  - milestones/03_1986_mlp/data/ (67KB old location)

Dataset Strategy:
TinyTorch now ships with only 2 curated datasets:
1. TinyDigits (51KB) - 8x8 digits for MLP/CNN milestones
2. TinyTalks (140KB) - Q&A pairs for transformer milestone

Total: 191KB shipped data (perfect for RasPi0 deployment)

Rationale:
- Self-contained: No downloads, works offline
- Citable: TinyTorch educational infrastructure for white paper
- Portable: Tiny footprint enables edge device deployment
- Fast: <5 sec training enables instant student feedback

Updated .gitignore to allow TinyTorch curated datasets while
still blocking downloaded large datasets.
2025-11-10 16:59:43 -05:00