mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-09 07:15:51 -05:00
Add curated educational datasets for TinyTorch milestones: TinyDigits (~310 KB): - 1000 train + 200 test samples of 8x8 digit images - Balanced: 100 samples per digit class (0-9) - Used by Milestones 03 (MLP) and 04 (CNN) - Created from sklearn digits, normalized to [0,1] TinyTalks (~40 KB): - 350 Q&A pairs across 5 difficulty levels - Character-level conversational dataset - Used by Milestone 05 (Transformer) - Designed for fast training (3-5 min on laptop) Both datasets follow Karpathy's ~1K samples philosophy: - Small enough to ship with repo - Large enough for meaningful learning - Fast training with instant feedback - Works offline, no downloads needed
2.2 KiB
2.2 KiB
TinyTorch Datasets
This directory contains datasets for TinyTorch milestone examples.
Directory Structure
datasets/
├── tinydigits/ ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/ ← Q&A dataset for transformers (ships with repo, ~40KB)
└── README.md ← This file
Shipped Datasets (No Download Required)
TinyDigits
- Used by: Milestones 03 & 04 (MLP and CNN examples)
- Contents: 1,000 training + 200 test samples
- Format: 8×8 grayscale images, pickled
- Size: ~310 KB
- Purpose: Fast iteration on real image classification
TinyTalks
- Used by: Milestone 05 (Transformer/GPT examples)
- Contents: 350 Q&A pairs across 5 difficulty levels
- Format: Plain text (Q: ... A: ... format)
- Size: ~40 KB
- Purpose: Character-level conversational AI training
Downloaded Datasets (On-Demand)
The milestones automatically download larger datasets when needed:
MNIST
- Used by:
milestones/03_1986_mlp/02_rumelhart_mnist.py - Downloads to:
milestones/datasets/mnist/ - Contents: 60K training + 10K test samples
- Format: 28×28 grayscale images
- Size: ~10 MB compressed
- Auto-downloaded by:
milestones/data_manager.py
CIFAR-10
- Used by:
milestones/04_1998_cnn/02_lecun_cifar10.py - Downloads to:
milestones/datasets/cifar-10/ - Contents: 50K training + 10K test samples
- Format: 32×32 RGB images
- Size: ~170 MB compressed
- Auto-downloaded by:
milestones/data_manager.py
Design Philosophy
Shipped datasets follow Karpathy's "~1K samples" philosophy:
- Small enough to ship with repo
- Large enough for meaningful learning
- Fast training (seconds to minutes)
- Instant gratification for students
Downloaded datasets are full benchmarks:
- Standard ML benchmarks (MNIST, CIFAR-10)
- Larger, slower, more realistic
- Auto-downloaded only when needed
- Used for scaling demonstrations
Total Repository Size
- Shipped data: ~350 KB (tinydigits + tinytalks)
- USB-friendly: Entire repo fits on any device
- Offline-capable: Core milestones work without internet
- Git-friendly: No large binary files in version control