mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-27 09:59:49 -05:00
TinyTorch Datasets
This directory contains datasets for TinyTorch milestone examples.
Directory Structure
datasets/
├── tinydigits/ ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/ ← conversational Q&A text for transformers (~40KB)
└── README.md ← This file
Shipped Datasets (No Download Required)
TinyDigits
- Used by: Milestones 03 & 04 (MLP and CNN examples)
- Contents: 1,000 training + 200 test samples
- Format: 8×8 grayscale images, pickled
- Size: ~310 KB
- Purpose: Fast iteration on real image classification
TinyTalks
- Used by: Transformer / language teaching (e.g. Milestone 05 materials and extensions)
- Contents: 350 Q&A pairs across 5 difficulty levels
- Format: Plain text (
Q:/A:lines), character-level friendly - Size: ~40 KB
- Purpose: Fast, offline conversational text for attention and GPT-style experiments
Downloaded Datasets (On-Demand)
The milestones automatically download larger datasets when needed:
MNIST
- Used by:
milestones/03_1986_mlp/02_rumelhart_mnist.py - Downloads to:
milestones/datasets/mnist/ - Contents: 60K training + 10K test samples
- Format: 28×28 grayscale images
- Size: ~10 MB compressed
- Auto-downloaded by:
milestones/data_manager.py
CIFAR-10
- Used by:
milestones/04_1998_cnn/02_lecun_cifar10.py - Downloads to:
milestones/datasets/cifar-10/ - Contents: 50K training + 10K test samples
- Format: 32×32 RGB images
- Size: ~170 MB compressed
- Auto-downloaded by:
milestones/data_manager.py
Design Philosophy
Shipped datasets follow Karpathy's "~1K samples" philosophy:
- Small enough to ship with repo
- Large enough for meaningful learning
- Fast training (seconds to minutes)
- Instant gratification for students
Downloaded datasets are full benchmarks:
- Standard ML benchmarks (MNIST, CIFAR-10)
- Larger, slower, more realistic
- Auto-downloaded only when needed
- Used for scaling demonstrations
Total Repository Size
- Shipped data: ~350 KB (TinyDigits + TinyTalks combined)
- USB-friendly: Entire repo fits on any device
- Offline-capable: Core milestones work without internet
- Git-friendly: No large binary files in version control