mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-01 18:19:18 -05:00
Add curated educational datasets for TinyTorch milestones: TinyDigits (~310 KB): - 1000 train + 200 test samples of 8x8 digit images - Balanced: 100 samples per digit class (0-9) - Used by Milestones 03 (MLP) and 04 (CNN) - Created from sklearn digits, normalized to [0,1] TinyTalks (~40 KB): - 350 Q&A pairs across 5 difficulty levels - Character-level conversational dataset - Used by Milestone 05 (Transformer) - Designed for fast training (3-5 min on laptop) Both datasets follow Karpathy's ~1K samples philosophy: - Small enough to ship with repo - Large enough for meaningful learning - Fast training with instant feedback - Works offline, no downloads needed
70 lines
2.2 KiB
Markdown
70 lines
2.2 KiB
Markdown
# TinyTorch Datasets
|
||
|
||
This directory contains datasets for TinyTorch milestone examples.
|
||
|
||
## Directory Structure
|
||
|
||
```
|
||
datasets/
|
||
├── tinydigits/ ← 8×8 handwritten digits (ships with repo, ~310KB)
|
||
├── tinytalks/ ← Q&A dataset for transformers (ships with repo, ~40KB)
|
||
└── README.md ← This file
|
||
```
|
||
|
||
## Shipped Datasets (No Download Required)
|
||
|
||
### TinyDigits
|
||
- **Used by:** Milestones 03 & 04 (MLP and CNN examples)
|
||
- **Contents:** 1,000 training + 200 test samples
|
||
- **Format:** 8×8 grayscale images, pickled
|
||
- **Size:** ~310 KB
|
||
- **Purpose:** Fast iteration on real image classification
|
||
|
||
### TinyTalks
|
||
- **Used by:** Milestone 05 (Transformer/GPT examples)
|
||
- **Contents:** 350 Q&A pairs across 5 difficulty levels
|
||
- **Format:** Plain text (Q: ... A: ... format)
|
||
- **Size:** ~40 KB
|
||
- **Purpose:** Character-level conversational AI training
|
||
|
||
## Downloaded Datasets (On-Demand)
|
||
|
||
The milestones automatically download larger datasets when needed:
|
||
|
||
### MNIST
|
||
- **Used by:** `milestones/03_1986_mlp/02_rumelhart_mnist.py`
|
||
- **Downloads to:** `milestones/datasets/mnist/`
|
||
- **Contents:** 60K training + 10K test samples
|
||
- **Format:** 28×28 grayscale images
|
||
- **Size:** ~10 MB compressed
|
||
- **Auto-downloaded by:** `milestones/data_manager.py`
|
||
|
||
### CIFAR-10
|
||
- **Used by:** `milestones/04_1998_cnn/02_lecun_cifar10.py`
|
||
- **Downloads to:** `milestones/datasets/cifar-10/`
|
||
- **Contents:** 50K training + 10K test samples
|
||
- **Format:** 32×32 RGB images
|
||
- **Size:** ~170 MB compressed
|
||
- **Auto-downloaded by:** `milestones/data_manager.py`
|
||
|
||
## Design Philosophy
|
||
|
||
**Shipped datasets** follow Karpathy's "~1K samples" philosophy:
|
||
- Small enough to ship with repo
|
||
- Large enough for meaningful learning
|
||
- Fast training (seconds to minutes)
|
||
- Instant gratification for students
|
||
|
||
**Downloaded datasets** are full benchmarks:
|
||
- Standard ML benchmarks (MNIST, CIFAR-10)
|
||
- Larger, slower, more realistic
|
||
- Auto-downloaded only when needed
|
||
- Used for scaling demonstrations
|
||
|
||
## Total Repository Size
|
||
|
||
- **Shipped data:** ~350 KB (tinydigits + tinytalks)
|
||
- **USB-friendly:** Entire repo fits on any device
|
||
- **Offline-capable:** Core milestones work without internet
|
||
- **Git-friendly:** No large binary files in version control
|