Files
cs249r_book/tinytorch/datasets/README.md
Vijay Janapa Reddi 712ccc0c27 feat(datasets): add tinydigits and tinytalks educational datasets
Add curated educational datasets for TinyTorch milestones:

TinyDigits (~310 KB):
- 1000 train + 200 test samples of 8x8 digit images
- Balanced: 100 samples per digit class (0-9)
- Used by Milestones 03 (MLP) and 04 (CNN)
- Created from sklearn digits, normalized to [0,1]

TinyTalks (~40 KB):
- 350 Q&A pairs across 5 difficulty levels
- Character-level conversational dataset
- Used by Milestone 05 (Transformer)
- Designed for fast training (3-5 min on laptop)

Both datasets follow Karpathy's ~1K samples philosophy:
- Small enough to ship with repo
- Large enough for meaningful learning
- Fast training with instant feedback
- Works offline, no downloads needed
2026-01-13 10:03:09 -05:00

70 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TinyTorch Datasets
This directory contains datasets for TinyTorch milestone examples.
## Directory Structure
```
datasets/
├── tinydigits/ ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/ ← Q&A dataset for transformers (ships with repo, ~40KB)
└── README.md ← This file
```
## Shipped Datasets (No Download Required)
### TinyDigits
- **Used by:** Milestones 03 & 04 (MLP and CNN examples)
- **Contents:** 1,000 training + 200 test samples
- **Format:** 8×8 grayscale images, pickled
- **Size:** ~310 KB
- **Purpose:** Fast iteration on real image classification
### TinyTalks
- **Used by:** Milestone 05 (Transformer/GPT examples)
- **Contents:** 350 Q&A pairs across 5 difficulty levels
- **Format:** Plain text (Q: ... A: ... format)
- **Size:** ~40 KB
- **Purpose:** Character-level conversational AI training
## Downloaded Datasets (On-Demand)
The milestones automatically download larger datasets when needed:
### MNIST
- **Used by:** `milestones/03_1986_mlp/02_rumelhart_mnist.py`
- **Downloads to:** `milestones/datasets/mnist/`
- **Contents:** 60K training + 10K test samples
- **Format:** 28×28 grayscale images
- **Size:** ~10 MB compressed
- **Auto-downloaded by:** `milestones/data_manager.py`
### CIFAR-10
- **Used by:** `milestones/04_1998_cnn/02_lecun_cifar10.py`
- **Downloads to:** `milestones/datasets/cifar-10/`
- **Contents:** 50K training + 10K test samples
- **Format:** 32×32 RGB images
- **Size:** ~170 MB compressed
- **Auto-downloaded by:** `milestones/data_manager.py`
## Design Philosophy
**Shipped datasets** follow Karpathy's "~1K samples" philosophy:
- Small enough to ship with repo
- Large enough for meaningful learning
- Fast training (seconds to minutes)
- Instant gratification for students
**Downloaded datasets** are full benchmarks:
- Standard ML benchmarks (MNIST, CIFAR-10)
- Larger, slower, more realistic
- Auto-downloaded only when needed
- Used for scaling demonstrations
## Total Repository Size
- **Shipped data:** ~350 KB (tinydigits + tinytalks)
- **USB-friendly:** Entire repo fits on any device
- **Offline-capable:** Core milestones work without internet
- **Git-friendly:** No large binary files in version control