cs249r_book/tinytorch/datasets/README.md

# TinyTorch Datasets

This directory contains datasets for TinyTorch milestone examples.

## Directory Structure

```
datasets/
├── tinydigits/     ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/      ← Q&A dataset for transformers (ships with repo, ~40KB)
└── README.md       ← This file
```

## Shipped Datasets (No Download Required)

### TinyDigits
- **Used by:** Milestones 03 & 04 (MLP and CNN examples)
- **Contents:** 1,000 training + 200 test samples
- **Format:** 8×8 grayscale images, pickled
- **Size:** ~310 KB
- **Purpose:** Fast iteration on real image classification

### TinyTalks
- **Used by:** Milestone 05 (Transformer/GPT examples)
- **Contents:** 350 Q&A pairs across 5 difficulty levels
- **Format:** Plain text (Q: ... A: ... format)
- **Size:** ~40 KB
- **Purpose:** Character-level conversational AI training

## Downloaded Datasets (On-Demand)

The milestones automatically download larger datasets when needed:

### MNIST
- **Used by:** `milestones/03_1986_mlp/02_rumelhart_mnist.py`
- **Downloads to:** `milestones/datasets/mnist/`
- **Contents:** 60K training + 10K test samples
- **Format:** 28×28 grayscale images
- **Size:** ~10 MB compressed
- **Auto-downloaded by:** `milestones/data_manager.py`

### CIFAR-10
- **Used by:** `milestones/04_1998_cnn/02_lecun_cifar10.py`
- **Downloads to:** `milestones/datasets/cifar-10/`
- **Contents:** 50K training + 10K test samples
- **Format:** 32×32 RGB images
- **Size:** ~170 MB compressed
- **Auto-downloaded by:** `milestones/data_manager.py`

## Design Philosophy

**Shipped datasets** follow Karpathy's "~1K samples" philosophy:
- Small enough to ship with repo
- Large enough for meaningful learning
- Fast training (seconds to minutes)
- Instant gratification for students

**Downloaded datasets** are full benchmarks:
- Standard ML benchmarks (MNIST, CIFAR-10)
- Larger, slower, more realistic
- Auto-downloaded only when needed
- Used for scaling demonstrations

## Total Repository Size

- **Shipped data:** ~350 KB (tinydigits + tinytalks)
- **USB-friendly:** Entire repo fits on any device
- **Offline-capable:** Core milestones work without internet
- **Git-friendly:** No large binary files in version control