TinyTorch Datasets

This directory contains datasets for TinyTorch milestone examples.

Directory Structure

datasets/
├── tinydigits/     ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/    ← conversational Q&A text for transformers (~40KB)
└── README.md     ← This file

Shipped Datasets (No Download Required)

TinyDigits

Used by: Milestones 03 & 04 (MLP and CNN examples)
Contents: 1,000 training + 200 test samples
Format: 8×8 grayscale images, pickled
Size: ~310 KB
Purpose: Fast iteration on real image classification

TinyTalks

Used by: Transformer / language teaching (e.g. Milestone 05 materials and extensions)
Contents: 350 Q&A pairs across 5 difficulty levels
Format: Plain text (Q: / A: lines), character-level friendly
Size: ~40 KB
Purpose: Fast, offline conversational text for attention and GPT-style experiments

Downloaded Datasets (On-Demand)

The milestones automatically download larger datasets when needed:

MNIST

Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py
Downloads to: milestones/datasets/mnist/
Contents: 60K training + 10K test samples
Format: 28×28 grayscale images
Size: ~10 MB compressed
Auto-downloaded by: milestones/data_manager.py

CIFAR-10

Used by: milestones/04_1998_cnn/02_lecun_cifar10.py
Downloads to: milestones/datasets/cifar-10/
Contents: 50K training + 10K test samples
Format: 32×32 RGB images
Size: ~170 MB compressed
Auto-downloaded by: milestones/data_manager.py

Design Philosophy

Shipped datasets follow Karpathy's "~1K samples" philosophy:

Small enough to ship with repo
Large enough for meaningful learning
Fast training (seconds to minutes)
Instant gratification for students

Downloaded datasets are full benchmarks:

Standard ML benchmarks (MNIST, CIFAR-10)
Larger, slower, more realistic
Auto-downloaded only when needed
Used for scaling demonstrations

Total Repository Size

Shipped data: ~350 KB (TinyDigits + TinyTalks combined)
USB-friendly: Entire repo fits on any device
Offline-capable: Core milestones work without internet
Git-friendly: No large binary files in version control

README.md Unescape Escape