mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-03-09 07:15:51 -05:00

Files

Vijay Janapa Reddi 712ccc0c27 feat(datasets): add tinydigits and tinytalks educational datasets

Add curated educational datasets for TinyTorch milestones:

TinyDigits (~310 KB):
- 1000 train + 200 test samples of 8x8 digit images
- Balanced: 100 samples per digit class (0-9)
- Used by Milestones 03 (MLP) and 04 (CNN)
- Created from sklearn digits, normalized to [0,1]

TinyTalks (~40 KB):
- 350 Q&A pairs across 5 difficulty levels
- Character-level conversational dataset
- Used by Milestone 05 (Transformer)
- Designed for fast training (3-5 min on laptop)

Both datasets follow Karpathy's ~1K samples philosophy:
- Small enough to ship with repo
- Large enough for meaningful learning
- Fast training with instant feedback
- Works offline, no downloads needed

2026-01-13 10:03:09 -05:00

2.2 KiB

Raw Permalink Blame History

TinyTorch Datasets

This directory contains datasets for TinyTorch milestone examples.

Directory Structure

datasets/
├── tinydigits/     ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/      ← Q&A dataset for transformers (ships with repo, ~40KB)
└── README.md       ← This file

Shipped Datasets (No Download Required)

TinyDigits

Used by: Milestones 03 & 04 (MLP and CNN examples)
Contents: 1,000 training + 200 test samples
Format: 8×8 grayscale images, pickled
Size: ~310 KB
Purpose: Fast iteration on real image classification

TinyTalks

Used by: Milestone 05 (Transformer/GPT examples)
Contents: 350 Q&A pairs across 5 difficulty levels
Format: Plain text (Q: ... A: ... format)
Size: ~40 KB
Purpose: Character-level conversational AI training

Downloaded Datasets (On-Demand)

The milestones automatically download larger datasets when needed:

MNIST

Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py
Downloads to: milestones/datasets/mnist/
Contents: 60K training + 10K test samples
Format: 28×28 grayscale images
Size: ~10 MB compressed
Auto-downloaded by: milestones/data_manager.py

CIFAR-10

Used by: milestones/04_1998_cnn/02_lecun_cifar10.py
Downloads to: milestones/datasets/cifar-10/
Contents: 50K training + 10K test samples
Format: 32×32 RGB images
Size: ~170 MB compressed
Auto-downloaded by: milestones/data_manager.py

Design Philosophy

Shipped datasets follow Karpathy's "~1K samples" philosophy:

Small enough to ship with repo
Large enough for meaningful learning
Fast training (seconds to minutes)
Instant gratification for students

Downloaded datasets are full benchmarks:

Standard ML benchmarks (MNIST, CIFAR-10)
Larger, slower, more realistic
Auto-downloaded only when needed
Used for scaling demonstrations

Total Repository Size

Shipped data: ~350 KB (tinydigits + tinytalks)
USB-friendly: Entire repo fits on any device
Offline-capable: Core milestones work without internet
Git-friendly: No large binary files in version control

2.2 KiB Raw Permalink Blame History Unescape Escape