Files
Vijay Janapa Reddi 712ccc0c27 feat(datasets): add tinydigits and tinytalks educational datasets
Add curated educational datasets for TinyTorch milestones:

TinyDigits (~310 KB):
- 1000 train + 200 test samples of 8x8 digit images
- Balanced: 100 samples per digit class (0-9)
- Used by Milestones 03 (MLP) and 04 (CNN)
- Created from sklearn digits, normalized to [0,1]

TinyTalks (~40 KB):
- 350 Q&A pairs across 5 difficulty levels
- Character-level conversational dataset
- Used by Milestone 05 (Transformer)
- Designed for fast training (3-5 min on laptop)

Both datasets follow Karpathy's ~1K samples philosophy:
- Small enough to ship with repo
- Large enough for meaningful learning
- Fast training with instant feedback
- Works offline, no downloads needed
2026-01-13 10:03:09 -05:00

2.2 KiB
Raw Permalink Blame History

TinyTorch Datasets

This directory contains datasets for TinyTorch milestone examples.

Directory Structure

datasets/
├── tinydigits/     ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/      ← Q&A dataset for transformers (ships with repo, ~40KB)
└── README.md       ← This file

Shipped Datasets (No Download Required)

TinyDigits

  • Used by: Milestones 03 & 04 (MLP and CNN examples)
  • Contents: 1,000 training + 200 test samples
  • Format: 8×8 grayscale images, pickled
  • Size: ~310 KB
  • Purpose: Fast iteration on real image classification

TinyTalks

  • Used by: Milestone 05 (Transformer/GPT examples)
  • Contents: 350 Q&A pairs across 5 difficulty levels
  • Format: Plain text (Q: ... A: ... format)
  • Size: ~40 KB
  • Purpose: Character-level conversational AI training

Downloaded Datasets (On-Demand)

The milestones automatically download larger datasets when needed:

MNIST

  • Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py
  • Downloads to: milestones/datasets/mnist/
  • Contents: 60K training + 10K test samples
  • Format: 28×28 grayscale images
  • Size: ~10 MB compressed
  • Auto-downloaded by: milestones/data_manager.py

CIFAR-10

  • Used by: milestones/04_1998_cnn/02_lecun_cifar10.py
  • Downloads to: milestones/datasets/cifar-10/
  • Contents: 50K training + 10K test samples
  • Format: 32×32 RGB images
  • Size: ~170 MB compressed
  • Auto-downloaded by: milestones/data_manager.py

Design Philosophy

Shipped datasets follow Karpathy's "~1K samples" philosophy:

  • Small enough to ship with repo
  • Large enough for meaningful learning
  • Fast training (seconds to minutes)
  • Instant gratification for students

Downloaded datasets are full benchmarks:

  • Standard ML benchmarks (MNIST, CIFAR-10)
  • Larger, slower, more realistic
  • Auto-downloaded only when needed
  • Used for scaling demonstrations

Total Repository Size

  • Shipped data: ~350 KB (tinydigits + tinytalks)
  • USB-friendly: Entire repo fits on any device
  • Offline-capable: Core milestones work without internet
  • Git-friendly: No large binary files in version control