Files
TinyTorch/datasets
Vijay Janapa Reddi 91ac8458cd Add validation tool: NBGrader config validator
- Add comprehensive NBGrader configuration validator
- Validates Jupytext headers, solution blocks, cell metadata
- Checks for duplicate grade IDs and proper schema version
- Provides detailed validation reports with severity levels
2025-11-11 19:04:58 -05:00
..

TinyTorch Datasets

This directory contains datasets for TinyTorch milestone examples.

Directory Structure

datasets/
├── tinydigits/     ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/      ← Q&A dataset for transformers (ships with repo, ~40KB)
└── README.md       ← This file

Shipped Datasets (No Download Required)

TinyDigits

  • Used by: Milestones 03 & 04 (MLP and CNN examples)
  • Contents: 1,000 training + 200 test samples
  • Format: 8×8 grayscale images, pickled
  • Size: ~310 KB
  • Purpose: Fast iteration on real image classification

TinyTalks

  • Used by: Milestone 05 (Transformer/GPT examples)
  • Contents: 350 Q&A pairs across 5 difficulty levels
  • Format: Plain text (Q: ... A: ... format)
  • Size: ~40 KB
  • Purpose: Character-level conversational AI training

Downloaded Datasets (On-Demand)

The milestones automatically download larger datasets when needed:

MNIST

  • Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py
  • Downloads to: milestones/datasets/mnist/
  • Contents: 60K training + 10K test samples
  • Format: 28×28 grayscale images
  • Size: ~10 MB compressed
  • Auto-downloaded by: milestones/data_manager.py

CIFAR-10

  • Used by: milestones/04_1998_cnn/02_lecun_cifar10.py
  • Downloads to: milestones/datasets/cifar-10/
  • Contents: 50K training + 10K test samples
  • Format: 32×32 RGB images
  • Size: ~170 MB compressed
  • Auto-downloaded by: milestones/data_manager.py

Design Philosophy

Shipped datasets follow Karpathy's "~1K samples" philosophy:

  • Small enough to ship with repo
  • Large enough for meaningful learning
  • Fast training (seconds to minutes)
  • Instant gratification for students

Downloaded datasets are full benchmarks:

  • Standard ML benchmarks (MNIST, CIFAR-10)
  • Larger, slower, more realistic
  • Auto-downloaded only when needed
  • Used for scaling demonstrations

Total Repository Size

  • Shipped data: ~350 KB (tinydigits + tinytalks)
  • USB-friendly: Entire repo fits on any device
  • Offline-capable: Core milestones work without internet
  • Git-friendly: No large binary files in version control