Files
cs249r_book/tinytorch/site/datasets.md
Vijay Janapa Reddi 8a259e4a42 docs(tinytorch): reorganize site navigation and add update documentation
- Reorganize TOC: move datasets from TITO CLI to new Reference section
- Add complete TITO CLI Reference section to PDF build (was missing)
- Add System Updates section to troubleshooting guide with code safety info
- Fix datasets.md styling for consistency (add border-radius to all boxes)
- Improve logical flow: TITO CLI → Reference → Community

Changes:
- HTML TOC: Create Reference section, move datasets + resources
- PDF TOC: Add full TITO CLI section (overview, modules, milestones, data, troubleshooting)
- troubleshooting.md: Add 4 new sections on updating TinyTorch safely
- datasets.md: Standardize styling with rounded corners

Result: Clean separation between workflow docs (TITO CLI) and reference materials
2026-01-17 10:39:49 -05:00

9.1 KiB
Raw Blame History

TinyTorch Datasets

Ship-with-Repo Datasets for Fast Learning

Small datasets for instant iteration + standard benchmarks for validation

Purpose: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones.

Design Philosophy

TinyTorch uses a two-tier dataset approach:

Shipped Datasets

~350 KB total - Ships with repository

  • Small enough to fit in Git (~1K samples each)
  • Fast training (seconds to minutes)
  • Instant gratification for learners
  • Works offline - no download needed
  • Perfect for rapid iteration

Downloaded Datasets

~180 MB - Auto-downloaded when needed

  • Standard ML benchmarks (MNIST, CIFAR-10)
  • Larger scale (~60K samples)
  • Used for validation and scaling
  • Downloaded automatically by milestones
  • Cached locally for reuse

Philosophy: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation.

Shipped Datasets (Included with TinyTorch)

TinyDigits - Handwritten Digit Recognition

Location: datasets/tinydigits/ Size: ~310 KB Used by: Milestones 03 & 04 (MLP and CNN examples)

Contents:

  • 1,000 training samples
  • 200 test samples
  • 8×8 grayscale images (downsampled from MNIST)
  • 10 classes (digits 0-9)

Format: Python pickle file with NumPy arrays

Why 8×8?

  • Fast iteration: Trains in seconds
  • Memory-friendly: Small enough to debug
  • Conceptually complete: Same challenges as 28×28 MNIST
  • Git-friendly: Only 310 KB vs 10 MB for full MNIST

Usage in milestones:

# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)

TinyTalks - Conversational Q&A Dataset

Location: datasets/tinytalks/ Size: ~40 KB Used by: Milestone 05 (Transformer/GPT text generation)

Contents:

  • 350 Q&A pairs across 5 difficulty levels
  • Character-level text data
  • Topics: General knowledge, math, science, reasoning
  • Balanced difficulty distribution

Format: Plain text files with Q: / A: format

Why conversational format?

  • Engaging: Questions feel natural
  • Varied: Different answer lengths and complexity
  • Educational: Difficulty levels scaffold learning
  • Practical: Mirrors real chatbot use cases

Example:

Q: What is the capital of France?
A: Paris

Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h

Usage in milestones:

# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs

See detailed documentation: datasets/tinytalks/README.md

Downloaded Datasets (Auto-Downloaded On-Demand)

These standard benchmarks download automatically when you run relevant milestone scripts:

MNIST - Handwritten Digit Classification

Downloads to: milestones/datasets/mnist/ Size: ~10 MB (compressed) Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py

Contents:

  • 60,000 training samples
  • 10,000 test samples
  • 28×28 grayscale images
  • 10 classes (digits 0-9)

Auto-download: When you run the MNIST milestone script, it automatically:

  1. Checks if data exists locally
  2. Downloads if needed (~10 MB)
  3. Caches for future runs
  4. Loads data using your TinyTorch DataLoader

Purpose: Validate that your framework achieves production-level results (95%+ accuracy target)

Milestone goal: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.

CIFAR-10 - Natural Image Classification

Downloads to: milestones/datasets/cifar-10/ Size: ~170 MB (compressed) Used by: milestones/04_1998_cnn/02_lecun_cifar10.py

Contents:

  • 50,000 training samples
  • 10,000 test samples
  • 32×32 RGB images
  • 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)

Auto-download: Milestone script handles everything:

  1. Downloads from official source
  2. Verifies integrity
  3. Caches locally
  4. Preprocesses for your framework

Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)

Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.

Dataset Selection Rationale

Why These Specific Datasets?

TinyDigits (not full MNIST):

  • 100× faster training iterations
  • Ships with repo (no download)
  • Same conceptual challenges
  • Perfect for learning and debugging

TinyTalks (custom dataset):

  • Designed for educational progression
  • Scaffolded difficulty levels
  • Character-level tokenization friendly
  • Engaging conversational format

MNIST (when scaling up):

  • Industry standard benchmark
  • Validates your implementation
  • Comparable to published results
  • 95%+ accuracy is achievable milestone

CIFAR-10 (for CNN validation):

  • Natural images (harder than digits)
  • RGB channels (multi-dimensional)
  • Standard CNN benchmark
  • 75%+ with basic CNN proves it works

Accessing Datasets

For Students

You don't need to manually download anything!

# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py  # Uses shipped TinyDigits

python 02_rumelhart_mnist.py       # Auto-downloads MNIST if needed

The milestones handle all data loading automatically.

For Developers/Researchers

Direct dataset access:

# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()

from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()

# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities

Dataset Sizes Summary

Dataset Size Samples Ships With Repo Purpose
TinyDigits 310 KB 1,200 Yes Fast MLP/CNN iteration
TinyTalks 40 KB 350 pairs Yes Transformer learning
MNIST 10 MB 70,000 Downloads MLP validation
CIFAR-10 170 MB 60,000 Downloads CNN validation

Total shipped: ~350 KB Total with benchmarks: ~180 MB

Why Ship-with-Repo Matters

Traditional ML courses:

  • "Download MNIST (10 MB)"
  • "Download CIFAR-10 (170 MB)"
  • Wait for downloads before starting
  • Large files in Git (bad practice)

TinyTorch approach:

  • Clone repo → Immediately start learning
  • Train first model in under 1 minute
  • Full benchmarks download only when scaling
  • Git repo stays small and fast

Educational benefit: Students see working models within minutes, not hours.

Frequently Asked Questions

Q: Why not use full MNIST from the start? A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.

Q: Can I use my own datasets? A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.

Q: Why ship datasets in Git? A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.

Q: Where does CIFAR-10 download from? A: Official sources via milestones/data_manager.py, with integrity verification.

Q: Can I skip the large downloads? A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.

See the Milestones section for how each dataset is used in historical achievements.

Dataset implementation details: See datasets/tinydigits/README.md and datasets/tinytalks/README.md for technical specifications.