Files
TinyTorch/site/datasets.md
Vijay Janapa Reddi 842a76d3e1 Standardize emoji usage across all site pages for professional consistency
- Removed emojis from all section headers (## and ###)
- Reduced emojis in body text and callout boxes
- Standardized link references (removed emoji prefixes)
- Maintained professional tone while keeping content accessible
- Updated quickstart-guide, student-workflow, tito-essentials, faq, datasets, community, resources, testing-framework, learning-progress, checkpoint-system, and all chapter files
2025-11-12 11:42:03 -05:00

9.2 KiB
Raw Permalink Blame History

TinyTorch Datasets

Ship-with-Repo Datasets for Fast Learning

Small datasets for instant iteration + standard benchmarks for validation

Purpose: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones.

Design Philosophy

TinyTorch uses a two-tier dataset approach:

Shipped Datasets

~350 KB total - Ships with repository

  • Small enough to fit in Git (~1K samples each)
  • Fast training (seconds to minutes)
  • Instant gratification for learners
  • Works offline - no download needed
  • Perfect for rapid iteration

Downloaded Datasets

~180 MB - Auto-downloaded when needed

  • Standard ML benchmarks (MNIST, CIFAR-10)
  • Larger scale (~60K samples)
  • Used for validation and scaling
  • Downloaded automatically by milestones
  • Cached locally for reuse

Philosophy: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation.


Shipped Datasets (Included with TinyTorch)

TinyDigits - Handwritten Digit Recognition

Location: datasets/tinydigits/
Size: ~310 KB
Used by: Milestones 03 & 04 (MLP and CNN examples)

Contents:

  • 1,000 training samples
  • 200 test samples
  • 8×8 grayscale images (downsampled from MNIST)
  • 10 classes (digits 0-9)

Format: Python pickle file with NumPy arrays

Why 8×8?

  • Fast iteration: Trains in seconds
  • Memory-friendly: Small enough to debug
  • Conceptually complete: Same challenges as 28×28 MNIST
  • Git-friendly: Only 310 KB vs 10 MB for full MNIST

Usage in milestones:

# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)

TinyTalks - Conversational Q&A Dataset

Location: datasets/tinytalks/
Size: ~40 KB
Used by: Milestone 05 (Transformer/GPT text generation)

Contents:

  • 350 Q&A pairs across 5 difficulty levels
  • Character-level text data
  • Topics: General knowledge, math, science, reasoning
  • Balanced difficulty distribution

Format: Plain text files with Q: / A: format

Why conversational format?

  • Engaging: Questions feel natural
  • Varied: Different answer lengths and complexity
  • Educational: Difficulty levels scaffold learning
  • Practical: Mirrors real chatbot use cases

Example:

Q: What is the capital of France?
A: Paris

Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h

Usage in milestones:

# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs

See detailed documentation: datasets/tinytalks/README.md


Downloaded Datasets (Auto-Downloaded On-Demand)

These standard benchmarks download automatically when you run relevant milestone scripts:

MNIST - Handwritten Digit Classification

Downloads to: milestones/datasets/mnist/
Size: ~10 MB (compressed)
Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py

Contents:

  • 60,000 training samples
  • 10,000 test samples
  • 28×28 grayscale images
  • 10 classes (digits 0-9)

Auto-download: When you run the MNIST milestone script, it automatically:

  1. Checks if data exists locally
  2. Downloads if needed (~10 MB)
  3. Caches for future runs
  4. Loads data using your TinyTorch DataLoader

Purpose: Validate that your framework achieves production-level results (95%+ accuracy target)

Milestone goal: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.

CIFAR-10 - Natural Image Classification

Downloads to: milestones/datasets/cifar-10/
Size: ~170 MB (compressed)
Used by: milestones/04_1998_cnn/02_lecun_cifar10.py

Contents:

  • 50,000 training samples
  • 10,000 test samples
  • 32×32 RGB images
  • 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)

Auto-download: Milestone script handles everything:

  1. Downloads from official source
  2. Verifies integrity
  3. Caches locally
  4. Preprocesses for your framework

Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)

Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.


Dataset Selection Rationale

Why These Specific Datasets?

TinyDigits (not full MNIST):

  • 100× faster training iterations
  • Ships with repo (no download)
  • Same conceptual challenges
  • Perfect for learning and debugging

TinyTalks (custom dataset):

  • Designed for educational progression
  • Scaffolded difficulty levels
  • Character-level tokenization friendly
  • Engaging conversational format

MNIST (when scaling up):

  • Industry standard benchmark
  • Validates your implementation
  • Comparable to published results
  • 95%+ accuracy is achievable milestone

CIFAR-10 (for CNN validation):

  • Natural images (harder than digits)
  • RGB channels (multi-dimensional)
  • Standard CNN benchmark
  • 75%+ with basic CNN proves it works

Accessing Datasets

For Students

You don't need to manually download anything!

# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py  # Uses shipped TinyDigits

python 02_rumelhart_mnist.py       # Auto-downloads MNIST if needed

The milestones handle all data loading automatically.

For Developers/Researchers

Direct dataset access:

# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()

from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()

# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities

Dataset Sizes Summary

Dataset Size Samples Ships With Repo Purpose
TinyDigits 310 KB 1,200 Yes Fast MLP/CNN iteration
TinyTalks 40 KB 350 pairs Yes Transformer learning
MNIST 10 MB 70,000 Downloads MLP validation
CIFAR-10 170 MB 60,000 Downloads CNN validation

Total shipped: ~350 KB
Total with benchmarks: ~180 MB


Why Ship-with-Repo Matters

Traditional ML courses:

  • "Download MNIST (10 MB)"
  • "Download CIFAR-10 (170 MB)"
  • Wait for downloads before starting
  • Large files in Git (bad practice)

TinyTorch approach:

  • Clone repo → Immediately start learning
  • Train first model in under 1 minute
  • Full benchmarks download only when scaling
  • Git repo stays small and fast

Educational benefit: Students see working models within minutes, not hours.


Frequently Asked Questions

Q: Why not use full MNIST from the start?
A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.

Q: Can I use my own datasets?
A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.

Q: Why ship datasets in Git?
A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.

Q: Where does CIFAR-10 download from?
A: Official sources via milestones/data_manager.py, with integrity verification.

Q: Can I skip the large downloads?
A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.


Dataset implementation details: See datasets/tinydigits/README.md and datasets/tinytalks/README.md for technical specifications.