mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-03-09 12:32:47 -05:00

Files

Vijay Janapa Reddi 842a76d3e1 Standardize emoji usage across all site pages for professional consistency

- Removed emojis from all section headers (## and ###)
- Reduced emojis in body text and callout boxes
- Standardized link references (removed emoji prefixes)
- Maintained professional tone while keeping content accessible
- Updated quickstart-guide, student-workflow, tito-essentials, faq, datasets, community, resources, testing-framework, learning-progress, checkpoint-system, and all chapter files

2025-11-12 11:42:03 -05:00

9.2 KiB

Raw Permalink Blame History

TinyTorch Datasets

Ship-with-Repo Datasets for Fast Learning

Small datasets for instant iteration + standard benchmarks for validation

Purpose: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones.

Design Philosophy

TinyTorch uses a two-tier dataset approach:

Shipped Datasets

~350 KB total - Ships with repository

Small enough to fit in Git (~1K samples each)
Fast training (seconds to minutes)
Instant gratification for learners
Works offline - no download needed
Perfect for rapid iteration

Downloaded Datasets

~180 MB - Auto-downloaded when needed

Standard ML benchmarks (MNIST, CIFAR-10)
Larger scale (~60K samples)
Used for validation and scaling
Downloaded automatically by milestones
Cached locally for reuse

Philosophy: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation.

Shipped Datasets (Included with TinyTorch)

TinyDigits - Handwritten Digit Recognition

Location: datasets/tinydigits/
Size: ~310 KB
Used by: Milestones 03 & 04 (MLP and CNN examples)

Contents:

1,000 training samples
200 test samples
8×8 grayscale images (downsampled from MNIST)
10 classes (digits 0-9)

Format: Python pickle file with NumPy arrays

Why 8×8?

Fast iteration: Trains in seconds
Memory-friendly: Small enough to debug
Conceptually complete: Same challenges as 28×28 MNIST
Git-friendly: Only 310 KB vs 10 MB for full MNIST

Usage in milestones:

# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)

TinyTalks - Conversational Q&A Dataset

Location: datasets/tinytalks/
Size: ~40 KB
Used by: Milestone 05 (Transformer/GPT text generation)

Contents:

350 Q&A pairs across 5 difficulty levels
Character-level text data
Topics: General knowledge, math, science, reasoning
Balanced difficulty distribution

Format: Plain text files with Q: / A: format

Why conversational format?

Engaging: Questions feel natural
Varied: Different answer lengths and complexity
Educational: Difficulty levels scaffold learning
Practical: Mirrors real chatbot use cases

Example:

Q: What is the capital of France?
A: Paris

Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h

Usage in milestones:

# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs

See detailed documentation: datasets/tinytalks/README.md

Downloaded Datasets (Auto-Downloaded On-Demand)

These standard benchmarks download automatically when you run relevant milestone scripts:

MNIST - Handwritten Digit Classification

Downloads to: milestones/datasets/mnist/
Size: ~10 MB (compressed)
Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py

Contents:

60,000 training samples
10,000 test samples
28×28 grayscale images
10 classes (digits 0-9)

Auto-download: When you run the MNIST milestone script, it automatically:

Checks if data exists locally
Downloads if needed (~10 MB)
Caches for future runs
Loads data using your TinyTorch DataLoader

Purpose: Validate that your framework achieves production-level results (95%+ accuracy target)

Milestone goal: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.

CIFAR-10 - Natural Image Classification

Downloads to: milestones/datasets/cifar-10/
Size: ~170 MB (compressed)
Used by: milestones/04_1998_cnn/02_lecun_cifar10.py

Contents:

50,000 training samples
10,000 test samples
32×32 RGB images
10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)

Auto-download: Milestone script handles everything:

Downloads from official source
Verifies integrity
Caches locally
Preprocesses for your framework

Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)

Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.

Dataset Selection Rationale

Why These Specific Datasets?

TinyDigits (not full MNIST):

100× faster training iterations
Ships with repo (no download)
Same conceptual challenges
Perfect for learning and debugging

TinyTalks (custom dataset):

Designed for educational progression
Scaffolded difficulty levels
Character-level tokenization friendly
Engaging conversational format

MNIST (when scaling up):

Industry standard benchmark
Validates your implementation
Comparable to published results
95%+ accuracy is achievable milestone

CIFAR-10 (for CNN validation):

Natural images (harder than digits)
RGB channels (multi-dimensional)
Standard CNN benchmark
75%+ with basic CNN proves it works

Accessing Datasets

For Students

You don't need to manually download anything!

# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py  # Uses shipped TinyDigits

python 02_rumelhart_mnist.py       # Auto-downloads MNIST if needed

The milestones handle all data loading automatically.

For Developers/Researchers

Direct dataset access:

# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()

from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()

# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities

Dataset Sizes Summary

Dataset	Size	Samples	Ships With Repo	Purpose
TinyDigits	310 KB	1,200	Yes	Fast MLP/CNN iteration
TinyTalks	40 KB	350 pairs	Yes	Transformer learning
MNIST	10 MB	70,000	Downloads	MLP validation
CIFAR-10	170 MB	60,000	Downloads	CNN validation

Total shipped: ~350 KB
Total with benchmarks: ~180 MB

Why Ship-with-Repo Matters

Traditional ML courses:

"Download MNIST (10 MB)"
"Download CIFAR-10 (170 MB)"
Wait for downloads before starting
Large files in Git (bad practice)

TinyTorch approach:

Clone repo → Immediately start learning
Train first model in under 1 minute
Full benchmarks download only when scaling
Git repo stays small and fast

Educational benefit: Students see working models within minutes, not hours.

Frequently Asked Questions

Q: Why not use full MNIST from the start?
A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.

Q: Can I use my own datasets?
A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.

Q: Why ship datasets in Git?
A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.

Q: Where does CIFAR-10 download from?
A: Official sources via milestones/data_manager.py, with integrity verification.

Q: Can I skip the large downloads?
A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.

Milestones Guide - See how each dataset is used in historical achievements
Student Workflow - Learn the development cycle
Quick Start - Start building in 15 minutes

Dataset implementation details: See datasets/tinydigits/README.md and datasets/tinytalks/README.md for technical specifications.

9.2 KiB Raw Permalink Blame History Unescape Escape

TinyTorch Datasets

Ship-with-Repo Datasets for Fast Learning

Design Philosophy

Shipped Datasets

Downloaded Datasets

Shipped Datasets (Included with TinyTorch)

TinyDigits - Handwritten Digit Recognition

TinyTalks - Conversational Q&A Dataset

Downloaded Datasets (Auto-Downloaded On-Demand)

MNIST - Handwritten Digit Classification

CIFAR-10 - Natural Image Classification

Dataset Selection Rationale

Why These Specific Datasets?

Accessing Datasets

For Students

For Developers/Researchers

Dataset Sizes Summary

Why Ship-with-Repo Matters

Frequently Asked Questions

Related Documentation

9.2 KiB

Raw Permalink Blame History