- Removed emojis from all section headers (## and ###) - Reduced emojis in body text and callout boxes - Standardized link references (removed emoji prefixes) - Maintained professional tone while keeping content accessible - Updated quickstart-guide, student-workflow, tito-essentials, faq, datasets, community, resources, testing-framework, learning-progress, checkpoint-system, and all chapter files
9.2 KiB
TinyTorch Datasets
Ship-with-Repo Datasets for Fast Learning
Small datasets for instant iteration + standard benchmarks for validation
Purpose: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones.
Design Philosophy
TinyTorch uses a two-tier dataset approach:
Shipped Datasets
~350 KB total - Ships with repository
- Small enough to fit in Git (~1K samples each)
- Fast training (seconds to minutes)
- Instant gratification for learners
- Works offline - no download needed
- Perfect for rapid iteration
Downloaded Datasets
~180 MB - Auto-downloaded when needed
- Standard ML benchmarks (MNIST, CIFAR-10)
- Larger scale (~60K samples)
- Used for validation and scaling
- Downloaded automatically by milestones
- Cached locally for reuse
Philosophy: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation.
Shipped Datasets (Included with TinyTorch)
TinyDigits - Handwritten Digit Recognition
Location: datasets/tinydigits/
Size: ~310 KB
Used by: Milestones 03 & 04 (MLP and CNN examples)
Contents:
- 1,000 training samples
- 200 test samples
- 8×8 grayscale images (downsampled from MNIST)
- 10 classes (digits 0-9)
Format: Python pickle file with NumPy arrays
Why 8×8?
- Fast iteration: Trains in seconds
- Memory-friendly: Small enough to debug
- Conceptually complete: Same challenges as 28×28 MNIST
- Git-friendly: Only 310 KB vs 10 MB for full MNIST
Usage in milestones:
# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)
TinyTalks - Conversational Q&A Dataset
Location: datasets/tinytalks/
Size: ~40 KB
Used by: Milestone 05 (Transformer/GPT text generation)
Contents:
- 350 Q&A pairs across 5 difficulty levels
- Character-level text data
- Topics: General knowledge, math, science, reasoning
- Balanced difficulty distribution
Format: Plain text files with Q: / A: format
Why conversational format?
- Engaging: Questions feel natural
- Varied: Different answer lengths and complexity
- Educational: Difficulty levels scaffold learning
- Practical: Mirrors real chatbot use cases
Example:
Q: What is the capital of France?
A: Paris
Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h
Usage in milestones:
# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs
See detailed documentation: datasets/tinytalks/README.md
Downloaded Datasets (Auto-Downloaded On-Demand)
These standard benchmarks download automatically when you run relevant milestone scripts:
MNIST - Handwritten Digit Classification
Downloads to: milestones/datasets/mnist/
Size: ~10 MB (compressed)
Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py
Contents:
- 60,000 training samples
- 10,000 test samples
- 28×28 grayscale images
- 10 classes (digits 0-9)
Auto-download: When you run the MNIST milestone script, it automatically:
- Checks if data exists locally
- Downloads if needed (~10 MB)
- Caches for future runs
- Loads data using your TinyTorch DataLoader
Purpose: Validate that your framework achieves production-level results (95%+ accuracy target)
Milestone goal: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.
CIFAR-10 - Natural Image Classification
Downloads to: milestones/datasets/cifar-10/
Size: ~170 MB (compressed)
Used by: milestones/04_1998_cnn/02_lecun_cifar10.py
Contents:
- 50,000 training samples
- 10,000 test samples
- 32×32 RGB images
- 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)
Auto-download: Milestone script handles everything:
- Downloads from official source
- Verifies integrity
- Caches locally
- Preprocesses for your framework
Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)
Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.
Dataset Selection Rationale
Why These Specific Datasets?
TinyDigits (not full MNIST):
- 100× faster training iterations
- Ships with repo (no download)
- Same conceptual challenges
- Perfect for learning and debugging
TinyTalks (custom dataset):
- Designed for educational progression
- Scaffolded difficulty levels
- Character-level tokenization friendly
- Engaging conversational format
MNIST (when scaling up):
- Industry standard benchmark
- Validates your implementation
- Comparable to published results
- 95%+ accuracy is achievable milestone
CIFAR-10 (for CNN validation):
- Natural images (harder than digits)
- RGB channels (multi-dimensional)
- Standard CNN benchmark
- 75%+ with basic CNN proves it works
Accessing Datasets
For Students
You don't need to manually download anything!
# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py # Uses shipped TinyDigits
python 02_rumelhart_mnist.py # Auto-downloads MNIST if needed
The milestones handle all data loading automatically.
For Developers/Researchers
Direct dataset access:
# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()
# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities
Dataset Sizes Summary
| Dataset | Size | Samples | Ships With Repo | Purpose |
|---|---|---|---|---|
| TinyDigits | 310 KB | 1,200 | Yes | Fast MLP/CNN iteration |
| TinyTalks | 40 KB | 350 pairs | Yes | Transformer learning |
| MNIST | 10 MB | 70,000 | Downloads | MLP validation |
| CIFAR-10 | 170 MB | 60,000 | Downloads | CNN validation |
Total shipped: ~350 KB
Total with benchmarks: ~180 MB
Why Ship-with-Repo Matters
Traditional ML courses:
- "Download MNIST (10 MB)"
- "Download CIFAR-10 (170 MB)"
- Wait for downloads before starting
- Large files in Git (bad practice)
TinyTorch approach:
- Clone repo → Immediately start learning
- Train first model in under 1 minute
- Full benchmarks download only when scaling
- Git repo stays small and fast
Educational benefit: Students see working models within minutes, not hours.
Frequently Asked Questions
Q: Why not use full MNIST from the start?
A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.
Q: Can I use my own datasets?
A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.
Q: Why ship datasets in Git?
A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.
Q: Where does CIFAR-10 download from?
A: Official sources via milestones/data_manager.py, with integrity verification.
Q: Can I skip the large downloads?
A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.
Related Documentation
- Milestones Guide - See how each dataset is used in historical achievements
- Student Workflow - Learn the development cycle
- Quick Start - Start building in 15 minutes
Dataset implementation details: See datasets/tinydigits/README.md and datasets/tinytalks/README.md for technical specifications.