# TinyTorch Datasets

Ship-with-Repo Datasets for Fast Learning

Small datasets for instant iteration + standard benchmarks for validation

**Purpose**: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones. ## Design Philosophy TinyTorch uses a two-tier dataset approach:

Shipped Datasets

~350 KB total - Ships with repository

Downloaded Datasets

~180 MB - Auto-downloaded when needed

**Philosophy**: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation. --- ## Shipped Datasets (Included with TinyTorch) ### TinyDigits - Handwritten Digit Recognition
**Location**: `datasets/tinydigits/` **Size**: ~310 KB **Used by**: Milestones 03 & 04 (MLP and CNN examples) **Contents:** - 1,000 training samples - 200 test samples - 8×8 grayscale images (downsampled from MNIST) - 10 classes (digits 0-9) **Format**: Python pickle file with NumPy arrays **Why 8×8?** - Fast iteration: Trains in seconds - Memory-friendly: Small enough to debug - Conceptually complete: Same challenges as 28×28 MNIST - Git-friendly: Only 310 KB vs 10 MB for full MNIST **Usage in milestones:** ```python # Automatically loaded by milestones from datasets.tinydigits import load_tinydigits X_train, y_train, X_test, y_test = load_tinydigits() # X_train shape: (1000, 8, 8) # y_train shape: (1000,) ```
### TinyTalks - Conversational Q&A Dataset
**Location**: `datasets/tinytalks/` **Size**: ~40 KB **Used by**: Milestone 05 (Transformer/GPT text generation) **Contents:** - 350 Q&A pairs across 5 difficulty levels - Character-level text data - Topics: General knowledge, math, science, reasoning - Balanced difficulty distribution **Format**: Plain text files with Q: / A: format **Why conversational format?** - Engaging: Questions feel natural - Varied: Different answer lengths and complexity - Educational: Difficulty levels scaffold learning - Practical: Mirrors real chatbot use cases **Example:** ``` Q: What is the capital of France? A: Paris Q: If a train travels 120 km in 2 hours, what is its average speed? A: 60 km/h ``` **Usage in milestones:** ```python # Automatically loaded by transformer milestones from datasets.tinytalks import load_tinytalks dataset = load_tinytalks() # Returns list of (question, answer) pairs ``` See detailed documentation: `datasets/tinytalks/README.md`
--- ## Downloaded Datasets (Auto-Downloaded On-Demand) These standard benchmarks download automatically when you run relevant milestone scripts: ### MNIST - Handwritten Digit Classification
**Downloads to**: `milestones/datasets/mnist/` **Size**: ~10 MB (compressed) **Used by**: `milestones/03_1986_mlp/02_rumelhart_mnist.py` **Contents:** - 60,000 training samples - 10,000 test samples - 28×28 grayscale images - 10 classes (digits 0-9) **Auto-download**: When you run the MNIST milestone script, it automatically: 1. Checks if data exists locally 2. Downloads if needed (~10 MB) 3. Caches for future runs 4. Loads data using your TinyTorch DataLoader **Purpose**: Validate that your framework achieves production-level results (95%+ accuracy target) **Milestone goal**: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.
### CIFAR-10 - Natural Image Classification
**Downloads to**: `milestones/datasets/cifar-10/` **Size**: ~170 MB (compressed) **Used by**: `milestones/04_1998_cnn/02_lecun_cifar10.py` **Contents:** - 50,000 training samples - 10,000 test samples - 32×32 RGB images - 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck) **Auto-download**: Milestone script handles everything: 1. Downloads from official source 2. Verifies integrity 3. Caches locally 4. Preprocesses for your framework **Purpose**: Prove your CNN implementation works on real natural images (75%+ accuracy target) **Milestone goal**: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.
--- ## Dataset Selection Rationale ### Why These Specific Datasets? **TinyDigits (not full MNIST):** - 100× faster training iterations - Ships with repo (no download) - Same conceptual challenges - Perfect for learning and debugging **TinyTalks (custom dataset):** - Designed for educational progression - Scaffolded difficulty levels - Character-level tokenization friendly - Engaging conversational format **MNIST (when scaling up):** - Industry standard benchmark - Validates your implementation - Comparable to published results - 95%+ accuracy is achievable milestone **CIFAR-10 (for CNN validation):** - Natural images (harder than digits) - RGB channels (multi-dimensional) - Standard CNN benchmark - 75%+ with basic CNN proves it works --- ## Accessing Datasets ### For Students **You don't need to manually download anything!** ```bash # Just run milestone scripts cd milestones/03_1986_mlp python 01_rumelhart_tinydigits.py # Uses shipped TinyDigits python 02_rumelhart_mnist.py # Auto-downloads MNIST if needed ``` The milestones handle all data loading automatically. ### For Developers/Researchers **Direct dataset access:** ```python # Shipped datasets (always available) from datasets.tinydigits import load_tinydigits X_train, y_train, X_test, y_test = load_tinydigits() from datasets.tinytalks import load_tinytalks conversations = load_tinytalks() # Downloaded datasets (through milestones) # See milestones/data_manager.py for download utilities ``` --- ## Dataset Sizes Summary | Dataset | Size | Samples | Ships With Repo | Purpose | |---------|------|---------|-----------------|---------| | TinyDigits | 310 KB | 1,200 | Yes | Fast MLP/CNN iteration | | TinyTalks | 40 KB | 350 pairs | Yes | Transformer learning | | MNIST | 10 MB | 70,000 | Downloads | MLP validation | | CIFAR-10 | 170 MB | 60,000 | Downloads | CNN validation | **Total shipped**: ~350 KB **Total with benchmarks**: ~180 MB --- ## Why Ship-with-Repo Matters
**Traditional ML courses:** - "Download MNIST (10 MB)" - "Download CIFAR-10 (170 MB)" - Wait for downloads before starting - Large files in Git (bad practice) **TinyTorch approach:** - Clone repo → Immediately start learning - Train first model in under 1 minute - Full benchmarks download only when scaling - Git repo stays small and fast **Educational benefit**: Students see working models within minutes, not hours.
--- ## Frequently Asked Questions **Q: Why not use full MNIST from the start?** A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later. **Q: Can I use my own datasets?** A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch. **Q: Why ship datasets in Git?** A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration. **Q: Where does CIFAR-10 download from?** A: Official sources via `milestones/data_manager.py`, with integrity verification. **Q: Can I skip the large downloads?** A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones. --- ## Related Documentation - [Milestones Guide](chapters/milestones.md) - See how each dataset is used in historical achievements - [Student Workflow](student-workflow.md) - Learn the development cycle - [Quick Start](quickstart-guide.md) - Start building in 15 minutes **Dataset implementation details**: See `datasets/tinydigits/README.md` and `datasets/tinytalks/README.md` for technical specifications.