# TinyTorch Datasets
Ship-with-Repo Datasets for Fast Learning
Small datasets for instant iteration + standard benchmarks for validation
**Purpose**: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones.
## Design Philosophy
TinyTorch uses a two-tier dataset approach:
Shipped Datasets
~350 KB total - Ships with repository
- Small enough to fit in Git (~1K samples each)
- Fast training (seconds to minutes)
- Instant gratification for learners
- Works offline - no download needed
- Perfect for rapid iteration
Downloaded Datasets
~180 MB - Auto-downloaded when needed
- Standard ML benchmarks (MNIST, CIFAR-10)
- Larger scale (~60K samples)
- Used for validation and scaling
- Downloaded automatically by milestones
- Cached locally for reuse
**Philosophy**: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation.
---
## Shipped Datasets (Included with TinyTorch)
### TinyDigits - Handwritten Digit Recognition
**Location**: `datasets/tinydigits/`
**Size**: ~310 KB
**Used by**: Milestones 03 & 04 (MLP and CNN examples)
**Contents:**
- 1,000 training samples
- 200 test samples
- 8×8 grayscale images (downsampled from MNIST)
- 10 classes (digits 0-9)
**Format**: Python pickle file with NumPy arrays
**Why 8×8?**
- Fast iteration: Trains in seconds
- Memory-friendly: Small enough to debug
- Conceptually complete: Same challenges as 28×28 MNIST
- Git-friendly: Only 310 KB vs 10 MB for full MNIST
**Usage in milestones:**
```python
# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)
```
### TinyTalks - Conversational Q&A Dataset
**Location**: `datasets/tinytalks/`
**Size**: ~40 KB
**Used by**: Milestone 05 (Transformer/GPT text generation)
**Contents:**
- 350 Q&A pairs across 5 difficulty levels
- Character-level text data
- Topics: General knowledge, math, science, reasoning
- Balanced difficulty distribution
**Format**: Plain text files with Q: / A: format
**Why conversational format?**
- Engaging: Questions feel natural
- Varied: Different answer lengths and complexity
- Educational: Difficulty levels scaffold learning
- Practical: Mirrors real chatbot use cases
**Example:**
```
Q: What is the capital of France?
A: Paris
Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h
```
**Usage in milestones:**
```python
# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs
```
See detailed documentation: `datasets/tinytalks/README.md`
---
## Downloaded Datasets (Auto-Downloaded On-Demand)
These standard benchmarks download automatically when you run relevant milestone scripts:
### MNIST - Handwritten Digit Classification
**Downloads to**: `milestones/datasets/mnist/`
**Size**: ~10 MB (compressed)
**Used by**: `milestones/03_1986_mlp/02_rumelhart_mnist.py`
**Contents:**
- 60,000 training samples
- 10,000 test samples
- 28×28 grayscale images
- 10 classes (digits 0-9)
**Auto-download**: When you run the MNIST milestone script, it automatically:
1. Checks if data exists locally
2. Downloads if needed (~10 MB)
3. Caches for future runs
4. Loads data using your TinyTorch DataLoader
**Purpose**: Validate that your framework achieves production-level results (95%+ accuracy target)
**Milestone goal**: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.
### CIFAR-10 - Natural Image Classification
**Downloads to**: `milestones/datasets/cifar-10/`
**Size**: ~170 MB (compressed)
**Used by**: `milestones/04_1998_cnn/02_lecun_cifar10.py`
**Contents:**
- 50,000 training samples
- 10,000 test samples
- 32×32 RGB images
- 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)
**Auto-download**: Milestone script handles everything:
1. Downloads from official source
2. Verifies integrity
3. Caches locally
4. Preprocesses for your framework
**Purpose**: Prove your CNN implementation works on real natural images (75%+ accuracy target)
**Milestone goal**: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.
---
## Dataset Selection Rationale
### Why These Specific Datasets?
**TinyDigits (not full MNIST):**
- 100× faster training iterations
- Ships with repo (no download)
- Same conceptual challenges
- Perfect for learning and debugging
**TinyTalks (custom dataset):**
- Designed for educational progression
- Scaffolded difficulty levels
- Character-level tokenization friendly
- Engaging conversational format
**MNIST (when scaling up):**
- Industry standard benchmark
- Validates your implementation
- Comparable to published results
- 95%+ accuracy is achievable milestone
**CIFAR-10 (for CNN validation):**
- Natural images (harder than digits)
- RGB channels (multi-dimensional)
- Standard CNN benchmark
- 75%+ with basic CNN proves it works
---
## Accessing Datasets
### For Students
**You don't need to manually download anything!**
```bash
# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py # Uses shipped TinyDigits
python 02_rumelhart_mnist.py # Auto-downloads MNIST if needed
```
The milestones handle all data loading automatically.
### For Developers/Researchers
**Direct dataset access:**
```python
# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()
# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities
```
---
## Dataset Sizes Summary
| Dataset | Size | Samples | Ships With Repo | Purpose |
|---------|------|---------|-----------------|---------|
| TinyDigits | 310 KB | 1,200 | Yes | Fast MLP/CNN iteration |
| TinyTalks | 40 KB | 350 pairs | Yes | Transformer learning |
| MNIST | 10 MB | 70,000 | Downloads | MLP validation |
| CIFAR-10 | 170 MB | 60,000 | Downloads | CNN validation |
**Total shipped**: ~350 KB
**Total with benchmarks**: ~180 MB
---
## Why Ship-with-Repo Matters
**Traditional ML courses:**
- "Download MNIST (10 MB)"
- "Download CIFAR-10 (170 MB)"
- Wait for downloads before starting
- Large files in Git (bad practice)
**TinyTorch approach:**
- Clone repo → Immediately start learning
- Train first model in under 1 minute
- Full benchmarks download only when scaling
- Git repo stays small and fast
**Educational benefit**: Students see working models within minutes, not hours.
---
## Frequently Asked Questions
**Q: Why not use full MNIST from the start?**
A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.
**Q: Can I use my own datasets?**
A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.
**Q: Why ship datasets in Git?**
A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.
**Q: Where does CIFAR-10 download from?**
A: Official sources via `milestones/data_manager.py`, with integrity verification.
**Q: Can I skip the large downloads?**
A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.
---
## Related Documentation
- [Milestones Guide](chapters/milestones.md) - See how each dataset is used in historical achievements
- [Student Workflow](student-workflow.md) - Learn the development cycle
- [Quick Start](quickstart-guide.md) - Start building in 15 minutes
**Dataset implementation details**: See `datasets/tinydigits/README.md` and `datasets/tinytalks/README.md` for technical specifications.