Files
TinyTorch/docs/datasets.md
Vijay Janapa Reddi c058ab9419 Fix documentation links after site → docs reorganization
- Replace all .html → .md in markdown source files (43 instances)
- Fix broken links: tito-essentials.md → tito/overview.md
- Remove broken links to non-existent leaderboard/olympics-rules pages
- Fix PDF_BUILD_GUIDE reference in website-README.md

Website rebuilt successfully with 46 warnings.

Changes:
- All markdown files now use .md extension for internal links
- Removed references to missing/planned files
- Website builds cleanly and all links are functional
2025-11-28 05:01:44 +01:00

310 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TinyTorch Datasets
<div style="background: #f8f9fa; padding: 2rem; border-radius: 0.5rem; margin: 2rem 0; text-align: center;">
<h2 style="margin: 0 0 1rem 0; color: #495057;">Ship-with-Repo Datasets for Fast Learning</h2>
<p style="margin: 0; font-size: 1.1rem; color: #6c757d;">Small datasets for instant iteration + standard benchmarks for validation</p>
</div>
**Purpose**: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones.
## Design Philosophy
TinyTorch uses a two-tier dataset approach:
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 1.5rem; margin: 2rem 0;">
<div style="background: #e3f2fd; border: 1px solid #2196f3; padding: 1.5rem; border-radius: 0.5rem;">
<h3 style="margin: 0 0 1rem 0; color: #1976d2;">Shipped Datasets</h3>
<p style="margin: 0 0 1rem 0;"><strong>~350 KB total - Ships with repository</strong></p>
<ul style="margin: 0; font-size: 0.9rem;">
<li>Small enough to fit in Git (~1K samples each)</li>
<li>Fast training (seconds to minutes)</li>
<li>Instant gratification for learners</li>
<li>Works offline - no download needed</li>
<li>Perfect for rapid iteration</li>
</ul>
</div>
<div style="background: #f3e5f5; border: 1px solid #9c27b0; padding: 1.5rem; border-radius: 0.5rem;">
<h3 style="margin: 0 0 1rem 0; color: #7b1fa2;">Downloaded Datasets</h3>
<p style="margin: 0 0 1rem 0;"><strong>~180 MB - Auto-downloaded when needed</strong></p>
<ul style="margin: 0; font-size: 0.9rem;">
<li>Standard ML benchmarks (MNIST, CIFAR-10)</li>
<li>Larger scale (~60K samples)</li>
<li>Used for validation and scaling</li>
<li>Downloaded automatically by milestones</li>
<li>Cached locally for reuse</li>
</ul>
</div>
</div>
**Philosophy**: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation.
---
## Shipped Datasets (Included with TinyTorch)
### TinyDigits - Handwritten Digit Recognition
<div style="background: #fff5f5; border-left: 4px solid #e74c3c; padding: 1.5rem; margin: 1.5rem 0;">
**Location**: `datasets/tinydigits/`
**Size**: ~310 KB
**Used by**: Milestones 03 & 04 (MLP and CNN examples)
**Contents:**
- 1,000 training samples
- 200 test samples
- 8×8 grayscale images (downsampled from MNIST)
- 10 classes (digits 0-9)
**Format**: Python pickle file with NumPy arrays
**Why 8×8?**
- Fast iteration: Trains in seconds
- Memory-friendly: Small enough to debug
- Conceptually complete: Same challenges as 28×28 MNIST
- Git-friendly: Only 310 KB vs 10 MB for full MNIST
**Usage in milestones:**
```python
# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)
```
</div>
### TinyTalks - Conversational Q&A Dataset
<div style="background: #f0fff4; border-left: 4px solid #22c55e; padding: 1.5rem; margin: 1.5rem 0;">
**Location**: `datasets/tinytalks/`
**Size**: ~40 KB
**Used by**: Milestone 05 (Transformer/GPT text generation)
**Contents:**
- 350 Q&A pairs across 5 difficulty levels
- Character-level text data
- Topics: General knowledge, math, science, reasoning
- Balanced difficulty distribution
**Format**: Plain text files with Q: / A: format
**Why conversational format?**
- Engaging: Questions feel natural
- Varied: Different answer lengths and complexity
- Educational: Difficulty levels scaffold learning
- Practical: Mirrors real chatbot use cases
**Example:**
```
Q: What is the capital of France?
A: Paris
Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h
```
**Usage in milestones:**
```python
# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs
```
See detailed documentation: `datasets/tinytalks/README.md`
</div>
---
## Downloaded Datasets (Auto-Downloaded On-Demand)
These standard benchmarks download automatically when you run relevant milestone scripts:
### MNIST - Handwritten Digit Classification
<div style="background: #fffbeb; border-left: 4px solid #f59e0b; padding: 1.5rem; margin: 1.5rem 0;">
**Downloads to**: `milestones/datasets/mnist/`
**Size**: ~10 MB (compressed)
**Used by**: `milestones/03_1986_mlp/02_rumelhart_mnist.py`
**Contents:**
- 60,000 training samples
- 10,000 test samples
- 28×28 grayscale images
- 10 classes (digits 0-9)
**Auto-download**: When you run the MNIST milestone script, it automatically:
1. Checks if data exists locally
2. Downloads if needed (~10 MB)
3. Caches for future runs
4. Loads data using your TinyTorch DataLoader
**Purpose**: Validate that your framework achieves production-level results (95%+ accuracy target)
**Milestone goal**: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.
</div>
### CIFAR-10 - Natural Image Classification
<div style="background: #fdf2f8; border-left: 4px solid #ec4899; padding: 1.5rem; margin: 1.5rem 0;">
**Downloads to**: `milestones/datasets/cifar-10/`
**Size**: ~170 MB (compressed)
**Used by**: `milestones/04_1998_cnn/02_lecun_cifar10.py`
**Contents:**
- 50,000 training samples
- 10,000 test samples
- 32×32 RGB images
- 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)
**Auto-download**: Milestone script handles everything:
1. Downloads from official source
2. Verifies integrity
3. Caches locally
4. Preprocesses for your framework
**Purpose**: Prove your CNN implementation works on real natural images (75%+ accuracy target)
**Milestone goal**: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.
</div>
---
## Dataset Selection Rationale
### Why These Specific Datasets?
**TinyDigits (not full MNIST):**
- 100× faster training iterations
- Ships with repo (no download)
- Same conceptual challenges
- Perfect for learning and debugging
**TinyTalks (custom dataset):**
- Designed for educational progression
- Scaffolded difficulty levels
- Character-level tokenization friendly
- Engaging conversational format
**MNIST (when scaling up):**
- Industry standard benchmark
- Validates your implementation
- Comparable to published results
- 95%+ accuracy is achievable milestone
**CIFAR-10 (for CNN validation):**
- Natural images (harder than digits)
- RGB channels (multi-dimensional)
- Standard CNN benchmark
- 75%+ with basic CNN proves it works
---
## Accessing Datasets
### For Students
**You don't need to manually download anything!**
```bash
# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py # Uses shipped TinyDigits
python 02_rumelhart_mnist.py # Auto-downloads MNIST if needed
```
The milestones handle all data loading automatically.
### For Developers/Researchers
**Direct dataset access:**
```python
# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()
# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities
```
---
## Dataset Sizes Summary
| Dataset | Size | Samples | Ships With Repo | Purpose |
|---------|------|---------|-----------------|---------|
| TinyDigits | 310 KB | 1,200 | Yes | Fast MLP/CNN iteration |
| TinyTalks | 40 KB | 350 pairs | Yes | Transformer learning |
| MNIST | 10 MB | 70,000 | Downloads | MLP validation |
| CIFAR-10 | 170 MB | 60,000 | Downloads | CNN validation |
**Total shipped**: ~350 KB
**Total with benchmarks**: ~180 MB
---
## Why Ship-with-Repo Matters
<div style="background: #e3f2fd; padding: 1.5rem; border-radius: 0.5rem; margin: 1.5rem 0;">
**Traditional ML courses:**
- "Download MNIST (10 MB)"
- "Download CIFAR-10 (170 MB)"
- Wait for downloads before starting
- Large files in Git (bad practice)
**TinyTorch approach:**
- Clone repo → Immediately start learning
- Train first model in under 1 minute
- Full benchmarks download only when scaling
- Git repo stays small and fast
**Educational benefit**: Students see working models within minutes, not hours.
</div>
---
## Frequently Asked Questions
**Q: Why not use full MNIST from the start?**
A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.
**Q: Can I use my own datasets?**
A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.
**Q: Why ship datasets in Git?**
A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.
**Q: Where does CIFAR-10 download from?**
A: Official sources via `milestones/data_manager.py`, with integrity verification.
**Q: Can I skip the large downloads?**
A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.
---
## Related Documentation
- [Milestones Guide](chapters/milestones.md) - See how each dataset is used in historical achievements
- [Student Workflow](student-workflow.md) - Learn the development cycle
- [Quick Start](quickstart-guide.md) - Start building in 15 minutes
**Dataset implementation details**: See `datasets/tinydigits/README.md` and `datasets/tinytalks/README.md` for technical specifications.