mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 07:27:39 -05:00
- Replace all .html → .md in markdown source files (43 instances) - Fix broken links: tito-essentials.md → tito/overview.md - Remove broken links to non-existent leaderboard/olympics-rules pages - Fix PDF_BUILD_GUIDE reference in website-README.md Website rebuilt successfully with 46 warnings. Changes: - All markdown files now use .md extension for internal links - Removed references to missing/planned files - Website builds cleanly and all links are functional
310 lines
9.2 KiB
Markdown
310 lines
9.2 KiB
Markdown
# TinyTorch Datasets
|
||
|
||
<div style="background: #f8f9fa; padding: 2rem; border-radius: 0.5rem; margin: 2rem 0; text-align: center;">
|
||
<h2 style="margin: 0 0 1rem 0; color: #495057;">Ship-with-Repo Datasets for Fast Learning</h2>
|
||
<p style="margin: 0; font-size: 1.1rem; color: #6c757d;">Small datasets for instant iteration + standard benchmarks for validation</p>
|
||
</div>
|
||
|
||
**Purpose**: Understand TinyTorch's dataset strategy and where to find each dataset used in milestones.
|
||
|
||
## Design Philosophy
|
||
|
||
TinyTorch uses a two-tier dataset approach:
|
||
|
||
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 1.5rem; margin: 2rem 0;">
|
||
|
||
<div style="background: #e3f2fd; border: 1px solid #2196f3; padding: 1.5rem; border-radius: 0.5rem;">
|
||
<h3 style="margin: 0 0 1rem 0; color: #1976d2;">Shipped Datasets</h3>
|
||
<p style="margin: 0 0 1rem 0;"><strong>~350 KB total - Ships with repository</strong></p>
|
||
<ul style="margin: 0; font-size: 0.9rem;">
|
||
<li>Small enough to fit in Git (~1K samples each)</li>
|
||
<li>Fast training (seconds to minutes)</li>
|
||
<li>Instant gratification for learners</li>
|
||
<li>Works offline - no download needed</li>
|
||
<li>Perfect for rapid iteration</li>
|
||
</ul>
|
||
</div>
|
||
|
||
<div style="background: #f3e5f5; border: 1px solid #9c27b0; padding: 1.5rem; border-radius: 0.5rem;">
|
||
<h3 style="margin: 0 0 1rem 0; color: #7b1fa2;">Downloaded Datasets</h3>
|
||
<p style="margin: 0 0 1rem 0;"><strong>~180 MB - Auto-downloaded when needed</strong></p>
|
||
<ul style="margin: 0; font-size: 0.9rem;">
|
||
<li>Standard ML benchmarks (MNIST, CIFAR-10)</li>
|
||
<li>Larger scale (~60K samples)</li>
|
||
<li>Used for validation and scaling</li>
|
||
<li>Downloaded automatically by milestones</li>
|
||
<li>Cached locally for reuse</li>
|
||
</ul>
|
||
</div>
|
||
|
||
</div>
|
||
|
||
**Philosophy**: Following Andrej Karpathy's "~1K samples" approach—small datasets for learning, full benchmarks for validation.
|
||
|
||
---
|
||
|
||
## Shipped Datasets (Included with TinyTorch)
|
||
|
||
### TinyDigits - Handwritten Digit Recognition
|
||
|
||
<div style="background: #fff5f5; border-left: 4px solid #e74c3c; padding: 1.5rem; margin: 1.5rem 0;">
|
||
|
||
**Location**: `datasets/tinydigits/`
|
||
**Size**: ~310 KB
|
||
**Used by**: Milestones 03 & 04 (MLP and CNN examples)
|
||
|
||
**Contents:**
|
||
- 1,000 training samples
|
||
- 200 test samples
|
||
- 8×8 grayscale images (downsampled from MNIST)
|
||
- 10 classes (digits 0-9)
|
||
|
||
**Format**: Python pickle file with NumPy arrays
|
||
|
||
**Why 8×8?**
|
||
- Fast iteration: Trains in seconds
|
||
- Memory-friendly: Small enough to debug
|
||
- Conceptually complete: Same challenges as 28×28 MNIST
|
||
- Git-friendly: Only 310 KB vs 10 MB for full MNIST
|
||
|
||
**Usage in milestones:**
|
||
```python
|
||
# Automatically loaded by milestones
|
||
from datasets.tinydigits import load_tinydigits
|
||
X_train, y_train, X_test, y_test = load_tinydigits()
|
||
# X_train shape: (1000, 8, 8)
|
||
# y_train shape: (1000,)
|
||
```
|
||
|
||
</div>
|
||
|
||
### TinyTalks - Conversational Q&A Dataset
|
||
|
||
<div style="background: #f0fff4; border-left: 4px solid #22c55e; padding: 1.5rem; margin: 1.5rem 0;">
|
||
|
||
**Location**: `datasets/tinytalks/`
|
||
**Size**: ~40 KB
|
||
**Used by**: Milestone 05 (Transformer/GPT text generation)
|
||
|
||
**Contents:**
|
||
- 350 Q&A pairs across 5 difficulty levels
|
||
- Character-level text data
|
||
- Topics: General knowledge, math, science, reasoning
|
||
- Balanced difficulty distribution
|
||
|
||
**Format**: Plain text files with Q: / A: format
|
||
|
||
**Why conversational format?**
|
||
- Engaging: Questions feel natural
|
||
- Varied: Different answer lengths and complexity
|
||
- Educational: Difficulty levels scaffold learning
|
||
- Practical: Mirrors real chatbot use cases
|
||
|
||
**Example:**
|
||
```
|
||
Q: What is the capital of France?
|
||
A: Paris
|
||
|
||
Q: If a train travels 120 km in 2 hours, what is its average speed?
|
||
A: 60 km/h
|
||
```
|
||
|
||
**Usage in milestones:**
|
||
```python
|
||
# Automatically loaded by transformer milestones
|
||
from datasets.tinytalks import load_tinytalks
|
||
dataset = load_tinytalks()
|
||
# Returns list of (question, answer) pairs
|
||
```
|
||
|
||
See detailed documentation: `datasets/tinytalks/README.md`
|
||
|
||
</div>
|
||
|
||
---
|
||
|
||
## Downloaded Datasets (Auto-Downloaded On-Demand)
|
||
|
||
These standard benchmarks download automatically when you run relevant milestone scripts:
|
||
|
||
### MNIST - Handwritten Digit Classification
|
||
|
||
<div style="background: #fffbeb; border-left: 4px solid #f59e0b; padding: 1.5rem; margin: 1.5rem 0;">
|
||
|
||
**Downloads to**: `milestones/datasets/mnist/`
|
||
**Size**: ~10 MB (compressed)
|
||
**Used by**: `milestones/03_1986_mlp/02_rumelhart_mnist.py`
|
||
|
||
**Contents:**
|
||
- 60,000 training samples
|
||
- 10,000 test samples
|
||
- 28×28 grayscale images
|
||
- 10 classes (digits 0-9)
|
||
|
||
**Auto-download**: When you run the MNIST milestone script, it automatically:
|
||
1. Checks if data exists locally
|
||
2. Downloads if needed (~10 MB)
|
||
3. Caches for future runs
|
||
4. Loads data using your TinyTorch DataLoader
|
||
|
||
**Purpose**: Validate that your framework achieves production-level results (95%+ accuracy target)
|
||
|
||
**Milestone goal**: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart's breakthrough.
|
||
|
||
</div>
|
||
|
||
### CIFAR-10 - Natural Image Classification
|
||
|
||
<div style="background: #fdf2f8; border-left: 4px solid #ec4899; padding: 1.5rem; margin: 1.5rem 0;">
|
||
|
||
**Downloads to**: `milestones/datasets/cifar-10/`
|
||
**Size**: ~170 MB (compressed)
|
||
**Used by**: `milestones/04_1998_cnn/02_lecun_cifar10.py`
|
||
|
||
**Contents:**
|
||
- 50,000 training samples
|
||
- 10,000 test samples
|
||
- 32×32 RGB images
|
||
- 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)
|
||
|
||
**Auto-download**: Milestone script handles everything:
|
||
1. Downloads from official source
|
||
2. Verifies integrity
|
||
3. Caches locally
|
||
4. Preprocesses for your framework
|
||
|
||
**Purpose**: Prove your CNN implementation works on real natural images (75%+ accuracy target)
|
||
|
||
**Milestone goal**: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.
|
||
|
||
</div>
|
||
|
||
---
|
||
|
||
## Dataset Selection Rationale
|
||
|
||
### Why These Specific Datasets?
|
||
|
||
**TinyDigits (not full MNIST):**
|
||
- 100× faster training iterations
|
||
- Ships with repo (no download)
|
||
- Same conceptual challenges
|
||
- Perfect for learning and debugging
|
||
|
||
**TinyTalks (custom dataset):**
|
||
- Designed for educational progression
|
||
- Scaffolded difficulty levels
|
||
- Character-level tokenization friendly
|
||
- Engaging conversational format
|
||
|
||
**MNIST (when scaling up):**
|
||
- Industry standard benchmark
|
||
- Validates your implementation
|
||
- Comparable to published results
|
||
- 95%+ accuracy is achievable milestone
|
||
|
||
**CIFAR-10 (for CNN validation):**
|
||
- Natural images (harder than digits)
|
||
- RGB channels (multi-dimensional)
|
||
- Standard CNN benchmark
|
||
- 75%+ with basic CNN proves it works
|
||
|
||
---
|
||
|
||
## Accessing Datasets
|
||
|
||
### For Students
|
||
|
||
**You don't need to manually download anything!**
|
||
|
||
```bash
|
||
# Just run milestone scripts
|
||
cd milestones/03_1986_mlp
|
||
python 01_rumelhart_tinydigits.py # Uses shipped TinyDigits
|
||
|
||
python 02_rumelhart_mnist.py # Auto-downloads MNIST if needed
|
||
```
|
||
|
||
The milestones handle all data loading automatically.
|
||
|
||
### For Developers/Researchers
|
||
|
||
**Direct dataset access:**
|
||
|
||
```python
|
||
# Shipped datasets (always available)
|
||
from datasets.tinydigits import load_tinydigits
|
||
X_train, y_train, X_test, y_test = load_tinydigits()
|
||
|
||
from datasets.tinytalks import load_tinytalks
|
||
conversations = load_tinytalks()
|
||
|
||
# Downloaded datasets (through milestones)
|
||
# See milestones/data_manager.py for download utilities
|
||
```
|
||
|
||
---
|
||
|
||
## Dataset Sizes Summary
|
||
|
||
| Dataset | Size | Samples | Ships With Repo | Purpose |
|
||
|---------|------|---------|-----------------|---------|
|
||
| TinyDigits | 310 KB | 1,200 | Yes | Fast MLP/CNN iteration |
|
||
| TinyTalks | 40 KB | 350 pairs | Yes | Transformer learning |
|
||
| MNIST | 10 MB | 70,000 | Downloads | MLP validation |
|
||
| CIFAR-10 | 170 MB | 60,000 | Downloads | CNN validation |
|
||
|
||
**Total shipped**: ~350 KB
|
||
**Total with benchmarks**: ~180 MB
|
||
|
||
---
|
||
|
||
## Why Ship-with-Repo Matters
|
||
|
||
<div style="background: #e3f2fd; padding: 1.5rem; border-radius: 0.5rem; margin: 1.5rem 0;">
|
||
|
||
**Traditional ML courses:**
|
||
- "Download MNIST (10 MB)"
|
||
- "Download CIFAR-10 (170 MB)"
|
||
- Wait for downloads before starting
|
||
- Large files in Git (bad practice)
|
||
|
||
**TinyTorch approach:**
|
||
- Clone repo → Immediately start learning
|
||
- Train first model in under 1 minute
|
||
- Full benchmarks download only when scaling
|
||
- Git repo stays small and fast
|
||
|
||
**Educational benefit**: Students see working models within minutes, not hours.
|
||
|
||
</div>
|
||
|
||
---
|
||
|
||
## Frequently Asked Questions
|
||
|
||
**Q: Why not use full MNIST from the start?**
|
||
A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.
|
||
|
||
**Q: Can I use my own datasets?**
|
||
A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.
|
||
|
||
**Q: Why ship datasets in Git?**
|
||
A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.
|
||
|
||
**Q: Where does CIFAR-10 download from?**
|
||
A: Official sources via `milestones/data_manager.py`, with integrity verification.
|
||
|
||
**Q: Can I skip the large downloads?**
|
||
A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- [Milestones Guide](chapters/milestones.md) - See how each dataset is used in historical achievements
|
||
- [Student Workflow](student-workflow.md) - Learn the development cycle
|
||
- [Quick Start](quickstart-guide.md) - Start building in 15 minutes
|
||
|
||
**Dataset implementation details**: See `datasets/tinydigits/README.md` and `datasets/tinytalks/README.md` for technical specifications.
|