mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 00:38:35 -05:00
- Add comprehensive NBGrader configuration validator - Validates Jupytext headers, solution blocks, cell metadata - Checks for duplicate grade IDs and proper schema version - Provides detailed validation reports with severity levels
70 lines
2.2 KiB
Markdown
70 lines
2.2 KiB
Markdown
# TinyTorch Datasets
|
||
|
||
This directory contains datasets for TinyTorch milestone examples.
|
||
|
||
## Directory Structure
|
||
|
||
```
|
||
datasets/
|
||
├── tinydigits/ ← 8×8 handwritten digits (ships with repo, ~310KB)
|
||
├── tinytalks/ ← Q&A dataset for transformers (ships with repo, ~40KB)
|
||
└── README.md ← This file
|
||
```
|
||
|
||
## Shipped Datasets (No Download Required)
|
||
|
||
### TinyDigits
|
||
- **Used by:** Milestones 03 & 04 (MLP and CNN examples)
|
||
- **Contents:** 1,000 training + 200 test samples
|
||
- **Format:** 8×8 grayscale images, pickled
|
||
- **Size:** ~310 KB
|
||
- **Purpose:** Fast iteration on real image classification
|
||
|
||
### TinyTalks
|
||
- **Used by:** Milestone 05 (Transformer/GPT examples)
|
||
- **Contents:** 350 Q&A pairs across 5 difficulty levels
|
||
- **Format:** Plain text (Q: ... A: ... format)
|
||
- **Size:** ~40 KB
|
||
- **Purpose:** Character-level conversational AI training
|
||
|
||
## Downloaded Datasets (On-Demand)
|
||
|
||
The milestones automatically download larger datasets when needed:
|
||
|
||
### MNIST
|
||
- **Used by:** `milestones/03_1986_mlp/02_rumelhart_mnist.py`
|
||
- **Downloads to:** `milestones/datasets/mnist/`
|
||
- **Contents:** 60K training + 10K test samples
|
||
- **Format:** 28×28 grayscale images
|
||
- **Size:** ~10 MB compressed
|
||
- **Auto-downloaded by:** `milestones/data_manager.py`
|
||
|
||
### CIFAR-10
|
||
- **Used by:** `milestones/04_1998_cnn/02_lecun_cifar10.py`
|
||
- **Downloads to:** `milestones/datasets/cifar-10/`
|
||
- **Contents:** 50K training + 10K test samples
|
||
- **Format:** 32×32 RGB images
|
||
- **Size:** ~170 MB compressed
|
||
- **Auto-downloaded by:** `milestones/data_manager.py`
|
||
|
||
## Design Philosophy
|
||
|
||
**Shipped datasets** follow Karpathy's "~1K samples" philosophy:
|
||
- Small enough to ship with repo
|
||
- Large enough for meaningful learning
|
||
- Fast training (seconds to minutes)
|
||
- Instant gratification for students
|
||
|
||
**Downloaded datasets** are full benchmarks:
|
||
- Standard ML benchmarks (MNIST, CIFAR-10)
|
||
- Larger, slower, more realistic
|
||
- Auto-downloaded only when needed
|
||
- Used for scaling demonstrations
|
||
|
||
## Total Repository Size
|
||
|
||
- **Shipped data:** ~350 KB (tinydigits + tinytalks)
|
||
- **USB-friendly:** Entire repo fits on any device
|
||
- **Offline-capable:** Core milestones work without internet
|
||
- **Git-friendly:** No large binary files in version control
|