Files
TinyTorch/datasets/README.md
Vijay Janapa Reddi f938ad8e19 Add validation tool: NBGrader config validator
- Add comprehensive NBGrader configuration validator
- Validates Jupytext headers, solution blocks, cell metadata
- Checks for duplicate grade IDs and proper schema version
- Provides detailed validation reports with severity levels
2025-11-11 19:04:58 -05:00

70 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TinyTorch Datasets
This directory contains datasets for TinyTorch milestone examples.
## Directory Structure
```
datasets/
├── tinydigits/ ← 8×8 handwritten digits (ships with repo, ~310KB)
├── tinytalks/ ← Q&A dataset for transformers (ships with repo, ~40KB)
└── README.md ← This file
```
## Shipped Datasets (No Download Required)
### TinyDigits
- **Used by:** Milestones 03 & 04 (MLP and CNN examples)
- **Contents:** 1,000 training + 200 test samples
- **Format:** 8×8 grayscale images, pickled
- **Size:** ~310 KB
- **Purpose:** Fast iteration on real image classification
### TinyTalks
- **Used by:** Milestone 05 (Transformer/GPT examples)
- **Contents:** 350 Q&A pairs across 5 difficulty levels
- **Format:** Plain text (Q: ... A: ... format)
- **Size:** ~40 KB
- **Purpose:** Character-level conversational AI training
## Downloaded Datasets (On-Demand)
The milestones automatically download larger datasets when needed:
### MNIST
- **Used by:** `milestones/03_1986_mlp/02_rumelhart_mnist.py`
- **Downloads to:** `milestones/datasets/mnist/`
- **Contents:** 60K training + 10K test samples
- **Format:** 28×28 grayscale images
- **Size:** ~10 MB compressed
- **Auto-downloaded by:** `milestones/data_manager.py`
### CIFAR-10
- **Used by:** `milestones/04_1998_cnn/02_lecun_cifar10.py`
- **Downloads to:** `milestones/datasets/cifar-10/`
- **Contents:** 50K training + 10K test samples
- **Format:** 32×32 RGB images
- **Size:** ~170 MB compressed
- **Auto-downloaded by:** `milestones/data_manager.py`
## Design Philosophy
**Shipped datasets** follow Karpathy's "~1K samples" philosophy:
- Small enough to ship with repo
- Large enough for meaningful learning
- Fast training (seconds to minutes)
- Instant gratification for students
**Downloaded datasets** are full benchmarks:
- Standard ML benchmarks (MNIST, CIFAR-10)
- Larger, slower, more realistic
- Auto-downloaded only when needed
- Used for scaling demonstrations
## Total Repository Size
- **Shipped data:** ~350 KB (tinydigits + tinytalks)
- **USB-friendly:** Entire repo fits on any device
- **Offline-capable:** Core milestones work without internet
- **Git-friendly:** No large binary files in version control