Add validation tool: NBGrader config validator

- Add comprehensive NBGrader configuration validator - Validates Jupytext headers, solution blocks, cell metadata - Checks for duplicate grade IDs and proper schema version - Provides detailed validation reports with severity levels
2026-03-11 22:03:34 -05:00 · 2025-11-11 19:04:58 -05:00
parent 9a0924376e
commit 91ac8458cd
2 changed files with 535 additions and 47 deletions
--- a/datasets/README.md
+++ b/datasets/README.md
@@ -1,67 +1,69 @@
 # TinyTorch Datasets

-This directory contains datasets for TinyTorch examples and training.
+This directory contains datasets for TinyTorch milestone examples.

 ## Directory Structure

 ```
 datasets/
-├── tiny/           ← Tiny datasets shipped with repo (~100KB each)
-│   └── digits_8x8.npz (1,797 samples, 67KB)
-├── mnist/          ← Full MNIST (downloaded, gitignored)
-├── cifar10/        ← Full CIFAR-10 (downloaded, gitignored)
-└── download_*.py   ← Download scripts for large datasets
+├── tinydigits/     ← 8×8 handwritten digits (ships with repo, ~310KB)
+├── tinytalks/      ← Q&A dataset for transformers (ships with repo, ~40KB)
+└── README.md       ← This file
 ```

-## Quick Start
+## Shipped Datasets (No Download Required)

-**For learning (instant, offline):**
-```python
-# Use tiny shipped datasets
-import numpy as np
-data = np.load('datasets/tiny/digits_8x8.npz')
-```
+### TinyDigits
+- **Used by:** Milestones 03 & 04 (MLP and CNN examples)
+- **Contents:** 1,000 training + 200 test samples
+- **Format:** 8×8 grayscale images, pickled
+- **Size:** ~310 KB
+- **Purpose:** Fast iteration on real image classification

-**For serious training (download once):**
-```bash
-python datasets/download_mnist.py
-```
+### TinyTalks
+- **Used by:** Milestone 05 (Transformer/GPT examples)
+- **Contents:** 350 Q&A pairs across 5 difficulty levels
+- **Format:** Plain text (Q: ... A: ... format)
+- **Size:** ~40 KB
+- **Purpose:** Character-level conversational AI training

-## MNIST Dataset
+## Downloaded Datasets (On-Demand)

-The `mnist/` directory should contain the MNIST or Fashion-MNIST dataset files:
- `train-images-idx3-ubyte.gz` - Training images (60,000 samples)
- `train-labels-idx1-ubyte.gz` - Training labels
- `t10k-images-idx3-ubyte.gz` - Test images (10,000 samples)
- `t10k-labels-idx1-ubyte.gz` - Test labels
+The milestones automatically download larger datasets when needed:

-### Downloading the Dataset
+### MNIST
+- **Used by:** `milestones/03_1986_mlp/02_rumelhart_mnist.py`
+- **Downloads to:** `milestones/datasets/mnist/`
+- **Contents:** 60K training + 10K test samples
+- **Format:** 28×28 grayscale images
+- **Size:** ~10 MB compressed
+- **Auto-downloaded by:** `milestones/data_manager.py`

-Run the provided download script:
-```bash
-cd datasets
-python download_mnist.py
-```
+### CIFAR-10
+- **Used by:** `milestones/04_1998_cnn/02_lecun_cifar10.py`
+- **Downloads to:** `milestones/datasets/cifar-10/`
+- **Contents:** 50K training + 10K test samples
+- **Format:** 32×32 RGB images
+- **Size:** ~170 MB compressed
+- **Auto-downloaded by:** `milestones/data_manager.py`

-This will download Fashion-MNIST (which has the same format as MNIST but is more accessible).
+## Design Philosophy

-### Dataset Format
+**Shipped datasets** follow Karpathy's "~1K samples" philosophy:
+- Small enough to ship with repo
+- Large enough for meaningful learning
+- Fast training (seconds to minutes)
+- Instant gratification for students

-Both MNIST and Fashion-MNIST use the same IDX file format:
- Images: 28x28 grayscale pixels
- Labels: Integer values 0-9
- Gzipped for compression
+**Downloaded datasets** are full benchmarks:
+- Standard ML benchmarks (MNIST, CIFAR-10)
+- Larger, slower, more realistic
+- Auto-downloaded only when needed
+- Used for scaling demonstrations

-Fashion-MNIST classes:
- 0: T-shirt/top
- 1: Trouser
- 2: Pullover
- 3: Dress
- 4: Coat
- 5: Sandal
- 6: Shirt
- 7: Sneaker
- 8: Bag
- 9: Ankle boot
+## Total Repository Size

-The examples will work with either original MNIST digits or Fashion-MNIST items.
+- **Shipped data:** ~350 KB (tinydigits + tinytalks)
+- **USB-friendly:** Entire repo fits on any device
+- **Offline-capable:** Core milestones work without internet
+- **Git-friendly:** No large binary files in version control