mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-10 16:03:04 -05:00
You were right - 150 samples was too small for decent accuracy. Following Andrej Karpathy's "~1000 samples" educational dataset philosophy. Results: - Before (150 samples): 19% test accuracy (too small!) - After (1000 samples): 79.5% test accuracy (decent!) Changes: - Increased training: 150 → 1000 samples (100 per digit class) - Increased test: 47 → 200 samples (20 per digit class) - Perfect class balance: 0.00 std deviation - File size: 51 KB → 310 KB (still tiny for USB stick) - Training time: ~3-5 sec → ~8-10 sec (still fast) Updated: - create_tinydigits.py: Load from sklearn, generate 1K samples - train.pkl: 258 KB (1000 samples, perfectly balanced) - test.pkl: 52 KB (200 samples, balanced) - README.md: Updated all documentation with new sizes - mlp_digits.py: Updated docstring to reflect 1K dataset Dataset Philosophy: "~1000 samples is the sweet spot for educational datasets" - Small enough: Trains in seconds on CPU - Large enough: Achieves decent accuracy (~80%) - Balanced: Perfect stratification across all classes - Reproducible: Fixed seed=42 for consistency Still perfect for TinyTorch-on-a-stick vision: - 310 KB fits on any USB drive - Works on RasPi0 - No downloads needed - Offline-first education
TinyTorch Datasets
This directory contains datasets for TinyTorch examples and training.
Directory Structure
datasets/
├── tiny/ ← Tiny datasets shipped with repo (~100KB each)
│ └── digits_8x8.npz (1,797 samples, 67KB)
├── mnist/ ← Full MNIST (downloaded, gitignored)
├── cifar10/ ← Full CIFAR-10 (downloaded, gitignored)
└── download_*.py ← Download scripts for large datasets
Quick Start
For learning (instant, offline):
# Use tiny shipped datasets
import numpy as np
data = np.load('datasets/tiny/digits_8x8.npz')
For serious training (download once):
python datasets/download_mnist.py
MNIST Dataset
The mnist/ directory should contain the MNIST or Fashion-MNIST dataset files:
train-images-idx3-ubyte.gz- Training images (60,000 samples)train-labels-idx1-ubyte.gz- Training labelst10k-images-idx3-ubyte.gz- Test images (10,000 samples)t10k-labels-idx1-ubyte.gz- Test labels
Downloading the Dataset
Run the provided download script:
cd datasets
python download_mnist.py
This will download Fashion-MNIST (which has the same format as MNIST but is more accessible).
Dataset Format
Both MNIST and Fashion-MNIST use the same IDX file format:
- Images: 28x28 grayscale pixels
- Labels: Integer values 0-9
- Gzipped for compression
Fashion-MNIST classes:
- 0: T-shirt/top
- 1: Trouser
- 2: Pullover
- 3: Dress
- 4: Coat
- 5: Sandal
- 6: Shirt
- 7: Sneaker
- 8: Bag
- 9: Ankle boot
The examples will work with either original MNIST digits or Fashion-MNIST items.