Files
TinyTorch/datasets
Vijay Janapa Reddi af12404076 Increase TinyDigits to 1000 samples following Karpathy's philosophy
You were right - 150 samples was too small for decent accuracy.
Following Andrej Karpathy's "~1000 samples" educational dataset philosophy.

Results:
- Before (150 samples): 19% test accuracy (too small!)
- After (1000 samples): 79.5% test accuracy (decent!)

Changes:
- Increased training: 150 → 1000 samples (100 per digit class)
- Increased test: 47 → 200 samples (20 per digit class)
- Perfect class balance: 0.00 std deviation
- File size: 51 KB → 310 KB (still tiny for USB stick)
- Training time: ~3-5 sec → ~8-10 sec (still fast)

Updated:
- create_tinydigits.py: Load from sklearn, generate 1K samples
- train.pkl: 258 KB (1000 samples, perfectly balanced)
- test.pkl: 52 KB (200 samples, balanced)
- README.md: Updated all documentation with new sizes
- mlp_digits.py: Updated docstring to reflect 1K dataset

Dataset Philosophy:
"~1000 samples is the sweet spot for educational datasets"
- Small enough: Trains in seconds on CPU
- Large enough: Achieves decent accuracy (~80%)
- Balanced: Perfect stratification across all classes
- Reproducible: Fixed seed=42 for consistency

Still perfect for TinyTorch-on-a-stick vision:
- 310 KB fits on any USB drive
- Works on RasPi0
- No downloads needed
- Offline-first education
2025-11-10 17:20:54 -05:00
..

TinyTorch Datasets

This directory contains datasets for TinyTorch examples and training.

Directory Structure

datasets/
├── tiny/           ← Tiny datasets shipped with repo (~100KB each)
│   └── digits_8x8.npz (1,797 samples, 67KB)
├── mnist/          ← Full MNIST (downloaded, gitignored)
├── cifar10/        ← Full CIFAR-10 (downloaded, gitignored)
└── download_*.py   ← Download scripts for large datasets

Quick Start

For learning (instant, offline):

# Use tiny shipped datasets
import numpy as np
data = np.load('datasets/tiny/digits_8x8.npz')

For serious training (download once):

python datasets/download_mnist.py

MNIST Dataset

The mnist/ directory should contain the MNIST or Fashion-MNIST dataset files:

  • train-images-idx3-ubyte.gz - Training images (60,000 samples)
  • train-labels-idx1-ubyte.gz - Training labels
  • t10k-images-idx3-ubyte.gz - Test images (10,000 samples)
  • t10k-labels-idx1-ubyte.gz - Test labels

Downloading the Dataset

Run the provided download script:

cd datasets
python download_mnist.py

This will download Fashion-MNIST (which has the same format as MNIST but is more accessible).

Dataset Format

Both MNIST and Fashion-MNIST use the same IDX file format:

  • Images: 28x28 grayscale pixels
  • Labels: Integer values 0-9
  • Gzipped for compression

Fashion-MNIST classes:

  • 0: T-shirt/top
  • 1: Trouser
  • 2: Pullover
  • 3: Dress
  • 4: Coat
  • 5: Sandal
  • 6: Shirt
  • 7: Sneaker
  • 8: Bag
  • 9: Ankle boot

The examples will work with either original MNIST digits or Fashion-MNIST items.