Files
Vijay Janapa Reddi f099730723 Increase TinyDigits to 1000 samples following Karpathy's philosophy
You were right - 150 samples was too small for decent accuracy.
Following Andrej Karpathy's "~1000 samples" educational dataset philosophy.

Results:
- Before (150 samples): 19% test accuracy (too small!)
- After (1000 samples): 79.5% test accuracy (decent!)

Changes:
- Increased training: 150 → 1000 samples (100 per digit class)
- Increased test: 47 → 200 samples (20 per digit class)
- Perfect class balance: 0.00 std deviation
- File size: 51 KB → 310 KB (still tiny for USB stick)
- Training time: ~3-5 sec → ~8-10 sec (still fast)

Updated:
- create_tinydigits.py: Load from sklearn, generate 1K samples
- train.pkl: 258 KB (1000 samples, perfectly balanced)
- test.pkl: 52 KB (200 samples, balanced)
- README.md: Updated all documentation with new sizes
- mlp_digits.py: Updated docstring to reflect 1K dataset

Dataset Philosophy:
"~1000 samples is the sweet spot for educational datasets"
- Small enough: Trains in seconds on CPU
- Large enough: Achieves decent accuracy (~80%)
- Balanced: Perfect stratification across all classes
- Reproducible: Fixed seed=42 for consistency

Still perfect for TinyTorch-on-a-stick vision:
- 310 KB fits on any USB drive
- Works on RasPi0
- No downloads needed
- Offline-first education
2025-11-10 17:20:54 -05:00
..

TinyDigits Dataset

A curated subset of the sklearn digits dataset for rapid ML prototyping and educational demonstrations.

Following Karpathy's "~1000 samples" philosophy for educational datasets.

Contents

  • Training: 1000 samples (100 per digit, 0-9)
  • Test: 200 samples (20 per digit, balanced)
  • Format: 8×8 grayscale images, float32 normalized [0, 1]
  • Size: ~310 KB total (vs 10 MB MNIST, 50× smaller)

Files

datasets/tinydigits/
├── train.pkl  # {'images': (1000, 8, 8), 'labels': (1000,)}
└── test.pkl   # {'images': (200, 8, 8), 'labels': (200,)}

Usage

import pickle

# Load training data
with open('datasets/tinydigits/train.pkl', 'rb') as f:
    data = pickle.load(f)
    train_images = data['images']  # (1000, 8, 8)
    train_labels = data['labels']  # (1000,)

# Load test data
with open('datasets/tinydigits/test.pkl', 'rb') as f:
    data = pickle.load(f)
    test_images = data['images']   # (200, 8, 8)
    test_labels = data['labels']   # (200,)

Purpose

Educational Infrastructure: Designed for teaching ML systems with real data at edge-device scale.

Following Andrej Karpathy's philosophy: "~1000 samples is the sweet spot for educational datasets."

  • Decent accuracy: Achieves ~80% test accuracy on MLPs (vs <20% with 150 samples)
  • Fast training: <10 sec on CPU, instant feedback loop
  • Balanced classes: Perfect 100 samples per digit (0-9)
  • Offline-capable: Ships with repo, no downloads needed
  • USB-friendly: 310 KB fits on any device, even RasPi0
  • Real learning curve: Model improves visibly across epochs

Curation Process

Created from the sklearn digits dataset (8×8 downsampled MNIST):

  1. Balanced Sampling: 100 training samples per digit class (1000 total)
  2. Test Split: 20 samples per digit (200 total) from remaining examples
  3. Random Seeding: Reproducible selection (seed=42)
  4. Normalization: Pixels normalized to [0, 1] range
  5. Shuffled: Training and test sets randomly shuffled for fair evaluation

The sklearn digits dataset itself is derived from the UCI ML hand-written digits datasets.

Why TinyDigits vs Full MNIST?

Metric MNIST TinyDigits Benefit
Samples 60,000 1,000 60× fewer samples
File size 10 MB 310 KB 32× smaller
Train time 5-10 min <10 sec 30-60× faster
Test accuracy (MLP) ~92% ~80% Close enough for learning
Download Network required Ships with repo Always available
Resolution 28×28 (784 pixels) 8×8 (64 pixels) Faster forward pass
Edge deployment Challenging Perfect Works on RasPi0

Educational Progression

TinyDigits serves as the first step in a scaffolded learning path:

  1. TinyDigits (8×8) ← Start here: Learn MLP/CNN basics with instant feedback
  2. Full MNIST (28×28) ← Graduate to: Standard benchmark, longer training
  3. CIFAR-10 (32×32 RGB) ← Advanced: Color images, real-world complexity

Citation

TinyDigits is curated from the sklearn digits dataset for educational use in TinyTorch.

Original Source:

  • sklearn.datasets.load_digits()
  • Derived from UCI ML hand-written digits datasets
  • License: BSD 3-Clause (sklearn)

TinyTorch Curation:

@misc{tinydigits2025,
  title={TinyDigits: Curated Educational Dataset for ML Systems Learning},
  author={TinyTorch Project},
  year={2025},
  note={Balanced subset of sklearn digits optimized for edge deployment}
}

Generation

To regenerate this dataset from the original sklearn data:

python3 datasets/tinydigits/create_tinydigits.py

This ensures reproducibility and allows customization for specific educational needs.

License

See LICENSE for details. TinyDigits inherits the BSD 3-Clause license from sklearn.