mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-29 06:47:52 -05:00

Files

Vijay Janapa Reddi 89c9e0dd7e Create TinyDigits educational dataset for self-contained TinyTorch

Replaces sklearn-sourced digits_8x8.npz with TinyTorch-branded dataset.

Changes:
- Created datasets/tinydigits/ (~51KB total)
  - train.pkl: 150 samples (15 per digit class 0-9)
  - test.pkl: 47 samples (balanced across digits)
  - README.md: Full curation documentation
  - LICENSE: BSD 3-Clause with sklearn attribution
  - create_tinydigits.py: Reproducible generation script

- Updated milestones to use TinyDigits:
  - mlp_digits.py: Now loads from datasets/tinydigits/
  - cnn_digits.py: Now loads from datasets/tinydigits/

- Removed old data:
  - datasets/tiny/ (67KB sklearn duplicate)
  - milestones/03_1986_mlp/data/ (67KB old location)

Dataset Strategy:
TinyTorch now ships with only 2 curated datasets:
1. TinyDigits (51KB) - 8x8 digits for MLP/CNN milestones
2. TinyTalks (140KB) - Q&A pairs for transformer milestone

Total: 191KB shipped data (perfect for RasPi0 deployment)

Rationale:
- Self-contained: No downloads, works offline
- Citable: TinyTorch educational infrastructure for white paper
- Portable: Tiny footprint enables edge device deployment
- Fast: <5 sec training enables instant student feedback

Updated .gitignore to allow TinyTorch curated datasets while
still blocking downloaded large datasets.

2025-11-10 16:59:43 -05:00

3.3 KiB

Raw Blame History

TinyDigits Dataset

A curated subset of the sklearn digits dataset for rapid ML prototyping and educational demonstrations.

Training: 150 samples (15 per digit, 0-9)
Test: 47 samples (balanced across digits)
Format: 8×8 grayscale images, float32 normalized [0, 1]
Size: ~51 KB total (vs 67 KB original, 10 MB MNIST)

Files

datasets/tinydigits/
├── train.pkl  # {'images': (150, 8, 8), 'labels': (150,)}
└── test.pkl   # {'images': (47, 8, 8), 'labels': (47,)}

Usage

import pickle

# Load training data
with open('datasets/tinydigits/train.pkl', 'rb') as f:
    data = pickle.load(f)
    train_images = data['images']  # (150, 8, 8)
    train_labels = data['labels']  # (150,)

# Load test data
with open('datasets/tinydigits/test.pkl', 'rb') as f:
    data = pickle.load(f)
    test_images = data['images']   # (47, 8, 8)
    test_labels = data['labels']   # (47,)

Purpose

Educational Infrastructure: Designed for teaching ML systems with real data at edge-device scale.

Fast iteration during development (<5 sec training)
Instant "it works!" moment for students
Offline-capable demos (no downloads)
CI/CD friendly (lightweight tests)
Deployable on RasPi0 - tiny footprint for democratizing ML education

Curation Process

Created from the sklearn digits dataset (8×8 downsampled MNIST):

Balanced Sampling: 15 training samples per digit class (150 total)
Test Split: 4-5 samples per digit (47 total) from remaining examples
Random Seeding: Reproducible selection (seed=42)
Shuffled: Training and test sets randomly shuffled for fair evaluation

The sklearn digits dataset itself is derived from the UCI ML hand-written digits datasets.

Why TinyDigits vs Full MNIST?

Metric	MNIST	TinyDigits	Benefit
Samples	60,000	150	400× fewer samples
File size	10 MB	51 KB	200× smaller
Train time	5-10 min	<5 sec	60-120× faster
Download	Network required	Ships with repo	Always available
Resolution	28×28 (784 pixels)	8×8 (64 pixels)	Faster forward pass
Edge deployment	Challenging	Perfect	Works on RasPi0

Educational Progression

TinyDigits serves as the first step in a scaffolded learning path:

TinyDigits (8×8) ← Start here: Learn MLP/CNN basics with instant feedback
Full MNIST (28×28) ← Graduate to: Standard benchmark, longer training
CIFAR-10 (32×32 RGB) ← Advanced: Color images, real-world complexity

Citation

TinyDigits is curated from the sklearn digits dataset for educational use in TinyTorch.

Original Source:

sklearn.datasets.load_digits()
Derived from UCI ML hand-written digits datasets
License: BSD 3-Clause (sklearn)

TinyTorch Curation:

@misc{tinydigits2025,
  title={TinyDigits: Curated Educational Dataset for ML Systems Learning},
  author={TinyTorch Project},
  year={2025},
  note={Balanced subset of sklearn digits optimized for edge deployment}
}

Generation

To regenerate this dataset from the original sklearn data:

python3 datasets/tinydigits/create_tinydigits.py

This ensures reproducibility and allows customization for specific educational needs.

License

See LICENSE for details. TinyDigits inherits the BSD 3-Clause license from sklearn.

3.3 KiB Raw Blame History Unescape Escape