mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 09:18:02 -05:00
Replaces sklearn-sourced digits_8x8.npz with TinyTorch-branded dataset. Changes: - Created datasets/tinydigits/ (~51KB total) - train.pkl: 150 samples (15 per digit class 0-9) - test.pkl: 47 samples (balanced across digits) - README.md: Full curation documentation - LICENSE: BSD 3-Clause with sklearn attribution - create_tinydigits.py: Reproducible generation script - Updated milestones to use TinyDigits: - mlp_digits.py: Now loads from datasets/tinydigits/ - cnn_digits.py: Now loads from datasets/tinydigits/ - Removed old data: - datasets/tiny/ (67KB sklearn duplicate) - milestones/03_1986_mlp/data/ (67KB old location) Dataset Strategy: TinyTorch now ships with only 2 curated datasets: 1. TinyDigits (51KB) - 8x8 digits for MLP/CNN milestones 2. TinyTalks (140KB) - Q&A pairs for transformer milestone Total: 191KB shipped data (perfect for RasPi0 deployment) Rationale: - Self-contained: No downloads, works offline - Citable: TinyTorch educational infrastructure for white paper - Portable: Tiny footprint enables edge device deployment - Fast: <5 sec training enables instant student feedback Updated .gitignore to allow TinyTorch curated datasets while still blocking downloaded large datasets.
TinyTorch Datasets
This directory contains datasets for TinyTorch examples and training.
Directory Structure
datasets/
├── tiny/ ← Tiny datasets shipped with repo (~100KB each)
│ └── digits_8x8.npz (1,797 samples, 67KB)
├── mnist/ ← Full MNIST (downloaded, gitignored)
├── cifar10/ ← Full CIFAR-10 (downloaded, gitignored)
└── download_*.py ← Download scripts for large datasets
Quick Start
For learning (instant, offline):
# Use tiny shipped datasets
import numpy as np
data = np.load('datasets/tiny/digits_8x8.npz')
For serious training (download once):
python datasets/download_mnist.py
MNIST Dataset
The mnist/ directory should contain the MNIST or Fashion-MNIST dataset files:
train-images-idx3-ubyte.gz- Training images (60,000 samples)train-labels-idx1-ubyte.gz- Training labelst10k-images-idx3-ubyte.gz- Test images (10,000 samples)t10k-labels-idx1-ubyte.gz- Test labels
Downloading the Dataset
Run the provided download script:
cd datasets
python download_mnist.py
This will download Fashion-MNIST (which has the same format as MNIST but is more accessible).
Dataset Format
Both MNIST and Fashion-MNIST use the same IDX file format:
- Images: 28x28 grayscale pixels
- Labels: Integer values 0-9
- Gzipped for compression
Fashion-MNIST classes:
- 0: T-shirt/top
- 1: Trouser
- 2: Pullover
- 3: Dress
- 4: Coat
- 5: Sandal
- 6: Shirt
- 7: Sneaker
- 8: Bag
- 9: Ankle boot
The examples will work with either original MNIST digits or Fashion-MNIST items.