Files
TinyTorch/datasets
Vijay Janapa Reddi 84568f0bd5 Create TinyDigits educational dataset for self-contained TinyTorch
Replaces sklearn-sourced digits_8x8.npz with TinyTorch-branded dataset.

Changes:
- Created datasets/tinydigits/ (~51KB total)
  - train.pkl: 150 samples (15 per digit class 0-9)
  - test.pkl: 47 samples (balanced across digits)
  - README.md: Full curation documentation
  - LICENSE: BSD 3-Clause with sklearn attribution
  - create_tinydigits.py: Reproducible generation script

- Updated milestones to use TinyDigits:
  - mlp_digits.py: Now loads from datasets/tinydigits/
  - cnn_digits.py: Now loads from datasets/tinydigits/

- Removed old data:
  - datasets/tiny/ (67KB sklearn duplicate)
  - milestones/03_1986_mlp/data/ (67KB old location)

Dataset Strategy:
TinyTorch now ships with only 2 curated datasets:
1. TinyDigits (51KB) - 8x8 digits for MLP/CNN milestones
2. TinyTalks (140KB) - Q&A pairs for transformer milestone

Total: 191KB shipped data (perfect for RasPi0 deployment)

Rationale:
- Self-contained: No downloads, works offline
- Citable: TinyTorch educational infrastructure for white paper
- Portable: Tiny footprint enables edge device deployment
- Fast: <5 sec training enables instant student feedback

Updated .gitignore to allow TinyTorch curated datasets while
still blocking downloaded large datasets.
2025-11-10 16:59:43 -05:00
..

TinyTorch Datasets

This directory contains datasets for TinyTorch examples and training.

Directory Structure

datasets/
├── tiny/           ← Tiny datasets shipped with repo (~100KB each)
│   └── digits_8x8.npz (1,797 samples, 67KB)
├── mnist/          ← Full MNIST (downloaded, gitignored)
├── cifar10/        ← Full CIFAR-10 (downloaded, gitignored)
└── download_*.py   ← Download scripts for large datasets

Quick Start

For learning (instant, offline):

# Use tiny shipped datasets
import numpy as np
data = np.load('datasets/tiny/digits_8x8.npz')

For serious training (download once):

python datasets/download_mnist.py

MNIST Dataset

The mnist/ directory should contain the MNIST or Fashion-MNIST dataset files:

  • train-images-idx3-ubyte.gz - Training images (60,000 samples)
  • train-labels-idx1-ubyte.gz - Training labels
  • t10k-images-idx3-ubyte.gz - Test images (10,000 samples)
  • t10k-labels-idx1-ubyte.gz - Test labels

Downloading the Dataset

Run the provided download script:

cd datasets
python download_mnist.py

This will download Fashion-MNIST (which has the same format as MNIST but is more accessible).

Dataset Format

Both MNIST and Fashion-MNIST use the same IDX file format:

  • Images: 28x28 grayscale pixels
  • Labels: Integer values 0-9
  • Gzipped for compression

Fashion-MNIST classes:

  • 0: T-shirt/top
  • 1: Trouser
  • 2: Pullover
  • 3: Dress
  • 4: Coat
  • 5: Sandal
  • 6: Shirt
  • 7: Sneaker
  • 8: Bag
  • 9: Ankle boot

The examples will work with either original MNIST digits or Fashion-MNIST items.