mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-06-04 01:30:53 -05:00

Files

Vijay Janapa Reddi 92b70c3646 Add tiny datasets infrastructure with 8×8 digits

Created datasets/tiny/ for shipping small datasets with TinyTorch:

New Structure:
- datasets/tiny/digits_8x8.npz (67KB, 1,797 samples)
  - 8×8 handwritten digits from UCI/sklearn
  - Normalized to [0-1], ready for immediate use
  - Perfect for DataLoader learning (Module 08)

- datasets/tiny/README.md
  - Full documentation and usage examples
  - Philosophy: tiny (learn) → full (practice) → custom (master)

- datasets/tiny/create_digits_8x8.py
  - Extraction script showing how dataset was created
  - Reproducible from sklearn.datasets.load_digits()

Updated .gitignore:
- Ignore datasets/* (downloaded large files)
- Allow datasets/tiny/ (shipped small files)
- Allow datasets/README.md and download scripts
- Selectively ignore .npz files (not in tiny/)

Benefits:
✅ Zero download friction for Module 08
✅ Offline-friendly (planes, classrooms, slow networks)
✅ Real handwritten digits (not synthetic noise)
✅ Git-friendly size (67KB vs 10MB MNIST)
✅ Same shape/format students will use for CNNs

Progression:
- Module 08: Learn DataLoader with 8×8 digits
- Milestone 03: Train on full 28×28 MNIST
- Milestone 04: Scale to CIFAR-10

2025-09-30 15:05:34 -04:00

create_digits_8x8.py

Add tiny datasets infrastructure with 8×8 digits

2025-09-30 15:05:34 -04:00

digits_8x8.npz

Add tiny datasets infrastructure with 8×8 digits

2025-09-30 15:05:34 -04:00

README.md

Add tiny datasets infrastructure with 8×8 digits

2025-09-30 15:05:34 -04:00

README.md

Tiny Datasets for TinyTorch

Small, curated datasets that ship with TinyTorch - no downloads required!

These datasets are committed to the repository for instant, offline-friendly learning.

📊 Available Datasets

8×8 Handwritten Digits

File: digits_8x8.npz
Size: ~67 KB
Samples: 1,797 images
Shape: (8, 8) grayscale
Classes: 10 digits (0-9)
Source: UCI ML Repository via sklearn

Perfect for:

Learning DataLoader mechanics
Quick CNN testing
Offline development
Educational demos

Usage:

import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader

# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])

# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Iterate through batches
for batch_images, batch_labels in loader:
    print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")

Visual Sample:

Digit "5":        Digit "3":        Digit "8":
░░████░░         ░█████░          ░█████░░
░█░░░█░          ░░░░░█░          █░░░░░█░
░░░░█░░          ░░███░░          ░█████░░
░░░█░░░          ░░░░░█░          █░░░░░█░
░░█░░░░          ░░████░░          ░█████░░

🎯 Philosophy

Why ship tiny datasets?

Zero friction - Students start learning immediately
Offline-first - Works in classrooms, planes, anywhere
Fast iteration - No wait times, instant feedback
Educational focus - Sized for learning, not production

Progression:

Tiny datasets (here) → Learn DataLoader mechanics
Downloaded datasets (../mnist/, ../cifar10/) → Real applications
Custom datasets → Production skills

📂 File Format

All datasets use NumPy's .npz format (compressed):

data = np.load('dataset.npz')
images = data['images']  # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels']  # Shape: (N,)

Benefits:

Fast loading
Compressed storage
Python-native
Easy inspection

🔧 Creating New Tiny Datasets

See create_digits_8x8.py for example extraction script.

Guidelines:

Max size: ~100 KB per dataset
Format: .npz with images and labels keys
Normalize: Images in [0, 1] range
License: Verify public domain / open source

📚 Dataset Information

Digits 8×8 Credits

Original Source:

E. Alpaydin, C. Kaynak (1998)
UCI Machine Learning Repository
"Optical Recognition of Handwritten Digits"

Preprocessing:

Extracted via sklearn.datasets.load_digits()
Normalized from [0-16] to [0-1]
Saved as float32 for efficiency

License: Public domain

🚀 Next Steps

After mastering DataLoader with tiny datasets:

Module 08 → Build DataLoader with digits_8x8
Milestone 03 → Train MLP on full MNIST
Milestone 04 → Train CNN on CIFAR-10
Custom datasets → Apply to your own data

Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.

README.md Unescape Escape

Tiny Datasets for TinyTorch

📊 Available Datasets

8×8 Handwritten Digits

🎯 Philosophy

📂 File Format

🔧 Creating New Tiny Datasets

📚 Dataset Information

Digits 8×8 Credits

🚀 Next Steps

README.md