mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-04 01:30:53 -05:00
Created datasets/tiny/ for shipping small datasets with TinyTorch: New Structure: - datasets/tiny/digits_8x8.npz (67KB, 1,797 samples) - 8×8 handwritten digits from UCI/sklearn - Normalized to [0-1], ready for immediate use - Perfect for DataLoader learning (Module 08) - datasets/tiny/README.md - Full documentation and usage examples - Philosophy: tiny (learn) → full (practice) → custom (master) - datasets/tiny/create_digits_8x8.py - Extraction script showing how dataset was created - Reproducible from sklearn.datasets.load_digits() Updated .gitignore: - Ignore datasets/* (downloaded large files) - Allow datasets/tiny/ (shipped small files) - Allow datasets/README.md and download scripts - Selectively ignore .npz files (not in tiny/) Benefits: ✅ Zero download friction for Module 08 ✅ Offline-friendly (planes, classrooms, slow networks) ✅ Real handwritten digits (not synthetic noise) ✅ Git-friendly size (67KB vs 10MB MNIST) ✅ Same shape/format students will use for CNNs Progression: - Module 08: Learn DataLoader with 8×8 digits - Milestone 03: Train on full 28×28 MNIST - Milestone 04: Scale to CIFAR-10
Tiny Datasets for TinyTorch
Small, curated datasets that ship with TinyTorch - no downloads required!
These datasets are committed to the repository for instant, offline-friendly learning.
📊 Available Datasets
8×8 Handwritten Digits
File: digits_8x8.npz
Size: ~67 KB
Samples: 1,797 images
Shape: (8, 8) grayscale
Classes: 10 digits (0-9)
Source: UCI ML Repository via sklearn
Perfect for:
- Learning DataLoader mechanics
- Quick CNN testing
- Offline development
- Educational demos
Usage:
import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader
# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])
# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
# Iterate through batches
for batch_images, batch_labels in loader:
print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")
Visual Sample:
Digit "5": Digit "3": Digit "8":
░░████░░ ░█████░ ░█████░░
░█░░░█░ ░░░░░█░ █░░░░░█░
░░░░█░░ ░░███░░ ░█████░░
░░░█░░░ ░░░░░█░ █░░░░░█░
░░█░░░░ ░░████░░ ░█████░░
🎯 Philosophy
Why ship tiny datasets?
- Zero friction - Students start learning immediately
- Offline-first - Works in classrooms, planes, anywhere
- Fast iteration - No wait times, instant feedback
- Educational focus - Sized for learning, not production
Progression:
- Tiny datasets (here) → Learn DataLoader mechanics
- Downloaded datasets (../mnist/, ../cifar10/) → Real applications
- Custom datasets → Production skills
📂 File Format
All datasets use NumPy's .npz format (compressed):
data = np.load('dataset.npz')
images = data['images'] # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels'] # Shape: (N,)
Benefits:
- Fast loading
- Compressed storage
- Python-native
- Easy inspection
🔧 Creating New Tiny Datasets
See create_digits_8x8.py for example extraction script.
Guidelines:
- Max size: ~100 KB per dataset
- Format:
.npzwithimagesandlabelskeys - Normalize: Images in [0, 1] range
- License: Verify public domain / open source
📚 Dataset Information
Digits 8×8 Credits
Original Source:
- E. Alpaydin, C. Kaynak (1998)
- UCI Machine Learning Repository
- "Optical Recognition of Handwritten Digits"
Preprocessing:
- Extracted via
sklearn.datasets.load_digits() - Normalized from [0-16] to [0-1]
- Saved as float32 for efficiency
License: Public domain
🚀 Next Steps
After mastering DataLoader with tiny datasets:
- Module 08 → Build DataLoader with digits_8x8
- Milestone 03 → Train MLP on full MNIST
- Milestone 04 → Train CNN on CIFAR-10
- Custom datasets → Apply to your own data
Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.