mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 13:12:33 -05:00
Added integration tests for DataLoader: - test_dataloader_integration.py in tests/integration/ - Training workflow integration - Shuffle consistency across epochs - Memory efficiency verification Updated Module 08: - Added note about optional performance analysis - Clarified that analysis functions can be run manually - Clean flow: text → code → tests Updated datasets/tiny/README.md: - Minor formatting fixes Module 08 is now complete and ready to export: ✅ Dataset abstraction ✅ TensorDataset implementation ✅ DataLoader with batching/shuffling ✅ ASCII visualizations for understanding ✅ Unit tests (in module) ✅ Integration tests (in tests/) ✅ Performance analysis tools (optional) Next: Export with 'bin/tito export 08_dataloader'
Tiny Datasets for TinyTorch
Small, curated datasets that ship with TinyTorch - no downloads required!
These datasets are committed to the repository for instant, offline-friendly learning.
📊 Available Datasets
8×8 Handwritten Digits
File: digits_8x8.npz
Size: ~67 KB
Samples: 1,797 images
Shape: (8, 8) grayscale
Classes: 10 digits (0-9)
Source: UCI ML Repository via sklearn
Perfect for:
- Learning DataLoader mechanics
- Quick CNN testing
- Offline development
- Educational demos
Usage:
import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader
# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])
# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
# Iterate through batches
for batch_images, batch_labels in loader:
print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")
Visual Sample:
Digit "5": Digit "3": Digit "8":
░█████░░ ░█████░ ░█████░░
░█░░░█░ ░░░░░█░ █░░░░░█░
░░░░█░░ ░░███░░ ░█████░░
░░░█░░░ ░░░░░█░ █░░░░░█░
░░█░░░░ ░█████░ ░█████░░
🎯 Philosophy
Why ship tiny datasets?
- Zero friction - Students start learning immediately
- Offline-first - Works in classrooms, planes, anywhere
- Fast iteration - No wait times, instant feedback
- Educational focus - Sized for learning, not production
Progression:
- Tiny datasets (here) → Learn DataLoader mechanics
- Downloaded datasets (../mnist/, ../cifar10/) → Real applications
- Custom datasets → Production skills
📂 File Format
All datasets use NumPy's .npz format (compressed):
data = np.load('dataset.npz')
images = data['images'] # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels'] # Shape: (N,)
Benefits:
- Fast loading
- Compressed storage
- Python-native
- Easy inspection
🔧 Creating New Tiny Datasets
See create_digits_8x8.py for example extraction script.
Guidelines:
- Max size: ~100 KB per dataset
- Format:
.npzwithimagesandlabelskeys - Normalize: Images in [0, 1] range
- License: Verify public domain / open source
📚 Dataset Information
Digits 8×8 Credits
Original Source:
- E. Alpaydin, C. Kaynak (1998)
- UCI Machine Learning Repository
- "Optical Recognition of Handwritten Digits"
Preprocessing:
- Extracted via
sklearn.datasets.load_digits() - Normalized from [0-16] to [0-1]
- Saved as float32 for efficiency
License: Public domain
🚀 Next Steps
After mastering DataLoader with tiny datasets:
- Module 08 → Build DataLoader with digits_8x8
- Milestone 03 → Train MLP on full MNIST
- Milestone 04 → Train CNN on CIFAR-10
- Custom datasets → Apply to your own data
Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.