mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-28 13:12:33 -05:00

Files

Vijay Janapa Reddi 22309fa39d Finalize Module 08 and add integration tests

Added integration tests for DataLoader:
- test_dataloader_integration.py in tests/integration/
  - Training workflow integration
  - Shuffle consistency across epochs
  - Memory efficiency verification

Updated Module 08:
- Added note about optional performance analysis
- Clarified that analysis functions can be run manually
- Clean flow: text → code → tests

Updated datasets/tiny/README.md:
- Minor formatting fixes

Module 08 is now complete and ready to export:
✅ Dataset abstraction
✅ TensorDataset implementation
✅ DataLoader with batching/shuffling
✅ ASCII visualizations for understanding
✅ Unit tests (in module)
✅ Integration tests (in tests/)
✅ Performance analysis tools (optional)

Next: Export with 'bin/tito export 08_dataloader'

2025-09-30 16:07:55 -04:00

create_digits_8x8.py

Add tiny datasets infrastructure with 8×8 digits

2025-09-30 15:05:34 -04:00

digits_8x8.npz

Add tiny datasets infrastructure with 8×8 digits

2025-09-30 15:05:34 -04:00

README.md

Finalize Module 08 and add integration tests

2025-09-30 16:07:55 -04:00

README.md

Tiny Datasets for TinyTorch

Small, curated datasets that ship with TinyTorch - no downloads required!

These datasets are committed to the repository for instant, offline-friendly learning.

📊 Available Datasets

8×8 Handwritten Digits

File: digits_8x8.npz
Size: ~67 KB
Samples: 1,797 images
Shape: (8, 8) grayscale
Classes: 10 digits (0-9)
Source: UCI ML Repository via sklearn

Perfect for:

Learning DataLoader mechanics
Quick CNN testing
Offline development
Educational demos

Usage:

import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader

# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])

# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Iterate through batches
for batch_images, batch_labels in loader:
    print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")

Visual Sample:

Digit "5":        Digit "3":        Digit "8":
░█████░░         ░█████░          ░█████░░
░█░░░█░          ░░░░░█░          █░░░░░█░
░░░░█░░          ░░███░░          ░█████░░
░░░█░░░          ░░░░░█░          █░░░░░█░
░░█░░░░          ░█████░          ░█████░░

🎯 Philosophy

Why ship tiny datasets?

Zero friction - Students start learning immediately
Offline-first - Works in classrooms, planes, anywhere
Fast iteration - No wait times, instant feedback
Educational focus - Sized for learning, not production

Progression:

Tiny datasets (here) → Learn DataLoader mechanics
Downloaded datasets (../mnist/, ../cifar10/) → Real applications
Custom datasets → Production skills

📂 File Format

All datasets use NumPy's .npz format (compressed):

data = np.load('dataset.npz')
images = data['images']  # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels']  # Shape: (N,)

Benefits:

Fast loading
Compressed storage
Python-native
Easy inspection

🔧 Creating New Tiny Datasets

See create_digits_8x8.py for example extraction script.

Guidelines:

Max size: ~100 KB per dataset
Format: .npz with images and labels keys
Normalize: Images in [0, 1] range
License: Verify public domain / open source

📚 Dataset Information

Digits 8×8 Credits

Original Source:

E. Alpaydin, C. Kaynak (1998)
UCI Machine Learning Repository
"Optical Recognition of Handwritten Digits"

Preprocessing:

Extracted via sklearn.datasets.load_digits()
Normalized from [0-16] to [0-1]
Saved as float32 for efficiency

License: Public domain

🚀 Next Steps

After mastering DataLoader with tiny datasets:

Module 08 → Build DataLoader with digits_8x8
Milestone 03 → Train MLP on full MNIST
Milestone 04 → Train CNN on CIFAR-10
Custom datasets → Apply to your own data

Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.

README.md Unescape Escape

Tiny Datasets for TinyTorch

📊 Available Datasets

8×8 Handwritten Digits

🎯 Philosophy

📂 File Format

🔧 Creating New Tiny Datasets

📚 Dataset Information

Digits 8×8 Credits

🚀 Next Steps

README.md