Files
TinyTorch/datasets/tiny
Vijay Janapa Reddi 22309fa39d Finalize Module 08 and add integration tests
Added integration tests for DataLoader:
- test_dataloader_integration.py in tests/integration/
  - Training workflow integration
  - Shuffle consistency across epochs
  - Memory efficiency verification

Updated Module 08:
- Added note about optional performance analysis
- Clarified that analysis functions can be run manually
- Clean flow: text → code → tests

Updated datasets/tiny/README.md:
- Minor formatting fixes

Module 08 is now complete and ready to export:
 Dataset abstraction
 TensorDataset implementation
 DataLoader with batching/shuffling
 ASCII visualizations for understanding
 Unit tests (in module)
 Integration tests (in tests/)
 Performance analysis tools (optional)

Next: Export with 'bin/tito export 08_dataloader'
2025-09-30 16:07:55 -04:00
..

Tiny Datasets for TinyTorch

Small, curated datasets that ship with TinyTorch - no downloads required!

These datasets are committed to the repository for instant, offline-friendly learning.


📊 Available Datasets

8×8 Handwritten Digits

File: digits_8x8.npz
Size: ~67 KB
Samples: 1,797 images
Shape: (8, 8) grayscale
Classes: 10 digits (0-9)
Source: UCI ML Repository via sklearn

Perfect for:

  • Learning DataLoader mechanics
  • Quick CNN testing
  • Offline development
  • Educational demos

Usage:

import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader

# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])

# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Iterate through batches
for batch_images, batch_labels in loader:
    print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")

Visual Sample:

Digit "5":        Digit "3":        Digit "8":
░█████░░         ░█████░          ░█████░░
░█░░░█░          ░░░░░█░          █░░░░░█░
░░░░█░░          ░░███░░          ░█████░░
░░░█░░░          ░░░░░█░          █░░░░░█░
░░█░░░░          ░█████░          ░█████░░

🎯 Philosophy

Why ship tiny datasets?

  1. Zero friction - Students start learning immediately
  2. Offline-first - Works in classrooms, planes, anywhere
  3. Fast iteration - No wait times, instant feedback
  4. Educational focus - Sized for learning, not production

Progression:

  • Tiny datasets (here) → Learn DataLoader mechanics
  • Downloaded datasets (../mnist/, ../cifar10/) → Real applications
  • Custom datasets → Production skills

📂 File Format

All datasets use NumPy's .npz format (compressed):

data = np.load('dataset.npz')
images = data['images']  # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels']  # Shape: (N,)

Benefits:

  • Fast loading
  • Compressed storage
  • Python-native
  • Easy inspection

🔧 Creating New Tiny Datasets

See create_digits_8x8.py for example extraction script.

Guidelines:

  • Max size: ~100 KB per dataset
  • Format: .npz with images and labels keys
  • Normalize: Images in [0, 1] range
  • License: Verify public domain / open source

📚 Dataset Information

Digits 8×8 Credits

Original Source:

  • E. Alpaydin, C. Kaynak (1998)
  • UCI Machine Learning Repository
  • "Optical Recognition of Handwritten Digits"

Preprocessing:

  • Extracted via sklearn.datasets.load_digits()
  • Normalized from [0-16] to [0-1]
  • Saved as float32 for efficiency

License: Public domain


🚀 Next Steps

After mastering DataLoader with tiny datasets:

  1. Module 08 → Build DataLoader with digits_8x8
  2. Milestone 03 → Train MLP on full MNIST
  3. Milestone 04 → Train CNN on CIFAR-10
  4. Custom datasets → Apply to your own data

Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.