Files
TinyTorch/datasets/tiny
Vijay Janapa Reddi 92b70c3646 Add tiny datasets infrastructure with 8×8 digits
Created datasets/tiny/ for shipping small datasets with TinyTorch:

New Structure:
- datasets/tiny/digits_8x8.npz (67KB, 1,797 samples)
  - 8×8 handwritten digits from UCI/sklearn
  - Normalized to [0-1], ready for immediate use
  - Perfect for DataLoader learning (Module 08)

- datasets/tiny/README.md
  - Full documentation and usage examples
  - Philosophy: tiny (learn) → full (practice) → custom (master)

- datasets/tiny/create_digits_8x8.py
  - Extraction script showing how dataset was created
  - Reproducible from sklearn.datasets.load_digits()

Updated .gitignore:
- Ignore datasets/* (downloaded large files)
- Allow datasets/tiny/ (shipped small files)
- Allow datasets/README.md and download scripts
- Selectively ignore .npz files (not in tiny/)

Benefits:
 Zero download friction for Module 08
 Offline-friendly (planes, classrooms, slow networks)
 Real handwritten digits (not synthetic noise)
 Git-friendly size (67KB vs 10MB MNIST)
 Same shape/format students will use for CNNs

Progression:
- Module 08: Learn DataLoader with 8×8 digits
- Milestone 03: Train on full 28×28 MNIST
- Milestone 04: Scale to CIFAR-10
2025-09-30 15:05:34 -04:00
..

Tiny Datasets for TinyTorch

Small, curated datasets that ship with TinyTorch - no downloads required!

These datasets are committed to the repository for instant, offline-friendly learning.


📊 Available Datasets

8×8 Handwritten Digits

File: digits_8x8.npz
Size: ~67 KB
Samples: 1,797 images
Shape: (8, 8) grayscale
Classes: 10 digits (0-9)
Source: UCI ML Repository via sklearn

Perfect for:

  • Learning DataLoader mechanics
  • Quick CNN testing
  • Offline development
  • Educational demos

Usage:

import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader

# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])

# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Iterate through batches
for batch_images, batch_labels in loader:
    print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")

Visual Sample:

Digit "5":        Digit "3":        Digit "8":
░░████░░         ░█████░          ░█████░░
░█░░░█░          ░░░░░█░          █░░░░░█░
░░░░█░░          ░░███░░          ░█████░░
░░░█░░░          ░░░░░█░          █░░░░░█░
░░█░░░░          ░░████░░          ░█████░░

🎯 Philosophy

Why ship tiny datasets?

  1. Zero friction - Students start learning immediately
  2. Offline-first - Works in classrooms, planes, anywhere
  3. Fast iteration - No wait times, instant feedback
  4. Educational focus - Sized for learning, not production

Progression:

  • Tiny datasets (here) → Learn DataLoader mechanics
  • Downloaded datasets (../mnist/, ../cifar10/) → Real applications
  • Custom datasets → Production skills

📂 File Format

All datasets use NumPy's .npz format (compressed):

data = np.load('dataset.npz')
images = data['images']  # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels']  # Shape: (N,)

Benefits:

  • Fast loading
  • Compressed storage
  • Python-native
  • Easy inspection

🔧 Creating New Tiny Datasets

See create_digits_8x8.py for example extraction script.

Guidelines:

  • Max size: ~100 KB per dataset
  • Format: .npz with images and labels keys
  • Normalize: Images in [0, 1] range
  • License: Verify public domain / open source

📚 Dataset Information

Digits 8×8 Credits

Original Source:

  • E. Alpaydin, C. Kaynak (1998)
  • UCI Machine Learning Repository
  • "Optical Recognition of Handwritten Digits"

Preprocessing:

  • Extracted via sklearn.datasets.load_digits()
  • Normalized from [0-16] to [0-1]
  • Saved as float32 for efficiency

License: Public domain


🚀 Next Steps

After mastering DataLoader with tiny datasets:

  1. Module 08 → Build DataLoader with digits_8x8
  2. Milestone 03 → Train MLP on full MNIST
  3. Milestone 04 → Train CNN on CIFAR-10
  4. Custom datasets → Apply to your own data

Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.