TinyTorch/datasets/tiny/README.md

# Tiny Datasets for TinyTorch

**Small, curated datasets that ship with TinyTorch** - no downloads required!

These datasets are committed to the repository for instant, offline-friendly learning.

---

## 📊 Available Datasets

### 8×8 Handwritten Digits

**File:** `digits_8x8.npz`
**Size:** ~67 KB
**Samples:** 1,797 images
**Shape:** (8, 8) grayscale
**Classes:** 10 digits (0-9)
**Source:** UCI ML Repository via sklearn

**Perfect for:**
- Learning DataLoader mechanics
- Quick CNN testing
- Offline development
- Educational demos

**Usage:**
```python
import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader

# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])

# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Iterate through batches
for batch_images, batch_labels in loader:
    print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")
```

**Visual Sample:**
```
Digit "5":        Digit "3":        Digit "8":
░█████░░         ░█████░          ░█████░░
░█░░░█░          ░░░░░█░          █░░░░░█░
░░░░█░░          ░░███░░          ░█████░░
░░░█░░░          ░░░░░█░          █░░░░░█░
░░█░░░░          ░█████░          ░█████░░
```

---

## 🎯 Philosophy

**Why ship tiny datasets?**

1. **Zero friction** - Students start learning immediately
2. **Offline-first** - Works in classrooms, planes, anywhere
3. **Fast iteration** - No wait times, instant feedback
4. **Educational focus** - Sized for learning, not production

**Progression:**
- **Tiny datasets** (here) → Learn DataLoader mechanics
- **Downloaded datasets** (../mnist/, ../cifar10/) → Real applications
- **Custom datasets** → Production skills

---

## 📂 File Format

All datasets use NumPy's `.npz` format (compressed):

```python
data = np.load('dataset.npz')
images = data['images']  # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels']  # Shape: (N,)
```

**Benefits:**
- Fast loading
- Compressed storage
- Python-native
- Easy inspection

---

## 🔧 Creating New Tiny Datasets

See `create_digits_8x8.py` for example extraction script.

**Guidelines:**
- Max size: ~100 KB per dataset
- Format: `.npz` with `images` and `labels` keys
- Normalize: Images in [0, 1] range
- License: Verify public domain / open source

---

## 📚 Dataset Information

### Digits 8×8 Credits

**Original Source:**
- E. Alpaydin, C. Kaynak (1998)
- UCI Machine Learning Repository
- "Optical Recognition of Handwritten Digits"

**Preprocessing:**
- Extracted via `sklearn.datasets.load_digits()`
- Normalized from [0-16] to [0-1]
- Saved as float32 for efficiency

**License:** Public domain

---

## 🚀 Next Steps

After mastering DataLoader with tiny datasets:

1. **Module 08** → Build DataLoader with digits_8x8
2. **Milestone 03** → Train MLP on full MNIST
3. **Milestone 04** → Train CNN on CIFAR-10
4. **Custom datasets** → Apply to your own data

Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.