Add tiny datasets infrastructure with 8×8 digits

Created datasets/tiny/ for shipping small datasets with TinyTorch:

New Structure:
- datasets/tiny/digits_8x8.npz (67KB, 1,797 samples)
  - 8×8 handwritten digits from UCI/sklearn
  - Normalized to [0-1], ready for immediate use
  - Perfect for DataLoader learning (Module 08)

- datasets/tiny/README.md
  - Full documentation and usage examples
  - Philosophy: tiny (learn) → full (practice) → custom (master)

- datasets/tiny/create_digits_8x8.py
  - Extraction script showing how dataset was created
  - Reproducible from sklearn.datasets.load_digits()

Updated .gitignore:
- Ignore datasets/* (downloaded large files)
- Allow datasets/tiny/ (shipped small files)
- Allow datasets/README.md and download scripts
- Selectively ignore .npz files (not in tiny/)

Benefits:
 Zero download friction for Module 08
 Offline-friendly (planes, classrooms, slow networks)
 Real handwritten digits (not synthetic noise)
 Git-friendly size (67KB vs 10MB MNIST)
 Same shape/format students will use for CNNs

Progression:
- Module 08: Learn DataLoader with 8×8 digits
- Milestone 03: Train on full 28×28 MNIST
- Milestone 04: Scale to CIFAR-10
This commit is contained in:
Vijay Janapa Reddi
2025-09-30 15:05:34 -04:00
parent 422c45d7ce
commit d4cdd4506e
5 changed files with 222 additions and 2 deletions

View File

@@ -2,6 +2,31 @@
This directory contains datasets for TinyTorch examples and training.
## Directory Structure
```
datasets/
├── tiny/ ← Tiny datasets shipped with repo (~100KB each)
│ └── digits_8x8.npz (1,797 samples, 67KB)
├── mnist/ ← Full MNIST (downloaded, gitignored)
├── cifar10/ ← Full CIFAR-10 (downloaded, gitignored)
└── download_*.py ← Download scripts for large datasets
```
## Quick Start
**For learning (instant, offline):**
```python
# Use tiny shipped datasets
import numpy as np
data = np.load('datasets/tiny/digits_8x8.npz')
```
**For serious training (download once):**
```bash
python datasets/download_mnist.py
```
## MNIST Dataset
The `mnist/` directory should contain the MNIST or Fashion-MNIST dataset files:

133
datasets/tiny/README.md Normal file
View File

@@ -0,0 +1,133 @@
# Tiny Datasets for TinyTorch
**Small, curated datasets that ship with TinyTorch** - no downloads required!
These datasets are committed to the repository for instant, offline-friendly learning.
---
## 📊 Available Datasets
### 8×8 Handwritten Digits
**File:** `digits_8x8.npz`
**Size:** ~67 KB
**Samples:** 1,797 images
**Shape:** (8, 8) grayscale
**Classes:** 10 digits (0-9)
**Source:** UCI ML Repository via sklearn
**Perfect for:**
- Learning DataLoader mechanics
- Quick CNN testing
- Offline development
- Educational demos
**Usage:**
```python
import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader
# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])
# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
# Iterate through batches
for batch_images, batch_labels in loader:
print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")
```
**Visual Sample:**
```
Digit "5": Digit "3": Digit "8":
░░████░░ ░█████░ ░█████░░
░█░░░█░ ░░░░░█░ █░░░░░█░
░░░░█░░ ░░███░░ ░█████░░
░░░█░░░ ░░░░░█░ █░░░░░█░
░░█░░░░ ░░████░░ ░█████░░
```
---
## 🎯 Philosophy
**Why ship tiny datasets?**
1. **Zero friction** - Students start learning immediately
2. **Offline-first** - Works in classrooms, planes, anywhere
3. **Fast iteration** - No wait times, instant feedback
4. **Educational focus** - Sized for learning, not production
**Progression:**
- **Tiny datasets** (here) → Learn DataLoader mechanics
- **Downloaded datasets** (../mnist/, ../cifar10/) → Real applications
- **Custom datasets** → Production skills
---
## 📂 File Format
All datasets use NumPy's `.npz` format (compressed):
```python
data = np.load('dataset.npz')
images = data['images'] # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels'] # Shape: (N,)
```
**Benefits:**
- Fast loading
- Compressed storage
- Python-native
- Easy inspection
---
## 🔧 Creating New Tiny Datasets
See `create_digits_8x8.py` for example extraction script.
**Guidelines:**
- Max size: ~100 KB per dataset
- Format: `.npz` with `images` and `labels` keys
- Normalize: Images in [0, 1] range
- License: Verify public domain / open source
---
## 📚 Dataset Information
### Digits 8×8 Credits
**Original Source:**
- E. Alpaydin, C. Kaynak (1998)
- UCI Machine Learning Repository
- "Optical Recognition of Handwritten Digits"
**Preprocessing:**
- Extracted via `sklearn.datasets.load_digits()`
- Normalized from [0-16] to [0-1]
- Saved as float32 for efficiency
**License:** Public domain
---
## 🚀 Next Steps
After mastering DataLoader with tiny datasets:
1. **Module 08** → Build DataLoader with digits_8x8
2. **Milestone 03** → Train MLP on full MNIST
3. **Milestone 04** → Train CNN on CIFAR-10
4. **Custom datasets** → Apply to your own data
Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.

View File

@@ -0,0 +1,53 @@
#!/usr/bin/env python3
"""
Create 8x8 Digits Dataset
=========================
Extracts the 8×8 handwritten digits dataset from sklearn and saves it
as a compact .npz file for TinyTorch.
Source: UCI Machine Learning Repository
Used by: sklearn.datasets.load_digits()
Size: 1,797 samples, 8×8 grayscale images
License: Public domain
"""
import numpy as np
try:
from sklearn.datasets import load_digits
except ImportError:
print("❌ sklearn not installed. Install with: pip install scikit-learn")
exit(1)
print("📥 Loading 8×8 digits from sklearn...")
digits = load_digits()
print(f"✅ Loaded {len(digits.images)} digit images")
print(f" Shape: {digits.images.shape}")
print(f" Classes: {np.unique(digits.target)}")
# Normalize to [0, 1] range (original is 0-16)
images_normalized = digits.images.astype(np.float32) / 16.0
labels = digits.target.astype(np.int64)
# Save as compressed .npz
output_file = 'digits_8x8.npz'
np.savez_compressed(output_file,
images=images_normalized,
labels=labels)
# Check file size
import os
file_size_kb = os.path.getsize(output_file) / 1024
print(f"\n💾 Saved to {output_file}")
print(f" File size: {file_size_kb:.1f} KB")
print(f" Images shape: {images_normalized.shape}")
print(f" Labels shape: {labels.shape}")
print(f" Value range: [{images_normalized.min():.2f}, {images_normalized.max():.2f}]")
# Quick verification
print(f"\n✅ Dataset ready for TinyTorch!")
print(f" Total samples: {len(images_normalized)}")
print(f" Samples per class: ~{len(images_normalized) // 10}")
print(f" Perfect for DataLoader demos!")

Binary file not shown.