Files
TinyTorch/datasets/tiny/README.md
Vijay Janapa Reddi 97fece7b5f Finalize Module 08 and add integration tests
Added integration tests for DataLoader:
- test_dataloader_integration.py in tests/integration/
  - Training workflow integration
  - Shuffle consistency across epochs
  - Memory efficiency verification

Updated Module 08:
- Added note about optional performance analysis
- Clarified that analysis functions can be run manually
- Clean flow: text → code → tests

Updated datasets/tiny/README.md:
- Minor formatting fixes

Module 08 is now complete and ready to export:
 Dataset abstraction
 TensorDataset implementation
 DataLoader with batching/shuffling
 ASCII visualizations for understanding
 Unit tests (in module)
 Integration tests (in tests/)
 Performance analysis tools (optional)

Next: Export with 'bin/tito export 08_dataloader'
2025-09-30 16:07:55 -04:00

134 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tiny Datasets for TinyTorch
**Small, curated datasets that ship with TinyTorch** - no downloads required!
These datasets are committed to the repository for instant, offline-friendly learning.
---
## 📊 Available Datasets
### 8×8 Handwritten Digits
**File:** `digits_8x8.npz`
**Size:** ~67 KB
**Samples:** 1,797 images
**Shape:** (8, 8) grayscale
**Classes:** 10 digits (0-9)
**Source:** UCI ML Repository via sklearn
**Perfect for:**
- Learning DataLoader mechanics
- Quick CNN testing
- Offline development
- Educational demos
**Usage:**
```python
import numpy as np
from tinytorch import Tensor
from tinytorch.data.loader import TensorDataset, DataLoader
# Load the dataset
data = np.load('datasets/tiny/digits_8x8.npz')
images = Tensor(data['images'])
labels = Tensor(data['labels'])
# Create dataset and loader
dataset = TensorDataset(images, labels)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
# Iterate through batches
for batch_images, batch_labels in loader:
print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")
```
**Visual Sample:**
```
Digit "5": Digit "3": Digit "8":
░█████░░ ░█████░ ░█████░░
░█░░░█░ ░░░░░█░ █░░░░░█░
░░░░█░░ ░░███░░ ░█████░░
░░░█░░░ ░░░░░█░ █░░░░░█░
░░█░░░░ ░█████░ ░█████░░
```
---
## 🎯 Philosophy
**Why ship tiny datasets?**
1. **Zero friction** - Students start learning immediately
2. **Offline-first** - Works in classrooms, planes, anywhere
3. **Fast iteration** - No wait times, instant feedback
4. **Educational focus** - Sized for learning, not production
**Progression:**
- **Tiny datasets** (here) → Learn DataLoader mechanics
- **Downloaded datasets** (../mnist/, ../cifar10/) → Real applications
- **Custom datasets** → Production skills
---
## 📂 File Format
All datasets use NumPy's `.npz` format (compressed):
```python
data = np.load('dataset.npz')
images = data['images'] # Shape: (N, H, W) or (N, H, W, C)
labels = data['labels'] # Shape: (N,)
```
**Benefits:**
- Fast loading
- Compressed storage
- Python-native
- Easy inspection
---
## 🔧 Creating New Tiny Datasets
See `create_digits_8x8.py` for example extraction script.
**Guidelines:**
- Max size: ~100 KB per dataset
- Format: `.npz` with `images` and `labels` keys
- Normalize: Images in [0, 1] range
- License: Verify public domain / open source
---
## 📚 Dataset Information
### Digits 8×8 Credits
**Original Source:**
- E. Alpaydin, C. Kaynak (1998)
- UCI Machine Learning Repository
- "Optical Recognition of Handwritten Digits"
**Preprocessing:**
- Extracted via `sklearn.datasets.load_digits()`
- Normalized from [0-16] to [0-1]
- Saved as float32 for efficiency
**License:** Public domain
---
## 🚀 Next Steps
After mastering DataLoader with tiny datasets:
1. **Module 08** → Build DataLoader with digits_8x8
2. **Milestone 03** → Train MLP on full MNIST
3. **Milestone 04** → Train CNN on CIFAR-10
4. **Custom datasets** → Apply to your own data
Tiny datasets teach the mechanics.
Real datasets teach the systems.
Custom datasets teach the engineering.