mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 08:25:22 -05:00
Added integration tests for DataLoader: - test_dataloader_integration.py in tests/integration/ - Training workflow integration - Shuffle consistency across epochs - Memory efficiency verification Updated Module 08: - Added note about optional performance analysis - Clarified that analysis functions can be run manually - Clean flow: text → code → tests Updated datasets/tiny/README.md: - Minor formatting fixes Module 08 is now complete and ready to export: ✅ Dataset abstraction ✅ TensorDataset implementation ✅ DataLoader with batching/shuffling ✅ ASCII visualizations for understanding ✅ Unit tests (in module) ✅ Integration tests (in tests/) ✅ Performance analysis tools (optional) Next: Export with 'bin/tito export 08_dataloader'
134 lines
3.3 KiB
Markdown
134 lines
3.3 KiB
Markdown
# Tiny Datasets for TinyTorch
|
||
|
||
**Small, curated datasets that ship with TinyTorch** - no downloads required!
|
||
|
||
These datasets are committed to the repository for instant, offline-friendly learning.
|
||
|
||
---
|
||
|
||
## 📊 Available Datasets
|
||
|
||
### 8×8 Handwritten Digits
|
||
|
||
**File:** `digits_8x8.npz`
|
||
**Size:** ~67 KB
|
||
**Samples:** 1,797 images
|
||
**Shape:** (8, 8) grayscale
|
||
**Classes:** 10 digits (0-9)
|
||
**Source:** UCI ML Repository via sklearn
|
||
|
||
**Perfect for:**
|
||
- Learning DataLoader mechanics
|
||
- Quick CNN testing
|
||
- Offline development
|
||
- Educational demos
|
||
|
||
**Usage:**
|
||
```python
|
||
import numpy as np
|
||
from tinytorch import Tensor
|
||
from tinytorch.data.loader import TensorDataset, DataLoader
|
||
|
||
# Load the dataset
|
||
data = np.load('datasets/tiny/digits_8x8.npz')
|
||
images = Tensor(data['images'])
|
||
labels = Tensor(data['labels'])
|
||
|
||
# Create dataset and loader
|
||
dataset = TensorDataset(images, labels)
|
||
loader = DataLoader(dataset, batch_size=32, shuffle=True)
|
||
|
||
# Iterate through batches
|
||
for batch_images, batch_labels in loader:
|
||
print(f"Batch: {batch_images.shape}, Labels: {batch_labels.shape}")
|
||
```
|
||
|
||
**Visual Sample:**
|
||
```
|
||
Digit "5": Digit "3": Digit "8":
|
||
░█████░░ ░█████░ ░█████░░
|
||
░█░░░█░ ░░░░░█░ █░░░░░█░
|
||
░░░░█░░ ░░███░░ ░█████░░
|
||
░░░█░░░ ░░░░░█░ █░░░░░█░
|
||
░░█░░░░ ░█████░ ░█████░░
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 Philosophy
|
||
|
||
**Why ship tiny datasets?**
|
||
|
||
1. **Zero friction** - Students start learning immediately
|
||
2. **Offline-first** - Works in classrooms, planes, anywhere
|
||
3. **Fast iteration** - No wait times, instant feedback
|
||
4. **Educational focus** - Sized for learning, not production
|
||
|
||
**Progression:**
|
||
- **Tiny datasets** (here) → Learn DataLoader mechanics
|
||
- **Downloaded datasets** (../mnist/, ../cifar10/) → Real applications
|
||
- **Custom datasets** → Production skills
|
||
|
||
---
|
||
|
||
## 📂 File Format
|
||
|
||
All datasets use NumPy's `.npz` format (compressed):
|
||
|
||
```python
|
||
data = np.load('dataset.npz')
|
||
images = data['images'] # Shape: (N, H, W) or (N, H, W, C)
|
||
labels = data['labels'] # Shape: (N,)
|
||
```
|
||
|
||
**Benefits:**
|
||
- Fast loading
|
||
- Compressed storage
|
||
- Python-native
|
||
- Easy inspection
|
||
|
||
---
|
||
|
||
## 🔧 Creating New Tiny Datasets
|
||
|
||
See `create_digits_8x8.py` for example extraction script.
|
||
|
||
**Guidelines:**
|
||
- Max size: ~100 KB per dataset
|
||
- Format: `.npz` with `images` and `labels` keys
|
||
- Normalize: Images in [0, 1] range
|
||
- License: Verify public domain / open source
|
||
|
||
---
|
||
|
||
## 📚 Dataset Information
|
||
|
||
### Digits 8×8 Credits
|
||
|
||
**Original Source:**
|
||
- E. Alpaydin, C. Kaynak (1998)
|
||
- UCI Machine Learning Repository
|
||
- "Optical Recognition of Handwritten Digits"
|
||
|
||
**Preprocessing:**
|
||
- Extracted via `sklearn.datasets.load_digits()`
|
||
- Normalized from [0-16] to [0-1]
|
||
- Saved as float32 for efficiency
|
||
|
||
**License:** Public domain
|
||
|
||
---
|
||
|
||
## 🚀 Next Steps
|
||
|
||
After mastering DataLoader with tiny datasets:
|
||
|
||
1. **Module 08** → Build DataLoader with digits_8x8
|
||
2. **Milestone 03** → Train MLP on full MNIST
|
||
3. **Milestone 04** → Train CNN on CIFAR-10
|
||
4. **Custom datasets** → Apply to your own data
|
||
|
||
Tiny datasets teach the mechanics.
|
||
Real datasets teach the systems.
|
||
Custom datasets teach the engineering.
|