mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-05 21:35:51 -05:00
Created datasets/tiny/ for shipping small datasets with TinyTorch: New Structure: - datasets/tiny/digits_8x8.npz (67KB, 1,797 samples) - 8×8 handwritten digits from UCI/sklearn - Normalized to [0-1], ready for immediate use - Perfect for DataLoader learning (Module 08) - datasets/tiny/README.md - Full documentation and usage examples - Philosophy: tiny (learn) → full (practice) → custom (master) - datasets/tiny/create_digits_8x8.py - Extraction script showing how dataset was created - Reproducible from sklearn.datasets.load_digits() Updated .gitignore: - Ignore datasets/* (downloaded large files) - Allow datasets/tiny/ (shipped small files) - Allow datasets/README.md and download scripts - Selectively ignore .npz files (not in tiny/) Benefits: ✅ Zero download friction for Module 08 ✅ Offline-friendly (planes, classrooms, slow networks) ✅ Real handwritten digits (not synthetic noise) ✅ Git-friendly size (67KB vs 10MB MNIST) ✅ Same shape/format students will use for CNNs Progression: - Module 08: Learn DataLoader with 8×8 digits - Milestone 03: Train on full 28×28 MNIST - Milestone 04: Scale to CIFAR-10
67 lines
1.6 KiB
Markdown
67 lines
1.6 KiB
Markdown
# TinyTorch Datasets
|
|
|
|
This directory contains datasets for TinyTorch examples and training.
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
datasets/
|
|
├── tiny/ ← Tiny datasets shipped with repo (~100KB each)
|
|
│ └── digits_8x8.npz (1,797 samples, 67KB)
|
|
├── mnist/ ← Full MNIST (downloaded, gitignored)
|
|
├── cifar10/ ← Full CIFAR-10 (downloaded, gitignored)
|
|
└── download_*.py ← Download scripts for large datasets
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
**For learning (instant, offline):**
|
|
```python
|
|
# Use tiny shipped datasets
|
|
import numpy as np
|
|
data = np.load('datasets/tiny/digits_8x8.npz')
|
|
```
|
|
|
|
**For serious training (download once):**
|
|
```bash
|
|
python datasets/download_mnist.py
|
|
```
|
|
|
|
## MNIST Dataset
|
|
|
|
The `mnist/` directory should contain the MNIST or Fashion-MNIST dataset files:
|
|
- `train-images-idx3-ubyte.gz` - Training images (60,000 samples)
|
|
- `train-labels-idx1-ubyte.gz` - Training labels
|
|
- `t10k-images-idx3-ubyte.gz` - Test images (10,000 samples)
|
|
- `t10k-labels-idx1-ubyte.gz` - Test labels
|
|
|
|
### Downloading the Dataset
|
|
|
|
Run the provided download script:
|
|
```bash
|
|
cd datasets
|
|
python download_mnist.py
|
|
```
|
|
|
|
This will download Fashion-MNIST (which has the same format as MNIST but is more accessible).
|
|
|
|
### Dataset Format
|
|
|
|
Both MNIST and Fashion-MNIST use the same IDX file format:
|
|
- Images: 28x28 grayscale pixels
|
|
- Labels: Integer values 0-9
|
|
- Gzipped for compression
|
|
|
|
Fashion-MNIST classes:
|
|
- 0: T-shirt/top
|
|
- 1: Trouser
|
|
- 2: Pullover
|
|
- 3: Dress
|
|
- 4: Coat
|
|
- 5: Sandal
|
|
- 6: Shirt
|
|
- 7: Sneaker
|
|
- 8: Bag
|
|
- 9: Ankle boot
|
|
|
|
The examples will work with either original MNIST digits or Fashion-MNIST items. |