Files
TinyTorch/datasets/README.md
Vijay Janapa Reddi 92b70c3646 Add tiny datasets infrastructure with 8×8 digits
Created datasets/tiny/ for shipping small datasets with TinyTorch:

New Structure:
- datasets/tiny/digits_8x8.npz (67KB, 1,797 samples)
  - 8×8 handwritten digits from UCI/sklearn
  - Normalized to [0-1], ready for immediate use
  - Perfect for DataLoader learning (Module 08)

- datasets/tiny/README.md
  - Full documentation and usage examples
  - Philosophy: tiny (learn) → full (practice) → custom (master)

- datasets/tiny/create_digits_8x8.py
  - Extraction script showing how dataset was created
  - Reproducible from sklearn.datasets.load_digits()

Updated .gitignore:
- Ignore datasets/* (downloaded large files)
- Allow datasets/tiny/ (shipped small files)
- Allow datasets/README.md and download scripts
- Selectively ignore .npz files (not in tiny/)

Benefits:
 Zero download friction for Module 08
 Offline-friendly (planes, classrooms, slow networks)
 Real handwritten digits (not synthetic noise)
 Git-friendly size (67KB vs 10MB MNIST)
 Same shape/format students will use for CNNs

Progression:
- Module 08: Learn DataLoader with 8×8 digits
- Milestone 03: Train on full 28×28 MNIST
- Milestone 04: Scale to CIFAR-10
2025-09-30 15:05:34 -04:00

67 lines
1.6 KiB
Markdown

# TinyTorch Datasets
This directory contains datasets for TinyTorch examples and training.
## Directory Structure
```
datasets/
├── tiny/ ← Tiny datasets shipped with repo (~100KB each)
│ └── digits_8x8.npz (1,797 samples, 67KB)
├── mnist/ ← Full MNIST (downloaded, gitignored)
├── cifar10/ ← Full CIFAR-10 (downloaded, gitignored)
└── download_*.py ← Download scripts for large datasets
```
## Quick Start
**For learning (instant, offline):**
```python
# Use tiny shipped datasets
import numpy as np
data = np.load('datasets/tiny/digits_8x8.npz')
```
**For serious training (download once):**
```bash
python datasets/download_mnist.py
```
## MNIST Dataset
The `mnist/` directory should contain the MNIST or Fashion-MNIST dataset files:
- `train-images-idx3-ubyte.gz` - Training images (60,000 samples)
- `train-labels-idx1-ubyte.gz` - Training labels
- `t10k-images-idx3-ubyte.gz` - Test images (10,000 samples)
- `t10k-labels-idx1-ubyte.gz` - Test labels
### Downloading the Dataset
Run the provided download script:
```bash
cd datasets
python download_mnist.py
```
This will download Fashion-MNIST (which has the same format as MNIST but is more accessible).
### Dataset Format
Both MNIST and Fashion-MNIST use the same IDX file format:
- Images: 28x28 grayscale pixels
- Labels: Integer values 0-9
- Gzipped for compression
Fashion-MNIST classes:
- 0: T-shirt/top
- 1: Trouser
- 2: Pullover
- 3: Dress
- 4: Coat
- 5: Sandal
- 6: Shirt
- 7: Sneaker
- 8: Bag
- 9: Ankle boot
The examples will work with either original MNIST digits or Fashion-MNIST items.