Create TinyDigits educational dataset for self-contained TinyTorch

Replaces sklearn-sourced digits_8x8.npz with TinyTorch-branded dataset.

Changes:
- Created datasets/tinydigits/ (~51KB total)
  - train.pkl: 150 samples (15 per digit class 0-9)
  - test.pkl: 47 samples (balanced across digits)
  - README.md: Full curation documentation
  - LICENSE: BSD 3-Clause with sklearn attribution
  - create_tinydigits.py: Reproducible generation script

- Updated milestones to use TinyDigits:
  - mlp_digits.py: Now loads from datasets/tinydigits/
  - cnn_digits.py: Now loads from datasets/tinydigits/

- Removed old data:
  - datasets/tiny/ (67KB sklearn duplicate)
  - milestones/03_1986_mlp/data/ (67KB old location)

Dataset Strategy:
TinyTorch now ships with only 2 curated datasets:
1. TinyDigits (51KB) - 8x8 digits for MLP/CNN milestones
2. TinyTalks (140KB) - Q&A pairs for transformer milestone

Total: 191KB shipped data (perfect for RasPi0 deployment)

Rationale:
- Self-contained: No downloads, works offline
- Citable: TinyTorch educational infrastructure for white paper
- Portable: Tiny footprint enables edge device deployment
- Fast: <5 sec training enables instant student feedback

Updated .gitignore to allow TinyTorch curated datasets while
still blocking downloaded large datasets.
This commit is contained in:
Vijay Janapa Reddi
2025-11-10 16:59:43 -05:00
parent 1a0d9fe7b3
commit 89c9e0dd7e
13 changed files with 710 additions and 239 deletions

View File

@@ -0,0 +1,54 @@
BSD 3-Clause License
TinyDigits Dataset License
==========================
TinyDigits is a curated educational subset derived from the sklearn digits dataset.
Original Data Source:
---------------------
scikit-learn digits dataset (sklearn.datasets.load_digits)
- Derived from UCI ML hand-written digits datasets
- Copyright (c) 2007-2024 The scikit-learn developers
- License: BSD 3-Clause
TinyTorch Curation:
------------------
Copyright (c) 2025 TinyTorch Project
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Attribution
-----------
When using TinyDigits in research or educational materials, please cite:
1. The original sklearn digits dataset:
Pedregosa et al., "Scikit-learn: Machine Learning in Python",
JMLR 12, pp. 2825-2830, 2011.
2. TinyTorch's educational curation:
TinyTorch Project (2025). "TinyDigits: Curated Educational Dataset
for ML Systems Learning". Available at: https://github.com/VJHack/TinyTorch

View File

@@ -0,0 +1,109 @@
# TinyDigits Dataset
A curated subset of the sklearn digits dataset for rapid ML prototyping and educational demonstrations.
## Contents
- **Training**: 150 samples (15 per digit, 0-9)
- **Test**: 47 samples (balanced across digits)
- **Format**: 8×8 grayscale images, float32 normalized [0, 1]
- **Size**: ~51 KB total (vs 67 KB original, 10 MB MNIST)
## Files
```
datasets/tinydigits/
├── train.pkl # {'images': (150, 8, 8), 'labels': (150,)}
└── test.pkl # {'images': (47, 8, 8), 'labels': (47,)}
```
## Usage
```python
import pickle
# Load training data
with open('datasets/tinydigits/train.pkl', 'rb') as f:
data = pickle.load(f)
train_images = data['images'] # (150, 8, 8)
train_labels = data['labels'] # (150,)
# Load test data
with open('datasets/tinydigits/test.pkl', 'rb') as f:
data = pickle.load(f)
test_images = data['images'] # (47, 8, 8)
test_labels = data['labels'] # (47,)
```
## Purpose
**Educational Infrastructure**: Designed for teaching ML systems with real data at edge-device scale.
- Fast iteration during development (<5 sec training)
- Instant "it works!" moment for students
- Offline-capable demos (no downloads)
- CI/CD friendly (lightweight tests)
- **Deployable on RasPi0** - tiny footprint for democratizing ML education
## Curation Process
Created from the sklearn digits dataset (8×8 downsampled MNIST):
1. **Balanced Sampling**: 15 training samples per digit class (150 total)
2. **Test Split**: 4-5 samples per digit (47 total) from remaining examples
3. **Random Seeding**: Reproducible selection (seed=42)
4. **Shuffled**: Training and test sets randomly shuffled for fair evaluation
The sklearn digits dataset itself is derived from the UCI ML hand-written digits datasets.
## Why TinyDigits vs Full MNIST?
| Metric | MNIST | TinyDigits | Benefit |
|--------|-------|------------|---------|
| Samples | 60,000 | 150 | 400× fewer samples |
| File size | 10 MB | 51 KB | 200× smaller |
| Train time | 5-10 min | <5 sec | 60-120× faster |
| Download | Network required | Ships with repo | Always available |
| Resolution | 28×28 (784 pixels) | 8×8 (64 pixels) | Faster forward pass |
| Edge deployment | Challenging | Perfect | Works on RasPi0 |
## Educational Progression
TinyDigits serves as the first step in a scaffolded learning path:
1. **TinyDigits (8×8)** ← Start here: Learn MLP/CNN basics with instant feedback
2. **Full MNIST (28×28)** ← Graduate to: Standard benchmark, longer training
3. **CIFAR-10 (32×32 RGB)** ← Advanced: Color images, real-world complexity
## Citation
TinyDigits is curated from the sklearn digits dataset for educational use in TinyTorch.
**Original Source**:
- sklearn.datasets.load_digits()
- Derived from UCI ML hand-written digits datasets
- License: BSD 3-Clause (sklearn)
**TinyTorch Curation**:
```bibtex
@misc{tinydigits2025,
title={TinyDigits: Curated Educational Dataset for ML Systems Learning},
author={TinyTorch Project},
year={2025},
note={Balanced subset of sklearn digits optimized for edge deployment}
}
```
## Generation
To regenerate this dataset from the original sklearn data:
```bash
python3 datasets/tinydigits/create_tinydigits.py
```
This ensures reproducibility and allows customization for specific educational needs.
## License
See [LICENSE](LICENSE) for details. TinyDigits inherits the BSD 3-Clause license from sklearn.

View File

@@ -0,0 +1,109 @@
#!/usr/bin/env python3
"""
Create TinyDigits Dataset
=========================
Extracts a balanced, curated subset from sklearn's digits dataset (8x8 grayscale).
This creates a TinyTorch-branded educational dataset optimized for fast iteration.
Target sizes:
- Training: 150 samples (15 per digit class 0-9)
- Test: 47 samples (mix of clear and challenging examples)
"""
import numpy as np
import pickle
from pathlib import Path
def create_tinydigits():
"""Create TinyDigits train/test split from full digits dataset."""
# Load the full sklearn digits dataset (shipped with repo)
source_path = Path(__file__).parent.parent.parent / "milestones/03_1986_mlp/data/digits_8x8.npz"
data = np.load(source_path)
images = data['images'] # (1797, 8, 8)
labels = data['labels'] # (1797,)
print(f"📊 Source dataset: {images.shape[0]} samples")
print(f" Shape: {images.shape}, dtype: {images.dtype}")
print(f" Range: [{images.min():.3f}, {images.max():.3f}]")
# Set random seed for reproducibility
np.random.seed(42)
# Create balanced splits
train_images, train_labels = [], []
test_images, test_labels = [], []
# For each digit class (0-9)
for digit in range(10):
# Get all samples of this digit
digit_indices = np.where(labels == digit)[0]
digit_count = len(digit_indices)
# Shuffle indices
np.random.shuffle(digit_indices)
# Split: 15 for training, rest for test pool
train_count = 15
test_pool = digit_indices[train_count:]
# Training: First 15 samples
train_images.append(images[digit_indices[:train_count]])
train_labels.extend([digit] * train_count)
# Test: Select 4-5 samples from remaining (47 total across all digits)
test_count = 5 if digit < 7 else 4 # 7*5 + 3*4 = 47
test_indices = np.random.choice(test_pool, size=test_count, replace=False)
test_images.append(images[test_indices])
test_labels.extend([digit] * test_count)
print(f" Digit {digit}: {train_count} train, {test_count} test (from {digit_count} total)")
# Stack into arrays
train_images = np.vstack(train_images)
train_labels = np.array(train_labels, dtype=np.int64)
test_images = np.vstack(test_images)
test_labels = np.array(test_labels, dtype=np.int64)
# Shuffle both sets
train_shuffle = np.random.permutation(len(train_images))
train_images = train_images[train_shuffle]
train_labels = train_labels[train_shuffle]
test_shuffle = np.random.permutation(len(test_images))
test_images = test_images[test_shuffle]
test_labels = test_labels[test_shuffle]
print(f"\n✅ Created TinyDigits:")
print(f" Training: {train_images.shape} images, {train_labels.shape} labels")
print(f" Test: {test_images.shape} images, {test_labels.shape} labels")
# Save as pickle files
output_dir = Path(__file__).parent
train_data = {'images': train_images, 'labels': train_labels}
with open(output_dir / 'train.pkl', 'wb') as f:
pickle.dump(train_data, f)
print(f"\n💾 Saved: train.pkl")
test_data = {'images': test_images, 'labels': test_labels}
with open(output_dir / 'test.pkl', 'wb') as f:
pickle.dump(test_data, f)
print(f"💾 Saved: test.pkl")
# Calculate file sizes
train_size = (output_dir / 'train.pkl').stat().st_size / 1024
test_size = (output_dir / 'test.pkl').stat().st_size / 1024
total_size = train_size + test_size
print(f"\n📦 File sizes:")
print(f" train.pkl: {train_size:.1f} KB")
print(f" test.pkl: {test_size:.1f} KB")
print(f" Total: {total_size:.1f} KB")
print(f"\n🎯 TinyDigits created successfully!")
print(f" Perfect for TinyTorch on RasPi0 - only {total_size:.1f} KB!")
if __name__ == "__main__":
create_tinydigits()

BIN
datasets/tinydigits/test.pkl LFS Normal file

Binary file not shown.

BIN
datasets/tinydigits/train.pkl LFS Normal file

Binary file not shown.