mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-11 19:03:34 -05:00
Cleanup: Remove old/unused files
- Remove datasets analysis and download scripts (replaced by updated README) - Remove archived book development documentation - Remove module review reports (16_compression, 17_memoization)
This commit is contained in:
@@ -1,351 +0,0 @@
|
||||
# TinyTorch Dataset Analysis & Strategy
|
||||
|
||||
**Date**: November 10, 2025
|
||||
**Purpose**: Determine which datasets to ship with TinyTorch for optimal educational experience
|
||||
|
||||
---
|
||||
|
||||
## Current Milestone Data Usage
|
||||
|
||||
### Summary Table
|
||||
|
||||
| Milestone | File | Data Source | Currently Shipped? | Size | Issue |
|
||||
|-----------|------|-------------|-------------------|------|-------|
|
||||
| **01 Perceptron** | perceptron_trained.py | Synthetic (code-generated) | ✅ N/A | 0 KB | None |
|
||||
| **01 Perceptron** | forward_pass.py | Synthetic (code-generated) | ✅ N/A | 0 KB | None |
|
||||
| **02 XOR** | xor_crisis.py | Synthetic (code-generated) | ✅ N/A | 0 KB | None |
|
||||
| **02 XOR** | xor_solved.py | Synthetic (code-generated) | ✅ N/A | 0 KB | None |
|
||||
| **03 MLP** | mlp_digits.py | `03_1986_mlp/data/digits_8x8.npz` | ✅ YES | 67 KB | **Sklearn source** |
|
||||
| **03 MLP** | mlp_mnist.py | Downloads via `data_manager.get_mnist()` | ❌ NO | ~10 MB | **Download fails** |
|
||||
| **04 CNN** | cnn_digits.py | `03_1986_mlp/data/digits_8x8.npz` (shared) | ✅ YES | 67 KB | **Sklearn source** |
|
||||
| **04 CNN** | lecun_cifar10.py | Downloads via `data_manager.get_cifar10()` | ❌ NO | ~170 MB | **Too large** |
|
||||
| **05 Transformer** | vaswani_chatgpt.py | `datasets/tinytalks/` | ✅ YES | 140 KB | None ✓ |
|
||||
| **05 Transformer** | vaswani_copilot.py | Embedded Python patterns (in code) | ✅ N/A | 0 KB | None ✓ |
|
||||
| **05 Transformer** | profile_kv_cache.py | Uses model from vaswani_chatgpt | ✅ N/A | 0 KB | None ✓ |
|
||||
|
||||
---
|
||||
|
||||
## Detailed Analysis
|
||||
|
||||
### ✅ What's Working (6/11 files)
|
||||
|
||||
**Fully Self-Contained:**
|
||||
1. **Perceptron milestones** - Generate linearly separable data on-the-fly
|
||||
2. **XOR milestones** - Generate XOR patterns on-the-fly
|
||||
3. **mlp_digits.py** - Uses shipped `digits_8x8.npz` (67KB, sklearn digits)
|
||||
4. **cnn_digits.py** - Reuses `digits_8x8.npz` (smart sharing!)
|
||||
5. **vaswani_chatgpt.py** - Uses shipped TinyTalks (140KB)
|
||||
6. **vaswani_copilot.py** - Embedded patterns in code
|
||||
|
||||
**Result**: 6 of 11 milestone files work offline, instantly, with zero setup.
|
||||
|
||||
### ❌ What's Broken (2/11 files)
|
||||
|
||||
**Requires External Downloads:**
|
||||
1. **mlp_mnist.py** - Tries to download 10MB MNIST, fails with 404 error
|
||||
2. **lecun_cifar10.py** - Tries to download 170MB CIFAR-10
|
||||
|
||||
**Impact**:
|
||||
- Students can't run 2 milestone files without internet
|
||||
- Downloads fail (saw 404 error in testing)
|
||||
- First-time experience is 5+ minute wait or failure
|
||||
|
||||
### ⚠️ What's Problematic (3/11 files use sklearn data)
|
||||
|
||||
**Uses sklearn's digits dataset:**
|
||||
- `digits_8x8.npz` (67KB) is currently shipped
|
||||
- **Source**: Originally from sklearn.datasets.load_digits()
|
||||
- **Issue**: Not "TinyTorch data", it's sklearn's data
|
||||
- **Citation problem**: Can't cite as "TinyTorch educational dataset"
|
||||
|
||||
---
|
||||
|
||||
## Current Datasets Directory
|
||||
|
||||
```
|
||||
datasets/
|
||||
├── README.md (4KB)
|
||||
├── download_mnist.py (unused script)
|
||||
├── tiny/ (76KB - unknown purpose)
|
||||
├── tinymnist/ (3.6MB - synthetic, recently added)
|
||||
│ ├── train.pkl
|
||||
│ └── test.pkl
|
||||
└── tinytalks/ (140KB) ✅ TinyTorch original!
|
||||
├── CHANGELOG.md
|
||||
├── DATASHEET.md
|
||||
├── README.md
|
||||
├── LICENSE
|
||||
├── splits/
|
||||
│ ├── train.txt (12KB)
|
||||
│ ├── val.txt
|
||||
│ └── test.txt
|
||||
└── tinytalks_v1.txt
|
||||
```
|
||||
|
||||
**Current total**: ~3.8MB shipped data
|
||||
|
||||
---
|
||||
|
||||
## The Core Issues
|
||||
|
||||
### 1. **Attribution & Citation Problem**
|
||||
|
||||
Current situation:
|
||||
- `digits_8x8.npz` = sklearn's data (not TinyTorch's)
|
||||
- TinyTalks = TinyTorch original ✓
|
||||
- tinymnist = Synthetic (not authentic MNIST)
|
||||
|
||||
**For white paper citation**, you need:
|
||||
- ❌ Can't cite "digits_8x8" as TinyTorch dataset (it's sklearn)
|
||||
- ✅ Can cite "TinyTalks" as TinyTorch original
|
||||
- ❌ Can't cite synthetic tinymnist as educational benchmark
|
||||
|
||||
### 2. **Authenticity vs Speed Trade-off**
|
||||
|
||||
**Option A: Synthetic Data**
|
||||
- ✅ Ships with repo (instant start)
|
||||
- ❌ Not real examples (lower educational value)
|
||||
- ❌ Not citable as benchmark
|
||||
|
||||
**Option B: Curated Real Data**
|
||||
- ✅ Authentic samples from MNIST/CIFAR
|
||||
- ✅ Citable as educational benchmark
|
||||
- ✅ Teaches pattern recognition on real data
|
||||
- ❌ Needs to be generated once from source
|
||||
|
||||
### 3. **The sklearn Dependency**
|
||||
|
||||
Files using sklearn data:
|
||||
- mlp_digits.py
|
||||
- cnn_digits.py
|
||||
|
||||
**Problem**:
|
||||
- Not TinyTorch data
|
||||
- Citation goes to sklearn, not you
|
||||
- Loses educational ownership
|
||||
|
||||
---
|
||||
|
||||
## Recommended Strategy: TinyTorch Native Datasets
|
||||
|
||||
### Phase 1: Replace sklearn with TinyDigits ✅
|
||||
|
||||
**Create**: `datasets/tinydigits/`
|
||||
- **Source**: Extract 200 samples from sklearn's digits (8x8 grayscale)
|
||||
- **Purpose**: Replace `03_1986_mlp/data/digits_8x8.npz`
|
||||
- **Size**: ~20KB
|
||||
- **Citation**: "TinyDigits, curated from sklearn digits dataset for educational use"
|
||||
|
||||
**Files**:
|
||||
```
|
||||
datasets/tinydigits/
|
||||
├── README.md (explains curation process)
|
||||
├── train.pkl (150 samples, 8x8, ~15KB)
|
||||
└── test.pkl (47 samples, 8x8, ~5KB)
|
||||
```
|
||||
|
||||
**Why this works**:
|
||||
- ✅ Quick start (instant, offline)
|
||||
- ✅ Real data (from sklearn)
|
||||
- ✅ TinyTorch branding
|
||||
- ✅ Small enough to ship (20KB)
|
||||
- ✅ Can cite: "We curated TinyDigits from the sklearn digits dataset"
|
||||
|
||||
### Phase 2: Create TinyMNIST (Real Samples) ✅
|
||||
|
||||
**Create**: `datasets/tinymnist/` (replace synthetic)
|
||||
- **Source**: Extract 1000 best samples from actual MNIST
|
||||
- **Purpose**: Fast MNIST demo for MLP milestone
|
||||
- **Size**: ~90KB
|
||||
- **Citation**: "TinyMNIST, 1K curated samples from MNIST (LeCun et al., 1998)"
|
||||
|
||||
**Curation criteria**:
|
||||
- 100 samples per digit (0-9)
|
||||
- Select clearest, most "canonical" examples
|
||||
- Balanced difficulty (not all easy, not all hard)
|
||||
- Test edge cases (ambiguous digits for teaching)
|
||||
|
||||
**Files**:
|
||||
```
|
||||
datasets/tinymnist/
|
||||
├── README.md (explains curation from MNIST)
|
||||
├── LICENSE (cite LeCun et al., 1998)
|
||||
├── train.pkl (1000 samples, 28x28, ~75KB)
|
||||
└── test.pkl (200 samples, 28x28, ~15KB)
|
||||
```
|
||||
|
||||
**Why this works**:
|
||||
- ✅ Authentic MNIST samples
|
||||
- ✅ Fast enough to ship (90KB vs 10MB)
|
||||
- ✅ Citable: "TinyMNIST subset for educational scaffolding"
|
||||
- ✅ Students graduate to full MNIST later
|
||||
|
||||
### Phase 3: Document TinyTalks Properly ✅
|
||||
|
||||
**Already exists**: `datasets/tinytalks/` (140KB)
|
||||
- ✅ Original TinyTorch creation
|
||||
- ✅ Properly documented with DATASHEET.md
|
||||
- ✅ Leveled difficulty (L1-L5)
|
||||
- ✅ Citable as original work
|
||||
|
||||
**Action needed**: None! This is perfect.
|
||||
|
||||
### Phase 4: Skip TinyCIFAR (Too Large)
|
||||
|
||||
**Decision**: DON'T create TinyCIFAR
|
||||
- CIFAR-10 at 1000 samples would still be ~3MB (color images)
|
||||
- Combined with other data = 4+ MB repo bloat
|
||||
- **Better**: Keep download-on-demand for CIFAR-10
|
||||
|
||||
**For lecun_cifar10.py**:
|
||||
- Add `--download` flag to explicitly trigger download
|
||||
- Add helpful error message: "Run with --download to fetch CIFAR-10 (170MB, 2-3 min)"
|
||||
- Document that this is the "graduate to real benchmarks" milestone
|
||||
|
||||
---
|
||||
|
||||
## Final Dataset Suite
|
||||
|
||||
### What to Ship with TinyTorch
|
||||
|
||||
```
|
||||
datasets/
|
||||
├── tinydigits/ ~20KB ← NEW: Replace sklearn digits
|
||||
│ ├── README.md
|
||||
│ ├── train.pkl (150 samples, 8x8)
|
||||
│ └── test.pkl (47 samples, 8x8)
|
||||
│
|
||||
├── tinymnist/ ~90KB ← REPLACE: Real MNIST subset
|
||||
│ ├── README.md
|
||||
│ ├── LICENSE (cite LeCun)
|
||||
│ ├── train.pkl (1000 samples, 28x28)
|
||||
│ └── test.pkl (200 samples, 28x28)
|
||||
│
|
||||
└── tinytalks/ ~140KB ← KEEP: Original TinyTorch
|
||||
├── DATASHEET.md
|
||||
├── README.md
|
||||
├── LICENSE
|
||||
└── splits/
|
||||
├── train.txt
|
||||
├── val.txt
|
||||
└── test.txt
|
||||
|
||||
TOTAL: ~250KB (negligible repo impact)
|
||||
```
|
||||
|
||||
### What NOT to Ship
|
||||
|
||||
**Don't include**:
|
||||
- ❌ Full MNIST (10MB) - download on demand
|
||||
- ❌ CIFAR-10 (170MB) - download on demand
|
||||
- ❌ Any dataset >1MB - defeats portability
|
||||
- ❌ Synthetic fake data - not authentic enough
|
||||
|
||||
---
|
||||
|
||||
## Citation Strategy
|
||||
|
||||
### White Paper Language
|
||||
|
||||
```markdown
|
||||
## TinyTorch Educational Datasets
|
||||
|
||||
We developed three curated datasets optimized for progressive learning:
|
||||
|
||||
### TinyDigits (8×8 Grayscale, 200 samples)
|
||||
Curated subset of sklearn's digits dataset, selected for visual clarity
|
||||
and progressive difficulty. Used for rapid prototyping and CNN concept
|
||||
demonstrations.
|
||||
|
||||
### TinyMNIST (28×28 Grayscale, 1.2K samples)
|
||||
Curated subset of MNIST (LeCun et al., 1998), with 100 canonical examples
|
||||
per digit class. Balances authentic data with fast iteration cycles,
|
||||
enabling students to achieve success in <30 seconds while learning on
|
||||
real handwritten digits.
|
||||
|
||||
### TinyTalks (Text Q&A, 300 pairs)
|
||||
Original conversational dataset with 5 difficulty levels (L1: Greetings
|
||||
→ L5: Context reasoning). Designed specifically for teaching attention
|
||||
mechanisms and transformer architectures with clear learning signal and
|
||||
fast convergence.
|
||||
|
||||
### Design Philosophy
|
||||
- **Speed**: All datasets train in <60 seconds on CPU
|
||||
- **Authenticity**: Real data (MNIST digits, human conversations)
|
||||
- **Progressive**: TinyX → Full X graduation path
|
||||
- **Reproducible**: Fixed subsets ensure consistent results
|
||||
- **Offline**: No download dependencies for core learning
|
||||
|
||||
### Comparison to Standard Benchmarks
|
||||
| Metric | MNIST | TinyMNIST | Impact |
|
||||
|--------|-------|-----------|--------|
|
||||
| Samples | 60,000 | 1,000 | 60× faster |
|
||||
| Train time | 5-10 min | 30 sec | 10-20× faster |
|
||||
| Download | 10MB, network | 0, offline | Always works |
|
||||
| Student success | 65% (frustration) | 95% (confidence) | Better outcomes |
|
||||
```
|
||||
|
||||
**This is citable research**. You're not just using datasets, you're **designing educational infrastructure**.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
### Immediate Actions
|
||||
|
||||
- [x] Keep TinyTalks as-is (perfect!)
|
||||
- [ ] Create TinyDigits from sklearn digits (replace 03_1986_mlp/data/)
|
||||
- [ ] Create TinyMNIST from real MNIST (replace synthetic version)
|
||||
- [ ] Remove synthetic tinymnist (not authentic)
|
||||
- [ ] Update milestones to use new TinyDigits
|
||||
- [ ] Update milestones to use new TinyMNIST
|
||||
- [ ] Add download instructions for full MNIST/CIFAR
|
||||
- [ ] Write datasets/PHILOSOPHY.md explaining curation
|
||||
- [ ] Add LICENSE files citing original sources
|
||||
- [ ] Write DATASHEET.md for each dataset
|
||||
|
||||
### File Changes Needed
|
||||
|
||||
**Update these milestones**:
|
||||
1. `mlp_digits.py` - Point to `datasets/tinydigits/`
|
||||
2. `cnn_digits.py` - Point to `datasets/tinydigits/`
|
||||
3. `mlp_mnist.py` - Point to `datasets/tinymnist/` first, offer --full flag
|
||||
4. `lecun_cifar10.py` - Add helpful message about --download flag
|
||||
|
||||
**Remove**:
|
||||
- `03_1986_mlp/data/digits_8x8.npz` (replace with TinyDigits)
|
||||
- Synthetic tinymnist pkl files (replace with real)
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Before (Current State)
|
||||
- ✅ 6/11 milestones work offline
|
||||
- ❌ 2/11 require downloads (often fail)
|
||||
- ❌ 3/11 use non-TinyTorch data (sklearn)
|
||||
- ❌ Not citable as educational infrastructure
|
||||
|
||||
### After (Proposed)
|
||||
- ✅ 9/11 milestones work offline (<30 sec)
|
||||
- ✅ 2/11 offer optional downloads with clear UX
|
||||
- ✅ 3 TinyTorch-branded datasets (citable)
|
||||
- ✅ White paper section on educational dataset design
|
||||
- ✅ Total shipped data: ~250KB (negligible)
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Recommendation**: Create TinyDigits and authentic TinyMNIST
|
||||
|
||||
**Rationale**:
|
||||
1. **Educational**: Real data beats synthetic for learning
|
||||
2. **Citable**: "TinyTorch educational datasets" becomes research contribution
|
||||
3. **Practical**: 250KB total keeps repo lightweight
|
||||
4. **Professional**: Proper curation, documentation, licenses
|
||||
5. **Scalable**: Clear graduation path to full benchmarks
|
||||
|
||||
**Not reinventing the wheel** - building educational infrastructure that doesn't exist.
|
||||
|
||||
The goal: Make TinyTorch not just a framework, but a **citable educational system** with purpose-designed datasets.
|
||||
@@ -1,102 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Download MNIST dataset files.
|
||||
"""
|
||||
|
||||
import os
|
||||
import gzip
|
||||
import urllib.request
|
||||
import numpy as np
|
||||
|
||||
def download_mnist():
|
||||
"""Download MNIST dataset files."""
|
||||
|
||||
# Create mnist directory
|
||||
os.makedirs('mnist', exist_ok=True)
|
||||
|
||||
# URLs for MNIST dataset (from original source)
|
||||
base_url = 'http://yann.lecun.com/exdb/mnist/'
|
||||
files = {
|
||||
'train-images-idx3-ubyte.gz': 'train_images',
|
||||
'train-labels-idx1-ubyte.gz': 'train_labels',
|
||||
't10k-images-idx3-ubyte.gz': 'test_images',
|
||||
't10k-labels-idx1-ubyte.gz': 'test_labels'
|
||||
}
|
||||
|
||||
print("📥 Downloading MNIST dataset...")
|
||||
|
||||
for filename, label in files.items():
|
||||
filepath = os.path.join('mnist', filename)
|
||||
|
||||
# Skip if already downloaded
|
||||
if os.path.exists(filepath) and os.path.getsize(filepath) > 1000:
|
||||
print(f" ✓ {filename} already exists")
|
||||
continue
|
||||
|
||||
url = base_url + filename
|
||||
print(f" Downloading {filename}...")
|
||||
|
||||
try:
|
||||
# Download with custom headers to avoid 403 errors
|
||||
request = urllib.request.Request(
|
||||
url,
|
||||
headers={
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
}
|
||||
)
|
||||
|
||||
with urllib.request.urlopen(request) as response:
|
||||
data = response.read()
|
||||
|
||||
# Save the file
|
||||
with open(filepath, 'wb') as f:
|
||||
f.write(data)
|
||||
|
||||
size = len(data) / 1024 / 1024
|
||||
print(f" ✓ Downloaded {size:.1f} MB")
|
||||
|
||||
except Exception as e:
|
||||
print(f" ✗ Failed: {e}")
|
||||
print(f" Trying alternative method...")
|
||||
|
||||
# Alternative: Create synthetic MNIST-like data for testing
|
||||
if 'images' in label:
|
||||
# Create synthetic image data (60000 or 10000 samples)
|
||||
n_samples = 60000 if 'train' in label else 10000
|
||||
images = np.random.randint(0, 256, (n_samples, 28, 28), dtype=np.uint8)
|
||||
|
||||
# MNIST file format header
|
||||
header = np.array([0x0803, n_samples, 28, 28], dtype='>i4')
|
||||
|
||||
with gzip.open(filepath, 'wb') as f:
|
||||
f.write(header.tobytes())
|
||||
f.write(images.tobytes())
|
||||
|
||||
print(f" ✓ Created synthetic {label} data")
|
||||
|
||||
else:
|
||||
# Create synthetic label data
|
||||
n_samples = 60000 if 'train' in label else 10000
|
||||
labels = np.random.randint(0, 10, n_samples, dtype=np.uint8)
|
||||
|
||||
# MNIST file format header
|
||||
header = np.array([0x0801, n_samples], dtype='>i4')
|
||||
|
||||
with gzip.open(filepath, 'wb') as f:
|
||||
f.write(header.tobytes())
|
||||
f.write(labels.tobytes())
|
||||
|
||||
print(f" ✓ Created synthetic {label} data")
|
||||
|
||||
print("\n✅ MNIST dataset ready in datasets/mnist/")
|
||||
|
||||
# Verify files
|
||||
print("\nVerifying files:")
|
||||
for filename in files.keys():
|
||||
filepath = os.path.join('mnist', filename)
|
||||
if os.path.exists(filepath):
|
||||
size = os.path.getsize(filepath) / 1024 / 1024
|
||||
print(f" {filename}: {size:.1f} MB")
|
||||
|
||||
if __name__ == "__main__":
|
||||
download_mnist()
|
||||
@@ -1,30 +0,0 @@
|
||||
{
|
||||
"mnist": {
|
||||
"dataset": "tinymnist",
|
||||
"training_time": 0.5278840065002441,
|
||||
"epochs": 20,
|
||||
"final_accuracy": 27.0,
|
||||
"architecture": "MLP(784\u2192128\u219210)",
|
||||
"suitable_for_students": false
|
||||
},
|
||||
"vww": {
|
||||
"dataset": "tinyvww",
|
||||
"training_time": 8.571065664291382,
|
||||
"epochs": 15,
|
||||
"final_accuracy": 100.0,
|
||||
"architecture": "CNN(Conv\u2192Pool\u2192Conv\u2192Pool\u2192FC)",
|
||||
"precision": 1.0,
|
||||
"recall": 1.0,
|
||||
"f1_score": 1.0,
|
||||
"suitable_for_students": true
|
||||
},
|
||||
"gpt": {
|
||||
"dataset": "tinypy",
|
||||
"training_time": 2.596580743789673,
|
||||
"epochs": 10,
|
||||
"final_loss": 1.9299052770321186,
|
||||
"final_perplexity": 6.888857677630846,
|
||||
"architecture": "TinyGPT(64 embed, 4 heads, 2 layers)",
|
||||
"suitable_for_students": true
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user