Cleanup: Remove old/unused files

- Remove datasets analysis and download scripts (replaced by updated README)
- Remove archived book development documentation
- Remove module review reports (16_compression, 17_memoization)
This commit is contained in:
Vijay Janapa Reddi
2025-11-11 19:04:56 -05:00
parent f4a40ab655
commit 9a0924376e
14 changed files with 0 additions and 3923 deletions

View File

@@ -1,351 +0,0 @@
# TinyTorch Dataset Analysis & Strategy
**Date**: November 10, 2025
**Purpose**: Determine which datasets to ship with TinyTorch for optimal educational experience
---
## Current Milestone Data Usage
### Summary Table
| Milestone | File | Data Source | Currently Shipped? | Size | Issue |
|-----------|------|-------------|-------------------|------|-------|
| **01 Perceptron** | perceptron_trained.py | Synthetic (code-generated) | ✅ N/A | 0 KB | None |
| **01 Perceptron** | forward_pass.py | Synthetic (code-generated) | ✅ N/A | 0 KB | None |
| **02 XOR** | xor_crisis.py | Synthetic (code-generated) | ✅ N/A | 0 KB | None |
| **02 XOR** | xor_solved.py | Synthetic (code-generated) | ✅ N/A | 0 KB | None |
| **03 MLP** | mlp_digits.py | `03_1986_mlp/data/digits_8x8.npz` | ✅ YES | 67 KB | **Sklearn source** |
| **03 MLP** | mlp_mnist.py | Downloads via `data_manager.get_mnist()` | ❌ NO | ~10 MB | **Download fails** |
| **04 CNN** | cnn_digits.py | `03_1986_mlp/data/digits_8x8.npz` (shared) | ✅ YES | 67 KB | **Sklearn source** |
| **04 CNN** | lecun_cifar10.py | Downloads via `data_manager.get_cifar10()` | ❌ NO | ~170 MB | **Too large** |
| **05 Transformer** | vaswani_chatgpt.py | `datasets/tinytalks/` | ✅ YES | 140 KB | None ✓ |
| **05 Transformer** | vaswani_copilot.py | Embedded Python patterns (in code) | ✅ N/A | 0 KB | None ✓ |
| **05 Transformer** | profile_kv_cache.py | Uses model from vaswani_chatgpt | ✅ N/A | 0 KB | None ✓ |
---
## Detailed Analysis
### ✅ What's Working (6/11 files)
**Fully Self-Contained:**
1. **Perceptron milestones** - Generate linearly separable data on-the-fly
2. **XOR milestones** - Generate XOR patterns on-the-fly
3. **mlp_digits.py** - Uses shipped `digits_8x8.npz` (67KB, sklearn digits)
4. **cnn_digits.py** - Reuses `digits_8x8.npz` (smart sharing!)
5. **vaswani_chatgpt.py** - Uses shipped TinyTalks (140KB)
6. **vaswani_copilot.py** - Embedded patterns in code
**Result**: 6 of 11 milestone files work offline, instantly, with zero setup.
### ❌ What's Broken (2/11 files)
**Requires External Downloads:**
1. **mlp_mnist.py** - Tries to download 10MB MNIST, fails with 404 error
2. **lecun_cifar10.py** - Tries to download 170MB CIFAR-10
**Impact**:
- Students can't run 2 milestone files without internet
- Downloads fail (saw 404 error in testing)
- First-time experience is 5+ minute wait or failure
### ⚠️ What's Problematic (3/11 files use sklearn data)
**Uses sklearn's digits dataset:**
- `digits_8x8.npz` (67KB) is currently shipped
- **Source**: Originally from sklearn.datasets.load_digits()
- **Issue**: Not "TinyTorch data", it's sklearn's data
- **Citation problem**: Can't cite as "TinyTorch educational dataset"
---
## Current Datasets Directory
```
datasets/
├── README.md (4KB)
├── download_mnist.py (unused script)
├── tiny/ (76KB - unknown purpose)
├── tinymnist/ (3.6MB - synthetic, recently added)
│ ├── train.pkl
│ └── test.pkl
└── tinytalks/ (140KB) ✅ TinyTorch original!
├── CHANGELOG.md
├── DATASHEET.md
├── README.md
├── LICENSE
├── splits/
│ ├── train.txt (12KB)
│ ├── val.txt
│ └── test.txt
└── tinytalks_v1.txt
```
**Current total**: ~3.8MB shipped data
---
## The Core Issues
### 1. **Attribution & Citation Problem**
Current situation:
- `digits_8x8.npz` = sklearn's data (not TinyTorch's)
- TinyTalks = TinyTorch original ✓
- tinymnist = Synthetic (not authentic MNIST)
**For white paper citation**, you need:
- ❌ Can't cite "digits_8x8" as TinyTorch dataset (it's sklearn)
- ✅ Can cite "TinyTalks" as TinyTorch original
- ❌ Can't cite synthetic tinymnist as educational benchmark
### 2. **Authenticity vs Speed Trade-off**
**Option A: Synthetic Data**
- ✅ Ships with repo (instant start)
- ❌ Not real examples (lower educational value)
- ❌ Not citable as benchmark
**Option B: Curated Real Data**
- ✅ Authentic samples from MNIST/CIFAR
- ✅ Citable as educational benchmark
- ✅ Teaches pattern recognition on real data
- ❌ Needs to be generated once from source
### 3. **The sklearn Dependency**
Files using sklearn data:
- mlp_digits.py
- cnn_digits.py
**Problem**:
- Not TinyTorch data
- Citation goes to sklearn, not you
- Loses educational ownership
---
## Recommended Strategy: TinyTorch Native Datasets
### Phase 1: Replace sklearn with TinyDigits ✅
**Create**: `datasets/tinydigits/`
- **Source**: Extract 200 samples from sklearn's digits (8x8 grayscale)
- **Purpose**: Replace `03_1986_mlp/data/digits_8x8.npz`
- **Size**: ~20KB
- **Citation**: "TinyDigits, curated from sklearn digits dataset for educational use"
**Files**:
```
datasets/tinydigits/
├── README.md (explains curation process)
├── train.pkl (150 samples, 8x8, ~15KB)
└── test.pkl (47 samples, 8x8, ~5KB)
```
**Why this works**:
- ✅ Quick start (instant, offline)
- ✅ Real data (from sklearn)
- ✅ TinyTorch branding
- ✅ Small enough to ship (20KB)
- ✅ Can cite: "We curated TinyDigits from the sklearn digits dataset"
### Phase 2: Create TinyMNIST (Real Samples) ✅
**Create**: `datasets/tinymnist/` (replace synthetic)
- **Source**: Extract 1000 best samples from actual MNIST
- **Purpose**: Fast MNIST demo for MLP milestone
- **Size**: ~90KB
- **Citation**: "TinyMNIST, 1K curated samples from MNIST (LeCun et al., 1998)"
**Curation criteria**:
- 100 samples per digit (0-9)
- Select clearest, most "canonical" examples
- Balanced difficulty (not all easy, not all hard)
- Test edge cases (ambiguous digits for teaching)
**Files**:
```
datasets/tinymnist/
├── README.md (explains curation from MNIST)
├── LICENSE (cite LeCun et al., 1998)
├── train.pkl (1000 samples, 28x28, ~75KB)
└── test.pkl (200 samples, 28x28, ~15KB)
```
**Why this works**:
- ✅ Authentic MNIST samples
- ✅ Fast enough to ship (90KB vs 10MB)
- ✅ Citable: "TinyMNIST subset for educational scaffolding"
- ✅ Students graduate to full MNIST later
### Phase 3: Document TinyTalks Properly ✅
**Already exists**: `datasets/tinytalks/` (140KB)
- ✅ Original TinyTorch creation
- ✅ Properly documented with DATASHEET.md
- ✅ Leveled difficulty (L1-L5)
- ✅ Citable as original work
**Action needed**: None! This is perfect.
### Phase 4: Skip TinyCIFAR (Too Large)
**Decision**: DON'T create TinyCIFAR
- CIFAR-10 at 1000 samples would still be ~3MB (color images)
- Combined with other data = 4+ MB repo bloat
- **Better**: Keep download-on-demand for CIFAR-10
**For lecun_cifar10.py**:
- Add `--download` flag to explicitly trigger download
- Add helpful error message: "Run with --download to fetch CIFAR-10 (170MB, 2-3 min)"
- Document that this is the "graduate to real benchmarks" milestone
---
## Final Dataset Suite
### What to Ship with TinyTorch
```
datasets/
├── tinydigits/ ~20KB ← NEW: Replace sklearn digits
│ ├── README.md
│ ├── train.pkl (150 samples, 8x8)
│ └── test.pkl (47 samples, 8x8)
├── tinymnist/ ~90KB ← REPLACE: Real MNIST subset
│ ├── README.md
│ ├── LICENSE (cite LeCun)
│ ├── train.pkl (1000 samples, 28x28)
│ └── test.pkl (200 samples, 28x28)
└── tinytalks/ ~140KB ← KEEP: Original TinyTorch
├── DATASHEET.md
├── README.md
├── LICENSE
└── splits/
├── train.txt
├── val.txt
└── test.txt
TOTAL: ~250KB (negligible repo impact)
```
### What NOT to Ship
**Don't include**:
- ❌ Full MNIST (10MB) - download on demand
- ❌ CIFAR-10 (170MB) - download on demand
- ❌ Any dataset >1MB - defeats portability
- ❌ Synthetic fake data - not authentic enough
---
## Citation Strategy
### White Paper Language
```markdown
## TinyTorch Educational Datasets
We developed three curated datasets optimized for progressive learning:
### TinyDigits (8×8 Grayscale, 200 samples)
Curated subset of sklearn's digits dataset, selected for visual clarity
and progressive difficulty. Used for rapid prototyping and CNN concept
demonstrations.
### TinyMNIST (28×28 Grayscale, 1.2K samples)
Curated subset of MNIST (LeCun et al., 1998), with 100 canonical examples
per digit class. Balances authentic data with fast iteration cycles,
enabling students to achieve success in <30 seconds while learning on
real handwritten digits.
### TinyTalks (Text Q&A, 300 pairs)
Original conversational dataset with 5 difficulty levels (L1: Greetings
→ L5: Context reasoning). Designed specifically for teaching attention
mechanisms and transformer architectures with clear learning signal and
fast convergence.
### Design Philosophy
- **Speed**: All datasets train in <60 seconds on CPU
- **Authenticity**: Real data (MNIST digits, human conversations)
- **Progressive**: TinyX → Full X graduation path
- **Reproducible**: Fixed subsets ensure consistent results
- **Offline**: No download dependencies for core learning
### Comparison to Standard Benchmarks
| Metric | MNIST | TinyMNIST | Impact |
|--------|-------|-----------|--------|
| Samples | 60,000 | 1,000 | 60× faster |
| Train time | 5-10 min | 30 sec | 10-20× faster |
| Download | 10MB, network | 0, offline | Always works |
| Student success | 65% (frustration) | 95% (confidence) | Better outcomes |
```
**This is citable research**. You're not just using datasets, you're **designing educational infrastructure**.
---
## Implementation Checklist
### Immediate Actions
- [x] Keep TinyTalks as-is (perfect!)
- [ ] Create TinyDigits from sklearn digits (replace 03_1986_mlp/data/)
- [ ] Create TinyMNIST from real MNIST (replace synthetic version)
- [ ] Remove synthetic tinymnist (not authentic)
- [ ] Update milestones to use new TinyDigits
- [ ] Update milestones to use new TinyMNIST
- [ ] Add download instructions for full MNIST/CIFAR
- [ ] Write datasets/PHILOSOPHY.md explaining curation
- [ ] Add LICENSE files citing original sources
- [ ] Write DATASHEET.md for each dataset
### File Changes Needed
**Update these milestones**:
1. `mlp_digits.py` - Point to `datasets/tinydigits/`
2. `cnn_digits.py` - Point to `datasets/tinydigits/`
3. `mlp_mnist.py` - Point to `datasets/tinymnist/` first, offer --full flag
4. `lecun_cifar10.py` - Add helpful message about --download flag
**Remove**:
- `03_1986_mlp/data/digits_8x8.npz` (replace with TinyDigits)
- Synthetic tinymnist pkl files (replace with real)
---
## Success Metrics
### Before (Current State)
- ✅ 6/11 milestones work offline
- ❌ 2/11 require downloads (often fail)
- ❌ 3/11 use non-TinyTorch data (sklearn)
- ❌ Not citable as educational infrastructure
### After (Proposed)
- ✅ 9/11 milestones work offline (<30 sec)
- ✅ 2/11 offer optional downloads with clear UX
- ✅ 3 TinyTorch-branded datasets (citable)
- ✅ White paper section on educational dataset design
- ✅ Total shipped data: ~250KB (negligible)
---
## Conclusion
**Recommendation**: Create TinyDigits and authentic TinyMNIST
**Rationale**:
1. **Educational**: Real data beats synthetic for learning
2. **Citable**: "TinyTorch educational datasets" becomes research contribution
3. **Practical**: 250KB total keeps repo lightweight
4. **Professional**: Proper curation, documentation, licenses
5. **Scalable**: Clear graduation path to full benchmarks
**Not reinventing the wheel** - building educational infrastructure that doesn't exist.
The goal: Make TinyTorch not just a framework, but a **citable educational system** with purpose-designed datasets.

View File

@@ -1,102 +0,0 @@
#!/usr/bin/env python3
"""
Download MNIST dataset files.
"""
import os
import gzip
import urllib.request
import numpy as np
def download_mnist():
"""Download MNIST dataset files."""
# Create mnist directory
os.makedirs('mnist', exist_ok=True)
# URLs for MNIST dataset (from original source)
base_url = 'http://yann.lecun.com/exdb/mnist/'
files = {
'train-images-idx3-ubyte.gz': 'train_images',
'train-labels-idx1-ubyte.gz': 'train_labels',
't10k-images-idx3-ubyte.gz': 'test_images',
't10k-labels-idx1-ubyte.gz': 'test_labels'
}
print("📥 Downloading MNIST dataset...")
for filename, label in files.items():
filepath = os.path.join('mnist', filename)
# Skip if already downloaded
if os.path.exists(filepath) and os.path.getsize(filepath) > 1000:
print(f"{filename} already exists")
continue
url = base_url + filename
print(f" Downloading {filename}...")
try:
# Download with custom headers to avoid 403 errors
request = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
)
with urllib.request.urlopen(request) as response:
data = response.read()
# Save the file
with open(filepath, 'wb') as f:
f.write(data)
size = len(data) / 1024 / 1024
print(f" ✓ Downloaded {size:.1f} MB")
except Exception as e:
print(f" ✗ Failed: {e}")
print(f" Trying alternative method...")
# Alternative: Create synthetic MNIST-like data for testing
if 'images' in label:
# Create synthetic image data (60000 or 10000 samples)
n_samples = 60000 if 'train' in label else 10000
images = np.random.randint(0, 256, (n_samples, 28, 28), dtype=np.uint8)
# MNIST file format header
header = np.array([0x0803, n_samples, 28, 28], dtype='>i4')
with gzip.open(filepath, 'wb') as f:
f.write(header.tobytes())
f.write(images.tobytes())
print(f" ✓ Created synthetic {label} data")
else:
# Create synthetic label data
n_samples = 60000 if 'train' in label else 10000
labels = np.random.randint(0, 10, n_samples, dtype=np.uint8)
# MNIST file format header
header = np.array([0x0801, n_samples], dtype='>i4')
with gzip.open(filepath, 'wb') as f:
f.write(header.tobytes())
f.write(labels.tobytes())
print(f" ✓ Created synthetic {label} data")
print("\n✅ MNIST dataset ready in datasets/mnist/")
# Verify files
print("\nVerifying files:")
for filename in files.keys():
filepath = os.path.join('mnist', filename)
if os.path.exists(filepath):
size = os.path.getsize(filepath) / 1024 / 1024
print(f" {filename}: {size:.1f} MB")
if __name__ == "__main__":
download_mnist()

View File

@@ -1,30 +0,0 @@
{
"mnist": {
"dataset": "tinymnist",
"training_time": 0.5278840065002441,
"epochs": 20,
"final_accuracy": 27.0,
"architecture": "MLP(784\u2192128\u219210)",
"suitable_for_students": false
},
"vww": {
"dataset": "tinyvww",
"training_time": 8.571065664291382,
"epochs": 15,
"final_accuracy": 100.0,
"architecture": "CNN(Conv\u2192Pool\u2192Conv\u2192Pool\u2192FC)",
"precision": 1.0,
"recall": 1.0,
"f1_score": 1.0,
"suitable_for_students": true
},
"gpt": {
"dataset": "tinypy",
"training_time": 2.596580743789673,
"epochs": 10,
"final_loss": 1.9299052770321186,
"final_perplexity": 6.888857677630846,
"architecture": "TinyGPT(64 embed, 4 heads, 2 layers)",
"suitable_for_students": true
}
}