mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-12 04:53:55 -05:00
Remove outdated milestone README files
Deleted 5 README/documentation files with stale information: - 01_1957_perceptron/README.md - 02_1969_xor/README.md - 03_1986_mlp/README.md - 04_1998_cnn/README.md - 05_2017_transformer/PERFORMANCE_METRICS_DEMO.md Issues with these files: - Wrong file names (rosenblatt_perceptron.py, train_mlp.py, train_cnn.py) - Old paths (examples/datasets/) - Duplicate content (already in Python file docstrings) - Could not be kept in sync with code Documentation now lives exclusively in comprehensive Python docstrings at the top of each milestone file, ensuring it stays accurate and students see rich context when running files.
This commit is contained in:
@@ -1,62 +0,0 @@
|
||||
# 🧠 Perceptron (1957) - Rosenblatt
|
||||
|
||||
## What This Demonstrates
|
||||
The first trainable neural network in history! Using YOUR TinyTorch implementations to recreate Rosenblatt's pioneering perceptron.
|
||||
|
||||
## Prerequisites
|
||||
Complete these TinyTorch modules first:
|
||||
- Module 02 (Tensor) - Data structures with gradients
|
||||
- Module 03 (Activations) - Sigmoid activation
|
||||
- Module 04 (Layers) - Linear layer
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
```bash
|
||||
# Run the perceptron training
|
||||
python rosenblatt_perceptron.py
|
||||
|
||||
# Test architecture only
|
||||
python rosenblatt_perceptron.py --test-only
|
||||
|
||||
# Custom epochs
|
||||
python rosenblatt_perceptron.py --epochs 200
|
||||
```
|
||||
|
||||
## 📊 Dataset Information
|
||||
|
||||
### Synthetic Linearly Separable Data
|
||||
- **Generated**: 1,000 points in 2D space
|
||||
- **Classes**: Binary (0 or 1)
|
||||
- **Property**: Linearly separable by design
|
||||
- **No Download Required**: Data generated on-the-fly
|
||||
|
||||
### Why Synthetic Data?
|
||||
The perceptron can only solve linearly separable problems. We generate data that's guaranteed to be separable to demonstrate the algorithm works when its assumptions are met.
|
||||
|
||||
## 🏗️ Architecture
|
||||
```
|
||||
Input (x1, x2) → Linear (2→1) → Sigmoid → Binary Output
|
||||
```
|
||||
|
||||
Simple but revolutionary - this proved machines could learn!
|
||||
|
||||
## 📈 Expected Results
|
||||
- **Training Time**: ~30 seconds
|
||||
- **Accuracy**: 95%+ (problem is linearly separable)
|
||||
- **Parameters**: Just 3 (2 weights + 1 bias)
|
||||
|
||||
## 💡 Historical Significance
|
||||
- **1957**: First demonstration of machine learning
|
||||
- **Innovation**: Weights that adjust based on errors
|
||||
- **Limitation**: Can't solve XOR (see xor_1969 example)
|
||||
- **Legacy**: Foundation for all modern neural networks
|
||||
|
||||
## 🔧 Command Line Options
|
||||
- `--test-only`: Test architecture without training
|
||||
- `--epochs N`: Number of training epochs (default: 100)
|
||||
|
||||
## 📚 What You Learn
|
||||
- How the first neural network worked
|
||||
- Why gradients enable learning
|
||||
- YOUR Linear layer performs the same math as 1957
|
||||
- Limitations that led to multi-layer networks
|
||||
@@ -1,145 +0,0 @@
|
||||
# ⊕ XOR Problem (1969) - Minsky & Papert
|
||||
|
||||
## Historical Significance
|
||||
|
||||
In 1969, Marvin Minsky and Seymour Papert published "Perceptrons," mathematically proving that single-layer perceptrons **cannot** solve the XOR problem. This revelation killed neural network research funding for over a decade - the infamous "AI Winter."
|
||||
|
||||
In 1986, Rumelhart, Hinton, and Williams published the backpropagation algorithm for multi-layer networks, and XOR became trivial. This milestone recreates both the crisis and the solution using YOUR TinyTorch!
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Complete these TinyTorch modules first:
|
||||
|
||||
**For Part 1 (xor_crisis.py):**
|
||||
- Module 01 (Tensor)
|
||||
- Module 02 (Activations)
|
||||
- Module 03 (Layers)
|
||||
- Module 04 (Losses)
|
||||
- Module 05 (Autograd)
|
||||
- Module 06 (Optimizers)
|
||||
|
||||
**For Part 2 (xor_solved.py):**
|
||||
- All of the above ✓
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Part 1: The Crisis (1969)
|
||||
Watch a single-layer perceptron **fail** to learn XOR:
|
||||
|
||||
```bash
|
||||
python milestones/02_xor_crisis_1969/xor_crisis.py
|
||||
```
|
||||
|
||||
**Expected:** ~50% accuracy (random guessing) - proves Minsky was right!
|
||||
|
||||
### Part 2: The Solution (1986)
|
||||
Watch a multi-layer network **solve** the "impossible" problem:
|
||||
|
||||
```bash
|
||||
python milestones/02_xor_crisis_1969/xor_solved.py
|
||||
```
|
||||
|
||||
**Expected:** 75%+ accuracy (problem solved!) - proves hidden layers work!
|
||||
|
||||
## The XOR Problem
|
||||
|
||||
### What is XOR?
|
||||
|
||||
XOR (Exclusive OR) outputs 1 when inputs **differ**, 0 when they're the **same**:
|
||||
|
||||
```
|
||||
┌────┬────┬─────┐
|
||||
│ x₁ │ x₂ │ XOR │
|
||||
├────┼────┼─────┤
|
||||
│ 0 │ 0 │ 0 │ ← same
|
||||
│ 0 │ 1 │ 1 │ ← different
|
||||
│ 1 │ 0 │ 1 │ ← different
|
||||
│ 1 │ 1 │ 0 │ ← same
|
||||
└────┴────┴─────┘
|
||||
```
|
||||
|
||||
### Why It's Impossible for Single Layers
|
||||
|
||||
The problem is **non-linearly separable** - no single straight line can separate the points:
|
||||
|
||||
```
|
||||
Visual Representation:
|
||||
|
||||
1 │ ○ (0,1) ● (1,1) Try drawing a line:
|
||||
│ [1] [0] ANY line fails!
|
||||
│
|
||||
0 │ ● (0,0) ○ (1,0)
|
||||
│ [0] [1]
|
||||
└─────────────────
|
||||
0 1
|
||||
```
|
||||
|
||||
This fundamental limitation ended the first era of neural networks.
|
||||
|
||||
## The Solution
|
||||
|
||||
Hidden layers create a **new feature space** where XOR becomes linearly separable!
|
||||
|
||||
### Original 1986 Architecture
|
||||
```
|
||||
Input (2) → Hidden (2) + Sigmoid → Output (1) + Sigmoid
|
||||
|
||||
Total: Only 9 parameters!
|
||||
```
|
||||
|
||||
The 2 hidden units learn:
|
||||
- `h₁ ≈ x₁ AND NOT x₂`
|
||||
- `h₂ ≈ x₂ AND NOT x₁`
|
||||
- `output ≈ h₁ OR h₂` = XOR
|
||||
|
||||
### Our Implementation
|
||||
```
|
||||
Input (2) → Hidden (4-8) + ReLU → Output (1) + Sigmoid
|
||||
|
||||
Modern activation, slightly larger for robustness
|
||||
```
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Part 1: The Crisis
|
||||
- **Accuracy:** ~50% (random guessing)
|
||||
- **Loss:** Stuck around 0.69 (not decreasing)
|
||||
- **Weights:** Don't converge to meaningful values
|
||||
- **Conclusion:** Single-layer perceptrons **cannot** solve XOR
|
||||
|
||||
### Part 2: The Solution
|
||||
- **Accuracy:** 75-100% (problem solved!)
|
||||
- **Loss:** Decreases to ~0.35 or lower
|
||||
- **Weights:** Learn meaningful features
|
||||
- **Conclusion:** Multi-layer networks **can** solve XOR
|
||||
|
||||
## What You Learn
|
||||
|
||||
1. **Why depth matters** - Hidden layers enable non-linear functions
|
||||
2. **Historical context** - The XOR crisis that stopped AI research
|
||||
3. **The breakthrough** - Backpropagation through hidden layers
|
||||
4. **Your autograd works!** - Multi-layer gradients flow correctly
|
||||
|
||||
## Files in This Milestone
|
||||
|
||||
- `xor_crisis.py` - Single-layer perceptron **failing** on XOR (1969 crisis)
|
||||
- `xor_solved.py` - Multi-layer network **solving** XOR (1986 breakthrough)
|
||||
- `README.md` - This file
|
||||
|
||||
## Historical Timeline
|
||||
|
||||
- **1969:** Minsky & Papert prove single-layer networks can't solve XOR
|
||||
- **1970-1986:** AI Winter - 17 years of minimal neural network research
|
||||
- **1986:** Rumelhart, Hinton, Williams publish backpropagation for multi-layer nets
|
||||
- **1986+:** AI Renaissance begins
|
||||
- **TODAY:** Deep learning powers GPT, AlphaGo, autonomous vehicles, etc.
|
||||
|
||||
## Next Steps
|
||||
|
||||
After completing this milestone:
|
||||
|
||||
- **Milestone 03:** MLP Revival (1986) - Train deeper networks on real data
|
||||
- **Module 08:** DataLoaders for batch processing
|
||||
- **Module 09:** CNNs for image recognition
|
||||
|
||||
Every modern AI architecture builds on what you just learned - hidden layers + backpropagation!
|
||||
@@ -1,92 +0,0 @@
|
||||
# 🔢 MNIST MLP (1986) - Backpropagation Revolution
|
||||
|
||||
## What This Demonstrates
|
||||
Multi-layer network solving real vision! Backpropagation enables training deep networks on actual handwritten digits.
|
||||
|
||||
## Prerequisites
|
||||
Complete these TinyTorch modules first:
|
||||
- Module 02 (Tensor) - Data structures
|
||||
- Module 03 (Activations) - ReLU, Softmax
|
||||
- Module 04 (Layers) - Linear layers
|
||||
- Module 06 (Autograd) - Backpropagation
|
||||
- Module 07 (Optimizers) - SGD optimizer
|
||||
- Module 08 (Training) - Training loops
|
||||
|
||||
Note: Runs BEFORE Module 10 (DataLoader), so uses manual batching.
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
```bash
|
||||
# Train on MNIST digits
|
||||
python train_mlp.py
|
||||
|
||||
# Test architecture only
|
||||
python train_mlp.py --test-only
|
||||
|
||||
# Quick training (fewer epochs)
|
||||
python train_mlp.py --epochs 3
|
||||
```
|
||||
|
||||
## 📊 Dataset Information
|
||||
|
||||
### MNIST Handwritten Digits
|
||||
- **Size**: 70,000 grayscale 28×28 images (60K train, 10K test)
|
||||
- **Classes**: Digits 0-9
|
||||
- **Download**: ~10MB from http://yann.lecun.com/exdb/mnist/
|
||||
- **Storage**: Cached in `examples/datasets/mnist/` after first download
|
||||
|
||||
### Sample Digits
|
||||
```
|
||||
"7" "2" "1"
|
||||
░░░████░░ █████████ ░░░██░░░
|
||||
░░░░░██░░ ░░░░░░██░ ░░███░░░
|
||||
░░░░██░░░ ░░░░░██░░ ░░░██░░░
|
||||
░░░██░░░░ ░░░██░░░░ ░░░██░░░
|
||||
░░██░░░░░ ░░██░░░░░ ░░░██░░░
|
||||
░░██░░░░░ ██████████ ░░░██░░░
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
1. **Download**: Automatic from LeCun's website
|
||||
2. **Format**: Flatten 28×28 → 784 features
|
||||
3. **Batching**: Manual (DataLoader not available yet)
|
||||
|
||||
## 🏗️ Architecture
|
||||
```
|
||||
Input (784) → Linear (784→128) → ReLU → Linear (128→64) → ReLU → Linear (64→10) → Output
|
||||
↑ ↑ ↑
|
||||
Hidden Layer 1 Hidden Layer 2 10 Classes
|
||||
```
|
||||
|
||||
## 📈 Expected Results
|
||||
- **Training Time**: 2-3 minutes (5 epochs)
|
||||
- **Accuracy**: 95%+ on test set
|
||||
- **Parameters**: ~100K weights
|
||||
|
||||
## 💡 Historical Significance
|
||||
- **1986**: Backprop paper enables deep learning
|
||||
- **Innovation**: Automatic gradient computation
|
||||
- **Impact**: Proved neural networks could solve real problems
|
||||
- **YOUR Version**: Same architecture, YOUR implementation!
|
||||
|
||||
## 🔧 Command Line Options
|
||||
- `--test-only`: Test architecture without training
|
||||
- `--epochs N`: Training epochs (default: 5)
|
||||
|
||||
## 📚 What You Learn
|
||||
- How to handle real vision datasets
|
||||
- Multi-layer networks for complex patterns
|
||||
- Manual batching before DataLoader
|
||||
- YOUR complete training pipeline works!
|
||||
|
||||
## 🐛 Troubleshooting
|
||||
|
||||
### Download Issues
|
||||
If MNIST download fails:
|
||||
- Check internet connection
|
||||
- Falls back to synthetic data automatically
|
||||
- Manual download: http://yann.lecun.com/exdb/mnist/
|
||||
|
||||
### Memory Issues
|
||||
- Reduce batch size in the code (default: 32)
|
||||
- Train for fewer epochs: `--epochs 2`
|
||||
@@ -1,98 +0,0 @@
|
||||
# 🖼️ CIFAR-10 CNN Example
|
||||
|
||||
## What This Demonstrates
|
||||
A modern CNN architecture for natural image classification using YOUR TinyTorch implementations!
|
||||
|
||||
## Prerequisites
|
||||
Complete these TinyTorch modules first:
|
||||
- Module 02 (Tensor) - Data structures
|
||||
- Module 03 (Activations) - ReLU
|
||||
- Module 04 (Layers) - Linear layers
|
||||
- Module 07 (Optimizers) - Adam
|
||||
- Module 09 (Spatial) - Conv2d, MaxPool2D
|
||||
- Module 10 (DataLoader) - Dataset, DataLoader
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
```bash
|
||||
# Test architecture only (no data download)
|
||||
python train_cnn.py --test-only
|
||||
|
||||
# Train with real CIFAR-10 data (~170MB download)
|
||||
python train_cnn.py
|
||||
|
||||
# Quick test with subset of data
|
||||
python train_cnn.py --quick-test
|
||||
```
|
||||
|
||||
## 📊 Dataset Information
|
||||
|
||||
### CIFAR-10 Details
|
||||
- **Size**: 60,000 32×32 color images (50K train, 10K test)
|
||||
- **Classes**: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck
|
||||
- **Download**: ~170MB from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
|
||||
- **Storage**: Cached in `examples/datasets/cifar-10/` after first download
|
||||
|
||||
### Data Flow
|
||||
1. **First Run**: Downloads CIFAR-10 from the web (shows progress)
|
||||
2. **Subsequent Runs**: Uses cached data (no re-download)
|
||||
3. **Offline Mode**: Falls back to synthetic data if download fails
|
||||
|
||||
### Dataset Handling
|
||||
```python
|
||||
# The example uses DataManager for downloading
|
||||
data_manager = DatasetManager() # Handles download/caching
|
||||
(train_data, train_labels), (test_data, test_labels) = data_manager.get_cifar10()
|
||||
|
||||
# Then wraps in YOUR Dataset interface
|
||||
train_dataset = CIFARDataset(train_data, train_labels) # YOUR Dataset
|
||||
|
||||
# Finally uses YOUR DataLoader for batching
|
||||
train_loader = DataLoader(train_dataset, batch_size=32) # YOUR DataLoader
|
||||
```
|
||||
|
||||
## 🏗️ Architecture
|
||||
```
|
||||
Input (32×32×3) → Conv2d (3→32) → ReLU → MaxPool (2×2) →
|
||||
Conv2d (32→64) → ReLU → MaxPool (2×2) → Flatten →
|
||||
Linear (2304→256) → ReLU → Linear (256→10) → Output
|
||||
```
|
||||
|
||||
## 📈 Expected Results
|
||||
- **Training Time**: 3-5 minutes for demo (3 epochs, 100 batches/epoch)
|
||||
- **Accuracy**: 65%+ on test set (with simple architecture)
|
||||
- **Parameters**: ~600K weights
|
||||
|
||||
## 🔧 Command Line Options
|
||||
- `--epochs N`: Number of training epochs (default: 3)
|
||||
- `--batch-size N`: Batch size (default: 32)
|
||||
- `--test-only`: Test architecture without training
|
||||
- `--quick-test`: Use subset of data for quick testing
|
||||
- `--no-visualize`: Skip visualization
|
||||
|
||||
## 💡 What You Learn
|
||||
- How CNNs extract hierarchical features from images
|
||||
- Why spatial structure matters for vision
|
||||
- How YOUR Conv2d, MaxPool2D, and DataLoader work together
|
||||
- Complete end-to-end training pipeline with real data
|
||||
|
||||
## 🐛 Troubleshooting
|
||||
|
||||
### Download Issues
|
||||
If CIFAR-10 download fails:
|
||||
- Check internet connection
|
||||
- The example will automatically use synthetic data
|
||||
- You can manually download from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
|
||||
|
||||
### Memory Issues
|
||||
If you run out of memory:
|
||||
- Use smaller batch size: `--batch-size 16`
|
||||
- Use quick test mode: `--quick-test`
|
||||
- Reduce number of epochs: `--epochs 1`
|
||||
|
||||
## 📚 Educational Notes
|
||||
This example shows how YOUR implementations handle:
|
||||
- **Spatial feature extraction** through convolutions
|
||||
- **Efficient data loading** with batching and shuffling
|
||||
- **Real-world datasets** with proper train/test splits
|
||||
- **Complete training loops** with YOUR optimizer and autograd
|
||||
@@ -1,191 +0,0 @@
|
||||
# Performance Metrics Demo - Phase 1 Complete ✅
|
||||
|
||||
**Date:** November 5, 2025
|
||||
**Status:** Ready for Module 14 KV Caching Implementation
|
||||
|
||||
---
|
||||
|
||||
## 🎯 What Was Added
|
||||
|
||||
Enhanced `vaswani_chatgpt.py` with comprehensive performance metrics to prepare students for Module 14 (KV Caching).
|
||||
|
||||
### Key Changes
|
||||
|
||||
1. **Enhanced `generate()` method**
|
||||
- Tracks start/end time
|
||||
- Counts tokens generated
|
||||
- Calculates tokens/sec
|
||||
- Optional `return_stats=True` parameter
|
||||
|
||||
2. **Performance display during demo**
|
||||
- Per-question speed metrics
|
||||
- Summary performance table
|
||||
- Educational note about KV caching
|
||||
|
||||
3. **Training checkpoints show speed**
|
||||
- Live generation speed during testing
|
||||
- Average speed across test prompts
|
||||
|
||||
---
|
||||
|
||||
## 📊 What Students Will See
|
||||
|
||||
### During Training (Every 3 Epochs)
|
||||
|
||||
```
|
||||
🧪 Testing Live Predictions:
|
||||
Q: Hello!
|
||||
A: Hi there! How are you?
|
||||
⚡ 42.3 tok/s
|
||||
|
||||
Q: What is your name?
|
||||
A: I am TinyBot, a chatbot
|
||||
⚡ 38.7 tok/s
|
||||
|
||||
Q: What color is the sky?
|
||||
A: The sky is blue
|
||||
⚡ 45.1 tok/s
|
||||
|
||||
Average generation speed: 42.0 tokens/sec
|
||||
```
|
||||
|
||||
### Final Demo Output
|
||||
|
||||
```
|
||||
======================================================================
|
||||
🤖 TinyBot Demo: Ask Me Questions!
|
||||
======================================================================
|
||||
|
||||
Q: Hello!
|
||||
A: Hi there! How are you today?
|
||||
⚡ 43.5 tok/s | 📊 28 tokens | ⏱️ 0.643s
|
||||
|
||||
Q: What is your name?
|
||||
A: I am TinyBot, a friendly chatbot.
|
||||
⚡ 41.2 tok/s | 📊 34 tokens | ⏱️ 0.825s
|
||||
|
||||
Q: What color is the sky?
|
||||
A: The sky is blue on a clear day.
|
||||
⚡ 39.8 tok/s | 📊 32 tokens | ⏱️ 0.804s
|
||||
|
||||
Q: How many legs does a dog have?
|
||||
A: A dog has four legs.
|
||||
⚡ 44.7 tok/s | 📊 22 tokens | ⏱️ 0.492s
|
||||
|
||||
Q: What is 2 plus 3?
|
||||
A: 2 plus 3 equals 5.
|
||||
⚡ 46.1 tok/s | 📊 19 tokens | ⏱️ 0.412s
|
||||
|
||||
Q: What do you use a pen for?
|
||||
A: You use a pen for writing.
|
||||
⚡ 42.8 tok/s | 📊 25 tokens | ⏱️ 0.584s
|
||||
|
||||
======================================================================
|
||||
|
||||
╭──────────────── ⚡ Generation Performance Summary ─────────────────╮
|
||||
│ Metric │ Value │
|
||||
├───────────────────────────┼────────────────────────────────────┤
|
||||
│ Average Speed │ 43.0 tokens/sec │
|
||||
│ Average Time/Question │ 0.627 seconds │
|
||||
│ Total Tokens Generated │ 160 tokens │
|
||||
│ Total Generation Time │ 3.76 seconds │
|
||||
│ Questions Answered │ 6 │
|
||||
╰───────────────────────────┴────────────────────────────────────╯
|
||||
|
||||
💡 Note: In Module 14 (KV Caching), you'll learn how to make this 10-15x faster!
|
||||
Current: ~43 tok/s → With KV Cache: ~516 tok/s 🚀
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎓 Educational Value
|
||||
|
||||
### For Students Before Module 14
|
||||
|
||||
Students will:
|
||||
1. ✅ See concrete performance numbers (not just loss values)
|
||||
2. ✅ Understand that ~40-50 tok/s is the baseline
|
||||
3. ✅ Get excited about 10-15x speedup promise
|
||||
4. ✅ Naturally wonder: "How does KV caching work?"
|
||||
|
||||
### Setting Up the Motivation
|
||||
|
||||
The final note creates natural curiosity:
|
||||
```
|
||||
💡 Note: In Module 14 (KV Caching), you'll learn how to make this 10-15x faster!
|
||||
Current: ~43 tok/s → With KV Cache: ~516 tok/s 🚀
|
||||
```
|
||||
|
||||
Students will think:
|
||||
- "Wow, I can make my transformer 10x faster?"
|
||||
- "What is KV caching?"
|
||||
- "I want to learn that next!"
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Next Phase: Module 14 Implementation
|
||||
|
||||
### Phase 2: Create Benchmark Comparison Script
|
||||
|
||||
After implementing Module 14, create `benchmark_caching.py`:
|
||||
|
||||
```python
|
||||
# Compare performance with/without KV caching
|
||||
results = {
|
||||
'no_cache': benchmark_generation(model, prompts, use_cache=False),
|
||||
'with_cache': benchmark_generation(model, prompts, use_cache=True)
|
||||
}
|
||||
|
||||
# Show dramatic speedup
|
||||
print_comparison_table(results)
|
||||
```
|
||||
|
||||
### Phase 3: Side-by-Side Interactive Demo
|
||||
|
||||
Create `performance_comparison.py` showing both running simultaneously.
|
||||
|
||||
---
|
||||
|
||||
## 📈 Expected Performance Ranges
|
||||
|
||||
Based on TinyTorch transformer implementation:
|
||||
|
||||
| Configuration | Tokens/Sec (No Cache) | Tokens/Sec (With Cache) | Speedup |
|
||||
|---------------|----------------------|-------------------------|---------|
|
||||
| Tiny (embed=64, layers=2) | ~80 tok/s | ~600 tok/s | 7.5x |
|
||||
| Small (embed=96, layers=4) | ~40 tok/s | ~500 tok/s | 12.5x |
|
||||
| Medium (embed=128, layers=6) | ~25 tok/s | ~400 tok/s | 16x |
|
||||
| Large (embed=256, layers=8) | ~12 tok/s | ~200 tok/s | 16.7x |
|
||||
|
||||
**Key Insight:** The speedup increases with:
|
||||
- Larger models (more computation saved)
|
||||
- Longer sequences (more tokens to cache)
|
||||
- More attention heads (more KV pairs to reuse)
|
||||
|
||||
---
|
||||
|
||||
## ✅ Phase 1 Complete Checklist
|
||||
|
||||
- [x] Added timing to `generate()` method
|
||||
- [x] Created `return_stats` parameter
|
||||
- [x] Enhanced `demo_questions()` with metrics
|
||||
- [x] Updated `test_model_predictions()` with speed display
|
||||
- [x] Added performance summary table
|
||||
- [x] Included educational note about Module 14
|
||||
- [x] Tested syntax and committed changes
|
||||
- [ ] **Next:** Implement Module 14 (KV Caching)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Success Criteria
|
||||
|
||||
Students should be able to:
|
||||
1. ✅ Run `vaswani_chatgpt.py` and see performance metrics
|
||||
2. ✅ Understand their transformer generates ~40-50 tokens/sec
|
||||
3. ✅ See the performance summary table
|
||||
4. ✅ Be motivated to learn KV caching for speedup
|
||||
|
||||
---
|
||||
|
||||
*Ready to implement Module 14: KV Caching! 🚀*
|
||||
|
||||
Reference in New Issue
Block a user