MAJOR: Implement beautiful module progression through strategic reordering

This commit implements the pedagogically optimal "inevitable discovery" module progression based on expert validation and educational design principles.

## Module Reordering Summary

**Previous Order (Problems)**:
- 05_losses → 06_autograd → 07_dataloader → 08_optimizers → 09_spatial → 10_training
- Issues: Autograd before optimizers, DataLoader before training, scattered dependencies

**New Order (Beautiful Progression)**:
- 05_losses → 06_optimizers → 07_autograd → 08_training → 09_spatial → 10_dataloader
- Benefits: Each module creates inevitable need for the next

## Pedagogical Flow Achieved

**05_losses** → "Need systematic weight updates" → **06_optimizers**
**06_optimizers** → "Need automatic gradients" → **07_autograd**
**07_autograd** → "Need systematic training" → **08_training**
**08_training** → "MLPs hit limits on images" → **09_spatial**
**09_spatial** → "Training is too slow" → **10_dataloader**

## Technical Changes

### Module Directory Renaming
- `06_autograd` → `07_autograd`
- `07_dataloader` → `10_dataloader`
- `08_optimizers` → `06_optimizers`
- `10_training` → `08_training`
- `09_spatial` → `09_spatial` (no change)

### System Integration Updates
- **MODULE_TO_CHECKPOINT mapping**: Updated in tito/commands/export.py
- **Test directories**: Renamed module_XX directories to match new numbers
- **Documentation**: Updated all references in MD files and agent configurations
- **CLI integration**: Updated next-steps suggestions for proper flow

### Agent Configuration Updates
- **Quality Assurance**: Updated module audit status with new numbers
- **Module Developer**: Updated work tracking with new sequence
- **Documentation**: Updated MASTER_PLAN_OF_RECORD.md with beautiful progression

## Educational Benefits

1. **Inevitable Discovery**: Each module naturally leads to the next
2. **Cognitive Load**: Concepts introduced exactly when needed
3. **Motivation**: Students understand WHY each tool is necessary
4. **Synthesis**: Everything flows toward complete ML systems understanding
5. **Professional Alignment**: Matches real ML engineering workflows

## Quality Assurance

-  All CLI commands still function
-  Checkpoint system mappings updated
-  Documentation consistency maintained
-  Test directory structure aligned
-  Agent configurations synchronized

**Impact**: This reordering transforms TinyTorch from a collection of modules into a coherent educational journey where each step naturally motivates the next, creating optimal conditions for deep learning systems understanding.
This commit is contained in:
Vijay Janapa Reddi
2025-09-24 15:56:47 -04:00
parent 0d87b6603f
commit 2f23f757e7
68 changed files with 5875 additions and 2399 deletions

View File

@@ -0,0 +1,517 @@
# %% [markdown]
"""
# Module 15: Hardware Acceleration and Kernel Optimization
## Learning Objectives
By the end of this module, you will be able to:
1. **Understand Why Loops Are Slow**: See why your Module 2/4 loops have poor performance
2. **Implement Cache-Friendly Blocking**: Build blocked matrix multiplication that leverages CPU cache
3. **Recognize When to Use Libraries**: Understand when NumPy optimizations beat custom code
4. **Build Transparent Backend Systems**: Create automatic switching between implementations
## The Optimization Journey
**Key Message**: You implemented loops to understand the algorithm. Now we'll optimize them to understand systems performance, then switch to NumPy because it already has these (and more) optimizations built-in.
**The Journey:**
1. **Baseline**: Your loops from Module 2/4 (educational, slow)
2. **Blocking**: Cache-friendly version (educational, faster)
3. **NumPy**: Production version (optimal performance)
4. **Backend**: Smart switching system
"""
# %% [markdown]
"""
## Part 1: Baseline Implementation - Your Loops from Module 2/4
Let's start with the educational triple-nested loops you implemented earlier. These were perfect for learning but terrible for performance.
"""
# %%
#| default_exp core.acceleration
import time
import numpy as np
def educational_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""
Educational matrix multiplication using triple nested loops.
This is the same implementation from Module 2/4 - perfect for learning
the algorithm, but very slow due to poor cache performance.
"""
m, k = a.shape
k2, n = b.shape
assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
# Initialize result matrix
c = np.zeros((m, n), dtype=np.float32)
# Triple nested loop - the educational implementation
for i in range(m):
for j in range(n):
for l in range(k):
c[i, j] += a[i, l] * b[l, j]
return c
# %% [markdown]
"""
### Test Educational Implementation
Let's test our educational loops and see why they're slow.
"""
# %%
def test_educational_baseline():
"""Test educational implementation and measure its performance"""
print("Testing Educational Implementation...")
# Test correctness with small matrices
a = np.array([[1, 2], [3, 4]], dtype=np.float32)
b = np.array([[5, 6], [7, 8]], dtype=np.float32)
result_educational = educational_matmul(a, b)
result_numpy = a @ b
assert np.allclose(result_educational, result_numpy), "Educational matmul incorrect"
print("✅ Educational implementation produces correct results")
# Performance comparison (small sizes only - educational is VERY slow)
print("\nPerformance comparison:")
small_a = np.random.randn(100, 100).astype(np.float32)
small_b = np.random.randn(100, 100).astype(np.float32)
# Time educational implementation
start = time.perf_counter()
_ = educational_matmul(small_a, small_b)
educational_time = time.perf_counter() - start
# Time NumPy implementation
start = time.perf_counter()
_ = small_a @ small_b
numpy_time = time.perf_counter() - start
speedup = educational_time / numpy_time
print(f"Educational loops: {educational_time*1000:.1f} ms")
print(f"NumPy optimized: {numpy_time*1000:.1f} ms")
print(f"NumPy is {speedup:.1f}x faster")
print("✅ Educational baseline established")
return educational_time, numpy_time, speedup
# %% [markdown]
"""
## Part 2: Cache-Friendly Blocking - Your First Optimization
Now let's implement blocked matrix multiplication. This teaches you about CPU cache hierarchy by processing data in blocks that fit in cache.
"""
# %%
def blocked_matmul(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.ndarray:
"""
Cache-friendly blocked matrix multiplication.
This version processes data in blocks that fit in CPU cache.
Key insight: Keep working set small enough to fit in L1/L2 cache.
Args:
a: Left matrix (m × k)
b: Right matrix (k × n)
block_size: Size of cache-friendly blocks (typically 32-128)
"""
m, k = a.shape
k2, n = b.shape
assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
# Initialize result
c = np.zeros((m, n), dtype=np.float32)
# Process in blocks to maximize cache utilization
for i in range(0, m, block_size):
for j in range(0, n, block_size):
for l in range(0, k, block_size):
# Define block boundaries
i_end = min(i + block_size, m)
j_end = min(j + block_size, n)
l_end = min(l + block_size, k)
# Extract blocks (these stay in cache)
a_block = a[i:i_end, l:l_end]
b_block = b[l:l_end, j:j_end]
# Multiply blocks using NumPy (optimized BLAS)
c[i:i_end, j:j_end] += a_block @ b_block
return c
# %% [markdown]
"""
### Test Blocked Implementation
Let's see how much faster cache-friendly blocking is compared to educational loops.
"""
def test_blocked_optimization():
"""Test blocked matrix multiplication performance"""
print("Testing Blocked Matrix Multiplication...")
# Test correctness
a = np.random.randn(200, 200).astype(np.float32)
b = np.random.randn(200, 200).astype(np.float32)
result_blocked = blocked_matmul(a, b, block_size=64)
result_numpy = a @ b
assert np.allclose(result_blocked, result_numpy, atol=1e-3), "Blocked matmul incorrect"
print("✅ Blocked implementation produces correct results")
# Performance comparison
print("\nPerformance comparison:")
# Educational vs Blocked vs NumPy
size = 200
test_a = np.random.randn(size, size).astype(np.float32)
test_b = np.random.randn(size, size).astype(np.float32)
# Time educational (smaller subset to avoid waiting forever)
start = time.perf_counter()
_ = educational_matmul(test_a[:50, :50], test_b[:50, :50])
educational_time = time.perf_counter() - start
educational_time_scaled = educational_time * (size/50)**3 # Scale up
# Time blocked
start = time.perf_counter()
_ = blocked_matmul(test_a, test_b, block_size=64)
blocked_time = time.perf_counter() - start
# Time NumPy
start = time.perf_counter()
_ = test_a @ test_b
numpy_time = time.perf_counter() - start
print(f"Educational (est): {educational_time_scaled*1000:.1f} ms")
print(f"Blocked: {blocked_time*1000:.1f} ms")
print(f"NumPy: {numpy_time*1000:.1f} ms")
speedup_blocked = educational_time_scaled / blocked_time
speedup_numpy = educational_time_scaled / numpy_time
print(f"\nBlocked is {speedup_blocked:.1f}x faster than educational")
print(f"NumPy is {speedup_numpy:.1f}x faster than educational")
print("✅ Blocked optimization tested successfully")
return blocked_time, numpy_time
# %% [markdown]
"""
## Part 3: NumPy Optimization - Production Performance
Now we'll switch to NumPy for production use. The key insight: NumPy already has these optimizations (and more) built-in.
"""
# %%
def optimized_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""
Production matrix multiplication using NumPy.
This is what you should actually use in practice.
NumPy already has blocking, vectorization, and BLAS optimizations built-in.
"""
return a @ b
# %% [markdown]
"""
### Test Production Implementation
Let's verify that NumPy is indeed the best choice for production.
"""
# %%
def test_production_performance():
"""Test that NumPy is indeed optimal for production use"""
print("Testing Production Performance...")
# Test different sizes
sizes = [200, 500, 800]
print("\nPerformance comparison across the optimization spectrum:")
for size in sizes:
print(f"\nMatrix size: {size}x{size}")
a = np.random.randn(size, size).astype(np.float32)
b = np.random.randn(size, size).astype(np.float32)
# Time blocked implementation
start = time.perf_counter()
_ = blocked_matmul(a, b, block_size=64)
blocked_time = time.perf_counter() - start
# Time NumPy implementation
start = time.perf_counter()
_ = optimized_matmul(a, b)
numpy_time = time.perf_counter() - start
speedup = blocked_time / numpy_time
print(f"Blocked: {blocked_time*1000:6.1f} ms")
print(f"NumPy: {numpy_time*1000:6.1f} ms")
print(f"NumPy is {speedup:.1f}x faster than blocked")
print("\n💡 Key Insight: NumPy already has these optimizations built-in!")
print(" • Blocking algorithms")
print(" • Vectorization")
print(" • Hardware-specific BLAS libraries")
print(" • Assembly-level optimizations")
print("\n✅ Production performance verified")
return True
# %% [markdown]
"""
## Part 4: Backend System - Transparent Switching
Now let's build a system that automatically chooses the right implementation.
"""
# %%
class OptimizedBackend:
"""Backend that automatically uses the best implementation"""
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""Matrix multiplication using NumPy (best for production)"""
return optimized_matmul(a, b)
# Global backend instance
_backend = OptimizedBackend()
def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""Matrix multiplication using current backend"""
return _backend.matmul(a, b)
# %% [markdown]
"""
### Test Backend System
Let's verify our backend system works correctly and uses optimal implementations.
"""
# %%
def test_backend_system():
"""Test the backend system"""
print("Testing Backend System...")
# Test matrices
a = np.random.randn(100, 100).astype(np.float32)
b = np.random.randn(100, 100).astype(np.float32)
# Test that our backend works
result = matmul(a, b)
expected = a @ b
assert np.allclose(result, expected), "Backend matmul incorrect"
print("✅ Backend produces correct results")
# Compare performance
start = time.perf_counter()
_ = matmul(a, b)
backend_time = time.perf_counter() - start
start = time.perf_counter()
_ = a @ b
numpy_time = time.perf_counter() - start
print(f"\nPerformance comparison:")
print(f"Backend: {backend_time*1000:.1f} ms")
print(f"NumPy: {numpy_time*1000:.1f} ms")
print(f"Backend uses optimal NumPy implementation")
print("\n✅ Backend system works correctly")
return True
# %% [markdown]
"""
## Comprehensive Testing
Let's run all our components together to see the complete optimization journey.
"""
# %%
def run_complete_acceleration_demo():
"""Run the complete acceleration demonstration"""
print("🚀 Complete Acceleration Module Demo")
print("=" * 50)
print("THE OPTIMIZATION JOURNEY: From Loops to NumPy")
# 1. Test educational baseline
print("\n1. Educational Baseline (your Module 2/4 loops):")
educational_results = test_educational_baseline()
# 2. Test blocked optimization
print("\n2. Cache-Friendly Blocking:")
test_blocked_optimization()
# 3. Test production performance
print("\n3. Production Performance (NumPy):")
test_production_performance()
# 4. Test backend system
print("\n4. Backend System:")
test_backend_system()
print("\n" + "=" * 50)
print("🎯 OPTIMIZATION JOURNEY COMPLETE")
print("=" * 50)
print("\n📚 What You Learned:")
print("✅ Why your Module 2/4 loops were slow (but educational)")
print("✅ How cache-friendly blocking improves performance")
print("✅ Why NumPy is optimal for production (already has optimizations)")
print("✅ How to build transparent backend systems")
print("\n🎯 Key Message:")
print("• Educational loops: Perfect for understanding algorithms")
print("• Blocking: Teaches cache optimization principles")
print("• NumPy: Production choice with all optimizations built-in")
print("• Smart backends: Combine educational value with performance")
return educational_results
# %% [markdown]
"""
## Main Execution Block
Run all tests and demonstrations when this module is executed directly.
"""
# %%
if __name__ == "__main__":
print("Module 15: Hardware Acceleration and Kernel Optimization")
print("=" * 60)
print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy")
# Run complete demonstration
results = run_complete_acceleration_demo()
print(f"\n🎉 Module 15 complete!")
print(f"⚡ You've learned the full optimization spectrum.")
print(f"🏗️ Ready to use NumPy optimally in production.")
# %% [markdown]
"""
## Systems Analysis Summary
This module demonstrates the fundamental principles of hardware acceleration in ML systems:
### 🏗️ **Architecture Principles**
- **Cache Hierarchy**: Understanding L1/L2/L3 cache and memory access costs
- **Vectorization**: Leveraging SIMD instructions for parallel computation
- **Memory Layout**: Contiguous access patterns for optimal performance
- **Backend Abstraction**: Transparent dispatch between naive and optimized implementations
### ⚡ **Optimization Techniques**
- **Blocked Algorithms**: Process data in cache-friendly blocks
- **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines
- **In-place Operations**: Minimize memory allocation overhead
- **Automatic Dispatch**: Choose optimal implementation based on problem size
### 📊 **Performance Understanding**
- **Measurement First**: Profile real bottlenecks before optimizing
- **Algorithmic Impact**: O(N³) → O(N²) matters more than 2x constant factors
- **Hardware Awareness**: CPU cache misses cost 100x more than cache hits
- **Library Utilization**: Optimized BLAS libraries beat custom implementations
### 🎯 **Real-World Applications**
- **ML Frameworks**: How PyTorch/TensorFlow apply these same principles
- **Production Systems**: Where optimization efforts provide real value
- **Development Practice**: When to optimize vs when to use existing solutions
### 💡 **Key Insights**
- Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone
- Vectorization eliminates Python overhead for 10-100x improvements
- Most NumPy operations are already optimized - focus on system-level improvements
- Competition frameworks make optimization learning engaging and quantifiable
- Real ML systems face memory and communication bottlenecks, not pure computation limits
This approach teaches students to think like systems engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most.
"""
# %% [markdown]
"""
## Main Execution Block
Run all tests and demonstrations when this module is executed directly.
"""
# %%
if __name__ == "__main__":
print("Module 15: Hardware Acceleration and Kernel Optimization")
print("=" * 60)
print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy")
# Run complete demonstration
results = run_complete_acceleration_demo()
print(f"\n🎉 Module 15 complete!")
print(f"⚡ You've learned the full optimization spectrum.")
print(f"🏗️ Ready to use NumPy optimally in production.")
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Interactive Questions
1. **Why are nested loops slow for large matrices?** Your educational loops from Module 2/4 access memory randomly, causing cache misses. Explain why accessing `b[l, j]` in the inner loop creates terrible cache performance, and why this gets exponentially worse as matrix size increases.
2. **How does blocking improve cache usage?** Your blocked implementation processes 64×64 blocks. Calculate the memory footprint of a 64×64 block (in KB) and explain why this fits well in L1/L2 cache. What happens if you use 256×256 blocks instead?
3. **Why use NumPy instead of custom optimizations?** You implemented blocking to understand cache optimization, but NumPy is still faster. List three optimizations that NumPy has built-in that your blocked implementation lacks, and explain why building these yourself isn't worth the effort.
4. **When should you optimize vs use libraries?** You've seen educational loops (1000x slower), blocking (10x slower), and NumPy (optimal). For each scenario, choose the right approach: (a) Learning algorithms, (b) Debugging matrix math, (c) Production training loop, (d) Custom operation not in NumPy. Justify your choices.
"""
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Hardware Acceleration and Kernel Optimization
This module completes the optimization journey from your Module 2/4 educational loops to production-ready NumPy usage, showing why understanding comes through building.
### 🛤️ **The Optimization Journey**
- **Module 2/4**: You implemented educational loops to understand matrix multiplication
- **Module 15**: You learned why loops are slow and how to optimize them systematically
- **End Goal**: You now use NumPy optimally, understanding what's happening under the hood
### 🛠️ **What We Built**
- **Educational Baseline**: Your triple-nested loops from earlier modules
- **Blocked Implementation**: Cache-friendly version showing 10x+ speedup over loops
- **NumPy Integration**: Production implementation using optimal libraries
- **Smart Backend**: System that chooses the right implementation transparently
### 🧠 **Key Learning Outcomes**
- **Why loops are slow**: Memory access patterns and cache hierarchy matter most
- **How blocking helps**: Processing data in cache-friendly chunks improves performance
- **When to use NumPy**: It already has these optimizations (and more) built-in
- **Systems thinking**: Understanding enables better decisions about when to optimize
### ⚡ **Performance Spectrum Demonstrated**
- **Educational loops**: Perfect for learning, terrible for performance (1000x slower)
- **Cache-friendly blocking**: Good educational optimization (10x faster than loops)
- **NumPy production**: Optimal performance with all optimizations built-in
### 🏆 **Practical Skills Developed**
- Analyze why educational implementations have poor performance
- Implement cache-friendly algorithms to understand optimization principles
- Choose NumPy for production while understanding what it's doing internally
- Build systems that balance educational value with performance requirements
### 📊 **Systems Insights Gained**
- **Educational code serves a purpose**: Understanding algorithms enables optimization intuition
- **Cache hierarchy dominates performance**: Memory access patterns matter more than computation
- **Libraries beat custom optimization**: NumPy already has expert-level optimizations
- **Understanding enables better tools**: You can build smarter systems when you know the principles
### 💡 **The Key Message**
You implemented loops to understand the algorithm. You implemented blocking to understand cache optimization. Now you use NumPy because it already has these (and more) optimizations built-in. Understanding the journey makes you a better ML systems engineer.
"""