mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-03 23:12:32 -05:00
This commit implements the pedagogically optimal "inevitable discovery" module progression based on expert validation and educational design principles. ## Module Reordering Summary **Previous Order (Problems)**: - 05_losses → 06_autograd → 07_dataloader → 08_optimizers → 09_spatial → 10_training - Issues: Autograd before optimizers, DataLoader before training, scattered dependencies **New Order (Beautiful Progression)**: - 05_losses → 06_optimizers → 07_autograd → 08_training → 09_spatial → 10_dataloader - Benefits: Each module creates inevitable need for the next ## Pedagogical Flow Achieved **05_losses** → "Need systematic weight updates" → **06_optimizers** **06_optimizers** → "Need automatic gradients" → **07_autograd** **07_autograd** → "Need systematic training" → **08_training** **08_training** → "MLPs hit limits on images" → **09_spatial** **09_spatial** → "Training is too slow" → **10_dataloader** ## Technical Changes ### Module Directory Renaming - `06_autograd` → `07_autograd` - `07_dataloader` → `10_dataloader` - `08_optimizers` → `06_optimizers` - `10_training` → `08_training` - `09_spatial` → `09_spatial` (no change) ### System Integration Updates - **MODULE_TO_CHECKPOINT mapping**: Updated in tito/commands/export.py - **Test directories**: Renamed module_XX directories to match new numbers - **Documentation**: Updated all references in MD files and agent configurations - **CLI integration**: Updated next-steps suggestions for proper flow ### Agent Configuration Updates - **Quality Assurance**: Updated module audit status with new numbers - **Module Developer**: Updated work tracking with new sequence - **Documentation**: Updated MASTER_PLAN_OF_RECORD.md with beautiful progression ## Educational Benefits 1. **Inevitable Discovery**: Each module naturally leads to the next 2. **Cognitive Load**: Concepts introduced exactly when needed 3. **Motivation**: Students understand WHY each tool is necessary 4. **Synthesis**: Everything flows toward complete ML systems understanding 5. **Professional Alignment**: Matches real ML engineering workflows ## Quality Assurance - ✅ All CLI commands still function - ✅ Checkpoint system mappings updated - ✅ Documentation consistency maintained - ✅ Test directory structure aligned - ✅ Agent configurations synchronized **Impact**: This reordering transforms TinyTorch from a collection of modules into a coherent educational journey where each step naturally motivates the next, creating optimal conditions for deep learning systems understanding.
517 lines
19 KiB
Python
517 lines
19 KiB
Python
# %% [markdown]
|
||
"""
|
||
# Module 15: Hardware Acceleration and Kernel Optimization
|
||
|
||
## Learning Objectives
|
||
By the end of this module, you will be able to:
|
||
|
||
1. **Understand Why Loops Are Slow**: See why your Module 2/4 loops have poor performance
|
||
2. **Implement Cache-Friendly Blocking**: Build blocked matrix multiplication that leverages CPU cache
|
||
3. **Recognize When to Use Libraries**: Understand when NumPy optimizations beat custom code
|
||
4. **Build Transparent Backend Systems**: Create automatic switching between implementations
|
||
|
||
## The Optimization Journey
|
||
|
||
**Key Message**: You implemented loops to understand the algorithm. Now we'll optimize them to understand systems performance, then switch to NumPy because it already has these (and more) optimizations built-in.
|
||
|
||
**The Journey:**
|
||
1. **Baseline**: Your loops from Module 2/4 (educational, slow)
|
||
2. **Blocking**: Cache-friendly version (educational, faster)
|
||
3. **NumPy**: Production version (optimal performance)
|
||
4. **Backend**: Smart switching system
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 1: Baseline Implementation - Your Loops from Module 2/4
|
||
|
||
Let's start with the educational triple-nested loops you implemented earlier. These were perfect for learning but terrible for performance.
|
||
"""
|
||
|
||
# %%
|
||
#| default_exp core.acceleration
|
||
|
||
import time
|
||
import numpy as np
|
||
|
||
def educational_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Educational matrix multiplication using triple nested loops.
|
||
|
||
This is the same implementation from Module 2/4 - perfect for learning
|
||
the algorithm, but very slow due to poor cache performance.
|
||
"""
|
||
m, k = a.shape
|
||
k2, n = b.shape
|
||
assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
|
||
|
||
# Initialize result matrix
|
||
c = np.zeros((m, n), dtype=np.float32)
|
||
|
||
# Triple nested loop - the educational implementation
|
||
for i in range(m):
|
||
for j in range(n):
|
||
for l in range(k):
|
||
c[i, j] += a[i, l] * b[l, j]
|
||
|
||
return c
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Test Educational Implementation
|
||
|
||
Let's test our educational loops and see why they're slow.
|
||
"""
|
||
|
||
# %%
|
||
def test_educational_baseline():
|
||
"""Test educational implementation and measure its performance"""
|
||
print("Testing Educational Implementation...")
|
||
|
||
# Test correctness with small matrices
|
||
a = np.array([[1, 2], [3, 4]], dtype=np.float32)
|
||
b = np.array([[5, 6], [7, 8]], dtype=np.float32)
|
||
|
||
result_educational = educational_matmul(a, b)
|
||
result_numpy = a @ b
|
||
assert np.allclose(result_educational, result_numpy), "Educational matmul incorrect"
|
||
print("✅ Educational implementation produces correct results")
|
||
|
||
# Performance comparison (small sizes only - educational is VERY slow)
|
||
print("\nPerformance comparison:")
|
||
small_a = np.random.randn(100, 100).astype(np.float32)
|
||
small_b = np.random.randn(100, 100).astype(np.float32)
|
||
|
||
# Time educational implementation
|
||
start = time.perf_counter()
|
||
_ = educational_matmul(small_a, small_b)
|
||
educational_time = time.perf_counter() - start
|
||
|
||
# Time NumPy implementation
|
||
start = time.perf_counter()
|
||
_ = small_a @ small_b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
speedup = educational_time / numpy_time
|
||
print(f"Educational loops: {educational_time*1000:.1f} ms")
|
||
print(f"NumPy optimized: {numpy_time*1000:.1f} ms")
|
||
print(f"NumPy is {speedup:.1f}x faster")
|
||
|
||
print("✅ Educational baseline established")
|
||
return educational_time, numpy_time, speedup
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 2: Cache-Friendly Blocking - Your First Optimization
|
||
|
||
Now let's implement blocked matrix multiplication. This teaches you about CPU cache hierarchy by processing data in blocks that fit in cache.
|
||
"""
|
||
|
||
# %%
|
||
def blocked_matmul(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.ndarray:
|
||
"""
|
||
Cache-friendly blocked matrix multiplication.
|
||
|
||
This version processes data in blocks that fit in CPU cache.
|
||
Key insight: Keep working set small enough to fit in L1/L2 cache.
|
||
|
||
Args:
|
||
a: Left matrix (m × k)
|
||
b: Right matrix (k × n)
|
||
block_size: Size of cache-friendly blocks (typically 32-128)
|
||
"""
|
||
m, k = a.shape
|
||
k2, n = b.shape
|
||
assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
|
||
|
||
# Initialize result
|
||
c = np.zeros((m, n), dtype=np.float32)
|
||
|
||
# Process in blocks to maximize cache utilization
|
||
for i in range(0, m, block_size):
|
||
for j in range(0, n, block_size):
|
||
for l in range(0, k, block_size):
|
||
# Define block boundaries
|
||
i_end = min(i + block_size, m)
|
||
j_end = min(j + block_size, n)
|
||
l_end = min(l + block_size, k)
|
||
|
||
# Extract blocks (these stay in cache)
|
||
a_block = a[i:i_end, l:l_end]
|
||
b_block = b[l:l_end, j:j_end]
|
||
|
||
# Multiply blocks using NumPy (optimized BLAS)
|
||
c[i:i_end, j:j_end] += a_block @ b_block
|
||
|
||
return c
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Test Blocked Implementation
|
||
|
||
Let's see how much faster cache-friendly blocking is compared to educational loops.
|
||
"""
|
||
|
||
def test_blocked_optimization():
|
||
"""Test blocked matrix multiplication performance"""
|
||
print("Testing Blocked Matrix Multiplication...")
|
||
|
||
# Test correctness
|
||
a = np.random.randn(200, 200).astype(np.float32)
|
||
b = np.random.randn(200, 200).astype(np.float32)
|
||
|
||
result_blocked = blocked_matmul(a, b, block_size=64)
|
||
result_numpy = a @ b
|
||
|
||
assert np.allclose(result_blocked, result_numpy, atol=1e-3), "Blocked matmul incorrect"
|
||
print("✅ Blocked implementation produces correct results")
|
||
|
||
# Performance comparison
|
||
print("\nPerformance comparison:")
|
||
|
||
# Educational vs Blocked vs NumPy
|
||
size = 200
|
||
test_a = np.random.randn(size, size).astype(np.float32)
|
||
test_b = np.random.randn(size, size).astype(np.float32)
|
||
|
||
# Time educational (smaller subset to avoid waiting forever)
|
||
start = time.perf_counter()
|
||
_ = educational_matmul(test_a[:50, :50], test_b[:50, :50])
|
||
educational_time = time.perf_counter() - start
|
||
educational_time_scaled = educational_time * (size/50)**3 # Scale up
|
||
|
||
# Time blocked
|
||
start = time.perf_counter()
|
||
_ = blocked_matmul(test_a, test_b, block_size=64)
|
||
blocked_time = time.perf_counter() - start
|
||
|
||
# Time NumPy
|
||
start = time.perf_counter()
|
||
_ = test_a @ test_b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
print(f"Educational (est): {educational_time_scaled*1000:.1f} ms")
|
||
print(f"Blocked: {blocked_time*1000:.1f} ms")
|
||
print(f"NumPy: {numpy_time*1000:.1f} ms")
|
||
|
||
speedup_blocked = educational_time_scaled / blocked_time
|
||
speedup_numpy = educational_time_scaled / numpy_time
|
||
|
||
print(f"\nBlocked is {speedup_blocked:.1f}x faster than educational")
|
||
print(f"NumPy is {speedup_numpy:.1f}x faster than educational")
|
||
|
||
print("✅ Blocked optimization tested successfully")
|
||
return blocked_time, numpy_time
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 3: NumPy Optimization - Production Performance
|
||
|
||
Now we'll switch to NumPy for production use. The key insight: NumPy already has these optimizations (and more) built-in.
|
||
"""
|
||
|
||
# %%
|
||
def optimized_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Production matrix multiplication using NumPy.
|
||
|
||
This is what you should actually use in practice.
|
||
NumPy already has blocking, vectorization, and BLAS optimizations built-in.
|
||
"""
|
||
return a @ b
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Test Production Implementation
|
||
|
||
Let's verify that NumPy is indeed the best choice for production.
|
||
"""
|
||
|
||
# %%
|
||
def test_production_performance():
|
||
"""Test that NumPy is indeed optimal for production use"""
|
||
print("Testing Production Performance...")
|
||
|
||
# Test different sizes
|
||
sizes = [200, 500, 800]
|
||
|
||
print("\nPerformance comparison across the optimization spectrum:")
|
||
|
||
for size in sizes:
|
||
print(f"\nMatrix size: {size}x{size}")
|
||
a = np.random.randn(size, size).astype(np.float32)
|
||
b = np.random.randn(size, size).astype(np.float32)
|
||
|
||
# Time blocked implementation
|
||
start = time.perf_counter()
|
||
_ = blocked_matmul(a, b, block_size=64)
|
||
blocked_time = time.perf_counter() - start
|
||
|
||
# Time NumPy implementation
|
||
start = time.perf_counter()
|
||
_ = optimized_matmul(a, b)
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
speedup = blocked_time / numpy_time
|
||
print(f"Blocked: {blocked_time*1000:6.1f} ms")
|
||
print(f"NumPy: {numpy_time*1000:6.1f} ms")
|
||
print(f"NumPy is {speedup:.1f}x faster than blocked")
|
||
|
||
print("\n💡 Key Insight: NumPy already has these optimizations built-in!")
|
||
print(" • Blocking algorithms")
|
||
print(" • Vectorization")
|
||
print(" • Hardware-specific BLAS libraries")
|
||
print(" • Assembly-level optimizations")
|
||
|
||
print("\n✅ Production performance verified")
|
||
return True
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 4: Backend System - Transparent Switching
|
||
|
||
Now let's build a system that automatically chooses the right implementation.
|
||
"""
|
||
|
||
# %%
|
||
class OptimizedBackend:
|
||
"""Backend that automatically uses the best implementation"""
|
||
|
||
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||
"""Matrix multiplication using NumPy (best for production)"""
|
||
return optimized_matmul(a, b)
|
||
|
||
# Global backend instance
|
||
_backend = OptimizedBackend()
|
||
|
||
def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||
"""Matrix multiplication using current backend"""
|
||
return _backend.matmul(a, b)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Test Backend System
|
||
|
||
Let's verify our backend system works correctly and uses optimal implementations.
|
||
"""
|
||
|
||
# %%
|
||
def test_backend_system():
|
||
"""Test the backend system"""
|
||
print("Testing Backend System...")
|
||
|
||
# Test matrices
|
||
a = np.random.randn(100, 100).astype(np.float32)
|
||
b = np.random.randn(100, 100).astype(np.float32)
|
||
|
||
# Test that our backend works
|
||
result = matmul(a, b)
|
||
expected = a @ b
|
||
|
||
assert np.allclose(result, expected), "Backend matmul incorrect"
|
||
print("✅ Backend produces correct results")
|
||
|
||
# Compare performance
|
||
start = time.perf_counter()
|
||
_ = matmul(a, b)
|
||
backend_time = time.perf_counter() - start
|
||
|
||
start = time.perf_counter()
|
||
_ = a @ b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
print(f"\nPerformance comparison:")
|
||
print(f"Backend: {backend_time*1000:.1f} ms")
|
||
print(f"NumPy: {numpy_time*1000:.1f} ms")
|
||
print(f"Backend uses optimal NumPy implementation")
|
||
|
||
print("\n✅ Backend system works correctly")
|
||
return True
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Comprehensive Testing
|
||
|
||
Let's run all our components together to see the complete optimization journey.
|
||
"""
|
||
|
||
# %%
|
||
def run_complete_acceleration_demo():
|
||
"""Run the complete acceleration demonstration"""
|
||
print("🚀 Complete Acceleration Module Demo")
|
||
print("=" * 50)
|
||
print("THE OPTIMIZATION JOURNEY: From Loops to NumPy")
|
||
|
||
# 1. Test educational baseline
|
||
print("\n1. Educational Baseline (your Module 2/4 loops):")
|
||
educational_results = test_educational_baseline()
|
||
|
||
# 2. Test blocked optimization
|
||
print("\n2. Cache-Friendly Blocking:")
|
||
test_blocked_optimization()
|
||
|
||
# 3. Test production performance
|
||
print("\n3. Production Performance (NumPy):")
|
||
test_production_performance()
|
||
|
||
# 4. Test backend system
|
||
print("\n4. Backend System:")
|
||
test_backend_system()
|
||
|
||
print("\n" + "=" * 50)
|
||
print("🎯 OPTIMIZATION JOURNEY COMPLETE")
|
||
print("=" * 50)
|
||
|
||
print("\n📚 What You Learned:")
|
||
print("✅ Why your Module 2/4 loops were slow (but educational)")
|
||
print("✅ How cache-friendly blocking improves performance")
|
||
print("✅ Why NumPy is optimal for production (already has optimizations)")
|
||
print("✅ How to build transparent backend systems")
|
||
|
||
print("\n🎯 Key Message:")
|
||
print("• Educational loops: Perfect for understanding algorithms")
|
||
print("• Blocking: Teaches cache optimization principles")
|
||
print("• NumPy: Production choice with all optimizations built-in")
|
||
print("• Smart backends: Combine educational value with performance")
|
||
|
||
return educational_results
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Main Execution Block
|
||
|
||
Run all tests and demonstrations when this module is executed directly.
|
||
"""
|
||
|
||
# %%
|
||
if __name__ == "__main__":
|
||
print("Module 15: Hardware Acceleration and Kernel Optimization")
|
||
print("=" * 60)
|
||
print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy")
|
||
|
||
# Run complete demonstration
|
||
results = run_complete_acceleration_demo()
|
||
|
||
print(f"\n🎉 Module 15 complete!")
|
||
print(f"⚡ You've learned the full optimization spectrum.")
|
||
print(f"🏗️ Ready to use NumPy optimally in production.")
|
||
|
||
|
||
|
||
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Systems Analysis Summary
|
||
|
||
This module demonstrates the fundamental principles of hardware acceleration in ML systems:
|
||
|
||
### 🏗️ **Architecture Principles**
|
||
- **Cache Hierarchy**: Understanding L1/L2/L3 cache and memory access costs
|
||
- **Vectorization**: Leveraging SIMD instructions for parallel computation
|
||
- **Memory Layout**: Contiguous access patterns for optimal performance
|
||
- **Backend Abstraction**: Transparent dispatch between naive and optimized implementations
|
||
|
||
### ⚡ **Optimization Techniques**
|
||
- **Blocked Algorithms**: Process data in cache-friendly blocks
|
||
- **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines
|
||
- **In-place Operations**: Minimize memory allocation overhead
|
||
- **Automatic Dispatch**: Choose optimal implementation based on problem size
|
||
|
||
### 📊 **Performance Understanding**
|
||
- **Measurement First**: Profile real bottlenecks before optimizing
|
||
- **Algorithmic Impact**: O(N³) → O(N²) matters more than 2x constant factors
|
||
- **Hardware Awareness**: CPU cache misses cost 100x more than cache hits
|
||
- **Library Utilization**: Optimized BLAS libraries beat custom implementations
|
||
|
||
### 🎯 **Real-World Applications**
|
||
- **ML Frameworks**: How PyTorch/TensorFlow apply these same principles
|
||
- **Production Systems**: Where optimization efforts provide real value
|
||
- **Development Practice**: When to optimize vs when to use existing solutions
|
||
|
||
### 💡 **Key Insights**
|
||
- Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone
|
||
- Vectorization eliminates Python overhead for 10-100x improvements
|
||
- Most NumPy operations are already optimized - focus on system-level improvements
|
||
- Competition frameworks make optimization learning engaging and quantifiable
|
||
- Real ML systems face memory and communication bottlenecks, not pure computation limits
|
||
|
||
This approach teaches students to think like systems engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Main Execution Block
|
||
|
||
Run all tests and demonstrations when this module is executed directly.
|
||
"""
|
||
|
||
# %%
|
||
if __name__ == "__main__":
|
||
print("Module 15: Hardware Acceleration and Kernel Optimization")
|
||
print("=" * 60)
|
||
print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy")
|
||
|
||
# Run complete demonstration
|
||
results = run_complete_acceleration_demo()
|
||
|
||
print(f"\n🎉 Module 15 complete!")
|
||
print(f"⚡ You've learned the full optimization spectrum.")
|
||
print(f"🏗️ Ready to use NumPy optimally in production.")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🤔 ML Systems Thinking: Interactive Questions
|
||
|
||
1. **Why are nested loops slow for large matrices?** Your educational loops from Module 2/4 access memory randomly, causing cache misses. Explain why accessing `b[l, j]` in the inner loop creates terrible cache performance, and why this gets exponentially worse as matrix size increases.
|
||
|
||
2. **How does blocking improve cache usage?** Your blocked implementation processes 64×64 blocks. Calculate the memory footprint of a 64×64 block (in KB) and explain why this fits well in L1/L2 cache. What happens if you use 256×256 blocks instead?
|
||
|
||
3. **Why use NumPy instead of custom optimizations?** You implemented blocking to understand cache optimization, but NumPy is still faster. List three optimizations that NumPy has built-in that your blocked implementation lacks, and explain why building these yourself isn't worth the effort.
|
||
|
||
4. **When should you optimize vs use libraries?** You've seen educational loops (1000x slower), blocking (10x slower), and NumPy (optimal). For each scenario, choose the right approach: (a) Learning algorithms, (b) Debugging matrix math, (c) Production training loop, (d) Custom operation not in NumPy. Justify your choices.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Hardware Acceleration and Kernel Optimization
|
||
|
||
This module completes the optimization journey from your Module 2/4 educational loops to production-ready NumPy usage, showing why understanding comes through building.
|
||
|
||
### 🛤️ **The Optimization Journey**
|
||
- **Module 2/4**: You implemented educational loops to understand matrix multiplication
|
||
- **Module 15**: You learned why loops are slow and how to optimize them systematically
|
||
- **End Goal**: You now use NumPy optimally, understanding what's happening under the hood
|
||
|
||
### 🛠️ **What We Built**
|
||
- **Educational Baseline**: Your triple-nested loops from earlier modules
|
||
- **Blocked Implementation**: Cache-friendly version showing 10x+ speedup over loops
|
||
- **NumPy Integration**: Production implementation using optimal libraries
|
||
- **Smart Backend**: System that chooses the right implementation transparently
|
||
|
||
### 🧠 **Key Learning Outcomes**
|
||
- **Why loops are slow**: Memory access patterns and cache hierarchy matter most
|
||
- **How blocking helps**: Processing data in cache-friendly chunks improves performance
|
||
- **When to use NumPy**: It already has these optimizations (and more) built-in
|
||
- **Systems thinking**: Understanding enables better decisions about when to optimize
|
||
|
||
### ⚡ **Performance Spectrum Demonstrated**
|
||
- **Educational loops**: Perfect for learning, terrible for performance (1000x slower)
|
||
- **Cache-friendly blocking**: Good educational optimization (10x faster than loops)
|
||
- **NumPy production**: Optimal performance with all optimizations built-in
|
||
|
||
### 🏆 **Practical Skills Developed**
|
||
- Analyze why educational implementations have poor performance
|
||
- Implement cache-friendly algorithms to understand optimization principles
|
||
- Choose NumPy for production while understanding what it's doing internally
|
||
- Build systems that balance educational value with performance requirements
|
||
|
||
### 📊 **Systems Insights Gained**
|
||
- **Educational code serves a purpose**: Understanding algorithms enables optimization intuition
|
||
- **Cache hierarchy dominates performance**: Memory access patterns matter more than computation
|
||
- **Libraries beat custom optimization**: NumPy already has expert-level optimizations
|
||
- **Understanding enables better tools**: You can build smarter systems when you know the principles
|
||
|
||
### 💡 **The Key Message**
|
||
You implemented loops to understand the algorithm. You implemented blocking to understand cache optimization. Now you use NumPy because it already has these (and more) optimizations built-in. Understanding the journey makes you a better ML systems engineer.
|
||
""" |