Files
TinyTorch/modules/15_acceleration/acceleration_dev.py
Vijay Janapa Reddi 2f23f757e7 MAJOR: Implement beautiful module progression through strategic reordering
This commit implements the pedagogically optimal "inevitable discovery" module progression based on expert validation and educational design principles.

## Module Reordering Summary

**Previous Order (Problems)**:
- 05_losses → 06_autograd → 07_dataloader → 08_optimizers → 09_spatial → 10_training
- Issues: Autograd before optimizers, DataLoader before training, scattered dependencies

**New Order (Beautiful Progression)**:
- 05_losses → 06_optimizers → 07_autograd → 08_training → 09_spatial → 10_dataloader
- Benefits: Each module creates inevitable need for the next

## Pedagogical Flow Achieved

**05_losses** → "Need systematic weight updates" → **06_optimizers**
**06_optimizers** → "Need automatic gradients" → **07_autograd**
**07_autograd** → "Need systematic training" → **08_training**
**08_training** → "MLPs hit limits on images" → **09_spatial**
**09_spatial** → "Training is too slow" → **10_dataloader**

## Technical Changes

### Module Directory Renaming
- `06_autograd` → `07_autograd`
- `07_dataloader` → `10_dataloader`
- `08_optimizers` → `06_optimizers`
- `10_training` → `08_training`
- `09_spatial` → `09_spatial` (no change)

### System Integration Updates
- **MODULE_TO_CHECKPOINT mapping**: Updated in tito/commands/export.py
- **Test directories**: Renamed module_XX directories to match new numbers
- **Documentation**: Updated all references in MD files and agent configurations
- **CLI integration**: Updated next-steps suggestions for proper flow

### Agent Configuration Updates
- **Quality Assurance**: Updated module audit status with new numbers
- **Module Developer**: Updated work tracking with new sequence
- **Documentation**: Updated MASTER_PLAN_OF_RECORD.md with beautiful progression

## Educational Benefits

1. **Inevitable Discovery**: Each module naturally leads to the next
2. **Cognitive Load**: Concepts introduced exactly when needed
3. **Motivation**: Students understand WHY each tool is necessary
4. **Synthesis**: Everything flows toward complete ML systems understanding
5. **Professional Alignment**: Matches real ML engineering workflows

## Quality Assurance

-  All CLI commands still function
-  Checkpoint system mappings updated
-  Documentation consistency maintained
-  Test directory structure aligned
-  Agent configurations synchronized

**Impact**: This reordering transforms TinyTorch from a collection of modules into a coherent educational journey where each step naturally motivates the next, creating optimal conditions for deep learning systems understanding.
2025-09-24 15:56:47 -04:00

517 lines
19 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# %% [markdown]
"""
# Module 15: Hardware Acceleration and Kernel Optimization
## Learning Objectives
By the end of this module, you will be able to:
1. **Understand Why Loops Are Slow**: See why your Module 2/4 loops have poor performance
2. **Implement Cache-Friendly Blocking**: Build blocked matrix multiplication that leverages CPU cache
3. **Recognize When to Use Libraries**: Understand when NumPy optimizations beat custom code
4. **Build Transparent Backend Systems**: Create automatic switching between implementations
## The Optimization Journey
**Key Message**: You implemented loops to understand the algorithm. Now we'll optimize them to understand systems performance, then switch to NumPy because it already has these (and more) optimizations built-in.
**The Journey:**
1. **Baseline**: Your loops from Module 2/4 (educational, slow)
2. **Blocking**: Cache-friendly version (educational, faster)
3. **NumPy**: Production version (optimal performance)
4. **Backend**: Smart switching system
"""
# %% [markdown]
"""
## Part 1: Baseline Implementation - Your Loops from Module 2/4
Let's start with the educational triple-nested loops you implemented earlier. These were perfect for learning but terrible for performance.
"""
# %%
#| default_exp core.acceleration
import time
import numpy as np
def educational_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""
Educational matrix multiplication using triple nested loops.
This is the same implementation from Module 2/4 - perfect for learning
the algorithm, but very slow due to poor cache performance.
"""
m, k = a.shape
k2, n = b.shape
assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
# Initialize result matrix
c = np.zeros((m, n), dtype=np.float32)
# Triple nested loop - the educational implementation
for i in range(m):
for j in range(n):
for l in range(k):
c[i, j] += a[i, l] * b[l, j]
return c
# %% [markdown]
"""
### Test Educational Implementation
Let's test our educational loops and see why they're slow.
"""
# %%
def test_educational_baseline():
"""Test educational implementation and measure its performance"""
print("Testing Educational Implementation...")
# Test correctness with small matrices
a = np.array([[1, 2], [3, 4]], dtype=np.float32)
b = np.array([[5, 6], [7, 8]], dtype=np.float32)
result_educational = educational_matmul(a, b)
result_numpy = a @ b
assert np.allclose(result_educational, result_numpy), "Educational matmul incorrect"
print("✅ Educational implementation produces correct results")
# Performance comparison (small sizes only - educational is VERY slow)
print("\nPerformance comparison:")
small_a = np.random.randn(100, 100).astype(np.float32)
small_b = np.random.randn(100, 100).astype(np.float32)
# Time educational implementation
start = time.perf_counter()
_ = educational_matmul(small_a, small_b)
educational_time = time.perf_counter() - start
# Time NumPy implementation
start = time.perf_counter()
_ = small_a @ small_b
numpy_time = time.perf_counter() - start
speedup = educational_time / numpy_time
print(f"Educational loops: {educational_time*1000:.1f} ms")
print(f"NumPy optimized: {numpy_time*1000:.1f} ms")
print(f"NumPy is {speedup:.1f}x faster")
print("✅ Educational baseline established")
return educational_time, numpy_time, speedup
# %% [markdown]
"""
## Part 2: Cache-Friendly Blocking - Your First Optimization
Now let's implement blocked matrix multiplication. This teaches you about CPU cache hierarchy by processing data in blocks that fit in cache.
"""
# %%
def blocked_matmul(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.ndarray:
"""
Cache-friendly blocked matrix multiplication.
This version processes data in blocks that fit in CPU cache.
Key insight: Keep working set small enough to fit in L1/L2 cache.
Args:
a: Left matrix (m × k)
b: Right matrix (k × n)
block_size: Size of cache-friendly blocks (typically 32-128)
"""
m, k = a.shape
k2, n = b.shape
assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
# Initialize result
c = np.zeros((m, n), dtype=np.float32)
# Process in blocks to maximize cache utilization
for i in range(0, m, block_size):
for j in range(0, n, block_size):
for l in range(0, k, block_size):
# Define block boundaries
i_end = min(i + block_size, m)
j_end = min(j + block_size, n)
l_end = min(l + block_size, k)
# Extract blocks (these stay in cache)
a_block = a[i:i_end, l:l_end]
b_block = b[l:l_end, j:j_end]
# Multiply blocks using NumPy (optimized BLAS)
c[i:i_end, j:j_end] += a_block @ b_block
return c
# %% [markdown]
"""
### Test Blocked Implementation
Let's see how much faster cache-friendly blocking is compared to educational loops.
"""
def test_blocked_optimization():
"""Test blocked matrix multiplication performance"""
print("Testing Blocked Matrix Multiplication...")
# Test correctness
a = np.random.randn(200, 200).astype(np.float32)
b = np.random.randn(200, 200).astype(np.float32)
result_blocked = blocked_matmul(a, b, block_size=64)
result_numpy = a @ b
assert np.allclose(result_blocked, result_numpy, atol=1e-3), "Blocked matmul incorrect"
print("✅ Blocked implementation produces correct results")
# Performance comparison
print("\nPerformance comparison:")
# Educational vs Blocked vs NumPy
size = 200
test_a = np.random.randn(size, size).astype(np.float32)
test_b = np.random.randn(size, size).astype(np.float32)
# Time educational (smaller subset to avoid waiting forever)
start = time.perf_counter()
_ = educational_matmul(test_a[:50, :50], test_b[:50, :50])
educational_time = time.perf_counter() - start
educational_time_scaled = educational_time * (size/50)**3 # Scale up
# Time blocked
start = time.perf_counter()
_ = blocked_matmul(test_a, test_b, block_size=64)
blocked_time = time.perf_counter() - start
# Time NumPy
start = time.perf_counter()
_ = test_a @ test_b
numpy_time = time.perf_counter() - start
print(f"Educational (est): {educational_time_scaled*1000:.1f} ms")
print(f"Blocked: {blocked_time*1000:.1f} ms")
print(f"NumPy: {numpy_time*1000:.1f} ms")
speedup_blocked = educational_time_scaled / blocked_time
speedup_numpy = educational_time_scaled / numpy_time
print(f"\nBlocked is {speedup_blocked:.1f}x faster than educational")
print(f"NumPy is {speedup_numpy:.1f}x faster than educational")
print("✅ Blocked optimization tested successfully")
return blocked_time, numpy_time
# %% [markdown]
"""
## Part 3: NumPy Optimization - Production Performance
Now we'll switch to NumPy for production use. The key insight: NumPy already has these optimizations (and more) built-in.
"""
# %%
def optimized_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""
Production matrix multiplication using NumPy.
This is what you should actually use in practice.
NumPy already has blocking, vectorization, and BLAS optimizations built-in.
"""
return a @ b
# %% [markdown]
"""
### Test Production Implementation
Let's verify that NumPy is indeed the best choice for production.
"""
# %%
def test_production_performance():
"""Test that NumPy is indeed optimal for production use"""
print("Testing Production Performance...")
# Test different sizes
sizes = [200, 500, 800]
print("\nPerformance comparison across the optimization spectrum:")
for size in sizes:
print(f"\nMatrix size: {size}x{size}")
a = np.random.randn(size, size).astype(np.float32)
b = np.random.randn(size, size).astype(np.float32)
# Time blocked implementation
start = time.perf_counter()
_ = blocked_matmul(a, b, block_size=64)
blocked_time = time.perf_counter() - start
# Time NumPy implementation
start = time.perf_counter()
_ = optimized_matmul(a, b)
numpy_time = time.perf_counter() - start
speedup = blocked_time / numpy_time
print(f"Blocked: {blocked_time*1000:6.1f} ms")
print(f"NumPy: {numpy_time*1000:6.1f} ms")
print(f"NumPy is {speedup:.1f}x faster than blocked")
print("\n💡 Key Insight: NumPy already has these optimizations built-in!")
print(" • Blocking algorithms")
print(" • Vectorization")
print(" • Hardware-specific BLAS libraries")
print(" • Assembly-level optimizations")
print("\n✅ Production performance verified")
return True
# %% [markdown]
"""
## Part 4: Backend System - Transparent Switching
Now let's build a system that automatically chooses the right implementation.
"""
# %%
class OptimizedBackend:
"""Backend that automatically uses the best implementation"""
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""Matrix multiplication using NumPy (best for production)"""
return optimized_matmul(a, b)
# Global backend instance
_backend = OptimizedBackend()
def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""Matrix multiplication using current backend"""
return _backend.matmul(a, b)
# %% [markdown]
"""
### Test Backend System
Let's verify our backend system works correctly and uses optimal implementations.
"""
# %%
def test_backend_system():
"""Test the backend system"""
print("Testing Backend System...")
# Test matrices
a = np.random.randn(100, 100).astype(np.float32)
b = np.random.randn(100, 100).astype(np.float32)
# Test that our backend works
result = matmul(a, b)
expected = a @ b
assert np.allclose(result, expected), "Backend matmul incorrect"
print("✅ Backend produces correct results")
# Compare performance
start = time.perf_counter()
_ = matmul(a, b)
backend_time = time.perf_counter() - start
start = time.perf_counter()
_ = a @ b
numpy_time = time.perf_counter() - start
print(f"\nPerformance comparison:")
print(f"Backend: {backend_time*1000:.1f} ms")
print(f"NumPy: {numpy_time*1000:.1f} ms")
print(f"Backend uses optimal NumPy implementation")
print("\n✅ Backend system works correctly")
return True
# %% [markdown]
"""
## Comprehensive Testing
Let's run all our components together to see the complete optimization journey.
"""
# %%
def run_complete_acceleration_demo():
"""Run the complete acceleration demonstration"""
print("🚀 Complete Acceleration Module Demo")
print("=" * 50)
print("THE OPTIMIZATION JOURNEY: From Loops to NumPy")
# 1. Test educational baseline
print("\n1. Educational Baseline (your Module 2/4 loops):")
educational_results = test_educational_baseline()
# 2. Test blocked optimization
print("\n2. Cache-Friendly Blocking:")
test_blocked_optimization()
# 3. Test production performance
print("\n3. Production Performance (NumPy):")
test_production_performance()
# 4. Test backend system
print("\n4. Backend System:")
test_backend_system()
print("\n" + "=" * 50)
print("🎯 OPTIMIZATION JOURNEY COMPLETE")
print("=" * 50)
print("\n📚 What You Learned:")
print("✅ Why your Module 2/4 loops were slow (but educational)")
print("✅ How cache-friendly blocking improves performance")
print("✅ Why NumPy is optimal for production (already has optimizations)")
print("✅ How to build transparent backend systems")
print("\n🎯 Key Message:")
print("• Educational loops: Perfect for understanding algorithms")
print("• Blocking: Teaches cache optimization principles")
print("• NumPy: Production choice with all optimizations built-in")
print("• Smart backends: Combine educational value with performance")
return educational_results
# %% [markdown]
"""
## Main Execution Block
Run all tests and demonstrations when this module is executed directly.
"""
# %%
if __name__ == "__main__":
print("Module 15: Hardware Acceleration and Kernel Optimization")
print("=" * 60)
print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy")
# Run complete demonstration
results = run_complete_acceleration_demo()
print(f"\n🎉 Module 15 complete!")
print(f"⚡ You've learned the full optimization spectrum.")
print(f"🏗️ Ready to use NumPy optimally in production.")
# %% [markdown]
"""
## Systems Analysis Summary
This module demonstrates the fundamental principles of hardware acceleration in ML systems:
### 🏗️ **Architecture Principles**
- **Cache Hierarchy**: Understanding L1/L2/L3 cache and memory access costs
- **Vectorization**: Leveraging SIMD instructions for parallel computation
- **Memory Layout**: Contiguous access patterns for optimal performance
- **Backend Abstraction**: Transparent dispatch between naive and optimized implementations
### ⚡ **Optimization Techniques**
- **Blocked Algorithms**: Process data in cache-friendly blocks
- **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines
- **In-place Operations**: Minimize memory allocation overhead
- **Automatic Dispatch**: Choose optimal implementation based on problem size
### 📊 **Performance Understanding**
- **Measurement First**: Profile real bottlenecks before optimizing
- **Algorithmic Impact**: O(N³) → O(N²) matters more than 2x constant factors
- **Hardware Awareness**: CPU cache misses cost 100x more than cache hits
- **Library Utilization**: Optimized BLAS libraries beat custom implementations
### 🎯 **Real-World Applications**
- **ML Frameworks**: How PyTorch/TensorFlow apply these same principles
- **Production Systems**: Where optimization efforts provide real value
- **Development Practice**: When to optimize vs when to use existing solutions
### 💡 **Key Insights**
- Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone
- Vectorization eliminates Python overhead for 10-100x improvements
- Most NumPy operations are already optimized - focus on system-level improvements
- Competition frameworks make optimization learning engaging and quantifiable
- Real ML systems face memory and communication bottlenecks, not pure computation limits
This approach teaches students to think like systems engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most.
"""
# %% [markdown]
"""
## Main Execution Block
Run all tests and demonstrations when this module is executed directly.
"""
# %%
if __name__ == "__main__":
print("Module 15: Hardware Acceleration and Kernel Optimization")
print("=" * 60)
print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy")
# Run complete demonstration
results = run_complete_acceleration_demo()
print(f"\n🎉 Module 15 complete!")
print(f"⚡ You've learned the full optimization spectrum.")
print(f"🏗️ Ready to use NumPy optimally in production.")
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Interactive Questions
1. **Why are nested loops slow for large matrices?** Your educational loops from Module 2/4 access memory randomly, causing cache misses. Explain why accessing `b[l, j]` in the inner loop creates terrible cache performance, and why this gets exponentially worse as matrix size increases.
2. **How does blocking improve cache usage?** Your blocked implementation processes 64×64 blocks. Calculate the memory footprint of a 64×64 block (in KB) and explain why this fits well in L1/L2 cache. What happens if you use 256×256 blocks instead?
3. **Why use NumPy instead of custom optimizations?** You implemented blocking to understand cache optimization, but NumPy is still faster. List three optimizations that NumPy has built-in that your blocked implementation lacks, and explain why building these yourself isn't worth the effort.
4. **When should you optimize vs use libraries?** You've seen educational loops (1000x slower), blocking (10x slower), and NumPy (optimal). For each scenario, choose the right approach: (a) Learning algorithms, (b) Debugging matrix math, (c) Production training loop, (d) Custom operation not in NumPy. Justify your choices.
"""
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Hardware Acceleration and Kernel Optimization
This module completes the optimization journey from your Module 2/4 educational loops to production-ready NumPy usage, showing why understanding comes through building.
### 🛤️ **The Optimization Journey**
- **Module 2/4**: You implemented educational loops to understand matrix multiplication
- **Module 15**: You learned why loops are slow and how to optimize them systematically
- **End Goal**: You now use NumPy optimally, understanding what's happening under the hood
### 🛠️ **What We Built**
- **Educational Baseline**: Your triple-nested loops from earlier modules
- **Blocked Implementation**: Cache-friendly version showing 10x+ speedup over loops
- **NumPy Integration**: Production implementation using optimal libraries
- **Smart Backend**: System that chooses the right implementation transparently
### 🧠 **Key Learning Outcomes**
- **Why loops are slow**: Memory access patterns and cache hierarchy matter most
- **How blocking helps**: Processing data in cache-friendly chunks improves performance
- **When to use NumPy**: It already has these optimizations (and more) built-in
- **Systems thinking**: Understanding enables better decisions about when to optimize
### ⚡ **Performance Spectrum Demonstrated**
- **Educational loops**: Perfect for learning, terrible for performance (1000x slower)
- **Cache-friendly blocking**: Good educational optimization (10x faster than loops)
- **NumPy production**: Optimal performance with all optimizations built-in
### 🏆 **Practical Skills Developed**
- Analyze why educational implementations have poor performance
- Implement cache-friendly algorithms to understand optimization principles
- Choose NumPy for production while understanding what it's doing internally
- Build systems that balance educational value with performance requirements
### 📊 **Systems Insights Gained**
- **Educational code serves a purpose**: Understanding algorithms enables optimization intuition
- **Cache hierarchy dominates performance**: Memory access patterns matter more than computation
- **Libraries beat custom optimization**: NumPy already has expert-level optimizations
- **Understanding enables better tools**: You can build smarter systems when you know the principles
### 💡 **The Key Message**
You implemented loops to understand the algorithm. You implemented blocking to understand cache optimization. Now you use NumPy because it already has these (and more) optimizations built-in. Understanding the journey makes you a better ML systems engineer.
"""