# %% [markdown] """ # Module 15: Hardware Acceleration and Kernel Optimization ## Learning Objectives By the end of this module, you will be able to: 1. **Understand Why Loops Are Slow**: See why your Module 2/4 loops have poor performance 2. **Implement Cache-Friendly Blocking**: Build blocked matrix multiplication that leverages CPU cache 3. **Recognize When to Use Libraries**: Understand when NumPy optimizations beat custom code 4. **Build Transparent Backend Systems**: Create automatic switching between implementations ## The Optimization Journey **Key Message**: You implemented loops to understand the algorithm. Now we'll optimize them to understand systems performance, then switch to NumPy because it already has these (and more) optimizations built-in. **The Journey:** 1. **Baseline**: Your loops from Module 2/4 (educational, slow) 2. **Blocking**: Cache-friendly version (educational, faster) 3. **NumPy**: Production version (optimal performance) 4. **Backend**: Smart switching system """ # %% [markdown] """ ## Part 1: Baseline Implementation - Your Loops from Module 2/4 Let's start with the educational triple-nested loops you implemented earlier. These were perfect for learning but terrible for performance. """ # %% #| default_exp core.acceleration import time import numpy as np def educational_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray: """ Educational matrix multiplication using triple nested loops. This is the same implementation from Module 2/4 - perfect for learning the algorithm, but very slow due to poor cache performance. """ m, k = a.shape k2, n = b.shape assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}" # Initialize result matrix c = np.zeros((m, n), dtype=np.float32) # Triple nested loop - the educational implementation for i in range(m): for j in range(n): for l in range(k): c[i, j] += a[i, l] * b[l, j] return c # %% [markdown] """ ### Test Educational Implementation Let's test our educational loops and see why they're slow. """ # %% def test_educational_baseline(): """Test educational implementation and measure its performance""" print("Testing Educational Implementation...") # Test correctness with small matrices a = np.array([[1, 2], [3, 4]], dtype=np.float32) b = np.array([[5, 6], [7, 8]], dtype=np.float32) result_educational = educational_matmul(a, b) result_numpy = a @ b assert np.allclose(result_educational, result_numpy), "Educational matmul incorrect" print("✅ Educational implementation produces correct results") # Performance comparison (small sizes only - educational is VERY slow) print("\nPerformance comparison:") small_a = np.random.randn(100, 100).astype(np.float32) small_b = np.random.randn(100, 100).astype(np.float32) # Time educational implementation start = time.perf_counter() _ = educational_matmul(small_a, small_b) educational_time = time.perf_counter() - start # Time NumPy implementation start = time.perf_counter() _ = small_a @ small_b numpy_time = time.perf_counter() - start speedup = educational_time / numpy_time print(f"Educational loops: {educational_time*1000:.1f} ms") print(f"NumPy optimized: {numpy_time*1000:.1f} ms") print(f"NumPy is {speedup:.1f}x faster") print("✅ Educational baseline established") return educational_time, numpy_time, speedup # %% [markdown] """ ## Part 2: Cache-Friendly Blocking - Your First Optimization Now let's implement blocked matrix multiplication. This teaches you about CPU cache hierarchy by processing data in blocks that fit in cache. """ # %% def blocked_matmul(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.ndarray: """ Cache-friendly blocked matrix multiplication. This version processes data in blocks that fit in CPU cache. Key insight: Keep working set small enough to fit in L1/L2 cache. Args: a: Left matrix (m × k) b: Right matrix (k × n) block_size: Size of cache-friendly blocks (typically 32-128) """ m, k = a.shape k2, n = b.shape assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}" # Initialize result c = np.zeros((m, n), dtype=np.float32) # Process in blocks to maximize cache utilization for i in range(0, m, block_size): for j in range(0, n, block_size): for l in range(0, k, block_size): # Define block boundaries i_end = min(i + block_size, m) j_end = min(j + block_size, n) l_end = min(l + block_size, k) # Extract blocks (these stay in cache) a_block = a[i:i_end, l:l_end] b_block = b[l:l_end, j:j_end] # Multiply blocks using NumPy (optimized BLAS) c[i:i_end, j:j_end] += a_block @ b_block return c # %% [markdown] """ ### Test Blocked Implementation Let's see how much faster cache-friendly blocking is compared to educational loops. """ def test_blocked_optimization(): """Test blocked matrix multiplication performance""" print("Testing Blocked Matrix Multiplication...") # Test correctness a = np.random.randn(200, 200).astype(np.float32) b = np.random.randn(200, 200).astype(np.float32) result_blocked = blocked_matmul(a, b, block_size=64) result_numpy = a @ b assert np.allclose(result_blocked, result_numpy, atol=1e-3), "Blocked matmul incorrect" print("✅ Blocked implementation produces correct results") # Performance comparison print("\nPerformance comparison:") # Educational vs Blocked vs NumPy size = 200 test_a = np.random.randn(size, size).astype(np.float32) test_b = np.random.randn(size, size).astype(np.float32) # Time educational (smaller subset to avoid waiting forever) start = time.perf_counter() _ = educational_matmul(test_a[:50, :50], test_b[:50, :50]) educational_time = time.perf_counter() - start educational_time_scaled = educational_time * (size/50)**3 # Scale up # Time blocked start = time.perf_counter() _ = blocked_matmul(test_a, test_b, block_size=64) blocked_time = time.perf_counter() - start # Time NumPy start = time.perf_counter() _ = test_a @ test_b numpy_time = time.perf_counter() - start print(f"Educational (est): {educational_time_scaled*1000:.1f} ms") print(f"Blocked: {blocked_time*1000:.1f} ms") print(f"NumPy: {numpy_time*1000:.1f} ms") speedup_blocked = educational_time_scaled / blocked_time speedup_numpy = educational_time_scaled / numpy_time print(f"\nBlocked is {speedup_blocked:.1f}x faster than educational") print(f"NumPy is {speedup_numpy:.1f}x faster than educational") print("✅ Blocked optimization tested successfully") return blocked_time, numpy_time # %% [markdown] """ ## Part 3: NumPy Optimization - Production Performance Now we'll switch to NumPy for production use. The key insight: NumPy already has these optimizations (and more) built-in. """ # %% def optimized_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray: """ Production matrix multiplication using NumPy. This is what you should actually use in practice. NumPy already has blocking, vectorization, and BLAS optimizations built-in. """ return a @ b # %% [markdown] """ ### Test Production Implementation Let's verify that NumPy is indeed the best choice for production. """ # %% def test_production_performance(): """Test that NumPy is indeed optimal for production use""" print("Testing Production Performance...") # Test different sizes sizes = [200, 500, 800] print("\nPerformance comparison across the optimization spectrum:") for size in sizes: print(f"\nMatrix size: {size}x{size}") a = np.random.randn(size, size).astype(np.float32) b = np.random.randn(size, size).astype(np.float32) # Time blocked implementation start = time.perf_counter() _ = blocked_matmul(a, b, block_size=64) blocked_time = time.perf_counter() - start # Time NumPy implementation start = time.perf_counter() _ = optimized_matmul(a, b) numpy_time = time.perf_counter() - start speedup = blocked_time / numpy_time print(f"Blocked: {blocked_time*1000:6.1f} ms") print(f"NumPy: {numpy_time*1000:6.1f} ms") print(f"NumPy is {speedup:.1f}x faster than blocked") print("\n💡 Key Insight: NumPy already has these optimizations built-in!") print(" • Blocking algorithms") print(" • Vectorization") print(" • Hardware-specific BLAS libraries") print(" • Assembly-level optimizations") print("\n✅ Production performance verified") return True # %% [markdown] """ ## Part 4: Backend System - Transparent Switching Now let's build a system that automatically chooses the right implementation. """ # %% class OptimizedBackend: """Backend that automatically uses the best implementation""" def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray: """Matrix multiplication using NumPy (best for production)""" return optimized_matmul(a, b) # Global backend instance _backend = OptimizedBackend() def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray: """Matrix multiplication using current backend""" return _backend.matmul(a, b) # %% [markdown] """ ### Test Backend System Let's verify our backend system works correctly and uses optimal implementations. """ # %% def test_backend_system(): """Test the backend system""" print("Testing Backend System...") # Test matrices a = np.random.randn(100, 100).astype(np.float32) b = np.random.randn(100, 100).astype(np.float32) # Test that our backend works result = matmul(a, b) expected = a @ b assert np.allclose(result, expected), "Backend matmul incorrect" print("✅ Backend produces correct results") # Compare performance start = time.perf_counter() _ = matmul(a, b) backend_time = time.perf_counter() - start start = time.perf_counter() _ = a @ b numpy_time = time.perf_counter() - start print(f"\nPerformance comparison:") print(f"Backend: {backend_time*1000:.1f} ms") print(f"NumPy: {numpy_time*1000:.1f} ms") print(f"Backend uses optimal NumPy implementation") print("\n✅ Backend system works correctly") return True # %% [markdown] """ ## Comprehensive Testing Let's run all our components together to see the complete optimization journey. """ # %% def run_complete_acceleration_demo(): """Run the complete acceleration demonstration""" print("🚀 Complete Acceleration Module Demo") print("=" * 50) print("THE OPTIMIZATION JOURNEY: From Loops to NumPy") # 1. Test educational baseline print("\n1. Educational Baseline (your Module 2/4 loops):") educational_results = test_educational_baseline() # 2. Test blocked optimization print("\n2. Cache-Friendly Blocking:") test_blocked_optimization() # 3. Test production performance print("\n3. Production Performance (NumPy):") test_production_performance() # 4. Test backend system print("\n4. Backend System:") test_backend_system() print("\n" + "=" * 50) print("🎯 OPTIMIZATION JOURNEY COMPLETE") print("=" * 50) print("\n📚 What You Learned:") print("✅ Why your Module 2/4 loops were slow (but educational)") print("✅ How cache-friendly blocking improves performance") print("✅ Why NumPy is optimal for production (already has optimizations)") print("✅ How to build transparent backend systems") print("\n🎯 Key Message:") print("• Educational loops: Perfect for understanding algorithms") print("• Blocking: Teaches cache optimization principles") print("• NumPy: Production choice with all optimizations built-in") print("• Smart backends: Combine educational value with performance") return educational_results # %% [markdown] """ ## Main Execution Block Run all tests and demonstrations when this module is executed directly. """ # %% if __name__ == "__main__": print("Module 15: Hardware Acceleration and Kernel Optimization") print("=" * 60) print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy") # Run complete demonstration results = run_complete_acceleration_demo() print(f"\n🎉 Module 15 complete!") print(f"⚡ You've learned the full optimization spectrum.") print(f"🏗️ Ready to use NumPy optimally in production.") # %% [markdown] """ ## Systems Analysis Summary This module demonstrates the fundamental principles of hardware acceleration in ML systems: ### 🏗️ **Architecture Principles** - **Cache Hierarchy**: Understanding L1/L2/L3 cache and memory access costs - **Vectorization**: Leveraging SIMD instructions for parallel computation - **Memory Layout**: Contiguous access patterns for optimal performance - **Backend Abstraction**: Transparent dispatch between naive and optimized implementations ### ⚡ **Optimization Techniques** - **Blocked Algorithms**: Process data in cache-friendly blocks - **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines - **In-place Operations**: Minimize memory allocation overhead - **Automatic Dispatch**: Choose optimal implementation based on problem size ### 📊 **Performance Understanding** - **Measurement First**: Profile real bottlenecks before optimizing - **Algorithmic Impact**: O(N³) → O(N²) matters more than 2x constant factors - **Hardware Awareness**: CPU cache misses cost 100x more than cache hits - **Library Utilization**: Optimized BLAS libraries beat custom implementations ### 🎯 **Real-World Applications** - **ML Frameworks**: How PyTorch/TensorFlow apply these same principles - **Production Systems**: Where optimization efforts provide real value - **Development Practice**: When to optimize vs when to use existing solutions ### 💡 **Key Insights** - Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone - Vectorization eliminates Python overhead for 10-100x improvements - Most NumPy operations are already optimized - focus on system-level improvements - Competition frameworks make optimization learning engaging and quantifiable - Real ML systems face memory and communication bottlenecks, not pure computation limits This approach teaches students to think like systems engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most. """ # %% [markdown] """ ## Main Execution Block Run all tests and demonstrations when this module is executed directly. """ # %% if __name__ == "__main__": print("Module 15: Hardware Acceleration and Kernel Optimization") print("=" * 60) print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy") # Run complete demonstration results = run_complete_acceleration_demo() print(f"\n🎉 Module 15 complete!") print(f"⚡ You've learned the full optimization spectrum.") print(f"🏗️ Ready to use NumPy optimally in production.") # %% [markdown] """ ## 🤔 ML Systems Thinking: Interactive Questions 1. **Why are nested loops slow for large matrices?** Your educational loops from Module 2/4 access memory randomly, causing cache misses. Explain why accessing `b[l, j]` in the inner loop creates terrible cache performance, and why this gets exponentially worse as matrix size increases. 2. **How does blocking improve cache usage?** Your blocked implementation processes 64×64 blocks. Calculate the memory footprint of a 64×64 block (in KB) and explain why this fits well in L1/L2 cache. What happens if you use 256×256 blocks instead? 3. **Why use NumPy instead of custom optimizations?** You implemented blocking to understand cache optimization, but NumPy is still faster. List three optimizations that NumPy has built-in that your blocked implementation lacks, and explain why building these yourself isn't worth the effort. 4. **When should you optimize vs use libraries?** You've seen educational loops (1000x slower), blocking (10x slower), and NumPy (optimal). For each scenario, choose the right approach: (a) Learning algorithms, (b) Debugging matrix math, (c) Production training loop, (d) Custom operation not in NumPy. Justify your choices. """ # %% [markdown] """ ## 🎯 MODULE SUMMARY: Hardware Acceleration and Kernel Optimization This module completes the optimization journey from your Module 2/4 educational loops to production-ready NumPy usage, showing why understanding comes through building. ### 🛤️ **The Optimization Journey** - **Module 2/4**: You implemented educational loops to understand matrix multiplication - **Module 15**: You learned why loops are slow and how to optimize them systematically - **End Goal**: You now use NumPy optimally, understanding what's happening under the hood ### 🛠️ **What We Built** - **Educational Baseline**: Your triple-nested loops from earlier modules - **Blocked Implementation**: Cache-friendly version showing 10x+ speedup over loops - **NumPy Integration**: Production implementation using optimal libraries - **Smart Backend**: System that chooses the right implementation transparently ### 🧠 **Key Learning Outcomes** - **Why loops are slow**: Memory access patterns and cache hierarchy matter most - **How blocking helps**: Processing data in cache-friendly chunks improves performance - **When to use NumPy**: It already has these optimizations (and more) built-in - **Systems thinking**: Understanding enables better decisions about when to optimize ### ⚡ **Performance Spectrum Demonstrated** - **Educational loops**: Perfect for learning, terrible for performance (1000x slower) - **Cache-friendly blocking**: Good educational optimization (10x faster than loops) - **NumPy production**: Optimal performance with all optimizations built-in ### 🏆 **Practical Skills Developed** - Analyze why educational implementations have poor performance - Implement cache-friendly algorithms to understand optimization principles - Choose NumPy for production while understanding what it's doing internally - Build systems that balance educational value with performance requirements ### 📊 **Systems Insights Gained** - **Educational code serves a purpose**: Understanding algorithms enables optimization intuition - **Cache hierarchy dominates performance**: Memory access patterns matter more than computation - **Libraries beat custom optimization**: NumPy already has expert-level optimizations - **Understanding enables better tools**: You can build smarter systems when you know the principles ### 💡 **The Key Message** You implemented loops to understand the algorithm. You implemented blocking to understand cache optimization. Now you use NumPy because it already has these (and more) optimizations built-in. Understanding the journey makes you a better ML systems engineer. """