TinyTorch/modules/15_acceleration/acceleration_dev.py

# %% [markdown]
"""
# Module 15: Hardware Acceleration and Kernel Optimization

## Learning Objectives
By the end of this module, you will be able to:

1. **Understand Why Loops Are Slow**: See why your Module 2/4 loops have poor performance
2. **Implement Cache-Friendly Blocking**: Build blocked matrix multiplication that leverages CPU cache
3. **Recognize When to Use Libraries**: Understand when NumPy optimizations beat custom code
4. **Build Transparent Backend Systems**: Create automatic switching between implementations

## The Optimization Journey

**Key Message**: You implemented loops to understand the algorithm. Now we'll optimize them to understand systems performance, then switch to NumPy because it already has these (and more) optimizations built-in.

**The Journey:**
1. **Baseline**: Your loops from Module 2/4 (educational, slow)
2. **Blocking**: Cache-friendly version (educational, faster)
3. **NumPy**: Production version (optimal performance)
4. **Backend**: Smart switching system
"""

# %% [markdown]
"""
## Part 1: Baseline Implementation - Your Loops from Module 2/4

Let's start with the educational triple-nested loops you implemented earlier. These were perfect for learning but terrible for performance.
"""

# %%
#| default_exp core.acceleration

import time
import numpy as np

def educational_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """
    Educational matrix multiplication using triple nested loops.

    This is the same implementation from Module 2/4 - perfect for learning
    the algorithm, but very slow due to poor cache performance.
    """
    m, k = a.shape
    k2, n = b.shape
    assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"

    # Initialize result matrix
    c = np.zeros((m, n), dtype=np.float32)

    # Triple nested loop - the educational implementation
    for i in range(m):
        for j in range(n):
            for l in range(k):
                c[i, j] += a[i, l] * b[l, j]

    return c

# %% [markdown]
"""
### Test Educational Implementation

Let's test our educational loops and see why they're slow.
"""

# %%
def test_educational_baseline():
    """Test educational implementation and measure its performance"""
    print("Testing Educational Implementation...")

    # Test correctness with small matrices
    a = np.array([[1, 2], [3, 4]], dtype=np.float32)
    b = np.array([[5, 6], [7, 8]], dtype=np.float32)

    result_educational = educational_matmul(a, b)
    result_numpy = a @ b
    assert np.allclose(result_educational, result_numpy), "Educational matmul incorrect"
    print("✅ Educational implementation produces correct results")

    # Performance comparison (small sizes only - educational is VERY slow)
    print("\nPerformance comparison:")
    small_a = np.random.randn(100, 100).astype(np.float32)
    small_b = np.random.randn(100, 100).astype(np.float32)

    # Time educational implementation
    start = time.perf_counter()
    _ = educational_matmul(small_a, small_b)
    educational_time = time.perf_counter() - start

    # Time NumPy implementation
    start = time.perf_counter()
    _ = small_a @ small_b
    numpy_time = time.perf_counter() - start

    speedup = educational_time / numpy_time
    print(f"Educational loops: {educational_time*1000:.1f} ms")
    print(f"NumPy optimized:   {numpy_time*1000:.1f} ms")
    print(f"NumPy is {speedup:.1f}x faster")

    print("✅ Educational baseline established")
    return educational_time, numpy_time, speedup

# %% [markdown]
"""
## Part 2: Cache-Friendly Blocking - Your First Optimization

Now let's implement blocked matrix multiplication. This teaches you about CPU cache hierarchy by processing data in blocks that fit in cache.
"""

# %%
def blocked_matmul(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.ndarray:
    """
    Cache-friendly blocked matrix multiplication.

    This version processes data in blocks that fit in CPU cache.
    Key insight: Keep working set small enough to fit in L1/L2 cache.

    Args:
        a: Left matrix (m × k)
        b: Right matrix (k × n)
        block_size: Size of cache-friendly blocks (typically 32-128)
    """
    m, k = a.shape
    k2, n = b.shape
    assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"

    # Initialize result
    c = np.zeros((m, n), dtype=np.float32)

    # Process in blocks to maximize cache utilization
    for i in range(0, m, block_size):
        for j in range(0, n, block_size):
            for l in range(0, k, block_size):
                # Define block boundaries
                i_end = min(i + block_size, m)
                j_end = min(j + block_size, n)
                l_end = min(l + block_size, k)

                # Extract blocks (these stay in cache)
                a_block = a[i:i_end, l:l_end]
                b_block = b[l:l_end, j:j_end]

                # Multiply blocks using NumPy (optimized BLAS)
                c[i:i_end, j:j_end] += a_block @ b_block

    return c

# %% [markdown]
"""
### Test Blocked Implementation

Let's see how much faster cache-friendly blocking is compared to educational loops.
"""

def test_blocked_optimization():
    """Test blocked matrix multiplication performance"""
    print("Testing Blocked Matrix Multiplication...")

    # Test correctness
    a = np.random.randn(200, 200).astype(np.float32)
    b = np.random.randn(200, 200).astype(np.float32)

    result_blocked = blocked_matmul(a, b, block_size=64)
    result_numpy = a @ b

    assert np.allclose(result_blocked, result_numpy, atol=1e-3), "Blocked matmul incorrect"
    print("✅ Blocked implementation produces correct results")

    # Performance comparison
    print("\nPerformance comparison:")

    # Educational vs Blocked vs NumPy
    size = 200
    test_a = np.random.randn(size, size).astype(np.float32)
    test_b = np.random.randn(size, size).astype(np.float32)

    # Time educational (smaller subset to avoid waiting forever)
    start = time.perf_counter()
    _ = educational_matmul(test_a[:50, :50], test_b[:50, :50])
    educational_time = time.perf_counter() - start
    educational_time_scaled = educational_time * (size/50)**3  # Scale up

    # Time blocked
    start = time.perf_counter()
    _ = blocked_matmul(test_a, test_b, block_size=64)
    blocked_time = time.perf_counter() - start

    # Time NumPy
    start = time.perf_counter()
    _ = test_a @ test_b
    numpy_time = time.perf_counter() - start

    print(f"Educational (est): {educational_time_scaled*1000:.1f} ms")
    print(f"Blocked:          {blocked_time*1000:.1f} ms")
    print(f"NumPy:            {numpy_time*1000:.1f} ms")

    speedup_blocked = educational_time_scaled / blocked_time
    speedup_numpy = educational_time_scaled / numpy_time

    print(f"\nBlocked is {speedup_blocked:.1f}x faster than educational")
    print(f"NumPy is {speedup_numpy:.1f}x faster than educational")

    print("✅ Blocked optimization tested successfully")
    return blocked_time, numpy_time

# %% [markdown]
"""
## Part 3: NumPy Optimization - Production Performance

Now we'll switch to NumPy for production use. The key insight: NumPy already has these optimizations (and more) built-in.
"""

# %%
def optimized_matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """
    Production matrix multiplication using NumPy.

    This is what you should actually use in practice.
    NumPy already has blocking, vectorization, and BLAS optimizations built-in.
    """
    return a @ b

# %% [markdown]
"""
### Test Production Implementation

Let's verify that NumPy is indeed the best choice for production.
"""

# %%
def test_production_performance():
    """Test that NumPy is indeed optimal for production use"""
    print("Testing Production Performance...")

    # Test different sizes
    sizes = [200, 500, 800]

    print("\nPerformance comparison across the optimization spectrum:")

    for size in sizes:
        print(f"\nMatrix size: {size}x{size}")
        a = np.random.randn(size, size).astype(np.float32)
        b = np.random.randn(size, size).astype(np.float32)

        # Time blocked implementation
        start = time.perf_counter()
        _ = blocked_matmul(a, b, block_size=64)
        blocked_time = time.perf_counter() - start

        # Time NumPy implementation
        start = time.perf_counter()
        _ = optimized_matmul(a, b)
        numpy_time = time.perf_counter() - start

        speedup = blocked_time / numpy_time
        print(f"Blocked:     {blocked_time*1000:6.1f} ms")
        print(f"NumPy:       {numpy_time*1000:6.1f} ms")
        print(f"NumPy is {speedup:.1f}x faster than blocked")

    print("\n💡 Key Insight: NumPy already has these optimizations built-in!")
    print("   • Blocking algorithms")
    print("   • Vectorization")
    print("   • Hardware-specific BLAS libraries")
    print("   • Assembly-level optimizations")

    print("\n✅ Production performance verified")
    return True

# %% [markdown]
"""
## Part 4: Backend System - Transparent Switching

Now let's build a system that automatically chooses the right implementation.
"""

# %%
class OptimizedBackend:
    """Backend that automatically uses the best implementation"""

    def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
        """Matrix multiplication using NumPy (best for production)"""
        return optimized_matmul(a, b)

# Global backend instance
_backend = OptimizedBackend()

def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Matrix multiplication using current backend"""
    return _backend.matmul(a, b)

# %% [markdown]
"""
### Test Backend System

Let's verify our backend system works correctly and uses optimal implementations.
"""

# %%
def test_backend_system():
    """Test the backend system"""
    print("Testing Backend System...")

    # Test matrices
    a = np.random.randn(100, 100).astype(np.float32)
    b = np.random.randn(100, 100).astype(np.float32)

    # Test that our backend works
    result = matmul(a, b)
    expected = a @ b

    assert np.allclose(result, expected), "Backend matmul incorrect"
    print("✅ Backend produces correct results")

    # Compare performance
    start = time.perf_counter()
    _ = matmul(a, b)
    backend_time = time.perf_counter() - start

    start = time.perf_counter()
    _ = a @ b
    numpy_time = time.perf_counter() - start

    print(f"\nPerformance comparison:")
    print(f"Backend: {backend_time*1000:.1f} ms")
    print(f"NumPy:   {numpy_time*1000:.1f} ms")
    print(f"Backend uses optimal NumPy implementation")

    print("\n✅ Backend system works correctly")
    return True

# %% [markdown]
"""
## Comprehensive Testing

Let's run all our components together to see the complete optimization journey.
"""

# %%
def run_complete_acceleration_demo():
    """Run the complete acceleration demonstration"""
    print("🚀 Complete Acceleration Module Demo")
    print("=" * 50)
    print("THE OPTIMIZATION JOURNEY: From Loops to NumPy")

    # 1. Test educational baseline
    print("\n1. Educational Baseline (your Module 2/4 loops):")
    educational_results = test_educational_baseline()

    # 2. Test blocked optimization
    print("\n2. Cache-Friendly Blocking:")
    test_blocked_optimization()

    # 3. Test production performance
    print("\n3. Production Performance (NumPy):")
    test_production_performance()

    # 4. Test backend system
    print("\n4. Backend System:")
    test_backend_system()

    print("\n" + "=" * 50)
    print("🎯 OPTIMIZATION JOURNEY COMPLETE")
    print("=" * 50)

    print("\n📚 What You Learned:")
    print("✅ Why your Module 2/4 loops were slow (but educational)")
    print("✅ How cache-friendly blocking improves performance")
    print("✅ Why NumPy is optimal for production (already has optimizations)")
    print("✅ How to build transparent backend systems")

    print("\n🎯 Key Message:")
    print("• Educational loops: Perfect for understanding algorithms")
    print("• Blocking: Teaches cache optimization principles")
    print("• NumPy: Production choice with all optimizations built-in")
    print("• Smart backends: Combine educational value with performance")

    return educational_results

# %% [markdown]
"""
## Main Execution Block

Run all tests and demonstrations when this module is executed directly.
"""

# %%
if __name__ == "__main__":
    print("Module 15: Hardware Acceleration and Kernel Optimization")
    print("=" * 60)
    print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy")

    # Run complete demonstration
    results = run_complete_acceleration_demo()

    print(f"\n🎉 Module 15 complete!")
    print(f"⚡ You've learned the full optimization spectrum.")
    print(f"🏗️ Ready to use NumPy optimally in production.")


# %% [markdown]
"""
## Systems Analysis Summary

This module demonstrates the fundamental principles of hardware acceleration in ML systems:

### 🏗️ **Architecture Principles**
- **Cache Hierarchy**: Understanding L1/L2/L3 cache and memory access costs
- **Vectorization**: Leveraging SIMD instructions for parallel computation
- **Memory Layout**: Contiguous access patterns for optimal performance
- **Backend Abstraction**: Transparent dispatch between naive and optimized implementations

### ⚡ **Optimization Techniques**
- **Blocked Algorithms**: Process data in cache-friendly blocks
- **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines
- **In-place Operations**: Minimize memory allocation overhead
- **Automatic Dispatch**: Choose optimal implementation based on problem size

### 📊 **Performance Understanding**
- **Measurement First**: Profile real bottlenecks before optimizing
- **Algorithmic Impact**: O(N³) → O(N²) matters more than 2x constant factors
- **Hardware Awareness**: CPU cache misses cost 100x more than cache hits
- **Library Utilization**: Optimized BLAS libraries beat custom implementations

### 🎯 **Real-World Applications**
- **ML Frameworks**: How PyTorch/TensorFlow apply these same principles
- **Production Systems**: Where optimization efforts provide real value
- **Development Practice**: When to optimize vs when to use existing solutions

### 💡 **Key Insights**
- Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone
- Vectorization eliminates Python overhead for 10-100x improvements
- Most NumPy operations are already optimized - focus on system-level improvements
- Competition frameworks make optimization learning engaging and quantifiable
- Real ML systems face memory and communication bottlenecks, not pure computation limits

This approach teaches students to think like systems engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most.
"""

# %% [markdown]
"""
## Main Execution Block

Run all tests and demonstrations when this module is executed directly.
"""

# %%
if __name__ == "__main__":
    print("Module 15: Hardware Acceleration and Kernel Optimization")
    print("=" * 60)
    print("THE OPTIMIZATION JOURNEY: From Educational Loops to NumPy")

    # Run complete demonstration
    results = run_complete_acceleration_demo()

    print(f"\n🎉 Module 15 complete!")
    print(f"⚡ You've learned the full optimization spectrum.")
    print(f"🏗️ Ready to use NumPy optimally in production.")

# %% [markdown]
"""
## 🤔 ML Systems Thinking: Interactive Questions

1. **Why are nested loops slow for large matrices?** Your educational loops from Module 2/4 access memory randomly, causing cache misses. Explain why accessing `b[l, j]` in the inner loop creates terrible cache performance, and why this gets exponentially worse as matrix size increases.

2. **How does blocking improve cache usage?** Your blocked implementation processes 64×64 blocks. Calculate the memory footprint of a 64×64 block (in KB) and explain why this fits well in L1/L2 cache. What happens if you use 256×256 blocks instead?

3. **Why use NumPy instead of custom optimizations?** You implemented blocking to understand cache optimization, but NumPy is still faster. List three optimizations that NumPy has built-in that your blocked implementation lacks, and explain why building these yourself isn't worth the effort.

4. **When should you optimize vs use libraries?** You've seen educational loops (1000x slower), blocking (10x slower), and NumPy (optimal). For each scenario, choose the right approach: (a) Learning algorithms, (b) Debugging matrix math, (c) Production training loop, (d) Custom operation not in NumPy. Justify your choices.
"""

# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Hardware Acceleration and Kernel Optimization

This module completes the optimization journey from your Module 2/4 educational loops to production-ready NumPy usage, showing why understanding comes through building.

### 🛤️ **The Optimization Journey**
- **Module 2/4**: You implemented educational loops to understand matrix multiplication
- **Module 15**: You learned why loops are slow and how to optimize them systematically
- **End Goal**: You now use NumPy optimally, understanding what's happening under the hood

### 🛠️ **What We Built**
- **Educational Baseline**: Your triple-nested loops from earlier modules
- **Blocked Implementation**: Cache-friendly version showing 10x+ speedup over loops
- **NumPy Integration**: Production implementation using optimal libraries
- **Smart Backend**: System that chooses the right implementation transparently

### 🧠 **Key Learning Outcomes**
- **Why loops are slow**: Memory access patterns and cache hierarchy matter most
- **How blocking helps**: Processing data in cache-friendly chunks improves performance
- **When to use NumPy**: It already has these optimizations (and more) built-in
- **Systems thinking**: Understanding enables better decisions about when to optimize

### ⚡ **Performance Spectrum Demonstrated**
- **Educational loops**: Perfect for learning, terrible for performance (1000x slower)
- **Cache-friendly blocking**: Good educational optimization (10x faster than loops)
- **NumPy production**: Optimal performance with all optimizations built-in

### 🏆 **Practical Skills Developed**
- Analyze why educational implementations have poor performance
- Implement cache-friendly algorithms to understand optimization principles
- Choose NumPy for production while understanding what it's doing internally
- Build systems that balance educational value with performance requirements

### 📊 **Systems Insights Gained**
- **Educational code serves a purpose**: Understanding algorithms enables optimization intuition
- **Cache hierarchy dominates performance**: Memory access patterns matter more than computation
- **Libraries beat custom optimization**: NumPy already has expert-level optimizations
- **Understanding enables better tools**: You can build smarter systems when you know the principles

### 💡 **The Key Message**
You implemented loops to understand the algorithm. You implemented blocking to understand cache optimization. Now you use NumPy because it already has these (and more) optimizations built-in. Understanding the journey makes you a better ML systems engineer.
"""