mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-02 02:47:32 -05:00
- Added progressive complexity guidelines (Foundation/Intermediate/Advanced) - Added measurement function consolidation to prevent information overload - Fixed all diagnostic issues in losses_dev.py - Fixed markdown formatting across all modules - Consolidated redundant analysis functions in foundation modules - Fixed syntax errors and unused variables - Ensured all educational content is in proper markdown cells for Jupyter
1533 lines
63 KiB
Python
1533 lines
63 KiB
Python
# %% [markdown]
|
||
"""
|
||
# Module 16: Hardware Acceleration - The Free Speedup!
|
||
|
||
Welcome to Hardware Acceleration! You'll discover the easiest optimization in ML systems - getting 100x speedups with zero code changes!
|
||
|
||
## LINK Building on Previous Learning
|
||
**What You Built Before**:
|
||
- Module 02 (Tensor): Triple-nested loops for matrix operations
|
||
- Module 04 (Layers): Forward pass implementations
|
||
- Module 15 (Profiling): Performance measurement and bottleneck identification
|
||
|
||
**What's Working**: You can implement any matrix operation correctly using educational loops!
|
||
|
||
**The Gap**: Your educational loops are 1000x slower than production code, limiting real ML applications.
|
||
|
||
**This Module's Solution**: Learn the optimization spectrum from educational to production performance.
|
||
|
||
**Connection Map**:
|
||
```
|
||
Profiling -> Acceleration -> Production ML
|
||
(identify) (optimize) (deploy at scale)
|
||
```
|
||
|
||
## Learning Goals (Systems-Focused Framework)
|
||
- **Systems understanding**: CPU cache hierarchy and memory access patterns
|
||
- **Core implementation skill**: Cache-friendly blocking algorithms
|
||
- **Pattern/abstraction mastery**: Backend abstraction and automatic dispatch
|
||
- **Framework connections**: How PyTorch/TensorFlow achieve performance
|
||
- **Optimization trade-offs**: Educational clarity vs production speed
|
||
|
||
## Build -> Use -> Reflect
|
||
1. **Build**: Cache-friendly blocked matrix multiplication from scratch
|
||
2. **Use**: Apply acceleration to real ML model operations (MLP, CNN, Attention)
|
||
3. **Reflect**: Analyze the educational-to-production optimization spectrum
|
||
|
||
## Systems Reality Check
|
||
TIP **Production Context**: ML frameworks use these exact principles for 100x speedups
|
||
SPEED **Performance Insight**: Memory access patterns matter more than raw computation speed
|
||
|
||
## The Free Speedup Journey
|
||
|
||
**Key Message**: This is the EASIEST optimization - just use better backends! No accuracy trade-offs, no complex math - just 10-100x faster code.
|
||
|
||
```
|
||
Educational Loops -> Cache Blocking -> NumPy/BLAS -> Smart Backends
|
||
(learning) (understanding) (production) (automation)
|
||
1000x slower 100x slower optimal speed transparent
|
||
```
|
||
|
||
**Visual Performance Spectrum**:
|
||
```
|
||
Performance: [████████████████████████████████████████] 100% NumPy
|
||
[████] 4% Blocked
|
||
[▌] 0.1% Naive
|
||
```
|
||
|
||
**Why This Works**: Same math, better implementation. Free performance with zero downsides!
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 1: Baseline Implementation - Your Loops from Module 2/4
|
||
|
||
Let's start with the educational triple-nested loops you implemented earlier. These were perfect for learning but terrible for performance.
|
||
|
||
### CPU vs GPU Architecture Fundamentals
|
||
|
||
```
|
||
CPU Architecture (Optimized for Sequential): GPU Architecture (Optimized for Parallel):
|
||
+---------------------------------------------+ +---------------------------------------------+
|
||
| Complex Control Unit | | Simple Control Units |
|
||
| +---------+ Large Caches | | +-+ +-+ +-+ +-+ Small Caches |
|
||
| | Core 1 | +--------------------------+ | | |C| |C| |C| |C| +----------------------+ |
|
||
| +---------+ | L3 Cache (8MB) | | | +-+ +-+ +-+ +-+ | Shared Memory (48KB) | |
|
||
| +---------+ | | | | +-+ +-+ +-+ +-+ | | |
|
||
| | Core 2 | +--------------------------+ | | |C| |C| |C| |C| +----------------------+ |
|
||
| +---------+ | | +-+ +-+ +-+ +-+ ... (thousands of cores) |
|
||
| +---------+ Main Memory (16GB) | | |
|
||
| | Core 4 | +--------------------------+ | | High Bandwidth Memory (HBM) |
|
||
| +---------+ | 200+ cycle latency | | | +--------------------------------------+ |
|
||
| +--------------------------+ | | | 1000+ GB/s bandwidth | |
|
||
+---------------------------------------------+ +---------------------------------------------+
|
||
|
||
CPU: Few cores, complex, optimized for latency GPU: Many cores, simple, optimized for throughput
|
||
Best for: Sequential algorithms, complex logic Best for: Parallel algorithms, simple operations
|
||
```
|
||
|
||
### Memory Hierarchy Deep Dive
|
||
|
||
```
|
||
Memory Hierarchy (Latency and Size Trade-offs):
|
||
|
||
Registers: 4 bytes | 1 cycle | ██████████ Speed
|
||
L1 Cache: 32KB | 3-4 cycles | ████████▒▒
|
||
L2 Cache: 256KB | 10-20 cycles | ██████▒▒▒▒
|
||
L3 Cache: 8MB | 50-100 cycles| ████▒▒▒▒▒▒
|
||
Main RAM: 16GB | 200+ cycles | ██▒▒▒▒▒▒▒▒
|
||
SSD Storage: 1TB | 100,000+ cyc | ▒▒▒▒▒▒▒▒▒▒
|
||
^ ^
|
||
Size Speed
|
||
```
|
||
|
||
**The Cache Miss Problem**:
|
||
- Cache hit: Data found in L1 -> 1 cycle
|
||
- Cache miss: Must fetch from RAM -> 200+ cycles
|
||
- 200x slowdown for every cache miss!
|
||
"""
|
||
|
||
# %%
|
||
#| default_exp backends.acceleration
|
||
|
||
import time
|
||
import numpy as np
|
||
import matplotlib.pyplot as plt
|
||
from typing import Tuple, Dict, List
|
||
|
||
def matmul_naive(a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Educational matrix multiplication using triple nested loops.
|
||
|
||
This is the same implementation from Module 2/4 - perfect for learning
|
||
the algorithm, but very slow due to poor cache performance.
|
||
|
||
Memory Access Pattern Analysis:
|
||
```
|
||
Inner loop accesses:
|
||
a[i, k] -> Sequential access (cache-friendly)
|
||
b[k, j] -> Strided access (cache-hostile!)
|
||
|
||
For 1000*1000 matrices:
|
||
- a[i,k]: 1000 sequential reads per row (good)
|
||
- b[k,j]: 1000 random column reads (terrible!)
|
||
- Total cache misses: ~1 billion!
|
||
```
|
||
"""
|
||
m, k = a.shape
|
||
k2, n = b.shape
|
||
assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
|
||
|
||
# Initialize result matrix (contiguous memory allocation)
|
||
c = np.zeros((m, n), dtype=np.float32)
|
||
|
||
# Triple nested loop - the educational implementation
|
||
# i loop: iterates over output rows
|
||
# j loop: iterates over output columns
|
||
# k loop: performs dot product computation
|
||
for i in range(m): # Output row
|
||
for j in range(n): # Output column
|
||
for k_idx in range(k): # Dot product accumulation
|
||
# Cache analysis: a[i,k_idx] = sequential (good)
|
||
# b[k_idx,j] = strided (bad!)
|
||
c[i, j] += a[i, k_idx] * b[k_idx, j]
|
||
|
||
return c
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Memory Access Pattern Analysis
|
||
def analyze_memory_access_patterns():
|
||
"""
|
||
Visualize why naive loops create terrible cache performance.
|
||
|
||
This analysis shows the fundamental problem with nested loops:
|
||
cache-hostile memory access patterns that destroy performance.
|
||
"""
|
||
try:
|
||
print("📊 Memory Access Pattern Analysis")
|
||
print("=" * 45)
|
||
|
||
# Simulate memory access for small matrix
|
||
size = 4
|
||
print(f"\nAnalyzing {size}x{size} matrix multiplication:")
|
||
print("\nMatrix A (row-major layout):")
|
||
print("Memory: [a00 a01 a02 a03 | a10 a11 a12 a13 | a20 a21 a22 a23 | a30 a31 a32 a33]")
|
||
print("\nMatrix B (row-major layout):")
|
||
print("Memory: [b00 b01 b02 b03 | b10 b11 b12 b13 | b20 b21 b22 b23 | b30 b31 b32 b33]")
|
||
|
||
print("\n🔴 PROBLEM: Computing C[0,0] = sum(A[0,k] * B[k,0])")
|
||
print("A[0,k] accesses: a00, a01, a02, a03 (sequential OK)")
|
||
print("B[k,0] accesses: b00, b10, b20, b30 (every 4th element FAIL)")
|
||
|
||
print("\n📊 Cache Miss Analysis:")
|
||
cache_line_size = 64 # bytes
|
||
float_size = 4 # bytes
|
||
elements_per_line = cache_line_size // float_size # 16 elements
|
||
|
||
print(f"Cache line size: {cache_line_size} bytes = {elements_per_line} float32s")
|
||
print(f"Sequential access (A): 1 cache miss per {elements_per_line} elements")
|
||
print(f"Strided access (B): 1 cache miss per element (worst case)")
|
||
|
||
# Calculate for realistic size
|
||
n = 1000
|
||
sequential_misses = n // elements_per_line
|
||
strided_misses = n
|
||
total_operations = n * n * n
|
||
total_misses = total_operations * (sequential_misses + strided_misses) // n
|
||
|
||
print(f"\n📊 Scaling to {n}x{n} matrices:")
|
||
print(f"Total operations: {total_operations:,}")
|
||
print(f"Estimated cache misses: {total_misses:,}")
|
||
print(f"Cache miss rate: {total_misses/total_operations:.1%}")
|
||
print(f"\n📊 Why this kills performance:")
|
||
print(f"Cache hit: 1 cycle")
|
||
print(f"Cache miss: 200+ cycles")
|
||
print(f"Performance penalty: 200x slower!")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in memory analysis: {e}")
|
||
print("Make sure numpy is available")
|
||
|
||
# Run the analysis
|
||
analyze_memory_access_patterns()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Educational Implementation
|
||
|
||
Let's test our educational loops and measure their performance characteristics.
|
||
"""
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Naive matrix multiplication complete
|
||
|
||
# THINK PREDICTION: How much slower are educational loops vs NumPy?
|
||
# Your guess: ___x slower for 100x100 matrices
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT #1: Why Educational Loops Are Slow
|
||
def analyze_educational_loop_performance():
|
||
"""
|
||
Measure and understand why educational loops create performance problems.
|
||
|
||
This analysis reveals the fundamental performance characteristics
|
||
that students experience when implementing algorithms from scratch.
|
||
"""
|
||
try:
|
||
print("📊 Educational Loop Performance Analysis")
|
||
print("=" * 50)
|
||
|
||
# Test progressively larger matrices to show scaling
|
||
sizes = [50, 100, 200]
|
||
|
||
print("\nPerformance scaling with matrix size:")
|
||
print("Size | Naive Time | NumPy Time | Slowdown | O(N³) Theory")
|
||
print("-" * 60)
|
||
|
||
baseline_naive = None
|
||
baseline_numpy = None
|
||
|
||
for size in sizes:
|
||
# Create test matrices
|
||
a = np.random.randn(size, size).astype(np.float32)
|
||
b = np.random.randn(size, size).astype(np.float32)
|
||
|
||
# Time naive implementation
|
||
start = time.perf_counter()
|
||
_ = matmul_naive(a, b)
|
||
naive_time = time.perf_counter() - start
|
||
|
||
# Time NumPy implementation
|
||
start = time.perf_counter()
|
||
_ = a @ b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
# Calculate slowdown
|
||
slowdown = naive_time / numpy_time
|
||
|
||
# Calculate theoretical scaling (O(N³))
|
||
if baseline_naive is None:
|
||
baseline_naive = naive_time
|
||
baseline_numpy = numpy_time
|
||
theory_scale = 1.0
|
||
else:
|
||
theory_scale = (size / sizes[0]) ** 3
|
||
|
||
print(f"{size:4d} | {naive_time*1000:9.1f}ms | {numpy_time*1000:9.1f}ms | {slowdown:7.0f}x | {theory_scale:8.1f}x")
|
||
|
||
print(f"\n📊 Key Performance Insights:")
|
||
print(f"• Educational loops: Perfect for learning algorithms")
|
||
print(f"• Scaling follows O(N³): doubling size = 8x operations")
|
||
print(f"• Cache misses make large matrices exponentially slower")
|
||
print(f"• NumPy: Professional optimizations give 100-1000x speedup")
|
||
|
||
print(f"\nTIP Why This Matters for ML Systems:")
|
||
print(f"• Understanding algorithms != performance optimization")
|
||
print(f"• Educational clarity vs production speed trade-off")
|
||
print(f"• Memory access patterns dominate performance")
|
||
print(f"• Library choice impacts application feasibility")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in performance analysis: {e}")
|
||
print("Make sure matrices are small enough for educational timing")
|
||
|
||
# Run the educational performance analysis
|
||
analyze_educational_loop_performance()
|
||
|
||
# %%
|
||
def test_naive_baseline():
|
||
"""
|
||
Test naive implementation and measure its performance characteristics.
|
||
|
||
This test validates correctness and demonstrates the performance gap
|
||
between educational loops and optimized implementations.
|
||
"""
|
||
print("TEST Testing Naive Implementation...")
|
||
|
||
# Test correctness with small matrices first
|
||
a = np.array([[1, 2], [3, 4]], dtype=np.float32)
|
||
b = np.array([[5, 6], [7, 8]], dtype=np.float32)
|
||
|
||
result_naive = matmul_naive(a, b)
|
||
result_numpy = a @ b
|
||
expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
|
||
|
||
assert np.allclose(result_naive, result_numpy), "Naive matmul incorrect vs NumPy"
|
||
assert np.allclose(result_naive, expected), "Naive matmul incorrect vs expected"
|
||
print("PASS Naive implementation produces correct results")
|
||
|
||
# Performance comparison (small sizes only - educational is VERY slow)
|
||
print("\n📊 Performance comparison:")
|
||
small_a = np.random.randn(100, 100).astype(np.float32)
|
||
small_b = np.random.randn(100, 100).astype(np.float32)
|
||
|
||
# Time naive implementation (limit size to avoid excessive wait)
|
||
start = time.perf_counter()
|
||
_ = matmul_naive(small_a, small_b)
|
||
naive_time = time.perf_counter() - start
|
||
|
||
# Time NumPy implementation
|
||
start = time.perf_counter()
|
||
_ = small_a @ small_b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
speedup = naive_time / numpy_time
|
||
|
||
print(f"Naive loops: {naive_time*1000:8.1f} ms")
|
||
print(f"NumPy optimized: {numpy_time*1000:8.1f} ms")
|
||
print(f"Speedup: {speedup:8.1f}x faster")
|
||
|
||
# Estimate scaling behavior
|
||
print(f"\n📊 Scaling Analysis (100x100 baseline):")
|
||
print(f"For 500x500 matrix: ~{speedup * 125:.0f}x slower than NumPy") # (500/100)^3 = 125
|
||
print(f"For 1000x1000 matrix: ~{speedup * 1000:.0f}x slower than NumPy") # (1000/100)^3 = 1000
|
||
print(f"\nTIP Why: O(N³) complexity + cache misses = exponential slowdown")
|
||
|
||
print("PASS Naive baseline established")
|
||
return naive_time, numpy_time, speedup
|
||
|
||
# Execute the test
|
||
test_naive_baseline()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 2: Understanding Cache Hierarchy - Why Memory Matters More Than Computation
|
||
|
||
**The Big Insight**: Modern CPUs are FAST at computation but SLOW at memory access. Cache hierarchy makes the difference between fast and slow code.
|
||
|
||
### CPU Cache Hierarchy Visualization
|
||
```
|
||
CPU Cache Hierarchy (Latency vs Capacity Trade-off):
|
||
|
||
+----------------------------------------------------------------------------------+
|
||
Register: 4 bytes | ██████████ 1 cycle (instant access) |
|
||
L1 Cache: 32KB | ████████▒▒ 3-4 cycles (lightning fast) |
|
||
L2 Cache: 256KB | ██████▒▒▒▒ 10-20 cycles (fast) |
|
||
L3 Cache: 8MB | ████▒▒▒▒▒▒ 50-100 cycles(slow) |
|
||
Main RAM: 16GB | ██▒▒▒▒▒▒▒▒ 200+ cycles (VERY slow) |
|
||
SSD: 1TB | ▒▒▒▒▒▒▒▒▒▒ 100,000+ cyc (glacial) |
|
||
+----------------------------------------------------------------------------------+
|
||
Size Speed Characteristics
|
||
```
|
||
|
||
**Key Principle**: Keep your working set in L1/L2 cache for 100x better performance!
|
||
|
||
### Vectorization vs Parallelization Concepts
|
||
|
||
```
|
||
Vectorization (SIMD - Single Instruction, Multiple Data):
|
||
+--------------------------------------------------+
|
||
| Scalar: for i in range(4): c[i] = a[i] + b[i] |
|
||
| ADD a[0], b[0] -> c[0] (4 operations) |
|
||
| ADD a[1], b[1] -> c[1] |
|
||
| ADD a[2], b[2] -> c[2] |
|
||
| ADD a[3], b[3] -> c[3] |
|
||
| |
|
||
| Vector: c = a + b (NumPy/BLAS) |
|
||
| VADD [a0,a1,a2,a3], [b0,b1,b2,b3] |
|
||
| -> [c0,c1,c2,c3] (1 operation!) |
|
||
+--------------------------------------------------+
|
||
|
||
Parallelization (Multiple cores working simultaneously):
|
||
+--------------------------------------------------+
|
||
| Core 1: Computes rows 0-249 of result matrix |
|
||
| Core 2: Computes rows 250-499 of result matrix |
|
||
| Core 3: Computes rows 500-749 of result matrix |
|
||
| Core 4: Computes rows 750-999 of result matrix |
|
||
| |
|
||
| 4x speedup (ideal) if no synchronization costs |
|
||
+--------------------------------------------------+
|
||
```
|
||
|
||
### Memory Access Pattern Analysis
|
||
|
||
Your naive loops access memory like this:
|
||
```python
|
||
for i in range(m): # Loop over output rows
|
||
for j in range(n): # Loop over output columns
|
||
for k in range(k): # Loop over dot product
|
||
c[i,j] += a[i,k] * b[k,j] # b[k,j] creates cache misses!
|
||
```
|
||
|
||
**The Problem**: `b[k,j]` creates terrible access patterns:
|
||
- Each `j` increment jumps to a new column (cache miss)
|
||
- Each `k` increment jumps to a new row (another cache miss)
|
||
- For 1000*1000 matrix: 1 billion cache misses!
|
||
|
||
**Visualization of Memory Access**:
|
||
```
|
||
Matrix B in memory (row-major):
|
||
[b00 b01 b02 b03 | b10 b11 b12 b13 | b20 b21 b22 b23 | ...]
|
||
|
||
Accessing column 0: b00, b10, b20, b30, ...
|
||
| | | |
|
||
4 4 4 4 elements apart = strided access
|
||
🔴 🔴 🔴 🔴 cache misses!
|
||
```
|
||
|
||
**The Solution**: Process in blocks that fit in cache.
|
||
"""
|
||
|
||
# %%
|
||
def matmul_blocked(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.ndarray:
|
||
"""
|
||
Cache-friendly blocked matrix multiplication.
|
||
|
||
This version processes data in blocks that fit in CPU cache,
|
||
dramatically reducing cache misses and improving performance.
|
||
|
||
**Memory Analysis (Quantitative)**:
|
||
- 64*64 float32 block = 4096 * 4 bytes = 16KB per block
|
||
- 3 blocks (A_block, B_block, C_block) = 48KB total
|
||
- Fits comfortably in 256KB L2 cache with room for other data
|
||
- Reuses each data element 64 times before evicting from cache
|
||
|
||
**Why This Works**:
|
||
- Naive: 1 cache miss per operation (terrible)
|
||
- Blocked: 1 cache miss per 64 operations (64x better!)
|
||
|
||
**Blocking Visualization**:
|
||
```
|
||
Large Matrix Multiplication:
|
||
A (1000x1000) * B (1000x1000) = C (1000x1000)
|
||
|
||
Blocked Approach:
|
||
+----------------+ +----------------+ +----------------+
|
||
| 64x64| | | 64x64| | | 64x64| |
|
||
| block | A | * | block | B | = | block | C |
|
||
| | | | | | | | |
|
||
+----------------+ +----------------+ +----------------+
|
||
|
||
Each 64x64 block fits in L1/L2 cache!
|
||
```
|
||
|
||
Args:
|
||
a: Left matrix (m * k)
|
||
b: Right matrix (k * n)
|
||
block_size: Cache-friendly block size (64 = 16KB fits in L2 cache)
|
||
"""
|
||
m, k = a.shape
|
||
k2, n = b.shape
|
||
assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
|
||
|
||
# Initialize result matrix with zeros
|
||
c = np.zeros((m, n), dtype=np.float32)
|
||
|
||
# Process in blocks to maximize cache utilization
|
||
# Outer loops: iterate over blocks
|
||
for i in range(0, m, block_size): # Block rows in A and C
|
||
for j in range(0, n, block_size): # Block columns in B and C
|
||
for k_idx in range(0, k, block_size): # Block columns in A, rows in B
|
||
|
||
# Define block boundaries (handle edge cases)
|
||
i_end = min(i + block_size, m)
|
||
j_end = min(j + block_size, n)
|
||
k_end = min(k_idx + block_size, k)
|
||
|
||
# Extract blocks that fit in cache
|
||
# These slices create views, not copies (memory efficient)
|
||
a_block = a[i:i_end, k_idx:k_end] # Shape: (<=64, <=64)
|
||
b_block = b[k_idx:k_end, j:j_end] # Shape: (<=64, <=64)
|
||
|
||
# Multiply blocks using optimized NumPy BLAS
|
||
# This operates on cache-resident data
|
||
c[i:i_end, j:j_end] += a_block @ b_block
|
||
|
||
return c
|
||
|
||
def calculate_cache_footprint(block_size: int) -> dict:
|
||
"""
|
||
Calculate memory footprint for educational purposes.
|
||
|
||
This helps students understand why different block sizes work better or worse.
|
||
Block size optimization is crucial for cache performance.
|
||
"""
|
||
bytes_per_float = 4 # float32 size
|
||
elements_per_block = block_size * block_size
|
||
bytes_per_block = elements_per_block * bytes_per_float
|
||
total_blocks = 3 # A_block, B_block, C_block
|
||
total_bytes = bytes_per_block * total_blocks
|
||
|
||
# Cache size thresholds (typical modern CPU)
|
||
l1_cache_size = 32 * 1024 # 32KB L1 data cache
|
||
l2_cache_size = 256 * 1024 # 256KB L2 cache
|
||
l3_cache_size = 8 * 1024 * 1024 # 8MB L3 cache
|
||
|
||
return {
|
||
"block_size": block_size,
|
||
"elements_per_block": elements_per_block,
|
||
"bytes_per_block": bytes_per_block,
|
||
"total_bytes": total_bytes,
|
||
"total_kb": total_bytes / 1024,
|
||
"fits_in_l1": total_bytes <= l1_cache_size,
|
||
"fits_in_l2": total_bytes <= l2_cache_size,
|
||
"fits_in_l3": total_bytes <= l3_cache_size,
|
||
"cache_level": (
|
||
"L1" if total_bytes <= l1_cache_size else
|
||
"L2" if total_bytes <= l2_cache_size else
|
||
"L3" if total_bytes <= l3_cache_size else
|
||
"RAM"
|
||
)
|
||
}
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Cache Optimization Analysis
|
||
def analyze_cache_optimization():
|
||
"""
|
||
Analyze how different block sizes affect cache performance.
|
||
|
||
This demonstrates the trade-off between cache utilization
|
||
and computational efficiency in blocked algorithms.
|
||
"""
|
||
try:
|
||
print("📊 Cache Optimization Analysis")
|
||
print("=" * 40)
|
||
|
||
# Test different block sizes
|
||
block_sizes = [16, 32, 64, 128, 256]
|
||
|
||
print("\nBlock Size Analysis:")
|
||
print("Size | Elements | Memory | Cache Level | Efficiency")
|
||
print("-" * 55)
|
||
|
||
for block_size in block_sizes:
|
||
footprint = calculate_cache_footprint(block_size)
|
||
|
||
# Calculate computational efficiency
|
||
# Smaller blocks = more overhead, larger blocks = cache misses
|
||
if footprint["fits_in_l1"]:
|
||
efficiency = "Excellent"
|
||
elif footprint["fits_in_l2"]:
|
||
efficiency = "Good"
|
||
elif footprint["fits_in_l3"]:
|
||
efficiency = "Fair"
|
||
else:
|
||
efficiency = "Poor"
|
||
|
||
print(f"{block_size:4d} | {footprint['elements_per_block']:8d} | {footprint['total_kb']:6.1f}KB | {footprint['cache_level']:10s} | {efficiency}")
|
||
|
||
print("\n📊 Optimal Block Size Analysis:")
|
||
optimal = calculate_cache_footprint(64)
|
||
print(f"64x64 blocks use {optimal['total_kb']:.1f}KB")
|
||
print(f"Fits in: {optimal['cache_level']} cache")
|
||
print(f"Reuse factor: Each element used 64 times")
|
||
print(f"Cache efficiency: 64x better than naive")
|
||
|
||
print("\nTIP Key Insights:")
|
||
print("• Blocks too small: High loop overhead")
|
||
print("• Blocks too large: Cache misses")
|
||
print("• Sweet spot: 64x64 fits in L2 cache")
|
||
print("• Modern CPUs: Designed for this pattern!")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in cache analysis: {e}")
|
||
|
||
# Run the cache analysis
|
||
analyze_cache_optimization()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Blocked Implementation
|
||
|
||
Let's see how much faster cache-friendly blocking is compared to educational loops.
|
||
"""
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Cache-friendly blocking complete
|
||
|
||
# THINK PREDICTION: How much speedup does cache blocking provide?
|
||
# Your guess: ___x faster than educational loops
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT #2: Cache Blocking Effectiveness
|
||
def analyze_cache_blocking_effectiveness():
|
||
"""
|
||
Measure how cache-friendly blocking improves performance.
|
||
|
||
This demonstrates the practical impact of designing algorithms
|
||
that work with CPU cache hierarchy instead of against it.
|
||
"""
|
||
try:
|
||
print("📊 Cache Blocking Effectiveness Analysis")
|
||
print("=" * 45)
|
||
|
||
# Test different block sizes to show optimal choice
|
||
matrix_size = 300
|
||
block_sizes = [32, 64, 128, 256]
|
||
|
||
# Create test matrices
|
||
a = np.random.randn(matrix_size, matrix_size).astype(np.float32)
|
||
b = np.random.randn(matrix_size, matrix_size).astype(np.float32)
|
||
|
||
print(f"\nBlock Size Optimization (Matrix: {matrix_size}x{matrix_size}):")
|
||
print("Block | Time (ms) | Cache Fit | Efficiency")
|
||
print("-" * 45)
|
||
|
||
best_time = float('inf')
|
||
best_block = 64
|
||
|
||
for block_size in block_sizes:
|
||
# Time blocked implementation
|
||
start = time.perf_counter()
|
||
_ = matmul_blocked(a, b, block_size=block_size)
|
||
blocked_time = time.perf_counter() - start
|
||
|
||
# Calculate cache footprint
|
||
footprint = calculate_cache_footprint(block_size)
|
||
|
||
# Determine efficiency
|
||
if blocked_time < best_time:
|
||
best_time = blocked_time
|
||
best_block = block_size
|
||
efficiency = "Optimal"
|
||
elif blocked_time < best_time * 1.2:
|
||
efficiency = "Good"
|
||
else:
|
||
efficiency = "Suboptimal"
|
||
|
||
print(f"{block_size:5d} | {blocked_time*1000:8.1f} | {footprint['cache_level']:8s} | {efficiency}")
|
||
|
||
# Compare with naive and NumPy
|
||
print(f"\n📊 Performance Comparison:")
|
||
|
||
# Time naive (small subset)
|
||
start = time.perf_counter()
|
||
_ = matmul_naive(a[:50, :50], b[:50, :50])
|
||
naive_time = time.perf_counter() - start
|
||
naive_scaled = naive_time * (matrix_size / 50) ** 3
|
||
|
||
# Time NumPy
|
||
start = time.perf_counter()
|
||
_ = a @ b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
print(f"Naive (estimated): {naive_scaled*1000:8.1f}ms")
|
||
print(f"Blocked (optimal): {best_time*1000:8.1f}ms")
|
||
print(f"NumPy (production): {numpy_time*1000:8.1f}ms")
|
||
|
||
speedup_blocked = naive_scaled / best_time
|
||
speedup_numpy = naive_scaled / numpy_time
|
||
|
||
print(f"\nROCKET Speedup Results:")
|
||
print(f"Blocking: {speedup_blocked:.0f}x faster than naive")
|
||
print(f"NumPy: {speedup_numpy:.0f}x faster than naive")
|
||
print(f"Block size {best_block}: Optimal for this matrix size")
|
||
|
||
print(f"\nTIP Key Cache Insights:")
|
||
print(f"• 64x64 blocks typically optimal (fits L2 cache)")
|
||
print(f"• Too small: High loop overhead")
|
||
print(f"• Too large: Cache misses return")
|
||
print(f"• Cache hierarchy shapes algorithm design")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in blocking analysis: {e}")
|
||
print("Make sure all blocking functions are implemented correctly")
|
||
|
||
# Run the cache blocking analysis
|
||
analyze_cache_blocking_effectiveness()
|
||
|
||
def test_blocked_optimization():
|
||
"""Test blocked matrix multiplication performance"""
|
||
print("Testing Blocked Matrix Multiplication...")
|
||
|
||
# Test correctness
|
||
a = np.random.randn(200, 200).astype(np.float32)
|
||
b = np.random.randn(200, 200).astype(np.float32)
|
||
|
||
result_blocked = matmul_blocked(a, b, block_size=64)
|
||
result_numpy = a @ b
|
||
|
||
assert np.allclose(result_blocked, result_numpy, atol=1e-3), "Blocked matmul incorrect"
|
||
print("PASS Blocked implementation produces correct results")
|
||
|
||
# Performance comparison
|
||
print("\nPerformance comparison:")
|
||
|
||
# Educational vs Blocked vs NumPy
|
||
size = 200
|
||
test_a = np.random.randn(size, size).astype(np.float32)
|
||
test_b = np.random.randn(size, size).astype(np.float32)
|
||
|
||
# Time educational (smaller subset to avoid waiting forever)
|
||
start = time.perf_counter()
|
||
_ = matmul_naive(test_a[:50, :50], test_b[:50, :50])
|
||
naive_time = time.perf_counter() - start
|
||
# Scale cubic complexity: (200/50)³ = 4³ = 64x operations
|
||
scaling_factor = (size / 50) ** 3
|
||
naive_time_scaled = naive_time * scaling_factor
|
||
|
||
# Time blocked
|
||
start = time.perf_counter()
|
||
_ = matmul_blocked(test_a, test_b, block_size=64)
|
||
blocked_time = time.perf_counter() - start
|
||
|
||
# Time NumPy
|
||
start = time.perf_counter()
|
||
_ = test_a @ test_b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
print(f"Naive (estimated): {naive_time_scaled*1000:.1f} ms")
|
||
print(f"Blocked: {blocked_time*1000:.1f} ms")
|
||
print(f"NumPy: {numpy_time*1000:.1f} ms")
|
||
|
||
speedup_blocked = naive_time_scaled / blocked_time
|
||
speedup_numpy = naive_time_scaled / numpy_time
|
||
|
||
print(f"\nROCKET SPEEDUP RESULTS:")
|
||
print(f"Blocked is {speedup_blocked:.1f}x faster than naive loops!")
|
||
print(f"NumPy is {speedup_numpy:.1f}x faster than naive loops!")
|
||
print(f"\nTIP Why blocking works: Better cache utilization!")
|
||
print(f" • Naive: 1 cache miss per operation")
|
||
print(f" • Blocked: 1 cache miss per 64 operations")
|
||
print(f" • NumPy: Professional optimizations + vectorization")
|
||
|
||
print("PASS Blocked optimization tested successfully")
|
||
return blocked_time, numpy_time
|
||
|
||
# Execute the blocked test
|
||
test_blocked_optimization()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 3: NumPy Optimization - Production Performance
|
||
|
||
Now we'll switch to NumPy for production use. The key insight: NumPy already has these optimizations (and more) built-in.
|
||
"""
|
||
|
||
# %%
|
||
def matmul_numpy(a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Production matrix multiplication using NumPy.
|
||
|
||
This is what you should actually use in practice.
|
||
NumPy already has blocking, vectorization, and BLAS optimizations built-in.
|
||
"""
|
||
return a @ b
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Production Implementation
|
||
|
||
Let's verify that NumPy is indeed the best choice for production.
|
||
"""
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Production backend system complete
|
||
|
||
# THINK PREDICTION: What makes NumPy faster than our blocking algorithm?
|
||
# Your answer: ___ (vectorization, BLAS, assembly, etc.)
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT #3: Production Optimization Analysis
|
||
def analyze_production_optimization_stack():
|
||
"""
|
||
Analyze the complete optimization stack that makes NumPy so fast.
|
||
|
||
This reveals why production libraries beat custom implementations
|
||
and what optimizations are built into professional ML frameworks.
|
||
"""
|
||
try:
|
||
print("📊 Production Optimization Stack Analysis")
|
||
print("=" * 50)
|
||
|
||
# Test across range of sizes to show scaling characteristics
|
||
sizes = [100, 300, 500, 1000]
|
||
|
||
print("\nOptimization Stack Performance:")
|
||
print("Size | Naive Est | Blocked | NumPy | Block->NumPy | Total Speedup")
|
||
print("-" * 70)
|
||
|
||
for size in sizes:
|
||
# Create test matrices
|
||
a = np.random.randn(size, size).astype(np.float32)
|
||
b = np.random.randn(size, size).astype(np.float32)
|
||
|
||
# Time blocked implementation
|
||
start = time.perf_counter()
|
||
_ = matmul_blocked(a, b, block_size=64)
|
||
blocked_time = time.perf_counter() - start
|
||
|
||
# Time NumPy implementation
|
||
start = time.perf_counter()
|
||
_ = a @ b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
# Estimate naive time (from small sample)
|
||
if size <= 200:
|
||
start = time.perf_counter()
|
||
_ = matmul_naive(a[:50, :50], b[:50, :50])
|
||
naive_small = time.perf_counter() - start
|
||
naive_estimated = naive_small * (size / 50) ** 3
|
||
else:
|
||
# Use previous scaling for larger matrices
|
||
naive_estimated = naive_time * (size / 200) ** 3 if 'naive_time' in locals() else blocked_time * 100
|
||
|
||
# Calculate speedups
|
||
block_speedup = naive_estimated / blocked_time
|
||
numpy_speedup = blocked_time / numpy_time
|
||
total_speedup = naive_estimated / numpy_time
|
||
|
||
print(f"{size:4d} | {naive_estimated*1000:8.0f}ms | {blocked_time*1000:6.1f}ms | {numpy_time*1000:4.1f}ms | {numpy_speedup:9.1f}x | {total_speedup:11.0f}x")
|
||
|
||
if size == 200: # Store for scaling estimation
|
||
naive_time = naive_estimated
|
||
|
||
print(f"\n📊 NumPy's Optimization Stack:")
|
||
print(f"🔧 1. Cache Blocking: Process data in cache-friendly chunks")
|
||
print(f"🔧 2. Vectorization: SIMD instructions (4-8x speedup)")
|
||
print(f"🔧 3. BLAS Libraries: Hand-optimized linear algebra (Intel MKL, OpenBLAS)")
|
||
print(f"🔧 4. Assembly Kernels: CPU-specific optimizations")
|
||
print(f"🔧 5. Memory Layout: Optimal data structure organization")
|
||
print(f"🔧 6. Threading: Automatic parallelization for large matrices")
|
||
|
||
print(f"\n📊 Development Cost vs Performance Benefit:")
|
||
print(f"• Custom blocking: 1 week implementation -> 10-50x speedup")
|
||
print(f"• BLAS integration: 1 month implementation -> additional 5-10x")
|
||
print(f"• Assembly optimization: 6+ months -> additional 2-5x")
|
||
print(f"• NumPy: 0 development time -> all optimizations included")
|
||
|
||
print(f"\nTIP ML Systems Engineering Insight:")
|
||
print(f"• Focus on system architecture, not micro-optimizations")
|
||
print(f"• Leverage existing optimized libraries (NumPy, PyTorch, TensorFlow)")
|
||
print(f"• Understanding principles enables better system design")
|
||
print(f"• Build on foundations, don't reinvent optimized wheels")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in production analysis: {e}")
|
||
print("Make sure all performance functions are implemented correctly")
|
||
|
||
# Run the production optimization analysis
|
||
analyze_production_optimization_stack()
|
||
|
||
# %%
|
||
def test_production_performance():
|
||
"""Test that NumPy is indeed optimal for production use"""
|
||
print("Testing Production Performance...")
|
||
|
||
# Test different sizes
|
||
sizes = [200, 500, 800]
|
||
|
||
print("\nPerformance comparison across the optimization spectrum:")
|
||
|
||
for size in sizes:
|
||
print(f"\nMatrix size: {size}x{size}")
|
||
a = np.random.randn(size, size).astype(np.float32)
|
||
b = np.random.randn(size, size).astype(np.float32)
|
||
|
||
# Time blocked implementation
|
||
start = time.perf_counter()
|
||
_ = matmul_blocked(a, b, block_size=64)
|
||
blocked_time = time.perf_counter() - start
|
||
|
||
# Time NumPy implementation
|
||
start = time.perf_counter()
|
||
_ = matmul_numpy(a, b)
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
speedup = blocked_time / numpy_time
|
||
print(f"Blocked: {blocked_time*1000:6.1f} ms")
|
||
print(f"NumPy: {numpy_time*1000:6.1f} ms")
|
||
print(f"NumPy is {speedup:.1f}x faster than blocked")
|
||
|
||
print("\nTIP Key Insight: NumPy already has these optimizations built-in!")
|
||
print(" • Blocking algorithms")
|
||
print(" • Vectorization")
|
||
print(" • Hardware-specific BLAS libraries")
|
||
print(" • Assembly-level optimizations")
|
||
|
||
print("\nPASS Production performance verified")
|
||
return True
|
||
|
||
# Execute the production test
|
||
test_production_performance()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 4: Smart Backend System - Transparent Optimization
|
||
|
||
Now let's build a system that automatically chooses the right implementation. This is how real ML frameworks work!
|
||
"""
|
||
|
||
# %%
|
||
class OptimizedBackend:
|
||
"""
|
||
Smart backend that automatically dispatches to optimal implementations.
|
||
|
||
This demonstrates how real ML frameworks (PyTorch, TensorFlow) work:
|
||
- Single API for users
|
||
- Automatic dispatch to fastest implementation
|
||
- Transparent optimization without code changes
|
||
"""
|
||
|
||
def dispatch(self, op: str, *args, **kwargs):
|
||
"""Dispatch operations to optimal implementations"""
|
||
if op == "matmul":
|
||
return self.matmul(*args, **kwargs)
|
||
else:
|
||
raise NotImplementedError(f"Operation {op} not implemented")
|
||
|
||
def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Matrix multiplication with automatic optimization selection.
|
||
|
||
For production: Always use NumPy (has all optimizations built-in)
|
||
For education: Could switch based on size, but NumPy is always best
|
||
"""
|
||
# In a real system, you might choose based on:
|
||
# - Matrix size (small vs large)
|
||
# - Hardware available (CPU vs GPU)
|
||
# - Memory constraints
|
||
#
|
||
# But NumPy is almost always the right choice for CPU
|
||
return matmul_numpy(a, b)
|
||
|
||
# Global backend instance
|
||
_backend = OptimizedBackend()
|
||
|
||
def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Matrix multiplication using optimal backend.
|
||
|
||
This is the API students should use - it automatically
|
||
selects the best implementation available.
|
||
"""
|
||
return _backend.dispatch("matmul", a, b)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Backend System
|
||
|
||
Let's verify our backend system works correctly and uses optimal implementations.
|
||
"""
|
||
|
||
# %%
|
||
def test_backend_system():
|
||
"""Test the backend system"""
|
||
print("Testing Backend System...")
|
||
|
||
# Test matrices
|
||
a = np.random.randn(100, 100).astype(np.float32)
|
||
b = np.random.randn(100, 100).astype(np.float32)
|
||
|
||
# Test that our backend works
|
||
result = matmul(a, b)
|
||
expected = a @ b
|
||
|
||
assert np.allclose(result, expected), "Backend matmul incorrect"
|
||
print("PASS Backend produces correct results")
|
||
|
||
# Compare performance
|
||
start = time.perf_counter()
|
||
_ = matmul(a, b)
|
||
backend_time = time.perf_counter() - start
|
||
|
||
start = time.perf_counter()
|
||
_ = a @ b
|
||
numpy_time = time.perf_counter() - start
|
||
|
||
print(f"\nPerformance comparison:")
|
||
print(f"Backend: {backend_time*1000:.1f} ms")
|
||
print(f"NumPy: {numpy_time*1000:.1f} ms")
|
||
print(f"Backend uses optimal NumPy implementation")
|
||
|
||
print("\nPASS Backend system works correctly")
|
||
return True
|
||
|
||
# Execute the backend test
|
||
test_backend_system()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## TARGET Computational Assessment Questions
|
||
|
||
Practice your understanding of hardware acceleration concepts with these NBGrader-compatible questions.
|
||
|
||
These questions test your ability to analyze performance characteristics, optimize for cache hierarchy, and understand the engineering trade-offs in hardware acceleration. They're grounded in the actual implementations you just built and tested.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "acceleration-q1", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
def calculate_cache_efficiency(matrix_size: int, block_size: int) -> Tuple[int, float]:
|
||
"""
|
||
Calculate the cache efficiency improvement of blocked vs naive matrix multiplication.
|
||
|
||
For a matrix_size * matrix_size multiplication using block_size * block_size blocks:
|
||
1. Calculate total number of cache misses for naive implementation
|
||
2. Calculate total number of cache misses for blocked implementation
|
||
3. Return (total_operations, efficiency_ratio)
|
||
|
||
Assumptions:
|
||
- Cache line = 64 bytes = 16 float32 elements
|
||
- Naive: Every B[k,j] access is a cache miss (column-major access)
|
||
- Blocked: 1 cache miss per block load, then block stays in cache
|
||
|
||
Args:
|
||
matrix_size: Size of square matrices (N*N)
|
||
block_size: Size of blocks for blocked algorithm
|
||
|
||
Returns:
|
||
Tuple[int, float]: (total_operations, cache_efficiency_ratio)
|
||
|
||
TODO: Implement cache efficiency calculation for blocked matrix multiplication
|
||
|
||
HINTS:
|
||
- Total operations = matrix_size³
|
||
- Naive cache misses ~= matrix_size³ (every B access misses)
|
||
- Blocked cache misses = (matrix_size/block_size)³ * block_size²
|
||
- Efficiency ratio = naive_misses / blocked_misses
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Total operations for matrix multiplication
|
||
total_operations = matrix_size ** 3
|
||
|
||
# Naive implementation cache misses
|
||
# Every access to B[k,j] causes a cache miss due to column-major access
|
||
naive_cache_misses = total_operations
|
||
|
||
# Blocked implementation cache misses
|
||
# Number of blocks in each dimension
|
||
blocks_per_dim = (matrix_size + block_size - 1) // block_size # Ceiling division
|
||
total_blocks = blocks_per_dim ** 3
|
||
|
||
# Each block is loaded once, then all elements accessed from cache
|
||
blocked_cache_misses = total_blocks * block_size ** 2
|
||
|
||
# Cache efficiency ratio
|
||
efficiency_ratio = naive_cache_misses / blocked_cache_misses if blocked_cache_misses > 0 else 1.0
|
||
|
||
return total_operations, efficiency_ratio
|
||
### END SOLUTION
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "acceleration-q2", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
def analyze_vectorization_speedup(array_size: int, vector_width: int) -> Tuple[int, int, float]:
|
||
"""
|
||
Analyze the theoretical speedup from vectorization (SIMD instructions).
|
||
|
||
Calculate:
|
||
1. Number of scalar operations needed
|
||
2. Number of vector operations needed
|
||
3. Theoretical speedup ratio
|
||
|
||
Args:
|
||
array_size: Number of elements to process
|
||
vector_width: Number of elements processed per vector instruction
|
||
|
||
Returns:
|
||
Tuple[int, int, float]: (scalar_ops, vector_ops, speedup_ratio)
|
||
|
||
TODO: Calculate vectorization speedup for array operations
|
||
|
||
APPROACH:
|
||
1. Scalar: One operation per element
|
||
2. Vector: One operation per vector_width elements (with remainder)
|
||
3. Speedup: scalar_ops / vector_ops
|
||
|
||
EXAMPLE:
|
||
>>> scalar_ops, vector_ops, speedup = analyze_vectorization_speedup(1000, 4)
|
||
>>> print(f"Scalar: {scalar_ops}, Vector: {vector_ops}, Speedup: {speedup:.1f}x")
|
||
Scalar: 1000, Vector: 250, Speedup: 4.0x
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Scalar operations: one per element
|
||
scalar_ops = array_size
|
||
|
||
# Vector operations: ceiling division to handle remainder
|
||
vector_ops = (array_size + vector_width - 1) // vector_width
|
||
|
||
# Theoretical speedup (ignores overhead, assumes perfect vectorization)
|
||
speedup_ratio = scalar_ops / vector_ops if vector_ops > 0 else 1.0
|
||
|
||
return scalar_ops, vector_ops, speedup_ratio
|
||
### END SOLUTION
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "acceleration-q3", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
def optimize_block_size(matrix_size: int, cache_sizes: Dict[str, int]) -> Tuple[int, str, float]:
|
||
"""
|
||
Find the optimal block size for a given matrix size and cache hierarchy.
|
||
|
||
Test block sizes [16, 32, 64, 128, 256] and select the largest that fits in L2 cache.
|
||
|
||
Args:
|
||
matrix_size: Size of square matrix to multiply
|
||
cache_sizes: Dictionary with cache sizes in bytes, e.g., {"L1": 32768, "L2": 262144}
|
||
|
||
Returns:
|
||
Tuple[int, str, float]: (optimal_block_size, cache_level, memory_utilization)
|
||
|
||
TODO: Find optimal block size based on cache constraints
|
||
|
||
APPROACH:
|
||
1. For each candidate block size, calculate memory footprint
|
||
2. Check which cache level it fits in (3 blocks * block_size² * 4 bytes)
|
||
3. Select largest block size that fits in L2 cache
|
||
4. Calculate memory utilization = footprint / cache_size
|
||
|
||
EXAMPLE:
|
||
>>> cache_sizes = {"L1": 32768, "L2": 262144}
|
||
>>> block_size, level, util = optimize_block_size(1000, cache_sizes)
|
||
>>> print(f"Optimal: {block_size}x{block_size}, fits in {level}, {util:.1%} utilization")
|
||
Optimal: 64x64, fits in L2, 18.8% utilization
|
||
"""
|
||
### BEGIN SOLUTION
|
||
candidate_sizes = [16, 32, 64, 128, 256]
|
||
bytes_per_float = 4
|
||
blocks_needed = 3 # A_block, B_block, C_block
|
||
|
||
optimal_block_size = 16 # Default fallback
|
||
cache_level = "RAM"
|
||
memory_utilization = 0.0
|
||
|
||
# Test each candidate size
|
||
for block_size in candidate_sizes:
|
||
# Calculate memory footprint
|
||
elements_per_block = block_size * block_size
|
||
bytes_per_block = elements_per_block * bytes_per_float
|
||
total_footprint = bytes_per_block * blocks_needed
|
||
|
||
# Check which cache level it fits in
|
||
if total_footprint <= cache_sizes.get("L1", 0):
|
||
# Prefer L2 for larger block sizes (better computational efficiency)
|
||
if block_size >= optimal_block_size:
|
||
optimal_block_size = block_size
|
||
cache_level = "L1"
|
||
memory_utilization = total_footprint / cache_sizes["L1"]
|
||
elif total_footprint <= cache_sizes.get("L2", 0):
|
||
# L2 is the sweet spot for most cases
|
||
if block_size >= optimal_block_size:
|
||
optimal_block_size = block_size
|
||
cache_level = "L2"
|
||
memory_utilization = total_footprint / cache_sizes["L2"]
|
||
|
||
return optimal_block_size, cache_level, memory_utilization
|
||
### END SOLUTION
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "acceleration-q4", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
def compare_acceleration_techniques(matrix_size: int) -> Dict[str, float]:
|
||
"""
|
||
Compare the theoretical speedup of different acceleration techniques.
|
||
|
||
Calculate expected speedup for:
|
||
1. "cache_blocking": Blocked algorithm (64x64 blocks)
|
||
2. "vectorization": SIMD with 8-wide vectors
|
||
3. "parallelization": 4-core CPU parallelization
|
||
4. "combined": All techniques together
|
||
|
||
Args:
|
||
matrix_size: Size of square matrices
|
||
|
||
Returns:
|
||
Dict[str, float]: Speedup factors for each technique
|
||
|
||
TODO: Calculate theoretical speedups for different acceleration techniques
|
||
|
||
APPROACH:
|
||
1. Cache blocking: Use previous cache efficiency calculation
|
||
2. Vectorization: 8-wide SIMD operations
|
||
3. Parallelization: 4 cores working in parallel
|
||
4. Combined: Multiply individual speedups (idealized)
|
||
|
||
ASSUMPTIONS:
|
||
- Perfect scaling (no overhead)
|
||
- Cache blocking gives efficiency_ratio improvement
|
||
- Vectorization gives 8x speedup
|
||
- Parallelization gives 4x speedup
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Cache blocking speedup (using 64x64 blocks)
|
||
block_size = 64
|
||
_, cache_speedup = calculate_cache_efficiency(matrix_size, block_size)
|
||
|
||
# Vectorization speedup (8-wide SIMD)
|
||
vector_width = 8
|
||
_, _, vectorization_speedup = analyze_vectorization_speedup(matrix_size ** 3, vector_width)
|
||
|
||
# Parallelization speedup (4 cores)
|
||
parallelization_speedup = 4.0
|
||
|
||
# Combined speedup (multiplicative - idealized)
|
||
combined_speedup = cache_speedup * vectorization_speedup * parallelization_speedup
|
||
|
||
return {
|
||
"cache_blocking": cache_speedup,
|
||
"vectorization": vectorization_speedup,
|
||
"parallelization": parallelization_speedup,
|
||
"combined": combined_speedup
|
||
}
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Part 5: Real-World Application Testing
|
||
|
||
Let's test our optimizations on actual ML model operations: MLP layers, CNN convolutions, and Transformer attention.
|
||
"""
|
||
|
||
# %%
|
||
def test_ml_model_acceleration():
|
||
"""Test acceleration on real ML model operations"""
|
||
print("Testing Acceleration on Real ML Models...")
|
||
|
||
# Test 1: MLP Forward Pass (common in Module 4)
|
||
print("\n1. MLP Forward Pass (256 -> 128 -> 64):")
|
||
batch_size, input_dim, hidden_dim, output_dim = 32, 256, 128, 64
|
||
|
||
# Simulated MLP layers
|
||
x = np.random.randn(batch_size, input_dim).astype(np.float32)
|
||
W1 = np.random.randn(input_dim, hidden_dim).astype(np.float32)
|
||
W2 = np.random.randn(hidden_dim, output_dim).astype(np.float32)
|
||
|
||
# Time naive implementation (small version)
|
||
start = time.perf_counter()
|
||
h1_naive = matmul_naive(x[:8, :64], W1[:64, :32]) # Scaled down
|
||
h2_naive = matmul_naive(h1_naive, W2[:32, :16]) # Scaled down
|
||
naive_time = time.perf_counter() - start
|
||
|
||
# Time optimized implementation
|
||
start = time.perf_counter()
|
||
h1_opt = matmul(x, W1)
|
||
h2_opt = matmul(h1_opt, W2)
|
||
opt_time = time.perf_counter() - start
|
||
|
||
# Scale for: batch_size (32/8) * input_dim (256/64) * hidden_dim (128/32)
|
||
batch_scale = 32/8 # 4x more samples
|
||
input_scale = 256/64 # 4x larger input
|
||
hidden_scale = 128/32 # 4x larger hidden layer
|
||
naive_scaled = naive_time * batch_scale * input_scale * hidden_scale
|
||
speedup = naive_scaled / opt_time
|
||
|
||
print(f" Naive (estimated): {naive_scaled*1000:.1f} ms")
|
||
print(f" Optimized: {opt_time*1000:.1f} ms")
|
||
print(f" Speedup: {speedup:.1f}x faster!")
|
||
|
||
# Test 2: CNN-like Convolution (flattened as matrix multiply)
|
||
print("\n2. CNN Convolution (as matrix multiply):")
|
||
# Simulate im2col operation for 3x3 convolution
|
||
img_patches = np.random.randn(1024, 27).astype(np.float32) # 32x32 image, 3x3 patches
|
||
conv_filters = np.random.randn(27, 64).astype(np.float32) # 64 filters
|
||
|
||
start = time.perf_counter()
|
||
conv_output = matmul(img_patches, conv_filters)
|
||
conv_time = time.perf_counter() - start
|
||
print(f" Convolution output: {conv_time*1000:.1f} ms")
|
||
print(f" Shape: {conv_output.shape} (1024 locations * 64 filters)")
|
||
|
||
# Test 3: Transformer-like Attention (scaled down)
|
||
print("\n3. Transformer Attention (Q·K^T):")
|
||
seq_len, d_model = 128, 256
|
||
Q = np.random.randn(seq_len, d_model).astype(np.float32)
|
||
K = np.random.randn(seq_len, d_model).astype(np.float32)
|
||
|
||
start = time.perf_counter()
|
||
attention_scores = matmul(Q, K.T) # Shape: (seq_len, seq_len)
|
||
attn_time = time.perf_counter() - start
|
||
print(f" Attention computation: {attn_time*1000:.1f} ms")
|
||
print(f" Shape: {attention_scores.shape} (128*128 attention matrix)")
|
||
|
||
print(f"\nPASS All ML model operations accelerated successfully!")
|
||
print(f"TIP Key insight: Matrix multiplication is EVERYWHERE in ML!")
|
||
return True
|
||
|
||
# Execute the ML model test
|
||
test_ml_model_acceleration()
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Acceleration Scaling Analysis
|
||
def analyze_acceleration_scaling():
|
||
"""
|
||
Analyze how different acceleration techniques scale with problem size.
|
||
|
||
This demonstrates the performance characteristics of optimization
|
||
techniques across a range of matrix sizes typical in ML workloads.
|
||
"""
|
||
try:
|
||
print("📊 Acceleration Scaling Analysis")
|
||
print("=" * 45)
|
||
|
||
# Test different matrix sizes (typical ML workloads)
|
||
matrix_sizes = [100, 200, 500, 1000, 2000]
|
||
|
||
print("\nScaling Analysis Across Matrix Sizes:")
|
||
print("Size | Cache Block | Vectorization | Parallelization | Combined")
|
||
print("-" * 65)
|
||
|
||
for size in matrix_sizes:
|
||
# Calculate speedups for this matrix size
|
||
speedups = compare_acceleration_techniques(size)
|
||
|
||
print(f"{size:4d} | {speedups['cache_blocking']:10.1f} | {speedups['vectorization']:12.1f} | {speedups['parallelization']:14.1f} | {speedups['combined']:7.0f}")
|
||
|
||
print(f"\n📊 Key Scaling Insights:")
|
||
|
||
# Analyze cache blocking scaling
|
||
small_speedup = compare_acceleration_techniques(100)['cache_blocking']
|
||
large_speedup = compare_acceleration_techniques(2000)['cache_blocking']
|
||
|
||
print(f"• Cache blocking: {small_speedup:.1f}x -> {large_speedup:.1f}x (scales with cache misses)")
|
||
print(f"• Vectorization: 8.0x constant (independent of matrix size)")
|
||
print(f"• Parallelization: 4.0x constant (perfect scaling assumed)")
|
||
print(f"• Combined: Multiplicative effect = cache * vector * parallel")
|
||
|
||
print(f"\n📊 Real-World Performance Expectations:")
|
||
realistic_combined = large_speedup * 4.0 * 4.0 # Conservative vectorization
|
||
print(f"• Realistic combined speedup: ~{realistic_combined:.0f}x")
|
||
print(f"• Why not perfect: Memory bandwidth limits, overhead, synchronization")
|
||
print(f"• Production systems: Focus on cache + vectorization first")
|
||
|
||
print(f"\nTIP ML Systems Implications:")
|
||
print(f"• Small models (<=500): Vectorization dominates")
|
||
print(f"• Large models (>=1000): Cache optimization critical")
|
||
print(f"• Production: Memory bandwidth becomes bottleneck")
|
||
print(f"• GPU: Different scaling - thousands of cores, different cache hierarchy")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in scaling analysis: {e}")
|
||
print("Make sure all analysis functions are implemented correctly")
|
||
|
||
# Run the scaling analysis
|
||
analyze_acceleration_scaling()
|
||
|
||
def run_complete_acceleration_demo():
|
||
"""Run the complete acceleration demonstration"""
|
||
print("ROCKET Complete Hardware Acceleration Demo")
|
||
print("=" * 55)
|
||
print("THE FREE SPEEDUP: From Naive Loops to Optimized Backends")
|
||
|
||
# 1. Test naive baseline
|
||
print("\n1. Naive Baseline (your Module 2/4 loops):")
|
||
naive_results = test_naive_baseline()
|
||
|
||
# 2. Test blocked optimization
|
||
print("\n2. Cache-Friendly Blocking:")
|
||
test_blocked_optimization()
|
||
|
||
# 3. Test production performance
|
||
print("\n3. Production Performance (NumPy):")
|
||
test_production_performance()
|
||
|
||
# 4. Test ML model acceleration
|
||
print("\n4. Real ML Model Acceleration:")
|
||
test_ml_model_acceleration()
|
||
|
||
# 5. Test backend system
|
||
print("\n5. Smart Backend System:")
|
||
test_backend_system()
|
||
|
||
print("\n" + "=" * 55)
|
||
print("TARGET HARDWARE ACCELERATION MASTERED")
|
||
print("=" * 55)
|
||
|
||
print("\n📚 What You Mastered:")
|
||
print("PASS Why your Module 2/4 loops were slow (cache hierarchy matters!)")
|
||
print("PASS How cache-friendly blocking works (process data in chunks)")
|
||
print("PASS Why NumPy dominates (professional optimizations built-in)")
|
||
print("PASS How to build smart backend systems (automatic optimization)")
|
||
print("PASS Real ML applications (MLPs, CNNs, Transformers all use matmul!)")
|
||
|
||
print("\nTARGET The Free Speedup Philosophy:")
|
||
print("• ROCKET Same math, better implementation = 100x speedup")
|
||
print("• 🧠 Educational loops teach algorithms")
|
||
print("• SPEED Blocked algorithms teach cache optimization")
|
||
print("• 🏭 NumPy provides production performance")
|
||
print("• TARGET Smart backends make optimization transparent")
|
||
print("• TIP Understanding the spectrum makes you a better engineer!")
|
||
|
||
return naive_results
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Systems Analysis Summary
|
||
|
||
This module demonstrates the fundamental principles of hardware acceleration in ML systems:
|
||
|
||
### 🏗️ **Architecture Principles**
|
||
- **Cache Hierarchy**: Understanding L1/L2/L3 cache and memory access costs
|
||
- **Vectorization**: Leveraging SIMD instructions for parallel computation
|
||
- **Memory Layout**: Contiguous access patterns for optimal performance
|
||
- **Backend Abstraction**: Transparent dispatch between naive and optimized implementations
|
||
|
||
### SPEED **Optimization Techniques**
|
||
- **Blocked Algorithms**: Process data in cache-friendly blocks
|
||
- **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines
|
||
- **In-place Operations**: Minimize memory allocation overhead
|
||
- **Automatic Dispatch**: Choose optimal implementation based on problem size
|
||
|
||
### 📊 **Performance Understanding**
|
||
- **Measurement First**: Profile real bottlenecks before optimizing
|
||
- **Algorithmic Impact**: O(N³) -> O(N²) matters more than 2x constant factors
|
||
- **Hardware Awareness**: CPU cache misses cost 100x more than cache hits
|
||
- **Library Utilization**: Optimized BLAS libraries beat custom implementations
|
||
|
||
### TARGET **Real-World Applications**
|
||
- **ML Frameworks**: How PyTorch/TensorFlow apply these same principles
|
||
- **Production Systems**: Where optimization efforts provide real value
|
||
- **Development Practice**: When to optimize vs when to use existing solutions
|
||
|
||
### TIP **Key Insights**
|
||
- Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone
|
||
- Vectorization eliminates Python overhead for 10-100x improvements
|
||
- Most NumPy operations are already optimized - focus on system-level improvements
|
||
- Competition frameworks make optimization learning engaging and quantifiable
|
||
- Real ML systems face memory and communication bottlenecks, not pure computation limits
|
||
|
||
This approach teaches students to think like systems engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most.
|
||
"""
|
||
|
||
def test_unit_all():
|
||
"""Run all unit tests for the acceleration module."""
|
||
print("TEST Running all Hardware Acceleration tests...")
|
||
print("=" * 55)
|
||
|
||
try:
|
||
# Test educational baseline
|
||
print("\n1. Testing educational baseline...")
|
||
test_naive_baseline()
|
||
|
||
# Test cache blocking optimization
|
||
print("\n2. Testing cache blocking...")
|
||
test_blocked_optimization()
|
||
|
||
# Test production performance
|
||
print("\n3. Testing production performance...")
|
||
test_production_performance()
|
||
|
||
# Test backend system
|
||
print("\n4. Testing backend system...")
|
||
test_backend_system()
|
||
|
||
# Test ML model acceleration
|
||
print("\n5. Testing ML model acceleration...")
|
||
test_ml_model_acceleration()
|
||
|
||
print("\n" + "=" * 55)
|
||
print("PASS All Hardware Acceleration tests passed!")
|
||
print("ROCKET Module ready for production ML systems.")
|
||
|
||
except Exception as e:
|
||
print(f"FAIL Test failed: {e}")
|
||
raise
|
||
|
||
if __name__ == "__main__":
|
||
print("Module 16: Hardware Acceleration - The Free Speedup!")
|
||
print("=" * 60)
|
||
print("ROCKET THE EASIEST OPTIMIZATION: Better Backends, Zero Trade-offs")
|
||
|
||
# Run complete testing suite
|
||
test_unit_all()
|
||
|
||
print(f"\nCELEBRATE Module 16: Hardware Acceleration COMPLETE!")
|
||
print(f"SPEED Mastered: 10-100x speedups with no accuracy loss")
|
||
print(f"🧠 Learned: Cache hierarchy, blocking, vectorization")
|
||
print(f"🏭 Applied: MLPs, CNNs, Transformers all benefit")
|
||
print(f"TARGET Ready: To build high-performance ML systems!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## THINK ML Systems Thinking: Interactive Questions
|
||
|
||
1. **Memory Access Pattern Analysis**: In your `matmul_naive()` implementation, the innermost loop accesses `a[i, k]` sequentially but `b[k, j]` with large strides. When you tested 200*200 matrices, you saw dramatic slowdowns. Analyze why: (a) Calculate cache misses for both access patterns, (b) Explain why `b[k, j]` creates O(N²) cache misses, (c) Show how this scales to 1000*1000 matrices, and (d) Design a memory layout that would eliminate strided access.
|
||
|
||
2. **Cache Blocking Optimization**: Your `matmul_blocked()` function uses 64*64 blocks and showed significant speedups over naive loops. Analyze the cache efficiency: (a) Calculate total memory footprint (3 blocks * 64² * 4 bytes), (b) Verify it fits in L2 cache (256KB), (c) Compute cache reuse factor (64 operations per cache line), (d) Predict performance change with 128*128 blocks, and (e) Explain why your cache analysis function showed 64*64 as optimal.
|
||
|
||
3. **Production Stack Engineering**: You measured that NumPy beats your blocked implementation by 5-10x. Analyze the engineering trade-offs: (a) List three specific optimizations NumPy includes (BLAS, vectorization, threading), (b) Calculate development time vs. performance gain for each, (c) Estimate why custom optimization rarely beats production libraries, and (d) Determine when custom optimization is justified in ML systems.
|
||
|
||
4. **ML Acceleration Architecture**: Your tests showed acceleration benefits for MLP, CNN, and Transformer operations. Design an acceleration strategy: (a) Rank these operations by matrix multiplication density, (b) Identify memory bandwidth vs. compute bottlenecks for each, (c) Predict how GPU acceleration would change the rankings, and (d) Explain why understanding this spectrum enables better ML systems engineering decisions.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## TARGET MODULE SUMMARY: Hardware Acceleration - The Free Speedup
|
||
|
||
This module demonstrates the easiest optimization in ML systems: using better backends for free speedups with zero accuracy trade-offs. You learned why understanding the optimization spectrum makes you a better engineer.
|
||
|
||
### 🛤️ **The Free Speedup Journey**
|
||
- **Educational Foundation**: Your Module 2/4 loops taught you the algorithm (perfect for learning)
|
||
- **Performance Understanding**: Module 15 showed you WHY loops are slow (profiling first)
|
||
- **Optimization Mastery**: Now you achieve 100x speedups by choosing better implementations
|
||
- **Systems Thinking**: Understanding the spectrum from educational to production code
|
||
|
||
### 🛠️ **What We Built and Tested**
|
||
- **Educational Baseline**: Your triple-nested loops from Module 2/4 (algorithm understanding)
|
||
- **Cache-Friendly Blocking**: 64*64 blocks fitting in L1/L2 cache (10x+ speedup)
|
||
- **NumPy Production**: Leveraging professional BLAS optimizations (another 10x speedup)
|
||
- **Smart Backend System**: Automatic dispatch to optimal implementations
|
||
- **Real ML Applications**: MLP, CNN, Transformer operations using matrix multiplication
|
||
|
||
### 🧠 **Key Learning Outcomes**
|
||
- **Why loops are slow**: Memory access patterns and cache hierarchy matter most
|
||
- **How blocking helps**: Processing data in cache-friendly chunks improves performance
|
||
- **When to use NumPy**: It already has these optimizations (and more) built-in
|
||
- **Systems thinking**: Understanding enables better decisions about when to optimize
|
||
|
||
### SPEED **Performance Spectrum Mastered**
|
||
- **Educational loops**: Algorithm understanding (1000x slower, perfect for learning)
|
||
- **Cache-friendly blocking**: Systems understanding (100x slower, teaches optimization)
|
||
- **NumPy production**: Professional performance (optimal speed, built-in optimizations)
|
||
- **Smart backends**: Engineering understanding (transparent optimization selection)
|
||
|
||
### 🏆 **Practical Skills Developed**
|
||
- Analyze why educational implementations have poor performance
|
||
- Implement cache-friendly algorithms to understand optimization principles
|
||
- Choose NumPy for production while understanding what it's doing internally
|
||
- Build systems that balance educational value with performance requirements
|
||
|
||
### 📊 **Systems Insights Gained**
|
||
- **Educational code serves a purpose**: Understanding algorithms enables optimization intuition
|
||
- **Cache hierarchy dominates performance**: Memory access patterns matter more than computation
|
||
- **Libraries beat custom optimization**: NumPy already has expert-level optimizations
|
||
- **Understanding enables better tools**: You can build smarter systems when you know the principles
|
||
|
||
### TIP **The Free Speedup Philosophy**
|
||
This is the EASIEST optimization in ML systems: same math, better implementation, massive speedups, zero downsides. You implemented loops to understand algorithms. You implemented blocking to understand cache optimization. Now you use NumPy because it has all optimizations built-in. Understanding this spectrum - from educational to production - makes you a superior ML systems engineer who can make informed optimization decisions.
|
||
"""
|
||
|