Files
TinyTorch/modules/18_acceleration/acceleration_dev.py
2025-11-19 08:54:10 -05:00

1748 lines
71 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.18.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %%
#| default_exp optimization.acceleration
#| export
# %% [markdown]
"""
# Module 16: Acceleration - Making Models Run Faster
Welcome to Module 16! You're about to master the art of neural network acceleration through vectorization, kernel fusion, and mixed precision training.
## 🔗 Prerequisites & Progress
**You've Built**: Complete training pipeline with profiling capabilities
**You'll Build**: Acceleration techniques including vectorization, operation fusion, and mixed precision
**You'll Enable**: Production-ready optimization for real-world deployment
**Connection Map**:
```
Profiling (Module 15) → Acceleration (Module 16) → Quantization (Module 17)
(measurement) (optimization) (precision reduction)
```
## Learning Objectives
By the end of this module, you will:
1. Implement vectorized operations for maximum throughput
2. Create fused operations to reduce memory bandwidth
3. Build mixed precision training for memory efficiency
4. Understand the relationship between compute and memory bandwidth
5. Analyze acceleration trade-offs in production systems
Let's optimize for speed!
## 📦 Where This Code Lives in the Final Package
**Learning Side:** You work in `modules/16_acceleration/acceleration_dev.py`
**Building Side:** Code exports to `tinytorch.optimization.acceleration`
```python
# How to use this module:
from tinytorch.optimization.acceleration import vectorized_matmul, fused_gelu, MixedPrecisionTrainer
```
**Why this matters:**
- **Learning:** Complete acceleration system in one focused module for deep understanding
- **Production:** Proper organization like PyTorch's torch.amp and torch.jit with optimization components
- **Consistency:** All acceleration operations and mixed precision training in optimization.acceleration
- **Integration:** Works seamlessly with profiling for complete performance optimization
"""
# %%
import numpy as np
import time
from typing import Dict, List, Tuple, Optional, Any, Union
# %% [markdown]
"""
## 1. Introduction - The Performance Challenge
Modern neural networks face two fundamental bottlenecks that limit their speed:
### The Two Enemies of Performance
**1. Compute Bound Operations:**
```
CPU/GPU Cores: [====BUSY====] [====BUSY====] [====BUSY====]
Memory Bus: [---idle---] [---idle---] [---idle---]
When: Matrix multiplication, convolutions
Solution: Vectorization, better algorithms
```
**2. Memory Bound Operations:**
```
CPU/GPU Cores: [--idle--] [--idle--] [--idle--]
Memory Bus: [========SATURATED========]
When: Element-wise operations, small tensors
Solution: Kernel fusion, memory layout optimization
```
### The Roofline Model - Your Performance Compass
Every processor has fundamental limits:
```
Performance │ Compute Bound Region
(GFLOPS) │ ┌─────────────────────
│ │ Peak Performance
│ │
│ ╱│ Memory Bound Region
│╱ │
╱│ │
│ │
│ │
╱───│──│───────────────────────
│ │
│ │
╱──────│──│────────────────── Arithmetic Intensity
│ │ (FLOPs/Byte)
Low│ │High
```
**Key Insight**: Understand where your operations live on this graph to optimize effectively.
### Why This Module Matters
Real-world performance wins:
- **2-5× speedup** from vectorization
- **30-50% memory reduction** from mixed precision
- **2-3× throughput** from kernel fusion
- **10× scaling improvement** for large models
"""
# %% nbgrader={"grade": false, "grade_id": "tensor-import", "solution": true}
"""
## 🔗 Module Dependencies
This module REQUIRES completion of:
- Module 01 (Tensor): Foundation data structure we optimize
- Module 03 (Layers): Linear layers for vectorization
- Module 14 (Profiling): Profiler for measuring improvements
**Progressive Building**:
```
Module 01 (Tensor) ──> [This Module: Optimize Tensor operations]
Module 03 (Layers) ──> [This Module: Optimize Linear layers]
Module 14 (Profiling) ──> [This Module: Measure improvements]
```
**What You've Built**:
- Module 01: Tensor (what we optimize)
- Module 03: Linear layers (uses optimized ops)
- Module 14: Profiling (measure improvements)
**What This Module Adds**:
- Vectorized operations (SIMD optimization)
- Kernel fusion (memory efficiency)
- Mixed precision training (memory/speed)
**To verify dependencies are met, run**:
python -c "from tinytorch.core.tensor import Tensor; print('✅ Module 01 ready')"
python -c "from tinytorch.core.layers import Linear; print('✅ Module 03 ready')"
python -c "from tinytorch.profiling.profiler import Profiler; print('✅ Module 14 ready')"
"""
# Direct imports from previous modules - these MUST exist
# If imports fail, students will get clear educational errors
try:
from tinytorch.core.tensor import Tensor # Module 01: What we optimize
except ImportError as e:
raise ImportError(
"❌ Module 18 (Acceleration) requires Module 01 (Tensor) to be completed first.\n"
" This module optimizes Tensor operations - you need Tensor to exist first!\n"
" Please complete Module 01 first, then run 'tito module complete 01'.\n"
" Original error: " + str(e)
) from e
try:
from tinytorch.core.layers import Linear # Module 03: Uses optimized ops
except ImportError as e:
raise ImportError(
"❌ Module 18 (Acceleration) requires Module 03 (Layers) to be completed first.\n"
" This module optimizes Linear layer operations.\n"
" Please complete Module 03 first, then run 'tito module complete 03'.\n"
" Original error: " + str(e)
) from e
# %% [markdown]
"""
## 2. Foundations - Vectorization: From Loops to Lightning
### The SIMD Revolution
Modern processors can execute **Single Instruction, Multiple Data** operations:
```
Traditional Loop (Scalar): SIMD Vectorized:
for i in range(4): ┌─────┐ ┌─────┬─────┬─────┬─────┐
c[i] = a[i] + b[i] │ ALU │ → │ALU 0│ALU 1│ALU 2│ALU 3│
└─────┘ └─────┴─────┴─────┴─────┘
1 element 4 elements per cycle
per cycle
```
### Memory Access Patterns: The Hidden Performance Killer
```
Sequential Access (FAST):
Memory: [A][B][C][D][E][F][G][H]
Access: ↓ ↓ ↓ ↓ → Cache friendly
Strided Access (SLOWER):
Memory: [A][ ][B][ ][C][ ][D][ ]
Access: ↓ ↓ ↓ ↓ → Cache misses
Random Access (SLOWEST):
Memory: [A][B][C][D][E][F][G][H]
Access: ↓ ↑ ↓ ↑ → Cache chaos
```
### Matrix Multiplication: The King of Vectorization
Matrix multiplication is **perfectly suited** for vectorization:
```
Matrix A (M×K) × Matrix B (K×N) = Matrix C (M×N)
Computation Pattern:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ a₁₁ a₁₂ a₁₃ a₁₄│ × │ b₁₁ b₁₂ b₁₃ b₁₄│ = │ c₁₁ c₁₂ c₁₃ c₁₄│
│ a₂₁ a₂₂ a₂₃ a₂₄│ │ b₂₁ b₂₂ b₂₃ b₂₄│ │ c₂₁ c₂₂ c₂₃ c₂₄│
│ a₃₁ a₃₂ a₃₃ a₃₄│ │ b₃₁ b₃₂ b₃₃ b₃₄│ │ c₃₁ c₃₂ c₃₃ c₃₄│
│ a₄₁ a₄₂ a₄₃ a₄₄│ │ b₄₁ b₄₂ b₄₃ b₄₄│ │ c₄₁ c₄₂ c₄₃ c₄₄│
└─────────────────┘ └─────────────────┘ └─────────────────┘
For c₁₁: Row₁ · Column₁ = a₁₁×b₁₁ + a₁₂×b₂₁ + a₁₃×b₃₁ + a₁₄×b₄₁
VECTORIZABLE!
```
**Why vectorization wins:**
- **High arithmetic intensity**: 2N³ FLOPs for N³ data
- **Predictable memory access**: Sequential row/column reads
- **Parallelizable**: Independent dot products
- **Cache-friendly**: Data reuse in inner loops
"""
# %% nbgrader={"grade": false, "grade_id": "vectorized-matmul", "solution": true}
def vectorized_matmul(a: Tensor, b: Tensor) -> Tensor:
"""
High-performance matrix multiplication using vectorized operations.
This implementation leverages optimized BLAS libraries that use:
- SIMD instructions for parallel computation
- Cache-blocking for memory efficiency
- Multi-threading for CPU parallelization
TODO: Implement production-grade matrix multiplication
APPROACH:
1. Validate shapes are compatible for matrix multiplication
2. Use NumPy's optimized dot product (calls BLAS GEMM)
3. Return result wrapped in Tensor
EXAMPLE:
Matrix multiplication visualization:
>>> a = Tensor([[1, 2], [3, 4]]) # 2×2
>>> b = Tensor([[5, 6], [7, 8]]) # 2×2
>>> result = vectorized_matmul(a, b)
>>> print(result.data)
[[19 22] # [1×5+2×7, 1×6+2×8] = [19, 22]
[43 50]] # [3×5+4×7, 3×6+4×8] = [43, 50]
PERFORMANCE CHARACTERISTICS:
- Time Complexity: O(N³) but highly optimized
- Space Complexity: O(N²) for result
- Arithmetic Intensity: 2N³ FLOPs / 3N² bytes = 2N/3 (good for large N)
HINTS:
- Check a.shape[-1] == b.shape[-2] for inner dimension match
- Use np.matmul() for batch support and optimization
- Trust BLAS to handle the vectorization magic
"""
### BEGIN SOLUTION
# Input validation for matrix multiplication
if len(a.shape) < 2 or len(b.shape) < 2:
raise ValueError(
f"Matrix multiplication requires 2D+ tensors, got shapes {a.shape} and {b.shape}. "
f"💡 HINT: Use reshape() to add dimensions if needed."
)
if a.shape[-1] != b.shape[-2]:
raise ValueError(
f"Matrix multiplication shape mismatch: {a.shape} @ {b.shape}. "
f"Inner dimensions must match: a.shape[-1]={a.shape[-1]} != b.shape[-2]={b.shape[-2]}. "
f"💡 HINT: For A@B, A's columns must equal B's rows."
)
# Use NumPy's highly optimized matrix multiplication
# This calls BLAS GEMM (General Matrix Multiply), which uses:
# - SIMD vectorization for parallel arithmetic
# - Cache blocking for memory efficiency
# - Multi-threading on multi-core systems
result_data = np.matmul(a.data, b.data)
return Tensor(result_data)
### END SOLUTION
# %% nbgrader={"grade": true, "grade_id": "test-vectorized-matmul", "locked": true, "points": 10}
def test_unit_vectorized_matmul():
"""🔬 Test vectorized matrix multiplication implementation."""
print("🔬 Unit Test: Vectorized Matrix Multiplication...")
# Test basic 2D multiplication
a = Tensor([[1, 2], [3, 4]])
b = Tensor([[5, 6], [7, 8]])
result = vectorized_matmul(a, b)
expected = np.array([[19, 22], [43, 50]])
assert np.allclose(result.data, expected), f"Basic matmul failed: expected {expected}, got {result.data}"
# Test batch multiplication (3D tensors)
batch_size, m, k, n = 2, 3, 4, 5
a_batch = Tensor(np.random.randn(batch_size, m, k))
b_batch = Tensor(np.random.randn(batch_size, k, n))
result_batch = vectorized_matmul(a_batch, b_batch)
assert result_batch.shape == (batch_size, m, n), f"Wrong batch shape: {result_batch.shape}"
# Test broadcasting (different batch dimensions)
a_single = Tensor(np.random.randn(m, k))
b_batch = Tensor(np.random.randn(batch_size, k, n))
result_broadcast = vectorized_matmul(a_single, b_batch)
assert result_broadcast.shape == (batch_size, m, n), f"Broadcasting failed: {result_broadcast.shape}"
# Test error cases
try:
vectorized_matmul(Tensor([1, 2, 3]), Tensor([4, 5])) # 1D tensors
assert False, "Should reject 1D tensors"
except ValueError as e:
assert "2D+" in str(e)
try:
vectorized_matmul(Tensor([[1, 2]]), Tensor([[1], [2], [3]])) # Shape mismatch
assert False, "Should reject incompatible shapes"
except ValueError as e:
assert "shape mismatch" in str(e).lower()
print("✅ vectorized_matmul works correctly!")
test_unit_vectorized_matmul()
# %% [markdown]
"""
## 3. Implementation - Kernel Fusion: Eliminating Memory Bottlenecks
### The Memory Bandwidth Crisis
Consider this innocent-looking computation: `y = gelu(x * weight + bias)`
**Naive Implementation (Memory Intensive):**
```
Step 1: temp1 = x * weight → Write 4GB to memory
Step 2: temp2 = temp1 + bias → Read 4GB, Write 4GB
Step 3: y = gelu(temp2) → Read 4GB, Write 4GB
Total: 20GB memory traffic!
```
**Fused Implementation (Memory Efficient):**
```
Single Step: y = gelu(x * weight + bias) → Read 8GB, Write 4GB
Total: 12GB memory traffic!
60% memory bandwidth reduction!
```
### Understanding GELU: The Smooth Activation
GELU (Gaussian Error Linear Unit) is used in transformers because it's **smooth** (differentiable everywhere):
```
Activation Functions Compared:
ReLU: GELU: Sigmoid:
| | 1 ┌─────
| |
| ╱───│───
─────┘ ╱─── │ ───╱ │
Discontinuous Smooth Curve │ Smooth but saturates
gradient at 0 everywhere │
```
**GELU Formula**: `GELU(x) = x * Φ(x)` where Φ is the standard normal CDF
**Fast Approximation**: `GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))`
### Kernel Fusion Strategy
```
Unfused Operations: Fused Operation:
┌─────────────────┐ ┌─────────────────┐
│ x³ computation │ → temp1 │ │
└─────────────────┘ │ │
┌─────────────────┐ │ │
│ polynomial part │ → temp2 │ All operations│
└─────────────────┘ │ combined in │
┌─────────────────┐ │ single kernel │
│ tanh computation│ → temp3 │ │
└─────────────────┘ │ │
┌─────────────────┐ │ │
│ final multiply │ → result │ │
└─────────────────┘ └─────────────────┘
5 memory round-trips 1 memory round-trip
```
"""
# %% nbgrader={"grade": false, "grade_id": "fused-gelu", "solution": true}
def fused_gelu(x: Tensor) -> Tensor:
"""
Fused GELU activation that combines all operations in a single kernel.
GELU combines the benefits of ReLU and sigmoid:
- Smooth everywhere (unlike ReLU's discontinuity at 0)
- Non-saturating for positive values (unlike sigmoid)
- Probabilistic interpretation: x * P(X ≤ x) where X ~ N(0,1)
Mathematical Definition:
GELU(x) = x * Φ(x) where Φ(x) is the standard normal CDF
Fast Approximation (used here):
GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
TODO: Implement fused GELU to minimize memory bandwidth
APPROACH:
1. Compute all intermediate values in a single expression
2. Avoid creating temporary arrays
3. Let NumPy's broadcasting handle vectorization
EXAMPLE:
>>> x = Tensor([-2, -1, 0, 1, 2])
>>> result = fused_gelu(x)
>>> print(result.data)
[-0.04550026 -0.15865526 0. 0.8413447 1.9544997 ]
# Notice: smooth transition through 0, positive bias
MEMORY EFFICIENCY:
- Unfused: 5 temporary arrays × input_size × 4 bytes
- Fused: 0 temporary arrays, direct computation
- Bandwidth reduction: ~80% for memory-bound operations
HINTS:
- Use np.sqrt(2.0 / np.pi) for the constant
- Keep entire expression in one line for maximum fusion
- NumPy will optimize the expression tree automatically
"""
### BEGIN SOLUTION
# Mathematical constant for GELU approximation
sqrt_2_over_pi = np.sqrt(2.0 / np.pi)
# Fused GELU computation - all operations in single expression
# This minimizes memory bandwidth by avoiding intermediate arrays
# NumPy's expression evaluator will optimize this into efficient machine code
result_data = 0.5 * x.data * (
1.0 + np.tanh(sqrt_2_over_pi * (x.data + 0.044715 * x.data**3))
)
return Tensor(result_data)
### END SOLUTION
# %% nbgrader={"grade": true, "grade_id": "test-fused-gelu", "locked": true, "points": 10}
def test_unit_fused_gelu():
"""🔬 Test fused GELU activation implementation."""
print("🔬 Unit Test: Fused GELU...")
# Test basic properties
x = Tensor([-3, -1, 0, 1, 3])
result = fused_gelu(x)
# GELU(0) = 0 (exact property)
assert abs(result.data[2]) < 1e-6, f"GELU(0) should be 0, got {result.data[2]}"
# GELU is smooth and increasing
assert result.data[4] > result.data[3] > result.data[2], "GELU should be increasing"
# GELU has positive bias (unlike ReLU)
assert result.data[3] > 0.8, "GELU(1) should be close to 1"
assert result.data[1] > -0.2, "GELU(-1) should be slightly negative"
# Test numerical stability with extreme values
x_extreme = Tensor([-10, -5, 0, 5, 10])
result_extreme = fused_gelu(x_extreme)
assert not np.any(np.isnan(result_extreme.data)), "No NaN values allowed"
assert not np.any(np.isinf(result_extreme.data)), "No infinite values allowed"
# Test large tensor processing
x_large = Tensor(np.random.randn(1000, 1000).astype(np.float32))
result_large = fused_gelu(x_large)
assert result_large.shape == x_large.shape, "Shape preservation failed"
assert result_large.data.dtype == np.float32, "Data type preservation failed"
# Test that positive inputs are mostly preserved (GELU ≈ x for large positive x)
x_positive = Tensor([5.0])
result_positive = fused_gelu(x_positive)
assert result_positive.data[0] > 4.9, "Large positive values should be nearly preserved"
print("✅ fused_gelu works correctly!")
test_unit_fused_gelu()
# %% [markdown]
"""
### 🔬 Performance Analysis: Measuring Fusion Benefits
Let's quantify the impact of kernel fusion by comparing fused vs unfused implementations.
"""
# %% nbgrader={"grade": false, "grade_id": "unfused-gelu", "solution": true}
def unfused_gelu(x: Tensor) -> Tensor:
"""
Deliberately unfused GELU implementation for performance comparison.
This version creates multiple intermediate tensors to simulate
the memory bandwidth overhead of unfused operations.
TODO: Implement GELU with explicit intermediate steps
APPROACH:
1. Break computation into individual steps
2. Create temporary Tensor objects for each step
3. This simulates real memory allocation overhead
PERFORMANCE IMPACT:
- Creates 7 temporary arrays
- Each array allocation/deallocation has overhead
- More memory bandwidth usage
- Potential cache misses between operations
"""
### BEGIN SOLUTION
# Unfused version - creates many intermediate arrays
sqrt_2_over_pi = np.sqrt(2.0 / np.pi)
# Each operation creates a temporary array (simulating kernel launches)
temp1 = Tensor(x.data**3) # x³
temp2 = Tensor(0.044715 * temp1.data) # 0.044715 * x³
temp3 = Tensor(x.data + temp2.data) # x + 0.044715 * x³
temp4 = Tensor(sqrt_2_over_pi * temp3.data) # √(2/π) * (...)
temp5 = Tensor(np.tanh(temp4.data)) # tanh(...)
temp6 = Tensor(1.0 + temp5.data) # 1 + tanh(...)
temp7 = Tensor(x.data * temp6.data) # x * (1 + tanh(...))
result = Tensor(0.5 * temp7.data) # 0.5 * x * (...)
return result
### END SOLUTION
# %% nbgrader={"grade": true, "grade_id": "test-fusion-speedup", "locked": true, "points": 10}
def test_unit_fusion_speedup():
"""🔬 Measure the performance impact of kernel fusion."""
print("🔬 Unit Test: Kernel Fusion Performance Impact...")
# Create moderately large tensor for meaningful timing
size = 2000
x = Tensor(np.random.randn(size, size).astype(np.float32))
warmup_iterations = 2
timing_iterations = 5
# Warmup both implementations
for _ in range(warmup_iterations):
_ = unfused_gelu(x)
_ = fused_gelu(x)
# Time unfused version
start = time.time()
for _ in range(timing_iterations):
result_unfused = unfused_gelu(x)
unfused_time = time.time() - start
# Time fused version
start = time.time()
for _ in range(timing_iterations):
result_fused = fused_gelu(x)
fused_time = time.time() - start
# Verify numerical correctness
assert np.allclose(result_unfused.data, result_fused.data, atol=1e-6), \
"Fused and unfused implementations must be numerically equivalent"
# Calculate performance metrics
speedup = unfused_time / fused_time if fused_time > 0 else 1.0
unfused_per_elem = (unfused_time / timing_iterations) / (size * size) * 1e9 # ns per element
fused_per_elem = (fused_time / timing_iterations) / (size * size) * 1e9
print(f"📊 Kernel Fusion Performance Analysis:")
print(f" Tensor size: {size}×{size} = {size*size:,} elements")
print(f" Unfused time: {unfused_time/timing_iterations*1000:.2f} ms")
print(f" Fused time: {fused_time/timing_iterations*1000:.2f} ms")
print(f" Speedup: {speedup:.2f}× faster")
print(f" Per-element: {unfused_per_elem:.1f} ns → {fused_per_elem:.1f} ns")
# Memory bandwidth estimate
bytes_per_elem = 4 # float32
unfused_memory_ops = 7 # 7 intermediate arrays
fused_memory_ops = 2 # read input, write output
unfused_bandwidth = (unfused_memory_ops * size * size * bytes_per_elem) / (unfused_time / timing_iterations) / 1e9
fused_bandwidth = (fused_memory_ops * size * size * bytes_per_elem) / (fused_time / timing_iterations) / 1e9
print(f" Memory efficiency: {unfused_memory_ops}{fused_memory_ops} memory ops")
print(f" Effective bandwidth: {unfused_bandwidth:.1f}{fused_bandwidth:.1f} GB/s")
# Interpret results
if speedup > 1.5:
print("🚀 Excellent! Kernel fusion providing significant speedup")
elif speedup > 1.1:
print("✅ Good! Kernel fusion providing measurable benefit")
else:
print("⚠️ Limited speedup - may be compute-bound or small tensor size")
print("✅ Fusion performance analysis completed!")
test_unit_fusion_speedup()
# %% [markdown]
"""
## 4. Integration - Mixed Precision Training: Memory and Speed
### The Mixed Precision Revolution
Modern GPUs (like V100, A100) have specialized **Tensor Cores** that can perform FP16 operations much faster than FP32:
```
Performance Comparison (Theoretical Peak):
┌─────────────────┬────────────────┬────────────────┐
│ Precision │ V100 TFLOPS │ A100 TFLOPS │
├─────────────────┼────────────────┼────────────────┤
│ FP32 (float) │ 15.7 │ 19.5 │
│ FP16 (half) │ 125.0 │ 312.0 │
│ Speedup │ 8× │ 16×
└─────────────────┴────────────────┴────────────────┘
```
### The Challenge: FP16 Precision Limitations
FP16 has a much smaller range than FP32:
```
FP32 (32-bit): FP16 (16-bit):
┌─────────────────────────────┐ ┌───────────────┐
│ Sign │ 8-bit │ 23-bit │ │Sign│5-bit│10-bit│
│ bit │ Exp │ Mantissa │ │bit │ Exp │Mant. │
└─────────────────────────────┘ └───────────────┘
Range: ±3.4 × 10³⁸ Range: ±6.5 × 10⁴
Precision: ~7 decimal digits Precision: ~3 decimal digits
Problem: Small gradients (< 6e-5) become ZERO in FP16!
```
### The Solution: Automatic Loss Scaling
```
Training Step Without Scaling: Training Step With Scaling:
Loss = 0.0001 Loss = 0.0001
↓ ↓
Gradients = 0.00001 Scale × 1024
↓ ↓
Convert to FP16 Loss = 0.1024
↓ ↓
Gradients = 0.0 (UNDERFLOW!) Gradients = 0.01024
↓ ↓
No learning! Convert to FP16: 0.01024 ✓
Unscale: 0.01024 / 1024 = 0.00001
Successful learning!
```
### Mixed Precision Memory Benefits
```
Model Component Breakdown:
┌─────────────────┬─────────────┬─────────────┬─────────────┐
│ Component │ FP32 Memory │ FP16 Memory │ Savings │
├─────────────────┼─────────────┼─────────────┼─────────────┤
│ Parameters │ 4N │ 4N │ 0%
│ Gradients │ 4N │ 2N │ 50%
│ Activations │ 4A │ 2A │ 50%
│ Optimizer State │ 8N │ 8N │ 0%
├─────────────────┼─────────────┼─────────────┼─────────────┤
│ Total Typical │ ~20N │ ~16N │ 20%
│ Activation-Heavy│ ~40N │ ~24N │ 40%
└─────────────────┴─────────────┴─────────────┴─────────────┘
N = parameter count, A = activation memory
```
"""
# %% nbgrader={"grade": false, "grade_id": "mixed-precision-trainer", "solution": true}
class MixedPrecisionTrainer:
"""
Mixed precision trainer with automatic loss scaling.
Implements the same pattern as PyTorch's Automatic Mixed Precision (AMP):
1. Forward pass in FP16 for speed and memory efficiency
2. Loss scaling to prevent gradient underflow
3. Gradient computation and unscaling
4. Parameter updates in FP32 for numerical stability
The key insight: keep different parts of training in optimal precision.
"""
def __init__(self, model, optimizer, loss_scale: float = 1024.0, max_loss_scale: float = 65536.0):
"""
Initialize mixed precision training infrastructure.
TODO: Set up automatic loss scaling and overflow detection
APPROACH:
1. Store model and optimizer references
2. Initialize dynamic loss scaling parameters
3. Set up overflow detection and scale adjustment logic
Args:
model: Neural network model
optimizer: Parameter optimizer (SGD, Adam, etc.)
loss_scale: Initial scaling factor for gradients
max_loss_scale: Maximum allowed loss scale
LOSS SCALING STRATEGY:
- Start with reasonable scale (1024)
- Increase gradually if no overflow (better precision)
- Decrease immediately on overflow (stability)
- This balances numerical precision with training stability
HINTS:
- Track consecutive successful steps for scale increases
- Use exponential backoff on overflow detection
- Keep scale within reasonable bounds [1, 65536]
"""
### BEGIN SOLUTION
self.model = model
self.optimizer = optimizer
# Loss scaling parameters
self.loss_scale = loss_scale
self.max_loss_scale = max_loss_scale
self.min_loss_scale = 1.0
# Dynamic scaling parameters
self.scale_growth_factor = 2.0 # Multiply by 2 when increasing
self.scale_backoff_factor = 0.5 # Divide by 2 when decreasing
self.growth_interval = 2000 # Steps between scale increases
self.steps_since_last_scale_update = 0
# Overflow tracking
self.overflow_detected = False
### END SOLUTION
def scale_loss(self, loss: Tensor) -> Tensor:
"""
Scale loss to prevent gradient underflow in FP16.
The fundamental challenge: FP16 can only represent values ≥ 6e-5.
Small gradients (common in deep networks) become zero without scaling.
TODO: Apply loss scaling for mixed precision stability
APPROACH:
1. Multiply loss by current scale factor
2. This amplifies gradients proportionally
3. Return scaled loss for backward pass
MATHEMATICAL INSIGHT:
If loss = 1e-6 and scale = 1024:
scaled_loss = 1e-6 × 1024 = 1.024e-3
After backward pass:
scaled_gradients = 1.024e-3 × dloss/dparam = 1024 × gradients
These larger gradients survive FP16 conversion!
EXAMPLE:
>>> trainer = MixedPrecisionTrainer(model, optimizer)
>>> loss = Tensor([0.0001]) # Small loss
>>> scaled = trainer.scale_loss(loss)
>>> print(scaled.data) # [0.1024] (0.0001 × 1024)
"""
### BEGIN SOLUTION
# Scale the loss to amplify gradients
# This prevents gradient underflow in FP16 arithmetic
scaled_data = loss.data * self.loss_scale
return Tensor(scaled_data)
### END SOLUTION
def unscale_gradients(self, parameters: List[Tensor]) -> bool:
"""
Unscale gradients and detect overflow from FP16 conversion.
After backward pass on scaled loss, gradients are scaled too.
We must unscale them AND check for overflow/underflow.
TODO: Implement gradient unscaling with overflow detection
APPROACH:
1. Divide all gradients by loss scale (restore original magnitude)
2. Check for inf/nan values (indicates FP16 overflow)
3. Return True if gradients are valid, False if overflow detected
OVERFLOW DETECTION:
inf/nan in gradients indicates:
- Gradient magnitude too large for FP16
- Numerical instability in computation
- Loss scale too aggressive
When overflow occurs:
- Skip parameter update (unstable gradients)
- Reduce loss scale for next iteration
- Continue training with lower scale
HINTS:
- Use np.isfinite() to detect inf/nan efficiently
- Process all parameters even if overflow found
- Set self.overflow_detected flag for scale adjustment
"""
### BEGIN SOLUTION
self.overflow_detected = False
# Unscale all gradients and check for overflow
for param in parameters:
if param.grad is not None:
# Unscale gradients to original magnitude
param.grad.data = param.grad.data / self.loss_scale
# Check for overflow/underflow (inf/nan values)
if not np.all(np.isfinite(param.grad.data)):
self.overflow_detected = True
# Continue processing to unscale all gradients
return not self.overflow_detected
### END SOLUTION
def update_loss_scale(self):
"""
Dynamically adjust loss scale based on training stability.
Implements the "Goldilocks" principle for loss scaling:
- Too low: precision loss from small gradients
- Too high: overflow and instability
- Just right: maximum precision without overflow
TODO: Implement adaptive loss scale adjustment
APPROACH:
1. If overflow detected: reduce scale immediately (stability)
2. If no overflow for many steps: increase scale (precision)
3. Keep scale within reasonable bounds
SCALING STRATEGY:
- Aggressive reduction on overflow (×0.5)
- Conservative growth during stability (×2 every 2000 steps)
- This favors stability over maximum precision
WHY THIS WORKS:
- Most training is stable (gradual scale increase)
- Occasional instability (rapid scale decrease)
- Converges to optimal scale for current training phase
"""
### BEGIN SOLUTION
if self.overflow_detected:
# Immediately reduce scale on overflow
self.loss_scale = max(
self.min_loss_scale,
self.loss_scale * self.scale_backoff_factor
)
self.steps_since_last_scale_update = 0
else:
# Gradually increase scale if stable
self.steps_since_last_scale_update += 1
if self.steps_since_last_scale_update >= self.growth_interval:
self.loss_scale = min(
self.max_loss_scale,
self.loss_scale * self.scale_growth_factor
)
self.steps_since_last_scale_update = 0
### END SOLUTION
def train_step(self, batch: Tuple[Tensor, Tensor]) -> Dict[str, float]:
"""
Execute complete mixed precision training step.
Orchestrates the entire mixed precision training process:
1. Forward pass (FP16 in real implementation)
2. Loss computation and scaling
3. Backward pass on scaled loss
4. Gradient unscaling and overflow detection
5. Conditional parameter update
6. Loss scale adjustment
TODO: Implement end-to-end mixed precision training step
APPROACH:
1. Clear gradients from previous step
2. Forward pass through model
3. Compute and scale loss
4. Backward pass to compute scaled gradients
5. Unscale gradients and check for overflow
6. Update parameters only if no overflow
7. Adjust loss scale based on stability
CRITICAL INSIGHT:
Skip parameter updates on overflow! Unstable gradients
would move parameters in wrong direction.
RETURN FORMAT:
Dictionary with training metrics:
- loss: unscaled loss value
- loss_scale: current scaling factor
- overflow: whether overflow occurred
- gradients_valid: whether update was applied
HINTS:
- Use self.optimizer.zero_grad() to clear gradients
- Get parameters with gradients for unscaling
- Only call optimizer.step() if gradients are valid
"""
### BEGIN SOLUTION
inputs, targets = batch
# Clear gradients from previous step
self.optimizer.zero_grad()
# Forward pass (would use FP16 autocast in real implementation)
# For simulation, we work in FP32 but apply scaling principles
outputs = self.model(inputs)
# Compute loss (unscaled)
loss = self._compute_loss(outputs, targets)
# Scale loss for mixed precision
scaled_loss = self.scale_loss(loss)
# Backward pass on scaled loss
scaled_loss.backward()
# Get all parameters with gradients
parameters = [p for p in self.model.parameters() if p.grad is not None]
# Unscale gradients and detect overflow
gradients_valid = self.unscale_gradients(parameters)
# Update parameters only if no overflow
if gradients_valid:
self.optimizer.step()
# Adjust loss scale based on stability
self.update_loss_scale()
# Return training metrics
return {
'loss': loss.data.item() if hasattr(loss.data, 'item') else float(loss.data),
'loss_scale': self.loss_scale,
'overflow': self.overflow_detected,
'gradients_valid': gradients_valid
}
### END SOLUTION
def _compute_loss(self, outputs: Tensor, targets: Tensor) -> Tensor:
"""Simple MSE loss for demonstration purposes."""
diff = Tensor(outputs.data - targets.data)
return Tensor(np.mean(diff.data**2))
# %% nbgrader={"grade": true, "grade_id": "test-mixed-precision", "locked": true, "points": 15}
def test_unit_mixed_precision():
"""🔬 Test mixed precision training components comprehensively."""
print("🔬 Unit Test: Mixed Precision Training...")
# Create mock model and optimizer for testing
class MockModel:
def __init__(self):
self.weight = Tensor(np.random.randn(10, 5).astype(np.float32))
self.weight.grad = None
def __call__(self, x):
return x.matmul(self.weight)
def parameters(self):
return [self.weight]
class MockOptimizer:
def __init__(self, params):
self.params = params
self.updates_applied = 0
def zero_grad(self):
for p in self.params:
p.grad = None
def step(self):
for p in self.params:
if p.grad is not None:
p.data = p.data - 0.01 * p.grad.data
self.updates_applied += 1
# Initialize mixed precision trainer
model = MockModel()
optimizer = MockOptimizer(model.parameters())
trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=1024.0)
# Test 1: Loss scaling
print(" Testing loss scaling...")
loss = Tensor([0.001])
scaled_loss = trainer.scale_loss(loss)
expected_scaled = 0.001 * 1024.0
assert np.isclose(scaled_loss.data[0], expected_scaled), \
f"Loss scaling failed: expected {expected_scaled}, got {scaled_loss.data[0]}"
# Test 2: Gradient unscaling (normal case)
print(" Testing gradient unscaling...")
model.weight.grad = Tensor(np.full((10, 5), 1024.0)) # Simulate scaled gradients
valid = trainer.unscale_gradients([model.weight])
assert valid, "Should detect valid gradients"
assert np.allclose(model.weight.grad.data, 1.0), "Gradient unscaling failed"
# Test 3: Overflow detection
print(" Testing overflow detection...")
model.weight.grad = Tensor(np.full((10, 5), np.inf)) # Simulate overflow
valid = trainer.unscale_gradients([model.weight])
assert not valid, "Should detect overflow"
assert trainer.overflow_detected, "Overflow flag not set"
# Test 4: Loss scale adjustment after overflow
print(" Testing loss scale adjustment...")
initial_scale = trainer.loss_scale
trainer.update_loss_scale() # Should reduce scale due to overflow
assert trainer.loss_scale < initial_scale, \
f"Scale should decrease after overflow: {initial_scale}{trainer.loss_scale}"
# Test 5: Loss scale increase during stability
print(" Testing loss scale increase...")
trainer.overflow_detected = False
trainer.steps_since_last_scale_update = 2000 # Simulate stable training
scale_before = trainer.loss_scale
trainer.update_loss_scale()
assert trainer.loss_scale > scale_before, "Scale should increase during stability"
# Test 6: End-to-end training step
print(" Testing complete training step...")
inputs = Tensor(np.random.randn(8, 10).astype(np.float32))
targets = Tensor(np.random.randn(8, 5).astype(np.float32))
initial_updates = optimizer.updates_applied
metrics = trainer.train_step((inputs, targets))
# Verify metrics structure
required_keys = ['loss', 'loss_scale', 'overflow', 'gradients_valid']
for key in required_keys:
assert key in metrics, f"Missing metric: {key}"
# Verify loss is reasonable
assert isinstance(metrics['loss'], (int, float)), "Loss should be numeric"
assert metrics['loss'] >= 0, "Loss should be non-negative"
# Verify loss scale is positive
assert metrics['loss_scale'] > 0, "Loss scale should be positive"
print("✅ Mixed precision training works correctly!")
test_unit_mixed_precision()
# %% [markdown]
"""
## 5. Systems Analysis - Performance Scaling Patterns
Let's analyze how our acceleration techniques perform across different scenarios and understand their scaling characteristics.
"""
# %% nbgrader={"grade": false, "grade_id": "analyze-vectorization", "solution": true}
def analyze_vectorization_scaling():
"""📊 Analyze vectorization performance across different tensor sizes."""
print("📊 Analyzing vectorization scaling behavior...")
# Test sizes spanning different cache regimes
sizes = [64, 128, 256, 512, 1024, 2048]
print("\n🔍 Vectorization Scaling Analysis:")
print("┌─────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
print("│ Size │ Time (ms) │ GFLOPS │ Bandwidth │ Efficiency │")
print("│ │ │ │ (GB/s) │ (% of peak) │")
print("├─────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
for size in sizes:
# Create test matrices
a = Tensor(np.random.randn(size, size).astype(np.float32))
b = Tensor(np.random.randn(size, size).astype(np.float32))
# Warm up
for _ in range(2):
_ = vectorized_matmul(a, b)
# Time vectorized implementation
iterations = max(1, 100 // (size // 64)) # Fewer iterations for larger sizes
start = time.time()
for _ in range(iterations):
result = vectorized_matmul(a, b)
elapsed = (time.time() - start) / iterations
# Calculate performance metrics
flops = 2 * size**3 # 2N³ FLOPs for matrix multiplication
gflops = flops / (elapsed * 1e9)
bytes_accessed = 3 * size * size * 4 # 3 matrices × size² × 4 bytes
bandwidth = bytes_accessed / (elapsed * 1e9)
# Estimate efficiency (rough baseline: modern CPU ~100-500 GFLOPS peak)
estimated_peak_gflops = 200 # Conservative estimate
efficiency = min(100, gflops / estimated_peak_gflops * 100)
print(f"{size:6d}{elapsed*1000:9.2f}{gflops:9.1f}{bandwidth:9.1f}{efficiency:9.1f}")
print("└─────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
print(f"\n💡 Vectorization insights:")
print(f" • Small matrices: Limited by overhead and cache effects")
print(f" • Medium matrices: Sweet spot for cache reuse")
print(f" • Large matrices: Memory bandwidth becomes limiting factor")
print(f" • BLAS libraries automatically optimize for each size regime")
print("🚀 Vectorization effectiveness depends on problem size and hardware")
analyze_vectorization_scaling()
# %% nbgrader={"grade": false, "grade_id": "analyze-arithmetic-intensity", "solution": true}
def analyze_arithmetic_intensity():
"""📊 Demonstrate the roofline model with different operations."""
print("📊 Analyzing arithmetic intensity patterns...")
size = 1024
iterations = 10
operations = []
# Create test data
x = Tensor(np.random.randn(size, size).astype(np.float32))
y = Tensor(np.random.randn(size, size).astype(np.float32))
print("\n🎯 Arithmetic Intensity Analysis:")
print("┌─────────────────────┬─────────┬─────────────┬─────────────┬─────────────┐")
print("│ Operation │ AI │ Time (ms) │ GFLOPS │ GB/s │")
print("│ │(FLOPs/B)│ │ │ │")
print("├─────────────────────┼─────────┼─────────────┼─────────────┼─────────────┤")
# 1. Element-wise addition (very low arithmetic intensity)
start = time.time()
for _ in range(iterations):
_ = Tensor(x.data + y.data)
add_time = (time.time() - start) / iterations
add_flops = size * size # One addition per element
add_bytes = 3 * size * size * 4 # Read x, read y, write result
add_ai = add_flops / add_bytes
add_gflops = add_flops / (add_time * 1e9)
add_bandwidth = add_bytes / (add_time * 1e9)
print(f"│ Element-wise Add │ {add_ai:6.3f}{add_time*1000:9.2f}{add_gflops:9.1f}{add_bandwidth:9.1f}")
# 2. Element-wise multiply (still low, but slightly higher)
start = time.time()
for _ in range(iterations):
_ = Tensor(x.data * y.data)
mul_time = (time.time() - start) / iterations
mul_flops = size * size
mul_bytes = 3 * size * size * 4
mul_ai = mul_flops / mul_bytes
mul_gflops = mul_flops / (mul_time * 1e9)
mul_bandwidth = mul_bytes / (mul_time * 1e9)
print(f"│ Element-wise Mult │ {mul_ai:6.3f}{mul_time*1000:9.2f}{mul_gflops:9.1f}{mul_bandwidth:9.1f}")
# 3. GELU (medium arithmetic intensity)
start = time.time()
for _ in range(iterations):
_ = fused_gelu(x)
gelu_time = (time.time() - start) / iterations
gelu_flops = size * size * 8 # Approximate: x³, add, mul, tanh, etc.
gelu_bytes = 2 * size * size * 4 # Read x, write result
gelu_ai = gelu_flops / gelu_bytes
gelu_gflops = gelu_flops / (gelu_time * 1e9)
gelu_bandwidth = gelu_bytes / (gelu_time * 1e9)
print(f"│ Fused GELU │ {gelu_ai:6.3f}{gelu_time*1000:9.2f}{gelu_gflops:9.1f}{gelu_bandwidth:9.1f}")
# 4. Matrix multiplication (high arithmetic intensity)
start = time.time()
for _ in range(iterations):
_ = vectorized_matmul(x, y)
matmul_time = (time.time() - start) / iterations
matmul_flops = 2 * size**3 # 2N³ FLOPs
matmul_bytes = 3 * size * size * 4 # 3 matrices
matmul_ai = matmul_flops / matmul_bytes
matmul_gflops = matmul_flops / (matmul_time * 1e9)
matmul_bandwidth = matmul_bytes / (matmul_time * 1e9)
print(f"│ Matrix Multiply │ {matmul_ai:6.3f}{matmul_time*1000:9.2f}{matmul_gflops:9.1f}{matmul_bandwidth:9.1f}")
print("└─────────────────────┴─────────┴─────────────┴─────────────┴─────────────┘")
print(f"\n💡 Roofline Model Insights:")
print(f" 📊 Low AI (< 1): Memory bound - limited by bandwidth")
print(f" 📊 Med AI (1-10): Transitional - depends on implementation")
print(f" 📊 High AI (> 10): Compute bound - limited by ALU throughput")
print(f" 🎯 Matrix multiplication ({matmul_ai:.1f} AI) is ideal for GPUs/TPUs")
print(f" ⚡ Element-wise ops ({add_ai:.3f} AI) need memory optimization")
print("🚀 Design algorithms with high arithmetic intensity for performance")
analyze_arithmetic_intensity()
# %% nbgrader={"grade": false, "grade_id": "analyze-mixed-precision-benefits", "solution": true}
def analyze_mixed_precision_benefits():
"""📊 Quantify mixed precision memory and performance benefits."""
print("📊 Analyzing mixed precision benefits across model sizes...")
# Define representative model configurations
model_configs = [
("Tiny CNN", {"params": 50_000, "activations": 100_000}),
("Small BERT", {"params": 10_000_000, "activations": 5_000_000}),
("Medium GPT", {"params": 100_000_000, "activations": 50_000_000}),
("Large Transformer", {"params": 1_000_000_000, "activations": 500_000_000}),
]
print("\n🧮 Mixed Precision Memory Analysis:")
print("┌─────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
print("│ Model Type │ Parameters │ FP32 Memory │ FP16 Memory │ Savings │")
print("│ │ │ (GB) │ (GB) │ (%) │")
print("├─────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
for name, config in model_configs:
param_count = config["params"]
activation_count = config["activations"]
# Memory calculation (bytes)
# Parameters: always FP32 for stability
param_memory = param_count * 4
# FP32 training memory
fp32_activations = activation_count * 4
fp32_gradients = param_count * 4
fp32_optimizer = param_count * 8 # Adam: momentum + velocity
fp32_total = param_memory + fp32_activations + fp32_gradients + fp32_optimizer
# Mixed precision memory
fp16_activations = activation_count * 2 # FP16 activations
fp16_gradients = param_count * 2 # FP16 gradients during backward
mixed_total = param_memory + fp16_activations + fp16_gradients + fp32_optimizer
# Calculate savings
savings_gb = (fp32_total - mixed_total) / 1e9
savings_pct = (fp32_total - mixed_total) / fp32_total * 100
print(f"{name:14s}{param_count:10,d}{fp32_total/1e9:9.1f}{mixed_total/1e9:9.1f}{savings_pct:9.1f}")
print("└─────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
# Performance simulation
print(f"\n⚡ Mixed Precision Performance Simulation:")
# Simulate different batch sizes to show memory pressure
batch_sizes = [8, 16, 32, 64]
hidden_size = 1024
seq_length = 512
print("┌─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
print("│ Batch Size │ FP32 Mem │ FP16 Mem │ Throughput │ Efficiency │")
print("│ │ (GB) │ (GB) │ Gain │ Gain │")
print("├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
for batch_size in batch_sizes:
# Memory for activations (dominant for large models)
elements = batch_size * seq_length * hidden_size
fp32_mem = elements * 4 / 1e9 # 4 bytes per FP32
fp16_mem = elements * 2 / 1e9 # 2 bytes per FP16
# Simulate throughput gains (based on Tensor Core speedups)
# Real speedups depend on hardware and operation mix
throughput_gain = 1.4 # Conservative estimate for mixed workloads
# Memory efficiency enables larger batch sizes
max_fp32_batch = 32 # Assume memory limit
max_fp16_batch = 64 # Double capacity with FP16
efficiency_gain = max_fp16_batch / max_fp32_batch if batch_size <= max_fp32_batch else "OOM"
efficiency_str = f"{efficiency_gain:.1f}×" if isinstance(efficiency_gain, float) else efficiency_gain
print(f"{batch_size:10d}{fp32_mem:9.2f}{fp16_mem:9.2f}{throughput_gain:9.1f}×{efficiency_str:9s}")
print("└─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
print(f"\n💡 Mixed Precision Key Benefits:")
print(f" 🎯 Memory: 20-40% reduction enables larger models/batches")
print(f" ⚡ Speed: 1.3-2× throughput on modern hardware (V100+)")
print(f" 📈 Scale: Essential for billion-parameter models")
print(f" ⚠️ Complexity: Requires careful loss scaling and overflow handling")
print("🚀 Mixed precision is crucial for competitive ML training")
analyze_mixed_precision_benefits()
# %% [markdown]
"""
## 6. Optimization Insights - Production Acceleration Strategy
Understanding when and how to apply different acceleration techniques in real-world scenarios.
"""
# %% nbgrader={"grade": false, "grade_id": "acceleration-decision-framework", "solution": true}
def analyze_acceleration_decision_framework():
"""📊 Decision framework for choosing acceleration techniques."""
print("📊 Acceleration Technique Decision Framework...")
# Define workload characteristics
workloads = [
("Research Training", {
"memory_pressure": "medium",
"latency_sensitive": False,
"stability_critical": False,
"development_speed": "high",
"hardware_variety": "high"
}),
("Production Training", {
"memory_pressure": "high",
"latency_sensitive": False,
"stability_critical": True,
"development_speed": "medium",
"hardware_variety": "low"
}),
("Real-time Inference", {
"memory_pressure": "medium",
"latency_sensitive": True,
"stability_critical": True,
"development_speed": "low",
"hardware_variety": "medium"
}),
("Edge Deployment", {
"memory_pressure": "very_high",
"latency_sensitive": True,
"stability_critical": True,
"development_speed": "low",
"hardware_variety": "very_high"
}),
("Batch Inference", {
"memory_pressure": "low",
"latency_sensitive": False,
"stability_critical": True,
"development_speed": "medium",
"hardware_variety": "low"
})
]
# Define technique characteristics
techniques = {
"Vectorization": {
"implementation_cost": "low",
"memory_benefit": "none",
"latency_benefit": "high",
"stability_risk": "none",
"hardware_dependency": "low"
},
"Kernel Fusion": {
"implementation_cost": "medium",
"memory_benefit": "medium",
"latency_benefit": "medium",
"stability_risk": "low",
"hardware_dependency": "medium"
},
"Mixed Precision": {
"implementation_cost": "high",
"memory_benefit": "high",
"latency_benefit": "high",
"stability_risk": "medium",
"hardware_dependency": "high"
},
"Graph Optimization": {
"implementation_cost": "very_high",
"memory_benefit": "medium",
"latency_benefit": "very_high",
"stability_risk": "low",
"hardware_dependency": "very_high"
}
}
print("\n🎯 Acceleration Technique Recommendations:")
print("┌─────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
print("│ Workload │ Vectorize │ Fuse Kernels│ Mixed Prec │ Graph Opt │")
print("├─────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
for workload_name, workload_chars in workloads:
recommendations = []
for technique_name in ["Vectorization", "Kernel Fusion", "Mixed Precision", "Graph Optimization"]:
tech_chars = techniques[technique_name]
score = 0
# Benefit vs requirement matching
if workload_chars["memory_pressure"] in ["high", "very_high"]:
if tech_chars["memory_benefit"] in ["medium", "high"]:
score += 2
if workload_chars["latency_sensitive"]:
if tech_chars["latency_benefit"] in ["medium", "high", "very_high"]:
score += 2
# Risk vs tolerance matching
if workload_chars["stability_critical"]:
if tech_chars["stability_risk"] in ["none", "low"]:
score += 1
elif tech_chars["stability_risk"] == "medium":
score -= 1
# Implementation cost vs development speed
if workload_chars["development_speed"] == "high":
if tech_chars["implementation_cost"] in ["low", "medium"]:
score += 1
elif tech_chars["implementation_cost"] in ["high", "very_high"]:
score -= 1
# Hardware dependency vs variety
if workload_chars["hardware_variety"] in ["high", "very_high"]:
if tech_chars["hardware_dependency"] in ["low", "medium"]:
score += 1
elif tech_chars["hardware_dependency"] in ["high", "very_high"]:
score -= 2
# Convert score to recommendation
if score >= 3:
rec = "✅ High"
elif score >= 1:
rec = "⚡ Medium"
elif score >= 0:
rec = "⚠️ Low"
else:
rec = "❌ Skip"
recommendations.append(rec)
rec_line = "".join(f"{rec:10s}" for rec in recommendations)
print(f"{workload_name:18s}{rec_line}")
print("└─────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
# Implementation priority framework
print(f"\n🛠️ Implementation Priority Framework:")
print(f" 📊 Phase 1 (Always): Vectorization")
print(f" • Low risk, high reward")
print(f" • Works on any hardware")
print(f" • Foundation for other optimizations")
print(f" ")
print(f" 📊 Phase 2 (Memory constrained): Kernel Fusion")
print(f" • Targets memory-bound operations")
print(f" • Moderate complexity")
print(f" • Significant wins on element-wise ops")
print(f" ")
print(f" 📊 Phase 3 (Large models): Mixed Precision")
print(f" • Essential for large model training")
print(f" • Requires careful validation")
print(f" • Hardware-dependent benefits")
print(f" ")
print(f" 📊 Phase 4 (Production): Graph Optimization")
print(f" • Maximum performance extraction")
print(f" • High implementation cost")
print(f" • Deployment-specific tuning")
print(f"\n💡 Key Decision Factors:")
print(f" 🎯 Start simple: Vectorization first, always")
print(f" 📈 Scale up: Add complexity only when needed")
print(f" ⚡ Measure impact: Profile before and after each optimization")
print(f" 🔄 Iterate: Optimization is an ongoing process, not one-time")
print("🚀 Systematic acceleration beats random optimization")
analyze_acceleration_decision_framework()
# %% [markdown]
"""
## 7. Module Integration Test
Final validation that all acceleration components work together correctly.
"""
# %% nbgrader={"grade": true, "grade_id": "test-module", "locked": true, "points": 20}
def test_module():
"""
Comprehensive test of entire acceleration module functionality.
This final test ensures:
- All acceleration techniques work correctly
- Performance improvements are measurable
- Mixed precision training is stable
- Components integrate seamlessly
- Module is ready for production use
"""
print("🧪 RUNNING MODULE INTEGRATION TEST")
print("=" * 50)
# Run all unit tests
print("Running unit tests...")
test_unit_vectorized_matmul()
test_unit_fused_gelu()
test_unit_fusion_speedup()
test_unit_mixed_precision()
print("\nRunning integration scenarios...")
# Test realistic acceleration pipeline
print("🔬 Integration Test: Complete acceleration pipeline...")
# Create realistic model scenario
batch_size, seq_len, hidden_dim = 16, 64, 256
print(f" Model config: batch={batch_size}, seq_len={seq_len}, hidden={hidden_dim}")
# Test data
x = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))
weight = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
print(f" Input tensor: {x.shape}, Weight tensor: {weight.shape}")
# Test complete pipeline: reshape → matmul → activation → mixed precision
print(" Testing vectorized operations...")
# Reshape for matrix multiplication (flatten batch and sequence)
x_reshaped = Tensor(x.data.reshape(-1, hidden_dim))
assert x_reshaped.shape == (batch_size * seq_len, hidden_dim)
# Vectorized matrix multiplication
linear_output = vectorized_matmul(x_reshaped, weight)
assert linear_output.shape == (batch_size * seq_len, hidden_dim)
print(f" ✅ Matrix multiplication: {x_reshaped.shape} @ {weight.shape}{linear_output.shape}")
# Fused activation
activated = fused_gelu(linear_output)
assert activated.shape == linear_output.shape
print(f" ✅ Fused GELU activation: {linear_output.shape}{activated.shape}")
# Reshape back to original structure
final_output = Tensor(activated.data.reshape(batch_size, seq_len, hidden_dim))
assert final_output.shape == x.shape
print(f" ✅ Output reshape: {activated.shape}{final_output.shape}")
print(" Testing mixed precision training integration...")
# Create complete model for mixed precision testing
class TransformerBlock:
def __init__(self, hidden_dim):
self.hidden_dim = hidden_dim
self.weight1 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
self.weight2 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
self.weight1.grad = None
self.weight2.grad = None
def __call__(self, x):
# Simulate transformer block: linear → activation → linear
batch_size, seq_len, hidden_dim = x.shape
x_flat = Tensor(x.data.reshape(-1, hidden_dim))
# First linear layer
h1 = vectorized_matmul(x_flat, self.weight1)
h1_activated = fused_gelu(h1)
# Second linear layer
h2 = vectorized_matmul(h1_activated, self.weight2)
# Reshape back
output = Tensor(h2.data.reshape(batch_size, seq_len, hidden_dim))
return output
def parameters(self):
return [self.weight1, self.weight2]
class SimpleOptimizer:
def __init__(self, params):
self.params = params
def zero_grad(self):
for p in self.params:
p.grad = None
def step(self):
for p in self.params:
if p.grad is not None:
p.data = p.data - 0.001 * p.grad.data
# Initialize model and optimizer
model = TransformerBlock(hidden_dim)
optimizer = SimpleOptimizer(model.parameters())
trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=512.0)
print(f" Model parameters: {len(model.parameters())}")
print(f" Initial loss scale: {trainer.loss_scale}")
# Simulate training steps
print(" Running training steps...")
targets = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))
training_metrics = []
for step in range(5):
metrics = trainer.train_step((x, targets))
training_metrics.append(metrics)
# Verify metrics are reasonable
assert isinstance(metrics['loss'], (int, float))
assert metrics['loss'] >= 0
assert metrics['loss_scale'] > 0
assert isinstance(metrics['overflow'], bool)
assert isinstance(metrics['gradients_valid'], bool)
print(f" ✅ Completed {len(training_metrics)} training steps")
# Analyze training stability
losses = [m['loss'] for m in training_metrics]
overflows = [m['overflow'] for m in training_metrics]
print(f" Loss range: {min(losses):.6f} - {max(losses):.6f}")
print(f" Overflow rate: {sum(overflows)}/{len(overflows)} steps")
print(" Testing performance characteristics...")
# Verify acceleration provides measurable benefits
test_sizes = [128, 256]
for size in test_sizes:
test_x = Tensor(np.random.randn(size, size).astype(np.float32))
test_y = Tensor(np.random.randn(size, size).astype(np.float32))
# Time operations and verify reasonable performance
start = time.time()
_ = vectorized_matmul(test_x, test_y)
matmul_time = time.time() - start
start = time.time()
_ = fused_gelu(test_x)
gelu_time = time.time() - start
# Verify operations complete in reasonable time
assert matmul_time < 1.0, f"Matrix multiplication too slow: {matmul_time:.3f}s"
assert gelu_time < 0.1, f"GELU activation too slow: {gelu_time:.3f}s"
print(f" ✅ Size {size}: matmul={matmul_time*1000:.1f}ms, gelu={gelu_time*1000:.1f}ms")
print(" Testing memory efficiency...")
# Verify mixed precision reduces memory usage conceptually
param_count = sum(p.data.size for p in model.parameters())
activation_count = batch_size * seq_len * hidden_dim
fp32_memory = (param_count + activation_count) * 4 # 4 bytes per FP32
mixed_memory = param_count * 4 + activation_count * 2 # FP32 params + FP16 activations
memory_savings = (fp32_memory - mixed_memory) / fp32_memory * 100
print(f" Memory analysis: {memory_savings:.1f}% savings from mixed precision")
assert memory_savings > 0, "Mixed precision should reduce memory usage"
print("✅ End-to-end acceleration pipeline works!")
print("\n" + "=" * 50)
print("🎉 ALL TESTS PASSED! Module ready for export.")
print("Run: tito module complete 16")
# Call the module test
test_module()
# %% nbgrader={"grade": false, "grade_id": "main-execution", "solution": false}
# Main execution block
if __name__ == "__main__":
print("🚀 Running Acceleration module...")
test_module()
print("✅ Module validation complete!")
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Acceleration and Performance
### Question 1: Arithmetic Intensity Analysis
You implemented vectorized matrix multiplication and fused GELU.
- Matrix multiplication (1024×1024): Performs ~2.1 billion FLOPs, reads ~12 MB data
- Arithmetic intensity: _____ FLOPs/byte
- Compared to element-wise addition (0.33 FLOPs/byte): _____× higher intensity
- Why does this make matrix multiplication ideal for GPUs? _____
### Question 2: Kernel Fusion Memory Benefits
Your fused_gelu combines 7 operations into a single expression.
- Unfused version memory accesses: 7 reads + 7 writes = _____ per element
- Fused version memory accesses: 1 read + 1 write = _____ per element
- Memory bandwidth reduction: _____%
- Why is this critical for transformer inference? _____
### Question 3: Mixed Precision Memory Calculation
Your MixedPrecisionTrainer uses FP16 activations, FP32 parameters.
For a 100M parameter model with 50M activation elements:
- FP32 memory: (100M + 50M) × 4 bytes = _____ MB
- Mixed precision memory: 100M × 4 + 50M × 2 = _____ MB
- Memory reduction: _____%
### Question 4: Loss Scaling Strategy
Your trainer starts with loss_scale=1024, grows by 2×, shrinks by 0.5×.
- Minimum FP16 representable value: ~6e-5
- Without scaling, gradients < _____ become zero
- With 1024× scaling, gradients down to _____ are preserved
- Why increase scale gradually but decrease immediately? _____
### Question 5: Production Optimization Strategy
Based on your decision framework analysis:
For edge deployment (memory critical, stability required, hardware diverse):
- Priority 1 technique: _____ (low risk, universal)
- Priority 2 technique: _____ (memory benefits)
- Skip technique: _____ (why: _____)
- What's the primary constraint: memory, compute, or power? _____
"""
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Acceleration
Congratulations! You've mastered the fundamental techniques for accelerating neural networks!
### Key Accomplishments
- Built **vectorized operations** leveraging SIMD and optimized BLAS for 2-5× speedups
- Implemented **kernel fusion** reducing memory bandwidth by 60-80% for element-wise operations
- Created **mixed precision training** with automatic loss scaling for 20-40% memory savings
- Analyzed **arithmetic intensity patterns** and their impact on the roofline model
- Developed **production decision framework** for systematic optimization
- All tests pass ✅ (validated by `test_module()`)
### Systems Insights Discovered
- **Roofline Model**: Operations with high arithmetic intensity (FLOPs/byte) scale better
- **Memory Bandwidth**: Often the limiting factor for modern accelerators
- **Kernel Fusion**: Critical for memory-bound workloads, reduces intermediate storage overhead
- **Mixed Precision**: Essential for large model training, requires careful gradient scaling
- **Optimization Strategy**: Start simple (vectorization), add complexity as needed
### Production Impact
Your acceleration techniques enable:
- **Training larger models** within memory constraints
- **Faster iteration cycles** during research and development
- **Better hardware utilization** across different deployment targets
- **Cost reduction** through improved efficiency
### Ready for Next Steps
Your acceleration implementations provide the foundation for quantization techniques in Module 17.
The performance analysis skills transfer directly to production optimization workflows.
Export with: `tito module complete 16`
**Next**: Module 17 will add quantization to further reduce memory and increase throughput while maintaining accuracy!
"""