mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-11 02:41:59 -05:00
1748 lines
71 KiB
Python
1748 lines
71 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.18.1
|
||
# kernelspec:
|
||
# display_name: Python 3 (ipykernel)
|
||
# language: python
|
||
# name: python3
|
||
# ---
|
||
|
||
# %%
|
||
#| default_exp optimization.acceleration
|
||
#| export
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Module 16: Acceleration - Making Models Run Faster
|
||
|
||
Welcome to Module 16! You're about to master the art of neural network acceleration through vectorization, kernel fusion, and mixed precision training.
|
||
|
||
## 🔗 Prerequisites & Progress
|
||
**You've Built**: Complete training pipeline with profiling capabilities
|
||
**You'll Build**: Acceleration techniques including vectorization, operation fusion, and mixed precision
|
||
**You'll Enable**: Production-ready optimization for real-world deployment
|
||
|
||
**Connection Map**:
|
||
```
|
||
Profiling (Module 15) → Acceleration (Module 16) → Quantization (Module 17)
|
||
(measurement) (optimization) (precision reduction)
|
||
```
|
||
|
||
## Learning Objectives
|
||
By the end of this module, you will:
|
||
1. Implement vectorized operations for maximum throughput
|
||
2. Create fused operations to reduce memory bandwidth
|
||
3. Build mixed precision training for memory efficiency
|
||
4. Understand the relationship between compute and memory bandwidth
|
||
5. Analyze acceleration trade-offs in production systems
|
||
|
||
Let's optimize for speed!
|
||
|
||
## 📦 Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in `modules/16_acceleration/acceleration_dev.py`
|
||
**Building Side:** Code exports to `tinytorch.optimization.acceleration`
|
||
|
||
```python
|
||
# How to use this module:
|
||
from tinytorch.optimization.acceleration import vectorized_matmul, fused_gelu, MixedPrecisionTrainer
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Complete acceleration system in one focused module for deep understanding
|
||
- **Production:** Proper organization like PyTorch's torch.amp and torch.jit with optimization components
|
||
- **Consistency:** All acceleration operations and mixed precision training in optimization.acceleration
|
||
- **Integration:** Works seamlessly with profiling for complete performance optimization
|
||
"""
|
||
|
||
# %%
|
||
import numpy as np
|
||
import time
|
||
from typing import Dict, List, Tuple, Optional, Any, Union
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 1. Introduction - The Performance Challenge
|
||
|
||
Modern neural networks face two fundamental bottlenecks that limit their speed:
|
||
|
||
### The Two Enemies of Performance
|
||
|
||
**1. Compute Bound Operations:**
|
||
```
|
||
CPU/GPU Cores: [====BUSY====] [====BUSY====] [====BUSY====]
|
||
Memory Bus: [---idle---] [---idle---] [---idle---]
|
||
|
||
When: Matrix multiplication, convolutions
|
||
Solution: Vectorization, better algorithms
|
||
```
|
||
|
||
**2. Memory Bound Operations:**
|
||
```
|
||
CPU/GPU Cores: [--idle--] [--idle--] [--idle--]
|
||
Memory Bus: [========SATURATED========]
|
||
|
||
When: Element-wise operations, small tensors
|
||
Solution: Kernel fusion, memory layout optimization
|
||
```
|
||
|
||
### The Roofline Model - Your Performance Compass
|
||
|
||
Every processor has fundamental limits:
|
||
|
||
```
|
||
Performance │ Compute Bound Region
|
||
(GFLOPS) │ ┌─────────────────────
|
||
│ │ Peak Performance
|
||
│ │
|
||
│ ╱│ Memory Bound Region
|
||
│╱ │
|
||
╱│ │
|
||
╱ │ │
|
||
╱ │ │
|
||
╱───│──│───────────────────────
|
||
╱ │ │
|
||
╱ │ │
|
||
╱──────│──│────────────────── Arithmetic Intensity
|
||
│ │ (FLOPs/Byte)
|
||
Low│ │High
|
||
```
|
||
|
||
**Key Insight**: Understand where your operations live on this graph to optimize effectively.
|
||
|
||
### Why This Module Matters
|
||
|
||
Real-world performance wins:
|
||
- **2-5× speedup** from vectorization
|
||
- **30-50% memory reduction** from mixed precision
|
||
- **2-3× throughput** from kernel fusion
|
||
- **10× scaling improvement** for large models
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tensor-import", "solution": true}
|
||
"""
|
||
## 🔗 Module Dependencies
|
||
|
||
This module REQUIRES completion of:
|
||
- Module 01 (Tensor): Foundation data structure we optimize
|
||
- Module 03 (Layers): Linear layers for vectorization
|
||
- Module 14 (Profiling): Profiler for measuring improvements
|
||
|
||
**Progressive Building**:
|
||
```
|
||
Module 01 (Tensor) ──> [This Module: Optimize Tensor operations]
|
||
Module 03 (Layers) ──> [This Module: Optimize Linear layers]
|
||
Module 14 (Profiling) ──> [This Module: Measure improvements]
|
||
```
|
||
|
||
**What You've Built**:
|
||
- Module 01: Tensor (what we optimize)
|
||
- Module 03: Linear layers (uses optimized ops)
|
||
- Module 14: Profiling (measure improvements)
|
||
|
||
**What This Module Adds**:
|
||
- Vectorized operations (SIMD optimization)
|
||
- Kernel fusion (memory efficiency)
|
||
- Mixed precision training (memory/speed)
|
||
|
||
**To verify dependencies are met, run**:
|
||
python -c "from tinytorch.core.tensor import Tensor; print('✅ Module 01 ready')"
|
||
python -c "from tinytorch.core.layers import Linear; print('✅ Module 03 ready')"
|
||
python -c "from tinytorch.profiling.profiler import Profiler; print('✅ Module 14 ready')"
|
||
"""
|
||
|
||
# Direct imports from previous modules - these MUST exist
|
||
# If imports fail, students will get clear educational errors
|
||
try:
|
||
from tinytorch.core.tensor import Tensor # Module 01: What we optimize
|
||
except ImportError as e:
|
||
raise ImportError(
|
||
"❌ Module 18 (Acceleration) requires Module 01 (Tensor) to be completed first.\n"
|
||
" This module optimizes Tensor operations - you need Tensor to exist first!\n"
|
||
" Please complete Module 01 first, then run 'tito module complete 01'.\n"
|
||
" Original error: " + str(e)
|
||
) from e
|
||
|
||
try:
|
||
from tinytorch.core.layers import Linear # Module 03: Uses optimized ops
|
||
except ImportError as e:
|
||
raise ImportError(
|
||
"❌ Module 18 (Acceleration) requires Module 03 (Layers) to be completed first.\n"
|
||
" This module optimizes Linear layer operations.\n"
|
||
" Please complete Module 03 first, then run 'tito module complete 03'.\n"
|
||
" Original error: " + str(e)
|
||
) from e
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 2. Foundations - Vectorization: From Loops to Lightning
|
||
|
||
### The SIMD Revolution
|
||
|
||
Modern processors can execute **Single Instruction, Multiple Data** operations:
|
||
|
||
```
|
||
Traditional Loop (Scalar): SIMD Vectorized:
|
||
for i in range(4): ┌─────┐ ┌─────┬─────┬─────┬─────┐
|
||
c[i] = a[i] + b[i] │ ALU │ → │ALU 0│ALU 1│ALU 2│ALU 3│
|
||
└─────┘ └─────┴─────┴─────┴─────┘
|
||
1 element 4 elements per cycle
|
||
per cycle
|
||
```
|
||
|
||
### Memory Access Patterns: The Hidden Performance Killer
|
||
|
||
```
|
||
Sequential Access (FAST):
|
||
Memory: [A][B][C][D][E][F][G][H]
|
||
Access: ↓ ↓ ↓ ↓ → Cache friendly
|
||
|
||
Strided Access (SLOWER):
|
||
Memory: [A][ ][B][ ][C][ ][D][ ]
|
||
Access: ↓ ↓ ↓ ↓ → Cache misses
|
||
|
||
Random Access (SLOWEST):
|
||
Memory: [A][B][C][D][E][F][G][H]
|
||
Access: ↓ ↑ ↓ ↑ → Cache chaos
|
||
```
|
||
|
||
### Matrix Multiplication: The King of Vectorization
|
||
|
||
Matrix multiplication is **perfectly suited** for vectorization:
|
||
|
||
```
|
||
Matrix A (M×K) × Matrix B (K×N) = Matrix C (M×N)
|
||
|
||
Computation Pattern:
|
||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||
│ a₁₁ a₁₂ a₁₃ a₁₄│ × │ b₁₁ b₁₂ b₁₃ b₁₄│ = │ c₁₁ c₁₂ c₁₃ c₁₄│
|
||
│ a₂₁ a₂₂ a₂₃ a₂₄│ │ b₂₁ b₂₂ b₂₃ b₂₄│ │ c₂₁ c₂₂ c₂₃ c₂₄│
|
||
│ a₃₁ a₃₂ a₃₃ a₃₄│ │ b₃₁ b₃₂ b₃₃ b₃₄│ │ c₃₁ c₃₂ c₃₃ c₃₄│
|
||
│ a₄₁ a₄₂ a₄₃ a₄₄│ │ b₄₁ b₄₂ b₄₃ b₄₄│ │ c₄₁ c₄₂ c₄₃ c₄₄│
|
||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||
|
||
For c₁₁: Row₁ · Column₁ = a₁₁×b₁₁ + a₁₂×b₂₁ + a₁₃×b₃₁ + a₁₄×b₄₁
|
||
↑
|
||
VECTORIZABLE!
|
||
```
|
||
|
||
**Why vectorization wins:**
|
||
- **High arithmetic intensity**: 2N³ FLOPs for N³ data
|
||
- **Predictable memory access**: Sequential row/column reads
|
||
- **Parallelizable**: Independent dot products
|
||
- **Cache-friendly**: Data reuse in inner loops
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "vectorized-matmul", "solution": true}
|
||
def vectorized_matmul(a: Tensor, b: Tensor) -> Tensor:
|
||
"""
|
||
High-performance matrix multiplication using vectorized operations.
|
||
|
||
This implementation leverages optimized BLAS libraries that use:
|
||
- SIMD instructions for parallel computation
|
||
- Cache-blocking for memory efficiency
|
||
- Multi-threading for CPU parallelization
|
||
|
||
TODO: Implement production-grade matrix multiplication
|
||
|
||
APPROACH:
|
||
1. Validate shapes are compatible for matrix multiplication
|
||
2. Use NumPy's optimized dot product (calls BLAS GEMM)
|
||
3. Return result wrapped in Tensor
|
||
|
||
EXAMPLE:
|
||
Matrix multiplication visualization:
|
||
>>> a = Tensor([[1, 2], [3, 4]]) # 2×2
|
||
>>> b = Tensor([[5, 6], [7, 8]]) # 2×2
|
||
>>> result = vectorized_matmul(a, b)
|
||
>>> print(result.data)
|
||
[[19 22] # [1×5+2×7, 1×6+2×8] = [19, 22]
|
||
[43 50]] # [3×5+4×7, 3×6+4×8] = [43, 50]
|
||
|
||
PERFORMANCE CHARACTERISTICS:
|
||
- Time Complexity: O(N³) but highly optimized
|
||
- Space Complexity: O(N²) for result
|
||
- Arithmetic Intensity: 2N³ FLOPs / 3N² bytes = 2N/3 (good for large N)
|
||
|
||
HINTS:
|
||
- Check a.shape[-1] == b.shape[-2] for inner dimension match
|
||
- Use np.matmul() for batch support and optimization
|
||
- Trust BLAS to handle the vectorization magic
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Input validation for matrix multiplication
|
||
if len(a.shape) < 2 or len(b.shape) < 2:
|
||
raise ValueError(
|
||
f"Matrix multiplication requires 2D+ tensors, got shapes {a.shape} and {b.shape}. "
|
||
f"💡 HINT: Use reshape() to add dimensions if needed."
|
||
)
|
||
|
||
if a.shape[-1] != b.shape[-2]:
|
||
raise ValueError(
|
||
f"Matrix multiplication shape mismatch: {a.shape} @ {b.shape}. "
|
||
f"Inner dimensions must match: a.shape[-1]={a.shape[-1]} != b.shape[-2]={b.shape[-2]}. "
|
||
f"💡 HINT: For A@B, A's columns must equal B's rows."
|
||
)
|
||
|
||
# Use NumPy's highly optimized matrix multiplication
|
||
# This calls BLAS GEMM (General Matrix Multiply), which uses:
|
||
# - SIMD vectorization for parallel arithmetic
|
||
# - Cache blocking for memory efficiency
|
||
# - Multi-threading on multi-core systems
|
||
result_data = np.matmul(a.data, b.data)
|
||
|
||
return Tensor(result_data)
|
||
### END SOLUTION
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-vectorized-matmul", "locked": true, "points": 10}
|
||
def test_unit_vectorized_matmul():
|
||
"""🔬 Test vectorized matrix multiplication implementation."""
|
||
print("🔬 Unit Test: Vectorized Matrix Multiplication...")
|
||
|
||
# Test basic 2D multiplication
|
||
a = Tensor([[1, 2], [3, 4]])
|
||
b = Tensor([[5, 6], [7, 8]])
|
||
result = vectorized_matmul(a, b)
|
||
|
||
expected = np.array([[19, 22], [43, 50]])
|
||
assert np.allclose(result.data, expected), f"Basic matmul failed: expected {expected}, got {result.data}"
|
||
|
||
# Test batch multiplication (3D tensors)
|
||
batch_size, m, k, n = 2, 3, 4, 5
|
||
a_batch = Tensor(np.random.randn(batch_size, m, k))
|
||
b_batch = Tensor(np.random.randn(batch_size, k, n))
|
||
result_batch = vectorized_matmul(a_batch, b_batch)
|
||
|
||
assert result_batch.shape == (batch_size, m, n), f"Wrong batch shape: {result_batch.shape}"
|
||
|
||
# Test broadcasting (different batch dimensions)
|
||
a_single = Tensor(np.random.randn(m, k))
|
||
b_batch = Tensor(np.random.randn(batch_size, k, n))
|
||
result_broadcast = vectorized_matmul(a_single, b_batch)
|
||
|
||
assert result_broadcast.shape == (batch_size, m, n), f"Broadcasting failed: {result_broadcast.shape}"
|
||
|
||
# Test error cases
|
||
try:
|
||
vectorized_matmul(Tensor([1, 2, 3]), Tensor([4, 5])) # 1D tensors
|
||
assert False, "Should reject 1D tensors"
|
||
except ValueError as e:
|
||
assert "2D+" in str(e)
|
||
|
||
try:
|
||
vectorized_matmul(Tensor([[1, 2]]), Tensor([[1], [2], [3]])) # Shape mismatch
|
||
assert False, "Should reject incompatible shapes"
|
||
except ValueError as e:
|
||
assert "shape mismatch" in str(e).lower()
|
||
|
||
print("✅ vectorized_matmul works correctly!")
|
||
|
||
test_unit_vectorized_matmul()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 3. Implementation - Kernel Fusion: Eliminating Memory Bottlenecks
|
||
|
||
### The Memory Bandwidth Crisis
|
||
|
||
Consider this innocent-looking computation: `y = gelu(x * weight + bias)`
|
||
|
||
**Naive Implementation (Memory Intensive):**
|
||
```
|
||
Step 1: temp1 = x * weight → Write 4GB to memory
|
||
Step 2: temp2 = temp1 + bias → Read 4GB, Write 4GB
|
||
Step 3: y = gelu(temp2) → Read 4GB, Write 4GB
|
||
Total: 20GB memory traffic!
|
||
```
|
||
|
||
**Fused Implementation (Memory Efficient):**
|
||
```
|
||
Single Step: y = gelu(x * weight + bias) → Read 8GB, Write 4GB
|
||
Total: 12GB memory traffic!
|
||
60% memory bandwidth reduction!
|
||
```
|
||
|
||
### Understanding GELU: The Smooth Activation
|
||
|
||
GELU (Gaussian Error Linear Unit) is used in transformers because it's **smooth** (differentiable everywhere):
|
||
|
||
```
|
||
Activation Functions Compared:
|
||
|
||
ReLU: GELU: Sigmoid:
|
||
| | 1 ┌─────
|
||
| | ╱ │
|
||
| ╱───│─── ╱ │
|
||
─────┘ ╱─── │ ───╱ │
|
||
Discontinuous Smooth Curve │ Smooth but saturates
|
||
gradient at 0 everywhere │
|
||
```
|
||
|
||
**GELU Formula**: `GELU(x) = x * Φ(x)` where Φ is the standard normal CDF
|
||
|
||
**Fast Approximation**: `GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))`
|
||
|
||
### Kernel Fusion Strategy
|
||
|
||
```
|
||
Unfused Operations: Fused Operation:
|
||
┌─────────────────┐ ┌─────────────────┐
|
||
│ x³ computation │ → temp1 │ │
|
||
└─────────────────┘ │ │
|
||
┌─────────────────┐ │ │
|
||
│ polynomial part │ → temp2 │ All operations│
|
||
└─────────────────┘ │ combined in │
|
||
┌─────────────────┐ │ single kernel │
|
||
│ tanh computation│ → temp3 │ │
|
||
└─────────────────┘ │ │
|
||
┌─────────────────┐ │ │
|
||
│ final multiply │ → result │ │
|
||
└─────────────────┘ └─────────────────┘
|
||
|
||
5 memory round-trips 1 memory round-trip
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "fused-gelu", "solution": true}
|
||
def fused_gelu(x: Tensor) -> Tensor:
|
||
"""
|
||
Fused GELU activation that combines all operations in a single kernel.
|
||
|
||
GELU combines the benefits of ReLU and sigmoid:
|
||
- Smooth everywhere (unlike ReLU's discontinuity at 0)
|
||
- Non-saturating for positive values (unlike sigmoid)
|
||
- Probabilistic interpretation: x * P(X ≤ x) where X ~ N(0,1)
|
||
|
||
Mathematical Definition:
|
||
GELU(x) = x * Φ(x) where Φ(x) is the standard normal CDF
|
||
|
||
Fast Approximation (used here):
|
||
GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
|
||
|
||
TODO: Implement fused GELU to minimize memory bandwidth
|
||
|
||
APPROACH:
|
||
1. Compute all intermediate values in a single expression
|
||
2. Avoid creating temporary arrays
|
||
3. Let NumPy's broadcasting handle vectorization
|
||
|
||
EXAMPLE:
|
||
>>> x = Tensor([-2, -1, 0, 1, 2])
|
||
>>> result = fused_gelu(x)
|
||
>>> print(result.data)
|
||
[-0.04550026 -0.15865526 0. 0.8413447 1.9544997 ]
|
||
# Notice: smooth transition through 0, positive bias
|
||
|
||
MEMORY EFFICIENCY:
|
||
- Unfused: 5 temporary arrays × input_size × 4 bytes
|
||
- Fused: 0 temporary arrays, direct computation
|
||
- Bandwidth reduction: ~80% for memory-bound operations
|
||
|
||
HINTS:
|
||
- Use np.sqrt(2.0 / np.pi) for the constant
|
||
- Keep entire expression in one line for maximum fusion
|
||
- NumPy will optimize the expression tree automatically
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Mathematical constant for GELU approximation
|
||
sqrt_2_over_pi = np.sqrt(2.0 / np.pi)
|
||
|
||
# Fused GELU computation - all operations in single expression
|
||
# This minimizes memory bandwidth by avoiding intermediate arrays
|
||
# NumPy's expression evaluator will optimize this into efficient machine code
|
||
result_data = 0.5 * x.data * (
|
||
1.0 + np.tanh(sqrt_2_over_pi * (x.data + 0.044715 * x.data**3))
|
||
)
|
||
|
||
return Tensor(result_data)
|
||
### END SOLUTION
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-fused-gelu", "locked": true, "points": 10}
|
||
def test_unit_fused_gelu():
|
||
"""🔬 Test fused GELU activation implementation."""
|
||
print("🔬 Unit Test: Fused GELU...")
|
||
|
||
# Test basic properties
|
||
x = Tensor([-3, -1, 0, 1, 3])
|
||
result = fused_gelu(x)
|
||
|
||
# GELU(0) = 0 (exact property)
|
||
assert abs(result.data[2]) < 1e-6, f"GELU(0) should be 0, got {result.data[2]}"
|
||
|
||
# GELU is smooth and increasing
|
||
assert result.data[4] > result.data[3] > result.data[2], "GELU should be increasing"
|
||
|
||
# GELU has positive bias (unlike ReLU)
|
||
assert result.data[3] > 0.8, "GELU(1) should be close to 1"
|
||
assert result.data[1] > -0.2, "GELU(-1) should be slightly negative"
|
||
|
||
# Test numerical stability with extreme values
|
||
x_extreme = Tensor([-10, -5, 0, 5, 10])
|
||
result_extreme = fused_gelu(x_extreme)
|
||
|
||
assert not np.any(np.isnan(result_extreme.data)), "No NaN values allowed"
|
||
assert not np.any(np.isinf(result_extreme.data)), "No infinite values allowed"
|
||
|
||
# Test large tensor processing
|
||
x_large = Tensor(np.random.randn(1000, 1000).astype(np.float32))
|
||
result_large = fused_gelu(x_large)
|
||
|
||
assert result_large.shape == x_large.shape, "Shape preservation failed"
|
||
assert result_large.data.dtype == np.float32, "Data type preservation failed"
|
||
|
||
# Test that positive inputs are mostly preserved (GELU ≈ x for large positive x)
|
||
x_positive = Tensor([5.0])
|
||
result_positive = fused_gelu(x_positive)
|
||
assert result_positive.data[0] > 4.9, "Large positive values should be nearly preserved"
|
||
|
||
print("✅ fused_gelu works correctly!")
|
||
|
||
test_unit_fused_gelu()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Performance Analysis: Measuring Fusion Benefits
|
||
|
||
Let's quantify the impact of kernel fusion by comparing fused vs unfused implementations.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "unfused-gelu", "solution": true}
|
||
def unfused_gelu(x: Tensor) -> Tensor:
|
||
"""
|
||
Deliberately unfused GELU implementation for performance comparison.
|
||
|
||
This version creates multiple intermediate tensors to simulate
|
||
the memory bandwidth overhead of unfused operations.
|
||
|
||
TODO: Implement GELU with explicit intermediate steps
|
||
|
||
APPROACH:
|
||
1. Break computation into individual steps
|
||
2. Create temporary Tensor objects for each step
|
||
3. This simulates real memory allocation overhead
|
||
|
||
PERFORMANCE IMPACT:
|
||
- Creates 7 temporary arrays
|
||
- Each array allocation/deallocation has overhead
|
||
- More memory bandwidth usage
|
||
- Potential cache misses between operations
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Unfused version - creates many intermediate arrays
|
||
sqrt_2_over_pi = np.sqrt(2.0 / np.pi)
|
||
|
||
# Each operation creates a temporary array (simulating kernel launches)
|
||
temp1 = Tensor(x.data**3) # x³
|
||
temp2 = Tensor(0.044715 * temp1.data) # 0.044715 * x³
|
||
temp3 = Tensor(x.data + temp2.data) # x + 0.044715 * x³
|
||
temp4 = Tensor(sqrt_2_over_pi * temp3.data) # √(2/π) * (...)
|
||
temp5 = Tensor(np.tanh(temp4.data)) # tanh(...)
|
||
temp6 = Tensor(1.0 + temp5.data) # 1 + tanh(...)
|
||
temp7 = Tensor(x.data * temp6.data) # x * (1 + tanh(...))
|
||
result = Tensor(0.5 * temp7.data) # 0.5 * x * (...)
|
||
|
||
return result
|
||
### END SOLUTION
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-fusion-speedup", "locked": true, "points": 10}
|
||
def test_unit_fusion_speedup():
|
||
"""🔬 Measure the performance impact of kernel fusion."""
|
||
print("🔬 Unit Test: Kernel Fusion Performance Impact...")
|
||
|
||
# Create moderately large tensor for meaningful timing
|
||
size = 2000
|
||
x = Tensor(np.random.randn(size, size).astype(np.float32))
|
||
warmup_iterations = 2
|
||
timing_iterations = 5
|
||
|
||
# Warmup both implementations
|
||
for _ in range(warmup_iterations):
|
||
_ = unfused_gelu(x)
|
||
_ = fused_gelu(x)
|
||
|
||
# Time unfused version
|
||
start = time.time()
|
||
for _ in range(timing_iterations):
|
||
result_unfused = unfused_gelu(x)
|
||
unfused_time = time.time() - start
|
||
|
||
# Time fused version
|
||
start = time.time()
|
||
for _ in range(timing_iterations):
|
||
result_fused = fused_gelu(x)
|
||
fused_time = time.time() - start
|
||
|
||
# Verify numerical correctness
|
||
assert np.allclose(result_unfused.data, result_fused.data, atol=1e-6), \
|
||
"Fused and unfused implementations must be numerically equivalent"
|
||
|
||
# Calculate performance metrics
|
||
speedup = unfused_time / fused_time if fused_time > 0 else 1.0
|
||
unfused_per_elem = (unfused_time / timing_iterations) / (size * size) * 1e9 # ns per element
|
||
fused_per_elem = (fused_time / timing_iterations) / (size * size) * 1e9
|
||
|
||
print(f"📊 Kernel Fusion Performance Analysis:")
|
||
print(f" Tensor size: {size}×{size} = {size*size:,} elements")
|
||
print(f" Unfused time: {unfused_time/timing_iterations*1000:.2f} ms")
|
||
print(f" Fused time: {fused_time/timing_iterations*1000:.2f} ms")
|
||
print(f" Speedup: {speedup:.2f}× faster")
|
||
print(f" Per-element: {unfused_per_elem:.1f} ns → {fused_per_elem:.1f} ns")
|
||
|
||
# Memory bandwidth estimate
|
||
bytes_per_elem = 4 # float32
|
||
unfused_memory_ops = 7 # 7 intermediate arrays
|
||
fused_memory_ops = 2 # read input, write output
|
||
|
||
unfused_bandwidth = (unfused_memory_ops * size * size * bytes_per_elem) / (unfused_time / timing_iterations) / 1e9
|
||
fused_bandwidth = (fused_memory_ops * size * size * bytes_per_elem) / (fused_time / timing_iterations) / 1e9
|
||
|
||
print(f" Memory efficiency: {unfused_memory_ops}→{fused_memory_ops} memory ops")
|
||
print(f" Effective bandwidth: {unfused_bandwidth:.1f}→{fused_bandwidth:.1f} GB/s")
|
||
|
||
# Interpret results
|
||
if speedup > 1.5:
|
||
print("🚀 Excellent! Kernel fusion providing significant speedup")
|
||
elif speedup > 1.1:
|
||
print("✅ Good! Kernel fusion providing measurable benefit")
|
||
else:
|
||
print("⚠️ Limited speedup - may be compute-bound or small tensor size")
|
||
|
||
print("✅ Fusion performance analysis completed!")
|
||
|
||
test_unit_fusion_speedup()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 4. Integration - Mixed Precision Training: Memory and Speed
|
||
|
||
### The Mixed Precision Revolution
|
||
|
||
Modern GPUs (like V100, A100) have specialized **Tensor Cores** that can perform FP16 operations much faster than FP32:
|
||
|
||
```
|
||
Performance Comparison (Theoretical Peak):
|
||
┌─────────────────┬────────────────┬────────────────┐
|
||
│ Precision │ V100 TFLOPS │ A100 TFLOPS │
|
||
├─────────────────┼────────────────┼────────────────┤
|
||
│ FP32 (float) │ 15.7 │ 19.5 │
|
||
│ FP16 (half) │ 125.0 │ 312.0 │
|
||
│ Speedup │ 8× │ 16× │
|
||
└─────────────────┴────────────────┴────────────────┘
|
||
```
|
||
|
||
### The Challenge: FP16 Precision Limitations
|
||
|
||
FP16 has a much smaller range than FP32:
|
||
|
||
```
|
||
FP32 (32-bit): FP16 (16-bit):
|
||
┌─────────────────────────────┐ ┌───────────────┐
|
||
│ Sign │ 8-bit │ 23-bit │ │Sign│5-bit│10-bit│
|
||
│ bit │ Exp │ Mantissa │ │bit │ Exp │Mant. │
|
||
└─────────────────────────────┘ └───────────────┘
|
||
Range: ±3.4 × 10³⁸ Range: ±6.5 × 10⁴
|
||
Precision: ~7 decimal digits Precision: ~3 decimal digits
|
||
|
||
Problem: Small gradients (< 6e-5) become ZERO in FP16!
|
||
```
|
||
|
||
### The Solution: Automatic Loss Scaling
|
||
|
||
```
|
||
Training Step Without Scaling: Training Step With Scaling:
|
||
|
||
Loss = 0.0001 Loss = 0.0001
|
||
↓ ↓
|
||
Gradients = 0.00001 Scale × 1024
|
||
↓ ↓
|
||
Convert to FP16 Loss = 0.1024
|
||
↓ ↓
|
||
Gradients = 0.0 (UNDERFLOW!) Gradients = 0.01024
|
||
↓ ↓
|
||
No learning! Convert to FP16: 0.01024 ✓
|
||
↓
|
||
Unscale: 0.01024 / 1024 = 0.00001
|
||
↓
|
||
Successful learning!
|
||
```
|
||
|
||
### Mixed Precision Memory Benefits
|
||
|
||
```
|
||
Model Component Breakdown:
|
||
┌─────────────────┬─────────────┬─────────────┬─────────────┐
|
||
│ Component │ FP32 Memory │ FP16 Memory │ Savings │
|
||
├─────────────────┼─────────────┼─────────────┼─────────────┤
|
||
│ Parameters │ 4N │ 4N │ 0% │
|
||
│ Gradients │ 4N │ 2N │ 50% │
|
||
│ Activations │ 4A │ 2A │ 50% │
|
||
│ Optimizer State │ 8N │ 8N │ 0% │
|
||
├─────────────────┼─────────────┼─────────────┼─────────────┤
|
||
│ Total Typical │ ~20N │ ~16N │ 20% │
|
||
│ Activation-Heavy│ ~40N │ ~24N │ 40% │
|
||
└─────────────────┴─────────────┴─────────────┴─────────────┘
|
||
|
||
N = parameter count, A = activation memory
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "mixed-precision-trainer", "solution": true}
|
||
class MixedPrecisionTrainer:
|
||
"""
|
||
Mixed precision trainer with automatic loss scaling.
|
||
|
||
Implements the same pattern as PyTorch's Automatic Mixed Precision (AMP):
|
||
1. Forward pass in FP16 for speed and memory efficiency
|
||
2. Loss scaling to prevent gradient underflow
|
||
3. Gradient computation and unscaling
|
||
4. Parameter updates in FP32 for numerical stability
|
||
|
||
The key insight: keep different parts of training in optimal precision.
|
||
"""
|
||
|
||
def __init__(self, model, optimizer, loss_scale: float = 1024.0, max_loss_scale: float = 65536.0):
|
||
"""
|
||
Initialize mixed precision training infrastructure.
|
||
|
||
TODO: Set up automatic loss scaling and overflow detection
|
||
|
||
APPROACH:
|
||
1. Store model and optimizer references
|
||
2. Initialize dynamic loss scaling parameters
|
||
3. Set up overflow detection and scale adjustment logic
|
||
|
||
Args:
|
||
model: Neural network model
|
||
optimizer: Parameter optimizer (SGD, Adam, etc.)
|
||
loss_scale: Initial scaling factor for gradients
|
||
max_loss_scale: Maximum allowed loss scale
|
||
|
||
LOSS SCALING STRATEGY:
|
||
- Start with reasonable scale (1024)
|
||
- Increase gradually if no overflow (better precision)
|
||
- Decrease immediately on overflow (stability)
|
||
- This balances numerical precision with training stability
|
||
|
||
HINTS:
|
||
- Track consecutive successful steps for scale increases
|
||
- Use exponential backoff on overflow detection
|
||
- Keep scale within reasonable bounds [1, 65536]
|
||
"""
|
||
### BEGIN SOLUTION
|
||
self.model = model
|
||
self.optimizer = optimizer
|
||
|
||
# Loss scaling parameters
|
||
self.loss_scale = loss_scale
|
||
self.max_loss_scale = max_loss_scale
|
||
self.min_loss_scale = 1.0
|
||
|
||
# Dynamic scaling parameters
|
||
self.scale_growth_factor = 2.0 # Multiply by 2 when increasing
|
||
self.scale_backoff_factor = 0.5 # Divide by 2 when decreasing
|
||
self.growth_interval = 2000 # Steps between scale increases
|
||
self.steps_since_last_scale_update = 0
|
||
|
||
# Overflow tracking
|
||
self.overflow_detected = False
|
||
### END SOLUTION
|
||
|
||
def scale_loss(self, loss: Tensor) -> Tensor:
|
||
"""
|
||
Scale loss to prevent gradient underflow in FP16.
|
||
|
||
The fundamental challenge: FP16 can only represent values ≥ 6e-5.
|
||
Small gradients (common in deep networks) become zero without scaling.
|
||
|
||
TODO: Apply loss scaling for mixed precision stability
|
||
|
||
APPROACH:
|
||
1. Multiply loss by current scale factor
|
||
2. This amplifies gradients proportionally
|
||
3. Return scaled loss for backward pass
|
||
|
||
MATHEMATICAL INSIGHT:
|
||
If loss = 1e-6 and scale = 1024:
|
||
scaled_loss = 1e-6 × 1024 = 1.024e-3
|
||
|
||
After backward pass:
|
||
scaled_gradients = 1.024e-3 × dloss/dparam = 1024 × gradients
|
||
|
||
These larger gradients survive FP16 conversion!
|
||
|
||
EXAMPLE:
|
||
>>> trainer = MixedPrecisionTrainer(model, optimizer)
|
||
>>> loss = Tensor([0.0001]) # Small loss
|
||
>>> scaled = trainer.scale_loss(loss)
|
||
>>> print(scaled.data) # [0.1024] (0.0001 × 1024)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Scale the loss to amplify gradients
|
||
# This prevents gradient underflow in FP16 arithmetic
|
||
scaled_data = loss.data * self.loss_scale
|
||
return Tensor(scaled_data)
|
||
### END SOLUTION
|
||
|
||
def unscale_gradients(self, parameters: List[Tensor]) -> bool:
|
||
"""
|
||
Unscale gradients and detect overflow from FP16 conversion.
|
||
|
||
After backward pass on scaled loss, gradients are scaled too.
|
||
We must unscale them AND check for overflow/underflow.
|
||
|
||
TODO: Implement gradient unscaling with overflow detection
|
||
|
||
APPROACH:
|
||
1. Divide all gradients by loss scale (restore original magnitude)
|
||
2. Check for inf/nan values (indicates FP16 overflow)
|
||
3. Return True if gradients are valid, False if overflow detected
|
||
|
||
OVERFLOW DETECTION:
|
||
inf/nan in gradients indicates:
|
||
- Gradient magnitude too large for FP16
|
||
- Numerical instability in computation
|
||
- Loss scale too aggressive
|
||
|
||
When overflow occurs:
|
||
- Skip parameter update (unstable gradients)
|
||
- Reduce loss scale for next iteration
|
||
- Continue training with lower scale
|
||
|
||
HINTS:
|
||
- Use np.isfinite() to detect inf/nan efficiently
|
||
- Process all parameters even if overflow found
|
||
- Set self.overflow_detected flag for scale adjustment
|
||
"""
|
||
### BEGIN SOLUTION
|
||
self.overflow_detected = False
|
||
|
||
# Unscale all gradients and check for overflow
|
||
for param in parameters:
|
||
if param.grad is not None:
|
||
# Unscale gradients to original magnitude
|
||
param.grad.data = param.grad.data / self.loss_scale
|
||
|
||
# Check for overflow/underflow (inf/nan values)
|
||
if not np.all(np.isfinite(param.grad.data)):
|
||
self.overflow_detected = True
|
||
# Continue processing to unscale all gradients
|
||
|
||
return not self.overflow_detected
|
||
### END SOLUTION
|
||
|
||
def update_loss_scale(self):
|
||
"""
|
||
Dynamically adjust loss scale based on training stability.
|
||
|
||
Implements the "Goldilocks" principle for loss scaling:
|
||
- Too low: precision loss from small gradients
|
||
- Too high: overflow and instability
|
||
- Just right: maximum precision without overflow
|
||
|
||
TODO: Implement adaptive loss scale adjustment
|
||
|
||
APPROACH:
|
||
1. If overflow detected: reduce scale immediately (stability)
|
||
2. If no overflow for many steps: increase scale (precision)
|
||
3. Keep scale within reasonable bounds
|
||
|
||
SCALING STRATEGY:
|
||
- Aggressive reduction on overflow (×0.5)
|
||
- Conservative growth during stability (×2 every 2000 steps)
|
||
- This favors stability over maximum precision
|
||
|
||
WHY THIS WORKS:
|
||
- Most training is stable (gradual scale increase)
|
||
- Occasional instability (rapid scale decrease)
|
||
- Converges to optimal scale for current training phase
|
||
"""
|
||
### BEGIN SOLUTION
|
||
if self.overflow_detected:
|
||
# Immediately reduce scale on overflow
|
||
self.loss_scale = max(
|
||
self.min_loss_scale,
|
||
self.loss_scale * self.scale_backoff_factor
|
||
)
|
||
self.steps_since_last_scale_update = 0
|
||
else:
|
||
# Gradually increase scale if stable
|
||
self.steps_since_last_scale_update += 1
|
||
if self.steps_since_last_scale_update >= self.growth_interval:
|
||
self.loss_scale = min(
|
||
self.max_loss_scale,
|
||
self.loss_scale * self.scale_growth_factor
|
||
)
|
||
self.steps_since_last_scale_update = 0
|
||
### END SOLUTION
|
||
|
||
def train_step(self, batch: Tuple[Tensor, Tensor]) -> Dict[str, float]:
|
||
"""
|
||
Execute complete mixed precision training step.
|
||
|
||
Orchestrates the entire mixed precision training process:
|
||
1. Forward pass (FP16 in real implementation)
|
||
2. Loss computation and scaling
|
||
3. Backward pass on scaled loss
|
||
4. Gradient unscaling and overflow detection
|
||
5. Conditional parameter update
|
||
6. Loss scale adjustment
|
||
|
||
TODO: Implement end-to-end mixed precision training step
|
||
|
||
APPROACH:
|
||
1. Clear gradients from previous step
|
||
2. Forward pass through model
|
||
3. Compute and scale loss
|
||
4. Backward pass to compute scaled gradients
|
||
5. Unscale gradients and check for overflow
|
||
6. Update parameters only if no overflow
|
||
7. Adjust loss scale based on stability
|
||
|
||
CRITICAL INSIGHT:
|
||
Skip parameter updates on overflow! Unstable gradients
|
||
would move parameters in wrong direction.
|
||
|
||
RETURN FORMAT:
|
||
Dictionary with training metrics:
|
||
- loss: unscaled loss value
|
||
- loss_scale: current scaling factor
|
||
- overflow: whether overflow occurred
|
||
- gradients_valid: whether update was applied
|
||
|
||
HINTS:
|
||
- Use self.optimizer.zero_grad() to clear gradients
|
||
- Get parameters with gradients for unscaling
|
||
- Only call optimizer.step() if gradients are valid
|
||
"""
|
||
### BEGIN SOLUTION
|
||
inputs, targets = batch
|
||
|
||
# Clear gradients from previous step
|
||
self.optimizer.zero_grad()
|
||
|
||
# Forward pass (would use FP16 autocast in real implementation)
|
||
# For simulation, we work in FP32 but apply scaling principles
|
||
outputs = self.model(inputs)
|
||
|
||
# Compute loss (unscaled)
|
||
loss = self._compute_loss(outputs, targets)
|
||
|
||
# Scale loss for mixed precision
|
||
scaled_loss = self.scale_loss(loss)
|
||
|
||
# Backward pass on scaled loss
|
||
scaled_loss.backward()
|
||
|
||
# Get all parameters with gradients
|
||
parameters = [p for p in self.model.parameters() if p.grad is not None]
|
||
|
||
# Unscale gradients and detect overflow
|
||
gradients_valid = self.unscale_gradients(parameters)
|
||
|
||
# Update parameters only if no overflow
|
||
if gradients_valid:
|
||
self.optimizer.step()
|
||
|
||
# Adjust loss scale based on stability
|
||
self.update_loss_scale()
|
||
|
||
# Return training metrics
|
||
return {
|
||
'loss': loss.data.item() if hasattr(loss.data, 'item') else float(loss.data),
|
||
'loss_scale': self.loss_scale,
|
||
'overflow': self.overflow_detected,
|
||
'gradients_valid': gradients_valid
|
||
}
|
||
### END SOLUTION
|
||
|
||
def _compute_loss(self, outputs: Tensor, targets: Tensor) -> Tensor:
|
||
"""Simple MSE loss for demonstration purposes."""
|
||
diff = Tensor(outputs.data - targets.data)
|
||
return Tensor(np.mean(diff.data**2))
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-mixed-precision", "locked": true, "points": 15}
|
||
def test_unit_mixed_precision():
|
||
"""🔬 Test mixed precision training components comprehensively."""
|
||
print("🔬 Unit Test: Mixed Precision Training...")
|
||
|
||
# Create mock model and optimizer for testing
|
||
class MockModel:
|
||
def __init__(self):
|
||
self.weight = Tensor(np.random.randn(10, 5).astype(np.float32))
|
||
self.weight.grad = None
|
||
|
||
def __call__(self, x):
|
||
return x.matmul(self.weight)
|
||
|
||
def parameters(self):
|
||
return [self.weight]
|
||
|
||
class MockOptimizer:
|
||
def __init__(self, params):
|
||
self.params = params
|
||
self.updates_applied = 0
|
||
|
||
def zero_grad(self):
|
||
for p in self.params:
|
||
p.grad = None
|
||
|
||
def step(self):
|
||
for p in self.params:
|
||
if p.grad is not None:
|
||
p.data = p.data - 0.01 * p.grad.data
|
||
self.updates_applied += 1
|
||
|
||
# Initialize mixed precision trainer
|
||
model = MockModel()
|
||
optimizer = MockOptimizer(model.parameters())
|
||
trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=1024.0)
|
||
|
||
# Test 1: Loss scaling
|
||
print(" Testing loss scaling...")
|
||
loss = Tensor([0.001])
|
||
scaled_loss = trainer.scale_loss(loss)
|
||
expected_scaled = 0.001 * 1024.0
|
||
assert np.isclose(scaled_loss.data[0], expected_scaled), \
|
||
f"Loss scaling failed: expected {expected_scaled}, got {scaled_loss.data[0]}"
|
||
|
||
# Test 2: Gradient unscaling (normal case)
|
||
print(" Testing gradient unscaling...")
|
||
model.weight.grad = Tensor(np.full((10, 5), 1024.0)) # Simulate scaled gradients
|
||
valid = trainer.unscale_gradients([model.weight])
|
||
assert valid, "Should detect valid gradients"
|
||
assert np.allclose(model.weight.grad.data, 1.0), "Gradient unscaling failed"
|
||
|
||
# Test 3: Overflow detection
|
||
print(" Testing overflow detection...")
|
||
model.weight.grad = Tensor(np.full((10, 5), np.inf)) # Simulate overflow
|
||
valid = trainer.unscale_gradients([model.weight])
|
||
assert not valid, "Should detect overflow"
|
||
assert trainer.overflow_detected, "Overflow flag not set"
|
||
|
||
# Test 4: Loss scale adjustment after overflow
|
||
print(" Testing loss scale adjustment...")
|
||
initial_scale = trainer.loss_scale
|
||
trainer.update_loss_scale() # Should reduce scale due to overflow
|
||
assert trainer.loss_scale < initial_scale, \
|
||
f"Scale should decrease after overflow: {initial_scale} → {trainer.loss_scale}"
|
||
|
||
# Test 5: Loss scale increase during stability
|
||
print(" Testing loss scale increase...")
|
||
trainer.overflow_detected = False
|
||
trainer.steps_since_last_scale_update = 2000 # Simulate stable training
|
||
scale_before = trainer.loss_scale
|
||
trainer.update_loss_scale()
|
||
assert trainer.loss_scale > scale_before, "Scale should increase during stability"
|
||
|
||
# Test 6: End-to-end training step
|
||
print(" Testing complete training step...")
|
||
inputs = Tensor(np.random.randn(8, 10).astype(np.float32))
|
||
targets = Tensor(np.random.randn(8, 5).astype(np.float32))
|
||
|
||
initial_updates = optimizer.updates_applied
|
||
metrics = trainer.train_step((inputs, targets))
|
||
|
||
# Verify metrics structure
|
||
required_keys = ['loss', 'loss_scale', 'overflow', 'gradients_valid']
|
||
for key in required_keys:
|
||
assert key in metrics, f"Missing metric: {key}"
|
||
|
||
# Verify loss is reasonable
|
||
assert isinstance(metrics['loss'], (int, float)), "Loss should be numeric"
|
||
assert metrics['loss'] >= 0, "Loss should be non-negative"
|
||
|
||
# Verify loss scale is positive
|
||
assert metrics['loss_scale'] > 0, "Loss scale should be positive"
|
||
|
||
print("✅ Mixed precision training works correctly!")
|
||
|
||
test_unit_mixed_precision()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 5. Systems Analysis - Performance Scaling Patterns
|
||
|
||
Let's analyze how our acceleration techniques perform across different scenarios and understand their scaling characteristics.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "analyze-vectorization", "solution": true}
|
||
def analyze_vectorization_scaling():
|
||
"""📊 Analyze vectorization performance across different tensor sizes."""
|
||
print("📊 Analyzing vectorization scaling behavior...")
|
||
|
||
# Test sizes spanning different cache regimes
|
||
sizes = [64, 128, 256, 512, 1024, 2048]
|
||
|
||
print("\n🔍 Vectorization Scaling Analysis:")
|
||
print("┌─────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
|
||
print("│ Size │ Time (ms) │ GFLOPS │ Bandwidth │ Efficiency │")
|
||
print("│ │ │ │ (GB/s) │ (% of peak) │")
|
||
print("├─────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
|
||
|
||
for size in sizes:
|
||
# Create test matrices
|
||
a = Tensor(np.random.randn(size, size).astype(np.float32))
|
||
b = Tensor(np.random.randn(size, size).astype(np.float32))
|
||
|
||
# Warm up
|
||
for _ in range(2):
|
||
_ = vectorized_matmul(a, b)
|
||
|
||
# Time vectorized implementation
|
||
iterations = max(1, 100 // (size // 64)) # Fewer iterations for larger sizes
|
||
start = time.time()
|
||
for _ in range(iterations):
|
||
result = vectorized_matmul(a, b)
|
||
elapsed = (time.time() - start) / iterations
|
||
|
||
# Calculate performance metrics
|
||
flops = 2 * size**3 # 2N³ FLOPs for matrix multiplication
|
||
gflops = flops / (elapsed * 1e9)
|
||
|
||
bytes_accessed = 3 * size * size * 4 # 3 matrices × size² × 4 bytes
|
||
bandwidth = bytes_accessed / (elapsed * 1e9)
|
||
|
||
# Estimate efficiency (rough baseline: modern CPU ~100-500 GFLOPS peak)
|
||
estimated_peak_gflops = 200 # Conservative estimate
|
||
efficiency = min(100, gflops / estimated_peak_gflops * 100)
|
||
|
||
print(f"│ {size:6d} │ {elapsed*1000:9.2f} │ {gflops:9.1f} │ {bandwidth:9.1f} │ {efficiency:9.1f} │")
|
||
|
||
print("└─────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
|
||
|
||
print(f"\n💡 Vectorization insights:")
|
||
print(f" • Small matrices: Limited by overhead and cache effects")
|
||
print(f" • Medium matrices: Sweet spot for cache reuse")
|
||
print(f" • Large matrices: Memory bandwidth becomes limiting factor")
|
||
print(f" • BLAS libraries automatically optimize for each size regime")
|
||
print("🚀 Vectorization effectiveness depends on problem size and hardware")
|
||
|
||
analyze_vectorization_scaling()
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "analyze-arithmetic-intensity", "solution": true}
|
||
def analyze_arithmetic_intensity():
|
||
"""📊 Demonstrate the roofline model with different operations."""
|
||
print("📊 Analyzing arithmetic intensity patterns...")
|
||
|
||
size = 1024
|
||
iterations = 10
|
||
|
||
operations = []
|
||
|
||
# Create test data
|
||
x = Tensor(np.random.randn(size, size).astype(np.float32))
|
||
y = Tensor(np.random.randn(size, size).astype(np.float32))
|
||
|
||
print("\n🎯 Arithmetic Intensity Analysis:")
|
||
print("┌─────────────────────┬─────────┬─────────────┬─────────────┬─────────────┐")
|
||
print("│ Operation │ AI │ Time (ms) │ GFLOPS │ GB/s │")
|
||
print("│ │(FLOPs/B)│ │ │ │")
|
||
print("├─────────────────────┼─────────┼─────────────┼─────────────┼─────────────┤")
|
||
|
||
# 1. Element-wise addition (very low arithmetic intensity)
|
||
start = time.time()
|
||
for _ in range(iterations):
|
||
_ = Tensor(x.data + y.data)
|
||
add_time = (time.time() - start) / iterations
|
||
|
||
add_flops = size * size # One addition per element
|
||
add_bytes = 3 * size * size * 4 # Read x, read y, write result
|
||
add_ai = add_flops / add_bytes
|
||
add_gflops = add_flops / (add_time * 1e9)
|
||
add_bandwidth = add_bytes / (add_time * 1e9)
|
||
|
||
print(f"│ Element-wise Add │ {add_ai:6.3f} │ {add_time*1000:9.2f} │ {add_gflops:9.1f} │ {add_bandwidth:9.1f} │")
|
||
|
||
# 2. Element-wise multiply (still low, but slightly higher)
|
||
start = time.time()
|
||
for _ in range(iterations):
|
||
_ = Tensor(x.data * y.data)
|
||
mul_time = (time.time() - start) / iterations
|
||
|
||
mul_flops = size * size
|
||
mul_bytes = 3 * size * size * 4
|
||
mul_ai = mul_flops / mul_bytes
|
||
mul_gflops = mul_flops / (mul_time * 1e9)
|
||
mul_bandwidth = mul_bytes / (mul_time * 1e9)
|
||
|
||
print(f"│ Element-wise Mult │ {mul_ai:6.3f} │ {mul_time*1000:9.2f} │ {mul_gflops:9.1f} │ {mul_bandwidth:9.1f} │")
|
||
|
||
# 3. GELU (medium arithmetic intensity)
|
||
start = time.time()
|
||
for _ in range(iterations):
|
||
_ = fused_gelu(x)
|
||
gelu_time = (time.time() - start) / iterations
|
||
|
||
gelu_flops = size * size * 8 # Approximate: x³, add, mul, tanh, etc.
|
||
gelu_bytes = 2 * size * size * 4 # Read x, write result
|
||
gelu_ai = gelu_flops / gelu_bytes
|
||
gelu_gflops = gelu_flops / (gelu_time * 1e9)
|
||
gelu_bandwidth = gelu_bytes / (gelu_time * 1e9)
|
||
|
||
print(f"│ Fused GELU │ {gelu_ai:6.3f} │ {gelu_time*1000:9.2f} │ {gelu_gflops:9.1f} │ {gelu_bandwidth:9.1f} │")
|
||
|
||
# 4. Matrix multiplication (high arithmetic intensity)
|
||
start = time.time()
|
||
for _ in range(iterations):
|
||
_ = vectorized_matmul(x, y)
|
||
matmul_time = (time.time() - start) / iterations
|
||
|
||
matmul_flops = 2 * size**3 # 2N³ FLOPs
|
||
matmul_bytes = 3 * size * size * 4 # 3 matrices
|
||
matmul_ai = matmul_flops / matmul_bytes
|
||
matmul_gflops = matmul_flops / (matmul_time * 1e9)
|
||
matmul_bandwidth = matmul_bytes / (matmul_time * 1e9)
|
||
|
||
print(f"│ Matrix Multiply │ {matmul_ai:6.3f} │ {matmul_time*1000:9.2f} │ {matmul_gflops:9.1f} │ {matmul_bandwidth:9.1f} │")
|
||
|
||
print("└─────────────────────┴─────────┴─────────────┴─────────────┴─────────────┘")
|
||
|
||
print(f"\n💡 Roofline Model Insights:")
|
||
print(f" 📊 Low AI (< 1): Memory bound - limited by bandwidth")
|
||
print(f" 📊 Med AI (1-10): Transitional - depends on implementation")
|
||
print(f" 📊 High AI (> 10): Compute bound - limited by ALU throughput")
|
||
print(f" 🎯 Matrix multiplication ({matmul_ai:.1f} AI) is ideal for GPUs/TPUs")
|
||
print(f" ⚡ Element-wise ops ({add_ai:.3f} AI) need memory optimization")
|
||
print("🚀 Design algorithms with high arithmetic intensity for performance")
|
||
|
||
analyze_arithmetic_intensity()
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "analyze-mixed-precision-benefits", "solution": true}
|
||
def analyze_mixed_precision_benefits():
|
||
"""📊 Quantify mixed precision memory and performance benefits."""
|
||
print("📊 Analyzing mixed precision benefits across model sizes...")
|
||
|
||
# Define representative model configurations
|
||
model_configs = [
|
||
("Tiny CNN", {"params": 50_000, "activations": 100_000}),
|
||
("Small BERT", {"params": 10_000_000, "activations": 5_000_000}),
|
||
("Medium GPT", {"params": 100_000_000, "activations": 50_000_000}),
|
||
("Large Transformer", {"params": 1_000_000_000, "activations": 500_000_000}),
|
||
]
|
||
|
||
print("\n🧮 Mixed Precision Memory Analysis:")
|
||
print("┌─────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
|
||
print("│ Model Type │ Parameters │ FP32 Memory │ FP16 Memory │ Savings │")
|
||
print("│ │ │ (GB) │ (GB) │ (%) │")
|
||
print("├─────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
|
||
|
||
for name, config in model_configs:
|
||
param_count = config["params"]
|
||
activation_count = config["activations"]
|
||
|
||
# Memory calculation (bytes)
|
||
# Parameters: always FP32 for stability
|
||
param_memory = param_count * 4
|
||
|
||
# FP32 training memory
|
||
fp32_activations = activation_count * 4
|
||
fp32_gradients = param_count * 4
|
||
fp32_optimizer = param_count * 8 # Adam: momentum + velocity
|
||
fp32_total = param_memory + fp32_activations + fp32_gradients + fp32_optimizer
|
||
|
||
# Mixed precision memory
|
||
fp16_activations = activation_count * 2 # FP16 activations
|
||
fp16_gradients = param_count * 2 # FP16 gradients during backward
|
||
mixed_total = param_memory + fp16_activations + fp16_gradients + fp32_optimizer
|
||
|
||
# Calculate savings
|
||
savings_gb = (fp32_total - mixed_total) / 1e9
|
||
savings_pct = (fp32_total - mixed_total) / fp32_total * 100
|
||
|
||
print(f"│ {name:14s} │ {param_count:10,d} │ {fp32_total/1e9:9.1f} │ {mixed_total/1e9:9.1f} │ {savings_pct:9.1f} │")
|
||
|
||
print("└─────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
|
||
|
||
# Performance simulation
|
||
print(f"\n⚡ Mixed Precision Performance Simulation:")
|
||
|
||
# Simulate different batch sizes to show memory pressure
|
||
batch_sizes = [8, 16, 32, 64]
|
||
hidden_size = 1024
|
||
seq_length = 512
|
||
|
||
print("┌─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
|
||
print("│ Batch Size │ FP32 Mem │ FP16 Mem │ Throughput │ Efficiency │")
|
||
print("│ │ (GB) │ (GB) │ Gain │ Gain │")
|
||
print("├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
|
||
|
||
for batch_size in batch_sizes:
|
||
# Memory for activations (dominant for large models)
|
||
elements = batch_size * seq_length * hidden_size
|
||
|
||
fp32_mem = elements * 4 / 1e9 # 4 bytes per FP32
|
||
fp16_mem = elements * 2 / 1e9 # 2 bytes per FP16
|
||
|
||
# Simulate throughput gains (based on Tensor Core speedups)
|
||
# Real speedups depend on hardware and operation mix
|
||
throughput_gain = 1.4 # Conservative estimate for mixed workloads
|
||
|
||
# Memory efficiency enables larger batch sizes
|
||
max_fp32_batch = 32 # Assume memory limit
|
||
max_fp16_batch = 64 # Double capacity with FP16
|
||
|
||
efficiency_gain = max_fp16_batch / max_fp32_batch if batch_size <= max_fp32_batch else "OOM"
|
||
efficiency_str = f"{efficiency_gain:.1f}×" if isinstance(efficiency_gain, float) else efficiency_gain
|
||
|
||
print(f"│ {batch_size:10d} │ {fp32_mem:9.2f} │ {fp16_mem:9.2f} │ {throughput_gain:9.1f}× │ {efficiency_str:9s} │")
|
||
|
||
print("└─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
|
||
|
||
print(f"\n💡 Mixed Precision Key Benefits:")
|
||
print(f" 🎯 Memory: 20-40% reduction enables larger models/batches")
|
||
print(f" ⚡ Speed: 1.3-2× throughput on modern hardware (V100+)")
|
||
print(f" 📈 Scale: Essential for billion-parameter models")
|
||
print(f" ⚠️ Complexity: Requires careful loss scaling and overflow handling")
|
||
print("🚀 Mixed precision is crucial for competitive ML training")
|
||
|
||
analyze_mixed_precision_benefits()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 6. Optimization Insights - Production Acceleration Strategy
|
||
|
||
Understanding when and how to apply different acceleration techniques in real-world scenarios.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "acceleration-decision-framework", "solution": true}
|
||
def analyze_acceleration_decision_framework():
|
||
"""📊 Decision framework for choosing acceleration techniques."""
|
||
print("📊 Acceleration Technique Decision Framework...")
|
||
|
||
# Define workload characteristics
|
||
workloads = [
|
||
("Research Training", {
|
||
"memory_pressure": "medium",
|
||
"latency_sensitive": False,
|
||
"stability_critical": False,
|
||
"development_speed": "high",
|
||
"hardware_variety": "high"
|
||
}),
|
||
("Production Training", {
|
||
"memory_pressure": "high",
|
||
"latency_sensitive": False,
|
||
"stability_critical": True,
|
||
"development_speed": "medium",
|
||
"hardware_variety": "low"
|
||
}),
|
||
("Real-time Inference", {
|
||
"memory_pressure": "medium",
|
||
"latency_sensitive": True,
|
||
"stability_critical": True,
|
||
"development_speed": "low",
|
||
"hardware_variety": "medium"
|
||
}),
|
||
("Edge Deployment", {
|
||
"memory_pressure": "very_high",
|
||
"latency_sensitive": True,
|
||
"stability_critical": True,
|
||
"development_speed": "low",
|
||
"hardware_variety": "very_high"
|
||
}),
|
||
("Batch Inference", {
|
||
"memory_pressure": "low",
|
||
"latency_sensitive": False,
|
||
"stability_critical": True,
|
||
"development_speed": "medium",
|
||
"hardware_variety": "low"
|
||
})
|
||
]
|
||
|
||
# Define technique characteristics
|
||
techniques = {
|
||
"Vectorization": {
|
||
"implementation_cost": "low",
|
||
"memory_benefit": "none",
|
||
"latency_benefit": "high",
|
||
"stability_risk": "none",
|
||
"hardware_dependency": "low"
|
||
},
|
||
"Kernel Fusion": {
|
||
"implementation_cost": "medium",
|
||
"memory_benefit": "medium",
|
||
"latency_benefit": "medium",
|
||
"stability_risk": "low",
|
||
"hardware_dependency": "medium"
|
||
},
|
||
"Mixed Precision": {
|
||
"implementation_cost": "high",
|
||
"memory_benefit": "high",
|
||
"latency_benefit": "high",
|
||
"stability_risk": "medium",
|
||
"hardware_dependency": "high"
|
||
},
|
||
"Graph Optimization": {
|
||
"implementation_cost": "very_high",
|
||
"memory_benefit": "medium",
|
||
"latency_benefit": "very_high",
|
||
"stability_risk": "low",
|
||
"hardware_dependency": "very_high"
|
||
}
|
||
}
|
||
|
||
print("\n🎯 Acceleration Technique Recommendations:")
|
||
print("┌─────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
|
||
print("│ Workload │ Vectorize │ Fuse Kernels│ Mixed Prec │ Graph Opt │")
|
||
print("├─────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
|
||
|
||
for workload_name, workload_chars in workloads:
|
||
recommendations = []
|
||
|
||
for technique_name in ["Vectorization", "Kernel Fusion", "Mixed Precision", "Graph Optimization"]:
|
||
tech_chars = techniques[technique_name]
|
||
score = 0
|
||
|
||
# Benefit vs requirement matching
|
||
if workload_chars["memory_pressure"] in ["high", "very_high"]:
|
||
if tech_chars["memory_benefit"] in ["medium", "high"]:
|
||
score += 2
|
||
|
||
if workload_chars["latency_sensitive"]:
|
||
if tech_chars["latency_benefit"] in ["medium", "high", "very_high"]:
|
||
score += 2
|
||
|
||
# Risk vs tolerance matching
|
||
if workload_chars["stability_critical"]:
|
||
if tech_chars["stability_risk"] in ["none", "low"]:
|
||
score += 1
|
||
elif tech_chars["stability_risk"] == "medium":
|
||
score -= 1
|
||
|
||
# Implementation cost vs development speed
|
||
if workload_chars["development_speed"] == "high":
|
||
if tech_chars["implementation_cost"] in ["low", "medium"]:
|
||
score += 1
|
||
elif tech_chars["implementation_cost"] in ["high", "very_high"]:
|
||
score -= 1
|
||
|
||
# Hardware dependency vs variety
|
||
if workload_chars["hardware_variety"] in ["high", "very_high"]:
|
||
if tech_chars["hardware_dependency"] in ["low", "medium"]:
|
||
score += 1
|
||
elif tech_chars["hardware_dependency"] in ["high", "very_high"]:
|
||
score -= 2
|
||
|
||
# Convert score to recommendation
|
||
if score >= 3:
|
||
rec = "✅ High"
|
||
elif score >= 1:
|
||
rec = "⚡ Medium"
|
||
elif score >= 0:
|
||
rec = "⚠️ Low"
|
||
else:
|
||
rec = "❌ Skip"
|
||
|
||
recommendations.append(rec)
|
||
|
||
rec_line = " │ ".join(f"{rec:10s}" for rec in recommendations)
|
||
print(f"│ {workload_name:18s} │ {rec_line} │")
|
||
|
||
print("└─────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
|
||
|
||
# Implementation priority framework
|
||
print(f"\n🛠️ Implementation Priority Framework:")
|
||
print(f" 📊 Phase 1 (Always): Vectorization")
|
||
print(f" • Low risk, high reward")
|
||
print(f" • Works on any hardware")
|
||
print(f" • Foundation for other optimizations")
|
||
print(f" ")
|
||
print(f" 📊 Phase 2 (Memory constrained): Kernel Fusion")
|
||
print(f" • Targets memory-bound operations")
|
||
print(f" • Moderate complexity")
|
||
print(f" • Significant wins on element-wise ops")
|
||
print(f" ")
|
||
print(f" 📊 Phase 3 (Large models): Mixed Precision")
|
||
print(f" • Essential for large model training")
|
||
print(f" • Requires careful validation")
|
||
print(f" • Hardware-dependent benefits")
|
||
print(f" ")
|
||
print(f" 📊 Phase 4 (Production): Graph Optimization")
|
||
print(f" • Maximum performance extraction")
|
||
print(f" • High implementation cost")
|
||
print(f" • Deployment-specific tuning")
|
||
|
||
print(f"\n💡 Key Decision Factors:")
|
||
print(f" 🎯 Start simple: Vectorization first, always")
|
||
print(f" 📈 Scale up: Add complexity only when needed")
|
||
print(f" ⚡ Measure impact: Profile before and after each optimization")
|
||
print(f" 🔄 Iterate: Optimization is an ongoing process, not one-time")
|
||
print("🚀 Systematic acceleration beats random optimization")
|
||
|
||
analyze_acceleration_decision_framework()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 7. Module Integration Test
|
||
|
||
Final validation that all acceleration components work together correctly.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-module", "locked": true, "points": 20}
|
||
def test_module():
|
||
"""
|
||
Comprehensive test of entire acceleration module functionality.
|
||
|
||
This final test ensures:
|
||
- All acceleration techniques work correctly
|
||
- Performance improvements are measurable
|
||
- Mixed precision training is stable
|
||
- Components integrate seamlessly
|
||
- Module is ready for production use
|
||
"""
|
||
print("🧪 RUNNING MODULE INTEGRATION TEST")
|
||
print("=" * 50)
|
||
|
||
# Run all unit tests
|
||
print("Running unit tests...")
|
||
test_unit_vectorized_matmul()
|
||
test_unit_fused_gelu()
|
||
test_unit_fusion_speedup()
|
||
test_unit_mixed_precision()
|
||
|
||
print("\nRunning integration scenarios...")
|
||
|
||
# Test realistic acceleration pipeline
|
||
print("🔬 Integration Test: Complete acceleration pipeline...")
|
||
|
||
# Create realistic model scenario
|
||
batch_size, seq_len, hidden_dim = 16, 64, 256
|
||
print(f" Model config: batch={batch_size}, seq_len={seq_len}, hidden={hidden_dim}")
|
||
|
||
# Test data
|
||
x = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))
|
||
weight = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
|
||
print(f" Input tensor: {x.shape}, Weight tensor: {weight.shape}")
|
||
|
||
# Test complete pipeline: reshape → matmul → activation → mixed precision
|
||
print(" Testing vectorized operations...")
|
||
|
||
# Reshape for matrix multiplication (flatten batch and sequence)
|
||
x_reshaped = Tensor(x.data.reshape(-1, hidden_dim))
|
||
assert x_reshaped.shape == (batch_size * seq_len, hidden_dim)
|
||
|
||
# Vectorized matrix multiplication
|
||
linear_output = vectorized_matmul(x_reshaped, weight)
|
||
assert linear_output.shape == (batch_size * seq_len, hidden_dim)
|
||
print(f" ✅ Matrix multiplication: {x_reshaped.shape} @ {weight.shape} → {linear_output.shape}")
|
||
|
||
# Fused activation
|
||
activated = fused_gelu(linear_output)
|
||
assert activated.shape == linear_output.shape
|
||
print(f" ✅ Fused GELU activation: {linear_output.shape} → {activated.shape}")
|
||
|
||
# Reshape back to original structure
|
||
final_output = Tensor(activated.data.reshape(batch_size, seq_len, hidden_dim))
|
||
assert final_output.shape == x.shape
|
||
print(f" ✅ Output reshape: {activated.shape} → {final_output.shape}")
|
||
|
||
print(" Testing mixed precision training integration...")
|
||
|
||
# Create complete model for mixed precision testing
|
||
class TransformerBlock:
|
||
def __init__(self, hidden_dim):
|
||
self.hidden_dim = hidden_dim
|
||
self.weight1 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
|
||
self.weight2 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
|
||
self.weight1.grad = None
|
||
self.weight2.grad = None
|
||
|
||
def __call__(self, x):
|
||
# Simulate transformer block: linear → activation → linear
|
||
batch_size, seq_len, hidden_dim = x.shape
|
||
x_flat = Tensor(x.data.reshape(-1, hidden_dim))
|
||
|
||
# First linear layer
|
||
h1 = vectorized_matmul(x_flat, self.weight1)
|
||
h1_activated = fused_gelu(h1)
|
||
|
||
# Second linear layer
|
||
h2 = vectorized_matmul(h1_activated, self.weight2)
|
||
|
||
# Reshape back
|
||
output = Tensor(h2.data.reshape(batch_size, seq_len, hidden_dim))
|
||
return output
|
||
|
||
def parameters(self):
|
||
return [self.weight1, self.weight2]
|
||
|
||
class SimpleOptimizer:
|
||
def __init__(self, params):
|
||
self.params = params
|
||
|
||
def zero_grad(self):
|
||
for p in self.params:
|
||
p.grad = None
|
||
|
||
def step(self):
|
||
for p in self.params:
|
||
if p.grad is not None:
|
||
p.data = p.data - 0.001 * p.grad.data
|
||
|
||
# Initialize model and optimizer
|
||
model = TransformerBlock(hidden_dim)
|
||
optimizer = SimpleOptimizer(model.parameters())
|
||
trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=512.0)
|
||
|
||
print(f" Model parameters: {len(model.parameters())}")
|
||
print(f" Initial loss scale: {trainer.loss_scale}")
|
||
|
||
# Simulate training steps
|
||
print(" Running training steps...")
|
||
targets = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))
|
||
|
||
training_metrics = []
|
||
for step in range(5):
|
||
metrics = trainer.train_step((x, targets))
|
||
training_metrics.append(metrics)
|
||
|
||
# Verify metrics are reasonable
|
||
assert isinstance(metrics['loss'], (int, float))
|
||
assert metrics['loss'] >= 0
|
||
assert metrics['loss_scale'] > 0
|
||
assert isinstance(metrics['overflow'], bool)
|
||
assert isinstance(metrics['gradients_valid'], bool)
|
||
|
||
print(f" ✅ Completed {len(training_metrics)} training steps")
|
||
|
||
# Analyze training stability
|
||
losses = [m['loss'] for m in training_metrics]
|
||
overflows = [m['overflow'] for m in training_metrics]
|
||
|
||
print(f" Loss range: {min(losses):.6f} - {max(losses):.6f}")
|
||
print(f" Overflow rate: {sum(overflows)}/{len(overflows)} steps")
|
||
|
||
print(" Testing performance characteristics...")
|
||
|
||
# Verify acceleration provides measurable benefits
|
||
test_sizes = [128, 256]
|
||
for size in test_sizes:
|
||
test_x = Tensor(np.random.randn(size, size).astype(np.float32))
|
||
test_y = Tensor(np.random.randn(size, size).astype(np.float32))
|
||
|
||
# Time operations and verify reasonable performance
|
||
start = time.time()
|
||
_ = vectorized_matmul(test_x, test_y)
|
||
matmul_time = time.time() - start
|
||
|
||
start = time.time()
|
||
_ = fused_gelu(test_x)
|
||
gelu_time = time.time() - start
|
||
|
||
# Verify operations complete in reasonable time
|
||
assert matmul_time < 1.0, f"Matrix multiplication too slow: {matmul_time:.3f}s"
|
||
assert gelu_time < 0.1, f"GELU activation too slow: {gelu_time:.3f}s"
|
||
|
||
print(f" ✅ Size {size}: matmul={matmul_time*1000:.1f}ms, gelu={gelu_time*1000:.1f}ms")
|
||
|
||
print(" Testing memory efficiency...")
|
||
|
||
# Verify mixed precision reduces memory usage conceptually
|
||
param_count = sum(p.data.size for p in model.parameters())
|
||
activation_count = batch_size * seq_len * hidden_dim
|
||
|
||
fp32_memory = (param_count + activation_count) * 4 # 4 bytes per FP32
|
||
mixed_memory = param_count * 4 + activation_count * 2 # FP32 params + FP16 activations
|
||
memory_savings = (fp32_memory - mixed_memory) / fp32_memory * 100
|
||
|
||
print(f" Memory analysis: {memory_savings:.1f}% savings from mixed precision")
|
||
assert memory_savings > 0, "Mixed precision should reduce memory usage"
|
||
|
||
print("✅ End-to-end acceleration pipeline works!")
|
||
|
||
print("\n" + "=" * 50)
|
||
print("🎉 ALL TESTS PASSED! Module ready for export.")
|
||
print("Run: tito module complete 16")
|
||
|
||
# Call the module test
|
||
test_module()
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "main-execution", "solution": false}
|
||
# Main execution block
|
||
if __name__ == "__main__":
|
||
print("🚀 Running Acceleration module...")
|
||
test_module()
|
||
print("✅ Module validation complete!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🤔 ML Systems Thinking: Acceleration and Performance
|
||
|
||
### Question 1: Arithmetic Intensity Analysis
|
||
You implemented vectorized matrix multiplication and fused GELU.
|
||
- Matrix multiplication (1024×1024): Performs ~2.1 billion FLOPs, reads ~12 MB data
|
||
- Arithmetic intensity: _____ FLOPs/byte
|
||
- Compared to element-wise addition (0.33 FLOPs/byte): _____× higher intensity
|
||
- Why does this make matrix multiplication ideal for GPUs? _____
|
||
|
||
### Question 2: Kernel Fusion Memory Benefits
|
||
Your fused_gelu combines 7 operations into a single expression.
|
||
- Unfused version memory accesses: 7 reads + 7 writes = _____ per element
|
||
- Fused version memory accesses: 1 read + 1 write = _____ per element
|
||
- Memory bandwidth reduction: _____%
|
||
- Why is this critical for transformer inference? _____
|
||
|
||
### Question 3: Mixed Precision Memory Calculation
|
||
Your MixedPrecisionTrainer uses FP16 activations, FP32 parameters.
|
||
For a 100M parameter model with 50M activation elements:
|
||
- FP32 memory: (100M + 50M) × 4 bytes = _____ MB
|
||
- Mixed precision memory: 100M × 4 + 50M × 2 = _____ MB
|
||
- Memory reduction: _____%
|
||
|
||
### Question 4: Loss Scaling Strategy
|
||
Your trainer starts with loss_scale=1024, grows by 2×, shrinks by 0.5×.
|
||
- Minimum FP16 representable value: ~6e-5
|
||
- Without scaling, gradients < _____ become zero
|
||
- With 1024× scaling, gradients down to _____ are preserved
|
||
- Why increase scale gradually but decrease immediately? _____
|
||
|
||
### Question 5: Production Optimization Strategy
|
||
Based on your decision framework analysis:
|
||
For edge deployment (memory critical, stability required, hardware diverse):
|
||
- Priority 1 technique: _____ (low risk, universal)
|
||
- Priority 2 technique: _____ (memory benefits)
|
||
- Skip technique: _____ (why: _____)
|
||
- What's the primary constraint: memory, compute, or power? _____
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Acceleration
|
||
|
||
Congratulations! You've mastered the fundamental techniques for accelerating neural networks!
|
||
|
||
### Key Accomplishments
|
||
- Built **vectorized operations** leveraging SIMD and optimized BLAS for 2-5× speedups
|
||
- Implemented **kernel fusion** reducing memory bandwidth by 60-80% for element-wise operations
|
||
- Created **mixed precision training** with automatic loss scaling for 20-40% memory savings
|
||
- Analyzed **arithmetic intensity patterns** and their impact on the roofline model
|
||
- Developed **production decision framework** for systematic optimization
|
||
- All tests pass ✅ (validated by `test_module()`)
|
||
|
||
### Systems Insights Discovered
|
||
- **Roofline Model**: Operations with high arithmetic intensity (FLOPs/byte) scale better
|
||
- **Memory Bandwidth**: Often the limiting factor for modern accelerators
|
||
- **Kernel Fusion**: Critical for memory-bound workloads, reduces intermediate storage overhead
|
||
- **Mixed Precision**: Essential for large model training, requires careful gradient scaling
|
||
- **Optimization Strategy**: Start simple (vectorization), add complexity as needed
|
||
|
||
### Production Impact
|
||
Your acceleration techniques enable:
|
||
- **Training larger models** within memory constraints
|
||
- **Faster iteration cycles** during research and development
|
||
- **Better hardware utilization** across different deployment targets
|
||
- **Cost reduction** through improved efficiency
|
||
|
||
### Ready for Next Steps
|
||
Your acceleration implementations provide the foundation for quantization techniques in Module 17.
|
||
The performance analysis skills transfer directly to production optimization workflows.
|
||
|
||
Export with: `tito module complete 16`
|
||
|
||
**Next**: Module 17 will add quantization to further reduce memory and increase throughput while maintaining accuracy!
|
||
"""
|