Files
TinyTorch/modules/source/16_acceleration/acceleration_dev.py
Vijay Janapa Reddi 1fe3ec0ee8 Rename ProfilerComplete to Profiler for cleaner API
- Updated all imports: ProfilerComplete → Profiler
- Updated Module 16: Uses Profiler for acceleration demos
- Updated Module 19: Uses Profiler in Benchmark class
- Updated all comments and docstrings
- Simpler, more professional naming (no awkward Complete suffix)
2025-11-06 20:35:21 -05:00

1189 lines
48 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.17.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
#| default_exp optimization.acceleration
#| export
# %% [markdown]
"""
# Module 16: Acceleration - Making Models Run Faster
Welcome to Module 16! You're about to master the art of neural network acceleration through vectorization and kernel fusion.
## 🔗 Prerequisites & Progress
**You've Built**: Complete training pipeline with profiling capabilities
**You'll Build**: Acceleration techniques including vectorization and operation fusion
**You'll Enable**: Production-ready optimization for real-world deployment
**Connection Map**:
```
Profiling (Module 15) → Acceleration (Module 16) → Quantization (Module 17)
(measurement) (optimization) (precision reduction)
```
## Learning Objectives
By the end of this module, you will:
1. Implement vectorized operations for maximum throughput
2. Create fused operations to reduce memory bandwidth
3. Understand the relationship between compute and memory bandwidth
4. Analyze acceleration trade-offs in production systems
Let's optimize for speed!
## 📦 Where This Code Lives in the Final Package
**Learning Side:** You work in `modules/16_acceleration/acceleration_dev.py`
**Building Side:** Code exports to `tinytorch.optimization.acceleration`
```python
# How to use this module:
from tinytorch.optimization.acceleration import vectorized_matmul, fused_gelu
```
**Why this matters:**
- **Learning:** Complete acceleration system in one focused module for deep understanding
- **Production:** Proper organization like PyTorch's torch.amp and torch.jit with optimization components
- **Consistency:** All acceleration operations and optimization components in optimization.acceleration
- **Integration:** Works seamlessly with profiling for complete performance optimization
"""
# %%
import numpy as np
import time
from typing import Dict, List, Tuple, Optional, Any, Union
import warnings
# %% [markdown]
"""
## 1. Introduction - The Performance Challenge
Modern neural networks face two fundamental bottlenecks that limit their speed:
### The Two Enemies of Performance
**1. Compute Bound Operations:**
```
CPU/GPU Cores: [====BUSY====] [====BUSY====] [====BUSY====]
Memory Bus: [---idle---] [---idle---] [---idle---]
When: Matrix multiplication, convolutions
Solution: Vectorization, better algorithms
```
**2. Memory Bound Operations:**
```
CPU/GPU Cores: [--idle--] [--idle--] [--idle--]
Memory Bus: [========SATURATED========]
When: Element-wise operations, small tensors
Solution: Kernel fusion, memory layout optimization
```
### The Roofline Model - Your Performance Compass
Every processor has fundamental limits:
```
Performance │ Compute Bound Region
(GFLOPS) │ ┌─────────────────────
│ │ Peak Performance
│ │
│ ╱│ Memory Bound Region
│╱ │
╱│ │
│ │
│ │
╱───│──│───────────────────────
│ │
│ │
╱──────│──│────────────────── Arithmetic Intensity
│ │ (FLOPs/Byte)
Low│ │High
```
**Key Insight**: Understand where your operations live on this graph to optimize effectively.
### Why This Module Matters
Real-world performance wins:
- **2-5× speedup** from vectorization
- **2-3× throughput** from kernel fusion
- **10× scaling improvement** for large models
"""
# %% nbgrader={"grade": false, "grade_id": "tensor-import", "solution": true}
# Import required dependencies
### BEGIN SOLUTION
from tinytorch.core.tensor import Tensor
### END SOLUTION
# %% [markdown]
"""
## 2. Foundations - Vectorization: From Loops to Lightning
### The SIMD Revolution
Modern processors can execute **Single Instruction, Multiple Data** operations:
```
Traditional Loop (Scalar): SIMD Vectorized:
for i in range(4): ┌─────┐ ┌─────┬─────┬─────┬─────┐
c[i] = a[i] + b[i] │ ALU │ → │ALU 0│ALU 1│ALU 2│ALU 3│
└─────┘ └─────┴─────┴─────┴─────┘
1 element 4 elements per cycle
per cycle
```
### Memory Access Patterns: The Hidden Performance Killer
```
Sequential Access (FAST):
Memory: [A][B][C][D][E][F][G][H]
Access: ↓ ↓ ↓ ↓ → Cache friendly
Strided Access (SLOWER):
Memory: [A][ ][B][ ][C][ ][D][ ]
Access: ↓ ↓ ↓ ↓ → Cache misses
Random Access (SLOWEST):
Memory: [A][B][C][D][E][F][G][H]
Access: ↓ ↑ ↓ ↑ → Cache chaos
```
### Matrix Multiplication: The King of Vectorization
Matrix multiplication is **perfectly suited** for vectorization:
```
Matrix A (M×K) × Matrix B (K×N) = Matrix C (M×N)
Computation Pattern:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ a₁₁ a₁₂ a₁₃ a₁₄ │ × │ b₁₁ b₁₂ b₁₃ b₁₄ │ = │ c₁₁ c₁₂ c₁₃ c₁₄ │
│ a₂₁ a₂₂ a₂₃ a₂₄ │ │ b₂₁ b₂₂ b₂₃ b₂₄ │ │ c₂₁ c₂₂ c₂₃ c₂₄ │
│ a₃₁ a₃₂ a₃₃ a₃₄ │ │ b₃₁ b₃₂ b₃₃ b₃₄ │ │ c₃₁ c₃₂ c₃₃ c₃₄ │
│ a₄₁ a₄₂ a₄₃ a₄₄ │ │ b₄₁ b₄₂ b₄₃ b₄₄ │ │ c₄₁ c₄₂ c₄₃ c₄₄ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
For c₁₁: Row₁ · Column₁ = a₁₁×b₁₁ + a₁₂×b₂₁ + a₁₃×b₃₁ + a₁₄×b₄₁
VECTORIZABLE!
```
**Why vectorization wins:**
- **High arithmetic intensity**: 2N³ FLOPs for N³ data
- **Predictable memory access**: Sequential row/column reads
- **Parallelizable**: Independent dot products
- **Cache-friendly**: Data reuse in inner loops
"""
# %% nbgrader={"grade": false, "grade_id": "vectorized-matmul", "solution": true}
def vectorized_matmul(a: Tensor, b: Tensor) -> Tensor:
"""
High-performance matrix multiplication using vectorized operations.
This implementation leverages optimized BLAS libraries that use:
- SIMD instructions for parallel computation
- Cache-blocking for memory efficiency
- Multi-threading for CPU parallelization
TODO: Implement production-grade matrix multiplication
APPROACH:
1. Validate shapes are compatible for matrix multiplication
2. Use NumPy's optimized dot product (calls BLAS GEMM)
3. Return result wrapped in Tensor
EXAMPLE:
Matrix multiplication visualization:
>>> a = Tensor([[1, 2], [3, 4]]) # 2×2
>>> b = Tensor([[5, 6], [7, 8]]) # 2×2
>>> result = vectorized_matmul(a, b)
>>> print(result.data)
[[19 22] # [1×5+2×7, 1×6+2×8] = [19, 22]
[43 50]] # [3×5+4×7, 3×6+4×8] = [43, 50]
PERFORMANCE CHARACTERISTICS:
- Time Complexity: O(N³) but highly optimized
- Space Complexity: O(N²) for result
- Arithmetic Intensity: 2N³ FLOPs / 3N² bytes = 2N/3 (good for large N)
HINTS:
- Check a.shape[-1] == b.shape[-2] for inner dimension match
- Use np.matmul() for batch support and optimization
- Trust BLAS to handle the vectorization magic
"""
### BEGIN SOLUTION
# Input validation for matrix multiplication
if len(a.shape) < 2 or len(b.shape) < 2:
raise ValueError(
f"Matrix multiplication requires 2D+ tensors, got shapes {a.shape} and {b.shape}. "
f"💡 HINT: Use reshape() to add dimensions if needed."
)
if a.shape[-1] != b.shape[-2]:
raise ValueError(
f"Matrix multiplication shape mismatch: {a.shape} @ {b.shape}. "
f"Inner dimensions must match: a.shape[-1]={a.shape[-1]} != b.shape[-2]={b.shape[-2]}. "
f"💡 HINT: For A@B, A's columns must equal B's rows."
)
# Use NumPy's highly optimized matrix multiplication
# This calls BLAS GEMM (General Matrix Multiply), which uses:
# - SIMD vectorization for parallel arithmetic
# - Cache blocking for memory efficiency
# - Multi-threading on multi-core systems
result_data = np.matmul(a.data, b.data)
return Tensor(result_data)
### END SOLUTION
# %% nbgrader={"grade": true, "grade_id": "test-vectorized-matmul", "locked": true, "points": 10}
def test_unit_vectorized_matmul():
"""🔬 Test vectorized matrix multiplication implementation."""
print("🔬 Unit Test: Vectorized Matrix Multiplication...")
# Test basic 2D multiplication
a = Tensor([[1, 2], [3, 4]])
b = Tensor([[5, 6], [7, 8]])
result = vectorized_matmul(a, b)
expected = np.array([[19, 22], [43, 50]])
assert np.allclose(result.data, expected), f"Basic matmul failed: expected {expected}, got {result.data}"
# Test batch multiplication (3D tensors)
batch_size, m, k, n = 2, 3, 4, 5
a_batch = Tensor(np.random.randn(batch_size, m, k))
b_batch = Tensor(np.random.randn(batch_size, k, n))
result_batch = vectorized_matmul(a_batch, b_batch)
assert result_batch.shape == (batch_size, m, n), f"Wrong batch shape: {result_batch.shape}"
# Test broadcasting (different batch dimensions)
a_single = Tensor(np.random.randn(m, k))
b_batch = Tensor(np.random.randn(batch_size, k, n))
result_broadcast = vectorized_matmul(a_single, b_batch)
assert result_broadcast.shape == (batch_size, m, n), f"Broadcasting failed: {result_broadcast.shape}"
# Test error cases
try:
vectorized_matmul(Tensor([1, 2, 3]), Tensor([4, 5])) # 1D tensors
assert False, "Should reject 1D tensors"
except ValueError as e:
assert "2D+" in str(e)
try:
vectorized_matmul(Tensor([[1, 2]]), Tensor([[1], [2], [3]])) # Shape mismatch
assert False, "Should reject incompatible shapes"
except ValueError as e:
assert "shape mismatch" in str(e).lower()
print("✅ vectorized_matmul works correctly!")
test_unit_vectorized_matmul()
# %% [markdown]
"""
## 3. Implementation - Kernel Fusion: Eliminating Memory Bottlenecks
### The Memory Bandwidth Crisis
Consider this innocent-looking computation: `y = gelu(x * weight + bias)`
**Naive Implementation (Memory Intensive):**
```
Step 1: temp1 = x * weight → Write 4GB to memory
Step 2: temp2 = temp1 + bias → Read 4GB, Write 4GB
Step 3: y = gelu(temp2) → Read 4GB, Write 4GB
Total: 20GB memory traffic!
```
**Fused Implementation (Memory Efficient):**
```
Single Step: y = gelu(x * weight + bias) → Read 8GB, Write 4GB
Total: 12GB memory traffic!
60% memory bandwidth reduction!
```
### Understanding GELU: The Smooth Activation
GELU (Gaussian Error Linear Unit) is used in transformers because it's **smooth** (differentiable everywhere):
```
Activation Functions Compared:
ReLU: GELU: Sigmoid:
| | 1 ┌─────
| |
| ╱───│───
─────┘ ╱─── │ ───╱ │
Discontinuous Smooth Curve │ Smooth but saturates
gradient at 0 everywhere │
```
**GELU Formula**: `GELU(x) = x * Φ(x)` where Φ is the standard normal CDF
**Fast Approximation**: `GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))`
### Kernel Fusion Strategy
```
Unfused Operations: Fused Operation:
┌─────────────────┐ ┌─────────────────┐
│ x³ computation │ → temp1 │ │
└─────────────────┘ │ │
┌─────────────────┐ │ │
│ polynomial part │ → temp2 │ All operations│
└─────────────────┘ │ combined in │
┌─────────────────┐ │ single kernel │
│ tanh computation│ → temp3 │ │
└─────────────────┘ │ │
┌─────────────────┐ │ │
│ final multiply │ → result │ │
└─────────────────┘ └─────────────────┘
5 memory round-trips 1 memory round-trip
```
"""
# %% nbgrader={"grade": false, "grade_id": "fused-gelu", "solution": true}
def fused_gelu(x: Tensor) -> Tensor:
"""
Fused GELU activation that combines all operations in a single kernel.
GELU combines the benefits of ReLU and sigmoid:
- Smooth everywhere (unlike ReLU's discontinuity at 0)
- Non-saturating for positive values (unlike sigmoid)
- Probabilistic interpretation: x * P(X ≤ x) where X ~ N(0,1)
Mathematical Definition:
GELU(x) = x * Φ(x) where Φ(x) is the standard normal CDF
Fast Approximation (used here):
GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
TODO: Implement fused GELU to minimize memory bandwidth
APPROACH:
1. Compute all intermediate values in a single expression
2. Avoid creating temporary arrays
3. Let NumPy's broadcasting handle vectorization
EXAMPLE:
>>> x = Tensor([-2, -1, 0, 1, 2])
>>> result = fused_gelu(x)
>>> print(result.data)
[-0.04550026 -0.15865526 0. 0.8413447 1.9544997 ]
# Notice: smooth transition through 0, positive bias
MEMORY EFFICIENCY:
- Unfused: 5 temporary arrays × input_size × 4 bytes
- Fused: 0 temporary arrays, direct computation
- Bandwidth reduction: ~80% for memory-bound operations
HINTS:
- Use np.sqrt(2.0 / np.pi) for the constant
- Keep entire expression in one line for maximum fusion
- NumPy will optimize the expression tree automatically
"""
### BEGIN SOLUTION
# Mathematical constant for GELU approximation
sqrt_2_over_pi = np.sqrt(2.0 / np.pi)
# Fused GELU computation - all operations in single expression
# This minimizes memory bandwidth by avoiding intermediate arrays
# NumPy's expression evaluator will optimize this into efficient machine code
result_data = 0.5 * x.data * (
1.0 + np.tanh(sqrt_2_over_pi * (x.data + 0.044715 * x.data**3))
)
return Tensor(result_data)
### END SOLUTION
# %% nbgrader={"grade": true, "grade_id": "test-fused-gelu", "locked": true, "points": 10}
def test_unit_fused_gelu():
"""🔬 Test fused GELU activation implementation."""
print("🔬 Unit Test: Fused GELU...")
# Test basic properties
x = Tensor([-3, -1, 0, 1, 3])
result = fused_gelu(x)
# GELU(0) = 0 (exact property)
assert abs(result.data[2]) < 1e-6, f"GELU(0) should be 0, got {result.data[2]}"
# GELU is smooth and increasing
assert result.data[4] > result.data[3] > result.data[2], "GELU should be increasing"
# GELU has positive bias (unlike ReLU)
assert result.data[3] > 0.8, "GELU(1) should be close to 1"
assert result.data[1] > -0.2, "GELU(-1) should be slightly negative"
# Test numerical stability with extreme values
x_extreme = Tensor([-10, -5, 0, 5, 10])
result_extreme = fused_gelu(x_extreme)
assert not np.any(np.isnan(result_extreme.data)), "No NaN values allowed"
assert not np.any(np.isinf(result_extreme.data)), "No infinite values allowed"
# Test large tensor processing
x_large = Tensor(np.random.randn(1000, 1000).astype(np.float32))
result_large = fused_gelu(x_large)
assert result_large.shape == x_large.shape, "Shape preservation failed"
assert result_large.data.dtype == np.float32, "Data type preservation failed"
# Test that positive inputs are mostly preserved (GELU ≈ x for large positive x)
x_positive = Tensor([5.0])
result_positive = fused_gelu(x_positive)
assert result_positive.data[0] > 4.9, "Large positive values should be nearly preserved"
print("✅ fused_gelu works correctly!")
test_unit_fused_gelu()
# %% [markdown]
"""
### 🔬 Performance Analysis: Measuring Fusion Benefits
Let's quantify the impact of kernel fusion by comparing fused vs unfused implementations.
"""
# %% nbgrader={"grade": false, "grade_id": "unfused-gelu", "solution": true}
def unfused_gelu(x: Tensor) -> Tensor:
"""
Deliberately unfused GELU implementation for performance comparison.
This version creates multiple intermediate tensors to simulate
the memory bandwidth overhead of unfused operations.
TODO: Implement GELU with explicit intermediate steps
APPROACH:
1. Break computation into individual steps
2. Create temporary Tensor objects for each step
3. This simulates real memory allocation overhead
PERFORMANCE IMPACT:
- Creates 7 temporary arrays
- Each array allocation/deallocation has overhead
- More memory bandwidth usage
- Potential cache misses between operations
"""
### BEGIN SOLUTION
# Unfused version - creates many intermediate arrays
sqrt_2_over_pi = np.sqrt(2.0 / np.pi)
# Each operation creates a temporary array (simulating kernel launches)
temp1 = Tensor(x.data**3) # x³
temp2 = Tensor(0.044715 * temp1.data) # 0.044715 * x³
temp3 = Tensor(x.data + temp2.data) # x + 0.044715 * x³
temp4 = Tensor(sqrt_2_over_pi * temp3.data) # √(2/π) * (...)
temp5 = Tensor(np.tanh(temp4.data)) # tanh(...)
temp6 = Tensor(1.0 + temp5.data) # 1 + tanh(...)
temp7 = Tensor(x.data * temp6.data) # x * (1 + tanh(...))
result = Tensor(0.5 * temp7.data) # 0.5 * x * (...)
return result
### END SOLUTION
# %% nbgrader={"grade": true, "grade_id": "test-fusion-speedup", "locked": true, "points": 10}
def test_unit_fusion_speedup():
"""🔬 Measure the performance impact of kernel fusion."""
print("🔬 Unit Test: Kernel Fusion Performance Impact...")
# Create moderately large tensor for meaningful timing
size = 2000
x = Tensor(np.random.randn(size, size).astype(np.float32))
warmup_iterations = 2
timing_iterations = 5
# Warmup both implementations
for _ in range(warmup_iterations):
_ = unfused_gelu(x)
_ = fused_gelu(x)
# Time unfused version
start = time.time()
for _ in range(timing_iterations):
result_unfused = unfused_gelu(x)
unfused_time = time.time() - start
# Time fused version
start = time.time()
for _ in range(timing_iterations):
result_fused = fused_gelu(x)
fused_time = time.time() - start
# Verify numerical correctness
assert np.allclose(result_unfused.data, result_fused.data, atol=1e-6), \
"Fused and unfused implementations must be numerically equivalent"
# Calculate performance metrics
speedup = unfused_time / fused_time if fused_time > 0 else 1.0
unfused_per_elem = (unfused_time / timing_iterations) / (size * size) * 1e9 # ns per element
fused_per_elem = (fused_time / timing_iterations) / (size * size) * 1e9
print(f"📊 Kernel Fusion Performance Analysis:")
print(f" Tensor size: {size}×{size} = {size*size:,} elements")
print(f" Unfused time: {unfused_time/timing_iterations*1000:.2f} ms")
print(f" Fused time: {fused_time/timing_iterations*1000:.2f} ms")
print(f" Speedup: {speedup:.2f}× faster")
print(f" Per-element: {unfused_per_elem:.1f} ns → {fused_per_elem:.1f} ns")
# Memory bandwidth estimate
bytes_per_elem = 4 # float32
unfused_memory_ops = 7 # 7 intermediate arrays
fused_memory_ops = 2 # read input, write output
unfused_bandwidth = (unfused_memory_ops * size * size * bytes_per_elem) / (unfused_time / timing_iterations) / 1e9
fused_bandwidth = (fused_memory_ops * size * size * bytes_per_elem) / (fused_time / timing_iterations) / 1e9
print(f" Memory efficiency: {unfused_memory_ops}{fused_memory_ops} memory ops")
print(f" Effective bandwidth: {unfused_bandwidth:.1f}{fused_bandwidth:.1f} GB/s")
# Interpret results
if speedup > 1.5:
print("🚀 Excellent! Kernel fusion providing significant speedup")
elif speedup > 1.1:
print("✅ Good! Kernel fusion providing measurable benefit")
else:
print("⚠️ Limited speedup - may be compute-bound or small tensor size")
print("✅ Fusion performance analysis completed!")
test_unit_fusion_speedup()
# %% [markdown]
"""
## 4. Systems Analysis - Performance Scaling Patterns
Let's analyze how our acceleration techniques perform across different scenarios and understand their scaling characteristics.
"""
# %% nbgrader={"grade": false, "grade_id": "analyze-vectorization", "solution": true}
def analyze_vectorization_scaling():
"""📊 Analyze vectorization performance across different tensor sizes."""
print("📊 Analyzing vectorization scaling behavior...")
# Test sizes spanning different cache regimes
sizes = [64, 128, 256, 512, 1024, 2048]
print("\n🔍 Vectorization Scaling Analysis:")
print("┌─────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
print("│ Size │ Time (ms) │ GFLOPS │ Bandwidth │ Efficiency │")
print("│ │ │ │ (GB/s) │ (% of peak) │")
print("├─────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
for size in sizes:
# Create test matrices
a = Tensor(np.random.randn(size, size).astype(np.float32))
b = Tensor(np.random.randn(size, size).astype(np.float32))
# Warm up
for _ in range(2):
_ = vectorized_matmul(a, b)
# Time vectorized implementation
iterations = max(1, 100 // (size // 64)) # Fewer iterations for larger sizes
start = time.time()
for _ in range(iterations):
result = vectorized_matmul(a, b)
elapsed = (time.time() - start) / iterations
# Calculate performance metrics
flops = 2 * size**3 # 2N³ FLOPs for matrix multiplication
gflops = flops / (elapsed * 1e9)
bytes_accessed = 3 * size * size * 4 # 3 matrices × size² × 4 bytes
bandwidth = bytes_accessed / (elapsed * 1e9)
# Estimate efficiency (rough baseline: modern CPU ~100-500 GFLOPS peak)
estimated_peak_gflops = 200 # Conservative estimate
efficiency = min(100, gflops / estimated_peak_gflops * 100)
print(f"{size:6d}{elapsed*1000:9.2f}{gflops:9.1f}{bandwidth:9.1f}{efficiency:9.1f}")
print("└─────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
print(f"\n💡 Vectorization insights:")
print(f" • Small matrices: Limited by overhead and cache effects")
print(f" • Medium matrices: Sweet spot for cache reuse")
print(f" • Large matrices: Memory bandwidth becomes limiting factor")
print(f" • BLAS libraries automatically optimize for each size regime")
print("🚀 Vectorization effectiveness depends on problem size and hardware")
analyze_vectorization_scaling()
# %% nbgrader={"grade": false, "grade_id": "analyze-arithmetic-intensity", "solution": true}
def analyze_arithmetic_intensity():
"""📊 Demonstrate the roofline model with different operations."""
print("📊 Analyzing arithmetic intensity patterns...")
size = 1024
iterations = 10
operations = []
# Create test data
x = Tensor(np.random.randn(size, size).astype(np.float32))
y = Tensor(np.random.randn(size, size).astype(np.float32))
print("\n🎯 Arithmetic Intensity Analysis:")
print("┌─────────────────────┬─────────┬─────────────┬─────────────┬─────────────┐")
print("│ Operation │ AI │ Time (ms) │ GFLOPS │ GB/s │")
print("│ │(FLOPs/B)│ │ │ │")
print("├─────────────────────┼─────────┼─────────────┼─────────────┼─────────────┤")
# 1. Element-wise addition (very low arithmetic intensity)
start = time.time()
for _ in range(iterations):
_ = Tensor(x.data + y.data)
add_time = (time.time() - start) / iterations
add_flops = size * size # One addition per element
add_bytes = 3 * size * size * 4 # Read x, read y, write result
add_ai = add_flops / add_bytes
add_gflops = add_flops / (add_time * 1e9)
add_bandwidth = add_bytes / (add_time * 1e9)
print(f"│ Element-wise Add │ {add_ai:6.3f}{add_time*1000:9.2f}{add_gflops:9.1f}{add_bandwidth:9.1f}")
# 2. Element-wise multiply (still low, but slightly higher)
start = time.time()
for _ in range(iterations):
_ = Tensor(x.data * y.data)
mul_time = (time.time() - start) / iterations
mul_flops = size * size
mul_bytes = 3 * size * size * 4
mul_ai = mul_flops / mul_bytes
mul_gflops = mul_flops / (mul_time * 1e9)
mul_bandwidth = mul_bytes / (mul_time * 1e9)
print(f"│ Element-wise Mult │ {mul_ai:6.3f}{mul_time*1000:9.2f}{mul_gflops:9.1f}{mul_bandwidth:9.1f}")
# 3. GELU (medium arithmetic intensity)
start = time.time()
for _ in range(iterations):
_ = fused_gelu(x)
gelu_time = (time.time() - start) / iterations
gelu_flops = size * size * 8 # Approximate: x³, add, mul, tanh, etc.
gelu_bytes = 2 * size * size * 4 # Read x, write result
gelu_ai = gelu_flops / gelu_bytes
gelu_gflops = gelu_flops / (gelu_time * 1e9)
gelu_bandwidth = gelu_bytes / (gelu_time * 1e9)
print(f"│ Fused GELU │ {gelu_ai:6.3f}{gelu_time*1000:9.2f}{gelu_gflops:9.1f}{gelu_bandwidth:9.1f}")
# 4. Matrix multiplication (high arithmetic intensity)
start = time.time()
for _ in range(iterations):
_ = vectorized_matmul(x, y)
matmul_time = (time.time() - start) / iterations
matmul_flops = 2 * size**3 # 2N³ FLOPs
matmul_bytes = 3 * size * size * 4 # 3 matrices
matmul_ai = matmul_flops / matmul_bytes
matmul_gflops = matmul_flops / (matmul_time * 1e9)
matmul_bandwidth = matmul_bytes / (matmul_time * 1e9)
print(f"│ Matrix Multiply │ {matmul_ai:6.3f}{matmul_time*1000:9.2f}{matmul_gflops:9.1f}{matmul_bandwidth:9.1f}")
print("└─────────────────────┴─────────┴─────────────┴─────────────┴─────────────┘")
print(f"\n💡 Roofline Model Insights:")
print(f" 📊 Low AI (< 1): Memory bound - limited by bandwidth")
print(f" 📊 Med AI (1-10): Transitional - depends on implementation")
print(f" 📊 High AI (> 10): Compute bound - limited by ALU throughput")
print(f" 🎯 Matrix multiplication ({matmul_ai:.1f} AI) is ideal for GPUs/TPUs")
print(f" ⚡ Element-wise ops ({add_ai:.3f} AI) need memory optimization")
print("🚀 Design algorithms with high arithmetic intensity for performance")
analyze_arithmetic_intensity()
# %% [markdown]
"""
## 5. Optimization Insights - Production Acceleration Strategy
Understanding when and how to apply different acceleration techniques in real-world scenarios.
"""
# %% nbgrader={"grade": false, "grade_id": "acceleration-decision-framework", "solution": true}
def analyze_acceleration_decision_framework():
"""📊 Decision framework for choosing acceleration techniques."""
print("📊 Acceleration Technique Decision Framework...")
# Define workload characteristics
workloads = [
("Research Training", {
"memory_pressure": "medium",
"latency_sensitive": False,
"stability_critical": False,
"development_speed": "high",
"hardware_variety": "high"
}),
("Production Training", {
"memory_pressure": "high",
"latency_sensitive": False,
"stability_critical": True,
"development_speed": "medium",
"hardware_variety": "low"
}),
("Real-time Inference", {
"memory_pressure": "medium",
"latency_sensitive": True,
"stability_critical": True,
"development_speed": "low",
"hardware_variety": "medium"
}),
("Edge Deployment", {
"memory_pressure": "very_high",
"latency_sensitive": True,
"stability_critical": True,
"development_speed": "low",
"hardware_variety": "very_high"
}),
("Batch Inference", {
"memory_pressure": "low",
"latency_sensitive": False,
"stability_critical": True,
"development_speed": "medium",
"hardware_variety": "low"
})
]
# Define technique characteristics
techniques = {
"Vectorization": {
"implementation_cost": "low",
"memory_benefit": "none",
"latency_benefit": "high",
"stability_risk": "none",
"hardware_dependency": "low"
},
"Kernel Fusion": {
"implementation_cost": "medium",
"memory_benefit": "medium",
"latency_benefit": "medium",
"stability_risk": "low",
"hardware_dependency": "medium"
},
"Graph Optimization": {
"implementation_cost": "very_high",
"memory_benefit": "medium",
"latency_benefit": "very_high",
"stability_risk": "low",
"hardware_dependency": "very_high"
}
}
print("\n🎯 Acceleration Technique Recommendations:")
print("┌─────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐")
print("│ Workload │ Vectorize │ Fuse Kernels│ Mixed Prec │ Graph Opt │")
print("├─────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤")
for workload_name, workload_chars in workloads:
recommendations = []
for technique_name in ["Vectorization", "Kernel Fusion", "Graph Optimization"]:
tech_chars = techniques[technique_name]
score = 0
# Benefit vs requirement matching
if workload_chars["memory_pressure"] in ["high", "very_high"]:
if tech_chars["memory_benefit"] in ["medium", "high"]:
score += 2
if workload_chars["latency_sensitive"]:
if tech_chars["latency_benefit"] in ["medium", "high", "very_high"]:
score += 2
# Risk vs tolerance matching
if workload_chars["stability_critical"]:
if tech_chars["stability_risk"] in ["none", "low"]:
score += 1
elif tech_chars["stability_risk"] == "medium":
score -= 1
# Implementation cost vs development speed
if workload_chars["development_speed"] == "high":
if tech_chars["implementation_cost"] in ["low", "medium"]:
score += 1
elif tech_chars["implementation_cost"] in ["high", "very_high"]:
score -= 1
# Hardware dependency vs variety
if workload_chars["hardware_variety"] in ["high", "very_high"]:
if tech_chars["hardware_dependency"] in ["low", "medium"]:
score += 1
elif tech_chars["hardware_dependency"] in ["high", "very_high"]:
score -= 2
# Convert score to recommendation
if score >= 3:
rec = "✅ High"
elif score >= 1:
rec = "⚡ Medium"
elif score >= 0:
rec = "⚠️ Low"
else:
rec = "❌ Skip"
recommendations.append(rec)
rec_line = "".join(f"{rec:10s}" for rec in recommendations)
print(f"{workload_name:18s}{rec_line}")
print("└─────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘")
# Implementation priority framework
print(f"\n🛠️ Implementation Priority Framework:")
print(f" 📊 Phase 1 (Always): Vectorization")
print(f" • Low risk, high reward")
print(f" • Works on any hardware")
print(f" • Foundation for other optimizations")
print(f" ")
print(f" 📊 Phase 2 (Memory constrained): Kernel Fusion")
print(f" • Targets memory-bound operations")
print(f" • Moderate complexity")
print(f" • Significant wins on element-wise ops")
print(f" ")
print(f" • Essential for large model training")
print(f" • Requires careful validation")
print(f" • Hardware-dependent benefits")
print(f" ")
print(f" 📊 Phase 4 (Production): Graph Optimization")
print(f" • Maximum performance extraction")
print(f" • High implementation cost")
print(f" • Deployment-specific tuning")
print(f"\n💡 Key Decision Factors:")
print(f" 🎯 Start simple: Vectorization first, always")
print(f" 📈 Scale up: Add complexity only when needed")
print(f" ⚡ Measure impact: Profile before and after each optimization")
print(f" 🔄 Iterate: Optimization is an ongoing process, not one-time")
print("🚀 Systematic acceleration beats random optimization")
analyze_acceleration_decision_framework()
# %% [markdown]
"""
## 5.5 Measuring Acceleration Gains with Profiler
Now let's use the **Profiler** tool you built in Module 15 to measure the actual performance improvements from vectorization. This demonstrates the full workflow: build profiling tools (M15), apply optimizations (M16), measure gains (M15+M16).
This is how professional ML engineers work: profile → optimize → measure → repeat.
"""
# %% nbgrader={"grade": false, "grade_id": "demo-profiler-acceleration", "solution": true}
# Import Profiler from Module 15
from tinytorch.profiling.profiler import Profiler
def demo_acceleration_with_profiler():
"""📊 Demonstrate acceleration gains using Profiler from Module 15."""
print("📊 Measuring Acceleration Gains with Profiler")
print("=" * 70)
profiler = Profiler()
# Create two simple models: one slow (loop-based), one fast (vectorized)
class SlowLinear:
"""Linear layer using explicit loops (slow)."""
def __init__(self, in_features, out_features):
self.weight = Tensor(np.random.randn(in_features, out_features).astype(np.float32) * 0.01)
self.name = "slow_linear"
def forward(self, x):
# Explicit loop implementation (for demonstration)
batch_size = x.shape[0]
out_features = self.weight.shape[1]
result = np.zeros((batch_size, out_features), dtype=np.float32)
for i in range(batch_size):
for j in range(out_features):
for k in range(x.shape[1]):
result[i, j] += x.data[i, k] * self.weight.data[k, j]
return Tensor(result)
class FastLinear:
"""Linear layer using vectorized matmul (fast)."""
def __init__(self, in_features, out_features):
self.weight = Tensor(np.random.randn(in_features, out_features).astype(np.float32) * 0.01)
self.name = "fast_linear"
def forward(self, x):
# Vectorized implementation
return vectorized_matmul(x, self.weight)
in_features, out_features = 128, 64
batch_size = 32
# Create models
slow_model = SlowLinear(in_features, out_features)
fast_model = FastLinear(in_features, out_features)
# Create input
input_tensor = Tensor(np.random.randn(batch_size, in_features).astype(np.float32))
print("\n🐢 BEFORE: Loop-based implementation")
print("-" * 70)
# Measure slow model
slow_latency = profiler.measure_latency(slow_model, input_tensor, warmup=3, iterations=10)
slow_flops = profiler.count_flops(slow_model, (batch_size, in_features))
print(f" Latency: {slow_latency:.2f} ms")
print(f" FLOPs: {slow_flops:,}")
print(f" Throughput: {slow_flops / (slow_latency / 1000) / 1e9:.2f} GFLOP/s")
print("\n🚀 AFTER: Vectorized implementation")
print("-" * 70)
# Measure fast model
fast_latency = profiler.measure_latency(fast_model, input_tensor, warmup=3, iterations=10)
fast_flops = profiler.count_flops(fast_model, (batch_size, in_features))
print(f" Latency: {fast_latency:.2f} ms")
print(f" FLOPs: {fast_flops:,}")
print(f" Throughput: {fast_flops / (fast_latency / 1000) / 1e9:.2f} GFLOP/s")
print("\n📈 ACCELERATION GAINS")
print("=" * 70)
speedup = slow_latency / fast_latency
print(f" Speedup: {speedup:.1f}x faster")
print(f" Time saved: {slow_latency - fast_latency:.2f} ms per inference")
print(f" Throughput improvement: {speedup:.1f}x more inferences/second")
print("\n💡 Key Insight:")
print(f" Vectorization with numpy.matmul leverages optimized BLAS libraries")
print(f" that use SIMD instructions and cache-friendly memory access patterns.")
print(f" This is why {speedup:.0f}x speedups are possible with the same FLOPs!")
print("\n✅ This is the power of acceleration: same math, different execution!")
demo_acceleration_with_profiler()
# %% [markdown]
"""
## 6. Module Integration Test
Final validation that all acceleration components work together correctly.
"""
# %% nbgrader={"grade": true, "grade_id": "test-module", "locked": true, "points": 20}
def test_module():
"""
Comprehensive test of entire acceleration module functionality.
This final test ensures:
- All acceleration techniques work correctly
- Performance improvements are measurable
- Components integrate seamlessly
- Module is ready for production use
"""
print("🧪 RUNNING MODULE INTEGRATION TEST")
print("=" * 50)
# Run all unit tests
print("Running unit tests...")
test_unit_vectorized_matmul()
test_unit_fused_gelu()
test_unit_fusion_speedup()
print("\nRunning integration scenarios...")
# Test realistic acceleration pipeline
print("🔬 Integration Test: Complete acceleration pipeline...")
# Create realistic model scenario
batch_size, seq_len, hidden_dim = 16, 64, 256
print(f" Model config: batch={batch_size}, seq_len={seq_len}, hidden={hidden_dim}")
# Test data
x = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))
weight = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
print(f" Input tensor: {x.shape}, Weight tensor: {weight.shape}")
# Test complete pipeline: reshape → matmul → activation
print(" Testing vectorized operations...")
# Reshape for matrix multiplication (flatten batch and sequence)
x_reshaped = Tensor(x.data.reshape(-1, hidden_dim))
assert x_reshaped.shape == (batch_size * seq_len, hidden_dim)
# Vectorized matrix multiplication
linear_output = vectorized_matmul(x_reshaped, weight)
assert linear_output.shape == (batch_size * seq_len, hidden_dim)
print(f" ✅ Matrix multiplication: {x_reshaped.shape} @ {weight.shape}{linear_output.shape}")
# Fused activation
activated = fused_gelu(linear_output)
assert activated.shape == linear_output.shape
print(f" ✅ Fused GELU activation: {linear_output.shape}{activated.shape}")
# Reshape back to original structure
final_output = Tensor(activated.data.reshape(batch_size, seq_len, hidden_dim))
assert final_output.shape == x.shape
print(f" ✅ Output reshape: {activated.shape}{final_output.shape}")
class TransformerBlock:
def __init__(self, hidden_dim):
self.hidden_dim = hidden_dim
self.weight1 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
self.weight2 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))
self.weight1.grad = None
self.weight2.grad = None
def __call__(self, x):
# Simulate transformer block: linear → activation → linear
batch_size, seq_len, hidden_dim = x.shape
x_flat = Tensor(x.data.reshape(-1, hidden_dim))
# First linear layer
h1 = vectorized_matmul(x_flat, self.weight1)
h1_activated = fused_gelu(h1)
# Second linear layer
h2 = vectorized_matmul(h1_activated, self.weight2)
# Reshape back
output = Tensor(h2.data.reshape(batch_size, seq_len, hidden_dim))
return output
def parameters(self):
return [self.weight1, self.weight2]
# Initialize model and test forward pass
model = TransformerBlock(hidden_dim)
print(f" Model parameters: {len(model.parameters())}")
# Test model forward pass with accelerated operations
print(" Testing model forward pass with accelerated operations...")
output = model(x)
assert output.shape == x.shape
print(f" ✅ Model forward pass: {x.shape}{output.shape}")
# Verify accelerated operations provide correct results
print(" Validating numerical correctness...")
# Check output is finite and has reasonable values
assert np.all(np.isfinite(output.data)), "Model output contains NaN or Inf"
output_mean = np.mean(np.abs(output.data))
# Random initialization can produce larger values - verify reasonable range
assert output_mean < 1000.0, f"Output values unreasonably large: {output_mean}"
print(f" ✅ Numerical validation passed (mean magnitude: {output_mean:.4f})")
print(" Testing performance characteristics...")
# Verify acceleration provides measurable benefits
test_sizes = [128, 256]
for size in test_sizes:
test_x = Tensor(np.random.randn(size, size).astype(np.float32))
test_y = Tensor(np.random.randn(size, size).astype(np.float32))
# Time operations and verify reasonable performance
start = time.time()
_ = vectorized_matmul(test_x, test_y)
matmul_time = time.time() - start
start = time.time()
_ = fused_gelu(test_x)
gelu_time = time.time() - start
# Verify operations complete in reasonable time
assert matmul_time < 1.0, f"Matrix multiplication too slow: {matmul_time:.3f}s"
assert gelu_time < 0.1, f"GELU activation too slow: {gelu_time:.3f}s"
print(f" ✅ Size {size}: matmul={matmul_time*1000:.1f}ms, gelu={gelu_time*1000:.1f}ms")
print(" Testing memory efficiency...")
print("✅ End-to-end acceleration pipeline works!")
print("\n" + "=" * 50)
print("🎉 ALL TESTS PASSED! Module ready for export.")
print("Run: tito module complete 16")
# Call the module test
test_module()
# %% nbgrader={"grade": false, "grade_id": "main-execution", "solution": false}
# Main execution block
if __name__ == "__main__":
print("🚀 Running Acceleration module...")
test_module()
print("✅ Module validation complete!")
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Acceleration and Performance
### Question 1: Arithmetic Intensity Analysis
You implemented vectorized matrix multiplication and fused GELU.
- Matrix multiplication (1024×1024): Performs ~2.1 billion FLOPs, reads ~12 MB data
- Arithmetic intensity: _____ FLOPs/byte
- Compared to element-wise addition (0.33 FLOPs/byte): _____× higher intensity
- Why does this make matrix multiplication ideal for GPUs? _____
### Question 2: Kernel Fusion Memory Benefits
Your fused_gelu combines 7 operations into a single expression.
- Unfused version memory accesses: 7 reads + 7 writes = _____ per element
- Fused version memory accesses: 1 read + 1 write = _____ per element
- Memory bandwidth reduction: _____%
- Why is this critical for transformer inference? _____
### Question 4: Production Optimization Strategy
Based on your decision framework analysis:
For edge deployment (memory critical, stability required, hardware diverse):
- Priority 1 technique: _____ (low risk, universal)
- Priority 2 technique: _____ (memory benefits)
- Skip technique: _____ (why: _____)
- What's the primary constraint: memory, compute, or power? _____
"""
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Acceleration
Congratulations! You've mastered the fundamental techniques for accelerating neural networks!
### Key Accomplishments
- Built **vectorized operations** leveraging SIMD and optimized BLAS for 2-5× speedups
- Implemented **kernel fusion** reducing memory bandwidth by 60-80% for element-wise operations
- Analyzed **arithmetic intensity patterns** and their impact on the roofline model
- Developed **production decision framework** for systematic optimization
- All tests pass ✅ (validated by `test_module()`)
### Systems Insights Discovered
- **Roofline Model**: Operations with high arithmetic intensity (FLOPs/byte) scale better
- **Memory Bandwidth**: Often the limiting factor for modern accelerators
- **Kernel Fusion**: Critical for memory-bound workloads, reduces intermediate storage overhead
- **Optimization Strategy**: Start simple (vectorization), add complexity as needed
### Production Impact
Your acceleration techniques enable:
- **Training larger models** within memory constraints
- **Faster iteration cycles** during research and development
- **Better hardware utilization** across different deployment targets
- **Cost reduction** through improved efficiency
### Ready for Next Steps
Your acceleration implementations provide the foundation for quantization techniques in Module 17.
The performance analysis skills transfer directly to production optimization workflows.
Export with: `tito module complete 16`
**Next**: Module 17 will add quantization to further reduce memory and increase throughput while maintaining accuracy!
"""