mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-07 19:16:40 -05:00
- Added progressive complexity guidelines (Foundation/Intermediate/Advanced) - Added measurement function consolidation to prevent information overload - Fixed all diagnostic issues in losses_dev.py - Fixed markdown formatting across all modules - Consolidated redundant analysis functions in foundation modules - Fixed syntax errors and unused variables - Ensured all educational content is in proper markdown cells for Jupyter
2555 lines
98 KiB
Python
2555 lines
98 KiB
Python
# %% [markdown]
|
||
"""
|
||
# Kernels - High-Performance Computational Kernels
|
||
|
||
Welcome to Kernels! You'll implement high-performance computational kernels that power modern ML systems!
|
||
|
||
## LINK Building on Previous Learning
|
||
**What You Built Before**:
|
||
- Module 11 (Training): Complete training loops with gradient computation
|
||
- Module 12 (Regularization): Advanced training techniques for robust models
|
||
|
||
**What's Working**: You can train neural networks end-to-end with sophisticated optimization and regularization!
|
||
|
||
**The Gap**: Your implementations work correctly but may not be optimized for real-world performance demands.
|
||
|
||
**This Module's Solution**: Implement high-performance computational kernels that optimize memory access, leverage parallelism, and achieve production-grade performance.
|
||
|
||
**Connection Map**:
|
||
```
|
||
Training -> Kernels -> Benchmarking
|
||
(correct) (fast) (measured)
|
||
```
|
||
|
||
## Learning Goals (Your 5-Point Framework)
|
||
- **Systems understanding**: Memory layout, cache optimization, and vectorization for ML operations
|
||
- **Core implementation skill**: Building high-performance computational kernels from scratch
|
||
- **Pattern/abstraction mastery**: Recognizing optimization patterns across different hardware architectures
|
||
- **Framework connections**: Understanding how PyTorch and TensorFlow achieve high performance
|
||
- **Optimization trade-offs**: Balancing memory usage, computational complexity, and parallelism
|
||
|
||
## Build -> Use -> Reflect
|
||
1. **Build**: Implement optimized kernels for matrix operations, activations, and memory management
|
||
2. **Use**: Apply kernels to real ML workloads and measure performance improvements
|
||
3. **Reflect**: Analyze optimization patterns and design production-grade kernel architectures
|
||
|
||
## Systems Reality Check
|
||
TIP **Production Context**: PyTorch uses custom CUDA kernels and CPU vectorization for 10-100x speedups
|
||
SPEED **Performance Insight**: Memory bandwidth is often the limiting factor, not compute - optimize data movement first
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## What Are High-Performance Kernels?
|
||
|
||
High-performance kernels are optimized computational functions that leverage hardware-specific features like:
|
||
|
||
```
|
||
CPU Kernels:
|
||
+-------------------------------------+
|
||
| SIMD Instructions (AVX, SSE) | <- Process 4-16 floats simultaneously
|
||
| Cache-Friendly Memory Patterns | <- Minimize cache misses
|
||
| Loop Unrolling & Vectorization | <- Eliminate loop overhead
|
||
+-------------------------------------+
|
||
|
||
GPU Kernels:
|
||
+-------------------------------------+
|
||
| Thread Blocks & Shared Memory | <- Parallel processing with fast memory
|
||
| Memory Coalescing | <- Efficient global memory access
|
||
| Warp-Level Operations | <- 32 threads execute together
|
||
+-------------------------------------+
|
||
```
|
||
|
||
**Why This Matters for ML Systems:**
|
||
- **Training Speed**: 10-100x faster matrix operations enable larger models
|
||
- **Inference Latency**: Optimized kernels reduce serving costs and improve user experience
|
||
- **Memory Efficiency**: Better data layouts reduce memory bandwidth requirements
|
||
- **Energy Efficiency**: Optimized code reduces power consumption in data centers
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Mathematical Foundations
|
||
|
||
### Cache-Friendly Matrix Multiplication
|
||
|
||
Standard algorithm is O(n³) but cache-unfriendly:
|
||
```python
|
||
# Cache-unfriendly (random memory access)
|
||
for i in range(n):
|
||
for j in range(n):
|
||
for k in range(n):
|
||
C[i,j] += A[i,k] * B[k,j] # B[k,j] jumps around memory
|
||
```
|
||
|
||
Blocked algorithm improves cache locality:
|
||
```python
|
||
# Cache-friendly (blocked access)
|
||
for bi in range(0, n, block_size):
|
||
for bj in range(0, n, block_size):
|
||
for bk in range(0, n, block_size):
|
||
# Process block that fits in cache
|
||
for i in range(bi, min(bi+block_size, n)):
|
||
for j in range(bj, min(bj+block_size, n)):
|
||
for k in range(bk, min(bk+block_size, n)):
|
||
C[i,j] += A[i,k] * B[k,j]
|
||
```
|
||
|
||
### SIMD Vectorization
|
||
|
||
Single Instruction, Multiple Data (SIMD) processes multiple elements simultaneously:
|
||
|
||
```
|
||
Scalar ReLU (1 element at a time):
|
||
for i in range(n):
|
||
y[i] = max(0, x[i]) # 1 operation per cycle
|
||
|
||
Vectorized ReLU (8 elements at a time with AVX):
|
||
y = np.maximum(0, x) # 8 operations per cycle
|
||
```
|
||
|
||
### Memory Access Patterns
|
||
|
||
```
|
||
Row-Major Access (Fast):
|
||
A[0,0] A[0,1] A[0,2] A[0,3] ... <- Sequential memory access
|
||
|
||
Column-Major Access (Slow):
|
||
A[0,0] A[1,0] A[2,0] A[3,0] ... <- Strided memory access
|
||
|
||
Cache Line Impact:
|
||
+-----+-----+-----+-----+
|
||
| A[0,0:4] loaded together | <- 64-byte cache line
|
||
+-----+-----+-----+-----+
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Why Build High-Performance Kernels?
|
||
|
||
### Production Performance Requirements
|
||
Modern ML systems require optimized kernels for:
|
||
|
||
1. **Real-Time Inference**: Self-driving cars need <10ms response times
|
||
2. **Large-Scale Training**: Training GPT-scale models requires maximum hardware utilization
|
||
3. **Edge Deployment**: Mobile and IoT devices have limited compute and memory
|
||
4. **Cost Optimization**: Cloud compute costs scale with execution time
|
||
|
||
### Learning Through Implementation
|
||
Building kernels teaches you:
|
||
|
||
- **Hardware-Software Interface**: How software maps to CPU/GPU architecture
|
||
- **Performance Engineering**: Systematic optimization methodology
|
||
- **Production Debugging**: Why ML models are slow and how to fix them
|
||
- **System Design**: How to build scalable ML infrastructure
|
||
|
||
### Connection to Frameworks
|
||
Every major ML framework uses custom kernels:
|
||
- **PyTorch**: ATen library with CUDA kernels and CPU vectorization
|
||
- **TensorFlow**: XLA compiler with hardware-specific optimizations
|
||
- **JAX**: JIT compilation with automatic kernel fusion
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Production Context - How Real Systems Work
|
||
|
||
### PyTorch Kernel Architecture
|
||
```python
|
||
# High-level PyTorch operation
|
||
result = torch.matmul(A, B)
|
||
|
||
# Maps to optimized kernel based on:
|
||
# - Hardware: CPU (MKL-DNN) vs GPU (cuBLAS)
|
||
# - Data type: float32, float16, int8
|
||
# - Tensor size: Small (custom) vs Large (BLAS)
|
||
# - Memory layout: Contiguous vs Strided
|
||
```
|
||
|
||
### Performance Hierarchy
|
||
```
|
||
1. Specialized Hardware: TPUs, Tensor Cores (100-1000x)
|
||
2. Optimized Libraries: cuBLAS, MKL (10-100x)
|
||
3. Vectorized Code: SIMD, OpenMP (2-10x)
|
||
4. Cache-Friendly: Blocked algorithms (1.5-3x)
|
||
5. Naive Implementation: Baseline (1x)
|
||
```
|
||
|
||
### Real-World Impact
|
||
- **Training Cost**: Optimized kernels reduce AWS training costs by 50-90%
|
||
- **Serving Latency**: Fast inference enables real-time applications
|
||
- **Model Size**: Quantization kernels enable deployment on mobile devices
|
||
- **Energy Usage**: Efficient kernels reduce data center power consumption
|
||
"""
|
||
|
||
# %%
|
||
#| default_exp core.kernels
|
||
import numpy as np
|
||
import sys
|
||
import os
|
||
import time
|
||
import psutil
|
||
from typing import Callable, Dict, Any, Optional, Tuple, List
|
||
from concurrent.futures import ThreadPoolExecutor
|
||
|
||
# Import our existing components
|
||
try:
|
||
from tinytorch.core.tensor import Tensor
|
||
except ImportError:
|
||
# Create minimal mock for development
|
||
class Tensor:
|
||
def __init__(self, data):
|
||
self.data = np.array(data)
|
||
self.shape = self.data.shape
|
||
def __str__(self):
|
||
return f"Tensor({self.data})"
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Architecture - Building High-Performance Kernels
|
||
|
||
Our kernel optimization strategy follows a systematic hierarchy:
|
||
|
||
```
|
||
TARGET Optimization Strategy:
|
||
+-------------------------------------+
|
||
| 1. Correctness: Get the right answer |
|
||
| 2. Cache Optimization: Memory patterns |
|
||
| 3. Vectorization: SIMD instructions |
|
||
| 4. Parallelization: Multi-core |
|
||
| 5. Quantization: Reduced precision |
|
||
+-------------------------------------+
|
||
|
||
🔧 Implementation Layers:
|
||
+-------------------------------------+
|
||
| Higher Level: Kernel Composition | <- Combine optimizations
|
||
| Mid Level: Algorithm Optimization | <- Cache blocking, tiling
|
||
| Lower Level: Hardware Primitives | <- SIMD, memory layout
|
||
+-------------------------------------+
|
||
```
|
||
|
||
**Design Principles:**
|
||
1. **Measure First**: Profile before optimizing
|
||
2. **Systematic Approach**: One optimization at a time
|
||
3. **Hardware Awareness**: Understand the target architecture
|
||
4. **Composability**: Build higher-level optimizations from primitives
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Implementation - Building High-Performance Kernels
|
||
|
||
### Core Timing Infrastructure
|
||
"""
|
||
|
||
# %%
|
||
def time_kernel(func: Callable, *args, **kwargs) -> Tuple[Any, float]:
|
||
"""
|
||
Precision timing function for measuring kernel performance.
|
||
|
||
This is the foundation for all performance analysis - accurate timing
|
||
that accounts for CPU frequency scaling and system noise.
|
||
|
||
Args:
|
||
func: The kernel function to time
|
||
*args: Arguments to pass to the function
|
||
**kwargs: Keyword arguments to pass to the function
|
||
|
||
Returns:
|
||
tuple: (function_result, execution_time_microseconds)
|
||
|
||
TODO: Implement high-precision kernel timing with noise reduction.
|
||
|
||
APPROACH:
|
||
1. Use time.perf_counter() for high precision timing
|
||
2. Warm up CPU to stable frequency before measurement
|
||
3. Handle OS scheduling noise with multiple measurements
|
||
4. Return both result and timing for validation
|
||
|
||
EXAMPLE:
|
||
>>> result, time_us = time_kernel(np.matmul, A, B)
|
||
>>> print(f"Matrix multiply took {time_us:.2f} microseconds")
|
||
|
||
PERFORMANCE CONSIDERATIONS:
|
||
- perf_counter() has nanosecond precision on modern systems
|
||
- CPU frequency scaling can affect measurements
|
||
- OS scheduling introduces timing noise
|
||
- Cache state affects first vs subsequent runs
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Warm-up run to stabilize CPU frequency
|
||
_ = func(*args, **kwargs)
|
||
|
||
# High-precision timing
|
||
start = time.perf_counter()
|
||
result = func(*args, **kwargs)
|
||
end = time.perf_counter()
|
||
|
||
# Convert to microseconds for better readability
|
||
execution_time_us = (end - start) * 1_000_000
|
||
|
||
return result, execution_time_us
|
||
### END SOLUTION
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Timing infrastructure complete
|
||
|
||
# THINK PREDICTION: How much timing overhead does our measurement add?
|
||
# Your guess: _____ microseconds
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Timing Overhead Analysis
|
||
def analyze_timing_overhead():
|
||
"""Measure the overhead of our timing infrastructure."""
|
||
try:
|
||
# Test with minimal operation
|
||
def minimal_op():
|
||
return 42
|
||
|
||
# Time the timing overhead
|
||
measurements = []
|
||
for _ in range(100):
|
||
_, timing = time_kernel(minimal_op)
|
||
measurements.append(timing)
|
||
|
||
avg_overhead = np.mean(measurements)
|
||
std_overhead = np.std(measurements)
|
||
min_overhead = np.min(measurements)
|
||
|
||
print(f"Timing overhead analysis:")
|
||
print(f" Average: {avg_overhead:.3f} μs")
|
||
print(f" Std dev: {std_overhead:.3f} μs")
|
||
print(f" Minimum: {min_overhead:.3f} μs")
|
||
print(f" Relative precision: ±{std_overhead/avg_overhead*100:.1f}%")
|
||
|
||
# TIP WHY THIS MATTERS: Timing overhead must be much smaller than
|
||
# the operations we're measuring, or results will be meaningless.
|
||
# Modern CPUs: ~1-10 μs overhead, so measure operations >100 μs
|
||
|
||
return {
|
||
'avg_overhead_us': avg_overhead,
|
||
'precision_percent': std_overhead/avg_overhead*100,
|
||
'reliable_for_operations_above_us': avg_overhead * 10
|
||
}
|
||
except Exception as e:
|
||
print(f"WARNING️ Timing analysis error: {e}")
|
||
return None
|
||
|
||
# Run the analysis
|
||
timing_analysis = analyze_timing_overhead()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Timing Infrastructure
|
||
This test validates `time_kernel`, ensuring accurate performance measurement
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_timing_infrastructure():
|
||
"""Test timing infrastructure with known operations."""
|
||
print("TEST Unit Test: Timing Infrastructure")
|
||
|
||
# Test 1: Basic timing functionality
|
||
def test_operation():
|
||
time.sleep(0.001) # 1ms sleep
|
||
return "done"
|
||
|
||
result, elapsed_us = time_kernel(test_operation)
|
||
|
||
assert result == "done", "Function result should be preserved"
|
||
assert 800 <= elapsed_us <= 2000, f"1ms sleep should take ~1000μs, got {elapsed_us:.1f}μs"
|
||
print(f"PASS Basic timing: {elapsed_us:.1f}μs for 1ms operation")
|
||
|
||
# Test 2: Timing precision
|
||
def fast_operation():
|
||
return sum(range(1000))
|
||
|
||
measurements = []
|
||
for _ in range(10):
|
||
_, timing = time_kernel(fast_operation)
|
||
measurements.append(timing)
|
||
|
||
cv = np.std(measurements) / np.mean(measurements)
|
||
assert cv < 0.5, f"Timing precision should be reasonable, CV={cv:.3f}"
|
||
print(f"PASS Timing precision: CV={cv:.3f} across 10 measurements")
|
||
|
||
# Test 3: Argument passing
|
||
def add_operation(a, b, c=0):
|
||
return a + b + c
|
||
|
||
result, _ = time_kernel(add_operation, 5, 10, c=2)
|
||
assert result == 17, f"Arguments should pass correctly, got {result}"
|
||
print("PASS Argument passing works correctly")
|
||
|
||
# Run the test
|
||
test_unit_timing_infrastructure()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Matrix Multiplication Optimization
|
||
"""
|
||
|
||
# %%
|
||
def matmul_baseline(A: np.ndarray, B: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Baseline matrix multiplication using NumPy's optimized implementation.
|
||
|
||
This serves as our reference implementation and performance baseline.
|
||
NumPy uses highly optimized BLAS libraries (Intel MKL, OpenBLAS).
|
||
|
||
Args:
|
||
A: Left matrix (M x K)
|
||
B: Right matrix (K x N)
|
||
|
||
Returns:
|
||
np.ndarray: Result matrix (M x N)
|
||
|
||
TODO: Use NumPy's optimized matrix multiplication as baseline.
|
||
|
||
APPROACH:
|
||
1. Validate input shapes for compatibility
|
||
2. Use np.dot() which calls optimized BLAS
|
||
3. This is our "ground truth" for correctness and baseline for performance
|
||
|
||
EXAMPLE:
|
||
>>> A = np.random.randn(100, 50)
|
||
>>> B = np.random.randn(50, 75)
|
||
>>> C = matmul_baseline(A, B)
|
||
>>> print(C.shape) # (100, 75)
|
||
|
||
PERFORMANCE NOTES:
|
||
- NumPy calls optimized BLAS: Intel MKL or OpenBLAS
|
||
- These libraries use vectorization, threading, and cache optimization
|
||
- Typical performance: 100+ GFLOPS on modern CPUs
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Validate shapes
|
||
if A.shape[1] != B.shape[0]:
|
||
raise ValueError(f"Cannot multiply {A.shape} and {B.shape}: inner dimensions don't match")
|
||
|
||
# Use NumPy's optimized matrix multiplication
|
||
result = np.dot(A, B)
|
||
|
||
return result
|
||
### END SOLUTION
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Baseline matrix multiplication complete
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Matrix Multiplication Performance Scaling
|
||
def analyze_matmul_scaling():
|
||
"""Analyze how matrix multiplication performance scales with size."""
|
||
try:
|
||
sizes = [64, 128, 256, 512]
|
||
results = []
|
||
|
||
for size in sizes:
|
||
A = np.random.randn(size, size).astype(np.float32)
|
||
B = np.random.randn(size, size).astype(np.float32)
|
||
|
||
# Time the operation
|
||
_, time_us = time_kernel(matmul_baseline, A, B)
|
||
|
||
# Calculate metrics
|
||
flops = 2 * size**3 # Multiply-accumulate operations
|
||
gflops = flops / (time_us / 1_000_000) / 1e9
|
||
|
||
results.append({
|
||
'size': size,
|
||
'time_us': time_us,
|
||
'gflops': gflops,
|
||
'memory_mb': (A.nbytes + B.nbytes + A.nbytes) / 1024 / 1024
|
||
})
|
||
|
||
print(f"Size {size:3d}: {time_us:8.1f}μs, {gflops:6.1f} GFLOPS, {results[-1]['memory_mb']:5.1f}MB")
|
||
|
||
# Analyze scaling behavior
|
||
time_scaling = results[-1]['time_us'] / results[0]['time_us']
|
||
size_scaling = (results[-1]['size'] / results[0]['size']) ** 3
|
||
efficiency = time_scaling / size_scaling
|
||
|
||
print(f"\nScaling analysis:")
|
||
print(f" Time scaling: {time_scaling:.1f}x")
|
||
print(f" Theoretical (O(n³)): {size_scaling:.1f}x")
|
||
print(f" Efficiency: {efficiency:.3f} (1.0 = perfect scaling)")
|
||
|
||
# TIP WHY THIS MATTERS: Matrix multiplication is O(n³), but cache effects
|
||
# and memory bandwidth limits mean real performance doesn't scale perfectly.
|
||
# Understanding these limits helps size operations for optimal performance.
|
||
|
||
return results
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Scaling analysis error: {e}")
|
||
return None
|
||
|
||
# Run the analysis
|
||
matmul_scaling = analyze_matmul_scaling()
|
||
|
||
# %%
|
||
def cache_friendly_matmul(A: np.ndarray, B: np.ndarray, block_size: int = 64) -> np.ndarray:
|
||
"""
|
||
Cache-friendly matrix multiplication using blocking technique.
|
||
|
||
This implementation improves memory access patterns by processing
|
||
matrices in cache-sized blocks, reducing cache misses.
|
||
|
||
Args:
|
||
A: Left matrix (M x K)
|
||
B: Right matrix (K x N)
|
||
block_size: Size of cache blocks (default 64)
|
||
|
||
Returns:
|
||
np.ndarray: Result matrix (M x N)
|
||
|
||
TODO: Implement cache-friendly matrix multiplication using blocking.
|
||
|
||
APPROACH:
|
||
1. Divide matrices into block_size x block_size blocks
|
||
2. Process blocks in order that maximizes data reuse
|
||
3. Inner loops work on cache-friendly sub-matrices
|
||
4. Accumulate partial results in output blocks
|
||
|
||
BLOCKING ALGORITHM:
|
||
```
|
||
for each block row of A:
|
||
for each block column of B:
|
||
for each block column of A / block row of B:
|
||
multiply sub-blocks and accumulate
|
||
```
|
||
|
||
EXAMPLE:
|
||
>>> A = np.random.randn(128, 128)
|
||
>>> B = np.random.randn(128, 128)
|
||
>>> C = cache_friendly_matmul(A, B, block_size=32)
|
||
|
||
CACHE OPTIMIZATION:
|
||
- block_size should fit in L1 cache (~32KB)
|
||
- For float32: block_size=64 uses ~16KB per block
|
||
- Reduces cache misses from O(n³) to O(n³/B) where B=block_size
|
||
"""
|
||
### BEGIN SOLUTION
|
||
M, K = A.shape
|
||
K2, N = B.shape
|
||
|
||
if K != K2:
|
||
raise ValueError(f"Cannot multiply {A.shape} and {B.shape}")
|
||
|
||
# Initialize result matrix
|
||
C = np.zeros((M, N), dtype=A.dtype)
|
||
|
||
# Cache-friendly blocked multiplication
|
||
for i in range(0, M, block_size):
|
||
for j in range(0, N, block_size):
|
||
for k in range(0, K, block_size):
|
||
# Define block boundaries
|
||
end_i = min(i + block_size, M)
|
||
end_j = min(j + block_size, N)
|
||
end_k = min(k + block_size, K)
|
||
|
||
# Extract blocks
|
||
A_block = A[i:end_i, k:end_k]
|
||
B_block = B[k:end_k, j:end_j]
|
||
|
||
# Multiply blocks and accumulate
|
||
C[i:end_i, j:end_j] += np.dot(A_block, B_block)
|
||
|
||
return C
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Cache-Friendly Matrix Multiplication
|
||
This test validates `cache_friendly_matmul`, ensuring correctness and performance improvement
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_cache_friendly_matmul():
|
||
"""Test cache-friendly matrix multiplication."""
|
||
print("TEST Unit Test: Cache-Friendly Matrix Multiplication")
|
||
|
||
# Test 1: Correctness
|
||
A = np.array([[1, 2], [3, 4]], dtype=np.float32)
|
||
B = np.array([[5, 6], [7, 8]], dtype=np.float32)
|
||
|
||
result_cache = cache_friendly_matmul(A, B, block_size=1)
|
||
result_baseline = matmul_baseline(A, B)
|
||
|
||
assert np.allclose(result_cache, result_baseline), "Cache-friendly result should match baseline"
|
||
print("PASS Correctness: Matches baseline implementation")
|
||
|
||
# Test 2: Performance comparison
|
||
size = 256
|
||
A_large = np.random.randn(size, size).astype(np.float32)
|
||
B_large = np.random.randn(size, size).astype(np.float32)
|
||
|
||
_, baseline_time = time_kernel(matmul_baseline, A_large, B_large)
|
||
_, cache_time = time_kernel(cache_friendly_matmul, A_large, B_large, 64)
|
||
|
||
print(f"PASS Performance: Baseline={baseline_time:.1f}μs, Cache-friendly={cache_time:.1f}μs")
|
||
|
||
# Test 3: Different block sizes
|
||
block_sizes = [32, 64, 128]
|
||
for bs in block_sizes:
|
||
result = cache_friendly_matmul(A, B, block_size=bs)
|
||
assert np.allclose(result, result_baseline), f"Block size {bs} should be correct"
|
||
|
||
print(f"PASS Block sizes: Tested {block_sizes}")
|
||
|
||
# Run the test
|
||
test_unit_cache_friendly_matmul()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Vectorized Operations
|
||
"""
|
||
|
||
# %%
|
||
def vectorized_relu(x: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Vectorized ReLU implementation using SIMD principles.
|
||
|
||
This function demonstrates how to write operations that leverage
|
||
CPU vectorization for better performance than scalar loops.
|
||
|
||
Args:
|
||
x: Input array
|
||
|
||
Returns:
|
||
np.ndarray: ReLU applied element-wise
|
||
|
||
TODO: Implement vectorized ReLU optimized for SIMD execution.
|
||
|
||
APPROACH:
|
||
1. Ensure input array is contiguous for vectorization
|
||
2. Use NumPy's vectorized operations (compile to SIMD)
|
||
3. Handle different data types appropriately
|
||
4. Return result maintaining input shape
|
||
|
||
VECTORIZATION TECHNIQUES:
|
||
- np.maximum() uses SIMD instructions when possible
|
||
- Contiguous memory layout enables efficient vectorization
|
||
- Proper data types (float32) maximize SIMD lane utilization
|
||
|
||
EXAMPLE:
|
||
>>> x = np.array([-2, -1, 0, 1, 2], dtype=np.float32)
|
||
>>> y = vectorized_relu(x)
|
||
>>> print(y) # [0, 0, 0, 1, 2]
|
||
|
||
PERFORMANCE BENEFITS:
|
||
- AVX2: 8 float32 operations per instruction
|
||
- AVX-512: 16 float32 operations per instruction
|
||
- Typical speedup: 4-16x over scalar loops
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Ensure contiguous memory layout for best SIMD performance
|
||
if not x.flags.c_contiguous:
|
||
x = np.ascontiguousarray(x)
|
||
|
||
# Vectorized ReLU using NumPy's maximum function
|
||
# This compiles to SIMD instructions on modern CPUs
|
||
result = np.maximum(0, x)
|
||
|
||
return result
|
||
### END SOLUTION
|
||
|
||
# %%
|
||
def vectorized_operations(x: np.ndarray, y: np.ndarray) -> Dict[str, np.ndarray]:
|
||
"""
|
||
Collection of vectorized operations demonstrating SIMD principles.
|
||
|
||
Shows how multiple operations can be vectorized efficiently.
|
||
|
||
Args:
|
||
x: First input array
|
||
y: Second input array (must be same shape as x)
|
||
|
||
Returns:
|
||
Dict[str, np.ndarray]: Dictionary of vectorized operation results
|
||
|
||
TODO: Implement vectorized versions of common operations.
|
||
|
||
OPERATIONS TO IMPLEMENT:
|
||
- Element-wise addition, multiplication
|
||
- Squared difference
|
||
- Euclidean distance
|
||
- Dot product
|
||
|
||
APPROACH:
|
||
1. Validate input shapes match
|
||
2. Use NumPy vectorized functions
|
||
3. Combine operations when beneficial
|
||
4. Return comprehensive results dictionary
|
||
|
||
EXAMPLE:
|
||
>>> x = np.array([1, 2, 3, 4])
|
||
>>> y = np.array([2, 3, 4, 5])
|
||
>>> results = vectorized_operations(x, y)
|
||
>>> print(results['element_wise_add']) # [3, 5, 7, 9]
|
||
|
||
VECTORIZATION BENEFITS:
|
||
- Single instruction processes multiple elements
|
||
- Reduced loop overhead
|
||
- Better CPU pipeline utilization
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Validate shapes
|
||
if x.shape != y.shape:
|
||
raise ValueError(f"Input shapes don't match: {x.shape} vs {y.shape}")
|
||
|
||
# Ensure contiguous arrays for best performance
|
||
if not x.flags.c_contiguous:
|
||
x = np.ascontiguousarray(x)
|
||
if not y.flags.c_contiguous:
|
||
y = np.ascontiguousarray(y)
|
||
|
||
# Vectorized operations
|
||
results = {
|
||
'element_wise_add': x + y,
|
||
'element_wise_multiply': x * y,
|
||
'squared_difference': (x - y) ** 2,
|
||
'euclidean_distance': np.sqrt(np.sum((x - y) ** 2)),
|
||
'dot_product': np.dot(x.flatten(), y.flatten()),
|
||
'cosine_similarity': np.dot(x.flatten(), y.flatten()) / (np.linalg.norm(x) * np.linalg.norm(y))
|
||
}
|
||
|
||
return results
|
||
### END SOLUTION
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Vectorized operations complete
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Vectorization Performance Analysis
|
||
def analyze_vectorization_performance():
|
||
"""Compare vectorized vs scalar performance."""
|
||
try:
|
||
size = 100000
|
||
x = np.random.randn(size).astype(np.float32)
|
||
y = np.random.randn(size).astype(np.float32)
|
||
|
||
# Time vectorized ReLU
|
||
_, vec_time = time_kernel(vectorized_relu, x)
|
||
|
||
# Time scalar ReLU (simulated)
|
||
def scalar_relu_simulation(arr):
|
||
# Simulate scalar processing with numpy operations
|
||
# (Real scalar would be much slower)
|
||
result = np.zeros_like(arr)
|
||
for i in range(min(1000, len(arr))): # Sample to avoid timeout
|
||
result[i] = max(0, arr[i])
|
||
return result
|
||
|
||
_, scalar_time = time_kernel(scalar_relu_simulation, x[:1000])
|
||
|
||
# Estimate full scalar time
|
||
estimated_scalar_time = scalar_time * (size / 1000)
|
||
speedup = estimated_scalar_time / vec_time
|
||
|
||
print(f"Vectorization performance analysis:")
|
||
print(f" Array size: {size:,} elements")
|
||
print(f" Vectorized ReLU: {vec_time:.1f}μs")
|
||
print(f" Estimated scalar: {estimated_scalar_time:.1f}μs")
|
||
print(f" Speedup: {speedup:.1f}x")
|
||
|
||
# Test vectorized operations
|
||
_, ops_time = time_kernel(vectorized_operations, x, y)
|
||
operations_per_second = 6 * size / (ops_time / 1_000_000) # 6 operations
|
||
|
||
print(f" Vectorized operations: {ops_time:.1f}μs")
|
||
print(f" Throughput: {operations_per_second/1e6:.1f}M ops/sec")
|
||
|
||
# TIP WHY THIS MATTERS: Vectorization provides 4-16x speedups on modern CPUs.
|
||
# This is essential for real-time inference and efficient training.
|
||
# ML frameworks like PyTorch rely heavily on vectorized operations.
|
||
|
||
return {
|
||
'vectorized_speedup': speedup,
|
||
'throughput_mops': operations_per_second / 1e6
|
||
}
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Vectorization analysis error: {e}")
|
||
return None
|
||
|
||
# Run the analysis
|
||
vectorization_analysis = analyze_vectorization_performance()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Vectorized Operations
|
||
This test validates vectorized implementations for correctness and performance
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_vectorized_operations():
|
||
"""Test vectorized operations."""
|
||
print("TEST Unit Test: Vectorized Operations")
|
||
|
||
# Test 1: Vectorized ReLU correctness
|
||
x = np.array([-2, -1, 0, 1, 2], dtype=np.float32)
|
||
result = vectorized_relu(x)
|
||
expected = np.array([0, 0, 0, 1, 2], dtype=np.float32)
|
||
|
||
assert np.allclose(result, expected), "Vectorized ReLU should be correct"
|
||
print("PASS ReLU correctness: Produces expected outputs")
|
||
|
||
# Test 2: Vectorized operations correctness
|
||
x = np.array([1, 2, 3, 4], dtype=np.float32)
|
||
y = np.array([2, 3, 4, 5], dtype=np.float32)
|
||
|
||
results = vectorized_operations(x, y)
|
||
|
||
assert np.allclose(results['element_wise_add'], [3, 5, 7, 9]), "Addition should be correct"
|
||
assert np.allclose(results['element_wise_multiply'], [2, 6, 12, 20]), "Multiplication should be correct"
|
||
assert np.allclose(results['dot_product'], 40), "Dot product should be correct"
|
||
|
||
print("PASS Operations correctness: All operations produce expected results")
|
||
|
||
# Test 3: Performance with larger arrays
|
||
large_x = np.random.randn(10000).astype(np.float32)
|
||
large_y = np.random.randn(10000).astype(np.float32)
|
||
|
||
_, relu_time = time_kernel(vectorized_relu, large_x)
|
||
_, ops_time = time_kernel(vectorized_operations, large_x, large_y)
|
||
|
||
assert relu_time < 1000, f"ReLU should be fast, took {relu_time:.1f}μs"
|
||
assert ops_time < 5000, f"Operations should be fast, took {ops_time:.1f}μs"
|
||
|
||
print(f"PASS Performance: ReLU={relu_time:.1f}μs, Operations={ops_time:.1f}μs")
|
||
|
||
# Run the test
|
||
test_unit_vectorized_operations()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Parallel Processing
|
||
"""
|
||
|
||
# %%
|
||
def parallel_relu(x: np.ndarray, num_workers: int = 4) -> np.ndarray:
|
||
"""
|
||
Parallel ReLU implementation using multiple CPU cores.
|
||
|
||
Demonstrates data parallelism by distributing computation
|
||
across multiple worker threads.
|
||
|
||
Args:
|
||
x: Input array
|
||
num_workers: Number of parallel workers
|
||
|
||
Returns:
|
||
np.ndarray: ReLU applied in parallel
|
||
|
||
TODO: Implement parallel ReLU using threading or multiprocessing.
|
||
|
||
APPROACH:
|
||
1. Split input array into chunks for each worker
|
||
2. Process chunks in parallel using ThreadPoolExecutor
|
||
3. Combine results maintaining original order
|
||
4. Handle edge cases (small arrays, uneven splits)
|
||
|
||
PARALLELIZATION STRATEGY:
|
||
- Thread-based for I/O bound or small computations
|
||
- Process-based for CPU-bound large computations
|
||
- Chunk size should balance overhead vs parallelism
|
||
|
||
EXAMPLE:
|
||
>>> x = np.random.randn(100000)
|
||
>>> y = parallel_relu(x, num_workers=8)
|
||
|
||
PERFORMANCE CONSIDERATIONS:
|
||
- Overhead of thread creation and coordination
|
||
- Memory bandwidth limitations
|
||
- Thread synchronization costs
|
||
- Optimal for large arrays where parallelism benefits exceed overhead
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# For small arrays, parallel processing overhead isn't worth it
|
||
if x.size < 10000:
|
||
return vectorized_relu(x)
|
||
|
||
# Split array into chunks
|
||
chunk_size = max(1, x.size // num_workers)
|
||
chunks = []
|
||
flat_x = x.flatten()
|
||
|
||
for i in range(0, len(flat_x), chunk_size):
|
||
chunks.append(flat_x[i:i + chunk_size])
|
||
|
||
# Worker function
|
||
def relu_chunk(chunk):
|
||
return vectorized_relu(chunk)
|
||
|
||
# Process chunks in parallel
|
||
with ThreadPoolExecutor(max_workers=num_workers) as executor:
|
||
# Submit all tasks
|
||
futures = [executor.submit(relu_chunk, chunk) for chunk in chunks]
|
||
|
||
# Collect results in order
|
||
results = [future.result() for future in futures]
|
||
|
||
# Combine results and reshape
|
||
combined = np.concatenate(results)
|
||
return combined.reshape(x.shape)
|
||
### END SOLUTION
|
||
|
||
# %%
|
||
def parallel_batch_processing(batch_data: np.ndarray, operation: Callable = None, num_workers: int = 4) -> np.ndarray:
|
||
"""
|
||
Process batches of data in parallel across multiple workers.
|
||
|
||
Demonstrates how ML frameworks parallelize batch processing
|
||
for improved throughput.
|
||
|
||
Args:
|
||
batch_data: Input batch (batch_size, ...)
|
||
operation: Operation to apply (default: ReLU)
|
||
num_workers: Number of parallel workers
|
||
|
||
Returns:
|
||
np.ndarray: Processed batch data
|
||
|
||
TODO: Implement parallel batch processing.
|
||
|
||
APPROACH:
|
||
1. Split batch across workers (each worker gets some samples)
|
||
2. Apply operation to each worker's subset
|
||
3. Combine results maintaining batch order
|
||
4. Default to ReLU if no operation specified
|
||
|
||
PARALLELIZATION PATTERN:
|
||
- Each worker processes complete samples
|
||
- Good for independent operations on batch elements
|
||
- Scales well with batch size
|
||
|
||
EXAMPLE:
|
||
>>> batch = np.random.randn(128, 784) # 128 samples, 784 features
|
||
>>> result = parallel_batch_processing(batch, vectorized_relu, 4)
|
||
|
||
ML SYSTEMS CONNECTION:
|
||
- PyTorch DataLoader uses similar parallelization
|
||
- GPU tensor operations naturally parallel across batch dimension
|
||
- Critical for large batch training and inference
|
||
"""
|
||
### BEGIN SOLUTION
|
||
if operation is None:
|
||
operation = vectorized_relu
|
||
|
||
batch_size = batch_data.shape[0]
|
||
|
||
# For small batches, parallel processing overhead isn't worth it
|
||
if batch_size < num_workers:
|
||
return operation(batch_data)
|
||
|
||
# Split batch into chunks
|
||
chunk_size = max(1, batch_size // num_workers)
|
||
chunks = []
|
||
|
||
for i in range(0, batch_size, chunk_size):
|
||
end_idx = min(i + chunk_size, batch_size)
|
||
chunks.append(batch_data[i:end_idx])
|
||
|
||
# Process chunks in parallel
|
||
with ThreadPoolExecutor(max_workers=num_workers) as executor:
|
||
# Submit all tasks
|
||
futures = [executor.submit(operation, chunk) for chunk in chunks]
|
||
|
||
# Collect results in order
|
||
results = [future.result() for future in futures]
|
||
|
||
# Combine results
|
||
return np.concatenate(results, axis=0)
|
||
### END SOLUTION
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Parallel processing complete
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Parallel Processing Scaling Analysis
|
||
def analyze_parallel_scaling():
|
||
"""Analyze how parallel processing scales with worker count."""
|
||
try:
|
||
# Test data
|
||
large_array = np.random.randn(50000).astype(np.float32)
|
||
batch_data = np.random.randn(64, 1000).astype(np.float32)
|
||
|
||
# Test different worker counts
|
||
worker_counts = [1, 2, 4, 8]
|
||
results = []
|
||
|
||
print("Parallel processing scaling analysis:")
|
||
print("Worker Count | ReLU Time | Batch Time | ReLU Speedup | Batch Speedup")
|
||
print("-" * 70)
|
||
|
||
baseline_relu_time = None
|
||
baseline_batch_time = None
|
||
|
||
for workers in worker_counts:
|
||
# Time parallel ReLU
|
||
_, relu_time = time_kernel(parallel_relu, large_array, workers)
|
||
|
||
# Time parallel batch processing
|
||
_, batch_time = time_kernel(parallel_batch_processing, batch_data, vectorized_relu, workers)
|
||
|
||
# Calculate speedups
|
||
if baseline_relu_time is None:
|
||
baseline_relu_time = relu_time
|
||
baseline_batch_time = batch_time
|
||
relu_speedup = 1.0
|
||
batch_speedup = 1.0
|
||
else:
|
||
relu_speedup = baseline_relu_time / relu_time
|
||
batch_speedup = baseline_batch_time / batch_time
|
||
|
||
results.append({
|
||
'workers': workers,
|
||
'relu_time': relu_time,
|
||
'batch_time': batch_time,
|
||
'relu_speedup': relu_speedup,
|
||
'batch_speedup': batch_speedup
|
||
})
|
||
|
||
print(f"{workers:11d} | {relu_time:8.1f}μs | {batch_time:9.1f}μs | "
|
||
f"{relu_speedup:11.2f}x | {batch_speedup:12.2f}x")
|
||
|
||
# Analyze scaling efficiency
|
||
max_speedup_relu = max(r['relu_speedup'] for r in results)
|
||
max_speedup_batch = max(r['batch_speedup'] for r in results)
|
||
|
||
print(f"\nScaling analysis:")
|
||
print(f" Max ReLU speedup: {max_speedup_relu:.2f}x")
|
||
print(f" Max batch speedup: {max_speedup_batch:.2f}x")
|
||
print(f" ReLU efficiency: {max_speedup_relu/8:.2f} (theoretical max: 1.0)")
|
||
print(f" Batch efficiency: {max_speedup_batch/8:.2f} (theoretical max: 1.0)")
|
||
|
||
# TIP WHY THIS MATTERS: Parallel processing has diminishing returns due to:
|
||
# 1. Thread overhead and synchronization costs
|
||
# 2. Memory bandwidth limitations
|
||
# 3. Amdahl's law - sequential portions limit speedup
|
||
# Understanding these limits helps choose optimal parallelism levels.
|
||
|
||
return results
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Parallel scaling analysis error: {e}")
|
||
return None
|
||
|
||
# Run the analysis
|
||
parallel_scaling = analyze_parallel_scaling()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Parallel Processing
|
||
This test validates parallel implementations for correctness and performance scaling
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_parallel_processing():
|
||
"""Test parallel processing implementations."""
|
||
print("TEST Unit Test: Parallel Processing")
|
||
|
||
# Test 1: Parallel ReLU correctness
|
||
x = np.array([-2, -1, 0, 1, 2], dtype=np.float32)
|
||
|
||
result_parallel = parallel_relu(x, num_workers=2)
|
||
result_sequential = vectorized_relu(x)
|
||
|
||
assert np.allclose(result_parallel, result_sequential), "Parallel ReLU should match sequential"
|
||
print("PASS ReLU correctness: Parallel matches sequential result")
|
||
|
||
# Test 2: Parallel batch processing correctness
|
||
batch = np.random.randn(16, 10).astype(np.float32)
|
||
|
||
result_parallel = parallel_batch_processing(batch, vectorized_relu, num_workers=4)
|
||
result_sequential = vectorized_relu(batch)
|
||
|
||
assert np.allclose(result_parallel, result_sequential), "Parallel batch should match sequential"
|
||
assert result_parallel.shape == batch.shape, "Output shape should match input"
|
||
print("PASS Batch correctness: Parallel matches sequential result")
|
||
|
||
# Test 3: Performance with larger data
|
||
large_x = np.random.randn(20000).astype(np.float32)
|
||
large_batch = np.random.randn(32, 1000).astype(np.float32)
|
||
|
||
_, sequential_time = time_kernel(vectorized_relu, large_x)
|
||
_, parallel_time = time_kernel(parallel_relu, large_x, 4)
|
||
|
||
print(f"PASS Performance: Sequential={sequential_time:.1f}μs, Parallel={parallel_time:.1f}μs")
|
||
|
||
# Test 4: Edge cases
|
||
small_x = np.array([1, 2, 3])
|
||
result_small = parallel_relu(small_x, num_workers=8)
|
||
expected_small = vectorized_relu(small_x)
|
||
|
||
assert np.allclose(result_small, expected_small), "Small arrays should work correctly"
|
||
print("PASS Edge cases: Small arrays handled correctly")
|
||
|
||
# Run the test
|
||
test_unit_parallel_processing()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Quantization Kernels
|
||
"""
|
||
|
||
# %%
|
||
def quantized_matmul(A: np.ndarray, B: np.ndarray, bits: int = 8) -> np.ndarray:
|
||
"""
|
||
Quantized matrix multiplication for memory and compute efficiency.
|
||
|
||
Implements quantization to reduce memory usage and enable
|
||
efficient inference on edge devices.
|
||
|
||
Args:
|
||
A: Left matrix (float32)
|
||
B: Right matrix (float32)
|
||
bits: Quantization bits (default 8)
|
||
|
||
Returns:
|
||
np.ndarray: Dequantized result matrix
|
||
|
||
TODO: Implement quantized matrix multiplication.
|
||
|
||
APPROACH:
|
||
1. Calculate quantization scales based on data range
|
||
2. Quantize inputs to int8/int16 format
|
||
3. Perform integer matrix multiplication
|
||
4. Dequantize result back to float32
|
||
|
||
QUANTIZATION PROCESS:
|
||
```
|
||
scale = max(abs(data)) / (2^(bits-1) - 1)
|
||
quantized = round(data / scale).clip(-128, 127) # for 8-bit
|
||
result = quantized_A @ quantized_B
|
||
dequantized = result * scale_A * scale_B
|
||
```
|
||
|
||
EXAMPLE:
|
||
>>> A = np.random.randn(64, 32).astype(np.float32)
|
||
>>> B = np.random.randn(32, 48).astype(np.float32)
|
||
>>> C = quantized_matmul(A, B, bits=8)
|
||
|
||
PERFORMANCE BENEFITS:
|
||
- 4x memory reduction (float32 -> int8)
|
||
- Faster integer arithmetic on some hardware
|
||
- Enables deployment on memory-constrained devices
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Calculate quantization scales
|
||
max_val = 2**(bits-1) - 1 # e.g., 127 for 8-bit
|
||
|
||
scale_A = np.max(np.abs(A)) / max_val if np.max(np.abs(A)) > 0 else 1.0
|
||
scale_B = np.max(np.abs(B)) / max_val if np.max(np.abs(B)) > 0 else 1.0
|
||
|
||
# Quantize inputs
|
||
if bits == 8:
|
||
dtype = np.int8
|
||
min_val, max_val = -128, 127
|
||
elif bits == 16:
|
||
dtype = np.int16
|
||
min_val, max_val = -32768, 32767
|
||
else:
|
||
raise ValueError(f"Unsupported quantization: {bits} bits")
|
||
|
||
A_quantized = np.round(A / scale_A).clip(min_val, max_val).astype(dtype)
|
||
B_quantized = np.round(B / scale_B).clip(min_val, max_val).astype(dtype)
|
||
|
||
# Perform integer matrix multiplication
|
||
# Use int32 accumulation to prevent overflow
|
||
C_quantized = np.dot(A_quantized.astype(np.int32), B_quantized.astype(np.int32))
|
||
|
||
# Dequantize result
|
||
C_dequantized = C_quantized.astype(np.float32) * scale_A * scale_B
|
||
|
||
return C_dequantized
|
||
### END SOLUTION
|
||
|
||
# %%
|
||
def quantized_relu(x: np.ndarray, bits: int = 8) -> np.ndarray:
|
||
"""
|
||
Quantized ReLU activation for efficient inference.
|
||
|
||
Applies ReLU in quantized domain to maintain precision
|
||
while reducing computational overhead.
|
||
|
||
Args:
|
||
x: Input array (float32)
|
||
bits: Quantization bits (default 8)
|
||
|
||
Returns:
|
||
np.ndarray: Quantized ReLU result (dequantized to float32)
|
||
|
||
TODO: Implement quantized ReLU activation.
|
||
|
||
APPROACH:
|
||
1. Calculate quantization scale from input range
|
||
2. Quantize input to integer representation
|
||
3. Apply ReLU in integer domain (max(0, x))
|
||
4. Dequantize result back to float32
|
||
|
||
QUANTIZED RELU PROCESS:
|
||
```
|
||
scale = max(abs(x)) / (2^(bits-1) - 1)
|
||
x_quantized = round(x / scale).clip(-128, 127)
|
||
relu_quantized = max(0, x_quantized)
|
||
result = relu_quantized * scale
|
||
```
|
||
|
||
EXAMPLE:
|
||
>>> x = np.array([-1.0, 0.0, 1.0, 2.0])
|
||
>>> y = quantized_relu(x, bits=8)
|
||
>>> print(y) # [0.0, 0.0, ~1.0, ~2.0]
|
||
|
||
OPTIMIZATION BENEFITS:
|
||
- ReLU in integer domain is just max(0, x)
|
||
- No floating-point operations during activation
|
||
- Maintains quantization format for subsequent operations
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Calculate quantization scale
|
||
max_val = 2**(bits-1) - 1 # e.g., 127 for 8-bit
|
||
scale = np.max(np.abs(x)) / max_val if np.max(np.abs(x)) > 0 else 1.0
|
||
|
||
# Quantize input
|
||
if bits == 8:
|
||
dtype = np.int8
|
||
min_val, max_val = -128, 127
|
||
elif bits == 16:
|
||
dtype = np.int16
|
||
min_val, max_val = -32768, 32767
|
||
else:
|
||
raise ValueError(f"Unsupported quantization: {bits} bits")
|
||
|
||
x_quantized = np.round(x / scale).clip(min_val, max_val).astype(dtype)
|
||
|
||
# Apply ReLU in quantized domain
|
||
relu_quantized = np.maximum(0, x_quantized)
|
||
|
||
# Dequantize result
|
||
result = relu_quantized.astype(np.float32) * scale
|
||
|
||
return result
|
||
### END SOLUTION
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Quantization kernels complete
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Quantization Analysis
|
||
def analyze_quantization_impact():
|
||
"""Analyze the impact of quantization on accuracy and performance."""
|
||
try:
|
||
# Test matrices
|
||
A = np.random.randn(128, 64).astype(np.float32) * 10 # Scale for visible quantization
|
||
B = np.random.randn(64, 96).astype(np.float32) * 10
|
||
x = np.random.randn(1000).astype(np.float32) * 5
|
||
|
||
# Compare quantized vs full precision
|
||
print("Quantization impact analysis:")
|
||
print("Operation | Bits | Accuracy (MSE) | Memory | Time")
|
||
print("-" * 55)
|
||
|
||
# Matrix multiplication analysis
|
||
baseline_matmul = matmul_baseline(A, B)
|
||
baseline_size = A.nbytes + B.nbytes + baseline_matmul.nbytes
|
||
_, baseline_time = time_kernel(matmul_baseline, A, B)
|
||
|
||
for bits in [8, 16]:
|
||
quant_result = quantized_matmul(A, B, bits=bits)
|
||
mse = np.mean((baseline_matmul - quant_result) ** 2)
|
||
|
||
# Estimate quantized memory usage
|
||
if bits == 8:
|
||
quant_size = A.size + B.size + baseline_matmul.size # int8 = 1 byte
|
||
else:
|
||
quant_size = (A.size + B.size + baseline_matmul.size) * 2 # int16 = 2 bytes
|
||
|
||
memory_ratio = quant_size / baseline_size
|
||
|
||
_, quant_time = time_kernel(quantized_matmul, A, B, bits)
|
||
time_ratio = quant_time / baseline_time
|
||
|
||
print(f"MatMul | {bits:4d} | {mse:13.6f} | {memory_ratio:5.2f}x | {time_ratio:5.2f}x")
|
||
|
||
# ReLU analysis
|
||
baseline_relu = vectorized_relu(x)
|
||
_, baseline_relu_time = time_kernel(vectorized_relu, x)
|
||
|
||
for bits in [8, 16]:
|
||
quant_relu = quantized_relu(x, bits=bits)
|
||
mse_relu = np.mean((baseline_relu - quant_relu) ** 2)
|
||
|
||
_, quant_relu_time = time_kernel(quantized_relu, x, bits)
|
||
time_ratio_relu = quant_relu_time / baseline_relu_time
|
||
|
||
print(f"ReLU | {bits:4d} | {mse_relu:13.6f} | {0.25:5.2f}x | {time_ratio_relu:5.2f}x")
|
||
|
||
print(f"\nBaseline performance:")
|
||
print(f" MatMul: {baseline_time:.1f}μs, {baseline_size/1024:.1f}KB")
|
||
print(f" ReLU: {baseline_relu_time:.1f}μs, {x.nbytes/1024:.1f}KB")
|
||
|
||
# TIP WHY THIS MATTERS: Quantization trades accuracy for memory and speed.
|
||
# 8-bit quantization: 4x memory reduction, variable performance impact
|
||
# Critical for edge deployment where memory is constrained
|
||
# Modern ML accelerators (TPUs, mobile chips) heavily use quantization
|
||
|
||
return {
|
||
'matmul_accuracy_8bit': np.mean((baseline_matmul - quantized_matmul(A, B, 8)) ** 2),
|
||
'memory_reduction': baseline_size / (A.size + B.size), # Approximate
|
||
'deployment_ready': True
|
||
}
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Quantization analysis error: {e}")
|
||
return None
|
||
|
||
# Run the analysis
|
||
quantization_analysis = analyze_quantization_impact()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Quantization Kernels
|
||
This test validates quantization implementations for correctness and efficiency trade-offs
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_quantization_kernels():
|
||
"""Test quantization kernel implementations."""
|
||
print("TEST Unit Test: Quantization Kernels")
|
||
|
||
# Test 1: Quantized matrix multiplication correctness
|
||
A = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32)
|
||
B = np.array([[0.5, 1.5], [2.5, 3.5]], dtype=np.float32)
|
||
|
||
result_quant = quantized_matmul(A, B, bits=8)
|
||
result_baseline = matmul_baseline(A, B)
|
||
|
||
# Should be approximately correct (quantization introduces error)
|
||
relative_error = np.mean(np.abs(result_quant - result_baseline) / np.abs(result_baseline + 1e-8))
|
||
assert relative_error < 0.1, f"Quantization error too high: {relative_error:.3f}"
|
||
print(f"PASS MatMul quantization: relative error {relative_error:.3f}")
|
||
|
||
# Test 2: Quantized ReLU correctness
|
||
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0], dtype=np.float32)
|
||
|
||
result_quant_relu = quantized_relu(x, bits=8)
|
||
result_baseline_relu = vectorized_relu(x)
|
||
|
||
# Check that negative values become zero and positive values remain positive
|
||
assert np.all(result_quant_relu >= 0), "Quantized ReLU should be non-negative"
|
||
assert np.allclose(result_quant_relu[x <= 0], 0, atol=0.1), "Negative inputs should become zero"
|
||
print("PASS ReLU quantization: maintains ReLU properties")
|
||
|
||
# Test 3: Different bit depths
|
||
for bits in [8, 16]:
|
||
result_8bit = quantized_matmul(A, B, bits=bits)
|
||
assert result_8bit.shape == result_baseline.shape, f"{bits}-bit result shape should match"
|
||
|
||
result_relu_bits = quantized_relu(x, bits=bits)
|
||
assert result_relu_bits.shape == x.shape, f"{bits}-bit ReLU shape should match"
|
||
|
||
print("PASS Bit depths: 8-bit and 16-bit quantization work correctly")
|
||
|
||
# Test 4: Performance characteristics
|
||
large_A = np.random.randn(64, 64).astype(np.float32)
|
||
large_B = np.random.randn(64, 64).astype(np.float32)
|
||
|
||
_, baseline_time = time_kernel(matmul_baseline, large_A, large_B)
|
||
_, quant_time = time_kernel(quantized_matmul, large_A, large_B, 8)
|
||
|
||
print(f"PASS Performance: Baseline={baseline_time:.1f}μs, Quantized={quant_time:.1f}μs")
|
||
|
||
# Run the test
|
||
test_unit_quantization_kernels()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Advanced Systems Analysis Framework
|
||
|
||
Now you'll implement the Progressive Analysis Framework at the **Advanced Level**.
|
||
|
||
At this level, you design comprehensive analyses from scratch - no scaffolding provided.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TARGET ADVANCED ANALYSIS CHALLENGE: Comprehensive Kernel Optimization Analysis
|
||
|
||
**CHALLENGE**: Design and implement a complete kernel optimization analysis system that:
|
||
|
||
1. **Performance Profiling**: Measures execution time, throughput, and resource utilization
|
||
2. **Memory Pattern Analysis**: Analyzes cache behavior, memory bandwidth, and access patterns
|
||
3. **Optimization Opportunities**: Identifies bottlenecks and recommends improvements
|
||
4. **Hardware Adaptation**: Adapts recommendations based on target hardware architecture
|
||
5. **Production Readiness**: Assesses readiness for deployment in production ML systems
|
||
|
||
**YOUR MISSION**: Implement `KernelOptimizationAnalyzer` class with methods for comprehensive analysis.
|
||
|
||
**TODO: Design comprehensive kernel optimization analysis from scratch.**
|
||
|
||
**DESIGN REQUIREMENTS**:
|
||
- Analyze cache efficiency and memory bandwidth utilization
|
||
- Identify vectorization opportunities and parallel processing potential
|
||
- Measure quantization impact on accuracy vs performance trade-offs
|
||
- Generate actionable optimization recommendations for production deployment
|
||
- Support analysis across different hardware architectures (CPU, GPU, edge devices)
|
||
|
||
**ANALYSIS FRAMEWORK**:
|
||
```python
|
||
class KernelOptimizationAnalyzer:
|
||
def analyze_cache_efficiency(self, kernel_func, data_sizes):
|
||
# TODO: Measure cache hit rates and memory access patterns
|
||
pass
|
||
|
||
def analyze_vectorization_potential(self, operation_sequence):
|
||
# TODO: Identify SIMD optimization opportunities
|
||
pass
|
||
|
||
def analyze_parallel_scaling(self, workload, worker_counts):
|
||
# TODO: Measure parallel processing efficiency
|
||
pass
|
||
|
||
def analyze_quantization_trade_offs(self, precision_levels):
|
||
# TODO: Accuracy vs performance analysis
|
||
pass
|
||
|
||
def generate_optimization_roadmap(self, target_hardware):
|
||
# TODO: Prioritized recommendations for production deployment
|
||
pass
|
||
```
|
||
|
||
**EXPECTED INSIGHTS**:
|
||
- Cache miss rates and optimal block sizes
|
||
- Vectorization speedup potential and SIMD utilization
|
||
- Parallel processing efficiency and scaling bottlenecks
|
||
- Quantization accuracy degradation vs memory/speed benefits
|
||
- Hardware-specific optimization strategies
|
||
|
||
**PRODUCTION FOCUS**: Your analysis should guide real optimization decisions for production ML systems.
|
||
"""
|
||
|
||
# %%
|
||
class KernelOptimizationAnalyzer:
|
||
"""
|
||
Advanced kernel optimization analysis system for production ML systems.
|
||
|
||
TODO: Design comprehensive analysis from scratch.
|
||
|
||
This class should provide complete optimization analysis including:
|
||
- Cache efficiency and memory bandwidth analysis
|
||
- Vectorization potential and SIMD utilization assessment
|
||
- Parallel processing scaling analysis and bottleneck identification
|
||
- Quantization impact analysis for accuracy vs performance trade-offs
|
||
- Hardware-specific optimization recommendations for production deployment
|
||
|
||
Your implementation should guide real optimization decisions for production ML systems.
|
||
"""
|
||
|
||
def __init__(self, hardware_config: Optional[Dict[str, Any]] = None):
|
||
"""
|
||
Initialize the analyzer with hardware configuration.
|
||
|
||
TODO: Design initialization strategy that detects or accepts hardware specs.
|
||
|
||
Should handle:
|
||
- CPU specifications (cores, cache sizes, SIMD capabilities)
|
||
- Memory hierarchy (L1/L2/L3 cache, RAM bandwidth)
|
||
- GPU specifications (if available)
|
||
- Target deployment environment (cloud, edge, mobile)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
self.hardware_config = hardware_config or self._detect_hardware()
|
||
self.analysis_results = {}
|
||
self.optimization_recommendations = []
|
||
self.baseline_measurements = {}
|
||
|
||
def _detect_hardware(self) -> Dict[str, Any]:
|
||
"""Detect current hardware configuration."""
|
||
return {
|
||
'cpu_cores': psutil.cpu_count(),
|
||
'memory_gb': psutil.virtual_memory().total // (1024**3),
|
||
'cache_sizes': {
|
||
'l1_data': 32768, # 32KB typical L1 data cache
|
||
'l1_instruction': 32768, # 32KB typical L1 instruction cache
|
||
'l2': 262144, # 256KB typical L2 cache
|
||
'l3': 8388608 # 8MB typical L3 cache
|
||
},
|
||
'cpu_frequency': 2.4, # GHz - would detect actual frequency
|
||
'memory_bandwidth': 25.6, # GB/s - would measure actual bandwidth
|
||
'simd_width': 8, # AVX2 - 8 float32 per instruction
|
||
'gpu_available': False,
|
||
'deployment_target': 'cloud' # vs 'edge' or 'mobile'
|
||
}
|
||
### END SOLUTION
|
||
|
||
def analyze_cache_efficiency(self, kernel_func: Callable, data_sizes: List[int],
|
||
access_patterns: List[str] = None) -> Dict[str, Any]:
|
||
"""
|
||
Analyze cache efficiency and memory access patterns.
|
||
|
||
TODO: Design comprehensive cache analysis that measures:
|
||
- Cache hit/miss rates for different data sizes
|
||
- Memory bandwidth utilization
|
||
- Optimal block sizes for cache-friendly algorithms
|
||
- Impact of different access patterns (sequential, strided, random)
|
||
|
||
Should return actionable insights about memory optimization opportunities.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
if access_patterns is None:
|
||
access_patterns = ['sequential', 'strided', 'random']
|
||
|
||
cache_analysis = {
|
||
'data_sizes_tested': data_sizes,
|
||
'access_patterns': access_patterns,
|
||
'cache_efficiency': {},
|
||
'bandwidth_utilization': {},
|
||
'optimal_block_sizes': {},
|
||
'recommendations': []
|
||
}
|
||
|
||
l1_size = self.hardware_config['cache_sizes']['l1_data']
|
||
l2_size = self.hardware_config['cache_sizes']['l2']
|
||
l3_size = self.hardware_config['cache_sizes']['l3']
|
||
|
||
for size in data_sizes:
|
||
# Generate test data
|
||
test_data = np.random.randn(size, size).astype(np.float32)
|
||
data_size_bytes = test_data.nbytes
|
||
|
||
# Time the kernel operation
|
||
_, execution_time = time_kernel(kernel_func, test_data, test_data)
|
||
|
||
# Estimate cache behavior
|
||
if data_size_bytes <= l1_size:
|
||
cache_level = 'L1'
|
||
efficiency = 0.95
|
||
elif data_size_bytes <= l2_size:
|
||
cache_level = 'L2'
|
||
efficiency = 0.85
|
||
elif data_size_bytes <= l3_size:
|
||
cache_level = 'L3'
|
||
efficiency = 0.70
|
||
else:
|
||
cache_level = 'RAM'
|
||
efficiency = 0.30
|
||
|
||
# Calculate bandwidth utilization
|
||
bytes_accessed = data_size_bytes * 2 # Read A, B
|
||
bandwidth_used = bytes_accessed / (execution_time / 1_000_000) / (1024**3) # GB/s
|
||
peak_bandwidth = self.hardware_config['memory_bandwidth']
|
||
bandwidth_util = bandwidth_used / peak_bandwidth
|
||
|
||
cache_analysis['cache_efficiency'][size] = {
|
||
'cache_level': cache_level,
|
||
'efficiency_estimate': efficiency,
|
||
'data_size_mb': data_size_bytes / (1024**2),
|
||
'execution_time_us': execution_time
|
||
}
|
||
|
||
cache_analysis['bandwidth_utilization'][size] = {
|
||
'bandwidth_gb_s': bandwidth_used,
|
||
'utilization_percent': bandwidth_util * 100,
|
||
'bottleneck': 'memory' if bandwidth_util > 0.8 else 'compute'
|
||
}
|
||
|
||
# Determine optimal block sizes
|
||
for cache_level, cache_size in [('L1', l1_size), ('L2', l2_size)]:
|
||
# Optimal block size fits in cache with room for temporaries
|
||
optimal_elements = int((cache_size * 0.7) / 4) # 70% of cache, float32 = 4 bytes
|
||
optimal_block_size = int(np.sqrt(optimal_elements))
|
||
cache_analysis['optimal_block_sizes'][cache_level] = optimal_block_size
|
||
|
||
# Generate recommendations
|
||
if any(analysis['bottleneck'] == 'memory' for analysis in cache_analysis['bandwidth_utilization'].values()):
|
||
cache_analysis['recommendations'].append("Memory bandwidth limited - consider cache blocking")
|
||
|
||
if max(data_sizes)**2 * 4 > l3_size:
|
||
cache_analysis['recommendations'].append(f"Large matrices exceed L3 cache - use block size <= {cache_analysis['optimal_block_sizes']['L2']}")
|
||
|
||
self.analysis_results['cache_efficiency'] = cache_analysis
|
||
return cache_analysis
|
||
### END SOLUTION
|
||
|
||
def analyze_vectorization_potential(self, operation_sequence: List[str],
|
||
data_shapes: List[Tuple[int, ...]] = None) -> Dict[str, Any]:
|
||
"""
|
||
Analyze vectorization potential and SIMD optimization opportunities.
|
||
|
||
TODO: Design analysis that identifies:
|
||
- Operations that can benefit from SIMD vectorization
|
||
- Data layout requirements for optimal vectorization
|
||
- Expected speedup from vectorization
|
||
- Vectorization-friendly algorithm modifications
|
||
|
||
Should provide specific recommendations for SIMD optimization.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
if data_shapes is None:
|
||
data_shapes = [(1000,), (1000, 1000), (100, 100, 100)]
|
||
|
||
vectorization_analysis = {
|
||
'operations_analyzed': operation_sequence,
|
||
'simd_opportunities': {},
|
||
'data_layout_requirements': {},
|
||
'speedup_estimates': {},
|
||
'algorithm_modifications': [],
|
||
'recommendations': []
|
||
}
|
||
|
||
simd_width = self.hardware_config['simd_width']
|
||
|
||
# Analyze each operation for vectorization potential
|
||
vectorizable_ops = {
|
||
'add': {'potential': 'high', 'speedup': simd_width * 0.9},
|
||
'multiply': {'potential': 'high', 'speedup': simd_width * 0.9},
|
||
'relu': {'potential': 'high', 'speedup': simd_width * 0.8},
|
||
'matmul': {'potential': 'medium', 'speedup': 3.0}, # More complex, less perfect vectorization
|
||
'conv2d': {'potential': 'medium', 'speedup': 4.0},
|
||
'softmax': {'potential': 'low', 'speedup': 1.5}, # Has sequential dependencies
|
||
'batchnorm': {'potential': 'high', 'speedup': simd_width * 0.7}
|
||
}
|
||
|
||
for op in operation_sequence:
|
||
if op in vectorizable_ops:
|
||
vectorization_analysis['simd_opportunities'][op] = vectorizable_ops[op]
|
||
else:
|
||
vectorization_analysis['simd_opportunities'][op] = {
|
||
'potential': 'unknown',
|
||
'speedup': 1.0
|
||
}
|
||
|
||
# Analyze data layout requirements
|
||
for i, shape in enumerate(data_shapes):
|
||
layout_analysis = {
|
||
'shape': shape,
|
||
'memory_layout': 'contiguous_required',
|
||
'alignment': 'simd_aligned',
|
||
'stride_pattern': 'unit_stride_optimal'
|
||
}
|
||
|
||
# For multi-dimensional arrays, analyze optimal access patterns
|
||
if len(shape) > 1:
|
||
layout_analysis['access_pattern'] = 'row_major_optimal'
|
||
layout_analysis['vectorization_dimension'] = 'last_dimension'
|
||
|
||
vectorization_analysis['data_layout_requirements'][f'shape_{i}'] = layout_analysis
|
||
|
||
# Calculate overall speedup potential
|
||
total_speedup = 1.0
|
||
for op in operation_sequence:
|
||
if op in vectorization_analysis['simd_opportunities']:
|
||
speedup = vectorization_analysis['simd_opportunities'][op]['speedup']
|
||
total_speedup *= speedup ** (1.0 / len(operation_sequence)) # Geometric mean
|
||
|
||
vectorization_analysis['speedup_estimates']['overall'] = total_speedup
|
||
vectorization_analysis['speedup_estimates']['best_case'] = max(
|
||
vectorization_analysis['simd_opportunities'][op]['speedup']
|
||
for op in operation_sequence
|
||
if op in vectorization_analysis['simd_opportunities']
|
||
)
|
||
|
||
# Algorithm modification suggestions
|
||
if 'matmul' in operation_sequence:
|
||
vectorization_analysis['algorithm_modifications'].append(
|
||
"Use BLAS libraries (MKL, OpenBLAS) for vectorized matrix operations"
|
||
)
|
||
|
||
if any(op in ['add', 'multiply', 'relu'] for op in operation_sequence):
|
||
vectorization_analysis['algorithm_modifications'].append(
|
||
"Ensure contiguous memory layout and use NumPy vectorized operations"
|
||
)
|
||
|
||
# Generate recommendations
|
||
high_potential_ops = [op for op in operation_sequence
|
||
if vectorization_analysis['simd_opportunities'].get(op, {}).get('potential') == 'high']
|
||
|
||
if high_potential_ops:
|
||
vectorization_analysis['recommendations'].append(
|
||
f"High vectorization potential: {', '.join(high_potential_ops)}"
|
||
)
|
||
|
||
if total_speedup > 2.0:
|
||
vectorization_analysis['recommendations'].append(
|
||
f"Significant speedup possible: {total_speedup:.1f}x with full vectorization"
|
||
)
|
||
|
||
self.analysis_results['vectorization_potential'] = vectorization_analysis
|
||
return vectorization_analysis
|
||
### END SOLUTION
|
||
|
||
def analyze_parallel_scaling(self, workload_func: Callable, worker_counts: List[int],
|
||
data_sizes: List[int] = None) -> Dict[str, Any]:
|
||
"""
|
||
Analyze parallel processing efficiency and scaling bottlenecks.
|
||
|
||
TODO: Design analysis that measures:
|
||
- Parallel processing speedup across different worker counts
|
||
- Scaling efficiency and diminishing returns
|
||
- Thread overhead and synchronization costs
|
||
- Optimal parallelism level for different workload sizes
|
||
|
||
Should identify when parallel processing is beneficial vs overhead costs.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
if data_sizes is None:
|
||
data_sizes = [1000, 10000, 100000]
|
||
|
||
parallel_analysis = {
|
||
'worker_counts_tested': worker_counts,
|
||
'data_sizes_tested': data_sizes,
|
||
'scaling_results': {},
|
||
'efficiency_analysis': {},
|
||
'overhead_analysis': {},
|
||
'optimal_parallelism': {},
|
||
'recommendations': []
|
||
}
|
||
|
||
max_cores = self.hardware_config['cpu_cores']
|
||
|
||
for data_size in data_sizes:
|
||
test_data = np.random.randn(data_size).astype(np.float32)
|
||
size_results = {}
|
||
|
||
# Measure performance for different worker counts
|
||
baseline_time = None
|
||
for workers in worker_counts:
|
||
if workers > max_cores:
|
||
continue # Skip if more workers than cores
|
||
|
||
try:
|
||
_, execution_time = time_kernel(workload_func, test_data, workers)
|
||
|
||
if baseline_time is None:
|
||
baseline_time = execution_time
|
||
speedup = 1.0
|
||
efficiency = 1.0
|
||
else:
|
||
speedup = baseline_time / execution_time
|
||
efficiency = speedup / workers
|
||
|
||
size_results[workers] = {
|
||
'execution_time_us': execution_time,
|
||
'speedup': speedup,
|
||
'efficiency': efficiency
|
||
}
|
||
|
||
except Exception as e:
|
||
size_results[workers] = {
|
||
'execution_time_us': None,
|
||
'speedup': 0,
|
||
'efficiency': 0,
|
||
'error': str(e)
|
||
}
|
||
|
||
parallel_analysis['scaling_results'][data_size] = size_results
|
||
|
||
# Analyze scaling efficiency
|
||
if size_results:
|
||
max_speedup = max(result['speedup'] for result in size_results.values() if result['speedup'] > 0)
|
||
best_workers = max(size_results.keys(), key=lambda w: size_results[w]['speedup'])
|
||
|
||
parallel_analysis['efficiency_analysis'][data_size] = {
|
||
'max_speedup': max_speedup,
|
||
'best_worker_count': best_workers,
|
||
'scaling_efficiency': max_speedup / best_workers,
|
||
'diminishing_returns_threshold': best_workers
|
||
}
|
||
|
||
# Estimate overhead
|
||
if len(size_results) >= 2:
|
||
single_thread_time = size_results.get(1, {}).get('execution_time_us', 0)
|
||
two_thread_time = size_results.get(2, {}).get('execution_time_us', single_thread_time)
|
||
|
||
if single_thread_time > 0 and two_thread_time > 0:
|
||
theoretical_two_thread = single_thread_time / 2
|
||
overhead_factor = two_thread_time / theoretical_two_thread
|
||
|
||
parallel_analysis['overhead_analysis'][data_size] = {
|
||
'overhead_factor': overhead_factor,
|
||
'overhead_percent': (overhead_factor - 1) * 100,
|
||
'worthwhile_threshold': single_thread_time * 10 # 10x overhead minimum
|
||
}
|
||
|
||
# Determine optimal parallelism
|
||
for data_size in data_sizes:
|
||
if data_size in parallel_analysis['scaling_results']:
|
||
results = parallel_analysis['scaling_results'][data_size]
|
||
optimal_workers = max(results.keys(),
|
||
key=lambda w: results[w]['speedup'] if results[w]['speedup'] > 0 else 0)
|
||
|
||
parallel_analysis['optimal_parallelism'][data_size] = {
|
||
'optimal_workers': optimal_workers,
|
||
'speedup_at_optimal': results[optimal_workers]['speedup'],
|
||
'efficiency_at_optimal': results[optimal_workers]['efficiency']
|
||
}
|
||
|
||
# Generate recommendations
|
||
avg_efficiency = np.mean([
|
||
analysis['scaling_efficiency']
|
||
for analysis in parallel_analysis['efficiency_analysis'].values()
|
||
])
|
||
|
||
if avg_efficiency > 0.7:
|
||
parallel_analysis['recommendations'].append(
|
||
"Excellent parallel scaling - parallel processing highly beneficial"
|
||
)
|
||
elif avg_efficiency > 0.4:
|
||
parallel_analysis['recommendations'].append(
|
||
"Good parallel scaling - parallel processing beneficial for large workloads"
|
||
)
|
||
else:
|
||
parallel_analysis['recommendations'].append(
|
||
"Poor parallel scaling - overhead exceeds benefits, avoid parallel processing"
|
||
)
|
||
|
||
# Workload size recommendations
|
||
small_workloads = [size for size in data_sizes if size < 10000]
|
||
if small_workloads and any(
|
||
parallel_analysis['overhead_analysis'].get(size, {}).get('overhead_percent', 0) > 50
|
||
for size in small_workloads
|
||
):
|
||
parallel_analysis['recommendations'].append(
|
||
"Small workloads have high overhead - use sequential processing"
|
||
)
|
||
|
||
self.analysis_results['parallel_scaling'] = parallel_analysis
|
||
return parallel_analysis
|
||
### END SOLUTION
|
||
|
||
def analyze_quantization_trade_offs(self, operations: List[Callable],
|
||
precision_levels: List[int] = None,
|
||
accuracy_threshold: float = 0.01) -> Dict[str, Any]:
|
||
"""
|
||
Analyze quantization impact on accuracy vs performance trade-offs.
|
||
|
||
TODO: Design analysis that measures:
|
||
- Accuracy degradation at different quantization levels
|
||
- Performance improvement from reduced precision
|
||
- Memory usage reduction
|
||
- Optimal quantization strategy for production deployment
|
||
|
||
Should provide guidance on quantization deployment decisions.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
if precision_levels is None:
|
||
precision_levels = [32, 16, 8] # float32, float16/int16, int8
|
||
|
||
quantization_analysis = {
|
||
'precision_levels_tested': precision_levels,
|
||
'operations_analyzed': [op.__name__ for op in operations],
|
||
'accuracy_analysis': {},
|
||
'performance_analysis': {},
|
||
'memory_analysis': {},
|
||
'deployment_recommendations': {},
|
||
'recommendations': []
|
||
}
|
||
|
||
# Test data
|
||
test_sizes = [64, 128, 256]
|
||
|
||
for op_func in operations:
|
||
op_name = op_func.__name__
|
||
operation_results = {}
|
||
|
||
for size in test_sizes:
|
||
if 'matmul' in op_name.lower():
|
||
test_data_a = np.random.randn(size, size).astype(np.float32)
|
||
test_data_b = np.random.randn(size, size).astype(np.float32)
|
||
baseline_result = op_func(test_data_a, test_data_b)
|
||
baseline_time = time_kernel(op_func, test_data_a, test_data_b)[1]
|
||
baseline_memory = (test_data_a.nbytes + test_data_b.nbytes + baseline_result.nbytes)
|
||
else:
|
||
test_data = np.random.randn(size, size).astype(np.float32)
|
||
baseline_result = op_func(test_data)
|
||
baseline_time = time_kernel(op_func, test_data)[1]
|
||
baseline_memory = test_data.nbytes + baseline_result.nbytes
|
||
|
||
size_results = {
|
||
'baseline': {
|
||
'precision': 32,
|
||
'accuracy_mse': 0.0,
|
||
'execution_time_us': baseline_time,
|
||
'memory_bytes': baseline_memory,
|
||
'relative_performance': 1.0,
|
||
'relative_memory': 1.0
|
||
}
|
||
}
|
||
|
||
# Test different precision levels
|
||
for bits in precision_levels:
|
||
if bits == 32:
|
||
continue # Already have baseline
|
||
|
||
try:
|
||
if 'matmul' in op_name.lower() and hasattr(op_func, '__name__'):
|
||
# Use quantized version if available
|
||
if bits in [8, 16]:
|
||
quant_result = quantized_matmul(test_data_a, test_data_b, bits=bits)
|
||
quant_time = time_kernel(quantized_matmul, test_data_a, test_data_b, bits)[1]
|
||
elif 'relu' in op_name.lower():
|
||
if bits in [8, 16]:
|
||
quant_result = quantized_relu(test_data, bits=bits)
|
||
quant_time = time_kernel(quantized_relu, test_data, bits)[1]
|
||
else:
|
||
# Simulate quantization effect
|
||
max_val = 2**(bits-1) - 1
|
||
scale = np.max(np.abs(baseline_result)) / max_val
|
||
quantized = np.round(baseline_result / scale) * scale
|
||
quant_result = quantized
|
||
quant_time = baseline_time * 0.8 # Assume some speedup
|
||
|
||
# Calculate accuracy metrics
|
||
mse = np.mean((baseline_result - quant_result) ** 2)
|
||
relative_error = mse / (np.mean(baseline_result ** 2) + 1e-8)
|
||
|
||
# Estimate memory usage
|
||
memory_factor = bits / 32.0
|
||
quant_memory = int(baseline_memory * memory_factor)
|
||
|
||
size_results[bits] = {
|
||
'precision': bits,
|
||
'accuracy_mse': mse,
|
||
'relative_error': relative_error,
|
||
'execution_time_us': quant_time,
|
||
'memory_bytes': quant_memory,
|
||
'relative_performance': baseline_time / quant_time,
|
||
'relative_memory': baseline_memory / quant_memory,
|
||
'acceptable_accuracy': relative_error < accuracy_threshold
|
||
}
|
||
|
||
except Exception as e:
|
||
size_results[bits] = {
|
||
'precision': bits,
|
||
'error': str(e),
|
||
'acceptable_accuracy': False
|
||
}
|
||
|
||
operation_results[size] = size_results
|
||
|
||
quantization_analysis['accuracy_analysis'][op_name] = operation_results
|
||
|
||
# Aggregate analysis across operations and sizes
|
||
for precision in precision_levels:
|
||
if precision == 32:
|
||
continue
|
||
|
||
accuracy_scores = []
|
||
performance_gains = []
|
||
memory_reductions = []
|
||
|
||
for op_name, op_results in quantization_analysis['accuracy_analysis'].items():
|
||
for size, size_results in op_results.items():
|
||
if precision in size_results and 'relative_error' in size_results[precision]:
|
||
accuracy_scores.append(size_results[precision]['acceptable_accuracy'])
|
||
performance_gains.append(size_results[precision]['relative_performance'])
|
||
memory_reductions.append(size_results[precision]['relative_memory'])
|
||
|
||
if accuracy_scores:
|
||
quantization_analysis['deployment_recommendations'][precision] = {
|
||
'accuracy_success_rate': np.mean(accuracy_scores),
|
||
'avg_performance_gain': np.mean(performance_gains),
|
||
'avg_memory_reduction': np.mean(memory_reductions),
|
||
'recommended_for_production': np.mean(accuracy_scores) > 0.8 and np.mean(performance_gains) > 1.1
|
||
}
|
||
|
||
# Generate recommendations
|
||
for precision, metrics in quantization_analysis['deployment_recommendations'].items():
|
||
if metrics['recommended_for_production']:
|
||
quantization_analysis['recommendations'].append(
|
||
f"{precision}-bit quantization: {metrics['avg_performance_gain']:.1f}x speedup, "
|
||
f"{metrics['avg_memory_reduction']:.1f}x memory reduction, "
|
||
f"{metrics['accuracy_success_rate']*100:.0f}% accuracy success rate"
|
||
)
|
||
|
||
if not any(metrics['recommended_for_production']
|
||
for metrics in quantization_analysis['deployment_recommendations'].values()):
|
||
quantization_analysis['recommendations'].append(
|
||
"Quantization not recommended - accuracy degradation exceeds threshold"
|
||
)
|
||
|
||
self.analysis_results['quantization_trade_offs'] = quantization_analysis
|
||
return quantization_analysis
|
||
### END SOLUTION
|
||
|
||
def generate_optimization_roadmap(self, target_hardware: str = 'cloud',
|
||
priority_metrics: List[str] = None) -> Dict[str, Any]:
|
||
"""
|
||
Generate prioritized optimization roadmap for production deployment.
|
||
|
||
TODO: Design roadmap generation that synthesizes all analyses into:
|
||
- Prioritized optimization opportunities
|
||
- Implementation difficulty vs impact assessment
|
||
- Hardware-specific recommendations
|
||
- Deployment timeline and resource requirements
|
||
|
||
Should provide actionable guidance for ML system optimization in production.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
if priority_metrics is None:
|
||
priority_metrics = ['performance', 'memory', 'accuracy']
|
||
|
||
roadmap = {
|
||
'target_hardware': target_hardware,
|
||
'priority_metrics': priority_metrics,
|
||
'optimization_opportunities': [],
|
||
'implementation_plan': {},
|
||
'resource_requirements': {},
|
||
'expected_outcomes': {},
|
||
'recommendations': []
|
||
}
|
||
|
||
# Hardware-specific considerations
|
||
hardware_profiles = {
|
||
'cloud': {
|
||
'cpu_cores': 16,
|
||
'memory_gb': 64,
|
||
'performance_priority': 'high',
|
||
'cost_sensitivity': 'medium',
|
||
'deployment_complexity': 'low'
|
||
},
|
||
'edge': {
|
||
'cpu_cores': 4,
|
||
'memory_gb': 8,
|
||
'performance_priority': 'medium',
|
||
'cost_sensitivity': 'high',
|
||
'deployment_complexity': 'high'
|
||
},
|
||
'mobile': {
|
||
'cpu_cores': 8,
|
||
'memory_gb': 4,
|
||
'performance_priority': 'medium',
|
||
'cost_sensitivity': 'high',
|
||
'deployment_complexity': 'very_high'
|
||
}
|
||
}
|
||
|
||
target_profile = hardware_profiles.get(target_hardware, hardware_profiles['cloud'])
|
||
|
||
# Analyze optimization opportunities from all analyses
|
||
opportunities = []
|
||
|
||
# From cache analysis
|
||
if 'cache_efficiency' in self.analysis_results:
|
||
cache_results = self.analysis_results['cache_efficiency']
|
||
for size, analysis in cache_results['bandwidth_utilization'].items():
|
||
if analysis['bottleneck'] == 'memory':
|
||
opportunities.append({
|
||
'type': 'cache_optimization',
|
||
'impact': 'high',
|
||
'difficulty': 'medium',
|
||
'description': 'Implement cache-friendly blocking algorithms',
|
||
'expected_improvement': '2-4x performance gain',
|
||
'implementation_effort': '2-3 weeks'
|
||
})
|
||
break
|
||
|
||
# From vectorization analysis
|
||
if 'vectorization_potential' in self.analysis_results:
|
||
vec_results = self.analysis_results['vectorization_potential']
|
||
overall_speedup = vec_results['speedup_estimates'].get('overall', 1.0)
|
||
if overall_speedup > 2.0:
|
||
opportunities.append({
|
||
'type': 'vectorization',
|
||
'impact': 'high',
|
||
'difficulty': 'low',
|
||
'description': 'Implement SIMD vectorization for element-wise operations',
|
||
'expected_improvement': f'{overall_speedup:.1f}x performance gain',
|
||
'implementation_effort': '1-2 weeks'
|
||
})
|
||
|
||
# From parallel analysis
|
||
if 'parallel_scaling' in self.analysis_results:
|
||
parallel_results = self.analysis_results['parallel_scaling']
|
||
avg_efficiency = np.mean([
|
||
analysis['scaling_efficiency']
|
||
for analysis in parallel_results['efficiency_analysis'].values()
|
||
]) if parallel_results['efficiency_analysis'] else 0
|
||
|
||
if avg_efficiency > 0.5 and target_profile['cpu_cores'] > 4:
|
||
opportunities.append({
|
||
'type': 'parallelization',
|
||
'impact': 'medium',
|
||
'difficulty': 'medium',
|
||
'description': f'Implement parallel processing for {target_profile["cpu_cores"]} cores',
|
||
'expected_improvement': f'{avg_efficiency * target_profile["cpu_cores"]:.1f}x speedup',
|
||
'implementation_effort': '2-4 weeks'
|
||
})
|
||
|
||
# From quantization analysis
|
||
if 'quantization_trade_offs' in self.analysis_results:
|
||
quant_results = self.analysis_results['quantization_trade_offs']
|
||
for precision, metrics in quant_results['deployment_recommendations'].items():
|
||
if metrics['recommended_for_production']:
|
||
impact_level = 'high' if metrics['avg_memory_reduction'] > 2.0 else 'medium'
|
||
opportunities.append({
|
||
'type': 'quantization',
|
||
'impact': impact_level,
|
||
'difficulty': 'high',
|
||
'description': f'Deploy {precision}-bit quantization',
|
||
'expected_improvement': f'{metrics["avg_performance_gain"]:.1f}x speedup, {metrics["avg_memory_reduction"]:.1f}x memory reduction',
|
||
'implementation_effort': '3-6 weeks'
|
||
})
|
||
break
|
||
|
||
# Sort opportunities by priority
|
||
priority_order = {'high': 3, 'medium': 2, 'low': 1}
|
||
difficulty_penalty = {'low': 0, 'medium': -0.5, 'high': -1, 'very_high': -2}
|
||
|
||
def opportunity_score(opp):
|
||
impact_score = priority_order.get(opp['impact'], 1)
|
||
difficulty_score = difficulty_penalty.get(opp['difficulty'], 0)
|
||
|
||
# Hardware-specific adjustments
|
||
if target_hardware == 'mobile' and opp['type'] == 'quantization':
|
||
impact_score += 1 # Quantization more important for mobile
|
||
elif target_hardware == 'cloud' and opp['type'] == 'parallelization':
|
||
impact_score += 0.5 # Parallelization more beneficial in cloud
|
||
|
||
return impact_score + difficulty_score
|
||
|
||
opportunities.sort(key=opportunity_score, reverse=True)
|
||
roadmap['optimization_opportunities'] = opportunities[:5] # Top 5 opportunities
|
||
|
||
# Create implementation plan
|
||
phases = ['Phase 1 (0-1 months)', 'Phase 2 (1-3 months)', 'Phase 3 (3-6 months)']
|
||
current_phase = 0
|
||
|
||
for i, opportunity in enumerate(roadmap['optimization_opportunities']):
|
||
if i < 2:
|
||
phase = phases[0]
|
||
elif i < 4:
|
||
phase = phases[1]
|
||
else:
|
||
phase = phases[2]
|
||
|
||
if phase not in roadmap['implementation_plan']:
|
||
roadmap['implementation_plan'][phase] = []
|
||
|
||
roadmap['implementation_plan'][phase].append({
|
||
'optimization': opportunity['type'],
|
||
'description': opportunity['description'],
|
||
'effort': opportunity['implementation_effort']
|
||
})
|
||
|
||
# Resource requirements
|
||
roadmap['resource_requirements'] = {
|
||
'engineering_time': '3-6 months for full implementation',
|
||
'hardware_requirements': f"Target: {target_hardware} with {target_profile['cpu_cores']} cores, {target_profile['memory_gb']}GB RAM",
|
||
'testing_infrastructure': 'Performance testing and regression testing framework',
|
||
'deployment_complexity': target_profile['deployment_complexity']
|
||
}
|
||
|
||
# Expected outcomes
|
||
total_performance_gain = 1.0
|
||
total_memory_reduction = 1.0
|
||
|
||
for opp in roadmap['optimization_opportunities']:
|
||
# Extract numerical improvements (simplified)
|
||
if 'x performance gain' in opp['expected_improvement']:
|
||
try:
|
||
gain = float(opp['expected_improvement'].split('x')[0])
|
||
total_performance_gain *= gain ** 0.5 # Assume some compounding
|
||
except:
|
||
pass
|
||
|
||
if 'x memory reduction' in opp['expected_improvement']:
|
||
try:
|
||
reduction = float(opp['expected_improvement'].split('x memory reduction')[0].split()[-1])
|
||
total_memory_reduction *= reduction ** 0.5
|
||
except:
|
||
pass
|
||
|
||
roadmap['expected_outcomes'] = {
|
||
'performance_improvement': f'{total_performance_gain:.1f}x overall speedup',
|
||
'memory_efficiency': f'{total_memory_reduction:.1f}x memory reduction',
|
||
'deployment_readiness': 'Production-ready optimized kernels',
|
||
'maintenance_overhead': 'Low (well-structured optimization patterns)'
|
||
}
|
||
|
||
# Generate final recommendations
|
||
roadmap['recommendations'] = [
|
||
f"Prioritize {roadmap['optimization_opportunities'][0]['type']} optimization first (highest impact)",
|
||
f"Target hardware ({target_hardware}) well-suited for planned optimizations",
|
||
f"Expected overall improvement: {total_performance_gain:.1f}x performance, {total_memory_reduction:.1f}x memory efficiency",
|
||
"Implement comprehensive performance testing before production deployment"
|
||
]
|
||
|
||
if target_hardware in ['edge', 'mobile']:
|
||
roadmap['recommendations'].append(
|
||
"Quantization critical for resource-constrained deployment"
|
||
)
|
||
|
||
self.analysis_results['optimization_roadmap'] = roadmap
|
||
return roadmap
|
||
### END SOLUTION
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Advanced optimization analyzer complete
|
||
|
||
# THINK PREDICTION: What will be the most impactful optimization for matrix operations?
|
||
# Your guess: _______
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Comprehensive Kernel Optimization Analysis
|
||
def comprehensive_kernel_analysis():
|
||
"""Run complete kernel optimization analysis using the advanced analyzer."""
|
||
try:
|
||
print("ROCKET Comprehensive Kernel Optimization Analysis")
|
||
print("=" * 60)
|
||
|
||
# Initialize analyzer
|
||
analyzer = KernelOptimizationAnalyzer()
|
||
|
||
# 1. Cache efficiency analysis
|
||
print("\n📊 Cache Efficiency Analysis:")
|
||
cache_results = analyzer.analyze_cache_efficiency(
|
||
matmul_baseline,
|
||
data_sizes=[64, 128, 256, 512],
|
||
access_patterns=['sequential', 'strided']
|
||
)
|
||
|
||
for size, analysis in cache_results['cache_efficiency'].items():
|
||
print(f" Size {size:3d}: {analysis['cache_level']} cache, {analysis['efficiency_estimate']:.1%} efficiency")
|
||
|
||
print(f" Recommendations: {'; '.join(cache_results['recommendations'])}")
|
||
|
||
# 2. Vectorization potential analysis
|
||
print("\nROCKET Vectorization Potential Analysis:")
|
||
vec_results = analyzer.analyze_vectorization_potential(
|
||
['matmul', 'relu', 'add', 'multiply'],
|
||
[(1000,), (1000, 1000)]
|
||
)
|
||
|
||
for op, potential in vec_results['simd_opportunities'].items():
|
||
print(f" {op}: {potential['potential']} potential, {potential['speedup']:.1f}x speedup")
|
||
|
||
print(f" Overall speedup estimate: {vec_results['speedup_estimates']['overall']:.1f}x")
|
||
|
||
# 3. Parallel scaling analysis
|
||
print("\n🔀 Parallel Scaling Analysis:")
|
||
parallel_results = analyzer.analyze_parallel_scaling(
|
||
parallel_relu,
|
||
worker_counts=[1, 2, 4, 8],
|
||
data_sizes=[10000, 50000]
|
||
)
|
||
|
||
for size, analysis in parallel_results['efficiency_analysis'].items():
|
||
print(f" Size {size:5d}: {analysis['max_speedup']:.1f}x max speedup, {analysis['scaling_efficiency']:.1%} efficiency")
|
||
|
||
# 4. Quantization trade-offs analysis
|
||
print("\n🗜️ Quantization Trade-offs Analysis:")
|
||
quant_results = analyzer.analyze_quantization_trade_offs(
|
||
[matmul_baseline, vectorized_relu],
|
||
precision_levels=[32, 16, 8]
|
||
)
|
||
|
||
for precision, metrics in quant_results['deployment_recommendations'].items():
|
||
if metrics['recommended_for_production']:
|
||
print(f" {precision}-bit: {metrics['avg_performance_gain']:.1f}x speedup, "
|
||
f"{metrics['avg_memory_reduction']:.1f}x memory reduction, "
|
||
f"{metrics['accuracy_success_rate']:.0%} accuracy success")
|
||
|
||
# 5. Generate optimization roadmap
|
||
print("\n🗺️ Optimization Roadmap:")
|
||
roadmap = analyzer.generate_optimization_roadmap(
|
||
target_hardware='cloud',
|
||
priority_metrics=['performance', 'memory']
|
||
)
|
||
|
||
print(f" Target: {roadmap['target_hardware']} deployment")
|
||
print(f" Expected outcomes: {roadmap['expected_outcomes']['performance_improvement']}, "
|
||
f"{roadmap['expected_outcomes']['memory_efficiency']}")
|
||
|
||
print("\n Top optimization opportunities:")
|
||
for i, opp in enumerate(roadmap['optimization_opportunities'][:3], 1):
|
||
print(f" {i}. {opp['type']}: {opp['description']}")
|
||
print(f" Impact: {opp['impact']}, Effort: {opp['implementation_effort']}")
|
||
|
||
print("\n Key recommendations:")
|
||
for rec in roadmap['recommendations'][:3]:
|
||
print(f" • {rec}")
|
||
|
||
# TIP WHY THIS MATTERS: Comprehensive analysis guides optimization decisions:
|
||
# 1. Cache analysis reveals memory bottlenecks and optimal algorithms
|
||
# 2. Vectorization analysis shows where SIMD can provide biggest gains
|
||
# 3. Parallel analysis identifies when threading helps vs hurts
|
||
# 4. Quantization analysis balances accuracy vs deployment efficiency
|
||
# 5. Roadmap prioritizes efforts for maximum production impact
|
||
|
||
return {
|
||
'cache_analysis': cache_results,
|
||
'vectorization_analysis': vec_results,
|
||
'parallel_analysis': parallel_results,
|
||
'quantization_analysis': quant_results,
|
||
'optimization_roadmap': roadmap
|
||
}
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Comprehensive analysis error: {e}")
|
||
return None
|
||
|
||
# Run the comprehensive analysis
|
||
comprehensive_analysis = comprehensive_kernel_analysis()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Advanced Optimization Analyzer
|
||
This test validates the comprehensive kernel optimization analyzer
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_advanced_optimization_analyzer():
|
||
"""Test the advanced kernel optimization analyzer."""
|
||
print("TEST Unit Test: Advanced Optimization Analyzer")
|
||
|
||
# Test 1: Analyzer initialization
|
||
analyzer = KernelOptimizationAnalyzer()
|
||
|
||
assert hasattr(analyzer, 'hardware_config'), "Analyzer should have hardware config"
|
||
assert analyzer.hardware_config['cpu_cores'] > 0, "Should detect CPU cores"
|
||
print("PASS Initialization: Hardware configuration detected")
|
||
|
||
# Test 2: Cache efficiency analysis
|
||
cache_results = analyzer.analyze_cache_efficiency(matmul_baseline, [64, 128])
|
||
|
||
assert 'cache_efficiency' in cache_results, "Should return cache efficiency results"
|
||
assert 'bandwidth_utilization' in cache_results, "Should analyze bandwidth utilization"
|
||
assert 'recommendations' in cache_results, "Should provide recommendations"
|
||
print("PASS Cache analysis: Complete analysis with recommendations")
|
||
|
||
# Test 3: Vectorization potential analysis
|
||
vec_results = analyzer.analyze_vectorization_potential(['relu', 'add'])
|
||
|
||
assert 'simd_opportunities' in vec_results, "Should identify SIMD opportunities"
|
||
assert 'speedup_estimates' in vec_results, "Should estimate speedup potential"
|
||
print("PASS Vectorization analysis: SIMD opportunities identified")
|
||
|
||
# Test 4: Parallel scaling analysis
|
||
parallel_results = analyzer.analyze_parallel_scaling(parallel_relu, [1, 2, 4])
|
||
|
||
assert 'scaling_results' in parallel_results, "Should provide scaling results"
|
||
assert 'efficiency_analysis' in parallel_results, "Should analyze efficiency"
|
||
print("PASS Parallel analysis: Scaling efficiency measured")
|
||
|
||
# Test 5: Quantization analysis
|
||
quant_results = analyzer.analyze_quantization_trade_offs([vectorized_relu])
|
||
|
||
assert 'deployment_recommendations' in quant_results, "Should provide deployment recommendations"
|
||
assert 'accuracy_analysis' in quant_results, "Should analyze accuracy impact"
|
||
print("PASS Quantization analysis: Trade-offs evaluated")
|
||
|
||
# Test 6: Optimization roadmap
|
||
roadmap = analyzer.generate_optimization_roadmap('cloud')
|
||
|
||
assert 'optimization_opportunities' in roadmap, "Should identify opportunities"
|
||
assert 'implementation_plan' in roadmap, "Should provide implementation plan"
|
||
assert 'expected_outcomes' in roadmap, "Should estimate outcomes"
|
||
assert 'recommendations' in roadmap, "Should give actionable recommendations"
|
||
print("PASS Roadmap generation: Comprehensive optimization plan created")
|
||
|
||
# Test 7: Integration across analyses
|
||
assert len(analyzer.analysis_results) >= 4, "Should store all analysis results"
|
||
print("PASS Integration: All analyses stored and accessible")
|
||
|
||
# Run the test
|
||
test_unit_advanced_optimization_analyzer()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Integration - Bringing High-Performance Kernels Together
|
||
|
||
### Kernel Composition and Performance Pipeline
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_all():
|
||
"""Run comprehensive kernel module validation."""
|
||
print("TEST Running all kernel unit tests...")
|
||
|
||
# Core infrastructure tests
|
||
test_unit_timing_infrastructure()
|
||
print()
|
||
|
||
# Matrix operation tests
|
||
test_unit_cache_friendly_matmul()
|
||
print()
|
||
|
||
# Vectorization tests
|
||
test_unit_vectorized_operations()
|
||
print()
|
||
|
||
# Parallel processing tests
|
||
test_unit_parallel_processing()
|
||
print()
|
||
|
||
# Quantization tests
|
||
test_unit_quantization_kernels()
|
||
print()
|
||
|
||
# Advanced analyzer tests
|
||
test_unit_advanced_optimization_analyzer()
|
||
print()
|
||
|
||
print("PASS All kernel unit tests passed! High-performance kernels ready for deployment.")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Production Context - Real-World Kernel Usage
|
||
|
||
### How Production ML Systems Use Optimized Kernels
|
||
|
||
Modern ML frameworks achieve their performance through sophisticated kernel optimization:
|
||
|
||
**PyTorch Kernel Architecture:**
|
||
```python
|
||
# High-level PyTorch operation
|
||
result = torch.matmul(A, B)
|
||
|
||
# Dispatches to optimized kernels based on:
|
||
# - Hardware: CPU (Intel MKL) vs GPU (cuBLAS/cuDNN)
|
||
# - Data type: float32, float16, bfloat16, int8
|
||
# - Tensor properties: size, stride, memory layout
|
||
# - Available optimizations: Tensor Cores, quantization
|
||
```
|
||
|
||
**Performance Optimization Stack:**
|
||
```
|
||
Application Level: model(input)
|
||
Framework Level: torch.matmul(A, B)
|
||
Dispatcher Level: select_optimal_kernel(A, B, device)
|
||
Kernel Level: optimized_matmul_cuda/cpu(A, B)
|
||
Hardware Level: CUDA cores, Tensor cores, SIMD units
|
||
```
|
||
|
||
**Real-World Impact:**
|
||
- **Training Acceleration**: Optimized kernels enable training larger models in reasonable time
|
||
- **Inference Speed**: Fast kernels reduce serving latency and costs
|
||
- **Edge Deployment**: Quantized kernels enable deployment on mobile/IoT devices
|
||
- **Energy Efficiency**: Efficient kernels reduce data center power consumption
|
||
|
||
### Framework Integration Patterns
|
||
|
||
**Automatic Kernel Selection:**
|
||
```python
|
||
# Framework chooses optimal implementation
|
||
if tensor.is_cuda and tensor.dtype == torch.float16:
|
||
return tensor_core_matmul(A, B)
|
||
elif tensor.is_cpu and has_avx512():
|
||
return vectorized_cpu_matmul(A, B)
|
||
else:
|
||
return fallback_matmul(A, B)
|
||
```
|
||
|
||
**Performance Profiling Integration:**
|
||
```python
|
||
# Built-in profiling like our analyzer
|
||
with torch.profiler.profile() as prof:
|
||
result = model(input)
|
||
|
||
# Reveals which kernels are bottlenecks
|
||
prof.export_chrome_trace("trace.json")
|
||
```
|
||
"""
|
||
|
||
# %%
|
||
if __name__ == "__main__":
|
||
test_unit_all()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## THINK ML Systems Thinking: Interactive Questions
|
||
|
||
Now that you've implemented high-performance computational kernels, let's explore the systems implications through hands-on analysis.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 1: Cache Hierarchy Optimization Analysis
|
||
|
||
**Context**: Your `cache_friendly_matmul` function uses blocking to improve cache locality. You measured different block sizes and saw varying performance characteristics.
|
||
|
||
**Reflection Question**: Analyze the cache behavior patterns in your implementation. When you tested block sizes of 32, 64, and 128, how did performance scale with memory hierarchy levels (L1/L2/L3 cache)? Design an adaptive blocking strategy that automatically selects optimal block sizes based on runtime cache analysis. How would you extend your approach to handle matrices that don't fit entirely in any cache level?
|
||
|
||
**Think about**:
|
||
- Cache line sizes and prefetching behavior
|
||
- Multi-level cache optimization strategies
|
||
- Memory bandwidth vs cache capacity trade-offs
|
||
- Production deployment across different CPU architectures
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 2: Vectorization and Parallelization Interaction Analysis
|
||
|
||
**Context**: You implemented both SIMD vectorization (`vectorized_relu`) and multi-threading parallelization (`parallel_relu`). Your performance analysis showed different scaling characteristics.
|
||
|
||
**Reflection Question**: Examine the interaction between vectorization and parallelization in your implementations. How does SIMD vectorization within each thread affect the optimal number of worker threads? Analyze the memory bandwidth contention when multiple threads are performing vectorized operations simultaneously. Design a hybrid optimization strategy that balances SIMD width, thread count, and memory bandwidth for maximum throughput.
|
||
|
||
**Think about**:
|
||
- Memory bandwidth limitations with multiple vectorized threads
|
||
- NUMA topology effects on parallel vectorized operations
|
||
- Thread affinity and cache sharing between cores
|
||
- Optimal work distribution strategies for vectorized workloads
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 3: Production Deployment Optimization Strategy
|
||
|
||
**Context**: Your `KernelOptimizationAnalyzer` generated a comprehensive optimization roadmap with prioritized improvements for production deployment.
|
||
|
||
**Reflection Question**: Based on your optimization analysis results, design a production deployment strategy for a real-time ML inference service. How would you adapt your kernel optimizations for different deployment scenarios: cloud instances with 32+ cores, edge devices with 4 cores and limited memory, and mobile devices with thermal constraints? Create a decision framework that automatically selects optimal kernel implementations based on runtime hardware detection and performance requirements.
|
||
|
||
**Think about**:
|
||
- Runtime performance monitoring and adaptation
|
||
- Thermal management and performance throttling
|
||
- Memory pressure and kernel selection strategies
|
||
- Fallback mechanisms for unsupported optimizations
|
||
- Continuous performance optimization in production
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## TARGET MODULE SUMMARY: Kernels
|
||
|
||
Congratulations! You've successfully implemented high-performance computational kernels that power modern ML systems!
|
||
|
||
### What You've Accomplished
|
||
PASS **High-Performance Implementation**: 200+ lines of optimized kernel code with cache blocking, vectorization, and parallelization
|
||
PASS **Advanced Optimization Analysis**: Comprehensive `KernelOptimizationAnalyzer` with multi-dimensional performance evaluation
|
||
PASS **Production-Ready Kernels**: Matrix multiplication, activation functions, and quantization kernels optimized for real-world deployment
|
||
PASS **Systems Integration**: Complete optimization pipeline from profiling through deployment recommendations
|
||
PASS **Performance Engineering**: Deep understanding of cache hierarchy, SIMD vectorization, and parallel processing trade-offs
|
||
|
||
### Key Learning Outcomes
|
||
- **Cache Optimization**: Implementing cache-friendly algorithms that minimize memory access latency
|
||
- **Vectorization Mastery**: Leveraging SIMD instructions for 4-16x performance improvements
|
||
- **Parallel Processing**: Understanding when parallelization helps vs creates overhead
|
||
- **Quantization Engineering**: Balancing accuracy vs performance for efficient deployment
|
||
- **Production Optimization**: Systematic approach to kernel optimization for real-world ML systems
|
||
|
||
### Mathematical Foundations Mastered
|
||
- **Cache-Friendly Algorithms**: O(n³/B) cache complexity through blocking techniques
|
||
- **SIMD Vectorization**: Processing 4-16 elements simultaneously with vector instructions
|
||
- **Parallel Scaling**: Amdahl's law and parallel efficiency analysis across worker counts
|
||
- **Quantization Mathematics**: Precision reduction with controlled accuracy degradation
|
||
|
||
### Professional Skills Developed
|
||
- **Performance Engineering**: Systematic optimization methodology from profiling to deployment
|
||
- **Systems Architecture**: Understanding hardware-software interface for ML acceleration
|
||
- **Production Deployment**: Optimization strategies for cloud, edge, and mobile environments
|
||
- **Kernel Development**: Building high-performance computational primitives that power ML frameworks
|
||
|
||
### Ready for Advanced Applications
|
||
Your kernel implementations now enable:
|
||
- **Real-Time Inference**: Optimized kernels for low-latency ML serving
|
||
- **Large-Scale Training**: High-performance operations for training large models
|
||
- **Edge Deployment**: Memory-efficient kernels for resource-constrained devices
|
||
- **Framework Development**: Understanding how PyTorch and TensorFlow achieve high performance
|
||
|
||
### Connection to Real ML Systems
|
||
Your implementation mirrors production systems:
|
||
- **PyTorch**: ATen library with CUDA kernels, Intel MKL integration, and automatic kernel selection
|
||
- **TensorFlow**: XLA compiler with hardware-specific optimizations and kernel fusion
|
||
- **Industry Practice**: Cache blocking, vectorization, and quantization are fundamental to all modern ML frameworks
|
||
|
||
### Next Steps
|
||
1. **Export your module**: `tito module complete 13_kernels`
|
||
2. **Validate integration**: `tito test --module kernels`
|
||
3. **Explore advanced optimizations**: GPU kernels, custom CUDA implementations
|
||
4. **Ready for Module 14**: Performance analysis and benchmarking systems
|
||
|
||
**Performance Engineering Mastery**: Your high-performance kernel implementations demonstrate deep understanding of how to optimize ML operations for production deployment - the foundation for building scalable ML infrastructure!
|
||
""" |