Add ML systems content to Module 13 (Kernels) - 70% implementation

- Added KernelOptimizationProfiler class with CUDA performance analysis
- Implemented memory coalescing and warp divergence analysis
- Added tensor core utilization and kernel fusion detection
- Included multi-GPU scaling patterns and optimization
- Added comprehensive ML systems thinking questions
This commit is contained in:
Vijay Janapa Reddi
2025-09-15 23:52:59 -04:00
parent 157eff36dd
commit 36edc9f441

View File

@@ -1324,6 +1324,458 @@ def final_performance_test():
# Run the final test
final_performance_test()
# %% [markdown]
"""
## Step 7: ML Systems - Production Kernel Optimization Profiler
### GPU Architecture and Custom Kernels in Production ML
In production ML systems, kernel optimization becomes critical for performance and cost efficiency. Modern ML frameworks rely on thousands of specialized kernels that are optimized for specific hardware architectures and use cases.
### The Production Reality
Real ML deployments face:
- **Inference latency**: Sub-millisecond requirements for real-time applications
- **Throughput demands**: Processing millions of requests per second
- **Hardware diversity**: CPUs, GPUs, TPUs, custom ASICs
- **Memory constraints**: Limited bandwidth and capacity
- **Energy efficiency**: Power consumption in data centers and edge devices
### GPU Kernel Optimization Patterns
Modern GPUs require specialized optimization techniques:
- **Memory coalescing**: Optimizing memory access patterns for GPU memory hierarchy
- **Warp divergence analysis**: Ensuring efficient execution across GPU thread warps
- **Shared memory optimization**: Leveraging fast on-chip memory for data reuse
- **Tensor core utilization**: Maximizing mixed-precision compute throughput
- **Kernel fusion**: Combining multiple operations to reduce memory overhead
- **Multi-GPU scaling**: Coordinating computation across multiple devices
### Real-World Context
- **NVIDIA cuDNN**: Thousands of optimized GPU kernels for deep learning
- **Intel oneDNN**: CPU-optimized kernels for inference
- **Triton**: Python-like language for writing GPU kernels
- **TensorRT**: Runtime optimization for NVIDIA GPUs
- **Custom silicon**: TPUs, AWS Inferentia, Apple Neural Engine
"""
# %% nbgrader={"grade": false, "grade_id": "kernel-optimization-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class KernelOptimizationProfiler:
"""
Production-grade kernel optimization profiler for ML systems.
This class provides comprehensive analysis tools for optimizing ML kernels
across different hardware architectures, focusing on GPU optimization patterns
and production deployment scenarios.
Key Features:
- CUDA kernel performance analysis
- Memory coalescing pattern detection
- Warp divergence analysis
- Shared memory optimization
- Tensor core utilization metrics
- Kernel fusion opportunities
- Multi-GPU scaling analysis
"""
def __init__(self, hardware_config: Optional[Dict[str, Any]] = None):
"""
Initialize the kernel optimization profiler.
Args:
hardware_config: Dictionary containing hardware specifications
"""
self.hardware_config = hardware_config or self._detect_hardware()
self.profile_results = {}
self.optimization_recommendations = []
def _detect_hardware(self) -> Dict[str, Any]:
"""Detect current hardware configuration."""
return {
'cpu_cores': psutil.cpu_count(),
'memory_gb': psutil.virtual_memory().total // (1024**3),
'cache_sizes': {
'l1': 32768, # Typical L1 cache size in bytes
'l2': 262144, # Typical L2 cache size in bytes
'l3': 8388608 # Typical L3 cache size in bytes
},
'gpu_available': False, # Would check for CUDA/OpenCL in real implementation
'gpu_memory_gb': 0,
'tensor_cores': False,
'warp_size': 32 # NVIDIA GPU warp size
}
def analyze_cuda_kernel_performance(self, kernel_func: Callable, input_data: Tensor,
iterations: int = 100) -> Dict[str, Any]:
"""
Analyze CUDA kernel performance characteristics.
In a real implementation, this would interface with CUDA profiling tools
to measure actual GPU kernel performance metrics.
"""
# Simulate CUDA kernel analysis
total_time = 0
memory_bandwidth = 0
compute_utilization = 0
for _ in range(iterations):
result, execution_time = time_kernel(kernel_func, input_data)
total_time += execution_time
# Simulate GPU metrics calculation
data_size = input_data.data.nbytes
memory_bandwidth += (data_size * 2) / (execution_time / 1_000_000) # Read + Write
compute_utilization += np.random.uniform(0.3, 0.9) # Simulated utilization
avg_time = total_time / iterations
avg_bandwidth = memory_bandwidth / iterations
avg_utilization = compute_utilization / iterations
analysis = {
'avg_execution_time_us': avg_time,
'memory_bandwidth_gb_s': avg_bandwidth / (1024**3),
'compute_utilization': avg_utilization,
'theoretical_peak_bandwidth': 900, # GB/s for high-end GPU
'bandwidth_efficiency': min(100, (avg_bandwidth / (1024**3)) / 900 * 100),
'bottleneck_analysis': self._identify_bottlenecks(avg_bandwidth / (1024**3), avg_utilization)
}
self.profile_results['cuda_analysis'] = analysis
return analysis
def analyze_memory_coalescing(self, access_pattern: str, data_shape: Tuple[int, ...]) -> Dict[str, Any]:
"""
Analyze memory access patterns for GPU coalescing efficiency.
Memory coalescing is critical for GPU performance - threads in a warp
should access contiguous memory locations.
"""
coalescing_efficiency = 1.0
if access_pattern == 'row_major':
# Good coalescing for row-major access
coalescing_efficiency = 0.95
elif access_pattern == 'column_major':
# Poor coalescing for column-major access
coalescing_efficiency = 0.3
elif access_pattern == 'strided':
# Moderate coalescing for strided access
stride = data_shape[1] if len(data_shape) > 1 else 1
coalescing_efficiency = max(0.1, 1.0 / stride)
elif access_pattern == 'random':
# Very poor coalescing for random access
coalescing_efficiency = 0.1
analysis = {
'access_pattern': access_pattern,
'data_shape': data_shape,
'coalescing_efficiency': coalescing_efficiency,
'memory_transactions': self._calculate_memory_transactions(data_shape, coalescing_efficiency),
'optimization_potential': 1.0 - coalescing_efficiency
}
self.profile_results['memory_coalescing'] = analysis
return analysis
def analyze_warp_divergence(self, conditional_operations: int, total_operations: int) -> Dict[str, Any]:
"""
Analyze warp divergence patterns in kernel execution.
Warp divergence occurs when threads in a warp take different execution paths,
reducing parallelism efficiency.
"""
divergence_ratio = conditional_operations / total_operations
efficiency_loss = divergence_ratio * 0.5 # Simplified model
analysis = {
'conditional_operations': conditional_operations,
'total_operations': total_operations,
'divergence_ratio': divergence_ratio,
'efficiency_loss': efficiency_loss,
'warp_efficiency': 1.0 - efficiency_loss,
'optimization_suggestions': self._generate_divergence_optimizations(divergence_ratio)
}
self.profile_results['warp_divergence'] = analysis
return analysis
def analyze_shared_memory_usage(self, kernel_data_size: int, reuse_factor: float) -> Dict[str, Any]:
"""
Analyze shared memory optimization opportunities.
Shared memory is fast on-chip memory that can dramatically improve
performance when used effectively for data reuse.
"""
shared_memory_size = 48 * 1024 # 48KB typical shared memory per SM
bank_conflicts = self._estimate_bank_conflicts(kernel_data_size)
analysis = {
'data_size_bytes': kernel_data_size,
'shared_memory_available': shared_memory_size,
'utilization_ratio': min(1.0, kernel_data_size / shared_memory_size),
'reuse_factor': reuse_factor,
'bank_conflicts': bank_conflicts,
'performance_gain': min(10.0, reuse_factor * (1.0 - bank_conflicts)),
'optimization_opportunities': self._identify_shared_memory_optimizations(kernel_data_size, reuse_factor)
}
self.profile_results['shared_memory'] = analysis
return analysis
def analyze_tensor_core_utilization(self, operation_type: str, data_types: List[str]) -> Dict[str, Any]:
"""
Analyze tensor core utilization for mixed-precision operations.
Tensor cores provide massive acceleration for mixed-precision matrix operations
when data shapes and types are optimized correctly.
"""
tensor_core_compatible = (
operation_type in ['matmul', 'conv2d'] and
any(dtype in ['float16', 'bfloat16', 'int8'] for dtype in data_types)
)
if tensor_core_compatible:
theoretical_speedup = 4.0 # Typical tensor core speedup
actual_utilization = 0.7 # Realistic utilization
else:
theoretical_speedup = 1.0
actual_utilization = 0.0
analysis = {
'operation_type': operation_type,
'data_types': data_types,
'tensor_core_compatible': tensor_core_compatible,
'theoretical_speedup': theoretical_speedup,
'actual_utilization': actual_utilization,
'performance_gain': theoretical_speedup * actual_utilization,
'optimization_requirements': self._get_tensor_core_requirements()
}
self.profile_results['tensor_core'] = analysis
return analysis
def analyze_kernel_fusion_opportunities(self, operation_sequence: List[str]) -> Dict[str, Any]:
"""
Analyze opportunities for kernel fusion to reduce memory overhead.
Kernel fusion combines multiple operations into a single kernel,
reducing memory bandwidth requirements and improving performance.
"""
fusable_patterns = [
['matmul', 'relu'],
['conv2d', 'batchnorm', 'relu'],
['add', 'relu'],
['mul', 'add']
]
fusion_opportunities = []
memory_savings = 0
for pattern in fusable_patterns:
if self._sequence_contains_pattern(operation_sequence, pattern):
fusion_opportunities.append(pattern)
memory_savings += len(pattern) - 1 # Save intermediate results
analysis = {
'operation_sequence': operation_sequence,
'fusion_opportunities': fusion_opportunities,
'memory_savings_factor': memory_savings,
'performance_improvement': min(2.0, 1 + memory_savings * 0.3),
'implementation_complexity': len(fusion_opportunities) * 2
}
self.profile_results['kernel_fusion'] = analysis
return analysis
def analyze_multi_gpu_scaling(self, data_size: int, num_gpus: int) -> Dict[str, Any]:
"""
Analyze multi-GPU scaling patterns and communication overhead.
Multi-GPU deployments require careful optimization of data distribution
and communication patterns to achieve good scaling efficiency.
"""
communication_overhead = self._calculate_communication_overhead(data_size, num_gpus)
compute_scaling = min(num_gpus, data_size / 1000) # Simplified scaling model
analysis = {
'data_size': data_size,
'num_gpus': num_gpus,
'communication_overhead': communication_overhead,
'compute_scaling': compute_scaling,
'scaling_efficiency': compute_scaling / num_gpus,
'bottleneck_type': 'communication' if communication_overhead > 0.3 else 'compute',
'optimization_strategies': self._get_multi_gpu_optimizations(communication_overhead)
}
self.profile_results['multi_gpu'] = analysis
return analysis
def generate_optimization_report(self) -> str:
"""Generate comprehensive optimization report with recommendations."""
report = ["🚀 Kernel Optimization Analysis Report", "=" * 50, ""]
for analysis_type, results in self.profile_results.items():
report.append(f"📊 {analysis_type.replace('_', ' ').title()} Analysis:")
report.append("-" * 30)
for key, value in results.items():
if isinstance(value, float):
report.append(f" {key}: {value:.3f}")
elif isinstance(value, list):
report.append(f" {key}: {', '.join(map(str, value))}")
else:
report.append(f" {key}: {value}")
report.append("")
# Add optimization recommendations
report.append("🎯 Optimization Recommendations:")
report.append("-" * 30)
for rec in self.optimization_recommendations:
report.append(f"{rec}")
return "\n".join(report)
# Helper methods
def _identify_bottlenecks(self, bandwidth_gb_s: float, utilization: float) -> str:
"""Identify performance bottlenecks."""
if bandwidth_gb_s < 100:
return "Memory bandwidth limited"
elif utilization < 0.5:
return "Compute utilization limited"
else:
return "Well balanced"
def _calculate_memory_transactions(self, shape: Tuple[int, ...], efficiency: float) -> int:
"""Calculate memory transaction count."""
total_elements = np.prod(shape)
return int(total_elements / (32 * efficiency)) # 32 threads per warp
def _generate_divergence_optimizations(self, divergence_ratio: float) -> List[str]:
"""Generate warp divergence optimization suggestions."""
suggestions = []
if divergence_ratio > 0.3:
suggestions.append("Reduce conditional operations in inner loops")
suggestions.append("Use predicated execution instead of branching")
if divergence_ratio > 0.5:
suggestions.append("Restructure algorithm to minimize thread divergence")
return suggestions
def _estimate_bank_conflicts(self, data_size: int) -> float:
"""Estimate shared memory bank conflicts."""
# Simplified model - assumes some degree of bank conflicts
return min(0.5, data_size / (32 * 4)) # 32 banks, 4 bytes per bank
def _identify_shared_memory_optimizations(self, size: int, reuse: float) -> List[str]:
"""Identify shared memory optimization opportunities."""
optimizations = []
if reuse > 2.0:
optimizations.append("High reuse factor - shared memory beneficial")
if size < 16384: # 16KB
optimizations.append("Data fits in shared memory - implement tiling")
return optimizations
def _get_tensor_core_requirements(self) -> List[str]:
"""Get tensor core optimization requirements."""
return [
"Use mixed precision (float16/bfloat16)",
"Ensure matrix dimensions are multiples of 8",
"Use proper memory layout (NHWC for convolutions)"
]
def _sequence_contains_pattern(self, sequence: List[str], pattern: List[str]) -> bool:
"""Check if operation sequence contains fusable pattern."""
for i in range(len(sequence) - len(pattern) + 1):
if sequence[i:i+len(pattern)] == pattern:
return True
return False
def _calculate_communication_overhead(self, data_size: int, num_gpus: int) -> float:
"""Calculate multi-GPU communication overhead."""
# Simplified model based on data size and GPU count
return min(0.8, (data_size / 1000) / num_gpus + 0.1)
def _get_multi_gpu_optimizations(self, overhead: float) -> List[str]:
"""Get multi-GPU optimization strategies."""
strategies = []
if overhead > 0.3:
strategies.append("Implement gradient compression")
strategies.append("Use asynchronous communication")
if overhead > 0.5:
strategies.append("Increase batch size to amortize communication")
return strategies
# %% nbgrader={"grade": false, "grade_id": "test-kernel-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
### 🧪 Unit Test: Kernel Optimization Profiler
def test_unit_kernel_optimization_profiler():
"""Unit test for the kernel optimization profiler."""
print("🔬 Unit Test: Kernel Optimization Profiler...")
# Create profiler instance
profiler = KernelOptimizationProfiler()
# Test CUDA kernel analysis
x = Tensor(np.random.randn(1000))
cuda_analysis = profiler.analyze_cuda_kernel_performance(vectorized_relu, x, iterations=10)
assert 'avg_execution_time_us' in cuda_analysis
assert 'memory_bandwidth_gb_s' in cuda_analysis
assert 'compute_utilization' in cuda_analysis
print("✅ CUDA kernel analysis works")
# Test memory coalescing analysis
memory_analysis = profiler.analyze_memory_coalescing('row_major', (1024, 1024))
assert memory_analysis['coalescing_efficiency'] > 0.9
assert 'optimization_potential' in memory_analysis
print("✅ Memory coalescing analysis works")
# Test warp divergence analysis
warp_analysis = profiler.analyze_warp_divergence(100, 1000)
assert warp_analysis['divergence_ratio'] == 0.1
assert 'warp_efficiency' in warp_analysis
print("✅ Warp divergence analysis works")
# Test shared memory analysis
shared_analysis = profiler.analyze_shared_memory_usage(16384, 3.0)
assert 'performance_gain' in shared_analysis
assert shared_analysis['reuse_factor'] == 3.0
print("✅ Shared memory analysis works")
# Test tensor core analysis
tensor_analysis = profiler.analyze_tensor_core_utilization('matmul', ['float16'])
assert tensor_analysis['tensor_core_compatible'] == True
assert tensor_analysis['theoretical_speedup'] > 1.0
print("✅ Tensor core analysis works")
# Test kernel fusion analysis
fusion_analysis = profiler.analyze_kernel_fusion_opportunities(['matmul', 'relu', 'add'])
assert len(fusion_analysis['fusion_opportunities']) > 0
assert 'performance_improvement' in fusion_analysis
print("✅ Kernel fusion analysis works")
# Test multi-GPU analysis
gpu_analysis = profiler.analyze_multi_gpu_scaling(10000, 4)
assert gpu_analysis['num_gpus'] == 4
assert 'scaling_efficiency' in gpu_analysis
print("✅ Multi-GPU analysis works")
# Test report generation
report = profiler.generate_optimization_report()
assert "Kernel Optimization Analysis Report" in report
assert len(report) > 100 # Should be a substantial report
print("✅ Optimization report generation works")
print("📈 Progress: Kernel Optimization Profiler ✓")
# Run the test
test_unit_kernel_optimization_profiler()
# %%
def test_module_kernel_sequential_model():
"""
@@ -1435,4 +1887,76 @@ Your implementations mirror production systems:
4. **Move to Module 14**: Add benchmarking for evaluation!
**Ready for benchmarking?** Your custom kernels are now ready for real-world deployment!
## 🤔 ML Systems Thinking Questions
### GPU Architecture and Parallelism
**How does GPU architecture influence kernel design decisions?**
Consider the massive parallelism of modern GPUs (1000s of cores) versus CPUs (10s of cores). How would you design matrix multiplication kernels differently for each architecture? What are the trade-offs between thread-level parallelism and instruction-level parallelism?
**Why do memory access patterns matter more on GPUs than CPUs?**
Think about how GPU memory hierarchy (global memory, shared memory, registers) differs from CPU caches. How does memory coalescing affect bandwidth utilization, and why do random access patterns cause such dramatic performance degradation on GPUs?
**How do you handle load balancing across thousands of GPU threads?**
When processing variable-sized data or irregular computations, how do you ensure all GPU cores stay busy? What strategies exist for handling workload imbalances, and how do frameworks like PyTorch handle dynamic shapes efficiently?
**What role do GPU warps play in kernel optimization?**
NVIDIA GPUs execute threads in groups of 32 (warps). How does this affect branching, memory access, and algorithm design? Why is warp divergence such a critical performance consideration, and how do you design algorithms to minimize it?
### Custom CUDA Kernel Development
**When should you write custom CUDA kernels versus using library functions?**
Given that libraries like cuDNN and cuBLAS are highly optimized, when does it make sense to write custom kernels? Consider scenarios like novel layer types, fused operations, or hardware-specific optimizations.
**How do you optimize CUDA kernels for different GPU generations?**
GPU architectures evolve rapidly (Pascal → Volta → Ampere → Hopper). How do optimization strategies change across generations? What are the implications of new features like tensor cores, multi-instance GPU, and transformer engines?
**What's the development workflow for production CUDA kernels?**
Consider the entire pipeline from prototype to production: profiling bottlenecks, writing initial kernels, optimization iterations, testing across hardware, and deployment. How do companies like OpenAI and Google manage kernel development at scale?
**How do you ensure numerical stability in custom kernels?**
Custom kernels often involve low-level optimizations that can affect numerical precision. How do you balance performance with accuracy? What testing strategies ensure kernels produce correct results across different data ranges and edge cases?
### Triton and Kernel Languages
**How does Triton compare to CUDA for kernel development?**
Triton promises Python-like syntax while generating efficient GPU code. What are the trade-offs between ease of development and performance control? When would you choose Triton over CUDA or vice versa?
**What role do domain-specific languages play in kernel optimization?**
Beyond CUDA and Triton, consider languages like OpenCL, HIP, and emerging alternatives. How do these languages abstract hardware differences while maintaining performance? What's the future of cross-platform kernel development?
**How do JIT compilation and auto-tuning affect kernel performance?**
Modern frameworks use just-in-time compilation to optimize kernels for specific inputs and hardware. How does this compare to static optimization? What are the implications for deployment, cold start times, and reproducibility?
**What are the challenges of kernel portability across hardware vendors?**
With AMD GPUs, Intel GPUs, and custom accelerators becoming more common, how do you write kernels that perform well across different architectures? What abstraction layers exist, and what are their performance costs?
### Hardware-Specific Optimizations
**How do you optimize kernels for different memory hierarchies?**
Consider the differences between GPU global memory, shared memory, and registers versus CPU caches. How do you design algorithms that effectively use each level of the hierarchy? What happens when your working set exceeds cache capacity?
**What optimization strategies work best for tensor operations?**
Tensor cores on modern GPUs can dramatically accelerate mixed-precision operations. How do you restructure algorithms to take advantage of these specialized units? What are the constraints on data layout, precision, and problem sizes?
**How do you handle precision trade-offs in optimized kernels?**
Production systems often use int8, fp16, or bfloat16 for performance. How do you maintain model accuracy while using reduced precision? What accumulation strategies prevent numerical issues in long computations?
**What role does compiler optimization play in kernel performance?**
Modern GPU compilers perform sophisticated optimizations like loop unrolling, memory access optimization, and instruction scheduling. How do you write kernel code that works well with these optimizations? When do you need to use inline assembly or intrinsics?
### Production GPU Clusters
**How do you scale kernel optimizations across multi-GPU systems?**
Single-node multi-GPU systems require coordination of memory transfers, computation scheduling, and synchronization. How do you design kernels that scale efficiently across 8-16 GPUs? What are the bottlenecks in multi-GPU scaling?
**What are the challenges of distributed training with custom kernels?**
When scaling to hundreds or thousands of GPUs across multiple nodes, network communication becomes critical. How do custom kernels interact with distributed training frameworks? What optimizations exist for gradient synchronization and parameter updates?
**How do you manage kernel deployment in production clusters?**
Production ML systems need to handle hardware failures, software updates, and varying workloads. How do you deploy and manage custom kernels across heterogeneous clusters? What strategies exist for A/B testing kernel optimizations safely?
**What monitoring and debugging tools exist for production GPU workloads?**
When kernels behave unexpectedly in production, how do you diagnose issues? What metrics matter for kernel performance monitoring? How do you correlate kernel performance with higher-level model metrics like accuracy and throughput?
"""