# --- # jupyter: # jupytext: # text_representation: # extension: .py # format_name: percent # format_version: '1.3' # jupytext_version: 1.17.1 # kernelspec: # display_name: Python 3 (ipykernel) # language: python # name: python3 # --- # %% [markdown] """ # Module 14: Profiling - Measuring What Matters in ML Systems Welcome to Module 14! You'll build professional profiling tools to measure model performance and uncover optimization opportunities. ## πŸ”— Prerequisites & Progress **You've Built**: Complete ML stack from tensors to transformers **You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency **You'll Enable**: Data-driven optimization decisions and performance analysis **Connection Map**: ``` All Modules (01-13) β†’ Profiling (14) β†’ Optimization Techniques (15-18) (implementations) (measurement) (targeted fixes) ``` **Before starting this module, verify:** - [ ] Module 01 (Tensor): Core tensor operations - [ ] Module 03 (Layers): Linear layer implementation - [ ] Module 08 (Spatial): Convolutional operations This module can work standalone with minimal Tensor implementation, but full functionality requires previous modules for realistic profiling scenarios ## Learning Objectives By the end of this module, you will: 1. Implement a complete Profiler class for model analysis 2. Count parameters and FLOPs accurately for different architectures 3. Measure memory usage and latency with statistical rigor 4. Create production-quality performance analysis tools Let's build the measurement foundation for ML systems optimization! ## πŸ“¦ Where This Code Lives in the Final Package **Learning Side:** You work in `modules/14_profiling/profiling_dev.py` **Building Side:** Code exports to `tinytorch.profiling.profiler` ```python # How to use this module: from tinytorch.profiling.profiler import Profiler, profile_forward_pass, profile_backward_pass ``` **Why this matters:** - **Learning:** Complete profiling system for understanding model performance characteristics - **Production:** Professional measurement tools like those used in PyTorch, TensorFlow - **Consistency:** All profiling and measurement tools in profiling.profiler - **Integration:** Works with any model built using TinyTorch components """ # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true} #| default_exp profiling.profiler #| export import sys import os import time import numpy as np import tracemalloc from typing import Dict, List, Any, Optional, Tuple from collections import defaultdict import gc # Import from TinyTorch package (previous modules must be completed and exported) from tinytorch.core.tensor import Tensor from tinytorch.core.layers import Linear from tinytorch.core.spatial import Conv2d # Constants for memory and performance measurement BYTES_PER_FLOAT32 = 4 # Standard float32 size in bytes KB_TO_BYTES = 1024 # Kilobytes to bytes conversion MB_TO_BYTES = 1024 * 1024 # Megabytes to bytes conversion # %% [markdown] """ ## 1. Introduction: Why Profiling Matters in ML Systems Imagine you're a detective investigating a performance crime. Your model is running slowly, using too much memory, or burning through compute budgets. Without profiling, you're flying blind - making guesses about what to optimize. With profiling, you have evidence. **The Performance Investigation Process:** ``` Suspect Model β†’ Profile Evidence β†’ Identify Bottleneck β†’ Target Optimization ↓ ↓ ↓ ↓ "Too slow" "200 GFLOP/s" "Memory bound" "Reduce transfers" ``` **Questions Profiling Answers:** - **How many parameters?** (Memory footprint, model size) - **How many FLOPs?** (Computational cost, energy usage) - **Where are bottlenecks?** (Memory vs compute bound) - **What's actual latency?** (Real-world performance) **Production Importance:** In production ML systems, profiling isn't optional - it's survival. A model that's 10% more accurate but 100Γ— slower often can't be deployed. Teams use profiling daily to make data-driven optimization decisions, not guesses. ### The Profiling Workflow Visualization ``` Model β†’ Profiler β†’ Measurements β†’ Analysis β†’ Optimization Decision ↓ ↓ ↓ ↓ ↓ GPT Parameter 125M params Memory Use quantization Counter 2.5B FLOPs bound Reduce precision ``` """ # %% [markdown] """ ### πŸ”— From Implementation to Optimization: The Profiling Foundation **In this module (14)**, you'll build the measurement tools to discover optimization opportunities. **In later modules (15+)**, you'll use these profiling insights to implement optimizations like KV caching. **The Real ML Engineering Workflow**: ``` Step 1: Measure (This Module!) Step 2: Analyze ↓ ↓ Profile baseline β†’ Find bottleneck β†’ Understand cause 40 tok/s 80% in attention O(nΒ²) recomputation ↓ Step 4: Validate Step 3: Optimize (Future Modules) ↓ ↓ Profile optimized ← Verify speedup ← Implement optimization 500 tok/s (12.5x) Measure impact Design solution ``` **Without profiling**: You'd never know WHERE to optimize! **Without measurement**: You couldn't verify improvements! This module teaches the measurement and analysis skills that enable optimization breakthroughs. You'll profile real models and discover bottlenecks just like production ML teams do. """ # %% [markdown] """ ## 2. Foundations: Performance Measurement Principles Before we build our profiler, let's understand what we're measuring and why each metric matters. ### Parameter Counting - Model Size Detective Work Parameters determine your model's memory footprint and storage requirements. Every parameter is typically a 32-bit float (4 bytes), so counting them precisely predicts memory usage. **Parameter Counting Formula:** ``` Linear Layer: (input_features Γ— output_features) + output_features ↑ ↑ ↑ Weight matrix Bias vector Total parameters Example: Linear(768, 3072) β†’ (768 Γ— 3072) + 3072 = 2,362,368 parameters Memory: 2,362,368 Γ— 4 bytes = 9.45 MB ``` ### FLOP Counting - Computational Cost Analysis FLOPs (Floating Point Operations) measure computational work. Unlike wall-clock time, FLOPs are hardware-independent and predict compute costs across different systems. **FLOP Formulas for Key Operations:** ``` Matrix Multiplication (M,K) @ (K,N): FLOPs = M Γ— N Γ— K Γ— 2 ↑ ↑ ↑ ↑ Rows Cols Inner Multiply+Add Linear Layer Forward: FLOPs = batch_size Γ— input_features Γ— output_features Γ— 2 ↑ ↑ ↑ Matmul cost Bias add Operations Convolution (simplified): FLOPs = output_H Γ— output_W Γ— kernel_H Γ— kernel_W Γ— in_channels Γ— out_channels Γ— 2 ``` ### Memory Profiling - The Three Types of Memory ML models use memory in three distinct ways, each with different optimization strategies: **Memory Type Breakdown:** ``` Total Training Memory = Parameters + Activations + Gradients + Optimizer State ↓ ↓ ↓ ↓ Model Forward Backward Adam: 2Γ—params weights pass cache gradients SGD: 0Γ—params Example for 125M parameter model: Parameters: 500 MB (125M Γ— 4 bytes) Activations: 200 MB (depends on batch size) Gradients: 500 MB (same as parameters) Adam state: 1,000 MB (momentum + velocity) Total: 2,200 MB (4.4Γ— parameter memory!) ``` ### Latency Measurement - Dealing with Reality Latency measurement is tricky because systems have variance, warmup effects, and measurement overhead. Professional profiling requires statistical rigor. **Latency Measurement Best Practices:** ``` Measurement Protocol: 1. Warmup runs (10+) β†’ CPU/GPU caches warm up 2. Timed runs (100+) β†’ Statistical significance 3. Outlier handling β†’ Use median, not mean 4. Memory cleanup β†’ Prevent contamination Timeline: Warmup: [run][run][run]...[run] ← Don't time these Timing: [⏱run⏱][⏱run⏱]...[⏱run⏱] ← Time these Result: median(all_times) ← Robust to outliers ``` """ # %% [markdown] """ ## 3. Implementation: Building the Core Profiler Class Now let's implement our profiler step by step. We'll start with the foundation and build up to comprehensive analysis. ### The Profiler Architecture ``` Profiler Class β”œβ”€β”€ count_parameters() β†’ Model size analysis β”œβ”€β”€ count_flops() β†’ Computational cost estimation β”œβ”€β”€ measure_memory() β†’ Memory usage tracking β”œβ”€β”€ measure_latency() β†’ Performance timing β”œβ”€β”€ profile_layer() β†’ Layer-wise analysis β”œβ”€β”€ profile_forward_pass() β†’ Complete forward analysis └── profile_backward_pass() β†’ Training analysis Integration: All methods work together to provide comprehensive performance insights ``` """ # %% nbgrader={"grade": false, "grade_id": "profiler_class", "solution": true} #| export class Profiler: """ Professional-grade ML model profiler for performance analysis. Measures parameters, FLOPs, memory usage, and latency with statistical rigor. Used for optimization guidance and deployment planning. """ def __init__(self): """ Initialize profiler with measurement state. TODO: Set up profiler tracking structures APPROACH: 1. Create empty measurements dictionary 2. Initialize operation counters 3. Set up memory tracking state EXAMPLE: >>> profiler = Profiler() >>> profiler.measurements {} HINTS: - Use defaultdict(int) for operation counters - measurements dict will store timing results """ ### BEGIN SOLUTION self.measurements = {} self.operation_counts = defaultdict(int) self.memory_tracker = None ### END SOLUTION def count_parameters(self, model) -> int: """ Count total trainable parameters in a model. TODO: Implement parameter counting for any model with parameters() method APPROACH: 1. Get all parameters from model.parameters() if available 2. For single layers, count weight and bias directly 3. Sum total element count across all parameter tensors EXAMPLE: >>> linear = Linear(128, 64) # 128*64 + 64 = 8256 parameters >>> profiler = Profiler() >>> count = profiler.count_parameters(linear) >>> print(count) 8256 HINTS: - Use parameter.data.size for tensor element count - Handle models with and without parameters() method - Don't forget bias terms when present """ ### BEGIN SOLUTION total_params = 0 # Handle SimpleModel pattern (has .layers attribute) if hasattr(model, 'layers'): # SimpleModel: iterate through layers for layer in model.layers: for param in layer.parameters(): total_params += param.data.size elif hasattr(model, 'parameters'): # Model with direct parameters() method for param in model.parameters(): total_params += param.data.size elif hasattr(model, 'weight'): # Single layer (Linear, Conv2d) - all have .weight total_params += model.weight.data.size # Check for bias (may be None) if hasattr(model, 'bias') and model.bias is not None: total_params += model.bias.data.size else: # No parameters (activations, etc.) total_params = 0 return total_params ### END SOLUTION def count_flops(self, model, input_shape: Tuple[int, ...]) -> int: """ Count FLOPs (Floating Point Operations) for one forward pass. TODO: Implement FLOP counting for different layer types APPROACH: 1. Create dummy input with given shape 2. Calculate FLOPs based on layer type and dimensions 3. Handle different model architectures (Linear, Conv2d, Sequential) LAYER-SPECIFIC FLOP FORMULAS: - Linear: input_features Γ— output_features Γ— 2 (matmul + bias) - Conv2d: output_h Γ— output_w Γ— kernel_h Γ— kernel_w Γ— in_channels Γ— out_channels Γ— 2 - Activation: Usually 1 FLOP per element (ReLU, Sigmoid) EXAMPLE: >>> linear = Linear(128, 64) >>> profiler = Profiler() >>> flops = profiler.count_flops(linear, (1, 128)) >>> print(flops) # 128 * 64 * 2 = 16384 16384 HINTS: - Batch dimension doesn't affect per-sample FLOPs - Focus on major operations (matmul, conv) first - For Sequential models, sum FLOPs of all layers """ ### BEGIN SOLUTION # Create dummy input (unused but kept for interface consistency) _dummy_input = Tensor(np.random.randn(*input_shape)) total_flops = 0 # Handle different model types if hasattr(model, '__class__'): model_name = model.__class__.__name__ if model_name == 'Linear': # Linear layer: input_features Γ— output_features Γ— 2 in_features = input_shape[-1] out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1 total_flops = in_features * out_features * 2 elif model_name == 'Conv2d': # Conv2d layer: complex calculation based on output size # Simplified: assume we know the output dimensions if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'): _batch_size = input_shape[0] if len(input_shape) > 3 else 1 in_channels = model.in_channels out_channels = model.out_channels kernel_h = kernel_w = model.kernel_size # Estimate output size (simplified) input_h, input_w = input_shape[-2], input_shape[-1] output_h = input_h // (model.stride if hasattr(model, 'stride') else 1) output_w = input_w // (model.stride if hasattr(model, 'stride') else 1) total_flops = (output_h * output_w * kernel_h * kernel_w * in_channels * out_channels * 2) elif model_name == 'Sequential': # Sequential model: sum FLOPs of all layers current_shape = input_shape for layer in model.layers: layer_flops = self.count_flops(layer, current_shape) total_flops += layer_flops # Update shape for next layer (simplified) if hasattr(layer, 'weight'): current_shape = current_shape[:-1] + (layer.weight.shape[1],) else: # Activation or other: assume 1 FLOP per element total_flops = np.prod(input_shape) return total_flops ### END SOLUTION def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]: """ Measure memory usage during forward pass. TODO: Implement memory tracking for model execution APPROACH: 1. Use tracemalloc to track memory allocation 2. Measure baseline memory before model execution 3. Run forward pass and track peak usage 4. Calculate different memory components RETURN DICTIONARY: - 'parameter_memory_mb': Memory for model parameters - 'activation_memory_mb': Memory for activations - 'peak_memory_mb': Maximum memory usage - 'memory_efficiency': Ratio of useful to total memory EXAMPLE: >>> linear = Linear(1024, 512) >>> profiler = Profiler() >>> memory = profiler.measure_memory(linear, (32, 1024)) >>> print(f"Parameters: {memory['parameter_memory_mb']:.1f} MB") Parameters: 2.1 MB HINTS: - Use tracemalloc.start() and tracemalloc.get_traced_memory() - Account for float32 = 4 bytes per parameter - Activation memory scales with batch size """ ### BEGIN SOLUTION # Start memory tracking tracemalloc.start() # Measure baseline memory (unused but kept for completeness) _baseline_memory = tracemalloc.get_traced_memory()[0] # Calculate parameter memory param_count = self.count_parameters(model) parameter_memory_bytes = param_count * BYTES_PER_FLOAT32 parameter_memory_mb = parameter_memory_bytes / MB_TO_BYTES # Create input and measure activation memory dummy_input = Tensor(np.random.randn(*input_shape)) input_memory_bytes = dummy_input.data.nbytes # Estimate activation memory (simplified) activation_memory_bytes = input_memory_bytes * 2 # Rough estimate activation_memory_mb = activation_memory_bytes / MB_TO_BYTES # Run forward pass to measure peak memory usage _ = model.forward(dummy_input) # Get peak memory _current_memory, peak_memory = tracemalloc.get_traced_memory() peak_memory_mb = (peak_memory - _baseline_memory) / MB_TO_BYTES tracemalloc.stop() # Calculate efficiency useful_memory = parameter_memory_mb + activation_memory_mb memory_efficiency = useful_memory / max(peak_memory_mb, 0.001) # Avoid division by zero return { 'parameter_memory_mb': parameter_memory_mb, 'activation_memory_mb': activation_memory_mb, 'peak_memory_mb': max(peak_memory_mb, useful_memory), 'memory_efficiency': min(memory_efficiency, 1.0) } ### END SOLUTION def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float: """ Measure model inference latency with statistical rigor. TODO: Implement accurate latency measurement APPROACH: 1. Run warmup iterations to stabilize performance 2. Measure multiple iterations for statistical accuracy 3. Calculate median latency to handle outliers 4. Return latency in milliseconds PARAMETERS: - warmup: Number of warmup runs (default 10) - iterations: Number of measurement runs (default 100) EXAMPLE: >>> linear = Linear(128, 64) >>> input_tensor = Tensor(np.random.randn(1, 128)) >>> profiler = Profiler() >>> latency = profiler.measure_latency(linear, input_tensor) >>> print(f"Latency: {latency:.2f} ms") Latency: 0.15 ms HINTS: - Use time.perf_counter() for high precision - Use median instead of mean for robustness against outliers - Handle different model interfaces (forward, __call__) """ ### BEGIN SOLUTION # Warmup runs to stabilize performance for _ in range(warmup): _ = model.forward(input_tensor) # Measurement runs times = [] for _ in range(iterations): start_time = time.perf_counter() _ = model.forward(input_tensor) end_time = time.perf_counter() times.append((end_time - start_time) * 1000) # Convert to milliseconds # Calculate statistics - use median for robustness times = np.array(times) median_latency = np.median(times) return float(median_latency) ### END SOLUTION def profile_layer(self, layer, input_shape: Tuple[int, ...]) -> Dict[str, Any]: """ Profile a single layer comprehensively. TODO: Implement layer-wise profiling APPROACH: 1. Count parameters for this layer 2. Count FLOPs for this layer 3. Measure memory usage 4. Measure latency 5. Return comprehensive layer profile EXAMPLE: >>> linear = Linear(256, 128) >>> profiler = Profiler() >>> profile = profiler.profile_layer(linear, (32, 256)) >>> print(f"Layer uses {profile['parameters']} parameters") Layer uses 32896 parameters HINTS: - Use existing profiler methods (count_parameters, count_flops, etc.) - Create dummy input for latency measurement - Include layer type information in profile """ ### BEGIN SOLUTION # Create dummy input for latency measurement dummy_input = Tensor(np.random.randn(*input_shape)) # Gather all measurements params = self.count_parameters(layer) flops = self.count_flops(layer, input_shape) memory = self.measure_memory(layer, input_shape) latency = self.measure_latency(layer, dummy_input, warmup=3, iterations=10) # Compute derived metrics gflops_per_second = (flops / 1e9) / max(latency / 1000, 1e-6) return { 'layer_type': layer.__class__.__name__, 'parameters': params, 'flops': flops, 'latency_ms': latency, 'gflops_per_second': gflops_per_second, **memory } ### END SOLUTION def profile_forward_pass(self, model, input_tensor) -> Dict[str, Any]: """ Comprehensive profiling of a model's forward pass. TODO: Implement complete forward pass analysis APPROACH: 1. Use Profiler class to gather all measurements 2. Create comprehensive performance profile 3. Add derived metrics and insights 4. Return structured analysis results RETURN METRICS: - All basic profiler measurements - FLOPs per second (computational efficiency) - Memory bandwidth utilization - Performance bottleneck identification EXAMPLE: >>> model = Linear(256, 128) >>> input_data = Tensor(np.random.randn(32, 256)) >>> profiler = Profiler() >>> profile = profiler.profile_forward_pass(model, input_data) >>> print(f"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s") Throughput: 2.45 GFLOP/s HINTS: - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000) - Memory bandwidth = memory_mb / (latency_ms / 1000) - Consider realistic hardware limits for efficiency calculations """ ### BEGIN SOLUTION # Basic measurements param_count = self.count_parameters(model) flops = self.count_flops(model, input_tensor.shape) memory_stats = self.measure_memory(model, input_tensor.shape) latency_ms = self.measure_latency(model, input_tensor, warmup=5, iterations=20) # Derived metrics latency_seconds = latency_ms / 1000.0 gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6) # Memory bandwidth (MB/s) memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6) # Efficiency metrics theoretical_peak_gflops = 100.0 # Assume 100 GFLOP/s theoretical peak for CPU computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0) # Bottleneck analysis is_memory_bound = memory_bandwidth > gflops_per_second * 100 # Rough heuristic is_compute_bound = not is_memory_bound return { # Basic measurements 'parameters': param_count, 'flops': flops, 'latency_ms': latency_ms, **memory_stats, # Derived metrics 'gflops_per_second': gflops_per_second, 'memory_bandwidth_mbs': memory_bandwidth, 'computational_efficiency': computational_efficiency, # Bottleneck analysis 'is_memory_bound': is_memory_bound, 'is_compute_bound': is_compute_bound, 'bottleneck': 'memory' if is_memory_bound else 'compute' } ### END SOLUTION def profile_backward_pass(self, model, input_tensor, _loss_fn=None) -> Dict[str, Any]: """ Profile both forward and backward passes for training analysis. TODO: Implement training-focused profiling APPROACH: 1. Profile forward pass first 2. Estimate backward pass costs (typically 2Γ— forward) 3. Calculate total training iteration metrics 4. Analyze memory requirements for gradients and optimizers BACKWARD PASS ESTIMATES: - FLOPs: ~2Γ— forward pass (gradient computation) - Memory: +1Γ— parameters (gradient storage) - Latency: ~2Γ— forward pass (more complex operations) EXAMPLE: >>> model = Linear(128, 64) >>> input_data = Tensor(np.random.randn(16, 128)) >>> profiler = Profiler() >>> profile = profiler.profile_backward_pass(model, input_data) >>> print(f"Training iteration: {profile['total_latency_ms']:.2f} ms") Training iteration: 0.45 ms HINTS: - Total memory = parameters + activations + gradients - Optimizer memory depends on algorithm (SGD: 0Γ—, Adam: 2Γ—) - Consider gradient accumulation effects """ ### BEGIN SOLUTION # Get forward pass profile forward_profile = self.profile_forward_pass(model, input_tensor) # Estimate backward pass (typically 2Γ— forward) backward_flops = forward_profile['flops'] * 2 backward_latency_ms = forward_profile['latency_ms'] * 2 # Gradient memory (equal to parameter memory) gradient_memory_mb = forward_profile['parameter_memory_mb'] # Total training iteration total_flops = forward_profile['flops'] + backward_flops total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms total_memory_mb = (forward_profile['parameter_memory_mb'] + forward_profile['activation_memory_mb'] + gradient_memory_mb) # Training efficiency total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0) # Optimizer memory estimates optimizer_memory_estimates = { 'sgd': 0, # No extra memory 'adam': gradient_memory_mb * 2, # Momentum + velocity 'adamw': gradient_memory_mb * 2, # Same as Adam } return { # Forward pass 'forward_flops': forward_profile['flops'], 'forward_latency_ms': forward_profile['latency_ms'], 'forward_memory_mb': forward_profile['peak_memory_mb'], # Backward pass estimates 'backward_flops': backward_flops, 'backward_latency_ms': backward_latency_ms, 'gradient_memory_mb': gradient_memory_mb, # Total training iteration 'total_flops': total_flops, 'total_latency_ms': total_latency_ms, 'total_memory_mb': total_memory_mb, 'total_gflops_per_second': total_gflops_per_second, # Optimizer memory requirements 'optimizer_memory_estimates': optimizer_memory_estimates, # Training insights 'memory_efficiency': forward_profile['memory_efficiency'], 'bottleneck': forward_profile['bottleneck'] } ### END SOLUTION # %% [markdown] """ ## Helper Functions - Quick Profiling Utilities These helper functions provide simplified interfaces for common profiling tasks. They make it easy to quickly profile models and analyze characteristics without manually calling multiple profiler methods. ### Why Helper Functions Matter In production ML engineering, you often need quick insights without setting up full profiling workflows. These utilities provide: - **Quick profiling**: One-line model analysis with formatted output - **Weight analysis**: Understanding parameter distributions for compression - **Student-friendly output**: Clear, formatted results for learning These functions wrap our core Profiler class with convenience interfaces used in real ML workflows for rapid iteration and debugging. """ # %% nbgrader={"grade": false, "grade_id": "helper_quick_profile", "solution": true} #| export def quick_profile(model, input_tensor, profiler=None): """ Quick profiling function for immediate insights. Provides a simplified interface for profiling that displays key metrics in a student-friendly format. Args: model: Model to profile input_tensor: Input data for profiling profiler: Optional Profiler instance (creates new one if None) Returns: dict: Profile results with key metrics Example: >>> model = Linear(128, 64) >>> input_data = Tensor(np.random.randn(16, 128)) >>> results = quick_profile(model, input_data) >>> # Displays formatted output automatically """ if profiler is None: profiler = Profiler() profile = profiler.profile_forward_pass(model, input_tensor) # Display formatted results print("πŸ”¬ Quick Profile Results:") print(f" Parameters: {profile['parameters']:,}") print(f" FLOPs: {profile['flops']:,}") print(f" Latency: {profile['latency_ms']:.2f} ms") print(f" Memory: {profile['peak_memory_mb']:.2f} MB") print(f" Bottleneck: {profile['bottleneck']}") print(f" Efficiency: {profile['computational_efficiency']*100:.1f}%") return profile # %% nbgrader={"grade": false, "grade_id": "helper_weight_distribution", "solution": true} #| export def analyze_weight_distribution(model, percentiles=[10, 25, 50, 75, 90]): """ Analyze weight distribution for compression insights. Helps understand which weights are small and might be prunable. Used by Module 17 (Compression) to motivate pruning. Args: model: Model to analyze percentiles: List of percentiles to compute Returns: dict: Weight distribution statistics Example: >>> model = Linear(512, 512) >>> stats = analyze_weight_distribution(model) >>> print(f"Weights < 0.01: {stats['below_threshold_001']:.1f}%") """ # Collect all weights weights = [] if hasattr(model, 'parameters'): for param in model.parameters(): weights.extend(param.data.flatten().tolist()) elif hasattr(model, 'weight'): weights.extend(model.weight.data.flatten().tolist()) else: return {'error': 'No weights found'} weights = np.array(weights) abs_weights = np.abs(weights) # Calculate statistics stats = { 'total_weights': len(weights), 'mean': float(np.mean(abs_weights)), 'std': float(np.std(abs_weights)), 'min': float(np.min(abs_weights)), 'max': float(np.max(abs_weights)), } # Percentile analysis for p in percentiles: stats[f'percentile_{p}'] = float(np.percentile(abs_weights, p)) # Threshold analysis (useful for pruning) for threshold in [0.001, 0.01, 0.1]: below = np.sum(abs_weights < threshold) / len(weights) * 100 stats[f'below_threshold_{str(threshold).replace(".", "")}'] = below return stats # %% [markdown] """ ### πŸ§ͺ Unit Test: Helper Functions This test validates our helper utilities work correctly and provide useful output. **What we're testing**: Quick profiling and weight distribution analysis **Why it matters**: These utilities are used daily in production ML workflows **Expected**: Correct profiles with formatted output """ # %% nbgrader={"grade": true, "grade_id": "test_helper_functions", "locked": true, "points": 5} def test_unit_helper_functions(): """πŸ”¬ Test helper function implementations.""" print("πŸ”¬ Unit Test: Helper Functions...") # Test 1: Quick profile function from tinytorch.core.layers import Linear test_model = Linear(16, 8) test_input = Tensor(np.random.randn(8, 16)) profile = quick_profile(test_model, test_input, profiler=Profiler()) # Validate profile contains expected keys assert 'parameters' in profile, "Quick profile should include parameters" assert 'flops' in profile, "Quick profile should include FLOPs" assert 'latency_ms' in profile, "Quick profile should include latency" print("βœ… Quick profile provides comprehensive metrics") # Test 2: Weight distribution analysis class SimpleModel: def __init__(self): self.weight = Tensor(np.random.randn(10, 5) * 0.1) # Small weights model = SimpleModel() stats = analyze_weight_distribution(model) # Validate statistics structure assert 'total_weights' in stats, "Should count total weights" assert 'mean' in stats, "Should compute mean" assert 'std' in stats, "Should compute standard deviation" assert stats['total_weights'] == 50, f"Expected 50 weights, got {stats['total_weights']}" print(f"βœ… Weight distribution analysis: {stats['total_weights']} weights analyzed") # Test 3: Weight distribution with no weights class NoWeightModel: pass no_weight_model = NoWeightModel() stats = analyze_weight_distribution(no_weight_model) assert 'error' in stats, "Should handle models without weights" print("βœ… Handles models without weights gracefully") print("βœ… Helper functions work correctly!") if __name__ == "__main__": test_unit_helper_functions() # %% [markdown] """ ## Parameter Counting - Model Size Analysis Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's validate our implementation. ### Why Parameter Counting Matters ``` Model Deployment Pipeline: Parameters β†’ Memory β†’ Hardware β†’ Cost ↓ ↓ ↓ ↓ 125M 500MB 8GB GPU $200/month Parameter Growth Examples: Small: GPT-2 Small (124M parameters) β†’ 500MB memory Medium: GPT-2 Medium (350M parameters) β†’ 1.4GB memory Large: GPT-2 Large (774M parameters) β†’ 3.1GB memory XL: GPT-2 XL (1.5B parameters) β†’ 6.0GB memory ``` """ # %% [markdown] """ ### πŸ§ͺ Unit Test: Parameter Counting This test validates our parameter counting works correctly for different model types. **What we're testing**: Parameter counting accuracy for various architectures **Why it matters**: Accurate parameter counts predict memory usage and model complexity **Expected**: Correct counts for known model configurations """ # %% nbgrader={"grade": true, "grade_id": "test_parameter_counting", "locked": true, "points": 10} def test_unit_parameter_counting(): """πŸ”¬ Test parameter counting implementation.""" print("πŸ”¬ Unit Test: Parameter Counting...") profiler = Profiler() # Test 1: Simple model with known parameters class SimpleModel: def __init__(self): self.weight = Tensor(np.random.randn(10, 5)) self.bias = Tensor(np.random.randn(5)) def parameters(self): return [self.weight, self.bias] simple_model = SimpleModel() param_count = profiler.count_parameters(simple_model) expected_count = 10 * 5 + 5 # weight + bias assert param_count == expected_count, f"Expected {expected_count} parameters, got {param_count}" print(f"βœ… Simple model: {param_count} parameters") # Test 2: Model without parameters class NoParamModel: def __init__(self): pass no_param_model = NoParamModel() param_count = profiler.count_parameters(no_param_model) assert param_count == 0, f"Expected 0 parameters, got {param_count}" print(f"βœ… No parameter model: {param_count} parameters") # Test 3: Direct tensor (no parameters) test_tensor = Tensor(np.random.randn(2, 3)) param_count = profiler.count_parameters(test_tensor) assert param_count == 0, f"Expected 0 parameters for tensor, got {param_count}" print(f"βœ… Direct tensor: {param_count} parameters") print("βœ… Parameter counting works correctly!") if __name__ == "__main__": test_unit_parameter_counting() # %% [markdown] """ ## FLOP Counting - Computational Cost Estimation FLOPs measure the computational work required for model operations. Unlike latency, FLOPs are hardware-independent and help predict compute costs across different systems. ### FLOP Counting Visualization ``` Linear Layer FLOP Breakdown: Input (batch=32, features=768) Γ— Weight (768, 3072) + Bias (3072) ↓ Matrix Multiplication: 32 Γ— 768 Γ— 3072 Γ— 2 = 150,994,944 FLOPs Bias Addition: 32 Γ— 3072 Γ— 1 = 98,304 FLOPs ↓ Total FLOPs: 151,093,248 FLOPs Convolution FLOP Breakdown: Input (batch=1, channels=3, H=224, W=224) Kernel (out=64, in=3, kH=7, kW=7) ↓ Output size: (224Γ—224) β†’ (112Γ—112) with stride=2 FLOPs = 112 Γ— 112 Γ— 7 Γ— 7 Γ— 3 Γ— 64 Γ— 2 = 235,012,096 FLOPs ``` ### FLOP Counting Strategy Different operations require different FLOP calculations: - **Matrix operations**: M Γ— N Γ— K Γ— 2 (multiply + add) - **Convolutions**: Output spatial Γ— kernel spatial Γ— channels - **Activations**: Usually 1 FLOP per element """ # %% [markdown] """ ### πŸ§ͺ Unit Test: FLOP Counting This test validates our FLOP counting for different operations and architectures. **What we're testing**: FLOP calculation accuracy for various layer types **Why it matters**: FLOPs predict computational cost and energy usage **Expected**: Correct FLOP counts for known operation types """ # %% nbgrader={"grade": true, "grade_id": "test_flop_counting", "locked": true, "points": 10} def test_unit_flop_counting(): """πŸ”¬ Test FLOP counting implementation.""" print("πŸ”¬ Unit Test: FLOP Counting...") profiler = Profiler() # Test 1: Simple tensor operations test_tensor = Tensor(np.random.randn(4, 8)) flops = profiler.count_flops(test_tensor, (4, 8)) expected_flops = 4 * 8 # 1 FLOP per element for generic operation assert flops == expected_flops, f"Expected {expected_flops} FLOPs, got {flops}" print(f"βœ… Tensor operation: {flops} FLOPs") # Test 2: Simulated Linear layer class MockLinear: def __init__(self, in_features, out_features): self.weight = Tensor(np.random.randn(in_features, out_features)) self.__class__.__name__ = 'Linear' mock_linear = MockLinear(128, 64) flops = profiler.count_flops(mock_linear, (1, 128)) expected_flops = 128 * 64 * 2 # matmul FLOPs assert flops == expected_flops, f"Expected {expected_flops} FLOPs, got {flops}" print(f"βœ… Linear layer: {flops} FLOPs") # Test 3: Batch size independence flops_batch1 = profiler.count_flops(mock_linear, (1, 128)) flops_batch32 = profiler.count_flops(mock_linear, (32, 128)) assert flops_batch1 == flops_batch32, "FLOPs should be independent of batch size" print(f"βœ… Batch independence: {flops_batch1} FLOPs (same for batch 1 and 32)") print("βœ… FLOP counting works correctly!") if __name__ == "__main__": test_unit_flop_counting() # %% [markdown] """ ## Memory Profiling - Understanding Memory Usage Patterns Memory profiling reveals how much RAM your model consumes during training and inference. This is critical for deployment planning and optimization. ### Memory Usage Breakdown ``` ML Model Memory Components: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Total Memory β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Parameters β”‚ Activations β”‚ Gradients β”‚ β”‚ (persistent) β”‚ (per forward) β”‚ (per backward)β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Linear weights β”‚ Hidden states β”‚ βˆ‚L/βˆ‚W β”‚ β”‚ Conv filters β”‚ Attention maps β”‚ βˆ‚L/βˆ‚b β”‚ β”‚ Embeddings β”‚ Residual cache β”‚ Optimizer β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Memory Scaling: Batch Size β†’ Activation Memory (linear scaling) Model Size β†’ Parameter + Gradient Memory (linear scaling) Sequence Length β†’ Attention Memory (quadratic scaling!) ``` ### Memory Measurement Strategy We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns. """ # %% [markdown] """ ### πŸ§ͺ Unit Test: Memory Measurement This test validates our memory tracking works correctly and provides useful metrics. **What we're testing**: Memory usage measurement and calculation accuracy **Why it matters**: Memory constraints often limit model deployment **Expected**: Reasonable memory measurements with proper components """ # %% nbgrader={"grade": true, "grade_id": "test_memory_measurement", "locked": true, "points": 10} def test_unit_memory_measurement(): """πŸ”¬ Test memory measurement implementation.""" print("πŸ”¬ Unit Test: Memory Measurement...") profiler = Profiler() # Test 1: Basic memory measurement test_tensor = Tensor(np.random.randn(10, 20)) from tinytorch.core.layers import Linear test_model = Linear(20, 10) memory_stats = profiler.measure_memory(test_model, (10, 20)) # Validate dictionary structure required_keys = ['parameter_memory_mb', 'activation_memory_mb', 'peak_memory_mb', 'memory_efficiency'] for key in required_keys: assert key in memory_stats, f"Missing key: {key}" # Validate non-negative values for key in required_keys: assert memory_stats[key] >= 0, f"{key} should be non-negative, got {memory_stats[key]}" print(f"βœ… Basic measurement: {memory_stats['peak_memory_mb']:.3f} MB peak") # Test 2: Memory scaling with size from tinytorch.core.layers import Linear small_model = Linear(5, 5) large_model = Linear(50, 50) small_memory = profiler.measure_memory(small_model, (5, 5)) large_memory = profiler.measure_memory(large_model, (50, 50)) # Larger tensor should use more activation memory assert large_memory['activation_memory_mb'] >= small_memory['activation_memory_mb'], \ "Larger tensor should use more activation memory" print(f"βœ… Scaling: Small {small_memory['activation_memory_mb']:.3f} MB β†’ Large {large_memory['activation_memory_mb']:.3f} MB") # Test 3: Efficiency bounds assert 0 <= memory_stats['memory_efficiency'] <= 1.0, \ f"Memory efficiency should be between 0 and 1, got {memory_stats['memory_efficiency']}" print(f"βœ… Efficiency: {memory_stats['memory_efficiency']:.3f} (0-1 range)") print("βœ… Memory measurement works correctly!") if __name__ == "__main__": test_unit_memory_measurement() # %% [markdown] """ ## Latency Measurement - Accurate Performance Timing Latency measurement is the most challenging part of profiling because it's affected by system state, caching, and measurement overhead. We need statistical rigor to get reliable results. ### Latency Measurement Challenges ``` Timing Challenges: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Time Variance β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ System Noise β”‚ Cache Effects β”‚ Thermal β”‚ β”‚ β”‚ β”‚ Throttling β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Background β”‚ Cold start vs β”‚ CPU slows β”‚ β”‚ processes β”‚ warm caches β”‚ when hot β”‚ β”‚ OS scheduling β”‚ Memory locality β”‚ GPU thermal β”‚ β”‚ Network I/O β”‚ Branch predict β”‚ limits β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Solution: Statistical Approach Warmup β†’ Multiple measurements β†’ Robust statistics (median) ``` ### Measurement Protocol Our latency measurement follows professional benchmarking practices: 1. **Warmup runs** to stabilize system state 2. **Multiple measurements** for statistical significance 3. **Median calculation** to handle outliers 4. **Memory cleanup** to prevent contamination """ # %% [markdown] """ ### πŸ§ͺ Unit Test: Latency Measurement This test validates our latency measurement provides consistent and reasonable results. **What we're testing**: Timing accuracy and statistical robustness **Why it matters**: Latency determines real-world deployment feasibility **Expected**: Consistent timing measurements with proper statistical handling """ # %% nbgrader={"grade": true, "grade_id": "test_latency_measurement", "locked": true, "points": 10} def test_unit_latency_measurement(): """πŸ”¬ Test latency measurement implementation.""" print("πŸ”¬ Unit Test: Latency Measurement...") profiler = Profiler() # Test 1: Basic latency measurement from tinytorch.core.layers import Linear test_model = Linear(8, 4) test_input = Tensor(np.random.randn(4, 8)) latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5) assert latency >= 0, f"Latency should be non-negative, got {latency}" assert latency < 1000, f"Latency seems too high for simple operation: {latency} ms" print(f"βœ… Basic latency: {latency:.3f} ms") # Test 2: Measurement consistency latencies = [] for _ in range(3): lat = profiler.measure_latency(test_model, test_input, warmup=1, iterations=3) latencies.append(lat) # Measurements should be in reasonable range avg_latency = np.mean(latencies) std_latency = np.std(latencies) assert std_latency < avg_latency, "Standard deviation shouldn't exceed mean for simple operations" print(f"βœ… Consistency: {avg_latency:.3f} Β± {std_latency:.3f} ms") # Test 3: Size scaling small_model = Linear(2, 2) large_model = Linear(20, 20) small_input = Tensor(np.random.randn(2, 2)) large_input = Tensor(np.random.randn(20, 20)) small_latency = profiler.measure_latency(small_model, small_input, warmup=1, iterations=3) large_latency = profiler.measure_latency(large_model, large_input, warmup=1, iterations=3) # Larger operations might take longer (though not guaranteed for simple operations) print(f"βœ… Scaling: Small {small_latency:.3f} ms, Large {large_latency:.3f} ms") print("βœ… Latency measurement works correctly!") if __name__ == "__main__": test_unit_latency_measurement() # %% [markdown] """ ## 4. Integration: Advanced Profiling Functions Now let's validate our higher-level profiling functions that combine core measurements into comprehensive analysis tools. ### Advanced Profiling Architecture ``` Core Profiler Methods β†’ Advanced Analysis Functions β†’ Optimization Insights ↓ ↓ ↓ count_parameters() profile_forward_pass() "Memory-bound workload" count_flops() profile_backward_pass() "Optimize data movement" measure_memory() profile_layer() "Focus on bandwidth" measure_latency() benchmark_efficiency() "Use quantization" ``` ### Forward Pass Profiling - Complete Performance Picture A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions. """ # %% [markdown] """ ### Backward Pass Profiling - Training Analysis Training requires both forward and backward passes. The backward pass typically uses 2Γ— the compute and adds gradient memory. Understanding this is crucial for training optimization. ### Training Memory Visualization ``` Training Memory Timeline: Forward Pass: [Parameters] + [Activations] ↓ Backward Pass: [Parameters] + [Activations] + [Gradients] ↓ Optimizer: [Parameters] + [Gradients] + [Optimizer State] Memory Examples: Model: 125M parameters (500MB) Forward: 500MB params + 100MB activations = 600MB Backward: 500MB params + 100MB activations + 500MB gradients = 1,100MB Adam: 500MB params + 500MB gradients + 1,000MB momentum/velocity = 2,000MB Total Training Memory: 4Γ— parameter memory! ``` """ # %% [markdown] """ ### πŸ§ͺ Unit Test: Advanced Profiling Functions This test validates our advanced profiling functions provide comprehensive analysis. **What we're testing**: Forward and backward pass profiling completeness **Why it matters**: Training optimization requires understanding both passes **Expected**: Complete profiles with all required metrics and relationships """ # %% nbgrader={"grade": true, "grade_id": "test_advanced_profiling", "locked": true, "points": 15} def test_unit_advanced_profiling(): """πŸ”¬ Test advanced profiling functions.""" print("πŸ”¬ Unit Test: Advanced Profiling Functions...") # Create profiler and test model profiler = Profiler() from tinytorch.core.layers import Linear test_model = Linear(8, 4) test_input = Tensor(np.random.randn(4, 8)) # Test forward pass profiling forward_profile = profiler.profile_forward_pass(test_model, test_input) # Validate forward profile structure required_forward_keys = [ 'parameters', 'flops', 'latency_ms', 'gflops_per_second', 'memory_bandwidth_mbs', 'bottleneck' ] for key in required_forward_keys: assert key in forward_profile, f"Missing key: {key}" assert forward_profile['parameters'] >= 0 assert forward_profile['flops'] >= 0 assert forward_profile['latency_ms'] >= 0 assert forward_profile['gflops_per_second'] >= 0 print(f"βœ… Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s") # Test backward pass profiling backward_profile = profiler.profile_backward_pass(test_model, test_input) # Validate backward profile structure required_backward_keys = [ 'forward_flops', 'backward_flops', 'total_flops', 'total_latency_ms', 'total_memory_mb', 'optimizer_memory_estimates' ] for key in required_backward_keys: assert key in backward_profile, f"Missing key: {key}" # Validate relationships assert backward_profile['total_flops'] >= backward_profile['forward_flops'] assert backward_profile['total_latency_ms'] >= backward_profile['forward_latency_ms'] assert 'sgd' in backward_profile['optimizer_memory_estimates'] assert 'adam' in backward_profile['optimizer_memory_estimates'] # Check backward pass estimates are reasonable assert backward_profile['backward_flops'] >= backward_profile['forward_flops'], \ "Backward pass should have at least as many FLOPs as forward" assert backward_profile['gradient_memory_mb'] >= 0, \ "Gradient memory should be non-negative" print(f"βœ… Backward profiling: {backward_profile['total_latency_ms']:.2f} ms total") print(f"βœ… Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training") print("βœ… Advanced profiling functions work correctly!") if __name__ == "__main__": test_unit_advanced_profiling() # %% [markdown] """ ## 5. Systems Analysis: Understanding Performance Characteristics Let's analyze how different model characteristics affect performance. This analysis guides optimization decisions and helps identify bottlenecks. ### Performance Analysis Workflow ``` Model Scaling Analysis: Size β†’ Memory β†’ Latency β†’ Throughput β†’ Bottleneck Identification ↓ ↓ ↓ ↓ ↓ 64 1MB 0.1ms 10K ops/s Memory bound 128 4MB 0.2ms 8K ops/s Memory bound 256 16MB 0.5ms 4K ops/s Memory bound 512 64MB 2.0ms 1K ops/s Memory bound Insight: This workload is memory-bound β†’ Optimize data movement, not compute! ``` """ # %% nbgrader={"grade": false, "grade_id": "performance_analysis", "solution": true} def analyze_model_scaling(): """πŸ“Š Analyze how model performance scales with size.""" print("πŸ“Š Analyzing Model Scaling Characteristics...") profiler = Profiler() results = [] # Test different model sizes sizes = [64, 128, 256, 512] print("\nModel Scaling Analysis:") print("Size\tParams\t\tFLOPs\t\tLatency(ms)\tMemory(MB)\tGFLOP/s") print("-" * 80) for size in sizes: # Create models of different sizes for comparison from tinytorch.core.layers import Linear test_model = Linear(size, size) input_shape = (32, size) # Batch of 32 dummy_input = Tensor(np.random.randn(*input_shape)) # Simulate linear layer characteristics linear_params = size * size + size # W + b linear_flops = size * size * 2 # matmul # Measure actual performance latency = profiler.measure_latency(test_model, dummy_input, warmup=3, iterations=10) memory = profiler.measure_memory(test_model, input_shape) gflops_per_second = (linear_flops / 1e9) / (latency / 1000) results.append({ 'size': size, 'parameters': linear_params, 'flops': linear_flops, 'latency_ms': latency, 'memory_mb': memory['peak_memory_mb'], 'gflops_per_second': gflops_per_second }) print(f"{size}\t{linear_params:,}\t\t{linear_flops:,}\t\t" f"{latency:.2f}\t\t{memory['peak_memory_mb']:.2f}\t\t" f"{gflops_per_second:.2f}") # Analysis insights print("\nπŸ’‘ Scaling Analysis Insights:") # Memory scaling memory_growth = results[-1]['memory_mb'] / max(results[0]['memory_mb'], 0.001) print(f"Memory grows {memory_growth:.1f}Γ— from {sizes[0]} to {sizes[-1]} size") # Compute scaling compute_growth = results[-1]['gflops_per_second'] / max(results[0]['gflops_per_second'], 0.001) print(f"Compute efficiency changes {compute_growth:.1f}Γ— with size") # Performance characteristics avg_efficiency = np.mean([r['gflops_per_second'] for r in results]) if avg_efficiency < 10: # Arbitrary threshold for "low" efficiency print("πŸš€ Low compute efficiency suggests memory-bound workload") else: print("πŸš€ High compute efficiency suggests compute-bound workload") def analyze_batch_size_effects(): """πŸ“Š Analyze how batch size affects performance and efficiency.""" print("\nπŸ“Š Analyzing Batch Size Effects...") profiler = Profiler() batch_sizes = [1, 8, 32, 128] feature_size = 256 print("\nBatch Size Effects Analysis:") print("Batch\tLatency(ms)\tThroughput(samples/s)\tMemory(MB)\tMemory Efficiency") print("-" * 85) for batch_size in batch_sizes: from tinytorch.core.layers import Linear test_model = Linear(feature_size, feature_size) input_shape = (batch_size, feature_size) dummy_input = Tensor(np.random.randn(*input_shape)) # Measure performance latency = profiler.measure_latency(test_model, dummy_input, warmup=3, iterations=10) memory = profiler.measure_memory(test_model, input_shape) # Calculate throughput samples_per_second = (batch_size * 1000) / latency # samples/second # Calculate efficiency (samples per unit memory) efficiency = samples_per_second / max(memory['peak_memory_mb'], 0.001) print(f"{batch_size}\t{latency:.2f}\t\t{samples_per_second:.0f}\t\t\t" f"{memory['peak_memory_mb']:.2f}\t\t{efficiency:.1f}") print("\nπŸ’‘ Batch Size Insights:") print("Larger batches typically improve throughput but increase memory usage") # Run the analysis if __name__ == "__main__": analyze_model_scaling() analyze_batch_size_effects() # %% [markdown] """ ## 6. Optimization Insights: Production Performance Patterns Understanding profiling results helps guide optimization decisions. Let's analyze different operation types and measurement overhead. ### Operation Efficiency Analysis ``` Operation Types and Their Characteristics: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Operation β”‚ Compute/Memory β”‚ Optimization β”‚ Priority β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Matrix Multiply β”‚ Compute-bound β”‚ BLAS libraries β”‚ High β”‚ β”‚ Elementwise β”‚ Memory-bound β”‚ Data locality β”‚ Medium β”‚ β”‚ Reductions β”‚ Memory-bound β”‚ Parallelizationβ”‚ Medium β”‚ β”‚ Attention β”‚ Memory-bound β”‚ FlashAttention β”‚ High β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Optimization Strategy: 1. Profile first β†’ Identify bottlenecks 2. Focus on compute-bound ops β†’ Algorithmic improvements 3. Focus on memory-bound ops β†’ Data movement optimization 4. Measure again β†’ Verify improvements ``` """ # %% nbgrader={"grade": false, "grade_id": "optimization_insights", "solution": true} def benchmark_operation_efficiency(): """πŸ“Š Compare efficiency of different operations for optimization guidance.""" print("πŸ“Š Benchmarking Operation Efficiency...") profiler = Profiler() operations = [] # Test different operation types size = 256 input_tensor = Tensor(np.random.randn(32, size)) # Elementwise operations (memory-bound) # Create a simple model wrapper for elementwise operations class ElementwiseModel: def forward(self, x): return x + x # Simple elementwise operation elementwise_model = ElementwiseModel() elementwise_latency = profiler.measure_latency(elementwise_model, input_tensor, iterations=20) elementwise_flops = size * 32 # One operation per element operations.append({ 'operation': 'Elementwise', 'latency_ms': elementwise_latency, 'flops': elementwise_flops, 'gflops_per_second': (elementwise_flops / 1e9) / (elementwise_latency / 1000), 'efficiency_class': 'memory-bound', 'optimization_focus': 'data_locality' }) # Matrix operations (compute-bound) from tinytorch.core.layers import Linear matrix_model = Linear(size, size) matrix_latency = profiler.measure_latency(matrix_model, input_tensor, iterations=10) matrix_flops = size * size * 2 # Matrix multiplication operations.append({ 'operation': 'Matrix Multiply', 'latency_ms': matrix_latency, 'flops': matrix_flops, 'gflops_per_second': (matrix_flops / 1e9) / (matrix_latency / 1000), 'efficiency_class': 'compute-bound', 'optimization_focus': 'algorithms' }) # Reduction operations (memory-bound) class ReductionModel: def forward(self, x): return x.sum() # Sum reduction operation reduction_model = ReductionModel() reduction_latency = profiler.measure_latency(reduction_model, input_tensor, iterations=20) reduction_flops = size * 32 # Sum reduction operations.append({ 'operation': 'Reduction', 'latency_ms': reduction_latency, 'flops': reduction_flops, 'gflops_per_second': (reduction_flops / 1e9) / (reduction_latency / 1000), 'efficiency_class': 'memory-bound', 'optimization_focus': 'parallelization' }) print("\nOperation Efficiency Comparison:") print("Operation\t\tLatency(ms)\tGFLOP/s\t\tEfficiency Class\tOptimization Focus") print("-" * 95) for op in operations: print(f"{op['operation']:<15}\t{op['latency_ms']:.3f}\t\t" f"{op['gflops_per_second']:.2f}\t\t{op['efficiency_class']:<15}\t{op['optimization_focus']}") print("\nπŸ’‘ Operation Optimization Insights:") # Find most and least efficient best_op = max(operations, key=lambda x: x['gflops_per_second']) worst_op = min(operations, key=lambda x: x['gflops_per_second']) print(f"Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)") print(f"Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)") # Count operation types memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound'] compute_bound_ops = [op for op in operations if op['efficiency_class'] == 'compute-bound'] print(f"\nπŸš€ Optimization Priority:") if len(memory_bound_ops) > len(compute_bound_ops): print("Focus on memory optimization: data locality, bandwidth, caching") else: print("Focus on compute optimization: better algorithms, vectorization") def analyze_profiling_overhead(): """πŸ“Š Measure the overhead of profiling itself.""" print("\nπŸ“Š Analyzing Profiling Overhead...") # Test with and without profiling test_tensor = Tensor(np.random.randn(100, 100)) iterations = 50 # Without profiling - baseline measurement start_time = time.perf_counter() for _ in range(iterations): _ = test_tensor.data.copy() # Simple operation end_time = time.perf_counter() baseline_ms = (end_time - start_time) * 1000 # With profiling - includes measurement overhead profiler = Profiler() # Create a simple model for profiling overhead measurement class TestModel: def forward(self, x): return x + 1.0 test_model = TestModel() start_time = time.perf_counter() for _ in range(iterations): _ = profiler.measure_latency(test_model, test_tensor, warmup=1, iterations=1) end_time = time.perf_counter() profiled_ms = (end_time - start_time) * 1000 overhead_factor = profiled_ms / max(baseline_ms, 0.001) print(f"\nProfiling Overhead Analysis:") print(f"Baseline execution: {baseline_ms:.2f} ms") print(f"With profiling: {profiled_ms:.2f} ms") print(f"Profiling overhead: {overhead_factor:.1f}Γ— slower") print(f"\nπŸ’‘ Profiling Overhead Insights:") if overhead_factor < 2: print("Low overhead - suitable for frequent profiling") elif overhead_factor < 10: print("Moderate overhead - use for development and debugging") else: print("High overhead - use sparingly in production") # Run optimization analysis if __name__ == "__main__": benchmark_operation_efficiency() analyze_profiling_overhead() # %% [markdown] """ ## πŸ§ͺ Module Integration Test Final validation that everything works together correctly. """ # %% nbgrader={"grade": true, "grade_id": "test_module", "locked": true, "points": 20} def test_module(): """ Comprehensive test of entire profiling module functionality. This final test runs before module summary to ensure: - All unit tests pass - Functions work together correctly - Module is ready for integration with TinyTorch """ print("πŸ§ͺ RUNNING MODULE INTEGRATION TEST") print("=" * 50) # Run all unit tests print("Running unit tests...") test_unit_helper_functions() test_unit_parameter_counting() test_unit_flop_counting() test_unit_memory_measurement() test_unit_latency_measurement() test_unit_advanced_profiling() print("\nRunning integration scenarios...") # Test realistic usage patterns print("πŸ”¬ Integration Test: Complete Profiling Workflow...") # Create profiler profiler = Profiler() # Create test model and data from tinytorch.core.layers import Linear test_model = Linear(16, 32) test_input = Tensor(np.random.randn(8, 16)) # Run complete profiling workflow print("1. Measuring model characteristics...") params = profiler.count_parameters(test_model) flops = profiler.count_flops(test_model, test_input.shape) memory = profiler.measure_memory(test_model, test_input.shape) latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5) print(f" Parameters: {params}") print(f" FLOPs: {flops}") print(f" Memory: {memory['peak_memory_mb']:.2f} MB") print(f" Latency: {latency:.2f} ms") # Test advanced profiling print("2. Running advanced profiling...") forward_profile = profiler.profile_forward_pass(test_model, test_input) backward_profile = profiler.profile_backward_pass(test_model, test_input) assert 'gflops_per_second' in forward_profile assert 'total_latency_ms' in backward_profile print(f" Forward GFLOP/s: {forward_profile['gflops_per_second']:.2f}") print(f" Training latency: {backward_profile['total_latency_ms']:.2f} ms") # Test bottleneck analysis print("3. Analyzing performance bottlenecks...") bottleneck = forward_profile['bottleneck'] efficiency = forward_profile['computational_efficiency'] print(f" Bottleneck: {bottleneck}") print(f" Compute efficiency: {efficiency:.3f}") # Validate end-to-end workflow assert params >= 0, "Parameter count should be non-negative" assert flops >= 0, "FLOP count should be non-negative" assert memory['peak_memory_mb'] >= 0, "Memory usage should be non-negative" assert latency >= 0, "Latency should be non-negative" assert forward_profile['gflops_per_second'] >= 0, "GFLOP/s should be non-negative" assert backward_profile['total_latency_ms'] >= 0, "Total latency should be non-negative" assert bottleneck in ['memory', 'compute'], "Bottleneck should be memory or compute" assert 0 <= efficiency <= 1, "Efficiency should be between 0 and 1" print("βœ… End-to-end profiling workflow works!") # Test production-like scenario print("4. Testing production profiling scenario...") # Simulate larger model analysis from tinytorch.core.layers import Linear large_model = Linear(512, 256) large_input = Tensor(np.random.randn(32, 512)) # Larger model input large_profile = profiler.profile_forward_pass(large_model, large_input) # Verify profile contains optimization insights assert 'bottleneck' in large_profile, "Profile should identify bottlenecks" assert 'memory_bandwidth_mbs' in large_profile, "Profile should measure memory bandwidth" print(f" Large model analysis: {large_profile['bottleneck']} bottleneck") print(f" Memory bandwidth: {large_profile['memory_bandwidth_mbs']:.1f} MB/s") print("βœ… Production profiling scenario works!") print("\n" + "=" * 50) print("πŸŽ‰ ALL TESTS PASSED! Module ready for export.") print("Run: tito module complete 14") # Call before module summary if __name__ == "__main__": test_module() # %% if __name__ == "__main__": print("πŸš€ Running Profiling module...") test_module() print("βœ… Module validation complete!") # %% [markdown] """ ## πŸ€” ML Systems Thinking: Performance Measurement ### Question 1: FLOP Analysis You implemented a profiler that counts FLOPs for different operations. For a Linear layer with 1000 input features and 500 output features: - How many FLOPs are required for one forward pass? _____ FLOPs - If you process a batch of 32 samples, how does this change the per-sample FLOPs? _____ ### Question 2: Memory Scaling Your profiler measures memory usage for models and activations. A transformer model has 125M parameters (500MB at FP32). During training with batch size 16: - What's the minimum memory for gradients? _____ MB - With Adam optimizer, what's the total memory requirement? _____ MB ### Question 3: Performance Bottlenecks You built tools to identify compute vs memory bottlenecks. A model achieves 10 GFLOP/s on hardware with 100 GFLOP/s peak: - What's the computational efficiency? _____% - If doubling batch size doesn't improve GFLOP/s, the bottleneck is likely _____ ### Question 4: Profiling Trade-offs Your profiler adds measurement overhead to understand performance. If profiling adds 5Γ— overhead but reveals a 50% speedup opportunity: - Is the profiling cost justified for development? _____ - When should you disable profiling in production? _____ """ # %% [markdown] """ ## 🎯 MODULE SUMMARY: Profiling Congratulations! You've built a comprehensive profiling system for ML performance analysis! ### Key Accomplishments - Built complete Profiler class with parameter, FLOP, memory, and latency measurement - Implemented advanced profiling functions for forward and backward pass analysis - Discovered performance characteristics through scaling and efficiency analysis - Created production-quality measurement tools for optimization guidance - All tests pass βœ… (validated by `test_module()`) ### Systems Insights Gained - **FLOPs vs Reality**: Theoretical operations don't always predict actual performance - **Memory Bottlenecks**: Many ML operations are limited by memory bandwidth, not compute - **Batch Size Effects**: Larger batches improve throughput but increase memory requirements - **Profiling Overhead**: Measurement tools have costs but enable data-driven optimization ### Production Skills Developed - **Performance Detective Work**: Use data, not guesses, to identify bottlenecks - **Optimization Prioritization**: Focus efforts on actual bottlenecks, not assumptions - **Resource Planning**: Predict memory and compute requirements for deployment - **Statistical Rigor**: Handle measurement variance with proper methodology ### Ready for Next Steps Your profiling implementation enables optimization modules (15-18) to make data-driven optimization decisions. Export with: `tito module complete 14` **Next**: Module 15 (Memoization) will use profiling to discover transformer bottlenecks and fix them! """