From a272030037e58a5aafa016708865bf8a8e220e26 Mon Sep 17 00:00:00 2001 From: Vijay Janapa Reddi Date: Sun, 9 Nov 2025 13:03:30 -0500 Subject: [PATCH] build: regenerate profiling notebook from updated dev file --- .../source/14_profiling/profiling_dev.ipynb | 1457 ++++++++--------- 1 file changed, 725 insertions(+), 732 deletions(-) diff --git a/modules/source/14_profiling/profiling_dev.ipynb b/modules/source/14_profiling/profiling_dev.ipynb index f08cdde1..c3daf773 100644 --- a/modules/source/14_profiling/profiling_dev.ipynb +++ b/modules/source/14_profiling/profiling_dev.ipynb @@ -2,24 +2,24 @@ "cells": [ { "cell_type": "markdown", - "id": "78d24362", + "id": "55618ade", "metadata": { "cell_marker": "\"\"\"" }, "source": [ - "# Module 15: Profiling - Measuring What Matters in ML Systems\n", + "# Module 14: Profiling - Measuring What Matters in ML Systems\n", "\n", - "Welcome to Module 15! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.\n", + "Welcome to Module 14! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.\n", "\n", "## šŸ”— Prerequisites & Progress\n", - "**You've Built**: Complete ML stack from tensors to transformers with KV caching\n", + "**You've Built**: Complete ML stack from tensors to transformers\n", "**You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency\n", "**You'll Enable**: Data-driven optimization decisions and performance analysis\n", "\n", "**Connection Map**:\n", "```\n", - "All Modules → Profiling → Acceleration (Module 16)\n", - "(implementations) (measurement) (optimization)\n", + "All Modules (01-13) → Profiling (14) → Optimization Techniques (15-18)\n", + "(implementations) (measurement) (targeted fixes)\n", "```\n", "\n", "## Learning Objectives\n", @@ -33,7 +33,7 @@ "\n", "## šŸ“¦ Where This Code Lives in the Final Package\n", "\n", - "**Learning Side:** You work in `modules/15_profiling/profiling_dev.py` \n", + "**Learning Side:** You work in `modules/14_profiling/profiling_dev.py`\n", "**Building Side:** Code exports to `tinytorch.profiling.profiler`\n", "\n", "```python\n", @@ -51,7 +51,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f622ef61", + "id": "d92307b1", "metadata": { "nbgrader": { "grade": false, @@ -79,7 +79,7 @@ }, { "cell_type": "markdown", - "id": "ae7455a2", + "id": "6e4fb271", "metadata": { "cell_marker": "\"\"\"" }, @@ -115,7 +115,40 @@ }, { "cell_type": "markdown", - "id": "85ee0680", + "id": "ddfa3dfb", + "metadata": { + "cell_marker": "\"\"\"" + }, + "source": [ + "### šŸ”— From Optimization to Discovery: Connecting Module 14\n", + "\n", + "**In Module 14**, you implemented KV caching and saw 10-15x speedup.\n", + "**In Module 15**, you'll learn HOW to discover such optimization opportunities.\n", + "\n", + "**The Real ML Engineering Workflow**:\n", + "```\n", + "Step 1: Measure (This Module!) Step 2: Analyze\n", + " ↓ ↓\n", + "Profile baseline → Find bottleneck → Understand cause\n", + "40 tok/s 80% in attention O(n²) recomputation\n", + " ↓\n", + "Step 4: Validate Step 3: Optimize (Module 14)\n", + " ↓ ↓\n", + "Profile optimized ← Verify speedup ← Implement KV cache\n", + "500 tok/s (12.5x) Measure impact Design solution\n", + "```\n", + "\n", + "**Without Module 15's profiling**: You'd never know WHERE to optimize!\n", + "**Without Module 14's optimization**: You couldn't FIX the bottleneck!\n", + "\n", + "This module teaches the measurement and analysis skills that enable\n", + "optimization breakthroughs like KV caching. You'll profile real models\n", + "and discover bottlenecks just like production ML teams do." + ] + }, + { + "cell_type": "markdown", + "id": "d5a2e470", "metadata": { "cell_marker": "\"\"\"" }, @@ -198,7 +231,7 @@ }, { "cell_type": "markdown", - "id": "ab8f2347", + "id": "c466e14d", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -214,18 +247,20 @@ "ā”œā”€ā”€ count_parameters() → Model size analysis\n", "ā”œā”€ā”€ count_flops() → Computational cost estimation\n", "ā”œā”€ā”€ measure_memory() → Memory usage tracking\n", - "└── measure_latency() → Performance timing\n", - "\n", - "Integration Functions\n", + "ā”œā”€ā”€ measure_latency() → Performance timing\n", + "ā”œā”€ā”€ profile_layer() → Layer-wise analysis\n", "ā”œā”€ā”€ profile_forward_pass() → Complete forward analysis\n", "└── profile_backward_pass() → Training analysis\n", + "\n", + "Integration:\n", + "All methods work together to provide comprehensive performance insights\n", "```" ] }, { "cell_type": "code", "execution_count": null, - "id": "208a26c8", + "id": "31829387", "metadata": { "lines_to_next_cell": 1, "nbgrader": { @@ -246,25 +281,627 @@ " \"\"\"\n", "\n", " def __init__(self):\n", - " \"\"\"Initialize profiler with measurement state.\"\"\"\n", + " \"\"\"\n", + " Initialize profiler with measurement state.\n", + "\n", + " TODO: Set up profiler tracking structures\n", + "\n", + " APPROACH:\n", + " 1. Create empty measurements dictionary\n", + " 2. Initialize operation counters\n", + " 3. Set up memory tracking state\n", + "\n", + " EXAMPLE:\n", + " >>> profiler = Profiler()\n", + " >>> profiler.measurements\n", + " {}\n", + "\n", + " HINTS:\n", + " - Use defaultdict(int) for operation counters\n", + " - measurements dict will store timing results\n", + " \"\"\"\n", " ### BEGIN SOLUTION\n", " self.measurements = {}\n", " self.operation_counts = defaultdict(int)\n", " self.memory_tracker = None\n", + " ### END SOLUTION\n", + "\n", + " def count_parameters(self, model) -> int:\n", + " \"\"\"\n", + " Count total trainable parameters in a model.\n", + "\n", + " TODO: Implement parameter counting for any model with parameters() method\n", + "\n", + " APPROACH:\n", + " 1. Get all parameters from model.parameters() if available\n", + " 2. For single layers, count weight and bias directly\n", + " 3. Sum total element count across all parameter tensors\n", + "\n", + " EXAMPLE:\n", + " >>> linear = Linear(128, 64) # 128*64 + 64 = 8256 parameters\n", + " >>> profiler = Profiler()\n", + " >>> count = profiler.count_parameters(linear)\n", + " >>> print(count)\n", + " 8256\n", + "\n", + " HINTS:\n", + " - Use parameter.data.size for tensor element count\n", + " - Handle models with and without parameters() method\n", + " - Don't forget bias terms when present\n", + " \"\"\"\n", + " ### BEGIN SOLUTION\n", + " total_params = 0\n", + "\n", + " # Handle different model types\n", + " if hasattr(model, 'parameters'):\n", + " # Model with parameters() method (Sequential, custom models)\n", + " for param in model.parameters():\n", + " total_params += param.data.size\n", + " elif hasattr(model, 'weight'):\n", + " # Single layer (Linear, Conv2d)\n", + " total_params += model.weight.data.size\n", + " if hasattr(model, 'bias') and model.bias is not None:\n", + " total_params += model.bias.data.size\n", + " else:\n", + " # No parameters (activations, etc.)\n", + " total_params = 0\n", + "\n", + " return total_params\n", + " ### END SOLUTION\n", + "\n", + " def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n", + " \"\"\"\n", + " Count FLOPs (Floating Point Operations) for one forward pass.\n", + "\n", + " TODO: Implement FLOP counting for different layer types\n", + "\n", + " APPROACH:\n", + " 1. Create dummy input with given shape\n", + " 2. Calculate FLOPs based on layer type and dimensions\n", + " 3. Handle different model architectures (Linear, Conv2d, Sequential)\n", + "\n", + " LAYER-SPECIFIC FLOP FORMULAS:\n", + " - Linear: input_features Ɨ output_features Ɨ 2 (matmul + bias)\n", + " - Conv2d: output_h Ɨ output_w Ɨ kernel_h Ɨ kernel_w Ɨ in_channels Ɨ out_channels Ɨ 2\n", + " - Activation: Usually 1 FLOP per element (ReLU, Sigmoid)\n", + "\n", + " EXAMPLE:\n", + " >>> linear = Linear(128, 64)\n", + " >>> profiler = Profiler()\n", + " >>> flops = profiler.count_flops(linear, (1, 128))\n", + " >>> print(flops) # 128 * 64 * 2 = 16384\n", + " 16384\n", + "\n", + " HINTS:\n", + " - Batch dimension doesn't affect per-sample FLOPs\n", + " - Focus on major operations (matmul, conv) first\n", + " - For Sequential models, sum FLOPs of all layers\n", + " \"\"\"\n", + " ### BEGIN SOLUTION\n", + " # Create dummy input (unused but kept for interface consistency)\n", + " _dummy_input = Tensor(np.random.randn(*input_shape))\n", + " total_flops = 0\n", + "\n", + " # Handle different model types\n", + " if hasattr(model, '__class__'):\n", + " model_name = model.__class__.__name__\n", + "\n", + " if model_name == 'Linear':\n", + " # Linear layer: input_features Ɨ output_features Ɨ 2\n", + " in_features = input_shape[-1]\n", + " out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n", + " total_flops = in_features * out_features * 2\n", + "\n", + " elif model_name == 'Conv2d':\n", + " # Conv2d layer: complex calculation based on output size\n", + " # Simplified: assume we know the output dimensions\n", + " if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):\n", + " _batch_size = input_shape[0] if len(input_shape) > 3 else 1\n", + " in_channels = model.in_channels\n", + " out_channels = model.out_channels\n", + " kernel_h = kernel_w = model.kernel_size\n", + "\n", + " # Estimate output size (simplified)\n", + " input_h, input_w = input_shape[-2], input_shape[-1]\n", + " output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)\n", + " output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)\n", + "\n", + " total_flops = (output_h * output_w * kernel_h * kernel_w *\n", + " in_channels * out_channels * 2)\n", + "\n", + " elif model_name == 'Sequential':\n", + " # Sequential model: sum FLOPs of all layers\n", + " current_shape = input_shape\n", + " for layer in model.layers:\n", + " layer_flops = self.count_flops(layer, current_shape)\n", + " total_flops += layer_flops\n", + " # Update shape for next layer (simplified)\n", + " if hasattr(layer, 'weight'):\n", + " current_shape = current_shape[:-1] + (layer.weight.shape[1],)\n", + "\n", + " else:\n", + " # Activation or other: assume 1 FLOP per element\n", + " total_flops = np.prod(input_shape)\n", + "\n", + " return total_flops\n", + " ### END SOLUTION\n", + "\n", + " def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n", + " \"\"\"\n", + " Measure memory usage during forward pass.\n", + "\n", + " TODO: Implement memory tracking for model execution\n", + "\n", + " APPROACH:\n", + " 1. Use tracemalloc to track memory allocation\n", + " 2. Measure baseline memory before model execution\n", + " 3. Run forward pass and track peak usage\n", + " 4. Calculate different memory components\n", + "\n", + " RETURN DICTIONARY:\n", + " - 'parameter_memory_mb': Memory for model parameters\n", + " - 'activation_memory_mb': Memory for activations\n", + " - 'peak_memory_mb': Maximum memory usage\n", + " - 'memory_efficiency': Ratio of useful to total memory\n", + "\n", + " EXAMPLE:\n", + " >>> linear = Linear(1024, 512)\n", + " >>> profiler = Profiler()\n", + " >>> memory = profiler.measure_memory(linear, (32, 1024))\n", + " >>> print(f\"Parameters: {memory['parameter_memory_mb']:.1f} MB\")\n", + " Parameters: 2.1 MB\n", + "\n", + " HINTS:\n", + " - Use tracemalloc.start() and tracemalloc.get_traced_memory()\n", + " - Account for float32 = 4 bytes per parameter\n", + " - Activation memory scales with batch size\n", + " \"\"\"\n", + " ### BEGIN SOLUTION\n", + " # Start memory tracking\n", + " tracemalloc.start()\n", + "\n", + " # Measure baseline memory (unused but kept for completeness)\n", + " _baseline_memory = tracemalloc.get_traced_memory()[0]\n", + "\n", + " # Calculate parameter memory\n", + " param_count = self.count_parameters(model)\n", + " parameter_memory_bytes = param_count * 4 # Assume float32\n", + " parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n", + "\n", + " # Create input and measure activation memory\n", + " dummy_input = Tensor(np.random.randn(*input_shape))\n", + " input_memory_bytes = dummy_input.data.nbytes\n", + "\n", + " # Estimate activation memory (simplified)\n", + " activation_memory_bytes = input_memory_bytes * 2 # Rough estimate\n", + " activation_memory_mb = activation_memory_bytes / (1024 * 1024)\n", + "\n", + " # Try to run forward pass and measure peak\n", + " try:\n", + " if hasattr(model, 'forward'):\n", + " _ = model.forward(dummy_input)\n", + " elif hasattr(model, '__call__'):\n", + " _ = model(dummy_input)\n", + " except:\n", + " pass # Ignore errors for simplified measurement\n", + "\n", + " # Get peak memory\n", + " _current_memory, peak_memory = tracemalloc.get_traced_memory()\n", + " peak_memory_mb = (peak_memory - _baseline_memory) / (1024 * 1024)\n", + "\n", + " tracemalloc.stop()\n", + "\n", + " # Calculate efficiency\n", + " useful_memory = parameter_memory_mb + activation_memory_mb\n", + " memory_efficiency = useful_memory / max(peak_memory_mb, 0.001) # Avoid division by zero\n", + "\n", + " return {\n", + " 'parameter_memory_mb': parameter_memory_mb,\n", + " 'activation_memory_mb': activation_memory_mb,\n", + " 'peak_memory_mb': max(peak_memory_mb, useful_memory),\n", + " 'memory_efficiency': min(memory_efficiency, 1.0)\n", + " }\n", + " ### END SOLUTION\n", + "\n", + " def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n", + " \"\"\"\n", + " Measure model inference latency with statistical rigor.\n", + "\n", + " TODO: Implement accurate latency measurement\n", + "\n", + " APPROACH:\n", + " 1. Run warmup iterations to stabilize performance\n", + " 2. Measure multiple iterations for statistical accuracy\n", + " 3. Calculate median latency to handle outliers\n", + " 4. Return latency in milliseconds\n", + "\n", + " PARAMETERS:\n", + " - warmup: Number of warmup runs (default 10)\n", + " - iterations: Number of measurement runs (default 100)\n", + "\n", + " EXAMPLE:\n", + " >>> linear = Linear(128, 64)\n", + " >>> input_tensor = Tensor(np.random.randn(1, 128))\n", + " >>> profiler = Profiler()\n", + " >>> latency = profiler.measure_latency(linear, input_tensor)\n", + " >>> print(f\"Latency: {latency:.2f} ms\")\n", + " Latency: 0.15 ms\n", + "\n", + " HINTS:\n", + " - Use time.perf_counter() for high precision\n", + " - Use median instead of mean for robustness against outliers\n", + " - Handle different model interfaces (forward, __call__)\n", + " \"\"\"\n", + " ### BEGIN SOLUTION\n", + " # Warmup runs\n", + " for _ in range(warmup):\n", + " try:\n", + " if hasattr(model, 'forward'):\n", + " _ = model.forward(input_tensor)\n", + " elif hasattr(model, '__call__'):\n", + " _ = model(input_tensor)\n", + " else:\n", + " # Fallback for simple operations\n", + " _ = input_tensor\n", + " except:\n", + " pass # Ignore errors during warmup\n", + "\n", + " # Measurement runs\n", + " times = []\n", + " for _ in range(iterations):\n", + " start_time = time.perf_counter()\n", + "\n", + " try:\n", + " if hasattr(model, 'forward'):\n", + " _ = model.forward(input_tensor)\n", + " elif hasattr(model, '__call__'):\n", + " _ = model(input_tensor)\n", + " else:\n", + " # Minimal operation for timing\n", + " _ = input_tensor.data.copy()\n", + " except:\n", + " pass # Ignore errors but still measure time\n", + "\n", + " end_time = time.perf_counter()\n", + " times.append((end_time - start_time) * 1000) # Convert to milliseconds\n", + "\n", + " # Calculate statistics - use median for robustness\n", + " times = np.array(times)\n", + " median_latency = np.median(times)\n", + "\n", + " return float(median_latency)\n", + " ### END SOLUTION\n", + "\n", + " def profile_layer(self, layer, input_shape: Tuple[int, ...]) -> Dict[str, Any]:\n", + " \"\"\"\n", + " Profile a single layer comprehensively.\n", + "\n", + " TODO: Implement layer-wise profiling\n", + "\n", + " APPROACH:\n", + " 1. Count parameters for this layer\n", + " 2. Count FLOPs for this layer\n", + " 3. Measure memory usage\n", + " 4. Measure latency\n", + " 5. Return comprehensive layer profile\n", + "\n", + " EXAMPLE:\n", + " >>> linear = Linear(256, 128)\n", + " >>> profiler = Profiler()\n", + " >>> profile = profiler.profile_layer(linear, (32, 256))\n", + " >>> print(f\"Layer uses {profile['parameters']} parameters\")\n", + " Layer uses 32896 parameters\n", + "\n", + " HINTS:\n", + " - Use existing profiler methods (count_parameters, count_flops, etc.)\n", + " - Create dummy input for latency measurement\n", + " - Include layer type information in profile\n", + " \"\"\"\n", + " ### BEGIN SOLUTION\n", + " # Create dummy input for latency measurement\n", + " dummy_input = Tensor(np.random.randn(*input_shape))\n", + "\n", + " # Gather all measurements\n", + " params = self.count_parameters(layer)\n", + " flops = self.count_flops(layer, input_shape)\n", + " memory = self.measure_memory(layer, input_shape)\n", + " latency = self.measure_latency(layer, dummy_input, warmup=3, iterations=10)\n", + "\n", + " # Compute derived metrics\n", + " gflops_per_second = (flops / 1e9) / max(latency / 1000, 1e-6)\n", + "\n", + " return {\n", + " 'layer_type': layer.__class__.__name__,\n", + " 'parameters': params,\n", + " 'flops': flops,\n", + " 'latency_ms': latency,\n", + " 'gflops_per_second': gflops_per_second,\n", + " **memory\n", + " }\n", + " ### END SOLUTION\n", + "\n", + " def profile_forward_pass(self, model, input_tensor) -> Dict[str, Any]:\n", + " \"\"\"\n", + " Comprehensive profiling of a model's forward pass.\n", + "\n", + " TODO: Implement complete forward pass analysis\n", + "\n", + " APPROACH:\n", + " 1. Use Profiler class to gather all measurements\n", + " 2. Create comprehensive performance profile\n", + " 3. Add derived metrics and insights\n", + " 4. Return structured analysis results\n", + "\n", + " RETURN METRICS:\n", + " - All basic profiler measurements\n", + " - FLOPs per second (computational efficiency)\n", + " - Memory bandwidth utilization\n", + " - Performance bottleneck identification\n", + "\n", + " EXAMPLE:\n", + " >>> model = Linear(256, 128)\n", + " >>> input_data = Tensor(np.random.randn(32, 256))\n", + " >>> profiler = Profiler()\n", + " >>> profile = profiler.profile_forward_pass(model, input_data)\n", + " >>> print(f\"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s\")\n", + " Throughput: 2.45 GFLOP/s\n", + "\n", + " HINTS:\n", + " - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)\n", + " - Memory bandwidth = memory_mb / (latency_ms / 1000)\n", + " - Consider realistic hardware limits for efficiency calculations\n", + " \"\"\"\n", + " ### BEGIN SOLUTION\n", + " # Basic measurements\n", + " param_count = self.count_parameters(model)\n", + " flops = self.count_flops(model, input_tensor.shape)\n", + " memory_stats = self.measure_memory(model, input_tensor.shape)\n", + " latency_ms = self.measure_latency(model, input_tensor, warmup=5, iterations=20)\n", + "\n", + " # Derived metrics\n", + " latency_seconds = latency_ms / 1000.0\n", + " gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)\n", + "\n", + " # Memory bandwidth (MB/s)\n", + " memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)\n", + "\n", + " # Efficiency metrics\n", + " theoretical_peak_gflops = 100.0 # Assume 100 GFLOP/s theoretical peak for CPU\n", + " computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)\n", + "\n", + " # Bottleneck analysis\n", + " is_memory_bound = memory_bandwidth > gflops_per_second * 100 # Rough heuristic\n", + " is_compute_bound = not is_memory_bound\n", + "\n", + " return {\n", + " # Basic measurements\n", + " 'parameters': param_count,\n", + " 'flops': flops,\n", + " 'latency_ms': latency_ms,\n", + " **memory_stats,\n", + "\n", + " # Derived metrics\n", + " 'gflops_per_second': gflops_per_second,\n", + " 'memory_bandwidth_mbs': memory_bandwidth,\n", + " 'computational_efficiency': computational_efficiency,\n", + "\n", + " # Bottleneck analysis\n", + " 'is_memory_bound': is_memory_bound,\n", + " 'is_compute_bound': is_compute_bound,\n", + " 'bottleneck': 'memory' if is_memory_bound else 'compute'\n", + " }\n", + " ### END SOLUTION\n", + "\n", + " def profile_backward_pass(self, model, input_tensor, _loss_fn=None) -> Dict[str, Any]:\n", + " \"\"\"\n", + " Profile both forward and backward passes for training analysis.\n", + "\n", + " TODO: Implement training-focused profiling\n", + "\n", + " APPROACH:\n", + " 1. Profile forward pass first\n", + " 2. Estimate backward pass costs (typically 2Ɨ forward)\n", + " 3. Calculate total training iteration metrics\n", + " 4. Analyze memory requirements for gradients and optimizers\n", + "\n", + " BACKWARD PASS ESTIMATES:\n", + " - FLOPs: ~2Ɨ forward pass (gradient computation)\n", + " - Memory: +1Ɨ parameters (gradient storage)\n", + " - Latency: ~2Ɨ forward pass (more complex operations)\n", + "\n", + " EXAMPLE:\n", + " >>> model = Linear(128, 64)\n", + " >>> input_data = Tensor(np.random.randn(16, 128))\n", + " >>> profiler = Profiler()\n", + " >>> profile = profiler.profile_backward_pass(model, input_data)\n", + " >>> print(f\"Training iteration: {profile['total_latency_ms']:.2f} ms\")\n", + " Training iteration: 0.45 ms\n", + "\n", + " HINTS:\n", + " - Total memory = parameters + activations + gradients\n", + " - Optimizer memory depends on algorithm (SGD: 0Ɨ, Adam: 2Ɨ)\n", + " - Consider gradient accumulation effects\n", + " \"\"\"\n", + " ### BEGIN SOLUTION\n", + " # Get forward pass profile\n", + " forward_profile = self.profile_forward_pass(model, input_tensor)\n", + "\n", + " # Estimate backward pass (typically 2Ɨ forward)\n", + " backward_flops = forward_profile['flops'] * 2\n", + " backward_latency_ms = forward_profile['latency_ms'] * 2\n", + "\n", + " # Gradient memory (equal to parameter memory)\n", + " gradient_memory_mb = forward_profile['parameter_memory_mb']\n", + "\n", + " # Total training iteration\n", + " total_flops = forward_profile['flops'] + backward_flops\n", + " total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms\n", + " total_memory_mb = (forward_profile['parameter_memory_mb'] +\n", + " forward_profile['activation_memory_mb'] +\n", + " gradient_memory_mb)\n", + "\n", + " # Training efficiency\n", + " total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)\n", + "\n", + " # Optimizer memory estimates\n", + " optimizer_memory_estimates = {\n", + " 'sgd': 0, # No extra memory\n", + " 'adam': gradient_memory_mb * 2, # Momentum + velocity\n", + " 'adamw': gradient_memory_mb * 2, # Same as Adam\n", + " }\n", + "\n", + " return {\n", + " # Forward pass\n", + " 'forward_flops': forward_profile['flops'],\n", + " 'forward_latency_ms': forward_profile['latency_ms'],\n", + " 'forward_memory_mb': forward_profile['peak_memory_mb'],\n", + "\n", + " # Backward pass estimates\n", + " 'backward_flops': backward_flops,\n", + " 'backward_latency_ms': backward_latency_ms,\n", + " 'gradient_memory_mb': gradient_memory_mb,\n", + "\n", + " # Total training iteration\n", + " 'total_flops': total_flops,\n", + " 'total_latency_ms': total_latency_ms,\n", + " 'total_memory_mb': total_memory_mb,\n", + " 'total_gflops_per_second': total_gflops_per_second,\n", + "\n", + " # Optimizer memory requirements\n", + " 'optimizer_memory_estimates': optimizer_memory_estimates,\n", + "\n", + " # Training insights\n", + " 'memory_efficiency': forward_profile['memory_efficiency'],\n", + " 'bottleneck': forward_profile['bottleneck']\n", + " }\n", " ### END SOLUTION" ] }, { "cell_type": "markdown", - "id": "463b3b6c", + "id": "644d770d", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, + "source": [ + "## Helper Functions - Quick Profiling Utilities\n", + "\n", + "These helper functions provide simplified interfaces for common profiling tasks.\n", + "They make it easy to quickly profile models and analyze characteristics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad647a04", + "metadata": { + "lines_to_next_cell": 1 + }, + "outputs": [], + "source": [ + "#| export\n", + "def quick_profile(model, input_tensor, profiler=None):\n", + " \"\"\"\n", + " Quick profiling function for immediate insights.\n", + " \n", + " Provides a simplified interface for profiling that displays key metrics\n", + " in a student-friendly format.\n", + " \n", + " Args:\n", + " model: Model to profile\n", + " input_tensor: Input data for profiling\n", + " profiler: Optional Profiler instance (creates new one if None)\n", + " \n", + " Returns:\n", + " dict: Profile results with key metrics\n", + " \n", + " Example:\n", + " >>> model = Linear(128, 64)\n", + " >>> input_data = Tensor(np.random.randn(16, 128))\n", + " >>> results = quick_profile(model, input_data)\n", + " >>> # Displays formatted output automatically\n", + " \"\"\"\n", + " if profiler is None:\n", + " profiler = Profiler()\n", + " \n", + " profile = profiler.profile_forward_pass(model, input_tensor)\n", + " \n", + " # Display formatted results\n", + " print(\"šŸ”¬ Quick Profile Results:\")\n", + " print(f\" Parameters: {profile['parameters']:,}\")\n", + " print(f\" FLOPs: {profile['flops']:,}\")\n", + " print(f\" Latency: {profile['latency_ms']:.2f} ms\")\n", + " print(f\" Memory: {profile['peak_memory_mb']:.2f} MB\")\n", + " print(f\" Bottleneck: {profile['bottleneck']}\")\n", + " print(f\" Efficiency: {profile['computational_efficiency']*100:.1f}%\")\n", + " \n", + " return profile\n", + "\n", + "#| export\n", + "def analyze_weight_distribution(model, percentiles=[10, 25, 50, 75, 90]):\n", + " \"\"\"\n", + " Analyze weight distribution for compression insights.\n", + " \n", + " Helps understand which weights are small and might be prunable.\n", + " Used by Module 17 (Compression) to motivate pruning.\n", + " \n", + " Args:\n", + " model: Model to analyze\n", + " percentiles: List of percentiles to compute\n", + " \n", + " Returns:\n", + " dict: Weight distribution statistics\n", + " \n", + " Example:\n", + " >>> model = Linear(512, 512)\n", + " >>> stats = analyze_weight_distribution(model)\n", + " >>> print(f\"Weights < 0.01: {stats['below_threshold_001']:.1f}%\")\n", + " \"\"\"\n", + " # Collect all weights\n", + " weights = []\n", + " if hasattr(model, 'parameters'):\n", + " for param in model.parameters():\n", + " weights.extend(param.data.flatten().tolist())\n", + " elif hasattr(model, 'weight'):\n", + " weights.extend(model.weight.data.flatten().tolist())\n", + " else:\n", + " return {'error': 'No weights found'}\n", + " \n", + " weights = np.array(weights)\n", + " abs_weights = np.abs(weights)\n", + " \n", + " # Calculate statistics\n", + " stats = {\n", + " 'total_weights': len(weights),\n", + " 'mean': float(np.mean(abs_weights)),\n", + " 'std': float(np.std(abs_weights)),\n", + " 'min': float(np.min(abs_weights)),\n", + " 'max': float(np.max(abs_weights)),\n", + " }\n", + " \n", + " # Percentile analysis\n", + " for p in percentiles:\n", + " stats[f'percentile_{p}'] = float(np.percentile(abs_weights, p))\n", + " \n", + " # Threshold analysis (useful for pruning)\n", + " for threshold in [0.001, 0.01, 0.1]:\n", + " below = np.sum(abs_weights < threshold) / len(weights) * 100\n", + " stats[f'below_threshold_{str(threshold).replace(\".\", \"\")}'] = below\n", + " \n", + " return stats" + ] + }, + { + "cell_type": "markdown", + "id": "68b967c5", + "metadata": { + "cell_marker": "\"\"\"" + }, "source": [ "## Parameter Counting - Model Size Analysis\n", "\n", - "Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's build a robust parameter counter that handles different model architectures.\n", + "Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's validate our implementation.\n", "\n", "### Why Parameter Counting Matters\n", "```\n", @@ -278,72 +915,12 @@ "Medium: GPT-2 Medium (350M parameters) → 1.4GB memory\n", "Large: GPT-2 Large (774M parameters) → 3.1GB memory\n", "XL: GPT-2 XL (1.5B parameters) → 6.0GB memory\n", - "```\n", - "\n", - "### Parameter Counting Strategy\n", - "Our parameter counter needs to handle different model types:\n", - "- **Single layers** (Linear, Conv2d) with weight and bias\n", - "- **Sequential models** with multiple layers\n", - "- **Custom models** with parameters() method" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4044095a", - "metadata": {}, - "outputs": [], - "source": [ - "def count_parameters(self, model) -> int:\n", - " \"\"\"\n", - " Count total trainable parameters in a model.\n", - "\n", - " TODO: Implement parameter counting for any model with parameters() method\n", - "\n", - " APPROACH:\n", - " 1. Get all parameters from model.parameters() if available\n", - " 2. For single layers, count weight and bias directly\n", - " 3. Sum total element count across all parameter tensors\n", - "\n", - " EXAMPLE:\n", - " >>> linear = Linear(128, 64) # 128*64 + 64 = 8256 parameters\n", - " >>> profiler = Profiler()\n", - " >>> count = profiler.count_parameters(linear)\n", - " >>> print(count)\n", - " 8256\n", - "\n", - " HINTS:\n", - " - Use parameter.data.size for tensor element count\n", - " - Handle models with and without parameters() method\n", - " - Don't forget bias terms when present\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " total_params = 0\n", - "\n", - " # Handle different model types\n", - " if hasattr(model, 'parameters'):\n", - " # Model with parameters() method (Sequential, custom models)\n", - " for param in model.parameters():\n", - " total_params += param.data.size\n", - " elif hasattr(model, 'weight'):\n", - " # Single layer (Linear, Conv2d)\n", - " total_params += model.weight.data.size\n", - " if hasattr(model, 'bias') and model.bias is not None:\n", - " total_params += model.bias.data.size\n", - " else:\n", - " # No parameters (activations, etc.)\n", - " total_params = 0\n", - "\n", - " return total_params\n", - " ### END SOLUTION\n", - "\n", - "# Add method to Profiler class\n", - "Profiler.count_parameters = count_parameters" + "```" ] }, { "cell_type": "markdown", - "id": "acdb0834", + "id": "68a302c1", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -359,7 +936,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4f5a8065", + "id": "9c44b45f", "metadata": { "nbgrader": { "grade": true, @@ -409,15 +986,15 @@ "\n", " print(\"āœ… Parameter counting works correctly!\")\n", "\n", - "test_unit_parameter_counting()" + "if __name__ == \"__main__\":\n", + " test_unit_parameter_counting()" ] }, { "cell_type": "markdown", - "id": "6e9d44c6", + "id": "fd88f0ff", "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 + "cell_marker": "\"\"\"" }, "source": [ "## FLOP Counting - Computational Cost Estimation\n", @@ -449,97 +1026,9 @@ "- **Activations**: Usually 1 FLOP per element" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "218af3a1", - "metadata": {}, - "outputs": [], - "source": [ - "def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n", - " \"\"\"\n", - " Count FLOPs (Floating Point Operations) for one forward pass.\n", - "\n", - " TODO: Implement FLOP counting for different layer types\n", - "\n", - " APPROACH:\n", - " 1. Create dummy input with given shape\n", - " 2. Calculate FLOPs based on layer type and dimensions\n", - " 3. Handle different model architectures (Linear, Conv2d, Sequential)\n", - "\n", - " LAYER-SPECIFIC FLOP FORMULAS:\n", - " - Linear: input_features Ɨ output_features Ɨ 2 (matmul + bias)\n", - " - Conv2d: output_h Ɨ output_w Ɨ kernel_h Ɨ kernel_w Ɨ in_channels Ɨ out_channels Ɨ 2\n", - " - Activation: Usually 1 FLOP per element (ReLU, Sigmoid)\n", - "\n", - " EXAMPLE:\n", - " >>> linear = Linear(128, 64)\n", - " >>> profiler = Profiler()\n", - " >>> flops = profiler.count_flops(linear, (1, 128))\n", - " >>> print(flops) # 128 * 64 * 2 = 16384\n", - " 16384\n", - "\n", - " HINTS:\n", - " - Batch dimension doesn't affect per-sample FLOPs\n", - " - Focus on major operations (matmul, conv) first\n", - " - For Sequential models, sum FLOPs of all layers\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Create dummy input\n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - " total_flops = 0\n", - "\n", - " # Handle different model types\n", - " if hasattr(model, '__class__'):\n", - " model_name = model.__class__.__name__\n", - "\n", - " if model_name == 'Linear':\n", - " # Linear layer: input_features Ɨ output_features Ɨ 2\n", - " in_features = input_shape[-1]\n", - " out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n", - " total_flops = in_features * out_features * 2\n", - "\n", - " elif model_name == 'Conv2d':\n", - " # Conv2d layer: complex calculation based on output size\n", - " # Simplified: assume we know the output dimensions\n", - " if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):\n", - " batch_size = input_shape[0] if len(input_shape) > 3 else 1\n", - " in_channels = model.in_channels\n", - " out_channels = model.out_channels\n", - " kernel_h = kernel_w = model.kernel_size\n", - "\n", - " # Estimate output size (simplified)\n", - " input_h, input_w = input_shape[-2], input_shape[-1]\n", - " output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)\n", - " output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)\n", - "\n", - " total_flops = (output_h * output_w * kernel_h * kernel_w *\n", - " in_channels * out_channels * 2)\n", - "\n", - " elif model_name == 'Sequential':\n", - " # Sequential model: sum FLOPs of all layers\n", - " current_shape = input_shape\n", - " for layer in model.layers:\n", - " layer_flops = self.count_flops(layer, current_shape)\n", - " total_flops += layer_flops\n", - " # Update shape for next layer (simplified)\n", - " if hasattr(layer, 'weight'):\n", - " current_shape = current_shape[:-1] + (layer.weight.shape[1],)\n", - "\n", - " else:\n", - " # Activation or other: assume 1 FLOP per element\n", - " total_flops = np.prod(input_shape)\n", - "\n", - " return total_flops\n", - " ### END SOLUTION\n", - "\n", - "# Add method to Profiler class\n", - "Profiler.count_flops = count_flops" - ] - }, { "cell_type": "markdown", - "id": "8b02224b", + "id": "e6311a0a", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -555,7 +1044,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3b947e9e", + "id": "8919b41a", "metadata": { "nbgrader": { "grade": true, @@ -599,15 +1088,15 @@ "\n", " print(\"āœ… FLOP counting works correctly!\")\n", "\n", - "test_unit_flop_counting()" + "if __name__ == \"__main__\":\n", + " test_unit_flop_counting()" ] }, { "cell_type": "markdown", - "id": "f32cf57c", + "id": "9a1d06f7", "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 + "cell_marker": "\"\"\"" }, "source": [ "## Memory Profiling - Understanding Memory Usage Patterns\n", @@ -638,97 +1127,9 @@ "We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns." ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "694a0990", - "metadata": {}, - "outputs": [], - "source": [ - "def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n", - " \"\"\"\n", - " Measure memory usage during forward pass.\n", - "\n", - " TODO: Implement memory tracking for model execution\n", - "\n", - " APPROACH:\n", - " 1. Use tracemalloc to track memory allocation\n", - " 2. Measure baseline memory before model execution\n", - " 3. Run forward pass and track peak usage\n", - " 4. Calculate different memory components\n", - "\n", - " RETURN DICTIONARY:\n", - " - 'parameter_memory_mb': Memory for model parameters\n", - " - 'activation_memory_mb': Memory for activations\n", - " - 'peak_memory_mb': Maximum memory usage\n", - " - 'memory_efficiency': Ratio of useful to total memory\n", - "\n", - " EXAMPLE:\n", - " >>> linear = Linear(1024, 512)\n", - " >>> profiler = Profiler()\n", - " >>> memory = profiler.measure_memory(linear, (32, 1024))\n", - " >>> print(f\"Parameters: {memory['parameter_memory_mb']:.1f} MB\")\n", - " Parameters: 2.1 MB\n", - "\n", - " HINTS:\n", - " - Use tracemalloc.start() and tracemalloc.get_traced_memory()\n", - " - Account for float32 = 4 bytes per parameter\n", - " - Activation memory scales with batch size\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Start memory tracking\n", - " tracemalloc.start()\n", - "\n", - " # Measure baseline memory\n", - " baseline_memory = tracemalloc.get_traced_memory()[0]\n", - "\n", - " # Calculate parameter memory\n", - " param_count = self.count_parameters(model)\n", - " parameter_memory_bytes = param_count * 4 # Assume float32\n", - " parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n", - "\n", - " # Create input and measure activation memory\n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - " input_memory_bytes = dummy_input.data.nbytes\n", - "\n", - " # Estimate activation memory (simplified)\n", - " activation_memory_bytes = input_memory_bytes * 2 # Rough estimate\n", - " activation_memory_mb = activation_memory_bytes / (1024 * 1024)\n", - "\n", - " # Try to run forward pass and measure peak\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(dummy_input)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(dummy_input)\n", - " except:\n", - " pass # Ignore errors for simplified measurement\n", - "\n", - " # Get peak memory\n", - " current_memory, peak_memory = tracemalloc.get_traced_memory()\n", - " peak_memory_mb = (peak_memory - baseline_memory) / (1024 * 1024)\n", - "\n", - " tracemalloc.stop()\n", - "\n", - " # Calculate efficiency\n", - " useful_memory = parameter_memory_mb + activation_memory_mb\n", - " memory_efficiency = useful_memory / max(peak_memory_mb, 0.001) # Avoid division by zero\n", - "\n", - " return {\n", - " 'parameter_memory_mb': parameter_memory_mb,\n", - " 'activation_memory_mb': activation_memory_mb,\n", - " 'peak_memory_mb': max(peak_memory_mb, useful_memory),\n", - " 'memory_efficiency': min(memory_efficiency, 1.0)\n", - " }\n", - " ### END SOLUTION\n", - "\n", - "# Add method to Profiler class\n", - "Profiler.measure_memory = measure_memory" - ] - }, { "cell_type": "markdown", - "id": "1d520581", + "id": "a1e39372", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -744,7 +1145,7 @@ { "cell_type": "code", "execution_count": null, - "id": "88c934b5", + "id": "60ee4331", "metadata": { "nbgrader": { "grade": true, @@ -797,15 +1198,15 @@ "\n", " print(\"āœ… Memory measurement works correctly!\")\n", "\n", - "test_unit_memory_measurement()" + "if __name__ == \"__main__\":\n", + " test_unit_memory_measurement()" ] }, { "cell_type": "markdown", - "id": "c45f1b79", + "id": "350bdbd3", "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 + "cell_marker": "\"\"\"" }, "source": [ "## Latency Measurement - Accurate Performance Timing\n", @@ -839,89 +1240,9 @@ "4. **Memory cleanup** to prevent contamination" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "764b8db5", - "metadata": {}, - "outputs": [], - "source": [ - "def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n", - " \"\"\"\n", - " Measure model inference latency with statistical rigor.\n", - "\n", - " TODO: Implement accurate latency measurement\n", - "\n", - " APPROACH:\n", - " 1. Run warmup iterations to stabilize performance\n", - " 2. Measure multiple iterations for statistical accuracy\n", - " 3. Calculate median latency to handle outliers\n", - " 4. Return latency in milliseconds\n", - "\n", - " PARAMETERS:\n", - " - warmup: Number of warmup runs (default 10)\n", - " - iterations: Number of measurement runs (default 100)\n", - "\n", - " EXAMPLE:\n", - " >>> linear = Linear(128, 64)\n", - " >>> input_tensor = Tensor(np.random.randn(1, 128))\n", - " >>> profiler = Profiler()\n", - " >>> latency = profiler.measure_latency(linear, input_tensor)\n", - " >>> print(f\"Latency: {latency:.2f} ms\")\n", - " Latency: 0.15 ms\n", - "\n", - " HINTS:\n", - " - Use time.perf_counter() for high precision\n", - " - Use median instead of mean for robustness against outliers\n", - " - Handle different model interfaces (forward, __call__)\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Warmup runs\n", - " for _ in range(warmup):\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(input_tensor)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(input_tensor)\n", - " else:\n", - " # Fallback for simple operations\n", - " _ = input_tensor\n", - " except:\n", - " pass # Ignore errors during warmup\n", - "\n", - " # Measurement runs\n", - " times = []\n", - " for _ in range(iterations):\n", - " start_time = time.perf_counter()\n", - "\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(input_tensor)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(input_tensor)\n", - " else:\n", - " # Minimal operation for timing\n", - " _ = input_tensor.data.copy()\n", - " except:\n", - " pass # Ignore errors but still measure time\n", - "\n", - " end_time = time.perf_counter()\n", - " times.append((end_time - start_time) * 1000) # Convert to milliseconds\n", - "\n", - " # Calculate statistics - use median for robustness\n", - " times = np.array(times)\n", - " median_latency = np.median(times)\n", - "\n", - " return float(median_latency)\n", - " ### END SOLUTION\n", - "\n", - "# Add method to Profiler class\n", - "Profiler.measure_latency = measure_latency" - ] - }, { "cell_type": "markdown", - "id": "a7aa639f", + "id": "f1a0465b", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -937,7 +1258,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3e642916", + "id": "dcc3cff0", "metadata": { "nbgrader": { "grade": true, @@ -986,20 +1307,20 @@ "\n", " print(\"āœ… Latency measurement works correctly!\")\n", "\n", - "test_unit_latency_measurement()" + "if __name__ == \"__main__\":\n", + " test_unit_latency_measurement()" ] }, { "cell_type": "markdown", - "id": "47686a04", + "id": "a5d9a959", "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 + "cell_marker": "\"\"\"" }, "source": [ "## 4. Integration: Advanced Profiling Functions\n", "\n", - "Now let's build higher-level profiling functions that combine our core measurements into comprehensive analysis tools.\n", + "Now let's validate our higher-level profiling functions that combine core measurements into comprehensive analysis tools.\n", "\n", "### Advanced Profiling Architecture\n", "```\n", @@ -1007,8 +1328,8 @@ " ↓ ↓ ↓\n", "count_parameters() profile_forward_pass() \"Memory-bound workload\"\n", "count_flops() profile_backward_pass() \"Optimize data movement\"\n", - "measure_memory() benchmark_efficiency() \"Focus on bandwidth\"\n", - "measure_latency() analyze_bottlenecks() \"Use quantization\"\n", + "measure_memory() profile_layer() \"Focus on bandwidth\"\n", + "measure_latency() benchmark_efficiency() \"Use quantization\"\n", "```\n", "\n", "### Forward Pass Profiling - Complete Performance Picture\n", @@ -1016,100 +1337,11 @@ "A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions." ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "01dc2eb1", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "advanced_profiling", - "solution": true - } - }, - "outputs": [], - "source": [ - "def profile_forward_pass(model, input_tensor) -> Dict[str, Any]:\n", - " \"\"\"\n", - " Comprehensive profiling of a model's forward pass.\n", - "\n", - " TODO: Implement complete forward pass analysis\n", - "\n", - " APPROACH:\n", - " 1. Use Profiler class to gather all measurements\n", - " 2. Create comprehensive performance profile\n", - " 3. Add derived metrics and insights\n", - " 4. Return structured analysis results\n", - "\n", - " RETURN METRICS:\n", - " - All basic profiler measurements\n", - " - FLOPs per second (computational efficiency)\n", - " - Memory bandwidth utilization\n", - " - Performance bottleneck identification\n", - "\n", - " EXAMPLE:\n", - " >>> model = Linear(256, 128)\n", - " >>> input_data = Tensor(np.random.randn(32, 256))\n", - " >>> profile = profile_forward_pass(model, input_data)\n", - " >>> print(f\"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s\")\n", - " Throughput: 2.45 GFLOP/s\n", - "\n", - " HINTS:\n", - " - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)\n", - " - Memory bandwidth = memory_mb / (latency_ms / 1000)\n", - " - Consider realistic hardware limits for efficiency calculations\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " profiler = Profiler()\n", - "\n", - " # Basic measurements\n", - " param_count = profiler.count_parameters(model)\n", - " flops = profiler.count_flops(model, input_tensor.shape)\n", - " memory_stats = profiler.measure_memory(model, input_tensor.shape)\n", - " latency_ms = profiler.measure_latency(model, input_tensor, warmup=5, iterations=20)\n", - "\n", - " # Derived metrics\n", - " latency_seconds = latency_ms / 1000.0\n", - " gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)\n", - "\n", - " # Memory bandwidth (MB/s)\n", - " memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)\n", - "\n", - " # Efficiency metrics\n", - " theoretical_peak_gflops = 100.0 # Assume 100 GFLOP/s theoretical peak for CPU\n", - " computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)\n", - "\n", - " # Bottleneck analysis\n", - " is_memory_bound = memory_bandwidth > gflops_per_second * 100 # Rough heuristic\n", - " is_compute_bound = not is_memory_bound\n", - "\n", - " return {\n", - " # Basic measurements\n", - " 'parameters': param_count,\n", - " 'flops': flops,\n", - " 'latency_ms': latency_ms,\n", - " **memory_stats,\n", - "\n", - " # Derived metrics\n", - " 'gflops_per_second': gflops_per_second,\n", - " 'memory_bandwidth_mbs': memory_bandwidth,\n", - " 'computational_efficiency': computational_efficiency,\n", - "\n", - " # Bottleneck analysis\n", - " 'is_memory_bound': is_memory_bound,\n", - " 'is_compute_bound': is_compute_bound,\n", - " 'bottleneck': 'memory' if is_memory_bound else 'compute'\n", - " }\n", - " ### END SOLUTION" - ] - }, { "cell_type": "markdown", - "id": "16cc4aaf", + "id": "791555b9", "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 + "cell_marker": "\"\"\"" }, "source": [ "### Backward Pass Profiling - Training Analysis\n", @@ -1135,102 +1367,9 @@ "```" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "20aab8e4", - "metadata": { - "lines_to_next_cell": 1 - }, - "outputs": [], - "source": [ - "def profile_backward_pass(model, input_tensor, loss_fn=None) -> Dict[str, Any]:\n", - " \"\"\"\n", - " Profile both forward and backward passes for training analysis.\n", - "\n", - " TODO: Implement training-focused profiling\n", - "\n", - " APPROACH:\n", - " 1. Profile forward pass first\n", - " 2. Estimate backward pass costs (typically 2Ɨ forward)\n", - " 3. Calculate total training iteration metrics\n", - " 4. Analyze memory requirements for gradients and optimizers\n", - "\n", - " BACKWARD PASS ESTIMATES:\n", - " - FLOPs: ~2Ɨ forward pass (gradient computation)\n", - " - Memory: +1Ɨ parameters (gradient storage)\n", - " - Latency: ~2Ɨ forward pass (more complex operations)\n", - "\n", - " EXAMPLE:\n", - " >>> model = Linear(128, 64)\n", - " >>> input_data = Tensor(np.random.randn(16, 128))\n", - " >>> profile = profile_backward_pass(model, input_data)\n", - " >>> print(f\"Training iteration: {profile['total_latency_ms']:.2f} ms\")\n", - " Training iteration: 0.45 ms\n", - "\n", - " HINTS:\n", - " - Total memory = parameters + activations + gradients\n", - " - Optimizer memory depends on algorithm (SGD: 0Ɨ, Adam: 2Ɨ)\n", - " - Consider gradient accumulation effects\n", - " \"\"\"\n", - " ### BEGIN SOLUTION\n", - " # Get forward pass profile\n", - " forward_profile = profile_forward_pass(model, input_tensor)\n", - "\n", - " # Estimate backward pass (typically 2Ɨ forward)\n", - " backward_flops = forward_profile['flops'] * 2\n", - " backward_latency_ms = forward_profile['latency_ms'] * 2\n", - "\n", - " # Gradient memory (equal to parameter memory)\n", - " gradient_memory_mb = forward_profile['parameter_memory_mb']\n", - "\n", - " # Total training iteration\n", - " total_flops = forward_profile['flops'] + backward_flops\n", - " total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms\n", - " total_memory_mb = (forward_profile['parameter_memory_mb'] +\n", - " forward_profile['activation_memory_mb'] +\n", - " gradient_memory_mb)\n", - "\n", - " # Training efficiency\n", - " total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)\n", - "\n", - " # Optimizer memory estimates\n", - " optimizer_memory_estimates = {\n", - " 'sgd': 0, # No extra memory\n", - " 'adam': gradient_memory_mb * 2, # Momentum + velocity\n", - " 'adamw': gradient_memory_mb * 2, # Same as Adam\n", - " }\n", - "\n", - " return {\n", - " # Forward pass\n", - " 'forward_flops': forward_profile['flops'],\n", - " 'forward_latency_ms': forward_profile['latency_ms'],\n", - " 'forward_memory_mb': forward_profile['peak_memory_mb'],\n", - "\n", - " # Backward pass estimates\n", - " 'backward_flops': backward_flops,\n", - " 'backward_latency_ms': backward_latency_ms,\n", - " 'gradient_memory_mb': gradient_memory_mb,\n", - "\n", - " # Total training iteration\n", - " 'total_flops': total_flops,\n", - " 'total_latency_ms': total_latency_ms,\n", - " 'total_memory_mb': total_memory_mb,\n", - " 'total_gflops_per_second': total_gflops_per_second,\n", - "\n", - " # Optimizer memory requirements\n", - " 'optimizer_memory_estimates': optimizer_memory_estimates,\n", - "\n", - " # Training insights\n", - " 'memory_efficiency': forward_profile['memory_efficiency'],\n", - " 'bottleneck': forward_profile['bottleneck']\n", - " }\n", - " ### END SOLUTION" - ] - }, { "cell_type": "markdown", - "id": "a66d79fe", + "id": "24236272", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -1246,7 +1385,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f7838a43", + "id": "1516ed04", "metadata": { "nbgrader": { "grade": true, @@ -1261,11 +1400,12 @@ " \"\"\"šŸ”¬ Test advanced profiling functions.\"\"\"\n", " print(\"šŸ”¬ Unit Test: Advanced Profiling Functions...\")\n", "\n", - " # Create test model and input\n", + " # Create profiler and test model\n", + " profiler = Profiler()\n", " test_input = Tensor(np.random.randn(4, 8))\n", "\n", " # Test forward pass profiling\n", - " forward_profile = profile_forward_pass(test_input, test_input)\n", + " forward_profile = profiler.profile_forward_pass(test_input, test_input)\n", "\n", " # Validate forward profile structure\n", " required_forward_keys = [\n", @@ -1284,7 +1424,7 @@ " print(f\"āœ… Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s\")\n", "\n", " # Test backward pass profiling\n", - " backward_profile = profile_backward_pass(test_input, test_input)\n", + " backward_profile = profiler.profile_backward_pass(test_input, test_input)\n", "\n", " # Validate backward profile structure\n", " required_backward_keys = [\n", @@ -1311,12 +1451,13 @@ " print(f\"āœ… Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training\")\n", " print(\"āœ… Advanced profiling functions work correctly!\")\n", "\n", - "test_unit_advanced_profiling()" + "if __name__ == \"__main__\":\n", + " test_unit_advanced_profiling()" ] }, { "cell_type": "markdown", - "id": "768f21e5", + "id": "b52a9046", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -1343,7 +1484,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7f90a148", + "id": "331e282f", "metadata": { "nbgrader": { "grade": false, @@ -1410,10 +1551,8 @@ " avg_efficiency = np.mean([r['gflops_per_second'] for r in results])\n", " if avg_efficiency < 10: # Arbitrary threshold for \"low\" efficiency\n", " print(\"šŸš€ Low compute efficiency suggests memory-bound workload\")\n", - " print(\" → Optimization focus: Data layout, memory bandwidth, caching\")\n", " else:\n", " print(\"šŸš€ High compute efficiency suggests compute-bound workload\")\n", - " print(\" → Optimization focus: Algorithmic efficiency, vectorization\")\n", "\n", "def analyze_batch_size_effects():\n", " \"\"\"šŸ“Š Analyze how batch size affects performance and efficiency.\"\"\"\n", @@ -1445,18 +1584,17 @@ " f\"{memory['peak_memory_mb']:.2f}\\t\\t{efficiency:.1f}\")\n", "\n", " print(\"\\nšŸ’” Batch Size Insights:\")\n", - " print(\"• Larger batches typically improve throughput but increase memory usage\")\n", - " print(\"• Sweet spot balances throughput and memory constraints\")\n", - " print(\"• Memory efficiency = samples/s per MB (higher is better)\")\n", + " print(\"Larger batches typically improve throughput but increase memory usage\")\n", "\n", "# Run the analysis\n", - "analyze_model_scaling()\n", - "analyze_batch_size_effects()" + "if __name__ == \"__main__\":\n", + " analyze_model_scaling()\n", + " analyze_batch_size_effects()" ] }, { "cell_type": "markdown", - "id": "0563e9cd", + "id": "08957c5b", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -1489,7 +1627,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c506a927", + "id": "750be525", "metadata": { "nbgrader": { "grade": false, @@ -1564,8 +1702,8 @@ " best_op = max(operations, key=lambda x: x['gflops_per_second'])\n", " worst_op = min(operations, key=lambda x: x['gflops_per_second'])\n", "\n", - " print(f\"• Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)\")\n", - " print(f\"• Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)\")\n", + " print(f\"Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)\")\n", + " print(f\"Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)\")\n", "\n", " # Count operation types\n", " memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound']\n", @@ -1573,11 +1711,9 @@ "\n", " print(f\"\\nšŸš€ Optimization Priority:\")\n", " if len(memory_bound_ops) > len(compute_bound_ops):\n", - " print(\"• Focus on memory optimization: data locality, bandwidth, caching\")\n", - " print(\"• Consider operation fusion to reduce memory traffic\")\n", + " print(\"Focus on memory optimization: data locality, bandwidth, caching\")\n", " else:\n", - " print(\"• Focus on compute optimization: better algorithms, vectorization\")\n", - " print(\"• Consider specialized libraries (BLAS, cuBLAS)\")\n", + " print(\"Focus on compute optimization: better algorithms, vectorization\")\n", "\n", "def analyze_profiling_overhead():\n", " \"\"\"šŸ“Š Measure the overhead of profiling itself.\"\"\"\n", @@ -1611,29 +1747,21 @@ "\n", " print(f\"\\nšŸ’” Profiling Overhead Insights:\")\n", " if overhead_factor < 2:\n", - " print(\"• Low overhead - suitable for frequent profiling\")\n", - " print(\"• Can be used in development with minimal impact\")\n", + " print(\"Low overhead - suitable for frequent profiling\")\n", " elif overhead_factor < 10:\n", - " print(\"• Moderate overhead - use for development and debugging\")\n", - " print(\"• Disable for production unless investigating issues\")\n", + " print(\"Moderate overhead - use for development and debugging\")\n", " else:\n", - " print(\"• High overhead - use sparingly in production\")\n", - " print(\"• Enable only when investigating specific performance issues\")\n", - "\n", - " print(f\"\\nšŸš€ Profiling Best Practices:\")\n", - " print(\"• Profile during development to identify bottlenecks\")\n", - " print(\"• Use production profiling only for investigation\")\n", - " print(\"• Focus measurement on critical code paths\")\n", - " print(\"• Balance measurement detail with overhead cost\")\n", + " print(\"High overhead - use sparingly in production\")\n", "\n", "# Run optimization analysis\n", - "benchmark_operation_efficiency()\n", - "analyze_profiling_overhead()" + "if __name__ == \"__main__\":\n", + " benchmark_operation_efficiency()\n", + " analyze_profiling_overhead()" ] }, { "cell_type": "markdown", - "id": "e7a5de0d", + "id": "a170135d", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 @@ -1647,7 +1775,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d922a54d", + "id": "379ab83a", "metadata": { "nbgrader": { "grade": true, @@ -1704,8 +1832,8 @@ "\n", " # Test advanced profiling\n", " print(\"2. Running advanced profiling...\")\n", - " forward_profile = profile_forward_pass(test_model, test_input)\n", - " backward_profile = profile_backward_pass(test_model, test_input)\n", + " forward_profile = profiler.profile_forward_pass(test_model, test_input)\n", + " backward_profile = profiler.profile_backward_pass(test_model, test_input)\n", "\n", " assert 'gflops_per_second' in forward_profile\n", " assert 'total_latency_ms' in backward_profile\n", @@ -1736,7 +1864,7 @@ "\n", " # Simulate larger model analysis\n", " large_input = Tensor(np.random.randn(32, 512)) # Larger model input\n", - " large_profile = profile_forward_pass(large_input, large_input)\n", + " large_profile = profiler.profile_forward_pass(large_input, large_input)\n", "\n", " # Verify profile contains optimization insights\n", " assert 'bottleneck' in large_profile, \"Profile should identify bottlenecks\"\n", @@ -1749,16 +1877,17 @@ "\n", " print(\"\\n\" + \"=\" * 50)\n", " print(\"šŸŽ‰ ALL TESTS PASSED! Module ready for export.\")\n", - " print(\"Run: tito module complete 15\")\n", + " print(\"Run: tito module complete 14\")\n", "\n", "# Call before module summary\n", - "test_module()" + "if __name__ == \"__main__\":\n", + " test_module()" ] }, { "cell_type": "code", "execution_count": null, - "id": "378e2ca8", + "id": "6502f689", "metadata": {}, "outputs": [], "source": [ @@ -1770,7 +1899,7 @@ }, { "cell_type": "markdown", - "id": "e44c6173", + "id": "b4ff25e4", "metadata": { "cell_marker": "\"\"\"" }, @@ -1805,143 +1934,7 @@ }, { "cell_type": "markdown", - "id": "ab131290", - "metadata": { - "cell_marker": "\"\"\"", - "lines_to_next_cell": 1 - }, - "source": [ - "## šŸ Consolidated Profiler for Export\n", - "\n", - "Now that we've implemented all profiling methods, let's create a consolidated Profiler class\n", - "for export to the tinytorch package. This allows milestones to use the full profiler." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dd3324fa", - "metadata": { - "lines_to_next_cell": 1, - "nbgrader": { - "grade": false, - "grade_id": "profiler_export", - "solution": false - } - }, - "outputs": [], - "source": [ - "#| export\n", - "class ProfilerComplete:\n", - " \"\"\"\n", - " Complete profiler with all measurement capabilities for milestone use.\n", - " \n", - " This is the exported version students build through the module exercises.\n", - " \"\"\"\n", - " \n", - " def __init__(self):\n", - " \"\"\"Initialize profiler with measurement state.\"\"\"\n", - " self.measurements = {}\n", - " self.operation_counts = defaultdict(int)\n", - " self.memory_tracker = None\n", - " \n", - " def count_parameters(self, model) -> int:\n", - " \"\"\"Count total trainable parameters in a model.\"\"\"\n", - " total_params = 0\n", - " \n", - " if hasattr(model, 'parameters'):\n", - " for param in model.parameters():\n", - " total_params += param.data.size\n", - " elif hasattr(model, 'weight'):\n", - " total_params += model.weight.data.size\n", - " if hasattr(model, 'bias') and model.bias is not None:\n", - " total_params += model.bias.data.size\n", - " \n", - " return total_params\n", - " \n", - " def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n", - " \"\"\"Count FLOPs for one forward pass.\"\"\"\n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - " total_flops = 0\n", - " \n", - " if hasattr(model, '__class__'):\n", - " model_name = model.__class__.__name__\n", - " \n", - " if model_name == 'Linear':\n", - " in_features = input_shape[-1]\n", - " out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n", - " total_flops = in_features * out_features * 2\n", - " \n", - " elif model_name == 'Conv2d':\n", - " total_flops = 1000000 # Simplified for now\n", - " \n", - " return total_flops\n", - " \n", - " def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n", - " \"\"\"Measure memory usage during forward pass.\"\"\"\n", - " tracemalloc.start()\n", - " baseline_memory = tracemalloc.get_traced_memory()[0]\n", - " \n", - " param_count = self.count_parameters(model)\n", - " parameter_memory_bytes = param_count * 4\n", - " parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n", - " \n", - " dummy_input = Tensor(np.random.randn(*input_shape))\n", - " \n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " output = model.forward(dummy_input)\n", - " elif hasattr(model, '__call__'):\n", - " output = model(dummy_input)\n", - " except:\n", - " output = dummy_input\n", - " \n", - " peak_memory, _ = tracemalloc.get_traced_memory()\n", - " tracemalloc.stop()\n", - " \n", - " peak_memory_mb = peak_memory / (1024 * 1024)\n", - " activation_memory_mb = max(0, peak_memory_mb - parameter_memory_mb)\n", - " \n", - " return {\n", - " 'parameter_memory_mb': parameter_memory_mb,\n", - " 'activation_memory_mb': activation_memory_mb,\n", - " 'peak_memory_mb': peak_memory_mb,\n", - " 'memory_efficiency': parameter_memory_mb / peak_memory_mb if peak_memory_mb > 0 else 0\n", - " }\n", - " \n", - " def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n", - " \"\"\"Measure model inference latency with statistical rigor.\"\"\"\n", - " # Warmup\n", - " for _ in range(warmup):\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(input_tensor)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(input_tensor)\n", - " except:\n", - " pass\n", - " \n", - " # Measurement\n", - " times = []\n", - " for _ in range(iterations):\n", - " start = time.perf_counter()\n", - " try:\n", - " if hasattr(model, 'forward'):\n", - " _ = model.forward(input_tensor)\n", - " elif hasattr(model, '__call__'):\n", - " _ = model(input_tensor)\n", - " except:\n", - " pass\n", - " end = time.perf_counter()\n", - " times.append(end - start)\n", - " \n", - " median_latency_ms = np.median(times) * 1000\n", - " return median_latency_ms" - ] - }, - { - "cell_type": "markdown", - "id": "dc025a52", + "id": "72dec7d6", "metadata": { "cell_marker": "\"\"\"" }, @@ -1970,10 +1963,10 @@ "- **Statistical Rigor**: Handle measurement variance with proper methodology\n", "\n", "### Ready for Next Steps\n", - "Your profiling implementation enables Module 16 (Acceleration) to make data-driven optimization decisions.\n", - "Export with: `tito module complete 15`\n", + "Your profiling implementation enables optimization modules (15-18) to make data-driven optimization decisions.\n", + "Export with: `tito module complete 14`\n", "\n", - "**Next**: Module 16 will use these profiling tools to implement acceleration techniques and measure their effectiveness!" + "**Next**: Module 15 (Memoization) will use profiling to discover transformer bottlenecks and fix them!" ] } ],