mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-30 23:47:32 -05:00
1983 lines
84 KiB
Plaintext
1983 lines
84 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "55618ade",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 14: Profiling - Measuring What Matters in ML Systems\n",
|
||
"\n",
|
||
"Welcome to Module 14! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Complete ML stack from tensors to transformers\n",
|
||
"**You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency\n",
|
||
"**You'll Enable**: Data-driven optimization decisions and performance analysis\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"All Modules (01-13) → Profiling (14) → Optimization Techniques (15-18)\n",
|
||
"(implementations) (measurement) (targeted fixes)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement a complete Profiler class for model analysis\n",
|
||
"2. Count parameters and FLOPs accurately for different architectures\n",
|
||
"3. Measure memory usage and latency with statistical rigor\n",
|
||
"4. Create production-quality performance analysis tools\n",
|
||
"\n",
|
||
"Let's build the measurement foundation for ML systems optimization!\n",
|
||
"\n",
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/14_profiling/profiling_dev.py`\n",
|
||
"**Building Side:** Code exports to `tinytorch.profiling.profiler`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# How to use this module:\n",
|
||
"from tinytorch.profiling.profiler import Profiler, profile_forward_pass, profile_backward_pass\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete profiling system for understanding model performance characteristics\n",
|
||
"- **Production:** Professional measurement tools like those used in PyTorch, TensorFlow\n",
|
||
"- **Consistency:** All profiling and measurement tools in profiling.profiler\n",
|
||
"- **Integration:** Works with any model built using TinyTorch components"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d92307b1",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "imports",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp profiling.profiler\n",
|
||
"#| export\n",
|
||
"\n",
|
||
"import time\n",
|
||
"import numpy as np\n",
|
||
"import tracemalloc\n",
|
||
"from typing import Dict, List, Any, Optional, Tuple\n",
|
||
"from collections import defaultdict\n",
|
||
"import gc\n",
|
||
"\n",
|
||
"# Import our TinyTorch components for profiling\n",
|
||
"from tinytorch.core.tensor import Tensor\n",
|
||
"from tinytorch.core.layers import Linear\n",
|
||
"from tinytorch.core.spatial import Conv2d"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6e4fb271",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction: Why Profiling Matters in ML Systems\n",
|
||
"\n",
|
||
"Imagine you're a detective investigating a performance crime. Your model is running slowly, using too much memory, or burning through compute budgets. Without profiling, you're flying blind - making guesses about what to optimize. With profiling, you have evidence.\n",
|
||
"\n",
|
||
"**The Performance Investigation Process:**\n",
|
||
"```\n",
|
||
"Suspect Model → Profile Evidence → Identify Bottleneck → Target Optimization\n",
|
||
" ↓ ↓ ↓ ↓\n",
|
||
" \"Too slow\" \"200 GFLOP/s\" \"Memory bound\" \"Reduce transfers\"\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Questions Profiling Answers:**\n",
|
||
"- **How many parameters?** (Memory footprint, model size)\n",
|
||
"- **How many FLOPs?** (Computational cost, energy usage)\n",
|
||
"- **Where are bottlenecks?** (Memory vs compute bound)\n",
|
||
"- **What's actual latency?** (Real-world performance)\n",
|
||
"\n",
|
||
"**Production Importance:**\n",
|
||
"In production ML systems, profiling isn't optional - it's survival. A model that's 10% more accurate but 100× slower often can't be deployed. Teams use profiling daily to make data-driven optimization decisions, not guesses.\n",
|
||
"\n",
|
||
"### The Profiling Workflow Visualization\n",
|
||
"```\n",
|
||
"Model → Profiler → Measurements → Analysis → Optimization Decision\n",
|
||
" ↓ ↓ ↓ ↓ ↓\n",
|
||
" GPT Parameter 125M params Memory Use quantization\n",
|
||
" Counter 2.5B FLOPs bound Reduce precision\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ddfa3dfb",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### 🔗 From Optimization to Discovery: Connecting Module 14\n",
|
||
"\n",
|
||
"**In Module 14**, you implemented KV caching and saw 10-15x speedup.\n",
|
||
"**In Module 15**, you'll learn HOW to discover such optimization opportunities.\n",
|
||
"\n",
|
||
"**The Real ML Engineering Workflow**:\n",
|
||
"```\n",
|
||
"Step 1: Measure (This Module!) Step 2: Analyze\n",
|
||
" ↓ ↓\n",
|
||
"Profile baseline → Find bottleneck → Understand cause\n",
|
||
"40 tok/s 80% in attention O(n²) recomputation\n",
|
||
" ↓\n",
|
||
"Step 4: Validate Step 3: Optimize (Module 14)\n",
|
||
" ↓ ↓\n",
|
||
"Profile optimized ← Verify speedup ← Implement KV cache\n",
|
||
"500 tok/s (12.5x) Measure impact Design solution\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Without Module 15's profiling**: You'd never know WHERE to optimize!\n",
|
||
"**Without Module 14's optimization**: You couldn't FIX the bottleneck!\n",
|
||
"\n",
|
||
"This module teaches the measurement and analysis skills that enable\n",
|
||
"optimization breakthroughs like KV caching. You'll profile real models\n",
|
||
"and discover bottlenecks just like production ML teams do."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d5a2e470",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 2. Foundations: Performance Measurement Principles\n",
|
||
"\n",
|
||
"Before we build our profiler, let's understand what we're measuring and why each metric matters.\n",
|
||
"\n",
|
||
"### Parameter Counting - Model Size Detective Work\n",
|
||
"\n",
|
||
"Parameters determine your model's memory footprint and storage requirements. Every parameter is typically a 32-bit float (4 bytes), so counting them precisely predicts memory usage.\n",
|
||
"\n",
|
||
"**Parameter Counting Formula:**\n",
|
||
"```\n",
|
||
"Linear Layer: (input_features × output_features) + output_features\n",
|
||
" ↑ ↑ ↑\n",
|
||
" Weight matrix Bias vector Total parameters\n",
|
||
"\n",
|
||
"Example: Linear(768, 3072) → (768 × 3072) + 3072 = 2,362,368 parameters\n",
|
||
"Memory: 2,362,368 × 4 bytes = 9.45 MB\n",
|
||
"```\n",
|
||
"\n",
|
||
"### FLOP Counting - Computational Cost Analysis\n",
|
||
"\n",
|
||
"FLOPs (Floating Point Operations) measure computational work. Unlike wall-clock time, FLOPs are hardware-independent and predict compute costs across different systems.\n",
|
||
"\n",
|
||
"**FLOP Formulas for Key Operations:**\n",
|
||
"```\n",
|
||
"Matrix Multiplication (M,K) @ (K,N):\n",
|
||
" FLOPs = M × N × K × 2\n",
|
||
" ↑ ↑ ↑ ↑\n",
|
||
" Rows Cols Inner Multiply+Add\n",
|
||
"\n",
|
||
"Linear Layer Forward:\n",
|
||
" FLOPs = batch_size × input_features × output_features × 2\n",
|
||
" ↑ ↑ ↑\n",
|
||
" Matmul cost Bias add Operations\n",
|
||
"\n",
|
||
"Convolution (simplified):\n",
|
||
" FLOPs = output_H × output_W × kernel_H × kernel_W × in_channels × out_channels × 2\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Memory Profiling - The Three Types of Memory\n",
|
||
"\n",
|
||
"ML models use memory in three distinct ways, each with different optimization strategies:\n",
|
||
"\n",
|
||
"**Memory Type Breakdown:**\n",
|
||
"```\n",
|
||
"Total Training Memory = Parameters + Activations + Gradients + Optimizer State\n",
|
||
" ↓ ↓ ↓ ↓\n",
|
||
" Model Forward Backward Adam: 2×params\n",
|
||
" weights pass cache gradients SGD: 0×params\n",
|
||
"\n",
|
||
"Example for 125M parameter model:\n",
|
||
"Parameters: 500 MB (125M × 4 bytes)\n",
|
||
"Activations: 200 MB (depends on batch size)\n",
|
||
"Gradients: 500 MB (same as parameters)\n",
|
||
"Adam state: 1,000 MB (momentum + velocity)\n",
|
||
"Total: 2,200 MB (4.4× parameter memory!)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Latency Measurement - Dealing with Reality\n",
|
||
"\n",
|
||
"Latency measurement is tricky because systems have variance, warmup effects, and measurement overhead. Professional profiling requires statistical rigor.\n",
|
||
"\n",
|
||
"**Latency Measurement Best Practices:**\n",
|
||
"```\n",
|
||
"Measurement Protocol:\n",
|
||
"1. Warmup runs (10+) → CPU/GPU caches warm up\n",
|
||
"2. Timed runs (100+) → Statistical significance\n",
|
||
"3. Outlier handling → Use median, not mean\n",
|
||
"4. Memory cleanup → Prevent contamination\n",
|
||
"\n",
|
||
"Timeline:\n",
|
||
"Warmup: [run][run][run]...[run] ← Don't time these\n",
|
||
"Timing: [⏱run⏱][⏱run⏱]...[⏱run⏱] ← Time these\n",
|
||
"Result: median(all_times) ← Robust to outliers\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c466e14d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 3. Implementation: Building the Core Profiler Class\n",
|
||
"\n",
|
||
"Now let's implement our profiler step by step. We'll start with the foundation and build up to comprehensive analysis.\n",
|
||
"\n",
|
||
"### The Profiler Architecture\n",
|
||
"```\n",
|
||
"Profiler Class\n",
|
||
"├── count_parameters() → Model size analysis\n",
|
||
"├── count_flops() → Computational cost estimation\n",
|
||
"├── measure_memory() → Memory usage tracking\n",
|
||
"├── measure_latency() → Performance timing\n",
|
||
"├── profile_layer() → Layer-wise analysis\n",
|
||
"├── profile_forward_pass() → Complete forward analysis\n",
|
||
"└── profile_backward_pass() → Training analysis\n",
|
||
"\n",
|
||
"Integration:\n",
|
||
"All methods work together to provide comprehensive performance insights\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "31829387",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "profiler_class",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class Profiler:\n",
|
||
" \"\"\"\n",
|
||
" Professional-grade ML model profiler for performance analysis.\n",
|
||
"\n",
|
||
" Measures parameters, FLOPs, memory usage, and latency with statistical rigor.\n",
|
||
" Used for optimization guidance and deployment planning.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self):\n",
|
||
" \"\"\"\n",
|
||
" Initialize profiler with measurement state.\n",
|
||
"\n",
|
||
" TODO: Set up profiler tracking structures\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Create empty measurements dictionary\n",
|
||
" 2. Initialize operation counters\n",
|
||
" 3. Set up memory tracking state\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> profiler = Profiler()\n",
|
||
" >>> profiler.measurements\n",
|
||
" {}\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use defaultdict(int) for operation counters\n",
|
||
" - measurements dict will store timing results\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.measurements = {}\n",
|
||
" self.operation_counts = defaultdict(int)\n",
|
||
" self.memory_tracker = None\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def count_parameters(self, model) -> int:\n",
|
||
" \"\"\"\n",
|
||
" Count total trainable parameters in a model.\n",
|
||
"\n",
|
||
" TODO: Implement parameter counting for any model with parameters() method\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Get all parameters from model.parameters() if available\n",
|
||
" 2. For single layers, count weight and bias directly\n",
|
||
" 3. Sum total element count across all parameter tensors\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> linear = Linear(128, 64) # 128*64 + 64 = 8256 parameters\n",
|
||
" >>> profiler = Profiler()\n",
|
||
" >>> count = profiler.count_parameters(linear)\n",
|
||
" >>> print(count)\n",
|
||
" 8256\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use parameter.data.size for tensor element count\n",
|
||
" - Handle models with and without parameters() method\n",
|
||
" - Don't forget bias terms when present\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" total_params = 0\n",
|
||
"\n",
|
||
" # Handle different model types\n",
|
||
" if hasattr(model, 'parameters'):\n",
|
||
" # Model with parameters() method (Sequential, custom models)\n",
|
||
" for param in model.parameters():\n",
|
||
" total_params += param.data.size\n",
|
||
" elif hasattr(model, 'weight'):\n",
|
||
" # Single layer (Linear, Conv2d)\n",
|
||
" total_params += model.weight.data.size\n",
|
||
" if hasattr(model, 'bias') and model.bias is not None:\n",
|
||
" total_params += model.bias.data.size\n",
|
||
" else:\n",
|
||
" # No parameters (activations, etc.)\n",
|
||
" total_params = 0\n",
|
||
"\n",
|
||
" return total_params\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n",
|
||
" \"\"\"\n",
|
||
" Count FLOPs (Floating Point Operations) for one forward pass.\n",
|
||
"\n",
|
||
" TODO: Implement FLOP counting for different layer types\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Create dummy input with given shape\n",
|
||
" 2. Calculate FLOPs based on layer type and dimensions\n",
|
||
" 3. Handle different model architectures (Linear, Conv2d, Sequential)\n",
|
||
"\n",
|
||
" LAYER-SPECIFIC FLOP FORMULAS:\n",
|
||
" - Linear: input_features × output_features × 2 (matmul + bias)\n",
|
||
" - Conv2d: output_h × output_w × kernel_h × kernel_w × in_channels × out_channels × 2\n",
|
||
" - Activation: Usually 1 FLOP per element (ReLU, Sigmoid)\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> linear = Linear(128, 64)\n",
|
||
" >>> profiler = Profiler()\n",
|
||
" >>> flops = profiler.count_flops(linear, (1, 128))\n",
|
||
" >>> print(flops) # 128 * 64 * 2 = 16384\n",
|
||
" 16384\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Batch dimension doesn't affect per-sample FLOPs\n",
|
||
" - Focus on major operations (matmul, conv) first\n",
|
||
" - For Sequential models, sum FLOPs of all layers\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Create dummy input (unused but kept for interface consistency)\n",
|
||
" _dummy_input = Tensor(np.random.randn(*input_shape))\n",
|
||
" total_flops = 0\n",
|
||
"\n",
|
||
" # Handle different model types\n",
|
||
" if hasattr(model, '__class__'):\n",
|
||
" model_name = model.__class__.__name__\n",
|
||
"\n",
|
||
" if model_name == 'Linear':\n",
|
||
" # Linear layer: input_features × output_features × 2\n",
|
||
" in_features = input_shape[-1]\n",
|
||
" out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n",
|
||
" total_flops = in_features * out_features * 2\n",
|
||
"\n",
|
||
" elif model_name == 'Conv2d':\n",
|
||
" # Conv2d layer: complex calculation based on output size\n",
|
||
" # Simplified: assume we know the output dimensions\n",
|
||
" if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):\n",
|
||
" _batch_size = input_shape[0] if len(input_shape) > 3 else 1\n",
|
||
" in_channels = model.in_channels\n",
|
||
" out_channels = model.out_channels\n",
|
||
" kernel_h = kernel_w = model.kernel_size\n",
|
||
"\n",
|
||
" # Estimate output size (simplified)\n",
|
||
" input_h, input_w = input_shape[-2], input_shape[-1]\n",
|
||
" output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)\n",
|
||
" output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)\n",
|
||
"\n",
|
||
" total_flops = (output_h * output_w * kernel_h * kernel_w *\n",
|
||
" in_channels * out_channels * 2)\n",
|
||
"\n",
|
||
" elif model_name == 'Sequential':\n",
|
||
" # Sequential model: sum FLOPs of all layers\n",
|
||
" current_shape = input_shape\n",
|
||
" for layer in model.layers:\n",
|
||
" layer_flops = self.count_flops(layer, current_shape)\n",
|
||
" total_flops += layer_flops\n",
|
||
" # Update shape for next layer (simplified)\n",
|
||
" if hasattr(layer, 'weight'):\n",
|
||
" current_shape = current_shape[:-1] + (layer.weight.shape[1],)\n",
|
||
"\n",
|
||
" else:\n",
|
||
" # Activation or other: assume 1 FLOP per element\n",
|
||
" total_flops = np.prod(input_shape)\n",
|
||
"\n",
|
||
" return total_flops\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n",
|
||
" \"\"\"\n",
|
||
" Measure memory usage during forward pass.\n",
|
||
"\n",
|
||
" TODO: Implement memory tracking for model execution\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Use tracemalloc to track memory allocation\n",
|
||
" 2. Measure baseline memory before model execution\n",
|
||
" 3. Run forward pass and track peak usage\n",
|
||
" 4. Calculate different memory components\n",
|
||
"\n",
|
||
" RETURN DICTIONARY:\n",
|
||
" - 'parameter_memory_mb': Memory for model parameters\n",
|
||
" - 'activation_memory_mb': Memory for activations\n",
|
||
" - 'peak_memory_mb': Maximum memory usage\n",
|
||
" - 'memory_efficiency': Ratio of useful to total memory\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> linear = Linear(1024, 512)\n",
|
||
" >>> profiler = Profiler()\n",
|
||
" >>> memory = profiler.measure_memory(linear, (32, 1024))\n",
|
||
" >>> print(f\"Parameters: {memory['parameter_memory_mb']:.1f} MB\")\n",
|
||
" Parameters: 2.1 MB\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use tracemalloc.start() and tracemalloc.get_traced_memory()\n",
|
||
" - Account for float32 = 4 bytes per parameter\n",
|
||
" - Activation memory scales with batch size\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Start memory tracking\n",
|
||
" tracemalloc.start()\n",
|
||
"\n",
|
||
" # Measure baseline memory (unused but kept for completeness)\n",
|
||
" _baseline_memory = tracemalloc.get_traced_memory()[0]\n",
|
||
"\n",
|
||
" # Calculate parameter memory\n",
|
||
" param_count = self.count_parameters(model)\n",
|
||
" parameter_memory_bytes = param_count * 4 # Assume float32\n",
|
||
" parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n",
|
||
"\n",
|
||
" # Create input and measure activation memory\n",
|
||
" dummy_input = Tensor(np.random.randn(*input_shape))\n",
|
||
" input_memory_bytes = dummy_input.data.nbytes\n",
|
||
"\n",
|
||
" # Estimate activation memory (simplified)\n",
|
||
" activation_memory_bytes = input_memory_bytes * 2 # Rough estimate\n",
|
||
" activation_memory_mb = activation_memory_bytes / (1024 * 1024)\n",
|
||
"\n",
|
||
" # Try to run forward pass and measure peak\n",
|
||
" try:\n",
|
||
" if hasattr(model, 'forward'):\n",
|
||
" _ = model.forward(dummy_input)\n",
|
||
" elif hasattr(model, '__call__'):\n",
|
||
" _ = model(dummy_input)\n",
|
||
" except:\n",
|
||
" pass # Ignore errors for simplified measurement\n",
|
||
"\n",
|
||
" # Get peak memory\n",
|
||
" _current_memory, peak_memory = tracemalloc.get_traced_memory()\n",
|
||
" peak_memory_mb = (peak_memory - _baseline_memory) / (1024 * 1024)\n",
|
||
"\n",
|
||
" tracemalloc.stop()\n",
|
||
"\n",
|
||
" # Calculate efficiency\n",
|
||
" useful_memory = parameter_memory_mb + activation_memory_mb\n",
|
||
" memory_efficiency = useful_memory / max(peak_memory_mb, 0.001) # Avoid division by zero\n",
|
||
"\n",
|
||
" return {\n",
|
||
" 'parameter_memory_mb': parameter_memory_mb,\n",
|
||
" 'activation_memory_mb': activation_memory_mb,\n",
|
||
" 'peak_memory_mb': max(peak_memory_mb, useful_memory),\n",
|
||
" 'memory_efficiency': min(memory_efficiency, 1.0)\n",
|
||
" }\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n",
|
||
" \"\"\"\n",
|
||
" Measure model inference latency with statistical rigor.\n",
|
||
"\n",
|
||
" TODO: Implement accurate latency measurement\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Run warmup iterations to stabilize performance\n",
|
||
" 2. Measure multiple iterations for statistical accuracy\n",
|
||
" 3. Calculate median latency to handle outliers\n",
|
||
" 4. Return latency in milliseconds\n",
|
||
"\n",
|
||
" PARAMETERS:\n",
|
||
" - warmup: Number of warmup runs (default 10)\n",
|
||
" - iterations: Number of measurement runs (default 100)\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> linear = Linear(128, 64)\n",
|
||
" >>> input_tensor = Tensor(np.random.randn(1, 128))\n",
|
||
" >>> profiler = Profiler()\n",
|
||
" >>> latency = profiler.measure_latency(linear, input_tensor)\n",
|
||
" >>> print(f\"Latency: {latency:.2f} ms\")\n",
|
||
" Latency: 0.15 ms\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use time.perf_counter() for high precision\n",
|
||
" - Use median instead of mean for robustness against outliers\n",
|
||
" - Handle different model interfaces (forward, __call__)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Warmup runs\n",
|
||
" for _ in range(warmup):\n",
|
||
" try:\n",
|
||
" if hasattr(model, 'forward'):\n",
|
||
" _ = model.forward(input_tensor)\n",
|
||
" elif hasattr(model, '__call__'):\n",
|
||
" _ = model(input_tensor)\n",
|
||
" else:\n",
|
||
" # Fallback for simple operations\n",
|
||
" _ = input_tensor\n",
|
||
" except:\n",
|
||
" pass # Ignore errors during warmup\n",
|
||
"\n",
|
||
" # Measurement runs\n",
|
||
" times = []\n",
|
||
" for _ in range(iterations):\n",
|
||
" start_time = time.perf_counter()\n",
|
||
"\n",
|
||
" try:\n",
|
||
" if hasattr(model, 'forward'):\n",
|
||
" _ = model.forward(input_tensor)\n",
|
||
" elif hasattr(model, '__call__'):\n",
|
||
" _ = model(input_tensor)\n",
|
||
" else:\n",
|
||
" # Minimal operation for timing\n",
|
||
" _ = input_tensor.data.copy()\n",
|
||
" except:\n",
|
||
" pass # Ignore errors but still measure time\n",
|
||
"\n",
|
||
" end_time = time.perf_counter()\n",
|
||
" times.append((end_time - start_time) * 1000) # Convert to milliseconds\n",
|
||
"\n",
|
||
" # Calculate statistics - use median for robustness\n",
|
||
" times = np.array(times)\n",
|
||
" median_latency = np.median(times)\n",
|
||
"\n",
|
||
" return float(median_latency)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def profile_layer(self, layer, input_shape: Tuple[int, ...]) -> Dict[str, Any]:\n",
|
||
" \"\"\"\n",
|
||
" Profile a single layer comprehensively.\n",
|
||
"\n",
|
||
" TODO: Implement layer-wise profiling\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Count parameters for this layer\n",
|
||
" 2. Count FLOPs for this layer\n",
|
||
" 3. Measure memory usage\n",
|
||
" 4. Measure latency\n",
|
||
" 5. Return comprehensive layer profile\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> linear = Linear(256, 128)\n",
|
||
" >>> profiler = Profiler()\n",
|
||
" >>> profile = profiler.profile_layer(linear, (32, 256))\n",
|
||
" >>> print(f\"Layer uses {profile['parameters']} parameters\")\n",
|
||
" Layer uses 32896 parameters\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use existing profiler methods (count_parameters, count_flops, etc.)\n",
|
||
" - Create dummy input for latency measurement\n",
|
||
" - Include layer type information in profile\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Create dummy input for latency measurement\n",
|
||
" dummy_input = Tensor(np.random.randn(*input_shape))\n",
|
||
"\n",
|
||
" # Gather all measurements\n",
|
||
" params = self.count_parameters(layer)\n",
|
||
" flops = self.count_flops(layer, input_shape)\n",
|
||
" memory = self.measure_memory(layer, input_shape)\n",
|
||
" latency = self.measure_latency(layer, dummy_input, warmup=3, iterations=10)\n",
|
||
"\n",
|
||
" # Compute derived metrics\n",
|
||
" gflops_per_second = (flops / 1e9) / max(latency / 1000, 1e-6)\n",
|
||
"\n",
|
||
" return {\n",
|
||
" 'layer_type': layer.__class__.__name__,\n",
|
||
" 'parameters': params,\n",
|
||
" 'flops': flops,\n",
|
||
" 'latency_ms': latency,\n",
|
||
" 'gflops_per_second': gflops_per_second,\n",
|
||
" **memory\n",
|
||
" }\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def profile_forward_pass(self, model, input_tensor) -> Dict[str, Any]:\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive profiling of a model's forward pass.\n",
|
||
"\n",
|
||
" TODO: Implement complete forward pass analysis\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Use Profiler class to gather all measurements\n",
|
||
" 2. Create comprehensive performance profile\n",
|
||
" 3. Add derived metrics and insights\n",
|
||
" 4. Return structured analysis results\n",
|
||
"\n",
|
||
" RETURN METRICS:\n",
|
||
" - All basic profiler measurements\n",
|
||
" - FLOPs per second (computational efficiency)\n",
|
||
" - Memory bandwidth utilization\n",
|
||
" - Performance bottleneck identification\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> model = Linear(256, 128)\n",
|
||
" >>> input_data = Tensor(np.random.randn(32, 256))\n",
|
||
" >>> profiler = Profiler()\n",
|
||
" >>> profile = profiler.profile_forward_pass(model, input_data)\n",
|
||
" >>> print(f\"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s\")\n",
|
||
" Throughput: 2.45 GFLOP/s\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)\n",
|
||
" - Memory bandwidth = memory_mb / (latency_ms / 1000)\n",
|
||
" - Consider realistic hardware limits for efficiency calculations\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Basic measurements\n",
|
||
" param_count = self.count_parameters(model)\n",
|
||
" flops = self.count_flops(model, input_tensor.shape)\n",
|
||
" memory_stats = self.measure_memory(model, input_tensor.shape)\n",
|
||
" latency_ms = self.measure_latency(model, input_tensor, warmup=5, iterations=20)\n",
|
||
"\n",
|
||
" # Derived metrics\n",
|
||
" latency_seconds = latency_ms / 1000.0\n",
|
||
" gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)\n",
|
||
"\n",
|
||
" # Memory bandwidth (MB/s)\n",
|
||
" memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)\n",
|
||
"\n",
|
||
" # Efficiency metrics\n",
|
||
" theoretical_peak_gflops = 100.0 # Assume 100 GFLOP/s theoretical peak for CPU\n",
|
||
" computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)\n",
|
||
"\n",
|
||
" # Bottleneck analysis\n",
|
||
" is_memory_bound = memory_bandwidth > gflops_per_second * 100 # Rough heuristic\n",
|
||
" is_compute_bound = not is_memory_bound\n",
|
||
"\n",
|
||
" return {\n",
|
||
" # Basic measurements\n",
|
||
" 'parameters': param_count,\n",
|
||
" 'flops': flops,\n",
|
||
" 'latency_ms': latency_ms,\n",
|
||
" **memory_stats,\n",
|
||
"\n",
|
||
" # Derived metrics\n",
|
||
" 'gflops_per_second': gflops_per_second,\n",
|
||
" 'memory_bandwidth_mbs': memory_bandwidth,\n",
|
||
" 'computational_efficiency': computational_efficiency,\n",
|
||
"\n",
|
||
" # Bottleneck analysis\n",
|
||
" 'is_memory_bound': is_memory_bound,\n",
|
||
" 'is_compute_bound': is_compute_bound,\n",
|
||
" 'bottleneck': 'memory' if is_memory_bound else 'compute'\n",
|
||
" }\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def profile_backward_pass(self, model, input_tensor, _loss_fn=None) -> Dict[str, Any]:\n",
|
||
" \"\"\"\n",
|
||
" Profile both forward and backward passes for training analysis.\n",
|
||
"\n",
|
||
" TODO: Implement training-focused profiling\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Profile forward pass first\n",
|
||
" 2. Estimate backward pass costs (typically 2× forward)\n",
|
||
" 3. Calculate total training iteration metrics\n",
|
||
" 4. Analyze memory requirements for gradients and optimizers\n",
|
||
"\n",
|
||
" BACKWARD PASS ESTIMATES:\n",
|
||
" - FLOPs: ~2× forward pass (gradient computation)\n",
|
||
" - Memory: +1× parameters (gradient storage)\n",
|
||
" - Latency: ~2× forward pass (more complex operations)\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> model = Linear(128, 64)\n",
|
||
" >>> input_data = Tensor(np.random.randn(16, 128))\n",
|
||
" >>> profiler = Profiler()\n",
|
||
" >>> profile = profiler.profile_backward_pass(model, input_data)\n",
|
||
" >>> print(f\"Training iteration: {profile['total_latency_ms']:.2f} ms\")\n",
|
||
" Training iteration: 0.45 ms\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Total memory = parameters + activations + gradients\n",
|
||
" - Optimizer memory depends on algorithm (SGD: 0×, Adam: 2×)\n",
|
||
" - Consider gradient accumulation effects\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Get forward pass profile\n",
|
||
" forward_profile = self.profile_forward_pass(model, input_tensor)\n",
|
||
"\n",
|
||
" # Estimate backward pass (typically 2× forward)\n",
|
||
" backward_flops = forward_profile['flops'] * 2\n",
|
||
" backward_latency_ms = forward_profile['latency_ms'] * 2\n",
|
||
"\n",
|
||
" # Gradient memory (equal to parameter memory)\n",
|
||
" gradient_memory_mb = forward_profile['parameter_memory_mb']\n",
|
||
"\n",
|
||
" # Total training iteration\n",
|
||
" total_flops = forward_profile['flops'] + backward_flops\n",
|
||
" total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms\n",
|
||
" total_memory_mb = (forward_profile['parameter_memory_mb'] +\n",
|
||
" forward_profile['activation_memory_mb'] +\n",
|
||
" gradient_memory_mb)\n",
|
||
"\n",
|
||
" # Training efficiency\n",
|
||
" total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)\n",
|
||
"\n",
|
||
" # Optimizer memory estimates\n",
|
||
" optimizer_memory_estimates = {\n",
|
||
" 'sgd': 0, # No extra memory\n",
|
||
" 'adam': gradient_memory_mb * 2, # Momentum + velocity\n",
|
||
" 'adamw': gradient_memory_mb * 2, # Same as Adam\n",
|
||
" }\n",
|
||
"\n",
|
||
" return {\n",
|
||
" # Forward pass\n",
|
||
" 'forward_flops': forward_profile['flops'],\n",
|
||
" 'forward_latency_ms': forward_profile['latency_ms'],\n",
|
||
" 'forward_memory_mb': forward_profile['peak_memory_mb'],\n",
|
||
"\n",
|
||
" # Backward pass estimates\n",
|
||
" 'backward_flops': backward_flops,\n",
|
||
" 'backward_latency_ms': backward_latency_ms,\n",
|
||
" 'gradient_memory_mb': gradient_memory_mb,\n",
|
||
"\n",
|
||
" # Total training iteration\n",
|
||
" 'total_flops': total_flops,\n",
|
||
" 'total_latency_ms': total_latency_ms,\n",
|
||
" 'total_memory_mb': total_memory_mb,\n",
|
||
" 'total_gflops_per_second': total_gflops_per_second,\n",
|
||
"\n",
|
||
" # Optimizer memory requirements\n",
|
||
" 'optimizer_memory_estimates': optimizer_memory_estimates,\n",
|
||
"\n",
|
||
" # Training insights\n",
|
||
" 'memory_efficiency': forward_profile['memory_efficiency'],\n",
|
||
" 'bottleneck': forward_profile['bottleneck']\n",
|
||
" }\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "644d770d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## Helper Functions - Quick Profiling Utilities\n",
|
||
"\n",
|
||
"These helper functions provide simplified interfaces for common profiling tasks.\n",
|
||
"They make it easy to quickly profile models and analyze characteristics."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ad647a04",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"def quick_profile(model, input_tensor, profiler=None):\n",
|
||
" \"\"\"\n",
|
||
" Quick profiling function for immediate insights.\n",
|
||
" \n",
|
||
" Provides a simplified interface for profiling that displays key metrics\n",
|
||
" in a student-friendly format.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" model: Model to profile\n",
|
||
" input_tensor: Input data for profiling\n",
|
||
" profiler: Optional Profiler instance (creates new one if None)\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" dict: Profile results with key metrics\n",
|
||
" \n",
|
||
" Example:\n",
|
||
" >>> model = Linear(128, 64)\n",
|
||
" >>> input_data = Tensor(np.random.randn(16, 128))\n",
|
||
" >>> results = quick_profile(model, input_data)\n",
|
||
" >>> # Displays formatted output automatically\n",
|
||
" \"\"\"\n",
|
||
" if profiler is None:\n",
|
||
" profiler = Profiler()\n",
|
||
" \n",
|
||
" profile = profiler.profile_forward_pass(model, input_tensor)\n",
|
||
" \n",
|
||
" # Display formatted results\n",
|
||
" print(\"🔬 Quick Profile Results:\")\n",
|
||
" print(f\" Parameters: {profile['parameters']:,}\")\n",
|
||
" print(f\" FLOPs: {profile['flops']:,}\")\n",
|
||
" print(f\" Latency: {profile['latency_ms']:.2f} ms\")\n",
|
||
" print(f\" Memory: {profile['peak_memory_mb']:.2f} MB\")\n",
|
||
" print(f\" Bottleneck: {profile['bottleneck']}\")\n",
|
||
" print(f\" Efficiency: {profile['computational_efficiency']*100:.1f}%\")\n",
|
||
" \n",
|
||
" return profile\n",
|
||
"\n",
|
||
"#| export\n",
|
||
"def analyze_weight_distribution(model, percentiles=[10, 25, 50, 75, 90]):\n",
|
||
" \"\"\"\n",
|
||
" Analyze weight distribution for compression insights.\n",
|
||
" \n",
|
||
" Helps understand which weights are small and might be prunable.\n",
|
||
" Used by Module 17 (Compression) to motivate pruning.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" model: Model to analyze\n",
|
||
" percentiles: List of percentiles to compute\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" dict: Weight distribution statistics\n",
|
||
" \n",
|
||
" Example:\n",
|
||
" >>> model = Linear(512, 512)\n",
|
||
" >>> stats = analyze_weight_distribution(model)\n",
|
||
" >>> print(f\"Weights < 0.01: {stats['below_threshold_001']:.1f}%\")\n",
|
||
" \"\"\"\n",
|
||
" # Collect all weights\n",
|
||
" weights = []\n",
|
||
" if hasattr(model, 'parameters'):\n",
|
||
" for param in model.parameters():\n",
|
||
" weights.extend(param.data.flatten().tolist())\n",
|
||
" elif hasattr(model, 'weight'):\n",
|
||
" weights.extend(model.weight.data.flatten().tolist())\n",
|
||
" else:\n",
|
||
" return {'error': 'No weights found'}\n",
|
||
" \n",
|
||
" weights = np.array(weights)\n",
|
||
" abs_weights = np.abs(weights)\n",
|
||
" \n",
|
||
" # Calculate statistics\n",
|
||
" stats = {\n",
|
||
" 'total_weights': len(weights),\n",
|
||
" 'mean': float(np.mean(abs_weights)),\n",
|
||
" 'std': float(np.std(abs_weights)),\n",
|
||
" 'min': float(np.min(abs_weights)),\n",
|
||
" 'max': float(np.max(abs_weights)),\n",
|
||
" }\n",
|
||
" \n",
|
||
" # Percentile analysis\n",
|
||
" for p in percentiles:\n",
|
||
" stats[f'percentile_{p}'] = float(np.percentile(abs_weights, p))\n",
|
||
" \n",
|
||
" # Threshold analysis (useful for pruning)\n",
|
||
" for threshold in [0.001, 0.01, 0.1]:\n",
|
||
" below = np.sum(abs_weights < threshold) / len(weights) * 100\n",
|
||
" stats[f'below_threshold_{str(threshold).replace(\".\", \"\")}'] = below\n",
|
||
" \n",
|
||
" return stats"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "68b967c5",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## Parameter Counting - Model Size Analysis\n",
|
||
"\n",
|
||
"Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's validate our implementation.\n",
|
||
"\n",
|
||
"### Why Parameter Counting Matters\n",
|
||
"```\n",
|
||
"Model Deployment Pipeline:\n",
|
||
"Parameters → Memory → Hardware → Cost\n",
|
||
" ↓ ↓ ↓ ↓\n",
|
||
" 125M 500MB 8GB GPU $200/month\n",
|
||
"\n",
|
||
"Parameter Growth Examples:\n",
|
||
"Small: GPT-2 Small (124M parameters) → 500MB memory\n",
|
||
"Medium: GPT-2 Medium (350M parameters) → 1.4GB memory\n",
|
||
"Large: GPT-2 Large (774M parameters) → 3.1GB memory\n",
|
||
"XL: GPT-2 XL (1.5B parameters) → 6.0GB memory\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "68a302c1",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Parameter Counting\n",
|
||
"This test validates our parameter counting works correctly for different model types.\n",
|
||
"**What we're testing**: Parameter counting accuracy for various architectures\n",
|
||
"**Why it matters**: Accurate parameter counts predict memory usage and model complexity\n",
|
||
"**Expected**: Correct counts for known model configurations"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9c44b45f",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test_parameter_counting",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_parameter_counting():\n",
|
||
" \"\"\"🔬 Test parameter counting implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Parameter Counting...\")\n",
|
||
"\n",
|
||
" profiler = Profiler()\n",
|
||
"\n",
|
||
" # Test 1: Simple model with known parameters\n",
|
||
" class SimpleModel:\n",
|
||
" def __init__(self):\n",
|
||
" self.weight = Tensor(np.random.randn(10, 5))\n",
|
||
" self.bias = Tensor(np.random.randn(5))\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" return [self.weight, self.bias]\n",
|
||
"\n",
|
||
" simple_model = SimpleModel()\n",
|
||
" param_count = profiler.count_parameters(simple_model)\n",
|
||
" expected_count = 10 * 5 + 5 # weight + bias\n",
|
||
" assert param_count == expected_count, f\"Expected {expected_count} parameters, got {param_count}\"\n",
|
||
" print(f\"✅ Simple model: {param_count} parameters\")\n",
|
||
"\n",
|
||
" # Test 2: Model without parameters\n",
|
||
" class NoParamModel:\n",
|
||
" def __init__(self):\n",
|
||
" pass\n",
|
||
"\n",
|
||
" no_param_model = NoParamModel()\n",
|
||
" param_count = profiler.count_parameters(no_param_model)\n",
|
||
" assert param_count == 0, f\"Expected 0 parameters, got {param_count}\"\n",
|
||
" print(f\"✅ No parameter model: {param_count} parameters\")\n",
|
||
"\n",
|
||
" # Test 3: Direct tensor (no parameters)\n",
|
||
" test_tensor = Tensor(np.random.randn(2, 3))\n",
|
||
" param_count = profiler.count_parameters(test_tensor)\n",
|
||
" assert param_count == 0, f\"Expected 0 parameters for tensor, got {param_count}\"\n",
|
||
" print(f\"✅ Direct tensor: {param_count} parameters\")\n",
|
||
"\n",
|
||
" print(\"✅ Parameter counting works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_parameter_counting()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fd88f0ff",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## FLOP Counting - Computational Cost Estimation\n",
|
||
"\n",
|
||
"FLOPs measure the computational work required for model operations. Unlike latency, FLOPs are hardware-independent and help predict compute costs across different systems.\n",
|
||
"\n",
|
||
"### FLOP Counting Visualization\n",
|
||
"```\n",
|
||
"Linear Layer FLOP Breakdown:\n",
|
||
"Input (batch=32, features=768) × Weight (768, 3072) + Bias (3072)\n",
|
||
" ↓\n",
|
||
"Matrix Multiplication: 32 × 768 × 3072 × 2 = 150,994,944 FLOPs\n",
|
||
"Bias Addition: 32 × 3072 × 1 = 98,304 FLOPs\n",
|
||
" ↓\n",
|
||
"Total FLOPs: 151,093,248 FLOPs\n",
|
||
"\n",
|
||
"Convolution FLOP Breakdown:\n",
|
||
"Input (batch=1, channels=3, H=224, W=224)\n",
|
||
"Kernel (out=64, in=3, kH=7, kW=7)\n",
|
||
" ↓\n",
|
||
"Output size: (224×224) → (112×112) with stride=2\n",
|
||
"FLOPs = 112 × 112 × 7 × 7 × 3 × 64 × 2 = 235,012,096 FLOPs\n",
|
||
"```\n",
|
||
"\n",
|
||
"### FLOP Counting Strategy\n",
|
||
"Different operations require different FLOP calculations:\n",
|
||
"- **Matrix operations**: M × N × K × 2 (multiply + add)\n",
|
||
"- **Convolutions**: Output spatial × kernel spatial × channels\n",
|
||
"- **Activations**: Usually 1 FLOP per element"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e6311a0a",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: FLOP Counting\n",
|
||
"This test validates our FLOP counting for different operations and architectures.\n",
|
||
"**What we're testing**: FLOP calculation accuracy for various layer types\n",
|
||
"**Why it matters**: FLOPs predict computational cost and energy usage\n",
|
||
"**Expected**: Correct FLOP counts for known operation types"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8919b41a",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test_flop_counting",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_flop_counting():\n",
|
||
" \"\"\"🔬 Test FLOP counting implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: FLOP Counting...\")\n",
|
||
"\n",
|
||
" profiler = Profiler()\n",
|
||
"\n",
|
||
" # Test 1: Simple tensor operations\n",
|
||
" test_tensor = Tensor(np.random.randn(4, 8))\n",
|
||
" flops = profiler.count_flops(test_tensor, (4, 8))\n",
|
||
" expected_flops = 4 * 8 # 1 FLOP per element for generic operation\n",
|
||
" assert flops == expected_flops, f\"Expected {expected_flops} FLOPs, got {flops}\"\n",
|
||
" print(f\"✅ Tensor operation: {flops} FLOPs\")\n",
|
||
"\n",
|
||
" # Test 2: Simulated Linear layer\n",
|
||
" class MockLinear:\n",
|
||
" def __init__(self, in_features, out_features):\n",
|
||
" self.weight = Tensor(np.random.randn(in_features, out_features))\n",
|
||
" self.__class__.__name__ = 'Linear'\n",
|
||
"\n",
|
||
" mock_linear = MockLinear(128, 64)\n",
|
||
" flops = profiler.count_flops(mock_linear, (1, 128))\n",
|
||
" expected_flops = 128 * 64 * 2 # matmul FLOPs\n",
|
||
" assert flops == expected_flops, f\"Expected {expected_flops} FLOPs, got {flops}\"\n",
|
||
" print(f\"✅ Linear layer: {flops} FLOPs\")\n",
|
||
"\n",
|
||
" # Test 3: Batch size independence\n",
|
||
" flops_batch1 = profiler.count_flops(mock_linear, (1, 128))\n",
|
||
" flops_batch32 = profiler.count_flops(mock_linear, (32, 128))\n",
|
||
" assert flops_batch1 == flops_batch32, \"FLOPs should be independent of batch size\"\n",
|
||
" print(f\"✅ Batch independence: {flops_batch1} FLOPs (same for batch 1 and 32)\")\n",
|
||
"\n",
|
||
" print(\"✅ FLOP counting works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_flop_counting()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9a1d06f7",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## Memory Profiling - Understanding Memory Usage Patterns\n",
|
||
"\n",
|
||
"Memory profiling reveals how much RAM your model consumes during training and inference. This is critical for deployment planning and optimization.\n",
|
||
"\n",
|
||
"### Memory Usage Breakdown\n",
|
||
"```\n",
|
||
"ML Model Memory Components:\n",
|
||
"┌───────────────────────────────────────────────────┐\n",
|
||
"│ Total Memory │\n",
|
||
"├─────────────────┬─────────────────┬───────────────┤\n",
|
||
"│ Parameters │ Activations │ Gradients │\n",
|
||
"│ (persistent) │ (per forward) │ (per backward)│\n",
|
||
"├─────────────────┼─────────────────┼───────────────┤\n",
|
||
"│ Linear weights │ Hidden states │ ∂L/∂W │\n",
|
||
"│ Conv filters │ Attention maps │ ∂L/∂b │\n",
|
||
"│ Embeddings │ Residual cache │ Optimizer │\n",
|
||
"└─────────────────┴─────────────────┴───────────────┘\n",
|
||
"\n",
|
||
"Memory Scaling:\n",
|
||
"Batch Size → Activation Memory (linear scaling)\n",
|
||
"Model Size → Parameter + Gradient Memory (linear scaling)\n",
|
||
"Sequence Length → Attention Memory (quadratic scaling!)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Memory Measurement Strategy\n",
|
||
"We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a1e39372",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Memory Measurement\n",
|
||
"This test validates our memory tracking works correctly and provides useful metrics.\n",
|
||
"**What we're testing**: Memory usage measurement and calculation accuracy\n",
|
||
"**Why it matters**: Memory constraints often limit model deployment\n",
|
||
"**Expected**: Reasonable memory measurements with proper components"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "60ee4331",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test_memory_measurement",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_memory_measurement():\n",
|
||
" \"\"\"🔬 Test memory measurement implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Memory Measurement...\")\n",
|
||
"\n",
|
||
" profiler = Profiler()\n",
|
||
"\n",
|
||
" # Test 1: Basic memory measurement\n",
|
||
" test_tensor = Tensor(np.random.randn(10, 20))\n",
|
||
" memory_stats = profiler.measure_memory(test_tensor, (10, 20))\n",
|
||
"\n",
|
||
" # Validate dictionary structure\n",
|
||
" required_keys = ['parameter_memory_mb', 'activation_memory_mb', 'peak_memory_mb', 'memory_efficiency']\n",
|
||
" for key in required_keys:\n",
|
||
" assert key in memory_stats, f\"Missing key: {key}\"\n",
|
||
"\n",
|
||
" # Validate non-negative values\n",
|
||
" for key in required_keys:\n",
|
||
" assert memory_stats[key] >= 0, f\"{key} should be non-negative, got {memory_stats[key]}\"\n",
|
||
"\n",
|
||
" print(f\"✅ Basic measurement: {memory_stats['peak_memory_mb']:.3f} MB peak\")\n",
|
||
"\n",
|
||
" # Test 2: Memory scaling with size\n",
|
||
" small_tensor = Tensor(np.random.randn(5, 5))\n",
|
||
" large_tensor = Tensor(np.random.randn(50, 50))\n",
|
||
"\n",
|
||
" small_memory = profiler.measure_memory(small_tensor, (5, 5))\n",
|
||
" large_memory = profiler.measure_memory(large_tensor, (50, 50))\n",
|
||
"\n",
|
||
" # Larger tensor should use more activation memory\n",
|
||
" assert large_memory['activation_memory_mb'] >= small_memory['activation_memory_mb'], \\\n",
|
||
" \"Larger tensor should use more activation memory\"\n",
|
||
"\n",
|
||
" print(f\"✅ Scaling: Small {small_memory['activation_memory_mb']:.3f} MB → Large {large_memory['activation_memory_mb']:.3f} MB\")\n",
|
||
"\n",
|
||
" # Test 3: Efficiency bounds\n",
|
||
" assert 0 <= memory_stats['memory_efficiency'] <= 1.0, \\\n",
|
||
" f\"Memory efficiency should be between 0 and 1, got {memory_stats['memory_efficiency']}\"\n",
|
||
"\n",
|
||
" print(f\"✅ Efficiency: {memory_stats['memory_efficiency']:.3f} (0-1 range)\")\n",
|
||
"\n",
|
||
" print(\"✅ Memory measurement works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_memory_measurement()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "350bdbd3",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## Latency Measurement - Accurate Performance Timing\n",
|
||
"\n",
|
||
"Latency measurement is the most challenging part of profiling because it's affected by system state, caching, and measurement overhead. We need statistical rigor to get reliable results.\n",
|
||
"\n",
|
||
"### Latency Measurement Challenges\n",
|
||
"```\n",
|
||
"Timing Challenges:\n",
|
||
"┌─────────────────────────────────────────────────┐\n",
|
||
"│ Time Variance │\n",
|
||
"├─────────────────┬─────────────────┬─────────────┤\n",
|
||
"│ System Noise │ Cache Effects │ Thermal │\n",
|
||
"│ │ │ Throttling │\n",
|
||
"├─────────────────┼─────────────────┼─────────────┤\n",
|
||
"│ Background │ Cold start vs │ CPU slows │\n",
|
||
"│ processes │ warm caches │ when hot │\n",
|
||
"│ OS scheduling │ Memory locality │ GPU thermal │\n",
|
||
"│ Network I/O │ Branch predict │ limits │\n",
|
||
"└─────────────────┴─────────────────┴─────────────┘\n",
|
||
"\n",
|
||
"Solution: Statistical Approach\n",
|
||
"Warmup → Multiple measurements → Robust statistics (median)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Measurement Protocol\n",
|
||
"Our latency measurement follows professional benchmarking practices:\n",
|
||
"1. **Warmup runs** to stabilize system state\n",
|
||
"2. **Multiple measurements** for statistical significance\n",
|
||
"3. **Median calculation** to handle outliers\n",
|
||
"4. **Memory cleanup** to prevent contamination"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f1a0465b",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Latency Measurement\n",
|
||
"This test validates our latency measurement provides consistent and reasonable results.\n",
|
||
"**What we're testing**: Timing accuracy and statistical robustness\n",
|
||
"**Why it matters**: Latency determines real-world deployment feasibility\n",
|
||
"**Expected**: Consistent timing measurements with proper statistical handling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "dcc3cff0",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test_latency_measurement",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_latency_measurement():\n",
|
||
" \"\"\"🔬 Test latency measurement implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Latency Measurement...\")\n",
|
||
"\n",
|
||
" profiler = Profiler()\n",
|
||
"\n",
|
||
" # Test 1: Basic latency measurement\n",
|
||
" test_tensor = Tensor(np.random.randn(4, 8))\n",
|
||
" latency = profiler.measure_latency(test_tensor, test_tensor, warmup=2, iterations=5)\n",
|
||
"\n",
|
||
" assert latency >= 0, f\"Latency should be non-negative, got {latency}\"\n",
|
||
" assert latency < 1000, f\"Latency seems too high for simple operation: {latency} ms\"\n",
|
||
" print(f\"✅ Basic latency: {latency:.3f} ms\")\n",
|
||
"\n",
|
||
" # Test 2: Measurement consistency\n",
|
||
" latencies = []\n",
|
||
" for _ in range(3):\n",
|
||
" lat = profiler.measure_latency(test_tensor, test_tensor, warmup=1, iterations=3)\n",
|
||
" latencies.append(lat)\n",
|
||
"\n",
|
||
" # Measurements should be in reasonable range\n",
|
||
" avg_latency = np.mean(latencies)\n",
|
||
" std_latency = np.std(latencies)\n",
|
||
" assert std_latency < avg_latency, \"Standard deviation shouldn't exceed mean for simple operations\"\n",
|
||
" print(f\"✅ Consistency: {avg_latency:.3f} ± {std_latency:.3f} ms\")\n",
|
||
"\n",
|
||
" # Test 3: Size scaling\n",
|
||
" small_tensor = Tensor(np.random.randn(2, 2))\n",
|
||
" large_tensor = Tensor(np.random.randn(20, 20))\n",
|
||
"\n",
|
||
" small_latency = profiler.measure_latency(small_tensor, small_tensor, warmup=1, iterations=3)\n",
|
||
" large_latency = profiler.measure_latency(large_tensor, large_tensor, warmup=1, iterations=3)\n",
|
||
"\n",
|
||
" # Larger operations might take longer (though not guaranteed for simple operations)\n",
|
||
" print(f\"✅ Scaling: Small {small_latency:.3f} ms, Large {large_latency:.3f} ms\")\n",
|
||
"\n",
|
||
" print(\"✅ Latency measurement works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_latency_measurement()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a5d9a959",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 4. Integration: Advanced Profiling Functions\n",
|
||
"\n",
|
||
"Now let's validate our higher-level profiling functions that combine core measurements into comprehensive analysis tools.\n",
|
||
"\n",
|
||
"### Advanced Profiling Architecture\n",
|
||
"```\n",
|
||
"Core Profiler Methods → Advanced Analysis Functions → Optimization Insights\n",
|
||
" ↓ ↓ ↓\n",
|
||
"count_parameters() profile_forward_pass() \"Memory-bound workload\"\n",
|
||
"count_flops() profile_backward_pass() \"Optimize data movement\"\n",
|
||
"measure_memory() profile_layer() \"Focus on bandwidth\"\n",
|
||
"measure_latency() benchmark_efficiency() \"Use quantization\"\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Forward Pass Profiling - Complete Performance Picture\n",
|
||
"\n",
|
||
"A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "791555b9",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### Backward Pass Profiling - Training Analysis\n",
|
||
"\n",
|
||
"Training requires both forward and backward passes. The backward pass typically uses 2× the compute and adds gradient memory. Understanding this is crucial for training optimization.\n",
|
||
"\n",
|
||
"### Training Memory Visualization\n",
|
||
"```\n",
|
||
"Training Memory Timeline:\n",
|
||
"Forward Pass: [Parameters] + [Activations]\n",
|
||
" ↓\n",
|
||
"Backward Pass: [Parameters] + [Activations] + [Gradients]\n",
|
||
" ↓\n",
|
||
"Optimizer: [Parameters] + [Gradients] + [Optimizer State]\n",
|
||
"\n",
|
||
"Memory Examples:\n",
|
||
"Model: 125M parameters (500MB)\n",
|
||
"Forward: 500MB params + 100MB activations = 600MB\n",
|
||
"Backward: 500MB params + 100MB activations + 500MB gradients = 1,100MB\n",
|
||
"Adam: 500MB params + 500MB gradients + 1,000MB momentum/velocity = 2,000MB\n",
|
||
"\n",
|
||
"Total Training Memory: 4× parameter memory!\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "24236272",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Advanced Profiling Functions\n",
|
||
"This test validates our advanced profiling functions provide comprehensive analysis.\n",
|
||
"**What we're testing**: Forward and backward pass profiling completeness\n",
|
||
"**Why it matters**: Training optimization requires understanding both passes\n",
|
||
"**Expected**: Complete profiles with all required metrics and relationships"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1516ed04",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test_advanced_profiling",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_advanced_profiling():\n",
|
||
" \"\"\"🔬 Test advanced profiling functions.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Advanced Profiling Functions...\")\n",
|
||
"\n",
|
||
" # Create profiler and test model\n",
|
||
" profiler = Profiler()\n",
|
||
" test_input = Tensor(np.random.randn(4, 8))\n",
|
||
"\n",
|
||
" # Test forward pass profiling\n",
|
||
" forward_profile = profiler.profile_forward_pass(test_input, test_input)\n",
|
||
"\n",
|
||
" # Validate forward profile structure\n",
|
||
" required_forward_keys = [\n",
|
||
" 'parameters', 'flops', 'latency_ms', 'gflops_per_second',\n",
|
||
" 'memory_bandwidth_mbs', 'bottleneck'\n",
|
||
" ]\n",
|
||
"\n",
|
||
" for key in required_forward_keys:\n",
|
||
" assert key in forward_profile, f\"Missing key: {key}\"\n",
|
||
"\n",
|
||
" assert forward_profile['parameters'] >= 0\n",
|
||
" assert forward_profile['flops'] >= 0\n",
|
||
" assert forward_profile['latency_ms'] >= 0\n",
|
||
" assert forward_profile['gflops_per_second'] >= 0\n",
|
||
"\n",
|
||
" print(f\"✅ Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s\")\n",
|
||
"\n",
|
||
" # Test backward pass profiling\n",
|
||
" backward_profile = profiler.profile_backward_pass(test_input, test_input)\n",
|
||
"\n",
|
||
" # Validate backward profile structure\n",
|
||
" required_backward_keys = [\n",
|
||
" 'forward_flops', 'backward_flops', 'total_flops',\n",
|
||
" 'total_latency_ms', 'total_memory_mb', 'optimizer_memory_estimates'\n",
|
||
" ]\n",
|
||
"\n",
|
||
" for key in required_backward_keys:\n",
|
||
" assert key in backward_profile, f\"Missing key: {key}\"\n",
|
||
"\n",
|
||
" # Validate relationships\n",
|
||
" assert backward_profile['total_flops'] >= backward_profile['forward_flops']\n",
|
||
" assert backward_profile['total_latency_ms'] >= backward_profile['forward_latency_ms']\n",
|
||
" assert 'sgd' in backward_profile['optimizer_memory_estimates']\n",
|
||
" assert 'adam' in backward_profile['optimizer_memory_estimates']\n",
|
||
"\n",
|
||
" # Check backward pass estimates are reasonable\n",
|
||
" assert backward_profile['backward_flops'] >= backward_profile['forward_flops'], \\\n",
|
||
" \"Backward pass should have at least as many FLOPs as forward\"\n",
|
||
" assert backward_profile['gradient_memory_mb'] >= 0, \\\n",
|
||
" \"Gradient memory should be non-negative\"\n",
|
||
"\n",
|
||
" print(f\"✅ Backward profiling: {backward_profile['total_latency_ms']:.2f} ms total\")\n",
|
||
" print(f\"✅ Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training\")\n",
|
||
" print(\"✅ Advanced profiling functions work correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_advanced_profiling()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b52a9046",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 5. Systems Analysis: Understanding Performance Characteristics\n",
|
||
"\n",
|
||
"Let's analyze how different model characteristics affect performance. This analysis guides optimization decisions and helps identify bottlenecks.\n",
|
||
"\n",
|
||
"### Performance Analysis Workflow\n",
|
||
"```\n",
|
||
"Model Scaling Analysis:\n",
|
||
"Size → Memory → Latency → Throughput → Bottleneck Identification\n",
|
||
" ↓ ↓ ↓ ↓ ↓\n",
|
||
"64 1MB 0.1ms 10K ops/s Memory bound\n",
|
||
"128 4MB 0.2ms 8K ops/s Memory bound\n",
|
||
"256 16MB 0.5ms 4K ops/s Memory bound\n",
|
||
"512 64MB 2.0ms 1K ops/s Memory bound\n",
|
||
"\n",
|
||
"Insight: This workload is memory-bound → Optimize data movement, not compute!\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "331e282f",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "performance_analysis",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_model_scaling():\n",
|
||
" \"\"\"📊 Analyze how model performance scales with size.\"\"\"\n",
|
||
" print(\"📊 Analyzing Model Scaling Characteristics...\")\n",
|
||
"\n",
|
||
" profiler = Profiler()\n",
|
||
" results = []\n",
|
||
"\n",
|
||
" # Test different model sizes\n",
|
||
" sizes = [64, 128, 256, 512]\n",
|
||
"\n",
|
||
" print(\"\\nModel Scaling Analysis:\")\n",
|
||
" print(\"Size\\tParams\\t\\tFLOPs\\t\\tLatency(ms)\\tMemory(MB)\\tGFLOP/s\")\n",
|
||
" print(\"-\" * 80)\n",
|
||
"\n",
|
||
" for size in sizes:\n",
|
||
" # Create models of different sizes for comparison\n",
|
||
" input_shape = (32, size) # Batch of 32\n",
|
||
" dummy_input = Tensor(np.random.randn(*input_shape))\n",
|
||
"\n",
|
||
" # Simulate linear layer characteristics\n",
|
||
" linear_params = size * size + size # W + b\n",
|
||
" linear_flops = size * size * 2 # matmul\n",
|
||
"\n",
|
||
" # Measure actual performance\n",
|
||
" latency = profiler.measure_latency(dummy_input, dummy_input, warmup=3, iterations=10)\n",
|
||
" memory = profiler.measure_memory(dummy_input, input_shape)\n",
|
||
"\n",
|
||
" gflops_per_second = (linear_flops / 1e9) / (latency / 1000)\n",
|
||
"\n",
|
||
" results.append({\n",
|
||
" 'size': size,\n",
|
||
" 'parameters': linear_params,\n",
|
||
" 'flops': linear_flops,\n",
|
||
" 'latency_ms': latency,\n",
|
||
" 'memory_mb': memory['peak_memory_mb'],\n",
|
||
" 'gflops_per_second': gflops_per_second\n",
|
||
" })\n",
|
||
"\n",
|
||
" print(f\"{size}\\t{linear_params:,}\\t\\t{linear_flops:,}\\t\\t\"\n",
|
||
" f\"{latency:.2f}\\t\\t{memory['peak_memory_mb']:.2f}\\t\\t\"\n",
|
||
" f\"{gflops_per_second:.2f}\")\n",
|
||
"\n",
|
||
" # Analysis insights\n",
|
||
" print(\"\\n💡 Scaling Analysis Insights:\")\n",
|
||
"\n",
|
||
" # Memory scaling\n",
|
||
" memory_growth = results[-1]['memory_mb'] / max(results[0]['memory_mb'], 0.001)\n",
|
||
" print(f\"Memory grows {memory_growth:.1f}× from {sizes[0]} to {sizes[-1]} size\")\n",
|
||
"\n",
|
||
" # Compute scaling\n",
|
||
" compute_growth = results[-1]['gflops_per_second'] / max(results[0]['gflops_per_second'], 0.001)\n",
|
||
" print(f\"Compute efficiency changes {compute_growth:.1f}× with size\")\n",
|
||
"\n",
|
||
" # Performance characteristics\n",
|
||
" avg_efficiency = np.mean([r['gflops_per_second'] for r in results])\n",
|
||
" if avg_efficiency < 10: # Arbitrary threshold for \"low\" efficiency\n",
|
||
" print(\"🚀 Low compute efficiency suggests memory-bound workload\")\n",
|
||
" else:\n",
|
||
" print(\"🚀 High compute efficiency suggests compute-bound workload\")\n",
|
||
"\n",
|
||
"def analyze_batch_size_effects():\n",
|
||
" \"\"\"📊 Analyze how batch size affects performance and efficiency.\"\"\"\n",
|
||
" print(\"\\n📊 Analyzing Batch Size Effects...\")\n",
|
||
"\n",
|
||
" profiler = Profiler()\n",
|
||
" batch_sizes = [1, 8, 32, 128]\n",
|
||
" feature_size = 256\n",
|
||
"\n",
|
||
" print(\"\\nBatch Size Effects Analysis:\")\n",
|
||
" print(\"Batch\\tLatency(ms)\\tThroughput(samples/s)\\tMemory(MB)\\tMemory Efficiency\")\n",
|
||
" print(\"-\" * 85)\n",
|
||
"\n",
|
||
" for batch_size in batch_sizes:\n",
|
||
" input_shape = (batch_size, feature_size)\n",
|
||
" dummy_input = Tensor(np.random.randn(*input_shape))\n",
|
||
"\n",
|
||
" # Measure performance\n",
|
||
" latency = profiler.measure_latency(dummy_input, dummy_input, warmup=3, iterations=10)\n",
|
||
" memory = profiler.measure_memory(dummy_input, input_shape)\n",
|
||
"\n",
|
||
" # Calculate throughput\n",
|
||
" samples_per_second = (batch_size * 1000) / latency # samples/second\n",
|
||
"\n",
|
||
" # Calculate efficiency (samples per unit memory)\n",
|
||
" efficiency = samples_per_second / max(memory['peak_memory_mb'], 0.001)\n",
|
||
"\n",
|
||
" print(f\"{batch_size}\\t{latency:.2f}\\t\\t{samples_per_second:.0f}\\t\\t\\t\"\n",
|
||
" f\"{memory['peak_memory_mb']:.2f}\\t\\t{efficiency:.1f}\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Batch Size Insights:\")\n",
|
||
" print(\"Larger batches typically improve throughput but increase memory usage\")\n",
|
||
"\n",
|
||
"# Run the analysis\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" analyze_model_scaling()\n",
|
||
" analyze_batch_size_effects()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "08957c5b",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 6. Optimization Insights: Production Performance Patterns\n",
|
||
"\n",
|
||
"Understanding profiling results helps guide optimization decisions. Let's analyze different operation types and measurement overhead.\n",
|
||
"\n",
|
||
"### Operation Efficiency Analysis\n",
|
||
"```\n",
|
||
"Operation Types and Their Characteristics:\n",
|
||
"┌─────────────────┬──────────────────┬──────────────────┬─────────────────┐\n",
|
||
"│ Operation │ Compute/Memory │ Optimization │ Priority │\n",
|
||
"├─────────────────┼──────────────────┼──────────────────┼─────────────────┤\n",
|
||
"│ Matrix Multiply │ Compute-bound │ BLAS libraries │ High │\n",
|
||
"│ Elementwise │ Memory-bound │ Data locality │ Medium │\n",
|
||
"│ Reductions │ Memory-bound │ Parallelization│ Medium │\n",
|
||
"│ Attention │ Memory-bound │ FlashAttention │ High │\n",
|
||
"└─────────────────┴──────────────────┴──────────────────┴─────────────────┘\n",
|
||
"\n",
|
||
"Optimization Strategy:\n",
|
||
"1. Profile first → Identify bottlenecks\n",
|
||
"2. Focus on compute-bound ops → Algorithmic improvements\n",
|
||
"3. Focus on memory-bound ops → Data movement optimization\n",
|
||
"4. Measure again → Verify improvements\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "750be525",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "optimization_insights",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def benchmark_operation_efficiency():\n",
|
||
" \"\"\"📊 Compare efficiency of different operations for optimization guidance.\"\"\"\n",
|
||
" print(\"📊 Benchmarking Operation Efficiency...\")\n",
|
||
"\n",
|
||
" profiler = Profiler()\n",
|
||
" operations = []\n",
|
||
"\n",
|
||
" # Test different operation types\n",
|
||
" size = 256\n",
|
||
" input_tensor = Tensor(np.random.randn(32, size))\n",
|
||
"\n",
|
||
" # Elementwise operations (memory-bound)\n",
|
||
" elementwise_latency = profiler.measure_latency(input_tensor, input_tensor, iterations=20)\n",
|
||
" elementwise_flops = size * 32 # One operation per element\n",
|
||
"\n",
|
||
" operations.append({\n",
|
||
" 'operation': 'Elementwise',\n",
|
||
" 'latency_ms': elementwise_latency,\n",
|
||
" 'flops': elementwise_flops,\n",
|
||
" 'gflops_per_second': (elementwise_flops / 1e9) / (elementwise_latency / 1000),\n",
|
||
" 'efficiency_class': 'memory-bound',\n",
|
||
" 'optimization_focus': 'data_locality'\n",
|
||
" })\n",
|
||
"\n",
|
||
" # Matrix operations (compute-bound)\n",
|
||
" matrix_tensor = Tensor(np.random.randn(size, size))\n",
|
||
" matrix_latency = profiler.measure_latency(matrix_tensor, input_tensor, iterations=10)\n",
|
||
" matrix_flops = size * size * 2 # Matrix multiplication\n",
|
||
"\n",
|
||
" operations.append({\n",
|
||
" 'operation': 'Matrix Multiply',\n",
|
||
" 'latency_ms': matrix_latency,\n",
|
||
" 'flops': matrix_flops,\n",
|
||
" 'gflops_per_second': (matrix_flops / 1e9) / (matrix_latency / 1000),\n",
|
||
" 'efficiency_class': 'compute-bound',\n",
|
||
" 'optimization_focus': 'algorithms'\n",
|
||
" })\n",
|
||
"\n",
|
||
" # Reduction operations (memory-bound)\n",
|
||
" reduction_latency = profiler.measure_latency(input_tensor, input_tensor, iterations=20)\n",
|
||
" reduction_flops = size * 32 # Sum reduction\n",
|
||
"\n",
|
||
" operations.append({\n",
|
||
" 'operation': 'Reduction',\n",
|
||
" 'latency_ms': reduction_latency,\n",
|
||
" 'flops': reduction_flops,\n",
|
||
" 'gflops_per_second': (reduction_flops / 1e9) / (reduction_latency / 1000),\n",
|
||
" 'efficiency_class': 'memory-bound',\n",
|
||
" 'optimization_focus': 'parallelization'\n",
|
||
" })\n",
|
||
"\n",
|
||
" print(\"\\nOperation Efficiency Comparison:\")\n",
|
||
" print(\"Operation\\t\\tLatency(ms)\\tGFLOP/s\\t\\tEfficiency Class\\tOptimization Focus\")\n",
|
||
" print(\"-\" * 95)\n",
|
||
"\n",
|
||
" for op in operations:\n",
|
||
" print(f\"{op['operation']:<15}\\t{op['latency_ms']:.3f}\\t\\t\"\n",
|
||
" f\"{op['gflops_per_second']:.2f}\\t\\t{op['efficiency_class']:<15}\\t{op['optimization_focus']}\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Operation Optimization Insights:\")\n",
|
||
"\n",
|
||
" # Find most and least efficient\n",
|
||
" best_op = max(operations, key=lambda x: x['gflops_per_second'])\n",
|
||
" worst_op = min(operations, key=lambda x: x['gflops_per_second'])\n",
|
||
"\n",
|
||
" print(f\"Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)\")\n",
|
||
" print(f\"Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)\")\n",
|
||
"\n",
|
||
" # Count operation types\n",
|
||
" memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound']\n",
|
||
" compute_bound_ops = [op for op in operations if op['efficiency_class'] == 'compute-bound']\n",
|
||
"\n",
|
||
" print(f\"\\n🚀 Optimization Priority:\")\n",
|
||
" if len(memory_bound_ops) > len(compute_bound_ops):\n",
|
||
" print(\"Focus on memory optimization: data locality, bandwidth, caching\")\n",
|
||
" else:\n",
|
||
" print(\"Focus on compute optimization: better algorithms, vectorization\")\n",
|
||
"\n",
|
||
"def analyze_profiling_overhead():\n",
|
||
" \"\"\"📊 Measure the overhead of profiling itself.\"\"\"\n",
|
||
" print(\"\\n📊 Analyzing Profiling Overhead...\")\n",
|
||
"\n",
|
||
" # Test with and without profiling\n",
|
||
" test_tensor = Tensor(np.random.randn(100, 100))\n",
|
||
" iterations = 50\n",
|
||
"\n",
|
||
" # Without profiling - baseline measurement\n",
|
||
" start_time = time.perf_counter()\n",
|
||
" for _ in range(iterations):\n",
|
||
" _ = test_tensor.data.copy() # Simple operation\n",
|
||
" end_time = time.perf_counter()\n",
|
||
" baseline_ms = (end_time - start_time) * 1000\n",
|
||
"\n",
|
||
" # With profiling - includes measurement overhead\n",
|
||
" profiler = Profiler()\n",
|
||
" start_time = time.perf_counter()\n",
|
||
" for _ in range(iterations):\n",
|
||
" _ = profiler.measure_latency(test_tensor, test_tensor, warmup=1, iterations=1)\n",
|
||
" end_time = time.perf_counter()\n",
|
||
" profiled_ms = (end_time - start_time) * 1000\n",
|
||
"\n",
|
||
" overhead_factor = profiled_ms / max(baseline_ms, 0.001)\n",
|
||
"\n",
|
||
" print(f\"\\nProfiling Overhead Analysis:\")\n",
|
||
" print(f\"Baseline execution: {baseline_ms:.2f} ms\")\n",
|
||
" print(f\"With profiling: {profiled_ms:.2f} ms\")\n",
|
||
" print(f\"Profiling overhead: {overhead_factor:.1f}× slower\")\n",
|
||
"\n",
|
||
" print(f\"\\n💡 Profiling Overhead Insights:\")\n",
|
||
" if overhead_factor < 2:\n",
|
||
" print(\"Low overhead - suitable for frequent profiling\")\n",
|
||
" elif overhead_factor < 10:\n",
|
||
" print(\"Moderate overhead - use for development and debugging\")\n",
|
||
" else:\n",
|
||
" print(\"High overhead - use sparingly in production\")\n",
|
||
"\n",
|
||
"# Run optimization analysis\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" benchmark_operation_efficiency()\n",
|
||
" analyze_profiling_overhead()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a170135d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 🧪 Module Integration Test\n",
|
||
"\n",
|
||
"Final validation that everything works together correctly."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "379ab83a",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test_module",
|
||
"locked": true,
|
||
"points": 20
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire profiling module functionality.\n",
|
||
"\n",
|
||
" This final test runs before module summary to ensure:\n",
|
||
" - All unit tests pass\n",
|
||
" - Functions work together correctly\n",
|
||
" - Module is ready for integration with TinyTorch\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_parameter_counting()\n",
|
||
" test_unit_flop_counting()\n",
|
||
" test_unit_memory_measurement()\n",
|
||
" test_unit_latency_measurement()\n",
|
||
" test_unit_advanced_profiling()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Test realistic usage patterns\n",
|
||
" print(\"🔬 Integration Test: Complete Profiling Workflow...\")\n",
|
||
"\n",
|
||
" # Create profiler\n",
|
||
" profiler = Profiler()\n",
|
||
"\n",
|
||
" # Create test model and data\n",
|
||
" test_model = Tensor(np.random.randn(16, 32))\n",
|
||
" test_input = Tensor(np.random.randn(8, 16))\n",
|
||
"\n",
|
||
" # Run complete profiling workflow\n",
|
||
" print(\"1. Measuring model characteristics...\")\n",
|
||
" params = profiler.count_parameters(test_model)\n",
|
||
" flops = profiler.count_flops(test_model, test_input.shape)\n",
|
||
" memory = profiler.measure_memory(test_model, test_input.shape)\n",
|
||
" latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5)\n",
|
||
"\n",
|
||
" print(f\" Parameters: {params}\")\n",
|
||
" print(f\" FLOPs: {flops}\")\n",
|
||
" print(f\" Memory: {memory['peak_memory_mb']:.2f} MB\")\n",
|
||
" print(f\" Latency: {latency:.2f} ms\")\n",
|
||
"\n",
|
||
" # Test advanced profiling\n",
|
||
" print(\"2. Running advanced profiling...\")\n",
|
||
" forward_profile = profiler.profile_forward_pass(test_model, test_input)\n",
|
||
" backward_profile = profiler.profile_backward_pass(test_model, test_input)\n",
|
||
"\n",
|
||
" assert 'gflops_per_second' in forward_profile\n",
|
||
" assert 'total_latency_ms' in backward_profile\n",
|
||
" print(f\" Forward GFLOP/s: {forward_profile['gflops_per_second']:.2f}\")\n",
|
||
" print(f\" Training latency: {backward_profile['total_latency_ms']:.2f} ms\")\n",
|
||
"\n",
|
||
" # Test bottleneck analysis\n",
|
||
" print(\"3. Analyzing performance bottlenecks...\")\n",
|
||
" bottleneck = forward_profile['bottleneck']\n",
|
||
" efficiency = forward_profile['computational_efficiency']\n",
|
||
" print(f\" Bottleneck: {bottleneck}\")\n",
|
||
" print(f\" Compute efficiency: {efficiency:.3f}\")\n",
|
||
"\n",
|
||
" # Validate end-to-end workflow\n",
|
||
" assert params >= 0, \"Parameter count should be non-negative\"\n",
|
||
" assert flops >= 0, \"FLOP count should be non-negative\"\n",
|
||
" assert memory['peak_memory_mb'] >= 0, \"Memory usage should be non-negative\"\n",
|
||
" assert latency >= 0, \"Latency should be non-negative\"\n",
|
||
" assert forward_profile['gflops_per_second'] >= 0, \"GFLOP/s should be non-negative\"\n",
|
||
" assert backward_profile['total_latency_ms'] >= 0, \"Total latency should be non-negative\"\n",
|
||
" assert bottleneck in ['memory', 'compute'], \"Bottleneck should be memory or compute\"\n",
|
||
" assert 0 <= efficiency <= 1, \"Efficiency should be between 0 and 1\"\n",
|
||
"\n",
|
||
" print(\"✅ End-to-end profiling workflow works!\")\n",
|
||
"\n",
|
||
" # Test production-like scenario\n",
|
||
" print(\"4. Testing production profiling scenario...\")\n",
|
||
"\n",
|
||
" # Simulate larger model analysis\n",
|
||
" large_input = Tensor(np.random.randn(32, 512)) # Larger model input\n",
|
||
" large_profile = profiler.profile_forward_pass(large_input, large_input)\n",
|
||
"\n",
|
||
" # Verify profile contains optimization insights\n",
|
||
" assert 'bottleneck' in large_profile, \"Profile should identify bottlenecks\"\n",
|
||
" assert 'memory_bandwidth_mbs' in large_profile, \"Profile should measure memory bandwidth\"\n",
|
||
"\n",
|
||
" print(f\" Large model analysis: {large_profile['bottleneck']} bottleneck\")\n",
|
||
" print(f\" Memory bandwidth: {large_profile['memory_bandwidth_mbs']:.1f} MB/s\")\n",
|
||
"\n",
|
||
" print(\"✅ Production profiling scenario works!\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"Run: tito module complete 14\")\n",
|
||
"\n",
|
||
"# Call before module summary\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_module()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6502f689",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"if __name__ == \"__main__\":\n",
|
||
" print(\"🚀 Running Profiling module...\")\n",
|
||
" test_module()\n",
|
||
" print(\"✅ Module validation complete!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b4ff25e4",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🤔 ML Systems Thinking: Performance Measurement\n",
|
||
"\n",
|
||
"### Question 1: FLOP Analysis\n",
|
||
"You implemented a profiler that counts FLOPs for different operations.\n",
|
||
"For a Linear layer with 1000 input features and 500 output features:\n",
|
||
"- How many FLOPs are required for one forward pass? _____ FLOPs\n",
|
||
"- If you process a batch of 32 samples, how does this change the per-sample FLOPs? _____\n",
|
||
"\n",
|
||
"### Question 2: Memory Scaling\n",
|
||
"Your profiler measures memory usage for models and activations.\n",
|
||
"A transformer model has 125M parameters (500MB at FP32).\n",
|
||
"During training with batch size 16:\n",
|
||
"- What's the minimum memory for gradients? _____ MB\n",
|
||
"- With Adam optimizer, what's the total memory requirement? _____ MB\n",
|
||
"\n",
|
||
"### Question 3: Performance Bottlenecks\n",
|
||
"You built tools to identify compute vs memory bottlenecks.\n",
|
||
"A model achieves 10 GFLOP/s on hardware with 100 GFLOP/s peak:\n",
|
||
"- What's the computational efficiency? _____%\n",
|
||
"- If doubling batch size doesn't improve GFLOP/s, the bottleneck is likely _____\n",
|
||
"\n",
|
||
"### Question 4: Profiling Trade-offs\n",
|
||
"Your profiler adds measurement overhead to understand performance.\n",
|
||
"If profiling adds 5× overhead but reveals a 50% speedup opportunity:\n",
|
||
"- Is the profiling cost justified for development? _____\n",
|
||
"- When should you disable profiling in production? _____"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "72dec7d6",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Profiling\n",
|
||
"\n",
|
||
"Congratulations! You've built a comprehensive profiling system for ML performance analysis!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- Built complete Profiler class with parameter, FLOP, memory, and latency measurement\n",
|
||
"- Implemented advanced profiling functions for forward and backward pass analysis\n",
|
||
"- Discovered performance characteristics through scaling and efficiency analysis\n",
|
||
"- Created production-quality measurement tools for optimization guidance\n",
|
||
"- All tests pass ✅ (validated by `test_module()`)\n",
|
||
"\n",
|
||
"### Systems Insights Gained\n",
|
||
"- **FLOPs vs Reality**: Theoretical operations don't always predict actual performance\n",
|
||
"- **Memory Bottlenecks**: Many ML operations are limited by memory bandwidth, not compute\n",
|
||
"- **Batch Size Effects**: Larger batches improve throughput but increase memory requirements\n",
|
||
"- **Profiling Overhead**: Measurement tools have costs but enable data-driven optimization\n",
|
||
"\n",
|
||
"### Production Skills Developed\n",
|
||
"- **Performance Detective Work**: Use data, not guesses, to identify bottlenecks\n",
|
||
"- **Optimization Prioritization**: Focus efforts on actual bottlenecks, not assumptions\n",
|
||
"- **Resource Planning**: Predict memory and compute requirements for deployment\n",
|
||
"- **Statistical Rigor**: Handle measurement variance with proper methodology\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your profiling implementation enables optimization modules (15-18) to make data-driven optimization decisions.\n",
|
||
"Export with: `tito module complete 14`\n",
|
||
"\n",
|
||
"**Next**: Module 15 (Memoization) will use profiling to discover transformer bottlenecks and fix them!"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|