TinyTorch/modules/source/14_profiling/profiling_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "55618ade",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Module 14: Profiling - Measuring What Matters in ML Systems\n",
    "\n",
    "Welcome to Module 14! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.\n",
    "\n",
    "## 🔗 Prerequisites & Progress\n",
    "**You've Built**: Complete ML stack from tensors to transformers\n",
    "**You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency\n",
    "**You'll Enable**: Data-driven optimization decisions and performance analysis\n",
    "\n",
    "**Connection Map**:\n",
    "```\n",
    "All Modules (01-13) → Profiling (14) → Optimization Techniques (15-18)\n",
    "(implementations)     (measurement)     (targeted fixes)\n",
    "```\n",
    "\n",
    "## Learning Objectives\n",
    "By the end of this module, you will:\n",
    "1. Implement a complete Profiler class for model analysis\n",
    "2. Count parameters and FLOPs accurately for different architectures\n",
    "3. Measure memory usage and latency with statistical rigor\n",
    "4. Create production-quality performance analysis tools\n",
    "\n",
    "Let's build the measurement foundation for ML systems optimization!\n",
    "\n",
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in `modules/14_profiling/profiling_dev.py`\n",
    "**Building Side:** Code exports to `tinytorch.profiling.profiler`\n",
    "\n",
    "```python\n",
    "# How to use this module:\n",
    "from tinytorch.profiling.profiler import Profiler, profile_forward_pass, profile_backward_pass\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Complete profiling system for understanding model performance characteristics\n",
    "- **Production:** Professional measurement tools like those used in PyTorch, TensorFlow\n",
    "- **Consistency:** All profiling and measurement tools in profiling.profiler\n",
    "- **Integration:** Works with any model built using TinyTorch components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d92307b1",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "imports",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp profiling.profiler\n",
    "#| export\n",
    "\n",
    "import time\n",
    "import numpy as np\n",
    "import tracemalloc\n",
    "from typing import Dict, List, Any, Optional, Tuple\n",
    "from collections import defaultdict\n",
    "import gc\n",
    "\n",
    "# Import our TinyTorch components for profiling\n",
    "from tinytorch.core.tensor import Tensor\n",
    "from tinytorch.core.layers import Linear\n",
    "from tinytorch.core.spatial import Conv2d"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e4fb271",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 1. Introduction: Why Profiling Matters in ML Systems\n",
    "\n",
    "Imagine you're a detective investigating a performance crime. Your model is running slowly, using too much memory, or burning through compute budgets. Without profiling, you're flying blind - making guesses about what to optimize. With profiling, you have evidence.\n",
    "\n",
    "**The Performance Investigation Process:**\n",
    "```\n",
    "Suspect Model → Profile Evidence → Identify Bottleneck → Target Optimization\n",
    "     ↓               ↓                    ↓                    ↓\n",
    "   \"Too slow\"    \"200 GFLOP/s\"      \"Memory bound\"      \"Reduce transfers\"\n",
    "```\n",
    "\n",
    "**Questions Profiling Answers:**\n",
    "- **How many parameters?** (Memory footprint, model size)\n",
    "- **How many FLOPs?** (Computational cost, energy usage)\n",
    "- **Where are bottlenecks?** (Memory vs compute bound)\n",
    "- **What's actual latency?** (Real-world performance)\n",
    "\n",
    "**Production Importance:**\n",
    "In production ML systems, profiling isn't optional - it's survival. A model that's 10% more accurate but 100× slower often can't be deployed. Teams use profiling daily to make data-driven optimization decisions, not guesses.\n",
    "\n",
    "### The Profiling Workflow Visualization\n",
    "```\n",
    "Model → Profiler → Measurements → Analysis → Optimization Decision\n",
    "  ↓        ↓           ↓           ↓            ↓\n",
    " GPT   Parameter   125M params   Memory      Use quantization\n",
    "       Counter     2.5B FLOPs    bound       Reduce precision\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddfa3dfb",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### 🔗 From Optimization to Discovery: Connecting Module 14\n",
    "\n",
    "**In Module 14**, you implemented KV caching and saw 10-15x speedup.\n",
    "**In Module 15**, you'll learn HOW to discover such optimization opportunities.\n",
    "\n",
    "**The Real ML Engineering Workflow**:\n",
    "```\n",
    "Step 1: Measure (This Module!)          Step 2: Analyze\n",
    "  ↓                                       ↓\n",
    "Profile baseline → Find bottleneck → Understand cause\n",
    "40 tok/s          80% in attention    O(n²) recomputation\n",
    "                                       ↓\n",
    "Step 4: Validate                      Step 3: Optimize (Module 14)\n",
    "  ↓                                       ↓\n",
    "Profile optimized ← Verify speedup ← Implement KV cache\n",
    "500 tok/s (12.5x)   Measure impact    Design solution\n",
    "```\n",
    "\n",
    "**Without Module 15's profiling**: You'd never know WHERE to optimize!\n",
    "**Without Module 14's optimization**: You couldn't FIX the bottleneck!\n",
    "\n",
    "This module teaches the measurement and analysis skills that enable\n",
    "optimization breakthroughs like KV caching. You'll profile real models\n",
    "and discover bottlenecks just like production ML teams do."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5a2e470",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 2. Foundations: Performance Measurement Principles\n",
    "\n",
    "Before we build our profiler, let's understand what we're measuring and why each metric matters.\n",
    "\n",
    "### Parameter Counting - Model Size Detective Work\n",
    "\n",
    "Parameters determine your model's memory footprint and storage requirements. Every parameter is typically a 32-bit float (4 bytes), so counting them precisely predicts memory usage.\n",
    "\n",
    "**Parameter Counting Formula:**\n",
    "```\n",
    "Linear Layer: (input_features × output_features) + output_features\n",
    "               ↑              ↑                    ↑\n",
    "            Weight matrix   Bias vector      Total parameters\n",
    "\n",
    "Example: Linear(768, 3072) → (768 × 3072) + 3072 = 2,362,368 parameters\n",
    "Memory: 2,362,368 × 4 bytes = 9.45 MB\n",
    "```\n",
    "\n",
    "### FLOP Counting - Computational Cost Analysis\n",
    "\n",
    "FLOPs (Floating Point Operations) measure computational work. Unlike wall-clock time, FLOPs are hardware-independent and predict compute costs across different systems.\n",
    "\n",
    "**FLOP Formulas for Key Operations:**\n",
    "```\n",
    "Matrix Multiplication (M,K) @ (K,N):\n",
    "   FLOPs = M × N × K × 2\n",
    "           ↑   ↑   ↑   ↑\n",
    "        Rows Cols Inner Multiply+Add\n",
    "\n",
    "Linear Layer Forward:\n",
    "   FLOPs = batch_size × input_features × output_features × 2\n",
    "                      ↑                  ↑                 ↑\n",
    "                  Matmul cost        Bias add        Operations\n",
    "\n",
    "Convolution (simplified):\n",
    "   FLOPs = output_H × output_W × kernel_H × kernel_W × in_channels × out_channels × 2\n",
    "```\n",
    "\n",
    "### Memory Profiling - The Three Types of Memory\n",
    "\n",
    "ML models use memory in three distinct ways, each with different optimization strategies:\n",
    "\n",
    "**Memory Type Breakdown:**\n",
    "```\n",
    "Total Training Memory = Parameters + Activations + Gradients + Optimizer State\n",
    "                           ↓            ↓           ↓            ↓\n",
    "                        Model         Forward     Backward     Adam: 2×params\n",
    "                        weights       pass cache  gradients    SGD: 0×params\n",
    "\n",
    "Example for 125M parameter model:\n",
    "Parameters:    500 MB (125M × 4 bytes)\n",
    "Activations:   200 MB (depends on batch size)\n",
    "Gradients:     500 MB (same as parameters)\n",
    "Adam state:  1,000 MB (momentum + velocity)\n",
    "Total:      2,200 MB (4.4× parameter memory!)\n",
    "```\n",
    "\n",
    "### Latency Measurement - Dealing with Reality\n",
    "\n",
    "Latency measurement is tricky because systems have variance, warmup effects, and measurement overhead. Professional profiling requires statistical rigor.\n",
    "\n",
    "**Latency Measurement Best Practices:**\n",
    "```\n",
    "Measurement Protocol:\n",
    "1. Warmup runs (10+) → CPU/GPU caches warm up\n",
    "2. Timed runs (100+) → Statistical significance\n",
    "3. Outlier handling → Use median, not mean\n",
    "4. Memory cleanup → Prevent contamination\n",
    "\n",
    "Timeline:\n",
    "Warmup: [run][run][run]...[run] ← Don't time these\n",
    "Timing: [⏱run⏱][⏱run⏱]...[⏱run⏱] ← Time these\n",
    "Result: median(all_times) ← Robust to outliers\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c466e14d",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 3. Implementation: Building the Core Profiler Class\n",
    "\n",
    "Now let's implement our profiler step by step. We'll start with the foundation and build up to comprehensive analysis.\n",
    "\n",
    "### The Profiler Architecture\n",
    "```\n",
    "Profiler Class\n",
    "├── count_parameters() → Model size analysis\n",
    "├── count_flops() → Computational cost estimation\n",
    "├── measure_memory() → Memory usage tracking\n",
    "├── measure_latency() → Performance timing\n",
    "├── profile_layer() → Layer-wise analysis\n",
    "├── profile_forward_pass() → Complete forward analysis\n",
    "└── profile_backward_pass() → Training analysis\n",
    "\n",
    "Integration:\n",
    "All methods work together to provide comprehensive performance insights\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31829387",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "profiler_class",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Profiler:\n",
    "    \"\"\"\n",
    "    Professional-grade ML model profiler for performance analysis.\n",
    "\n",
    "    Measures parameters, FLOPs, memory usage, and latency with statistical rigor.\n",
    "    Used for optimization guidance and deployment planning.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self):\n",
    "        \"\"\"\n",
    "        Initialize profiler with measurement state.\n",
    "\n",
    "        TODO: Set up profiler tracking structures\n",
    "\n",
    "        APPROACH:\n",
    "        1. Create empty measurements dictionary\n",
    "        2. Initialize operation counters\n",
    "        3. Set up memory tracking state\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> profiler = Profiler()\n",
    "        >>> profiler.measurements\n",
    "        {}\n",
    "\n",
    "        HINTS:\n",
    "        - Use defaultdict(int) for operation counters\n",
    "        - measurements dict will store timing results\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.measurements = {}\n",
    "        self.operation_counts = defaultdict(int)\n",
    "        self.memory_tracker = None\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def count_parameters(self, model) -> int:\n",
    "        \"\"\"\n",
    "        Count total trainable parameters in a model.\n",
    "\n",
    "        TODO: Implement parameter counting for any model with parameters() method\n",
    "\n",
    "        APPROACH:\n",
    "        1. Get all parameters from model.parameters() if available\n",
    "        2. For single layers, count weight and bias directly\n",
    "        3. Sum total element count across all parameter tensors\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> linear = Linear(128, 64)  # 128*64 + 64 = 8256 parameters\n",
    "        >>> profiler = Profiler()\n",
    "        >>> count = profiler.count_parameters(linear)\n",
    "        >>> print(count)\n",
    "        8256\n",
    "\n",
    "        HINTS:\n",
    "        - Use parameter.data.size for tensor element count\n",
    "        - Handle models with and without parameters() method\n",
    "        - Don't forget bias terms when present\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        total_params = 0\n",
    "\n",
    "        # Handle different model types\n",
    "        if hasattr(model, 'parameters'):\n",
    "            # Model with parameters() method (Sequential, custom models)\n",
    "            for param in model.parameters():\n",
    "                total_params += param.data.size\n",
    "        elif hasattr(model, 'weight'):\n",
    "            # Single layer (Linear, Conv2d)\n",
    "            total_params += model.weight.data.size\n",
    "            if hasattr(model, 'bias') and model.bias is not None:\n",
    "                total_params += model.bias.data.size\n",
    "        else:\n",
    "            # No parameters (activations, etc.)\n",
    "            total_params = 0\n",
    "\n",
    "        return total_params\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n",
    "        \"\"\"\n",
    "        Count FLOPs (Floating Point Operations) for one forward pass.\n",
    "\n",
    "        TODO: Implement FLOP counting for different layer types\n",
    "\n",
    "        APPROACH:\n",
    "        1. Create dummy input with given shape\n",
    "        2. Calculate FLOPs based on layer type and dimensions\n",
    "        3. Handle different model architectures (Linear, Conv2d, Sequential)\n",
    "\n",
    "        LAYER-SPECIFIC FLOP FORMULAS:\n",
    "        - Linear: input_features × output_features × 2 (matmul + bias)\n",
    "        - Conv2d: output_h × output_w × kernel_h × kernel_w × in_channels × out_channels × 2\n",
    "        - Activation: Usually 1 FLOP per element (ReLU, Sigmoid)\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> linear = Linear(128, 64)\n",
    "        >>> profiler = Profiler()\n",
    "        >>> flops = profiler.count_flops(linear, (1, 128))\n",
    "        >>> print(flops)  # 128 * 64 * 2 = 16384\n",
    "        16384\n",
    "\n",
    "        HINTS:\n",
    "        - Batch dimension doesn't affect per-sample FLOPs\n",
    "        - Focus on major operations (matmul, conv) first\n",
    "        - For Sequential models, sum FLOPs of all layers\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Create dummy input (unused but kept for interface consistency)\n",
    "        _dummy_input = Tensor(np.random.randn(*input_shape))\n",
    "        total_flops = 0\n",
    "\n",
    "        # Handle different model types\n",
    "        if hasattr(model, '__class__'):\n",
    "            model_name = model.__class__.__name__\n",
    "\n",
    "            if model_name == 'Linear':\n",
    "                # Linear layer: input_features × output_features × 2\n",
    "                in_features = input_shape[-1]\n",
    "                out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n",
    "                total_flops = in_features * out_features * 2\n",
    "\n",
    "            elif model_name == 'Conv2d':\n",
    "                # Conv2d layer: complex calculation based on output size\n",
    "                # Simplified: assume we know the output dimensions\n",
    "                if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):\n",
    "                    _batch_size = input_shape[0] if len(input_shape) > 3 else 1\n",
    "                    in_channels = model.in_channels\n",
    "                    out_channels = model.out_channels\n",
    "                    kernel_h = kernel_w = model.kernel_size\n",
    "\n",
    "                    # Estimate output size (simplified)\n",
    "                    input_h, input_w = input_shape[-2], input_shape[-1]\n",
    "                    output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)\n",
    "                    output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)\n",
    "\n",
    "                    total_flops = (output_h * output_w * kernel_h * kernel_w *\n",
    "                                 in_channels * out_channels * 2)\n",
    "\n",
    "            elif model_name == 'Sequential':\n",
    "                # Sequential model: sum FLOPs of all layers\n",
    "                current_shape = input_shape\n",
    "                for layer in model.layers:\n",
    "                    layer_flops = self.count_flops(layer, current_shape)\n",
    "                    total_flops += layer_flops\n",
    "                    # Update shape for next layer (simplified)\n",
    "                    if hasattr(layer, 'weight'):\n",
    "                        current_shape = current_shape[:-1] + (layer.weight.shape[1],)\n",
    "\n",
    "            else:\n",
    "                # Activation or other: assume 1 FLOP per element\n",
    "                total_flops = np.prod(input_shape)\n",
    "\n",
    "        return total_flops\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n",
    "        \"\"\"\n",
    "        Measure memory usage during forward pass.\n",
    "\n",
    "        TODO: Implement memory tracking for model execution\n",
    "\n",
    "        APPROACH:\n",
    "        1. Use tracemalloc to track memory allocation\n",
    "        2. Measure baseline memory before model execution\n",
    "        3. Run forward pass and track peak usage\n",
    "        4. Calculate different memory components\n",
    "\n",
    "        RETURN DICTIONARY:\n",
    "        - 'parameter_memory_mb': Memory for model parameters\n",
    "        - 'activation_memory_mb': Memory for activations\n",
    "        - 'peak_memory_mb': Maximum memory usage\n",
    "        - 'memory_efficiency': Ratio of useful to total memory\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> linear = Linear(1024, 512)\n",
    "        >>> profiler = Profiler()\n",
    "        >>> memory = profiler.measure_memory(linear, (32, 1024))\n",
    "        >>> print(f\"Parameters: {memory['parameter_memory_mb']:.1f} MB\")\n",
    "        Parameters: 2.1 MB\n",
    "\n",
    "        HINTS:\n",
    "        - Use tracemalloc.start() and tracemalloc.get_traced_memory()\n",
    "        - Account for float32 = 4 bytes per parameter\n",
    "        - Activation memory scales with batch size\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Start memory tracking\n",
    "        tracemalloc.start()\n",
    "\n",
    "        # Measure baseline memory (unused but kept for completeness)\n",
    "        _baseline_memory = tracemalloc.get_traced_memory()[0]\n",
    "\n",
    "        # Calculate parameter memory\n",
    "        param_count = self.count_parameters(model)\n",
    "        parameter_memory_bytes = param_count * 4  # Assume float32\n",
    "        parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n",
    "\n",
    "        # Create input and measure activation memory\n",
    "        dummy_input = Tensor(np.random.randn(*input_shape))\n",
    "        input_memory_bytes = dummy_input.data.nbytes\n",
    "\n",
    "        # Estimate activation memory (simplified)\n",
    "        activation_memory_bytes = input_memory_bytes * 2  # Rough estimate\n",
    "        activation_memory_mb = activation_memory_bytes / (1024 * 1024)\n",
    "\n",
    "        # Try to run forward pass and measure peak\n",
    "        try:\n",
    "            if hasattr(model, 'forward'):\n",
    "                _ = model.forward(dummy_input)\n",
    "            elif hasattr(model, '__call__'):\n",
    "                _ = model(dummy_input)\n",
    "        except:\n",
    "            pass  # Ignore errors for simplified measurement\n",
    "\n",
    "        # Get peak memory\n",
    "        _current_memory, peak_memory = tracemalloc.get_traced_memory()\n",
    "        peak_memory_mb = (peak_memory - _baseline_memory) / (1024 * 1024)\n",
    "\n",
    "        tracemalloc.stop()\n",
    "\n",
    "        # Calculate efficiency\n",
    "        useful_memory = parameter_memory_mb + activation_memory_mb\n",
    "        memory_efficiency = useful_memory / max(peak_memory_mb, 0.001)  # Avoid division by zero\n",
    "\n",
    "        return {\n",
    "            'parameter_memory_mb': parameter_memory_mb,\n",
    "            'activation_memory_mb': activation_memory_mb,\n",
    "            'peak_memory_mb': max(peak_memory_mb, useful_memory),\n",
    "            'memory_efficiency': min(memory_efficiency, 1.0)\n",
    "        }\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n",
    "        \"\"\"\n",
    "        Measure model inference latency with statistical rigor.\n",
    "\n",
    "        TODO: Implement accurate latency measurement\n",
    "\n",
    "        APPROACH:\n",
    "        1. Run warmup iterations to stabilize performance\n",
    "        2. Measure multiple iterations for statistical accuracy\n",
    "        3. Calculate median latency to handle outliers\n",
    "        4. Return latency in milliseconds\n",
    "\n",
    "        PARAMETERS:\n",
    "        - warmup: Number of warmup runs (default 10)\n",
    "        - iterations: Number of measurement runs (default 100)\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> linear = Linear(128, 64)\n",
    "        >>> input_tensor = Tensor(np.random.randn(1, 128))\n",
    "        >>> profiler = Profiler()\n",
    "        >>> latency = profiler.measure_latency(linear, input_tensor)\n",
    "        >>> print(f\"Latency: {latency:.2f} ms\")\n",
    "        Latency: 0.15 ms\n",
    "\n",
    "        HINTS:\n",
    "        - Use time.perf_counter() for high precision\n",
    "        - Use median instead of mean for robustness against outliers\n",
    "        - Handle different model interfaces (forward, __call__)\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Warmup runs\n",
    "        for _ in range(warmup):\n",
    "            try:\n",
    "                if hasattr(model, 'forward'):\n",
    "                    _ = model.forward(input_tensor)\n",
    "                elif hasattr(model, '__call__'):\n",
    "                    _ = model(input_tensor)\n",
    "                else:\n",
    "                    # Fallback for simple operations\n",
    "                    _ = input_tensor\n",
    "            except:\n",
    "                pass  # Ignore errors during warmup\n",
    "\n",
    "        # Measurement runs\n",
    "        times = []\n",
    "        for _ in range(iterations):\n",
    "            start_time = time.perf_counter()\n",
    "\n",
    "            try:\n",
    "                if hasattr(model, 'forward'):\n",
    "                    _ = model.forward(input_tensor)\n",
    "                elif hasattr(model, '__call__'):\n",
    "                    _ = model(input_tensor)\n",
    "                else:\n",
    "                    # Minimal operation for timing\n",
    "                    _ = input_tensor.data.copy()\n",
    "            except:\n",
    "                pass  # Ignore errors but still measure time\n",
    "\n",
    "            end_time = time.perf_counter()\n",
    "            times.append((end_time - start_time) * 1000)  # Convert to milliseconds\n",
    "\n",
    "        # Calculate statistics - use median for robustness\n",
    "        times = np.array(times)\n",
    "        median_latency = np.median(times)\n",
    "\n",
    "        return float(median_latency)\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def profile_layer(self, layer, input_shape: Tuple[int, ...]) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Profile a single layer comprehensively.\n",
    "\n",
    "        TODO: Implement layer-wise profiling\n",
    "\n",
    "        APPROACH:\n",
    "        1. Count parameters for this layer\n",
    "        2. Count FLOPs for this layer\n",
    "        3. Measure memory usage\n",
    "        4. Measure latency\n",
    "        5. Return comprehensive layer profile\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> linear = Linear(256, 128)\n",
    "        >>> profiler = Profiler()\n",
    "        >>> profile = profiler.profile_layer(linear, (32, 256))\n",
    "        >>> print(f\"Layer uses {profile['parameters']} parameters\")\n",
    "        Layer uses 32896 parameters\n",
    "\n",
    "        HINTS:\n",
    "        - Use existing profiler methods (count_parameters, count_flops, etc.)\n",
    "        - Create dummy input for latency measurement\n",
    "        - Include layer type information in profile\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Create dummy input for latency measurement\n",
    "        dummy_input = Tensor(np.random.randn(*input_shape))\n",
    "\n",
    "        # Gather all measurements\n",
    "        params = self.count_parameters(layer)\n",
    "        flops = self.count_flops(layer, input_shape)\n",
    "        memory = self.measure_memory(layer, input_shape)\n",
    "        latency = self.measure_latency(layer, dummy_input, warmup=3, iterations=10)\n",
    "\n",
    "        # Compute derived metrics\n",
    "        gflops_per_second = (flops / 1e9) / max(latency / 1000, 1e-6)\n",
    "\n",
    "        return {\n",
    "            'layer_type': layer.__class__.__name__,\n",
    "            'parameters': params,\n",
    "            'flops': flops,\n",
    "            'latency_ms': latency,\n",
    "            'gflops_per_second': gflops_per_second,\n",
    "            **memory\n",
    "        }\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def profile_forward_pass(self, model, input_tensor) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Comprehensive profiling of a model's forward pass.\n",
    "\n",
    "        TODO: Implement complete forward pass analysis\n",
    "\n",
    "        APPROACH:\n",
    "        1. Use Profiler class to gather all measurements\n",
    "        2. Create comprehensive performance profile\n",
    "        3. Add derived metrics and insights\n",
    "        4. Return structured analysis results\n",
    "\n",
    "        RETURN METRICS:\n",
    "        - All basic profiler measurements\n",
    "        - FLOPs per second (computational efficiency)\n",
    "        - Memory bandwidth utilization\n",
    "        - Performance bottleneck identification\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> model = Linear(256, 128)\n",
    "        >>> input_data = Tensor(np.random.randn(32, 256))\n",
    "        >>> profiler = Profiler()\n",
    "        >>> profile = profiler.profile_forward_pass(model, input_data)\n",
    "        >>> print(f\"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s\")\n",
    "        Throughput: 2.45 GFLOP/s\n",
    "\n",
    "        HINTS:\n",
    "        - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)\n",
    "        - Memory bandwidth = memory_mb / (latency_ms / 1000)\n",
    "        - Consider realistic hardware limits for efficiency calculations\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Basic measurements\n",
    "        param_count = self.count_parameters(model)\n",
    "        flops = self.count_flops(model, input_tensor.shape)\n",
    "        memory_stats = self.measure_memory(model, input_tensor.shape)\n",
    "        latency_ms = self.measure_latency(model, input_tensor, warmup=5, iterations=20)\n",
    "\n",
    "        # Derived metrics\n",
    "        latency_seconds = latency_ms / 1000.0\n",
    "        gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)\n",
    "\n",
    "        # Memory bandwidth (MB/s)\n",
    "        memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)\n",
    "\n",
    "        # Efficiency metrics\n",
    "        theoretical_peak_gflops = 100.0  # Assume 100 GFLOP/s theoretical peak for CPU\n",
    "        computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)\n",
    "\n",
    "        # Bottleneck analysis\n",
    "        is_memory_bound = memory_bandwidth > gflops_per_second * 100  # Rough heuristic\n",
    "        is_compute_bound = not is_memory_bound\n",
    "\n",
    "        return {\n",
    "            # Basic measurements\n",
    "            'parameters': param_count,\n",
    "            'flops': flops,\n",
    "            'latency_ms': latency_ms,\n",
    "            **memory_stats,\n",
    "\n",
    "            # Derived metrics\n",
    "            'gflops_per_second': gflops_per_second,\n",
    "            'memory_bandwidth_mbs': memory_bandwidth,\n",
    "            'computational_efficiency': computational_efficiency,\n",
    "\n",
    "            # Bottleneck analysis\n",
    "            'is_memory_bound': is_memory_bound,\n",
    "            'is_compute_bound': is_compute_bound,\n",
    "            'bottleneck': 'memory' if is_memory_bound else 'compute'\n",
    "        }\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def profile_backward_pass(self, model, input_tensor, _loss_fn=None) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Profile both forward and backward passes for training analysis.\n",
    "\n",
    "        TODO: Implement training-focused profiling\n",
    "\n",
    "        APPROACH:\n",
    "        1. Profile forward pass first\n",
    "        2. Estimate backward pass costs (typically 2× forward)\n",
    "        3. Calculate total training iteration metrics\n",
    "        4. Analyze memory requirements for gradients and optimizers\n",
    "\n",
    "        BACKWARD PASS ESTIMATES:\n",
    "        - FLOPs: ~2× forward pass (gradient computation)\n",
    "        - Memory: +1× parameters (gradient storage)\n",
    "        - Latency: ~2× forward pass (more complex operations)\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> model = Linear(128, 64)\n",
    "        >>> input_data = Tensor(np.random.randn(16, 128))\n",
    "        >>> profiler = Profiler()\n",
    "        >>> profile = profiler.profile_backward_pass(model, input_data)\n",
    "        >>> print(f\"Training iteration: {profile['total_latency_ms']:.2f} ms\")\n",
    "        Training iteration: 0.45 ms\n",
    "\n",
    "        HINTS:\n",
    "        - Total memory = parameters + activations + gradients\n",
    "        - Optimizer memory depends on algorithm (SGD: 0×, Adam: 2×)\n",
    "        - Consider gradient accumulation effects\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Get forward pass profile\n",
    "        forward_profile = self.profile_forward_pass(model, input_tensor)\n",
    "\n",
    "        # Estimate backward pass (typically 2× forward)\n",
    "        backward_flops = forward_profile['flops'] * 2\n",
    "        backward_latency_ms = forward_profile['latency_ms'] * 2\n",
    "\n",
    "        # Gradient memory (equal to parameter memory)\n",
    "        gradient_memory_mb = forward_profile['parameter_memory_mb']\n",
    "\n",
    "        # Total training iteration\n",
    "        total_flops = forward_profile['flops'] + backward_flops\n",
    "        total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms\n",
    "        total_memory_mb = (forward_profile['parameter_memory_mb'] +\n",
    "                          forward_profile['activation_memory_mb'] +\n",
    "                          gradient_memory_mb)\n",
    "\n",
    "        # Training efficiency\n",
    "        total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)\n",
    "\n",
    "        # Optimizer memory estimates\n",
    "        optimizer_memory_estimates = {\n",
    "            'sgd': 0,  # No extra memory\n",
    "            'adam': gradient_memory_mb * 2,  # Momentum + velocity\n",
    "            'adamw': gradient_memory_mb * 2,  # Same as Adam\n",
    "        }\n",
    "\n",
    "        return {\n",
    "            # Forward pass\n",
    "            'forward_flops': forward_profile['flops'],\n",
    "            'forward_latency_ms': forward_profile['latency_ms'],\n",
    "            'forward_memory_mb': forward_profile['peak_memory_mb'],\n",
    "\n",
    "            # Backward pass estimates\n",
    "            'backward_flops': backward_flops,\n",
    "            'backward_latency_ms': backward_latency_ms,\n",
    "            'gradient_memory_mb': gradient_memory_mb,\n",
    "\n",
    "            # Total training iteration\n",
    "            'total_flops': total_flops,\n",
    "            'total_latency_ms': total_latency_ms,\n",
    "            'total_memory_mb': total_memory_mb,\n",
    "            'total_gflops_per_second': total_gflops_per_second,\n",
    "\n",
    "            # Optimizer memory requirements\n",
    "            'optimizer_memory_estimates': optimizer_memory_estimates,\n",
    "\n",
    "            # Training insights\n",
    "            'memory_efficiency': forward_profile['memory_efficiency'],\n",
    "            'bottleneck': forward_profile['bottleneck']\n",
    "        }\n",
    "        ### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "644d770d",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Helper Functions - Quick Profiling Utilities\n",
    "\n",
    "These helper functions provide simplified interfaces for common profiling tasks.\n",
    "They make it easy to quickly profile models and analyze characteristics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad647a04",
   "metadata": {
    "lines_to_next_cell": 1
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def quick_profile(model, input_tensor, profiler=None):\n",
    "    \"\"\"\n",
    "    Quick profiling function for immediate insights.\n",
    "    \n",
    "    Provides a simplified interface for profiling that displays key metrics\n",
    "    in a student-friendly format.\n",
    "    \n",
    "    Args:\n",
    "        model: Model to profile\n",
    "        input_tensor: Input data for profiling\n",
    "        profiler: Optional Profiler instance (creates new one if None)\n",
    "    \n",
    "    Returns:\n",
    "        dict: Profile results with key metrics\n",
    "    \n",
    "    Example:\n",
    "        >>> model = Linear(128, 64)\n",
    "        >>> input_data = Tensor(np.random.randn(16, 128))\n",
    "        >>> results = quick_profile(model, input_data)\n",
    "        >>> # Displays formatted output automatically\n",
    "    \"\"\"\n",
    "    if profiler is None:\n",
    "        profiler = Profiler()\n",
    "    \n",
    "    profile = profiler.profile_forward_pass(model, input_tensor)\n",
    "    \n",
    "    # Display formatted results\n",
    "    print(\"🔬 Quick Profile Results:\")\n",
    "    print(f\"   Parameters: {profile['parameters']:,}\")\n",
    "    print(f\"   FLOPs: {profile['flops']:,}\")\n",
    "    print(f\"   Latency: {profile['latency_ms']:.2f} ms\")\n",
    "    print(f\"   Memory: {profile['peak_memory_mb']:.2f} MB\")\n",
    "    print(f\"   Bottleneck: {profile['bottleneck']}\")\n",
    "    print(f\"   Efficiency: {profile['computational_efficiency']*100:.1f}%\")\n",
    "    \n",
    "    return profile\n",
    "\n",
    "#| export\n",
    "def analyze_weight_distribution(model, percentiles=[10, 25, 50, 75, 90]):\n",
    "    \"\"\"\n",
    "    Analyze weight distribution for compression insights.\n",
    "    \n",
    "    Helps understand which weights are small and might be prunable.\n",
    "    Used by Module 17 (Compression) to motivate pruning.\n",
    "    \n",
    "    Args:\n",
    "        model: Model to analyze\n",
    "        percentiles: List of percentiles to compute\n",
    "    \n",
    "    Returns:\n",
    "        dict: Weight distribution statistics\n",
    "    \n",
    "    Example:\n",
    "        >>> model = Linear(512, 512)\n",
    "        >>> stats = analyze_weight_distribution(model)\n",
    "        >>> print(f\"Weights < 0.01: {stats['below_threshold_001']:.1f}%\")\n",
    "    \"\"\"\n",
    "    # Collect all weights\n",
    "    weights = []\n",
    "    if hasattr(model, 'parameters'):\n",
    "        for param in model.parameters():\n",
    "            weights.extend(param.data.flatten().tolist())\n",
    "    elif hasattr(model, 'weight'):\n",
    "        weights.extend(model.weight.data.flatten().tolist())\n",
    "    else:\n",
    "        return {'error': 'No weights found'}\n",
    "    \n",
    "    weights = np.array(weights)\n",
    "    abs_weights = np.abs(weights)\n",
    "    \n",
    "    # Calculate statistics\n",
    "    stats = {\n",
    "        'total_weights': len(weights),\n",
    "        'mean': float(np.mean(abs_weights)),\n",
    "        'std': float(np.std(abs_weights)),\n",
    "        'min': float(np.min(abs_weights)),\n",
    "        'max': float(np.max(abs_weights)),\n",
    "    }\n",
    "    \n",
    "    # Percentile analysis\n",
    "    for p in percentiles:\n",
    "        stats[f'percentile_{p}'] = float(np.percentile(abs_weights, p))\n",
    "    \n",
    "    # Threshold analysis (useful for pruning)\n",
    "    for threshold in [0.001, 0.01, 0.1]:\n",
    "        below = np.sum(abs_weights < threshold) / len(weights) * 100\n",
    "        stats[f'below_threshold_{str(threshold).replace(\".\", \"\")}'] = below\n",
    "    \n",
    "    return stats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68b967c5",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Parameter Counting - Model Size Analysis\n",
    "\n",
    "Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's validate our implementation.\n",
    "\n",
    "### Why Parameter Counting Matters\n",
    "```\n",
    "Model Deployment Pipeline:\n",
    "Parameters → Memory → Hardware → Cost\n",
    "    ↓         ↓         ↓        ↓\n",
    "  125M    500MB     8GB GPU   $200/month\n",
    "\n",
    "Parameter Growth Examples:\n",
    "Small:   GPT-2 Small (124M parameters)   → 500MB memory\n",
    "Medium:  GPT-2 Medium (350M parameters) → 1.4GB memory\n",
    "Large:   GPT-2 Large (774M parameters)  → 3.1GB memory\n",
    "XL:      GPT-2 XL (1.5B parameters)     → 6.0GB memory\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68a302c1",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Parameter Counting\n",
    "This test validates our parameter counting works correctly for different model types.\n",
    "**What we're testing**: Parameter counting accuracy for various architectures\n",
    "**Why it matters**: Accurate parameter counts predict memory usage and model complexity\n",
    "**Expected**: Correct counts for known model configurations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c44b45f",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test_parameter_counting",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_parameter_counting():\n",
    "    \"\"\"🔬 Test parameter counting implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Parameter Counting...\")\n",
    "\n",
    "    profiler = Profiler()\n",
    "\n",
    "    # Test 1: Simple model with known parameters\n",
    "    class SimpleModel:\n",
    "        def __init__(self):\n",
    "            self.weight = Tensor(np.random.randn(10, 5))\n",
    "            self.bias = Tensor(np.random.randn(5))\n",
    "\n",
    "        def parameters(self):\n",
    "            return [self.weight, self.bias]\n",
    "\n",
    "    simple_model = SimpleModel()\n",
    "    param_count = profiler.count_parameters(simple_model)\n",
    "    expected_count = 10 * 5 + 5  # weight + bias\n",
    "    assert param_count == expected_count, f\"Expected {expected_count} parameters, got {param_count}\"\n",
    "    print(f\"✅ Simple model: {param_count} parameters\")\n",
    "\n",
    "    # Test 2: Model without parameters\n",
    "    class NoParamModel:\n",
    "        def __init__(self):\n",
    "            pass\n",
    "\n",
    "    no_param_model = NoParamModel()\n",
    "    param_count = profiler.count_parameters(no_param_model)\n",
    "    assert param_count == 0, f\"Expected 0 parameters, got {param_count}\"\n",
    "    print(f\"✅ No parameter model: {param_count} parameters\")\n",
    "\n",
    "    # Test 3: Direct tensor (no parameters)\n",
    "    test_tensor = Tensor(np.random.randn(2, 3))\n",
    "    param_count = profiler.count_parameters(test_tensor)\n",
    "    assert param_count == 0, f\"Expected 0 parameters for tensor, got {param_count}\"\n",
    "    print(f\"✅ Direct tensor: {param_count} parameters\")\n",
    "\n",
    "    print(\"✅ Parameter counting works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_parameter_counting()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd88f0ff",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## FLOP Counting - Computational Cost Estimation\n",
    "\n",
    "FLOPs measure the computational work required for model operations. Unlike latency, FLOPs are hardware-independent and help predict compute costs across different systems.\n",
    "\n",
    "### FLOP Counting Visualization\n",
    "```\n",
    "Linear Layer FLOP Breakdown:\n",
    "Input (batch=32, features=768) × Weight (768, 3072) + Bias (3072)\n",
    "                    ↓\n",
    "Matrix Multiplication: 32 × 768 × 3072 × 2 = 150,994,944 FLOPs\n",
    "Bias Addition:         32 × 3072 × 1      =     98,304 FLOPs\n",
    "                    ↓\n",
    "Total FLOPs:                               151,093,248 FLOPs\n",
    "\n",
    "Convolution FLOP Breakdown:\n",
    "Input (batch=1, channels=3, H=224, W=224)\n",
    "Kernel (out=64, in=3, kH=7, kW=7)\n",
    "                    ↓\n",
    "Output size: (224×224) → (112×112) with stride=2\n",
    "FLOPs = 112 × 112 × 7 × 7 × 3 × 64 × 2 = 235,012,096 FLOPs\n",
    "```\n",
    "\n",
    "### FLOP Counting Strategy\n",
    "Different operations require different FLOP calculations:\n",
    "- **Matrix operations**: M × N × K × 2 (multiply + add)\n",
    "- **Convolutions**: Output spatial × kernel spatial × channels\n",
    "- **Activations**: Usually 1 FLOP per element"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6311a0a",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: FLOP Counting\n",
    "This test validates our FLOP counting for different operations and architectures.\n",
    "**What we're testing**: FLOP calculation accuracy for various layer types\n",
    "**Why it matters**: FLOPs predict computational cost and energy usage\n",
    "**Expected**: Correct FLOP counts for known operation types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8919b41a",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test_flop_counting",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_flop_counting():\n",
    "    \"\"\"🔬 Test FLOP counting implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: FLOP Counting...\")\n",
    "\n",
    "    profiler = Profiler()\n",
    "\n",
    "    # Test 1: Simple tensor operations\n",
    "    test_tensor = Tensor(np.random.randn(4, 8))\n",
    "    flops = profiler.count_flops(test_tensor, (4, 8))\n",
    "    expected_flops = 4 * 8  # 1 FLOP per element for generic operation\n",
    "    assert flops == expected_flops, f\"Expected {expected_flops} FLOPs, got {flops}\"\n",
    "    print(f\"✅ Tensor operation: {flops} FLOPs\")\n",
    "\n",
    "    # Test 2: Simulated Linear layer\n",
    "    class MockLinear:\n",
    "        def __init__(self, in_features, out_features):\n",
    "            self.weight = Tensor(np.random.randn(in_features, out_features))\n",
    "            self.__class__.__name__ = 'Linear'\n",
    "\n",
    "    mock_linear = MockLinear(128, 64)\n",
    "    flops = profiler.count_flops(mock_linear, (1, 128))\n",
    "    expected_flops = 128 * 64 * 2  # matmul FLOPs\n",
    "    assert flops == expected_flops, f\"Expected {expected_flops} FLOPs, got {flops}\"\n",
    "    print(f\"✅ Linear layer: {flops} FLOPs\")\n",
    "\n",
    "    # Test 3: Batch size independence\n",
    "    flops_batch1 = profiler.count_flops(mock_linear, (1, 128))\n",
    "    flops_batch32 = profiler.count_flops(mock_linear, (32, 128))\n",
    "    assert flops_batch1 == flops_batch32, \"FLOPs should be independent of batch size\"\n",
    "    print(f\"✅ Batch independence: {flops_batch1} FLOPs (same for batch 1 and 32)\")\n",
    "\n",
    "    print(\"✅ FLOP counting works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_flop_counting()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a1d06f7",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Memory Profiling - Understanding Memory Usage Patterns\n",
    "\n",
    "Memory profiling reveals how much RAM your model consumes during training and inference. This is critical for deployment planning and optimization.\n",
    "\n",
    "### Memory Usage Breakdown\n",
    "```\n",
    "ML Model Memory Components:\n",
    "┌───────────────────────────────────────────────────┐\n",
    "│                 Total Memory                      │\n",
    "├─────────────────┬─────────────────┬───────────────┤\n",
    "│   Parameters    │   Activations   │  Gradients    │\n",
    "│   (persistent)  │  (per forward)  │ (per backward)│\n",
    "├─────────────────┼─────────────────┼───────────────┤\n",
    "│ Linear weights  │ Hidden states   │ ∂L/∂W         │\n",
    "│ Conv filters    │ Attention maps  │ ∂L/∂b         │\n",
    "│ Embeddings      │ Residual cache  │ Optimizer     │\n",
    "└─────────────────┴─────────────────┴───────────────┘\n",
    "\n",
    "Memory Scaling:\n",
    "Batch Size → Activation Memory (linear scaling)\n",
    "Model Size → Parameter + Gradient Memory (linear scaling)\n",
    "Sequence Length → Attention Memory (quadratic scaling!)\n",
    "```\n",
    "\n",
    "### Memory Measurement Strategy\n",
    "We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1e39372",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Memory Measurement\n",
    "This test validates our memory tracking works correctly and provides useful metrics.\n",
    "**What we're testing**: Memory usage measurement and calculation accuracy\n",
    "**Why it matters**: Memory constraints often limit model deployment\n",
    "**Expected**: Reasonable memory measurements with proper components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60ee4331",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test_memory_measurement",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_memory_measurement():\n",
    "    \"\"\"🔬 Test memory measurement implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Memory Measurement...\")\n",
    "\n",
    "    profiler = Profiler()\n",
    "\n",
    "    # Test 1: Basic memory measurement\n",
    "    test_tensor = Tensor(np.random.randn(10, 20))\n",
    "    memory_stats = profiler.measure_memory(test_tensor, (10, 20))\n",
    "\n",
    "    # Validate dictionary structure\n",
    "    required_keys = ['parameter_memory_mb', 'activation_memory_mb', 'peak_memory_mb', 'memory_efficiency']\n",
    "    for key in required_keys:\n",
    "        assert key in memory_stats, f\"Missing key: {key}\"\n",
    "\n",
    "    # Validate non-negative values\n",
    "    for key in required_keys:\n",
    "        assert memory_stats[key] >= 0, f\"{key} should be non-negative, got {memory_stats[key]}\"\n",
    "\n",
    "    print(f\"✅ Basic measurement: {memory_stats['peak_memory_mb']:.3f} MB peak\")\n",
    "\n",
    "    # Test 2: Memory scaling with size\n",
    "    small_tensor = Tensor(np.random.randn(5, 5))\n",
    "    large_tensor = Tensor(np.random.randn(50, 50))\n",
    "\n",
    "    small_memory = profiler.measure_memory(small_tensor, (5, 5))\n",
    "    large_memory = profiler.measure_memory(large_tensor, (50, 50))\n",
    "\n",
    "    # Larger tensor should use more activation memory\n",
    "    assert large_memory['activation_memory_mb'] >= small_memory['activation_memory_mb'], \\\n",
    "        \"Larger tensor should use more activation memory\"\n",
    "\n",
    "    print(f\"✅ Scaling: Small {small_memory['activation_memory_mb']:.3f} MB → Large {large_memory['activation_memory_mb']:.3f} MB\")\n",
    "\n",
    "    # Test 3: Efficiency bounds\n",
    "    assert 0 <= memory_stats['memory_efficiency'] <= 1.0, \\\n",
    "        f\"Memory efficiency should be between 0 and 1, got {memory_stats['memory_efficiency']}\"\n",
    "\n",
    "    print(f\"✅ Efficiency: {memory_stats['memory_efficiency']:.3f} (0-1 range)\")\n",
    "\n",
    "    print(\"✅ Memory measurement works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_memory_measurement()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "350bdbd3",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Latency Measurement - Accurate Performance Timing\n",
    "\n",
    "Latency measurement is the most challenging part of profiling because it's affected by system state, caching, and measurement overhead. We need statistical rigor to get reliable results.\n",
    "\n",
    "### Latency Measurement Challenges\n",
    "```\n",
    "Timing Challenges:\n",
    "┌─────────────────────────────────────────────────┐\n",
    "│                 Time Variance                   │\n",
    "├─────────────────┬─────────────────┬─────────────┤\n",
    "│  System Noise   │   Cache Effects │   Thermal   │\n",
    "│                 │                 │  Throttling  │\n",
    "├─────────────────┼─────────────────┼─────────────┤\n",
    "│ Background      │ Cold start vs   │ CPU slows   │\n",
    "│ processes       │ warm caches     │ when hot    │\n",
    "│ OS scheduling   │ Memory locality │ GPU thermal │\n",
    "│ Network I/O     │ Branch predict  │ limits      │\n",
    "└─────────────────┴─────────────────┴─────────────┘\n",
    "\n",
    "Solution: Statistical Approach\n",
    "Warmup → Multiple measurements → Robust statistics (median)\n",
    "```\n",
    "\n",
    "### Measurement Protocol\n",
    "Our latency measurement follows professional benchmarking practices:\n",
    "1. **Warmup runs** to stabilize system state\n",
    "2. **Multiple measurements** for statistical significance\n",
    "3. **Median calculation** to handle outliers\n",
    "4. **Memory cleanup** to prevent contamination"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1a0465b",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Latency Measurement\n",
    "This test validates our latency measurement provides consistent and reasonable results.\n",
    "**What we're testing**: Timing accuracy and statistical robustness\n",
    "**Why it matters**: Latency determines real-world deployment feasibility\n",
    "**Expected**: Consistent timing measurements with proper statistical handling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dcc3cff0",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test_latency_measurement",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_latency_measurement():\n",
    "    \"\"\"🔬 Test latency measurement implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Latency Measurement...\")\n",
    "\n",
    "    profiler = Profiler()\n",
    "\n",
    "    # Test 1: Basic latency measurement\n",
    "    test_tensor = Tensor(np.random.randn(4, 8))\n",
    "    latency = profiler.measure_latency(test_tensor, test_tensor, warmup=2, iterations=5)\n",
    "\n",
    "    assert latency >= 0, f\"Latency should be non-negative, got {latency}\"\n",
    "    assert latency < 1000, f\"Latency seems too high for simple operation: {latency} ms\"\n",
    "    print(f\"✅ Basic latency: {latency:.3f} ms\")\n",
    "\n",
    "    # Test 2: Measurement consistency\n",
    "    latencies = []\n",
    "    for _ in range(3):\n",
    "        lat = profiler.measure_latency(test_tensor, test_tensor, warmup=1, iterations=3)\n",
    "        latencies.append(lat)\n",
    "\n",
    "    # Measurements should be in reasonable range\n",
    "    avg_latency = np.mean(latencies)\n",
    "    std_latency = np.std(latencies)\n",
    "    assert std_latency < avg_latency, \"Standard deviation shouldn't exceed mean for simple operations\"\n",
    "    print(f\"✅ Consistency: {avg_latency:.3f} ± {std_latency:.3f} ms\")\n",
    "\n",
    "    # Test 3: Size scaling\n",
    "    small_tensor = Tensor(np.random.randn(2, 2))\n",
    "    large_tensor = Tensor(np.random.randn(20, 20))\n",
    "\n",
    "    small_latency = profiler.measure_latency(small_tensor, small_tensor, warmup=1, iterations=3)\n",
    "    large_latency = profiler.measure_latency(large_tensor, large_tensor, warmup=1, iterations=3)\n",
    "\n",
    "    # Larger operations might take longer (though not guaranteed for simple operations)\n",
    "    print(f\"✅ Scaling: Small {small_latency:.3f} ms, Large {large_latency:.3f} ms\")\n",
    "\n",
    "    print(\"✅ Latency measurement works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_latency_measurement()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5d9a959",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 4. Integration: Advanced Profiling Functions\n",
    "\n",
    "Now let's validate our higher-level profiling functions that combine core measurements into comprehensive analysis tools.\n",
    "\n",
    "### Advanced Profiling Architecture\n",
    "```\n",
    "Core Profiler Methods → Advanced Analysis Functions → Optimization Insights\n",
    "        ↓                         ↓                         ↓\n",
    "count_parameters()      profile_forward_pass()      \"Memory-bound workload\"\n",
    "count_flops()          profile_backward_pass()      \"Optimize data movement\"\n",
    "measure_memory()       profile_layer()              \"Focus on bandwidth\"\n",
    "measure_latency()      benchmark_efficiency()       \"Use quantization\"\n",
    "```\n",
    "\n",
    "### Forward Pass Profiling - Complete Performance Picture\n",
    "\n",
    "A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "791555b9",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### Backward Pass Profiling - Training Analysis\n",
    "\n",
    "Training requires both forward and backward passes. The backward pass typically uses 2× the compute and adds gradient memory. Understanding this is crucial for training optimization.\n",
    "\n",
    "### Training Memory Visualization\n",
    "```\n",
    "Training Memory Timeline:\n",
    "Forward Pass:   [Parameters] + [Activations]\n",
    "                     ↓\n",
    "Backward Pass:  [Parameters] + [Activations] + [Gradients]\n",
    "                     ↓\n",
    "Optimizer:      [Parameters] + [Gradients] + [Optimizer State]\n",
    "\n",
    "Memory Examples:\n",
    "Model: 125M parameters (500MB)\n",
    "Forward:  500MB params + 100MB activations = 600MB\n",
    "Backward: 500MB params + 100MB activations + 500MB gradients = 1,100MB\n",
    "Adam:     500MB params + 500MB gradients + 1,000MB momentum/velocity = 2,000MB\n",
    "\n",
    "Total Training Memory: 4× parameter memory!\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24236272",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Advanced Profiling Functions\n",
    "This test validates our advanced profiling functions provide comprehensive analysis.\n",
    "**What we're testing**: Forward and backward pass profiling completeness\n",
    "**Why it matters**: Training optimization requires understanding both passes\n",
    "**Expected**: Complete profiles with all required metrics and relationships"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1516ed04",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test_advanced_profiling",
     "locked": true,
     "points": 15
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_advanced_profiling():\n",
    "    \"\"\"🔬 Test advanced profiling functions.\"\"\"\n",
    "    print(\"🔬 Unit Test: Advanced Profiling Functions...\")\n",
    "\n",
    "    # Create profiler and test model\n",
    "    profiler = Profiler()\n",
    "    test_input = Tensor(np.random.randn(4, 8))\n",
    "\n",
    "    # Test forward pass profiling\n",
    "    forward_profile = profiler.profile_forward_pass(test_input, test_input)\n",
    "\n",
    "    # Validate forward profile structure\n",
    "    required_forward_keys = [\n",
    "        'parameters', 'flops', 'latency_ms', 'gflops_per_second',\n",
    "        'memory_bandwidth_mbs', 'bottleneck'\n",
    "    ]\n",
    "\n",
    "    for key in required_forward_keys:\n",
    "        assert key in forward_profile, f\"Missing key: {key}\"\n",
    "\n",
    "    assert forward_profile['parameters'] >= 0\n",
    "    assert forward_profile['flops'] >= 0\n",
    "    assert forward_profile['latency_ms'] >= 0\n",
    "    assert forward_profile['gflops_per_second'] >= 0\n",
    "\n",
    "    print(f\"✅ Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s\")\n",
    "\n",
    "    # Test backward pass profiling\n",
    "    backward_profile = profiler.profile_backward_pass(test_input, test_input)\n",
    "\n",
    "    # Validate backward profile structure\n",
    "    required_backward_keys = [\n",
    "        'forward_flops', 'backward_flops', 'total_flops',\n",
    "        'total_latency_ms', 'total_memory_mb', 'optimizer_memory_estimates'\n",
    "    ]\n",
    "\n",
    "    for key in required_backward_keys:\n",
    "        assert key in backward_profile, f\"Missing key: {key}\"\n",
    "\n",
    "    # Validate relationships\n",
    "    assert backward_profile['total_flops'] >= backward_profile['forward_flops']\n",
    "    assert backward_profile['total_latency_ms'] >= backward_profile['forward_latency_ms']\n",
    "    assert 'sgd' in backward_profile['optimizer_memory_estimates']\n",
    "    assert 'adam' in backward_profile['optimizer_memory_estimates']\n",
    "\n",
    "    # Check backward pass estimates are reasonable\n",
    "    assert backward_profile['backward_flops'] >= backward_profile['forward_flops'], \\\n",
    "        \"Backward pass should have at least as many FLOPs as forward\"\n",
    "    assert backward_profile['gradient_memory_mb'] >= 0, \\\n",
    "        \"Gradient memory should be non-negative\"\n",
    "\n",
    "    print(f\"✅ Backward profiling: {backward_profile['total_latency_ms']:.2f} ms total\")\n",
    "    print(f\"✅ Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training\")\n",
    "    print(\"✅ Advanced profiling functions work correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_advanced_profiling()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b52a9046",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 5. Systems Analysis: Understanding Performance Characteristics\n",
    "\n",
    "Let's analyze how different model characteristics affect performance. This analysis guides optimization decisions and helps identify bottlenecks.\n",
    "\n",
    "### Performance Analysis Workflow\n",
    "```\n",
    "Model Scaling Analysis:\n",
    "Size → Memory → Latency → Throughput → Bottleneck Identification\n",
    " ↓      ↓        ↓         ↓            ↓\n",
    "64    1MB     0.1ms    10K ops/s    Memory bound\n",
    "128   4MB     0.2ms    8K ops/s     Memory bound\n",
    "256   16MB    0.5ms    4K ops/s     Memory bound\n",
    "512   64MB    2.0ms    1K ops/s     Memory bound\n",
    "\n",
    "Insight: This workload is memory-bound → Optimize data movement, not compute!\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "331e282f",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "performance_analysis",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def analyze_model_scaling():\n",
    "    \"\"\"📊 Analyze how model performance scales with size.\"\"\"\n",
    "    print(\"📊 Analyzing Model Scaling Characteristics...\")\n",
    "\n",
    "    profiler = Profiler()\n",
    "    results = []\n",
    "\n",
    "    # Test different model sizes\n",
    "    sizes = [64, 128, 256, 512]\n",
    "\n",
    "    print(\"\\nModel Scaling Analysis:\")\n",
    "    print(\"Size\\tParams\\t\\tFLOPs\\t\\tLatency(ms)\\tMemory(MB)\\tGFLOP/s\")\n",
    "    print(\"-\" * 80)\n",
    "\n",
    "    for size in sizes:\n",
    "        # Create models of different sizes for comparison\n",
    "        input_shape = (32, size)  # Batch of 32\n",
    "        dummy_input = Tensor(np.random.randn(*input_shape))\n",
    "\n",
    "        # Simulate linear layer characteristics\n",
    "        linear_params = size * size + size  # W + b\n",
    "        linear_flops = size * size * 2  # matmul\n",
    "\n",
    "        # Measure actual performance\n",
    "        latency = profiler.measure_latency(dummy_input, dummy_input, warmup=3, iterations=10)\n",
    "        memory = profiler.measure_memory(dummy_input, input_shape)\n",
    "\n",
    "        gflops_per_second = (linear_flops / 1e9) / (latency / 1000)\n",
    "\n",
    "        results.append({\n",
    "            'size': size,\n",
    "            'parameters': linear_params,\n",
    "            'flops': linear_flops,\n",
    "            'latency_ms': latency,\n",
    "            'memory_mb': memory['peak_memory_mb'],\n",
    "            'gflops_per_second': gflops_per_second\n",
    "        })\n",
    "\n",
    "        print(f\"{size}\\t{linear_params:,}\\t\\t{linear_flops:,}\\t\\t\"\n",
    "              f\"{latency:.2f}\\t\\t{memory['peak_memory_mb']:.2f}\\t\\t\"\n",
    "              f\"{gflops_per_second:.2f}\")\n",
    "\n",
    "    # Analysis insights\n",
    "    print(\"\\n💡 Scaling Analysis Insights:\")\n",
    "\n",
    "    # Memory scaling\n",
    "    memory_growth = results[-1]['memory_mb'] / max(results[0]['memory_mb'], 0.001)\n",
    "    print(f\"Memory grows {memory_growth:.1f}× from {sizes[0]} to {sizes[-1]} size\")\n",
    "\n",
    "    # Compute scaling\n",
    "    compute_growth = results[-1]['gflops_per_second'] / max(results[0]['gflops_per_second'], 0.001)\n",
    "    print(f\"Compute efficiency changes {compute_growth:.1f}× with size\")\n",
    "\n",
    "    # Performance characteristics\n",
    "    avg_efficiency = np.mean([r['gflops_per_second'] for r in results])\n",
    "    if avg_efficiency < 10:  # Arbitrary threshold for \"low\" efficiency\n",
    "        print(\"🚀 Low compute efficiency suggests memory-bound workload\")\n",
    "    else:\n",
    "        print(\"🚀 High compute efficiency suggests compute-bound workload\")\n",
    "\n",
    "def analyze_batch_size_effects():\n",
    "    \"\"\"📊 Analyze how batch size affects performance and efficiency.\"\"\"\n",
    "    print(\"\\n📊 Analyzing Batch Size Effects...\")\n",
    "\n",
    "    profiler = Profiler()\n",
    "    batch_sizes = [1, 8, 32, 128]\n",
    "    feature_size = 256\n",
    "\n",
    "    print(\"\\nBatch Size Effects Analysis:\")\n",
    "    print(\"Batch\\tLatency(ms)\\tThroughput(samples/s)\\tMemory(MB)\\tMemory Efficiency\")\n",
    "    print(\"-\" * 85)\n",
    "\n",
    "    for batch_size in batch_sizes:\n",
    "        input_shape = (batch_size, feature_size)\n",
    "        dummy_input = Tensor(np.random.randn(*input_shape))\n",
    "\n",
    "        # Measure performance\n",
    "        latency = profiler.measure_latency(dummy_input, dummy_input, warmup=3, iterations=10)\n",
    "        memory = profiler.measure_memory(dummy_input, input_shape)\n",
    "\n",
    "        # Calculate throughput\n",
    "        samples_per_second = (batch_size * 1000) / latency  # samples/second\n",
    "\n",
    "        # Calculate efficiency (samples per unit memory)\n",
    "        efficiency = samples_per_second / max(memory['peak_memory_mb'], 0.001)\n",
    "\n",
    "        print(f\"{batch_size}\\t{latency:.2f}\\t\\t{samples_per_second:.0f}\\t\\t\\t\"\n",
    "              f\"{memory['peak_memory_mb']:.2f}\\t\\t{efficiency:.1f}\")\n",
    "\n",
    "    print(\"\\n💡 Batch Size Insights:\")\n",
    "    print(\"Larger batches typically improve throughput but increase memory usage\")\n",
    "\n",
    "# Run the analysis\n",
    "if __name__ == \"__main__\":\n",
    "    analyze_model_scaling()\n",
    "    analyze_batch_size_effects()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08957c5b",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 6. Optimization Insights: Production Performance Patterns\n",
    "\n",
    "Understanding profiling results helps guide optimization decisions. Let's analyze different operation types and measurement overhead.\n",
    "\n",
    "### Operation Efficiency Analysis\n",
    "```\n",
    "Operation Types and Their Characteristics:\n",
    "┌─────────────────┬──────────────────┬──────────────────┬─────────────────┐\n",
    "│   Operation     │   Compute/Memory │   Optimization   │   Priority      │\n",
    "├─────────────────┼──────────────────┼──────────────────┼─────────────────┤\n",
    "│ Matrix Multiply │   Compute-bound  │   BLAS libraries │   High          │\n",
    "│ Elementwise     │   Memory-bound   │   Data locality  │   Medium        │\n",
    "│ Reductions      │   Memory-bound   │   Parallelization│   Medium        │\n",
    "│ Attention       │   Memory-bound   │   FlashAttention │   High          │\n",
    "└─────────────────┴──────────────────┴──────────────────┴─────────────────┘\n",
    "\n",
    "Optimization Strategy:\n",
    "1. Profile first → Identify bottlenecks\n",
    "2. Focus on compute-bound ops → Algorithmic improvements\n",
    "3. Focus on memory-bound ops → Data movement optimization\n",
    "4. Measure again → Verify improvements\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "750be525",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "optimization_insights",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def benchmark_operation_efficiency():\n",
    "    \"\"\"📊 Compare efficiency of different operations for optimization guidance.\"\"\"\n",
    "    print(\"📊 Benchmarking Operation Efficiency...\")\n",
    "\n",
    "    profiler = Profiler()\n",
    "    operations = []\n",
    "\n",
    "    # Test different operation types\n",
    "    size = 256\n",
    "    input_tensor = Tensor(np.random.randn(32, size))\n",
    "\n",
    "    # Elementwise operations (memory-bound)\n",
    "    elementwise_latency = profiler.measure_latency(input_tensor, input_tensor, iterations=20)\n",
    "    elementwise_flops = size * 32  # One operation per element\n",
    "\n",
    "    operations.append({\n",
    "        'operation': 'Elementwise',\n",
    "        'latency_ms': elementwise_latency,\n",
    "        'flops': elementwise_flops,\n",
    "        'gflops_per_second': (elementwise_flops / 1e9) / (elementwise_latency / 1000),\n",
    "        'efficiency_class': 'memory-bound',\n",
    "        'optimization_focus': 'data_locality'\n",
    "    })\n",
    "\n",
    "    # Matrix operations (compute-bound)\n",
    "    matrix_tensor = Tensor(np.random.randn(size, size))\n",
    "    matrix_latency = profiler.measure_latency(matrix_tensor, input_tensor, iterations=10)\n",
    "    matrix_flops = size * size * 2  # Matrix multiplication\n",
    "\n",
    "    operations.append({\n",
    "        'operation': 'Matrix Multiply',\n",
    "        'latency_ms': matrix_latency,\n",
    "        'flops': matrix_flops,\n",
    "        'gflops_per_second': (matrix_flops / 1e9) / (matrix_latency / 1000),\n",
    "        'efficiency_class': 'compute-bound',\n",
    "        'optimization_focus': 'algorithms'\n",
    "    })\n",
    "\n",
    "    # Reduction operations (memory-bound)\n",
    "    reduction_latency = profiler.measure_latency(input_tensor, input_tensor, iterations=20)\n",
    "    reduction_flops = size * 32  # Sum reduction\n",
    "\n",
    "    operations.append({\n",
    "        'operation': 'Reduction',\n",
    "        'latency_ms': reduction_latency,\n",
    "        'flops': reduction_flops,\n",
    "        'gflops_per_second': (reduction_flops / 1e9) / (reduction_latency / 1000),\n",
    "        'efficiency_class': 'memory-bound',\n",
    "        'optimization_focus': 'parallelization'\n",
    "    })\n",
    "\n",
    "    print(\"\\nOperation Efficiency Comparison:\")\n",
    "    print(\"Operation\\t\\tLatency(ms)\\tGFLOP/s\\t\\tEfficiency Class\\tOptimization Focus\")\n",
    "    print(\"-\" * 95)\n",
    "\n",
    "    for op in operations:\n",
    "        print(f\"{op['operation']:<15}\\t{op['latency_ms']:.3f}\\t\\t\"\n",
    "              f\"{op['gflops_per_second']:.2f}\\t\\t{op['efficiency_class']:<15}\\t{op['optimization_focus']}\")\n",
    "\n",
    "    print(\"\\n💡 Operation Optimization Insights:\")\n",
    "\n",
    "    # Find most and least efficient\n",
    "    best_op = max(operations, key=lambda x: x['gflops_per_second'])\n",
    "    worst_op = min(operations, key=lambda x: x['gflops_per_second'])\n",
    "\n",
    "    print(f\"Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)\")\n",
    "    print(f\"Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)\")\n",
    "\n",
    "    # Count operation types\n",
    "    memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound']\n",
    "    compute_bound_ops = [op for op in operations if op['efficiency_class'] == 'compute-bound']\n",
    "\n",
    "    print(f\"\\n🚀 Optimization Priority:\")\n",
    "    if len(memory_bound_ops) > len(compute_bound_ops):\n",
    "        print(\"Focus on memory optimization: data locality, bandwidth, caching\")\n",
    "    else:\n",
    "        print(\"Focus on compute optimization: better algorithms, vectorization\")\n",
    "\n",
    "def analyze_profiling_overhead():\n",
    "    \"\"\"📊 Measure the overhead of profiling itself.\"\"\"\n",
    "    print(\"\\n📊 Analyzing Profiling Overhead...\")\n",
    "\n",
    "    # Test with and without profiling\n",
    "    test_tensor = Tensor(np.random.randn(100, 100))\n",
    "    iterations = 50\n",
    "\n",
    "    # Without profiling - baseline measurement\n",
    "    start_time = time.perf_counter()\n",
    "    for _ in range(iterations):\n",
    "        _ = test_tensor.data.copy()  # Simple operation\n",
    "    end_time = time.perf_counter()\n",
    "    baseline_ms = (end_time - start_time) * 1000\n",
    "\n",
    "    # With profiling - includes measurement overhead\n",
    "    profiler = Profiler()\n",
    "    start_time = time.perf_counter()\n",
    "    for _ in range(iterations):\n",
    "        _ = profiler.measure_latency(test_tensor, test_tensor, warmup=1, iterations=1)\n",
    "    end_time = time.perf_counter()\n",
    "    profiled_ms = (end_time - start_time) * 1000\n",
    "\n",
    "    overhead_factor = profiled_ms / max(baseline_ms, 0.001)\n",
    "\n",
    "    print(f\"\\nProfiling Overhead Analysis:\")\n",
    "    print(f\"Baseline execution: {baseline_ms:.2f} ms\")\n",
    "    print(f\"With profiling: {profiled_ms:.2f} ms\")\n",
    "    print(f\"Profiling overhead: {overhead_factor:.1f}× slower\")\n",
    "\n",
    "    print(f\"\\n💡 Profiling Overhead Insights:\")\n",
    "    if overhead_factor < 2:\n",
    "        print(\"Low overhead - suitable for frequent profiling\")\n",
    "    elif overhead_factor < 10:\n",
    "        print(\"Moderate overhead - use for development and debugging\")\n",
    "    else:\n",
    "        print(\"High overhead - use sparingly in production\")\n",
    "\n",
    "# Run optimization analysis\n",
    "if __name__ == \"__main__\":\n",
    "    benchmark_operation_efficiency()\n",
    "    analyze_profiling_overhead()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a170135d",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 🧪 Module Integration Test\n",
    "\n",
    "Final validation that everything works together correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "379ab83a",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test_module",
     "locked": true,
     "points": 20
    }
   },
   "outputs": [],
   "source": [
    "def test_module():\n",
    "    \"\"\"\n",
    "    Comprehensive test of entire profiling module functionality.\n",
    "\n",
    "    This final test runs before module summary to ensure:\n",
    "    - All unit tests pass\n",
    "    - Functions work together correctly\n",
    "    - Module is ready for integration with TinyTorch\n",
    "    \"\"\"\n",
    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
    "    print(\"=\" * 50)\n",
    "\n",
    "    # Run all unit tests\n",
    "    print(\"Running unit tests...\")\n",
    "    test_unit_parameter_counting()\n",
    "    test_unit_flop_counting()\n",
    "    test_unit_memory_measurement()\n",
    "    test_unit_latency_measurement()\n",
    "    test_unit_advanced_profiling()\n",
    "\n",
    "    print(\"\\nRunning integration scenarios...\")\n",
    "\n",
    "    # Test realistic usage patterns\n",
    "    print(\"🔬 Integration Test: Complete Profiling Workflow...\")\n",
    "\n",
    "    # Create profiler\n",
    "    profiler = Profiler()\n",
    "\n",
    "    # Create test model and data\n",
    "    test_model = Tensor(np.random.randn(16, 32))\n",
    "    test_input = Tensor(np.random.randn(8, 16))\n",
    "\n",
    "    # Run complete profiling workflow\n",
    "    print(\"1. Measuring model characteristics...\")\n",
    "    params = profiler.count_parameters(test_model)\n",
    "    flops = profiler.count_flops(test_model, test_input.shape)\n",
    "    memory = profiler.measure_memory(test_model, test_input.shape)\n",
    "    latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5)\n",
    "\n",
    "    print(f\"   Parameters: {params}\")\n",
    "    print(f\"   FLOPs: {flops}\")\n",
    "    print(f\"   Memory: {memory['peak_memory_mb']:.2f} MB\")\n",
    "    print(f\"   Latency: {latency:.2f} ms\")\n",
    "\n",
    "    # Test advanced profiling\n",
    "    print(\"2. Running advanced profiling...\")\n",
    "    forward_profile = profiler.profile_forward_pass(test_model, test_input)\n",
    "    backward_profile = profiler.profile_backward_pass(test_model, test_input)\n",
    "\n",
    "    assert 'gflops_per_second' in forward_profile\n",
    "    assert 'total_latency_ms' in backward_profile\n",
    "    print(f\"   Forward GFLOP/s: {forward_profile['gflops_per_second']:.2f}\")\n",
    "    print(f\"   Training latency: {backward_profile['total_latency_ms']:.2f} ms\")\n",
    "\n",
    "    # Test bottleneck analysis\n",
    "    print(\"3. Analyzing performance bottlenecks...\")\n",
    "    bottleneck = forward_profile['bottleneck']\n",
    "    efficiency = forward_profile['computational_efficiency']\n",
    "    print(f\"   Bottleneck: {bottleneck}\")\n",
    "    print(f\"   Compute efficiency: {efficiency:.3f}\")\n",
    "\n",
    "    # Validate end-to-end workflow\n",
    "    assert params >= 0, \"Parameter count should be non-negative\"\n",
    "    assert flops >= 0, \"FLOP count should be non-negative\"\n",
    "    assert memory['peak_memory_mb'] >= 0, \"Memory usage should be non-negative\"\n",
    "    assert latency >= 0, \"Latency should be non-negative\"\n",
    "    assert forward_profile['gflops_per_second'] >= 0, \"GFLOP/s should be non-negative\"\n",
    "    assert backward_profile['total_latency_ms'] >= 0, \"Total latency should be non-negative\"\n",
    "    assert bottleneck in ['memory', 'compute'], \"Bottleneck should be memory or compute\"\n",
    "    assert 0 <= efficiency <= 1, \"Efficiency should be between 0 and 1\"\n",
    "\n",
    "    print(\"✅ End-to-end profiling workflow works!\")\n",
    "\n",
    "    # Test production-like scenario\n",
    "    print(\"4. Testing production profiling scenario...\")\n",
    "\n",
    "    # Simulate larger model analysis\n",
    "    large_input = Tensor(np.random.randn(32, 512))  # Larger model input\n",
    "    large_profile = profiler.profile_forward_pass(large_input, large_input)\n",
    "\n",
    "    # Verify profile contains optimization insights\n",
    "    assert 'bottleneck' in large_profile, \"Profile should identify bottlenecks\"\n",
    "    assert 'memory_bandwidth_mbs' in large_profile, \"Profile should measure memory bandwidth\"\n",
    "\n",
    "    print(f\"   Large model analysis: {large_profile['bottleneck']} bottleneck\")\n",
    "    print(f\"   Memory bandwidth: {large_profile['memory_bandwidth_mbs']:.1f} MB/s\")\n",
    "\n",
    "    print(\"✅ Production profiling scenario works!\")\n",
    "\n",
    "    print(\"\\n\" + \"=\" * 50)\n",
    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
    "    print(\"Run: tito module complete 14\")\n",
    "\n",
    "# Call before module summary\n",
    "if __name__ == \"__main__\":\n",
    "    test_module()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6502f689",
   "metadata": {},
   "outputs": [],
   "source": [
    "if __name__ == \"__main__\":\n",
    "    print(\"🚀 Running Profiling module...\")\n",
    "    test_module()\n",
    "    print(\"✅ Module validation complete!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4ff25e4",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🤔 ML Systems Thinking: Performance Measurement\n",
    "\n",
    "### Question 1: FLOP Analysis\n",
    "You implemented a profiler that counts FLOPs for different operations.\n",
    "For a Linear layer with 1000 input features and 500 output features:\n",
    "- How many FLOPs are required for one forward pass? _____ FLOPs\n",
    "- If you process a batch of 32 samples, how does this change the per-sample FLOPs? _____\n",
    "\n",
    "### Question 2: Memory Scaling\n",
    "Your profiler measures memory usage for models and activations.\n",
    "A transformer model has 125M parameters (500MB at FP32).\n",
    "During training with batch size 16:\n",
    "- What's the minimum memory for gradients? _____ MB\n",
    "- With Adam optimizer, what's the total memory requirement? _____ MB\n",
    "\n",
    "### Question 3: Performance Bottlenecks\n",
    "You built tools to identify compute vs memory bottlenecks.\n",
    "A model achieves 10 GFLOP/s on hardware with 100 GFLOP/s peak:\n",
    "- What's the computational efficiency? _____%\n",
    "- If doubling batch size doesn't improve GFLOP/s, the bottleneck is likely _____\n",
    "\n",
    "### Question 4: Profiling Trade-offs\n",
    "Your profiler adds measurement overhead to understand performance.\n",
    "If profiling adds 5× overhead but reveals a 50% speedup opportunity:\n",
    "- Is the profiling cost justified for development? _____\n",
    "- When should you disable profiling in production? _____"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72dec7d6",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🎯 MODULE SUMMARY: Profiling\n",
    "\n",
    "Congratulations! You've built a comprehensive profiling system for ML performance analysis!\n",
    "\n",
    "### Key Accomplishments\n",
    "- Built complete Profiler class with parameter, FLOP, memory, and latency measurement\n",
    "- Implemented advanced profiling functions for forward and backward pass analysis\n",
    "- Discovered performance characteristics through scaling and efficiency analysis\n",
    "- Created production-quality measurement tools for optimization guidance\n",
    "- All tests pass ✅ (validated by `test_module()`)\n",
    "\n",
    "### Systems Insights Gained\n",
    "- **FLOPs vs Reality**: Theoretical operations don't always predict actual performance\n",
    "- **Memory Bottlenecks**: Many ML operations are limited by memory bandwidth, not compute\n",
    "- **Batch Size Effects**: Larger batches improve throughput but increase memory requirements\n",
    "- **Profiling Overhead**: Measurement tools have costs but enable data-driven optimization\n",
    "\n",
    "### Production Skills Developed\n",
    "- **Performance Detective Work**: Use data, not guesses, to identify bottlenecks\n",
    "- **Optimization Prioritization**: Focus efforts on actual bottlenecks, not assumptions\n",
    "- **Resource Planning**: Predict memory and compute requirements for deployment\n",
    "- **Statistical Rigor**: Handle measurement variance with proper methodology\n",
    "\n",
    "### Ready for Next Steps\n",
    "Your profiling implementation enables optimization modules (15-18) to make data-driven optimization decisions.\n",
    "Export with: `tito module complete 14`\n",
    "\n",
    "**Next**: Module 15 (Memoization) will use profiling to discover transformer bottlenecks and fix them!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}