From a272030037e58a5aafa016708865bf8a8e220e26 Mon Sep 17 00:00:00 2001
From: Vijay Janapa Reddi <vj@eecs.harvard.edu>
Date: Sun, 9 Nov 2025 13:03:30 -0500
Subject: [PATCH] build: regenerate profiling notebook from updated dev file

---
 .../source/14_profiling/profiling_dev.ipynb   | 1457 ++++++++---------
 1 file changed, 725 insertions(+), 732 deletions(-)

diff --git a/modules/source/14_profiling/profiling_dev.ipynb b/modules/source/14_profiling/profiling_dev.ipynb
index f08cdde1..c3daf773 100644
--- a/modules/source/14_profiling/profiling_dev.ipynb
+++ b/modules/source/14_profiling/profiling_dev.ipynb
@@ -2,24 +2,24 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "78d24362",
+   "id": "55618ade",
    "metadata": {
     "cell_marker": "\"\"\""
    },
    "source": [
-    "# Module 15: Profiling - Measuring What Matters in ML Systems\n",
+    "# Module 14: Profiling - Measuring What Matters in ML Systems\n",
     "\n",
-    "Welcome to Module 15! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.\n",
+    "Welcome to Module 14! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.\n",
     "\n",
     "## 🔗 Prerequisites & Progress\n",
-    "**You've Built**: Complete ML stack from tensors to transformers with KV caching\n",
+    "**You've Built**: Complete ML stack from tensors to transformers\n",
     "**You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency\n",
     "**You'll Enable**: Data-driven optimization decisions and performance analysis\n",
     "\n",
     "**Connection Map**:\n",
     "```\n",
-    "All Modules → Profiling → Acceleration (Module 16)\n",
-    "(implementations) (measurement) (optimization)\n",
+    "All Modules (01-13) → Profiling (14) → Optimization Techniques (15-18)\n",
+    "(implementations)     (measurement)     (targeted fixes)\n",
     "```\n",
     "\n",
     "## Learning Objectives\n",
@@ -33,7 +33,7 @@
     "\n",
     "## 📦 Where This Code Lives in the Final Package\n",
     "\n",
-    "**Learning Side:** You work in `modules/15_profiling/profiling_dev.py`  \n",
+    "**Learning Side:** You work in `modules/14_profiling/profiling_dev.py`\n",
     "**Building Side:** Code exports to `tinytorch.profiling.profiler`\n",
     "\n",
     "```python\n",
@@ -51,7 +51,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f622ef61",
+   "id": "d92307b1",
    "metadata": {
     "nbgrader": {
      "grade": false,
@@ -79,7 +79,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ae7455a2",
+   "id": "6e4fb271",
    "metadata": {
     "cell_marker": "\"\"\""
    },
@@ -115,7 +115,40 @@
   },
   {
    "cell_type": "markdown",
-   "id": "85ee0680",
+   "id": "ddfa3dfb",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "### 🔗 From Optimization to Discovery: Connecting Module 14\n",
+    "\n",
+    "**In Module 14**, you implemented KV caching and saw 10-15x speedup.\n",
+    "**In Module 15**, you'll learn HOW to discover such optimization opportunities.\n",
+    "\n",
+    "**The Real ML Engineering Workflow**:\n",
+    "```\n",
+    "Step 1: Measure (This Module!)          Step 2: Analyze\n",
+    "  ↓                                       ↓\n",
+    "Profile baseline → Find bottleneck → Understand cause\n",
+    "40 tok/s          80% in attention    O(n²) recomputation\n",
+    "                                       ↓\n",
+    "Step 4: Validate                      Step 3: Optimize (Module 14)\n",
+    "  ↓                                       ↓\n",
+    "Profile optimized ← Verify speedup ← Implement KV cache\n",
+    "500 tok/s (12.5x)   Measure impact    Design solution\n",
+    "```\n",
+    "\n",
+    "**Without Module 15's profiling**: You'd never know WHERE to optimize!\n",
+    "**Without Module 14's optimization**: You couldn't FIX the bottleneck!\n",
+    "\n",
+    "This module teaches the measurement and analysis skills that enable\n",
+    "optimization breakthroughs like KV caching. You'll profile real models\n",
+    "and discover bottlenecks just like production ML teams do."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d5a2e470",
    "metadata": {
     "cell_marker": "\"\"\""
    },
@@ -198,7 +231,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ab8f2347",
+   "id": "c466e14d",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -214,18 +247,20 @@
     "├── count_parameters() → Model size analysis\n",
     "├── count_flops() → Computational cost estimation\n",
     "├── measure_memory() → Memory usage tracking\n",
-    "└── measure_latency() → Performance timing\n",
-    "\n",
-    "Integration Functions\n",
+    "├── measure_latency() → Performance timing\n",
+    "├── profile_layer() → Layer-wise analysis\n",
     "├── profile_forward_pass() → Complete forward analysis\n",
     "└── profile_backward_pass() → Training analysis\n",
+    "\n",
+    "Integration:\n",
+    "All methods work together to provide comprehensive performance insights\n",
     "```"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "208a26c8",
+   "id": "31829387",
    "metadata": {
     "lines_to_next_cell": 1,
     "nbgrader": {
@@ -246,25 +281,627 @@
     "    \"\"\"\n",
     "\n",
     "    def __init__(self):\n",
-    "        \"\"\"Initialize profiler with measurement state.\"\"\"\n",
+    "        \"\"\"\n",
+    "        Initialize profiler with measurement state.\n",
+    "\n",
+    "        TODO: Set up profiler tracking structures\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Create empty measurements dictionary\n",
+    "        2. Initialize operation counters\n",
+    "        3. Set up memory tracking state\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> profiler = Profiler()\n",
+    "        >>> profiler.measurements\n",
+    "        {}\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use defaultdict(int) for operation counters\n",
+    "        - measurements dict will store timing results\n",
+    "        \"\"\"\n",
     "        ### BEGIN SOLUTION\n",
     "        self.measurements = {}\n",
     "        self.operation_counts = defaultdict(int)\n",
     "        self.memory_tracker = None\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def count_parameters(self, model) -> int:\n",
+    "        \"\"\"\n",
+    "        Count total trainable parameters in a model.\n",
+    "\n",
+    "        TODO: Implement parameter counting for any model with parameters() method\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Get all parameters from model.parameters() if available\n",
+    "        2. For single layers, count weight and bias directly\n",
+    "        3. Sum total element count across all parameter tensors\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> linear = Linear(128, 64)  # 128*64 + 64 = 8256 parameters\n",
+    "        >>> profiler = Profiler()\n",
+    "        >>> count = profiler.count_parameters(linear)\n",
+    "        >>> print(count)\n",
+    "        8256\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use parameter.data.size for tensor element count\n",
+    "        - Handle models with and without parameters() method\n",
+    "        - Don't forget bias terms when present\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        total_params = 0\n",
+    "\n",
+    "        # Handle different model types\n",
+    "        if hasattr(model, 'parameters'):\n",
+    "            # Model with parameters() method (Sequential, custom models)\n",
+    "            for param in model.parameters():\n",
+    "                total_params += param.data.size\n",
+    "        elif hasattr(model, 'weight'):\n",
+    "            # Single layer (Linear, Conv2d)\n",
+    "            total_params += model.weight.data.size\n",
+    "            if hasattr(model, 'bias') and model.bias is not None:\n",
+    "                total_params += model.bias.data.size\n",
+    "        else:\n",
+    "            # No parameters (activations, etc.)\n",
+    "            total_params = 0\n",
+    "\n",
+    "        return total_params\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n",
+    "        \"\"\"\n",
+    "        Count FLOPs (Floating Point Operations) for one forward pass.\n",
+    "\n",
+    "        TODO: Implement FLOP counting for different layer types\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Create dummy input with given shape\n",
+    "        2. Calculate FLOPs based on layer type and dimensions\n",
+    "        3. Handle different model architectures (Linear, Conv2d, Sequential)\n",
+    "\n",
+    "        LAYER-SPECIFIC FLOP FORMULAS:\n",
+    "        - Linear: input_features × output_features × 2 (matmul + bias)\n",
+    "        - Conv2d: output_h × output_w × kernel_h × kernel_w × in_channels × out_channels × 2\n",
+    "        - Activation: Usually 1 FLOP per element (ReLU, Sigmoid)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> linear = Linear(128, 64)\n",
+    "        >>> profiler = Profiler()\n",
+    "        >>> flops = profiler.count_flops(linear, (1, 128))\n",
+    "        >>> print(flops)  # 128 * 64 * 2 = 16384\n",
+    "        16384\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Batch dimension doesn't affect per-sample FLOPs\n",
+    "        - Focus on major operations (matmul, conv) first\n",
+    "        - For Sequential models, sum FLOPs of all layers\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Create dummy input (unused but kept for interface consistency)\n",
+    "        _dummy_input = Tensor(np.random.randn(*input_shape))\n",
+    "        total_flops = 0\n",
+    "\n",
+    "        # Handle different model types\n",
+    "        if hasattr(model, '__class__'):\n",
+    "            model_name = model.__class__.__name__\n",
+    "\n",
+    "            if model_name == 'Linear':\n",
+    "                # Linear layer: input_features × output_features × 2\n",
+    "                in_features = input_shape[-1]\n",
+    "                out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n",
+    "                total_flops = in_features * out_features * 2\n",
+    "\n",
+    "            elif model_name == 'Conv2d':\n",
+    "                # Conv2d layer: complex calculation based on output size\n",
+    "                # Simplified: assume we know the output dimensions\n",
+    "                if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):\n",
+    "                    _batch_size = input_shape[0] if len(input_shape) > 3 else 1\n",
+    "                    in_channels = model.in_channels\n",
+    "                    out_channels = model.out_channels\n",
+    "                    kernel_h = kernel_w = model.kernel_size\n",
+    "\n",
+    "                    # Estimate output size (simplified)\n",
+    "                    input_h, input_w = input_shape[-2], input_shape[-1]\n",
+    "                    output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)\n",
+    "                    output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)\n",
+    "\n",
+    "                    total_flops = (output_h * output_w * kernel_h * kernel_w *\n",
+    "                                 in_channels * out_channels * 2)\n",
+    "\n",
+    "            elif model_name == 'Sequential':\n",
+    "                # Sequential model: sum FLOPs of all layers\n",
+    "                current_shape = input_shape\n",
+    "                for layer in model.layers:\n",
+    "                    layer_flops = self.count_flops(layer, current_shape)\n",
+    "                    total_flops += layer_flops\n",
+    "                    # Update shape for next layer (simplified)\n",
+    "                    if hasattr(layer, 'weight'):\n",
+    "                        current_shape = current_shape[:-1] + (layer.weight.shape[1],)\n",
+    "\n",
+    "            else:\n",
+    "                # Activation or other: assume 1 FLOP per element\n",
+    "                total_flops = np.prod(input_shape)\n",
+    "\n",
+    "        return total_flops\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n",
+    "        \"\"\"\n",
+    "        Measure memory usage during forward pass.\n",
+    "\n",
+    "        TODO: Implement memory tracking for model execution\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use tracemalloc to track memory allocation\n",
+    "        2. Measure baseline memory before model execution\n",
+    "        3. Run forward pass and track peak usage\n",
+    "        4. Calculate different memory components\n",
+    "\n",
+    "        RETURN DICTIONARY:\n",
+    "        - 'parameter_memory_mb': Memory for model parameters\n",
+    "        - 'activation_memory_mb': Memory for activations\n",
+    "        - 'peak_memory_mb': Maximum memory usage\n",
+    "        - 'memory_efficiency': Ratio of useful to total memory\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> linear = Linear(1024, 512)\n",
+    "        >>> profiler = Profiler()\n",
+    "        >>> memory = profiler.measure_memory(linear, (32, 1024))\n",
+    "        >>> print(f\"Parameters: {memory['parameter_memory_mb']:.1f} MB\")\n",
+    "        Parameters: 2.1 MB\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use tracemalloc.start() and tracemalloc.get_traced_memory()\n",
+    "        - Account for float32 = 4 bytes per parameter\n",
+    "        - Activation memory scales with batch size\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Start memory tracking\n",
+    "        tracemalloc.start()\n",
+    "\n",
+    "        # Measure baseline memory (unused but kept for completeness)\n",
+    "        _baseline_memory = tracemalloc.get_traced_memory()[0]\n",
+    "\n",
+    "        # Calculate parameter memory\n",
+    "        param_count = self.count_parameters(model)\n",
+    "        parameter_memory_bytes = param_count * 4  # Assume float32\n",
+    "        parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n",
+    "\n",
+    "        # Create input and measure activation memory\n",
+    "        dummy_input = Tensor(np.random.randn(*input_shape))\n",
+    "        input_memory_bytes = dummy_input.data.nbytes\n",
+    "\n",
+    "        # Estimate activation memory (simplified)\n",
+    "        activation_memory_bytes = input_memory_bytes * 2  # Rough estimate\n",
+    "        activation_memory_mb = activation_memory_bytes / (1024 * 1024)\n",
+    "\n",
+    "        # Try to run forward pass and measure peak\n",
+    "        try:\n",
+    "            if hasattr(model, 'forward'):\n",
+    "                _ = model.forward(dummy_input)\n",
+    "            elif hasattr(model, '__call__'):\n",
+    "                _ = model(dummy_input)\n",
+    "        except:\n",
+    "            pass  # Ignore errors for simplified measurement\n",
+    "\n",
+    "        # Get peak memory\n",
+    "        _current_memory, peak_memory = tracemalloc.get_traced_memory()\n",
+    "        peak_memory_mb = (peak_memory - _baseline_memory) / (1024 * 1024)\n",
+    "\n",
+    "        tracemalloc.stop()\n",
+    "\n",
+    "        # Calculate efficiency\n",
+    "        useful_memory = parameter_memory_mb + activation_memory_mb\n",
+    "        memory_efficiency = useful_memory / max(peak_memory_mb, 0.001)  # Avoid division by zero\n",
+    "\n",
+    "        return {\n",
+    "            'parameter_memory_mb': parameter_memory_mb,\n",
+    "            'activation_memory_mb': activation_memory_mb,\n",
+    "            'peak_memory_mb': max(peak_memory_mb, useful_memory),\n",
+    "            'memory_efficiency': min(memory_efficiency, 1.0)\n",
+    "        }\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n",
+    "        \"\"\"\n",
+    "        Measure model inference latency with statistical rigor.\n",
+    "\n",
+    "        TODO: Implement accurate latency measurement\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Run warmup iterations to stabilize performance\n",
+    "        2. Measure multiple iterations for statistical accuracy\n",
+    "        3. Calculate median latency to handle outliers\n",
+    "        4. Return latency in milliseconds\n",
+    "\n",
+    "        PARAMETERS:\n",
+    "        - warmup: Number of warmup runs (default 10)\n",
+    "        - iterations: Number of measurement runs (default 100)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> linear = Linear(128, 64)\n",
+    "        >>> input_tensor = Tensor(np.random.randn(1, 128))\n",
+    "        >>> profiler = Profiler()\n",
+    "        >>> latency = profiler.measure_latency(linear, input_tensor)\n",
+    "        >>> print(f\"Latency: {latency:.2f} ms\")\n",
+    "        Latency: 0.15 ms\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use time.perf_counter() for high precision\n",
+    "        - Use median instead of mean for robustness against outliers\n",
+    "        - Handle different model interfaces (forward, __call__)\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Warmup runs\n",
+    "        for _ in range(warmup):\n",
+    "            try:\n",
+    "                if hasattr(model, 'forward'):\n",
+    "                    _ = model.forward(input_tensor)\n",
+    "                elif hasattr(model, '__call__'):\n",
+    "                    _ = model(input_tensor)\n",
+    "                else:\n",
+    "                    # Fallback for simple operations\n",
+    "                    _ = input_tensor\n",
+    "            except:\n",
+    "                pass  # Ignore errors during warmup\n",
+    "\n",
+    "        # Measurement runs\n",
+    "        times = []\n",
+    "        for _ in range(iterations):\n",
+    "            start_time = time.perf_counter()\n",
+    "\n",
+    "            try:\n",
+    "                if hasattr(model, 'forward'):\n",
+    "                    _ = model.forward(input_tensor)\n",
+    "                elif hasattr(model, '__call__'):\n",
+    "                    _ = model(input_tensor)\n",
+    "                else:\n",
+    "                    # Minimal operation for timing\n",
+    "                    _ = input_tensor.data.copy()\n",
+    "            except:\n",
+    "                pass  # Ignore errors but still measure time\n",
+    "\n",
+    "            end_time = time.perf_counter()\n",
+    "            times.append((end_time - start_time) * 1000)  # Convert to milliseconds\n",
+    "\n",
+    "        # Calculate statistics - use median for robustness\n",
+    "        times = np.array(times)\n",
+    "        median_latency = np.median(times)\n",
+    "\n",
+    "        return float(median_latency)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def profile_layer(self, layer, input_shape: Tuple[int, ...]) -> Dict[str, Any]:\n",
+    "        \"\"\"\n",
+    "        Profile a single layer comprehensively.\n",
+    "\n",
+    "        TODO: Implement layer-wise profiling\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Count parameters for this layer\n",
+    "        2. Count FLOPs for this layer\n",
+    "        3. Measure memory usage\n",
+    "        4. Measure latency\n",
+    "        5. Return comprehensive layer profile\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> linear = Linear(256, 128)\n",
+    "        >>> profiler = Profiler()\n",
+    "        >>> profile = profiler.profile_layer(linear, (32, 256))\n",
+    "        >>> print(f\"Layer uses {profile['parameters']} parameters\")\n",
+    "        Layer uses 32896 parameters\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use existing profiler methods (count_parameters, count_flops, etc.)\n",
+    "        - Create dummy input for latency measurement\n",
+    "        - Include layer type information in profile\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Create dummy input for latency measurement\n",
+    "        dummy_input = Tensor(np.random.randn(*input_shape))\n",
+    "\n",
+    "        # Gather all measurements\n",
+    "        params = self.count_parameters(layer)\n",
+    "        flops = self.count_flops(layer, input_shape)\n",
+    "        memory = self.measure_memory(layer, input_shape)\n",
+    "        latency = self.measure_latency(layer, dummy_input, warmup=3, iterations=10)\n",
+    "\n",
+    "        # Compute derived metrics\n",
+    "        gflops_per_second = (flops / 1e9) / max(latency / 1000, 1e-6)\n",
+    "\n",
+    "        return {\n",
+    "            'layer_type': layer.__class__.__name__,\n",
+    "            'parameters': params,\n",
+    "            'flops': flops,\n",
+    "            'latency_ms': latency,\n",
+    "            'gflops_per_second': gflops_per_second,\n",
+    "            **memory\n",
+    "        }\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def profile_forward_pass(self, model, input_tensor) -> Dict[str, Any]:\n",
+    "        \"\"\"\n",
+    "        Comprehensive profiling of a model's forward pass.\n",
+    "\n",
+    "        TODO: Implement complete forward pass analysis\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use Profiler class to gather all measurements\n",
+    "        2. Create comprehensive performance profile\n",
+    "        3. Add derived metrics and insights\n",
+    "        4. Return structured analysis results\n",
+    "\n",
+    "        RETURN METRICS:\n",
+    "        - All basic profiler measurements\n",
+    "        - FLOPs per second (computational efficiency)\n",
+    "        - Memory bandwidth utilization\n",
+    "        - Performance bottleneck identification\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> model = Linear(256, 128)\n",
+    "        >>> input_data = Tensor(np.random.randn(32, 256))\n",
+    "        >>> profiler = Profiler()\n",
+    "        >>> profile = profiler.profile_forward_pass(model, input_data)\n",
+    "        >>> print(f\"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s\")\n",
+    "        Throughput: 2.45 GFLOP/s\n",
+    "\n",
+    "        HINTS:\n",
+    "        - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)\n",
+    "        - Memory bandwidth = memory_mb / (latency_ms / 1000)\n",
+    "        - Consider realistic hardware limits for efficiency calculations\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Basic measurements\n",
+    "        param_count = self.count_parameters(model)\n",
+    "        flops = self.count_flops(model, input_tensor.shape)\n",
+    "        memory_stats = self.measure_memory(model, input_tensor.shape)\n",
+    "        latency_ms = self.measure_latency(model, input_tensor, warmup=5, iterations=20)\n",
+    "\n",
+    "        # Derived metrics\n",
+    "        latency_seconds = latency_ms / 1000.0\n",
+    "        gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)\n",
+    "\n",
+    "        # Memory bandwidth (MB/s)\n",
+    "        memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)\n",
+    "\n",
+    "        # Efficiency metrics\n",
+    "        theoretical_peak_gflops = 100.0  # Assume 100 GFLOP/s theoretical peak for CPU\n",
+    "        computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)\n",
+    "\n",
+    "        # Bottleneck analysis\n",
+    "        is_memory_bound = memory_bandwidth > gflops_per_second * 100  # Rough heuristic\n",
+    "        is_compute_bound = not is_memory_bound\n",
+    "\n",
+    "        return {\n",
+    "            # Basic measurements\n",
+    "            'parameters': param_count,\n",
+    "            'flops': flops,\n",
+    "            'latency_ms': latency_ms,\n",
+    "            **memory_stats,\n",
+    "\n",
+    "            # Derived metrics\n",
+    "            'gflops_per_second': gflops_per_second,\n",
+    "            'memory_bandwidth_mbs': memory_bandwidth,\n",
+    "            'computational_efficiency': computational_efficiency,\n",
+    "\n",
+    "            # Bottleneck analysis\n",
+    "            'is_memory_bound': is_memory_bound,\n",
+    "            'is_compute_bound': is_compute_bound,\n",
+    "            'bottleneck': 'memory' if is_memory_bound else 'compute'\n",
+    "        }\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def profile_backward_pass(self, model, input_tensor, _loss_fn=None) -> Dict[str, Any]:\n",
+    "        \"\"\"\n",
+    "        Profile both forward and backward passes for training analysis.\n",
+    "\n",
+    "        TODO: Implement training-focused profiling\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Profile forward pass first\n",
+    "        2. Estimate backward pass costs (typically 2× forward)\n",
+    "        3. Calculate total training iteration metrics\n",
+    "        4. Analyze memory requirements for gradients and optimizers\n",
+    "\n",
+    "        BACKWARD PASS ESTIMATES:\n",
+    "        - FLOPs: ~2× forward pass (gradient computation)\n",
+    "        - Memory: +1× parameters (gradient storage)\n",
+    "        - Latency: ~2× forward pass (more complex operations)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> model = Linear(128, 64)\n",
+    "        >>> input_data = Tensor(np.random.randn(16, 128))\n",
+    "        >>> profiler = Profiler()\n",
+    "        >>> profile = profiler.profile_backward_pass(model, input_data)\n",
+    "        >>> print(f\"Training iteration: {profile['total_latency_ms']:.2f} ms\")\n",
+    "        Training iteration: 0.45 ms\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Total memory = parameters + activations + gradients\n",
+    "        - Optimizer memory depends on algorithm (SGD: 0×, Adam: 2×)\n",
+    "        - Consider gradient accumulation effects\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Get forward pass profile\n",
+    "        forward_profile = self.profile_forward_pass(model, input_tensor)\n",
+    "\n",
+    "        # Estimate backward pass (typically 2× forward)\n",
+    "        backward_flops = forward_profile['flops'] * 2\n",
+    "        backward_latency_ms = forward_profile['latency_ms'] * 2\n",
+    "\n",
+    "        # Gradient memory (equal to parameter memory)\n",
+    "        gradient_memory_mb = forward_profile['parameter_memory_mb']\n",
+    "\n",
+    "        # Total training iteration\n",
+    "        total_flops = forward_profile['flops'] + backward_flops\n",
+    "        total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms\n",
+    "        total_memory_mb = (forward_profile['parameter_memory_mb'] +\n",
+    "                          forward_profile['activation_memory_mb'] +\n",
+    "                          gradient_memory_mb)\n",
+    "\n",
+    "        # Training efficiency\n",
+    "        total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)\n",
+    "\n",
+    "        # Optimizer memory estimates\n",
+    "        optimizer_memory_estimates = {\n",
+    "            'sgd': 0,  # No extra memory\n",
+    "            'adam': gradient_memory_mb * 2,  # Momentum + velocity\n",
+    "            'adamw': gradient_memory_mb * 2,  # Same as Adam\n",
+    "        }\n",
+    "\n",
+    "        return {\n",
+    "            # Forward pass\n",
+    "            'forward_flops': forward_profile['flops'],\n",
+    "            'forward_latency_ms': forward_profile['latency_ms'],\n",
+    "            'forward_memory_mb': forward_profile['peak_memory_mb'],\n",
+    "\n",
+    "            # Backward pass estimates\n",
+    "            'backward_flops': backward_flops,\n",
+    "            'backward_latency_ms': backward_latency_ms,\n",
+    "            'gradient_memory_mb': gradient_memory_mb,\n",
+    "\n",
+    "            # Total training iteration\n",
+    "            'total_flops': total_flops,\n",
+    "            'total_latency_ms': total_latency_ms,\n",
+    "            'total_memory_mb': total_memory_mb,\n",
+    "            'total_gflops_per_second': total_gflops_per_second,\n",
+    "\n",
+    "            # Optimizer memory requirements\n",
+    "            'optimizer_memory_estimates': optimizer_memory_estimates,\n",
+    "\n",
+    "            # Training insights\n",
+    "            'memory_efficiency': forward_profile['memory_efficiency'],\n",
+    "            'bottleneck': forward_profile['bottleneck']\n",
+    "        }\n",
     "        ### END SOLUTION"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "463b3b6c",
+   "id": "644d770d",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
    },
+   "source": [
+    "## Helper Functions - Quick Profiling Utilities\n",
+    "\n",
+    "These helper functions provide simplified interfaces for common profiling tasks.\n",
+    "They make it easy to quickly profile models and analyze characteristics."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ad647a04",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "def quick_profile(model, input_tensor, profiler=None):\n",
+    "    \"\"\"\n",
+    "    Quick profiling function for immediate insights.\n",
+    "    \n",
+    "    Provides a simplified interface for profiling that displays key metrics\n",
+    "    in a student-friendly format.\n",
+    "    \n",
+    "    Args:\n",
+    "        model: Model to profile\n",
+    "        input_tensor: Input data for profiling\n",
+    "        profiler: Optional Profiler instance (creates new one if None)\n",
+    "    \n",
+    "    Returns:\n",
+    "        dict: Profile results with key metrics\n",
+    "    \n",
+    "    Example:\n",
+    "        >>> model = Linear(128, 64)\n",
+    "        >>> input_data = Tensor(np.random.randn(16, 128))\n",
+    "        >>> results = quick_profile(model, input_data)\n",
+    "        >>> # Displays formatted output automatically\n",
+    "    \"\"\"\n",
+    "    if profiler is None:\n",
+    "        profiler = Profiler()\n",
+    "    \n",
+    "    profile = profiler.profile_forward_pass(model, input_tensor)\n",
+    "    \n",
+    "    # Display formatted results\n",
+    "    print(\"🔬 Quick Profile Results:\")\n",
+    "    print(f\"   Parameters: {profile['parameters']:,}\")\n",
+    "    print(f\"   FLOPs: {profile['flops']:,}\")\n",
+    "    print(f\"   Latency: {profile['latency_ms']:.2f} ms\")\n",
+    "    print(f\"   Memory: {profile['peak_memory_mb']:.2f} MB\")\n",
+    "    print(f\"   Bottleneck: {profile['bottleneck']}\")\n",
+    "    print(f\"   Efficiency: {profile['computational_efficiency']*100:.1f}%\")\n",
+    "    \n",
+    "    return profile\n",
+    "\n",
+    "#| export\n",
+    "def analyze_weight_distribution(model, percentiles=[10, 25, 50, 75, 90]):\n",
+    "    \"\"\"\n",
+    "    Analyze weight distribution for compression insights.\n",
+    "    \n",
+    "    Helps understand which weights are small and might be prunable.\n",
+    "    Used by Module 17 (Compression) to motivate pruning.\n",
+    "    \n",
+    "    Args:\n",
+    "        model: Model to analyze\n",
+    "        percentiles: List of percentiles to compute\n",
+    "    \n",
+    "    Returns:\n",
+    "        dict: Weight distribution statistics\n",
+    "    \n",
+    "    Example:\n",
+    "        >>> model = Linear(512, 512)\n",
+    "        >>> stats = analyze_weight_distribution(model)\n",
+    "        >>> print(f\"Weights < 0.01: {stats['below_threshold_001']:.1f}%\")\n",
+    "    \"\"\"\n",
+    "    # Collect all weights\n",
+    "    weights = []\n",
+    "    if hasattr(model, 'parameters'):\n",
+    "        for param in model.parameters():\n",
+    "            weights.extend(param.data.flatten().tolist())\n",
+    "    elif hasattr(model, 'weight'):\n",
+    "        weights.extend(model.weight.data.flatten().tolist())\n",
+    "    else:\n",
+    "        return {'error': 'No weights found'}\n",
+    "    \n",
+    "    weights = np.array(weights)\n",
+    "    abs_weights = np.abs(weights)\n",
+    "    \n",
+    "    # Calculate statistics\n",
+    "    stats = {\n",
+    "        'total_weights': len(weights),\n",
+    "        'mean': float(np.mean(abs_weights)),\n",
+    "        'std': float(np.std(abs_weights)),\n",
+    "        'min': float(np.min(abs_weights)),\n",
+    "        'max': float(np.max(abs_weights)),\n",
+    "    }\n",
+    "    \n",
+    "    # Percentile analysis\n",
+    "    for p in percentiles:\n",
+    "        stats[f'percentile_{p}'] = float(np.percentile(abs_weights, p))\n",
+    "    \n",
+    "    # Threshold analysis (useful for pruning)\n",
+    "    for threshold in [0.001, 0.01, 0.1]:\n",
+    "        below = np.sum(abs_weights < threshold) / len(weights) * 100\n",
+    "        stats[f'below_threshold_{str(threshold).replace(\".\", \"\")}'] = below\n",
+    "    \n",
+    "    return stats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68b967c5",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
    "source": [
     "## Parameter Counting - Model Size Analysis\n",
     "\n",
-    "Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's build a robust parameter counter that handles different model architectures.\n",
+    "Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's validate our implementation.\n",
     "\n",
     "### Why Parameter Counting Matters\n",
     "```\n",
@@ -278,72 +915,12 @@
     "Medium:  GPT-2 Medium (350M parameters) → 1.4GB memory\n",
     "Large:   GPT-2 Large (774M parameters)  → 3.1GB memory\n",
     "XL:      GPT-2 XL (1.5B parameters)     → 6.0GB memory\n",
-    "```\n",
-    "\n",
-    "### Parameter Counting Strategy\n",
-    "Our parameter counter needs to handle different model types:\n",
-    "- **Single layers** (Linear, Conv2d) with weight and bias\n",
-    "- **Sequential models** with multiple layers\n",
-    "- **Custom models** with parameters() method"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4044095a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def count_parameters(self, model) -> int:\n",
-    "    \"\"\"\n",
-    "    Count total trainable parameters in a model.\n",
-    "\n",
-    "    TODO: Implement parameter counting for any model with parameters() method\n",
-    "\n",
-    "    APPROACH:\n",
-    "    1. Get all parameters from model.parameters() if available\n",
-    "    2. For single layers, count weight and bias directly\n",
-    "    3. Sum total element count across all parameter tensors\n",
-    "\n",
-    "    EXAMPLE:\n",
-    "    >>> linear = Linear(128, 64)  # 128*64 + 64 = 8256 parameters\n",
-    "    >>> profiler = Profiler()\n",
-    "    >>> count = profiler.count_parameters(linear)\n",
-    "    >>> print(count)\n",
-    "    8256\n",
-    "\n",
-    "    HINTS:\n",
-    "    - Use parameter.data.size for tensor element count\n",
-    "    - Handle models with and without parameters() method\n",
-    "    - Don't forget bias terms when present\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    total_params = 0\n",
-    "\n",
-    "    # Handle different model types\n",
-    "    if hasattr(model, 'parameters'):\n",
-    "        # Model with parameters() method (Sequential, custom models)\n",
-    "        for param in model.parameters():\n",
-    "            total_params += param.data.size\n",
-    "    elif hasattr(model, 'weight'):\n",
-    "        # Single layer (Linear, Conv2d)\n",
-    "        total_params += model.weight.data.size\n",
-    "        if hasattr(model, 'bias') and model.bias is not None:\n",
-    "            total_params += model.bias.data.size\n",
-    "    else:\n",
-    "        # No parameters (activations, etc.)\n",
-    "        total_params = 0\n",
-    "\n",
-    "    return total_params\n",
-    "    ### END SOLUTION\n",
-    "\n",
-    "# Add method to Profiler class\n",
-    "Profiler.count_parameters = count_parameters"
+    "```"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "acdb0834",
+   "id": "68a302c1",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -359,7 +936,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "4f5a8065",
+   "id": "9c44b45f",
    "metadata": {
     "nbgrader": {
      "grade": true,
@@ -409,15 +986,15 @@
     "\n",
     "    print(\"✅ Parameter counting works correctly!\")\n",
     "\n",
-    "test_unit_parameter_counting()"
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_parameter_counting()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "6e9d44c6",
+   "id": "fd88f0ff",
    "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
+    "cell_marker": "\"\"\""
    },
    "source": [
     "## FLOP Counting - Computational Cost Estimation\n",
@@ -449,97 +1026,9 @@
     "- **Activations**: Usually 1 FLOP per element"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "218af3a1",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n",
-    "    \"\"\"\n",
-    "    Count FLOPs (Floating Point Operations) for one forward pass.\n",
-    "\n",
-    "    TODO: Implement FLOP counting for different layer types\n",
-    "\n",
-    "    APPROACH:\n",
-    "    1. Create dummy input with given shape\n",
-    "    2. Calculate FLOPs based on layer type and dimensions\n",
-    "    3. Handle different model architectures (Linear, Conv2d, Sequential)\n",
-    "\n",
-    "    LAYER-SPECIFIC FLOP FORMULAS:\n",
-    "    - Linear: input_features × output_features × 2 (matmul + bias)\n",
-    "    - Conv2d: output_h × output_w × kernel_h × kernel_w × in_channels × out_channels × 2\n",
-    "    - Activation: Usually 1 FLOP per element (ReLU, Sigmoid)\n",
-    "\n",
-    "    EXAMPLE:\n",
-    "    >>> linear = Linear(128, 64)\n",
-    "    >>> profiler = Profiler()\n",
-    "    >>> flops = profiler.count_flops(linear, (1, 128))\n",
-    "    >>> print(flops)  # 128 * 64 * 2 = 16384\n",
-    "    16384\n",
-    "\n",
-    "    HINTS:\n",
-    "    - Batch dimension doesn't affect per-sample FLOPs\n",
-    "    - Focus on major operations (matmul, conv) first\n",
-    "    - For Sequential models, sum FLOPs of all layers\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Create dummy input\n",
-    "    dummy_input = Tensor(np.random.randn(*input_shape))\n",
-    "    total_flops = 0\n",
-    "\n",
-    "    # Handle different model types\n",
-    "    if hasattr(model, '__class__'):\n",
-    "        model_name = model.__class__.__name__\n",
-    "\n",
-    "        if model_name == 'Linear':\n",
-    "            # Linear layer: input_features × output_features × 2\n",
-    "            in_features = input_shape[-1]\n",
-    "            out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n",
-    "            total_flops = in_features * out_features * 2\n",
-    "\n",
-    "        elif model_name == 'Conv2d':\n",
-    "            # Conv2d layer: complex calculation based on output size\n",
-    "            # Simplified: assume we know the output dimensions\n",
-    "            if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):\n",
-    "                batch_size = input_shape[0] if len(input_shape) > 3 else 1\n",
-    "                in_channels = model.in_channels\n",
-    "                out_channels = model.out_channels\n",
-    "                kernel_h = kernel_w = model.kernel_size\n",
-    "\n",
-    "                # Estimate output size (simplified)\n",
-    "                input_h, input_w = input_shape[-2], input_shape[-1]\n",
-    "                output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)\n",
-    "                output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)\n",
-    "\n",
-    "                total_flops = (output_h * output_w * kernel_h * kernel_w *\n",
-    "                             in_channels * out_channels * 2)\n",
-    "\n",
-    "        elif model_name == 'Sequential':\n",
-    "            # Sequential model: sum FLOPs of all layers\n",
-    "            current_shape = input_shape\n",
-    "            for layer in model.layers:\n",
-    "                layer_flops = self.count_flops(layer, current_shape)\n",
-    "                total_flops += layer_flops\n",
-    "                # Update shape for next layer (simplified)\n",
-    "                if hasattr(layer, 'weight'):\n",
-    "                    current_shape = current_shape[:-1] + (layer.weight.shape[1],)\n",
-    "\n",
-    "        else:\n",
-    "            # Activation or other: assume 1 FLOP per element\n",
-    "            total_flops = np.prod(input_shape)\n",
-    "\n",
-    "    return total_flops\n",
-    "    ### END SOLUTION\n",
-    "\n",
-    "# Add method to Profiler class\n",
-    "Profiler.count_flops = count_flops"
-   ]
-  },
   {
    "cell_type": "markdown",
-   "id": "8b02224b",
+   "id": "e6311a0a",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -555,7 +1044,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "3b947e9e",
+   "id": "8919b41a",
    "metadata": {
     "nbgrader": {
      "grade": true,
@@ -599,15 +1088,15 @@
     "\n",
     "    print(\"✅ FLOP counting works correctly!\")\n",
     "\n",
-    "test_unit_flop_counting()"
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_flop_counting()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "f32cf57c",
+   "id": "9a1d06f7",
    "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
+    "cell_marker": "\"\"\""
    },
    "source": [
     "## Memory Profiling - Understanding Memory Usage Patterns\n",
@@ -638,97 +1127,9 @@
     "We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns."
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "694a0990",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n",
-    "    \"\"\"\n",
-    "    Measure memory usage during forward pass.\n",
-    "\n",
-    "    TODO: Implement memory tracking for model execution\n",
-    "\n",
-    "    APPROACH:\n",
-    "    1. Use tracemalloc to track memory allocation\n",
-    "    2. Measure baseline memory before model execution\n",
-    "    3. Run forward pass and track peak usage\n",
-    "    4. Calculate different memory components\n",
-    "\n",
-    "    RETURN DICTIONARY:\n",
-    "    - 'parameter_memory_mb': Memory for model parameters\n",
-    "    - 'activation_memory_mb': Memory for activations\n",
-    "    - 'peak_memory_mb': Maximum memory usage\n",
-    "    - 'memory_efficiency': Ratio of useful to total memory\n",
-    "\n",
-    "    EXAMPLE:\n",
-    "    >>> linear = Linear(1024, 512)\n",
-    "    >>> profiler = Profiler()\n",
-    "    >>> memory = profiler.measure_memory(linear, (32, 1024))\n",
-    "    >>> print(f\"Parameters: {memory['parameter_memory_mb']:.1f} MB\")\n",
-    "    Parameters: 2.1 MB\n",
-    "\n",
-    "    HINTS:\n",
-    "    - Use tracemalloc.start() and tracemalloc.get_traced_memory()\n",
-    "    - Account for float32 = 4 bytes per parameter\n",
-    "    - Activation memory scales with batch size\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Start memory tracking\n",
-    "    tracemalloc.start()\n",
-    "\n",
-    "    # Measure baseline memory\n",
-    "    baseline_memory = tracemalloc.get_traced_memory()[0]\n",
-    "\n",
-    "    # Calculate parameter memory\n",
-    "    param_count = self.count_parameters(model)\n",
-    "    parameter_memory_bytes = param_count * 4  # Assume float32\n",
-    "    parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n",
-    "\n",
-    "    # Create input and measure activation memory\n",
-    "    dummy_input = Tensor(np.random.randn(*input_shape))\n",
-    "    input_memory_bytes = dummy_input.data.nbytes\n",
-    "\n",
-    "    # Estimate activation memory (simplified)\n",
-    "    activation_memory_bytes = input_memory_bytes * 2  # Rough estimate\n",
-    "    activation_memory_mb = activation_memory_bytes / (1024 * 1024)\n",
-    "\n",
-    "    # Try to run forward pass and measure peak\n",
-    "    try:\n",
-    "        if hasattr(model, 'forward'):\n",
-    "            _ = model.forward(dummy_input)\n",
-    "        elif hasattr(model, '__call__'):\n",
-    "            _ = model(dummy_input)\n",
-    "    except:\n",
-    "        pass  # Ignore errors for simplified measurement\n",
-    "\n",
-    "    # Get peak memory\n",
-    "    current_memory, peak_memory = tracemalloc.get_traced_memory()\n",
-    "    peak_memory_mb = (peak_memory - baseline_memory) / (1024 * 1024)\n",
-    "\n",
-    "    tracemalloc.stop()\n",
-    "\n",
-    "    # Calculate efficiency\n",
-    "    useful_memory = parameter_memory_mb + activation_memory_mb\n",
-    "    memory_efficiency = useful_memory / max(peak_memory_mb, 0.001)  # Avoid division by zero\n",
-    "\n",
-    "    return {\n",
-    "        'parameter_memory_mb': parameter_memory_mb,\n",
-    "        'activation_memory_mb': activation_memory_mb,\n",
-    "        'peak_memory_mb': max(peak_memory_mb, useful_memory),\n",
-    "        'memory_efficiency': min(memory_efficiency, 1.0)\n",
-    "    }\n",
-    "    ### END SOLUTION\n",
-    "\n",
-    "# Add method to Profiler class\n",
-    "Profiler.measure_memory = measure_memory"
-   ]
-  },
   {
    "cell_type": "markdown",
-   "id": "1d520581",
+   "id": "a1e39372",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -744,7 +1145,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "88c934b5",
+   "id": "60ee4331",
    "metadata": {
     "nbgrader": {
      "grade": true,
@@ -797,15 +1198,15 @@
     "\n",
     "    print(\"✅ Memory measurement works correctly!\")\n",
     "\n",
-    "test_unit_memory_measurement()"
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_memory_measurement()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "c45f1b79",
+   "id": "350bdbd3",
    "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
+    "cell_marker": "\"\"\""
    },
    "source": [
     "## Latency Measurement - Accurate Performance Timing\n",
@@ -839,89 +1240,9 @@
     "4. **Memory cleanup** to prevent contamination"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "764b8db5",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n",
-    "    \"\"\"\n",
-    "    Measure model inference latency with statistical rigor.\n",
-    "\n",
-    "    TODO: Implement accurate latency measurement\n",
-    "\n",
-    "    APPROACH:\n",
-    "    1. Run warmup iterations to stabilize performance\n",
-    "    2. Measure multiple iterations for statistical accuracy\n",
-    "    3. Calculate median latency to handle outliers\n",
-    "    4. Return latency in milliseconds\n",
-    "\n",
-    "    PARAMETERS:\n",
-    "    - warmup: Number of warmup runs (default 10)\n",
-    "    - iterations: Number of measurement runs (default 100)\n",
-    "\n",
-    "    EXAMPLE:\n",
-    "    >>> linear = Linear(128, 64)\n",
-    "    >>> input_tensor = Tensor(np.random.randn(1, 128))\n",
-    "    >>> profiler = Profiler()\n",
-    "    >>> latency = profiler.measure_latency(linear, input_tensor)\n",
-    "    >>> print(f\"Latency: {latency:.2f} ms\")\n",
-    "    Latency: 0.15 ms\n",
-    "\n",
-    "    HINTS:\n",
-    "    - Use time.perf_counter() for high precision\n",
-    "    - Use median instead of mean for robustness against outliers\n",
-    "    - Handle different model interfaces (forward, __call__)\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Warmup runs\n",
-    "    for _ in range(warmup):\n",
-    "        try:\n",
-    "            if hasattr(model, 'forward'):\n",
-    "                _ = model.forward(input_tensor)\n",
-    "            elif hasattr(model, '__call__'):\n",
-    "                _ = model(input_tensor)\n",
-    "            else:\n",
-    "                # Fallback for simple operations\n",
-    "                _ = input_tensor\n",
-    "        except:\n",
-    "            pass  # Ignore errors during warmup\n",
-    "\n",
-    "    # Measurement runs\n",
-    "    times = []\n",
-    "    for _ in range(iterations):\n",
-    "        start_time = time.perf_counter()\n",
-    "\n",
-    "        try:\n",
-    "            if hasattr(model, 'forward'):\n",
-    "                _ = model.forward(input_tensor)\n",
-    "            elif hasattr(model, '__call__'):\n",
-    "                _ = model(input_tensor)\n",
-    "            else:\n",
-    "                # Minimal operation for timing\n",
-    "                _ = input_tensor.data.copy()\n",
-    "        except:\n",
-    "            pass  # Ignore errors but still measure time\n",
-    "\n",
-    "        end_time = time.perf_counter()\n",
-    "        times.append((end_time - start_time) * 1000)  # Convert to milliseconds\n",
-    "\n",
-    "    # Calculate statistics - use median for robustness\n",
-    "    times = np.array(times)\n",
-    "    median_latency = np.median(times)\n",
-    "\n",
-    "    return float(median_latency)\n",
-    "    ### END SOLUTION\n",
-    "\n",
-    "# Add method to Profiler class\n",
-    "Profiler.measure_latency = measure_latency"
-   ]
-  },
   {
    "cell_type": "markdown",
-   "id": "a7aa639f",
+   "id": "f1a0465b",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -937,7 +1258,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "3e642916",
+   "id": "dcc3cff0",
    "metadata": {
     "nbgrader": {
      "grade": true,
@@ -986,20 +1307,20 @@
     "\n",
     "    print(\"✅ Latency measurement works correctly!\")\n",
     "\n",
-    "test_unit_latency_measurement()"
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_latency_measurement()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "47686a04",
+   "id": "a5d9a959",
    "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
+    "cell_marker": "\"\"\""
    },
    "source": [
     "## 4. Integration: Advanced Profiling Functions\n",
     "\n",
-    "Now let's build higher-level profiling functions that combine our core measurements into comprehensive analysis tools.\n",
+    "Now let's validate our higher-level profiling functions that combine core measurements into comprehensive analysis tools.\n",
     "\n",
     "### Advanced Profiling Architecture\n",
     "```\n",
@@ -1007,8 +1328,8 @@
     "        ↓                         ↓                         ↓\n",
     "count_parameters()      profile_forward_pass()      \"Memory-bound workload\"\n",
     "count_flops()          profile_backward_pass()      \"Optimize data movement\"\n",
-    "measure_memory()       benchmark_efficiency()       \"Focus on bandwidth\"\n",
-    "measure_latency()      analyze_bottlenecks()        \"Use quantization\"\n",
+    "measure_memory()       profile_layer()              \"Focus on bandwidth\"\n",
+    "measure_latency()      benchmark_efficiency()       \"Use quantization\"\n",
     "```\n",
     "\n",
     "### Forward Pass Profiling - Complete Performance Picture\n",
@@ -1016,100 +1337,11 @@
     "A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions."
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "01dc2eb1",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "advanced_profiling",
-     "solution": true
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def profile_forward_pass(model, input_tensor) -> Dict[str, Any]:\n",
-    "    \"\"\"\n",
-    "    Comprehensive profiling of a model's forward pass.\n",
-    "\n",
-    "    TODO: Implement complete forward pass analysis\n",
-    "\n",
-    "    APPROACH:\n",
-    "    1. Use Profiler class to gather all measurements\n",
-    "    2. Create comprehensive performance profile\n",
-    "    3. Add derived metrics and insights\n",
-    "    4. Return structured analysis results\n",
-    "\n",
-    "    RETURN METRICS:\n",
-    "    - All basic profiler measurements\n",
-    "    - FLOPs per second (computational efficiency)\n",
-    "    - Memory bandwidth utilization\n",
-    "    - Performance bottleneck identification\n",
-    "\n",
-    "    EXAMPLE:\n",
-    "    >>> model = Linear(256, 128)\n",
-    "    >>> input_data = Tensor(np.random.randn(32, 256))\n",
-    "    >>> profile = profile_forward_pass(model, input_data)\n",
-    "    >>> print(f\"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s\")\n",
-    "    Throughput: 2.45 GFLOP/s\n",
-    "\n",
-    "    HINTS:\n",
-    "    - GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)\n",
-    "    - Memory bandwidth = memory_mb / (latency_ms / 1000)\n",
-    "    - Consider realistic hardware limits for efficiency calculations\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    profiler = Profiler()\n",
-    "\n",
-    "    # Basic measurements\n",
-    "    param_count = profiler.count_parameters(model)\n",
-    "    flops = profiler.count_flops(model, input_tensor.shape)\n",
-    "    memory_stats = profiler.measure_memory(model, input_tensor.shape)\n",
-    "    latency_ms = profiler.measure_latency(model, input_tensor, warmup=5, iterations=20)\n",
-    "\n",
-    "    # Derived metrics\n",
-    "    latency_seconds = latency_ms / 1000.0\n",
-    "    gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)\n",
-    "\n",
-    "    # Memory bandwidth (MB/s)\n",
-    "    memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)\n",
-    "\n",
-    "    # Efficiency metrics\n",
-    "    theoretical_peak_gflops = 100.0  # Assume 100 GFLOP/s theoretical peak for CPU\n",
-    "    computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)\n",
-    "\n",
-    "    # Bottleneck analysis\n",
-    "    is_memory_bound = memory_bandwidth > gflops_per_second * 100  # Rough heuristic\n",
-    "    is_compute_bound = not is_memory_bound\n",
-    "\n",
-    "    return {\n",
-    "        # Basic measurements\n",
-    "        'parameters': param_count,\n",
-    "        'flops': flops,\n",
-    "        'latency_ms': latency_ms,\n",
-    "        **memory_stats,\n",
-    "\n",
-    "        # Derived metrics\n",
-    "        'gflops_per_second': gflops_per_second,\n",
-    "        'memory_bandwidth_mbs': memory_bandwidth,\n",
-    "        'computational_efficiency': computational_efficiency,\n",
-    "\n",
-    "        # Bottleneck analysis\n",
-    "        'is_memory_bound': is_memory_bound,\n",
-    "        'is_compute_bound': is_compute_bound,\n",
-    "        'bottleneck': 'memory' if is_memory_bound else 'compute'\n",
-    "    }\n",
-    "    ### END SOLUTION"
-   ]
-  },
   {
    "cell_type": "markdown",
-   "id": "16cc4aaf",
+   "id": "791555b9",
    "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
+    "cell_marker": "\"\"\""
    },
    "source": [
     "### Backward Pass Profiling - Training Analysis\n",
@@ -1135,102 +1367,9 @@
     "```"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "20aab8e4",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def profile_backward_pass(model, input_tensor, loss_fn=None) -> Dict[str, Any]:\n",
-    "    \"\"\"\n",
-    "    Profile both forward and backward passes for training analysis.\n",
-    "\n",
-    "    TODO: Implement training-focused profiling\n",
-    "\n",
-    "    APPROACH:\n",
-    "    1. Profile forward pass first\n",
-    "    2. Estimate backward pass costs (typically 2× forward)\n",
-    "    3. Calculate total training iteration metrics\n",
-    "    4. Analyze memory requirements for gradients and optimizers\n",
-    "\n",
-    "    BACKWARD PASS ESTIMATES:\n",
-    "    - FLOPs: ~2× forward pass (gradient computation)\n",
-    "    - Memory: +1× parameters (gradient storage)\n",
-    "    - Latency: ~2× forward pass (more complex operations)\n",
-    "\n",
-    "    EXAMPLE:\n",
-    "    >>> model = Linear(128, 64)\n",
-    "    >>> input_data = Tensor(np.random.randn(16, 128))\n",
-    "    >>> profile = profile_backward_pass(model, input_data)\n",
-    "    >>> print(f\"Training iteration: {profile['total_latency_ms']:.2f} ms\")\n",
-    "    Training iteration: 0.45 ms\n",
-    "\n",
-    "    HINTS:\n",
-    "    - Total memory = parameters + activations + gradients\n",
-    "    - Optimizer memory depends on algorithm (SGD: 0×, Adam: 2×)\n",
-    "    - Consider gradient accumulation effects\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Get forward pass profile\n",
-    "    forward_profile = profile_forward_pass(model, input_tensor)\n",
-    "\n",
-    "    # Estimate backward pass (typically 2× forward)\n",
-    "    backward_flops = forward_profile['flops'] * 2\n",
-    "    backward_latency_ms = forward_profile['latency_ms'] * 2\n",
-    "\n",
-    "    # Gradient memory (equal to parameter memory)\n",
-    "    gradient_memory_mb = forward_profile['parameter_memory_mb']\n",
-    "\n",
-    "    # Total training iteration\n",
-    "    total_flops = forward_profile['flops'] + backward_flops\n",
-    "    total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms\n",
-    "    total_memory_mb = (forward_profile['parameter_memory_mb'] +\n",
-    "                      forward_profile['activation_memory_mb'] +\n",
-    "                      gradient_memory_mb)\n",
-    "\n",
-    "    # Training efficiency\n",
-    "    total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)\n",
-    "\n",
-    "    # Optimizer memory estimates\n",
-    "    optimizer_memory_estimates = {\n",
-    "        'sgd': 0,  # No extra memory\n",
-    "        'adam': gradient_memory_mb * 2,  # Momentum + velocity\n",
-    "        'adamw': gradient_memory_mb * 2,  # Same as Adam\n",
-    "    }\n",
-    "\n",
-    "    return {\n",
-    "        # Forward pass\n",
-    "        'forward_flops': forward_profile['flops'],\n",
-    "        'forward_latency_ms': forward_profile['latency_ms'],\n",
-    "        'forward_memory_mb': forward_profile['peak_memory_mb'],\n",
-    "\n",
-    "        # Backward pass estimates\n",
-    "        'backward_flops': backward_flops,\n",
-    "        'backward_latency_ms': backward_latency_ms,\n",
-    "        'gradient_memory_mb': gradient_memory_mb,\n",
-    "\n",
-    "        # Total training iteration\n",
-    "        'total_flops': total_flops,\n",
-    "        'total_latency_ms': total_latency_ms,\n",
-    "        'total_memory_mb': total_memory_mb,\n",
-    "        'total_gflops_per_second': total_gflops_per_second,\n",
-    "\n",
-    "        # Optimizer memory requirements\n",
-    "        'optimizer_memory_estimates': optimizer_memory_estimates,\n",
-    "\n",
-    "        # Training insights\n",
-    "        'memory_efficiency': forward_profile['memory_efficiency'],\n",
-    "        'bottleneck': forward_profile['bottleneck']\n",
-    "    }\n",
-    "    ### END SOLUTION"
-   ]
-  },
   {
    "cell_type": "markdown",
-   "id": "a66d79fe",
+   "id": "24236272",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -1246,7 +1385,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f7838a43",
+   "id": "1516ed04",
    "metadata": {
     "nbgrader": {
      "grade": true,
@@ -1261,11 +1400,12 @@
     "    \"\"\"🔬 Test advanced profiling functions.\"\"\"\n",
     "    print(\"🔬 Unit Test: Advanced Profiling Functions...\")\n",
     "\n",
-    "    # Create test model and input\n",
+    "    # Create profiler and test model\n",
+    "    profiler = Profiler()\n",
     "    test_input = Tensor(np.random.randn(4, 8))\n",
     "\n",
     "    # Test forward pass profiling\n",
-    "    forward_profile = profile_forward_pass(test_input, test_input)\n",
+    "    forward_profile = profiler.profile_forward_pass(test_input, test_input)\n",
     "\n",
     "    # Validate forward profile structure\n",
     "    required_forward_keys = [\n",
@@ -1284,7 +1424,7 @@
     "    print(f\"✅ Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s\")\n",
     "\n",
     "    # Test backward pass profiling\n",
-    "    backward_profile = profile_backward_pass(test_input, test_input)\n",
+    "    backward_profile = profiler.profile_backward_pass(test_input, test_input)\n",
     "\n",
     "    # Validate backward profile structure\n",
     "    required_backward_keys = [\n",
@@ -1311,12 +1451,13 @@
     "    print(f\"✅ Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training\")\n",
     "    print(\"✅ Advanced profiling functions work correctly!\")\n",
     "\n",
-    "test_unit_advanced_profiling()"
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_advanced_profiling()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "768f21e5",
+   "id": "b52a9046",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -1343,7 +1484,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "7f90a148",
+   "id": "331e282f",
    "metadata": {
     "nbgrader": {
      "grade": false,
@@ -1410,10 +1551,8 @@
     "    avg_efficiency = np.mean([r['gflops_per_second'] for r in results])\n",
     "    if avg_efficiency < 10:  # Arbitrary threshold for \"low\" efficiency\n",
     "        print(\"🚀 Low compute efficiency suggests memory-bound workload\")\n",
-    "        print(\"   → Optimization focus: Data layout, memory bandwidth, caching\")\n",
     "    else:\n",
     "        print(\"🚀 High compute efficiency suggests compute-bound workload\")\n",
-    "        print(\"   → Optimization focus: Algorithmic efficiency, vectorization\")\n",
     "\n",
     "def analyze_batch_size_effects():\n",
     "    \"\"\"📊 Analyze how batch size affects performance and efficiency.\"\"\"\n",
@@ -1445,18 +1584,17 @@
     "              f\"{memory['peak_memory_mb']:.2f}\\t\\t{efficiency:.1f}\")\n",
     "\n",
     "    print(\"\\n💡 Batch Size Insights:\")\n",
-    "    print(\"• Larger batches typically improve throughput but increase memory usage\")\n",
-    "    print(\"• Sweet spot balances throughput and memory constraints\")\n",
-    "    print(\"• Memory efficiency = samples/s per MB (higher is better)\")\n",
+    "    print(\"Larger batches typically improve throughput but increase memory usage\")\n",
     "\n",
     "# Run the analysis\n",
-    "analyze_model_scaling()\n",
-    "analyze_batch_size_effects()"
+    "if __name__ == \"__main__\":\n",
+    "    analyze_model_scaling()\n",
+    "    analyze_batch_size_effects()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "0563e9cd",
+   "id": "08957c5b",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -1489,7 +1627,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "c506a927",
+   "id": "750be525",
    "metadata": {
     "nbgrader": {
      "grade": false,
@@ -1564,8 +1702,8 @@
     "    best_op = max(operations, key=lambda x: x['gflops_per_second'])\n",
     "    worst_op = min(operations, key=lambda x: x['gflops_per_second'])\n",
     "\n",
-    "    print(f\"• Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)\")\n",
-    "    print(f\"• Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)\")\n",
+    "    print(f\"Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)\")\n",
+    "    print(f\"Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)\")\n",
     "\n",
     "    # Count operation types\n",
     "    memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound']\n",
@@ -1573,11 +1711,9 @@
     "\n",
     "    print(f\"\\n🚀 Optimization Priority:\")\n",
     "    if len(memory_bound_ops) > len(compute_bound_ops):\n",
-    "        print(\"• Focus on memory optimization: data locality, bandwidth, caching\")\n",
-    "        print(\"• Consider operation fusion to reduce memory traffic\")\n",
+    "        print(\"Focus on memory optimization: data locality, bandwidth, caching\")\n",
     "    else:\n",
-    "        print(\"• Focus on compute optimization: better algorithms, vectorization\")\n",
-    "        print(\"• Consider specialized libraries (BLAS, cuBLAS)\")\n",
+    "        print(\"Focus on compute optimization: better algorithms, vectorization\")\n",
     "\n",
     "def analyze_profiling_overhead():\n",
     "    \"\"\"📊 Measure the overhead of profiling itself.\"\"\"\n",
@@ -1611,29 +1747,21 @@
     "\n",
     "    print(f\"\\n💡 Profiling Overhead Insights:\")\n",
     "    if overhead_factor < 2:\n",
-    "        print(\"• Low overhead - suitable for frequent profiling\")\n",
-    "        print(\"• Can be used in development with minimal impact\")\n",
+    "        print(\"Low overhead - suitable for frequent profiling\")\n",
     "    elif overhead_factor < 10:\n",
-    "        print(\"• Moderate overhead - use for development and debugging\")\n",
-    "        print(\"• Disable for production unless investigating issues\")\n",
+    "        print(\"Moderate overhead - use for development and debugging\")\n",
     "    else:\n",
-    "        print(\"• High overhead - use sparingly in production\")\n",
-    "        print(\"• Enable only when investigating specific performance issues\")\n",
-    "\n",
-    "    print(f\"\\n🚀 Profiling Best Practices:\")\n",
-    "    print(\"• Profile during development to identify bottlenecks\")\n",
-    "    print(\"• Use production profiling only for investigation\")\n",
-    "    print(\"• Focus measurement on critical code paths\")\n",
-    "    print(\"• Balance measurement detail with overhead cost\")\n",
+    "        print(\"High overhead - use sparingly in production\")\n",
     "\n",
     "# Run optimization analysis\n",
-    "benchmark_operation_efficiency()\n",
-    "analyze_profiling_overhead()"
+    "if __name__ == \"__main__\":\n",
+    "    benchmark_operation_efficiency()\n",
+    "    analyze_profiling_overhead()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "e7a5de0d",
+   "id": "a170135d",
    "metadata": {
     "cell_marker": "\"\"\"",
     "lines_to_next_cell": 1
@@ -1647,7 +1775,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "d922a54d",
+   "id": "379ab83a",
    "metadata": {
     "nbgrader": {
      "grade": true,
@@ -1704,8 +1832,8 @@
     "\n",
     "    # Test advanced profiling\n",
     "    print(\"2. Running advanced profiling...\")\n",
-    "    forward_profile = profile_forward_pass(test_model, test_input)\n",
-    "    backward_profile = profile_backward_pass(test_model, test_input)\n",
+    "    forward_profile = profiler.profile_forward_pass(test_model, test_input)\n",
+    "    backward_profile = profiler.profile_backward_pass(test_model, test_input)\n",
     "\n",
     "    assert 'gflops_per_second' in forward_profile\n",
     "    assert 'total_latency_ms' in backward_profile\n",
@@ -1736,7 +1864,7 @@
     "\n",
     "    # Simulate larger model analysis\n",
     "    large_input = Tensor(np.random.randn(32, 512))  # Larger model input\n",
-    "    large_profile = profile_forward_pass(large_input, large_input)\n",
+    "    large_profile = profiler.profile_forward_pass(large_input, large_input)\n",
     "\n",
     "    # Verify profile contains optimization insights\n",
     "    assert 'bottleneck' in large_profile, \"Profile should identify bottlenecks\"\n",
@@ -1749,16 +1877,17 @@
     "\n",
     "    print(\"\\n\" + \"=\" * 50)\n",
     "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
-    "    print(\"Run: tito module complete 15\")\n",
+    "    print(\"Run: tito module complete 14\")\n",
     "\n",
     "# Call before module summary\n",
-    "test_module()"
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "378e2ca8",
+   "id": "6502f689",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1770,7 +1899,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e44c6173",
+   "id": "b4ff25e4",
    "metadata": {
     "cell_marker": "\"\"\""
    },
@@ -1805,143 +1934,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ab131290",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## 🏁 Consolidated Profiler for Export\n",
-    "\n",
-    "Now that we've implemented all profiling methods, let's create a consolidated Profiler class\n",
-    "for export to the tinytorch package. This allows milestones to use the full profiler."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dd3324fa",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "profiler_export",
-     "solution": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class ProfilerComplete:\n",
-    "    \"\"\"\n",
-    "    Complete profiler with all measurement capabilities for milestone use.\n",
-    "    \n",
-    "    This is the exported version students build through the module exercises.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize profiler with measurement state.\"\"\"\n",
-    "        self.measurements = {}\n",
-    "        self.operation_counts = defaultdict(int)\n",
-    "        self.memory_tracker = None\n",
-    "    \n",
-    "    def count_parameters(self, model) -> int:\n",
-    "        \"\"\"Count total trainable parameters in a model.\"\"\"\n",
-    "        total_params = 0\n",
-    "        \n",
-    "        if hasattr(model, 'parameters'):\n",
-    "            for param in model.parameters():\n",
-    "                total_params += param.data.size\n",
-    "        elif hasattr(model, 'weight'):\n",
-    "            total_params += model.weight.data.size\n",
-    "            if hasattr(model, 'bias') and model.bias is not None:\n",
-    "                total_params += model.bias.data.size\n",
-    "        \n",
-    "        return total_params\n",
-    "    \n",
-    "    def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:\n",
-    "        \"\"\"Count FLOPs for one forward pass.\"\"\"\n",
-    "        dummy_input = Tensor(np.random.randn(*input_shape))\n",
-    "        total_flops = 0\n",
-    "        \n",
-    "        if hasattr(model, '__class__'):\n",
-    "            model_name = model.__class__.__name__\n",
-    "            \n",
-    "            if model_name == 'Linear':\n",
-    "                in_features = input_shape[-1]\n",
-    "                out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1\n",
-    "                total_flops = in_features * out_features * 2\n",
-    "            \n",
-    "            elif model_name == 'Conv2d':\n",
-    "                total_flops = 1000000  # Simplified for now\n",
-    "        \n",
-    "        return total_flops\n",
-    "    \n",
-    "    def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:\n",
-    "        \"\"\"Measure memory usage during forward pass.\"\"\"\n",
-    "        tracemalloc.start()\n",
-    "        baseline_memory = tracemalloc.get_traced_memory()[0]\n",
-    "        \n",
-    "        param_count = self.count_parameters(model)\n",
-    "        parameter_memory_bytes = param_count * 4\n",
-    "        parameter_memory_mb = parameter_memory_bytes / (1024 * 1024)\n",
-    "        \n",
-    "        dummy_input = Tensor(np.random.randn(*input_shape))\n",
-    "        \n",
-    "        try:\n",
-    "            if hasattr(model, 'forward'):\n",
-    "                output = model.forward(dummy_input)\n",
-    "            elif hasattr(model, '__call__'):\n",
-    "                output = model(dummy_input)\n",
-    "        except:\n",
-    "            output = dummy_input\n",
-    "        \n",
-    "        peak_memory, _ = tracemalloc.get_traced_memory()\n",
-    "        tracemalloc.stop()\n",
-    "        \n",
-    "        peak_memory_mb = peak_memory / (1024 * 1024)\n",
-    "        activation_memory_mb = max(0, peak_memory_mb - parameter_memory_mb)\n",
-    "        \n",
-    "        return {\n",
-    "            'parameter_memory_mb': parameter_memory_mb,\n",
-    "            'activation_memory_mb': activation_memory_mb,\n",
-    "            'peak_memory_mb': peak_memory_mb,\n",
-    "            'memory_efficiency': parameter_memory_mb / peak_memory_mb if peak_memory_mb > 0 else 0\n",
-    "        }\n",
-    "    \n",
-    "    def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:\n",
-    "        \"\"\"Measure model inference latency with statistical rigor.\"\"\"\n",
-    "        # Warmup\n",
-    "        for _ in range(warmup):\n",
-    "            try:\n",
-    "                if hasattr(model, 'forward'):\n",
-    "                    _ = model.forward(input_tensor)\n",
-    "                elif hasattr(model, '__call__'):\n",
-    "                    _ = model(input_tensor)\n",
-    "            except:\n",
-    "                pass\n",
-    "        \n",
-    "        # Measurement\n",
-    "        times = []\n",
-    "        for _ in range(iterations):\n",
-    "            start = time.perf_counter()\n",
-    "            try:\n",
-    "                if hasattr(model, 'forward'):\n",
-    "                    _ = model.forward(input_tensor)\n",
-    "                elif hasattr(model, '__call__'):\n",
-    "                    _ = model(input_tensor)\n",
-    "            except:\n",
-    "                pass\n",
-    "            end = time.perf_counter()\n",
-    "            times.append(end - start)\n",
-    "        \n",
-    "        median_latency_ms = np.median(times) * 1000\n",
-    "        return median_latency_ms"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "dc025a52",
+   "id": "72dec7d6",
    "metadata": {
     "cell_marker": "\"\"\""
    },
@@ -1970,10 +1963,10 @@
     "- **Statistical Rigor**: Handle measurement variance with proper methodology\n",
     "\n",
     "### Ready for Next Steps\n",
-    "Your profiling implementation enables Module 16 (Acceleration) to make data-driven optimization decisions.\n",
-    "Export with: `tito module complete 15`\n",
+    "Your profiling implementation enables optimization modules (15-18) to make data-driven optimization decisions.\n",
+    "Export with: `tito module complete 14`\n",
     "\n",
-    "**Next**: Module 16 will use these profiling tools to implement acceleration techniques and measure their effectiveness!"
+    "**Next**: Module 15 (Memoization) will use profiling to discover transformer bottlenecks and fix them!"
    ]
   }
  ],