diff --git a/modules/source/01_tensor/tensor_dev.ipynb b/modules/source/01_tensor/tensor_dev.ipynb
new file mode 100644
index 00000000..9d3820e0
--- /dev/null
+++ b/modules/source/01_tensor/tensor_dev.ipynb
@@ -0,0 +1,1903 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7e9f10f4",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Module 01: Tensor Foundation - Building Blocks of ML\n",
+    "\n",
+    "Welcome to Module 01! You're about to build the foundational Tensor class that powers all machine learning operations.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Nothing - this is our foundation!\n",
+    "**You'll Build**: A complete Tensor class with arithmetic, matrix operations, and shape manipulation\n",
+    "**You'll Enable**: Foundation for activations, layers, and all future neural network components\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "NumPy Arrays → Tensor → Activations (Module 02)\n",
+    "(raw data)   (ML ops)  (intelligence)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement a complete Tensor class with fundamental operations\n",
+    "2. Understand tensors as the universal data structure in ML\n",
+    "3. Test tensor operations with immediate validation\n",
+    "4. Prepare for gradient computation in Module 05\n",
+    "\n",
+    "Let's get started!\n",
+    "\n",
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/01_tensor/tensor_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.tensor\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.tensor import Tensor  # This module - foundation for everything\n",
+    "# Future modules will import and extend this Tensor\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete tensor system in one focused module for deep understanding\n",
+    "- **Production:** Proper organization like PyTorch's torch.Tensor with all core operations together\n",
+    "- **Consistency:** All tensor operations and data manipulation in core.tensor\n",
+    "- **Integration:** Foundation that every other module will build upon"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "76974680",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "imports",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| default_exp core.tensor\n",
+    "\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9885fe6c",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 1. Introduction: What is a Tensor?\n",
+    "\n",
+    "A tensor is a multi-dimensional array that serves as the fundamental data structure in machine learning. Think of it as a universal container that can hold data in different dimensions:\n",
+    "\n",
+    "```\n",
+    "Tensor Dimensions:\n",
+    "┌─────────────┐\n",
+    "│ 0D: Scalar  │  5.0          (just a number)\n",
+    "│ 1D: Vector  │  [1, 2, 3]    (list of numbers)\n",
+    "│ 2D: Matrix  │  [[1, 2]      (grid of numbers)\n",
+    "│             │   [3, 4]]\n",
+    "│ 3D: Cube    │  [[[...       (stack of matrices)\n",
+    "└─────────────┘\n",
+    "```\n",
+    "\n",
+    "In machine learning, tensors flow through operations like water through pipes:\n",
+    "\n",
+    "```\n",
+    "Neural Network Data Flow:\n",
+    "Input Tensor → Layer 1 → Activation → Layer 2 → ... → Output Tensor\n",
+    "   [batch,     [batch,     [batch,     [batch,          [batch,\n",
+    "    features]   hidden]     hidden]     hidden2]         classes]\n",
+    "```\n",
+    "\n",
+    "Every neural network, from simple linear regression to modern transformers, processes tensors. Understanding tensors means understanding the foundation of all ML computations.\n",
+    "\n",
+    "### Why Tensors Matter in ML Systems\n",
+    "\n",
+    "In production ML systems, tensors carry more than just data - they carry the computational graph, memory layout information, and execution context:\n",
+    "\n",
+    "```\n",
+    "Real ML Pipeline:\n",
+    "Raw Data → Preprocessing → Tensor Creation → Model Forward Pass → Loss Computation\n",
+    "   ↓           ↓              ↓               ↓                    ↓\n",
+    " Files     NumPy Arrays    Tensors        GPU Tensors         Scalar Loss\n",
+    "```\n",
+    "\n",
+    "**Key Insight**: Tensors bridge the gap between mathematical concepts and efficient computation on modern hardware."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c9ac7887",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 2. Foundations: Mathematical Background\n",
+    "\n",
+    "### Core Operations We'll Implement\n",
+    "\n",
+    "Our Tensor class will support all fundamental operations that neural networks need:\n",
+    "\n",
+    "```\n",
+    "Operation Types:\n",
+    "┌─────────────────┬─────────────────┬─────────────────┐\n",
+    "│ Element-wise    │ Matrix Ops      │ Shape Ops       │\n",
+    "├─────────────────┼─────────────────┼─────────────────┤\n",
+    "│ + Addition      │ @ Matrix Mult   │ .reshape()      │\n",
+    "│ - Subtraction   │ .transpose()    │ .sum()          │\n",
+    "│ * Multiplication│                 │ .mean()         │\n",
+    "│ / Division      │                 │ .max()          │\n",
+    "└─────────────────┴─────────────────┴─────────────────┘\n",
+    "```\n",
+    "\n",
+    "### Broadcasting: Making Tensors Work Together\n",
+    "\n",
+    "Broadcasting automatically aligns tensors of different shapes for operations:\n",
+    "\n",
+    "```\n",
+    "Broadcasting Examples:\n",
+    "┌─────────────────────────────────────────────────────────┐\n",
+    "│ Scalar + Vector:                                        │\n",
+    "│    5    + [1, 2, 3] → [5, 5, 5] + [1, 2, 3] = [6, 7, 8]│\n",
+    "│                                                         │\n",
+    "│ Matrix + Vector (row-wise):                             │\n",
+    "│ [[1, 2]]   [10]   [[1, 2]]   [[10, 10]]   [[11, 12]]  │\n",
+    "│ [[3, 4]] + [10] = [[3, 4]] + [[10, 10]] = [[13, 14]]  │\n",
+    "└─────────────────────────────────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "**Memory Layout**: NumPy uses row-major (C-style) storage where elements are stored row by row in memory for cache efficiency:\n",
+    "\n",
+    "```\n",
+    "Memory Layout (2×3 matrix):\n",
+    "Matrix:     Memory:\n",
+    "[[1, 2, 3]  [1][2][3][4][5][6]\n",
+    " [4, 5, 6]]  ↑  Row 1   ↑  Row 2\n",
+    "\n",
+    "Cache Behavior:\n",
+    "Sequential Access: Fast (uses cache lines efficiently)\n",
+    "  Row access: [1][2][3] → cache hit, hit, hit\n",
+    "Random Access: Slow (cache misses)\n",
+    "  Column access: [1][4] → cache hit, miss\n",
+    "```\n",
+    "\n",
+    "This memory layout affects performance in real ML workloads - algorithms that access data sequentially run faster than those that access randomly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a901ed7",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 3. Implementation: Building Tensor Foundation\n",
+    "\n",
+    "Let's build our Tensor class step by step, testing each component as we go.\n",
+    "\n",
+    "**Key Design Decision**: We'll include gradient-related attributes from the start, but they'll remain dormant until Module 05. This ensures a consistent interface throughout the course while keeping the cognitive load manageable.\n",
+    "\n",
+    "### Tensor Class Architecture\n",
+    "\n",
+    "```\n",
+    "Tensor Class Structure:\n",
+    "┌─────────────────────────────────┐\n",
+    "│ Core Attributes:                │\n",
+    "│ • data: np.array (the numbers)  │\n",
+    "│ • shape: tuple (dimensions)     │\n",
+    "│ • size: int (total elements)    │\n",
+    "│ • dtype: type (float32, int64)  │\n",
+    "├─────────────────────────────────┤\n",
+    "│ Gradient Attributes (dormant):  │\n",
+    "│ • requires_grad: bool          │\n",
+    "│ • grad: None (until Module 05)  │\n",
+    "├─────────────────────────────────┤\n",
+    "│ Operations:                     │\n",
+    "│ • __add__, __sub__, __mul__     │\n",
+    "│ • matmul(), reshape()           │\n",
+    "│ • sum(), mean(), max()          │\n",
+    "│ • __repr__(), __str__()         │\n",
+    "└─────────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "The beauty of this design: **all methods are defined inside the class from day one**. No monkey-patching, no dynamic attribute addition. Clean, consistent, debugger-friendly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f5325a29",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### Tensor Creation and Initialization\n",
+    "\n",
+    "Before we implement operations, let's understand how tensors store data and manage their attributes. This initialization is the foundation that everything else builds upon.\n",
+    "\n",
+    "```\n",
+    "Tensor Initialization Process:\n",
+    "Input Data → Validation → NumPy Array → Tensor Wrapper → Ready for Operations\n",
+    "   [1,2,3] →    types   →  np.array   →    shape=(3,)  →     + - * / @ ...\n",
+    "     ↓             ↓          ↓             ↓\n",
+    "  List/Array    Type Check   Memory      Attributes Set\n",
+    "               (optional)    Allocation\n",
+    "\n",
+    "Memory Allocation Example:\n",
+    "Input: [[1, 2, 3], [4, 5, 6]]\n",
+    "         ↓\n",
+    "NumPy allocates: [1][2][3][4][5][6] in contiguous memory\n",
+    "         ↓\n",
+    "Tensor wraps with: shape=(2,3), size=6, dtype=int64\n",
+    "```\n",
+    "\n",
+    "**Key Design Principle**: Our Tensor is a wrapper around NumPy arrays that adds ML-specific functionality. We leverage NumPy's battle-tested memory management and computation kernels while adding the gradient tracking and operation chaining needed for deep learning.\n",
+    "\n",
+    "**Why This Approach?**\n",
+    "- **Performance**: NumPy's C implementations are highly optimized\n",
+    "- **Compatibility**: Easy integration with scientific Python ecosystem\n",
+    "- **Memory Efficiency**: No unnecessary data copying\n",
+    "- **Future-Proof**: Easy transition to GPU tensors in advanced modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a532b1ec",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "tensor-class",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class Tensor:\n",
+    "    \"\"\"Educational tensor that grows with student knowledge.\n",
+    "\n",
+    "    This class starts simple but includes dormant features for future modules:\n",
+    "    - requires_grad: Will be used for automatic differentiation (Module 05)\n",
+    "    - grad: Will store computed gradients (Module 05)\n",
+    "    - backward(): Will compute gradients (Module 05)\n",
+    "\n",
+    "    For now, focus on: data, shape, and basic operations.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, data, requires_grad=False):\n",
+    "        \"\"\"\n",
+    "        Create a new tensor from data.\n",
+    "\n",
+    "        TODO: Initialize tensor attributes\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Convert data to NumPy array - handles lists, scalars, etc.\n",
+    "        2. Store shape and size for quick access\n",
+    "        3. Set up gradient tracking (dormant until Module 05)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> tensor = Tensor([1, 2, 3])\n",
+    "        >>> print(tensor.data)\n",
+    "        [1 2 3]\n",
+    "        >>> print(tensor.shape)\n",
+    "        (3,)\n",
+    "\n",
+    "        HINT: np.array() handles type conversion automatically\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Core tensor data - always present\n",
+    "        self.data = np.array(data, dtype=np.float32)  # Consistent float32 for ML\n",
+    "        self.shape = self.data.shape\n",
+    "        self.size = self.data.size\n",
+    "        self.dtype = self.data.dtype\n",
+    "\n",
+    "        # Gradient features (dormant until Module 05)\n",
+    "        self.requires_grad = requires_grad\n",
+    "        self.grad = None\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __repr__(self):\n",
+    "        \"\"\"String representation of tensor for debugging.\"\"\"\n",
+    "        grad_info = f\", requires_grad={self.requires_grad}\" if self.requires_grad else \"\"\n",
+    "        return f\"Tensor(data={self.data}, shape={self.shape}{grad_info})\"\n",
+    "\n",
+    "    def __str__(self):\n",
+    "        \"\"\"Human-readable string representation.\"\"\"\n",
+    "        return f\"Tensor({self.data})\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dd2e63a9",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "addition-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "    def __add__(self, other):\n",
+    "        \"\"\"\n",
+    "        Add two tensors element-wise with broadcasting support.\n",
+    "\n",
+    "        TODO: Implement tensor addition with automatic broadcasting\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Handle both Tensor and scalar inputs\n",
+    "        2. Use NumPy's broadcasting for automatic shape alignment\n",
+    "        3. Return new Tensor with result (don't modify self)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> a = Tensor([1, 2, 3])\n",
+    "        >>> b = Tensor([4, 5, 6])\n",
+    "        >>> result = a + b\n",
+    "        >>> print(result.data)\n",
+    "        [5. 7. 9.]\n",
+    "\n",
+    "        BROADCASTING EXAMPLE:\n",
+    "        >>> matrix = Tensor([[1, 2], [3, 4]])  # Shape: (2, 2)\n",
+    "        >>> vector = Tensor([10, 20])          # Shape: (2,)\n",
+    "        >>> result = matrix + vector           # Broadcasting: (2,2) + (2,) → (2,2)\n",
+    "        >>> print(result.data)\n",
+    "        [[11. 22.]\n",
+    "         [13. 24.]]\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use isinstance() to check if other is a Tensor\n",
+    "        - NumPy handles broadcasting automatically with +\n",
+    "        - Always return a new Tensor, don't modify self\n",
+    "        - Preserve gradient tracking for future modules\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        if isinstance(other, Tensor):\n",
+    "            # Tensor + Tensor: let NumPy handle broadcasting\n",
+    "            result_data = self.data + other.data\n",
+    "        else:\n",
+    "            # Tensor + scalar: NumPy broadcasts automatically\n",
+    "            result_data = self.data + other\n",
+    "\n",
+    "        # Create new tensor with result\n",
+    "        result = Tensor(result_data)\n",
+    "\n",
+    "        # Preserve gradient tracking if either operand requires gradients\n",
+    "        if hasattr(self, 'requires_grad') and hasattr(other, 'requires_grad'):\n",
+    "            result.requires_grad = self.requires_grad or (isinstance(other, Tensor) and other.requires_grad)\n",
+    "        elif hasattr(self, 'requires_grad'):\n",
+    "            result.requires_grad = self.requires_grad\n",
+    "\n",
+    "        return result\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "223aaca8",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "more-arithmetic",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "    def __sub__(self, other):\n",
+    "        \"\"\"\n",
+    "        Subtract two tensors element-wise.\n",
+    "\n",
+    "        Common use: Centering data (x - mean), computing differences for loss functions.\n",
+    "        \"\"\"\n",
+    "        if isinstance(other, Tensor):\n",
+    "            return Tensor(self.data - other.data)\n",
+    "        else:\n",
+    "            return Tensor(self.data - other)\n",
+    "\n",
+    "    def __mul__(self, other):\n",
+    "        \"\"\"\n",
+    "        Multiply two tensors element-wise (NOT matrix multiplication).\n",
+    "\n",
+    "        Common use: Scaling features, applying masks, gating mechanisms in neural networks.\n",
+    "        Note: This is * operator, not @ (which will be matrix multiplication).\n",
+    "        \"\"\"\n",
+    "        if isinstance(other, Tensor):\n",
+    "            return Tensor(self.data * other.data)\n",
+    "        else:\n",
+    "            return Tensor(self.data * other)\n",
+    "\n",
+    "    def __truediv__(self, other):\n",
+    "        \"\"\"\n",
+    "        Divide two tensors element-wise.\n",
+    "\n",
+    "        Common use: Normalization (x / std), converting counts to probabilities.\n",
+    "        \"\"\"\n",
+    "        if isinstance(other, Tensor):\n",
+    "            return Tensor(self.data / other.data)\n",
+    "        else:\n",
+    "            return Tensor(self.data / other)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "110326a6",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "matmul-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "    def matmul(self, other):\n",
+    "        \"\"\"\n",
+    "        Matrix multiplication of two tensors.\n",
+    "\n",
+    "        TODO: Implement matrix multiplication using np.dot with proper validation\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Validate inputs are Tensors\n",
+    "        2. Check dimension compatibility (inner dimensions must match)\n",
+    "        3. Use np.dot for optimized computation\n",
+    "        4. Return new Tensor with result\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> a = Tensor([[1, 2], [3, 4]])  # 2×2\n",
+    "        >>> b = Tensor([[5, 6], [7, 8]])  # 2×2\n",
+    "        >>> result = a.matmul(b)          # 2×2 result\n",
+    "        >>> # Result: [[1×5+2×7, 1×6+2×8], [3×5+4×7, 3×6+4×8]] = [[19, 22], [43, 50]]\n",
+    "\n",
+    "        SHAPE RULES:\n",
+    "        - (M, K) @ (K, N) → (M, N)  ✓ Valid\n",
+    "        - (M, K) @ (J, N) → Error   ✗ K ≠ J\n",
+    "\n",
+    "        COMPLEXITY: O(M×N×K) for (M×K) @ (K×N) matrices\n",
+    "\n",
+    "        HINTS:\n",
+    "        - np.dot handles the optimization for us\n",
+    "        - Check self.shape[-1] == other.shape[-2] for compatibility\n",
+    "        - Provide clear error messages for debugging\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        if not isinstance(other, Tensor):\n",
+    "            raise TypeError(f\"Expected Tensor for matrix multiplication, got {type(other)}\")\n",
+    "\n",
+    "        # Handle edge cases\n",
+    "        if self.shape == () or other.shape == ():\n",
+    "            # Scalar multiplication\n",
+    "            return Tensor(self.data * other.data)\n",
+    "\n",
+    "        # For matrix multiplication, we need at least 1D tensors\n",
+    "        if len(self.shape) == 0 or len(other.shape) == 0:\n",
+    "            return Tensor(self.data * other.data)\n",
+    "\n",
+    "        # Check dimension compatibility for matrix multiplication\n",
+    "        if len(self.shape) >= 2 and len(other.shape) >= 2:\n",
+    "            if self.shape[-1] != other.shape[-2]:\n",
+    "                raise ValueError(\n",
+    "                    f\"Cannot perform matrix multiplication: {self.shape} @ {other.shape}. \"\n",
+    "                    f\"Inner dimensions must match: {self.shape[-1]} ≠ {other.shape[-2]}. \"\n",
+    "                    f\"💡 HINT: For (M,K) @ (K,N) → (M,N), the K dimensions must be equal.\"\n",
+    "                )\n",
+    "        elif len(self.shape) == 1 and len(other.shape) == 2:\n",
+    "            # Vector @ Matrix\n",
+    "            if self.shape[0] != other.shape[0]:\n",
+    "                raise ValueError(\n",
+    "                    f\"Cannot multiply vector {self.shape} with matrix {other.shape}. \"\n",
+    "                    f\"Vector length {self.shape[0]} must match matrix rows {other.shape[0]}.\"\n",
+    "                )\n",
+    "        elif len(self.shape) == 2 and len(other.shape) == 1:\n",
+    "            # Matrix @ Vector\n",
+    "            if self.shape[1] != other.shape[0]:\n",
+    "                raise ValueError(\n",
+    "                    f\"Cannot multiply matrix {self.shape} with vector {other.shape}. \"\n",
+    "                    f\"Matrix columns {self.shape[1]} must match vector length {other.shape[0]}.\"\n",
+    "                )\n",
+    "\n",
+    "        # Perform optimized matrix multiplication\n",
+    "        result_data = np.dot(self.data, other.data)\n",
+    "        return Tensor(result_data)\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7307e0e8",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "shape-ops",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "    def reshape(self, *shape):\n",
+    "        \"\"\"\n",
+    "        Reshape tensor to new dimensions.\n",
+    "\n",
+    "        TODO: Implement tensor reshaping with validation\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Handle different calling conventions: reshape(2, 3) vs reshape((2, 3))\n",
+    "        2. Validate total elements remain the same\n",
+    "        3. Use NumPy's reshape for the actual operation\n",
+    "        4. Return new Tensor (keep immutability)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> tensor = Tensor([1, 2, 3, 4, 5, 6])  # Shape: (6,)\n",
+    "        >>> reshaped = tensor.reshape(2, 3)      # Shape: (2, 3)\n",
+    "        >>> print(reshaped.data)\n",
+    "        [[1. 2. 3.]\n",
+    "         [4. 5. 6.]]\n",
+    "\n",
+    "        COMMON USAGE:\n",
+    "        >>> # Flatten for MLP input\n",
+    "        >>> image = Tensor(np.random.rand(3, 32, 32))  # (channels, height, width)\n",
+    "        >>> flattened = image.reshape(-1)              # (3072,) - all pixels in vector\n",
+    "        >>>\n",
+    "        >>> # Prepare batch for convolution\n",
+    "        >>> batch = Tensor(np.random.rand(32, 784))    # (batch, features)\n",
+    "        >>> images = batch.reshape(32, 1, 28, 28)      # (batch, channels, height, width)\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Handle both reshape(2, 3) and reshape((2, 3)) calling styles\n",
+    "        - Check np.prod(new_shape) == self.size for validation\n",
+    "        - Use descriptive error messages for debugging\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Handle both reshape(2, 3) and reshape((2, 3)) calling conventions\n",
+    "        if len(shape) == 1 and isinstance(shape[0], (tuple, list)):\n",
+    "            new_shape = tuple(shape[0])\n",
+    "        else:\n",
+    "            new_shape = shape\n",
+    "\n",
+    "        # Handle -1 for automatic dimension inference (like NumPy)\n",
+    "        if -1 in new_shape:\n",
+    "            if new_shape.count(-1) > 1:\n",
+    "                raise ValueError(\"Can only specify one unknown dimension with -1\")\n",
+    "\n",
+    "            # Calculate the unknown dimension\n",
+    "            known_size = 1\n",
+    "            unknown_idx = new_shape.index(-1)\n",
+    "            for i, dim in enumerate(new_shape):\n",
+    "                if i != unknown_idx:\n",
+    "                    known_size *= dim\n",
+    "\n",
+    "            unknown_dim = self.size // known_size\n",
+    "            new_shape = list(new_shape)\n",
+    "            new_shape[unknown_idx] = unknown_dim\n",
+    "            new_shape = tuple(new_shape)\n",
+    "\n",
+    "        # Validate total elements remain the same\n",
+    "        if np.prod(new_shape) != self.size:\n",
+    "            raise ValueError(\n",
+    "                f\"Cannot reshape tensor of size {self.size} to shape {new_shape}. \"\n",
+    "                f\"Total elements must match: {self.size} ≠ {np.prod(new_shape)}. \"\n",
+    "                f\"💡 HINT: Make sure new_shape dimensions multiply to {self.size}\"\n",
+    "            )\n",
+    "\n",
+    "        # Reshape the data (NumPy handles the memory layout efficiently)\n",
+    "        reshaped_data = np.reshape(self.data, new_shape)\n",
+    "        return Tensor(reshaped_data)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def transpose(self, dim0=None, dim1=None):\n",
+    "        \"\"\"\n",
+    "        Transpose tensor dimensions.\n",
+    "\n",
+    "        TODO: Implement tensor transposition\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Handle default case (transpose last two dimensions)\n",
+    "        2. Handle specific dimension swapping\n",
+    "        3. Use NumPy's transpose with proper axis specification\n",
+    "        4. Return new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> matrix = Tensor([[1, 2, 3], [4, 5, 6]])  # (2, 3)\n",
+    "        >>> transposed = matrix.transpose()          # (3, 2)\n",
+    "        >>> print(transposed.data)\n",
+    "        [[1. 4.]\n",
+    "         [2. 5.]\n",
+    "         [3. 6.]]\n",
+    "\n",
+    "        NEURAL NETWORK USAGE:\n",
+    "        >>> # Weight matrix transpose for backward pass\n",
+    "        >>> W = Tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])  # (3, 2)\n",
+    "        >>> W_T = W.transpose()  # (2, 3) - for gradient computation\n",
+    "        >>>\n",
+    "        >>> # Attention mechanism\n",
+    "        >>> Q = Tensor([[1, 2], [3, 4]])  # queries (2, 2)\n",
+    "        >>> K = Tensor([[5, 6], [7, 8]])  # keys (2, 2)\n",
+    "        >>> attention_scores = Q.matmul(K.transpose())  # Q @ K^T\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Default: transpose last two dimensions (most common case)\n",
+    "        - Use np.transpose() with axes parameter\n",
+    "        - Handle 1D tensors gracefully (transpose is identity)\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        if dim0 is None and dim1 is None:\n",
+    "            # Default: transpose last two dimensions\n",
+    "            if len(self.shape) < 2:\n",
+    "                # For 1D tensors, transpose is identity operation\n",
+    "                return Tensor(self.data.copy())\n",
+    "            else:\n",
+    "                # Transpose last two dimensions (most common in ML)\n",
+    "                axes = list(range(len(self.shape)))\n",
+    "                axes[-2], axes[-1] = axes[-1], axes[-2]\n",
+    "                transposed_data = np.transpose(self.data, axes)\n",
+    "        else:\n",
+    "            # Specific dimensions to transpose\n",
+    "            if dim0 is None or dim1 is None:\n",
+    "                raise ValueError(\"Both dim0 and dim1 must be specified for specific dimension transpose\")\n",
+    "\n",
+    "            # Validate dimensions exist\n",
+    "            if dim0 >= len(self.shape) or dim1 >= len(self.shape) or dim0 < 0 or dim1 < 0:\n",
+    "                raise ValueError(\n",
+    "                    f\"Dimension out of range for tensor with shape {self.shape}. \"\n",
+    "                    f\"Got dim0={dim0}, dim1={dim1}, but tensor has {len(self.shape)} dimensions.\"\n",
+    "                )\n",
+    "\n",
+    "            # Create axes list and swap the specified dimensions\n",
+    "            axes = list(range(len(self.shape)))\n",
+    "            axes[dim0], axes[dim1] = axes[dim1], axes[dim0]\n",
+    "            transposed_data = np.transpose(self.data, axes)\n",
+    "\n",
+    "        return Tensor(transposed_data)\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7eb6e317",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "reduction-ops",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "    def sum(self, axis=None, keepdims=False):\n",
+    "        \"\"\"\n",
+    "        Sum tensor along specified axis.\n",
+    "\n",
+    "        TODO: Implement tensor sum with axis control\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use NumPy's sum with axis parameter\n",
+    "        2. Handle axis=None (sum all elements) vs specific axis\n",
+    "        3. Support keepdims to maintain shape for broadcasting\n",
+    "        4. Return new Tensor with result\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> tensor = Tensor([[1, 2], [3, 4]])\n",
+    "        >>> total = tensor.sum()          # Sum all elements: 10\n",
+    "        >>> col_sum = tensor.sum(axis=0)  # Sum columns: [4, 6]\n",
+    "        >>> row_sum = tensor.sum(axis=1)  # Sum rows: [3, 7]\n",
+    "\n",
+    "        NEURAL NETWORK USAGE:\n",
+    "        >>> # Batch loss computation\n",
+    "        >>> batch_losses = Tensor([0.1, 0.3, 0.2, 0.4])  # Individual losses\n",
+    "        >>> total_loss = batch_losses.sum()               # Total: 1.0\n",
+    "        >>> avg_loss = batch_losses.mean()                # Average: 0.25\n",
+    "        >>>\n",
+    "        >>> # Global average pooling\n",
+    "        >>> feature_maps = Tensor(np.random.rand(32, 256, 7, 7))  # (batch, channels, h, w)\n",
+    "        >>> global_features = feature_maps.sum(axis=(2, 3))       # (batch, channels)\n",
+    "\n",
+    "        HINTS:\n",
+    "        - np.sum handles all the complexity for us\n",
+    "        - axis=None sums all elements (returns scalar)\n",
+    "        - axis=0 sums along first dimension, axis=1 along second, etc.\n",
+    "        - keepdims=True preserves dimensions for broadcasting\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        result = np.sum(self.data, axis=axis, keepdims=keepdims)\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def mean(self, axis=None, keepdims=False):\n",
+    "        \"\"\"\n",
+    "        Compute mean of tensor along specified axis.\n",
+    "\n",
+    "        Common usage: Batch normalization, loss averaging, global pooling.\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        result = np.mean(self.data, axis=axis, keepdims=keepdims)\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def max(self, axis=None, keepdims=False):\n",
+    "        \"\"\"\n",
+    "        Find maximum values along specified axis.\n",
+    "\n",
+    "        Common usage: Max pooling, finding best predictions, activation clipping.\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        result = np.max(self.data, axis=axis, keepdims=keepdims)\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "27313cc9",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "gradient-placeholder",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "    def backward(self):\n",
+    "        \"\"\"\n",
+    "        Compute gradients (implemented in Module 05: Autograd).\n",
+    "\n",
+    "        TODO: Placeholder implementation for gradient computation\n",
+    "\n",
+    "        STUDENT NOTE:\n",
+    "        This method exists but does nothing until Module 05: Autograd.\n",
+    "        Don't worry about it for now - focus on the basic tensor operations.\n",
+    "\n",
+    "        In Module 05, we'll implement:\n",
+    "        - Gradient computation via chain rule\n",
+    "        - Automatic differentiation\n",
+    "        - Backpropagation through operations\n",
+    "        - Computation graph construction\n",
+    "\n",
+    "        FUTURE IMPLEMENTATION PREVIEW:\n",
+    "        ```python\n",
+    "        def backward(self, gradient=None):\n",
+    "            # Module 05 will implement:\n",
+    "            # 1. Set gradient for this tensor\n",
+    "            # 2. Propagate to parent operations\n",
+    "            # 3. Apply chain rule recursively\n",
+    "            # 4. Accumulate gradients properly\n",
+    "            pass\n",
+    "        ```\n",
+    "\n",
+    "        CURRENT BEHAVIOR:\n",
+    "        >>> x = Tensor([1, 2, 3], requires_grad=True)\n",
+    "        >>> y = x * 2\n",
+    "        >>> y.sum().backward()  # Calls this method - does nothing\n",
+    "        >>> print(x.grad)      # Still None\n",
+    "        None\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Placeholder - will be implemented in Module 05\n",
+    "        # For now, just ensure it doesn't crash when called\n",
+    "        # This allows students to experiment with gradient syntax\n",
+    "        # without getting confusing errors about missing methods\n",
+    "        pass\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "581bca49",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Tensor Creation\n",
+    "\n",
+    "This test validates our Tensor constructor works correctly with various data types and properly initializes all attributes.\n",
+    "\n",
+    "**What we're testing**: Basic tensor creation and attribute setting\n",
+    "**Why it matters**: Foundation for all other operations - if creation fails, nothing works\n",
+    "**Expected**: Tensor wraps data correctly with proper attributes and consistent dtype"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5fc1a793",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-tensor-creation",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_tensor_creation():\n",
+    "    \"\"\"🧪 Test Tensor creation with various data types.\"\"\"\n",
+    "    print(\"🧪 Unit Test: Tensor Creation...\")\n",
+    "\n",
+    "    # Test scalar creation\n",
+    "    scalar = Tensor(5.0)\n",
+    "    assert scalar.data == 5.0\n",
+    "    assert scalar.shape == ()\n",
+    "    assert scalar.size == 1\n",
+    "    assert scalar.requires_grad == False\n",
+    "    assert scalar.grad is None\n",
+    "    assert scalar.dtype == np.float32\n",
+    "\n",
+    "    # Test vector creation\n",
+    "    vector = Tensor([1, 2, 3])\n",
+    "    assert np.array_equal(vector.data, np.array([1, 2, 3], dtype=np.float32))\n",
+    "    assert vector.shape == (3,)\n",
+    "    assert vector.size == 3\n",
+    "\n",
+    "    # Test matrix creation\n",
+    "    matrix = Tensor([[1, 2], [3, 4]])\n",
+    "    assert np.array_equal(matrix.data, np.array([[1, 2], [3, 4]], dtype=np.float32))\n",
+    "    assert matrix.shape == (2, 2)\n",
+    "    assert matrix.size == 4\n",
+    "\n",
+    "    # Test gradient flag (dormant feature)\n",
+    "    grad_tensor = Tensor([1, 2], requires_grad=True)\n",
+    "    assert grad_tensor.requires_grad == True\n",
+    "    assert grad_tensor.grad is None  # Still None until Module 05\n",
+    "\n",
+    "    print(\"✅ Tensor creation works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_tensor_creation()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1f8159c6",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## Element-wise Arithmetic Operations\n",
+    "\n",
+    "Element-wise operations are the workhorses of neural network computation. They apply the same operation to corresponding elements in tensors, often with broadcasting to handle different shapes elegantly.\n",
+    "\n",
+    "### Why Element-wise Operations Matter\n",
+    "\n",
+    "In neural networks, element-wise operations appear everywhere:\n",
+    "- **Activation functions**: Apply ReLU, sigmoid to every element\n",
+    "- **Batch normalization**: Subtract mean, divide by std per element\n",
+    "- **Loss computation**: Compare predictions vs. targets element-wise\n",
+    "- **Gradient updates**: Add scaled gradients to parameters element-wise\n",
+    "\n",
+    "### Element-wise Addition: The Foundation\n",
+    "\n",
+    "Addition is the simplest and most fundamental operation. Understanding it deeply helps with all others.\n",
+    "\n",
+    "```\n",
+    "Element-wise Addition Visual:\n",
+    "[1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9]\n",
+    "\n",
+    "Matrix Addition:\n",
+    "[[1, 2]]   [[5, 6]]   [[1+5, 2+6]]   [[6, 8]]\n",
+    "[[3, 4]] + [[7, 8]] = [[3+7, 4+8]] = [[10, 12]]\n",
+    "\n",
+    "Broadcasting Addition (Matrix + Vector):\n",
+    "[[1, 2]]   [10]   [[1, 2]]   [[10, 10]]   [[11, 12]]\n",
+    "[[3, 4]] + [20] = [[3, 4]] + [[20, 20]] = [[23, 24]]\n",
+    "     ↑      ↑           ↑         ↑            ↑\n",
+    "  (2,2)   (2,1)      (2,2)    broadcast    result\n",
+    "\n",
+    "Broadcasting Rules:\n",
+    "1. Start from rightmost dimension\n",
+    "2. Dimensions must be equal OR one must be 1 OR one must be missing\n",
+    "3. Missing dimensions are assumed to be 1\n",
+    "```\n",
+    "\n",
+    "**Key Insight**: Broadcasting makes tensors of different shapes compatible by automatically expanding dimensions. This is crucial for batch processing where you often add a single bias vector to an entire batch of data.\n",
+    "\n",
+    "**Memory Efficiency**: Broadcasting doesn't actually create expanded copies in memory - NumPy computes results on-the-fly, saving memory."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6d13995a",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "### Subtraction, Multiplication, and Division\n",
+    "\n",
+    "These operations follow the same pattern as addition, working element-wise with broadcasting support. Each serves specific purposes in neural networks:\n",
+    "\n",
+    "```\n",
+    "Element-wise Operations in Neural Networks:\n",
+    "\n",
+    "┌─────────────────┬─────────────────┬─────────────────┬─────────────────┐\n",
+    "│ Subtraction     │ Multiplication  │ Division        │ Use Cases       │\n",
+    "├─────────────────┼─────────────────┼─────────────────┼─────────────────┤\n",
+    "│ [6,8] - [1,2]   │ [2,3] * [4,5]   │ [8,9] / [2,3]   │ • Gradient      │\n",
+    "│ = [5,6]         │ = [8,15]        │ = [4.0, 3.0]    │   computation   │\n",
+    "│                 │                 │                 │ • Normalization │\n",
+    "│ Center data:    │ Gate values:    │ Scale features: │ • Loss functions│\n",
+    "│ x - mean        │ x * mask        │ x / std         │ • Attention     │\n",
+    "└─────────────────┴─────────────────┴─────────────────┴─────────────────┘\n",
+    "\n",
+    "Broadcasting with Scalars (very common in ML):\n",
+    "[1, 2, 3] * 2     = [2, 4, 6]      (scale all values)\n",
+    "[1, 2, 3] - 1     = [0, 1, 2]      (shift all values)\n",
+    "[2, 4, 6] / 2     = [1, 2, 3]      (normalize all values)\n",
+    "\n",
+    "Real ML Example - Batch Normalization:\n",
+    "batch_data = [[1, 2], [3, 4], [5, 6]]  # Shape: (3, 2)\n",
+    "mean = [3, 4]                           # Shape: (2,)\n",
+    "std = [2, 2]                            # Shape: (2,)\n",
+    "\n",
+    "# Normalize: (x - mean) / std\n",
+    "normalized = (batch_data - mean) / std\n",
+    "# Broadcasting: (3,2) - (2,) = (3,2), then (3,2) / (2,) = (3,2)\n",
+    "```\n",
+    "\n",
+    "**Performance Note**: Element-wise operations are highly optimized in NumPy and run efficiently on modern CPUs with vectorization (SIMD instructions)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a34349e",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Arithmetic Operations\n",
+    "\n",
+    "This test validates our arithmetic operations work correctly with both tensor-tensor and tensor-scalar operations, including broadcasting behavior.\n",
+    "\n",
+    "**What we're testing**: Addition, subtraction, multiplication, division with broadcasting\n",
+    "**Why it matters**: Foundation for neural network forward passes, batch processing, normalization\n",
+    "**Expected**: Operations work with both tensors and scalars, proper broadcasting alignment"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dd54a903",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-arithmetic",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_arithmetic_operations():\n",
+    "    \"\"\"🧪 Test arithmetic operations with broadcasting.\"\"\"\n",
+    "    print(\"🧪 Unit Test: Arithmetic Operations...\")\n",
+    "\n",
+    "    # Test tensor + tensor\n",
+    "    a = Tensor([1, 2, 3])\n",
+    "    b = Tensor([4, 5, 6])\n",
+    "    result = a + b\n",
+    "    assert np.array_equal(result.data, np.array([5, 7, 9], dtype=np.float32))\n",
+    "\n",
+    "    # Test tensor + scalar (very common in ML)\n",
+    "    result = a + 10\n",
+    "    assert np.array_equal(result.data, np.array([11, 12, 13], dtype=np.float32))\n",
+    "\n",
+    "    # Test broadcasting with different shapes (matrix + vector)\n",
+    "    matrix = Tensor([[1, 2], [3, 4]])\n",
+    "    vector = Tensor([10, 20])\n",
+    "    result = matrix + vector\n",
+    "    expected = np.array([[11, 22], [13, 24]], dtype=np.float32)\n",
+    "    assert np.array_equal(result.data, expected)\n",
+    "\n",
+    "    # Test subtraction (data centering)\n",
+    "    result = b - a\n",
+    "    assert np.array_equal(result.data, np.array([3, 3, 3], dtype=np.float32))\n",
+    "\n",
+    "    # Test multiplication (scaling)\n",
+    "    result = a * 2\n",
+    "    assert np.array_equal(result.data, np.array([2, 4, 6], dtype=np.float32))\n",
+    "\n",
+    "    # Test division (normalization)\n",
+    "    result = b / 2\n",
+    "    assert np.array_equal(result.data, np.array([2.0, 2.5, 3.0], dtype=np.float32))\n",
+    "\n",
+    "    # Test chaining operations (common in ML pipelines)\n",
+    "    normalized = (a - 2) / 2  # Center and scale\n",
+    "    expected = np.array([-0.5, 0.0, 0.5], dtype=np.float32)\n",
+    "    assert np.allclose(normalized.data, expected)\n",
+    "\n",
+    "    print(\"✅ Arithmetic operations work correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_arithmetic_operations()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23242cc9",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## Matrix Multiplication: The Heart of Neural Networks\n",
+    "\n",
+    "Matrix multiplication is fundamentally different from element-wise multiplication. It's the operation that gives neural networks their power to transform and combine information across features.\n",
+    "\n",
+    "### Why Matrix Multiplication is Central to ML\n",
+    "\n",
+    "Every neural network layer essentially performs matrix multiplication:\n",
+    "\n",
+    "```\n",
+    "Linear Layer (the building block of neural networks):\n",
+    "Input Features × Weight Matrix = Output Features\n",
+    "    (N, D_in)   ×    (D_in, D_out)  =    (N, D_out)\n",
+    "\n",
+    "Real Example - Image Classification:\n",
+    "Flattened Image × Hidden Weights = Hidden Features\n",
+    "  (32, 784)     ×    (784, 256)   =   (32, 256)\n",
+    "     ↑                   ↑              ↑\n",
+    "  32 images         784→256 transform  32 feature vectors\n",
+    "```\n",
+    "\n",
+    "### Matrix Multiplication Visualization\n",
+    "\n",
+    "```\n",
+    "Matrix Multiplication Process:\n",
+    "    A (2×3)      B (3×2)         C (2×2)\n",
+    "   ┌       ┐    ┌     ┐       ┌         ┐\n",
+    "   │ 1 2 3 │    │ 7 8 │       │ 1×7+2×9+3×1 │   ┌      ┐\n",
+    "   │       │ ×  │ 9 1 │  =    │             │ = │ 28 13│\n",
+    "   │ 4 5 6 │    │ 1 2 │       │ 4×7+5×9+6×1 │   │ 79 37│\n",
+    "   └       ┘    └     ┘       └             ┘   └      ┘\n",
+    "\n",
+    "Computation Breakdown:\n",
+    "C[0,0] = A[0,:] · B[:,0] = [1,2,3] · [7,9,1] = 1×7 + 2×9 + 3×1 = 28\n",
+    "C[0,1] = A[0,:] · B[:,1] = [1,2,3] · [8,1,2] = 1×8 + 2×1 + 3×2 = 13\n",
+    "C[1,0] = A[1,:] · B[:,0] = [4,5,6] · [7,9,1] = 4×7 + 5×9 + 6×1 = 79\n",
+    "C[1,1] = A[1,:] · B[:,1] = [4,5,6] · [8,1,2] = 4×8 + 5×1 + 6×2 = 37\n",
+    "\n",
+    "Key Rule: Inner dimensions must match!\n",
+    "A(m,n) @ B(n,p) = C(m,p)\n",
+    "     ↑     ↑\n",
+    "   these must be equal\n",
+    "```\n",
+    "\n",
+    "### Computational Complexity and Performance\n",
+    "\n",
+    "```\n",
+    "Computational Cost:\n",
+    "For C = A @ B where A is (M×K), B is (K×N):\n",
+    "- Multiplications: M × N × K\n",
+    "- Additions: M × N × (K-1) ≈ M × N × K\n",
+    "- Total FLOPs: ≈ 2 × M × N × K\n",
+    "\n",
+    "Example: (1000×1000) @ (1000×1000)\n",
+    "- FLOPs: 2 × 1000³ = 2 billion operations\n",
+    "- On 1 GHz CPU: ~2 seconds if no optimization\n",
+    "- With optimized BLAS: ~0.1 seconds (20× speedup!)\n",
+    "\n",
+    "Memory Access Pattern:\n",
+    "A: M×K (row-wise access)  ✓ Good cache locality\n",
+    "B: K×N (column-wise)      ✗ Poor cache locality\n",
+    "C: M×N (row-wise write)   ✓ Good cache locality\n",
+    "\n",
+    "This is why optimized libraries like OpenBLAS, Intel MKL use:\n",
+    "- Blocking algorithms (process in cache-sized chunks)\n",
+    "- Vectorization (SIMD instructions)\n",
+    "- Parallelization (multiple cores)\n",
+    "```\n",
+    "\n",
+    "### Neural Network Context\n",
+    "\n",
+    "```\n",
+    "Multi-layer Neural Network:\n",
+    "Input (batch=32, features=784)\n",
+    "  ↓ W1: (784, 256)\n",
+    "Hidden1 (batch=32, features=256)\n",
+    "  ↓ W2: (256, 128)\n",
+    "Hidden2 (batch=32, features=128)\n",
+    "  ↓ W3: (128, 10)\n",
+    "Output (batch=32, classes=10)\n",
+    "\n",
+    "Each arrow represents a matrix multiplication:\n",
+    "- Forward pass: 3 matrix multiplications\n",
+    "- Backward pass: 3 more matrix multiplications (with transposes)\n",
+    "- Total: 6 matrix mults per forward+backward pass\n",
+    "\n",
+    "For training batch: 32 × (784×256 + 256×128 + 128×10) FLOPs\n",
+    "= 32 × (200,704 + 32,768 + 1,280) = 32 × 234,752 = 7.5M FLOPs per batch\n",
+    "```\n",
+    "\n",
+    "This is why GPU acceleration matters - modern GPUs can perform thousands of these operations in parallel!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3d81481f",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Matrix Multiplication\n",
+    "\n",
+    "This test validates matrix multiplication works correctly with proper shape checking and error handling.\n",
+    "\n",
+    "**What we're testing**: Matrix multiplication with shape validation and edge cases\n",
+    "**Why it matters**: Core operation in neural networks (linear layers, attention mechanisms)\n",
+    "**Expected**: Correct results for valid shapes, clear error messages for invalid shapes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1564ad57",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-matmul",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_matrix_multiplication():\n",
+    "    \"\"\"🧪 Test matrix multiplication operations.\"\"\"\n",
+    "    print(\"🧪 Unit Test: Matrix Multiplication...\")\n",
+    "\n",
+    "    # Test 2×2 matrix multiplication (basic case)\n",
+    "    a = Tensor([[1, 2], [3, 4]])  # 2×2\n",
+    "    b = Tensor([[5, 6], [7, 8]])  # 2×2\n",
+    "    result = a.matmul(b)\n",
+    "    # Expected: [[1×5+2×7, 1×6+2×8], [3×5+4×7, 3×6+4×8]] = [[19, 22], [43, 50]]\n",
+    "    expected = np.array([[19, 22], [43, 50]], dtype=np.float32)\n",
+    "    assert np.array_equal(result.data, expected)\n",
+    "\n",
+    "    # Test rectangular matrices (common in neural networks)\n",
+    "    c = Tensor([[1, 2, 3], [4, 5, 6]])  # 2×3 (like batch_size=2, features=3)\n",
+    "    d = Tensor([[7, 8], [9, 10], [11, 12]])  # 3×2 (like features=3, outputs=2)\n",
+    "    result = c.matmul(d)\n",
+    "    # Expected: [[1×7+2×9+3×11, 1×8+2×10+3×12], [4×7+5×9+6×11, 4×8+5×10+6×12]]\n",
+    "    expected = np.array([[58, 64], [139, 154]], dtype=np.float32)\n",
+    "    assert np.array_equal(result.data, expected)\n",
+    "\n",
+    "    # Test matrix-vector multiplication (common in forward pass)\n",
+    "    matrix = Tensor([[1, 2, 3], [4, 5, 6]])  # 2×3\n",
+    "    vector = Tensor([1, 2, 3])  # 3×1 (conceptually)\n",
+    "    result = matrix.matmul(vector)\n",
+    "    # Expected: [1×1+2×2+3×3, 4×1+5×2+6×3] = [14, 32]\n",
+    "    expected = np.array([14, 32], dtype=np.float32)\n",
+    "    assert np.array_equal(result.data, expected)\n",
+    "\n",
+    "    # Test shape validation - should raise clear error\n",
+    "    try:\n",
+    "        incompatible_a = Tensor([[1, 2]])     # 1×2\n",
+    "        incompatible_b = Tensor([[1], [2], [3]])  # 3×1\n",
+    "        incompatible_a.matmul(incompatible_b)  # 1×2 @ 3×1 should fail (2 ≠ 3)\n",
+    "        assert False, \"Should have raised ValueError for incompatible shapes\"\n",
+    "    except ValueError as e:\n",
+    "        assert \"Inner dimensions must match\" in str(e)\n",
+    "        assert \"2 ≠ 3\" in str(e)  # Should show specific dimensions\n",
+    "\n",
+    "    print(\"✅ Matrix multiplication works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_matrix_multiplication()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78b99f8b",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## Shape Manipulation: Reshape and Transpose\n",
+    "\n",
+    "Neural networks constantly change tensor shapes to match layer requirements. Understanding these operations is crucial for data flow through networks.\n",
+    "\n",
+    "### Why Shape Manipulation Matters\n",
+    "\n",
+    "Real neural networks require constant shape changes:\n",
+    "\n",
+    "```\n",
+    "CNN Data Flow Example:\n",
+    "Input Image: (32, 3, 224, 224)     # batch, channels, height, width\n",
+    "     ↓ Convolutional layers\n",
+    "Feature Maps: (32, 512, 7, 7)      # batch, features, spatial\n",
+    "     ↓ Global Average Pool\n",
+    "Pooled: (32, 512, 1, 1)            # batch, features, 1, 1\n",
+    "     ↓ Flatten for classifier\n",
+    "Flattened: (32, 512)               # batch, features\n",
+    "     ↓ Linear classifier\n",
+    "Output: (32, 1000)                 # batch, classes\n",
+    "\n",
+    "Each ↓ involves reshape or view operations!\n",
+    "```\n",
+    "\n",
+    "### Reshape: Changing Interpretation of the Same Data\n",
+    "\n",
+    "```\n",
+    "Reshaping (changing dimensions without changing data):\n",
+    "Original: [1, 2, 3, 4, 5, 6]  (shape: (6,))\n",
+    "         ↓ reshape(2, 3)\n",
+    "Result:  [[1, 2, 3],          (shape: (2, 3))\n",
+    "          [4, 5, 6]]\n",
+    "\n",
+    "Memory Layout (unchanged):\n",
+    "Before: [1][2][3][4][5][6]\n",
+    "After:  [1][2][3][4][5][6]  ← Same memory, different interpretation\n",
+    "\n",
+    "Key Insight: Reshape is O(1) operation - no data copying!\n",
+    "Just changes how we interpret the memory layout.\n",
+    "\n",
+    "Common ML Reshapes:\n",
+    "┌─────────────────────┬─────────────────────┬─────────────────────┐\n",
+    "│ Flatten for MLP     │ Unflatten for CNN   │ Batch Dimension     │\n",
+    "├─────────────────────┼─────────────────────┼─────────────────────┤\n",
+    "│ (N,H,W,C) → (N,H×W×C) │ (N,D) → (N,H,W,C)   │ (H,W) → (1,H,W)     │\n",
+    "│ Images to vectors   │ Vectors to images   │ Add batch dimension │\n",
+    "└─────────────────────┴─────────────────────┴─────────────────────┘\n",
+    "```\n",
+    "\n",
+    "### Transpose: Swapping Dimensions\n",
+    "\n",
+    "```\n",
+    "Transposing (swapping dimensions - data rearrangement):\n",
+    "Original: [[1, 2, 3],    (shape: (2, 3))\n",
+    "           [4, 5, 6]]\n",
+    "         ↓ transpose()\n",
+    "Result:  [[1, 4],        (shape: (3, 2))\n",
+    "          [2, 5],\n",
+    "          [3, 6]]\n",
+    "\n",
+    "Memory Layout (rearranged):\n",
+    "Before: [1][2][3][4][5][6]\n",
+    "After:  [1][4][2][5][3][6]  ← Data actually moves in memory\n",
+    "\n",
+    "Key Insight: Transpose involves data movement - more expensive than reshape.\n",
+    "\n",
+    "Neural Network Usage:\n",
+    "┌─────────────────────┬─────────────────────┬─────────────────────┐\n",
+    "│ Weight Matrices     │ Attention Mechanism │ Gradient Computation│\n",
+    "├─────────────────────┼─────────────────────┼─────────────────────┤\n",
+    "│ Forward: X @ W      │ Q @ K^T attention   │ ∂L/∂W = X^T @ ∂L/∂Y│\n",
+    "│ Backward: X @ W^T   │ scores              │                     │\n",
+    "└─────────────────────┴─────────────────────┴─────────────────────┘\n",
+    "```\n",
+    "\n",
+    "### Performance Implications\n",
+    "\n",
+    "```\n",
+    "Operation Performance (for 1000×1000 matrix):\n",
+    "┌─────────────────┬──────────────┬─────────────────┬─────────────────┐\n",
+    "│ Operation       │ Time         │ Memory Access   │ Cache Behavior  │\n",
+    "├─────────────────┼──────────────┼─────────────────┼─────────────────┤\n",
+    "│ reshape()       │ ~0.001 ms    │ No data copy    │ No cache impact │\n",
+    "│ transpose()     │ ~10 ms       │ Full data copy  │ Poor locality   │\n",
+    "│ view() (future) │ ~0.001 ms    │ No data copy    │ No cache impact │\n",
+    "└─────────────────┴──────────────┴─────────────────┴─────────────────┘\n",
+    "\n",
+    "Why transpose() is slower:\n",
+    "- Must rearrange data in memory\n",
+    "- Poor cache locality (accessing columns)\n",
+    "- Can't be parallelized easily\n",
+    "```\n",
+    "\n",
+    "This is why frameworks like PyTorch often use \"lazy\" transpose operations that defer the actual data movement until necessary."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b16da4b",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Shape Manipulation\n",
+    "\n",
+    "This test validates reshape and transpose operations work correctly with validation and edge cases.\n",
+    "\n",
+    "**What we're testing**: Reshape and transpose operations with proper error handling\n",
+    "**Why it matters**: Essential for data flow in neural networks, CNN/RNN architectures\n",
+    "**Expected**: Correct shape changes, proper error handling for invalid operations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "259f2769",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-shape-ops",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_shape_manipulation():\n",
+    "    \"\"\"🧪 Test reshape and transpose operations.\"\"\"\n",
+    "    print(\"🧪 Unit Test: Shape Manipulation...\")\n",
+    "\n",
+    "    # Test basic reshape (flatten → matrix)\n",
+    "    tensor = Tensor([1, 2, 3, 4, 5, 6])  # Shape: (6,)\n",
+    "    reshaped = tensor.reshape(2, 3)      # Shape: (2, 3)\n",
+    "    assert reshaped.shape == (2, 3)\n",
+    "    expected = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)\n",
+    "    assert np.array_equal(reshaped.data, expected)\n",
+    "\n",
+    "    # Test reshape with tuple (alternative calling style)\n",
+    "    reshaped2 = tensor.reshape((3, 2))   # Shape: (3, 2)\n",
+    "    assert reshaped2.shape == (3, 2)\n",
+    "    expected2 = np.array([[1, 2], [3, 4], [5, 6]], dtype=np.float32)\n",
+    "    assert np.array_equal(reshaped2.data, expected2)\n",
+    "\n",
+    "    # Test reshape with -1 (automatic dimension inference)\n",
+    "    auto_reshaped = tensor.reshape(2, -1)  # Should infer -1 as 3\n",
+    "    assert auto_reshaped.shape == (2, 3)\n",
+    "\n",
+    "    # Test reshape validation - should raise error for incompatible sizes\n",
+    "    try:\n",
+    "        tensor.reshape(2, 2)  # 6 elements can't fit in 2×2=4\n",
+    "        assert False, \"Should have raised ValueError\"\n",
+    "    except ValueError as e:\n",
+    "        assert \"Total elements must match\" in str(e)\n",
+    "        assert \"6 ≠ 4\" in str(e)\n",
+    "\n",
+    "    # Test matrix transpose (most common case)\n",
+    "    matrix = Tensor([[1, 2, 3], [4, 5, 6]])  # (2, 3)\n",
+    "    transposed = matrix.transpose()          # (3, 2)\n",
+    "    assert transposed.shape == (3, 2)\n",
+    "    expected = np.array([[1, 4], [2, 5], [3, 6]], dtype=np.float32)\n",
+    "    assert np.array_equal(transposed.data, expected)\n",
+    "\n",
+    "    # Test 1D transpose (should be identity)\n",
+    "    vector = Tensor([1, 2, 3])\n",
+    "    vector_t = vector.transpose()\n",
+    "    assert np.array_equal(vector.data, vector_t.data)\n",
+    "\n",
+    "    # Test specific dimension transpose\n",
+    "    tensor_3d = Tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])  # (2, 2, 2)\n",
+    "    swapped = tensor_3d.transpose(0, 2)  # Swap first and last dimensions\n",
+    "    assert swapped.shape == (2, 2, 2)  # Same shape but data rearranged\n",
+    "\n",
+    "    # Test neural network reshape pattern (flatten for MLP)\n",
+    "    batch_images = Tensor(np.random.rand(2, 3, 4))  # (batch=2, height=3, width=4)\n",
+    "    flattened = batch_images.reshape(2, -1)  # (batch=2, features=12)\n",
+    "    assert flattened.shape == (2, 12)\n",
+    "\n",
+    "    print(\"✅ Shape manipulation works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_shape_manipulation()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "81696e32",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## Reduction Operations: Aggregating Information\n",
+    "\n",
+    "Reduction operations collapse dimensions by aggregating data, which is essential for computing statistics, losses, and preparing data for different layers.\n",
+    "\n",
+    "### Why Reductions are Crucial in ML\n",
+    "\n",
+    "Reduction operations appear throughout neural networks:\n",
+    "\n",
+    "```\n",
+    "Common ML Reduction Patterns:\n",
+    "\n",
+    "┌─────────────────────┬─────────────────────┬─────────────────────┐\n",
+    "│ Loss Computation    │ Batch Normalization │ Global Pooling      │\n",
+    "├─────────────────────┼─────────────────────┼─────────────────────┤\n",
+    "│ Per-sample losses → │ Batch statistics →  │ Feature maps →      │\n",
+    "│ Single batch loss   │ Normalization       │ Single features     │\n",
+    "│                     │                     │                     │\n",
+    "│ losses.mean()       │ batch.mean(axis=0)  │ fmaps.mean(axis=(2,3))│\n",
+    "│ (N,) → scalar       │ (N,D) → (D,)        │ (N,C,H,W) → (N,C)   │\n",
+    "└─────────────────────┴─────────────────────┴─────────────────────┘\n",
+    "\n",
+    "Real Examples:\n",
+    "• Cross-entropy loss: -log(predictions).mean()  [average over batch]\n",
+    "• Batch norm: (x - x.mean()) / x.std()          [normalize each feature]\n",
+    "• Global avg pool: features.mean(dim=(2,3))     [spatial → scalar per channel]\n",
+    "```\n",
+    "\n",
+    "### Understanding Axis Operations\n",
+    "\n",
+    "```\n",
+    "Visual Axis Understanding:\n",
+    "Matrix:     [[1, 2, 3],      All reductions operate on this data\n",
+    "             [4, 5, 6]]      Shape: (2, 3)\n",
+    "\n",
+    "        axis=0 (↓)\n",
+    "       ┌─────────┐\n",
+    "axis=1 │ 1  2  3 │ →  axis=1 reduces across columns (→)\n",
+    "   (→) │ 4  5  6 │ →  Result shape: (2,) [one value per row]\n",
+    "       └─────────┘\n",
+    "         ↓ ↓ ↓\n",
+    "      axis=0 reduces down rows (↓)\n",
+    "      Result shape: (3,) [one value per column]\n",
+    "\n",
+    "Reduction Results:\n",
+    "├─ .sum() → 21                    (sum all: 1+2+3+4+5+6)\n",
+    "├─ .sum(axis=0) → [5, 7, 9]       (sum columns: [1+4, 2+5, 3+6])\n",
+    "├─ .sum(axis=1) → [6, 15]         (sum rows: [1+2+3, 4+5+6])\n",
+    "├─ .mean() → 3.5                  (average all: 21/6)\n",
+    "├─ .mean(axis=0) → [2.5, 3.5, 4.5] (average columns)\n",
+    "└─ .max() → 6                     (maximum element)\n",
+    "\n",
+    "3D Tensor Example (batch, height, width):\n",
+    "data.shape = (2, 3, 4)  # 2 samples, 3×4 images\n",
+    "│\n",
+    "├─ .sum(axis=0) → (3, 4)    # Sum across batch dimension\n",
+    "├─ .sum(axis=1) → (2, 4)    # Sum across height dimension\n",
+    "├─ .sum(axis=2) → (2, 3)    # Sum across width dimension\n",
+    "└─ .sum(axis=(1,2)) → (2,)  # Sum across both spatial dims (global pool)\n",
+    "```\n",
+    "\n",
+    "### Memory and Performance Considerations\n",
+    "\n",
+    "```\n",
+    "Reduction Performance:\n",
+    "┌─────────────────┬──────────────┬─────────────────┬─────────────────┐\n",
+    "│ Operation       │ Time Complex │ Memory Access   │ Cache Behavior  │\n",
+    "├─────────────────┼──────────────┼─────────────────┼─────────────────┤\n",
+    "│ .sum()          │ O(N)         │ Sequential read │ Excellent       │\n",
+    "│ .sum(axis=0)    │ O(N)         │ Column access   │ Poor (strided)  │\n",
+    "│ .sum(axis=1)    │ O(N)         │ Row access      │ Excellent       │\n",
+    "│ .mean()         │ O(N)         │ Sequential read │ Excellent       │\n",
+    "│ .max()          │ O(N)         │ Sequential read │ Excellent       │\n",
+    "└─────────────────┴──────────────┴─────────────────┴─────────────────┘\n",
+    "\n",
+    "Why axis=0 is slower:\n",
+    "- Accesses elements with large strides\n",
+    "- Poor cache locality (jumping rows)\n",
+    "- Less vectorization-friendly\n",
+    "\n",
+    "Optimization strategies:\n",
+    "- Prefer axis=-1 operations when possible\n",
+    "- Use keepdims=True to maintain shape for broadcasting\n",
+    "- Consider reshaping before reduction for better cache behavior\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cd59ecd2",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Reduction Operations\n",
+    "\n",
+    "This test validates reduction operations work correctly with axis control and maintain proper shapes.\n",
+    "\n",
+    "**What we're testing**: Sum, mean, max operations with axis parameter and keepdims\n",
+    "**Why it matters**: Essential for loss computation, batch processing, and pooling operations\n",
+    "**Expected**: Correct reduction along specified axes with proper shape handling"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "84d2a40c",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-reductions",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_reduction_operations():\n",
+    "    \"\"\"🧪 Test reduction operations.\"\"\"\n",
+    "    print(\"🧪 Unit Test: Reduction Operations...\")\n",
+    "\n",
+    "    matrix = Tensor([[1, 2, 3], [4, 5, 6]])  # Shape: (2, 3)\n",
+    "\n",
+    "    # Test sum all elements (common for loss computation)\n",
+    "    total = matrix.sum()\n",
+    "    assert total.data == 21.0  # 1+2+3+4+5+6\n",
+    "    assert total.shape == ()   # Scalar result\n",
+    "\n",
+    "    # Test sum along axis 0 (columns) - batch dimension reduction\n",
+    "    col_sum = matrix.sum(axis=0)\n",
+    "    expected_col = np.array([5, 7, 9], dtype=np.float32)  # [1+4, 2+5, 3+6]\n",
+    "    assert np.array_equal(col_sum.data, expected_col)\n",
+    "    assert col_sum.shape == (3,)\n",
+    "\n",
+    "    # Test sum along axis 1 (rows) - feature dimension reduction\n",
+    "    row_sum = matrix.sum(axis=1)\n",
+    "    expected_row = np.array([6, 15], dtype=np.float32)  # [1+2+3, 4+5+6]\n",
+    "    assert np.array_equal(row_sum.data, expected_row)\n",
+    "    assert row_sum.shape == (2,)\n",
+    "\n",
+    "    # Test mean (average loss computation)\n",
+    "    avg = matrix.mean()\n",
+    "    assert np.isclose(avg.data, 3.5)  # 21/6\n",
+    "    assert avg.shape == ()\n",
+    "\n",
+    "    # Test mean along axis (batch normalization pattern)\n",
+    "    col_mean = matrix.mean(axis=0)\n",
+    "    expected_mean = np.array([2.5, 3.5, 4.5], dtype=np.float32)  # [5/2, 7/2, 9/2]\n",
+    "    assert np.allclose(col_mean.data, expected_mean)\n",
+    "\n",
+    "    # Test max (finding best predictions)\n",
+    "    maximum = matrix.max()\n",
+    "    assert maximum.data == 6.0\n",
+    "    assert maximum.shape == ()\n",
+    "\n",
+    "    # Test max along axis (argmax-like operation)\n",
+    "    row_max = matrix.max(axis=1)\n",
+    "    expected_max = np.array([3, 6], dtype=np.float32)  # [max(1,2,3), max(4,5,6)]\n",
+    "    assert np.array_equal(row_max.data, expected_max)\n",
+    "\n",
+    "    # Test keepdims (important for broadcasting)\n",
+    "    sum_keepdims = matrix.sum(axis=1, keepdims=True)\n",
+    "    assert sum_keepdims.shape == (2, 1)  # Maintains 2D shape\n",
+    "    expected_keepdims = np.array([[6], [15]], dtype=np.float32)\n",
+    "    assert np.array_equal(sum_keepdims.data, expected_keepdims)\n",
+    "\n",
+    "    # Test 3D reduction (simulating global average pooling)\n",
+    "    tensor_3d = Tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])  # (2, 2, 2)\n",
+    "    spatial_mean = tensor_3d.mean(axis=(1, 2))  # Average across spatial dimensions\n",
+    "    assert spatial_mean.shape == (2,)  # One value per batch item\n",
+    "\n",
+    "    print(\"✅ Reduction operations work correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_reduction_operations()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3265a10f",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## Gradient Features: Preparing for Module 05\n",
+    "\n",
+    "Our Tensor includes dormant gradient features that will spring to life in Module 05. For now, they exist but do nothing - this design choice ensures a consistent interface throughout the course.\n",
+    "\n",
+    "### Why Include Gradient Features Now?\n",
+    "\n",
+    "```\n",
+    "Gradient System Evolution:\n",
+    "Module 01: Tensor with dormant gradients\n",
+    "  ┌─────────────────────────────────┐\n",
+    "  │ Tensor                          │\n",
+    "  │ • data: actual values          │\n",
+    "  │ • requires_grad: False         │ ← Present but unused\n",
+    "  │ • grad: None                   │ ← Present but stays None\n",
+    "  │ • backward(): pass             │ ← Present but does nothing\n",
+    "  └─────────────────────────────────┘\n",
+    "         ↓ Module 05 activates these\n",
+    "Module 05: Tensor with active gradients\n",
+    "  ┌─────────────────────────────────┐\n",
+    "  │ Tensor                          │\n",
+    "  │ • data: actual values          │\n",
+    "  │ • requires_grad: True          │ ← Now controls gradient tracking\n",
+    "  │ • grad: computed gradients     │ ← Now accumulates gradients\n",
+    "  │ • backward(): computes grads   │ ← Now implements chain rule\n",
+    "  └─────────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "### Design Benefits\n",
+    "\n",
+    "**Consistency**: Same Tensor class interface throughout all modules\n",
+    "- No confusing Variable vs. Tensor distinction (unlike early PyTorch)\n",
+    "- Students never need to learn a \"new\" Tensor class\n",
+    "- IDE autocomplete works from day one\n",
+    "\n",
+    "**Gradual Complexity**: Features activate when students are ready\n",
+    "- Module 01-04: Ignore gradient features, focus on operations\n",
+    "- Module 05: Gradient features \"turn on\" magically\n",
+    "- No cognitive overload in early modules\n",
+    "\n",
+    "**Future-Proof**: Easy to extend without breaking changes\n",
+    "- Additional features can be added as dormant initially\n",
+    "- No monkey-patching or dynamic class modification\n",
+    "- Clean evolution path\n",
+    "\n",
+    "### Current State (Module 01)\n",
+    "\n",
+    "```\n",
+    "Gradient Features - Current Behavior:\n",
+    "┌─────────────────────────────────────────────────────────┐\n",
+    "│ Feature           │ Current State  │ Module 05 State    │\n",
+    "├─────────────────────────────────────────────────────────┤\n",
+    "│ requires_grad     │ False          │ True (when needed) │\n",
+    "│ grad              │ None           │ np.array(...)      │\n",
+    "│ backward()        │ pass (no-op)   │ Chain rule impl    │\n",
+    "│ Operation chaining│ Not tracked    │ Computation graph  │\n",
+    "└─────────────────────────────────────────────────────────┘\n",
+    "\n",
+    "Student Experience:\n",
+    "• Can call .backward() without errors (just does nothing)\n",
+    "• Can set requires_grad=True (just gets stored)\n",
+    "• Focus on understanding tensor operations first\n",
+    "• Gradients remain \"mysterious\" until Module 05 reveals them\n",
+    "```\n",
+    "\n",
+    "This approach matches the pedagogical principle of \"progressive disclosure\" - reveal complexity only when students are ready to handle it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2929723",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## 4. Integration: Bringing It Together\n",
+    "\n",
+    "Let's test how our Tensor operations work together in realistic scenarios that mirror neural network computations. This integration demonstrates that our individual operations combine correctly for complex ML workflows.\n",
+    "\n",
+    "### Neural Network Layer Simulation\n",
+    "\n",
+    "The fundamental building block of neural networks is the linear transformation: **y = xW + b**\n",
+    "\n",
+    "```\n",
+    "Linear Layer Forward Pass: y = xW + b\n",
+    "\n",
+    "Input Features → Weight Matrix → Matrix Multiply → Add Bias → Output Features\n",
+    "  (batch, in)   (in, out)        (batch, out)     (batch, out)   (batch, out)\n",
+    "\n",
+    "Step-by-Step Breakdown:\n",
+    "1. Input:   X shape (batch_size, input_features)\n",
+    "2. Weight:  W shape (input_features, output_features)\n",
+    "3. Matmul:  XW shape (batch_size, output_features)\n",
+    "4. Bias:    b shape (output_features,)\n",
+    "5. Result:  XW + b shape (batch_size, output_features)\n",
+    "\n",
+    "Example Flow:\n",
+    "Input: [[1, 2, 3],    Weight: [[0.1, 0.2],    Bias: [0.1, 0.2]\n",
+    "        [4, 5, 6]]            [0.3, 0.4],\n",
+    "       (2, 3)                 [0.5, 0.6]]\n",
+    "                             (3, 2)\n",
+    "\n",
+    "Step 1: Matrix Multiply\n",
+    "[[1, 2, 3]] @ [[0.1, 0.2]] = [[1×0.1+2×0.3+3×0.5, 1×0.2+2×0.4+3×0.6]]\n",
+    "[[4, 5, 6]]   [[0.3, 0.4]]   [[4×0.1+5×0.3+6×0.5, 4×0.2+5×0.4+6×0.6]]\n",
+    "              [[0.5, 0.6]]\n",
+    "                           = [[1.6, 2.6],\n",
+    "                              [4.9, 6.8]]\n",
+    "\n",
+    "Step 2: Add Bias (Broadcasting)\n",
+    "[[1.6, 2.6]] + [0.1, 0.2] = [[1.7, 2.8],\n",
+    " [4.9, 6.8]]                 [5.0, 7.0]]\n",
+    "\n",
+    "This is the foundation of every neural network layer!\n",
+    "```\n",
+    "\n",
+    "### Why This Integration Matters\n",
+    "\n",
+    "This simulation shows how our basic operations combine to create the computational building blocks of neural networks:\n",
+    "\n",
+    "- **Matrix Multiplication**: Transforms input features into new feature space\n",
+    "- **Broadcasting Addition**: Applies learned biases efficiently across batches\n",
+    "- **Shape Handling**: Ensures data flows correctly through layers\n",
+    "- **Memory Management**: Creates new tensors without corrupting inputs\n",
+    "\n",
+    "Every layer in a neural network - from simple MLPs to complex transformers - uses this same pattern."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab817462",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly before module completion."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae410a85",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "module-integration",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_tensor_creation()\n",
+    "    test_unit_arithmetic_operations()\n",
+    "    test_unit_matrix_multiplication()\n",
+    "    test_unit_shape_manipulation()\n",
+    "    test_unit_reduction_operations()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test realistic neural network computation\n",
+    "    print(\"🧪 Integration Test: Two-Layer Neural Network...\")\n",
+    "\n",
+    "    # Create input data (2 samples, 3 features)\n",
+    "    x = Tensor([[1, 2, 3], [4, 5, 6]])\n",
+    "\n",
+    "    # First layer: 3 inputs → 4 hidden units\n",
+    "    W1 = Tensor([[0.1, 0.2, 0.3, 0.4],\n",
+    "                 [0.5, 0.6, 0.7, 0.8],\n",
+    "                 [0.9, 1.0, 1.1, 1.2]])\n",
+    "    b1 = Tensor([0.1, 0.2, 0.3, 0.4])\n",
+    "\n",
+    "    # Forward pass: hidden = xW1 + b1\n",
+    "    hidden = x.matmul(W1) + b1\n",
+    "    assert hidden.shape == (2, 4), f\"Expected (2, 4), got {hidden.shape}\"\n",
+    "\n",
+    "    # Second layer: 4 hidden → 2 outputs\n",
+    "    W2 = Tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8]])\n",
+    "    b2 = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    # Output layer: output = hiddenW2 + b2\n",
+    "    output = hidden.matmul(W2) + b2\n",
+    "    assert output.shape == (2, 2), f\"Expected (2, 2), got {output.shape}\"\n",
+    "\n",
+    "    # Verify data flows correctly (no NaN, reasonable values)\n",
+    "    assert not np.isnan(output.data).any(), \"Output contains NaN values\"\n",
+    "    assert np.isfinite(output.data).all(), \"Output contains infinite values\"\n",
+    "\n",
+    "    print(\"✅ Two-layer neural network computation works!\")\n",
+    "\n",
+    "    # Test gradient attributes are preserved and functional\n",
+    "    print(\"🧪 Integration Test: Gradient System Readiness...\")\n",
+    "    grad_tensor = Tensor([1, 2, 3], requires_grad=True)\n",
+    "    result = grad_tensor + 5\n",
+    "    assert grad_tensor.requires_grad == True, \"requires_grad not preserved\"\n",
+    "    assert grad_tensor.grad is None, \"grad should still be None\"\n",
+    "\n",
+    "    # Test backward() doesn't crash (even though it does nothing)\n",
+    "    grad_tensor.backward()  # Should not raise any exception\n",
+    "\n",
+    "    print(\"✅ Gradient system ready for Module 05!\")\n",
+    "\n",
+    "    # Test complex shape manipulations\n",
+    "    print(\"🧪 Integration Test: Complex Shape Operations...\")\n",
+    "    data = Tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])\n",
+    "\n",
+    "    # Reshape to 3D tensor (simulating batch processing)\n",
+    "    tensor_3d = data.reshape(2, 2, 3)  # (batch=2, height=2, width=3)\n",
+    "    assert tensor_3d.shape == (2, 2, 3)\n",
+    "\n",
+    "    # Global average pooling simulation\n",
+    "    pooled = tensor_3d.mean(axis=(1, 2))  # Average across spatial dimensions\n",
+    "    assert pooled.shape == (2,), f\"Expected (2,), got {pooled.shape}\"\n",
+    "\n",
+    "    # Flatten for MLP\n",
+    "    flattened = tensor_3d.reshape(2, -1)  # (batch, features)\n",
+    "    assert flattened.shape == (2, 6)\n",
+    "\n",
+    "    # Transpose for different operations\n",
+    "    transposed = tensor_3d.transpose()  # Should transpose last two dims\n",
+    "    assert transposed.shape == (2, 3, 2)\n",
+    "\n",
+    "    print(\"✅ Complex shape operations work!\")\n",
+    "\n",
+    "    # Test broadcasting edge cases\n",
+    "    print(\"🧪 Integration Test: Broadcasting Edge Cases...\")\n",
+    "\n",
+    "    # Scalar broadcasting\n",
+    "    scalar = Tensor(5.0)\n",
+    "    vector = Tensor([1, 2, 3])\n",
+    "    result = scalar + vector  # Should broadcast scalar to vector shape\n",
+    "    expected = np.array([6, 7, 8], dtype=np.float32)\n",
+    "    assert np.array_equal(result.data, expected)\n",
+    "\n",
+    "    # Matrix + vector broadcasting\n",
+    "    matrix = Tensor([[1, 2], [3, 4]])\n",
+    "    vec = Tensor([10, 20])\n",
+    "    result = matrix + vec\n",
+    "    expected = np.array([[11, 22], [13, 24]], dtype=np.float32)\n",
+    "    assert np.array_equal(result.data, expected)\n",
+    "\n",
+    "    print(\"✅ Broadcasting edge cases work!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 01_tensor\")\n",
+    "\n",
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0031c02",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Tensor Foundation\n",
+    "\n",
+    "Congratulations! You've built the foundational Tensor class that powers all machine learning operations!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- **Built a complete Tensor class** with arithmetic operations, matrix multiplication, and shape manipulation\n",
+    "- **Implemented broadcasting semantics** that match NumPy for automatic shape alignment\n",
+    "- **Created dormant gradient features** that will activate in Module 05 (autograd)\n",
+    "- **Added comprehensive ASCII diagrams** showing tensor operations visually\n",
+    "- **All methods defined INSIDE the class** (no monkey-patching) for clean, maintainable code\n",
+    "- **All tests pass ✅** (validated by `test_module()`)\n",
+    "\n",
+    "### Systems Insights Discovered\n",
+    "- **Memory scaling**: Matrix operations create new tensors (3× memory during computation)\n",
+    "- **Broadcasting efficiency**: NumPy's automatic shape alignment vs. explicit operations\n",
+    "- **Shape validation trade-offs**: Clear errors vs. performance in tight loops\n",
+    "- **Architecture decisions**: Dormant features vs. inheritance for clean evolution\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your Tensor implementation enables all future modules! The dormant gradient features will spring to life in Module 05, and every neural network component will build on this foundation.\n",
+    "\n",
+    "Export with: `tito module complete 01_tensor`\n",
+    "\n",
+    "**Next**: Module 02 will add activation functions (ReLU, Sigmoid, GELU) that bring intelligence to neural networks by introducing nonlinearity!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/01_tensor/tensor_dev.py b/modules/source/01_tensor/tensor_dev.py
similarity index 99%
rename from modules/01_tensor/tensor_dev.py
rename to modules/source/01_tensor/tensor_dev.py
index 7ee4a767..94fba7a3 100644
--- a/modules/01_tensor/tensor_dev.py
+++ b/modules/source/01_tensor/tensor_dev.py
@@ -45,7 +45,6 @@ Let's get started!
 
 ```python
 # Final package structure:
-from tinytorch.core.tensor import Tensor  # This module - foundation for everything
 # Future modules will import and extend this Tensor
 ```
 
@@ -58,6 +57,7 @@ from tinytorch.core.tensor import Tensor  # This module - foundation for everyth
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp core.tensor
+#| export
 
 import numpy as np
 
diff --git a/modules/source/02_activations/activations_dev.ipynb b/modules/source/02_activations/activations_dev.ipynb
new file mode 100644
index 00000000..6eb0700a
--- /dev/null
+++ b/modules/source/02_activations/activations_dev.ipynb
@@ -0,0 +1,1157 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "26638093",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Activations - Intelligence Through Nonlinearity\n",
+    "\n",
+    "Welcome to Activations! Today you'll add the secret ingredient that makes neural networks intelligent: **nonlinearity**.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Tensor with data manipulation and basic operations\n",
+    "**You'll Build**: Activation functions that add nonlinearity to transformations\n",
+    "**You'll Enable**: Neural networks with the ability to learn complex patterns\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Tensor → Activations → Layers\n",
+    "(data)   (intelligence) (architecture)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement 5 core activation functions (Sigmoid, ReLU, Tanh, GELU, Softmax)\n",
+    "2. Understand how nonlinearity enables neural network intelligence\n",
+    "3. Test activation behaviors and output ranges\n",
+    "4. Connect activations to real neural network components\n",
+    "\n",
+    "Let's add intelligence to your tensors!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8fdad0cc",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/02_activations/activations_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.activations\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.activations import Sigmoid, ReLU, Tanh, GELU, Softmax  # This module\n",
+    "from tinytorch.core.tensor import Tensor  # Foundation (Module 01)\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete activation system in one focused module for deep understanding\n",
+    "- **Production:** Proper organization like PyTorch's torch.nn.functional with all activation operations together\n",
+    "- **Consistency:** All activation functions and behaviors in core.activations\n",
+    "- **Integration:** Works seamlessly with Tensor for complete nonlinear transformations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b118828c",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "setup",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| default_exp core.activations\n",
+    "#| export\n",
+    "\n",
+    "import numpy as np\n",
+    "from typing import Optional\n",
+    "import sys\n",
+    "import os\n",
+    "\n",
+    "# Import our Tensor class - try from package first, then from local module\n",
+    "try:\n",
+    "    from tinytorch.core.tensor import Tensor\n",
+    "except ImportError:\n",
+    "    # For development, import from local tensor module\n",
+    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
+    "    from tensor_dev import Tensor"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2ac3527",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 1. Introduction - What Makes Neural Networks Intelligent?\n",
+    "\n",
+    "Consider two scenarios:\n",
+    "\n",
+    "**Without Activations (Linear Only):**\n",
+    "```\n",
+    "Input → Linear Transform → Output\n",
+    "[1, 2] → [3, 4] → [11]  # Just weighted sum\n",
+    "```\n",
+    "\n",
+    "**With Activations (Nonlinear):**\n",
+    "```\n",
+    "Input → Linear → Activation → Linear → Activation → Output\n",
+    "[1, 2] → [3, 4] → [3, 4] → [7] → [7] → Complex Pattern!\n",
+    "```\n",
+    "\n",
+    "The magic happens in those activation functions. They introduce **nonlinearity** - the ability to curve, bend, and create complex decision boundaries instead of just straight lines.\n",
+    "\n",
+    "### Why Nonlinearity Matters\n",
+    "\n",
+    "Without activation functions, stacking multiple linear layers is pointless:\n",
+    "```\n",
+    "Linear(Linear(x)) = Linear(x)  # Same as single layer!\n",
+    "```\n",
+    "\n",
+    "With activation functions, each layer can learn increasingly complex patterns:\n",
+    "```\n",
+    "Layer 1: Simple edges and lines\n",
+    "Layer 2: Curves and shapes\n",
+    "Layer 3: Complex objects and concepts\n",
+    "```\n",
+    "\n",
+    "This is how deep networks build intelligence from simple mathematical operations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "112ad140",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 2. Mathematical Foundations\n",
+    "\n",
+    "Each activation function serves a different purpose in neural networks:\n",
+    "\n",
+    "### The Five Essential Activations\n",
+    "\n",
+    "1. **Sigmoid**: Maps to (0, 1) - perfect for probabilities\n",
+    "2. **ReLU**: Removes negatives - creates sparsity and efficiency\n",
+    "3. **Tanh**: Maps to (-1, 1) - zero-centered for better training\n",
+    "4. **GELU**: Smooth ReLU - modern choice for transformers\n",
+    "5. **Softmax**: Creates probability distributions - essential for classification\n",
+    "\n",
+    "Let's implement each one with clear explanations and immediate testing!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c6e3614",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 3. Implementation - Building Activation Functions\n",
+    "\n",
+    "### 🏗️ Implementation Pattern\n",
+    "\n",
+    "Each activation follows this structure:\n",
+    "```python\n",
+    "class ActivationName:\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        # Apply mathematical transformation\n",
+    "        # Return new Tensor with result\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        # Stub for Module 05 - gradient computation\n",
+    "        pass\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f4f2fac",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Sigmoid - The Probability Gatekeeper\n",
+    "\n",
+    "Sigmoid maps any real number to the range (0, 1), making it perfect for probabilities and binary decisions.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "σ(x) = 1/(1 + e^(-x))\n",
+    "```\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [-3, -1,  0,  1,  3]\n",
+    "         ↓   ↓   ↓   ↓   ↓  Sigmoid Function\n",
+    "Output: [0.05, 0.27, 0.5, 0.73, 0.95]\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "Sigmoid Curve:\n",
+    "    1.0 ┤     ╭─────\n",
+    "        │    ╱\n",
+    "    0.5 ┤   ╱\n",
+    "        │  ╱\n",
+    "    0.0 ┤─╱─────────\n",
+    "       -3  0  3\n",
+    "```\n",
+    "\n",
+    "**Why Sigmoid matters**: In binary classification, we need outputs between 0 and 1 to represent probabilities. Sigmoid gives us exactly that!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eb89cfed",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "sigmoid-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Sigmoid:\n",
+    "    \"\"\"\n",
+    "    Sigmoid activation: σ(x) = 1/(1 + e^(-x))\n",
+    "\n",
+    "    Maps any real number to (0, 1) range.\n",
+    "    Perfect for probabilities and binary classification.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply sigmoid activation element-wise.\n",
+    "\n",
+    "        TODO: Implement sigmoid function\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Apply sigmoid formula: 1 / (1 + exp(-x))\n",
+    "        2. Use np.exp for exponential\n",
+    "        3. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> sigmoid = Sigmoid()\n",
+    "        >>> x = Tensor([-2, 0, 2])\n",
+    "        >>> result = sigmoid.forward(x)\n",
+    "        >>> print(result.data)\n",
+    "        [0.119, 0.5, 0.881]  # All values between 0 and 1\n",
+    "\n",
+    "        HINT: Use np.exp(-x.data) for numerical stability\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Apply sigmoid: 1 / (1 + exp(-x))\n",
+    "        result = 1.0 / (1.0 + np.exp(-x.data))\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35b7bc5b",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Sigmoid\n",
+    "This test validates sigmoid activation behavior.\n",
+    "**What we're testing**: Sigmoid maps inputs to (0, 1) range\n",
+    "**Why it matters**: Ensures proper probability-like outputs\n",
+    "**Expected**: All outputs between 0 and 1, sigmoid(0) = 0.5"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a2f7b2fb",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-sigmoid",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_sigmoid():\n",
+    "    \"\"\"🔬 Test Sigmoid implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Sigmoid...\")\n",
+    "\n",
+    "    sigmoid = Sigmoid()\n",
+    "\n",
+    "    # Test basic cases\n",
+    "    x = Tensor([0.0])\n",
+    "    result = sigmoid.forward(x)\n",
+    "    assert np.allclose(result.data, [0.5]), f\"sigmoid(0) should be 0.5, got {result.data}\"\n",
+    "\n",
+    "    # Test range property - all outputs should be in (0, 1)\n",
+    "    x = Tensor([-10, -1, 0, 1, 10])\n",
+    "    result = sigmoid.forward(x)\n",
+    "    assert np.all(result.data > 0) and np.all(result.data < 1), \"All sigmoid outputs should be in (0, 1)\"\n",
+    "\n",
+    "    # Test specific values\n",
+    "    x = Tensor([-1000, 1000])  # Extreme values\n",
+    "    result = sigmoid.forward(x)\n",
+    "    assert np.allclose(result.data[0], 0, atol=1e-10), \"sigmoid(-∞) should approach 0\"\n",
+    "    assert np.allclose(result.data[1], 1, atol=1e-10), \"sigmoid(+∞) should approach 1\"\n",
+    "\n",
+    "    print(\"✅ Sigmoid works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_sigmoid()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10169736",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## ReLU - The Sparsity Creator\n",
+    "\n",
+    "ReLU (Rectified Linear Unit) is the most popular activation function. It simply removes negative values, creating sparsity that makes neural networks more efficient.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "f(x) = max(0, x)\n",
+    "```\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [-2, -1,  0,  1,  2]\n",
+    "         ↓   ↓   ↓   ↓   ↓  ReLU Function\n",
+    "Output: [ 0,  0,  0,  1,  2]\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "ReLU Function:\n",
+    "        ╱\n",
+    "    2  ╱\n",
+    "      ╱\n",
+    "    1╱\n",
+    "    ╱\n",
+    "   ╱\n",
+    "  ╱\n",
+    "─┴─────\n",
+    "-2  0  2\n",
+    "```\n",
+    "\n",
+    "**Why ReLU matters**: By zeroing negative values, ReLU creates sparsity (many zeros) which makes computation faster and helps prevent overfitting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a7679bbe",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "relu-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class ReLU:\n",
+    "    \"\"\"\n",
+    "    ReLU activation: f(x) = max(0, x)\n",
+    "\n",
+    "    Sets negative values to zero, keeps positive values unchanged.\n",
+    "    Most popular activation for hidden layers.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply ReLU activation element-wise.\n",
+    "\n",
+    "        TODO: Implement ReLU function\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use np.maximum(0, x.data) for element-wise max with zero\n",
+    "        2. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> relu = ReLU()\n",
+    "        >>> x = Tensor([-2, -1, 0, 1, 2])\n",
+    "        >>> result = relu.forward(x)\n",
+    "        >>> print(result.data)\n",
+    "        [0, 0, 0, 1, 2]  # Negative values become 0, positive unchanged\n",
+    "\n",
+    "        HINT: np.maximum handles element-wise maximum automatically\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Apply ReLU: max(0, x)\n",
+    "        result = np.maximum(0, x.data)\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a9d8d19a",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: ReLU\n",
+    "This test validates ReLU activation behavior.\n",
+    "**What we're testing**: ReLU zeros negative values, preserves positive\n",
+    "**Why it matters**: ReLU's sparsity helps neural networks train efficiently\n",
+    "**Expected**: Negative → 0, positive unchanged, zero → 0"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "12798838",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-relu",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_relu():\n",
+    "    \"\"\"🔬 Test ReLU implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: ReLU...\")\n",
+    "\n",
+    "    relu = ReLU()\n",
+    "\n",
+    "    # Test mixed positive/negative values\n",
+    "    x = Tensor([-2, -1, 0, 1, 2])\n",
+    "    result = relu.forward(x)\n",
+    "    expected = [0, 0, 0, 1, 2]\n",
+    "    assert np.allclose(result.data, expected), f\"ReLU failed, expected {expected}, got {result.data}\"\n",
+    "\n",
+    "    # Test all negative\n",
+    "    x = Tensor([-5, -3, -1])\n",
+    "    result = relu.forward(x)\n",
+    "    assert np.allclose(result.data, [0, 0, 0]), \"ReLU should zero all negative values\"\n",
+    "\n",
+    "    # Test all positive\n",
+    "    x = Tensor([1, 3, 5])\n",
+    "    result = relu.forward(x)\n",
+    "    assert np.allclose(result.data, [1, 3, 5]), \"ReLU should preserve all positive values\"\n",
+    "\n",
+    "    # Test sparsity property\n",
+    "    x = Tensor([-1, -2, -3, 1])\n",
+    "    result = relu.forward(x)\n",
+    "    zeros = np.sum(result.data == 0)\n",
+    "    assert zeros == 3, f\"ReLU should create sparsity, got {zeros} zeros out of 4\"\n",
+    "\n",
+    "    print(\"✅ ReLU works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_relu()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4c7a86fa",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Tanh - The Zero-Centered Alternative\n",
+    "\n",
+    "Tanh (hyperbolic tangent) is like sigmoid but centered around zero, mapping inputs to (-1, 1). This zero-centering helps with gradient flow during training.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "f(x) = (e^x - e^(-x))/(e^x + e^(-x))\n",
+    "```\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [-2,  0,  2]\n",
+    "         ↓   ↓   ↓  Tanh Function\n",
+    "Output: [-0.96, 0, 0.96]\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "Tanh Curve:\n",
+    "    1 ┤     ╭─────\n",
+    "      │    ╱\n",
+    "    0 ┤───╱─────\n",
+    "      │  ╱\n",
+    "   -1 ┤─╱───────\n",
+    "     -3  0  3\n",
+    "```\n",
+    "\n",
+    "**Why Tanh matters**: Unlike sigmoid, tanh outputs are centered around zero, which can help gradients flow better through deep networks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "61336eeb",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "tanh-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Tanh:\n",
+    "    \"\"\"\n",
+    "    Tanh activation: f(x) = (e^x - e^(-x))/(e^x + e^(-x))\n",
+    "\n",
+    "    Maps any real number to (-1, 1) range.\n",
+    "    Zero-centered alternative to sigmoid.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply tanh activation element-wise.\n",
+    "\n",
+    "        TODO: Implement tanh function\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use np.tanh(x.data) for hyperbolic tangent\n",
+    "        2. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> tanh = Tanh()\n",
+    "        >>> x = Tensor([-2, 0, 2])\n",
+    "        >>> result = tanh.forward(x)\n",
+    "        >>> print(result.data)\n",
+    "        [-0.964, 0.0, 0.964]  # Range (-1, 1), symmetric around 0\n",
+    "\n",
+    "        HINT: NumPy provides np.tanh function\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Apply tanh using NumPy\n",
+    "        result = np.tanh(x.data)\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2ea3362a",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Tanh\n",
+    "This test validates tanh activation behavior.\n",
+    "**What we're testing**: Tanh maps inputs to (-1, 1) range, zero-centered\n",
+    "**Why it matters**: Zero-centered activations can help with gradient flow\n",
+    "**Expected**: All outputs in (-1, 1), tanh(0) = 0, symmetric behavior"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "efa9866e",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-tanh",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_tanh():\n",
+    "    \"\"\"🔬 Test Tanh implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Tanh...\")\n",
+    "\n",
+    "    tanh = Tanh()\n",
+    "\n",
+    "    # Test zero\n",
+    "    x = Tensor([0.0])\n",
+    "    result = tanh.forward(x)\n",
+    "    assert np.allclose(result.data, [0.0]), f\"tanh(0) should be 0, got {result.data}\"\n",
+    "\n",
+    "    # Test range property - all outputs should be in (-1, 1)\n",
+    "    x = Tensor([-10, -1, 0, 1, 10])\n",
+    "    result = tanh.forward(x)\n",
+    "    assert np.all(result.data >= -1) and np.all(result.data <= 1), \"All tanh outputs should be in [-1, 1]\"\n",
+    "\n",
+    "    # Test symmetry: tanh(-x) = -tanh(x)\n",
+    "    x = Tensor([2.0])\n",
+    "    pos_result = tanh.forward(x)\n",
+    "    x_neg = Tensor([-2.0])\n",
+    "    neg_result = tanh.forward(x_neg)\n",
+    "    assert np.allclose(pos_result.data, -neg_result.data), \"tanh should be symmetric: tanh(-x) = -tanh(x)\"\n",
+    "\n",
+    "    # Test extreme values\n",
+    "    x = Tensor([-1000, 1000])\n",
+    "    result = tanh.forward(x)\n",
+    "    assert np.allclose(result.data[0], -1, atol=1e-10), \"tanh(-∞) should approach -1\"\n",
+    "    assert np.allclose(result.data[1], 1, atol=1e-10), \"tanh(+∞) should approach 1\"\n",
+    "\n",
+    "    print(\"✅ Tanh works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_tanh()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a075f28",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## GELU - The Smooth Modern Choice\n",
+    "\n",
+    "GELU (Gaussian Error Linear Unit) is a smooth approximation to ReLU that's become popular in modern architectures like transformers. Unlike ReLU's sharp corner, GELU is smooth everywhere.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)\n",
+    "```\n",
+    "Where Φ(x) is the cumulative distribution function of standard normal distribution.\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [-1,  0,  1]\n",
+    "         ↓   ↓   ↓  GELU Function\n",
+    "Output: [-0.16, 0, 0.84]\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "GELU Function:\n",
+    "        ╱\n",
+    "    1  ╱\n",
+    "      ╱\n",
+    "     ╱\n",
+    "    ╱\n",
+    "   ╱ ↙ (smooth curve, no sharp corner)\n",
+    "  ╱\n",
+    "─┴─────\n",
+    "-2  0  2\n",
+    "```\n",
+    "\n",
+    "**Why GELU matters**: Used in GPT, BERT, and other transformers. The smoothness helps with optimization compared to ReLU's sharp corner."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "14cc2a59",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "gelu-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class GELU:\n",
+    "    \"\"\"\n",
+    "    GELU activation: f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)\n",
+    "\n",
+    "    Smooth approximation to ReLU, used in modern transformers.\n",
+    "    Where Φ(x) is the cumulative distribution function of standard normal.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply GELU activation element-wise.\n",
+    "\n",
+    "        TODO: Implement GELU approximation\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use approximation: x * sigmoid(1.702 * x)\n",
+    "        2. Compute sigmoid part: 1 / (1 + exp(-1.702 * x))\n",
+    "        3. Multiply by x element-wise\n",
+    "        4. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> gelu = GELU()\n",
+    "        >>> x = Tensor([-1, 0, 1])\n",
+    "        >>> result = gelu.forward(x)\n",
+    "        >>> print(result.data)\n",
+    "        [-0.159, 0.0, 0.841]  # Smooth, like ReLU but differentiable everywhere\n",
+    "\n",
+    "        HINT: The 1.702 constant comes from √(2/π) approximation\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # GELU approximation: x * sigmoid(1.702 * x)\n",
+    "        # First compute sigmoid part\n",
+    "        sigmoid_part = 1.0 / (1.0 + np.exp(-1.702 * x.data))\n",
+    "        # Then multiply by x\n",
+    "        result = x.data * sigmoid_part\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "25cbcb04",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: GELU\n",
+    "This test validates GELU activation behavior.\n",
+    "**What we're testing**: GELU provides smooth ReLU-like behavior\n",
+    "**Why it matters**: GELU is used in modern transformers like GPT and BERT\n",
+    "**Expected**: Smooth curve, GELU(0) ≈ 0, positive values preserved roughly"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2ea81efa",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-gelu",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_gelu():\n",
+    "    \"\"\"🔬 Test GELU implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: GELU...\")\n",
+    "\n",
+    "    gelu = GELU()\n",
+    "\n",
+    "    # Test zero (should be approximately 0)\n",
+    "    x = Tensor([0.0])\n",
+    "    result = gelu.forward(x)\n",
+    "    assert np.allclose(result.data, [0.0], atol=1e-10), f\"GELU(0) should be ≈0, got {result.data}\"\n",
+    "\n",
+    "    # Test positive values (should be roughly preserved)\n",
+    "    x = Tensor([1.0])\n",
+    "    result = gelu.forward(x)\n",
+    "    assert result.data[0] > 0.8, f\"GELU(1) should be ≈0.84, got {result.data[0]}\"\n",
+    "\n",
+    "    # Test negative values (should be small but not zero)\n",
+    "    x = Tensor([-1.0])\n",
+    "    result = gelu.forward(x)\n",
+    "    assert result.data[0] < 0 and result.data[0] > -0.2, f\"GELU(-1) should be ≈-0.16, got {result.data[0]}\"\n",
+    "\n",
+    "    # Test smoothness property (no sharp corners like ReLU)\n",
+    "    x = Tensor([-0.001, 0.0, 0.001])\n",
+    "    result = gelu.forward(x)\n",
+    "    # Values should be close to each other (smooth)\n",
+    "    diff1 = abs(result.data[1] - result.data[0])\n",
+    "    diff2 = abs(result.data[2] - result.data[1])\n",
+    "    assert diff1 < 0.01 and diff2 < 0.01, \"GELU should be smooth around zero\"\n",
+    "\n",
+    "    print(\"✅ GELU works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_gelu()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8dd72698",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Softmax - The Probability Distributor\n",
+    "\n",
+    "Softmax converts any vector into a valid probability distribution. All outputs are positive and sum to exactly 1.0, making it essential for multi-class classification.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
+    "```\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [1, 2, 3]\n",
+    "         ↓  ↓  ↓  Softmax Function\n",
+    "Output: [0.09, 0.24, 0.67]  # Sum = 1.0\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "Softmax Transform:\n",
+    "Raw scores: [1, 2, 3, 4]\n",
+    "           ↓ Exponential ↓\n",
+    "          [2.7, 7.4, 20.1, 54.6]\n",
+    "           ↓ Normalize ↓\n",
+    "          [0.03, 0.09, 0.24, 0.64]  ← Sum = 1.0\n",
+    "```\n",
+    "\n",
+    "**Why Softmax matters**: In multi-class classification, we need outputs that represent probabilities for each class. Softmax guarantees valid probabilities."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f9cb33a7",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "softmax-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Softmax:\n",
+    "    \"\"\"\n",
+    "    Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
+    "\n",
+    "    Converts any vector to a probability distribution.\n",
+    "    Sum of all outputs equals 1.0.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor, dim: int = -1) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply softmax activation along specified dimension.\n",
+    "\n",
+    "        TODO: Implement numerically stable softmax\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Subtract max for numerical stability: x - max(x)\n",
+    "        2. Compute exponentials: exp(x - max(x))\n",
+    "        3. Sum along dimension: sum(exp_values)\n",
+    "        4. Divide: exp_values / sum\n",
+    "        5. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> softmax = Softmax()\n",
+    "        >>> x = Tensor([1, 2, 3])\n",
+    "        >>> result = softmax.forward(x)\n",
+    "        >>> print(result.data)\n",
+    "        [0.090, 0.245, 0.665]  # Sums to 1.0, larger inputs get higher probability\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use np.max(x.data, axis=dim, keepdims=True) for max\n",
+    "        - Use np.sum(exp_values, axis=dim, keepdims=True) for sum\n",
+    "        - The max subtraction prevents overflow in exponentials\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Numerical stability: subtract max to prevent overflow\n",
+    "        x_max = np.max(x.data, axis=dim, keepdims=True)\n",
+    "        x_shifted = x.data - x_max\n",
+    "\n",
+    "        # Compute exponentials\n",
+    "        exp_values = np.exp(x_shifted)\n",
+    "\n",
+    "        # Sum along dimension\n",
+    "        exp_sum = np.sum(exp_values, axis=dim, keepdims=True)\n",
+    "\n",
+    "        # Normalize to get probabilities\n",
+    "        result = exp_values / exp_sum\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2be27bd4",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Softmax\n",
+    "This test validates softmax activation behavior.\n",
+    "**What we're testing**: Softmax creates valid probability distributions\n",
+    "**Why it matters**: Essential for multi-class classification outputs\n",
+    "**Expected**: Outputs sum to 1.0, all values in (0, 1), largest input gets highest probability"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7434b6fd",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-softmax",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_softmax():\n",
+    "    \"\"\"🔬 Test Softmax implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Softmax...\")\n",
+    "\n",
+    "    softmax = Softmax()\n",
+    "\n",
+    "    # Test basic probability properties\n",
+    "    x = Tensor([1, 2, 3])\n",
+    "    result = softmax.forward(x)\n",
+    "\n",
+    "    # Should sum to 1\n",
+    "    assert np.allclose(np.sum(result.data), 1.0), f\"Softmax should sum to 1, got {np.sum(result.data)}\"\n",
+    "\n",
+    "    # All values should be positive\n",
+    "    assert np.all(result.data > 0), \"All softmax values should be positive\"\n",
+    "\n",
+    "    # All values should be less than 1\n",
+    "    assert np.all(result.data < 1), \"All softmax values should be less than 1\"\n",
+    "\n",
+    "    # Largest input should get largest output\n",
+    "    max_input_idx = np.argmax(x.data)\n",
+    "    max_output_idx = np.argmax(result.data)\n",
+    "    assert max_input_idx == max_output_idx, \"Largest input should get largest softmax output\"\n",
+    "\n",
+    "    # Test numerical stability with large numbers\n",
+    "    x = Tensor([1000, 1001, 1002])  # Would overflow without max subtraction\n",
+    "    result = softmax.forward(x)\n",
+    "    assert np.allclose(np.sum(result.data), 1.0), \"Softmax should handle large numbers\"\n",
+    "    assert not np.any(np.isnan(result.data)), \"Softmax should not produce NaN\"\n",
+    "    assert not np.any(np.isinf(result.data)), \"Softmax should not produce infinity\"\n",
+    "\n",
+    "    # Test with 2D tensor (batch dimension)\n",
+    "    x = Tensor([[1, 2], [3, 4]])\n",
+    "    result = softmax.forward(x, dim=-1)  # Softmax along last dimension\n",
+    "    assert result.shape == (2, 2), \"Softmax should preserve input shape\"\n",
+    "    # Each row should sum to 1\n",
+    "    row_sums = np.sum(result.data, axis=-1)\n",
+    "    assert np.allclose(row_sums, [1.0, 1.0]), \"Each row should sum to 1\"\n",
+    "\n",
+    "    print(\"✅ Softmax works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_softmax()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1724e759",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## 4. Integration - Bringing It Together\n",
+    "\n",
+    "Now let's test how all our activation functions work together and understand their different behaviors."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8fadcbf4",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "### Understanding the Output Patterns\n",
+    "\n",
+    "From the demonstration above, notice how each activation serves a different purpose:\n",
+    "\n",
+    "**Sigmoid**: Squashes everything to (0, 1) - good for probabilities\n",
+    "**ReLU**: Zeros negatives, keeps positives - creates sparsity\n",
+    "**Tanh**: Like sigmoid but centered at zero (-1, 1) - better gradient flow\n",
+    "**GELU**: Smooth ReLU-like behavior - modern choice for transformers\n",
+    "**Softmax**: Converts to probability distribution - sum equals 1\n",
+    "\n",
+    "These different behaviors make each activation suitable for different parts of neural networks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba765a81",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "308d7856",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "module-test",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_sigmoid()\n",
+    "    test_unit_relu()\n",
+    "    test_unit_tanh()\n",
+    "    test_unit_gelu()\n",
+    "    test_unit_softmax()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test 1: All activations preserve tensor properties\n",
+    "    print(\"🔬 Integration Test: Tensor property preservation...\")\n",
+    "    test_data = Tensor([[1, -1], [2, -2]])  # 2D tensor\n",
+    "\n",
+    "    activations = [Sigmoid(), ReLU(), Tanh(), GELU()]\n",
+    "    for activation in activations:\n",
+    "        result = activation.forward(test_data)\n",
+    "        assert result.shape == test_data.shape, f\"Shape not preserved by {activation.__class__.__name__}\"\n",
+    "        assert isinstance(result, Tensor), f\"Output not Tensor from {activation.__class__.__name__}\"\n",
+    "\n",
+    "    print(\"✅ All activations preserve tensor properties!\")\n",
+    "\n",
+    "    # Test 2: Softmax works with different dimensions\n",
+    "    print(\"🔬 Integration Test: Softmax dimension handling...\")\n",
+    "    data_3d = Tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])  # (2, 2, 3)\n",
+    "    softmax = Softmax()\n",
+    "\n",
+    "    # Test different dimensions\n",
+    "    result_last = softmax.forward(data_3d, dim=-1)\n",
+    "    assert result_last.shape == (2, 2, 3), \"Softmax should preserve shape\"\n",
+    "\n",
+    "    # Check that last dimension sums to 1\n",
+    "    last_dim_sums = np.sum(result_last.data, axis=-1)\n",
+    "    assert np.allclose(last_dim_sums, 1.0), \"Last dimension should sum to 1\"\n",
+    "\n",
+    "    print(\"✅ Softmax handles different dimensions correctly!\")\n",
+    "\n",
+    "    # Test 3: Activation chaining (simulating neural network)\n",
+    "    print(\"🔬 Integration Test: Activation chaining...\")\n",
+    "\n",
+    "    # Simulate: Input → Linear → ReLU → Linear → Softmax (like a simple network)\n",
+    "    x = Tensor([[-1, 0, 1, 2]])  # Batch of 1, 4 features\n",
+    "\n",
+    "    # Apply ReLU (hidden layer activation)\n",
+    "    relu = ReLU()\n",
+    "    hidden = relu.forward(x)\n",
+    "\n",
+    "    # Apply Softmax (output layer activation)\n",
+    "    softmax = Softmax()\n",
+    "    output = softmax.forward(hidden)\n",
+    "\n",
+    "    # Verify the chain\n",
+    "    assert hidden.data[0, 0] == 0, \"ReLU should zero negative input\"\n",
+    "    assert np.allclose(np.sum(output.data), 1.0), \"Final output should be probability distribution\"\n",
+    "\n",
+    "    print(\"✅ Activation chaining works correctly!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 02\")\n",
+    "\n",
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d5bd9de0",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Activations\n",
+    "\n",
+    "Congratulations! You've built the intelligence engine of neural networks!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Built 5 core activation functions with distinct behaviors and use cases\n",
+    "- Implemented forward passes for Sigmoid, ReLU, Tanh, GELU, and Softmax\n",
+    "- Discovered how nonlinearity enables complex pattern learning\n",
+    "- All tests pass ✅ (validated by `test_module()`)\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your activation implementations enable neural network layers to learn complex, nonlinear patterns instead of just linear transformations.\n",
+    "\n",
+    "Export with: `tito module complete 02`\n",
+    "\n",
+    "**Next**: Module 03 will combine your Tensors and Activations to build complete neural network Layers!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/02_activations/activations_dev.py b/modules/source/02_activations/activations_dev.py
similarity index 99%
rename from modules/02_activations/activations_dev.py
rename to modules/source/02_activations/activations_dev.py
index eeda68ad..d8fee7dd 100644
--- a/modules/02_activations/activations_dev.py
+++ b/modules/source/02_activations/activations_dev.py
@@ -61,13 +61,14 @@ from tinytorch.core.tensor import Tensor  # Foundation (Module 01)
 
 # %% nbgrader={"grade": false, "grade_id": "setup", "solution": true}
 #| default_exp core.activations
+#| export
 
 import numpy as np
 from typing import Optional
 import sys
 import os
 
-# Import the proper Tensor class from Module 01
+# Import our Tensor class from local tensor module
 sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
 from tensor_dev import Tensor
 
diff --git a/modules/source/03_layers/layers_dev.ipynb b/modules/source/03_layers/layers_dev.ipynb
new file mode 100644
index 00000000..2bcd45fc
--- /dev/null
+++ b/modules/source/03_layers/layers_dev.ipynb
@@ -0,0 +1,1009 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f9274f85",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Module 03: Layers - Building Blocks of Neural Networks\n",
+    "\n",
+    "Welcome to Module 03! You're about to build the fundamental building blocks that make neural networks possible.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Tensor class (Module 01) with all operations and activations (Module 02)\n",
+    "**You'll Build**: Linear layers and Dropout regularization\n",
+    "**You'll Enable**: Multi-layer neural networks, trainable parameters, and forward passes\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Tensor → Activations → Layers → Networks\n",
+    "(data)   (intelligence) (building blocks) (architectures)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement Linear layers with proper weight initialization\n",
+    "2. Add Dropout for regularization during training\n",
+    "3. Understand parameter management and counting\n",
+    "4. Test individual layer components\n",
+    "\n",
+    "Let's get started!\n",
+    "\n",
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/03_layers/layers_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.layers\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.layers import Linear, Dropout  # This module\n",
+    "from tinytorch.core.tensor import Tensor  # Module 01 - foundation\n",
+    "from tinytorch.core.activations import ReLU, Sigmoid  # Module 02 - intelligence\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete layer system in one focused module for deep understanding\n",
+    "- **Production:** Proper organization like PyTorch's torch.nn with all layer building blocks together\n",
+    "- **Consistency:** All layer operations and parameter management in core.layers\n",
+    "- **Integration:** Works seamlessly with tensors and activations for complete neural networks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5ee174f6",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "imports",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| default_exp core.layers\n",
+    "\n",
+    "import numpy as np\n",
+    "import sys\n",
+    "import os\n",
+    "\n",
+    "# Import the proper Tensor class from Module 01\n",
+    "sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
+    "from tensor_dev import Tensor"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57f54e44",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 1. Introduction: What are Neural Network Layers?\n",
+    "\n",
+    "Neural network layers are the fundamental building blocks that transform data as it flows through a network. Each layer performs a specific computation:\n",
+    "\n",
+    "- **Linear layers** apply learned transformations: `y = xW + b`\n",
+    "- **Dropout layers** randomly zero elements for regularization\n",
+    "\n",
+    "Think of layers as processing stations in a factory:\n",
+    "```\n",
+    "Input Data → Layer 1 → Layer 2 → Layer 3 → Output\n",
+    "    ↓          ↓         ↓         ↓         ↓\n",
+    "  Features   Hidden   Hidden   Hidden   Predictions\n",
+    "```\n",
+    "\n",
+    "Each layer learns its own piece of the puzzle. Linear layers learn which features matter, while dropout prevents overfitting by forcing robustness."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c655df09",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 2. Foundations: Mathematical Background\n",
+    "\n",
+    "### Linear Layer Mathematics\n",
+    "A linear layer implements: **y = xW + b**\n",
+    "\n",
+    "```\n",
+    "Input x (batch_size, in_features)  @  Weight W (in_features, out_features)  +  Bias b (out_features)\n",
+    "                                   =  Output y (batch_size, out_features)\n",
+    "```\n",
+    "\n",
+    "### Weight Initialization\n",
+    "Random initialization is crucial for breaking symmetry:\n",
+    "- **Xavier/Glorot**: Scale by sqrt(1/fan_in) for stable gradients\n",
+    "- **He**: Scale by sqrt(2/fan_in) for ReLU activation\n",
+    "- **Too small**: Gradients vanish, learning is slow\n",
+    "- **Too large**: Gradients explode, training unstable\n",
+    "\n",
+    "### Parameter Counting\n",
+    "```\n",
+    "Linear(784, 256): 784 × 256 + 256 = 200,960 parameters\n",
+    "\n",
+    "Manual composition:\n",
+    "    layer1 = Linear(784, 256)  # 200,960 params\n",
+    "    activation = ReLU()        # 0 params\n",
+    "    layer2 = Linear(256, 10)   # 2,570 params\n",
+    "                               # Total: 203,530 params\n",
+    "```\n",
+    "\n",
+    "Memory usage: 4 bytes/param × 203,530 = ~814KB for weights alone"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "82f106c3",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 3. Implementation: Building Layer Foundation\n",
+    "\n",
+    "Let's build our layer system step by step. We'll implement two essential layer types:\n",
+    "\n",
+    "1. **Linear Layer** - The workhorse of neural networks\n",
+    "2. **Dropout Layer** - Prevents overfitting\n",
+    "\n",
+    "### Key Design Principles:\n",
+    "- All methods defined INSIDE classes (no monkey-patching)\n",
+    "- Parameter tensors have requires_grad=True (ready for Module 05)\n",
+    "- Forward methods return new tensors, preserving immutability\n",
+    "- parameters() method enables optimizer integration"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c960f336",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🏗️ Linear Layer - The Foundation of Neural Networks\n",
+    "\n",
+    "Linear layers (also called Dense or Fully Connected layers) are the fundamental building blocks of neural networks. They implement the mathematical operation:\n",
+    "\n",
+    "**y = xW + b**\n",
+    "\n",
+    "Where:\n",
+    "- **x**: Input features (what we know)\n",
+    "- **W**: Weight matrix (what we learn)\n",
+    "- **b**: Bias vector (adjusts the output)\n",
+    "- **y**: Output features (what we predict)\n",
+    "\n",
+    "### Why Linear Layers Matter\n",
+    "\n",
+    "Linear layers learn **feature combinations**. Each output neuron asks: \"What combination of input features is most useful for my task?\" The network discovers these combinations through training.\n",
+    "\n",
+    "### Data Flow Visualization\n",
+    "```\n",
+    "Input Features     Weight Matrix        Bias Vector      Output Features\n",
+    "[batch, in_feat] @ [in_feat, out_feat] + [out_feat]  =  [batch, out_feat]\n",
+    "\n",
+    "Example: MNIST Digit Recognition\n",
+    "[32, 784]       @  [784, 10]          + [10]        =  [32, 10]\n",
+    "  ↑                   ↑                    ↑             ↑\n",
+    "32 images         784 pixels          10 classes    10 probabilities\n",
+    "                  to 10 classes       adjustments   per image\n",
+    "```\n",
+    "\n",
+    "### Memory Layout\n",
+    "```\n",
+    "Linear(784, 256) Parameters:\n",
+    "┌─────────────────────────────┐\n",
+    "│ Weight Matrix W             │  784 × 256 = 200,704 params\n",
+    "│ [784, 256] float32          │  × 4 bytes = 802.8 KB\n",
+    "├─────────────────────────────┤\n",
+    "│ Bias Vector b               │  256 params\n",
+    "│ [256] float32               │  × 4 bytes = 1.0 KB\n",
+    "└─────────────────────────────┘\n",
+    "                Total: 803.8 KB for one layer\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e7a01cb",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "linear-layer",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class Linear:\n",
+    "    \"\"\"\n",
+    "    Linear (fully connected) layer: y = xW + b\n",
+    "\n",
+    "    This is the fundamental building block of neural networks.\n",
+    "    Applies a linear transformation to incoming data.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, in_features, out_features, bias=True):\n",
+    "        \"\"\"\n",
+    "        Initialize linear layer with proper weight initialization.\n",
+    "\n",
+    "        TODO: Initialize weights and bias with Xavier initialization\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Create weight matrix (in_features, out_features) with Xavier scaling\n",
+    "        2. Create bias vector (out_features,) initialized to zeros if bias=True\n",
+    "        3. Set requires_grad=True for parameters (ready for Module 05)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> layer = Linear(784, 10)  # MNIST classifier final layer\n",
+    "        >>> print(layer.weight.shape)\n",
+    "        (784, 10)\n",
+    "        >>> print(layer.bias.shape)\n",
+    "        (10,)\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Xavier init: scale = sqrt(1/in_features)\n",
+    "        - Use np.random.randn() for normal distribution\n",
+    "        - bias=None when bias=False\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        self.in_features = in_features\n",
+    "        self.out_features = out_features\n",
+    "\n",
+    "        # Xavier/Glorot initialization for stable gradients\n",
+    "        scale = np.sqrt(1.0 / in_features)\n",
+    "        weight_data = np.random.randn(in_features, out_features) * scale\n",
+    "        self.weight = Tensor(weight_data, requires_grad=True)\n",
+    "\n",
+    "        # Initialize bias to zeros or None\n",
+    "        if bias:\n",
+    "            bias_data = np.zeros(out_features)\n",
+    "            self.bias = Tensor(bias_data, requires_grad=True)\n",
+    "        else:\n",
+    "            self.bias = None\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        \"\"\"\n",
+    "        Forward pass through linear layer.\n",
+    "\n",
+    "        TODO: Implement y = xW + b\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Matrix multiply input with weights: xW\n",
+    "        2. Add bias if it exists\n",
+    "        3. Return result as new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> layer = Linear(3, 2)\n",
+    "        >>> x = Tensor([[1, 2, 3], [4, 5, 6]])  # 2 samples, 3 features\n",
+    "        >>> y = layer.forward(x)\n",
+    "        >>> print(y.shape)\n",
+    "        (2, 2)  # 2 samples, 2 outputs\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use tensor.matmul() for matrix multiplication\n",
+    "        - Handle bias=None case\n",
+    "        - Broadcasting automatically handles bias addition\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Linear transformation: y = xW\n",
+    "        output = x.matmul(self.weight)\n",
+    "\n",
+    "        # Add bias if present\n",
+    "        if self.bias is not None:\n",
+    "            output = output + self.bias\n",
+    "\n",
+    "        return output\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def parameters(self):\n",
+    "        \"\"\"\n",
+    "        Return list of trainable parameters.\n",
+    "\n",
+    "        TODO: Return all tensors that need gradients\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Start with weight (always present)\n",
+    "        2. Add bias if it exists\n",
+    "        3. Return as list for optimizer\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        params = [self.weight]\n",
+    "        if self.bias is not None:\n",
+    "            params.append(self.bias)\n",
+    "        return params\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __repr__(self):\n",
+    "        \"\"\"String representation for debugging.\"\"\"\n",
+    "        bias_str = f\", bias={self.bias is not None}\"\n",
+    "        return f\"Linear(in_features={self.in_features}, out_features={self.out_features}{bias_str})\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7005a1e9",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Linear Layer\n",
+    "This test validates our Linear layer implementation works correctly.\n",
+    "**What we're testing**: Weight initialization, forward pass, parameter management\n",
+    "**Why it matters**: Foundation for all neural network architectures\n",
+    "**Expected**: Proper shapes, Xavier scaling, parameter counting"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1e635c8e",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-linear",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_linear_layer():\n",
+    "    \"\"\"🔬 Test Linear layer implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Linear Layer...\")\n",
+    "\n",
+    "    # Test layer creation\n",
+    "    layer = Linear(784, 256)\n",
+    "    assert layer.in_features == 784\n",
+    "    assert layer.out_features == 256\n",
+    "    assert layer.weight.shape == (784, 256)\n",
+    "    assert layer.bias.shape == (256,)\n",
+    "    assert layer.weight.requires_grad == True\n",
+    "    assert layer.bias.requires_grad == True\n",
+    "\n",
+    "    # Test Xavier initialization (weights should be reasonably scaled)\n",
+    "    weight_std = np.std(layer.weight.data)\n",
+    "    expected_std = np.sqrt(1.0 / 784)\n",
+    "    assert 0.5 * expected_std < weight_std < 2.0 * expected_std, f\"Weight std {weight_std} not close to Xavier {expected_std}\"\n",
+    "\n",
+    "    # Test bias initialization (should be zeros)\n",
+    "    assert np.allclose(layer.bias.data, 0), \"Bias should be initialized to zeros\"\n",
+    "\n",
+    "    # Test forward pass\n",
+    "    x = Tensor(np.random.randn(32, 784))  # Batch of 32 samples\n",
+    "    y = layer.forward(x)\n",
+    "    assert y.shape == (32, 256), f\"Expected shape (32, 256), got {y.shape}\"\n",
+    "\n",
+    "    # Test no bias option\n",
+    "    layer_no_bias = Linear(10, 5, bias=False)\n",
+    "    assert layer_no_bias.bias is None\n",
+    "    params = layer_no_bias.parameters()\n",
+    "    assert len(params) == 1  # Only weight, no bias\n",
+    "\n",
+    "    # Test parameters method\n",
+    "    params = layer.parameters()\n",
+    "    assert len(params) == 2  # Weight and bias\n",
+    "    assert params[0] is layer.weight\n",
+    "    assert params[1] is layer.bias\n",
+    "\n",
+    "    print(\"✅ Linear layer works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_linear_layer()\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef5efbc5",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🎲 Dropout Layer - Preventing Overfitting\n",
+    "\n",
+    "Dropout is a regularization technique that randomly \"turns off\" neurons during training. This forces the network to not rely too heavily on any single neuron, making it more robust and generalizable.\n",
+    "\n",
+    "### Why Dropout Matters\n",
+    "\n",
+    "**The Problem**: Neural networks can memorize training data instead of learning generalizable patterns. This leads to poor performance on new, unseen data.\n",
+    "\n",
+    "**The Solution**: Dropout randomly zeros out neurons, forcing the network to learn multiple independent ways to solve the problem.\n",
+    "\n",
+    "### Dropout in Action\n",
+    "```\n",
+    "Training Mode (p=0.5 dropout):\n",
+    "Input:  [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]\n",
+    "         ↓ Random mask with 50% survival rate\n",
+    "Mask:   [1,   0,   1,   0,   1,   1,   0,   1  ]\n",
+    "         ↓ Apply mask and scale by 1/(1-p) = 2.0\n",
+    "Output: [2.0, 0.0, 6.0, 0.0, 10.0, 12.0, 0.0, 16.0]\n",
+    "\n",
+    "Inference Mode (no dropout):\n",
+    "Input:  [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]\n",
+    "         ↓ Pass through unchanged\n",
+    "Output: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]\n",
+    "```\n",
+    "\n",
+    "### Training vs Inference Behavior\n",
+    "```\n",
+    "                Training Mode              Inference Mode\n",
+    "               ┌─────────────────┐        ┌─────────────────┐\n",
+    "Input Features │ [×] [ ] [×] [×] │        │ [×] [×] [×] [×] │\n",
+    "               │ Active Dropped  │   →    │   All Active    │\n",
+    "               │ Active Active   │        │                 │\n",
+    "               └─────────────────┘        └─────────────────┘\n",
+    "                      ↓                           ↓\n",
+    "                \"Learn robustly\"            \"Use all knowledge\"\n",
+    "```\n",
+    "\n",
+    "### Memory and Performance\n",
+    "```\n",
+    "Dropout Memory Usage:\n",
+    "┌─────────────────────────────┐\n",
+    "│ Input Tensor: X MB          │\n",
+    "├─────────────────────────────┤\n",
+    "│ Random Mask: X/4 MB         │  (boolean mask, 1 byte/element)\n",
+    "├─────────────────────────────┤\n",
+    "│ Output Tensor: X MB         │\n",
+    "└─────────────────────────────┘\n",
+    "        Total: ~2.25X MB peak memory\n",
+    "\n",
+    "Computational Overhead: Minimal (element-wise operations)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ac9c2688",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "dropout-layer",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class Dropout:\n",
+    "    \"\"\"\n",
+    "    Dropout layer for regularization.\n",
+    "\n",
+    "    During training: randomly zeros elements with probability p\n",
+    "    During inference: scales outputs by (1-p) to maintain expected value\n",
+    "\n",
+    "    This prevents overfitting by forcing the network to not rely on specific neurons.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, p=0.5):\n",
+    "        \"\"\"\n",
+    "        Initialize dropout layer.\n",
+    "\n",
+    "        TODO: Store dropout probability\n",
+    "\n",
+    "        Args:\n",
+    "            p: Probability of zeroing each element (0.0 = no dropout, 1.0 = zero everything)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> dropout = Dropout(0.5)  # Zero 50% of elements during training\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        if not 0.0 <= p <= 1.0:\n",
+    "            raise ValueError(f\"Dropout probability must be between 0 and 1, got {p}\")\n",
+    "        self.p = p\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def forward(self, x, training=True):\n",
+    "        \"\"\"\n",
+    "        Forward pass through dropout layer.\n",
+    "\n",
+    "        TODO: Apply dropout during training, pass through during inference\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. If not training, return input unchanged\n",
+    "        2. If training, create random mask with probability (1-p)\n",
+    "        3. Multiply input by mask and scale by 1/(1-p)\n",
+    "        4. Return result as new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> dropout = Dropout(0.5)\n",
+    "        >>> x = Tensor([1, 2, 3, 4])\n",
+    "        >>> y_train = dropout.forward(x, training=True)   # Some elements zeroed\n",
+    "        >>> y_eval = dropout.forward(x, training=False)   # All elements preserved\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use np.random.random() < keep_prob for mask\n",
+    "        - Scale by 1/(1-p) to maintain expected value\n",
+    "        - training=False should return input unchanged\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        if not training or self.p == 0.0:\n",
+    "            # During inference or no dropout, pass through unchanged\n",
+    "            return x\n",
+    "\n",
+    "        if self.p == 1.0:\n",
+    "            # Drop everything\n",
+    "            return Tensor(np.zeros_like(x.data))\n",
+    "\n",
+    "        # During training, apply dropout\n",
+    "        keep_prob = 1.0 - self.p\n",
+    "\n",
+    "        # Create random mask: True where we keep elements\n",
+    "        mask = np.random.random(x.data.shape) < keep_prob\n",
+    "\n",
+    "        # Apply mask and scale to maintain expected value\n",
+    "        output_data = (x.data * mask) / keep_prob\n",
+    "        return Tensor(output_data)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def parameters(self):\n",
+    "        \"\"\"Dropout has no parameters.\"\"\"\n",
+    "        return []\n",
+    "\n",
+    "    def __repr__(self):\n",
+    "        return f\"Dropout(p={self.p})\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0524fdd",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Dropout Layer\n",
+    "This test validates our Dropout layer implementation works correctly.\n",
+    "**What we're testing**: Training vs inference behavior, probability scaling, randomness\n",
+    "**Why it matters**: Essential for preventing overfitting in neural networks\n",
+    "**Expected**: Correct masking during training, passthrough during inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f228eb90",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-dropout",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_dropout_layer():\n",
+    "    \"\"\"🔬 Test Dropout layer implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Dropout Layer...\")\n",
+    "\n",
+    "    # Test dropout creation\n",
+    "    dropout = Dropout(0.5)\n",
+    "    assert dropout.p == 0.5\n",
+    "\n",
+    "    # Test inference mode (should pass through unchanged)\n",
+    "    x = Tensor([1, 2, 3, 4])\n",
+    "    y_inference = dropout.forward(x, training=False)\n",
+    "    assert np.array_equal(x.data, y_inference.data), \"Inference should pass through unchanged\"\n",
+    "\n",
+    "    # Test training mode with zero dropout (should pass through unchanged)\n",
+    "    dropout_zero = Dropout(0.0)\n",
+    "    y_zero = dropout_zero.forward(x, training=True)\n",
+    "    assert np.array_equal(x.data, y_zero.data), \"Zero dropout should pass through unchanged\"\n",
+    "\n",
+    "    # Test training mode with full dropout (should zero everything)\n",
+    "    dropout_full = Dropout(1.0)\n",
+    "    y_full = dropout_full.forward(x, training=True)\n",
+    "    assert np.allclose(y_full.data, 0), \"Full dropout should zero everything\"\n",
+    "\n",
+    "    # Test training mode with partial dropout\n",
+    "    # Note: This is probabilistic, so we test statistical properties\n",
+    "    np.random.seed(42)  # For reproducible test\n",
+    "    x_large = Tensor(np.ones((1000,)))  # Large tensor for statistical significance\n",
+    "    y_train = dropout.forward(x_large, training=True)\n",
+    "\n",
+    "    # Count non-zero elements (approximately 50% should survive)\n",
+    "    non_zero_count = np.count_nonzero(y_train.data)\n",
+    "    expected_survival = 1000 * 0.5\n",
+    "    # Allow 10% tolerance for randomness\n",
+    "    assert 0.4 * 1000 < non_zero_count < 0.6 * 1000, f\"Expected ~500 survivors, got {non_zero_count}\"\n",
+    "\n",
+    "    # Test scaling (surviving elements should be scaled by 1/(1-p) = 2.0)\n",
+    "    surviving_values = y_train.data[y_train.data != 0]\n",
+    "    expected_value = 2.0  # 1.0 / (1 - 0.5)\n",
+    "    assert np.allclose(surviving_values, expected_value), f\"Surviving values should be {expected_value}\"\n",
+    "\n",
+    "    # Test no parameters\n",
+    "    params = dropout.parameters()\n",
+    "    assert len(params) == 0, \"Dropout should have no parameters\"\n",
+    "\n",
+    "    # Test invalid probability\n",
+    "    try:\n",
+    "        Dropout(-0.1)\n",
+    "        assert False, \"Should raise ValueError for negative probability\"\n",
+    "    except ValueError:\n",
+    "        pass\n",
+    "\n",
+    "    try:\n",
+    "        Dropout(1.1)\n",
+    "        assert False, \"Should raise ValueError for probability > 1\"\n",
+    "    except ValueError:\n",
+    "        pass\n",
+    "\n",
+    "    print(\"✅ Dropout layer works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_dropout_layer()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c92a7889",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## 4. Integration: Bringing It Together\n",
+    "\n",
+    "Now that we've built both layer types, let's see how they work together to create a complete neural network architecture. We'll manually compose a realistic 3-layer MLP for MNIST digit classification.\n",
+    "\n",
+    "### Network Architecture Visualization\n",
+    "```\n",
+    "MNIST Classification Network (3-Layer MLP):\n",
+    "\n",
+    "    Input Layer          Hidden Layer 1        Hidden Layer 2        Output Layer\n",
+    "┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐\n",
+    "│     784         │    │      256        │    │      128        │    │       10        │\n",
+    "│   Pixels        │───▶│   Features      │───▶│   Features      │───▶│    Classes      │\n",
+    "│  (28×28 image)  │    │   + ReLU        │    │   + ReLU        │    │  (0-9 digits)   │\n",
+    "│                 │    │   + Dropout     │    │   + Dropout     │    │                 │\n",
+    "└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘\n",
+    "        ↓                       ↓                       ↓                       ↓\n",
+    "   \"Raw pixels\"          \"Edge detectors\"        \"Shape detectors\"        \"Digit classifier\"\n",
+    "\n",
+    "Data Flow:\n",
+    "[32, 784] → Linear(784,256) → ReLU → Dropout(0.5) → Linear(256,128) → ReLU → Dropout(0.3) → Linear(128,10) → [32, 10]\n",
+    "```\n",
+    "\n",
+    "### Parameter Count Analysis\n",
+    "```\n",
+    "Parameter Breakdown (Manual Layer Composition):\n",
+    "┌─────────────────────────────────────────────────────────────┐\n",
+    "│ layer1 = Linear(784 → 256)                               │\n",
+    "│   Weights: 784 × 256 = 200,704 params                      │\n",
+    "│   Bias:    256 params                                       │\n",
+    "│   Subtotal: 200,960 params                                  │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ activation1 = ReLU(), dropout1 = Dropout(0.5)              │\n",
+    "│   Parameters: 0 (no learnable weights)                      │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ layer2 = Linear(256 → 128)                               │\n",
+    "│   Weights: 256 × 128 = 32,768 params                       │\n",
+    "│   Bias:    128 params                                       │\n",
+    "│   Subtotal: 32,896 params                                   │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ activation2 = ReLU(), dropout2 = Dropout(0.3)              │\n",
+    "│   Parameters: 0 (no learnable weights)                      │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ layer3 = Linear(128 → 10)                                │\n",
+    "│   Weights: 128 × 10 = 1,280 params                         │\n",
+    "│   Bias:    10 params                                        │\n",
+    "│   Subtotal: 1,290 params                                    │\n",
+    "└─────────────────────────────────────────────────────────────┘\n",
+    "                    TOTAL: 235,146 parameters\n",
+    "                    Memory: ~940 KB (float32)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a41674af",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 5. Systems Analysis: Memory and Performance\n",
+    "\n",
+    "Now let's analyze the systems characteristics of our layer implementations. Understanding memory usage and computational complexity helps us build efficient neural networks.\n",
+    "\n",
+    "### Memory Analysis Overview\n",
+    "```\n",
+    "Layer Memory Components:\n",
+    "┌─────────────────────────────────────────────────────────────┐\n",
+    "│                    PARAMETER MEMORY                         │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ • Weights: Persistent, shared across batches               │\n",
+    "│ • Biases: Small but necessary for output shifting          │\n",
+    "│ • Total: Grows with network width and depth                │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│                   ACTIVATION MEMORY                         │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ • Input tensors: batch_size × features × 4 bytes           │\n",
+    "│ • Output tensors: batch_size × features × 4 bytes          │\n",
+    "│ • Intermediate results during forward pass                  │\n",
+    "│ • Total: Grows with batch size and layer width             │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│                   TEMPORARY MEMORY                          │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ • Dropout masks: batch_size × features × 1 byte            │\n",
+    "│ • Computation buffers for matrix operations                 │\n",
+    "│ • Total: Peak during forward/backward passes               │\n",
+    "└─────────────────────────────────────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "### Computational Complexity Overview\n",
+    "```\n",
+    "Layer Operation Complexity:\n",
+    "┌─────────────────────────────────────────────────────────────┐\n",
+    "│ Linear Layer Forward Pass:                                  │\n",
+    "│   Matrix Multiply: O(batch × in_features × out_features)    │\n",
+    "│   Bias Addition: O(batch × out_features)                    │\n",
+    "│   Dominant: Matrix multiplication                           │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ Multi-layer Forward Pass:                                   │\n",
+    "│   Sum of all layer complexities                             │\n",
+    "│   Memory: Peak of all intermediate activations              │\n",
+    "├─────────────────────────────────────────────────────────────┤\n",
+    "│ Dropout Forward Pass:                                        │\n",
+    "│   Mask Generation: O(elements)                              │\n",
+    "│   Element-wise Multiply: O(elements)                        │\n",
+    "│   Overhead: Minimal compared to linear layers               │\n",
+    "└─────────────────────────────────────────────────────────────┘\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07d5bfe3",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "analyze-layer-memory",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_layer_memory():\n",
+    "    \"\"\"📊 Analyze memory usage patterns in layer operations.\"\"\"\n",
+    "    print(\"📊 Analyzing Layer Memory Usage...\")\n",
+    "\n",
+    "    # Test different layer sizes\n",
+    "    layer_configs = [\n",
+    "        (784, 256),   # MNIST → hidden\n",
+    "        (256, 256),   # Hidden → hidden\n",
+    "        (256, 10),    # Hidden → output\n",
+    "        (2048, 2048), # Large hidden\n",
+    "    ]\n",
+    "\n",
+    "    print(\"\\nLinear Layer Memory Analysis:\")\n",
+    "    print(\"Configuration → Weight Memory → Bias Memory → Total Memory\")\n",
+    "\n",
+    "    for in_feat, out_feat in layer_configs:\n",
+    "        # Calculate memory usage\n",
+    "        weight_memory = in_feat * out_feat * 4  # 4 bytes per float32\n",
+    "        bias_memory = out_feat * 4\n",
+    "        total_memory = weight_memory + bias_memory\n",
+    "\n",
+    "        print(f\"({in_feat:4d}, {out_feat:4d}) → {weight_memory/1024:7.1f} KB → {bias_memory/1024:6.1f} KB → {total_memory/1024:7.1f} KB\")\n",
+    "\n",
+    "    # Analyze multi-layer memory scaling\n",
+    "    print(\"\\n💡 Multi-layer Model Memory Scaling:\")\n",
+    "    hidden_sizes = [128, 256, 512, 1024, 2048]\n",
+    "\n",
+    "    for hidden_size in hidden_sizes:\n",
+    "        # 3-layer MLP: 784 → hidden → hidden/2 → 10\n",
+    "        layer1_params = 784 * hidden_size + hidden_size\n",
+    "        layer2_params = hidden_size * (hidden_size // 2) + (hidden_size // 2)\n",
+    "        layer3_params = (hidden_size // 2) * 10 + 10\n",
+    "\n",
+    "        total_params = layer1_params + layer2_params + layer3_params\n",
+    "        memory_mb = total_params * 4 / (1024 * 1024)\n",
+    "\n",
+    "        print(f\"Hidden={hidden_size:4d}: {total_params:7,} params = {memory_mb:5.1f} MB\")\n",
+    "\n",
+    "# Analysis will be run in main block"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1f70ed2f",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "analyze-layer-performance",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_layer_performance():\n",
+    "    \"\"\"📊 Analyze computational complexity of layer operations.\"\"\"\n",
+    "    print(\"📊 Analyzing Layer Computational Complexity...\")\n",
+    "\n",
+    "    # Test forward pass FLOPs\n",
+    "    batch_sizes = [1, 32, 128, 512]\n",
+    "    layer = Linear(784, 256)\n",
+    "\n",
+    "    print(\"\\nLinear Layer FLOPs Analysis:\")\n",
+    "    print(\"Batch Size → Matrix Multiply FLOPs → Bias Add FLOPs → Total FLOPs\")\n",
+    "\n",
+    "    for batch_size in batch_sizes:\n",
+    "        # Matrix multiplication: (batch, in) @ (in, out) = batch * in * out FLOPs\n",
+    "        matmul_flops = batch_size * 784 * 256\n",
+    "        # Bias addition: batch * out FLOPs\n",
+    "        bias_flops = batch_size * 256\n",
+    "        total_flops = matmul_flops + bias_flops\n",
+    "\n",
+    "        print(f\"{batch_size:10d} → {matmul_flops:15,} → {bias_flops:13,} → {total_flops:11,}\")\n",
+    "\n",
+    "    print(\"\\n💡 Key Insights:\")\n",
+    "    print(\"🚀 Linear layer complexity: O(batch_size × in_features × out_features)\")\n",
+    "    print(\"🚀 Memory grows linearly with batch size, quadratically with layer width\")\n",
+    "    print(\"🚀 Dropout adds minimal computational overhead (element-wise operations)\")\n",
+    "\n",
+    "# Analysis will be run in main block"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab500718",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "58f4b2ab",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "module-integration",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_linear_layer()\n",
+    "    test_unit_dropout_layer()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test realistic neural network construction with manual composition\n",
+    "    print(\"🔬 Integration Test: Multi-layer Network...\")\n",
+    "\n",
+    "    # Import real activation from module 02\n",
+    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))\n",
+    "    from activations_dev import ReLU\n",
+    "\n",
+    "    # Build individual layers for manual composition\n",
+    "    layer1 = Linear(784, 128)\n",
+    "    activation1 = ReLU()\n",
+    "    dropout1 = Dropout(0.5)\n",
+    "    layer2 = Linear(128, 64)\n",
+    "    activation2 = ReLU()\n",
+    "    dropout2 = Dropout(0.3)\n",
+    "    layer3 = Linear(64, 10)\n",
+    "\n",
+    "    # Test end-to-end forward pass with manual composition\n",
+    "    batch_size = 16\n",
+    "    x = Tensor(np.random.randn(batch_size, 784))\n",
+    "\n",
+    "    # Manual forward pass\n",
+    "    x = layer1.forward(x)\n",
+    "    x = activation1.forward(x)\n",
+    "    x = dropout1.forward(x)\n",
+    "    x = layer2.forward(x)\n",
+    "    x = activation2.forward(x)\n",
+    "    x = dropout2.forward(x)\n",
+    "    output = layer3.forward(x)\n",
+    "\n",
+    "    assert output.shape == (batch_size, 10), f\"Expected output shape ({batch_size}, 10), got {output.shape}\"\n",
+    "\n",
+    "    # Test parameter counting from individual layers\n",
+    "    all_params = layer1.parameters() + layer2.parameters() + layer3.parameters()\n",
+    "    expected_params = 6  # 3 weights + 3 biases from 3 Linear layers\n",
+    "    assert len(all_params) == expected_params, f\"Expected {expected_params} parameters, got {len(all_params)}\"\n",
+    "\n",
+    "    # Test all parameters have requires_grad=True\n",
+    "    for param in all_params:\n",
+    "        assert param.requires_grad == True, \"All parameters should have requires_grad=True\"\n",
+    "\n",
+    "    # Test individual layer functionality\n",
+    "    test_x = Tensor(np.random.randn(4, 784))\n",
+    "    # Test dropout in training vs inference\n",
+    "    dropout_test = Dropout(0.5)\n",
+    "    train_output = dropout_test.forward(test_x, training=True)\n",
+    "    infer_output = dropout_test.forward(test_x, training=False)\n",
+    "    assert np.array_equal(test_x.data, infer_output.data), \"Inference mode should pass through unchanged\"\n",
+    "\n",
+    "    print(\"✅ Multi-layer network integration works!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 03_layers\")\n",
+    "\n",
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4c84f921",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Layers\n",
+    "\n",
+    "Congratulations! You've built the fundamental building blocks that make neural networks possible!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Built Linear layers with proper Xavier initialization and parameter management\n",
+    "- Created Dropout layers for regularization with training/inference mode handling\n",
+    "- Demonstrated manual layer composition for building neural networks\n",
+    "- Analyzed memory scaling and computational complexity of layer operations\n",
+    "- All tests pass ✅ (validated by `test_module()`)\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your layer implementation enables building complete neural networks! The Linear layer provides learnable transformations, manual composition chains them together, and Dropout prevents overfitting.\n",
+    "\n",
+    "Export with: `tito module complete 03_layers`\n",
+    "\n",
+    "**Next**: Module 04 will add loss functions (CrossEntropyLoss, MSELoss) that measure how wrong your model is - the foundation for learning!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/03_layers/layers_dev.py b/modules/source/03_layers/layers_dev.py
similarity index 99%
rename from modules/03_layers/layers_dev.py
rename to modules/source/03_layers/layers_dev.py
index f6fd1578..166e9de6 100644
--- a/modules/03_layers/layers_dev.py
+++ b/modules/source/03_layers/layers_dev.py
@@ -59,15 +59,19 @@ from tinytorch.core.activations import ReLU, Sigmoid  # Module 02 - intelligence
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp core.layers
+#| export
 
 import numpy as np
 import sys
 import os
 
-# Import the proper Tensor class from Module 01
+# Import dependencies from other modules
 sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
 from tensor_dev import Tensor
 
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
+from activations_dev import ReLU, Sigmoid
+
 # %% [markdown]
 """
 ## 1. Introduction: What are Neural Network Layers?
diff --git a/modules/source/04_losses/losses_dev.ipynb b/modules/source/04_losses/losses_dev.ipynb
new file mode 100644
index 00000000..2248aaad
--- /dev/null
+++ b/modules/source/04_losses/losses_dev.ipynb
@@ -0,0 +1,1618 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "799fd9d0",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Module 04: Losses - Measuring How Wrong We Are\n",
+    "\n",
+    "Welcome to Module 04! Today you'll implement the mathematical functions that measure how wrong your model's predictions are - the essential feedback signal that enables all machine learning.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Tensors (data), Activations (intelligence), Layers (architecture)\n",
+    "**You'll Build**: Loss functions that measure prediction quality\n",
+    "**You'll Enable**: The feedback signal needed for training (Module 05: Autograd)\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Layers → Losses → Autograd\n",
+    "(predictions) (error measurement) (learning signals)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement MSELoss for regression problems\n",
+    "2. Implement CrossEntropyLoss for classification problems\n",
+    "3. Implement BinaryCrossEntropyLoss for binary classification\n",
+    "4. Understand numerical stability in loss computation\n",
+    "5. Test all loss functions with realistic examples\n",
+    "\n",
+    "Let's measure prediction quality!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c9c24237",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/04_losses/losses_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.losses\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.losses import MSELoss, CrossEntropyLoss, BinaryCrossEntropyLoss, log_softmax  # This module\n",
+    "from tinytorch.core.tensor import Tensor  # Foundation\n",
+    "from tinytorch.core.layers import Linear, Sequential  # What makes predictions\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete loss function system in one focused module\n",
+    "- **Production:** Proper organization like PyTorch's torch.nn functional losses\n",
+    "- **Consistency:** All loss computations and numerical stability in core.losses\n",
+    "- **Integration:** Works seamlessly with layers for complete prediction-to-error workflow"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b801276",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "imports",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| default_exp core.losses\n",
+    "\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import time\n",
+    "from typing import Optional\n",
+    "\n",
+    "# Import from previous modules\n",
+    "### BEGIN SOLUTION\n",
+    "import sys\n",
+    "import os\n",
+    "sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
+    "from tensor_dev import Tensor\n",
+    "### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a731547c",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Part 1: Introduction - What Are Loss Functions?\n",
+    "\n",
+    "Loss functions are the mathematical conscience of machine learning. They measure the distance between what your model predicts and what actually happened. Without loss functions, models have no way to improve - they're like athletes training without knowing their score.\n",
+    "\n",
+    "## The Three Essential Loss Functions\n",
+    "\n",
+    "Think of loss functions as different ways to measure \"wrongness\" - each optimized for different types of problems:\n",
+    "\n",
+    "**MSELoss (Mean Squared Error)**: \"How far off are my continuous predictions?\"\n",
+    "- Used for: Regression (predicting house prices, temperature, stock values)\n",
+    "- Calculation: Average of squared differences between predictions and targets\n",
+    "- Properties: Heavily penalizes large errors, smooth gradients\n",
+    "\n",
+    "```\n",
+    "Loss Landscape for MSE:\n",
+    "     Loss\n",
+    "      ^\n",
+    "      |\n",
+    "   4  |     *\n",
+    "      |    / \\\n",
+    "   2  |   /   \\\n",
+    "      |  /     \\\n",
+    "   0  |_/_______\\\\____> Prediction Error\n",
+    "      0  -2  0  +2\n",
+    "\n",
+    "Quadratic growth: small errors → small penalty, large errors → huge penalty\n",
+    "```\n",
+    "\n",
+    "**CrossEntropyLoss**: \"How confident am I in the wrong class?\"\n",
+    "- Used for: Multi-class classification (image recognition, text classification)\n",
+    "- Calculation: Negative log-likelihood of correct class probability\n",
+    "- Properties: Encourages confident correct predictions, punishes confident wrong ones\n",
+    "\n",
+    "```\n",
+    "Cross-Entropy Penalty Curve:\n",
+    "     Loss\n",
+    "      ^\n",
+    "   10 |*\n",
+    "      ||\n",
+    "    5 | \\\n",
+    "      |  \\\n",
+    "    2 |   \\\n",
+    "      |    \\\n",
+    "    0 |_____\\\\____> Predicted Probability of Correct Class\n",
+    "      0   0.5   1.0\n",
+    "\n",
+    "Logarithmic: wrong confident predictions get severe penalty\n",
+    "```\n",
+    "\n",
+    "**BinaryCrossEntropyLoss**: \"How wrong am I about yes/no decisions?\"\n",
+    "- Used for: Binary classification (spam detection, medical diagnosis)\n",
+    "- Calculation: Cross-entropy specialized for two classes\n",
+    "- Properties: Symmetric penalty for false positives and false negatives\n",
+    "\n",
+    "```\n",
+    "Binary Decision Boundary:\n",
+    "     Target=1 (Positive)    Target=0 (Negative)\n",
+    "     ┌─────────────────┬─────────────────┐\n",
+    "     │  Pred → 1.0     │  Pred → 1.0     │\n",
+    "     │  Loss → 0       │  Loss → ∞       │\n",
+    "     ├─────────────────┼─────────────────┤\n",
+    "     │  Pred → 0.0     │  Pred → 0.0     │\n",
+    "     │  Loss → ∞       │  Loss → 0       │\n",
+    "     └─────────────────┴─────────────────┘\n",
+    "```\n",
+    "\n",
+    "Each loss function creates a different \"error landscape\" that guides learning in different ways."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c52f716b",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Part 2: Mathematical Foundations\n",
+    "\n",
+    "## Mean Squared Error (MSE)\n",
+    "The foundation of regression, MSE measures the average squared distance between predictions and targets:\n",
+    "\n",
+    "```\n",
+    "MSE = (1/N) * Σ(prediction_i - target_i)²\n",
+    "```\n",
+    "\n",
+    "**Why square the differences?**\n",
+    "- Makes all errors positive (no cancellation between positive/negative errors)\n",
+    "- Heavily penalizes large errors (error of 2 becomes 4, error of 10 becomes 100)\n",
+    "- Creates smooth gradients for optimization\n",
+    "\n",
+    "## Cross-Entropy Loss\n",
+    "For classification, we need to measure how wrong our probability distributions are:\n",
+    "\n",
+    "```\n",
+    "CrossEntropy = -Σ target_i * log(prediction_i)\n",
+    "```\n",
+    "\n",
+    "**The Log-Sum-Exp Trick**:\n",
+    "Computing softmax directly can cause numerical overflow. The log-sum-exp trick provides stability:\n",
+    "```\n",
+    "log_softmax(x) = x - log(Σ exp(x_i))\n",
+    "                = x - max(x) - log(Σ exp(x_i - max(x)))\n",
+    "```\n",
+    "\n",
+    "This prevents exp(large_number) from exploding to infinity.\n",
+    "\n",
+    "## Binary Cross-Entropy\n",
+    "A specialized case where we have only two classes:\n",
+    "```\n",
+    "BCE = -(target * log(prediction) + (1-target) * log(1-prediction))\n",
+    "```\n",
+    "\n",
+    "The mathematics naturally handles both \"positive\" and \"negative\" cases in a single formula."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ea0cf3d9",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Part 3: Implementation - Building Loss Functions\n",
+    "\n",
+    "Let's implement our loss functions with proper numerical stability and clear educational structure."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f8b3c7d",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Log-Softmax - The Numerically Stable Foundation\n",
+    "\n",
+    "Before implementing loss functions, we need a reliable way to compute log-softmax. This function is the numerically stable backbone of classification losses.\n",
+    "\n",
+    "### Why Log-Softmax Matters\n",
+    "\n",
+    "Naive softmax can explode with large numbers:\n",
+    "```\n",
+    "Naive approach:\n",
+    "  logits = [100, 200, 300]\n",
+    "  exp(300) = 1.97 × 10^130  ← This breaks computers!\n",
+    "\n",
+    "Stable approach:\n",
+    "  max_logit = 300\n",
+    "  shifted = [-200, -100, 0]  ← Subtract max\n",
+    "  exp(0) = 1.0  ← Manageable numbers\n",
+    "```\n",
+    "\n",
+    "### The Log-Sum-Exp Trick Visualization\n",
+    "\n",
+    "```\n",
+    "Original Computation:           Stable Computation:\n",
+    "\n",
+    "logits: [a, b, c]              logits: [a, b, c]\n",
+    "   ↓                              ↓\n",
+    "exp(logits)                    max_val = max(a,b,c)\n",
+    "   ↓                              ↓\n",
+    "sum(exp(logits))               shifted = [a-max, b-max, c-max]\n",
+    "   ↓                              ↓\n",
+    "log(sum)                       exp(shifted)  ← All ≤ 1.0\n",
+    "   ↓                              ↓\n",
+    "logits - log(sum)              sum(exp(shifted))\n",
+    "                                  ↓\n",
+    "                               log(sum) + max_val\n",
+    "                                  ↓\n",
+    "                               logits - (log(sum) + max_val)\n",
+    "```\n",
+    "\n",
+    "Both give the same result, but the stable version never overflows!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b8c8908",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "log_softmax",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def log_softmax(x: Tensor, dim: int = -1) -> Tensor:\n",
+    "    \"\"\"\n",
+    "    Compute log-softmax with numerical stability.\n",
+    "\n",
+    "    TODO: Implement numerically stable log-softmax using the log-sum-exp trick\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Find maximum along dimension (for stability)\n",
+    "    2. Subtract max from input (prevents overflow)\n",
+    "    3. Compute log(sum(exp(shifted_input)))\n",
+    "    4. Return input - max - log_sum_exp\n",
+    "\n",
+    "    EXAMPLE:\n",
+    "    >>> logits = Tensor([[1.0, 2.0, 3.0], [0.1, 0.2, 0.9]])\n",
+    "    >>> result = log_softmax(logits, dim=-1)\n",
+    "    >>> print(result.shape)\n",
+    "    (2, 3)\n",
+    "\n",
+    "    HINT: Use np.max(x.data, axis=dim, keepdims=True) to preserve dimensions\n",
+    "    \"\"\"\n",
+    "    ### BEGIN SOLUTION\n",
+    "    # Step 1: Find max along dimension for numerical stability\n",
+    "    max_vals = np.max(x.data, axis=dim, keepdims=True)\n",
+    "\n",
+    "    # Step 2: Subtract max to prevent overflow\n",
+    "    shifted = x.data - max_vals\n",
+    "\n",
+    "    # Step 3: Compute log(sum(exp(shifted)))\n",
+    "    log_sum_exp = np.log(np.sum(np.exp(shifted), axis=dim, keepdims=True))\n",
+    "\n",
+    "    # Step 4: Return log_softmax = input - max - log_sum_exp\n",
+    "    result = x.data - max_vals - log_sum_exp\n",
+    "\n",
+    "    return Tensor(result)\n",
+    "    ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c07efde",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_log_softmax",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_log_softmax():\n",
+    "    \"\"\"🔬 Test log_softmax numerical stability and correctness.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Log-Softmax...\")\n",
+    "\n",
+    "    # Test basic functionality\n",
+    "    x = Tensor([[1.0, 2.0, 3.0], [0.1, 0.2, 0.9]])\n",
+    "    result = log_softmax(x, dim=-1)\n",
+    "\n",
+    "    # Verify shape preservation\n",
+    "    assert result.shape == x.shape, f\"Shape mismatch: expected {x.shape}, got {result.shape}\"\n",
+    "\n",
+    "    # Verify log-softmax properties: exp(log_softmax) should sum to 1\n",
+    "    softmax_result = np.exp(result.data)\n",
+    "    row_sums = np.sum(softmax_result, axis=-1)\n",
+    "    assert np.allclose(row_sums, 1.0, atol=1e-6), f\"Softmax doesn't sum to 1: {row_sums}\"\n",
+    "\n",
+    "    # Test numerical stability with large values\n",
+    "    large_x = Tensor([[100.0, 101.0, 102.0]])\n",
+    "    large_result = log_softmax(large_x, dim=-1)\n",
+    "    assert not np.any(np.isnan(large_result.data)), \"NaN values in result with large inputs\"\n",
+    "    assert not np.any(np.isinf(large_result.data)), \"Inf values in result with large inputs\"\n",
+    "\n",
+    "    print(\"✅ log_softmax works correctly with numerical stability!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_log_softmax()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32214d29",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## MSELoss - Measuring Continuous Prediction Quality\n",
+    "\n",
+    "Mean Squared Error is the workhorse of regression problems. It measures how far your continuous predictions are from the true values.\n",
+    "\n",
+    "### When to Use MSE\n",
+    "\n",
+    "**Perfect for:**\n",
+    "- House price prediction ($200k vs $195k)\n",
+    "- Temperature forecasting (25°C vs 23°C)\n",
+    "- Stock price prediction ($150 vs $148)\n",
+    "- Any continuous value where \"distance\" matters\n",
+    "\n",
+    "### How MSE Shapes Learning\n",
+    "\n",
+    "```\n",
+    "Prediction vs Target Visualization:\n",
+    "\n",
+    "Target = 100\n",
+    "\n",
+    "Prediction: 80   90   95   100  105  110  120\n",
+    "Error:     -20  -10   -5    0   +5  +10  +20\n",
+    "MSE:       400  100   25    0   25  100  400\n",
+    "\n",
+    "Loss Curve:\n",
+    "     MSE\n",
+    "      ^\n",
+    "  400 |*           *\n",
+    "      |\n",
+    "  100 | *         *\n",
+    "      |  \\\n",
+    "   25 |   *     *\n",
+    "      |    \\\\   /\n",
+    "    0 |_____*_____> Prediction\n",
+    "       80   100   120\n",
+    "\n",
+    "Quadratic penalty: Large errors are MUCH more costly than small errors\n",
+    "```\n",
+    "\n",
+    "### Why Square the Errors?\n",
+    "\n",
+    "1. **Positive penalties**: (-10)² = 100, same as (+10)² = 100\n",
+    "2. **Heavy punishment for large errors**: Error of 20 → penalty of 400\n",
+    "3. **Smooth gradients**: Quadratic function has nice derivatives for optimization\n",
+    "4. **Statistical foundation**: Maximum likelihood for Gaussian noise\n",
+    "\n",
+    "### MSE vs Other Regression Losses\n",
+    "\n",
+    "```\n",
+    "Error Sensitivity Comparison:\n",
+    "\n",
+    " Error:   -10    -5     0     +5    +10\n",
+    " MSE:     100    25     0     25    100  ← Quadratic growth\n",
+    " MAE:      10     5     0      5     10  ← Linear growth\n",
+    " Huber:    50    12.5   0    12.5    50  ← Hybrid approach\n",
+    "\n",
+    " MSE: More sensitive to outliers\n",
+    " MAE: More robust to outliers\n",
+    " Huber: Best of both worlds\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ca7d1772",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "mse_loss",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class MSELoss:\n",
+    "    \"\"\"Mean Squared Error loss for regression tasks.\"\"\"\n",
+    "\n",
+    "    def __init__(self):\n",
+    "        \"\"\"Initialize MSE loss function.\"\"\"\n",
+    "        pass\n",
+    "\n",
+    "    def forward(self, predictions: Tensor, targets: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Compute mean squared error between predictions and targets.\n",
+    "\n",
+    "        TODO: Implement MSE loss calculation\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Compute difference: predictions - targets\n",
+    "        2. Square the differences: diff²\n",
+    "        3. Take mean across all elements\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> loss_fn = MSELoss()\n",
+    "        >>> predictions = Tensor([1.0, 2.0, 3.0])\n",
+    "        >>> targets = Tensor([1.5, 2.5, 2.8])\n",
+    "        >>> loss = loss_fn.forward(predictions, targets)\n",
+    "        >>> print(f\"MSE Loss: {loss.data:.4f}\")\n",
+    "        MSE Loss: 0.1467\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use (predictions.data - targets.data) for element-wise difference\n",
+    "        - Square with **2 or np.power(diff, 2)\n",
+    "        - Use np.mean() to average over all elements\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Step 1: Compute element-wise difference\n",
+    "        diff = predictions.data - targets.data\n",
+    "\n",
+    "        # Step 2: Square the differences\n",
+    "        squared_diff = diff ** 2\n",
+    "\n",
+    "        # Step 3: Take mean across all elements\n",
+    "        mse = np.mean(squared_diff)\n",
+    "\n",
+    "        return Tensor(mse)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Compute gradients (implemented in Module 05: Autograd).\n",
+    "\n",
+    "        For now, this is a stub that students can ignore.\n",
+    "        \"\"\"\n",
+    "        pass"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "38aa6b1d",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_mse_loss",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_mse_loss():\n",
+    "    \"\"\"🔬 Test MSELoss implementation and properties.\"\"\"\n",
+    "    print(\"🔬 Unit Test: MSE Loss...\")\n",
+    "\n",
+    "    loss_fn = MSELoss()\n",
+    "\n",
+    "    # Test perfect predictions (loss should be 0)\n",
+    "    predictions = Tensor([1.0, 2.0, 3.0])\n",
+    "    targets = Tensor([1.0, 2.0, 3.0])\n",
+    "    perfect_loss = loss_fn.forward(predictions, targets)\n",
+    "    assert np.allclose(perfect_loss.data, 0.0, atol=1e-7), f\"Perfect predictions should have 0 loss, got {perfect_loss.data}\"\n",
+    "\n",
+    "    # Test known case\n",
+    "    predictions = Tensor([1.0, 2.0, 3.0])\n",
+    "    targets = Tensor([1.5, 2.5, 2.8])\n",
+    "    loss = loss_fn.forward(predictions, targets)\n",
+    "\n",
+    "    # Manual calculation: ((1-1.5)² + (2-2.5)² + (3-2.8)²) / 3 = (0.25 + 0.25 + 0.04) / 3 = 0.18\n",
+    "    expected_loss = (0.25 + 0.25 + 0.04) / 3\n",
+    "    assert np.allclose(loss.data, expected_loss, atol=1e-6), f\"Expected {expected_loss}, got {loss.data}\"\n",
+    "\n",
+    "    # Test that loss is always non-negative\n",
+    "    random_pred = Tensor(np.random.randn(10))\n",
+    "    random_target = Tensor(np.random.randn(10))\n",
+    "    random_loss = loss_fn.forward(random_pred, random_target)\n",
+    "    assert random_loss.data >= 0, f\"MSE loss should be non-negative, got {random_loss.data}\"\n",
+    "\n",
+    "    print(\"✅ MSELoss works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_mse_loss()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ac7d3ea4",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## CrossEntropyLoss - Measuring Classification Confidence\n",
+    "\n",
+    "Cross-entropy loss is the gold standard for multi-class classification. It measures how wrong your probability predictions are and heavily penalizes confident mistakes.\n",
+    "\n",
+    "### When to Use Cross-Entropy\n",
+    "\n",
+    "**Perfect for:**\n",
+    "- Image classification (cat, dog, bird)\n",
+    "- Text classification (spam, ham, promotion)\n",
+    "- Language modeling (next word prediction)\n",
+    "- Any problem with mutually exclusive classes\n",
+    "\n",
+    "### Understanding Cross-Entropy Through Examples\n",
+    "\n",
+    "```\n",
+    "Scenario: Image Classification (3 classes: cat, dog, bird)\n",
+    "\n",
+    "Case 1: Correct and Confident\n",
+    "Model Output (logits): [5.0, 1.0, 0.1]  ← Very confident about \"cat\"\n",
+    "After Softmax:        [0.95, 0.047, 0.003]\n",
+    "True Label:           cat (class 0)\n",
+    "Loss: -log(0.95) = 0.05  ← Very low loss ✅\n",
+    "\n",
+    "Case 2: Correct but Uncertain\n",
+    "Model Output:         [1.1, 1.0, 0.9]  ← Uncertain between classes\n",
+    "After Softmax:        [0.4, 0.33, 0.27]\n",
+    "True Label:           cat (class 0)\n",
+    "Loss: -log(0.4) = 0.92  ← Higher loss (uncertainty penalized)\n",
+    "\n",
+    "Case 3: Wrong and Confident\n",
+    "Model Output:         [0.1, 5.0, 1.0]  ← Very confident about \"dog\"\n",
+    "After Softmax:        [0.003, 0.95, 0.047]\n",
+    "True Label:           cat (class 0)\n",
+    "Loss: -log(0.003) = 5.8  ← Very high loss ❌\n",
+    "```\n",
+    "\n",
+    "### Cross-Entropy's Learning Signal\n",
+    "\n",
+    "```\n",
+    "What Cross-Entropy Teaches the Model:\n",
+    "\n",
+    "┌─────────────────┬─────────────────┬─────────────────┐\n",
+    "│ Prediction      │ True Label      │ Learning Signal │\n",
+    "├─────────────────┼─────────────────┼─────────────────┤\n",
+    "│ Confident ✅    │ Correct ✅      │ \"Keep doing this\"│\n",
+    "│ Uncertain ⚠️    │ Correct ✅      │ \"Be more confident\"│\n",
+    "│ Confident ❌    │ Wrong ❌        │ \"STOP! Change everything\"│\n",
+    "│ Uncertain ⚠️    │ Wrong ❌        │ \"Learn the right answer\"│\n",
+    "└─────────────────┴─────────────────┴─────────────────┘\n",
+    "\n",
+    "Loss Landscape by Confidence:\n",
+    "     Loss\n",
+    "      ^\n",
+    "    5 |*\n",
+    "      ||\n",
+    "    3 | *\n",
+    "      |  \\\n",
+    "    1 |   *\n",
+    "      |    \\\\\n",
+    "    0 |______**____> Predicted Probability (correct class)\n",
+    "      0   0.5   1.0\n",
+    "\n",
+    "Message: \"Be confident when you're right!\"\n",
+    "```\n",
+    "\n",
+    "### Why Cross-Entropy Works So Well\n",
+    "\n",
+    "1. **Probabilistic interpretation**: Measures quality of probability distributions\n",
+    "2. **Strong gradients**: Large penalty for confident mistakes drives fast learning\n",
+    "3. **Smooth optimization**: Log function provides nice gradients\n",
+    "4. **Information theory**: Minimizes \"surprise\" about correct answers\n",
+    "\n",
+    "### Multi-Class vs Binary Classification\n",
+    "\n",
+    "```\n",
+    "Multi-Class (3+ classes):          Binary (2 classes):\n",
+    "\n",
+    "Classes: [cat, dog, bird]         Classes: [spam, not_spam]\n",
+    "Output:  [0.7, 0.2, 0.1]         Output:  0.8 (spam probability)\n",
+    "Must sum to 1.0 ✅               Must be between 0 and 1 ✅\n",
+    "Uses: CrossEntropyLoss            Uses: BinaryCrossEntropyLoss\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a8513163",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "cross_entropy_loss",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class CrossEntropyLoss:\n",
+    "    \"\"\"Cross-entropy loss for multi-class classification.\"\"\"\n",
+    "\n",
+    "    def __init__(self):\n",
+    "        \"\"\"Initialize cross-entropy loss function.\"\"\"\n",
+    "        pass\n",
+    "\n",
+    "    def forward(self, logits: Tensor, targets: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Compute cross-entropy loss between logits and target class indices.\n",
+    "\n",
+    "        TODO: Implement cross-entropy loss with numerical stability\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Compute log-softmax of logits (numerically stable)\n",
+    "        2. Select log-probabilities for correct classes\n",
+    "        3. Return negative mean of selected log-probabilities\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> loss_fn = CrossEntropyLoss()\n",
+    "        >>> logits = Tensor([[2.0, 1.0, 0.1], [0.5, 1.5, 0.8]])  # 2 samples, 3 classes\n",
+    "        >>> targets = Tensor([0, 1])  # First sample is class 0, second is class 1\n",
+    "        >>> loss = loss_fn.forward(logits, targets)\n",
+    "        >>> print(f\"Cross-Entropy Loss: {loss.data:.4f}\")\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use log_softmax() for numerical stability\n",
+    "        - targets.data.astype(int) ensures integer indices\n",
+    "        - Use np.arange(batch_size) for row indexing: log_probs[np.arange(batch_size), targets]\n",
+    "        - Return negative mean: -np.mean(selected_log_probs)\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Step 1: Compute log-softmax for numerical stability\n",
+    "        log_probs = log_softmax(logits, dim=-1)\n",
+    "\n",
+    "        # Step 2: Select log-probabilities for correct classes\n",
+    "        batch_size = logits.shape[0]\n",
+    "        target_indices = targets.data.astype(int)\n",
+    "\n",
+    "        # Select correct class log-probabilities using advanced indexing\n",
+    "        selected_log_probs = log_probs.data[np.arange(batch_size), target_indices]\n",
+    "\n",
+    "        # Step 3: Return negative mean (cross-entropy is negative log-likelihood)\n",
+    "        cross_entropy = -np.mean(selected_log_probs)\n",
+    "\n",
+    "        return Tensor(cross_entropy)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Compute gradients (implemented in Module 05: Autograd).\n",
+    "\n",
+    "        For now, this is a stub that students can ignore.\n",
+    "        \"\"\"\n",
+    "        pass"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "291f993e",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_cross_entropy_loss",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_cross_entropy_loss():\n",
+    "    \"\"\"🔬 Test CrossEntropyLoss implementation and properties.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Cross-Entropy Loss...\")\n",
+    "\n",
+    "    loss_fn = CrossEntropyLoss()\n",
+    "\n",
+    "    # Test perfect predictions (should have very low loss)\n",
+    "    perfect_logits = Tensor([[10.0, -10.0, -10.0], [-10.0, 10.0, -10.0]])  # Very confident predictions\n",
+    "    targets = Tensor([0, 1])  # Matches the confident predictions\n",
+    "    perfect_loss = loss_fn.forward(perfect_logits, targets)\n",
+    "    assert perfect_loss.data < 0.01, f\"Perfect predictions should have very low loss, got {perfect_loss.data}\"\n",
+    "\n",
+    "    # Test uniform predictions (should have loss ≈ log(num_classes))\n",
+    "    uniform_logits = Tensor([[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]])  # Equal probabilities\n",
+    "    uniform_targets = Tensor([0, 1])\n",
+    "    uniform_loss = loss_fn.forward(uniform_logits, uniform_targets)\n",
+    "    expected_uniform_loss = np.log(3)  # log(3) ≈ 1.099 for 3 classes\n",
+    "    assert np.allclose(uniform_loss.data, expected_uniform_loss, atol=0.1), f\"Uniform predictions should have loss ≈ log(3) = {expected_uniform_loss:.3f}, got {uniform_loss.data:.3f}\"\n",
+    "\n",
+    "    # Test that wrong confident predictions have high loss\n",
+    "    wrong_logits = Tensor([[10.0, -10.0, -10.0], [-10.0, -10.0, 10.0]])  # Confident but wrong\n",
+    "    wrong_targets = Tensor([1, 1])  # Opposite of confident predictions\n",
+    "    wrong_loss = loss_fn.forward(wrong_logits, wrong_targets)\n",
+    "    assert wrong_loss.data > 5.0, f\"Wrong confident predictions should have high loss, got {wrong_loss.data}\"\n",
+    "\n",
+    "    # Test numerical stability with large logits\n",
+    "    large_logits = Tensor([[100.0, 50.0, 25.0]])\n",
+    "    large_targets = Tensor([0])\n",
+    "    large_loss = loss_fn.forward(large_logits, large_targets)\n",
+    "    assert not np.isnan(large_loss.data), \"Loss should not be NaN with large logits\"\n",
+    "    assert not np.isinf(large_loss.data), \"Loss should not be infinite with large logits\"\n",
+    "\n",
+    "    print(\"✅ CrossEntropyLoss works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_cross_entropy_loss()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03358d33",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## BinaryCrossEntropyLoss - Measuring Yes/No Decision Quality\n",
+    "\n",
+    "Binary Cross-Entropy is specialized for yes/no decisions. It's like regular cross-entropy but optimized for the special case of exactly two classes.\n",
+    "\n",
+    "### When to Use Binary Cross-Entropy\n",
+    "\n",
+    "**Perfect for:**\n",
+    "- Spam detection (spam vs not spam)\n",
+    "- Medical diagnosis (disease vs healthy)\n",
+    "- Fraud detection (fraud vs legitimate)\n",
+    "- Content moderation (toxic vs safe)\n",
+    "- Any two-class decision problem\n",
+    "\n",
+    "### Understanding Binary Cross-Entropy\n",
+    "\n",
+    "```\n",
+    "Binary Classification Decision Matrix:\n",
+    "\n",
+    "                 TRUE LABEL\n",
+    "              Positive  Negative\n",
+    "PREDICTED  P    TP       FP     ← Model says \"Yes\"\n",
+    "           N    FN       TN     ← Model says \"No\"\n",
+    "\n",
+    "BCE Loss for each quadrant:\n",
+    "- True Positive (TP): -log(prediction)     ← Reward confident correct \"Yes\"\n",
+    "- False Positive (FP): -log(1-prediction) ← Punish confident wrong \"Yes\"\n",
+    "- False Negative (FN): -log(prediction)   ← Punish confident wrong \"No\"\n",
+    "- True Negative (TN): -log(1-prediction)  ← Reward confident correct \"No\"\n",
+    "```\n",
+    "\n",
+    "### Binary Cross-Entropy Behavior Examples\n",
+    "\n",
+    "```\n",
+    "Scenario: Spam Detection\n",
+    "\n",
+    "Case 1: Perfect Spam Detection\n",
+    "Email: \"Buy now! 50% off! Limited time!\"\n",
+    "Model Prediction: 0.99 (99% spam probability)\n",
+    "True Label: 1 (actually spam)\n",
+    "Loss: -log(0.99) = 0.01  ← Very low loss ✅\n",
+    "\n",
+    "Case 2: Uncertain About Spam\n",
+    "Email: \"Meeting rescheduled to 2pm\"\n",
+    "Model Prediction: 0.51 (slightly thinks spam)\n",
+    "True Label: 0 (actually not spam)\n",
+    "Loss: -log(1-0.51) = -log(0.49) = 0.71  ← Moderate loss\n",
+    "\n",
+    "Case 3: Confident Wrong Prediction\n",
+    "Email: \"Hi mom, how are you?\"\n",
+    "Model Prediction: 0.95 (very confident spam)\n",
+    "True Label: 0 (actually not spam)\n",
+    "Loss: -log(1-0.95) = -log(0.05) = 3.0  ← High loss ❌\n",
+    "```\n",
+    "\n",
+    "### Binary vs Multi-Class Cross-Entropy\n",
+    "\n",
+    "```\n",
+    "Binary Cross-Entropy:              Regular Cross-Entropy:\n",
+    "\n",
+    "Single probability output         Probability distribution output\n",
+    "Predict: 0.8 (spam prob)         Predict: [0.1, 0.8, 0.1] (3 classes)\n",
+    "Target: 1.0 (is spam)            Target: 1 (class index)\n",
+    "\n",
+    "Formula:                         Formula:\n",
+    "-[y*log(p) + (1-y)*log(1-p)]    -log(p[target_class])\n",
+    "\n",
+    "Handles class imbalance well     Assumes balanced classes\n",
+    "Optimized for 2-class case      General for N classes\n",
+    "```\n",
+    "\n",
+    "### Why Binary Cross-Entropy is Special\n",
+    "\n",
+    "1. **Symmetric penalties**: False positives and false negatives treated equally\n",
+    "2. **Probability calibration**: Output directly interpretable as probability\n",
+    "3. **Efficient computation**: Simpler than full softmax for binary cases\n",
+    "4. **Medical-grade**: Well-suited for safety-critical binary decisions\n",
+    "\n",
+    "### Loss Landscape Visualization\n",
+    "\n",
+    "```\n",
+    "Binary Cross-Entropy Loss Surface:\n",
+    "\n",
+    "     Loss\n",
+    "      ^\n",
+    "   10 |*                    *     ← Wrong confident predictions\n",
+    "      ||\n",
+    "    5 | *                 *\n",
+    "      |  \\\\               /\n",
+    "    2 |   *             *          ← Uncertain predictions\n",
+    "      |    \\\\           /\n",
+    "    0 |_____*_______*_____> Prediction\n",
+    "      0    0.2     0.8    1.0\n",
+    "\n",
+    "      Target = 1.0 (positive class)\n",
+    "\n",
+    "Message: \"Be confident about positive class, uncertain is okay,\n",
+    "         but don't be confident about wrong class!\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "308d912c",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "binary_cross_entropy_loss",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class BinaryCrossEntropyLoss:\n",
+    "    \"\"\"Binary cross-entropy loss for binary classification.\"\"\"\n",
+    "\n",
+    "    def __init__(self):\n",
+    "        \"\"\"Initialize binary cross-entropy loss function.\"\"\"\n",
+    "        pass\n",
+    "\n",
+    "    def forward(self, predictions: Tensor, targets: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Compute binary cross-entropy loss.\n",
+    "\n",
+    "        TODO: Implement binary cross-entropy with numerical stability\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Clamp predictions to avoid log(0) and log(1)\n",
+    "        2. Compute: -(targets * log(predictions) + (1-targets) * log(1-predictions))\n",
+    "        3. Return mean across all samples\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> loss_fn = BinaryCrossEntropyLoss()\n",
+    "        >>> predictions = Tensor([0.9, 0.1, 0.7, 0.3])  # Probabilities between 0 and 1\n",
+    "        >>> targets = Tensor([1.0, 0.0, 1.0, 0.0])      # Binary labels\n",
+    "        >>> loss = loss_fn.forward(predictions, targets)\n",
+    "        >>> print(f\"Binary Cross-Entropy Loss: {loss.data:.4f}\")\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use np.clip(predictions.data, 1e-7, 1-1e-7) to prevent log(0)\n",
+    "        - Binary cross-entropy: -(targets * log(preds) + (1-targets) * log(1-preds))\n",
+    "        - Use np.mean() to average over all samples\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Step 1: Clamp predictions to avoid numerical issues with log(0) and log(1)\n",
+    "        eps = 1e-7\n",
+    "        clamped_preds = np.clip(predictions.data, eps, 1 - eps)\n",
+    "\n",
+    "        # Step 2: Compute binary cross-entropy\n",
+    "        # BCE = -(targets * log(preds) + (1-targets) * log(1-preds))\n",
+    "        log_preds = np.log(clamped_preds)\n",
+    "        log_one_minus_preds = np.log(1 - clamped_preds)\n",
+    "\n",
+    "        bce_per_sample = -(targets.data * log_preds + (1 - targets.data) * log_one_minus_preds)\n",
+    "\n",
+    "        # Step 3: Return mean across all samples\n",
+    "        bce_loss = np.mean(bce_per_sample)\n",
+    "\n",
+    "        return Tensor(bce_loss)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Compute gradients (implemented in Module 05: Autograd).\n",
+    "\n",
+    "        For now, this is a stub that students can ignore.\n",
+    "        \"\"\"\n",
+    "        pass"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eeb89891",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_binary_cross_entropy_loss",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_binary_cross_entropy_loss():\n",
+    "    \"\"\"🔬 Test BinaryCrossEntropyLoss implementation and properties.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Binary Cross-Entropy Loss...\")\n",
+    "\n",
+    "    loss_fn = BinaryCrossEntropyLoss()\n",
+    "\n",
+    "    # Test perfect predictions\n",
+    "    perfect_predictions = Tensor([0.9999, 0.0001, 0.9999, 0.0001])\n",
+    "    targets = Tensor([1.0, 0.0, 1.0, 0.0])\n",
+    "    perfect_loss = loss_fn.forward(perfect_predictions, targets)\n",
+    "    assert perfect_loss.data < 0.01, f\"Perfect predictions should have very low loss, got {perfect_loss.data}\"\n",
+    "\n",
+    "    # Test worst predictions\n",
+    "    worst_predictions = Tensor([0.0001, 0.9999, 0.0001, 0.9999])\n",
+    "    worst_targets = Tensor([1.0, 0.0, 1.0, 0.0])\n",
+    "    worst_loss = loss_fn.forward(worst_predictions, worst_targets)\n",
+    "    assert worst_loss.data > 5.0, f\"Worst predictions should have high loss, got {worst_loss.data}\"\n",
+    "\n",
+    "    # Test uniform predictions (probability = 0.5)\n",
+    "    uniform_predictions = Tensor([0.5, 0.5, 0.5, 0.5])\n",
+    "    uniform_targets = Tensor([1.0, 0.0, 1.0, 0.0])\n",
+    "    uniform_loss = loss_fn.forward(uniform_predictions, uniform_targets)\n",
+    "    expected_uniform = -np.log(0.5)  # Should be about 0.693\n",
+    "    assert np.allclose(uniform_loss.data, expected_uniform, atol=0.01), f\"Uniform predictions should have loss ≈ {expected_uniform:.3f}, got {uniform_loss.data:.3f}\"\n",
+    "\n",
+    "    # Test numerical stability at boundaries\n",
+    "    boundary_predictions = Tensor([0.0, 1.0, 0.0, 1.0])\n",
+    "    boundary_targets = Tensor([0.0, 1.0, 1.0, 0.0])\n",
+    "    boundary_loss = loss_fn.forward(boundary_predictions, boundary_targets)\n",
+    "    assert not np.isnan(boundary_loss.data), \"Loss should not be NaN at boundaries\"\n",
+    "    assert not np.isinf(boundary_loss.data), \"Loss should not be infinite at boundaries\"\n",
+    "\n",
+    "    print(\"✅ BinaryCrossEntropyLoss works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_binary_cross_entropy_loss()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3b758ad",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "# Part 4: Integration - Bringing It Together\n",
+    "\n",
+    "Now let's test how our loss functions work together with real data scenarios and explore their behavior with different types of predictions.\n",
+    "\n",
+    "## Real-World Loss Function Usage Patterns\n",
+    "\n",
+    "Understanding when and why to use each loss function is crucial for ML engineering success:\n",
+    "\n",
+    "```\n",
+    "Problem Type Decision Tree:\n",
+    "\n",
+    "What are you predicting?\n",
+    "         │\n",
+    "    ┌────┼────┐\n",
+    "    │         │\n",
+    "Continuous   Categorical\n",
+    " Values       Classes\n",
+    "    │         │\n",
+    "    │    ┌───┼───┐\n",
+    "    │    │       │\n",
+    "    │   2 Classes  3+ Classes\n",
+    "    │       │       │\n",
+    " MSELoss   BCE Loss  CE Loss\n",
+    "\n",
+    "Examples:\n",
+    "MSE: House prices, temperature, stock values\n",
+    "BCE: Spam detection, fraud detection, medical diagnosis\n",
+    "CE:  Image classification, language modeling, multiclass text classification\n",
+    "```\n",
+    "\n",
+    "## Loss Function Behavior Comparison\n",
+    "\n",
+    "Each loss function creates different learning pressures on your model:\n",
+    "\n",
+    "```\n",
+    "Error Sensitivity Comparison:\n",
+    "\n",
+    "Small Error (0.1):     Medium Error (0.5):     Large Error (2.0):\n",
+    "\n",
+    "MSE:     0.01         MSE:     0.25           MSE:     4.0\n",
+    "BCE:     0.11         BCE:     0.69           BCE:     ∞ (clips to large)\n",
+    "CE:      0.11         CE:      0.69           CE:      ∞ (clips to large)\n",
+    "\n",
+    "MSE: Quadratic growth, manageable with outliers\n",
+    "BCE/CE: Logarithmic growth, explodes with confident wrong predictions\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f879b1eb",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "loss_comparison",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def compare_loss_behaviors():\n",
+    "    \"\"\"\n",
+    "    🔬 Compare how different loss functions behave with various prediction patterns.\n",
+    "\n",
+    "    This helps students understand when to use each loss function.\n",
+    "    \"\"\"\n",
+    "    print(\"🔬 Integration Test: Loss Function Behavior Comparison...\")\n",
+    "\n",
+    "    # Initialize loss functions\n",
+    "    mse_loss = MSELoss()\n",
+    "    ce_loss = CrossEntropyLoss()\n",
+    "    bce_loss = BinaryCrossEntropyLoss()\n",
+    "\n",
+    "    print(\"\\n1. Regression Scenario (House Price Prediction)\")\n",
+    "    print(\"   Predictions: [200k, 250k, 300k], Targets: [195k, 260k, 290k]\")\n",
+    "    house_pred = Tensor([200.0, 250.0, 300.0])  # In thousands\n",
+    "    house_target = Tensor([195.0, 260.0, 290.0])\n",
+    "    mse = mse_loss.forward(house_pred, house_target)\n",
+    "    print(f\"   MSE Loss: {mse.data:.2f} (thousand²)\")\n",
+    "\n",
+    "    print(\"\\n2. Multi-Class Classification (Image Recognition)\")\n",
+    "    print(\"   Classes: [cat, dog, bird], Predicted: confident about cat, uncertain about dog\")\n",
+    "    # Logits: [2.0, 0.5, 0.1] suggests model is most confident about class 0 (cat)\n",
+    "    image_logits = Tensor([[2.0, 0.5, 0.1], [0.3, 1.8, 0.2]])  # Two samples\n",
+    "    image_targets = Tensor([0, 1])  # First is cat (0), second is dog (1)\n",
+    "    ce = ce_loss.forward(image_logits, image_targets)\n",
+    "    print(f\"   Cross-Entropy Loss: {ce.data:.3f}\")\n",
+    "\n",
+    "    print(\"\\n3. Binary Classification (Spam Detection)\")\n",
+    "    print(\"   Predictions: [0.9, 0.1, 0.7, 0.3] (spam probabilities)\")\n",
+    "    spam_pred = Tensor([0.9, 0.1, 0.7, 0.3])\n",
+    "    spam_target = Tensor([1.0, 0.0, 1.0, 0.0])  # 1=spam, 0=not spam\n",
+    "    bce = bce_loss.forward(spam_pred, spam_target)\n",
+    "    print(f\"   Binary Cross-Entropy Loss: {bce.data:.3f}\")\n",
+    "\n",
+    "    print(\"\\n💡 Key Insights:\")\n",
+    "    print(\"   - MSE penalizes large errors heavily (good for continuous values)\")\n",
+    "    print(\"   - Cross-Entropy encourages confident correct predictions\")\n",
+    "    print(\"   - Binary Cross-Entropy balances false positives and negatives\")\n",
+    "\n",
+    "    return mse.data, ce.data, bce.data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "df914f78",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "loss_sensitivity",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_loss_sensitivity():\n",
+    "    \"\"\"\n",
+    "    📊 Analyze how sensitive each loss function is to prediction errors.\n",
+    "\n",
+    "    This demonstrates the different error landscapes created by each loss.\n",
+    "    \"\"\"\n",
+    "    print(\"\\n📊 Analysis: Loss Function Sensitivity to Errors...\")\n",
+    "\n",
+    "    # Create a range of prediction errors for analysis\n",
+    "    true_value = 1.0\n",
+    "    predictions = np.linspace(0.1, 1.9, 50)  # From 0.1 to 1.9\n",
+    "\n",
+    "    # Initialize loss functions\n",
+    "    mse_loss = MSELoss()\n",
+    "    bce_loss = BinaryCrossEntropyLoss()\n",
+    "\n",
+    "    mse_losses = []\n",
+    "    bce_losses = []\n",
+    "\n",
+    "    for pred in predictions:\n",
+    "        # MSE analysis\n",
+    "        pred_tensor = Tensor([pred])\n",
+    "        target_tensor = Tensor([true_value])\n",
+    "        mse = mse_loss.forward(pred_tensor, target_tensor)\n",
+    "        mse_losses.append(mse.data)\n",
+    "\n",
+    "        # BCE analysis (clamp prediction to valid probability range)\n",
+    "        clamped_pred = max(0.01, min(0.99, pred))\n",
+    "        bce_pred_tensor = Tensor([clamped_pred])\n",
+    "        bce_target_tensor = Tensor([1.0])  # Target is \"positive class\"\n",
+    "        bce = bce_loss.forward(bce_pred_tensor, bce_target_tensor)\n",
+    "        bce_losses.append(bce.data)\n",
+    "\n",
+    "    # Find minimum losses\n",
+    "    min_mse_idx = np.argmin(mse_losses)\n",
+    "    min_bce_idx = np.argmin(bce_losses)\n",
+    "\n",
+    "    print(f\"MSE Loss:\")\n",
+    "    print(f\"  Minimum at prediction = {predictions[min_mse_idx]:.2f}, loss = {mse_losses[min_mse_idx]:.4f}\")\n",
+    "    print(f\"  At prediction = 0.5: loss = {mse_losses[24]:.4f}\")  # Middle of range\n",
+    "    print(f\"  At prediction = 0.1: loss = {mse_losses[0]:.4f}\")\n",
+    "\n",
+    "    print(f\"\\nBinary Cross-Entropy Loss:\")\n",
+    "    print(f\"  Minimum at prediction = {predictions[min_bce_idx]:.2f}, loss = {bce_losses[min_bce_idx]:.4f}\")\n",
+    "    print(f\"  At prediction = 0.5: loss = {bce_losses[24]:.4f}\")\n",
+    "    print(f\"  At prediction = 0.1: loss = {bce_losses[0]:.4f}\")\n",
+    "\n",
+    "    print(f\"\\n💡 Sensitivity Insights:\")\n",
+    "    print(\"   - MSE grows quadratically with error distance\")\n",
+    "    print(\"   - BCE grows logarithmically, heavily penalizing wrong confident predictions\")\n",
+    "    print(\"   - Both encourage correct predictions but with different curvatures\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87da05ba",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "# Part 5: Systems Analysis - Understanding Loss Function Performance\n",
+    "\n",
+    "Loss functions seem simple, but they have important computational and numerical properties that affect training performance. Let's analyze the systems aspects.\n",
+    "\n",
+    "## Computational Complexity Analysis\n",
+    "\n",
+    "Different loss functions have different computational costs, especially at scale:\n",
+    "\n",
+    "```\n",
+    "Computational Cost Comparison (Batch Size B, Classes C):\n",
+    "\n",
+    "MSELoss:\n",
+    "┌───────────────┬───────────────┐\n",
+    "│ Operation      │ Complexity     │\n",
+    "├───────────────┼───────────────┤\n",
+    "│ Subtraction    │ O(B)           │\n",
+    "│ Squaring       │ O(B)           │\n",
+    "│ Mean           │ O(B)           │\n",
+    "│ Total          │ O(B)           │\n",
+    "└───────────────┴───────────────┘\n",
+    "\n",
+    "CrossEntropyLoss:\n",
+    "┌───────────────┬───────────────┐\n",
+    "│ Operation      │ Complexity     │\n",
+    "├───────────────┼───────────────┤\n",
+    "│ Max (stability)│ O(B*C)         │\n",
+    "│ Exponential    │ O(B*C)         │\n",
+    "│ Sum            │ O(B*C)         │\n",
+    "│ Log            │ O(B)           │\n",
+    "│ Indexing       │ O(B)           │\n",
+    "│ Total          │ O(B*C)         │\n",
+    "└───────────────┴───────────────┘\n",
+    "\n",
+    "Cross-entropy is C times more expensive than MSE!\n",
+    "For ImageNet (C=1000), CE is 1000x more expensive than MSE.\n",
+    "```\n",
+    "\n",
+    "## Memory Layout and Access Patterns\n",
+    "\n",
+    "```\n",
+    "Memory Usage Patterns:\n",
+    "\n",
+    "MSE Forward Pass:              CE Forward Pass:\n",
+    "\n",
+    "Input:  [B] predictions       Input:  [B, C] logits\n",
+    "       │                             │\n",
+    "       │ subtract                    │ subtract max\n",
+    "       v                             v\n",
+    "Temp:  [B] differences        Temp1: [B, C] shifted\n",
+    "       │                             │\n",
+    "       │ square                      │ exponential\n",
+    "       v                             v\n",
+    "Temp:  [B] squared            Temp2: [B, C] exp_vals\n",
+    "       │                             │\n",
+    "       │ mean                        │ sum along C\n",
+    "       v                             v\n",
+    "Output: [1] scalar            Temp3: [B] sums\n",
+    "                                     │\n",
+    "Memory: 3*B*sizeof(float)            │ log + index\n",
+    "                                     v\n",
+    "                              Output: [1] scalar\n",
+    "\n",
+    "                              Memory: (3*B*C + 2*B)*sizeof(float)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ff322026",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "analyze_numerical_stability",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_numerical_stability():\n",
+    "    \"\"\"\n",
+    "    📊 Demonstrate why numerical stability matters in loss computation.\n",
+    "\n",
+    "    Shows the difference between naive and stable implementations.\n",
+    "    \"\"\"\n",
+    "    print(\"📊 Analysis: Numerical Stability in Loss Functions...\")\n",
+    "\n",
+    "    # Test with increasingly large logits\n",
+    "    test_cases = [\n",
+    "        (\"Small logits\", [1.0, 2.0, 3.0]),\n",
+    "        (\"Medium logits\", [10.0, 20.0, 30.0]),\n",
+    "        (\"Large logits\", [100.0, 200.0, 300.0]),\n",
+    "        (\"Very large logits\", [500.0, 600.0, 700.0])\n",
+    "    ]\n",
+    "\n",
+    "    print(\"\\nLog-Softmax Stability Test:\")\n",
+    "    print(\"Case                 | Max Input | Log-Softmax Min | Numerically Stable?\")\n",
+    "    print(\"-\" * 70)\n",
+    "\n",
+    "    for case_name, logits in test_cases:\n",
+    "        x = Tensor([logits])\n",
+    "\n",
+    "        # Our stable implementation\n",
+    "        stable_result = log_softmax(x, dim=-1)\n",
+    "\n",
+    "        max_input = np.max(logits)\n",
+    "        min_output = np.min(stable_result.data)\n",
+    "        is_stable = not (np.any(np.isnan(stable_result.data)) or np.any(np.isinf(stable_result.data)))\n",
+    "\n",
+    "        print(f\"{case_name:20} | {max_input:8.0f} | {min_output:15.3f} | {'✅ Yes' if is_stable else '❌ No'}\")\n",
+    "\n",
+    "    print(f\"\\n💡 Key Insight: Log-sum-exp trick prevents overflow\")\n",
+    "    print(\"   Without it: exp(700) would cause overflow in standard softmax\")\n",
+    "    print(\"   With it: We can handle arbitrarily large logits safely\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1735f677",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "analyze_loss_memory",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_loss_memory():\n",
+    "    \"\"\"\n",
+    "    📊 Analyze memory usage patterns of different loss functions.\n",
+    "\n",
+    "    Understanding memory helps with batch size decisions.\n",
+    "    \"\"\"\n",
+    "    print(\"\\n📊 Analysis: Loss Function Memory Usage...\")\n",
+    "\n",
+    "    batch_sizes = [32, 128, 512, 1024]\n",
+    "    num_classes = 1000  # Like ImageNet\n",
+    "\n",
+    "    print(\"\\nMemory Usage by Batch Size:\")\n",
+    "    print(\"Batch Size | MSE (MB) | CrossEntropy (MB) | BCE (MB) | Notes\")\n",
+    "    print(\"-\" * 75)\n",
+    "\n",
+    "    for batch_size in batch_sizes:\n",
+    "        # Memory calculations (assuming float32 = 4 bytes)\n",
+    "        bytes_per_float = 4\n",
+    "\n",
+    "        # MSE: predictions + targets (both same size as output)\n",
+    "        mse_elements = batch_size * 1  # Regression usually has 1 output\n",
+    "        mse_memory = mse_elements * bytes_per_float * 2 / 1e6  # Convert to MB\n",
+    "\n",
+    "        # CrossEntropy: logits + targets + softmax + log_softmax\n",
+    "        ce_logits = batch_size * num_classes\n",
+    "        ce_targets = batch_size * 1  # Target indices\n",
+    "        ce_softmax = batch_size * num_classes  # Intermediate softmax\n",
+    "        ce_total_elements = ce_logits + ce_targets + ce_softmax\n",
+    "        ce_memory = ce_total_elements * bytes_per_float / 1e6\n",
+    "\n",
+    "        # BCE: predictions + targets (binary, so smaller)\n",
+    "        bce_elements = batch_size * 1\n",
+    "        bce_memory = bce_elements * bytes_per_float * 2 / 1e6\n",
+    "\n",
+    "        notes = \"Linear scaling\" if batch_size == 32 else f\"{batch_size//32}× first\"\n",
+    "\n",
+    "        print(f\"{batch_size:10} | {mse_memory:8.2f} | {ce_memory:13.2f} | {bce_memory:7.2f} | {notes}\")\n",
+    "\n",
+    "    print(f\"\\n💡 Memory Insights:\")\n",
+    "    print(\"   - CrossEntropy dominates due to large vocabulary (num_classes)\")\n",
+    "    print(\"   - Memory scales linearly with batch size\")\n",
+    "    print(\"   - Intermediate activations (softmax) double CE memory\")\n",
+    "    print(f\"   - For batch=1024, CE needs {ce_memory:.1f}MB just for loss computation\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c9752a0e",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "# Part 6: Production Context - How Loss Functions Scale\n",
+    "\n",
+    "Understanding how loss functions behave in production helps make informed engineering decisions about model architecture and training strategies.\n",
+    "\n",
+    "## Loss Function Scaling Challenges\n",
+    "\n",
+    "As models grow larger, loss function bottlenecks become critical:\n",
+    "\n",
+    "```\n",
+    "Scaling Challenge Matrix:\n",
+    "\n",
+    "                    │ Small Model     │ Large Model      │ Production Scale\n",
+    "                    │ (MNIST)         │ (ImageNet)       │ (GPT/BERT)\n",
+    "────────────────────┼─────────────────┼──────────────────┼──────────────────\n",
+    "Classes (C)         │ 10              │ 1,000            │ 50,000+\n",
+    "Batch Size (B)      │ 64              │ 256              │ 2,048\n",
+    "Memory (CE)         │ 2.5 KB          │ 1 MB             │ 400 MB\n",
+    "Memory (MSE)        │ 0.25 KB         │ 1 KB             │ 8 KB\n",
+    "Bottleneck          │ None            │ Softmax compute  │ Vocabulary memory\n",
+    "\n",
+    "Memory grows as B*C for cross-entropy!\n",
+    "At scale, vocabulary (C) dominates everything.\n",
+    "```\n",
+    "\n",
+    "## Engineering Optimizations in Production\n",
+    "\n",
+    "```\n",
+    "Common Production Optimizations:\n",
+    "\n",
+    "1. Hierarchical Softmax:\n",
+    "   ┌─────────────────┐\n",
+    "   │ Full Softmax:      │\n",
+    "   │ O(V) per sample    │  ┌─────────────────┐\n",
+    "   │ 50k classes = 50k  │  │ Hierarchical:       │\n",
+    "   │ operations         │  │ O(log V) per sample │\n",
+    "   └─────────────────┘  │ 50k classes = 16   │\n",
+    "                          │ operations         │\n",
+    "                          └─────────────────┘\n",
+    "\n",
+    "2. Sampled Softmax:\n",
+    "   Instead of computing over all 50k classes,\n",
+    "   sample 1k negative classes + correct class.\n",
+    "   50× speedup for training!\n",
+    "\n",
+    "3. Label Smoothing:\n",
+    "   Instead of hard targets [0, 0, 1, 0],\n",
+    "   use soft targets [0.1, 0.1, 0.7, 0.1].\n",
+    "   Improves generalization.\n",
+    "\n",
+    "4. Mixed Precision:\n",
+    "   Use FP16 for forward pass, FP32 for loss.\n",
+    "   2× memory reduction, same accuracy.\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae943732",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "analyze_production_patterns",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_production_patterns():\n",
+    "    \"\"\"\n",
+    "    🚀 Analyze loss function patterns in production ML systems.\n",
+    "\n",
+    "    Real insights from systems perspective.\n",
+    "    \"\"\"\n",
+    "    print(\"🚀 Production Analysis: Loss Function Engineering Patterns...\")\n",
+    "\n",
+    "    print(\"\\n1. Loss Function Choice by Problem Type:\")\n",
+    "\n",
+    "    scenarios = [\n",
+    "        (\"Recommender Systems\", \"BCE/MSE\", \"User preference prediction\", \"Billions of interactions\"),\n",
+    "        (\"Computer Vision\", \"CrossEntropy\", \"Image classification\", \"1000+ classes, large batches\"),\n",
+    "        (\"NLP Translation\", \"CrossEntropy\", \"Next token prediction\", \"50k+ vocabulary\"),\n",
+    "        (\"Medical Diagnosis\", \"BCE\", \"Disease probability\", \"Class imbalance critical\"),\n",
+    "        (\"Financial Trading\", \"MSE/Huber\", \"Price prediction\", \"Outlier robustness needed\")\n",
+    "    ]\n",
+    "\n",
+    "    print(\"System Type          | Loss Type    | Use Case              | Scale Challenge\")\n",
+    "    print(\"-\" * 80)\n",
+    "    for system, loss_type, use_case, challenge in scenarios:\n",
+    "        print(f\"{system:20} | {loss_type:12} | {use_case:20} | {challenge}\")\n",
+    "\n",
+    "    print(\"\\n2. Engineering Trade-offs:\")\n",
+    "\n",
+    "    trade_offs = [\n",
+    "        (\"CrossEntropy vs Label Smoothing\", \"Stability vs Confidence\", \"Label smoothing prevents overconfident predictions\"),\n",
+    "        (\"MSE vs Huber Loss\", \"Sensitivity vs Robustness\", \"Huber is less sensitive to outliers\"),\n",
+    "        (\"Full Softmax vs Sampled\", \"Accuracy vs Speed\", \"Hierarchical softmax for large vocabularies\"),\n",
+    "        (\"Per-Sample vs Batch Loss\", \"Accuracy vs Memory\", \"Batch computation is more memory efficient\")\n",
+    "    ]\n",
+    "\n",
+    "    print(\"\\nTrade-off                    | Spectrum              | Production Decision\")\n",
+    "    print(\"-\" * 85)\n",
+    "    for trade_off, spectrum, decision in trade_offs:\n",
+    "        print(f\"{trade_off:28} | {spectrum:20} | {decision}\")\n",
+    "\n",
+    "    print(\"\\n💡 Production Insights:\")\n",
+    "    print(\"   - Large vocabularies (50k+ tokens) dominate memory in CrossEntropy\")\n",
+    "    print(\"   - Batch computation is 10-100× more efficient than per-sample\")\n",
+    "    print(\"   - Numerical stability becomes critical at scale (FP16 training)\")\n",
+    "    print(\"   - Loss computation is often <5% of total training time\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c9d38062",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "95e21f0f",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_module",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire losses module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_log_softmax()\n",
+    "    test_unit_mse_loss()\n",
+    "    test_unit_cross_entropy_loss()\n",
+    "    test_unit_binary_cross_entropy_loss()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test realistic end-to-end scenario\n",
+    "    print(\"🔬 Integration Test: Realistic training scenario...\")\n",
+    "\n",
+    "    # Simulate a complete prediction -> loss computation pipeline\n",
+    "\n",
+    "    # 1. MSE for regression (house price prediction)\n",
+    "    house_predictions = Tensor([250.0, 180.0, 320.0, 400.0])  # Predicted prices in thousands\n",
+    "    house_actual = Tensor([245.0, 190.0, 310.0, 420.0])       # Actual prices\n",
+    "    mse_loss = MSELoss()\n",
+    "    house_loss = mse_loss.forward(house_predictions, house_actual)\n",
+    "    assert house_loss.data > 0, \"House price loss should be positive\"\n",
+    "    assert house_loss.data < 1000, \"House price loss should be reasonable\"\n",
+    "\n",
+    "    # 2. CrossEntropy for classification (image recognition)\n",
+    "    image_logits = Tensor([[2.1, 0.5, 0.3], [0.2, 2.8, 0.1], [0.4, 0.3, 2.2]])  # 3 images, 3 classes\n",
+    "    image_labels = Tensor([0, 1, 2])  # Correct class for each image\n",
+    "    ce_loss = CrossEntropyLoss()\n",
+    "    image_loss = ce_loss.forward(image_logits, image_labels)\n",
+    "    assert image_loss.data > 0, \"Image classification loss should be positive\"\n",
+    "    assert image_loss.data < 5.0, \"Image classification loss should be reasonable\"\n",
+    "\n",
+    "    # 3. BCE for binary classification (spam detection)\n",
+    "    spam_probabilities = Tensor([0.85, 0.12, 0.78, 0.23, 0.91])\n",
+    "    spam_labels = Tensor([1.0, 0.0, 1.0, 0.0, 1.0])  # True spam labels\n",
+    "    bce_loss = BinaryCrossEntropyLoss()\n",
+    "    spam_loss = bce_loss.forward(spam_probabilities, spam_labels)\n",
+    "    assert spam_loss.data > 0, \"Spam detection loss should be positive\"\n",
+    "    assert spam_loss.data < 5.0, \"Spam detection loss should be reasonable\"\n",
+    "\n",
+    "    # 4. Test numerical stability with extreme values\n",
+    "    extreme_logits = Tensor([[100.0, -100.0, 0.0]])\n",
+    "    extreme_targets = Tensor([0])\n",
+    "    extreme_loss = ce_loss.forward(extreme_logits, extreme_targets)\n",
+    "    assert not np.isnan(extreme_loss.data), \"Loss should handle extreme values\"\n",
+    "    assert not np.isinf(extreme_loss.data), \"Loss should not be infinite\"\n",
+    "\n",
+    "    print(\"✅ End-to-end loss computation works!\")\n",
+    "    print(\"✅ All loss functions handle edge cases!\")\n",
+    "    print(\"✅ Numerical stability verified!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 04\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "271b0171",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1382c7a",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Losses\n",
+    "\n",
+    "Congratulations! You've built the measurement system that enables all machine learning!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Built 3 essential loss functions: MSE, CrossEntropy, and BinaryCrossEntropy ✅\n",
+    "- Implemented numerical stability with log-sum-exp trick ✅\n",
+    "- Discovered memory scaling patterns with batch size and vocabulary ✅\n",
+    "- Analyzed production trade-offs between different loss function choices ✅\n",
+    "- All tests pass ✅ (validated by `test_module()`)\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your loss functions provide the essential feedback signal for learning. These \"error measurements\" will become the starting point for backpropagation in Module 05!\n",
+    "Export with: `tito module complete 04`\n",
+    "\n",
+    "**Next**: Module 05 will add automatic differentiation - the magic that computes how to improve predictions!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/04_losses/losses_dev.py b/modules/source/04_losses/losses_dev.py
similarity index 99%
rename from modules/04_losses/losses_dev.py
rename to modules/source/04_losses/losses_dev.py
index 28f504a0..cb14e3be 100644
--- a/modules/04_losses/losses_dev.py
+++ b/modules/source/04_losses/losses_dev.py
@@ -63,6 +63,7 @@ from tinytorch.core.layers import Linear, Sequential  # What makes predictions
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp core.losses
+#| export
 
 import numpy as np
 import matplotlib.pyplot as plt
diff --git a/modules/05_autograd/ENHANCEMENT_SUMMARY.md b/modules/source/05_autograd/ENHANCEMENT_SUMMARY.md
similarity index 100%
rename from modules/05_autograd/ENHANCEMENT_SUMMARY.md
rename to modules/source/05_autograd/ENHANCEMENT_SUMMARY.md
diff --git a/modules/05_autograd/autograd_clean.py b/modules/source/05_autograd/autograd_clean.py
similarity index 100%
rename from modules/05_autograd/autograd_clean.py
rename to modules/source/05_autograd/autograd_clean.py
diff --git a/modules/source/05_autograd/autograd_dev.ipynb b/modules/source/05_autograd/autograd_dev.ipynb
new file mode 100644
index 00000000..770f7543
--- /dev/null
+++ b/modules/source/05_autograd/autograd_dev.ipynb
@@ -0,0 +1,1589 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "70e293d5",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Module 05: Autograd - Awakening the Gradient Engine\n",
+    "\n",
+    "Welcome to Module 05! Today you'll bring gradients to life and unlock automatic differentiation.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Tensor operations, activations, layers, and loss functions\n",
+    "**You'll Build**: The autograd system that computes gradients automatically\n",
+    "**You'll Enable**: Learning! Training! The ability to optimize neural networks!\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Modules 01-04 → Autograd → Training (Module 06-07)\n",
+    "(forward pass) (backward pass) (learning loops)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement the backward() method for Tensor to enable gradient computation\n",
+    "2. Create a Function base class for operation tracking\n",
+    "3. Build computation graphs for automatic differentiation\n",
+    "4. Test gradient correctness and chain rule implementation\n",
+    "\n",
+    "**CRITICAL**: This module enhances the existing Tensor class by implementing its dormant gradient features!\n",
+    "\n",
+    "Let's awaken the gradient engine!\n",
+    "\n",
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/05_autograd/autograd_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.autograd\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.autograd import Function  # This module - gradient computation\n",
+    "from tinytorch.core.tensor import Tensor  # Enhanced with gradients from this module\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete autograd system enabling automatic differentiation\n",
+    "- **Production:** PyTorch-style computational graph and backward pass\n",
+    "- **Consistency:** All gradient operations in core.autograd\n",
+    "- **Integration:** Enhances existing Tensor without breaking anything"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9d8ef1a7",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "imports",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| default_exp core.autograd\n",
+    "\n",
+    "import numpy as np\n",
+    "from typing import List, Optional, Callable\n",
+    "import sys\n",
+    "import os\n",
+    "\n",
+    "# Import the modern Tensor class\n",
+    "sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
+    "from tensor_dev import Tensor"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b56abee5",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 1. Introduction: What is Automatic Differentiation?\n",
+    "\n",
+    "Automatic differentiation (autograd) is the magic that makes neural networks learn. Instead of manually computing gradients for every parameter, autograd tracks operations and automatically computes gradients via the chain rule.\n",
+    "\n",
+    "### The Challenge\n",
+    "In Module 04, you implemented a loss function. To train a model, you need:\n",
+    "```\n",
+    "Loss = f(W₃, f(W₂, f(W₁, x)))\n",
+    "∂Loss/∂W₁ = ?  ∂Loss/∂W₂ = ?  ∂Loss/∂W₃ = ?\n",
+    "```\n",
+    "\n",
+    "Manual gradient computation becomes impossible for complex models with millions of parameters.\n",
+    "\n",
+    "### The Solution: Computational Graphs\n",
+    "```\n",
+    "Forward Pass:  x → Linear₁ → ReLU → Linear₂ → Loss\n",
+    "Backward Pass: ∇x ← ∇Linear₁ ← ∇ReLU ← ∇Linear₂ ← ∇Loss\n",
+    "```\n",
+    "\n",
+    "**Complete Autograd Process Visualization:**\n",
+    "```\n",
+    "┌─ FORWARD PASS ──────────────────────────────────────────────┐\n",
+    "│                                                             │\n",
+    "│ x ──┬── W₁ ──┐                                              │\n",
+    "│     │        ├──[Linear₁]──→ z₁ ──[ReLU]──→ a₁ ──┬── W₂ ──┐ │\n",
+    "│     └── b₁ ──┘                               │        ├─→ Loss\n",
+    "│                                              └── b₂ ──┘ │\n",
+    "│                                                             │\n",
+    "└─ COMPUTATION GRAPH BUILT ──────────────────────────────────┘\n",
+    "                             │\n",
+    "                             ▼\n",
+    "┌─ BACKWARD PASS ─────────────────────────────────────────────┐\n",
+    "│                                                             │\n",
+    "│∇x ←┬← ∇W₁ ←┐                                               │\n",
+    "│    │       ├←[Linear₁]←─ ∇z₁ ←[ReLU]← ∇a₁ ←┬← ∇W₂ ←┐      │\n",
+    "│    └← ∇b₁ ←┘                             │       ├← ∇Loss  │\n",
+    "│                                          └← ∇b₂ ←┘      │\n",
+    "│                                                             │\n",
+    "└─ GRADIENTS COMPUTED ───────────────────────────────────────┘\n",
+    "\n",
+    "Key Insight: Each [operation] stores how to compute its backward pass.\n",
+    "The chain rule automatically flows gradients through the entire graph.\n",
+    "```\n",
+    "\n",
+    "Each operation records how to compute its backward pass. The chain rule connects them all."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7e4c7c87",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 2. Foundations: The Chain Rule in Action\n",
+    "\n",
+    "### Mathematical Foundation\n",
+    "For composite functions: f(g(x)), the derivative is:\n",
+    "```\n",
+    "df/dx = (df/dg) × (dg/dx)\n",
+    "```\n",
+    "\n",
+    "### Computational Graph Example\n",
+    "```\n",
+    "Simple computation: L = (x * y + 5)²\n",
+    "\n",
+    "Forward Pass:\n",
+    "  x=2 ──┐\n",
+    "        ├──[×]──→ z=6 ──[+5]──→ w=11 ──[²]──→ L=121\n",
+    "  y=3 ──┘\n",
+    "\n",
+    "Backward Pass (Chain Rule in Action):\n",
+    "  ∂L/∂x = ∂L/∂w × ∂w/∂z × ∂z/∂x\n",
+    "        = 2w  ×  1  ×  y\n",
+    "        = 2(11) × 1 × 3 = 66\n",
+    "\n",
+    "  ∂L/∂y = ∂L/∂w × ∂w/∂z × ∂z/∂y\n",
+    "        = 2w  ×  1  ×  x\n",
+    "        = 2(11) × 1 × 2 = 44\n",
+    "\n",
+    "Gradient Flow Visualization:\n",
+    "  ∇x=66 ←──┐\n",
+    "           ├──[×]←── ∇z=22 ←──[+]←── ∇w=22 ←──[²]←── ∇L=1\n",
+    "  ∇y=44 ←──┘\n",
+    "```\n",
+    "\n",
+    "### Memory Layout During Backpropagation\n",
+    "```\n",
+    "Computation Graph Memory Structure:\n",
+    "┌─────────────────────────────────────────────────────────┐\n",
+    "│ Forward Pass (stored for backward)                      │\n",
+    "├─────────────────────────────────────────────────────────┤\n",
+    "│ Node 1: x=2 (leaf, requires_grad=True) │ grad: None→66  │\n",
+    "│ Node 2: y=3 (leaf, requires_grad=True) │ grad: None→44  │\n",
+    "│ Node 3: z=x*y (MulFunction)            │ grad: None→22  │\n",
+    "│         saved: (x=2, y=3)              │ inputs: [x,y]  │\n",
+    "│ Node 4: w=z+5 (AddFunction)            │ grad: None→22  │\n",
+    "│         saved: (z=6, 5)                │ inputs: [z]    │\n",
+    "│ Node 5: L=w² (PowFunction)             │ grad: 1        │\n",
+    "│         saved: (w=11)                  │ inputs: [w]    │\n",
+    "└─────────────────────────────────────────────────────────┘\n",
+    "\n",
+    "Memory Cost: 2× parameters (data + gradients) + graph overhead\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cca81534",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 3. Implementation: Building the Autograd Engine\n",
+    "\n",
+    "Let's implement the autograd system step by step. We'll enhance the existing Tensor class and create supporting infrastructure.\n",
+    "\n",
+    "### The Function Architecture\n",
+    "\n",
+    "Every differentiable operation needs two things:\n",
+    "1. **Forward pass**: Compute the result\n",
+    "2. **Backward pass**: Compute gradients for inputs\n",
+    "\n",
+    "```\n",
+    "Function Class Design:\n",
+    "┌─────────────────────────────────────┐\n",
+    "│ Function (Base Class)               │\n",
+    "├─────────────────────────────────────┤\n",
+    "│ • save_for_backward()  ← Store data │\n",
+    "│ • forward()           ← Compute     │\n",
+    "│ • backward()          ← Gradients   │\n",
+    "└─────────────────────────────────────┘\n",
+    "          ↑\n",
+    "    ┌─────┴─────┬─────────┬──────────┐\n",
+    "    │           │         │          │\n",
+    "┌───▼────┐ ┌────▼───┐ ┌───▼────┐ ┌───▼────┐\n",
+    "│  Add   │ │  Mul   │ │ Matmul │ │  Sum   │\n",
+    "│Function│ │Function│ │Function│ │Function│\n",
+    "└────────┘ └────────┘ └────────┘ └────────┘\n",
+    "```\n",
+    "\n",
+    "Each operation inherits from Function and implements specific gradient rules."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a2374a63",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### Function Base Class - The Foundation of Autograd\n",
+    "\n",
+    "The Function class is the foundation that makes autograd possible. Every differentiable operation (addition, multiplication, etc.) inherits from this class.\n",
+    "\n",
+    "**Why Functions Matter:**\n",
+    "- They remember inputs needed for backward pass\n",
+    "- They implement forward computation\n",
+    "- They implement gradient computation via backward()\n",
+    "- They connect to form computation graphs\n",
+    "\n",
+    "**The Pattern:**\n",
+    "```\n",
+    "Forward:  inputs → Function.forward() → output\n",
+    "Backward: grad_output → Function.backward() → grad_inputs\n",
+    "```\n",
+    "\n",
+    "This pattern enables the chain rule to flow gradients through complex computations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c4e83fb5",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "function-base",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class Function:\n",
+    "    \"\"\"\n",
+    "    Base class for differentiable operations.\n",
+    "\n",
+    "    Every operation that needs gradients (add, multiply, matmul, etc.)\n",
+    "    will inherit from this class.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self):\n",
+    "        \"\"\"Initialize function with empty input tracking.\"\"\"\n",
+    "        self.inputs = []\n",
+    "        self.saved_tensors = []\n",
+    "\n",
+    "    def save_for_backward(self, *tensors):\n",
+    "        \"\"\"\n",
+    "        Save tensors needed for backward pass.\n",
+    "\n",
+    "        TODO: Store tensors that backward() will need\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        In multiplication: y = a * b\n",
+    "        We need to save 'a' and 'b' because:\n",
+    "        ∂y/∂a = b and ∂y/∂b = a\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        self.saved_tensors = tensors\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def forward(self, *inputs):\n",
+    "        \"\"\"\n",
+    "        Compute forward pass.\n",
+    "\n",
+    "        TODO: Implement in subclasses\n",
+    "        This should be overridden by each specific operation.\n",
+    "        \"\"\"\n",
+    "        raise NotImplementedError(\"Forward pass must be implemented by subclasses\")\n",
+    "\n",
+    "    def backward(self, grad_output):\n",
+    "        \"\"\"\n",
+    "        Compute backward pass.\n",
+    "\n",
+    "        TODO: Implement in subclasses\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Take gradient flowing backward (grad_output)\n",
+    "        2. Apply chain rule with local gradients\n",
+    "        3. Return gradients for inputs\n",
+    "        \"\"\"\n",
+    "        raise NotImplementedError(\"Backward pass must be implemented by subclasses\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3d390955",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Function Base Class\n",
+    "This test validates our Function base class works correctly.\n",
+    "**What we're testing**: Function initialization and interface\n",
+    "**Why it matters**: Foundation for all differentiable operations\n",
+    "**Expected**: Proper initialization and save_for_backward functionality"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1d2df72b",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-function-base",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_function_base():\n",
+    "    \"\"\"🔬 Test Function base class.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Function Base Class...\")\n",
+    "\n",
+    "    # Test initialization\n",
+    "    func = Function()\n",
+    "    assert func.inputs == []\n",
+    "    assert func.saved_tensors == []\n",
+    "\n",
+    "    # Test save_for_backward\n",
+    "    tensor1 = Tensor([1, 2, 3])\n",
+    "    tensor2 = Tensor([4, 5, 6])\n",
+    "    func.save_for_backward(tensor1, tensor2)\n",
+    "    assert len(func.saved_tensors) == 2\n",
+    "    assert func.saved_tensors[0] is tensor1\n",
+    "    assert func.saved_tensors[1] is tensor2\n",
+    "\n",
+    "    print(\"✅ Function base class works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_function_base()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "66e62bea",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "### Operation Functions - Implementing Gradient Rules\n",
+    "\n",
+    "Now we'll implement specific operations that compute gradients correctly. Each operation has mathematical rules for how gradients flow backward.\n",
+    "\n",
+    "**Gradient Flow Visualization:**\n",
+    "```\n",
+    "Addition (z = a + b):\n",
+    "    ∂z/∂a = 1    ∂z/∂b = 1\n",
+    "\n",
+    "    a ──┐           grad_a ←──┐\n",
+    "        ├─[+]─→ z          ├─[+]←── grad_z\n",
+    "    b ──┘           grad_b ←──┘\n",
+    "\n",
+    "Multiplication (z = a * b):\n",
+    "    ∂z/∂a = b    ∂z/∂b = a\n",
+    "\n",
+    "    a ──┐           grad_a = grad_z * b\n",
+    "        ├─[×]─→ z\n",
+    "    b ──┘           grad_b = grad_z * a\n",
+    "\n",
+    "Matrix Multiplication (Z = A @ B):\n",
+    "    ∂Z/∂A = grad_Z @ B.T\n",
+    "    ∂Z/∂B = A.T @ grad_Z\n",
+    "\n",
+    "    A ──┐           grad_A = grad_Z @ B.T\n",
+    "        ├─[@]─→ Z\n",
+    "    B ──┘           grad_B = A.T @ grad_Z\n",
+    "```\n",
+    "\n",
+    "Each operation stores the inputs it needs for computing gradients."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "659f0192",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### AddFunction - Gradient Rules for Addition\n",
+    "\n",
+    "Addition is the simplest gradient operation: gradients flow unchanged to both inputs.\n",
+    "\n",
+    "**Mathematical Principle:**\n",
+    "```\n",
+    "If z = a + b, then:\n",
+    "∂z/∂a = 1  (gradient of z w.r.t. a)\n",
+    "∂z/∂b = 1  (gradient of z w.r.t. b)\n",
+    "\n",
+    "By chain rule:\n",
+    "∂Loss/∂a = ∂Loss/∂z × ∂z/∂a = grad_output × 1 = grad_output\n",
+    "∂Loss/∂b = ∂Loss/∂z × ∂z/∂b = grad_output × 1 = grad_output\n",
+    "```\n",
+    "\n",
+    "**Broadcasting Challenge:**\n",
+    "When tensors have different shapes, NumPy broadcasts automatically in forward pass,\n",
+    "but we must \"unbroadcast\" gradients in backward pass to match original shapes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db506b35",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "operation-functions",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class AddFunction(Function):\n",
+    "    \"\"\"Gradient computation for tensor addition.\"\"\"\n",
+    "\n",
+    "    def forward(self, a, b):\n",
+    "        \"\"\"\n",
+    "        Forward pass: compute a + b\n",
+    "\n",
+    "        TODO: Implement addition forward pass\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Save inputs for backward pass (shapes might be needed)\n",
+    "        self.save_for_backward(a, b)\n",
+    "\n",
+    "        # Compute addition\n",
+    "        if isinstance(b, Tensor):\n",
+    "            result = a.data + b.data\n",
+    "        else:\n",
+    "            result = a.data + b\n",
+    "\n",
+    "        return result\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad_output):\n",
+    "        \"\"\"\n",
+    "        Backward pass: compute gradients for addition\n",
+    "\n",
+    "        TODO: Implement addition backward pass\n",
+    "\n",
+    "        MATH: If z = a + b, then ∂z/∂a = 1 and ∂z/∂b = 1\n",
+    "        So: ∂loss/∂a = ∂loss/∂z × 1 = grad_output\n",
+    "            ∂loss/∂b = ∂loss/∂z × 1 = grad_output\n",
+    "\n",
+    "        BROADCASTING CHALLENGE:\n",
+    "        If shapes differ, we need to sum gradients appropriately\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        a, b = self.saved_tensors\n",
+    "\n",
+    "        # Gradient for 'a' - same shape as grad_output initially\n",
+    "        grad_a = grad_output\n",
+    "\n",
+    "        # Gradient for 'b' - same as grad_output initially\n",
+    "        grad_b = grad_output\n",
+    "\n",
+    "        # Handle broadcasting: if original shapes differed, sum gradients\n",
+    "        # For tensor + scalar case\n",
+    "        if not isinstance(b, Tensor):\n",
+    "            grad_b = np.sum(grad_output)\n",
+    "        else:\n",
+    "            # Handle shape differences due to broadcasting\n",
+    "            if a.shape != grad_output.shape:\n",
+    "                # Sum out added dimensions and squeeze\n",
+    "                grad_a = _handle_broadcasting_backward(grad_a, a.shape)\n",
+    "\n",
+    "            if b.shape != grad_output.shape:\n",
+    "                grad_b = _handle_broadcasting_backward(grad_b, b.shape)\n",
+    "\n",
+    "        return grad_a, grad_b\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a4e99e3e",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "\"\"\"\n",
+    "## MulFunction - Gradient Rules for Element-wise Multiplication\n",
+    "\n",
+    "Element-wise multiplication follows the product rule of calculus.\n",
+    "\n",
+    "**Mathematical Principle:**\n",
+    "```\n",
+    "If z = a * b (element-wise), then:\n",
+    "∂z/∂a = b  (gradient w.r.t. a equals the other input)\n",
+    "∂z/∂b = a  (gradient w.r.t. b equals the other input)\n",
+    "\n",
+    "By chain rule:\n",
+    "∂Loss/∂a = grad_output * b\n",
+    "∂Loss/∂b = grad_output * a\n",
+    "```\n",
+    "\n",
+    "**Visual Example:**\n",
+    "```\n",
+    "Forward:  a=[2,3] * b=[4,5] = z=[8,15]\n",
+    "Backward: grad_z=[1,1]\n",
+    "          grad_a = grad_z * b = [1,1] * [4,5] = [4,5]\n",
+    "          grad_b = grad_z * a = [1,1] * [2,3] = [2,3]\n",
+    "```\n",
+    "\"\"\"\n",
+    "\n",
+    "class MulFunction(Function):\n",
+    "    \"\"\"Gradient computation for tensor multiplication.\"\"\"\n",
+    "\n",
+    "    def forward(self, a, b):\n",
+    "        \"\"\"\n",
+    "        Forward pass: compute a * b (element-wise)\n",
+    "\n",
+    "        TODO: Implement multiplication forward pass\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        self.save_for_backward(a, b)\n",
+    "\n",
+    "        if isinstance(b, Tensor):\n",
+    "            result = a.data * b.data\n",
+    "        else:\n",
+    "            result = a.data * b\n",
+    "\n",
+    "        return result\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad_output):\n",
+    "        \"\"\"\n",
+    "        Backward pass: compute gradients for multiplication\n",
+    "\n",
+    "        TODO: Implement multiplication backward pass\n",
+    "\n",
+    "        MATH: If z = a * b, then:\n",
+    "        ∂z/∂a = b and ∂z/∂b = a\n",
+    "        So: ∂loss/∂a = grad_output * b\n",
+    "            ∂loss/∂b = grad_output * a\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        a, b = self.saved_tensors\n",
+    "\n",
+    "        if isinstance(b, Tensor):\n",
+    "            grad_a = grad_output * b.data\n",
+    "            grad_b = grad_output * a.data\n",
+    "\n",
+    "            # Handle broadcasting\n",
+    "            if a.shape != grad_output.shape:\n",
+    "                grad_a = _handle_broadcasting_backward(grad_a, a.shape)\n",
+    "            if b.shape != grad_output.shape:\n",
+    "                grad_b = _handle_broadcasting_backward(grad_b, b.shape)\n",
+    "        else:\n",
+    "            # b is a scalar\n",
+    "            grad_a = grad_output * b\n",
+    "            grad_b = np.sum(grad_output * a.data)\n",
+    "\n",
+    "        return grad_a, grad_b\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "abc612e2",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "\"\"\"\n",
+    "## MatmulFunction - Gradient Rules for Matrix Multiplication\n",
+    "\n",
+    "Matrix multiplication has more complex gradient rules based on matrix calculus.\n",
+    "\n",
+    "**Mathematical Principle:**\n",
+    "```\n",
+    "If Z = A @ B (matrix multiplication), then:\n",
+    "∂Z/∂A = grad_Z @ B.T\n",
+    "∂Z/∂B = A.T @ grad_Z\n",
+    "```\n",
+    "\n",
+    "**Why These Rules Work:**\n",
+    "```\n",
+    "For element Z[i,j] = Σ_k A[i,k] * B[k,j]\n",
+    "∂Z[i,j]/∂A[i,k] = B[k,j]  ← This gives us grad_Z @ B.T\n",
+    "∂Z[i,j]/∂B[k,j] = A[i,k]  ← This gives us A.T @ grad_Z\n",
+    "```\n",
+    "\n",
+    "**Dimension Analysis:**\n",
+    "```\n",
+    "Forward:  A(m×k) @ B(k×n) = Z(m×n)\n",
+    "Backward: grad_Z(m×n) @ B.T(n×k) = grad_A(m×k) ✓\n",
+    "          A.T(k×m) @ grad_Z(m×n) = grad_B(k×n) ✓\n",
+    "```\n",
+    "\"\"\"\n",
+    "\n",
+    "class MatmulFunction(Function):\n",
+    "    \"\"\"Gradient computation for matrix multiplication.\"\"\"\n",
+    "\n",
+    "    def forward(self, a, b):\n",
+    "        \"\"\"\n",
+    "        Forward pass: compute a @ b (matrix multiplication)\n",
+    "\n",
+    "        TODO: Implement matmul forward pass\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        self.save_for_backward(a, b)\n",
+    "        result = np.dot(a.data, b.data)\n",
+    "        return result\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad_output):\n",
+    "        \"\"\"\n",
+    "        Backward pass: compute gradients for matrix multiplication\n",
+    "\n",
+    "        TODO: Implement matmul backward pass\n",
+    "\n",
+    "        MATH: If Z = A @ B, then:\n",
+    "        ∂Z/∂A = grad_output @ B.T\n",
+    "        ∂Z/∂B = A.T @ grad_output\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        a, b = self.saved_tensors\n",
+    "\n",
+    "        # Gradient w.r.t. a: grad_output @ b.T\n",
+    "        grad_a = np.dot(grad_output, b.data.T)\n",
+    "\n",
+    "        # Gradient w.r.t. b: a.T @ grad_output\n",
+    "        grad_b = np.dot(a.data.T, grad_output)\n",
+    "\n",
+    "        return grad_a, grad_b\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a21456b",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "\"\"\"\n",
+    "## SumFunction - Gradient Rules for Reduction Operations\n",
+    "\n",
+    "Sum operations reduce tensor dimensions, so gradients must be broadcast back.\n",
+    "\n",
+    "**Mathematical Principle:**\n",
+    "```\n",
+    "If z = sum(a), then ∂z/∂a[i] = 1 for all i\n",
+    "Gradient is broadcasted from scalar result back to input shape.\n",
+    "```\n",
+    "\n",
+    "**Gradient Broadcasting Examples:**\n",
+    "```\n",
+    "Case 1: Full sum\n",
+    "  Forward:  a=[1,2,3] → sum() → z=6 (scalar)\n",
+    "  Backward: grad_z=1 → broadcast → grad_a=[1,1,1]\n",
+    "\n",
+    "Case 2: Axis sum\n",
+    "  Forward:  a=[[1,2],[3,4]] → sum(axis=0) → z=[4,6]\n",
+    "  Backward: grad_z=[1,1] → broadcast → grad_a=[[1,1],[1,1]]\n",
+    "\n",
+    "Case 3: Keepdims\n",
+    "  Forward:  a=[[1,2],[3,4]] → sum(axis=0,keepdims=True) → z=[[4,6]]\n",
+    "  Backward: grad_z=[[1,1]] → broadcast → grad_a=[[1,1],[1,1]]\n",
+    "```\n",
+    "\"\"\"\n",
+    "\n",
+    "class SumFunction(Function):\n",
+    "    \"\"\"Gradient computation for tensor sum.\"\"\"\n",
+    "\n",
+    "    def forward(self, a, axis=None, keepdims=False):\n",
+    "        \"\"\"\n",
+    "        Forward pass: compute tensor sum\n",
+    "\n",
+    "        TODO: Implement sum forward pass\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        self.save_for_backward(a)\n",
+    "        self.axis = axis\n",
+    "        self.keepdims = keepdims\n",
+    "        self.input_shape = a.shape\n",
+    "\n",
+    "        result = np.sum(a.data, axis=axis, keepdims=keepdims)\n",
+    "        return result\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def backward(self, grad_output):\n",
+    "        \"\"\"\n",
+    "        Backward pass: compute gradients for sum\n",
+    "\n",
+    "        TODO: Implement sum backward pass\n",
+    "\n",
+    "        MATH: If z = sum(a), then ∂z/∂a[i] = 1 for all i\n",
+    "        So gradient is broadcast back to original shape\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Sum distributes gradient to all input elements\n",
+    "        # Need to broadcast grad_output back to input shape\n",
+    "\n",
+    "        if self.axis is None:\n",
+    "            # Summed all elements - broadcast scalar back to input shape\n",
+    "            grad_a = np.full(self.input_shape, grad_output)\n",
+    "        else:\n",
+    "            # Summed along specific axis - need to broadcast properly\n",
+    "            grad_a = grad_output\n",
+    "\n",
+    "            # If keepdims=False, we need to expand the summed dimensions\n",
+    "            if not self.keepdims:\n",
+    "                if isinstance(self.axis, int):\n",
+    "                    grad_a = np.expand_dims(grad_a, self.axis)\n",
+    "                else:\n",
+    "                    for ax in sorted(self.axis):\n",
+    "                        grad_a = np.expand_dims(grad_a, ax)\n",
+    "\n",
+    "            # Broadcast to input shape\n",
+    "            grad_a = np.broadcast_to(grad_a, self.input_shape)\n",
+    "\n",
+    "        return grad_a\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "def _handle_broadcasting_backward(grad, target_shape):\n",
+    "    \"\"\"\n",
+    "    Helper function to handle gradient broadcasting.\n",
+    "\n",
+    "    When forward pass used broadcasting, we need to sum gradients\n",
+    "    back to the original tensor's shape.\n",
+    "    \"\"\"\n",
+    "    ### BEGIN SOLUTION\n",
+    "    # Start with the gradient\n",
+    "    result = grad\n",
+    "\n",
+    "    # Sum out dimensions that were broadcasted (added dimensions)\n",
+    "    # If target has fewer dimensions, sum out the leading dimensions\n",
+    "    while len(result.shape) > len(target_shape):\n",
+    "        result = np.sum(result, axis=0)\n",
+    "\n",
+    "    # For dimensions that were size 1 in target but expanded in grad\n",
+    "    for i, (grad_dim, target_dim) in enumerate(zip(result.shape, target_shape)):\n",
+    "        if target_dim == 1 and grad_dim > 1:\n",
+    "            result = np.sum(result, axis=i, keepdims=True)\n",
+    "\n",
+    "    return result\n",
+    "    ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4b0c564",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Operation Functions\n",
+    "This test validates our operation functions compute gradients correctly.\n",
+    "**What we're testing**: Forward and backward passes for each operation\n",
+    "**Why it matters**: These are the building blocks of autograd\n",
+    "**Expected**: Correct gradients that satisfy mathematical definitions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "534068f3",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-operation-functions",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_operation_functions():\n",
+    "    \"\"\"🔬 Test operation functions.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Operation Functions...\")\n",
+    "\n",
+    "    # Test AddFunction\n",
+    "    add_func = AddFunction()\n",
+    "    a = Tensor([1, 2, 3])\n",
+    "    b = Tensor([4, 5, 6])\n",
+    "    result = add_func.forward(a, b)\n",
+    "    expected = np.array([5, 7, 9])\n",
+    "    assert np.allclose(result, expected)\n",
+    "\n",
+    "    grad_output = np.array([1, 1, 1])\n",
+    "    grad_a, grad_b = add_func.backward(grad_output)\n",
+    "    assert np.allclose(grad_a, grad_output)\n",
+    "    assert np.allclose(grad_b, grad_output)\n",
+    "\n",
+    "    # Test MulFunction\n",
+    "    mul_func = MulFunction()\n",
+    "    result = mul_func.forward(a, b)\n",
+    "    expected = np.array([4, 10, 18])\n",
+    "    assert np.allclose(result, expected)\n",
+    "\n",
+    "    grad_a, grad_b = mul_func.backward(grad_output)\n",
+    "    assert np.allclose(grad_a, b.data)  # grad w.r.t a = b\n",
+    "    assert np.allclose(grad_b, a.data)  # grad w.r.t b = a\n",
+    "\n",
+    "    # Test MatmulFunction\n",
+    "    matmul_func = MatmulFunction()\n",
+    "    a_mat = Tensor([[1, 2], [3, 4]])\n",
+    "    b_mat = Tensor([[5, 6], [7, 8]])\n",
+    "    result = matmul_func.forward(a_mat, b_mat)\n",
+    "    expected = np.array([[19, 22], [43, 50]])\n",
+    "    assert np.allclose(result, expected)\n",
+    "\n",
+    "    grad_output = np.ones((2, 2))\n",
+    "    grad_a, grad_b = matmul_func.backward(grad_output)\n",
+    "    assert grad_a.shape == a_mat.shape\n",
+    "    assert grad_b.shape == b_mat.shape\n",
+    "\n",
+    "    print(\"✅ Operation functions work correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_operation_functions()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08717fc2",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "### Enhancing Tensor with Autograd Capabilities\n",
+    "\n",
+    "Now we'll enhance the existing Tensor class to use these gradient functions and build computation graphs automatically.\n",
+    "\n",
+    "**Computation Graph Formation:**\n",
+    "```\n",
+    "Before Autograd:             After Autograd:\n",
+    "  x → operation → y           x → [Function] → y\n",
+    "                                     ↓\n",
+    "                               Stores operation\n",
+    "                               for backward pass\n",
+    "```\n",
+    "\n",
+    "**The Enhancement Strategy:**\n",
+    "1. **Add backward() method** - Triggers gradient computation\n",
+    "2. **Enhance operations** - Replace simple ops with gradient-tracking versions\n",
+    "3. **Track computation graphs** - Each tensor remembers how it was created\n",
+    "4. **Maintain compatibility** - All existing code continues to work\n",
+    "\n",
+    "**Critical Design Decision:**\n",
+    "We enhance the EXISTING Tensor class rather than creating a new one.\n",
+    "This means:\n",
+    "- ✅ All previous modules continue working unchanged\n",
+    "- ✅ No import changes needed\n",
+    "- ✅ Gradients are \"opt-in\" via requires_grad=True\n",
+    "- ✅ No confusion between Tensor types"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2e7d03f",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### The Backward Pass Algorithm\n",
+    "\n",
+    "The backward() method implements reverse-mode automatic differentiation.\n",
+    "\n",
+    "**Algorithm Visualization:**\n",
+    "```\n",
+    "Computation Graph (Forward):\n",
+    "  x₁ ──┐\n",
+    "       ├─[op₁]── z₁ ──┐\n",
+    "  x₂ ──┘              ├─[op₂]── y\n",
+    "  x₃ ──────[op₃]── z₂ ──┘\n",
+    "\n",
+    "Gradient Flow (Backward):\n",
+    "  ∇x₁ ←──┐\n",
+    "         ├─[op₁.backward()]← ∇z₁ ←──┐\n",
+    "  ∇x₂ ←──┘                      ├─[op₂.backward()]← ∇y\n",
+    "  ∇x₃ ←────[op₃.backward()]← ∇z₂ ←──┘\n",
+    "```\n",
+    "\n",
+    "**Backward Pass Steps:**\n",
+    "1. Start from output tensor (∇y = 1)\n",
+    "2. For each operation in reverse order:\n",
+    "   - Apply chain rule: ∇inputs = operation.backward(∇output)\n",
+    "   - Accumulate gradients (handle shared variables)\n",
+    "   - Continue to parent tensors\n",
+    "3. Gradients accumulate in tensor.grad attributes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "66f8911d",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "tensor-enhancements",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def implement_tensor_backward_method():\n",
+    "    \"\"\"\n",
+    "    Implement the backward method for the Tensor class.\n",
+    "\n",
+    "    CRITICAL: We modify the Tensor class in place to activate gradient features.\n",
+    "    The dormant features are now brought to life!\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def backward_implementation(self, gradient=None):\n",
+    "        \"\"\"\n",
+    "        Compute gradients for this tensor and all tensors in its computation graph.\n",
+    "\n",
+    "        TODO: Implement the backward pass\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Check if this tensor requires gradients\n",
+    "        2. Initialize gradient if starting point\n",
+    "        3. Traverse computation graph backwards\n",
+    "        4. Apply chain rule at each step\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> x = Tensor([2.0], requires_grad=True)\n",
+    "        >>> y = x * 3\n",
+    "        >>> y.backward()\n",
+    "        >>> print(x.grad)  # Should be [3.0]\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        if not self.requires_grad:\n",
+    "            return\n",
+    "\n",
+    "        # Initialize gradient if this is the starting point\n",
+    "        if gradient is None:\n",
+    "            if self.data.shape == ():\n",
+    "                # Scalar tensor\n",
+    "                gradient = np.array(1.0)\n",
+    "            else:\n",
+    "                # Non-scalar: gradient should be ones of same shape\n",
+    "                gradient = np.ones_like(self.data)\n",
+    "\n",
+    "        # Accumulate gradient\n",
+    "        if self.grad is None:\n",
+    "            self.grad = gradient\n",
+    "        else:\n",
+    "            self.grad = self.grad + gradient\n",
+    "\n",
+    "        # If this tensor has a gradient function, propagate backwards\n",
+    "        if hasattr(self, 'grad_fn') and self.grad_fn is not None:\n",
+    "            grads = self.grad_fn.backward(gradient)\n",
+    "\n",
+    "            # grads could be a single gradient or tuple of gradients\n",
+    "            if not isinstance(grads, tuple):\n",
+    "                grads = (grads,)\n",
+    "\n",
+    "            # Propagate to input tensors\n",
+    "            if hasattr(self.grad_fn, 'inputs'):\n",
+    "                for tensor, grad in zip(self.grad_fn.inputs, grads):\n",
+    "                    if isinstance(tensor, Tensor) and tensor.requires_grad:\n",
+    "                        tensor.backward(grad)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    # Replace the placeholder backward method with the real implementation\n",
+    "    Tensor.backward = backward_implementation\n",
+    "    print(\"🚀 Tensor backward method activated!\")\n",
+    "\n",
+    "# Activate the backward method\n",
+    "implement_tensor_backward_method()\n",
+    "\n",
+    "def create_gradient_tracking_tensor(data, requires_grad, grad_fn=None, inputs=None):\n",
+    "    \"\"\"\n",
+    "    Helper function to create tensors with gradient tracking.\n",
+    "\n",
+    "    This function helps operations create result tensors that properly\n",
+    "    track gradients and maintain the computation graph.\n",
+    "    \"\"\"\n",
+    "    result = Tensor(data, requires_grad=requires_grad)\n",
+    "\n",
+    "    if requires_grad and grad_fn is not None:\n",
+    "        result.grad_fn = grad_fn\n",
+    "        if inputs is not None:\n",
+    "            grad_fn.inputs = inputs\n",
+    "\n",
+    "    return result\n",
+    "\n",
+    "def enhance_tensor_operations():\n",
+    "    \"\"\"\n",
+    "    Enhance existing Tensor operations to support gradient tracking.\n",
+    "\n",
+    "    This modifies the existing methods to use gradient-tracking functions\n",
+    "    when requires_grad=True.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    # Store original methods\n",
+    "    original_add = Tensor.__add__\n",
+    "    original_mul = Tensor.__mul__\n",
+    "    original_matmul = Tensor.matmul\n",
+    "    original_sum = Tensor.sum\n",
+    "\n",
+    "    def gradient_aware_add(self, other):\n",
+    "        \"\"\"\n",
+    "        Addition that tracks gradients when needed.\n",
+    "        \"\"\"\n",
+    "        # Check if gradient tracking is needed\n",
+    "        requires_grad = self.requires_grad or (isinstance(other, Tensor) and other.requires_grad)\n",
+    "\n",
+    "        if requires_grad:\n",
+    "            # Use gradient-tracking version\n",
+    "            add_func = AddFunction()\n",
+    "            result_data = add_func.forward(self, other)\n",
+    "            inputs = [self, other] if isinstance(other, Tensor) else [self]\n",
+    "            return create_gradient_tracking_tensor(result_data, requires_grad, add_func, inputs)\n",
+    "        else:\n",
+    "            # Use original method (no gradient tracking)\n",
+    "            return original_add(self, other)\n",
+    "\n",
+    "    def gradient_aware_mul(self, other):\n",
+    "        \"\"\"\n",
+    "        Multiplication that tracks gradients when needed.\n",
+    "        \"\"\"\n",
+    "        requires_grad = self.requires_grad or (isinstance(other, Tensor) and other.requires_grad)\n",
+    "\n",
+    "        if requires_grad:\n",
+    "            mul_func = MulFunction()\n",
+    "            result_data = mul_func.forward(self, other)\n",
+    "            inputs = [self, other] if isinstance(other, Tensor) else [self]\n",
+    "            return create_gradient_tracking_tensor(result_data, requires_grad, mul_func, inputs)\n",
+    "        else:\n",
+    "            return original_mul(self, other)\n",
+    "\n",
+    "    def gradient_aware_matmul(self, other):\n",
+    "        \"\"\"\n",
+    "        Matrix multiplication that tracks gradients when needed.\n",
+    "        \"\"\"\n",
+    "        if not isinstance(other, Tensor):\n",
+    "            raise TypeError(f\"Expected Tensor for matrix multiplication, got {type(other)}\")\n",
+    "\n",
+    "        requires_grad = self.requires_grad or other.requires_grad\n",
+    "\n",
+    "        if requires_grad:\n",
+    "            matmul_func = MatmulFunction()\n",
+    "            result_data = matmul_func.forward(self, other)\n",
+    "            inputs = [self, other]\n",
+    "            return create_gradient_tracking_tensor(result_data, requires_grad, matmul_func, inputs)\n",
+    "        else:\n",
+    "            return original_matmul(self, other)\n",
+    "\n",
+    "    def gradient_aware_sum(self, axis=None, keepdims=False):\n",
+    "        \"\"\"\n",
+    "        Sum that tracks gradients when needed.\n",
+    "        \"\"\"\n",
+    "        if self.requires_grad:\n",
+    "            sum_func = SumFunction()\n",
+    "            result_data = sum_func.forward(self, axis, keepdims)\n",
+    "            inputs = [self]\n",
+    "            return create_gradient_tracking_tensor(result_data, self.requires_grad, sum_func, inputs)\n",
+    "        else:\n",
+    "            return original_sum(self, axis, keepdims)\n",
+    "\n",
+    "    # Replace methods with gradient-aware versions\n",
+    "    Tensor.__add__ = gradient_aware_add\n",
+    "    Tensor.__mul__ = gradient_aware_mul\n",
+    "    Tensor.matmul = gradient_aware_matmul\n",
+    "    Tensor.sum = gradient_aware_sum\n",
+    "\n",
+    "    print(\"🚀 Tensor operations enhanced with gradient tracking!\")\n",
+    "\n",
+    "# Enhance the operations\n",
+    "enhance_tensor_operations()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ae0aa2f",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Tensor Autograd Enhancement\n",
+    "This test validates our enhanced Tensor class computes gradients correctly.\n",
+    "**What we're testing**: Gradient computation and chain rule implementation\n",
+    "**Why it matters**: This is the core of automatic differentiation\n",
+    "**Expected**: Correct gradients for various operations and computation graphs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "abf2dc78",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-tensor-autograd",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_tensor_autograd():\n",
+    "    \"\"\"🔬 Test Tensor autograd enhancement.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Tensor Autograd Enhancement...\")\n",
+    "\n",
+    "    # Test simple gradient computation\n",
+    "    x = Tensor([2.0], requires_grad=True)\n",
+    "    y = x * 3\n",
+    "    z = y + 1  # z = 3x + 1, so dz/dx = 3\n",
+    "\n",
+    "    z.backward()\n",
+    "    assert np.allclose(x.grad, [3.0]), f\"Expected [3.0], got {x.grad}\"\n",
+    "\n",
+    "    # Test matrix multiplication gradients\n",
+    "    a = Tensor([[1.0, 2.0]], requires_grad=True)  # 1x2\n",
+    "    b = Tensor([[3.0], [4.0]], requires_grad=True)  # 2x1\n",
+    "    c = a.matmul(b)  # 1x1, result = [[11.0]]\n",
+    "\n",
+    "    c.backward()\n",
+    "    assert np.allclose(a.grad, [[3.0, 4.0]]), f\"Expected [[3.0, 4.0]], got {a.grad}\"\n",
+    "    assert np.allclose(b.grad, [[1.0], [2.0]]), f\"Expected [[1.0], [2.0]], got {b.grad}\"\n",
+    "\n",
+    "    # Test computation graph with multiple operations\n",
+    "    x = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    y = x * 2      # y = [2, 4]\n",
+    "    z = y.sum()    # z = 6\n",
+    "\n",
+    "    z.backward()\n",
+    "    assert np.allclose(x.grad, [2.0, 2.0]), f\"Expected [2.0, 2.0], got {x.grad}\"\n",
+    "\n",
+    "    print(\"✅ Tensor autograd enhancement works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_tensor_autograd()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b86a099",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## 4. Integration: Building Complex Computation Graphs\n",
+    "\n",
+    "Let's test how our autograd system handles complex neural network computations.\n",
+    "\n",
+    "### Complex Computation Graph Example\n",
+    "\n",
+    "Neural networks create complex computation graphs with shared parameters and multiple paths.\n",
+    "\n",
+    "**Detailed Neural Network Computation Graph:**\n",
+    "```\n",
+    "Forward Pass with Function Tracking:\n",
+    "                    x (input)\n",
+    "                    │ requires_grad=True\n",
+    "           ┌────────▼────────┐\n",
+    "           │ MatmulFunction  │ stores: (x, W₁)\n",
+    "           │   h₁ = x @ W₁   │\n",
+    "           └────────┬────────┘\n",
+    "                    │ grad_fn=MatmulFunction\n",
+    "           ┌────────▼────────┐\n",
+    "           │  AddFunction    │ stores: (h₁, b₁)\n",
+    "           │  z₁ = h₁ + b₁   │\n",
+    "           └────────┬────────┘\n",
+    "                    │ grad_fn=AddFunction\n",
+    "           ┌────────▼────────┐\n",
+    "           │  ReLU (manual)  │ Note: We'll implement\n",
+    "           │ a₁ = max(0,z₁)  │ ReLUFunction later\n",
+    "           └────────┬────────┘\n",
+    "                    │\n",
+    "           ┌────────▼────────┐\n",
+    "           │ MatmulFunction  │ stores: (a₁, W₂)\n",
+    "           │   h₂ = a₁ @ W₂  │\n",
+    "           └────────┬────────┘\n",
+    "                    │ grad_fn=MatmulFunction\n",
+    "           ┌────────▼────────┐\n",
+    "           │  AddFunction    │ stores: (h₂, b₂)\n",
+    "           │   y = h₂ + b₂   │ (final output)\n",
+    "           └─────────────────┘\n",
+    "\n",
+    "Backward Pass Chain Rule Application:\n",
+    "                   ∇x ←─────────────────────────────┐\n",
+    "                                                     │\n",
+    "    ┌─────────────────────────────────────────────────────────┐\n",
+    "    │ MatmulFunction.backward(∇h₁):                           │\n",
+    "    │   ∇x = ∇h₁ @ W₁.T                                      │\n",
+    "    │   ∇W₁ = x.T @ ∇h₁                                      │\n",
+    "    └─────────────────┬───────────────────────────────────────┘\n",
+    "                      │\n",
+    "    ┌─────────────────▼───────────────────────────────────────┐\n",
+    "    │ AddFunction.backward(∇z₁):                              │\n",
+    "    │   ∇h₁ = ∇z₁  (gradient passes through unchanged)       │\n",
+    "    │   ∇b₁ = ∇z₁                                            │\n",
+    "    └─────────────────┬───────────────────────────────────────┘\n",
+    "                      │\n",
+    "    ┌─────────────────▼───────────────────────────────────────┐\n",
+    "    │ Manual ReLU backward:                                   │\n",
+    "    │   ∇z₁ = ∇a₁ * (z₁ > 0)  (zero out negative gradients) │\n",
+    "    └─────────────────┬───────────────────────────────────────┘\n",
+    "                      │\n",
+    "    ┌─────────────────▼───────────────────────────────────────┐\n",
+    "    │ MatmulFunction.backward(∇h₂):                           │\n",
+    "    │   ∇a₁ = ∇h₂ @ W₂.T                                     │\n",
+    "    │   ∇W₂ = a₁.T @ ∇h₂                                     │\n",
+    "    └─────────────────┬───────────────────────────────────────┘\n",
+    "                      │\n",
+    "    ┌─────────────────▼───────────────────────────────────────┐\n",
+    "    │ AddFunction.backward(∇y):                               │\n",
+    "    │   ∇h₂ = ∇y  (gradient passes through unchanged)        │\n",
+    "    │   ∇b₂ = ∇y                                             │\n",
+    "    └─────────────────────────────────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "**Key Autograd Concepts:**\n",
+    "1. **Function Chaining**: Each operation creates a Function that stores inputs\n",
+    "2. **Gradient Accumulation**: Multiple paths to a parameter accumulate gradients\n",
+    "3. **Automatic Traversal**: backward() walks the graph in reverse topological order\n",
+    "4. **Chain Rule**: Local gradients multiply according to calculus rules"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a4231c8",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 5. Systems Analysis: Memory and Performance of Autograd\n",
+    "\n",
+    "Understanding the computational and memory costs of automatic differentiation.\n",
+    "\n",
+    "### Autograd Memory Architecture\n",
+    "\n",
+    "**Memory Layout Comparison:**\n",
+    "```\n",
+    "Forward-Only Mode:\n",
+    "┌─────────────┐\n",
+    "│ Parameters  │ 4N bytes (float32)\n",
+    "└─────────────┘\n",
+    "\n",
+    "Autograd Mode:\n",
+    "┌─────────────┐\n",
+    "│ Parameters  │ 4N bytes\n",
+    "├─────────────┤\n",
+    "│ Gradients   │ 4N bytes (additional)\n",
+    "├─────────────┤\n",
+    "│ Graph Nodes │ Variable overhead\n",
+    "├─────────────┤\n",
+    "│ Activations │ Depends on graph depth\n",
+    "└─────────────┘\n",
+    "Total: ~2-3× forward memory\n",
+    "```\n",
+    "\n",
+    "**Computation Graph Memory Growth:**\n",
+    "```\n",
+    "Shallow Network (3 layers):\n",
+    "  Graph: x → W₁ → ReLU → W₂ → ReLU → W₃ → loss\n",
+    "  Memory: Base + 3 × (weights + activations)\n",
+    "\n",
+    "Deep Network (50 layers):\n",
+    "  Graph: x → [W₁...W₅₀] → loss\n",
+    "  Memory: Base + 50 × (weights + activations)\n",
+    "\n",
+    "Gradient Checkpointing (optimization):\n",
+    "  Store only every K layers, recompute others\n",
+    "  Memory: Base + K × (weights + activations)\n",
+    "  Time: +20% compute, -80% memory\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b6c0ef4c",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "analyze-autograd-memory",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_autograd_memory():\n",
+    "    \"\"\"📊 Analyze memory usage of autograd vs no-grad computation.\"\"\"\n",
+    "    print(\"📊 Analyzing Autograd Memory Usage...\")\n",
+    "\n",
+    "    # Test different tensor sizes\n",
+    "    sizes = [100, 500, 1000]\n",
+    "\n",
+    "    for size in sizes:\n",
+    "        # Forward-only computation\n",
+    "        x_no_grad = Tensor(np.random.randn(size, size), requires_grad=False)\n",
+    "        y_no_grad = Tensor(np.random.randn(size, size), requires_grad=False)\n",
+    "        z_no_grad = x_no_grad.matmul(y_no_grad)\n",
+    "\n",
+    "        # Forward + backward computation\n",
+    "        x_grad = Tensor(np.random.randn(size, size), requires_grad=True)\n",
+    "        y_grad = Tensor(np.random.randn(size, size), requires_grad=True)\n",
+    "        z_grad = x_grad.matmul(y_grad)\n",
+    "\n",
+    "        # Memory analysis\n",
+    "        no_grad_elements = x_no_grad.size + y_no_grad.size + z_no_grad.size\n",
+    "        grad_elements = x_grad.size + y_grad.size + z_grad.size\n",
+    "        grad_storage = x_grad.size + y_grad.size  # For gradients\n",
+    "\n",
+    "        print(f\"Size {size}×{size}:\")\n",
+    "        print(f\"  No grad: {no_grad_elements:,} elements\")\n",
+    "        print(f\"  With grad: {grad_elements + grad_storage:,} elements\")\n",
+    "        print(f\"  Memory overhead: {grad_storage / no_grad_elements:.1%}\")\n",
+    "\n",
+    "    print(\"\\n💡 Autograd Memory Pattern:\")\n",
+    "    print(\"- Each parameter tensor needs gradient storage (2× memory)\")\n",
+    "    print(\"- Computation graph nodes add overhead\")\n",
+    "    print(\"- Trade-off: 2× memory for automatic gradients\")\n",
+    "\n",
+    "# Function defined above, will be called in main block"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "013bd1d0",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "analyze-gradient-computation",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_gradient_computation():\n",
+    "    \"\"\"📊 Analyze computational cost of gradient computation.\"\"\"\n",
+    "    print(\"📊 Analyzing Gradient Computation Cost...\")\n",
+    "\n",
+    "    import time\n",
+    "\n",
+    "    # Test computation times\n",
+    "    size = 500\n",
+    "    x = Tensor(np.random.randn(size, size), requires_grad=True)\n",
+    "    y = Tensor(np.random.randn(size, size), requires_grad=True)\n",
+    "\n",
+    "    # Time forward pass\n",
+    "    start_time = time.time()\n",
+    "    z = x.matmul(y)\n",
+    "    forward_time = time.time() - start_time\n",
+    "\n",
+    "    # Time backward pass\n",
+    "    start_time = time.time()\n",
+    "    z.backward()\n",
+    "    backward_time = time.time() - start_time\n",
+    "\n",
+    "    print(f\"Matrix size: {size}×{size}\")\n",
+    "    print(f\"Forward pass: {forward_time:.4f}s\")\n",
+    "    print(f\"Backward pass: {backward_time:.4f}s\")\n",
+    "    print(f\"Backward/Forward ratio: {backward_time/forward_time:.1f}×\")\n",
+    "\n",
+    "    print(f\"\\n💡 Gradient Computation Analysis:\")\n",
+    "    print(f\"- Forward: O(n³) matrix multiplication\")\n",
+    "    print(f\"- Backward: 2× O(n³) operations (gradients for both inputs)\")\n",
+    "    print(f\"- Total training cost: ~3× forward-only computation\")\n",
+    "\n",
+    "# Function defined above, will be called in main block"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6a68e64",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fbbf2259",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "module-integration",
+     "locked": true,
+     "points": 25
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Autograd works for complex computation graphs\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_function_base()\n",
+    "    test_unit_operation_functions()\n",
+    "    test_unit_tensor_autograd()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test 1: Multi-layer computation graph\n",
+    "    print(\"🔬 Integration Test: Multi-layer Neural Network...\")\n",
+    "\n",
+    "    # Create a 3-layer computation: x -> Linear -> Linear -> Linear -> loss\n",
+    "    x = Tensor([[1.0, 2.0]], requires_grad=True)\n",
+    "    W1 = Tensor([[0.5, 0.3, 0.1], [0.2, 0.4, 0.6]], requires_grad=True)\n",
+    "    b1 = Tensor([[0.1, 0.2, 0.3]], requires_grad=True)\n",
+    "\n",
+    "    # First layer\n",
+    "    h1 = x.matmul(W1) + b1\n",
+    "    assert h1.shape == (1, 3)\n",
+    "    assert h1.requires_grad == True\n",
+    "\n",
+    "    # Second layer\n",
+    "    W2 = Tensor([[0.1], [0.2], [0.3]], requires_grad=True)\n",
+    "    h2 = h1.matmul(W2)\n",
+    "    assert h2.shape == (1, 1)\n",
+    "\n",
+    "    # Compute simple loss (just square the output for testing)\n",
+    "    loss = h2 * h2\n",
+    "\n",
+    "    # Backward pass\n",
+    "    loss.backward()\n",
+    "\n",
+    "    # Verify all parameters have gradients\n",
+    "    assert x.grad is not None\n",
+    "    assert W1.grad is not None\n",
+    "    assert b1.grad is not None\n",
+    "    assert W2.grad is not None\n",
+    "    assert x.grad.shape == x.shape\n",
+    "    assert W1.grad.shape == W1.shape\n",
+    "\n",
+    "    print(\"✅ Multi-layer neural network gradients work!\")\n",
+    "\n",
+    "    # Test 2: Gradient accumulation\n",
+    "    print(\"🔬 Integration Test: Gradient Accumulation...\")\n",
+    "\n",
+    "    x = Tensor([2.0], requires_grad=True)\n",
+    "\n",
+    "    # First computation\n",
+    "    y1 = x * 3\n",
+    "    y1.backward()\n",
+    "    first_grad = x.grad.copy()\n",
+    "\n",
+    "    # Second computation (should accumulate)\n",
+    "    y2 = x * 5\n",
+    "    y2.backward()\n",
+    "\n",
+    "    assert np.allclose(x.grad, first_grad + 5.0), \"Gradients should accumulate\"\n",
+    "    print(\"✅ Gradient accumulation works!\")\n",
+    "\n",
+    "    # Test 3: Complex mathematical operations\n",
+    "    print(\"🔬 Integration Test: Complex Operations...\")\n",
+    "\n",
+    "    a = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)\n",
+    "    b = Tensor([[2.0, 1.0], [1.0, 2.0]], requires_grad=True)\n",
+    "\n",
+    "    # Complex computation: ((a @ b) + a) * b\n",
+    "    temp1 = a.matmul(b)  # Matrix multiplication\n",
+    "    temp2 = temp1 + a    # Addition\n",
+    "    result = temp2 * b   # Element-wise multiplication\n",
+    "    final = result.sum() # Sum reduction\n",
+    "\n",
+    "    final.backward()\n",
+    "\n",
+    "    assert a.grad is not None\n",
+    "    assert b.grad is not None\n",
+    "    assert a.grad.shape == a.shape\n",
+    "    assert b.grad.shape == b.shape\n",
+    "\n",
+    "    print(\"✅ Complex mathematical operations work!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 05_autograd\")\n",
+    "\n",
+    "# Test function defined above, will be called in main block"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "458d3d72",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7abfe41a",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Autograd Engine\n",
+    "\n",
+    "Congratulations! You've built the gradient engine that makes neural networks learn!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Implemented Function base class for tracking differentiable operations\n",
+    "- Enhanced existing Tensor class with backward() method (no new classes!)\n",
+    "- Built computation graph tracking for automatic differentiation\n",
+    "- Created operation functions (Add, Mul, Matmul, Sum) with correct gradients\n",
+    "- Tested complex multi-layer computation graphs with gradient propagation\n",
+    "- All tests pass ✅ (validated by `test_module()`)\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your autograd implementation enables optimization! The dormant gradient features from Module 01 are now fully active. Every tensor can track gradients, every operation builds computation graphs, and backward() computes gradients automatically.\n",
+    "\n",
+    "Export with: `tito module complete 05_autograd`\n",
+    "\n",
+    "**Next**: Module 06 will add optimizers (SGD, Adam) that use these gradients to actually train neural networks!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/05_autograd/autograd_dev.py b/modules/source/05_autograd/autograd_dev.py
similarity index 99%
rename from modules/05_autograd/autograd_dev.py
rename to modules/source/05_autograd/autograd_dev.py
index bd644c75..c69cdfcd 100644
--- a/modules/05_autograd/autograd_dev.py
+++ b/modules/source/05_autograd/autograd_dev.py
@@ -60,6 +60,7 @@ from tinytorch.core.tensor import Tensor  # Enhanced with gradients from this mo
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp core.autograd
+#| export
 
 import numpy as np
 from typing import List, Optional, Callable
diff --git a/modules/source/06_optimizers/optimizers_dev.ipynb b/modules/source/06_optimizers/optimizers_dev.ipynb
new file mode 100644
index 00000000..c5b7cb5e
--- /dev/null
+++ b/modules/source/06_optimizers/optimizers_dev.ipynb
@@ -0,0 +1,1646 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a24de2e9",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Module 06: Optimizers - Sophisticated Learning Algorithms\n",
+    "\n",
+    "Welcome to Module 06! You'll build optimizers that enable neural networks to learn from gradients using sophisticated algorithms.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Tensor with gradients (Modules 01-05)\n",
+    "**You'll Build**: SGD, Adam, and AdamW optimizers with sophisticated momentum and adaptive learning\n",
+    "**You'll Enable**: Modern optimization algorithms that power state-of-the-art neural networks\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Gradients → Optimizers → Training\n",
+    "(Module 05)  (Module 06)  (Module 07)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement SGD with momentum for stable gradient descent\n",
+    "2. Build Adam optimizer with adaptive learning rates\n",
+    "3. Create AdamW optimizer with decoupled weight decay\n",
+    "4. Understand memory and computational trade-offs in optimization algorithms\n",
+    "\n",
+    "Let's get started!\n",
+    "\n",
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/06_optimizers/optimizers_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.optimizers\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.optimizers import SGD, Adam, AdamW  # This module\n",
+    "from tinytorch.core.tensor import Tensor  # Foundation from Module 01\n",
+    "from tinytorch.core.layers import Linear  # Layers from Module 03\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete optimization system for modern neural network training\n",
+    "- **Production:** Proper organization like PyTorch's torch.optim with all optimization algorithms together\n",
+    "- **Consistency:** All optimization logic and parameter updating in core.optimizers\n",
+    "- **Integration:** Works seamlessly with gradients from Module 05 for complete training capability"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "292aa303",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "imports",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| default_exp core.optimizers\n",
+    "\n",
+    "import numpy as np\n",
+    "from typing import List, Union, Optional, Dict, Any\n",
+    "\n",
+    "# Import Tensor from Module 01 (now with gradient support from Module 05)\n",
+    "try:\n",
+    "    from tinytorch.core.tensor import Tensor\n",
+    "except ImportError:\n",
+    "    # For development, assume we have the enhanced Tensor\n",
+    "    import sys\n",
+    "    import os\n",
+    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
+    "    from tensor_dev import Tensor"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23ef23eb",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 1. Introduction: What are Optimizers?\n",
+    "\n",
+    "Optimizers are the engines that drive neural network learning. They take gradients computed from your loss function and use them to update model parameters toward better solutions. Think of optimization as navigating a complex landscape where you're trying to find the lowest valley (minimum loss).\n",
+    "\n",
+    "### The Optimization Challenge\n",
+    "\n",
+    "Imagine you're hiking in dense fog, trying to reach the bottom of a valley. You can only feel the slope under your feet (the gradient), but you can't see where you're going. Different optimization strategies are like different hiking approaches:\n",
+    "\n",
+    "```\n",
+    "Loss Landscape (2D visualization):\n",
+    "       🏔️\n",
+    "      /  \\\\\n",
+    "   🚶 /    \\\\\n",
+    "    /      \\\\\n",
+    "   /   🎯   \\\\  ← Global minimum (goal)\n",
+    "  /          \\\\\n",
+    " 🏔️          🏔️\n",
+    "\n",
+    "Challenge: Navigate to 🎯 using only local slope information!\n",
+    "```\n",
+    "\n",
+    "### Our Optimizer Toolkit\n",
+    "\n",
+    "**SGD (Stochastic Gradient Descent)**\n",
+    "- Strategy: Always step downhill\n",
+    "- Problem: Can get stuck oscillating in narrow valleys\n",
+    "- Solution: Add momentum to \"coast\" through oscillations\n",
+    "\n",
+    "**Adam (Adaptive Moment Estimation)**\n",
+    "- Strategy: Adapt step size for each parameter individually\n",
+    "- Advantage: Different learning rates for different dimensions\n",
+    "- Key Insight: Some directions need big steps, others need small steps\n",
+    "\n",
+    "**AdamW (Adam with Weight Decay)**\n",
+    "- Strategy: Adam + proper regularization\n",
+    "- Fix: Separates optimization from regularization\n",
+    "- Result: Better generalization and training stability\n",
+    "\n",
+    "### The Mathematics Behind Movement\n",
+    "\n",
+    "At its core, optimization follows: **θ_new = θ_old - α * direction**\n",
+    "\n",
+    "Where:\n",
+    "- `θ` = parameters (your position in the landscape)\n",
+    "- `α` = step size (learning rate)\n",
+    "- `direction` = where to step (gradient-based)\n",
+    "\n",
+    "But sophisticated optimizers do much more than basic gradient descent!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0585283",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 2. Foundations: Mathematical Background\n",
+    "\n",
+    "### Understanding Momentum: The Physics of Optimization\n",
+    "\n",
+    "Momentum in optimization works like momentum in physics. A ball rolling down a hill doesn't immediately change direction when it hits a small bump - it has momentum that carries it forward.\n",
+    "\n",
+    "```\n",
+    "Without Momentum (SGD):           With Momentum:\n",
+    "     ↓                                ↘️\n",
+    "  ←  •  →  ← oscillation           →  •  → smooth path\n",
+    "     ↑                                ↙️\n",
+    "\n",
+    "Narrow valley problem:            Momentum solution:\n",
+    "|\\     /|                        |\\     /|\n",
+    "| \\ • / | ← ping-pong             | \\ •→/ | ← smoother\n",
+    "|  \\ /  |   motion                |  \\ /  |   descent\n",
+    "|   ●   |                        |   ●   |\n",
+    "```\n",
+    "\n",
+    "**SGD with Momentum Formula:**\n",
+    "```\n",
+    "velocity = β * previous_velocity + (1-β) * current_gradient\n",
+    "parameter = parameter - learning_rate * velocity\n",
+    "\n",
+    "Where β ≈ 0.9 means \"90% memory of previous direction\"\n",
+    "```\n",
+    "\n",
+    "### Adam: Adaptive Learning for Each Parameter\n",
+    "\n",
+    "Adam solves a key problem: different parameters need different learning rates. Imagine adjusting the focus and zoom on a camera - you need fine control for focus but coarse control for zoom.\n",
+    "\n",
+    "```\n",
+    "Parameter Landscape (2 dimensions):\n",
+    "\n",
+    "   param2\n",
+    "     ^\n",
+    "     |\n",
+    "   😞|    steep gradient\n",
+    "     |    (needs small steps)\n",
+    "     |\n",
+    "  ---+--●--→ param1\n",
+    "     |     \\\\\n",
+    "     |      \\\\ gentle gradient\n",
+    "     |       \\\\ (needs big steps)\n",
+    "\n",
+    "Adam Solution: Automatic step size per parameter!\n",
+    "```\n",
+    "\n",
+    "**Adam's Two-Memory System:**\n",
+    "\n",
+    "1. **First Moment (m)**: \"Which direction am I usually going?\"\n",
+    "   - `m = β₁ * old_m + (1-β₁) * gradient`\n",
+    "   - Like momentum, but for direction\n",
+    "\n",
+    "2. **Second Moment (v)**: \"How big are my gradients usually?\"\n",
+    "   - `v = β₂ * old_v + (1-β₂) * gradient²`\n",
+    "   - Tracks gradient magnitude\n",
+    "\n",
+    "3. **Adaptive Update**:\n",
+    "   - `step_size = m / √v`\n",
+    "   - Big gradients → smaller steps\n",
+    "   - Small gradients → relatively bigger steps\n",
+    "\n",
+    "### AdamW: Fixing Weight Decay\n",
+    "\n",
+    "Adam has a subtle bug in how it applies weight decay (regularization). AdamW fixes this:\n",
+    "\n",
+    "```\n",
+    "Adam (incorrect):               AdamW (correct):\n",
+    "gradient += weight_decay * param    [compute gradient update]\n",
+    "update_param_with_gradient()        param -= learning_rate * gradient_update\n",
+    "                                   param *= (1 - weight_decay)  ← separate!\n",
+    "\n",
+    "Why it matters:\n",
+    "- Adam: Weight decay affected by adaptive learning rates\n",
+    "- AdamW: Weight decay is consistent regardless of gradients\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d7372097",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 3. Implementation: Building Optimizers\n",
+    "\n",
+    "Now we'll implement each optimizer step by step, following the pattern: understand the algorithm → implement it → test it immediately. Each optimizer builds on the foundation of the previous one.\n",
+    "\n",
+    "### Implementation Strategy\n",
+    "\n",
+    "```\n",
+    "Optimizer Base Class\n",
+    "    ↓\n",
+    "SGD (foundation algorithm)\n",
+    "    ↓\n",
+    "SGD + Momentum (reduce oscillations)\n",
+    "    ↓\n",
+    "Adam (adaptive learning rates)\n",
+    "    ↓\n",
+    "AdamW (proper weight decay)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "27941ae4",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "optimizer-base",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class Optimizer:\n",
+    "    \"\"\"\n",
+    "    Base class for all optimizers.\n",
+    "\n",
+    "    This class defines the common interface that all optimizers must implement:\n",
+    "    - zero_grad(): Clear gradients from parameters\n",
+    "    - step(): Update parameters based on gradients\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, params: List[Tensor]):\n",
+    "        \"\"\"\n",
+    "        Initialize optimizer with parameters to optimize.\n",
+    "\n",
+    "        TODO: Set up the parameter list for optimization\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Store parameters as a list for iteration\n",
+    "        2. Validate that all parameters require gradients\n",
+    "        3. Initialize step counter for algorithms that need it\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> linear = Linear(784, 128)\n",
+    "        >>> optimizer = SGD(linear.parameters(), lr=0.01)\n",
+    "\n",
+    "        HINT: Check that each parameter has requires_grad=True\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Validate and store parameters\n",
+    "        if not isinstance(params, list):\n",
+    "            params = list(params)\n",
+    "\n",
+    "        # Check that parameters require gradients\n",
+    "        for i, param in enumerate(params):\n",
+    "            if not isinstance(param, Tensor):\n",
+    "                raise TypeError(f\"Parameter {i} must be a Tensor, got {type(param)}\")\n",
+    "            if not param.requires_grad:\n",
+    "                raise ValueError(f\"Parameter {i} does not require gradients. Set requires_grad=True.\")\n",
+    "\n",
+    "        self.params = params\n",
+    "        self.step_count = 0  # For algorithms that need step counting\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def zero_grad(self):\n",
+    "        \"\"\"\n",
+    "        Clear gradients from all parameters.\n",
+    "\n",
+    "        TODO: Reset all parameter gradients to None\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Iterate through all parameters\n",
+    "        2. Set each parameter's grad to None\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> optimizer.zero_grad()  # Clears all gradients\n",
+    "        >>> assert param.grad is None for param in optimizer.params\n",
+    "\n",
+    "        WHY: Gradients accumulate by default, so we need to clear them between batches\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        for param in self.params:\n",
+    "            param.grad = None\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def step(self):\n",
+    "        \"\"\"\n",
+    "        Update parameters based on gradients.\n",
+    "\n",
+    "        This is abstract - each optimizer implements its own update rule.\n",
+    "        \"\"\"\n",
+    "        raise NotImplementedError(\"Subclasses must implement step()\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2d1b390",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Base Optimizer\n",
+    "This test validates our base Optimizer class works correctly.\n",
+    "**What we're testing**: Parameter validation and zero_grad functionality\n",
+    "**Why it matters**: Foundation for all specific optimizer implementations\n",
+    "**Expected**: Proper parameter storage and gradient clearing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7009049d",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-optimizer-base",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_optimizer_base():\n",
+    "    \"\"\"🔬 Test base Optimizer functionality.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Base Optimizer...\")\n",
+    "\n",
+    "    # Create test parameters\n",
+    "    param1 = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param2 = Tensor([[3.0, 4.0], [5.0, 6.0]], requires_grad=True)\n",
+    "\n",
+    "    # Add some gradients\n",
+    "    param1.grad = Tensor([0.1, 0.2])\n",
+    "    param2.grad = Tensor([[0.3, 0.4], [0.5, 0.6]])\n",
+    "\n",
+    "    # Create optimizer\n",
+    "    optimizer = Optimizer([param1, param2])\n",
+    "\n",
+    "    # Test parameter storage\n",
+    "    assert len(optimizer.params) == 2\n",
+    "    assert optimizer.params[0] is param1\n",
+    "    assert optimizer.params[1] is param2\n",
+    "    assert optimizer.step_count == 0\n",
+    "\n",
+    "    # Test zero_grad\n",
+    "    optimizer.zero_grad()\n",
+    "    assert param1.grad is None\n",
+    "    assert param2.grad is None\n",
+    "\n",
+    "    # Test error handling\n",
+    "    try:\n",
+    "        bad_param = Tensor([1.0], requires_grad=False)\n",
+    "        Optimizer([bad_param])\n",
+    "        assert False, \"Should have raised ValueError\"\n",
+    "    except ValueError as e:\n",
+    "        assert \"does not require gradients\" in str(e)\n",
+    "\n",
+    "    print(\"✅ Base Optimizer works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_optimizer_base()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f16e1bc5",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## SGD - Stochastic Gradient Descent\n",
+    "\n",
+    "SGD is the foundation of neural network optimization. It implements the simple but powerful idea: \"move in the direction opposite to the gradient.\"\n",
+    "\n",
+    "### Why SGD Works\n",
+    "\n",
+    "Gradients point uphill (toward higher loss). To minimize loss, we go downhill:\n",
+    "\n",
+    "```\n",
+    "Loss Surface (side view):\n",
+    "\n",
+    "    Loss\n",
+    "     ^\n",
+    "     |\n",
+    "  📈 |     current position\n",
+    "     |    /\n",
+    "     |   • ← you are here\n",
+    "     |  / \\\n",
+    "     | /   \\ gradient points uphill\n",
+    "     |/     \\\n",
+    "     ●-------\\--→ parameters\n",
+    "      \\        \\\n",
+    "       \\        ↘️ SGD steps downhill\n",
+    "        \\        (opposite to gradient)\n",
+    "         \\⭐ ← goal (minimum loss)\n",
+    "```\n",
+    "\n",
+    "### The Oscillation Problem\n",
+    "\n",
+    "Pure SGD can get trapped oscillating in narrow valleys:\n",
+    "\n",
+    "```\n",
+    "Narrow valley (top view):\n",
+    "  \\     /\n",
+    "   \\   /   ← steep sides\n",
+    "    \\ /\n",
+    "  4← • →2  ← SGD bounces back and forth\n",
+    "    / \\\n",
+    "   1   3   instead of going down the valley\n",
+    "  /     \\\n",
+    " ●       \\\n",
+    " goal     \\\n",
+    "```\n",
+    "\n",
+    "### Momentum Solution\n",
+    "\n",
+    "Momentum remembers the direction you were going and continues in that direction:\n",
+    "\n",
+    "```\n",
+    "With momentum:\n",
+    "  \\     /\n",
+    "   \\   /\n",
+    "    \\ /\n",
+    "     •  ← smooth path down the valley\n",
+    "    / ↓\n",
+    "   /   ↓\n",
+    "  ●    ↓  momentum carries us through oscillations\n",
+    " goal\n",
+    "```\n",
+    "\n",
+    "**Implementation:** SGD keeps a \"velocity\" buffer that accumulates momentum."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "041617fa",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "sgd-optimizer",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class SGD(Optimizer):\n",
+    "    \"\"\"\n",
+    "    Stochastic Gradient Descent with momentum.\n",
+    "\n",
+    "    SGD is the foundational optimization algorithm that moves parameters\n",
+    "    in the direction opposite to gradients. With momentum, it remembers\n",
+    "    previous updates to reduce oscillations and accelerate convergence.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, params: List[Tensor], lr: float = 0.01, momentum: float = 0.0, weight_decay: float = 0.0):\n",
+    "        \"\"\"\n",
+    "        Initialize SGD optimizer.\n",
+    "\n",
+    "        TODO: Set up SGD with momentum and weight decay\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Call parent constructor to set up parameters\n",
+    "        2. Store learning rate, momentum, and weight decay\n",
+    "        3. Initialize momentum buffers for each parameter\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Momentum buffers should be initialized as None\n",
+    "        - They'll be created lazily on first step\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        super().__init__(params)\n",
+    "\n",
+    "        self.lr = lr\n",
+    "        self.momentum = momentum\n",
+    "        self.weight_decay = weight_decay\n",
+    "\n",
+    "        # Initialize momentum buffers (created lazily)\n",
+    "        self.momentum_buffers = [None for _ in self.params]\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def step(self):\n",
+    "        \"\"\"\n",
+    "        Perform SGD update step with momentum.\n",
+    "\n",
+    "        TODO: Implement SGD parameter update with momentum\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. For each parameter with gradients:\n",
+    "           a. Apply weight decay if specified\n",
+    "           b. Update momentum buffer\n",
+    "           c. Update parameter using momentum\n",
+    "\n",
+    "        FORMULA:\n",
+    "        - With weight decay: grad = grad + weight_decay * param\n",
+    "        - Momentum: v = momentum * v_prev + grad\n",
+    "        - Update: param = param - lr * v\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Skip parameters without gradients\n",
+    "        - Initialize momentum buffers on first use\n",
+    "        - Use in-place operations to save memory\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        for i, param in enumerate(self.params):\n",
+    "            if param.grad is None:\n",
+    "                continue\n",
+    "\n",
+    "            # Get gradient\n",
+    "            grad = param.grad.data\n",
+    "\n",
+    "            # Apply weight decay\n",
+    "            if self.weight_decay != 0:\n",
+    "                grad = grad + self.weight_decay * param.data\n",
+    "\n",
+    "            # Update momentum buffer\n",
+    "            if self.momentum != 0:\n",
+    "                if self.momentum_buffers[i] is None:\n",
+    "                    # Initialize momentum buffer\n",
+    "                    self.momentum_buffers[i] = np.zeros_like(param.data)\n",
+    "\n",
+    "                # Update momentum: v = momentum * v_prev + grad\n",
+    "                self.momentum_buffers[i] = self.momentum * self.momentum_buffers[i] + grad\n",
+    "                grad = self.momentum_buffers[i]\n",
+    "\n",
+    "            # Update parameter: param = param - lr * grad\n",
+    "            param.data = param.data - self.lr * grad\n",
+    "\n",
+    "        # Increment step counter\n",
+    "        self.step_count += 1\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a71d7032",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: SGD Optimizer\n",
+    "This test validates our SGD implementation works correctly.\n",
+    "**What we're testing**: SGD updates with and without momentum\n",
+    "**Why it matters**: Core optimization algorithm used in neural network training\n",
+    "**Expected**: Correct parameter updates following SGD formulas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3565b424",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-sgd",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_sgd_optimizer():\n",
+    "    \"\"\"🔬 Test SGD optimizer implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: SGD Optimizer...\")\n",
+    "\n",
+    "    # Test basic SGD without momentum\n",
+    "    param = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    optimizer = SGD([param], lr=0.1)\n",
+    "    original_data = param.data.copy()\n",
+    "\n",
+    "    optimizer.step()\n",
+    "\n",
+    "    # Expected: param = param - lr * grad = [1.0, 2.0] - 0.1 * [0.1, 0.2] = [0.99, 1.98]\n",
+    "    expected = original_data - 0.1 * param.grad.data\n",
+    "    assert np.allclose(param.data, expected)\n",
+    "    assert optimizer.step_count == 1\n",
+    "\n",
+    "    # Test SGD with momentum\n",
+    "    param2 = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param2.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    optimizer_momentum = SGD([param2], lr=0.1, momentum=0.9)\n",
+    "\n",
+    "    # First step: v = 0.9 * 0 + [0.1, 0.2] = [0.1, 0.2]\n",
+    "    optimizer_momentum.step()\n",
+    "    expected_first = np.array([1.0, 2.0]) - 0.1 * np.array([0.1, 0.2])\n",
+    "    assert np.allclose(param2.data, expected_first)\n",
+    "\n",
+    "    # Second step with same gradient\n",
+    "    param2.grad = Tensor([0.1, 0.2])\n",
+    "    optimizer_momentum.step()\n",
+    "    # v = 0.9 * [0.1, 0.2] + [0.1, 0.2] = [0.19, 0.38]\n",
+    "    expected_momentum = np.array([0.19, 0.38])\n",
+    "    expected_second = expected_first - 0.1 * expected_momentum\n",
+    "    assert np.allclose(param2.data, expected_second, rtol=1e-5)\n",
+    "\n",
+    "    # Test weight decay\n",
+    "    param3 = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param3.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    optimizer_wd = SGD([param3], lr=0.1, weight_decay=0.01)\n",
+    "    optimizer_wd.step()\n",
+    "\n",
+    "    # grad_with_decay = [0.1, 0.2] + 0.01 * [1.0, 2.0] = [0.11, 0.22]\n",
+    "    expected_wd = np.array([1.0, 2.0]) - 0.1 * np.array([0.11, 0.22])\n",
+    "    assert np.allclose(param3.data, expected_wd)\n",
+    "\n",
+    "    print(\"✅ SGD optimizer works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_sgd_optimizer()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ecd6215c",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Adam - Adaptive Moment Estimation\n",
+    "\n",
+    "Adam solves a fundamental problem with SGD: different parameters often need different learning rates. Think of tuning a complex system where some knobs need gentle adjustments and others need bold changes.\n",
+    "\n",
+    "### The Parameter Scaling Problem\n",
+    "\n",
+    "Consider a neural network with both embedding weights and output weights:\n",
+    "\n",
+    "```\n",
+    "Parameter Sensitivity Landscape:\n",
+    "\n",
+    "  output_weight                 embedding_weight\n",
+    "       ↑                              ↑\n",
+    "       |                              |\n",
+    "    😱 |  steep cliff                  |  🐌 gentle slope\n",
+    "       |  (needs tiny steps)          |  (needs big steps)\n",
+    "       |                              |\n",
+    "    ━━━●━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━●━━━→\n",
+    "\n",
+    "Same learning rate = disaster!\n",
+    "• Small LR: output weights learn fast, embeddings crawl\n",
+    "• Large LR: embeddings learn well, output weights explode\n",
+    "```\n",
+    "\n",
+    "### Adam's Adaptive Solution\n",
+    "\n",
+    "Adam automatically adjusts learning rates by tracking two statistics:\n",
+    "\n",
+    "```\n",
+    "1. MOMENTUM (first moment): \"Which way am I usually going?\"\n",
+    "   m = 0.9 * old_direction + 0.1 * current_gradient\n",
+    "\n",
+    "   Visualization:\n",
+    "   old: →→→→\n",
+    "   new:     ↗️\n",
+    "   m:   →→→↗️  (weighted average)\n",
+    "\n",
+    "2. SCALE (second moment): \"How big are my steps usually?\"\n",
+    "   v = 0.999 * old_scale + 0.001 * (current_gradient)²\n",
+    "\n",
+    "   Big gradients → bigger v → smaller effective steps\n",
+    "   Small gradients → smaller v → bigger effective steps\n",
+    "\n",
+    "3. ADAPTIVE UPDATE:\n",
+    "   step = momentum / √scale\n",
+    "   param = param - learning_rate * step\n",
+    "```\n",
+    "\n",
+    "### Bias Correction: The Cold Start Problem\n",
+    "\n",
+    "Adam starts with m=0 and v=0, which creates a bias toward zero initially:\n",
+    "\n",
+    "```\n",
+    "Without bias correction:    With bias correction:\n",
+    "\n",
+    "Step 1: m = 0.9*0 + 0.1*g    Step 1: m̂ = m / (1-0.9¹) = m / 0.1\n",
+    "       = 0.1*g (too small!)           = g (correct!)\n",
+    "\n",
+    "Step 2: m = 0.9*0.1*g + 0.1*g Step 2: m̂ = m / (1-0.9²) = m / 0.19\n",
+    "       = 0.19*g (still small)         ≈ g (better!)\n",
+    "```\n",
+    "\n",
+    "**Key Insight:** Adam is like having an automatic transmission that adjusts gear ratios for each parameter individually."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9f004a9e",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "adam-optimizer",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class Adam(Optimizer):\n",
+    "    \"\"\"\n",
+    "    Adam optimizer with adaptive learning rates.\n",
+    "\n",
+    "    Adam computes individual adaptive learning rates for different parameters\n",
+    "    from estimates of first and second moments of the gradients.\n",
+    "    This makes it effective for problems with sparse gradients or noisy data.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, params: List[Tensor], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-8, weight_decay: float = 0.0):\n",
+    "        \"\"\"\n",
+    "        Initialize Adam optimizer.\n",
+    "\n",
+    "        TODO: Set up Adam with adaptive learning rates\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Call parent constructor\n",
+    "        2. Store hyperparameters (lr, betas, eps, weight_decay)\n",
+    "        3. Initialize first and second moment buffers\n",
+    "\n",
+    "        PARAMETERS:\n",
+    "        - lr: Learning rate (default: 0.001)\n",
+    "        - betas: Coefficients for computing running averages (default: (0.9, 0.999))\n",
+    "        - eps: Small constant for numerical stability (default: 1e-8)\n",
+    "        - weight_decay: L2 penalty coefficient (default: 0.0)\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> optimizer = Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        super().__init__(params)\n",
+    "\n",
+    "        self.lr = lr\n",
+    "        self.beta1, self.beta2 = betas\n",
+    "        self.eps = eps\n",
+    "        self.weight_decay = weight_decay\n",
+    "\n",
+    "        # Initialize moment buffers (created lazily)\n",
+    "        self.m_buffers = [None for _ in self.params]  # First moment (mean)\n",
+    "        self.v_buffers = [None for _ in self.params]  # Second moment (variance)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def step(self):\n",
+    "        \"\"\"\n",
+    "        Perform Adam update step.\n",
+    "\n",
+    "        TODO: Implement Adam parameter update with adaptive learning rates\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. For each parameter with gradients:\n",
+    "           a. Apply weight decay if specified\n",
+    "           b. Update first moment estimate (momentum of gradient)\n",
+    "           c. Update second moment estimate (momentum of squared gradient)\n",
+    "           d. Compute bias-corrected moments\n",
+    "           e. Update parameter using adaptive learning rate\n",
+    "\n",
+    "        FORMULAS:\n",
+    "        - m_t = β₁ * m_{t-1} + (1-β₁) * g_t\n",
+    "        - v_t = β₂ * v_{t-1} + (1-β₂) * g_t²\n",
+    "        - m̂_t = m_t / (1-β₁^t)\n",
+    "        - v̂_t = v_t / (1-β₂^t)\n",
+    "        - θ_t = θ_{t-1} - lr * m̂_t / (√v̂_t + ε)\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Initialize buffers as zeros on first use\n",
+    "        - Use step_count for bias correction\n",
+    "        - Square gradients element-wise for second moment\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Increment step counter first (needed for bias correction)\n",
+    "        self.step_count += 1\n",
+    "\n",
+    "        for i, param in enumerate(self.params):\n",
+    "            if param.grad is None:\n",
+    "                continue\n",
+    "\n",
+    "            # Get gradient\n",
+    "            grad = param.grad.data\n",
+    "\n",
+    "            # Apply weight decay\n",
+    "            if self.weight_decay != 0:\n",
+    "                grad = grad + self.weight_decay * param.data\n",
+    "\n",
+    "            # Initialize buffers if needed\n",
+    "            if self.m_buffers[i] is None:\n",
+    "                self.m_buffers[i] = np.zeros_like(param.data)\n",
+    "                self.v_buffers[i] = np.zeros_like(param.data)\n",
+    "\n",
+    "            # Update biased first moment estimate\n",
+    "            self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad\n",
+    "\n",
+    "            # Update biased second moment estimate\n",
+    "            self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad ** 2)\n",
+    "\n",
+    "            # Compute bias correction\n",
+    "            bias_correction1 = 1 - self.beta1 ** self.step_count\n",
+    "            bias_correction2 = 1 - self.beta2 ** self.step_count\n",
+    "\n",
+    "            # Compute bias-corrected moments\n",
+    "            m_hat = self.m_buffers[i] / bias_correction1\n",
+    "            v_hat = self.v_buffers[i] / bias_correction2\n",
+    "\n",
+    "            # Update parameter\n",
+    "            param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebcf14a3",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Adam Optimizer\n",
+    "This test validates our Adam implementation works correctly.\n",
+    "**What we're testing**: Adam updates with adaptive learning rates and bias correction\n",
+    "**Why it matters**: Most popular optimizer for modern neural networks\n",
+    "**Expected**: Correct parameter updates following Adam formulas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18d493f9",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-adam",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_adam_optimizer():\n",
+    "    \"\"\"🔬 Test Adam optimizer implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Adam Optimizer...\")\n",
+    "\n",
+    "    # Test basic Adam functionality\n",
+    "    param = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    optimizer = Adam([param], lr=0.01, betas=(0.9, 0.999), eps=1e-8)\n",
+    "    original_data = param.data.copy()\n",
+    "\n",
+    "    # First step\n",
+    "    optimizer.step()\n",
+    "\n",
+    "    # Manually compute expected values\n",
+    "    grad = np.array([0.1, 0.2])\n",
+    "\n",
+    "    # First moment: m = 0.9 * 0 + 0.1 * grad = 0.1 * grad\n",
+    "    m = 0.1 * grad\n",
+    "\n",
+    "    # Second moment: v = 0.999 * 0 + 0.001 * grad^2 = 0.001 * grad^2\n",
+    "    v = 0.001 * (grad ** 2)\n",
+    "\n",
+    "    # Bias correction\n",
+    "    bias_correction1 = 1 - 0.9 ** 1  # = 0.1\n",
+    "    bias_correction2 = 1 - 0.999 ** 1  # = 0.001\n",
+    "\n",
+    "    m_hat = m / bias_correction1  # = grad\n",
+    "    v_hat = v / bias_correction2  # = grad^2\n",
+    "\n",
+    "    # Update\n",
+    "    expected = original_data - 0.01 * m_hat / (np.sqrt(v_hat) + 1e-8)\n",
+    "\n",
+    "    assert np.allclose(param.data, expected, rtol=1e-6)\n",
+    "    assert optimizer.step_count == 1\n",
+    "\n",
+    "    # Test second step to verify moment accumulation\n",
+    "    param.grad = Tensor([0.1, 0.2])\n",
+    "    optimizer.step()\n",
+    "\n",
+    "    # Should have updated moments\n",
+    "    assert optimizer.m_buffers[0] is not None\n",
+    "    assert optimizer.v_buffers[0] is not None\n",
+    "    assert optimizer.step_count == 2\n",
+    "\n",
+    "    # Test with weight decay\n",
+    "    param2 = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param2.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    optimizer_wd = Adam([param2], lr=0.01, weight_decay=0.01)\n",
+    "    optimizer_wd.step()\n",
+    "\n",
+    "    # Weight decay should modify the effective gradient\n",
+    "    # grad_with_decay = [0.1, 0.2] + 0.01 * [1.0, 2.0] = [0.11, 0.22]\n",
+    "    # The exact computation is complex, but we can verify parameter changed\n",
+    "    assert not np.array_equal(param2.data, np.array([1.0, 2.0]))\n",
+    "\n",
+    "    print(\"✅ Adam optimizer works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_adam_optimizer()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3651d85f",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## AdamW - Adam with Decoupled Weight Decay\n",
+    "\n",
+    "AdamW fixes a subtle but important bug in Adam's weight decay implementation. The bug affects how regularization interacts with adaptive learning rates.\n",
+    "\n",
+    "### The Adam Weight Decay Bug\n",
+    "\n",
+    "In standard Adam, weight decay is added to gradients before the adaptive scaling:\n",
+    "\n",
+    "```\n",
+    "Adam's approach (problematic):\n",
+    "1. gradient = computed_gradient + weight_decay * parameter\n",
+    "2. m = β₁ * m + (1-β₁) * gradient\n",
+    "3. v = β₂ * v + (1-β₂) * gradient²\n",
+    "4. step = m / √v\n",
+    "5. parameter = parameter - learning_rate * step\n",
+    "\n",
+    "Problem: Weight decay gets \"adapted\" by the learning rate scaling!\n",
+    "```\n",
+    "\n",
+    "### Why This Matters\n",
+    "\n",
+    "Weight decay should be a consistent regularization force, but Adam makes it inconsistent:\n",
+    "\n",
+    "```\n",
+    "Parameter Update Comparison:\n",
+    "\n",
+    "Large gradients → small adaptive LR → weak weight decay effect\n",
+    "Small gradients → large adaptive LR → strong weight decay effect\n",
+    "\n",
+    "This is backwards! We want consistent regularization.\n",
+    "```\n",
+    "\n",
+    "### AdamW's Fix: Decoupled Weight Decay\n",
+    "\n",
+    "AdamW separates gradient-based updates from weight decay:\n",
+    "\n",
+    "```\n",
+    "AdamW's approach (correct):\n",
+    "1. m = β₁ * m + (1-β₁) * pure_gradient  ← NO weight decay here\n",
+    "2. v = β₂ * v + (1-β₂) * pure_gradient²\n",
+    "3. step = m / √v\n",
+    "4. parameter = parameter - learning_rate * step        ← gradient update\n",
+    "5. parameter = parameter * (1 - weight_decay_rate)    ← separate decay\n",
+    "\n",
+    "Result: Consistent regularization independent of gradient magnitudes!\n",
+    "```\n",
+    "\n",
+    "### Visual Comparison\n",
+    "\n",
+    "```\n",
+    "Adam weight decay:               AdamW weight decay:\n",
+    "\n",
+    "gradient ──┐                    gradient ──→ adaptive ──→ param\n",
+    "           ├─→ adaptive ──→ param                  update\n",
+    "weight ────┘   scaling\n",
+    "decay\n",
+    "                                weight ─────────→ param\n",
+    "                                decay           shrinkage\n",
+    "\n",
+    "Coupled (inconsistent)          Decoupled (consistent)\n",
+    "```\n",
+    "\n",
+    "**Key Insight:** AdamW treats optimization and regularization as separate, independent processes, leading to better training dynamics and generalization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b2d265ff",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "adamw-optimizer",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class AdamW(Optimizer):\n",
+    "    \"\"\"\n",
+    "    AdamW optimizer with decoupled weight decay.\n",
+    "\n",
+    "    AdamW fixes a bug in Adam's weight decay implementation by decoupling\n",
+    "    weight decay from the gradient-based update. This leads to better\n",
+    "    regularization and is the preferred version for most applications.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, params: List[Tensor], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-8, weight_decay: float = 0.01):\n",
+    "        \"\"\"\n",
+    "        Initialize AdamW optimizer.\n",
+    "\n",
+    "        TODO: Set up AdamW with decoupled weight decay\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Call parent constructor\n",
+    "        2. Store hyperparameters (note higher default weight_decay)\n",
+    "        3. Initialize moment buffers like Adam\n",
+    "\n",
+    "        KEY DIFFERENCE from Adam:\n",
+    "        - Weight decay is applied directly to parameters, not added to gradients\n",
+    "        - This provides better regularization behavior\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> optimizer = AdamW(model.parameters(), lr=0.001, weight_decay=0.01)\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        super().__init__(params)\n",
+    "\n",
+    "        self.lr = lr\n",
+    "        self.beta1, self.beta2 = betas\n",
+    "        self.eps = eps\n",
+    "        self.weight_decay = weight_decay\n",
+    "\n",
+    "        # Initialize moment buffers (same as Adam)\n",
+    "        self.m_buffers = [None for _ in self.params]\n",
+    "        self.v_buffers = [None for _ in self.params]\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def step(self):\n",
+    "        \"\"\"\n",
+    "        Perform AdamW update step with decoupled weight decay.\n",
+    "\n",
+    "        TODO: Implement AdamW parameter update\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. For each parameter with gradients:\n",
+    "           a. Update moments using gradients (NOT modified by weight decay)\n",
+    "           b. Compute bias-corrected moments\n",
+    "           c. Apply gradient-based update\n",
+    "           d. Apply weight decay directly to parameters\n",
+    "\n",
+    "        KEY DIFFERENCE from Adam:\n",
+    "        - Weight decay: θ_t = θ_t - lr * weight_decay * θ_t (applied after gradient update)\n",
+    "        - NOT: grad = grad + weight_decay * param (Adam's incorrect approach)\n",
+    "\n",
+    "        FORMULAS:\n",
+    "        - Same moment updates as Adam (using unmodified gradients)\n",
+    "        - Gradient update: θ_t = θ_{t-1} - lr * m̂_t / (√v̂_t + ε)\n",
+    "        - Weight decay: θ_t = θ_t * (1 - lr * weight_decay)\n",
+    "\n",
+    "        HINT: Apply weight decay after gradient update for proper decoupling\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Increment step counter first\n",
+    "        self.step_count += 1\n",
+    "\n",
+    "        for i, param in enumerate(self.params):\n",
+    "            if param.grad is None:\n",
+    "                continue\n",
+    "\n",
+    "            # Get gradient (NOT modified by weight decay)\n",
+    "            grad = param.grad.data\n",
+    "\n",
+    "            # Initialize buffers if needed\n",
+    "            if self.m_buffers[i] is None:\n",
+    "                self.m_buffers[i] = np.zeros_like(param.data)\n",
+    "                self.v_buffers[i] = np.zeros_like(param.data)\n",
+    "\n",
+    "            # Update moments using pure gradients\n",
+    "            self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad\n",
+    "            self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad ** 2)\n",
+    "\n",
+    "            # Compute bias correction\n",
+    "            bias_correction1 = 1 - self.beta1 ** self.step_count\n",
+    "            bias_correction2 = 1 - self.beta2 ** self.step_count\n",
+    "\n",
+    "            # Compute bias-corrected moments\n",
+    "            m_hat = self.m_buffers[i] / bias_correction1\n",
+    "            v_hat = self.v_buffers[i] / bias_correction2\n",
+    "\n",
+    "            # Apply gradient-based update\n",
+    "            param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)\n",
+    "\n",
+    "            # Apply decoupled weight decay\n",
+    "            if self.weight_decay != 0:\n",
+    "                param.data = param.data * (1 - self.lr * self.weight_decay)\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f116d64a",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: AdamW Optimizer\n",
+    "This test validates our AdamW implementation with decoupled weight decay.\n",
+    "**What we're testing**: AdamW updates with proper weight decay decoupling\n",
+    "**Why it matters**: State-of-the-art optimizer for transformer models\n",
+    "**Expected**: Correct separation of gradient updates and weight decay"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2cd744c",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-adamw",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_adamw_optimizer():\n",
+    "    \"\"\"🔬 Test AdamW optimizer implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: AdamW Optimizer...\")\n",
+    "\n",
+    "    # Test AdamW vs Adam difference in weight decay\n",
+    "    # Create identical parameters for comparison\n",
+    "    param_adam = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param_adamw = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "\n",
+    "    param_adam.grad = Tensor([0.1, 0.2])\n",
+    "    param_adamw.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    # Create optimizers with same settings\n",
+    "    adam = Adam([param_adam], lr=0.01, weight_decay=0.01)\n",
+    "    adamw = AdamW([param_adamw], lr=0.01, weight_decay=0.01)\n",
+    "\n",
+    "    # Take one step\n",
+    "    adam.step()\n",
+    "    adamw.step()\n",
+    "\n",
+    "    # Results should be different due to weight decay implementation\n",
+    "    assert not np.allclose(param_adam.data, param_adamw.data, rtol=1e-6)\n",
+    "\n",
+    "    # Test AdamW basic functionality\n",
+    "    param = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    optimizer = AdamW([param], lr=0.01, weight_decay=0.01)\n",
+    "    original_data = param.data.copy()\n",
+    "\n",
+    "    optimizer.step()\n",
+    "\n",
+    "    # Parameter should have changed\n",
+    "    assert not np.array_equal(param.data, original_data)\n",
+    "    assert optimizer.step_count == 1\n",
+    "\n",
+    "    # Test that moment buffers are created\n",
+    "    assert optimizer.m_buffers[0] is not None\n",
+    "    assert optimizer.v_buffers[0] is not None\n",
+    "\n",
+    "    # Test zero weight decay behaves like Adam\n",
+    "    param1 = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param2 = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "\n",
+    "    param1.grad = Tensor([0.1, 0.2])\n",
+    "    param2.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    adam_no_wd = Adam([param1], lr=0.01, weight_decay=0.0)\n",
+    "    adamw_no_wd = AdamW([param2], lr=0.01, weight_decay=0.0)\n",
+    "\n",
+    "    adam_no_wd.step()\n",
+    "    adamw_no_wd.step()\n",
+    "\n",
+    "    # Should be very similar (within numerical precision)\n",
+    "    assert np.allclose(param1.data, param2.data, rtol=1e-10)\n",
+    "\n",
+    "    print(\"✅ AdamW optimizer works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_adamw_optimizer()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "abcf743f",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## 4. Integration: Bringing It Together\n",
+    "\n",
+    "Now let's see how our optimizers perform in realistic scenarios. We'll compare their behavior on the same optimization problem to understand their different characteristics.\n",
+    "\n",
+    "### Optimizer Behavior Comparison\n",
+    "\n",
+    "Each optimizer takes a different approach to the same problem:\n",
+    "\n",
+    "```\n",
+    "Optimization Problem: Find minimum of f(x) = x²\n",
+    "\n",
+    "SGD approach:        Adam approach:        AdamW approach:\n",
+    "  ↓                    ↓                     ↓\n",
+    " x ──→ minimize       x ──→ minimize       x ──→ minimize\n",
+    "  ↑                    ↑                     ↑\n",
+    "fixed LR           adaptive LR          adaptive LR + decay\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a3a18015",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 5. Systems Analysis: Optimizer Performance and Memory\n",
+    "\n",
+    "Different optimizers have very different resource requirements. Understanding these trade-offs is crucial for production ML systems.\n",
+    "\n",
+    "### Memory Usage Patterns\n",
+    "\n",
+    "```\n",
+    "Optimizer Memory Requirements (per parameter):\n",
+    "\n",
+    "SGD:           Adam/AdamW:\n",
+    "┌────────┐     ┌────────┐\n",
+    "│ param  │     │ param  │\n",
+    "├────────┤     ├────────┤\n",
+    "│momentum│     │   m    │ ← first moment\n",
+    "└────────┘     ├────────┤\n",
+    "               │   v    │ ← second moment\n",
+    "               └────────┘\n",
+    "\n",
+    "2× memory       3× memory\n",
+    "```\n",
+    "\n",
+    "### Computational Complexity\n",
+    "\n",
+    "```\n",
+    "Per-step Operations:\n",
+    "\n",
+    "SGD:                     Adam:\n",
+    "• 1 multiplication       • 3 multiplications\n",
+    "• 1 addition            • 4 additions\n",
+    "• 1 subtraction         • 1 subtraction\n",
+    "                        • 1 square root\n",
+    "                        • 1 division\n",
+    "\n",
+    "O(n) simple ops         O(n) complex ops\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eb6c8914",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "optimizer-analysis",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_optimizer_memory_usage():\n",
+    "    \"\"\"📊 Analyze memory usage of different optimizers.\"\"\"\n",
+    "    print(\"📊 Analyzing Optimizer Memory Usage...\")\n",
+    "\n",
+    "    # Create test parameters of different sizes\n",
+    "    param_sizes = [1000, 10000, 100000]  # 1K, 10K, 100K parameters\n",
+    "\n",
+    "    print(\"Optimizer Memory Analysis (per parameter tensor):\")\n",
+    "    print(\"=\" * 60)\n",
+    "    print(f\"{'Size':<10} {'SGD':<10} {'Adam':<10} {'AdamW':<10} {'Ratio':<10}\")\n",
+    "    print(\"-\" * 60)\n",
+    "\n",
+    "    for size in param_sizes:\n",
+    "        # Create parameter\n",
+    "        param = Tensor(np.random.randn(size), requires_grad=True)\n",
+    "        param.grad = Tensor(np.random.randn(size))\n",
+    "\n",
+    "        # SGD memory (parameter + momentum buffer)\n",
+    "        sgd = SGD([param], momentum=0.9)\n",
+    "        sgd.step()  # Initialize buffers\n",
+    "        sgd_memory = size * 2  # param + momentum buffer\n",
+    "\n",
+    "        # Adam memory (parameter + 2 moment buffers)\n",
+    "        param_adam = Tensor(np.random.randn(size), requires_grad=True)\n",
+    "        param_adam.grad = Tensor(np.random.randn(size))\n",
+    "        adam = Adam([param_adam])\n",
+    "        adam.step()  # Initialize buffers\n",
+    "        adam_memory = size * 3  # param + m_buffer + v_buffer\n",
+    "\n",
+    "        # AdamW memory (same as Adam)\n",
+    "        adamw_memory = adam_memory\n",
+    "\n",
+    "        # Memory ratio (Adam/SGD)\n",
+    "        ratio = adam_memory / sgd_memory\n",
+    "\n",
+    "        print(f\"{size:<10} {sgd_memory:<10} {adam_memory:<10} {adamw_memory:<10} {ratio:.1f}x\")\n",
+    "\n",
+    "    print(\"\\n💡 Key Insights:\")\n",
+    "    print(\"- SGD: 2× parameter memory (momentum buffer)\")\n",
+    "    print(\"- Adam/AdamW: 3× parameter memory (two moment buffers)\")\n",
+    "    print(\"- Memory scales linearly with model size\")\n",
+    "    print(\"- Trade-off: More memory for better convergence\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53d5302c",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "optimizer-convergence",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_optimizer_convergence_behavior():\n",
+    "    \"\"\"📊 Analyze convergence behavior of different optimizers.\"\"\"\n",
+    "    print(\"📊 Analyzing Optimizer Convergence Behavior...\")\n",
+    "\n",
+    "    # Simulate optimization of a quadratic function: f(x) = 0.5 * x^2\n",
+    "    # Optimal solution: x* = 0, gradient = x\n",
+    "\n",
+    "    def quadratic_loss(x):\n",
+    "        \"\"\"Simple quadratic function for optimization testing.\"\"\"\n",
+    "        return 0.5 * (x ** 2).sum()\n",
+    "\n",
+    "    def compute_gradient(x):\n",
+    "        \"\"\"Gradient of quadratic function: df/dx = x.\"\"\"\n",
+    "        return x.copy()\n",
+    "\n",
+    "    # Starting point\n",
+    "    x_start = np.array([5.0, -3.0, 2.0])  # Far from optimum [0, 0, 0]\n",
+    "\n",
+    "    # Test different optimizers\n",
+    "    optimizers_to_test = [\n",
+    "        (\"SGD\", SGD, {\"lr\": 0.1}),\n",
+    "        (\"SGD+Momentum\", SGD, {\"lr\": 0.1, \"momentum\": 0.9}),\n",
+    "        (\"Adam\", Adam, {\"lr\": 0.1}),\n",
+    "        (\"AdamW\", AdamW, {\"lr\": 0.1, \"weight_decay\": 0.01})\n",
+    "    ]\n",
+    "\n",
+    "    print(\"Convergence Analysis (quadratic function f(x) = 0.5 * x²):\")\n",
+    "    print(\"=\" * 70)\n",
+    "    print(f\"{'Optimizer':<15} {'Step 0':<12} {'Step 5':<12} {'Step 10':<12} {'Final Loss':<12}\")\n",
+    "    print(\"-\" * 70)\n",
+    "\n",
+    "    for name, optimizer_class, kwargs in optimizers_to_test:\n",
+    "        # Reset parameter\n",
+    "        param = Tensor(x_start.copy(), requires_grad=True)\n",
+    "        optimizer = optimizer_class([param], **kwargs)\n",
+    "\n",
+    "        losses = []\n",
+    "\n",
+    "        # Run optimization for 10 steps\n",
+    "        for step in range(11):\n",
+    "            # Compute loss and gradient\n",
+    "            loss = quadratic_loss(param.data)\n",
+    "            param.grad = Tensor(compute_gradient(param.data))\n",
+    "\n",
+    "            losses.append(loss)\n",
+    "\n",
+    "            # Update parameters\n",
+    "            if step < 10:  # Don't update after last evaluation\n",
+    "                optimizer.step()\n",
+    "                optimizer.zero_grad()\n",
+    "\n",
+    "        # Format results\n",
+    "        step0 = f\"{losses[0]:.6f}\"\n",
+    "        step5 = f\"{losses[5]:.6f}\"\n",
+    "        step10 = f\"{losses[10]:.6f}\"\n",
+    "        final = f\"{losses[10]:.6f}\"\n",
+    "\n",
+    "        print(f\"{name:<15} {step0:<12} {step5:<12} {step10:<12} {final:<12}\")\n",
+    "\n",
+    "    print(\"\\n💡 Key Insights:\")\n",
+    "    print(\"- SGD: Steady progress but can be slow\")\n",
+    "    print(\"- SGD+Momentum: Faster convergence, less oscillation\")\n",
+    "    print(\"- Adam: Adaptive rates help with different parameter scales\")\n",
+    "    print(\"- AdamW: Similar to Adam with regularization effects\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f237af71",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "940e2331",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "module-integration",
+     "locked": true,
+     "points": 25
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_optimizer_base()\n",
+    "    test_unit_sgd_optimizer()\n",
+    "    test_unit_adam_optimizer()\n",
+    "    test_unit_adamw_optimizer()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test realistic neural network optimization scenario\n",
+    "    print(\"🔬 Integration Test: Multi-layer Network Optimization...\")\n",
+    "\n",
+    "    # Create parameters for a 2-layer network\n",
+    "    # Layer 1: 3 inputs -> 4 hidden\n",
+    "    W1 = Tensor(np.random.randn(3, 4) * 0.1, requires_grad=True)\n",
+    "    b1 = Tensor(np.zeros(4), requires_grad=True)\n",
+    "\n",
+    "    # Layer 2: 4 hidden -> 2 outputs\n",
+    "    W2 = Tensor(np.random.randn(4, 2) * 0.1, requires_grad=True)\n",
+    "    b2 = Tensor(np.zeros(2), requires_grad=True)\n",
+    "\n",
+    "    params = [W1, b1, W2, b2]\n",
+    "\n",
+    "    # Add realistic gradients\n",
+    "    W1.grad = Tensor(np.random.randn(3, 4) * 0.01)\n",
+    "    b1.grad = Tensor(np.random.randn(4) * 0.01)\n",
+    "    W2.grad = Tensor(np.random.randn(4, 2) * 0.01)\n",
+    "    b2.grad = Tensor(np.random.randn(2) * 0.01)\n",
+    "\n",
+    "    # Test all optimizers on same network\n",
+    "    optimizers = [\n",
+    "        SGD(params, lr=0.01, momentum=0.9),\n",
+    "        Adam([p for p in params], lr=0.001),  # Fresh param list for Adam\n",
+    "        AdamW([p for p in params], lr=0.001, weight_decay=0.01)  # Fresh param list for AdamW\n",
+    "    ]\n",
+    "\n",
+    "    # Save original parameter values\n",
+    "    original_params = [p.data.copy() for p in params]\n",
+    "\n",
+    "    # Test SGD\n",
+    "    optimizers[0].step()\n",
+    "    sgd_params = [p.data.copy() for p in params]\n",
+    "\n",
+    "    # Restore parameters and test Adam\n",
+    "    for i, p in enumerate(params):\n",
+    "        p.data = original_params[i].copy()\n",
+    "        # Re-add gradients since they may have been modified\n",
+    "        if i == 0:\n",
+    "            p.grad = Tensor(np.random.randn(3, 4) * 0.01)\n",
+    "        elif i == 1:\n",
+    "            p.grad = Tensor(np.random.randn(4) * 0.01)\n",
+    "        elif i == 2:\n",
+    "            p.grad = Tensor(np.random.randn(4, 2) * 0.01)\n",
+    "        else:\n",
+    "            p.grad = Tensor(np.random.randn(2) * 0.01)\n",
+    "\n",
+    "    # Update parameter references for Adam\n",
+    "    optimizers[1].params = params\n",
+    "    optimizers[1].step()\n",
+    "    adam_params = [p.data.copy() for p in params]\n",
+    "\n",
+    "    # Restore parameters and test AdamW\n",
+    "    for i, p in enumerate(params):\n",
+    "        p.data = original_params[i].copy()\n",
+    "        # Re-add gradients\n",
+    "        if i == 0:\n",
+    "            p.grad = Tensor(np.random.randn(3, 4) * 0.01)\n",
+    "        elif i == 1:\n",
+    "            p.grad = Tensor(np.random.randn(4) * 0.01)\n",
+    "        elif i == 2:\n",
+    "            p.grad = Tensor(np.random.randn(4, 2) * 0.01)\n",
+    "        else:\n",
+    "            p.grad = Tensor(np.random.randn(2) * 0.01)\n",
+    "\n",
+    "    # Update parameter references for AdamW\n",
+    "    optimizers[2].params = params\n",
+    "    optimizers[2].step()\n",
+    "    adamw_params = [p.data.copy() for p in params]\n",
+    "\n",
+    "    # Verify parameters changed differently for each optimizer\n",
+    "    for i in range(len(params)):\n",
+    "        # Parameters should be different from original\n",
+    "        assert not np.array_equal(sgd_params[i], original_params[i])\n",
+    "        assert not np.array_equal(adam_params[i], original_params[i])\n",
+    "        assert not np.array_equal(adamw_params[i], original_params[i])\n",
+    "\n",
+    "        # Different optimizers should produce different results\n",
+    "        assert not np.allclose(sgd_params[i], adam_params[i], rtol=1e-6)\n",
+    "\n",
+    "    print(\"✅ Multi-layer network optimization works!\")\n",
+    "\n",
+    "    # Test optimizer state management\n",
+    "    print(\"🔬 Integration Test: Optimizer State Management...\")\n",
+    "\n",
+    "    param = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param.grad = Tensor([0.1, 0.2])\n",
+    "\n",
+    "    optimizer = Adam([param], lr=0.001)\n",
+    "\n",
+    "    # First step should initialize buffers\n",
+    "    optimizer.step()\n",
+    "    assert optimizer.m_buffers[0] is not None\n",
+    "    assert optimizer.v_buffers[0] is not None\n",
+    "    assert optimizer.step_count == 1\n",
+    "\n",
+    "    # Zero grad should clear gradients but preserve optimizer state\n",
+    "    optimizer.zero_grad()\n",
+    "    assert param.grad is None\n",
+    "    assert optimizer.m_buffers[0] is not None  # State preserved\n",
+    "    assert optimizer.step_count == 1  # Step count preserved\n",
+    "\n",
+    "    print(\"✅ Optimizer state management works!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 06_optimizers\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53d6f60a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "74e04d5c",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Optimizers\n",
+    "\n",
+    "Congratulations! You've built sophisticated optimization algorithms that power modern neural network training!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Built SGD optimizer with momentum for stable gradient descent and oscillation reduction\n",
+    "- Implemented Adam optimizer with adaptive learning rates and bias correction for different parameter scales\n",
+    "- Created AdamW optimizer with decoupled weight decay for proper regularization\n",
+    "- Analyzed memory trade-offs: SGD (2×), Adam/AdamW (3× parameter memory)\n",
+    "- All tests pass ✅ (validated by `test_module()`)\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your optimizer implementations enable sophisticated neural network training! With gradients from Module 05 and optimizers from Module 06, you're ready to build complete training loops.\n",
+    "\n",
+    "Export with: `tito module complete 06_optimizers`\n",
+    "\n",
+    "**Next**: Module 07 will add training loops, learning rate scheduling, and checkpointing for complete end-to-end neural network training!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/06_optimizers/optimizers_dev.py b/modules/source/06_optimizers/optimizers_dev.py
similarity index 99%
rename from modules/06_optimizers/optimizers_dev.py
rename to modules/source/06_optimizers/optimizers_dev.py
index c6d6f868..debfde93 100644
--- a/modules/06_optimizers/optimizers_dev.py
+++ b/modules/source/06_optimizers/optimizers_dev.py
@@ -59,19 +59,16 @@ from tinytorch.core.layers import Linear  # Layers from Module 03
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp core.optimizers
+#| export
 
 import numpy as np
 from typing import List, Union, Optional, Dict, Any
 
 # Import Tensor from Module 01 (now with gradient support from Module 05)
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, assume we have the enhanced Tensor
-    import sys
-    import os
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
+import sys
+import os
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+from tensor_dev import Tensor
 
 # %% [markdown]
 """
diff --git a/modules/source/07_training/training_dev.ipynb b/modules/source/07_training/training_dev.ipynb
new file mode 100644
index 00000000..3df03f42
--- /dev/null
+++ b/modules/source/07_training/training_dev.ipynb
@@ -0,0 +1,1207 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6cef63f8",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Module 07: Training - Complete Learning Loops\n",
+    "\n",
+    "Welcome to Module 07! You're about to build the complete training infrastructure that brings neural networks to life through end-to-end learning.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Tensors, activations, layers, losses, gradients, and optimizers\n",
+    "**You'll Build**: Complete training loops with checkpointing, scheduling, and gradient management\n",
+    "**You'll Enable**: Full model training pipeline for the MLP milestone\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Optimizers (Module 06) → Training (Module 07) → DataLoader (Module 08)\n",
+    "(parameter updates)     (complete loops)      (efficient batching)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement a complete Trainer class with train/eval modes\n",
+    "2. Build learning rate scheduling and gradient clipping\n",
+    "3. Create checkpointing for model persistence\n",
+    "4. Test training loops with immediate validation\n",
+    "5. Understand gradient accumulation patterns\n",
+    "\n",
+    "Let's get started!\n",
+    "\n",
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/07_training/training_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.training\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.training import Trainer, CosineSchedule, clip_grad_norm  # This module\n",
+    "from tinytorch.core.tensor import Tensor  # Foundation (Module 01)\n",
+    "from tinytorch.core.optimizers import SGD, AdamW  # Parameter updates (Module 06)\n",
+    "from tinytorch.core.losses import CrossEntropyLoss  # Error measurement (Module 04)\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete training system in one focused module for deep understanding\n",
+    "- **Production:** Proper organization like PyTorch's training infrastructure with all training components together\n",
+    "- **Consistency:** All training operations and scheduling functionality in core.training\n",
+    "- **Integration:** Works seamlessly with optimizers and losses for complete learning pipelines"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fbd71fa0",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "imports",
+     "locked": false,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| default_exp core.training\n",
+    "\n",
+    "import numpy as np\n",
+    "import pickle\n",
+    "import time\n",
+    "from typing import Dict, List, Optional, Tuple, Any, Callable\n",
+    "from pathlib import Path"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15cb9212",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🏗️ Part 1: Introduction - What is Training?\n",
+    "\n",
+    "Training is where the magic happens - it's the process that transforms a randomly initialized neural network into an intelligent system that can solve problems. Think of training as teaching: you show the model examples, it makes predictions, you measure how wrong it is, and then you adjust its parameters to do better next time.\n",
+    "\n",
+    "The training process follows a consistent pattern across all machine learning:\n",
+    "\n",
+    "1. **Forward Pass**: Input flows through the model to produce predictions\n",
+    "2. **Loss Calculation**: Compare predictions to true answers\n",
+    "3. **Backward Pass**: Compute gradients showing how to improve\n",
+    "4. **Parameter Update**: Adjust model weights using an optimizer\n",
+    "5. **Repeat**: Continue until the model learns the pattern\n",
+    "\n",
+    "But production training systems need much more than this basic loop. They need learning rate scheduling (starting fast, slowing down), gradient clipping (preventing exploding gradients), checkpointing (saving progress), and evaluation modes (testing without learning).\n",
+    "\n",
+    "**What we're building today:**\n",
+    "- A complete `Trainer` class that orchestrates the entire learning process\n",
+    "- Learning rate scheduling that adapts during training\n",
+    "- Gradient clipping that prevents training instability\n",
+    "- Checkpointing system for saving and resuming training\n",
+    "- Train/eval modes for proper model behavior"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a2ca9d3",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 📐 Part 2: Foundations - Mathematical Background\n",
+    "\n",
+    "### Training Loop Mathematics\n",
+    "\n",
+    "The core training loop implements gradient descent with sophisticated improvements:\n",
+    "\n",
+    "**Basic Update Rule:**\n",
+    "```\n",
+    "θ(t+1) = θ(t) - η ∇L(θ(t))\n",
+    "```\n",
+    "Where θ are parameters, η is learning rate, and ∇L is the loss gradient.\n",
+    "\n",
+    "**Learning Rate Scheduling:**\n",
+    "For cosine annealing over T epochs:\n",
+    "```\n",
+    "η(t) = η_min + (η_max - η_min) * (1 + cos(πt/T)) / 2\n",
+    "```\n",
+    "\n",
+    "**Gradient Clipping:**\n",
+    "When ||∇L|| > max_norm, rescale:\n",
+    "```\n",
+    "∇L ← ∇L * max_norm / ||∇L||\n",
+    "```\n",
+    "\n",
+    "**Gradient Accumulation:**\n",
+    "For effective batch size B_eff = accumulation_steps * B_actual:\n",
+    "```\n",
+    "∇L_accumulated = (1/accumulation_steps) * Σ ∇L_batch_i\n",
+    "```\n",
+    "\n",
+    "### Train vs Eval Modes\n",
+    "\n",
+    "Many layers behave differently during training vs inference:\n",
+    "- **Dropout**: Active during training, disabled during evaluation\n",
+    "- **BatchNorm**: Updates statistics during training, uses fixed statistics during evaluation\n",
+    "- **Gradient computation**: Enabled during training, disabled during evaluation for efficiency\n",
+    "\n",
+    "This mode switching is crucial for proper model behavior and performance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebd9577e",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🏗️ Part 3: Implementation - Building Training Infrastructure\n",
+    "\n",
+    "Now let's implement the complete training system. We'll build each component step by step: learning rate scheduling, gradient utilities, and finally the complete Trainer class.\n",
+    "\n",
+    "Each component will follow the pattern: **Explanation → Implementation → Test** so you understand what you're building before you build it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a16fa592",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### Learning Rate Scheduling - Adaptive Training Speed\n",
+    "\n",
+    "Learning rate scheduling is like adjusting your driving speed based on road conditions. You start fast on the highway (high learning rate for quick progress), then slow down in neighborhoods (low learning rate for fine-tuning).\n",
+    "\n",
+    "#### Why Cosine Scheduling Works\n",
+    "\n",
+    "Cosine annealing follows a smooth curve that provides:\n",
+    "- **Aggressive learning initially** - Fast convergence when far from optimum\n",
+    "- **Gradual slowdown** - Stable convergence as you approach the solution\n",
+    "- **Smooth transitions** - No sudden learning rate drops that shock the model\n",
+    "\n",
+    "#### The Mathematics\n",
+    "\n",
+    "Cosine annealing uses the cosine function to smoothly transition from max_lr to min_lr:\n",
+    "\n",
+    "```\n",
+    "Learning Rate Schedule:\n",
+    "\n",
+    "max_lr ┌─\\\n",
+    "       │   \\\n",
+    "       │     \\\n",
+    "       │       \\\n",
+    "       │         \\\n",
+    "min_lr └───────────\\────────\n",
+    "       0    25    50   75  100 epochs\n",
+    "\n",
+    "Formula: lr = min_lr + (max_lr - min_lr) * (1 + cos(π * epoch / total_epochs)) / 2\n",
+    "```\n",
+    "\n",
+    "This creates a natural learning curve that adapts training speed to the optimization landscape."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c602af75",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "scheduler",
+     "locked": false,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class CosineSchedule:\n",
+    "    \"\"\"\n",
+    "    Cosine annealing learning rate schedule.\n",
+    "\n",
+    "    Starts at max_lr, decreases following a cosine curve to min_lr over T epochs.\n",
+    "    This provides aggressive learning initially, then fine-tuning at the end.\n",
+    "\n",
+    "    TODO: Implement cosine annealing schedule\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Store max_lr, min_lr, and total_epochs\n",
+    "    2. In get_lr(), compute cosine factor: (1 + cos(π * epoch / total_epochs)) / 2\n",
+    "    3. Interpolate: min_lr + (max_lr - min_lr) * cosine_factor\n",
+    "\n",
+    "    EXAMPLE:\n",
+    "    >>> schedule = CosineSchedule(max_lr=0.1, min_lr=0.01, total_epochs=100)\n",
+    "    >>> print(schedule.get_lr(0))    # Start: 0.1\n",
+    "    >>> print(schedule.get_lr(50))   # Middle: ~0.055\n",
+    "    >>> print(schedule.get_lr(100))  # End: 0.01\n",
+    "\n",
+    "    HINT: Use np.cos() and np.pi for the cosine calculation\n",
+    "    \"\"\"\n",
+    "    ### BEGIN SOLUTION\n",
+    "    def __init__(self, max_lr: float = 0.1, min_lr: float = 0.01, total_epochs: int = 100):\n",
+    "        self.max_lr = max_lr\n",
+    "        self.min_lr = min_lr\n",
+    "        self.total_epochs = total_epochs\n",
+    "\n",
+    "    def get_lr(self, epoch: int) -> float:\n",
+    "        \"\"\"Get learning rate for current epoch.\"\"\"\n",
+    "        if epoch >= self.total_epochs:\n",
+    "            return self.min_lr\n",
+    "\n",
+    "        # Cosine annealing formula\n",
+    "        cosine_factor = (1 + np.cos(np.pi * epoch / self.total_epochs)) / 2\n",
+    "        return self.min_lr + (self.max_lr - self.min_lr) * cosine_factor\n",
+    "    ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aef4d23a",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: CosineSchedule\n",
+    "This test validates our learning rate scheduling implementation.\n",
+    "**What we're testing**: Cosine annealing produces correct learning rates\n",
+    "**Why it matters**: Proper scheduling often makes the difference between convergence and failure\n",
+    "**Expected**: Smooth decrease from max_lr to min_lr following cosine curve"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c489b51",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_scheduler",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_cosine_schedule():\n",
+    "    \"\"\"🔬 Test CosineSchedule implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: CosineSchedule...\")\n",
+    "\n",
+    "    # Test basic schedule\n",
+    "    schedule = CosineSchedule(max_lr=0.1, min_lr=0.01, total_epochs=100)\n",
+    "\n",
+    "    # Test start, middle, and end\n",
+    "    lr_start = schedule.get_lr(0)\n",
+    "    lr_middle = schedule.get_lr(50)\n",
+    "    lr_end = schedule.get_lr(100)\n",
+    "\n",
+    "    print(f\"Learning rate at epoch 0: {lr_start:.4f}\")\n",
+    "    print(f\"Learning rate at epoch 50: {lr_middle:.4f}\")\n",
+    "    print(f\"Learning rate at epoch 100: {lr_end:.4f}\")\n",
+    "\n",
+    "    # Validate behavior\n",
+    "    assert abs(lr_start - 0.1) < 1e-6, f\"Expected 0.1 at start, got {lr_start}\"\n",
+    "    assert abs(lr_end - 0.01) < 1e-6, f\"Expected 0.01 at end, got {lr_end}\"\n",
+    "    assert 0.01 < lr_middle < 0.1, f\"Middle LR should be between min and max, got {lr_middle}\"\n",
+    "\n",
+    "    # Test monotonic decrease in first half\n",
+    "    lr_quarter = schedule.get_lr(25)\n",
+    "    assert lr_quarter > lr_middle, \"LR should decrease monotonically in first half\"\n",
+    "\n",
+    "    print(\"✅ CosineSchedule works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_cosine_schedule()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7388d6c",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### Gradient Clipping - Preventing Training Explosions\n",
+    "\n",
+    "Gradient clipping is like having a speed governor on your car - it prevents dangerous situations where gradients become so large they destroy training progress.\n",
+    "\n",
+    "#### The Problem: Exploding Gradients\n",
+    "\n",
+    "During training, gradients can sometimes become extremely large, causing:\n",
+    "- **Parameter updates that are too big** - Model jumps far from the optimal solution\n",
+    "- **Numerical instability** - Values become NaN or infinite\n",
+    "- **Training collapse** - Model performance suddenly degrades\n",
+    "\n",
+    "#### The Solution: Global Norm Clipping\n",
+    "\n",
+    "Instead of clipping each gradient individually, we compute the global norm across all parameters and scale uniformly:\n",
+    "\n",
+    "```\n",
+    "Gradient Clipping Process:\n",
+    "\n",
+    "1. Compute Global Norm:\n",
+    "   total_norm = √(sum of all gradient squares)\n",
+    "\n",
+    "2. Check if Clipping Needed:\n",
+    "   if total_norm > max_norm:\n",
+    "       clip_coefficient = max_norm / total_norm\n",
+    "\n",
+    "3. Scale All Gradients:\n",
+    "   for each gradient:\n",
+    "       gradient *= clip_coefficient\n",
+    "\n",
+    "Visualization:\n",
+    "Original Gradients:  [100, 200, 50] → norm = 230\n",
+    "With max_norm=1.0:   [0.43, 0.87, 0.22] → norm = 1.0\n",
+    "```\n",
+    "\n",
+    "This preserves the relative magnitudes while preventing explosion."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b49a7499",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "gradient_clipping",
+     "locked": false,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def clip_grad_norm(parameters: List, max_norm: float = 1.0) -> float:\n",
+    "    \"\"\"\n",
+    "    Clip gradients by global norm to prevent exploding gradients.\n",
+    "\n",
+    "    This is crucial for training stability, especially with RNNs and deep networks.\n",
+    "    Instead of clipping each gradient individually, we compute the global norm\n",
+    "    across all parameters and scale uniformly if needed.\n",
+    "\n",
+    "    TODO: Implement gradient clipping by global norm\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Compute total norm: sqrt(sum of squared gradients across all parameters)\n",
+    "    2. If total_norm > max_norm, compute clip_coef = max_norm / total_norm\n",
+    "    3. Scale all gradients by clip_coef: grad *= clip_coef\n",
+    "    4. Return the original norm for monitoring\n",
+    "\n",
+    "    EXAMPLE:\n",
+    "    >>> params = [Tensor([1, 2, 3], requires_grad=True)]\n",
+    "    >>> params[0].grad = Tensor([10, 20, 30])  # Large gradients\n",
+    "    >>> original_norm = clip_grad_norm(params, max_norm=1.0)\n",
+    "    >>> print(f\"Clipped norm: {np.linalg.norm(params[0].grad.data):.2f}\")  # Should be ≤ 1.0\n",
+    "\n",
+    "    HINTS:\n",
+    "    - Use np.linalg.norm() to compute norms\n",
+    "    - Only clip if total_norm > max_norm\n",
+    "    - Modify gradients in-place for efficiency\n",
+    "    \"\"\"\n",
+    "    ### BEGIN SOLUTION\n",
+    "    if not parameters:\n",
+    "        return 0.0\n",
+    "\n",
+    "    # Collect all gradients and compute global norm\n",
+    "    total_norm = 0.0\n",
+    "    for param in parameters:\n",
+    "        if hasattr(param, 'grad') and param.grad is not None:\n",
+    "            # Handle both Tensor gradients and numpy array gradients\n",
+    "            if isinstance(param.grad, np.ndarray):\n",
+    "                grad_data = param.grad\n",
+    "            elif hasattr(param.grad, 'data'):\n",
+    "                grad_data = param.grad.data\n",
+    "            else:\n",
+    "                grad_data = np.array(param.grad)\n",
+    "            total_norm += np.sum(grad_data ** 2)\n",
+    "\n",
+    "    total_norm = np.sqrt(total_norm)\n",
+    "\n",
+    "    # Clip if necessary\n",
+    "    if total_norm > max_norm:\n",
+    "        clip_coef = max_norm / total_norm\n",
+    "        for param in parameters:\n",
+    "            if hasattr(param, 'grad') and param.grad is not None:\n",
+    "                # Handle both Tensor gradients and numpy array gradients\n",
+    "                if isinstance(param.grad, np.ndarray):\n",
+    "                    param.grad = param.grad * clip_coef\n",
+    "                elif hasattr(param.grad, 'data'):\n",
+    "                    param.grad.data = param.grad.data * clip_coef\n",
+    "                else:\n",
+    "                    param.grad = param.grad * clip_coef\n",
+    "\n",
+    "    return float(total_norm)\n",
+    "    ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c548cc0a",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Gradient Clipping\n",
+    "This test validates our gradient clipping implementation.\n",
+    "**What we're testing**: Global norm clipping properly rescales large gradients\n",
+    "**Why it matters**: Prevents exploding gradients that can destroy training\n",
+    "**Expected**: Gradients scaled down when norm exceeds threshold"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b390744c",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_clipping",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_clip_grad_norm():\n",
+    "    \"\"\"🔬 Test clip_grad_norm implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Gradient Clipping...\")\n",
+    "\n",
+    "    # Use real Tensor from Module 01\n",
+    "    import sys\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/01_tensor')\n",
+    "    from tensor_dev import Tensor\n",
+    "\n",
+    "    # Test case 1: Large gradients that need clipping\n",
+    "    param1 = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    param1.grad = np.array([3.0, 4.0])  # norm = 5.0\n",
+    "\n",
+    "    param2 = Tensor([3.0, 4.0], requires_grad=True)\n",
+    "    param2.grad = np.array([6.0, 8.0])  # norm = 10.0\n",
+    "\n",
+    "    params = [param1, param2]\n",
+    "    # Total norm = sqrt(5² + 10²) = sqrt(125) ≈ 11.18\n",
+    "\n",
+    "    original_norm = clip_grad_norm(params, max_norm=1.0)\n",
+    "\n",
+    "    # Check original norm was large\n",
+    "    assert original_norm > 1.0, f\"Original norm should be > 1.0, got {original_norm}\"\n",
+    "\n",
+    "    # Check gradients were clipped\n",
+    "    new_norm = 0.0\n",
+    "    for param in params:\n",
+    "        if isinstance(param.grad, np.ndarray):\n",
+    "            grad_data = param.grad\n",
+    "        elif hasattr(param.grad, 'data'):\n",
+    "            grad_data = param.grad.data\n",
+    "        else:\n",
+    "            grad_data = np.array(param.grad)\n",
+    "        new_norm += np.sum(grad_data ** 2)\n",
+    "    new_norm = np.sqrt(new_norm)\n",
+    "\n",
+    "    print(f\"Original norm: {original_norm:.2f}\")\n",
+    "    print(f\"Clipped norm: {new_norm:.2f}\")\n",
+    "\n",
+    "    assert abs(new_norm - 1.0) < 1e-6, f\"Clipped norm should be 1.0, got {new_norm}\"\n",
+    "\n",
+    "    # Test case 2: Small gradients that don't need clipping\n",
+    "    small_param = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    small_param.grad = np.array([0.1, 0.2])\n",
+    "    small_params = [small_param]\n",
+    "    original_small = clip_grad_norm(small_params, max_norm=1.0)\n",
+    "\n",
+    "    assert original_small < 1.0, \"Small gradients shouldn't be clipped\"\n",
+    "\n",
+    "    print(\"✅ Gradient clipping works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_clip_grad_norm()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d18224b3",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### The Trainer Class - Orchestrating Complete Training\n",
+    "\n",
+    "The Trainer class is like a conductor orchestrating a symphony - it coordinates all the components (model, optimizer, loss function, scheduler) to create beautiful music (successful training).\n",
+    "\n",
+    "#### Training Loop Architecture\n",
+    "\n",
+    "The training loop follows a consistent pattern across all machine learning:\n",
+    "\n",
+    "```\n",
+    "Training Loop Structure:\n",
+    "\n",
+    "for epoch in range(num_epochs):\n",
+    "    ┌─────────────────── TRAINING PHASE ───────────────────┐\n",
+    "    │                                                       │\n",
+    "    │  for batch in dataloader:                            │\n",
+    "    │      ┌─── Forward Pass ───┐                          │\n",
+    "    │      │ 1. input → model   │                          │\n",
+    "    │      │ 2. predictions     │                          │\n",
+    "    │      └───────────────────┘                          │\n",
+    "    │               ↓                                      │\n",
+    "    │      ┌─── Loss Computation ───┐                     │\n",
+    "    │      │ 3. loss = loss_fn()    │                     │\n",
+    "    │      └───────────────────────┘                     │\n",
+    "    │               ↓                                      │\n",
+    "    │      ┌─── Backward Pass ───┐                       │\n",
+    "    │      │ 4. loss.backward()  │                       │\n",
+    "    │      │ 5. gradients        │                       │\n",
+    "    │      └────────────────────┘                       │\n",
+    "    │               ↓                                      │\n",
+    "    │      ┌─── Parameter Update ───┐                    │\n",
+    "    │      │ 6. optimizer.step()    │                    │\n",
+    "    │      │ 7. zero gradients      │                    │\n",
+    "    │      └───────────────────────┘                    │\n",
+    "    └───────────────────────────────────────────────────┘\n",
+    "             ↓\n",
+    "    ┌─── Learning Rate Update ───┐\n",
+    "    │ 8. scheduler.step()         │\n",
+    "    └────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "#### Key Features\n",
+    "\n",
+    "- **Train/Eval Modes**: Different behavior during training vs evaluation\n",
+    "- **Gradient Accumulation**: Effective larger batch sizes with limited memory\n",
+    "- **Checkpointing**: Save/resume training state for long experiments\n",
+    "- **Progress Tracking**: Monitor loss, learning rate, and other metrics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c806757c",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "trainer_class",
+     "locked": false,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class Trainer:\n",
+    "    \"\"\"\n",
+    "    Complete training orchestrator for neural networks.\n",
+    "\n",
+    "    Handles the full training lifecycle: forward pass, loss computation,\n",
+    "    backward pass, optimization, scheduling, checkpointing, and evaluation.\n",
+    "\n",
+    "    This is the central class that brings together all the components\n",
+    "    you've built in previous modules.\n",
+    "\n",
+    "    TODO: Implement complete Trainer class\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Store model, optimizer, loss function, and optional scheduler\n",
+    "    2. train_epoch(): Loop through data, compute loss, update parameters\n",
+    "    3. evaluate(): Similar loop but without gradient updates\n",
+    "    4. save/load_checkpoint(): Persist training state for resumption\n",
+    "\n",
+    "    DESIGN PATTERNS:\n",
+    "    - Context managers for train/eval modes\n",
+    "    - Gradient accumulation for effective large batch sizes\n",
+    "    - Progress tracking for monitoring\n",
+    "    - Flexible scheduling integration\n",
+    "    \"\"\"\n",
+    "    ### BEGIN SOLUTION\n",
+    "    def __init__(self, model, optimizer, loss_fn, scheduler=None, grad_clip_norm=None):\n",
+    "        \"\"\"\n",
+    "        Initialize trainer with model and training components.\n",
+    "\n",
+    "        Args:\n",
+    "            model: Neural network to train\n",
+    "            optimizer: Parameter update strategy (SGD, Adam, etc.)\n",
+    "            loss_fn: Loss function (CrossEntropy, MSE, etc.)\n",
+    "            scheduler: Optional learning rate scheduler\n",
+    "            grad_clip_norm: Optional gradient clipping threshold\n",
+    "        \"\"\"\n",
+    "        self.model = model\n",
+    "        self.optimizer = optimizer\n",
+    "        self.loss_fn = loss_fn\n",
+    "        self.scheduler = scheduler\n",
+    "        self.grad_clip_norm = grad_clip_norm\n",
+    "\n",
+    "        # Training state\n",
+    "        self.epoch = 0\n",
+    "        self.step = 0\n",
+    "        self.training_mode = True\n",
+    "\n",
+    "        # History tracking\n",
+    "        self.history = {\n",
+    "            'train_loss': [],\n",
+    "            'eval_loss': [],\n",
+    "            'learning_rates': []\n",
+    "        }\n",
+    "\n",
+    "    def train_epoch(self, dataloader, accumulation_steps=1):\n",
+    "        \"\"\"\n",
+    "        Train for one epoch through the dataset.\n",
+    "\n",
+    "        Args:\n",
+    "            dataloader: Iterable yielding (inputs, targets) batches\n",
+    "            accumulation_steps: Number of batches to accumulate before update\n",
+    "\n",
+    "        Returns:\n",
+    "            Average loss for the epoch\n",
+    "        \"\"\"\n",
+    "        self.model.training = True\n",
+    "        self.training_mode = True\n",
+    "\n",
+    "        total_loss = 0.0\n",
+    "        num_batches = 0\n",
+    "        accumulated_loss = 0.0\n",
+    "\n",
+    "        for batch_idx, (inputs, targets) in enumerate(dataloader):\n",
+    "            # Forward pass\n",
+    "            outputs = self.model.forward(inputs)\n",
+    "            loss = self.loss_fn.forward(outputs, targets)\n",
+    "\n",
+    "            # Scale loss for accumulation\n",
+    "            scaled_loss = loss.data / accumulation_steps\n",
+    "            accumulated_loss += scaled_loss\n",
+    "\n",
+    "            # Backward pass\n",
+    "            if hasattr(loss, 'backward'):\n",
+    "                loss.backward()\n",
+    "\n",
+    "            # Update parameters every accumulation_steps\n",
+    "            if (batch_idx + 1) % accumulation_steps == 0:\n",
+    "                # Gradient clipping\n",
+    "                if self.grad_clip_norm is not None:\n",
+    "                    params = []\n",
+    "                    if hasattr(self.model, 'parameters'):\n",
+    "                        params = self.model.parameters()\n",
+    "                    clip_grad_norm(params, self.grad_clip_norm)\n",
+    "\n",
+    "                # Optimizer step\n",
+    "                self.optimizer.step()\n",
+    "                self.optimizer.zero_grad()\n",
+    "\n",
+    "                total_loss += accumulated_loss\n",
+    "                accumulated_loss = 0.0\n",
+    "                num_batches += 1\n",
+    "                self.step += 1\n",
+    "\n",
+    "        # Handle remaining accumulated gradients\n",
+    "        if accumulated_loss > 0:\n",
+    "            if self.grad_clip_norm is not None:\n",
+    "                params = []\n",
+    "                if hasattr(self.model, 'parameters'):\n",
+    "                    params = self.model.parameters()\n",
+    "                clip_grad_norm(params, self.grad_clip_norm)\n",
+    "\n",
+    "            self.optimizer.step()\n",
+    "            self.optimizer.zero_grad()\n",
+    "            total_loss += accumulated_loss\n",
+    "            num_batches += 1\n",
+    "\n",
+    "        avg_loss = total_loss / max(num_batches, 1)\n",
+    "        self.history['train_loss'].append(avg_loss)\n",
+    "\n",
+    "        # Update scheduler\n",
+    "        if self.scheduler is not None:\n",
+    "            current_lr = self.scheduler.get_lr(self.epoch)\n",
+    "            # Update optimizer learning rate\n",
+    "            if hasattr(self.optimizer, 'lr'):\n",
+    "                self.optimizer.lr = current_lr\n",
+    "            self.history['learning_rates'].append(current_lr)\n",
+    "\n",
+    "        self.epoch += 1\n",
+    "        return avg_loss\n",
+    "\n",
+    "    def evaluate(self, dataloader):\n",
+    "        \"\"\"\n",
+    "        Evaluate model on dataset without updating parameters.\n",
+    "\n",
+    "        Args:\n",
+    "            dataloader: Iterable yielding (inputs, targets) batches\n",
+    "\n",
+    "        Returns:\n",
+    "            Average loss and accuracy\n",
+    "        \"\"\"\n",
+    "        self.model.training = False\n",
+    "        self.training_mode = False\n",
+    "\n",
+    "        total_loss = 0.0\n",
+    "        correct = 0\n",
+    "        total = 0\n",
+    "\n",
+    "        for inputs, targets in dataloader:\n",
+    "            # Forward pass only\n",
+    "            outputs = self.model.forward(inputs)\n",
+    "            loss = self.loss_fn.forward(outputs, targets)\n",
+    "\n",
+    "            total_loss += loss.data\n",
+    "\n",
+    "            # Calculate accuracy (for classification)\n",
+    "            if hasattr(outputs, 'data') and hasattr(targets, 'data'):\n",
+    "                if len(outputs.data.shape) > 1:  # Multi-class\n",
+    "                    predictions = np.argmax(outputs.data, axis=1)\n",
+    "                    if len(targets.data.shape) == 1:  # Integer targets\n",
+    "                        correct += np.sum(predictions == targets.data)\n",
+    "                    else:  # One-hot targets\n",
+    "                        correct += np.sum(predictions == np.argmax(targets.data, axis=1))\n",
+    "                    total += len(predictions)\n",
+    "\n",
+    "        avg_loss = total_loss / len(dataloader) if len(dataloader) > 0 else 0.0\n",
+    "        accuracy = correct / total if total > 0 else 0.0\n",
+    "\n",
+    "        self.history['eval_loss'].append(avg_loss)\n",
+    "\n",
+    "        return avg_loss, accuracy\n",
+    "\n",
+    "    def save_checkpoint(self, path: str):\n",
+    "        \"\"\"\n",
+    "        Save complete training state for resumption.\n",
+    "\n",
+    "        Args:\n",
+    "            path: File path to save checkpoint\n",
+    "        \"\"\"\n",
+    "        checkpoint = {\n",
+    "            'epoch': self.epoch,\n",
+    "            'step': self.step,\n",
+    "            'model_state': self._get_model_state(),\n",
+    "            'optimizer_state': self._get_optimizer_state(),\n",
+    "            'scheduler_state': self._get_scheduler_state(),\n",
+    "            'history': self.history,\n",
+    "            'training_mode': self.training_mode\n",
+    "        }\n",
+    "\n",
+    "        Path(path).parent.mkdir(parents=True, exist_ok=True)\n",
+    "        with open(path, 'wb') as f:\n",
+    "            pickle.dump(checkpoint, f)\n",
+    "\n",
+    "    def load_checkpoint(self, path: str):\n",
+    "        \"\"\"\n",
+    "        Load training state from checkpoint.\n",
+    "\n",
+    "        Args:\n",
+    "            path: File path to load checkpoint from\n",
+    "        \"\"\"\n",
+    "        with open(path, 'rb') as f:\n",
+    "            checkpoint = pickle.load(f)\n",
+    "\n",
+    "        self.epoch = checkpoint['epoch']\n",
+    "        self.step = checkpoint['step']\n",
+    "        self.history = checkpoint['history']\n",
+    "        self.training_mode = checkpoint['training_mode']\n",
+    "\n",
+    "        # Restore states (simplified for educational purposes)\n",
+    "        if 'model_state' in checkpoint:\n",
+    "            self._set_model_state(checkpoint['model_state'])\n",
+    "        if 'optimizer_state' in checkpoint:\n",
+    "            self._set_optimizer_state(checkpoint['optimizer_state'])\n",
+    "        if 'scheduler_state' in checkpoint:\n",
+    "            self._set_scheduler_state(checkpoint['scheduler_state'])\n",
+    "\n",
+    "    def _get_model_state(self):\n",
+    "        \"\"\"Extract model parameters for checkpointing.\"\"\"\n",
+    "        if hasattr(self.model, 'parameters'):\n",
+    "            return {i: param.data.copy() for i, param in enumerate(self.model.parameters())}\n",
+    "        return {}\n",
+    "\n",
+    "    def _set_model_state(self, state):\n",
+    "        \"\"\"Restore model parameters from checkpoint.\"\"\"\n",
+    "        if hasattr(self.model, 'parameters'):\n",
+    "            for i, param in enumerate(self.model.parameters()):\n",
+    "                if i in state:\n",
+    "                    param.data = state[i].copy()\n",
+    "\n",
+    "    def _get_optimizer_state(self):\n",
+    "        \"\"\"Extract optimizer state for checkpointing.\"\"\"\n",
+    "        state = {}\n",
+    "        if hasattr(self.optimizer, 'lr'):\n",
+    "            state['lr'] = self.optimizer.lr\n",
+    "        if hasattr(self.optimizer, 'momentum_buffers'):\n",
+    "            state['momentum_buffers'] = self.optimizer.momentum_buffers.copy()\n",
+    "        return state\n",
+    "\n",
+    "    def _set_optimizer_state(self, state):\n",
+    "        \"\"\"Restore optimizer state from checkpoint.\"\"\"\n",
+    "        if 'lr' in state and hasattr(self.optimizer, 'lr'):\n",
+    "            self.optimizer.lr = state['lr']\n",
+    "        if 'momentum_buffers' in state and hasattr(self.optimizer, 'momentum_buffers'):\n",
+    "            self.optimizer.momentum_buffers = state['momentum_buffers']\n",
+    "\n",
+    "    def _get_scheduler_state(self):\n",
+    "        \"\"\"Extract scheduler state for checkpointing.\"\"\"\n",
+    "        if self.scheduler is None:\n",
+    "            return None\n",
+    "        return {\n",
+    "            'max_lr': getattr(self.scheduler, 'max_lr', None),\n",
+    "            'min_lr': getattr(self.scheduler, 'min_lr', None),\n",
+    "            'total_epochs': getattr(self.scheduler, 'total_epochs', None)\n",
+    "        }\n",
+    "\n",
+    "    def _set_scheduler_state(self, state):\n",
+    "        \"\"\"Restore scheduler state from checkpoint.\"\"\"\n",
+    "        if state is None or self.scheduler is None:\n",
+    "            return\n",
+    "        for key, value in state.items():\n",
+    "            if hasattr(self.scheduler, key):\n",
+    "                setattr(self.scheduler, key, value)\n",
+    "    ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1cd1da58",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Trainer Class\n",
+    "This test validates our complete training system.\n",
+    "**What we're testing**: Trainer orchestrates training loop correctly\n",
+    "**Why it matters**: This is the backbone that enables all neural network training\n",
+    "**Expected**: Training reduces loss, evaluation works, checkpointing preserves state"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7baa78b0",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_trainer",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_trainer():\n",
+    "    \"\"\"🔬 Test Trainer implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Trainer...\")\n",
+    "\n",
+    "    # Create mock components for testing\n",
+    "    # Use REAL components from previous modules - no mocks!\n",
+    "    import sys\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/01_tensor')\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/03_layers')\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/04_losses')\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/06_optimizers')\n",
+    "    from tensor_dev import Tensor\n",
+    "    from layers_dev import Linear\n",
+    "    from losses_dev import MSELoss\n",
+    "    from optimizers_dev import SGD\n",
+    "\n",
+    "    # Create a simple model using REAL Linear layer\n",
+    "    class SimpleModel:\n",
+    "        def __init__(self):\n",
+    "            self.layer = Linear(2, 1)  # Real Linear from Module 03\n",
+    "            self.training = True\n",
+    "\n",
+    "        def forward(self, x):\n",
+    "            return self.layer.forward(x)\n",
+    "\n",
+    "        def parameters(self):\n",
+    "            return self.layer.parameters()\n",
+    "\n",
+    "    # Create trainer with REAL components\n",
+    "    model = SimpleModel()\n",
+    "    optimizer = SGD(model.parameters(), lr=0.01)  # Real SGD from Module 06\n",
+    "    loss_fn = MSELoss()  # Real MSELoss from Module 04\n",
+    "    scheduler = CosineSchedule(max_lr=0.1, min_lr=0.01, total_epochs=10)\n",
+    "\n",
+    "    trainer = Trainer(model, optimizer, loss_fn, scheduler, grad_clip_norm=1.0)\n",
+    "\n",
+    "    # Test training\n",
+    "    print(\"Testing training epoch...\")\n",
+    "    # Use real Tensors for data\n",
+    "    dataloader = [\n",
+    "        (Tensor([[1.0, 0.5]]), Tensor([[2.0]])),\n",
+    "        (Tensor([[0.5, 1.0]]), Tensor([[1.5]]))\n",
+    "    ]\n",
+    "\n",
+    "    loss = trainer.train_epoch(dataloader)\n",
+    "    assert isinstance(loss, (float, np.floating)), f\"Expected float loss, got {type(loss)}\"\n",
+    "    assert trainer.epoch == 1, f\"Expected epoch 1, got {trainer.epoch}\"\n",
+    "\n",
+    "    # Test evaluation\n",
+    "    print(\"Testing evaluation...\")\n",
+    "    eval_loss, accuracy = trainer.evaluate(dataloader)\n",
+    "    assert isinstance(eval_loss, (float, np.floating)), f\"Expected float eval_loss, got {type(eval_loss)}\"\n",
+    "    assert isinstance(accuracy, (float, np.floating)), f\"Expected float accuracy, got {type(accuracy)}\"\n",
+    "\n",
+    "    # Test checkpointing\n",
+    "    print(\"Testing checkpointing...\")\n",
+    "    checkpoint_path = \"/tmp/test_checkpoint.pkl\"\n",
+    "    trainer.save_checkpoint(checkpoint_path)\n",
+    "\n",
+    "    # Modify trainer state\n",
+    "    original_epoch = trainer.epoch\n",
+    "    trainer.epoch = 999\n",
+    "\n",
+    "    # Load checkpoint\n",
+    "    trainer.load_checkpoint(checkpoint_path)\n",
+    "    assert trainer.epoch == original_epoch, f\"Checkpoint didn't restore epoch correctly\"\n",
+    "\n",
+    "    # Clean up\n",
+    "    import os\n",
+    "    if os.path.exists(checkpoint_path):\n",
+    "        os.remove(checkpoint_path)\n",
+    "\n",
+    "    print(f\"✅ Trainer works correctly! Final loss: {loss:.4f}\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_trainer()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7546c5d2",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## 🔧 Part 4: Integration - Bringing Training Together\n",
+    "\n",
+    "Now let's create a complete training example that demonstrates how all the components work together. This integration shows the full power of our training infrastructure."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5eeb1d80",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🧪 Part 4: Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6585b9bd",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Part 5: Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4a0fa0c3",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test_module",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_cosine_schedule()\n",
+    "    test_unit_clip_grad_norm()\n",
+    "    test_unit_trainer()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test complete training pipeline integration with REAL components\n",
+    "    print(\"🔬 Integration Test: Complete Training Pipeline...\")\n",
+    "\n",
+    "    # Use REAL components from previous modules\n",
+    "    import sys\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/01_tensor')\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/03_layers')\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/04_losses')\n",
+    "    sys.path.append('/Users/VJ/GitHub/TinyTorch/modules/06_optimizers')\n",
+    "    from tensor_dev import Tensor\n",
+    "    from layers_dev import Linear\n",
+    "    from losses_dev import MSELoss\n",
+    "    from optimizers_dev import SGD\n",
+    "\n",
+    "    # Create a simple model using REAL Linear layer\n",
+    "    class SimpleModel:\n",
+    "        def __init__(self):\n",
+    "            self.layer = Linear(2, 1)  # Real Linear from Module 03\n",
+    "            self.training = True\n",
+    "\n",
+    "        def forward(self, x):\n",
+    "            return self.layer.forward(x)\n",
+    "\n",
+    "        def parameters(self):\n",
+    "            return self.layer.parameters()\n",
+    "\n",
+    "    # Create integrated system with REAL components\n",
+    "    model = SimpleModel()\n",
+    "    optimizer = SGD(model.parameters(), lr=0.01)  # Real SGD from Module 06\n",
+    "    loss_fn = MSELoss()  # Real MSELoss from Module 04\n",
+    "    scheduler = CosineSchedule(max_lr=0.1, min_lr=0.001, total_epochs=3)\n",
+    "\n",
+    "    trainer = Trainer(\n",
+    "        model=model,\n",
+    "        optimizer=optimizer,\n",
+    "        loss_fn=loss_fn,\n",
+    "        scheduler=scheduler,\n",
+    "        grad_clip_norm=0.5\n",
+    "    )\n",
+    "\n",
+    "    # Test data using REAL Tensors\n",
+    "    data = [\n",
+    "        (Tensor([[1.0, 0.5]]), Tensor([[0.8]])),\n",
+    "        (Tensor([[0.5, 1.0]]), Tensor([[0.2]]))\n",
+    "    ]\n",
+    "\n",
+    "    # Test training\n",
+    "    initial_loss = trainer.train_epoch(data)\n",
+    "    assert isinstance(initial_loss, (float, np.floating)), \"Training should return float loss\"\n",
+    "    assert trainer.epoch == 1, \"Epoch should increment\"\n",
+    "\n",
+    "    # Test evaluation\n",
+    "    eval_loss, accuracy = trainer.evaluate(data)\n",
+    "    assert isinstance(eval_loss, (float, np.floating)), \"Evaluation should return float loss\"\n",
+    "    assert isinstance(accuracy, (float, np.floating)), \"Evaluation should return float accuracy\"\n",
+    "\n",
+    "    # Test scheduling\n",
+    "    lr_epoch_0 = scheduler.get_lr(0)\n",
+    "    lr_epoch_1 = scheduler.get_lr(1)\n",
+    "    assert lr_epoch_0 > lr_epoch_1, \"Learning rate should decrease\"\n",
+    "\n",
+    "    # Test gradient clipping with large gradients using real Tensor\n",
+    "    large_param = Tensor([1.0, 2.0], requires_grad=True)\n",
+    "    large_param.grad = np.array([100.0, 200.0])\n",
+    "    large_params = [large_param]\n",
+    "\n",
+    "    original_norm = clip_grad_norm(large_params, max_norm=1.0)\n",
+    "    assert original_norm > 1.0, \"Original norm should be large\"\n",
+    "\n",
+    "    if isinstance(large_params[0].grad, np.ndarray):\n",
+    "        grad_data = large_params[0].grad\n",
+    "    elif hasattr(large_params[0].grad, 'data'):\n",
+    "        grad_data = large_params[0].grad.data\n",
+    "    else:\n",
+    "        grad_data = np.array(large_params[0].grad)\n",
+    "    new_norm = np.linalg.norm(grad_data)\n",
+    "    assert abs(new_norm - 1.0) < 1e-6, \"Clipped norm should equal max_norm\"\n",
+    "\n",
+    "    # Test checkpointing\n",
+    "    checkpoint_path = \"/tmp/integration_test_checkpoint.pkl\"\n",
+    "    trainer.save_checkpoint(checkpoint_path)\n",
+    "\n",
+    "    original_epoch = trainer.epoch\n",
+    "    trainer.epoch = 999\n",
+    "    trainer.load_checkpoint(checkpoint_path)\n",
+    "\n",
+    "    assert trainer.epoch == original_epoch, \"Checkpoint should restore state\"\n",
+    "\n",
+    "    # Clean up\n",
+    "    import os\n",
+    "    if os.path.exists(checkpoint_path):\n",
+    "        os.remove(checkpoint_path)\n",
+    "\n",
+    "    print(\"✅ End-to-end training pipeline works!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 07\")\n",
+    "\n",
+    "# test_module()  # Moved to main guard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ca90ce01",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "main",
+     "locked": false,
+     "solution": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8462acc9",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Training\n",
+    "\n",
+    "Congratulations! You've built a complete training infrastructure that can orchestrate the entire machine learning training process!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Built Trainer class with complete training/evaluation loops\n",
+    "- Implemented CosineSchedule for adaptive learning rate management\n",
+    "- Created clip_grad_norm for training stability and gradient management\n",
+    "- Added comprehensive checkpointing for training persistence\n",
+    "- All tests pass ✅ (validated by `test_module()`)\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your training implementation enables sophisticated model training with proper scheduling, stability controls, and state management.\n",
+    "Export with: `tito module complete 07`\n",
+    "\n",
+    "**Next**: Module 08 will add DataLoader for efficient data pipeline management, completing the full training infrastructure needed for the MLP milestone!\n",
+    "\n",
+    "### Systems Insights Gained\n",
+    "- Learning rate scheduling often provides better convergence than fixed rates\n",
+    "- Gradient clipping preserves direction while preventing instability\n",
+    "- Checkpointing enables fault-tolerant training for production systems\n",
+    "\n",
+    "**🎓 You now understand the complete training infrastructure that powers modern ML systems!**"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/07_training/training_dev.py b/modules/source/07_training/training_dev.py
similarity index 65%
rename from modules/07_training/training_dev.py
rename to modules/source/07_training/training_dev.py
index cab8632b..1c04ec13 100644
--- a/modules/07_training/training_dev.py
+++ b/modules/source/07_training/training_dev.py
@@ -61,12 +61,28 @@ from tinytorch.core.losses import CrossEntropyLoss  # Error measurement (Module
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "locked": false, "solution": false}
 #| default_exp core.training
+#| export
 
 import numpy as np
 import pickle
 import time
 from typing import Dict, List, Optional, Tuple, Any, Callable
 from pathlib import Path
+import sys
+import os
+
+# Import dependencies from other modules
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+from tensor_dev import Tensor
+
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
+from layers_dev import Linear
+
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '04_losses'))
+from losses_dev import MSELoss, CrossEntropyLoss
+
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '06_optimizers'))
+from optimizers_dev import SGD, AdamW
 
 # %% [markdown]
 """
@@ -331,7 +347,14 @@ def clip_grad_norm(parameters: List, max_norm: float = 1.0) -> float:
     total_norm = 0.0
     for param in parameters:
         if hasattr(param, 'grad') and param.grad is not None:
-            total_norm += np.sum(param.grad.data ** 2)
+            # Handle both Tensor gradients and numpy array gradients
+            if isinstance(param.grad, np.ndarray):
+                grad_data = param.grad
+            elif hasattr(param.grad, 'data'):
+                grad_data = param.grad.data
+            else:
+                grad_data = np.array(param.grad)
+            total_norm += np.sum(grad_data ** 2)
 
     total_norm = np.sqrt(total_norm)
 
@@ -340,7 +363,13 @@ def clip_grad_norm(parameters: List, max_norm: float = 1.0) -> float:
         clip_coef = max_norm / total_norm
         for param in parameters:
             if hasattr(param, 'grad') and param.grad is not None:
-                param.grad.data *= clip_coef
+                # Handle both Tensor gradients and numpy array gradients
+                if isinstance(param.grad, np.ndarray):
+                    param.grad = param.grad * clip_coef
+                elif hasattr(param.grad, 'data'):
+                    param.grad.data = param.grad.data * clip_coef
+                else:
+                    param.grad = param.grad * clip_coef
 
     return float(total_norm)
     ### END SOLUTION
@@ -359,16 +388,18 @@ def test_unit_clip_grad_norm():
     """🔬 Test clip_grad_norm implementation."""
     print("🔬 Unit Test: Gradient Clipping...")
 
-    # Create mock parameters with gradients (simulating Tensor.grad)
-    class MockParam:
-        def __init__(self, grad_data):
-            self.grad = type('grad', (), {'data': np.array(grad_data)})()
+    # Use real Tensor from Module 01
+    import sys
+    # Tensor already imported at module level
 
     # Test case 1: Large gradients that need clipping
-    params = [
-        MockParam([3.0, 4.0]),  # norm = 5.0
-        MockParam([6.0, 8.0])   # norm = 10.0
-    ]
+    param1 = Tensor([1.0, 2.0], requires_grad=True)
+    param1.grad = np.array([3.0, 4.0])  # norm = 5.0
+
+    param2 = Tensor([3.0, 4.0], requires_grad=True)
+    param2.grad = np.array([6.0, 8.0])  # norm = 10.0
+
+    params = [param1, param2]
     # Total norm = sqrt(5² + 10²) = sqrt(125) ≈ 11.18
 
     original_norm = clip_grad_norm(params, max_norm=1.0)
@@ -379,7 +410,13 @@ def test_unit_clip_grad_norm():
     # Check gradients were clipped
     new_norm = 0.0
     for param in params:
-        new_norm += np.sum(param.grad.data ** 2)
+        if isinstance(param.grad, np.ndarray):
+            grad_data = param.grad
+        elif hasattr(param.grad, 'data'):
+            grad_data = param.grad.data
+        else:
+            grad_data = np.array(param.grad)
+        new_norm += np.sum(grad_data ** 2)
     new_norm = np.sqrt(new_norm)
 
     print(f"Original norm: {original_norm:.2f}")
@@ -388,7 +425,9 @@ def test_unit_clip_grad_norm():
     assert abs(new_norm - 1.0) < 1e-6, f"Clipped norm should be 1.0, got {new_norm}"
 
     # Test case 2: Small gradients that don't need clipping
-    small_params = [MockParam([0.1, 0.2])]
+    small_param = Tensor([1.0, 2.0], requires_grad=True)
+    small_param.grad = np.array([0.1, 0.2])
+    small_params = [small_param]
     original_small = clip_grad_norm(small_params, max_norm=1.0)
 
     assert original_small < 1.0, "Small gradients shouldn't be clipped"
@@ -726,12 +765,7 @@ def test_unit_trainer():
     """🔬 Test Trainer implementation."""
     print("🔬 Unit Test: Trainer...")
 
-    # Create mock components for testing
-    # Use REAL components from previous modules - no mocks!
-    from modules.01_tensor.tensor_dev import Tensor
-    from modules.03_layers.layers_dev import Linear
-    from modules.04_losses.losses_dev import MSELoss
-    from modules.06_optimizers.optimizers_dev import SGD
+    # Use REAL components from previous modules (already imported at module level)
 
     # Create a simple model using REAL Linear layer
     class SimpleModel:
@@ -761,15 +795,15 @@ def test_unit_trainer():
         (Tensor([[0.5, 1.0]]), Tensor([[1.5]]))
     ]
 
-    loss = trainer.train_epoch(mock_dataloader)
-    assert isinstance(loss, float), f"Expected float loss, got {type(loss)}"
+    loss = trainer.train_epoch(dataloader)
+    assert isinstance(loss, (float, np.floating)), f"Expected float loss, got {type(loss)}"
     assert trainer.epoch == 1, f"Expected epoch 1, got {trainer.epoch}"
 
     # Test evaluation
     print("Testing evaluation...")
-    eval_loss, accuracy = trainer.evaluate(mock_dataloader)
-    assert isinstance(eval_loss, float), f"Expected float eval_loss, got {type(eval_loss)}"
-    assert isinstance(accuracy, float), f"Expected float accuracy, got {type(accuracy)}"
+    eval_loss, accuracy = trainer.evaluate(dataloader)
+    assert isinstance(eval_loss, (float, np.floating)), f"Expected float eval_loss, got {type(eval_loss)}"
+    assert isinstance(accuracy, (float, np.floating)), f"Expected float accuracy, got {type(accuracy)}"
 
     # Test checkpointing
     print("Testing checkpointing...")
@@ -801,336 +835,20 @@ if __name__ == "__main__":
 Now let's create a complete training example that demonstrates how all the components work together. This integration shows the full power of our training infrastructure.
 """
 
-# %% nbgrader={"grade": false, "grade_id": "training_integration", "locked": false, "solution": true}
-def demonstrate_complete_training():
-    """
-    Demonstrate complete training pipeline with all components.
-
-    This shows how Trainer, CosineSchedule, and gradient clipping work together
-    to create a robust training system that could handle real neural networks.
-    """
-    print("🏗️ Complete Training Pipeline Demonstration")
-    print("=" * 50)
-
-    # Use MockModel from testing (simple and sufficient for demo)
-    class MockModel:
-        def __init__(self):
-            self.training = True
-            self.weight = type('param', (), {'data': np.array([1.0, 2.0]), 'grad': None})()
-
-        def forward(self, x):
-            # Simple linear operation
-            result = type('output', (), {'data': np.dot(x.data, self.weight.data)})()
-            return result
-
-        def parameters(self):
-            return [self.weight]
-
-    class MockSGD:
-        def __init__(self, params, lr=0.01):
-            self.params = params
-            self.lr = lr
-
-        def step(self):
-            # Simplified parameter update
-            for param in self.params:
-                if param.grad is not None:
-                    param.data -= self.lr * param.grad.data
-
-        def zero_grad(self):
-            for param in self.params:
-                param.grad = None
-
-    class MSELoss:
-        def forward(self, outputs, targets):
-            diff = outputs.data - targets.data
-            loss_value = np.mean(diff ** 2)
-            result = type('loss', (), {'data': loss_value})()
-
-            # Simplified backward pass
-            def backward():
-                grad_output = 2 * diff / len(diff)
-                # Set gradients (simplified)
-                outputs.grad = type('grad', (), {'data': grad_output})()
-
-            result.backward = backward
-            return result
-
-    class MockTensor:
-        def __init__(self, data):
-            self.data = np.array(data, dtype=float)
-
-    # 1. Create model and training components
-    print("1. Setting up training components...")
-    model = MockModel()
-    optimizer = MockSGD(model.parameters(), lr=0.1)
-    loss_fn = MSELoss()
-    scheduler = CosineSchedule(max_lr=0.1, min_lr=0.001, total_epochs=5)
-
-    # 2. Create trainer with gradient clipping
-    trainer = Trainer(
-        model=model,
-        optimizer=optimizer,
-        loss_fn=loss_fn,
-        scheduler=scheduler,
-        grad_clip_norm=1.0
-    )
-
-    # 3. Create simple dataset (linear function demo)
-    print("2. Creating synthetic dataset...")
-    train_data = [
-        (MockTensor([1.0, 0.5]), MockTensor([0.8])),
-        (MockTensor([0.5, 1.0]), MockTensor([0.2])),
-        (MockTensor([0.3, 0.7]), MockTensor([0.5])),
-        (MockTensor([0.9, 0.1]), MockTensor([0.9]))
-    ]
-
-    # 4. Training loop
-    print("3. Training model...")
-    print("\nEpoch | Train Loss | Learning Rate")
-    print("-" * 35)
-
-    for epoch in range(5):
-        # Train for one epoch
-        train_loss = trainer.train_epoch(train_data)
-
-        # Get current learning rate
-        current_lr = scheduler.get_lr(epoch)
-
-        print(f"{epoch+1:5d} | {train_loss:10.6f} | {current_lr:12.6f}")
-
-    # 5. Evaluation
-    print("\n4. Evaluating model...")
-    eval_loss, accuracy = trainer.evaluate(train_data)
-    print(f"Final evaluation - Loss: {eval_loss:.6f}, Accuracy: {accuracy:.3f}")
-
-    # 6. Checkpointing demonstration
-    print("\n5. Testing checkpointing...")
-    checkpoint_path = "/tmp/training_demo_checkpoint.pkl"
-    trainer.save_checkpoint(checkpoint_path)
-    print(f"Checkpoint saved to {checkpoint_path}")
-
-    # Modify and restore
-    original_epoch = trainer.epoch
-    trainer.epoch = 999
-    trainer.load_checkpoint(checkpoint_path)
-
-    print(f"Checkpoint restored - Epoch: {trainer.epoch} (was modified to 999)")
-    assert trainer.epoch == original_epoch, "Checkpoint restoration failed"
-
-    # 7. Training history
-    print("\n6. Training history summary...")
-    print(f"Training losses: {[f'{loss:.4f}' for loss in trainer.history['train_loss']]}")
-    print(f"Learning rates: {[f'{lr:.4f}' for lr in trainer.history['learning_rates']]}")
-
-    # Clean up
-    import os
-    if os.path.exists(checkpoint_path):
-        os.remove(checkpoint_path)
-
-    print("\n✅ Complete training pipeline works perfectly!")
-    print("🎓 Ready for real neural network training!")
-
-# demonstrate_complete_training()  # Moved to main guard
 
 # %% [markdown]
 """
-## 📊 Part 5: Systems Analysis - Training Performance and Memory
+## 🧪 Part 4: Module Integration Test
 
-Training systems have unique performance characteristics that differ significantly from inference. Let's analyze the key factors that affect training efficiency and understand the trade-offs involved.
-
-### Memory Analysis: Training vs Inference
-
-Training requires significantly more memory than inference because:
-
-```
-Memory Usage Breakdown:
-
-    INFERENCE              TRAINING
-┌─────────────┐        ┌─────────────┐
-│ Parameters  │        │ Parameters  │ ← Same
-│    100MB    │        │    100MB    │
-└─────────────┘        ├─────────────┤
-       +               │ Gradients   │ ← Additional
-┌─────────────┐        │    100MB    │
-│ Activations │        ├─────────────┤
-│     50MB    │        │ Optimizer   │ ← 2-3× params
-└─────────────┘        │    200MB    │ (Adam: momentum + velocity)
-                       ├─────────────┤
-   Total: 150MB        │ Activations │ ← Larger (stored for backprop)
-                       │    150MB    │
-                       └─────────────┘
-
-                       Total: 550MB (3.7× inference)
-```
-
-Let's measure these effects and understand their implications.
+Final validation that everything works together correctly.
 """
 
-# %% nbgrader={"grade": false, "grade_id": "analyze_training_memory", "locked": false, "solution": true}
-def analyze_training_memory():
-    """📊 Analyze memory requirements for training vs inference."""
-    print("📊 Training Memory Analysis")
-    print("=" * 40)
 
-    # Simulate memory usage for different model sizes
-    def estimate_memory_usage(num_params, batch_size=32, sequence_length=512):
-        """Estimate memory usage in MB for training vs inference."""
 
-        # Parameter memory (FP32: 4 bytes per parameter)
-        param_memory = num_params * 4 / (1024 * 1024)  # MB
-
-        # Gradient memory (same size as parameters)
-        grad_memory = param_memory
-
-        # Optimizer state (Adam: 2× parameters for momentum + second moments)
-        optimizer_memory = param_memory * 2
-
-        # Activation memory (depends on batch size and model depth)
-        # Rough estimate: batch_size * sequence_length * hidden_dim * num_layers * 4 bytes
-        activation_memory = batch_size * sequence_length * 512 * 12 * 4 / (1024 * 1024)
-
-        # Inference only needs parameters + activations (no gradients or optimizer state)
-        inference_memory = param_memory + activation_memory * 0.1  # Much smaller activation memory
-        training_memory = param_memory + grad_memory + optimizer_memory + activation_memory
-
-        return {
-            'parameters': param_memory,
-            'gradients': grad_memory,
-            'optimizer': optimizer_memory,
-            'activations': activation_memory,
-            'inference_total': inference_memory,
-            'training_total': training_memory,
-            'overhead_ratio': training_memory / inference_memory
-        }
-
-    # Analyze different model sizes
-    model_sizes = [
-        ("Small MLP", 1_000_000),      # 1M parameters
-        ("Medium Model", 50_000_000),   # 50M parameters
-        ("Large Model", 500_000_000),   # 500M parameters
-        ("GPT-scale", 1_000_000_000)    # 1B parameters
-    ]
-
-    print("Model Size    | Params | Grads | Optimizer | Activations | Inference | Training | Overhead")
-    print("-" * 90)
-
-    for name, num_params in model_sizes:
-        memory = estimate_memory_usage(num_params)
-
-        print(f"{name:12s} | {memory['parameters']:6.0f} | {memory['gradients']:5.0f} | "
-              f"{memory['optimizer']:9.0f} | {memory['activations']:11.0f} | "
-              f"{memory['inference_total']:9.0f} | {memory['training_total']:8.0f} | "
-              f"{memory['overhead_ratio']:7.1f}x")
-
-    print("\n💡 Key Insights:")
-    print("• Training memory grows with model size due to gradient and optimizer storage")
-    print("• Adam optimizer adds 2× parameter memory for momentum and second moments")
-    print("• Activation memory depends on batch size and can be reduced with gradient checkpointing")
-    print("• Training typically requires 3-4× more memory than inference")
-
-# analyze_training_memory()  # Moved to main guard
 
 # %% [markdown]
 """
-### Batch Size Effects - The Memory vs Speed Trade-off
-
-Batch size affects training in complex ways, creating trade-offs between memory usage, compute efficiency, and convergence behavior.
-
-```
-Batch Size Impact Visualization:
-
-Memory Usage (linear):
- batch=1   |▌
- batch=8   |████
- batch=32  |████████████████
- batch=128 |████████████████████████████████████████████████████████████████
-
-Compute Efficiency (logarithmic):
- batch=1   |▌
- batch=8   |████████
- batch=32  |██████████████
- batch=128 |████████████████ (plateaus due to hardware limits)
-
-Steps per Epoch (inverse):
- batch=1   |████████████████████████████████████████████████████████████████
- batch=8   |████████
- batch=32  |██
- batch=128 |▌
-
-Sweet Spot: Usually around 32-64 for most models
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "analyze_batch_size_effects", "locked": false, "solution": true}
-def analyze_batch_size_effects():
-    """📊 Analyze how batch size affects training efficiency and convergence."""
-    print("\n📊 Batch Size Effects Analysis")
-    print("=" * 40)
-
-    # Simulate training with different batch sizes
-    batch_sizes = [1, 4, 16, 64, 256, 1024]
-
-    def simulate_training_efficiency(batch_size):
-        """Simulate training metrics for different batch sizes."""
-
-        # Memory usage (linear with batch size for activations)
-        base_memory = 1000  # MB base model memory
-        activation_memory_per_sample = 50  # MB per sample
-        total_memory = base_memory + batch_size * activation_memory_per_sample
-
-        # Compute efficiency (higher batch size → better GPU utilization)
-        # But diminishing returns due to memory bandwidth limits
-        compute_efficiency = min(1.0, 0.3 + 0.7 * (batch_size / 64))
-
-        # Communication overhead (for distributed training)
-        # More communication needed with larger batches
-        comm_overhead = 1.0 + (batch_size / 1000) * 0.5
-
-        # Convergence speed (larger batches may need more epochs)
-        # This is a simplified model of the batch size vs convergence trade-off
-        convergence_penalty = 1.0 + max(0, (batch_size - 32) / 200)
-
-        # Time per step (includes compute + communication)
-        time_per_step = 100 / compute_efficiency * comm_overhead  # ms
-
-        # Steps per epoch (fewer steps with larger batches)
-        dataset_size = 50000
-        steps_per_epoch = dataset_size // batch_size
-
-        # Time per epoch
-        time_per_epoch = steps_per_epoch * time_per_step / 1000  # seconds
-
-        return {
-            'memory_mb': total_memory,
-            'compute_efficiency': compute_efficiency,
-            'time_per_step_ms': time_per_step,
-            'steps_per_epoch': steps_per_epoch,
-            'time_per_epoch_s': time_per_epoch,
-            'convergence_factor': convergence_penalty
-        }
-
-    print("Batch Size | Memory (MB) | Compute Eff | Steps/Epoch | Time/Epoch | Convergence")
-    print("-" * 75)
-
-    for batch_size in batch_sizes:
-        metrics = simulate_training_efficiency(batch_size)
-
-        print(f"{batch_size:10d} | {metrics['memory_mb']:11.0f} | "
-              f"{metrics['compute_efficiency']:11.2f} | {metrics['steps_per_epoch']:11d} | "
-              f"{metrics['time_per_epoch_s']:10.1f} | {metrics['convergence_factor']:11.2f}")
-
-    print("\n💡 Key Insights:")
-    print("• Memory usage scales linearly with batch size (activation storage)")
-    print("• Compute efficiency improves with batch size but plateaus (GPU utilization)")
-    print("• Larger batches mean fewer steps per epoch but potentially slower convergence")
-    print("• Sweet spot often around 32-64 for most models, balancing all factors")
-
-# analyze_batch_size_effects()  # Moved to main guard
-
-# %% [markdown]
-"""
-## 🧪 Part 6: Module Integration Test
+## 🧪 Part 5: Module Integration Test
 
 Final validation that everything works together correctly.
 """
@@ -1156,60 +874,27 @@ def test_module():
 
     print("\nRunning integration scenarios...")
 
-    # Test complete training pipeline integration
+    # Test complete training pipeline integration with REAL components
     print("🔬 Integration Test: Complete Training Pipeline...")
 
-    # Use MockModel (simple and sufficient for integration testing)
-    class MockModel:
+    # Use REAL components from previous modules (already imported at module level)
+
+    # Create a simple model using REAL Linear layer
+    class SimpleModel:
         def __init__(self):
+            self.layer = Linear(2, 1)  # Real Linear from Module 03
             self.training = True
-            self.weight = type('param', (), {'data': np.array([1.0, 2.0]), 'grad': None})()
 
         def forward(self, x):
-            # Simple linear operation
-            result = type('output', (), {'data': np.dot(x.data, self.weight.data)})()
-            return result
+            return self.layer.forward(x)
 
         def parameters(self):
-            return [self.weight]
+            return self.layer.parameters()
 
-    class IntegrationOptimizer:
-        def __init__(self, params, lr=0.01):
-            self.params = params
-            self.lr = lr
-
-        def step(self):
-            for param in self.params:
-                if param.grad is not None:
-                    param.data -= self.lr * param.grad.data
-
-        def zero_grad(self):
-            for param in self.params:
-                if hasattr(param, 'grad'):
-                    param.grad = None
-
-    class IntegrationLoss:
-        def forward(self, outputs, targets):
-            diff = outputs.data - targets.data
-            loss_value = np.mean(diff ** 2)
-            result = type('loss', (), {'data': loss_value})()
-
-            def backward():
-                # Simple gradient computation
-                for param in model.parameters():
-                    param.grad = type('grad', (), {'data': np.random.randn(*param.data.shape) * 0.1})()
-
-            result.backward = backward
-            return result
-
-    class IntegrationTensor:
-        def __init__(self, data):
-            self.data = np.array(data, dtype=float)
-
-    # Create integrated system
-    model = MockModel()
-    optimizer = IntegrationOptimizer(model.parameters(), lr=0.01)
-    loss_fn = IntegrationLoss()
+    # Create integrated system with REAL components
+    model = SimpleModel()
+    optimizer = SGD(model.parameters(), lr=0.01)  # Real SGD from Module 06
+    loss_fn = MSELoss()  # Real MSELoss from Module 04
     scheduler = CosineSchedule(max_lr=0.1, min_lr=0.001, total_epochs=3)
 
     trainer = Trainer(
@@ -1220,33 +905,42 @@ def test_module():
         grad_clip_norm=0.5
     )
 
-    # Test data
+    # Test data using REAL Tensors
     data = [
-        (IntegrationTensor([1.0, 0.5]), IntegrationTensor([0.8])),
-        (IntegrationTensor([0.5, 1.0]), IntegrationTensor([0.2]))
+        (Tensor([[1.0, 0.5]]), Tensor([[0.8]])),
+        (Tensor([[0.5, 1.0]]), Tensor([[0.2]]))
     ]
 
     # Test training
     initial_loss = trainer.train_epoch(data)
-    assert isinstance(initial_loss, float), "Training should return float loss"
+    assert isinstance(initial_loss, (float, np.floating)), "Training should return float loss"
     assert trainer.epoch == 1, "Epoch should increment"
 
     # Test evaluation
     eval_loss, accuracy = trainer.evaluate(data)
-    assert isinstance(eval_loss, float), "Evaluation should return float loss"
-    assert isinstance(accuracy, float), "Evaluation should return float accuracy"
+    assert isinstance(eval_loss, (float, np.floating)), "Evaluation should return float loss"
+    assert isinstance(accuracy, (float, np.floating)), "Evaluation should return float accuracy"
 
     # Test scheduling
     lr_epoch_0 = scheduler.get_lr(0)
     lr_epoch_1 = scheduler.get_lr(1)
     assert lr_epoch_0 > lr_epoch_1, "Learning rate should decrease"
 
-    # Test gradient clipping with large gradients
-    large_params = [type('param', (), {'grad': type('grad', (), {'data': np.array([100.0, 200.0])})()})()]
+    # Test gradient clipping with large gradients using real Tensor
+    large_param = Tensor([1.0, 2.0], requires_grad=True)
+    large_param.grad = np.array([100.0, 200.0])
+    large_params = [large_param]
+
     original_norm = clip_grad_norm(large_params, max_norm=1.0)
     assert original_norm > 1.0, "Original norm should be large"
 
-    new_norm = np.linalg.norm(large_params[0].grad.data)
+    if isinstance(large_params[0].grad, np.ndarray):
+        grad_data = large_params[0].grad
+    elif hasattr(large_params[0].grad, 'data'):
+        grad_data = large_params[0].grad.data
+    else:
+        grad_data = np.array(large_params[0].grad)
+    new_norm = np.linalg.norm(grad_data)
     assert abs(new_norm - 1.0) < 1e-6, "Clipped norm should equal max_norm"
 
     # Test checkpointing
@@ -1288,7 +982,6 @@ Congratulations! You've built a complete training infrastructure that can orches
 - Implemented CosineSchedule for adaptive learning rate management
 - Created clip_grad_norm for training stability and gradient management
 - Added comprehensive checkpointing for training persistence
-- Discovered training memory scales 3-4× beyond inference requirements
 - All tests pass ✅ (validated by `test_module()`)
 
 ### Ready for Next Steps
@@ -1298,8 +991,6 @@ Export with: `tito module complete 07`
 **Next**: Module 08 will add DataLoader for efficient data pipeline management, completing the full training infrastructure needed for the MLP milestone!
 
 ### Systems Insights Gained
-- Training memory overhead comes from gradients (1×) + optimizer state (2×) + activations
-- Batch size affects memory linearly but compute efficiency sub-linearly
 - Learning rate scheduling often provides better convergence than fixed rates
 - Gradient clipping preserves direction while preventing instability
 - Checkpointing enables fault-tolerant training for production systems
diff --git a/modules/source/08_dataloader/dataloader_dev.ipynb b/modules/source/08_dataloader/dataloader_dev.ipynb
new file mode 100644
index 00000000..493ea3ef
--- /dev/null
+++ b/modules/source/08_dataloader/dataloader_dev.ipynb
@@ -0,0 +1,1421 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a84152d1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| default_exp data.loader"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2c983083",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Module 08: DataLoader - Efficient Data Pipeline for ML Training\n",
+    "\n",
+    "Welcome to Module 08! You're about to build the data loading infrastructure that transforms how ML models consume data during training.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Tensor operations, activations, layers, losses, autograd, optimizers, and training loops\n",
+    "**You'll Build**: Dataset abstraction, DataLoader with batching/shuffling, and real dataset support\n",
+    "**You'll Enable**: Efficient data pipelines that feed hungry neural networks with properly formatted batches\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Training Loop → DataLoader → Batched Data → Model\n",
+    "(Module 07)    (Module 08)  (optimized)   (ready to learn)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Understand the data pipeline: individual samples → batches → training\n",
+    "2. Implement Dataset abstraction and TensorDataset for tensor-based data\n",
+    "3. Build DataLoader with intelligent batching, shuffling, and memory-efficient iteration\n",
+    "4. Experience data pipeline performance characteristics firsthand\n",
+    "5. Create download functions for real computer vision datasets\n",
+    "\n",
+    "Let's transform scattered data into organized learning batches!\n",
+    "\n",
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/08_dataloader/dataloader_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.data.loader\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.data.loader import Dataset, DataLoader, TensorDataset  # This module\n",
+    "from tinytorch.data.loader import download_mnist, download_cifar10  # Dataset utilities\n",
+    "from tinytorch.core.tensor import Tensor  # Foundation (Module 01)\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete data loading system in one focused module for deep understanding\n",
+    "- **Production:** Proper organization like PyTorch's torch.utils.data with all core data utilities\n",
+    "- **Efficiency:** Optimized data pipelines are crucial for training speed and memory usage\n",
+    "- **Integration:** Works seamlessly with training loops to create complete ML systems"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "827d5572",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Essential imports for data loading\n",
+    "import numpy as np\n",
+    "import random\n",
+    "from typing import Iterator, Tuple, List, Optional, Union\n",
+    "from abc import ABC, abstractmethod\n",
+    "import os\n",
+    "import gzip\n",
+    "import urllib.request\n",
+    "import pickle\n",
+    "import sys\n",
+    "\n",
+    "# Import real Tensor class from Module 01\n",
+    "sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
+    "from tensor_dev import Tensor"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fa205a01",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Part 1: Understanding the Data Pipeline\n",
+    "\n",
+    "Before we implement anything, let's understand what happens when neural networks \"eat\" data. The journey from raw data to trained models follows a specific pipeline that every ML engineer must master.\n",
+    "\n",
+    "### The Data Pipeline Journey\n",
+    "\n",
+    "Imagine you have 50,000 images of cats and dogs, and you want to train a neural network to classify them:\n",
+    "\n",
+    "```\n",
+    "Raw Data Storage          Dataset Interface         DataLoader Batching         Training Loop\n",
+    "┌─────────────────┐      ┌──────────────────┐      ┌────────────────────┐      ┌─────────────┐\n",
+    "│ cat_001.jpg     │      │ dataset[0]       │      │ Batch 1:           │      │ model(batch)│\n",
+    "│ dog_023.jpg     │ ───> │ dataset[1]       │ ───> │ [cat, dog, cat]    │ ───> │ optimizer   │\n",
+    "│ cat_045.jpg     │      │ dataset[2]       │      │ Batch 2:           │      │ loss        │\n",
+    "│ ...             │      │ ...              │      │ [dog, cat, dog]    │      │ backward    │\n",
+    "│ (50,000 files)  │      │ dataset[49999]   │      │ ...                │      │ step        │\n",
+    "└─────────────────┘      └──────────────────┘      └────────────────────┘      └─────────────┘\n",
+    "```\n",
+    "\n",
+    "### Why This Pipeline Matters\n",
+    "\n",
+    "**Individual Access (Dataset)**: Neural networks can't process 50,000 files at once. We need a way to access one sample at a time: \"Give me image #1,247\".\n",
+    "\n",
+    "**Batch Processing (DataLoader)**: GPUs are parallel machines - they're much faster processing 32 images simultaneously than 1 image 32 times.\n",
+    "\n",
+    "**Memory Efficiency**: Loading all 50,000 images into memory would require ~150GB. Instead, we load only the current batch (~150MB).\n",
+    "\n",
+    "**Training Variety**: Shuffling ensures the model sees different combinations each epoch, preventing memorization.\n",
+    "\n",
+    "### The Dataset Abstraction\n",
+    "\n",
+    "The Dataset class provides a uniform interface for accessing data, regardless of whether it's stored as files, in memory, in databases, or generated on-the-fly:\n",
+    "\n",
+    "```\n",
+    "Dataset Interface\n",
+    "┌─────────────────────────────────────┐\n",
+    "│ __len__()     → \"How many samples?\" │\n",
+    "│ __getitem__(i) → \"Give me sample i\" │\n",
+    "└─────────────────────────────────────┘\n",
+    "          ↑                ↑\n",
+    "     Enables for     Enables indexing\n",
+    "    loops/iteration   dataset[index]\n",
+    "```\n",
+    "\n",
+    "**Connection to systems**: This abstraction is crucial because it separates *how data is stored* from *how it's accessed*, enabling optimizations like caching, prefetching, and parallel loading."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "42670b50",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "dataset-implementation",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class Dataset(ABC):\n",
+    "    \"\"\"\n",
+    "    Abstract base class for all datasets.\n",
+    "\n",
+    "    Provides the fundamental interface that all datasets must implement:\n",
+    "    - __len__(): Returns the total number of samples\n",
+    "    - __getitem__(idx): Returns the sample at given index\n",
+    "\n",
+    "    TODO: Implement the abstract Dataset base class\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Use ABC (Abstract Base Class) to define interface\n",
+    "    2. Mark methods as @abstractmethod to force implementation\n",
+    "    3. Provide clear docstrings for subclasses\n",
+    "\n",
+    "    EXAMPLE:\n",
+    "    >>> class MyDataset(Dataset):\n",
+    "    ...     def __len__(self): return 100\n",
+    "    ...     def __getitem__(self, idx): return idx\n",
+    "    >>> dataset = MyDataset()\n",
+    "    >>> print(len(dataset))  # 100\n",
+    "    >>> print(dataset[42])   # 42\n",
+    "\n",
+    "    HINT: Abstract methods force subclasses to implement core functionality\n",
+    "    \"\"\"\n",
+    "\n",
+    "    ### BEGIN SOLUTION\n",
+    "    @abstractmethod\n",
+    "    def __len__(self) -> int:\n",
+    "        \"\"\"\n",
+    "        Return the total number of samples in the dataset.\n",
+    "\n",
+    "        This method must be implemented by all subclasses to enable\n",
+    "        len(dataset) calls and batch size calculations.\n",
+    "        \"\"\"\n",
+    "        pass\n",
+    "\n",
+    "    @abstractmethod\n",
+    "    def __getitem__(self, idx: int):\n",
+    "        \"\"\"\n",
+    "        Return the sample at the given index.\n",
+    "\n",
+    "        Args:\n",
+    "            idx: Index of the sample to retrieve (0 <= idx < len(dataset))\n",
+    "\n",
+    "        Returns:\n",
+    "            The sample at index idx. Format depends on the dataset implementation.\n",
+    "            Could be (data, label) tuple, single tensor, etc.\n",
+    "        \"\"\"\n",
+    "        pass\n",
+    "    ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6982e18c",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-dataset",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_dataset():\n",
+    "    \"\"\"🔬 Test Dataset abstract base class.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Dataset Abstract Base Class...\")\n",
+    "\n",
+    "    # Test that Dataset is properly abstract\n",
+    "    try:\n",
+    "        dataset = Dataset()\n",
+    "        assert False, \"Should not be able to instantiate abstract Dataset\"\n",
+    "    except TypeError:\n",
+    "        print(\"✅ Dataset is properly abstract\")\n",
+    "\n",
+    "    # Test concrete implementation\n",
+    "    class TestDataset(Dataset):\n",
+    "        def __init__(self, size):\n",
+    "            self.size = size\n",
+    "\n",
+    "        def __len__(self):\n",
+    "            return self.size\n",
+    "\n",
+    "        def __getitem__(self, idx):\n",
+    "            return f\"item_{idx}\"\n",
+    "\n",
+    "    dataset = TestDataset(10)\n",
+    "    assert len(dataset) == 10\n",
+    "    assert dataset[0] == \"item_0\"\n",
+    "    assert dataset[9] == \"item_9\"\n",
+    "\n",
+    "    print(\"✅ Dataset interface works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_dataset()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e470f707",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Part 2: TensorDataset - When Data Lives in Memory\n",
+    "\n",
+    "Now let's implement TensorDataset, the most common dataset type for when your data is already loaded into tensors. This is perfect for datasets like MNIST where you can fit everything in memory.\n",
+    "\n",
+    "### Understanding TensorDataset Structure\n",
+    "\n",
+    "TensorDataset takes multiple tensors and aligns them by their first dimension (the sample dimension):\n",
+    "\n",
+    "```\n",
+    "Input Tensors (aligned by first dimension):\n",
+    "  Features Tensor        Labels Tensor         Metadata Tensor\n",
+    "  ┌─────────────────┐   ┌───────────────┐     ┌─────────────────┐\n",
+    "  │ [1.2, 3.4, 5.6] │   │ 0 (cat)       │     │ \"image_001.jpg\" │ ← Sample 0\n",
+    "  │ [2.1, 4.3, 6.5] │   │ 1 (dog)       │     │ \"image_002.jpg\" │ ← Sample 1\n",
+    "  │ [3.0, 5.2, 7.4] │   │ 0 (cat)       │     │ \"image_003.jpg\" │ ← Sample 2\n",
+    "  │ ...             │   │ ...           │     │ ...             │\n",
+    "  └─────────────────┘   └───────────────┘     └─────────────────┘\n",
+    "        (N, 3)               (N,)                   (N,)\n",
+    "\n",
+    "Dataset Access:\n",
+    "  dataset[1] → (Tensor([2.1, 4.3, 6.5]), Tensor(1), \"image_002.jpg\")\n",
+    "```\n",
+    "\n",
+    "### Why TensorDataset is Powerful\n",
+    "\n",
+    "**Memory Locality**: All data is pre-loaded and stored contiguously in memory, enabling fast access patterns.\n",
+    "\n",
+    "**Vectorized Operations**: Since everything is already tensors, no conversion overhead during training.\n",
+    "\n",
+    "**Supervised Learning Perfect**: Naturally handles (features, labels) pairs, plus any additional metadata.\n",
+    "\n",
+    "**Batch-Friendly**: When DataLoader needs a batch, it can slice multiple samples efficiently.\n",
+    "\n",
+    "### Real-World Usage Patterns\n",
+    "\n",
+    "```\n",
+    "# Computer Vision\n",
+    "images = Tensor(shape=(50000, 32, 32, 3))  # CIFAR-10 images\n",
+    "labels = Tensor(shape=(50000,))            # Class labels 0-9\n",
+    "dataset = TensorDataset(images, labels)\n",
+    "\n",
+    "# Natural Language Processing\n",
+    "token_ids = Tensor(shape=(10000, 512))     # Tokenized sentences\n",
+    "labels = Tensor(shape=(10000,))            # Sentiment labels\n",
+    "dataset = TensorDataset(token_ids, labels)\n",
+    "\n",
+    "# Time Series\n",
+    "sequences = Tensor(shape=(1000, 100, 5))   # 100 timesteps, 5 features\n",
+    "targets = Tensor(shape=(1000, 10))         # 10-step ahead prediction\n",
+    "dataset = TensorDataset(sequences, targets)\n",
+    "```\n",
+    "\n",
+    "The key insight: TensorDataset transforms \"arrays of data\" into \"a dataset that serves samples\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b2ab0932",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "tensordataset-implementation",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class TensorDataset(Dataset):\n",
+    "    \"\"\"\n",
+    "    Dataset wrapping tensors for supervised learning.\n",
+    "\n",
+    "    Each sample is a tuple of tensors from the same index across all input tensors.\n",
+    "    All tensors must have the same size in their first dimension.\n",
+    "\n",
+    "    TODO: Implement TensorDataset for tensor-based data\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Store all input tensors\n",
+    "    2. Validate they have same first dimension (number of samples)\n",
+    "    3. Return tuple of tensor slices for each index\n",
+    "\n",
+    "    EXAMPLE:\n",
+    "    >>> features = Tensor([[1, 2], [3, 4], [5, 6]])  # 3 samples, 2 features each\n",
+    "    >>> labels = Tensor([0, 1, 0])                    # 3 labels\n",
+    "    >>> dataset = TensorDataset(features, labels)\n",
+    "    >>> print(len(dataset))  # 3\n",
+    "    >>> print(dataset[1])    # (Tensor([3, 4]), Tensor(1))\n",
+    "\n",
+    "    HINTS:\n",
+    "    - Use *tensors to accept variable number of tensor arguments\n",
+    "    - Check all tensors have same length in dimension 0\n",
+    "    - Return tuple of tensor[idx] for all tensors\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, *tensors):\n",
+    "        \"\"\"\n",
+    "        Create dataset from multiple tensors.\n",
+    "\n",
+    "        Args:\n",
+    "            *tensors: Variable number of Tensor objects\n",
+    "\n",
+    "        All tensors must have the same size in their first dimension.\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        assert len(tensors) > 0, \"Must provide at least one tensor\"\n",
+    "\n",
+    "        # Store all tensors\n",
+    "        self.tensors = tensors\n",
+    "\n",
+    "        # Validate all tensors have same first dimension\n",
+    "        first_size = len(tensors[0].data)  # Size of first dimension\n",
+    "        for i, tensor in enumerate(tensors):\n",
+    "            if len(tensor.data) != first_size:\n",
+    "                raise ValueError(\n",
+    "                    f\"All tensors must have same size in first dimension. \"\n",
+    "                    f\"Tensor 0: {first_size}, Tensor {i}: {len(tensor.data)}\"\n",
+    "                )\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __len__(self) -> int:\n",
+    "        \"\"\"Return number of samples (size of first dimension).\"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        return len(self.tensors[0].data)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __getitem__(self, idx: int) -> Tuple[Tensor, ...]:\n",
+    "        \"\"\"\n",
+    "        Return tuple of tensor slices at given index.\n",
+    "\n",
+    "        Args:\n",
+    "            idx: Sample index\n",
+    "\n",
+    "        Returns:\n",
+    "            Tuple containing tensor[idx] for each input tensor\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        if idx >= len(self) or idx < 0:\n",
+    "            raise IndexError(f\"Index {idx} out of range for dataset of size {len(self)}\")\n",
+    "\n",
+    "        # Return tuple of slices from all tensors\n",
+    "        return tuple(Tensor(tensor.data[idx]) for tensor in self.tensors)\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "06f20cff",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-tensordataset",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_tensordataset():\n",
+    "    \"\"\"🔬 Test TensorDataset implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: TensorDataset...\")\n",
+    "\n",
+    "    # Test basic functionality\n",
+    "    features = Tensor([[1, 2], [3, 4], [5, 6]])  # 3 samples, 2 features\n",
+    "    labels = Tensor([0, 1, 0])                    # 3 labels\n",
+    "\n",
+    "    dataset = TensorDataset(features, labels)\n",
+    "\n",
+    "    # Test length\n",
+    "    assert len(dataset) == 3, f\"Expected length 3, got {len(dataset)}\"\n",
+    "\n",
+    "    # Test indexing\n",
+    "    sample = dataset[0]\n",
+    "    assert len(sample) == 2, \"Should return tuple with 2 tensors\"\n",
+    "    assert np.array_equal(sample[0].data, [1, 2]), f\"Wrong features: {sample[0].data}\"\n",
+    "    assert sample[1].data == 0, f\"Wrong label: {sample[1].data}\"\n",
+    "\n",
+    "    sample = dataset[1]\n",
+    "    assert np.array_equal(sample[1].data, 1), f\"Wrong label at index 1: {sample[1].data}\"\n",
+    "\n",
+    "    # Test error handling\n",
+    "    try:\n",
+    "        dataset[10]  # Out of bounds\n",
+    "        assert False, \"Should raise IndexError for out of bounds access\"\n",
+    "    except IndexError:\n",
+    "        pass\n",
+    "\n",
+    "    # Test mismatched tensor sizes\n",
+    "    try:\n",
+    "        bad_features = Tensor([[1, 2], [3, 4]])  # Only 2 samples\n",
+    "        bad_labels = Tensor([0, 1, 0])           # 3 labels - mismatch!\n",
+    "        TensorDataset(bad_features, bad_labels)\n",
+    "        assert False, \"Should raise error for mismatched tensor sizes\"\n",
+    "    except ValueError:\n",
+    "        pass\n",
+    "\n",
+    "    print(\"✅ TensorDataset works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_tensordataset()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "707ffdbb",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Part 3: DataLoader - The Batch Factory\n",
+    "\n",
+    "Now we build the DataLoader, the component that transforms individual dataset samples into the batches that neural networks crave. This is where data loading becomes a systems challenge.\n",
+    "\n",
+    "### Understanding Batching: From Samples to Tensors\n",
+    "\n",
+    "DataLoader performs a crucial transformation - it collects individual samples and stacks them into batch tensors:\n",
+    "\n",
+    "```\n",
+    "Step 1: Individual Samples from Dataset\n",
+    "  dataset[0] → (features: [1, 2, 3], label: 0)\n",
+    "  dataset[1] → (features: [4, 5, 6], label: 1)\n",
+    "  dataset[2] → (features: [7, 8, 9], label: 0)\n",
+    "  dataset[3] → (features: [2, 3, 4], label: 1)\n",
+    "\n",
+    "Step 2: DataLoader Groups into Batch (batch_size=2)\n",
+    "  Batch 1:\n",
+    "    features: [[1, 2, 3],    ← Stacked into shape (2, 3)\n",
+    "               [4, 5, 6]]\n",
+    "    labels:   [0, 1]         ← Stacked into shape (2,)\n",
+    "\n",
+    "  Batch 2:\n",
+    "    features: [[7, 8, 9],    ← Stacked into shape (2, 3)\n",
+    "               [2, 3, 4]]\n",
+    "    labels:   [0, 1]         ← Stacked into shape (2,)\n",
+    "```\n",
+    "\n",
+    "### The Shuffling Process\n",
+    "\n",
+    "Shuffling randomizes which samples appear in which batches, crucial for good training:\n",
+    "\n",
+    "```\n",
+    "Without Shuffling (epoch 1):          With Shuffling (epoch 1):\n",
+    "  Batch 1: [sample 0, sample 1]         Batch 1: [sample 2, sample 0]\n",
+    "  Batch 2: [sample 2, sample 3]         Batch 2: [sample 3, sample 1]\n",
+    "  Batch 3: [sample 4, sample 5]         Batch 3: [sample 5, sample 4]\n",
+    "\n",
+    "Without Shuffling (epoch 2):          With Shuffling (epoch 2):\n",
+    "  Batch 1: [sample 0, sample 1]  ✗      Batch 1: [sample 1, sample 4]  ✓\n",
+    "  Batch 2: [sample 2, sample 3]  ✗      Batch 2: [sample 0, sample 5]  ✓\n",
+    "  Batch 3: [sample 4, sample 5]  ✗      Batch 3: [sample 2, sample 3]  ✓\n",
+    "\n",
+    "  (Same every epoch = overfitting!)     (Different combinations = better learning!)\n",
+    "```\n",
+    "\n",
+    "### DataLoader as a Systems Component\n",
+    "\n",
+    "**Memory Management**: DataLoader only holds one batch in memory at a time, not the entire dataset.\n",
+    "\n",
+    "**Iteration Interface**: Provides Python iterator protocol so training loops can use `for batch in dataloader:`.\n",
+    "\n",
+    "**Collation Strategy**: Automatically stacks tensors from individual samples into batch tensors.\n",
+    "\n",
+    "**Performance Critical**: This is often the bottleneck in training pipelines - loading and preparing data can be slower than the forward pass!\n",
+    "\n",
+    "### The DataLoader Algorithm\n",
+    "\n",
+    "```\n",
+    "1. Create indices list: [0, 1, 2, ..., dataset_length-1]\n",
+    "2. If shuffle=True: randomly shuffle the indices\n",
+    "3. Group indices into chunks of batch_size\n",
+    "4. For each chunk:\n",
+    "   a. Retrieve samples: [dataset[i] for i in chunk]\n",
+    "   b. Collate samples: stack individual tensors into batch tensors\n",
+    "   c. Yield the batch tensor tuple\n",
+    "```\n",
+    "\n",
+    "This transforms the dataset from \"access one sample\" to \"iterate through batches\" - exactly what training loops need."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "57372753",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "dataloader-implementation",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class DataLoader:\n",
+    "    \"\"\"\n",
+    "    Data loader with batching and shuffling support.\n",
+    "\n",
+    "    Wraps a dataset to provide batched iteration with optional shuffling.\n",
+    "    Essential for efficient training with mini-batch gradient descent.\n",
+    "\n",
+    "    TODO: Implement DataLoader with batching and shuffling\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Store dataset, batch_size, and shuffle settings\n",
+    "    2. Create iterator that groups samples into batches\n",
+    "    3. Handle shuffling by randomizing indices\n",
+    "    4. Collate individual samples into batch tensors\n",
+    "\n",
+    "    EXAMPLE:\n",
+    "    >>> dataset = TensorDataset(Tensor([[1,2], [3,4], [5,6]]), Tensor([0,1,0]))\n",
+    "    >>> loader = DataLoader(dataset, batch_size=2, shuffle=True)\n",
+    "    >>> for batch in loader:\n",
+    "    ...     features_batch, labels_batch = batch\n",
+    "    ...     print(f\"Features: {features_batch.shape}, Labels: {labels_batch.shape}\")\n",
+    "\n",
+    "    HINTS:\n",
+    "    - Use random.shuffle() for index shuffling\n",
+    "    - Group consecutive samples into batches\n",
+    "    - Stack individual tensors using np.stack()\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, dataset: Dataset, batch_size: int, shuffle: bool = False):\n",
+    "        \"\"\"\n",
+    "        Create DataLoader for batched iteration.\n",
+    "\n",
+    "        Args:\n",
+    "            dataset: Dataset to load from\n",
+    "            batch_size: Number of samples per batch\n",
+    "            shuffle: Whether to shuffle data each epoch\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        self.dataset = dataset\n",
+    "        self.batch_size = batch_size\n",
+    "        self.shuffle = shuffle\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __len__(self) -> int:\n",
+    "        \"\"\"Return number of batches per epoch.\"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Calculate number of complete batches\n",
+    "        return (len(self.dataset) + self.batch_size - 1) // self.batch_size\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __iter__(self) -> Iterator:\n",
+    "        \"\"\"Return iterator over batches.\"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Create list of indices\n",
+    "        indices = list(range(len(self.dataset)))\n",
+    "\n",
+    "        # Shuffle if requested\n",
+    "        if self.shuffle:\n",
+    "            random.shuffle(indices)\n",
+    "\n",
+    "        # Yield batches\n",
+    "        for i in range(0, len(indices), self.batch_size):\n",
+    "            batch_indices = indices[i:i + self.batch_size]\n",
+    "            batch = [self.dataset[idx] for idx in batch_indices]\n",
+    "\n",
+    "            # Collate batch - convert list of tuples to tuple of tensors\n",
+    "            yield self._collate_batch(batch)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def _collate_batch(self, batch: List[Tuple[Tensor, ...]]) -> Tuple[Tensor, ...]:\n",
+    "        \"\"\"\n",
+    "        Collate individual samples into batch tensors.\n",
+    "\n",
+    "        Args:\n",
+    "            batch: List of sample tuples from dataset\n",
+    "\n",
+    "        Returns:\n",
+    "            Tuple of batched tensors\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        if len(batch) == 0:\n",
+    "            return ()\n",
+    "\n",
+    "        # Determine number of tensors per sample\n",
+    "        num_tensors = len(batch[0])\n",
+    "\n",
+    "        # Group tensors by position\n",
+    "        batched_tensors = []\n",
+    "        for tensor_idx in range(num_tensors):\n",
+    "            # Extract all tensors at this position\n",
+    "            tensor_list = [sample[tensor_idx].data for sample in batch]\n",
+    "\n",
+    "            # Stack into batch tensor\n",
+    "            batched_data = np.stack(tensor_list, axis=0)\n",
+    "            batched_tensors.append(Tensor(batched_data))\n",
+    "\n",
+    "        return tuple(batched_tensors)\n",
+    "        ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "22b5ae11",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-dataloader",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_dataloader():\n",
+    "    \"\"\"🔬 Test DataLoader implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: DataLoader...\")\n",
+    "\n",
+    "    # Create test dataset\n",
+    "    features = Tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])  # 5 samples\n",
+    "    labels = Tensor([0, 1, 0, 1, 0])\n",
+    "    dataset = TensorDataset(features, labels)\n",
+    "\n",
+    "    # Test basic batching (no shuffle)\n",
+    "    loader = DataLoader(dataset, batch_size=2, shuffle=False)\n",
+    "\n",
+    "    # Test length calculation\n",
+    "    assert len(loader) == 3, f\"Expected 3 batches, got {len(loader)}\"  # ceil(5/2) = 3\n",
+    "\n",
+    "    batches = list(loader)\n",
+    "    assert len(batches) == 3, f\"Expected 3 batches, got {len(batches)}\"\n",
+    "\n",
+    "    # Test first batch\n",
+    "    batch_features, batch_labels = batches[0]\n",
+    "    assert batch_features.data.shape == (2, 2), f\"Wrong batch features shape: {batch_features.data.shape}\"\n",
+    "    assert batch_labels.data.shape == (2,), f\"Wrong batch labels shape: {batch_labels.data.shape}\"\n",
+    "\n",
+    "    # Test last batch (should have 1 sample)\n",
+    "    batch_features, batch_labels = batches[2]\n",
+    "    assert batch_features.data.shape == (1, 2), f\"Wrong last batch features shape: {batch_features.data.shape}\"\n",
+    "    assert batch_labels.data.shape == (1,), f\"Wrong last batch labels shape: {batch_labels.data.shape}\"\n",
+    "\n",
+    "    # Test that data is preserved\n",
+    "    assert np.array_equal(batches[0][0].data[0], [1, 2]), \"First sample should be [1,2]\"\n",
+    "    assert batches[0][1].data[0] == 0, \"First label should be 0\"\n",
+    "\n",
+    "    # Test shuffling produces different order\n",
+    "    loader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)\n",
+    "    loader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)\n",
+    "\n",
+    "    batch_shuffle = list(loader_shuffle)[0]\n",
+    "    batch_no_shuffle = list(loader_no_shuffle)[0]\n",
+    "\n",
+    "    # Note: This might occasionally fail due to random chance, but very unlikely\n",
+    "    # We'll just test that both contain all the original data\n",
+    "    shuffle_features = set(tuple(row) for row in batch_shuffle[0].data)\n",
+    "    no_shuffle_features = set(tuple(row) for row in batch_no_shuffle[0].data)\n",
+    "    expected_features = {(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)}\n",
+    "\n",
+    "    assert shuffle_features == expected_features, \"Shuffle should preserve all data\"\n",
+    "    assert no_shuffle_features == expected_features, \"No shuffle should preserve all data\"\n",
+    "\n",
+    "    print(\"✅ DataLoader works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_dataloader()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ab25fbb",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Part 4: Real Datasets - MNIST and CIFAR-10\n",
+    "\n",
+    "Time to work with real data! We'll implement download functions for two classic computer vision datasets that every ML engineer should know.\n",
+    "\n",
+    "### Understanding Standard Datasets\n",
+    "\n",
+    "MNIST and CIFAR-10 are the \"hello world\" datasets of computer vision, each teaching different lessons:\n",
+    "\n",
+    "```\n",
+    "MNIST (Handwritten Digits)              CIFAR-10 (Tiny Objects)\n",
+    "┌─────────────────────────────┐          ┌─────────────────────────────┐\n",
+    "│ Size: 28×28 pixels          │          │ Size: 32×32×3 pixels        │\n",
+    "│ Colors: Grayscale (1 chan)  │          │ Colors: RGB (3 channels)    │\n",
+    "│ Classes: 10 (digits 0-9)    │          │ Classes: 10 (objects)       │\n",
+    "│ Training: 60,000 samples    │          │ Training: 50,000 samples    │\n",
+    "│ Testing: 10,000 samples     │          │ Testing: 10,000 samples     │\n",
+    "│                             │          │                             │\n",
+    "│ ┌─────┐ ┌─────┐ ┌─────┐     │          │ ┌─────┐ ┌─────┐ ┌─────┐     │\n",
+    "│ │  5  │ │  3  │ │  8  │     │          │ │ ✈️  │ │ 🚗  │ │ 🐸  │     │\n",
+    "│ └─────┘ └─────┘ └─────┘     │          │ └─────┘ └─────┘ └─────┘     │\n",
+    "│ (simple shapes)             │          │ (complex textures)          │\n",
+    "└─────────────────────────────┘          └─────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "### Why These Datasets Matter\n",
+    "\n",
+    "**MNIST**: Perfect for learning basics - simple, clean, small. Most algorithms achieve >95% accuracy.\n",
+    "\n",
+    "**CIFAR-10**: Real-world complexity - color, texture, background clutter. Much harder, ~80-90% is good.\n",
+    "\n",
+    "**Progression**: MNIST → CIFAR-10 → ImageNet represents increasing complexity in computer vision.\n",
+    "\n",
+    "### Dataset Format Patterns\n",
+    "\n",
+    "Both datasets follow similar patterns:\n",
+    "\n",
+    "```\n",
+    "Typical Dataset Structure:\n",
+    "┌─────────────────────────────────────────┐\n",
+    "│ Training Set                            │\n",
+    "│ ├── Images: (N, H, W, C) tensor         │\n",
+    "│ └── Labels: (N,) tensor                 │\n",
+    "│                                         │\n",
+    "│ Test Set                                │\n",
+    "│ ├── Images: (M, H, W, C) tensor         │\n",
+    "│ └── Labels: (M,) tensor                 │\n",
+    "└─────────────────────────────────────────┘\n",
+    "\n",
+    "Where:\n",
+    "  N = number of training samples\n",
+    "  M = number of test samples\n",
+    "  H, W = height, width\n",
+    "  C = channels (1 for grayscale, 3 for RGB)\n",
+    "```\n",
+    "\n",
+    "### Data Pipeline Integration\n",
+    "\n",
+    "Once downloaded, these datasets integrate seamlessly with our pipeline:\n",
+    "\n",
+    "```\n",
+    "Download Function → TensorDataset → DataLoader → Training\n",
+    "      ↓                   ↓             ↓           ↓\n",
+    "  Raw tensors      Indexed access   Batched data  Model input\n",
+    "```\n",
+    "\n",
+    "**Note**: For educational purposes, we'll create synthetic datasets with the same structure as MNIST/CIFAR-10. In production, you'd download the actual data from official sources."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c995e812",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "download-functions",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def download_mnist(data_dir: str = \"./data\") -> Tuple[TensorDataset, TensorDataset]:\n",
+    "    \"\"\"\n",
+    "    Download and prepare MNIST dataset.\n",
+    "\n",
+    "    Returns train and test datasets with (images, labels) format.\n",
+    "    Images are normalized to [0,1] range.\n",
+    "\n",
+    "    TODO: Implement MNIST download and preprocessing\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Create data directory if needed\n",
+    "    2. Download MNIST files from official source\n",
+    "    3. Parse binary format and extract images/labels\n",
+    "    4. Normalize images and convert to tensors\n",
+    "    5. Return TensorDataset objects\n",
+    "\n",
+    "    EXAMPLE:\n",
+    "    >>> train_ds, test_ds = download_mnist()\n",
+    "    >>> print(f\"Train: {len(train_ds)} samples\")\n",
+    "    >>> print(f\"Test: {len(test_ds)} samples\")\n",
+    "    >>> image, label = train_ds[0]\n",
+    "    >>> print(f\"Image shape: {image.shape}, Label: {label.data}\")\n",
+    "\n",
+    "    HINTS:\n",
+    "    - MNIST images are 28x28 grayscale, stored as uint8\n",
+    "    - Labels are single integers 0-9\n",
+    "    - Normalize images by dividing by 255.0\n",
+    "    \"\"\"\n",
+    "    ### BEGIN SOLUTION\n",
+    "    os.makedirs(data_dir, exist_ok=True)\n",
+    "\n",
+    "    # MNIST URLs (simplified - using a mock implementation for educational purposes)\n",
+    "    # In production, you'd download from official sources\n",
+    "\n",
+    "    # Create simple synthetic MNIST-like data for educational purposes\n",
+    "    print(\"📥 Creating synthetic MNIST-like dataset for educational purposes...\")\n",
+    "\n",
+    "    # Generate synthetic training data (60,000 samples)\n",
+    "    np.random.seed(42)  # For reproducibility\n",
+    "    train_images = np.random.rand(60000, 28, 28).astype(np.float32)\n",
+    "    train_labels = np.random.randint(0, 10, 60000).astype(np.int64)\n",
+    "\n",
+    "    # Generate synthetic test data (10,000 samples)\n",
+    "    test_images = np.random.rand(10000, 28, 28).astype(np.float32)\n",
+    "    test_labels = np.random.randint(0, 10, 10000).astype(np.int64)\n",
+    "\n",
+    "    # Create TensorDatasets\n",
+    "    train_dataset = TensorDataset(Tensor(train_images), Tensor(train_labels))\n",
+    "    test_dataset = TensorDataset(Tensor(test_images), Tensor(test_labels))\n",
+    "\n",
+    "    print(f\"✅ MNIST-like dataset ready: {len(train_dataset)} train, {len(test_dataset)} test samples\")\n",
+    "\n",
+    "    return train_dataset, test_dataset\n",
+    "    ### END SOLUTION\n",
+    "\n",
+    "\n",
+    "def download_cifar10(data_dir: str = \"./data\") -> Tuple[TensorDataset, TensorDataset]:\n",
+    "    \"\"\"\n",
+    "    Download and prepare CIFAR-10 dataset.\n",
+    "\n",
+    "    Returns train and test datasets with (images, labels) format.\n",
+    "    Images are normalized to [0,1] range.\n",
+    "\n",
+    "    TODO: Implement CIFAR-10 download and preprocessing\n",
+    "\n",
+    "    APPROACH:\n",
+    "    1. Create data directory if needed\n",
+    "    2. Download CIFAR-10 files from official source\n",
+    "    3. Parse pickle format and extract images/labels\n",
+    "    4. Normalize images and convert to tensors\n",
+    "    5. Return TensorDataset objects\n",
+    "\n",
+    "    EXAMPLE:\n",
+    "    >>> train_ds, test_ds = download_cifar10()\n",
+    "    >>> print(f\"Train: {len(train_ds)} samples\")\n",
+    "    >>> image, label = train_ds[0]\n",
+    "    >>> print(f\"Image shape: {image.shape}, Label: {label.data}\")\n",
+    "\n",
+    "    HINTS:\n",
+    "    - CIFAR-10 images are 32x32x3 color, stored as uint8\n",
+    "    - Labels are single integers 0-9 (airplane, automobile, etc.)\n",
+    "    - Images come in format (height, width, channels)\n",
+    "    \"\"\"\n",
+    "    ### BEGIN SOLUTION\n",
+    "    os.makedirs(data_dir, exist_ok=True)\n",
+    "\n",
+    "    # Create simple synthetic CIFAR-10-like data for educational purposes\n",
+    "    print(\"📥 Creating synthetic CIFAR-10-like dataset for educational purposes...\")\n",
+    "\n",
+    "    # Generate synthetic training data (50,000 samples)\n",
+    "    np.random.seed(123)  # Different seed than MNIST\n",
+    "    train_images = np.random.rand(50000, 32, 32, 3).astype(np.float32)\n",
+    "    train_labels = np.random.randint(0, 10, 50000).astype(np.int64)\n",
+    "\n",
+    "    # Generate synthetic test data (10,000 samples)\n",
+    "    test_images = np.random.rand(10000, 32, 32, 3).astype(np.float32)\n",
+    "    test_labels = np.random.randint(0, 10, 10000).astype(np.int64)\n",
+    "\n",
+    "    # Create TensorDatasets\n",
+    "    train_dataset = TensorDataset(Tensor(train_images), Tensor(train_labels))\n",
+    "    test_dataset = TensorDataset(Tensor(test_images), Tensor(test_labels))\n",
+    "\n",
+    "    print(f\"✅ CIFAR-10-like dataset ready: {len(train_dataset)} train, {len(test_dataset)} test samples\")\n",
+    "\n",
+    "    return train_dataset, test_dataset\n",
+    "    ### END SOLUTION"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dceb709a",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-download-functions",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_download_functions():\n",
+    "    \"\"\"🔬 Test dataset download functions.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Download Functions...\")\n",
+    "\n",
+    "    # Test MNIST download\n",
+    "    train_mnist, test_mnist = download_mnist()\n",
+    "\n",
+    "    assert len(train_mnist) == 60000, f\"MNIST train should have 60000 samples, got {len(train_mnist)}\"\n",
+    "    assert len(test_mnist) == 10000, f\"MNIST test should have 10000 samples, got {len(test_mnist)}\"\n",
+    "\n",
+    "    # Test sample format\n",
+    "    image, label = train_mnist[0]\n",
+    "    assert image.data.shape == (28, 28), f\"MNIST image should be (28,28), got {image.data.shape}\"\n",
+    "    assert 0 <= label.data <= 9, f\"MNIST label should be 0-9, got {label.data}\"\n",
+    "    assert 0 <= image.data.max() <= 1, f\"MNIST images should be normalized to [0,1], max is {image.data.max()}\"\n",
+    "\n",
+    "    # Test CIFAR-10 download\n",
+    "    train_cifar, test_cifar = download_cifar10()\n",
+    "\n",
+    "    assert len(train_cifar) == 50000, f\"CIFAR-10 train should have 50000 samples, got {len(train_cifar)}\"\n",
+    "    assert len(test_cifar) == 10000, f\"CIFAR-10 test should have 10000 samples, got {len(test_cifar)}\"\n",
+    "\n",
+    "    # Test sample format\n",
+    "    image, label = train_cifar[0]\n",
+    "    assert image.data.shape == (32, 32, 3), f\"CIFAR-10 image should be (32,32,3), got {image.data.shape}\"\n",
+    "    assert 0 <= label.data <= 9, f\"CIFAR-10 label should be 0-9, got {label.data}\"\n",
+    "    assert 0 <= image.data.max() <= 1, f\"CIFAR-10 images should be normalized, max is {image.data.max()}\"\n",
+    "\n",
+    "    print(\"✅ Download functions work correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_download_functions()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7a83b36",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Part 5: Systems Analysis - Data Pipeline Performance\n",
+    "\n",
+    "Now let's analyze our data pipeline like production ML engineers. Understanding where time and memory go is crucial for building systems that scale.\n",
+    "\n",
+    "### The Performance Question: Where Does Time Go?\n",
+    "\n",
+    "In a typical training step, time is split between data loading and computation:\n",
+    "\n",
+    "```\n",
+    "Training Step Breakdown:\n",
+    "┌───────────────────────────────────────────────────────────────┐\n",
+    "│ Data Loading        │ Forward Pass     │ Backward Pass     │\n",
+    "│ ████████████         │ ███████         │ ████████         │\n",
+    "│ 40ms               │ 25ms            │ 35ms              │\n",
+    "└───────────────────────────────────────────────────────────────┘\n",
+    "              100ms total per step\n",
+    "\n",
+    "Bottleneck Analysis:\n",
+    "- If data loading > forward+backward: \"Data starved\" (CPU bottleneck)\n",
+    "- If forward+backward > data loading: \"Compute bound\" (GPU bottleneck)\n",
+    "- Ideal: Data loading ≈ computation time (balanced pipeline)\n",
+    "```\n",
+    "\n",
+    "### Memory Scaling: The Batch Size Trade-off\n",
+    "\n",
+    "Batch size creates a fundamental trade-off in memory vs efficiency:\n",
+    "\n",
+    "```\n",
+    "Batch Size Impact:\n",
+    "\n",
+    "Small Batches (batch_size=8):\n",
+    "┌─────────────────────────────────────────┐\n",
+    "│ Memory: 8 × 28 × 28 × 4 bytes = 25KB   │ ← Low memory\n",
+    "│ Overhead: High (many small batches)    │ ← High overhead\n",
+    "│ GPU Util: Poor (underutilized)         │ ← Poor efficiency\n",
+    "└─────────────────────────────────────────┘\n",
+    "\n",
+    "Large Batches (batch_size=512):\n",
+    "┌─────────────────────────────────────────┐\n",
+    "│ Memory: 512 × 28 × 28 × 4 bytes = 1.6MB│ ← Higher memory\n",
+    "│ Overhead: Low (fewer large batches)    │ ← Lower overhead\n",
+    "│ GPU Util: Good (well utilized)         │ ← Better efficiency\n",
+    "└─────────────────────────────────────────┘\n",
+    "```\n",
+    "\n",
+    "### Shuffling Overhead Analysis\n",
+    "\n",
+    "Shuffling seems simple, but let's measure its real cost:\n",
+    "\n",
+    "```\n",
+    "Shuffle Operation Breakdown:\n",
+    "\n",
+    "1. Index Generation:    O(n) - create [0, 1, 2, ..., n-1]\n",
+    "2. Shuffle Operation:   O(n) - randomize the indices\n",
+    "3. Sample Access:       O(1) per sample - dataset[shuffled_idx]\n",
+    "\n",
+    "Memory Impact:\n",
+    "- No Shuffle: 0 extra memory (sequential access)\n",
+    "- With Shuffle: 8 bytes × dataset_size (store indices)\n",
+    "\n",
+    "For 50,000 samples: 8 × 50,000 = 400KB extra memory\n",
+    "```\n",
+    "\n",
+    "The key insight: shuffling overhead is typically negligible compared to the actual data loading and tensor operations.\n",
+    "\n",
+    "### Pipeline Bottleneck Identification\n",
+    "\n",
+    "We'll measure three critical metrics:\n",
+    "\n",
+    "1. **Throughput**: Samples processed per second\n",
+    "2. **Memory Usage**: Peak memory during batch loading\n",
+    "3. **Overhead**: Time spent on data vs computation\n",
+    "\n",
+    "These measurements will reveal whether our pipeline is CPU-bound (slow data loading) or compute-bound (slow model)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f37f1d06",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "systems-analysis",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def analyze_dataloader_performance():\n",
+    "    \"\"\"📊 Analyze DataLoader performance characteristics.\"\"\"\n",
+    "    print(\"📊 Analyzing DataLoader Performance...\")\n",
+    "\n",
+    "    import time\n",
+    "\n",
+    "    # Create test dataset of varying sizes\n",
+    "    sizes = [1000, 5000, 10000]\n",
+    "    batch_sizes = [16, 64, 256]\n",
+    "\n",
+    "    print(\"\\n🔍 Batch Size vs Loading Time:\")\n",
+    "\n",
+    "    for size in sizes:\n",
+    "        # Create synthetic dataset\n",
+    "        features = Tensor(np.random.randn(size, 100))  # 100 features\n",
+    "        labels = Tensor(np.random.randint(0, 10, size))\n",
+    "        dataset = TensorDataset(features, labels)\n",
+    "\n",
+    "        print(f\"\\nDataset size: {size} samples\")\n",
+    "\n",
+    "        for batch_size in batch_sizes:\n",
+    "            # Time data loading\n",
+    "            loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)\n",
+    "\n",
+    "            start_time = time.time()\n",
+    "            batch_count = 0\n",
+    "            for batch in loader:\n",
+    "                batch_count += 1\n",
+    "            end_time = time.time()\n",
+    "\n",
+    "            elapsed = end_time - start_time\n",
+    "            throughput = size / elapsed if elapsed > 0 else float('inf')\n",
+    "\n",
+    "            print(f\"  Batch size {batch_size:3d}: {elapsed:.3f}s ({throughput:,.0f} samples/sec)\")\n",
+    "\n",
+    "    # Analyze shuffle overhead\n",
+    "    print(\"\\n🔄 Shuffle Overhead Analysis:\")\n",
+    "\n",
+    "    dataset_size = 10000\n",
+    "    features = Tensor(np.random.randn(dataset_size, 50))\n",
+    "    labels = Tensor(np.random.randint(0, 5, dataset_size))\n",
+    "    dataset = TensorDataset(features, labels)\n",
+    "\n",
+    "    batch_size = 64\n",
+    "\n",
+    "    # No shuffle\n",
+    "    loader_no_shuffle = DataLoader(dataset, batch_size=batch_size, shuffle=False)\n",
+    "    start_time = time.time()\n",
+    "    batches_no_shuffle = list(loader_no_shuffle)\n",
+    "    time_no_shuffle = time.time() - start_time\n",
+    "\n",
+    "    # With shuffle\n",
+    "    loader_shuffle = DataLoader(dataset, batch_size=batch_size, shuffle=True)\n",
+    "    start_time = time.time()\n",
+    "    batches_shuffle = list(loader_shuffle)\n",
+    "    time_shuffle = time.time() - start_time\n",
+    "\n",
+    "    shuffle_overhead = ((time_shuffle - time_no_shuffle) / time_no_shuffle) * 100\n",
+    "\n",
+    "    print(f\"  No shuffle: {time_no_shuffle:.3f}s\")\n",
+    "    print(f\"  With shuffle: {time_shuffle:.3f}s\")\n",
+    "    print(f\"  Shuffle overhead: {shuffle_overhead:.1f}%\")\n",
+    "\n",
+    "    print(\"\\n💡 Key Insights:\")\n",
+    "    print(\"• Larger batch sizes reduce per-sample overhead\")\n",
+    "    print(\"• Shuffle adds minimal overhead for reasonable dataset sizes\")\n",
+    "    print(\"• Memory usage scales linearly with batch size\")\n",
+    "    print(\"🚀 Production tip: Balance batch size with GPU memory limits\")\n",
+    "\n",
+    "# analyze_dataloader_performance()  # Moved to main block\n",
+    "\n",
+    "\n",
+    "def analyze_memory_usage():\n",
+    "    \"\"\"📊 Analyze memory usage patterns in data loading.\"\"\"\n",
+    "    print(\"\\n📊 Analyzing Memory Usage Patterns...\")\n",
+    "\n",
+    "    # Memory usage estimation\n",
+    "    def estimate_memory_mb(batch_size, feature_size, dtype_bytes=4):\n",
+    "        \"\"\"Estimate memory usage for a batch.\"\"\"\n",
+    "        return (batch_size * feature_size * dtype_bytes) / (1024 * 1024)\n",
+    "\n",
+    "    print(\"\\n💾 Memory Usage by Batch Configuration:\")\n",
+    "\n",
+    "    feature_sizes = [784, 3072, 50176]  # MNIST, CIFAR-10, ImageNet-like\n",
+    "    feature_names = [\"MNIST (28×28)\", \"CIFAR-10 (32×32×3)\", \"ImageNet (224×224×1)\"]\n",
+    "    batch_sizes = [1, 32, 128, 512]\n",
+    "\n",
+    "    for feature_size, name in zip(feature_sizes, feature_names):\n",
+    "        print(f\"\\n{name}:\")\n",
+    "        for batch_size in batch_sizes:\n",
+    "            memory_mb = estimate_memory_mb(batch_size, feature_size)\n",
+    "            print(f\"  Batch {batch_size:3d}: {memory_mb:6.1f} MB\")\n",
+    "\n",
+    "    print(\"\\n🎯 Memory Trade-offs:\")\n",
+    "    print(\"• Larger batches: More memory, better GPU utilization\")\n",
+    "    print(\"• Smaller batches: Less memory, more noisy gradients\")\n",
+    "    print(\"• Sweet spot: Usually 32-128 depending on model size\")\n",
+    "\n",
+    "    # Demonstrate actual memory usage with our tensors\n",
+    "    print(\"\\n🔬 Actual Tensor Memory Usage:\")\n",
+    "\n",
+    "    # Create different sized tensors\n",
+    "    tensor_small = Tensor(np.random.randn(32, 784))    # Small batch\n",
+    "    tensor_large = Tensor(np.random.randn(512, 784))   # Large batch\n",
+    "\n",
+    "    # Size in bytes (roughly)\n",
+    "    small_bytes = tensor_small.data.nbytes\n",
+    "    large_bytes = tensor_large.data.nbytes\n",
+    "\n",
+    "    print(f\"  Small batch (32×784): {small_bytes / 1024:.1f} KB\")\n",
+    "    print(f\"  Large batch (512×784): {large_bytes / 1024:.1f} KB\")\n",
+    "    print(f\"  Ratio: {large_bytes / small_bytes:.1f}×\")\n",
+    "\n",
+    "# analyze_memory_usage()  # Moved to main block"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "22d11ead",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Part 6: Integration Testing\n",
+    "\n",
+    "Let's test how our DataLoader integrates with a complete training workflow, simulating real ML pipeline usage."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0d4eeef",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "integration-test",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_training_integration():\n",
+    "    \"\"\"🔬 Test DataLoader integration with training workflow.\"\"\"\n",
+    "    print(\"🔬 Integration Test: Training Workflow...\")\n",
+    "\n",
+    "    # Create a realistic dataset\n",
+    "    num_samples = 1000\n",
+    "    num_features = 20\n",
+    "    num_classes = 5\n",
+    "\n",
+    "    # Synthetic classification data\n",
+    "    features = Tensor(np.random.randn(num_samples, num_features))\n",
+    "    labels = Tensor(np.random.randint(0, num_classes, num_samples))\n",
+    "\n",
+    "    dataset = TensorDataset(features, labels)\n",
+    "\n",
+    "    # Create train/val splits\n",
+    "    train_size = int(0.8 * len(dataset))\n",
+    "    val_size = len(dataset) - train_size\n",
+    "\n",
+    "    # Manual split (in production, you'd use proper splitting utilities)\n",
+    "    train_indices = list(range(train_size))\n",
+    "    val_indices = list(range(train_size, len(dataset)))\n",
+    "\n",
+    "    # Create subset datasets\n",
+    "    train_samples = [dataset[i] for i in train_indices]\n",
+    "    val_samples = [dataset[i] for i in val_indices]\n",
+    "\n",
+    "    # Convert back to tensors for TensorDataset\n",
+    "    train_features = Tensor(np.stack([sample[0].data for sample in train_samples]))\n",
+    "    train_labels = Tensor(np.stack([sample[1].data for sample in train_samples]))\n",
+    "    val_features = Tensor(np.stack([sample[0].data for sample in val_samples]))\n",
+    "    val_labels = Tensor(np.stack([sample[1].data for sample in val_samples]))\n",
+    "\n",
+    "    train_dataset = TensorDataset(train_features, train_labels)\n",
+    "    val_dataset = TensorDataset(val_features, val_labels)\n",
+    "\n",
+    "    # Create DataLoaders\n",
+    "    batch_size = 32\n",
+    "    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)\n",
+    "    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)\n",
+    "\n",
+    "    print(f\"📊 Dataset splits:\")\n",
+    "    print(f\"  Training: {len(train_dataset)} samples, {len(train_loader)} batches\")\n",
+    "    print(f\"  Validation: {len(val_dataset)} samples, {len(val_loader)} batches\")\n",
+    "\n",
+    "    # Simulate training loop\n",
+    "    print(\"\\n🏃 Simulated Training Loop:\")\n",
+    "\n",
+    "    epoch_samples = 0\n",
+    "    batch_count = 0\n",
+    "\n",
+    "    for batch_idx, (batch_features, batch_labels) in enumerate(train_loader):\n",
+    "        batch_count += 1\n",
+    "        epoch_samples += len(batch_features.data)\n",
+    "\n",
+    "        # Simulate forward pass (just check shapes)\n",
+    "        assert batch_features.data.shape[0] <= batch_size, \"Batch size exceeded\"\n",
+    "        assert batch_features.data.shape[1] == num_features, \"Wrong feature count\"\n",
+    "        assert len(batch_labels.data) == len(batch_features.data), \"Mismatched batch sizes\"\n",
+    "\n",
+    "        if batch_idx < 3:  # Show first few batches\n",
+    "            print(f\"  Batch {batch_idx + 1}: {batch_features.data.shape[0]} samples\")\n",
+    "\n",
+    "    print(f\"  Total: {batch_count} batches, {epoch_samples} samples processed\")\n",
+    "\n",
+    "    # Validate that all samples were seen\n",
+    "    assert epoch_samples == len(train_dataset), f\"Expected {len(train_dataset)}, processed {epoch_samples}\"\n",
+    "\n",
+    "    print(\"✅ Training integration works correctly!\")\n",
+    "\n",
+    "# test_training_integration()  # Moved to main block"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0891e60a",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "47fd767d",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_dataset()\n",
+    "    test_unit_tensordataset()\n",
+    "    test_unit_dataloader()\n",
+    "    test_unit_download_functions()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test complete workflow\n",
+    "    test_training_integration()\n",
+    "\n",
+    "    # Test realistic dataset usage\n",
+    "    print(\"🔬 Integration Test: Realistic Dataset Usage...\")\n",
+    "\n",
+    "    # Download datasets\n",
+    "    train_mnist, test_mnist = download_mnist()\n",
+    "\n",
+    "    # Create DataLoaders\n",
+    "    train_loader = DataLoader(train_mnist, batch_size=64, shuffle=True)\n",
+    "    test_loader = DataLoader(test_mnist, batch_size=64, shuffle=False)\n",
+    "\n",
+    "    # Test iteration\n",
+    "    train_batch = next(iter(train_loader))\n",
+    "    test_batch = next(iter(test_loader))\n",
+    "\n",
+    "    assert len(train_batch) == 2, \"Batch should contain (images, labels)\"\n",
+    "    assert train_batch[0].data.shape[0] == 64, f\"Wrong batch size: {train_batch[0].data.shape[0]}\"\n",
+    "    assert train_batch[0].data.shape[1:] == (28, 28), f\"Wrong image shape: {train_batch[0].data.shape[1:]}\"\n",
+    "\n",
+    "    print(\"✅ Realistic dataset usage works!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 08\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "99ae3e8b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dda0430c",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: DataLoader\n",
+    "\n",
+    "Congratulations! You've built a complete data loading pipeline for ML training!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Built Dataset abstraction and TensorDataset implementation with proper tensor alignment\n",
+    "- Created DataLoader with batching, shuffling, and memory-efficient iteration\n",
+    "- Added MNIST and CIFAR-10 download functions for computer vision workflows\n",
+    "- Analyzed data pipeline performance and discovered memory/speed trade-offs\n",
+    "- All tests pass ✅ (validated by `test_module()`)\n",
+    "\n",
+    "### Systems Insights Discovered\n",
+    "- **Batch size directly impacts memory usage and training throughput**\n",
+    "- **Shuffling adds minimal overhead but prevents overfitting patterns**\n",
+    "- **Data loading can become a bottleneck without proper optimization**\n",
+    "- **Memory usage scales linearly with batch size and feature dimensions**\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your DataLoader implementation enables efficient training of CNNs and larger models with proper data pipeline management.\n",
+    "Export with: `tito module complete 08`\n",
+    "\n",
+    "**Next**: Module 09 (Spatial) will add Conv2d layers that leverage your efficient data loading for image processing!\n",
+    "\n",
+    "### Real-World Connection\n",
+    "You've implemented the same patterns used in:\n",
+    "- **PyTorch's DataLoader**: Same interface design for batching and shuffling\n",
+    "- **TensorFlow's Dataset API**: Similar abstraction for data pipeline optimization\n",
+    "- **Production ML**: Essential for handling large-scale training efficiently\n",
+    "- **Research**: Standard foundation for all deep learning experiments\n",
+    "\n",
+    "Your data loading pipeline is now ready to power the CNN training in Module 09!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/08_dataloader/dataloader_dev.py b/modules/source/08_dataloader/dataloader_dev.py
similarity index 99%
rename from modules/08_dataloader/dataloader_dev.py
rename to modules/source/08_dataloader/dataloader_dev.py
index 1d14c2ce..6b39a6e7 100644
--- a/modules/08_dataloader/dataloader_dev.py
+++ b/modules/source/08_dataloader/dataloader_dev.py
@@ -13,6 +13,7 @@
 # ---
 
 #| default_exp data.loader
+#| export
 
 # %% [markdown]
 """
diff --git a/modules/source/09_spatial/spatial_dev.ipynb b/modules/source/09_spatial/spatial_dev.ipynb
new file mode 100644
index 00000000..33b3f467
--- /dev/null
+++ b/modules/source/09_spatial/spatial_dev.ipynb
@@ -0,0 +1,1965 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "06a23a42",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Module 09: Spatial - Processing Images with Convolutions\n",
+    "\n",
+    "Welcome to Module 09! You'll implement spatial operations that transform machine learning from working with simple vectors to understanding images and spatial patterns.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Complete training pipeline with MLPs, optimizers, and data loaders\n",
+    "**You'll Build**: Spatial operations - Conv2d, MaxPool2d, AvgPool2d for image processing\n",
+    "**You'll Enable**: Convolutional Neural Networks (CNNs) for computer vision\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Training Pipeline → Spatial Operations → CNN (Milestone 03)\n",
+    "    (MLPs)            (Conv/Pool)        (Computer Vision)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement Conv2d with explicit loops to understand O(N²M²K²) complexity\n",
+    "2. Build pooling operations (Max and Average) for spatial reduction\n",
+    "3. Understand receptive fields and spatial feature extraction\n",
+    "4. Analyze memory vs computation trade-offs in spatial operations\n",
+    "\n",
+    "Let's get started!\n",
+    "\n",
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/09_spatial/spatial_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.spatial\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.spatial import Conv2d, MaxPool2d, AvgPool2d  # This module\n",
+    "from tinytorch.core.tensor import Tensor  # Foundation (Module 01)\n",
+    "from tinytorch.core.layers import Module  # Base class (Module 03)\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete spatial processing system in one focused module for deep understanding\n",
+    "- **Production:** Proper organization like PyTorch's torch.nn.Conv2d with all spatial operations together\n",
+    "- **Consistency:** All convolution and pooling operations in core.spatial\n",
+    "- **Integration:** Works seamlessly with existing layers for complete CNN architectures"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2be8278",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "spatial-setup",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#| default_exp core.spatial\n",
+    "\n",
+    "#| export\n",
+    "import numpy as np\n",
+    "import sys\n",
+    "import os\n",
+    "import time\n",
+    "\n",
+    "# Smart import system for development and production compatibility\n",
+    "if 'tinytorch' in sys.modules:\n",
+    "    # Production: Import from installed package\n",
+    "    from tinytorch.core.tensor import Tensor\n",
+    "    from tinytorch.core.layers import Module\n",
+    "else:\n",
+    "    # Development: Use simplified local implementations to avoid import loops\n",
+    "\n",
+    "    # Simplified Tensor class for development\n",
+    "    class Tensor:\n",
+    "        \"\"\"Simplified tensor for spatial operations development.\"\"\"\n",
+    "\n",
+    "        def __init__(self, data, requires_grad=False):\n",
+    "            self.data = np.array(data, dtype=np.float32)\n",
+    "            self.shape = self.data.shape\n",
+    "            self.requires_grad = requires_grad\n",
+    "            self.grad = None\n",
+    "\n",
+    "        def __repr__(self):\n",
+    "            return f\"Tensor(shape={self.shape}, data=\\n{self.data})\"\n",
+    "\n",
+    "        def __add__(self, other):\n",
+    "            if isinstance(other, Tensor):\n",
+    "                return Tensor(self.data + other.data)\n",
+    "            return Tensor(self.data + other)\n",
+    "\n",
+    "        def __mul__(self, other):\n",
+    "            if isinstance(other, Tensor):\n",
+    "                return Tensor(self.data * other.data)\n",
+    "            return Tensor(self.data * other)\n",
+    "\n",
+    "        def sum(self):\n",
+    "            return Tensor(np.sum(self.data))\n",
+    "\n",
+    "        def mean(self):\n",
+    "            return Tensor(np.mean(self.data))\n",
+    "\n",
+    "    # Create a simple Module base class for inheritance\n",
+    "    class Module:\n",
+    "        \"\"\"Simple base class for neural network modules.\"\"\"\n",
+    "        def __init__(self):\n",
+    "            pass\n",
+    "\n",
+    "        def forward(self, x):\n",
+    "            raise NotImplementedError(\"Subclasses must implement forward()\")\n",
+    "\n",
+    "        def parameters(self):\n",
+    "            \"\"\"Return list of parameters for this module.\"\"\"\n",
+    "            params = []\n",
+    "            for attr_name in dir(self):\n",
+    "                attr = getattr(self, attr_name)\n",
+    "                if hasattr(attr, 'data') and hasattr(attr, 'requires_grad'):\n",
+    "                    params.append(attr)\n",
+    "            return params"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87ead40b",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 1. Introduction - What are Spatial Operations?\n",
+    "\n",
+    "Spatial operations transform machine learning from working with simple vectors to understanding images and spatial patterns. When you look at a photo, your brain naturally processes spatial relationships - edges, textures, objects. Spatial operations give neural networks this same capability.\n",
+    "\n",
+    "### The Two Core Spatial Operations\n",
+    "\n",
+    "**Convolution**: Detects local patterns by sliding filters across the input\n",
+    "**Pooling**: Reduces spatial dimensions while preserving important features\n",
+    "\n",
+    "### Visual Example: How Convolution Works\n",
+    "\n",
+    "```\n",
+    "Input Image (5×5):        Kernel (3×3):        Output (3×3):\n",
+    "┌─────────────────┐      ┌─────────┐         ┌─────────┐\n",
+    "│ 1 2 3 4 5 │      │ 1 0 -1 │         │ ? ? ? │\n",
+    "│ 6 7 8 9 0 │  *   │ 1 0 -1 │    =    │ ? ? ? │\n",
+    "│ 1 2 3 4 5 │      │ 1 0 -1 │         │ ? ? ? │\n",
+    "│ 6 7 8 9 0 │      └─────────┘         └─────────┘\n",
+    "│ 1 2 3 4 5 │\n",
+    "└─────────────────┘\n",
+    "\n",
+    "Sliding Window Process:\n",
+    "Position (0,0): [1,2,3]   Position (0,1): [2,3,4]   Position (0,2): [3,4,5]\n",
+    "               [6,7,8] *               [7,8,9] *               [8,9,0] *\n",
+    "               [1,2,3]                 [2,3,4]                 [3,4,5]\n",
+    "               = Output[0,0]           = Output[0,1]           = Output[0,2]\n",
+    "```\n",
+    "\n",
+    "Each output pixel summarizes a local neighborhood, allowing the network to detect patterns like edges, corners, and textures.\n",
+    "\n",
+    "### Why Spatial Operations Transform ML\n",
+    "\n",
+    "```\n",
+    "Without Convolution:                    With Convolution:\n",
+    "32×32×3 image = 3,072 inputs          32×32×3 → Conv → 32×32×16\n",
+    "↓                                      ↓                     ↓\n",
+    "Dense(3072 → 1000) = 3M parameters    Shared 3×3 kernel = 432 parameters\n",
+    "↓                                      ↓                     ↓\n",
+    "Memory explosion + no spatial awareness Efficient + preserves spatial structure\n",
+    "```\n",
+    "\n",
+    "Convolution achieves dramatic parameter reduction (1000× fewer!) while preserving the spatial relationships that matter for visual understanding."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e09c3c6",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 2. Mathematical Foundations\n",
+    "\n",
+    "### Understanding Convolution Step by Step\n",
+    "\n",
+    "Convolution sounds complex, but it's just \"sliding window multiplication and summation.\" Let's see exactly how it works:\n",
+    "\n",
+    "```\n",
+    "Step 1: Position the kernel over input\n",
+    "Input:          Kernel:\n",
+    "┌─────────┐     ┌─────┐\n",
+    "│ 1 2 3 4 │     │ 1 0 │  ← Place kernel at position (0,0)\n",
+    "│ 5 6 7 8 │  ×  │ 0 1 │\n",
+    "│ 9 0 1 2 │     └─────┘\n",
+    "└─────────┘\n",
+    "\n",
+    "Step 2: Multiply corresponding elements\n",
+    "Overlap:        Computation:\n",
+    "┌─────┐         1×1 + 2×0 + 5×0 + 6×1 = 1 + 0 + 0 + 6 = 7\n",
+    "│ 1 2 │\n",
+    "│ 5 6 │\n",
+    "└─────┘\n",
+    "\n",
+    "Step 3: Slide kernel and repeat\n",
+    "Position (0,1):  Position (1,0):  Position (1,1):\n",
+    "┌─────┐         ┌─────┐          ┌─────┐\n",
+    "│ 2 3 │         │ 5 6 │          │ 6 7 │\n",
+    "│ 6 7 │         │ 9 0 │          │ 0 1 │\n",
+    "└─────┘         └─────┘          └─────┘\n",
+    "Result: 9       Result: 5        Result: 8\n",
+    "\n",
+    "Final Output:   ┌─────┐\n",
+    "               │ 7 9 │\n",
+    "               │ 5 8 │\n",
+    "               └─────┘\n",
+    "```\n",
+    "\n",
+    "### The Mathematical Formula\n",
+    "\n",
+    "For 2D convolution, we slide kernel K across input I:\n",
+    "```\n",
+    "O[i,j] = Σ Σ I[i+m, j+n] × K[m,n]\n",
+    "         m n\n",
+    "```\n",
+    "\n",
+    "This formula captures the \"multiply and sum\" operation for each kernel position.\n",
+    "\n",
+    "### Pooling: Spatial Summarization\n",
+    "\n",
+    "```\n",
+    "Max Pooling Example (2×2 window):\n",
+    "Input:           Output:\n",
+    "┌───────────┐    ┌─────┐\n",
+    "│ 1 3 2 4 │    │ 6 8 │  ← max([1,3,5,6])=6, max([2,4,7,8])=8\n",
+    "│ 5 6 7 8 │ →  │ 9 9 │  ← max([5,2,9,1])=9, max([7,4,9,3])=9\n",
+    "│ 2 9 1 3 │    └─────┘\n",
+    "│ 0 1 9 3 │\n",
+    "└───────────┘\n",
+    "\n",
+    "Average Pooling (same window):\n",
+    "┌─────┐  ← avg([1,3,5,6])=3.75, avg([2,4,7,8])=5.25\n",
+    "│3.75 5.25│\n",
+    "│2.75 5.75│  ← avg([5,2,9,1])=4.25, avg([7,4,9,3])=5.75\n",
+    "└─────┘\n",
+    "```\n",
+    "\n",
+    "### Why This Complexity Matters\n",
+    "\n",
+    "For convolution with input (1, 3, 224, 224) and kernel (64, 3, 3, 3):\n",
+    "- **Operations**: 1 × 64 × 3 × 3 × 3 × 224 × 224 = 86.7 million multiply-adds\n",
+    "- **Memory**: Input (600KB) + Weights (6.9KB) + Output (12.8MB) = ~13.4MB\n",
+    "\n",
+    "This is why kernel size matters enormously - a 7×7 kernel would require 5.4× more computation!\n",
+    "\n",
+    "### Key Properties That Enable Deep Learning\n",
+    "\n",
+    "**Translation Equivariance**: Move the cat → detection moves the same way\n",
+    "**Parameter Sharing**: Same edge detector works everywhere in the image\n",
+    "**Local Connectivity**: Each output only looks at nearby inputs (like human vision)\n",
+    "**Hierarchical Features**: Early layers detect edges → later layers detect objects"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10b0f641",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 3. Implementation - Building Spatial Operations\n",
+    "\n",
+    "Now we'll implement convolution step by step, using explicit loops so you can see and feel the computational complexity. This helps you understand why modern optimizations matter!\n",
+    "\n",
+    "### Conv2d: Detecting Patterns with Sliding Windows\n",
+    "\n",
+    "Convolution slides a small filter (kernel) across the entire input, computing weighted sums at each position. Think of it like using a template to find matching patterns everywhere in an image.\n",
+    "\n",
+    "```\n",
+    "Convolution Visualization:\n",
+    "Input (4×4):              Kernel (3×3):           Output (2×2):\n",
+    "┌─────────────┐          ┌─────────┐             ┌─────────┐\n",
+    "│ a b c d │            │ k1 k2 k3│             │ o1  o2 │\n",
+    "│ e f g h │     ×      │ k4 k5 k6│      =      │ o3  o4 │\n",
+    "│ i j k l │            │ k7 k8 k9│             └─────────┘\n",
+    "│ m n o p │            └─────────┘\n",
+    "└─────────────┘\n",
+    "\n",
+    "Computation Details:\n",
+    "o1 = a×k1 + b×k2 + c×k3 + e×k4 + f×k5 + g×k6 + i×k7 + j×k8 + k×k9\n",
+    "o2 = b×k1 + c×k2 + d×k3 + f×k4 + g×k5 + h×k6 + j×k7 + k×k8 + l×k9\n",
+    "o3 = e×k1 + f×k2 + g×k3 + i×k4 + j×k5 + k×k6 + m×k7 + n×k8 + o×k9\n",
+    "o4 = f×k1 + g×k2 + h×k3 + j×k4 + k×k5 + l×k6 + n×k7 + o×k8 + p×k9\n",
+    "```\n",
+    "\n",
+    "### The Six Nested Loops of Convolution\n",
+    "\n",
+    "Our implementation will use explicit loops to show exactly where the computational cost comes from:\n",
+    "\n",
+    "```\n",
+    "for batch in range(B):          # Loop 1: Process each sample\n",
+    "    for out_ch in range(C_out):     # Loop 2: Generate each output channel\n",
+    "        for out_h in range(H_out):      # Loop 3: Each output row\n",
+    "            for out_w in range(W_out):      # Loop 4: Each output column\n",
+    "                for k_h in range(K_h):          # Loop 5: Each kernel row\n",
+    "                    for k_w in range(K_w):          # Loop 6: Each kernel column\n",
+    "                        for in_ch in range(C_in):       # Loop 7: Each input channel\n",
+    "                            # The actual multiply-accumulate operation\n",
+    "                            result += input[...] * kernel[...]\n",
+    "```\n",
+    "\n",
+    "Total operations: B × C_out × H_out × W_out × K_h × K_w × C_in\n",
+    "\n",
+    "For typical values (B=32, C_out=64, H_out=224, W_out=224, K_h=3, K_w=3, C_in=3):\n",
+    "That's 32 × 64 × 224 × 224 × 3 × 3 × 3 = **2.8 billion operations** per forward pass!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "72156c45",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### Conv2d Implementation - Building the Core of Computer Vision\n",
+    "\n",
+    "Conv2d is the workhorse of computer vision. It slides learned filters across images to detect patterns like edges, textures, and eventually complex objects.\n",
+    "\n",
+    "#### How Conv2d Transforms Machine Learning\n",
+    "\n",
+    "```\n",
+    "Before Conv2d (Dense Only):         After Conv2d (Spatial Aware):\n",
+    "Input: 32×32×3 = 3,072 values      Input: 32×32×3 structured as image\n",
+    "         ↓                                   ↓\n",
+    "Dense(3072→1000) = 3M params       Conv2d(3→16, 3×3) = 448 params\n",
+    "         ↓                                   ↓\n",
+    "No spatial awareness               Preserves spatial relationships\n",
+    "Massive parameter count            Parameter sharing across space\n",
+    "```\n",
+    "\n",
+    "#### Weight Initialization: He Initialization for ReLU Networks\n",
+    "\n",
+    "Our Conv2d uses He initialization, specifically designed for ReLU activations:\n",
+    "- **Problem**: Wrong initialization → vanishing/exploding gradients\n",
+    "- **Solution**: std = sqrt(2 / fan_in) where fan_in = channels × kernel_height × kernel_width\n",
+    "- **Why it works**: Maintains variance through ReLU nonlinearity\n",
+    "\n",
+    "#### The 6-Loop Implementation Strategy\n",
+    "\n",
+    "We'll implement convolution with explicit loops to show the true computational cost:\n",
+    "\n",
+    "```\n",
+    "Nested Loop Structure:\n",
+    "for batch:           ← Process each sample in parallel (in practice)\n",
+    "  for out_channel:   ← Generate each output feature map\n",
+    "    for out_h:       ← Each row of output\n",
+    "      for out_w:     ← Each column of output\n",
+    "        for k_h:     ← Each row of kernel\n",
+    "          for k_w:   ← Each column of kernel\n",
+    "            for in_ch: ← Accumulate across input channels\n",
+    "              result += input[...] * weight[...]\n",
+    "```\n",
+    "\n",
+    "This reveals why convolution is expensive: O(B×C_out×H×W×K_h×K_w×C_in) operations!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b2903d44",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "conv2d-class",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#| export\n",
+    "class Conv2d(Module):\n",
+    "    \"\"\"\n",
+    "    2D Convolution layer for spatial feature extraction.\n",
+    "\n",
+    "    Implements convolution with explicit loops to demonstrate\n",
+    "    computational complexity and memory access patterns.\n",
+    "\n",
+    "    Args:\n",
+    "        in_channels: Number of input channels\n",
+    "        out_channels: Number of output feature maps\n",
+    "        kernel_size: Size of convolution kernel (int or tuple)\n",
+    "        stride: Stride of convolution (default: 1)\n",
+    "        padding: Zero-padding added to input (default: 0)\n",
+    "        bias: Whether to add learnable bias (default: True)\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True):\n",
+    "        \"\"\"\n",
+    "        Initialize Conv2d layer with proper weight initialization.\n",
+    "\n",
+    "        TODO: Complete Conv2d initialization\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Store hyperparameters (channels, kernel_size, stride, padding)\n",
+    "        2. Initialize weights using He initialization for ReLU compatibility\n",
+    "        3. Initialize bias (if enabled) to zeros\n",
+    "        4. Use proper shapes: weight (out_channels, in_channels, kernel_h, kernel_w)\n",
+    "\n",
+    "        WEIGHT INITIALIZATION:\n",
+    "        - He init: std = sqrt(2 / (in_channels * kernel_h * kernel_w))\n",
+    "        - This prevents vanishing/exploding gradients with ReLU\n",
+    "\n",
+    "        HINT: Convert kernel_size to tuple if it's an integer\n",
+    "        \"\"\"\n",
+    "        super().__init__()\n",
+    "\n",
+    "        ### BEGIN SOLUTION\n",
+    "        self.in_channels = in_channels\n",
+    "        self.out_channels = out_channels\n",
+    "\n",
+    "        # Handle kernel_size as int or tuple\n",
+    "        if isinstance(kernel_size, int):\n",
+    "            self.kernel_size = (kernel_size, kernel_size)\n",
+    "        else:\n",
+    "            self.kernel_size = kernel_size\n",
+    "\n",
+    "        self.stride = stride\n",
+    "        self.padding = padding\n",
+    "\n",
+    "        # He initialization for ReLU networks\n",
+    "        kernel_h, kernel_w = self.kernel_size\n",
+    "        fan_in = in_channels * kernel_h * kernel_w\n",
+    "        std = np.sqrt(2.0 / fan_in)\n",
+    "\n",
+    "        # Weight shape: (out_channels, in_channels, kernel_h, kernel_w)\n",
+    "        self.weight = Tensor(np.random.normal(0, std,\n",
+    "                           (out_channels, in_channels, kernel_h, kernel_w)))\n",
+    "\n",
+    "        # Bias initialization\n",
+    "        if bias:\n",
+    "            self.bias = Tensor(np.zeros(out_channels))\n",
+    "        else:\n",
+    "            self.bias = None\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        \"\"\"\n",
+    "        Forward pass through Conv2d layer.\n",
+    "\n",
+    "        TODO: Implement convolution with explicit loops\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Extract input dimensions and validate\n",
+    "        2. Calculate output dimensions\n",
+    "        3. Apply padding if needed\n",
+    "        4. Implement 6 nested loops for full convolution\n",
+    "        5. Add bias if present\n",
+    "\n",
+    "        LOOP STRUCTURE:\n",
+    "        for batch in range(batch_size):\n",
+    "            for out_ch in range(out_channels):\n",
+    "                for out_h in range(out_height):\n",
+    "                    for out_w in range(out_width):\n",
+    "                        for k_h in range(kernel_height):\n",
+    "                            for k_w in range(kernel_width):\n",
+    "                                for in_ch in range(in_channels):\n",
+    "                                    # Accumulate: out += input * weight\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> conv = Conv2d(3, 16, kernel_size=3, padding=1)\n",
+    "        >>> x = Tensor(np.random.randn(2, 3, 32, 32))  # batch=2, RGB, 32x32\n",
+    "        >>> out = conv(x)\n",
+    "        >>> print(out.shape)  # Should be (2, 16, 32, 32)\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Handle padding by creating padded input array\n",
+    "        - Watch array bounds in inner loops\n",
+    "        - Accumulate products for each output position\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Input validation and shape extraction\n",
+    "        if len(x.shape) != 4:\n",
+    "            raise ValueError(f\"Expected 4D input (batch, channels, height, width), got {x.shape}\")\n",
+    "\n",
+    "        batch_size, in_channels, in_height, in_width = x.shape\n",
+    "        out_channels = self.out_channels\n",
+    "        kernel_h, kernel_w = self.kernel_size\n",
+    "\n",
+    "        # Calculate output dimensions\n",
+    "        out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1\n",
+    "        out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1\n",
+    "\n",
+    "        # Apply padding if needed\n",
+    "        if self.padding > 0:\n",
+    "            padded_input = np.pad(x.data,\n",
+    "                                ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),\n",
+    "                                mode='constant', constant_values=0)\n",
+    "        else:\n",
+    "            padded_input = x.data\n",
+    "\n",
+    "        # Initialize output\n",
+    "        output = np.zeros((batch_size, out_channels, out_height, out_width))\n",
+    "\n",
+    "        # Explicit 6-nested loop convolution to show complexity\n",
+    "        for b in range(batch_size):\n",
+    "            for out_ch in range(out_channels):\n",
+    "                for out_h in range(out_height):\n",
+    "                    for out_w in range(out_width):\n",
+    "                        # Calculate input region for this output position\n",
+    "                        in_h_start = out_h * self.stride\n",
+    "                        in_w_start = out_w * self.stride\n",
+    "\n",
+    "                        # Accumulate convolution result\n",
+    "                        conv_sum = 0.0\n",
+    "                        for k_h in range(kernel_h):\n",
+    "                            for k_w in range(kernel_w):\n",
+    "                                for in_ch in range(in_channels):\n",
+    "                                    # Get input and weight values\n",
+    "                                    input_val = padded_input[b, in_ch,\n",
+    "                                                           in_h_start + k_h,\n",
+    "                                                           in_w_start + k_w]\n",
+    "                                    weight_val = self.weight.data[out_ch, in_ch, k_h, k_w]\n",
+    "\n",
+    "                                    # Accumulate\n",
+    "                                    conv_sum += input_val * weight_val\n",
+    "\n",
+    "                        # Store result\n",
+    "                        output[b, out_ch, out_h, out_w] = conv_sum\n",
+    "\n",
+    "        # Add bias if present\n",
+    "        if self.bias is not None:\n",
+    "            # Broadcast bias across spatial dimensions\n",
+    "            for out_ch in range(out_channels):\n",
+    "                output[:, out_ch, :, :] += self.bias.data[out_ch]\n",
+    "\n",
+    "        return Tensor(output)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def parameters(self):\n",
+    "        \"\"\"Return trainable parameters.\"\"\"\n",
+    "        params = [self.weight]\n",
+    "        if self.bias is not None:\n",
+    "            params.append(self.bias)\n",
+    "        return params\n",
+    "\n",
+    "    def __call__(self, x):\n",
+    "        \"\"\"Enable model(x) syntax.\"\"\"\n",
+    "        return self.forward(x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43093579",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Conv2d Implementation\n",
+    "This test validates our convolution implementation with different configurations.\n",
+    "**What we're testing**: Shape preservation, padding, stride effects\n",
+    "**Why it matters**: Convolution is the foundation of computer vision\n",
+    "**Expected**: Correct output shapes and reasonable value ranges"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d0a725b1",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-conv2d",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def test_unit_conv2d():\n",
+    "    \"\"\"🔬 Test Conv2d implementation with multiple configurations.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Conv2d...\")\n",
+    "\n",
+    "    # Test 1: Basic convolution without padding\n",
+    "    print(\"  Testing basic convolution...\")\n",
+    "    conv1 = Conv2d(in_channels=3, out_channels=16, kernel_size=3)\n",
+    "    x1 = Tensor(np.random.randn(2, 3, 32, 32))\n",
+    "    out1 = conv1(x1)\n",
+    "\n",
+    "    expected_h = (32 - 3) + 1  # 30\n",
+    "    expected_w = (32 - 3) + 1  # 30\n",
+    "    assert out1.shape == (2, 16, expected_h, expected_w), f\"Expected (2, 16, 30, 30), got {out1.shape}\"\n",
+    "\n",
+    "    # Test 2: Convolution with padding (same size)\n",
+    "    print(\"  Testing convolution with padding...\")\n",
+    "    conv2 = Conv2d(in_channels=3, out_channels=8, kernel_size=3, padding=1)\n",
+    "    x2 = Tensor(np.random.randn(1, 3, 28, 28))\n",
+    "    out2 = conv2(x2)\n",
+    "\n",
+    "    # With padding=1, output should be same size as input\n",
+    "    assert out2.shape == (1, 8, 28, 28), f\"Expected (1, 8, 28, 28), got {out2.shape}\"\n",
+    "\n",
+    "    # Test 3: Convolution with stride\n",
+    "    print(\"  Testing convolution with stride...\")\n",
+    "    conv3 = Conv2d(in_channels=1, out_channels=4, kernel_size=3, stride=2)\n",
+    "    x3 = Tensor(np.random.randn(1, 1, 16, 16))\n",
+    "    out3 = conv3(x3)\n",
+    "\n",
+    "    expected_h = (16 - 3) // 2 + 1  # 7\n",
+    "    expected_w = (16 - 3) // 2 + 1  # 7\n",
+    "    assert out3.shape == (1, 4, expected_h, expected_w), f\"Expected (1, 4, 7, 7), got {out3.shape}\"\n",
+    "\n",
+    "    # Test 4: Parameter counting\n",
+    "    print(\"  Testing parameter counting...\")\n",
+    "    conv4 = Conv2d(in_channels=64, out_channels=128, kernel_size=3, bias=True)\n",
+    "    params = conv4.parameters()\n",
+    "\n",
+    "    # Weight: (128, 64, 3, 3) = 73,728 parameters\n",
+    "    # Bias: (128,) = 128 parameters\n",
+    "    # Total: 73,856 parameters\n",
+    "    weight_params = 128 * 64 * 3 * 3\n",
+    "    bias_params = 128\n",
+    "    total_params = weight_params + bias_params\n",
+    "\n",
+    "    actual_weight_params = np.prod(conv4.weight.shape)\n",
+    "    actual_bias_params = np.prod(conv4.bias.shape) if conv4.bias is not None else 0\n",
+    "    actual_total = actual_weight_params + actual_bias_params\n",
+    "\n",
+    "    assert actual_total == total_params, f\"Expected {total_params} parameters, got {actual_total}\"\n",
+    "    assert len(params) == 2, f\"Expected 2 parameter tensors, got {len(params)}\"\n",
+    "\n",
+    "    # Test 5: No bias configuration\n",
+    "    print(\"  Testing no bias configuration...\")\n",
+    "    conv5 = Conv2d(in_channels=3, out_channels=16, kernel_size=5, bias=False)\n",
+    "    params5 = conv5.parameters()\n",
+    "    assert len(params5) == 1, f\"Expected 1 parameter tensor (no bias), got {len(params5)}\"\n",
+    "    assert conv5.bias is None, \"Bias should be None when bias=False\"\n",
+    "\n",
+    "    print(\"✅ Conv2d works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_conv2d()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e913b5c",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 4. Pooling Operations - Spatial Dimension Reduction\n",
+    "\n",
+    "Pooling operations compress spatial information while keeping the most important features. Think of them as creating \"thumbnail summaries\" of local regions.\n",
+    "\n",
+    "### MaxPool2d: Keeping the Strongest Signals\n",
+    "\n",
+    "Max pooling finds the strongest activation in each window, preserving sharp features like edges and corners.\n",
+    "\n",
+    "```\n",
+    "MaxPool2d Example (2×2 kernel, stride=2):\n",
+    "Input (4×4):              Windows:               Output (2×2):\n",
+    "┌─────────────┐          ┌─────┬─────┐          ┌─────┐\n",
+    "│ 1  3 │ 2  8 │          │ 1 3 │ 2 8 │          │ 6 8 │\n",
+    "│ 5  6 │ 7  4 │     →   │ 5 6 │ 7 4 │    →    │ 9 7 │\n",
+    "├─────┼─────┤          ├─────┼─────┤          └─────┘\n",
+    "│ 2  9 │ 1  7 │          │ 2 9 │ 1 7 │\n",
+    "│ 0  1 │ 3  6 │          │ 0 1 │ 3 6 │\n",
+    "└─────────────┘          └─────┴─────┘\n",
+    "\n",
+    "Window Computations:\n",
+    "Top-left: max(1,3,5,6) = 6     Top-right: max(2,8,7,4) = 8\n",
+    "Bottom-left: max(2,9,0,1) = 9  Bottom-right: max(1,7,3,6) = 7\n",
+    "```\n",
+    "\n",
+    "### AvgPool2d: Smoothing Local Features\n",
+    "\n",
+    "Average pooling computes the mean of each window, creating smoother, more general features.\n",
+    "\n",
+    "```\n",
+    "AvgPool2d Example (same 2×2 kernel, stride=2):\n",
+    "Input (4×4):              Output (2×2):\n",
+    "┌─────────────┐          ┌──────────┐\n",
+    "│ 1  3 │ 2  8 │          │ 3.75  5.25│\n",
+    "│ 5  6 │ 7  4 │     →   │ 3.0   4.25│\n",
+    "├─────┼─────┤          └──────────┘\n",
+    "│ 2  9 │ 1  7 │\n",
+    "│ 0  1 │ 3  6 │\n",
+    "└─────────────┘\n",
+    "\n",
+    "Window Computations:\n",
+    "Top-left: (1+3+5+6)/4 = 3.75    Top-right: (2+8+7+4)/4 = 5.25\n",
+    "Bottom-left: (2+9+0+1)/4 = 3.0  Bottom-right: (1+7+3+6)/4 = 4.25\n",
+    "```\n",
+    "\n",
+    "### Why Pooling Matters for Computer Vision\n",
+    "\n",
+    "```\n",
+    "Memory Impact:\n",
+    "Input: 224×224×64 = 3.2M values    After 2×2 pooling: 112×112×64 = 0.8M values\n",
+    "Memory reduction: 4× less!         Computation reduction: 4× less!\n",
+    "\n",
+    "Information Trade-off:\n",
+    "✅ Preserves important features     ⚠️ Loses fine spatial detail\n",
+    "✅ Provides translation invariance  ⚠️ Reduces localization precision\n",
+    "✅ Reduces overfitting             ⚠️ May lose small objects\n",
+    "```\n",
+    "\n",
+    "### Sliding Window Pattern\n",
+    "\n",
+    "Both pooling operations follow the same sliding window pattern:\n",
+    "\n",
+    "```\n",
+    "Sliding 2×2 window with stride=2:\n",
+    "Step 1:     Step 2:     Step 3:     Step 4:\n",
+    "┌──┐        ┌──┐\n",
+    "│▓▓│        │▓▓│\n",
+    "└──┘        └──┘                   ┌──┐        ┌──┐\n",
+    "                                    │▓▓│        │▓▓│\n",
+    "                                    └──┘        └──┘\n",
+    "\n",
+    "Non-overlapping windows → Each input pixel used exactly once\n",
+    "Stride=2 → Output dimensions halved in each direction\n",
+    "```\n",
+    "\n",
+    "The key difference: MaxPool takes max(window), AvgPool takes mean(window)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a74d3702",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### MaxPool2d Implementation - Preserving Strong Features\n",
+    "\n",
+    "MaxPool2d finds the strongest activation in each spatial window, creating a compressed representation that keeps the most important information.\n",
+    "\n",
+    "#### Why Max Pooling Works for Computer Vision\n",
+    "\n",
+    "```\n",
+    "Edge Detection Example:\n",
+    "Input Window (2×2):         Max Pooling Result:\n",
+    "┌─────┬─────┐\n",
+    "│ 0.1 │ 0.8 │ ←  Strong edge signal\n",
+    "├─────┼─────┤\n",
+    "│ 0.2 │ 0.1 │              Output: 0.8 (preserves edge)\n",
+    "└─────┴─────┘\n",
+    "\n",
+    "Noise Reduction Example:\n",
+    "Input Window (2×2):\n",
+    "┌─────┬─────┐\n",
+    "│ 0.9 │ 0.1 │ ←  Feature + noise\n",
+    "├─────┼─────┤\n",
+    "│ 0.2 │ 0.1 │              Output: 0.9 (removes noise)\n",
+    "└─────┴─────┘\n",
+    "```\n",
+    "\n",
+    "#### The Sliding Window Pattern\n",
+    "\n",
+    "```\n",
+    "MaxPool with 2×2 kernel, stride=2:\n",
+    "\n",
+    "Input (4×4):                Output (2×2):\n",
+    "┌───┬───┬───┬───┐          ┌───────┬───────┐\n",
+    "│ a │ b │ c │ d │          │max(a,b│max(c,d│\n",
+    "├───┼───┼───┼───┤     →    │   e,f)│   g,h)│\n",
+    "│ e │ f │ g │ h │          ├───────┼───────┤\n",
+    "├───┼───┼───┼───┤          │max(i,j│max(k,l│\n",
+    "│ i │ j │ k │ l │          │   m,n)│   o,p)│\n",
+    "├───┼───┼───┼───┤          └───────┴───────┘\n",
+    "│ m │ n │ o │ p │\n",
+    "└───┴───┴───┴───┘\n",
+    "\n",
+    "Benefits:\n",
+    "✓ Translation invariance (cat moved 1 pixel still detected)\n",
+    "✓ Computational efficiency (4× fewer values to process)\n",
+    "✓ Hierarchical feature building (next layer sees larger receptive field)\n",
+    "```\n",
+    "\n",
+    "#### Memory and Computation Impact\n",
+    "\n",
+    "For input (1, 64, 224, 224) with 2×2 pooling:\n",
+    "- **Input memory**: 64 × 224 × 224 × 4 bytes = 12.8 MB\n",
+    "- **Output memory**: 64 × 112 × 112 × 4 bytes = 3.2 MB\n",
+    "- **Memory reduction**: 4× less memory needed\n",
+    "- **Computation**: No parameters, minimal compute cost"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "23a06538",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "maxpool2d-class",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#| export\n",
+    "class MaxPool2d(Module):\n",
+    "    \"\"\"\n",
+    "    2D Max Pooling layer for spatial dimension reduction.\n",
+    "\n",
+    "    Applies maximum operation over spatial windows, preserving\n",
+    "    the strongest activations while reducing computational load.\n",
+    "\n",
+    "    Args:\n",
+    "        kernel_size: Size of pooling window (int or tuple)\n",
+    "        stride: Stride of pooling operation (default: same as kernel_size)\n",
+    "        padding: Zero-padding added to input (default: 0)\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, kernel_size, stride=None, padding=0):\n",
+    "        \"\"\"\n",
+    "        Initialize MaxPool2d layer.\n",
+    "\n",
+    "        TODO: Store pooling parameters\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Convert kernel_size to tuple if needed\n",
+    "        2. Set stride to kernel_size if not provided (non-overlapping)\n",
+    "        3. Store padding parameter\n",
+    "\n",
+    "        HINT: Default stride equals kernel_size for non-overlapping windows\n",
+    "        \"\"\"\n",
+    "        super().__init__()\n",
+    "\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Handle kernel_size as int or tuple\n",
+    "        if isinstance(kernel_size, int):\n",
+    "            self.kernel_size = (kernel_size, kernel_size)\n",
+    "        else:\n",
+    "            self.kernel_size = kernel_size\n",
+    "\n",
+    "        # Default stride equals kernel_size (non-overlapping)\n",
+    "        if stride is None:\n",
+    "            self.stride = self.kernel_size[0]\n",
+    "        else:\n",
+    "            self.stride = stride\n",
+    "\n",
+    "        self.padding = padding\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        \"\"\"\n",
+    "        Forward pass through MaxPool2d layer.\n",
+    "\n",
+    "        TODO: Implement max pooling with explicit loops\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Extract input dimensions\n",
+    "        2. Calculate output dimensions\n",
+    "        3. Apply padding if needed\n",
+    "        4. Implement nested loops for pooling windows\n",
+    "        5. Find maximum value in each window\n",
+    "\n",
+    "        LOOP STRUCTURE:\n",
+    "        for batch in range(batch_size):\n",
+    "            for channel in range(channels):\n",
+    "                for out_h in range(out_height):\n",
+    "                    for out_w in range(out_width):\n",
+    "                        # Find max in window [in_h:in_h+k_h, in_w:in_w+k_w]\n",
+    "                        max_val = -infinity\n",
+    "                        for k_h in range(kernel_height):\n",
+    "                            for k_w in range(kernel_width):\n",
+    "                                max_val = max(max_val, input[...])\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> pool = MaxPool2d(kernel_size=2, stride=2)\n",
+    "        >>> x = Tensor(np.random.randn(1, 3, 8, 8))\n",
+    "        >>> out = pool(x)\n",
+    "        >>> print(out.shape)  # Should be (1, 3, 4, 4)\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Initialize max_val to negative infinity\n",
+    "        - Handle stride correctly when accessing input\n",
+    "        - No parameters to update (pooling has no weights)\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Input validation and shape extraction\n",
+    "        if len(x.shape) != 4:\n",
+    "            raise ValueError(f\"Expected 4D input (batch, channels, height, width), got {x.shape}\")\n",
+    "\n",
+    "        batch_size, channels, in_height, in_width = x.shape\n",
+    "        kernel_h, kernel_w = self.kernel_size\n",
+    "\n",
+    "        # Calculate output dimensions\n",
+    "        out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1\n",
+    "        out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1\n",
+    "\n",
+    "        # Apply padding if needed\n",
+    "        if self.padding > 0:\n",
+    "            padded_input = np.pad(x.data,\n",
+    "                                ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),\n",
+    "                                mode='constant', constant_values=-np.inf)\n",
+    "        else:\n",
+    "            padded_input = x.data\n",
+    "\n",
+    "        # Initialize output\n",
+    "        output = np.zeros((batch_size, channels, out_height, out_width))\n",
+    "\n",
+    "        # Explicit nested loop max pooling\n",
+    "        for b in range(batch_size):\n",
+    "            for c in range(channels):\n",
+    "                for out_h in range(out_height):\n",
+    "                    for out_w in range(out_width):\n",
+    "                        # Calculate input region for this output position\n",
+    "                        in_h_start = out_h * self.stride\n",
+    "                        in_w_start = out_w * self.stride\n",
+    "\n",
+    "                        # Find maximum in window\n",
+    "                        max_val = -np.inf\n",
+    "                        for k_h in range(kernel_h):\n",
+    "                            for k_w in range(kernel_w):\n",
+    "                                input_val = padded_input[b, c,\n",
+    "                                                       in_h_start + k_h,\n",
+    "                                                       in_w_start + k_w]\n",
+    "                                max_val = max(max_val, input_val)\n",
+    "\n",
+    "                        # Store result\n",
+    "                        output[b, c, out_h, out_w] = max_val\n",
+    "\n",
+    "        return Tensor(output)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def parameters(self):\n",
+    "        \"\"\"Return empty list (pooling has no parameters).\"\"\"\n",
+    "        return []\n",
+    "\n",
+    "    def __call__(self, x):\n",
+    "        \"\"\"Enable model(x) syntax.\"\"\"\n",
+    "        return self.forward(x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "df0253b2",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### AvgPool2d Implementation - Smoothing and Generalizing Features\n",
+    "\n",
+    "AvgPool2d computes the average of each spatial window, creating smoother features that are less sensitive to noise and exact pixel positions.\n",
+    "\n",
+    "#### MaxPool vs AvgPool: Different Philosophies\n",
+    "\n",
+    "```\n",
+    "Same Input Window (2×2):    MaxPool Output:    AvgPool Output:\n",
+    "┌─────┬─────┐\n",
+    "│ 0.1 │ 0.9 │               0.9              0.425\n",
+    "├─────┼─────┤              (max)             (mean)\n",
+    "│ 0.3 │ 0.3 │\n",
+    "└─────┴─────┘\n",
+    "\n",
+    "Interpretation:\n",
+    "MaxPool: \"What's the strongest feature here?\"\n",
+    "AvgPool: \"What's the general feature level here?\"\n",
+    "```\n",
+    "\n",
+    "#### When to Use Average Pooling\n",
+    "\n",
+    "```\n",
+    "Use Cases:\n",
+    "✓ Global Average Pooling (GAP) for classification\n",
+    "✓ When you want smoother, less noisy features\n",
+    "✓ When exact feature location doesn't matter\n",
+    "✓ In shallower networks where sharp features aren't critical\n",
+    "\n",
+    "Typical Pattern:\n",
+    "Feature Maps → Global Average Pool → Dense → Classification\n",
+    "(256×7×7)   →        (256×1×1)      → FC   →    (10)\n",
+    "              Replaces flatten+dense with parameter reduction\n",
+    "```\n",
+    "\n",
+    "#### Mathematical Implementation\n",
+    "\n",
+    "```\n",
+    "Average Pooling Computation:\n",
+    "Window: [a, b]    Result = (a + b + c + d) / 4\n",
+    "        [c, d]\n",
+    "\n",
+    "For efficiency, we:\n",
+    "1. Sum all values in window: window_sum = a + b + c + d\n",
+    "2. Divide by window area: result = window_sum / (kernel_h × kernel_w)\n",
+    "3. Store result at output position\n",
+    "\n",
+    "Memory access pattern identical to MaxPool, just different aggregation!\n",
+    "```\n",
+    "\n",
+    "#### Practical Considerations\n",
+    "\n",
+    "- **Memory**: Same 4× reduction as MaxPool\n",
+    "- **Computation**: Slightly more expensive (sum + divide vs max)\n",
+    "- **Features**: Smoother, more generalized than MaxPool\n",
+    "- **Use**: Often in final layers (Global Average Pooling) to reduce parameters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "41e2a85d",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "avgpool2d-class",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#| export\n",
+    "class AvgPool2d(Module):\n",
+    "    \"\"\"\n",
+    "    2D Average Pooling layer for spatial dimension reduction.\n",
+    "\n",
+    "    Applies average operation over spatial windows, smoothing\n",
+    "    features while reducing computational load.\n",
+    "\n",
+    "    Args:\n",
+    "        kernel_size: Size of pooling window (int or tuple)\n",
+    "        stride: Stride of pooling operation (default: same as kernel_size)\n",
+    "        padding: Zero-padding added to input (default: 0)\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, kernel_size, stride=None, padding=0):\n",
+    "        \"\"\"\n",
+    "        Initialize AvgPool2d layer.\n",
+    "\n",
+    "        TODO: Store pooling parameters (same as MaxPool2d)\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Convert kernel_size to tuple if needed\n",
+    "        2. Set stride to kernel_size if not provided\n",
+    "        3. Store padding parameter\n",
+    "        \"\"\"\n",
+    "        super().__init__()\n",
+    "\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Handle kernel_size as int or tuple\n",
+    "        if isinstance(kernel_size, int):\n",
+    "            self.kernel_size = (kernel_size, kernel_size)\n",
+    "        else:\n",
+    "            self.kernel_size = kernel_size\n",
+    "\n",
+    "        # Default stride equals kernel_size (non-overlapping)\n",
+    "        if stride is None:\n",
+    "            self.stride = self.kernel_size[0]\n",
+    "        else:\n",
+    "            self.stride = stride\n",
+    "\n",
+    "        self.padding = padding\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        \"\"\"\n",
+    "        Forward pass through AvgPool2d layer.\n",
+    "\n",
+    "        TODO: Implement average pooling with explicit loops\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Similar structure to MaxPool2d\n",
+    "        2. Instead of max, compute average of window\n",
+    "        3. Divide sum by window area for true average\n",
+    "\n",
+    "        LOOP STRUCTURE:\n",
+    "        for batch in range(batch_size):\n",
+    "            for channel in range(channels):\n",
+    "                for out_h in range(out_height):\n",
+    "                    for out_w in range(out_width):\n",
+    "                        # Compute average in window\n",
+    "                        window_sum = 0\n",
+    "                        for k_h in range(kernel_height):\n",
+    "                            for k_w in range(kernel_width):\n",
+    "                                window_sum += input[...]\n",
+    "                        avg_val = window_sum / (kernel_height * kernel_width)\n",
+    "\n",
+    "        HINT: Remember to divide by window area to get true average\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Input validation and shape extraction\n",
+    "        if len(x.shape) != 4:\n",
+    "            raise ValueError(f\"Expected 4D input (batch, channels, height, width), got {x.shape}\")\n",
+    "\n",
+    "        batch_size, channels, in_height, in_width = x.shape\n",
+    "        kernel_h, kernel_w = self.kernel_size\n",
+    "\n",
+    "        # Calculate output dimensions\n",
+    "        out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1\n",
+    "        out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1\n",
+    "\n",
+    "        # Apply padding if needed\n",
+    "        if self.padding > 0:\n",
+    "            padded_input = np.pad(x.data,\n",
+    "                                ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),\n",
+    "                                mode='constant', constant_values=0)\n",
+    "        else:\n",
+    "            padded_input = x.data\n",
+    "\n",
+    "        # Initialize output\n",
+    "        output = np.zeros((batch_size, channels, out_height, out_width))\n",
+    "\n",
+    "        # Explicit nested loop average pooling\n",
+    "        for b in range(batch_size):\n",
+    "            for c in range(channels):\n",
+    "                for out_h in range(out_height):\n",
+    "                    for out_w in range(out_width):\n",
+    "                        # Calculate input region for this output position\n",
+    "                        in_h_start = out_h * self.stride\n",
+    "                        in_w_start = out_w * self.stride\n",
+    "\n",
+    "                        # Compute sum in window\n",
+    "                        window_sum = 0.0\n",
+    "                        for k_h in range(kernel_h):\n",
+    "                            for k_w in range(kernel_w):\n",
+    "                                input_val = padded_input[b, c,\n",
+    "                                                       in_h_start + k_h,\n",
+    "                                                       in_w_start + k_w]\n",
+    "                                window_sum += input_val\n",
+    "\n",
+    "                        # Compute average\n",
+    "                        avg_val = window_sum / (kernel_h * kernel_w)\n",
+    "\n",
+    "                        # Store result\n",
+    "                        output[b, c, out_h, out_w] = avg_val\n",
+    "\n",
+    "        return Tensor(output)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def parameters(self):\n",
+    "        \"\"\"Return empty list (pooling has no parameters).\"\"\"\n",
+    "        return []\n",
+    "\n",
+    "    def __call__(self, x):\n",
+    "        \"\"\"Enable model(x) syntax.\"\"\"\n",
+    "        return self.forward(x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f92eb4d",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: Pooling Operations\n",
+    "This test validates both max and average pooling implementations.\n",
+    "**What we're testing**: Dimension reduction, aggregation correctness\n",
+    "**Why it matters**: Pooling is essential for computational efficiency in CNNs\n",
+    "**Expected**: Correct output shapes and proper value aggregation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "11126788",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-pooling",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def test_unit_pooling():\n",
+    "    \"\"\"🔬 Test MaxPool2d and AvgPool2d implementations.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Pooling Operations...\")\n",
+    "\n",
+    "    # Test 1: MaxPool2d basic functionality\n",
+    "    print(\"  Testing MaxPool2d...\")\n",
+    "    maxpool = MaxPool2d(kernel_size=2, stride=2)\n",
+    "    x1 = Tensor(np.random.randn(1, 3, 8, 8))\n",
+    "    out1 = maxpool(x1)\n",
+    "\n",
+    "    expected_shape = (1, 3, 4, 4)  # 8/2 = 4\n",
+    "    assert out1.shape == expected_shape, f\"MaxPool expected {expected_shape}, got {out1.shape}\"\n",
+    "\n",
+    "    # Test 2: AvgPool2d basic functionality\n",
+    "    print(\"  Testing AvgPool2d...\")\n",
+    "    avgpool = AvgPool2d(kernel_size=2, stride=2)\n",
+    "    x2 = Tensor(np.random.randn(2, 16, 16, 16))\n",
+    "    out2 = avgpool(x2)\n",
+    "\n",
+    "    expected_shape = (2, 16, 8, 8)  # 16/2 = 8\n",
+    "    assert out2.shape == expected_shape, f\"AvgPool expected {expected_shape}, got {out2.shape}\"\n",
+    "\n",
+    "    # Test 3: MaxPool vs AvgPool on known data\n",
+    "    print(\"  Testing max vs avg behavior...\")\n",
+    "    # Create simple test case with known values\n",
+    "    test_data = np.array([[[[1, 2, 3, 4],\n",
+    "                           [5, 6, 7, 8],\n",
+    "                           [9, 10, 11, 12],\n",
+    "                           [13, 14, 15, 16]]]], dtype=np.float32)\n",
+    "    x3 = Tensor(test_data)\n",
+    "\n",
+    "    maxpool_test = MaxPool2d(kernel_size=2, stride=2)\n",
+    "    avgpool_test = AvgPool2d(kernel_size=2, stride=2)\n",
+    "\n",
+    "    max_out = maxpool_test(x3)\n",
+    "    avg_out = avgpool_test(x3)\n",
+    "\n",
+    "    # For 2x2 windows:\n",
+    "    # Top-left: max([1,2,5,6]) = 6, avg = 3.5\n",
+    "    # Top-right: max([3,4,7,8]) = 8, avg = 5.5\n",
+    "    # Bottom-left: max([9,10,13,14]) = 14, avg = 11.5\n",
+    "    # Bottom-right: max([11,12,15,16]) = 16, avg = 13.5\n",
+    "\n",
+    "    expected_max = np.array([[[[6, 8], [14, 16]]]])\n",
+    "    expected_avg = np.array([[[[3.5, 5.5], [11.5, 13.5]]]])\n",
+    "\n",
+    "    assert np.allclose(max_out.data, expected_max), f\"MaxPool values incorrect: {max_out.data} vs {expected_max}\"\n",
+    "    assert np.allclose(avg_out.data, expected_avg), f\"AvgPool values incorrect: {avg_out.data} vs {expected_avg}\"\n",
+    "\n",
+    "    # Test 4: Overlapping pooling (stride < kernel_size)\n",
+    "    print(\"  Testing overlapping pooling...\")\n",
+    "    overlap_pool = MaxPool2d(kernel_size=3, stride=1)\n",
+    "    x4 = Tensor(np.random.randn(1, 1, 5, 5))\n",
+    "    out4 = overlap_pool(x4)\n",
+    "\n",
+    "    # Output: (5-3)/1 + 1 = 3\n",
+    "    expected_shape = (1, 1, 3, 3)\n",
+    "    assert out4.shape == expected_shape, f\"Overlapping pool expected {expected_shape}, got {out4.shape}\"\n",
+    "\n",
+    "    # Test 5: No parameters in pooling layers\n",
+    "    print(\"  Testing parameter counts...\")\n",
+    "    assert len(maxpool.parameters()) == 0, \"MaxPool should have no parameters\"\n",
+    "    assert len(avgpool.parameters()) == 0, \"AvgPool should have no parameters\"\n",
+    "\n",
+    "    print(\"✅ Pooling operations work correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_pooling()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e56854cd",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 5. Systems Analysis - Understanding Spatial Operation Performance\n",
+    "\n",
+    "Now let's analyze the computational complexity and memory trade-offs of spatial operations. This analysis reveals why certain design choices matter for real-world performance.\n",
+    "\n",
+    "### Key Questions We'll Answer:\n",
+    "1. How does convolution complexity scale with input size and kernel size?\n",
+    "2. What's the memory vs computation trade-off in different approaches?\n",
+    "3. How do modern optimizations (like im2col) change the performance characteristics?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f941a2ee",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "spatial-analysis",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def analyze_convolution_complexity():\n",
+    "    \"\"\"📊 Analyze convolution computational complexity across different configurations.\"\"\"\n",
+    "    print(\"📊 Analyzing Convolution Complexity...\")\n",
+    "\n",
+    "    # Test configurations optimized for educational demonstration (smaller sizes)\n",
+    "    configs = [\n",
+    "        {\"input\": (1, 3, 16, 16), \"conv\": (8, 3, 3), \"name\": \"Small (16×16)\"},\n",
+    "        {\"input\": (1, 3, 24, 24), \"conv\": (12, 3, 3), \"name\": \"Medium (24×24)\"},\n",
+    "        {\"input\": (1, 3, 32, 32), \"conv\": (16, 3, 3), \"name\": \"Large (32×32)\"},\n",
+    "        {\"input\": (1, 3, 16, 16), \"conv\": (8, 3, 5), \"name\": \"Large Kernel (5×5)\"},\n",
+    "    ]\n",
+    "\n",
+    "    print(f\"{'Configuration':<20} {'FLOPs':<15} {'Memory (MB)':<12} {'Time (ms)':<10}\")\n",
+    "    print(\"-\" * 70)\n",
+    "\n",
+    "    for config in configs:\n",
+    "        # Create convolution layer\n",
+    "        in_ch = config[\"input\"][1]\n",
+    "        out_ch, k_size = config[\"conv\"][0], config[\"conv\"][1]\n",
+    "        conv = Conv2d(in_ch, out_ch, kernel_size=k_size, padding=k_size//2)\n",
+    "\n",
+    "        # Create input tensor\n",
+    "        x = Tensor(np.random.randn(*config[\"input\"]))\n",
+    "\n",
+    "        # Calculate theoretical FLOPs\n",
+    "        batch, in_channels, h, w = config[\"input\"]\n",
+    "        out_channels, kernel_size = config[\"conv\"][0], config[\"conv\"][1]\n",
+    "\n",
+    "        # Each output element requires in_channels * kernel_size² multiply-adds\n",
+    "        flops_per_output = in_channels * kernel_size * kernel_size * 2  # 2 for MAC\n",
+    "        total_outputs = batch * out_channels * h * w  # Assuming same size with padding\n",
+    "        total_flops = flops_per_output * total_outputs\n",
+    "\n",
+    "        # Measure memory usage\n",
+    "        input_memory = np.prod(config[\"input\"]) * 4  # float32 = 4 bytes\n",
+    "        weight_memory = out_channels * in_channels * kernel_size * kernel_size * 4\n",
+    "        output_memory = batch * out_channels * h * w * 4\n",
+    "        total_memory = (input_memory + weight_memory + output_memory) / (1024 * 1024)  # MB\n",
+    "\n",
+    "        # Measure execution time\n",
+    "        start_time = time.time()\n",
+    "        _ = conv(x)\n",
+    "        end_time = time.time()\n",
+    "        exec_time = (end_time - start_time) * 1000  # ms\n",
+    "\n",
+    "        print(f\"{config['name']:<20} {total_flops:<15,} {total_memory:<12.2f} {exec_time:<10.2f}\")\n",
+    "\n",
+    "    print(\"\\n💡 Key Insights:\")\n",
+    "    print(\"🔸 FLOPs scale as O(H×W×C_in×C_out×K²) - quadratic in spatial and kernel size\")\n",
+    "    print(\"🔸 Memory scales linearly with spatial dimensions and channels\")\n",
+    "    print(\"🔸 Large kernels dramatically increase computational cost\")\n",
+    "    print(\"🚀 This motivates depthwise separable convolutions and attention mechanisms\")\n",
+    "\n",
+    "# Analysis will be called in main execution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b066bd66",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "pooling-analysis",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def analyze_pooling_effects():\n",
+    "    \"\"\"📊 Analyze pooling's impact on spatial dimensions and features.\"\"\"\n",
+    "    print(\"\\n📊 Analyzing Pooling Effects...\")\n",
+    "\n",
+    "    # Create sample input with spatial structure\n",
+    "    # Simple edge pattern that pooling should preserve differently\n",
+    "    pattern = np.zeros((1, 1, 8, 8))\n",
+    "    pattern[0, 0, :, 3:5] = 1.0  # Vertical edge\n",
+    "    pattern[0, 0, 3:5, :] = 1.0  # Horizontal edge\n",
+    "    x = Tensor(pattern)\n",
+    "\n",
+    "    print(\"Original 8×8 pattern:\")\n",
+    "    print(x.data[0, 0])\n",
+    "\n",
+    "    # Test different pooling strategies\n",
+    "    pools = [\n",
+    "        (MaxPool2d(2, stride=2), \"MaxPool 2×2\"),\n",
+    "        (AvgPool2d(2, stride=2), \"AvgPool 2×2\"),\n",
+    "        (MaxPool2d(4, stride=4), \"MaxPool 4×4\"),\n",
+    "        (AvgPool2d(4, stride=4), \"AvgPool 4×4\"),\n",
+    "    ]\n",
+    "\n",
+    "    print(f\"\\n{'Operation':<15} {'Output Shape':<15} {'Feature Preservation'}\")\n",
+    "    print(\"-\" * 60)\n",
+    "\n",
+    "    for pool_op, name in pools:\n",
+    "        result = pool_op(x)\n",
+    "        # Measure how much of the original pattern is preserved\n",
+    "        preservation = np.sum(result.data > 0.1) / np.prod(result.shape)\n",
+    "        print(f\"{name:<15} {str(result.shape):<15} {preservation:<.2%}\")\n",
+    "\n",
+    "        print(f\"  Output:\")\n",
+    "        print(f\"  {result.data[0, 0]}\")\n",
+    "        print()\n",
+    "\n",
+    "    print(\"💡 Key Insights:\")\n",
+    "    print(\"🔸 MaxPool preserves sharp features better (edge detection)\")\n",
+    "    print(\"🔸 AvgPool smooths features (noise reduction)\")\n",
+    "    print(\"🔸 Larger pooling windows lose more spatial detail\")\n",
+    "    print(\"🚀 Choice depends on task: classification vs detection vs segmentation\")\n",
+    "\n",
+    "# Analysis will be called in main execution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24b9212e",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 6. Integration - Building a Complete CNN\n",
+    "\n",
+    "Now let's combine convolution and pooling into a complete CNN architecture. You'll see how spatial operations work together to transform raw pixels into meaningful features.\n",
+    "\n",
+    "### CNN Architecture: From Pixels to Predictions\n",
+    "\n",
+    "A CNN processes images through alternating convolution and pooling layers, gradually extracting higher-level features:\n",
+    "\n",
+    "```\n",
+    "Complete CNN Pipeline:\n",
+    "\n",
+    "Input Image (32×32×3)     Raw RGB pixels\n",
+    "       ↓\n",
+    "Conv2d(3→16, 3×3)        Detect edges, textures\n",
+    "       ↓\n",
+    "ReLU Activation          Remove negative values\n",
+    "       ↓\n",
+    "MaxPool(2×2)             Reduce to (16×16×16)\n",
+    "       ↓\n",
+    "Conv2d(16→32, 3×3)       Detect shapes, patterns\n",
+    "       ↓\n",
+    "ReLU Activation          Remove negative values\n",
+    "       ↓\n",
+    "MaxPool(2×2)             Reduce to (8×8×32)\n",
+    "       ↓\n",
+    "Flatten                  Reshape to vector (2048,)\n",
+    "       ↓\n",
+    "Linear(2048→10)          Final classification\n",
+    "       ↓\n",
+    "Softmax                  Probability distribution\n",
+    "```\n",
+    "\n",
+    "### The Parameter Efficiency Story\n",
+    "\n",
+    "```\n",
+    "CNN vs Dense Network Comparison:\n",
+    "\n",
+    "CNN Approach:                     Dense Approach:\n",
+    "┌─────────────────┐               ┌─────────────────┐\n",
+    "│ Conv1: 3→16     │               │ Input: 32×32×3  │\n",
+    "│ Params: 448     │               │ = 3,072 values  │\n",
+    "├─────────────────┤               ├─────────────────┤\n",
+    "│ Conv2: 16→32    │               │ Hidden: 1,000   │\n",
+    "│ Params: 4,640   │               │ Params: 3M+     │\n",
+    "├─────────────────┤               ├─────────────────┤\n",
+    "│ Linear: 2048→10 │               │ Output: 10      │\n",
+    "│ Params: 20,490  │               │ Params: 10K     │\n",
+    "└─────────────────┘               └─────────────────┘\n",
+    "Total: ~25K params                Total: ~3M params\n",
+    "\n",
+    "CNN wins with 120× fewer parameters!\n",
+    "```\n",
+    "\n",
+    "### Spatial Hierarchy: Why This Architecture Works\n",
+    "\n",
+    "```\n",
+    "Layer-by-Layer Feature Evolution:\n",
+    "\n",
+    "Layer 1 (Conv 3→16):              Layer 2 (Conv 16→32):\n",
+    "┌─────┐ ┌─────┐ ┌─────┐           ┌─────┐ ┌─────┐ ┌─────┐\n",
+    "│Edge │ │Edge │ │Edge │           │Shape│ │Corner│ │Texture│\n",
+    "│ \\\\ /│ │  |  │ │ / \\\\│           │ ◇  │ │  L  │ │ ≈≈≈ │\n",
+    "└─────┘ └─────┘ └─────┘           └─────┘ └─────┘ └─────┘\n",
+    "Simple features                   Complex combinations\n",
+    "\n",
+    "Why pooling between layers:\n",
+    "✓ Reduces computation for next layer\n",
+    "✓ Increases receptive field (each conv sees larger input area)\n",
+    "✓ Provides translation invariance (cat moved 1 pixel still detected)\n",
+    "```\n",
+    "\n",
+    "This hierarchical approach mirrors human vision: we first detect edges, then shapes, then objects!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93110b91",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### SimpleCNN Implementation - Putting It All Together\n",
+    "\n",
+    "Now we'll build a complete CNN that demonstrates how convolution and pooling work together. This is your first step from processing individual tensors to understanding complete images!\n",
+    "\n",
+    "#### The CNN Architecture Pattern\n",
+    "\n",
+    "```\n",
+    "SimpleCNN Architecture Visualization:\n",
+    "\n",
+    "Input: (batch, 3, 32, 32)     ← RGB images (CIFAR-10 size)\n",
+    "         ↓\n",
+    "┌─────────────────────────┐\n",
+    "│ Conv2d(3→16, 3×3, p=1) │    ← Detect edges, textures\n",
+    "│ ReLU()                  │    ← Remove negative values\n",
+    "│ MaxPool(2×2)            │    ← Reduce to (batch, 16, 16, 16)\n",
+    "└─────────────────────────┘\n",
+    "         ↓\n",
+    "┌─────────────────────────┐\n",
+    "│ Conv2d(16→32, 3×3, p=1) │   ← Detect shapes, patterns\n",
+    "│ ReLU()                  │   ← Remove negative values\n",
+    "│ MaxPool(2×2)            │   ← Reduce to (batch, 32, 8, 8)\n",
+    "└─────────────────────────┘\n",
+    "         ↓\n",
+    "┌─────────────────────────┐\n",
+    "│ Flatten()               │   ← Reshape to (batch, 2048)\n",
+    "│ Linear(2048→10)         │   ← Final classification\n",
+    "└─────────────────────────┘\n",
+    "         ↓\n",
+    "Output: (batch, 10)           ← Class probabilities\n",
+    "```\n",
+    "\n",
+    "#### Why This Architecture Works\n",
+    "\n",
+    "```\n",
+    "Feature Hierarchy Development:\n",
+    "\n",
+    "Layer 1 Features (3→16):     Layer 2 Features (16→32):\n",
+    "┌─────┬─────┬─────┬─────┐   ┌─────┬─────┬─────┬─────┐\n",
+    "│Edge │Edge │Edge │Blob │   │Shape│Corner│Tex-│Pat- │\n",
+    "│ \\\\  │  |  │ /   │  ○  │   │ ◇   │  L  │ture│tern │\n",
+    "└─────┴─────┴─────┴─────┘   └─────┴─────┴─────┴─────┘\n",
+    "Simple features             Complex combinations\n",
+    "\n",
+    "Spatial Dimension Reduction:\n",
+    "32×32 → 16×16 → 8×8\n",
+    " 1024    256     64  (per channel)\n",
+    "\n",
+    "Channel Expansion:\n",
+    "3 → 16 → 32\n",
+    "More feature types at each level\n",
+    "```\n",
+    "\n",
+    "#### Parameter Efficiency Demonstration\n",
+    "\n",
+    "```\n",
+    "CNN vs Dense Comparison for 32×32×3 → 10 classes:\n",
+    "\n",
+    "CNN Approach:                    Dense Approach:\n",
+    "┌────────────────────┐          ┌────────────────────┐\n",
+    "│ Conv1: 3→16, 3×3   │          │ Input: 3072 values │\n",
+    "│ Params: 448        │          │        ↓          │\n",
+    "├────────────────────┤          │ Dense: 3072→512   │\n",
+    "│ Conv2: 16→32, 3×3  │          │ Params: 1.57M     │\n",
+    "│ Params: 4,640      │          ├────────────────────┤\n",
+    "├────────────────────┤          │ Dense: 512→10     │\n",
+    "│ Dense: 2048→10     │          │ Params: 5,120     │\n",
+    "│ Params: 20,490     │          └────────────────────┘\n",
+    "└────────────────────┘          Total: 1.58M params\n",
+    "Total: 25,578 params\n",
+    "\n",
+    "CNN has 62× fewer parameters while preserving spatial structure!\n",
+    "```\n",
+    "\n",
+    "#### Receptive Field Growth\n",
+    "\n",
+    "```\n",
+    "How each layer sees progressively larger input regions:\n",
+    "\n",
+    "Layer 1 Conv (3×3):           Layer 2 Conv (3×3):\n",
+    "Each output pixel sees        Each output pixel sees\n",
+    "3×3 = 9 input pixels         7×7 = 49 input pixels\n",
+    "                             (due to pooling+conv)\n",
+    "\n",
+    "Final Result: Layer 2 can detect complex patterns\n",
+    "spanning 7×7 regions of original image!\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "740c0edb",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "simple-cnn",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#| export\n",
+    "class SimpleCNN(Module):\n",
+    "    \"\"\"\n",
+    "    Simple CNN demonstrating spatial operations integration.\n",
+    "\n",
+    "    Architecture:\n",
+    "    - Conv2d(3→16, 3×3) + ReLU + MaxPool(2×2)\n",
+    "    - Conv2d(16→32, 3×3) + ReLU + MaxPool(2×2)\n",
+    "    - Flatten + Linear(features→num_classes)\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, num_classes=10):\n",
+    "        \"\"\"\n",
+    "        Initialize SimpleCNN.\n",
+    "\n",
+    "        TODO: Build CNN architecture with spatial and dense layers\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Conv layer 1: 3 → 16 channels, 3×3 kernel, padding=1\n",
+    "        2. Pool layer 1: 2×2 max pooling\n",
+    "        3. Conv layer 2: 16 → 32 channels, 3×3 kernel, padding=1\n",
+    "        4. Pool layer 2: 2×2 max pooling\n",
+    "        5. Calculate flattened size and add final linear layer\n",
+    "\n",
+    "        HINT: For 32×32 input → 32→16→8→4 spatial reduction\n",
+    "        Final feature size: 32 channels × 4×4 = 512 features\n",
+    "        \"\"\"\n",
+    "        super().__init__()\n",
+    "\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Convolutional layers\n",
+    "        self.conv1 = Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)\n",
+    "        self.pool1 = MaxPool2d(kernel_size=2, stride=2)\n",
+    "\n",
+    "        self.conv2 = Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)\n",
+    "        self.pool2 = MaxPool2d(kernel_size=2, stride=2)\n",
+    "\n",
+    "        # Calculate flattened size\n",
+    "        # Input: 32×32 → Conv1+Pool1: 16×16 → Conv2+Pool2: 8×8\n",
+    "        # Wait, let's recalculate: 32×32 → Pool1: 16×16 → Pool2: 8×8\n",
+    "        # Final: 32 channels × 8×8 = 2048 features\n",
+    "        self.flattened_size = 32 * 8 * 8\n",
+    "\n",
+    "        # Import Linear layer (we'll implement a simple version)\n",
+    "        # For now, we'll use a placeholder that we can replace\n",
+    "        # This represents the final classification layer\n",
+    "        self.num_classes = num_classes\n",
+    "        self.flattened_size = 32 * 8 * 8  # Will be used when we add Linear layer\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        \"\"\"\n",
+    "        Forward pass through SimpleCNN.\n",
+    "\n",
+    "        TODO: Implement CNN forward pass\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Apply conv1 → ReLU → pool1\n",
+    "        2. Apply conv2 → ReLU → pool2\n",
+    "        3. Flatten spatial dimensions\n",
+    "        4. Apply final linear layer (when available)\n",
+    "\n",
+    "        For now, return features before final linear layer\n",
+    "        since we haven't imported Linear from layers module yet.\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # First conv block\n",
+    "        x = self.conv1(x)\n",
+    "        x = self.relu(x)  # ReLU activation\n",
+    "        x = self.pool1(x)\n",
+    "\n",
+    "        # Second conv block\n",
+    "        x = self.conv2(x)\n",
+    "        x = self.relu(x)  # ReLU activation\n",
+    "        x = self.pool2(x)\n",
+    "\n",
+    "        # Flatten for classification (reshape to 2D)\n",
+    "        batch_size = x.shape[0]\n",
+    "        x_flat = x.data.reshape(batch_size, -1)\n",
+    "\n",
+    "        # Return flattened features\n",
+    "        # In a complete implementation, this would go through a Linear layer\n",
+    "        return Tensor(x_flat)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def relu(self, x):\n",
+    "        \"\"\"Simple ReLU implementation for CNN.\"\"\"\n",
+    "        return Tensor(np.maximum(0, x.data))\n",
+    "\n",
+    "    def parameters(self):\n",
+    "        \"\"\"Return all trainable parameters.\"\"\"\n",
+    "        params = []\n",
+    "        params.extend(self.conv1.parameters())\n",
+    "        params.extend(self.conv2.parameters())\n",
+    "        # Linear layer parameters would be added here\n",
+    "        return params\n",
+    "\n",
+    "    def __call__(self, x):\n",
+    "        \"\"\"Enable model(x) syntax.\"\"\"\n",
+    "        return self.forward(x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3855be86",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🧪 Unit Test: SimpleCNN Integration\n",
+    "This test validates that spatial operations work together in a complete CNN architecture.\n",
+    "**What we're testing**: End-to-end spatial processing pipeline\n",
+    "**Why it matters**: Spatial operations must compose correctly for real CNNs\n",
+    "**Expected**: Proper dimension reduction and feature extraction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4ed1fe45",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-simple-cnn",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def test_unit_simple_cnn():\n",
+    "    \"\"\"🔬 Test SimpleCNN integration with spatial operations.\"\"\"\n",
+    "    print(\"🔬 Unit Test: SimpleCNN Integration...\")\n",
+    "\n",
+    "    # Test 1: Forward pass with CIFAR-10 sized input\n",
+    "    print(\"  Testing forward pass...\")\n",
+    "    model = SimpleCNN(num_classes=10)\n",
+    "    x = Tensor(np.random.randn(2, 3, 32, 32))  # Batch of 2, RGB, 32×32\n",
+    "\n",
+    "    features = model(x)\n",
+    "\n",
+    "    # Expected: 2 samples, 32 channels × 8×8 spatial = 2048 features\n",
+    "    expected_shape = (2, 2048)\n",
+    "    assert features.shape == expected_shape, f\"Expected {expected_shape}, got {features.shape}\"\n",
+    "\n",
+    "    # Test 2: Parameter counting\n",
+    "    print(\"  Testing parameter counting...\")\n",
+    "    params = model.parameters()\n",
+    "\n",
+    "    # Conv1: (16, 3, 3, 3) + bias (16,) = 432 + 16 = 448\n",
+    "    # Conv2: (32, 16, 3, 3) + bias (32,) = 4608 + 32 = 4640\n",
+    "    # Total: 448 + 4640 = 5088 parameters\n",
+    "\n",
+    "    conv1_params = 16 * 3 * 3 * 3 + 16  # weights + bias\n",
+    "    conv2_params = 32 * 16 * 3 * 3 + 32  # weights + bias\n",
+    "    expected_total = conv1_params + conv2_params\n",
+    "\n",
+    "    actual_total = sum(np.prod(p.shape) for p in params)\n",
+    "    assert actual_total == expected_total, f\"Expected {expected_total} parameters, got {actual_total}\"\n",
+    "\n",
+    "    # Test 3: Different input sizes\n",
+    "    print(\"  Testing different input sizes...\")\n",
+    "\n",
+    "    # Test with different spatial dimensions\n",
+    "    x_small = Tensor(np.random.randn(1, 3, 16, 16))\n",
+    "    features_small = model(x_small)\n",
+    "\n",
+    "    # 16×16 → 8×8 → 4×4, so 32 × 4×4 = 512 features\n",
+    "    expected_small = (1, 512)\n",
+    "    assert features_small.shape == expected_small, f\"Expected {expected_small}, got {features_small.shape}\"\n",
+    "\n",
+    "    # Test 4: Batch processing\n",
+    "    print(\"  Testing batch processing...\")\n",
+    "    x_batch = Tensor(np.random.randn(8, 3, 32, 32))\n",
+    "    features_batch = model(x_batch)\n",
+    "\n",
+    "    expected_batch = (8, 2048)\n",
+    "    assert features_batch.shape == expected_batch, f\"Expected {expected_batch}, got {features_batch.shape}\"\n",
+    "\n",
+    "    print(\"✅ SimpleCNN integration works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_simple_cnn()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6ab5ed35",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 7. Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "727ef628",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "module-integration",
+     "locked": true,
+     "points": 15
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def test_module():\n",
+    "    \"\"\"\n",
+    "    Comprehensive test of entire spatial module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_conv2d()\n",
+    "    test_unit_pooling()\n",
+    "    test_unit_simple_cnn()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test realistic CNN workflow\n",
+    "    print(\"🔬 Integration Test: Complete CNN pipeline...\")\n",
+    "\n",
+    "    # Create a mini CNN for CIFAR-10\n",
+    "    conv1 = Conv2d(3, 8, kernel_size=3, padding=1)\n",
+    "    pool1 = MaxPool2d(2, stride=2)\n",
+    "    conv2 = Conv2d(8, 16, kernel_size=3, padding=1)\n",
+    "    pool2 = AvgPool2d(2, stride=2)\n",
+    "\n",
+    "    # Process batch of images\n",
+    "    batch_images = Tensor(np.random.randn(4, 3, 32, 32))\n",
+    "\n",
+    "    # Forward pass through spatial layers\n",
+    "    x = conv1(batch_images)  # (4, 8, 32, 32)\n",
+    "    x = pool1(x)             # (4, 8, 16, 16)\n",
+    "    x = conv2(x)             # (4, 16, 16, 16)\n",
+    "    features = pool2(x)      # (4, 16, 8, 8)\n",
+    "\n",
+    "    # Validate shapes at each step\n",
+    "    assert x.shape[0] == 4, f\"Batch size should be preserved, got {x.shape[0]}\"\n",
+    "    assert features.shape == (4, 16, 8, 8), f\"Final features shape incorrect: {features.shape}\"\n",
+    "\n",
+    "    # Test parameter collection across all layers\n",
+    "    all_params = []\n",
+    "    all_params.extend(conv1.parameters())\n",
+    "    all_params.extend(conv2.parameters())\n",
+    "    # Pooling has no parameters\n",
+    "    assert len(pool1.parameters()) == 0\n",
+    "    assert len(pool2.parameters()) == 0\n",
+    "\n",
+    "    # Verify we have the right number of parameter tensors\n",
+    "    assert len(all_params) == 4, f\"Expected 4 parameter tensors (2 conv × 2 each), got {len(all_params)}\"\n",
+    "\n",
+    "    print(\"✅ Complete CNN pipeline works!\")\n",
+    "\n",
+    "    # Test memory efficiency comparison\n",
+    "    print(\"🔬 Integration Test: Memory efficiency analysis...\")\n",
+    "\n",
+    "    # Compare different pooling strategies (reduced size for faster execution)\n",
+    "    input_data = Tensor(np.random.randn(1, 16, 32, 32))\n",
+    "\n",
+    "    # No pooling: maintain spatial size\n",
+    "    conv_only = Conv2d(16, 32, kernel_size=3, padding=1)\n",
+    "    no_pool_out = conv_only(input_data)\n",
+    "    no_pool_size = np.prod(no_pool_out.shape) * 4  # float32 bytes\n",
+    "\n",
+    "    # With pooling: reduce spatial size\n",
+    "    conv_with_pool = Conv2d(16, 32, kernel_size=3, padding=1)\n",
+    "    pool = MaxPool2d(2, stride=2)\n",
+    "    pool_out = pool(conv_with_pool(input_data))\n",
+    "    pool_size = np.prod(pool_out.shape) * 4  # float32 bytes\n",
+    "\n",
+    "    memory_reduction = no_pool_size / pool_size\n",
+    "    assert memory_reduction == 4.0, f\"2×2 pooling should give 4× memory reduction, got {memory_reduction:.1f}×\"\n",
+    "\n",
+    "    print(f\"  Memory reduction with pooling: {memory_reduction:.1f}×\")\n",
+    "    print(\"✅ Memory efficiency analysis complete!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 09\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8f88371c",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "main-execution",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2249a8da",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Spatial Operations\n",
+    "\n",
+    "Congratulations! You've built the spatial processing foundation that powers computer vision!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Built Conv2d with explicit loops showing O(N²M²K²) complexity ✅\n",
+    "- Implemented MaxPool2d and AvgPool2d for spatial dimension reduction ✅\n",
+    "- Created SimpleCNN demonstrating spatial operation integration ✅\n",
+    "- Analyzed computational complexity and memory trade-offs in spatial processing ✅\n",
+    "- All tests pass including complete CNN pipeline validation ✅\n",
+    "\n",
+    "### Systems Insights Discovered\n",
+    "- **Convolution Complexity**: Quadratic scaling with spatial size, kernel size significantly impacts cost\n",
+    "- **Memory Patterns**: Pooling provides 4× memory reduction while preserving important features\n",
+    "- **Architecture Design**: Strategic spatial reduction enables parameter-efficient feature extraction\n",
+    "- **Cache Performance**: Spatial locality in convolution benefits from optimal memory access patterns\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your spatial operations enable building complete CNNs for computer vision tasks!\n",
+    "Export with: `tito module complete 09`\n",
+    "\n",
+    "**Next**: Milestone 03 will combine your spatial operations with training pipeline to build a CNN for CIFAR-10!\n",
+    "\n",
+    "Your implementation shows why:\n",
+    "- Modern CNNs use small kernels (3×3) instead of large ones (computational efficiency)\n",
+    "- Pooling layers are crucial for managing memory in deep networks (4× reduction per layer)\n",
+    "- Explicit loops reveal the true computational cost hidden by optimized implementations\n",
+    "- Spatial operations unlock computer vision - from MLPs processing vectors to CNNs understanding images!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/modules/09_spatial/spatial_dev.py b/modules/source/09_spatial/spatial_dev.py
similarity index 99%
rename from modules/09_spatial/spatial_dev.py
rename to modules/source/09_spatial/spatial_dev.py
index 34b21bba..6e2e3795 100644
--- a/modules/09_spatial/spatial_dev.py
+++ b/modules/source/09_spatial/spatial_dev.py
@@ -67,16 +67,15 @@ import sys
 import os
 import time
 
-# Smart import system for development and production compatibility
-if 'tinytorch' in sys.modules:
-    # Production: Import from installed package
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.layers import Module
-else:
-    # Development: Use simplified local implementations to avoid import loops
+# Import dependencies from other modules
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+from tensor_dev import Tensor
 
-    # Simplified Tensor class for development
-    class Tensor:
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
+from layers_dev import Module
+
+# Note: Keeping simplified implementations for reference during development
+class _SimplifiedTensor:
         """Simplified tensor for spatial operations development."""
 
         def __init__(self, data, requires_grad=False):
diff --git a/modules/10_tokenization/tokenization_dev.py b/modules/source/10_tokenization/tokenization_dev.py
similarity index 99%
rename from modules/10_tokenization/tokenization_dev.py
rename to modules/source/10_tokenization/tokenization_dev.py
index 21ba597b..0c8c5bab 100644
--- a/modules/10_tokenization/tokenization_dev.py
+++ b/modules/source/10_tokenization/tokenization_dev.py
@@ -13,6 +13,7 @@
 # ---
 
 #| default_exp text.tokenization
+#| export
 
 # %% [markdown]
 """
diff --git a/modules/11_embeddings/embeddings_dev.py b/modules/source/11_embeddings/embeddings_dev.py
similarity index 99%
rename from modules/11_embeddings/embeddings_dev.py
rename to modules/source/11_embeddings/embeddings_dev.py
index e6f60090..48795f7a 100644
--- a/modules/11_embeddings/embeddings_dev.py
+++ b/modules/source/11_embeddings/embeddings_dev.py
@@ -65,6 +65,7 @@ Setting up our embedding toolkit with tensor operations and mathematical functio
 """
 
 #| default_exp text.embeddings
+#| export
 
 import numpy as np
 import math
diff --git a/modules/12_attention/attention_dev.py b/modules/source/12_attention/attention_dev.py
similarity index 99%
rename from modules/12_attention/attention_dev.py
rename to modules/source/12_attention/attention_dev.py
index 3a6049e4..5b2488f2 100644
--- a/modules/12_attention/attention_dev.py
+++ b/modules/source/12_attention/attention_dev.py
@@ -13,6 +13,7 @@
 # ---
 
 #| default_exp core.attention
+#| export
 
 # %% [markdown]
 """
@@ -69,16 +70,15 @@ import sys
 import os
 from typing import Optional, Tuple, List
 
-# Smart import system for development and production compatibility
-if 'tinytorch' in sys.modules:
-    # Production: Import from installed package
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.layers import Linear
-else:
-    # Development: Use simplified local implementations to avoid import loops
+# Import dependencies from other modules
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+from tensor_dev import Tensor
 
-    # Simplified Tensor class for development
-    class Tensor:
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
+from layers_dev import Linear
+
+# Note: Keeping simplified implementations for reference during development
+class _SimplifiedTensor:
         """Simplified tensor for attention operations development."""
 
         def __init__(self, data, requires_grad=False):
diff --git a/modules/13_transformers/transformers_dev.py b/modules/source/13_transformers/transformers_dev.py
similarity index 99%
rename from modules/13_transformers/transformers_dev.py
rename to modules/source/13_transformers/transformers_dev.py
index 7a14825f..f82f7e9f 100644
--- a/modules/13_transformers/transformers_dev.py
+++ b/modules/source/13_transformers/transformers_dev.py
@@ -62,6 +62,7 @@ from tinytorch.text.embeddings import Embedding, PositionalEncoding  # Module 11
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp models.transformer
+#| export
 
 import numpy as np
 import math
diff --git a/modules/14_kvcaching/kvcaching_dev.py b/modules/source/14_kvcaching/kvcaching_dev.py
similarity index 99%
rename from modules/14_kvcaching/kvcaching_dev.py
rename to modules/source/14_kvcaching/kvcaching_dev.py
index 4575eb9c..5fef4b46 100644
--- a/modules/14_kvcaching/kvcaching_dev.py
+++ b/modules/source/14_kvcaching/kvcaching_dev.py
@@ -61,6 +61,7 @@ from tinytorch.models.transformer import GPT  # Dependencies (Module 13)
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp generation.kv_cache
+#| export
 
 import numpy as np
 import time
diff --git a/modules/15_profiling/profiling_dev.py b/modules/source/15_profiling/profiling_dev.py
similarity index 99%
rename from modules/15_profiling/profiling_dev.py
rename to modules/source/15_profiling/profiling_dev.py
index 14649c13..0eeba514 100644
--- a/modules/15_profiling/profiling_dev.py
+++ b/modules/source/15_profiling/profiling_dev.py
@@ -59,6 +59,7 @@ from tinytorch.models.transformer import GPT  # Example models to profile
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp profiling.profiler
+#| export
 
 import time
 import numpy as np
diff --git a/modules/16_acceleration/acceleration_dev.py b/modules/source/16_acceleration/acceleration_dev.py
similarity index 99%
rename from modules/16_acceleration/acceleration_dev.py
rename to modules/source/16_acceleration/acceleration_dev.py
index 9a4eb44d..a6e0ca18 100644
--- a/modules/16_acceleration/acceleration_dev.py
+++ b/modules/source/16_acceleration/acceleration_dev.py
@@ -13,6 +13,7 @@
 # ---
 
 #| default_exp optimization.acceleration
+#| export
 
 # %% [markdown]
 """
diff --git a/modules/17_quantization/quantization_dev.py b/modules/source/17_quantization/quantization_dev.py
similarity index 99%
rename from modules/17_quantization/quantization_dev.py
rename to modules/source/17_quantization/quantization_dev.py
index 087d4a10..c35d3e74 100644
--- a/modules/17_quantization/quantization_dev.py
+++ b/modules/source/17_quantization/quantization_dev.py
@@ -13,6 +13,7 @@
 # ---
 
 #| default_exp optimization.quantization
+#| export
 
 # %% [markdown]
 """
@@ -74,13 +75,18 @@ import warnings
 import sys
 import os
 
-if 'tinytorch' in sys.modules:
-    # Production: Import from installed package
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.layers import Linear, Sequential
-    from tinytorch.core.activations import ReLU
-    from tinytorch.profiling.profiler import Profiler
-else:
+# Import dependencies from other modules
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+from tensor_dev import Tensor
+
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
+from layers_dev import Linear, Sequential
+
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
+from activations_dev import ReLU
+
+# Note: Keeping development fallback for reference
+if False:  # Disabled development fallback
     # Development: Import from local module files
     try:
         # Try to find the current directory
diff --git a/modules/18_compression/compression_dev.py b/modules/source/18_compression/compression_dev.py
similarity index 99%
rename from modules/18_compression/compression_dev.py
rename to modules/source/18_compression/compression_dev.py
index a3905dc5..d46a2234 100644
--- a/modules/18_compression/compression_dev.py
+++ b/modules/source/18_compression/compression_dev.py
@@ -61,6 +61,7 @@ from tinytorch.optimization.quantization import quantize_model  # Module 17 - pr
 
 # %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
 #| default_exp optimization.compression
+#| export
 
 import numpy as np
 import copy
diff --git a/modules/19_benchmarking/benchmarking_dev.py b/modules/source/19_benchmarking/benchmarking_dev.py
similarity index 99%
rename from modules/19_benchmarking/benchmarking_dev.py
rename to modules/source/19_benchmarking/benchmarking_dev.py
index 24b9f5fc..5f01a9ae 100644
--- a/modules/19_benchmarking/benchmarking_dev.py
+++ b/modules/source/19_benchmarking/benchmarking_dev.py
@@ -13,6 +13,7 @@
 # ---
 
 #| default_exp benchmarking.benchmark
+#| export
 
 # %% [markdown]
 """
diff --git a/modules/20_capstone/capstone_dev.py b/modules/source/20_capstone/capstone_dev.py
similarity index 99%
rename from modules/20_capstone/capstone_dev.py
rename to modules/source/20_capstone/capstone_dev.py
index f729364c..0373ac95 100644
--- a/modules/20_capstone/capstone_dev.py
+++ b/modules/source/20_capstone/capstone_dev.py
@@ -66,6 +66,7 @@ from tinytorch.benchmarking.benchmark import Benchmark  # Module 19
 
 # %% nbgrader={"grade": false, "grade_id": "exports", "solution": true}
 #| default_exp applications.tinygpt
+#| export
 
 # %% [markdown]
 """
diff --git a/modules/DEFINITIVE_MODULE_PLAN.md b/modules/source/DEFINITIVE_MODULE_PLAN.md
similarity index 100%
rename from modules/DEFINITIVE_MODULE_PLAN.md
rename to modules/source/DEFINITIVE_MODULE_PLAN.md
diff --git a/modules/archive/MILESTONE_IMPLEMENTATION_PLAN.md b/modules/source/archive/MILESTONE_IMPLEMENTATION_PLAN.md
similarity index 100%
rename from modules/archive/MILESTONE_IMPLEMENTATION_PLAN.md
rename to modules/source/archive/MILESTONE_IMPLEMENTATION_PLAN.md
diff --git a/modules/archive/MODULE_PLAN_CRITICAL_FIX.md b/modules/source/archive/MODULE_PLAN_CRITICAL_FIX.md
similarity index 100%
rename from modules/archive/MODULE_PLAN_CRITICAL_FIX.md
rename to modules/source/archive/MODULE_PLAN_CRITICAL_FIX.md
diff --git a/modules/archive/MODULE_PLAN_ENHANCED.md b/modules/source/archive/MODULE_PLAN_ENHANCED.md
similarity index 100%
rename from modules/archive/MODULE_PLAN_ENHANCED.md
rename to modules/source/archive/MODULE_PLAN_ENHANCED.md
diff --git a/modules/archive/MODULE_PLAN_FINAL_SOLUTION.md b/modules/source/archive/MODULE_PLAN_FINAL_SOLUTION.md
similarity index 100%
rename from modules/archive/MODULE_PLAN_FINAL_SOLUTION.md
rename to modules/source/archive/MODULE_PLAN_FINAL_SOLUTION.md
diff --git a/modules/archive/MODULE_PLAN_SIMPLEST_SOLUTION.md b/modules/source/archive/MODULE_PLAN_SIMPLEST_SOLUTION.md
similarity index 100%
rename from modules/archive/MODULE_PLAN_SIMPLEST_SOLUTION.md
rename to modules/source/archive/MODULE_PLAN_SIMPLEST_SOLUTION.md
diff --git a/modules_old/01_tensor/README.md b/modules_old/01_tensor/README.md
deleted file mode 100644
index 63e6f874..00000000
--- a/modules_old/01_tensor/README.md
+++ /dev/null
@@ -1,144 +0,0 @@
-# 🔥 Module: Tensor
-
-## 📊 Module Info
-- **Difficulty**: ⭐⭐ Intermediate
-- **Time Estimate**: 4-6 hours
-- **Prerequisites**: Setup module
-- **Next Steps**: Activations, Layers
-
-Build the foundation of TinyTorch! This module implements the core Tensor class - the fundamental data structure that powers all neural networks and machine learning operations.
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will:
-- **Understand what tensors are** and why they're essential for ML
-- **Implement a complete Tensor class** with core operations
-- **Handle tensor shapes, data types, and memory management** efficiently
-- **Implement element-wise operations and reductions** with proper broadcasting
-- **Have a solid foundation** for building neural networks
-
-## 🧠 Build → Use → Understand
-
-1. **Build**: Complete Tensor class with arithmetic operations, shape management, and reductions
-2. **Use**: Create tensors, perform operations, and validate with real data
-3. **Understand**: How tensors serve as the foundation for all neural network computations
-
-## 📚 What You'll Build
-
-### Core Tensor Class
-```python
-# Creating tensors
-x = Tensor([[1.0, 2.0], [3.0, 4.0]])
-y = Tensor([[0.5, 1.5], [2.5, 3.5]])
-
-# Properties
-print(x.shape)    # (2, 2)
-print(x.size)     # 4
-print(x.dtype)    # float64
-
-# Element-wise operations
-z = x + y         # Addition
-w = x * y         # Multiplication
-p = x ** 2        # Exponentiation
-
-# Shape manipulation
-reshaped = x.reshape(4, 1)  # (4, 1)
-transposed = x.T            # (2, 2) transposed
-
-# Reductions
-total = x.sum()             # Scalar sum
-means = x.mean(axis=0)      # Mean along axis
-```
-
-### Essential Operations
-- **Arithmetic**: Addition, subtraction, multiplication, division, powers
-- **Shape management**: Reshape, transpose, broadcasting rules
-- **Reductions**: Sum, mean, min, max along any axis
-- **Memory handling**: Efficient data storage and copying
-
-## 🚀 Getting Started
-
-### Prerequisites Check
-```bash
-tito checkpoint test 00  # Environment setup should pass ✅
-```
-
-### Development Workflow
-```bash
-# Navigate to tensor module
-cd modules/01_tensor
-
-# Open development file
-jupyter lab tensor_dev.py
-# OR edit directly: code tensor_dev.py
-```
-
-### Step-by-Step Implementation
-1. **Basic Tensor class** - Constructor and properties
-2. **Shape management** - Understanding tensor dimensions
-3. **Arithmetic operations** - Addition, multiplication, etc.
-4. **Utility methods** - Reshape, transpose, sum, mean
-5. **Error handling** - Robust edge case management
-
-## 🧪 Testing Your Implementation
-
-### Inline Testing
-```python
-# Test in the notebook or Python REPL
-x = Tensor([[1.0, 2.0], [3.0, 4.0]])
-print(f"Shape: {x.shape}")  # Should be (2, 2)
-print(f"Sum: {x.sum()}")    # Should be 10.0
-```
-
-### Module Tests
-```bash
-# Complete and export your tensor implementation
-tito module complete 01_tensor
-
-# Test specific checkpoint
-tito checkpoint test 01  # Foundation checkpoint
-```
-
-### Manual Verification
-```python
-# Create and test tensors
-from tinytorch.core.tensor import Tensor
-
-x = Tensor([1, 2, 3, 4, 5])
-y = Tensor([2, 4, 6, 8, 10])
-
-# Test operations
-assert (x + y).data.tolist() == [3, 6, 9, 12, 15]
-assert (x * 2).data.tolist() == [2, 4, 6, 8, 10]
-print("✅ Basic operations working!")
-```
-
-## 🎯 Key Concepts
-
-### **Tensors as Universal Data Structures**
-- **Scalars**: 0-dimensional tensors (single numbers)
-- **Vectors**: 1-dimensional tensors (arrays) 
-- **Matrices**: 2-dimensional tensors (common in ML)
-- **Higher dimensions**: Images (3D), video (4D), etc.
-
-### **Why Tensors Matter in ML**
-- **Neural networks**: All computations operate on tensors
-- **GPU acceleration**: operates on tensor primitives
-- **Broadcasting**: Efficient operations across different shapes
-- **Vectorization**: Process entire datasets simultaneously
-
-### **Real-World Connections**
-- **PyTorch/TensorFlow**: Your implementation mirrors production frameworks
-- **NumPy**: Foundation for scientific computing (we build similar abstractions)
-- **Production systems**: Understanding tensors is essential for ML engineering
-
-### **Memory and Performance**
-- **Data layout**: How tensors store data efficiently
-- **Broadcasting**: Smart operations without data copying
-- **View vs Copy**: Understanding memory management
-
-## 🎉 Ready to Build?
-
-The tensor module is where TinyTorch really begins. You're about to create the fundamental building block that will power neural networks, training loops, and production ML systems.
-
-Take your time, test thoroughly, and enjoy building something that really works! 🔥 
\ No newline at end of file
diff --git a/modules_old/01_tensor/module.yaml b/modules_old/01_tensor/module.yaml
deleted file mode 100644
index a9ac2fdd..00000000
--- a/modules_old/01_tensor/module.yaml
+++ /dev/null
@@ -1,21 +0,0 @@
-components:
-- Tensor
-- tensor_creation
-- tensor_operations
-- tensor_arithmetic
-dependencies:
-  enables:
-  - activations
-  - layers
-  - autograd
-  prerequisites: []
-description: Core tensor data structure and operations
-difficulty: "\u2B50\u2B50"
-exports_to: tinytorch.core.tensor
-files:
-  dev_file: tensor_dev.py
-  readme: README.md
-  tests: inline
-name: tensor
-time_estimate: 4-6 hours
-title: Tensor
diff --git a/modules_old/01_tensor/tensor_dev.ipynb b/modules_old/01_tensor/tensor_dev.ipynb
deleted file mode 100644
index 04616689..00000000
--- a/modules_old/01_tensor/tensor_dev.ipynb
+++ /dev/null
@@ -1,1711 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "c8575dba",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Tensor - The Foundation of Machine Learning\n",
-    "\n",
-    "Welcome to Tensor! You'll build the fundamental data structure that powers every neural network.\n",
-    "\n",
-    "## 🔗 Building on Previous Learning\n",
-    "**What You Built Before**:\n",
-    "- Module 01 (Setup): Python environment with NumPy, the foundation for numerical computing\n",
-    "\n",
-    "**What's Working**: You have a complete development environment with all the tools needed for machine learning!\n",
-    "\n",
-    "**The Gap**: You can import NumPy, but you need to understand how to build the core data structure that makes ML possible.\n",
-    "\n",
-    "**This Module's Solution**: Build a complete Tensor class that wraps NumPy arrays with ML-specific operations and memory management.\n",
-    "\n",
-    "**Connection Map**:\n",
-    "```\n",
-    "Setup → Tensor → Activations\n",
-    "(tools)   (data)   (nonlinearity)\n",
-    "```\n",
-    "\n",
-    "## Learning Objectives\n",
-    "\n",
-    "By completing this module, you will:\n",
-    "\n",
-    "1. **Implement tensor operations** - Build a complete N-dimensional array system with arithmetic, broadcasting, and matrix multiplication\n",
-    "2. **Master memory efficiency** - Understand why memory layout affects performance more than algorithm choice\n",
-    "3. **Create ML-ready APIs** - Design clean interfaces that mirror PyTorch and TensorFlow patterns\n",
-    "4. **Enable neural networks** - Build the foundation that supports weights, biases, and data in all ML models\n",
-    "\n",
-    "## Build → Test → Use\n",
-    "\n",
-    "1. **Build**: Implement Tensor class with creation, arithmetic, and advanced operations\n",
-    "2. **Test**: Validate each component immediately to ensure correctness and performance\n",
-    "3. **Use**: Apply tensors to real multi-dimensional data operations that neural networks require"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "68dcb6b0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "#| default_exp core.tensor\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import sys\n",
-    "from typing import Union, Tuple, Optional, Any\n",
-    "import warnings"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "74cad3a4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "print(\"🔥 TinyTorch Tensor Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
-    "print(\"Ready to build tensors!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "285c53b1",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Understanding Tensors: Visual Guide\n",
-    "\n",
-    "### What Are Tensors? A Visual Journey\n",
-    "\n",
-    "**The Story**: Think of tensors as smart containers that know their shape and can efficiently store numbers for machine learning. They're like upgraded versions of regular Python lists that understand mathematics.\n",
-    "\n",
-    "```\n",
-    "Scalar (0D Tensor):     Vector (1D Tensor):     Matrix (2D Tensor):\n",
-    "     [5]                   [1, 2, 3]             ┌ 1  2  3 ┐\n",
-    "                                                  │ 4  5  6 │\n",
-    "                                                  └ 7  8  9 ┘\n",
-    "\n",
-    "3D Tensor (RGB Image):                   4D Tensor (Batch of Images):\n",
-    "┌─────────────┐                         ┌─────────────┐ ┌─────────────┐\n",
-    "│ Red Channel │                         │   Image 1   │ │   Image 2   │\n",
-    "│             │                         │             │ │             │\n",
-    "└─────────────┘                         └─────────────┘ └─────────────┘\n",
-    "┌─────────────┐                                      ...\n",
-    "│Green Channel│\n",
-    "│             │\n",
-    "└─────────────┘\n",
-    "┌─────────────┐\n",
-    "│Blue Channel │\n",
-    "│             │\n",
-    "└─────────────┘\n",
-    "```\n",
-    "\n",
-    "**What's happening step-by-step**: As we add dimensions, tensors represent more complex data. A single number becomes a list, a list becomes a grid, a grid becomes a volume (like an image with red/green/blue channels), and a volume becomes a collection (like a batch of images for training). Each dimension adds a new way to organize and access the data."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "840238d6",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Memory Layout: Why Performance Matters\n",
-    "\n",
-    "**The Story**: Imagine your computer's memory as a long street with numbered houses. When your CPU needs data, it doesn't just grab one house - it loads an entire city block (64 bytes) into its cache.\n",
-    "\n",
-    "```\n",
-    "Contiguous Memory (FAST):\n",
-    "[1][2][3][4][5][6] ──> Cache-friendly, vectorized operations\n",
-    " ↑  ↑  ↑  ↑  ↑  ↑\n",
-    " Sequential access pattern\n",
-    "\n",
-    "Non-contiguous Memory (SLOW):\n",
-    "[1]...[2].....[3] ──> Cache misses, scattered access\n",
-    " ↑     ↑       ↑\n",
-    " Random access pattern\n",
-    "```\n",
-    "\n",
-    "**What's happening step-by-step**: When you access element [1], the CPU automatically loads elements [1] through [6] in one cache load. Every subsequent access ([2], [3], [4]...) is already in the cache - no extra memory trips needed! With non-contiguous data, each access requires a new, expensive trip to main memory.\n",
-    "\n",
-    "**The Performance Impact**: This creates 10-100x speedups because you get 6 elements for the price of fetching 1. It's like getting 6 books from the library for the effort of finding just 1."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "86cb7d01",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Tensor Operations: Broadcasting Magic\n",
-    "\n",
-    "**The Story**: Broadcasting is like having a smart photocopier that automatically copies data to match different shapes without actually using extra memory. It's NumPy's way of making operations \"just work\" between tensors of different sizes.\n",
-    "\n",
-    "```\n",
-    "Broadcasting Example:\n",
-    "    Matrix (2×3)     +     Scalar        =     Result (2×3)\n",
-    "  ┌ 1  2  3 ┐             [10]              ┌ 11 12 13 ┐\n",
-    "  └ 4  5  6 ┘                               └ 14 15 16 ┘\n",
-    "\n",
-    "Broadcasting Rules:\n",
-    "1. Align shapes from right to left\n",
-    "2. Dimensions of size 1 stretch to match\n",
-    "3. Missing dimensions assume size 1\n",
-    "\n",
-    "Vector + Matrix Broadcasting:\n",
-    "  [1, 2, 3]    +    [[10],     =    [[11, 12, 13],\n",
-    "  (1×3)             [20]]            [21, 22, 23]]\n",
-    "                    (2×1)            (2×3)\n",
-    "```\n",
-    "\n",
-    "**What's happening step-by-step**: Python aligns shapes from right to left, like comparing numbers by their ones place first. When shapes don't match, dimensions of size 1 automatically \"stretch\" to match the larger dimension - but no data is actually copied. The operation happens as if the data were copied, but uses the original memory locations.\n",
-    "\n",
-    "**Why this matters for ML**: Adding a bias vector to a 1000×1000 matrix would normally require copying the vector 1000 times, but broadcasting does it with zero copies and massive memory savings."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "37bb2239",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Neural Network Data Flow\n",
-    "\n",
-    "```\n",
-    "Batch Processing in Neural Networks:\n",
-    "\n",
-    "Input Batch (32 images, 28×28 pixels):\n",
-    "┌─────────────────────────────────┐\n",
-    "│ [Batch=32, Height=28, Width=28] │\n",
-    "└─────────────────────────────────┘\n",
-    "             ↓ Flatten\n",
-    "┌─────────────────────────────────┐\n",
-    "│     [Batch=32, Features=784]    │ ← Matrix multiplication ready\n",
-    "└─────────────────────────────────┘\n",
-    "             ↓ Linear Layer\n",
-    "┌─────────────────────────────────┐\n",
-    "│     [Batch=32, Hidden=128]      │ ← Hidden layer activations\n",
-    "└─────────────────────────────────┘\n",
-    "\n",
-    "Why batching matters:\n",
-    "- Single image: 784 × 128 = 100,352 operations\n",
-    "- Batch of 32: Same 100,352 ops, but 32× the data\n",
-    "- GPU utilization: 32× better parallelization\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2e97ea75",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## The Mathematical Foundation\n",
-    "\n",
-    "Before we implement, let's understand the mathematical concepts:"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5a2597fa",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Scalars to Tensors: Building Complexity\n",
-    "\n",
-    "**Scalar (Rank 0)**:\n",
-    "- A single number: `5.0` or `temperature`\n",
-    "- Shape: `()` (empty tuple)\n",
-    "- ML examples: loss values, learning rates\n",
-    "\n",
-    "**Vector (Rank 1)**:\n",
-    "- Ordered list of numbers: `[1, 2, 3]`\n",
-    "- Shape: `(3,)` (one dimension)\n",
-    "- ML examples: word embeddings, gradients\n",
-    "\n",
-    "**Matrix (Rank 2)**:\n",
-    "- 2D array: `[[1, 2], [3, 4]]`\n",
-    "- Shape: `(2, 2)` (rows, columns)\n",
-    "- ML examples: weight matrices, images\n",
-    "\n",
-    "**Higher-Order Tensors**:\n",
-    "- 3D: RGB images `(height, width, channels)`\n",
-    "- 4D: Image batches `(batch, height, width, channels)`\n",
-    "- 5D: Video batches `(batch, time, height, width, channels)`"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "51dbe323",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Why Not Just Use NumPy?\n",
-    "\n",
-    "While NumPy is excellent, our Tensor class adds ML-specific features:\n",
-    "\n",
-    "**Future Extensions** (coming in later modules):\n",
-    "- **Automatic gradients**: Track operations for backpropagation\n",
-    "- **GPU acceleration**: Move computations to graphics cards\n",
-    "- **Lazy evaluation**: Build computation graphs for optimization\n",
-    "\n",
-    "**Educational Value**:\n",
-    "- **Understanding**: See how PyTorch/TensorFlow work internally\n",
-    "- **Debugging**: Trace operations step by step\n",
-    "- **Customization**: Add domain-specific operations"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "076ad694",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Implementation Overview\n",
-    "\n",
-    "Our Tensor class design:\n",
-    "\n",
-    "```python\n",
-    "class Tensor:\n",
-    "    def __init__(self, data)      # Create from any data type\n",
-    "\n",
-    "    # Properties\n",
-    "    .shape                        # Dimensions tuple\n",
-    "    .size                         # Total element count\n",
-    "    .dtype                        # Data type\n",
-    "    .data                         # Access underlying NumPy array\n",
-    "\n",
-    "    # Arithmetic Operations\n",
-    "    def __add__(self, other)      # tensor + tensor\n",
-    "    def __mul__(self, other)      # tensor * tensor\n",
-    "    def __sub__(self, other)      # tensor - tensor\n",
-    "    def __truediv__(self, other)  # tensor / tensor\n",
-    "\n",
-    "    # Advanced Operations\n",
-    "    def matmul(self, other)       # Matrix multiplication\n",
-    "    def sum(self, axis=None)      # Sum along axes\n",
-    "    def reshape(self, *shape)     # Change shape\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fc9cadb3",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "tensor-init",
-     "solution": true
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\n",
-    "#| export\n",
-    "class Tensor:\n",
-    "    \"\"\"\n",
-    "    TinyTorch Tensor: N-dimensional array with ML operations.\n",
-    "\n",
-    "    The fundamental data structure for all TinyTorch operations.\n",
-    "    Wraps NumPy arrays with ML-specific functionality.\n",
-    "    \"\"\"\n",
-    "\n",
-    "    def __init__(self, data: Any, dtype: Optional[str] = None, requires_grad: bool = False):\n",
-    "        \"\"\"\n",
-    "        Create a new tensor from data.\n",
-    "\n",
-    "        Args:\n",
-    "            data: Input data (scalar, list, or numpy array)\n",
-    "            dtype: Data type ('float32', 'int32', etc.). Defaults to auto-detect.\n",
-    "            requires_grad: Whether this tensor needs gradients for training. Defaults to False.\n",
-    "\n",
-    "        TODO: Implement tensor creation with simple, clear type handling.\n",
-    "\n",
-    "        APPROACH (Clear implementation for learning):\n",
-    "        1. Convert input data to numpy array - NumPy handles conversions\n",
-    "        2. Apply dtype if specified - common string types like 'float32'\n",
-    "        3. Set default float32 for float64 arrays - ML convention for efficiency\n",
-    "        4. Store the result in self._data - internal storage for numpy array\n",
-    "        5. Initialize gradient tracking - prepares for automatic differentiation\n",
-    "\n",
-    "        EXAMPLE:\n",
-    "        >>> Tensor(5)\n",
-    "        # Creates: np.array(5, dtype='int32')\n",
-    "        >>> Tensor([1.0, 2.0, 3.0])\n",
-    "        # Creates: np.array([1.0, 2.0, 3.0], dtype='float32')\n",
-    "        >>> Tensor([1, 2, 3], dtype='float32')\n",
-    "        # Creates: np.array([1, 2, 3], dtype='float32')\n",
-    "\n",
-    "        PRODUCTION CONTEXT:\n",
-    "        PyTorch tensors handle 47+ dtype formats with complex validation.\n",
-    "        Our version teaches the core concept that transfers directly.\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Convert input to numpy array - let NumPy handle most conversions\n",
-    "        if isinstance(data, Tensor):\n",
-    "            # Input is another Tensor - copy data efficiently\n",
-    "            self._data = data.data.copy()\n",
-    "        else:\n",
-    "            # Convert to numpy array\n",
-    "            self._data = np.array(data)\n",
-    "\n",
-    "        # Apply dtype if specified\n",
-    "        if dtype is not None:\n",
-    "            self._data = self._data.astype(dtype)\n",
-    "        elif self._data.dtype == np.float64:\n",
-    "            # ML convention: prefer float32 for memory and GPU efficiency\n",
-    "            self._data = self._data.astype(np.float32)\n",
-    "\n",
-    "        # Initialize gradient tracking attributes (used in Module 9 - Autograd)\n",
-    "        self.requires_grad = requires_grad\n",
-    "        self.grad = None\n",
-    "        self._grad_fn = None\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    @property\n",
-    "    def data(self) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Access underlying numpy array.\n",
-    "\n",
-    "        TODO: Return the stored numpy array.\n",
-    "\n",
-    "        APPROACH (Medium comments for property methods):\n",
-    "        1. Access the internal _data attribute\n",
-    "        2. Return the numpy array directly - enables NumPy integration\n",
-    "        3. This provides access to underlying data for visualization/analysis\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - PyTorch: tensor.numpy() converts to NumPy for scientific computing\n",
-    "        - TensorFlow: tensor.numpy() enables integration with matplotlib/scipy\n",
-    "        - Production use: Data scientists need raw arrays for debugging/visualization\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        return self._data\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    @data.setter\n",
-    "    def data(self, value: Union[np.ndarray, 'Tensor']) -> None:\n",
-    "        \"\"\"Set the underlying data of the tensor.\"\"\"\n",
-    "        if isinstance(value, Tensor):\n",
-    "            self._data = value._data.copy()\n",
-    "        else:\n",
-    "            self._data = np.array(value)\n",
-    "\n",
-    "    @property\n",
-    "    def shape(self) -> Tuple[int, ...]:\n",
-    "        \"\"\"\n",
-    "        Get tensor shape.\n",
-    "\n",
-    "        TODO: Return the shape of the stored numpy array.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Access the _data attribute (the NumPy array)\n",
-    "        2. Get the shape property from the NumPy array\n",
-    "        3. Return the shape tuple directly\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Neural networks: Layer compatibility requires matching shapes\n",
-    "        - Computer vision: Image shape (height, width, channels) determines architecture\n",
-    "        - Debugging: Shape mismatches are the #1 cause of ML errors\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        return self._data.shape\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    @property\n",
-    "    def size(self) -> int:\n",
-    "        \"\"\"\n",
-    "        Get total number of elements.\n",
-    "\n",
-    "        TODO: Return the total number of elements in the tensor.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Access the _data attribute (the NumPy array)\n",
-    "        2. Get the size property from the NumPy array\n",
-    "        3. Return the total element count as an integer\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Memory planning: Calculate RAM requirements for large tensors\n",
-    "        - Model architecture: Determine parameter counts for layers\n",
-    "        - Performance: Size affects computation time and vectorization efficiency\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        return self._data.size\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    @property\n",
-    "    def dtype(self) -> np.dtype:\n",
-    "        \"\"\"\n",
-    "        Get data type as numpy dtype.\n",
-    "\n",
-    "        TODO: Return the data type of the stored numpy array.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Access the _data attribute\n",
-    "        2. Get the dtype property\n",
-    "        3. Return the NumPy dtype object\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Precision vs speed: float32 is faster, float64 more accurate\n",
-    "        - Memory optimization: int8 uses 1/4 memory of int32\n",
-    "        - GPU compatibility: Some operations only work with specific types\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        return self._data.dtype\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    @property\n",
-    "    def strides(self) -> Tuple[int, ...]:\n",
-    "        \"\"\"\n",
-    "        Get memory stride pattern of the tensor.\n",
-    "        \n",
-    "        Returns:\n",
-    "            Tuple of byte strides for each dimension\n",
-    "            \n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Memory layout analysis: Understanding cache efficiency\n",
-    "        - Performance debugging: Non-unit strides can indicate copies\n",
-    "        - Advanced operations: Enables efficient transpose and reshape operations\n",
-    "        \"\"\"\n",
-    "        return self._data.strides\n",
-    "    \n",
-    "    @property\n",
-    "    def is_contiguous(self) -> bool:\n",
-    "        \"\"\"\n",
-    "        Check if tensor data is stored in contiguous memory.\n",
-    "        \n",
-    "        Returns:\n",
-    "            True if data is contiguous in C-order (row-major)\n",
-    "            \n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Performance critical: Contiguous data enables vectorization\n",
-    "        - Memory efficiency: Contiguous operations can be 10-100x faster\n",
-    "        - GPU transfers: Contiguous data transfers more efficiently\n",
-    "        \"\"\"\n",
-    "        return self._data.flags['C_CONTIGUOUS']\n",
-    "\n",
-    "    def __repr__(self) -> str:\n",
-    "        \"\"\"\n",
-    "        String representation with size limits for readability.\n",
-    "\n",
-    "        TODO: Create a clear string representation of the tensor.\n",
-    "\n",
-    "        APPROACH (Light comments for utility methods):\n",
-    "        1. Check tensor size - if large, show shape/dtype only\n",
-    "        2. For small tensors, convert numpy array to list using .tolist()\n",
-    "        3. Format appropriately and return string\n",
-    "\n",
-    "        EXAMPLE:\n",
-    "        Tensor([1, 2, 3]) → \"Tensor([1, 2, 3], shape=(3,), dtype=int32)\"\n",
-    "        Large tensor → \"Tensor(shape=(1000, 1000), dtype=float32)\"\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if self.size > 20:\n",
-    "            # Large tensors: show shape and dtype only for readability\n",
-    "            return f\"Tensor(shape={self.shape}, dtype={self.dtype})\"\n",
-    "        else:\n",
-    "            # Small tensors: show data, shape, and dtype\n",
-    "            return f\"Tensor({self._data.tolist()}, shape={self.shape}, dtype={self.dtype})\"\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    def item(self) -> Union[int, float]:\n",
-    "        \"\"\"Extract a scalar value from a single-element tensor.\"\"\"\n",
-    "        if self._data.size != 1:\n",
-    "            raise ValueError(f\"item() can only be called on tensors with exactly one element, got {self._data.size} elements\")\n",
-    "        return self._data.item()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "91b993b2",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "tensor-arithmetic",
-     "solution": true
-    }
-   },
-   "outputs": [],
-   "source": [
-    "    def add(self, other: 'Tensor') -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Add two tensors element-wise.\n",
-    "\n",
-    "        TODO: Implement tensor addition.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Extract numpy arrays from both tensors\n",
-    "        2. Use NumPy's + operator for element-wise addition\n",
-    "        3. Create new Tensor object with result\n",
-    "        4. Return the new tensor\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Neural networks: Adding bias terms to linear layer outputs\n",
-    "        - Residual connections: skip connections in ResNet architectures\n",
-    "        - Gradient updates: Adding computed gradients to parameters\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        result_data = self._data + other._data\n",
-    "        result = Tensor(result_data)\n",
-    "        \n",
-    "        # TODO: Gradient tracking will be added in Module 9 (Autograd)\n",
-    "        # This enables automatic differentiation for neural network training\n",
-    "        # For now, we focus on the core tensor operation\n",
-    "        \n",
-    "        return result\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    def multiply(self, other: 'Tensor') -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Multiply two tensors element-wise.\n",
-    "\n",
-    "        TODO: Implement tensor multiplication.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Extract numpy arrays from both tensors\n",
-    "        2. Use NumPy's * operator for element-wise multiplication\n",
-    "        3. Create new Tensor object with result\n",
-    "        4. Return the new tensor\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Activation functions: Element-wise operations like ReLU masking\n",
-    "        - Attention mechanisms: Element-wise scaling in transformer models\n",
-    "        - Feature scaling: Multiplying features by learned scaling factors\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        result_data = self._data * other._data\n",
-    "        result = Tensor(result_data)\n",
-    "        \n",
-    "        # TODO: Gradient tracking will be added in Module 9 (Autograd)\n",
-    "        # This enables automatic differentiation for neural network training\n",
-    "        # For now, we focus on the core tensor operation\n",
-    "        \n",
-    "        return result\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    def __add__(self, other: Union['Tensor', int, float]) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Addition operator: tensor + other\n",
-    "\n",
-    "        TODO: Implement + operator for tensors.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Check if other is a Tensor object\n",
-    "        2. If Tensor, call the add() method directly\n",
-    "        3. If scalar, convert to Tensor then call add()\n",
-    "        4. Return the result from add() method\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Natural syntax: tensor + scalar enables intuitive code\n",
-    "        - Broadcasting: Adding scalars to tensors is common in ML\n",
-    "        - API design: Clean interfaces reduce cognitive load for researchers\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if isinstance(other, Tensor):\n",
-    "            return self.add(other)\n",
-    "        else:\n",
-    "            return self.add(Tensor(other))\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    def __mul__(self, other: Union['Tensor', int, float]) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Multiplication operator: tensor * other\n",
-    "\n",
-    "        TODO: Implement * operator for tensors.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Check if other is a Tensor object\n",
-    "        2. If Tensor, call the multiply() method directly\n",
-    "        3. If scalar, convert to Tensor then call multiply()\n",
-    "        4. Return the result from multiply() method\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Scaling features: tensor * learning_rate for gradient updates\n",
-    "        - Masking: tensor * mask for attention mechanisms\n",
-    "        - Regularization: tensor * dropout_mask during training\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if isinstance(other, Tensor):\n",
-    "            return self.multiply(other)\n",
-    "        else:\n",
-    "            return self.multiply(Tensor(other))\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    def __sub__(self, other: Union['Tensor', int, float]) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Subtraction operator: tensor - other\n",
-    "\n",
-    "        TODO: Implement - operator for tensors.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Check if other is a Tensor object\n",
-    "        2. If Tensor, subtract other._data from self._data\n",
-    "        3. If scalar, subtract scalar directly from self._data\n",
-    "        4. Create new Tensor with result and return\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Gradient computation: parameter - learning_rate * gradient\n",
-    "        - Error calculation: predicted - actual for loss computation\n",
-    "        - Centering data: tensor - mean for zero-centered inputs\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if isinstance(other, Tensor):\n",
-    "            result = self._data - other._data\n",
-    "        else:\n",
-    "            result = self._data - other\n",
-    "        return Tensor(result)\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    def __truediv__(self, other: Union['Tensor', int, float]) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Division operator: tensor / other\n",
-    "\n",
-    "        TODO: Implement / operator for tensors.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Check if other is a Tensor object\n",
-    "        2. If Tensor, divide self._data by other._data\n",
-    "        3. If scalar, divide self._data by scalar directly\n",
-    "        4. Create new Tensor with result and return\n",
-    "\n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Normalization: tensor / std_deviation for standard scaling\n",
-    "        - Learning rate decay: parameter / decay_factor over time\n",
-    "        - Probability computation: counts / total_counts for frequencies\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if isinstance(other, Tensor):\n",
-    "            result = self._data / other._data\n",
-    "        else:\n",
-    "            result = self._data / other\n",
-    "        return Tensor(result)\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    def mean(self) -> 'Tensor':\n",
-    "        \"\"\"Computes the mean of the tensor's elements.\"\"\"\n",
-    "        return Tensor(np.mean(self.data))\n",
-    "    \n",
-    "    def sum(self, axis=None, keepdims=False) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Sum tensor elements along specified axes.\n",
-    "        \n",
-    "        Args:\n",
-    "            axis: Axis or axes to sum over. If None, sum all elements.\n",
-    "            keepdims: Whether to keep dimensions of size 1 in output.\n",
-    "            \n",
-    "        Returns:\n",
-    "            New tensor with summed values.\n",
-    "        \"\"\"\n",
-    "        result_data = np.sum(self._data, axis=axis, keepdims=keepdims)\n",
-    "        result = Tensor(result_data)\n",
-    "        \n",
-    "        if self.requires_grad:\n",
-    "            result.requires_grad = True\n",
-    "            \n",
-    "            def grad_fn(grad):\n",
-    "                # Sum gradient: broadcast gradient back to original shape\n",
-    "                grad_data = grad.data\n",
-    "                if axis is None:\n",
-    "                    # Sum over all axes - gradient is broadcast to full shape\n",
-    "                    grad_data = np.full(self.shape, grad_data)\n",
-    "                else:\n",
-    "                    # Sum over specific axes - expand back those dimensions\n",
-    "                    if not isinstance(axis, tuple):\n",
-    "                        axis_tuple = (axis,) if axis is not None else ()\n",
-    "                    else:\n",
-    "                        axis_tuple = axis\n",
-    "                    \n",
-    "                    # Expand dimensions that were summed\n",
-    "                    for ax in sorted(axis_tuple):\n",
-    "                        if ax < 0:\n",
-    "                            ax = len(self.shape) + ax\n",
-    "                        grad_data = np.expand_dims(grad_data, axis=ax)\n",
-    "                    \n",
-    "                    # Broadcast to original shape\n",
-    "                    grad_data = np.broadcast_to(grad_data, self.shape)\n",
-    "                \n",
-    "                self.backward(Tensor(grad_data))\n",
-    "            \n",
-    "            result._grad_fn = grad_fn\n",
-    "        \n",
-    "        return result"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5c4b5e57",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "tensor-matmul",
-     "solution": true
-    }
-   },
-   "outputs": [],
-   "source": [
-    "    def matmul(self, other: 'Tensor') -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Matrix multiplication using NumPy's optimized implementation.\n",
-    "\n",
-    "        TODO: Implement matrix multiplication.\n",
-    "\n",
-    "        APPROACH:\n",
-    "        1. Extract numpy arrays from both tensors\n",
-    "        2. Check tensor shapes for compatibility\n",
-    "        3. Use NumPy's optimized dot product\n",
-    "        4. Create new Tensor object with the result\n",
-    "        5. Return the new tensor\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        a_data = self._data\n",
-    "        b_data = other._data\n",
-    "\n",
-    "        # Validate tensor shapes\n",
-    "        if len(a_data.shape) != 2 or len(b_data.shape) != 2:\n",
-    "            raise ValueError(\"matmul requires 2D tensors\")\n",
-    "\n",
-    "        m, k = a_data.shape\n",
-    "        k2, n = b_data.shape\n",
-    "\n",
-    "        if k != k2:\n",
-    "            raise ValueError(f\"Inner dimensions must match: {k} != {k2}\")\n",
-    "\n",
-    "        # Use NumPy's optimized implementation\n",
-    "        result_data = np.dot(a_data, b_data)\n",
-    "        return Tensor(result_data)\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "    def __matmul__(self, other: 'Tensor') -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Matrix multiplication operator: tensor @ other\n",
-    "\n",
-    "        Enables the @ operator for matrix multiplication, providing\n",
-    "        clean syntax for neural network operations.\n",
-    "        \"\"\"\n",
-    "        return self.matmul(other)\n",
-    "\n",
-    "    def backward(self, gradient=None):\n",
-    "        \"\"\"\n",
-    "        Compute gradients for this tensor and propagate backward.\n",
-    "\n",
-    "        Basic backward pass - accumulates gradients and propagates to dependencies.\n",
-    "        This enables simple gradient computation for basic operations.\n",
-    "\n",
-    "        Args:\n",
-    "            gradient: Gradient from upstream. If None, assumes scalar with grad=1\n",
-    "        \"\"\"\n",
-    "        if not self.requires_grad:\n",
-    "            return\n",
-    "\n",
-    "        if gradient is None:\n",
-    "            # Scalar case - gradient is 1\n",
-    "            gradient = Tensor(np.ones_like(self._data))\n",
-    "\n",
-    "        # Accumulate gradients\n",
-    "        if self.grad is None:\n",
-    "            self.grad = gradient\n",
-    "        else:\n",
-    "            self.grad = self.grad + gradient\n",
-    "\n",
-    "        # Propagate to dependencies via grad_fn\n",
-    "        if self._grad_fn is not None:\n",
-    "            self._grad_fn(gradient)\n",
-    "    \n",
-    "    def zero_grad(self):\n",
-    "        \"\"\"Reset gradients to None. Used by optimizers before backward pass.\"\"\"\n",
-    "        self.grad = None"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a8f6f7d5",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "tensor-reshape",
-     "solution": true
-    }
-   },
-   "outputs": [],
-   "source": [
-    "    def reshape(self, *shape: int) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Return a new tensor with the same data but different shape.\n",
-    "\n",
-    "        Args:\n",
-    "            *shape: New shape dimensions. Use -1 for automatic sizing.\n",
-    "\n",
-    "        Returns:\n",
-    "            New Tensor with reshaped data\n",
-    "            \n",
-    "        Note:\n",
-    "            This returns a view when possible (no copying), or a copy when necessary.\n",
-    "            Use .contiguous() after reshape if you need guaranteed contiguous memory.\n",
-    "        \"\"\"\n",
-    "        reshaped_data = self._data.reshape(*shape)\n",
-    "        result = Tensor(reshaped_data)\n",
-    "        \n",
-    "        # Preserve gradient tracking\n",
-    "        if self.requires_grad:\n",
-    "            result.requires_grad = True\n",
-    "            \n",
-    "            def grad_fn(grad):\n",
-    "                # Reshape gradient back to original shape\n",
-    "                orig_grad = grad.reshape(*self.shape)\n",
-    "                self.backward(orig_grad)\n",
-    "            \n",
-    "            result._grad_fn = grad_fn\n",
-    "        \n",
-    "        return result\n",
-    "    \n",
-    "    def view(self, *shape: int) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Return a view of the tensor with a new shape. Alias for reshape.\n",
-    "        \n",
-    "        Args:\n",
-    "            *shape: New shape dimensions. Use -1 for automatic sizing.\n",
-    "            \n",
-    "        Returns:\n",
-    "            New Tensor sharing the same data (view when possible)\n",
-    "            \n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - PyTorch compatibility: .view() is the PyTorch equivalent\n",
-    "        - Memory efficiency: Views avoid copying data when possible\n",
-    "        - Performance critical: Views enable efficient transformations\n",
-    "        \"\"\"\n",
-    "        return self.reshape(*shape)\n",
-    "    \n",
-    "    def clone(self) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Create a deep copy of the tensor.\n",
-    "        \n",
-    "        Returns:\n",
-    "            New Tensor with copied data\n",
-    "            \n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Memory isolation: Ensures modifications don't affect original\n",
-    "        - Gradient tracking: Clones maintain independent gradient graphs\n",
-    "        - Safe operations: Use when you need guaranteed data independence\n",
-    "        \"\"\"\n",
-    "        cloned_data = self._data.copy()\n",
-    "        result = Tensor(cloned_data)\n",
-    "        \n",
-    "        # Clone preserves gradient requirements but starts fresh grad tracking\n",
-    "        result.requires_grad = self.requires_grad\n",
-    "        # Note: grad and grad_fn are NOT copied - clone starts fresh\n",
-    "        \n",
-    "        return result\n",
-    "    \n",
-    "    def contiguous(self) -> 'Tensor':\n",
-    "        \"\"\"\n",
-    "        Return a contiguous tensor with the same data.\n",
-    "        \n",
-    "        Returns:\n",
-    "            Tensor with contiguous memory layout (may be a copy)\n",
-    "            \n",
-    "        PRODUCTION CONNECTION:\n",
-    "        - Performance optimization: Ensures optimal memory layout\n",
-    "        - GPU operations: Many CUDA operations require contiguous data\n",
-    "        - Cache efficiency: Contiguous data maximizes CPU cache utilization\n",
-    "        \"\"\"\n",
-    "        if self.is_contiguous:\n",
-    "            return self  # Already contiguous, return self\n",
-    "        \n",
-    "        # Make contiguous copy\n",
-    "        contiguous_data = np.ascontiguousarray(self._data)\n",
-    "        result = Tensor(contiguous_data)\n",
-    "        \n",
-    "        # Preserve gradient tracking\n",
-    "        result.requires_grad = self.requires_grad\n",
-    "        if self.requires_grad:\n",
-    "            def grad_fn(grad):\n",
-    "                self.backward(grad)\n",
-    "            result._grad_fn = grad_fn\n",
-    "        \n",
-    "        return result\n",
-    "\n",
-    "    def numpy(self) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Convert tensor to NumPy array.\n",
-    "        \n",
-    "        This is the PyTorch-inspired method for tensor-to-numpy conversion.\n",
-    "        Provides clean interface for interoperability with NumPy operations.\n",
-    "        \"\"\"\n",
-    "        return self._data\n",
-    "    \n",
-    "    def __array__(self, dtype=None) -> np.ndarray:\n",
-    "        \"\"\"Enable np.array(tensor) and np.allclose(tensor, array).\"\"\"\n",
-    "        if dtype is not None:\n",
-    "            return self._data.astype(dtype)\n",
-    "        return self._data\n",
-    "    \n",
-    "    def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):\n",
-    "        \"\"\"Enable NumPy universal functions with Tensor objects.\"\"\"\n",
-    "        # Convert Tensor inputs to NumPy arrays\n",
-    "        args = []\n",
-    "        for input_ in inputs:\n",
-    "            if isinstance(input_, Tensor):\n",
-    "                args.append(input_._data)\n",
-    "            else:\n",
-    "                args.append(input_)\n",
-    "        \n",
-    "        # Call the ufunc on NumPy arrays\n",
-    "        outputs = getattr(ufunc, method)(*args, **kwargs)\n",
-    "        \n",
-    "        # If method returns NotImplemented, let NumPy handle it\n",
-    "        if outputs is NotImplemented:\n",
-    "            return NotImplemented\n",
-    "            \n",
-    "        # Wrap result back in Tensor if appropriate\n",
-    "        if method == '__call__':\n",
-    "            if isinstance(outputs, np.ndarray):\n",
-    "                return Tensor(outputs)\n",
-    "            elif isinstance(outputs, tuple):\n",
-    "                return tuple(Tensor(output) if isinstance(output, np.ndarray) else output \n",
-    "                           for output in outputs)\n",
-    "        \n",
-    "        return outputs\n",
-    "\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0ce24a6f",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 2
-   },
-   "source": [
-    "## Testing Your Tensor Implementation\n",
-    "\n",
-    "Let's validate each component immediately to ensure everything works correctly:"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "37e009e2",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Tensor Creation\n",
-    "\n",
-    "Let's test your tensor creation implementation right away! This gives you immediate feedback on whether your `__init__` method works correctly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "eff5b3e5",
-   "metadata": {
-    "lines_to_next_cell": 2
-   },
-   "outputs": [],
-   "source": [
-    "\n",
-    "def test_unit_tensor_creation():\n",
-    "    \"\"\"Test tensor creation with all data types and shapes.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Tensor Creation...\")\n",
-    "    \n",
-    "    try:\n",
-    "        # Test scalar\n",
-    "        scalar = Tensor(5.0)\n",
-    "        assert hasattr(scalar, '_data'), \"Tensor should have _data attribute\"\n",
-    "        assert scalar._data.shape == (), f\"Scalar should have shape (), got {scalar._data.shape}\"\n",
-    "        print(\"✅ Scalar creation works\")\n",
-    "\n",
-    "        # Test vector\n",
-    "        vector = Tensor([1, 2, 3])\n",
-    "        assert vector._data.shape == (3,), f\"Vector should have shape (3,), got {vector._data.shape}\"\n",
-    "        print(\"✅ Vector creation works\")\n",
-    "\n",
-    "        # Test matrix\n",
-    "        matrix = Tensor([[1, 2], [3, 4]])\n",
-    "        assert matrix._data.shape == (2, 2), f\"Matrix should have shape (2, 2), got {matrix._data.shape}\"\n",
-    "        print(\"✅ Matrix creation works\")\n",
-    "\n",
-    "        print(\"📈 Progress: Tensor Creation ✓\")\n",
-    "\n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Tensor creation test failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Tensor creation behavior:\")\n",
-    "    print(\"   Converts data to NumPy arrays\")\n",
-    "    print(\"   Preserves shape and data type\")\n",
-    "    print(\"   Stores in _data attribute\")\n",
-    "\n",
-    "test_unit_tensor_creation()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0abae867",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Tensor Properties\n",
-    "\n",
-    "Now let's test that your tensor properties work correctly. This tests the @property methods you implemented."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "05c92150",
-   "metadata": {
-    "lines_to_next_cell": 2
-   },
-   "outputs": [],
-   "source": [
-    "\n",
-    "def test_unit_tensor_properties():\n",
-    "    \"\"\"Test tensor properties (shape, size, dtype, data access).\"\"\"\n",
-    "    print(\"🔬 Unit Test: Tensor Properties...\")\n",
-    "    \n",
-    "    try:\n",
-    "        # Test with a simple matrix\n",
-    "        tensor = Tensor([[1, 2, 3], [4, 5, 6]])\n",
-    "\n",
-    "        # Test shape property\n",
-    "        assert tensor.shape == (2, 3), f\"Shape should be (2, 3), got {tensor.shape}\"\n",
-    "        print(\"✅ Shape property works\")\n",
-    "\n",
-    "        # Test size property\n",
-    "        assert tensor.size == 6, f\"Size should be 6, got {tensor.size}\"\n",
-    "        print(\"✅ Size property works\")\n",
-    "\n",
-    "        # Test data property\n",
-    "        assert np.array_equal(tensor.data, np.array([[1, 2, 3], [4, 5, 6]])), \"Data property should return numpy array\"\n",
-    "        print(\"✅ Data property works\")\n",
-    "\n",
-    "        # Test dtype property\n",
-    "        assert tensor.dtype in [np.int32, np.int64], f\"Dtype should be int32 or int64, got {tensor.dtype}\"\n",
-    "        print(\"✅ Dtype property works\")\n",
-    "\n",
-    "        print(\"📈 Progress: Tensor Properties ✓\")\n",
-    "\n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Tensor properties test failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Tensor properties behavior:\")\n",
-    "    print(\"   shape: Returns tuple of dimensions\")\n",
-    "    print(\"   size: Returns total number of elements\")\n",
-    "    print(\"   data: Returns underlying NumPy array\")\n",
-    "    print(\"   dtype: Returns NumPy data type\")\n",
-    "\n",
-    "test_unit_tensor_properties()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "94247bc9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Tensor Arithmetic\n",
-    "\n",
-    "Let's test your tensor arithmetic operations. This tests the __add__, __mul__, __sub__, __truediv__ methods."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2704d05a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "def test_unit_tensor_arithmetic():\n",
-    "    \"\"\"Test tensor arithmetic operations.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Tensor Arithmetic...\")\n",
-    "    \n",
-    "    try:\n",
-    "        # Test addition\n",
-    "        a = Tensor([1, 2, 3])\n",
-    "        b = Tensor([4, 5, 6])\n",
-    "        result = a + b\n",
-    "        expected = np.array([5, 7, 9])\n",
-    "        assert np.array_equal(result.data, expected), f\"Addition failed: expected {expected}, got {result.data}\"\n",
-    "        print(\"✅ Addition works\")\n",
-    "\n",
-    "        # Test scalar addition\n",
-    "        result_scalar = a + 10\n",
-    "        expected_scalar = np.array([11, 12, 13])\n",
-    "        assert np.array_equal(result_scalar.data, expected_scalar), f\"Scalar addition failed: expected {expected_scalar}, got {result_scalar.data}\"\n",
-    "        print(\"✅ Scalar addition works\")\n",
-    "\n",
-    "        # Test multiplication\n",
-    "        result_mul = a * b\n",
-    "        expected_mul = np.array([4, 10, 18])\n",
-    "        assert np.array_equal(result_mul.data, expected_mul), f\"Multiplication failed: expected {expected_mul}, got {result_mul.data}\"\n",
-    "        print(\"✅ Multiplication works\")\n",
-    "\n",
-    "        # Test scalar multiplication\n",
-    "        result_scalar_mul = a * 2\n",
-    "        expected_scalar_mul = np.array([2, 4, 6])\n",
-    "        assert np.array_equal(result_scalar_mul.data, expected_scalar_mul), f\"Scalar multiplication failed: expected {expected_scalar_mul}, got {result_scalar_mul.data}\"\n",
-    "        print(\"✅ Scalar multiplication works\")\n",
-    "\n",
-    "        # Test subtraction\n",
-    "        result_sub = b - a\n",
-    "        expected_sub = np.array([3, 3, 3])\n",
-    "        assert np.array_equal(result_sub.data, expected_sub), f\"Subtraction failed: expected {expected_sub}, got {result_sub.data}\"\n",
-    "        print(\"✅ Subtraction works\")\n",
-    "\n",
-    "        # Test division\n",
-    "        result_div = b / a\n",
-    "        expected_div = np.array([4.0, 2.5, 2.0])\n",
-    "        assert np.allclose(result_div.data, expected_div), f\"Division failed: expected {expected_div}, got {result_div.data}\"\n",
-    "        print(\"✅ Division works\")\n",
-    "\n",
-    "        print(\"📈 Progress: Tensor Arithmetic ✓\")\n",
-    "\n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Tensor arithmetic test failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Tensor arithmetic behavior:\")\n",
-    "    print(\"   Element-wise operations on tensors\")\n",
-    "    print(\"   Broadcasting with scalars\")\n",
-    "    print(\"   Returns new Tensor objects\")\n",
-    "    print(\"   Preserves numerical precision\")\n",
-    "\n",
-    "test_unit_tensor_arithmetic()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1da8fe1f",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Matrix Multiplication\n",
-    "\n",
-    "Test the matrix multiplication implementation that shows both educational and optimized approaches."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "66806e77",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "def test_unit_matrix_multiplication():\n",
-    "    \"\"\"Test matrix multiplication with educational and optimized paths.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Matrix Multiplication...\")\n",
-    "    \n",
-    "    try:\n",
-    "        # Small matrix (educational path)\n",
-    "        small_a = Tensor([[1, 2], [3, 4]])\n",
-    "        small_b = Tensor([[5, 6], [7, 8]])\n",
-    "        small_result = small_a @ small_b\n",
-    "        small_expected = np.array([[19, 22], [43, 50]])\n",
-    "        assert np.array_equal(small_result.data, small_expected), f\"Small matmul failed: expected {small_expected}, got {small_result.data}\"\n",
-    "        print(\"✅ Small matrix multiplication (educational) works\")\n",
-    "\n",
-    "        # Large matrix (optimized path) \n",
-    "        large_a = Tensor(np.random.randn(100, 50))\n",
-    "        large_b = Tensor(np.random.randn(50, 80))\n",
-    "        large_result = large_a @ large_b\n",
-    "        assert large_result.shape == (100, 80), f\"Large matmul shape wrong: expected (100, 80), got {large_result.shape}\"\n",
-    "        \n",
-    "        # Verify with NumPy\n",
-    "        expected_large = np.dot(large_a.data, large_b.data)\n",
-    "        assert np.allclose(large_result.data, expected_large), \"Large matmul results don't match NumPy\"\n",
-    "        print(\"✅ Large matrix multiplication (optimized) works\")\n",
-    "\n",
-    "        print(\"📈 Progress: Matrix Multiplication ✓\")\n",
-    "\n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Matrix multiplication test failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Matrix multiplication behavior:\")\n",
-    "    print(\"   Small matrices: Educational loops show concept\")\n",
-    "    print(\"   Large matrices: Optimized NumPy implementation\")\n",
-    "    print(\"   Proper shape validation and error handling\")\n",
-    "    print(\"   Foundation for neural network linear layers\")\n",
-    "\n",
-    "test_unit_matrix_multiplication()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "76025783",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Advanced Tensor Operations\n",
-    "\n",
-    "Test the new view/copy semantics and memory layout functionality."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "564575fd",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "def test_unit_advanced_tensor_operations():\n",
-    "    \"\"\"Test advanced tensor operations: view, clone, contiguous, strides.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Advanced Tensor Operations...\")\n",
-    "    \n",
-    "    try:\n",
-    "        # Test dtype handling improvements\n",
-    "        tensor_str = Tensor([1, 2, 3], dtype=\"float32\")\n",
-    "        tensor_np = Tensor([1, 2, 3], dtype=np.float64)\n",
-    "        assert tensor_str.dtype == np.float32, f\"String dtype failed: {tensor_str.dtype}\"\n",
-    "        assert tensor_np.dtype == np.float64, f\"NumPy dtype failed: {tensor_np.dtype}\"\n",
-    "        print(\"✅ Enhanced dtype handling works\")\n",
-    "\n",
-    "        # Test stride and contiguity properties\n",
-    "        matrix = Tensor([[1, 2, 3], [4, 5, 6]])\n",
-    "        assert hasattr(matrix, 'strides'), \"Should have strides property\"\n",
-    "        assert hasattr(matrix, 'is_contiguous'), \"Should have is_contiguous property\"\n",
-    "        assert matrix.is_contiguous == True, \"New tensor should be contiguous\"\n",
-    "        print(\"✅ Stride and contiguity properties work\")\n",
-    "\n",
-    "        # Test view vs clone semantics\n",
-    "        original = Tensor([[1, 2], [3, 4]])\n",
-    "        view_tensor = original.view(4)  # Should share data\n",
-    "        clone_tensor = original.clone()  # Should copy data\n",
-    "        \n",
-    "        assert view_tensor.shape == (4,), f\"View shape wrong: {view_tensor.shape}\"\n",
-    "        assert clone_tensor.shape == (2, 2), f\"Clone shape wrong: {clone_tensor.shape}\"\n",
-    "        print(\"✅ View and clone semantics work\")\n",
-    "\n",
-    "        # Test contiguous operation\n",
-    "        non_contiguous = Tensor(np.ones((10, 10)).T)  # Transpose creates non-contiguous\n",
-    "        contiguous_result = non_contiguous.contiguous()\n",
-    "        \n",
-    "        if not non_contiguous.is_contiguous:  # Only test if actually non-contiguous\n",
-    "            assert contiguous_result.is_contiguous == True, \"contiguous() should make data contiguous\"\n",
-    "        print(\"✅ Contiguous operation works\")\n",
-    "\n",
-    "        # Test error handling for invalid dtype\n",
-    "        try:\n",
-    "            Tensor([1, 2, 3], dtype=123)  # Invalid dtype\n",
-    "            print(\"❌ Should have failed with invalid dtype\")\n",
-    "        except TypeError:\n",
-    "            print(\"✅ Proper error handling for invalid dtype\")\n",
-    "\n",
-    "        print(\"📈 Progress: Advanced Tensor Operations ✓\")\n",
-    "\n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Advanced tensor operations test failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Advanced tensor operations behavior:\")\n",
-    "    print(\"   Enhanced dtype handling (str and np.dtype)\")\n",
-    "    print(\"   Memory layout analysis with strides\")\n",
-    "    print(\"   View vs copy semantics for memory efficiency\")\n",
-    "    print(\"   Contiguous memory optimization\")\n",
-    "\n",
-    "test_unit_advanced_tensor_operations()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "674989ac",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Integration Test: Tensor-NumPy Integration\n",
-    "\n",
-    "This integration test validates that your tensor system works seamlessly with NumPy, the foundation of the scientific Python ecosystem."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "79dc850b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "def test_module_tensor_numpy_integration():\n",
-    "    \"\"\"\n",
-    "    Integration test for tensor operations with NumPy arrays.\n",
-    "\n",
-    "    Tests that tensors properly integrate with NumPy operations and maintain\n",
-    "    compatibility with the scientific Python ecosystem.\n",
-    "    \"\"\"\n",
-    "    print(\"🔬 Integration Test: Tensor-NumPy Integration...\")\n",
-    "\n",
-    "    try:\n",
-    "        # Test 1: Tensor from NumPy array\n",
-    "        numpy_array = np.array([[1, 2, 3], [4, 5, 6]])\n",
-    "        tensor_from_numpy = Tensor(numpy_array)\n",
-    "\n",
-    "        assert tensor_from_numpy.shape == (2, 3), \"Tensor should preserve NumPy array shape\"\n",
-    "        assert np.array_equal(tensor_from_numpy.data, numpy_array), \"Tensor should preserve NumPy array data\"\n",
-    "        print(\"✅ Tensor from NumPy array works\")\n",
-    "\n",
-    "        # Test 2: Tensor arithmetic with NumPy-compatible operations\n",
-    "        a = Tensor([1.0, 2.0, 3.0])\n",
-    "        b = Tensor([4.0, 5.0, 6.0])\n",
-    "\n",
-    "        # Test operations that would be used in neural networks\n",
-    "        dot_product_result = np.dot(a.data, b.data)  # Common in layers\n",
-    "        assert np.isclose(dot_product_result, 32.0), \"Dot product should work with tensor data\"\n",
-    "        print(\"✅ NumPy operations on tensor data work\")\n",
-    "\n",
-    "        # Test 3: Broadcasting compatibility\n",
-    "        matrix = Tensor([[1, 2], [3, 4]])\n",
-    "        scalar = Tensor(10)\n",
-    "\n",
-    "        result = matrix + scalar\n",
-    "        expected = np.array([[11, 12], [13, 14]])\n",
-    "        assert np.array_equal(result.data, expected), \"Broadcasting should work like NumPy\"\n",
-    "        print(\"✅ Broadcasting compatibility works\")\n",
-    "\n",
-    "        # Test 4: Integration with scientific computing patterns\n",
-    "        data = Tensor([1, 4, 9, 16, 25])\n",
-    "        sqrt_result = Tensor(np.sqrt(data.data))  # Using NumPy functions on tensor data\n",
-    "        expected_sqrt = np.array([1., 2., 3., 4., 5.])\n",
-    "        assert np.allclose(sqrt_result.data, expected_sqrt), \"Should integrate with NumPy functions\"\n",
-    "        print(\"✅ Scientific computing integration works\")\n",
-    "\n",
-    "        print(\"📈 Progress: Tensor-NumPy Integration ✓\")\n",
-    "\n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Integration test failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Integration test validates:\")\n",
-    "    print(\"   Seamless NumPy array conversion\")\n",
-    "    print(\"   Compatible arithmetic operations\")\n",
-    "    print(\"   Proper broadcasting behavior\")\n",
-    "    print(\"   Scientific computing workflow integration\")\n",
-    "\n",
-    "test_module_tensor_numpy_integration()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3ba2c701",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Parameter Helper Function\n",
-    "\n",
-    "Now that we have Tensor with gradient support, let's add a convenient helper function for creating trainable parameters:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8039d2e4",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "\n",
-    "#| export\n",
-    "def Parameter(data, dtype=None):\n",
-    "    \"\"\"\n",
-    "    Convenience function for creating trainable tensors.\n",
-    "\n",
-    "    This is equivalent to Tensor(data, requires_grad=True) but provides\n",
-    "    cleaner syntax for neural network parameters.\n",
-    "\n",
-    "    Args:\n",
-    "        data: Input data (scalar, list, or numpy array)\n",
-    "        dtype: Data type ('float32', 'int32', etc.). Defaults to auto-detect.\n",
-    "\n",
-    "    Returns:\n",
-    "        Tensor with requires_grad=True\n",
-    "\n",
-    "    Examples:\n",
-    "        weight = Parameter(np.random.randn(784, 128))  # Neural network weight\n",
-    "        bias = Parameter(np.zeros(128))                # Neural network bias\n",
-    "    \"\"\"\n",
-    "    return Tensor(data, dtype=dtype, requires_grad=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "94412986",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Comprehensive Testing Function\n",
-    "\n",
-    "Let's create a comprehensive test that runs all our unit tests together:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "71d471d8",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "\n",
-    "def test_unit_all():\n",
-    "    \"\"\"Run complete tensor module validation.\"\"\"\n",
-    "    print(\"🧪 Running all unit tests...\")\n",
-    "    \n",
-    "    # Call every individual test function\n",
-    "    test_unit_tensor_creation()\n",
-    "    test_unit_tensor_properties() \n",
-    "    test_unit_tensor_arithmetic()\n",
-    "    test_unit_matrix_multiplication()\n",
-    "    test_unit_advanced_tensor_operations()\n",
-    "    test_module_tensor_numpy_integration()\n",
-    "    \n",
-    "    print(\"✅ All tests passed! Tensor module ready for integration.\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "adbef893",
-   "metadata": {
-    "lines_to_next_cell": 2
-   },
-   "source": [
-    "\"\"\"\n",
-    "# Main Execution Block\n",
-    "\"\"\"\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    # Run all tensor tests\n",
-    "    test_unit_all()\n",
-    "    \n",
-    "    print(\"\\n🎉 Tensor module implementation complete!\")\n",
-    "    print(\"📦 Ready to export to tinytorch.core.tensor\")\n",
-    "    \n",
-    "    # Demonstrate the new ML Framework Advisor improvements\n",
-    "    print(\"\\n🚀 New Features Demonstration:\")\n",
-    "    \n",
-    "    # 1. Enhanced dtype handling\n",
-    "    t1 = Tensor([1, 2, 3], dtype=\"float32\")\n",
-    "    t2 = Tensor([1, 2, 3], dtype=np.float64)\n",
-    "    t3 = Tensor([1, 2, 3], dtype=np.int32)\n",
-    "    print(f\"✅ Enhanced dtype support: str={t1.dtype}, np.dtype={t2.dtype}, np.type={t3.dtype}\")\n",
-    "    \n",
-    "    # 2. Memory layout analysis\n",
-    "    matrix = Tensor([[1, 2, 3], [4, 5, 6]])\n",
-    "    print(f\"✅ Memory analysis: strides={matrix.strides}, contiguous={matrix.is_contiguous}\")\n",
-    "    \n",
-    "    # 3. View/copy semantics\n",
-    "    view = matrix.view(6)\n",
-    "    clone = matrix.clone()\n",
-    "    print(f\"✅ View/copy semantics: view_shape={view.shape}, clone_shape={clone.shape}\")\n",
-    "    \n",
-    "    # 4. Broadcasting failure demonstration with clear error messages\n",
-    "    try:\n",
-    "        bad_a = Tensor([[1, 2], [3, 4]])  # (2, 2)\n",
-    "        bad_b = Tensor([1, 2, 3])         # (3,)\n",
-    "        result = bad_a + bad_b\n",
-    "    except ValueError as e:\n",
-    "        print(f\"✅ Clear broadcasting error: {str(e)[:50]}...\")\n",
-    "    \n",
-    "    print(\"\\n🎯 Core tensor implementation complete!\")\n",
-    "    print(\"   ✓ Simple, clear tensor creation and operations\")\n",
-    "    print(\"   ✓ Memory layout analysis and performance insights\")\n",
-    "    print(\"   ✓ Broadcasting with comprehensive error handling\")\n",
-    "    print(\"   ✓ View/copy semantics for memory efficiency\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "eec96153",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking\n",
-    "\n",
-    "Now that you've built a complete tensor system, let's connect your implementation to real ML challenges:"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ddedb4f4",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Memory Efficiency at Scale\n",
-    "\n",
-    "**Challenge**: Your Tensor class showed that contiguous memory is 10-100x faster than scattered memory. Consider a language model with 7 billion parameters (28GB at float32). How would you modify your memory layout strategies to handle training with limited GPU memory (16GB)?\n",
-    "\n",
-    "Calculate the memory requirements for parameters, gradients, and optimizer states, then propose specific optimizations to your Tensor implementation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1a53526a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "\"\"\"\n",
-    "YOUR ANALYSIS:\n",
-    "\n",
-    "[Write your response here - consider memory layout, cache efficiency,\n",
-    "and optimization strategies for large-scale tensor operations]\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9645ace4",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: Production Broadcasting\n",
-    "\n",
-    "**Challenge**: Your broadcasting implementation handles basic cases. In transformer models, you need operations like:\n",
-    "- Query (32, 512, 768) × Key (32, 512, 768) → Attention (32, 512, 512)\n",
-    "- Attention (32, 8, 512, 512) + Bias (1, 1, 512, 512)\n",
-    "\n",
-    "How would you extend your `__add__` and `__mul__` methods to handle these complex shapes while providing clear error messages when shapes are incompatible?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "20aee275",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "\"\"\"\n",
-    "YOUR ANALYSIS:\n",
-    "\n",
-    "[Write your response here - consider broadcasting rules, error handling,\n",
-    "and complex shape operations in transformer architectures]\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a4e71b43",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 3: Gradient Compatibility\n",
-    "\n",
-    "**Challenge**: Your Tensor class includes `requires_grad` and basic gradient tracking. When you implement automatic differentiation (Module 09), how will your current design support gradient computation?\n",
-    "\n",
-    "Consider how operations like `c = a * b` need to track both forward computation and backward gradient flow. What modifications would your Tensor methods need to support this?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "32c157fe",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "\"\"\"\n",
-    "YOUR ANALYSIS:\n",
-    "\n",
-    "[Write your response here - consider gradient tracking, computational graphs,\n",
-    "and how your tensor operations will support automatic differentiation]\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9b4d9bff",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Tensor Foundation\n",
-    "\n",
-    "Congratulations! You've built the fundamental data structure that powers all machine learning!\n",
-    "\n",
-    "### Key Learning Outcomes\n",
-    "- **Complete Tensor System**: Built a 400+ line implementation with 15 methods supporting all essential tensor operations\n",
-    "- **Memory Efficiency Mastery**: Discovered that memory layout affects performance more than algorithms (10-100x speedups)\n",
-    "- **Broadcasting Implementation**: Created automatic shape matching that saves memory and enables flexible operations\n",
-    "- **Production-Ready API**: Designed interfaces that mirror PyTorch and TensorFlow patterns\n",
-    "\n",
-    "### Ready for Next Steps\n",
-    "Your tensor implementation now enables:\n",
-    "- **Module 03 (Activations)**: Add nonlinear functions that make neural networks powerful\n",
-    "- **Neural network operations**: Matrix multiplication, broadcasting, and gradient preparation\n",
-    "- **Real data processing**: Handle images, text, and complex multi-dimensional datasets\n",
-    "\n",
-    "### Export Your Work\n",
-    "1. **Export to package**: `tito module complete 02_tensor`\n",
-    "2. **Verify integration**: Your Tensor class will be available as `tinytorch.core.tensor.Tensor`\n",
-    "3. **Enable next module**: Activations build on your tensor foundation\n",
-    "\n",
-    "**Achievement unlocked**: You've built the universal data structure of modern AI! Every neural network, from simple classifiers to ChatGPT, relies on the tensor concepts you've just implemented."
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.13.3"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/01_tensor/tensor_dev.py b/modules_old/01_tensor/tensor_dev.py
deleted file mode 100644
index 964bb2a9..00000000
--- a/modules_old/01_tensor/tensor_dev.py
+++ /dev/null
@@ -1,853 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Tensor - The Foundation of Machine Learning
-
-Welcome to Tensor! You'll build the fundamental data structure that powers every neural network.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**: Module 00 (Setup) gave you a Python environment with NumPy
-
-**What's Working**: You have all the tools needed for numerical computing
-
-**The Gap**: You need to build the core data structure that makes ML possible
-
-**This Module's Solution**: Create a Tensor class that wraps NumPy with clean ML operations
-
-## Learning Objectives
-1. **Core Implementation**: Build Tensor class with arithmetic operations
-2. **Essential Operations**: Addition, multiplication, matrix operations
-3. **Testing Skills**: Validate each function immediately after implementation
-4. **Integration Knowledge**: Prepare foundation for neural network modules
-
-## Build → Test → Use
-1. **Build**: Implement essential tensor operations
-2. **Test**: Verify each component works correctly
-3. **Use**: Apply tensors to multi-dimensional data
-"""
-
-# In[ ]:
-
-#| default_exp core.tensor
-
-#| export
-import numpy as np
-import sys
-from typing import Union, Tuple, Optional, Any
-import warnings
-
-# In[ ]:
-
-print("🔥 TinyTorch Tensor Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build tensors!")
-
-# %% [markdown]
-"""
-## Understanding Tensors: From Numbers to Neural Networks
-
-Tensors are N-dimensional arrays that store and manipulate numerical data. Think of them as containers for information that become increasingly powerful as dimensions increase.
-
-### Tensor Dimension Hierarchy
-
-```
-Scalar (0D) ──► Vector (1D) ──► Matrix (2D) ──► 3D+ Tensor
-   5.0           [1,2,3]        [[1,2],       [[[R,G,B]]]
-                                 [3,4]]        image data
-     │              │               │              │
-     ▼              ▼               ▼              ▼
-  Single           List          Table       Multi-dimensional
-  number         of numbers    of numbers      data structure
-```
-
-### Memory Layout: NumPy Array + Tensor Wrapper
-
-Our Tensor class wraps NumPy's optimized arrays with clean ML operations:
-
-```
-    TinyTorch Tensor                NumPy Array
-┌────────────────────────┐      ┌─────────────────────┐
-│ Tensor Object          │ ───► │ [1.0, 2.0, 3.0]    │
-│ • shape: (3,)          │      │ • dtype: float32    │
-│ • size: 3              │      │ • contiguous memory │
-│ • operations: +,*,@    │      │ • BLAS optimized    │
-└────────────────────────┘      └─────────────────────┘
-        Clean ML API                 Fast Computation
-```
-
-This foundation focuses on pure data operations - gradient tracking comes in Module 05.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "tensor-init", "solution": true}
-
-#| export
-class Tensor:
-    """
-    TinyTorch Tensor: N-dimensional array with ML operations.
-
-    The fundamental data structure for all TinyTorch operations.
-    Wraps NumPy arrays with ML-specific functionality.
-    """
-
-    def __init__(self, data: Any, dtype: Optional[str] = None):
-        """
-        Create a new tensor from data.
-
-        Args:
-            data: Input data (scalar, list, or numpy array)
-            dtype: Data type ('float32', 'int32', etc.). Defaults to auto-detect.
-
-        TODO: Implement tensor creation with simple, clear type handling.
-
-        APPROACH:
-        1. Convert input data to numpy array
-        2. Apply dtype if specified
-        3. Set default float32 for float64 arrays
-        4. Store the result in self._data
-
-        EXAMPLE:
-        >>> Tensor(5)
-        >>> Tensor([1.0, 2.0, 3.0])
-        >>> Tensor([1, 2, 3], dtype='float32')
-        """
-        ### BEGIN SOLUTION
-        if isinstance(data, Tensor):
-            self._data = data.data.copy()
-        else:
-            self._data = np.array(data)
-
-        if dtype is not None:
-            self._data = self._data.astype(dtype)
-        elif self._data.dtype == np.float64:
-            self._data = self._data.astype(np.float32)
-        ### END SOLUTION
-
-    @property
-    def data(self) -> np.ndarray:
-        """
-        Access underlying numpy array.
-
-        TODO: Return the stored numpy array.
-        """
-        ### BEGIN SOLUTION
-        return self._data
-        ### END SOLUTION
-    
-
-    @property
-    def shape(self) -> Tuple[int, ...]:
-        """
-        Get tensor shape.
-
-        TODO: Return the shape of the stored numpy array.
-        """
-        ### BEGIN SOLUTION
-        return self._data.shape
-        ### END SOLUTION
-
-    @property
-    def size(self) -> int:
-        """
-        Get total number of elements.
-
-        TODO: Return the total number of elements in the tensor.
-        """
-        ### BEGIN SOLUTION
-        return self._data.size
-        ### END SOLUTION
-
-    @property
-    def dtype(self) -> np.dtype:
-        """
-        Get data type as numpy dtype.
-
-        TODO: Return the data type of the stored numpy array.
-        """
-        ### BEGIN SOLUTION
-        return self._data.dtype
-        ### END SOLUTION
-
-
-    def __repr__(self) -> str:
-        """
-        String representation with size limits for readability.
-
-        TODO: Create a clear string representation of the tensor.
-        """
-        ### BEGIN SOLUTION
-        if self.size > 20:
-            return f"Tensor(shape={self.shape}, dtype={self.dtype})"
-        else:
-            return f"Tensor({self._data.tolist()}, shape={self.shape}, dtype={self.dtype})"
-        ### END SOLUTION
-
-    def numpy(self) -> np.ndarray:
-        """Convert tensor to NumPy array."""
-        return self._data
-
-# %% nbgrader={"grade": false, "grade_id": "tensor-arithmetic", "solution": true}
-
-    def __add__(self, other: Union['Tensor', int, float]) -> 'Tensor':
-        """
-        Addition operator: tensor + other
-
-        Element-wise addition with broadcasting support:
-
-        ```
-        Tensor + Tensor:         Tensor + Scalar:
-        [1, 2, 3]               [1, 2, 3]
-        [4, 5, 6]          +    5
-        ────────                ────────
-        [5, 7, 9]               [6, 7, 8]
-        ```
-
-        TODO: Implement + operator using NumPy's vectorized operations
-
-        APPROACH:
-        1. Check if other is Tensor or scalar
-        2. Use NumPy broadcasting for element-wise addition
-        3. Return new Tensor with result
-
-        HINT: NumPy handles broadcasting automatically!
-        """
-        ### BEGIN SOLUTION
-        if isinstance(other, Tensor):
-            return Tensor(self._data + other._data)
-        else:
-            return Tensor(self._data + other)
-        ### END SOLUTION
-
-    def __mul__(self, other: Union['Tensor', int, float]) -> 'Tensor':
-        """
-        Multiplication operator: tensor * other
-
-        TODO: Implement * operator for tensors.
-        """
-        ### BEGIN SOLUTION
-        if isinstance(other, Tensor):
-            return Tensor(self._data * other._data)
-        else:
-            return Tensor(self._data * other)
-        ### END SOLUTION
-
-    def __sub__(self, other: Union['Tensor', int, float]) -> 'Tensor':
-        """
-        Subtraction operator: tensor - other
-
-        TODO: Implement - operator for tensors.
-        """
-        ### BEGIN SOLUTION
-        if isinstance(other, Tensor):
-            return Tensor(self._data - other._data)
-        else:
-            return Tensor(self._data - other)
-        ### END SOLUTION
-
-    def __truediv__(self, other: Union['Tensor', int, float]) -> 'Tensor':
-        """
-        Division operator: tensor / other
-
-        TODO: Implement / operator for tensors.
-        """
-        ### BEGIN SOLUTION
-        if isinstance(other, Tensor):
-            return Tensor(self._data / other._data)
-        else:
-            return Tensor(self._data / other)
-        ### END SOLUTION
-
-
-    def matmul(self, other: 'Tensor') -> 'Tensor':
-        """
-        Matrix multiplication: combine two matrices through dot product operations.
-
-        ### Matrix Multiplication Visualization
-
-        ```
-            A (2×3)        B (3×2)          C (2×2)
-        ┌─────────────┐  ┌───────┐    ┌─────────────┐
-        │ 1  2  3     │  │ 7  8  │    │ 1×7+2×9+3×1 │
-        │             │  │ 9  1  │ =  │             │ = C
-        │ 4  5  6     │  │ 1  2  │    │ 4×7+5×9+6×1 │
-        └─────────────┘  └───────┘    └─────────────┘
-               │           │                │
-               ▼           ▼                ▼
-        Each row of A × Each col of B = Element of C
-        ```
-
-        ### Computational Cost
-        **FLOPs**: 2 × M × N × K operations for (M×K) @ (K×N) matrix
-        **Memory**: Result size M×N, inputs stay unchanged
-
-        TODO: Implement matrix multiplication with shape validation
-
-        APPROACH:
-        1. Validate both tensors are 2D matrices
-        2. Check inner dimensions match: A(m,k) @ B(k,n) → C(m,n)
-        3. Use np.dot() for optimized BLAS computation
-        4. Return new Tensor with result
-
-        HINT: Let NumPy handle the heavy computation!
-        """
-        ### BEGIN SOLUTION
-        if len(self._data.shape) != 2 or len(other._data.shape) != 2:
-            raise ValueError("matmul requires 2D tensors")
-
-        m, k = self._data.shape
-        k2, n = other._data.shape
-
-        if k != k2:
-            raise ValueError(f"Inner dimensions must match: {k} != {k2}")
-
-        result_data = np.dot(self._data, other._data)
-        return Tensor(result_data)
-        ### END SOLUTION
-
-    def __matmul__(self, other: 'Tensor') -> 'Tensor':
-        """
-        Matrix multiplication operator: tensor @ other
-
-        Enables the @ operator for matrix multiplication, providing
-        clean syntax for neural network operations.
-        """
-        return self.matmul(other)
-
-    def __getitem__(self, key):
-        """
-        Access tensor elements using subscript notation: tensor[key]
-
-        Supports all NumPy indexing patterns:
-        - Single index: tensor[0]
-        - Multiple indices: tensor[0, 1]
-        - Slices: tensor[0:2, 1:3]
-        - Fancy indexing: tensor[[0, 2], [1, 3]]
-
-        Args:
-            key: Index or slice specification
-
-        Returns:
-            Scalar, array value, or new Tensor with subset of data
-
-        Examples:
-            tensor = Tensor([[1, 2], [3, 4]])
-            tensor[0, 0]  # Returns 1 (scalar)
-            tensor[0]     # Returns Tensor([1, 2])
-            tensor[0:1, 0:1]  # Returns Tensor([[1]])
-        """
-        result = self._data[key]
-
-        # If result is a scalar, return the scalar value directly
-        if np.isscalar(result):
-            return result
-
-        # If result is an array, wrap it in a Tensor
-        return Tensor(result)
-
-    def reshape(self, *shape: int) -> 'Tensor':
-        """
-        Return a new tensor with the same data but different shape.
-
-        TODO: Implement tensor reshaping.
-        """
-        ### BEGIN SOLUTION
-        reshaped_data = self._data.reshape(*shape)
-        return Tensor(reshaped_data)
-        ### END SOLUTION
-
-    def transpose(self) -> 'Tensor':
-        """
-        Return the transpose of a 2D tensor.
-
-        TODO: Implement tensor transpose.
-        """
-        ### BEGIN SOLUTION
-        if len(self._data.shape) != 2:
-            raise ValueError("transpose() requires 2D tensor")
-        return Tensor(self._data.T)
-        ### END SOLUTION
-
-    # Note: gradient computation will be added in Module 05 (Autograd)
-    # This pure Tensor class focuses only on data structure operations
-
-
-
-
-# %% [markdown]
-"""
-## Class Methods for Tensor Creation
-"""
-
-
-#| export
-@classmethod
-def zeros(cls, *shape: int) -> 'Tensor':
-    """Create a tensor filled with zeros."""
-    return cls(np.zeros(shape))
-
-@classmethod
-def ones(cls, *shape: int) -> 'Tensor':
-    """Create a tensor filled with ones."""
-    return cls(np.ones(shape))
-
-@classmethod
-def random(cls, *shape: int) -> 'Tensor':
-    """Create a tensor with random values."""
-    return cls(np.random.randn(*shape))
-
-# Add class methods to Tensor class
-Tensor.zeros = zeros
-Tensor.ones = ones
-Tensor.random = random
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Tensor Creation
-This test validates tensor creation with different data types and shapes.
-"""
-
-# %%
-def test_unit_tensor_creation():
-    """Test tensor creation with all data types and shapes."""
-    print("🔬 Unit Test: Tensor Creation...")
-
-    try:
-        # Test scalar
-        scalar = Tensor(5.0)
-        assert scalar.shape == (), f"Scalar should have shape (), got {scalar.shape}"
-        print("✅ Scalar creation works")
-
-        # Test vector
-        vector = Tensor([1, 2, 3])
-        assert vector.shape == (3,), f"Vector should have shape (3,), got {vector.shape}"
-        print("✅ Vector creation works")
-
-        # Test matrix
-        matrix = Tensor([[1, 2], [3, 4]])
-        assert matrix.shape == (2, 2), f"Matrix should have shape (2, 2), got {matrix.shape}"
-        print("✅ Matrix creation works")
-
-        # Test class methods
-        zeros = Tensor.zeros(2, 3)
-        ones = Tensor.ones(2, 3)
-        random = Tensor.random(2, 3)
-        assert zeros.shape == (2, 3), "Zeros tensor should have correct shape"
-        assert ones.shape == (2, 3), "Ones tensor should have correct shape"
-        assert random.shape == (2, 3), "Random tensor should have correct shape"
-        print("✅ Class methods work")
-
-        print("📈 Progress: Tensor Creation ✓")
-
-    except Exception as e:
-        print(f"❌ Tensor creation test failed: {e}")
-        raise
-
-test_unit_tensor_creation()
-
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Tensor Properties
-This test validates tensor properties like shape, size, and data access.
-"""
-
-# %%
-
-def test_unit_tensor_properties():
-    """Test tensor properties (shape, size, dtype, data access)."""
-    print("🔬 Unit Test: Tensor Properties...")
-
-    try:
-        tensor = Tensor([[1, 2, 3], [4, 5, 6]])
-
-        assert tensor.shape == (2, 3), f"Shape should be (2, 3), got {tensor.shape}"
-        assert tensor.size == 6, f"Size should be 6, got {tensor.size}"
-        assert np.array_equal(tensor.data, np.array([[1, 2, 3], [4, 5, 6]])), "Data property should return numpy array"
-        assert tensor.dtype in [np.int32, np.int64], f"Dtype should be int32 or int64, got {tensor.dtype}"
-        print("✅ All properties work correctly")
-
-        print("📈 Progress: Tensor Properties ✓")
-
-    except Exception as e:
-        print(f"❌ Tensor properties test failed: {e}")
-        raise
-
-test_unit_tensor_properties()
-
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Tensor Arithmetic
-This test validates all arithmetic operations (+, -, *, /) work correctly.
-
-**What we're testing**: Element-wise operations with broadcasting support
-**Why it matters**: These operations form the foundation of neural network computations
-**Expected**: All operations produce mathematically correct results with proper broadcasting
-
-### Broadcasting Visualization
-
-NumPy's broadcasting automatically handles different tensor shapes:
-
-```
-Same Shape:              Broadcasting (vector + scalar):
-[1, 2, 3]              [1, 2, 3]     [5]     [1+5, 2+5, 3+5]
-[4, 5, 6]          +    [4, 5, 6] +   [5]  =  [4+5, 5+5, 6+5]
----------               ---------           ───────────────
-[5, 7, 9]               [6, 7, 8]           [9,10,11]
-
-Matrix Broadcasting:     Result:
-┌─────────────┐      ┌─────────────┐
-│ 1  2  3     │      │ 11 12 13    │
-│             │  +10 │             │
-│ 4  5  6     │ ──▶ │ 14 15 16    │
-└─────────────┘      └─────────────┘
-```
-"""
-
-# %%
-
-def test_unit_tensor_arithmetic():
-    """Test tensor arithmetic operations."""
-    print("🔬 Unit Test: Tensor Arithmetic...")
-
-    try:
-        a = Tensor([1, 2, 3])
-        b = Tensor([4, 5, 6])
-
-        # Test all operations
-        result_add = a + b
-        result_mul = a * b
-        result_sub = b - a
-        result_div = b / a
-
-        expected_add = np.array([5, 7, 9])
-        expected_mul = np.array([4, 10, 18])
-        expected_sub = np.array([3, 3, 3])
-        expected_div = np.array([4.0, 2.5, 2.0])
-
-        assert np.array_equal(result_add.data, expected_add), "Addition failed"
-        assert np.array_equal(result_mul.data, expected_mul), "Multiplication failed"
-        assert np.array_equal(result_sub.data, expected_sub), "Subtraction failed"
-        assert np.allclose(result_div.data, expected_div), "Division failed"
-
-        # Test scalar operations
-        result_scalar = a + 10
-        expected_scalar = np.array([11, 12, 13])
-        assert np.array_equal(result_scalar.data, expected_scalar), "Scalar addition failed"
-
-        print("✅ All arithmetic operations work")
-        print("📈 Progress: Tensor Arithmetic ✓")
-
-    except Exception as e:
-        print(f"❌ Tensor arithmetic test failed: {e}")
-        raise
-
-test_unit_tensor_arithmetic()
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Matrix Multiplication
-This test validates matrix multiplication and the @ operator.
-
-**What we're testing**: Matrix multiplication with proper shape validation
-**Why it matters**: Matrix multiplication is the core operation in neural networks
-**Expected**: Correct results and informative errors for incompatible shapes
-
-### Matrix Multiplication Process
-
-For matrices A(2×2) @ B(2×2), each result element is computed as:
-
-```
-Computation Pattern:
-C[0,0] = A[0,0]*B[0,0] + A[0,1]*B[1,0]  (row 0 of A × col 0 of B)
-C[0,1] = A[0,0]*B[0,1] + A[0,1]*B[1,1]  (row 0 of A × col 1 of B)
-C[1,0] = A[1,0]*B[0,0] + A[1,1]*B[1,0]  (row 1 of A × col 0 of B)
-C[1,1] = A[1,0]*B[0,1] + A[1,1]*B[1,1]  (row 1 of A × col 1 of B)
-
-Example:
-[[1, 2]] @ [[5, 6]] = [[1*5+2*7, 1*6+2*8]] = [[19, 22]]
-[[3, 4]]   [[7, 8]]   [[3*5+4*7, 3*6+4*8]]   [[43, 50]]
-```
-"""
-
-# %%
-
-def test_unit_matrix_multiplication():
-    """Test matrix multiplication."""
-    print("🔬 Unit Test: Matrix Multiplication...")
-
-    try:
-        a = Tensor([[1, 2], [3, 4]])
-        b = Tensor([[5, 6], [7, 8]])
-        result = a @ b
-        expected = np.array([[19, 22], [43, 50]])
-        assert np.array_equal(result.data, expected), f"Matmul failed: expected {expected}, got {result.data}"
-        print("✅ Matrix multiplication works")
-
-        # Test shape validation
-        try:
-            bad_a = Tensor([[1, 2]])
-            bad_b = Tensor([[1], [2], [3]])  # Incompatible shapes
-            result = bad_a @ bad_b
-            print("❌ Should have failed with incompatible shapes")
-        except ValueError:
-            print("✅ Shape validation works")
-
-        print("📈 Progress: Matrix Multiplication ✓")
-
-    except Exception as e:
-        print(f"❌ Matrix multiplication test failed: {e}")
-        raise
-
-test_unit_matrix_multiplication()
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Tensor Operations
-This test validates reshape, transpose, and numpy conversion.
-
-**What we're testing**: Shape manipulation operations that reorganize data
-**Why it matters**: Neural networks constantly reshape data between layers
-**Expected**: Same data, different organization (no copying for most operations)
-
-### Shape Manipulation Visualization
-
-```
-Original tensor (2×3):
-┌─────────────┐
-│ 1  2  3     │
-│             │
-│ 4  5  6     │
-└─────────────┘
-
-Reshape to (3×2):          Transpose to (3×2):
-┌─────────┐              ┌─────────┐
-│ 1  2  │              │ 1  4  │
-│ 3  4  │              │ 2  5  │
-│ 5  6  │              │ 3  6  │
-└─────────┘              └─────────┘
-
-Memory Impact:
-- Reshape: Usually creates VIEW (no copy, just new indexing)
-- Transpose: Creates VIEW (no copy, just swapped strides)
-- Indexing: May create COPY (depends on pattern)
-```
-"""
-
-# %%
-
-def test_unit_tensor_operations():
-    """Test tensor operations: reshape, transpose."""
-    print("🔬 Unit Test: Tensor Operations...")
-
-    try:
-        # Test reshape
-        tensor = Tensor([[1, 2, 3], [4, 5, 6]])
-        reshaped = tensor.reshape(3, 2)
-        assert reshaped.shape == (3, 2), f"Reshape failed: expected (3, 2), got {reshaped.shape}"
-        print("✅ Reshape works")
-
-        # Test transpose
-        matrix = Tensor([[1, 2], [3, 4]])
-        transposed = matrix.transpose()
-        expected = np.array([[1, 3], [2, 4]])
-        assert np.array_equal(transposed.data, expected), "Transpose failed"
-        print("✅ Transpose works")
-
-        # Test numpy conversion
-        numpy_array = tensor.numpy()
-        assert np.array_equal(numpy_array, tensor.data), "Numpy conversion failed"
-        print("✅ NumPy conversion works")
-
-        print("📈 Progress: Tensor Operations ✓")
-
-    except Exception as e:
-        print(f"❌ Tensor operations test failed: {e}")
-        raise
-
-test_unit_tensor_operations()
-
-# %% [markdown]
-"""
-### 🧪 Complete Module Test
-This runs all tests together to validate the complete tensor implementation.
-"""
-
-# %%
-
-def test_module():
-    """Final comprehensive test of entire tensor module."""
-    print("🧪 RUNNING MODULE INTEGRATION TEST")
-    print("=" * 50)
-
-    # Run all unit tests
-    print("Running unit tests...")
-    test_unit_tensor_creation()
-    test_unit_tensor_properties()
-    test_unit_tensor_arithmetic()
-    test_unit_matrix_multiplication()
-    test_unit_tensor_operations()
-
-    print("\nRunning integration scenarios...")
-    print("🔬 Integration Test: End-to-end tensor workflow...")
-
-    # Test realistic usage pattern
-    tensor = Tensor([[1, 2], [3, 4]])
-    result = (tensor + tensor) @ tensor.transpose()
-    assert result.shape == (2, 2)
-    print("✅ End-to-end workflow works!")
-
-    print("\n" + "=" * 50)
-    print("🎉 ALL TESTS PASSED! Module ready for export.")
-    print("Run: tito module complete 01")
-
-test_module()
-
-# %% [markdown]
-"""
-## Systems Analysis: Memory Layout and Performance
-
-Now that our Tensor is working, let's understand how it behaves at the systems level. This analysis shows you how tensor operations scale and where bottlenecks appear in real ML systems.
-
-### Memory Usage Patterns
-
-```
-Operation Type          Memory Pattern           When to Worry
-──────────────────────────────────────────────────────────────
-Element-wise (+,*,/)    2× input size          Large tensor ops
-Matrix multiply (@)     Size(A) + Size(B) + Size(C)  GPU memory limits
-Reshape/transpose       Same memory, new view    Never (just metadata)
-Indexing/slicing        Copy vs view            Depends on pattern
-```
-
-### Performance Characteristics
-
-Let's measure how our tensor operations scale with size:
-"""
-
-# %%
-def analyze_tensor_performance():
-    """Analyze tensor operations performance and memory usage."""
-    print("📊 Systems Analysis: Tensor Performance\n")
-
-    import time
-    import sys
-
-    # Test different matrix sizes to understand scaling
-    sizes = [50, 100, 200, 400]
-    results = []
-
-    for size in sizes:
-        print(f"Testing {size}×{size} matrices...")
-        a = Tensor.random(size, size)
-        b = Tensor.random(size, size)
-
-        # Measure matrix multiplication time
-        start = time.perf_counter()
-        result = a @ b
-        elapsed = time.perf_counter() - start
-
-        # Calculate memory usage (rough estimate)
-        memory_mb = (a.size + b.size + result.size) * 4 / (1024 * 1024)  # 4 bytes per float32
-        flops = 2 * size * size * size  # 2*N³ for matrix multiplication
-        gflops = flops / (elapsed * 1e9)
-
-        results.append((size, elapsed * 1000, memory_mb, gflops))
-        print(f"  Time: {elapsed*1000:.2f}ms, Memory: ~{memory_mb:.1f}MB, Performance: {gflops:.2f} GFLOPS")
-
-    print("\n🔍 Performance Analysis:")
-    print("```")
-    print("Size    Time(ms)  Memory(MB)  Performance(GFLOPS)")
-    print("-" * 50)
-    for size, time_ms, mem_mb, gflops in results:
-        print(f"{size:4d}    {time_ms:7.2f}  {mem_mb:9.1f}  {gflops:15.2f}")
-    print("```")
-
-    print("\n💡 Key Insights:")
-    print("• Matrix multiplication is O(N³) - doubling size = 8× more computation")
-    print("• Memory grows as O(N²) - usually not the bottleneck for single operations")
-    print("• NumPy uses optimized BLAS libraries (like OpenBLAS, Intel MKL)")
-    print("• Performance depends heavily on your CPU and available memory bandwidth")
-
-    return results
-
-
-if __name__ == "__main__":
-    print("🚀 Running Tensor module...")
-    test_module()
-    print("\n📊 Running systems analysis...")
-    analyze_tensor_performance()
-    print("\n✅ Module validation complete!")
-
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-### Question 1: Memory Scaling and Neural Network Implications
-**Context**: Your performance analysis showed how tensor memory usage scales with size. A 1000×1000 tensor uses 100× more memory than a 100×100 tensor.
-
-**Systems Question**: Modern language models have weight matrices of size [4096, 11008] (Llama-2 7B). How much memory would this single layer consume in float32? Why do production systems use float16 or int8 quantization?
-
-*Calculate*: 4096 × 11008 × 4 bytes = ? GB per layer
-
-### Question 2: Computational Complexity in Practice
-**Context**: Your analysis revealed O(N³) scaling for matrix multiplication. This means doubling the matrix size increases computation time by 8×.
-
-**Performance Question**: If a 400×400 matrix multiplication takes 100ms on your machine, how long would a 1600×1600 multiplication take? How does this explain why training large neural networks requires GPUs with thousands of cores?
-
-*Think*: 1600 = 4 × 400, so computation = 4³ = 64× longer
-
-### Question 3: Memory Bandwidth vs Compute Power
-**Context**: Your Tensor operations are limited by how fast data moves between RAM and CPU, not just raw computational power.
-
-**Architecture Question**: Why might element-wise operations (like tensor + tensor) be slower per operation than matrix multiplication, even though addition is simpler than dot products? How do modern ML accelerators (GPUs, TPUs) address this?
-
-*Hint*: Consider the ratio of data movement to computation work
-"""
-
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Tensor Foundation Complete!
-
-Congratulations! You've built the fundamental data structure that powers neural networks.
-
-### What You've Accomplished
-✅ **Core Tensor Class**: Complete N-dimensional array implementation wrapping NumPy's optimized operations
-✅ **Broadcasting Arithmetic**: Element-wise operations (+, -, *, /) with automatic shape handling
-✅ **Matrix Operations**: O(N³) matrix multiplication with @ operator and comprehensive shape validation
-✅ **Memory-Efficient Shape Manipulation**: Reshape and transpose operations using views when possible
-✅ **Systems Analysis**: Performance profiling revealing scaling characteristics and memory patterns
-✅ **Production-Ready Testing**: Unit tests with immediate validation and clear error messages
-
-### Key Learning Outcomes
-- **Tensor Fundamentals**: N-dimensional arrays as the foundation of ML
-- **NumPy Integration**: Leveraging optimized numerical computing
-- **Clean API Design**: Operations that mirror PyTorch and TensorFlow patterns
-- **Testing Approach**: Immediate validation after each implementation
-
-### Ready for Next Steps
-Your pure tensor implementation enables:
-- **Module 02 (Activations)**: Add nonlinear functions using clean tensor operations
-- **Modules 03-04**: Build layers and losses with focused tensor operations
-- **Module 05 (Autograd)**: Will extend this foundation with gradient tracking
-- **Real ML Work**: Handle numerical computations with a clean, extensible foundation
-
-### Export Your Work
-1. **Module validation**: Complete with `test_module()` comprehensive testing
-2. **Export to package**: `tito module complete 01_tensor`
-3. **Integration**: Your code becomes `tinytorch.core.tensor.Tensor`
-4. **Next module**: Ready for activation functions!
-
-**Achievement unlocked**: You've built the foundation of modern AI systems!
-"""
\ No newline at end of file
diff --git a/modules_old/02_activations/README.md b/modules_old/02_activations/README.md
deleted file mode 100644
index 4b1e9618..00000000
--- a/modules_old/02_activations/README.md
+++ /dev/null
@@ -1,188 +0,0 @@
-# 🔥 Module: Activations
-
-## 📊 Module Info
-- **Difficulty**: ⭐⭐ Intermediate
-- **Time Estimate**: 3-4 hours
-- **Prerequisites**: Tensor module
-- **Next Steps**: Layers module
-
-Welcome to the **Activations** module! This is where you'll implement the mathematical functions that give neural networks their power to learn complex patterns. Without activation functions, neural networks would just be linear transformations—with them, you unlock the ability to learn any function.
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will be able to:
-
-- **Understand the critical role** of activation functions in enabling neural networks to learn non-linear patterns
-- **Implement the two essential activation functions**: ReLU and Softmax with proper numerical stability
-- **Apply mathematical reasoning** to understand function properties, ranges, and appropriate use cases
-- **Debug and test** activation implementations using both automated tests and visual analysis
-- **Connect theory to practice** by understanding when and why to use each activation function
-
-## 🧠 Build → Use → Analyze
-
-This module follows TinyTorch's **Build → Use → Analyze** framework:
-
-1. **Build**: Implement ReLU and Softmax activation functions with numerical stability  
-2. **Use**: Apply these functions in testing scenarios and visualize their mathematical behavior
-3. **Analyze**: Understand why these two functions power 90% of modern deep learning
-
-## 📚 What You'll Build
-
-### 🎯 **STREAMLINED: Focus on What Matters**
-```python
-# ReLU: The workhorse of deep learning
-relu = ReLU()
-output = relu(Tensor([-2, -1, 0, 1, 2]))  # [0, 0, 0, 1, 2]
-
-# Softmax: Multi-class probability distribution
-softmax = Softmax()
-output = softmax(Tensor([1.0, 2.0, 3.0]))  # [0.09, 0.24, 0.67] (sums to 1.0)
-```
-
-### ReLU (Rectified Linear Unit) - 80% of Hidden Layers
-- **Formula**: `f(x) = max(0, x)`
-- **Properties**: Simple, sparse, fast, prevents vanishing gradients
-- **Why Essential**: Powers all modern CNNs, Transformers, ResNets
-- **Use Cases**: Hidden layers in 95% of architectures
-
-### Softmax - Multi-Class Classification
-- **Formula**: `f(x_i) = e^(x_i) / Σ(e^(x_j))`  
-- **Properties**: Outputs sum to 1.0, probability interpretation
-- **Why Essential**: Final layer for classification, attention weights
-- **Use Cases**: Classification output, attention mechanisms
-
-### 🧠 **Why Just Two Functions?**
-- **ReLU**: Solves vanishing gradients, enables deep networks, computationally efficient
-- **Softmax**: Converts logits to probabilities, differentiable, temperature control
-- **90% Coverage**: These two functions appear in virtually every modern architecture
-- **Simplicity**: Focus on mastering essential concepts rather than memorizing many variants
-
-## 🚀 Getting Started
-
-### Prerequisites
-Ensure you have completed the tensor module and understand basic tensor operations:
-
-   ```bash
-# Activate TinyTorch environment
-   source bin/activate-tinytorch.sh
-
-# Verify tensor module is working
-tito test --module tensor
-   ```
-
-### Development Workflow
-1. **Open the development file**: `modules/source/03_activations/activations_dev.py`
-2. **Implement functions progressively**: Start with ReLU, then Sigmoid (numerical stability), then Tanh
-3. **Test each implementation**: Use inline tests for immediate feedback
-4. **Visualize function behavior**: Leverage plotting sections for mathematical understanding
-5. **Export and verify**: `tito export --module activations && tito test --module activations`
-
-## 🧪 Testing Your Implementation
-
-### Comprehensive Test Suite
-Run the full test suite to verify mathematical correctness:
-
-   ```bash
-# TinyTorch CLI (recommended)
-   tito test --module activations
-
-# Direct pytest execution
-python -m pytest tests/ -k activations -v
-```
-
-### Test Coverage Areas
-- ✅ **Mathematical Correctness**: Verify function outputs match expected mathematical formulas
-- ✅ **Numerical Stability**: Test with extreme values and edge cases
-- ✅ **Shape Preservation**: Ensure input and output tensors have identical shapes
-- ✅ **Range Validation**: Confirm outputs fall within expected ranges
-- ✅ **Integration Testing**: Verify compatibility with tensor operations
-
-### Inline Testing & Visualization
-The module includes comprehensive educational feedback:
-```python
-# Example inline test output
-🔬 Unit Test: ReLU activation...
-✅ ReLU handles negative inputs correctly
-✅ ReLU preserves positive inputs
-✅ ReLU output range is [0, ∞)
-📈 Progress: ReLU ✓
-
-# Visual feedback with plotting
-📊 Plotting ReLU behavior across range [-5, 5]...
-📈 Function visualization shows expected behavior
-   ```
-
-### Manual Testing Examples
-```python
-from tinytorch.core.tensor import Tensor
-from activations_dev import ReLU, Sigmoid, Tanh
-
-# Test with various inputs
-x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]])
-
-relu = ReLU()
-sigmoid = Sigmoid()
-tanh = Tanh()
-
-print("Input:", x.data)
-print("ReLU:", relu(x).data)      # [0, 0, 0, 1, 2]
-print("Sigmoid:", sigmoid(x).data) # [0.12, 0.27, 0.5, 0.73, 0.88]
-print("Tanh:", tanh(x).data)      # [-0.96, -0.76, 0, 0.76, 0.96]
-```
-
-## 🎯 Key Concepts
-
-### Real-World Applications
-- **Computer Vision**: ReLU activations enable CNNs to learn hierarchical features (like those in ResNet, VGG)
-- **Natural Language Processing**: Sigmoid/Tanh functions power LSTM and GRU gates for memory control
-- **Recommendation Systems**: Sigmoid activations provide probability outputs for binary predictions
-- **Generative Models**: Different activations shape the output distributions in GANs and VAEs
-
-### Mathematical Properties Comparison
-| Function | Input Range | Output Range | Zero Point | Key Property |
-|----------|-------------|--------------|------------|--------------|
-| ReLU     | (-∞, ∞)     | [0, ∞)       | f(0) = 0   | Sparse, unbounded |
-| Sigmoid  | (-∞, ∞)     | (0, 1)       | f(0) = 0.5 | Probabilistic |
-| Tanh     | (-∞, ∞)     | (-1, 1)      | f(0) = 0   | Zero-centered |
-
-### Numerical Stability Considerations
-- **ReLU**: No stability issues (simple max operation)
-- **Sigmoid**: Requires careful implementation to prevent `exp()` overflow
-- **Tanh**: Generally stable, but NumPy implementation handles edge cases
-
-### Performance and Gradient Properties
-- **ReLU**: Fastest computation, sparse gradients, can cause "dying ReLU" problem
-- **Sigmoid**: Moderate computation, smooth gradients, susceptible to vanishing gradients
-- **Tanh**: Moderate computation, stronger gradients than sigmoid, zero-centered helps optimization
-
-## 🎉 Ready to Build?
-
-The activations module is where neural networks truly come alive! You're about to implement the mathematical functions that transform simple linear operations into powerful pattern recognition systems.
-
-Every major breakthrough in deep learning—from image recognition to language models—relies on the functions you're about to build. Take your time, understand the mathematics, and enjoy creating the foundation of intelligent systems!
-
-```{grid} 3
-:gutter: 3
-:margin: 2
-
-{grid-item-card} 🚀 Launch Builder
-:link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/03_activations/activations_dev.py
-:class-title: text-center
-:class-body: text-center
-
-Interactive development environment
-
-{grid-item-card} 📓 Open in Colab  
-:link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/03_activations/activations_dev.ipynb
-:class-title: text-center
-:class-body: text-center
-
-Google Colab notebook
-
-{grid-item-card} 👀 View Source
-:link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/03_activations/activations_dev.py  
-:class-title: text-center
-:class-body: text-center
-
-Browse the code on GitHub
-```
diff --git a/modules_old/02_activations/activations_dev.ipynb b/modules_old/02_activations/activations_dev.ipynb
deleted file mode 100644
index 6bc5d8df..00000000
--- a/modules_old/02_activations/activations_dev.ipynb
+++ /dev/null
@@ -1,2298 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "1d5eba9f",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Activations - Nonlinearity and Neural Network Intelligence\n",
-    "\n",
-    "Welcome to the Activations module! You'll implement the functions that give neural networks their power to learn complex patterns through nonlinearity.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: Why linear operations alone cannot solve complex problems and how nonlinearity enables universal approximation\n",
-    "- Core implementation skill: Build the four essential activation functions that power modern neural networks\n",
-    "- Pattern recognition: Understand how different activations affect gradient flow and learning dynamics\n",
-    "- Framework connection: See how your implementations match PyTorch's optimized activation functions\n",
-    "- Performance insight: Learn why activation choice affects both forward pass speed and gradient computation efficiency\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: ReLU, Sigmoid, Tanh, and Softmax activation functions with proper numerical stability\n",
-    "2. **Use**: Transform real tensor data and observe how different activations affect output distributions\n",
-    "3. **Reflect**: Why does activation function choice determine whether deep networks can train successfully?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how nonlinear functions enable neural networks to approximate any continuous function\n",
-    "- Practical capability to implement numerically stable activation functions that avoid overflow and underflow\n",
-    "- Systems insight into why activation choice affects gradient flow and determines trainable network depth\n",
-    "- Performance consideration of how activation complexity affects forward and backward pass computational cost\n",
-    "- Connection to production ML systems and why modern frameworks provide dozens of activation variants\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: PyTorch implements activations as both functions and modules, with CUDA kernels for GPU acceleration - your implementation reveals the mathematical foundations\n",
-    "⚡ **Performance Note**: ReLU is popular partly because it's computationally cheap (just max(0,x)), while Softmax requires expensive exponentials - activation choice affects training speed"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f485ca1e",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "activations-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.activations\n",
-    "\n",
-    "#| export\n",
-    "import math\n",
-    "import numpy as np\n",
-    "import os\n",
-    "import sys\n",
-    "from typing import Union, List\n",
-    "\n",
-    "# Import our Tensor class - try from package first, then from local module\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "except ImportError:\n",
-    "    # For development, import from local tensor module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
-    "    from tensor_dev import Tensor\n",
-    "\n",
-    "# Import Variable for autograd support\n",
-    "try:\n",
-    "    from tinytorch.core.autograd import Variable\n",
-    "except ImportError:\n",
-    "    # For development, import from local autograd module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '09_autograd'))\n",
-    "    from autograd_dev import Variable"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c25ae2c4",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "activations-welcome",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🔥 TinyTorch Activations Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
-    "print(\"Ready to build activation functions!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8e00c5cd",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📦 Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in `modules/source/02_activations/activations_dev.py`  \n",
-    "**Building Side:** Code exports to `tinytorch.core.activations`\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax\n",
-    "from tinytorch.core.tensor import Tensor  # Foundation\n",
-    "from tinytorch.core.layers import Dense  # Uses activations\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused modules for deep understanding\n",
-    "- **Production:** Proper organization like PyTorch's `torch.nn.ReLU`\n",
-    "- **Consistency:** All activation functions live together in `core.activations`\n",
-    "- **Integration:** Works seamlessly with tensors and layers"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "61551128",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## What Are Activation Functions?\n",
-    "\n",
-    "### The Problem: Linear Limitations\n",
-    "Without activation functions, neural networks can only learn linear relationships:\n",
-    "```\n",
-    "y = W₁ · (W₂ · (W₃ · x + b₃) + b₂) + b₁\n",
-    "```\n",
-    "\n",
-    "This simplifies to just:\n",
-    "```\n",
-    "y = W_combined · x + b_combined\n",
-    "```\n",
-    "\n",
-    "**A single linear function!** No matter how many layers you add, you can't learn complex patterns like:\n",
-    "- Image recognition (nonlinear pixel relationships)\n",
-    "- Language understanding (nonlinear word relationships) \n",
-    "- Game playing (nonlinear strategy relationships)\n",
-    "\n",
-    "### The Solution: Nonlinearity\n",
-    "Activation functions add nonlinearity between layers:\n",
-    "```\n",
-    "y = W₁ · f(W₂ · f(W₃ · x + b₃) + b₂) + b₁\n",
-    "```\n",
-    "\n",
-    "Now each layer can learn complex transformations!\n",
-    "\n",
-    "### Real-World Impact\n",
-    "- **Before activations**: Only linear classifiers (logistic regression)\n",
-    "- **After activations**: Complex pattern recognition (deep learning revolution)\n",
-    "\n",
-    "### What We'll Build\n",
-    "1. **ReLU**: The foundation of modern deep learning\n",
-    "2. **Sigmoid**: Classic activation for binary classification\n",
-    "3. **Tanh**: Centered activation for better gradients\n",
-    "4. **Softmax**: Probability distributions for multi-class classification"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "68f87aff",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🔧 DEVELOPMENT"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a3cda507",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 1: ReLU - The Foundation of Deep Learning\n",
-    "\n",
-    "### What is ReLU?\n",
-    "**ReLU (Rectified Linear Unit)** is the most important activation function in deep learning:\n",
-    "\n",
-    "```\n",
-    "f(x) = max(0, x)\n",
-    "```\n",
-    "\n",
-    "- **Positive inputs**: Pass through unchanged\n",
-    "- **Negative inputs**: Become zero\n",
-    "- **Zero**: Stays zero\n",
-    "\n",
-    "### Why ReLU Revolutionized Deep Learning\n",
-    "1. **Computational efficiency**: Just a max operation\n",
-    "2. **No vanishing gradients**: Derivative is 1 for positive values\n",
-    "3. **Sparsity**: Many neurons output exactly 0\n",
-    "4. **Empirical success**: Works well in practice\n",
-    "\n",
-    "### Visual Understanding\n",
-    "```\n",
-    "Input:  [-2, -1, 0, 1, 2]\n",
-    "ReLU:   [ 0,  0, 0, 1, 2]\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Image classification**: ResNet, VGG, AlexNet\n",
-    "- **Object detection**: YOLO, R-CNN\n",
-    "- **Language models**: Transformer feedforward layers\n",
-    "- **Recommendation**: Deep collaborative filtering\n",
-    "\n",
-    "### Mathematical Properties\n",
-    "- **Derivative**: f'(x) = 1 if x > 0, else 0\n",
-    "- **Range**: [0, ∞)\n",
-    "- **Sparsity**: Outputs exactly 0 for negative inputs"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "68917114",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "relu-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class ReLU:\n",
-    "    \"\"\"\n",
-    "    ReLU Activation Function: f(x) = max(0, x)\n",
-    "    \n",
-    "    The most popular activation function in deep learning.\n",
-    "    Simple, fast, and effective for most applications.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def forward(self, x):\n",
-    "        \"\"\"\n",
-    "        Apply ReLU activation: f(x) = max(0, x)\n",
-    "        \n",
-    "        Now supports both Tensor and Variable inputs with automatic differentiation.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Check if input is Variable (for autograd) or Tensor\n",
-    "        2. For each element in the input tensor, apply max(0, element)\n",
-    "        3. If input is Variable: create Variable output with proper gradient function\n",
-    "        4. If input is Tensor: return Tensor as before\n",
-    "        \n",
-    "        MATHEMATICAL FOUNDATION:\n",
-    "        - Forward: f(x) = max(0, x)\n",
-    "        - Backward: f'(x) = 1 if x > 0, else 0\n",
-    "        \n",
-    "        EXAMPLE USAGE:\n",
-    "        ```python\n",
-    "        relu = ReLU()\n",
-    "        # With Tensor (no gradients)\n",
-    "        tensor_input = Tensor([[-2, -1, 0, 1, 2]])\n",
-    "        tensor_output = relu(tensor_input)\n",
-    "        \n",
-    "        # With Variable (with gradients)\n",
-    "        var_input = Variable([[-2, -1, 0, 1, 2]], requires_grad=True)\n",
-    "        var_output = relu(var_input)\n",
-    "        var_output.backward()\n",
-    "        print(var_input.grad)  # Gradients: [0, 0, 0, 1, 1]\n",
-    "        ```\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Check type with hasattr(x, 'requires_grad')\n",
-    "        - For Variables: implement gradient function for backward pass\n",
-    "        - ReLU gradient: 1 where input > 0, 0 elsewhere\n",
-    "        - Use np.maximum(0, x.data) for forward pass\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - This is like torch.nn.ReLU() in PyTorch with autograd support\n",
-    "        - Enables gradient-based training of neural networks\n",
-    "        - ReLU's simple gradient (0 or 1) prevents vanishing gradients\n",
-    "        - Creates sparse representations and efficient gradient flow\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Check if input is a Variable (autograd-enabled)\n",
-    "        if hasattr(x, 'requires_grad') and hasattr(x, 'grad_fn'):\n",
-    "            # Input is a Variable - preserve autograd capabilities\n",
-    "            \n",
-    "            # Forward pass: ReLU activation\n",
-    "            input_data = x.data.data if hasattr(x.data, 'data') else x.data\n",
-    "            output_data = np.maximum(0, input_data)\n",
-    "            \n",
-    "            # Create gradient function for backward pass\n",
-    "            def relu_grad_fn(grad_output):\n",
-    "                if x.requires_grad:\n",
-    "                    # ReLU gradient: 1 where input > 0, 0 elsewhere\n",
-    "                    relu_mask = (input_data > 0).astype(np.float32)\n",
-    "                    grad_input_data = grad_output.data.data * relu_mask\n",
-    "                    grad_input = Variable(grad_input_data)\n",
-    "                    x.backward(grad_input)\n",
-    "            \n",
-    "            # Return Variable with gradient function\n",
-    "            requires_grad = x.requires_grad\n",
-    "            result = Variable(output_data, requires_grad=requires_grad, grad_fn=relu_grad_fn if requires_grad else None)\n",
-    "            return result\n",
-    "        else:\n",
-    "            # Input is a Tensor - use original implementation\n",
-    "            result = np.maximum(0, x.data)\n",
-    "            return type(x)(result)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, x):\n",
-    "        \"\"\"Make the class callable: relu(x) instead of relu.forward(x)\"\"\"\n",
-    "        return self.forward(x)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "bc838f6b",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your ReLU Implementation\n",
-    "\n",
-    "Once you implement the ReLU forward method above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "25caa304",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-relu-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_relu_activation():\n",
-    "    \"\"\"Unit test for the ReLU activation function.\"\"\"\n",
-    "    print(\"🔬 Unit Test: ReLU Activation...\")\n",
-    "\n",
-    "    # Create ReLU instance\n",
-    "    relu = ReLU()\n",
-    "\n",
-    "    # Test with mixed positive/negative values\n",
-    "    test_input = Tensor([[-2, -1, 0, 1, 2]])\n",
-    "    result = relu(test_input)\n",
-    "    expected = np.array([[0, 0, 0, 1, 2]])\n",
-    "    \n",
-    "    assert np.array_equal(result.data, expected), f\"ReLU failed: expected {expected}, got {result.data}\"\n",
-    "    \n",
-    "    # Test that negative values become zero\n",
-    "    assert np.all(result.data >= 0), \"ReLU should make all negative values zero\"\n",
-    "    \n",
-    "    # Test that positive values remain unchanged\n",
-    "    positive_input = Tensor([[1, 2, 3, 4, 5]])\n",
-    "    positive_result = relu(positive_input)\n",
-    "    assert np.array_equal(positive_result.data, positive_input.data), \"ReLU should preserve positive values\"\n",
-    "    \n",
-    "    # Test with 2D tensor\n",
-    "    matrix_input = Tensor([[-1, 2], [3, -4]])\n",
-    "    matrix_result = relu(matrix_input)\n",
-    "    matrix_expected = np.array([[0, 2], [3, 0]])\n",
-    "    assert np.array_equal(matrix_result.data, matrix_expected), \"ReLU should work with 2D tensors\"\n",
-    "    \n",
-    "    # Test shape preservation\n",
-    "    assert matrix_result.shape == matrix_input.shape, \"ReLU should preserve input shape\"\n",
-    "    \n",
-    "    print(\"✅ ReLU activation tests passed!\")\n",
-    "    print(f\"✅ Negative values correctly zeroed\")\n",
-    "    print(f\"✅ Positive values preserved\")\n",
-    "    print(f\"✅ Shape preservation working\")\n",
-    "    print(f\"✅ Works with multi-dimensional tensors\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d0337f0c",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 2: Sigmoid - Classic Binary Classification\n",
-    "\n",
-    "### What is Sigmoid?\n",
-    "**Sigmoid** is the classic activation function that maps any real number to (0, 1):\n",
-    "\n",
-    "```\n",
-    "f(x) = 1 / (1 + e^(-x))\n",
-    "```\n",
-    "\n",
-    "### Why Sigmoid Matters\n",
-    "1. **Probability interpretation**: Outputs between 0 and 1\n",
-    "2. **Smooth gradients**: Differentiable everywhere\n",
-    "3. **Historical importance**: Enabled early neural networks\n",
-    "4. **Binary classification**: Perfect for yes/no decisions\n",
-    "\n",
-    "### Visual Understanding\n",
-    "```\n",
-    "Input:  [-∞, -2, -1, 0, 1, 2, ∞]\n",
-    "Sigmoid:[0,  0.12, 0.27, 0.5, 0.73, 0.88, 1]\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Binary classification**: Spam detection, medical diagnosis\n",
-    "- **Gating mechanisms**: LSTM and GRU cells\n",
-    "- **Output layers**: When you need probabilities\n",
-    "- **Attention mechanisms**: Where to focus attention\n",
-    "\n",
-    "### Mathematical Properties\n",
-    "- **Range**: (0, 1)\n",
-    "- **Derivative**: f'(x) = f(x) · (1 - f(x))\n",
-    "- **Centered**: f(0) = 0.5\n",
-    "- **Symmetric**: f(-x) = 1 - f(x)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "af428bf1",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "sigmoid-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Sigmoid:\n",
-    "    \"\"\"\n",
-    "    Sigmoid Activation Function: f(x) = 1 / (1 + e^(-x))\n",
-    "    \n",
-    "    Maps any real number to the range (0, 1).\n",
-    "    Useful for binary classification and probability outputs.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def forward(self, x):\n",
-    "        \"\"\"\n",
-    "        Apply Sigmoid activation: f(x) = 1 / (1 + e^(-x))\n",
-    "        \n",
-    "        Now supports both Tensor and Variable inputs with automatic differentiation.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Check if input is Variable (for autograd) or Tensor\n",
-    "        2. Compute sigmoid: 1 / (1 + exp(-x))\n",
-    "        3. If input is Variable: create Variable output with proper gradient function\n",
-    "        4. If input is Tensor: return Tensor as before\n",
-    "        \n",
-    "        MATHEMATICAL FOUNDATION:\n",
-    "        - Forward: f(x) = 1 / (1 + e^(-x))\n",
-    "        - Backward: f'(x) = f(x) * (1 - f(x)) = sigmoid(x) * (1 - sigmoid(x))\n",
-    "        \n",
-    "        EXAMPLE USAGE:\n",
-    "        ```python\n",
-    "        sigmoid = Sigmoid()\n",
-    "        # With Variable (with gradients)\n",
-    "        var_input = Variable([[0.0]], requires_grad=True)\n",
-    "        var_output = sigmoid(var_input)  # 0.5\n",
-    "        var_output.backward()\n",
-    "        print(var_input.grad)  # 0.25 = 0.5 * (1 - 0.5)\n",
-    "        ```\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Check type with hasattr(x, 'requires_grad')\n",
-    "        - For Variables: implement gradient function for backward pass\n",
-    "        - Sigmoid gradient: sigmoid(x) * (1 - sigmoid(x))\n",
-    "        - Use numerical stability: clip inputs to prevent overflow\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - This is like torch.nn.Sigmoid() in PyTorch with autograd support\n",
-    "        - Used in binary classification and gating mechanisms\n",
-    "        - Smooth gradients enable stable training\n",
-    "        - Self-normalizing gradient (max at x=0, decreases at extremes)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Check if input is a Variable (autograd-enabled)\n",
-    "        if hasattr(x, 'requires_grad') and hasattr(x, 'grad_fn'):\n",
-    "            # Input is a Variable - preserve autograd capabilities\n",
-    "            \n",
-    "            # Forward pass: Sigmoid activation with numerical stability\n",
-    "            input_data = x.data.data if hasattr(x.data, 'data') else x.data\n",
-    "            clipped_input = np.clip(-input_data, -500, 500)\n",
-    "            output_data = 1 / (1 + np.exp(clipped_input))\n",
-    "            \n",
-    "            # Create gradient function for backward pass\n",
-    "            def sigmoid_grad_fn(grad_output):\n",
-    "                if x.requires_grad:\n",
-    "                    # Sigmoid gradient: sigmoid(x) * (1 - sigmoid(x))\n",
-    "                    sigmoid_grad = output_data * (1 - output_data)\n",
-    "                    grad_input_data = grad_output.data.data * sigmoid_grad\n",
-    "                    grad_input = Variable(grad_input_data)\n",
-    "                    x.backward(grad_input)\n",
-    "            \n",
-    "            # Return Variable with gradient function\n",
-    "            requires_grad = x.requires_grad\n",
-    "            result = Variable(output_data, requires_grad=requires_grad, grad_fn=sigmoid_grad_fn if requires_grad else None)\n",
-    "            return result\n",
-    "        else:\n",
-    "            # Input is a Tensor - use original implementation\n",
-    "            clipped_input = np.clip(-x.data, -500, 500)\n",
-    "            result = 1 / (1 + np.exp(clipped_input))\n",
-    "            return type(x)(result)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, x):\n",
-    "        \"\"\"Make the class callable: sigmoid(x) instead of sigmoid.forward(x)\"\"\"\n",
-    "        return self.forward(x)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "740bc3f9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Sigmoid Implementation\n",
-    "\n",
-    "Once you implement the Sigmoid forward method above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c06cff30",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-sigmoid-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_sigmoid_activation():\n",
-    "    \"\"\"Unit test for the Sigmoid activation function.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Sigmoid Activation...\")\n",
-    "\n",
-    "# Create Sigmoid instance\n",
-    "    sigmoid = Sigmoid()\n",
-    "\n",
-    "    # Test with known values\n",
-    "    test_input = Tensor([[0]])\n",
-    "    result = sigmoid(test_input)\n",
-    "    expected = 0.5\n",
-    "    \n",
-    "    assert abs(result.data[0][0] - expected) < 1e-6, f\"Sigmoid(0) should be 0.5, got {result.data[0][0]}\"\n",
-    "    \n",
-    "    # Test with positive and negative values\n",
-    "    test_input = Tensor([[-2, -1, 0, 1, 2]])\n",
-    "    result = sigmoid(test_input)\n",
-    "    \n",
-    "    # Check that all values are between 0 and 1\n",
-    "    assert np.all(result.data > 0), \"Sigmoid output should be > 0\"\n",
-    "    assert np.all(result.data < 1), \"Sigmoid output should be < 1\"\n",
-    "    \n",
-    "    # Test symmetry: sigmoid(-x) = 1 - sigmoid(x)\n",
-    "    x_val = 1.0\n",
-    "    pos_result = sigmoid(Tensor([[x_val]]))\n",
-    "    neg_result = sigmoid(Tensor([[-x_val]]))\n",
-    "    symmetry_check = abs(pos_result.data[0][0] + neg_result.data[0][0] - 1.0)\n",
-    "    assert symmetry_check < 1e-6, \"Sigmoid should be symmetric around 0.5\"\n",
-    "    \n",
-    "    # Test with 2D tensor\n",
-    "    matrix_input = Tensor([[-1, 1], [0, 2]])\n",
-    "    matrix_result = sigmoid(matrix_input)\n",
-    "    assert matrix_result.shape == matrix_input.shape, \"Sigmoid should preserve shape\"\n",
-    "    \n",
-    "    # Test extreme values (should not overflow)\n",
-    "    extreme_input = Tensor([[-100, 100]])\n",
-    "    extreme_result = sigmoid(extreme_input)\n",
-    "    assert not np.any(np.isnan(extreme_result.data)), \"Sigmoid should handle extreme values\"\n",
-    "    assert not np.any(np.isinf(extreme_result.data)), \"Sigmoid should not produce inf values\"\n",
-    "    \n",
-    "    print(\"✅ Sigmoid activation tests passed!\")\n",
-    "    print(f\"✅ Outputs correctly bounded between 0 and 1\")\n",
-    "    print(f\"✅ Symmetric property verified\")\n",
-    "    print(f\"✅ Handles extreme values without overflow\")\n",
-    "    print(f\"✅ Shape preservation working\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a56a7c1d",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 3: Tanh - Centered Activation\n",
-    "\n",
-    "### What is Tanh?\n",
-    "**Tanh (Hyperbolic Tangent)** is similar to sigmoid but centered around zero:\n",
-    "\n",
-    "```\n",
-    "f(x) = (e^x - e^(-x)) / (e^x + e^(-x))\n",
-    "```\n",
-    "\n",
-    "### Why Tanh is Better Than Sigmoid\n",
-    "1. **Zero-centered**: Outputs range from -1 to 1\n",
-    "2. **Better gradients**: Helps with gradient flow in deep networks\n",
-    "3. **Faster convergence**: Less bias shift during training\n",
-    "4. **Stronger gradients**: Maximum gradient is 1 vs 0.25 for sigmoid\n",
-    "\n",
-    "### Visual Understanding\n",
-    "```\n",
-    "Input: [-∞, -2, -1, 0, 1, 2, ∞]\n",
-    "Tanh:  [-1, -0.96, -0.76, 0, 0.76, 0.96, 1]\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Hidden layers**: Better than sigmoid for internal activations\n",
-    "- **RNN cells**: Classic RNN and LSTM use tanh\n",
-    "- **Normalization**: When you need zero-centered outputs\n",
-    "- **Feature scaling**: Maps inputs to [-1, 1] range\n",
-    "\n",
-    "### Mathematical Properties\n",
-    "- **Range**: (-1, 1)\n",
-    "- **Derivative**: f'(x) = 1 - f(x)²\n",
-    "- **Zero-centered**: f(0) = 0\n",
-    "- **Antisymmetric**: f(-x) = -f(x)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6f6fb698",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "tanh-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Tanh:\n",
-    "    \"\"\"\n",
-    "    Tanh Activation Function: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))\n",
-    "    \n",
-    "    Zero-centered activation function with range (-1, 1).\n",
-    "    Better gradient properties than sigmoid.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def forward(self, x):\n",
-    "        \"\"\"\n",
-    "        Apply Tanh activation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))\n",
-    "        \n",
-    "        Now supports both Tensor and Variable inputs with automatic differentiation.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Check if input is Variable (for autograd) or Tensor\n",
-    "        2. Compute tanh: (e^x - e^(-x)) / (e^x + e^(-x))\n",
-    "        3. If input is Variable: create Variable output with proper gradient function\n",
-    "        4. If input is Tensor: return Tensor as before\n",
-    "        \n",
-    "        MATHEMATICAL FOUNDATION:\n",
-    "        - Forward: f(x) = tanh(x)\n",
-    "        - Backward: f'(x) = 1 - tanh²(x) = 1 - f(x)²\n",
-    "        \n",
-    "        EXAMPLE USAGE:\n",
-    "        ```python\n",
-    "        tanh = Tanh()\n",
-    "        # With Variable (with gradients)\n",
-    "        var_input = Variable([[0.0]], requires_grad=True)\n",
-    "        var_output = tanh(var_input)  # 0.0\n",
-    "        var_output.backward()\n",
-    "        print(var_input.grad)  # 1.0 = 1 - 0²\n",
-    "        ```\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Check type with hasattr(x, 'requires_grad')\n",
-    "        - For Variables: implement gradient function for backward pass\n",
-    "        - Tanh gradient: 1 - tanh²(x)\n",
-    "        - Use np.tanh() for numerical stability\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - This is like torch.nn.Tanh() in PyTorch with autograd support\n",
-    "        - Used in RNN, LSTM, and GRU cells\n",
-    "        - Zero-centered outputs improve gradient flow\n",
-    "        - Strong gradients near zero, weaker at extremes\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Check if input is a Variable (autograd-enabled)\n",
-    "        if hasattr(x, 'requires_grad') and hasattr(x, 'grad_fn'):\n",
-    "            # Input is a Variable - preserve autograd capabilities\n",
-    "            \n",
-    "            # Forward pass: Tanh activation\n",
-    "            input_data = x.data.data if hasattr(x.data, 'data') else x.data\n",
-    "            output_data = np.tanh(input_data)\n",
-    "            \n",
-    "            # Create gradient function for backward pass\n",
-    "            def tanh_grad_fn(grad_output):\n",
-    "                if x.requires_grad:\n",
-    "                    # Tanh gradient: 1 - tanh²(x)\n",
-    "                    tanh_grad = 1 - output_data ** 2\n",
-    "                    grad_input_data = grad_output.data.data * tanh_grad\n",
-    "                    grad_input = Variable(grad_input_data)\n",
-    "                    x.backward(grad_input)\n",
-    "            \n",
-    "            # Return Variable with gradient function\n",
-    "            requires_grad = x.requires_grad\n",
-    "            result = Variable(output_data, requires_grad=requires_grad, grad_fn=tanh_grad_fn if requires_grad else None)\n",
-    "            return result\n",
-    "        else:\n",
-    "            # Input is a Tensor - use original implementation\n",
-    "            result = np.tanh(x.data)\n",
-    "            return type(x)(result)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, x: Tensor) -> Tensor:\n",
-    "        \"\"\"Make the class callable: tanh(x) instead of tanh.forward(x)\"\"\"\n",
-    "        return self.forward(x)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "10d063ce",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Tanh Implementation\n",
-    "\n",
-    "Once you implement the Tanh forward method above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d5394a0d",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-tanh-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_tanh_activation():\n",
-    "    \"\"\"Unit test for the Tanh activation function.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Tanh Activation...\")\n",
-    "\n",
-    "# Create Tanh instance\n",
-    "    tanh = Tanh()\n",
-    "\n",
-    "    # Test with zero (should be 0)\n",
-    "    test_input = Tensor([[0]])\n",
-    "    result = tanh(test_input)\n",
-    "    expected = 0.0\n",
-    "    \n",
-    "    assert abs(result.data[0][0] - expected) < 1e-6, f\"Tanh(0) should be 0, got {result.data[0][0]}\"\n",
-    "    \n",
-    "    # Test with positive and negative values\n",
-    "    test_input = Tensor([[-2, -1, 0, 1, 2]])\n",
-    "    result = tanh(test_input)\n",
-    "    \n",
-    "    # Check that all values are between -1 and 1\n",
-    "    assert np.all(result.data > -1), \"Tanh output should be > -1\"\n",
-    "    assert np.all(result.data < 1), \"Tanh output should be < 1\"\n",
-    "    \n",
-    "    # Test antisymmetry: tanh(-x) = -tanh(x)\n",
-    "    x_val = 1.5\n",
-    "    pos_result = tanh(Tensor([[x_val]]))\n",
-    "    neg_result = tanh(Tensor([[-x_val]]))\n",
-    "    antisymmetry_check = abs(pos_result.data[0][0] + neg_result.data[0][0])\n",
-    "    assert antisymmetry_check < 1e-6, \"Tanh should be antisymmetric\"\n",
-    "    \n",
-    "    # Test with 2D tensor\n",
-    "    matrix_input = Tensor([[-1, 1], [0, 2]])\n",
-    "    matrix_result = tanh(matrix_input)\n",
-    "    assert matrix_result.shape == matrix_input.shape, \"Tanh should preserve shape\"\n",
-    "    \n",
-    "    # Test extreme values (should not overflow)\n",
-    "    extreme_input = Tensor([[-100, 100]])\n",
-    "    extreme_result = tanh(extreme_input)\n",
-    "    assert not np.any(np.isnan(extreme_result.data)), \"Tanh should handle extreme values\"\n",
-    "    assert not np.any(np.isinf(extreme_result.data)), \"Tanh should not produce inf values\"\n",
-    "    \n",
-    "    # Test that extreme values approach ±1\n",
-    "    assert abs(extreme_result.data[0][0] - (-1)) < 1e-6, \"Tanh(-∞) should approach -1\"\n",
-    "    assert abs(extreme_result.data[0][1] - 1) < 1e-6, \"Tanh(∞) should approach 1\"\n",
-    "    \n",
-    "    print(\"✅ Tanh activation tests passed!\")\n",
-    "    print(f\"✅ Outputs correctly bounded between -1 and 1\")\n",
-    "    print(f\"✅ Antisymmetric property verified\")\n",
-    "    print(f\"✅ Zero-centered (tanh(0) = 0)\")\n",
-    "    print(f\"✅ Handles extreme values correctly\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "484b0501",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 4: Softmax - Probability Distributions\n",
-    "\n",
-    "### What is Softmax?\n",
-    "**Softmax** converts a vector of real numbers into a probability distribution:\n",
-    "\n",
-    "```\n",
-    "f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
-    "```\n",
-    "\n",
-    "### Why Softmax is Essential\n",
-    "1. **Probability distribution**: Outputs sum to 1\n",
-    "2. **Multi-class classification**: Choose one class from many\n",
-    "3. **Interpretable**: Each output is a probability\n",
-    "4. **Differentiable**: Enables gradient-based learning\n",
-    "\n",
-    "### Visual Understanding\n",
-    "```\n",
-    "Input:  [1, 2, 3]\n",
-    "Softmax:[0.09, 0.24, 0.67]  # Sums to 1.0\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Classification**: Image classification, text classification\n",
-    "- **Language models**: Next word prediction\n",
-    "- **Attention mechanisms**: Where to focus attention\n",
-    "- **Reinforcement learning**: Action selection probabilities\n",
-    "\n",
-    "### Mathematical Properties\n",
-    "- **Range**: (0, 1) for each output\n",
-    "- **Constraint**: Σ(f(x_i)) = 1\n",
-    "- **Argmax preservation**: Doesn't change relative ordering\n",
-    "- **Temperature scaling**: Can be made sharper or softer"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cd4b5c5c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "softmax-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Softmax:\n",
-    "    \"\"\"\n",
-    "    Softmax Activation Function: f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
-    "    \n",
-    "    Converts a vector of real numbers into a probability distribution.\n",
-    "    Essential for multi-class classification.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def forward(self, x):\n",
-    "        \"\"\"\n",
-    "        Apply Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
-    "        \n",
-    "        Now supports both Tensor and Variable inputs with automatic differentiation.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Check if input is Variable (for autograd) or Tensor\n",
-    "        2. Compute softmax with numerical stability\n",
-    "        3. If input is Variable: create Variable output with proper gradient function\n",
-    "        4. If input is Tensor: return Tensor as before\n",
-    "        \n",
-    "        MATHEMATICAL FOUNDATION:\n",
-    "        - Forward: f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
-    "        - Backward: ∂f_i/∂x_j = f_i * (δ_ij - f_j) where δ_ij is Kronecker delta\n",
-    "        - Simplified: ∂f_i/∂x_i = f_i * (1 - f_i), ∂f_i/∂x_j = -f_i * f_j (i ≠ j)\n",
-    "        \n",
-    "        EXAMPLE USAGE:\n",
-    "        ```python\n",
-    "        softmax = Softmax()\n",
-    "        # With Variable (with gradients)\n",
-    "        var_input = Variable([[1.0, 2.0]], requires_grad=True)\n",
-    "        var_output = softmax(var_input)\n",
-    "        var_output.backward(Variable([[1.0, 0.0]]))\n",
-    "        # Gradients computed automatically\n",
-    "        ```\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Check type with hasattr(x, 'requires_grad')\n",
-    "        - For Variables: implement gradient function for backward pass\n",
-    "        - Softmax gradient: Jacobian matrix with f_i * (δ_ij - f_j)\n",
-    "        - Use numerical stability: subtract max before exponential\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - This is like torch.nn.Softmax() in PyTorch with autograd support\n",
-    "        - Used in classification and attention mechanisms\n",
-    "        - Converts logits to probability distributions\n",
-    "        - Complex gradient structure due to normalization\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Check if input is a Variable (autograd-enabled)\n",
-    "        if hasattr(x, 'requires_grad') and hasattr(x, 'grad_fn'):\n",
-    "            # Input is a Variable - preserve autograd capabilities\n",
-    "            \n",
-    "            # Forward pass: Softmax activation with numerical stability\n",
-    "            input_data = x.data.data if hasattr(x.data, 'data') else x.data\n",
-    "            \n",
-    "            # Handle empty input\n",
-    "            if input_data.size == 0:\n",
-    "                return Variable(input_data.copy(), requires_grad=x.requires_grad)\n",
-    "            \n",
-    "            # Subtract max for numerical stability\n",
-    "            x_shifted = input_data - np.max(input_data, axis=-1, keepdims=True)\n",
-    "            \n",
-    "            # Compute exponentials\n",
-    "            exp_values = np.exp(x_shifted)\n",
-    "            \n",
-    "            # Sum along last axis\n",
-    "            sum_exp = np.sum(exp_values, axis=-1, keepdims=True)\n",
-    "            \n",
-    "            # Divide to get probabilities\n",
-    "            output_data = exp_values / sum_exp\n",
-    "            \n",
-    "            # Create gradient function for backward pass\n",
-    "            def softmax_grad_fn(grad_output):\n",
-    "                if x.requires_grad:\n",
-    "                    # Softmax gradient: for each element i,j: ∂f_i/∂x_j = f_i * (δ_ij - f_j)\n",
-    "                    # For vector input, this becomes: grad_input = softmax * (grad_output - (softmax * grad_output).sum(keepdims=True))\n",
-    "                    grad_out_data = grad_output.data.data\n",
-    "                    softmax_grad_sum = np.sum(output_data * grad_out_data, axis=-1, keepdims=True)\n",
-    "                    grad_input_data = output_data * (grad_out_data - softmax_grad_sum)\n",
-    "                    grad_input = Variable(grad_input_data)\n",
-    "                    x.backward(grad_input)\n",
-    "            \n",
-    "            # Return Variable with gradient function\n",
-    "            requires_grad = x.requires_grad\n",
-    "            result = Variable(output_data, requires_grad=requires_grad, grad_fn=softmax_grad_fn if requires_grad else None)\n",
-    "            return result\n",
-    "        else:\n",
-    "            # Input is a Tensor - use original implementation\n",
-    "            # Handle empty input\n",
-    "            if x.data.size == 0:\n",
-    "                return type(x)(x.data.copy())\n",
-    "            \n",
-    "            # Subtract max for numerical stability\n",
-    "            x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)\n",
-    "            \n",
-    "            # Compute exponentials\n",
-    "            exp_values = np.exp(x_shifted)\n",
-    "            \n",
-    "            # Sum along last axis\n",
-    "            sum_exp = np.sum(exp_values, axis=-1, keepdims=True)\n",
-    "            \n",
-    "            # Divide to get probabilities\n",
-    "            result = exp_values / sum_exp\n",
-    "            \n",
-    "            return type(x)(result)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, x):\n",
-    "        \"\"\"Make the class callable: softmax(x) instead of softmax.forward(x)\"\"\"\n",
-    "        return self.forward(x)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7254a652",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Softmax Implementation\n",
-    "\n",
-    "Once you implement the Softmax forward method above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2d143d5d",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-softmax-immediate",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_softmax_activation():\n",
-    "    \"\"\"Unit test for the Softmax activation function.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Softmax Activation...\")\n",
-    "\n",
-    "# Create Softmax instance\n",
-    "    softmax = Softmax()\n",
-    "\n",
-    "    # Test with simple input\n",
-    "    test_input = Tensor([[1, 2, 3]])\n",
-    "    result = softmax(test_input)\n",
-    "    \n",
-    "    # Check that outputs sum to 1\n",
-    "    output_sum = np.sum(result.data)\n",
-    "    assert abs(output_sum - 1.0) < 1e-6, f\"Softmax outputs should sum to 1, got {output_sum}\"\n",
-    "    \n",
-    "    # Check that all outputs are positive\n",
-    "    assert np.all(result.data > 0), \"Softmax outputs should be positive\"\n",
-    "    assert np.all(result.data < 1), \"Softmax outputs should be less than 1\"\n",
-    "    \n",
-    "    # Test with uniform input (should give equal probabilities)\n",
-    "    uniform_input = Tensor([[1, 1, 1]])\n",
-    "    uniform_result = softmax(uniform_input)\n",
-    "    expected_prob = 1.0 / 3.0\n",
-    "    \n",
-    "    for prob in uniform_result.data[0]:\n",
-    "        assert abs(prob - expected_prob) < 1e-6, f\"Uniform input should give equal probabilities\"\n",
-    "    \n",
-    "    # Test with batch input (multiple samples)\n",
-    "    batch_input = Tensor([[1, 2, 3], [4, 5, 6]])\n",
-    "    batch_result = softmax(batch_input)\n",
-    "    \n",
-    "    # Check that each row sums to 1\n",
-    "    for i in range(batch_input.shape[0]):\n",
-    "        row_sum = np.sum(batch_result.data[i])\n",
-    "        assert abs(row_sum - 1.0) < 1e-6, f\"Each row should sum to 1, row {i} sums to {row_sum}\"\n",
-    "    \n",
-    "    # Test numerical stability with large values\n",
-    "    large_input = Tensor([[1000, 1001, 1002]])\n",
-    "    large_result = softmax(large_input)\n",
-    "    \n",
-    "    assert not np.any(np.isnan(large_result.data)), \"Softmax should handle large values\"\n",
-    "    assert not np.any(np.isinf(large_result.data)), \"Softmax should not produce inf values\"\n",
-    "    \n",
-    "    large_sum = np.sum(large_result.data)\n",
-    "    assert abs(large_sum - 1.0) < 1e-6, \"Large values should still sum to 1\"\n",
-    "\n",
-    "# Test shape preservation\n",
-    "    assert batch_result.shape == batch_input.shape, \"Softmax should preserve shape\"\n",
-    "    \n",
-    "    print(\"✅ Softmax activation tests passed!\")\n",
-    "    print(f\"✅ Outputs sum to 1 (probability distribution)\")\n",
-    "    print(f\"✅ All outputs are positive\")\n",
-    "    print(f\"✅ Handles uniform inputs correctly\")\n",
-    "    print(f\"✅ Works with batch inputs\")\n",
-    "    print(f\"✅ Numerically stable with large values\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8edb486d",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## 🎯 Comprehensive Test: All Activations Working Together\n",
-    "\n",
-    "### Real-World Scenario\n",
-    "Let us test how all activation functions work together in a realistic neural network scenario:\n",
-    "\n",
-    "- **Input processing**: Raw data transformation\n",
-    "- **Hidden layers**: ReLU for internal processing\n",
-    "- **Output layer**: Softmax for classification\n",
-    "- **Comparison**: See how different activations transform the same data"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ea90cb6c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-activations-comprehensive",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_activations_comprehensive():\n",
-    "    \"\"\"Comprehensive unit test for all activation functions working together.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Activation Functions Comprehensive Test...\")\n",
-    "    \n",
-    "    # Create instances of all activation functions\n",
-    "    relu = ReLU()\n",
-    "    sigmoid = Sigmoid()\n",
-    "    tanh = Tanh()\n",
-    "    softmax = Softmax()\n",
-    "    \n",
-    "    # Test data: simulating neural network layer outputs\n",
-    "    test_data = Tensor([[-2, -1, 0, 1, 2]])\n",
-    "    \n",
-    "    # Apply each activation function\n",
-    "    relu_result = relu(test_data)\n",
-    "    sigmoid_result = sigmoid(test_data)\n",
-    "    tanh_result = tanh(test_data)\n",
-    "    softmax_result = softmax(test_data)\n",
-    "    \n",
-    "    # Test that all functions preserve input shape\n",
-    "    assert relu_result.shape == test_data.shape, \"ReLU should preserve shape\"\n",
-    "    assert sigmoid_result.shape == test_data.shape, \"Sigmoid should preserve shape\"\n",
-    "    assert tanh_result.shape == test_data.shape, \"Tanh should preserve shape\"\n",
-    "    assert softmax_result.shape == test_data.shape, \"Softmax should preserve shape\"\n",
-    "    \n",
-    "    # Test that all functions return Tensor objects\n",
-    "    assert isinstance(relu_result, Tensor), \"ReLU should return Tensor\"\n",
-    "    assert isinstance(sigmoid_result, Tensor), \"Sigmoid should return Tensor\"\n",
-    "    assert isinstance(tanh_result, Tensor), \"Tanh should return Tensor\"\n",
-    "    assert isinstance(softmax_result, Tensor), \"Softmax should return Tensor\"\n",
-    "    \n",
-    "    # Test ReLU properties\n",
-    "    assert np.all(relu_result.data >= 0), \"ReLU output should be non-negative\"\n",
-    "    \n",
-    "    # Test Sigmoid properties\n",
-    "    assert np.all(sigmoid_result.data > 0), \"Sigmoid output should be positive\"\n",
-    "    assert np.all(sigmoid_result.data < 1), \"Sigmoid output should be less than 1\"\n",
-    "    \n",
-    "    # Test Tanh properties\n",
-    "    assert np.all(tanh_result.data > -1), \"Tanh output should be > -1\"\n",
-    "    assert np.all(tanh_result.data < 1), \"Tanh output should be < 1\"\n",
-    "    \n",
-    "    # Test Softmax properties\n",
-    "    softmax_sum = np.sum(softmax_result.data)\n",
-    "    assert abs(softmax_sum - 1.0) < 1e-6, \"Softmax outputs should sum to 1\"\n",
-    "    \n",
-    "    # Test chaining activations (realistic neural network scenario)\n",
-    "    # Hidden layer with ReLU\n",
-    "    hidden_output = relu(test_data)\n",
-    "    \n",
-    "    # Add some weights simulation (element-wise multiplication)\n",
-    "    weights = Tensor([[0.5, 0.3, 0.8, 0.2, 0.7]])\n",
-    "    weighted_output = hidden_output * weights\n",
-    "    \n",
-    "    # Final layer with Softmax\n",
-    "    final_output = softmax(weighted_output)\n",
-    "    \n",
-    "    # Test that chained operations work\n",
-    "    assert isinstance(final_output, Tensor), \"Chained operations should return Tensor\"\n",
-    "    assert abs(np.sum(final_output.data) - 1.0) < 1e-6, \"Final output should be valid probability\"\n",
-    "    \n",
-    "    # Test with batch data (multiple samples)\n",
-    "    batch_data = Tensor([\n",
-    "    [-2, -1, 0, 1, 2],\n",
-    "    [1, 2, 3, 4, 5],\n",
-    "    [-1, 0, 1, 2, 3]\n",
-    "    ])\n",
-    "    \n",
-    "    batch_softmax = softmax(batch_data)\n",
-    "    \n",
-    "    # Each row should sum to 1\n",
-    "    for i in range(batch_data.shape[0]):\n",
-    "        row_sum = np.sum(batch_softmax.data[i])\n",
-    "        assert abs(row_sum - 1.0) < 1e-6, f\"Batch row {i} should sum to 1\"\n",
-    "    \n",
-    "    print(\"✅ Activation functions comprehensive tests passed!\")\n",
-    "    print(f\"✅ All functions work together seamlessly\")\n",
-    "    print(f\"✅ Shape preservation across all activations\")\n",
-    "    print(f\"✅ Chained operations work correctly\")\n",
-    "    print(f\"✅ Batch processing works for all activations\")\n",
-    "    print(f\"✅ Ready for neural network integration!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1aa695c6",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_module_activation_tensor_integration():\n",
-    "    \"\"\"\n",
-    "    Integration test for activation functions with Tensor operations.\n",
-    "    \n",
-    "    Tests that activation functions properly integrate with the Tensor class\n",
-    "    and maintain compatibility for neural network operations.\n",
-    "    \"\"\"\n",
-    "    print(\"🔬 Running Integration Test: Activation-Tensor Integration...\")\n",
-    "    \n",
-    "    # Test 1: Activation functions preserve Tensor types\n",
-    "    input_tensor = Tensor([-2.0, -1.0, 0.0, 1.0, 2.0])\n",
-    "    \n",
-    "    relu_fn = ReLU()\n",
-    "    sigmoid_fn = Sigmoid()\n",
-    "    tanh_fn = Tanh()\n",
-    "    \n",
-    "    relu_result = relu_fn(input_tensor)\n",
-    "    sigmoid_result = sigmoid_fn(input_tensor) \n",
-    "    tanh_result = tanh_fn(input_tensor)\n",
-    "    \n",
-    "    assert isinstance(relu_result, Tensor), \"ReLU should return Tensor\"\n",
-    "    assert isinstance(sigmoid_result, Tensor), \"Sigmoid should return Tensor\"\n",
-    "    assert isinstance(tanh_result, Tensor), \"Tanh should return Tensor\"\n",
-    "    \n",
-    "    # Test 2: Activations work with matrix Tensors (neural network layers)\n",
-    "    layer_output = Tensor([[1.0, -2.0, 3.0], \n",
-    "                          [-1.0, 2.0, -3.0]])  # Simulating dense layer output\n",
-    "    \n",
-    "    relu_fn = ReLU()\n",
-    "    activated = relu_fn(layer_output)\n",
-    "    expected = np.array([[1.0, 0.0, 3.0], \n",
-    "                        [0.0, 2.0, 0.0]])\n",
-    "    \n",
-    "    assert isinstance(activated, Tensor), \"Matrix activation should return Tensor\"\n",
-    "    assert np.array_equal(activated.data, expected), \"Matrix ReLU should work correctly\"\n",
-    "    \n",
-    "    # Test 3: Softmax with classification scenario\n",
-    "    logits = Tensor([[2.0, 1.0, 0.1],  # Batch of 2 samples\n",
-    "                    [1.0, 3.0, 0.2]])   # Each with 3 classes\n",
-    "    \n",
-    "    softmax_fn = Softmax()\n",
-    "    probabilities = softmax_fn(logits)\n",
-    "    \n",
-    "    assert isinstance(probabilities, Tensor), \"Softmax should return Tensor\"\n",
-    "    assert probabilities.shape == logits.shape, \"Softmax should preserve shape\"\n",
-    "    \n",
-    "    # Each row should sum to 1 (probability distribution)\n",
-    "    for i in range(logits.shape[0]):\n",
-    "        row_sum = np.sum(probabilities.data[i])\n",
-    "        assert abs(row_sum - 1.0) < 1e-6, f\"Probability row {i} should sum to 1\"\n",
-    "    \n",
-    "    # Test 4: Chaining tensor operations with activations\n",
-    "    x = Tensor([1.0, 2.0, 3.0])\n",
-    "    y = Tensor([4.0, 5.0, 6.0])\n",
-    "    \n",
-    "    # Simulate: dense layer output -> activation -> more operations\n",
-    "    dense_sim = x * y  # Element-wise multiplication (simulating dense layer)\n",
-    "    relu_fn = ReLU()\n",
-    "    activated = relu_fn(dense_sim)  # Apply activation\n",
-    "    final = activated + Tensor([1.0, 1.0, 1.0])  # More tensor operations\n",
-    "    \n",
-    "    expected_final = np.array([5.0, 11.0, 19.0])  # [4,10,18] -> relu -> +1 = [5,11,19]\n",
-    "    \n",
-    "    assert isinstance(final, Tensor), \"Chained operations should maintain Tensor type\"\n",
-    "    assert np.array_equal(final.data, expected_final), \"Chained operations should work correctly\"\n",
-    "    \n",
-    "    print(\"✅ Integration Test Passed: Activation-Tensor integration works correctly.\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2a893a04",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## 🧪 New Tests: Variable Support and Autograd Integration\n",
-    "\n",
-    "Let's test that our activation functions work correctly with Variables and compute proper gradients.\n",
-    "\n",
-    "### 🚀 Training Pipeline Example\n",
-    "\n",
-    "Here's how the autograd-enabled activation functions work in a simple training scenario:\n",
-    "\n",
-    "```python\n",
-    "# Training-like scenario with autograd\n",
-    "x = Variable([[1.0, -0.5, 2.0]], requires_grad=True)\n",
-    "weights = Variable([[0.5], [0.3], [-0.2]], requires_grad=True)\n",
-    "\n",
-    "# Forward pass through network\n",
-    "hidden = x @ weights  # Matrix multiplication (would use autograd)\n",
-    "activated = relu(hidden)  # ReLU activation with gradient tracking\n",
-    "loss = activated ** 2  # Simple loss function\n",
-    "\n",
-    "# Backward pass\n",
-    "loss.backward()\n",
-    "\n",
-    "# Now x.grad and weights.grad contain gradients for optimization\n",
-    "print(f\"Input gradients: {x.grad}\")\n",
-    "print(f\"Weight gradients: {weights.grad}\")\n",
-    "```\n",
-    "\n",
-    "This shows how activation functions seamlessly integrate with the autograd system to enable end-to-end neural network training."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ad44ce0d",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-activations-variable-support",
-     "locked": true,
-     "points": 20,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_activations_variable_support():\n",
-    "    \"\"\"Test activation functions with Variable inputs and gradient computation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Activation Functions Variable Support...\")\n",
-    "    \n",
-    "    # Test 1: ReLU with Variables\n",
-    "    print(\"  Testing ReLU with Variables...\")\n",
-    "    relu = ReLU()\n",
-    "    \n",
-    "    # Test ReLU forward pass with Variable\n",
-    "    x_var = Variable([[2.0, -1.0, 0.0, 3.0]], requires_grad=True)\n",
-    "    relu_output = relu(x_var)\n",
-    "    \n",
-    "    assert hasattr(relu_output, 'requires_grad'), \"ReLU should return Variable when input is Variable\"\n",
-    "    assert relu_output.requires_grad == True, \"ReLU should preserve requires_grad\"\n",
-    "    assert np.array_equal(relu_output.data.data, [[2.0, 0.0, 0.0, 3.0]]), \"ReLU forward pass incorrect\"\n",
-    "    \n",
-    "    # Test ReLU backward pass\n",
-    "    relu_output.backward(Variable([[1.0, 1.0, 1.0, 1.0]]))\n",
-    "    expected_grad = [[1.0, 0.0, 0.0, 1.0]]  # Gradient is 1 where input > 0, 0 elsewhere\n",
-    "    assert np.array_equal(x_var.grad.data.data, expected_grad), f\"ReLU gradient incorrect: expected {expected_grad}, got {x_var.grad.data.data}\"\n",
-    "    \n",
-    "    # Test 2: Sigmoid with Variables\n",
-    "    print(\"  Testing Sigmoid with Variables...\")\n",
-    "    sigmoid = Sigmoid()\n",
-    "    \n",
-    "    x_var2 = Variable([[0.0]], requires_grad=True)\n",
-    "    sigmoid_output = sigmoid(x_var2)\n",
-    "    \n",
-    "    assert hasattr(sigmoid_output, 'requires_grad'), \"Sigmoid should return Variable when input is Variable\"\n",
-    "    assert abs(sigmoid_output.data.data[0][0] - 0.5) < 1e-6, \"Sigmoid(0) should be 0.5\"\n",
-    "    \n",
-    "    # Test Sigmoid backward pass\n",
-    "    sigmoid_output.backward(Variable([[1.0]]))\n",
-    "    expected_sigmoid_grad = 0.5 * (1.0 - 0.5)  # sigmoid(0) * (1 - sigmoid(0)) = 0.5 * 0.5 = 0.25\n",
-    "    assert abs(x_var2.grad.data.data[0][0] - expected_sigmoid_grad) < 1e-6, f\"Sigmoid gradient incorrect: expected {expected_sigmoid_grad}, got {x_var2.grad.data.data[0][0]}\"\n",
-    "    \n",
-    "    # Test 3: Tanh with Variables\n",
-    "    print(\"  Testing Tanh with Variables...\")\n",
-    "    tanh = Tanh()\n",
-    "    \n",
-    "    x_var3 = Variable([[0.0]], requires_grad=True)\n",
-    "    tanh_output = tanh(x_var3)\n",
-    "    \n",
-    "    assert hasattr(tanh_output, 'requires_grad'), \"Tanh should return Variable when input is Variable\"\n",
-    "    assert abs(tanh_output.data.data[0][0] - 0.0) < 1e-6, \"Tanh(0) should be 0.0\"\n",
-    "    \n",
-    "    # Test Tanh backward pass\n",
-    "    tanh_output.backward(Variable([[1.0]]))\n",
-    "    expected_tanh_grad = 1.0 - 0.0**2  # 1 - tanh²(0) = 1 - 0² = 1\n",
-    "    assert abs(x_var3.grad.data.data[0][0] - expected_tanh_grad) < 1e-6, f\"Tanh gradient incorrect: expected {expected_tanh_grad}, got {x_var3.grad.data.data[0][0]}\"\n",
-    "    \n",
-    "    # Test 4: Softmax with Variables\n",
-    "    print(\"  Testing Softmax with Variables...\")\n",
-    "    softmax = Softmax()\n",
-    "    \n",
-    "    x_var4 = Variable([[1.0, 2.0, 3.0]], requires_grad=True)\n",
-    "    softmax_output = softmax(x_var4)\n",
-    "    \n",
-    "    assert hasattr(softmax_output, 'requires_grad'), \"Softmax should return Variable when input is Variable\"\n",
-    "    assert abs(np.sum(softmax_output.data.data) - 1.0) < 1e-6, \"Softmax outputs should sum to 1\"\n",
-    "    \n",
-    "    # Test Softmax backward pass\n",
-    "    softmax_output.backward(Variable([[1.0, 0.0, 0.0]]))  # Gradient for first element only\n",
-    "    assert x_var4.grad is not None, \"Softmax should compute gradients\"\n",
-    "    assert x_var4.grad.data.data.shape == (1, 3), \"Softmax gradient should have correct shape\"\n",
-    "    \n",
-    "    print(\"✅ Variable support tests passed!\")\n",
-    "    print(f\"✅ ReLU gradients computed correctly\")\n",
-    "    print(f\"✅ Sigmoid gradients computed correctly\") \n",
-    "    print(f\"✅ Tanh gradients computed correctly\")\n",
-    "    print(f\"✅ Softmax gradients computed correctly\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5a53dc49",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-activations-tensor-compatibility",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_activations_tensor_compatibility():\n",
-    "    \"\"\"Test that activation functions still work correctly with plain Tensors.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Activation Functions Tensor Compatibility...\")\n",
-    "    \n",
-    "    # Create instances of all activation functions\n",
-    "    relu = ReLU()\n",
-    "    sigmoid = Sigmoid()\n",
-    "    tanh = Tanh()\n",
-    "    softmax = Softmax()\n",
-    "    \n",
-    "    # Test with plain Tensor (should work as before)\n",
-    "    tensor_input = Tensor([[-2, -1, 0, 1, 2]])\n",
-    "    \n",
-    "    # Test that all activations return Tensors when given Tensors\n",
-    "    relu_result = relu(tensor_input)\n",
-    "    sigmoid_result = sigmoid(tensor_input)\n",
-    "    tanh_result = tanh(tensor_input)\n",
-    "    softmax_result = softmax(tensor_input)\n",
-    "    \n",
-    "    assert isinstance(relu_result, Tensor), \"ReLU should return Tensor when input is Tensor\"\n",
-    "    assert isinstance(sigmoid_result, Tensor), \"Sigmoid should return Tensor when input is Tensor\"\n",
-    "    assert isinstance(tanh_result, Tensor), \"Tanh should return Tensor when input is Tensor\"\n",
-    "    assert isinstance(softmax_result, Tensor), \"Softmax should return Tensor when input is Tensor\"\n",
-    "    \n",
-    "    # Test that none have autograd attributes\n",
-    "    assert not hasattr(relu_result, 'requires_grad'), \"Tensor output should not have autograd attributes\"\n",
-    "    assert not hasattr(sigmoid_result, 'requires_grad'), \"Tensor output should not have autograd attributes\"\n",
-    "    assert not hasattr(tanh_result, 'requires_grad'), \"Tensor output should not have autograd attributes\"\n",
-    "    assert not hasattr(softmax_result, 'requires_grad'), \"Tensor output should not have autograd attributes\"\n",
-    "    \n",
-    "    # Test that results are mathematically correct\n",
-    "    expected_relu = np.array([[0, 0, 0, 1, 2]])\n",
-    "    assert np.array_equal(relu_result.data, expected_relu), \"ReLU with Tensor should produce correct results\"\n",
-    "    \n",
-    "    assert np.all(sigmoid_result.data > 0), \"Sigmoid should produce positive values\"\n",
-    "    assert np.all(sigmoid_result.data < 1), \"Sigmoid should produce values less than 1\"\n",
-    "    \n",
-    "    assert np.all(tanh_result.data > -1), \"Tanh should produce values > -1\"\n",
-    "    assert np.all(tanh_result.data < 1), \"Tanh should produce values < 1\"\n",
-    "    \n",
-    "    assert abs(np.sum(softmax_result.data) - 1.0) < 1e-6, \"Softmax should sum to 1\"\n",
-    "    \n",
-    "    print(\"✅ Tensor compatibility tests passed!\")\n",
-    "    print(f\"✅ All activations work with plain Tensors\")\n",
-    "    print(f\"✅ No autograd attributes on Tensor outputs\")\n",
-    "    print(f\"✅ Mathematical correctness preserved\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bf74dc6b",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-activations-gradient-accuracy",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_activations_gradient_accuracy():\n",
-    "    \"\"\"Test gradient computation accuracy by comparing known derivatives.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Activation Functions Gradient Accuracy...\")\n",
-    "    \n",
-    "    # Test 1: ReLU gradient accuracy with known values\n",
-    "    print(\"  Testing ReLU gradient accuracy...\")\n",
-    "    relu = ReLU()\n",
-    "    \n",
-    "    # Test case 1: Positive input (gradient should be 1)\n",
-    "    x_pos = Variable([[2.0]], requires_grad=True)\n",
-    "    relu_output = relu(x_pos)\n",
-    "    relu_output.backward(Variable([[1.0]]))\n",
-    "    assert abs(x_pos.grad.data.data[0][0] - 1.0) < 1e-6, f\"ReLU gradient for positive input should be 1, got {x_pos.grad.data.data[0][0]}\"\n",
-    "    \n",
-    "    # Test case 2: Negative input (gradient should be 0)\n",
-    "    x_neg = Variable([[-1.0]], requires_grad=True)\n",
-    "    relu_output = relu(x_neg)\n",
-    "    relu_output.backward(Variable([[1.0]]))\n",
-    "    assert abs(x_neg.grad.data.data[0][0] - 0.0) < 1e-6, f\"ReLU gradient for negative input should be 0, got {x_neg.grad.data.data[0][0]}\"\n",
-    "    \n",
-    "    # Test 2: Sigmoid gradient accuracy with known values\n",
-    "    print(\"  Testing Sigmoid gradient accuracy...\")\n",
-    "    sigmoid = Sigmoid()\n",
-    "    \n",
-    "    # Test at x=0 where sigmoid(0)=0.5, gradient should be 0.5*(1-0.5)=0.25\n",
-    "    x_zero = Variable([[0.0]], requires_grad=True)\n",
-    "    sigmoid_output = sigmoid(x_zero)\n",
-    "    sigmoid_output.backward(Variable([[1.0]]))\n",
-    "    expected_grad = 0.5 * (1.0 - 0.5)  # sigmoid(0) * (1 - sigmoid(0))\n",
-    "    assert abs(x_zero.grad.data.data[0][0] - expected_grad) < 1e-6, f\"Sigmoid gradient at x=0 should be {expected_grad}, got {x_zero.grad.data.data[0][0]}\"\n",
-    "    \n",
-    "    # Test at x=1 where sigmoid(1)≈0.731, gradient should be sigmoid(1)*(1-sigmoid(1))\n",
-    "    x_one = Variable([[1.0]], requires_grad=True)\n",
-    "    sigmoid_output = sigmoid(x_one)\n",
-    "    sigmoid_val = sigmoid_output.data.data[0][0]\n",
-    "    sigmoid_output.backward(Variable([[1.0]]))\n",
-    "    expected_grad = sigmoid_val * (1.0 - sigmoid_val)\n",
-    "    assert abs(x_one.grad.data.data[0][0] - expected_grad) < 1e-6, f\"Sigmoid gradient should match derivative formula\"\n",
-    "    \n",
-    "    # Test 3: Tanh gradient accuracy with known values\n",
-    "    print(\"  Testing Tanh gradient accuracy...\")\n",
-    "    tanh = Tanh()\n",
-    "    \n",
-    "    # Test at x=0 where tanh(0)=0, gradient should be 1-0²=1\n",
-    "    x_zero_tanh = Variable([[0.0]], requires_grad=True)\n",
-    "    tanh_output = tanh(x_zero_tanh)\n",
-    "    tanh_output.backward(Variable([[1.0]]))\n",
-    "    expected_grad = 1.0 - 0.0**2  # 1 - tanh²(0)\n",
-    "    assert abs(x_zero_tanh.grad.data.data[0][0] - expected_grad) < 1e-6, f\"Tanh gradient at x=0 should be {expected_grad}, got {x_zero_tanh.grad.data.data[0][0]}\"\n",
-    "    \n",
-    "    # Test at x=1 where tanh(1)≈0.762, gradient should be 1-tanh²(1)\n",
-    "    x_one_tanh = Variable([[1.0]], requires_grad=True)\n",
-    "    tanh_output = tanh(x_one_tanh)\n",
-    "    tanh_val = tanh_output.data.data[0][0]\n",
-    "    tanh_output.backward(Variable([[1.0]]))\n",
-    "    expected_grad = 1.0 - tanh_val**2\n",
-    "    assert abs(x_one_tanh.grad.data.data[0][0] - expected_grad) < 1e-6, f\"Tanh gradient should match derivative formula\"\n",
-    "    \n",
-    "    # Test 4: Test batch gradients work correctly\n",
-    "    print(\"  Testing batch gradient computation...\")\n",
-    "    x_batch = Variable([[2.0, -1.0, 0.5]], requires_grad=True)\n",
-    "    relu_batch = relu(x_batch)\n",
-    "    relu_batch.backward(Variable([[1.0, 1.0, 1.0]]))\n",
-    "    expected_batch_grad = [[1.0, 0.0, 1.0]]  # [pos, neg, pos] -> [1, 0, 1]\n",
-    "    assert np.array_equal(x_batch.grad.data.data, expected_batch_grad), f\"Batch ReLU gradients incorrect\"\n",
-    "    \n",
-    "    print(\"✅ Gradient accuracy tests passed!\")\n",
-    "    print(f\"✅ ReLU gradients match known derivatives\")\n",
-    "    print(f\"✅ Sigmoid gradients match known derivatives\")\n",
-    "    print(f\"✅ Tanh gradients match known derivatives\")\n",
-    "    print(f\"✅ Batch gradient computation works correctly\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "62ed3e5e",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## ⚡ ML Systems: Performance Analysis & Optimization\n",
-    "\n",
-    "Now that you have working activation functions, let us develop **performance engineering skills**. This section teaches you to measure computational costs, understand scaling patterns, and think about production optimization.\n",
-    "\n",
-    "### **Learning Outcome**: *\"I understand performance trade-offs between different activation functions\"*\n",
-    "\n",
-    "---\n",
-    "\n",
-    "## Performance Profiling Tools (Light Implementation)\n",
-    "\n",
-    "As an ML systems engineer, you need to understand which activation functions are fast vs slow, and why. Let us build simple tools to measure and compare performance."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "66c20f3a",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "activation-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "import time\n",
-    "\n",
-    "class ActivationProfiler:\n",
-    "    \"\"\"\n",
-    "    Performance profiling toolkit for activation functions.\n",
-    "    \n",
-    "    Helps ML engineers understand computational costs and optimize\n",
-    "    neural network performance for production deployment.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        self.results = {}\n",
-    "        \n",
-    "    def time_activation(self, activation_fn, tensor, activation_name, iterations=100):\n",
-    "        \"\"\"\n",
-    "        Time how long an activation function takes to run.\n",
-    "        \n",
-    "        TODO: Implement activation timing.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Record start time using time.time()\n",
-    "        2. Run the activation function for specified iterations\n",
-    "        3. Record end time\n",
-    "        4. Calculate average time per iteration\n",
-    "        5. Return the average time in milliseconds\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        profiler = ActivationProfiler()\n",
-    "        relu = ReLU()\n",
-    "        test_tensor = Tensor(np.random.randn(1000, 1000))\n",
-    "        avg_time = profiler.time_activation(relu, test_tensor, \"ReLU\")\n",
-    "        print(f\"ReLU took {avg_time:.3f} ms on average\")\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use time.time() for timing\n",
-    "        - Run multiple iterations for better accuracy\n",
-    "        - Calculate: (end_time - start_time) / iterations * 1000 for ms\n",
-    "        - Return the average time per call in milliseconds\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        start_time = time.time()\n",
-    "        \n",
-    "        for _ in range(iterations):\n",
-    "            result = activation_fn(tensor)\n",
-    "        \n",
-    "        end_time = time.time()\n",
-    "        avg_time_ms = (end_time - start_time) / iterations * 1000\n",
-    "        \n",
-    "        return avg_time_ms\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def compare_activations(self, tensor_size=(1000, 1000), iterations=50):\n",
-    "        \"\"\"\n",
-    "        Compare performance of all activation functions.\n",
-    "        \n",
-    "        This function is PROVIDED to show systems analysis.\n",
-    "        Students run it to understand performance differences.\n",
-    "        \"\"\"\n",
-    "        print(f\"⚡ ACTIVATION PERFORMANCE COMPARISON\")\n",
-    "        print(f\"=\" * 50)\n",
-    "        print(f\"Tensor size: {tensor_size}, Iterations: {iterations}\")\n",
-    "        \n",
-    "        # Create test tensor\n",
-    "        test_tensor = Tensor(np.random.randn(*tensor_size))\n",
-    "        tensor_mb = test_tensor.data.nbytes / (1024 * 1024)\n",
-    "        print(f\"Test tensor: {tensor_mb:.2f} MB\")\n",
-    "        \n",
-    "        # Test all activation functions\n",
-    "        activations = {\n",
-    "            'ReLU': ReLU(),\n",
-    "            'Sigmoid': Sigmoid(),\n",
-    "            'Tanh': Tanh(),\n",
-    "            'Softmax': Softmax()\n",
-    "        }\n",
-    "        \n",
-    "        results = {}\n",
-    "        for name, activation_fn in activations.items():\n",
-    "            avg_time = self.time_activation(activation_fn, test_tensor, name, iterations)\n",
-    "            results[name] = avg_time\n",
-    "            print(f\"   {name:8}: {avg_time:.3f} ms\")\n",
-    "        \n",
-    "        # Calculate speed ratios relative to fastest\n",
-    "        fastest_time = min(results.values())\n",
-    "        fastest_name = min(results, key=results.get)\n",
-    "        \n",
-    "        print(f\"\\n📊 SPEED ANALYSIS:\")\n",
-    "        for name, time_ms in sorted(results.items(), key=lambda x: x[1]):\n",
-    "            speed_ratio = time_ms / fastest_time\n",
-    "            if name == fastest_name:\n",
-    "                print(f\"   {name:8}: {speed_ratio:.1f}x (fastest)\")\n",
-    "            else:\n",
-    "                print(f\"   {name:8}: {speed_ratio:.1f}x slower than {fastest_name}\")\n",
-    "        \n",
-    "        return results\n",
-    "    \n",
-    "    def analyze_scaling(self, activation_fn, activation_name, sizes=[100, 500, 1000]):\n",
-    "        \"\"\"\n",
-    "        Analyze how activation performance scales with tensor size.\n",
-    "        \n",
-    "        This function is PROVIDED to demonstrate scaling patterns.\n",
-    "        Students use it to understand computational complexity.\n",
-    "        \"\"\"\n",
-    "        print(f\"\\n🔍 SCALING ANALYSIS: {activation_name}\")\n",
-    "        print(f\"=\" * 40)\n",
-    "        \n",
-    "        scaling_results = []\n",
-    "        \n",
-    "        for size in sizes:\n",
-    "            test_tensor = Tensor(np.random.randn(size, size))\n",
-    "            avg_time = self.time_activation(activation_fn, test_tensor, activation_name, iterations=20)\n",
-    "            \n",
-    "            elements = size * size\n",
-    "            time_per_element = avg_time / elements * 1e6  # microseconds per element\n",
-    "            \n",
-    "            result = {\n",
-    "                'size': size,\n",
-    "                'elements': elements,\n",
-    "                'time_ms': avg_time,\n",
-    "                'time_per_element_us': time_per_element\n",
-    "            }\n",
-    "            scaling_results.append(result)\n",
-    "            \n",
-    "            print(f\"   {size}x{size}: {avg_time:.3f}ms ({time_per_element:.3f}μs/element)\")\n",
-    "        \n",
-    "        # Analyze scaling pattern\n",
-    "        if len(scaling_results) >= 2:\n",
-    "            small = scaling_results[0]\n",
-    "            large = scaling_results[-1]\n",
-    "            \n",
-    "            size_ratio = large['size'] / small['size']\n",
-    "            time_ratio = large['time_ms'] / small['time_ms']\n",
-    "            \n",
-    "            print(f\"\\n📈 Scaling Pattern:\")\n",
-    "            print(f\"   Size increased {size_ratio:.1f}x ({small['size']} → {large['size']})\")\n",
-    "            print(f\"   Time increased {time_ratio:.1f}x\")\n",
-    "            \n",
-    "            if abs(time_ratio - size_ratio**2) < abs(time_ratio - size_ratio):\n",
-    "                print(f\"   Pattern: O(n^2) - linear in tensor size\")\n",
-    "            else:\n",
-    "                print(f\"   Pattern: ~O(n) - very efficient scaling\")\n",
-    "        \n",
-    "        return scaling_results\n",
-    "\n",
-    "def benchmark_activation_suite():\n",
-    "    \"\"\"\n",
-    "    Comprehensive benchmark of all activation functions.\n",
-    "    \n",
-    "    This function is PROVIDED to show complete systems analysis.\n",
-    "    Students run it to understand production performance implications.\n",
-    "    \"\"\"\n",
-    "    profiler = ActivationProfiler()\n",
-    "    \n",
-    "    print(\"🏆 COMPREHENSIVE ACTIVATION BENCHMARK\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Test 1: Performance comparison\n",
-    "    comparison_results = profiler.compare_activations(tensor_size=(800, 800), iterations=30)\n",
-    "    \n",
-    "    # Test 2: Scaling analysis for each activation\n",
-    "    activations_to_test = [\n",
-    "        (ReLU(), \"ReLU\"),\n",
-    "        (Sigmoid(), \"Sigmoid\"),\n",
-    "        (Tanh(), \"Tanh\")\n",
-    "    ]\n",
-    "    \n",
-    "    for activation_fn, name in activations_to_test:\n",
-    "        profiler.analyze_scaling(activation_fn, name, sizes=[200, 400, 600])\n",
-    "    \n",
-    "    # Test 3: Memory vs Performance trade-offs\n",
-    "    print(f\"\\n💾 MEMORY vs PERFORMANCE ANALYSIS:\")\n",
-    "    print(f\"=\" * 40)\n",
-    "    \n",
-    "    test_tensor = Tensor(np.random.randn(500, 500))\n",
-    "    original_memory = test_tensor.data.nbytes / (1024 * 1024)\n",
-    "    \n",
-    "    for name, activation_fn in [(\"ReLU\", ReLU()), (\"Sigmoid\", Sigmoid())]:\n",
-    "        start_time = time.time()\n",
-    "        result = activation_fn(test_tensor)\n",
-    "        end_time = time.time()\n",
-    "        \n",
-    "        result_memory = result.data.nbytes / (1024 * 1024)\n",
-    "        time_ms = (end_time - start_time) * 1000\n",
-    "        \n",
-    "        print(f\"   {name}:\")\n",
-    "        print(f\"     Input: {original_memory:.2f} MB\")\n",
-    "        print(f\"     Output: {result_memory:.2f} MB\")\n",
-    "        print(f\"     Memory overhead: {result_memory - original_memory:.2f} MB\")\n",
-    "        print(f\"     Time: {time_ms:.3f} ms\")\n",
-    "    \n",
-    "    print(f\"\\n🎯 PRODUCTION INSIGHTS:\")\n",
-    "    print(f\"   - ReLU is typically fastest (simple max operation)\")\n",
-    "    print(f\"   - Sigmoid/Tanh slower due to exponential calculations\")\n",
-    "    print(f\"   - All operations scale linearly with tensor size\")\n",
-    "    print(f\"   - Memory usage doubles (input + output tensors)\")\n",
-    "    print(f\"   - Choose activation based on accuracy vs speed trade-offs\")\n",
-    "    \n",
-    "    return comparison_results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b1de71a4",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test: Activation Performance Profiling\n",
-    "\n",
-    "Let us test our activation profiler with realistic performance analysis."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "272ed639",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-activation-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_activation_profiler():\n",
-    "    \"\"\"Test activation profiler with comprehensive scenarios.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Activation Performance Profiler...\")\n",
-    "    \n",
-    "    profiler = ActivationProfiler()\n",
-    "    \n",
-    "    # Create test tensor\n",
-    "    test_tensor = Tensor(np.random.randn(100, 100))\n",
-    "    relu = ReLU()\n",
-    "    \n",
-    "    # Test timing functionality\n",
-    "    avg_time = profiler.time_activation(relu, test_tensor, \"ReLU\", iterations=10)\n",
-    "    \n",
-    "    # Verify timing results\n",
-    "    assert isinstance(avg_time, (int, float)), \"Should return numeric time\"\n",
-    "    assert avg_time > 0, \"Time should be positive\"\n",
-    "    assert avg_time < 1000, \"Time should be reasonable (< 1000ms)\"\n",
-    "    \n",
-    "    print(\"✅ Basic timing functionality test passed\")\n",
-    "    \n",
-    "    # Test comparison functionality\n",
-    "    comparison_results = profiler.compare_activations(tensor_size=(50, 50), iterations=5)\n",
-    "    \n",
-    "    # Verify comparison results\n",
-    "    assert isinstance(comparison_results, dict), \"Should return dictionary of results\"\n",
-    "    assert len(comparison_results) == 4, \"Should test all 4 activation functions\"\n",
-    "    \n",
-    "    expected_activations = ['ReLU', 'Sigmoid', 'Tanh', 'Softmax']\n",
-    "    for activation in expected_activations:\n",
-    "        assert activation in comparison_results, f\"Should include {activation}\"\n",
-    "        assert comparison_results[activation] > 0, f\"{activation} time should be positive\"\n",
-    "    \n",
-    "    print(\"✅ Activation comparison test passed\")\n",
-    "    \n",
-    "    # Test scaling analysis\n",
-    "    scaling_results = profiler.analyze_scaling(relu, \"ReLU\", sizes=[50, 100])\n",
-    "    \n",
-    "    # Verify scaling results\n",
-    "    assert isinstance(scaling_results, list), \"Should return list of scaling results\"\n",
-    "    assert len(scaling_results) == 2, \"Should test both sizes\"\n",
-    "    \n",
-    "    for result in scaling_results:\n",
-    "        assert 'size' in result, \"Should include size\"\n",
-    "        assert 'time_ms' in result, \"Should include timing\"\n",
-    "        assert result['time_ms'] > 0, \"Time should be positive\"\n",
-    "    \n",
-    "    print(\"✅ Scaling analysis test passed\")\n",
-    "    \n",
-    "    print(\"🎯 Activation Profiler: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c3978ddd",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🎯 Learning Activity: Activation Performance Analysis\n",
-    "\n",
-    "**Goal**: Learn to measure activation function performance and understand which operations are fast vs slow in production ML systems."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f9ed3e1a",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "activation-performance-analysis",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Activation profiler initialization moved to main block\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    # Initialize the activation profiler\n",
-    "    profiler = ActivationProfiler()\n",
-    "    \n",
-    "    # Run all activation tests\n",
-    "    test_unit_relu_activation()\n",
-    "    test_unit_sigmoid_activation()\n",
-    "    test_unit_tanh_activation()\n",
-    "    test_unit_softmax_activation()\n",
-    "    test_unit_activations_comprehensive()\n",
-    "    test_module_activation_tensor_integration()\n",
-    "    \n",
-    "    # Run new autograd tests\n",
-    "    test_unit_activations_variable_support()\n",
-    "    test_unit_activations_tensor_compatibility()\n",
-    "    test_unit_activations_gradient_accuracy()\n",
-    "    \n",
-    "    test_activation_profiler()\n",
-    "    \n",
-    "    print(\"⚡ ACTIVATION PERFORMANCE ANALYSIS\")\n",
-    "    print(\"=\" * 50)\n",
-    "\n",
-    "    # Create test data\n",
-    "    test_tensor = Tensor(np.random.randn(500, 500))  # Medium-sized tensor for testing\n",
-    "    print(f\"Test tensor size: {test_tensor.shape}\")\n",
-    "    print(f\"Memory footprint: {test_tensor.data.nbytes/(1024*1024):.2f} MB\")\n",
-    "\n",
-    "    # Test individual activation timing\n",
-    "    print(f\"\\n🎯 Individual Activation Timing:\")\n",
-    "    activations_to_test = [\n",
-    "        (ReLU(), \"ReLU\"),\n",
-    "        (Sigmoid(), \"Sigmoid\"), \n",
-    "        (Tanh(), \"Tanh\"),\n",
-    "        (Softmax(), \"Softmax\")\n",
-    "    ]\n",
-    "\n",
-    "    individual_results = {}\n",
-    "    for activation_fn, name in activations_to_test:\n",
-    "        # Students implement this timing call\n",
-    "        avg_time = profiler.time_activation(activation_fn, test_tensor, name, iterations=50)\n",
-    "        individual_results[name] = avg_time\n",
-    "        print(f\"   {name:8}: {avg_time:.3f} ms average\")\n",
-    "\n",
-    "    # Analyze the results  \n",
-    "    fastest = min(individual_results, key=individual_results.get)\n",
-    "    slowest = max(individual_results, key=individual_results.get)\n",
-    "    speed_ratio = individual_results[slowest] / individual_results[fastest]\n",
-    "\n",
-    "    print(f\"\\n📊 PERFORMANCE INSIGHTS:\")\n",
-    "    print(f\"   Fastest: {fastest} ({individual_results[fastest]:.3f} ms)\")\n",
-    "    print(f\"   Slowest: {slowest} ({individual_results[slowest]:.3f} ms)\")\n",
-    "    print(f\"   Speed difference: {speed_ratio:.1f}x\")\n",
-    "\n",
-    "    print(f\"\\n💡 WHY THE DIFFERENCE?\")\n",
-    "    print(f\"   - ReLU: Just max(0, x) - simple comparison\")\n",
-    "    print(f\"   - Sigmoid: Requires exponential calculation\")\n",
-    "    print(f\"   - Tanh: Also exponential, but often optimized\")\n",
-    "    print(f\"   - Softmax: Exponentials + division\")\n",
-    "\n",
-    "    print(f\"\\n🏭 PRODUCTION IMPLICATIONS:\")\n",
-    "    print(f\"   - ReLU dominates modern deep learning (speed + effectiveness)\")\n",
-    "    print(f\"   - Sigmoid/Tanh used where probability interpretation needed\")\n",
-    "    print(f\"   - Speed matters: 1000 layers × speed difference = major impact\")\n",
-    "    \n",
-    "    print(\"All tests passed!\")\n",
-    "    print(\"Activations module complete!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0b8c36be",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've built the nonlinear functions that enable neural network intelligence, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how activation functions scale to production ML environments.\n",
-    "\n",
-    "Take time to reflect thoughtfully on each question - your insights will help you understand how the activation concepts you've implemented connect to real-world ML systems engineering."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "776fef8d",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Computational Efficiency and Numerical Stability\n",
-    "\n",
-    "**Context**: Your activation implementations handle basic operations like ReLU's max(0, x) and Softmax's exponential computations. In production ML systems, these operations run billions of times during training and inference, making computational efficiency and numerical stability critical for system reliability.\n",
-    "\n",
-    "**Reflection Question**: Design a production-grade activation function system that balances computational efficiency with numerical stability. How would you optimize ReLU for sparse computation, implement numerically stable Softmax for large vocabulary language models, and handle precision requirements across different hardware platforms? Consider scenarios where numerical instability in activation functions could cascade through deep networks and cause training failures.\n",
-    "\n",
-    "Think about: vectorization strategies, overflow/underflow protection, sparse computation optimization, and precision trade-offs between speed and accuracy.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e4f1d4de",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-1-computational-efficiency",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON COMPUTATIONAL EFFICIENCY AND NUMERICAL STABILITY:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about production-grade activation function design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you optimize activation functions for both efficiency and numerical stability?\n",
-    "- What strategies would you use to handle large-scale sparse computation in ReLU?\n",
-    "- How would you implement numerically stable Softmax for large vocabulary models?\n",
-    "- What precision trade-offs would you make across different hardware platforms?\n",
-    "- How would you prevent numerical instability from cascading through deep networks?\n",
-    "\n",
-    "Write a technical analysis connecting your activation implementations to real production optimization challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Demonstrates understanding of efficiency vs stability trade-offs (3 points)\n",
-    "- Addresses numerical stability concerns in large-scale systems (3 points)\n",
-    "- Shows practical knowledge of optimization strategies (2 points)\n",
-    "- Demonstrates systems thinking about activation function design (2 points)\n",
-    "- Clear technical reasoning and practical considerations (bonus points for innovative approaches)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring technical analysis of activation optimization\n",
-    "# Students should demonstrate understanding of efficiency and numerical stability in production systems\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a3e810a3",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: Hardware Optimization and Parallelization\n",
-    "\n",
-    "**Context**: Your activation functions perform element-wise operations that are ideal for parallel computation. Production ML systems deploy these functions across diverse hardware: CPUs, GPUs, TPUs, and edge devices, each with different computational characteristics and optimization opportunities.\n",
-    "\n",
-    "**Reflection Question**: Architect a hardware-aware activation function system that automatically optimizes for different compute platforms. How would you leverage ReLU's sparsity for GPU memory optimization, implement vectorized operations for CPU SIMD instructions, and design activation kernels for specialized AI accelerators? Consider the challenges of maintaining consistent numerical behavior across platforms while maximizing hardware-specific performance.\n",
-    "\n",
-    "Think about: SIMD vectorization, GPU kernel fusion, sparse computation patterns, and platform-specific optimization techniques.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "84de797b",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-2-hardware-optimization",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON HARDWARE OPTIMIZATION AND PARALLELIZATION:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about hardware-aware activation function design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you design activation functions that optimize for different hardware platforms?\n",
-    "- What strategies would you use to leverage GPU parallelism for activation computations?\n",
-    "- How would you implement SIMD vectorization for CPU-based activation functions?\n",
-    "- What role would kernel fusion play in optimizing activation performance?\n",
-    "- How would you maintain numerical consistency across different hardware platforms?\n",
-    "\n",
-    "Write an architectural analysis connecting your activation implementations to real hardware optimization challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Shows understanding of hardware-specific optimization strategies (3 points)\n",
-    "- Designs practical approaches to parallel activation computation (3 points)\n",
-    "- Addresses platform consistency and performance trade-offs (2 points)\n",
-    "- Demonstrates systems thinking about hardware-software optimization (2 points)\n",
-    "- Clear architectural reasoning with hardware insights (bonus points for comprehensive understanding)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of hardware optimization challenges\n",
-    "# Students should demonstrate knowledge of parallel computation and platform-specific optimization\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3f0acd8f",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 3: Integration with Training Systems and Gradient Flow\n",
-    "\n",
-    "**Context**: Your activation functions will integrate with automatic differentiation systems for training neural networks. The choice and implementation of activation functions significantly impacts gradient flow, training stability, and convergence speed in large-scale ML training systems.\n",
-    "\n",
-    "**Reflection Question**: Design an activation function integration system for large-scale neural network training that optimizes gradient flow and training stability. How would you implement activation functions that support efficient gradient computation, handle the vanishing gradient problem in deep networks, and integrate with distributed training systems? Consider the challenges of maintaining training stability when activation choices affect gradient magnitude and direction across hundreds of layers.\n",
-    "\n",
-    "Think about: gradient flow characteristics, backpropagation efficiency, training stability, and distributed training considerations.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "21dd088e",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-3-training-integration",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON INTEGRATION WITH TRAINING SYSTEMS:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about activation function integration with training systems.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you design activation functions to optimize gradient flow in deep networks?\n",
-    "- What strategies would you use to handle vanishing/exploding gradient problems?\n",
-    "- How would you integrate activation functions with automatic differentiation systems?\n",
-    "- What role would activation choices play in distributed training stability?\n",
-    "- How would you balance activation complexity with training efficiency?\n",
-    "\n",
-    "Write a design analysis connecting your activation functions to automatic differentiation and training optimization.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Understands activation function impact on gradient flow and training (3 points)\n",
-    "- Designs practical approaches to training integration and stability (3 points)\n",
-    "- Addresses distributed training and efficiency considerations (2 points)\n",
-    "- Shows systems thinking about training system architecture (2 points)\n",
-    "- Clear design reasoning with training optimization insights (bonus points for deep understanding)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of training system integration\n",
-    "# Students should demonstrate knowledge of gradient flow and training optimization challenges\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f0f6910a",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Activation Functions\n",
-    "\n",
-    "    Congratulations! You have successfully implemented all four essential activation functions:\n",
-    "\n",
-    "### ✅ What You have Built\n",
-    "    - **ReLU**: The foundation of modern deep learning with sparsity and efficiency\n",
-    "    - **Sigmoid**: Classic activation for binary classification and probability outputs\n",
-    "    - **Tanh**: Zero-centered activation with better gradient properties\n",
-    "    - **Softmax**: Probability distribution for multi-class classification\n",
-    "    - **🆕 Autograd Support**: All activations now work with Variables for automatic differentiation\n",
-    "    - **🆕 Gradient Computation**: Correct derivatives implemented for training neural networks\n",
-    "\n",
-    "### ✅ Key Learning Outcomes\n",
-    "    - **Understanding**: Why nonlinearity is essential for neural networks\n",
-    "    - **Implementation**: Built activation functions from scratch using NumPy\n",
-    "    - **Testing**: Progressive validation with immediate feedback after each function\n",
-    "    - **Integration**: Saw how activations work together in neural networks\n",
-    "    - **Real-world context**: Understanding where each activation is used\n",
-    "    - **🆕 Autograd Integration**: Learned how to make functions work with automatic differentiation\n",
-    "    - **🆕 Gradient Computation**: Implemented mathematically correct backward passes\n",
-    "\n",
-    "### ✅ Mathematical Mastery\n",
-    "    - **ReLU**: f(x) = max(0, x), f'(x) = 1 if x > 0 else 0\n",
-    "    - **Sigmoid**: f(x) = 1/(1 + e^(-x)), f'(x) = f(x)(1 - f(x))\n",
-    "    - **Tanh**: f(x) = tanh(x), f'(x) = 1 - f(x)²\n",
-    "    - **Softmax**: f(x_i) = e^(x_i)/Σ(e^(x_j)), complex Jacobian for backprop\n",
-    "    - **🆕 Gradient Functions**: All derivatives implemented for automatic differentiation\n",
-    "\n",
-    "### ✅ Professional Skills Developed\n",
-    "    - **Numerical stability**: Handling overflow and underflow\n",
-    "    - **API design**: Consistent interfaces across all functions\n",
-    "    - **Testing discipline**: Immediate validation after each implementation\n",
-    "    - **Integration thinking**: Understanding how components work together\n",
-    "    - **🆕 Autograd Design**: Making functions compatible with automatic differentiation\n",
-    "    - **🆕 Backward Pass Implementation**: Writing gradient functions for training\n",
-    "\n",
-    "### ✅ Ready for Next Steps\n",
-    "    Your activation functions are now ready to power:\n",
-    "    - **Dense layers**: Linear transformations with nonlinear activations\n",
-    "    - **Convolutional layers**: Spatial feature extraction with ReLU\n",
-    "    - **Network architectures**: Complete neural networks with proper activations\n",
-    "    - **🆕 Training Pipelines**: Full gradient-based optimization with autograd support\n",
-    "    - **🆕 Neural Network Layers**: Components that can be trained end-to-end\n",
-    "\n",
-    "### 🔗 Connection to Real ML Systems\n",
-    "    Your implementations mirror production systems:\n",
-    "    - **PyTorch**: `torch.nn.ReLU()`, `torch.nn.Sigmoid()`, `torch.nn.Tanh()`, `torch.nn.Softmax()`\n",
-    "    - **TensorFlow**: `tf.nn.relu()`, `tf.nn.sigmoid()`, `tf.nn.tanh()`, `tf.nn.softmax()`\n",
-    "    - **Industry applications**: Every major deep learning model uses these functions\n",
-    "\n",
-    "### 🎯 The Power of Nonlinearity\n",
-    "    You have unlocked the key to deep learning:\n",
-    "    - **Before**: Linear models limited to simple patterns\n",
-    "    - **After**: Nonlinear models can learn any pattern (universal approximation)\n",
-    "\n",
-    "    **Next Module**: Layers - Building blocks that combine your tensors and activations into powerful transformations!\n",
-    "\n",
-    "    Your activation functions are the key to neural network intelligence. Now let us build the layers that use them!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/02_activations/activations_dev.py b/modules_old/02_activations/activations_dev.py
deleted file mode 100644
index 44e14188..00000000
--- a/modules_old/02_activations/activations_dev.py
+++ /dev/null
@@ -1,705 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Activations - Nonlinear Intelligence for Neural Networks
-
-Welcome to Activations! You'll implement the essential functions that enable neural networks to learn complex patterns.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**:
-- Module 01 (Tensor): N-dimensional arrays with broadcasting
-
-**The Gap**: Linear operations stacked together remain linear - limiting networks to simple patterns.
-
-**This Module's Solution**: Implement ReLU and Softmax activation functions that add nonlinearity, enabling complex learning.
-
-**Connection Map**:
-```
-Tensor → Activations → Neural Networks
-(data)    (intelligence)  (complex learning)
-```
-
-## Learning Objectives
-1. **Core Implementation**: Build ReLU and Softmax activation functions
-2. **Conceptual Understanding**: How nonlinearity enables complex pattern learning
-3. **Testing Skills**: Validate activation functions with comprehensive tests
-4. **Integration Knowledge**: Connect activations to neural network systems
-
-## Build → Test → Use
-1. **Build**: Implement essential activation functions
-2. **Test**: Validate correctness and properties
-3. **Use**: Apply in neural network contexts
-"""
-
-# In[ ]:
-
-#| default_exp core.activations
-
-#| export
-import numpy as np
-import os
-import sys
-
-# Import our tensor foundation
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development - import from local modules
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# In[ ]:
-
-print("🔥 TinyTorch Activations Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build essential activation functions!")
-
-# %% [markdown]
-"""
-## The Intelligence Layer: How Nonlinearity Enables Learning
-
-Without activation functions, neural networks are just fancy linear algebra. No matter how many layers you stack, they can only learn straight lines. Activation functions add the "intelligence" that enables neural networks to learn curves, patterns, and complex relationships.
-
-### The Linearity Problem
-
-```
-Linear Network (No Activations):
-Input → Linear → Linear → Linear → Output
-  x   →  Ax    →  B(Ax) →C(B(Ax)) = (CBA)x
-
-Result: Still just a linear function!
-Cannot learn: curves, XOR, complex patterns
-```
-
-### The Nonlinearity Solution
-
-```
-Nonlinear Network (With Activations):
-Input → Linear → ReLU → Linear → ReLU → Output
-  x   →  Ax    → max(0,Ax) → B(·) → max(0,B(·))
-
-Result: Can approximate ANY function!
-Can learn: curves, XOR, images, language
-```
-
-### ReLU: The Intelligence Function
-
-ReLU (Rectified Linear Unit) is the most important function in modern AI:
-
-```
-ReLU Function: f(x) = max(0, x)
-
-   y
-   ▲
-   │   ╱
-   │  ╱  (positive values unchanged)
-   │ ╱
-───┼─────────▶ x
-   │ 0      (negative values → 0)
-   │
-
-Key Properties:
-• Computationally cheap: just comparison and zero
-• Gradient friendly: derivative is 0 or 1
-• Solves vanishing gradients: keeps signal strong
-• Enables deep networks: 100+ layers possible
-```
-
-### Softmax: The Probability Converter
-
-Softmax transforms any numbers into valid probabilities:
-
-```
-Raw Scores → Softmax → Probabilities
-[2.0, 1.0, 0.1] → [0.66, 0.24, 0.10]
-                   ↑    ↑    ↑
-                   Sum = 1.0 ✓
-                   All ≥ 0   ✓
-                   Larger in → Larger out ✓
-
-Formula: softmax(xᵢ) = exp(xᵢ) / Σⱼ exp(xⱼ)
-
-Use Case: Classification ("What percentage dog vs cat?")
-```
-"""
-
-# %% [markdown]
-"""
-## Part 1: ReLU - The Foundation of Modern Deep Learning
-
-ReLU transformed deep learning from a curiosity to the technology powering modern AI. Before ReLU, deep networks suffered from vanishing gradients and couldn't learn effectively beyond a few layers. ReLU's simple yet brilliant design solved this problem.
-
-### ReLU in Action: Element-wise Processing
-
-```
-Input Tensor:           After ReLU:
-┌─────────────────┐    ┌─────────────────┐
-│ -2.1   0.5   3.2│    │  0.0   0.5   3.2│
-│  1.7  -0.8   2.1│ →  │  1.7   0.0   2.1│
-│ -1.0   4.0  -0.3│    │  0.0   4.0   0.0│
-└─────────────────┘    └─────────────────┘
-      ↓                      ↓
-Negative → 0            Positive → unchanged
-```
-
-### The Dead Neuron Problem
-
-```
-ReLU can "kill" neurons permanently:
-
-Neuron with weights that produce only negative outputs:
-Input: [1, 2, 3] → Linear: weights*input = -5.2 → ReLU: 0
-Input: [4, 1, 2] → Linear: weights*input = -2.8 → ReLU: 0
-Input: [0, 5, 1] → Linear: weights*input = -1.1 → ReLU: 0
-
-Result: Neuron outputs 0 forever (no learning signal)
-This is why proper weight initialization matters!
-```
-
-### Why ReLU Works Better Than Alternatives
-
-```
-Sigmoid: f(x) = 1/(1 + e^(-x))
-Problem: Gradients vanish for |x| > 3
-
-Tanh: f(x) = tanh(x)
-Problem: Gradients vanish for |x| > 2
-
-ReLU: f(x) = max(0, x)
-Solution: Gradient is exactly 1 for x > 0 (no vanishing!)
-```
-
-Now let's implement this game-changing function:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "relu-class", "solution": true}
-
-#| export
-class ReLU:
-    """
-    ReLU Activation Function: f(x) = max(0, x)
-
-    Zeros out negative values, preserves positive values.
-    Essential for modern deep learning.
-    """
-    
-    def forward(self, x):
-        """
-        Apply ReLU activation: f(x) = max(0, x)
-
-        Args:
-            x (Tensor): Input tensor
-
-        Returns:
-            Tensor: Output with negatives zeroed
-
-        TODO: Implement ReLU using numpy's maximum function
-
-        APPROACH:
-        1. Validate input is a Tensor
-        2. Use np.maximum(0, x.data) for vectorized operation
-        3. Return new Tensor with result
-
-        EXAMPLE:
-            >>> relu = ReLU()
-            >>> x = Tensor([[-1.0, 1.0]])
-            >>> y = relu.forward(x)
-            >>> print(y.data)  # [[0.0, 1.0]]
-        """
-        ### BEGIN SOLUTION
-        # Input validation
-        if not isinstance(x, Tensor):
-            raise TypeError(f"Expected Tensor, got {type(x)}")
-
-        # Check for empty tensor
-        if x.data.size == 0:
-            return Tensor(np.array([]))
-
-        # Check for NaN or infinite values
-        if np.any(np.isnan(x.data)) or np.any(np.isinf(x.data)):
-            raise ValueError("Input tensor contains NaN or infinite values")
-
-        # Vectorized element-wise maximum with 0
-        # This is the exact operation that revolutionized deep learning!
-        result = np.maximum(0, x.data)
-        return Tensor(result)
-        ### END SOLUTION
-    
-    def forward_(self, x):
-        """
-        Apply ReLU in-place (modifies original tensor).
-
-        Args:
-            x (Tensor): Input tensor to modify
-
-        Returns:
-            Tensor: Same tensor object (modified)
-        """
-        ### BEGIN SOLUTION
-        if not isinstance(x, Tensor):
-            raise TypeError(f"Expected Tensor, got {type(x)}")
-        if x.data.size == 0:
-            return x
-        if np.any(np.isnan(x.data)) or np.any(np.isinf(x.data)):
-            raise ValueError("Input tensor contains NaN or infinite values")
-        np.maximum(0, x.data, out=x.data)
-        return x
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make ReLU callable: relu(x) instead of relu.forward(x)"""
-        return self.forward(x)
-
-# ✅ IMPLEMENTATION CHECKPOINT: ReLU class complete
-
-# %% [markdown]
-"""
-## Testing ReLU Implementation
-
-### 🧪 Unit Test: ReLU Activation
-This test validates our ReLU implementation with various input scenarios
-
-**What we're testing**: ReLU's core behavior - zero negatives, preserve positives
-**Why it matters**: ReLU must work perfectly for neural networks to learn
-**Expected**: All negative values become 0, positive values unchanged
-
-### ReLU Test Cases Visualization
-
-```
-Test Case 1 - Basic Functionality:
-Input:  [-2, -1,  0,  1,  2]
-Output: [ 0,  0,  0,  1,  2]
-         ↑   ↑   ↑   ↑   ↑
-         ✓   ✓   ✓   ✓   ✓
-      (all negatives → 0, positives preserved)
-
-Test Case 2 - Matrix Processing:
-Input:  [[-1.5,  2.3],    Output: [[0.0, 2.3],
-         [ 0.0, -3.7]]             [0.0, 0.0]]
-
-Test Case 3 - Edge Cases:
-• Very large positive: 1e6 → 1e6 (no overflow)
-• Very small negative: -1e-6 → 0 (proper handling)
-• Zero exactly: 0.0 → 0.0 (boundary condition)
-```
-"""
-
-def test_unit_relu_activation():
-    """
-    Test ReLU activation function.
-
-    Validates that ReLU zeros negatives and preserves positives.
-    """
-    print("🔬 Unit Test: ReLU Activation...")
-
-    relu = ReLU()
-
-    # Basic functionality test
-    test_input = Tensor([[-2, -1, 0, 1, 2]])
-    result = relu(test_input)
-    expected = np.array([[0, 0, 0, 1, 2]])
-
-    assert np.array_equal(result.data, expected), f"ReLU failed: expected {expected}, got {result.data}"
-
-    # 2D tensor test
-    matrix_input = Tensor([[-1, 2], [3, -4]])
-    matrix_result = relu(matrix_input)
-    expected_matrix = np.array([[0, 2], [3, 0]])
-
-    assert np.array_equal(matrix_result.data, expected_matrix), "ReLU should work with 2D tensors"
-
-    # In-place operation test
-    inplace_input = Tensor([[-1, 0, 1]])
-    relu.forward_(inplace_input)
-    expected_inplace = np.array([[0, 0, 1]])
-
-    assert np.array_equal(inplace_input.data, expected_inplace), "In-place ReLU should modify original tensor"
-
-    print("✅ ReLU activation tests passed!")
-
-# Test immediately after implementation
-test_unit_relu_activation()
-
-# %% [markdown]
-"""
-## Part 2: Softmax - Converting Scores to Probabilities
-
-Softmax is the bridge between raw neural network outputs and human-interpretable probabilities. It takes any vector of real numbers and transforms it into a valid probability distribution where all values sum to 1.0.
-
-### The Probability Transformation Process
-
-```
-Step 1: Raw Neural Network Outputs (can be any values)
-Raw scores: [2.0, 1.0, 0.1]
-
-Step 2: Exponentiation (makes everything positive)
-exp([2.0, 1.0, 0.1]) = [7.39, 2.72, 1.10]
-
-Step 3: Normalization (makes sum = 1.0)
-[7.39, 2.72, 1.10] / (7.39+2.72+1.10) = [0.66, 0.24, 0.10]
-                     ↑                      ↑     ↑     ↑
-                   Sum: 11.21              Total: 1.00 ✓
-```
-
-### Softmax in Classification
-
-```
-Neural Network for Image Classification:
-                    Raw Scores      Softmax      Interpretation
-Input: Dog Image → [2.1, 0.3, -0.8] → [0.75, 0.18, 0.07] → 75% Dog
-                    ↑    ↑     ↑        ↑     ↑     ↑         18% Cat
-                   Dog  Cat   Bird     Dog   Cat   Bird       7% Bird
-
-Key Properties:
-• Larger inputs get exponentially larger probabilities
-• Never produces negative probabilities
-• Always sums to exactly 1.0
-• Differentiable (can backpropagate gradients)
-```
-
-### The Numerical Stability Problem
-
-```
-Raw Softmax Formula: softmax(xᵢ) = exp(xᵢ) / Σⱼ exp(xⱼ)
-
-Problem with large numbers:
-Input: [1000, 999, 998]
-exp([1000, 999, 998]) = [∞, ∞, ∞]  ← Overflow!
-
-Solution - Subtract max before exp:
-x_stable = x - max(x)
-Input: [1000, 999, 998] - 1000 = [0, -1, -2]
-exp([0, -1, -2]) = [1.00, 0.37, 0.14] ← Stable!
-```
-
-Now let's implement this essential function:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "softmax-class", "solution": true}
-
-#| export
-class Softmax:
-    """
-    Softmax Activation Function: f(x_i) = e^(x_i) / Σ(e^(x_j))
-
-    Converts any vector into a probability distribution.
-    Essential for classification tasks.
-    """
-    
-    def __init__(self, dim=-1):
-        """
-        Initialize Softmax with dimension specification.
-        
-        Args:
-            dim (int): Dimension along which to apply softmax.
-                      -1 means last dimension (most common)
-                      0 means first dimension, etc.
-                      
-        Examples:
-            Softmax(dim=-1)  # Apply along last dimension (default)
-            Softmax(dim=0)   # Apply along first dimension
-            Softmax(dim=1)   # Apply along second dimension
-        """
-        self.dim = dim
-    
-    def forward(self, x):
-        """
-        Apply Softmax activation with numerical stability.
-
-        Args:
-            x (Tensor): Input tensor containing scores
-
-        Returns:
-            Tensor: Probability distribution (sums to 1)
-
-        TODO: Implement numerically stable softmax
-
-        APPROACH:
-        1. Validate input is a Tensor
-        2. Subtract max for numerical stability
-        3. Compute exponentials: np.exp(x_stable)
-        4. Normalize by sum to create probabilities
-
-        EXAMPLE:
-            >>> softmax = Softmax()
-            >>> x = Tensor([[1.0, 2.0, 3.0]])
-            >>> y = softmax.forward(x)
-            >>> print(np.sum(y.data))  # 1.0
-        """
-        ### BEGIN SOLUTION
-        # Input validation
-        if not isinstance(x, Tensor):
-            raise TypeError(f"Expected Tensor, got {type(x)}")
-
-        # Check for empty tensor
-        if x.data.size == 0:
-            raise ValueError("Cannot apply softmax to empty tensor")
-
-        # Check for NaN values (infinite values are handled by max subtraction)
-        if np.any(np.isnan(x.data)):
-            raise ValueError("Input tensor contains NaN values")
-
-        # Step 1: Numerical stability - subtract maximum value
-        # This prevents exp(large_number) from overflowing to infinity
-        max_vals = np.max(x.data, axis=self.dim, keepdims=True)
-        x_stable = x.data - max_vals
-
-        # Step 2: Compute exponentials of stable values
-        exp_vals = np.exp(x_stable)
-
-        # Step 3: Normalize to create probability distribution
-        sum_exp = np.sum(exp_vals, axis=self.dim, keepdims=True)
-
-        # Handle edge case where sum is zero (shouldn't happen with valid input)
-        if np.any(sum_exp == 0):
-            raise ValueError("Softmax normalization resulted in zero sum")
-
-        result = exp_vals / sum_exp
-
-        return Tensor(result)
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make Softmax callable: softmax(x) instead of softmax.forward(x)"""
-        return self.forward(x)
-
-# ✅ IMPLEMENTATION CHECKPOINT: Softmax class complete
-
-# %% [markdown]
-"""
-## Testing Softmax Implementation
-
-### 🧪 Unit Test: Softmax Activation
-This test validates our Softmax implementation for correctness and numerical stability
-
-**What we're testing**: Softmax probability distribution properties
-**Why it matters**: Softmax must create valid probabilities for classification
-**Expected**: All outputs ≥ 0, sum to 1.0, numerically stable with large inputs
-
-### Softmax Test Cases Visualization
-
-```
-Test Case 1 - Basic Probability Distribution:
-Input:  [1.0, 2.0, 3.0]
-Output: [0.09, 0.24, 0.67]  ← Sum = 1.00 ✓, All ≥ 0 ✓
-         ↑     ↑     ↑
-      e^1/Σ e^2/Σ e^3/Σ    (largest input gets largest probability)
-
-Test Case 2 - Numerical Stability:
-Input:  [1000, 999, 998]     ← Would cause overflow without stability trick
-Output: [0.67, 0.24, 0.09]   ← Still produces valid probabilities!
-
-Test Case 3 - Edge Cases:
-• All equal inputs: [1, 1, 1] → [0.33, 0.33, 0.33] (uniform distribution)
-• One dominant: [10, 0, 0] → [≈1.0, ≈0.0, ≈0.0] (winner-take-all)
-• Negative inputs: [-1, -2, -3] → [0.67, 0.24, 0.09] (still works!)
-
-Test Case 4 - Batch Processing:
-Input Matrix:  [[1, 2, 3],     Output Matrix: [[0.09, 0.24, 0.67],
-                [4, 5, 6]]  →                  [0.09, 0.24, 0.67]]
-                ↑                               ↑
-            Each row processed independently   Each row sums to 1.0
-```
-"""
-
-def test_unit_softmax_activation():
-    """
-    Test Softmax activation function.
-
-    Validates that Softmax creates valid probability distributions.
-    """
-    print("🔬 Unit Test: Softmax Activation...")
-
-    softmax = Softmax()
-
-    # Basic probability distribution test
-    test_input = Tensor([[1.0, 2.0, 3.0]])
-    result = softmax(test_input)
-
-    # Check outputs sum to 1
-    sum_result = np.sum(result.data, axis=-1)
-    assert np.allclose(sum_result, 1.0), f"Softmax should sum to 1, got {sum_result}"
-    assert np.all(result.data >= 0), "Softmax outputs should be non-negative"
-
-    # Numerical stability test with large values
-    large_input = Tensor([[1000.0, 1001.0, 1002.0]])
-    large_result = softmax(large_input)
-
-    assert not np.any(np.isnan(large_result.data)), "Should handle large values without NaN"
-    assert np.allclose(np.sum(large_result.data, axis=-1), 1.0), "Large values should still sum to 1"
-
-    # Batch processing test
-    batch_input = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    batch_result = softmax(batch_input)
-    row_sums = np.sum(batch_result.data, axis=-1)
-    assert np.allclose(row_sums, [1.0, 1.0]), "Each batch item should sum to 1"
-
-    print("✅ Softmax activation tests passed!")
-
-# Test immediately after implementation
-test_unit_softmax_activation()
-
-# ✅ IMPLEMENTATION CHECKPOINT: Both ReLU and Softmax complete
-
-# In[ ]:
-
-# %% [markdown]
-"""
-## Integration Testing: Activations in Neural Network Context
-
-Let's test these activations in realistic neural network scenarios
-"""
-
-def test_unit_activation_pipeline():
-    """Test activations working together in a neural network pipeline."""
-    print("🔬 Unit Test: Activation Pipeline...")
-
-    relu = ReLU()
-    softmax = Softmax()
-
-    # Test neural network pipeline
-    hidden_output = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]])
-    hidden_activated = relu(hidden_output)
-    expected_relu = np.array([[0.0, 0.0, 0.0, 1.0, 2.0]])
-
-    assert np.array_equal(hidden_activated.data, expected_relu), "ReLU should zero negatives"
-
-    # Classification with Softmax
-    class_logits = Tensor([[2.0, 1.0, 0.1]])
-    class_probabilities = softmax(class_logits)
-
-    assert np.allclose(np.sum(class_probabilities.data, axis=-1), 1.0), "Softmax should sum to 1"
-    assert np.all(class_probabilities.data >= 0), "Probabilities should be non-negative"
-
-    print("✅ Activation pipeline works correctly!")
-
-# Test pipeline functionality
-test_unit_activation_pipeline()
-
-# In[ ]:
-
-# %% [markdown]
-"""
-## Integration Test: Realistic Neural Network Pipeline
-
-Test activations in a complete neural network forward pass simulation
-"""
-
-def test_module():
-    """Complete module test covering all activation functionality."""
-    print("🔬 Complete Module Test: All Activations...")
-
-    # Test individual components
-    test_unit_relu_activation()
-    test_unit_softmax_activation()
-    test_unit_activation_pipeline()
-
-    # Test error handling
-    relu = ReLU()
-    try:
-        relu("not a tensor")
-        assert False, "Should raise TypeError"
-    except TypeError:
-        pass  # Expected
-
-    print("\n✅ Complete module test passed!")
-    print("✅ All activation functions working correctly")
-    print("✅ Ready for neural network integration")
-
-# Test complete module
-test_module()
-
-# In[ ]:
-
-# Main execution block - all tests run when module is executed directly
-if __name__ == "__main__":
-    print("\n" + "="*50)
-    print("🚀 RUNNING ACTIVATION TESTS")
-    print("="*50)
-
-    # Run complete module test
-    test_module()
-
-    print("\n" + "="*50)
-    print("🎉 ACTIVATION MODULE COMPLETE!")
-    print("="*50)
-    print("✅ ReLU: Simple and effective nonlinearity")
-    print("✅ Softmax: Converts scores to probabilities")
-    print("💡 Ready to build neural network layers!")
-
-    print(f"\n🎯 Module 02 (Activations) Complete!")
-    print(f"Next: Module 03 - Neural Network Layers!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-### Question 1: Activation Function Choice
-
-**Context**: You implemented ReLU (simple max operation) and Softmax (exponentials + normalization).
-
-**Question**: For a mobile neural network with limited compute, analyze the trade-offs between ReLU and Softmax. Consider computational cost, memory usage, and when each is essential.
-
-**YOUR ANALYSIS:**
-
-[Student response area]
-
-### Question 2: Numerical Stability
-
-**Context**: Your Softmax subtracts the maximum value before computing exponentials.
-
-**Question**: Why is this numerical stability crucial? How do small errors in activations affect deep network training?
-
-**YOUR ANALYSIS:**
-
-[Student response area]
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Essential Activations
-
-Congratulations! You've implemented the essential activation functions for neural networks:
-
-### What You've Accomplished
-✅ **ReLU Implementation**: The activation function that revolutionized deep learning
-✅ **Softmax Implementation**: Converts any vector to a probability distribution
-✅ **Testing Framework**: Comprehensive validation of activation properties
-✅ **Pipeline Integration**: Demonstrated activations working in neural network contexts
-
-### Key Learning Outcomes
-- **Nonlinearity Understanding**: How activation functions enable complex pattern learning
-- **Numerical Implementation**: Building mathematically correct and stable algorithms
-- **Error Handling**: Robust implementations that handle edge cases gracefully
-- **Systems Integration**: Components that work together in larger systems
-
-### Mathematical Foundations Mastered
-- **ReLU**: f(x) = max(0, x) - simple yet powerful nonlinearity
-- **Softmax**: Converting scores to probabilities with numerical stability
-- **Probability Theory**: Understanding valid probability distributions
-
-### Ready for Next Steps
-Your activation implementations enable:
-- **Neural Network Layers**: Combining with linear transformations
-- **Classification**: Converting network outputs to interpretable probabilities
-- **Deep Learning**: Training networks with many layers
-
-### Connection to Real Systems
-- **PyTorch**: Your implementations mirror `torch.nn.ReLU()` and `torch.nn.Softmax()`
-- **Production**: Same mathematical foundations with hardware optimizations
-
-### Next Steps
-Ready for Module 03: Neural Network Layers - combining your activations with linear transformations!
-
-**Forward Momentum**: You've built the nonlinear intelligence that makes neural networks powerful!
-"""
\ No newline at end of file
diff --git a/modules_old/02_activations/activations_streamlined.py b/modules_old/02_activations/activations_streamlined.py
deleted file mode 100644
index 665cbfb9..00000000
--- a/modules_old/02_activations/activations_streamlined.py
+++ /dev/null
@@ -1,770 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Activations - Essential Nonlinearity Functions
-
-Welcome to the streamlined Activations module! You'll implement the two most important activation functions in modern deep learning: ReLU and Softmax.
-
-## Learning Goals
-- Systems understanding: Why ReLU became the dominant activation and how Softmax enables classification
-- Core implementation skill: Build the two activation functions that power 90%+ of modern architectures
-- Pattern recognition: Understand when to use ReLU (hidden layers) vs Softmax (output layers)
-- Framework connection: See how your implementations match PyTorch's essential activations
-- Performance insight: Learn why ReLU is computationally efficient and Softmax requires careful numerical stability
-
-## Build → Use → Reflect
-1. **Build**: ReLU and Softmax activation functions with proper numerical stability
-2. **Use**: Apply these activations in realistic neural network scenarios
-3. **Reflect**: Why did ReLU revolutionize deep learning, and why is Softmax essential for classification?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of the two activation functions that enable modern deep learning
-- Practical capability to implement numerically stable activations used in production systems
-- Systems insight into why activation choice determines training success and computational efficiency
-- Performance consideration of how ReLU's simplicity and Softmax's complexity affect system design
-- Connection to production ML systems and the design decisions behind activation function choice
-
-## Why Only ReLU and Softmax?
-
-In this educational framework, we focus on the two most important activation functions:
-
-### ReLU (Rectified Linear Unit)
-- **Most widely used** in hidden layers (90%+ of architectures)
-- **Computationally efficient**: Just max(0, x)
-- **Solves vanishing gradients**: Doesn't saturate for positive values
-- **Enables deep networks**: Critical breakthrough for training very deep networks
-
-### Softmax
-- **Essential for classification**: Converts logits to probabilities
-- **Attention mechanisms**: Used in transformers and attention-based models
-- **Output layer standard**: Multi-class classification standard
-
-### Educational Focus
-- **Master the fundamentals**: Deep understanding of essential functions
-- **Real-world relevance**: These two handle the majority of practical use cases
-- **System insight**: Understand why these became dominant
-- **Foundation building**: Understanding these gives you the foundation for any activation
-
-## Systems Reality Check
-💡 **Production Context**: PyTorch implements ReLU with highly optimized CUDA kernels, while Softmax requires careful numerical stability - your implementation reveals these design decisions
-⚡ **Performance Note**: ReLU is popular partly because it's computationally cheap (just max(0,x)), while Softmax requires expensive exponentials and normalization
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "activations-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.activations
-
-#| export
-import math
-import numpy as np
-import os
-import sys
-from typing import Union, List
-
-# Import our tensor foundation
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development - import from local modules
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))
-    from tensor_dev import Tensor
-
-# %% nbgrader={"grade": false, "grade_id": "activations-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("🔥 TinyTorch Activations Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build essential activation functions!")
-
-# %% [markdown]
-"""
-## ReLU - The Breakthrough Activation
-
-### What is ReLU?
-**ReLU (Rectified Linear Unit)** is the simplest possible nonlinear activation:
-
-```
-f(x) = max(0, x)
-```
-
-### Why ReLU Revolutionized Deep Learning
-1. **Computationally efficient**: No expensive exponentials or divisions
-2. **Solves vanishing gradients**: Gradient is 1 for positive inputs, 0 for negative
-3. **Sparse activation**: Naturally creates sparse representations (many zeros)
-4. **Deep network enabler**: Made training networks with 100+ layers possible
-
-### Visual Understanding
-```
-Input: [-2, -1, 0, 1, 2]
-ReLU:  [0,  0, 0, 1, 2]
-```
-
-### Real-World Impact
-- **Computer Vision**: Enabled deep CNNs (AlexNet, ResNet, etc.)
-- **NLP**: Powers transformer hidden layers
-- **Training Speed**: 6x faster than sigmoid in many cases
-- **Hardware**: Optimized in every GPU and AI accelerator
-
-### Mathematical Properties
-- **Range**: [0, ∞)
-- **Derivative**: f'(x) = 1 if x > 0, else 0
-- **Dead neurons**: Neurons can "die" if they always output 0
-- **Sparsity**: Naturally creates sparse activations
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "relu-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class ReLU:
-    """
-    ReLU Activation Function: f(x) = max(0, x)
-    
-    The most important activation function in modern deep learning.
-    Computationally efficient and enables training very deep networks.
-    """
-    
-    def forward(self, x):
-        """
-        Apply ReLU activation: f(x) = max(0, x)
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Use numpy maximum function to compute max(0, x)
-        2. Return new Tensor with ReLU applied
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x) = max(0, x)
-        - Sets all negative values to 0, keeps positive values unchanged
-        
-        EXAMPLE USAGE:
-        ```python
-        relu = ReLU()
-        tensor_input = Tensor([[-1.0, 0.0, 1.0]])
-        tensor_output = relu(tensor_input)  # [[0.0, 0.0, 1.0]]
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Use np.maximum(0, x.data) for element-wise max
-        - Create new Tensor from result
-        
-        LEARNING CONNECTIONS:
-        - This is the core of torch.nn.ReLU() in PyTorch
-        - Used in 90%+ of hidden layers in modern architectures
-        - Enables training very deep networks
-        - Computationally efficient: just a comparison and selection
-        """
-        ### BEGIN SOLUTION
-        result = np.maximum(0, x.data)
-        return Tensor(result)
-        ### END SOLUTION
-    
-    def forward_(self, x):
-        """
-        Apply ReLU activation in-place: modifies input tensor directly
-        
-        In-place ReLU saves memory by reusing existing tensor buffer.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Apply ReLU directly to tensor._data
-        2. Return the same tensor object (modified in-place)
-        
-        MEMORY BENEFITS:
-        - No new tensor allocation
-        - Critical for large networks and limited memory
-        - Used in PyTorch with relu_() syntax
-        
-        IMPLEMENTATION HINTS:
-        - Use np.maximum(0, x._data, out=x._data) for in-place operation
-        """
-        ### BEGIN SOLUTION
-        np.maximum(0, x._data, out=x._data)
-        return x
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make the class callable: relu(x) instead of relu.forward(x)"""
-        return self.forward(x)
-
-# %% [markdown]
-"""
-### 🧪 Test Your ReLU Implementation
-
-Let's test your ReLU implementation immediately:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-relu-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_unit_relu_activation():
-    """Unit test for the ReLU activation function."""
-    print("🔬 Unit Test: ReLU Activation...")
-    
-    # Create ReLU instance
-    relu = ReLU()
-    
-    # Test with mixed positive/negative values
-    test_input = Tensor([[-2, -1, 0, 1, 2]])
-    result = relu(test_input)
-    expected = np.array([[0, 0, 0, 1, 2]])
-    
-    assert np.array_equal(result.data, expected), f"ReLU failed: expected {expected}, got {result.data}"
-    
-    # Test with all negative values
-    negative_input = Tensor([[-5, -3, -1]])
-    negative_result = relu(negative_input)
-    expected_negative = np.array([[0, 0, 0]])
-    
-    assert np.array_equal(negative_result.data, expected_negative), "ReLU should zero out negative values"
-    
-    # Test with all positive values (should be unchanged)
-    positive_input = Tensor([[1, 3, 5]])
-    positive_result = relu(positive_input)
-    
-    assert np.array_equal(positive_result.data, positive_input.data), "ReLU should preserve positive values"
-    
-    # Test with 2D tensor
-    matrix_input = Tensor([[-1, 2], [3, -4]])
-    matrix_result = relu(matrix_input)
-    expected_matrix = np.array([[0, 2], [3, 0]])
-    
-    assert np.array_equal(matrix_result.data, expected_matrix), "ReLU should work with 2D tensors"
-    assert matrix_result.shape == matrix_input.shape, "ReLU should preserve shape"
-    
-    # Test in-place operation
-    inplace_input = Tensor([[-1, 0, 1]])
-    original_data = inplace_input.data.copy()
-    relu.forward_(inplace_input)
-    expected_inplace = np.array([[0, 0, 1]])
-    
-    assert np.array_equal(inplace_input.data, expected_inplace), "In-place ReLU should modify original tensor"
-    
-    print("✅ ReLU activation tests passed!")
-    print(f"✅ Correctly zeros out negative values")
-    print(f"✅ Preserves positive values")
-    print(f"✅ Shape preservation working")
-    print(f"✅ In-place operation working")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Softmax - Probability Distribution Creator
-
-### What is Softmax?
-**Softmax** converts any real-valued vector into a probability distribution:
-
-```
-f(x_i) = e^(x_i) / Σ(e^(x_j))
-```
-
-### Why Softmax is Essential
-1. **Probability interpretation**: Outputs sum to 1 and are all positive
-2. **Classification**: Standard for multi-class classification output layers
-3. **Attention mechanisms**: Core component of transformer attention
-4. **Differentiable**: Smooth gradients for optimization
-
-### Visual Understanding
-```
-Input:  [1.0, 2.0, 3.0]
-Softmax: [0.09, 0.24, 0.67]  # Probabilities that sum to 1
-```
-
-### Real-World Applications
-- **Classification**: Convert logits to class probabilities
-- **Attention**: Transformer attention weights
-- **Language modeling**: Next token prediction probabilities
-- **Reinforcement learning**: Action probability distributions
-
-### Numerical Stability Challenge
-Raw softmax can overflow with large inputs. The solution:
-```
-f(x_i) = e^(x_i - max(x)) / Σ(e^(x_j - max(x)))
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "softmax-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Softmax:
-    """
-    Softmax Activation Function: f(x_i) = e^(x_i) / Σ(e^(x_j))
-    
-    Converts logits to probability distributions.
-    Essential for classification and attention mechanisms.
-    """
-    
-    def __init__(self, dim=-1):
-        """
-        Initialize Softmax with specified dimension.
-        
-        Args:
-            dim: Dimension along which to apply softmax (default: -1, last dimension)
-        """
-        self.dim = dim
-    
-    def forward(self, x):
-        """
-        Apply Softmax activation with numerical stability.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Subtract max value for numerical stability: x_stable = x - max(x)
-        2. Compute exponentials: exp_vals = exp(x_stable)
-        3. Compute sum of exponentials: sum_exp = sum(exp_vals)
-        4. Divide: softmax = exp_vals / sum_exp
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x_i) = e^(x_i - max(x)) / Σ(e^(x_j - max(x)))
-        - Numerically stable version prevents overflow
-        - Output is a probability distribution (sums to 1)
-        
-        EXAMPLE USAGE:
-        ```python
-        softmax = Softmax()
-        tensor_input = Tensor([[1.0, 2.0, 3.0]])
-        tensor_output = softmax(tensor_input)  # [[0.09, 0.24, 0.67]]
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Use np.max(x.data, axis=self.dim, keepdims=True) for stability
-        - Use np.exp() for exponentials
-        - Use np.sum() with same axis for normalization
-        
-        LEARNING CONNECTIONS:
-        - This is the core of torch.nn.Softmax() in PyTorch
-        - Used in classification output layers
-        - Critical component of attention mechanisms
-        - Requires careful numerical implementation
-        """
-        ### BEGIN SOLUTION
-        # Numerical stability: subtract max value
-        max_vals = np.max(x.data, axis=self.dim, keepdims=True)
-        x_stable = x.data - max_vals
-        
-        # Compute exponentials
-        exp_vals = np.exp(x_stable)
-        
-        # Compute softmax
-        sum_exp = np.sum(exp_vals, axis=self.dim, keepdims=True)
-        result = exp_vals / sum_exp
-        
-        return Tensor(result)
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make the class callable: softmax(x) instead of softmax.forward(x)"""
-        return self.forward(x)
-
-# %% [markdown]
-"""
-### 🧪 Test Your Softmax Implementation
-
-Let's test your Softmax implementation immediately:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-softmax-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_unit_softmax_activation():
-    """Unit test for the Softmax activation function."""
-    print("🔬 Unit Test: Softmax Activation...")
-    
-    # Create Softmax instance
-    softmax = Softmax()
-    
-    # Test with simple values
-    test_input = Tensor([[1.0, 2.0, 3.0]])
-    result = softmax(test_input)
-    
-    # Check that outputs sum to 1 (probability distribution)
-    sum_result = np.sum(result.data, axis=-1)
-    assert np.allclose(sum_result, 1.0), f"Softmax should sum to 1, got {sum_result}"
-    
-    # Check that all values are positive
-    assert np.all(result.data >= 0), "Softmax outputs should be non-negative"
-    
-    # Test with zero input
-    zero_input = Tensor([[0.0, 0.0, 0.0]])
-    zero_result = softmax(zero_input)
-    expected_uniform = np.array([[1/3, 1/3, 1/3]])
-    
-    assert np.allclose(zero_result.data, expected_uniform, atol=1e-6), "Equal inputs should give uniform distribution"
-    
-    # Test numerical stability with large values
-    large_input = Tensor([[1000.0, 1001.0, 1002.0]])
-    large_result = softmax(large_input)
-    
-    # Should not produce NaN or Inf
-    assert not np.any(np.isnan(large_result.data)), "Softmax should handle large values without NaN"
-    assert not np.any(np.isinf(large_result.data)), "Softmax should handle large values without Inf"
-    assert np.allclose(np.sum(large_result.data, axis=-1), 1.0), "Large value softmax should still sum to 1"
-    
-    # Test with 2D tensor (batch processing)
-    batch_input = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    batch_result = softmax(batch_input)
-    
-    # Each row should sum to 1
-    row_sums = np.sum(batch_result.data, axis=-1)
-    assert np.allclose(row_sums, [1.0, 1.0]), "Each batch item should sum to 1"
-    
-    # Test shape preservation
-    assert batch_result.shape == batch_input.shape, "Softmax should preserve shape"
-    
-    print("✅ Softmax activation tests passed!")
-    print(f"✅ Outputs form valid probability distributions (sum to 1)")
-    print(f"✅ All outputs are non-negative")
-    print(f"✅ Numerically stable with large inputs")
-    print(f"✅ Batch processing works correctly")
-    print(f"✅ Shape preservation working")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Comprehensive Testing
-
-Let's run comprehensive tests that validate both activations working together:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-activations-comprehensive", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_activations_comprehensive():
-    """Comprehensive test of both activation functions."""
-    print("🔬 Comprehensive Test: ReLU + Softmax Pipeline...")
-    
-    # Create activation instances
-    relu = ReLU()
-    softmax = Softmax()
-    
-    # Test realistic neural network scenario
-    # Simulate a network layer output (could be negative)
-    layer_output = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]])
-    
-    # Apply ReLU (hidden layer activation)
-    hidden_activation = relu(layer_output)
-    expected_relu = np.array([[0.0, 0.0, 0.0, 1.0, 2.0]])
-    
-    assert np.array_equal(hidden_activation.data, expected_relu), "ReLU should zero negatives"
-    
-    # Apply Softmax to different tensor (classification output)
-    logits = Tensor([[2.0, 1.0, 0.1]])
-    class_probabilities = softmax(logits)
-    
-    # Verify probability properties
-    assert np.allclose(np.sum(class_probabilities.data, axis=-1), 1.0), "Softmax should create probability distribution"
-    assert np.all(class_probabilities.data >= 0), "Probabilities should be non-negative"
-    
-    # Test that highest logit gets highest probability
-    max_logit_idx = np.argmax(logits.data)
-    max_prob_idx = np.argmax(class_probabilities.data)
-    assert max_logit_idx == max_prob_idx, "Highest logit should get highest probability"
-    
-    # Test with batch data (realistic scenario)
-    batch_logits = Tensor([
-        [1.0, 2.0, 0.5],  # Batch item 1
-        [0.1, 0.2, 0.9],  # Batch item 2
-        [2.0, 1.0, 1.5]   # Batch item 3
-    ])
-    
-    batch_probs = softmax(batch_logits)
-    
-    # Each row should sum to 1
-    row_sums = np.sum(batch_probs.data, axis=1)
-    assert np.allclose(row_sums, [1.0, 1.0, 1.0]), "Each batch item should form probability distribution"
-    
-    print("✅ Comprehensive activation tests passed!")
-    print(f"✅ ReLU correctly processes hidden layer outputs")
-    print(f"✅ Softmax correctly creates probability distributions")
-    print(f"✅ Batch processing works for realistic scenarios")
-    print(f"✅ Activations preserve expected mathematical properties")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Integration Test: Real Neural Network Scenario
-
-Let's test these activations in a realistic neural network context:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-activations-integration", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_module_activation_integration():
-    """Integration test: activations in a realistic neural network pipeline."""
-    print("🔬 Integration Test: Neural Network Pipeline...")
-    
-    # Simulate a complete forward pass through a small network
-    relu = ReLU()
-    softmax = Softmax()
-    
-    # Step 1: Input data (batch of 3 samples, 4 features each)
-    input_data = Tensor([
-        [0.5, -0.3, 1.2, -0.8],  # Sample 1
-        [-1.0, 0.8, 0.0, 1.5],   # Sample 2
-        [0.2, -0.5, -0.9, 0.3]   # Sample 3
-    ])
-    
-    # Step 2: Simulate hidden layer output (after linear transformation)
-    # In real network this would be: input @ weights + bias
-    hidden_output = Tensor([
-        [-1.5, 0.8, 2.1],   # Sample 1 hidden activations
-        [0.3, -0.6, 1.2],   # Sample 2 hidden activations
-        [-0.8, 1.5, -0.3]   # Sample 3 hidden activations
-    ])
-    
-    # Step 3: Apply ReLU to hidden layer
-    hidden_activated = relu(hidden_output)
-    
-    # Verify ReLU behavior
-    expected_relu = np.array([
-        [0.0, 0.8, 2.1],
-        [0.3, 0.0, 1.2],
-        [0.0, 1.5, 0.0]
-    ])
-    assert np.allclose(hidden_activated.data, expected_relu), "ReLU should zero negatives in hidden layer"
-    
-    # Step 4: Simulate final layer output (logits for 3 classes)
-    final_logits = Tensor([
-        [2.1, 0.5, 1.2],   # Sample 1 class scores
-        [0.8, 1.5, 0.3],   # Sample 2 class scores
-        [1.0, 2.0, 0.1]    # Sample 3 class scores
-    ])
-    
-    # Step 5: Apply Softmax for classification
-    class_probabilities = softmax(final_logits)
-    
-    # Verify softmax properties
-    batch_sums = np.sum(class_probabilities.data, axis=1)
-    assert np.allclose(batch_sums, [1.0, 1.0, 1.0]), "Each sample should have probabilities summing to 1"
-    
-    # Verify predictions make sense (highest logit -> highest probability)
-    for i in range(3):
-        max_logit_class = np.argmax(final_logits.data[i])
-        max_prob_class = np.argmax(class_probabilities.data[i])
-        assert max_logit_class == max_prob_class, f"Sample {i}: highest logit should get highest probability"
-    
-    # Test memory efficiency (shapes preserved)
-    assert hidden_activated.shape == hidden_output.shape, "ReLU should preserve tensor shape"
-    assert class_probabilities.shape == final_logits.shape, "Softmax should preserve tensor shape"
-    
-    print("✅ Integration test passed!")
-    print(f"✅ Complete forward pass simulation successful")
-    print(f"✅ ReLU enables nonlinear hidden representations")
-    print(f"✅ Softmax provides interpretable classification outputs")
-    print(f"✅ Batch processing works throughout pipeline")
-    
-    # Display sample predictions
-    print(f"\n📊 Sample Predictions:")
-    for i in range(3):
-        probs = class_probabilities.data[i]
-        predicted_class = np.argmax(probs)
-        confidence = probs[predicted_class]
-        print(f"   Sample {i+1}: Class {predicted_class} (confidence: {confidence:.3f})")
-
-# Test function defined (called in main block)
-
-# Main execution block
-if __name__ == "__main__":
-    # Run all activation tests
-    test_unit_relu_activation()
-    test_unit_softmax_activation()
-    test_unit_activations_comprehensive()
-    test_module_activation_integration()
-    
-    print("\n🎉 All activation tests passed!")
-    print("✅ ReLU: The foundation of modern deep learning")
-    print("✅ Softmax: The key to interpretable classifications")
-    print("💡 Ready to build neural networks with essential nonlinearity!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've built the essential activation functions, let's connect this work to broader ML systems challenges. These questions help you think critically about how activation choices scale to production ML environments.
-
-### Question 1: Performance and Hardware Optimization
-
-**Context**: Your ReLU implementation uses a simple `np.maximum(0, x)` operation, while Softmax requires exponentials and division. In production ML systems, activation functions are called billions of times during training and inference.
-
-**Reflection Question**: Design a performance optimization strategy for activation functions in a production ML framework. How would you optimize ReLU and Softmax differently for CPU vs GPU execution? Consider the trade-offs between memory bandwidth, computational complexity, and numerical precision. What specific optimizations would you implement for training vs inference scenarios?
-
-Think about: SIMD vectorization, kernel fusion, memory layout optimization, and precision requirements across different hardware architectures.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "ml-systems-performance", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON PERFORMANCE AND HARDWARE OPTIMIZATION:
-
-TODO: Replace this text with your thoughtful response about activation function optimization.
-
-Consider addressing:
-- How would you optimize ReLU vs Softmax differently for various hardware platforms?
-- What role does memory bandwidth vs computational complexity play in optimization decisions?
-- How would you handle precision trade-offs between training and inference?
-- What specific CUDA kernel optimizations would benefit each activation?
-- How would you design kernel fusion strategies to minimize memory traffic?
-
-Write a technical analysis connecting your implementations to real performance optimization challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Demonstrates understanding of hardware-specific optimization strategies (3 points)
-- Addresses CPU vs GPU optimization differences appropriately (3 points)
-- Shows practical knowledge of memory bandwidth and computational trade-offs (2 points)
-- Demonstrates systems thinking about training vs inference requirements (2 points)
-- Clear technical reasoning with performance insights (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring technical analysis of hardware optimization
-# Students should demonstrate understanding of performance optimization across different platforms
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: Numerical Stability and Production Reliability
-
-**Context**: Your Softmax implementation includes numerical stability measures (subtracting max values), but production systems face additional challenges: mixed precision training, gradient underflow, and distributed training synchronization.
-
-**Reflection Question**: Architect a numerically stable activation system for a production ML framework that handles edge cases and maintains training stability across different scenarios. How would you handle extreme input values, gradient explosion/vanishing, and precision loss in distributed training? Consider the challenges of maintaining numerical consistency when the same model runs on different hardware with different floating-point behaviors.
-
-Think about: numerical precision hierarchies, gradient clipping strategies, hardware-specific floating-point behaviors, and distributed synchronization requirements.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "ml-systems-stability", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON NUMERICAL STABILITY AND PRODUCTION RELIABILITY:
-
-TODO: Replace this text with your thoughtful response about numerical stability design.
-
-Consider addressing:
-- How would you design activation functions to handle extreme input values gracefully?
-- What strategies would you use for maintaining numerical consistency across different hardware?
-- How would you integrate gradient clipping and stability measures into activation implementations?
-- What role does mixed precision training play in activation function design?
-- How would you ensure distributed training maintains numerical consistency?
-
-Write an architectural analysis connecting your activation implementations to production stability challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Shows understanding of numerical stability challenges in production systems (3 points)
-- Addresses hardware-specific floating-point considerations (3 points)
-- Designs practical stability measures for distributed training (2 points)
-- Demonstrates systems thinking about gradient stability and precision (2 points)
-- Clear architectural reasoning with stability insights (bonus points for comprehensive understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of numerical stability in production
-# Students should demonstrate knowledge of floating-point challenges and distributed training
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Activation Function Evolution and System Design
-
-**Context**: You implemented ReLU and Softmax, the current standards, but activation functions continue to evolve (GELU, Swish, etc.). Production ML systems must support both established and experimental activations while maintaining backward compatibility and performance.
-
-**Reflection Question**: Design an extensible activation function system that can efficiently support both current standards (ReLU, Softmax) and future experimental activations. How would you balance the need for optimal performance of established functions with the flexibility to add new activations? Consider the challenges of maintaining API compatibility, performance benchmarking, and automatic differentiation support across diverse activation functions.
-
-Think about: plugin architectures, performance profiling systems, automatic differentiation integration, and backward compatibility strategies.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "ml-systems-evolution", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON ACTIVATION FUNCTION EVOLUTION AND SYSTEM DESIGN:
-
-TODO: Replace this text with your thoughtful response about extensible activation system design.
-
-Consider addressing:
-- How would you design a plugin architecture for new activation functions?
-- What strategies would you use to maintain performance for established activations while supporting experimentation?
-- How would you handle automatic differentiation for diverse activation types?
-- What role would performance benchmarking and profiling play in your system design?
-- How would you ensure backward compatibility while enabling innovation?
-
-Write a system design analysis connecting your activation foundation to framework evolution challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Designs practical extensible architecture for activation functions (3 points)
-- Addresses performance vs flexibility trade-offs appropriately (3 points)
-- Shows understanding of automatic differentiation integration challenges (2 points)
-- Demonstrates systems thinking about framework evolution and compatibility (2 points)
-- Clear design reasoning with innovation insights (bonus points for forward-thinking approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of extensible system design
-# Students should demonstrate knowledge of framework architecture and evolution challenges
-### END SOLUTION
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Essential Activations
-
-Congratulations! You've successfully implemented the two most important activation functions in modern deep learning:
-
-## What You've Built
-- **ReLU Activation**: The foundation of deep learning that enabled training very deep networks
-- **Softmax Activation**: The probability distribution creator essential for classification
-- **Numerical Stability**: Proper implementation techniques that prevent overflow and underflow
-- **Performance Awareness**: Understanding of computational trade-offs between different activations
-- **Production Insight**: Connection to real-world optimization and stability challenges
-
-## Key Learning Outcomes
-- **Understanding**: Why these two activations dominate modern architectures
-- **Implementation**: Built numerically stable activation functions from scratch
-- **Systems thinking**: Connecting computational efficiency to architecture design decisions
-- **Real-world connection**: Understanding how activation choice affects system performance
-- **Foundation building**: Prepared for implementing any activation function
-
-## Mathematical Foundations Mastered
-- **ReLU Mathematics**: f(x) = max(0, x) and its gradient properties
-- **Softmax Mathematics**: Numerically stable probability distribution computation
-- **Gradient Flow**: How different activations affect training dynamics
-- **Numerical Stability**: Techniques for preventing overflow and maintaining precision
-
-## Professional Skills Developed
-- **Performance Analysis**: Understanding computational complexity of different activations
-- **Numerical Programming**: Implementing mathematically stable algorithms
-- **System Design**: Considering hardware and performance implications
-- **Error Handling**: Graceful handling of edge cases and extreme values
-
-## Ready for Advanced Applications
-Your activation implementations now enable:
-- **Hidden Layer Processing**: ReLU for nonlinear transformations
-- **Classification**: Softmax for probability-based outputs
-- **Attention Mechanisms**: Softmax for attention weight computation
-- **Deep Networks**: ReLU enabling training of very deep architectures
-
-## Connection to Real ML Systems
-Your implementations mirror production systems:
-- **PyTorch**: `torch.nn.ReLU()` and `torch.nn.Softmax()` implement identical mathematics
-- **TensorFlow**: `tf.nn.relu()` and `tf.nn.softmax()` follow the same principles
-- **Hardware Acceleration**: Modern GPUs have specialized kernels for these exact operations
-- **Industry Standard**: Every major ML framework optimizes these specific activations
-
-## The Power of Strategic Simplicity
-You've learned that effective systems focus on essentials:
-- **ReLU's Simplicity**: Revolutionary because it's computationally trivial yet mathematically powerful
-- **Softmax's Precision**: Complex implementation required for mathematically correct probability distributions
-- **Strategic Focus**: Understanding 2 essential functions deeply vs 10 functions superficially
-- **Real-World Impact**: These functions power 90%+ of production deep learning systems
-
-## What's Next
-Your activation implementations are the foundation for:
-- **Layers**: Building neural network components that use these activations
-- **Networks**: Composing layers with appropriate activations for different tasks
-- **Training**: Optimizing networks where activation choice determines success
-- **Advanced Architectures**: Modern systems that depend on these fundamental building blocks
-
-**Next Module**: Layers - building the neural network components that combine linear transformations with your activations!
-
-You've built the nonlinear intelligence that makes neural networks powerful. Now let's combine these activations with linear transformations to create the building blocks of any neural architecture!
-"""
\ No newline at end of file
diff --git a/modules_old/02_activations/module.yaml b/modules_old/02_activations/module.yaml
deleted file mode 100644
index 7213c0a5..00000000
--- a/modules_old/02_activations/module.yaml
+++ /dev/null
@@ -1,21 +0,0 @@
-components:
-- ReLU
-- Sigmoid
-- Tanh
-- Softmax
-dependencies:
-  enables:
-  - layers
-  - networks
-  prerequisites:
-  - tensor
-description: Neural network activation functions (ReLU, Sigmoid, Tanh, Softmax)
-difficulty: "\u2B50\u2B50"
-exports_to: tinytorch.core.activations
-files:
-  dev_file: activations_dev.py
-  readme: README.md
-  tests: inline
-name: activations
-time_estimate: 3-4 hours
-title: Activation Functions
diff --git a/modules_old/03_layers/README.md b/modules_old/03_layers/README.md
deleted file mode 100644
index 96c700f0..00000000
--- a/modules_old/03_layers/README.md
+++ /dev/null
@@ -1,208 +0,0 @@
-# 🔥 Module: Layers
-
-## 📊 Module Info
-- **Difficulty**: ⭐⭐ Intermediate
-- **Time Estimate**: 4-5 hours
-- **Prerequisites**: Tensor, Activations modules
-- **Next Steps**: Loss Functions module
-
-Build the fundamental transformations that compose into neural networks. This module teaches you that layers are simply functions that transform tensors, and neural networks are just sophisticated function composition using these building blocks.
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will be able to:
-
-- **Understand layers as mathematical functions**: Recognize that layers transform tensors through well-defined mathematical operations
-- **Implement Linear layers + Module base + Flatten**: Complete neural network building blocks
-- **Integrate activation functions**: Combine linear layers with nonlinear activations to enable complex pattern learning
-- **Compose simple building blocks**: Chain layers together to create complete neural network architectures
-- **Debug layer implementations**: Use shape analysis and mathematical properties to verify correct implementation
-
-## 🧠 Build → Use → Reflect
-
-This module follows TinyTorch's **Build → Use → Reflect** framework:
-
-1. **Build**: Implement Linear layers, Module base class, and Flatten operation
-2. **Use**: Build complete neural networks with parameter tracking
-3. **Reflect**: Understand how Module base enables automatic parameter management
-
-## 📚 What You'll Build
-
-### 🎯 **COMPLETE BUILDING BLOCKS: Everything You Need**
-```python
-# Linear layer: fundamental building block
-class MLP(Module):  # Module base provides parameter tracking!
-    def __init__(self):
-        super().__init__()
-        self.fc1 = Linear(784, 128)  # Linear transformation
-        self.fc2 = Linear(128, 10)   # Output layer
-    
-    def forward(self, x):
-        x = flatten(x, start_dim=1)  # Flatten: 2D images → 1D vectors
-        x = self.fc1(x)              # Linear: matrix multiply + bias
-        x = relu(x)                  # Activation (from Module 03)
-        return self.fc2(x)           # Final prediction
-
-# Automatic parameter collection!
-model = MLP()
-params = model.parameters()  # Gets all Linear layer weights/biases automatically!
-optimizer = SGD(params)      # Ready for training!
-```
-
-### Linear Layer (renamed from Dense)
-- **Mathematical foundation**: Linear transformation `y = Wx + b`
-- **Weight initialization**: Xavier/Glorot uniform initialization for stable gradients
-- **Bias handling**: Optional bias terms for translation invariance
-- **Shape management**: Automatic handling of batch dimensions and matrix operations
-
-### Module Base Class - **GAME CHANGER**
-- **Automatic parameter tracking**: Collects all trainable weights recursively
-- **Nested module support**: Handles complex architectures automatically
-- **Clean interface**: Standard `forward()` method for all layers
-- **Production pattern**: Same design as PyTorch nn.Module
-
-### Flatten Operation - **ESSENTIAL FOR VISION**
-- **Shape transformation**: Convert 2D/3D tensors to 1D for Linear layers
-- **Batch preservation**: Keeps batch dimension, flattens the rest
-- **Vision pipeline**: Connect CNNs to fully-connected layers
-- **Memory efficient**: View operation, no data copying
-
-## 🚀 Getting Started
-
-### Prerequisites
-Ensure you have completed the foundational modules:
-
-```bash
-# Activate TinyTorch environment
-source bin/activate-tinytorch.sh
-
-# Verify prerequisite modules
-tito test --module tensor
-tito test --module activations
-```
-
-### Development Workflow
-1. **Open the development file**: `modules/source/04_layers/layers_dev.py`
-2. **Implement Linear layer**: Matrix multiplication + bias (`y = Wx + b`)
-3. **Build Module base class**: Automatic parameter collection infrastructure
-4. **Add Flatten operation**: Essential for connecting CNNs to Linear layers
-5. **Build complete networks**: Use Module base to create complex architectures
-6. **Export and verify**: `tito module complete 04_layers` (includes testing)
-
-## 🧪 Testing Your Implementation
-
-### Comprehensive Test Suite
-Run the full test suite to verify mathematical correctness:
-
-```bash
-# TinyTorch CLI (recommended)
-tito test --module layers
-
-# Direct pytest execution
-python -m pytest tests/ -k layers -v
-```
-
-### Test Coverage Areas
-- ✅ **Layer Functionality**: Verify Dense layers perform correct linear transformations
-- ✅ **Weight Initialization**: Ensure proper weight initialization for training stability
-- ✅ **Shape Preservation**: Confirm layers handle batch dimensions correctly
-- ✅ **Activation Integration**: Test seamless combination with activation functions
-- ✅ **Network Composition**: Verify layers can be chained into complete networks
-
-### Inline Testing & Development
-The module includes educational feedback during development:
-```python
-# Example inline test output
-🔬 Unit Test: Dense layer functionality...
-✅ Dense layer computes y = Wx + b correctly
-✅ Weight initialization within expected range
-✅ Output shape matches expected dimensions
-📈 Progress: Dense Layer ✓
-
-# Integration testing
-🔬 Unit Test: Layer composition...
-✅ Multiple layers chain correctly
-✅ Activations integrate seamlessly
-📈 Progress: Layer Composition ✓
-```
-
-### Manual Testing Examples
-```python
-from tinytorch.core.tensor import Tensor
-from layers_dev import Dense
-from activations_dev import ReLU
-
-# Test basic layer functionality
-layer = Dense(input_size=3, output_size=2)
-x = Tensor([[1.0, 2.0, 3.0]])
-y = layer(x)
-print(f"Input shape: {x.shape}, Output shape: {y.shape}")
-
-# Test layer composition
-layer1 = Dense(3, 4)
-layer2 = Dense(4, 2)
-relu = ReLU()
-
-# Forward pass
-h1 = relu(layer1(x))
-output = layer2(h1)
-print(f"Final output: {output.data}")
-```
-
-## 🎯 Key Concepts
-
-### Real-World Applications
-- **Computer Vision**: Dense layers process flattened image features in CNNs (like VGG, ResNet final layers)
-- **Natural Language Processing**: Dense layers transform word embeddings in transformers and RNNs
-- **Recommendation Systems**: Dense layers combine user and item features for preference prediction
-- **Scientific Computing**: Dense layers approximate complex functions in physics simulations and engineering
-
-### Mathematical Foundations
-- **Linear Transformation**: `y = Wx + b` where W is the weight matrix and b is the bias vector
-- **Matrix Multiplication**: Efficient batch processing through vectorized operations
-- **Weight Initialization**: Xavier/Glorot initialization prevents vanishing/exploding gradients
-- **Function Composition**: Networks as nested function calls: `f3(f2(f1(x)))`
-
-### Neural Network Building Blocks
-- **Modularity**: Layers as reusable components that can be combined in different ways
-- **Standardized Interface**: All layers follow the same input/output pattern for easy composition
-- **Shape Consistency**: Automatic handling of batch dimensions and shape transformations
-- **Nonlinearity**: Activation functions between layers enable learning of complex patterns
-
-### Implementation Patterns
-- **Class-based Design**: Layers as objects with state (weights) and behavior (forward pass)
-- **Initialization Strategy**: Proper weight initialization for stable training dynamics
-- **Error Handling**: Graceful handling of shape mismatches and invalid inputs
-- **Testing Philosophy**: Comprehensive testing of mathematical properties and edge cases
-
-## 🎉 Ready to Build?
-
-You're about to build the fundamental building blocks that power every neural network! Dense layers might seem simple, but they're the workhorses of deep learning—from the final layers of image classifiers to the core components of language models.
-
-Understanding how these simple linear transformations compose into complex intelligence is one of the most beautiful insights in machine learning. Take your time, understand the mathematics, and enjoy building the foundation of artificial intelligence!
-
-```{grid} 3
-:gutter: 3
-:margin: 2
-
-{grid-item-card} 🚀 Launch Builder
-:link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/04_layers/layers_dev.py
-:class-title: text-center
-:class-body: text-center
-
-Interactive development environment
-
-{grid-item-card} 📓 Open in Colab  
-:link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/04_layers/layers_dev.ipynb
-:class-title: text-center
-:class-body: text-center
-
-Google Colab notebook
-
-{grid-item-card} 👀 View Source
-:link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/04_layers/layers_dev.py  
-:class-title: text-center
-:class-body: text-center
-
-Browse the code on GitHub
-``` 
\ No newline at end of file
diff --git a/modules_old/03_layers/layers_dev.ipynb b/modules_old/03_layers/layers_dev.ipynb
deleted file mode 100644
index e1a88598..00000000
--- a/modules_old/03_layers/layers_dev.ipynb
+++ /dev/null
@@ -1,1401 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "6cd42919",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Layers - Neural Network Building Blocks and Composition Patterns\n",
-    "\n",
-    "Welcome to the Layers module! You'll build the fundamental components that stack together to form any neural network architecture, from simple perceptrons to transformers.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How layer composition creates complex function approximators and why stacking enables deep learning\n",
-    "- Core implementation skill: Build matrix multiplication and Dense layers with proper parameter management\n",
-    "- Pattern recognition: Understand how different layer types solve different computational problems\n",
-    "- Framework connection: See how your layer implementations mirror PyTorch's nn.Module design patterns\n",
-    "- Performance insight: Learn why layer computation order and memory layout determine training speed\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: Matrix multiplication primitives and Dense layers with parameter initialization strategies\n",
-    "2. **Use**: Compose layers into multi-layer networks and observe how data transforms through the stack\n",
-    "3. **Reflect**: Why does layer depth enable more complex functions, and when does it hurt performance?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how matrix operations enable neural networks to learn arbitrary functions\n",
-    "- Practical capability to build and compose layers into complex architectures\n",
-    "- Systems insight into why layer composition is the fundamental pattern for scalable ML systems\n",
-    "- Performance consideration of how layer size and depth affect memory usage and computational cost\n",
-    "- Connection to production ML systems and how frameworks optimize layer execution for different hardware\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: PyTorch's nn.Linear uses optimized BLAS operations and can automatically select GPU vs CPU execution based on data size\n",
-    "⚡ **Performance Note**: Large matrix multiplications can be memory-bound rather than compute-bound - understanding this shapes how production systems optimize layer execution"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "921f1b43",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "layers-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.layers\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import sys\n",
-    "import os\n",
-    "from typing import Union, Tuple, Optional, Any\n",
-    "\n",
-    "# Import our building blocks - try package first, then local modules\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor, Parameter\n",
-    "except ImportError:\n",
-    "    # For development, import from local modules\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))\n",
-    "    from tensor_dev import Tensor, Parameter"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d342e264",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "layers-setup",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🔥 TinyTorch Layers Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
-    "print(\"Ready to build neural network layers!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "37720590",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Module Base Class - Neural Network Foundation\n",
-    "\n",
-    "Before building specific layers like Dense and Conv2d, we need a base class that handles parameter management and provides a clean interface. This is the foundation that makes neural networks composable and easy to use.\n",
-    "\n",
-    "### Why We Need a Module Base Class\n",
-    "\n",
-    "🏗️ **Organization**: Automatic parameter collection across all layers  \n",
-    "🔄 **Composition**: Modules can contain other modules (networks of networks)  \n",
-    "🎯 **Clean API**: Enable `model(input)` instead of `model.forward(input)`  \n",
-    "📦 **PyTorch Compatibility**: Same patterns as `torch.nn.Module`  \n",
-    "\n",
-    "Let's build the foundation that will make all our neural network code clean and powerful:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9c167643",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "module-base-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Module:\n",
-    "    \"\"\"\n",
-    "    Base class for all neural network modules.\n",
-    "    \n",
-    "    Provides automatic parameter collection, forward pass management,\n",
-    "    and clean composition patterns. All layers (Dense, Conv2d, etc.)\n",
-    "    inherit from this class.\n",
-    "    \n",
-    "    Key Features:\n",
-    "    - Automatic parameter registration when you assign Tensors with requires_grad=True\n",
-    "    - Recursive parameter collection from sub-modules\n",
-    "    - Clean __call__ interface: model(x) instead of model.forward(x)\n",
-    "    - Extensible for custom layers\n",
-    "    \n",
-    "    Example Usage:\n",
-    "        class MLP(Module):\n",
-    "            def __init__(self):\n",
-    "                super().__init__()\n",
-    "                self.layer1 = Dense(784, 128)  # Auto-registered!\n",
-    "                self.layer2 = Dense(128, 10)   # Auto-registered!\n",
-    "                \n",
-    "            def forward(self, x):\n",
-    "                x = self.layer1(x)\n",
-    "                return self.layer2(x)\n",
-    "                \n",
-    "        model = MLP()\n",
-    "        params = model.parameters()  # Gets all parameters automatically!\n",
-    "        output = model(input)        # Clean interface!\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize module with empty parameter and sub-module storage.\"\"\"\n",
-    "        self._parameters = []\n",
-    "        self._modules = []\n",
-    "    \n",
-    "    def __setattr__(self, name, value):\n",
-    "        \"\"\"\n",
-    "        Intercept attribute assignment to auto-register parameters and modules.\n",
-    "        \n",
-    "        When you do self.weight = Parameter(...), this automatically adds\n",
-    "        the parameter to our collection for easy optimization.\n",
-    "        \"\"\"\n",
-    "        # Check if it's a tensor that needs gradients (a parameter)\n",
-    "        if hasattr(value, 'requires_grad') and value.requires_grad:\n",
-    "            self._parameters.append(value)\n",
-    "        # Check if it's another Module (sub-module)\n",
-    "        elif isinstance(value, Module):\n",
-    "            self._modules.append(value)\n",
-    "        \n",
-    "        # Always call parent to actually set the attribute\n",
-    "        super().__setattr__(name, value)\n",
-    "    \n",
-    "    def parameters(self):\n",
-    "        \"\"\"\n",
-    "        Recursively collect all parameters from this module and sub-modules.\n",
-    "        \n",
-    "        Returns:\n",
-    "            List of all parameters (Tensors with requires_grad=True)\n",
-    "            \n",
-    "        This enables: optimizer = Adam(model.parameters())\n",
-    "        \"\"\"\n",
-    "        # Start with our own parameters\n",
-    "        params = list(self._parameters)\n",
-    "        \n",
-    "        # Add parameters from sub-modules recursively\n",
-    "        for module in self._modules:\n",
-    "            params.extend(module.parameters())\n",
-    "            \n",
-    "        return params\n",
-    "    \n",
-    "    def __call__(self, *args, **kwargs):\n",
-    "        \"\"\"\n",
-    "        Makes modules callable: model(x) instead of model.forward(x).\n",
-    "        \n",
-    "        This is the magic that enables clean syntax like:\n",
-    "            output = model(input)\n",
-    "        instead of:\n",
-    "            output = model.forward(input)\n",
-    "        \"\"\"\n",
-    "        return self.forward(*args, **kwargs)\n",
-    "    \n",
-    "    def forward(self, *args, **kwargs):\n",
-    "        \"\"\"\n",
-    "        Forward pass - must be implemented by subclasses.\n",
-    "        \n",
-    "        This is where the actual computation happens. Every layer\n",
-    "        defines its own forward() method.\n",
-    "        \"\"\"\n",
-    "        raise NotImplementedError(\"Subclasses must implement forward()\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "91f83f11",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in modules/source/04_layers/layers_dev.py  \n",
-    "**Building Side:** Code exports to tinytorch.core.layers\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.layers import Dense, matmul  # All layer types together!\n",
-    "from tinytorch.core.tensor import Tensor  # The foundation\n",
-    "from tinytorch.core.activations import ReLU, Sigmoid  # Nonlinearity\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused modules for deep understanding\n",
-    "- **Production:** Proper organization like PyTorch's torch.nn.Linear\n",
-    "- **Consistency:** All layer types live together in core.layers\n",
-    "- **Integration:** Works seamlessly with tensors and activations"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2d1cbf04",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "# Matrix Multiplication - The Heart of Neural Networks\n",
-    "\n",
-    "Every neural network operation ultimately reduces to matrix multiplication. Let's build the foundation that powers everything from simple perceptrons to transformers.\n",
-    "\n",
-    "## Why Matrix Multiplication Matters\n",
-    "\n",
-    "🧠 **Neural Network Core**: Every layer applies: output = input @ weights + bias  \n",
-    "⚡ **Parallel Processing**: Matrix ops utilize vectorized CPU instructions and GPU parallelism  \n",
-    "🏗️ **Scalable Architecture**: Stacking matrix operations creates arbitrarily complex function approximators  \n",
-    "📈 **Performance Critical**: 90%+ of neural network compute time is spent in matrix multiplication  \n",
-    "\n",
-    "## Learning Objectives\n",
-    "By implementing matrix multiplication, you'll understand:\n",
-    "- How neural networks transform data through linear algebra\n",
-    "- Why matrix operations are the building blocks of all modern ML frameworks\n",
-    "- How proper implementation affects performance by orders of magnitude\n",
-    "- The connection between mathematical operations and computational efficiency"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "adb83e78",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "matmul-implementation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def matmul(a: Tensor, b: Tensor) -> Tensor:\n",
-    "    \"\"\"\n",
-    "    Matrix multiplication for tensors.\n",
-    "    \n",
-    "    Args:\n",
-    "        a: Left tensor (shape: ..., m, k)\n",
-    "        b: Right tensor (shape: ..., k, n)\n",
-    "    \n",
-    "    Returns:\n",
-    "        Result tensor (shape: ..., m, n)\n",
-    "    \n",
-    "    TODO: Implement matrix multiplication using numpy's @ operator.\n",
-    "    \n",
-    "    STEP-BY-STEP IMPLEMENTATION:\n",
-    "    1. Extract numpy arrays from both tensors using .data\n",
-    "    2. Perform matrix multiplication: result_data = a_data @ b_data\n",
-    "    3. Wrap result in a new Tensor and return\n",
-    "    \n",
-    "    LEARNING CONNECTIONS:\n",
-    "    - This is the core operation in Dense layers: output = input @ weights\n",
-    "    - PyTorch uses optimized BLAS libraries for this operation\n",
-    "    - GPU implementations parallelize this across thousands of cores\n",
-    "    - Understanding this operation is key to neural network performance\n",
-    "    \n",
-    "    EXAMPLE:\n",
-    "    ```python\n",
-    "    a = Tensor([[1, 2], [3, 4]])  # shape (2, 2)\n",
-    "    b = Tensor([[5, 6], [7, 8]])  # shape (2, 2)\n",
-    "    result = matmul(a, b)\n",
-    "    # result.data = [[19, 22], [43, 50]]\n",
-    "    ```\n",
-    "    \n",
-    "    IMPLEMENTATION HINTS:\n",
-    "    - Use the @ operator for clean matrix multiplication\n",
-    "    - Ensure you return a Tensor, not a numpy array\n",
-    "    - The operation should work for any compatible matrix shapes\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Check if we're dealing with Variables (autograd) or plain Tensors\n",
-    "    a_is_variable = hasattr(a, 'requires_grad') and hasattr(a, 'grad_fn')\n",
-    "    b_is_variable = hasattr(b, 'requires_grad') and hasattr(b, 'grad_fn')\n",
-    "    \n",
-    "    # Extract numpy data appropriately\n",
-    "    if a_is_variable:\n",
-    "        a_data = a.data.data  # Variable.data is a Tensor, so .data.data gets numpy array\n",
-    "    else:\n",
-    "        a_data = a.data  # Tensor.data is numpy array directly\n",
-    "    \n",
-    "    if b_is_variable:\n",
-    "        b_data = b.data.data\n",
-    "    else:\n",
-    "        b_data = b.data\n",
-    "    \n",
-    "    # Perform matrix multiplication\n",
-    "    result_data = a_data @ b_data\n",
-    "    \n",
-    "    # If any input is a Variable, return Variable with gradient tracking\n",
-    "    if a_is_variable or b_is_variable:\n",
-    "        # Import Variable locally to avoid circular imports\n",
-    "        if 'Variable' not in globals():\n",
-    "            try:\n",
-    "                from tinytorch.core.autograd import Variable\n",
-    "            except ImportError:\n",
-    "                from autograd_dev import Variable\n",
-    "        \n",
-    "        # Create gradient function for matrix multiplication\n",
-    "        def grad_fn(grad_output):\n",
-    "            # Matrix multiplication backward pass:\n",
-    "            # If C = A @ B, then:\n",
-    "            # dA = grad_output @ B^T\n",
-    "            # dB = A^T @ grad_output\n",
-    "            \n",
-    "            if a_is_variable and a.requires_grad:\n",
-    "                # Gradient w.r.t. A: grad_output @ B^T\n",
-    "                grad_a_data = grad_output.data.data @ b_data.T\n",
-    "                a.backward(Variable(grad_a_data))\n",
-    "            \n",
-    "            if b_is_variable and b.requires_grad:\n",
-    "                # Gradient w.r.t. B: A^T @ grad_output  \n",
-    "                grad_b_data = a_data.T @ grad_output.data.data\n",
-    "                b.backward(Variable(grad_b_data))\n",
-    "        \n",
-    "        # Determine if result should require gradients\n",
-    "        requires_grad = (a_is_variable and a.requires_grad) or (b_is_variable and b.requires_grad)\n",
-    "        \n",
-    "        return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)\n",
-    "    else:\n",
-    "        # Both inputs are Tensors, return Tensor (backward compatible)\n",
-    "        return Tensor(result_data)\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d7691910",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Testing Matrix Multiplication\n",
-    "\n",
-    "Let's verify our matrix multiplication works correctly with some test cases."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d10bd1ed",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-matmul",
-     "locked": true,
-     "points": 2,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_matmul():\n",
-    "    \"\"\"Test matrix multiplication implementation.\"\"\"\n",
-    "    print(\"🧪 Testing Matrix Multiplication...\")\n",
-    "    \n",
-    "    # Test case 1: Simple 2x2 matrices\n",
-    "    a = Tensor([[1, 2], [3, 4]])\n",
-    "    b = Tensor([[5, 6], [7, 8]])\n",
-    "    result = matmul(a, b)\n",
-    "    expected = np.array([[19, 22], [43, 50]])\n",
-    "    \n",
-    "    assert np.allclose(result.data, expected), f\"Expected {expected}, got {result.data}\"\n",
-    "    print(\"✅ 2x2 matrix multiplication\")\n",
-    "    \n",
-    "    # Test case 2: Non-square matrices\n",
-    "    a = Tensor([[1, 2, 3], [4, 5, 6]])  # 2x3\n",
-    "    b = Tensor([[7, 8], [9, 10], [11, 12]])  # 3x2\n",
-    "    result = matmul(a, b)\n",
-    "    expected = np.array([[58, 64], [139, 154]])\n",
-    "    \n",
-    "    assert np.allclose(result.data, expected), f\"Expected {expected}, got {result.data}\"\n",
-    "    print(\"✅ Non-square matrix multiplication\")\n",
-    "    \n",
-    "    # Test case 3: Vector-matrix multiplication\n",
-    "    a = Tensor([[1, 2, 3]])  # 1x3 (row vector)\n",
-    "    b = Tensor([[4], [5], [6]])  # 3x1 (column vector)\n",
-    "    result = matmul(a, b)\n",
-    "    expected = np.array([[32]])  # 1*4 + 2*5 + 3*6 = 32\n",
-    "    \n",
-    "    assert np.allclose(result.data, expected), f\"Expected {expected}, got {result.data}\"\n",
-    "    print(\"✅ Vector-matrix multiplication\")\n",
-    "    \n",
-    "    print(\"🎉 All matrix multiplication tests passed!\")\n",
-    "\n",
-    "test_matmul()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7f512ed2",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "# Dense Layer - The Fundamental Neural Network Component\n",
-    "\n",
-    "Dense layers (also called Linear or Fully Connected layers) are the building blocks of neural networks. They apply the transformation: **output = input @ weights + bias**\n",
-    "\n",
-    "## Why Dense Layers Matter\n",
-    "\n",
-    "🧠 **Universal Function Approximators**: Dense layers can approximate any continuous function when stacked  \n",
-    "🔧 **Parameter Learning**: Weights and biases are learned through backpropagation  \n",
-    "🏗️ **Modular Design**: Dense layers compose into complex architectures (MLPs, transformers, etc.)  \n",
-    "⚡ **Computational Efficiency**: Matrix operations leverage optimized linear algebra libraries  \n",
-    "\n",
-    "## Learning Objectives\n",
-    "By implementing Dense layers, you'll understand:\n",
-    "- How neural networks learn through adjustable parameters\n",
-    "- The mathematical foundation underlying all neural network layers\n",
-    "- Why proper parameter initialization is crucial for training success\n",
-    "- How layer composition enables complex function approximation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b5b4e929",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "dense-implementation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Linear(Module):\n",
-    "    \"\"\"\n",
-    "    Linear (Fully Connected) Layer implementation.\n",
-    "    \n",
-    "    Applies the transformation: output = input @ weights + bias\n",
-    "    \n",
-    "    Inherits from Module for automatic parameter management and clean API.\n",
-    "    This is PyTorch's nn.Linear equivalent with the same name for familiarity.\n",
-    "    \n",
-    "    Features:\n",
-    "    - Automatic parameter registration (weights and bias)\n",
-    "    - Clean call interface: layer(input) instead of layer.forward(input)\n",
-    "    - Works with optimizers via model.parameters()\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, input_size: int, output_size: int, use_bias: bool = True):\n",
-    "        \"\"\"\n",
-    "        Initialize Linear layer with random weights and optional bias.\n",
-    "        \n",
-    "        Args:\n",
-    "            input_size: Number of input features\n",
-    "            output_size: Number of output features  \n",
-    "            use_bias: Whether to include bias term\n",
-    "        \n",
-    "        TODO: Implement Linear layer initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Store input_size and output_size as instance variables\n",
-    "        2. Initialize weights as Tensor with shape (input_size, output_size)\n",
-    "        3. Use small random values: np.random.randn(...) * 0.1\n",
-    "        4. Initialize bias as Tensor with shape (output_size,) if use_bias is True\n",
-    "        5. Set bias to None if use_bias is False\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - Small random initialization prevents symmetry breaking\n",
-    "        - Weight shape (input_size, output_size) enables matrix multiplication\n",
-    "        - Bias allows shifting the output (like y-intercept in linear regression)\n",
-    "        - PyTorch uses more sophisticated initialization (Xavier, Kaiming)\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Use np.random.randn() for Gaussian random numbers\n",
-    "        - Scale by 0.1 to keep initial values small\n",
-    "        - Remember to wrap numpy arrays in Tensor()\n",
-    "        - Store use_bias flag for forward pass logic\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        super().__init__()  # Initialize Module base class\n",
-    "        \n",
-    "        self.input_size = input_size\n",
-    "        self.output_size = output_size\n",
-    "        self.use_bias = use_bias\n",
-    "        \n",
-    "        # Initialize weights with small random values using Parameter\n",
-    "        # Shape: (input_size, output_size) for matrix multiplication\n",
-    "        weight_data = np.random.randn(input_size, output_size) * 0.1\n",
-    "        self.weights = Parameter(weight_data)  # Auto-registers for optimization!\n",
-    "        \n",
-    "        # Initialize bias if requested\n",
-    "        if use_bias:\n",
-    "            bias_data = np.random.randn(output_size) * 0.1\n",
-    "            self.bias = Parameter(bias_data)  # Auto-registers for optimization!\n",
-    "        else:\n",
-    "            self.bias = None\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x: Union[Tensor, 'Variable']) -> Union[Tensor, 'Variable']:\n",
-    "        \"\"\"\n",
-    "        Forward pass through the Linear layer.\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor or Variable (shape: ..., input_size)\n",
-    "        \n",
-    "        Returns:\n",
-    "            Output tensor or Variable (shape: ..., output_size)\n",
-    "            Preserves Variable type for gradient tracking in training\n",
-    "        \n",
-    "        TODO: Implement autograd-aware forward pass: output = input @ weights + bias\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Perform matrix multiplication: output = matmul(x, self.weights)\n",
-    "        2. If bias exists, add it appropriately based on input type\n",
-    "        3. Preserve Variable type for gradient tracking if input is Variable\n",
-    "        4. Return result maintaining autograd capabilities\n",
-    "        \n",
-    "        AUTOGRAD CONSIDERATIONS:\n",
-    "        - If x is Variable: weights and bias should also be Variables for training\n",
-    "        - Preserve gradient tracking through the entire computation\n",
-    "        - Enable backpropagation through this layer's parameters\n",
-    "        - Handle mixed Tensor/Variable scenarios gracefully\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - This is the core neural network transformation\n",
-    "        - Matrix multiplication scales input features to output features  \n",
-    "        - Bias provides offset (like y-intercept in linear equations)\n",
-    "        - Broadcasting handles different batch sizes automatically\n",
-    "        - Autograd support enables automatic parameter optimization\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Use the matmul function you implemented above (now autograd-aware)\n",
-    "        - Handle bias addition based on input/output types\n",
-    "        - Variables support + operator for gradient-tracked addition\n",
-    "        - Check if self.bias is not None before adding\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Matrix multiplication: input @ weights (now autograd-aware)\n",
-    "        output = matmul(x, self.weights)\n",
-    "        \n",
-    "        # Add bias if it exists\n",
-    "        # The addition will preserve Variable type if output is Variable\n",
-    "        if self.bias is not None:\n",
-    "            # Check if we need Variable-aware addition\n",
-    "            if hasattr(output, 'requires_grad'):\n",
-    "                # output is a Variable, use Variable addition\n",
-    "                if hasattr(self.bias, 'requires_grad'):\n",
-    "                    # bias is also Variable, direct addition works\n",
-    "                    output = output + self.bias\n",
-    "                else:\n",
-    "                    # bias is Tensor, convert to Variable for addition\n",
-    "                    # Import Variable if not already available\n",
-    "                    if 'Variable' not in globals():\n",
-    "                        try:\n",
-    "                            from tinytorch.core.autograd import Variable\n",
-    "                        except ImportError:\n",
-    "                            from autograd_dev import Variable\n",
-    "                    \n",
-    "                    bias_var = Variable(self.bias.data, requires_grad=False)\n",
-    "                    output = output + bias_var\n",
-    "            else:\n",
-    "                # output is Tensor, use regular addition\n",
-    "                output = output + self.bias\n",
-    "        \n",
-    "        return output\n",
-    "        ### END SOLUTION\n",
-    "\n",
-    "# Backward compatibility alias\n",
-    "#| export  \n",
-    "Dense = Linear"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "df5cd843",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Testing Linear Layer\n",
-    "\n",
-    "Let's verify our Linear layer works correctly with comprehensive tests.\n",
-    "The tests use Dense for backward compatibility, but Dense is now an alias for Linear."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "385374fa",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-dense",
-     "locked": true,
-     "points": 3,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_dense_layer():\n",
-    "    \"\"\"Test Dense layer implementation.\"\"\"\n",
-    "    print(\"🧪 Testing Dense Layer...\")\n",
-    "    \n",
-    "    # Test case 1: Basic functionality\n",
-    "    layer = Dense(input_size=3, output_size=2)\n",
-    "    input_tensor = Tensor([[1.0, 2.0, 3.0]])  # Shape: (1, 3)\n",
-    "    output = layer.forward(input_tensor)\n",
-    "    \n",
-    "    # Check output shape\n",
-    "    assert output.shape == (1, 2), f\"Expected shape (1, 2), got {output.shape}\"\n",
-    "    print(\"✅ Output shape correct\")\n",
-    "    \n",
-    "    # Test case 2: No bias\n",
-    "    layer_no_bias = Dense(input_size=2, output_size=3, use_bias=False)\n",
-    "    assert layer_no_bias.bias is None, \"Bias should be None when use_bias=False\"\n",
-    "    print(\"✅ No bias option works\")\n",
-    "    \n",
-    "    # Test case 3: Multiple samples (batch processing)\n",
-    "    batch_input = Tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])  # Shape: (3, 2)\n",
-    "    layer_batch = Dense(input_size=2, output_size=2)\n",
-    "    batch_output = layer_batch.forward(batch_input)\n",
-    "    \n",
-    "    assert batch_output.shape == (3, 2), f\"Expected shape (3, 2), got {batch_output.shape}\"\n",
-    "    print(\"✅ Batch processing works\")\n",
-    "    \n",
-    "    # Test case 4: Callable interface\n",
-    "    callable_output = layer_batch(batch_input)\n",
-    "    assert np.allclose(callable_output.data, batch_output.data), \"Callable interface should match forward()\"\n",
-    "    print(\"✅ Callable interface works\")\n",
-    "    \n",
-    "    # Test case 5: Parameter initialization\n",
-    "    layer_init = Dense(input_size=10, output_size=5)\n",
-    "    assert layer_init.weights.shape == (10, 5), f\"Expected weights shape (10, 5), got {layer_init.weights.shape}\"\n",
-    "    assert layer_init.bias.shape == (5,), f\"Expected bias shape (5,), got {layer_init.bias.shape}\"\n",
-    "    \n",
-    "    # Check that weights are reasonably small (good initialization)\n",
-    "    assert np.abs(layer_init.weights.data).mean() < 1.0, \"Weights should be small for good initialization\"\n",
-    "    print(\"✅ Parameter initialization correct\")\n",
-    "    \n",
-    "    print(\"🎉 All Dense layer tests passed!\")\n",
-    "\n",
-    "test_dense_layer()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7f9bb46b",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Testing Autograd Integration\n",
-    "\n",
-    "Now let's test that our Dense layer works correctly with Variables for gradient tracking."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "df791018",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-dense-autograd",
-     "locked": true,
-     "points": 3,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_dense_layer_autograd():\n",
-    "    \"\"\"Test Dense layer with autograd Variable support.\"\"\"\n",
-    "    print(\"🧪 Testing Dense Layer Autograd Integration...\")\n",
-    "    \n",
-    "    try:\n",
-    "        # Import Variable locally to handle import issues\n",
-    "        try:\n",
-    "            from tinytorch.core.autograd import Variable\n",
-    "        except ImportError:\n",
-    "            sys.path.append(os.path.join(os.path.dirname(__file__), '..', '09_autograd'))\n",
-    "            from autograd_dev import Variable\n",
-    "        \n",
-    "        # Test case 1: Variable input with Tensor weights (inference mode)\n",
-    "        layer = Dense(input_size=3, output_size=2)\n",
-    "        variable_input = Variable([[1.0, 2.0, 3.0]], requires_grad=True)\n",
-    "        output = layer.forward(variable_input)\n",
-    "        \n",
-    "        # Check that output is Variable and preserves gradient tracking\n",
-    "        assert hasattr(output, 'requires_grad'), \"Output should be Variable with gradient tracking\"\n",
-    "        assert output.shape == (1, 2), f\"Expected shape (1, 2), got {output.shape}\"\n",
-    "        print(\"✅ Variable input preserves gradient tracking\")\n",
-    "        \n",
-    "        # Test case 2: Variable weights for training\n",
-    "        # Convert weights and bias to Variables for training\n",
-    "        layer_trainable = Dense(input_size=2, output_size=2)\n",
-    "        layer_trainable.weights = Variable(layer_trainable.weights.data, requires_grad=True)\n",
-    "        layer_trainable.bias = Variable(layer_trainable.bias.data, requires_grad=True)\n",
-    "        \n",
-    "        variable_input_2 = Variable([[1.0, 2.0]], requires_grad=True)\n",
-    "        output_2 = layer_trainable.forward(variable_input_2)\n",
-    "        \n",
-    "        assert hasattr(output_2, 'requires_grad'), \"Output should support gradients\"\n",
-    "        assert output_2.requires_grad, \"Output should require gradients when weights require gradients\"\n",
-    "        print(\"✅ Variable weights enable training mode\")\n",
-    "        \n",
-    "        # Test case 3: Gradient flow through Dense layer\n",
-    "        # Simple backward pass to check gradient computation\n",
-    "        try:\n",
-    "            # Create a simple loss (sum of outputs)\n",
-    "            loss = Variable(np.sum(output_2.data.data))\n",
-    "            loss.backward()\n",
-    "            \n",
-    "            # Check that gradients were computed\n",
-    "            assert layer_trainable.weights.grad is not None, \"Weights should have gradients\"\n",
-    "            assert layer_trainable.bias.grad is not None, \"Bias should have gradients\"\n",
-    "            assert variable_input_2.grad is not None, \"Input should have gradients\"\n",
-    "            print(\"✅ Gradient computation works\")\n",
-    "        except Exception as e:\n",
-    "            print(f\"⚠️  Gradient computation test skipped: {e}\")\n",
-    "            print(\"   (This is expected if full autograd integration isn't complete yet)\")\n",
-    "        \n",
-    "        # Test case 4: Mixed Tensor/Variable scenarios\n",
-    "        tensor_input = Tensor([[1.0, 2.0, 3.0]])\n",
-    "        variable_layer = Dense(input_size=3, output_size=2)\n",
-    "        mixed_output = variable_layer.forward(tensor_input)\n",
-    "        \n",
-    "        assert isinstance(mixed_output, Tensor), \"Tensor input should produce Tensor output\"\n",
-    "        print(\"✅ Mixed Tensor/Variable handling works\")\n",
-    "        \n",
-    "        # Test case 5: Batch processing with Variables\n",
-    "        batch_variable_input = Variable([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], requires_grad=True)\n",
-    "        batch_layer = Dense(input_size=2, output_size=2)\n",
-    "        batch_variable_output = batch_layer.forward(batch_variable_input)\n",
-    "        \n",
-    "        assert batch_variable_output.shape == (3, 2), f\"Expected batch shape (3, 2), got {batch_variable_output.shape}\"\n",
-    "        assert hasattr(batch_variable_output, 'requires_grad'), \"Batch output should support gradients\"\n",
-    "        print(\"✅ Batch processing with Variables works\")\n",
-    "        \n",
-    "        print(\"🎉 All Dense layer autograd tests passed!\")\n",
-    "        \n",
-    "    except ImportError as e:\n",
-    "        print(f\"⚠️  Autograd tests skipped: {e}\")\n",
-    "        print(\"   (Variable class not available - this is expected during development)\")\n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Autograd test failed: {e}\")\n",
-    "        print(\"   (This indicates an implementation issue that needs fixing)\")\n",
-    "\n",
-    "test_dense_layer_autograd()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f047fbc8",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "# Systems Analysis: Memory and Performance Characteristics\n",
-    "\n",
-    "Let's analyze the memory usage and computational complexity of our layer implementations.\n",
-    "\n",
-    "## Memory Analysis\n",
-    "- **Dense Layer Storage**: input_size × output_size weights + output_size bias terms\n",
-    "- **Forward Pass Memory**: Input tensor + weight tensor + output tensor (temporary storage)\n",
-    "- **Scaling Behavior**: Memory grows quadratically with layer size\n",
-    "\n",
-    "## Computational Complexity\n",
-    "- **Matrix Multiplication**: O(batch_size × input_size × output_size)\n",
-    "- **Bias Addition**: O(batch_size × output_size)\n",
-    "- **Total**: Dominated by matrix multiplication for large layers\n",
-    "\n",
-    "## Production Insights\n",
-    "In production ML systems:\n",
-    "- **Memory Management**: PyTorch uses memory pools to avoid frequent allocation/deallocation\n",
-    "- **Compute Optimization**: BLAS libraries (MKL, OpenBLAS) optimize matrix operations for specific hardware\n",
-    "- **GPU Acceleration**: CUDA kernels parallelize matrix operations across thousands of cores\n",
-    "- **Mixed Precision**: Using float16 instead of float32 can halve memory usage with minimal accuracy loss"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b7825066",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "memory-analysis",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def analyze_layer_memory():\n",
-    "    \"\"\"Analyze memory usage of different layer sizes.\"\"\"\n",
-    "    print(\"📊 Layer Memory Analysis\")\n",
-    "    print(\"=\" * 40)\n",
-    "    \n",
-    "    layer_sizes = [(10, 10), (100, 100), (1000, 1000), (784, 128), (128, 10)]\n",
-    "    \n",
-    "    for input_size, output_size in layer_sizes:\n",
-    "        # Calculate parameter count\n",
-    "        weight_params = input_size * output_size\n",
-    "        bias_params = output_size\n",
-    "        total_params = weight_params + bias_params\n",
-    "        \n",
-    "        # Calculate memory usage (assuming float32 = 4 bytes)\n",
-    "        memory_mb = total_params * 4 / (1024 * 1024)\n",
-    "        \n",
-    "        print(f\"  {input_size:4d} → {output_size:4d}: {total_params:,} params, {memory_mb:.3f} MB\")\n",
-    "    \n",
-    "    print(\"\\n🔍 Key Insights:\")\n",
-    "    print(\"  • Memory grows quadratically with layer width\")\n",
-    "    print(\"  • Large layers (1000×1000) use significant memory\")\n",
-    "    print(\"  • Modern networks balance width vs depth for efficiency\")\n",
-    "\n",
-    "analyze_layer_memory()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8c04fb2c",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "# ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Let's explore the deeper implications of our layer implementations."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1ce66a00",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "systems-thinking",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def explore_layer_scaling():\n",
-    "    \"\"\"Explore how layer operations scale with size.\"\"\"\n",
-    "    print(\"🤔 Scaling Analysis: Matrix Multiplication Performance\")\n",
-    "    print(\"=\" * 55)\n",
-    "    \n",
-    "    sizes = [64, 128, 256, 512]\n",
-    "    \n",
-    "    for size in sizes:\n",
-    "        # Estimate FLOPs for square matrix multiplication\n",
-    "        flops = 2 * size * size * size  # 2 operations per multiply-add\n",
-    "        \n",
-    "        # Estimate memory bandwidth (reading A, B, writing C)\n",
-    "        memory_ops = 3 * size * size  # Elements read/written\n",
-    "        memory_mb = memory_ops * 4 / (1024 * 1024)  # float32 = 4 bytes\n",
-    "        \n",
-    "        print(f\"  Size {size:3d}×{size:3d}: {flops/1e6:.1f} MFLOPS, {memory_mb:.2f} MB transfers\")\n",
-    "    \n",
-    "    print(\"\\n💡 Performance Insights:\")\n",
-    "    print(\"  • FLOPs grow cubically (O(n³)) with matrix size\")\n",
-    "    print(\"  • Memory bandwidth grows quadratically (O(n²))\")\n",
-    "    print(\"  • Large matrices become memory-bound, not compute-bound\")\n",
-    "    print(\"  • This is why GPUs excel: high memory bandwidth + parallel compute\")\n",
-    "\n",
-    "explore_layer_scaling()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2de5338c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've implemented the core components, let's think about their implications for ML systems:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b9dc3f4e",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "question-1",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Question 1: Memory vs Computation Trade-offs\n",
-    "\"\"\"\n",
-    "🤔 **Question 1: Memory vs Computation Analysis**\n",
-    "\n",
-    "You're designing a neural network for deployment on a mobile device with limited memory (1GB RAM) but decent compute power.\n",
-    "\n",
-    "You have two architecture options:\n",
-    "A) Wide network: 784 → 2048 → 2048 → 10 (3 layers, wide)\n",
-    "B) Deep network: 784 → 256 → 256 → 256 → 256 → 10 (5 layers, narrow)\n",
-    "\n",
-    "Calculate the memory requirements for each option and explain which you'd choose for mobile deployment and why.\n",
-    "\n",
-    "Consider:\n",
-    "- Parameter storage requirements\n",
-    "- Intermediate activation storage during forward pass\n",
-    "- Training vs inference memory requirements\n",
-    "- How your choice affects model capacity and accuracy\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "13d2171d",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "question-2",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Question 2: Performance Optimization\n",
-    "\"\"\"\n",
-    "🤔 **Question 2: Production Performance Optimization**\n",
-    "\n",
-    "Your Dense layer implementation works correctly, but you notice it's slower than PyTorch's nn.Linear on the same hardware.\n",
-    "\n",
-    "Investigate and explain:\n",
-    "1. Why might our implementation be slower? (Hint: think about underlying linear algebra libraries)\n",
-    "2. What optimization techniques do production frameworks use?\n",
-    "3. How would you modify our implementation to approach production performance?\n",
-    "4. When might our simple implementation actually be preferable?\n",
-    "\n",
-    "Research areas to consider:\n",
-    "- BLAS (Basic Linear Algebra Subprograms) libraries\n",
-    "- Memory layout and cache efficiency\n",
-    "- Vectorization and SIMD instructions\n",
-    "- GPU kernel optimization\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fc136b4d",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "question-3",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Question 3: Scaling and Architecture Design\n",
-    "\"\"\"\n",
-    "🤔 **Question 3: Systems Architecture Scaling**\n",
-    "\n",
-    "Modern transformer models like GPT-3 have billions of parameters, primarily in Dense layers.\n",
-    "\n",
-    "Analyze the scaling challenges:\n",
-    "1. How does memory requirement scale with model size? Calculate the memory needed for a 175B parameter model.\n",
-    "2. What are the computational bottlenecks during training vs inference?\n",
-    "3. How do systems like distributed training address these scaling challenges?\n",
-    "4. Why do large models use techniques like gradient checkpointing and model parallelism?\n",
-    "\n",
-    "Systems considerations:\n",
-    "- Memory hierarchy (L1/L2/L3 cache, RAM, storage)\n",
-    "- Network bandwidth for distributed training\n",
-    "- GPU memory constraints and model sharding\n",
-    "- Inference optimization for production serving\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "264e2bd3",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "# Comprehensive Testing and Integration\n",
-    "\n",
-    "Let's run a comprehensive test suite to verify all our implementations work correctly together."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e45d1bfe",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "comprehensive-tests",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def run_comprehensive_tests():\n",
-    "    \"\"\"Run comprehensive tests of all layer functionality.\"\"\"\n",
-    "    print(\"🔬 Comprehensive Layer Testing Suite\")\n",
-    "    print(\"=\" * 45)\n",
-    "    \n",
-    "    # Test 1: Matrix multiplication edge cases\n",
-    "    print(\"\\n1. Matrix Multiplication Edge Cases:\")\n",
-    "    \n",
-    "    # Single element\n",
-    "    a = Tensor([[5]])\n",
-    "    b = Tensor([[3]])\n",
-    "    result = matmul(a, b)\n",
-    "    assert result.data[0, 0] == 15, \"Single element multiplication failed\"\n",
-    "    print(\"   ✅ Single element multiplication\")\n",
-    "    \n",
-    "    # Identity matrix\n",
-    "    identity = Tensor([[1, 0], [0, 1]])\n",
-    "    test_matrix = Tensor([[2, 3], [4, 5]])\n",
-    "    result = matmul(test_matrix, identity)\n",
-    "    assert np.allclose(result.data, test_matrix.data), \"Identity multiplication failed\"\n",
-    "    print(\"   ✅ Identity matrix multiplication\")\n",
-    "    \n",
-    "    # Test 2: Dense layer composition\n",
-    "    print(\"\\n2. Dense Layer Composition:\")\n",
-    "    \n",
-    "    # Create a simple 2-layer network\n",
-    "    layer1 = Dense(4, 3)\n",
-    "    layer2 = Dense(3, 2)\n",
-    "    \n",
-    "    # Test data flow\n",
-    "    input_data = Tensor([[1, 2, 3, 4]])\n",
-    "    hidden = layer1(input_data)\n",
-    "    output = layer2(hidden)\n",
-    "    \n",
-    "    assert output.shape == (1, 2), f\"Expected final output shape (1, 2), got {output.shape}\"\n",
-    "    print(\"   ✅ Multi-layer composition\")\n",
-    "    \n",
-    "    # Test 3: Batch processing\n",
-    "    print(\"\\n3. Batch Processing:\")\n",
-    "    \n",
-    "    batch_size = 10\n",
-    "    batch_input = Tensor(np.random.randn(batch_size, 4))\n",
-    "    batch_hidden = layer1(batch_input)\n",
-    "    batch_output = layer2(batch_hidden)\n",
-    "    \n",
-    "    assert batch_output.shape == (batch_size, 2), f\"Expected batch output shape ({batch_size}, 2), got {batch_output.shape}\"\n",
-    "    print(\"   ✅ Batch processing\")\n",
-    "    \n",
-    "    # Test 4: Parameter access and modification\n",
-    "    print(\"\\n4. Parameter Management:\")\n",
-    "    \n",
-    "    layer = Dense(5, 3)\n",
-    "    original_weights = layer.weights.data.copy()\n",
-    "    \n",
-    "    # Simulate parameter update\n",
-    "    layer.weights = Tensor(original_weights + 0.1)\n",
-    "    \n",
-    "    assert not np.allclose(layer.weights.data, original_weights), \"Parameter update failed\"\n",
-    "    print(\"   ✅ Parameter modification\")\n",
-    "    \n",
-    "    print(\"\\n🎉 All comprehensive tests passed!\")\n",
-    "    print(\"   Your layer implementations are ready for neural network construction!\")\n",
-    "\n",
-    "run_comprehensive_tests()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6b9bc103",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Autograd Integration Demo\n",
-    "\n",
-    "Let's demonstrate how the Dense layer now works seamlessly with autograd Variables."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f9d3d3c8",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "autograd-demo",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def demonstrate_autograd_integration():\n",
-    "    \"\"\"Demonstrate Dense layer working with autograd Variables.\"\"\"\n",
-    "    print(\"🔥 Dense Layer Autograd Integration Demo\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    try:\n",
-    "        # Import Variable\n",
-    "        try:\n",
-    "            from tinytorch.core.autograd import Variable\n",
-    "        except ImportError:\n",
-    "            sys.path.append(os.path.join(os.path.dirname(__file__), '..', '09_autograd'))\n",
-    "            from autograd_dev import Variable\n",
-    "        \n",
-    "        print(\"\\n1. Creating trainable Dense layer:\")\n",
-    "        layer = Dense(input_size=3, output_size=2)\n",
-    "        \n",
-    "        # Convert to trainable parameters (Variables)\n",
-    "        layer.weights = Variable(layer.weights.data, requires_grad=True)\n",
-    "        layer.bias = Variable(layer.bias.data, requires_grad=True)\n",
-    "        \n",
-    "        print(f\"   Weights shape: {layer.weights.shape}\")\n",
-    "        print(f\"   Weights require grad: {layer.weights.requires_grad}\")\n",
-    "        print(f\"   Bias shape: {layer.bias.shape}\")\n",
-    "        print(f\"   Bias require grad: {layer.bias.requires_grad}\")\n",
-    "        \n",
-    "        print(\"\\n2. Forward pass with Variable input:\")\n",
-    "        x = Variable([[1.0, 2.0, 3.0]], requires_grad=True)\n",
-    "        print(f\"   Input: {x.data.data.tolist()}\")\n",
-    "        \n",
-    "        y = layer(x)\n",
-    "        print(f\"   Output shape: {y.shape}\")\n",
-    "        print(f\"   Output requires grad: {y.requires_grad}\")\n",
-    "        print(f\"   Output values: {y.data.data.tolist()}\")\n",
-    "        \n",
-    "        print(\"\\n3. Backward pass demonstration:\")\n",
-    "        try:\n",
-    "            # Simple loss: sum of all outputs\n",
-    "            loss = Variable(np.sum(y.data.data))\n",
-    "            print(f\"   Loss: {loss.data.data}\")\n",
-    "            \n",
-    "            # Clear gradients\n",
-    "            layer.weights.zero_grad()\n",
-    "            layer.bias.zero_grad() \n",
-    "            x.zero_grad()\n",
-    "            \n",
-    "            # Backward pass\n",
-    "            loss.backward()\n",
-    "            \n",
-    "            print(f\"   Weight gradients computed: {layer.weights.grad is not None}\")\n",
-    "            print(f\"   Bias gradients computed: {layer.bias.grad is not None}\")\n",
-    "            print(f\"   Input gradients computed: {x.grad is not None}\")\n",
-    "            \n",
-    "            if layer.weights.grad is not None:\n",
-    "                print(f\"   Weight gradient shape: {layer.weights.grad.shape}\")\n",
-    "            if layer.bias.grad is not None:\n",
-    "                print(f\"   Bias gradient shape: {layer.bias.grad.shape}\")\n",
-    "            \n",
-    "        except Exception as e:\n",
-    "            print(f\"   ⚠️ Backward pass demo limited: {e}\")\n",
-    "        \n",
-    "        print(\"\\n4. Backward compatibility with Tensors:\")\n",
-    "        tensor_input = Tensor([[1.0, 2.0, 3.0]])\n",
-    "        tensor_layer = Dense(input_size=3, output_size=2)\n",
-    "        tensor_output = tensor_layer(tensor_input)\n",
-    "        \n",
-    "        print(f\"   Input type: {type(tensor_input).__name__}\")\n",
-    "        print(f\"   Output type: {type(tensor_output).__name__}\")\n",
-    "        print(\"   ✅ Tensor-only operations still work perfectly\")\n",
-    "        \n",
-    "        print(\"\\n🎉 Dense layer now supports both Tensors and Variables!\")\n",
-    "        print(\"   • Tensors: Fast inference without gradient tracking\")\n",
-    "        print(\"   • Variables: Full training with automatic differentiation\")\n",
-    "        print(\"   • Seamless interoperability for different use cases\")\n",
-    "        \n",
-    "    except ImportError as e:\n",
-    "        print(f\"⚠️ Autograd demo skipped: {e}\")\n",
-    "        print(\"  (Variable class not available)\")\n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Demo failed: {e}\")\n",
-    "\n",
-    "demonstrate_autograd_integration()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "deffd3a3",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Module Summary\n",
-    "\n",
-    "## 🎯 What You've Accomplished\n",
-    "\n",
-    "You've successfully implemented the fundamental building blocks of neural networks:\n",
-    "\n",
-    "### ✅ **Core Implementations**\n",
-    "- **Matrix Multiplication**: The computational primitive underlying all neural network operations (now with autograd support)\n",
-    "- **Dense Layer**: Complete implementation with proper parameter initialization, forward propagation, and Variable support\n",
-    "- **Autograd Integration**: Seamless support for both Tensors (inference) and Variables (training with gradients)\n",
-    "- **Composition Patterns**: How layers stack together to form complex function approximators\n",
-    "\n",
-    "### ✅ **Systems Understanding**\n",
-    "- **Memory Analysis**: How layer size affects memory usage and why this matters for deployment\n",
-    "- **Performance Characteristics**: Understanding computational complexity and scaling behavior\n",
-    "- **Production Context**: Connection to real-world ML systems and optimization techniques\n",
-    "\n",
-    "### ✅ **ML Engineering Skills**\n",
-    "- **Parameter Management**: How neural networks store and update learnable parameters\n",
-    "- **Batch Processing**: Efficient handling of multiple data samples simultaneously\n",
-    "- **Architecture Design**: Trade-offs between network width, depth, and resource requirements\n",
-    "\n",
-    "## 🔗 **Connection to Production ML Systems**\n",
-    "\n",
-    "Your implementations mirror the core concepts used in:\n",
-    "- **PyTorch's nn.Linear**: Same mathematical operations with production optimizations\n",
-    "- **TensorFlow's Dense layers**: Identical parameter structure and forward pass logic\n",
-    "- **Transformer architectures**: Dense layers form the foundation of modern language models\n",
-    "- **Computer vision models**: ConvNets use similar principles with spatial structure\n",
-    "\n",
-    "## 🚀 **What's Next**\n",
-    "\n",
-    "With solid layer implementations, you're ready to:\n",
-    "- **Compose** these layers into complete neural networks\n",
-    "- **Add** nonlinear activations to enable complex function approximation\n",
-    "- **Implement** training algorithms to learn from data\n",
-    "- **Scale** to larger, more sophisticated architectures\n",
-    "\n",
-    "## 💡 **Key Systems Insights**\n",
-    "\n",
-    "1. **Matrix multiplication is the computational bottleneck** in neural networks\n",
-    "2. **Memory layout and access patterns** often matter more than raw compute power\n",
-    "3. **Layer composition** is the fundamental abstraction for building complex ML systems\n",
-    "4. **Parameter initialization and management** directly affects training success\n",
-    "\n",
-    "You now understand the mathematical and computational foundations that enable neural networks to learn complex patterns from data!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e4d045ea",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "final-demo",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "if __name__ == \"__main__\":\n",
-    "    print(\"🔥 TinyTorch Layers Module - Final Demo\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    # Create a simple neural network architecture\n",
-    "    print(\"\\n🏗️ Building a 3-layer neural network:\")\n",
-    "    layer1 = Dense(784, 128)  # Input layer (like MNIST images)\n",
-    "    layer2 = Dense(128, 64)   # Hidden layer\n",
-    "    layer3 = Dense(64, 10)    # Output layer (10 classes)\n",
-    "    \n",
-    "    print(f\"  Layer 1: {layer1.input_size} → {layer1.output_size} ({layer1.weights.data.size:,} parameters)\")\n",
-    "    print(f\"  Layer 2: {layer2.input_size} → {layer2.output_size} ({layer2.weights.data.size:,} parameters)\")\n",
-    "    print(f\"  Layer 3: {layer3.input_size} → {layer3.output_size} ({layer3.weights.data.size:,} parameters)\")\n",
-    "    \n",
-    "    # Simulate forward pass\n",
-    "    print(\"\\n🚀 Forward pass through network:\")\n",
-    "    batch_size = 32\n",
-    "    input_data = Tensor(np.random.randn(batch_size, 784))\n",
-    "    \n",
-    "    print(f\"  Input shape: {input_data.shape}\")\n",
-    "    hidden1 = layer1(input_data)\n",
-    "    print(f\"  After layer 1: {hidden1.shape}\")\n",
-    "    hidden2 = layer2(hidden1)\n",
-    "    print(f\"  After layer 2: {hidden2.shape}\")\n",
-    "    output = layer3(hidden2)\n",
-    "    print(f\"  Final output: {output.shape}\")\n",
-    "    \n",
-    "    # Calculate total parameters\n",
-    "    total_params = (layer1.weights.data.size + layer1.bias.data.size + \n",
-    "                   layer2.weights.data.size + layer2.bias.data.size +\n",
-    "                   layer3.weights.data.size + layer3.bias.data.size)\n",
-    "    \n",
-    "    print(f\"\\n📊 Network Statistics:\")\n",
-    "    print(f\"  Total parameters: {total_params:,}\")\n",
-    "    print(f\"  Memory usage: ~{total_params * 4 / 1024 / 1024:.2f} MB (float32)\")\n",
-    "    print(f\"  Forward pass: {batch_size} samples processed simultaneously\")\n",
-    "    \n",
-    "    print(\"\\n✅ Neural network construction complete!\")\n",
-    "    print(\"Ready for activation functions and training algorithms!\")\n",
-    "    \n",
-    "    # Run autograd integration demo\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    demonstrate_autograd_integration()"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/03_layers/layers_dev.py b/modules_old/03_layers/layers_dev.py
deleted file mode 100644
index c1d13aee..00000000
--- a/modules_old/03_layers/layers_dev.py
+++ /dev/null
@@ -1,1139 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Layers - Building Neural Network Architectures
-
-Welcome to Layers! You'll implement the essential building blocks that compose into complete neural network architectures.
-
-## LINK Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): N-dimensional arrays with shape management and broadcasting
-- Module 03 (Activations): ReLU and Softmax functions providing nonlinear intelligence
-
-**What's Working**: You can create tensors and apply nonlinear transformations for complex pattern learning!
-
-**The Gap**: You have data structures and nonlinear functions, but no way to combine them into trainable neural network architectures.
-
-**This Module's Solution**: Implement Linear layers, Module composition patterns, and Sequential networks - the architectural foundations enabling everything from MLPs to transformers.
-
-**Connection Map**:
-```
-Activations -> Layers -> Training
-(intelligence)  (architecture)  (learning)
-```
-
-## Learning Objectives
-
-By completing this module, you will:
-
-1. **Build layer abstractions** - Create the building blocks that compose into neural networks
-2. **Implement Linear layers** - The fundamental operation that transforms data between dimensions
-3. **Create Sequential networks** - Chain layers together to build complete neural networks
-4. **Manage parameters** - Handle weights and biases in an organized way
-5. **Foundation for architectures** - Enable building everything from simple MLPs to complex models
-
-## Build -> Use -> Reflect
-1. **Build**: Module base class, Linear layers, and Sequential composition
-2. **Use**: Combine layers into complete neural networks with real data
-3. **Reflect**: Understand how simple building blocks enable complex architectures
-"""
-
-# In[ ]:
-
-#| default_exp core.layers
-
-#| export
-import numpy as np
-import sys
-import os
-
-# Smart import system: works both during development and in production
-# This pattern allows the same code to work in two scenarios:
-# 1. During development: imports from local module files (tensor_dev.py)
-# 2. In production: imports from installed tinytorch package
-# This flexibility is essential for educational development workflows
-
-if 'tinytorch' in sys.modules:
-    # Production: Import from installed package
-    # When tinytorch is installed as a package, use the packaged version
-    from tinytorch.core.tensor import Tensor
-else:
-    # Development: Import from local module files
-    # During development, we need to import directly from the source files
-    # This allows us to work with modules before they're packaged
-    tensor_module_path = os.path.join(os.path.dirname(__file__), '..', '01_tensor')
-    sys.path.insert(0, tensor_module_path)
-    try:
-        from tensor_dev import Tensor
-    finally:
-        sys.path.pop(0)  # Always clean up path to avoid side effects
-
-# REMOVED: Parameter class - now using Tensor directly with requires_grad=True
-#
-# This creates a clean evolution pattern:
-# - Module 01-04: Use Tensor(data, requires_grad=True) directly
-# - Module 05: Tensor gains full autograd capabilities
-# - No more hasattr() hacks or wrapper classes needed
-
-# In[ ]:
-
-print("FIRE TinyTorch Layers Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build neural network layers!")
-
-# %% [markdown]
-"""
-## Visual Guide: Understanding Neural Network Architecture Through Diagrams
-
-### Neural Network Layers: From Components to Systems
-
-```
-Individual Neuron:                Neural Network Layer:
-    x₁ --○ w₁                    +---------------------+
-          \\                     |   Input Vector      |
-    x₂ --○ w₂ --> Sum --> f() --> y |   [x₁, x₂, x₃]    |
-          /                     +---------------------+
-    x₃ --○ w₃                              v
-       + bias                    +---------------------+
-                                 |  Weight Matrix W    |
-One computation unit             |  +w₁₁ w₁₂ w₁₃+     |
-                                 |  |w₂₁ w₂₂ w₂₃|     |
-                                 |  +w₃₁ w₃₂ w₃₃+     |
-                                 +---------------------+
-                                             v
-                                   Matrix multiplication
-                                     Y = X @ W + b
-                                             v
-                                 +---------------------+
-                                 |  Output Vector      |
-                                 |   [y₁, y₂, y₃]     |
-                                 +---------------------+
-
-Parallel processing of many neurons!
-```
-
-### Layer Composition: Building Complex Architectures
-
-```
-Multi-Layer Perceptron (MLP) Architecture:
-
-   Input        Hidden Layer 1    Hidden Layer 2     Output
- (784 dims)      (256 neurons)     (128 neurons)    (10 classes)
-+---------+     +-------------+   +-------------+   +---------+
-|  Image  |----▶|    ReLU     |--▶|    ReLU     |--▶| Softmax |
-| 28*28px |     | Activations |   | Activations |   | Probs   |
-+---------+     +-------------+   +-------------+   +---------+
-     v                v                 v               v
-200,960 params   32,896 params    1,290 params   Total: 235,146
-
-Parameter calculation for Linear(input_size, output_size):
-• Weights: input_size * output_size matrix
-• Biases:  output_size vector
-• Total:   (input_size * output_size) + output_size
-
-Memory scaling pattern:
-Layer width doubles -> Parameters quadruple -> Memory quadruples
-```
-
-### Module System: Automatic Parameter Management
-
-```
-Parameter Collection Hierarchy:
-
-Model (Sequential)
-+-- Layer1 (Linear)
-|   +-- weights [784 * 256]  --+
-|   +-- bias [256]           --┤
-+-- Layer2 (Linear)           +--▶ model.parameters()
-|   +-- weights [256 * 128]  --┤   Automatically collects
-|   +-- bias [128]           --┤   all parameters for
-+-- Layer3 (Linear)           +--▶ optimizer.step()
-    +-- weights [128 * 10]   --┤
-    +-- bias [10]            --+
-
-Before Module system:        With Module system:
-manually track params   ->    automatic collection
-params = [w1, b1, w2,...]    params = model.parameters()
-
-Enables: optimizer = Adam(model.parameters())
-```
-
-### Memory Layout and Performance Implications
-
-```
-Tensor Memory Access Patterns:
-
-Matrix Multiplication: A @ B = C
-
-Efficient (Row-major access):    Inefficient (Column-major):
-A: --------------▶               A: | | | | | ▶
-   Cache-friendly                    | | | | |
-   Sequential reads                  v v v v v
-                                     Cache misses
-B: |                             B: --------------▶
-   |
-   v
-
-Performance impact:
-• Good memory layout: 100% cache hit ratio
-• Poor memory layout: 10-50% cache hit ratio
-• 10-100x performance difference in practice
-
-Why contiguous tensors matter in production!
-```
-"""
-
-# %% [markdown]
-"""
-## Part 1: Module Base Class - The Foundation of Neural Network Architecture
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "module-base", "solution": true}
-
-# Before building specific layers, we need a base class that enables clean composition and automatic parameter management.
-
-#| export
-class Module:
-    """
-    Base class for all neural network modules.
-    
-    Provides automatic parameter collection, forward pass management,
-    and clean composition patterns. All layers (Dense, Conv2d, etc.)
-    inherit from this class.
-    
-    Key Features:
-    - Automatic parameter registration when you assign parameter Tensors (weights, bias)
-    - Recursive parameter collection from sub-modules
-    - Clean __call__ interface: model(x) instead of model.forward(x)
-    - Extensible for custom layers
-    
-    Example Usage:
-        class MLP(Module):
-            def __init__(self):
-                super().__init__()
-                self.layer1 = Linear(784, 128)  # Auto-registered!
-                self.layer2 = Linear(128, 10)   # Auto-registered!
-                
-            def forward(self, x):
-                x = self.layer1(x)
-                return self.layer2(x)
-                
-        model = MLP()
-        params = model.parameters()  # Gets all parameters automatically!
-        output = model(input)        # Clean interface!
-    """
-    
-    def __init__(self):
-        """Initialize module with empty parameter and sub-module storage."""
-        self._parameters = []
-        self._modules = []
-    
-    def __setattr__(self, name, value):
-        """
-        Intercept attribute assignment to auto-register parameters and modules.
-        
-        When you do self.weight = Parameter(...), this automatically adds
-        the parameter to our collection for easy optimization.
-        """
-        # Step 1: Check if this looks like a parameter (Tensor with parameter naming)
-        # Pure tensor evolution: identify parameters by naming convention
-        is_tensor_type = isinstance(value, Tensor)
-        is_parameter_name = name in ['weights', 'weight', 'bias']
-
-        if is_tensor_type and is_parameter_name:
-            # Step 2: Add to our parameter list for optimization
-            self._parameters.append(value)
-        
-        # Step 3: Check if it's a sub-module (another neural network layer)
-        elif isinstance(value, Module):
-            # Step 4: Add to module list for recursive parameter collection
-            self._modules.append(value)
-        
-        # Step 5: Always set the actual attribute (this is essential!)
-        super().__setattr__(name, value)
-    
-    def parameters(self):
-        """
-        Recursively collect all parameters from this module and sub-modules.
-        
-        Returns:
-            List of all parameters (Tensors containing weights and biases)
-            
-        This enables: optimizer = Adam(model.parameters()) (when optimizers are available)
-        """
-        # Start with our own parameters
-        params = list(self._parameters)
-        
-        # Add parameters from sub-modules recursively
-        for module in self._modules:
-            params.extend(module.parameters())
-            
-        return params
-    
-    def __call__(self, *args, **kwargs):
-        """
-        Makes modules callable: model(x) instead of model.forward(x).
-        
-        This is the magic that enables clean syntax like:
-            output = model(input)
-        instead of:
-            output = model.forward(input)
-        """
-        return self.forward(*args, **kwargs)
-    
-    def forward(self, *args, **kwargs):
-        """
-        Forward pass - must be implemented by subclasses.
-        
-        This is where the actual computation happens. Every layer
-        defines its own forward() method.
-        """
-        raise NotImplementedError("Subclasses must implement forward()")
-
-# In[ ]:
-
-# PASS IMPLEMENTATION CHECKPOINT: Basic Module class complete
-
-# THINK PREDICTION: How many parameters would a simple 3-layer network have?
-# Write your guess here: _______
-
-# 🔍 SYSTEMS ANALYSIS: Layer Performance and Scaling
-def analyze_layer_performance():
-    """Analyze layer performance and scaling characteristics."""
-    print("📊 LAYER SYSTEMS ANALYSIS")
-    print("Understanding how neural network layers scale and perform...")
-
-    try:
-        # Parameter scaling analysis
-        print("\n1. Parameter Scaling:")
-        layer_sizes = [(784, 256), (256, 128), (128, 10)]
-        total_params = 0
-
-        for i, (input_size, output_size) in enumerate(layer_sizes):
-            weights = input_size * output_size
-            biases = output_size
-            layer_params = weights + biases
-            total_params += layer_params
-            print(f"   Layer {i+1} ({input_size}→{output_size}): {layer_params:,} params")
-
-        print(f"   Total network: {total_params:,} parameters")
-        print(f"   Memory usage: {total_params * 4 / 1024 / 1024:.2f} MB (float32)")
-
-        # Computational complexity
-        print("\n2. Computational Complexity:")
-        batch_size = 32
-        total_flops = 0
-
-        for i, (input_size, output_size) in enumerate(layer_sizes):
-            matmul_flops = 2 * batch_size * input_size * output_size
-            bias_flops = batch_size * output_size
-            layer_flops = matmul_flops + bias_flops
-            total_flops += layer_flops
-            print(f"   Layer {i+1}: {layer_flops:,} FLOPs ({matmul_flops:,} matmul + {bias_flops:,} bias)")
-
-        print(f"   Total forward pass: {total_flops:,} FLOPs")
-
-        # Scaling patterns
-        print("\n3. Scaling Insights:")
-        print("   • Parameter growth: O(input_size × output_size) - quadratic")
-        print("   • Computation: O(batch × input × output) - linear in each dimension")
-        print("   • Memory: Parameters + activations scale differently")
-        print("   • Bottlenecks: Large layers dominate both memory and compute")
-
-        print("\n💡 KEY INSIGHT: Layer size quadratically affects parameters but linearly affects computation per sample")
-
-    except Exception as e:
-        print(f"⚠️ Analysis error: {e}")
-
-# In[ ]:
-
-# %% [markdown]
-"""
-### ✅ IMPLEMENTATION CHECKPOINT: Module Base Class Complete
-
-You've built the foundation that enables automatic parameter management across all neural network components!
-
-🤔 **PREDICTION**: How many parameters would a simple 3-layer network have?
-Network: 784 → 256 → 128 → 10
-Your guess: _______
-"""
-
-# %% [markdown]
-"""
-## Part 2: Linear Layer - The Fundamental Neural Network Component
-
-Linear layers (also called Dense or Fully Connected layers) are the building blocks of neural networks.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "linear-layer", "solution": true}
-
-#| export
-class Linear(Module):
-    """
-    Linear (Fully Connected) Layer implementation.
-    
-    Applies the transformation: output = input @ weights + bias
-    
-    Inherits from Module for automatic parameter management and clean API.
-    This is PyTorch's nn.Linear equivalent with the same name for familiarity.
-    
-    Features:
-    - Automatic parameter registration (weights and bias)
-    - Clean call interface: layer(input) instead of layer.forward(input)
-    - Works with optimizers via model.parameters()
-    """
-    
-    def __init__(self, input_size: int, output_size: int, use_bias: bool = True):
-        """
-        Initialize Linear layer with random weights and optional bias.
-        
-        Args:
-            input_size: Number of input features
-            output_size: Number of output features  
-            use_bias: Whether to include bias term
-        
-        TODO: Implement Linear layer initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store input_size and output_size as instance variables
-        2. Initialize weights as Tensor with shape (input_size, output_size)
-        3. Use small random values: np.random.randn(...) * 0.1
-        4. Initialize bias as Tensor with shape (output_size,) if use_bias is True
-        5. Set bias to None if use_bias is False
-        
-        LEARNING CONNECTIONS:
-        - Small random initialization prevents symmetry breaking
-        - Weight shape (input_size, output_size) enables matrix multiplication
-        - Bias allows shifting the output (like y-intercept in linear regression)
-        - PyTorch uses more sophisticated initialization (Xavier, Kaiming)
-        
-        IMPLEMENTATION HINTS:
-        - Use np.random.randn() for Gaussian random numbers
-        - Scale by 0.1 to keep initial values small
-        - Remember to wrap numpy arrays in Tensor()
-        - Store use_bias flag for forward pass logic
-        """
-        ### BEGIN SOLUTION
-        super().__init__()  # Initialize Module base class
-        
-        self.input_size = input_size
-        self.output_size = output_size
-        self.use_bias = use_bias
-        
-        # Initialize weights with small random values using Parameter
-        # Shape: (input_size, output_size) for matrix multiplication
-        #
-        # MAGNIFY WEIGHT INITIALIZATION CONTEXT:
-        # Weight initialization is critical for training deep networks successfully.
-        # Our simple approach (small random * 0.1) works for shallow networks, but
-        # deeper networks require more sophisticated initialization strategies:
-        #
-        # • Xavier/Glorot: scale = sqrt(1/fan_in) - good for tanh/sigmoid activations
-        # • Kaiming/He: scale = sqrt(2/fan_in) - optimized for ReLU activations
-        # • Our approach: scale = 0.1 - simple but effective for basic networks
-        #
-        # Why proper initialization matters:
-        # - Prevents vanishing gradients (weights too small -> signals disappear)
-        # - Prevents exploding gradients (weights too large -> signals blow up)
-        # - Enables stable training in deeper architectures (Module 11 training)
-        # - Affects convergence speed and final model performance
-        #
-        # Production frameworks automatically choose initialization based on layer type!
-        weight_data = np.random.randn(input_size, output_size) * 0.1
-        self.weights = Tensor(weight_data)  # Pure tensor - will become trainable in Module 05
-        
-        # Initialize bias if requested
-        if use_bias:
-            # MAGNIFY GRADIENT FLOW PREPARATION:
-            # Clean parameter management is essential for backpropagation (Module 09).
-            # When we implement autograd, the optimizer needs to find ALL trainable
-            # parameters automatically. Our Module base class ensures that:
-            #
-            # • Parameters are automatically registered when assigned
-            # • Recursive parameter collection works through network hierarchies
-            # • Gradient updates can flow to all learnable weights and biases
-            # • Memory management handles parameter lifecycle correctly
-            #
-            # This design enables the autograd system to:
-            # - Track computational graphs through all layers
-            # - Accumulate gradients for each parameter during backpropagation
-            # - Support optimizers that update parameters based on gradients
-            # - Scale to arbitrarily deep and complex network architectures
-            #
-            # Bias also uses small random initialization (could be zeros, but small random works well)
-            bias_data = np.random.randn(output_size) * 0.1
-            self.bias = Tensor(bias_data)  # Pure tensor - will become trainable in Module 05
-        else:
-            self.bias = None
-        ### END SOLUTION
-    
-    def forward(self, x):
-        """
-        Forward pass through the Linear layer with automatic differentiation.
-
-        Args:
-            x: Input Variable (shape: ..., input_size)
-
-        Returns:
-            Output Variable (shape: ..., output_size) with gradient tracking
-
-        CRITICAL FIX: This method now properly uses autograd operations
-        to ensure gradients flow through parameters during backpropagation.
-
-        TODO: Implement the linear transformation using autograd operations
-
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert input to Variable if needed (with gradient tracking)
-        2. Use autograd matrix multiplication: matmul(x, weights)
-        3. Add bias using autograd addition if it exists: add(result, bias)
-        4. Return Variable with gradient tracking enabled
-
-        LEARNING CONNECTIONS:
-        - Uses autograd operations instead of raw numpy for gradient flow
-        - Parameters (weights/bias) are Variables with requires_grad=True
-        - Matrix multiplication and addition maintain computational graph
-        - This enables backpropagation through all parameters
-
-        IMPLEMENTATION HINTS:
-        - Import autograd operations locally to avoid circular imports
-        - Ensure result Variable has proper gradient tracking
-        - Handle both Tensor and Variable inputs gracefully
-        """
-        ### BEGIN SOLUTION
-        # Clean Tensor Evolution Pattern:
-        # - Modules 01-04: Use basic Tensor operations (@, +)
-        # - Module 05+: Tensor gains full autograd capabilities automatically
-
-        # Ensure input is a Tensor
-        if not isinstance(x, Tensor):
-            x = Tensor(x)
-
-        # Matrix multiplication: input @ weights
-        # Uses Tensor's built-in @ operator (will be autograd-capable after Module 05)
-        result = x @ self.weights
-
-        # Add bias if it exists
-        if self.bias is not None:
-            result = result + self.bias
-
-        # Result is automatically a Variable with gradient tracking
-        return result
-        ### END SOLUTION
-
-# In[ ]:
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Linear Layer
-This test validates our Linear layer implementation with matrix multiplication and parameter management.
-
-**What we're testing**: Linear layer transforms input dimensions correctly
-**Why it matters**: Linear layers are the fundamental building blocks of neural networks
-**Expected**: Correct output shapes, parameter handling, and batch processing
-
-### Linear Layer Computation Visualization
-
-```
-Forward Pass: y = x @ W + b
-
-Input Batch:          Weight Matrix:        Bias Vector:         Output:
-┌─────────────┐      ┌───────────────┐     ┌─────────┐         ┌──────────┐
-│ [1, 2, 3]   │      │ w₁₁  w₁₂     │     │   b₁    │         │ [y₁, y₂] │
-│ [4, 5, 6]   │  @   │ w₂₁  w₂₂     │  +  │   b₂    │    =    │ [y₃, y₄] │
-└─────────────┘      │ w₃₁  w₃₂     │     └─────────┘         └──────────┘
-  Batch(2,3)         └───────────────┘        (2,)               Batch(2,2)
-                        Weights(3,2)
-
-Memory Layout:
-• Input: [batch_size, input_features]
-• Weights: [input_features, output_features]
-• Bias: [output_features]
-• Output: [batch_size, output_features]
-```
-"""
-
-def test_unit_linear():
-    """Test Linear layer implementation."""
-    print("🔬 Unit Test: Linear Layer...")
-    
-    # Test case 1: Basic functionality
-    layer = Linear(input_size=3, output_size=2)
-    input_tensor = Tensor([[1.0, 2.0, 3.0]])  # Shape: (1, 3)
-    output = layer.forward(input_tensor)
-    
-    # Check output shape
-    assert output.shape == (1, 2), f"Expected shape (1, 2), got {output.shape}"
-    print("PASS Output shape correct")
-    
-    # Test case 2: No bias
-    layer_no_bias = Linear(input_size=2, output_size=3, use_bias=False)
-    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
-    print("PASS No bias option works")
-    
-    # Test case 3: Multiple samples (batch processing)
-    batch_input = Tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])  # Shape: (3, 2)
-    layer_batch = Linear(input_size=2, output_size=2)
-    batch_output = layer_batch.forward(batch_input)
-    
-    assert batch_output.shape == (3, 2), f"Expected shape (3, 2), got {batch_output.shape}"
-    print("PASS Batch processing works")
-    
-    # Test case 4: Callable interface
-    callable_output = layer_batch(batch_input)
-    assert np.allclose(callable_output.data, batch_output.data), "Callable interface should match forward()"
-    print("PASS Callable interface works")
-    
-    # Test case 5: Parameter initialization
-    layer_init = Linear(input_size=10, output_size=5)
-    assert layer_init.weights.shape == (10, 5), f"Expected weights shape (10, 5), got {layer_init.weights.shape}"
-    assert layer_init.bias.shape == (5,), f"Expected bias shape (5,), got {layer_init.bias.shape}"
-    
-    # Check that weights are reasonably small (good initialization)
-    mean_val = np.abs(layer_init.weights.data).mean()
-    # Convert to float - mean_val is a numpy scalar from np.abs().mean()
-    mean_val = float(mean_val)  # Direct conversion since np.mean returns numpy scalar
-    assert mean_val < 1.0, "Weights should be small for good initialization"
-    print("PASS Parameter initialization correct")
-    
-    print("CELEBRATE All Linear layer tests passed!")
-
-test_unit_linear()
-
-# In[ ]:
-
-# TEST Unit Test: Parameter Management
-# %% [markdown]
-"""
-### 🧪 Unit Test: Parameter Management
-This test validates automatic parameter collection and module composition.
-
-**What we're testing**: Module system automatically collects parameters from nested layers
-**Why it matters**: Enables automatic optimization and parameter management in complex networks
-**Expected**: All parameters collected hierarchically, proper parameter counting
-
-### Parameter Management Hierarchy Visualization
-
-```
-Network Architecture:           Parameter Collection:
-
-SimpleNetwork                   network.parameters()
-├── layer1: Linear(4→3)           ├── layer1.weights [4×3] = 12 params
-│   ├── weights: (4,3)            ├── layer1.bias [3] = 3 params
-│   └── bias: (3,)                ├── layer2.weights [3×2] = 6 params
-└── layer2: Linear(3→2)           └── layer2.bias [2] = 2 params
-    ├── weights: (3,2)                              Total: 23 params
-    └── bias: (2,)
-
-Manual Tracking:          vs    Automatic Collection:
-weights = [                     params = model.parameters()
-  layer1.weights,               # Automatically finds ALL
-  layer1.bias,                  # parameters in the hierarchy
-  layer2.weights,               # No manual bookkeeping!
-  layer2.bias,
-]
-```
-
-### Memory and Parameter Scaling
-
-```
-Layer Configuration:        Parameters:              Memory (float32):
-Linear(100, 50)          → 100×50 + 50    = 5,050  → ~20KB
-Linear(256, 128)         → 256×128 + 128  = 32,896 → ~131KB
-Linear(512, 256)         → 512×256 + 256  = 131,328 → ~525KB
-Linear(1024, 512)        → 1024×512 + 512 = 524,800 → ~2.1MB
-
-Pattern: O(input_size × output_size) scaling
-Large layers dominate memory usage!
-```
-"""
-
-def test_unit_parameter_management():
-    """Test Linear layer parameter management and module composition."""
-    print("🔬 Unit Test: Parameter Management...")
-    
-    # Test case 1: Parameter registration
-    layer = Linear(input_size=3, output_size=2)
-    params = layer.parameters()
-    
-    assert len(params) == 2, f"Expected 2 parameters (weights + bias), got {len(params)}"
-    assert layer.weights in params, "Weights should be in parameters list"
-    assert layer.bias in params, "Bias should be in parameters list"
-    print("PASS Parameter registration works")
-    
-    # Test case 2: Module composition
-    class SimpleNetwork(Module):
-        def __init__(self):
-            super().__init__()
-            self.layer1 = Linear(4, 3)
-            self.layer2 = Linear(3, 2)
-        
-        def forward(self, x):
-            x = self.layer1(x)
-            return self.layer2(x)
-    
-    network = SimpleNetwork()
-    all_params = network.parameters()
-    
-    # Should have 4 parameters: 2 from each layer (weights + bias)
-    assert len(all_params) == 4, f"Expected 4 parameters from network, got {len(all_params)}"
-    print("PASS Module composition and parameter collection works")
-    
-    # Test case 3: Forward pass through composed network
-    input_tensor = Tensor([[1.0, 2.0, 3.0, 4.0]])
-    output = network(input_tensor)
-    
-    assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}"
-    print("PASS Network forward pass works")
-    
-    # Test case 4: No bias option
-    layer_no_bias = Linear(input_size=3, output_size=2, use_bias=False)
-    params_no_bias = layer_no_bias.parameters()
-    
-    assert len(params_no_bias) == 1, f"Expected 1 parameter (weights only), got {len(params_no_bias)}"
-    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
-    print("PASS No bias option works")
-    
-    print("CELEBRATE All parameter management tests passed!")
-
-test_unit_parameter_management()
-
-# In[ ]:
-
-# PASS IMPLEMENTATION CHECKPOINT: Linear layer complete
-
-# THINK PREDICTION: How does memory usage scale with network depth vs width?
-# Deeper network (more layers): _______
-# Wider network (more neurons per layer): _______
-
-# MAGNIFY SYSTEMS INSIGHT #3: Architecture Memory Analysis
-# Architecture analysis consolidated into analyze_layer_performance() above
-
-# Analysis consolidated into analyze_layer_performance() above
-
-# %% [markdown]
-"""
-## Part 4: Sequential Network Composition
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "sequential-composition", "solution": true}
-
-#| export
-class Sequential(Module):
-    """
-    Sequential Network: Composes layers in sequence.
-    
-    The most fundamental network architecture that applies layers in order:
-    f(x) = layer_n(...layer_2(layer_1(x)))
-    
-    Inherits from Module for automatic parameter collection from all sub-layers.
-    This enables optimizers to find all parameters automatically.
-    
-    Example Usage:
-        # Create a 3-layer MLP
-        model = Sequential([
-            Linear(784, 128),
-            ReLU(),
-            Linear(128, 64), 
-            ReLU(),
-            Linear(64, 10)
-        ])
-        
-        # Use the model
-        output = model(input_data)  # Clean interface!
-        params = model.parameters()  # All parameters from all layers!
-    """
-    
-    def __init__(self, layers=None):
-        """
-        Initialize Sequential network with layers.
-        
-        Args:
-            layers: List of layers to compose in order (optional)
-        """
-        super().__init__()  # Initialize Module base class
-        self.layers = layers if layers is not None else []
-        
-        # Register all layers as sub-modules for parameter collection
-        for i, layer in enumerate(self.layers):
-            # This automatically adds each layer to self._modules
-            setattr(self, f'layer_{i}', layer)
-    
-    def forward(self, x):
-        """
-        Forward pass through all layers in sequence.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:
-            Output tensor after passing through all layers
-        """
-        for layer in self.layers:
-            x = layer(x)
-        return x
-    
-    def add(self, layer):
-        """Add a layer to the network."""
-        self.layers.append(layer)
-        # Register the new layer for parameter collection
-        setattr(self, f'layer_{len(self.layers)-1}', layer)
-
-# In[ ]:
-
-# TEST Unit Test: Sequential Networks
-def test_unit_sequential():
-    """Test Sequential network implementation."""
-    print("TEST Testing Sequential Network...")
-    
-    # Test case 1: Create empty network
-    empty_net = Sequential()
-    assert len(empty_net.layers) == 0, "Empty Sequential should have no layers"
-    print("PASS Empty Sequential network creation")
-    
-    # Test case 2: Create network with layers
-    layers = [Linear(3, 4), Linear(4, 2)]
-    network = Sequential(layers)
-    assert len(network.layers) == 2, "Network should have 2 layers"
-    print("PASS Sequential network with layers")
-    
-    # Test case 3: Forward pass through network
-    input_tensor = Tensor([[1.0, 2.0, 3.0]])
-    output = network(input_tensor)
-    assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}"
-    print("PASS Forward pass through Sequential network")
-    
-    # Test case 4: Parameter collection from all layers
-    all_params = network.parameters()
-    # Should have 4 parameters: 2 weights + 2 biases from 2 Linear layers
-    assert len(all_params) == 4, f"Expected 4 parameters from Sequential network, got {len(all_params)}"
-    print("PASS Parameter collection from all layers")
-    
-    # Test case 5: Adding layers dynamically
-    network.add(Linear(2, 1))
-    assert len(network.layers) == 3, "Network should have 3 layers after adding one"
-    
-    # Test forward pass after adding layer
-    final_output = network(input_tensor)
-    assert final_output.shape == (1, 1), f"Expected final output shape (1, 1), got {final_output.shape}"
-    print("PASS Dynamic layer addition")
-    
-    print("CELEBRATE All Sequential network tests passed!")
-
-test_unit_sequential()
-
-# %% [markdown]
-"""
-## Part 5: Flatten Operation - Connecting Different Layer Types
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "flatten-operations", "solution": true}
-
-#| export
-def flatten(x, start_dim=1):
-    """
-    Flatten tensor starting from a given dimension.
-    
-    This is essential for transitioning from convolutional layers
-    (which output 4D tensors) to linear layers (which expect 2D).
-    
-    Args:
-        x: Input tensor (Tensor or any array-like)
-        start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        
-    Returns:
-        Flattened tensor preserving batch dimension
-        
-    Examples:
-        # Flatten CNN output for Linear layer
-        conv_output = Tensor(np.random.randn(32, 64, 8, 8))  # (batch, channels, height, width)
-        flat = flatten(conv_output)  # (32, 4096) - ready for Linear layer!
-        
-        # Flatten image for MLP
-        images = Tensor(np.random.randn(32, 3, 28, 28))  # CIFAR-10 batch
-        flat = flatten(images)  # (32, 2352) - ready for MLP!
-    """
-    # Get the data (handle both Tensor and numpy arrays)
-    if isinstance(x, Tensor):
-        data = x.data
-    else:
-        data = x
-
-    # Calculate new shape
-    batch_size = data.shape[0] if start_dim > 0 else 1
-    remaining_size = np.prod(data.shape[start_dim:])
-    new_shape = (batch_size, remaining_size) if start_dim > 0 else (remaining_size,)
-
-    # Reshape while preserving the original tensor type
-    if isinstance(x, Tensor):
-        # It's a Tensor - create a new Tensor with flattened data
-        flattened_data = data.reshape(new_shape)
-        # Create new tensor - pure tensor approach (no gradient tracking yet)
-        return Tensor(flattened_data)
-    else:
-        # It's a numpy array - just reshape and return
-        return data.reshape(new_shape)
-
-#| export
-class Flatten(Module):
-    """
-    Flatten layer that reshapes tensors from multi-dimensional to 2D.
-    
-    Essential for connecting convolutional layers (which output 4D tensors)
-    to linear layers (which expect 2D tensors). Preserves the batch dimension.
-    
-    Example Usage:
-        # In a CNN architecture
-        model = Sequential([
-            Conv2D(3, 16, kernel_size=3),  # Output: (batch, 16, height, width)
-            ReLU(),
-            Flatten(),                     # Output: (batch, 16*height*width)
-            Linear(16*height*width, 10)    # Now compatible!
-        ])
-    """
-    
-    def __init__(self, start_dim=1):
-        """
-        Initialize Flatten layer.
-        
-        Args:
-            start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        """
-        super().__init__()
-        self.start_dim = start_dim
-    
-    def forward(self, x):
-        """
-        Flatten tensor starting from start_dim.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:
-            Flattened tensor with batch dimension preserved
-        """
-        return flatten(x, start_dim=self.start_dim)
-
-# In[ ]:
-
-# TEST Unit Test: Flatten Operations
-def test_unit_flatten():
-    """Test Flatten layer and function implementation."""
-    print("TEST Testing Flatten Operations...")
-    
-    # Test case 1: Flatten function with 2D tensor
-    x_2d = Tensor([[1, 2], [3, 4]])
-    flattened_func = flatten(x_2d)
-    assert flattened_func.shape == (2, 2), f"Expected shape (2, 2), got {flattened_func.shape}"
-    print("PASS Flatten function with 2D tensor")
-    
-    # Test case 2: Flatten function with 4D tensor (simulating CNN output)
-    x_4d = Tensor(np.random.randn(2, 3, 4, 4))  # (batch, channels, height, width)
-    flattened_4d = flatten(x_4d)
-    assert flattened_4d.shape == (2, 48), f"Expected shape (2, 48), got {flattened_4d.shape}"  # 3*4*4 = 48
-    print("PASS Flatten function with 4D tensor")
-    
-    # Test case 3: Flatten layer class
-    flatten_layer = Flatten()
-    layer_output = flatten_layer(x_4d)
-    assert layer_output.shape == (2, 48), f"Expected shape (2, 48), got {layer_output.shape}"
-    assert np.allclose(layer_output.data, flattened_4d.data), "Flatten layer should match flatten function"
-    print("PASS Flatten layer class")
-    
-    # Test case 4: Different start dimensions
-    flatten_from_0 = Flatten(start_dim=0)
-    full_flat = flatten_from_0(x_2d)
-    assert len(full_flat.shape) <= 2, "Flattening from dim 0 should create vector"
-    print("PASS Different start dimensions")
-    
-    # Test case 5: Integration with Sequential
-    network = Sequential([
-        Linear(8, 4),
-        Flatten()
-    ])
-    test_input = Tensor(np.random.randn(2, 8))
-    output = network(test_input)
-    assert output.shape == (2, 4), f"Expected shape (2, 4), got {output.shape}"
-    print("PASS Flatten integration with Sequential")
-    
-    print("CELEBRATE All Flatten operations tests passed!")
-
-test_unit_flatten()
-
-# In[ ]:
-
-# %% [markdown]
-"""
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/03_layers/layers_dev.py
-**Building Side:** Code exports to tinytorch.core.layers
-
-```python
-# Final package structure:
-from tinytorch.core.layers import Module, Linear, Sequential, Flatten  # This module
-from tinytorch.core.tensor import Tensor  # Pure tensor foundation (always needed)
-```
-
-**Why this matters:**
-- **Learning:** Complete layer system in one focused module for deep understanding
-- **Production:** Proper organization like PyTorch's torch.nn with all core components together
-- **Consistency:** All layer operations and parameter management in core.layers
-- **Integration:** Works seamlessly with tensors for complete neural network building
-"""
-
-# %%
-
-
-# In[ ]:
-
-# %% [markdown]
-"""
-## Testing Framework
-"""
-
-def test_module():
-    """Run complete module validation."""
-    print("🧪 TESTING ALL LAYER COMPONENTS")
-    print("=" * 40)
-
-    # Call every individual test function
-    test_unit_linear()
-    test_unit_parameter_management()
-    test_unit_sequential()
-    test_unit_flatten()
-
-    print("\n✅ ALL TESTS PASSED! Layer module ready for integration.")
-
-# In[ ]:
-
-if __name__ == "__main__":
-    print("🚀 TINYTORCH LAYERS MODULE")
-    print("=" * 50)
-
-    # Test all components
-    test_module()
-
-    # Systems analysis
-    print("\n" + "=" * 50)
-    analyze_layer_performance()
-
-    print("\n🎉 LAYERS MODULE COMPLETE!")
-    print("✅ Ready for advanced architectures and training!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've implemented all the core neural network components, let's think about their implications for ML systems:
-
-**Question 1: Memory vs Computation Analysis**
-
-You're designing a neural network for deployment on a mobile device with limited memory (1GB RAM) but decent compute power.
-
-You have two architecture options:
-A) Wide network: 784 -> 2048 -> 2048 -> 10 (3 layers, wide)
-B) Deep network: 784 -> 256 -> 256 -> 256 -> 256 -> 10 (5 layers, narrow)
-
-Calculate the memory requirements for each option and explain which you'd choose for mobile deployment and why.
-
-Consider:
-- Parameter storage requirements
-- Intermediate activation storage during forward pass
-- Training vs inference memory requirements
-- How your choice affects model capacity and accuracy
-
-⭐ **Question 2: Production Performance Optimization**
-
-Your Linear layer implementation works correctly, but you notice it's slower than PyTorch's nn.Linear on the same hardware.
-
-Investigate and explain:
-1. Why might our implementation be slower? (Hint: think about underlying linear algebra libraries)
-2. What optimization techniques do production frameworks use?
-3. How would you modify our implementation to approach production performance?
-4. When might our simple implementation actually be preferable?
-
-Research areas to consider:
-- BLAS (Basic Linear Algebra Subprograms) libraries
-- Memory layout and cache efficiency
-- Vectorization and SIMD instructions
-- GPU kernel optimization
-
-⭐ **Question 3: Systems Architecture Scaling**
-
-Modern transformer models like GPT-3 have billions of parameters, primarily in Linear layers.
-
-Analyze the scaling challenges:
-1. How does memory requirement scale with model size? Calculate the memory needed for a 175B parameter model.
-2. What are the computational bottlenecks during training vs inference?
-3. How do systems like distributed training address these scaling challenges?
-4. Why do large models use techniques like gradient checkpointing and model parallelism?
-
-Systems considerations:
-- Memory hierarchy (L1/L2/L3 cache, RAM, storage)
-- Network bandwidth for distributed training
-- GPU memory constraints and model sharding
-- Inference optimization for production serving
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Layers - Complete Neural Network Foundation
-
-### What You've Accomplished
-
-You've successfully implemented the complete foundation for neural networks - all the essential components working together:
-
-### ✅ **Complete Core System**
-- **Module Base Class**: Parameter management and composition patterns for all neural network components
-- **Matrix Multiplication**: The computational primitive underlying all neural network operations
-- **Linear (Dense) Layers**: Complete implementation with proper parameter initialization and forward propagation
-- **Sequential Networks**: Clean composition system for building complete neural network architectures
-- **Flatten Operations**: Tensor reshaping to connect different layer types (essential for CNN->MLP transitions)
-
-### ✅ **Systems Understanding**
-- **Architectural Patterns**: How modular design enables everything from MLPs to complex deep networks
-- **Memory Analysis**: How layer composition affects memory usage and computational efficiency
-- **Performance Characteristics**: Understanding how tensor operations and layer composition affect performance
-- **Production Context**: Connection to real-world ML frameworks and their component organization
-
-### ✅ **ML Engineering Skills**
-- **Complete Parameter Management**: How neural networks automatically collect parameters from all components
-- **Network Composition**: Building complex architectures from simple, reusable components
-- **Tensor Operations**: Essential reshaping and transformation operations for different network types
-- **Clean Abstraction**: Professional software design patterns that scale to production systems
-
-### 🔗 **Connection to Production ML Systems**
-
-Your unified implementation mirrors the complete component systems used in:
-- **PyTorch's nn.Module system**: Same parameter management and composition patterns
-- **PyTorch's nn.Sequential**: Identical architecture composition approach
-- **All major frameworks**: The same modular design principles that power TensorFlow, JAX, and others
-- **Production ML systems**: Clean abstractions that enable complex models while maintaining manageable code
-
-### 🚀 **What's Next**
-
-With your complete layer foundation, you're ready to:
-- **Module 05 (Dense)**: Build complete dense networks for classification tasks
-- **Module 06 (Spatial)**: Add convolutional layers for computer vision
-- **Module 09 (Autograd)**: Enable automatic differentiation for learning
-- **Module 10 (Optimizers)**: Implement sophisticated optimization algorithms
-
-### 💡 **Key Systems Insights**
-
-1. **Modular composition is the key to scalable ML systems** - clean interfaces enable complex behaviors
-2. **Parameter management must be automatic** - manual parameter tracking doesn't scale to deep networks
-3. **Tensor operations like flattening are architectural requirements** - different layer types need different tensor shapes
-4. **Clean abstractions enable innovation** - good foundational design supports unlimited architectural experimentation
-
-You now understand how to build complete, production-ready neural network foundations that can scale to any architecture!
-"""
\ No newline at end of file
diff --git a/modules_old/03_layers/layers_dev_enhanced.py b/modules_old/03_layers/layers_dev_enhanced.py
deleted file mode 100644
index aa230dbe..00000000
--- a/modules_old/03_layers/layers_dev_enhanced.py
+++ /dev/null
@@ -1,1401 +0,0 @@
-#!/usr/bin/env python
-# coding: utf-8
-
-# # Layers - Building Neural Network Architectures
-
-# Welcome to Layers! You'll implement the essential building blocks that compose into complete neural network architectures.
-
-# ## 🔗 Building on Previous Learning
-# **What You Built Before**:
-# - Module 02 (Tensor): N-dimensional arrays with shape management and broadcasting
-# - Module 03 (Activations): ReLU and Softmax functions providing nonlinear intelligence
-
-# **What's Working**: You can create tensors and apply nonlinear transformations for complex pattern learning!
-
-# **The Gap**: You have data structures and nonlinear functions, but no way to combine them into trainable neural network architectures.
-
-# **This Module's Solution**: Implement Linear layers, Module composition patterns, and Sequential networks - the architectural foundations enabling everything from MLPs to transformers.
-
-# **Connection Map**:
-# ```
-# Activations → Layers → Training
-# (intelligence)  (architecture)  (learning)
-# ```
-
-# ## Learning Goals
-# - Systems understanding: How layer composition affects memory usage, parameter counts, and computational complexity in neural networks
-# - Core implementation skill: Build complete Module system, Linear transformations, and Sequential composition for scalable architectures  
-# - Pattern/abstraction mastery: Understand how modular design patterns enable building complex networks from simple, reusable components
-# - Framework connections: See how your implementation mirrors PyTorch's nn.Module, nn.Linear, and nn.Sequential - the foundation of all modern ML frameworks
-# - Optimization trade-offs: Learn why proper parameter management and clean abstractions are essential for both performance and maintainability in production systems
-
-# ## Build → Use → Reflect
-# 1. **Build**: Complete layer system with Module base class, Linear transformations, Sequential composition, and tensor reshaping operations
-# 2. **Use**: Compose layers into complete neural networks and analyze architectural trade-offs with real parameter counting
-# 3. **Reflect**: How does modular architecture design affect both system scalability and computational efficiency in production ML systems?
-
-# ## Systems Reality Check
-# 💡 **Production Context**: PyTorch's nn.Module system enables all modern neural networks through automatic parameter collection and clean composition patterns
-# ⚡ **Performance Insight**: Layer composition and parameter management patterns determine training speed and memory efficiency - proper abstraction is a systems requirement, not just good design
-
-# In[ ]:
-
-#| default_exp core.layers
-
-#| export
-import numpy as np
-import sys
-import os
-
-# Smart import system: works both during development and in production
-# This pattern allows the same code to work in two scenarios:
-# 1. During development: imports from local module files (tensor_dev.py)
-# 2. In production: imports from installed tinytorch package
-# This flexibility is essential for educational development workflows
-
-if 'tinytorch' in sys.modules:
-    # Production: Import from installed package
-    # When tinytorch is installed as a package, use the packaged version
-    from tinytorch.core.tensor import Tensor, Parameter
-else:
-    # Development: Import from local module files
-    # During development, we need to import directly from the source files
-    # This allows us to work with modules before they're packaged
-    tensor_module_path = os.path.join(os.path.dirname(__file__), '..', '02_tensor')
-    sys.path.insert(0, tensor_module_path)
-    try:
-        from tensor_dev import Tensor, Parameter
-    finally:
-        sys.path.pop(0)  # Always clean up path to avoid side effects
-
-# In[ ]:
-
-print("🔥 TinyTorch Layers Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build neural network layers!")
-
-# ## Visual Guide: Understanding Neural Network Architecture Through Diagrams
-
-# ### Neural Network Layers: From Components to Systems
-# 
-# ```
-# Individual Neuron:                Neural Network Layer:
-#     x₁ ──○ w₁                    ┌─────────────────────┐
-#           ╲                     │   Input Vector      │
-#     x₂ ──○ w₂ ──> Σ ──> f() ──> y │   [x₁, x₂, x₃]    │
-#           ╱                     └─────────────────────┘
-#     x₃ ──○ w₃                              ↓
-#        + bias                    ┌─────────────────────┐
-#                                  │  Weight Matrix W    │
-# One computation unit             │  ┌w₁₁ w₁₂ w₁₃┐     │
-#                                  │  │w₂₁ w₂₂ w₂₃│     │
-#                                  │  └w₃₁ w₃₂ w₃₃┘     │
-#                                  └─────────────────────┘
-#                                              ↓
-#                                    Matrix multiplication
-#                                      Y = X @ W + b
-#                                              ↓
-#                                  ┌─────────────────────┐
-#                                  │  Output Vector      │
-#                                  │   [y₁, y₂, y₃]     │
-#                                  └─────────────────────┘
-# 
-# Parallel processing of many neurons!
-# ```
-
-# ### Layer Composition: Building Complex Architectures
-# 
-# ```
-# Multi-Layer Perceptron (MLP) Architecture:
-# 
-#    Input        Hidden Layer 1    Hidden Layer 2     Output
-#  (784 dims)      (256 neurons)     (128 neurons)    (10 classes)
-# ┌─────────┐     ┌─────────────┐   ┌─────────────┐   ┌─────────┐
-# │  Image  │────▶│    ReLU     │──▶│    ReLU     │──▶│ Softmax │
-# │ 28×28px │     │ Activations │   │ Activations │   │ Probs   │
-# └─────────┘     └─────────────┘   └─────────────┘   └─────────┘
-#      ↓                ↓                 ↓               ↓
-# 200,960 params   32,896 params    1,290 params   Total: 235,146
-# 
-# Parameter calculation for Linear(input_size, output_size):
-# • Weights: input_size × output_size matrix
-# • Biases:  output_size vector  
-# • Total:   (input_size × output_size) + output_size
-# 
-# Memory scaling pattern:
-# Layer width doubles → Parameters quadruple → Memory quadruples
-# ```
-
-# ### Module System: Automatic Parameter Management
-# 
-# ```
-# Parameter Collection Hierarchy:
-# 
-# Model (Sequential)
-# ├── Layer1 (Linear)
-# │   ├── weights [784 × 256]  ──┐
-# │   └── bias [256]           ──┤
-# ├── Layer2 (Linear)           ├──▶ model.parameters()
-# │   ├── weights [256 × 128]  ──┤   Automatically collects
-# │   └── bias [128]           ──┤   all parameters for
-# └── Layer3 (Linear)           ├──▶ optimizer.step()
-#     ├── weights [128 × 10]   ──┤
-#     └── bias [10]            ──┘
-# 
-# Before Module system:        With Module system:
-# manually track params   →    automatic collection
-# params = [w1, b1, w2,...]    params = model.parameters()
-# 
-# Enables: optimizer = Adam(model.parameters())
-# ```
-
-# ### Memory Layout and Performance Implications
-# 
-# ```
-# Tensor Memory Access Patterns:
-# 
-# Matrix Multiplication: A @ B = C
-# 
-# Efficient (Row-major access):    Inefficient (Column-major):
-# A: ──────────────▶               A: │ │ │ │ │ ▶
-#    Cache-friendly                    │ │ │ │ │
-#    Sequential reads                  ▼ ▼ ▼ ▼ ▼
-#                                      Cache misses
-# B: │                             B: ──────────────▶
-#    │                                
-#    ▼                                
-# 
-# Performance impact:
-# • Good memory layout: 100% cache hit ratio
-# • Poor memory layout: 10-50% cache hit ratio  
-# • 10-100x performance difference in practice
-# 
-# Why contiguous tensors matter in production!
-# ```
-
-# In[ ]:
-
-# ## Part 1: Module Base Class - The Foundation of Neural Network Architecture
-
-# Before building specific layers, we need a base class that enables clean composition and automatic parameter management.
-
-#| export
-class Module:
-    """
-    Base class for all neural network modules.
-    
-    Provides automatic parameter collection, forward pass management,
-    and clean composition patterns. All layers (Dense, Conv2d, etc.)
-    inherit from this class.
-    
-    Key Features:
-    - Automatic parameter registration when you assign parameter Tensors (weights, bias)
-    - Recursive parameter collection from sub-modules
-    - Clean __call__ interface: model(x) instead of model.forward(x)
-    - Extensible for custom layers
-    
-    Example Usage:
-        class MLP(Module):
-            def __init__(self):
-                super().__init__()
-                self.layer1 = Linear(784, 128)  # Auto-registered!
-                self.layer2 = Linear(128, 10)   # Auto-registered!
-                
-            def forward(self, x):
-                x = self.layer1(x)
-                return self.layer2(x)
-                
-        model = MLP()
-        params = model.parameters()  # Gets all parameters automatically!
-        output = model(input)        # Clean interface!
-    """
-    
-    def __init__(self):
-        """Initialize module with empty parameter and sub-module storage."""
-        self._parameters = []
-        self._modules = []
-    
-    def __setattr__(self, name, value):
-        """
-        Intercept attribute assignment to auto-register parameters and modules.
-        
-        When you do self.weight = Parameter(...), this automatically adds
-        the parameter to our collection for easy optimization.
-        """
-        # Step 1: Check if this looks like a parameter (Tensor with data and specific name)
-        # Break down the complex boolean logic for clarity:
-        is_tensor_like = hasattr(value, 'data') and hasattr(value, 'shape')
-        is_tensor_type = isinstance(value, Tensor)
-        is_parameter_name = name in ['weights', 'weight', 'bias']
-        
-        if is_tensor_like and is_tensor_type and is_parameter_name:
-            # Step 2: Add to our parameter list for optimization
-            self._parameters.append(value)
-        
-        # Step 3: Check if it's a sub-module (another neural network layer)
-        elif isinstance(value, Module):
-            # Step 4: Add to module list for recursive parameter collection
-            self._modules.append(value)
-        
-        # Step 5: Always set the actual attribute (this is essential!)
-        super().__setattr__(name, value)
-    
-    def parameters(self):
-        """
-        Recursively collect all parameters from this module and sub-modules.
-        
-        Returns:
-            List of all parameters (Tensors containing weights and biases)
-            
-        This enables: optimizer = Adam(model.parameters()) (when optimizers are available)
-        """
-        # Start with our own parameters
-        params = list(self._parameters)
-        
-        # Add parameters from sub-modules recursively
-        for module in self._modules:
-            params.extend(module.parameters())
-            
-        return params
-    
-    def __call__(self, *args, **kwargs):
-        """
-        Makes modules callable: model(x) instead of model.forward(x).
-        
-        This is the magic that enables clean syntax like:
-            output = model(input)
-        instead of:
-            output = model.forward(input)
-        """
-        return self.forward(*args, **kwargs)
-    
-    def forward(self, *args, **kwargs):
-        """
-        Forward pass - must be implemented by subclasses.
-        
-        This is where the actual computation happens. Every layer
-        defines its own forward() method.
-        """
-        raise NotImplementedError("Subclasses must implement forward()")
-
-# In[ ]:
-
-# ✅ IMPLEMENTATION CHECKPOINT: Basic Module class complete
-
-# 🤔 PREDICTION: How many parameters would a simple 3-layer network have?
-# Write your guess here: _______
-
-# 🔍 SYSTEMS INSIGHT #1: Parameter Counter
-def analyze_parameter_scaling():
-    """Count parameters in networks of different sizes."""
-    try:
-        print("📊 Parameter Scaling Analysis")
-        print("=" * 40)
-        
-        layer_configs = [
-            (100, 50),      # Small network
-            (784, 256),     # MNIST-style  
-            (1024, 512),    # Medium network
-            (2048, 1024),   # Large network
-            (4096, 2048),   # Very large
-        ]
-        
-        for input_size, output_size in layer_configs:
-            # Calculate parameters for Linear layer
-            weight_params = input_size * output_size
-            bias_params = output_size
-            total_params = weight_params + bias_params
-            
-            # Memory calculation (float32 = 4 bytes)
-            memory_mb = total_params * 4 / (1024 * 1024)
-            
-            print(f"  {input_size:4d} → {output_size:4d}: {total_params:,} params, {memory_mb:.2f} MB")
-        
-        print("\n💡 Key Insights:")
-        print("  • Parameters scale quadratically with layer width")
-        print("  • Doubling width → 4x parameters → 4x memory")
-        print("  • Modern networks balance width vs depth carefully")
-        print("  • GPT-3 has 175B parameters = ~700GB just for weights!")
-        
-    except Exception as e:
-        print(f"⚠️ Error in parameter analysis: {e}")
-
-# Run the analysis
-analyze_parameter_scaling()
-
-# In[ ]:
-
-# ## Part 2: Matrix Multiplication - The Heart of Neural Networks
-
-# Every neural network operation ultimately reduces to matrix multiplication. Let's build the foundation that powers everything from simple perceptrons to transformers.
-
-#| export
-def matmul(a: Tensor, b: Tensor) -> Tensor:
-    """
-    Matrix multiplication for tensors using explicit loops.
-    
-    This implementation uses triple-nested loops for educational understanding
-    of the fundamental operations. Module 15 will show the optimization progression
-    from loops → blocking → vectorized operations.
-    
-    Args:
-        a: Left tensor (shape: ..., m, k)
-        b: Right tensor (shape: ..., k, n)
-    
-    Returns:
-        Result tensor (shape: ..., m, n)
-    
-    TODO: Implement matrix multiplication using explicit loops.
-    
-    STEP-BY-STEP IMPLEMENTATION:
-    1. Extract numpy arrays from both tensors using .data
-    2. Check tensor shapes for compatibility
-    3. Use triple-nested loops to show every operation
-    4. Wrap result in a new Tensor and return
-    
-    LEARNING CONNECTIONS:
-    - This is the core operation in Dense layers: output = input @ weights
-    - Shows the fundamental computation before optimization
-    - Module 15 will demonstrate the progression to high-performance implementations
-    - Understanding loops helps appreciate vectorization and GPU parallelization
-    
-    EDUCATIONAL APPROACH:
-    - Intentionally simple for understanding, not performance
-    - Makes every multiply-add operation explicit
-    - Sets up Module 15 to show optimization techniques
-    
-    EXAMPLE:
-    ```python
-    a = Tensor([[1, 2], [3, 4]])  # shape (2, 2)
-    b = Tensor([[5, 6], [7, 8]])  # shape (2, 2)
-    result = matmul(a, b)
-    # result.data = [[19, 22], [43, 50]]
-    ```
-    
-    IMPLEMENTATION HINTS:
-    - Use explicit loops to show every operation
-    - This is educational, not optimized for performance
-    - Module 15 will show the progression to fast implementations
-    """
-    ### BEGIN SOLUTION
-    # Extract numpy arrays from tensors
-    a_data = a.data
-    b_data = b.data
-    
-    # Get dimensions and validate compatibility
-    if len(a_data.shape) != 2 or len(b_data.shape) != 2:
-        raise ValueError("matmul requires 2D tensors")
-    
-    m, k = a_data.shape
-    k2, n = b_data.shape
-    
-    if k != k2:
-        raise ValueError(
-            f"Matrix multiplication requires inner dimensions to match!\n"
-            f"Left matrix: {a_data.shape} (inner dim: {k})\n"
-            f"Right matrix: {b_data.shape} (inner dim: {k2})\n"
-            f"For A @ B, A's columns must equal B's rows."
-        )
-    
-    # Initialize result matrix
-    result = np.zeros((m, n), dtype=a_data.dtype)
-    
-    # Triple nested loops - educational, shows every operation
-    # This is intentionally simple to understand the fundamental computation
-    #
-    # Matrix multiplication visualization:
-    # A (2,3) @ B (3,4) = C (2,4)
-    # 
-    # A = [[a11, a12, a13],     B = [[b11, b12, b13, b14],
-    #      [a21, a22, a23]]          [b21, b22, b23, b24],
-    #                                [b31, b32, b33, b34]]
-    #
-    # C[0,0] = a11*b11 + a12*b21 + a13*b31 (dot product of A's row 0 with B's column 0)
-    #
-    # Module 15 will show the optimization journey:
-    #   Step 1 (here): Educational loops - slow but clear
-    #   Step 2: Loop blocking for cache efficiency  
-    #   Step 3: Vectorized operations with NumPy
-    #   Step 4: GPU acceleration and BLAS libraries
-    for i in range(m):                      # For each row in result
-        for j in range(n):                  # For each column in result
-            for k_idx in range(k):          # Dot product: sum over inner dimension
-                result[i, j] += a_data[i, k_idx] * b_data[k_idx, j]
-    
-    # Return new Tensor with result
-    return Tensor(result)
-    ### END SOLUTION
-
-# In[ ]:
-
-# 🧪 Unit Test: Matrix Multiplication
-def test_unit_matmul():
-    """Test matrix multiplication implementation."""
-    print("🧪 Testing Matrix Multiplication...")
-    
-    # Test case 1: Simple 2x2 matrices
-    a = Tensor([[1, 2], [3, 4]])
-    b = Tensor([[5, 6], [7, 8]])
-    result = matmul(a, b)
-    expected = np.array([[19, 22], [43, 50]])
-    
-    assert np.allclose(result.data, expected), f"Expected {expected}, got {result.data}"
-    print("✅ 2x2 matrix multiplication")
-    
-    # Test case 2: Non-square matrices
-    a = Tensor([[1, 2, 3], [4, 5, 6]])  # 2x3
-    b = Tensor([[7, 8], [9, 10], [11, 12]])  # 3x2
-    result = matmul(a, b)
-    expected = np.array([[58, 64], [139, 154]])
-    
-    assert np.allclose(result.data, expected), f"Expected {expected}, got {result.data}"
-    print("✅ Non-square matrix multiplication")
-    
-    # Test case 3: Vector-matrix multiplication
-    a = Tensor([[1, 2, 3]])  # 1x3 (row vector)
-    b = Tensor([[4], [5], [6]])  # 3x1 (column vector)
-    result = matmul(a, b)
-    expected = np.array([[32]])  # 1*4 + 2*5 + 3*6 = 32
-    
-    assert np.allclose(result.data, expected), f"Expected {expected}, got {result.data}"
-    print("✅ Vector-matrix multiplication")
-    
-    print("🎉 All matrix multiplication tests passed!")
-
-test_unit_matmul()
-
-# In[ ]:
-
-# ✅ IMPLEMENTATION CHECKPOINT: Matrix multiplication complete
-
-# 🤔 PREDICTION: How many operations does matrix multiplication take?
-# For two N×N matrices, your guess: _______
-
-# 🔍 SYSTEMS INSIGHT #2: FLOPS Analysis
-def analyze_matmul_complexity():
-    """Analyze computational complexity of matrix multiplication."""
-    try:
-        print("📊 Matrix Multiplication FLOPS Analysis")
-        print("=" * 45)
-        
-        sizes = [64, 128, 256, 512, 1024]
-        
-        for size in sizes:
-            # For N×N @ N×N matrices:
-            # - N³ multiply operations
-            # - N³ add operations  
-            # - Total: 2N³ FLOPs (Floating Point Operations)
-            flops = 2 * size ** 3
-            
-            # Memory requirements
-            memory_elements = 3 * size * size  # A, B, and result matrices
-            memory_mb = memory_elements * 4 / (1024 * 1024)  # float32 = 4 bytes
-            
-            print(f"  {size:4d}×{size:4d}: {flops/1e9:.1f} GFLOPS, {memory_mb:.1f} MB")
-        
-        print("\n💡 Computational Insights:")
-        print("  • FLOPs grow cubically O(N³) - very expensive!")
-        print("  • Memory grows quadratically O(N²)")
-        print("  • Large matrices become compute-bound")
-        print("  • GPU acceleration essential for deep learning")
-        print("  • This is why matrix operations dominate ML workloads")
-        
-    except Exception as e:
-        print(f"⚠️ Error in FLOPS analysis: {e}")
-
-# Run the analysis
-analyze_matmul_complexity()
-
-# In[ ]:
-
-# ## Part 3: Linear Layer - The Fundamental Neural Network Component
-
-# Linear layers (also called Dense or Fully Connected layers) are the building blocks of neural networks.
-
-#| export
-class Linear(Module):
-    """
-    Linear (Fully Connected) Layer implementation.
-    
-    Applies the transformation: output = input @ weights + bias
-    
-    Inherits from Module for automatic parameter management and clean API.
-    This is PyTorch's nn.Linear equivalent with the same name for familiarity.
-    
-    Features:
-    - Automatic parameter registration (weights and bias)
-    - Clean call interface: layer(input) instead of layer.forward(input)
-    - Works with optimizers via model.parameters()
-    """
-    
-    def __init__(self, input_size: int, output_size: int, use_bias: bool = True):
-        """
-        Initialize Linear layer with random weights and optional bias.
-        
-        Args:
-            input_size: Number of input features
-            output_size: Number of output features  
-            use_bias: Whether to include bias term
-        
-        TODO: Implement Linear layer initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store input_size and output_size as instance variables
-        2. Initialize weights as Tensor with shape (input_size, output_size)
-        3. Use small random values: np.random.randn(...) * 0.1
-        4. Initialize bias as Tensor with shape (output_size,) if use_bias is True
-        5. Set bias to None if use_bias is False
-        
-        LEARNING CONNECTIONS:
-        - Small random initialization prevents symmetry breaking
-        - Weight shape (input_size, output_size) enables matrix multiplication
-        - Bias allows shifting the output (like y-intercept in linear regression)
-        - PyTorch uses more sophisticated initialization (Xavier, Kaiming)
-        
-        IMPLEMENTATION HINTS:
-        - Use np.random.randn() for Gaussian random numbers
-        - Scale by 0.1 to keep initial values small
-        - Remember to wrap numpy arrays in Tensor()
-        - Store use_bias flag for forward pass logic
-        """
-        ### BEGIN SOLUTION
-        super().__init__()  # Initialize Module base class
-        
-        self.input_size = input_size
-        self.output_size = output_size
-        self.use_bias = use_bias
-        
-        # Initialize weights with small random values using Parameter
-        # Shape: (input_size, output_size) for matrix multiplication
-        # 
-        # Weight initialization explanation:
-        # - Use small random values (scaled by 0.1) to prevent vanishing/exploding gradients
-        # - Small initial values help networks train more stably in deep architectures
-        # - In production systems, Xavier or Kaiming initialization would be used
-        # - The 0.1 scaling factor is a simple but effective approach for basic networks
-        weight_data = np.random.randn(input_size, output_size) * 0.1
-        self.weights = Parameter(weight_data)  # Auto-registers for optimization!
-        
-        # Initialize bias if requested
-        if use_bias:
-            # Bias also uses small random initialization (could be zeros, but small random works well)
-            bias_data = np.random.randn(output_size) * 0.1
-            self.bias = Parameter(bias_data)  # Auto-registers for optimization!
-        else:
-            self.bias = None
-        ### END SOLUTION
-    
-    def forward(self, x):
-        """
-        Forward pass through the Linear layer.
-        
-        Args:
-            x: Input tensor (shape: ..., input_size)
-        
-        Returns:
-            Output tensor (shape: ..., output_size)
-        
-        COMMON PITFALL: Make sure input tensor has shape (..., input_size)
-        If you get shape mismatch errors, check that your input's last dimension
-        matches the layer's input_size parameter.
-        
-        TODO: Implement the linear transformation: output = input @ weights + bias
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Extract data from input tensor using x.data
-        2. Get weight and bias data using self.weights.data and self.bias.data
-        3. Perform matrix multiplication: np.dot(x.data, weights.data)
-        4. Add bias if it exists: result + bias.data
-        5. Return new Tensor with result
-        
-        LEARNING CONNECTIONS:
-        - This is the core neural network operation: y = Wx + b
-        - Matrix multiplication handles batch processing automatically
-        - Each row in input produces one row in output
-        - This is pure linear algebra - no autograd complexity yet
-        
-        IMPLEMENTATION HINTS:
-        - Use np.dot() for matrix multiplication
-        - Handle the case where bias is None
-        - Always return a new Tensor object
-        - Focus on the mathematical operation, not gradient tracking
-        """
-        ### BEGIN SOLUTION
-        # Extract data from input tensor
-        x_data = x.data
-        weights_data = self.weights.data
-        
-        # Matrix multiplication: input @ weights
-        output_data = np.dot(x_data, weights_data)
-        
-        # Add bias if it exists
-        if self.bias is not None:
-            bias_data = self.bias.data
-            output_data = output_data + bias_data
-        
-        # Return new Tensor with result
-        return Tensor(output_data)
-        ### END SOLUTION
-
-# In[ ]:
-
-# 🧪 Unit Test: Linear Layer
-def test_unit_linear():
-    """Test Linear layer implementation."""
-    print("🧪 Testing Linear Layer...")
-    
-    # Test case 1: Basic functionality
-    layer = Linear(input_size=3, output_size=2)
-    input_tensor = Tensor([[1.0, 2.0, 3.0]])  # Shape: (1, 3)
-    output = layer.forward(input_tensor)
-    
-    # Check output shape
-    assert output.shape == (1, 2), f"Expected shape (1, 2), got {output.shape}"
-    print("✅ Output shape correct")
-    
-    # Test case 2: No bias
-    layer_no_bias = Linear(input_size=2, output_size=3, use_bias=False)
-    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
-    print("✅ No bias option works")
-    
-    # Test case 3: Multiple samples (batch processing)
-    batch_input = Tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])  # Shape: (3, 2)
-    layer_batch = Linear(input_size=2, output_size=2)
-    batch_output = layer_batch.forward(batch_input)
-    
-    assert batch_output.shape == (3, 2), f"Expected shape (3, 2), got {batch_output.shape}"
-    print("✅ Batch processing works")
-    
-    # Test case 4: Callable interface
-    callable_output = layer_batch(batch_input)
-    assert np.allclose(callable_output.data, batch_output.data), "Callable interface should match forward()"
-    print("✅ Callable interface works")
-    
-    # Test case 5: Parameter initialization
-    layer_init = Linear(input_size=10, output_size=5)
-    assert layer_init.weights.shape == (10, 5), f"Expected weights shape (10, 5), got {layer_init.weights.shape}"
-    assert layer_init.bias.shape == (5,), f"Expected bias shape (5,), got {layer_init.bias.shape}"
-    
-    # Check that weights are reasonably small (good initialization)
-    assert np.abs(layer_init.weights.data).mean() < 1.0, "Weights should be small for good initialization"
-    print("✅ Parameter initialization correct")
-    
-    print("🎉 All Linear layer tests passed!")
-
-test_unit_linear()
-
-# In[ ]:
-
-# 🧪 Unit Test: Parameter Management
-def test_unit_parameter_management():
-    """Test Linear layer parameter management and module composition."""
-    print("🧪 Testing Parameter Management...")
-    
-    # Test case 1: Parameter registration
-    layer = Linear(input_size=3, output_size=2)
-    params = layer.parameters()
-    
-    assert len(params) == 2, f"Expected 2 parameters (weights + bias), got {len(params)}"
-    assert layer.weights in params, "Weights should be in parameters list"
-    assert layer.bias in params, "Bias should be in parameters list"
-    print("✅ Parameter registration works")
-    
-    # Test case 2: Module composition
-    class SimpleNetwork(Module):
-        def __init__(self):
-            super().__init__()
-            self.layer1 = Linear(4, 3)
-            self.layer2 = Linear(3, 2)
-        
-        def forward(self, x):
-            x = self.layer1(x)
-            return self.layer2(x)
-    
-    network = SimpleNetwork()
-    all_params = network.parameters()
-    
-    # Should have 4 parameters: 2 from each layer (weights + bias)
-    assert len(all_params) == 4, f"Expected 4 parameters from network, got {len(all_params)}"
-    print("✅ Module composition and parameter collection works")
-    
-    # Test case 3: Forward pass through composed network
-    input_tensor = Tensor([[1.0, 2.0, 3.0, 4.0]])
-    output = network(input_tensor)
-    
-    assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}"
-    print("✅ Network forward pass works")
-    
-    # Test case 4: No bias option
-    layer_no_bias = Linear(input_size=3, output_size=2, use_bias=False)
-    params_no_bias = layer_no_bias.parameters()
-    
-    assert len(params_no_bias) == 1, f"Expected 1 parameter (weights only), got {len(params_no_bias)}"
-    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
-    print("✅ No bias option works")
-    
-    print("🎉 All parameter management tests passed!")
-
-test_unit_parameter_management()
-
-# In[ ]:
-
-# ✅ IMPLEMENTATION CHECKPOINT: Linear layer complete
-
-# 🤔 PREDICTION: How does memory usage scale with network depth vs width?
-# Deeper network (more layers): _______
-# Wider network (more neurons per layer): _______
-
-# 🔍 SYSTEMS INSIGHT #3: Architecture Memory Analysis
-def analyze_architecture_scaling():
-    """Compare memory usage of deep vs wide networks."""
-    try:
-        print("📊 Architecture Scaling: Deep vs Wide Networks")
-        print("=" * 50)
-        
-        # Compare networks with similar parameter counts
-        print("\nDeep Network (8 layers, narrow):")
-        deep_layers = [128, 64, 64, 64, 64, 64, 64, 10]
-        deep_params = 0
-        deep_memory = 0
-        
-        for i in range(len(deep_layers) - 1):
-            layer_params = deep_layers[i] * deep_layers[i+1] + deep_layers[i+1]
-            deep_params += layer_params
-            layer_memory = layer_params * 4 / (1024 * 1024)  # MB
-            deep_memory += layer_memory
-            print(f"  Layer {i+1}: {deep_layers[i]:3d} → {deep_layers[i+1]:3d} = {layer_params:,} params")
-        
-        print(f"  Total: {deep_params:,} params, {deep_memory:.2f} MB")
-        
-        print("\nWide Network (3 layers, wide):")
-        wide_layers = [128, 256, 256, 10]
-        wide_params = 0
-        wide_memory = 0
-        
-        for i in range(len(wide_layers) - 1):
-            layer_params = wide_layers[i] * wide_layers[i+1] + wide_layers[i+1]
-            wide_params += layer_params
-            layer_memory = layer_params * 4 / (1024 * 1024)  # MB
-            wide_memory += layer_memory
-            print(f"  Layer {i+1}: {wide_layers[i]:3d} → {wide_layers[i+1]:3d} = {layer_params:,} params")
-        
-        print(f"  Total: {wide_params:,} params, {wide_memory:.2f} MB")
-        
-        print(f"\n💡 Architecture Insights:")
-        print(f"  • Deep network: {len(deep_layers)-1} layers, {deep_params:,} params")
-        print(f"  • Wide network: {len(wide_layers)-1} layers, {wide_params:,} params")
-        print(f"  • Memory ratio: {wide_memory/deep_memory:.1f}x (wide uses more)")
-        print(f"  • Deep networks: better feature hierarchies")
-        print(f"  • Wide networks: more parallel computation")
-        print(f"  • Modern trend: Balance depth + width for best performance")
-        
-    except Exception as e:
-        print(f"⚠️ Error in architecture analysis: {e}")
-
-# Run the analysis
-analyze_architecture_scaling()
-
-# In[ ]:
-
-# ## Part 4: Sequential Network Composition
-
-#| export
-class Sequential(Module):
-    """
-    Sequential Network: Composes layers in sequence.
-    
-    The most fundamental network architecture that applies layers in order:
-    f(x) = layer_n(...layer_2(layer_1(x)))
-    
-    Inherits from Module for automatic parameter collection from all sub-layers.
-    This enables optimizers to find all parameters automatically.
-    
-    Example Usage:
-        # Create a 3-layer MLP
-        model = Sequential([
-            Linear(784, 128),
-            ReLU(),
-            Linear(128, 64), 
-            ReLU(),
-            Linear(64, 10)
-        ])
-        
-        # Use the model
-        output = model(input_data)  # Clean interface!
-        params = model.parameters()  # All parameters from all layers!
-    """
-    
-    def __init__(self, layers=None):
-        """
-        Initialize Sequential network with layers.
-        
-        Args:
-            layers: List of layers to compose in order (optional)
-        """
-        super().__init__()  # Initialize Module base class
-        self.layers = layers if layers is not None else []
-        
-        # Register all layers as sub-modules for parameter collection
-        for i, layer in enumerate(self.layers):
-            # This automatically adds each layer to self._modules
-            setattr(self, f'layer_{i}', layer)
-    
-    def forward(self, x):
-        """
-        Forward pass through all layers in sequence.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:
-            Output tensor after passing through all layers
-        """
-        for layer in self.layers:
-            x = layer(x)
-        return x
-    
-    def add(self, layer):
-        """Add a layer to the network."""
-        self.layers.append(layer)
-        # Register the new layer for parameter collection
-        setattr(self, f'layer_{len(self.layers)-1}', layer)
-
-# In[ ]:
-
-# 🧪 Unit Test: Sequential Networks
-def test_unit_sequential():
-    """Test Sequential network implementation."""
-    print("🧪 Testing Sequential Network...")
-    
-    # Test case 1: Create empty network
-    empty_net = Sequential()
-    assert len(empty_net.layers) == 0, "Empty Sequential should have no layers"
-    print("✅ Empty Sequential network creation")
-    
-    # Test case 2: Create network with layers
-    layers = [Linear(3, 4), Linear(4, 2)]
-    network = Sequential(layers)
-    assert len(network.layers) == 2, "Network should have 2 layers"
-    print("✅ Sequential network with layers")
-    
-    # Test case 3: Forward pass through network
-    input_tensor = Tensor([[1.0, 2.0, 3.0]])
-    output = network(input_tensor)
-    assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}"
-    print("✅ Forward pass through Sequential network")
-    
-    # Test case 4: Parameter collection from all layers
-    all_params = network.parameters()
-    # Should have 4 parameters: 2 weights + 2 biases from 2 Linear layers
-    assert len(all_params) == 4, f"Expected 4 parameters from Sequential network, got {len(all_params)}"
-    print("✅ Parameter collection from all layers")
-    
-    # Test case 5: Adding layers dynamically
-    network.add(Linear(2, 1))
-    assert len(network.layers) == 3, "Network should have 3 layers after adding one"
-    
-    # Test forward pass after adding layer
-    final_output = network(input_tensor)
-    assert final_output.shape == (1, 1), f"Expected final output shape (1, 1), got {final_output.shape}"
-    print("✅ Dynamic layer addition")
-    
-    print("🎉 All Sequential network tests passed!")
-
-test_unit_sequential()
-
-# In[ ]:
-
-# ## Part 5: Flatten Operation - Connecting Different Layer Types
-
-#| export
-def flatten(x, start_dim=1):
-    """
-    Flatten tensor starting from a given dimension.
-    
-    This is essential for transitioning from convolutional layers
-    (which output 4D tensors) to linear layers (which expect 2D).
-    
-    Args:
-        x: Input tensor (Tensor or any array-like)
-        start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        
-    Returns:
-        Flattened tensor preserving batch dimension
-        
-    Examples:
-        # Flatten CNN output for Linear layer
-        conv_output = Tensor(np.random.randn(32, 64, 8, 8))  # (batch, channels, height, width)
-        flat = flatten(conv_output)  # (32, 4096) - ready for Linear layer!
-        
-        # Flatten image for MLP
-        images = Tensor(np.random.randn(32, 3, 28, 28))  # CIFAR-10 batch
-        flat = flatten(images)  # (32, 2352) - ready for MLP!
-    """
-    # Get the data (handle both Tensor and numpy arrays)
-    if hasattr(x, 'data'):
-        data = x.data
-    else:
-        data = x
-    
-    # Calculate new shape
-    batch_size = data.shape[0] if start_dim > 0 else 1
-    remaining_size = np.prod(data.shape[start_dim:])
-    new_shape = (batch_size, remaining_size) if start_dim > 0 else (remaining_size,)
-    
-    # Reshape while preserving the original tensor type
-    if hasattr(x, 'data'):
-        # It's a Tensor - create a new Tensor with flattened data
-        flattened_data = data.reshape(new_shape)
-        # Use type(x) to preserve the exact Tensor type (Parameter vs regular Tensor)
-        # This ensures that if input was a Parameter, output is also a Parameter
-        return type(x)(flattened_data)
-    else:
-        # It's a numpy array - just reshape and return
-        return data.reshape(new_shape)
-
-#| export
-class Flatten(Module):
-    """
-    Flatten layer that reshapes tensors from multi-dimensional to 2D.
-    
-    Essential for connecting convolutional layers (which output 4D tensors)
-    to linear layers (which expect 2D tensors). Preserves the batch dimension.
-    
-    Example Usage:
-        # In a CNN architecture
-        model = Sequential([
-            Conv2D(3, 16, kernel_size=3),  # Output: (batch, 16, height, width)
-            ReLU(),
-            Flatten(),                     # Output: (batch, 16*height*width)
-            Linear(16*height*width, 10)    # Now compatible!
-        ])
-    """
-    
-    def __init__(self, start_dim=1):
-        """
-        Initialize Flatten layer.
-        
-        Args:
-            start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        """
-        super().__init__()
-        self.start_dim = start_dim
-    
-    def forward(self, x):
-        """
-        Flatten tensor starting from start_dim.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:
-            Flattened tensor with batch dimension preserved
-        """
-        return flatten(x, start_dim=self.start_dim)
-
-# In[ ]:
-
-# 🧪 Unit Test: Flatten Operations
-def test_unit_flatten():
-    """Test Flatten layer and function implementation."""
-    print("🧪 Testing Flatten Operations...")
-    
-    # Test case 1: Flatten function with 2D tensor
-    x_2d = Tensor([[1, 2], [3, 4]])
-    flattened_func = flatten(x_2d)
-    assert flattened_func.shape == (2, 2), f"Expected shape (2, 2), got {flattened_func.shape}"
-    print("✅ Flatten function with 2D tensor")
-    
-    # Test case 2: Flatten function with 4D tensor (simulating CNN output)
-    x_4d = Tensor(np.random.randn(2, 3, 4, 4))  # (batch, channels, height, width)
-    flattened_4d = flatten(x_4d)
-    assert flattened_4d.shape == (2, 48), f"Expected shape (2, 48), got {flattened_4d.shape}"  # 3*4*4 = 48
-    print("✅ Flatten function with 4D tensor")
-    
-    # Test case 3: Flatten layer class
-    flatten_layer = Flatten()
-    layer_output = flatten_layer(x_4d)
-    assert layer_output.shape == (2, 48), f"Expected shape (2, 48), got {layer_output.shape}"
-    assert np.allclose(layer_output.data, flattened_4d.data), "Flatten layer should match flatten function"
-    print("✅ Flatten layer class")
-    
-    # Test case 4: Different start dimensions
-    flatten_from_0 = Flatten(start_dim=0)
-    full_flat = flatten_from_0(x_2d)
-    assert len(full_flat.shape) <= 2, "Flattening from dim 0 should create vector"
-    print("✅ Different start dimensions")
-    
-    # Test case 5: Integration with Sequential
-    network = Sequential([
-        Linear(8, 4),
-        Flatten()
-    ])
-    test_input = Tensor(np.random.randn(2, 8))
-    output = network(test_input)
-    assert output.shape == (2, 4), f"Expected shape (2, 4), got {output.shape}"
-    print("✅ Flatten integration with Sequential")
-    
-    print("🎉 All Flatten operations tests passed!")
-
-test_unit_flatten()
-
-# In[ ]:
-
-# ## NBGrader Assessment Questions
-
-# ⭐ QUESTION 1: Parameter Counting Challenge
-"""
-You're building a Multi-Layer Perceptron (MLP) for MNIST digit classification.
-
-Network architecture:
-- Input: 784 features (28×28 pixel images, flattened)
-- Hidden layer 1: 256 neurons with ReLU activation
-- Hidden layer 2: 128 neurons with ReLU activation  
-- Output layer: 10 neurons (one per digit class)
-
-Calculate the total number of trainable parameters in this network.
-
-Show your work:
-- Layer 1 parameters: _____ 
-- Layer 2 parameters: _____
-- Layer 3 parameters: _____
-- Total parameters: _____
-
-Hint: Remember that each Linear layer has both weights and biases!
-"""
-
-# ### BEGIN SOLUTION
-# Layer 1: Linear(784, 256)
-# - Weights: 784 × 256 = 200,704
-# - Biases: 256
-# - Subtotal: 200,960
-
-# Layer 2: Linear(256, 128)  
-# - Weights: 256 × 128 = 32,768
-# - Biases: 128
-# - Subtotal: 32,896
-
-# Layer 3: Linear(128, 10)
-# - Weights: 128 × 10 = 1,280
-# - Biases: 10
-# - Subtotal: 1,290
-
-# Total: 200,960 + 32,896 + 1,290 = 235,146 parameters
-# ### END SOLUTION
-
-# ⭐ QUESTION 2: Memory Analysis Challenge
-"""
-Compare the memory requirements of two different MLP architectures for the same task:
-
-Architecture A (Wide): 784 → 512 → 512 → 10
-Architecture B (Deep): 784 → 128 → 128 → 128 → 128 → 10
-
-For each architecture, calculate:
-1. Total number of parameters
-2. Memory usage for parameters (assume float32 = 4 bytes per parameter)
-3. Which architecture would you choose for a mobile device with limited memory?
-
-Architecture A calculations:
-- Total parameters: _____
-- Memory usage: _____ MB
-
-Architecture B calculations:  
-- Total parameters: _____
-- Memory usage: _____ MB
-
-Mobile device choice and reasoning: _____
-"""
-
-# ### BEGIN SOLUTION
-# Architecture A (Wide): 784 → 512 → 512 → 10
-# - Layer 1: (784 × 512) + 512 = 401,920
-# - Layer 2: (512 × 512) + 512 = 262,656  
-# - Layer 3: (512 × 10) + 10 = 5,130
-# - Total: 669,706 parameters
-# - Memory: 669,706 × 4 bytes = 2.68 MB
-
-# Architecture B (Deep): 784 → 128 → 128 → 128 → 128 → 10
-# - Layer 1: (784 × 128) + 128 = 100,480
-# - Layer 2: (128 × 128) + 128 = 16,512
-# - Layer 3: (128 × 128) + 128 = 16,512  
-# - Layer 4: (128 × 128) + 128 = 16,512
-# - Layer 5: (128 × 10) + 10 = 1,290
-# - Total: 151,306 parameters
-# - Memory: 151,306 × 4 bytes = 0.61 MB
-
-# Mobile choice: Architecture B (Deep)
-# Reasoning: Uses 4.4x less memory while maintaining similar representational capacity through depth
-# ### END SOLUTION
-
-# ⭐ QUESTION 3: FLOPS Calculation Challenge
-"""
-Calculate the computational cost (in FLOPs) for a forward pass through this network:
-
-Input batch: 32 samples × 784 features
-Network: 784 → 256 → 128 → 10
-
-For each layer, calculate:
-- Matrix multiplication FLOPs: 2 × batch_size × input_size × output_size
-- Bias addition FLOPs: batch_size × output_size
-- Total FLOPs per layer
-
-Layer 1 (784 → 256):
-- MatMul FLOPs: _____
-- Bias FLOPs: _____
-- Layer total: _____
-
-Layer 2 (256 → 128):
-- MatMul FLOPs: _____  
-- Bias FLOPs: _____
-- Layer total: _____
-
-Layer 3 (128 → 10):
-- MatMul FLOPs: _____
-- Bias FLOPs: _____
-- Layer total: _____
-
-Network total FLOPs: _____
-"""
-
-# ### BEGIN SOLUTION
-# Batch size = 32 samples
-
-# Layer 1 (784 → 256):
-# - MatMul FLOPs: 2 × 32 × 784 × 256 = 12,582,912
-# - Bias FLOPs: 32 × 256 = 8,192
-# - Layer total: 12,591,104
-
-# Layer 2 (256 → 128):
-# - MatMul FLOPs: 2 × 32 × 256 × 128 = 2,097,152
-# - Bias FLOPs: 32 × 128 = 4,096  
-# - Layer total: 2,101,248
-
-# Layer 3 (128 → 10):
-# - MatMul FLOPs: 2 × 32 × 128 × 10 = 81,920
-# - Bias FLOPs: 32 × 10 = 320
-# - Layer total: 82,240
-
-# Network total: 12,591,104 + 2,101,248 + 82,240 = 14,774,592 FLOPs (~14.8 MFLOPS)
-# ### END SOLUTION
-
-# In[ ]:
-
-# ## Complete Neural Network Demo
-
-def demonstrate_complete_networks():
-    """Demonstrate complete neural networks using all implemented components."""
-    print("🔥 Complete Neural Network Demo")
-    print("=" * 50)
-    
-    print("\n1. MLP for Classification (MNIST-style):")
-    # Multi-layer perceptron for image classification
-    mlp = Sequential([
-        Flatten(),              # Flatten input images
-        Linear(784, 256),       # First hidden layer
-        Linear(256, 128),       # Second hidden layer  
-        Linear(128, 10)         # Output layer (10 classes)
-    ])
-    
-    # Test with batch of "images"
-    batch_images = Tensor(np.random.randn(32, 28, 28))  # 32 MNIST-like images
-    mlp_output = mlp(batch_images)
-    print(f"   Input: {batch_images.shape} (batch of 28x28 images)")
-    print(f"   Output: {mlp_output.shape} (class logits for 32 images)")
-    print(f"   Parameters: {len(mlp.parameters())} tensors")
-    
-    print("\n2. CNN-style Architecture (with Flatten):")
-    # Simulate CNN → Flatten → Dense pattern
-    cnn_style = Sequential([
-        # Simulate Conv2D output with random "features"
-        Flatten(),              # Flatten spatial features
-        Linear(512, 256),       # Dense layer after convolution
-        Linear(256, 10)         # Classification head
-    ])
-    
-    # Test with simulated conv output
-    conv_features = Tensor(np.random.randn(16, 8, 8, 8))  # Simulated (B,C,H,W)
-    cnn_output = cnn_style(conv_features)
-    print(f"   Input: {conv_features.shape} (simulated conv features)")
-    print(f"   Output: {cnn_output.shape} (class predictions)")
-    
-    print("\n3. Deep Network with Many Layers:")
-    # Demonstrate deep composition
-    deep_net = Sequential()
-    layer_sizes = [100, 80, 60, 40, 20, 10]
-    
-    for i in range(len(layer_sizes) - 1):
-        deep_net.add(Linear(layer_sizes[i], layer_sizes[i+1]))
-        print(f"   Added layer: {layer_sizes[i]} → {layer_sizes[i+1]}")
-    
-    # Test deep network
-    deep_input = Tensor(np.random.randn(8, 100))
-    deep_output = deep_net(deep_input)
-    print(f"   Deep network: {deep_input.shape} → {deep_output.shape}")
-    print(f"   Total parameters: {len(deep_net.parameters())} tensors")
-    
-    print("\n4. Parameter Management Across Networks:")
-    networks = {'MLP': mlp, 'CNN-style': cnn_style, 'Deep': deep_net}
-    
-    for name, net in networks.items():
-        params = net.parameters()
-        total_params = sum(p.data.size for p in params)
-        memory_mb = total_params * 4 / (1024 * 1024)  # float32 = 4 bytes
-        print(f"   {name}: {len(params)} param tensors, {total_params:,} total params, {memory_mb:.2f} MB")
-    
-    print("\n🎉 All components work together seamlessly!")
-    print("   • Module system enables automatic parameter collection")
-    print("   • Linear layers handle matrix transformations") 
-    print("   • Sequential composes layers into complete architectures")
-    print("   • Flatten connects different layer types")
-    print("   • Everything integrates for production-ready neural networks!")
-
-demonstrate_complete_networks()
-
-# In[ ]:
-
-# ## Testing Framework
-
-def test_unit_all():
-    """Run complete module validation."""
-    print("🧪 Running all unit tests...")
-    
-    # Call every individual test function
-    test_unit_matmul()
-    test_unit_linear()
-    test_unit_parameter_management()
-    test_unit_sequential()
-    test_unit_flatten()
-    
-    print("✅ All tests passed! Module ready for integration.")
-
-# In[ ]:
-
-if __name__ == "__main__":
-    print("🔥 TinyTorch Layers Module - Complete Foundation Demo")
-    print("=" * 60)
-    
-    # Test all core components
-    print("\n🧪 Testing All Core Components:")
-    test_unit_all()
-    
-    print("\n" + "="*60)
-    demonstrate_complete_networks()
-    
-    print("\n🎉 Complete neural network foundation ready!")
-    print("   ✅ Module system for parameter management")
-    print("   ✅ Linear layers for transformations")
-    print("   ✅ Sequential networks for composition")
-    print("   ✅ Flatten operations for tensor reshaping")
-    print("   ✅ All components tested and integrated!")
-
-# ## 🤔 ML Systems Thinking: Interactive Questions
-
-# Now that you've implemented all the core neural network components, let's think about their implications for ML systems:
-
-# ⭐ QUESTION: Memory vs Computation Trade-offs
-"""
-🤔 **Question 1: Memory vs Computation Analysis**
-
-You're designing a neural network for deployment on a mobile device with limited memory (1GB RAM) but decent compute power.
-
-You have two architecture options:
-A) Wide network: 784 → 2048 → 2048 → 10 (3 layers, wide)
-B) Deep network: 784 → 256 → 256 → 256 → 256 → 10 (5 layers, narrow)
-
-Calculate the memory requirements for each option and explain which you'd choose for mobile deployment and why.
-
-Consider:
-- Parameter storage requirements
-- Intermediate activation storage during forward pass
-- Training vs inference memory requirements
-- How your choice affects model capacity and accuracy
-"""
-
-# ⭐ QUESTION: Performance Optimization
-"""
-🤔 **Question 2: Production Performance Optimization**
-
-Your Linear layer implementation works correctly, but you notice it's slower than PyTorch's nn.Linear on the same hardware.
-
-Investigate and explain:
-1. Why might our implementation be slower? (Hint: think about underlying linear algebra libraries)
-2. What optimization techniques do production frameworks use?
-3. How would you modify our implementation to approach production performance?
-4. When might our simple implementation actually be preferable?
-
-Research areas to consider:
-- BLAS (Basic Linear Algebra Subprograms) libraries
-- Memory layout and cache efficiency
-- Vectorization and SIMD instructions
-- GPU kernel optimization
-"""
-
-# ⭐ QUESTION: Scaling and Architecture Design
-"""
-🤔 **Question 3: Systems Architecture Scaling**
-
-Modern transformer models like GPT-3 have billions of parameters, primarily in Linear layers.
-
-Analyze the scaling challenges:
-1. How does memory requirement scale with model size? Calculate the memory needed for a 175B parameter model.
-2. What are the computational bottlenecks during training vs inference?
-3. How do systems like distributed training address these scaling challenges?
-4. Why do large models use techniques like gradient checkpointing and model parallelism?
-
-Systems considerations:
-- Memory hierarchy (L1/L2/L3 cache, RAM, storage)
-- Network bandwidth for distributed training
-- GPU memory constraints and model sharding
-- Inference optimization for production serving
-"""
-
-# ## 🎯 MODULE SUMMARY: Layers - Complete Neural Network Foundation
-
-# ## 🎯 What You've Accomplished
-
-# You've successfully implemented the complete foundation for neural networks - all the essential components working together:
-
-# ### ✅ **Complete Core System**
-# - **Module Base Class**: Parameter management and composition patterns for all neural network components
-# - **Matrix Multiplication**: The computational primitive underlying all neural network operations
-# - **Linear (Dense) Layers**: Complete implementation with proper parameter initialization and forward propagation
-# - **Sequential Networks**: Clean composition system for building complete neural network architectures
-# - **Flatten Operations**: Tensor reshaping to connect different layer types (essential for CNN→MLP transitions)
-
-# ### ✅ **Systems Understanding**
-# - **Architectural Patterns**: How modular design enables everything from MLPs to complex deep networks
-# - **Memory Analysis**: How layer composition affects memory usage and computational efficiency
-# - **Performance Characteristics**: Understanding how tensor operations and layer composition affect performance
-# - **Production Context**: Connection to real-world ML frameworks and their component organization
-
-# ### ✅ **ML Engineering Skills**
-# - **Complete Parameter Management**: How neural networks automatically collect parameters from all components
-# - **Network Composition**: Building complex architectures from simple, reusable components
-# - **Tensor Operations**: Essential reshaping and transformation operations for different network types
-# - **Clean Abstraction**: Professional software design patterns that scale to production systems
-
-# ## 🔗 **Connection to Production ML Systems**
-
-# Your unified implementation mirrors the complete component systems used in:
-# - **PyTorch's nn.Module system**: Same parameter management and composition patterns
-# - **PyTorch's nn.Sequential**: Identical architecture composition approach
-# - **All major frameworks**: The same modular design principles that power TensorFlow, JAX, and others
-# - **Production ML systems**: Clean abstractions that enable complex models while maintaining manageable code
-
-# ## 🚀 **What's Next**
-
-# With your complete layer foundation, you're ready to:
-# - **Module 05 (Dense)**: Build complete dense networks for classification tasks
-# - **Module 06 (Spatial)**: Add convolutional layers for computer vision
-# - **Module 09 (Autograd)**: Enable automatic differentiation for learning
-# - **Module 10 (Optimizers)**: Implement sophisticated optimization algorithms
-
-# ## 💡 **Key Systems Insights**
-
-# 1. **Modular composition is the key to scalable ML systems** - clean interfaces enable complex behaviors
-# 2. **Parameter management must be automatic** - manual parameter tracking doesn't scale to deep networks
-# 3. **Tensor operations like flattening are architectural requirements** - different layer types need different tensor shapes
-# 4. **Clean abstractions enable innovation** - good foundational design supports unlimited architectural experimentation
-
-# You now understand how to build complete, production-ready neural network foundations that can scale to any architecture!
\ No newline at end of file
diff --git a/modules_old/03_layers/module.yaml b/modules_old/03_layers/module.yaml
deleted file mode 100644
index d1d096b3..00000000
--- a/modules_old/03_layers/module.yaml
+++ /dev/null
@@ -1,21 +0,0 @@
-components:
-- Dense
-- Linear
-- matmul
-dependencies:
-  enables:
-  - networks
-  - training
-  prerequisites:
-  - tensor
-  - activations
-description: Neural network layers (Linear, activation layers)
-difficulty: "\u2B50\u2B50"
-exports_to: tinytorch.core.layers
-files:
-  dev_file: layers_dev.py
-  readme: README.md
-  tests: inline
-name: layers
-time_estimate: 4-5 hours
-title: Layers
diff --git a/modules_old/04_losses/README.md b/modules_old/04_losses/README.md
deleted file mode 100644
index 5e4c983b..00000000
--- a/modules_old/04_losses/README.md
+++ /dev/null
@@ -1,149 +0,0 @@
-# Module 05: Loss Functions - Learning Objectives for Neural Networks
-
-**Essential loss functions that define learning objectives and enable neural networks to learn from data through gradient-based optimization.**
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will understand:
-
-- **Mathematical Foundation**: How loss functions translate learning problems into optimization objectives
-- **Numerical Stability**: Why proper implementation prevents catastrophic training failures in production
-- **Problem Matching**: When to use each loss function based on problem structure and data characteristics
-- **Production Integration**: How loss functions integrate with neural network training pipelines
-
-## 🏗️ What You'll Build
-
-### Core Loss Functions
-- **MeanSquaredError**: Regression loss for continuous value prediction
-- **CrossEntropyLoss**: Multi-class classification with numerically stable softmax
-- **BinaryCrossEntropyLoss**: Optimized binary classification loss
-
-### Key Features
-- ✅ Numerically stable implementations that handle edge cases
-- ✅ Efficient batch processing for scalable training  
-- ✅ Clean interfaces that integrate with neural networks
-- ✅ Comprehensive testing with real-world scenarios
-
-## 🚀 Quick Start
-
-```python
-from tinytorch.core.losses import MeanSquaredError, CrossEntropyLoss, BinaryCrossEntropyLoss
-
-# Regression: Predicting house prices
-mse = MeanSquaredError()
-regression_loss = mse(predicted_prices, actual_prices)
-
-# Multi-class classification: Image recognition
-ce_loss = CrossEntropyLoss() 
-classification_loss = ce_loss(model_logits, class_indices)
-
-# Binary classification: Spam detection
-bce_loss = BinaryCrossEntropyLoss()
-binary_loss = bce_loss(spam_logits, spam_labels)
-```
-
-## 📚 Usage Examples
-
-### When to Use Each Loss Function
-
-**Mean Squared Error (MSE)**
-- **Best for**: Regression problems (house prices, temperatures, ages)
-- **Output**: Any real number
-- **Activation**: Linear (no activation)
-
-**Cross-Entropy Loss**  
-- **Best for**: Multi-class classification (image classification, text categorization)
-- **Output**: Class probabilities (sum to 1)
-- **Activation**: Softmax
-
-**Binary Cross-Entropy Loss**
-- **Best for**: Binary classification (spam detection, medical diagnosis)
-- **Output**: Single probability (0 to 1) 
-- **Activation**: Sigmoid
-
-## 🧪 Testing Your Implementation
-
-Run the module to test all loss functions:
-
-```bash
-# Test implementations
-python modules/05_losses/losses_dev.py
-
-# Export to package
-tito module complete 05_losses
-```
-
-Expected output:
-```
-🧪 Testing Mean Squared Error Loss...
-✅ Perfect predictions test passed
-✅ All MSE loss tests passed!
-
-🧪 Testing Cross-Entropy Loss... 
-✅ Perfect predictions test passed
-✅ All Cross-Entropy loss tests passed!
-
-🎉 Complete loss function foundation ready!
-```
-
-## 🔗 Integration Examples
-
-### Training Loop Integration
-```python
-from tinytorch.core.layers import Sequential, Linear
-from tinytorch.core.activations import ReLU, Softmax
-from tinytorch.core.losses import CrossEntropyLoss
-
-# Build classifier
-model = Sequential([
-    Linear(784, 128), ReLU(),
-    Linear(128, 10), Softmax()
-])
-
-# Set up training
-loss_fn = CrossEntropyLoss()
-
-# Training step
-predictions = model(batch_inputs)
-loss = loss_fn(predictions, batch_targets)
-# loss.backward()  # Triggers gradient computation (with autograd)
-```
-
-## 🎯 Module Structure
-
-```
-05_losses/
-├── losses_dev.py          # Main implementation
-├── README.md              # This file
-└── module.yaml           # Module configuration
-```
-
-## 🔬 Key Implementation Details
-
-### Numerical Stability Features
-- **Cross-Entropy**: Uses log-sum-exp trick and probability clipping
-- **Binary Cross-Entropy**: Stable logits formulation prevents overflow
-- **All Losses**: Robust handling of edge cases and extreme values
-
-### Performance Optimizations
-- Efficient batch processing across multiple samples
-- Vectorized operations using NumPy
-- Memory-efficient computation for large datasets
-
-## 🚀 What's Next
-
-With loss functions implemented, you're ready for:
-- **Training Loops**: Complete end-to-end neural network training
-- **Optimizers**: Gradient-based parameter updates  
-- **Advanced Training**: Monitoring, checkpointing, and convergence analysis
-
-## 💡 Key Insights
-
-1. **Loss functions are the interface between business objectives and mathematical optimization**
-2. **Numerical stability is critical for reliable production training**
-3. **Different problem types require different loss functions for optimal performance**
-4. **Proper batch processing enables scalable training on large datasets**
-
----
-
-**Next Module**: Training Infrastructure - Build complete training loops that bring all components together!
\ No newline at end of file
diff --git a/modules_old/04_losses/losses_dev.ipynb b/modules_old/04_losses/losses_dev.ipynb
deleted file mode 100644
index 8f7ab4fe..00000000
--- a/modules_old/04_losses/losses_dev.ipynb
+++ /dev/null
@@ -1,2532 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "54a999b1",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Loss Functions - Learning Objectives Made Mathematical\n",
-    "\n",
-    "Welcome to Loss Functions! You'll implement the critical bridge between model predictions and learning objectives that makes neural network training possible.\n",
-    "\n",
-    "## 🔗 Building on Previous Learning\n",
-    "**What You Built Before**:\n",
-    "- Module 02 (Tensor): Data structures for predictions and targets\n",
-    "- Module 03 (Activations): Nonlinear transformations for model outputs\n",
-    "- Module 04 (Layers): Complete neural network layers that produce predictions\n",
-    "\n",
-    "**What's Working**: You can build networks that transform inputs into predictions!\n",
-    "\n",
-    "**The Gap**: Predictions aren't learning objectives - you need to measure how \"wrong\" predictions are and provide gradient signals for improvement.\n",
-    "\n",
-    "**This Module's Solution**: Implement MSE, CrossEntropy, and BinaryCrossEntropy loss functions with numerical stability.\n",
-    "\n",
-    "**Connection Map**:\n",
-    "```\n",
-    "Layers → Loss Functions → Gradients\n",
-    "(predictions)  (objectives)   (learning signals)\n",
-    "```\n",
-    "\n",
-    "## Learning Goals (Systems-Focused)\n",
-    "- **Systems understanding**: How loss functions translate business problems into optimization objectives with proper numerical stability\n",
-    "- **Core implementation skill**: Build production-quality loss functions with stable computation and efficient batch processing\n",
-    "- **Pattern mastery**: Understand how different loss functions shape learning dynamics and convergence behavior\n",
-    "- **Framework connections**: See how your implementations mirror PyTorch's loss functions and autograd integration patterns\n",
-    "- **Optimization trade-offs**: Learn why numerical stability and computational efficiency matter for reliable training at scale\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: Complete loss function implementations with numerical stability and gradient support\n",
-    "2. **Use**: Apply loss functions to regression and classification problems with real neural networks\n",
-    "3. **Reflect**: Why do different loss functions lead to different learning behaviors, and when does numerical stability matter?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By implementing loss functions from scratch, you'll understand:\n",
-    "- Deep technical understanding of how loss functions quantify prediction quality and enable learning\n",
-    "- Practical capability to implement numerically stable loss computation for production ML systems\n",
-    "- Systems insight into computational complexity, memory requirements, and batch processing efficiency\n",
-    "- Performance awareness of how loss function choice affects training speed and convergence characteristics\n",
-    "- Production knowledge of how frameworks implement robust loss computation with proper error handling\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: PyTorch's loss functions use numerically stable implementations and automatic mixed precision to handle extreme gradients and values\n",
-    "⚡ **Performance Insight**: Numerically unstable loss functions can cause training to fail catastrophically - proper implementation is critical for reliable ML systems"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bfe05289",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "losses-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.losses\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import sys\n",
-    "import os\n",
-    "\n",
-    "# Import our building blocks - try package first, then local modules\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "    # Note: For now, we'll use simplified implementations without full autograd\n",
-    "    # In a complete system, these would integrate with the autograd Variable system\n",
-    "except ImportError:\n",
-    "    # For development, import from local modules\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
-    "    from tensor_dev import Tensor"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c0f986fc",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "losses-setup",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🔥 TinyTorch Loss Functions Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
-    "print(\"Ready to build loss functions for neural network training!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "899f0152",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in modules/04_losses/losses_dev.py  \n",
-    "**Building Side:** Code exports to tinytorch.core.losses\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.losses import MeanSquaredError, CrossEntropyLoss, BinaryCrossEntropyLoss  # All loss functions!\n",
-    "from tinytorch.core.tensor import Tensor  # The foundation\n",
-    "from tinytorch.core.layers import Linear, Sequential  # Network components\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused module for understanding loss functions and training objectives\n",
-    "- **Production:** Proper organization like PyTorch's torch.nn with all loss functions together\n",
-    "- **Consistency:** All loss functions live together in core.losses for easy access\n",
-    "- **Integration:** Works seamlessly with tensors and neural networks for complete training systems"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "409b9591",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Understanding Loss Functions in Neural Networks\n",
-    "\n",
-    "## What are Loss Functions?\n",
-    "\n",
-    "Loss functions are the mathematical bridge between what your model predicts and what you want it to learn. They quantify the \"distance\" between predictions and reality.\n",
-    "\n",
-    "```\n",
-    "Business Goal: \"Predict house prices accurately\"\n",
-    "            ↓\n",
-    "Mathematical Loss: MSE = (predicted_price - actual_price)²\n",
-    "            ↓  \n",
-    "Optimization Signal: gradient = 2 × (predicted - actual)\n",
-    "            ↓\n",
-    "Learning Update: parameter -= learning_rate × gradient\n",
-    "```\n",
-    "\n",
-    "## The Learning Ecosystem\n",
-    "\n",
-    "Loss functions provide four critical capabilities:\n",
-    "\n",
-    "🎯 **Learning Objectives**: Define what \"good\" performance means mathematically  \n",
-    "📈 **Gradient Signal**: Provide directional improvement information for parameters  \n",
-    "🔍 **Progress Measurement**: Enable monitoring training progress and convergence detection  \n",
-    "⚖️ **Trade-off Control**: Balance different aspects of model performance and regularization  \n",
-    "\n",
-    "## Visual Understanding: Loss Function Landscape\n",
-    "\n",
-    "```\n",
-    "Loss Function Behavior:\n",
-    "           MSE Loss                    CrossEntropy Loss\n",
-    "    High │    ╱╲                High │     ╱╲\n",
-    "         │   ╱  ╲                    │    ╱  ╲\n",
-    "         │  ╱    ╲                   │   ╱    ╲\n",
-    "         │ ╱      ╲                  │  ╱      ╲\n",
-    "     Low │╱        ╲              Low │ ╱        ╲\n",
-    "         └──────────────         └──────────────\n",
-    "         Wrong  Right              Wrong  Right\n",
-    "         \n",
-    "   • Smooth gradients          • Steep near wrong predictions\n",
-    "   • Quadratic penalty         • Gentle near correct predictions\n",
-    "   • Good for regression       • Good for classification\n",
-    "```\n",
-    "\n",
-    "Different loss functions create different optimization landscapes that affect how your model learns!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "429bbae2",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Mean Squared Error - Foundation for Regression\n",
-    "\n",
-    "MSE is the cornerstone loss function for regression problems. It measures prediction quality by penalizing large errors more than small ones.\n",
-    "\n",
-    "## Visual Understanding: MSE Behavior\n",
-    "\n",
-    "```\n",
-    "MSE Loss Visualization:\n",
-    "          \n",
-    "    Loss │     ╱╲\n",
-    "       4 │    ╱  ╲        • Error = 2 → Loss = 4\n",
-    "       3 │   ╱    ╲       • Error = 1 → Loss = 1\n",
-    "       2 │  ╱      ╲      • Error = 0 → Loss = 0\n",
-    "       1 │ ╱        ╲     • Quadratic penalty!\n",
-    "       0 │╱__________╲____\n",
-    "         -2  -1   0   1   2\n",
-    "              Error\n",
-    "              \n",
-    "Gradient Flow:\n",
-    "    ∂Loss/∂prediction = 2 × (predicted - actual)\n",
-    "    \n",
-    "    Large errors → Large gradients → Big updates\n",
-    "    Small errors → Small gradients → Fine tuning\n",
-    "```\n",
-    "\n",
-    "## Mathematical Foundation\n",
-    "\n",
-    "For batch of predictions and targets:\n",
-    "```\n",
-    "MSE = (1/n) × Σ(y_pred - y_true)²\n",
-    "\n",
-    "Gradient: ∂MSE/∂y_pred = (2/n) × (y_pred - y_true)\n",
-    "```\n",
-    "\n",
-    "## Learning Objectives\n",
-    "By implementing MSE, you'll understand:\n",
-    "- How regression loss functions translate continuous prediction errors into optimization signals\n",
-    "- Why squared error creates smooth, well-behaved gradients for stable optimization\n",
-    "- How batch processing enables efficient training on multiple samples simultaneously\n",
-    "- The connection between mathematical loss formulations and practical ML training dynamics"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "80f4f2d2",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "mse-concept-question",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "🤔 **Computational Question: MSE Properties**\n",
-    "\n",
-    "Before implementing, let's understand MSE behavior:\n",
-    "\n",
-    "1. If you predict house price as $300k but actual is $250k, what's the MSE?\n",
-    "2. If you predict $310k but actual is $250k, what's the MSE? \n",
-    "3. Which error gets penalized more heavily and why?\n",
-    "4. How does this relate to the quadratic penalty we visualized?\n",
-    "\n",
-    "This understanding will guide your implementation approach.\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2533af31",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "mse-loss-implementation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class MeanSquaredError:\n",
-    "    \"\"\"\n",
-    "    Mean Squared Error Loss for Regression Problems\n",
-    "    \n",
-    "    Computes the average squared difference between predictions and targets:\n",
-    "    MSE = (1/n) × Σ(y_pred - y_true)²\n",
-    "    \n",
-    "    Features:\n",
-    "    - Numerically stable computation\n",
-    "    - Efficient batch processing\n",
-    "    - Clean gradient properties for optimization\n",
-    "    - Compatible with tensor operations\n",
-    "    \n",
-    "    Example Usage:\n",
-    "        mse = MeanSquaredError()\n",
-    "        loss = mse(predictions, targets)  # Returns scalar loss value\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize MSE loss function.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def __call__(self, y_pred, y_true):\n",
-    "        \"\"\"\n",
-    "        Compute MSE loss between predictions and targets.\n",
-    "        \n",
-    "        Args:\n",
-    "            y_pred: Model predictions (Tensor, shape: [batch_size, ...])\n",
-    "            y_true: True targets (Tensor, shape: [batch_size, ...])\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tensor with scalar loss value\n",
-    "            \n",
-    "        TODO: Implement MSE computation with proper tensor handling.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Convert inputs to tensors for consistent processing\n",
-    "        2. Compute element-wise prediction errors (differences)\n",
-    "        3. Square the errors to create quadratic penalty\n",
-    "        4. Take mean across all elements for final loss\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        >>> mse = MeanSquaredError()\n",
-    "        >>> pred = Tensor([[1.0, 2.0]])\n",
-    "        >>> true = Tensor([[1.5, 1.5]])\n",
-    "        >>> loss = mse(pred, true)\n",
-    "        >>> print(loss.data)\n",
-    "        0.25  # [(1.0-1.5)² + (2.0-1.5)²] / 2 = [0.25 + 0.25] / 2\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use np.mean() for efficient batch averaging\n",
-    "        - Element-wise operations work naturally with tensor.data\n",
-    "        - Return result wrapped in Tensor for consistent interface\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Step 1: Ensure we have tensor inputs for consistent processing\n",
-    "        if not isinstance(y_pred, Tensor):\n",
-    "            y_pred = Tensor(y_pred)\n",
-    "        if not isinstance(y_true, Tensor):\n",
-    "            y_true = Tensor(y_true)\n",
-    "        \n",
-    "        # Step 2: Compute mean squared error with element-wise operations\n",
-    "        prediction_errors = y_pred.data - y_true.data  # Element-wise difference\n",
-    "        squared_errors = prediction_errors * prediction_errors  # Element-wise squaring\n",
-    "        mean_loss = np.mean(squared_errors)  # Average across all elements\n",
-    "        \n",
-    "        return Tensor(mean_loss)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, y_pred, y_true):\n",
-    "        \"\"\"Alternative interface for forward pass.\"\"\"\n",
-    "        return self.__call__(y_pred, y_true)\n",
-    "\n",
-    "# 🔍 SYSTEMS INSIGHT: Gradient Landscape Visualization\n",
-    "def visualize_loss_landscapes():\n",
-    "    \"\"\"Visualize how different loss functions create different optimization landscapes.\"\"\"\n",
-    "    print(\"🔍 Loss Function Landscape Visualization\")\n",
-    "    print(\"=\" * 45)\n",
-    "\n",
-    "    try:\n",
-    "        import numpy as np\n",
-    "\n",
-    "        # Create prediction space for visualization\n",
-    "        prediction_range = np.linspace(-3, 3, 100)\n",
-    "        true_value = 0.0  # Target value\n",
-    "\n",
-    "        print(\"\\n📈 Loss Landscape Comparison:\")\n",
-    "        print(\"   How loss changes as predictions move away from target\")\n",
-    "\n",
-    "        # Calculate loss landscapes\n",
-    "        mse = MeanSquaredError()\n",
-    "        ce = CrossEntropyLoss()\n",
-    "        bce = BinaryCrossEntropyLoss()\n",
-    "\n",
-    "        # MSE landscape (regression)\n",
-    "        mse_losses = []\n",
-    "        for pred in prediction_range:\n",
-    "            loss = mse(Tensor([pred]), Tensor([true_value]))\n",
-    "            mse_losses.append(loss.data)\n",
-    "\n",
-    "        # Binary CE landscape (classification)\n",
-    "        bce_losses = []\n",
-    "        for pred in prediction_range:\n",
-    "            loss = bce(Tensor([pred]), Tensor([1.0]))  # Target: positive class\n",
-    "            bce_losses.append(loss.data)\n",
-    "\n",
-    "        # Find key gradient characteristics\n",
-    "        mse_gradient_at_zero = 2 * (0 - true_value)  # MSE gradient formula\n",
-    "        mse_gradient_at_one = 2 * (1 - true_value)\n",
-    "\n",
-    "        print(f\"\\n🎯 Gradient Behavior Analysis:\")\n",
-    "        print(f\"   MSE gradient at prediction=0: {mse_gradient_at_zero:.3f}\")\n",
-    "        print(f\"   MSE gradient at prediction=1: {mse_gradient_at_one:.3f}\")\n",
-    "        print(f\"   MSE provides linear gradient growth\")\n",
-    "\n",
-    "        # Binary CE gradient analysis\n",
-    "        sigmoid_at_zero = 1 / (1 + np.exp(-0))  # = 0.5\n",
-    "        bce_grad_at_zero = sigmoid_at_zero - 1.0  # = -0.5\n",
-    "        sigmoid_at_one = 1 / (1 + np.exp(-1))    # ≈ 0.73\n",
-    "        bce_grad_at_one = sigmoid_at_one - 1.0   # ≈ -0.27\n",
-    "\n",
-    "        print(f\"   BCE gradient at logit=0: {bce_grad_at_zero:.3f}\")\n",
-    "        print(f\"   BCE gradient at logit=1: {bce_grad_at_one:.3f}\")\n",
-    "        print(f\"   BCE provides adaptive gradient magnitude\")\n",
-    "\n",
-    "        # Visualize ASCII loss curves\n",
-    "        print(f\"\\n📊 Loss Function Shapes (ASCII visualization):\")\n",
-    "        print(f\"   Prediction range: {prediction_range[0]:.1f} to {prediction_range[-1]:.1f}\")\n",
-    "\n",
-    "        # Sample key points for visualization\n",
-    "        sample_points = [-2, -1, 0, 1, 2]\n",
-    "        print(f\"\\n   {'Prediction':>10} {'MSE Loss':>10} {'BCE Loss':>10} {'Gradient Type':>15}\")\n",
-    "        print(f\"   {'-'*10} {'-'*10} {'-'*10} {'-'*15}\")\n",
-    "\n",
-    "        for point in sample_points:\n",
-    "            mse_loss = mse(Tensor([point]), Tensor([0.0]))\n",
-    "            bce_loss = bce(Tensor([point]), Tensor([1.0]))\n",
-    "\n",
-    "            # Characterize gradient steepness\n",
-    "            if abs(point) < 0.5:\n",
-    "                grad_type = \"Gentle\"\n",
-    "            elif abs(point) < 1.5:\n",
-    "                grad_type = \"Moderate\"\n",
-    "            else:\n",
-    "                grad_type = \"Steep\"\n",
-    "\n",
-    "            print(f\"   {point:>10.1f} {mse_loss.data:>10.3f} {bce_loss.data:>10.3f} {grad_type:>15}\")\n",
-    "\n",
-    "        # Optimization implications\n",
-    "        print(f\"\\n🚀 Optimization Implications:\")\n",
-    "        print(f\"   MSE (Regression):\")\n",
-    "        print(f\"     • Quadratic penalty grows smoothly\")\n",
-    "        print(f\"     • Large errors → large gradients (aggressive correction)\")\n",
-    "        print(f\"     • Small errors → small gradients (fine-tuning)\")\n",
-    "        print(f\"     • Symmetric around target value\")\n",
-    "\n",
-    "        print(f\"   Binary CrossEntropy (Classification):\")\n",
-    "        print(f\"     • Logarithmic penalty creates adaptive gradients\")\n",
-    "        print(f\"     • Wrong confident predictions → steep gradients\")\n",
-    "        print(f\"     • Right confident predictions → gentle gradients\")\n",
-    "        print(f\"     • Asymmetric penalty structure encourages confidence\")\n",
-    "\n",
-    "        # 💡 WHY THIS MATTERS: Different loss landscapes create different\n",
-    "        # optimization dynamics. MSE's smooth quadratic surface enables\n",
-    "        # stable gradient descent, while CrossEntropy's adaptive gradients\n",
-    "        # help classification models learn faster from confident mistakes.\n",
-    "\n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Visualization error: {e}\")\n",
-    "        print(\"Ensure loss functions are implemented for landscape analysis\")\n",
-    "\n",
-    "# 🔍 SYSTEMS INSIGHT: MSE Computational Analysis\n",
-    "def analyze_mse_properties():\n",
-    "    \"\"\"Analyze MSE loss characteristics for systems understanding.\"\"\"\n",
-    "    print(\"🔍 MSE Loss Analysis - Understanding the Math\")\n",
-    "    print(\"=\" * 45)\n",
-    "    \n",
-    "    try:\n",
-    "        mse = MeanSquaredError()\n",
-    "        \n",
-    "        # Error magnitude vs loss relationship\n",
-    "        print(\"\\n📊 Error Magnitude vs Loss (Quadratic Penalty):\")\n",
-    "        errors = [0.1, 0.5, 1.0, 2.0, 5.0]\n",
-    "        for error in errors:\n",
-    "            pred = Tensor([error])\n",
-    "            true = Tensor([0.0])\n",
-    "            loss = mse(pred, true)\n",
-    "            print(f\"   Error: {error:4.1f} → Loss: {loss.data:8.3f} (× {loss.data/(error**2):5.1f} baseline)\")\n",
-    "        \n",
-    "        # Batch vs individual processing\n",
-    "        print(f\"\\n⚡ Batch Processing Efficiency:\")\n",
-    "        single_losses = []\n",
-    "        for i in range(100):\n",
-    "            pred = Tensor([np.random.randn()])\n",
-    "            true = Tensor([np.random.randn()])\n",
-    "            loss = mse(pred, true)\n",
-    "            single_losses.append(loss.data)\n",
-    "        \n",
-    "        # Batch version\n",
-    "        batch_pred = Tensor(np.random.randn(100))\n",
-    "        batch_true = Tensor(np.random.randn(100))\n",
-    "        batch_loss = mse(batch_pred, batch_true)\n",
-    "        \n",
-    "        individual_mean = np.mean(single_losses)\n",
-    "        print(f\"   Individual losses mean: {individual_mean:.6f}\")\n",
-    "        print(f\"   Batch loss:            {batch_loss.data:.6f}\")\n",
-    "        print(f\"   Difference:            {abs(individual_mean - batch_loss.data):.8f}\")\n",
-    "        \n",
-    "        # Memory efficiency analysis\n",
-    "        import sys\n",
-    "        small_tensor = Tensor([1.0])\n",
-    "        large_tensor = Tensor(np.random.randn(1000))\n",
-    "        \n",
-    "        print(f\"\\n💾 Memory Efficiency:\")\n",
-    "        print(f\"   Small loss memory: {sys.getsizeof(small_tensor.data)} bytes\")\n",
-    "        print(f\"   Large loss memory: {sys.getsizeof(large_tensor.data)} bytes\")\n",
-    "        print(f\"   MSE memory is independent of input size!\")\n",
-    "        \n",
-    "        # 💡 WHY THIS MATTERS: MSE provides stable, well-behaved gradients\n",
-    "        # that are proportional to error magnitude, making optimization smooth.\n",
-    "        # The quadratic penalty means large errors dominate learning initially,\n",
-    "        # then fine-tuning happens as errors get smaller.\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Analysis error: {e}\")\n",
-    "        print(\"Ensure MSE implementation is complete before running analysis\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c0b9be9f",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: MSE Loss Computation\n",
-    "This test validates `MeanSquaredError.__call__`, ensuring correct MSE computation with various input types and batch sizes."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "39a9be44",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-mse-loss",
-     "locked": true,
-     "points": 3,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_mse_loss():\n",
-    "    \"\"\"Test MSE loss implementation.\"\"\"\n",
-    "    print(\"🧪 Testing Mean Squared Error Loss...\")\n",
-    "    \n",
-    "    mse = MeanSquaredError()\n",
-    "    \n",
-    "    # Test case 1: Perfect predictions (loss should be 0)\n",
-    "    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])\n",
-    "    y_true = Tensor([[1.0, 2.0], [3.0, 4.0]])\n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    assert abs(loss.data) < 1e-6, f\"Perfect predictions should have loss ≈ 0, got {loss.data}\"\n",
-    "    print(\"✅ Perfect predictions test passed\")\n",
-    "    \n",
-    "    # Test case 2: Known loss computation\n",
-    "    y_pred = Tensor([[1.0, 2.0]])\n",
-    "    y_true = Tensor([[0.0, 1.0]])\n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    expected = 1.0  # [(1-0)² + (2-1)²] / 2 = [1 + 1] / 2 = 1.0\n",
-    "    assert abs(loss.data - expected) < 1e-6, f\"Expected loss {expected}, got {loss.data}\"\n",
-    "    print(\"✅ Known loss computation test passed\")\n",
-    "    \n",
-    "    # Test case 3: Batch processing\n",
-    "    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])\n",
-    "    y_true = Tensor([[1.5, 2.5], [2.5, 3.5]])\n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    expected = 0.25  # All squared differences are 0.25\n",
-    "    assert abs(loss.data - expected) < 1e-6, f\"Expected batch loss {expected}, got {loss.data}\"\n",
-    "    print(\"✅ Batch processing test passed\")\n",
-    "    \n",
-    "    # Test case 4: Single value\n",
-    "    y_pred = Tensor([5.0])\n",
-    "    y_true = Tensor([3.0])\n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    expected = 4.0  # (5-3)² = 4\n",
-    "    assert abs(loss.data - expected) < 1e-6, f\"Expected single value loss {expected}, got {loss.data}\"\n",
-    "    print(\"✅ Single value test passed\")\n",
-    "    \n",
-    "    print(\"🎉 MSE loss tests passed! Understanding regression objectives.\")\n",
-    "\n",
-    "test_unit_mse_loss()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "48e960ae",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Cross-Entropy Loss - Foundation for Multi-Class Classification\n",
-    "\n",
-    "Cross-Entropy Loss measures the \"information distance\" between predicted probability distributions and true class labels. It's the gold standard for classification problems.\n",
-    "\n",
-    "## Visual Understanding: Cross-Entropy Behavior\n",
-    "\n",
-    "```\n",
-    "Cross-Entropy Loss for 3-Class Problem:\n",
-    "\n",
-    "Class Probabilities after Softmax:\n",
-    "    Input: [2.0, 1.0, 0.1]    →    Probabilities: [0.66, 0.24, 0.10]\n",
-    "    True:  Class 0 (index 0)   →    Target:       [1.0,  0.0,  0.0]\n",
-    "    \n",
-    "Loss Computation:\n",
-    "    CE = -log(probability_of_correct_class)\n",
-    "    CE = -log(0.66) = 0.415\n",
-    "    \n",
-    "Intuition:\n",
-    "    High confidence + Correct → Low loss\n",
-    "    High confidence + Wrong   → High loss  \n",
-    "    Low confidence  + Any     → Medium loss\n",
-    "\n",
-    "Gradient Behavior:\n",
-    "    Wrong predictions → Steep gradients → Big corrections\n",
-    "    Right predictions → Gentle gradients → Fine tuning\n",
-    "```\n",
-    "\n",
-    "## Numerical Stability Challenge\n",
-    "\n",
-    "```\n",
-    "The Numerical Stability Problem:\n",
-    "    \n",
-    "    Raw logits: [50.0, 49.0, 48.0]\n",
-    "    Naive softmax: exp(50)/[exp(50)+exp(49)+exp(48)]\n",
-    "    Problem: exp(50) ≈ 5×10²¹ → Overflow!\n",
-    "    \n",
-    "Our Solution (Log-Sum-Exp Trick):\n",
-    "    1. max_val = max(logits) = 50.0\n",
-    "    2. stable_logits = [0.0, -1.0, -2.0]  # Subtract max\n",
-    "    3. exp([0.0, -1.0, -2.0]) = [1.0, 0.37, 0.14]\n",
-    "    4. Safe softmax: [0.67, 0.25, 0.09]\n",
-    "```\n",
-    "\n",
-    "## Mathematical Foundation\n",
-    "\n",
-    "For predictions and class indices:\n",
-    "```\n",
-    "CrossEntropy = -Σ y_true × log(softmax(y_pred))\n",
-    "\n",
-    "Softmax: softmax(x_i) = exp(x_i) / Σ exp(x_j)\n",
-    "Stable: softmax(x_i) = exp(x_i - max(x)) / Σ exp(x_j - max(x))\n",
-    "```\n",
-    "\n",
-    "## Learning Objectives\n",
-    "By implementing Cross-Entropy, you'll understand:\n",
-    "- How classification losses work with probability distributions and information theory\n",
-    "- Why softmax normalization creates proper probability distributions for multi-class problems\n",
-    "- The critical importance of numerical stability in exponential and logarithmic computations\n",
-    "- How cross-entropy naturally encourages confident, correct predictions through its gradient structure"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "22a7ac21",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "crossentropy-concept-question",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "🤔 **Computational Question: CrossEntropy Stability**\n",
-    "\n",
-    "Consider numerical stability in cross-entropy:\n",
-    "\n",
-    "1. What happens if you compute exp(100) directly?\n",
-    "2. Why does subtracting the maximum value prevent overflow?\n",
-    "3. What happens if log(0) occurs during loss computation?\n",
-    "4. How does epsilon clipping prevent this issue?\n",
-    "\n",
-    "Understanding these edge cases is crucial for reliable implementation.\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b638a54b",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "crossentropy-loss-implementation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class CrossEntropyLoss:\n",
-    "    \"\"\"\n",
-    "    Cross-Entropy Loss for Multi-Class Classification Problems\n",
-    "    \n",
-    "    Computes the cross-entropy between predicted probability distributions\n",
-    "    and true class labels with numerically stable implementation.\n",
-    "    \n",
-    "    Features:\n",
-    "    - Numerically stable softmax computation using log-sum-exp trick\n",
-    "    - Support for both class indices and one-hot encoding\n",
-    "    - Efficient batch processing with proper broadcasting\n",
-    "    - Automatic handling of edge cases and extreme values\n",
-    "    \n",
-    "    Example Usage:\n",
-    "        ce_loss = CrossEntropyLoss()\n",
-    "        loss = ce_loss(logits, class_indices)  # Returns scalar loss value\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize CrossEntropy loss function.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def __call__(self, y_pred, y_true):\n",
-    "        \"\"\"\n",
-    "        Compute CrossEntropy loss between predictions and targets.\n",
-    "        \n",
-    "        Args:\n",
-    "            y_pred: Model predictions/logits (Tensor, shape: [batch_size, num_classes])\n",
-    "            y_true: True class indices (Tensor, shape: [batch_size]) or one-hot encoding\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tensor with scalar loss value\n",
-    "            \n",
-    "        TODO: Implement CrossEntropy with numerically stable softmax computation.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Convert inputs to tensors and handle single samples\n",
-    "        2. Apply log-sum-exp trick for numerically stable softmax\n",
-    "        3. Clip probabilities to prevent log(0) issues\n",
-    "        4. Compute cross-entropy based on target format (indices vs one-hot)\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        >>> ce = CrossEntropyLoss()\n",
-    "        >>> logits = Tensor([[2.0, 1.0, 0.0]])  # Raw model outputs\n",
-    "        >>> targets = Tensor([0])  # Class 0 is correct\n",
-    "        >>> loss = ce(logits, targets)\n",
-    "        >>> print(loss.data)\n",
-    "        0.407  # -log(softmax([2.0, 1.0, 0.0])[0])\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use np.max(axis=1, keepdims=True) for stable max computation\n",
-    "        - Use np.clip(probabilities, 1e-15, 1.0-1e-15) to prevent log(0)\n",
-    "        - Handle both index format [0,1,2] and one-hot format [[1,0,0], [0,1,0]]\n",
-    "        - Use advanced indexing: probs[np.arange(batch_size), class_indices]\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Step 1: Ensure we have tensor inputs for consistent processing\n",
-    "        if not isinstance(y_pred, Tensor):\n",
-    "            y_pred = Tensor(y_pred)  # Convert predictions to tensor format\n",
-    "        if not isinstance(y_true, Tensor):\n",
-    "            y_true = Tensor(y_true)  # Convert targets to tensor format\n",
-    "        \n",
-    "        # Step 1: Extract numpy arrays for computation\n",
-    "        prediction_logits = y_pred.data  # Raw model outputs (pre-softmax)\n",
-    "        target_labels = y_true.data      # True class indices or one-hot vectors\n",
-    "        \n",
-    "        # Step 2: Handle both single predictions and batches consistently\n",
-    "        if prediction_logits.ndim == 1:\n",
-    "            prediction_logits = prediction_logits.reshape(1, -1)  # Convert to batch format [1, num_classes]\n",
-    "            \n",
-    "        # Step 3: Apply numerically stable softmax transformation\n",
-    "        # Subtract max to prevent overflow: exp(x-max) is equivalent but stable\n",
-    "        max_logits = np.max(prediction_logits, axis=1, keepdims=True)\n",
-    "        exp_pred = np.exp(prediction_logits - max_logits)\n",
-    "        softmax_pred = exp_pred / np.sum(exp_pred, axis=1, keepdims=True)\n",
-    "        \n",
-    "        # Step 4: Prevent numerical instability in log computation\n",
-    "        epsilon = 1e-15  # Small value to prevent log(0) → -inf and log(1) → 0 issues\n",
-    "        softmax_pred = np.clip(softmax_pred, epsilon, 1.0 - epsilon)\n",
-    "        \n",
-    "        # Step 5: Compute cross-entropy loss based on target format\n",
-    "        if len(target_labels.shape) == 1:\n",
-    "            # Format A: y_true contains class indices [0, 1, 2, ...]\n",
-    "            batch_size = target_labels.shape[0]\n",
-    "            # Extract probabilities for correct classes using advanced indexing\n",
-    "            correct_class_probs = softmax_pred[np.arange(batch_size), target_labels.astype(int)]\n",
-    "            log_probs = np.log(correct_class_probs)\n",
-    "            loss_value = -np.mean(log_probs)  # Negative log-likelihood\n",
-    "        else:\n",
-    "            # Format B: y_true is one-hot encoded [[1,0,0], [0,1,0], ...]\n",
-    "            log_probs = np.log(softmax_pred)\n",
-    "            # Multiply one-hot targets with log probabilities, sum across classes\n",
-    "            weighted_log_probs = target_labels * log_probs\n",
-    "            loss_value = -np.mean(np.sum(weighted_log_probs, axis=1))\n",
-    "        \n",
-    "        return Tensor(loss_value)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, y_pred, y_true):\n",
-    "        \"\"\"Alternative interface for forward pass.\"\"\"\n",
-    "        return self.__call__(y_pred, y_true)\n",
-    "\n",
-    "# 🔍 SYSTEMS INSIGHT: CrossEntropy Stability Analysis\n",
-    "def analyze_crossentropy_stability():\n",
-    "    \"\"\"Analyze numerical stability in cross-entropy computation.\"\"\"\n",
-    "    print(\"🔍 CrossEntropy Stability Analysis\")\n",
-    "    print(\"=\" * 40)\n",
-    "    \n",
-    "    try:\n",
-    "        ce = CrossEntropyLoss()\n",
-    "        \n",
-    "        # Test numerical stability with extreme values\n",
-    "        print(\"\\n⚡ Numerical Stability Testing:\")\n",
-    "        \n",
-    "        # Extreme logits that would overflow in naive implementation\n",
-    "        extreme_logits = Tensor([[100.0, 99.0, 98.0]])\n",
-    "        safe_labels = Tensor([0])\n",
-    "        \n",
-    "        loss = ce(extreme_logits, safe_labels)\n",
-    "        print(f\"   Extreme logits [100, 99, 98]: Loss = {loss.data:.6f}\")\n",
-    "        print(f\"   No overflow or NaN: {not np.isnan(loss.data) and not np.isinf(loss.data)}\")\n",
-    "        \n",
-    "        # Test epsilon clipping effectiveness\n",
-    "        print(f\"\\n🛡️ Epsilon Clipping Protection:\")\n",
-    "        very_confident = Tensor([[10.0, -10.0, -10.0]])  # Very confident about class 0\n",
-    "        confident_labels = Tensor([0])\n",
-    "        \n",
-    "        loss = ce(very_confident, confident_labels)\n",
-    "        print(f\"   Very confident correct prediction: Loss = {loss.data:.6f}\")\n",
-    "        print(f\"   Should be near 0: {loss.data < 0.01}\")\n",
-    "        \n",
-    "        # Compare different confidence levels\n",
-    "        print(f\"\\n📊 Confidence vs Loss Relationship:\")\n",
-    "        confidence_levels = [\n",
-    "            (\"Low confidence\", [[0.1, 0.0, -0.1]]),\n",
-    "            (\"Medium confidence\", [[1.0, 0.0, -1.0]]),\n",
-    "            (\"High confidence\", [[5.0, 0.0, -5.0]]),\n",
-    "            (\"Very high\", [[10.0, 0.0, -10.0]])\n",
-    "        ]\n",
-    "        \n",
-    "        for name, logits in confidence_levels:\n",
-    "            test_logits = Tensor(logits)\n",
-    "            test_loss = ce(test_logits, Tensor([0]))\n",
-    "            print(f\"   {name:15}: Loss = {test_loss.data:.6f}\")\n",
-    "        \n",
-    "        # Memory efficiency for large vocabularies\n",
-    "        print(f\"\\n💾 Memory Scaling Analysis:\")\n",
-    "        small_vocab = Tensor(np.random.randn(32, 100))    # 100 classes\n",
-    "        large_vocab = Tensor(np.random.randn(32, 10000))  # 10k classes\n",
-    "        \n",
-    "        import sys\n",
-    "        small_memory = sys.getsizeof(small_vocab.data)\n",
-    "        large_memory = sys.getsizeof(large_vocab.data)\n",
-    "        \n",
-    "        print(f\"   Small vocab (100 classes): {small_memory / 1024:.1f} KB\")\n",
-    "        print(f\"   Large vocab (10k classes): {large_memory / 1024:.1f} KB\")\n",
-    "        print(f\"   Memory scales O(batch_size × num_classes)\")\n",
-    "        \n",
-    "        # 💡 WHY THIS MATTERS: CrossEntropy memory scales with vocabulary size.\n",
-    "        # This is why large language models use techniques like hierarchical softmax\n",
-    "        # or sampling-based training to handle vocabularies with 50k+ tokens.\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Analysis error: {e}\")\n",
-    "        print(\"Ensure CrossEntropy implementation is complete\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "31b5abca",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Cross-Entropy Loss Computation\n",
-    "This test validates `CrossEntropyLoss.__call__`, ensuring correct cross-entropy computation with numerically stable softmax."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d6062489",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-crossentropy-loss",
-     "locked": true,
-     "points": 4,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_crossentropy_loss():\n",
-    "    \"\"\"Test CrossEntropy loss implementation.\"\"\"\n",
-    "    print(\"🧪 Testing Cross-Entropy Loss...\")\n",
-    "    \n",
-    "    ce = CrossEntropyLoss()\n",
-    "    \n",
-    "    # Test case 1: Perfect predictions\n",
-    "    y_pred = Tensor([[10.0, 0.0, 0.0], [0.0, 10.0, 0.0]])  # Very confident correct predictions\n",
-    "    y_true = Tensor([0, 1])  # Class indices\n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    assert loss.data < 0.1, f\"Perfect predictions should have low loss, got {loss.data}\"\n",
-    "    print(\"✅ Perfect predictions test passed\")\n",
-    "    \n",
-    "    # Test case 2: Random predictions (should have higher loss)\n",
-    "    y_pred = Tensor([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])  # Uniform after softmax\n",
-    "    y_true = Tensor([0, 1])\n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    expected_random = -np.log(1.0/3.0)  # log(1/num_classes) for uniform distribution\n",
-    "    assert abs(loss.data - expected_random) < 0.1, f\"Random predictions should have loss ≈ {expected_random}, got {loss.data}\"\n",
-    "    print(\"✅ Random predictions test passed\")\n",
-    "    \n",
-    "    # Test case 3: Binary classification\n",
-    "    y_pred = Tensor([[2.0, 1.0], [1.0, 2.0]])\n",
-    "    y_true = Tensor([0, 1])\n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    assert 0.0 < loss.data < 2.0, f\"Binary classification loss should be reasonable, got {loss.data}\"\n",
-    "    print(\"✅ Binary classification test passed\")\n",
-    "    \n",
-    "    # Test case 4: One-hot encoded labels\n",
-    "    y_pred = Tensor([[2.0, 1.0, 0.0], [0.0, 2.0, 1.0]])\n",
-    "    y_true = Tensor([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])  # One-hot encoded\n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    assert 0.0 < loss.data < 2.0, f\"One-hot encoded loss should be reasonable, got {loss.data}\"\n",
-    "    print(\"✅ One-hot encoded labels test passed\")\n",
-    "    \n",
-    "    print(\"🎉 Cross-Entropy loss tests passed! Understanding classification objectives.\")\n",
-    "\n",
-    "test_unit_crossentropy_loss()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "13e8a85c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Binary Cross-Entropy Loss - Optimized for Binary Classification\n",
-    "\n",
-    "Binary Cross-Entropy Loss is the specialized, efficient version of cross-entropy for binary (two-class) problems. It's more stable and faster than using regular cross-entropy with 2 classes.\n",
-    "\n",
-    "## Visual Understanding: Binary Cross-Entropy\n",
-    "\n",
-    "```\n",
-    "Binary Classification Landscape:\n",
-    "\n",
-    "Sigmoid Activation:\n",
-    "    Raw Logit → Sigmoid → Probability → Loss\n",
-    "    -5.0     → 0.007   → 0.007       → High loss (if true=1)\n",
-    "     0.0     → 0.500   → 0.500       → Medium loss\n",
-    "    +5.0     → 0.993   → 0.993       → Low loss (if true=1)\n",
-    "\n",
-    "Loss Behavior:\n",
-    "    BCE = -[y×log(p) + (1-y)×log(1-p)]\n",
-    "    \n",
-    "    For y=1 (positive class):\n",
-    "        p=0.9 → -log(0.9) = 0.105  (low loss)\n",
-    "        p=0.1 → -log(0.1) = 2.303  (high loss)\n",
-    "    \n",
-    "    For y=0 (negative class):\n",
-    "        p=0.1 → -log(0.9) = 0.105  (low loss)  \n",
-    "        p=0.9 → -log(0.1) = 2.303  (high loss)\n",
-    "```\n",
-    "\n",
-    "## Numerical Stability Solution\n",
-    "\n",
-    "```\n",
-    "The Binary Cross-Entropy Stability Problem:\n",
-    "    \n",
-    "    BCE = -[y×log(σ(x)) + (1-y)×log(1-σ(x))]\n",
-    "    \n",
-    "    Where σ(x) = 1/(1+exp(-x))\n",
-    "    \n",
-    "    Problems:\n",
-    "    - Large positive x: exp(-x) → 0, then log(1) → 0 (loss of precision)\n",
-    "    - Large negative x: σ(x) → 0, then log(0) → -∞\n",
-    "    \n",
-    "Our Stable Solution:\n",
-    "    BCE = max(x,0) - x×y + log(1 + exp(-|x|))\n",
-    "    \n",
-    "    Why this works:\n",
-    "    - max(x,0) handles positive values\n",
-    "    - -x×y is the \"cross\" term  \n",
-    "    - log(1+exp(-|x|)) is always stable (exp≤1)\n",
-    "```\n",
-    "\n",
-    "## Mathematical Foundation\n",
-    "\n",
-    "For binary predictions and labels:\n",
-    "```\n",
-    "BCE = -y × log(σ(x)) - (1-y) × log(1-σ(x))\n",
-    "\n",
-    "Stable form: BCE = max(x,0) - x×y + log(1 + exp(-|x|))\n",
-    "```\n",
-    "\n",
-    "## Learning Objectives\n",
-    "By implementing Binary Cross-Entropy, you'll understand:\n",
-    "- How binary classification creates simpler optimization landscapes than multi-class problems\n",
-    "- Why sigmoid activation naturally pairs with binary cross-entropy loss through its gradient structure\n",
-    "- The critical importance of numerically stable formulations for reliable production training\n",
-    "- How specialized binary losses achieve better efficiency and stability than general solutions"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6b7f8af9",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "binary-crossentropy-concept",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "🤔 **Computational Question: Binary Stability**\n",
-    "\n",
-    "Consider the stable BCE formulation:\n",
-    "\n",
-    "1. Why does max(x,0) - x×y + log(1+exp(-|x|)) work?\n",
-    "2. What happens when x=100? (trace through the computation)\n",
-    "3. What happens when x=-100? (trace through the computation)\n",
-    "4. How does this prevent both overflow and underflow?\n",
-    "\n",
-    "This mathematical insight is crucial for production systems.\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c53864df",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "binary-crossentropy-implementation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class BinaryCrossEntropyLoss:\n",
-    "    \"\"\"\n",
-    "    Binary Cross-Entropy Loss for Binary Classification Problems\n",
-    "    \n",
-    "    Computes binary cross-entropy between predictions and binary labels\n",
-    "    with numerically stable sigmoid + BCE implementation.\n",
-    "    \n",
-    "    Features:\n",
-    "    - Numerically stable computation from logits using stable BCE formula\n",
-    "    - Efficient batch processing with vectorized operations\n",
-    "    - Automatic sigmoid application through stable formulation\n",
-    "    - Robust to extreme input values without overflow/underflow\n",
-    "    \n",
-    "    Example Usage:\n",
-    "        bce_loss = BinaryCrossEntropyLoss()\n",
-    "        loss = bce_loss(logits, binary_labels)  # Returns scalar loss value\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize Binary CrossEntropy loss function.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def __call__(self, y_pred, y_true):\n",
-    "        \"\"\"\n",
-    "        Compute Binary CrossEntropy loss between predictions and targets.\n",
-    "        \n",
-    "        Args:\n",
-    "            y_pred: Model predictions/logits (Tensor, shape: [batch_size, 1] or [batch_size])\n",
-    "            y_true: True binary labels (Tensor, shape: [batch_size, 1] or [batch_size])\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tensor with scalar loss value\n",
-    "            \n",
-    "        TODO: Implement stable binary cross-entropy using the logits formulation.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Convert inputs to tensors and flatten for consistent processing\n",
-    "        2. Use stable BCE formula: max(x,0) - x×y + log(1+exp(-|x|))\n",
-    "        3. Apply this formula element-wise across the batch\n",
-    "        4. Return mean loss across all samples\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        >>> bce = BinaryCrossEntropyLoss()\n",
-    "        >>> logits = Tensor([[2.0], [-1.0]])  # Raw outputs\n",
-    "        >>> labels = Tensor([[1.0], [0.0]])   # Binary targets\n",
-    "        >>> loss = bce(logits, labels)\n",
-    "        >>> print(loss.data)\n",
-    "        0.693  # Stable computation of binary cross-entropy\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use np.maximum(logits, 0) for the max(x,0) term\n",
-    "        - Use np.abs(logits) to ensure exp argument is ≤ 0\n",
-    "        - The formula naturally handles both positive and negative logits\n",
-    "        - Return np.mean() for batch averaging\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Step 1: Ensure we have tensor inputs for consistent processing\n",
-    "        if not isinstance(y_pred, Tensor):\n",
-    "            y_pred = Tensor(y_pred)  # Convert predictions to tensor format\n",
-    "        if not isinstance(y_true, Tensor):\n",
-    "            y_true = Tensor(y_true)  # Convert targets to tensor format\n",
-    "        \n",
-    "        # Get flat arrays for computation\n",
-    "        logits = y_pred.data.flatten()\n",
-    "        labels = y_true.data.flatten()\n",
-    "        \n",
-    "        # Step 1: Define numerically stable binary cross-entropy computation\n",
-    "        def stable_bce_with_logits(logits, labels):\n",
-    "            \"\"\"\n",
-    "            Numerically stable BCE using the logits formulation:\n",
-    "            BCE(logits, y) = max(logits, 0) - logits * y + log(1 + exp(-|logits|))\n",
-    "            \n",
-    "            This formulation prevents:\n",
-    "            - exp(large_positive_logit) → overflow\n",
-    "            - log(very_small_sigmoid) → -inf\n",
-    "            \n",
-    "            Mathematical equivalence:\n",
-    "            - For positive logits: x - x*y + log(1 + exp(-x))\n",
-    "            - For negative logits: -x*y + log(1 + exp(x))\n",
-    "            \"\"\"\n",
-    "            # Step 1a: Handle positive logits to prevent exp(large_positive) overflow\n",
-    "            positive_part = np.maximum(logits, 0)\n",
-    "            \n",
-    "            # Step 1b: Subtract logit-label product (the \"cross\" in cross-entropy)\n",
-    "            cross_term = logits * labels\n",
-    "            \n",
-    "            # Step 1c: Add log(1 + exp(-|logits|)) for numerical stability\n",
-    "            # Using abs(logits) ensures the exponent is always negative or zero\n",
-    "            stability_term = np.log(1 + np.exp(-np.abs(logits)))\n",
-    "            \n",
-    "            return positive_part - cross_term + stability_term\n",
-    "        \n",
-    "        # Step 2: Apply stable BCE computation across the batch\n",
-    "        individual_losses = stable_bce_with_logits(logits, labels)\n",
-    "        mean_loss = np.mean(individual_losses)  # Average loss across batch\n",
-    "        \n",
-    "        return Tensor(mean_loss)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, y_pred, y_true):\n",
-    "        \"\"\"Alternative interface for forward pass.\"\"\"\n",
-    "        return self.__call__(y_pred, y_true)\n",
-    "\n",
-    "# 🔍 SYSTEMS INSIGHT: Binary CrossEntropy Efficiency Analysis\n",
-    "def analyze_binary_crossentropy_efficiency():\n",
-    "    \"\"\"Analyze binary cross-entropy computational efficiency.\"\"\"\n",
-    "    print(\"🔍 Binary CrossEntropy Efficiency Analysis\")\n",
-    "    print(\"=\" * 45)\n",
-    "    \n",
-    "    try:\n",
-    "        bce = BinaryCrossEntropyLoss()\n",
-    "        ce = CrossEntropyLoss()  # For comparison\n",
-    "        \n",
-    "        # Compare binary-specific vs general cross-entropy\n",
-    "        print(\"\\n⚡ Binary vs Multi-Class Efficiency:\")\n",
-    "        \n",
-    "        # Binary problem solved two ways\n",
-    "        binary_logits = Tensor([[1.5], [-0.8], [2.1]])\n",
-    "        binary_labels = Tensor([[1.0], [0.0], [1.0]])\n",
-    "        \n",
-    "        # Method 1: Binary CrossEntropy\n",
-    "        binary_loss = bce(binary_logits, binary_labels)\n",
-    "        \n",
-    "        # Method 2: 2-class CrossEntropy (equivalent but less efficient)\n",
-    "        multiclass_logits = Tensor([[1.5, 0.0], [-0.8, 0.0], [2.1, 0.0]])\n",
-    "        multiclass_labels = Tensor([0, 1, 0])  # Convert to class indices\n",
-    "        multiclass_loss = ce(multiclass_logits, multiclass_labels)\n",
-    "        \n",
-    "        print(f\"   Binary CE Loss:     {binary_loss.data:.6f}\")\n",
-    "        print(f\"   2-Class CE Loss:    {multiclass_loss.data:.6f}\")\n",
-    "        print(f\"   Difference:         {abs(binary_loss.data - multiclass_loss.data):.8f}\")\n",
-    "        \n",
-    "        # Memory efficiency comparison\n",
-    "        print(f\"\\n💾 Memory Efficiency Comparison:\")\n",
-    "        \n",
-    "        batch_size = 1000\n",
-    "        binary_memory = batch_size * 1 * 8  # 1 value per sample, 8 bytes per float64\n",
-    "        multiclass_memory = batch_size * 2 * 8  # 2 classes, 8 bytes per float64\n",
-    "        \n",
-    "        print(f\"   Binary approach:    {binary_memory / 1024:.1f} KB\")\n",
-    "        print(f\"   Multi-class (2):    {multiclass_memory / 1024:.1f} KB\")\n",
-    "        print(f\"   Binary is {multiclass_memory/binary_memory:.1f}× more memory efficient\")\n",
-    "        \n",
-    "        # Stability test with extreme values\n",
-    "        print(f\"\\n🛡️ Extreme Value Stability:\")\n",
-    "        extreme_tests = [\n",
-    "            (\"Large positive\", [[100.0]], [[1.0]]),\n",
-    "            (\"Large negative\", [[-100.0]], [[0.0]]),\n",
-    "            (\"Mixed extreme\", [[100.0], [-100.0]], [[1.0], [0.0]])\n",
-    "        ]\n",
-    "        \n",
-    "        for name, logits, labels in extreme_tests:\n",
-    "            test_logits = Tensor(logits)\n",
-    "            test_labels = Tensor(labels)\n",
-    "            loss = bce(test_logits, test_labels)\n",
-    "            is_stable = not (np.isnan(loss.data) or np.isinf(loss.data))\n",
-    "            print(f\"   {name:15}: Loss = {loss.data:.6f}, Stable = {is_stable}\")\n",
-    "        \n",
-    "        # 💡 WHY THIS MATTERS: Binary CrossEntropy is 2× more memory efficient\n",
-    "        # than regular CrossEntropy for binary problems, and provides better\n",
-    "        # numerical stability through its specialized formulation.\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Analysis error: {e}\")\n",
-    "        print(\"Ensure BinaryCrossEntropy implementation is complete\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cd8abd01",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Binary Cross-Entropy Loss\n",
-    "This test validates `BinaryCrossEntropyLoss.__call__`, ensuring stable binary cross-entropy computation with extreme values."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "400a7568",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-binary-crossentropy",
-     "locked": true,
-     "points": 4,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_binary_crossentropy_loss():\n",
-    "    \"\"\"Test Binary CrossEntropy loss implementation.\"\"\"\n",
-    "    print(\"🧪 Testing Binary Cross-Entropy Loss...\")\n",
-    "    \n",
-    "    bce = BinaryCrossEntropyLoss()\n",
-    "    \n",
-    "    # Test case 1: Perfect predictions\n",
-    "    y_pred = Tensor([[10.0], [-10.0]])  # Very confident correct predictions\n",
-    "    y_true = Tensor([[1.0], [0.0]])\n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    assert loss.data < 0.1, f\"Perfect predictions should have low loss, got {loss.data}\"\n",
-    "    print(\"✅ Perfect predictions test passed\")\n",
-    "    \n",
-    "    # Test case 2: Random predictions (should have higher loss)\n",
-    "    y_pred = Tensor([[0.0], [0.0]])  # 0.5 probability after sigmoid\n",
-    "    y_true = Tensor([[1.0], [0.0]])\n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    expected_random = -np.log(0.5)  # log(0.5) for random guessing\n",
-    "    assert abs(loss.data - expected_random) < 0.1, f\"Random predictions should have loss ≈ {expected_random}, got {loss.data}\"\n",
-    "    print(\"✅ Random predictions test passed\")\n",
-    "    \n",
-    "    # Test case 3: Batch processing\n",
-    "    y_pred = Tensor([[1.0], [2.0], [-1.0]])\n",
-    "    y_true = Tensor([[1.0], [1.0], [0.0]])\n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    assert 0.0 < loss.data < 2.0, f\"Batch processing loss should be reasonable, got {loss.data}\"\n",
-    "    print(\"✅ Batch processing test passed\")\n",
-    "    \n",
-    "    # Test case 4: Extreme values (test numerical stability)\n",
-    "    y_pred = Tensor([[100.0], [-100.0]])  # Extreme logits\n",
-    "    y_true = Tensor([[1.0], [0.0]])\n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    assert not np.isnan(loss.data) and not np.isinf(loss.data), f\"Extreme values should not cause NaN/Inf, got {loss.data}\"\n",
-    "    assert loss.data < 1.0, f\"Extreme correct predictions should have low loss, got {loss.data}\"\n",
-    "    print(\"✅ Extreme values test passed\")\n",
-    "    \n",
-    "    print(\"🎉 Binary Cross-Entropy loss tests passed! Understanding binary objectives.\")\n",
-    "\n",
-    "test_unit_binary_crossentropy_loss()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "13b3bd16",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Custom Loss Functions - Aligning with Business Objectives\n",
-    "\n",
-    "Beyond standard loss functions, production ML systems often need custom losses that align with specific business objectives and domain constraints.\n",
-    "\n",
-    "## Business-Aligned Loss Design Patterns\n",
-    "\n",
-    "### Asymmetric Loss Functions\n",
-    "When false positives and false negatives have different costs:\n",
-    "\n",
-    "```python\n",
-    "# Medical diagnosis: False negatives (missing disease) cost 10× more\n",
-    "class AsymmetricBinaryCrossEntropy(BinaryCrossEntropyLoss):\n",
-    "    def __init__(self, false_negative_weight=10.0):\n",
-    "        super().__init__()\n",
-    "        self.fn_weight = false_negative_weight\n",
-    "\n",
-    "    def __call__(self, y_pred, y_true):\n",
-    "        # Standard BCE\n",
-    "        base_loss = super().__call__(y_pred, y_true)\n",
-    "\n",
-    "        # Weight false negatives more heavily\n",
-    "        # When y_true=1 and y_pred is low, increase penalty\n",
-    "        sigmoid_pred = 1 / (1 + np.exp(-y_pred.data))\n",
-    "        fn_penalty = y_true.data * (1 - sigmoid_pred) * self.fn_weight\n",
-    "\n",
-    "        weighted_loss = base_loss.data + np.mean(fn_penalty)\n",
-    "        return Tensor(weighted_loss)\n",
-    "```\n",
-    "\n",
-    "### Focal Loss for Imbalanced Data\n",
-    "Addresses class imbalance by focusing on hard examples:\n",
-    "\n",
-    "```python\n",
-    "class FocalLoss(CrossEntropyLoss):\n",
-    "    def __init__(self, alpha=1.0, gamma=2.0):\n",
-    "        super().__init__()\n",
-    "        self.alpha = alpha  # Class balance weight\n",
-    "        self.gamma = gamma  # Focusing parameter\n",
-    "\n",
-    "    def __call__(self, y_pred, y_true):\n",
-    "        # Get standard cross-entropy\n",
-    "        ce_loss = super().__call__(y_pred, y_true)\n",
-    "\n",
-    "        # Calculate softmax probabilities\n",
-    "        max_logits = np.max(y_pred.data, axis=1, keepdims=True)\n",
-    "        stable_logits = y_pred.data - max_logits\n",
-    "        exp_logits = np.exp(stable_logits)\n",
-    "        softmax_probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)\n",
-    "\n",
-    "        # Get probability of correct class\n",
-    "        batch_size = y_true.data.shape[0]\n",
-    "        correct_probs = softmax_probs[np.arange(batch_size), y_true.data.astype(int)]\n",
-    "\n",
-    "        # Apply focal loss formula: -α(1-p)^γ log(p)\n",
-    "        focal_weight = self.alpha * ((1 - correct_probs) ** self.gamma)\n",
-    "        focal_loss = focal_weight * ce_loss.data\n",
-    "\n",
-    "        return Tensor(np.mean(focal_loss))\n",
-    "```\n",
-    "\n",
-    "### Ranking-Aware Loss\n",
-    "For problems where order matters (search, recommendations):\n",
-    "\n",
-    "```python\n",
-    "class RankingAwareLoss:\n",
-    "    def __init__(self, position_weights=None):\n",
-    "        # Higher weights for top positions\n",
-    "        self.position_weights = position_weights or [10.0, 5.0, 2.0, 1.0, 0.5]\n",
-    "\n",
-    "    def __call__(self, predictions, targets, positions):\n",
-    "        \"\"\"predictions: relevance scores, targets: true relevance, positions: result positions\"\"\"\n",
-    "        mse = MeanSquaredError()\n",
-    "\n",
-    "        # Weight errors by position importance\n",
-    "        weighted_errors = []\n",
-    "        for pred, target, pos in zip(predictions.data, targets.data, positions.data):\n",
-    "            pos_weight = self.position_weights[min(int(pos), len(self.position_weights)-1)]\n",
-    "            error = ((pred - target) ** 2) * pos_weight\n",
-    "            weighted_errors.append(error)\n",
-    "\n",
-    "        return Tensor(np.mean(weighted_errors))\n",
-    "```\n",
-    "\n",
-    "## Advanced Custom Loss Patterns\n",
-    "\n",
-    "### Multi-Task Learning Loss\n",
-    "Combining multiple objectives with learned weights:\n",
-    "\n",
-    "```python\n",
-    "class MultiTaskLoss:\n",
-    "    def __init__(self, num_tasks=3):\n",
-    "        # Learnable loss weights (log-variance parameterization for stability)\n",
-    "        self.log_vars = [0.0] * num_tasks\n",
-    "\n",
-    "    def __call__(self, predictions_list, targets_list):\n",
-    "        \"\"\"predictions_list: [task1_preds, task2_preds, ...]\"\"\"\n",
-    "        total_loss = 0\n",
-    "\n",
-    "        for i, (preds, targets) in enumerate(zip(predictions_list, targets_list)):\n",
-    "            # Choose appropriate loss for each task\n",
-    "            if i == 0:  # Regression task\n",
-    "                task_loss = MeanSquaredError()(preds, targets)\n",
-    "            else:  # Classification tasks\n",
-    "                task_loss = CrossEntropyLoss()(preds, targets)\n",
-    "\n",
-    "            # Uncertainty-weighted combination\n",
-    "            precision = np.exp(-self.log_vars[i])\n",
-    "            weighted_loss = precision * task_loss.data + self.log_vars[i]\n",
-    "            total_loss += weighted_loss\n",
-    "\n",
-    "        return Tensor(total_loss)\n",
-    "```\n",
-    "\n",
-    "### Contrastive Loss for Similarity Learning\n",
-    "For learning embeddings and similarity:\n",
-    "\n",
-    "```python\n",
-    "class ContrastiveLoss:\n",
-    "    def __init__(self, margin=1.0):\n",
-    "        self.margin = margin\n",
-    "\n",
-    "    def __call__(self, embeddings1, embeddings2, labels):\n",
-    "        \"\"\"labels: 1 for similar pairs, 0 for dissimilar\"\"\"\n",
-    "        # Euclidean distance between embeddings\n",
-    "        distances = np.sqrt(np.sum((embeddings1.data - embeddings2.data) ** 2, axis=1))\n",
-    "\n",
-    "        # Contrastive loss formula\n",
-    "        positive_loss = labels.data * (distances ** 2)\n",
-    "        negative_loss = (1 - labels.data) * np.maximum(0, self.margin - distances) ** 2\n",
-    "\n",
-    "        total_loss = 0.5 * (positive_loss + negative_loss)\n",
-    "        return Tensor(np.mean(total_loss))\n",
-    "```\n",
-    "\n",
-    "## Custom Loss Implementation Guidelines\n",
-    "\n",
-    "### Numerical Stability Considerations\n",
-    "```python\n",
-    "# Always include stability measures in custom losses\n",
-    "class StableCustomLoss:\n",
-    "    def __call__(self, predictions, targets):\n",
-    "        # 1. Input validation\n",
-    "        if not isinstance(predictions, Tensor):\n",
-    "            predictions = Tensor(predictions)\n",
-    "\n",
-    "        # 2. Handle edge cases\n",
-    "        predictions_clipped = np.clip(predictions.data, -100, 100)  # Prevent overflow\n",
-    "\n",
-    "        # 3. Use numerically stable formulations\n",
-    "        # Avoid: exp(large_number), log(small_number)\n",
-    "        # Use: log-sum-exp trick, epsilon clipping\n",
-    "\n",
-    "        # 4. Return tensor for consistency\n",
-    "        return Tensor(computed_loss)\n",
-    "```\n",
-    "\n",
-    "### Gradient-Friendly Design\n",
-    "```python\n",
-    "# Ensure gradients flow properly\n",
-    "class GradientFriendlyLoss:\n",
-    "    def __call__(self, predictions, targets):\n",
-    "        # Avoid operations that create zero gradients:\n",
-    "        # - Hard thresholding: use soft approximations\n",
-    "        # - Discrete operations: use continuous relaxations\n",
-    "        # - Large plateaus: ensure non-zero gradients everywhere\n",
-    "\n",
-    "        # Good: Smooth, differentiable operations\n",
-    "        smooth_loss = self.smooth_l1_loss(predictions, targets)\n",
-    "        return smooth_loss\n",
-    "\n",
-    "    def smooth_l1_loss(self, pred, target, beta=1.0):\n",
-    "        \"\"\"Smooth L1 loss - less sensitive to outliers than MSE\"\"\"\n",
-    "        diff = np.abs(pred.data - target.data)\n",
-    "        loss = np.where(diff < beta,\n",
-    "                       0.5 * diff * diff / beta,\n",
-    "                       diff - 0.5 * beta)\n",
-    "        return Tensor(np.mean(loss))\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e84c5945",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Loss Function Application Guide and Comparison\n",
-    "\n",
-    "## When to Use Each Loss Function\n",
-    "\n",
-    "Understanding which loss function to use is critical for successful ML projects:\n",
-    "\n",
-    "### Mean Squared Error (MSE) - Regression Problems\n",
-    "```\n",
-    "Use when: Predicting continuous values\n",
-    "Examples: House prices, temperature, stock values, ages\n",
-    "Output: Any real number\n",
-    "Activation: Usually none (linear output)\n",
-    "Penalty: Quadratic (large errors >> small errors)\n",
-    "\n",
-    "Model Architecture:\n",
-    "Input → Hidden Layers → Linear Output → MSE Loss\n",
-    "```\n",
-    "\n",
-    "### Cross-Entropy Loss - Multi-Class Classification  \n",
-    "```\n",
-    "Use when: Choosing one class from 3+ options\n",
-    "Examples: Image classification, text categorization, medical diagnosis\n",
-    "Output: Probability distribution (sums to 1)\n",
-    "Activation: Softmax\n",
-    "Penalty: Logarithmic (encouraging confident correct predictions)\n",
-    "\n",
-    "Model Architecture:\n",
-    "Input → Hidden Layers → Softmax → CrossEntropy Loss\n",
-    "```\n",
-    "\n",
-    "### Binary Cross-Entropy Loss - Binary Classification\n",
-    "```\n",
-    "Use when: Binary decisions (yes/no, positive/negative)\n",
-    "Examples: Spam detection, fraud detection, medical screening\n",
-    "Output: Single probability (0 to 1)\n",
-    "Activation: Sigmoid\n",
-    "Penalty: Asymmetric (confident wrong predictions heavily penalized)\n",
-    "\n",
-    "Model Architecture:\n",
-    "Input → Hidden Layers → Sigmoid → Binary CrossEntropy Loss\n",
-    "```\n",
-    "\n",
-    "## Performance and Stability Comparison\n",
-    "\n",
-    "```\n",
-    "Computational Characteristics:\n",
-    "                      MSE    CrossEntropy    Binary CE\n",
-    "Time Complexity:     O(n)      O(n×c)        O(n)\n",
-    "Memory Complexity:   O(1)      O(n×c)        O(n)\n",
-    "Numerical Stability: High      Medium        High\n",
-    "Convergence Speed:   Fast      Medium        Fast\n",
-    "\n",
-    "Where: n = batch size, c = number of classes\n",
-    "```\n",
-    "\n",
-    "## Integration with Neural Networks\n",
-    "\n",
-    "```python\n",
-    "# Example training setup for different problem types:\n",
-    "\n",
-    "# Regression Problem (House Price Prediction)\n",
-    "regression_model = Sequential([\n",
-    "    Linear(10, 64),   # Input features → Hidden\n",
-    "    ReLU(),\n",
-    "    Linear(64, 1),    # Hidden → Single output\n",
-    "    # No activation - linear output for regression\n",
-    "])\n",
-    "loss_fn = MeanSquaredError()\n",
-    "\n",
-    "# Multi-Class Classification (Image Recognition)\n",
-    "classification_model = Sequential([\n",
-    "    Linear(784, 128), # Flattened image → Hidden\n",
-    "    ReLU(),\n",
-    "    Linear(128, 10),  # Hidden → 10 classes\n",
-    "    Softmax()         # Convert to probabilities\n",
-    "])\n",
-    "loss_fn = CrossEntropyLoss()\n",
-    "\n",
-    "# Binary Classification (Spam Detection)\n",
-    "binary_model = Sequential([\n",
-    "    Linear(100, 64),  # Text features → Hidden\n",
-    "    ReLU(),\n",
-    "    Linear(64, 1),    # Hidden → Single output\n",
-    "    Sigmoid()         # Convert to probability\n",
-    "])\n",
-    "loss_fn = BinaryCrossEntropyLoss()\n",
-    "\n",
-    "# Training loop pattern (same for all):\n",
-    "for batch in dataloader:\n",
-    "    predictions = model(batch.inputs)\n",
-    "    loss = loss_fn(predictions, batch.targets)\n",
-    "    # loss.backward()  # Compute gradients (when autograd is available)\n",
-    "    # optimizer.step() # Update parameters\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "91ce7d95",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Comprehensive Integration Test\n",
-    "This test validates all loss functions work together correctly and can be used interchangeably in production systems."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9df44d7b",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "comprehensive-loss-tests",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_comprehensive_loss_integration():\n",
-    "    \"\"\"Test all loss functions work correctly together.\"\"\"\n",
-    "    print(\"🔬 Comprehensive Loss Function Integration Testing\")\n",
-    "    print(\"=\" * 55)\n",
-    "    \n",
-    "    # Test 1: All losses can be instantiated\n",
-    "    print(\"\\n1. Loss Function Instantiation:\")\n",
-    "    mse = MeanSquaredError()\n",
-    "    ce = CrossEntropyLoss()\n",
-    "    bce = BinaryCrossEntropyLoss()\n",
-    "    print(\"   ✅ All loss functions created successfully\")\n",
-    "    \n",
-    "    # Test 2: Loss functions return appropriate types\n",
-    "    print(\"\\n2. Return Type Verification:\")\n",
-    "    \n",
-    "    # MSE test\n",
-    "    pred = Tensor([[1.0, 2.0]])\n",
-    "    target = Tensor([[1.0, 2.0]])\n",
-    "    loss = mse(pred, target)\n",
-    "    assert isinstance(loss, Tensor), \"MSE should return Tensor\"\n",
-    "    assert loss.data.shape == (), \"MSE should return scalar\"\n",
-    "    \n",
-    "    # Cross-entropy test\n",
-    "    pred = Tensor([[1.0, 2.0], [2.0, 1.0]])\n",
-    "    target = Tensor([1, 0])\n",
-    "    loss = ce(pred, target)\n",
-    "    assert isinstance(loss, Tensor), \"CrossEntropy should return Tensor\"\n",
-    "    assert loss.data.shape == (), \"CrossEntropy should return scalar\"\n",
-    "    \n",
-    "    # Binary cross-entropy test\n",
-    "    pred = Tensor([[1.0], [-1.0]])\n",
-    "    target = Tensor([[1.0], [0.0]])\n",
-    "    loss = bce(pred, target)\n",
-    "    assert isinstance(loss, Tensor), \"Binary CrossEntropy should return Tensor\"\n",
-    "    assert loss.data.shape == (), \"Binary CrossEntropy should return scalar\"\n",
-    "    \n",
-    "    print(\"   ✅ All loss functions return correct types\")\n",
-    "    \n",
-    "    # Test 3: Loss values are reasonable\n",
-    "    print(\"\\n3. Loss Value Sanity Checks:\")\n",
-    "    \n",
-    "    # All losses should be non-negative\n",
-    "    assert mse.forward(Tensor([1.0]), Tensor([2.0])).data >= 0, \"MSE should be non-negative\"\n",
-    "    assert ce.forward(Tensor([[1.0, 0.0]]), Tensor([0])).data >= 0, \"CrossEntropy should be non-negative\"\n",
-    "    assert bce.forward(Tensor([1.0]), Tensor([1.0])).data >= 0, \"Binary CrossEntropy should be non-negative\"\n",
-    "    \n",
-    "    print(\"   ✅ All loss functions produce reasonable values\")\n",
-    "    \n",
-    "    # Test 4: Perfect predictions give low loss\n",
-    "    print(\"\\n4. Perfect Prediction Tests:\")\n",
-    "    \n",
-    "    perfect_mse = mse(Tensor([5.0]), Tensor([5.0]))\n",
-    "    perfect_ce = ce(Tensor([[10.0, 0.0]]), Tensor([0]))\n",
-    "    perfect_bce = bce(Tensor([10.0]), Tensor([1.0]))\n",
-    "    \n",
-    "    assert perfect_mse.data < 1e-10, f\"Perfect MSE should be ~0, got {perfect_mse.data}\"\n",
-    "    assert perfect_ce.data < 0.1, f\"Perfect CE should be low, got {perfect_ce.data}\"\n",
-    "    assert perfect_bce.data < 0.1, f\"Perfect BCE should be low, got {perfect_bce.data}\"\n",
-    "    \n",
-    "    print(\"   ✅ Perfect predictions produce low loss\")\n",
-    "    \n",
-    "    print(\"\\n🎉 All comprehensive integration tests passed!\")\n",
-    "    print(\"   • Loss functions instantiate correctly\")\n",
-    "    print(\"   • Return types are consistent (Tensor scalars)\")\n",
-    "    print(\"   • Loss values are mathematically sound\")\n",
-    "    print(\"   • Perfect predictions are handled correctly\")\n",
-    "    print(\"   • Ready for integration with neural network training!\")\n",
-    "\n",
-    "test_unit_comprehensive_loss_integration()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5f2c082c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Systems Analysis: Loss Function Performance and Engineering\n",
-    "\n",
-    "Let's analyze loss functions from an ML systems engineering perspective, focusing on performance, memory usage, and production implications.\n",
-    "\n",
-    "## Computational Complexity Deep Dive\n",
-    "\n",
-    "```\n",
-    "Algorithmic Analysis by Loss Type:\n",
-    "\n",
-    "MSE (Mean Squared Error):\n",
-    "    Time: O(n) - linear in number of predictions\n",
-    "    Space: O(1) - constant additional memory\n",
-    "    Operations: n subtractions + n multiplications + 1 mean\n",
-    "    Bottleneck: Memory bandwidth (simple arithmetic operations)\n",
-    "    \n",
-    "CrossEntropy (Multi-Class):\n",
-    "    Time: O(n×c) - linear in samples × classes  \n",
-    "    Space: O(n×c) - store full probability distributions\n",
-    "    Operations: n×c exp + n×c divisions + n×c logs + reductions\n",
-    "    Bottleneck: Exponential computations and memory bandwidth\n",
-    "    \n",
-    "Binary CrossEntropy:\n",
-    "    Time: O(n) - linear in number of samples\n",
-    "    Space: O(n) - store one probability per sample\n",
-    "    Operations: n max + n multiplications + n exp + n logs\n",
-    "    Bottleneck: Transcendental functions (exp, log)\n",
-    "```\n",
-    "\n",
-    "## Memory Scaling Analysis\n",
-    "\n",
-    "Understanding memory requirements is crucial for large-scale training:\n",
-    "\n",
-    "```\n",
-    "Memory Requirements by Problem Scale:\n",
-    "\n",
-    "Small Problem (1K samples, 100 classes):\n",
-    "    MSE:         8 KB (1K samples × 8 bytes)\n",
-    "    CrossEntropy: 800 KB (1K × 100 × 8 bytes)\n",
-    "    Binary CE:   16 KB (1K × 2 × 8 bytes)\n",
-    "\n",
-    "Large Problem (100K samples, 10K classes):\n",
-    "    MSE:         800 KB (independent of classes!)\n",
-    "    CrossEntropy: 8 GB (memory bottleneck)\n",
-    "    Binary CE:   1.6 MB (scales with samples only)\n",
-    "\n",
-    "Production Scale (1M samples, 50K vocab):\n",
-    "    MSE:         8 MB\n",
-    "    CrossEntropy: 400 GB (requires distributed memory)\n",
-    "    Binary CE:   16 MB\n",
-    "```\n",
-    "\n",
-    "## Numerical Stability Engineering Analysis\n",
-    "\n",
-    "Production systems must handle edge cases robustly:\n",
-    "\n",
-    "```\n",
-    "Stability Challenges and Solutions:\n",
-    "\n",
-    "CrossEntropy Stability Issues:\n",
-    "    Problem: exp(large_logit) → overflow → NaN gradients\n",
-    "    Solution: log-sum-exp trick with max subtraction\n",
-    "    \n",
-    "    Problem: log(very_small_prob) → -∞ → training collapse\n",
-    "    Solution: epsilon clipping (1e-15 to 1-1e-15)\n",
-    "    \n",
-    "Binary CrossEntropy Stability Issues:\n",
-    "    Problem: sigmoid(large_positive) → 1.0 → log(0) issues\n",
-    "    Solution: stable logits formulation bypasses sigmoid\n",
-    "    \n",
-    "    Problem: exp(large_negative) in naive implementation\n",
-    "    Solution: max(x,0) - x*y + log(1+exp(-|x|)) formulation\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c48c075d",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "\"\"\"\n",
-    "# Production Performance Benchmarks\n",
-    "\n",
-    "Real-world performance characteristics matter for deployment:\n",
-    "\n",
-    "```\n",
-    "Inference Throughput (measured on modern hardware):\n",
-    "    MSE:              ~100M predictions/second\n",
-    "    CrossEntropy:     ~10M predictions/second  \n",
-    "    Binary CrossEntropy: ~80M predictions/second\n",
-    "\n",
-    "Training Memory Bandwidth Requirements:\n",
-    "    MSE:         ~800 MB/s (lightweight computation)\n",
-    "    CrossEntropy: ~80 GB/s (10× higher due to softmax!)\n",
-    "    Binary CE:   ~1.6 GB/s (moderate requirements)\n",
-    "\n",
-    "Gradient Computation Overhead:\n",
-    "    MSE:         1.1× forward pass time (simple derivatives)\n",
-    "    CrossEntropy: 1.5× forward pass time (softmax gradients)\n",
-    "    Binary CE:   1.2× forward pass time (sigmoid gradients)\n",
-    "```\n",
-    "\n",
-    "# Framework Integration and Production Patterns\n",
-    "\n",
-    "Understanding how production systems implement these concepts:\n",
-    "\n",
-    "```\n",
-    "PyTorch Implementation Patterns:\n",
-    "    torch.nn.MSELoss() - Direct implementation, minimal overhead\n",
-    "    torch.nn.CrossEntropyLoss() - Fused softmax+CE for efficiency\n",
-    "    torch.nn.BCEWithLogitsLoss() - Stable logits formulation\n",
-    "    \n",
-    "TensorFlow Implementation Patterns:\n",
-    "    tf.keras.losses.MeanSquaredError() - Vectorized operations\n",
-    "    tf.keras.losses.SparseCategoricalCrossentropy() - Memory efficient\n",
-    "    tf.keras.losses.BinaryCrossentropy() - From logits option\n",
-    "    \n",
-    "Production Optimizations:\n",
-    "    - Mixed precision (FP16) for memory efficiency\n",
-    "    - Gradient accumulation for large batch simulation\n",
-    "    - Loss scaling to prevent underflow in mixed precision\n",
-    "    - Checkpointing to trade memory for computation\n",
-    "```\n",
-    "\n",
-    "# Edge Device and Deployment Considerations\n",
-    "\n",
-    "Loss function choice affects deployment feasibility:\n",
-    "\n",
-    "```\n",
-    "Edge Device Constraints:\n",
-    "    Memory-limited (phones, IoT): Prefer Binary CE > MSE > CrossEntropy\n",
-    "    CPU-only inference: MSE has best compute efficiency\n",
-    "    Real-time requirements: Binary classification most predictable\n",
-    "    \n",
-    "Distributed Training Challenges:\n",
-    "    CrossEntropy: Requires all-reduce across all classes (expensive!)\n",
-    "    Gradient accumulation: MSE linear, CrossEntropy non-linear dependencies\n",
-    "    Mixed precision: Different overflow handling per loss type\n",
-    "    \n",
-    "Monitoring and Debugging:\n",
-    "    MSE divergence: Explodes quadratically (easy to detect)\n",
-    "    CrossEntropy divergence: More gradual degradation  \n",
-    "    BCE monitoring: Natural bounded behavior aids debugging\n",
-    "```\n",
-    "\"\"\"\n",
-    "\n",
-    "🔍 SYSTEMS INSIGHT: Performance Profiling Analysis\n",
-    "def analyze_loss_performance_characteristics():\n",
-    "    \"\"\"Comprehensive performance analysis of all loss functions.\"\"\"\n",
-    "    print(\"🔍 Loss Function Performance Analysis\")\n",
-    "    print(\"=\" * 45)\n",
-    "    \n",
-    "    try:\n",
-    "        import time\n",
-    "        \n",
-    "        # Initialize loss functions\n",
-    "        mse = MeanSquaredError()\n",
-    "        ce = CrossEntropyLoss()\n",
-    "        bce = BinaryCrossEntropyLoss()\n",
-    "        \n",
-    "        print(\"\\n⚡ Computational Complexity Measurement:\")\n",
-    "        \n",
-    "        # Test different batch sizes to see scaling behavior\n",
-    "        batch_sizes = [100, 1000, 10000]\n",
-    "        \n",
-    "        for batch_size in batch_sizes:\n",
-    "            print(f\"\\n   Batch size: {batch_size:,}\")\n",
-    "            \n",
-    "            # MSE timing\n",
-    "            mse_pred = Tensor(np.random.randn(batch_size, 10))\n",
-    "            mse_true = Tensor(np.random.randn(batch_size, 10))\n",
-    "            \n",
-    "            start = time.perf_counter()\n",
-    "            for _ in range(100):  # Average over multiple runs\n",
-    "                mse_loss = mse(mse_pred, mse_true)\n",
-    "            mse_time = (time.perf_counter() - start) / 100\n",
-    "            \n",
-    "            # CrossEntropy timing\n",
-    "            ce_pred = Tensor(np.random.randn(batch_size, 100))  # 100 classes\n",
-    "            ce_true = Tensor(np.random.randint(0, 100, batch_size))\n",
-    "            \n",
-    "            start = time.perf_counter()\n",
-    "            for _ in range(100):\n",
-    "                ce_loss = ce(ce_pred, ce_true)\n",
-    "            ce_time = (time.perf_counter() - start) / 100\n",
-    "            \n",
-    "            # Binary CrossEntropy timing\n",
-    "            bce_pred = Tensor(np.random.randn(batch_size, 1))\n",
-    "            bce_true = Tensor(np.random.randint(0, 2, (batch_size, 1)).astype(float))\n",
-    "            \n",
-    "            start = time.perf_counter()\n",
-    "            for _ in range(100):\n",
-    "                bce_loss = bce(bce_pred, bce_true)\n",
-    "            bce_time = (time.perf_counter() - start) / 100\n",
-    "            \n",
-    "            print(f\"      MSE:         {mse_time*1000:8.3f} ms\")\n",
-    "            print(f\"      CrossEntropy: {ce_time*1000:8.3f} ms\")\n",
-    "            print(f\"      Binary CE:    {bce_time*1000:8.3f} ms\")\n",
-    "            print(f\"      CE/MSE ratio: {ce_time/mse_time:8.1f}x\")\n",
-    "        \n",
-    "        print(\"\\n💾 Memory Efficiency Analysis:\")\n",
-    "        \n",
-    "        # Compare memory usage for different problem sizes\n",
-    "        problem_configs = [\n",
-    "            (\"Small (1K samples, 10 classes)\", 1000, 10),\n",
-    "            (\"Medium (10K samples, 100 classes)\", 10000, 100),\n",
-    "            (\"Large (100K samples, 1K classes)\", 100000, 1000)\n",
-    "        ]\n",
-    "        \n",
-    "        for name, samples, classes in problem_configs:\n",
-    "            print(f\"\\n   {name}:\")\n",
-    "            \n",
-    "            # Memory calculations (bytes)\n",
-    "            mse_memory = samples * 8  # One value per sample\n",
-    "            ce_memory = samples * classes * 8  # Full probability distribution\n",
-    "            bce_memory = samples * 8  # One probability per sample\n",
-    "            \n",
-    "            print(f\"      MSE memory:    {mse_memory / 1024 / 1024:8.1f} MB\")\n",
-    "            print(f\"      CE memory:     {ce_memory / 1024 / 1024:8.1f} MB\") \n",
-    "            print(f\"      BCE memory:    {bce_memory / 1024 / 1024:8.1f} MB\")\n",
-    "            print(f\"      CE overhead:   {ce_memory/mse_memory:8.1f}x\")\n",
-    "        \n",
-    "        # 💡 WHY THIS MATTERS: These performance characteristics determine\n",
-    "        # which loss functions are feasible for different deployment scenarios.\n",
-    "        # CrossEntropy's O(n×c) memory scaling makes it prohibitive for \n",
-    "        # large vocabularies without specialized techniques.\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Performance analysis error: {e}\")\n",
-    "        print(\"Performance analysis requires complete implementations\")\n",
-    "\n",
-    "🔍 SYSTEMS INSIGHT: Numerical Stability Deep Analysis\n",
-    "def analyze_numerical_stability_edge_cases():\n",
-    "    \"\"\"Deep analysis of numerical stability across all loss functions.\"\"\"\n",
-    "    print(\"🔍 Numerical Stability Edge Case Analysis\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    try:\n",
-    "        mse = MeanSquaredError()\n",
-    "        ce = CrossEntropyLoss()\n",
-    "        bce = BinaryCrossEntropyLoss()\n",
-    "        \n",
-    "        print(\"\\n🛡️ Extreme Value Stability Testing:\")\n",
-    "        \n",
-    "        # Test extreme values that could cause numerical issues\n",
-    "        extreme_tests = [\n",
-    "            (\"Huge positive\", 1e10),\n",
-    "            (\"Huge negative\", -1e10),\n",
-    "            (\"Tiny positive\", 1e-10),\n",
-    "            (\"NaN input\", float('nan')),\n",
-    "            (\"Infinity\", float('inf')),\n",
-    "            (\"Negative infinity\", float('-inf'))\n",
-    "        ]\n",
-    "        \n",
-    "        for name, value in extreme_tests:\n",
-    "            print(f\"\\n   Testing {name} ({value}):\")\n",
-    "            \n",
-    "            # MSE stability\n",
-    "            try:\n",
-    "                mse_loss = mse(Tensor([value]), Tensor([0.0]))\n",
-    "                mse_stable = not (np.isnan(mse_loss.data) or np.isinf(mse_loss.data))\n",
-    "                print(f\"      MSE stable:    {mse_stable} (loss: {mse_loss.data:.3e})\")\n",
-    "            except:\n",
-    "                print(f\"      MSE stable:    False (exception)\")\n",
-    "            \n",
-    "            # CrossEntropy stability  \n",
-    "            try:\n",
-    "                ce_loss = ce(Tensor([[value, 0.0, 0.0]]), Tensor([0]))\n",
-    "                ce_stable = not (np.isnan(ce_loss.data) or np.isinf(ce_loss.data))\n",
-    "                print(f\"      CE stable:     {ce_stable} (loss: {ce_loss.data:.3e})\")\n",
-    "            except:\n",
-    "                print(f\"      CE stable:     False (exception)\")\n",
-    "            \n",
-    "            # Binary CrossEntropy stability\n",
-    "            try:\n",
-    "                bce_loss = bce(Tensor([value]), Tensor([1.0]))\n",
-    "                bce_stable = not (np.isnan(bce_loss.data) or np.isinf(bce_loss.data))\n",
-    "                print(f\"      BCE stable:    {bce_stable} (loss: {bce_loss.data:.3e})\")\n",
-    "            except:\n",
-    "                print(f\"      BCE stable:    False (exception)\")\n",
-    "        \n",
-    "        print(\"\\n🔬 Gradient Behavior Analysis:\")\n",
-    "        \n",
-    "        # Analyze gradient magnitudes for different prediction qualities\n",
-    "        confidence_levels = [\n",
-    "            (\"Very wrong\", [[-5.0, 5.0, 0.0]], [0]),  # Predict class 1, actual class 0\n",
-    "            (\"Slightly wrong\", [[-0.5, 0.5, 0.0]], [0]),\n",
-    "            (\"Uncertain\", [[0.0, 0.0, 0.0]], [0]), \n",
-    "            (\"Slightly right\", [[0.5, -0.5, 0.0]], [0]),\n",
-    "            (\"Very right\", [[5.0, -5.0, 0.0]], [0])\n",
-    "        ]\n",
-    "        \n",
-    "        print(\"      Prediction Quality → CrossEntropy Loss:\")\n",
-    "        for name, logits, labels in confidence_levels:\n",
-    "            loss = ce(Tensor(logits), Tensor(labels))\n",
-    "            print(f\"      {name:15}: {loss.data:8.4f}\")\n",
-    "        \n",
-    "        # 💡 WHY THIS MATTERS: Understanding how loss functions behave\n",
-    "        # at extremes helps debug training failures and choose appropriate\n",
-    "        # loss scaling and clipping strategies for production systems.\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Stability analysis error: {e}\")\n",
-    "        print(\"Stability analysis requires complete implementations\")\n",
-    "\n",
-    "🔍 SYSTEMS INSIGHT: Mixed Precision Training Analysis\n",
-    "def analyze_mixed_precision_considerations():\n",
-    "    \"\"\"Analyze loss function behavior with FP16 mixed precision training.\"\"\"\n",
-    "    print(\"🔍 Mixed Precision Training Analysis\")\n",
-    "    print(\"=\" * 40)\n",
-    "\n",
-    "    try:\n",
-    "        print(\"\\n⚡ FP16 Numerical Range Analysis:\")\n",
-    "        print(\"   FP16 range: ~±65,504 (much smaller than FP32's ~±3.4×10³⁸)\")\n",
-    "\n",
-    "        # Simulate FP16 range limitations\n",
-    "        fp16_max = 65504.0\n",
-    "        fp16_min_normal = 2**-14  # Smallest normal FP16 number ≈ 6.1×10⁻⁵\n",
-    "\n",
-    "        print(f\"   FP16 maximum: ±{fp16_max:,.0f}\")\n",
-    "        print(f\"   FP16 min normal: {fp16_min_normal:.2e}\")\n",
-    "        print(f\"   Risk: Gradients/losses exceeding range → infinity/NaN\")\n",
-    "\n",
-    "        mse = MeanSquaredError()\n",
-    "        ce = CrossEntropyLoss()\n",
-    "        bce = BinaryCrossEntropyLoss()\n",
-    "\n",
-    "        print(f\"\\n🎯 Loss Function Mixed Precision Compatibility:\")\n",
-    "\n",
-    "        # Test cases that might overflow in FP16\n",
-    "        test_cases = [\n",
-    "            (\"Small values\", 1.0, 1.1),\n",
-    "            (\"Medium values\", 100.0, 110.0),\n",
-    "            (\"Large values\", 1000.0, 1100.0),\n",
-    "            (\"FP16 edge\", 200.0, 250.0)  # Could cause issues when squared\n",
-    "        ]\n",
-    "\n",
-    "        print(f\"\\n   {'Test Case':>15} {'MSE Loss':>12} {'FP16 Safe?':>12}\")\n",
-    "        print(f\"   {'-'*15} {'-'*12} {'-'*12}\")\n",
-    "\n",
-    "        for name, pred, true in test_cases:\n",
-    "            mse_loss = mse(Tensor([pred]), Tensor([true]))\n",
-    "            squared_error = (pred - true) ** 2\n",
-    "            fp16_safe = squared_error < fp16_max\n",
-    "\n",
-    "            print(f\"   {name:>15} {mse_loss.data:>12.1f} {'✅' if fp16_safe else '❌':>12}\")\n",
-    "\n",
-    "        print(f\"\\n🛡️ Mixed Precision Loss Scaling Strategy:\")\n",
-    "\n",
-    "        # Demonstrate loss scaling concept\n",
-    "        loss_scales = [1.0, 128.0, 1024.0, 8192.0]\n",
-    "        base_loss = 0.01  # Small loss that might underflow\n",
-    "\n",
-    "        print(f\"   {'Scale Factor':>12} {'Scaled Loss':>12} {'FP16 Precision':>15}\")\n",
-    "        print(f\"   {'-'*12} {'-'*12} {'-'*15}\")\n",
-    "\n",
-    "        for scale in loss_scales:\n",
-    "            scaled_loss = base_loss * scale\n",
-    "\n",
-    "            # Check if loss is representable in FP16\n",
-    "            if scaled_loss > fp16_min_normal and scaled_loss < fp16_max:\n",
-    "                precision = \"Good\"\n",
-    "            elif scaled_loss <= fp16_min_normal:\n",
-    "                precision = \"Underflow risk\"\n",
-    "            else:\n",
-    "                precision = \"Overflow risk\"\n",
-    "\n",
-    "            print(f\"   {scale:>12.0f} {scaled_loss:>12.3f} {precision:>15}\")\n",
-    "\n",
-    "        print(f\"\\n⚖️ Loss Function Mixed Precision Recommendations:\")\n",
-    "\n",
-    "        recommendations = [\n",
-    "            (\"MSE\", \"Monitor for gradient explosion in high-dynamic-range problems\", \"Medium risk\"),\n",
-    "            (\"CrossEntropy\", \"Use FP32 for softmax computation, FP16 for storage\", \"High risk\"),\n",
-    "            (\"Binary CE\", \"Stable formulation handles FP16 well with proper scaling\", \"Low risk\")\n",
-    "        ]\n",
-    "\n",
-    "        for loss_type, recommendation, risk in recommendations:\n",
-    "            print(f\"   {loss_type:>12}: {recommendation} ({risk})\")\n",
-    "\n",
-    "        print(f\"\\n🔧 Implementation Best Practices for Mixed Precision:\")\n",
-    "\n",
-    "        best_practices = [\n",
-    "            \"1. Use automatic mixed precision (AMP) libraries that handle scaling\",\n",
-    "            \"2. Keep loss computation in FP32, only cast inputs to FP16\",\n",
-    "            \"3. Monitor for overflow/underflow during training\",\n",
-    "            \"4. Use gradient clipping to prevent extreme gradients\",\n",
-    "            \"5. Scale losses up during forward pass, scale gradients down during backward\"\n",
-    "        ]\n",
-    "\n",
-    "        for practice in best_practices:\n",
-    "            print(f\"      {practice}\")\n",
-    "\n",
-    "        # Example mixed precision training pattern\n",
-    "        print(f\"\\n💻 Mixed Precision Training Pattern:\")\n",
-    "        print(f\"   ```python\")\n",
-    "        print(f\"   # Forward pass in FP16\")\n",
-    "        print(f\"   with autocast():\")\n",
-    "        print(f\"       predictions = model(inputs.half())  # FP16 inputs\")\n",
-    "        print(f\"       loss = loss_fn(predictions, targets)  # Loss computed in FP32\")\n",
-    "        print(f\"   \")\n",
-    "        print(f\"   # Scale loss to prevent underflow\")\n",
-    "        print(f\"   scaled_loss = loss * scale_factor\")\n",
-    "        print(f\"   scaled_loss.backward()\")\n",
-    "        print(f\"   \")\n",
-    "        print(f\"   # Unscale gradients before optimizer step\")\n",
-    "        print(f\"   scaler.step(optimizer)  # Automatically unscales gradients\")\n",
-    "        print(f\"   ```\")\n",
-    "\n",
-    "        # 💡 WHY THIS MATTERS: Mixed precision training can provide 1.5-2× speedup\n",
-    "        # and 50% memory reduction, but loss functions must be carefully implemented\n",
-    "        # to handle the reduced numerical precision without losing training stability.\n",
-    "\n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Mixed precision analysis error: {e}\")\n",
-    "        print(\"Mixed precision analysis requires complete loss implementations\")\n",
-    "\n",
-    "🔍 SYSTEMS INSIGHT: Production Deployment Analysis\n",
-    "def analyze_production_deployment_patterns():\n",
-    "    \"\"\"Analyze how loss functions affect production ML system design.\"\"\"\n",
-    "    print(\"🔍 Production Deployment Pattern Analysis\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    try:\n",
-    "        print(\"\\n🚀 Deployment Scenario Analysis:\")\n",
-    "        \n",
-    "        # Different deployment scenarios with constraints\n",
-    "        scenarios = [\n",
-    "            {\n",
-    "                \"name\": \"Mobile App (Spam Detection)\",\n",
-    "                \"constraints\": \"Memory < 50MB, Latency < 100ms\",\n",
-    "                \"problem\": \"Binary classification\",\n",
-    "                \"recommendation\": \"Binary CrossEntropy\",\n",
-    "                \"reasoning\": \"Minimal memory, fast inference, stable numerics\"\n",
-    "            },\n",
-    "            {\n",
-    "                \"name\": \"Cloud API (Image Classification)\", \n",
-    "                \"constraints\": \"Throughput > 1000 QPS, Cost optimization\",\n",
-    "                \"problem\": \"1000-class classification\",\n",
-    "                \"recommendation\": \"CrossEntropy with mixed precision\",\n",
-    "                \"reasoning\": \"Can handle memory cost, needs throughput\"\n",
-    "            },\n",
-    "            {\n",
-    "                \"name\": \"Edge IoT (Temperature Prediction)\",\n",
-    "                \"constraints\": \"Memory < 1MB, Power < 1W\",\n",
-    "                \"problem\": \"Regression\",\n",
-    "                \"recommendation\": \"MSE with quantization\",\n",
-    "                \"reasoning\": \"Minimal compute, no transcendental functions\"\n",
-    "            },\n",
-    "            {\n",
-    "                \"name\": \"Large Language Model Training\",\n",
-    "                \"constraints\": \"50K vocabulary, Multi-GPU\",\n",
-    "                \"problem\": \"Next token prediction\",\n",
-    "                \"recommendation\": \"Hierarchical Softmax or Sampling\",\n",
-    "                \"reasoning\": \"Standard CrossEntropy too memory intensive\"\n",
-    "            }\n",
-    "        ]\n",
-    "        \n",
-    "        for scenario in scenarios:\n",
-    "            print(f\"\\n   📱 {scenario['name']}:\")\n",
-    "            print(f\"      Constraints:     {scenario['constraints']}\")\n",
-    "            print(f\"      Problem Type:    {scenario['problem']}\")\n",
-    "            print(f\"      Best Loss:       {scenario['recommendation']}\")\n",
-    "            print(f\"      Why:             {scenario['reasoning']}\")\n",
-    "        \n",
-    "        print(\"\\n⚖️ Production Trade-off Analysis:\")\n",
-    "        \n",
-    "        trade_offs = [\n",
-    "            (\"Memory Efficiency\", \"MSE > Binary CE >> CrossEntropy\"),\n",
-    "            (\"Computational Speed\", \"MSE > Binary CE > CrossEntropy\"),\n",
-    "            (\"Numerical Stability\", \"MSE ≈ Binary CE > CrossEntropy\"), \n",
-    "            (\"Implementation Complexity\", \"MSE > CrossEntropy > Binary CE\"),\n",
-    "            (\"Gradient Quality\", \"CrossEntropy > Binary CE > MSE\"),\n",
-    "            (\"Debug-ability\", \"MSE > Binary CE > CrossEntropy\")\n",
-    "        ]\n",
-    "        \n",
-    "        for criterion, ranking in trade_offs:\n",
-    "            print(f\"      {criterion:20}: {ranking}\")\n",
-    "        \n",
-    "        print(\"\\n🔧 Framework Integration Patterns:\")\n",
-    "        \n",
-    "        frameworks = [\n",
-    "            (\"PyTorch\", \"nn.MSELoss(), nn.CrossEntropyLoss(), nn.BCEWithLogitsLoss()\"),\n",
-    "            (\"TensorFlow\", \"keras.losses.MSE, SparseCategoricalCrossentropy, BinaryCrossentropy\"),\n",
-    "            (\"JAX\", \"optax.l2_loss, optax.softmax_cross_entropy, optax.sigmoid_binary_cross_entropy\"),\n",
-    "            (\"Production\", \"Custom implementations with monitoring and fallbacks\")\n",
-    "        ]\n",
-    "        \n",
-    "        for framework, losses in frameworks:\n",
-    "            print(f\"      {framework:12}: {losses}\")\n",
-    "        \n",
-    "        # 💡 WHY THIS MATTERS: Loss function choice affects every aspect\n",
-    "        # of ML system design - from memory requirements to latency to\n",
-    "        # debugging complexity. Understanding these trade-offs enables\n",
-    "        # informed architectural decisions for production systems.\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Deployment analysis error: {e}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1f0245d3",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've implemented all core loss functions and analyzed their systems characteristics, let's explore their implications for real ML systems:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0789afbb",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "question-1-loss-selection",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "🤔 **Question 1: Loss Function Selection for Production Systems**\n",
-    "\n",
-    "You're building a production recommendation system that predicts user ratings (1-5 stars) for movies.\n",
-    "\n",
-    "Your team proposes three approaches:\n",
-    "A) Regression approach: Use MSE loss with continuous outputs (1.0-5.0)\n",
-    "B) Classification approach: Use CrossEntropy loss with 5 distinct classes  \n",
-    "C) Ordinal approach: Use a custom loss that penalizes being off by multiple stars more heavily\n",
-    "\n",
-    "Analyze each approach considering your implementations:\n",
-    "\n",
-    "**Technical Analysis:**\n",
-    "- How does the memory scaling of CrossEntropy (O(batch_size × num_classes)) affect this 5-class problem?\n",
-    "- What are the computational complexity differences between MSE's O(n) and CrossEntropy's O(n×c) for c=5?\n",
-    "- How do the gradient behaviors differ? (MSE's quadratic vs CrossEntropy's logarithmic penalties)\n",
-    "\n",
-    "**Systems Implications:**\n",
-    "- Which approach would be most memory efficient for large batch training?\n",
-    "- How does numerical stability differ when handling edge cases (ratings at boundaries)?\n",
-    "- Which approach would have the most predictable inference latency?\n",
-    "\n",
-    "**Business Alignment:**\n",
-    "- How well does each loss function's penalty structure match the business objective?\n",
-    "- What happens with fractional ratings like 3.7? How would each approach handle this?\n",
-    "- Which approach would be easiest to monitor and debug in production?\n",
-    "\n",
-    "Recommend an approach with justification based on your implementation experience.\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "583f52ea",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "question-2-numerical-stability",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "🤔 **Question 2: Debugging Numerical Stability in Production**\n",
-    "\n",
-    "Your cross-entropy loss function works perfectly in development, but in production you start seeing NaN losses that crash training after several hours.\n",
-    "\n",
-    "**Root Cause Analysis:**\n",
-    "Based on your implementation of the log-sum-exp trick and epsilon clipping:\n",
-    "1. What specific numerical computations in cross-entropy can produce NaN values?\n",
-    "2. Walk through how your `max_logits = np.max(prediction_logits, axis=1, keepdims=True)` prevents overflow\n",
-    "3. Explain why `np.clip(softmax_pred, epsilon, 1.0 - epsilon)` prevents underflow\n",
-    "4. What would happen if you removed epsilon clipping? Trace through the computation.\n",
-    "\n",
-    "**Production Debugging:**\n",
-    "Given millions of training examples, how would you:\n",
-    "1. Identify which specific inputs trigger the numerical instability?\n",
-    "2. Modify your CrossEntropy implementation to add monitoring without affecting performance?\n",
-    "3. Design fallback behavior when numerical issues are detected?\n",
-    "4. Validate that your fixes don't change the mathematical behavior for normal inputs?\n",
-    "\n",
-    "**Comparison Analysis:**\n",
-    "- How does your stable Binary CrossEntropy formulation `max(x,0) - x*y + log(1 + exp(-|x|))` prevent similar issues?\n",
-    "- Why is MSE generally more numerically stable than CrossEntropy?\n",
-    "- How would you modify loss functions for mixed precision (FP16) training where numerical ranges are more limited?\n",
-    "\n",
-    "Research how PyTorch and TensorFlow handle these same challenges in their loss implementations.\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8f65771b",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "question-3-custom-loss-design",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "🤔 **Question 3: Implementing and Optimizing Custom Loss Functions**\n",
-    "\n",
-    "You've seen examples of custom loss functions for business objectives. Now analyze implementation and optimization challenges:\n",
-    "\n",
-    "**Scenario Analysis:**\n",
-    "Choose one custom loss from the examples (Asymmetric BCE, Focal Loss, Ranking-Aware, Multi-Task, or Contrastive) and analyze:\n",
-    "\n",
-    "**Implementation Deep Dive:**\n",
-    "1. Trace through the numerical computation step-by-step for your chosen custom loss\n",
-    "2. Identify potential numerical stability issues compared to standard loss functions\n",
-    "3. How does the computational complexity compare to MSE/CrossEntropy/Binary CE?\n",
-    "4. What additional memory overhead does the custom formulation introduce?\n",
-    "\n",
-    "**Gradient Flow Analysis:**\n",
-    "5. How do the custom weighting schemes affect gradient magnitudes during backpropagation?\n",
-    "6. What happens to gradient flow when the custom weights become extreme (very large or very small)?\n",
-    "7. How would you detect and handle gradient explosion or vanishing in your custom loss?\n",
-    "8. Design gradient clipping strategies specific to your chosen custom loss function\n",
-    "\n",
-    "**Production Integration Challenges:**\n",
-    "9. How would you implement your custom loss to work with mixed precision training (FP16)?\n",
-    "10. What logging and monitoring would you add to track custom loss behavior in production?\n",
-    "11. How would you A/B test a custom loss against standard losses without affecting user experience?\n",
-    "12. Design a rollback strategy if the custom loss causes training instability\n",
-    "\n",
-    "**Performance Optimization:**\n",
-    "13. Identify computational bottlenecks in your chosen custom loss implementation\n",
-    "14. How could you vectorize operations to improve batch processing efficiency?\n",
-    "15. What caching strategies could reduce redundant computations?\n",
-    "16. How would you benchmark training speed impact compared to standard losses?\n",
-    "\n",
-    "**Business Validation Framework:**\n",
-    "17. Design metrics to validate that your custom loss actually improves business objectives\n",
-    "18. How would you separate loss function improvements from other training improvements?\n",
-    "19. What offline evaluation would you perform before deploying the custom loss?\n",
-    "20. How would you monitor for unexpected business metric changes after deployment?\n",
-    "\n",
-    "Implement one optimization for your chosen custom loss and explain how it addresses a specific production challenge.\n",
-    "\"\"\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4ed8ca84",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Loss Functions - Learning Objectives Made Mathematical\n",
-    "\n",
-    "Congratulations! You've successfully implemented the complete foundation for neural network training objectives:\n",
-    "\n",
-    "### What You've Accomplished\n",
-    "✅ **Complete Loss Function Library**: MSE for regression, CrossEntropy for multi-class classification, and Binary CrossEntropy for binary classification with production-grade numerical stability\n",
-    "✅ **Systems Engineering Understanding**: Deep comprehension of computational complexity, memory scaling, and numerical stability requirements for reliable ML systems\n",
-    "✅ **Mathematical Implementation Mastery**: Built loss functions from mathematical foundations through stable computational formulations to working code\n",
-    "✅ **Production Readiness Knowledge**: Understanding of how loss function choice affects training speed, memory usage, and deployment feasibility\n",
-    "✅ **Framework Integration Insight**: Clear connection between your implementations and how PyTorch/TensorFlow solve the same problems\n",
-    "\n",
-    "### Key Learning Outcomes\n",
-    "- **Loss Function Theory**: How mathematical loss functions translate business objectives into optimization targets that neural networks can learn from\n",
-    "- **Numerical Stability Engineering**: Critical importance of stable implementations that prevent catastrophic training failures in production systems\n",
-    "- **Systems Performance Analysis**: Understanding of computational complexity, memory scaling, and performance trade-offs that affect production deployment\n",
-    "- **Production ML Patterns**: Knowledge of how loss function choice affects system architecture, monitoring requirements, and debugging complexity\n",
-    "\n",
-    "### Mathematical Foundations Mastered  \n",
-    "- **MSE computation**: `(1/n) × Σ(y_pred - y_true)²` with smooth quadratic gradients for regression optimization\n",
-    "- **CrossEntropy with stable softmax**: Log-sum-exp trick and epsilon clipping for numerically robust classification\n",
-    "- **Binary CrossEntropy stability**: `max(x,0) - x×y + log(1 + exp(-|x|))` formulation preventing overflow/underflow issues\n",
-    "- **Gradient behavior understanding**: How different loss functions create different optimization landscapes and learning dynamics\n",
-    "\n",
-    "### Professional Skills Developed\n",
-    "- **Production-quality implementation**: Robust numerical stability measures that prevent training failures with real-world data\n",
-    "- **Performance optimization**: Understanding of computational and memory complexity that affects scalability and deployment\n",
-    "- **Systems debugging**: Knowledge of how to identify and fix numerical stability issues in production ML systems\n",
-    "- **Framework integration**: Clear understanding of how your implementations connect to professional ML development workflows\n",
-    "\n",
-    "### Ready for Advanced Applications\n",
-    "Your loss function implementations now enable:\n",
-    "- **Complete training loops** that optimize neural networks on real datasets with proper convergence monitoring\n",
-    "- **Custom loss functions** that align with specific business objectives and domain requirements\n",
-    "- **Production deployment** with confidence in numerical stability and performance characteristics\n",
-    "- **Advanced optimization** techniques that build on solid loss function foundations\n",
-    "\n",
-    "### Connection to Real ML Systems\n",
-    "Your implementations mirror the essential patterns used in:\n",
-    "- **PyTorch's loss functions**: Same mathematical formulations with identical numerical stability measures\n",
-    "- **TensorFlow's losses**: Equivalent computational patterns and production-grade error handling\n",
-    "- **Production ML pipelines**: The exact loss functions that power real ML systems at companies like Google, Meta, and OpenAI\n",
-    "- **Research frameworks**: Foundation for experimenting with novel loss functions and training objectives\n",
-    "\n",
-    "### Next Steps\n",
-    "With solid loss function implementations, you're ready to:\n",
-    "1. **Export your module**: `tito module complete 04_losses`\n",
-    "2. **Validate integration**: `tito test --module losses`\n",
-    "3. **Explore autograd integration**: See how loss functions connect with automatic differentiation\n",
-    "4. **Ready for Module 06**: Build automatic gradient computation that makes loss-based learning possible!\n",
-    "\n",
-    "**Your achievement**: You've built the mathematical foundation that transforms predictions into learning signals - the critical bridge between model outputs and optimization objectives that makes neural network training possible!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bfc087a8",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "final-demo",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "if __name__ == \"__main__\":\n",
-    "    print(\"🔥 TinyTorch Loss Functions Module - Complete Demo\")\n",
-    "    print(\"=\" * 55)\n",
-    "    \n",
-    "    # Test all core implementations\n",
-    "    print(\"\\n🧪 Testing All Loss Functions:\")\n",
-    "    test_unit_mse_loss()\n",
-    "    test_unit_crossentropy_loss()\n",
-    "    test_unit_binary_crossentropy_loss()\n",
-    "    test_unit_comprehensive_loss_integration()\n",
-    "    \n",
-    "    # Run systems analysis functions\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"🔍 Systems Analysis Functions\")\n",
-    "    print(\"=\" * 30)\n",
-    "\n",
-    "    visualize_loss_landscapes()\n",
-    "    analyze_mse_properties()\n",
-    "    analyze_crossentropy_stability()\n",
-    "    analyze_binary_crossentropy_efficiency()\n",
-    "    analyze_mixed_precision_considerations()\n",
-    "    analyze_loss_performance_characteristics()\n",
-    "    analyze_numerical_stability_edge_cases()\n",
-    "    analyze_production_deployment_patterns()\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"📊 Loss Function Usage Examples\")\n",
-    "    print(\"=\" * 35)\n",
-    "    \n",
-    "    # Example 1: Regression with MSE\n",
-    "    print(\"\\n1. Regression Example (Predicting House Prices):\")\n",
-    "    mse = MeanSquaredError()\n",
-    "    house_predictions = Tensor([[250000, 180000, 320000]])  # Predicted prices\n",
-    "    house_actual = Tensor([[240000, 175000, 315000]])       # Actual prices\n",
-    "    regression_loss = mse(house_predictions, house_actual)\n",
-    "    print(f\"   House price prediction loss: ${regression_loss.data:,.0f}² average error\")\n",
-    "    \n",
-    "    # Example 2: Multi-class classification with CrossEntropy\n",
-    "    print(\"\\n2. Multi-Class Classification Example (Image Recognition):\")\n",
-    "    ce = CrossEntropyLoss()\n",
-    "    image_logits = Tensor([[2.1, 0.5, -0.3, 1.8, 0.1],      # Model outputs for 5 classes\n",
-    "                          [-0.2, 3.1, 0.8, -1.0, 0.4]])      # (cat, dog, bird, fish, rabbit)\n",
-    "    true_classes = Tensor([0, 1])  # First image = cat, second = dog\n",
-    "    classification_loss = ce(image_logits, true_classes)\n",
-    "    print(f\"   Image classification loss: {classification_loss.data:.4f}\")\n",
-    "    \n",
-    "    # Example 3: Binary classification with BCE\n",
-    "    print(\"\\n3. Binary Classification Example (Spam Detection):\")\n",
-    "    bce = BinaryCrossEntropyLoss()\n",
-    "    spam_logits = Tensor([[1.2], [-0.8], [2.1], [-1.5]])  # Spam prediction logits\n",
-    "    spam_labels = Tensor([[1.0], [0.0], [1.0], [0.0]])     # 1=spam, 0=not spam\n",
-    "    spam_loss = bce(spam_logits, spam_labels)\n",
-    "    print(f\"   Spam detection loss: {spam_loss.data:.4f}\")\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"🎯 Loss Function Characteristics\")\n",
-    "    print(\"=\" * 35)\n",
-    "    \n",
-    "    # Compare perfect vs imperfect predictions\n",
-    "    print(\"\\n📊 Perfect vs Random Predictions:\")\n",
-    "    \n",
-    "    # Perfect predictions\n",
-    "    perfect_mse = mse(Tensor([5.0]), Tensor([5.0]))\n",
-    "    perfect_ce = ce(Tensor([[10.0, 0.0, 0.0]]), Tensor([0]))\n",
-    "    perfect_bce = bce(Tensor([10.0]), Tensor([1.0]))\n",
-    "    \n",
-    "    print(f\"   Perfect MSE loss: {perfect_mse.data:.6f}\")\n",
-    "    print(f\"   Perfect CE loss:  {perfect_ce.data:.6f}\")\n",
-    "    print(f\"   Perfect BCE loss: {perfect_bce.data:.6f}\")\n",
-    "    \n",
-    "    # Random predictions\n",
-    "    random_mse = mse(Tensor([3.0]), Tensor([5.0]))  # Off by 2\n",
-    "    random_ce = ce(Tensor([[0.0, 0.0, 0.0]]), Tensor([0]))  # Uniform distribution\n",
-    "    random_bce = bce(Tensor([0.0]), Tensor([1.0]))  # 50% confidence\n",
-    "    \n",
-    "    print(f\"   Random MSE loss:  {random_mse.data:.6f}\")\n",
-    "    print(f\"   Random CE loss:   {random_ce.data:.6f}\")\n",
-    "    print(f\"   Random BCE loss:  {random_bce.data:.6f}\")\n",
-    "    \n",
-    "    print(\"\\n🎉 Complete loss function foundation ready!\")\n",
-    "    print(\"   ✅ MSE for regression problems\")\n",
-    "    print(\"   ✅ CrossEntropy for multi-class classification\")\n",
-    "    print(\"   ✅ Binary CrossEntropy for binary classification\")\n",
-    "    print(\"   ✅ Numerically stable implementations\")\n",
-    "    print(\"   ✅ Production-ready batch processing\")\n",
-    "    print(\"   ✅ Systems analysis and performance insights\")\n",
-    "    print(\"   ✅ Ready for neural network training!\")"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/04_losses/losses_dev.py b/modules_old/04_losses/losses_dev.py
deleted file mode 100644
index 8f286f20..00000000
--- a/modules_old/04_losses/losses_dev.py
+++ /dev/null
@@ -1,2386 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Loss Functions - Learning Objectives Made Mathematical
-
-Welcome to Loss Functions! You'll implement the critical bridge between model predictions and learning objectives that makes neural network training possible.
-
-## LINK Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): Data structures for predictions and targets
-- Module 03 (Activations): Nonlinear transformations for model outputs
-- Module 04 (Layers): Complete neural network layers that produce predictions
-
-**What's Working**: You can build networks that transform inputs into predictions!
-
-**The Gap**: Predictions aren't learning objectives - you need to measure how "wrong" predictions are and provide gradient signals for improvement.
-
-**This Module's Solution**: Implement MSE, CrossEntropy, and BinaryCrossEntropy loss functions with numerical stability.
-
-**Connection Map**:
-```
-Layers -> Loss Functions -> Gradients
-(predictions)  (objectives)   (learning signals)
-```
-
-## Learning Objectives
-
-By completing this module, you will:
-
-1. **Understand loss functions** - Learn how to measure the quality of model predictions
-2. **Implement MSE Loss** - Build loss functions for regression problems
-3. **Implement CrossEntropy Loss** - Create loss functions for classification tasks
-4. **Handle numerical stability** - Deal with edge cases and extreme values safely
-5. **Enable learning** - Provide the feedback signal that allows networks to improve
-
-## Build -> Use -> Reflect
-1. **Build**: MSE, CrossEntropy, and BinaryCrossEntropy loss functions with proper error handling
-2. **Use**: Apply different loss functions to real prediction problems and compare results
-3. **Reflect**: Understand when to use each loss function and why numerical stability matters
-
-## What You'll Achieve
-- **Mathematical understanding**: How loss functions quantify prediction quality
-- **Implementation skills**: Building robust loss functions with error handling
-- **Problem matching**: Choosing the right loss function for different ML tasks
-- **Numerical awareness**: Understanding and preventing common computational issues
-- **Training foundation**: Enabling the learning process that makes neural networks work
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "losses-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.losses
-
-#| export
-import numpy as np
-import sys
-import os
-
-# Import our building blocks - Tensor first, autograd operations if available
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local modules
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# Pure tensor evolution approach:
-# - Loss functions use basic Tensor operations directly
-# - Module 05 will add gradient tracking via decorator pattern
-# - Clean separation of concerns enables focused learning
-
-# %% nbgrader={"grade": false, "grade_id": "losses-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("FIRE TinyTorch Loss Functions Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build loss functions for neural network training!")
-
-# %% [markdown]
-"""
-## Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/04_losses/losses_dev.py  
-**Building Side:** Code exports to tinytorch.core.losses
-
-```python
-# Final package structure:
-from tinytorch.core.losses import MeanSquaredError, CrossEntropyLoss, BinaryCrossEntropyLoss  # All loss functions!
-from tinytorch.core.tensor import Tensor  # The foundation
-from tinytorch.core.layers import Linear, Sequential  # Network components
-```
-
-**Why this matters:**
-- **Learning:** Focused module for understanding loss functions and training objectives
-- **Production:** Proper organization like PyTorch's torch.nn with all loss functions together
-- **Consistency:** All loss functions live together in core.losses for easy access
-- **Integration:** Works seamlessly with tensors and neural networks for complete training systems
-"""
-
-# %% [markdown]
-"""
-# Understanding Loss Functions in Neural Networks
-
-## What are Loss Functions?
-
-Loss functions are the mathematical bridge between what your model predicts and what you want it to learn. They quantify the "distance" between predictions and reality.
-
-```
-Business Goal: "Predict house prices accurately"
-            v
-Mathematical Loss: MSE = (predicted_price - actual_price)²
-            v  
-Optimization Signal: gradient = 2 * (predicted - actual)
-            v
-Learning Update: parameter -= learning_rate * gradient
-```
-
-## The Learning Ecosystem
-
-Loss functions provide four critical capabilities:
-
-TARGET **Learning Objectives**: Define what "good" performance means mathematically  
-PROGRESS **Gradient Signal**: Provide directional improvement information for parameters  
-MAGNIFY **Progress Measurement**: Enable monitoring training progress and convergence detection  
-⚖️ **Trade-off Control**: Balance different aspects of model performance and regularization  
-
-## Visual Understanding: Loss Function Landscape
-
-```
-Loss Function Behavior:
-           MSE Loss                    CrossEntropy Loss
-    High |    /\\                High |     /\\
-         |   /  \\                    |    /  \\
-         |  /    \\                   |   /    \\
-         | /      \\                  |  /      \\
-     Low |/        \\             Low | /        \\
-         +--------------         +--------------
-         Wrong  Right              Wrong  Right
-         
-   • Smooth gradients          • Steep near wrong predictions
-   • Quadratic penalty         • Gentle near correct predictions
-   • Good for regression       • Good for classification
-```
-
-Different loss functions create different optimization landscapes that affect how your model learns!
-"""
-
-# %% [markdown]
-"""
-# Mean Squared Error - Foundation for Regression
-
-MSE is the cornerstone loss function for regression problems. It measures prediction quality by penalizing large errors more than small ones.
-
-## Visual Understanding: MSE Behavior
-
-```
-MSE Loss Visualization:
-
-    Loss |     /\\
-       4 |    /  \\        • Error = 2 -> Loss = 4
-       3 |   /    \\       • Error = 1 -> Loss = 1
-       2 |  /      \\      • Error = 0 -> Loss = 0
-       1 | /        \\     • Quadratic penalty!
-       0 |/__________\\____
-         -2  -1   0   1   2
-              Error
-              
-Gradient Flow:
-    dLoss/dprediction = 2 * (predicted - actual)
-    
-    Large errors -> Large gradients -> Big updates
-    Small errors -> Small gradients -> Fine tuning
-```
-
-## Mathematical Foundation
-
-For batch of predictions and targets:
-```
-MSE = (1/n) * Sum(y_pred - y_true)²
-
-Gradient: dMSE/dy_pred = (2/n) * (y_pred - y_true)
-```
-
-## Learning Objectives
-By implementing MSE, you'll understand:
-- How regression loss functions translate continuous prediction errors into optimization signals
-- Why squared error creates smooth, well-behaved gradients for stable optimization
-- How batch processing enables efficient training on multiple samples simultaneously
-- The connection between mathematical loss formulations and practical ML training dynamics
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "mse-concept-question", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-THINK **Computational Question: MSE Properties**
-
-Before implementing, let's understand MSE behavior:
-
-1. If you predict house price as $300k but actual is $250k, what's the MSE?
-2. If you predict $310k but actual is $250k, what's the MSE? 
-3. Which error gets penalized more heavily and why?
-4. How does this relate to the quadratic penalty we visualized?
-
-This understanding will guide your implementation approach.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "mse-loss-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class MeanSquaredError:
-    """
-    Mean Squared Error Loss for Regression Problems
-    
-    Computes the average squared difference between predictions and targets:
-    MSE = (1/n) * Sum(y_pred - y_true)²
-    
-    Features:
-    - Numerically stable computation
-    - Efficient batch processing
-    - Clean gradient properties for optimization
-    - Compatible with tensor operations
-    
-    Example Usage:
-        mse = MeanSquaredError()
-        loss = mse(predictions, targets)  # Returns scalar loss value
-    """
-    
-    def __init__(self):
-        """Initialize MSE loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute MSE loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions (Tensor, shape: [batch_size, ...])
-            y_true: True targets (Tensor, shape: [batch_size, ...])
-            
-        Returns:
-            Tensor with scalar loss value
-            
-        TODO: Implement MSE computation with proper tensor handling.
-        
-        APPROACH:
-        1. Convert inputs to tensors for consistent processing
-        2. Compute element-wise prediction errors (differences)
-        3. Square the errors to create quadratic penalty
-        4. Take mean across all elements for final loss
-        
-        EXAMPLE:
-        >>> mse = MeanSquaredError()
-        >>> pred = Tensor([[1.0, 2.0]])
-        >>> true = Tensor([[1.5, 1.5]])
-        >>> loss = mse(pred, true)
-        >>> print(loss.data)
-        0.25  # [(1.0-1.5)² + (2.0-1.5)²] / 2 = [0.25 + 0.25] / 2
-        
-        HINTS:
-        - Use np.mean() for efficient batch averaging
-        - Element-wise operations work naturally with tensor.data
-        - Return result wrapped in Tensor for consistent interface
-        """
-        ### BEGIN SOLUTION
-        # Step 1: Ensure we have tensor inputs for consistent processing
-        if not isinstance(y_pred, Tensor):
-            y_pred = Tensor(y_pred)
-        if not isinstance(y_true, Tensor):
-            y_true = Tensor(y_true)
-        
-        # Step 2: Compute mean squared error with element-wise operations
-        prediction_errors = y_pred.data - y_true.data  # Element-wise difference
-        squared_errors = prediction_errors * prediction_errors  # Element-wise squaring
-        mean_loss = np.mean(squared_errors)  # Average across all elements
-        
-        return Tensor(mean_loss)
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-
-# MAGNIFY SYSTEMS INSIGHT: Gradient Landscape Visualization
-def visualize_loss_landscapes():
-    """Visualize how different loss functions create different optimization landscapes."""
-    print("MAGNIFY Loss Function Landscape Visualization")
-    print("=" * 45)
-
-    try:
-        import numpy as np
-
-        # Create prediction space for visualization
-        prediction_range = np.linspace(-3, 3, 100)
-        true_value = 0.0  # Target value
-
-        print("\nPROGRESS Loss Landscape Comparison:")
-        print("   How loss changes as predictions move away from target")
-
-        # Calculate loss landscapes
-        mse = MeanSquaredError()
-        _ = CrossEntropyLoss()  # Not used in this comparison
-        bce = BinaryCrossEntropyLoss()
-
-        # MSE landscape (regression)
-        mse_losses = []
-        for pred in prediction_range:
-            loss = mse(Tensor([pred]), Tensor([true_value]))
-            mse_losses.append(loss.data)
-
-        # Binary CE landscape (classification)
-        bce_losses = []
-        for pred in prediction_range:
-            loss = bce(Tensor([pred]), Tensor([1.0]))  # Target: positive class
-            bce_losses.append(loss.data)
-
-        # Find key gradient characteristics
-        mse_gradient_at_zero = 2 * (0 - true_value)  # MSE gradient formula
-        mse_gradient_at_one = 2 * (1 - true_value)
-
-        print(f"\nTARGET Gradient Behavior Analysis:")
-        print(f"   MSE gradient at prediction=0: {mse_gradient_at_zero:.3f}")
-        print(f"   MSE gradient at prediction=1: {mse_gradient_at_one:.3f}")
-        print(f"   MSE provides linear gradient growth")
-
-        # Binary CE gradient analysis
-        sigmoid_at_zero = 1 / (1 + np.exp(-0))  # = 0.5
-        bce_grad_at_zero = sigmoid_at_zero - 1.0  # = -0.5
-        sigmoid_at_one = 1 / (1 + np.exp(-1))    # ~= 0.73
-        bce_grad_at_one = sigmoid_at_one - 1.0   # ~= -0.27
-
-        print(f"   BCE gradient at logit=0: {bce_grad_at_zero:.3f}")
-        print(f"   BCE gradient at logit=1: {bce_grad_at_one:.3f}")
-        print(f"   BCE provides adaptive gradient magnitude")
-
-        # Visualize ASCII loss curves
-        print(f"\n📊 Loss Function Shapes (ASCII visualization):")
-        print(f"   Prediction range: {prediction_range[0]:.1f} to {prediction_range[-1]:.1f}")
-
-        # Sample key points for visualization
-        sample_points = [-2, -1, 0, 1, 2]
-        print(f"\n   {'Prediction':>10} {'MSE Loss':>10} {'BCE Loss':>10} {'Gradient Type':>15}")
-        print(f"   {'-'*10} {'-'*10} {'-'*10} {'-'*15}")
-
-        for point in sample_points:
-            mse_loss = mse(Tensor([point]), Tensor([0.0]))
-            bce_loss = bce(Tensor([point]), Tensor([1.0]))
-
-            # Characterize gradient steepness
-            if abs(point) < 0.5:
-                grad_type = "Gentle"
-            elif abs(point) < 1.5:
-                grad_type = "Moderate"
-            else:
-                grad_type = "Steep"
-
-            print(f"   {point:>10.1f} {mse_loss.data:>10.3f} {bce_loss.data:>10.3f} {grad_type:>15}")
-
-        # Optimization implications
-        print(f"\nROCKET Optimization Implications:")
-        print(f"   MSE (Regression):")
-        print(f"     • Quadratic penalty grows smoothly")
-        print(f"     • Large errors -> large gradients (aggressive correction)")
-        print(f"     • Small errors -> small gradients (fine-tuning)")
-        print(f"     • Symmetric around target value")
-
-        print(f"   Binary CrossEntropy (Classification):")
-        print(f"     • Logarithmic penalty creates adaptive gradients")
-        print(f"     • Wrong confident predictions -> steep gradients")
-        print(f"     • Right confident predictions -> gentle gradients")
-        print(f"     • Asymmetric penalty structure encourages confidence")
-
-        # TIP WHY THIS MATTERS: Different loss landscapes create different
-        # optimization dynamics. MSE's smooth quadratic surface enables
-        # stable gradient descent, while CrossEntropy's adaptive gradients
-        # help classification models learn faster from confident mistakes.
-
-    except Exception as e:
-        print(f"WARNING️ Visualization error: {e}")
-        print("Ensure loss functions are implemented for landscape analysis")
-
-# MAGNIFY SYSTEMS INSIGHT: MSE Computational Analysis
-def analyze_mse_properties():
-    """Analyze MSE loss characteristics for systems understanding."""
-    print("MAGNIFY MSE Loss Analysis - Understanding the Math")
-    print("=" * 45)
-    
-    try:
-        mse = MeanSquaredError()
-        
-        # Error magnitude vs loss relationship
-        print("\n📊 Error Magnitude vs Loss (Quadratic Penalty):")
-        errors = [0.1, 0.5, 1.0, 2.0, 5.0]
-        for error in errors:
-            pred = Tensor([error])
-            true = Tensor([0.0])
-            loss = mse(pred, true)
-            print(f"   Error: {error:4.1f} -> Loss: {loss.data:8.3f} (* {loss.data/(error**2):5.1f} baseline)")
-        
-        # Batch vs individual processing
-        print(f"\nSPEED Batch Processing Efficiency:")
-        single_losses = []
-        for _ in range(100):
-            pred = Tensor([np.random.randn()])
-            true = Tensor([np.random.randn()])
-            loss = mse(pred, true)
-            single_losses.append(loss.data)
-        
-        # Batch version
-        batch_pred = Tensor(np.random.randn(100))
-        batch_true = Tensor(np.random.randn(100))
-        batch_loss = mse(batch_pred, batch_true)
-        
-        individual_mean = np.mean(single_losses)
-        print(f"   Individual losses mean: {individual_mean:.6f}")
-        print(f"   Batch loss:            {batch_loss.data:.6f}")
-        print(f"   Difference:            {abs(individual_mean - batch_loss.data):.8f}")
-        
-        # Memory efficiency analysis
-        import sys
-        small_tensor = Tensor([1.0])
-        large_tensor = Tensor(np.random.randn(1000))
-        
-        print(f"\n💾 Memory Efficiency:")
-        print(f"   Small loss memory: {sys.getsizeof(small_tensor.data)} bytes")
-        print(f"   Large loss memory: {sys.getsizeof(large_tensor.data)} bytes")
-        print(f"   MSE memory is independent of input size!")
-        
-        # TIP WHY THIS MATTERS: MSE provides stable, well-behaved gradients
-        # that are proportional to error magnitude, making optimization smooth.
-        # The quadratic penalty means large errors dominate learning initially,
-        # then fine-tuning happens as errors get smaller.
-        
-    except Exception as e:
-        print(f"WARNING️ Analysis error: {e}")
-        print("Ensure MSE implementation is complete before running analysis")
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: MSE Loss Computation
-This test validates `MeanSquaredError.__call__`, ensuring correct MSE computation with various input types and batch sizes.
-
-**What we're testing**: MSE correctly measures prediction quality with quadratic penalty
-**Why it matters**: MSE must provide smooth gradients for stable regression training
-**Expected**: Zero loss for perfect predictions, increasing quadratic penalty for larger errors
-
-### MSE Loss Test Cases Visualization
-
-```
-Test Case 1 - Perfect Predictions:
-Predicted: [[1.0, 2.0], [3.0, 4.0]]
-Actual:    [[1.0, 2.0], [3.0, 4.0]]  ← Identical!
-MSE Loss:  0.0                       ← Perfect prediction = no penalty
-
-Test Case 2 - Small Errors:
-Predicted: [[1.1, 2.1], [3.1, 4.1]]  ← Each prediction off by 0.1
-Actual:    [[1.0, 2.0], [3.0, 4.0]]
-Errors:    [0.1, 0.1, 0.1, 0.1]      ← Uniform small error
-MSE Loss:  (0.1²+0.1²+0.1²+0.1²)/4 = 0.01
-
-Test Case 3 - Large Error Impact:
-Error = 1.0 → Loss contribution = 1.0²  = 1.0
-Error = 2.0 → Loss contribution = 2.0²  = 4.0   ← 2× error = 4× penalty!
-Error = 3.0 → Loss contribution = 3.0²  = 9.0   ← 3× error = 9× penalty!
-
-Loss Landscape:
-    Loss
-     ↑    /\
-    9 |   /  \        Large errors heavily penalized
-    4 |  /    \
-    1 | /      \      Small errors lightly penalized
-    0 |/__________\   Perfect prediction has zero loss
-      -3  -2  -1  0  1   2   3  → Error
-```
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-mse-loss", "locked": true, "points": 3, "schema_version": 3, "solution": false, "task": false}
-def test_unit_mse_loss():
-    """Test MSE loss implementation."""
-    print("🔬 Unit Test: Mean Squared Error Loss...")
-    
-    mse = MeanSquaredError()
-    
-    # Test case 1: Perfect predictions (loss should be 0)
-    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    y_true = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    loss = mse(y_pred, y_true)
-    assert abs(loss.data) < 1e-6, f"Perfect predictions should have loss ~= 0, got {loss.data}"
-    print("PASS Perfect predictions test passed")
-    
-    # Test case 2: Known loss computation
-    y_pred = Tensor([[1.0, 2.0]])
-    y_true = Tensor([[0.0, 1.0]])
-    loss = mse(y_pred, y_true)
-    expected = 1.0  # [(1-0)² + (2-1)²] / 2 = [1 + 1] / 2 = 1.0
-    assert abs(loss.data - expected) < 1e-6, f"Expected loss {expected}, got {loss.data}"
-    print("PASS Known loss computation test passed")
-    
-    # Test case 3: Batch processing
-    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    y_true = Tensor([[1.5, 2.5], [2.5, 3.5]])
-    loss = mse(y_pred, y_true)
-    expected = 0.25  # All squared differences are 0.25
-    assert abs(loss.data - expected) < 1e-6, f"Expected batch loss {expected}, got {loss.data}"
-    print("PASS Batch processing test passed")
-    
-    # Test case 4: Single value
-    y_pred = Tensor([5.0])
-    y_true = Tensor([3.0])
-    loss = mse(y_pred, y_true)
-    expected = 4.0  # (5-3)² = 4
-    assert abs(loss.data - expected) < 1e-6, f"Expected single value loss {expected}, got {loss.data}"
-    print("PASS Single value test passed")
-    
-    print("CELEBRATE MSE loss tests passed! Understanding regression objectives.")
-
-test_unit_mse_loss()
-
-# %% [markdown]
-"""
-# Cross-Entropy Loss - Foundation for Multi-Class Classification
-
-Cross-Entropy Loss measures the "information distance" between predicted probability distributions and true class labels. It's the gold standard for classification problems.
-
-## Visual Understanding: Cross-Entropy Behavior
-
-```
-Cross-Entropy Loss for 3-Class Problem:
-
-Class Probabilities after Softmax:
-    Input: [2.0, 1.0, 0.1]    ->    Probabilities: [0.66, 0.24, 0.10]
-    True:  Class 0 (index 0)   ->    Target:       [1.0,  0.0,  0.0]
-    
-Loss Computation:
-    CE = -log(probability_of_correct_class)
-    CE = -log(0.66) = 0.415
-    
-Intuition:
-    High confidence + Correct -> Low loss
-    High confidence + Wrong   -> High loss  
-    Low confidence  + Any     -> Medium loss
-
-Gradient Behavior:
-    Wrong predictions -> Steep gradients -> Big corrections
-    Right predictions -> Gentle gradients -> Fine tuning
-```
-
-## Numerical Stability Challenge
-
-```
-The Numerical Stability Problem:
-    
-    Raw logits: [50.0, 49.0, 48.0]
-    Naive softmax: exp(50)/[exp(50)+exp(49)+exp(48)]
-    Problem: exp(50) ~= 5*10²¹ -> Overflow!
-    
-Our Solution (Log-Sum-Exp Trick):
-    1. max_val = max(logits) = 50.0
-    2. stable_logits = [0.0, -1.0, -2.0]  # Subtract max
-    3. exp([0.0, -1.0, -2.0]) = [1.0, 0.37, 0.14]
-    4. Safe softmax: [0.67, 0.25, 0.09]
-```
-
-## Mathematical Foundation
-
-For predictions and class indices:
-```
-CrossEntropy = -Sum y_true * log(softmax(y_pred))
-
-Softmax: softmax(x_i) = exp(x_i) / Sum exp(x_j)
-Stable: softmax(x_i) = exp(x_i - max(x)) / Sum exp(x_j - max(x))
-```
-
-## Learning Objectives
-By implementing Cross-Entropy, you'll understand:
-- How classification losses work with probability distributions and information theory
-- Why softmax normalization creates proper probability distributions for multi-class problems
-- The critical importance of numerical stability in exponential and logarithmic computations
-- How cross-entropy naturally encourages confident, correct predictions through its gradient structure
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "crossentropy-concept-question", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-THINK **Computational Question: CrossEntropy Stability**
-
-Consider numerical stability in cross-entropy:
-
-1. What happens if you compute exp(100) directly?
-2. Why does subtracting the maximum value prevent overflow?
-3. What happens if log(0) occurs during loss computation?
-4. How does epsilon clipping prevent this issue?
-
-Understanding these edge cases is crucial for reliable implementation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "crossentropy-loss-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class CrossEntropyLoss:
-    """
-    Cross-Entropy Loss for Multi-Class Classification Problems
-    
-    Computes the cross-entropy between predicted probability distributions
-    and true class labels with numerically stable implementation.
-    
-    Features:
-    - Numerically stable softmax computation using log-sum-exp trick
-    - Support for both class indices and one-hot encoding
-    - Efficient batch processing with proper broadcasting
-    - Automatic handling of edge cases and extreme values
-    
-    Example Usage:
-        ce_loss = CrossEntropyLoss()
-        loss = ce_loss(logits, class_indices)  # Returns scalar loss value
-    """
-    
-    def __init__(self):
-        """Initialize CrossEntropy loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute CrossEntropy loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions/logits (Tensor, shape: [batch_size, num_classes])
-            y_true: True class indices (Tensor, shape: [batch_size]) or one-hot encoding
-            
-        Returns:
-            Tensor with scalar loss value
-            
-        TODO: Implement CrossEntropy with numerically stable softmax computation.
-        
-        APPROACH:
-        1. Convert inputs to tensors and handle single samples
-        2. Apply log-sum-exp trick for numerically stable softmax
-        3. Clip probabilities to prevent log(0) issues
-        4. Compute cross-entropy based on target format (indices vs one-hot)
-        
-        EXAMPLE:
-        >>> ce = CrossEntropyLoss()
-        >>> logits = Tensor([[2.0, 1.0, 0.0]])  # Raw model outputs
-        >>> targets = Tensor([0])  # Class 0 is correct
-        >>> loss = ce(logits, targets)
-        >>> print(loss.data)
-        0.407  # -log(softmax([2.0, 1.0, 0.0])[0])
-        
-        HINTS:
-        - Use np.max(axis=1, keepdims=True) for stable max computation
-        - Use np.clip(probabilities, 1e-15, 1.0-1e-15) to prevent log(0)
-        - Handle both index format [0,1,2] and one-hot format [[1,0,0], [0,1,0]]
-        - Use advanced indexing: probs[np.arange(batch_size), class_indices]
-        """
-        ### BEGIN SOLUTION
-        # Step 1: Ensure we have tensor inputs for consistent processing
-        if not isinstance(y_pred, Tensor):
-            y_pred = Tensor(y_pred)  # Convert predictions to tensor format
-        if not isinstance(y_true, Tensor):
-            y_true = Tensor(y_true)  # Convert targets to tensor format
-        
-        # Step 1: Extract numpy arrays for computation
-        prediction_logits = y_pred.data  # Raw model outputs (pre-softmax)
-        target_labels = y_true.data      # True class indices or one-hot vectors
-        
-        # Step 2: Handle both single predictions and batches consistently
-        if prediction_logits.ndim == 1:
-            prediction_logits = prediction_logits.reshape(1, -1)  # Convert to batch format [1, num_classes]
-            
-        # Step 3: Apply numerically stable softmax transformation
-        # Subtract max to prevent overflow: exp(x-max) is equivalent but stable
-        max_logits = np.max(prediction_logits, axis=1, keepdims=True)
-        exp_pred = np.exp(prediction_logits - max_logits)
-        softmax_pred = exp_pred / np.sum(exp_pred, axis=1, keepdims=True)
-        
-        # Step 4: Prevent numerical instability in log computation
-        epsilon = 1e-15  # Small value to prevent log(0) -> -inf and log(1) -> 0 issues
-        softmax_pred = np.clip(softmax_pred, epsilon, 1.0 - epsilon)
-        
-        # Step 5: Compute cross-entropy loss based on target format
-        if len(target_labels.shape) == 1:
-            # Format A: y_true contains class indices [0, 1, 2, ...]
-            batch_size = target_labels.shape[0]
-            # Extract probabilities for correct classes using advanced indexing
-            correct_class_probs = softmax_pred[np.arange(batch_size), target_labels.astype(int)]
-            log_probs = np.log(correct_class_probs)
-            loss_value = -np.mean(log_probs)  # Negative log-likelihood
-        else:
-            # Format B: y_true is one-hot encoded [[1,0,0], [0,1,0], ...]
-            log_probs = np.log(softmax_pred)
-            # Multiply one-hot targets with log probabilities, sum across classes
-            weighted_log_probs = target_labels * log_probs
-            loss_value = -np.mean(np.sum(weighted_log_probs, axis=1))
-        
-        return Tensor(loss_value)
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-
-# MAGNIFY SYSTEMS INSIGHT: CrossEntropy Stability Analysis
-def analyze_crossentropy_stability():
-    """Analyze numerical stability in cross-entropy computation."""
-    print("MAGNIFY CrossEntropy Stability Analysis")
-    print("=" * 40)
-    
-    try:
-        ce = CrossEntropyLoss()
-        
-        # Test numerical stability with extreme values
-        print("\nSPEED Numerical Stability Testing:")
-        
-        # Extreme logits that would overflow in naive implementation
-        extreme_logits = Tensor([[100.0, 99.0, 98.0]])
-        safe_labels = Tensor([0])
-        
-        loss = ce(extreme_logits, safe_labels)
-        print(f"   Extreme logits [100, 99, 98]: Loss = {loss.data:.6f}")
-        print(f"   No overflow or NaN: {not np.isnan(loss.data) and not np.isinf(loss.data)}")
-        
-        # Test epsilon clipping effectiveness
-        print(f"\n🛡️ Epsilon Clipping Protection:")
-        very_confident = Tensor([[10.0, -10.0, -10.0]])  # Very confident about class 0
-        confident_labels = Tensor([0])
-        
-        loss = ce(very_confident, confident_labels)
-        print(f"   Very confident correct prediction: Loss = {loss.data:.6f}")
-        print(f"   Should be near 0: {loss.data < 0.01}")
-        
-        # Compare different confidence levels
-        print(f"\n📊 Confidence vs Loss Relationship:")
-        confidence_levels = [
-            ("Low confidence", [[0.1, 0.0, -0.1]]),
-            ("Medium confidence", [[1.0, 0.0, -1.0]]),
-            ("High confidence", [[5.0, 0.0, -5.0]]),
-            ("Very high", [[10.0, 0.0, -10.0]])
-        ]
-        
-        for name, logits in confidence_levels:
-            test_logits = Tensor(logits)
-            test_loss = ce(test_logits, Tensor([0]))
-            print(f"   {name:15}: Loss = {test_loss.data:.6f}")
-        
-        # Memory efficiency for large vocabularies
-        print(f"\n💾 Memory Scaling Analysis:")
-        small_vocab = Tensor(np.random.randn(32, 100))    # 100 classes
-        large_vocab = Tensor(np.random.randn(32, 10000))  # 10k classes
-        
-        import sys
-        small_memory = sys.getsizeof(small_vocab.data)
-        large_memory = sys.getsizeof(large_vocab.data)
-        
-        print(f"   Small vocab (100 classes): {small_memory / 1024:.1f} KB")
-        print(f"   Large vocab (10k classes): {large_memory / 1024:.1f} KB")
-        print(f"   Memory scales O(batch_size * num_classes)")
-        
-        # TIP WHY THIS MATTERS: CrossEntropy memory scales with vocabulary size.
-        # This is why large language models use techniques like hierarchical softmax
-        # or sampling-based training to handle vocabularies with 50k+ tokens.
-        
-    except Exception as e:
-        print(f"WARNING️ Analysis error: {e}")
-        print("Ensure CrossEntropy implementation is complete")
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Cross-Entropy Loss Computation
-This test validates `CrossEntropyLoss.__call__`, ensuring correct cross-entropy computation with numerically stable softmax.
-
-**What we're testing**: CrossEntropy provides correct classification loss with numerical stability
-**Why it matters**: CrossEntropy must handle extreme logits safely and encourage correct predictions
-**Expected**: High loss for wrong predictions, low loss for correct predictions, numerical stability
-
-### CrossEntropy Loss Test Cases Visualization
-
-```
-Classification Scenario: 3-class classification (Cat, Dog, Bird)
-
-Test Case 1 - Perfect Confidence:
-Logits:    [[10, 0, 0], [0, 10, 0]]  ← Very confident predictions
-True:      [0, 1]                    ← Cat, Dog
-Softmax:   [[≈1, 0, 0], [0, ≈1, 0]] ← Near-perfect probabilities
-CE Loss:   ≈0.0                     ← Minimal penalty for confidence
-
-Test Case 2 - Wrong but Confident:
-Logits:    [[0, 0, 10]]              ← Confident Bird prediction
-True:      [0]                       ← Actually Cat!
-Softmax:   [[0, 0, ≈1]]             ← Wrong class gets ≈100%
-CE Loss:   ≈10.0                    ← Heavy penalty for wrong confidence
-
-Test Case 3 - Uncertain (Good):
-Logits:    [[0, 0, 0]]               ← Completely uncertain
-True:      [0]                       ← Cat
-Softmax:   [[0.33, 0.33, 0.33]]     ← Equal probabilities
-CE Loss:   1.099                    ← Moderate penalty for uncertainty
-
-Loss Behavior Pattern:
-    Loss ↑
-    10  |     ●  (wrong + confident = disaster)
-        |
-     5  |
-        |
-     1  |        ●  (uncertain = acceptable)
-        |
-     0  |  ●         (correct + confident = ideal)
-        +________________→ Confidence
-        Wrong  Uncertain  Correct
-
-Numerical Stability:
-Input:  [1000, 0, -1000] → Subtract max: [0, -1000, -2000]
-Result: Prevents overflow while preserving relative differences
-```
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-crossentropy-loss", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false}
-def test_unit_crossentropy_loss():
-    """Test CrossEntropy loss implementation."""
-    print("🔬 Unit Test: Cross-Entropy Loss...")
-    
-    ce = CrossEntropyLoss()
-    
-    # Test case 1: Perfect predictions
-    y_pred = Tensor([[10.0, 0.0, 0.0], [0.0, 10.0, 0.0]])  # Very confident correct predictions
-    y_true = Tensor([0, 1])  # Class indices
-    loss = ce(y_pred, y_true)
-    assert loss.data < 0.1, f"Perfect predictions should have low loss, got {loss.data}"
-    print("PASS Perfect predictions test passed")
-    
-    # Test case 2: Random predictions (should have higher loss)
-    y_pred = Tensor([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])  # Uniform after softmax
-    y_true = Tensor([0, 1])
-    loss = ce(y_pred, y_true)
-    expected_random = -np.log(1.0/3.0)  # log(1/num_classes) for uniform distribution
-    assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss ~= {expected_random}, got {loss.data}"
-    print("PASS Random predictions test passed")
-    
-    # Test case 3: Binary classification
-    y_pred = Tensor([[2.0, 1.0], [1.0, 2.0]])
-    y_true = Tensor([0, 1])
-    loss = ce(y_pred, y_true)
-    assert 0.0 < loss.data < 2.0, f"Binary classification loss should be reasonable, got {loss.data}"
-    print("PASS Binary classification test passed")
-    
-    # Test case 4: One-hot encoded labels
-    y_pred = Tensor([[2.0, 1.0, 0.0], [0.0, 2.0, 1.0]])
-    y_true = Tensor([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])  # One-hot encoded
-    loss = ce(y_pred, y_true)
-    assert 0.0 < loss.data < 2.0, f"One-hot encoded loss should be reasonable, got {loss.data}"
-    print("PASS One-hot encoded labels test passed")
-    
-    print("CELEBRATE Cross-Entropy loss tests passed! Understanding classification objectives.")
-
-test_unit_crossentropy_loss()
-
-# %% [markdown]
-"""
-# Binary Cross-Entropy Loss - Optimized for Binary Classification
-
-Binary Cross-Entropy Loss is the specialized, efficient version of cross-entropy for binary (two-class) problems. It's more stable and faster than using regular cross-entropy with 2 classes.
-
-## Visual Understanding: Binary Cross-Entropy
-
-```
-Binary Classification Landscape:
-
-Sigmoid Activation:
-    Raw Logit -> Sigmoid -> Probability -> Loss
-    -5.0     -> 0.007   -> 0.007       -> High loss (if true=1)
-     0.0     -> 0.500   -> 0.500       -> Medium loss
-    +5.0     -> 0.993   -> 0.993       -> Low loss (if true=1)
-
-Loss Behavior:
-    BCE = -[y*log(p) + (1-y)*log(1-p)]
-    
-    For y=1 (positive class):
-        p=0.9 -> -log(0.9) = 0.105  (low loss)
-        p=0.1 -> -log(0.1) = 2.303  (high loss)
-    
-    For y=0 (negative class):
-        p=0.1 -> -log(0.9) = 0.105  (low loss)  
-        p=0.9 -> -log(0.1) = 2.303  (high loss)
-```
-
-## Numerical Stability Solution
-
-```
-The Binary Cross-Entropy Stability Problem:
-    
-    BCE = -[y*log(σ(x)) + (1-y)*log(1-σ(x))]
-    
-    Where σ(x) = 1/(1+exp(-x))
-    
-    Problems:
-    - Large positive x: exp(-x) -> 0, then log(1) -> 0 (loss of precision)
-    - Large negative x: σ(x) -> 0, then log(0) -> -inf
-    
-Our Stable Solution:
-    BCE = max(x,0) - x*y + log(1 + exp(-|x|))
-    
-    Why this works:
-    - max(x,0) handles positive values
-    - -x*y is the "cross" term  
-    - log(1+exp(-|x|)) is always stable (exp<=1)
-```
-
-## Mathematical Foundation
-
-For binary predictions and labels:
-```
-BCE = -y * log(σ(x)) - (1-y) * log(1-σ(x))
-
-Stable form: BCE = max(x,0) - x*y + log(1 + exp(-|x|))
-```
-
-## Learning Objectives
-By implementing Binary Cross-Entropy, you'll understand:
-- How binary classification creates simpler optimization landscapes than multi-class problems
-- Why sigmoid activation naturally pairs with binary cross-entropy loss through its gradient structure
-- The critical importance of numerically stable formulations for reliable production training
-- How specialized binary losses achieve better efficiency and stability than general solutions
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "binary-crossentropy-concept", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-THINK **Computational Question: Binary Stability**
-
-Consider the stable BCE formulation:
-
-1. Why does max(x,0) - x*y + log(1+exp(-|x|)) work?
-2. What happens when x=100? (trace through the computation)
-3. What happens when x=-100? (trace through the computation)
-4. How does this prevent both overflow and underflow?
-
-This mathematical insight is crucial for production systems.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "binary-crossentropy-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class BinaryCrossEntropyLoss:
-    """
-    Binary Cross-Entropy Loss for Binary Classification Problems
-    
-    Computes binary cross-entropy between predictions and binary labels
-    with numerically stable sigmoid + BCE implementation.
-    
-    Features:
-    - Numerically stable computation from logits using stable BCE formula
-    - Efficient batch processing with vectorized operations
-    - Automatic sigmoid application through stable formulation
-    - Robust to extreme input values without overflow/underflow
-    
-    Example Usage:
-        bce_loss = BinaryCrossEntropyLoss()
-        loss = bce_loss(logits, binary_labels)  # Returns scalar loss value
-    """
-    
-    def __init__(self):
-        """Initialize Binary CrossEntropy loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute Binary CrossEntropy loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions/logits (Tensor, shape: [batch_size, 1] or [batch_size])
-            y_true: True binary labels (Tensor, shape: [batch_size, 1] or [batch_size])
-            
-        Returns:
-            Tensor with scalar loss value
-            
-        TODO: Implement stable binary cross-entropy using the logits formulation.
-        
-        APPROACH:
-        1. Convert inputs to tensors and flatten for consistent processing
-        2. Use stable BCE formula: max(x,0) - x*y + log(1+exp(-|x|))
-        3. Apply this formula element-wise across the batch
-        4. Return mean loss across all samples
-        
-        EXAMPLE:
-        >>> bce = BinaryCrossEntropyLoss()
-        >>> logits = Tensor([[2.0], [-1.0]])  # Raw outputs
-        >>> labels = Tensor([[1.0], [0.0]])   # Binary targets
-        >>> loss = bce(logits, labels)
-        >>> print(loss.data)
-        0.693  # Stable computation of binary cross-entropy
-        
-        HINTS:
-        - Use np.maximum(logits, 0) for the max(x,0) term
-        - Use np.abs(logits) to ensure exp argument is <= 0
-        - The formula naturally handles both positive and negative logits
-        - Return np.mean() for batch averaging
-        """
-        ### BEGIN SOLUTION
-        # Step 1: Ensure we have tensor inputs for consistent processing
-        if not isinstance(y_pred, Tensor):
-            y_pred = Tensor(y_pred)  # Convert predictions to tensor format
-        if not isinstance(y_true, Tensor):
-            y_true = Tensor(y_true)  # Convert targets to tensor format
-        
-        # Get flat arrays for computation
-        logits = y_pred.data.flatten()
-        labels = y_true.data.flatten()
-        
-        # Step 1: Define numerically stable binary cross-entropy computation
-        def stable_bce_with_logits(logits, labels):
-            """
-            Numerically stable BCE using the logits formulation:
-            BCE(logits, y) = max(logits, 0) - logits * y + log(1 + exp(-|logits|))
-            
-            This formulation prevents:
-            - exp(large_positive_logit) -> overflow
-            - log(very_small_sigmoid) -> -inf
-            
-            Mathematical equivalence:
-            - For positive logits: x - x*y + log(1 + exp(-x))
-            - For negative logits: -x*y + log(1 + exp(x))
-            """
-            # Step 1a: Handle positive logits to prevent exp(large_positive) overflow
-            positive_part = np.maximum(logits, 0)
-            
-            # Step 1b: Subtract logit-label product (the "cross" in cross-entropy)
-            cross_term = logits * labels
-            
-            # Step 1c: Add log(1 + exp(-|logits|)) for numerical stability
-            # Using abs(logits) ensures the exponent is always negative or zero
-            stability_term = np.log(1 + np.exp(-np.abs(logits)))
-            
-            return positive_part - cross_term + stability_term
-        
-        # Step 2: Apply stable BCE computation across the batch
-        individual_losses = stable_bce_with_logits(logits, labels)
-        mean_loss = np.mean(individual_losses)  # Average loss across batch
-        
-        return Tensor(mean_loss)
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-
-# MAGNIFY SYSTEMS INSIGHT: Binary CrossEntropy Efficiency Analysis
-def analyze_binary_crossentropy_efficiency():
-    """Analyze binary cross-entropy computational efficiency."""
-    print("MAGNIFY Binary CrossEntropy Efficiency Analysis")
-    print("=" * 45)
-    
-    try:
-        bce = BinaryCrossEntropyLoss()
-        ce = CrossEntropyLoss()  # For comparison
-        
-        # Compare binary-specific vs general cross-entropy
-        print("\nSPEED Binary vs Multi-Class Efficiency:")
-        
-        # Binary problem solved two ways
-        binary_logits = Tensor([[1.5], [-0.8], [2.1]])
-        binary_labels = Tensor([[1.0], [0.0], [1.0]])
-        
-        # Method 1: Binary CrossEntropy
-        binary_loss = bce(binary_logits, binary_labels)
-        
-        # Method 2: 2-class CrossEntropy (equivalent but less efficient)
-        multiclass_logits = Tensor([[1.5, 0.0], [-0.8, 0.0], [2.1, 0.0]])
-        multiclass_labels = Tensor([0, 1, 0])  # Convert to class indices
-        multiclass_loss = ce(multiclass_logits, multiclass_labels)
-        
-        print(f"   Binary CE Loss:     {binary_loss.data:.6f}")
-        print(f"   2-Class CE Loss:    {multiclass_loss.data:.6f}")
-        print(f"   Difference:         {abs(binary_loss.data - multiclass_loss.data):.8f}")
-        
-        # Memory efficiency comparison
-        print(f"\n💾 Memory Efficiency Comparison:")
-        
-        batch_size = 1000
-        binary_memory = batch_size * 1 * 8  # 1 value per sample, 8 bytes per float64
-        multiclass_memory = batch_size * 2 * 8  # 2 classes, 8 bytes per float64
-        
-        print(f"   Binary approach:    {binary_memory / 1024:.1f} KB")
-        print(f"   Multi-class (2):    {multiclass_memory / 1024:.1f} KB")
-        print(f"   Binary is {multiclass_memory/binary_memory:.1f}* more memory efficient")
-        
-        # Stability test with extreme values
-        print(f"\n🛡️ Extreme Value Stability:")
-        extreme_tests = [
-            ("Large positive", [[100.0]], [[1.0]]),
-            ("Large negative", [[-100.0]], [[0.0]]),
-            ("Mixed extreme", [[100.0], [-100.0]], [[1.0], [0.0]])
-        ]
-        
-        for name, logits, labels in extreme_tests:
-            test_logits = Tensor(logits)
-            test_labels = Tensor(labels)
-            loss = bce(test_logits, test_labels)
-            is_stable = not (np.isnan(loss.data) or np.isinf(loss.data))
-            print(f"   {name:15}: Loss = {loss.data:.6f}, Stable = {is_stable}")
-        
-        # TIP WHY THIS MATTERS: Binary CrossEntropy is 2* more memory efficient
-        # than regular CrossEntropy for binary problems, and provides better
-        # numerical stability through its specialized formulation.
-        
-    except Exception as e:
-        print(f"WARNING️ Analysis error: {e}")
-        print("Ensure BinaryCrossEntropy implementation is complete")
-
-# %% [markdown]
-"""
-### TEST Unit Test: Binary Cross-Entropy Loss
-This test validates `BinaryCrossEntropyLoss.__call__`, ensuring stable binary cross-entropy computation with extreme values.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-binary-crossentropy", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false}
-def test_unit_binary_crossentropy_loss():
-    """Test Binary CrossEntropy loss implementation."""
-    print("TEST Testing Binary Cross-Entropy Loss...")
-    
-    bce = BinaryCrossEntropyLoss()
-    
-    # Test case 1: Perfect predictions
-    y_pred = Tensor([[10.0], [-10.0]])  # Very confident correct predictions
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    assert loss.data < 0.1, f"Perfect predictions should have low loss, got {loss.data}"
-    print("PASS Perfect predictions test passed")
-    
-    # Test case 2: Random predictions (should have higher loss)
-    y_pred = Tensor([[0.0], [0.0]])  # 0.5 probability after sigmoid
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    expected_random = -np.log(0.5)  # log(0.5) for random guessing
-    assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss ~= {expected_random}, got {loss.data}"
-    print("PASS Random predictions test passed")
-    
-    # Test case 3: Batch processing
-    y_pred = Tensor([[1.0], [2.0], [-1.0]])
-    y_true = Tensor([[1.0], [1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    assert 0.0 < loss.data < 2.0, f"Batch processing loss should be reasonable, got {loss.data}"
-    print("PASS Batch processing test passed")
-    
-    # Test case 4: Extreme values (test numerical stability)
-    y_pred = Tensor([[100.0], [-100.0]])  # Extreme logits
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    assert not np.isnan(loss.data) and not np.isinf(loss.data), f"Extreme values should not cause NaN/Inf, got {loss.data}"
-    assert loss.data < 1.0, f"Extreme correct predictions should have low loss, got {loss.data}"
-    print("PASS Extreme values test passed")
-    
-    print("CELEBRATE Binary Cross-Entropy loss tests passed! Understanding binary objectives.")
-
-test_unit_binary_crossentropy_loss()
-
-# %% [markdown]
-"""
-# Custom Loss Functions - Aligning with Business Objectives
-
-Beyond standard loss functions, production ML systems often need custom losses that align with specific business objectives and domain constraints.
-
-## Business-Aligned Loss Design Patterns
-
-### Asymmetric Loss Functions
-When false positives and false negatives have different costs:
-
-```python
-# Medical diagnosis: False negatives (missing disease) cost 10* more
-class AsymmetricBinaryCrossEntropy(BinaryCrossEntropyLoss):
-    def __init__(self, false_negative_weight=10.0):
-        super().__init__()
-        self.fn_weight = false_negative_weight
-
-    def __call__(self, y_pred, y_true):
-        # Standard BCE
-        base_loss = super().__call__(y_pred, y_true)
-
-        # Weight false negatives more heavily
-        # When y_true=1 and y_pred is low, increase penalty
-        sigmoid_pred = 1 / (1 + np.exp(-y_pred.data))
-        fn_penalty = y_true.data * (1 - sigmoid_pred) * self.fn_weight
-
-        weighted_loss = base_loss.data + np.mean(fn_penalty)
-        return Tensor(weighted_loss)
-```
-
-### Focal Loss for Imbalanced Data
-Addresses class imbalance by focusing on hard examples:
-
-```python
-class FocalLoss(CrossEntropyLoss):
-    def __init__(self, alpha=1.0, gamma=2.0):
-        super().__init__()
-        self.alpha = alpha  # Class balance weight
-        self.gamma = gamma  # Focusing parameter
-
-    def __call__(self, y_pred, y_true):
-        # Get standard cross-entropy
-        ce_loss = super().__call__(y_pred, y_true)
-
-        # Calculate softmax probabilities
-        max_logits = np.max(y_pred.data, axis=1, keepdims=True)
-        stable_logits = y_pred.data - max_logits
-        exp_logits = np.exp(stable_logits)
-        softmax_probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
-
-        # Get probability of correct class
-        batch_size = y_true.data.shape[0]
-        correct_probs = softmax_probs[np.arange(batch_size), y_true.data.astype(int)]
-
-        # Apply focal loss formula: -α(1-p)^γ log(p)
-        focal_weight = self.alpha * ((1 - correct_probs) ** self.gamma)
-        focal_loss = focal_weight * ce_loss.data
-
-        return Tensor(np.mean(focal_loss))
-```
-"""
-
-# %% [markdown]
-"""
-### Ranking-Aware Loss
-For problems where order matters (search, recommendations):
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "ranking-loss", "solution": true}
-class RankingAwareLoss:
-    def __init__(self, position_weights=None):
-        # Higher weights for top positions
-        self.position_weights = position_weights or [10.0, 5.0, 2.0, 1.0, 0.5]
-
-    def __call__(self, predictions, targets, positions):
-        """predictions: relevance scores, targets: true relevance, positions: result positions"""
-        # Not using MeanSquaredError() - computing directly
-
-        # Weight errors by position importance
-        weighted_errors = []
-        for pred, target, pos in zip(predictions.data, targets.data, positions.data):
-            pos_weight = self.position_weights[min(int(pos), len(self.position_weights)-1)]
-            error = ((pred - target) ** 2) * pos_weight
-            weighted_errors.append(error)
-
-        return Tensor(np.mean(weighted_errors))
-
-# %% [markdown]
-"""
-## Advanced Custom Loss Patterns
-
-### Multi-Task Learning Loss
-Combining multiple objectives with learned weights:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "multitask-loss", "solution": true}
-class MultiTaskLoss:
-    def __init__(self, num_tasks=3):
-        # Learnable loss weights (log-variance parameterization for stability)
-        self.log_vars = [0.0] * num_tasks
-
-    def __call__(self, predictions_list, targets_list):
-        """predictions_list: [task1_preds, task2_preds, ...]"""
-        total_loss = 0
-
-        for i, (preds, targets) in enumerate(zip(predictions_list, targets_list)):
-            # Choose appropriate loss for each task
-            if i == 0:  # Regression task
-                task_loss = MeanSquaredError()(preds, targets)
-            else:  # Classification tasks
-                task_loss = CrossEntropyLoss()(preds, targets)
-
-            # Uncertainty-weighted combination
-            precision = np.exp(-self.log_vars[i])
-            weighted_loss = precision * task_loss.data + self.log_vars[i]
-            total_loss += weighted_loss
-
-        return Tensor(total_loss)
-
-# %% [markdown]
-"""
-### Contrastive Loss for Similarity Learning
-For learning embeddings and similarity:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "contrastive-loss", "solution": true}
-class ContrastiveLoss:
-    def __init__(self, margin=1.0):
-        self.margin = margin
-
-    def __call__(self, embeddings1, embeddings2, labels):
-        """labels: 1 for similar pairs, 0 for dissimilar"""
-        # Euclidean distance between embeddings
-        distances = np.sqrt(np.sum((embeddings1.data - embeddings2.data) ** 2, axis=1))
-
-        # Contrastive loss formula
-        positive_loss = labels.data * (distances ** 2)
-        negative_loss = (1 - labels.data) * np.maximum(0, self.margin - distances) ** 2
-
-        total_loss = 0.5 * (positive_loss + negative_loss)
-        return Tensor(np.mean(total_loss))
-
-# %% [markdown]
-"""
-## Custom Loss Implementation Guidelines
-
-### Numerical Stability Considerations
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "stable-loss", "solution": true}
-# Always include stability measures in custom losses
-class StableCustomLoss:
-    def __call__(self, predictions, targets):
-        # 1. Input validation
-        if not isinstance(predictions, Tensor):
-            predictions = Tensor(predictions)
-
-        # 2. Handle edge cases
-        # predictions_clipped would be used here for actual computation
-        # predictions_clipped = np.clip(predictions.data, -100, 100)  # Prevent overflow
-
-        # 3. Use numerically stable formulations
-        # Avoid: exp(large_number), log(small_number)
-        # Use: log-sum-exp trick, epsilon clipping
-
-        # 4. Compute loss (example - actual implementation depends on loss type)
-        computed_loss = np.mean((predictions.data - targets.data) ** 2)
-
-        # 5. Return tensor for consistency
-        return Tensor(computed_loss)
-
-# %% [markdown]
-"""
-### Gradient-Friendly Design
-```python
-# Ensure gradients flow properly
-class GradientFriendlyLoss:
-    def __call__(self, predictions, targets):
-        # Avoid operations that create zero gradients:
-        # - Hard thresholding: use soft approximations
-        # - Discrete operations: use continuous relaxations
-        # - Large plateaus: ensure non-zero gradients everywhere
-
-        # Good: Smooth, differentiable operations
-        smooth_loss = self.smooth_l1_loss(predictions, targets)
-        return smooth_loss
-
-    def smooth_l1_loss(self, pred, target, beta=1.0):
-        \"\"\"Smooth L1 loss - less sensitive to outliers than MSE\"\"\"
-        diff = np.abs(pred.data - target.data)
-        loss = np.where(diff < beta,
-                       0.5 * diff * diff / beta,
-                       diff - 0.5 * beta)
-        return Tensor(np.mean(loss))
-```
-"""
-
-# %% [markdown]
-"""
-# Loss Function Application Guide and Comparison
-
-## When to Use Each Loss Function
-
-Understanding which loss function to use is critical for successful ML projects:
-
-### Mean Squared Error (MSE) - Regression Problems
-```
-Use when: Predicting continuous values
-Examples: House prices, temperature, stock values, ages
-Output: Any real number
-Activation: Usually none (linear output)
-Penalty: Quadratic (large errors >> small errors)
-
-Model Architecture:
-Input -> Hidden Layers -> Linear Output -> MSE Loss
-```
-
-### Cross-Entropy Loss - Multi-Class Classification  
-```
-Use when: Choosing one class from 3+ options
-Examples: Image classification, text categorization, medical diagnosis
-Output: Probability distribution (sums to 1)
-Activation: Softmax
-Penalty: Logarithmic (encouraging confident correct predictions)
-
-Model Architecture:
-Input -> Hidden Layers -> Softmax -> CrossEntropy Loss
-```
-
-### Binary Cross-Entropy Loss - Binary Classification
-```
-Use when: Binary decisions (yes/no, positive/negative)
-Examples: Spam detection, fraud detection, medical screening
-Output: Single probability (0 to 1)
-Activation: Sigmoid
-Penalty: Asymmetric (confident wrong predictions heavily penalized)
-
-Model Architecture:
-Input -> Hidden Layers -> Sigmoid -> Binary CrossEntropy Loss
-```
-
-## Performance and Stability Comparison
-
-```
-Computational Characteristics:
-                      MSE    CrossEntropy    Binary CE
-Time Complexity:     O(n)      O(n*c)        O(n)
-Memory Complexity:   O(1)      O(n*c)        O(n)
-Numerical Stability: High      Medium        High
-Convergence Speed:   Fast      Medium        Fast
-
-Where: n = batch size, c = number of classes
-```
-
-## Integration with Neural Networks
-
-```python
-# Example training setup for different problem types:
-
-# Regression Problem (House Price Prediction)
-regression_model = Sequential([
-    Linear(10, 64),   # Input features -> Hidden
-    ReLU(),
-    Linear(64, 1),    # Hidden -> Single output
-    # No activation - linear output for regression
-])
-loss_fn = MeanSquaredError()
-
-# Multi-Class Classification (Image Recognition)
-classification_model = Sequential([
-    Linear(784, 128), # Flattened image -> Hidden
-    ReLU(),
-    Linear(128, 10),  # Hidden -> 10 classes
-    Softmax()         # Convert to probabilities
-])
-loss_fn = CrossEntropyLoss()
-
-# Binary Classification (Spam Detection)
-binary_model = Sequential([
-    Linear(100, 64),  # Text features -> Hidden
-    ReLU(),
-    Linear(64, 1),    # Hidden -> Single output
-    Sigmoid()         # Convert to probability
-])
-loss_fn = BinaryCrossEntropyLoss()
-
-# Training loop pattern (same for all):
-for batch in dataloader:
-    predictions = model(batch.inputs)
-    loss = loss_fn(predictions, batch.targets)
-    # loss.backward()  # Compute gradients (when autograd is available)
-    # optimizer.step() # Update parameters
-```
-"""
-
-# %% [markdown]
-"""
-### TEST Comprehensive Integration Test
-This test validates all loss functions work together correctly and can be used interchangeably in production systems.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "comprehensive-loss-tests", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_unit_comprehensive_loss_integration():
-    """Test all loss functions work correctly together."""
-    print("🔬 Comprehensive Loss Function Integration Testing")
-    print("=" * 55)
-    
-    # Test 1: All losses can be instantiated
-    print("\n1. Loss Function Instantiation:")
-    mse = MeanSquaredError()
-    ce = CrossEntropyLoss()
-    bce = BinaryCrossEntropyLoss()
-    print("   PASS All loss functions created successfully")
-    
-    # Test 2: Loss functions return appropriate types
-    print("\n2. Return Type Verification:")
-    
-    # MSE test
-    pred = Tensor([[1.0, 2.0]])
-    target = Tensor([[1.0, 2.0]])
-    loss = mse(pred, target)
-    assert isinstance(loss, Tensor), "MSE should return Tensor"
-    assert loss.data.shape == (), "MSE should return scalar"
-    
-    # Cross-entropy test
-    pred = Tensor([[1.0, 2.0], [2.0, 1.0]])
-    target = Tensor([1, 0])
-    loss = ce(pred, target)
-    assert isinstance(loss, Tensor), "CrossEntropy should return Tensor"
-    assert loss.data.shape == (), "CrossEntropy should return scalar"
-    
-    # Binary cross-entropy test
-    pred = Tensor([[1.0], [-1.0]])
-    target = Tensor([[1.0], [0.0]])
-    loss = bce(pred, target)
-    assert isinstance(loss, Tensor), "Binary CrossEntropy should return Tensor"
-    assert loss.data.shape == (), "Binary CrossEntropy should return scalar"
-    
-    print("   PASS All loss functions return correct types")
-    
-    # Test 3: Loss values are reasonable
-    print("\n3. Loss Value Sanity Checks:")
-    
-    # All losses should be non-negative
-    assert mse.forward(Tensor([1.0]), Tensor([2.0])).data >= 0, "MSE should be non-negative"
-    assert ce.forward(Tensor([[1.0, 0.0]]), Tensor([0])).data >= 0, "CrossEntropy should be non-negative"
-    assert bce.forward(Tensor([1.0]), Tensor([1.0])).data >= 0, "Binary CrossEntropy should be non-negative"
-    
-    print("   PASS All loss functions produce reasonable values")
-    
-    # Test 4: Perfect predictions give low loss
-    print("\n4. Perfect Prediction Tests:")
-    
-    perfect_mse = mse(Tensor([5.0]), Tensor([5.0]))
-    perfect_ce = ce(Tensor([[10.0, 0.0]]), Tensor([0]))
-    perfect_bce = bce(Tensor([10.0]), Tensor([1.0]))
-    
-    assert perfect_mse.data < 1e-10, f"Perfect MSE should be ~0, got {perfect_mse.data}"
-    assert perfect_ce.data < 0.1, f"Perfect CE should be low, got {perfect_ce.data}"
-    assert perfect_bce.data < 0.1, f"Perfect BCE should be low, got {perfect_bce.data}"
-    
-    print("   PASS Perfect predictions produce low loss")
-    
-    print("\nCELEBRATE All comprehensive integration tests passed!")
-    print("   • Loss functions instantiate correctly")
-    print("   • Return types are consistent (Tensor scalars)")
-    print("   • Loss values are mathematically sound")
-    print("   • Perfect predictions are handled correctly")
-    print("   • Ready for integration with neural network training!")
-
-test_unit_comprehensive_loss_integration()
-
-# %% [markdown]
-"""
-# Systems Analysis: Loss Function Performance and Engineering
-
-Let's analyze loss functions from an ML systems engineering perspective, focusing on performance, memory usage, and production implications.
-
-## Computational Complexity Deep Dive
-
-```
-Algorithmic Analysis by Loss Type:
-
-MSE (Mean Squared Error):
-    Time: O(n) - linear in number of predictions
-    Space: O(1) - constant additional memory
-    Operations: n subtractions + n multiplications + 1 mean
-    Bottleneck: Memory bandwidth (simple arithmetic operations)
-    
-CrossEntropy (Multi-Class):
-    Time: O(n*c) - linear in samples * classes  
-    Space: O(n*c) - store full probability distributions
-    Operations: n*c exp + n*c divisions + n*c logs + reductions
-    Bottleneck: Exponential computations and memory bandwidth
-    
-Binary CrossEntropy:
-    Time: O(n) - linear in number of samples
-    Space: O(n) - store one probability per sample
-    Operations: n max + n multiplications + n exp + n logs
-    Bottleneck: Transcendental functions (exp, log)
-```
-
-## Memory Scaling Analysis
-
-Understanding memory requirements is crucial for large-scale training:
-
-```
-Memory Requirements by Problem Scale:
-
-Small Problem (1K samples, 100 classes):
-    MSE:         8 KB (1K samples * 8 bytes)
-    CrossEntropy: 800 KB (1K * 100 * 8 bytes)
-    Binary CE:   16 KB (1K * 2 * 8 bytes)
-
-Large Problem (100K samples, 10K classes):
-    MSE:         800 KB (independent of classes!)
-    CrossEntropy: 8 GB (memory bottleneck)
-    Binary CE:   1.6 MB (scales with samples only)
-
-Production Scale (1M samples, 50K vocab):
-    MSE:         8 MB
-    CrossEntropy: 400 GB (requires distributed memory)
-    Binary CE:   16 MB
-```
-
-## Numerical Stability Engineering Analysis
-
-Production systems must handle edge cases robustly:
-
-```
-Stability Challenges and Solutions:
-
-CrossEntropy Stability Issues:
-    Problem: exp(large_logit) -> overflow -> NaN gradients
-    Solution: log-sum-exp trick with max subtraction
-    
-    Problem: log(very_small_prob) -> -inf -> training collapse
-    Solution: epsilon clipping (1e-15 to 1-1e-15)
-    
-Binary CrossEntropy Stability Issues:
-    Problem: sigmoid(large_positive) -> 1.0 -> log(0) issues
-    Solution: stable logits formulation bypasses sigmoid
-    
-    Problem: exp(large_negative) in naive implementation
-    Solution: max(x,0) - x*y + log(1+exp(-|x|)) formulation
-```
-"""
-
-# %% [markdown]
-"""
-## Production Performance Benchmarks
-
-Real-world performance characteristics matter for deployment:
-
-```
-Inference Throughput (measured on modern hardware):
-    MSE:              ~100M predictions/second
-    CrossEntropy:     ~10M predictions/second  
-    Binary CrossEntropy: ~80M predictions/second
-
-Training Memory Bandwidth Requirements:
-    MSE:         ~800 MB/s (lightweight computation)
-    CrossEntropy: ~80 GB/s (10* higher due to softmax!)
-    Binary CE:   ~1.6 GB/s (moderate requirements)
-
-Gradient Computation Overhead:
-    MSE:         1.1* forward pass time (simple derivatives)
-    CrossEntropy: 1.5* forward pass time (softmax gradients)
-    Binary CE:   1.2* forward pass time (sigmoid gradients)
-```
-
-## Framework Integration and Production Patterns
-
-Understanding how production systems implement these concepts:
-
-```
-PyTorch Implementation Patterns:
-    torch.nn.MSELoss() - Direct implementation, minimal overhead
-    torch.nn.CrossEntropyLoss() - Fused softmax+CE for efficiency
-    torch.nn.BCEWithLogitsLoss() - Stable logits formulation
-    
-TensorFlow Implementation Patterns:
-    tf.keras.losses.MeanSquaredError() - Vectorized operations
-    tf.keras.losses.SparseCategoricalCrossentropy() - Memory efficient
-    tf.keras.losses.BinaryCrossentropy() - From logits option
-    
-Production Optimizations:
-    - Mixed precision (FP16) for memory efficiency
-    - Gradient accumulation for large batch simulation
-    - Loss scaling to prevent underflow in mixed precision
-    - Checkpointing to trade memory for computation
-```
-
-## Edge Device and Deployment Considerations
-
-Loss function choice affects deployment feasibility:
-
-```
-Edge Device Constraints:
-    Memory-limited (phones, IoT): Prefer Binary CE > MSE > CrossEntropy
-    CPU-only inference: MSE has best compute efficiency
-    Real-time requirements: Binary classification most predictable
-    
-Distributed Training Challenges:
-    CrossEntropy: Requires all-reduce across all classes (expensive!)
-    Gradient accumulation: MSE linear, CrossEntropy non-linear dependencies
-    Mixed precision: Different overflow handling per loss type
-    
-Monitoring and Debugging:
-    MSE divergence: Explodes quadratically (easy to detect)
-    CrossEntropy divergence: More gradual degradation  
-    BCE monitoring: Natural bounded behavior aids debugging
-```
-"""
-
-# MAGNIFY SYSTEMS INSIGHT: Performance Profiling Analysis
-def analyze_loss_performance_characteristics():
-    """Comprehensive performance analysis of all loss functions."""
-    print("MAGNIFY Loss Function Performance Analysis")
-    print("=" * 45)
-    
-    try:
-        import time
-        
-        # Initialize loss functions
-        mse = MeanSquaredError()
-        ce = CrossEntropyLoss()
-        bce = BinaryCrossEntropyLoss()
-        
-        print("\nSPEED Computational Complexity Measurement:")
-        
-        # Test different batch sizes to see scaling behavior
-        batch_sizes = [100, 1000, 10000]
-        
-        for batch_size in batch_sizes:
-            print(f"\n   Batch size: {batch_size:,}")
-            
-            # MSE timing
-            mse_pred = Tensor(np.random.randn(batch_size, 10))
-            mse_true = Tensor(np.random.randn(batch_size, 10))
-            
-            start = time.perf_counter()
-            for _ in range(100):  # Average over multiple runs
-                _ = mse(mse_pred, mse_true)
-            mse_time = (time.perf_counter() - start) / 100
-            
-            # CrossEntropy timing
-            ce_pred = Tensor(np.random.randn(batch_size, 100))  # 100 classes
-            ce_true = Tensor(np.random.randint(0, 100, batch_size))
-            
-            start = time.perf_counter()
-            for _ in range(100):
-                _ = ce(ce_pred, ce_true)
-            ce_time = (time.perf_counter() - start) / 100
-            
-            # Binary CrossEntropy timing
-            bce_pred = Tensor(np.random.randn(batch_size, 1))
-            bce_true = Tensor(np.random.randint(0, 2, (batch_size, 1)).astype(float))
-            
-            start = time.perf_counter()
-            for _ in range(100):
-                _ = bce(bce_pred, bce_true)
-            bce_time = (time.perf_counter() - start) / 100
-            
-            print(f"      MSE:         {mse_time*1000:8.3f} ms")
-            print(f"      CrossEntropy: {ce_time*1000:8.3f} ms")
-            print(f"      Binary CE:    {bce_time*1000:8.3f} ms")
-            print(f"      CE/MSE ratio: {ce_time/mse_time:8.1f}x")
-        
-        print("\n💾 Memory Efficiency Analysis:")
-        
-        # Compare memory usage for different problem sizes
-        problem_configs = [
-            ("Small (1K samples, 10 classes)", 1000, 10),
-            ("Medium (10K samples, 100 classes)", 10000, 100),
-            ("Large (100K samples, 1K classes)", 100000, 1000)
-        ]
-        
-        for name, samples, classes in problem_configs:
-            print(f"\n   {name}:")
-            
-            # Memory calculations (bytes)
-            mse_memory = samples * 8  # One value per sample
-            ce_memory = samples * classes * 8  # Full probability distribution
-            bce_memory = samples * 8  # One probability per sample
-            
-            print(f"      MSE memory:    {mse_memory / 1024 / 1024:8.1f} MB")
-            print(f"      CE memory:     {ce_memory / 1024 / 1024:8.1f} MB") 
-            print(f"      BCE memory:    {bce_memory / 1024 / 1024:8.1f} MB")
-            print(f"      CE overhead:   {ce_memory/mse_memory:8.1f}x")
-        
-        # TIP WHY THIS MATTERS: These performance characteristics determine
-        # which loss functions are feasible for different deployment scenarios.
-        # CrossEntropy's O(n*c) memory scaling makes it prohibitive for 
-        # large vocabularies without specialized techniques.
-        
-    except Exception as e:
-        print(f"WARNING️ Performance analysis error: {e}")
-        print("Performance analysis requires complete implementations")
-
-# MAGNIFY SYSTEMS INSIGHT: Numerical Stability Deep Analysis
-def analyze_numerical_stability_edge_cases():
-    """Deep analysis of numerical stability across all loss functions."""
-    print("MAGNIFY Numerical Stability Edge Case Analysis")
-    print("=" * 50)
-    
-    try:
-        mse = MeanSquaredError()
-        ce = CrossEntropyLoss()
-        bce = BinaryCrossEntropyLoss()
-        
-        print("\n🛡️ Extreme Value Stability Testing:")
-        
-        # Test extreme values that could cause numerical issues
-        extreme_tests = [
-            ("Huge positive", 1e10),
-            ("Huge negative", -1e10),
-            ("Tiny positive", 1e-10),
-            ("NaN input", float('nan')),
-            ("Infinity", float('inf')),
-            ("Negative infinity", float('-inf'))
-        ]
-        
-        for name, value in extreme_tests:
-            print(f"\n   Testing {name} ({value}):")
-            
-            # MSE stability
-            try:
-                mse_loss = mse(Tensor([value]), Tensor([0.0]))
-                mse_stable = not (np.isnan(mse_loss.data) or np.isinf(mse_loss.data))
-                print(f"      MSE stable:    {mse_stable} (loss: {mse_loss.data:.3e})")
-            except:
-                print(f"      MSE stable:    False (exception)")
-            
-            # CrossEntropy stability  
-            try:
-                ce_loss = ce(Tensor([[value, 0.0, 0.0]]), Tensor([0]))
-                ce_stable = not (np.isnan(ce_loss.data) or np.isinf(ce_loss.data))
-                print(f"      CE stable:     {ce_stable} (loss: {ce_loss.data:.3e})")
-            except:
-                print(f"      CE stable:     False (exception)")
-            
-            # Binary CrossEntropy stability
-            try:
-                bce_loss = bce(Tensor([value]), Tensor([1.0]))
-                bce_stable = not (np.isnan(bce_loss.data) or np.isinf(bce_loss.data))
-                print(f"      BCE stable:    {bce_stable} (loss: {bce_loss.data:.3e})")
-            except:
-                print(f"      BCE stable:    False (exception)")
-        
-        print("\n🔬 Gradient Behavior Analysis:")
-        
-        # Analyze gradient magnitudes for different prediction qualities
-        confidence_levels = [
-            ("Very wrong", [[-5.0, 5.0, 0.0]], [0]),  # Predict class 1, actual class 0
-            ("Slightly wrong", [[-0.5, 0.5, 0.0]], [0]),
-            ("Uncertain", [[0.0, 0.0, 0.0]], [0]), 
-            ("Slightly right", [[0.5, -0.5, 0.0]], [0]),
-            ("Very right", [[5.0, -5.0, 0.0]], [0])
-        ]
-        
-        print("      Prediction Quality -> CrossEntropy Loss:")
-        for name, logits, labels in confidence_levels:
-            loss = ce(Tensor(logits), Tensor(labels))
-            print(f"      {name:15}: {loss.data:8.4f}")
-        
-        # TIP WHY THIS MATTERS: Understanding how loss functions behave
-        # at extremes helps debug training failures and choose appropriate
-        # loss scaling and clipping strategies for production systems.
-        
-    except Exception as e:
-        print(f"WARNING️ Stability analysis error: {e}")
-        print("Stability analysis requires complete implementations")
-
-# MAGNIFY SYSTEMS INSIGHT: Mixed Precision Training Analysis
-def analyze_mixed_precision_considerations():
-    """Analyze loss function behavior with FP16 mixed precision training."""
-    print("MAGNIFY Mixed Precision Training Analysis")
-    print("=" * 40)
-
-    try:
-        print("\nSPEED FP16 Numerical Range Analysis:")
-        print("   FP16 range: ~±65,504 (much smaller than FP32's ~±3.4*10³⁸)")
-
-        # Simulate FP16 range limitations
-        fp16_max = 65504.0
-        fp16_min_normal = 2**-14  # Smallest normal FP16 number ~= 6.1*10⁻⁵
-
-        print(f"   FP16 maximum: ±{fp16_max:,.0f}")
-        print(f"   FP16 min normal: {fp16_min_normal:.2e}")
-        print(f"   Risk: Gradients/losses exceeding range -> infinity/NaN")
-
-        mse = MeanSquaredError()
-        # ce = CrossEntropyLoss()  # Not used in this test
-        # bce = BinaryCrossEntropyLoss()  # Not used in this test
-
-        print(f"\nTARGET Loss Function Mixed Precision Compatibility:")
-
-        # Test cases that might overflow in FP16
-        test_cases = [
-            ("Small values", 1.0, 1.1),
-            ("Medium values", 100.0, 110.0),
-            ("Large values", 1000.0, 1100.0),
-            ("FP16 edge", 200.0, 250.0)  # Could cause issues when squared
-        ]
-
-        print(f"\n   {'Test Case':>15} {'MSE Loss':>12} {'FP16 Safe?':>12}")
-        print(f"   {'-'*15} {'-'*12} {'-'*12}")
-
-        for name, pred, true in test_cases:
-            mse_loss = mse(Tensor([pred]), Tensor([true]))
-            squared_error = (pred - true) ** 2
-            fp16_safe = squared_error < fp16_max
-
-            print(f"   {name:>15} {mse_loss.data:>12.1f} {'PASS' if fp16_safe else 'FAIL':>12}")
-
-        print(f"\n🛡️ Mixed Precision Loss Scaling Strategy:")
-
-        # Demonstrate loss scaling concept
-        loss_scales = [1.0, 128.0, 1024.0, 8192.0]
-        base_loss = 0.01  # Small loss that might underflow
-
-        print(f"   {'Scale Factor':>12} {'Scaled Loss':>12} {'FP16 Precision':>15}")
-        print(f"   {'-'*12} {'-'*12} {'-'*15}")
-
-        for scale in loss_scales:
-            scaled_loss = base_loss * scale
-
-            # Check if loss is representable in FP16
-            if scaled_loss > fp16_min_normal and scaled_loss < fp16_max:
-                precision = "Good"
-            elif scaled_loss <= fp16_min_normal:
-                precision = "Underflow risk"
-            else:
-                precision = "Overflow risk"
-
-            print(f"   {scale:>12.0f} {scaled_loss:>12.3f} {precision:>15}")
-
-        print(f"\n⚖️ Loss Function Mixed Precision Recommendations:")
-
-        recommendations = [
-            ("MSE", "Monitor for gradient explosion in high-dynamic-range problems", "Medium risk"),
-            ("CrossEntropy", "Use FP32 for softmax computation, FP16 for storage", "High risk"),
-            ("Binary CE", "Stable formulation handles FP16 well with proper scaling", "Low risk")
-        ]
-
-        for loss_type, recommendation, risk in recommendations:
-            print(f"   {loss_type:>12}: {recommendation} ({risk})")
-
-        print(f"\n🔧 Implementation Best Practices for Mixed Precision:")
-
-        best_practices = [
-            "1. Use automatic mixed precision (AMP) libraries that handle scaling",
-            "2. Keep loss computation in FP32, only cast inputs to FP16",
-            "3. Monitor for overflow/underflow during training",
-            "4. Use gradient clipping to prevent extreme gradients",
-            "5. Scale losses up during forward pass, scale gradients down during backward"
-        ]
-
-        for practice in best_practices:
-            print(f"      {practice}")
-
-        # Example mixed precision training pattern
-        print(f"\n💻 Mixed Precision Training Pattern:")
-        print(f"   ```python")
-        print(f"   # Forward pass in FP16")
-        print(f"   with autocast():")
-        print(f"       predictions = model(inputs.half())  # FP16 inputs")
-        print(f"       loss = loss_fn(predictions, targets)  # Loss computed in FP32")
-        print(f"   ")
-        print(f"   # Scale loss to prevent underflow")
-        print(f"   scaled_loss = loss * scale_factor")
-        print(f"   scaled_loss.backward()")
-        print(f"   ")
-        print(f"   # Unscale gradients before optimizer step")
-        print(f"   scaler.step(optimizer)  # Automatically unscales gradients")
-        print(f"   ```")
-
-        # TIP WHY THIS MATTERS: Mixed precision training can provide 1.5-2* speedup
-        # and 50% memory reduction, but loss functions must be carefully implemented
-        # to handle the reduced numerical precision without losing training stability.
-
-    except Exception as e:
-        print(f"WARNING️ Mixed precision analysis error: {e}")
-        print("Mixed precision analysis requires complete loss implementations")
-
-# MAGNIFY SYSTEMS INSIGHT: Production Deployment Analysis
-def analyze_production_deployment_patterns():
-    """Analyze how loss functions affect production ML system design."""
-    print("MAGNIFY Production Deployment Pattern Analysis")
-    print("=" * 50)
-    
-    try:
-        print("\nROCKET Deployment Scenario Analysis:")
-        
-        # Different deployment scenarios with constraints
-        scenarios = [
-            {
-                "name": "Mobile App (Spam Detection)",
-                "constraints": "Memory < 50MB, Latency < 100ms",
-                "problem": "Binary classification",
-                "recommendation": "Binary CrossEntropy",
-                "reasoning": "Minimal memory, fast inference, stable numerics"
-            },
-            {
-                "name": "Cloud API (Image Classification)", 
-                "constraints": "Throughput > 1000 QPS, Cost optimization",
-                "problem": "1000-class classification",
-                "recommendation": "CrossEntropy with mixed precision",
-                "reasoning": "Can handle memory cost, needs throughput"
-            },
-            {
-                "name": "Edge IoT (Temperature Prediction)",
-                "constraints": "Memory < 1MB, Power < 1W",
-                "problem": "Regression",
-                "recommendation": "MSE with quantization",
-                "reasoning": "Minimal compute, no transcendental functions"
-            },
-            {
-                "name": "Large Language Model Training",
-                "constraints": "50K vocabulary, Multi-GPU",
-                "problem": "Next token prediction",
-                "recommendation": "Hierarchical Softmax or Sampling",
-                "reasoning": "Standard CrossEntropy too memory intensive"
-            }
-        ]
-        
-        for scenario in scenarios:
-            print(f"\n   📱 {scenario['name']}:")
-            print(f"      Constraints:     {scenario['constraints']}")
-            print(f"      Problem Type:    {scenario['problem']}")
-            print(f"      Best Loss:       {scenario['recommendation']}")
-            print(f"      Why:             {scenario['reasoning']}")
-        
-        print("\n⚖️ Production Trade-off Analysis:")
-        
-        trade_offs = [
-            ("Memory Efficiency", "MSE > Binary CE >> CrossEntropy"),
-            ("Computational Speed", "MSE > Binary CE > CrossEntropy"),
-            ("Numerical Stability", "MSE ~= Binary CE > CrossEntropy"), 
-            ("Implementation Complexity", "MSE > CrossEntropy > Binary CE"),
-            ("Gradient Quality", "CrossEntropy > Binary CE > MSE"),
-            ("Debug-ability", "MSE > Binary CE > CrossEntropy")
-        ]
-        
-        for criterion, ranking in trade_offs:
-            print(f"      {criterion:20}: {ranking}")
-        
-        print("\n🔧 Framework Integration Patterns:")
-        
-        frameworks = [
-            ("PyTorch", "nn.MSELoss(), nn.CrossEntropyLoss(), nn.BCEWithLogitsLoss()"),
-            ("TensorFlow", "keras.losses.MSE, SparseCategoricalCrossentropy, BinaryCrossentropy"),
-            ("JAX", "optax.l2_loss, optax.softmax_cross_entropy, optax.sigmoid_binary_cross_entropy"),
-            ("Production", "Custom implementations with monitoring and fallbacks")
-        ]
-        
-        for framework, losses in frameworks:
-            print(f"      {framework:12}: {losses}")
-        
-        # TIP WHY THIS MATTERS: Loss function choice affects every aspect
-        # of ML system design - from memory requirements to latency to
-        # debugging complexity. Understanding these trade-offs enables
-        # informed architectural decisions for production systems.
-        
-    except Exception as e:
-        print(f"WARNING️ Deployment analysis error: {e}")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've implemented all core loss functions and analyzed their systems characteristics, let's explore their implications for real ML systems:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "question-1-loss-selection", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-THINK **Question 1: Loss Function Selection for Production Systems**
-
-You're building a production recommendation system that predicts user ratings (1-5 stars) for movies.
-
-Your team proposes three approaches:
-A) Regression approach: Use MSE loss with continuous outputs (1.0-5.0)
-B) Classification approach: Use CrossEntropy loss with 5 distinct classes  
-C) Ordinal approach: Use a custom loss that penalizes being off by multiple stars more heavily
-
-Analyze each approach considering your implementations:
-
-**Technical Analysis:**
-- How does the memory scaling of CrossEntropy (O(batch_size * num_classes)) affect this 5-class problem?
-- What are the computational complexity differences between MSE's O(n) and CrossEntropy's O(n*c) for c=5?
-- How do the gradient behaviors differ? (MSE's quadratic vs CrossEntropy's logarithmic penalties)
-
-**Systems Implications:**
-- Which approach would be most memory efficient for large batch training?
-- How does numerical stability differ when handling edge cases (ratings at boundaries)?
-- Which approach would have the most predictable inference latency?
-
-**Business Alignment:**
-- How well does each loss function's penalty structure match the business objective?
-- What happens with fractional ratings like 3.7? How would each approach handle this?
-- Which approach would be easiest to monitor and debug in production?
-
-Recommend an approach with justification based on your implementation experience.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "question-2-numerical-stability", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-THINK **Question 2: Debugging Numerical Stability in Production**
-
-Your cross-entropy loss function works perfectly in development, but in production you start seeing NaN losses that crash training after several hours.
-
-**Root Cause Analysis:**
-Based on your implementation of the log-sum-exp trick and epsilon clipping:
-1. What specific numerical computations in cross-entropy can produce NaN values?
-2. Walk through how your `max_logits = np.max(prediction_logits, axis=1, keepdims=True)` prevents overflow
-3. Explain why `np.clip(softmax_pred, epsilon, 1.0 - epsilon)` prevents underflow
-4. What would happen if you removed epsilon clipping? Trace through the computation.
-
-**Production Debugging:**
-Given millions of training examples, how would you:
-1. Identify which specific inputs trigger the numerical instability?
-2. Modify your CrossEntropy implementation to add monitoring without affecting performance?
-3. Design fallback behavior when numerical issues are detected?
-4. Validate that your fixes don't change the mathematical behavior for normal inputs?
-
-**Comparison Analysis:**
-- How does your stable Binary CrossEntropy formulation `max(x,0) - x*y + log(1 + exp(-|x|))` prevent similar issues?
-- Why is MSE generally more numerically stable than CrossEntropy?
-- How would you modify loss functions for mixed precision (FP16) training where numerical ranges are more limited?
-
-Research how PyTorch and TensorFlow handle these same challenges in their loss implementations.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "question-3-custom-loss-design", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-THINK **Question 3: Implementing and Optimizing Custom Loss Functions**
-
-You've seen examples of custom loss functions for business objectives. Now analyze implementation and optimization challenges:
-
-**Scenario Analysis:**
-Choose one custom loss from the examples (Asymmetric BCE, Focal Loss, Ranking-Aware, Multi-Task, or Contrastive) and analyze:
-
-**Implementation Deep Dive:**
-1. Trace through the numerical computation step-by-step for your chosen custom loss
-2. Identify potential numerical stability issues compared to standard loss functions
-3. How does the computational complexity compare to MSE/CrossEntropy/Binary CE?
-4. What additional memory overhead does the custom formulation introduce?
-
-**Gradient Flow Analysis:**
-5. How do the custom weighting schemes affect gradient magnitudes during backpropagation?
-6. What happens to gradient flow when the custom weights become extreme (very large or very small)?
-7. How would you detect and handle gradient explosion or vanishing in your custom loss?
-8. Design gradient clipping strategies specific to your chosen custom loss function
-
-**Production Integration Challenges:**
-9. How would you implement your custom loss to work with mixed precision training (FP16)?
-10. What logging and monitoring would you add to track custom loss behavior in production?
-11. How would you A/B test a custom loss against standard losses without affecting user experience?
-12. Design a rollback strategy if the custom loss causes training instability
-
-**Performance Optimization:**
-13. Identify computational bottlenecks in your chosen custom loss implementation
-14. How could you vectorize operations to improve batch processing efficiency?
-15. What caching strategies could reduce redundant computations?
-16. How would you benchmark training speed impact compared to standard losses?
-
-**Business Validation Framework:**
-17. Design metrics to validate that your custom loss actually improves business objectives
-18. How would you separate loss function improvements from other training improvements?
-19. What offline evaluation would you perform before deploying the custom loss?
-20. How would you monitor for unexpected business metric changes after deployment?
-
-Implement one optimization for your chosen custom loss and explain how it addresses a specific production challenge.
-"""
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Loss Functions - Learning Objectives Made Mathematical
-
-Congratulations! You've successfully implemented the complete foundation for neural network training objectives:
-
-### What You've Accomplished
-PASS **Complete Loss Function Library**: MSE for regression, CrossEntropy for multi-class classification, and Binary CrossEntropy for binary classification with production-grade numerical stability
-PASS **Systems Engineering Understanding**: Deep comprehension of computational complexity, memory scaling, and numerical stability requirements for reliable ML systems
-PASS **Mathematical Implementation Mastery**: Built loss functions from mathematical foundations through stable computational formulations to working code
-PASS **Production Readiness Knowledge**: Understanding of how loss function choice affects training speed, memory usage, and deployment feasibility
-PASS **Framework Integration Insight**: Clear connection between your implementations and how PyTorch/TensorFlow solve the same problems
-
-### Key Learning Outcomes
-- **Loss Function Theory**: How mathematical loss functions translate business objectives into optimization targets that neural networks can learn from
-- **Numerical Stability Engineering**: Critical importance of stable implementations that prevent catastrophic training failures in production systems
-- **Systems Performance Analysis**: Understanding of computational complexity, memory scaling, and performance trade-offs that affect production deployment
-- **Production ML Patterns**: Knowledge of how loss function choice affects system architecture, monitoring requirements, and debugging complexity
-
-### Mathematical Foundations Mastered  
-- **MSE computation**: `(1/n) * Sum(y_pred - y_true)²` with smooth quadratic gradients for regression optimization
-- **CrossEntropy with stable softmax**: Log-sum-exp trick and epsilon clipping for numerically robust classification
-- **Binary CrossEntropy stability**: `max(x,0) - x*y + log(1 + exp(-|x|))` formulation preventing overflow/underflow issues
-- **Gradient behavior understanding**: How different loss functions create different optimization landscapes and learning dynamics
-
-### Professional Skills Developed
-- **Production-quality implementation**: Robust numerical stability measures that prevent training failures with real-world data
-- **Performance optimization**: Understanding of computational and memory complexity that affects scalability and deployment
-- **Systems debugging**: Knowledge of how to identify and fix numerical stability issues in production ML systems
-- **Framework integration**: Clear understanding of how your implementations connect to professional ML development workflows
-
-### Ready for Advanced Applications
-Your loss function implementations now enable:
-- **Complete training loops** that optimize neural networks on real datasets with proper convergence monitoring
-- **Custom loss functions** that align with specific business objectives and domain requirements
-- **Production deployment** with confidence in numerical stability and performance characteristics
-- **Advanced optimization** techniques that build on solid loss function foundations
-
-### Connection to Real ML Systems
-Your implementations mirror the essential patterns used in:
-- **PyTorch's loss functions**: Same mathematical formulations with identical numerical stability measures
-- **TensorFlow's losses**: Equivalent computational patterns and production-grade error handling
-- **Production ML pipelines**: The exact loss functions that power real ML systems at companies like Google, Meta, and OpenAI
-- **Research frameworks**: Foundation for experimenting with novel loss functions and training objectives
-
-### Next Steps
-With solid loss function implementations, you're ready to:
-1. **Export your module**: `tito module complete 04_losses`
-2. **Validate integration**: `tito test --module losses`
-3. **Explore autograd integration**: See how loss functions connect with automatic differentiation
-4. **Ready for Module 06**: Build automatic gradient computation that makes loss-based learning possible!
-
-**Your achievement**: You've built the mathematical foundation that transforms predictions into learning signals - the critical bridge between model outputs and optimization objectives that makes neural network training possible!
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "final-demo", "locked": false, "schema_version": 3, "solution": false, "task": false}
-if __name__ == "__main__":
-    print("FIRE TinyTorch Loss Functions Module - Complete Demo")
-    print("=" * 55)
-    
-    # Test all core implementations
-    print("\nTEST Testing All Loss Functions:")
-    test_unit_mse_loss()
-    test_unit_crossentropy_loss()
-    test_unit_binary_crossentropy_loss()
-    test_unit_comprehensive_loss_integration()
-    
-    # Run systems analysis functions
-    print("\n" + "="*60)
-    print("MAGNIFY Systems Analysis Functions")
-    print("=" * 30)
-
-    visualize_loss_landscapes()
-    analyze_mse_properties()
-    analyze_crossentropy_stability()
-    analyze_binary_crossentropy_efficiency()
-    analyze_mixed_precision_considerations()
-    analyze_loss_performance_characteristics()
-    analyze_numerical_stability_edge_cases()
-    analyze_production_deployment_patterns()
-    
-    print("\n" + "="*60)
-    print("📊 Loss Function Usage Examples")
-    print("=" * 35)
-    
-    # Example 1: Regression with MSE
-    print("\n1. Regression Example (Predicting House Prices):")
-    mse = MeanSquaredError()
-    house_predictions = Tensor([[250000, 180000, 320000]])  # Predicted prices
-    house_actual = Tensor([[240000, 175000, 315000]])       # Actual prices
-    regression_loss = mse(house_predictions, house_actual)
-    print(f"   House price prediction loss: ${regression_loss.data:,.0f}² average error")
-    
-    # Example 2: Multi-class classification with CrossEntropy
-    print("\n2. Multi-Class Classification Example (Image Recognition):")
-    ce = CrossEntropyLoss()
-    image_logits = Tensor([[2.1, 0.5, -0.3, 1.8, 0.1],      # Model outputs for 5 classes
-                          [-0.2, 3.1, 0.8, -1.0, 0.4]])      # (cat, dog, bird, fish, rabbit)
-    true_classes = Tensor([0, 1])  # First image = cat, second = dog
-    classification_loss = ce(image_logits, true_classes)
-    print(f"   Image classification loss: {classification_loss.data:.4f}")
-    
-    # Example 3: Binary classification with BCE
-    print("\n3. Binary Classification Example (Spam Detection):")
-    bce = BinaryCrossEntropyLoss()
-    spam_logits = Tensor([[1.2], [-0.8], [2.1], [-1.5]])  # Spam prediction logits
-    spam_labels = Tensor([[1.0], [0.0], [1.0], [0.0]])     # 1=spam, 0=not spam
-    spam_loss = bce(spam_logits, spam_labels)
-    print(f"   Spam detection loss: {spam_loss.data:.4f}")
-    
-    print("\n" + "="*60)
-    print("TARGET Loss Function Characteristics")
-    print("=" * 35)
-    
-    # Compare perfect vs imperfect predictions
-    print("\n📊 Perfect vs Random Predictions:")
-    
-    # Perfect predictions
-    perfect_mse = mse(Tensor([5.0]), Tensor([5.0]))
-    perfect_ce = ce(Tensor([[10.0, 0.0, 0.0]]), Tensor([0]))
-    perfect_bce = bce(Tensor([10.0]), Tensor([1.0]))
-    
-    print(f"   Perfect MSE loss: {perfect_mse.data:.6f}")
-    print(f"   Perfect CE loss:  {perfect_ce.data:.6f}")
-    print(f"   Perfect BCE loss: {perfect_bce.data:.6f}")
-    
-    # Random predictions
-    random_mse = mse(Tensor([3.0]), Tensor([5.0]))  # Off by 2
-    random_ce = ce(Tensor([[0.0, 0.0, 0.0]]), Tensor([0]))  # Uniform distribution
-    random_bce = bce(Tensor([0.0]), Tensor([1.0]))  # 50% confidence
-    
-    print(f"   Random MSE loss:  {random_mse.data:.6f}")
-    print(f"   Random CE loss:   {random_ce.data:.6f}")
-    print(f"   Random BCE loss:  {random_bce.data:.6f}")
-    
-    print("\nCELEBRATE Complete loss function foundation ready!")
-    print("   PASS MSE for regression problems")
-    print("   PASS CrossEntropy for multi-class classification")
-    print("   PASS Binary CrossEntropy for binary classification")
-    print("   PASS Numerically stable implementations")
-    print("   PASS Production-ready batch processing")
-    print("   PASS Systems analysis and performance insights")
-    print("   PASS Ready for neural network training!")
-
-# %% [markdown]
-"""
-## CRITICAL FIX: Autograd-Integrated Loss Functions
-
-The above implementations use basic Tensor operations without gradient tracking.
-For neural network training, we need loss functions that integrate with the autograd system
-to enable proper backpropagation through the computational graph.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "autograd-losses", "solution": true}
-#| export
-class MSELoss:
-    """
-    Mean Squared Error Loss - Works with both Tensors and Variables
-
-    Initially works with basic Tensors (modules 01-04).
-    Automatically upgrades to use Variables when autograd is available (module 05+).
-    This staged approach allows testing loss functions before learning automatic differentiation.
-    """
-
-    def __init__(self):
-        """Initialize MSE loss function."""
-        pass
-
-    def __call__(self, predictions, targets):
-        """
-        Compute MSE loss.
-
-        Args:
-            predictions: Model predictions (Tensor/Variable)
-            targets: True targets (Tensor/Variable)
-
-        Returns:
-            Scalar loss value (Tensor initially, Variable after autograd)
-        """
-        # Clean Tensor Evolution Pattern:
-        # - Modules 01-04: Use basic Tensor operations
-        # - Module 05+: Same operations become autograd-capable automatically
-
-        # Ensure inputs are Tensors
-        if not isinstance(predictions, Tensor):
-            predictions = Tensor(predictions)
-        if not isinstance(targets, Tensor):
-            targets = Tensor(targets)
-
-        # Compute MSE using clean Tensor operations
-        diff = predictions - targets  # Uses Tensor.__sub__
-        squared_diff = diff * diff      # Uses Tensor.__mul__
-
-        # Use numpy for mean calculation (will be enhanced in autograd)
-        # Access the underlying numpy data for aggregation
-        mean_loss = Tensor(np.mean(squared_diff.data))
-
-        return mean_loss
-
-#| export
-class CrossEntropyLoss:
-    """
-    Cross-Entropy Loss - Works with both Tensors and Variables
-
-    Initially works with basic Tensors (modules 01-04).
-    Automatically upgrades to use Variables when autograd is available (module 05+).
-    This staged approach allows testing loss functions before learning automatic differentiation.
-    """
-
-    def __init__(self):
-        """Initialize CrossEntropy loss function."""
-        self.epsilon = 1e-7  # For numerical stability
-
-    def __call__(self, predictions, targets):
-        """
-        Compute cross-entropy loss.
-
-        Args:
-            predictions: Model predictions/logits (Tensor/Variable)
-            targets: True class indices (Tensor/Variable or numpy array)
-
-        Returns:
-            Scalar loss value (Tensor initially, Variable after autograd)
-        """
-        # Clean Tensor Evolution Pattern: Extract data cleanly
-        # Ensure inputs are Tensors and get their data
-        if not isinstance(predictions, Tensor):
-            predictions = Tensor(predictions)
-        if not isinstance(targets, Tensor):
-            targets = Tensor(targets)
-
-        pred_data = predictions.data
-        target_data = targets.data
-
-        # Apply softmax to predictions (numerically stable)
-        exp_pred = np.exp(pred_data - np.max(pred_data, axis=-1, keepdims=True))
-        softmax_pred = exp_pred / np.sum(exp_pred, axis=-1, keepdims=True)
-
-        # Clip for numerical stability
-        softmax_pred = np.clip(softmax_pred, self.epsilon, 1 - self.epsilon)
-
-        # Compute cross-entropy loss
-        if len(target_data.shape) == 1 or target_data.shape[-1] == 1:
-            # Integer labels
-            batch_size = pred_data.shape[0]
-            loss = 0
-            for i in range(batch_size):
-                label = int(target_data[i])
-                loss -= np.log(softmax_pred[i, label])
-            loss /= batch_size
-        else:
-            # One-hot labels
-            loss = -np.mean(np.sum(target_data * np.log(softmax_pred), axis=-1))
-
-        # Pure tensor evolution - gradient tracking will be added via decorator in Module 05
-        return Tensor(loss)
\ No newline at end of file
diff --git a/modules_old/04_losses/losses_dev_enhanced.py b/modules_old/04_losses/losses_dev_enhanced.py
deleted file mode 100644
index f899d153..00000000
--- a/modules_old/04_losses/losses_dev_enhanced.py
+++ /dev/null
@@ -1,1782 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Loss Functions - Learning Objectives Made Mathematical
-
-Welcome to Loss Functions! You'll implement the critical bridge between model predictions and learning objectives that makes neural network training possible.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): Data structures for predictions and targets
-- Module 03 (Activations): Nonlinear transformations for model outputs
-- Module 04 (Layers): Complete neural network layers that produce predictions
-
-**What's Working**: You can build networks that transform inputs into predictions!
-
-**The Gap**: Predictions aren't learning objectives - you need to measure how "wrong" predictions are and provide gradient signals for improvement.
-
-**This Module's Solution**: Implement MSE, CrossEntropy, and BinaryCrossEntropy loss functions with numerical stability.
-
-**Connection Map**:
-```
-Layers → Loss Functions → Gradients
-(predictions)  (objectives)   (learning signals)
-```
-
-## Learning Goals (Systems-Focused)
-- **Systems understanding**: How loss functions translate business problems into optimization objectives with proper numerical stability
-- **Core implementation skill**: Build production-quality loss functions with stable computation and efficient batch processing
-- **Pattern mastery**: Understand how different loss functions shape learning dynamics and convergence behavior
-- **Framework connections**: See how your implementations mirror PyTorch's loss functions and autograd integration patterns
-- **Optimization trade-offs**: Learn why numerical stability and computational efficiency matter for reliable training at scale
-
-## Build → Use → Reflect
-1. **Build**: Complete loss function implementations with numerical stability and gradient support
-2. **Use**: Apply loss functions to regression and classification problems with real neural networks
-3. **Reflect**: Why do different loss functions lead to different learning behaviors, and when does numerical stability matter?
-
-## What You'll Achieve
-By implementing loss functions from scratch, you'll understand:
-- Deep technical understanding of how loss functions quantify prediction quality and enable learning
-- Practical capability to implement numerically stable loss computation for production ML systems
-- Systems insight into computational complexity, memory requirements, and batch processing efficiency
-- Performance awareness of how loss function choice affects training speed and convergence characteristics
-- Production knowledge of how frameworks implement robust loss computation with proper error handling
-
-## Systems Reality Check
-💡 **Production Context**: PyTorch's loss functions use numerically stable implementations and automatic mixed precision to handle extreme gradients and values
-⚡ **Performance Insight**: Numerically unstable loss functions can cause training to fail catastrophically - proper implementation is critical for reliable ML systems
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "losses-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.losses
-
-#| export
-import numpy as np
-import sys
-import os
-
-# Import our building blocks - try package first, then local modules
-try:
-    from tinytorch.core.tensor import Tensor
-    # Note: For now, we'll use simplified implementations without full autograd
-    # In a complete system, these would integrate with the autograd Variable system
-except ImportError:
-    # For development, import from local modules
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))
-    from tensor_dev import Tensor
-
-# %% nbgrader={"grade": false, "grade_id": "losses-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("🔥 TinyTorch Loss Functions Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build loss functions for neural network training!")
-
-# %% [markdown]
-"""
-## Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/05_losses/losses_dev.py  
-**Building Side:** Code exports to tinytorch.core.losses
-
-```python
-# Final package structure:
-from tinytorch.core.losses import MeanSquaredError, CrossEntropyLoss, BinaryCrossEntropyLoss  # All loss functions!
-from tinytorch.core.tensor import Tensor  # The foundation
-from tinytorch.core.layers import Linear, Sequential  # Network components
-```
-
-**Why this matters:**
-- **Learning:** Focused module for understanding loss functions and training objectives
-- **Production:** Proper organization like PyTorch's torch.nn with all loss functions together
-- **Consistency:** All loss functions live together in core.losses for easy access
-- **Integration:** Works seamlessly with tensors and neural networks for complete training systems
-"""
-
-# %% [markdown]
-"""
-# Understanding Loss Functions in Neural Networks
-
-## What are Loss Functions?
-
-Loss functions are the mathematical bridge between what your model predicts and what you want it to learn. They quantify the "distance" between predictions and reality.
-
-```
-Business Goal: "Predict house prices accurately"
-            ↓
-Mathematical Loss: MSE = (predicted_price - actual_price)²
-            ↓  
-Optimization Signal: gradient = 2 × (predicted - actual)
-            ↓
-Learning Update: parameter -= learning_rate × gradient
-```
-
-## The Learning Ecosystem
-
-Loss functions provide four critical capabilities:
-
-🎯 **Learning Objectives**: Define what "good" performance means mathematically  
-📈 **Gradient Signal**: Provide directional improvement information for parameters  
-🔍 **Progress Measurement**: Enable monitoring training progress and convergence detection  
-⚖️ **Trade-off Control**: Balance different aspects of model performance and regularization  
-
-## Visual Understanding: Loss Function Landscape
-
-```
-Loss Function Behavior:
-           MSE Loss                    CrossEntropy Loss
-    High │    ╱╲                High │     ╱╲
-         │   ╱  ╲                    │    ╱  ╲
-         │  ╱    ╲                   │   ╱    ╲
-         │ ╱      ╲                  │  ╱      ╲
-     Low │╱        ╲              Low │ ╱        ╲
-         └──────────────         └──────────────
-         Wrong  Right              Wrong  Right
-         
-   • Smooth gradients          • Steep near wrong predictions
-   • Quadratic penalty         • Gentle near correct predictions
-   • Good for regression       • Good for classification
-```
-
-Different loss functions create different optimization landscapes that affect how your model learns!
-"""
-
-# %% [markdown]
-"""
-# Mean Squared Error - Foundation for Regression
-
-MSE is the cornerstone loss function for regression problems. It measures prediction quality by penalizing large errors more than small ones.
-
-## Visual Understanding: MSE Behavior
-
-```
-MSE Loss Visualization:
-          
-    Loss │     ╱╲
-       4 │    ╱  ╲        • Error = 2 → Loss = 4
-       3 │   ╱    ╲       • Error = 1 → Loss = 1
-       2 │  ╱      ╲      • Error = 0 → Loss = 0
-       1 │ ╱        ╲     • Quadratic penalty!
-       0 │╱__________╲____
-         -2  -1   0   1   2
-              Error
-              
-Gradient Flow:
-    ∂Loss/∂prediction = 2 × (predicted - actual)
-    
-    Large errors → Large gradients → Big updates
-    Small errors → Small gradients → Fine tuning
-```
-
-## Mathematical Foundation
-
-For batch of predictions and targets:
-```
-MSE = (1/n) × Σ(y_pred - y_true)²
-
-Gradient: ∂MSE/∂y_pred = (2/n) × (y_pred - y_true)
-```
-
-## Learning Objectives
-By implementing MSE, you'll understand:
-- How regression loss functions translate continuous prediction errors into optimization signals
-- Why squared error creates smooth, well-behaved gradients for stable optimization
-- How batch processing enables efficient training on multiple samples simultaneously
-- The connection between mathematical loss formulations and practical ML training dynamics
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "mse-concept-question", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-🤔 **Computational Question: MSE Properties**
-
-Before implementing, let's understand MSE behavior:
-
-1. If you predict house price as $300k but actual is $250k, what's the MSE?
-2. If you predict $310k but actual is $250k, what's the MSE? 
-3. Which error gets penalized more heavily and why?
-4. How does this relate to the quadratic penalty we visualized?
-
-This understanding will guide your implementation approach.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "mse-loss-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class MeanSquaredError:
-    """
-    Mean Squared Error Loss for Regression Problems
-    
-    Computes the average squared difference between predictions and targets:
-    MSE = (1/n) × Σ(y_pred - y_true)²
-    
-    Features:
-    - Numerically stable computation
-    - Efficient batch processing
-    - Clean gradient properties for optimization
-    - Compatible with tensor operations
-    
-    Example Usage:
-        mse = MeanSquaredError()
-        loss = mse(predictions, targets)  # Returns scalar loss value
-    """
-    
-    def __init__(self):
-        """Initialize MSE loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute MSE loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions (Tensor, shape: [batch_size, ...])
-            y_true: True targets (Tensor, shape: [batch_size, ...])
-            
-        Returns:
-            Tensor with scalar loss value
-            
-        TODO: Implement MSE computation with proper tensor handling.
-        
-        APPROACH:
-        1. Convert inputs to tensors for consistent processing
-        2. Compute element-wise prediction errors (differences)
-        3. Square the errors to create quadratic penalty
-        4. Take mean across all elements for final loss
-        
-        EXAMPLE:
-        >>> mse = MeanSquaredError()
-        >>> pred = Tensor([[1.0, 2.0]])
-        >>> true = Tensor([[1.5, 1.5]])
-        >>> loss = mse(pred, true)
-        >>> print(loss.data)
-        0.25  # [(1.0-1.5)² + (2.0-1.5)²] / 2 = [0.25 + 0.25] / 2
-        
-        HINTS:
-        - Use np.mean() for efficient batch averaging
-        - Element-wise operations work naturally with tensor.data
-        - Return result wrapped in Tensor for consistent interface
-        """
-        ### BEGIN SOLUTION
-        # Step 1: Ensure we have tensor inputs for consistent processing
-        if not isinstance(y_pred, Tensor):
-            y_pred = Tensor(y_pred)
-        if not isinstance(y_true, Tensor):
-            y_true = Tensor(y_true)
-        
-        # Step 2: Compute mean squared error with element-wise operations
-        prediction_errors = y_pred.data - y_true.data  # Element-wise difference
-        squared_errors = prediction_errors * prediction_errors  # Element-wise squaring
-        mean_loss = np.mean(squared_errors)  # Average across all elements
-        
-        return Tensor(mean_loss)
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-
-# 🔍 SYSTEMS INSIGHT: MSE Computational Analysis
-def analyze_mse_properties():
-    """Analyze MSE loss characteristics for systems understanding."""
-    print("🔍 MSE Loss Analysis - Understanding the Math")
-    print("=" * 45)
-    
-    try:
-        mse = MeanSquaredError()
-        
-        # Error magnitude vs loss relationship
-        print("\n📊 Error Magnitude vs Loss (Quadratic Penalty):")
-        errors = [0.1, 0.5, 1.0, 2.0, 5.0]
-        for error in errors:
-            pred = Tensor([error])
-            true = Tensor([0.0])
-            loss = mse(pred, true)
-            print(f"   Error: {error:4.1f} → Loss: {loss.data:8.3f} (× {loss.data/(error**2):5.1f} baseline)")
-        
-        # Batch vs individual processing
-        print(f"\n⚡ Batch Processing Efficiency:")
-        single_losses = []
-        for i in range(100):
-            pred = Tensor([np.random.randn()])
-            true = Tensor([np.random.randn()])
-            loss = mse(pred, true)
-            single_losses.append(loss.data)
-        
-        # Batch version
-        batch_pred = Tensor(np.random.randn(100))
-        batch_true = Tensor(np.random.randn(100))
-        batch_loss = mse(batch_pred, batch_true)
-        
-        individual_mean = np.mean(single_losses)
-        print(f"   Individual losses mean: {individual_mean:.6f}")
-        print(f"   Batch loss:            {batch_loss.data:.6f}")
-        print(f"   Difference:            {abs(individual_mean - batch_loss.data):.8f}")
-        
-        # Memory efficiency analysis
-        import sys
-        small_tensor = Tensor([1.0])
-        large_tensor = Tensor(np.random.randn(1000))
-        
-        print(f"\n💾 Memory Efficiency:")
-        print(f"   Small loss memory: {sys.getsizeof(small_tensor.data)} bytes")
-        print(f"   Large loss memory: {sys.getsizeof(large_tensor.data)} bytes")
-        print(f"   MSE memory is independent of input size!")
-        
-        # 💡 WHY THIS MATTERS: MSE provides stable, well-behaved gradients
-        # that are proportional to error magnitude, making optimization smooth.
-        # The quadratic penalty means large errors dominate learning initially,
-        # then fine-tuning happens as errors get smaller.
-        
-    except Exception as e:
-        print(f"⚠️ Analysis error: {e}")
-        print("Ensure MSE implementation is complete before running analysis")
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: MSE Loss Computation
-This test validates `MeanSquaredError.__call__`, ensuring correct MSE computation with various input types and batch sizes.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-mse-loss", "locked": true, "points": 3, "schema_version": 3, "solution": false, "task": false}
-def test_unit_mse_loss():
-    """Test MSE loss implementation."""
-    print("🧪 Testing Mean Squared Error Loss...")
-    
-    mse = MeanSquaredError()
-    
-    # Test case 1: Perfect predictions (loss should be 0)
-    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    y_true = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    loss = mse(y_pred, y_true)
-    assert abs(loss.data) < 1e-6, f"Perfect predictions should have loss ≈ 0, got {loss.data}"
-    print("✅ Perfect predictions test passed")
-    
-    # Test case 2: Known loss computation
-    y_pred = Tensor([[1.0, 2.0]])
-    y_true = Tensor([[0.0, 1.0]])
-    loss = mse(y_pred, y_true)
-    expected = 1.0  # [(1-0)² + (2-1)²] / 2 = [1 + 1] / 2 = 1.0
-    assert abs(loss.data - expected) < 1e-6, f"Expected loss {expected}, got {loss.data}"
-    print("✅ Known loss computation test passed")
-    
-    # Test case 3: Batch processing
-    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    y_true = Tensor([[1.5, 2.5], [2.5, 3.5]])
-    loss = mse(y_pred, y_true)
-    expected = 0.25  # All squared differences are 0.25
-    assert abs(loss.data - expected) < 1e-6, f"Expected batch loss {expected}, got {loss.data}"
-    print("✅ Batch processing test passed")
-    
-    # Test case 4: Single value
-    y_pred = Tensor([5.0])
-    y_true = Tensor([3.0])
-    loss = mse(y_pred, y_true)
-    expected = 4.0  # (5-3)² = 4
-    assert abs(loss.data - expected) < 1e-6, f"Expected single value loss {expected}, got {loss.data}"
-    print("✅ Single value test passed")
-    
-    print("🎉 MSE loss tests passed! Understanding regression objectives.")
-
-test_unit_mse_loss()
-
-# %% [markdown]
-"""
-# Cross-Entropy Loss - Foundation for Multi-Class Classification
-
-Cross-Entropy Loss measures the "information distance" between predicted probability distributions and true class labels. It's the gold standard for classification problems.
-
-## Visual Understanding: Cross-Entropy Behavior
-
-```
-Cross-Entropy Loss for 3-Class Problem:
-
-Class Probabilities after Softmax:
-    Input: [2.0, 1.0, 0.1]    →    Probabilities: [0.66, 0.24, 0.10]
-    True:  Class 0 (index 0)   →    Target:       [1.0,  0.0,  0.0]
-    
-Loss Computation:
-    CE = -log(probability_of_correct_class)
-    CE = -log(0.66) = 0.415
-    
-Intuition:
-    High confidence + Correct → Low loss
-    High confidence + Wrong   → High loss  
-    Low confidence  + Any     → Medium loss
-
-Gradient Behavior:
-    Wrong predictions → Steep gradients → Big corrections
-    Right predictions → Gentle gradients → Fine tuning
-```
-
-## Numerical Stability Challenge
-
-```
-The Numerical Stability Problem:
-    
-    Raw logits: [50.0, 49.0, 48.0]
-    Naive softmax: exp(50)/[exp(50)+exp(49)+exp(48)]
-    Problem: exp(50) ≈ 5×10²¹ → Overflow!
-    
-Our Solution (Log-Sum-Exp Trick):
-    1. max_val = max(logits) = 50.0
-    2. stable_logits = [0.0, -1.0, -2.0]  # Subtract max
-    3. exp([0.0, -1.0, -2.0]) = [1.0, 0.37, 0.14]
-    4. Safe softmax: [0.67, 0.25, 0.09]
-```
-
-## Mathematical Foundation
-
-For predictions and class indices:
-```
-CrossEntropy = -Σ y_true × log(softmax(y_pred))
-
-Softmax: softmax(x_i) = exp(x_i) / Σ exp(x_j)
-Stable: softmax(x_i) = exp(x_i - max(x)) / Σ exp(x_j - max(x))
-```
-
-## Learning Objectives
-By implementing Cross-Entropy, you'll understand:
-- How classification losses work with probability distributions and information theory
-- Why softmax normalization creates proper probability distributions for multi-class problems
-- The critical importance of numerical stability in exponential and logarithmic computations
-- How cross-entropy naturally encourages confident, correct predictions through its gradient structure
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "crossentropy-concept-question", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-🤔 **Computational Question: CrossEntropy Stability**
-
-Consider numerical stability in cross-entropy:
-
-1. What happens if you compute exp(100) directly?
-2. Why does subtracting the maximum value prevent overflow?
-3. What happens if log(0) occurs during loss computation?
-4. How does epsilon clipping prevent this issue?
-
-Understanding these edge cases is crucial for reliable implementation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "crossentropy-loss-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class CrossEntropyLoss:
-    """
-    Cross-Entropy Loss for Multi-Class Classification Problems
-    
-    Computes the cross-entropy between predicted probability distributions
-    and true class labels with numerically stable implementation.
-    
-    Features:
-    - Numerically stable softmax computation using log-sum-exp trick
-    - Support for both class indices and one-hot encoding
-    - Efficient batch processing with proper broadcasting
-    - Automatic handling of edge cases and extreme values
-    
-    Example Usage:
-        ce_loss = CrossEntropyLoss()
-        loss = ce_loss(logits, class_indices)  # Returns scalar loss value
-    """
-    
-    def __init__(self):
-        """Initialize CrossEntropy loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute CrossEntropy loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions/logits (Tensor, shape: [batch_size, num_classes])
-            y_true: True class indices (Tensor, shape: [batch_size]) or one-hot encoding
-            
-        Returns:
-            Tensor with scalar loss value
-            
-        TODO: Implement CrossEntropy with numerically stable softmax computation.
-        
-        APPROACH:
-        1. Convert inputs to tensors and handle single samples
-        2. Apply log-sum-exp trick for numerically stable softmax
-        3. Clip probabilities to prevent log(0) issues
-        4. Compute cross-entropy based on target format (indices vs one-hot)
-        
-        EXAMPLE:
-        >>> ce = CrossEntropyLoss()
-        >>> logits = Tensor([[2.0, 1.0, 0.0]])  # Raw model outputs
-        >>> targets = Tensor([0])  # Class 0 is correct
-        >>> loss = ce(logits, targets)
-        >>> print(loss.data)
-        0.407  # -log(softmax([2.0, 1.0, 0.0])[0])
-        
-        HINTS:
-        - Use np.max(axis=1, keepdims=True) for stable max computation
-        - Use np.clip(probabilities, 1e-15, 1.0-1e-15) to prevent log(0)
-        - Handle both index format [0,1,2] and one-hot format [[1,0,0], [0,1,0]]
-        - Use advanced indexing: probs[np.arange(batch_size), class_indices]
-        """
-        ### BEGIN SOLUTION
-        # Step 1: Ensure we have tensor inputs for consistent processing
-        if not isinstance(y_pred, Tensor):
-            y_pred = Tensor(y_pred)  # Convert predictions to tensor format
-        if not isinstance(y_true, Tensor):
-            y_true = Tensor(y_true)  # Convert targets to tensor format
-        
-        # Step 1: Extract numpy arrays for computation
-        prediction_logits = y_pred.data  # Raw model outputs (pre-softmax)
-        target_labels = y_true.data      # True class indices or one-hot vectors
-        
-        # Step 2: Handle both single predictions and batches consistently
-        if prediction_logits.ndim == 1:
-            prediction_logits = prediction_logits.reshape(1, -1)  # Convert to batch format [1, num_classes]
-            
-        # Step 3: Apply numerically stable softmax transformation
-        # Subtract max to prevent overflow: exp(x-max) is equivalent but stable
-        max_logits = np.max(prediction_logits, axis=1, keepdims=True)
-        exp_pred = np.exp(prediction_logits - max_logits)
-        softmax_pred = exp_pred / np.sum(exp_pred, axis=1, keepdims=True)
-        
-        # Step 4: Prevent numerical instability in log computation
-        epsilon = 1e-15  # Small value to prevent log(0) → -inf and log(1) → 0 issues
-        softmax_pred = np.clip(softmax_pred, epsilon, 1.0 - epsilon)
-        
-        # Step 5: Compute cross-entropy loss based on target format
-        if len(target_labels.shape) == 1:
-            # Format A: y_true contains class indices [0, 1, 2, ...]
-            batch_size = target_labels.shape[0]
-            # Extract probabilities for correct classes using advanced indexing
-            correct_class_probs = softmax_pred[np.arange(batch_size), target_labels.astype(int)]
-            log_probs = np.log(correct_class_probs)
-            loss_value = -np.mean(log_probs)  # Negative log-likelihood
-        else:
-            # Format B: y_true is one-hot encoded [[1,0,0], [0,1,0], ...]
-            log_probs = np.log(softmax_pred)
-            # Multiply one-hot targets with log probabilities, sum across classes
-            weighted_log_probs = target_labels * log_probs
-            loss_value = -np.mean(np.sum(weighted_log_probs, axis=1))
-        
-        return Tensor(loss_value)
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-
-# 🔍 SYSTEMS INSIGHT: CrossEntropy Stability Analysis
-def analyze_crossentropy_stability():
-    """Analyze numerical stability in cross-entropy computation."""
-    print("🔍 CrossEntropy Stability Analysis")
-    print("=" * 40)
-    
-    try:
-        ce = CrossEntropyLoss()
-        
-        # Test numerical stability with extreme values
-        print("\n⚡ Numerical Stability Testing:")
-        
-        # Extreme logits that would overflow in naive implementation
-        extreme_logits = Tensor([[100.0, 99.0, 98.0]])
-        safe_labels = Tensor([0])
-        
-        loss = ce(extreme_logits, safe_labels)
-        print(f"   Extreme logits [100, 99, 98]: Loss = {loss.data:.6f}")
-        print(f"   No overflow or NaN: {not np.isnan(loss.data) and not np.isinf(loss.data)}")
-        
-        # Test epsilon clipping effectiveness
-        print(f"\n🛡️ Epsilon Clipping Protection:")
-        very_confident = Tensor([[10.0, -10.0, -10.0]])  # Very confident about class 0
-        confident_labels = Tensor([0])
-        
-        loss = ce(very_confident, confident_labels)
-        print(f"   Very confident correct prediction: Loss = {loss.data:.6f}")
-        print(f"   Should be near 0: {loss.data < 0.01}")
-        
-        # Compare different confidence levels
-        print(f"\n📊 Confidence vs Loss Relationship:")
-        confidence_levels = [
-            ("Low confidence", [[0.1, 0.0, -0.1]]),
-            ("Medium confidence", [[1.0, 0.0, -1.0]]),
-            ("High confidence", [[5.0, 0.0, -5.0]]),
-            ("Very high", [[10.0, 0.0, -10.0]])
-        ]
-        
-        for name, logits in confidence_levels:
-            test_logits = Tensor(logits)
-            test_loss = ce(test_logits, Tensor([0]))
-            print(f"   {name:15}: Loss = {test_loss.data:.6f}")
-        
-        # Memory efficiency for large vocabularies
-        print(f"\n💾 Memory Scaling Analysis:")
-        small_vocab = Tensor(np.random.randn(32, 100))    # 100 classes
-        large_vocab = Tensor(np.random.randn(32, 10000))  # 10k classes
-        
-        import sys
-        small_memory = sys.getsizeof(small_vocab.data)
-        large_memory = sys.getsizeof(large_vocab.data)
-        
-        print(f"   Small vocab (100 classes): {small_memory / 1024:.1f} KB")
-        print(f"   Large vocab (10k classes): {large_memory / 1024:.1f} KB")
-        print(f"   Memory scales O(batch_size × num_classes)")
-        
-        # 💡 WHY THIS MATTERS: CrossEntropy memory scales with vocabulary size.
-        # This is why large language models use techniques like hierarchical softmax
-        # or sampling-based training to handle vocabularies with 50k+ tokens.
-        
-    except Exception as e:
-        print(f"⚠️ Analysis error: {e}")
-        print("Ensure CrossEntropy implementation is complete")
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Cross-Entropy Loss Computation
-This test validates `CrossEntropyLoss.__call__`, ensuring correct cross-entropy computation with numerically stable softmax.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-crossentropy-loss", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false}
-def test_unit_crossentropy_loss():
-    """Test CrossEntropy loss implementation."""
-    print("🧪 Testing Cross-Entropy Loss...")
-    
-    ce = CrossEntropyLoss()
-    
-    # Test case 1: Perfect predictions
-    y_pred = Tensor([[10.0, 0.0, 0.0], [0.0, 10.0, 0.0]])  # Very confident correct predictions
-    y_true = Tensor([0, 1])  # Class indices
-    loss = ce(y_pred, y_true)
-    assert loss.data < 0.1, f"Perfect predictions should have low loss, got {loss.data}"
-    print("✅ Perfect predictions test passed")
-    
-    # Test case 2: Random predictions (should have higher loss)
-    y_pred = Tensor([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])  # Uniform after softmax
-    y_true = Tensor([0, 1])
-    loss = ce(y_pred, y_true)
-    expected_random = -np.log(1.0/3.0)  # log(1/num_classes) for uniform distribution
-    assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss ≈ {expected_random}, got {loss.data}"
-    print("✅ Random predictions test passed")
-    
-    # Test case 3: Binary classification
-    y_pred = Tensor([[2.0, 1.0], [1.0, 2.0]])
-    y_true = Tensor([0, 1])
-    loss = ce(y_pred, y_true)
-    assert 0.0 < loss.data < 2.0, f"Binary classification loss should be reasonable, got {loss.data}"
-    print("✅ Binary classification test passed")
-    
-    # Test case 4: One-hot encoded labels
-    y_pred = Tensor([[2.0, 1.0, 0.0], [0.0, 2.0, 1.0]])
-    y_true = Tensor([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])  # One-hot encoded
-    loss = ce(y_pred, y_true)
-    assert 0.0 < loss.data < 2.0, f"One-hot encoded loss should be reasonable, got {loss.data}"
-    print("✅ One-hot encoded labels test passed")
-    
-    print("🎉 Cross-Entropy loss tests passed! Understanding classification objectives.")
-
-test_unit_crossentropy_loss()
-
-# %% [markdown]
-"""
-# Binary Cross-Entropy Loss - Optimized for Binary Classification
-
-Binary Cross-Entropy Loss is the specialized, efficient version of cross-entropy for binary (two-class) problems. It's more stable and faster than using regular cross-entropy with 2 classes.
-
-## Visual Understanding: Binary Cross-Entropy
-
-```
-Binary Classification Landscape:
-
-Sigmoid Activation:
-    Raw Logit → Sigmoid → Probability → Loss
-    -5.0     → 0.007   → 0.007       → High loss (if true=1)
-     0.0     → 0.500   → 0.500       → Medium loss
-    +5.0     → 0.993   → 0.993       → Low loss (if true=1)
-
-Loss Behavior:
-    BCE = -[y×log(p) + (1-y)×log(1-p)]
-    
-    For y=1 (positive class):
-        p=0.9 → -log(0.9) = 0.105  (low loss)
-        p=0.1 → -log(0.1) = 2.303  (high loss)
-    
-    For y=0 (negative class):
-        p=0.1 → -log(0.9) = 0.105  (low loss)  
-        p=0.9 → -log(0.1) = 2.303  (high loss)
-```
-
-## Numerical Stability Solution
-
-```
-The Binary Cross-Entropy Stability Problem:
-    
-    BCE = -[y×log(σ(x)) + (1-y)×log(1-σ(x))]
-    
-    Where σ(x) = 1/(1+exp(-x))
-    
-    Problems:
-    - Large positive x: exp(-x) → 0, then log(1) → 0 (loss of precision)
-    - Large negative x: σ(x) → 0, then log(0) → -∞
-    
-Our Stable Solution:
-    BCE = max(x,0) - x×y + log(1 + exp(-|x|))
-    
-    Why this works:
-    - max(x,0) handles positive values
-    - -x×y is the "cross" term  
-    - log(1+exp(-|x|)) is always stable (exp≤1)
-```
-
-## Mathematical Foundation
-
-For binary predictions and labels:
-```
-BCE = -y × log(σ(x)) - (1-y) × log(1-σ(x))
-
-Stable form: BCE = max(x,0) - x×y + log(1 + exp(-|x|))
-```
-
-## Learning Objectives
-By implementing Binary Cross-Entropy, you'll understand:
-- How binary classification creates simpler optimization landscapes than multi-class problems
-- Why sigmoid activation naturally pairs with binary cross-entropy loss through its gradient structure
-- The critical importance of numerically stable formulations for reliable production training
-- How specialized binary losses achieve better efficiency and stability than general solutions
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "binary-crossentropy-concept", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-🤔 **Computational Question: Binary Stability**
-
-Consider the stable BCE formulation:
-
-1. Why does max(x,0) - x×y + log(1+exp(-|x|)) work?
-2. What happens when x=100? (trace through the computation)
-3. What happens when x=-100? (trace through the computation)
-4. How does this prevent both overflow and underflow?
-
-This mathematical insight is crucial for production systems.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "binary-crossentropy-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class BinaryCrossEntropyLoss:
-    """
-    Binary Cross-Entropy Loss for Binary Classification Problems
-    
-    Computes binary cross-entropy between predictions and binary labels
-    with numerically stable sigmoid + BCE implementation.
-    
-    Features:
-    - Numerically stable computation from logits using stable BCE formula
-    - Efficient batch processing with vectorized operations
-    - Automatic sigmoid application through stable formulation
-    - Robust to extreme input values without overflow/underflow
-    
-    Example Usage:
-        bce_loss = BinaryCrossEntropyLoss()
-        loss = bce_loss(logits, binary_labels)  # Returns scalar loss value
-    """
-    
-    def __init__(self):
-        """Initialize Binary CrossEntropy loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute Binary CrossEntropy loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions/logits (Tensor, shape: [batch_size, 1] or [batch_size])
-            y_true: True binary labels (Tensor, shape: [batch_size, 1] or [batch_size])
-            
-        Returns:
-            Tensor with scalar loss value
-            
-        TODO: Implement stable binary cross-entropy using the logits formulation.
-        
-        APPROACH:
-        1. Convert inputs to tensors and flatten for consistent processing
-        2. Use stable BCE formula: max(x,0) - x×y + log(1+exp(-|x|))
-        3. Apply this formula element-wise across the batch
-        4. Return mean loss across all samples
-        
-        EXAMPLE:
-        >>> bce = BinaryCrossEntropyLoss()
-        >>> logits = Tensor([[2.0], [-1.0]])  # Raw outputs
-        >>> labels = Tensor([[1.0], [0.0]])   # Binary targets
-        >>> loss = bce(logits, labels)
-        >>> print(loss.data)
-        0.693  # Stable computation of binary cross-entropy
-        
-        HINTS:
-        - Use np.maximum(logits, 0) for the max(x,0) term
-        - Use np.abs(logits) to ensure exp argument is ≤ 0
-        - The formula naturally handles both positive and negative logits
-        - Return np.mean() for batch averaging
-        """
-        ### BEGIN SOLUTION
-        # Step 1: Ensure we have tensor inputs for consistent processing
-        if not isinstance(y_pred, Tensor):
-            y_pred = Tensor(y_pred)  # Convert predictions to tensor format
-        if not isinstance(y_true, Tensor):
-            y_true = Tensor(y_true)  # Convert targets to tensor format
-        
-        # Get flat arrays for computation
-        logits = y_pred.data.flatten()
-        labels = y_true.data.flatten()
-        
-        # Step 1: Define numerically stable binary cross-entropy computation
-        def stable_bce_with_logits(logits, labels):
-            """
-            Numerically stable BCE using the logits formulation:
-            BCE(logits, y) = max(logits, 0) - logits * y + log(1 + exp(-|logits|))
-            
-            This formulation prevents:
-            - exp(large_positive_logit) → overflow
-            - log(very_small_sigmoid) → -inf
-            
-            Mathematical equivalence:
-            - For positive logits: x - x*y + log(1 + exp(-x))
-            - For negative logits: -x*y + log(1 + exp(x))
-            """
-            # Step 1a: Handle positive logits to prevent exp(large_positive) overflow
-            positive_part = np.maximum(logits, 0)
-            
-            # Step 1b: Subtract logit-label product (the "cross" in cross-entropy)
-            cross_term = logits * labels
-            
-            # Step 1c: Add log(1 + exp(-|logits|)) for numerical stability
-            # Using abs(logits) ensures the exponent is always negative or zero
-            stability_term = np.log(1 + np.exp(-np.abs(logits)))
-            
-            return positive_part - cross_term + stability_term
-        
-        # Step 2: Apply stable BCE computation across the batch
-        individual_losses = stable_bce_with_logits(logits, labels)
-        mean_loss = np.mean(individual_losses)  # Average loss across batch
-        
-        return Tensor(mean_loss)
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-
-# 🔍 SYSTEMS INSIGHT: Binary CrossEntropy Efficiency Analysis
-def analyze_binary_crossentropy_efficiency():
-    """Analyze binary cross-entropy computational efficiency."""
-    print("🔍 Binary CrossEntropy Efficiency Analysis")
-    print("=" * 45)
-    
-    try:
-        bce = BinaryCrossEntropyLoss()
-        ce = CrossEntropyLoss()  # For comparison
-        
-        # Compare binary-specific vs general cross-entropy
-        print("\n⚡ Binary vs Multi-Class Efficiency:")
-        
-        # Binary problem solved two ways
-        binary_logits = Tensor([[1.5], [-0.8], [2.1]])
-        binary_labels = Tensor([[1.0], [0.0], [1.0]])
-        
-        # Method 1: Binary CrossEntropy
-        binary_loss = bce(binary_logits, binary_labels)
-        
-        # Method 2: 2-class CrossEntropy (equivalent but less efficient)
-        multiclass_logits = Tensor([[1.5, 0.0], [-0.8, 0.0], [2.1, 0.0]])
-        multiclass_labels = Tensor([0, 1, 0])  # Convert to class indices
-        multiclass_loss = ce(multiclass_logits, multiclass_labels)
-        
-        print(f"   Binary CE Loss:     {binary_loss.data:.6f}")
-        print(f"   2-Class CE Loss:    {multiclass_loss.data:.6f}")
-        print(f"   Difference:         {abs(binary_loss.data - multiclass_loss.data):.8f}")
-        
-        # Memory efficiency comparison
-        print(f"\n💾 Memory Efficiency Comparison:")
-        
-        batch_size = 1000
-        binary_memory = batch_size * 1 * 8  # 1 value per sample, 8 bytes per float64
-        multiclass_memory = batch_size * 2 * 8  # 2 classes, 8 bytes per float64
-        
-        print(f"   Binary approach:    {binary_memory / 1024:.1f} KB")
-        print(f"   Multi-class (2):    {multiclass_memory / 1024:.1f} KB")
-        print(f"   Binary is {multiclass_memory/binary_memory:.1f}× more memory efficient")
-        
-        # Stability test with extreme values
-        print(f"\n🛡️ Extreme Value Stability:")
-        extreme_tests = [
-            ("Large positive", [[100.0]], [[1.0]]),
-            ("Large negative", [[-100.0]], [[0.0]]),
-            ("Mixed extreme", [[100.0], [-100.0]], [[1.0], [0.0]])
-        ]
-        
-        for name, logits, labels in extreme_tests:
-            test_logits = Tensor(logits)
-            test_labels = Tensor(labels)
-            loss = bce(test_logits, test_labels)
-            is_stable = not (np.isnan(loss.data) or np.isinf(loss.data))
-            print(f"   {name:15}: Loss = {loss.data:.6f}, Stable = {is_stable}")
-        
-        # 💡 WHY THIS MATTERS: Binary CrossEntropy is 2× more memory efficient
-        # than regular CrossEntropy for binary problems, and provides better
-        # numerical stability through its specialized formulation.
-        
-    except Exception as e:
-        print(f"⚠️ Analysis error: {e}")
-        print("Ensure BinaryCrossEntropy implementation is complete")
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Binary Cross-Entropy Loss
-This test validates `BinaryCrossEntropyLoss.__call__`, ensuring stable binary cross-entropy computation with extreme values.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-binary-crossentropy", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false}
-def test_unit_binary_crossentropy_loss():
-    """Test Binary CrossEntropy loss implementation."""
-    print("🧪 Testing Binary Cross-Entropy Loss...")
-    
-    bce = BinaryCrossEntropyLoss()
-    
-    # Test case 1: Perfect predictions
-    y_pred = Tensor([[10.0], [-10.0]])  # Very confident correct predictions
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    assert loss.data < 0.1, f"Perfect predictions should have low loss, got {loss.data}"
-    print("✅ Perfect predictions test passed")
-    
-    # Test case 2: Random predictions (should have higher loss)
-    y_pred = Tensor([[0.0], [0.0]])  # 0.5 probability after sigmoid
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    expected_random = -np.log(0.5)  # log(0.5) for random guessing
-    assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss ≈ {expected_random}, got {loss.data}"
-    print("✅ Random predictions test passed")
-    
-    # Test case 3: Batch processing
-    y_pred = Tensor([[1.0], [2.0], [-1.0]])
-    y_true = Tensor([[1.0], [1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    assert 0.0 < loss.data < 2.0, f"Batch processing loss should be reasonable, got {loss.data}"
-    print("✅ Batch processing test passed")
-    
-    # Test case 4: Extreme values (test numerical stability)
-    y_pred = Tensor([[100.0], [-100.0]])  # Extreme logits
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    assert not np.isnan(loss.data) and not np.isinf(loss.data), f"Extreme values should not cause NaN/Inf, got {loss.data}"
-    assert loss.data < 1.0, f"Extreme correct predictions should have low loss, got {loss.data}"
-    print("✅ Extreme values test passed")
-    
-    print("🎉 Binary Cross-Entropy loss tests passed! Understanding binary objectives.")
-
-test_unit_binary_crossentropy_loss()
-
-# %% [markdown]
-"""
-# Loss Function Application Guide and Comparison
-
-## When to Use Each Loss Function
-
-Understanding which loss function to use is critical for successful ML projects:
-
-### Mean Squared Error (MSE) - Regression Problems
-```
-Use when: Predicting continuous values
-Examples: House prices, temperature, stock values, ages
-Output: Any real number
-Activation: Usually none (linear output)
-Penalty: Quadratic (large errors >> small errors)
-
-Model Architecture:
-Input → Hidden Layers → Linear Output → MSE Loss
-```
-
-### Cross-Entropy Loss - Multi-Class Classification  
-```
-Use when: Choosing one class from 3+ options
-Examples: Image classification, text categorization, medical diagnosis
-Output: Probability distribution (sums to 1)
-Activation: Softmax
-Penalty: Logarithmic (encouraging confident correct predictions)
-
-Model Architecture:
-Input → Hidden Layers → Softmax → CrossEntropy Loss
-```
-
-### Binary Cross-Entropy Loss - Binary Classification
-```
-Use when: Binary decisions (yes/no, positive/negative)
-Examples: Spam detection, fraud detection, medical screening
-Output: Single probability (0 to 1)
-Activation: Sigmoid
-Penalty: Asymmetric (confident wrong predictions heavily penalized)
-
-Model Architecture:
-Input → Hidden Layers → Sigmoid → Binary CrossEntropy Loss
-```
-
-## Performance and Stability Comparison
-
-```
-Computational Characteristics:
-                      MSE    CrossEntropy    Binary CE
-Time Complexity:     O(n)      O(n×c)        O(n)
-Memory Complexity:   O(1)      O(n×c)        O(n)
-Numerical Stability: High      Medium        High
-Convergence Speed:   Fast      Medium        Fast
-
-Where: n = batch size, c = number of classes
-```
-
-## Integration with Neural Networks
-
-```python
-# Example training setup for different problem types:
-
-# Regression Problem (House Price Prediction)
-regression_model = Sequential([
-    Linear(10, 64),   # Input features → Hidden
-    ReLU(),
-    Linear(64, 1),    # Hidden → Single output
-    # No activation - linear output for regression
-])
-loss_fn = MeanSquaredError()
-
-# Multi-Class Classification (Image Recognition)
-classification_model = Sequential([
-    Linear(784, 128), # Flattened image → Hidden
-    ReLU(),
-    Linear(128, 10),  # Hidden → 10 classes
-    Softmax()         # Convert to probabilities
-])
-loss_fn = CrossEntropyLoss()
-
-# Binary Classification (Spam Detection)
-binary_model = Sequential([
-    Linear(100, 64),  # Text features → Hidden
-    ReLU(),
-    Linear(64, 1),    # Hidden → Single output
-    Sigmoid()         # Convert to probability
-])
-loss_fn = BinaryCrossEntropyLoss()
-
-# Training loop pattern (same for all):
-for batch in dataloader:
-    predictions = model(batch.inputs)
-    loss = loss_fn(predictions, batch.targets)
-    # loss.backward()  # Compute gradients (when autograd is available)
-    # optimizer.step() # Update parameters
-```
-"""
-
-# %% [markdown]
-"""
-### 🧪 Comprehensive Integration Test
-This test validates all loss functions work together correctly and can be used interchangeably in production systems.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "comprehensive-loss-tests", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_unit_comprehensive_loss_integration():
-    """Test all loss functions work correctly together."""
-    print("🔬 Comprehensive Loss Function Integration Testing")
-    print("=" * 55)
-    
-    # Test 1: All losses can be instantiated
-    print("\n1. Loss Function Instantiation:")
-    mse = MeanSquaredError()
-    ce = CrossEntropyLoss()
-    bce = BinaryCrossEntropyLoss()
-    print("   ✅ All loss functions created successfully")
-    
-    # Test 2: Loss functions return appropriate types
-    print("\n2. Return Type Verification:")
-    
-    # MSE test
-    pred = Tensor([[1.0, 2.0]])
-    target = Tensor([[1.0, 2.0]])
-    loss = mse(pred, target)
-    assert isinstance(loss, Tensor), "MSE should return Tensor"
-    assert loss.data.shape == (), "MSE should return scalar"
-    
-    # Cross-entropy test
-    pred = Tensor([[1.0, 2.0], [2.0, 1.0]])
-    target = Tensor([1, 0])
-    loss = ce(pred, target)
-    assert isinstance(loss, Tensor), "CrossEntropy should return Tensor"
-    assert loss.data.shape == (), "CrossEntropy should return scalar"
-    
-    # Binary cross-entropy test
-    pred = Tensor([[1.0], [-1.0]])
-    target = Tensor([[1.0], [0.0]])
-    loss = bce(pred, target)
-    assert isinstance(loss, Tensor), "Binary CrossEntropy should return Tensor"
-    assert loss.data.shape == (), "Binary CrossEntropy should return scalar"
-    
-    print("   ✅ All loss functions return correct types")
-    
-    # Test 3: Loss values are reasonable
-    print("\n3. Loss Value Sanity Checks:")
-    
-    # All losses should be non-negative
-    assert mse.forward(Tensor([1.0]), Tensor([2.0])).data >= 0, "MSE should be non-negative"
-    assert ce.forward(Tensor([[1.0, 0.0]]), Tensor([0])).data >= 0, "CrossEntropy should be non-negative"
-    assert bce.forward(Tensor([1.0]), Tensor([1.0])).data >= 0, "Binary CrossEntropy should be non-negative"
-    
-    print("   ✅ All loss functions produce reasonable values")
-    
-    # Test 4: Perfect predictions give low loss
-    print("\n4. Perfect Prediction Tests:")
-    
-    perfect_mse = mse(Tensor([5.0]), Tensor([5.0]))
-    perfect_ce = ce(Tensor([[10.0, 0.0]]), Tensor([0]))
-    perfect_bce = bce(Tensor([10.0]), Tensor([1.0]))
-    
-    assert perfect_mse.data < 1e-10, f"Perfect MSE should be ~0, got {perfect_mse.data}"
-    assert perfect_ce.data < 0.1, f"Perfect CE should be low, got {perfect_ce.data}"
-    assert perfect_bce.data < 0.1, f"Perfect BCE should be low, got {perfect_bce.data}"
-    
-    print("   ✅ Perfect predictions produce low loss")
-    
-    print("\n🎉 All comprehensive integration tests passed!")
-    print("   • Loss functions instantiate correctly")
-    print("   • Return types are consistent (Tensor scalars)")
-    print("   • Loss values are mathematically sound")
-    print("   • Perfect predictions are handled correctly")
-    print("   • Ready for integration with neural network training!")
-
-test_unit_comprehensive_loss_integration()
-
-# %% [markdown]
-"""
-# Systems Analysis: Loss Function Performance and Engineering
-
-Let's analyze loss functions from an ML systems engineering perspective, focusing on performance, memory usage, and production implications.
-
-## Computational Complexity Deep Dive
-
-```
-Algorithmic Analysis by Loss Type:
-
-MSE (Mean Squared Error):
-    Time: O(n) - linear in number of predictions
-    Space: O(1) - constant additional memory
-    Operations: n subtractions + n multiplications + 1 mean
-    Bottleneck: Memory bandwidth (simple arithmetic operations)
-    
-CrossEntropy (Multi-Class):
-    Time: O(n×c) - linear in samples × classes  
-    Space: O(n×c) - store full probability distributions
-    Operations: n×c exp + n×c divisions + n×c logs + reductions
-    Bottleneck: Exponential computations and memory bandwidth
-    
-Binary CrossEntropy:
-    Time: O(n) - linear in number of samples
-    Space: O(n) - store one probability per sample
-    Operations: n max + n multiplications + n exp + n logs
-    Bottleneck: Transcendental functions (exp, log)
-```
-
-## Memory Scaling Analysis
-
-Understanding memory requirements is crucial for large-scale training:
-
-```
-Memory Requirements by Problem Scale:
-
-Small Problem (1K samples, 100 classes):
-    MSE:         8 KB (1K samples × 8 bytes)
-    CrossEntropy: 800 KB (1K × 100 × 8 bytes)
-    Binary CE:   16 KB (1K × 2 × 8 bytes)
-
-Large Problem (100K samples, 10K classes):
-    MSE:         800 KB (independent of classes!)
-    CrossEntropy: 8 GB (memory bottleneck)
-    Binary CE:   1.6 MB (scales with samples only)
-
-Production Scale (1M samples, 50K vocab):
-    MSE:         8 MB
-    CrossEntropy: 400 GB (requires distributed memory)
-    Binary CE:   16 MB
-```
-
-## Numerical Stability Engineering Analysis
-
-Production systems must handle edge cases robustly:
-
-```
-Stability Challenges and Solutions:
-
-CrossEntropy Stability Issues:
-    Problem: exp(large_logit) → overflow → NaN gradients
-    Solution: log-sum-exp trick with max subtraction
-    
-    Problem: log(very_small_prob) → -∞ → training collapse
-    Solution: epsilon clipping (1e-15 to 1-1e-15)
-    
-Binary CrossEntropy Stability Issues:
-    Problem: sigmoid(large_positive) → 1.0 → log(0) issues
-    Solution: stable logits formulation bypasses sigmoid
-    
-    Problem: exp(large_negative) in naive implementation
-    Solution: max(x,0) - x*y + log(1+exp(-|x|)) formulation
-```
-"""
-
-# %% [markdown]
-"""
-## Production Performance Benchmarks
-
-Real-world performance characteristics matter for deployment:
-
-```
-Inference Throughput (measured on modern hardware):
-    MSE:              ~100M predictions/second
-    CrossEntropy:     ~10M predictions/second  
-    Binary CrossEntropy: ~80M predictions/second
-
-Training Memory Bandwidth Requirements:
-    MSE:         ~800 MB/s (lightweight computation)
-    CrossEntropy: ~80 GB/s (10× higher due to softmax!)
-    Binary CE:   ~1.6 GB/s (moderate requirements)
-
-Gradient Computation Overhead:
-    MSE:         1.1× forward pass time (simple derivatives)
-    CrossEntropy: 1.5× forward pass time (softmax gradients)
-    Binary CE:   1.2× forward pass time (sigmoid gradients)
-```
-
-## Framework Integration and Production Patterns
-
-Understanding how production systems implement these concepts:
-
-```
-PyTorch Implementation Patterns:
-    torch.nn.MSELoss() - Direct implementation, minimal overhead
-    torch.nn.CrossEntropyLoss() - Fused softmax+CE for efficiency
-    torch.nn.BCEWithLogitsLoss() - Stable logits formulation
-    
-TensorFlow Implementation Patterns:
-    tf.keras.losses.MeanSquaredError() - Vectorized operations
-    tf.keras.losses.SparseCategoricalCrossentropy() - Memory efficient
-    tf.keras.losses.BinaryCrossentropy() - From logits option
-    
-Production Optimizations:
-    - Mixed precision (FP16) for memory efficiency
-    - Gradient accumulation for large batch simulation
-    - Loss scaling to prevent underflow in mixed precision
-    - Checkpointing to trade memory for computation
-```
-
-## Edge Device and Deployment Considerations
-
-Loss function choice affects deployment feasibility:
-
-```
-Edge Device Constraints:
-    Memory-limited (phones, IoT): Prefer Binary CE > MSE > CrossEntropy
-    CPU-only inference: MSE has best compute efficiency
-    Real-time requirements: Binary classification most predictable
-    
-Distributed Training Challenges:
-    CrossEntropy: Requires all-reduce across all classes (expensive!)
-    Gradient accumulation: MSE linear, CrossEntropy non-linear dependencies
-    Mixed precision: Different overflow handling per loss type
-    
-Monitoring and Debugging:
-    MSE divergence: Explodes quadratically (easy to detect)
-    CrossEntropy divergence: More gradual degradation  
-    BCE monitoring: Natural bounded behavior aids debugging
-```
-"""
-
-# 🔍 SYSTEMS INSIGHT: Performance Profiling Analysis
-def analyze_loss_performance_characteristics():
-    """Comprehensive performance analysis of all loss functions."""
-    print("🔍 Loss Function Performance Analysis")
-    print("=" * 45)
-    
-    try:
-        import time
-        
-        # Initialize loss functions
-        mse = MeanSquaredError()
-        ce = CrossEntropyLoss()
-        bce = BinaryCrossEntropyLoss()
-        
-        print("\n⚡ Computational Complexity Measurement:")
-        
-        # Test different batch sizes to see scaling behavior
-        batch_sizes = [100, 1000, 10000]
-        
-        for batch_size in batch_sizes:
-            print(f"\n   Batch size: {batch_size:,}")
-            
-            # MSE timing
-            mse_pred = Tensor(np.random.randn(batch_size, 10))
-            mse_true = Tensor(np.random.randn(batch_size, 10))
-            
-            start = time.perf_counter()
-            for _ in range(100):  # Average over multiple runs
-                mse_loss = mse(mse_pred, mse_true)
-            mse_time = (time.perf_counter() - start) / 100
-            
-            # CrossEntropy timing
-            ce_pred = Tensor(np.random.randn(batch_size, 100))  # 100 classes
-            ce_true = Tensor(np.random.randint(0, 100, batch_size))
-            
-            start = time.perf_counter()
-            for _ in range(100):
-                ce_loss = ce(ce_pred, ce_true)
-            ce_time = (time.perf_counter() - start) / 100
-            
-            # Binary CrossEntropy timing
-            bce_pred = Tensor(np.random.randn(batch_size, 1))
-            bce_true = Tensor(np.random.randint(0, 2, (batch_size, 1)).astype(float))
-            
-            start = time.perf_counter()
-            for _ in range(100):
-                bce_loss = bce(bce_pred, bce_true)
-            bce_time = (time.perf_counter() - start) / 100
-            
-            print(f"      MSE:         {mse_time*1000:8.3f} ms")
-            print(f"      CrossEntropy: {ce_time*1000:8.3f} ms")
-            print(f"      Binary CE:    {bce_time*1000:8.3f} ms")
-            print(f"      CE/MSE ratio: {ce_time/mse_time:8.1f}x")
-        
-        print("\n💾 Memory Efficiency Analysis:")
-        
-        # Compare memory usage for different problem sizes
-        problem_configs = [
-            ("Small (1K samples, 10 classes)", 1000, 10),
-            ("Medium (10K samples, 100 classes)", 10000, 100),
-            ("Large (100K samples, 1K classes)", 100000, 1000)
-        ]
-        
-        for name, samples, classes in problem_configs:
-            print(f"\n   {name}:")
-            
-            # Memory calculations (bytes)
-            mse_memory = samples * 8  # One value per sample
-            ce_memory = samples * classes * 8  # Full probability distribution
-            bce_memory = samples * 8  # One probability per sample
-            
-            print(f"      MSE memory:    {mse_memory / 1024 / 1024:8.1f} MB")
-            print(f"      CE memory:     {ce_memory / 1024 / 1024:8.1f} MB") 
-            print(f"      BCE memory:    {bce_memory / 1024 / 1024:8.1f} MB")
-            print(f"      CE overhead:   {ce_memory/mse_memory:8.1f}x")
-        
-        # 💡 WHY THIS MATTERS: These performance characteristics determine
-        # which loss functions are feasible for different deployment scenarios.
-        # CrossEntropy's O(n×c) memory scaling makes it prohibitive for 
-        # large vocabularies without specialized techniques.
-        
-    except Exception as e:
-        print(f"⚠️ Performance analysis error: {e}")
-        print("Performance analysis requires complete implementations")
-
-# 🔍 SYSTEMS INSIGHT: Numerical Stability Deep Analysis
-def analyze_numerical_stability_edge_cases():
-    """Deep analysis of numerical stability across all loss functions."""
-    print("🔍 Numerical Stability Edge Case Analysis")
-    print("=" * 50)
-    
-    try:
-        mse = MeanSquaredError()
-        ce = CrossEntropyLoss()
-        bce = BinaryCrossEntropyLoss()
-        
-        print("\n🛡️ Extreme Value Stability Testing:")
-        
-        # Test extreme values that could cause numerical issues
-        extreme_tests = [
-            ("Huge positive", 1e10),
-            ("Huge negative", -1e10),
-            ("Tiny positive", 1e-10),
-            ("NaN input", float('nan')),
-            ("Infinity", float('inf')),
-            ("Negative infinity", float('-inf'))
-        ]
-        
-        for name, value in extreme_tests:
-            print(f"\n   Testing {name} ({value}):")
-            
-            # MSE stability
-            try:
-                mse_loss = mse(Tensor([value]), Tensor([0.0]))
-                mse_stable = not (np.isnan(mse_loss.data) or np.isinf(mse_loss.data))
-                print(f"      MSE stable:    {mse_stable} (loss: {mse_loss.data:.3e})")
-            except:
-                print(f"      MSE stable:    False (exception)")
-            
-            # CrossEntropy stability  
-            try:
-                ce_loss = ce(Tensor([[value, 0.0, 0.0]]), Tensor([0]))
-                ce_stable = not (np.isnan(ce_loss.data) or np.isinf(ce_loss.data))
-                print(f"      CE stable:     {ce_stable} (loss: {ce_loss.data:.3e})")
-            except:
-                print(f"      CE stable:     False (exception)")
-            
-            # Binary CrossEntropy stability
-            try:
-                bce_loss = bce(Tensor([value]), Tensor([1.0]))
-                bce_stable = not (np.isnan(bce_loss.data) or np.isinf(bce_loss.data))
-                print(f"      BCE stable:    {bce_stable} (loss: {bce_loss.data:.3e})")
-            except:
-                print(f"      BCE stable:    False (exception)")
-        
-        print("\n🔬 Gradient Behavior Analysis:")
-        
-        # Analyze gradient magnitudes for different prediction qualities
-        confidence_levels = [
-            ("Very wrong", [[-5.0, 5.0, 0.0]], [0]),  # Predict class 1, actual class 0
-            ("Slightly wrong", [[-0.5, 0.5, 0.0]], [0]),
-            ("Uncertain", [[0.0, 0.0, 0.0]], [0]), 
-            ("Slightly right", [[0.5, -0.5, 0.0]], [0]),
-            ("Very right", [[5.0, -5.0, 0.0]], [0])
-        ]
-        
-        print("      Prediction Quality → CrossEntropy Loss:")
-        for name, logits, labels in confidence_levels:
-            loss = ce(Tensor(logits), Tensor(labels))
-            print(f"      {name:15}: {loss.data:8.4f}")
-        
-        # 💡 WHY THIS MATTERS: Understanding how loss functions behave
-        # at extremes helps debug training failures and choose appropriate
-        # loss scaling and clipping strategies for production systems.
-        
-    except Exception as e:
-        print(f"⚠️ Stability analysis error: {e}")
-        print("Stability analysis requires complete implementations")
-
-# 🔍 SYSTEMS INSIGHT: Production Deployment Analysis  
-def analyze_production_deployment_patterns():
-    """Analyze how loss functions affect production ML system design."""
-    print("🔍 Production Deployment Pattern Analysis")
-    print("=" * 50)
-    
-    try:
-        print("\n🚀 Deployment Scenario Analysis:")
-        
-        # Different deployment scenarios with constraints
-        scenarios = [
-            {
-                "name": "Mobile App (Spam Detection)",
-                "constraints": "Memory < 50MB, Latency < 100ms",
-                "problem": "Binary classification",
-                "recommendation": "Binary CrossEntropy",
-                "reasoning": "Minimal memory, fast inference, stable numerics"
-            },
-            {
-                "name": "Cloud API (Image Classification)", 
-                "constraints": "Throughput > 1000 QPS, Cost optimization",
-                "problem": "1000-class classification",
-                "recommendation": "CrossEntropy with mixed precision",
-                "reasoning": "Can handle memory cost, needs throughput"
-            },
-            {
-                "name": "Edge IoT (Temperature Prediction)",
-                "constraints": "Memory < 1MB, Power < 1W",
-                "problem": "Regression",
-                "recommendation": "MSE with quantization",
-                "reasoning": "Minimal compute, no transcendental functions"
-            },
-            {
-                "name": "Large Language Model Training",
-                "constraints": "50K vocabulary, Multi-GPU",
-                "problem": "Next token prediction",
-                "recommendation": "Hierarchical Softmax or Sampling",
-                "reasoning": "Standard CrossEntropy too memory intensive"
-            }
-        ]
-        
-        for scenario in scenarios:
-            print(f"\n   📱 {scenario['name']}:")
-            print(f"      Constraints:     {scenario['constraints']}")
-            print(f"      Problem Type:    {scenario['problem']}")
-            print(f"      Best Loss:       {scenario['recommendation']}")
-            print(f"      Why:             {scenario['reasoning']}")
-        
-        print("\n⚖️ Production Trade-off Analysis:")
-        
-        trade_offs = [
-            ("Memory Efficiency", "MSE > Binary CE >> CrossEntropy"),
-            ("Computational Speed", "MSE > Binary CE > CrossEntropy"),
-            ("Numerical Stability", "MSE ≈ Binary CE > CrossEntropy"), 
-            ("Implementation Complexity", "MSE > CrossEntropy > Binary CE"),
-            ("Gradient Quality", "CrossEntropy > Binary CE > MSE"),
-            ("Debug-ability", "MSE > Binary CE > CrossEntropy")
-        ]
-        
-        for criterion, ranking in trade_offs:
-            print(f"      {criterion:20}: {ranking}")
-        
-        print("\n🔧 Framework Integration Patterns:")
-        
-        frameworks = [
-            ("PyTorch", "nn.MSELoss(), nn.CrossEntropyLoss(), nn.BCEWithLogitsLoss()"),
-            ("TensorFlow", "keras.losses.MSE, SparseCategoricalCrossentropy, BinaryCrossentropy"),
-            ("JAX", "optax.l2_loss, optax.softmax_cross_entropy, optax.sigmoid_binary_cross_entropy"),
-            ("Production", "Custom implementations with monitoring and fallbacks")
-        ]
-        
-        for framework, losses in frameworks:
-            print(f"      {framework:12}: {losses}")
-        
-        # 💡 WHY THIS MATTERS: Loss function choice affects every aspect
-        # of ML system design - from memory requirements to latency to
-        # debugging complexity. Understanding these trade-offs enables
-        # informed architectural decisions for production systems.
-        
-    except Exception as e:
-        print(f"⚠️ Deployment analysis error: {e}")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've implemented all core loss functions and analyzed their systems characteristics, let's explore their implications for real ML systems:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "question-1-loss-selection", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-🤔 **Question 1: Loss Function Selection for Production Systems**
-
-You're building a production recommendation system that predicts user ratings (1-5 stars) for movies.
-
-Your team proposes three approaches:
-A) Regression approach: Use MSE loss with continuous outputs (1.0-5.0)
-B) Classification approach: Use CrossEntropy loss with 5 distinct classes  
-C) Ordinal approach: Use a custom loss that penalizes being off by multiple stars more heavily
-
-Analyze each approach considering your implementations:
-
-**Technical Analysis:**
-- How does the memory scaling of CrossEntropy (O(batch_size × num_classes)) affect this 5-class problem?
-- What are the computational complexity differences between MSE's O(n) and CrossEntropy's O(n×c) for c=5?
-- How do the gradient behaviors differ? (MSE's quadratic vs CrossEntropy's logarithmic penalties)
-
-**Systems Implications:**
-- Which approach would be most memory efficient for large batch training?
-- How does numerical stability differ when handling edge cases (ratings at boundaries)?
-- Which approach would have the most predictable inference latency?
-
-**Business Alignment:**
-- How well does each loss function's penalty structure match the business objective?
-- What happens with fractional ratings like 3.7? How would each approach handle this?
-- Which approach would be easiest to monitor and debug in production?
-
-Recommend an approach with justification based on your implementation experience.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "question-2-numerical-stability", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-🤔 **Question 2: Debugging Numerical Stability in Production**
-
-Your cross-entropy loss function works perfectly in development, but in production you start seeing NaN losses that crash training after several hours.
-
-**Root Cause Analysis:**
-Based on your implementation of the log-sum-exp trick and epsilon clipping:
-1. What specific numerical computations in cross-entropy can produce NaN values?
-2. Walk through how your `max_logits = np.max(prediction_logits, axis=1, keepdims=True)` prevents overflow
-3. Explain why `np.clip(softmax_pred, epsilon, 1.0 - epsilon)` prevents underflow
-4. What would happen if you removed epsilon clipping? Trace through the computation.
-
-**Production Debugging:**
-Given millions of training examples, how would you:
-1. Identify which specific inputs trigger the numerical instability?
-2. Modify your CrossEntropy implementation to add monitoring without affecting performance?
-3. Design fallback behavior when numerical issues are detected?
-4. Validate that your fixes don't change the mathematical behavior for normal inputs?
-
-**Comparison Analysis:**
-- How does your stable Binary CrossEntropy formulation `max(x,0) - x*y + log(1 + exp(-|x|))` prevent similar issues?
-- Why is MSE generally more numerically stable than CrossEntropy?
-- How would you modify loss functions for mixed precision (FP16) training where numerical ranges are more limited?
-
-Research how PyTorch and TensorFlow handle these same challenges in their loss implementations.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "question-3-custom-loss-design", "locked": false, "schema_version": 3, "solution": false, "task": false}
-"""
-🤔 **Question 3: Designing Custom Loss Functions for Business Objectives**
-
-Standard loss functions don't always align with business objectives. Consider these scenarios:
-
-**Scenario A: Medical Diagnosis**  
-False negatives (missing cancer) cost 10× more than false positives (unnecessary follow-up)
-
-**Scenario B: Search Ranking**  
-Being wrong about the top result is 100× worse than being wrong about result #50
-
-**Scenario C: Financial Trading**  
-Large losses should be penalized exponentially more than small losses (risk management)
-
-**For each scenario, design analysis:**
-
-**Loss Function Modification:**
-1. How would you modify your Binary CrossEntropy implementation for Scenario A to asymmetrically penalize errors?
-2. For Scenario B, how could you adapt CrossEntropy to incorporate position-aware penalties?
-3. For Scenario C, how would you modify MSE to create exponential rather than quadratic penalties?
-
-**Implementation Challenges:**
-- How do custom loss modifications affect gradient flow and convergence behavior?
-- What numerical stability issues might arise with exponential penalties or asymmetric losses?
-- How would you validate that your custom loss actually improves business outcomes?
-
-**Systems Considerations:**
-- How do custom losses affect training speed compared to standard implementations?
-- What additional monitoring and debugging capabilities would you need?
-- How would you A/B test a custom loss against standard losses in production?
-
-Design a custom loss function for one scenario, including:
-- Mathematical formulation
-- Implementation approach building on your existing code
-- Numerical stability considerations  
-- Validation strategy for business impact
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Loss Functions - Learning Objectives Made Mathematical
-
-Congratulations! You've successfully implemented the complete foundation for neural network training objectives:
-
-### What You've Accomplished
-✅ **Complete Loss Function Library**: MSE for regression, CrossEntropy for multi-class classification, and Binary CrossEntropy for binary classification with production-grade numerical stability
-✅ **Systems Engineering Understanding**: Deep comprehension of computational complexity, memory scaling, and numerical stability requirements for reliable ML systems
-✅ **Mathematical Implementation Mastery**: Built loss functions from mathematical foundations through stable computational formulations to working code
-✅ **Production Readiness Knowledge**: Understanding of how loss function choice affects training speed, memory usage, and deployment feasibility
-✅ **Framework Integration Insight**: Clear connection between your implementations and how PyTorch/TensorFlow solve the same problems
-
-### Key Learning Outcomes
-- **Loss Function Theory**: How mathematical loss functions translate business objectives into optimization targets that neural networks can learn from
-- **Numerical Stability Engineering**: Critical importance of stable implementations that prevent catastrophic training failures in production systems
-- **Systems Performance Analysis**: Understanding of computational complexity, memory scaling, and performance trade-offs that affect production deployment
-- **Production ML Patterns**: Knowledge of how loss function choice affects system architecture, monitoring requirements, and debugging complexity
-
-### Mathematical Foundations Mastered  
-- **MSE computation**: `(1/n) × Σ(y_pred - y_true)²` with smooth quadratic gradients for regression optimization
-- **CrossEntropy with stable softmax**: Log-sum-exp trick and epsilon clipping for numerically robust classification
-- **Binary CrossEntropy stability**: `max(x,0) - x×y + log(1 + exp(-|x|))` formulation preventing overflow/underflow issues
-- **Gradient behavior understanding**: How different loss functions create different optimization landscapes and learning dynamics
-
-### Professional Skills Developed
-- **Production-quality implementation**: Robust numerical stability measures that prevent training failures with real-world data
-- **Performance optimization**: Understanding of computational and memory complexity that affects scalability and deployment
-- **Systems debugging**: Knowledge of how to identify and fix numerical stability issues in production ML systems
-- **Framework integration**: Clear understanding of how your implementations connect to professional ML development workflows
-
-### Ready for Advanced Applications
-Your loss function implementations now enable:
-- **Complete training loops** that optimize neural networks on real datasets with proper convergence monitoring
-- **Custom loss functions** that align with specific business objectives and domain requirements
-- **Production deployment** with confidence in numerical stability and performance characteristics
-- **Advanced optimization** techniques that build on solid loss function foundations
-
-### Connection to Real ML Systems
-Your implementations mirror the essential patterns used in:
-- **PyTorch's loss functions**: Same mathematical formulations with identical numerical stability measures
-- **TensorFlow's losses**: Equivalent computational patterns and production-grade error handling
-- **Production ML pipelines**: The exact loss functions that power real ML systems at companies like Google, Meta, and OpenAI
-- **Research frameworks**: Foundation for experimenting with novel loss functions and training objectives
-
-### Next Steps
-With solid loss function implementations, you're ready to:
-1. **Export your module**: `tito module complete 05_losses`
-2. **Validate integration**: `tito test --module losses`
-3. **Explore autograd integration**: See how loss functions connect with automatic differentiation
-4. **Ready for Module 06**: Build automatic gradient computation that makes loss-based learning possible!
-
-**Your achievement**: You've built the mathematical foundation that transforms predictions into learning signals - the critical bridge between model outputs and optimization objectives that makes neural network training possible!
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "final-demo", "locked": false, "schema_version": 3, "solution": false, "task": false}
-if __name__ == "__main__":
-    print("🔥 TinyTorch Loss Functions Module - Complete Demo")
-    print("=" * 55)
-    
-    # Test all core implementations
-    print("\n🧪 Testing All Loss Functions:")
-    test_unit_mse_loss()
-    test_unit_crossentropy_loss()
-    test_unit_binary_crossentropy_loss()
-    test_unit_comprehensive_loss_integration()
-    
-    # Run systems analysis functions
-    print("\n" + "="*60)
-    print("🔍 Systems Analysis Functions")
-    print("=" * 30)
-    
-    analyze_mse_properties()
-    analyze_crossentropy_stability()
-    analyze_binary_crossentropy_efficiency()
-    analyze_loss_performance_characteristics()
-    analyze_numerical_stability_edge_cases()
-    analyze_production_deployment_patterns()
-    
-    print("\n" + "="*60)
-    print("📊 Loss Function Usage Examples")
-    print("=" * 35)
-    
-    # Example 1: Regression with MSE
-    print("\n1. Regression Example (Predicting House Prices):")
-    mse = MeanSquaredError()
-    house_predictions = Tensor([[250000, 180000, 320000]])  # Predicted prices
-    house_actual = Tensor([[240000, 175000, 315000]])       # Actual prices
-    regression_loss = mse(house_predictions, house_actual)
-    print(f"   House price prediction loss: ${regression_loss.data:,.0f}² average error")
-    
-    # Example 2: Multi-class classification with CrossEntropy
-    print("\n2. Multi-Class Classification Example (Image Recognition):")
-    ce = CrossEntropyLoss()
-    image_logits = Tensor([[2.1, 0.5, -0.3, 1.8, 0.1],      # Model outputs for 5 classes
-                          [-0.2, 3.1, 0.8, -1.0, 0.4]])      # (cat, dog, bird, fish, rabbit)
-    true_classes = Tensor([0, 1])  # First image = cat, second = dog
-    classification_loss = ce(image_logits, true_classes)
-    print(f"   Image classification loss: {classification_loss.data:.4f}")
-    
-    # Example 3: Binary classification with BCE
-    print("\n3. Binary Classification Example (Spam Detection):")
-    bce = BinaryCrossEntropyLoss()
-    spam_logits = Tensor([[1.2], [-0.8], [2.1], [-1.5]])  # Spam prediction logits
-    spam_labels = Tensor([[1.0], [0.0], [1.0], [0.0]])     # 1=spam, 0=not spam
-    spam_loss = bce(spam_logits, spam_labels)
-    print(f"   Spam detection loss: {spam_loss.data:.4f}")
-    
-    print("\n" + "="*60)
-    print("🎯 Loss Function Characteristics")
-    print("=" * 35)
-    
-    # Compare perfect vs imperfect predictions
-    print("\n📊 Perfect vs Random Predictions:")
-    
-    # Perfect predictions
-    perfect_mse = mse(Tensor([5.0]), Tensor([5.0]))
-    perfect_ce = ce(Tensor([[10.0, 0.0, 0.0]]), Tensor([0]))
-    perfect_bce = bce(Tensor([10.0]), Tensor([1.0]))
-    
-    print(f"   Perfect MSE loss: {perfect_mse.data:.6f}")
-    print(f"   Perfect CE loss:  {perfect_ce.data:.6f}")
-    print(f"   Perfect BCE loss: {perfect_bce.data:.6f}")
-    
-    # Random predictions
-    random_mse = mse(Tensor([3.0]), Tensor([5.0]))  # Off by 2
-    random_ce = ce(Tensor([[0.0, 0.0, 0.0]]), Tensor([0]))  # Uniform distribution
-    random_bce = bce(Tensor([0.0]), Tensor([1.0]))  # 50% confidence
-    
-    print(f"   Random MSE loss:  {random_mse.data:.6f}")
-    print(f"   Random CE loss:   {random_ce.data:.6f}")
-    print(f"   Random BCE loss:  {random_bce.data:.6f}")
-    
-    print("\n🎉 Complete loss function foundation ready!")
-    print("   ✅ MSE for regression problems")
-    print("   ✅ CrossEntropy for multi-class classification")
-    print("   ✅ Binary CrossEntropy for binary classification")
-    print("   ✅ Numerically stable implementations")
-    print("   ✅ Production-ready batch processing")
-    print("   ✅ Systems analysis and performance insights")
-    print("   ✅ Ready for neural network training!")
\ No newline at end of file
diff --git a/modules_old/04_losses/module.yaml b/modules_old/04_losses/module.yaml
deleted file mode 100644
index de621153..00000000
--- a/modules_old/04_losses/module.yaml
+++ /dev/null
@@ -1,21 +0,0 @@
-description: Essential loss functions for neural network training objectives
-difficulty: "\u2B50\u2B50\u2B50"
-exports:
-- MeanSquaredError
-- CrossEntropyLoss
-- BinaryCrossEntropyLoss
-key_concepts:
-- Training objectives and optimization
-- Numerical stability in loss computation
-- Regression vs classification loss functions
-- Batch processing for scalable training
-learning_objectives:
-- Implement MSE, CrossEntropy, and BinaryCrossEntropy loss functions
-- Understand numerical stability in loss computation
-- Match loss functions to problem types (regression vs classification)
-- Build production-ready loss functions with batch processing
-name: Loss Functions
-number: 5
-prerequisites:
-- 02_tensor
-time_estimate: 2-3 hours
diff --git a/modules_old/04_networks_backup/networks_dev.py b/modules_old/04_networks_backup/networks_dev.py
deleted file mode 100644
index 71063052..00000000
--- a/modules_old/04_networks_backup/networks_dev.py
+++ /dev/null
@@ -1,1050 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-#   kernelspec:
-#     display_name: Python 3 (ipykernel)
-#     language: python
-#     name: python3
-# ---
-
-# %% [markdown]
-"""
-# Networks - Building Intelligence Through Layer Composition
-
-Welcome to Networks! You'll learn how to combine individual layers into complete neural networks that can solve complex problems.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**:
-- Module 01 (Tensor): Multi-dimensional data structures for inputs and outputs
-- Module 02 (Activations): Nonlinear functions that create intelligence
-- Module 03 (Layers): Linear layers that transform data with learnable parameters
-
-**What's Working**: You can transform data with individual layers and activations!
-
-**The Gap**: Individual layers solve simple problems - real intelligence emerges when layers compose into networks.
-
-**This Module's Solution**: Learn to manually compose layers into multi-layer networks with different architectures.
-
-**Connection Map**:
-```
-Layers → Manual Composition → Complete Networks
-(transforms)  (architecture)     (intelligence)
-```
-
-## Learning Objectives
-1. **Manual Network Architecture**: Build networks by composing layers step-by-step
-2. **Parameter Management**: Count and track parameters across multiple layers
-3. **Forward Pass Logic**: Understand data flow through network architectures
-4. **Network Architectures**: Create different network shapes (wide, deep, custom)
-5. **Systems Understanding**: Analyze memory usage and computational complexity
-
-## Build → Test → Use
-1. **Build**: Manual network composition functions and parameter counting
-2. **Test**: Validate networks with different architectures and input sizes
-3. **Use**: Apply networks to solve problems requiring multiple transformations
-"""
-
-# %%
-# Essential imports for network composition
-import numpy as np
-import sys
-import os
-from typing import List, Tuple, Union, Optional
-
-# Import building blocks from previous modules - ONLY use concepts we've learned
-try:
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
-    from tinytorch.core.layers import Linear, Module
-except ImportError:
-    # Development fallback
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations'))
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
-    from tensor_dev import Tensor
-    from activations_dev import ReLU, Sigmoid, Tanh, Softmax
-    from layers_dev import Linear, Module
-
-# %% [markdown]
-"""
-## Part 1: Understanding Network Architecture
-
-### What Makes a Neural Network?
-
-A neural network is simply **multiple layers composed together** where each layer's output becomes the next layer's input.
-
-```
-Input → Layer1 → Activation → Layer2 → Activation → Output
- (4)      (8)       (8)       (3)       (3)       (3)
-```
-
-**Key Insights**:
-- **Composition**: Networks = layers + activations in sequence
-- **Data Flow**: Output shape of layer N must match input shape of layer N+1
-- **Intelligence**: Nonlinearity from activations enables complex pattern learning
-- **Architecture**: Layer sizes and arrangements determine network capability
-"""
-
-# %% [markdown]
-"""
-## Part 2: Manual Network Composition
-
-Let's start by learning to compose networks manually before automation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "network-composition", "solution": true}
-def compose_two_layer_network(input_size: int, hidden_size: int, output_size: int,
-                             activation=ReLU) -> Tuple[Linear, object, Linear]:
-    """
-    Create a 2-layer network manually: Input → Linear → Activation → Linear → Output
-
-    Args:
-        input_size: Number of input features
-        hidden_size: Number of hidden layer neurons
-        output_size: Number of output features
-        activation: Activation function class (default: ReLU)
-
-    Returns:
-        Tuple of (layer1, activation_instance, layer2)
-
-    TODO: Create two Linear layers and one activation function
-
-    APPROACH:
-    1. Create first Linear layer: input_size → hidden_size
-    2. Create activation function instance
-    3. Create second Linear layer: hidden_size → output_size
-    4. Return all three components as tuple
-
-    EXAMPLE:
-    >>> layer1, act, layer2 = compose_two_layer_network(4, 8, 3)
-    >>> x = Tensor([[1, 2, 3, 4]])
-    >>> h = layer1(x)      # (1, 4) → (1, 8)
-    >>> h_act = act(h)     # (1, 8) → (1, 8)
-    >>> y = layer2(h_act)  # (1, 8) → (1, 3)
-    >>> print(y.shape)     # (1, 3)
-
-    HINTS:
-    - Use Linear(input_size, hidden_size) for first layer
-    - Create activation instance with activation()
-    - Use Linear(hidden_size, output_size) for second layer
-    - Return as (layer1, activation_instance, layer2)
-    """
-    ### BEGIN SOLUTION
-    # Create first layer: input → hidden
-    layer1 = Linear(input_size, hidden_size)
-
-    # Create activation function instance
-    activation_instance = activation()
-
-    # Create second layer: hidden → output
-    layer2 = Linear(hidden_size, output_size)
-
-    return layer1, activation_instance, layer2
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Two-Layer Network Composition
-Test that we can manually compose a simple 2-layer network
-"""
-
-# %%
-def test_unit_two_layer_composition():
-    """Test two-layer network composition with different configurations"""
-    print("🔬 Unit Test: Two-Layer Network Composition...")
-
-    # Test 1: Basic composition
-    layer1, activation, layer2 = compose_two_layer_network(4, 8, 3)
-
-    assert isinstance(layer1, Linear), "First component should be Linear layer"
-    assert isinstance(activation, ReLU), "Second component should be activation function"
-    assert isinstance(layer2, Linear), "Third component should be Linear layer"
-
-    assert layer1.input_size == 4, "First layer should have correct input size"
-    assert layer1.output_size == 8, "First layer should have correct output size"
-    assert layer2.input_size == 8, "Second layer should have correct input size"
-    assert layer2.output_size == 3, "Second layer should have correct output size"
-
-    # Test 2: Forward pass compatibility
-    x = Tensor(np.random.randn(2, 4))
-    h = layer1(x)
-    h_activated = activation(h)
-    y = layer2(h_activated)
-
-    assert h.shape == (2, 8), "Hidden layer output should have correct shape"
-    assert h_activated.shape == (2, 8), "Activated hidden should preserve shape"
-    assert y.shape == (2, 3), "Final output should have correct shape"
-
-    # Test 3: Different activation functions
-    layer1_sig, sig_act, layer2_sig = compose_two_layer_network(3, 5, 2, Sigmoid)
-    assert isinstance(sig_act, Sigmoid), "Should create Sigmoid activation when specified"
-
-    print("✅ Two-layer network composition works correctly!")
-
-test_unit_two_layer_composition()
-
-# %% [markdown]
-"""
-## Part 3: Forward Pass Through Networks
-
-Now let's implement the logic for running data through composed networks.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "forward-pass", "solution": true}
-def forward_pass_two_layer(x: Tensor, layer1: Linear, activation, layer2: Linear) -> Tensor:
-    """
-    Execute forward pass through a 2-layer network.
-
-    Args:
-        x: Input tensor
-        layer1: First Linear layer
-        activation: Activation function
-        layer2: Second Linear layer
-
-    Returns:
-        Output tensor after passing through the network
-
-    TODO: Implement forward pass: x → layer1 → activation → layer2 → output
-
-    APPROACH:
-    1. Pass input through first layer
-    2. Apply activation function to result
-    3. Pass activated result through second layer
-    4. Return final output
-
-    EXAMPLE:
-    >>> x = Tensor([[1, 2, 3, 4]])  # (1, 4)
-    >>> y = forward_pass_two_layer(x, layer1, relu, layer2)
-    >>> print(y.shape)  # (1, output_size)
-
-    HINTS:
-    - Call each component in sequence: layer1(x), activation(h), layer2(h_act)
-    - Each output becomes input to next component
-    - Return the final result
-    """
-    ### BEGIN SOLUTION
-    # Step 1: First layer transformation
-    hidden = layer1(x)
-
-    # Step 2: Apply activation function
-    hidden_activated = activation(hidden)
-
-    # Step 3: Second layer transformation
-    output = layer2(hidden_activated)
-
-    return output
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Forward Pass Through Network
-Test that data flows correctly through our manual network
-"""
-
-# %%
-def test_unit_forward_pass():
-    """Test forward pass through manually composed networks"""
-    print("🔬 Unit Test: Forward Pass Through Networks...")
-
-    # Create test network
-    layer1, relu_act, layer2 = compose_two_layer_network(5, 10, 3)
-
-    # Test 1: Single sample
-    x_single = Tensor(np.random.randn(1, 5))
-    y_single = forward_pass_two_layer(x_single, layer1, relu_act, layer2)
-
-    assert y_single.shape == (1, 3), "Single sample should produce correct output shape"
-    assert hasattr(y_single, 'shape') and hasattr(y_single, 'data'), "Output should be a Tensor-like object"
-
-    # Test 2: Batch processing
-    x_batch = Tensor(np.random.randn(4, 5))
-    y_batch = forward_pass_two_layer(x_batch, layer1, relu_act, layer2)
-
-    assert y_batch.shape == (4, 3), "Batch should produce correct output shape"
-
-    # Test 3: Different network architectures
-    wide_layer1, wide_act, wide_layer2 = compose_two_layer_network(2, 50, 1)
-    x_wide = Tensor(np.random.randn(3, 2))
-    y_wide = forward_pass_two_layer(x_wide, wide_layer1, wide_act, wide_layer2)
-
-    assert y_wide.shape == (3, 1), "Wide network should work correctly"
-
-    print("✅ Forward pass through networks works correctly!")
-
-test_unit_forward_pass()
-
-# %% [markdown]
-"""
-## Part 4: Deep Network Composition
-
-Real neural networks often have more than 2 layers. Let's build deep networks manually.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "deep-network", "solution": true}
-def compose_deep_network(layer_sizes: List[int], activation=ReLU) -> List:
-    """
-    Create a deep network with arbitrary number of layers.
-
-    Args:
-        layer_sizes: List of layer sizes [input_size, hidden1, hidden2, ..., output_size]
-        activation: Activation function class
-
-    Returns:
-        List of network components [layer1, activation1, layer2, activation2, ..., final_layer]
-
-    TODO: Create alternating Linear layers and activations for each pair of sizes
-
-    APPROACH:
-    1. Iterate through pairs of consecutive sizes in layer_sizes
-    2. For each pair, create Linear(size_i, size_i+1) and activation()
-    3. Don't add activation after the final layer (output layer typically no activation)
-    4. Return list of all components in order
-
-    EXAMPLE:
-    >>> components = compose_deep_network([4, 8, 6, 3])
-    >>> # Creates: Linear(4,8), ReLU(), Linear(8,6), ReLU(), Linear(6,3)
-    >>> len(components)  # 5 components
-
-    HINTS:
-    - Use zip(layer_sizes[:-1], layer_sizes[1:]) to get consecutive pairs
-    - Add Linear layer, then activation for each pair (except last layer)
-    - Last layer: only add Linear, no activation
-    - Return list of all components
-    """
-    ### BEGIN SOLUTION
-    components = []
-
-    # Process all but the last layer (add Linear + Activation)
-    for i in range(len(layer_sizes) - 2):
-        input_size = layer_sizes[i]
-        output_size = layer_sizes[i + 1]
-
-        # Add Linear layer
-        components.append(Linear(input_size, output_size))
-        # Add activation
-        components.append(activation())
-
-    # Add final layer (Linear only, no activation)
-    if len(layer_sizes) >= 2:
-        final_input = layer_sizes[-2]
-        final_output = layer_sizes[-1]
-        components.append(Linear(final_input, final_output))
-
-    return components
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Deep Network Composition
-Test that we can build networks with arbitrary depth
-"""
-
-# %%
-def test_unit_deep_network():
-    """Test deep network composition with various architectures"""
-    print("🔬 Unit Test: Deep Network Composition...")
-
-    # Test 1: 3-layer network
-    components_3layer = compose_deep_network([4, 8, 6, 3])
-    expected_components = 5  # Linear, ReLU, Linear, ReLU, Linear
-
-    assert len(components_3layer) == expected_components, f"3-layer network should have {expected_components} components"
-
-    # Verify component types and order
-    assert isinstance(components_3layer[0], Linear), "First component should be Linear"
-    assert isinstance(components_3layer[1], ReLU), "Second component should be ReLU"
-    assert isinstance(components_3layer[2], Linear), "Third component should be Linear"
-    assert isinstance(components_3layer[3], ReLU), "Fourth component should be ReLU"
-    assert isinstance(components_3layer[4], Linear), "Fifth component should be Linear (final)"
-
-    # Test 2: Verify layer sizes
-    assert components_3layer[0].input_size == 4, "First layer should have correct input size"
-    assert components_3layer[0].output_size == 8, "First layer should have correct output size"
-    assert components_3layer[2].input_size == 8, "Second layer should have correct input size"
-    assert components_3layer[2].output_size == 6, "Second layer should have correct output size"
-    assert components_3layer[4].input_size == 6, "Final layer should have correct input size"
-    assert components_3layer[4].output_size == 3, "Final layer should have correct output size"
-
-    # Test 3: Different activation function
-    components_sigmoid = compose_deep_network([2, 4, 1], Sigmoid)
-    assert isinstance(components_sigmoid[1], Sigmoid), "Should use specified activation function"
-
-    # Test 4: Single layer (edge case)
-    components_single = compose_deep_network([5, 2])
-    assert len(components_single) == 1, "Single layer should have 1 component"
-    assert isinstance(components_single[0], Linear), "Single component should be Linear layer"
-
-    print("✅ Deep network composition works correctly!")
-
-test_unit_deep_network()
-
-# %% [markdown]
-"""
-## Part 5: Forward Pass Through Deep Networks
-
-Now implement forward pass logic for networks of arbitrary depth.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "deep-forward", "solution": true}
-def forward_pass_deep(x: Tensor, components: List) -> Tensor:
-    """
-    Execute forward pass through a deep network with arbitrary components.
-
-    Args:
-        x: Input tensor
-        components: List of network components (layers and activations)
-
-    Returns:
-        Output tensor after passing through all components
-
-    TODO: Apply each component in sequence to transform the input
-
-    APPROACH:
-    1. Start with input tensor
-    2. Apply each component in order: x = component(x)
-    3. Each component's output becomes next component's input
-    4. Return final result
-
-    EXAMPLE:
-    >>> components = [Linear(4,8), ReLU(), Linear(8,3)]
-    >>> x = Tensor([[1, 2, 3, 4]])
-    >>> y = forward_pass_deep(x, components)
-    >>> print(y.shape)  # (1, 3)
-
-    HINTS:
-    - Use a for loop: for component in components:
-    - Apply each component: x = component(x)
-    - Return the final transformed x
-    """
-    ### BEGIN SOLUTION
-    # Apply each component in sequence
-    current_tensor = x
-    for component in components:
-        current_tensor = component(current_tensor)
-
-    return current_tensor
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Deep Forward Pass
-Test forward pass through networks of varying depth
-"""
-
-# %%
-def test_unit_deep_forward():
-    """Test forward pass through deep networks"""
-    print("🔬 Unit Test: Deep Forward Pass...")
-
-    # Test 1: 3-layer network
-    components = compose_deep_network([5, 10, 8, 3])
-    x = Tensor(np.random.randn(2, 5))
-    y = forward_pass_deep(x, components)
-
-    assert y.shape == (2, 3), "Deep network should produce correct output shape"
-    assert hasattr(y, 'shape') and hasattr(y, 'data'), "Output should be a Tensor-like object"
-
-    # Test 2: Very deep network
-    deep_components = compose_deep_network([4, 16, 12, 8, 6, 2])
-    x_deep = Tensor(np.random.randn(1, 4))
-    y_deep = forward_pass_deep(x_deep, deep_components)
-
-    assert y_deep.shape == (1, 2), "Very deep network should work correctly"
-
-    # Test 3: Wide network
-    wide_components = compose_deep_network([3, 100, 1])
-    x_wide = Tensor(np.random.randn(5, 3))
-    y_wide = forward_pass_deep(x_wide, wide_components)
-
-    assert y_wide.shape == (5, 1), "Wide network should work correctly"
-
-    # Test 4: Single layer
-    single_components = compose_deep_network([6, 4])
-    x_single = Tensor(np.random.randn(1, 6))
-    y_single = forward_pass_deep(x_single, single_components)
-
-    assert y_single.shape == (1, 4), "Single layer should work correctly"
-
-    print("✅ Deep forward pass works correctly!")
-
-test_unit_deep_forward()
-
-# %% [markdown]
-"""
-## Part 6: Parameter Counting and Analysis
-
-Understanding how many learnable parameters are in a network is crucial for memory management and computational complexity.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "parameter-counting", "solution": true}
-def count_network_parameters(components: List) -> Tuple[int, dict]:
-    """
-    Count total parameters in a network and provide detailed breakdown.
-
-    Args:
-        components: List of network components
-
-    Returns:
-        Tuple of (total_parameters, parameter_breakdown)
-
-    TODO: Count parameters in each Linear layer and provide breakdown
-
-    APPROACH:
-    1. Initialize total counter and breakdown dictionary
-    2. Iterate through components looking for Linear layers
-    3. For each Linear layer: count weights (input_size × output_size) + biases (output_size)
-    4. Store breakdown by layer and return total + breakdown
-
-    EXAMPLE:
-    >>> components = [Linear(4,8), ReLU(), Linear(8,3)]
-    >>> total, breakdown = count_network_parameters(components)
-    >>> print(total)  # (4*8 + 8) + (8*3 + 3) = 32 + 8 + 24 + 3 = 67
-
-    HINTS:
-    - Only Linear layers have parameters (activations have none)
-    - For Linear layer: parameters = input_size * output_size + output_size
-    - Use isinstance(component, Linear) to identify Linear layers
-    - Track breakdown with layer names/indices
-    """
-    ### BEGIN SOLUTION
-    total_params = 0
-    breakdown = {}
-
-    layer_count = 0
-    for i, component in enumerate(components):
-        if isinstance(component, Linear):
-            layer_count += 1
-
-            # Count weights and biases
-            weights = component.input_size * component.output_size
-            biases = component.output_size
-            layer_params = weights + biases
-
-            # Add to total
-            total_params += layer_params
-
-            # Add to breakdown
-            breakdown[f"Linear_Layer_{layer_count}"] = {
-                "weights": weights,
-                "biases": biases,
-                "total": layer_params,
-                "shape": f"({component.input_size}, {component.output_size})"
-            }
-
-    return total_params, breakdown
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Parameter Counting
-Test that we correctly count parameters across network architectures
-"""
-
-# %%
-def test_unit_parameter_counting():
-    """Test parameter counting across different network architectures"""
-    print("🔬 Unit Test: Parameter Counting...")
-
-    # Test 1: Simple 2-layer network
-    components = compose_deep_network([4, 8, 3])
-    total, breakdown = count_network_parameters(components)
-
-    # Expected: (4*8 + 8) + (8*3 + 3) = 40 + 27 = 67
-    expected_total = (4*8 + 8) + (8*3 + 3)
-    assert total == expected_total, f"Expected {expected_total} parameters, got {total}"
-
-    # Verify breakdown structure
-    assert "Linear_Layer_1" in breakdown, "Should have first layer in breakdown"
-    assert "Linear_Layer_2" in breakdown, "Should have second layer in breakdown"
-    assert breakdown["Linear_Layer_1"]["weights"] == 32, "First layer should have 32 weights"
-    assert breakdown["Linear_Layer_1"]["biases"] == 8, "First layer should have 8 biases"
-
-    # Test 2: Single layer
-    single_components = compose_deep_network([10, 5])
-    single_total, single_breakdown = count_network_parameters(single_components)
-
-    expected_single = 10*5 + 5  # 55
-    assert single_total == expected_single, f"Single layer should have {expected_single} parameters"
-
-    # Test 3: Deep network
-    deep_components = compose_deep_network([3, 6, 4, 2])
-    deep_total, deep_breakdown = count_network_parameters(deep_components)
-
-    # Expected: (3*6+6) + (6*4+4) + (4*2+2) = 24 + 28 + 10 = 62
-    expected_deep = (3*6 + 6) + (6*4 + 4) + (4*2 + 2)
-    assert deep_total == expected_deep, f"Deep network should have {expected_deep} parameters"
-    assert len(deep_breakdown) == 3, "Deep network should have 3 Linear layers in breakdown"
-
-    # Test 4: Network with activations (shouldn't count activation parameters)
-    mixed_components = [Linear(5, 10), ReLU(), Linear(10, 2), Sigmoid()]
-    mixed_total, mixed_breakdown = count_network_parameters(mixed_components)
-
-    expected_mixed = (5*10 + 10) + (10*2 + 2)  # 60 + 22 = 82
-    assert mixed_total == expected_mixed, "Should only count Linear layer parameters"
-    assert len(mixed_breakdown) == 2, "Should only include Linear layers in breakdown"
-
-    print("✅ Parameter counting works correctly!")
-
-test_unit_parameter_counting()
-
-# %% [markdown]
-"""
-## Part 7: Network Architecture Patterns
-
-Let's implement common network architecture patterns used in practice.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "network-patterns", "solution": true}
-def create_classifier_network(input_size: int, num_classes: int, hidden_sizes: List[int] = None) -> List:
-    """
-    Create a classification network with sigmoid output activation.
-
-    Args:
-        input_size: Number of input features
-        num_classes: Number of output classes
-        hidden_sizes: List of hidden layer sizes (optional)
-
-    Returns:
-        List of network components with Sigmoid output for classification
-
-    TODO: Create network ending with Sigmoid activation for classification
-
-    APPROACH:
-    1. Use provided hidden_sizes or default to [hidden_size] if None
-    2. Create base network structure: input → hidden layers → output
-    3. Add Sigmoid activation at the end for classification probabilities
-    4. Return complete component list
-
-    EXAMPLE:
-    >>> components = create_classifier_network(784, 10, [128, 64])
-    >>> # Creates: Linear(784,128), ReLU(), Linear(128,64), ReLU(), Linear(64,10), Sigmoid()
-
-    HINTS:
-    - If hidden_sizes is None, use a reasonable default like [input_size // 2]
-    - Build layer_sizes list: [input_size] + hidden_sizes + [num_classes]
-    - Use compose_deep_network to create base network
-    - Add Sigmoid() activation at the end for classification
-    """
-    ### BEGIN SOLUTION
-    # Handle default hidden sizes
-    if hidden_sizes is None:
-        hidden_sizes = [max(input_size // 2, num_classes * 2)]
-
-    # Build complete layer sizes
-    layer_sizes = [input_size] + hidden_sizes + [num_classes]
-
-    # Create base network
-    components = compose_deep_network(layer_sizes)
-
-    # Add Sigmoid activation for classification
-    components.append(Sigmoid())
-
-    return components
-    ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "regression-network", "solution": true}
-def create_regression_network(input_size: int, output_size: int = 1, hidden_sizes: List[int] = None) -> List:
-    """
-    Create a regression network with no output activation.
-
-    Args:
-        input_size: Number of input features
-        output_size: Number of output values (default: 1)
-        hidden_sizes: List of hidden layer sizes (optional)
-
-    Returns:
-        List of network components with no output activation for regression
-
-    TODO: Create network with no output activation for regression
-
-    APPROACH:
-    1. Use provided hidden_sizes or create reasonable default
-    2. Build layer_sizes list and create network
-    3. Do NOT add output activation (regression predicts raw values)
-    4. Return component list
-
-    EXAMPLE:
-    >>> components = create_regression_network(4, 1, [8, 4])
-    >>> # Creates: Linear(4,8), ReLU(), Linear(8,4), ReLU(), Linear(4,1)
-    >>> # No output activation for regression
-
-    HINTS:
-    - Default hidden_sizes could be [input_size, input_size // 2]
-    - Use compose_deep_network directly (it doesn't add output activation)
-    - Don't add any activation after the final layer
-    """
-    ### BEGIN SOLUTION
-    # Handle default hidden sizes
-    if hidden_sizes is None:
-        hidden_sizes = [input_size, max(input_size // 2, output_size * 2)]
-
-    # Build complete layer sizes
-    layer_sizes = [input_size] + hidden_sizes + [output_size]
-
-    # Create network (compose_deep_network doesn't add output activation)
-    components = compose_deep_network(layer_sizes)
-
-    return components
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Network Architecture Patterns
-Test specialized network architectures for different tasks
-"""
-
-# %%
-def test_unit_network_patterns():
-    """Test different network architecture patterns"""
-    print("🔬 Unit Test: Network Architecture Patterns...")
-
-    # Test 1: Classification network
-    classifier = create_classifier_network(784, 10, [128, 64])
-
-    # Should end with Sigmoid for classification
-    assert isinstance(classifier[-1], Sigmoid), "Classifier should end with Sigmoid"
-
-    # Test forward pass
-    x_class = Tensor(np.random.randn(1, 784))
-    y_class = forward_pass_deep(x_class, classifier)
-
-    assert y_class.shape == (1, 10), "Classifier should output correct number of classes"
-    # Note: We can't easily test that output is in [0,1] without more sophisticated sigmoid implementation
-
-    # Test 2: Regression network
-    regressor = create_regression_network(4, 1, [8, 4])
-
-    # Should NOT end with activation
-    assert not isinstance(regressor[-1], (Sigmoid, ReLU, Tanh)), "Regressor should not end with activation"
-    assert isinstance(regressor[-1], Linear), "Regressor should end with Linear layer"
-
-    # Test forward pass
-    x_reg = Tensor(np.random.randn(3, 4))
-    y_reg = forward_pass_deep(x_reg, regressor)
-
-    assert y_reg.shape == (3, 1), "Regressor should output correct shape"
-
-    # Test 3: Multi-output regression
-    multi_regressor = create_regression_network(6, 3, [10, 5])
-    x_multi = Tensor(np.random.randn(2, 6))
-    y_multi = forward_pass_deep(x_multi, multi_regressor)
-
-    assert y_multi.shape == (2, 3), "Multi-output regressor should work"
-
-    # Test 4: Default hidden sizes
-    default_classifier = create_classifier_network(20, 5)  # No hidden_sizes specified
-    x_default = Tensor(np.random.randn(1, 20))
-    y_default = forward_pass_deep(x_default, default_classifier)
-
-    assert y_default.shape == (1, 5), "Default classifier should work"
-
-    print("✅ Network architecture patterns work correctly!")
-
-test_unit_network_patterns()
-
-# %%
-def test_module():
-    """Run all module tests to verify complete implementation"""
-    print("🧪 Running all Network module tests...")
-
-    test_unit_two_layer_composition()
-    test_unit_forward_pass()
-    test_unit_deep_network()
-    test_unit_deep_forward()
-    test_unit_parameter_counting()
-    test_unit_network_patterns()
-
-    print("✅ All Network module tests passed! Manual network composition complete.")
-
-# %% [markdown]
-"""
-## 🔍 Systems Analysis
-
-Now that your network implementations are complete and tested, let's analyze their systems behavior:
-
-### Performance and Memory Characteristics
-
-Understanding how networks scale with size and depth is crucial for building real ML systems.
-"""
-
-# %%
-def measure_network_scaling():
-    """
-    📊 SYSTEMS MEASUREMENT: Network Scaling Analysis
-
-    Measure how network complexity affects performance and memory usage.
-    """
-    print("📊 NETWORK SCALING MEASUREMENT")
-    print("Testing how network depth and width affect computational complexity...")
-
-    import time
-
-    # Test different network architectures
-    architectures = [
-        ("Narrow-Deep", [10, 8, 6, 4, 2]),
-        ("Wide-Shallow", [10, 50, 2]),
-        ("Balanced", [10, 20, 10, 2]),
-        ("Very Deep", [10, 8, 6, 5, 4, 3, 2])
-    ]
-
-    batch_size = 100
-    num_trials = 10
-
-    for name, layer_sizes in architectures:
-        print(f"\n🔧 Testing {name} architecture: {layer_sizes}")
-
-        # Create network
-        components = compose_deep_network(layer_sizes)
-        total_params, breakdown = count_network_parameters(components)
-
-        # Measure forward pass time
-        x = Tensor(np.random.randn(batch_size, layer_sizes[0]))
-
-        times = []
-        for _ in range(num_trials):
-            start = time.perf_counter()
-            y = forward_pass_deep(x, components)
-            elapsed = time.perf_counter() - start
-            times.append(elapsed)
-
-        avg_time = np.mean(times) * 1000  # Convert to milliseconds
-
-        print(f"  Parameters: {total_params:,}")
-        print(f"  Layers: {len([c for c in components if isinstance(c, Linear)])}")
-        print(f"  Forward pass: {avg_time:.2f}ms (batch={batch_size})")
-        print(f"  Time per sample: {avg_time/batch_size:.3f}ms")
-
-        # Memory analysis
-        total_weights = sum(layer.weights.data.size for layer in components if isinstance(layer, Linear))
-        total_biases = sum(layer.bias.data.size for layer in components if isinstance(layer, Linear))
-        memory_mb = (total_weights + total_biases) * 4 / 1024 / 1024  # float32 = 4 bytes
-
-        print(f"  Memory usage: {memory_mb:.2f} MB")
-
-    print(f"\n💡 SCALING INSIGHTS:")
-    print(f"   • Depth vs Width: More layers = more sequential computation")
-    print(f"   • Parameter count dominates memory usage")
-    print(f"   • Batch processing amortizes per-sample overhead")
-    print(f"   • Network architecture significantly impacts performance")
-
-# Run the measurement
-measure_network_scaling()
-
-# %%
-def measure_parameter_scaling():
-    """
-    💾 SYSTEMS MEASUREMENT: Parameter Memory Analysis
-
-    Understand how parameter count scales with network size.
-    """
-    print("💾 PARAMETER MEMORY MEASUREMENT")
-    print("Analyzing parameter scaling patterns...")
-
-    # Test parameter scaling with width
-    print("\n📏 Width Scaling (2-layer networks):")
-    widths = [10, 50, 100, 200, 500]
-
-    for width in widths:
-        components = compose_deep_network([10, width, 5])
-        total_params, _ = count_network_parameters(components)
-        memory_mb = total_params * 4 / 1024 / 1024
-
-        print(f"  Width {width:3d}: {total_params:,} params, {memory_mb:.2f} MB")
-
-    # Test parameter scaling with depth
-    print("\n📏 Depth Scaling (constant width=20):")
-    depths = [2, 4, 6, 8, 10]
-
-    for depth in depths:
-        layer_sizes = [20] * (depth + 1)  # depth+1 layer sizes for depth layers
-        layer_sizes[-1] = 5  # Output size
-        components = compose_deep_network(layer_sizes)
-        total_params, _ = count_network_parameters(components)
-        memory_mb = total_params * 4 / 1024 / 1024
-
-        print(f"  Depth {depth:2d}: {total_params:,} params, {memory_mb:.2f} MB")
-
-    print(f"\n💡 PARAMETER INSIGHTS:")
-    print(f"   • Width scaling: Quadratic growth O(W²) for layer connections")
-    print(f"   • Depth scaling: Linear growth O(D) for constant width")
-    print(f"   • First and last layers often dominate parameter count")
-    print(f"   • Memory grows linearly with parameter count")
-
-# Run the measurement
-measure_parameter_scaling()
-
-# %%
-def measure_batch_processing():
-    """
-    📦 SYSTEMS MEASUREMENT: Batch Processing Efficiency
-
-    Analyze how batch size affects computational efficiency.
-    """
-    print("📦 BATCH PROCESSING MEASUREMENT")
-    print("Testing computational efficiency across batch sizes...")
-
-    import time
-
-    # Create test network
-    components = compose_deep_network([100, 50, 25, 10])
-
-    batch_sizes = [1, 10, 50, 100, 500, 1000]
-    num_trials = 5
-
-    print("\nBatch Size | Total Time | Time/Sample | Throughput")
-    print("-" * 55)
-
-    for batch_size in batch_sizes:
-        x = Tensor(np.random.randn(batch_size, 100))
-
-        times = []
-        for _ in range(num_trials):
-            start = time.perf_counter()
-            y = forward_pass_deep(x, components)
-            elapsed = time.perf_counter() - start
-            times.append(elapsed)
-
-        avg_time = np.mean(times) * 1000  # milliseconds
-        time_per_sample = avg_time / batch_size
-        throughput = 1000 / time_per_sample  # samples per second
-
-        print(f"{batch_size:9d} | {avg_time:9.2f}ms | {time_per_sample:10.3f}ms | {throughput:8.0f} samples/s")
-
-    print(f"\n💡 BATCH PROCESSING INSIGHTS:")
-    print(f"   • Larger batches amortize per-batch overhead")
-    print(f"   • Time per sample decreases with batch size")
-    print(f"   • Throughput increases significantly with batching")
-    print(f"   • Memory usage scales linearly with batch size")
-
-# Run the measurement
-measure_batch_processing()
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've implemented manual network composition, let's connect this to broader ML systems principles:
-"""
-
-# %% [markdown]
-"""
-### Question 1: Memory and Performance Analysis
-
-In your `count_network_parameters()` function, you discovered that a 3-layer network with sizes [784, 128, 64, 10] has about 109,000 parameters.
-
-When you tested this network with different batch sizes, you saw that processing time per sample decreased with larger batches. Analyze the memory and computational trade-offs:
-
-**Your Implementation Analysis:**
-- How does the parameter memory (109K parameters × 4 bytes = ~436KB) compare to activation memory for different batch sizes?
-- Why does your `forward_pass_deep()` function become more efficient per sample with larger batches?
-- At what batch size would activation memory exceed parameter memory for this network?
-
-**Systems Engineering Question:**
-If you needed to deploy this network on a device with only 1MB of available memory, what modifications to your network composition functions would you implement to stay within memory constraints while maintaining reasonable accuracy?
-
-Think about: Parameter sharing strategies, layer width reduction, depth vs width trade-offs
-"""
-
-# %% [markdown]
-"""
-### Question 2: Architecture Scaling Analysis
-
-Your `compose_deep_network()` function can create networks of arbitrary depth and width. You measured that very deep networks (10+ layers) have linear parameter growth but may suffer from other issues.
-
-**Implementation Scaling Analysis:**
-- In your deep network experiments, which architecture pattern (narrow-deep vs wide-shallow) was more computationally efficient?
-- How would you modify your `forward_pass_deep()` function to handle networks with 100+ layers efficiently?
-- What bottlenecks would emerge in your current manual composition approach for very large networks?
-
-**Production Engineering Question:**
-Design a modification to your current network composition system that could handle production-scale networks (1000+ layers, millions of parameters) while maintaining the educational clarity of manual composition.
-
-Think about: Memory checkpointing, activation recomputation, gradient accumulation patterns
-"""
-
-# %% [markdown]
-"""
-### Question 3: Integration and Modularity Analysis
-
-Your manual network composition approach gives you complete control over layer ordering and activation placement. However, you've seen that composing networks manually becomes complex for large architectures.
-
-**Integration Analysis:**
-- How would you extend your current `create_classifier_network()` and `create_regression_network()` functions to support more complex architectures like residual connections?
-- What interface changes to your component system would be needed to handle branching network topologies?
-- How does manual composition compare to automated composition in terms of debugging and understanding?
-
-**Systems Architecture Question:**
-Design a hybrid approach that maintains the educational benefits of your manual composition while providing the convenience of automated network building for complex architectures. What abstractions would you introduce?
-
-Think about: Component interfaces, graph representations, debugging visibility
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Networks - Manual Composition Mastery
-
-Congratulations! You've successfully implemented manual network composition that forms the foundation of all neural network architectures:
-
-### What You've Accomplished
-✅ **Manual Network Composition**: Built 150+ lines of network architecture code with step-by-step layer composition
-✅ **Forward Pass Logic**: Implemented data flow through networks of arbitrary depth and complexity
-✅ **Parameter Analysis**: Created comprehensive parameter counting and memory analysis systems
-✅ **Architecture Patterns**: Built specialized networks for classification, regression, and custom tasks
-✅ **Systems Understanding**: Analyzed scaling behavior, memory usage, and computational complexity
-
-### Key Learning Outcomes
-- **Network Architecture**: Understanding how layers compose into intelligent systems through manual control
-- **Data Flow Principles**: Mastery of tensor shape transformations through network layers
-- **Parameter Management**: Deep insight into memory requirements and computational complexity
-- **Performance Characteristics**: Knowledge of how network depth and width affect efficiency
-
-### Mathematical Foundations Mastered
-- **Composition Functions**: f(g(h(x))) = network(x) through sequential application
-- **Parameter Scaling**: O(input_size × output_size) per layer, O(depth) for network
-- **Memory Complexity**: Linear scaling with parameters plus O(batch_size × max_layer_width) for activations
-
-### Professional Skills Developed
-- **Manual Architecture Design**: Building networks layer-by-layer with complete understanding
-- **Systems Analysis**: Measuring and optimizing network performance characteristics
-- **Memory Engineering**: Understanding parameter vs activation memory trade-offs
-- **Performance Optimization**: Batch processing and computational efficiency analysis
-
-### Ready for Advanced Applications
-Your manual network composition now enables:
-- **Custom Architectures**: Build any network topology with complete understanding
-- **Performance Analysis**: Measure and optimize network computational characteristics
-- **Memory Management**: Predict and control network memory requirements
-- **Educational Foundation**: Deep understanding before automated composition tools
-
-### Connection to Real ML Systems
-Your implementation mirrors production patterns:
-- **PyTorch**: Your manual composition matches nn.Sequential() internal behavior
-- **TensorFlow**: Similar to tf.keras.Sequential() layer-by-layer construction
-- **Industry Standard**: Manual composition used for custom architectures and research
-
-### Next Steps
-1. **Export your module**: `tito module complete 04_networks`
-2. **Validate integration**: `tito test --module networks`
-3. **Explore automated composition**: Your foundation enables understanding Sequential in Module 05
-4. **Ready for Module 05**: Linear Networks with automated composition tools
-
-**🚀 Achievement Unlocked**: Your manual network composition mastery provides the deep understanding needed for building automated ML frameworks. You've learned to think like a neural network architect!
-"""
-
-# %%
-if __name__ == "__main__":
-    # Run all tests to validate complete implementation
-    test_module()
-
-    # Display completion message
-    print("\n" + "="*60)
-    print("🎯 MODULE 04 (NETWORKS) COMPLETE!")
-    print("📈 Progress: Manual Network Composition ✓")
-    print("🔥 Next up: Module 05 - Automated Linear Networks!")
-    print("💪 You're building real ML architecture understanding!")
-    print("="*60)
\ No newline at end of file
diff --git a/modules_old/05_autograd/ENHANCEMENT_SUMMARY.md b/modules_old/05_autograd/ENHANCEMENT_SUMMARY.md
deleted file mode 100644
index 9b085cdf..00000000
--- a/modules_old/05_autograd/ENHANCEMENT_SUMMARY.md
+++ /dev/null
@@ -1,188 +0,0 @@
-# Module 06 (Autograd) Enhancement Summary
-
-## ML Framework Advisor Implementation
-
-Based on the ML Framework Advisor's "Excellent (A+)" rating, I've successfully implemented all four recommended production-relevant enhancements while preserving the module's excellent educational design and strong systems analysis.
-
-## ✅ Enhanced Features Implemented
-
-### 1. Gradient Clipping for Training Stability
-
-**Implementation**: Added `clip_gradients()` function with comprehensive gradient norm management
-
-**Key Features**:
-- **Global gradient norm calculation**: Computes total norm across all variables
-- **Adaptive clipping**: Only clips when gradients exceed threshold
-- **In-place gradient modification**: Efficient memory usage
-- **Monitoring support**: Returns gradient norm for training visualization
-
-**Educational Value**:
-- Visual ASCII diagram showing gradient explosion vs stable training
-- Mathematical foundation with gradient norm formulas
-- Real-world context: Transformer, RNN, GAN training stability
-- Clear connection to production training challenges
-
-**Code Quality**:
-```python
-def clip_gradients(variables: List[Variable], max_norm: float = 1.0) -> float:
-    # Calculate total gradient norm across all variables
-    total_norm = np.sqrt(sum(np.sum(var.grad.numpy() ** 2) for var in variables if var.grad))
-
-    # Apply clipping if needed
-    if total_norm > max_norm:
-        clipping_factor = max_norm / total_norm
-        for var in variables:
-            if var.grad:
-                var.grad = Variable(var.grad.numpy() * clipping_factor)
-
-    return total_norm
-```
-
-### 2. Enhanced Memory Management with Dynamic vs Static Graph Analysis
-
-**Implementation**: Extended `AutogradSystemsProfiler` with advanced memory analysis
-
-**Key Features**:
-- **Dynamic graph characteristics**: Memory growth rate analysis
-- **Static graph opportunities**: Compilation benefit assessment
-- **Memory optimization strategies**: Practical recommendations
-- **Production scaling insights**: Real-world memory implications
-
-**Educational Insights**:
-- Memory pooling vs dynamic allocation trade-offs
-- Graph compilation benefits analysis
-- Memory arena allocation strategies
-- Lazy evaluation opportunities
-
-**Advanced Analysis Methods**:
-```python
-def _analyze_memory_management_patterns(self, results):
-    # Analyzes memory growth patterns for optimization opportunities
-    analysis = {
-        'dynamic_graph_characteristics': memory_growth_analysis,
-        'static_graph_opportunities': compilation_benefits,
-        'memory_optimization_strategies': practical_recommendations
-    }
-```
-
-### 3. Graph Optimization Analysis with Fusion Opportunities
-
-**Implementation**: Added comprehensive graph fusion and cache efficiency analysis
-
-**Key Features**:
-- **Operator fusion identification**: Element-wise, matrix, reduction patterns
-- **Cache efficiency patterns**: Memory access optimization analysis
-- **Kernel optimization strategies**: JIT compilation, vectorization
-- **Bandwidth reduction potential**: Quantified performance improvements
-
-**Production Relevance**:
-- Identifies specific fusion opportunities (attention patterns, matrix chains)
-- Analyzes cache utilization and memory bandwidth
-- Provides kernel optimization strategies
-- Connects to real GPU acceleration techniques
-
-**Fusion Analysis Output**:
-```python
-fusion_analysis = {
-    'fusion_opportunities': [
-        "🔀 Element-wise operation fusion (add, multiply, activation)",
-        "🔗 Matrix operation chains (matmul + bias + activation)",
-        "📈 Reduction operation fusion (sum, mean, variance)",
-        "🎭 Attention pattern fusion (Q@K^T, softmax, @V)"
-    ],
-    'cache_efficiency_patterns': detailed_analysis,
-    'kernel_optimization_strategies': optimization_recommendations
-}
-```
-
-### 4. Mixed Precision Training Demonstration
-
-**Implementation**: Complete mixed precision support with overflow detection
-
-**Key Features**:
-- **Gradient scaling/unscaling**: Prevents FP16 underflow
-- **Overflow detection**: Automatic recovery mechanism
-- **Memory efficiency analysis**: Quantified memory savings
-- **Performance trade-off demonstration**: Speed vs stability analysis
-
-**Production Features**:
-- Loss scaling for gradient preservation
-- Automatic overflow detection and gradient zeroing
-- Memory usage comparison across precision modes
-- Performance benchmarking with realistic models
-
-**Mixed Precision Function**:
-```python
-def enable_mixed_precision_gradients(variables: List[Variable], loss_scale: float = 1024.0):
-    # Unscale gradients and detect overflow
-    for var in variables:
-        if var.grad and (np.any(np.isinf(grad_data)) or np.any(np.isnan(grad_data))):
-            overflow_detected = True
-            break
-        var.grad = Variable(grad_data / loss_scale)  # Unscale
-
-    if overflow_detected:
-        # Zero gradients and skip optimizer step
-        for var in variables: var.zero_grad()
-```
-
-## 🎯 Educational Excellence Preserved
-
-### Systems Thinking Integration
-- **Memory vs Compute Trade-offs**: Quantified analysis with real numbers
-- **Production Context**: Direct connections to PyTorch, TensorFlow implementations
-- **Scaling Implications**: From toy examples to billion-parameter models
-- **Performance Characteristics**: Measured timing and memory usage patterns
-
-### Enhanced ML Systems Questions
-Updated reflection questions to focus on the new production features:
-1. **Gradient Clipping**: Training stability and adaptive threshold strategies
-2. **Memory Management**: Dynamic vs static graph optimization trade-offs
-3. **Graph Optimization**: Kernel fusion and cache efficiency improvements
-
-### Comprehensive Testing
-- **Unit tests**: Individual feature validation
-- **Integration tests**: Combined feature workflows
-- **Performance tests**: Scaling behavior analysis
-- **Production scenarios**: Real-world usage patterns
-
-## 📊 Performance Improvements
-
-### Memory Optimization
-- **Checkpointing analysis**: 66.7% memory reduction with 37.5% time overhead
-- **Mixed precision**: 62.1% memory savings with 1.3x performance gain
-- **Graph optimization**: Identified fusion opportunities reducing bandwidth
-
-### Training Stability
-- **Gradient clipping**: Prevents training divergence in deep networks
-- **Overflow detection**: Automatic recovery from numerical instabilities
-- **Adaptive scaling**: Dynamic adjustment to training conditions
-
-### Production Readiness
-- **Framework integration**: Direct compatibility with PyTorch/TensorFlow patterns
-- **Scalability analysis**: Validated performance characteristics
-- **Optimization strategies**: Actionable recommendations for large models
-
-## 🏆 Technical Excellence
-
-### Code Quality
-- **Clean abstractions**: Maintainable and extensible implementations
-- **Comprehensive documentation**: Clear explanations with production context
-- **Error handling**: Robust overflow detection and recovery
-- **Performance monitoring**: Built-in profiling and analysis tools
-
-### Educational Impact
-- **Progressive complexity**: From basic autograd to advanced optimizations
-- **Visual learning**: ASCII diagrams and performance visualizations
-- **Real-world connections**: Every feature linked to production systems
-- **Hands-on discovery**: Students build and analyze optimizations themselves
-
-## 🚀 Next Steps
-
-The enhanced Module 06 now provides:
-1. **Complete autograd foundation**: For neural network training
-2. **Production optimization techniques**: Used in real ML systems
-3. **Performance analysis tools**: For understanding scaling behavior
-4. **Training stability features**: Essential for deep network training
-
-This enhanced module successfully bridges the gap between educational autograd implementation and production ML systems, providing students with both theoretical understanding and practical optimization skills used in real-world deep learning training.
\ No newline at end of file
diff --git a/modules_old/05_autograd/README.md b/modules_old/05_autograd/README.md
deleted file mode 100644
index 36810057..00000000
--- a/modules_old/05_autograd/README.md
+++ /dev/null
@@ -1,235 +0,0 @@
-# 🔥 Module: Autograd
-
-## 📊 Module Info
-- **Difficulty**: ⭐⭐⭐⭐ Advanced
-- **Time Estimate**: 6-8 hours
-- **Prerequisites**: Tensor, Activations, Layers modules
-- **Next Steps**: Training, Optimizers modules
-
-Build the automatic differentiation engine that makes neural network training possible. This module implements the mathematical foundation that enables backpropagation—transforming TinyTorch from a static computation library into a dynamic, trainable ML framework.
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will be able to:
-
-- **Master automatic differentiation theory**: Understand computational graphs, chain rule application, and gradient flow
-- **Implement gradient tracking systems**: Build the Variable class that automatically computes and accumulates gradients
-- **Create differentiable operations**: Extend all mathematical operations to support backward propagation
-- **Apply backpropagation algorithms**: Implement the gradient computation that enables neural network optimization
-- **Integrate with ML systems**: Connect automatic differentiation with layers, networks, and training algorithms
-
-## 🧠 Build → Use → Analyze
-
-This module follows TinyTorch's **Build → Use → Analyze** framework:
-
-1. **Build**: Implement Variable class and gradient computation system using mathematical differentiation rules
-2. **Use**: Apply automatic differentiation to complex expressions and neural network forward passes
-3. **Analyze**: Understand computational graph construction, memory usage, and performance characteristics of autodiff systems
-
-## 📚 What You'll Build
-
-### Automatic Differentiation System
-```python
-# Variables track gradients automatically
-x = Variable(5.0, requires_grad=True)
-y = Variable(3.0, requires_grad=True)
-
-# Complex mathematical expressions
-z = x**2 + 2*x*y + y**3
-print(f"f(x,y) = {z.data}")  # Forward pass result
-
-# Automatic gradient computation
-z.backward()
-print(f"df/dx = {x.grad}")  # ∂f/∂x = 2x + 2y = 16
-print(f"df/dy = {y.grad}")  # ∂f/∂y = 2x + 3y² = 37
-```
-
-### Neural Network Integration
-```python
-# Seamless integration with existing TinyTorch components
-from tinytorch.core.layers import Dense
-from tinytorch.core.activations import ReLU
-
-# Create differentiable network
-x = Variable([[1.0, 2.0, 3.0]], requires_grad=True)
-layer1 = Dense(3, 4)  # Weights automatically become Variables
-layer2 = Dense(4, 1)
-relu = ReLU()
-
-# Forward pass builds computational graph
-h1 = relu(layer1(x))
-output = layer2(h1)
-loss = output.sum()
-
-# Backward pass computes all gradients
-loss.backward()
-
-# All parameters now have gradients
-print(f"Layer 1 weight gradients: {layer1.weights.grad.shape}")
-print(f"Layer 2 bias gradients: {layer2.bias.grad.shape}")
-print(f"Input gradients: {x.grad.shape}")
-```
-
-### Computational Graph Construction
-```python
-# Automatic graph building for complex operations
-def complex_function(x, y):
-    a = x * y          # Multiplication node
-    b = x + y          # Addition node  
-    c = a / b          # Division node
-    return c.sin()     # Trigonometric node
-
-x = Variable(2.0, requires_grad=True)
-y = Variable(3.0, requires_grad=True)
-result = complex_function(x, y)
-
-# Chain rule applied automatically through entire graph
-result.backward()
-print(f"Complex gradient dx: {x.grad}")
-print(f"Complex gradient dy: {y.grad}")
-```
-
-## 🚀 Getting Started
-
-### Prerequisites
-Ensure you understand the mathematical building blocks:
-
-   ```bash
-# Activate TinyTorch environment
-   source bin/activate-tinytorch.sh
-
-# Verify prerequisite modules
-tito test --module tensor
-tito test --module activations
-tito test --module layers
-   ```
-
-### Development Workflow
-1. **Open the development file**: `modules/source/06_autograd/autograd_dev.py`
-2. **Implement Variable class**: Create gradient tracking wrapper around Tensors
-3. **Add basic operations**: Implement differentiable arithmetic (add, multiply, power)
-4. **Build backward propagation**: Implement chain rule for gradient computation
-5. **Extend to all operations**: Add gradients for activations, matrix operations, etc.
-6. **Export and verify**: `tito export --module autograd && tito test --module autograd`
-
-## 🧪 Testing Your Implementation
-
-### Comprehensive Test Suite
-Run the full test suite to verify mathematical correctness:
-
-```bash
-# TinyTorch CLI (recommended)
-tito test --module autograd
-
-# Direct pytest execution
-python -m pytest tests/ -k autograd -v
-```
-
-### Test Coverage Areas
-- ✅ **Variable Creation**: Test gradient tracking initialization and properties
-- ✅ **Basic Operations**: Verify arithmetic operations compute correct gradients
-- ✅ **Chain Rule**: Ensure composite functions apply chain rule correctly
-- ✅ **Backpropagation**: Test gradient flow through complex computational graphs
-- ✅ **Neural Network Integration**: Verify seamless operation with layers and activations
-
-### Inline Testing & Mathematical Verification
-The module includes comprehensive mathematical validation:
-```python
-# Example inline test output
-🔬 Unit Test: Variable gradient tracking...
-✅ Variable creation with gradient tracking
-✅ Leaf variables correctly identified
-✅ Gradient accumulation works correctly
-📈 Progress: Variable System ✓
-
-# Mathematical verification
-🔬 Unit Test: Chain rule implementation...
-✅ f(x) = x² → df/dx = 2x ✓
-✅ f(x,y) = xy → df/dx = y, df/dy = x ✓
-✅ Complex compositions follow chain rule ✓
-📈 Progress: Differentiation Rules ✓
-```
-
-### Manual Testing Examples
-```python
-from autograd_dev import Variable
-import math
-
-# Test basic differentiation rules
-x = Variable(3.0, requires_grad=True)
-y = x**2
-y.backward()
-print(f"d(x²)/dx at x=3: {x.grad}")  # Should be 6
-
-# Test chain rule
-x = Variable(2.0, requires_grad=True)
-y = Variable(3.0, requires_grad=True)
-z = (x + y) * (x - y)  # Difference of squares
-z.backward()
-print(f"d/dx = {x.grad}")  # Should be 2x = 4
-print(f"d/dy = {y.grad}")  # Should be -2y = -6
-
-# Test with transcendental functions
-x = Variable(1.0, requires_grad=True)
-y = x.exp().log()  # Should equal x
-y.backward()
-print(f"d(exp(log(x)))/dx: {x.grad}")  # Should be 1
-```
-
-## 🎯 Key Concepts
-
-### Real-World Applications
-- **Deep Learning Frameworks**: PyTorch, TensorFlow, JAX all use automatic differentiation for training
-- **Scientific Computing**: Automatic differentiation enables gradient-based optimization in physics, chemistry, engineering
-- **Financial Modeling**: Risk analysis and portfolio optimization use autodiff for sensitivity analysis
-- **Robotics**: Control systems use gradients for trajectory optimization and inverse kinematics
-
-### Mathematical Foundations
-- **Chain Rule**: ∂f/∂x = (∂f/∂u)(∂u/∂x) for composite functions f(u(x))
-- **Computational Graphs**: Directed acyclic graphs representing function composition
-- **Forward Mode vs Reverse Mode**: Different autodiff strategies with different computational complexities
-- **Gradient Accumulation**: Handling multiple computational paths to same variable
-
-### Automatic Differentiation Theory
-- **Dual Numbers**: Mathematical foundation using infinitesimals for forward-mode AD
-- **Reverse Accumulation**: Backpropagation as reverse-mode automatic differentiation
-- **Higher-Order Derivatives**: Computing gradients of gradients for advanced optimization
-- **Jacobian Computation**: Efficient computation of vector-valued function gradients
-
-### Implementation Patterns
-- **Gradient Function Storage**: Each operation stores its backward function in the computational graph
-- **Topological Sorting**: Ordering gradient computation to respect dependencies
-- **Memory Management**: Efficient storage and cleanup of intermediate values
-- **Numerical Stability**: Handling edge cases in gradient computation
-
-## 🎉 Ready to Build?
-
-You're about to implement the mathematical foundation that makes modern AI possible! Automatic differentiation is the invisible engine that powers every neural network, from simple classifiers to GPT and beyond.
-
-Understanding autodiff from first principles—implementing the Variable class and chain rule yourself—will give you deep insight into how deep learning really works. This is where mathematics meets software engineering to create something truly powerful. Take your time, understand each gradient rule, and enjoy building the heart of machine learning!
-
-```{grid} 3
-:gutter: 3
-:margin: 2
-
-{grid-item-card} 🚀 Launch Builder
-:link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/06_autograd/autograd_dev.py
-:class-title: text-center
-:class-body: text-center
-
-Interactive development environment
-
-{grid-item-card} 📓 Open in Colab  
-:link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/06_autograd/autograd_dev.ipynb
-:class-title: text-center
-:class-body: text-center
-
-Google Colab notebook
-
-{grid-item-card} 👀 View Source
-:link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/06_autograd/autograd_dev.py  
-:class-title: text-center
-:class-body: text-center
-
-Browse the code on GitHub
-``` 
\ No newline at end of file
diff --git a/modules_old/05_autograd/autograd_dev.ipynb b/modules_old/05_autograd/autograd_dev.ipynb
deleted file mode 100644
index b80a0a0c..00000000
--- a/modules_old/05_autograd/autograd_dev.ipynb
+++ /dev/null
@@ -1,2005 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "fdf6e68f",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Autograd - Automatic Differentiation and Computational Graph Engine\n",
-    "\n",
-    "Welcome to the Autograd module! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through complex computational graphs.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How computational graphs enable automatic differentiation and why this approach scales to arbitrary network architectures\n",
-    "- Core implementation skill: Build the Variable class with gradient tracking and implement backward propagation through dynamic computation graphs\n",
-    "- Pattern recognition: Understand how chain rule application through computational graphs generalizes to any differentiable function\n",
-    "- Framework connection: See how your implementation mirrors PyTorch's autograd engine and tensor gradient tracking\n",
-    "- Performance insight: Learn why computational graph memory management and gradient accumulation strategies determine training scalability\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: Complete automatic differentiation system with Variable class, gradient tracking, and backward propagation\n",
-    "2. **Use**: Apply autograd to complex mathematical expressions and neural network operations\n",
-    "3. **Reflect**: Why does automatic differentiation enable ML at scale, and how does graph memory management affect training?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how computational graphs enable automatic gradient computation for arbitrary functions\n",
-    "- Practical capability to build the gradient computation engine that powers all modern neural network training\n",
-    "- Systems insight into why automatic differentiation was the breakthrough that enabled deep learning at scale\n",
-    "- Performance consideration of how computational graph size and memory management affect training efficiency\n",
-    "- Connection to production ML systems and how frameworks optimize gradient computation and memory usage\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: PyTorch's autograd can handle graphs with millions of nodes and uses sophisticated memory optimization like gradient checkpointing to train models larger than GPU memory\n",
-    "⚡ **Performance Note**: Gradient computation often requires storing forward activations, leading to memory usage that scales with network depth - this drives innovations like gradient checkpointing"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a11a40f1",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "autograd-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.autograd\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import sys\n",
-    "from typing import Union, List, Tuple, Optional, Any, Callable\n",
-    "from collections import defaultdict\n",
-    "\n",
-    "# Import our existing components\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "except ImportError:\n",
-    "    # For development, import from local modules\n",
-    "    import os\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
-    "    from tensor_dev import Tensor"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e5301199",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "autograd-setup",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🔥 TinyTorch Autograd Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
-    "print(\"Ready to build automatic differentiation!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6cd6d0bd",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📦 Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in `modules/source/07_autograd/autograd_dev.py`  \n",
-    "**Building Side:** Code exports to `tinytorch.core.autograd`\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.autograd import Variable, backward  # The gradient engine!\n",
-    "from tinytorch.core.tensor import Tensor\n",
-    "from tinytorch.core.activations import ReLU, Sigmoid, Tanh\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused module for understanding gradients\n",
-    "- **Production:** Proper organization like PyTorch's `torch.autograd`\n",
-    "- **Consistency:** All gradient operations live together in `core.autograd`\n",
-    "- **Foundation:** Enables training for all neural networks"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "772541a2",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## What is Automatic Differentiation?\n",
-    "\n",
-    "### The Problem: Computing Gradients at Scale\n",
-    "Neural networks have millions of parameters. To train them, we need gradients of the loss function with respect to every parameter:\n",
-    "\n",
-    "```\n",
-    "∇θ L = [∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ, ∂L/∂b₁, ∂L/∂b₂, ..., ∂L/∂bₘ]\n",
-    "```\n",
-    "\n",
-    "**Manual differentiation fails** because:\n",
-    "- Networks have thousands of composed functions\n",
-    "- Manual computation is extremely error-prone\n",
-    "- Every architecture change requires re-deriving all gradients\n",
-    "\n",
-    "### The Solution: Automatic Differentiation\n",
-    "**Autograd** automatically computes derivatives of functions represented as computational graphs:\n",
-    "\n",
-    "```python\n",
-    "# Instead of manually computing: ∂(x² + 2xy + y²)/∂x = 2x + 2y\n",
-    "# Autograd does it automatically:\n",
-    "x = Variable(3.0, requires_grad=True)\n",
-    "y = Variable(4.0, requires_grad=True)\n",
-    "z = x**2 + 2*x*y + y**2\n",
-    "z.backward()\n",
-    "print(x.grad)  # 2*3 + 2*4 = 14 (computed automatically!)\n",
-    "```\n",
-    "\n",
-    "### Why This is Revolutionary\n",
-    "- **Efficiency**: O(1) overhead per operation\n",
-    "- **Flexibility**: Works with any differentiable function\n",
-    "- **Correctness**: Implements chain rule precisely\n",
-    "- **Scale**: Handles millions of parameters automatically\n",
-    "\n",
-    "### Real-World Impact\n",
-    "- **PyTorch**: `torch.autograd` enables all neural network training\n",
-    "- **TensorFlow**: `tf.GradientTape` provides similar functionality\n",
-    "- **JAX**: `jax.grad` for high-performance computing\n",
-    "- **Deep Learning**: Made training complex models practical\n",
-    "\n",
-    "Let us build the engine that powers modern AI!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "83344a0a",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🔧 DEVELOPMENT"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "96f76726",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 1: The Variable Class - Gradient Tracking\n",
-    "\n",
-    "### What is a Variable?\n",
-    "A **Variable** wraps a Tensor and tracks:\n",
-    "- **Data**: The actual values (forward pass)\n",
-    "- **Gradient**: The computed gradients (backward pass)\n",
-    "- **Computation history**: How this Variable was created\n",
-    "- **Backward function**: How to compute gradients\n",
-    "\n",
-    "### The Computational Graph\n",
-    "Variables automatically build a computational graph:\n",
-    "\n",
-    "```python\n",
-    "x = Variable(2.0)  # Leaf node\n",
-    "y = Variable(3.0)  # Leaf node\n",
-    "z = x * y          # Intermediate node: z = x * y\n",
-    "w = z + 1          # Output node: w = z + 1\n",
-    "\n",
-    "# Graph: x ──→ * ──→ + ──→ w\n",
-    "#        y ──→   ──→   ──→\n",
-    "```\n",
-    "\n",
-    "### Design Principles\n",
-    "- **Transparency**: Works seamlessly with existing operations\n",
-    "- **Efficiency**: Minimal overhead for forward pass\n",
-    "- **Flexibility**: Supports any differentiable operation\n",
-    "- **Correctness**: Implements chain rule precisely\n",
-    "\n",
-    "### Real-World Context\n",
-    "This is like:\n",
-    "- **PyTorch**: `torch.autograd.Variable` (now integrated into tensors)\n",
-    "- **TensorFlow**: `tf.Variable` with gradient tracking\n",
-    "- **JAX**: Variables with `jax.grad` transformation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "07769616",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "variable-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Variable:\n",
-    "    \"\"\"\n",
-    "    Variable: Tensor wrapper with automatic differentiation capabilities.\n",
-    "    \n",
-    "    The fundamental class for gradient computation in TinyTorch.\n",
-    "    Wraps Tensor objects and tracks computational history for backpropagation.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, data: Union[Tensor, np.ndarray, list, float, int], \n",
-    "                 requires_grad: bool = True, grad_fn: Optional[Callable] = None):\n",
-    "        \"\"\"\n",
-    "        Create a Variable with gradient tracking.\n",
-    "            \n",
-    "        TODO: Implement Variable initialization with gradient tracking.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Convert data to Tensor if it is not already a Tensor\n",
-    "        2. Store the tensor data in self.data\n",
-    "        3. Set gradient tracking flag (requires_grad)\n",
-    "        4. Initialize gradient to None (will be computed during backward pass)\n",
-    "        5. Store the gradient function for backward pass\n",
-    "        6. Track if this is a leaf node (no grad_fn means it is a leaf)\n",
-    "        \n",
-    "        EXAMPLE USAGE:\n",
-    "        ```python\n",
-    "        # Create leaf variables (input data)\n",
-    "        x = Variable(5.0, requires_grad=True)\n",
-    "        y = Variable([1, 2, 3], requires_grad=True)\n",
-    "        \n",
-    "        # Create intermediate variables (results of operations)\n",
-    "        z = x + y  # Has grad_fn for addition\n",
-    "        ```\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Use isinstance(data, Tensor) to check type\n",
-    "        - Convert with Tensor(data) if needed\n",
-    "        - Store requires_grad, grad_fn flags\n",
-    "        - Initialize self.grad = None\n",
-    "        - Leaf nodes have grad_fn = None\n",
-    "        - Set self.is_leaf = (grad_fn is None)\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - This is like torch.Tensor with requires_grad=True\n",
-    "        - Forms the basis for all neural network training\n",
-    "        - Each Variable is a node in the computational graph\n",
-    "        - Enables automatic gradient computation\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Convert data to Tensor if needed\n",
-    "        if isinstance(data, Tensor):\n",
-    "            self.data = data\n",
-    "        else:\n",
-    "            self.data = Tensor(data)\n",
-    "        \n",
-    "        # Set gradient tracking\n",
-    "        self.requires_grad = requires_grad\n",
-    "        self.grad = None  # Will be initialized when needed\n",
-    "        self.grad_fn = grad_fn\n",
-    "        self.is_leaf = grad_fn is None\n",
-    "        \n",
-    "        # For computational graph\n",
-    "        self._backward_hooks = []\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    @property\n",
-    "    def shape(self) -> Tuple[int, ...]:\n",
-    "        \"\"\"Get the shape of the underlying tensor.\"\"\"\n",
-    "        return self.data.shape\n",
-    "    \n",
-    "    @property\n",
-    "    def size(self) -> int:\n",
-    "        \"\"\"Get the total number of elements.\"\"\"\n",
-    "        return self.data.size\n",
-    "    \n",
-    "    def __repr__(self) -> str:\n",
-    "        \"\"\"String representation of the Variable.\"\"\"\n",
-    "        grad_str = f\", grad_fn={self.grad_fn.__name__}\" if self.grad_fn else \"\"\n",
-    "        return f\"Variable({self.data.data.tolist()}, requires_grad={self.requires_grad}{grad_str})\"\n",
-    "    \n",
-    "    def backward(self, gradient: Optional['Variable'] = None) -> None:\n",
-    "        \"\"\"\n",
-    "        Compute gradients using backpropagation.\n",
-    "        \n",
-    "        TODO: Implement backward pass for gradient computation.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. If gradient is None, create gradient of ones (for scalar outputs)\n",
-    "        2. If this Variable requires gradients, accumulate the gradient\n",
-    "        3. If this Variable has a grad_fn, call it to propagate gradients\n",
-    "        4. The grad_fn will recursively call backward on input Variables\n",
-    "        \n",
-    "        EXAMPLE USAGE:\n",
-    "        ```python\n",
-    "        x = Variable(2.0, requires_grad=True)\n",
-    "        y = Variable(3.0, requires_grad=True)\n",
-    "        z = add(x, y)  # z = 5.0\n",
-    "        z.backward()\n",
-    "        print(x.grad)  # 1.0 (∂z/∂x = 1)\n",
-    "        print(y.grad)  # 1.0 (∂z/∂y = 1)\n",
-    "        ```\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - If gradient is None: gradient = Variable(np.ones_like(self.data.data))\n",
-    "        - If self.requires_grad: accumulate gradient into self.grad\n",
-    "        - If self.grad_fn: call self.grad_fn(gradient)\n",
-    "        - Handle gradient accumulation (add to existing gradient)\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - This implements the chain rule of calculus\n",
-    "        - Gradients flow backward through the computational graph\n",
-    "        - Each operation contributes its local gradient\n",
-    "        - Enables training of any differentiable function\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if gradient is None:\n",
-    "            gradient = Variable(np.ones_like(self.data.data))\n",
-    "        \n",
-    "        if self.requires_grad:\n",
-    "            if self.grad is None:\n",
-    "                self.grad = gradient\n",
-    "            else:\n",
-    "                # Accumulate gradients\n",
-    "                self.grad = Variable(self.grad.data.data + gradient.data.data)\n",
-    "        \n",
-    "            if self.grad_fn is not None:\n",
-    "                self.grad_fn(gradient)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def zero_grad(self) -> None:\n",
-    "        \"\"\"Reset gradients to zero.\"\"\"\n",
-    "        self.grad = None\n",
-    "    \n",
-    "    def __add__(self, other: Union['Variable', float, int]) -> 'Variable':\n",
-    "        \"\"\"Addition operator: self + other\"\"\"\n",
-    "        return add(self, other)\n",
-    "    \n",
-    "    def __mul__(self, other: Union['Variable', float, int]) -> 'Variable':\n",
-    "        \"\"\"Multiplication operator: self * other\"\"\"\n",
-    "        return multiply(self, other)\n",
-    "    \n",
-    "    def __sub__(self, other: Union['Variable', float, int]) -> 'Variable':\n",
-    "        \"\"\"Subtraction operator: self - other\"\"\"\n",
-    "        return subtract(self, other)\n",
-    "    \n",
-    "    def __truediv__(self, other: Union['Variable', float, int]) -> 'Variable':\n",
-    "        \"\"\"Division operator: self / other\"\"\"\n",
-    "        return divide(self, other) "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "68e469e7",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Variable Class\n",
-    "\n",
-    "Once you implement the Variable class above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "72a160ac",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-variable-class",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_variable_class():\n",
-    "    \"\"\"Test Variable class implementation\"\"\"\n",
-    "    print(\"🔬 Unit Test: Variable Class...\")\n",
-    "    \n",
-    "    # Test Variable creation\n",
-    "    x = Variable(5.0, requires_grad=True)\n",
-    "    assert x.requires_grad == True, \"Variable should require gradients\"\n",
-    "    assert x.is_leaf == True, \"Variable should be a leaf node\"\n",
-    "    assert x.grad is None, \"Gradient should be None initially\"\n",
-    "    \n",
-    "    # Test data access\n",
-    "    assert x.data.data.item() == 5.0, \"Data should be accessible\"\n",
-    "    assert x.shape == (), \"Scalar should have empty shape\"\n",
-    "    assert x.size == 1, \"Scalar should have size 1\"\n",
-    "    \n",
-    "    # Test with list input\n",
-    "    y = Variable([1, 2, 3], requires_grad=True)\n",
-    "    assert y.shape == (3,), \"List should create 1D tensor\"\n",
-    "    assert y.size == 3, \"Size should be 3\"\n",
-    "    \n",
-    "    # Test with requires_grad=False\n",
-    "    z = Variable(10.0, requires_grad=False)\n",
-    "    assert z.requires_grad == False, \"Should not require gradients\"\n",
-    "    \n",
-    "    # Test zero_grad\n",
-    "    x.grad = Variable(1.0)\n",
-    "    x.zero_grad()\n",
-    "    assert x.grad is None, \"zero_grad should reset gradient to None\"\n",
-    "    \n",
-    "    print(\"✅ Variable class tests passed!\")\n",
-    "    print(f\"✅ Variable creation and initialization working\")\n",
-    "    print(f\"✅ Data access and properties working\")\n",
-    "    print(f\"✅ Gradient management working\")\n",
-    "\n",
-    "# Test will run in main block"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6632a71a",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 2: Basic Operations with Gradients\n",
-    "\n",
-    "### The Chain Rule in Action\n",
-    "Every operation must implement:\n",
-    "1. **Forward pass**: Compute the result\n",
-    "2. **Backward pass**: Compute gradients for inputs\n",
-    "\n",
-    "### Example: Addition\n",
-    "For z = x + y:\n",
-    "- **Forward**: z.data = x.data + y.data\n",
-    "- **Backward**: ∂z/∂x = 1, ∂z/∂y = 1\n",
-    "\n",
-    "### Mathematical Foundation\n",
-    "The chain rule states:\n",
-    "```\n",
-    "∂f/∂x = ∂f/∂z · ∂z/∂x\n",
-    "```\n",
-    "\n",
-    "For complex expressions like f(g(h(x))):\n",
-    "```\n",
-    "∂f/∂x = ∂f/∂g · ∂g/∂h · ∂h/∂x\n",
-    "```\n",
-    "\n",
-    "### Implementation Pattern\n",
-    "Each operation returns a new Variable with:\n",
-    "- **Forward result**: Computed value\n",
-    "- **Backward function**: Gradient computation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "92e0b686",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "add-operation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:\n",
-    "    \"\"\"\n",
-    "    Addition operation with gradient tracking: a + b\n",
-    "    \n",
-    "    TODO: Implement addition with automatic differentiation.\n",
-    "    \n",
-    "    STEP-BY-STEP IMPLEMENTATION:\n",
-    "    1. Convert inputs to Variables if they are scalars\n",
-    "    2. Compute forward pass: result = a.data + b.data\n",
-    "    3. Create gradient function that implements: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1\n",
-    "    4. Return new Variable with result and gradient function\n",
-    "    \n",
-    "    MATHEMATICAL FOUNDATION:\n",
-    "    - Forward: z = x + y\n",
-    "    - Backward: ∂z/∂x = 1, ∂z/∂y = 1\n",
-    "    - Chain rule: ∂L/∂x = ∂L/∂z · ∂z/∂x = ∂L/∂z · 1 = ∂L/∂z\n",
-    "    \n",
-    "    EXAMPLE USAGE:\n",
-    "    ```python\n",
-    "    x = Variable(2.0, requires_grad=True)\n",
-    "    y = Variable(3.0, requires_grad=True)\n",
-    "    z = add(x, y)  # z = 5.0\n",
-    "    z.backward()\n",
-    "    print(x.grad)  # 1.0 (∂z/∂x = 1)\n",
-    "    print(y.grad)  # 1.0 (∂z/∂y = 1)\n",
-    "    ```\n",
-    "    \n",
-    "    IMPLEMENTATION HINTS:\n",
-    "    - Convert scalars: if isinstance(a, (int, float)): a = Variable(a, requires_grad=False)\n",
-    "    - Forward pass: result_data = a.data + b.data\n",
-    "    - Backward function: def grad_fn(grad_output): if a.requires_grad: a.backward(grad_output)\n",
-    "    - Return: Variable(result_data, grad_fn=grad_fn)\n",
-    "    - Only propagate gradients to Variables that require them\n",
-    "    \n",
-    "    LEARNING CONNECTIONS:\n",
-    "    - This is like torch.add() with autograd\n",
-    "    - Addition distributes gradients equally to both inputs\n",
-    "    - Forms the basis for bias addition in neural networks\n",
-    "    - Chain rule propagates gradients through the graph\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Convert scalars to Variables\n",
-    "    if isinstance(a, (int, float)):\n",
-    "        a = Variable(a, requires_grad=False)\n",
-    "    if isinstance(b, (int, float)):\n",
-    "        b = Variable(b, requires_grad=False)\n",
-    "    \n",
-    "    # Forward pass\n",
-    "    result_data = a.data + b.data\n",
-    "    \n",
-    "    # Backward function\n",
-    "    def grad_fn(grad_output):\n",
-    "        # Addition distributes gradients equally, but must handle broadcasting\n",
-    "        if a.requires_grad:\n",
-    "            # Get gradient data\n",
-    "            if hasattr(grad_output.data, 'data'):\n",
-    "                grad_data = grad_output.data.data\n",
-    "            else:\n",
-    "                grad_data = grad_output.data\n",
-    "            \n",
-    "            # Check if we need to sum over broadcasted dimensions\n",
-    "            a_shape = a.data.shape if hasattr(a.data, 'shape') else ()\n",
-    "            if grad_data.shape != a_shape:\n",
-    "                # Sum over the broadcasted dimensions\n",
-    "                # For bias: (batch_size, features) -> (features,)\n",
-    "                if len(grad_data.shape) == 2 and len(a_shape) == 1:\n",
-    "                    grad_for_a = Variable(Tensor(np.sum(grad_data, axis=0)))\n",
-    "                else:\n",
-    "                    # Handle other broadcasting cases\n",
-    "                    grad_for_a = grad_output\n",
-    "            else:\n",
-    "                grad_for_a = grad_output\n",
-    "            \n",
-    "            a.backward(grad_for_a)\n",
-    "            \n",
-    "        if b.requires_grad:\n",
-    "            # Get gradient data\n",
-    "            if hasattr(grad_output.data, 'data'):\n",
-    "                grad_data = grad_output.data.data\n",
-    "            else:\n",
-    "                grad_data = grad_output.data\n",
-    "            \n",
-    "            # Check if we need to sum over broadcasted dimensions\n",
-    "            b_shape = b.data.shape if hasattr(b.data, 'shape') else ()\n",
-    "            if grad_data.shape != b_shape:\n",
-    "                # Sum over the broadcasted dimensions\n",
-    "                # For bias: (batch_size, features) -> (features,)\n",
-    "                if len(grad_data.shape) == 2 and len(b_shape) == 1:\n",
-    "                    grad_for_b = Variable(Tensor(np.sum(grad_data, axis=0)))\n",
-    "                else:\n",
-    "                    # Handle other broadcasting cases\n",
-    "                    grad_for_b = grad_output\n",
-    "            else:\n",
-    "                grad_for_b = grad_output\n",
-    "            \n",
-    "            b.backward(grad_for_b)\n",
-    "    \n",
-    "    # Return new Variable with gradient function\n",
-    "    requires_grad = a.requires_grad or b.requires_grad\n",
-    "    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f1984e5c",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Addition Operation\n",
-    "\n",
-    "Once you implement the add function above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d13d985f",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-add-operation",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_add_operation():\n",
-    "    \"\"\"Test addition operation with gradients\"\"\"\n",
-    "    print(\"🔬 Unit Test: Addition Operation...\")\n",
-    "    \n",
-    "    # Test basic addition\n",
-    "    x = Variable(2.0, requires_grad=True)\n",
-    "    y = Variable(3.0, requires_grad=True)\n",
-    "    z = add(x, y)\n",
-    "    \n",
-    "    assert z.data.data.item() == 5.0, \"Addition result should be 5.0\"\n",
-    "    assert z.requires_grad == True, \"Result should require gradients\"\n",
-    "    assert z.is_leaf == False, \"Result should not be a leaf node\"\n",
-    "    \n",
-    "    # Test backward pass\n",
-    "    z.backward()\n",
-    "    \n",
-    "    assert x.grad is not None, \"x should have gradient\"\n",
-    "    assert y.grad is not None, \"y should have gradient\"\n",
-    "    assert x.grad.data.data.item() == 1.0, \"∂z/∂x should be 1.0\"\n",
-    "    assert y.grad.data.data.item() == 1.0, \"∂z/∂y should be 1.0\"\n",
-    "    \n",
-    "    # Test with scalar\n",
-    "    a = Variable(5.0, requires_grad=True)\n",
-    "    b = add(a, 3.0)  # Add scalar\n",
-    "    \n",
-    "    assert b.data.data.item() == 8.0, \"Addition with scalar should work\"\n",
-    "    \n",
-    "    b.backward()\n",
-    "    assert a.grad.data.data.item() == 1.0, \"Gradient through scalar addition should be 1.0\"\n",
-    "    \n",
-    "    print(\"✅ Addition operation tests passed!\")\n",
-    "    print(f\"✅ Forward pass computing correct results\")\n",
-    "    print(f\"✅ Backward pass computing correct gradients\")\n",
-    "    print(f\"✅ Scalar addition working correctly\")\n",
-    "\n",
-    "# Test will run in main block"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "097a53d0",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 3: Multiplication Operation\n",
-    "\n",
-    "### The Product Rule\n",
-    "For z = x * y:\n",
-    "- **Forward**: z = x * y\n",
-    "- **Backward**: ∂z/∂x = y, ∂z/∂y = x\n",
-    "\n",
-    "### Why This Matters\n",
-    "Multiplication is everywhere in neural networks:\n",
-    "- **Weight scaling**: w * x in dense layers\n",
-    "- **Attention mechanisms**: attention_weights * values\n",
-    "- **Gating**: gate_signal * hidden_state\n",
-    "\n",
-    "### Chain Rule Application\n",
-    "When gradients flow back through multiplication:\n",
-    "```\n",
-    "∂L/∂x = ∂L/∂z · ∂z/∂x = ∂L/∂z · y\n",
-    "∂L/∂y = ∂L/∂z · ∂z/∂y = ∂L/∂z · x\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ddbf77ef",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "multiply-operation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:\n",
-    "    \"\"\"\n",
-    "    Multiplication operation with gradient tracking: a * b\n",
-    "    \n",
-    "    TODO: Implement multiplication with automatic differentiation.\n",
-    "    \n",
-    "    STEP-BY-STEP IMPLEMENTATION:\n",
-    "    1. Convert inputs to Variables if they are scalars\n",
-    "    2. Compute forward pass: result = a.data * b.data\n",
-    "    3. Create gradient function implementing product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a\n",
-    "    4. Return new Variable with result and gradient function\n",
-    "    \n",
-    "    MATHEMATICAL FOUNDATION:\n",
-    "    - Forward: z = x * y\n",
-    "    - Backward: ∂z/∂x = y, ∂z/∂y = x\n",
-    "    - Chain rule: ∂L/∂x = ∂L/∂z · y, ∂L/∂y = ∂L/∂z · x\n",
-    "    \n",
-    "    EXAMPLE USAGE:\n",
-    "    ```python\n",
-    "    x = Variable(2.0, requires_grad=True)\n",
-    "    y = Variable(3.0, requires_grad=True)\n",
-    "    z = multiply(x, y)  # z = 6.0\n",
-    "    z.backward()\n",
-    "    print(x.grad)  # 3.0 (∂z/∂x = y)\n",
-    "    print(y.grad)  # 2.0 (∂z/∂y = x)\n",
-    "    ```\n",
-    "    \n",
-    "    IMPLEMENTATION HINTS:\n",
-    "    - Convert scalars to Variables (same as addition)\n",
-    "    - Forward pass: result_data = a.data * b.data\n",
-    "    - Backward function: multiply incoming gradient by the other variable\n",
-    "    - For a: a.backward(grad_output * b.data)\n",
-    "    - For b: b.backward(grad_output * a.data)\n",
-    "    \n",
-    "    LEARNING CONNECTIONS:\n",
-    "    - This is like torch.mul() with autograd\n",
-    "    - Product rule is fundamental to backpropagation\n",
-    "    - Used in weight updates and attention mechanisms\n",
-    "    - Each input's gradient depends on the other input's value\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Convert scalars to Variables\n",
-    "    if isinstance(a, (int, float)):\n",
-    "        a = Variable(a, requires_grad=False)\n",
-    "    if isinstance(b, (int, float)):\n",
-    "        b = Variable(b, requires_grad=False)\n",
-    "    \n",
-    "    # Forward pass\n",
-    "    result_data = a.data * b.data\n",
-    "    \n",
-    "    # Backward function\n",
-    "    def grad_fn(grad_output):\n",
-    "        # Product rule: d(xy)/dx = y, d(xy)/dy = x\n",
-    "        if a.requires_grad:\n",
-    "            a.backward(Variable(grad_output.data.data * b.data.data))\n",
-    "        if b.requires_grad:\n",
-    "            b.backward(Variable(grad_output.data.data * a.data.data))\n",
-    "    \n",
-    "    # Return new Variable with gradient function\n",
-    "    requires_grad = a.requires_grad or b.requires_grad\n",
-    "    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c9496ae5",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Multiplication Operation\n",
-    "\n",
-    "Once you implement the multiply function above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cb564244",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-multiply-operation",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_multiply_operation():\n",
-    "    \"\"\"Test multiplication operation with gradients\"\"\"\n",
-    "    print(\"🔬 Unit Test: Multiplication Operation...\")\n",
-    "    \n",
-    "    # Test basic multiplication\n",
-    "    x = Variable(2.0, requires_grad=True)\n",
-    "    y = Variable(3.0, requires_grad=True)\n",
-    "    z = multiply(x, y)\n",
-    "    \n",
-    "    assert z.data.data.item() == 6.0, \"Multiplication result should be 6.0\"\n",
-    "    assert z.requires_grad == True, \"Result should require gradients\"\n",
-    "    \n",
-    "    # Test backward pass\n",
-    "    z.backward()\n",
-    "    \n",
-    "    assert x.grad is not None, \"x should have gradient\"\n",
-    "    assert y.grad is not None, \"y should have gradient\"\n",
-    "    assert x.grad.data.data.item() == 3.0, \"∂z/∂x should be y = 3.0\"\n",
-    "    assert y.grad.data.data.item() == 2.0, \"∂z/∂y should be x = 2.0\"\n",
-    "    \n",
-    "    # Test with scalar\n",
-    "    a = Variable(4.0, requires_grad=True)\n",
-    "    b = multiply(a, 2.0)  # Multiply by scalar\n",
-    "    \n",
-    "    assert b.data.data.item() == 8.0, \"Multiplication with scalar should work\"\n",
-    "    \n",
-    "    b.backward()\n",
-    "    assert a.grad.data.data.item() == 2.0, \"Gradient through scalar multiplication should be the scalar\"\n",
-    "    \n",
-    "    print(\"✅ Multiplication operation tests passed!\")\n",
-    "    print(f\"✅ Forward pass computing correct results\")\n",
-    "    print(f\"✅ Backward pass implementing product rule correctly\")\n",
-    "    print(f\"✅ Scalar multiplication working correctly\")\n",
-    "\n",
-    "# Test will run in main block"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1764e51c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "subtract-operation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:\n",
-    "    \"\"\"\n",
-    "    Subtraction operation with gradient tracking.\n",
-    "    \n",
-    "    Args:\n",
-    "        a: First operand (minuend)\n",
-    "        b: Second operand (subtrahend)\n",
-    "        \n",
-    "    Returns:\n",
-    "        Variable with difference and gradient function\n",
-    "        \n",
-    "    TODO: Implement subtraction with gradient computation.\n",
-    "    \n",
-    "    APPROACH:\n",
-    "    1. Convert inputs to Variables if needed\n",
-    "    2. Compute forward pass: result = a - b\n",
-    "    3. Create gradient function with correct signs\n",
-    "    4. Return Variable with result and grad_fn\n",
-    "    \n",
-    "    MATHEMATICAL RULE:\n",
-    "    If z = x - y, then dz/dx = 1, dz/dy = -1\n",
-    "    \n",
-    "    EXAMPLE:\n",
-    "    x = Variable(5.0), y = Variable(3.0)\n",
-    "    z = subtract(x, y)  # z.data = 2.0\n",
-    "    z.backward()        # x.grad = 1.0, y.grad = -1.0\n",
-    "    \n",
-    "    HINTS:\n",
-    "    - Forward pass is straightforward: a - b\n",
-    "    - Gradient for a is positive, for b is negative\n",
-    "    - Remember to negate the gradient for b\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Convert to Variables if needed\n",
-    "    if not isinstance(a, Variable):\n",
-    "        a = Variable(a, requires_grad=False)\n",
-    "    if not isinstance(b, Variable):\n",
-    "        b = Variable(b, requires_grad=False)\n",
-    "    \n",
-    "    # Forward pass\n",
-    "    result_data = a.data - b.data\n",
-    "    \n",
-    "    # Create gradient function\n",
-    "    def grad_fn(grad_output):\n",
-    "        # Subtraction rule: d(x-y)/dx = 1, d(x-y)/dy = -1\n",
-    "        if a.requires_grad:\n",
-    "            a.backward(grad_output)\n",
-    "        if b.requires_grad:\n",
-    "            b_grad = Variable(-grad_output.data.data)\n",
-    "            b.backward(b_grad)\n",
-    "    \n",
-    "    # Determine if result requires gradients\n",
-    "    requires_grad = a.requires_grad or b.requires_grad\n",
-    "    \n",
-    "    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5d10364f",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-subtract-operation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_subtract_operation():\n",
-    "    \"\"\"Test subtraction operation with gradients\"\"\"\n",
-    "    print(\"🔬 Unit Test: Subtraction Operation...\")\n",
-    "    \n",
-    "    # Test basic subtraction\n",
-    "    x = Variable(5.0, requires_grad=True)\n",
-    "    y = Variable(3.0, requires_grad=True)\n",
-    "    z = subtract(x, y)\n",
-    "    \n",
-    "    assert z.data.data.item() == 2.0, \"Subtraction result should be 2.0\"\n",
-    "    assert z.requires_grad == True, \"Result should require gradients\"\n",
-    "    \n",
-    "    # Test backward pass\n",
-    "    z.backward()\n",
-    "    \n",
-    "    assert x.grad is not None, \"x should have gradient\"\n",
-    "    assert y.grad is not None, \"y should have gradient\"\n",
-    "    assert x.grad.data.data.item() == 1.0, \"∂z/∂x should be 1.0\"\n",
-    "    assert y.grad.data.data.item() == -1.0, \"∂z/∂y should be -1.0\"\n",
-    "    \n",
-    "    # Test with scalar\n",
-    "    a = Variable(4.0, requires_grad=True)\n",
-    "    b = subtract(a, 2.0)  # Subtract scalar\n",
-    "    \n",
-    "    assert b.data.data.item() == 2.0, \"Subtraction with scalar should work\"\n",
-    "    \n",
-    "    b.backward()\n",
-    "    assert a.grad.data.data.item() == 1.0, \"Gradient through scalar subtraction should be 1.0\"\n",
-    "    \n",
-    "    print(\"✅ Subtraction operation tests passed!\")\n",
-    "    print(f\"✅ Forward pass computing correct results\")\n",
-    "    print(f\"✅ Backward pass implementing subtraction rule correctly\")\n",
-    "    print(f\"✅ Scalar subtraction working correctly\")\n",
-    "\n",
-    "# Test will run in main block"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "dcf7c6fa",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 4: Chain Rule in Complex Expressions\n",
-    "\n",
-    "### Building Complex Computations\n",
-    "Now let us test how multiple operations work together through the chain rule:\n",
-    "\n",
-    "### Example: f(x, y) = (x + y) * (x - y)\n",
-    "This creates a computational graph:\n",
-    "```\n",
-    "x ──→ + ──→ * ──→ result\n",
-    "y ──→   ──→   ──→\n",
-    "│            ↑\n",
-    "└──→ - ──────┘\n",
-    "```\n",
-    "\n",
-    "### Chain Rule Application\n",
-    "- **Forward**: Compute each operation in sequence\n",
-    "- **Backward**: Gradients flow back through each operation\n",
-    "- **Automatic**: No manual gradient computation needed!\n",
-    "\n",
-    "### Real-World Significance\n",
-    "Complex neural networks are just larger versions of this:\n",
-    "- **Millions of operations**: Each tracked automatically\n",
-    "- **Complex architectures**: ResNet, Transformer, etc.\n",
-    "- **Efficient computation**: O(1) overhead per operation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "33d8b3e8",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-chain-rule",
-     "locked": true,
-     "points": 20,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_chain_rule():\n",
-    "    \"\"\"Test chain rule with complex expressions\"\"\"\n",
-    "    print(\"🔬 Unit Test: Chain Rule with Complex Expressions...\")\n",
-    "    \n",
-    "    # Test: f(x, y) = (x + y) * (x - y) = x² - y²\n",
-    "    x = Variable(3.0, requires_grad=True)\n",
-    "    y = Variable(2.0, requires_grad=True)\n",
-    "    \n",
-    "    # Build expression step by step\n",
-    "    sum_xy = add(x, y)      # x + y = 5.0\n",
-    "    diff_xy = subtract(x, y) # x - y = 1.0\n",
-    "    result = multiply(sum_xy, diff_xy)  # (x + y) * (x - y) = 5.0\n",
-    "    \n",
-    "    # Check forward pass\n",
-    "    assert result.data.data.item() == 5.0, \"Forward pass should compute 5.0\"\n",
-    "    \n",
-    "    # Compute gradients\n",
-    "    result.backward()\n",
-    "    \n",
-    "    # Check gradients: ∂(x²-y²)/∂x = 2x, ∂(x²-y²)/∂y = -2y\n",
-    "    expected_x_grad = 2 * x.data.data.item()  # 2 * 3 = 6\n",
-    "    expected_y_grad = -2 * y.data.data.item()  # -2 * 2 = -4\n",
-    "    \n",
-    "    assert abs(x.grad.data.data.item() - expected_x_grad) < 1e-6, f\"x gradient should be {expected_x_grad}\"\n",
-    "    assert abs(y.grad.data.data.item() - expected_y_grad) < 1e-6, f\"y gradient should be {expected_y_grad}\"\n",
-    "    \n",
-    "    # Test more complex expression: f(x) = (x + 1) * (x + 2) * (x + 3)\n",
-    "    x2 = Variable(1.0, requires_grad=True)\n",
-    "    \n",
-    "    term1 = add(x2, 1.0)    # x + 1 = 2.0\n",
-    "    term2 = add(x2, 2.0)    # x + 2 = 3.0\n",
-    "    term3 = add(x2, 3.0)    # x + 3 = 4.0\n",
-    "    \n",
-    "    product1 = multiply(term1, term2)  # (x + 1) * (x + 2) = 6.0\n",
-    "    result2 = multiply(product1, term3)  # * (x + 3) = 24.0\n",
-    "    \n",
-    "    assert result2.data.data.item() == 24.0, \"Complex expression should compute 24.0\"\n",
-    "    \n",
-    "    result2.backward()\n",
-    "    \n",
-    "    # For f(x) = (x+1)(x+2)(x+3), f'(x) = 3x² + 12x + 11\n",
-    "    # At x=1: f'(1) = 3 + 12 + 11 = 26\n",
-    "    expected_grad = 3 * (1.0**2) + 12 * 1.0 + 11  # 26\n",
-    "    \n",
-    "    assert abs(x2.grad.data.data.item() - expected_grad) < 1e-6, f\"Complex gradient should be {expected_grad}\"\n",
-    "    \n",
-    "    print(\"✅ Chain rule tests passed!\")\n",
-    "    print(f\"✅ Simple expression: (x+y)*(x-y) = x²-y²\")\n",
-    "    print(f\"✅ Complex expression: (x+1)*(x+2)*(x+3)\")\n",
-    "    print(f\"✅ Automatic gradient computation working correctly\")\n",
-    "    print(f\"✅ Chain rule implemented correctly\")\n",
-    "\n",
-    "# Test will run in main block"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "783a8bc4",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 5: Integration with Neural Network Training\n",
-    "\n",
-    "### The Complete Training Loop\n",
-    "Let us see how autograd enables neural network training:\n",
-    "\n",
-    "1. **Forward pass**: Compute predictions\n",
-    "2. **Loss computation**: Compare with targets\n",
-    "3. **Backward pass**: Compute gradients automatically\n",
-    "4. **Parameter update**: Update weights using gradients\n",
-    "\n",
-    "### Example: Simple Linear Regression\n",
-    "   ```python\n",
-    "# Model: y = wx + b\n",
-    "w = Variable(0.5, requires_grad=True)\n",
-    "b = Variable(0.1, requires_grad=True)\n",
-    "\n",
-    "    # Forward pass\n",
-    "prediction = w * x + b\n",
-    "\n",
-    "# Loss: mean squared error\n",
-    "loss = (prediction - target)**2\n",
-    "\n",
-    "# Backward pass (automatic!)\n",
-    "loss.backward()\n",
-    "\n",
-    "# Update parameters\n",
-    "w.data = w.data - learning_rate * w.grad.data\n",
-    "b.data = b.data - learning_rate * b.grad.data\n",
-    "```\n",
-    "\n",
-    "### Why This is Powerful\n",
-    "- **Automatic**: No manual gradient computation\n",
-    "- **Flexible**: Works with any differentiable function\n",
-    "- **Efficient**: Minimal computational overhead\n",
-    "- **Scalable**: Handles millions of parameters"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8f398293",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-neural-network-training",
-     "locked": true,
-     "points": 25,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_module_neural_network_training():\n",
-    "    \"\"\"Test autograd in neural network training scenario\"\"\"\n",
-    "    print(\"🔬 Integration Test: Neural Network Training Comprehensive Test...\")\n",
-    "    \n",
-    "    # Simple linear regression: y = wx + b\n",
-    "    # Training data: y = 2x + 1 + noise\n",
-    "    \n",
-    "    # Initialize parameters\n",
-    "    w = Variable(0.1, requires_grad=True)  # Start with small random value\n",
-    "    b = Variable(0.0, requires_grad=True)  # Start with zero bias\n",
-    "    \n",
-    "    # Training data\n",
-    "    x_data = [1.0, 2.0, 3.0, 4.0]\n",
-    "    y_data = [3.0, 5.0, 7.0, 9.0]  # y = 2x + 1\n",
-    "    \n",
-    "    learning_rate = 0.01\n",
-    "    \n",
-    "    # Training loop\n",
-    "    for epoch in range(100):\n",
-    "        total_loss = Variable(0.0)\n",
-    "        \n",
-    "        for x_val, y_val in zip(x_data, y_data):\n",
-    "            # Create input variable\n",
-    "            x = Variable(x_val, requires_grad=False)\n",
-    "            target = Variable(y_val, requires_grad=False)\n",
-    "            \n",
-    "    # Forward pass\n",
-    "            prediction = add(multiply(w, x), b)  # wx + b\n",
-    "            \n",
-    "            # Loss: squared error\n",
-    "            error = subtract(prediction, target)\n",
-    "            loss = multiply(error, error)  # (pred - target)²\n",
-    "            \n",
-    "            # Accumulate loss\n",
-    "            total_loss = add(total_loss, loss)\n",
-    "        \n",
-    "        # Backward pass\n",
-    "        w.zero_grad()\n",
-    "        b.zero_grad()\n",
-    "        total_loss.backward()\n",
-    "        \n",
-    "        # Update parameters\n",
-    "        if w.grad is not None:\n",
-    "            w.data = Tensor(w.data.data - learning_rate * w.grad.data.data)\n",
-    "        if b.grad is not None:\n",
-    "            b.data = Tensor(b.data.data - learning_rate * b.grad.data.data)\n",
-    "    \n",
-    "    # Check that parameters converged to correct values\n",
-    "    final_w = w.data.data.item()\n",
-    "    final_b = b.data.data.item()\n",
-    "    \n",
-    "    print(f\"Final weights: w = {final_w:.3f}, b = {final_b:.3f}\")\n",
-    "    print(f\"Target weights: w = 2.000, b = 1.000\")\n",
-    "    \n",
-    "    # Should be close to w=2, b=1\n",
-    "    assert abs(final_w - 2.0) < 0.1, f\"Weight should be close to 2.0, got {final_w}\"\n",
-    "    assert abs(final_b - 1.0) < 0.1, f\"Bias should be close to 1.0, got {final_b}\"\n",
-    "    \n",
-    "    # Test prediction with learned parameters\n",
-    "    test_x = Variable(5.0, requires_grad=False)\n",
-    "    test_prediction = add(multiply(w, test_x), b)\n",
-    "    expected_output = 2.0 * 5.0 + 1.0  # 11.0\n",
-    "    \n",
-    "    prediction_error = abs(test_prediction.data.data.item() - expected_output)\n",
-    "    assert prediction_error < 0.5, f\"Prediction error should be small, got {prediction_error}\"\n",
-    "    \n",
-    "    print(\"✅ Neural network training comprehensive tests passed!\")\n",
-    "    print(f\"✅ Parameters converged to correct values\")\n",
-    "    print(f\"✅ Model makes accurate predictions\")\n",
-    "    print(f\"✅ Autograd enables automatic training\")\n",
-    "    print(f\"✅ Ready for complex neural network architectures!\")\n",
-    "\n",
-    "# Test will run in main block"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4c2a1149",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Step 4: ML Systems Thinking - Computational Graph Optimization\n",
-    "\n",
-    "### 🏗️ Autograd Systems at Production Scale\n",
-    "\n",
-    "Your autograd implementation provides the foundation for understanding how production ML frameworks optimize computational graphs for massive neural network training and inference.\n",
-    "\n",
-    "#### **Computational Graph Architecture**\n",
-    "```python\n",
-    "class ProductionAutogradEngine:\n",
-    "    def __init__(self):\n",
-    "        # Advanced autograd optimizations for production systems\n",
-    "        self.graph_optimizer = ComputationalGraphOptimizer()\n",
-    "        self.memory_manager = GradientMemoryManager()\n",
-    "        self.kernel_fusion = AutogradKernelFusion()\n",
-    "        self.checkpoint_manager = GradientCheckpointManager()\n",
-    "```\n",
-    "\n",
-    "Real autograd systems must handle:\n",
-    "- **Graph optimization**: Fusing operations to minimize memory access\n",
-    "- **Memory management**: Releasing intermediate gradients to conserve memory\n",
-    "- **Parallel execution**: Computing gradients across multiple devices\n",
-    "- **Kernel fusion**: Combining operations for GPU efficiency"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7914b3b7",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "autograd-systems-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "import time\n",
-    "import gc\n",
-    "from collections import defaultdict, deque\n",
-    "\n",
-    "class AutogradSystemsProfiler:\n",
-    "    \"\"\"\n",
-    "    Production Autograd System Performance Analysis and Optimization\n",
-    "    \n",
-    "    Analyzes computational graph efficiency, memory patterns, and optimization\n",
-    "    opportunities for production automatic differentiation systems.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize autograd systems profiler.\"\"\"\n",
-    "        self.profiling_data = defaultdict(list)\n",
-    "        self.graph_analysis = defaultdict(list)\n",
-    "        self.optimization_strategies = []\n",
-    "        \n",
-    "    def profile_computational_graph_depth(self, max_depth=10, operations_per_level=5):\n",
-    "        \"\"\"\n",
-    "        Profile computational graph performance vs depth.\n",
-    "        \n",
-    "        TODO: Implement computational graph depth analysis.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Create computational graphs of increasing depth\n",
-    "        2. Measure forward and backward pass timing\n",
-    "        3. Analyze memory usage patterns during gradient computation\n",
-    "        4. Identify memory accumulation and gradient flow bottlenecks\n",
-    "        5. Generate graph optimization recommendations\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        profiler = AutogradSystemsProfiler()\n",
-    "        graph_analysis = profiler.profile_computational_graph_depth(max_depth=8)\n",
-    "        print(f\"Memory scaling factor: {graph_analysis['memory_scaling_factor']:.2f}\")\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Build graphs by chaining operations: x -> op1 -> op2 -> ... -> loss\n",
-    "        - Measure both forward and backward pass timing separately\n",
-    "        - Track memory usage throughout the computation\n",
-    "        - Monitor gradient accumulation patterns\n",
-    "        - Focus on production-relevant graph depths\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(\"🔧 Profiling Computational Graph Depth Impact...\")\n",
-    "        \n",
-    "        results = {}\n",
-    "        \n",
-    "        for depth in range(1, max_depth + 1):\n",
-    "            print(f\"  Testing graph depth: {depth}\")\n",
-    "            \n",
-    "            # Create a computational graph of specified depth\n",
-    "            # Each level adds more operations to test scaling\n",
-    "            \n",
-    "            # Start with input variable\n",
-    "            try:\n",
-    "                # Use Variable if available, otherwise simulate\n",
-    "                x = Variable(np.random.randn(100, 100), requires_grad=True)\n",
-    "            except:\n",
-    "                # Fallback for testing - simulate Variable with Tensor\n",
-    "                x = Tensor(np.random.randn(100, 100))\n",
-    "            \n",
-    "            # Build computational graph of specified depth\n",
-    "            current_var = x\n",
-    "            operations = []\n",
-    "            \n",
-    "            for level in range(depth):\n",
-    "                # Add multiple operations per level to increase complexity\n",
-    "                for op_idx in range(operations_per_level):\n",
-    "                    try:\n",
-    "                        # Simulate various operations\n",
-    "                        if op_idx % 4 == 0:\n",
-    "                            current_var = current_var * 0.9  # Scale operation\n",
-    "                        elif op_idx % 4 == 1:\n",
-    "                            current_var = current_var + 0.1  # Add operation\n",
-    "                        elif op_idx % 4 == 2:\n",
-    "                            # Matrix multiplication (most expensive)\n",
-    "                            weight = Tensor(np.random.randn(100, 100))\n",
-    "                            if hasattr(current_var, 'data'):\n",
-    "                                current_var = Tensor(current_var.data @ weight.data)\n",
-    "                            else:\n",
-    "                                current_var = current_var @ weight\n",
-    "                        else:\n",
-    "                            # Activation-like operation\n",
-    "                            if hasattr(current_var, 'data'):\n",
-    "                                current_var = Tensor(np.maximum(0, current_var.data))\n",
-    "                            else:\n",
-    "                                current_var = current_var  # Skip for simplicity\n",
-    "                        \n",
-    "                        operations.append(f\"level_{level}_op_{op_idx}\")\n",
-    "                    except:\n",
-    "                        # Fallback for testing\n",
-    "                        current_var = Tensor(np.random.randn(100, 100))\n",
-    "                        operations.append(f\"level_{level}_op_{op_idx}_fallback\")\n",
-    "            \n",
-    "            # Add final loss computation\n",
-    "            try:\n",
-    "                if hasattr(current_var, 'data'):\n",
-    "                    loss = Tensor(np.sum(current_var.data ** 2))\n",
-    "                else:\n",
-    "                    loss = np.sum(current_var ** 2)\n",
-    "            except:\n",
-    "                loss = Tensor(np.array([1.0]))\n",
-    "            \n",
-    "            # Measure forward pass timing\n",
-    "            forward_iterations = 3\n",
-    "            forward_start = time.time()\n",
-    "            \n",
-    "            for _ in range(forward_iterations):\n",
-    "                # Simulate forward pass computation\n",
-    "                temp_x = x\n",
-    "                for level in range(depth):\n",
-    "                    for op_idx in range(operations_per_level):\n",
-    "                        if op_idx % 4 == 0:\n",
-    "                            temp_x = temp_x * 0.9\n",
-    "                        elif op_idx % 4 == 1:\n",
-    "                            temp_x = temp_x + 0.1\n",
-    "                        # Skip expensive ops for timing\n",
-    "                \n",
-    "            forward_end = time.time()\n",
-    "            avg_forward_time = (forward_end - forward_start) / forward_iterations\n",
-    "            \n",
-    "            # Measure backward pass timing (simulated)\n",
-    "            # In real implementation, this would be loss.backward()\n",
-    "            backward_start = time.time()\n",
-    "            \n",
-    "            # Simulate gradient computation through the graph\n",
-    "            for _ in range(forward_iterations):\n",
-    "                # Simulate backpropagation through all operations\n",
-    "                gradient_accumulation = 0\n",
-    "                for level in range(depth):\n",
-    "                    for op_idx in range(operations_per_level):\n",
-    "                        # Simulate gradient computation\n",
-    "                        gradient_accumulation += level * op_idx * 0.001\n",
-    "            \n",
-    "            backward_end = time.time()\n",
-    "            avg_backward_time = (backward_end - backward_start) / forward_iterations\n",
-    "            \n",
-    "            # Memory analysis\n",
-    "            try:\n",
-    "                if hasattr(x, 'data'):\n",
-    "                    base_memory = x.data.nbytes / (1024 * 1024)  # MB\n",
-    "                    if hasattr(current_var, 'data'):\n",
-    "                        result_memory = current_var.data.nbytes / (1024 * 1024)\n",
-    "                    else:\n",
-    "                        result_memory = base_memory\n",
-    "                else:\n",
-    "                    base_memory = x.nbytes / (1024 * 1024) if hasattr(x, 'nbytes') else 1.0\n",
-    "                    result_memory = base_memory\n",
-    "            except:\n",
-    "                base_memory = 1.0\n",
-    "                result_memory = 1.0\n",
-    "            \n",
-    "            # Estimate gradient memory (in production, each operation stores gradients)\n",
-    "            estimated_gradient_memory = depth * operations_per_level * base_memory * 0.5\n",
-    "            total_memory = base_memory + result_memory + estimated_gradient_memory\n",
-    "            \n",
-    "            # Calculate efficiency metrics\n",
-    "            total_operations = depth * operations_per_level\n",
-    "            total_time = avg_forward_time + avg_backward_time\n",
-    "            operations_per_second = total_operations / total_time if total_time > 0 else 0\n",
-    "            \n",
-    "            result = {\n",
-    "                'graph_depth': depth,\n",
-    "                'total_operations': total_operations,\n",
-    "                'forward_time_ms': avg_forward_time * 1000,\n",
-    "                'backward_time_ms': avg_backward_time * 1000,\n",
-    "                'total_time_ms': total_time * 1000,\n",
-    "                'base_memory_mb': base_memory,\n",
-    "                'estimated_gradient_memory_mb': estimated_gradient_memory,\n",
-    "                'total_memory_mb': total_memory,\n",
-    "                'operations_per_second': operations_per_second,\n",
-    "                'memory_per_operation': total_memory / total_operations if total_operations > 0 else 0\n",
-    "            }\n",
-    "            \n",
-    "            results[depth] = result\n",
-    "            \n",
-    "            print(f\"    Forward: {avg_forward_time*1000:.3f}ms, Backward: {avg_backward_time*1000:.3f}ms, Memory: {total_memory:.2f}MB\")\n",
-    "        \n",
-    "        # Analyze scaling patterns\n",
-    "        graph_analysis = self._analyze_graph_scaling(results)\n",
-    "        \n",
-    "        # Store profiling data\n",
-    "        self.profiling_data['graph_depth_analysis'] = results\n",
-    "        self.graph_analysis = graph_analysis\n",
-    "        \n",
-    "        return {\n",
-    "            'detailed_results': results,\n",
-    "            'graph_analysis': graph_analysis,\n",
-    "            'optimization_strategies': self._generate_graph_optimizations(results)\n",
-    "        }\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _analyze_graph_scaling(self, results):\n",
-    "        \"\"\"Analyze computational graph scaling patterns.\"\"\"\n",
-    "        analysis = {}\n",
-    "        \n",
-    "        # Extract metrics for scaling analysis\n",
-    "        depths = sorted(results.keys())\n",
-    "        forward_times = [results[d]['forward_time_ms'] for d in depths]\n",
-    "        backward_times = [results[d]['backward_time_ms'] for d in depths]\n",
-    "        total_times = [results[d]['total_time_ms'] for d in depths]\n",
-    "        memory_usage = [results[d]['total_memory_mb'] for d in depths]\n",
-    "        \n",
-    "        # Calculate scaling factors\n",
-    "        if len(depths) >= 2:\n",
-    "            shallow = depths[0]\n",
-    "            deep = depths[-1]\n",
-    "            \n",
-    "            depth_ratio = deep / shallow\n",
-    "            forward_time_ratio = results[deep]['forward_time_ms'] / results[shallow]['forward_time_ms']\n",
-    "            backward_time_ratio = results[deep]['backward_time_ms'] / results[shallow]['backward_time_ms']\n",
-    "            memory_ratio = results[deep]['total_memory_mb'] / results[shallow]['total_memory_mb']\n",
-    "            \n",
-    "            analysis['scaling_metrics'] = {\n",
-    "                'depth_ratio': depth_ratio,\n",
-    "                'forward_time_scaling': forward_time_ratio,\n",
-    "                'backward_time_scaling': backward_time_ratio,\n",
-    "                'memory_scaling': memory_ratio,\n",
-    "                'theoretical_linear': depth_ratio  # Expected linear scaling\n",
-    "            }\n",
-    "            \n",
-    "            # Identify bottlenecks\n",
-    "            if backward_time_ratio > forward_time_ratio * 1.5:\n",
-    "                analysis['primary_bottleneck'] = 'backward_pass'\n",
-    "                analysis['bottleneck_reason'] = 'Gradient computation scaling worse than forward pass'\n",
-    "            elif memory_ratio > depth_ratio * 1.5:\n",
-    "                analysis['primary_bottleneck'] = 'memory'\n",
-    "                analysis['bottleneck_reason'] = 'Memory usage scaling faster than linear'\n",
-    "            else:\n",
-    "                analysis['primary_bottleneck'] = 'balanced'\n",
-    "                analysis['bottleneck_reason'] = 'Forward and backward passes scaling proportionally'\n",
-    "        \n",
-    "        # Backward/Forward ratio analysis\n",
-    "        backward_forward_ratios = [\n",
-    "            results[d]['backward_time_ms'] / max(results[d]['forward_time_ms'], 0.001)\n",
-    "            for d in depths\n",
-    "        ]\n",
-    "        avg_backward_forward_ratio = sum(backward_forward_ratios) / len(backward_forward_ratios)\n",
-    "        \n",
-    "        analysis['efficiency_metrics'] = {\n",
-    "            'avg_backward_forward_ratio': avg_backward_forward_ratio,\n",
-    "            'peak_memory_mb': max(memory_usage),\n",
-    "            'memory_efficiency_trend': 'increasing' if memory_usage[-1] > memory_usage[0] * 2 else 'stable'\n",
-    "        }\n",
-    "        \n",
-    "        return analysis\n",
-    "    \n",
-    "    def _generate_graph_optimizations(self, results):\n",
-    "        \"\"\"Generate computational graph optimization strategies.\"\"\"\n",
-    "        strategies = []\n",
-    "        \n",
-    "        # Analyze memory growth patterns\n",
-    "        peak_memory = max(result['total_memory_mb'] for result in results.values())\n",
-    "        \n",
-    "        if peak_memory > 50:  # > 50MB memory usage\n",
-    "            strategies.append(\"💾 High memory usage detected in computational graph\")\n",
-    "            strategies.append(\"🔧 Strategy: Gradient checkpointing for deep graphs\")\n",
-    "            strategies.append(\"🔧 Strategy: In-place operations where mathematically valid\")\n",
-    "        \n",
-    "        # Analyze computational efficiency\n",
-    "        graph_analysis = self.graph_analysis\n",
-    "        if graph_analysis and 'scaling_metrics' in graph_analysis:\n",
-    "            backward_scaling = graph_analysis['scaling_metrics']['backward_time_scaling']\n",
-    "            if backward_scaling > 2.0:\n",
-    "                strategies.append(\"🐌 Backward pass scaling poorly with graph depth\")\n",
-    "                strategies.append(\"🔧 Strategy: Kernel fusion for backward operations\")\n",
-    "                strategies.append(\"🔧 Strategy: Parallel gradient computation\")\n",
-    "        \n",
-    "        # Memory vs computation trade-offs\n",
-    "        if graph_analysis and 'efficiency_metrics' in graph_analysis:\n",
-    "            backward_forward_ratio = graph_analysis['efficiency_metrics']['avg_backward_forward_ratio']\n",
-    "            if backward_forward_ratio > 3.0:\n",
-    "                strategies.append(\"⚖️ Backward pass significantly slower than forward\")\n",
-    "                strategies.append(\"🔧 Strategy: Optimize gradient computation with sparse gradients\")\n",
-    "                strategies.append(\"🔧 Strategy: Use mixed precision to reduce memory bandwidth\")\n",
-    "        \n",
-    "        # Production optimization recommendations\n",
-    "        strategies.append(\"🏭 Production graph optimizations:\")\n",
-    "        strategies.append(\"   • Graph compilation and optimization (TorchScript, XLA)\")\n",
-    "        strategies.append(\"   • Operator fusion to minimize intermediate allocations\")\n",
-    "        strategies.append(\"   • Dynamic shape optimization for variable input sizes\")\n",
-    "        strategies.append(\"   • Gradient accumulation for large effective batch sizes\")\n",
-    "        \n",
-    "        return strategies\n",
-    "\n",
-    "    def analyze_memory_checkpointing_trade_offs(self, checkpoint_frequencies=[1, 2, 4, 8]):\n",
-    "        \"\"\"\n",
-    "        Analyze memory vs computation trade-offs with gradient checkpointing.\n",
-    "        \n",
-    "        This function is PROVIDED to demonstrate checkpointing analysis.\n",
-    "        Students use it to understand memory optimization strategies.\n",
-    "        \"\"\"\n",
-    "        print(\"🔍 GRADIENT CHECKPOINTING ANALYSIS\")\n",
-    "        print(\"=\" * 45)\n",
-    "        \n",
-    "        base_graph_depth = 12\n",
-    "        base_memory_per_layer = 10  # MB per layer\n",
-    "        base_computation_time = 5  # ms per layer\n",
-    "        \n",
-    "        checkpointing_results = []\n",
-    "        \n",
-    "        for freq in checkpoint_frequencies:\n",
-    "            # Calculate memory savings\n",
-    "            # Without checkpointing: store all intermediate activations\n",
-    "            no_checkpoint_memory = base_graph_depth * base_memory_per_layer\n",
-    "            \n",
-    "            # With checkpointing: only store every freq-th activation\n",
-    "            checkpointed_memory = (base_graph_depth // freq + 1) * base_memory_per_layer\n",
-    "            memory_savings = no_checkpoint_memory - checkpointed_memory\n",
-    "            memory_reduction_pct = (memory_savings / no_checkpoint_memory) * 100\n",
-    "            \n",
-    "            # Calculate recomputation overhead\n",
-    "            # Need to recompute (freq-1) layers for each checkpoint\n",
-    "            recomputation_layers = base_graph_depth * (freq - 1) / freq\n",
-    "            recomputation_time = recomputation_layers * base_computation_time\n",
-    "            \n",
-    "            # Total training time = forward + backward + recomputation\n",
-    "            base_training_time = base_graph_depth * base_computation_time * 2  # forward + backward\n",
-    "            total_training_time = base_training_time + recomputation_time\n",
-    "            time_overhead_pct = (recomputation_time / base_training_time) * 100\n",
-    "            \n",
-    "            result = {\n",
-    "                'checkpoint_frequency': freq,\n",
-    "                'memory_mb': checkpointed_memory,\n",
-    "                'memory_reduction_pct': memory_reduction_pct,\n",
-    "                'recomputation_time_ms': recomputation_time,\n",
-    "                'time_overhead_pct': time_overhead_pct,\n",
-    "                'memory_time_ratio': memory_reduction_pct / max(time_overhead_pct, 1)\n",
-    "            }\n",
-    "            checkpointing_results.append(result)\n",
-    "            \n",
-    "            print(f\"  Checkpoint every {freq} layers:\")\n",
-    "            print(f\"    Memory: {checkpointed_memory:.0f}MB ({memory_reduction_pct:.1f}% reduction)\")\n",
-    "            print(f\"    Time overhead: {time_overhead_pct:.1f}%\")\n",
-    "            print(f\"    Efficiency ratio: {result['memory_time_ratio']:.2f}\")\n",
-    "        \n",
-    "        # Find optimal trade-off\n",
-    "        optimal = max(checkpointing_results, key=lambda x: x['memory_time_ratio'])\n",
-    "        \n",
-    "        print(f\"\\n📈 Checkpointing Analysis:\")\n",
-    "        print(f\"  Optimal frequency: Every {optimal['checkpoint_frequency']} layers\")\n",
-    "        print(f\"  Best trade-off: {optimal['memory_reduction_pct']:.1f}% memory reduction\")\n",
-    "        print(f\"  Cost: {optimal['time_overhead_pct']:.1f}% time overhead\")\n",
-    "        \n",
-    "        return checkpointing_results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f24d5f2b",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test: Autograd Systems Profiling\n",
-    "\n",
-    "Let us test our autograd systems profiler with realistic computational graph scenarios."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3cb6d88d",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-autograd-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_autograd_systems_profiler():\n",
-    "    \"\"\"Test autograd systems profiler with comprehensive scenarios.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Autograd Systems Profiler...\")\n",
-    "    \n",
-    "    profiler = AutogradSystemsProfiler()\n",
-    "    \n",
-    "    # Test computational graph depth analysis\n",
-    "    try:\n",
-    "        graph_analysis = profiler.profile_computational_graph_depth(max_depth=5, operations_per_level=3)\n",
-    "        \n",
-    "        # Verify analysis structure\n",
-    "        assert 'detailed_results' in graph_analysis, \"Should provide detailed results\"\n",
-    "        assert 'graph_analysis' in graph_analysis, \"Should provide graph analysis\"\n",
-    "        assert 'optimization_strategies' in graph_analysis, \"Should provide optimization strategies\"\n",
-    "        \n",
-    "        # Verify detailed results\n",
-    "        results = graph_analysis['detailed_results']\n",
-    "        assert len(results) == 5, \"Should test all graph depths\"\n",
-    "        \n",
-    "        for depth, result in results.items():\n",
-    "            assert 'forward_time_ms' in result, f\"Should include forward timing for depth {depth}\"\n",
-    "            assert 'backward_time_ms' in result, f\"Should include backward timing for depth {depth}\"\n",
-    "            assert 'total_memory_mb' in result, f\"Should analyze memory for depth {depth}\"\n",
-    "            assert result['forward_time_ms'] >= 0, f\"Forward time should be non-negative for depth {depth}\"\n",
-    "            assert result['backward_time_ms'] >= 0, f\"Backward time should be non-negative for depth {depth}\"\n",
-    "        \n",
-    "        print(\"✅ Computational graph depth analysis test passed\")\n",
-    "        \n",
-    "        # Test memory checkpointing analysis\n",
-    "        checkpointing_analysis = profiler.analyze_memory_checkpointing_trade_offs(checkpoint_frequencies=[1, 2, 4])\n",
-    "        \n",
-    "        assert isinstance(checkpointing_analysis, list), \"Should return checkpointing analysis results\"\n",
-    "        assert len(checkpointing_analysis) == 3, \"Should analyze all checkpoint frequencies\"\n",
-    "        \n",
-    "        for result in checkpointing_analysis:\n",
-    "            assert 'checkpoint_frequency' in result, \"Should include checkpoint frequency\"\n",
-    "            assert 'memory_reduction_pct' in result, \"Should calculate memory reduction\"\n",
-    "            assert 'time_overhead_pct' in result, \"Should calculate time overhead\"\n",
-    "            assert result['memory_reduction_pct'] >= 0, \"Memory reduction should be non-negative\"\n",
-    "        \n",
-    "        print(\"✅ Memory checkpointing analysis test passed\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Autograd profiling test had issues: {e}\")\n",
-    "        print(\"✅ Basic structure test passed (graceful degradation)\")\n",
-    "    \n",
-    "    print(\"🎯 Autograd Systems Profiler: All tests passed!\")\n",
-    "\n",
-    "# Test will run in main block\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    print(\"\\n🧪 Running Autograd Module Tests...\")\n",
-    "    \n",
-    "    # Run all unit tests\n",
-    "    test_unit_variable_class()\n",
-    "    test_unit_add_operation()\n",
-    "    test_unit_multiply_operation()\n",
-    "    test_unit_subtract_operation()\n",
-    "    test_unit_chain_rule()\n",
-    "    test_module_neural_network_training()\n",
-    "    test_autograd_systems_profiler()\n",
-    "    \n",
-    "    print(\"\\n✅ All Autograd Module Tests Completed!\") \n",
-    "    print(\"Autograd module complete!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e7a0b05c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've built automatic differentiation capabilities that enable neural network training, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how computational graphs scale to production training environments.\n",
-    "\n",
-    "Take time to reflect thoughtfully on each question - your insights will help you understand how the automatic differentiation concepts you've implemented connect to real-world ML systems engineering."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1737577a",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Computational Graphs and Memory Management\n",
-    "\n",
-    "**Context**: Your autograd implementation builds computational graphs and stores intermediate values for gradient computation. Production training systems must manage memory efficiently when training models with billions of parameters and complex computational graphs that can consume enormous amounts of memory.\n",
-    "\n",
-    "**Reflection Question**: Design a memory-efficient automatic differentiation system for training large-scale neural networks that optimizes computational graph storage and gradient computation. How would you implement gradient checkpointing strategies, manage memory vs compute trade-offs, and optimize graph compilation for both dynamic flexibility and static optimization? Consider scenarios where you need to train models that exceed GPU memory capacity while maintaining numerical precision and training speed.\n",
-    "\n",
-    "Think about: gradient checkpointing strategies, memory vs compute trade-offs, graph optimization techniques, and distributed gradient computation.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8965cbe2",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-1-computational-graphs",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON COMPUTATIONAL GRAPHS AND MEMORY MANAGEMENT:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about memory-efficient automatic differentiation system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you implement gradient checkpointing to optimize memory usage in large models?\n",
-    "- What strategies would you use to balance memory consumption with computational efficiency?\n",
-    "- How would you design graph compilation that maintains flexibility while enabling optimization?\n",
-    "- What role would distributed gradient computation play in your system design?\n",
-    "- How would you handle memory constraints while preserving numerical precision?\n",
-    "\n",
-    "Write a technical analysis connecting your autograd implementations to real memory management challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Demonstrates understanding of computational graph memory management (3 points)\n",
-    "- Addresses gradient checkpointing and memory optimization strategies (3 points)\n",
-    "- Shows practical knowledge of graph compilation and optimization techniques (2 points)\n",
-    "- Demonstrates systems thinking about memory vs compute trade-offs (2 points)\n",
-    "- Clear technical reasoning and practical considerations (bonus points for innovative approaches)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring technical analysis of computational graph optimization\n",
-    "# Students should demonstrate understanding of memory management and gradient computation efficiency\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4101d38a",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: Distributed Training and Gradient Synchronization\n",
-    "\n",
-    "**Context**: Your autograd computes gradients on a single device, but production training systems must coordinate gradient computation across multiple GPUs and nodes. Efficient gradient synchronization becomes critical for training performance and scalability.\n",
-    "\n",
-    "**Reflection Question**: Architect a distributed automatic differentiation system that efficiently coordinates gradient computation across multiple devices and maintains training efficiency at scale. How would you implement gradient synchronization strategies, handle communication optimization, and manage numerical stability across distributed training? Consider scenarios where you need to train transformer models across hundreds of GPUs while minimizing communication overhead and maintaining convergence guarantees.\n",
-    "\n",
-    "Think about: gradient synchronization strategies, communication optimization, distributed computation patterns, and scalability considerations.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "49149516",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-2-distributed-training",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON DISTRIBUTED TRAINING AND GRADIENT SYNCHRONIZATION:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about distributed automatic differentiation system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you design gradient synchronization for efficient distributed training?\n",
-    "- What strategies would you use to minimize communication overhead in multi-GPU training?\n",
-    "- How would you implement gradient compression and optimization for distributed systems?\n",
-    "- What role would asynchronous vs synchronous training play in your design?\n",
-    "- How would you ensure numerical stability and convergence in distributed settings?\n",
-    "\n",
-    "Write an architectural analysis connecting your autograd implementation to real distributed training challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Shows understanding of distributed training and gradient synchronization (3 points)\n",
-    "- Designs practical approaches to communication optimization and scalability (3 points)\n",
-    "- Addresses numerical stability and convergence in distributed settings (2 points)\n",
-    "- Demonstrates systems thinking about distributed computation patterns (2 points)\n",
-    "- Clear architectural reasoning with distributed systems insights (bonus points for comprehensive understanding)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of distributed training systems\n",
-    "# Students should demonstrate knowledge of gradient synchronization and communication optimization\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3debca49",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 3: Advanced Training Optimizations and System Integration\n",
-    "\n",
-    "**Context**: Your autograd provides basic gradient computation, but production training systems must integrate with advanced optimization techniques like mixed precision training, gradient accumulation, and specialized hardware acceleration to achieve optimal performance.\n",
-    "\n",
-    "**Reflection Question**: Design an advanced automatic differentiation system that integrates with modern training optimizations and hardware acceleration capabilities. How would you implement automatic mixed precision support, gradient accumulation for large effective batch sizes, and integration with specialized hardware like TPUs? Consider scenarios where you need to optimize training for both research flexibility and production efficiency while maintaining numerical stability and debugging capabilities.\n",
-    "\n",
-    "Think about: mixed precision training, gradient accumulation strategies, hardware integration, and training optimization techniques.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5a4a0c51",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-3-training-optimizations",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON ADVANCED TRAINING OPTIMIZATIONS:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about advanced automatic differentiation system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you integrate automatic mixed precision training with gradient computation?\n",
-    "- What strategies would you use for gradient accumulation and large batch simulation?\n",
-    "- How would you design hardware integration for specialized accelerators like TPUs?\n",
-    "- What role would advanced optimizations play while maintaining research flexibility?\n",
-    "- How would you ensure numerical stability across different precision and hardware configurations?\n",
-    "\n",
-    "Write a design analysis connecting your autograd implementation to real training optimization challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Understands advanced training optimizations and mixed precision challenges (3 points)\n",
-    "- Designs practical approaches to gradient accumulation and hardware integration (3 points)\n",
-    "- Addresses numerical stability and research vs production trade-offs (2 points)\n",
-    "- Shows systems thinking about training optimization and system integration (2 points)\n",
-    "- Clear design reasoning with training optimization insights (bonus points for deep understanding)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of advanced training optimizations\n",
-    "# Students should demonstrate knowledge of mixed precision, gradient accumulation, and hardware integration\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2029f29c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Automatic Differentiation\n",
-    "\n",
-    "Congratulations! You have successfully implemented automatic differentiation:\n",
-    "\n",
-    "### What You have Accomplished\n",
-    "✅ **Computational Graphs**: Dynamic graph construction for gradient computation\n",
-    "✅ **Backpropagation**: Efficient gradient computation through reverse mode AD\n",
-    "✅ **Gradient Tracking**: Automatic gradient accumulation and management\n",
-    "✅ **Integration**: Seamless compatibility with Tensor operations\n",
-    "✅ **Real Applications**: Neural network training and optimization\n",
-    "\n",
-    "### Key Concepts You have Learned\n",
-    "- **Computational graphs**: How operations are tracked for gradient computation\n",
-    "- **Backpropagation**: Reverse mode automatic differentiation\n",
-    "- **Gradient accumulation**: How gradients flow through complex operations\n",
-    "- **Memory management**: Efficient handling of gradient storage\n",
-    "- **Integration patterns**: How autograd works with neural networks\n",
-    "\n",
-    "### Mathematical Foundations\n",
-    "- **Chain rule**: The mathematical foundation of backpropagation\n",
-    "- **Computational graphs**: Representing operations as directed acyclic graphs\n",
-    "- **Gradient flow**: How gradients propagate through complex functions\n",
-    "- **Memory efficiency**: Optimizing gradient storage and computation\n",
-    "\n",
-    "### Professional Skills Developed\n",
-    "- **Graph construction**: Building dynamic computational graphs\n",
-    "- **Gradient computation**: Implementing efficient backpropagation\n",
-    "- **Memory optimization**: Managing gradient storage efficiently\n",
-    "- **Integration testing**: Ensuring autograd works with all operations\n",
-    "\n",
-    "### Ready for Advanced Applications\n",
-    "Your autograd implementation now enables:\n",
-    "- **Neural network training**: Complete training pipelines with gradients\n",
-    "- **Optimization algorithms**: Gradient-based optimization methods\n",
-    "- **Custom loss functions**: Implementing specialized loss functions\n",
-    "- **Advanced architectures**: Training complex neural network models\n",
-    "\n",
-    "### Connection to Real ML Systems\n",
-    "Your implementations mirror production systems:\n",
-    "- **PyTorch**: `torch.autograd` provides identical functionality\n",
-    "- **TensorFlow**: `tf.GradientTape` implements similar concepts\n",
-    "- **JAX**: `jax.grad` uses similar automatic differentiation\n",
-    "- **Industry Standard**: Every major ML framework uses these exact principles\n",
-    "\n",
-    "### Next Steps\n",
-    "1. **Export your code**: `tito export 06_autograd`\n",
-    "2. **Test your implementation**: `tito test 06_autograd`\n",
-    "3. **Build training systems**: Combine with optimizers for complete training\n",
-    "4. **Move to Module 10**: Add optimization algorithms!\n",
-    "\n",
-    "**Ready for optimizers?** Your autograd system is now ready for real training!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/05_autograd/autograd_dev.py b/modules_old/05_autograd/autograd_dev.py
deleted file mode 100644
index 898404b8..00000000
--- a/modules_old/05_autograd/autograd_dev.py
+++ /dev/null
@@ -1,1635 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Autograd - Automatic Differentiation Engine
-
-Welcome to Autograd! You'll build automatic differentiation step by step, giving your Tensor class the ability to compute gradients automatically for neural network training.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**:
-- Module 01 (Setup): Development environment ready
-- Module 02 (Tensor): Complete tensor operations with math
-- Module 03 (Activations): Functions that add intelligence to networks
-- Module 04 (Losses): Functions that measure learning progress
-
-**What's Working**: Your tensors can do math, activations, and loss calculations perfectly!
-
-**The Gap**: Your tensors can't learn - they have no memory of how gradients flow backward through computations.
-
-**This Module's Solution**: Enhance your existing Tensor class with gradient tracking abilities, step by step.
-
-**Connection Map**:
-```
-Math Operations → Smart Operations → Learning Operations
-(Pure Tensors)   (+ Autograd)      (+ Optimizers)
-```
-
-## Learning Objectives
-1. **Incremental Enhancement**: Add gradient tracking without breaking existing code
-2. **Chain Rule Mastery**: Understand how gradients flow through complex expressions
-3. **Systems Understanding**: Memory and performance implications of automatic differentiation
-4. **Professional Skills**: How to enhance software systems safely
-
-## Build → Test → Use
-1. **Build**: Six incremental steps, each immediately testable
-2. **Test**: Frequent validation with clear success indicators
-3. **Use**: Enable gradient-based optimization for training
-
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/05_autograd/autograd_dev.py
-**Building Side:** Code exports to tinytorch.core.autograd
-
-```python
-# Final package structure:
-from tinytorch.core.autograd import Tensor  # Enhanced Tensor with gradients
-from tinytorch.core.tensor import Tensor    # Your original pure Tensor (backup)
-
-# Your enhanced Tensor can do everything:
-x = Tensor([1, 2, 3], requires_grad=True)   # New gradient capability
-y = x + 2                                   # Same math operations
-y.backward()                                # New gradient computation
-```
-
-**Why this matters:**
-- **Learning:** Experience incremental software enhancement with immediate feedback
-- **Production:** How real ML systems add features without breaking existing functionality
-- **Professional Practice:** Safe software evolution patterns used in industry
-- **Integration:** Your enhanced Tensor works with all previous modules
-"""
-
-# %%
-#| default_exp core.autograd
-
-#| export
-import numpy as np
-import sys
-from typing import Union, List, Optional, Callable, Any
-
-# Import the pure Tensor class from Module 02
-try:
-    from tinytorch.core.tensor import Tensor as BaseTensor
-except ImportError:
-    # For development, import from local modules
-    import os
-    sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '02_tensor'))
-    from tensor_dev import Tensor as BaseTensor
-
-# %%
-print("🔥 TinyTorch Autograd Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to enhance Tensor with gradients!")
-
-# %% [markdown]
-"""
-## Step 1: Teaching Our Tensor to Remember Gradients
-
-Our Tensor class from Module 02 is perfect for storing data and doing math. But for training neural networks, we need it to remember how gradients flow backward through computations.
-
-Think of it like teaching someone to remember the steps of a recipe so they can explain it later to others.
-
-### Gradient Memory Structure
-
-```
-                  Tensor Object
-    ┌──────────────────────────────────┐
-    │  data: [1.0, 2.0, 3.0]           │ ← Original tensor data
-    │  requires_grad: True              │ ← Should track gradients?
-    │  grad: None → [∇₁, ∇₂, ∇₃]       │ ← Accumulated gradients
-    │  grad_fn: None → <AddBackward>    │ ← How to propagate backward
-    └──────────────────────────────────┘
-                        │
-                        ▼
-              Computation Graph Node
-            ┌─────────────────────────┐
-            │   grad_fn stores:       │
-            │   • Parent tensors      │
-            │   • Backward function   │
-            │   • Local derivatives   │
-            └─────────────────────────┘
-```
-
-### What We're Adding
-
-We need three pieces of memory for our Tensor:
-
-1. **Should I remember?** (`requires_grad`) - Like asking "should I pay attention to gradients?"
-2. **What did I learn?** (`grad`) - The accumulated gradient information
-3. **How do I teach others?** (`grad_fn`) - Function to pass gradients backward
-
-These three attributes will transform our mathematical Tensor into a learning-capable Tensor.
-
-### Why Start Here?
-
-Before we can compute any gradients, we need places to store them. This is the foundation - like preparing notebooks before a lecture.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "tensor-gradient-attributes", "solution": true}
-#| export
-class Tensor(BaseTensor):
-    """
-    Enhanced Tensor with gradient tracking capabilities.
-
-    Inherits all functionality from BaseTensor and adds gradient memory.
-    """
-
-    def __init__(self, data, dtype=None, requires_grad=False):
-        """
-        Initialize Tensor with gradient tracking support.
-
-        TODO: Add gradient tracking attributes to existing Tensor
-
-        APPROACH:
-        1. Call parent __init__ to preserve all existing functionality
-        2. Add requires_grad boolean for gradient tracking control
-        3. Add grad attribute to store accumulated gradients (starts as None)
-        4. Add grad_fn attribute to store backward function (starts as None)
-
-        EXAMPLE:
-        >>> t = Tensor([1, 2, 3], requires_grad=True)
-        >>> print(t.requires_grad)  # True - ready to track gradients
-        >>> print(t.grad)          # None - no gradients accumulated yet
-        >>> print(t.grad_fn)       # None - no backward function yet
-
-        HINT: This is just storage - we're not computing anything yet
-        """
-        ### BEGIN SOLUTION
-        # Call parent constructor to preserve all existing functionality
-        super().__init__(data, dtype)
-
-        # Add gradient tracking attributes
-        self.requires_grad = requires_grad
-        self.grad = None        # Will store accumulated gradients
-        self.grad_fn = None     # Will store backward propagation function
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Test Step 1: Verify Gradient Memory
-This test confirms our Tensor can remember gradient information
-"""
-
-# %%
-def test_step1_gradient_attributes():
-    """Test that Tensor has gradient memory capabilities."""
-    print("🔬 Step 1 Test: Gradient Memory...")
-
-    # Test tensor with gradient tracking enabled
-    x = Tensor([1.0, 2.0, 3.0], requires_grad=True)
-
-    # Verify all gradient attributes exist and have correct initial values
-    assert hasattr(x, 'requires_grad'), "Tensor should have requires_grad attribute"
-    assert x.requires_grad == True, "requires_grad should be True when requested"
-    assert x.grad is None, "grad should start as None"
-    assert x.grad_fn is None, "grad_fn should start as None"
-
-    # Test tensor without gradient tracking
-    y = Tensor([4.0, 5.0, 6.0], requires_grad=False)
-    assert y.requires_grad == False, "requires_grad should be False by default"
-
-    # Verify existing functionality still works
-    z = x + y  # Should work exactly like before
-    assert hasattr(z, 'data'), "Enhanced tensor should still have data"
-
-    print("✅ Success! Your Tensor now has gradient memory!")
-    print(f"  • Gradient tracking: {x.requires_grad}")
-    print(f"  • Initial gradients: {x.grad}")
-    print(f"  • Backward function: {x.grad_fn}")
-
-test_step1_gradient_attributes()
-
-# %% [markdown]
-"""
-## Step 2: Teaching Our Tensor to Learn (Backward Method)
-
-Now that our Tensor has memory for gradients, we need to teach it how to accumulate gradients when they flow backward from later computations.
-
-Think of this like teaching someone to collect feedback from others and combine it with what they already know.
-
-### Gradient Flow Visualization
-
-```
-    Forward Pass (Building Graph):        Backward Pass (Computing Gradients):
-
-    x ──────┐                            x.grad ←──── gradient
-             │                                   │
-             ├─► [Operation] ──► result          │
-             │                     │             │
-    y ──────┘                     │             │
-                                   ▼             │
-                            result.backward() ───┘
-                                   │
-                                   ▼
-                            y.grad ←──── gradient
-```
-
-### The Backward Method
-
-The `backward()` method will:
-1. **Check if learning is enabled** (requires_grad must be True)
-2. **Accumulate gradients** (add new gradients to existing ones)
-3. **Propagate backwards** (tell earlier computations about the gradients)
-
-```
-    Gradient Accumulation Pattern:
-
-    First call: tensor.grad = None
-                tensor.backward([1.0])
-                tensor.grad = [1.0]    ← Store first gradient
-
-    Second call: tensor.backward([0.5])
-                 tensor.grad = [1.5]   ← Accumulate: [1.0] + [0.5]
-
-    Third call:  tensor.backward([2.0])
-                 tensor.grad = [3.5]   ← Accumulate: [1.5] + [2.0]
-```
-
-This is the heart of learning - how information flows backward to update our understanding.
-
-### Why Accumulation Matters
-
-Neural networks often compute multiple losses that all depend on the same parameters. We need to collect ALL the gradients, not just the last one.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "tensor-backward-method", "solution": true}
-def backward(self, gradient=None):
-    """
-    Accumulate gradients and propagate them backward through computation.
-
-    TODO: Implement gradient accumulation and backward propagation
-
-    APPROACH:
-    1. Check if this tensor requires gradients (error if not)
-    2. Set default gradient for scalar outputs (ones_like for scalars)
-    3. Accumulate gradient: first time = store, subsequent = add
-    4. Propagate backward through grad_fn if it exists
-
-    EXAMPLE:
-    >>> x = Tensor([2.0], requires_grad=True)
-    >>> x.grad = None  # No gradients yet
-    >>> x.backward([1.0])  # First gradient
-    >>> print(x.grad)  # [1.0]
-    >>> x.backward([0.5])  # Accumulate second gradient
-    >>> print(x.grad)  # [1.5] - accumulated!
-
-    HINTS:
-    - Default gradient for scalars should be ones_like(self.data)
-    - Use += for accumulation, but handle None case first
-    - Only call grad_fn if it exists (not None)
-    """
-    ### BEGIN SOLUTION
-    # Check if this tensor should accumulate gradients
-    if not self.requires_grad:
-        raise RuntimeError("Tensor doesn't require gradients - set requires_grad=True")
-
-    # Set default gradient for scalar outputs
-    if gradient is None:
-        if self.data.size == 1:  # Scalar output
-            gradient = np.ones_like(self.data)
-        else:
-            raise RuntimeError("gradient must be specified for non-scalar tensors")
-
-    # Accumulate gradients: first time or add to existing
-    if self.grad is None:
-        self.grad = np.array(gradient)  # First gradient
-    else:
-        self.grad = self.grad + gradient  # Accumulate
-
-    # Propagate gradients backward through computation graph
-    if self.grad_fn is not None:
-        self.grad_fn(gradient)
-    ### END SOLUTION
-
-# Add the backward method to our Tensor class
-Tensor.backward = backward
-
-# %% [markdown]
-"""
-### 🧪 Test Step 2: Verify Learning Ability
-This test confirms our Tensor can accumulate gradients properly
-"""
-
-# %%
-def test_step2_backward_method():
-    """Test that Tensor can accumulate gradients."""
-    print("🔬 Step 2 Test: Learning Ability...")
-
-    # Test basic gradient accumulation
-    x = Tensor([2.0], requires_grad=True)
-
-    # First gradient
-    x.backward(np.array([1.0]))
-    assert np.allclose(x.grad, [1.0]), f"First gradient failed: expected [1.0], got {x.grad}"
-
-    # Second gradient should accumulate
-    x.backward(np.array([0.5]))
-    assert np.allclose(x.grad, [1.5]), f"Accumulation failed: expected [1.5], got {x.grad}"
-
-    # Test default gradient for scalars
-    y = Tensor([3.0], requires_grad=True)
-    y.backward()  # No gradient specified - should use default
-    assert np.allclose(y.grad, [1.0]), f"Default gradient failed: expected [1.0], got {y.grad}"
-
-    # Test error for non-gradient tensor
-    z = Tensor([4.0], requires_grad=False)
-    try:
-        z.backward([1.0])
-        assert False, "Should have raised error for non-gradient tensor"
-    except RuntimeError:
-        pass  # Expected error
-
-    print("✅ Success! Your Tensor can now learn from gradients!")
-    print(f"  • Accumulation works: {x.grad}")
-    print(f"  • Default gradients work: {y.grad}")
-
-test_step2_backward_method()
-
-# %% [markdown]
-"""
-## Step 3: Smart Addition (x + y Learns!)
-
-Now we'll make addition smart - when two tensors are added, the result should remember how to flow gradients back to both inputs.
-
-Think of this like a conversation between three people: when C = A + B, and someone gives feedback to C, C knows to pass that same feedback to both A and B.
-
-### Addition Gradient Flow
-
-```
-    Forward Pass:                 Backward Pass:
-
-    x(2.0) ────┐                 x.grad ←── 1.0
-               ├─► [+] ──► z(5.0)         ↑
-    y(3.0) ────┘              │           │
-                               ▼           │
-                        z.backward(1.0) ───┘
-                               │
-                               ▼
-                        y.grad ←── 1.0
-
-    Addition Rule: ∂z/∂x = 1, ∂z/∂y = 1
-    Both inputs receive the same gradient!
-```
-
-### Mathematical Foundation
-
-For addition z = x + y:
-- ∂z/∂x = 1 (changing x by 1 changes z by 1)
-- ∂z/∂y = 1 (changing y by 1 changes z by 1)
-
-So gradients flow unchanged to both inputs: grad_x = grad_z, grad_y = grad_z
-
-### Computation Graph Building
-
-```
-    Enhanced Addition Process:
-
-    1. Compute: z.data = x.data + y.data    (math as before)
-
-    2. If gradients needed:
-       z.requires_grad = True
-       z.grad_fn = lambda grad: {
-           x.backward(grad)  ← Send same gradient to x
-           y.backward(grad)  ← Send same gradient to y
-       }
-
-    3. Result: z remembers how to teach x and y!
-```
-
-### Why Enhancement, Not Replacement
-
-We're enhancing the existing `__add__` method, not replacing it. The math stays the same - we just add gradient tracking on top.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "enhanced-addition", "solution": true}
-# Store the original addition method so we can enhance it
-_original_add = Tensor.__add__
-
-def enhanced_add(self, other):
-    """
-    Enhanced addition with automatic gradient tracking.
-
-    TODO: Add gradient tracking to existing addition operation
-
-    APPROACH:
-    1. Do the original math (call _original_add)
-    2. If either input tracks gradients, result should too
-    3. Create grad_fn that sends gradients back to both inputs
-    4. Remember: for addition, both inputs get the same gradient
-
-    EXAMPLE:
-    >>> x = Tensor([2.0], requires_grad=True)
-    >>> y = Tensor([3.0], requires_grad=True)
-    >>> z = x + y  # Enhanced addition
-    >>> z.backward()
-    >>> print(x.grad)  # [1.0] - same as gradient flowing to z
-    >>> print(y.grad)  # [1.0] - same as gradient flowing to z
-
-    HINTS:
-    - Use _original_add for the math computation
-    - Check if other has requires_grad attribute (might be scalar)
-    - Addition rule: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
-    """
-    ### BEGIN SOLUTION
-    # Do the original math - this preserves all existing functionality
-    original_result = _original_add(self, other)
-
-    # Create a new enhanced Tensor with the result data to ensure it has gradient capabilities
-    result = Tensor(original_result.data, requires_grad=False)
-
-    # Check if either input requires gradients
-    other_requires_grad = hasattr(other, 'requires_grad') and other.requires_grad
-    needs_grad = self.requires_grad or other_requires_grad
-
-    if needs_grad:
-        # Result should track gradients
-        result.requires_grad = True
-
-        # Create backward function for gradient propagation
-        def grad_fn(gradient):
-            """Send gradients back to both inputs (addition rule)."""
-            # For addition: ∂(a+b)/∂a = 1, so gradient flows unchanged
-            if self.requires_grad:
-                self.backward(gradient)
-            if other_requires_grad:
-                other.backward(gradient)
-
-        # Attach the backward function to the result
-        result.grad_fn = grad_fn
-
-    return result
-    ### END SOLUTION
-
-# Replace the addition method with our enhanced version
-Tensor.__add__ = enhanced_add
-
-# %% [markdown]
-"""
-### 🧪 Test Step 3: Verify Smart Addition
-This test confirms addition automatically tracks gradients
-"""
-
-# %%
-def test_step3_smart_addition():
-    """Test that addition tracks gradients automatically."""
-    print("🔬 Step 3 Test: Smart Addition...")
-
-    # Test basic addition with gradients
-    x = Tensor([2.0], requires_grad=True)
-    y = Tensor([3.0], requires_grad=True)
-    z = x + y
-
-    # Verify forward pass
-    assert np.allclose(z.data, [5.0]), f"Addition math failed: expected [5.0], got {z.data}"
-
-    # Verify gradient tracking is enabled
-    assert z.requires_grad == True, "Result should require gradients when inputs do"
-    assert z.grad_fn is not None, "Result should have backward function"
-
-    # Test backward pass
-    z.backward()
-    assert np.allclose(x.grad, [1.0]), f"x gradient failed: expected [1.0], got {x.grad}"
-    assert np.allclose(y.grad, [1.0]), f"y gradient failed: expected [1.0], got {y.grad}"
-
-    # Test addition with scalar (no gradients)
-    a = Tensor([1.0], requires_grad=True)
-    b = a + 5.0  # Adding scalar
-    b.backward()
-    assert np.allclose(a.grad, [1.0]), "Gradient should flow through scalar addition"
-
-    # Test backward compatibility - no gradients
-    p = Tensor([1.0])  # No requires_grad
-    q = Tensor([2.0])  # No requires_grad
-    r = p + q
-    assert not hasattr(r, 'requires_grad') or not r.requires_grad, "Should not track gradients by default"
-
-    print("✅ Success! Addition is now gradient-aware!")
-    print(f"  • Forward: {x.data} + {y.data} = {z.data}")
-    print(f"  • Backward: x.grad = {x.grad}, y.grad = {y.grad}")
-
-test_step3_smart_addition()
-
-# %% [markdown]
-"""
-## Step 4: Smart Multiplication (x * y Learns!)
-
-Now we'll enhance multiplication with gradient tracking. This is more interesting than addition because of the product rule.
-
-Think of multiplication like mixing ingredients: when you change one ingredient, the effect depends on how much of the other ingredient you have.
-
-### Multiplication Gradient Flow
-
-```
-    Forward Pass:                    Backward Pass:
-
-    x(2.0) ────┐                    x.grad ←── grad × y.data = 1.0 × 3.0 = 3.0
-               ├─► [×] ──► z(6.0)           ↑
-    y(3.0) ────┘              │             │
-                               ▼             │
-                        z.backward(1.0) ─────┘
-                               │
-                               ▼
-                        y.grad ←── grad × x.data = 1.0 × 2.0 = 2.0
-
-    Product Rule: ∂z/∂x = y, ∂z/∂y = x
-    Each input's gradient depends on the OTHER input's value!
-```
-
-### Mathematical Foundation - The Product Rule
-
-For multiplication z = x * y:
-- ∂z/∂x = y (changing x is multiplied by y's current value)
-- ∂z/∂y = x (changing y is multiplied by x's current value)
-
-```
-    Why Product Rule Matters:
-
-    If x = 2.0, y = 3.0, then z = 6.0
-
-    Small change in x: x + 0.1 = 2.1
-    New result: 2.1 × 3.0 = 6.3
-    Change in z: 6.3 - 6.0 = 0.3 = 0.1 × 3.0 ← Scaled by y!
-
-    Small change in y: y + 0.1 = 3.1
-    New result: 2.0 × 3.1 = 6.2
-    Change in z: 6.2 - 6.0 = 0.2 = 0.1 × 2.0 ← Scaled by x!
-```
-
-This means we need to remember the input values to compute gradients correctly.
-
-### Why This Matters
-
-Multiplication is everywhere in neural networks:
-- Linear layers: output = input * weights
-- Attention mechanisms: attention_scores * values
-- Element-wise operations in activations
-
-Getting multiplication gradients right is crucial for training.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "enhanced-multiplication", "solution": true}
-# Store the original multiplication method
-_original_mul = Tensor.__mul__
-
-def enhanced_mul(self, other):
-    """
-    Enhanced multiplication with automatic gradient tracking.
-
-    TODO: Add gradient tracking to multiplication using product rule
-
-    APPROACH:
-    1. Do the original math (call _original_mul)
-    2. If either input tracks gradients, result should too
-    3. Create grad_fn using product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
-    4. Handle both Tensor and scalar multiplication
-
-    EXAMPLE:
-    >>> x = Tensor([2.0], requires_grad=True)
-    >>> y = Tensor([3.0], requires_grad=True)
-    >>> z = x * y  # z = [6.0]
-    >>> z.backward()
-    >>> print(x.grad)  # [3.0] - gradient is y's value
-    >>> print(y.grad)  # [2.0] - gradient is x's value
-
-    HINTS:
-    - Product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
-    - Remember to handle scalars (use .data if available, else use directly)
-    - Gradients are: grad_x = gradient * other, grad_y = gradient * self
-    """
-    ### BEGIN SOLUTION
-    # Do the original math - preserves existing functionality
-    original_result = _original_mul(self, other)
-
-    # Create a new enhanced Tensor with the result data to ensure it has gradient capabilities
-    result = Tensor(original_result.data, requires_grad=False)
-
-    # Check if either input requires gradients
-    other_requires_grad = hasattr(other, 'requires_grad') and other.requires_grad
-    needs_grad = self.requires_grad or other_requires_grad
-
-    if needs_grad:
-        # Result should track gradients
-        result.requires_grad = True
-
-        # Create backward function using product rule
-        def grad_fn(gradient):
-            """Apply product rule for multiplication gradients."""
-            if self.requires_grad:
-                # ∂(a*b)/∂a = b, so gradient flows as: gradient * b
-                if hasattr(other, 'data'):
-                    self_grad = gradient * other.data
-                else:
-                    self_grad = gradient * other  # other is scalar
-                self.backward(self_grad)
-
-            if other_requires_grad:
-                # ∂(a*b)/∂b = a, so gradient flows as: gradient * a
-                other_grad = gradient * self.data
-                other.backward(other_grad)
-
-        # Attach the backward function to the result
-        result.grad_fn = grad_fn
-
-    return result
-    ### END SOLUTION
-
-# Replace multiplication method with enhanced version
-Tensor.__mul__ = enhanced_mul
-
-# %% [markdown]
-"""
-### 🧪 Test Step 4: Verify Smart Multiplication
-This test confirms multiplication uses the product rule correctly
-"""
-
-# %%
-def test_step4_smart_multiplication():
-    """Test that multiplication tracks gradients with product rule."""
-    print("🔬 Step 4 Test: Smart Multiplication...")
-
-    # Test basic multiplication with gradients
-    x = Tensor([2.0], requires_grad=True)
-    y = Tensor([3.0], requires_grad=True)
-    z = x * y
-
-    # Verify forward pass
-    assert np.allclose(z.data, [6.0]), f"Multiplication math failed: expected [6.0], got {z.data}"
-
-    # Test backward pass with product rule
-    z.backward()
-    assert np.allclose(x.grad, [3.0]), f"x gradient failed: expected [3.0] (y's value), got {x.grad}"
-    assert np.allclose(y.grad, [2.0]), f"y gradient failed: expected [2.0] (x's value), got {y.grad}"
-
-    # Test multiplication by scalar
-    a = Tensor([4.0], requires_grad=True)
-    b = a * 2.0  # Multiply by scalar
-    b.backward()
-    assert np.allclose(a.grad, [2.0]), f"Scalar multiplication failed: expected [2.0], got {a.grad}"
-
-    # Test more complex values
-    p = Tensor([1.5], requires_grad=True)
-    q = Tensor([2.5], requires_grad=True)
-    r = p * q  # Should be 3.75
-
-    assert np.allclose(r.data, [3.75]), f"Complex multiplication failed: expected [3.75], got {r.data}"
-    r.backward()
-    assert np.allclose(p.grad, [2.5]), f"Complex p gradient failed: expected [2.5], got {p.grad}"
-    assert np.allclose(q.grad, [1.5]), f"Complex q gradient failed: expected [1.5], got {q.grad}"
-
-    print("✅ Success! Multiplication follows the product rule!")
-    print(f"  • Forward: {x.data} * {y.data} = {z.data}")
-    print(f"  • Product rule: x.grad = {x.grad}, y.grad = {y.grad}")
-
-test_step4_smart_multiplication()
-
-# %% [markdown]
-"""
-## Step 5: Chain Rule Magic (Complex Expressions Work!)
-
-Now comes the magic moment - combining our smart operations to see the chain rule work automatically through complex expressions.
-
-When you build expressions like `z = (x + y) * (x - y)`, each operation tracks gradients locally, and they automatically chain together. This is what makes deep learning possible!
-
-Think of it like a telephone game where each person (operation) passes the message (gradient) backward, and everyone modifies it according to their local rule.
-
-### Complex Computation Graph
-
-```
-    Forward Pass: f(x,y) = (x + y) * (x - y)
-
-    x(3.0) ────┬─► [+] ──► t₁(5.0) ──┐
-               │                      ├─► [×] ──► result(5.0)
-    y(2.0) ────┼─► [+] ──────────────┘  ↑
-               │                         │
-               └─► [-] ──► t₂(1.0) ──────┘
-
-    Backward Pass: Chain rule flows gradients backward
-
-    result.backward(1.0)
-                    │
-                    ▼
-            [×] applies product rule:
-            t₁.backward(1.0 × t₂.data) = t₁.backward(1.0)
-            t₂.backward(1.0 × t₁.data) = t₂.backward(5.0)
-                    │                         │
-                    ▼                         ▼
-            [+] sends to both:        [-] sends with signs:
-            x.backward(1.0)           x.backward(5.0)
-            y.backward(1.0)           y.backward(-5.0)
-                    │                         │
-                    ▼                         ▼
-            Final gradients (accumulated):
-            x.grad = 1.0 + 5.0 = 6.0  ← Matches ∂(x²-y²)/∂x = 2x = 6.0
-            y.grad = 1.0 + (-5.0) = -4.0 ← Matches ∂(x²-y²)/∂y = -2y = -4.0
-```
-
-### The Chain Rule in Action
-
-For f(x,y) = (x + y) * (x - y) = x² - y²:
-1. Addition: passes gradients unchanged
-2. Subtraction: passes gradients (first unchanged, second negated)
-3. Multiplication: applies product rule
-4. Chain rule: combines all effects automatically
-
-Expected final gradients:
-- ∂f/∂x = 2x (derivative of x² - y²)
-- ∂f/∂y = -2y (derivative of x² - y²)
-
-### Gradient Accumulation in Action
-
-```
-    Notice how x appears in BOTH addition and subtraction:
-
-    x ──┬─► [+] ──► contributes to t₁
-        │
-        └─► [-] ──► contributes to t₂
-
-    During backward pass:
-    • Addition path contributes: x.grad += 1.0
-    • Subtraction path contributes: x.grad += 5.0
-    • Total: x.grad = 6.0 ← Automatic accumulation!
-
-    This is why we need gradient accumulation - same parameter
-    can contribute to loss through multiple paths!
-```
-
-### Why This Is Revolutionary
-
-You don't need to derive gradients manually anymore! The system automatically:
-- Tracks every operation
-- Applies local gradient rules
-- Chains them together correctly
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "enhanced-subtraction", "solution": true}
-# We need subtraction to complete our operations set
-_original_sub = getattr(Tensor, '__sub__', None)
-
-def enhanced_sub(self, other):
-    """
-    Enhanced subtraction with automatic gradient tracking.
-
-    TODO: Add gradient tracking to subtraction
-
-    APPROACH:
-    1. Compute subtraction (may need to implement if not in base class)
-    2. For gradients: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
-    3. First input gets gradient unchanged, second gets negative gradient
-
-    HINTS:
-    - Subtraction rule: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
-    - Handle case where base class might not have subtraction
-    - Use np.subtract or manual computation if needed
-    """
-    ### BEGIN SOLUTION
-    # Compute subtraction (implement if not available)
-    if _original_sub is not None:
-        original_result = _original_sub(self, other)
-        result = Tensor(original_result.data, requires_grad=False)
-    else:
-        # Implement subtraction manually
-        if hasattr(other, 'data'):
-            result_data = self.data - other.data
-        else:
-            result_data = self.data - other
-        result = Tensor(result_data, requires_grad=False)
-
-    # Check if either input requires gradients
-    other_requires_grad = hasattr(other, 'requires_grad') and other.requires_grad
-    needs_grad = self.requires_grad or other_requires_grad
-
-    if needs_grad:
-        result.requires_grad = True
-
-        def grad_fn(gradient):
-            """Apply subtraction gradient rule."""
-            if self.requires_grad:
-                # ∂(a-b)/∂a = 1, gradient flows unchanged
-                self.backward(gradient)
-            if other_requires_grad:
-                # ∂(a-b)/∂b = -1, gradient is negated
-                other.backward(-gradient)
-
-        result.grad_fn = grad_fn
-
-    return result
-    ### END SOLUTION
-
-# Add subtraction method to Tensor
-Tensor.__sub__ = enhanced_sub
-
-# %% [markdown]
-"""
-### 🧪 Test Step 5: Verify Chain Rule Magic
-This test confirms complex expressions compute gradients automatically
-
-**What we're testing**: The computation graph from our diagram above
-**Expected behavior**: Gradients flow backward through multiple paths and accumulate correctly
-**Success criteria**: Final gradients match analytical derivatives of f(x,y) = x² - y²
-"""
-
-# %%
-def test_step5_chain_rule_magic():
-    """Test that complex expressions automatically chain gradients."""
-    print("🔬 Step 5 Test: Chain Rule Magic...")
-
-    # Test complex expression: (x + y) * (x - y) = x² - y²
-    x = Tensor([3.0], requires_grad=True)
-    y = Tensor([2.0], requires_grad=True)
-
-    # Build computation graph step by step
-    sum_part = x + y      # 3 + 2 = 5
-    diff_part = x - y     # 3 - 2 = 1
-    result = sum_part * diff_part  # 5 * 1 = 5
-
-    # Verify forward computation
-    expected_forward = 3.0**2 - 2.0**2  # x² - y² = 9 - 4 = 5
-    assert np.allclose(result.data, [expected_forward]), f"Forward failed: expected [{expected_forward}], got {result.data}"
-
-    # Test the magic - backward propagation
-    result.backward()
-
-    # Expected gradients for f(x,y) = x² - y²
-    expected_x_grad = 2 * 3.0  # ∂(x²-y²)/∂x = 2x = 6
-    expected_y_grad = -2 * 2.0  # ∂(x²-y²)/∂y = -2y = -4
-
-    assert np.allclose(x.grad, [expected_x_grad]), f"x gradient failed: expected [{expected_x_grad}], got {x.grad}"
-    assert np.allclose(y.grad, [expected_y_grad]), f"y gradient failed: expected [{expected_y_grad}], got {y.grad}"
-
-    # Test another complex expression: 2*x*y + x
-    a = Tensor([2.0], requires_grad=True)
-    b = Tensor([3.0], requires_grad=True)
-
-    expr = (a * b) * 2.0 + a  # 2*a*b + a = 2*2*3 + 2 = 14
-
-    assert np.allclose(expr.data, [14.0]), f"Complex expression failed: expected [14.0], got {expr.data}"
-
-    expr.backward()
-    # ∂(2ab + a)/∂a = 2b + 1 = 2*3 + 1 = 7
-    # ∂(2ab + a)/∂b = 2a = 2*2 = 4
-    assert np.allclose(a.grad, [7.0]), f"Complex a gradient failed: expected [7.0], got {a.grad}"
-    assert np.allclose(b.grad, [4.0]), f"Complex b gradient failed: expected [4.0], got {b.grad}"
-
-    print("✅ Success! Chain rule works automatically!")
-    print(f"  • Expression: (x + y) * (x - y) = x² - y²")
-    print(f"  • Forward: {result.data}")
-    print(f"  • Gradients: ∂f/∂x = {x.grad}, ∂f/∂y = {y.grad}")
-    print("🎉 Your tensors can now learn through any expression!")
-
-test_step5_chain_rule_magic()
-
-# %% [markdown]
-"""
-## Step 6: Integration Testing (Complete Victory!)
-
-Time to celebrate! Let's test our complete autograd system with realistic neural network scenarios to make sure everything works together perfectly.
-
-We'll test scenarios that mirror what happens in real neural networks:
-- Linear transformations (matrix operations)
-- Activation functions
-- Loss computations
-- Complex multi-step computations
-
-This validates that your autograd system is ready to train real neural networks!
-
-### What Makes This Special
-
-Your autograd implementation now provides the foundation for all neural network training:
-- **Forward Pass**: Tensors compute values and build computation graphs
-- **Backward Pass**: Gradients flow automatically through any expression
-- **Parameter Updates**: Optimizers will use these gradients to update weights
-
-You've built the core engine that powers modern deep learning!
-"""
-
-# %% [markdown]
-"""
-### 🧪 Final Integration Test: Complete Autograd Validation
-This comprehensive test validates your entire autograd system
-"""
-
-# %%
-def test_step6_integration_complete():
-    """Complete integration test of autograd system."""
-    print("🧪 STEP 6: COMPLETE INTEGRATION TEST")
-    print("=" * 50)
-
-    # Test 1: Neural network linear layer simulation
-    print("1️⃣ Testing Linear Layer Simulation...")
-    weights = Tensor([[0.5, -0.3], [0.2, 0.8]], requires_grad=True)
-    inputs = Tensor([[1.0, 2.0]], requires_grad=True)
-    bias = Tensor([[0.1, -0.1]], requires_grad=True)
-
-    # Simulate: output = input @ weights + bias
-    linear_output = inputs * weights + bias  # Element-wise for simplicity
-    loss = linear_output * linear_output  # Squared for loss
-
-    # Sum all elements for scalar loss (simplified)
-    final_loss = loss  # In real networks, we'd sum across batch
-    # For testing, we'll provide gradients for the non-scalar tensor
-    final_loss.backward(np.ones_like(final_loss.data))
-
-    # Verify all parameters have gradients
-    assert weights.grad is not None, "Weights should have gradients"
-    assert inputs.grad is not None, "Inputs should have gradients"
-    assert bias.grad is not None, "Bias should have gradients"
-    print("   ✅ Linear layer gradients computed successfully")
-
-    # Test 2: Multi-step computation
-    print("2️⃣ Testing Multi-Step Computation...")
-    x = Tensor([1.0], requires_grad=True)
-    y = Tensor([2.0], requires_grad=True)
-    z = Tensor([3.0], requires_grad=True)
-
-    # Complex expression: ((x * y) + z) * (x - y)
-    step1 = x * y         # 1 * 2 = 2
-    step2 = step1 + z     # 2 + 3 = 5
-    step3 = x - y         # 1 - 2 = -1
-    result = step2 * step3  # 5 * (-1) = -5
-
-    assert np.allclose(result.data, [-5.0]), f"Multi-step forward failed: expected [-5.0], got {result.data}"
-
-    result.backward()
-
-    # All variables should have gradients
-    assert x.grad is not None, "x should have gradients from multi-step"
-    assert y.grad is not None, "y should have gradients from multi-step"
-    assert z.grad is not None, "z should have gradients from multi-step"
-    print("   ✅ Multi-step computation gradients work")
-
-    # Test 3: Gradient accumulation across multiple losses
-    print("3️⃣ Testing Gradient Accumulation...")
-    param = Tensor([1.0], requires_grad=True)
-
-    # First loss: param * 2
-    loss1 = param * 2.0
-    loss1.backward()
-    first_grad = param.grad.copy()
-
-    # Second loss: param * 3 (should accumulate)
-    loss2 = param * 3.0
-    loss2.backward()
-
-    expected_total = first_grad + 3.0
-    assert np.allclose(param.grad, expected_total), f"Accumulation failed: expected {expected_total}, got {param.grad}"
-    print("   ✅ Gradient accumulation works correctly")
-
-    # Test 4: Backward compatibility
-    print("4️⃣ Testing Backward Compatibility...")
-    # Operations without gradients should work exactly as before
-    a = Tensor([1, 2, 3])  # No requires_grad
-    b = Tensor([4, 5, 6])  # No requires_grad
-    c = a + b
-    d = a * b
-    e = a - b
-
-    # Should work without any gradient tracking
-    assert not (hasattr(c, 'requires_grad') and c.requires_grad), "Non-grad tensors shouldn't track gradients"
-    print("   ✅ Backward compatibility maintained")
-
-    # Test 5: Error handling
-    print("5️⃣ Testing Error Handling...")
-    non_grad_tensor = Tensor([1.0], requires_grad=False)
-    try:
-        non_grad_tensor.backward()
-        assert False, "Should have raised error for non-gradient tensor"
-    except RuntimeError:
-        print("   ✅ Proper error handling for non-gradient tensors")
-
-    print("\n" + "=" * 50)
-    print("🎉 COMPLETE SUCCESS! ALL INTEGRATION TESTS PASSED!")
-    print("\n🚀 Your Autograd System Achievements:")
-    print("   • ✅ Gradient tracking for all operations")
-    print("   • ✅ Automatic chain rule through complex expressions")
-    print("   • ✅ Gradient accumulation for multiple losses")
-    print("   • ✅ Backward compatibility with existing code")
-    print("   • ✅ Proper error handling and validation")
-    print("   • ✅ Ready for neural network training!")
-
-    print("\n🔗 Ready for Next Module:")
-    print("   Module 06 (Optimizers) will use these gradients")
-    print("   to update neural network parameters automatically!")
-
-test_step6_integration_complete()
-
-# %% [markdown]
-"""
-## 🔍 Systems Analysis: Autograd Memory and Performance
-
-Now that your autograd system is complete, let's analyze its behavior to understand memory usage patterns and performance characteristics that matter in real ML systems.
-
-### Memory Layout Analysis
-
-```
-    Tensor Without Gradients:        Tensor With Gradients:
-    ┌─────────────────┐             ┌─────────────────────────────────┐
-    │ data: [1,2,3]   │             │ data: [1,2,3]          8 bytes  │
-    │ shape: (3,)     │             │ shape: (3,)            8 bytes  │
-    │ dtype: float64  │             │ dtype: float64         8 bytes  │
-    └─────────────────┘             │ requires_grad: True    1 byte   │
-         ~24 bytes                  │ grad: [∇₁,∇₂,∇₃]       8 bytes  │
-                                    │ grad_fn: <Function>    8 bytes  │
-                                    └─────────────────────────────────┘
-                                             ~41 bytes
-
-    Memory Overhead: ~2x per tensor + computation graph storage
-```
-
-### Computation Graph Memory Growth
-
-```
-    Expression Depth vs Memory Usage:
-
-    Simple: z = x + y
-    Memory: 3 tensors (x, y, z)
-
-    Medium: z = (x + y) * (x - y)
-    Memory: 5 tensors (x, y, x+y, x-y, result)
-
-    Deep: z = ((x + y) * w₁ + b₁) * w₂ + b₂
-    Memory: 7 tensors + intermediate results
-
-    Pattern: Memory = O(expression_depth)
-
-    Production Issue: 50-layer network = 50+ intermediate tensors
-    until backward() is called and graph is freed!
-```
-
-**Analysis Focus**: Memory overhead, computational complexity, and scaling behavior of gradient computation
-"""
-
-# %%
-def analyze_autograd_behavior():
-    """
-    📊 SYSTEMS MEASUREMENT: Autograd Performance Analysis
-
-    Analyze memory usage and computational overhead of gradient tracking.
-    """
-    print("📊 AUTOGRAD SYSTEMS ANALYSIS")
-    print("=" * 40)
-
-    import time
-
-    # Test 1: Memory overhead analysis
-    print("💾 Memory Overhead Analysis:")
-
-    # Create tensors with and without gradient tracking
-    size = 1000
-    data = np.random.randn(size)
-
-    # Non-gradient tensor
-    no_grad_tensor = Tensor(data.copy(), requires_grad=False)
-
-    # Gradient tensor
-    grad_tensor = Tensor(data.copy(), requires_grad=True)
-
-    print(f"   Tensor size: {size} elements")
-    print(f"   Base tensor: data only")
-    print(f"   Gradient tensor: data + grad storage + grad_fn")
-    print(f"   Memory overhead: ~3x (data + grad + computation graph)")
-
-    # Test 2: Computational overhead
-    print("\n⚡ Computational Overhead Analysis:")
-
-    x_no_grad = Tensor([2.0] * 100, requires_grad=False)
-    y_no_grad = Tensor([3.0] * 100, requires_grad=False)
-
-    x_grad = Tensor([2.0] * 100, requires_grad=True)
-    y_grad = Tensor([3.0] * 100, requires_grad=True)
-
-    # Time operations without gradients
-    start = time.perf_counter()
-    for _ in range(1000):
-        z = x_no_grad + y_no_grad
-        z = z * x_no_grad
-    no_grad_time = time.perf_counter() - start
-
-    # Time operations with gradients (forward only)
-    start = time.perf_counter()
-    for _ in range(1000):
-        z = x_grad + y_grad
-        z = z * x_grad
-    grad_forward_time = time.perf_counter() - start
-
-    print(f"   Operations without gradients: {no_grad_time*1000:.2f}ms")
-    print(f"   Operations with gradients: {grad_forward_time*1000:.2f}ms")
-    print(f"   Forward pass overhead: {grad_forward_time/no_grad_time:.1f}x")
-
-    print("\n   Performance Visualization:")
-    print("   ┌──────────────────────────────────────────────┐")
-    print("   │ Operation Timeline (forward pass)             │")
-    print("   ├──────────────────────────────────────────────┤")
-    print("   │ No gradients:  [████████████]                 │")
-    print("   │ With gradients: [████████████████████████]     │")
-    print("   │                 ↑ Math      ↑ Graph building │")
-    print("   └──────────────────────────────────────────────┘")
-
-    # Test 3: Expression complexity scaling
-    print("\n📈 Expression Complexity Scaling:")
-
-    def time_expression(depth, with_gradients=True):
-        """Time increasingly complex expressions."""
-        x = Tensor([2.0], requires_grad=with_gradients)
-        y = Tensor([3.0], requires_grad=with_gradients)
-
-        start = time.perf_counter()
-        result = x
-        for i in range(depth):
-            result = result + y
-            result = result * x
-
-        if with_gradients:
-            result.backward()
-
-        return time.perf_counter() - start
-
-    depths = [1, 5, 10, 20]
-    for depth in depths:
-        time_no_grad = time_expression(depth, False)
-        time_with_grad = time_expression(depth, True)
-        overhead = time_with_grad / time_no_grad
-
-        print(f"   Depth {depth:2d}: {time_no_grad*1000:.1f}ms → {time_with_grad*1000:.1f}ms ({overhead:.1f}x overhead)")
-
-    # Test 4: Gradient accumulation patterns
-    print("\n🔄 Gradient Accumulation Patterns:")
-
-    param = Tensor([1.0], requires_grad=True)
-
-    # Single large gradient vs multiple small gradients
-    param.grad = None
-    start = time.perf_counter()
-    large_loss = param * 100.0
-    large_loss.backward()
-    large_grad_time = time.perf_counter() - start
-    large_grad_value = param.grad.copy()
-
-    param.grad = None
-    start = time.perf_counter()
-    for i in range(100):
-        small_loss = param * 1.0
-        small_loss.backward()
-    small_grad_time = time.perf_counter() - start
-
-    print(f"   Single large gradient: {large_grad_time*1000:.3f}ms → grad={large_grad_value}")
-    print(f"   100 small gradients: {small_grad_time*1000:.3f}ms → grad={param.grad}")
-    print(f"   Accumulation overhead: {small_grad_time/large_grad_time:.1f}x")
-
-    print("\n   Gradient Accumulation Pattern:")
-    print("   ┌──────────────────────────────────────────────────────┐")
-    print("   │ Multiple Loss Sources → Same Parameter:              │")
-    print("   ├──────────────────────────────────────────────────────┤")
-    print("   │                                                      │")
-    print("   │ Loss₁ ──→ grad₁(2.0) ──┐                           │")
-    print("   │                         ├─[+]→ param.grad = 5.0     │")
-    print("   │ Loss₂ ──→ grad₂(3.0) ──┘                           │")
-    print("   │                                                      │")
-    print("   │ Real Example: Same embedding used in encoder         │")
-    print("   │ AND decoder gets gradients from both paths!         │")
-    print("   └──────────────────────────────────────────────────────┘")
-
-    print("\n💡 AUTOGRAD INSIGHTS:")
-    print("   ┌───────────────────────────────────────────────────────────┐")
-    print("   │ Autograd Performance Characteristics                        │")
-    print("   ├───────────────────────────────────────────────────────────┤")
-    print("   │ Memory Usage:                                               │")
-    print("   │   • Base tensor: 1x (data only)                           │")
-    print("   │   • Gradient tensor: 2x (data + gradients)                │")
-    print("   │   • Computation graph: +O(depth) intermediate tensors      │")
-    print("   │                                                             │")
-    print("   │ Computational Overhead:                                     │")
-    print("   │   • Forward pass: ~2x (math + graph building)             │")
-    print("   │   • Backward pass: ~1x additional                         │")
-    print("   │   • Total training: ~3x vs inference-only                 │")
-    print("   │                                                             │")
-    print("   │ Scaling Behavior:                                           │")
-    print("   │   • Expression depth: O(n) memory growth                  │")
-    print("   │   • Gradient accumulation: O(1) per accumulation          │")
-    print("   │   • Deep networks: Memory freed after backward()          │")
-    print("   └───────────────────────────────────────────────────────────┘")
-    print("")
-    print("   🚀 Production Implications:")
-    print("   • Memory: Gradient tracking doubles memory usage (data + gradients)")
-    print("   • Forward pass: ~2x computational overhead for gradient graph building")
-    print("   • Backward pass: Additional ~1x computation time")
-    print("   • Expression depth: Overhead scales linearly with computation graph depth")
-    print("   • Gradient accumulation: Small overhead per accumulation operation")
-    print("   • Production impact: Why PyTorch offers torch.no_grad() for inference!")
-
-analyze_autograd_behavior()
-
-# %% [markdown]
-"""
-## 🧪 Module Integration Test
-
-Final validation that everything works together correctly.
-"""
-
-# %%
-def test_module():
-    """
-    Comprehensive test of entire autograd module functionality.
-
-    This final test runs before module summary to ensure:
-    - All components work correctly
-    - Integration with existing tensor operations
-    - Ready for use in neural network training
-    """
-    print("🧪 RUNNING MODULE INTEGRATION TEST")
-    print("=" * 50)
-
-    print("Running all unit tests...")
-    test_step1_gradient_attributes()
-    test_step2_backward_method()
-    test_step3_smart_addition()
-    test_step4_smart_multiplication()
-    test_step5_chain_rule_magic()
-    test_step6_integration_complete()
-
-    print("\n" + "=" * 50)
-    print("🎉 ALL TESTS PASSED! Module ready for export.")
-    print("Run: tito module complete 05_autograd")
-
-test_module()
-
-# %%
-if __name__ == "__main__":
-    print("🚀 Running Autograd module...")
-    test_module()
-    print("✅ Module validation complete!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-### Question 1: Memory Management in Gradient Computation
-
-Your autograd implementation stores references to input tensors through grad_fn closures. In a deep neural network with 50 layers, each layer creates intermediate tensors with gradient functions.
-
-```
-    Memory Growth in Deep Networks:
-
-    Layer 1: x₁ → f₁(x₁) → h₁  ░░░░░░░░░░░░░░░░░░░░░░░░░░┐
-             ↑               ↑                            │
-             └─ stored ──────┘ h₁.grad_fn keeps x₁ alive │
-                                                          │
-    Layer 2: h₁ → f₂(h₁) → h₂  ░░░░░░░░░░░░░░░░░░░░░░░░░┐ │
-             ↑               ↑                          │ │
-             └─ stored ──────┘ h₂.grad_fn keeps h₁ alive │ │
-                                                        │ │
-    ...                                                 │ │
-                                                        │ │
-    Layer 50: h₄₉ → f₅₀(h₄₉) → h₅₀                      │ │
-                                ↑                       │ │
-                                └─ loss.backward() ────┼─┼─┐
-                                                        │ │ │
-    Peak Memory: All h₁, h₂, ..., h₄₉ kept alive       │ │ │
-    until backward() traverses the entire graph! ──────┘ │ │
-                                                          │ │
-    After backward(): Memory freed in reverse order ─────┘ │
-                     (Python garbage collection)          │
-                                                          │
-    Memory = O(network_depth) until backward() completes ─┘
-```
-
-**Analysis Task**: Examine how your gradient tracking affects memory usage patterns.
-
-**Specific Questions**:
-- How does memory usage scale with network depth in your implementation?
-- What happens to memory when you call `backward()` on the final loss?
-- Why do production frameworks implement "gradient checkpointing"?
-
-**Implementation Connection**: Look at how your `grad_fn` closures capture references to input tensors and consider memory implications for deep networks.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "memory-management", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-TODO: Analyze memory management in your gradient computation system.
-
-Consider how your grad_fn closures store references to input tensors and
-how this affects memory usage in deep networks.
-"""
-### BEGIN SOLUTION
-# Memory management analysis:
-
-# 1. Memory scaling with network depth:
-# - Each operation creates a tensor with grad_fn that references input tensors
-# - In 50-layer network: 50 intermediate tensors + their grad_fn closures
-# - Each grad_fn keeps input tensors alive in memory
-# - Memory grows O(depth) for intermediate activations
-
-# 2. Memory behavior during backward():
-# - Forward pass: Builds computation graph, keeps all intermediates
-# - Backward pass: Traverses graph but doesn't immediately free memory
-# - Python's garbage collector frees tensors after no references remain
-# - Peak memory occurs at end of forward pass
-
-# 3. Gradient checkpointing solution:
-# - Trade compute for memory: store only subset of activations
-# - Recompute intermediate activations during backward pass
-# - Reduces memory from O(depth) to O(sqrt(depth))
-# - Essential for training very deep networks
-
-# Production implementations:
-# - PyTorch: torch.utils.checkpoint for gradient checkpointing
-# - TensorFlow: tf.recompute_grad decorator
-# - Custom: Clear computation graph after backward pass
-
-# Memory optimization strategies:
-# 1. In-place operations where mathematically safe
-# 2. Clear gradients regularly: param.grad = None
-# 3. Use torch.no_grad() for inference
-# 4. Implement custom backward functions for memory efficiency
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: Computational Graph Optimization
-
-Your autograd system builds computation graphs dynamically. Each operation creates a new tensor with its own grad_fn.
-
-**Analysis Task**: Identify opportunities for optimizing computational graphs to reduce overhead.
-
-**Specific Questions**:
-- Which operations could be fused together to reduce intermediate tensor creation?
-- How would operator fusion affect gradient computation correctness?
-- What trade-offs exist between graph complexity and performance?
-
-**Implementation Connection**: Examine your operation functions and consider where computation could be optimized while maintaining gradient correctness.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "graph-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-TODO: Design computational graph optimizations for your autograd system.
-
-Consider how operations could be fused or optimized while maintaining
-gradient correctness.
-"""
-### BEGIN SOLUTION
-# Computational graph optimization strategies:
-
-# 1. Operation fusion opportunities:
-# Current: z = (x + y) * w creates 2 tensors (intermediate + result)
-# Optimized: Single "fused_add_mul" operation creates 1 tensor
-
-def fused_add_multiply(x, y, w):
-    """Fused operation: (x + y) * w"""
-    # Direct computation without intermediate tensor
-    result_data = (x.data + y.data) * w.data
-    result = Tensor(result_data, requires_grad=True)
-
-    def grad_fn(gradient):
-        if x.requires_grad:
-            x.backward(gradient * w.data)  # Chain rule
-        if y.requires_grad:
-            y.backward(gradient * w.data)
-        if w.requires_grad:
-            w.backward(gradient * (x.data + y.data))
-
-    result.grad_fn = grad_fn
-    return result
-
-# 2. Safe fusion patterns:
-# - Element-wise operations: add + mul + relu → single kernel
-# - Linear operations: matmul + bias_add → single operation
-# - Activation chains: sigmoid + multiply → swish activation
-
-# 3. Gradient correctness preservation:
-# - Fusion must preserve mathematical equivalence
-# - Chain rule application remains identical
-# - Numerical stability must be maintained
-
-# 4. Trade-offs analysis:
-# Memory: Fewer intermediate tensors reduces memory usage
-# Compute: Fused operations can be more cache-efficient
-# Complexity: Harder to debug fused operations
-# Flexibility: Less modular, harder to optimize individual ops
-
-# 5. Production techniques:
-# - TensorFlow XLA: Ahead-of-time fusion optimization
-# - PyTorch JIT: Runtime graph optimization
-# - ONNX: Graph optimization passes for deployment
-# - Custom CUDA kernels: Maximum performance for common patterns
-
-# Example optimization for common pattern:
-class OptimizedLinear:
-    def forward(x, weight, bias):
-        # Fused: matmul + bias_add + activation
-        return activation(x @ weight + bias)  # Single backward pass
-
-# Memory-efficient alternative:
-class CheckpointedOperation:
-    def forward(inputs):
-        # Store only inputs, recompute intermediate during backward
-        return complex_computation(inputs)
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Gradient Flow Analysis
-
-In your autograd implementation, gradients flow backward through the computation graph via the chain rule.
-
-```
-    Gradient Magnitude Changes Through Operations:
-
-    Addition Preserves Magnitudes:           Multiplication Scales Magnitudes:
-    ┌─────────────────────────────┐         ┌─────────────────────────────────┐
-    │ x(0.1) ──┐                 │         │ x(0.1) ──┐                     │
-    │          ├─[+]─→ z(10.1)   │         │          ├─[×]─→ z(1.0)       │
-    │ y(10.0) ─┘     ↑           │         │ y(10.0) ─┘     ↑               │
-    │                │           │         │                │               │
-    │                grad=1.0    │         │                grad=1.0        │
-    │                ↓           │         │                ↓               │
-    │ x.grad ←─ 1.0 (unchanged)  │         │ x.grad ←─ 10.0 (scaled by y!) │
-    │ y.grad ←─ 1.0 (unchanged)  │         │ y.grad ←─ 0.1 (scaled by x!)  │
-    └─────────────────────────────┘         └─────────────────────────────────┘
-
-    Deep Network Gradient Flow Problems:
-
-    Vanishing Gradients:                    Exploding Gradients:
-    ┌──────────────────────────────┐       ┌──────────────────────────────┐
-    │ Layer 1: grad ← 1.0          │       │ Layer 1: grad ← 1.0          │
-    │         ↓ ×0.1 (small weight)│       │         ↓ ×3.0 (large weight)│
-    │ Layer 2: grad ← 0.1          │       │ Layer 2: grad ← 3.0          │
-    │         ↓ ×0.1               │       │         ↓ ×3.0               │
-    │ Layer 3: grad ← 0.01         │       │ Layer 3: grad ← 9.0          │
-    │         ↓ ×0.1               │       │         ↓ ×3.0               │
-    │ Layer 4: grad ← 0.001        │       │ Layer 4: grad ← 27.0         │
-    │         ↓                    │       │         ↓                    │
-    │ Final: grad ≈ 0 (vanished!)  │       │ Final: grad → ∞ (exploded!)  │
-    └──────────────────────────────┘       └──────────────────────────────┘
-```
-
-**Analysis Task**: Analyze how gradient magnitudes change as they flow through different types of operations.
-
-**Specific Questions**:
-- How do gradients change magnitude when flowing through multiplication vs addition?
-- What causes vanishing or exploding gradients in deep networks?
-- How would you detect and mitigate gradient flow problems?
-
-**Implementation Connection**: Consider how your product rule implementation in multiplication affects gradient magnitudes compared to your addition implementation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "gradient-flow", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-TODO: Analyze gradient flow patterns in your autograd implementation.
-
-Examine how different operations affect gradient magnitudes and identify
-potential gradient flow problems.
-"""
-### BEGIN SOLUTION
-# Gradient flow analysis:
-
-# 1. Gradient magnitude changes by operation:
-
-# Addition: z = x + y
-# ∂z/∂x = 1, ∂z/∂y = 1
-# Gradients pass through unchanged - magnitude preserved
-
-# Multiplication: z = x * y
-# ∂z/∂x = y, ∂z/∂y = x
-# Gradients scaled by other operand - magnitude can grow/shrink dramatically
-
-# Example analysis:
-def analyze_gradient_flow():
-    x = Tensor([0.1], requires_grad=True)  # Small value
-    y = Tensor([10.0], requires_grad=True)  # Large value
-
-    # Addition preserves gradients
-    z1 = x + y
-    z1.backward()
-    print(f"Addition: x.grad={x.grad}, y.grad={y.grad}")  # Both [1.0]
-
-    x.grad = None; y.grad = None
-
-    # Multiplication scales gradients
-    z2 = x * y
-    z2.backward()
-    print(f"Multiplication: x.grad={x.grad}, y.grad={y.grad}")  # [10.0], [0.1]
-
-# 2. Vanishing gradient causes:
-# - Many multiplications by small values (< 1.0)
-# - Deep networks: gradient = ∏(∂Li/∂Li-1) → 0 as depth increases
-# - Activation functions with small derivatives (sigmoid saturation)
-
-# 3. Exploding gradient causes:
-# - Many multiplications by large values (> 1.0)
-# - Poor weight initialization
-# - High learning rates
-
-# 4. Detection strategies:
-def detect_gradient_problems(model_parameters):
-    """Detect vanishing/exploding gradients"""
-    grad_norms = []
-    for param in model_parameters:
-        if param.grad is not None:
-            grad_norm = np.linalg.norm(param.grad)
-            grad_norms.append(grad_norm)
-
-    max_norm = max(grad_norms) if grad_norms else 0
-    min_norm = min(grad_norms) if grad_norms else 0
-
-    if max_norm > 10.0:
-        print("⚠️  Exploding gradients detected!")
-    if max_norm < 1e-6:
-        print("⚠️  Vanishing gradients detected!")
-
-    return grad_norms
-
-# 5. Mitigation strategies:
-# Gradient clipping for exploding gradients:
-def clip_gradients(parameters, max_norm=1.0):
-    total_norm = 0
-    for param in parameters:
-        if param.grad is not None:
-            total_norm += np.sum(param.grad ** 2)
-    total_norm = np.sqrt(total_norm)
-
-    if total_norm > max_norm:
-        clip_factor = max_norm / total_norm
-        for param in parameters:
-            if param.grad is not None:
-                param.grad = param.grad * clip_factor
-
-# Better weight initialization for vanishing gradients:
-# - Xavier/Glorot initialization
-# - He initialization for ReLU networks
-# - Layer normalization to control activations
-
-# Architectural solutions:
-# - Skip connections (ResNet)
-# - LSTM gates for sequences
-# - Careful activation function choice (ReLU vs sigmoid)
-### END SOLUTION
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Autograd - Incremental Automatic Differentiation
-
-Congratulations! You've built a complete automatic differentiation system through six manageable steps!
-
-### What You've Accomplished
-✅ **Step-by-Step Enhancement**: Added gradient tracking to existing Tensor class without breaking any functionality
-✅ **Gradient Memory**: Tensors now store gradients and backward functions (Step 1-2)
-✅ **Smart Operations**: Addition, multiplication, and subtraction automatically track gradients (Steps 3-4)
-✅ **Chain Rule Magic**: Complex expressions compute gradients automatically through the entire computation graph (Step 5)
-✅ **Complete Integration**: Full autograd system ready for neural network training (Step 6)
-✅ **Systems Understanding**: Memory overhead analysis and performance characteristics
-
-### Key Learning Outcomes
-- **Incremental Development**: How to enhance complex systems step by step with immediate validation
-- **Chain Rule Implementation**: Automatic gradient computation through mathematical expressions
-- **Software Architecture**: Safe enhancement of existing classes without breaking functionality
-- **Memory Management**: Understanding computational graph storage and gradient accumulation patterns
-- **Production Insights**: How real ML frameworks implement automatic differentiation
-
-### Technical Foundations Mastered
-- **Gradient Tracking**: `requires_grad`, `grad`, and `grad_fn` attributes for automatic differentiation
-- **Backward Propagation**: Automatic chain rule application through computation graphs
-- **Product Rule**: Correct gradient computation for multiplication operations
-- **Gradient Accumulation**: Proper handling of multiple backward passes
-- **Error Handling**: Robust validation for gradient computation requirements
-
-### Professional Skills Developed
-- **Incremental Enhancement**: Adding complex features through small, testable steps
-- **Immediate Feedback**: Validating each enhancement before proceeding to next step
-- **Backward Compatibility**: Ensuring existing functionality remains intact
-- **Systems Analysis**: Understanding memory and performance implications of design choices
-
-### Ready for Advanced Applications
-Your enhanced Tensor class enables:
-- **Neural Network Training**: Automatic gradient computation for parameter updates
-- **Optimization Algorithms**: Foundation for SGD, Adam, and other optimizers (Module 06)
-- **Complex Architectures**: Support for any differentiable computation graph
-- **Research Applications**: Building and experimenting with novel ML architectures
-
-### Connection to Real ML Systems
-Your incremental approach mirrors production development:
-- **PyTorch Evolution**: Similar step-by-step enhancement from pure tensors to autograd-capable tensors
-- **TensorFlow 2.0**: Eager execution with automatic differentiation follows similar patterns
-- **Professional Development**: Industry standard for adding complex features safely
-- **Debugging Friendly**: Step-by-step approach makes gradient computation errors easier to trace
-
-### Performance Characteristics Discovered
-- **Memory Overhead**: ~2x memory usage (data + gradients + computation graph)
-- **Computational Overhead**: ~2x forward pass time for gradient graph building
-- **Scaling Behavior**: Linear scaling with computation graph depth
-- **Optimization Opportunities**: Operation fusion and gradient checkpointing potential
-
-### Next Steps
-1. **Export your module**: `tito module complete 05_autograd`
-2. **Validate integration**: All previous tensor operations still work + new gradient features
-3. **Ready for Module 06**: Optimizers will use these gradients to train neural networks!
-
-**🚀 Achievement Unlocked**: You've mastered incremental software enhancement - building complex systems through small, immediately rewarding steps. This is exactly how professional ML engineers develop production systems!
-"""
\ No newline at end of file
diff --git a/modules_old/05_autograd/autograd_dev_enhanced.py b/modules_old/05_autograd/autograd_dev_enhanced.py
deleted file mode 100644
index 4c57263d..00000000
--- a/modules_old/05_autograd/autograd_dev_enhanced.py
+++ /dev/null
@@ -1,2072 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Autograd - Automatic Differentiation and Computational Graph Engine
-
-Welcome to Autograd! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through complex computational graphs.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): Data structures that hold neural network parameters
-- Module 05 (Losses): Functions that measure prediction accuracy
-
-**What's Working**: You can compute loss values for any prediction!
-
-**The Gap**: Loss values tell you HOW WRONG you are, but not HOW TO IMPROVE the parameters.
-
-**This Module's Solution**: Implement automatic differentiation to compute gradients automatically.
-
-**Connection Map**:
-```
-Tensors → Loss Functions → Autograd → Optimizers
-(data)    (error measure)  (∇L/∇θ)   (parameter updates)
-```
-
-## Learning Goals
-- Systems understanding: How computational graphs enable automatic differentiation and why this approach scales to arbitrary network architectures
-- Core implementation skill: Build the Variable class with gradient tracking and implement backward propagation through dynamic computation graphs
-- Pattern recognition: Understand how chain rule application through computational graphs generalizes to any differentiable function
-- Framework connection: See how your implementation mirrors PyTorch's autograd engine and tensor gradient tracking
-- Performance insight: Learn why computational graph memory management and gradient accumulation strategies determine training scalability
-
-## Build → Use → Reflect
-1. **Build**: Complete automatic differentiation system with Variable class, gradient tracking, and backward propagation
-2. **Use**: Apply autograd to complex mathematical expressions and neural network operations
-3. **Reflect**: Why does automatic differentiation enable ML at scale, and how does graph memory management affect training?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how computational graphs enable automatic gradient computation for arbitrary functions
-- Practical capability to build the gradient computation engine that powers all modern neural network training
-- Systems insight into why automatic differentiation was the breakthrough that enabled deep learning at scale
-- Performance consideration of how computational graph size and memory management affect training efficiency
-- Connection to production ML systems and how frameworks optimize gradient computation and memory usage
-
-## Systems Reality Check
-💡 **Production Context**: PyTorch's autograd can handle graphs with millions of nodes and uses sophisticated memory optimization like gradient checkpointing to train models larger than GPU memory
-⚡ **Performance Note**: Gradient computation often requires storing forward activations, leading to memory usage that scales with network depth - this drives innovations like gradient checkpointing
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "autograd-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.autograd
-
-#| export
-import numpy as np
-import sys
-from typing import Union, List, Tuple, Optional, Any, Callable
-from collections import defaultdict
-
-# Import our existing components
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local modules
-    import os
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# %% nbgrader={"grade": false, "grade_id": "autograd-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("🔥 TinyTorch Autograd Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build automatic differentiation!")
-
-# %% [markdown]
-"""
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in `modules/06_autograd/autograd_dev.py`  
-**Building Side:** Code exports to `tinytorch.core.autograd`
-
-```python
-# Final package structure:
-from tinytorch.core.autograd import Variable, backward  # The gradient engine!
-from tinytorch.core.tensor import Tensor
-from tinytorch.core.activations import ReLU, Sigmoid, Tanh
-```
-
-**Why this matters:**
-- **Learning:** Focused module for understanding gradients
-- **Production:** Proper organization like PyTorch's `torch.autograd`
-- **Consistency:** All gradient operations live together in `core.autograd`
-- **Foundation:** Enables training for all neural networks
-"""
-
-# %% [markdown]
-"""
-## What is Automatic Differentiation?
-
-### The Problem: Computing Gradients at Scale
-Neural networks have millions of parameters. To train them, we need gradients of the loss function with respect to every parameter:
-
-```
-∇θ L = [∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ, ∂L/∂b₁, ∂L/∂b₂, ..., ∂L/∂bₘ]
-```
-
-**Manual differentiation fails** because:
-- Networks have thousands of composed functions
-- Manual computation is extremely error-prone
-- Every architecture change requires re-deriving all gradients
-
-### The Solution: Automatic Differentiation
-**Autograd** automatically computes derivatives of functions represented as computational graphs:
-
-```python
-# Instead of manually computing: ∂(x² + 2xy + y²)/∂x = 2x + 2y
-# Autograd does it automatically:
-x = Variable(3.0, requires_grad=True)
-y = Variable(4.0, requires_grad=True)
-z = x**2 + 2*x*y + y**2
-z.backward()
-print(x.grad)  # 2*3 + 2*4 = 14 (computed automatically!)
-```
-
-### Visual Representation: Computational Graph
-
-```
-Mathematical Expression: z = x² + 2xy + y²
-
-Computational Graph:
-    x ──┬─→ [×] ──→ x² ──┬─→ [+] ──→ [+] ──→ z
-    ↑   │              │         ↑         ↑
-    │   └─→ [×] ──→ 2x ─┘         │         │
-    │       ↑                     │         │
-    │       2                     │         │
-    │                             │         │
-    x ──┬─→ [×] ──→ xy ─→ [×] ──→ 2xy      │
-    ↑   │           ↑     ↑               │
-    │   │           │     2               │
-    │   │           y                     │
-    │   │                                 │
-    y ──┴─→ [×] ──→ y² ────────────────────┘
-
-Forward Pass: Compute values x² = 9, 2xy = 24, y² = 16, z = 49
-Backward Pass: Compute gradients ∂z/∂x = 14, ∂z/∂y = 20
-```
-
-### Why This is Revolutionary
-- **Efficiency**: O(1) overhead per operation
-- **Flexibility**: Works with any differentiable function
-- **Correctness**: Implements chain rule precisely
-- **Scale**: Handles millions of parameters automatically
-
-### Real-World Impact
-- **PyTorch**: `torch.autograd` enables all neural network training
-- **TensorFlow**: `tf.GradientTape` provides similar functionality
-- **JAX**: `jax.grad` for high-performance computing
-- **Deep Learning**: Made training complex models practical
-
-Let us build the engine that powers modern AI!
-"""
-
-# %% [markdown]
-"""
-## 🔧 DEVELOPMENT
-"""
-
-# %% [markdown]
-"""
-## Step 1: The Variable Class - Gradient Tracking
-
-### What is a Variable?
-A **Variable** wraps a Tensor and tracks:
-- **Data**: The actual values (forward pass)
-- **Gradient**: The computed gradients (backward pass)
-- **Computation history**: How this Variable was created
-- **Backward function**: How to compute gradients
-
-### Visual: The Computational Graph Structure
-```
-Variable Structure:
-┌─────────────────────────────────┐
-│ Variable Object                 │
-├─────────────────────────────────┤
-│ data: Tensor([1.5, 2.3, ...])  │ ← Forward pass values
-│ grad: None → Tensor([...])     │ ← Backward pass gradients
-│ requires_grad: True/False       │ ← Should compute gradients?
-│ grad_fn: <AddBackward>         │ ← How to compute gradients
-│ is_leaf: True/False            │ ← Original parameter?
-└─────────────────────────────────┘
-
-Computational Graph Example:
-    x (leaf) ──┐
-               ├──[ADD]──→ z (intermediate)
-    y (leaf) ──┘
-    
-    Forward:  x.data + y.data = z.data
-    Backward: z.grad → x.grad, y.grad (via chain rule)
-```
-
-### Memory Layout: Variables vs Tensors
-```
-Memory Comparison:
-                Tensor Only          Variable with Autograd
-              ┌─────────────┐       ┌─────────────┐
-              │    Data     │       │    Data     │ ← Same data storage
-              │   4 bytes   │       │   4 bytes   │
-              └─────────────┘       ├─────────────┤
-                                    │ Gradient    │ ← Additional gradient storage
-                                    │   4 bytes   │
-                                    ├─────────────┤
-                                    │ grad_fn     │ ← Function pointer
-                                    │   8 bytes   │
-                                    └─────────────┘
-                                    Total: ~2x memory overhead
-```
-
-### Design Principles
-- **Transparency**: Works seamlessly with existing operations
-- **Efficiency**: Minimal overhead for forward pass
-- **Flexibility**: Supports any differentiable operation
-- **Correctness**: Implements chain rule precisely
-
-### Real-World Context
-This is like:
-- **PyTorch**: `torch.autograd.Variable` (now integrated into tensors)
-- **TensorFlow**: `tf.Variable` with gradient tracking
-- **JAX**: Variables with `jax.grad` transformation
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "variable-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Variable:
-    """
-    Variable: Tensor wrapper with automatic differentiation capabilities.
-    
-    The fundamental class for gradient computation in TinyTorch.
-    Wraps Tensor objects and tracks computational history for backpropagation.
-    """
-    
-    def __init__(self, data: Union[Tensor, np.ndarray, list, float, int], 
-                 requires_grad: bool = True, grad_fn: Optional[Callable] = None):
-        """
-        Create a Variable with gradient tracking.
-        
-        Simple, clear conversion focused on core autograd concepts.
-        
-        Args:
-            data: The data (will be converted to Tensor if needed)
-            requires_grad: Whether to track gradients for this Variable
-            grad_fn: Function for computing gradients (None for leaf nodes)
-        """
-        ### BEGIN SOLUTION
-        # Simple, clear conversion
-        if isinstance(data, Tensor):
-            self.data = data
-        else:
-            self.data = Tensor(data)
-        
-        self.requires_grad = requires_grad
-        self.grad = None
-        self.grad_fn = grad_fn
-        self.is_leaf = grad_fn is None
-        ### END SOLUTION
-    
-    @property
-    def shape(self) -> Tuple[int, ...]:
-        """Get the shape of the underlying tensor."""
-        return self.data.shape
-    
-    @property
-    def size(self) -> int:
-        """Get the total number of elements."""
-        return self.data.size
-    
-    def __repr__(self) -> str:
-        """String representation of the Variable."""
-        grad_str = f", grad_fn=<{self.grad_fn.__name__}>" if self.grad_fn else ""
-        return f"Variable(shape={self.shape}, requires_grad={self.requires_grad}{grad_str})"
-    
-    def backward(self, gradient: Optional['Variable'] = None) -> None:
-        """
-        Compute gradients using backpropagation.
-        
-        Simple gradient accumulation focused on learning the core concepts.
-        
-        Args:
-            gradient: Incoming gradient (defaults to ones for scalar outputs)
-        """
-        ### BEGIN SOLUTION
-        if gradient is None:
-            gradient = Variable(np.ones_like(self.numpy()))
-        
-        if self.requires_grad:
-            if self.grad is None:
-                self.grad = gradient
-            else:
-                # Accumulate gradients
-                self.grad = Variable(self.grad.numpy() + gradient.numpy())
-        
-        if self.grad_fn is not None:
-            self.grad_fn(gradient)
-        ### END SOLUTION
-    
-    def zero_grad(self) -> None:
-        """Reset gradients to zero."""
-        self.grad = None
-    
-    def numpy(self) -> np.ndarray:
-        """
-        Convert Variable to NumPy array - Universal data extraction interface.
-        
-        This is the PyTorch-inspired solution to inconsistent data access.
-        ALWAYS returns np.ndarray, regardless of internal structure.
-        
-        Returns:
-            NumPy array containing the variable's data
-            
-        Usage:
-            var = Variable([1, 2, 3])
-            array = var.numpy()  # Always np.ndarray, no conditional logic needed
-        """
-        return self.data.data
-    
-    @property 
-    def array(self) -> np.ndarray:
-        """
-        Clean property access to underlying numpy array.
-        
-        Use this instead of .data.data for cleaner, more readable code.
-        
-        Example:
-            x = Variable([1, 2, 3])
-            arr = x.array  # Clean access instead of x.data.data
-        """
-        return self.data.data
-    
-    def __add__(self, other: Union['Variable', float, int]) -> 'Variable':
-        """Addition operator: self + other"""
-        return add(self, other)
-    
-    def __mul__(self, other: Union['Variable', float, int]) -> 'Variable':
-        """Multiplication operator: self * other"""
-        return multiply(self, other)
-    
-    def __sub__(self, other: Union['Variable', float, int]) -> 'Variable':
-        """Subtraction operator: self - other"""
-        return subtract(self, other)
-    
-    def __truediv__(self, other: Union['Variable', float, int]) -> 'Variable':
-        """Division operator: self / other"""
-        return divide(self, other)
-    
-    def __matmul__(self, other: 'Variable') -> 'Variable':
-        """Matrix multiplication operator: self @ other"""
-        return matmul(self, other) 
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Variable Class
-
-This test validates Variable initialization, ensuring gradient tracking capabilities work correctly.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-variable-class", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_variable_class():
-    """Test Variable class implementation"""
-    print("🔬 Unit Test: Variable Class...")
-    
-    # Test Variable creation
-    x = Variable(5.0, requires_grad=True)
-    assert x.requires_grad == True, "Variable should require gradients"
-    assert x.is_leaf == True, "Variable should be a leaf node"
-    assert x.grad is None, "Gradient should be None initially"
-    
-    # Test data access
-    assert x.numpy().item() == 5.0, "Data should be accessible"
-    assert x.shape == (), "Scalar should have empty shape"
-    assert x.size == 1, "Scalar should have size 1"
-    
-    # Test with list input
-    y = Variable([1, 2, 3], requires_grad=True)
-    assert y.shape == (3,), "List should create 1D tensor"
-    assert y.size == 3, "Size should be 3"
-    
-    # Test with requires_grad=False
-    z = Variable(10.0, requires_grad=False)
-    assert z.requires_grad == False, "Should not require gradients"
-    
-    # Test zero_grad
-    x.grad = Variable(1.0)
-    x.zero_grad()
-    assert x.grad is None, "zero_grad should reset gradient to None"
-    
-    print("✅ Variable class tests passed!")
-    print(f"✅ Variable creation and initialization working")
-    print(f"✅ Data access and properties working")
-    print(f"✅ Gradient management working")
-
-# Test will run in main block
-
-# %% [markdown]
-"""
-## 🤔 Computational Assessment: Variable Understanding
-
-Test your understanding of computational graphs and Variable design.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-variable-design", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-### Assessment Question: Variable Memory and Design
-
-Consider this Variable usage pattern:
-```python
-x = Variable(np.random.randn(1000, 1000), requires_grad=True)
-y = x * 2 + 1
-z = y @ y.T
-loss = z.sum()
-loss.backward()
-```
-
-**Question**: How much memory does this computational graph consume compared to just storing the final result? Calculate the memory overhead and explain why Variables need to store intermediate values.
-
-**Calculation Space:**
-- Forward pass memory: _____ MB
-- Gradient storage memory: _____ MB  
-- Total overhead factor: _____x
-
-**Conceptual Analysis:**
-TODO: Explain why automatic differentiation requires storing intermediate values and how this affects memory scaling in deep networks.
-
-**Design Justification:**  
-TODO: Justify why the Variable design separates data, gradients, and computation history into different attributes.
-"""
-
-### BEGIN SOLUTION
-# Student response area - this will be manually graded
-# Expected analysis should cover:
-# 1. Memory calculation: 1000x1000 float32 = 4MB per intermediate result
-# 2. Total memory: x(4MB) + y(4MB) + z(4MB) + gradients(~12MB) = ~24MB vs 4MB final result
-# 3. Conceptual understanding: Need intermediate values for chain rule
-# 4. Design rationale: Separation enables flexible gradient computation
-### END SOLUTION
-
-# %% [markdown]
-"""
-## Step 2: Basic Operations with Gradients
-
-### The Chain Rule in Action
-Every operation must implement:
-1. **Forward pass**: Compute the result
-2. **Backward pass**: Compute gradients for inputs
-
-### Visual: Chain Rule Through Addition
-```
-Forward Pass: z = x + y
-    x: 3.0 ──┐
-             ├──[+]──→ z: 5.0
-    y: 2.0 ──┘
-
-Backward Pass: ∂z/∂x = 1, ∂z/∂y = 1
-    ∂L/∂z: 1.0 ──┬──→ ∂L/∂x: 1.0 (∂z/∂x = 1)
-                  │
-                  └──→ ∂L/∂y: 1.0 (∂z/∂y = 1)
-
-Chain Rule: ∂L/∂x = ∂L/∂z · ∂z/∂x = 1.0 · 1 = 1.0
-```
-
-### Mathematical Foundation
-The chain rule states:
-```
-∂f/∂x = ∂f/∂z · ∂z/∂x
-```
-
-For complex expressions like f(g(h(x))):
-```
-∂f/∂x = ∂f/∂g · ∂g/∂h · ∂h/∂x
-```
-
-### Implementation Pattern
-Each operation returns a new Variable with:
-- **Forward result**: Computed value
-- **Backward function**: Gradient computation
-"""
-
-# %% [markdown]
-"""
-## Helper Functions for Binary Operations
-
-These helper functions reduce code repetition and make operations more consistent.
-"""
-
-#| export
-def _ensure_variables(a, b):
-    """Convert inputs to Variables if they are scalars."""
-    if isinstance(a, (int, float)):
-        a = Variable(a, requires_grad=False)
-    if isinstance(b, (int, float)):
-        b = Variable(b, requires_grad=False)
-    return a, b
-
-def _create_binary_operation(forward_fn, grad_fn_a, grad_fn_b):
-    """
-    Helper to create binary operations with consistent structure.
-    
-    Args:
-        forward_fn: Function to compute forward pass
-        grad_fn_a: Function to compute gradient for first argument
-        grad_fn_b: Function to compute gradient for second argument
-    
-    Returns:
-        Binary operation function
-    """
-    def operation(a, b):
-        # Convert inputs
-        a, b = _ensure_variables(a, b)
-        
-        # Forward pass
-        result_data = forward_fn(a.data, b.data)
-        
-        # Backward function
-        def grad_fn(grad_output):
-            if a.requires_grad:
-                grad_a = grad_fn_a(grad_output, a, b)
-                a.backward(grad_a)
-            if b.requires_grad:
-                grad_b = grad_fn_b(grad_output, a, b)
-                b.backward(grad_b)
-        
-        requires_grad = a.requires_grad or b.requires_grad
-        return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
-    
-    return operation
-
-# %% nbgrader={"grade": false, "grade_id": "add-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Addition operation with gradient tracking: a + b
-    
-    TODO: Implement addition with automatic differentiation.
-    
-    STEP-BY-STEP IMPLEMENTATION:
-    1. Convert inputs to Variables if they are scalars
-    2. Compute forward pass: result = a.data + b.data
-    3. Create gradient function that implements: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
-    4. Return new Variable with result and gradient function
-    
-    MATHEMATICAL FOUNDATION:
-    - Forward: z = x + y
-    - Backward: ∂z/∂x = 1, ∂z/∂y = 1
-    - Chain rule: ∂L/∂x = ∂L/∂z · ∂z/∂x = ∂L/∂z · 1 = ∂L/∂z
-    
-    EXAMPLE USAGE:
-    ```python
-    x = Variable(2.0, requires_grad=True)
-    y = Variable(3.0, requires_grad=True)
-    z = add(x, y)  # z = 5.0
-    z.backward()
-    print(x.grad)  # 1.0 (∂z/∂x = 1)
-    print(y.grad)  # 1.0 (∂z/∂y = 1)
-    ```
-    
-    IMPLEMENTATION HINTS:
-    - Convert scalars: if isinstance(a, (int, float)): a = Variable(a, requires_grad=False)
-    - Forward pass: result_data = a.data + b.data
-    - Backward function: def grad_fn(grad_output): if a.requires_grad: a.backward(grad_output)
-    - Return: Variable(result_data, grad_fn=grad_fn)
-    - Only propagate gradients to Variables that require them
-    
-    LEARNING CONNECTIONS:
-    - This is like torch.add() with autograd
-    - Addition distributes gradients equally to both inputs
-    - Forms the basis for bias addition in neural networks
-    - Chain rule propagates gradients through the graph
-    """
-    ### BEGIN SOLUTION
-    # Convert scalars to Variables
-    if isinstance(a, (int, float)):
-        a = Variable(a, requires_grad=False)
-    if isinstance(b, (int, float)):
-        b = Variable(b, requires_grad=False)
-    
-    # Forward pass
-    result_data = a.data + b.data
-    
-    # Backward function
-    def grad_fn(grad_output):
-        # Addition distributes gradients equally, but must handle broadcasting
-        if a.requires_grad:
-            # Clean gradient data access
-            grad_data = grad_output.data
-            
-            # Check if we need to sum over broadcasted dimensions
-            a_shape = a.data.shape
-            if grad_data.shape != a_shape:
-                # Sum over the broadcasted dimensions
-                # For bias: (batch_size, features) -> (features,)
-                if len(grad_data.shape) == 2 and len(a_shape) == 1:
-                    grad_for_a = Variable(Tensor(np.sum(grad_data, axis=0)))
-                else:
-                    # Handle other broadcasting cases
-                    grad_for_a = grad_output
-            else:
-                grad_for_a = grad_output
-            
-            a.backward(grad_for_a)
-            
-        if b.requires_grad:
-            # Clean gradient data access
-            grad_data = grad_output.data
-            
-            # Check if we need to sum over broadcasted dimensions
-            b_shape = b.data.shape
-            if grad_data.shape != b_shape:
-                # Sum over the broadcasted dimensions
-                # For bias: (batch_size, features) -> (features,)
-                if len(grad_data.shape) == 2 and len(b_shape) == 1:
-                    grad_for_b = Variable(Tensor(np.sum(grad_data, axis=0)))
-                else:
-                    # Handle other broadcasting cases
-                    grad_for_b = grad_output
-            else:
-                grad_for_b = grad_output
-            
-            b.backward(grad_for_b)
-    
-    # Return new Variable with gradient function
-    requires_grad = a.requires_grad or b.requires_grad
-    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Addition Operation
-
-This test validates addition operation, ensuring gradients flow correctly through addition.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-add-operation", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_add_operation():
-    """Test addition operation with gradients"""
-    print("🔬 Unit Test: Addition Operation...")
-    
-    # Test basic addition
-    x = Variable(2.0, requires_grad=True)
-    y = Variable(3.0, requires_grad=True)
-    z = add(x, y)
-    
-    assert z.numpy().item() == 5.0, "Addition result should be 5.0"
-    assert z.requires_grad == True, "Result should require gradients"
-    assert z.is_leaf == False, "Result should not be a leaf node"
-    
-    # Test backward pass
-    z.backward()
-    
-    assert x.grad is not None, "x should have gradient"
-    assert y.grad is not None, "y should have gradient"
-    assert x.grad.numpy().item() == 1.0, "∂z/∂x should be 1.0"
-    assert y.grad.numpy().item() == 1.0, "∂z/∂y should be 1.0"
-    
-    # Test with scalar
-    a = Variable(5.0, requires_grad=True)
-    b = add(a, 3.0)  # Add scalar
-    
-    assert b.numpy().item() == 8.0, "Addition with scalar should work"
-    
-    b.backward()
-    assert a.grad.numpy().item() == 1.0, "Gradient through scalar addition should be 1.0"
-    
-    print("✅ Addition operation tests passed!")
-    print(f"✅ Forward pass computing correct results")
-    print(f"✅ Backward pass computing correct gradients")
-    print(f"✅ Scalar addition working correctly")
-
-# Test will run in main block
-
-# ✅ IMPLEMENTATION CHECKPOINT: Addition operation complete
-
-# 🤔 PREDICTION: How does the chain rule apply when operations are chained together?
-# Your answer: _______
-
-# 🔍 SYSTEMS INSIGHT #1: Gradient Flow Analysis
-def analyze_gradient_flow():
-    """Analyze how gradients flow through computational graphs."""
-    try:
-        print("🔍 GRADIENT FLOW ANALYSIS")
-        print("=" * 35)
-        
-        # Create simple computational graph
-        x = Variable(2.0, requires_grad=True)
-        y = Variable(3.0, requires_grad=True)
-        
-        # Build graph: z = (x + y) * 2
-        sum_xy = add(x, y)     # x + y = 5.0
-        z = multiply(sum_xy, 2.0)  # (x + y) * 2 = 10.0
-        
-        print(f"Forward pass:")
-        print(f"  x = {x.numpy().item()}")
-        print(f"  y = {y.numpy().item()}")
-        print(f"  x + y = {sum_xy.numpy().item()}")
-        print(f"  z = (x + y) * 2 = {z.numpy().item()}")
-        
-        # Compute gradients
-        z.backward()
-        
-        print(f"\nBackward pass:")
-        print(f"  ∂z/∂x = {x.grad.numpy().item()}")
-        print(f"  ∂z/∂y = {y.grad.numpy().item()}")
-        
-        # Analyze memory usage
-        import sys
-        x_memory = sys.getsizeof(x)
-        z_memory = sys.getsizeof(z)
-        
-        print(f"\nMemory Analysis:")
-        print(f"  Leaf variable (x): ~{x_memory} bytes")
-        print(f"  Intermediate result (z): ~{z_memory} bytes")
-        print(f"  Memory overhead: {z_memory/x_memory:.1f}x")
-        
-        # 💡 WHY THIS MATTERS: In large models, computational graphs can consume
-        # significant memory. Each intermediate result stores gradients and backward functions.
-        # This is why techniques like gradient checkpointing are crucial for training large models!
-        
-        return True
-        
-    except Exception as e:
-        print(f"⚠️ Error in gradient flow analysis: {e}")
-        print("Make sure addition and multiplication are implemented")
-        return False
-
-# Run the analysis (will work after multiplication is implemented)
-
-# %% [markdown]
-"""
-## Step 3: Multiplication Operation
-
-### The Product Rule
-For z = x * y:
-- **Forward**: z = x * y
-- **Backward**: ∂z/∂x = y, ∂z/∂y = x
-
-### Visual: Product Rule in Action
-```
-Forward Pass: z = x * y
-    x: 2.0 ──┐
-             ├──[×]──→ z: 6.0
-    y: 3.0 ──┘
-
-Backward Pass: ∂z/∂x = y, ∂z/∂y = x
-    ∂L/∂z: 1.0 ──┬──→ ∂L/∂x: 3.0 (∂z/∂x = y = 3.0)
-                  │
-                  └──→ ∂L/∂y: 2.0 (∂z/∂y = x = 2.0)
-
-Product Rule: 
-- ∂(xy)/∂x = y
-- ∂(xy)/∂y = x
-```
-
-### Why This Matters
-Multiplication is everywhere in neural networks:
-- **Weight scaling**: w * x in dense layers
-- **Attention mechanisms**: attention_weights * values
-- **Gating**: gate_signal * hidden_state
-
-### Chain Rule Application
-When gradients flow back through multiplication:
-```
-∂L/∂x = ∂L/∂z · ∂z/∂x = ∂L/∂z · y
-∂L/∂y = ∂L/∂z · ∂z/∂y = ∂L/∂z · x
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Multiplication operation with gradient tracking: a * b
-    
-    Uses the product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
-    """
-    ### BEGIN SOLUTION
-    # Convert scalars to Variables
-    a, b = _ensure_variables(a, b)
-    
-    # Forward pass
-    result_data = a.data * b.data
-    
-    # Backward function using product rule
-    def grad_fn(grad_output):
-        if a.requires_grad:
-            a.backward(Variable(grad_output.numpy() * b.numpy()))
-        if b.requires_grad:
-            b.backward(Variable(grad_output.numpy() * a.numpy()))
-    
-    # Return new Variable with gradient function
-    requires_grad = a.requires_grad or b.requires_grad
-    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Multiplication Operation
-
-This test validates multiplication operation, ensuring the product rule is implemented correctly.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-multiply-operation", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_multiply_operation():
-    """Test multiplication operation with gradients"""
-    print("🔬 Unit Test: Multiplication Operation...")
-    
-    # Test basic multiplication
-    x = Variable(2.0, requires_grad=True)
-    y = Variable(3.0, requires_grad=True)
-    z = multiply(x, y)
-    
-    assert z.numpy().item() == 6.0, "Multiplication result should be 6.0"
-    assert z.requires_grad == True, "Result should require gradients"
-    
-    # Test backward pass
-    z.backward()
-    
-    assert x.grad is not None, "x should have gradient"
-    assert y.grad is not None, "y should have gradient"
-    assert x.grad.numpy().item() == 3.0, "∂z/∂x should be y = 3.0"
-    assert y.grad.numpy().item() == 2.0, "∂z/∂y should be x = 2.0"
-    
-    # Test with scalar
-    a = Variable(4.0, requires_grad=True)
-    b = multiply(a, 2.0)  # Multiply by scalar
-    
-    assert b.numpy().item() == 8.0, "Multiplication with scalar should work"
-    
-    b.backward()
-    assert a.grad.numpy().item() == 2.0, "Gradient through scalar multiplication should be the scalar"
-    
-    print("✅ Multiplication operation tests passed!")
-    print(f"✅ Forward pass computing correct results")
-    print(f"✅ Backward pass implementing product rule correctly")
-    print(f"✅ Scalar multiplication working correctly")
-
-# Test will run in main block
-
-# Now run the gradient flow analysis
-analyze_gradient_flow()
-
-# %% nbgrader={"grade": false, "grade_id": "subtract-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Subtraction operation with gradient tracking: a - b
-    
-    Uses the rule: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
-    """
-    ### BEGIN SOLUTION
-    # Convert to Variables if needed
-    a, b = _ensure_variables(a, b)
-    
-    # Forward pass
-    result_data = a.data - b.data
-    
-    # Create gradient function
-    def grad_fn(grad_output):
-        # Subtraction rule: d(x-y)/dx = 1, d(x-y)/dy = -1
-        if a.requires_grad:
-            a.backward(grad_output)
-        if b.requires_grad:
-            b_grad = Variable(-grad_output.numpy())
-            b.backward(b_grad)
-    
-    # Determine if result requires gradients
-    requires_grad = a.requires_grad or b.requires_grad
-    
-    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
-    ### END SOLUTION
-
-#| export
-def matmul(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Matrix multiplication operation with gradient tracking: a @ b
-    
-    Uses matrix multiplication gradients: ∂C/∂A = grad_C @ B^T, ∂C/∂B = A^T @ grad_C
-    """
-    ### BEGIN SOLUTION
-    # Convert scalars to Variables
-    a, b = _ensure_variables(a, b)
-    
-    # Forward pass - matrix multiplication
-    result_data = Tensor(a.numpy() @ b.numpy())
-    
-    # Backward function
-    def grad_fn(grad_output):
-        # Matrix multiplication gradients
-        if a.requires_grad:
-            # ∂C/∂A = grad_C @ B^T
-            grad_a_data = grad_output.numpy() @ b.numpy().T
-            a.backward(Variable(grad_a_data))
-        
-        if b.requires_grad:
-            # ∂C/∂B = A^T @ grad_C  
-            grad_b_data = a.numpy().T @ grad_output.numpy()
-            b.backward(Variable(grad_b_data))
-    
-    # Return new Variable with gradient function
-    requires_grad = a.requires_grad or b.requires_grad
-    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
-    ### END SOLUTION
-
-#| export  
-def divide(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Division operation with gradient tracking: a / b
-    
-    Uses the quotient rule: ∂(a/b)/∂a = 1/b, ∂(a/b)/∂b = -a/b²
-    """
-    ### BEGIN SOLUTION
-    # Convert scalars to Variables
-    a, b = _ensure_variables(a, b)
-    
-    # Forward pass
-    result_data = a.data / b.data
-    
-    # Backward function
-    def grad_fn(grad_output):
-        if a.requires_grad:
-            # ∂(a/b)/∂a = 1/b
-            grad_a = Variable(grad_output.numpy() / b.numpy())
-            a.backward(grad_a)
-        if b.requires_grad:
-            # ∂(a/b)/∂b = -a/b²
-            grad_b = Variable(-grad_output.numpy() * a.numpy() / (b.numpy() ** 2))
-            b.backward(grad_b)
-    
-    requires_grad = a.requires_grad or b.requires_grad
-    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
-    ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "test-subtract-operation", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_unit_subtract_operation():
-    """Test subtraction operation with gradients"""
-    print("🔬 Unit Test: Subtraction Operation...")
-    
-    # Test basic subtraction
-    x = Variable(5.0, requires_grad=True)
-    y = Variable(3.0, requires_grad=True)
-    z = subtract(x, y)
-    
-    assert z.numpy().item() == 2.0, "Subtraction result should be 2.0"
-    assert z.requires_grad == True, "Result should require gradients"
-    
-    # Test backward pass
-    z.backward()
-    
-    assert x.grad is not None, "x should have gradient"
-    assert y.grad is not None, "y should have gradient"
-    assert x.grad.numpy().item() == 1.0, "∂z/∂x should be 1.0"
-    assert y.grad.numpy().item() == -1.0, "∂z/∂y should be -1.0"
-    
-    # Test with scalar
-    a = Variable(4.0, requires_grad=True)
-    b = subtract(a, 2.0)  # Subtract scalar
-    
-    assert b.numpy().item() == 2.0, "Subtraction with scalar should work"
-    
-    b.backward()
-    assert a.grad.numpy().item() == 1.0, "Gradient through scalar subtraction should be 1.0"
-    
-    print("✅ Subtraction operation tests passed!")
-    print(f"✅ Forward pass computing correct results")
-    print(f"✅ Backward pass implementing subtraction rule correctly")
-    print(f"✅ Scalar subtraction working correctly")
-
-# Test will run in main block
-
-# %% [markdown]
-"""
-## 🤔 Computational Assessment: Chain Rule Application
-
-Test your understanding of how gradients flow through multiple operations.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-chain-rule", "locked": false, "points": 15, "schema_version": 3, "solution": true, "task": false}
-"""
-### Assessment Question: Manual Gradient Calculation
-
-Consider this computational graph:
-```python
-x = Variable(2.0, requires_grad=True)
-y = Variable(3.0, requires_grad=True)
-a = x * y      # a = 6.0
-b = a + x      # b = 8.0  
-c = b * 2      # c = 16.0
-c.backward()
-```
-
-**Calculate manually:**
-1. ∂c/∂b = _____
-2. ∂b/∂a = _____
-3. ∂b/∂x = _____
-4. ∂a/∂x = _____
-5. ∂a/∂y = _____
-
-**Apply chain rule:**
-6. ∂c/∂x (through path c→b→a→x) = _____
-7. ∂c/∂x (through path c→b→x) = _____
-8. Total ∂c/∂x = _____ + _____ = _____
-9. ∂c/∂y = _____
-
-**Verification:**
-TODO: Run the code above and verify your calculations match the computed gradients.
-"""
-
-### BEGIN SOLUTION
-# Student calculation space - this will be manually graded
-# Expected answers:
-# 1. ∂c/∂b = 2 (c = b * 2)
-# 2. ∂b/∂a = 1 (b = a + x)
-# 3. ∂b/∂x = 1 (b = a + x)
-# 4. ∂a/∂x = y = 3 (a = x * y)
-# 5. ∂a/∂y = x = 2 (a = x * y)
-# 6. ∂c/∂x (path 1) = 2 * 1 * 3 = 6
-# 7. ∂c/∂x (path 2) = 2 * 1 = 2
-# 8. Total ∂c/∂x = 6 + 2 = 8
-# 9. ∂c/∂y = 2 * 1 * 2 = 4
-### END SOLUTION
-
-# %% [markdown]
-"""
-## Step 4: Chain Rule in Complex Expressions
-
-### Building Complex Computations
-Now let us test how multiple operations work together through the chain rule:
-
-### Visual: Complex Computational Graph
-```
-Example: f(x, y) = (x + y) * (x - y) = x² - y²
-
-Computational Graph:
-    x ──┬─→ [+] ──┬─→ [×] ──→ result
-        │         │
-    y ──┴─→ [+] ──┘
-        │
-        └─→ [-] ──┘
-        x
-
-Forward Pass Flow:
-    x=3, y=2 → sum=5, diff=1 → result=5
-
-Backward Pass Flow:
-    ∂L/∂result=1 → ∂L/∂sum=1, ∂L/∂diff=5 → ∂L/∂x=6, ∂L/∂y=-4
-
-Manual verification: f(x,y) = x² - y²
-∂f/∂x = 2x = 2(3) = 6 ✓
-∂f/∂y = -2y = -2(2) = -4 ✓
-```
-
-### Chain Rule Application
-- **Forward**: Compute each operation in sequence
-- **Backward**: Gradients flow back through each operation
-- **Automatic**: No manual gradient computation needed!
-
-### Real-World Significance
-Complex neural networks are just larger versions of this:
-- **Millions of operations**: Each tracked automatically
-- **Complex architectures**: ResNet, Transformer, etc.
-- **Efficient computation**: O(1) overhead per operation
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-chain-rule", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_unit_chain_rule():
-    """Test chain rule with complex expressions"""
-    print("🔬 Unit Test: Chain Rule with Complex Expressions...")
-    
-    # Test: f(x, y) = (x + y) * (x - y) = x² - y²
-    x = Variable(3.0, requires_grad=True)
-    y = Variable(2.0, requires_grad=True)
-    
-    # Build expression step by step
-    sum_xy = add(x, y)      # x + y = 5.0
-    diff_xy = subtract(x, y) # x - y = 1.0
-    result = multiply(sum_xy, diff_xy)  # (x + y) * (x - y) = 5.0
-    
-    # Check forward pass
-    assert result.numpy().item() == 5.0, "Forward pass should compute 5.0"
-    
-    # Compute gradients
-    result.backward()
-    
-    # Check gradients: ∂(x²-y²)/∂x = 2x, ∂(x²-y²)/∂y = -2y
-    expected_x_grad = 2 * x.numpy().item()  # 2 * 3 = 6
-    expected_y_grad = -2 * y.numpy().item()  # -2 * 2 = -4
-    
-    assert abs(x.grad.numpy().item() - expected_x_grad) < 1e-6, f"x gradient should be {expected_x_grad}"
-    assert abs(y.grad.numpy().item() - expected_y_grad) < 1e-6, f"y gradient should be {expected_y_grad}"
-    
-    # Test more complex expression: f(x) = (x + 1) * (x + 2) * (x + 3)
-    x2 = Variable(1.0, requires_grad=True)
-    
-    term1 = add(x2, 1.0)    # x + 1 = 2.0
-    term2 = add(x2, 2.0)    # x + 2 = 3.0
-    term3 = add(x2, 3.0)    # x + 3 = 4.0
-    
-    product1 = multiply(term1, term2)  # (x + 1) * (x + 2) = 6.0
-    result2 = multiply(product1, term3)  # * (x + 3) = 24.0
-    
-    assert result2.numpy().item() == 24.0, "Complex expression should compute 24.0"
-    
-    result2.backward()
-    
-    # For f(x) = (x+1)(x+2)(x+3), f'(x) = 3x² + 12x + 11
-    # At x=1: f'(1) = 3 + 12 + 11 = 26
-    expected_grad = 3 * (1.0**2) + 12 * 1.0 + 11  # 26
-    
-    assert abs(x2.grad.numpy().item() - expected_grad) < 1e-6, f"Complex gradient should be {expected_grad}"
-    
-    print("✅ Chain rule tests passed!")
-    print(f"✅ Simple expression: (x+y)*(x-y) = x²-y²")
-    print(f"✅ Complex expression: (x+1)*(x+2)*(x+3)")
-    print(f"✅ Automatic gradient computation working correctly")
-    print(f"✅ Chain rule implemented correctly")
-
-# Test will run in main block
-
-# ✅ IMPLEMENTATION CHECKPOINT: Basic operations complete
-
-# 🤔 PREDICTION: How does computational graph memory scale with network depth?
-# Your answer: _______
-
-# 🔍 SYSTEMS INSIGHT #2: Computational Graph Memory Analysis
-def analyze_computational_graph_memory():
-    """Analyze memory consumption patterns in computational graphs."""
-    try:
-        print("🔍 COMPUTATIONAL GRAPH MEMORY ANALYSIS")
-        print("=" * 45)
-        
-        import sys
-        
-        # Test different graph depths
-        depths = [1, 3, 5, 8]
-        memory_usage = []
-        
-        for depth in depths:
-            # Create computational graph of specified depth
-            x = Variable(np.random.randn(100, 100), requires_grad=True)
-            current = x
-            
-            # Build chain of operations
-            for i in range(depth):
-                current = multiply(current, 1.1)
-                current = add(current, 0.1)
-            
-            # Estimate memory usage
-            base_memory = x.data.data.nbytes / (1024 * 1024)  # MB
-            
-            # Each operation creates new Variable with references
-            estimated_graph_memory = depth * 2 * base_memory  # Rough estimate
-            
-            memory_usage.append(estimated_graph_memory)
-            
-            print(f"  Depth {depth}: ~{estimated_graph_memory:.1f} MB")
-        
-        # Analyze scaling
-        if len(memory_usage) >= 2:
-            shallow = memory_usage[0]
-            deep = memory_usage[-1] 
-            scaling_factor = deep / shallow
-            
-            print(f"\nMemory Scaling Analysis:")
-            print(f"  Depth 1: {shallow:.1f} MB")
-            print(f"  Depth {depths[-1]}: {deep:.1f} MB")
-            print(f"  Scaling factor: {scaling_factor:.1f}x")
-            print(f"  Scaling per layer: {scaling_factor/depths[-1]:.2f}x")
-        
-        # Production implications
-        print(f"\n🏭 Production Scaling Implications:")
-        print(f"  • ResNet-50 (50 layers): ~{memory_usage[0] * 50:.0f} MB graph memory")
-        print(f"  • Transformer (100 layers): ~{memory_usage[0] * 100:.0f} MB graph memory")
-        print(f"  • GPT-3 scale models: Gradient checkpointing essential!")
-        
-        # 💡 WHY THIS MATTERS: Deep networks require storing intermediate activations
-        # for gradient computation. This memory grows linearly with depth, leading to
-        # memory constraints. Gradient checkpointing trades compute for memory!
-        
-        return memory_usage
-        
-    except Exception as e:
-        print(f"⚠️ Error in memory analysis: {e}")
-        print("Make sure all operations are implemented")
-        return [1.0]
-
-# Run the analysis
-analyze_computational_graph_memory()
-
-# %% [markdown]
-"""
-## Step 5: Integration with Neural Network Training
-
-### The Complete Training Loop
-Let us see how autograd enables neural network training:
-
-1. **Forward pass**: Compute predictions
-2. **Loss computation**: Compare with targets
-3. **Backward pass**: Compute gradients automatically
-4. **Parameter update**: Update weights using gradients
-
-### Visual: Neural Network Training Flow
-```
-Training Loop Architecture:
-┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
-│   Forward   │───▶│    Loss     │───▶│  Backward   │───▶│   Update    │
-│     Pass    │    │ Computation │    │    Pass     │    │ Parameters  │
-└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
-      ▲                                       │                    │
-      │                                       ▼                    ▼
-┌─────────────┐                        ┌─────────────┐    ┌─────────────┐
-│ Input Data  │                        │  Gradients  │    │  New Weights│
-│    (x, y)   │                        │   ∇L/∇θ     │    │     θ'      │
-└─────────────┘                        └─────────────┘    └─────────────┘
-
-Memory Flow During Training:
-    Parameters → Forward Activations → Loss → Gradients → Parameter Updates
-       θ              f(x; θ)         L     ∇L/∇θ           θ - α∇L/∇θ
-     4 MB              12 MB         1 val   4 MB              4 MB
-                    (stored for                              (in-place)
-                     backward)
-```
-
-### Example: Simple Linear Regression
-   ```python
-# Model: y = wx + b
-w = Variable(0.5, requires_grad=True)
-b = Variable(0.1, requires_grad=True)
-
-    # Forward pass
-prediction = w * x + b
-
-# Loss: mean squared error
-loss = (prediction - target)**2
-
-# Backward pass (automatic!)
-loss.backward()
-
-# Update parameters
-w.data = w.data - learning_rate * w.grad.data
-b.data = b.data - learning_rate * b.grad.data
-```
-
-### Why This is Powerful
-- **Automatic**: No manual gradient computation
-- **Flexible**: Works with any differentiable function
-- **Efficient**: Minimal computational overhead
-- **Scalable**: Handles millions of parameters
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-neural-network-training", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
-def test_module_neural_network_training():
-    """Test autograd in neural network training scenario"""
-    print("🔬 Integration Test: Neural Network Training Comprehensive Test...")
-    
-    # Simple linear regression: y = wx + b
-    # Training data: y = 2x + 1 + noise
-    
-    # Initialize parameters
-    w = Variable(0.1, requires_grad=True)  # Start with small random value
-    b = Variable(0.0, requires_grad=True)  # Start with zero bias
-    
-    # Training data
-    x_data = [1.0, 2.0, 3.0, 4.0]
-    y_data = [3.0, 5.0, 7.0, 9.0]  # y = 2x + 1
-    
-    learning_rate = 0.01
-    
-    # Training loop
-    for epoch in range(100):
-        total_loss = Variable(0.0)
-        
-        for x_val, y_val in zip(x_data, y_data):
-            # Create input variable
-            x = Variable(x_val, requires_grad=False)
-            target = Variable(y_val, requires_grad=False)
-            
-    # Forward pass
-            prediction = add(multiply(w, x), b)  # wx + b
-            
-            # Loss: squared error
-            error = subtract(prediction, target)
-            loss = multiply(error, error)  # (pred - target)²
-            
-            # Accumulate loss
-            total_loss = add(total_loss, loss)
-        
-        # Backward pass
-        w.zero_grad()
-        b.zero_grad()
-        total_loss.backward()
-        
-        # Update parameters
-        if w.grad is not None:
-            w.data = Tensor(w.numpy() - learning_rate * w.grad.numpy())
-        if b.grad is not None:
-            b.data = Tensor(b.numpy() - learning_rate * b.grad.numpy())
-    
-    # Check that parameters converged to correct values
-    final_w = w.numpy().item()
-    final_b = b.numpy().item()
-    
-    print(f"Final weights: w = {final_w:.3f}, b = {final_b:.3f}")
-    print(f"Target weights: w = 2.000, b = 1.000")
-    
-    # Should be close to w=2, b=1
-    assert abs(final_w - 2.0) < 0.1, f"Weight should be close to 2.0, got {final_w}"
-    assert abs(final_b - 1.0) < 0.1, f"Bias should be close to 1.0, got {final_b}"
-    
-    # Test prediction with learned parameters
-    test_x = Variable(5.0, requires_grad=False)
-    test_prediction = add(multiply(w, test_x), b)
-    expected_output = 2.0 * 5.0 + 1.0  # 11.0
-    
-    prediction_error = abs(test_prediction.numpy().item() - expected_output)
-    assert prediction_error < 0.5, f"Prediction error should be small, got {prediction_error}"
-    
-    print("✅ Neural network training comprehensive tests passed!")
-    print(f"✅ Parameters converged to correct values")
-    print(f"✅ Model makes accurate predictions")
-    print(f"✅ Autograd enables automatic training")
-    print(f"✅ Ready for complex neural network architectures!")
-
-# Test will run in main block
-
-# ✅ IMPLEMENTATION CHECKPOINT: Neural network training complete
-
-# 🤔 PREDICTION: How does backward pass time compare to forward pass time?
-# Your answer: _______
-
-# 🔍 SYSTEMS INSIGHT #3: Forward vs Backward Pass Performance
-def analyze_forward_backward_performance():
-    """Analyze performance characteristics of forward vs backward passes."""
-    try:
-        print("🔍 FORWARD VS BACKWARD PASS PERFORMANCE")
-        print("=" * 45)
-        
-        import time
-        
-        # Test with different computation scales
-        sizes = [50, 100, 200]
-        results = []
-        
-        for size in sizes:
-            print(f"\nTesting {size}x{size} operations:")
-            
-            # Create computation graph
-            x = Variable(np.random.randn(size, size), requires_grad=True)
-            y = Variable(np.random.randn(size, size), requires_grad=True)
-            
-            # Forward pass timing
-            forward_iterations = 5
-            forward_start = time.time()
-            
-            for _ in range(forward_iterations):
-                z1 = multiply(x, y)
-                z2 = add(z1, x)
-                z3 = multiply(z2, 2.0)
-                result = z3.sum() if hasattr(z3, 'sum') else Variable(np.sum(z3.numpy()))
-            
-            forward_end = time.time()
-            avg_forward_time = (forward_end - forward_start) / forward_iterations
-            
-            # Backward pass timing
-            backward_start = time.time()
-            result.backward()
-            backward_end = time.time()
-            backward_time = backward_end - backward_start
-            
-            # Memory analysis
-            forward_memory = x.data.data.nbytes * 4 / (1024 * 1024)  # Estimate
-            gradient_memory = (x.grad.data.data.nbytes + y.grad.data.data.nbytes) / (1024 * 1024) if x.grad and y.grad else 0
-            
-            result_data = {
-                'size': size,
-                'forward_time_ms': avg_forward_time * 1000,
-                'backward_time_ms': backward_time * 1000,
-                'backward_forward_ratio': backward_time / avg_forward_time,
-                'forward_memory_mb': forward_memory,
-                'gradient_memory_mb': gradient_memory
-            }
-            results.append(result_data)
-            
-            print(f"  Forward: {avg_forward_time*1000:.2f}ms")
-            print(f"  Backward: {backward_time*1000:.2f}ms")
-            print(f"  Ratio: {backward_time/avg_forward_time:.1f}x")
-            print(f"  Memory: {forward_memory:.1f}MB forward, {gradient_memory:.1f}MB gradients")
-        
-        # Analyze trends
-        avg_ratio = sum(r['backward_forward_ratio'] for r in results) / len(results)
-        
-        print(f"\n📊 Performance Analysis:")
-        print(f"  Average backward/forward ratio: {avg_ratio:.1f}x")
-        
-        if avg_ratio > 2.0:
-            print(f"  • Backward pass significantly slower than forward")
-            print(f"  • Gradient computation dominates training time")
-        elif avg_ratio < 1.5:
-            print(f"  • Backward pass efficient relative to forward")
-            print(f"  • Good autograd implementation")
-        else:
-            print(f"  • Balanced forward/backward performance")
-        
-        print(f"\n🏭 Production Implications:")
-        print(f"  • Training time ≈ {1 + avg_ratio:.1f}x inference time")
-        print(f"  • Memory usage ≈ 2x parameters (gradients + weights)")
-        print(f"  • Gradient checkpointing can trade compute for memory")
-        
-        # 💡 WHY THIS MATTERS: Backward pass typically takes 1.5-3x forward pass time.
-        # This determines training speed and influences architecture choices.
-        # Understanding this ratio helps optimize training pipelines!
-        
-        return results
-        
-    except Exception as e:
-        print(f"⚠️ Error in performance analysis: {e}")
-        print("Basic timing analysis shows autograd overhead patterns")
-        return []
-
-# Run the analysis
-analyze_forward_backward_performance()
-
-# %% [markdown]
-"""
-## Step 4: ML Systems Thinking - Computational Graph Optimization
-
-### 🏗️ Autograd Systems at Production Scale
-
-Your autograd implementation provides the foundation for understanding how production ML frameworks optimize computational graphs for massive neural network training and inference.
-
-#### **Computational Graph Architecture**
-```python
-class ProductionAutogradEngine:
-    def __init__(self):
-        # Advanced autograd optimizations for production systems
-        self.graph_optimizer = ComputationalGraphOptimizer()
-        self.memory_manager = GradientMemoryManager()
-        self.kernel_fusion = AutogradKernelFusion()
-        self.checkpoint_manager = GradientCheckpointManager()
-```
-
-Real autograd systems must handle:
-- **Graph optimization**: Fusing operations to minimize memory access
-- **Memory management**: Releasing intermediate gradients to conserve memory
-- **Parallel execution**: Computing gradients across multiple devices
-- **Kernel fusion**: Combining operations for GPU efficiency
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "autograd-systems-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-import time
-import gc
-from collections import defaultdict, deque
-
-class AutogradSystemsProfiler:
-    """
-    Production Autograd System Performance Analysis and Optimization
-    
-    Analyzes computational graph efficiency, memory patterns, and optimization
-    opportunities for production automatic differentiation systems.
-    """
-    
-    def __init__(self):
-        """Initialize autograd systems profiler."""
-        self.profiling_data = defaultdict(list)
-        self.graph_analysis = defaultdict(list)
-        self.optimization_strategies = []
-        
-    def profile_computational_graph_depth(self, max_depth=10, operations_per_level=5):
-        """
-        Profile computational graph performance vs depth.
-        
-        TODO: Implement computational graph depth analysis.
-        
-        APPROACH:
-        1. Create computational graphs of increasing depth
-        2. Measure forward and backward pass timing
-        3. Analyze memory usage patterns during gradient computation
-        4. Identify memory accumulation and gradient flow bottlenecks
-        5. Generate graph optimization recommendations
-        
-        EXAMPLE:
-        profiler = AutogradSystemsProfiler()
-        graph_analysis = profiler.profile_computational_graph_depth(max_depth=8)
-        print(f"Memory scaling factor: {graph_analysis['memory_scaling_factor']:.2f}")
-        
-        HINTS:
-        - Build graphs by chaining operations: x -> op1 -> op2 -> ... -> loss
-        - Measure both forward and backward pass timing separately
-        - Track memory usage throughout the computation
-        - Monitor gradient accumulation patterns
-        - Focus on production-relevant graph depths
-        """
-        ### BEGIN SOLUTION
-        print("🔧 Profiling Computational Graph Depth Impact...")
-        
-        results = {}
-        
-        for depth in range(1, max_depth + 1):
-            print(f"  Testing graph depth: {depth}")
-            
-            # Create a computational graph of specified depth
-            # Each level adds more operations to test scaling
-            
-            # Start with input variable
-            try:
-                # Use Variable if available, otherwise simulate
-                x = Variable(np.random.randn(100, 100), requires_grad=True)
-            except:
-                # Fallback for testing - simulate Variable with Tensor
-                x = Tensor(np.random.randn(100, 100))
-            
-            # Build computational graph of specified depth
-            current_var = x
-            operations = []
-            
-            for level in range(depth):
-                # Add multiple operations per level to increase complexity
-                for op_idx in range(operations_per_level):
-                    try:
-                        # Simulate various operations
-                        if op_idx % 4 == 0:
-                            current_var = current_var * 0.9  # Scale operation
-                        elif op_idx % 4 == 1:
-                            current_var = current_var + 0.1  # Add operation
-                        elif op_idx % 4 == 2:
-                            # Matrix multiplication (most expensive)
-                            weight = Tensor(np.random.randn(100, 100))
-                            if hasattr(current_var, 'data'):
-                                current_var = Tensor(current_var.data @ weight.data)
-                            else:
-                                current_var = current_var @ weight
-                        else:
-                            # Activation-like operation
-                            if hasattr(current_var, 'data'):
-                                current_var = Tensor(np.maximum(0, current_var.data))
-                            else:
-                                current_var = current_var  # Skip for simplicity
-                        
-                        operations.append(f"level_{level}_op_{op_idx}")
-                    except:
-                        # Fallback for testing
-                        current_var = Tensor(np.random.randn(100, 100))
-                        operations.append(f"level_{level}_op_{op_idx}_fallback")
-            
-            # Add final loss computation
-            try:
-                if hasattr(current_var, 'data'):
-                    loss = Tensor(np.sum(current_var.data ** 2))
-                else:
-                    loss = np.sum(current_var ** 2)
-            except:
-                loss = Tensor(np.array([1.0]))
-            
-            # Measure forward pass timing
-            forward_iterations = 3
-            forward_start = time.time()
-            
-            for _ in range(forward_iterations):
-                # Simulate forward pass computation
-                temp_x = x
-                for level in range(depth):
-                    for op_idx in range(operations_per_level):
-                        if op_idx % 4 == 0:
-                            temp_x = temp_x * 0.9
-                        elif op_idx % 4 == 1:
-                            temp_x = temp_x + 0.1
-                        # Skip expensive ops for timing
-                
-            forward_end = time.time()
-            avg_forward_time = (forward_end - forward_start) / forward_iterations
-            
-            # Measure backward pass timing (simulated)
-            # In real implementation, this would be loss.backward()
-            backward_start = time.time()
-            
-            # Simulate gradient computation through the graph
-            for _ in range(forward_iterations):
-                # Simulate backpropagation through all operations
-                gradient_accumulation = 0
-                for level in range(depth):
-                    for op_idx in range(operations_per_level):
-                        # Simulate gradient computation
-                        gradient_accumulation += level * op_idx * 0.001
-            
-            backward_end = time.time()
-            avg_backward_time = (backward_end - backward_start) / forward_iterations
-            
-            # Memory analysis
-            try:
-                if hasattr(x, 'data'):
-                    base_memory = x.data.nbytes / (1024 * 1024)  # MB
-                    if hasattr(current_var, 'data'):
-                        result_memory = current_var.data.nbytes / (1024 * 1024)
-                    else:
-                        result_memory = base_memory
-                else:
-                    base_memory = x.nbytes / (1024 * 1024) if hasattr(x, 'nbytes') else 1.0
-                    result_memory = base_memory
-            except:
-                base_memory = 1.0
-                result_memory = 1.0
-            
-            # Estimate gradient memory (in production, each operation stores gradients)
-            estimated_gradient_memory = depth * operations_per_level * base_memory * 0.5
-            total_memory = base_memory + result_memory + estimated_gradient_memory
-            
-            # Calculate efficiency metrics
-            total_operations = depth * operations_per_level
-            total_time = avg_forward_time + avg_backward_time
-            operations_per_second = total_operations / total_time if total_time > 0 else 0
-            
-            result = {
-                'graph_depth': depth,
-                'total_operations': total_operations,
-                'forward_time_ms': avg_forward_time * 1000,
-                'backward_time_ms': avg_backward_time * 1000,
-                'total_time_ms': total_time * 1000,
-                'base_memory_mb': base_memory,
-                'estimated_gradient_memory_mb': estimated_gradient_memory,
-                'total_memory_mb': total_memory,
-                'operations_per_second': operations_per_second,
-                'memory_per_operation': total_memory / total_operations if total_operations > 0 else 0
-            }
-            
-            results[depth] = result
-            
-            print(f"    Forward: {avg_forward_time*1000:.3f}ms, Backward: {avg_backward_time*1000:.3f}ms, Memory: {total_memory:.2f}MB")
-        
-        # Analyze scaling patterns
-        graph_analysis = self._analyze_graph_scaling(results)
-        
-        # Store profiling data
-        self.profiling_data['graph_depth_analysis'] = results
-        self.graph_analysis = graph_analysis
-        
-        return {
-            'detailed_results': results,
-            'graph_analysis': graph_analysis,
-            'optimization_strategies': self._generate_graph_optimizations(results)
-        }
-        ### END SOLUTION
-    
-    def _analyze_graph_scaling(self, results):
-        """Analyze computational graph scaling patterns."""
-        analysis = {}
-        
-        # Extract metrics for scaling analysis
-        depths = sorted(results.keys())
-        forward_times = [results[d]['forward_time_ms'] for d in depths]
-        backward_times = [results[d]['backward_time_ms'] for d in depths]
-        total_times = [results[d]['total_time_ms'] for d in depths]
-        memory_usage = [results[d]['total_memory_mb'] for d in depths]
-        
-        # Calculate scaling factors
-        if len(depths) >= 2:
-            shallow = depths[0]
-            deep = depths[-1]
-            
-            depth_ratio = deep / shallow
-            forward_time_ratio = results[deep]['forward_time_ms'] / results[shallow]['forward_time_ms']
-            backward_time_ratio = results[deep]['backward_time_ms'] / results[shallow]['backward_time_ms']
-            memory_ratio = results[deep]['total_memory_mb'] / results[shallow]['total_memory_mb']
-            
-            analysis['scaling_metrics'] = {
-                'depth_ratio': depth_ratio,
-                'forward_time_scaling': forward_time_ratio,
-                'backward_time_scaling': backward_time_ratio,
-                'memory_scaling': memory_ratio,
-                'theoretical_linear': depth_ratio  # Expected linear scaling
-            }
-            
-            # Identify bottlenecks
-            if backward_time_ratio > forward_time_ratio * 1.5:
-                analysis['primary_bottleneck'] = 'backward_pass'
-                analysis['bottleneck_reason'] = 'Gradient computation scaling worse than forward pass'
-            elif memory_ratio > depth_ratio * 1.5:
-                analysis['primary_bottleneck'] = 'memory'
-                analysis['bottleneck_reason'] = 'Memory usage scaling faster than linear'
-            else:
-                analysis['primary_bottleneck'] = 'balanced'
-                analysis['bottleneck_reason'] = 'Forward and backward passes scaling proportionally'
-        
-        # Backward/Forward ratio analysis
-        backward_forward_ratios = [
-            results[d]['backward_time_ms'] / max(results[d]['forward_time_ms'], 0.001)
-            for d in depths
-        ]
-        avg_backward_forward_ratio = sum(backward_forward_ratios) / len(backward_forward_ratios)
-        
-        analysis['efficiency_metrics'] = {
-            'avg_backward_forward_ratio': avg_backward_forward_ratio,
-            'peak_memory_mb': max(memory_usage),
-            'memory_efficiency_trend': 'increasing' if memory_usage[-1] > memory_usage[0] * 2 else 'stable'
-        }
-        
-        return analysis
-    
-    def _generate_graph_optimizations(self, results):
-        """Generate computational graph optimization strategies."""
-        strategies = []
-        
-        # Analyze memory growth patterns
-        peak_memory = max(result['total_memory_mb'] for result in results.values())
-        
-        if peak_memory > 50:  # > 50MB memory usage
-            strategies.append("💾 High memory usage detected in computational graph")
-            strategies.append("🔧 Strategy: Gradient checkpointing for deep graphs")
-            strategies.append("🔧 Strategy: In-place operations where mathematically valid")
-        
-        # Analyze computational efficiency
-        graph_analysis = self.graph_analysis
-        if graph_analysis and 'scaling_metrics' in graph_analysis:
-            backward_scaling = graph_analysis['scaling_metrics']['backward_time_scaling']
-            if backward_scaling > 2.0:
-                strategies.append("🐌 Backward pass scaling poorly with graph depth")
-                strategies.append("🔧 Strategy: Kernel fusion for backward operations")
-                strategies.append("🔧 Strategy: Parallel gradient computation")
-        
-        # Memory vs computation trade-offs
-        if graph_analysis and 'efficiency_metrics' in graph_analysis:
-            backward_forward_ratio = graph_analysis['efficiency_metrics']['avg_backward_forward_ratio']
-            if backward_forward_ratio > 3.0:
-                strategies.append("⚖️ Backward pass significantly slower than forward")
-                strategies.append("🔧 Strategy: Optimize gradient computation with sparse gradients")
-                strategies.append("🔧 Strategy: Use mixed precision to reduce memory bandwidth")
-        
-        # Production optimization recommendations
-        strategies.append("🏭 Production graph optimizations:")
-        strategies.append("   • Graph compilation and optimization (TorchScript, XLA)")
-        strategies.append("   • Operator fusion to minimize intermediate allocations")
-        strategies.append("   • Dynamic shape optimization for variable input sizes")
-        strategies.append("   • Gradient accumulation for large effective batch sizes")
-        
-        return strategies
-
-    def analyze_memory_checkpointing_trade_offs(self, checkpoint_frequencies=[1, 2, 4, 8]):
-        """
-        Analyze memory vs computation trade-offs with gradient checkpointing.
-        
-        This function is PROVIDED to demonstrate checkpointing analysis.
-        Students use it to understand memory optimization strategies.
-        """
-        print("🔍 GRADIENT CHECKPOINTING ANALYSIS")
-        print("=" * 45)
-        
-        base_graph_depth = 12
-        base_memory_per_layer = 10  # MB per layer
-        base_computation_time = 5  # ms per layer
-        
-        checkpointing_results = []
-        
-        for freq in checkpoint_frequencies:
-            # Calculate memory savings
-            # Without checkpointing: store all intermediate activations
-            no_checkpoint_memory = base_graph_depth * base_memory_per_layer
-            
-            # With checkpointing: only store every freq-th activation
-            checkpointed_memory = max(base_memory_per_layer, (base_graph_depth // freq + 1) * base_memory_per_layer)
-            memory_savings = no_checkpoint_memory - checkpointed_memory
-            memory_reduction_pct = (memory_savings / no_checkpoint_memory) * 100
-            
-            # Calculate recomputation overhead
-            # Need to recompute (freq-1) layers for each checkpoint
-            recomputation_layers = base_graph_depth * (freq - 1) / freq
-            recomputation_time = recomputation_layers * base_computation_time
-            
-            # Total training time = forward + backward + recomputation
-            base_training_time = base_graph_depth * base_computation_time * 2  # forward + backward
-            total_training_time = base_training_time + recomputation_time
-            time_overhead_pct = (recomputation_time / base_training_time) * 100
-            
-            result = {
-                'checkpoint_frequency': freq,
-                'memory_mb': checkpointed_memory,
-                'memory_reduction_pct': memory_reduction_pct,
-                'recomputation_time_ms': recomputation_time,
-                'time_overhead_pct': time_overhead_pct,
-                'memory_time_ratio': memory_reduction_pct / max(time_overhead_pct, 1)
-            }
-            checkpointing_results.append(result)
-            
-            print(f"  Checkpoint every {freq} layers:")
-            print(f"    Memory: {checkpointed_memory:.0f}MB ({memory_reduction_pct:.1f}% reduction)")
-            print(f"    Time overhead: {time_overhead_pct:.1f}%")
-            print(f"    Efficiency ratio: {result['memory_time_ratio']:.2f}")
-        
-        # Find optimal trade-off
-        optimal = max(checkpointing_results, key=lambda x: x['memory_time_ratio'])
-        
-        print(f"\n📈 Checkpointing Analysis:")
-        print(f"  Optimal frequency: Every {optimal['checkpoint_frequency']} layers")
-        print(f"  Best trade-off: {optimal['memory_reduction_pct']:.1f}% memory reduction")
-        print(f"  Cost: {optimal['time_overhead_pct']:.1f}% time overhead")
-        
-        return checkpointing_results
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Autograd Systems Profiling
-
-This test validates our autograd systems profiler with realistic computational graph scenarios.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-autograd-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_autograd_systems_profiler():
-    """Test autograd systems profiler with comprehensive scenarios."""
-    print("🔬 Unit Test: Autograd Systems Profiler...")
-    
-    profiler = AutogradSystemsProfiler()
-    
-    # Test computational graph depth analysis
-    try:
-        graph_analysis = profiler.profile_computational_graph_depth(max_depth=5, operations_per_level=3)
-        
-        # Verify analysis structure
-        assert 'detailed_results' in graph_analysis, "Should provide detailed results"
-        assert 'graph_analysis' in graph_analysis, "Should provide graph analysis"
-        assert 'optimization_strategies' in graph_analysis, "Should provide optimization strategies"
-        
-        # Verify detailed results
-        results = graph_analysis['detailed_results']
-        assert len(results) == 5, "Should test all graph depths"
-        
-        for depth, result in results.items():
-            assert 'forward_time_ms' in result, f"Should include forward timing for depth {depth}"
-            assert 'backward_time_ms' in result, f"Should include backward timing for depth {depth}"
-            assert 'total_memory_mb' in result, f"Should analyze memory for depth {depth}"
-            assert result['forward_time_ms'] >= 0, f"Forward time should be non-negative for depth {depth}"
-            assert result['backward_time_ms'] >= 0, f"Backward time should be non-negative for depth {depth}"
-        
-        print("✅ Computational graph depth analysis test passed")
-        
-        # Test memory checkpointing analysis
-        checkpointing_analysis = profiler.analyze_memory_checkpointing_trade_offs(checkpoint_frequencies=[1, 2, 4])
-        
-        assert isinstance(checkpointing_analysis, list), "Should return checkpointing analysis results"
-        assert len(checkpointing_analysis) == 3, "Should analyze all checkpoint frequencies"
-        
-        for result in checkpointing_analysis:
-            assert 'checkpoint_frequency' in result, "Should include checkpoint frequency"
-            assert 'memory_reduction_pct' in result, "Should calculate memory reduction"
-            assert 'time_overhead_pct' in result, "Should calculate time overhead"
-            assert result['memory_reduction_pct'] >= 0, "Memory reduction should be non-negative"
-        
-        print("✅ Memory checkpointing analysis test passed")
-        
-    except Exception as e:
-        print(f"⚠️ Autograd profiling test had issues: {e}")
-        print("✅ Basic structure test passed (graceful degradation)")
-    
-    print("🎯 Autograd Systems Profiler: All tests passed!")
-
-# Test will run in main block
-
-if __name__ == "__main__":
-    print("\n🧪 Running Autograd Module Tests...")
-    
-    # Run all unit tests
-    test_unit_variable_class()
-    test_unit_add_operation()
-    test_unit_multiply_operation()
-    test_unit_subtract_operation()
-    test_unit_chain_rule()
-    test_module_neural_network_training()
-    test_autograd_systems_profiler()
-    
-    print("\n✅ All Autograd Module Tests Completed!") 
-    print("Autograd module complete!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've built automatic differentiation capabilities that enable neural network training, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how computational graphs scale to production training environments.
-
-Take time to reflect thoughtfully on each question - your insights will help you understand how the automatic differentiation concepts you've implemented connect to real-world ML systems engineering.
-"""
-
-# %% [markdown]
-"""
-### Question 1: Computational Graphs and Memory Management
-
-**Context**: Your Variable.backward() method accumulates gradients in memory. When you tested complex expressions like (x+y)*(x-y), you saw how computational graphs store intermediate values for gradient computation. In your autograd profiler, you discovered that graph memory scales with network depth.
-
-**Reflection Question**: Analyze the memory bottlenecks in your Variable implementation when extended to training deep neural networks. How would you modify your current gradient storage and computational graph management to handle 100-layer networks that exceed GPU memory? Design specific optimizations to your autograd system that balance memory efficiency with gradient computation accuracy.
-
-Think about: gradient checkpointing integration, memory release strategies, graph optimization techniques, and computational trade-offs in your implementation.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-1-computational-graphs", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON COMPUTATIONAL GRAPHS AND MEMORY MANAGEMENT:
-
-TODO: Replace this text with your thoughtful response about memory-efficient automatic differentiation system design.
-
-Consider addressing:
-- How would you implement gradient checkpointing to optimize memory usage in large models?
-- What strategies would you use to balance memory consumption with computational efficiency?
-- How would you design graph compilation that maintains flexibility while enabling optimization?
-- What role would distributed gradient computation play in your system design?
-- How would you handle memory constraints while preserving numerical precision?
-
-Write a technical analysis connecting your autograd implementations to real memory management challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Demonstrates understanding of computational graph memory management (3 points)
-- Addresses gradient checkpointing and memory optimization strategies (3 points)
-- Shows practical knowledge of graph compilation and optimization techniques (2 points)
-- Demonstrates systems thinking about memory vs compute trade-offs (2 points)
-- Clear technical reasoning and practical considerations (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring technical analysis of computational graph optimization
-# Students should demonstrate understanding of memory management and gradient computation efficiency
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: Distributed Training and Gradient Synchronization
-
-**Context**: Your autograd computes gradients locally on single Variables, but production training systems must coordinate gradient computation across multiple GPUs and nodes. Your implementation of chain rule through multiple operations shows how gradients must flow through complex computational graphs efficiently.
-
-**Reflection Question**: Extend your autograd implementation to support distributed gradient computation across multiple devices. How would you modify your Variable.backward() method and gradient accumulation strategy to handle gradient synchronization, communication optimization, and maintain numerical stability across distributed training? Design communication patterns that minimize overhead while preserving training convergence.
-
-Think about: gradient synchronization strategies, communication optimization, distributed computation patterns, and scalability considerations in your autograd design.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-2-distributed-training", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON DISTRIBUTED TRAINING AND GRADIENT SYNCHRONIZATION:
-
-TODO: Replace this text with your thoughtful response about distributed automatic differentiation system design.
-
-Consider addressing:
-- How would you design gradient synchronization for efficient distributed training?
-- What strategies would you use to minimize communication overhead in multi-GPU training?
-- How would you implement gradient compression and optimization for distributed systems?
-- What role would asynchronous vs synchronous training play in your design?
-- How would you ensure numerical stability and convergence in distributed settings?
-
-Write an architectural analysis connecting your autograd implementation to real distributed training challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Shows understanding of distributed training and gradient synchronization (3 points)
-- Designs practical approaches to communication optimization and scalability (3 points)
-- Addresses numerical stability and convergence in distributed settings (2 points)
-- Demonstrates systems thinking about distributed computation patterns (2 points)
-- Clear architectural reasoning with distributed systems insights (bonus points for comprehensive understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of distributed training systems
-# Students should demonstrate knowledge of gradient synchronization and communication optimization
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Advanced Training Optimizations and System Integration
-
-**Context**: Your autograd provides gradient computation, but production training systems must integrate with advanced optimization techniques like mixed precision training, gradient accumulation, and specialized hardware acceleration. Your performance analysis showed the relationship between forward and backward pass timing.
-
-**Reflection Question**: Design advanced optimizations for your autograd system that integrate automatic mixed precision support, gradient accumulation for large effective batch sizes, and hardware-specific acceleration. How would you modify your Variable class and operation implementations to support these optimizations while maintaining numerical stability and debugging capabilities? Consider the trade-offs between training speed and implementation complexity.
-
-Think about: mixed precision training integration, gradient accumulation strategies, hardware acceleration patterns, and systems optimization techniques.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-3-training-optimizations", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON ADVANCED TRAINING OPTIMIZATIONS:
-
-TODO: Replace this text with your thoughtful response about advanced automatic differentiation system design.
-
-Consider addressing:
-- How would you integrate automatic mixed precision training with gradient computation?
-- What strategies would you use for gradient accumulation and large batch simulation?
-- How would you design hardware integration for specialized accelerators like TPUs?
-- What role would advanced optimizations play while maintaining research flexibility?
-- How would you ensure numerical stability across different precision and hardware configurations?
-
-Write a design analysis connecting your autograd implementation to real training optimization challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Understands advanced training optimizations and mixed precision challenges (3 points)
-- Designs practical approaches to gradient accumulation and hardware integration (3 points)
-- Addresses numerical stability and research vs production trade-offs (2 points)
-- Shows systems thinking about training optimization and system integration (2 points)
-- Clear design reasoning with training optimization insights (bonus points for deep understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of advanced training optimizations
-# Students should demonstrate knowledge of mixed precision, gradient accumulation, and hardware integration
-### END SOLUTION
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Automatic Differentiation
-
-Congratulations! You have successfully implemented automatic differentiation:
-
-### What You have Accomplished
-✅ **Computational Graphs**: Dynamic graph construction for gradient computation (Variable class with 200+ lines)
-✅ **Backpropagation**: Efficient gradient computation through reverse mode AD (add, multiply, subtract operations)
-✅ **Gradient Tracking**: Automatic gradient accumulation and management (chain rule implementation)
-✅ **Integration**: Seamless compatibility with Tensor operations (neural network training capability)
-✅ **Real Applications**: Neural network training and optimization (linear regression convergence test)
-
-### Key Learning Outcomes
-- **Computational graphs**: How operations are tracked for gradient computation through dynamic graph construction
-- **Backpropagation**: Reverse mode automatic differentiation with O(1) overhead per operation
-- **Gradient accumulation**: How gradients flow through complex operations via chain rule
-- **Memory management**: Efficient handling of gradient storage with 2x memory overhead
-- **Integration patterns**: How autograd works with neural networks for training
-
-### Mathematical Foundations Mastered
-- **Chain rule**: The mathematical foundation ∂f/∂x = ∂f/∂z · ∂z/∂x for backpropagation
-- **Computational graphs**: Representing operations as directed acyclic graphs with forward/backward passes
-- **Gradient flow**: How gradients propagate through complex functions automatically
-- **Memory efficiency**: O(N) gradient storage scaling with graph depth
-
-### Professional Skills Developed
-- **Graph construction**: Building dynamic computational graphs with variable tracking
-- **Gradient computation**: Implementing efficient backpropagation algorithms
-- **Memory optimization**: Managing gradient storage with systems performance analysis
-- **Integration testing**: Ensuring autograd works with neural network training pipelines
-
-### Ready for Advanced Applications
-Your autograd implementation now enables:
-- **Neural network training**: Complete training pipelines with automatic gradient computation
-- **Optimization algorithms**: Gradient-based optimization methods with automatic differentiation
-- **Custom loss functions**: Implementing specialized loss functions with gradient tracking
-- **Advanced architectures**: Training complex neural network models with computational graph optimization
-
-### Connection to Real ML Systems
-Your implementations mirror production systems:
-- **PyTorch**: `torch.autograd` provides identical computational graph functionality
-- **TensorFlow**: `tf.GradientTape` implements similar automatic differentiation concepts
-- **JAX**: `jax.grad` uses similar reverse-mode automatic differentiation
-- **Industry Standard**: Every major ML framework uses these exact gradient computation principles
-
-### Next Steps
-1. **Export your code**: `tito module complete 06_autograd`
-2. **Test your implementation**: `tito test 06_autograd`
-3. **Build training systems**: Combine with optimizers for complete training pipelines
-4. **Move to Module 07**: Add optimization algorithms with your gradient engine!
-
-**Ready for optimizers?** Your autograd system now provides the foundation for all modern neural network training through automatic gradient computation!
-"""
\ No newline at end of file
diff --git a/modules_old/05_autograd/autograd_dev_enhanced_v2.py b/modules_old/05_autograd/autograd_dev_enhanced_v2.py
deleted file mode 100644
index c2ba7d06..00000000
--- a/modules_old/05_autograd/autograd_dev_enhanced_v2.py
+++ /dev/null
@@ -1,899 +0,0 @@
-# %% [markdown]
-"""
-# Autograd - Automatic Differentiation Engine
-
-Welcome to Autograd! You'll implement the magic that powers deep learning - automatic gradient computation for ANY computational graph!
-
-## 🔗 Building on Previous Learning
-
-**What You Built Before**:
-- Module 02 (Tensor): Data structures for n-dimensional arrays
-- Module 03 (Activations): Non-linear functions for neural networks
-
-**What's Working**: You can build computational graphs with tensors and apply non-linear transformations.
-
-**The Gap**: You have to manually compute derivatives - tedious, error-prone, and doesn't scale to complex networks.
-
-**This Module's Solution**: Build an automatic differentiation engine that tracks operations and computes gradients via chain rule.
-
-**Connection Map**:
-```
-Tensor → Autograd → Optimizers
-(data)   (∇f/∇x)   (x -= α∇f/∇x)
-```
-
-## Learning Goals
-- Understand computational graphs and gradient flow
-- Master the chain rule for automatic differentiation  
-- Build memory-efficient gradient accumulation
-- Connect to PyTorch's autograd system
-- Analyze memory vs compute trade-offs in backpropagation
-
-## Build → Use → Reflect
-1. **Build**: Implement Variable class and gradient computation
-2. **Use**: Test on complex computational graphs
-3. **Reflect**: Analyze memory usage and scaling behavior
-
-## Systems Reality Check
-💡 **Production Context**: PyTorch's autograd is the foundation of all deep learning
-⚡ **Performance Insight**: Gradient storage can use 2-3x more memory than forward pass!
-"""
-
-# %% 
-#| default_exp autograd
-import numpy as np
-from typing import List, Optional, Callable, Union
-
-# %% [markdown]
-"""
-## Part 1: The Million Dollar Question
-
-How does PyTorch automatically compute gradients for ANY neural network architecture, no matter how complex?
-
-The answer: **Computational Graphs + Chain Rule**
-
-Let's discover how this works by building it ourselves!
-"""
-
-# %% [markdown]
-"""
-## Part 2: The Variable Class - Tracking Computation History
-
-Every value in our computational graph needs to remember:
-1. Its data
-2. Whether it needs gradients
-3. How it was created (for backpropagation)
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "variable-class", "solution": true}
-#| export
-class Variable:
-    """
-    A Variable wraps data and tracks how it was created for gradient computation.
-    
-    This is the foundation of automatic differentiation - each Variable knows
-    its parents and the operation that created it, forming a computational graph.
-    
-    TODO: Implement the Variable class with gradient tracking capabilities.
-    
-    APPROACH:
-    1. Store data as numpy array for efficient computation
-    2. Track whether gradients are needed (requires_grad)
-    3. Store the operation that created this Variable (grad_fn)
-    
-    EXAMPLE:
-    >>> x = Variable(np.array([2.0]), requires_grad=True)
-    >>> y = x * 3  # y knows it was created by multiplication
-    >>> print(y.data)
-    [6.0]
-    
-    HINTS:
-    - Use np.array() to ensure data is numpy array
-    - Initialize grad to None (computed during backward)
-    - grad_fn stores the backward function
-    """
-    
-    def __init__(self, data, requires_grad=False, grad_fn=None):
-        ### BEGIN SOLUTION
-        # SYSTEMS INSIGHT: float32 uses 4 bytes per element
-        # For 1B parameters = 4GB just for data storage
-        self.data = np.array(data, dtype=np.float32)
-        self.requires_grad = requires_grad
-        
-        # CRITICAL ML PATTERN: Gradients initialized lazily
-        # Memory saved until backward() is called
-        self.grad = None
-        
-        # AUTOGRAD CORE: Links to parent operation in computation graph
-        # Enables automatic chain rule application
-        self.grad_fn = grad_fn
-        self._backward_hooks = []  # Extension point for advanced features
-        ### END SOLUTION
-    
-    def backward(self, gradient=None):
-        """
-        Compute gradients via backpropagation using chain rule.
-        
-        TODO: Implement backward pass through computational graph.
-        
-        APPROACH:
-        1. Initialize gradient if not provided (for scalar outputs)
-        2. Accumulate gradients (for shared parameters)
-        3. Call grad_fn to propagate gradients to parents
-        
-        HINTS:
-        - Gradient accumulates: grad = grad + new_gradient
-        - Only propagate if grad_fn exists
-        - Check requires_grad before accumulating
-        """
-        ### BEGIN SOLUTION
-        # OPTIMIZATION: Skip gradient computation when not needed
-        # Saves O(N) operations where N = parameter count
-        if not self.requires_grad:
-            return
-            
-        # AUTOGRAD PATTERN: Scalar loss needs starting gradient
-        # ∂L/∂L = 1 (derivative of loss w.r.t. itself)
-        if gradient is None:
-            if self.data.size != 1:
-                raise RuntimeError("Gradient must be specified for non-scalar outputs")
-            gradient = np.ones_like(self.data)  # O(1) memory for scalars
-        
-        # CRITICAL ML SYSTEMS PRINCIPLE: Gradient accumulation
-        # Why: Shared parameters (e.g., embeddings) receive gradients from multiple paths
-        # Memory: Creates new array to avoid aliasing bugs
-        if self.grad is None:
-            self.grad = gradient
-        else:
-            self.grad = self.grad + gradient  # += would modify original!
-            
-        # GRAPH TRAVERSAL: Recursive backpropagation
-        # Complexity: O(graph_depth), can hit Python recursion limit (~1000)
-        if self.grad_fn is not None:
-            self.grad_fn(gradient)
-        ### END SOLUTION
-    
-    def zero_grad(self):
-        """Reset gradient to None."""
-        ### BEGIN SOLUTION
-        self.grad = None
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-## Part 3: Implementing Operations with Gradient Tracking
-
-Now we need operations that build the computational graph AND know how to compute gradients.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "operations", "solution": true}
-#| export
-class Add:
-    """Addition operation with gradient computation."""
-    
-    @staticmethod
-    def forward(a: Variable, b: Variable) -> Variable:
-        """
-        Forward pass: z = a + b
-        
-        TODO: Implement forward pass and create backward function.
-        
-        HINTS:
-        - Result needs gradients if either input needs gradients
-        - Backward function gets gradient from child
-        - Addition gradient: ∂z/∂a = 1, ∂z/∂b = 1
-        """
-        ### BEGIN SOLUTION
-        # Track gradients if either input needs them
-        requires_grad = a.requires_grad or b.requires_grad
-        
-        def backward_fn(grad_output):
-            # Addition gradient: ∂z/∂a = 1, ∂z/∂b = 1
-            # Just pass gradients through unchanged
-            if a.requires_grad:
-                a.backward(grad_output)
-            if b.requires_grad:
-                b.backward(grad_output)
-        
-        # Create output Variable with link to backward function
-        result = Variable(
-            a.data + b.data,
-            requires_grad=requires_grad,
-            grad_fn=backward_fn if requires_grad else None
-        )
-        return result
-        ### END SOLUTION
-
-class Multiply:
-    """Multiplication operation with gradient computation."""
-    
-    @staticmethod  
-    def forward(a: Variable, b: Variable) -> Variable:
-        """
-        Forward pass: z = a * b
-        
-        TODO: Implement forward pass with gradient tracking.
-        
-        HINTS:
-        - Multiplication gradient uses chain rule
-        - ∂z/∂a = b, ∂z/∂b = a
-        - Save values needed for backward
-        """
-        ### BEGIN SOLUTION
-        requires_grad = a.requires_grad or b.requires_grad
-        
-        def backward_fn(grad_output):
-            # Chain rule for multiplication:
-            # ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
-            if a.requires_grad:
-                a.backward(grad_output * b.data)  # Scale by other operand
-            if b.requires_grad:
-                b.backward(grad_output * a.data)  # Scale by other operand
-        
-        result = Variable(
-            a.data * b.data,
-            requires_grad=requires_grad,
-            grad_fn=backward_fn if requires_grad else None
-        )
-        return result
-        ### END SOLUTION
-
-# Add operator overloading for convenience
-Variable.__add__ = lambda self, other: Add.forward(self, other)
-Variable.__mul__ = lambda self, other: Multiply.forward(self, other)
-
-# %% [markdown]
-"""
-### ✅ IMPLEMENTATION CHECKPOINT: Basic autograd complete
-
-### 🤔 PREDICTION: How much memory does gradient storage use compared to parameters?
-Write your guess: _____ × parameter memory
-
-### 🔍 SYSTEMS INSIGHT #1: Gradient Memory Analysis
-"""
-
-# %%
-def analyze_gradient_memory():
-    """Let's measure the memory overhead of gradients!"""
-    try:
-        # Create a simple computational graph
-        x = Variable(np.random.randn(1000, 1000), requires_grad=True)
-        y = Variable(np.random.randn(1000, 1000), requires_grad=True)
-        z = x * 2 + y * 3
-        w = z * z  # More complex graph
-        
-        # Compute gradients
-        w_sum = Variable(np.array([w.data.sum()]), requires_grad=True)
-        w_sum.backward()
-        
-        # Measure memory
-        param_memory = x.data.nbytes + y.data.nbytes
-        grad_memory = x.grad.nbytes + y.grad.nbytes if x.grad is not None else 0
-        
-        print(f"Parameters: {param_memory / 1024 / 1024:.2f} MB")
-        print(f"Gradients: {grad_memory / 1024 / 1024:.2f} MB")
-        print(f"Ratio: {grad_memory / param_memory:.1f}x parameter memory")
-        
-        # Scale to real networks
-        print(f"\nFor a 7B parameter model like LLaMA-7B:")
-        print(f"  Parameters: {7e9 * 4 / 1024**3:.1f} GB (float32)")
-        print(f"  Gradients: {7e9 * 4 / 1024**3:.1f} GB")
-        print(f"  Total training memory: {7e9 * 8 / 1024**3:.1f} GB minimum!")
-        
-        # 💡 WHY THIS MATTERS: This is why gradient checkpointing exists!
-        # Trading compute for memory by recomputing activations during backward.
-        
-    except Exception as e:
-        print(f"⚠️ Error in analysis: {e}")
-        print("Make sure Variable class and operations are implemented correctly")
-
-analyze_gradient_memory()
-
-# %% nbgrader={"grade": true, "grade_id": "compute-q1", "points": 2}
-"""
-### 📊 Computation Question: Memory Requirements
-
-Your Variable class uses float32 (4 bytes per element). Calculate the memory needed for:
-- A Variable with shape (1000, 1000) 
-- Its gradient after backward()
-- Total memory if using Adam optimizer (which stores 2 additional momentum buffers)
-
-Show your calculation and give answers in MB.
-
-YOUR ANSWER:
-"""
-### BEGIN SOLUTION
-"""
-Variable data: 1000 × 1000 × 4 bytes = 4,000,000 bytes = 4.0 MB
-Gradient: Same size as data = 4.0 MB
-Adam momentum (m): 4.0 MB
-Adam velocity (v): 4.0 MB
-Total with Adam: 4.0 + 4.0 + 4.0 + 4.0 = 16.0 MB
-"""
-### END SOLUTION
-
-# %% [markdown]
-"""
-## Part 4: Testing Our Autograd Engine
-
-Let's verify our implementation works correctly!
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-autograd", "locked": true, "points": 10}
-def test_unit_autograd():
-    """Test automatic differentiation."""
-    print("🧪 Testing Autograd Implementation...")
-    
-    # Test 1: Simple addition
-    x = Variable(np.array([2.0]), requires_grad=True)
-    y = Variable(np.array([3.0]), requires_grad=True)
-    z = x + y
-    z.backward()
-    
-    assert np.allclose(x.grad, [1.0]), "Addition gradient for x incorrect"
-    assert np.allclose(y.grad, [1.0]), "Addition gradient for y incorrect"
-    print("✅ Addition gradients correct")
-    
-    # Test 2: Multiplication
-    x.zero_grad()
-    y.zero_grad()
-    z = x * y
-    z.backward()
-    
-    assert np.allclose(x.grad, [3.0]), "Multiplication gradient for x incorrect"
-    assert np.allclose(y.grad, [2.0]), "Multiplication gradient for y incorrect"
-    print("✅ Multiplication gradients correct")
-    
-    # Test 3: Complex expression
-    x = Variable(np.array([2.0]), requires_grad=True)
-    y = Variable(np.array([3.0]), requires_grad=True)
-    z = x * x + y * y  # z = x² + y²
-    z.backward()
-    
-    assert np.allclose(x.grad, [4.0]), "Complex expression gradient for x incorrect"
-    assert np.allclose(y.grad, [6.0]), "Complex expression gradient for y incorrect"
-    print("✅ Complex expression gradients correct")
-    
-    print("🎉 All autograd tests passed!")
-
-test_unit_autograd()
-
-# %% [markdown]
-"""
-## Part 5: Matrix Operations with Broadcasting
-
-Real neural networks need matrix operations. Let's add them!
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "matmul", "solution": true}
-#| export
-class MatMul:
-    """Matrix multiplication with gradient computation."""
-    
-    @staticmethod
-    def forward(a: Variable, b: Variable) -> Variable:
-        """
-        Forward pass: C = A @ B
-        
-        TODO: Implement matrix multiplication with gradients.
-        
-        HINTS:
-        - Use np.dot or @ operator
-        - Gradient w.r.t A: grad_output @ B.T
-        - Gradient w.r.t B: A.T @ grad_output
-        - Handle shape broadcasting correctly
-        """
-        ### BEGIN SOLUTION
-        requires_grad = a.requires_grad or b.requires_grad
-        
-        def backward_fn(grad_output):
-            # Matrix calculus: Use transposes for gradient flow
-            if a.requires_grad:
-                grad_a = grad_output @ b.data.T  # ∂L/∂A = ∂L/∂C @ B^T
-                a.backward(grad_a)
-            if b.requires_grad:
-                grad_b = a.data.T @ grad_output  # ∂L/∂B = A^T @ ∂L/∂C
-                b.backward(grad_b)
-        
-        result = Variable(
-            a.data @ b.data,
-            requires_grad=requires_grad,
-            grad_fn=backward_fn if requires_grad else None
-        )
-        return result
-        ### END SOLUTION
-
-Variable.__matmul__ = lambda self, other: MatMul.forward(self, other)
-
-# %% [markdown]
-"""
-### ✅ IMPLEMENTATION CHECKPOINT: Matrix operations complete
-
-### 🤔 PREDICTION: How many FLOPs does a matrix multiplication A(m×k) @ B(k×n) require?
-Your answer: _______ operations
-
-### 🔍 SYSTEMS INSIGHT #2: Matrix Multiplication Complexity
-"""
-
-# %%
-def analyze_matmul_complexity():
-    """Measure the computational complexity of matrix multiplication."""
-    import time
-    
-    try:
-        sizes = [100, 200, 400, 800]
-        times = []
-        flops = []
-        
-        for size in sizes:
-            A = Variable(np.random.randn(size, size), requires_grad=True)
-            B = Variable(np.random.randn(size, size), requires_grad=True)
-            
-            # Measure forward pass
-            start = time.perf_counter()
-            C = A @ B
-            forward_time = time.perf_counter() - start
-            
-            # Measure backward pass
-            start = time.perf_counter()
-            C_sum = Variable(np.array([C.data.sum()]), requires_grad=True)
-            C_sum.backward()
-            backward_time = time.perf_counter() - start
-            
-            times.append((forward_time, backward_time))
-            # FLOPs for matrix multiply: 2 * m * n * k (multiply-add)
-            flops.append(2 * size * size * size)
-            
-            print(f"Size {size}×{size}:")
-            print(f"  Forward: {forward_time*1000:.2f}ms")
-            print(f"  Backward: {backward_time*1000:.2f}ms (~2× forward)")
-            print(f"  FLOPs: {flops[-1]/1e6:.1f}M")
-        
-        # Analyze scaling
-        time_ratio = times[-1][0] / times[0][0]
-        size_ratio = sizes[-1] / sizes[0]
-        scaling_exp = np.log(time_ratio) / np.log(size_ratio)
-        
-        print(f"\nTime scaling: O(N^{scaling_exp:.1f}) - should be ~3 for matmul")
-        
-        # 💡 WHY THIS MATTERS: This O(N³) scaling is why attention (O(N²×d))
-        # becomes the bottleneck in transformers with long sequences!
-        
-    except Exception as e:
-        print(f"⚠️ Error in analysis: {e}")
-        print("Make sure MatMul is implemented correctly")
-
-analyze_matmul_complexity()
-
-# %% nbgrader={"grade": true, "grade_id": "compute-q2", "points": 2}
-"""
-### 📊 Computation Question: Matrix Multiplication FLOPs
-
-For matrix multiplication C = A @ B where:
-- A has shape (M, K)
-- B has shape (K, N)
-
-The FLOPs (floating-point operations) = 2 × M × N × K (multiply + add for each output)
-
-Calculate the FLOPs for these operations in a neural network forward pass:
-1. Input (batch=32, features=784) @ Weight (784, 128) = ?
-2. Hidden (batch=32, features=128) @ Weight (128, 10) = ?
-3. Total FLOPs for both operations = ?
-
-Give your answers in MFLOPs (millions of FLOPs).
-
-YOUR ANSWER:
-"""
-### BEGIN SOLUTION
-"""
-1. First layer: 2 × 32 × 128 × 784 = 6,422,528 FLOPs = 6.42 MFLOPs
-2. Second layer: 2 × 32 × 10 × 128 = 81,920 FLOPs = 0.08 MFLOPs  
-3. Total: 6.42 + 0.08 = 6.50 MFLOPs
-
-Note: First layer dominates computation due to larger dimensions (784 vs 128).
-"""
-### END SOLUTION
-
-# %% [markdown]
-"""
-## Part 6: Building a Complete Neural Network Layer
-
-Let's use our autograd to build a real neural network layer!
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "linear-layer", "solution": true}
-#| export
-class Linear:
-    """Fully connected layer with automatic differentiation."""
-    
-    def __init__(self, in_features: int, out_features: int):
-        """
-        Initialize a linear layer: y = xW^T + b
-        
-        TODO: Initialize weights and bias as Variables with gradients.
-        
-        HINTS:
-        - Use Xavier/He initialization for weights
-        - Initialize bias to zeros
-        - Both need requires_grad=True
-        """
-        ### BEGIN SOLUTION
-        # Xavier initialization prevents gradient vanishing/explosion
-        scale = np.sqrt(2.0 / in_features)
-        self.weight = Variable(
-            np.random.randn(out_features, in_features) * scale,
-            requires_grad=True
-        )
-        self.bias = Variable(
-            np.zeros((out_features,)),
-            requires_grad=True
-        )
-        ### END SOLUTION
-    
-    def forward(self, x: Variable) -> Variable:
-        """Forward pass through the layer."""
-        ### BEGIN SOLUTION
-        output = x @ self.weight.T + self.bias  # y = xW^T + b
-        return output
-        ### END SOLUTION
-    
-    def parameters(self) -> List[Variable]:
-        """Return all parameters."""
-        ### BEGIN SOLUTION
-        return [self.weight, self.bias]
-        ### END SOLUTION
-
-# %% nbgrader={"grade": true, "grade_id": "compute-q3", "points": 2}
-"""
-### 📊 Computation Question: Parameter Counting
-
-You just implemented a Linear layer. For a 3-layer MLP with architecture:
-- Input: 784 features
-- Hidden 1: 256 neurons  
-- Hidden 2: 128 neurons
-- Output: 10 classes
-
-Calculate:
-1. Parameters in each layer (weights + biases)
-2. Total parameters in the network
-3. Memory in MB (float32 = 4 bytes per parameter)
-
-Show your work.
-
-YOUR ANSWER:
-"""
-### BEGIN SOLUTION
-"""
-Layer 1 (784 → 256): 
-  Weights: 784 × 256 = 200,704
-  Bias: 256
-  Total: 200,960
-
-Layer 2 (256 → 128):
-  Weights: 256 × 128 = 32,768
-  Bias: 128
-  Total: 32,896
-
-Layer 3 (128 → 10):
-  Weights: 128 × 10 = 1,280
-  Bias: 10
-  Total: 1,290
-
-Network total: 200,960 + 32,896 + 1,290 = 235,146 parameters
-Memory: 235,146 × 4 bytes = 940,584 bytes = 0.94 MB
-"""
-### END SOLUTION
-
-# %% [markdown]
-"""
-### ✅ IMPLEMENTATION CHECKPOINT: Neural network layer complete
-
-### 🤔 PREDICTION: For a layer with 1000 inputs and 1000 outputs, how many parameters?
-Your answer: _______ parameters
-
-### 🔍 SYSTEMS INSIGHT #3: Parameter Counting and Memory
-"""
-
-# %%
-def analyze_layer_parameters():
-    """Count parameters and analyze memory usage in neural network layers."""
-    try:
-        # Create layers of different sizes
-        sizes = [(784, 128), (128, 64), (64, 10)]  # Like a small MNIST network
-        
-        total_params = 0
-        total_memory = 0
-        
-        print("Layer Parameter Analysis:")
-        print("-" * 50)
-        
-        for in_feat, out_feat in sizes:
-            layer = Linear(in_feat, out_feat)
-            
-            # Count parameters
-            weight_params = layer.weight.data.size
-            bias_params = layer.bias.data.size
-            layer_params = weight_params + bias_params
-            
-            # Calculate memory
-            layer_memory = layer_params * 4  # float32
-            
-            total_params += layer_params
-            total_memory += layer_memory
-            
-            print(f"Layer {in_feat}→{out_feat}:")
-            print(f"  Weights: {weight_params:,} ({weight_params/1000:.1f}K)")
-            print(f"  Bias: {bias_params:,}")
-            print(f"  Total: {layer_params:,} params = {layer_memory/1024:.1f}KB")
-        
-        print("-" * 50)
-        print(f"Network Total: {total_params:,} parameters")
-        print(f"Memory (float32): {total_memory/1024:.1f}KB")
-        print(f"With gradients: {total_memory*2/1024:.1f}KB")
-        print(f"With Adam optimizer: {total_memory*4/1024:.1f}KB")
-        
-        # Scale up
-        print(f"\nScaling to GPT-3 (175B params):")
-        gpt3_memory = 175e9 * 4  # float32
-        print(f"  Parameters only: {gpt3_memory/1024**4:.1f}TB")
-        print(f"  With Adam: {gpt3_memory*4/1024**4:.1f}TB!")
-        
-        # 💡 WHY THIS MATTERS: This is why large models use:
-        # - Mixed precision (float16/bfloat16)
-        # - Gradient checkpointing
-        # - Model parallelism across GPUs
-        
-    except Exception as e:
-        print(f"⚠️ Error: {e}")
-
-analyze_layer_parameters()
-
-# %% nbgrader={"grade": true, "grade_id": "compute-q4", "points": 2}
-"""
-### 📊 Computation Question: Gradient Accumulation
-
-Consider this scenario: A shared weight matrix W (shape 100×100) is used in 3 different places 
-in your network. During backward pass:
-- Path 1 contributes gradient G1 with all elements = 0.1
-- Path 2 contributes gradient G2 with all elements = 0.2  
-- Path 3 contributes gradient G3 with all elements = 0.3
-
-Because of gradient accumulation in your backward() method:
-
-1. What will be the final value of W.grad[0,0] (top-left element)?
-2. If we OVERWROTE instead of accumulated, what would W.grad[0,0] be?
-3. How many total gradient additions occur for the entire weight matrix?
-
-YOUR ANSWER:
-"""
-### BEGIN SOLUTION
-"""
-1. W.grad[0,0] = 0.1 + 0.2 + 0.3 = 0.6 (accumulated from all paths)
-
-2. If overwriting: W.grad[0,0] = 0.3 (only the last gradient)
-
-3. Total additions: 100 × 100 × 3 = 30,000 gradient additions
-   (each of 10,000 elements gets 3 gradient contributions)
-
-This shows why accumulation is critical for shared parameters!
-"""
-### END SOLUTION
-
-# %% [markdown]
-"""
-## Part 7: Complete Test Suite
-"""
-
-# %%
-def test_unit_all():
-    """Run all unit tests for the autograd module."""
-    print("🧪 Running Complete Autograd Test Suite...")
-    print("=" * 50)
-    
-    # Test basic autograd
-    test_unit_autograd()
-    print()
-    
-    # Test matrix multiplication
-    print("🧪 Testing Matrix Multiplication...")
-    A = Variable(np.array([[1, 2], [3, 4]], dtype=np.float32), requires_grad=True)
-    B = Variable(np.array([[5, 6], [7, 8]], dtype=np.float32), requires_grad=True)
-    C = A @ B
-    
-    C_sum = Variable(np.array([C.data.sum()]), requires_grad=True)
-    C_sum.backward()
-    
-    expected_grad_A = B.data.sum(axis=0, keepdims=True).T @ np.ones((1, 2))
-    print(f"✅ MatMul forward: {np.allclose(C.data, [[19, 22], [43, 50]])}")
-    print(f"✅ MatMul gradients computed")
-    print()
-    
-    # Test neural network layer
-    print("🧪 Testing Neural Network Layer...")
-    layer = Linear(10, 5)
-    x = Variable(np.random.randn(3, 10), requires_grad=True)
-    y = layer.forward(x)
-    
-    assert y.data.shape == (3, 5), "Output shape incorrect"
-    print(f"✅ Linear layer forward pass: shape {y.data.shape}")
-    
-    y_sum = Variable(np.array([y.data.sum()]), requires_grad=True)
-    y_sum.backward()
-    
-    assert layer.weight.grad is not None, "Weight gradients not computed"
-    assert layer.bias.grad is not None, "Bias gradients not computed"
-    print("✅ Linear layer gradients computed")
-    
-    print("=" * 50)
-    print("🎉 All tests passed! Autograd engine working correctly!")
-
-# Main execution
-if __name__ == "__main__":
-    test_unit_all()
-
-# %% nbgrader={"grade": true, "grade_id": "compute-q5", "points": 2}
-"""
-### 📊 Computation Question: Batch Size vs Memory
-
-You have a model with 1M parameters training with batch size 64. The memory usage is:
-- Model parameters: 4 MB
-- Gradients: 4 MB  
-- Adam optimizer state: 8 MB
-- Activations (batch-dependent): 32 MB
-
-Answer:
-1. What is the total memory usage?
-2. If you double the batch size to 128, what will the new TOTAL memory be?
-3. What is the maximum batch size if you have 100 MB available?
-
-Show calculations.
-
-YOUR ANSWER:
-"""
-### BEGIN SOLUTION
-"""
-1. Total memory = 4 + 4 + 8 + 32 = 48 MB
-
-2. With batch size 128:
-   - Fixed (params + grads + optimizer): 4 + 4 + 8 = 16 MB (unchanged)
-   - Activations: 32 MB × (128/64) = 64 MB (scales linearly)
-   - New total: 16 + 64 = 80 MB
-
-3. Maximum batch size with 100 MB:
-   - Fixed costs: 16 MB
-   - Available for activations: 100 - 16 = 84 MB
-   - Batch size: 64 × (84/32) = 168 (maximum)
-   
-Key insight: Only activations scale with batch size, not parameters/gradients!
-"""
-### END SOLUTION
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Synthesis Questions
-
-Now that you've built and measured an autograd system, consider these broader questions:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "synthesis-q1", "solution": true, "points": 5}
-"""
-### Synthesis Question 1: Memory vs Compute Trade-offs
-
-You discovered that gradient computation requires significant memory (1× parameters for 
-gradients, 3× more for optimizers). You also measured that backward passes take ~2× 
-the time of forward passes.
-
-Design a training strategy for a model that requires 4× your available memory. Your 
-strategy should address:
-- How to fit the model in memory
-- What you sacrifice (time, accuracy, or complexity)
-- When this trade-off is worthwhile
-
-YOUR ANSWER (5-7 sentences):
-"""
-### BEGIN SOLUTION
-"""
-Strategy: Gradient checkpointing with micro-batching.
-
-1. Divide model into 4 checkpoint segments, storing only segment boundaries
-2. During backward, recompute intermediate activations for each segment
-3. Process mini-batches in 4 micro-batches, accumulating gradients
-
-Trade-offs:
-- Time: ~30% slower due to recomputation
-- Memory: 4× reduction achieved
-- Complexity: More complex implementation
-
-This is worthwhile when model quality is critical but hardware is limited,
-such as research environments or edge deployment. The time cost is acceptable
-for better model performance that couldn't otherwise be achieved.
-"""
-### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "synthesis-q2", "solution": true, "points": 5}
-"""
-### Synthesis Question 2: Scaling Bottlenecks
-
-Based on your measurements:
-- Matrix operations scale O(N³)
-- Gradient storage scales O(N) with parameters
-- Graph traversal scales O(depth) with network depth
-
-For each scaling pattern, describe:
-1. When it becomes the primary bottleneck
-2. A real-world scenario where this limits training
-3. An engineering solution to mitigate it
-
-YOUR ANSWER (6-8 sentences):
-"""
-### BEGIN SOLUTION
-"""
-1. O(N³) matrix operations:
-   - Bottleneck: Large hidden dimensions (>10K)
-   - Scenario: Language models with large embeddings
-   - Solution: Block-sparse matrices, reducing N³ to N²×log(N)
-
-2. O(N) gradient storage:
-   - Bottleneck: Models with >10B parameters
-   - Scenario: Training exceeds GPU memory
-   - Solution: Gradient sharding across devices, ZeRO optimization
-
-3. O(depth) graph traversal:
-   - Bottleneck: Networks >1000 layers deep
-   - Scenario: Very deep ResNets or Transformers
-   - Solution: Gradient checkpointing at strategic layers, reversible layers
-
-The key insight: Different architectures hit different bottlenecks, requiring
-architecture-specific optimization strategies.
-"""
-### END SOLUTION
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Autograd
-
-Congratulations! You've successfully implemented automatic differentiation from scratch:
-
-### What You've Accomplished
-✅ **200+ lines of autograd code**: Complete automatic differentiation engine
-✅ **Variable class**: Gradient tracking with computational graph construction  
-✅ **5 operations**: Add, Multiply, MatMul, and neural network layers
-✅ **Memory profiling**: Discovered gradients use 1× parameter memory
-✅ **Performance analysis**: Measured O(N³) scaling for matrix operations
-
-### Key Learning Outcomes
-- **Chain rule mastery**: Backpropagation through arbitrary computational graphs
-- **Memory-compute trade-offs**: Why gradient checkpointing exists
-- **Systems insight**: Gradient accumulation vs storage patterns
-- **Production patterns**: How PyTorch's autograd actually works
-
-### Mathematical Foundations Mastered
-- **Chain rule**: ∂L/∂x = ∂L/∂y · ∂y/∂x
-- **Matrix calculus**: Gradients for matrix multiplication
-- **Computational complexity**: O(N³) for matmul, O(N) for element-wise
-
-### Professional Skills Developed
-- **Automatic differentiation**: Core of all modern deep learning
-- **Memory profiling**: Quantifying memory usage in training
-- **Performance analysis**: Understanding scaling bottlenecks
-
-### Ready for Advanced Applications
-Your autograd implementation now enables:
-- **Immediate**: Training neural networks with gradient descent
-- **Next Module**: Building optimizers (SGD, Adam) using your gradients
-- **Real-world**: Understanding PyTorch's autograd internals
-
-### Connection to Real ML Systems
-Your implementation mirrors production systems:
-- **PyTorch**: torch.autograd.Variable and Function classes
-- **TensorFlow**: tf.GradientTape API
-- **JAX**: grad() transformation
-
-### Next Steps
-1. **Export your module**: `tito module complete 06_autograd`
-2. **Validate integration**: `tito test --module autograd`
-3. **Explore advanced features**: Higher-order gradients, custom operations
-4. **Ready for Module 07**: Build optimizers using your autograd engine!
-
-**You've built the foundation of deep learning**: Every neural network trained today relies on automatic differentiation. Your implementation gives you deep understanding of how gradients flow through complex architectures!
-"""
\ No newline at end of file
diff --git a/modules_old/05_autograd/autograd_visual_example.md b/modules_old/05_autograd/autograd_visual_example.md
deleted file mode 100644
index 5e9f3fc3..00000000
--- a/modules_old/05_autograd/autograd_visual_example.md
+++ /dev/null
@@ -1,146 +0,0 @@
-# Example: Visual Autograd Module Opening
-
-This shows how the autograd module would start with visual explanations:
-
-```python
-# %% [markdown]
-"""
-# Autograd - Automatic Differentiation Engine
-
-## 🎯 What We're Building Today
-
-We're creating the "magic" that powers all modern deep learning - automatic gradient computation:
-
-```
-    Your Neural Network Code:              What Autograd Does Behind the Scenes:
-    ─────────────────────────              ────────────────────────────────────
-    
-    x = Variable(data)                     Creates computation graph node
-    y = x * 2                              Tracks operation: Mul(x, 2)
-    z = y + 3                              Tracks operation: Add(y, 3)
-    loss = z.mean()                        Tracks operation: Mean(z)
-    loss.backward()                        Computes ALL gradients automatically!
-                                          
-                                          ∂loss/∂x computed via chain rule
-```
-
-## 📊 The Computational Graph
-
-When you write `z = x * y + b`, autograd builds this graph:
-
-```
-Forward Pass (Build Graph):
-                    x ────┐
-                          ├──[×]──> x*y ──┐
-                    y ────┘                ├──[+]──> z = x*y + b
-                                    b ────┘
-
-Backward Pass (Compute Gradients):
-                 ∂L/∂x ←──┐
-                          ├──[×]←── ∂L/∂(x*y) ←──┐
-                 ∂L/∂y ←──┘           ↑           ├──[+]←── ∂L/∂z
-                                      │    ∂L/∂b ←┘
-                               Chain Rule Applied
-```
-
-## 💾 Memory Architecture
-
-Understanding memory is crucial for training large models:
-
-```
-┌─────────────────────────────────────────────────────────┐
-│                    Training Memory Layout                │
-├─────────────────────────────────────────────────────────┤
-│                                                          │
-│  Forward Pass Memory:                                    │
-│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐   │
-│  │  Parameters  │ │ Activations  │ │ Intermediate │   │
-│  │     (W,b)    │ │   (x,y,z)    │ │   Results    │   │
-│  │     100MB    │ │    300MB     │ │    200MB     │   │
-│  └──────────────┘ └──────────────┘ └──────────────┘   │
-│                                                          │
-│  Backward Pass Additional Memory:                        │
-│  ┌──────────────┐ ┌──────────────┐                     │
-│  │  Gradients   │ │   Graph      │                     │
-│  │   (∂L/∂W)    │ │   Storage    │                     │
-│  │    100MB     │ │    50MB      │                     │
-│  └──────────────┘ └──────────────┘                     │
-│                                                          │
-│  Total: 750MB (1.25× more than forward-only)           │
-└─────────────────────────────────────────────────────────┘
-```
-
-## 🔄 The Chain Rule in Action
-
-Let's trace through a simple example step by step:
-
-```
-Given: f(x) = (x + 2) * 3
-Let x = 5
-
-Forward Pass:
-    x = 5
-    ↓
-    y = x + 2 = 7     (save x=5 for backward)
-    ↓
-    z = y * 3 = 21    (save y=7 for backward)
-
-Backward Pass (z.backward()):
-    ∂z/∂z = 1         (start with gradient 1)
-    ↓
-    ∂z/∂y = 3         (derivative of y*3 w.r.t y)
-    ↓
-    ∂z/∂x = ∂z/∂y * ∂y/∂x = 3 * 1 = 3
-    
-Result: x.grad = 3
-```
-
-## 🚀 Why This Matters
-
-Before autograd (pre-2015):
-- **Manual gradient derivation**: Days of calculus for complex models
-- **Error-prone implementation**: One sign error breaks everything
-- **Limited innovation**: Only experts could create new architectures
-
-After autograd (modern era):
-- **Automatic differentiation**: Gradients for ANY architecture
-- **Rapid prototyping**: Try new ideas in minutes, not weeks
-- **Democratized ML**: Focus on architecture, not calculus
-
-## 📈 Real-World Impact
-
-```
-Training Memory Requirements (GPT-3 Scale):
-
-Without Autograd Optimizations:        With Modern Autograd:
-┌────────────────────────┐            ┌────────────────────────┐
-│ Parameters:     700 GB │            │ Parameters:     700 GB │
-│ Gradients:      700 GB │            │ Gradients:      700 GB │
-│ Activations:   2100 GB │            │ Checkpointing:  300 GB │
-│ Optimizer:     1400 GB │            │ Optimizer:     1400 GB │
-├────────────────────────┤            ├────────────────────────┤
-│ Total:         4900 GB │            │ Total:         2700 GB │
-└────────────────────────┘            └────────────────────────┘
-                                       
-                                       45% memory saved via 
-                                       gradient checkpointing!
-```
-
-Now let's build this from scratch and truly understand how it works!
-"""
-```
-
-## Key Elements That Make This Readable:
-
-1. **Visual Comparisons**: Side-by-side "Your Code" vs "What Happens"
-2. **ASCII Diagrams**: Clear computational graphs with arrows
-3. **Memory Layouts**: Visual representation of memory usage
-4. **Step-by-Step Traces**: Following data through forward/backward
-5. **Real-World Context**: Showing GPT-3 scale implications
-6. **Before/After Comparisons**: Why autograd changed everything
-
-This approach ensures students can:
-- **Read and understand** without coding
-- **See the big picture** before implementation details
-- **Grasp systems implications** through visual memory layouts
-- **Connect to real-world** impact and scale
\ No newline at end of file
diff --git a/modules_old/05_autograd/module.yaml b/modules_old/05_autograd/module.yaml
deleted file mode 100644
index ebc5e06b..00000000
--- a/modules_old/05_autograd/module.yaml
+++ /dev/null
@@ -1,21 +0,0 @@
-components:
-- Variable
-- backward
-- gradient_computation
-dependencies:
-  enables:
-  - optimizers
-  - training
-  prerequisites:
-  - tensor
-  - activations
-description: Automatic differentiation engine for gradient computation
-difficulty: "\u2B50\u2B50\u2B50\u2B50"
-exports_to: tinytorch.core.autograd
-files:
-  dev_file: autograd_dev.py
-  readme: README.md
-  test_file: tests/test_autograd.py
-name: autograd
-time_estimate: 8-10 hours
-title: Autograd
diff --git a/modules_old/05_autograd/test_decorator.py b/modules_old/05_autograd/test_decorator.py
deleted file mode 100644
index 65c526e5..00000000
--- a/modules_old/05_autograd/test_decorator.py
+++ /dev/null
@@ -1,176 +0,0 @@
-#!/usr/bin/env python3
-"""
-Simple test of the decorator-based autograd implementation
-"""
-import sys
-import os
-import numpy as np
-
-# Import the pure Tensor class from Module 01
-sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-from tensor_dev import Tensor
-
-def add_autograd(cls):
-    """
-    Decorator that adds gradient tracking to existing Tensor class.
-    """
-    # Store original methods from pure Tensor class
-    original_init = cls.__init__
-    original_add = cls.__add__
-    original_mul = cls.__mul__
-    original_sub = cls.__sub__ if hasattr(cls, '__sub__') else None
-
-    def new_init(self, data, dtype=None, requires_grad=False):
-        """Enhanced constructor with gradient tracking support."""
-        # Call original constructor to preserve all existing functionality
-        original_init(self, data, dtype)
-
-        # Add gradient tracking attributes
-        self.requires_grad = requires_grad
-        self.grad = None
-        self.grad_fn = None
-
-    def new_add(self, other):
-        """Enhanced addition with gradient tracking."""
-        # Forward pass: use original pure addition
-        result = original_add(self, other)
-
-        # Add gradient tracking if either operand requires gradients
-        if self.requires_grad or (hasattr(other, 'requires_grad') and other.requires_grad):
-            result.requires_grad = True
-            result.grad = None
-
-            # Define backward function for gradient computation
-            def grad_fn(gradient):
-                """Apply addition backward pass: d(a+b)/da = 1, d(a+b)/db = 1"""
-                if self.requires_grad:
-                    self.backward(gradient)
-                if hasattr(other, 'requires_grad') and other.requires_grad:
-                    other.backward(gradient)
-
-            result.grad_fn = grad_fn
-
-        return result
-
-    def new_mul(self, other):
-        """Enhanced multiplication with gradient tracking."""
-        # Forward pass: use original pure multiplication
-        result = original_mul(self, other)
-
-        # Add gradient tracking if either operand requires gradients
-        if self.requires_grad or (hasattr(other, 'requires_grad') and other.requires_grad):
-            result.requires_grad = True
-            result.grad = None
-
-            # Define backward function using product rule
-            def grad_fn(gradient):
-                """Apply multiplication backward pass: d(a*b)/da = b, d(a*b)/db = a"""
-                if self.requires_grad:
-                    # Get gradient data, handle both Tensor and scalar cases
-                    if hasattr(other, 'data'):
-                        other_data = other.data
-                    else:
-                        other_data = other
-                    self_grad = gradient * other_data
-                    self.backward(self_grad)
-
-                if hasattr(other, 'requires_grad') and other.requires_grad:
-                    # Get gradient data for self
-                    self_grad = gradient * self.data
-                    other.backward(self_grad)
-
-            result.grad_fn = grad_fn
-
-        return result
-
-    def backward(self, gradient=None):
-        """
-        New method: Compute gradients via backpropagation.
-        """
-        if not self.requires_grad:
-            raise RuntimeError("Tensor doesn't require gradients")
-
-        # Default gradient for scalar outputs
-        if gradient is None:
-            if hasattr(self, 'data') and hasattr(self.data, 'size'):
-                if self.data.size == 1:
-                    gradient = np.ones_like(self.data)
-                else:
-                    raise RuntimeError("gradient must be specified for non-scalar tensors")
-            else:
-                gradient = np.ones_like(self.data)
-
-        # Accumulate gradients
-        if self.grad is None:
-            self.grad = gradient
-        else:
-            self.grad = self.grad + gradient
-
-        # Propagate gradients backwards through computation graph
-        if self.grad_fn is not None:
-            self.grad_fn(gradient)
-
-    # Replace methods on the class
-    cls.__init__ = new_init
-    cls.__add__ = new_add
-    cls.__mul__ = new_mul
-    cls.backward = backward
-
-    return cls
-
-def test_decorator():
-    """Test the decorator-based autograd implementation"""
-    print("🧪 Testing Decorator-Based Autograd")
-    print("=" * 40)
-
-    # Apply decorator to enhance the pure Tensor class
-    EnhancedTensor = add_autograd(Tensor)
-
-    # Test 1: Backward compatibility (no gradients)
-    print("Test 1: Backward Compatibility")
-    x = EnhancedTensor([1.0, 2.0])
-    y = EnhancedTensor([3.0, 4.0])
-    z = x + y
-    expected = np.array([4.0, 6.0])
-    actual = z.data if hasattr(z, 'data') else z._data
-    assert np.allclose(actual, expected), f"Expected {expected}, got {actual}"
-    print("✅ Pure tensor behavior preserved")
-
-    # Test 2: Gradient tracking
-    print("\nTest 2: Gradient Tracking")
-    a = EnhancedTensor([2.0], requires_grad=True)
-    b = EnhancedTensor([3.0], requires_grad=True)
-    c = a * b  # c = 6.0
-
-    # Backward pass
-    c.backward()
-
-    # Check gradients: dc/da = b = 3, dc/db = a = 2
-    assert np.allclose(a.grad, [3.0]), f"Expected a.grad=[3.0], got {a.grad}"
-    assert np.allclose(b.grad, [2.0]), f"Expected b.grad=[2.0], got {b.grad}"
-    print("✅ Gradient computation works")
-
-    # Test 3: Complex expression
-    print("\nTest 3: Complex Expression")
-    p = EnhancedTensor([4.0], requires_grad=True)
-    q = EnhancedTensor([2.0], requires_grad=True)
-
-    # f(p,q) = (p + q) * p = p² + pq
-    sum_term = p + q  # p + q = 6
-    result = sum_term * p  # (p + q) * p = 6 * 4 = 24
-
-    result.backward()
-
-    # Expected gradients: df/dp = 2p + q = 8 + 2 = 10, df/dq = p = 4
-    expected_p_grad = 2 * 4.0 + 2.0  # 10.0
-    expected_q_grad = 4.0            # 4.0
-
-    assert np.allclose(p.grad, [expected_p_grad]), f"Expected p.grad=[{expected_p_grad}], got {p.grad}"
-    assert np.allclose(q.grad, [expected_q_grad]), f"Expected q.grad=[{expected_q_grad}], got {q.grad}"
-    print("✅ Complex expression gradients work")
-
-    print("\n🎉 ALL TESTS PASSED!")
-    print("🚀 Decorator-based autograd implementation successful!")
-
-if __name__ == "__main__":
-    test_decorator()
\ No newline at end of file
diff --git a/modules_old/06_optimizers/README.md b/modules_old/06_optimizers/README.md
deleted file mode 100644
index 6bcd4ab1..00000000
--- a/modules_old/06_optimizers/README.md
+++ /dev/null
@@ -1,242 +0,0 @@
-# 🔥 Module: Optimizers
-
-## 📊 Module Info
-- **Difficulty**: ⭐⭐⭐⭐ Expert
-- **Time Estimate**: 6-8 hours
-- **Prerequisites**: Tensor, Autograd modules
-- **Next Steps**: Training, MLOps modules
-
-Build intelligent optimization algorithms that enable effective neural network training. This module implements the learning algorithms that power modern AI—from basic gradient descent to advanced adaptive methods that make training large-scale models possible.
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will be able to:
-
-- **Master gradient-based optimization theory**: Understand how gradients guide parameter updates and the mathematical foundations of learning
-- **Implement core optimization algorithms**: Build SGD, momentum, and Adam optimizers from mathematical first principles
-- **Design learning rate strategies**: Create scheduling systems that balance convergence speed with training stability
-- **Apply optimization in practice**: Use optimizers effectively in complete training workflows with real neural networks
-- **Analyze optimization dynamics**: Compare algorithm behavior, convergence patterns, and performance characteristics
-
-## 🧠 Build → Use → Optimize
-
-This module follows TinyTorch's **Build → Use → Optimize** framework:
-
-1. **Build**: Implement gradient descent, SGD with momentum, Adam optimizer, and learning rate scheduling from mathematical foundations
-2. **Use**: Apply optimization algorithms to train neural networks and solve real optimization problems
-3. **Optimize**: Analyze convergence behavior, compare algorithm performance, and tune hyperparameters for optimal training
-
-## 📚 What You'll Build
-
-### Core Optimization Algorithms
-```python
-# Gradient descent foundation
-def gradient_descent_step(parameter, learning_rate):
-    parameter.data = parameter.data - learning_rate * parameter.grad.data
-
-# SGD with momentum for accelerated convergence
-sgd = SGD(parameters=[w1, w2, bias], learning_rate=0.01, momentum=0.9)
-sgd.zero_grad()  # Clear previous gradients
-loss.backward()  # Compute new gradients
-sgd.step()       # Update parameters
-
-# Adam optimizer with adaptive learning rates
-adam = Adam(parameters=[w1, w2, bias], learning_rate=0.001, beta1=0.9, beta2=0.999)
-adam.zero_grad()
-loss.backward()
-adam.step()      # Adaptive updates per parameter
-```
-
-### Learning Rate Scheduling Systems
-```python
-# Strategic learning rate adjustment
-scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
-
-# Training loop with scheduling
-for epoch in range(num_epochs):
-    for batch in dataloader:
-        optimizer.zero_grad()
-        loss = criterion(model(batch.inputs), batch.targets)
-        loss.backward()
-        optimizer.step()
-    
-    scheduler.step()  # Adjust learning rate each epoch
-    print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()}")
-```
-
-### Complete Training Integration
-```python
-# Modern training workflow
-model = Sequential([Dense(784, 128), ReLU(), Dense(128, 10)])
-optimizer = Adam(model.parameters(), learning_rate=0.001)
-scheduler = StepLR(optimizer, step_size=20, gamma=0.5)
-
-# Training loop with optimization
-for epoch in range(num_epochs):
-    for batch_inputs, batch_targets in dataloader:
-        # Forward pass
-        predictions = model(batch_inputs)
-        loss = criterion(predictions, batch_targets)
-        
-        # Optimization step
-        optimizer.zero_grad()  # Clear gradients
-        loss.backward()        # Compute gradients
-        optimizer.step()       # Update parameters
-    
-    scheduler.step()  # Adjust learning rate
-```
-
-### Optimization Algorithm Implementations
-- **Gradient Descent**: Basic parameter update rule using gradients
-- **SGD with Momentum**: Velocity accumulation for smoother convergence
-- **Adam Optimizer**: Adaptive learning rates with bias correction
-- **Learning Rate Scheduling**: Strategic adjustment during training
-
-## 🚀 Getting Started
-
-### Prerequisites
-Ensure you understand the mathematical foundations:
-
-```bash
-# Activate TinyTorch environment
-source bin/activate-tinytorch.sh
-
-# Verify prerequisite modules
-tito test --module tensor
-tito test --module autograd
-```
-
-### Development Workflow
-1. **Open the development file**: `modules/source/08_optimizers/optimizers_dev.py`
-2. **Implement gradient descent**: Start with basic parameter update mechanics
-3. **Build SGD with momentum**: Add velocity accumulation for acceleration
-4. **Create Adam optimizer**: Implement adaptive learning rates with moment estimation
-5. **Add learning rate scheduling**: Build strategic learning rate adjustment systems
-6. **Export and verify**: `tito export --module optimizers && tito test --module optimizers`
-
-## 🧪 Testing Your Implementation
-
-### Comprehensive Test Suite
-Run the full test suite to verify optimization algorithm correctness:
-
-```bash
-# TinyTorch CLI (recommended)
-tito test --module optimizers
-
-# Direct pytest execution
-python -m pytest tests/ -k optimizers -v
-```
-
-### Test Coverage Areas
-- ✅ **Algorithm Implementation**: Verify SGD, momentum, and Adam compute correct parameter updates
-- ✅ **Mathematical Correctness**: Test against analytical solutions for convex optimization
-- ✅ **State Management**: Ensure proper momentum and moment estimation tracking
-- ✅ **Learning Rate Scheduling**: Verify step decay and scheduling functionality
-- ✅ **Training Integration**: Test optimizers in complete neural network training workflows
-
-### Inline Testing & Convergence Analysis
-The module includes comprehensive mathematical validation and convergence visualization:
-```python
-# Example inline test output
-🔬 Unit Test: SGD with momentum...
-✅ Parameter updates follow momentum equations
-✅ Velocity accumulation works correctly
-✅ Convergence achieved on test function
-📈 Progress: SGD with Momentum ✓
-
-# Optimization analysis
-🔬 Unit Test: Adam optimizer...
-✅ First moment estimation (m_t) computed correctly
-✅ Second moment estimation (v_t) computed correctly  
-✅ Bias correction applied properly
-✅ Adaptive learning rates working
-📈 Progress: Adam Optimizer ✓
-```
-
-### Manual Testing Examples
-```python
-from optimizers_dev import SGD, Adam, StepLR
-from autograd_dev import Variable
-
-# Test SGD on simple quadratic function
-x = Variable(10.0, requires_grad=True)
-sgd = SGD([x], learning_rate=0.1, momentum=0.9)
-
-for step in range(100):
-    sgd.zero_grad()
-    loss = x**2  # Minimize f(x) = x²
-    loss.backward()
-    sgd.step()
-    if step % 10 == 0:
-        print(f"Step {step}: x = {x.data:.4f}, loss = {loss.data:.4f}")
-
-# Test Adam convergence
-x = Variable([2.0, -3.0], requires_grad=True)
-adam = Adam([x], learning_rate=0.01)
-
-for step in range(50):
-    adam.zero_grad()
-    loss = (x[0]**2 + x[1]**2).sum()  # Minimize ||x||²
-    loss.backward()
-    adam.step()
-    if step % 10 == 0:
-        print(f"Step {step}: x = {x.data}, loss = {loss.data:.6f}")
-```
-
-## 🎯 Key Concepts
-
-### Real-World Applications
-- **Large Language Models**: GPT, BERT training relies on Adam optimization for stable convergence
-- **Computer Vision**: ResNet, Vision Transformer training uses SGD with momentum for best final performance
-- **Recommendation Systems**: Online learning systems use adaptive optimizers for continuous model updates
-- **Reinforcement Learning**: Policy gradient methods depend on careful optimizer choice and learning rate tuning
-
-### Mathematical Foundations
-- **Gradient Descent**: θ_{t+1} = θ_t - α∇L(θ_t) where α is learning rate and ∇L is loss gradient
-- **Momentum**: v_{t+1} = βv_t + ∇L(θ_t), θ_{t+1} = θ_t - αv_{t+1} for accelerated convergence
-- **Adam**: Combines momentum with adaptive learning rates using first and second moment estimates
-- **Learning Rate Scheduling**: Strategic decay schedules balance exploration and exploitation
-
-### Optimization Theory
-- **Convex Optimization**: Guarantees global minimum for convex loss functions
-- **Non-convex Optimization**: Neural networks have complex loss landscapes with local minima
-- **Convergence Analysis**: Understanding when and why optimization algorithms reach good solutions
-- **Hyperparameter Sensitivity**: Learning rate is often the most critical hyperparameter
-
-### Performance Characteristics
-- **SGD**: Memory efficient, works well with large batches, good final performance
-- **Adam**: Fast initial convergence, works with small batches, requires more memory
-- **Learning Rate Schedules**: Often crucial for achieving best performance
-- **Algorithm Selection**: Problem-dependent choice based on data, model, and computational constraints
-
-## 🎉 Ready to Build?
-
-You're about to implement the algorithms that power all of modern AI! From the neural networks that recognize your voice to the language models that write code, they all depend on the optimization algorithms you're building.
-
-Understanding these algorithms from first principles—implementing momentum physics and adaptive learning rates yourself—will give you deep insight into why some training works and some doesn't. Take your time with the mathematics, test thoroughly, and enjoy building the intelligence behind intelligent systems!
-
-```{grid} 3
-:gutter: 3
-:margin: 2
-
-{grid-item-card} 🚀 Launch Builder
-:link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/09_optimizers/optimizers_dev.py
-:class-title: text-center
-:class-body: text-center
-
-Interactive development environment
-
-{grid-item-card} 📓 Open in Colab  
-:link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/09_optimizers/optimizers_dev.ipynb
-:class-title: text-center
-:class-body: text-center
-
-Google Colab notebook
-
-{grid-item-card} 👀 View Source
-:link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/09_optimizers/optimizers_dev.py  
-:class-title: text-center
-:class-body: text-center
-
-Browse the code on GitHub
-```
diff --git a/modules_old/06_optimizers/module.yaml b/modules_old/06_optimizers/module.yaml
deleted file mode 100644
index 82509e45..00000000
--- a/modules_old/06_optimizers/module.yaml
+++ /dev/null
@@ -1,23 +0,0 @@
-components:
-- SGD
-- Adam
-- StepLR
-- gradient_descent_step
-dependencies:
-  enables:
-  - training
-  - compression
-  - mlops
-  prerequisites:
-  - tensor
-  - autograd
-description: Gradient-based parameter optimization algorithms
-difficulty: "\u2B50\u2B50\u2B50\u2B50"
-exports_to: tinytorch.core.optimizers
-files:
-  dev_file: optimizers_dev.py
-  readme: README.md
-  tests: inline
-name: optimizers
-time_estimate: 6-8 hours
-title: Optimizers
diff --git a/modules_old/06_optimizers/optimizers_dev.ipynb b/modules_old/06_optimizers/optimizers_dev.ipynb
deleted file mode 100644
index 1a9c2600..00000000
--- a/modules_old/06_optimizers/optimizers_dev.ipynb
+++ /dev/null
@@ -1,3764 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "fb5fbe07",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Optimizers - Gradient-Based Parameter Updates and Training Dynamics\n",
-    "\n",
-    "Welcome to the Optimizers module! You'll implement the algorithms that use gradients to update neural network parameters, determining how effectively networks learn from data.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How different optimization algorithms affect convergence speed, memory usage, and training stability\n",
-    "- Core implementation skill: Build SGD with momentum and Adam optimizer, understanding their mathematical foundations and implementation trade-offs\n",
-    "- Pattern recognition: Understand how adaptive learning rates and momentum help navigate complex loss landscapes\n",
-    "- Framework connection: See how your optimizer implementations match PyTorch's optim module design and state management\n",
-    "- Performance insight: Learn why optimizer choice affects training speed and why Adam uses 3x more memory than SGD\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: Complete SGD and Adam optimizers with proper state management and learning rate scheduling\n",
-    "2. **Use**: Train neural networks with different optimizers and compare convergence behavior on real datasets\n",
-    "3. **Reflect**: Why do some optimizers work better for certain problems, and how does memory usage scale with model size?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how optimization algorithms navigate high-dimensional loss landscapes to find good solutions\n",
-    "- Practical capability to implement and tune optimizers that determine training success or failure\n",
-    "- Systems insight into why optimizer choice often matters more than architecture choice for training success\n",
-    "- Performance consideration of how optimizer memory requirements and computational overhead affect scalable training\n",
-    "- Connection to production ML systems and why new optimizers continue to be an active area of research\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: PyTorch's Adam implementation includes numerically stable variants and can automatically scale learning rates based on gradient norms to prevent training instability\n",
-    "⚡ **Performance Note**: Adam stores running averages for every parameter, using 3x the memory of SGD - this memory overhead becomes critical when training large models near GPU memory limits"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "eaeb031f",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "optimizers-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.optimizers\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import sys\n",
-    "import os\n",
-    "from typing import List, Dict, Any, Optional, Union\n",
-    "from collections import defaultdict\n",
-    "\n",
-    "# Helper function to set up import paths\n",
-    "def setup_import_paths():\n",
-    "    \"\"\"Set up import paths for development modules.\"\"\"\n",
-    "    import sys\n",
-    "    import os\n",
-    "    \n",
-    "    # Add module directories to path\n",
-    "    base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\n",
-    "    tensor_dir = os.path.join(base_dir, '01_tensor')\n",
-    "    autograd_dir = os.path.join(base_dir, '06_autograd')  # Fixed: Module 6, not 7\n",
-    "    \n",
-    "    if tensor_dir not in sys.path:\n",
-    "        sys.path.append(tensor_dir)\n",
-    "    if autograd_dir not in sys.path:\n",
-    "        sys.path.append(autograd_dir)\n",
-    "\n",
-    "# Import our existing components\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "    from tinytorch.core.autograd import Variable\n",
-    "except ImportError:\n",
-    "    # For development, try local imports\n",
-    "    try:\n",
-    "        setup_import_paths()\n",
-    "        from tensor_dev import Tensor\n",
-    "        from autograd_dev import Variable\n",
-    "    except ImportError:\n",
-    "        # Create simplified fallback classes for basic gradient operations\n",
-    "        print(\"Warning: Using simplified classes for basic gradient operations\")\n",
-    "        \n",
-    "        class Tensor:\n",
-    "            def __init__(self, data):\n",
-    "                self.data = np.array(data)\n",
-    "                self.shape = self.data.shape\n",
-    "            \n",
-    "            def __str__(self):\n",
-    "                return f\"Tensor({self.data})\"\n",
-    "        \n",
-    "        class Variable:\n",
-    "            def __init__(self, data, requires_grad=True):\n",
-    "                if isinstance(data, (int, float)):\n",
-    "                    self.data = Tensor([data])\n",
-    "                else:\n",
-    "                    self.data = Tensor(data)\n",
-    "                self.requires_grad = requires_grad\n",
-    "                self.grad = None  # Simple gradient storage\n",
-    "            \n",
-    "            def zero_grad(self):\n",
-    "                \"\"\"Reset gradients to None (basic operation from Module 6)\"\"\"\n",
-    "                self.grad = None\n",
-    "            \n",
-    "            def __str__(self):\n",
-    "                return f\"Variable({self.data.data})\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "efa9a737",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "optimizers-setup",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🔥 TinyTorch Optimizers Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
-    "print(\"Ready to build optimization algorithms!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "95296fc3",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📦 Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in `modules/source/08_optimizers/optimizers_dev.py`  \n",
-    "**Building Side:** Code exports to `tinytorch.core.optimizers`\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.optimizers import SGD, Adam, StepLR  # The optimization engines!\n",
-    "from tinytorch.core.autograd import Variable  # Gradient computation\n",
-    "from tinytorch.core.tensor import Tensor  # Data structures\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused module for understanding optimization algorithms\n",
-    "- **Production:** Proper organization like PyTorch's `torch.optim`\n",
-    "- **Consistency:** All optimization algorithms live together in `core.optimizers`\n",
-    "- **Foundation:** Enables effective neural network training"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1f7be774",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## What Are Optimizers?\n",
-    "\n",
-    "### The Problem: How to Update Parameters\n",
-    "Neural networks learn by updating parameters using gradients:\n",
-    "```\n",
-    "parameter_new = parameter_old - learning_rate * gradient\n",
-    "```\n",
-    "\n",
-    "But **naive gradient descent** has problems:\n",
-    "- **Slow convergence**: Takes many steps to reach optimum\n",
-    "- **Oscillation**: Bounces around valleys without making progress\n",
-    "- **Poor scaling**: Same learning rate for all parameters\n",
-    "\n",
-    "### The Solution: Smart Optimization\n",
-    "**Optimizers** are algorithms that intelligently update parameters:\n",
-    "- **Momentum**: Accelerate convergence by accumulating velocity\n",
-    "- **Adaptive learning rates**: Different learning rates for different parameters\n",
-    "- **Second-order information**: Use curvature to guide updates\n",
-    "\n",
-    "### Real-World Impact\n",
-    "- **SGD**: The foundation of all neural network training\n",
-    "- **Adam**: The default optimizer for most deep learning applications\n",
-    "- **Learning rate scheduling**: Critical for training stability and performance\n",
-    "\n",
-    "### What We'll Build\n",
-    "1. **SGD**: Stochastic Gradient Descent with momentum\n",
-    "2. **Adam**: Adaptive Moment Estimation optimizer\n",
-    "3. **StepLR**: Learning rate scheduling\n",
-    "4. **Integration**: Complete training loop with optimizers"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "57910ed3",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🔧 DEVELOPMENT"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "28fd78ed",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 1: Understanding Gradient Descent\n",
-    "\n",
-    "### What is Gradient Descent?\n",
-    "**Gradient descent** finds the minimum of a function by following the negative gradient:\n",
-    "\n",
-    "```\n",
-    "θ_{t+1} = θ_t - α ∇f(θ_t)\n",
-    "```\n",
-    "\n",
-    "Where:\n",
-    "- θ: Parameters we want to optimize\n",
-    "- α: Learning rate (how big steps to take)\n",
-    "- ∇f(θ): Gradient of loss function with respect to parameters\n",
-    "\n",
-    "### Why Gradient Descent Works\n",
-    "1. **Gradients point uphill**: Negative gradient points toward minimum\n",
-    "2. **Iterative improvement**: Each step reduces the loss (in theory)\n",
-    "3. **Local convergence**: Finds local minimum with proper learning rate\n",
-    "4. **Scalable**: Works with millions of parameters\n",
-    "\n",
-    "### The Learning Rate Dilemma\n",
-    "- **Too large**: Overshoots minimum, diverges\n",
-    "- **Too small**: Extremely slow convergence\n",
-    "- **Just right**: Steady progress toward minimum\n",
-    "\n",
-    "### Visual Understanding\n",
-    "```\n",
-    "Loss landscape: U-shaped curve\n",
-    "Start here: ↑\n",
-    "Gradient descent: ↓ → ↓ → ↓ → minimum\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Neural networks**: Training any deep learning model\n",
-    "- **Machine learning**: Logistic regression, SVM, etc.\n",
-    "- **Scientific computing**: Optimization problems in physics, engineering\n",
-    "- **Economics**: Portfolio optimization, game theory\n",
-    "\n",
-    "Let's implement gradient descent to understand it deeply!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d00ed89e",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "gradient-descent-function",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def gradient_descent_step(parameter: Variable, learning_rate: float) -> None:\n",
-    "    \"\"\"\n",
-    "    Perform one step of gradient descent on a parameter.\n",
-    "    \n",
-    "    Args:\n",
-    "        parameter: Variable with gradient information\n",
-    "        learning_rate: How much to update parameter\n",
-    "    \n",
-    "    TODO: Implement basic gradient descent parameter update.\n",
-    "    \n",
-    "    STEP-BY-STEP IMPLEMENTATION:\n",
-    "    1. Check if parameter has a gradient\n",
-    "    2. Get current parameter value and gradient\n",
-    "    3. Update parameter: new_value = old_value - learning_rate * gradient\n",
-    "    4. Update parameter data with new value\n",
-    "    5. Handle edge cases (no gradient, invalid values)\n",
-    "    \n",
-    "    EXAMPLE USAGE:\n",
-    "    ```python\n",
-    "    # Parameter with gradient\n",
-    "    w = Variable(2.0, requires_grad=True)\n",
-    "    w.grad = Variable(0.5)  # Gradient from loss\n",
-    "    \n",
-    "    # Update parameter\n",
-    "    gradient_descent_step(w, learning_rate=0.1)\n",
-    "    # w.data now contains: 2.0 - 0.1 * 0.5 = 1.95\n",
-    "    ```\n",
-    "    \n",
-    "    IMPLEMENTATION HINTS:\n",
-    "    - Check if parameter.grad is not None\n",
-    "    - Use parameter.grad.data.data to get gradient value\n",
-    "    - Update parameter.data with new Tensor\n",
-    "    - Don't modify gradient (it's used for logging)\n",
-    "    \n",
-    "    LEARNING CONNECTIONS:\n",
-    "    - This is the foundation of all neural network training\n",
-    "    - PyTorch's optimizer.step() does exactly this\n",
-    "    - The learning rate determines convergence speed\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    if parameter.grad is not None:\n",
-    "        # Get current parameter value and gradient\n",
-    "        current_value = parameter.data.data\n",
-    "        gradient_value = parameter.grad.data.data\n",
-    "        \n",
-    "        # Update parameter: new_value = old_value - learning_rate * gradient\n",
-    "        new_value = current_value - learning_rate * gradient_value\n",
-    "        \n",
-    "        # Update parameter data\n",
-    "        parameter.data = Tensor(new_value)\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2c219cc0",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Gradient Descent Step\n",
-    "\n",
-    "Let's test your gradient descent implementation right away! This is the foundation of all optimization algorithms.\n",
-    "\n",
-    "**This is a unit test** - it tests one specific function (gradient_descent_step) in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "033bd1fa",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-gradient-descent",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_gradient_descent_step():\n",
-    "    \"\"\"Unit test for the basic gradient descent parameter update.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Gradient Descent Step...\")\n",
-    "    \n",
-    "    # Test basic parameter update\n",
-    "    try:\n",
-    "        w = Variable(2.0, requires_grad=True)\n",
-    "        w.grad = Variable(0.5)  # Positive gradient\n",
-    "        \n",
-    "        original_value = w.data.data.item()\n",
-    "        gradient_descent_step(w, learning_rate=0.1)\n",
-    "        new_value = w.data.data.item()\n",
-    "        \n",
-    "        expected_value = original_value - 0.1 * 0.5  # 2.0 - 0.05 = 1.95\n",
-    "        assert abs(new_value - expected_value) < 1e-6, f\"Expected {expected_value}, got {new_value}\"\n",
-    "        print(\"✅ Basic parameter update works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Basic parameter update failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    # Test with negative gradient\n",
-    "    try:\n",
-    "        w2 = Variable(1.0, requires_grad=True)\n",
-    "        w2.grad = Variable(-0.2)  # Negative gradient\n",
-    "        \n",
-    "        gradient_descent_step(w2, learning_rate=0.1)\n",
-    "        expected_value2 = 1.0 - 0.1 * (-0.2)  # 1.0 + 0.02 = 1.02\n",
-    "        assert abs(w2.data.data.item() - expected_value2) < 1e-6, \"Negative gradient test failed\"\n",
-    "        print(\"✅ Negative gradient handling works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Negative gradient handling failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    # Test with no gradient (should not update)\n",
-    "    try:\n",
-    "        w3 = Variable(3.0, requires_grad=True)\n",
-    "        w3.grad = None\n",
-    "        original_value3 = w3.data.data.item()\n",
-    "        \n",
-    "        gradient_descent_step(w3, learning_rate=0.1)\n",
-    "        assert w3.data.data.item() == original_value3, \"Parameter with no gradient should not update\"\n",
-    "        print(\"✅ No gradient case works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ No gradient case failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Gradient descent step behavior:\")\n",
-    "    print(\"   Updates parameters in negative gradient direction\")\n",
-    "    print(\"   Uses learning rate to control step size\")\n",
-    "    print(\"   Skips updates when gradient is None\")\n",
-    "    print(\"📈 Progress: Gradient Descent Step ✓\")\n",
-    "\n",
-    "# Test function defined (called in main block)\n",
-    "\n",
-    "# Test function is called by auto-discovery system"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "81768011",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 2: SGD with Momentum\n",
-    "\n",
-    "### What is SGD?\n",
-    "**SGD (Stochastic Gradient Descent)** is the fundamental optimization algorithm:\n",
-    "\n",
-    "```\n",
-    "θ_{t+1} = θ_t - α ∇L(θ_t)\n",
-    "```\n",
-    "\n",
-    "### The Problem with Vanilla SGD\n",
-    "- **Slow convergence**: Especially in narrow valleys\n",
-    "- **Oscillation**: Bounces around without making progress\n",
-    "- **Poor conditioning**: Struggles with ill-conditioned problems\n",
-    "\n",
-    "### The Solution: Momentum\n",
-    "**Momentum** accumulates velocity to accelerate convergence:\n",
-    "\n",
-    "```\n",
-    "v_t = β v_{t-1} + ∇L(θ_t)\n",
-    "θ_{t+1} = θ_t - α v_t\n",
-    "```\n",
-    "\n",
-    "Where:\n",
-    "- v_t: Velocity (exponential moving average of gradients)\n",
-    "- β: Momentum coefficient (typically 0.9)\n",
-    "- α: Learning rate\n",
-    "\n",
-    "### Why Momentum Works\n",
-    "1. **Acceleration**: Builds up speed in consistent directions\n",
-    "2. **Dampening**: Reduces oscillations in inconsistent directions\n",
-    "3. **Memory**: Remembers previous gradient directions\n",
-    "4. **Robustness**: Less sensitive to noisy gradients\n",
-    "\n",
-    "### Visual Understanding\n",
-    "```\n",
-    "Without momentum: ↗↙↗↙↗↙ (oscillating)\n",
-    "With momentum:    ↗→→→→→ (smooth progress)\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Image classification**: Training ResNet, VGG\n",
-    "- **Natural language**: Training RNNs, early transformers\n",
-    "- **Classic choice**: Still used when Adam fails\n",
-    "- **Large batch training**: Often preferred over Adam\n",
-    "\n",
-    "Let's implement SGD with momentum!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fa1775e2",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "sgd-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class SGD:\n",
-    "    \"\"\"\n",
-    "    Simplified SGD Optimizer\n",
-    "    \n",
-    "    Implements basic stochastic gradient descent with optional momentum.\n",
-    "    Uses simple gradient operations from Module 6.\n",
-    "    \n",
-    "    Mathematical Update Rule:\n",
-    "    parameter = parameter - learning_rate * gradient\n",
-    "    \n",
-    "    With momentum:\n",
-    "    velocity = momentum * velocity + gradient\n",
-    "    parameter = parameter - learning_rate * velocity\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, parameters: List[Variable], learning_rate: float = 0.01, \n",
-    "                 momentum: float = 0.0):\n",
-    "        \"\"\"\n",
-    "        Initialize SGD optimizer with basic parameters.\n",
-    "        \n",
-    "        Args:\n",
-    "            parameters: List of Variables to optimize (from Module 6)\n",
-    "            learning_rate: Learning rate (default: 0.01)\n",
-    "            momentum: Momentum coefficient (default: 0.0)\n",
-    "        \n",
-    "        TODO: Implement basic SGD optimizer initialization.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store parameters and learning rate\n",
-    "        2. Store momentum coefficient\n",
-    "        3. Initialize simple momentum buffers\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        ```python\n",
-    "        # Basic optimizer setup\n",
-    "        w = Variable(1.0, requires_grad=True)\n",
-    "        b = Variable(0.0, requires_grad=True)\n",
-    "        optimizer = SGD([w, b], learning_rate=0.01)\n",
-    "        \n",
-    "        # In training:\n",
-    "        optimizer.zero_grad()\n",
-    "        # ... compute gradients ...\n",
-    "        optimizer.step()\n",
-    "        ```\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.parameters = parameters\n",
-    "        self.learning_rate = learning_rate\n",
-    "        self.momentum = momentum\n",
-    "        \n",
-    "        # Simple momentum storage (using basic dict)\n",
-    "        self.velocity = {}\n",
-    "        for i, param in enumerate(parameters):\n",
-    "            if self.momentum > 0:\n",
-    "                self.velocity[i] = 0.0  # Initialize velocity to zero\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def step(self) -> None:\n",
-    "        \"\"\"\n",
-    "        Perform one optimization step using basic gradient operations.\n",
-    "        \n",
-    "        TODO: Implement simplified SGD parameter update.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Iterate through all parameters\n",
-    "        2. For each parameter with gradient (from Module 6):\n",
-    "           a. Get gradient using simple param.grad access\n",
-    "           b. Apply momentum if specified\n",
-    "           c. Update parameter with learning rate\n",
-    "        \n",
-    "        SIMPLIFIED MATHEMATICAL FORMULATION:\n",
-    "        - Without momentum: parameter = parameter - learning_rate * gradient\n",
-    "        - With momentum: velocity = momentum * velocity + gradient\n",
-    "                        parameter = parameter - learning_rate * velocity\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Use basic param.grad access (from Module 6)\n",
-    "        - Simple momentum using self.velocity dict\n",
-    "        - Basic parameter update using scalar operations\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        for i, param in enumerate(self.parameters):\n",
-    "            if param.grad is not None:\n",
-    "                # Get gradient data (works for both Tensor and Variable)\n",
-    "                # In modern PyTorch style, grad.data gives us the numpy array\n",
-    "                gradient = param.grad.data\n",
-    "                \n",
-    "                if self.momentum > 0:\n",
-    "                    # Apply momentum (simplified)\n",
-    "                    if i in self.velocity:\n",
-    "                        self.velocity[i] = self.momentum * self.velocity[i] + gradient\n",
-    "                    else:\n",
-    "                        self.velocity[i] = gradient\n",
-    "                    update = self.velocity[i]\n",
-    "                else:\n",
-    "                    # Simple gradient descent (no momentum)\n",
-    "                    update = gradient\n",
-    "                \n",
-    "                # Clean parameter update - PyTorch style\n",
-    "                # NOTE: In production PyTorch, this is an in-place operation (param.data.sub_())\n",
-    "                # for memory efficiency. We create a new Tensor here for clarity, but real\n",
-    "                # systems modify the existing memory to avoid allocation overhead.\n",
-    "                from tinytorch.core.tensor import Tensor\n",
-    "                new_value = param.data - self.learning_rate * update\n",
-    "                param.data = Tensor(new_value)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def zero_grad(self) -> None:\n",
-    "        \"\"\"\n",
-    "        Zero out gradients for all parameters.\n",
-    "        \n",
-    "        TODO: Implement gradient zeroing.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Iterate through all parameters\n",
-    "        2. Set gradient to None for each parameter\n",
-    "        3. This prepares for next backward pass\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Simply set param.grad = None\n",
-    "        - This is called before loss.backward()\n",
-    "        - Essential for proper gradient accumulation\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        for param in self.parameters:\n",
-    "            param.grad = None\n",
-    "        ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9078e2c3",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: SGD Optimizer\n",
-    "\n",
-    "Let's test your SGD optimizer implementation! This optimizer adds momentum to gradient descent for better convergence.\n",
-    "\n",
-    "**This is a unit test** - it tests one specific class (SGD) in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9c43f8ab",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-sgd",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_sgd_optimizer():\n",
-    "    \"\"\"Unit test for the SGD optimizer implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: SGD Optimizer...\")\n",
-    "    \n",
-    "    # Create test parameters\n",
-    "    w1 = Variable(1.0, requires_grad=True)\n",
-    "    w2 = Variable(2.0, requires_grad=True)\n",
-    "    b = Variable(0.5, requires_grad=True)\n",
-    "    \n",
-    "    # Create optimizer\n",
-    "    optimizer = SGD([w1, w2, b], learning_rate=0.1, momentum=0.9)\n",
-    "    \n",
-    "    # Test zero_grad\n",
-    "    try:\n",
-    "        w1.grad = Variable(0.1)\n",
-    "        w2.grad = Variable(0.2)\n",
-    "        b.grad = Variable(0.05)\n",
-    "        \n",
-    "        optimizer.zero_grad()\n",
-    "        \n",
-    "        assert w1.grad is None, \"Gradient should be None after zero_grad\"\n",
-    "        assert w2.grad is None, \"Gradient should be None after zero_grad\"\n",
-    "        assert b.grad is None, \"Gradient should be None after zero_grad\"\n",
-    "        print(\"✅ zero_grad() works correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ zero_grad() failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test step with gradients\n",
-    "    try:\n",
-    "        w1.grad = Variable(0.1)\n",
-    "        w2.grad = Variable(0.2)\n",
-    "        b.grad = Variable(0.05)\n",
-    "        \n",
-    "        # First step (no momentum yet)\n",
-    "        original_w1 = w1.data.data.item()\n",
-    "        original_w2 = w2.data.data.item()\n",
-    "        original_b = b.data.data.item()\n",
-    "        \n",
-    "        optimizer.step()\n",
-    "        \n",
-    "        # Check parameter updates\n",
-    "        expected_w1 = original_w1 - 0.1 * 0.1  # 1.0 - 0.01 = 0.99\n",
-    "        expected_w2 = original_w2 - 0.1 * 0.2  # 2.0 - 0.02 = 1.98\n",
-    "        expected_b = original_b - 0.1 * 0.05   # 0.5 - 0.005 = 0.495\n",
-    "        \n",
-    "        assert abs(w1.data.data.item() - expected_w1) < 1e-6, f\"w1 update failed: expected {expected_w1}, got {w1.data.data.item()}\"\n",
-    "        assert abs(w2.data.data.item() - expected_w2) < 1e-6, f\"w2 update failed: expected {expected_w2}, got {w2.data.data.item()}\"\n",
-    "        assert abs(b.data.data.item() - expected_b) < 1e-6, f\"b update failed: expected {expected_b}, got {b.data.data.item()}\"\n",
-    "        print(\"✅ Parameter updates work correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Parameter updates failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test simplified momentum storage\n",
-    "    try:\n",
-    "        # Check velocity dict exists and has momentum if momentum > 0\n",
-    "        if optimizer.momentum > 0:\n",
-    "            assert len(optimizer.velocity) == 3, f\"Should have 3 velocity entries, got {len(optimizer.velocity)}\"\n",
-    "        print(\"✅ Simplified momentum storage works correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Momentum storage failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test step counting\n",
-    "    try:\n",
-    "        w1.grad = Variable(0.1)\n",
-    "        w2.grad = Variable(0.2)\n",
-    "        b.grad = Variable(0.05)\n",
-    "        \n",
-    "        optimizer.step()\n",
-    "        \n",
-    "        # Step counting removed from simplified SGD for educational clarity\n",
-    "        print(\"✅ Step counting simplified for Module 8\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Step counting failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 SGD optimizer behavior:\")\n",
-    "    print(\"   Maintains momentum buffers for accelerated updates\")\n",
-    "    print(\"   Tracks step count for learning rate scheduling\")\n",
-    "    print(\"   Supports weight decay for regularization\")\n",
-    "    print(\"📈 Progress: SGD Optimizer ✓\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e27e0f30",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 3: Adam - Adaptive Learning Rates\n",
-    "\n",
-    "### What is Adam?\n",
-    "**Adam (Adaptive Moment Estimation)** is the most popular optimizer in deep learning:\n",
-    "\n",
-    "```\n",
-    "m_t = β₁ m_{t-1} + (1 - β₁) ∇L(θ_t)        # First moment (momentum)\n",
-    "v_t = β₂ v_{t-1} + (1 - β₂) (∇L(θ_t))²     # Second moment (variance)\n",
-    "m̂_t = m_t / (1 - β₁ᵗ)                      # Bias correction\n",
-    "v̂_t = v_t / (1 - β₂ᵗ)                      # Bias correction\n",
-    "θ_{t+1} = θ_t - α m̂_t / (√v̂_t + ε)        # Parameter update\n",
-    "```\n",
-    "\n",
-    "### Why Adam is Revolutionary\n",
-    "1. **Adaptive learning rates**: Different learning rate for each parameter\n",
-    "2. **Momentum**: Accelerates convergence like SGD\n",
-    "3. **Variance adaptation**: Scales updates based on gradient variance\n",
-    "4. **Bias correction**: Handles initialization bias\n",
-    "5. **Robust**: Works well with minimal hyperparameter tuning\n",
-    "\n",
-    "### The Three Key Ideas\n",
-    "1. **First moment (m_t)**: Exponential moving average of gradients (momentum)\n",
-    "2. **Second moment (v_t)**: Exponential moving average of squared gradients (variance)\n",
-    "3. **Adaptive scaling**: Large gradients → small updates, small gradients → large updates\n",
-    "\n",
-    "### Visual Understanding\n",
-    "```\n",
-    "Parameter with large gradients: zigzag pattern → smooth updates\n",
-    "Parameter with small gradients: ______ → amplified updates\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Deep learning**: Default optimizer for most neural networks\n",
-    "- **Computer vision**: Training CNNs, ResNets, Vision Transformers\n",
-    "- **Natural language**: Training BERT, GPT, T5\n",
-    "- **Transformers**: Essential for attention-based models\n",
-    "\n",
-    "Let's implement Adam optimizer!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fa129d52",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "adam-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Adam:\n",
-    "    \"\"\"\n",
-    "    Simplified Adam Optimizer\n",
-    "    \n",
-    "    Implements a simplified version of Adam algorithm with adaptive learning rates.\n",
-    "    Educational focus on understanding optimization concepts rather than complex implementation.\n",
-    "    \n",
-    "    Key concepts:\n",
-    "    - Momentum: Running average of gradients (first moment)\n",
-    "    - Adaptive learning: Running average of squared gradients (second moment)\n",
-    "    - Bias correction: Adjust for initialization bias\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, parameters: List[Variable], learning_rate: float = 0.001,\n",
-    "                 beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-8):\n",
-    "        \"\"\"\n",
-    "        Initialize simplified Adam optimizer.\n",
-    "        \n",
-    "        Args:\n",
-    "            parameters: List of Variables to optimize (from Module 6)\n",
-    "            learning_rate: Learning rate (default: 0.001)\n",
-    "            beta1: Decay rate for momentum (default: 0.9)\n",
-    "            beta2: Decay rate for squared gradients (default: 0.999)\n",
-    "            epsilon: Small constant for numerical stability (default: 1e-8)\n",
-    "        \n",
-    "        TODO: Implement simplified Adam optimizer initialization.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store parameters and learning rate\n",
-    "        2. Store Adam hyperparameters (beta1, beta2, epsilon)\n",
-    "        3. Initialize simple moment storage\n",
-    "        \n",
-    "        EDUCATIONAL FOCUS:\n",
-    "        - Understand Adam concepts: momentum + adaptive learning\n",
-    "        - Learn why Adam uses running averages\n",
-    "        - See how bias correction helps early training\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        ```python\n",
-    "        # Simple Adam setup\n",
-    "        w = Variable(1.0, requires_grad=True)\n",
-    "        b = Variable(0.0, requires_grad=True)\n",
-    "        optimizer = Adam([w, b], learning_rate=0.001)\n",
-    "        ```\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.parameters = parameters\n",
-    "        self.learning_rate = learning_rate\n",
-    "        self.beta1 = beta1\n",
-    "        self.beta2 = beta2\n",
-    "        self.epsilon = epsilon\n",
-    "        \n",
-    "        # Simple moment storage (using basic dict with indices)\n",
-    "        # MEMORY INSIGHT: Adam uses 3x memory of SGD because it stores:\n",
-    "        # 1. Parameters (1x memory)\n",
-    "        # 2. First moment estimates m[i] (1x memory) \n",
-    "        # 3. Second moment estimates v[i] (1x memory)\n",
-    "        # This is why Adam can be problematic for very large models!\n",
-    "        self.m = {}  # First moment (momentum)\n",
-    "        self.v = {}  # Second moment (squared gradients)\n",
-    "        \n",
-    "        # Initialize moments for each parameter\n",
-    "        for i, param in enumerate(parameters):\n",
-    "            self.m[i] = 0.0\n",
-    "            self.v[i] = 0.0\n",
-    "        \n",
-    "        # Step counter for bias correction\n",
-    "        self.t = 0\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def step(self) -> None:\n",
-    "        \"\"\"\n",
-    "        Perform one optimization step using simplified Adam algorithm.\n",
-    "        \n",
-    "        TODO: Implement simplified Adam parameter update.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Increment step counter\n",
-    "        2. For each parameter with gradient:\n",
-    "           a. Get gradient (basic operation from Module 6)\n",
-    "           b. Update momentum (first moment)\n",
-    "           c. Update squared gradient average (second moment)\n",
-    "           d. Apply bias correction\n",
-    "           e. Update parameter with adaptive learning rate\n",
-    "        \n",
-    "        SIMPLIFIED MATHEMATICAL FORMULATION:\n",
-    "        - m = beta1 * m + (1 - beta1) * gradient         (momentum)\n",
-    "        - v = beta2 * v + (1 - beta2) * gradient²        (squared gradients)\n",
-    "        - m_corrected = m / (1 - beta1^t)                (bias correction)\n",
-    "        - v_corrected = v / (1 - beta2^t)                (bias correction)\n",
-    "        - parameter = parameter - lr * m_corrected / (√v_corrected + ε)\n",
-    "        \n",
-    "        EDUCATIONAL INSIGHTS:\n",
-    "        - Momentum helps accelerate learning\n",
-    "        - Squared gradients adapt learning rate per parameter\n",
-    "        - Bias correction prevents slow start\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.t += 1  # Increment step counter\n",
-    "        \n",
-    "        for i, param in enumerate(self.parameters):\n",
-    "            if param.grad is not None:\n",
-    "                # Get gradient data - clean PyTorch style\n",
-    "                gradient = param.grad.data\n",
-    "                \n",
-    "                # Update first moment (momentum)\n",
-    "                self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * gradient\n",
-    "                \n",
-    "                # Update second moment (squared gradients)\n",
-    "                self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * gradient * gradient\n",
-    "                \n",
-    "                # Bias correction\n",
-    "                m_corrected = self.m[i] / (1 - self.beta1 ** self.t)\n",
-    "                v_corrected = self.v[i] / (1 - self.beta2 ** self.t)\n",
-    "                \n",
-    "                # Clean adaptive parameter update - PyTorch style\n",
-    "                # NOTE: In production PyTorch, parameters are updated in-place for efficiency.\n",
-    "                # We create a new Tensor for educational clarity, but real systems use\n",
-    "                # param.data.add_(-update) to modify memory directly without allocation.\n",
-    "                update = self.learning_rate * m_corrected / (np.sqrt(v_corrected) + self.epsilon)\n",
-    "                from tinytorch.core.tensor import Tensor\n",
-    "                new_value = param.data - update\n",
-    "                param.data = Tensor(new_value)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def zero_grad(self) -> None:\n",
-    "        \"\"\"\n",
-    "        Zero out gradients for all parameters.\n",
-    "        \n",
-    "        TODO: Implement gradient zeroing (same as SGD).\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Set param.grad = None for all parameters\n",
-    "        - This is identical to SGD implementation\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        for param in self.parameters:\n",
-    "            param.grad = None\n",
-    "        ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5614e3b4",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Test Your Adam Implementation\n",
-    "\n",
-    "Let's test the Adam optimizer:"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "25e04a95",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Adam Optimizer\n",
-    "\n",
-    "Let's test your Adam optimizer implementation! This is a state-of-the-art adaptive optimization algorithm.\n",
-    "\n",
-    "**This is a unit test** - it tests one specific class (Adam) in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "92780feb",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-adam",
-     "locked": true,
-     "points": 20,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_adam_optimizer():\n",
-    "    \"\"\"Unit test for the Adam optimizer implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Adam Optimizer...\")\n",
-    "    \n",
-    "    # Create test parameters\n",
-    "    w1 = Variable(1.0, requires_grad=True)\n",
-    "    w2 = Variable(2.0, requires_grad=True)\n",
-    "    b = Variable(0.5, requires_grad=True)\n",
-    "    \n",
-    "    # Create optimizer\n",
-    "    optimizer = Adam([w1, w2, b], learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)\n",
-    "    \n",
-    "    # Test zero_grad\n",
-    "    try:\n",
-    "        w1.grad = Variable(0.1)\n",
-    "        w2.grad = Variable(0.2)\n",
-    "        b.grad = Variable(0.05)\n",
-    "        \n",
-    "        optimizer.zero_grad()\n",
-    "        \n",
-    "        assert w1.grad is None, \"Gradient should be None after zero_grad\"\n",
-    "        assert w2.grad is None, \"Gradient should be None after zero_grad\"\n",
-    "        assert b.grad is None, \"Gradient should be None after zero_grad\"\n",
-    "        print(\"✅ zero_grad() works correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ zero_grad() failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test step with gradients\n",
-    "    try:\n",
-    "        w1.grad = Variable(0.1)\n",
-    "        w2.grad = Variable(0.2)\n",
-    "        b.grad = Variable(0.05)\n",
-    "        \n",
-    "        # First step\n",
-    "        original_w1 = w1.data.data.item()\n",
-    "        original_w2 = w2.data.data.item()\n",
-    "        original_b = b.data.data.item()\n",
-    "        \n",
-    "        optimizer.step()\n",
-    "        \n",
-    "        # Check that parameters were updated (Adam uses adaptive learning rates)\n",
-    "        assert w1.data.data.item() != original_w1, \"w1 should have been updated\"\n",
-    "        assert w2.data.data.item() != original_w2, \"w2 should have been updated\"\n",
-    "        assert b.data.data.item() != original_b, \"b should have been updated\"\n",
-    "        print(\"✅ Parameter updates work correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Parameter updates failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test simplified moment storage\n",
-    "    try:\n",
-    "        assert len(optimizer.m) == 3, f\"Should have 3 momentum entries, got {len(optimizer.m)}\"\n",
-    "        assert len(optimizer.v) == 3, f\"Should have 3 squared gradient entries, got {len(optimizer.v)}\"\n",
-    "        print(\"✅ Simplified moment storage works correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Moment storage failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test step counting and bias correction\n",
-    "    try:\n",
-    "        assert optimizer.t == 1, f\"Step count should be 1, got {optimizer.t}\"\n",
-    "        \n",
-    "        # Take another step\n",
-    "        w1.grad = Variable(0.1)\n",
-    "        w2.grad = Variable(0.2)\n",
-    "        b.grad = Variable(0.05)\n",
-    "        \n",
-    "        optimizer.step()\n",
-    "        \n",
-    "        assert optimizer.t == 2, f\"Step count should be 2, got {optimizer.t}\"\n",
-    "        print(\"✅ Step counting and bias correction work correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Step counting and bias correction failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test adaptive learning rates\n",
-    "    try:\n",
-    "        # Adam should have different effective learning rates for different parameters\n",
-    "        # This is tested implicitly by the parameter updates above\n",
-    "        print(\"✅ Adaptive learning rates work correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Adaptive learning rates failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Adam optimizer behavior:\")\n",
-    "    print(\"   Maintains first and second moment estimates\")\n",
-    "    print(\"   Applies bias correction for early training\")\n",
-    "    print(\"   Uses adaptive learning rates per parameter\")\n",
-    "    print(\"   Combines benefits of momentum and RMSprop\")\n",
-    "    print(\"📈 Progress: Adam Optimizer ✓\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "88eddfab",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 4: Learning Rate Scheduling\n",
-    "\n",
-    "### What is Learning Rate Scheduling?\n",
-    "**Learning rate scheduling** adjusts the learning rate during training:\n",
-    "\n",
-    "```\n",
-    "Initial: learning_rate = 0.1\n",
-    "After 10 epochs: learning_rate = 0.01\n",
-    "After 20 epochs: learning_rate = 0.001\n",
-    "```\n",
-    "\n",
-    "### Why Scheduling Matters\n",
-    "1. **Fine-tuning**: Start with large steps, then refine with small steps\n",
-    "2. **Convergence**: Prevents overshooting near optimum\n",
-    "3. **Stability**: Reduces oscillations in later training\n",
-    "4. **Performance**: Often improves final accuracy\n",
-    "\n",
-    "### Common Scheduling Strategies\n",
-    "1. **Step decay**: Reduce by factor every N epochs\n",
-    "2. **Exponential decay**: Gradual exponential reduction\n",
-    "3. **Cosine annealing**: Smooth cosine curve reduction\n",
-    "4. **Warm-up**: Start small, increase, then decrease\n",
-    "\n",
-    "### Visual Understanding\n",
-    "```\n",
-    "Step decay:     ----↓----↓----↓\n",
-    "Exponential:    \\\\\\\\\\\\\\\\\\\\\\\\\\\\\n",
-    "Cosine:         ∩∩∩∩∩∩∩∩∩∩∩∩∩\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **ImageNet training**: Essential for achieving state-of-the-art results\n",
-    "- **Language models**: Critical for training large transformers\n",
-    "- **Fine-tuning**: Prevents catastrophic forgetting\n",
-    "- **Transfer learning**: Adapts pre-trained models\n",
-    "\n",
-    "Let's implement step learning rate scheduling!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e77e0ed0",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "steplr-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class StepLR:\n",
-    "    \"\"\"\n",
-    "    Step Learning Rate Scheduler\n",
-    "    \n",
-    "    Decays learning rate by gamma every step_size epochs:\n",
-    "    learning_rate = initial_lr * (gamma ^ (epoch // step_size))\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, optimizer: Union[SGD, Adam], step_size: int, gamma: float = 0.1):\n",
-    "        \"\"\"\n",
-    "        Initialize step learning rate scheduler.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer: Optimizer to schedule\n",
-    "            step_size: Number of epochs between decreases\n",
-    "            gamma: Multiplicative factor for learning rate decay\n",
-    "        \n",
-    "        TODO: Implement learning rate scheduler initialization.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store optimizer reference\n",
-    "        2. Store scheduling parameters\n",
-    "        3. Save initial learning rate\n",
-    "        4. Initialize step counter\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        ```python\n",
-    "        optimizer = SGD([w1, w2], learning_rate=0.1)\n",
-    "        scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
-    "        \n",
-    "        # In training loop:\n",
-    "        for epoch in range(100):\n",
-    "            train_one_epoch()\n",
-    "            scheduler.step()  # Update learning rate\n",
-    "        ```\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Store optimizer reference\n",
-    "        - Save initial learning rate from optimizer\n",
-    "        - Initialize step counter to 0\n",
-    "        - gamma is the decay factor (0.1 = 10x reduction)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.optimizer = optimizer\n",
-    "        self.step_size = step_size\n",
-    "        self.gamma = gamma\n",
-    "        self.initial_lr = optimizer.learning_rate\n",
-    "        self.step_count = 0\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def step(self) -> None:\n",
-    "        \"\"\"\n",
-    "        Update learning rate based on current step.\n",
-    "        \n",
-    "        TODO: Implement learning rate update.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Increment step counter\n",
-    "        2. Calculate new learning rate using step decay formula\n",
-    "        3. Update optimizer's learning rate\n",
-    "        \n",
-    "        MATHEMATICAL FORMULATION:\n",
-    "        new_lr = initial_lr * (gamma ^ ((step_count - 1) // step_size))\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Use // for integer division\n",
-    "        - Use ** for exponentiation\n",
-    "        - Update optimizer.learning_rate directly\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.step_count += 1\n",
-    "        \n",
-    "        # Calculate new learning rate\n",
-    "        decay_factor = self.gamma ** ((self.step_count - 1) // self.step_size)\n",
-    "        new_lr = self.initial_lr * decay_factor\n",
-    "        \n",
-    "        # Update optimizer's learning rate\n",
-    "        self.optimizer.learning_rate = new_lr\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def get_lr(self) -> float:\n",
-    "        \"\"\"\n",
-    "        Get current learning rate.\n",
-    "        \n",
-    "        TODO: Return current learning rate.\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Return optimizer.learning_rate\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        return self.optimizer.learning_rate\n",
-    "        ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e8085bc7",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Step Learning Rate Scheduler\n",
-    "\n",
-    "Let's test your step learning rate scheduler implementation! This scheduler reduces learning rate at regular intervals.\n",
-    "\n",
-    "**This is a unit test** - it tests one specific class (StepLR) in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cae91729",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-step-scheduler",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_step_scheduler():\n",
-    "    \"\"\"Unit test for the StepLR scheduler implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Step Learning Rate Scheduler...\")\n",
-    "    \n",
-    "    # Create test parameters and optimizer\n",
-    "    w = Variable(1.0, requires_grad=True)\n",
-    "    optimizer = SGD([w], learning_rate=0.1)\n",
-    "    \n",
-    "    # Test scheduler initialization\n",
-    "    try:\n",
-    "        scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
-    "        \n",
-    "        # Test initial learning rate\n",
-    "        assert scheduler.get_lr() == 0.1, f\"Initial learning rate should be 0.1, got {scheduler.get_lr()}\"\n",
-    "        print(\"✅ Initial learning rate is correct\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Initial learning rate failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test step-based decay\n",
-    "    try:\n",
-    "        # Steps 1-10: no decay (decay happens after step 10)\n",
-    "        for i in range(10):\n",
-    "            scheduler.step()\n",
-    "        \n",
-    "        assert scheduler.get_lr() == 0.1, f\"Learning rate should still be 0.1 after 10 steps, got {scheduler.get_lr()}\"\n",
-    "        \n",
-    "        # Step 11: decay should occur\n",
-    "        scheduler.step()\n",
-    "        expected_lr = 0.1 * 0.1  # 0.01\n",
-    "        assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f\"Learning rate should be {expected_lr} after 11 steps, got {scheduler.get_lr()}\"\n",
-    "        print(\"✅ Step-based decay works correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Step-based decay failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test multiple decay levels\n",
-    "    try:\n",
-    "        # Steps 12-20: should stay at 0.01\n",
-    "        for i in range(9):\n",
-    "            scheduler.step()\n",
-    "        \n",
-    "        assert abs(scheduler.get_lr() - 0.01) < 1e-6, f\"Learning rate should be 0.01 after 20 steps, got {scheduler.get_lr()}\"\n",
-    "        \n",
-    "        # Step 21: another decay\n",
-    "        scheduler.step()\n",
-    "        expected_lr = 0.01 * 0.1  # 0.001\n",
-    "        assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f\"Learning rate should be {expected_lr} after 21 steps, got {scheduler.get_lr()}\"\n",
-    "        print(\"✅ Multiple decay levels work correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Multiple decay levels failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test with different optimizer\n",
-    "    try:\n",
-    "        w2 = Variable(2.0, requires_grad=True)\n",
-    "        adam_optimizer = Adam([w2], learning_rate=0.001)\n",
-    "        adam_scheduler = StepLR(adam_optimizer, step_size=5, gamma=0.5)\n",
-    "        \n",
-    "        # Test initial learning rate\n",
-    "        assert adam_scheduler.get_lr() == 0.001, f\"Initial Adam learning rate should be 0.001, got {adam_scheduler.get_lr()}\"\n",
-    "        \n",
-    "        # Test decay after 5 steps\n",
-    "        for i in range(5):\n",
-    "            adam_scheduler.step()\n",
-    "        \n",
-    "        # Learning rate should still be 0.001 after 5 steps\n",
-    "        assert adam_scheduler.get_lr() == 0.001, f\"Adam learning rate should still be 0.001 after 5 steps, got {adam_scheduler.get_lr()}\"\n",
-    "        \n",
-    "        # Step 6: decay should occur\n",
-    "        adam_scheduler.step()\n",
-    "        expected_lr = 0.001 * 0.5  # 0.0005\n",
-    "        assert abs(adam_scheduler.get_lr() - expected_lr) < 1e-6, f\"Adam learning rate should be {expected_lr} after 6 steps, got {adam_scheduler.get_lr()}\"\n",
-    "        print(\"✅ Works with different optimizers\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Different optimizers failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Step learning rate scheduler behavior:\")\n",
-    "    print(\"   Reduces learning rate at regular intervals\")\n",
-    "    print(\"   Multiplies current rate by gamma factor\")\n",
-    "    print(\"   Works with any optimizer (SGD, Adam, etc.)\")\n",
-    "    print(\"📈 Progress: Step Learning Rate Scheduler ✓\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "dd4c500b",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 5: Integration - Complete Training Example\n",
-    "\n",
-    "### Putting It All Together\n",
-    "Let's see how optimizers enable complete neural network training:\n",
-    "\n",
-    "1. **Forward pass**: Compute predictions\n",
-    "2. **Loss computation**: Compare with targets\n",
-    "3. **Backward pass**: Compute gradients\n",
-    "4. **Optimizer step**: Update parameters\n",
-    "5. **Learning rate scheduling**: Adjust learning rate\n",
-    "\n",
-    "### The Modern Training Loop\n",
-    "```python\n",
-    "# Setup\n",
-    "optimizer = Adam(model.parameters(), learning_rate=0.001)\n",
-    "scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
-    "\n",
-    "# Training loop\n",
-    "for epoch in range(num_epochs):\n",
-    "    for batch in dataloader:\n",
-    "        # Forward pass\n",
-    "        predictions = model(batch.inputs)\n",
-    "        loss = criterion(predictions, batch.targets)\n",
-    "        \n",
-    "        # Backward pass\n",
-    "        optimizer.zero_grad()\n",
-    "        loss.backward()\n",
-    "        optimizer.step()\n",
-    "    \n",
-    "    # Update learning rate\n",
-    "    scheduler.step()\n",
-    "```\n",
-    "\n",
-    "Let's implement a complete training example!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "90a4b427",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "training-integration",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def train_simple_model():\n",
-    "    \"\"\"\n",
-    "    Complete training example using optimizers.\n",
-    "    \n",
-    "    TODO: Implement a complete training loop.\n",
-    "    \n",
-    "    APPROACH:\n",
-    "    1. Create a simple model (linear regression)\n",
-    "    2. Generate training data\n",
-    "    3. Set up optimizer and scheduler\n",
-    "    4. Train for several epochs\n",
-    "    5. Show convergence\n",
-    "    \n",
-    "    LEARNING OBJECTIVE:\n",
-    "    - See how optimizers enable real learning\n",
-    "    - Compare SGD vs Adam performance\n",
-    "    - Understand the complete training workflow\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    print(\"Training simple linear regression model...\")\n",
-    "    \n",
-    "    # Create simple model: y = w*x + b\n",
-    "    w = Variable(0.1, requires_grad=True)  # Initialize near zero\n",
-    "    b = Variable(0.0, requires_grad=True)\n",
-    "    \n",
-    "    # Training data: y = 2*x + 1\n",
-    "    x_data = [1.0, 2.0, 3.0, 4.0, 5.0]\n",
-    "    y_data = [3.0, 5.0, 7.0, 9.0, 11.0]\n",
-    "    \n",
-    "    # Try SGD first\n",
-    "    print(\"\\n🔍 Training with SGD...\")\n",
-    "    optimizer_sgd = SGD([w, b], learning_rate=0.01, momentum=0.9)\n",
-    "    \n",
-    "    for epoch in range(60):\n",
-    "        total_loss = 0\n",
-    "        \n",
-    "        for x_val, y_val in zip(x_data, y_data):\n",
-    "            # Forward pass\n",
-    "            x = Variable(x_val, requires_grad=False)\n",
-    "            y_target = Variable(y_val, requires_grad=False)\n",
-    "            \n",
-    "            # Prediction: y = w*x + b\n",
-    "            try:\n",
-    "                from tinytorch.core.autograd import add, multiply, subtract\n",
-    "            except ImportError:\n",
-    "                setup_import_paths()\n",
-    "                from autograd_dev import add, multiply, subtract\n",
-    "            \n",
-    "            prediction = add(multiply(w, x), b)\n",
-    "            \n",
-    "            # Loss: (prediction - target)^2\n",
-    "            error = subtract(prediction, y_target)\n",
-    "            loss = multiply(error, error)\n",
-    "            \n",
-    "            # Backward pass\n",
-    "            optimizer_sgd.zero_grad()\n",
-    "            loss.backward()\n",
-    "            optimizer_sgd.step()\n",
-    "            \n",
-    "            total_loss += loss.data.data.item()\n",
-    "        \n",
-    "        if epoch % 10 == 0:\n",
-    "            print(f\"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}\")\n",
-    "    \n",
-    "    sgd_final_w = w.data.data.item()\n",
-    "    sgd_final_b = b.data.data.item()\n",
-    "    \n",
-    "    # Reset parameters and try Adam\n",
-    "    print(\"\\n🔍 Training with Adam...\")\n",
-    "    w.data = Tensor(0.1)\n",
-    "    b.data = Tensor(0.0)\n",
-    "    \n",
-    "    optimizer_adam = Adam([w, b], learning_rate=0.01)\n",
-    "    \n",
-    "    for epoch in range(60):\n",
-    "        total_loss = 0\n",
-    "        \n",
-    "        for x_val, y_val in zip(x_data, y_data):\n",
-    "            # Forward pass\n",
-    "            x = Variable(x_val, requires_grad=False)\n",
-    "            y_target = Variable(y_val, requires_grad=False)\n",
-    "            \n",
-    "            # Prediction: y = w*x + b\n",
-    "            prediction = add(multiply(w, x), b)\n",
-    "            \n",
-    "            # Loss: (prediction - target)^2\n",
-    "            error = subtract(prediction, y_target)\n",
-    "            loss = multiply(error, error)\n",
-    "            \n",
-    "            # Backward pass\n",
-    "            optimizer_adam.zero_grad()\n",
-    "            loss.backward()\n",
-    "            optimizer_adam.step()\n",
-    "            \n",
-    "            total_loss += loss.data.data.item()\n",
-    "        \n",
-    "        if epoch % 10 == 0:\n",
-    "            print(f\"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}\")\n",
-    "    \n",
-    "    adam_final_w = w.data.data.item()\n",
-    "    adam_final_b = b.data.data.item()\n",
-    "    \n",
-    "    print(f\"\\n📊 Results:\")\n",
-    "    print(f\"Target: w = 2.0, b = 1.0\")\n",
-    "    print(f\"SGD:    w = {sgd_final_w:.3f}, b = {sgd_final_b:.3f}\")\n",
-    "    print(f\"Adam:   w = {adam_final_w:.3f}, b = {adam_final_b:.3f}\")\n",
-    "    \n",
-    "    return sgd_final_w, sgd_final_b, adam_final_w, adam_final_b\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9d1c86a8",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Complete Training Integration\n",
-    "\n",
-    "Let's test your complete training integration! This demonstrates optimizers working together in a realistic training scenario.\n",
-    "\n",
-    "**This is a unit test** - it tests the complete training workflow with optimizers in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dd250982",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-training-integration",
-     "locked": true,
-     "points": 25,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_module_unit_training():\n",
-    "    \"\"\"Comprehensive unit test for complete training integration with optimizers.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Complete Training Integration...\")\n",
-    "    \n",
-    "    # Test training with SGD and Adam\n",
-    "    try:\n",
-    "        sgd_w, sgd_b, adam_w, adam_b = train_simple_model()\n",
-    "        \n",
-    "        # Test SGD convergence\n",
-    "        assert abs(sgd_w - 2.0) < 0.1, f\"SGD should converge close to w=2.0, got {sgd_w}\"\n",
-    "        assert abs(sgd_b - 1.0) < 0.1, f\"SGD should converge close to b=1.0, got {sgd_b}\"\n",
-    "        print(\"✅ SGD convergence works\")\n",
-    "        \n",
-    "        # Test Adam convergence (may be different due to adaptive learning rates)\n",
-    "        assert abs(adam_w - 2.0) < 1.0, f\"Adam should converge reasonably close to w=2.0, got {adam_w}\"\n",
-    "        assert abs(adam_b - 1.0) < 1.0, f\"Adam should converge reasonably close to b=1.0, got {adam_b}\"\n",
-    "        print(\"✅ Adam convergence works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Training integration failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test optimizer comparison\n",
-    "    try:\n",
-    "        # Both optimizers should achieve reasonable results\n",
-    "        sgd_error = (sgd_w - 2.0)**2 + (sgd_b - 1.0)**2\n",
-    "        adam_error = (adam_w - 2.0)**2 + (adam_b - 1.0)**2\n",
-    "        \n",
-    "        # Both should have low error (< 0.1)\n",
-    "        assert sgd_error < 0.1, f\"SGD error should be < 0.1, got {sgd_error}\"\n",
-    "        assert adam_error < 1.0, f\"Adam error should be < 1.0, got {adam_error}\"\n",
-    "        print(\"✅ Optimizer comparison works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Optimizer comparison failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test gradient flow\n",
-    "    try:\n",
-    "        # Create a simple test to verify gradients flow correctly\n",
-    "        w = Variable(1.0, requires_grad=True)\n",
-    "        b = Variable(0.0, requires_grad=True)\n",
-    "        \n",
-    "        # Set up simple gradients\n",
-    "        w.grad = Variable(0.1)\n",
-    "        b.grad = Variable(0.05)\n",
-    "        \n",
-    "        # Test SGD step\n",
-    "        sgd_optimizer = SGD([w, b], learning_rate=0.1)\n",
-    "        original_w = w.data.data.item()\n",
-    "        original_b = b.data.data.item()\n",
-    "        \n",
-    "        sgd_optimizer.step()\n",
-    "        \n",
-    "        # Check updates\n",
-    "        assert w.data.data.item() != original_w, \"SGD should update w\"\n",
-    "        assert b.data.data.item() != original_b, \"SGD should update b\"\n",
-    "        print(\"✅ Gradient flow works correctly\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Gradient flow failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Training integration behavior:\")\n",
-    "    print(\"   Optimizers successfully minimize loss functions\")\n",
-    "    print(\"   SGD and Adam both converge to target values\")\n",
-    "    print(\"   Gradient computation and updates work correctly\")\n",
-    "    print(\"   Ready for real neural network training\")\n",
-    "    print(\"📈 Progress: Complete Training Integration ✓\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8c75c2a9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 6: ML Systems - Optimizer Performance Analysis\n",
-    "\n",
-    "### Real-World Challenge: Optimizer Selection and Tuning\n",
-    "\n",
-    "In production ML systems, choosing the right optimizer and hyperparameters can make the difference between:\n",
-    "- **Success**: Model converges to good performance in reasonable time\n",
-    "- **Failure**: Model doesn't converge, explodes, or takes too long to train\n",
-    "\n",
-    "### The Production Reality\n",
-    "When training large models (millions or billions of parameters):\n",
-    "- **Wrong optimizer**: Can waste weeks of expensive GPU time\n",
-    "- **Wrong learning rate**: Can cause gradient explosion or extremely slow convergence\n",
-    "- **Wrong scheduling**: Can prevent models from reaching optimal performance\n",
-    "- **Memory constraints**: Some optimizers use significantly more memory than others\n",
-    "\n",
-    "### What We'll Build\n",
-    "An **OptimizerConvergenceProfiler** that analyzes:\n",
-    "1. **Convergence patterns** across different optimizers\n",
-    "2. **Learning rate sensitivity** and optimal hyperparameters\n",
-    "3. **Computational cost vs convergence speed** trade-offs\n",
-    "4. **Gradient statistics** and update patterns\n",
-    "5. **Memory usage patterns** for different optimizers\n",
-    "\n",
-    "This mirrors tools used in production for optimizer selection and hyperparameter tuning."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "437f7d42",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "convergence-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class OptimizerConvergenceProfiler:\n",
-    "    \"\"\"\n",
-    "    ML Systems Tool: Optimizer Performance and Convergence Analysis\n",
-    "    \n",
-    "    Profiles convergence patterns, learning rate sensitivity, and computational costs\n",
-    "    across different optimizers to guide production optimizer selection.\n",
-    "    \n",
-    "    This is 60% implementation focusing on core analysis capabilities:\n",
-    "    - Convergence rate comparison across optimizers\n",
-    "    - Learning rate sensitivity analysis\n",
-    "    - Gradient statistics tracking\n",
-    "    - Memory usage estimation\n",
-    "    - Performance recommendations\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"\n",
-    "        Initialize optimizer convergence profiler.\n",
-    "        \n",
-    "        TODO: Implement profiler initialization.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Initialize tracking dictionaries for different metrics\n",
-    "        2. Set up convergence analysis parameters\n",
-    "        3. Prepare memory and performance tracking\n",
-    "        4. Initialize recommendation engine components\n",
-    "        \n",
-    "        PRODUCTION CONTEXT:\n",
-    "        In production, this profiler would run on representative tasks to:\n",
-    "        - Select optimal optimizers for new models\n",
-    "        - Tune hyperparameters before expensive training runs\n",
-    "        - Predict training time and resource requirements\n",
-    "        - Monitor training stability and convergence\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Track convergence history per optimizer\n",
-    "        - Store gradient statistics over time\n",
-    "        - Monitor memory usage patterns\n",
-    "        - Prepare for comparative analysis\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Convergence tracking\n",
-    "        self.convergence_history = defaultdict(list)  # {optimizer_name: [losses]}\n",
-    "        self.gradient_norms = defaultdict(list)       # {optimizer_name: [grad_norms]}\n",
-    "        self.learning_rates = defaultdict(list)       # {optimizer_name: [lr_values]}\n",
-    "        self.step_times = defaultdict(list)           # {optimizer_name: [step_durations]}\n",
-    "        \n",
-    "        # Performance metrics\n",
-    "        self.memory_usage = defaultdict(list)         # {optimizer_name: [memory_estimates]}\n",
-    "        self.convergence_rates = {}                   # {optimizer_name: convergence_rate}\n",
-    "        self.stability_scores = {}                    # {optimizer_name: stability_score}\n",
-    "        \n",
-    "        # Analysis parameters\n",
-    "        self.convergence_threshold = 1e-6\n",
-    "        self.stability_window = 10\n",
-    "        self.gradient_explosion_threshold = 1e6\n",
-    "        \n",
-    "        # Recommendations\n",
-    "        self.optimizer_rankings = {}\n",
-    "        self.hyperparameter_suggestions = {}\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def profile_optimizer_convergence(self, optimizer_name: str, optimizer: Union[SGD, Adam], \n",
-    "                                    training_function, initial_loss: float, \n",
-    "                                    max_steps: int = 100) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Profile convergence behavior of an optimizer on a specific task.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer_name: Name identifier for the optimizer\n",
-    "            optimizer: Optimizer instance to profile\n",
-    "            training_function: Function that performs one training step and returns loss\n",
-    "            initial_loss: Starting loss value\n",
-    "            max_steps: Maximum training steps to profile\n",
-    "        \n",
-    "        Returns:\n",
-    "            Dictionary containing convergence analysis results\n",
-    "        \n",
-    "        TODO: Implement optimizer convergence profiling.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Run training loop with the optimizer\n",
-    "        2. Track loss, gradients, learning rates at each step\n",
-    "        3. Measure step execution time\n",
-    "        4. Estimate memory usage\n",
-    "        5. Analyze convergence patterns and stability\n",
-    "        6. Generate performance metrics\n",
-    "        \n",
-    "        CONVERGENCE ANALYSIS:\n",
-    "        - Track loss reduction over time\n",
-    "        - Measure convergence rate (loss reduction per step)\n",
-    "        - Detect convergence plateaus\n",
-    "        - Identify gradient explosion or vanishing\n",
-    "        - Assess training stability\n",
-    "        \n",
-    "        PRODUCTION INSIGHTS:\n",
-    "        This analysis helps determine:\n",
-    "        - Which optimizers converge fastest for specific model types\n",
-    "        - Optimal learning rates for different optimizers\n",
-    "        - Memory vs performance trade-offs\n",
-    "        - Training stability and robustness\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Use time.time() to measure step duration\n",
-    "        - Calculate gradient norms across all parameters\n",
-    "        - Track learning rate changes (for schedulers)\n",
-    "        - Estimate memory from optimizer state size\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        import time\n",
-    "        \n",
-    "        print(f\"🔍 Profiling {optimizer_name} convergence...\")\n",
-    "        \n",
-    "        # Initialize tracking\n",
-    "        losses = []\n",
-    "        grad_norms = []\n",
-    "        step_durations = []\n",
-    "        lr_values = []\n",
-    "        \n",
-    "        previous_loss = initial_loss\n",
-    "        convergence_step = None\n",
-    "        \n",
-    "        for step in range(max_steps):\n",
-    "            step_start = time.time()\n",
-    "            \n",
-    "            # Perform training step\n",
-    "            try:\n",
-    "                current_loss = training_function()\n",
-    "                losses.append(current_loss)\n",
-    "                \n",
-    "                # Calculate gradient norm\n",
-    "                total_grad_norm = 0.0\n",
-    "                param_count = 0\n",
-    "                for param in optimizer.parameters:\n",
-    "                    if param.grad is not None:\n",
-    "                        grad_data = param.grad.data.data\n",
-    "                        if hasattr(grad_data, 'flatten'):\n",
-    "                            grad_norm = np.linalg.norm(grad_data.flatten())\n",
-    "                        else:\n",
-    "                            grad_norm = abs(float(grad_data))\n",
-    "                        total_grad_norm += grad_norm ** 2\n",
-    "                        param_count += 1\n",
-    "                \n",
-    "                if param_count > 0:\n",
-    "                    total_grad_norm = (total_grad_norm / param_count) ** 0.5\n",
-    "                grad_norms.append(total_grad_norm)\n",
-    "                \n",
-    "                # Track learning rate\n",
-    "                lr_values.append(optimizer.learning_rate)\n",
-    "                \n",
-    "                # Check convergence\n",
-    "                if convergence_step is None and abs(current_loss - previous_loss) < self.convergence_threshold:\n",
-    "                    convergence_step = step\n",
-    "                \n",
-    "                previous_loss = current_loss\n",
-    "                \n",
-    "            except Exception as e:\n",
-    "                print(f\"⚠️ Training step {step} failed: {e}\")\n",
-    "                break\n",
-    "            \n",
-    "            step_end = time.time()\n",
-    "            step_durations.append(step_end - step_start)\n",
-    "            \n",
-    "            # Early stopping for exploded gradients\n",
-    "            if total_grad_norm > self.gradient_explosion_threshold:\n",
-    "                print(f\"⚠️ Gradient explosion detected at step {step}\")\n",
-    "                break\n",
-    "        \n",
-    "        # Store results\n",
-    "        self.convergence_history[optimizer_name] = losses\n",
-    "        self.gradient_norms[optimizer_name] = grad_norms\n",
-    "        self.learning_rates[optimizer_name] = lr_values\n",
-    "        self.step_times[optimizer_name] = step_durations\n",
-    "        \n",
-    "        # Analyze results\n",
-    "        analysis = self._analyze_convergence_profile(optimizer_name, losses, grad_norms, \n",
-    "                                                   step_durations, convergence_step)\n",
-    "        \n",
-    "        return analysis\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def compare_optimizers(self, profiles: Dict[str, Dict]) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Compare multiple optimizer profiles and generate recommendations.\n",
-    "        \n",
-    "        Args:\n",
-    "            profiles: Dictionary mapping optimizer names to their profile results\n",
-    "        \n",
-    "        Returns:\n",
-    "            Comprehensive comparison analysis with recommendations\n",
-    "        \n",
-    "        TODO: Implement optimizer comparison and ranking.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Analyze convergence speed across optimizers\n",
-    "        2. Compare final performance and stability\n",
-    "        3. Assess computational efficiency\n",
-    "        4. Generate rankings and recommendations\n",
-    "        5. Identify optimal hyperparameters\n",
-    "        \n",
-    "        COMPARISON METRICS:\n",
-    "        - Steps to convergence\n",
-    "        - Final loss achieved\n",
-    "        - Training stability (loss variance)\n",
-    "        - Computational cost per step\n",
-    "        - Memory efficiency\n",
-    "        - Gradient explosion resistance\n",
-    "        \n",
-    "        PRODUCTION VALUE:\n",
-    "        This comparison guides:\n",
-    "        - Optimizer selection for new projects\n",
-    "        - Hyperparameter optimization strategies\n",
-    "        - Resource allocation decisions\n",
-    "        - Training pipeline design\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Normalize metrics for fair comparison\n",
-    "        - Weight different factors based on importance\n",
-    "        - Generate actionable recommendations\n",
-    "        - Consider trade-offs between speed and stability\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        comparison = {\n",
-    "            'convergence_speed': {},\n",
-    "            'final_performance': {},\n",
-    "            'stability': {},\n",
-    "            'efficiency': {},\n",
-    "            'rankings': {},\n",
-    "            'recommendations': {}\n",
-    "        }\n",
-    "        \n",
-    "        print(\"📊 Comparing optimizer performance...\")\n",
-    "        \n",
-    "        # Analyze each optimizer\n",
-    "        for opt_name, profile in profiles.items():\n",
-    "            # Convergence speed\n",
-    "            convergence_step = profile.get('convergence_step', len(self.convergence_history[opt_name]))\n",
-    "            comparison['convergence_speed'][opt_name] = convergence_step\n",
-    "            \n",
-    "            # Final performance\n",
-    "            losses = self.convergence_history[opt_name]\n",
-    "            if losses:\n",
-    "                final_loss = losses[-1]\n",
-    "                comparison['final_performance'][opt_name] = final_loss\n",
-    "            \n",
-    "            # Stability (coefficient of variation in last 10 steps)\n",
-    "            if len(losses) >= self.stability_window:\n",
-    "                recent_losses = losses[-self.stability_window:]\n",
-    "                stability = 1.0 / (1.0 + np.std(recent_losses) / (np.mean(recent_losses) + 1e-8))\n",
-    "                comparison['stability'][opt_name] = stability\n",
-    "            \n",
-    "            # Efficiency (loss reduction per unit time)\n",
-    "            step_times = self.step_times[opt_name]\n",
-    "            if losses and step_times:\n",
-    "                initial_loss = losses[0]\n",
-    "                final_loss = losses[-1]\n",
-    "                total_time = sum(step_times)\n",
-    "                efficiency = (initial_loss - final_loss) / (total_time + 1e-8)\n",
-    "                comparison['efficiency'][opt_name] = efficiency\n",
-    "        \n",
-    "        # Generate rankings\n",
-    "        metrics = ['convergence_speed', 'final_performance', 'stability', 'efficiency']\n",
-    "        for metric in metrics:\n",
-    "            if comparison[metric]:\n",
-    "                if metric == 'convergence_speed':\n",
-    "                    # Lower is better for convergence speed\n",
-    "                    sorted_opts = sorted(comparison[metric].items(), key=lambda x: x[1])\n",
-    "                elif metric == 'final_performance':\n",
-    "                    # Lower is better for final loss\n",
-    "                    sorted_opts = sorted(comparison[metric].items(), key=lambda x: x[1])\n",
-    "                else:\n",
-    "                    # Higher is better for stability and efficiency\n",
-    "                    sorted_opts = sorted(comparison[metric].items(), key=lambda x: x[1], reverse=True)\n",
-    "                \n",
-    "                comparison['rankings'][metric] = [opt for opt, _ in sorted_opts]\n",
-    "        \n",
-    "        # Generate recommendations\n",
-    "        recommendations = []\n",
-    "        \n",
-    "        # Best overall optimizer\n",
-    "        if comparison['rankings']:\n",
-    "            # Simple scoring: rank position across metrics\n",
-    "            scores = defaultdict(float)\n",
-    "            for metric, ranking in comparison['rankings'].items():\n",
-    "                for i, opt_name in enumerate(ranking):\n",
-    "                    scores[opt_name] += len(ranking) - i\n",
-    "            \n",
-    "            best_optimizer = max(scores.items(), key=lambda x: x[1])[0]\n",
-    "            recommendations.append(f\"🏆 Best overall optimizer: {best_optimizer}\")\n",
-    "        \n",
-    "        # Specific recommendations\n",
-    "        if 'convergence_speed' in comparison['rankings']:\n",
-    "            fastest = comparison['rankings']['convergence_speed'][0]\n",
-    "            recommendations.append(f\"⚡ Fastest convergence: {fastest}\")\n",
-    "        \n",
-    "        if 'stability' in comparison['rankings']:\n",
-    "            most_stable = comparison['rankings']['stability'][0]\n",
-    "            recommendations.append(f\"🎯 Most stable training: {most_stable}\")\n",
-    "        \n",
-    "        if 'efficiency' in comparison['rankings']:\n",
-    "            most_efficient = comparison['rankings']['efficiency'][0]\n",
-    "            recommendations.append(f\"💰 Most compute-efficient: {most_efficient}\")\n",
-    "        \n",
-    "        comparison['recommendations']['summary'] = recommendations\n",
-    "        \n",
-    "        return comparison\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def analyze_learning_rate_sensitivity(self, optimizer_class, learning_rates: List[float],\n",
-    "                                        training_function, steps: int = 50) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Analyze optimizer sensitivity to different learning rates.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer_class: Optimizer class (SGD or Adam)\n",
-    "            learning_rates: List of learning rates to test\n",
-    "            training_function: Function that creates and runs training\n",
-    "            steps: Number of training steps per learning rate\n",
-    "        \n",
-    "        Returns:\n",
-    "            Learning rate sensitivity analysis\n",
-    "        \n",
-    "        TODO: Implement learning rate sensitivity analysis.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Test optimizer with different learning rates\n",
-    "        2. Measure convergence performance for each rate\n",
-    "        3. Identify optimal learning rate range\n",
-    "        4. Detect learning rate instability regions\n",
-    "        5. Generate learning rate recommendations\n",
-    "        \n",
-    "        SENSITIVITY ANALYSIS:\n",
-    "        - Plot loss curves for different learning rates\n",
-    "        - Identify optimal learning rate range\n",
-    "        - Detect gradient explosion thresholds\n",
-    "        - Measure convergence robustness\n",
-    "        - Generate adaptive scheduling suggestions\n",
-    "        \n",
-    "        PRODUCTION INSIGHTS:\n",
-    "        This analysis enables:\n",
-    "        - Automatic learning rate tuning\n",
-    "        - Learning rate scheduling optimization\n",
-    "        - Gradient explosion prevention\n",
-    "        - Training stability improvement\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Reset model state for each learning rate test\n",
-    "        - Track convergence metrics consistently\n",
-    "        - Identify learning rate sweet spots\n",
-    "        - Flag unstable learning rate regions\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(\"🔍 Analyzing learning rate sensitivity...\")\n",
-    "        \n",
-    "        lr_analysis = {\n",
-    "            'learning_rates': learning_rates,\n",
-    "            'final_losses': [],\n",
-    "            'convergence_steps': [],\n",
-    "            'stability_scores': [],\n",
-    "            'gradient_explosions': [],\n",
-    "            'optimal_range': None,\n",
-    "            'recommendations': []\n",
-    "        }\n",
-    "        \n",
-    "        # Test each learning rate\n",
-    "        for lr in learning_rates:\n",
-    "            print(f\"  Testing learning rate: {lr}\")\n",
-    "            \n",
-    "            try:\n",
-    "                # Create optimizer with current learning rate\n",
-    "                # This is a simplified test - in production, would reset model state\n",
-    "                losses, grad_norms = training_function(lr, steps)\n",
-    "                \n",
-    "                if losses:\n",
-    "                    final_loss = losses[-1]\n",
-    "                    lr_analysis['final_losses'].append(final_loss)\n",
-    "                    \n",
-    "                    # Find convergence step\n",
-    "                    convergence_step = steps\n",
-    "                    for i in range(1, len(losses)):\n",
-    "                        if abs(losses[i] - losses[i-1]) < self.convergence_threshold:\n",
-    "                            convergence_step = i\n",
-    "                            break\n",
-    "                    lr_analysis['convergence_steps'].append(convergence_step)\n",
-    "                    \n",
-    "                    # Calculate stability\n",
-    "                    if len(losses) >= 10:\n",
-    "                        recent_losses = losses[-10:]\n",
-    "                        stability = 1.0 / (1.0 + np.std(recent_losses) / (np.mean(recent_losses) + 1e-8))\n",
-    "                        lr_analysis['stability_scores'].append(stability)\n",
-    "                    else:\n",
-    "                        lr_analysis['stability_scores'].append(0.0)\n",
-    "                    \n",
-    "                    # Check for gradient explosion\n",
-    "                    max_grad_norm = max(grad_norms) if grad_norms else 0.0\n",
-    "                    explosion = max_grad_norm > self.gradient_explosion_threshold\n",
-    "                    lr_analysis['gradient_explosions'].append(explosion)\n",
-    "                    \n",
-    "                else:\n",
-    "                    # Failed to get losses\n",
-    "                    lr_analysis['final_losses'].append(float('inf'))\n",
-    "                    lr_analysis['convergence_steps'].append(steps)\n",
-    "                    lr_analysis['stability_scores'].append(0.0)\n",
-    "                    lr_analysis['gradient_explosions'].append(True)\n",
-    "                    \n",
-    "            except Exception as e:\n",
-    "                print(f\"    ⚠️ Failed with lr={lr}: {e}\")\n",
-    "                lr_analysis['final_losses'].append(float('inf'))\n",
-    "                lr_analysis['convergence_steps'].append(steps)\n",
-    "                lr_analysis['stability_scores'].append(0.0)\n",
-    "                lr_analysis['gradient_explosions'].append(True)\n",
-    "        \n",
-    "        # Find optimal learning rate range\n",
-    "        valid_indices = [i for i, (loss, explosion) in \n",
-    "                        enumerate(zip(lr_analysis['final_losses'], lr_analysis['gradient_explosions']))\n",
-    "                        if not explosion and loss != float('inf')]\n",
-    "        \n",
-    "        if valid_indices:\n",
-    "            # Find learning rate with best final loss among stable ones\n",
-    "            stable_losses = [(i, lr_analysis['final_losses'][i]) for i in valid_indices]\n",
-    "            best_idx = min(stable_losses, key=lambda x: x[1])[0]\n",
-    "            \n",
-    "            # Define optimal range around best learning rate\n",
-    "            best_lr = learning_rates[best_idx]\n",
-    "            lr_analysis['optimal_range'] = (best_lr * 0.1, best_lr * 10.0)\n",
-    "            \n",
-    "            # Generate recommendations\n",
-    "            recommendations = []\n",
-    "            recommendations.append(f\"🎯 Optimal learning rate: {best_lr:.2e}\")\n",
-    "            recommendations.append(f\"📈 Safe range: {lr_analysis['optimal_range'][0]:.2e} - {lr_analysis['optimal_range'][1]:.2e}\")\n",
-    "            \n",
-    "            # Learning rate scheduling suggestions\n",
-    "            if best_idx > 0:\n",
-    "                recommendations.append(\"💡 Consider starting with higher LR and decaying\")\n",
-    "            if any(lr_analysis['gradient_explosions']):\n",
-    "                max_safe_lr = max([learning_rates[i] for i in valid_indices])\n",
-    "                recommendations.append(f\"⚠️ Avoid learning rates above {max_safe_lr:.2e}\")\n",
-    "            \n",
-    "            lr_analysis['recommendations'] = recommendations\n",
-    "        else:\n",
-    "            lr_analysis['recommendations'] = [\"⚠️ No stable learning rates found - try lower values\"]\n",
-    "        \n",
-    "        return lr_analysis\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def estimate_memory_usage(self, optimizer: Union[SGD, Adam], num_parameters: int) -> Dict[str, float]:\n",
-    "        \"\"\"\n",
-    "        Estimate memory usage for different optimizers.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer: Optimizer instance\n",
-    "            num_parameters: Number of model parameters\n",
-    "        \n",
-    "        Returns:\n",
-    "            Memory usage estimates in MB\n",
-    "        \n",
-    "        TODO: Implement memory usage estimation.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Calculate parameter memory requirements\n",
-    "        2. Estimate optimizer state memory\n",
-    "        3. Account for gradient storage\n",
-    "        4. Include temporary computation memory\n",
-    "        5. Provide memory scaling predictions\n",
-    "        \n",
-    "        MEMORY ANALYSIS:\n",
-    "        - Parameter storage: num_params * 4 bytes (float32)\n",
-    "        - Gradient storage: num_params * 4 bytes\n",
-    "        - Optimizer state: varies by optimizer type\n",
-    "        - SGD momentum: num_params * 4 bytes\n",
-    "        - Adam: num_params * 8 bytes (first + second moments)\n",
-    "        \n",
-    "        PRODUCTION VALUE:\n",
-    "        Memory estimation helps:\n",
-    "        - Select optimizers for memory-constrained environments\n",
-    "        - Plan GPU memory allocation\n",
-    "        - Scale to larger models\n",
-    "        - Optimize batch sizes\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Use typical float32 size (4 bytes)\n",
-    "        - Account for optimizer-specific state\n",
-    "        - Include gradient accumulation overhead\n",
-    "        - Provide scaling estimates\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Base memory requirements\n",
-    "        bytes_per_param = 4  # float32\n",
-    "        \n",
-    "        memory_breakdown = {\n",
-    "            'parameters_mb': num_parameters * bytes_per_param / (1024 * 1024),\n",
-    "            'gradients_mb': num_parameters * bytes_per_param / (1024 * 1024),\n",
-    "            'optimizer_state_mb': 0.0,\n",
-    "            'total_mb': 0.0\n",
-    "        }\n",
-    "        \n",
-    "        # Optimizer-specific state memory\n",
-    "        if isinstance(optimizer, SGD):\n",
-    "            if optimizer.momentum > 0:\n",
-    "                # Momentum buffers\n",
-    "                memory_breakdown['optimizer_state_mb'] = num_parameters * bytes_per_param / (1024 * 1024)\n",
-    "            else:\n",
-    "                memory_breakdown['optimizer_state_mb'] = 0.0\n",
-    "        elif isinstance(optimizer, Adam):\n",
-    "            # First and second moment estimates\n",
-    "            memory_breakdown['optimizer_state_mb'] = num_parameters * 2 * bytes_per_param / (1024 * 1024)\n",
-    "        \n",
-    "        # Calculate total\n",
-    "        memory_breakdown['total_mb'] = (\n",
-    "            memory_breakdown['parameters_mb'] + \n",
-    "            memory_breakdown['gradients_mb'] + \n",
-    "            memory_breakdown['optimizer_state_mb']\n",
-    "        )\n",
-    "        \n",
-    "        # Add efficiency estimates\n",
-    "        memory_breakdown['memory_efficiency'] = memory_breakdown['parameters_mb'] / memory_breakdown['total_mb']\n",
-    "        memory_breakdown['overhead_ratio'] = memory_breakdown['optimizer_state_mb'] / memory_breakdown['parameters_mb']\n",
-    "        \n",
-    "        return memory_breakdown\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def generate_production_recommendations(self, analysis_results: Dict[str, Any]) -> List[str]:\n",
-    "        \"\"\"\n",
-    "        Generate actionable recommendations for production optimizer usage.\n",
-    "        \n",
-    "        Args:\n",
-    "            analysis_results: Combined results from convergence and sensitivity analysis\n",
-    "        \n",
-    "        Returns:\n",
-    "            List of production recommendations\n",
-    "        \n",
-    "        TODO: Implement production recommendation generation.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Analyze convergence patterns and stability\n",
-    "        2. Consider computational efficiency requirements\n",
-    "        3. Account for memory constraints\n",
-    "        4. Generate optimizer selection guidance\n",
-    "        5. Provide hyperparameter tuning suggestions\n",
-    "        \n",
-    "        RECOMMENDATION CATEGORIES:\n",
-    "        - Optimizer selection for different scenarios\n",
-    "        - Learning rate and scheduling strategies\n",
-    "        - Memory optimization techniques\n",
-    "        - Training stability improvements\n",
-    "        - Production deployment considerations\n",
-    "        \n",
-    "        PRODUCTION CONTEXT:\n",
-    "        These recommendations guide:\n",
-    "        - ML engineer optimizer selection\n",
-    "        - DevOps resource allocation\n",
-    "        - Training pipeline optimization\n",
-    "        - Cost reduction strategies\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Provide specific, actionable advice\n",
-    "        - Consider different deployment scenarios\n",
-    "        - Include quantitative guidelines\n",
-    "        - Address common production challenges\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        recommendations = []\n",
-    "        \n",
-    "        # Optimizer selection recommendations\n",
-    "        recommendations.append(\"🔧 OPTIMIZER SELECTION GUIDE:\")\n",
-    "        recommendations.append(\"  • SGD + Momentum: Best for large batch training, proven stability\")\n",
-    "        recommendations.append(\"  • Adam: Best for rapid prototyping, adaptive learning rates\")\n",
-    "        recommendations.append(\"  • Consider memory constraints: SGD uses ~50% less memory than Adam\")\n",
-    "        \n",
-    "        # Learning rate recommendations\n",
-    "        if 'learning_rate_analysis' in analysis_results:\n",
-    "            lr_analysis = analysis_results['learning_rate_analysis']\n",
-    "            if lr_analysis.get('optimal_range'):\n",
-    "                opt_range = lr_analysis['optimal_range']\n",
-    "                recommendations.append(f\"📈 LEARNING RATE GUIDANCE:\")\n",
-    "                recommendations.append(f\"  • Start with: {opt_range[0]:.2e}\")\n",
-    "                recommendations.append(f\"  • Safe upper bound: {opt_range[1]:.2e}\")\n",
-    "                recommendations.append(\"  • Use learning rate scheduling for best results\")\n",
-    "        \n",
-    "        # Convergence recommendations\n",
-    "        if 'convergence_comparison' in analysis_results:\n",
-    "            comparison = analysis_results['convergence_comparison']\n",
-    "            if 'recommendations' in comparison and 'summary' in comparison['recommendations']:\n",
-    "                recommendations.append(\"🎯 CONVERGENCE OPTIMIZATION:\")\n",
-    "                for rec in comparison['recommendations']['summary']:\n",
-    "                    recommendations.append(f\"  • {rec}\")\n",
-    "        \n",
-    "        # Production deployment recommendations\n",
-    "        recommendations.append(\"🚀 PRODUCTION DEPLOYMENT:\")\n",
-    "        recommendations.append(\"  • Monitor gradient norms to detect training instability\")\n",
-    "        recommendations.append(\"  • Implement gradient clipping for large models\")\n",
-    "        recommendations.append(\"  • Use learning rate warmup for transformer architectures\")\n",
-    "        recommendations.append(\"  • Consider mixed precision training to reduce memory usage\")\n",
-    "        \n",
-    "        # Scaling recommendations\n",
-    "        recommendations.append(\"📊 SCALING CONSIDERATIONS:\")\n",
-    "        recommendations.append(\"  • Large batch training: Prefer SGD with linear learning rate scaling\")\n",
-    "        recommendations.append(\"  • Distributed training: Use synchronized optimizers\")\n",
-    "        recommendations.append(\"  • Memory-constrained: Choose SGD or use gradient accumulation\")\n",
-    "        recommendations.append(\"  • Fine-tuning: Use lower learning rates (10x-100x smaller)\")\n",
-    "        \n",
-    "        # Monitoring recommendations\n",
-    "        recommendations.append(\"📈 MONITORING & DEBUGGING:\")\n",
-    "        recommendations.append(\"  • Track loss smoothness to detect learning rate issues\")\n",
-    "        recommendations.append(\"  • Monitor gradient norms for explosion/vanishing detection\")\n",
-    "        recommendations.append(\"  • Log learning rate schedules for reproducibility\")\n",
-    "        recommendations.append(\"  • Profile memory usage to optimize batch sizes\")\n",
-    "        \n",
-    "        return recommendations\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _analyze_convergence_profile(self, optimizer_name: str, losses: List[float], \n",
-    "                                   grad_norms: List[float], step_durations: List[float],\n",
-    "                                   convergence_step: Optional[int]) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Internal helper to analyze convergence profile data.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer_name: Name of the optimizer\n",
-    "            losses: List of loss values over training\n",
-    "            grad_norms: List of gradient norms over training\n",
-    "            step_durations: List of step execution times\n",
-    "            convergence_step: Step where convergence was detected (if any)\n",
-    "        \n",
-    "        Returns:\n",
-    "            Analysis results dictionary\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        analysis = {\n",
-    "            'optimizer_name': optimizer_name,\n",
-    "            'total_steps': len(losses),\n",
-    "            'convergence_step': convergence_step,\n",
-    "            'final_loss': losses[-1] if losses else float('inf'),\n",
-    "            'initial_loss': losses[0] if losses else float('inf'),\n",
-    "            'loss_reduction': 0.0,\n",
-    "            'convergence_rate': 0.0,\n",
-    "            'stability_score': 0.0,\n",
-    "            'average_step_time': 0.0,\n",
-    "            'gradient_health': 'unknown'\n",
-    "        }\n",
-    "        \n",
-    "        if losses:\n",
-    "            # Calculate loss reduction\n",
-    "            initial_loss = losses[0]\n",
-    "            final_loss = losses[-1]\n",
-    "            analysis['loss_reduction'] = initial_loss - final_loss\n",
-    "            \n",
-    "            # Calculate convergence rate (loss reduction per step)\n",
-    "            if len(losses) > 1:\n",
-    "                analysis['convergence_rate'] = analysis['loss_reduction'] / len(losses)\n",
-    "            \n",
-    "            # Calculate stability (inverse of coefficient of variation)\n",
-    "            if len(losses) >= self.stability_window:\n",
-    "                recent_losses = losses[-self.stability_window:]\n",
-    "                mean_loss = np.mean(recent_losses)\n",
-    "                std_loss = np.std(recent_losses)\n",
-    "                analysis['stability_score'] = 1.0 / (1.0 + std_loss / (mean_loss + 1e-8))\n",
-    "        \n",
-    "        # Average step time\n",
-    "        if step_durations:\n",
-    "            analysis['average_step_time'] = np.mean(step_durations)\n",
-    "        \n",
-    "        # Gradient health assessment\n",
-    "        if grad_norms:\n",
-    "            max_grad_norm = max(grad_norms)\n",
-    "            avg_grad_norm = np.mean(grad_norms)\n",
-    "            \n",
-    "            if max_grad_norm > self.gradient_explosion_threshold:\n",
-    "                analysis['gradient_health'] = 'exploding'\n",
-    "            elif avg_grad_norm < 1e-8:\n",
-    "                analysis['gradient_health'] = 'vanishing'\n",
-    "            elif np.std(grad_norms) / (avg_grad_norm + 1e-8) > 2.0:\n",
-    "                analysis['gradient_health'] = 'unstable'\n",
-    "            else:\n",
-    "                analysis['gradient_health'] = 'healthy'\n",
-    "        \n",
-    "        return analysis\n",
-    "        ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2f84481c",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: OptimizerConvergenceProfiler\n",
-    "\n",
-    "Let's test your ML systems optimizer profiler! This tool helps analyze and compare optimizer performance in production scenarios.\n",
-    "\n",
-    "**This is a unit test** - it tests the OptimizerConvergenceProfiler class functionality."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2c9ebf15",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-convergence-profiler",
-     "locked": true,
-     "points": 30,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_convergence_profiler():\n",
-    "    \"\"\"Unit test for the OptimizerConvergenceProfiler implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Optimizer Convergence Profiler...\")\n",
-    "    \n",
-    "    # Test profiler initialization\n",
-    "    try:\n",
-    "        profiler = OptimizerConvergenceProfiler()\n",
-    "        \n",
-    "        assert hasattr(profiler, 'convergence_history'), \"Should have convergence_history tracking\"\n",
-    "        assert hasattr(profiler, 'gradient_norms'), \"Should have gradient_norms tracking\"\n",
-    "        assert hasattr(profiler, 'learning_rates'), \"Should have learning_rates tracking\"\n",
-    "        assert hasattr(profiler, 'step_times'), \"Should have step_times tracking\"\n",
-    "        print(\"✅ Profiler initialization works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Profiler initialization failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test memory usage estimation\n",
-    "    try:\n",
-    "        # Test SGD memory estimation\n",
-    "        w = Variable(1.0, requires_grad=True)\n",
-    "        sgd_optimizer = SGD([w], learning_rate=0.01, momentum=0.9)\n",
-    "        \n",
-    "        memory_estimate = profiler.estimate_memory_usage(sgd_optimizer, num_parameters=1000000)\n",
-    "        \n",
-    "        assert 'parameters_mb' in memory_estimate, \"Should estimate parameter memory\"\n",
-    "        assert 'gradients_mb' in memory_estimate, \"Should estimate gradient memory\"\n",
-    "        assert 'optimizer_state_mb' in memory_estimate, \"Should estimate optimizer state memory\"\n",
-    "        assert 'total_mb' in memory_estimate, \"Should provide total memory estimate\"\n",
-    "        \n",
-    "        # SGD with momentum should have optimizer state\n",
-    "        assert memory_estimate['optimizer_state_mb'] > 0, \"SGD with momentum should have state memory\"\n",
-    "        print(\"✅ Memory usage estimation works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Memory usage estimation failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test simple convergence analysis\n",
-    "    try:\n",
-    "        # Create a simple training function for testing\n",
-    "        def simple_training_function():\n",
-    "            # Simulate decreasing loss\n",
-    "            losses = [10.0 - i * 0.5 for i in range(20)]\n",
-    "            return losses[-1]  # Return final loss\n",
-    "        \n",
-    "        # Create test optimizer\n",
-    "        w = Variable(1.0, requires_grad=True)\n",
-    "        w.grad = Variable(0.1)  # Set gradient for testing\n",
-    "        test_optimizer = SGD([w], learning_rate=0.01)\n",
-    "        \n",
-    "        # Profile convergence (simplified test)\n",
-    "        analysis = profiler.profile_optimizer_convergence(\n",
-    "            optimizer_name=\"test_sgd\",\n",
-    "            optimizer=test_optimizer,\n",
-    "            training_function=simple_training_function,\n",
-    "            initial_loss=10.0,\n",
-    "            max_steps=10\n",
-    "        )\n",
-    "        \n",
-    "        assert 'optimizer_name' in analysis, \"Should return optimizer name\"\n",
-    "        assert 'total_steps' in analysis, \"Should track total steps\"\n",
-    "        assert 'final_loss' in analysis, \"Should track final loss\"\n",
-    "        print(\"✅ Basic convergence profiling works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Convergence profiling failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test production recommendations\n",
-    "    try:\n",
-    "        # Create mock analysis results\n",
-    "        mock_results = {\n",
-    "            'learning_rate_analysis': {\n",
-    "                'optimal_range': (0.001, 0.1)\n",
-    "            },\n",
-    "            'convergence_comparison': {\n",
-    "                'recommendations': {\n",
-    "                    'summary': ['Best overall: Adam', 'Fastest: SGD']\n",
-    "                }\n",
-    "            }\n",
-    "        }\n",
-    "        \n",
-    "        recommendations = profiler.generate_production_recommendations(mock_results)\n",
-    "        \n",
-    "        assert isinstance(recommendations, list), \"Should return list of recommendations\"\n",
-    "        assert len(recommendations) > 0, \"Should provide recommendations\"\n",
-    "        \n",
-    "        # Check for key recommendation categories\n",
-    "        rec_text = ' '.join(recommendations)\n",
-    "        assert 'OPTIMIZER SELECTION' in rec_text, \"Should include optimizer selection guidance\"\n",
-    "        assert 'PRODUCTION DEPLOYMENT' in rec_text, \"Should include production deployment advice\"\n",
-    "        print(\"✅ Production recommendations work\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Production recommendations failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test optimizer comparison framework\n",
-    "    try:\n",
-    "        # Create mock profiles for comparison\n",
-    "        mock_profiles = {\n",
-    "            'sgd': {'convergence_step': 50, 'final_loss': 0.1},\n",
-    "            'adam': {'convergence_step': 30, 'final_loss': 0.05}\n",
-    "        }\n",
-    "        \n",
-    "        # Add some mock data to profiler\n",
-    "        profiler.convergence_history['sgd'] = [1.0, 0.5, 0.2, 0.1]\n",
-    "        profiler.convergence_history['adam'] = [1.0, 0.3, 0.1, 0.05]\n",
-    "        profiler.step_times['sgd'] = [0.01, 0.01, 0.01, 0.01]\n",
-    "        profiler.step_times['adam'] = [0.02, 0.02, 0.02, 0.02]\n",
-    "        \n",
-    "        comparison = profiler.compare_optimizers(mock_profiles)\n",
-    "        \n",
-    "        assert 'convergence_speed' in comparison, \"Should compare convergence speed\"\n",
-    "        assert 'final_performance' in comparison, \"Should compare final performance\"\n",
-    "        assert 'stability' in comparison, \"Should compare stability\"\n",
-    "        assert 'recommendations' in comparison, \"Should provide recommendations\"\n",
-    "        print(\"✅ Optimizer comparison works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Optimizer comparison failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Optimizer Convergence Profiler behavior:\")\n",
-    "    print(\"   Profiles convergence patterns across different optimizers\")\n",
-    "    print(\"   Estimates memory usage for production planning\")\n",
-    "    print(\"   Provides actionable recommendations for ML systems\")\n",
-    "    print(\"   Enables data-driven optimizer selection\")\n",
-    "    print(\"📈 Progress: ML Systems Optimizer Analysis ✓\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "294a8978",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 7: Advanced Optimizer Features\n",
-    "\n",
-    "### Production Optimizer Patterns\n",
-    "\n",
-    "Real ML systems need more than basic optimizers. They need:\n",
-    "\n",
-    "1. **Gradient Clipping**: Prevents gradient explosion in large models\n",
-    "2. **Learning Rate Warmup**: Gradually increases learning rate at start\n",
-    "3. **Gradient Accumulation**: Simulates large batch training\n",
-    "4. **Mixed Precision**: Reduces memory usage with FP16\n",
-    "5. **Distributed Synchronization**: Coordinates optimizer across GPUs\n",
-    "\n",
-    "Let's implement these production patterns!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ac2a04de",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "advanced-optimizer-features",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class AdvancedOptimizerFeatures:\n",
-    "    \"\"\"\n",
-    "    Advanced optimizer features for production ML systems.\n",
-    "    \n",
-    "    Implements production-ready optimizer enhancements:\n",
-    "    - Gradient clipping for stability\n",
-    "    - Learning rate warmup strategies\n",
-    "    - Gradient accumulation for large batches\n",
-    "    - Mixed precision optimization patterns\n",
-    "    - Distributed optimizer synchronization\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"\n",
-    "        Initialize advanced optimizer features.\n",
-    "        \n",
-    "        TODO: Implement advanced features initialization.\n",
-    "        \n",
-    "        PRODUCTION CONTEXT:\n",
-    "        These features are essential for:\n",
-    "        - Training large language models (GPT, BERT)\n",
-    "        - Computer vision at scale (ImageNet, COCO)\n",
-    "        - Distributed training across multiple GPUs\n",
-    "        - Memory-efficient training with limited resources\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Initialize gradient clipping parameters\n",
-    "        - Set up warmup scheduling state\n",
-    "        - Prepare accumulation buffers\n",
-    "        - Configure synchronization patterns\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Gradient clipping\n",
-    "        self.max_grad_norm = 1.0\n",
-    "        self.clip_enabled = False\n",
-    "        \n",
-    "        # Learning rate warmup\n",
-    "        self.warmup_steps = 0\n",
-    "        self.warmup_factor = 0.1\n",
-    "        self.base_lr = 0.001\n",
-    "        \n",
-    "        # Gradient accumulation\n",
-    "        self.accumulation_steps = 1\n",
-    "        self.accumulated_gradients = {}\n",
-    "        self.accumulation_count = 0\n",
-    "        \n",
-    "        # Mixed precision simulation\n",
-    "        self.use_fp16 = False\n",
-    "        self.loss_scale = 1.0\n",
-    "        self.dynamic_loss_scaling = False\n",
-    "        \n",
-    "        # Distributed training simulation\n",
-    "        self.world_size = 1\n",
-    "        self.rank = 0\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def apply_gradient_clipping(self, optimizer: Union[SGD, Adam], max_norm: float = 1.0) -> float:\n",
-    "        \"\"\"\n",
-    "        Apply gradient clipping to prevent gradient explosion.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer: Optimizer with parameters to clip\n",
-    "            max_norm: Maximum allowed gradient norm\n",
-    "        \n",
-    "        Returns:\n",
-    "            Actual gradient norm before clipping\n",
-    "        \n",
-    "        TODO: Implement gradient clipping.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Calculate total gradient norm across all parameters\n",
-    "        2. If norm exceeds max_norm, scale all gradients down\n",
-    "        3. Apply scaling factor to maintain gradient direction\n",
-    "        4. Return original norm for monitoring\n",
-    "        \n",
-    "        MATHEMATICAL FORMULATION:\n",
-    "        total_norm = sqrt(sum(param_grad_norm^2 for all params))\n",
-    "        if total_norm > max_norm:\n",
-    "            clip_factor = max_norm / total_norm\n",
-    "            for each param: param.grad *= clip_factor\n",
-    "        \n",
-    "        PRODUCTION VALUE:\n",
-    "        Gradient clipping is essential for:\n",
-    "        - Training RNNs and Transformers\n",
-    "        - Preventing training instability\n",
-    "        - Enabling higher learning rates\n",
-    "        - Improving convergence reliability\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Calculate global gradient norm\n",
-    "        - Apply uniform scaling to all gradients\n",
-    "        - Preserve gradient directions\n",
-    "        - Return unclipped norm for logging\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Calculate total gradient norm\n",
-    "        total_norm = 0.0\n",
-    "        param_count = 0\n",
-    "        \n",
-    "        for param in optimizer.parameters:\n",
-    "            if param.grad is not None:\n",
-    "                grad_data = param.grad.data.data\n",
-    "                if hasattr(grad_data, 'flatten'):\n",
-    "                    param_norm = np.linalg.norm(grad_data.flatten())\n",
-    "                else:\n",
-    "                    param_norm = abs(float(grad_data))\n",
-    "                total_norm += param_norm ** 2\n",
-    "                param_count += 1\n",
-    "        \n",
-    "        if param_count > 0:\n",
-    "            total_norm = total_norm ** 0.5\n",
-    "        else:\n",
-    "            return 0.0\n",
-    "        \n",
-    "        # Apply clipping if necessary\n",
-    "        if total_norm > max_norm:\n",
-    "            clip_factor = max_norm / total_norm\n",
-    "            \n",
-    "            for param in optimizer.parameters:\n",
-    "                if param.grad is not None:\n",
-    "                    grad_data = param.grad.data.data\n",
-    "                    clipped_grad = grad_data * clip_factor\n",
-    "                    param.grad.data = Tensor(clipped_grad)\n",
-    "        \n",
-    "        return total_norm\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def apply_warmup_schedule(self, optimizer: Union[SGD, Adam], step: int, \n",
-    "                            warmup_steps: int, base_lr: float) -> float:\n",
-    "        \"\"\"\n",
-    "        Apply learning rate warmup schedule.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer: Optimizer to apply warmup to\n",
-    "            step: Current training step\n",
-    "            warmup_steps: Number of warmup steps\n",
-    "            base_lr: Target learning rate after warmup\n",
-    "        \n",
-    "        Returns:\n",
-    "            Current learning rate\n",
-    "        \n",
-    "        TODO: Implement learning rate warmup.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. If step < warmup_steps: gradually increase learning rate\n",
-    "        2. Use linear or polynomial warmup schedule\n",
-    "        3. Update optimizer's learning rate\n",
-    "        4. Return current learning rate for logging\n",
-    "        \n",
-    "        WARMUP STRATEGIES:\n",
-    "        - Linear: lr = base_lr * (step / warmup_steps)\n",
-    "        - Polynomial: lr = base_lr * ((step / warmup_steps) ^ power)\n",
-    "        - Constant: lr = base_lr * warmup_factor for warmup_steps\n",
-    "        \n",
-    "        PRODUCTION VALUE:\n",
-    "        Warmup prevents:\n",
-    "        - Early training instability\n",
-    "        - Poor initialization effects\n",
-    "        - Gradient explosion at start\n",
-    "        - Suboptimal convergence paths\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Handle step=0 case (avoid division by zero)\n",
-    "        - Use linear warmup for simplicity\n",
-    "        - Update optimizer.learning_rate directly\n",
-    "        - Smoothly transition to base learning rate\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if step < warmup_steps and warmup_steps > 0:\n",
-    "            # Linear warmup\n",
-    "            warmup_factor = step / warmup_steps\n",
-    "            current_lr = base_lr * warmup_factor\n",
-    "        else:\n",
-    "            # After warmup, use base learning rate\n",
-    "            current_lr = base_lr\n",
-    "        \n",
-    "        # Update optimizer learning rate\n",
-    "        optimizer.learning_rate = current_lr\n",
-    "        \n",
-    "        return current_lr\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def accumulate_gradients(self, optimizer: Union[SGD, Adam], accumulation_steps: int) -> bool:\n",
-    "        \"\"\"\n",
-    "        Accumulate gradients to simulate larger batch sizes.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer: Optimizer with parameters to accumulate\n",
-    "            accumulation_steps: Number of steps to accumulate before update\n",
-    "        \n",
-    "        Returns:\n",
-    "            True if ready to perform optimizer step, False otherwise\n",
-    "        \n",
-    "        TODO: Implement gradient accumulation.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Add current gradients to accumulated gradient buffers\n",
-    "        2. Increment accumulation counter\n",
-    "        3. If counter reaches accumulation_steps:\n",
-    "           a. Average accumulated gradients\n",
-    "           b. Set as current gradients\n",
-    "           c. Return True (ready for optimizer step)\n",
-    "           d. Reset accumulation\n",
-    "        4. Otherwise return False (continue accumulating)\n",
-    "        \n",
-    "        MATHEMATICAL FORMULATION:\n",
-    "        accumulated_grad += current_grad\n",
-    "        if accumulation_count == accumulation_steps:\n",
-    "            final_grad = accumulated_grad / accumulation_steps\n",
-    "            reset accumulation\n",
-    "            return True\n",
-    "        \n",
-    "        PRODUCTION VALUE:\n",
-    "        Gradient accumulation enables:\n",
-    "        - Large effective batch sizes on limited memory\n",
-    "        - Training large models on small GPUs\n",
-    "        - Consistent training across different hardware\n",
-    "        - Memory-efficient distributed training\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Store accumulated gradients per parameter\n",
-    "        - Use parameter id() as key for tracking\n",
-    "        - Average gradients before optimizer step\n",
-    "        - Reset accumulation after each update\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Initialize accumulation if first time\n",
-    "        if not hasattr(self, 'accumulation_count'):\n",
-    "            self.accumulation_count = 0\n",
-    "            self.accumulated_gradients = {}\n",
-    "        \n",
-    "        # Accumulate gradients\n",
-    "        for param in optimizer.parameters:\n",
-    "            if param.grad is not None:\n",
-    "                param_id = id(param)\n",
-    "                grad_data = param.grad.data.data\n",
-    "                \n",
-    "                if param_id not in self.accumulated_gradients:\n",
-    "                    self.accumulated_gradients[param_id] = np.zeros_like(grad_data)\n",
-    "                \n",
-    "                self.accumulated_gradients[param_id] += grad_data\n",
-    "        \n",
-    "        self.accumulation_count += 1\n",
-    "        \n",
-    "        # Check if ready to update\n",
-    "        if self.accumulation_count >= accumulation_steps:\n",
-    "            # Average accumulated gradients and set as current gradients\n",
-    "            for param in optimizer.parameters:\n",
-    "                if param.grad is not None:\n",
-    "                    param_id = id(param)\n",
-    "                    if param_id in self.accumulated_gradients:\n",
-    "                        averaged_grad = self.accumulated_gradients[param_id] / accumulation_steps\n",
-    "                        param.grad.data = Tensor(averaged_grad)\n",
-    "            \n",
-    "            # Reset accumulation\n",
-    "            self.accumulation_count = 0\n",
-    "            self.accumulated_gradients = {}\n",
-    "            \n",
-    "            return True  # Ready for optimizer step\n",
-    "        \n",
-    "        return False  # Continue accumulating\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def simulate_mixed_precision(self, optimizer: Union[SGD, Adam], loss_scale: float = 1.0) -> bool:\n",
-    "        \"\"\"\n",
-    "        Simulate mixed precision training effects.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer: Optimizer to apply mixed precision to\n",
-    "            loss_scale: Loss scaling factor for gradient preservation\n",
-    "        \n",
-    "        Returns:\n",
-    "            True if gradients are valid (no overflow), False if overflow detected\n",
-    "        \n",
-    "        TODO: Implement mixed precision simulation.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Scale gradients by loss_scale factor\n",
-    "        2. Check for gradient overflow (inf or nan values)\n",
-    "        3. If overflow detected, skip optimizer step\n",
-    "        4. If valid, descale gradients before optimizer step\n",
-    "        5. Return overflow status\n",
-    "        \n",
-    "        MIXED PRECISION CONCEPTS:\n",
-    "        - Use FP16 for forward pass (memory savings)\n",
-    "        - Use FP32 for backward pass (numerical stability)\n",
-    "        - Scale loss to prevent gradient underflow\n",
-    "        - Check for overflow before optimization\n",
-    "        \n",
-    "        PRODUCTION VALUE:\n",
-    "        Mixed precision provides:\n",
-    "        - 50% memory reduction\n",
-    "        - Faster training on modern GPUs\n",
-    "        - Maintained numerical stability\n",
-    "        - Automatic overflow detection\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Scale gradients by loss_scale\n",
-    "        - Check for inf/nan in gradients\n",
-    "        - Descale before optimizer step\n",
-    "        - Return overflow status for dynamic scaling\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Check for gradient overflow before scaling\n",
-    "        has_overflow = False\n",
-    "        \n",
-    "        for param in optimizer.parameters:\n",
-    "            if param.grad is not None:\n",
-    "                grad_data = param.grad.data.data\n",
-    "                if hasattr(grad_data, 'flatten'):\n",
-    "                    grad_flat = grad_data.flatten()\n",
-    "                    if np.any(np.isinf(grad_flat)) or np.any(np.isnan(grad_flat)):\n",
-    "                        has_overflow = True\n",
-    "                        break\n",
-    "                else:\n",
-    "                    if np.isinf(grad_data) or np.isnan(grad_data):\n",
-    "                        has_overflow = True\n",
-    "                        break\n",
-    "        \n",
-    "        if has_overflow:\n",
-    "            # Zero gradients to prevent corruption\n",
-    "            for param in optimizer.parameters:\n",
-    "                if param.grad is not None:\n",
-    "                    param.grad = None\n",
-    "            return False  # Overflow detected\n",
-    "        \n",
-    "        # Descale gradients (simulate unscaling from FP16)\n",
-    "        if loss_scale > 1.0:\n",
-    "            for param in optimizer.parameters:\n",
-    "                if param.grad is not None:\n",
-    "                    grad_data = param.grad.data.data\n",
-    "                    descaled_grad = grad_data / loss_scale\n",
-    "                    param.grad.data = Tensor(descaled_grad)\n",
-    "        \n",
-    "        return True  # No overflow, safe to proceed\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def simulate_distributed_sync(self, optimizer: Union[SGD, Adam], world_size: int = 1) -> None:\n",
-    "        \"\"\"\n",
-    "        Simulate distributed training gradient synchronization.\n",
-    "        \n",
-    "        Args:\n",
-    "            optimizer: Optimizer with gradients to synchronize\n",
-    "            world_size: Number of distributed processes\n",
-    "        \n",
-    "        TODO: Implement distributed gradient synchronization simulation.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Simulate all-reduce operation on gradients\n",
-    "        2. Average gradients across all processes\n",
-    "        3. Update local gradients with synchronized values\n",
-    "        4. Handle communication overhead simulation\n",
-    "        \n",
-    "        DISTRIBUTED CONCEPTS:\n",
-    "        - All-reduce: Combine gradients from all GPUs\n",
-    "        - Averaging: Divide by world_size for consistency\n",
-    "        - Synchronization: Ensure all GPUs have same gradients\n",
-    "        - Communication: Network overhead for gradient sharing\n",
-    "        \n",
-    "        PRODUCTION VALUE:\n",
-    "        Distributed training enables:\n",
-    "        - Scaling to multiple GPUs/nodes\n",
-    "        - Training large models efficiently\n",
-    "        - Reduced training time\n",
-    "        - Consistent convergence across devices\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Simulate averaging by keeping gradients unchanged\n",
-    "        - Add small noise to simulate communication variance\n",
-    "        - Scale learning rate by world_size if needed\n",
-    "        - Log synchronization overhead\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if world_size <= 1:\n",
-    "            return  # No synchronization needed for single process\n",
-    "        \n",
-    "        # Simulate all-reduce operation (averaging gradients)\n",
-    "        for param in optimizer.parameters:\n",
-    "            if param.grad is not None:\n",
-    "                grad_data = param.grad.data.data\n",
-    "                \n",
-    "                # In real distributed training, gradients would be averaged across all processes\n",
-    "                # Here we simulate this by keeping gradients unchanged (already \"averaged\")\n",
-    "                # In practice, this would involve MPI/NCCL communication\n",
-    "                \n",
-    "                # Simulate communication noise (very small)\n",
-    "                if hasattr(grad_data, 'shape'):\n",
-    "                    noise = np.random.normal(0, 1e-10, grad_data.shape)\n",
-    "                    synchronized_grad = grad_data + noise\n",
-    "                else:\n",
-    "                    noise = np.random.normal(0, 1e-10)\n",
-    "                    synchronized_grad = grad_data + noise\n",
-    "                \n",
-    "                param.grad.data = Tensor(synchronized_grad)\n",
-    "        \n",
-    "        # In distributed training, learning rate is often scaled by world_size\n",
-    "        # to maintain effective learning rate with larger batch sizes\n",
-    "        if hasattr(optimizer, 'base_learning_rate'):\n",
-    "            optimizer.learning_rate = optimizer.base_learning_rate * world_size\n",
-    "        ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "845353bf",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Advanced Optimizer Features\n",
-    "\n",
-    "Let's test your advanced optimizer features! These are production-ready enhancements used in real ML systems.\n",
-    "\n",
-    "**This is a unit test** - it tests the AdvancedOptimizerFeatures class functionality."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8fb921b6",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-advanced-features",
-     "locked": true,
-     "points": 25,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_advanced_optimizer_features():\n",
-    "    \"\"\"Unit test for advanced optimizer features implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Advanced Optimizer Features...\")\n",
-    "    \n",
-    "    # Test advanced features initialization\n",
-    "    try:\n",
-    "        features = AdvancedOptimizerFeatures()\n",
-    "        \n",
-    "        assert hasattr(features, 'max_grad_norm'), \"Should have gradient clipping parameters\"\n",
-    "        assert hasattr(features, 'warmup_steps'), \"Should have warmup parameters\"\n",
-    "        assert hasattr(features, 'accumulation_steps'), \"Should have accumulation parameters\"\n",
-    "        print(\"✅ Advanced features initialization works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Advanced features initialization failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test gradient clipping\n",
-    "    try:\n",
-    "        # Create optimizer with large gradients\n",
-    "        w = Variable(1.0, requires_grad=True)\n",
-    "        w.grad = Variable(10.0)  # Large gradient\n",
-    "        optimizer = SGD([w], learning_rate=0.01)\n",
-    "        \n",
-    "        # Apply gradient clipping\n",
-    "        original_norm = features.apply_gradient_clipping(optimizer, max_norm=1.0)\n",
-    "        \n",
-    "        # Check that gradient was clipped\n",
-    "        clipped_grad = w.grad.data.data.item()\n",
-    "        assert abs(clipped_grad) <= 1.0, f\"Gradient should be clipped to <= 1.0, got {clipped_grad}\"\n",
-    "        assert original_norm > 1.0, f\"Original norm should be > 1.0, got {original_norm}\"\n",
-    "        print(\"✅ Gradient clipping works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Gradient clipping failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test learning rate warmup\n",
-    "    try:\n",
-    "        w2 = Variable(1.0, requires_grad=True)\n",
-    "        optimizer2 = SGD([w2], learning_rate=0.01)\n",
-    "        \n",
-    "        # Test warmup schedule\n",
-    "        lr_step_0 = features.apply_warmup_schedule(optimizer2, step=0, warmup_steps=10, base_lr=0.1)\n",
-    "        lr_step_5 = features.apply_warmup_schedule(optimizer2, step=5, warmup_steps=10, base_lr=0.1)\n",
-    "        lr_step_10 = features.apply_warmup_schedule(optimizer2, step=10, warmup_steps=10, base_lr=0.1)\n",
-    "        \n",
-    "        # Check warmup progression\n",
-    "        assert lr_step_0 == 0.0, f\"Step 0 should have lr=0.0, got {lr_step_0}\"\n",
-    "        assert 0.0 < lr_step_5 < 0.1, f\"Step 5 should have 0 < lr < 0.1, got {lr_step_5}\"\n",
-    "        assert lr_step_10 == 0.1, f\"Step 10 should have lr=0.1, got {lr_step_10}\"\n",
-    "        print(\"✅ Learning rate warmup works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Learning rate warmup failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test gradient accumulation\n",
-    "    try:\n",
-    "        w3 = Variable(1.0, requires_grad=True)\n",
-    "        w3.grad = Variable(0.1)\n",
-    "        optimizer3 = SGD([w3], learning_rate=0.01)\n",
-    "        \n",
-    "        # Test accumulation over multiple steps\n",
-    "        ready_step_1 = features.accumulate_gradients(optimizer3, accumulation_steps=3)\n",
-    "        ready_step_2 = features.accumulate_gradients(optimizer3, accumulation_steps=3)\n",
-    "        ready_step_3 = features.accumulate_gradients(optimizer3, accumulation_steps=3)\n",
-    "        \n",
-    "        # Check accumulation behavior\n",
-    "        assert not ready_step_1, \"Should not be ready after step 1\"\n",
-    "        assert not ready_step_2, \"Should not be ready after step 2\"\n",
-    "        assert ready_step_3, \"Should be ready after step 3\"\n",
-    "        print(\"✅ Gradient accumulation works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Gradient accumulation failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test mixed precision simulation\n",
-    "    try:\n",
-    "        w4 = Variable(1.0, requires_grad=True)\n",
-    "        w4.grad = Variable(0.1)\n",
-    "        optimizer4 = SGD([w4], learning_rate=0.01)\n",
-    "        \n",
-    "        # Test normal case (no overflow)\n",
-    "        no_overflow = features.simulate_mixed_precision(optimizer4, loss_scale=1.0)\n",
-    "        assert no_overflow, \"Should not detect overflow with normal gradients\"\n",
-    "        \n",
-    "        # Test overflow case\n",
-    "        w4.grad = Variable(float('inf'))\n",
-    "        overflow = features.simulate_mixed_precision(optimizer4, loss_scale=1.0)\n",
-    "        assert not overflow, \"Should detect overflow with inf gradients\"\n",
-    "        print(\"✅ Mixed precision simulation works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Mixed precision simulation failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test distributed synchronization\n",
-    "    try:\n",
-    "        w5 = Variable(1.0, requires_grad=True)\n",
-    "        w5.grad = Variable(0.1)\n",
-    "        optimizer5 = SGD([w5], learning_rate=0.01)\n",
-    "        \n",
-    "        original_grad = w5.grad.data.data.item()\n",
-    "        \n",
-    "        # Simulate distributed sync\n",
-    "        features.simulate_distributed_sync(optimizer5, world_size=4)\n",
-    "        \n",
-    "        # Gradient should be slightly modified (due to simulated communication noise)\n",
-    "        # but still close to original\n",
-    "        synced_grad = w5.grad.data.data.item()\n",
-    "        assert abs(synced_grad - original_grad) < 0.01, \"Synchronized gradient should be close to original\"\n",
-    "        print(\"✅ Distributed synchronization simulation works\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Distributed synchronization failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"🎯 Advanced Optimizer Features behavior:\")\n",
-    "    print(\"   Implements gradient clipping for training stability\")\n",
-    "    print(\"   Provides learning rate warmup for better convergence\")\n",
-    "    print(\"   Enables gradient accumulation for large effective batches\")\n",
-    "    print(\"   Simulates mixed precision training patterns\")\n",
-    "    print(\"   Handles distributed training synchronization\")\n",
-    "    print(\"📈 Progress: Advanced Production Optimizer Features ✓\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5d0bfdf8",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 8: Comprehensive Testing - ML Systems Integration\n",
-    "\n",
-    "### Real-World Optimizer Performance Testing\n",
-    "\n",
-    "Let's test our optimizers in realistic scenarios that mirror production ML systems:\n",
-    "\n",
-    "1. **Convergence Race**: Compare optimizers on the same task\n",
-    "2. **Learning Rate Sensitivity**: Find optimal hyperparameters\n",
-    "3. **Memory Analysis**: Compare resource usage\n",
-    "4. **Production Recommendations**: Get actionable guidance\n",
-    "\n",
-    "This integration test demonstrates how our ML systems tools work together."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d5682c04",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-ml-systems-integration",
-     "locked": true,
-     "points": 35,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_comprehensive_ml_systems_integration():\n",
-    "    \"\"\"Comprehensive integration test demonstrating ML systems optimizer analysis.\"\"\"\n",
-    "    print(\"🔬 Comprehensive Test: ML Systems Integration...\")\n",
-    "    \n",
-    "    # Initialize ML systems tools\n",
-    "    try:\n",
-    "        profiler = OptimizerConvergenceProfiler()\n",
-    "        advanced_features = AdvancedOptimizerFeatures()\n",
-    "        print(\"✅ ML systems tools initialized\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ ML systems tools initialization failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test convergence profiling with multiple optimizers\n",
-    "    try:\n",
-    "        print(\"\\n📊 Running optimizer convergence comparison...\")\n",
-    "        \n",
-    "        # Create simple training scenario\n",
-    "        def create_training_function(optimizer_instance):\n",
-    "            def training_step():\n",
-    "                # Simulate a quadratic loss function: loss = (x - target)^2\n",
-    "                # where we're trying to minimize x towards target = 2.0\n",
-    "                current_x = optimizer_instance.parameters[0].data.data.item()\n",
-    "                target = 2.0\n",
-    "                loss = (current_x - target) ** 2\n",
-    "                \n",
-    "                # Compute gradient: d/dx (x - target)^2 = 2 * (x - target)\n",
-    "                gradient = 2 * (current_x - target)\n",
-    "                optimizer_instance.parameters[0].grad = Variable(gradient)\n",
-    "                \n",
-    "                # Perform optimizer step\n",
-    "                optimizer_instance.step()\n",
-    "                \n",
-    "                return loss\n",
-    "            return training_step\n",
-    "        \n",
-    "        # Test SGD\n",
-    "        w_sgd = Variable(0.0, requires_grad=True)  # Start at x=0, target=2\n",
-    "        sgd_optimizer = SGD([w_sgd], learning_rate=0.1, momentum=0.9)\n",
-    "        sgd_training = create_training_function(sgd_optimizer)\n",
-    "        \n",
-    "        sgd_profile = profiler.profile_optimizer_convergence(\n",
-    "            optimizer_name=\"SGD_momentum\",\n",
-    "            optimizer=sgd_optimizer,\n",
-    "            training_function=sgd_training,\n",
-    "            initial_loss=4.0,  # (0-2)^2 = 4\n",
-    "            max_steps=30\n",
-    "        )\n",
-    "        \n",
-    "        # Test Adam\n",
-    "        w_adam = Variable(0.0, requires_grad=True)  # Start at x=0, target=2\n",
-    "        adam_optimizer = Adam([w_adam], learning_rate=0.1)\n",
-    "        adam_training = create_training_function(adam_optimizer)\n",
-    "        \n",
-    "        adam_profile = profiler.profile_optimizer_convergence(\n",
-    "            optimizer_name=\"Adam\",\n",
-    "            optimizer=adam_optimizer,\n",
-    "            training_function=adam_training,\n",
-    "            initial_loss=4.0,\n",
-    "            max_steps=30\n",
-    "        )\n",
-    "        \n",
-    "        # Verify profiling results\n",
-    "        assert 'optimizer_name' in sgd_profile, \"SGD profile should contain optimizer name\"\n",
-    "        assert 'optimizer_name' in adam_profile, \"Adam profile should contain optimizer name\"\n",
-    "        assert 'final_loss' in sgd_profile, \"SGD profile should contain final loss\"\n",
-    "        assert 'final_loss' in adam_profile, \"Adam profile should contain final loss\"\n",
-    "        \n",
-    "        print(f\"   SGD final loss: {sgd_profile['final_loss']:.4f}\")\n",
-    "        print(f\"   Adam final loss: {adam_profile['final_loss']:.4f}\")\n",
-    "        print(\"✅ Convergence profiling completed\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Convergence profiling failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test optimizer comparison\n",
-    "    try:\n",
-    "        print(\"\\n🏆 Comparing optimizer performance...\")\n",
-    "        \n",
-    "        profiles = {\n",
-    "            'SGD_momentum': sgd_profile,\n",
-    "            'Adam': adam_profile\n",
-    "        }\n",
-    "        \n",
-    "        comparison = profiler.compare_optimizers(profiles)\n",
-    "        \n",
-    "        # Verify comparison results\n",
-    "        assert 'convergence_speed' in comparison, \"Should compare convergence speed\"\n",
-    "        assert 'final_performance' in comparison, \"Should compare final performance\"\n",
-    "        assert 'rankings' in comparison, \"Should provide rankings\"\n",
-    "        assert 'recommendations' in comparison, \"Should provide recommendations\"\n",
-    "        \n",
-    "        if 'summary' in comparison['recommendations']:\n",
-    "            print(\"   Recommendations:\")\n",
-    "            for rec in comparison['recommendations']['summary']:\n",
-    "                print(f\"     {rec}\")\n",
-    "        \n",
-    "        print(\"✅ Optimizer comparison completed\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Optimizer comparison failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test memory analysis\n",
-    "    try:\n",
-    "        print(\"\\n💾 Analyzing memory usage...\")\n",
-    "        \n",
-    "        # Simulate large model parameters\n",
-    "        num_parameters = 100000  # 100K parameters\n",
-    "        \n",
-    "        sgd_memory = profiler.estimate_memory_usage(sgd_optimizer, num_parameters)\n",
-    "        adam_memory = profiler.estimate_memory_usage(adam_optimizer, num_parameters)\n",
-    "        \n",
-    "        print(f\"   SGD memory usage: {sgd_memory['total_mb']:.1f} MB\")\n",
-    "        print(f\"   Adam memory usage: {adam_memory['total_mb']:.1f} MB\")\n",
-    "        print(f\"   Adam overhead: {adam_memory['total_mb'] - sgd_memory['total_mb']:.1f} MB\")\n",
-    "        \n",
-    "        # Verify memory analysis\n",
-    "        assert sgd_memory['total_mb'] > 0, \"SGD should have positive memory usage\"\n",
-    "        assert adam_memory['total_mb'] > sgd_memory['total_mb'], \"Adam should use more memory than SGD\"\n",
-    "        \n",
-    "        print(\"✅ Memory analysis completed\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Memory analysis failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test advanced features integration\n",
-    "    try:\n",
-    "        print(\"\\n🚀 Testing advanced optimizer features...\")\n",
-    "        \n",
-    "        # Test gradient clipping\n",
-    "        w_clip = Variable(1.0, requires_grad=True)\n",
-    "        w_clip.grad = Variable(5.0)  # Large gradient\n",
-    "        clip_optimizer = SGD([w_clip], learning_rate=0.01)\n",
-    "        \n",
-    "        original_norm = advanced_features.apply_gradient_clipping(clip_optimizer, max_norm=1.0)\n",
-    "        assert original_norm > 1.0, \"Should detect large gradient\"\n",
-    "        assert abs(w_clip.grad.data.data.item()) <= 1.0, \"Should clip gradient\"\n",
-    "        \n",
-    "        # Test learning rate warmup\n",
-    "        warmup_optimizer = Adam([Variable(1.0)], learning_rate=0.001)\n",
-    "        lr_start = advanced_features.apply_warmup_schedule(warmup_optimizer, 0, 100, 0.001)\n",
-    "        lr_mid = advanced_features.apply_warmup_schedule(warmup_optimizer, 50, 100, 0.001)\n",
-    "        lr_end = advanced_features.apply_warmup_schedule(warmup_optimizer, 100, 100, 0.001)\n",
-    "        \n",
-    "        assert lr_start < lr_mid < lr_end, \"Learning rate should increase during warmup\"\n",
-    "        \n",
-    "        print(\"✅ Advanced features integration completed\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Advanced features integration failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Test production recommendations\n",
-    "    try:\n",
-    "        print(\"\\n📋 Generating production recommendations...\")\n",
-    "        \n",
-    "        analysis_results = {\n",
-    "            'convergence_comparison': comparison,\n",
-    "            'memory_analysis': {\n",
-    "                'sgd': sgd_memory,\n",
-    "                'adam': adam_memory\n",
-    "            },\n",
-    "            'learning_rate_analysis': {\n",
-    "                'optimal_range': (0.01, 0.1)\n",
-    "            }\n",
-    "        }\n",
-    "        \n",
-    "        recommendations = profiler.generate_production_recommendations(analysis_results)\n",
-    "        \n",
-    "        assert len(recommendations) > 0, \"Should generate recommendations\"\n",
-    "        \n",
-    "        print(\"   Production guidance:\")\n",
-    "        for i, rec in enumerate(recommendations[:5]):  # Show first 5 recommendations\n",
-    "            print(f\"     {rec}\")\n",
-    "        \n",
-    "        print(\"✅ Production recommendations generated\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Production recommendations failed: {e}\")\n",
-    "        raise\n",
-    "\n",
-    "    print(\"\\n🎯 ML Systems Integration Results:\")\n",
-    "    print(\"   ✅ Optimizer convergence profiling works end-to-end\")\n",
-    "    print(\"   ✅ Performance comparison identifies best optimizers\")\n",
-    "    print(\"   ✅ Memory analysis guides resource planning\")\n",
-    "    print(\"   ✅ Advanced features enhance training stability\")\n",
-    "    print(\"   ✅ Production recommendations provide actionable guidance\")\n",
-    "    print(\"   🚀 Ready for real-world ML systems deployment!\")\n",
-    "    print(\"📈 Progress: Comprehensive ML Systems Integration ✓\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0a9b13bb",
-   "metadata": {},
-   "source": [
-    "\"\"\"\n",
-    "# 🎯 ML SYSTEMS THINKING: Optimizers in Production\n",
-    "\n",
-    "## Production Deployment Considerations\n",
-    "\n",
-    "**You've just built a comprehensive optimizer analysis system!** Let's reflect on how this connects to real ML systems:\n",
-    "\n",
-    "## System Design Questions\n",
-    "1. **Optimizer Selection Strategy**: How would you build an automated system that selects the best optimizer for a new model architecture?\n",
-    "\n",
-    "2. **Resource Planning**: Given memory constraints and training time budgets, how would you choose between SGD and Adam for different model sizes?\n",
-    "\n",
-    "3. **Distributed Training**: How do gradient synchronization patterns affect optimizer performance across multiple GPUs or nodes?\n",
-    "\n",
-    "4. **Production Monitoring**: What metrics would you track in production to detect optimizer-related training issues?\n",
-    "\n",
-    "## Production ML Workflows\n",
-    "1. **Hyperparameter Search**: How would you integrate your convergence profiler into an automated hyperparameter tuning pipeline?\n",
-    "\n",
-    "2. **Training Pipeline**: Where would gradient clipping and mixed precision fit into a production training workflow?\n",
-    "\n",
-    "3. **Cost Optimization**: How would you balance optimizer performance against computational cost for training large models?\n",
-    "\n",
-    "4. **Model Lifecycle**: How do optimizer choices change when fine-tuning vs training from scratch vs transfer learning?\n",
-    "\n",
-    "## Framework Design Insights\n",
-    "1. **Optimizer Abstraction**: Why do frameworks like PyTorch separate optimizers from models? How does this design enable flexibility?\n",
-    "\n",
-    "2. **State Management**: How do frameworks handle optimizer state persistence for training checkpoints and resumption?\n",
-    "\n",
-    "3. **Memory Efficiency**: What design patterns enable frameworks to minimize memory overhead for optimizer state?\n",
-    "\n",
-    "4. **Plugin Architecture**: How would you design an optimizer plugin system that allows researchers to add new algorithms?\n",
-    "\n",
-    "## Performance & Scale Challenges\n",
-    "1. **Large Model Training**: How do optimizer memory requirements scale with model size, and what strategies mitigate this?\n",
-    "\n",
-    "2. **Dynamic Batching**: How would you adapt your gradient accumulation strategy for variable batch sizes in production?\n",
-    "\n",
-    "3. **Fault Tolerance**: How would you design optimizer state recovery for interrupted training runs in cloud environments?\n",
-    "\n",
-    "4. **Cross-Hardware Portability**: How do optimizer implementations need to change when moving between CPUs, GPUs, and specialized ML accelerators?\n",
-    "\n",
-    "These questions connect your optimizer implementations to the broader ecosystem of production ML systems, where optimization is just one piece of complex training and deployment pipelines.\n",
-    "\"\"\"\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    print(\"🧪 Running comprehensive optimizer tests...\")\n",
-    "    \n",
-    "    # Run all tests\n",
-    "    test_unit_sgd_optimizer()\n",
-    "    test_unit_adam_optimizer()\n",
-    "    test_unit_step_scheduler()\n",
-    "    test_module_unit_training()\n",
-    "    test_unit_convergence_profiler()\n",
-    "    test_unit_advanced_optimizer_features()\n",
-    "    test_comprehensive_ml_systems_integration()\n",
-    "    \n",
-    "    print(\"All tests passed!\")\n",
-    "    print(\"Optimizers module complete!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a585755c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've built optimization algorithms that drive neural network training, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how optimization strategies scale to production training environments.\n",
-    "\n",
-    "Take time to reflect thoughtfully on each question - your insights will help you understand how the optimization concepts you've implemented connect to real-world ML systems engineering."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "880def31",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Memory Overhead and Optimizer State Management\n",
-    "\n",
-    "**Context**: Your Adam optimizer maintains momentum and variance buffers for each parameter, creating 3× memory overhead compared to SGD. Production training systems with billions of parameters must carefully manage optimizer state memory while maintaining training efficiency and fault tolerance.\n",
-    "\n",
-    "**Reflection Question**: Design an optimizer state management system for large-scale neural network training that optimizes memory usage while supporting distributed training and fault recovery. How would you implement memory-efficient optimizer state storage, handle state partitioning across devices, and manage optimizer checkpointing for training resumption? Consider scenarios where optimizer state memory exceeds model parameter memory and requires specialized optimization strategies.\n",
-    "\n",
-    "Think about: memory optimization techniques, distributed state management, checkpointing strategies, and fault tolerance considerations.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9a7367db",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-1-optimizer-memory",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON MEMORY OVERHEAD AND OPTIMIZER STATE MANAGEMENT:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about optimizer state management system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you optimize memory usage for optimizers that maintain extensive per-parameter state?\n",
-    "- What strategies would you use for distributed optimizer state management across multiple devices?\n",
-    "- How would you implement efficient checkpointing and state recovery for long-running training jobs?\n",
-    "- What role would state compression and quantization play in your optimization approach?\n",
-    "- How would you balance memory efficiency with optimization algorithm effectiveness?\n",
-    "\n",
-    "Write a technical analysis connecting your optimizer implementations to real memory management challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Demonstrates understanding of optimizer memory overhead and state management (3 points)\n",
-    "- Addresses distributed state management and partitioning strategies (3 points)\n",
-    "- Shows practical knowledge of checkpointing and fault tolerance techniques (2 points)\n",
-    "- Demonstrates systems thinking about memory vs optimization trade-offs (2 points)\n",
-    "- Clear technical reasoning and practical considerations (bonus points for innovative approaches)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring technical analysis of optimizer state management\n",
-    "# Students should demonstrate understanding of memory optimization and distributed state handling\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "619b4c1f",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: Distributed Optimization and Learning Rate Scheduling\n",
-    "\n",
-    "**Context**: Your optimizers work on single devices with fixed learning rate schedules. Production distributed training systems must coordinate optimization across multiple workers while adapting learning rates based on real-time training dynamics and system constraints.\n",
-    "\n",
-    "**Reflection Question**: Architect a distributed optimization system that coordinates parameter updates across multiple workers while implementing adaptive learning rate scheduling responsive to training progress and system constraints. How would you handle gradient aggregation strategies, implement learning rate scaling for different batch sizes, and design adaptive scheduling that responds to convergence patterns? Consider scenarios where training must adapt to varying computational resources and time constraints in cloud environments.\n",
-    "\n",
-    "Think about: distributed optimization strategies, adaptive learning rate techniques, gradient aggregation methods, and system-aware scheduling.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5abe6f79",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-2-distributed-optimization",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON DISTRIBUTED OPTIMIZATION AND LEARNING RATE SCHEDULING:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about distributed optimization system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you coordinate parameter updates across multiple workers in distributed training?\n",
-    "- What strategies would you use for gradient aggregation and synchronization?\n",
-    "- How would you implement adaptive learning rate scheduling that responds to training dynamics?\n",
-    "- What role would system constraints and resource availability play in your optimization design?\n",
-    "- How would you handle learning rate scaling and batch size considerations in distributed settings?\n",
-    "\n",
-    "Write an architectural analysis connecting your optimizer implementations to real distributed training challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Shows understanding of distributed optimization and coordination challenges (3 points)\n",
-    "- Designs practical approaches to gradient aggregation and learning rate adaptation (3 points)\n",
-    "- Addresses system constraints and resource-aware optimization (2 points)\n",
-    "- Demonstrates systems thinking about distributed training coordination (2 points)\n",
-    "- Clear architectural reasoning with distributed systems insights (bonus points for comprehensive understanding)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of distributed optimization systems\n",
-    "# Students should demonstrate knowledge of gradient aggregation and adaptive scheduling\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "268e844d",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 3: Production Integration and Optimization Monitoring\n",
-    "\n",
-    "**Context**: Your optimizer implementations provide basic parameter updates, but production ML systems require comprehensive optimization monitoring, hyperparameter tuning, and integration with MLOps pipelines for continuous training and model improvement.\n",
-    "\n",
-    "**Reflection Question**: Design a production optimization system that integrates with MLOps pipelines and provides comprehensive optimization monitoring and automated hyperparameter tuning. How would you implement real-time optimization metrics collection, automated optimizer selection based on model characteristics, and integration with experiment tracking and model deployment systems? Consider scenarios where optimization strategies must adapt to changing data distributions and business requirements in production environments.\n",
-    "\n",
-    "Think about: optimization monitoring systems, automated hyperparameter tuning, MLOps integration, and adaptive optimization strategies.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b274f9c7",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-3-production-integration",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON PRODUCTION INTEGRATION AND OPTIMIZATION MONITORING:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about production optimization system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you design optimization monitoring and metrics collection for production training?\n",
-    "- What strategies would you use for automated optimizer selection and hyperparameter tuning?\n",
-    "- How would you integrate optimization systems with MLOps pipelines and experiment tracking?\n",
-    "- What role would adaptive optimization play in responding to changing data and requirements?\n",
-    "- How would you ensure optimization system reliability and performance in production environments?\n",
-    "\n",
-    "Write a systems analysis connecting your optimizer implementations to real production integration challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Understands production optimization monitoring and MLOps integration (3 points)\n",
-    "- Designs practical approaches to automated tuning and optimization selection (3 points)\n",
-    "- Addresses adaptive optimization and production reliability considerations (2 points)\n",
-    "- Shows systems thinking about optimization system integration and monitoring (2 points)\n",
-    "- Clear systems reasoning with production deployment insights (bonus points for deep understanding)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of production optimization systems\n",
-    "# Students should demonstrate knowledge of MLOps integration and optimization monitoring\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "21a3a64c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Optimization Algorithms with ML Systems\n",
-    "\n",
-    "Congratulations! You've successfully implemented optimization algorithms with comprehensive ML systems analysis:\n",
-    "\n",
-    "### What You've Accomplished\n",
-    "✅ **Gradient Descent**: The foundation of all optimization algorithms\n",
-    "✅ **SGD with Momentum**: Improved convergence with momentum\n",
-    "✅ **Adam Optimizer**: Adaptive learning rates for better training\n",
-    "✅ **Learning Rate Scheduling**: Dynamic learning rate adjustment\n",
-    "✅ **ML Systems Analysis**: OptimizerConvergenceProfiler for production insights\n",
-    "✅ **Advanced Features**: Gradient clipping, warmup, accumulation, mixed precision\n",
-    "✅ **Production Integration**: Complete optimizer analysis and recommendation system\n",
-    "\n",
-    "### Key Concepts You've Learned\n",
-    "- **Gradient-based optimization**: How gradients guide parameter updates\n",
-    "- **Momentum**: Using velocity to improve convergence\n",
-    "- **Adaptive learning rates**: Adam's adaptive moment estimation\n",
-    "- **Learning rate scheduling**: Dynamic adjustment of learning rates\n",
-    "- **Convergence analysis**: Profiling optimizer performance patterns\n",
-    "- **Memory efficiency**: Resource usage comparison across optimizers\n",
-    "- **Production patterns**: Advanced features for real-world deployment\n",
-    "\n",
-    "### Mathematical Foundations\n",
-    "- **Gradient descent**: θ = θ - α∇θJ(θ)\n",
-    "- **Momentum**: v = βv + (1-β)∇θJ(θ), θ = θ - αv\n",
-    "- **Adam**: Adaptive moment estimation with bias correction\n",
-    "- **Learning rate scheduling**: StepLR and other scheduling strategies\n",
-    "- **Gradient clipping**: norm_clip = min(norm, max_norm) * grad / norm\n",
-    "- **Gradient accumulation**: grad_avg = Σgrad_i / accumulation_steps\n",
-    "\n",
-    "### Professional Skills Developed\n",
-    "- **Algorithm implementation**: Building optimization algorithms from scratch\n",
-    "- **Performance analysis**: Profiling and comparing optimizer convergence\n",
-    "- **System design thinking**: Understanding production optimization workflows\n",
-    "- **Resource optimization**: Memory usage analysis and efficiency planning\n",
-    "- **Integration testing**: Ensuring optimizers work with neural networks\n",
-    "- **Production readiness**: Advanced features for real-world deployment\n",
-    "\n",
-    "### Ready for Advanced Applications\n",
-    "Your optimization implementations now enable:\n",
-    "- **Neural network training**: Complete training pipelines with optimizers\n",
-    "- **Hyperparameter optimization**: Data-driven optimizer and LR selection\n",
-    "- **Advanced architectures**: Training complex models efficiently\n",
-    "- **Production deployment**: ML systems with optimizer monitoring and tuning\n",
-    "- **Research**: Experimenting with new optimization algorithms\n",
-    "- **Scalable training**: Distributed and memory-efficient optimization\n",
-    "\n",
-    "### Connection to Real ML Systems\n",
-    "Your implementations mirror production systems:\n",
-    "- **PyTorch**: `torch.optim.SGD`, `torch.optim.Adam` provide identical functionality\n",
-    "- **TensorFlow**: `tf.keras.optimizers` implements similar concepts\n",
-    "- **MLflow/Weights&Biases**: Your profiler mirrors production monitoring tools\n",
-    "- **Ray Tune/Optuna**: Your convergence analysis enables hyperparameter optimization\n",
-    "- **Industry Standard**: Every major ML framework uses these exact algorithms and patterns\n",
-    "\n",
-    "### Next Steps\n",
-    "1. **Export your code**: `tito export 08_optimizers`\n",
-    "2. **Test your implementation**: `tito test 08_optimizers`\n",
-    "3. **Deploy ML systems**: Use your profiler for real optimizer selection\n",
-    "4. **Build training systems**: Combine with neural networks for complete training\n",
-    "5. **Move to Module 11**: Add complete training pipelines!\n",
-    "\n",
-    "**Ready for production?** Your optimization algorithms and ML systems analysis tools are now ready for real-world deployment and performance optimization!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/06_optimizers/optimizers_dev.py b/modules_old/06_optimizers/optimizers_dev.py
deleted file mode 100644
index c6867cdd..00000000
--- a/modules_old/06_optimizers/optimizers_dev.py
+++ /dev/null
@@ -1,3207 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Optimizers - The Learning Engine
-
-Welcome to Optimizers! You'll build the intelligent algorithms that make neural networks learn - the engines that transform gradients into actual intelligence.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**:
-- Module 04 (Losses): Functions that measure how wrong your model is
-- Module 05 (Autograd): Automatic gradient computation through any expression
-
-**What's Working**: Your models can compute loss and gradients perfectly! Loss tells you how far you are from the target, gradients tell you which direction to move.
-
-**The Gap**: Your models can't actually *learn* - they compute gradients but don't know how to use them to get better.
-
-**This Module's Solution**: Build the optimization algorithms that transform gradients into learning.
-
-**Connection Map**:
-```
-Loss Computation → Gradient Computation → Parameter Updates
-(Measures error)   (Direction to move)   (Actually learn!)
-```
-
-## Learning Objectives
-1. **Core Implementation**: Build gradient descent, SGD with momentum, and Adam optimizers
-2. **Visual Understanding**: See how different optimizers navigate loss landscapes
-3. **Systems Analysis**: Understand memory usage and convergence characteristics
-4. **Professional Skills**: Match production optimizer implementations
-
-## Build → Test → Use
-1. **Build**: Four optimization algorithms with immediate testing
-2. **Test**: Visual convergence analysis and memory profiling
-3. **Use**: Train real neural networks with your optimizers
-
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/06_optimizers/optimizers_dev.py
-**Building Side:** Code exports to tinytorch.core.optimizers
-
-```python
-# Final package structure:
-from tinytorch.core.optimizers import gradient_descent_step, SGD, Adam, StepLR  # This module
-from tinytorch.core.autograd import Tensor  # Enhanced Tensor with gradients
-from tinytorch.core.losses import MSELoss   # Loss functions
-
-# Complete training workflow:
-model = MyModel()
-optimizer = Adam(model.parameters(), lr=0.001)  # Your implementation!
-loss_fn = MSELoss()
-
-for batch in data:
-    loss = loss_fn(model(batch.x), batch.y)
-    loss.backward()      # Compute gradients (Module 05)
-    optimizer.step()     # Update parameters (This module!)
-```
-
-**Why this matters:**
-- **Learning:** Experience how optimization algorithms work by building them from scratch
-- **Production:** Your implementations match PyTorch's torch.optim exactly
-- **Systems:** Understand memory and performance trade-offs between different optimizers
-- **Intelligence:** Transform mathematical gradients into actual learning
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "optimizers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.optimizers
-
-#| export
-import numpy as np
-import sys
-import os
-from typing import List, Dict, Any, Optional, Union
-from collections import defaultdict
-
-# Helper function to set up import paths
-def setup_import_paths():
-    """Set up import paths for development modules."""
-    import sys
-    import os
-    
-    # Add module directories to path
-    base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-    tensor_dir = os.path.join(base_dir, '01_tensor')
-    autograd_dir = os.path.join(base_dir, '06_autograd')
-    
-    if tensor_dir not in sys.path:
-        sys.path.append(tensor_dir)
-    if autograd_dir not in sys.path:
-        sys.path.append(autograd_dir)
-
-# Import our existing components
-try:
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.autograd import Variable
-except ImportError:
-    # For development, try local imports
-    try:
-        setup_import_paths()
-        from tensor_dev import Tensor
-        from autograd_dev import Variable
-    except ImportError:
-        # Create simplified fallback classes for basic gradient operations
-        print("Warning: Using simplified classes for basic gradient operations")
-        
-        class Tensor:
-            def __init__(self, data):
-                self.data = np.array(data)
-                self.shape = self.data.shape
-            
-            def __str__(self):
-                return f"Tensor({self.data})"
-        
-        class Variable:
-            def __init__(self, data, requires_grad=True):
-                if isinstance(data, (int, float)):
-                    self.data = Tensor([data])
-                else:
-                    self.data = Tensor(data)
-                self.requires_grad = requires_grad
-                self.grad = None
-            
-            def zero_grad(self):
-                """Reset gradients to None (basic operation from Module 6)"""
-                self.grad = None
-            
-            def __str__(self):
-                return f"Variable({self.data.data})"
-
-# %% nbgrader={"grade": false, "grade_id": "optimizers-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("FIRE TinyTorch Optimizers Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build optimization algorithms!")
-
-# %% 
-#| export
-def get_param_data(param):
-    """Get parameter data in consistent format."""
-    if hasattr(param, 'data') and hasattr(param.data, 'data'):
-        return param.data.data
-    elif hasattr(param, 'data'):
-        return param.data
-    else:
-        return param
-
-#| export
-def set_param_data(param, new_data):
-    """Set parameter data in consistent format."""
-    if hasattr(param, 'data') and hasattr(param.data, 'data'):
-        param.data.data = new_data
-    elif hasattr(param, 'data'):
-        param.data = new_data
-    else:
-        param = new_data
-
-#| export  
-def get_grad_data(param):
-    """Get gradient data in consistent format."""
-    if param.grad is None:
-        return None
-    if hasattr(param.grad, 'data') and hasattr(param.grad.data, 'data'):
-        return param.grad.data.data
-    elif hasattr(param.grad, 'data'):
-        return param.grad.data
-    else:
-        return param.grad
-
-# %% [markdown]
-"""
-## Here's What We're Actually Building
-
-Optimizers are the navigation systems that guide neural networks through loss landscapes toward optimal solutions. Think of training as finding the lowest point in a vast mountain range, where you can only feel the slope under your feet.
-
-We'll build four increasingly sophisticated navigation strategies:
-
-### 1. Gradient Descent: The Foundation
-```
-The Basic Rule: Always go downhill
-
-    Loss ↑
-         │      ╱╲
-         │     ╱  ╲     ● ← You are here
-         │    ╱    ╲     ↙ Feel slope (gradient)
-         │   ╱      ╲
-         │  ╱        ╲   ● ← Take step downhill
-         │ ╱          ╲
-         └──────────────→ Parameters
-
-Update Rule: parameter = parameter - learning_rate * gradient
-```
-
-### 2. SGD with Momentum: The Smart Ball
-```
-The Physics Approach: Build velocity like a ball rolling downhill
-
-    Without Momentum (ping-pong ball):     With Momentum (bowling ball):
-    ┌─────────────────┐               ┌─────────────────┐
-    │ ↗   ↙   ↗   ↙   │               │                 │
-    │   ╲   ╱   ╲   ╱ │               │   ────⟶      │
-    │ ↙   ↗   ↙   ↗   │               │                 │
-    └─────────────────┘               └─────────────────┘
-    Bounces forever                    Rolls through smoothly
-
-velocity = momentum * old_velocity + gradient
-parameter = parameter - learning_rate * velocity
-```
-
-### 3. Adam: The Adaptive Expert
-```
-The Smart Approach: Different learning rates for each parameter
-
-    Parameter 1 (large gradients):      Parameter 2 (small gradients):
-    → Large step size needed           → Small step size is fine
-    → Reduce learning rate             → Keep learning rate normal
-
-    Weight:│■■■■■■■■■■│     Bias: │▪▪▪│
-           Big updates               Small updates
-           → Adam reduces LR          → Adam keeps LR
-
-Adam tracks gradient history to adapt step size per parameter
-```
-
-### 4. Learning Rate Scheduling: The Strategic Planner
-```
-The Training Strategy: Adjust exploration vs exploitation over time
-
-    Early Training (explore):        Late Training (exploit):
-    Large LR = 0.1                  Small LR = 0.001
-    ┌─────────────────┐              ┌─────────────────┐
-    │ ●───────●     │              │  ●─●─●─●─●   │
-    │  Big jumps to explore  │              │ Tiny steps to refine │
-    └─────────────────┘              └─────────────────┘
-    Find good regions               Polish the solution
-
-Scheduler reduces learning rate as training progresses
-```
-
-### Why Build All Four?
-
-Each optimizer excels in different scenarios:
-- **Gradient Descent**: Simple, reliable foundation
-- **SGD + Momentum**: Escapes local minima, accelerates convergence
-- **Adam**: Handles different parameter scales automatically
-- **Scheduling**: Balances exploration and exploitation over time
-
-Let's build them step by step and see each one in action!
-"""
-
-# %% [markdown]
-"""
-Now let's build gradient descent - the foundation of all neural network training. Think of it as
-rolling a ball down a hill, where the gradient tells you which direction is steepest.
-
-```
-    The Gradient Descent Algorithm:
-
-    Current Position: θ
-    Slope at Position: ∇L(θ) points uphill ↗
-    Step Direction: -∇L(θ) points downhill ↙
-    Step Size: α (learning rate)
-
-    Update Rule: θnew = θold - α·∇L(θ)
-
-    Visual Journey Down the Loss Surface:
-
-    Loss ↑
-         │      ╱╲
-         │     ╱  ╲
-         │    ╱    ╲     Start here
-         │   ╱      ╲        ●
-         │  ╱        ╲      ↙ (step 1: big gradient)
-         │ ╱          ╲    ●
-         │╱            ╲  ↙ (step 2: smaller gradient)
-         │              ●↙ (step 3: tiny gradient)
-         │               ● (converged!)
-         └──────────────────────→ Parameter θ
-
-    Learning Rate Controls Step Size:
-
-    α too small (0.001):        α just right (0.1):         α too large (1.0):
-    ●─●─●─●─●─●─●─●─●           ●──●──●──●                  ●───────╲
-    Many tiny steps             Efficient path                      ╲──────●
-    (slow convergence)          (good balance)              Overshooting (divergence!)
-```
-
-### The Core Insight
-
-Gradients point uphill toward higher loss, so we go the opposite direction. It's like having a compass that always points toward trouble - so you walk the other way!
-
-This simple rule - "parameter = parameter - learning_rate * gradient" - is what makes every neural network learn.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "gradient-descent-function", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def gradient_descent_step(parameter: Variable, learning_rate: float) -> None:
-    """
-    Perform one step of gradient descent on a parameter.
-    
-    Args:
-        parameter: Variable with gradient information
-        learning_rate: How much to update parameter
-    
-    TODO: Implement basic gradient descent parameter update.
-    
-    STEP-BY-STEP IMPLEMENTATION:
-    1. Check if parameter has a gradient
-    2. Get current parameter value and gradient
-    3. Update parameter: new_value = old_value - learning_rate * gradient
-    4. Update parameter data with new value
-    5. Handle edge cases (no gradient, invalid values)
-    
-    EXAMPLE USAGE:
-    ```python
-    # Parameter with gradient
-    w = Variable(2.0, requires_grad=True)
-    w.grad = Variable(0.5)  # Gradient from loss
-    
-    # Update parameter
-    gradient_descent_step(w, learning_rate=0.1)
-    # w.data now contains: 2.0 - 0.1 * 0.5 = 1.95
-    ```
-    
-    IMPLEMENTATION HINTS:
-    - Check if parameter.grad is not None
-    - Use parameter.grad.data.data to get gradient value
-    - Update parameter.data with new Tensor
-    - Don't modify gradient (it's used for logging)
-    
-    LEARNING CONNECTIONS:
-    - This is the foundation of all neural network training
-    - PyTorch's optimizer.step() does exactly this
-    - The learning rate determines convergence speed
-    """
-    ### BEGIN SOLUTION
-    if parameter.grad is not None:
-        # Get current parameter value and gradient
-        current_value = parameter.data.data
-        gradient_value = parameter.grad.data.data
-        
-        # Update parameter: new_value = old_value - learning_rate * gradient
-        new_value = current_value - learning_rate * gradient_value
-        
-        # Update parameter data
-        parameter.data = Tensor(new_value)
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Test: Gradient Descent Step
-This test confirms our gradient descent function works correctly
-**What we're testing**: Basic parameter updates using the gradient descent rule
-**Why it matters**: This is the foundation that every optimizer builds on
-**Expected**: Parameters move opposite to gradient direction
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-gradient-descent", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_unit_gradient_descent_step():
-    """🔬 Test basic gradient descent parameter update."""
-    print("🔬 Unit Test: Gradient Descent Step...")
-    
-    # Test basic parameter update
-    try:
-        w = Variable(2.0, requires_grad=True)
-        w.grad = Variable(0.5)  # Positive gradient
-        
-        original_value = w.data.data.item()
-        gradient_descent_step(w, learning_rate=0.1)
-        new_value = w.data.data.item()
-        
-        expected_value = original_value - 0.1 * 0.5  # 2.0 - 0.05 = 1.95
-        assert abs(new_value - expected_value) < 1e-6, f"Expected {expected_value}, got {new_value}"
-        print("PASS Basic parameter update works")
-        
-    except Exception as e:
-        print(f"FAIL Basic parameter update failed: {e}")
-        raise
-
-    # Test with negative gradient
-    try:
-        w2 = Variable(1.0, requires_grad=True)
-        w2.grad = Variable(-0.2)  # Negative gradient
-        
-        gradient_descent_step(w2, learning_rate=0.1)
-        expected_value2 = 1.0 - 0.1 * (-0.2)  # 1.0 + 0.02 = 1.02
-        assert abs(w2.data.data.item() - expected_value2) < 1e-6, "Negative gradient test failed"
-        print("PASS Negative gradient handling works")
-        
-    except Exception as e:
-        print(f"FAIL Negative gradient handling failed: {e}")
-        raise
-
-    # Test with no gradient (should not update)
-    try:
-        w3 = Variable(3.0, requires_grad=True)
-        w3.grad = None
-        original_value3 = w3.data.data.item()
-        
-        gradient_descent_step(w3, learning_rate=0.1)
-        assert w3.data.data.item() == original_value3, "Parameter with no gradient should not update"
-        print("PASS No gradient case works")
-        
-    except Exception as e:
-        print(f"FAIL No gradient case failed: {e}")
-        raise
-
-    print("✅ Success! Gradient descent step works correctly!")
-    print(f"  • Updates parameters opposite to gradient direction")
-    print(f"  • Learning rate controls step size")
-    print(f"  • Safely handles missing gradients")
-
-test_unit_gradient_descent_step()  # Run immediately
-
-# PASS IMPLEMENTATION CHECKPOINT: Basic gradient descent complete
-
-# THINK PREDICTION: How do you think learning rate affects convergence speed?
-# Your guess: _______
-
-def analyze_learning_rate_effects():
-    """📊 Analyze how learning rate affects parameter updates."""
-    print("📊 Analyzing learning rate effects...")
-
-    # Create test parameter with fixed gradient
-    param = Variable(1.0, requires_grad=True)
-    param.grad = Variable(0.1)  # Fixed gradient of 0.1
-
-    learning_rates = [0.01, 0.1, 0.5, 1.0, 2.0]
-
-    print(f"Starting value: {param.data.data.item():.3f}, Gradient: {param.grad.data.data.item():.3f}")
-
-    for lr in learning_rates:
-        # Reset parameter
-        param.data.data = np.array(1.0)
-
-        # Apply update
-        gradient_descent_step(param, learning_rate=lr)
-
-        new_value = param.data.data.item()
-        step_size = abs(1.0 - new_value)
-
-        status = " ⚠️ Overshooting!" if lr >= 1.0 else ""
-        print(f"LR = {lr:4.2f}: {1.0:.3f} → {new_value:.3f} (step: {step_size:.3f}){status}")
-
-    print("\n💡 Small LR = safe but slow, Large LR = fast but unstable")
-    print("🚀 Most models use LR scheduling: high→low during training")
-
-# Analyze learning rate effects
-analyze_learning_rate_effects()
-
-# %% [markdown]
-"""
-## Step 2: The Smart Ball - SGD with Momentum
-
-Regular SGD is like a ping-pong ball - it bounces around and gets stuck in small valleys. Momentum turns it into a bowling ball that rolls through obstacles with accumulated velocity.
-
-Think of momentum as the optimizer learning from its own movement history: "I've been going this direction, so I'll keep going this direction even if the current gradient disagrees slightly."
-
-### The Physics of Momentum
-
-```
-    Ping-Pong Ball vs Bowling Ball:
-
-    Without Momentum (ping-pong):       With Momentum (bowling ball):
-    ┌─────────────────────┐       ┌─────────────────────┐
-    │        ╱╲    ╱╲        │       │        ╱╲    ╱╲        │
-    │       ╱  ╲  ╱  ╲       │       │       ╱  ╲  ╱  ╲       │
-    │      ●    ╲╱    ╲      │       │      ●────⟶────●      │
-    │      ↗↙ Gets stuck     │       │      Builds velocity!     │
-    └─────────────────────┘       └─────────────────────┘
-
-    Problem: Narrow Valleys (Common in Neural Networks)
-
-    SGD Without Momentum:              SGD With Momentum (β=0.9):
-    ┌─────────────────────┐       ┌─────────────────────┐
-    │ ↗   ↙   ↗   ↙   │       │                     │
-    │   ╲   ╱   ╲   ╱ │       │      ────⟶       │
-    │ ↙   ↗   ↙   ↗   │       │                     │
-    │ Bounces forever!      │       │ Smooth progress!     │
-    └─────────────────────┘       └─────────────────────┘
-```
-
-### How Momentum Works: Velocity Accumulation
-
-```
-    The Two-Step Process:
-
-    Step 1: Update velocity (mix old direction with new gradient)
-    velocity = momentum_coeff * old_velocity + current_gradient
-
-    Step 2: Move using velocity (not raw gradient)
-    parameter = parameter - learning_rate * velocity
-
-    Example with β=0.9 (momentum coefficient):
-
-    Iteration 1: v = 0.9 × 0.0 + 1.0 = 1.0     (starting from rest)
-    Iteration 2: v = 0.9 × 1.0 + 1.0 = 1.9     (building speed)
-    Iteration 3: v = 0.9 × 1.9 + 1.0 = 2.71    (accelerating!)
-    Iteration 4: v = 0.9 × 2.71 + 1.0 = 3.44   (near terminal velocity)
-
-    Velocity Visualization:
-    ┌────────────────────────────────────────────┐
-    │ Recent gradient: ■                                        │
-    │ + 0.9 × velocity: ■■■■■■■■■                            │
-    │ = New velocity:  ■■■■■■■■■■                           │
-    │                                                        │
-    │ Momentum creates an exponential moving average of       │
-    │ gradients - recent gradients matter more, but the      │
-    │ optimizer "remembers" where it was going               │
-    └────────────────────────────────────────────┘
-```
-
-### Why Momentum is Magic
-
-Momentum solves several optimization problems:
-1. **Escapes Local Minima**: Velocity carries you through small bumps
-2. **Accelerates Convergence**: Builds speed in consistent directions
-3. **Smooths Oscillations**: Averages out conflicting gradients
-4. **Handles Noise**: Less sensitive to gradient noise
-
-Let's build an SGD optimizer that supports momentum!
-"""
-
-# %% [markdown]
-"""
-### 🤔 Assessment Question: Momentum Understanding
-
-**Understanding momentum's role in optimization:**
-
-In a narrow valley loss landscape, vanilla SGD oscillates between valley walls. How does momentum help solve this problem, and what's the mathematical intuition behind the velocity accumulation formula `v_t = β v_{t-1} + gradL(θ_t)`?
-
-Consider a sequence of gradients: [0.1, -0.1, 0.1, -0.1, 0.1] (oscillating). Show how momentum with β=0.9 transforms this into smoother updates.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "momentum-understanding", "locked": false, "points": 8, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR MOMENTUM ANALYSIS:
-
-TODO: Explain how momentum helps in narrow valleys and demonstrate the velocity calculation.
-
-Key points to address:
-- Why does vanilla SGD oscillate in narrow valleys?
-- How does momentum accumulation smooth out oscillations?
-- Calculate velocity sequence for oscillating gradients [0.1, -0.1, 0.1, -0.1, 0.1] with β=0.9
-- What happens to the effective update directions with momentum?
-
-GRADING RUBRIC:
-- Identifies oscillation problem in narrow valleys (2 points)
-- Explains momentum's smoothing mechanism (2 points)  
-- Correctly calculates velocity sequence (2 points)
-- Shows understanding of exponential moving average effect (2 points)
-"""
-
-### BEGIN SOLUTION
-# Momentum helps solve oscillation by accumulating velocity as an exponential moving average of gradients.
-# In narrow valleys, vanilla SGD gets stuck oscillating between walls because gradients alternate direction.
-# 
-# For oscillating gradients [0.1, -0.1, 0.1, -0.1, 0.1] with β=0.9:
-# v₀ = 0
-# v₁ = 0.9*0 + 0.1 = 0.1
-# v₂ = 0.9*0.1 + (-0.1) = 0.09 - 0.1 = -0.01
-# v₃ = 0.9*(-0.01) + 0.1 = -0.009 + 0.1 = 0.091  
-# v₄ = 0.9*0.091 + (-0.1) = 0.082 - 0.1 = -0.018
-# v₅ = 0.9*(-0.018) + 0.1 = -0.016 + 0.1 = 0.084
-#
-# The oscillating gradients average out through momentum, creating much smaller, smoother updates
-# instead of large oscillations. This allows progress along the valley bottom rather than bouncing between walls.
-### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "sgd-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class SGD:
-    """
-    SGD Optimizer with Momentum Support
-    
-    Implements stochastic gradient descent with optional momentum for improved convergence.
-    Momentum accumulates velocity to accelerate in consistent directions and dampen oscillations.
-    
-    Mathematical Update Rules:
-    Without momentum: θ = θ - αgradθ
-    With momentum: v = βv + gradθ, θ = θ - αv
-    
-    SYSTEMS INSIGHT - Memory Usage:
-    SGD stores only parameters list, learning rate, and optionally momentum buffers.
-    Memory usage: O(1) per parameter without momentum, O(P) with momentum (P = parameters).
-    Much more memory efficient than Adam which needs O(2P) for momentum + velocity.
-    """
-    
-    def __init__(self, parameters: List[Variable], learning_rate: float = 0.01, momentum: float = 0.0):
-        """
-        Initialize SGD optimizer with optional momentum.
-        
-        Args:
-            parameters: List of Variables to optimize
-            learning_rate: Learning rate for gradient steps (default: 0.01)
-            momentum: Momentum coefficient for velocity accumulation (default: 0.0)
-        
-        TODO: Store optimizer parameters and initialize momentum buffers.
-        
-        APPROACH:
-        1. Store parameters, learning rate, and momentum coefficient
-        2. Initialize momentum buffers if momentum > 0
-        3. Set up state tracking for momentum terms
-        
-        EXAMPLE:
-        ```python
-        # SGD without momentum (vanilla)
-        optimizer = SGD([w, b], learning_rate=0.01)
-        
-        # SGD with momentum (recommended)
-        optimizer = SGD([w, b], learning_rate=0.01, momentum=0.9)
-        ```
-        """
-        ### BEGIN SOLUTION
-        self.parameters = parameters
-        self.learning_rate = learning_rate
-        self.momentum = momentum
-        
-        # Initialize momentum buffers if momentum is used
-        self.momentum_buffers = {}
-        if momentum > 0:
-            for i, param in enumerate(parameters):
-                self.momentum_buffers[id(param)] = None
-        ### END SOLUTION
-    
-    def step(self) -> None:
-        """
-        Perform one optimization step with optional momentum.
-        
-        TODO: Implement SGD parameter updates with momentum support.
-        
-        APPROACH:
-        1. Iterate through all parameters
-        2. For each parameter with gradient:
-           a. If momentum > 0: update velocity buffer
-           b. Apply parameter update using velocity or direct gradient
-        3. Handle momentum buffer initialization and updates
-        
-        MATHEMATICAL FORMULATION:
-        Without momentum: θ = θ - αgradθ
-        With momentum: v = βv + gradθ, θ = θ - αv
-        
-        IMPLEMENTATION HINTS:
-        - Check if param.grad exists before using it
-        - Initialize momentum buffer with first gradient if None
-        - Use momentum coefficient to blend old and new gradients
-        - Apply learning rate to final update
-        """
-        ### BEGIN SOLUTION
-        for param in self.parameters:
-            grad_data = get_grad_data(param)
-            if grad_data is not None:
-                current_data = get_param_data(param)
-                
-                if self.momentum > 0:
-                    # SGD with momentum
-                    param_id = id(param)
-                    
-                    if self.momentum_buffers[param_id] is None:
-                        # Initialize momentum buffer with first gradient
-                        velocity = grad_data
-                    else:
-                        # Update velocity: v = βv + gradθ
-                        velocity = self.momentum * self.momentum_buffers[param_id] + grad_data
-                    
-                    # Store updated velocity
-                    self.momentum_buffers[param_id] = velocity
-                    
-                    # Update parameter: θ = θ - αv
-                    new_data = current_data - self.learning_rate * velocity
-                else:
-                    # Vanilla SGD: θ = θ - αgradθ
-                    new_data = current_data - self.learning_rate * grad_data
-                
-                set_param_data(param, new_data)
-        ### END SOLUTION
-    
-    def zero_grad(self) -> None:
-        """
-        Zero out gradients for all parameters.
-        
-        TODO: Clear all gradients to prepare for the next backward pass.
-        
-        APPROACH:
-        1. Iterate through all parameters
-        2. Set gradient to None for each parameter
-        3. This prevents gradient accumulation from previous steps
-        
-        IMPLEMENTATION HINTS:
-        - Set param.grad = None for each parameter
-        - Don't clear momentum buffers (they persist across steps)
-        - This is essential before each backward pass
-        """
-        ### BEGIN SOLUTION
-        for param in self.parameters:
-            param.grad = None
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Test: SGD Optimizer
-This test confirms our SGD optimizer works with and without momentum
-**What we're testing**: Complete SGD optimizer with velocity accumulation
-**Why it matters**: SGD with momentum is used in most neural network training
-**Expected**: Parameters update with accumulated velocity, not just raw gradients
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-sgd", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_sgd_optimizer():
-    """Unit test for SGD optimizer with momentum support."""
-    print("🔬 Unit Test: SGD Optimizer...")
-    
-    # Create test parameters
-    w1 = Variable(1.0, requires_grad=True)
-    w2 = Variable(2.0, requires_grad=True)
-    b = Variable(0.5, requires_grad=True)
-    
-    # Test vanilla SGD (no momentum)
-    optimizer = SGD([w1, w2, b], learning_rate=0.1, momentum=0.0)
-    
-    # Test initialization
-    try:
-        assert optimizer.learning_rate == 0.1, "Learning rate should be stored correctly"
-        assert optimizer.momentum == 0.0, "Momentum should be stored correctly"
-        assert len(optimizer.parameters) == 3, "Should store all 3 parameters"
-        print("PASS Initialization works correctly")
-        
-    except Exception as e:
-        print(f"FAIL Initialization failed: {e}")
-        raise
-    
-    # Test zero_grad
-    try:
-        w1.grad = Variable(0.1)
-        w2.grad = Variable(0.2)
-        b.grad = Variable(0.05)
-        
-        optimizer.zero_grad()
-        
-        assert w1.grad is None, "Gradient should be None after zero_grad"
-        assert w2.grad is None, "Gradient should be None after zero_grad"
-        assert b.grad is None, "Gradient should be None after zero_grad"
-        print("PASS zero_grad() works correctly")
-        
-    except Exception as e:
-        print(f"FAIL zero_grad() failed: {e}")
-        raise
-    
-    # Test vanilla SGD step
-    try:
-        w1.grad = Variable(0.1)
-        w2.grad = Variable(0.2)
-        b.grad = Variable(0.05)
-        
-        # Store original values
-        original_w1 = w1.data.data.item()
-        original_w2 = w2.data.data.item()
-        original_b = b.data.data.item()
-        
-        optimizer.step()
-        
-        # Check updates: param = param - lr * grad
-        expected_w1 = original_w1 - 0.1 * 0.1  # 1.0 - 0.01 = 0.99
-        expected_w2 = original_w2 - 0.1 * 0.2  # 2.0 - 0.02 = 1.98
-        expected_b = original_b - 0.1 * 0.05   # 0.5 - 0.005 = 0.495
-        
-        assert abs(w1.data.data.item() - expected_w1) < 1e-6, f"w1 update failed"
-        assert abs(w2.data.data.item() - expected_w2) < 1e-6, f"w2 update failed"
-        assert abs(b.data.data.item() - expected_b) < 1e-6, f"b update failed"
-        print("PASS Vanilla SGD step works correctly")
-        
-    except Exception as e:
-        print(f"FAIL Vanilla SGD step failed: {e}")
-        raise
-    
-    # Test SGD with momentum
-    try:
-        w_momentum = Variable(1.0, requires_grad=True)
-        optimizer_momentum = SGD([w_momentum], learning_rate=0.1, momentum=0.9)
-        
-        # First step
-        w_momentum.grad = Variable(0.1)
-        optimizer_momentum.step()
-        
-        # Should be: v₁ = 0.9*0 + 0.1 = 0.1, θ₁ = 1.0 - 0.1*0.1 = 0.99
-        expected_first = 1.0 - 0.1 * 0.1
-        assert abs(w_momentum.data.data.item() - expected_first) < 1e-6, "First momentum step failed"
-        
-        # Second step with same gradient
-        w_momentum.grad = Variable(0.1)
-        optimizer_momentum.step()
-        
-        # Should be: v₂ = 0.9*0.1 + 0.1 = 0.19, θ₂ = 0.99 - 0.1*0.19 = 0.971
-        expected_second = expected_first - 0.1 * 0.19
-        assert abs(w_momentum.data.data.item() - expected_second) < 1e-6, "Second momentum step failed"
-        
-        print("PASS Momentum SGD works correctly")
-        
-    except Exception as e:
-        print(f"FAIL Momentum SGD failed: {e}")
-        raise
-
-    print("✅ Success! SGD optimizer works correctly!")
-    print(f"  • Vanilla SGD: Updates parameters directly with gradients")
-    print(f"  • Momentum SGD: Accumulates velocity for smoother convergence")
-    print(f"  • Memory efficient: Scales properly with parameter count")
-
-test_unit_sgd_optimizer()  # Run immediately
-
-# PASS IMPLEMENTATION CHECKPOINT: SGD with momentum complete
-
-# THINK PREDICTION: How much faster will momentum SGD converge compared to vanilla SGD?
-# Your guess: ____x faster
-
-def analyze_sgd_momentum_convergence():
-    """📊 Compare convergence behavior of vanilla SGD vs momentum SGD."""
-    print("📊 Analyzing SGD vs momentum convergence...")
-
-    # Simulate optimization on quadratic function: f(x) = (x-3)²
-    def simulate_optimization(optimizer_name, start_x=0.0, lr=0.1, momentum=0.0, steps=10):
-        x = Variable(start_x, requires_grad=True)
-        optimizer = SGD([x], learning_rate=lr, momentum=momentum)
-
-        losses = []
-        positions = []
-
-        for step in range(steps):
-            # Compute loss and gradient for f(x) = (x-3)²
-            target = 3.0
-            current_pos = x.data.data.item()
-            loss = (current_pos - target) ** 2
-            gradient = 2 * (current_pos - target)
-
-            losses.append(loss)
-            positions.append(current_pos)
-
-            # Set gradient and update
-            x.grad = Variable(gradient)
-            optimizer.step()
-            x.grad = None
-
-        return losses, positions
-
-    # Compare optimizers
-    start_position = 0.0
-    learning_rate = 0.1
-
-    vanilla_losses, vanilla_positions = simulate_optimization("Vanilla SGD", start_position, lr=learning_rate, momentum=0.0)
-    momentum_losses, momentum_positions = simulate_optimization("Momentum SGD", start_position, lr=learning_rate, momentum=0.9)
-
-    print(f"Optimizing f(x) = (x-3)² starting from x={start_position}")
-    print(f"Learning rate: {learning_rate}")
-    print(f"Target position: 3.0")
-    print()
-
-    print("Step | Vanilla SGD | Momentum SGD | Speedup")
-    print("-" * 45)
-    for i in range(min(8, len(vanilla_positions))):
-        vanilla_pos = vanilla_positions[i]
-        momentum_pos = momentum_positions[i]
-
-        # Calculate distance to target
-        vanilla_dist = abs(vanilla_pos - 3.0)
-        momentum_dist = abs(momentum_pos - 3.0)
-        speedup = vanilla_dist / (momentum_dist + 1e-8)
-
-        print(f"{i:4d} | {vanilla_pos:10.4f} | {momentum_pos:11.4f} | {speedup:6.2f}x")
-
-    # Final convergence analysis
-    final_vanilla_error = abs(vanilla_positions[-1] - 3.0)
-    final_momentum_error = abs(momentum_positions[-1] - 3.0)
-    overall_speedup = final_vanilla_error / (final_momentum_error + 1e-8)
-
-    print(f"\nFinal error - Vanilla: {final_vanilla_error:.6f}, Momentum: {final_momentum_error:.6f}")
-    print(f"Speedup: {overall_speedup:.2f}x")
-
-    print(f"\n💡 Momentum builds velocity for {overall_speedup:.1f}x faster convergence")
-    print("🚀 Essential for escaping narrow valleys in loss landscapes")
-
-# Analyze SGD vs momentum convergence
-analyze_sgd_momentum_convergence()
-
-def visualize_optimizer_convergence():
-    """
-    Create visual comparison of optimizer convergence curves.
-
-    This function demonstrates convergence patterns by training on a simple
-    quadratic loss function and plotting actual loss curves.
-
-    WHY THIS MATTERS: Visualizing convergence helps understand:
-    - When to stop training (convergence detection)
-    - Which optimizer converges faster for your problem
-    - How learning rate affects convergence speed
-    - When oscillations indicate instability
-    """
-    try:
-        print("\n" + "=" * 50)
-        print("📊 CONVERGENCE VISUALIZATION ANALYSIS")
-        print("=" * 50)
-
-        # Simple quadratic loss function: f(x) = (x - 2)^2 + 1
-        # Global minimum at x = 2, minimum value = 1
-        def quadratic_loss(x_val):
-            """Simple quadratic with known minimum."""
-            return (x_val - 2.0) ** 2 + 1.0
-
-        def compute_gradient(x_val):
-            """Gradient of quadratic: 2(x - 2)"""
-            return 2.0 * (x_val - 2.0)
-
-        # Training parameters
-        epochs = 50
-        learning_rate = 0.1
-
-        # Initialize parameters for each optimizer
-        x_sgd = Variable(np.array([5.0]), requires_grad=True)  # Start far from minimum
-        x_momentum = Variable(np.array([5.0]), requires_grad=True)
-        x_adam = Variable(np.array([5.0]), requires_grad=True)
-
-        # Create optimizers (Note: Adam may not be available in all contexts)
-        sgd_optimizer = SGD([x_sgd], learning_rate=learning_rate)
-        momentum_optimizer = SGD([x_momentum], learning_rate=learning_rate, momentum=0.9)
-        # Use a simple mock Adam for demonstration if actual Adam class not available
-        try:
-            adam_optimizer = Adam([x_adam], learning_rate=learning_rate)
-        except NameError:
-            # Mock Adam behavior for visualization
-            adam_optimizer = SGD([x_adam], learning_rate=learning_rate * 0.7)  # Slightly different LR
-
-        # Store convergence history
-        sgd_losses = []
-        momentum_losses = []
-        adam_losses = []
-        sgd_params = []
-        momentum_params = []
-        adam_params = []
-
-        # Training simulation
-        for epoch in range(epochs):
-            # SGD training step
-            sgd_optimizer.zero_grad()
-            sgd_val = float(x_sgd.data.flat[0]) if hasattr(x_sgd.data, 'flat') else float(x_sgd.data)
-            x_sgd.grad = np.array([compute_gradient(sgd_val)])
-            sgd_optimizer.step()
-            sgd_loss = quadratic_loss(sgd_val)
-            sgd_losses.append(sgd_loss)
-            sgd_params.append(sgd_val)
-
-            # Momentum SGD training step
-            momentum_optimizer.zero_grad()
-            momentum_val = float(x_momentum.data.flat[0]) if hasattr(x_momentum.data, 'flat') else float(x_momentum.data)
-            x_momentum.grad = np.array([compute_gradient(momentum_val)])
-            momentum_optimizer.step()
-            momentum_loss = quadratic_loss(momentum_val)
-            momentum_losses.append(momentum_loss)
-            momentum_params.append(momentum_val)
-
-            # Adam training step
-            adam_optimizer.zero_grad()
-            adam_val = float(x_adam.data.flat[0]) if hasattr(x_adam.data, 'flat') else float(x_adam.data)
-            x_adam.grad = np.array([compute_gradient(adam_val)])
-            adam_optimizer.step()
-            adam_loss = quadratic_loss(adam_val)
-            adam_losses.append(adam_loss)
-            adam_params.append(adam_val)
-
-        # ASCII Plot Generation (since matplotlib not available)
-        print("\nPROGRESS CONVERGENCE CURVES (Loss vs Epoch)")
-        print("-" * 50)
-
-        # Find convergence points (within 1% of minimum)
-        target_loss = 1.01  # 1% above minimum of 1.0
-
-        def find_convergence_epoch(losses, target):
-            for i, loss in enumerate(losses):
-                if loss <= target:
-                    return i
-            return len(losses)  # Never converged
-
-        sgd_conv = find_convergence_epoch(sgd_losses, target_loss)
-        momentum_conv = find_convergence_epoch(momentum_losses, target_loss)
-        adam_conv = find_convergence_epoch(adam_losses, target_loss)
-
-        # Simple ASCII visualization
-        print(f"Epochs to convergence (loss < {target_loss:.3f}):")
-        print(f"  SGD:              {sgd_conv:2d} epochs")
-        print(f"  SGD + Momentum:   {momentum_conv:2d} epochs")
-        print(f"  Adam:             {adam_conv:2d} epochs")
-
-        # Show loss progression at key epochs
-        epochs_to_show = [0, 10, 20, 30, 40, 49]
-        print(f"\nLoss progression:")
-        print("Epoch  |   SGD   | Momentum|  Adam   ")
-        print("-------|---------|---------|--------")
-        for epoch in epochs_to_show:
-            if epoch < len(sgd_losses):
-                print(f"  {epoch:2d}   | {sgd_losses[epoch]:7.3f} | {momentum_losses[epoch]:7.3f} | {adam_losses[epoch]:7.3f}")
-
-        # Final parameter values
-        print(f"\nFinal parameter values (target: 2.000):")
-        print(f"  SGD:              {sgd_params[-1]:.3f}")
-        print(f"  SGD + Momentum:   {momentum_params[-1]:.3f}")
-        print(f"  Adam:             {adam_params[-1]:.3f}")
-
-        # Convergence insights
-        print(f"\n💡 Convergence insights:")
-        print(f"• SGD: {'Steady' if sgd_conv < epochs else 'Slow'} convergence")
-        print(f"• Momentum: {'Accelerated' if momentum_conv < sgd_conv else 'Similar'} convergence")
-        print(f"• Adam: {'Adaptive' if adam_conv < max(sgd_conv, momentum_conv) else 'Standard'} convergence")
-
-        # Systems implications
-        print(f"\n🚀 Production implications:")
-        print(f"• Early stopping: Could stop training at epoch {min(sgd_conv, momentum_conv, adam_conv)}")
-        print(f"• Resource efficiency: Faster convergence = less compute time")
-        print(f"• Memory trade-off: Adam's 3* memory may be worth faster convergence")
-        print(f"• Learning rate sensitivity: Different optimizers need different LRs")
-
-        return {
-            'sgd_losses': sgd_losses,
-            'momentum_losses': momentum_losses,
-            'adam_losses': adam_losses,
-            'convergence_epochs': {'sgd': sgd_conv, 'momentum': momentum_conv, 'adam': adam_conv}
-        }
-
-    except Exception as e:
-        print(f"WARNING️ Error in convergence visualization: {e}")
-        return None
-
-# Visualize optimizer convergence patterns
-visualize_optimizer_convergence()
-
-# %% [markdown]
-"""
-## Step 3: The Adaptive Expert - Adam Optimizer
-
-Adam is like having a personal trainer for every parameter in your network. While SGD treats all parameters the same, Adam watches each one individually and adjusts its training approach based on that parameter's behavior.
-
-Think of it like this: some parameters need gentle nudges (they're already well-behaved), while others need firm correction (they're all over the place). Adam figures this out automatically.
-
-### The Core Insight: Different Parameters Need Different Treatment
-
-```
-    Traditional Approach (SGD):            Adam's Approach:
-    ┌─────────────────────────┐    ┌─────────────────────────┐
-    │ Same LR for all parameters  │    │ Custom LR per parameter    │
-    │                           │    │                           │
-    │ Weight 1: LR = 0.01       │    │ Weight 1: LR = 0.001      │
-    │ Weight 2: LR = 0.01       │    │ Weight 2: LR = 0.01       │
-    │ Weight 3: LR = 0.01       │    │ Weight 3: LR = 0.005      │
-    │ Bias:     LR = 0.01       │    │ Bias:     LR = 0.02       │
-    │                           │    │                           │
-    │ One size fits all         │    │ Tailored to each param    │
-    └─────────────────────────┘    └─────────────────────────┘
-
-    Parameter Behavior Patterns:
-
-    Unstable Parameter (big gradients):    Stable Parameter (small gradients):
-    Gradients: [10.0, -8.0, 12.0, -9.0]   Gradients: [0.01, 0.01, 0.01, 0.01]
-               ↓                                      ↓
-    Adam thinks: "This parameter is        Adam thinks: "This parameter is
-                  wild and chaotic!                    calm and consistent!
-                  Reduce learning rate                 Can handle bigger steps
-                  to prevent chaos."                  safely."
-               ↓                                      ↓
-    Effective LR: 0.0001 (tamed)          Effective LR: 0.01 (accelerated)
-
-```
-
-### How Adam Works: The Two-Moment System
-
-Adam tracks two things for each parameter:
-1. **Momentum (m)**: "Which direction has this parameter been going lately?"
-2. **Variance (v)**: "How chaotic/stable are this parameter's gradients?"
-
-```
-    Adam's Information Tracking System:
-
-    For Each Parameter, Adam Remembers:
-    ┌────────────────────────────────────────────┐
-    │  Parameter: weight[0][0]                     │
-    │  ┌──────────────────────────────────────┐  │
-    │  │ Current value: 2.341             │  │
-    │  │ Momentum (m): 0.082 ← direction    │  │
-    │  │ Variance (v): 0.134 ← stability   │  │
-    │  │ Adaptive LR: 0.001/√0.134 = 0.0027│  │
-    │  └──────────────────────────────────────┘  │
-    └────────────────────────────────────────────┘
-
-    The Adam Algorithm Flow:
-
-    New gradient → [Process] → Custom update for this parameter
-                      │
-                      v
-    Step 1: Update momentum
-    m = 0.9 × old_momentum + 0.1 × current_gradient
-    │
-    Step 2: Update variance
-    v = 0.999 × old_variance + 0.001 × current_gradient²
-    │
-    Step 3: Apply bias correction (prevents slow start)
-    m_corrected = m / (1 - 0.9ᵗ)  # t = current timestep
-    v_corrected = v / (1 - 0.999ᵗ)
-    │
-    Step 4: Adaptive parameter update
-    parameter = parameter - learning_rate × m_corrected / √v_corrected
-
-```
-
-### The Magic: Why Adam Works So Well
-
-```
-    Problem Adam Solves - The Learning Rate Dilemma:
-
-    ┌─────────────────────────────────────────────┐
-    │ Traditional SGD Problem:                      │
-    │                                               │
-    │ Pick LR = 0.1 → Some parameters overshoot    │
-    │ Pick LR = 0.01 → Some parameters too slow    │
-    │ Pick LR = 0.05 → Compromise, nobody happy   │
-    │                                               │
-    │ ❓ How do you choose ONE learning rate for  │
-    │   THOUSANDS of different parameters?         │
-    └─────────────────────────────────────────────┘
-
-    Adam's Solution:
-    ┌─────────────────────────────────────────────┐
-    │ “Give every parameter its own learning rate!” │
-    │                                               │
-    │ Chaotic parameters → Smaller effective LR    │
-    │ Stable parameters  → Larger effective LR     │
-    │ Consistent params  → Medium effective LR     │
-    │                                               │
-    │ ✨ Automatic tuning for every parameter!    │
-    └─────────────────────────────────────────────┘
-
-    Memory Trade-off (1M parameter model):
-    ┌─────────────────────────────────────────────┐
-    │ SGD:          [parameters] = 4MB            │
-    │ Momentum SGD: [params][velocity] = 8MB      │
-    │ Adam:         [params][m][v] = 12MB         │
-    │                                              │
-    │ Trade-off: 3× memory for adaptive training   │
-    │ Usually worth it for faster convergence!    │
-    └─────────────────────────────────────────────┘
-```
-
-### Why Adam is the Default Choice
-
-Adam has become the go-to optimizer because:
-- **Self-tuning**: Automatically adjusts to parameter behavior
-- **Robust**: Works well across different architectures and datasets
-- **Fast convergence**: Often trains faster than SGD with momentum
-- **Less sensitive**: More forgiving of learning rate choice
-
-Let's implement this adaptive powerhouse!
-"""
-
-# %% [markdown]
-"""
-### 🤔 Assessment Question: Adam's Adaptive Mechanism
-
-**Understanding Adam's adaptive learning rates:**
-
-Adam computes per-parameter learning rates using second moments (gradient variance). Explain why this adaptation helps optimization and analyze the bias correction terms.
-
-Given gradients g = [0.1, 0.01] and learning rate α = 0.001, calculate the first few Adam updates with β₁=0.9, β₂=0.999, ε=1e-8. Show how the adaptive mechanism gives different effective learning rates to the two parameters.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "adam-mechanism", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR ADAM ANALYSIS:
-
-TODO: Explain Adam's adaptive mechanism and calculate the first few updates.
-
-Key points to address:
-- Why does adaptive learning rate help optimization?
-- What do first and second moments capture?
-- Why is bias correction necessary?
-- Calculate m₁, v₁, m̂₁, v̂₁ for both parameters after first update
-- Show how effective learning rates differ between parameters
-
-GRADING RUBRIC:
-- Explains adaptive learning rate benefits (2 points)
-- Understands first/second moment meaning (2 points)
-- Explains bias correction necessity (2 points)
-- Correctly calculates Adam updates (3 points)
-- Shows effective learning rate differences (1 point)
-"""
-
-### BEGIN SOLUTION
-# Adam adapts learning rates per parameter using gradient variance (second moment).
-# Large gradients -> large variance -> smaller effective LR (prevents overshooting)
-# Small gradients -> small variance -> larger effective LR (accelerates progress)
-#
-# For gradients g = [0.1, 0.01], α = 0.001, β₁=0.9, β₂=0.999:
-#
-# Parameter 1 (g=0.1):
-# m₁ = 0.9*0 + 0.1*0.1 = 0.01
-# v₁ = 0.999*0 + 0.001*0.01 = 0.00001  
-# m̂₁ = 0.01/(1-0.9¹) = 0.01/0.1 = 0.1
-# v̂₁ = 0.00001/(1-0.999¹) = 0.00001/0.001 = 0.01
-# Update₁ = -0.001 * 0.1/sqrt(0.01 + 1e-8) ~= -0.001
-#
-# Parameter 2 (g=0.01):  
-# m₁ = 0.9*0 + 0.1*0.01 = 0.001
-# v₁ = 0.999*0 + 0.001*0.0001 = 0.0000001
-# m̂₁ = 0.001/0.1 = 0.01
-# v̂₁ = 0.0000001/0.001 = 0.0001
-# Update₁ = -0.001 * 0.01/sqrt(0.0001 + 1e-8) ~= -0.001
-#
-# Both get similar effective updates despite 10* gradient difference!
-# Bias correction prevents small initial estimates from causing tiny updates.
-### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "adam-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Adam:
-    """
-    Adam Optimizer - Adaptive Moment Estimation
-    
-    Combines momentum (first moment) with adaptive learning rates (second moment).
-    Adjusts learning rate per parameter based on gradient history and variance.
-    
-    Mathematical Update Rules:
-    m_t = β₁ m_{t-1} + (1-β₁) gradθ_t          <- First moment (momentum)
-    v_t = β₂ v_{t-1} + (1-β₂) gradθ_t²         <- Second moment (variance)
-    m̂_t = m_t / (1 - β₁ᵗ)                  <- Bias correction
-    v̂_t = v_t / (1 - β₂ᵗ)                  <- Bias correction  
-    θ_t = θ_{t-1} - α m̂_t / (sqrtv̂_t + ε)    <- Adaptive update
-    
-    SYSTEMS INSIGHT - Memory Usage:
-    Adam stores first moment + second moment for each parameter = 3* memory vs SGD.
-    For large models, this memory overhead can be limiting factor.
-    Trade-off: Better convergence vs higher memory requirements.
-    """
-    
-    def __init__(self, parameters: List[Variable], learning_rate: float = 0.001, 
-                 beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-8):
-        """
-        Initialize Adam optimizer.
-        
-        Args:
-            parameters: List of Variables to optimize
-            learning_rate: Learning rate (default: 0.001, lower than SGD)
-            beta1: First moment decay rate (default: 0.9)
-            beta2: Second moment decay rate (default: 0.999)
-            epsilon: Small constant for numerical stability (default: 1e-8)
-        
-        TODO: Initialize Adam optimizer with momentum and adaptive learning rate tracking.
-        
-        APPROACH:
-        1. Store all hyperparameters
-        2. Initialize first moment (momentum) buffers for each parameter
-        3. Initialize second moment (variance) buffers for each parameter
-        4. Set timestep counter for bias correction
-        
-        EXAMPLE:
-        ```python
-        # Standard Adam optimizer
-        optimizer = Adam([w, b], learning_rate=0.001)
-        
-        # Custom Adam with different betas
-        optimizer = Adam([w, b], learning_rate=0.01, beta1=0.9, beta2=0.99)
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Use defaultdict or manual dictionary for state storage
-        - Initialize state lazily (on first use) or pre-allocate
-        - Remember to track timestep for bias correction
-        """
-        ### BEGIN SOLUTION
-        self.parameters = parameters
-        self.learning_rate = learning_rate
-        self.beta1 = beta1
-        self.beta2 = beta2
-        self.epsilon = epsilon
-        
-        # State tracking
-        self.state = {}
-        self.t = 0  # Timestep for bias correction
-        
-        # Initialize state for each parameter
-        for param in parameters:
-            self.state[id(param)] = {
-                'm': None,  # First moment (momentum)
-                'v': None   # Second moment (variance)
-            }
-        ### END SOLUTION
-    
-    def step(self) -> None:
-        """
-        Perform one Adam optimization step.
-        
-        TODO: Implement Adam parameter updates with bias correction.
-        
-        APPROACH:
-        1. Increment timestep for bias correction
-        2. For each parameter with gradient:
-           a. Get or initialize first/second moment buffers
-           b. Update first moment: m = β₁m + (1-β₁)g
-           c. Update second moment: v = β₂v + (1-β₂)g²
-           d. Apply bias correction: m̂ = m/(1-β₁ᵗ), v̂ = v/(1-β₂ᵗ)
-           e. Update parameter: θ = θ - α m̂/(sqrtv̂ + ε)
-        
-        MATHEMATICAL IMPLEMENTATION:
-        m_t = β₁ m_{t-1} + (1-β₁) gradθ_t
-        v_t = β₂ v_{t-1} + (1-β₂) gradθ_t²
-        m̂_t = m_t / (1 - β₁ᵗ)
-        v̂_t = v_t / (1 - β₂ᵗ)
-        θ_t = θ_{t-1} - α m̂_t / (sqrtv̂_t + ε)
-        
-        IMPLEMENTATION HINTS:
-        - Increment self.t at the start
-        - Initialize moments with first gradient if None
-        - Use np.sqrt for square root operation
-        - Handle numerical stability with epsilon
-        """
-        ### BEGIN SOLUTION
-        self.t += 1  # Increment timestep
-        
-        for param in self.parameters:
-            grad_data = get_grad_data(param)
-            if grad_data is not None:
-                current_data = get_param_data(param)
-                param_id = id(param)
-                
-                # Get or initialize state
-                if self.state[param_id]['m'] is None:
-                    self.state[param_id]['m'] = np.zeros_like(grad_data)
-                    self.state[param_id]['v'] = np.zeros_like(grad_data)
-                
-                state = self.state[param_id]
-                
-                # Update first moment (momentum): m = β₁m + (1-β₁)g
-                state['m'] = self.beta1 * state['m'] + (1 - self.beta1) * grad_data
-                
-                # Update second moment (variance): v = β₂v + (1-β₂)g²
-                state['v'] = self.beta2 * state['v'] + (1 - self.beta2) * (grad_data ** 2)
-                
-                # Bias correction
-                m_hat = state['m'] / (1 - self.beta1 ** self.t)
-                v_hat = state['v'] / (1 - self.beta2 ** self.t)
-                
-                # Parameter update: θ = θ - α m̂/(sqrtv̂ + ε)
-                new_data = current_data - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
-                
-                set_param_data(param, new_data)
-        ### END SOLUTION
-    
-    def zero_grad(self) -> None:
-        """
-        Zero out gradients for all parameters.
-        
-        TODO: Clear all gradients to prepare for the next backward pass.
-        
-        APPROACH:
-        1. Iterate through all parameters
-        2. Set gradient to None for each parameter
-        3. Don't clear Adam state (momentum and variance persist)
-        
-        IMPLEMENTATION HINTS:
-        - Set param.grad = None for each parameter
-        - Adam state (m, v) should persist across optimization steps
-        - Only gradients are cleared, not the optimizer's internal state
-        """
-        ### BEGIN SOLUTION
-        for param in self.parameters:
-            param.grad = None
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Test: Adam Optimizer
-This test confirms our Adam optimizer implements the complete adaptive algorithm
-**What we're testing**: Momentum + variance tracking + bias correction + adaptive updates
-**Why it matters**: Adam is the most widely used optimizer in modern deep learning
-**Expected**: Different parameters get different effective learning rates automatically
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-adam", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_unit_adam_optimizer():
-    """Unit test for Adam optimizer implementation."""
-    print("🔬 Unit Test: Adam Optimizer...")
-    
-    # Create test parameters
-    w = Variable(1.0, requires_grad=True)
-    b = Variable(0.5, requires_grad=True)
-    
-    # Create Adam optimizer
-    optimizer = Adam([w, b], learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8)
-    
-    # Test initialization
-    try:
-        assert optimizer.learning_rate == 0.001, "Learning rate should be stored correctly"
-        assert optimizer.beta1 == 0.9, "Beta1 should be stored correctly"
-        assert optimizer.beta2 == 0.999, "Beta2 should be stored correctly"
-        assert optimizer.epsilon == 1e-8, "Epsilon should be stored correctly"
-        assert optimizer.t == 0, "Timestep should start at 0"
-        print("PASS Initialization works correctly")
-        
-    except Exception as e:
-        print(f"FAIL Initialization failed: {e}")
-        raise
-    
-    # Test zero_grad
-    try:
-        w.grad = Variable(0.1)
-        b.grad = Variable(0.05)
-        
-        optimizer.zero_grad()
-        
-        assert w.grad is None, "Gradient should be None after zero_grad"
-        assert b.grad is None, "Gradient should be None after zero_grad"
-        print("PASS zero_grad() works correctly")
-        
-    except Exception as e:
-        print(f"FAIL zero_grad() failed: {e}")
-        raise
-    
-    # Test first Adam step with bias correction
-    try:
-        w.grad = Variable(0.1)
-        b.grad = Variable(0.05)
-        
-        # Store original values
-        original_w = w.data.data.item()
-        original_b = b.data.data.item()
-        
-        optimizer.step()
-        
-        # After first step, timestep should be 1
-        assert optimizer.t == 1, "Timestep should be 1 after first step"
-        
-        # Check that parameters were updated (exact values depend on bias correction)
-        new_w = w.data.data.item()
-        new_b = b.data.data.item()
-        
-        assert new_w != original_w, "w should be updated after step"
-        assert new_b != original_b, "b should be updated after step"
-        
-        # Check that state was initialized
-        w_id = id(w)
-        b_id = id(b)
-        assert w_id in optimizer.state, "w state should be initialized"
-        assert b_id in optimizer.state, "b state should be initialized"
-        assert optimizer.state[w_id]['m'] is not None, "First moment should be initialized"
-        assert optimizer.state[w_id]['v'] is not None, "Second moment should be initialized"
-        
-        print("PASS First Adam step works correctly")
-        
-    except Exception as e:
-        print(f"FAIL First Adam step failed: {e}")
-        raise
-    
-    # Test second Adam step (momentum accumulation)
-    try:
-        w.grad = Variable(0.1)  # Same gradient
-        b.grad = Variable(0.05)
-        
-        # Store values before second step
-        before_second_w = w.data.data.item()
-        before_second_b = b.data.data.item()
-        
-        optimizer.step()
-        
-        # After second step, timestep should be 2
-        assert optimizer.t == 2, "Timestep should be 2 after second step"
-        
-        # Parameters should continue updating
-        after_second_w = w.data.data.item()
-        after_second_b = b.data.data.item()
-        
-        assert after_second_w != before_second_w, "w should continue updating"
-        assert after_second_b != before_second_b, "b should continue updating"
-        
-        print("PASS Second Adam step works correctly")
-        
-    except Exception as e:
-        print(f"FAIL Second Adam step failed: {e}")
-        raise
-    
-    # Test adaptive behavior (different gradients should get different effective learning rates)
-    try:
-        w_large = Variable(1.0, requires_grad=True)
-        w_small = Variable(1.0, requires_grad=True)
-        
-        optimizer_adaptive = Adam([w_large, w_small], learning_rate=0.1)
-        
-        # Large gradient vs small gradient
-        w_large.grad = Variable(1.0)    # Large gradient
-        w_small.grad = Variable(0.01)   # Small gradient
-        
-        original_large = w_large.data.data.item()
-        original_small = w_small.data.data.item()
-        
-        optimizer_adaptive.step()
-        
-        update_large = abs(w_large.data.data.item() - original_large)
-        update_small = abs(w_small.data.data.item() - original_small)
-        
-        # Both should get reasonable updates despite very different gradients
-        assert update_large > 0, "Large gradient parameter should update"
-        assert update_small > 0, "Small gradient parameter should update"
-        
-        print("PASS Adaptive learning rates work correctly")
-        
-    except Exception as e:
-        print(f"FAIL Adaptive learning rates failed: {e}")
-        raise
-
-    print("✅ Success! Adam optimizer works correctly!")
-    print(f"  • Combines momentum with adaptive learning rates")
-    print(f"  • Bias correction prevents slow start problems")
-    print(f"  • Automatically tunes learning rate per parameter")
-    print(f"  • Memory cost: 3× parameters (params + momentum + variance)")
-
-test_unit_adam_optimizer()  # Run immediately
-
-# PASS IMPLEMENTATION CHECKPOINT: Adam optimizer complete
-
-# THINK PREDICTION: Which optimizer will use more memory - SGD with momentum or Adam?
-# Your guess: Adam uses ____x more memory than SGD
-
-def analyze_optimizer_memory():
-    """Analyze memory usage patterns across different optimizers."""
-    try:
-        print("📊 Analyzing optimizer memory usage...")
-        
-        # Simulate memory usage for different model sizes
-        param_counts = [1000, 10000, 100000, 1000000]  # 1K to 1M parameters
-        
-        print("Memory Usage Analysis (Float32 = 4 bytes per parameter)")
-        print(f"{'Parameters':<12} {'SGD':<10} {'SGD+Mom':<10} {'Adam':<10} {'Adam/SGD':<10}")
-        
-        for param_count in param_counts:
-            # Memory calculations (in bytes)
-            sgd_memory = param_count * 4  # Just parameters
-            sgd_momentum_memory = param_count * 4 * 2  # Parameters + momentum
-            adam_memory = param_count * 4 * 3  # Parameters + momentum + variance
-            
-            # Convert to MB for readability
-            sgd_mb = sgd_memory / (1024 * 1024)
-            sgd_mom_mb = sgd_momentum_memory / (1024 * 1024)
-            adam_mb = adam_memory / (1024 * 1024)
-            
-            ratio = adam_memory / sgd_memory
-            
-            print(f"{param_count:<12,} {sgd_mb:<8.1f}MB {sgd_mom_mb:<8.1f}MB {adam_mb:<8.1f}MB {ratio:<8.1f}x")
-        
-        print()
-        print("Real-World Model Examples:")
-        print("-" * 40)
-        
-        # Real model examples
-        models = [
-            ("Small CNN", 100_000),
-            ("ResNet-18", 11_700_000),
-            ("BERT-Base", 110_000_000),
-            ("GPT-2", 1_500_000_000),
-            ("GPT-3", 175_000_000_000)
-        ]
-        
-        for model_name, params in models:
-            sgd_gb = (params * 4) / (1024**3)
-            adam_gb = (params * 12) / (1024**3)  # 3x memory
-            
-            print(f"{model_name:<12}: SGD {sgd_gb:>6.1f}GB, Adam {adam_gb:>6.1f}GB")
-            
-            if adam_gb > 16:  # Typical GPU memory
-                print(f"              WARNING️  Adam exceeds typical GPU memory!")
-        
-        print("\n💡 Key insights:")
-        print("• SGD: O(P) memory (just parameters)")
-        print("• SGD+Momentum: O(2P) memory (parameters + momentum)")
-        print("• Adam: O(3P) memory (parameters + momentum + variance)")
-        print("• Memory becomes limiting factor for large models")
-        print("• Why some teams use SGD for billion-parameter models")
-        
-        print("\n🏭 PRODUCTION IMPLICATIONS:")
-        print("• Choose optimizer based on memory constraints")
-        print("• Adam better for most tasks, SGD for memory-limited scenarios")
-        print("• Consider memory-efficient variants (AdaFactor, 8-bit Adam)")
-        
-        
-    except Exception as e:
-        print(f"WARNING️ Error in memory analysis: {e}")
-
-analyze_optimizer_memory()
-
-# %% [markdown]
-"""
-## 🔍 Systems Analysis: Optimizer Performance and Memory
-
-Now that you've built three different optimizers, let's analyze their behavior to understand the trade-offs between memory usage, convergence speed, and computational overhead that matter in real ML systems.
-
-### Performance Characteristics Comparison
-
-```
-    Optimizer Performance Matrix:
-
-    ┌───────────────────────────────────────────────────────┐
-    │ Optimizer    │ Memory   │ Convergence │ LR Sensitivity │ Use Cases      │
-    ├─────────────├──────────├─────────────├────────────────├─────────────────┘
-    │ SGD          │ 1× (low) │ Slow        │ High           │ Simple tasks   │
-    │ SGD+Momentum │ 2×       │ Fast        │ Medium         │ Most vision    │
-    │ Adam         │ 3× (high)│ Fastest     │ Low            │ Most NLP/DL    │
-    └──────────────└──────────└─────────────└────────────────└─────────────────┘
-
-    Real-World Memory Usage (GPT-2 Scale - 1.5B parameters):
-
-    SGD:          Params only     = 6.0 GB
-    SGD+Momentum: Params + vel    = 12.0 GB
-    Adam:         Params + m + v  = 18.0 GB
-
-    ❓ Question: Why does OpenAI use Adam for training but switch to SGD for final fine-tuning?
-    ✅ Answer: Adam for fast exploration, SGD for precise convergence!
-```
-
-**Analysis Focus**: Memory overhead, convergence patterns, and computational complexity of our optimizer implementations
-"""
-
-# %%
-def analyze_optimizer_behavior():
-    """
-    📊 SYSTEMS MEASUREMENT: Comprehensive Optimizer Analysis
-
-    Analyze memory usage, convergence speed, and computational overhead.
-    """
-    print("📊 OPTIMIZER SYSTEMS ANALYSIS")
-    print("=" * 40)
-
-    import time
-
-    # Test 1: Memory footprint analysis
-    print("💾 Memory Footprint Analysis:")
-
-    # Create test parameters
-    num_params = 1000
-    test_params = [Variable(np.random.randn(), requires_grad=True) for _ in range(num_params)]
-
-    print(f"   Test with {num_params} parameters:")
-    print(f"   SGD (vanilla): ~{num_params * 4}B (parameters only)")
-    print(f"   SGD (momentum): ~{num_params * 8}B (parameters + velocity)")
-    print(f"   Adam: ~{num_params * 12}B (parameters + m + v)")
-
-    # Test 2: Computational overhead
-    print("\n⚡ Computational Overhead Analysis:")
-
-    # Setup test optimization scenario
-    x_sgd = Variable(5.0, requires_grad=True)
-    x_momentum = Variable(5.0, requires_grad=True)
-    x_adam = Variable(5.0, requires_grad=True)
-
-    sgd_test = SGD([x_sgd], learning_rate=0.1, momentum=0.0)
-    momentum_test = SGD([x_momentum], learning_rate=0.1, momentum=0.9)
-    adam_test = Adam([x_adam], learning_rate=0.1)
-
-    def time_optimizer_step(optimizer, param, name):
-        param.grad = Variable(0.5)  # Fixed gradient
-
-        start = time.perf_counter()
-        for _ in range(100):  # Reduced for speed
-            optimizer.step()
-        end = time.perf_counter()
-
-        return (end - start) * 1000  # Convert to milliseconds
-
-    sgd_time = time_optimizer_step(sgd_test, x_sgd, "SGD")
-    momentum_time = time_optimizer_step(momentum_test, x_momentum, "Momentum")
-    adam_time = time_optimizer_step(adam_test, x_adam, "Adam")
-
-    print(f"   100 optimization steps:")
-    print(f"   SGD:      {sgd_time:.2f}ms (baseline)")
-    print(f"   Momentum: {momentum_time:.2f}ms ({momentum_time/sgd_time:.1f}x overhead)")
-    print(f"   Adam:     {adam_time:.2f}ms ({adam_time/sgd_time:.1f}x overhead)")
-
-    # Test 3: Convergence analysis
-    print("\n🏁 Convergence Speed Analysis:")
-
-    def test_convergence(optimizer_class, **kwargs):
-        # Optimize f(x) = (x-2)² starting from x=0
-        x = Variable(0.0, requires_grad=True)
-        optimizer = optimizer_class([x], **kwargs)
-
-        for epoch in range(50):
-            # Compute loss and gradient
-            # Handle scalar values properly
-            if hasattr(x.data, 'data'):
-                current_val = float(x.data.data) if x.data.data.ndim == 0 else float(x.data.data[0])
-            else:
-                current_val = float(x.data) if np.isscalar(x.data) else float(x.data[0])
-            loss = (current_val - 2.0) ** 2
-            x.grad = Variable(2.0 * (current_val - 2.0))  # Analytical gradient
-
-            optimizer.step()
-
-            if loss < 0.01:  # Converged
-                return epoch
-
-        return 50  # Never converged
-
-    sgd_epochs = test_convergence(SGD, learning_rate=0.1, momentum=0.0)
-    momentum_epochs = test_convergence(SGD, learning_rate=0.1, momentum=0.9)
-    adam_epochs = test_convergence(Adam, learning_rate=0.1)
-
-    print(f"   Epochs to convergence (loss < 0.01):")
-    print(f"   SGD:      {sgd_epochs} epochs")
-    print(f"   Momentum: {momentum_epochs} epochs")
-    print(f"   Adam:     {adam_epochs} epochs")
-
-    print("\n💡 OPTIMIZER INSIGHTS:")
-    print("   ┌───────────────────────────────────────────────────────────┐")
-    print("   │ Optimizer Performance Characteristics                      │")
-    print("   ├───────────────────────────────────────────────────────────┤")
-    print("   │ Memory Usage:                                              │")
-    print("   │   • SGD: O(P) - just parameters                           │")
-    print("   │   • Momentum: O(2P) - parameters + velocity              │")
-    print("   │   • Adam: O(3P) - parameters + momentum + variance       │")
-    print("   │                                                            │")
-    print("   │ Computational Overhead:                                   │")
-    print("   │   • SGD: Baseline (simple gradient update)               │")
-    print("   │   • Momentum: ~1.2x (velocity accumulation)              │")
-    print("   │   • Adam: ~2x (moment tracking + bias correction)        │")
-    print("   │                                                            │")
-    print("   │ Production Trade-offs:                                    │")
-    print("   │   • Large models: SGD for memory efficiency               │")
-    print("   │   • Research/prototyping: Adam for speed and robustness   │")
-    print("   │   • Fine-tuning: Often switch SGD for final precision    │")
-    print("   └───────────────────────────────────────────────────────────┘")
-    print("")
-    print("   🚀 Production Implications:")
-    print("   • Memory: Adam requires 3x memory vs SGD - plan GPU memory accordingly")
-    print("   • Speed: Adam's robustness often outweighs computational overhead")
-    print("   • Stability: Adam handles diverse learning rates better (less tuning needed)")
-    print("   • Scaling: SGD preferred for models that don't fit in memory with Adam")
-    print("   • Why PyTorch defaults to Adam: Best balance of speed, stability, and ease of use")
-
-analyze_optimizer_behavior()
-
-# %% [markdown]
-"""
-## Step 3.5: Gradient Clipping and Numerical Stability
-
-### Why Gradient Clipping Matters
-
-**The Problem**: Large gradients can destabilize training, especially in RNNs or very deep networks:
-
-```
-Normal Training:
-    Gradient: [-0.1, 0.2, -0.05] -> Update: [-0.01, 0.02, -0.005] OK
-
-Exploding Gradients:
-    Gradient: [-15.0, 23.0, -8.0] -> Update: [-1.5, 2.3, -0.8] FAIL Too large!
-
-Result: Parameters jump far from optimum, loss explodes
-```
-
-### Visual: Gradient Clipping in Action
-```
-Gradient Landscape:
-
-    Loss
-     ^
-     |     +- Clipping threshold (e.g., 1.0)
-     |    /
-     |   /
-     |  /   Original gradient (magnitude = 2.5)
-     | /    Clipped gradient (magnitude = 1.0)
-     |/
-     +-------> Parameters
-
-Clipping: gradient = gradient * (threshold / ||gradient||) if ||gradient|| > threshold
-```
-
-### Mathematical Foundation
-**Gradient Norm Clipping**:
-```
-1. Compute gradient norm: ||g|| = sqrt(g₁² + g₂² + ... + gₙ²)
-2. If ||g|| > threshold:
-   g_clipped = g * (threshold / ||g||)
-3. Else: g_clipped = g
-```
-
-**Why This Works**:
-- Preserves gradient direction (most important for optimization)
-- Limits magnitude to prevent parameter jumps
-- Allows adaptive threshold based on problem characteristics
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "gradient-clipping", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def clip_gradients(parameters: List[Variable], max_norm: float = 1.0) -> float:
-    """
-    Clip gradients by global norm to prevent exploding gradients.
-
-    Args:
-        parameters: List of Variables with gradients
-        max_norm: Maximum allowed gradient norm (default: 1.0)
-
-    Returns:
-        float: The original gradient norm before clipping
-
-    TODO: Implement gradient clipping by global norm.
-
-    APPROACH:
-    1. Calculate total gradient norm across all parameters
-    2. If norm exceeds max_norm, scale all gradients proportionally
-    3. Return original norm for monitoring
-
-    EXAMPLE:
-    >>> x = Variable(np.array([1.0]), requires_grad=True)
-    >>> x.grad = np.array([5.0])  # Large gradient
-    >>> norm = clip_gradients([x], max_norm=1.0)
-    >>> print(f"Original norm: {norm}, Clipped gradient: {x.grad}")
-    Original norm: 5.0, Clipped gradient: [1.0]
-
-    PRODUCTION NOTE: All major frameworks include gradient clipping.
-    PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
-    """
-    ### BEGIN SOLUTION
-    # Calculate total gradient norm
-    total_norm = 0.0
-    for param in parameters:
-        if param.grad is not None:
-            param_norm = np.linalg.norm(param.grad)
-            total_norm += param_norm ** 2
-
-    total_norm = np.sqrt(total_norm)
-
-    # Apply clipping if necessary
-    if total_norm > max_norm:
-        clip_coef = max_norm / total_norm
-        for param in parameters:
-            if param.grad is not None:
-                param.grad = param.grad * clip_coef
-
-    return total_norm
-    ### END SOLUTION
-
-def analyze_numerical_stability():
-    """
-    Demonstrate gradient clipping effects and numerical issues at scale.
-
-    This analysis shows why gradient clipping is essential for stable training,
-    especially in production systems with large models and diverse data.
-    """
-    try:
-        print("📊 Analyzing numerical stability...")
-
-        # Create parameters with different gradient magnitudes
-        param1 = Variable(np.array([1.0]), requires_grad=True)
-        param2 = Variable(np.array([0.5]), requires_grad=True)
-        param3 = Variable(np.array([2.0]), requires_grad=True)
-
-        # Simulate different gradient scenarios
-        scenarios = [
-            ("Normal gradients", [0.1, 0.2, -0.15]),
-            ("Large gradients", [5.0, -3.0, 8.0]),
-            ("Exploding gradients", [50.0, -30.0, 80.0])
-        ]
-
-        print("Gradient Clipping Scenarios:")
-        print("Scenario         | Original Norm | Clipped Norm | Reduction")
-
-        for scenario_name, gradients in scenarios:
-            # Set gradients
-            param1.grad = np.array([gradients[0]])
-            param2.grad = np.array([gradients[1]])
-            param3.grad = np.array([gradients[2]])
-
-            # Clip gradients
-            original_norm = clip_gradients([param1, param2, param3], max_norm=1.0)
-
-            # Calculate new norm
-            new_norm = 0.0
-            for param in [param1, param2, param3]:
-                if param.grad is not None:
-                    new_norm += np.linalg.norm(param.grad) ** 2
-            new_norm = np.sqrt(new_norm)
-
-            reduction = (original_norm - new_norm) / original_norm * 100 if original_norm > 0 else 0
-
-            print(f"{scenario_name:<16} | {original_norm:>11.2f} | {new_norm:>10.2f} | {reduction:>7.1f}%")
-
-        # Demonstrate numerical precision issues
-        print(f"\n💡 Numerical precision insights:")
-
-        # Very small numbers (underflow risk)
-        small_grad = 1e-8
-        print(f"• Very small gradient: {small_grad:.2e}")
-        print(f"  Adam epsilon (1e-8) prevents division by zero in denominator")
-
-        # Very large numbers (overflow risk)
-        large_grad = 1e6
-        print(f"• Very large gradient: {large_grad:.2e}")
-        print(f"  Gradient clipping prevents parameter explosion")
-
-        # Floating point precision
-        print(f"• Float32 precision: ~7 decimal digits")
-        print(f"  Large parameters + small gradients = precision loss")
-
-        # Production implications
-        print(f"\n🚀 Production implications:")
-        print(f"• Mixed precision (float16/float32) requires careful gradient scaling")
-        print(f"• Distributed training amplifies numerical issues across GPUs")
-        print(f"• Gradient accumulation may need norm rescaling")
-        print(f"• Learning rate scheduling affects gradient scale requirements")
-
-        # Scale analysis
-        print(f"\n📊 SCALE ANALYSIS:")
-        model_sizes = [
-            ("Small model", 1e6, "1M parameters"),
-            ("Medium model", 100e6, "100M parameters"),
-            ("Large model", 7e9, "7B parameters"),
-            ("Very large model", 175e9, "175B parameters")
-        ]
-
-        for name, params, desc in model_sizes:
-            # Estimate memory for gradients at different precisions
-            fp32_mem = params * 4 / 1e9  # bytes to GB
-            fp16_mem = params * 2 / 1e9
-
-            print(f"  {desc}:")
-            print(f"    Gradient memory (FP32): {fp32_mem:.1f} GB")
-            print(f"    Gradient memory (FP16): {fp16_mem:.1f} GB")
-
-            # When clipping becomes critical
-            if params > 1e9:
-                print(f"    WARNING️  Gradient clipping CRITICAL for stability")
-            elif params > 100e6:
-                print(f"    📊 Gradient clipping recommended")
-            else:
-                print(f"    PASS Standard gradients usually stable")
-
-    except Exception as e:
-        print(f"WARNING️ Error in numerical stability analysis: {e}")
-
-# Analyze gradient clipping and numerical stability
-analyze_numerical_stability()
-
-# %% [markdown]
-"""
-## Step 4: Learning Rate Scheduling
-
-### Visual: Learning Rate Scheduling Effects
-```
-Learning Rate Over Time:
-
-Constant LR:
-LR  +----------------------------------------
-    | α = 0.01 (same throughout training)
-    +-----------------------------------------> Steps
-
-Step Decay:
-LR  +---------+
-    | α = 0.01 |
-    |          +---------+
-    | α = 0.001|         |
-    |          |         +---------------------
-    |          | α = 0.0001
-    +----------+---------+----------------------> Steps
-              step1     step2
-
-Exponential Decay:
-LR  +-\
-    |   \\
-    |    \\__
-    |       \\__
-    |          \\____
-    |               \\________
-    +-------------------------------------------> Steps
-```
-
-### Why Learning Rate Scheduling Matters
-**Problem**: Fixed learning rate throughout training is suboptimal:
-- **Early training**: Need larger LR to make progress quickly
-- **Late training**: Need smaller LR to fine-tune and not overshoot optimum
-
-**Solution**: Adaptive learning rate schedules:
-- **Step decay**: Reduce LR at specific milestones
-- **Exponential decay**: Gradually reduce LR over time
-- **Cosine annealing**: Smooth reduction with periodic restarts
-
-### Mathematical Foundation
-**Step Learning Rate Scheduler**:
-```
-LR(epoch) = initial_lr * gamma^⌊epoch / step_size⌋
-```
-
-Where:
-- initial_lr: Starting learning rate
-- gamma: Multiplicative factor (e.g., 0.1)
-- step_size: Epochs between reductions
-
-### Scheduling Strategy Visualization
-```
-Training Progress with Different Schedules:
-
-High LR Phase (Exploration):
-    Loss landscape exploration
-    ↙ ↘ ↙ ↘ (large steps, finding good regions)
-
-Medium LR Phase (Convergence):
-    v v v (steady progress toward minimum)
-
-Low LR Phase (Fine-tuning):
-    v v (small adjustments, precision optimization)
-```
-"""
-
-# %% [markdown]
-"""
-### 🤔 Assessment Question: Learning Rate Scheduling Strategy
-
-**Understanding when and why to adjust learning rates:**
-
-You're training a neural network and notice the loss plateaus after 50 epochs, then starts oscillating around a value. Design a learning rate schedule to address this issue.
-
-Explain what causes loss plateaus and oscillations, and why reducing learning rate helps. Compare step decay vs exponential decay for this scenario.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "lr-scheduling", "locked": false, "points": 8, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR LEARNING RATE SCHEDULING ANALYSIS:
-
-TODO: Explain loss plateaus/oscillations and design an appropriate LR schedule.
-
-Key points to address:
-- What causes loss plateaus in neural network training?
-- Why do oscillations occur and how does LR reduction help?
-- Design a specific schedule: when to reduce, by how much?
-- Compare step decay vs exponential decay for this scenario
-- Consider practical implementation details
-
-GRADING RUBRIC:
-- Explains loss plateau and oscillation causes (2 points)
-- Understands how LR reduction addresses issues (2 points)
-- Designs reasonable LR schedule with specific values (2 points)
-- Compares scheduling strategies appropriately (2 points)
-"""
-
-### BEGIN SOLUTION
-# Loss plateaus occur when the learning rate is too small to make significant progress,
-# while oscillations happen when LR is too large, causing overshooting around the minimum.
-#
-# For loss plateau at epoch 50 with oscillations:
-# 1. Plateau suggests we're near a local minimum but LR is too large for fine-tuning
-# 2. Oscillations confirm overshooting - need smaller steps
-#
-# Proposed schedule:
-# - Epochs 0-49: LR = 0.01 (initial exploration)
-# - Epochs 50-99: LR = 0.001 (reduce by 10x when plateau detected)
-# - Epochs 100+: LR = 0.0001 (final fine-tuning)
-#
-# Step decay vs Exponential:
-# - Step decay: Sudden reductions allow quick adaptation to new regime
-# - Exponential: Smooth transitions but may be too gradual for plateau situations
-# 
-# For plateaus, step decay is better as it provides immediate adjustment to the
-# learning dynamics when stagnation is detected.
-### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "step-scheduler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class StepLR:
-    """
-    Step Learning Rate Scheduler
-    
-    Reduces learning rate by a factor (gamma) every step_size epochs.
-    This helps neural networks converge better by using high learning rates
-    initially for fast progress, then lower rates for fine-tuning.
-    
-    Mathematical Formula:
-    LR(epoch) = initial_lr * gamma^⌊epoch / step_size⌋
-    
-    SYSTEMS INSIGHT - Training Dynamics:
-    Learning rate scheduling is crucial for training stability and final performance.
-    Proper scheduling can improve final accuracy by 1-5% and reduce training time.
-    Most production training pipelines use some form of LR scheduling.
-    """
-    
-    def __init__(self, optimizer: Union[SGD, Adam], step_size: int, gamma: float = 0.1):
-        """
-        Initialize step learning rate scheduler.
-        
-        Args:
-            optimizer: SGD or Adam optimizer to schedule
-            step_size: Number of epochs between LR reductions
-            gamma: Multiplicative factor for LR reduction (default: 0.1)
-        
-        TODO: Initialize scheduler with optimizer and decay parameters.
-        
-        APPROACH:
-        1. Store reference to optimizer
-        2. Store scheduling parameters (step_size, gamma)
-        3. Save initial learning rate for calculations
-        4. Initialize epoch counter
-        
-        EXAMPLE:
-        ```python
-        optimizer = SGD([w, b], learning_rate=0.01)
-        scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
-        
-        # Training loop:
-        for epoch in range(100):
-            train_one_epoch()
-            scheduler.step()  # Update learning rate
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Store initial_lr from optimizer.learning_rate
-        - Keep track of current epoch for step calculations
-        - Maintain reference to optimizer for LR updates
-        """
-        ### BEGIN SOLUTION
-        self.optimizer = optimizer
-        self.step_size = step_size
-        self.gamma = gamma
-        self.initial_lr = optimizer.learning_rate
-        self.current_epoch = 0
-        ### END SOLUTION
-    
-    def step(self) -> None:
-        """
-        Update learning rate based on current epoch.
-        
-        TODO: Implement step LR scheduling logic.
-        
-        APPROACH:
-        1. Increment current epoch counter
-        2. Calculate new learning rate using step formula
-        3. Update optimizer's learning rate
-        4. Optionally log the learning rate change
-        
-        MATHEMATICAL IMPLEMENTATION:
-        LR(epoch) = initial_lr * gamma^⌊epoch / step_size⌋
-        
-        EXAMPLE BEHAVIOR:
-        initial_lr=0.01, step_size=30, gamma=0.1:
-        - Epochs 0-29: LR = 0.01
-        - Epochs 30-59: LR = 0.001  
-        - Epochs 60-89: LR = 0.0001
-        
-        IMPLEMENTATION HINTS:
-        - Use integer division (//) for step calculation
-        - Update optimizer.learning_rate directly
-        - Consider numerical precision for very small LRs
-        """
-        ### BEGIN SOLUTION
-        # Calculate number of LR reductions based on current epoch
-        decay_steps = self.current_epoch // self.step_size
-        
-        # Apply step decay formula
-        new_lr = self.initial_lr * (self.gamma ** decay_steps)
-        
-        # Update optimizer learning rate
-        self.optimizer.learning_rate = new_lr
-        
-        # Increment epoch counter for next call
-        self.current_epoch += 1
-        ### END SOLUTION
-    
-    def get_lr(self) -> float:
-        """
-        Get current learning rate without updating.
-        
-        TODO: Return current learning rate based on epoch.
-        
-        APPROACH:
-        1. Calculate current LR using step formula
-        2. Return the value without side effects
-        3. Useful for logging and monitoring
-        
-        IMPLEMENTATION HINTS:
-        - Use same formula as step() but don't increment epoch
-        - Return the calculated learning rate value
-        """
-        ### BEGIN SOLUTION
-        decay_steps = self.current_epoch // self.step_size
-        return self.initial_lr * (self.gamma ** decay_steps)
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-### TEST Unit Test: Learning Rate Scheduler
-
-Let's test your learning rate scheduler implementation! This ensures proper LR decay over epochs.
-
-**This is a unit test** - it tests the StepLR scheduler in isolation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-step-scheduler", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_unit_step_scheduler():
-    """Unit test for step learning rate scheduler."""
-    print("🔬 Unit Test: Step Learning Rate Scheduler...")
-    
-    # Create optimizer and scheduler
-    w = Variable(1.0, requires_grad=True)
-    optimizer = SGD([w], learning_rate=0.01)
-    scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
-    
-    # Test initialization
-    try:
-        assert scheduler.step_size == 10, "Step size should be stored correctly"
-        assert scheduler.gamma == 0.1, "Gamma should be stored correctly"
-        assert scheduler.initial_lr == 0.01, "Initial LR should be stored correctly"
-        assert scheduler.current_epoch == 0, "Should start at epoch 0"
-        print("PASS Initialization works correctly")
-        
-    except Exception as e:
-        print(f"FAIL Initialization failed: {e}")
-        raise
-    
-    # Test get_lr before any steps
-    try:
-        initial_lr = scheduler.get_lr()
-        assert initial_lr == 0.01, f"Initial LR should be 0.01, got {initial_lr}"
-        print("PASS get_lr() works correctly")
-        
-    except Exception as e:
-        print(f"FAIL get_lr() failed: {e}")
-        raise
-    
-    # Test LR updates over multiple epochs
-    try:
-        # First 10 epochs should maintain initial LR
-        for epoch in range(10):
-            scheduler.step()
-            current_lr = optimizer.learning_rate
-            expected_lr = 0.01  # No decay yet
-            assert abs(current_lr - expected_lr) < 1e-10, f"Epoch {epoch+1}: expected {expected_lr}, got {current_lr}"
-        
-        print("PASS First 10 epochs maintain initial LR")
-        
-        # Epoch 11 should trigger first decay
-        scheduler.step()  # Epoch 11
-        current_lr = optimizer.learning_rate
-        expected_lr = 0.01 * 0.1  # First decay
-        assert abs(current_lr - expected_lr) < 1e-10, f"First decay: expected {expected_lr}, got {current_lr}"
-        
-        print("PASS First LR decay works correctly")
-        
-        # Continue to second decay point
-        for epoch in range(9):  # Epochs 12-20
-            scheduler.step()
-        
-        scheduler.step()  # Epoch 21
-        current_lr = optimizer.learning_rate
-        expected_lr = 0.01 * (0.1 ** 2)  # Second decay
-        assert abs(current_lr - expected_lr) < 1e-10, f"Second decay: expected {expected_lr}, got {current_lr}"
-        
-        print("PASS Second LR decay works correctly")
-        
-    except Exception as e:
-        print(f"FAIL LR decay failed: {e}")
-        raise
-    
-    # Test with different parameters
-    try:
-        optimizer2 = Adam([w], learning_rate=0.001)
-        scheduler2 = StepLR(optimizer2, step_size=5, gamma=0.5)
-        
-        # Test 5 steps
-        for _ in range(5):
-            scheduler2.step()
-        
-        scheduler2.step()  # 6th step should trigger decay
-        current_lr = optimizer2.learning_rate
-        expected_lr = 0.001 * 0.5
-        assert abs(current_lr - expected_lr) < 1e-10, f"Custom params: expected {expected_lr}, got {current_lr}"
-        
-        print("PASS Custom parameters work correctly")
-        
-    except Exception as e:
-        print(f"FAIL Custom parameters failed: {e}")
-        raise
-
-    print("TARGET Step LR scheduler behavior:")
-    print("   Reduces learning rate by gamma every step_size epochs")
-    print("   Enables fast initial training with gradual fine-tuning")
-    print("   Essential for achieving optimal model performance")
-    print("PROGRESS Progress: Learning Rate Scheduling OK")
-
-# PASS IMPLEMENTATION CHECKPOINT: Learning rate scheduling complete
-
-# THINK PREDICTION: How much will proper LR scheduling improve final model accuracy?
-# Your guess: ____% improvement
-
-def analyze_lr_schedule_impact():
-    """Analyze the impact of learning rate scheduling on training dynamics."""
-    try:
-        print("📊 Analyzing learning rate schedule impact...")
-        print("=" * 55)
-        
-        # Simulate training with different LR strategies
-        def simulate_training_progress(lr_schedule_name, lr_values, epochs=50):
-            """Simulate loss progression with given LR schedule."""
-            loss = 1.0  # Starting loss
-            losses = []
-            
-            for epoch, lr in enumerate(lr_values[:epochs]):
-                # Simulate loss reduction (simplified model)
-                # Higher LR = faster initial progress but less precision
-                # Lower LR = slower progress but better fine-tuning
-                
-                if loss > 0.1:  # Early training - LR matters more
-                    progress = lr * 0.1 * (1.0 - loss * 0.1)  # Faster with higher LR
-                else:  # Late training - precision matters more  
-                    progress = lr * 0.05 / (1.0 + lr * 10)  # Better with lower LR
-                
-                loss = max(0.01, loss - progress)  # Minimum achievable loss
-                losses.append(loss)
-            
-            return losses
-        
-        # Different LR strategies
-        epochs = 50
-        
-        # Strategy 1: Constant LR
-        constant_lr = [0.01] * epochs
-        
-        # Strategy 2: Step decay
-        step_lr = []
-        for epoch in range(epochs):
-            if epoch < 20:
-                step_lr.append(0.01)
-            elif epoch < 40:
-                step_lr.append(0.001)
-            else:
-                step_lr.append(0.0001)
-        
-        # Strategy 3: Exponential decay
-        exponential_lr = [0.01 * (0.95 ** epoch) for epoch in range(epochs)]
-        
-        # Simulate training
-        constant_losses = simulate_training_progress("Constant", constant_lr)
-        step_losses = simulate_training_progress("Step Decay", step_lr)
-        exp_losses = simulate_training_progress("Exponential", exponential_lr)
-        
-        print("Learning Rate Strategy Comparison:")
-        print("=" * 40)
-        print(f"{'Epoch':<6} {'Constant':<10} {'Step':<10} {'Exponential':<12}")
-        print("-" * 40)
-        
-        checkpoints = [5, 15, 25, 35, 45]
-        for epoch in checkpoints:
-            const_loss = constant_losses[epoch-1]
-            step_loss = step_losses[epoch-1]  
-            exp_loss = exp_losses[epoch-1]
-            
-            print(f"{epoch:<6} {const_loss:<10.4f} {step_loss:<10.4f} {exp_loss:<12.4f}")
-        
-        # Final results analysis
-        final_constant = constant_losses[-1]
-        final_step = step_losses[-1]
-        final_exp = exp_losses[-1]
-        
-        print(f"\nFinal Loss Comparison:")
-        print(f"Constant LR:     {final_constant:.6f}")
-        print(f"Step Decay:      {final_step:.6f} ({((final_constant-final_step)/final_constant*100):+.1f}%)")
-        print(f"Exponential:     {final_exp:.6f} ({((final_constant-final_exp)/final_constant*100):+.1f}%)")
-        
-        # Convergence speed analysis
-        target_loss = 0.1
-        
-        def find_convergence_epoch(losses, target):
-            for i, loss in enumerate(losses):
-                if loss <= target:
-                    return i + 1
-            return len(losses)
-        
-        const_convergence = find_convergence_epoch(constant_losses, target_loss)
-        step_convergence = find_convergence_epoch(step_losses, target_loss)
-        exp_convergence = find_convergence_epoch(exp_losses, target_loss)
-        
-        print(f"\nConvergence Speed (to reach loss = {target_loss}):")
-        print(f"Constant LR:     {const_convergence} epochs")
-        print(f"Step Decay:      {step_convergence} epochs ({const_convergence-step_convergence:+d} epochs)")
-        print(f"Exponential:     {exp_convergence} epochs ({const_convergence-exp_convergence:+d} epochs)")
-        
-        print("\n💡 Key insights:")
-        print("• Proper LR scheduling improves final performance by 1-5%")
-        print("• Step decay provides clear phase transitions (explore -> converge -> fine-tune)")
-        print("• Exponential decay offers smooth transitions but may converge slower")
-        print("• LR scheduling often as important as optimizer choice")
-        
-        print("\n🏭 PRODUCTION BEST PRACTICES:")
-        print("• Most successful models use LR scheduling")
-        print("• Common pattern: high LR -> reduce at plateaus -> final fine-tuning")
-        print("• Monitor validation loss to determine schedule timing")
-        print("• Cosine annealing popular for transformer training")
-        
-        
-    except Exception as e:
-        print(f"WARNING️ Error in LR schedule analysis: {e}")
-
-# Analyze learning rate schedule impact
-analyze_lr_schedule_impact()
-
-# %% [markdown]
-"""
-## Step 4.5: Advanced Learning Rate Schedulers
-
-### Why More Scheduler Variety?
-
-Different training scenarios benefit from different LR patterns:
-
-```
-Training Scenario -> Optimal Scheduler:
-
-• Image Classification: Cosine annealing for smooth convergence
-• Language Models: Exponential decay with warmup
-• Fine-tuning: Step decay at specific milestones
-• Research/Exploration: Cosine with restarts for multiple trials
-```
-
-### Visual: Advanced Scheduler Patterns
-```
-Learning Rate Over Time:
-
-StepLR:        ------+     +-----+     +--
-               ░░░░░░|░░░░░|░░░░░|░░░░░|░
-               ░░░░░░+-----+░░░░░+-----+░
-
-Exponential:   --\
-               ░░░\
-               ░░░░\
-               ░░░░░\\
-
-Cosine:        --\\   /--\\   /--\\   /--
-               ░░░\\ /░░░░\\ /░░░░\\ /░░░
-               ░░░░\\/░░░░░░\\/░░░░░░\\/░░
-
-Epoch:         0   10   20   30   40   50
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "exponential-scheduler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class ExponentialLR:
-    """
-    Exponential Learning Rate Scheduler
-
-    Decays learning rate exponentially every epoch: LR(epoch) = initial_lr * gamma^epoch
-
-    Provides smooth, continuous decay popular in research and fine-tuning scenarios.
-    Unlike StepLR's sudden drops, exponential provides gradual reduction.
-
-    Mathematical Formula:
-    LR(epoch) = initial_lr * gamma^epoch
-
-    SYSTEMS INSIGHT - Smooth Convergence:
-    Exponential decay provides smoother convergence than step decay but requires
-    careful gamma tuning. Too aggressive (gamma < 0.9) can reduce LR too quickly.
-    """
-
-    def __init__(self, optimizer: Union[SGD, Adam], gamma: float = 0.95):
-        """
-        Initialize exponential learning rate scheduler.
-
-        Args:
-            optimizer: SGD or Adam optimizer to schedule
-            gamma: Decay factor per epoch (default: 0.95)
-
-        TODO: Initialize exponential scheduler.
-
-        APPROACH:
-        1. Store optimizer reference
-        2. Store gamma decay factor
-        3. Save initial learning rate
-        4. Initialize epoch counter
-
-        EXAMPLE:
-        >>> optimizer = Adam([param], learning_rate=0.01)
-        >>> scheduler = ExponentialLR(optimizer, gamma=0.95)
-        >>> # LR decays by 5% each epoch
-        """
-        ### BEGIN SOLUTION
-        self.optimizer = optimizer
-        self.gamma = gamma
-        self.initial_lr = optimizer.learning_rate
-        self.current_epoch = 0
-        ### END SOLUTION
-
-    def step(self) -> None:
-        """
-        Update learning rate exponentially.
-
-        TODO: Apply exponential decay to learning rate.
-
-        APPROACH:
-        1. Calculate new LR using exponential formula
-        2. Update optimizer's learning rate
-        3. Increment epoch counter
-        """
-        ### BEGIN SOLUTION
-        new_lr = self.initial_lr * (self.gamma ** self.current_epoch)
-        self.optimizer.learning_rate = new_lr
-        self.current_epoch += 1
-        ### END SOLUTION
-
-    def get_lr(self) -> float:
-        """Get current learning rate without updating."""
-        ### BEGIN SOLUTION
-        return self.initial_lr * (self.gamma ** self.current_epoch)
-        ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "cosine-scheduler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class CosineAnnealingLR:
-    """
-    Cosine Annealing Learning Rate Scheduler
-
-    Uses cosine function to smoothly reduce learning rate from max to min over T_max epochs.
-    Popular in transformer training and competitions for better final performance.
-
-    Mathematical Formula:
-    LR(epoch) = lr_min + (lr_max - lr_min) * (1 + cos(π * epoch / T_max)) / 2
-
-    SYSTEMS INSIGHT - Natural Exploration Pattern:
-    Cosine annealing mimics natural exploration patterns - starts aggressive,
-    gradually reduces with smooth transitions. Often yields better final accuracy
-    than step or exponential decay in deep learning applications.
-    """
-
-    def __init__(self, optimizer: Union[SGD, Adam], T_max: int, eta_min: float = 0.0):
-        """
-        Initialize cosine annealing scheduler.
-
-        Args:
-            optimizer: SGD or Adam optimizer to schedule
-            T_max: Maximum number of epochs for one cycle
-            eta_min: Minimum learning rate (default: 0.0)
-
-        TODO: Initialize cosine annealing scheduler.
-
-        APPROACH:
-        1. Store optimizer and cycle parameters
-        2. Save initial LR as maximum LR
-        3. Store minimum LR
-        4. Initialize epoch counter
-
-        EXAMPLE:
-        >>> optimizer = SGD([param], learning_rate=0.1)
-        >>> scheduler = CosineAnnealingLR(optimizer, T_max=50, eta_min=0.001)
-        >>> # LR follows cosine curve from 0.1 to 0.001 over 50 epochs
-        """
-        ### BEGIN SOLUTION
-        self.optimizer = optimizer
-        self.T_max = T_max
-        self.eta_min = eta_min
-        self.eta_max = optimizer.learning_rate  # Initial LR as max
-        self.current_epoch = 0
-        ### END SOLUTION
-
-    def step(self) -> None:
-        """
-        Update learning rate using cosine annealing.
-
-        TODO: Apply cosine annealing formula.
-
-        APPROACH:
-        1. Calculate cosine factor: (1 + cos(π * epoch / T_max)) / 2
-        2. Interpolate between min and max LR
-        3. Update optimizer's learning rate
-        4. Increment epoch (with cycling)
-        """
-        ### BEGIN SOLUTION
-        import math
-
-        # Cosine annealing formula
-        cosine_factor = (1 + math.cos(math.pi * (self.current_epoch % self.T_max) / self.T_max)) / 2
-        new_lr = self.eta_min + (self.eta_max - self.eta_min) * cosine_factor
-
-        self.optimizer.learning_rate = new_lr
-        self.current_epoch += 1
-        ### END SOLUTION
-
-    def get_lr(self) -> float:
-        """Get current learning rate without updating."""
-        ### BEGIN SOLUTION
-        import math
-        cosine_factor = (1 + math.cos(math.pi * (self.current_epoch % self.T_max) / self.T_max)) / 2
-        return self.eta_min + (self.eta_max - self.eta_min) * cosine_factor
-        ### END SOLUTION
-
-def analyze_advanced_schedulers():
-    """
-    Compare advanced learning rate schedulers across different training scenarios.
-
-    This analysis demonstrates how scheduler choice affects training dynamics
-    and shows when to use each type in production systems.
-    """
-    try:
-        print("\n" + "=" * 50)
-        print("🔄 ADVANCED SCHEDULER ANALYSIS")
-        print("=" * 50)
-
-        # Create mock optimizer for testing
-        param = Variable(np.array([1.0]), requires_grad=True)
-
-        # Initialize different schedulers
-        optimizers = {
-            'step': SGD([param], learning_rate=0.1),
-            'exponential': SGD([param], learning_rate=0.1),
-            'cosine': SGD([param], learning_rate=0.1)
-        }
-
-        schedulers = {
-            'step': StepLR(optimizers['step'], step_size=20, gamma=0.1),
-            'exponential': ExponentialLR(optimizers['exponential'], gamma=0.95),
-            'cosine': CosineAnnealingLR(optimizers['cosine'], T_max=50, eta_min=0.001)
-        }
-
-        # Simulate learning rate progression
-        epochs = 50
-        lr_history = {name: [] for name in schedulers.keys()}
-
-        for epoch in range(epochs):
-            for name, scheduler in schedulers.items():
-                lr_history[name].append(scheduler.get_lr())
-                scheduler.step()
-
-        # Display learning rate progression
-        print("Learning Rate Progression (first 10 epochs):")
-        print("Epoch  |   Step   | Exponential| Cosine  ")
-        for epoch in range(min(10, epochs)):
-            step_lr = lr_history['step'][epoch]
-            exp_lr = lr_history['exponential'][epoch]
-            cos_lr = lr_history['cosine'][epoch]
-            print(f"  {epoch:2d}   | {step_lr:8.4f} | {exp_lr:10.4f} | {cos_lr:8.4f}")
-
-        # Analyze final learning rates
-        print(f"\nFinal Learning Rates (epoch {epochs-1}):")
-        for name in schedulers.keys():
-            final_lr = lr_history[name][-1]
-            print(f"  {name.capitalize():<12}: {final_lr:.6f}")
-
-        # Scheduler characteristics
-        print(f"\n💡 Scheduler characteristics:")
-        print(f"• Step: Sudden drops, good for milestone-based training")
-        print(f"• Exponential: Smooth decay, good for fine-tuning")
-        print(f"• Cosine: Natural curve, excellent for final convergence")
-
-        # Production use cases
-        print(f"\n🚀 Production use cases:")
-        print(f"• Image Classification: Cosine annealing (ImageNet standard)")
-        print(f"• Language Models: Exponential with warmup (BERT, GPT)")
-        print(f"• Transfer Learning: Step decay at validation plateaus")
-        print(f"• Research: Cosine with restarts for hyperparameter search")
-
-        # Performance implications
-        print(f"\n📊 PERFORMANCE IMPLICATIONS:")
-        print(f"• Cosine often improves final accuracy by 0.5-2%")
-        print(f"• Exponential provides most stable training")
-        print(f"• Step decay requires careful timing but very effective")
-        print(f"• All schedulers help prevent overfitting vs constant LR")
-
-        return lr_history
-
-    except Exception as e:
-        print(f"WARNING️ Error in advanced scheduler analysis: {e}")
-        return None
-
-# Analyze advanced scheduler comparison
-analyze_advanced_schedulers()
-
-# %% [markdown]
-"""
-## Step 5: Integration - Complete Training Example
-
-### Visual: Complete Training Pipeline
-```
-Training Loop Architecture:
-
-Data -> Forward Pass -> Loss Computation
-  ^         v              v
-  |    Predictions    Gradients (Autograd)
-  |         ^              v
-  +--- Parameters <- Optimizer Updates
-            ^              v
-       LR Scheduler  -> Learning Rate
-```
-
-### Complete Training Pattern
-```python
-# Standard ML training pattern
-optimizer = Adam(model.parameters(), lr=0.001)
-scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
-
-for epoch in range(num_epochs):
-    for batch in dataloader:
-        # Forward pass
-        predictions = model(batch.inputs)
-        loss = loss_function(predictions, batch.targets)
-        
-        # Backward pass  
-        optimizer.zero_grad()  # Clear gradients
-        loss.backward()        # Compute gradients
-        optimizer.step()       # Update parameters
-    
-    scheduler.step()  # Update learning rate
-```
-
-### Training Dynamics Visualization
-```
-Training Progress Over Time:
-
-Loss    |
-        |\\
-        | \\
-        |  \\__
-        |     \\__    <- LR reductions
-        |        \\____
-        |             \\____
-        +--------------------------> Epochs
-
-Learning | 0.01 +-----+
-Rate     |      |     | 0.001 +---+
-         |      |     +-------┤   | 0.0001
-         |      |             +---+
-         +------+----------------------> Epochs
-```
-
-This integration shows how all components work together for effective neural network training.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "training-integration", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def train_simple_model(parameters: List[Variable], optimizer, scheduler, 
-                      loss_function, num_epochs: int = 20, verbose: bool = True):
-    """
-    Complete training loop integrating optimizer, scheduler, and loss computation.
-    
-    Args:
-        parameters: Model parameters to optimize
-        optimizer: SGD or Adam optimizer instance
-        scheduler: Learning rate scheduler (optional)
-        loss_function: Function that computes loss and gradients
-        num_epochs: Number of training epochs
-        verbose: Whether to print training progress
-    
-    Returns:
-        Training history with losses and learning rates
-    
-    TODO: Implement complete training loop with optimizer and scheduler integration.
-    
-    APPROACH:
-    1. Initialize training history tracking
-    2. For each epoch:
-       a. Compute loss and gradients using loss_function
-       b. Update parameters using optimizer
-       c. Update learning rate using scheduler
-       d. Track metrics and progress
-    3. Return complete training history
-    
-    INTEGRATION POINTS:
-    - Optimizer: handles parameter updates
-    - Scheduler: manages learning rate decay  
-    - Loss function: computes gradients for backpropagation
-    - History tracking: enables training analysis
-    
-    EXAMPLE USAGE:
-    ```python
-    # Set up components
-    w = Variable(1.0, requires_grad=True)
-    optimizer = Adam([w], learning_rate=0.01)
-    scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
-    
-    def simple_loss():
-        loss = (w.data.data - 3.0) ** 2  # Target value = 3
-        w.grad = Variable(2 * (w.data.data - 3.0))  # Derivative
-        return loss
-    
-    # Train the model
-    history = train_simple_model([w], optimizer, scheduler, simple_loss)
-    ```
-    
-    IMPLEMENTATION HINTS:
-    - Call optimizer.zero_grad() before loss computation
-    - Call optimizer.step() after gradients are computed
-    - Call scheduler.step() at end of each epoch
-    - Track both loss values and learning rates
-    - Handle optional scheduler (might be None)
-    """
-    ### BEGIN SOLUTION
-    history = {
-        'losses': [],
-        'learning_rates': [],
-        'epochs': []
-    }
-    
-    if verbose:
-        print("ROCKET Starting training...")
-        print(f"Optimizer: {type(optimizer).__name__}")
-        print(f"Scheduler: {type(scheduler).__name__ if scheduler else 'None'}")
-        print(f"Epochs: {num_epochs}")
-        print("-" * 50)
-    
-    for epoch in range(num_epochs):
-        # Clear gradients from previous iteration
-        optimizer.zero_grad()
-        
-        # Compute loss and gradients
-        loss = loss_function()
-        
-        # Update parameters using optimizer
-        optimizer.step()
-        
-        # Update learning rate using scheduler (if provided)
-        if scheduler is not None:
-            scheduler.step()
-        
-        # Track training metrics
-        current_lr = optimizer.learning_rate
-        history['losses'].append(loss)
-        history['learning_rates'].append(current_lr)
-        history['epochs'].append(epoch + 1)
-        
-        # Print progress
-        if verbose and (epoch + 1) % 5 == 0:
-            print(f"Epoch {epoch + 1:3d}: Loss = {loss:.6f}, LR = {current_lr:.6f}")
-    
-    if verbose:
-        print("-" * 50)
-        print(f"PASS Training completed!")
-        print(f"Final loss: {history['losses'][-1]:.6f}")
-        print(f"Final LR: {history['learning_rates'][-1]:.6f}")
-    
-    return history
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### TEST Unit Test: Training Integration
-
-Let's test your complete training integration! This validates that all components work together.
-
-**This is an integration test** - it tests how optimizers, schedulers, and training loops interact.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-training-integration", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_training():
-    """Integration test for complete training loop."""
-    print("🔬 Unit Test: Training Integration...")
-    
-    # Create a simple optimization problem: minimize (x - 5)²
-    x = Variable(0.0, requires_grad=True)
-    target = 5.0
-    
-    def quadratic_loss():
-        """Simple quadratic loss function with known optimum."""
-        current_x = x.data.data.item()
-        loss = (current_x - target) ** 2
-        gradient = 2 * (current_x - target)
-        x.grad = Variable(gradient)
-        return loss
-    
-    # Test with SGD + Step scheduler
-    try:
-        optimizer = SGD([x], learning_rate=0.1)
-        scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
-        
-        # Reset parameter
-        x.data.data = np.array(0.0)
-        
-        history = train_simple_model([x], optimizer, scheduler, quadratic_loss, 
-                                   num_epochs=20, verbose=False)
-        
-        # Check training progress
-        assert len(history['losses']) == 20, "Should track all epochs"
-        assert len(history['learning_rates']) == 20, "Should track LR for all epochs"
-        assert history['losses'][0] > history['losses'][-1], "Loss should decrease"
-        
-        # Check LR scheduling
-        assert history['learning_rates'][0] == 0.1, "Initial LR should be 0.1"
-        print(f"Debug: LR at index 10 = {history['learning_rates'][10]}, expected = 0.01")
-        assert abs(history['learning_rates'][10] - 0.01) < 1e-10, "LR should decay after step_size"
-        
-        print("PASS SGD + StepLR integration works correctly")
-        
-    except Exception as e:
-        print(f"FAIL SGD + StepLR integration failed: {e}")
-        raise
-    
-    # Test with Adam optimizer (basic convergence check)
-    try:
-        x.data.data = np.array(0.0)  # Reset
-        optimizer_adam = Adam([x], learning_rate=0.01)
-        
-        history_adam = train_simple_model([x], optimizer_adam, None, quadratic_loss,
-                                        num_epochs=15, verbose=False)
-        
-        # Check Adam basic functionality
-        assert len(history_adam['losses']) == 15, "Should track all epochs"
-        assert history_adam['losses'][0] > history_adam['losses'][-1], "Loss should decrease with Adam"
-        
-        print("PASS Adam integration works correctly")
-        
-    except Exception as e:
-        print(f"FAIL Adam integration failed: {e}")
-        raise
-    
-    # Test convergence to correct solution
-    try:
-        final_x = x.data.data.item()
-        error = abs(final_x - target)
-        print(f"Final x: {final_x}, target: {target}, error: {error}")
-        # Relaxed convergence test - optimizers are working but convergence depends on many factors
-        assert error < 10.0, f"Should show some progress toward target {target}, got {final_x}"
-        
-        print("PASS Shows optimization progress")
-        
-    except Exception as e:
-        print(f"FAIL Convergence test failed: {e}")
-        raise
-    
-    # Test training history format
-    try:
-        required_keys = ['losses', 'learning_rates', 'epochs']
-        for key in required_keys:
-            assert key in history, f"History should contain '{key}'"
-        
-        # Check consistency
-        n_epochs = len(history['losses'])
-        assert len(history['learning_rates']) == n_epochs, "LR history length mismatch"
-        assert len(history['epochs']) == n_epochs, "Epoch history length mismatch"
-        
-        print("PASS Training history format is correct")
-        
-    except Exception as e:
-        print(f"FAIL History format test failed: {e}")
-        raise
-
-    print("TARGET Training integration behavior:")
-    print("   Coordinates optimizer, scheduler, and loss computation")
-    print("   Tracks complete training history for analysis")
-    print("   Supports both SGD and Adam with optional scheduling")
-    print("   Provides foundation for real neural network training")
-    print("PROGRESS Progress: Training Integration OK")
-
-# Final system checkpoint and readiness verification
-print("\nTARGET OPTIMIZATION SYSTEM STATUS:")
-print("PASS Gradient Descent: Foundation algorithm implemented")
-print("PASS SGD with Momentum: Accelerated convergence algorithm")  
-print("PASS Adam Optimizer: Adaptive learning rate algorithm")
-print("PASS Learning Rate Scheduling: Dynamic LR adjustment")
-print("PASS Training Integration: Complete pipeline ready")
-print("\nROCKET Ready for neural network training!")
-
-# %% [markdown]
-"""
-## Comprehensive Testing - All Components
-
-This section runs all unit tests to validate the complete optimizer implementation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "comprehensive-tests", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_all_optimizers():
-    """Run all optimizer tests to validate complete implementation."""
-    print("TEST Running Comprehensive Optimizer Tests...")
-    print("=" * 60)
-    
-    try:
-        # Core implementation tests
-        test_unit_gradient_descent_step()
-        test_unit_sgd_optimizer() 
-        test_unit_adam_optimizer()
-        test_unit_step_scheduler()
-        test_unit_training()
-        
-        print("\n" + "=" * 60)
-        print("CELEBRATE ALL OPTIMIZER TESTS PASSED!")
-        print("PASS Gradient descent foundation working")
-        print("PASS SGD with momentum implemented correctly")
-        print("PASS Adam adaptive learning rates functional")
-        print("PASS Learning rate scheduling operational")
-        print("PASS Complete training integration successful")
-        print("\nROCKET Optimizer system ready for neural network training!")
-        
-    except Exception as e:
-        print(f"\nFAIL Optimizer test failed: {e}")
-        print("🔧 Please fix implementation before proceeding")
-        raise
-
-if __name__ == "__main__":
-    print("TEST Running core optimizer tests...")
-    
-    # Core understanding tests (REQUIRED)
-    test_unit_gradient_descent_step()
-    test_unit_sgd_optimizer()
-    test_unit_adam_optimizer()
-    test_unit_step_scheduler()
-    test_unit_training()
-    
-    print("\n" + "=" * 60)
-    print("🔬 SYSTEMS INSIGHTS ANALYSIS")
-    print("=" * 60)
-    
-    # Execute systems insights functions (CRITICAL for learning objectives)
-    analyze_learning_rate_effects()
-    analyze_sgd_momentum_convergence()
-    visualize_optimizer_convergence()
-    analyze_optimizer_memory()
-    analyze_numerical_stability()
-    analyze_lr_schedule_impact()
-    analyze_advanced_schedulers()
-    
-    print("PASS Core tests passed!")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-*Complete these after implementing the optimizers to reflect on systems implications*
-"""
-
-# %% [markdown]
-"""
-### Question 1: Optimizer Memory and Performance Trade-offs
-
-**Context**: Your optimizer implementations show clear memory trade-offs: SGD uses O(P) memory, while Adam uses O(3P) memory for the same number of parameters. You've also seen different convergence characteristics through your implementations.
-
-**Reflection Question**: Analyze the memory vs convergence trade-offs in your optimizer implementations. For a model with 1 billion parameters, calculate the memory overhead for each optimizer and design a strategy for optimizer selection based on memory constraints. How would you modify your implementations to handle memory-limited scenarios while maintaining convergence benefits?
-
-Think about: memory scaling patterns, gradient accumulation strategies, mixed precision optimizers, and convergence speed vs memory usage.
-
-*Target length: 150-250 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-1-memory-tradeoffs", "locked": false, "points": 8, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON OPTIMIZER MEMORY TRADE-OFFS:
-
-TODO: Replace this text with your thoughtful analysis of memory vs convergence trade-offs.
-
-Consider addressing:
-- Memory calculations for 1B parameter model with different optimizers
-- When would you choose SGD vs Adam based on memory constraints?
-- How could you modify implementations for memory-limited scenarios?
-- What strategies balance convergence speed with memory usage?
-- How do production systems handle these trade-offs?
-
-Write a systems analysis connecting your optimizer implementations to real memory constraints.
-
-GRADING RUBRIC (Instructor Use):
-- Calculates memory usage correctly for different optimizers (2 points)
-- Understands trade-offs between convergence speed and memory (2 points)  
-- Proposes practical strategies for memory-limited scenarios (2 points)
-- Shows systems thinking about production optimizer selection (2 points)
-- Clear reasoning connecting implementation to real constraints (bonus points for deep understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring analysis of optimizer memory trade-offs
-# Students should demonstrate understanding of memory scaling and practical constraints
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: Learning Rate Scheduling and Training Dynamics
-
-**Context**: Your learning rate scheduler implementation demonstrates how adaptive LR affects training dynamics. You've seen through your analysis functions how different schedules impact convergence speed and final performance.
-
-**Reflection Question**: Extend your StepLR scheduler to handle plateau detection - automatically reducing learning rate when loss plateaus for multiple epochs. Design the plateau detection logic and explain how this adaptive scheduling improves upon fixed step schedules. How would you integrate this with your Adam optimizer's existing adaptive mechanism? 
-
-Think about: plateau detection criteria, interaction with Adam's per-parameter adaptation, validation loss monitoring, and early stopping integration.
-
-*Target length: 150-250 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-2-adaptive-scheduling", "locked": false, "points": 8, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON ADAPTIVE LEARNING RATE SCHEDULING:
-
-TODO: Replace this text with your thoughtful response about plateau-based LR scheduling.
-
-Consider addressing:
-- How would you detect loss plateaus in your scheduler implementation?
-- What's the interaction between LR scheduling and Adam's adaptive rates?
-- How should plateau detection integrate with validation monitoring?
-- What are the benefits over fixed step scheduling?
-- How would this work in production training pipelines?
-
-Write a systems analysis showing how to extend your scheduler implementations.
-
-GRADING RUBRIC (Instructor Use):
-- Designs reasonable plateau detection logic (2 points)
-- Understands interaction with Adam's adaptive mechanism (2 points)
-- Considers validation monitoring and early stopping (2 points)
-- Shows systems thinking about production training (2 points)
-- Clear technical reasoning with implementation insights (bonus points for deep understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of adaptive scheduling
-# Students should demonstrate knowledge of plateau detection and LR scheduling integration
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Production Optimizer Selection and Monitoring
-
-**Context**: Your optimizer implementations provide the foundation for production ML training, but real systems require monitoring, hyperparameter tuning, and adaptive selection based on model characteristics and training dynamics.
-
-**Reflection Question**: Design a production optimizer monitoring system that tracks your SGD and Adam implementations in real-time training. What metrics would you collect from your optimizers, how would you detect training instability, and when would you automatically switch between optimizers? Consider how gradient norms, learning rate effectiveness, and convergence patterns inform optimizer selection.
-
-Think about: gradient monitoring, convergence detection, automatic hyperparameter tuning, and optimizer switching strategies.
-
-*Target length: 150-250 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-3-production-monitoring", "locked": false, "points": 8, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON PRODUCTION OPTIMIZER MONITORING:
-
-TODO: Replace this text with your thoughtful response about production optimizer systems.
-
-Consider addressing:
-- What metrics would you collect from your optimizer implementations?
-- How would you detect training instability or poor convergence?
-- When and how would you automatically switch between SGD and Adam?
-- How would you integrate optimizer monitoring with MLOps pipelines?
-- What role does gradient monitoring play in optimizer selection?
-
-Write a systems analysis connecting your implementations to production training monitoring.
-
-GRADING RUBRIC (Instructor Use):
-- Identifies relevant optimizer monitoring metrics (2 points)
-- Understands training instability detection (2 points)
-- Designs practical optimizer switching strategies (2 points)
-- Shows systems thinking about production integration (2 points)
-- Clear systems reasoning with monitoring insights (bonus points for deep understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of production optimizer monitoring
-# Students should demonstrate knowledge of training monitoring and optimizer selection strategies
-### END SOLUTION
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Optimization Algorithms
-
-Congratulations! You've successfully implemented the algorithms that make neural networks learn efficiently:
-
-### What You've Accomplished
-PASS **Gradient Descent Foundation**: 50+ lines implementing the core parameter update mechanism
-PASS **SGD with Momentum**: Complete optimizer class with velocity accumulation for accelerated convergence
-PASS **Adam Optimizer**: Advanced adaptive learning rates with first/second moment estimation and bias correction
-PASS **Learning Rate Scheduling**: StepLR, ExponentialLR, and CosineAnnealingLR schedulers for diverse training scenarios
-PASS **Gradient Clipping**: Numerical stability features preventing exploding gradients in deep networks
-PASS **Convergence Visualization**: Real loss curve analysis comparing optimizer convergence patterns
-PASS **Training Integration**: Complete training loop coordinating optimizer, scheduler, and loss computation
-PASS **Systems Analysis**: Memory profiling, numerical stability analysis, and advanced scheduler comparisons
-
-### Key Learning Outcomes
-- **Optimization fundamentals**: How gradient-based algorithms navigate loss landscapes to find optima
-- **Mathematical foundations**: Momentum accumulation, adaptive learning rates, bias correction, and numerical stability
-- **Systems insights**: Memory vs convergence trade-offs, gradient clipping for stability, scheduler variety for different scenarios
-- **Professional skills**: Building production-ready optimizers with advanced features matching PyTorch's design patterns
-
-### Mathematical Foundations Mastered
-- **Gradient Descent**: θ = θ - αgradθ (foundation of all neural network training)
-- **SGD Momentum**: v = βv + gradθ, θ = θ - αv (acceleration through velocity accumulation)
-- **Adam Algorithm**: Adaptive moments with bias correction for per-parameter learning rates
-- **Gradient Clipping**: ||g||₂ normalization preventing exploding gradients in deep networks
-- **Advanced Scheduling**: Step, exponential, and cosine annealing patterns for optimal convergence
-
-### Professional Skills Developed
-- **Algorithm implementation**: Building optimizers from mathematical specifications to working code
-- **Systems engineering**: Understanding memory overhead, performance characteristics, and scaling behavior
-- **Integration patterns**: Coordinating optimizers, schedulers, and training loops in production pipelines
-
-### Ready for Advanced Applications
-Your optimizer implementations now enable:
-- **Neural network training**: Complete training pipelines with multiple optimizers and advanced scheduling
-- **Stable deep learning**: Gradient clipping and numerical stability for very deep networks
-- **Convergence analysis**: Visual tools for comparing optimizer performance across training scenarios
-- **Production deployment**: Memory-aware optimizer selection with advanced scheduler variety
-- **Research applications**: Foundation for implementing state-of-the-art optimization algorithms
-
-### Connection to Real ML Systems
-Your implementations mirror production systems:
-- **PyTorch**: `torch.optim.SGD`, `torch.optim.Adam`, and `torch.optim.lr_scheduler` use identical mathematical formulations
-- **TensorFlow**: `tf.keras.optimizers` implements the same algorithms and scheduling patterns
-- **Gradient Clipping**: `torch.nn.utils.clip_grad_norm_()` uses your exact clipping implementation
-- **Industry Standard**: Every major ML framework uses these exact optimization algorithms and stability features
-
-### Next Steps
-1. **Export your module**: `tito module complete 07_optimizers`
-2. **Validate integration**: `tito test --module optimizers`
-3. **Explore advanced features**: Experiment with different momentum coefficients and learning rates
-4. **Ready for Module 08**: Build complete training loops with your optimizers!
-
-**ROCKET Achievement Unlocked**: Your optimization algorithms form the learning engine that transforms gradients into intelligence!
-"""
\ No newline at end of file
diff --git a/modules_old/07_training/README.md b/modules_old/07_training/README.md
deleted file mode 100644
index 853262e0..00000000
--- a/modules_old/07_training/README.md
+++ /dev/null
@@ -1,328 +0,0 @@
-# 🔥 Module: Training
-
-## 📊 Module Info
-- **Difficulty**: ⭐⭐⭐⭐ Expert
-- **Time Estimate**: 8-10 hours
-- **Prerequisites**: Tensor, Activations, Layers, Networks, DataLoader, Autograd, Optimizers modules
-- **Next Steps**: Compression, Kernels, Benchmarking, MLOps modules
-
-Build the complete training pipeline that brings all TinyTorch components together. This capstone module orchestrates data loading, model forward passes, loss computation, backpropagation, and optimization into the end-to-end training workflows that power modern AI systems.
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will be able to:
-
-- **Design complete training architectures**: Orchestrate all ML components into cohesive training systems
-- **Implement essential loss functions**: Build MSE, CrossEntropy, and BinaryCrossEntropy from mathematical foundations
-- **Create evaluation frameworks**: Develop metrics systems for classification, regression, and model performance assessment
-- **Build production training loops**: Implement robust training workflows with validation, logging, and progress tracking
-- **Master training dynamics**: Understand convergence, overfitting, generalization, and optimization in real scenarios
-
-## 🧠 Build → Use → Optimize
-
-This module follows TinyTorch's **Build → Use → Optimize** framework:
-
-1. **Build**: Implement loss functions, evaluation metrics, and complete training orchestration systems
-2. **Use**: Train end-to-end neural networks on real datasets with full pipeline automation
-3. **Optimize**: Analyze training dynamics, debug convergence issues, and optimize training performance for production
-
-## 🎯 NEW: Model Checkpointing & Evaluation Tools
-
-### Complete Training with Checkpointing
-This module now includes production features for our north star goal:
-
-```python
-from tinytorch.core.training import Trainer, CrossEntropyLoss, Accuracy
-from tinytorch.core.training import evaluate_model, plot_training_history
-
-# Train with automatic model checkpointing
-trainer = Trainer(model, CrossEntropyLoss(), Adam(lr=0.001), [Accuracy()])
-history = trainer.fit(
-    train_loader,
-    val_dataloader=test_loader,
-    epochs=30,
-    save_best=True,                    # ✅ NEW: Saves best model automatically
-    checkpoint_path='best_model.pkl',  # ✅ NEW: Checkpoint location
-    early_stopping_patience=5          # ✅ NEW: Stop if no improvement
-)
-
-# Load best model after training
-trainer.load_checkpoint('best_model.pkl')
-print(f"✅ Restored best model from epoch {trainer.current_epoch}")
-
-# Evaluate with comprehensive metrics
-results = evaluate_model(model, test_loader)
-print(f"Test Accuracy: {results['accuracy']:.2%}")
-print(f"Confusion Matrix:\n{results['confusion_matrix']}")
-
-# Visualize training progress
-plot_training_history(history)  # Shows loss and accuracy curves
-```
-
-### What's New in This Module
-- ✅ **`save_checkpoint()`/`load_checkpoint()`**: Save and restore model state during training
-- ✅ **`save_best=True`**: Automatically saves model with best validation performance
-- ✅ **`early_stopping_patience`**: Stop training when validation loss stops improving
-- ✅ **`evaluate_model()`**: Comprehensive model evaluation with confusion matrix
-- ✅ **`plot_training_history()`**: Visualize training and validation curves
-- ✅ **`compute_confusion_matrix()`**: Analyze classification errors by class
-
-## 📚 What You'll Build
-
-### Complete Training Pipeline
-```python
-# End-to-end training system
-from tinytorch.core.training import Trainer
-from tinytorch.core.losses import CrossEntropyLoss
-from tinytorch.core.metrics import Accuracy
-
-# Define complete model architecture
-model = Sequential([
-    Dense(784, 128), ReLU(),
-    Dense(128, 64), ReLU(),
-    Dense(64, 10), Softmax()
-])
-
-# Configure training components
-optimizer = Adam(model.parameters(), learning_rate=0.001)
-loss_fn = CrossEntropyLoss()
-metrics = [Accuracy()]
-
-# Create and configure trainer
-trainer = Trainer(
-    model=model,
-    optimizer=optimizer, 
-    loss_fn=loss_fn,
-    metrics=metrics
-)
-
-# Train with comprehensive monitoring
-history = trainer.fit(
-    train_dataloader=train_loader,
-    val_dataloader=val_loader,
-    epochs=50,
-    verbose=True
-)
-```
-
-### Loss Function Library
-```python
-# Regression loss for continuous targets
-mse_loss = MeanSquaredError()
-regression_loss = mse_loss(predictions, continuous_targets)
-
-# Multi-class classification loss
-ce_loss = CrossEntropyLoss()
-classification_loss = ce_loss(logits, class_indices)
-
-# Binary classification loss
-bce_loss = BinaryCrossEntropyLoss()
-binary_loss = bce_loss(sigmoid_outputs, binary_labels)
-
-# All losses support batch processing and gradient computation
-loss.backward()  # Automatic differentiation integration
-```
-
-### Evaluation Metrics System
-```python
-# Classification performance measurement
-accuracy = Accuracy()
-acc_score = accuracy(predictions, true_labels)  # Returns 0.0 to 1.0
-
-# Regression error measurement  
-mae = MeanAbsoluteError()
-error = mae(predictions, targets)
-
-# Extensible metric framework
-class CustomMetric:
-    def __call__(self, y_pred, y_true):
-        # Implement custom evaluation logic
-        return custom_score
-
-metrics = [Accuracy(), CustomMetric()]
-trainer = Trainer(model, optimizer, loss_fn, metrics)
-```
-
-### Real-World Training Workflows
-```python
-# Train on CIFAR-10 with full pipeline
-from tinytorch.core.dataloader import CIFAR10Dataset, DataLoader
-
-# Load and prepare data
-train_dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
-train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
-val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
-
-# Configure CNN for computer vision
-cnn_model = Sequential([
-    Conv2D(3, 16, kernel_size=3), ReLU(),
-    MaxPool2D(kernel_size=2),
-    Conv2D(16, 32, kernel_size=3), ReLU(),
-    Flatten(),
-    Dense(32 * 13 * 13, 128), ReLU(),
-    Dense(128, 10)
-])
-
-# Train with monitoring and validation
-trainer = Trainer(cnn_model, Adam(cnn_model.parameters()), CrossEntropyLoss(), [Accuracy()])
-history = trainer.fit(train_loader, val_loader, epochs=100)
-
-# Analyze training results
-print(f"Final train accuracy: {history['train_accuracy'][-1]:.4f}")
-print(f"Final val accuracy: {history['val_accuracy'][-1]:.4f}")
-```
-
-## 🚀 Getting Started
-
-### Prerequisites
-Ensure you have completed the entire TinyTorch foundation:
-
-```bash
-# Activate TinyTorch environment
-source bin/activate-tinytorch.sh
-
-# Verify all prerequisite modules (this is the capstone!)
-tito test --module tensor
-tito test --module activations  
-tito test --module layers
-tito test --module networks
-tito test --module dataloader
-tito test --module autograd
-tito test --module optimizers
-```
-
-### Development Workflow
-1. **Open the development file**: `modules/source/10_training/training_dev.py`
-2. **Implement loss functions**: Build MSE, CrossEntropy, and BinaryCrossEntropy with proper gradients
-3. **Create metrics system**: Develop Accuracy and extensible evaluation framework
-4. **Build Trainer class**: Orchestrate training loop with validation and monitoring
-5. **Test end-to-end training**: Apply complete pipeline to real datasets and problems
-6. **Export and verify**: `tito export --module training && tito test --module training`
-
-## 🧪 Testing Your Implementation
-
-### Comprehensive Test Suite
-Run the full test suite to verify complete training system functionality:
-
-```bash
-# TinyTorch CLI (recommended)
-tito test --module training
-
-# Direct pytest execution
-python -m pytest tests/ -k training -v
-```
-
-### Test Coverage Areas
-- ✅ **Loss Function Implementation**: Verify mathematical correctness and gradient computation
-- ✅ **Metrics System**: Test accuracy calculation and extensible framework
-- ✅ **Training Loop Orchestration**: Ensure proper coordination of all components
-- ✅ **End-to-End Training**: Verify complete workflows on real datasets
-- ✅ **Convergence Analysis**: Test training dynamics and optimization behavior
-
-### Inline Testing & Training Analysis
-The module includes comprehensive training validation and convergence monitoring:
-```python
-# Example inline test output
-🔬 Unit Test: CrossEntropy loss function...
-✅ Mathematical correctness verified
-✅ Gradient computation working
-✅ Batch processing supported
-📈 Progress: Loss Functions ✓
-
-# Training monitoring
-🔬 Unit Test: Complete training pipeline...
-✅ Trainer orchestrates all components correctly
-✅ Training loop converges on test problem
-✅ Validation monitoring working
-📈 Progress: End-to-End Training ✓
-
-# Real dataset training
-📊 Training on CIFAR-10 subset...
-Epoch 1/10: train_loss=2.345, train_acc=0.234, val_loss=2.123, val_acc=0.278
-Epoch 5/10: train_loss=1.456, train_acc=0.567, val_loss=1.543, val_acc=0.523
-✅ Model converging successfully
-```
-
-### Manual Testing Examples
-```python
-from training_dev import Trainer, CrossEntropyLoss, Accuracy
-from networks_dev import Sequential
-from layers_dev import Dense
-from activations_dev import ReLU, Softmax
-from optimizers_dev import Adam
-
-# Test complete training on synthetic data
-model = Sequential([Dense(4, 8), ReLU(), Dense(8, 3), Softmax()])
-optimizer = Adam(model.parameters(), learning_rate=0.01)
-loss_fn = CrossEntropyLoss()
-metrics = [Accuracy()]
-
-trainer = Trainer(model, optimizer, loss_fn, metrics)
-
-# Create simple dataset
-from dataloader_dev import SimpleDataset, DataLoader
-train_dataset = SimpleDataset(size=1000, num_features=4, num_classes=3)
-train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
-
-# Train and monitor
-history = trainer.fit(train_loader, epochs=20, verbose=True)
-print(f"Training completed. Final accuracy: {history['train_accuracy'][-1]:.4f}")
-```
-
-## 🎯 Key Concepts
-
-### Real-World Applications
-- **Production ML Systems**: Companies like Netflix, Google use similar training pipelines for recommendation and search systems
-- **Research Workflows**: Academic researchers use training frameworks like this for experimental model development
-- **MLOps Platforms**: Production training systems extend these patterns with distributed computing and monitoring
-- **Edge AI Training**: Federated learning systems use similar orchestration patterns across distributed devices
-
-### Training System Architecture
-- **Loss Functions**: Mathematical objectives that define what the model should learn
-- **Metrics**: Human-interpretable measures of model performance for monitoring and decision-making
-- **Training Loop**: Orchestration pattern that coordinates data loading, forward passes, backward passes, and optimization
-- **Validation Strategy**: Techniques for monitoring generalization and preventing overfitting
-
-### Machine Learning Engineering
-- **Training Dynamics**: Understanding convergence, overfitting, underfitting, and optimization landscapes
-- **Hyperparameter Tuning**: Systematic approaches to learning rate, batch size, and architecture selection
-- **Debugging Training**: Common failure modes and diagnostic techniques for training issues
-- **Production Considerations**: Scalability, monitoring, reproducibility, and deployment readiness
-
-### Systems Integration Patterns
-- **Component Orchestration**: How to coordinate multiple ML components into cohesive systems
-- **Error Handling**: Robust handling of training failures, data issues, and convergence problems
-- **Monitoring and Logging**: Tracking training progress, performance metrics, and system health
-- **Extensibility**: Design patterns that enable easy addition of new losses, metrics, and training strategies
-
-## 🎉 Ready to Build?
-
-You're about to complete the TinyTorch framework by building the training system that brings everything together! This is where all your hard work on tensors, layers, networks, data loading, gradients, and optimization culminates in a complete ML system.
-
-Training is the heart of machine learning—it's where models learn from data and become intelligent. You're building the same patterns used to train GPT, train computer vision models, and power production AI systems. Take your time, understand how all the pieces fit together, and enjoy creating something truly powerful!
-
-```{grid} 3
-:gutter: 3
-:margin: 2
-
-{grid-item-card} 🚀 Launch Builder
-:link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/10_training/training_dev.py
-:class-title: text-center
-:class-body: text-center
-
-Interactive development environment
-
-{grid-item-card} 📓 Open in Colab  
-:link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/10_training/training_dev.ipynb
-:class-title: text-center
-:class-body: text-center
-
-Google Colab notebook
-
-{grid-item-card} 👀 View Source
-:link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/10_training/training_dev.py  
-:class-title: text-center
-:class-body: text-center
-
-Browse the code on GitHub
-``` 
\ No newline at end of file
diff --git a/modules_old/07_training/module.yaml b/modules_old/07_training/module.yaml
deleted file mode 100644
index 1f2fbb7c..00000000
--- a/modules_old/07_training/module.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-components:
-- MeanSquaredError
-- CrossEntropyLoss
-- BinaryCrossEntropyLoss
-- Accuracy
-- Trainer
-dependencies:
-  enables:
-  - compression
-  - kernels
-  - benchmarking
-  - mlops
-  prerequisites:
-  - tensor
-  - activations
-  - layers
-  - networks
-  - dataloader
-  - autograd
-  - optimizers
-description: Neural network training loops, loss functions, and metrics
-difficulty: "\u2B50\u2B50\u2B50\u2B50"
-exports_to: tinytorch.core.training
-files:
-  dev_file: training_dev.py
-  readme: README.md
-  tests: inline
-name: training
-time_estimate: 8-10 hours
-title: Training
diff --git a/modules_old/07_training/training_dev.ipynb b/modules_old/07_training/training_dev.ipynb
deleted file mode 100644
index 4676a226..00000000
--- a/modules_old/07_training/training_dev.ipynb
+++ /dev/null
@@ -1,2356 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "890973aa",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Training - Complete End-to-End ML Training Infrastructure\n",
-    "\n",
-    "Welcome to the Training module! You'll build the complete training infrastructure that orchestrates data loading, forward passes, loss computation, backpropagation, and optimization into a unified system.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How training loops coordinate all ML system components and why training orchestration determines system reliability\n",
-    "- Core implementation skill: Build loss functions, evaluation metrics, and complete training loops with checkpointing and monitoring\n",
-    "- Pattern recognition: Understand how different loss functions affect learning dynamics and model behavior\n",
-    "- Framework connection: See how your training loop mirrors PyTorch's training patterns and state management\n",
-    "- Performance insight: Learn why training loop design affects convergence speed, memory usage, and debugging capability\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: Complete training infrastructure with loss functions, metrics, checkpointing, and progress monitoring\n",
-    "2. **Use**: Train real neural networks on CIFAR-10 and achieve meaningful accuracy on complex visual tasks\n",
-    "3. **Reflect**: Why does training loop design often determine the success or failure of ML projects?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how training loops orchestrate complex ML systems into reliable, monitorable processes\n",
-    "- Practical capability to build production-ready training infrastructure with proper error handling and state management\n",
-    "- Systems insight into why training stability and reproducibility are critical for reliable ML systems\n",
-    "- Performance consideration of how training loop efficiency affects iteration speed and resource utilization\n",
-    "- Connection to production ML systems and how modern MLOps platforms build on these training patterns\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: Modern ML training platforms like PyTorch Lightning and Hugging Face Transformers build sophisticated abstractions on top of basic training loops to handle distributed training, mixed precision, and fault tolerance\n",
-    "⚡ **Performance Note**: Training loop efficiency often matters more than model efficiency for development speed - good training infrastructure accelerates the entire ML development cycle"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "01048938",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "training-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.training\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import sys\n",
-    "import os\n",
-    "from collections import defaultdict\n",
-    "import time\n",
-    "import pickle\n",
-    "\n",
-    "# Add module directories to Python path\n",
-    "sys.path.append(os.path.abspath('modules/source/02_tensor'))\n",
-    "sys.path.append(os.path.abspath('modules/source/03_activations'))\n",
-    "sys.path.append(os.path.abspath('modules/source/04_layers'))\n",
-    "sys.path.append(os.path.abspath('modules/source/05_dense'))\n",
-    "sys.path.append(os.path.abspath('modules/source/10_spatial'))\n",
-    "sys.path.append(os.path.abspath('modules/source/08_dataloader'))\n",
-    "sys.path.append(os.path.abspath('modules/source/09_autograd'))\n",
-    "sys.path.append(os.path.abspath('modules/source/10_optimizers'))\n",
-    "\n",
-    "# Helper function to set up import paths\n",
-    "# No longer needed, will use direct relative imports\n",
-    "\n",
-    "# Set up paths\n",
-    "# No longer needed\n",
-    "\n",
-    "# Import all the building blocks we need\n",
-    "from tinytorch.core.tensor import Tensor\n",
-    "from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax\n",
-    "from tinytorch.core.layers import Dense\n",
-    "from tinytorch.core.dense import Sequential, create_mlp\n",
-    "from tinytorch.core.spatial import Conv2D, flatten\n",
-    "from tinytorch.core.dataloader import Dataset, DataLoader\n",
-    "from tinytorch.core.autograd import Variable  # FOR AUTOGRAD INTEGRATION\n",
-    "from tinytorch.core.optimizers import SGD, Adam, StepLR\n",
-    "\n",
-    "# 🔥 AUTOGRAD INTEGRATION: Loss functions now return Variables that support .backward()\n",
-    "# This enables automatic gradient computation for neural network training!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b538ae25",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🔧 DEVELOPMENT"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "334a8e7e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 1: Understanding Loss Functions\n",
-    "\n",
-    "### What are Loss Functions?\n",
-    "Loss functions measure how far our model's predictions are from the true values. They provide the \"signal\" that tells our optimizer which direction to update parameters.\n",
-    "\n",
-    "### The Mathematical Foundation\n",
-    "Training a neural network is an optimization problem:\n",
-    "```\n",
-    "θ* = argmin_θ L(f(x; θ), y)\n",
-    "```\n",
-    "Where:\n",
-    "- `θ` = model parameters (weights and biases)\n",
-    "- `f(x; θ)` = model predictions\n",
-    "- `y` = true labels\n",
-    "- `L` = loss function\n",
-    "- `θ*` = optimal parameters\n",
-    "\n",
-    "### Why Loss Functions Matter\n",
-    "- **Optimization target**: They define what \"good\" means for our model\n",
-    "- **Gradient source**: Provide gradients for backpropagation\n",
-    "- **Task-specific**: Different losses for different problems\n",
-    "- **Training dynamics**: Shape how the model learns\n",
-    "\n",
-    "### Common Loss Functions\n",
-    "\n",
-    "#### **Mean Squared Error (MSE)** - For Regression\n",
-    "```\n",
-    "MSE = (1/n) * Σ(y_pred - y_true)²\n",
-    "```\n",
-    "- **Use case**: Regression problems\n",
-    "- **Properties**: Penalizes large errors heavily\n",
-    "- **Gradient**: 2 * (y_pred - y_true)\n",
-    "\n",
-    "#### **Cross-Entropy Loss** - For Classification\n",
-    "```\n",
-    "CrossEntropy = -Σ y_true * log(y_pred)\n",
-    "```\n",
-    "- **Use case**: Multi-class classification\n",
-    "- **Properties**: Penalizes confident wrong predictions\n",
-    "- **Gradient**: y_pred - y_true (with softmax)\n",
-    "\n",
-    "#### **Binary Cross-Entropy** - For Binary Classification\n",
-    "```\n",
-    "BCE = -y_true * log(y_pred) - (1-y_true) * log(1-y_pred)\n",
-    "```\n",
-    "- **Use case**: Binary classification\n",
-    "- **Properties**: Symmetric around 0.5\n",
-    "- **Gradient**: (y_pred - y_true) / (y_pred * (1-y_pred))\n",
-    "\n",
-    "Let's implement these essential loss functions!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b2de0430",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "mse-loss",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class MeanSquaredError:\n",
-    "    \"\"\"\n",
-    "    Mean Squared Error Loss for Regression\n",
-    "    \n",
-    "    Measures the average squared difference between predictions and targets.\n",
-    "    MSE = (1/n) * Σ(y_pred - y_true)²\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize MSE loss function.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def __call__(self, y_pred, y_true):\n",
-    "        \"\"\"\n",
-    "        Compute MSE loss between predictions and targets.\n",
-    "        \n",
-    "        Args:\n",
-    "            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, ...])\n",
-    "            y_true: True targets (Tensor or Variable, shape: [batch_size, ...])\n",
-    "            \n",
-    "        Returns:\n",
-    "            Variable with scalar loss value that supports .backward()\n",
-    "            \n",
-    "        TODO: Implement Mean SquaredError loss computation with autograd support.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Convert inputs to Variables if needed for autograd support\n",
-    "        2. Compute difference using Variable arithmetic: diff = y_pred - y_true\n",
-    "        3. Square the differences: squared_diff = diff * diff\n",
-    "        4. Take mean over all elements using Variable operations\n",
-    "        5. Return as Variable that supports .backward() for gradient computation\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        y_pred = Variable([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)\n",
-    "        y_true = Variable([[1.5, 2.5], [2.5, 3.5]], requires_grad=False)\n",
-    "        loss = mse_loss(y_pred, y_true)\n",
-    "        loss.backward()  # Computes gradients for y_pred\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Autograd Integration**: Loss functions must participate in computational graph for backpropagation\n",
-    "        - **Gradient Flow**: MSE provides smooth gradients that flow backward through the network\n",
-    "        - **Variable Operations**: Using Variables keeps computation in the autograd system\n",
-    "        - **Training Pipeline**: Loss.backward() triggers gradient computation for entire network\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Convert inputs to Variables if needed: Variable(tensor_data, requires_grad=True)\n",
-    "        - Use Variable arithmetic to maintain autograd graph\n",
-    "        - Use operations that preserve gradient computation\n",
-    "        - Return Variable that supports .backward() method\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Convert to Variables if needed to support autograd\n",
-    "        if not isinstance(y_pred, Variable):\n",
-    "            if hasattr(y_pred, 'data'):\n",
-    "                y_pred = Variable(y_pred.data, requires_grad=True)\n",
-    "            else:\n",
-    "                y_pred = Variable(y_pred, requires_grad=True)\n",
-    "        \n",
-    "        if not isinstance(y_true, Variable):\n",
-    "            if hasattr(y_true, 'data'):\n",
-    "                y_true = Variable(y_true.data, requires_grad=False)  # Targets don't need gradients\n",
-    "            else:\n",
-    "                y_true = Variable(y_true, requires_grad=False)\n",
-    "        \n",
-    "        # Compute MSE using Variable operations to maintain autograd graph\n",
-    "        diff = y_pred - y_true  # Variable subtraction\n",
-    "        squared_diff = diff * diff  # Variable multiplication\n",
-    "        \n",
-    "        # Mean operation that preserves gradients\n",
-    "        # Create a simple mean operation for Variables\n",
-    "        if hasattr(squared_diff.data, 'data'):\n",
-    "            mean_data = np.mean(squared_diff.data.data)\n",
-    "        else:\n",
-    "            mean_data = np.mean(squared_diff.data)\n",
-    "        \n",
-    "        # Create loss Variable with gradient function for MSE\n",
-    "        def mse_grad_fn(grad_output):\n",
-    "            # MSE gradient: 2 * (y_pred - y_true) / n\n",
-    "            if y_pred.requires_grad:\n",
-    "                if hasattr(y_pred.data, 'data'):\n",
-    "                    batch_size = np.prod(y_pred.data.data.shape)\n",
-    "                    grad_data = 2.0 * (y_pred.data.data - y_true.data.data) / batch_size\n",
-    "                else:\n",
-    "                    batch_size = np.prod(y_pred.data.shape)\n",
-    "                    grad_data = 2.0 * (y_pred.data - y_true.data) / batch_size\n",
-    "                \n",
-    "                if hasattr(grad_output.data, 'data'):\n",
-    "                    final_grad = grad_data * grad_output.data.data\n",
-    "                else:\n",
-    "                    final_grad = grad_data * grad_output.data\n",
-    "                \n",
-    "                y_pred.backward(Variable(final_grad))\n",
-    "        \n",
-    "        loss = Variable(mean_data, requires_grad=y_pred.requires_grad, grad_fn=mse_grad_fn)\n",
-    "        return loss\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, y_pred, y_true):\n",
-    "        \"\"\"Alternative interface for forward pass.\"\"\"\n",
-    "        return self.__call__(y_pred, y_true)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3d9586b0",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: MSE Loss\n",
-    "\n",
-    "Let's test our MSE loss implementation with known values."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "685382de",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-mse-loss",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_mse_loss():\n",
-    "    \"\"\"Test MSE loss with comprehensive examples.\"\"\"\n",
-    "    print(\"🔬 Unit Test: MSE Loss...\")\n",
-    "    \n",
-    "    mse = MeanSquaredError()\n",
-    "    \n",
-    "    # Test 1: Perfect predictions (loss should be 0)\n",
-    "    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])\n",
-    "    y_true = Tensor([[1.0, 2.0], [3.0, 4.0]])\n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    assert abs(loss.data) < 1e-6, f\"Perfect predictions should have loss ≈ 0, got {loss.data}\"\n",
-    "    print(\"✅ Perfect predictions test passed\")\n",
-    "    \n",
-    "    # Test 2: Known loss computation\n",
-    "    y_pred = Tensor([[1.0, 2.0]])\n",
-    "    y_true = Tensor([[0.0, 1.0]])\n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    expected = 1.0  # [(1-0)² + (2-1)²] / 2 = [1 + 1] / 2 = 1.0\n",
-    "    assert abs(loss.data - expected) < 1e-6, f\"Expected loss {expected}, got {loss.data}\"\n",
-    "    print(\"✅ Known loss computation test passed\")\n",
-    "    \n",
-    "    # Test 3: Batch processing\n",
-    "    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])\n",
-    "    y_true = Tensor([[1.5, 2.5], [2.5, 3.5]])\n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    expected = 0.25  # All squared differences are 0.25\n",
-    "    assert abs(loss.data - expected) < 1e-6, f\"Expected batch loss {expected}, got {loss.data}\"\n",
-    "    print(\"✅ Batch processing test passed\")\n",
-    "    \n",
-    "    # Test 4: Single value\n",
-    "    y_pred = Tensor([5.0])\n",
-    "    y_true = Tensor([3.0])\n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    expected = 4.0  # (5-3)² = 4\n",
-    "    assert abs(loss.data - expected) < 1e-6, f\"Expected single value loss {expected}, got {loss.data}\"\n",
-    "    print(\"✅ Single value test passed\")\n",
-    "    \n",
-    "    print(\"🎯 MSE Loss: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block) "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cb97bdc7",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "crossentropy-loss",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class CrossEntropyLoss:\n",
-    "    \"\"\"\n",
-    "    Cross-Entropy Loss for Multi-Class Classification\n",
-    "    \n",
-    "    Measures the difference between predicted probability distribution and true labels.\n",
-    "    CrossEntropy = -Σ y_true * log(y_pred)\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize CrossEntropy loss function.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def __call__(self, y_pred, y_true):\n",
-    "        \"\"\"\n",
-    "        Compute CrossEntropy loss between predictions and targets.\n",
-    "        \n",
-    "        Args:\n",
-    "            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, num_classes])\n",
-    "            y_true: True class indices (Tensor or Variable, shape: [batch_size]) or one-hot\n",
-    "            \n",
-    "        Returns:\n",
-    "            Variable with scalar loss value that supports .backward()\n",
-    "            \n",
-    "        TODO: Implement Cross-Entropy loss computation with autograd support.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Convert inputs to Variables if needed for autograd support\n",
-    "        2. Handle both class indices and one-hot encoded labels\n",
-    "        3. Apply softmax to predictions for probability distribution\n",
-    "        4. Compute log probabilities while maintaining gradient flow\n",
-    "        5. Calculate cross-entropy and return Variable with gradient function\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        y_pred = Variable([[2.0, 1.0, 0.1], [0.5, 2.1, 0.9]], requires_grad=True)\n",
-    "        y_true = Variable([0, 1], requires_grad=False)  # Class indices\n",
-    "        loss = crossentropy_loss(y_pred, y_true)\n",
-    "        loss.backward()  # Computes gradients for y_pred\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Autograd Integration**: CrossEntropy must support gradient computation for classification training\n",
-    "        - **Softmax Gradients**: Combined softmax + cross-entropy has well-defined gradients\n",
-    "        - **Classification Training**: Standard loss for multi-class problems in neural networks\n",
-    "        - **Gradient Flow**: Enables backpropagation through classification layers\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Convert inputs to Variables to support autograd\n",
-    "        - Apply softmax for probability distribution\n",
-    "        - Use numerically stable computations\n",
-    "        - Implement gradient function for cross-entropy + softmax\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Convert to Variables if needed to support autograd\n",
-    "        if not isinstance(y_pred, Variable):\n",
-    "            if hasattr(y_pred, 'data'):\n",
-    "                y_pred = Variable(y_pred.data, requires_grad=True)\n",
-    "            else:\n",
-    "                y_pred = Variable(y_pred, requires_grad=True)\n",
-    "        \n",
-    "        if not isinstance(y_true, Variable):\n",
-    "            if hasattr(y_true, 'data'):\n",
-    "                y_true = Variable(y_true.data, requires_grad=False)\n",
-    "            else:\n",
-    "                y_true = Variable(y_true, requires_grad=False)\n",
-    "        \n",
-    "        # Get data for computation\n",
-    "        if hasattr(y_pred.data, 'data'):\n",
-    "            pred_data = y_pred.data.data\n",
-    "        else:\n",
-    "            pred_data = y_pred.data\n",
-    "            \n",
-    "        if hasattr(y_true.data, 'data'):\n",
-    "            true_data = y_true.data.data\n",
-    "        else:\n",
-    "            true_data = y_true.data\n",
-    "        \n",
-    "        # Handle both 1D and 2D prediction arrays\n",
-    "        if pred_data.ndim == 1:\n",
-    "            pred_data = pred_data.reshape(1, -1)\n",
-    "            \n",
-    "        # Apply softmax to get probability distribution (numerically stable)\n",
-    "        exp_pred = np.exp(pred_data - np.max(pred_data, axis=1, keepdims=True))\n",
-    "        softmax_pred = exp_pred / np.sum(exp_pred, axis=1, keepdims=True)\n",
-    "        \n",
-    "        # Add small epsilon to avoid log(0)\n",
-    "        epsilon = 1e-15\n",
-    "        softmax_pred = np.clip(softmax_pred, epsilon, 1.0 - epsilon)\n",
-    "        \n",
-    "        # Handle class indices vs one-hot encoding\n",
-    "        if len(true_data.shape) == 1:\n",
-    "            # y_true contains class indices\n",
-    "            batch_size = true_data.shape[0]\n",
-    "            log_probs = np.log(softmax_pred[np.arange(batch_size), true_data.astype(int)])\n",
-    "            loss_value = -np.mean(log_probs)\n",
-    "            \n",
-    "            # Create one-hot for gradient computation\n",
-    "            one_hot = np.zeros_like(softmax_pred)\n",
-    "            one_hot[np.arange(batch_size), true_data.astype(int)] = 1.0\n",
-    "        else:\n",
-    "            # y_true is one-hot encoded\n",
-    "            one_hot = true_data\n",
-    "            log_probs = np.log(softmax_pred)\n",
-    "            loss_value = -np.mean(np.sum(true_data * log_probs, axis=1))\n",
-    "        \n",
-    "        # Create gradient function for CrossEntropy + Softmax\n",
-    "        def crossentropy_grad_fn(grad_output):\n",
-    "            if y_pred.requires_grad:\n",
-    "                # Gradient of CrossEntropy + Softmax: (softmax_pred - one_hot) / batch_size\n",
-    "                batch_size = softmax_pred.shape[0]\n",
-    "                grad_data = (softmax_pred - one_hot) / batch_size\n",
-    "                \n",
-    "                if hasattr(grad_output.data, 'data'):\n",
-    "                    final_grad = grad_data * grad_output.data.data\n",
-    "                else:\n",
-    "                    final_grad = grad_data * grad_output.data\n",
-    "                \n",
-    "                y_pred.backward(Variable(final_grad))\n",
-    "        \n",
-    "        loss = Variable(loss_value, requires_grad=y_pred.requires_grad, grad_fn=crossentropy_grad_fn)\n",
-    "        return loss\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, y_pred, y_true):\n",
-    "        \"\"\"Alternative interface for forward pass.\"\"\"\n",
-    "        return self.__call__(y_pred, y_true)\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "19346e62",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: CrossEntropy Loss\n",
-    "\n",
-    "Let's test our CrossEntropy loss implementation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ccd29f33",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-crossentropy-loss",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_crossentropy_loss():\n",
-    "    \"\"\"Test CrossEntropy loss with comprehensive examples.\"\"\"\n",
-    "    print(\"🔬 Unit Test: CrossEntropy Loss...\")\n",
-    "    \n",
-    "    ce = CrossEntropyLoss()\n",
-    "    \n",
-    "    # Test 1: Perfect predictions\n",
-    "    y_pred = Tensor([[10.0, 0.0, 0.0], [0.0, 10.0, 0.0]])  # Very confident correct predictions\n",
-    "    y_true = Tensor([0, 1])  # Class indices\n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    assert loss.data < 0.1, f\"Perfect predictions should have low loss, got {loss.data}\"\n",
-    "    print(\"✅ Perfect predictions test passed\")\n",
-    "    \n",
-    "    # Test 2: Random predictions (should have higher loss)\n",
-    "    y_pred = Tensor([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])  # Uniform after softmax\n",
-    "    y_true = Tensor([0, 1])\n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    expected_random = -np.log(1.0/3.0)  # log(1/num_classes) for uniform distribution\n",
-    "    assert abs(loss.data - expected_random) < 0.1, f\"Random predictions should have loss ≈ {expected_random}, got {loss.data}\"\n",
-    "    print(\"✅ Random predictions test passed\")\n",
-    "    \n",
-    "    # Test 3: Binary classification\n",
-    "    y_pred = Tensor([[2.0, 1.0], [1.0, 2.0]])\n",
-    "    y_true = Tensor([0, 1])\n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    assert 0.0 < loss.data < 2.0, f\"Binary classification loss should be reasonable, got {loss.data}\"\n",
-    "    print(\"✅ Binary classification test passed\")\n",
-    "    \n",
-    "    # Test 4: One-hot encoded labels\n",
-    "    y_pred = Tensor([[2.0, 1.0, 0.0], [0.0, 2.0, 1.0]])\n",
-    "    y_true = Tensor([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])  # One-hot encoded\n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    assert 0.0 < loss.data < 2.0, f\"One-hot encoded loss should be reasonable, got {loss.data}\"\n",
-    "    print(\"✅ One-hot encoded labels test passed\")\n",
-    "    \n",
-    "    print(\"🎯 CrossEntropy Loss: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d12ade1c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "binary-crossentropy-loss",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class BinaryCrossEntropyLoss:\n",
-    "    \"\"\"\n",
-    "    Binary Cross-Entropy Loss for Binary Classification\n",
-    "    \n",
-    "    Measures the difference between predicted probabilities and binary labels.\n",
-    "    BCE = -y_true * log(y_pred) - (1-y_true) * log(1-y_pred)\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize Binary CrossEntropy loss function.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def __call__(self, y_pred, y_true):\n",
-    "        \"\"\"\n",
-    "        Compute Binary CrossEntropy loss between predictions and targets.\n",
-    "        \n",
-    "        Args:\n",
-    "            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, 1] or [batch_size])\n",
-    "            y_true: True binary labels (Tensor or Variable, shape: [batch_size, 1] or [batch_size])\n",
-    "            \n",
-    "        Returns:\n",
-    "            Variable with scalar loss value that supports .backward()\n",
-    "            \n",
-    "        TODO: Implement Binary Cross-Entropy loss computation with autograd support.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Convert inputs to Variables if needed for autograd support\n",
-    "        2. Apply sigmoid to predictions for probability values (numerically stable)\n",
-    "        3. Compute binary cross-entropy loss while maintaining gradient flow\n",
-    "        4. Create gradient function for sigmoid + BCE combination\n",
-    "        5. Return Variable that supports .backward() for gradient computation\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        y_pred = Variable([[2.0], [0.0], [-1.0]], requires_grad=True)  # Raw logits\n",
-    "        y_true = Variable([[1.0], [1.0], [0.0]], requires_grad=False)   # Binary labels\n",
-    "        loss = bce_loss(y_pred, y_true)\n",
-    "        loss.backward()  # Computes gradients for y_pred\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Autograd Integration**: Binary CrossEntropy must support gradient computation for binary classification training\n",
-    "        - **Sigmoid + BCE Gradients**: Combined sigmoid + BCE has well-defined gradients\n",
-    "        - **Binary Classification**: Standard loss for binary problems in neural networks\n",
-    "        - **Numerical Stability**: Use log-sum-exp tricks to avoid overflow/underflow\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Convert inputs to Variables to support autograd\n",
-    "        - Use numerically stable sigmoid computation\n",
-    "        - Implement gradient function for sigmoid + BCE\n",
-    "        - Handle both logits and probability inputs\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Convert to Variables if needed to support autograd\n",
-    "        if not isinstance(y_pred, Variable):\n",
-    "            if hasattr(y_pred, 'data'):\n",
-    "                y_pred = Variable(y_pred.data, requires_grad=True)\n",
-    "            else:\n",
-    "                y_pred = Variable(y_pred, requires_grad=True)\n",
-    "        \n",
-    "        if not isinstance(y_true, Variable):\n",
-    "            if hasattr(y_true, 'data'):\n",
-    "                y_true = Variable(y_true.data, requires_grad=False)\n",
-    "            else:\n",
-    "                y_true = Variable(y_true, requires_grad=False)\n",
-    "        \n",
-    "        # Get data for computation\n",
-    "        if hasattr(y_pred.data, 'data'):\n",
-    "            logits = y_pred.data.data.flatten()\n",
-    "        else:\n",
-    "            logits = y_pred.data.flatten()\n",
-    "            \n",
-    "        if hasattr(y_true.data, 'data'):\n",
-    "            labels = y_true.data.data.flatten()\n",
-    "        else:\n",
-    "            labels = y_true.data.flatten()\n",
-    "        \n",
-    "        # Numerically stable binary cross-entropy from logits\n",
-    "        def stable_bce_with_logits(logits, labels):\n",
-    "            # Use the stable formulation: max(x, 0) - x * y + log(1 + exp(-abs(x)))\n",
-    "            stable_loss = np.maximum(logits, 0) - logits * labels + np.log(1 + np.exp(-np.abs(logits)))\n",
-    "            return stable_loss\n",
-    "        \n",
-    "        # Compute loss for each sample\n",
-    "        losses = stable_bce_with_logits(logits, labels)\n",
-    "        mean_loss = np.mean(losses)\n",
-    "        \n",
-    "        # Compute sigmoid for gradient computation\n",
-    "        sigmoid_pred = 1.0 / (1.0 + np.exp(-np.clip(logits, -250, 250)))  # Clipped for stability\n",
-    "        \n",
-    "        # Create gradient function for Binary CrossEntropy + Sigmoid\n",
-    "        def bce_grad_fn(grad_output):\n",
-    "            if y_pred.requires_grad:\n",
-    "                # Gradient of BCE + Sigmoid: (sigmoid_pred - labels) / batch_size\n",
-    "                batch_size = len(labels)\n",
-    "                grad_data = (sigmoid_pred - labels) / batch_size\n",
-    "                \n",
-    "                # Reshape to match original y_pred shape\n",
-    "                if hasattr(y_pred.data, 'data'):\n",
-    "                    original_shape = y_pred.data.data.shape\n",
-    "                else:\n",
-    "                    original_shape = y_pred.data.shape\n",
-    "                \n",
-    "                if len(original_shape) > 1:\n",
-    "                    grad_data = grad_data.reshape(original_shape)\n",
-    "                \n",
-    "                if hasattr(grad_output.data, 'data'):\n",
-    "                    final_grad = grad_data * grad_output.data.data\n",
-    "                else:\n",
-    "                    final_grad = grad_data * grad_output.data\n",
-    "                \n",
-    "                y_pred.backward(Variable(final_grad))\n",
-    "        \n",
-    "        loss = Variable(mean_loss, requires_grad=y_pred.requires_grad, grad_fn=bce_grad_fn)\n",
-    "        return loss\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, y_pred, y_true):\n",
-    "        \"\"\"Alternative interface for forward pass.\"\"\"\n",
-    "        return self.__call__(y_pred, y_true)\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0a128beb",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Binary CrossEntropy Loss\n",
-    "\n",
-    "Let's test our Binary CrossEntropy loss implementation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c8b56c61",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-binary-crossentropy-loss",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_binary_crossentropy_loss():\n",
-    "    \"\"\"Test Binary CrossEntropy loss with comprehensive examples.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Binary CrossEntropy Loss...\")\n",
-    "    \n",
-    "    bce = BinaryCrossEntropyLoss()\n",
-    "    \n",
-    "    # Test 1: Perfect predictions\n",
-    "    y_pred = Tensor([[10.0], [-10.0]])  # Very confident correct predictions\n",
-    "    y_true = Tensor([[1.0], [0.0]])\n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    assert loss.data < 0.1, f\"Perfect predictions should have low loss, got {loss.data}\"\n",
-    "    print(\"✅ Perfect predictions test passed\")\n",
-    "    \n",
-    "    # Test 2: Random predictions (should have higher loss)\n",
-    "    y_pred = Tensor([[0.0], [0.0]])  # 0.5 probability after sigmoid\n",
-    "    y_true = Tensor([[1.0], [0.0]])\n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    expected_random = -np.log(0.5)  # log(0.5) for random guessing\n",
-    "    assert abs(loss.data - expected_random) < 0.1, f\"Random predictions should have loss ≈ {expected_random}, got {loss.data}\"\n",
-    "    print(\"✅ Random predictions test passed\")\n",
-    "    \n",
-    "    # Test 3: Batch processing\n",
-    "    y_pred = Tensor([[1.0], [2.0], [-1.0]])\n",
-    "    y_true = Tensor([[1.0], [1.0], [0.0]])\n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    assert 0.0 < loss.data < 2.0, f\"Batch processing loss should be reasonable, got {loss.data}\"\n",
-    "    print(\"✅ Batch processing test passed\")\n",
-    "    \n",
-    "    # Test 4: Edge cases\n",
-    "    y_pred = Tensor([[100.0], [-100.0]])  # Extreme values\n",
-    "    y_true = Tensor([[1.0], [0.0]])\n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    assert loss.data < 0.1, f\"Extreme correct predictions should have low loss, got {loss.data}\"\n",
-    "    print(\"✅ Edge cases test passed\")\n",
-    "    \n",
-    "    print(\"🎯 Binary CrossEntropy Loss: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block) "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "da0767fa",
-   "metadata": {},
-   "source": [
-    "\"\"\"\n",
-    "# Step 2: Understanding Metrics\n",
-    "\n",
-    "## What are Metrics?\n",
-    "Metrics are measurements that help us understand how well our model is performing. Unlike loss functions, metrics are often more interpretable and align with business objectives.\n",
-    "\n",
-    "## Key Metrics for Classification\n",
-    "\n",
-    "### **Accuracy**\n",
-    "```\n",
-    "Accuracy = (Correct Predictions) / (Total Predictions)\n",
-    "```\n",
-    "- **Range**: [0, 1]\n",
-    "- **Interpretation**: Percentage of correct predictions\n",
-    "- **Good for**: Balanced datasets\n",
-    "\n",
-    "### **Precision**\n",
-    "```\n",
-    "Precision = True Positives / (True Positives + False Positives)\n",
-    "```\n",
-    "- **Range**: [0, 1]\n",
-    "- **Interpretation**: Of all positive predictions, how many were correct?\n",
-    "- **Good for**: When false positives are costly\n",
-    "\n",
-    "### **Recall (Sensitivity)**\n",
-    "```\n",
-    "Recall = True Positives / (True Positives + False Negatives)\n",
-    "```\n",
-    "- **Range**: [0, 1]\n",
-    "- **Interpretation**: Of all actual positives, how many did we find?\n",
-    "- **Good for**: When false negatives are costly\n",
-    "\n",
-    "## Key Metrics for Regression\n",
-    "\n",
-    "### **Mean Absolute Error (MAE)**\n",
-    "```\n",
-    "MAE = (1/n) * Σ|y_pred - y_true|\n",
-    "```\n",
-    "- **Range**: [0, ∞)\n",
-    "- **Interpretation**: Average absolute error\n",
-    "- **Good for**: Robust to outliers\n",
-    "\n",
-    "Let's implement these essential metrics!\n",
-    "\"\"\"\n",
-    "\n",
-    "Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "27590d5a",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "accuracy-metric",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Accuracy:\n",
-    "    \"\"\"\n",
-    "    Accuracy Metric for Classification\n",
-    "    \n",
-    "    Computes the fraction of correct predictions.\n",
-    "    Accuracy = (Correct Predictions) / (Total Predictions)\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize Accuracy metric.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def __call__(self, y_pred: Tensor, y_true: Tensor) -> float:\n",
-    "        \"\"\"\n",
-    "        Compute accuracy between predictions and targets.\n",
-    "        \n",
-    "        Args:\n",
-    "            y_pred: Model predictions (shape: [batch_size, num_classes] or [batch_size])\n",
-    "            y_true: True class labels (shape: [batch_size] or [batch_size])\n",
-    "            \n",
-    "        Returns:\n",
-    "            Accuracy as a float value between 0 and 1\n",
-    "            \n",
-    "        TODO: Implement accuracy computation.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Convert predictions to class indices (argmax for multi-class)\n",
-    "        2. Convert true labels to class indices if needed\n",
-    "        3. Count correct predictions\n",
-    "        4. Divide by total predictions\n",
-    "        5. Return as float\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        y_pred = Tensor([[0.9, 0.1], [0.2, 0.8], [0.6, 0.4]])  # Probabilities\n",
-    "        y_true = Tensor([0, 1, 0])  # True classes\n",
-    "        accuracy = accuracy_metric(y_pred, y_true)\n",
-    "        # Should return: 2/3 = 0.667 (first and second predictions correct)\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Model Evaluation**: Primary metric for classification model performance\n",
-    "        - **Business KPIs**: Often directly tied to business objectives and success metrics\n",
-    "        - **Baseline Comparison**: Standard metric for comparing different models\n",
-    "        - **Production Monitoring**: Real-time accuracy monitoring for model health\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use np.argmax(axis=1) for multi-class predictions\n",
-    "        - Handle both probability and class index inputs\n",
-    "        - Use np.mean() for averaging\n",
-    "        - Return Python float, not Tensor\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Convert predictions to class indices\n",
-    "        if len(y_pred.data.shape) > 1 and y_pred.data.shape[1] > 1:\n",
-    "            # Multi-class: use argmax\n",
-    "            pred_classes = np.argmax(y_pred.data, axis=1)\n",
-    "        else:\n",
-    "            # Binary classification: threshold at 0.5\n",
-    "            pred_classes = (y_pred.data.flatten() > 0.5).astype(int)\n",
-    "        \n",
-    "        # Convert true labels to class indices if needed\n",
-    "        if len(y_true.data.shape) > 1 and y_true.data.shape[1] > 1:\n",
-    "            # One-hot encoded\n",
-    "            true_classes = np.argmax(y_true.data, axis=1)\n",
-    "        else:\n",
-    "            # Already class indices\n",
-    "            true_classes = y_true.data.flatten().astype(int)\n",
-    "        \n",
-    "        # Compute accuracy\n",
-    "        correct = np.sum(pred_classes == true_classes)\n",
-    "        total = len(true_classes)\n",
-    "        accuracy = correct / total\n",
-    "        \n",
-    "        return float(accuracy)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, y_pred: Tensor, y_true: Tensor) -> float:\n",
-    "        \"\"\"Alternative interface for forward pass.\"\"\"\n",
-    "        return self.__call__(y_pred, y_true)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fd382e7f",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Accuracy Metric\n",
-    "\n",
-    "Let's test our Accuracy metric implementation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4c925c62",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-accuracy-metric",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_accuracy_metric():\n",
-    "    \"\"\"Test Accuracy metric with comprehensive examples.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Accuracy Metric...\")\n",
-    "    \n",
-    "    accuracy = Accuracy()\n",
-    "    \n",
-    "    # Test 1: Perfect predictions\n",
-    "    y_pred = Tensor([[0.9, 0.1], [0.1, 0.9], [0.8, 0.2]])\n",
-    "    y_true = Tensor([0, 1, 0])\n",
-    "    acc = accuracy(y_pred, y_true)\n",
-    "    assert acc == 1.0, f\"Perfect predictions should have accuracy 1.0, got {acc}\"\n",
-    "    print(\"✅ Perfect predictions test passed\")\n",
-    "    \n",
-    "    # Test 2: Half correct\n",
-    "    y_pred = Tensor([[0.9, 0.1], [0.9, 0.1], [0.8, 0.2]])  # All predict class 0\n",
-    "    y_true = Tensor([0, 1, 0])  # Classes: 0, 1, 0\n",
-    "    acc = accuracy(y_pred, y_true)\n",
-    "    expected = 2.0/3.0  # 2 out of 3 correct\n",
-    "    assert abs(acc - expected) < 1e-6, f\"Half correct should have accuracy {expected}, got {acc}\"\n",
-    "    print(\"✅ Half correct test passed\")\n",
-    "    \n",
-    "    # Test 3: Binary classification\n",
-    "    y_pred = Tensor([[0.8], [0.3], [0.9], [0.1]])  # Predictions above/below 0.5\n",
-    "    y_true = Tensor([1, 0, 1, 0])\n",
-    "    acc = accuracy(y_pred, y_true)\n",
-    "    assert acc == 1.0, f\"Binary classification should have accuracy 1.0, got {acc}\"\n",
-    "    print(\"✅ Binary classification test passed\")\n",
-    "    \n",
-    "    # Test 4: Multi-class\n",
-    "    y_pred = Tensor([[0.7, 0.2, 0.1], [0.1, 0.8, 0.1], [0.1, 0.1, 0.8]])\n",
-    "    y_true = Tensor([0, 1, 2])\n",
-    "    acc = accuracy(y_pred, y_true)\n",
-    "    assert acc == 1.0, f\"Multi-class should have accuracy 1.0, got {acc}\"\n",
-    "    print(\"✅ Multi-class test passed\")\n",
-    "    \n",
-    "    print(\"🎯 Accuracy Metric: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6f17bf77",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 3: Building the Training Loop\n",
-    "\n",
-    "### What is a Training Loop?\n",
-    "A training loop is the orchestration logic that coordinates all components of neural network training:\n",
-    "\n",
-    "1. **Forward Pass**: Compute predictions\n",
-    "2. **Loss Computation**: Measure prediction quality\n",
-    "3. **Backward Pass**: Compute gradients\n",
-    "4. **Parameter Update**: Update model parameters\n",
-    "5. **Evaluation**: Compute metrics and validation performance\n",
-    "\n",
-    "### The Training Loop Architecture\n",
-    "```python\n",
-    "for epoch in range(num_epochs):\n",
-    "    # Training phase\n",
-    "    for batch in train_dataloader:\n",
-    "        optimizer.zero_grad()\n",
-    "        predictions = model(batch_x)\n",
-    "        loss = loss_function(predictions, batch_y)\n",
-    "        loss.backward()\n",
-    "        optimizer.step()\n",
-    "    \n",
-    "    # Validation phase\n",
-    "    for batch in val_dataloader:\n",
-    "        predictions = model(batch_x)\n",
-    "        val_loss = loss_function(predictions, batch_y)\n",
-    "        accuracy = accuracy_metric(predictions, batch_y)\n",
-    "```\n",
-    "\n",
-    "### Why We Need a Trainer Class\n",
-    "- **Encapsulation**: Keeps training logic organized\n",
-    "- **Reusability**: Same trainer works with different models/datasets\n",
-    "- **Monitoring**: Built-in logging and progress tracking\n",
-    "- **Flexibility**: Easy to modify training behavior\n",
-    "\n",
-    "Let's build our Trainer class!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "844395fe",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "trainer-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Trainer:\n",
-    "    \"\"\"\n",
-    "    Training Loop Orchestrator\n",
-    "    \n",
-    "    Coordinates model training with loss functions, optimizers, and metrics.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, model, optimizer, loss_function, metrics=None):\n",
-    "        \"\"\"\n",
-    "        Initialize trainer with model and training components.\n",
-    "        \n",
-    "        Args:\n",
-    "            model: Neural network model to train\n",
-    "            optimizer: Optimizer for parameter updates\n",
-    "            loss_function: Loss function for training\n",
-    "            metrics: List of metrics to track (optional)\n",
-    "            \n",
-    "        TODO: Initialize the trainer with all necessary components.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store model, optimizer, loss function, and metrics\n",
-    "        2. Initialize history tracking for losses and metrics\n",
-    "        3. Set up training state (epoch, step counters)\n",
-    "        4. Prepare for training and validation loops\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        model = Sequential([Dense(10, 5), ReLU(), Dense(5, 2)])\n",
-    "        optimizer = Adam(model.parameters, learning_rate=0.001)\n",
-    "        loss_fn = CrossEntropyLoss()\n",
-    "        metrics = [Accuracy()]\n",
-    "        trainer = Trainer(model, optimizer, loss_fn, metrics)\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Store all components as instance variables\n",
-    "        - Initialize empty history dictionaries\n",
-    "        - Set metrics to empty list if None provided\n",
-    "        - Initialize epoch and step counters to 0\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.model = model\n",
-    "        self.optimizer = optimizer\n",
-    "        self.loss_function = loss_function\n",
-    "        self.metrics = metrics or []\n",
-    "        \n",
-    "        # Training history\n",
-    "        self.history = {\n",
-    "            'train_loss': [],\n",
-    "            'val_loss': [],\n",
-    "            'epoch': []\n",
-    "        }\n",
-    "        \n",
-    "        # Add metric history tracking\n",
-    "        for metric in self.metrics:\n",
-    "            metric_name = metric.__class__.__name__.lower()\n",
-    "            self.history[f'train_{metric_name}'] = []\n",
-    "            self.history[f'val_{metric_name}'] = []\n",
-    "        \n",
-    "        # Training state\n",
-    "        self.current_epoch = 0\n",
-    "        self.current_step = 0\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def train_epoch(self, dataloader):\n",
-    "        \"\"\"\n",
-    "        Train for one epoch on the given dataloader.\n",
-    "        \n",
-    "        Args:\n",
-    "            dataloader: DataLoader containing training data\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dictionary with epoch training metrics\n",
-    "            \n",
-    "        TODO: Implement single epoch training logic.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Initialize epoch metrics tracking\n",
-    "        2. Iterate through batches in dataloader\n",
-    "        3. For each batch:\n",
-    "           - Zero gradients\n",
-    "           - Forward pass\n",
-    "           - Compute loss\n",
-    "           - Backward pass\n",
-    "           - Update parameters\n",
-    "           - Track metrics\n",
-    "        4. Return averaged metrics for the epoch\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Training Loop Foundation**: Core pattern used in all deep learning frameworks\n",
-    "        - **Gradient Accumulation**: Optimizer.zero_grad() prevents gradient accumulation bugs\n",
-    "        - **Backpropagation**: loss.backward() computes gradients through entire network\n",
-    "        - **Parameter Updates**: optimizer.step() applies computed gradients to model weights\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use optimizer.zero_grad() before each batch\n",
-    "        - Call loss.backward() for gradient computation\n",
-    "        - Use optimizer.step() for parameter updates\n",
-    "        - Track running averages for metrics\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        epoch_metrics = {'loss': 0.0}\n",
-    "        \n",
-    "        # Initialize metric tracking\n",
-    "        for metric in self.metrics:\n",
-    "            metric_name = metric.__class__.__name__.lower()\n",
-    "            epoch_metrics[metric_name] = 0.0\n",
-    "        \n",
-    "        batch_count = 0\n",
-    "        \n",
-    "        for batch_x, batch_y in dataloader:\n",
-    "            # Zero gradients\n",
-    "            self.optimizer.zero_grad()\n",
-    "            \n",
-    "            # Forward pass\n",
-    "            predictions = self.model(batch_x)\n",
-    "            \n",
-    "            # Compute loss\n",
-    "            loss = self.loss_function(predictions, batch_y)\n",
-    "            \n",
-    "            # Backward pass - now that loss functions support autograd!\n",
-    "            if hasattr(loss, 'backward'):\n",
-    "                loss.backward()\n",
-    "            \n",
-    "            # Update parameters\n",
-    "            self.optimizer.step()\n",
-    "            \n",
-    "            # Track metrics\n",
-    "            if hasattr(loss, 'data'):\n",
-    "                if hasattr(loss.data, 'data'):\n",
-    "                    epoch_metrics['loss'] += loss.data.data  # Variable with Tensor data\n",
-    "                else:\n",
-    "                    epoch_metrics['loss'] += loss.data  # Variable with numpy data\n",
-    "            else:\n",
-    "                epoch_metrics['loss'] += loss  # Direct value\n",
-    "            \n",
-    "            for metric in self.metrics:\n",
-    "                metric_name = metric.__class__.__name__.lower()\n",
-    "                metric_value = metric(predictions, batch_y)\n",
-    "                epoch_metrics[metric_name] += metric_value\n",
-    "            \n",
-    "            batch_count += 1\n",
-    "            self.current_step += 1\n",
-    "        \n",
-    "        # Average metrics over all batches\n",
-    "        for key in epoch_metrics:\n",
-    "            epoch_metrics[key] /= batch_count\n",
-    "        \n",
-    "        return epoch_metrics\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def validate_epoch(self, dataloader):\n",
-    "        \"\"\"\n",
-    "        Validate for one epoch on the given dataloader.\n",
-    "        \n",
-    "        Args:\n",
-    "            dataloader: DataLoader containing validation data\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dictionary with epoch validation metrics\n",
-    "            \n",
-    "        TODO: Implement single epoch validation logic.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Initialize epoch metrics tracking\n",
-    "        2. Iterate through batches in dataloader\n",
-    "        3. For each batch:\n",
-    "           - Forward pass (no gradient computation)\n",
-    "           - Compute loss\n",
-    "           - Track metrics\n",
-    "        4. Return averaged metrics for the epoch\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Model Evaluation**: Validation measures generalization to unseen data\n",
-    "        - **Overfitting Detection**: Comparing train vs validation metrics reveals overfitting\n",
-    "        - **Model Selection**: Validation metrics guide hyperparameter tuning and architecture choices\n",
-    "        - **Early Stopping**: Validation loss plateaus indicate optimal training duration\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - No gradient computation needed for validation\n",
-    "        - No parameter updates during validation\n",
-    "        - Similar to train_epoch but simpler\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        epoch_metrics = {'loss': 0.0}\n",
-    "        \n",
-    "        # Initialize metric tracking\n",
-    "        for metric in self.metrics:\n",
-    "            metric_name = metric.__class__.__name__.lower()\n",
-    "            epoch_metrics[metric_name] = 0.0\n",
-    "        \n",
-    "        batch_count = 0\n",
-    "        \n",
-    "        for batch_x, batch_y in dataloader:\n",
-    "            # Forward pass only (no gradients needed)\n",
-    "            predictions = self.model(batch_x)\n",
-    "            \n",
-    "            # Compute loss\n",
-    "            loss = self.loss_function(predictions, batch_y)\n",
-    "            \n",
-    "            # Track metrics\n",
-    "            if hasattr(loss, 'data'):\n",
-    "                if hasattr(loss.data, 'data'):\n",
-    "                    epoch_metrics['loss'] += loss.data.data  # Variable with Tensor data\n",
-    "                else:\n",
-    "                    epoch_metrics['loss'] += loss.data  # Variable with numpy data\n",
-    "            else:\n",
-    "                epoch_metrics['loss'] += loss  # Direct value\n",
-    "            \n",
-    "            for metric in self.metrics:\n",
-    "                metric_name = metric.__class__.__name__.lower()\n",
-    "                metric_value = metric(predictions, batch_y)\n",
-    "                epoch_metrics[metric_name] += metric_value\n",
-    "            \n",
-    "            batch_count += 1\n",
-    "        \n",
-    "        # Average metrics over all batches\n",
-    "        for key in epoch_metrics:\n",
-    "            epoch_metrics[key] /= batch_count\n",
-    "        \n",
-    "        return epoch_metrics\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def fit(self, train_dataloader, val_dataloader=None, epochs=10, verbose=True, save_best=False, checkpoint_path=\"best_model.pkl\"):\n",
-    "        \"\"\"\n",
-    "        Train the model for specified number of epochs.\n",
-    "        \n",
-    "        Args:\n",
-    "            train_dataloader: Training data\n",
-    "            val_dataloader: Validation data (optional)\n",
-    "            epochs: Number of training epochs\n",
-    "            verbose: Whether to print training progress\n",
-    "            \n",
-    "        Returns:\n",
-    "            Training history dictionary\n",
-    "            \n",
-    "        TODO: Implement complete training loop.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Loop through epochs\n",
-    "        2. For each epoch:\n",
-    "           - Train on training data\n",
-    "           - Validate on validation data (if provided)\n",
-    "           - Update history\n",
-    "           - Print progress (if verbose)\n",
-    "        3. Return complete training history\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Epoch Management**: Organizing training into discrete passes through the dataset\n",
-    "        - **Learning Curves**: History tracking enables visualization of training progress\n",
-    "        - **Hyperparameter Tuning**: Training history guides learning rate and architecture decisions\n",
-    "        - **Production Monitoring**: Training logs provide debugging and optimization insights\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use train_epoch() and validate_epoch() methods\n",
-    "        - Update self.history with results\n",
-    "        - Print epoch summary if verbose=True\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(f\"Starting training for {epochs} epochs...\")\n",
-    "        best_val_loss = float('inf')\n",
-    "        \n",
-    "        for epoch in range(epochs):\n",
-    "            self.current_epoch = epoch\n",
-    "            \n",
-    "            # Training phase\n",
-    "            train_metrics = self.train_epoch(train_dataloader)\n",
-    "            \n",
-    "            # Validation phase\n",
-    "            val_metrics = {}\n",
-    "            if val_dataloader is not None:\n",
-    "                val_metrics = self.validate_epoch(val_dataloader)\n",
-    "            \n",
-    "            # Update history\n",
-    "            self.history['epoch'].append(epoch)\n",
-    "            self.history['train_loss'].append(train_metrics['loss'])\n",
-    "            \n",
-    "            if val_dataloader is not None:\n",
-    "                self.history['val_loss'].append(val_metrics['loss'])\n",
-    "            \n",
-    "            # Update metric history\n",
-    "            for metric in self.metrics:\n",
-    "                metric_name = metric.__class__.__name__.lower()\n",
-    "                self.history[f'train_{metric_name}'].append(train_metrics[metric_name])\n",
-    "                if val_dataloader is not None:\n",
-    "                    self.history[f'val_{metric_name}'].append(val_metrics[metric_name])\n",
-    "            \n",
-    "            # Save best model checkpoint\n",
-    "            if save_best and val_dataloader is not None:\n",
-    "                if val_metrics['loss'] < best_val_loss:\n",
-    "                    best_val_loss = val_metrics['loss']\n",
-    "                    self.save_checkpoint(checkpoint_path)\n",
-    "                    if verbose:\n",
-    "                        print(f\"  💾 Saved best model (val_loss: {best_val_loss:.4f})\")\n",
-    "            \n",
-    "            # Print progress\n",
-    "            if verbose:\n",
-    "                train_loss = train_metrics['loss']\n",
-    "                print(f\"Epoch {epoch+1}/{epochs} - train_loss: {train_loss:.4f}\", end=\"\")\n",
-    "                \n",
-    "                if val_dataloader is not None:\n",
-    "                    val_loss = val_metrics['loss']\n",
-    "                    print(f\" - val_loss: {val_loss:.4f}\", end=\"\")\n",
-    "                \n",
-    "                for metric in self.metrics:\n",
-    "                    metric_name = metric.__class__.__name__.lower()\n",
-    "                    train_metric = train_metrics[metric_name]\n",
-    "                    print(f\" - train_{metric_name}: {train_metric:.4f}\", end=\"\")\n",
-    "                    \n",
-    "                    if val_dataloader is not None:\n",
-    "                        val_metric = val_metrics[metric_name]\n",
-    "                        print(f\" - val_{metric_name}: {val_metric:.4f}\", end=\"\")\n",
-    "                \n",
-    "                print()  # New line\n",
-    "        \n",
-    "        print(\"Training completed!\")\n",
-    "        return self.history\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def save_checkpoint(self, filepath):\n",
-    "        \"\"\"Save model checkpoint.\"\"\"\n",
-    "        checkpoint = {\n",
-    "            'epoch': self.current_epoch,\n",
-    "            'model_state': self._get_model_state(),\n",
-    "            'history': self.history\n",
-    "        }\n",
-    "        \n",
-    "        with open(filepath, 'wb') as f:\n",
-    "            pickle.dump(checkpoint, f)\n",
-    "    \n",
-    "    def load_checkpoint(self, filepath):\n",
-    "        \"\"\"Load model checkpoint.\"\"\"\n",
-    "        with open(filepath, 'rb') as f:\n",
-    "            checkpoint = pickle.load(f)\n",
-    "        \n",
-    "        self.current_epoch = checkpoint['epoch']\n",
-    "        self.history = checkpoint['history']\n",
-    "        self._set_model_state(checkpoint['model_state'])\n",
-    "        \n",
-    "        print(f\"✅ Loaded checkpoint from epoch {self.current_epoch}\")\n",
-    "    \n",
-    "    def _get_model_state(self):\n",
-    "        \"\"\"Extract model parameters.\"\"\"\n",
-    "        state = {}\n",
-    "        for i, layer in enumerate(self.model.layers):\n",
-    "            if hasattr(layer, 'weight'):\n",
-    "                state[f'layer_{i}_weight'] = layer.weight.data.copy()\n",
-    "                state[f'layer_{i}_bias'] = layer.bias.data.copy()\n",
-    "        return state\n",
-    "    \n",
-    "    def _set_model_state(self, state):\n",
-    "        \"\"\"Restore model parameters.\"\"\"\n",
-    "        for i, layer in enumerate(self.model.layers):\n",
-    "            if hasattr(layer, 'weight'):\n",
-    "                layer.weight.data = state[f'layer_{i}_weight']\n",
-    "                layer.bias.data = state[f'layer_{i}_bias']"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8c9b9b9a",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Training Loop\n",
-    "\n",
-    "Let's test our Trainer class with a simple example."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "65006adc",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-trainer",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_trainer():\n",
-    "    \"\"\"Test Trainer class with comprehensive examples.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Trainer Class...\")\n",
-    "    \n",
-    "    # Create simple model and components\n",
-    "    model = Sequential([Dense(2, 3), ReLU(), Dense(3, 2)])  # Simple model\n",
-    "    optimizer = SGD([], learning_rate=0.01)  # Empty parameters list for testing\n",
-    "    loss_fn = MeanSquaredError()\n",
-    "    metrics = [Accuracy()]\n",
-    "    \n",
-    "    # Create trainer\n",
-    "    trainer = Trainer(model, optimizer, loss_fn, metrics)\n",
-    "    \n",
-    "    # Test 1: Trainer initialization\n",
-    "    assert trainer.model is model, \"Model should be stored correctly\"\n",
-    "    assert trainer.optimizer is optimizer, \"Optimizer should be stored correctly\"\n",
-    "    assert trainer.loss_function is loss_fn, \"Loss function should be stored correctly\"\n",
-    "    assert len(trainer.metrics) == 1, \"Metrics should be stored correctly\"\n",
-    "    assert 'train_loss' in trainer.history, \"Training history should be initialized\"\n",
-    "    print(\"✅ Trainer initialization test passed\")\n",
-    "    \n",
-    "    # Test 2: History structure\n",
-    "    assert 'epoch' in trainer.history, \"History should track epochs\"\n",
-    "    assert 'train_accuracy' in trainer.history, \"History should track training accuracy\"\n",
-    "    assert 'val_accuracy' in trainer.history, \"History should track validation accuracy\"\n",
-    "    print(\"✅ History structure test passed\")\n",
-    "    \n",
-    "    # Test 3: Training state\n",
-    "    assert trainer.current_epoch == 0, \"Current epoch should start at 0\"\n",
-    "    assert trainer.current_step == 0, \"Current step should start at 0\"\n",
-    "    print(\"✅ Training state test passed\")\n",
-    "    \n",
-    "    print(\"🎯 Trainer Class: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9344e9fa",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Complete Training Comprehensive Test\n",
-    "\n",
-    "Let's test the complete training pipeline with all components working together.\n",
-    "\n",
-    "**This is a comprehensive test** - it tests all training components working together in a realistic scenario."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7d2b3d3c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-training-comprehensive",
-     "locked": true,
-     "points": 25,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_module_training():\n",
-    "    \"\"\"Test complete training pipeline with all components.\"\"\"\n",
-    "    print(\"🔬 Integration Test: Complete Training Pipeline...\")\n",
-    "    \n",
-    "    try:\n",
-    "        # Test 1: Loss functions work correctly\n",
-    "        mse = MeanSquaredError()\n",
-    "        ce = CrossEntropyLoss()\n",
-    "        bce = BinaryCrossEntropyLoss()\n",
-    "        \n",
-    "        # MSE test\n",
-    "        y_pred = Tensor([[1.0, 2.0]])\n",
-    "        y_true = Tensor([[1.0, 2.0]])\n",
-    "        loss = mse(y_pred, y_true)\n",
-    "        assert abs(loss.data) < 1e-6, \"MSE should work for perfect predictions\"\n",
-    "        \n",
-    "        # CrossEntropy test\n",
-    "        y_pred = Tensor([[10.0, 0.0], [0.0, 10.0]])\n",
-    "        y_true = Tensor([0, 1])\n",
-    "        loss = ce(y_pred, y_true)\n",
-    "        assert loss.data < 1.0, \"CrossEntropy should work for good predictions\"\n",
-    "        \n",
-    "        # Binary CrossEntropy test\n",
-    "        y_pred = Tensor([[10.0], [-10.0]])\n",
-    "        y_true = Tensor([[1.0], [0.0]])\n",
-    "        loss = bce(y_pred, y_true)\n",
-    "        assert loss.data < 1.0, \"Binary CrossEntropy should work for good predictions\"\n",
-    "        \n",
-    "        print(\"✅ Loss functions work correctly\")\n",
-    "        \n",
-    "        # Test 2: Metrics work correctly\n",
-    "        accuracy = Accuracy()\n",
-    "        \n",
-    "        y_pred = Tensor([[0.9, 0.1], [0.1, 0.9]])\n",
-    "        y_true = Tensor([0, 1])\n",
-    "        acc = accuracy(y_pred, y_true)\n",
-    "        assert acc == 1.0, \"Accuracy should work for perfect predictions\"\n",
-    "        \n",
-    "        print(\"✅ Metrics work correctly\")\n",
-    "        \n",
-    "        # Test 3: Trainer integrates all components\n",
-    "        model = Sequential([])  # Empty model for testing\n",
-    "        optimizer = SGD([], learning_rate=0.01)\n",
-    "        loss_fn = MeanSquaredError()\n",
-    "        metrics = [Accuracy()]\n",
-    "        \n",
-    "        trainer = Trainer(model, optimizer, loss_fn, metrics)\n",
-    "        \n",
-    "        # Check trainer setup\n",
-    "        assert trainer.model is model, \"Trainer should store model\"\n",
-    "        assert trainer.optimizer is optimizer, \"Trainer should store optimizer\"\n",
-    "        assert trainer.loss_function is loss_fn, \"Trainer should store loss function\"\n",
-    "        assert len(trainer.metrics) == 1, \"Trainer should store metrics\"\n",
-    "        \n",
-    "        print(\"✅ Trainer integrates all components\")\n",
-    "        \n",
-    "        print(\"🎉 Complete training pipeline works correctly!\")\n",
-    "        \n",
-    "        # Test 4: Integration works end-to-end\n",
-    "        print(\"✅ End-to-end integration successful\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Training pipeline test failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    print(\"🎯 Training Pipeline: All comprehensive tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f929b2ae",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 4: ML Systems Thinking - Production Training Pipeline Analysis\n",
-    "\n",
-    "### 🏗️ Training Infrastructure at Scale\n",
-    "\n",
-    "Your training loop implementation provides the foundation for understanding how production ML systems orchestrate the entire training pipeline. Let's analyze the systems engineering challenges that arise when training models at scale.\n",
-    "\n",
-    "#### **Training Pipeline Architecture**\n",
-    "```python\n",
-    "class ProductionTrainingPipeline:\n",
-    "    def __init__(self):\n",
-    "        # Resource allocation and distributed coordination\n",
-    "        self.gpu_memory_pool = GPUMemoryManager()\n",
-    "        self.distributed_coordinator = DistributedTrainingCoordinator() \n",
-    "        self.checkpoint_manager = CheckpointManager()\n",
-    "        self.metrics_aggregator = MetricsAggregator()\n",
-    "```\n",
-    "\n",
-    "Real training systems must handle:\n",
-    "- **Multi-GPU coordination**: Synchronizing gradients across devices\n",
-    "- **Memory management**: Optimizing batch sizes for available GPU memory\n",
-    "- **Fault tolerance**: Recovering from hardware failures during long training runs\n",
-    "- **Resource scheduling**: Balancing compute, memory, and I/O across the cluster"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "98db040e",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "training-pipeline-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class TrainingPipelineProfiler:\n",
-    "    \"\"\"\n",
-    "    Production Training Pipeline Analysis and Optimization\n",
-    "    \n",
-    "    Monitors end-to-end training performance and identifies bottlenecks\n",
-    "    across the complete training infrastructure.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, warning_threshold_seconds=5.0):\n",
-    "        \"\"\"\n",
-    "        Initialize training pipeline profiler.\n",
-    "        \n",
-    "        Args:\n",
-    "            warning_threshold_seconds: Warn if any pipeline step exceeds this time\n",
-    "        \"\"\"\n",
-    "        self.warning_threshold = warning_threshold_seconds\n",
-    "        self.profiling_data = defaultdict(list)\n",
-    "        self.resource_usage = defaultdict(list)\n",
-    "        \n",
-    "    def profile_complete_training_step(self, model, dataloader, optimizer, loss_fn, batch_size=32):\n",
-    "        \"\"\"\n",
-    "        Profile complete training step including all pipeline components.\n",
-    "        \n",
-    "        TODO: Implement comprehensive training step profiling.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Time each component: data loading, forward pass, loss computation, backward pass, optimization\n",
-    "        2. Monitor memory usage throughout the pipeline\n",
-    "        3. Calculate throughput metrics (samples/second, batches/second)\n",
-    "        4. Identify pipeline bottlenecks and optimization opportunities\n",
-    "        5. Generate performance recommendations\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        profiler = TrainingPipelineProfiler()\n",
-    "        step_metrics = profiler.profile_complete_training_step(model, dataloader, optimizer, loss_fn)\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Performance Optimization**: Identifying bottlenecks in training pipeline\n",
-    "        - **Resource Planning**: Understanding memory and compute requirements\n",
-    "        - **Hardware Selection**: Data guides GPU vs CPU trade-offs\n",
-    "        - **Production Scaling**: Optimizing training throughput for large models\n",
-    "        print(f\"Training throughput: {step_metrics['samples_per_second']:.1f} samples/sec\")\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use time.time() for timing measurements\n",
-    "        - Monitor before/after memory usage\n",
-    "        - Calculate ratios: compute_time / total_time\n",
-    "        - Identify which step is the bottleneck\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        import time\n",
-    "        \n",
-    "        # Initialize timing and memory tracking\n",
-    "        step_times = {}\n",
-    "        memory_usage = {}\n",
-    "        \n",
-    "        # Get initial memory baseline (simplified - in production would use GPU monitoring)\n",
-    "        baseline_memory = self._estimate_memory_usage()\n",
-    "        \n",
-    "        # 1. Data Loading Phase\n",
-    "        data_start = time.time()\n",
-    "        try:\n",
-    "            batch_x, batch_y = next(iter(dataloader))\n",
-    "            data_time = time.time() - data_start\n",
-    "            step_times['data_loading'] = data_time\n",
-    "        except:\n",
-    "            # Handle case where dataloader is not iterable for testing\n",
-    "            data_time = 0.001  # Minimal time for testing\n",
-    "            step_times['data_loading'] = data_time\n",
-    "            batch_x = Tensor(np.random.randn(batch_size, 10))\n",
-    "            batch_y = Tensor(np.random.randint(0, 2, batch_size))\n",
-    "        \n",
-    "        memory_usage['after_data_loading'] = self._estimate_memory_usage()\n",
-    "        \n",
-    "        # 2. Forward Pass Phase\n",
-    "        forward_start = time.time()\n",
-    "        try:\n",
-    "            predictions = model(batch_x)\n",
-    "            forward_time = time.time() - forward_start\n",
-    "            step_times['forward_pass'] = forward_time\n",
-    "        except:\n",
-    "            # Handle case for testing with simplified model\n",
-    "            forward_time = 0.002\n",
-    "            step_times['forward_pass'] = forward_time\n",
-    "            predictions = Tensor(np.random.randn(batch_size, 2))\n",
-    "        \n",
-    "        memory_usage['after_forward_pass'] = self._estimate_memory_usage()\n",
-    "        \n",
-    "        # 3. Loss Computation Phase\n",
-    "        loss_start = time.time()\n",
-    "        loss = loss_fn(predictions, batch_y)\n",
-    "        loss_time = time.time() - loss_start\n",
-    "        step_times['loss_computation'] = loss_time\n",
-    "        \n",
-    "        memory_usage['after_loss_computation'] = self._estimate_memory_usage()\n",
-    "        \n",
-    "        # 4. Backward Pass Phase (simplified for testing)\n",
-    "        backward_start = time.time()\n",
-    "        # In real implementation: loss.backward()\n",
-    "        backward_time = 0.003  # Simulated backward pass time\n",
-    "        step_times['backward_pass'] = backward_time\n",
-    "        \n",
-    "        memory_usage['after_backward_pass'] = self._estimate_memory_usage()\n",
-    "        \n",
-    "        # 5. Optimization Phase\n",
-    "        optimization_start = time.time()\n",
-    "        try:\n",
-    "            optimizer.step()\n",
-    "            optimization_time = time.time() - optimization_start\n",
-    "            step_times['optimization'] = optimization_time\n",
-    "        except:\n",
-    "            # Handle case for testing\n",
-    "            optimization_time = 0.001\n",
-    "            step_times['optimization'] = optimization_time\n",
-    "        \n",
-    "        memory_usage['after_optimization'] = self._estimate_memory_usage()\n",
-    "        \n",
-    "        # Calculate total time and throughput\n",
-    "        total_time = sum(step_times.values())\n",
-    "        samples_per_second = batch_size / total_time if total_time > 0 else 0\n",
-    "        \n",
-    "        # Identify bottleneck\n",
-    "        bottleneck_step = max(step_times.items(), key=lambda x: x[1])\n",
-    "        \n",
-    "        # Calculate component percentages\n",
-    "        component_percentages = {\n",
-    "            step: (time_taken / total_time * 100) if total_time > 0 else 0\n",
-    "            for step, time_taken in step_times.items()\n",
-    "        }\n",
-    "        \n",
-    "        # Generate performance analysis\n",
-    "        performance_analysis = self._analyze_pipeline_performance(step_times, memory_usage, component_percentages)\n",
-    "        \n",
-    "        # Store profiling data\n",
-    "        self.profiling_data['total_time'].append(total_time)\n",
-    "        self.profiling_data['samples_per_second'].append(samples_per_second)\n",
-    "        self.profiling_data['bottleneck_step'].append(bottleneck_step[0])\n",
-    "        \n",
-    "        return {\n",
-    "            'step_times': step_times,\n",
-    "            'total_time': total_time,\n",
-    "            'samples_per_second': samples_per_second,\n",
-    "            'bottleneck_step': bottleneck_step[0],\n",
-    "            'bottleneck_time': bottleneck_step[1],\n",
-    "            'component_percentages': component_percentages,\n",
-    "            'memory_usage': memory_usage,\n",
-    "            'performance_analysis': performance_analysis\n",
-    "        }\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _estimate_memory_usage(self):\n",
-    "        \"\"\"Estimate current memory usage (simplified implementation).\"\"\"\n",
-    "        # In production: would use psutil.Process().memory_info().rss or GPU monitoring\n",
-    "        import sys\n",
-    "        return sys.getsizeof({}) * 1024  # Simplified estimate\n",
-    "    \n",
-    "    def _analyze_pipeline_performance(self, step_times, memory_usage, component_percentages):\n",
-    "        \"\"\"Analyze training pipeline performance and generate recommendations.\"\"\"\n",
-    "        analysis = []\n",
-    "        \n",
-    "        # Identify performance bottlenecks\n",
-    "        max_step = max(step_times.items(), key=lambda x: x[1])\n",
-    "        if max_step[1] > self.warning_threshold:\n",
-    "            analysis.append(f\"⚠️ BOTTLENECK: {max_step[0]} taking {max_step[1]:.3f}s (>{self.warning_threshold}s threshold)\")\n",
-    "        \n",
-    "        # Analyze component balance\n",
-    "        forward_pct = component_percentages.get('forward_pass', 0)\n",
-    "        backward_pct = component_percentages.get('backward_pass', 0)\n",
-    "        data_pct = component_percentages.get('data_loading', 0)\n",
-    "        \n",
-    "        if data_pct > 30:\n",
-    "            analysis.append(\"📊 Data loading is >30% of total time - consider data pipeline optimization\")\n",
-    "        \n",
-    "        if forward_pct > 60:\n",
-    "            analysis.append(\"🔄 Forward pass dominates (>60%) - consider model optimization or batch size tuning\")\n",
-    "        \n",
-    "        # Memory analysis\n",
-    "        memory_keys = list(memory_usage.keys())\n",
-    "        if len(memory_keys) > 1:\n",
-    "            memory_growth = memory_usage[memory_keys[-1]] - memory_usage[memory_keys[0]]\n",
-    "            if memory_growth > 1024 * 1024:  # > 1MB growth\n",
-    "                analysis.append(\"💾 Significant memory growth during training step - monitor for memory leaks\")\n",
-    "        \n",
-    "        return analysis"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ec75ffe9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test: Training Pipeline Profiling\n",
-    "\n",
-    "Let's test our training pipeline profiler with a realistic training scenario."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2402ca88",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-training-pipeline-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_training_pipeline_profiler():\n",
-    "    \"\"\"Test training pipeline profiler with comprehensive scenarios.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Training Pipeline Profiler...\")\n",
-    "    \n",
-    "    profiler = TrainingPipelineProfiler(warning_threshold_seconds=1.0)\n",
-    "    \n",
-    "    # Create test components\n",
-    "    model = Sequential([Dense(10, 5), ReLU(), Dense(5, 2)])\n",
-    "    optimizer = SGD([], learning_rate=0.01)\n",
-    "    loss_fn = MeanSquaredError()\n",
-    "    \n",
-    "    # Create simple test dataloader\n",
-    "    class TestDataLoader:\n",
-    "        def __iter__(self):\n",
-    "            return self\n",
-    "        def __next__(self):\n",
-    "            return Tensor(np.random.randn(32, 10)), Tensor(np.random.randint(0, 2, 32))\n",
-    "    \n",
-    "    dataloader = TestDataLoader()\n",
-    "    \n",
-    "    # Test training step profiling\n",
-    "    metrics = profiler.profile_complete_training_step(model, dataloader, optimizer, loss_fn, batch_size=32)\n",
-    "    \n",
-    "    # Verify profiling results\n",
-    "    assert 'step_times' in metrics, \"Should track step times\"\n",
-    "    assert 'total_time' in metrics, \"Should track total time\"\n",
-    "    assert 'samples_per_second' in metrics, \"Should calculate throughput\"\n",
-    "    assert 'bottleneck_step' in metrics, \"Should identify bottleneck\"\n",
-    "    assert 'performance_analysis' in metrics, \"Should provide performance analysis\"\n",
-    "    \n",
-    "    # Verify all pipeline steps are profiled\n",
-    "    expected_steps = ['data_loading', 'forward_pass', 'loss_computation', 'backward_pass', 'optimization']\n",
-    "    for step in expected_steps:\n",
-    "        assert step in metrics['step_times'], f\"Should profile {step}\"\n",
-    "        assert metrics['step_times'][step] >= 0, f\"Step time should be non-negative for {step}\"\n",
-    "    \n",
-    "    # Verify throughput calculation\n",
-    "    assert metrics['samples_per_second'] >= 0, \"Throughput should be non-negative\"\n",
-    "    \n",
-    "    # Verify component percentages\n",
-    "    total_percentage = sum(metrics['component_percentages'].values())\n",
-    "    assert abs(total_percentage - 100.0) < 1.0, f\"Component percentages should sum to ~100%, got {total_percentage}\"\n",
-    "    \n",
-    "    print(\"✅ Training pipeline profiling test passed\")\n",
-    "    \n",
-    "    # Test performance analysis\n",
-    "    assert isinstance(metrics['performance_analysis'], list), \"Performance analysis should be a list\"\n",
-    "    print(\"✅ Performance analysis generation test passed\")\n",
-    "    \n",
-    "    print(\"🎯 Training Pipeline Profiler: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "adf3252a",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "production-training-optimizer",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class ProductionTrainingOptimizer:\n",
-    "    \"\"\"\n",
-    "    Production Training Pipeline Optimization\n",
-    "    \n",
-    "    Optimizes training pipelines for production deployment with focus on\n",
-    "    throughput, resource utilization, and system stability.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize production training optimizer.\"\"\"\n",
-    "        self.optimization_history = []\n",
-    "        self.baseline_metrics = None\n",
-    "        \n",
-    "    def optimize_batch_size_for_throughput(self, model, loss_fn, optimizer, initial_batch_size=32, max_batch_size=512):\n",
-    "        \"\"\"\n",
-    "        Find optimal batch size for maximum training throughput.\n",
-    "        \n",
-    "        TODO: Implement batch size optimization for production throughput.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Test range of batch sizes from initial to maximum\n",
-    "        2. For each batch size, measure:\n",
-    "           - Training throughput (samples/second)\n",
-    "           - Memory usage\n",
-    "           - Time per step\n",
-    "        3. Find optimal batch size balancing throughput and memory\n",
-    "        4. Handle memory limitations gracefully\n",
-    "        5. Return recommendations with trade-off analysis\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        optimizer = ProductionTrainingOptimizer()\n",
-    "        optimal_config = optimizer.optimize_batch_size_for_throughput(model, loss_fn, optimizer)\n",
-    "        print(f\"Optimal batch size: {optimal_config['batch_size']}\")\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Memory vs Throughput**: Larger batches improve GPU utilization but use more memory\n",
-    "        - **Hardware Optimization**: Optimal batch size depends on GPU memory and compute units\n",
-    "        - **Training Dynamics**: Batch size affects gradient noise and convergence behavior\n",
-    "        - **Production Cost**: Throughput optimization directly impacts cloud computing costs\n",
-    "        print(f\"Expected throughput: {optimal_config['throughput']:.1f} samples/sec\")\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Test powers of 2: 32, 64, 128, 256, 512\n",
-    "        - Monitor memory usage to avoid OOM\n",
-    "        - Calculate samples_per_second for each batch size\n",
-    "        - Consider memory efficiency (throughput per MB)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(\"🔧 Optimizing batch size for production throughput...\")\n",
-    "        \n",
-    "        # Test batch sizes (powers of 2 for optimal GPU utilization)\n",
-    "        test_batch_sizes = []\n",
-    "        current_batch = initial_batch_size\n",
-    "        while current_batch <= max_batch_size:\n",
-    "            test_batch_sizes.append(current_batch)\n",
-    "            current_batch *= 2\n",
-    "        \n",
-    "        optimization_results = []\n",
-    "        profiler = TrainingPipelineProfiler()\n",
-    "        \n",
-    "        for batch_size in test_batch_sizes:\n",
-    "            print(f\"  Testing batch size: {batch_size}\")\n",
-    "            \n",
-    "            try:\n",
-    "                # Create test data for this batch size\n",
-    "                test_x = Tensor(np.random.randn(batch_size, 10))\n",
-    "                test_y = Tensor(np.random.randint(0, 2, batch_size))\n",
-    "                \n",
-    "                # Create mock dataloader\n",
-    "                class MockDataLoader:\n",
-    "                    def __init__(self, x, y):\n",
-    "                        self.x, self.y = x, y\n",
-    "                    def __iter__(self):\n",
-    "                        return self\n",
-    "                    def __next__(self):\n",
-    "                        return self.x, self.y\n",
-    "                \n",
-    "                dataloader = MockDataLoader(test_x, test_y)\n",
-    "                \n",
-    "                # Profile training step\n",
-    "                metrics = profiler.profile_complete_training_step(\n",
-    "                    model, dataloader, optimizer, loss_fn, batch_size\n",
-    "                )\n",
-    "                \n",
-    "                # Estimate memory usage (simplified)\n",
-    "                estimated_memory_mb = batch_size * 10 * 4 / (1024 * 1024)  # 4 bytes per float\n",
-    "                memory_efficiency = metrics['samples_per_second'] / estimated_memory_mb if estimated_memory_mb > 0 else 0\n",
-    "                \n",
-    "                optimization_results.append({\n",
-    "                    'batch_size': batch_size,\n",
-    "                    'throughput': metrics['samples_per_second'],\n",
-    "                    'total_time': metrics['total_time'],\n",
-    "                    'estimated_memory_mb': estimated_memory_mb,\n",
-    "                    'memory_efficiency': memory_efficiency,\n",
-    "                    'bottleneck_step': metrics['bottleneck_step']\n",
-    "                })\n",
-    "                \n",
-    "            except Exception as e:\n",
-    "                print(f\"    ⚠️ Batch size {batch_size} failed: {e}\")\n",
-    "                # In production, this would typically be OOM\n",
-    "                break\n",
-    "        \n",
-    "        # Find optimal configuration\n",
-    "        if not optimization_results:\n",
-    "            return {'error': 'No valid batch sizes found'}\n",
-    "        \n",
-    "        # Optimal = highest throughput that doesn't exceed memory limits\n",
-    "        best_config = max(optimization_results, key=lambda x: x['throughput'])\n",
-    "        \n",
-    "        # Generate optimization analysis\n",
-    "        analysis = self._generate_batch_size_analysis(optimization_results, best_config)\n",
-    "        \n",
-    "        # Store optimization history\n",
-    "        self.optimization_history.append({\n",
-    "            'optimization_type': 'batch_size',\n",
-    "            'results': optimization_results,\n",
-    "            'best_config': best_config,\n",
-    "            'analysis': analysis\n",
-    "        })\n",
-    "        \n",
-    "        return {\n",
-    "            'optimal_batch_size': best_config['batch_size'],\n",
-    "            'expected_throughput': best_config['throughput'],\n",
-    "            'estimated_memory_usage': best_config['estimated_memory_mb'],\n",
-    "            'all_results': optimization_results,\n",
-    "            'optimization_analysis': analysis\n",
-    "        }\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _generate_batch_size_analysis(self, results, best_config):\n",
-    "        \"\"\"Generate analysis of batch size optimization results.\"\"\"\n",
-    "        analysis = []\n",
-    "        \n",
-    "        # Throughput analysis\n",
-    "        throughputs = [r['throughput'] for r in results]\n",
-    "        max_throughput = max(throughputs)\n",
-    "        min_throughput = min(throughputs)\n",
-    "        \n",
-    "        analysis.append(f\"📈 Throughput range: {min_throughput:.1f} - {max_throughput:.1f} samples/sec\")\n",
-    "        analysis.append(f\"🎯 Optimal batch size: {best_config['batch_size']} ({max_throughput:.1f} samples/sec)\")\n",
-    "        \n",
-    "        # Memory efficiency analysis\n",
-    "        memory_efficiencies = [r['memory_efficiency'] for r in results]\n",
-    "        most_efficient = max(results, key=lambda x: x['memory_efficiency'])\n",
-    "        \n",
-    "        analysis.append(f\"💾 Most memory efficient: batch size {most_efficient['batch_size']} ({most_efficient['memory_efficiency']:.2f} samples/sec/MB)\")\n",
-    "        \n",
-    "        # Bottleneck analysis\n",
-    "        bottleneck_counts = {}\n",
-    "        for r in results:\n",
-    "            step = r['bottleneck_step']\n",
-    "            bottleneck_counts[step] = bottleneck_counts.get(step, 0) + 1\n",
-    "        \n",
-    "        common_bottleneck = max(bottleneck_counts.items(), key=lambda x: x[1])\n",
-    "        analysis.append(f\"🔍 Common bottleneck: {common_bottleneck[0]} ({common_bottleneck[1]}/{len(results)} configurations)\")\n",
-    "        \n",
-    "        return analysis"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fd2344b5",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test: Production Training Optimization\n",
-    "\n",
-    "Let's test our production training optimizer."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "05e054a7",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-production-optimizer",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_production_training_optimizer():\n",
-    "    \"\"\"Test production training optimizer with realistic scenarios.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Production Training Optimizer...\")\n",
-    "    \n",
-    "    optimizer_tool = ProductionTrainingOptimizer()\n",
-    "    \n",
-    "    # Create test components\n",
-    "    model = Sequential([Dense(10, 5), ReLU(), Dense(5, 2)])\n",
-    "    optimizer = SGD([], learning_rate=0.01)\n",
-    "    loss_fn = MeanSquaredError()\n",
-    "    \n",
-    "    # Test batch size optimization\n",
-    "    result = optimizer_tool.optimize_batch_size_for_throughput(\n",
-    "        model, loss_fn, optimizer, \n",
-    "        initial_batch_size=32, \n",
-    "        max_batch_size=128\n",
-    "    )\n",
-    "    \n",
-    "    # Verify optimization results\n",
-    "    assert 'optimal_batch_size' in result, \"Should find optimal batch size\"\n",
-    "    assert 'expected_throughput' in result, \"Should calculate expected throughput\"\n",
-    "    assert 'estimated_memory_usage' in result, \"Should estimate memory usage\"\n",
-    "    assert 'all_results' in result, \"Should provide all test results\"\n",
-    "    assert 'optimization_analysis' in result, \"Should provide analysis\"\n",
-    "    \n",
-    "    # Verify optimal batch size is reasonable\n",
-    "    assert result['optimal_batch_size'] >= 32, \"Optimal batch size should be at least initial size\"\n",
-    "    assert result['optimal_batch_size'] <= 128, \"Optimal batch size should not exceed maximum\"\n",
-    "    \n",
-    "    # Verify throughput is positive\n",
-    "    assert result['expected_throughput'] > 0, \"Expected throughput should be positive\"\n",
-    "    \n",
-    "    # Verify all results structure\n",
-    "    all_results = result['all_results']\n",
-    "    assert len(all_results) > 0, \"Should have tested at least one batch size\"\n",
-    "    \n",
-    "    for test_result in all_results:\n",
-    "        assert 'batch_size' in test_result, \"Each result should have batch size\"\n",
-    "        assert 'throughput' in test_result, \"Each result should have throughput\"\n",
-    "        assert 'total_time' in test_result, \"Each result should have total time\"\n",
-    "        assert test_result['throughput'] >= 0, \"Throughput should be non-negative\"\n",
-    "    \n",
-    "    print(\"✅ Batch size optimization test passed\")\n",
-    "    \n",
-    "    # Test optimization history tracking\n",
-    "    assert len(optimizer_tool.optimization_history) == 1, \"Should track optimization history\"\n",
-    "    history_entry = optimizer_tool.optimization_history[0]\n",
-    "    assert history_entry['optimization_type'] == 'batch_size', \"Should track optimization type\"\n",
-    "    assert 'results' in history_entry, \"Should store optimization results\"\n",
-    "    assert 'best_config' in history_entry, \"Should store best configuration\"\n",
-    "    \n",
-    "    print(\"✅ Optimization history tracking test passed\")\n",
-    "    \n",
-    "    print(\"🎯 Production Training Optimizer: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)\n",
-    "\n",
-    "def test_autograd_integration():\n",
-    "    \"\"\"Test that loss functions now support autograd for gradient computation.\"\"\"\n",
-    "    print(\"🔬 Autograd Integration Test: Loss Functions Support .backward()...\")\n",
-    "    \n",
-    "    # Test MSE Loss with autograd\n",
-    "    mse = MeanSquaredError()\n",
-    "    y_pred = Variable([[2.0, 3.0]], requires_grad=True)\n",
-    "    y_true = Variable([[1.0, 2.0]], requires_grad=False)\n",
-    "    \n",
-    "    loss = mse(y_pred, y_true)\n",
-    "    assert isinstance(loss, Variable), \"MSE should return Variable for autograd\"\n",
-    "    assert hasattr(loss, 'backward'), \"Loss should have backward method\"\n",
-    "    \n",
-    "    # Test backward pass\n",
-    "    loss.backward()\n",
-    "    assert y_pred.grad is not None, \"Gradients should be computed for y_pred\"\n",
-    "    print(\"✅ MSE Loss autograd integration works\")\n",
-    "    \n",
-    "    # Test CrossEntropy Loss with autograd\n",
-    "    ce = CrossEntropyLoss()\n",
-    "    y_pred = Variable([[2.0, 1.0], [1.0, 2.0]], requires_grad=True)\n",
-    "    y_true = Variable([0, 1], requires_grad=False)\n",
-    "    \n",
-    "    loss = ce(y_pred, y_true)\n",
-    "    assert isinstance(loss, Variable), \"CrossEntropy should return Variable for autograd\"\n",
-    "    assert hasattr(loss, 'backward'), \"Loss should have backward method\"\n",
-    "    \n",
-    "    # Test backward pass\n",
-    "    loss.backward()\n",
-    "    assert y_pred.grad is not None, \"Gradients should be computed for y_pred\"\n",
-    "    print(\"✅ CrossEntropy Loss autograd integration works\")\n",
-    "    \n",
-    "    # Test Binary CrossEntropy Loss with autograd  \n",
-    "    bce = BinaryCrossEntropyLoss()\n",
-    "    y_pred = Variable([[1.0], [-1.0]], requires_grad=True)\n",
-    "    y_true = Variable([[1.0], [0.0]], requires_grad=False)\n",
-    "    \n",
-    "    loss = bce(y_pred, y_true)\n",
-    "    assert isinstance(loss, Variable), \"Binary CrossEntropy should return Variable for autograd\"\n",
-    "    assert hasattr(loss, 'backward'), \"Loss should have backward method\"\n",
-    "    \n",
-    "    # Test backward pass\n",
-    "    loss.backward()\n",
-    "    assert y_pred.grad is not None, \"Gradients should be computed for y_pred\"\n",
-    "    print(\"✅ Binary CrossEntropy Loss autograd integration works\")\n",
-    "    \n",
-    "    print(\"🎯 Autograd Integration: All loss functions now support gradient computation!\")\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    # Run all training tests\n",
-    "    test_unit_mse_loss()\n",
-    "    test_unit_crossentropy_loss()\n",
-    "    test_unit_binary_crossentropy_loss()\n",
-    "    test_unit_accuracy_metric()\n",
-    "    test_unit_trainer()\n",
-    "    test_module_training()\n",
-    "    test_autograd_integration()  # NEW: Test autograd integration\n",
-    "    # test_training_pipeline_profiler()  # Skip due to type mismatch issue\n",
-    "    # test_production_training_optimizer()  # Skip due to type mismatch issue\n",
-    "    \n",
-    "    print(\"\\n🎉 SUCCESS: Training module now fully integrated with autograd system!\")\n",
-    "    print(\"✅ Loss functions return Variables that support .backward()\")\n",
-    "    print(\"✅ Training loops can now compute gradients automatically\")\n",
-    "    print(\"✅ Ready for real neural network training with backpropagation!\")\n",
-    "    print(\"\\nTraining module complete!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "af53870c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking Questions\n",
-    "\n",
-    "*Take a moment to reflect on these questions. Consider how your training loop implementation connects to the broader challenges of production ML systems.*\n",
-    "\n",
-    "### 🏗️ Training Infrastructure Design\n",
-    "1. **Pipeline Architecture**: Your training loop orchestrates data loading, forward pass, loss computation, and optimization. How might this change when scaling to distributed training across multiple GPUs or machines?\n",
-    "\n",
-    "2. **Resource Management**: What happens to your training pipeline when GPU memory becomes the limiting factor? How do production systems handle out-of-memory errors during training?\n",
-    "\n",
-    "3. **Fault Tolerance**: If a training job crashes after 20 hours, how can production systems recover? What checkpointing strategies would you implement?\n",
-    "\n",
-    "### 📊 Production Training Operations\n",
-    "4. **Monitoring Strategy**: Beyond loss and accuracy, what metrics would you monitor in a production training system? How would you detect training instability or hardware failures?\n",
-    "\n",
-    "5. **Hyperparameter Optimization**: How would you systematically search for optimal batch sizes, learning rates, and model architectures at scale?\n",
-    "\n",
-    "6. **Data Pipeline Integration**: How does your training loop interact with data pipelines that might be processing terabytes of data? What happens when data arrives faster than the model can consume it?\n",
-    "\n",
-    "### ⚖️ Training at Scale\n",
-    "7. **Distributed Coordination**: When training on 1000 GPUs, how do you ensure all devices stay synchronized? What are the trade-offs between synchronous and asynchronous training?\n",
-    "\n",
-    "8. **Memory Optimization**: How would you implement gradient accumulation to simulate larger batch sizes? What other memory optimization techniques are critical for large models?\n",
-    "\n",
-    "9. **Training Efficiency**: What's the difference between training throughput (samples/second) and training efficiency (time to convergence)? How do you optimize for both?\n",
-    "\n",
-    "### 🔄 MLOps Integration\n",
-    "10. **Experiment Tracking**: How would you track thousands of training experiments with different configurations? What metadata is essential for reproducibility?\n",
-    "\n",
-    "11. **Model Lifecycle**: How does your training pipeline integrate with model versioning, A/B testing, and deployment systems?\n",
-    "\n",
-    "12. **Cost Optimization**: Training large models can cost thousands of dollars. How would you optimize training costs while maintaining model quality?\n",
-    "\n",
-    "*These questions connect your training implementation to the real challenges of production ML systems. Each question represents engineering decisions that impact the reliability, scalability, and cost-effectiveness of ML systems at scale.*"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1e5afb2a",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Training Pipelines\n",
-    "\n",
-    "Congratulations! You've successfully implemented complete training pipelines:\n",
-    "\n",
-    "### What You've Accomplished\n",
-    "✅ **Training Loops**: End-to-end training with loss computation and optimization  \n",
-    "✅ **Loss Functions**: Implementation and integration of loss calculations  \n",
-    "✅ **Metrics Tracking**: Monitoring accuracy and loss during training  \n",
-    "✅ **Integration**: Seamless compatibility with neural networks and optimizers  \n",
-    "✅ **Real Applications**: Training real models on real data  \n",
-    "✅ **Pipeline Profiling**: Production-grade performance analysis and optimization  \n",
-    "✅ **Systems Thinking**: Understanding training infrastructure at scale  \n",
-    "\n",
-    "### Key Concepts You've Learned\n",
-    "- **Training loops**: How to iterate over data, compute loss, and update parameters\n",
-    "- **Loss functions**: Quantifying model performance\n",
-    "- **Metrics tracking**: Monitoring progress and diagnosing issues\n",
-    "- **Integration patterns**: How training works with all components\n",
-    "- **Performance optimization**: Efficient training for large models\n",
-    "- **Pipeline profiling**: Identifying bottlenecks in training infrastructure\n",
-    "- **Production optimization**: Balancing throughput, memory, and resource utilization\n",
-    "\n",
-    "### Professional Skills Developed\n",
-    "- **Training orchestration**: Building robust training systems\n",
-    "- **Loss engineering**: Implementing and tuning loss functions\n",
-    "- **Metrics analysis**: Understanding and improving model performance\n",
-    "- **Integration testing**: Ensuring all components work together\n",
-    "- **Performance profiling**: Optimizing training pipelines for production\n",
-    "- **Systems design**: Understanding distributed training challenges\n",
-    "\n",
-    "### Ready for Advanced Applications\n",
-    "Your training pipeline implementations now enable:\n",
-    "- **Full model training**: End-to-end training of neural networks\n",
-    "- **Experimentation**: Testing different architectures and hyperparameters\n",
-    "- **Production systems**: Deploying trained models for real applications\n",
-    "- **Research**: Experimenting with new training strategies\n",
-    "- **Performance optimization**: Scaling training to production workloads\n",
-    "- **Infrastructure design**: Building reliable ML training systems\n",
-    "\n",
-    "### Connection to Real ML Systems\n",
-    "Your implementations mirror production systems:\n",
-    "- **PyTorch**: `torch.nn.Module`, `torch.optim`, and training loops\n",
-    "- **TensorFlow**: `tf.keras.Model`, `tf.keras.optimizers`, and fit methods\n",
-    "- **Industry Standard**: Every major ML framework uses these exact patterns\n",
-    "- **Production Tools**: Similar to Ray Train, Horovod, and distributed training frameworks\n",
-    "\n",
-    "### Next Steps\n",
-    "1. **Export your code**: `tito export 10_training`\n",
-    "2. **Test your implementation**: `tito test 10_training`\n",
-    "3. **Build evaluation pipelines**: Add benchmarking and validation\n",
-    "4. **Move to Module 12**: Add model compression and optimization!\n",
-    "\n",
-    "**Ready for compression?** Your training pipelines are now ready for real-world deployment!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/07_training/training_dev.py b/modules_old/07_training/training_dev.py
deleted file mode 100644
index 87d76d64..00000000
--- a/modules_old/07_training/training_dev.py
+++ /dev/null
@@ -1,2059 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Training - Complete End-to-End ML Training Infrastructure
-
-Welcome to the Training module! You'll build the complete training infrastructure that orchestrates data loading, forward passes, loss computation, backpropagation, and optimization into a unified system.
-
-## Learning Goals
-- Systems understanding: How training loops coordinate all ML system components and why training orchestration determines system reliability
-- Core implementation skill: Build loss functions, evaluation metrics, and complete training loops with checkpointing and monitoring
-- Pattern recognition: Understand how different loss functions affect learning dynamics and model behavior
-- Framework connection: See how your training loop mirrors PyTorch's training patterns and state management
-- Performance insight: Learn why training loop design affects convergence speed, memory usage, and debugging capability
-
-## Build → Use → Reflect
-1. **Build**: Complete training infrastructure with loss functions, metrics, checkpointing, and progress monitoring
-2. **Use**: Train real neural networks on CIFAR-10 and achieve meaningful accuracy on complex visual tasks
-3. **Reflect**: Why does training loop design often determine the success or failure of ML projects?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how training loops orchestrate complex ML systems into reliable, monitorable processes
-- Practical capability to build production-ready training infrastructure with proper error handling and state management
-- Systems insight into why training stability and reproducibility are critical for reliable ML systems
-- Performance consideration of how training loop efficiency affects iteration speed and resource utilization
-- Connection to production ML systems and how modern MLOps platforms build on these training patterns
-
-## Systems Reality Check
-💡 **Production Context**: Modern ML training platforms like PyTorch Lightning and Hugging Face Transformers build sophisticated abstractions on top of basic training loops to handle distributed training, mixed precision, and fault tolerance
-⚡ **Performance Note**: Training loop efficiency often matters more than model efficiency for development speed - good training infrastructure accelerates the entire ML development cycle
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "training-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.training
-
-#| export
-import numpy as np
-import sys
-import os
-from collections import defaultdict
-import time
-import pickle
-
-# Add module directories to Python path
-sys.path.append(os.path.abspath('modules/source/01_tensor'))
-sys.path.append(os.path.abspath('modules/source/02_activations'))
-sys.path.append(os.path.abspath('modules/source/03_layers'))
-sys.path.append(os.path.abspath('modules/source/05_networks'))
-sys.path.append(os.path.abspath('modules/source/06_autograd'))
-sys.path.append(os.path.abspath('modules/source/07_spatial'))
-sys.path.append(os.path.abspath('modules/source/08_optimizers'))
-sys.path.append(os.path.abspath('modules/source/09_dataloader'))
-
-# Helper function to set up import paths
-# No longer needed, will use direct relative imports
-
-# Set up paths
-# No longer needed
-
-# Import all the building blocks we need
-from tinytorch.core.tensor import Tensor
-from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
-from tinytorch.core.layers import Linear
-from tinytorch.core.networks import Sequential, create_mlp
-from tinytorch.core.spatial import Conv2D, flatten
-from tinytorch.utils.data import Dataset, DataLoader
-from tinytorch.core.autograd import Variable  # FOR AUTOGRAD INTEGRATION
-from tinytorch.core.optimizers import SGD, Adam
-
-# 🔥 AUTOGRAD INTEGRATION: Loss functions now return Variables that support .backward()
-# This enables automatic gradient computation for neural network training!
-
-# Global helper for clean data access
-def extract_numpy_data(tensor_obj):
-    """Extract raw numpy data from tensor objects using clean Tensor interface.
-
-    Clean Tensor Evolution Pattern: Work directly with Tensor.data property.
-    """
-    import numpy as np
-
-    # Clean extraction: Handle Tensor objects directly
-    if isinstance(tensor_obj, (Tensor, Variable)):
-        return tensor_obj.data
-
-    # Handle raw numpy arrays or other data
-    if isinstance(tensor_obj, np.ndarray):
-        return tensor_obj
-
-    # Convert other types to numpy array
-    return np.array(tensor_obj)
-
-# Utility function for tensor data access
-def get_tensor_value(tensor_obj):
-    """Extract numeric value from tensor/variable objects for testing.
-    
-    Educational simplification: Handles Variable -> Tensor -> numpy array -> scalar pattern
-    in a clear, step-by-step manner that students can easily understand.
-    """
-    import numpy as np
-    
-    # Step 1: Unwrap Variable objects recursively
-    if isinstance(tensor_obj, Variable):
-        return get_tensor_value(tensor_obj.data)  # Unwrap Variable
-    
-    # Step 2: Handle Tensor objects
-    if isinstance(tensor_obj, Tensor):
-        return get_tensor_value(tensor_obj.data)  # Unwrap Tensor
-    
-    # Step 3: Handle numpy arrays
-    if isinstance(tensor_obj, np.ndarray):
-        return float(tensor_obj.item() if tensor_obj.size == 1 else tensor_obj.flat[0])
-    
-    # Step 4: Handle memoryview objects (convert to numpy first)
-    if isinstance(tensor_obj, memoryview):
-        array_data = np.array(tensor_obj)
-        return float(array_data.item() if array_data.size == 1 else array_data.flat[0])
-    
-    # Step 5: Handle basic Python numbers
-    if isinstance(tensor_obj, (int, float, np.number)):
-        return float(tensor_obj)
-    
-    # Step 6: Last resort - direct conversion
-    try:
-        return float(tensor_obj)
-    except (ValueError, TypeError):
-        print(f"Warning: Could not extract value from {type(tensor_obj)}, returning 0")
-        return 0.0
-
-# %% [markdown]
-"""
-## 🔧 DEVELOPMENT
-"""
-
-# %% [markdown]
-"""
-## Step 1: Understanding Loss Functions
-
-### What are Loss Functions?
-Loss functions measure how far our model's predictions are from the true values. They provide the "signal" that tells our optimizer which direction to update parameters.
-
-### Visual Understanding: Loss Function Landscapes
-```
-Loss Landscape Visualization:
-
-    High Loss         Low Loss          Zero Loss
-       ↓                ↓                 ↓
-   ┌─────────┐      ┌─────────┐      ┌─────────┐
-   │    🔥   │      │    📊   │      │    ✅   │
-   │ L=10.5  │  →   │  L=2.1  │  →   │  L=0.0  │
-   │ (bad)   │      │ (better)│      │(perfect)│
-   └─────────┘      └─────────┘      └─────────┘
-   
-   Training Direction: Always move toward lower loss
-```
-
-### The Mathematical Foundation
-Training a neural network is an optimization problem:
-```
-Optimization Equation:
-    θ* = argmin_θ L(f(x; θ), y)
-    
-Visual Flow:
-    Input → Model → Prediction → Loss Function → Gradient → Update
-     x   →  f(θ) →    ŷ      →    L(ŷ,y)    →   ∇L   →   θ'
-```
-
-Where:
-- `θ` = model parameters (weights and biases)
-- `f(x; θ)` = model predictions  
-- `y` = true labels
-- `L` = loss function
-- `θ*` = optimal parameters
-
-### Loss Function Types & Trade-offs
-
-#### **Mean Squared Error (MSE)** - For Regression
-```
-MSE Behavior:
-    Error: -2  -1   0   +1  +2
-    Loss:  4   1   0    1   4
-           ↑   ↑   ↑    ↑   ↑
-      Heavy penalty for large errors
-
-Formula: MSE = (1/n) * Σ(y_pred - y_true)²
-Gradient: ∂MSE/∂pred = 2 * (y_pred - y_true)
-```
-- **Use case**: Regression problems (predicting continuous values)
-- **Properties**: Heavily penalizes large errors, smooth gradients
-- **Trade-off**: Sensitive to outliers but provides strong learning signal
-
-#### **Cross-Entropy Loss** - For Classification  
-```
-Cross-Entropy Behavior:
-    Confidence:  0.01  0.1  0.5  0.9  0.99
-    Loss:        4.6   2.3  0.7  0.1  0.01
-                 ↑     ↑    ↑    ↑     ↑
-            Heavily penalizes wrong confidence
-
-Formula: CE = -Σ y_true * log(y_pred)
-With Softmax: CE = -log(softmax(logits)[true_class])
-```
-- **Use case**: Multi-class classification
-- **Properties**: Penalizes confident wrong predictions exponentially
-- **Trade-off**: Provides strong learning signal but can be unstable
-
-#### **Binary Cross-Entropy** - For Binary Problems
-```
-Binary CE Behavior:
-    True=1, Pred: 0.1   0.5   0.9   0.99
-    Loss:         2.3   0.7   0.1   0.01
-                  ↑     ↑     ↑     ↑
-              Higher loss for wrong predictions
-
-Formula: BCE = -y*log(p) - (1-y)*log(1-p)
-Symmetric: Same penalty for false positives/negatives
-```
-- **Use case**: Binary classification (yes/no, spam/ham)
-- **Properties**: Symmetric around 0.5 probability
-- **Trade-off**: Balanced but may need class weighting for imbalanced data
-
-Let's implement these essential loss functions!
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "mse-loss", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class MeanSquaredError:
-    """
-    Mean Squared Error Loss for Regression
-    
-    Measures the average squared difference between predictions and targets.
-    MSE = (1/n) * Σ(y_pred - y_true)²
-    """
-    
-    def __init__(self):
-        """Initialize MSE loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute MSE loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, ...])
-            y_true: True targets (Tensor or Variable, shape: [batch_size, ...])
-            
-        Returns:
-            Variable with scalar loss value that supports .backward()
-            
-        TODO: Implement Mean SquaredError loss computation with autograd support.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert inputs to Variables if needed for autograd support
-        2. Compute difference using Variable arithmetic: diff = y_pred - y_true
-        3. Square the differences: squared_diff = diff * diff
-        4. Take mean over all elements using Variable operations
-        5. Return as Variable that supports .backward() for gradient computation
-        
-        EXAMPLE:
-        y_pred = Variable([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
-        y_true = Variable([[1.5, 2.5], [2.5, 3.5]], requires_grad=False)
-        loss = mse_loss(y_pred, y_true)
-        loss.backward()  # Computes gradients for y_pred
-        
-        LEARNING CONNECTIONS:
-        - **Autograd Integration**: Loss functions must participate in computational graph for backpropagation
-        - **Gradient Flow**: MSE provides smooth gradients that flow backward through the network
-        - **Variable Operations**: Using Variables keeps computation in the autograd system
-        - **Training Pipeline**: Loss.backward() triggers gradient computation for entire network
-        
-        HINTS:
-        - Convert inputs to Variables if needed: Variable(tensor_data, requires_grad=True)
-        - Use Variable arithmetic to maintain autograd graph
-        - Use operations that preserve gradient computation
-        - Return Variable that supports .backward() method
-        """
-        ### BEGIN SOLUTION
-        # Convert to Variables if needed to support autograd
-        if not isinstance(y_pred, Variable):
-            if hasattr(y_pred, 'data'):
-                y_pred = Variable(y_pred.data, requires_grad=True)
-            else:
-                y_pred = Variable(y_pred, requires_grad=True)
-        
-        if not isinstance(y_true, Variable):
-            if hasattr(y_true, 'data'):
-                y_true = Variable(y_true.data, requires_grad=False)  # Targets don't need gradients
-            else:
-                y_true = Variable(y_true, requires_grad=False)
-        
-        # MSE Computation Visual:
-        # Step 1: diff = pred - true    (element-wise difference)
-        # Step 2: squared = diff²       (penalize large errors heavily) 
-        # Step 3: mean = Σ(squared)/n   (average across all samples)
-        
-        diff = y_pred - y_true  # Variable subtraction
-        squared_diff = diff * diff  # Variable multiplication (squares each error)
-        
-        # Clean mean operation - get raw numpy array
-        # Use global helper function to extract numpy data cleanly
-        squared_diff_data = extract_numpy_data(squared_diff)
-        mean_data = np.mean(squared_diff_data)
-        
-        # Educational Note: In full PyTorch, autograd would handle this automatically
-        # For Module 8 students, we focus on training loop patterns
-        # Create loss Variable (simplified for educational use)
-        loss = Variable(mean_data, requires_grad=y_pred.requires_grad)
-        return loss
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-    
-
-# 🔍 SYSTEMS INSIGHT #1: Training Performance Analysis
-def analyze_training_performance():
-    """Consolidated analysis of training performance characteristics."""
-    try:
-        print("📊 Training Performance Analysis:")
-        print(f"  • MSE Loss: O(N) time, 4x memory overhead (pred + true + diff + squared)")
-        print(f"  • Batch processing: 10-50x faster than single samples due to vectorization")
-        print(f"  • Training bottlenecks: Data loading > Model forward > Gradient computation")
-        print(f"  • Memory scaling: Batch size directly impacts GPU memory (watch for OOM)")
-        print(f"  • Convergence: Loss oscillation normal early, smoothing indicates learning")
-
-    except Exception as e:
-        print(f"⚠️ Analysis failed: {e}")
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: MSE Loss
-
-Let's test our MSE loss implementation with known values.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-mse-loss", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_unit_mse_loss():
-    """Test MSE loss with comprehensive examples."""
-    print("🔬 Unit Test: MSE Loss...")
-    
-    mse = MeanSquaredError()
-    
-    # Test 1: Perfect predictions (loss should be 0)
-    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    y_true = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    loss = mse(y_pred, y_true)
-    loss_value = get_tensor_value(loss)
-    assert abs(loss_value) < 1e-6, f"Perfect predictions should have loss ≈ 0, got {loss_value}"
-    print("✅ Perfect predictions test passed")
-    
-    # Test 2: Known loss computation
-    y_pred = Tensor([[1.0, 2.0]])
-    y_true = Tensor([[0.0, 1.0]])
-    loss = mse(y_pred, y_true)
-    expected = 1.0  # [(1-0)² + (2-1)²] / 2 = [1 + 1] / 2 = 1.0
-    loss_value = get_tensor_value(loss)
-    assert abs(loss_value - expected) < 1e-6, f"Expected loss {expected}, got {loss_value}"
-    print("✅ Known loss computation test passed")
-    
-    # Test 3: Batch processing
-    y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])
-    y_true = Tensor([[1.5, 2.5], [2.5, 3.5]])
-    loss = mse(y_pred, y_true)
-    expected = 0.25  # All squared differences are 0.25
-    loss_value = get_tensor_value(loss)
-    assert abs(loss_value - expected) < 1e-6, f"Expected batch loss {expected}, got {loss_value}"
-    print("✅ Batch processing test passed")
-    
-    # Test 4: Single value
-    y_pred = Tensor([5.0])
-    y_true = Tensor([3.0])
-    loss = mse(y_pred, y_true)
-    expected = 4.0  # (5-3)² = 4
-    loss_value = get_tensor_value(loss)
-    assert abs(loss_value - expected) < 1e-6, f"Expected single value loss {expected}, got {loss_value}"
-    print("✅ Single value test passed")
-    
-    print("🎯 MSE Loss: All tests passed!")
-
-# Test function defined (called in main block) 
-
-# %% nbgrader={"grade": false, "grade_id": "crossentropy-loss", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class CrossEntropyLoss:
-    """
-    Cross-Entropy Loss for Multi-Class Classification
-    
-    Measures the difference between predicted probability distribution and true labels.
-    CrossEntropy = -Σ y_true * log(y_pred)
-    """
-    
-    def __init__(self):
-        """Initialize CrossEntropy loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute CrossEntropy loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, num_classes])
-            y_true: True class indices (Tensor or Variable, shape: [batch_size]) or one-hot
-            
-        Returns:
-            Variable with scalar loss value that supports .backward()
-            
-        TODO: Implement Cross-Entropy loss computation with autograd support.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert inputs to Variables if needed for autograd support
-        2. Handle both class indices and one-hot encoded labels
-        3. Apply softmax to predictions for probability distribution
-        4. Compute log probabilities while maintaining gradient flow
-        5. Calculate cross-entropy and return Variable with gradient function
-        
-        EXAMPLE:
-        y_pred = Variable([[2.0, 1.0, 0.1], [0.5, 2.1, 0.9]], requires_grad=True)
-        y_true = Variable([0, 1], requires_grad=False)  # Class indices
-        loss = crossentropy_loss(y_pred, y_true)
-        loss.backward()  # Computes gradients for y_pred
-        
-        LEARNING CONNECTIONS:
-        - **Autograd Integration**: CrossEntropy must support gradient computation for classification training
-        - **Softmax Gradients**: Combined softmax + cross-entropy has well-defined gradients
-        - **Classification Training**: Standard loss for multi-class problems in neural networks
-        - **Gradient Flow**: Enables backpropagation through classification layers
-        
-        HINTS:
-        - Convert inputs to Variables to support autograd
-        - Apply softmax for probability distribution
-        - Use numerically stable computations
-        - Implement gradient function for cross-entropy + softmax
-        """
-        ### BEGIN SOLUTION
-        # Convert to Variables if needed to support autograd
-        if not isinstance(y_pred, Variable):
-            if hasattr(y_pred, 'data'):
-                y_pred = Variable(y_pred.data, requires_grad=True)
-            else:
-                y_pred = Variable(y_pred, requires_grad=True)
-        
-        if not isinstance(y_true, Variable):
-            if hasattr(y_true, 'data'):
-                y_true = Variable(y_true.data, requires_grad=False)
-            else:
-                y_true = Variable(y_true, requires_grad=False)
-        
-        # Extract raw numpy arrays using global helper function
-        pred_data = extract_numpy_data(y_pred)
-        true_data = extract_numpy_data(y_true)
-        
-        # Handle both 1D and 2D prediction arrays
-        if pred_data.ndim == 1:
-            pred_data = pred_data.reshape(1, -1)
-            
-        # Apply softmax to get probability distribution (numerically stable)
-        exp_pred = np.exp(pred_data - np.max(pred_data, axis=1, keepdims=True))
-        softmax_pred = exp_pred / np.sum(exp_pred, axis=1, keepdims=True)
-        
-        # Add small epsilon to prevent log(0) numerical instability
-        # 1e-15 is small enough to not affect results but prevents NaN values
-        # when softmax produces very small probabilities (near machine precision)
-        epsilon = 1e-15  # Prevent log(0) numerical instability
-        softmax_pred = np.clip(softmax_pred, epsilon, 1.0 - epsilon)
-        
-        # Handle class indices vs one-hot encoding
-        if len(true_data.shape) == 1:
-            # y_true contains class indices
-            batch_size = true_data.shape[0]
-            log_probs = np.log(softmax_pred[np.arange(batch_size), true_data.astype(int)])
-            loss_value = -np.mean(log_probs)
-            
-            # Create one-hot for gradient computation
-            one_hot = np.zeros_like(softmax_pred)
-            one_hot[np.arange(batch_size), true_data.astype(int)] = 1.0
-        else:
-            # y_true is one-hot encoded
-            one_hot = true_data
-            log_probs = np.log(softmax_pred)
-            loss_value = -np.mean(np.sum(true_data * log_probs, axis=1))
-        
-        # Educational Note: In full PyTorch, autograd would handle this automatically
-        # For Module 8 students, we focus on training loop patterns
-        # Create loss Variable (simplified for educational use)
-        loss = Variable(loss_value, requires_grad=y_pred.requires_grad)
-        return loss
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-    
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: CrossEntropy Loss
-
-Let's test our CrossEntropy loss implementation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-crossentropy-loss", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_unit_crossentropy_loss():
-    """Test CrossEntropy loss with comprehensive examples."""
-    print("🔬 Unit Test: CrossEntropy Loss...")
-    
-    ce = CrossEntropyLoss()
-    
-    # Test 1: Perfect predictions
-    y_pred = Tensor([[10.0, 0.0, 0.0], [0.0, 10.0, 0.0]])  # Very confident correct predictions
-    y_true = Tensor([0, 1])  # Class indices
-    loss = ce(y_pred, y_true)
-    loss_value = get_tensor_value(loss)
-    assert loss_value < 0.1, f"Perfect predictions should have low loss, got {loss_value}"
-    print("✅ Perfect predictions test passed")
-    
-    # Test 2: Random predictions (should have higher loss)
-    y_pred = Tensor([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])  # Uniform after softmax
-    y_true = Tensor([0, 1])
-    loss = ce(y_pred, y_true)
-    expected_random = -np.log(1.0/3.0)  # log(1/num_classes) for uniform distribution
-    loss_value = get_tensor_value(loss)
-    assert abs(loss_value - expected_random) < 0.1, f"Random predictions should have loss ≈ {expected_random}, got {loss_value}"
-    print("✅ Random predictions test passed")
-    
-    # Test 3: Binary classification
-    y_pred = Tensor([[2.0, 1.0], [1.0, 2.0]])
-    y_true = Tensor([0, 1])
-    loss = ce(y_pred, y_true)
-    loss_value = get_tensor_value(loss)
-    assert 0.0 < loss_value < 2.0, f"Binary classification loss should be reasonable, got {loss_value}"
-    print("✅ Binary classification test passed")
-    
-    # Test 4: One-hot encoded labels
-    y_pred = Tensor([[2.0, 1.0, 0.0], [0.0, 2.0, 1.0]])
-    y_true = Tensor([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])  # One-hot encoded
-    loss = ce(y_pred, y_true)
-    loss_value = get_tensor_value(loss)
-    assert 0.0 < loss_value < 2.0, f"One-hot encoded loss should be reasonable, got {loss_value}"
-    print("✅ One-hot encoded labels test passed")
-    
-    print("🎯 CrossEntropy Loss: All tests passed!")
-
-# Test function defined (called in main block)
-
-# %% nbgrader={"grade": false, "grade_id": "binary-crossentropy-loss", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class BinaryCrossEntropyLoss:
-    """
-    Binary Cross-Entropy Loss for Binary Classification
-    
-    Measures the difference between predicted probabilities and binary labels.
-    BCE = -y_true * log(y_pred) - (1-y_true) * log(1-y_pred)
-    """
-    
-    def __init__(self):
-        """Initialize Binary CrossEntropy loss function."""
-        pass
-    
-    def __call__(self, y_pred, y_true):
-        """
-        Compute Binary CrossEntropy loss between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, 1] or [batch_size])
-            y_true: True binary labels (Tensor or Variable, shape: [batch_size, 1] or [batch_size])
-            
-        Returns:
-            Variable with scalar loss value that supports .backward()
-            
-        TODO: Implement Binary Cross-Entropy loss computation with autograd support.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert inputs to Variables if needed for autograd support
-        2. Apply sigmoid to predictions for probability values (numerically stable)
-        3. Compute binary cross-entropy loss while maintaining gradient flow
-        4. Create gradient function for sigmoid + BCE combination
-        5. Return Variable that supports .backward() for gradient computation
-        
-        EXAMPLE:
-        y_pred = Variable([[2.0], [0.0], [-1.0]], requires_grad=True)  # Raw logits
-        y_true = Variable([[1.0], [1.0], [0.0]], requires_grad=False)   # Binary labels
-        loss = bce_loss(y_pred, y_true)
-        loss.backward()  # Computes gradients for y_pred
-        
-        LEARNING CONNECTIONS:
-        - **Autograd Integration**: Binary CrossEntropy must support gradient computation for binary classification training
-        - **Sigmoid + BCE Gradients**: Combined sigmoid + BCE has well-defined gradients
-        - **Binary Classification**: Standard loss for binary problems in neural networks
-        - **Numerical Stability**: Use log-sum-exp tricks to avoid overflow/underflow
-        
-        HINTS:
-        - Convert inputs to Variables to support autograd
-        - Use numerically stable sigmoid computation
-        - Implement gradient function for sigmoid + BCE
-        - Handle both logits and probability inputs
-        """
-        ### BEGIN SOLUTION
-        # Convert to Variables if needed to support autograd
-        if not isinstance(y_pred, Variable):
-            if hasattr(y_pred, 'data'):
-                y_pred = Variable(y_pred.data, requires_grad=True)
-            else:
-                y_pred = Variable(y_pred, requires_grad=True)
-        
-        if not isinstance(y_true, Variable):
-            if hasattr(y_true, 'data'):
-                y_true = Variable(y_true.data, requires_grad=False)
-            else:
-                y_true = Variable(y_true, requires_grad=False)
-        
-        # Extract raw numpy arrays using global helper function
-        logits = extract_numpy_data(y_pred).flatten()
-        labels = extract_numpy_data(y_true).flatten()
-        
-        # Numerically stable binary cross-entropy from logits
-        def stable_bce_with_logits(logits, labels):
-            # Use the stable formulation: max(x, 0) - x * y + log(1 + exp(-abs(x)))
-            stable_loss = np.maximum(logits, 0) - logits * labels + np.log(1 + np.exp(-np.abs(logits)))
-            return stable_loss
-        
-        # Compute loss for each sample
-        losses = stable_bce_with_logits(logits, labels)
-        mean_loss = np.mean(losses)
-        
-        # Compute sigmoid using robust numerically stable approach
-        # This implementation avoids overflow/underflow for extreme logit values
-        def stable_sigmoid(x):
-            """Numerically stable sigmoid function."""
-            # For large positive x: use sigmoid(x) = 1/(1+exp(-x))
-            # For large negative x: use sigmoid(x) = exp(x)/(1+exp(x))
-            # This prevents overflow in either direction
-            pos_mask = x >= 0
-            neg_mask = ~pos_mask
-            result = np.zeros_like(x)
-            
-            # Handle positive values
-            if np.any(pos_mask):
-                exp_neg = np.exp(-x[pos_mask])
-                result[pos_mask] = 1.0 / (1.0 + exp_neg)
-            
-            # Handle negative values  
-            if np.any(neg_mask):
-                exp_pos = np.exp(x[neg_mask])
-                result[neg_mask] = exp_pos / (1.0 + exp_pos)
-                
-            return result
-        
-        sigmoid_pred = stable_sigmoid(logits)  # Numerically stable sigmoid
-        
-        # Educational Note: In full PyTorch, autograd would handle this automatically
-        # For Module 8 students, we focus on training loop patterns
-        # Create loss Variable (simplified for educational use)
-        loss = Variable(mean_loss, requires_grad=y_pred.requires_grad)
-        return loss
-        ### END SOLUTION
-    
-    def forward(self, y_pred, y_true):
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-    
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Binary CrossEntropy Loss
-
-Let's test our Binary CrossEntropy loss implementation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-binary-crossentropy-loss", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_unit_binary_crossentropy_loss():
-    """Test Binary CrossEntropy loss with comprehensive examples."""
-    print("🔬 Unit Test: Binary CrossEntropy Loss...")
-    
-    bce = BinaryCrossEntropyLoss()
-    
-    # Test 1: Perfect predictions
-    y_pred = Tensor([[10.0], [-10.0]])  # Very confident correct predictions
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    loss_value = get_tensor_value(loss)
-    assert loss_value < 0.1, f"Perfect predictions should have low loss, got {loss_value}"
-    print("✅ Perfect predictions test passed")
-    
-    # Test 2: Random predictions (should have higher loss)
-    y_pred = Tensor([[0.0], [0.0]])  # 0.5 probability after sigmoid
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    expected_random = -np.log(0.5)  # log(0.5) for random guessing
-    loss_value = get_tensor_value(loss)
-    assert abs(loss_value - expected_random) < 0.1, f"Random predictions should have loss ≈ {expected_random}, got {loss_value}"
-    print("✅ Random predictions test passed")
-    
-    # Test 3: Batch processing
-    y_pred = Tensor([[1.0], [2.0], [-1.0]])
-    y_true = Tensor([[1.0], [1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    loss_value = get_tensor_value(loss)
-    assert 0.0 < loss_value < 2.0, f"Batch processing loss should be reasonable, got {loss_value}"
-    print("✅ Batch processing test passed")
-    
-    # Test 4: Edge cases
-    y_pred = Tensor([[100.0], [-100.0]])  # Extreme values
-    y_true = Tensor([[1.0], [0.0]])
-    loss = bce(y_pred, y_true)
-    loss_value = get_tensor_value(loss)
-    assert loss_value < 0.1, f"Extreme correct predictions should have low loss, got {loss_value}"
-    print("✅ Edge cases test passed")
-    
-    print("🎯 Binary CrossEntropy Loss: All tests passed!")
-
-# Test function defined (called in main block) 
-
-# %% [markdown]
-"""
-## Step 2: Understanding Metrics
-
-### What are Metrics?
-Metrics are measurements that help us understand how well our model is performing. Unlike loss functions, metrics are often more interpretable and align with business objectives.
-
-### Visual Understanding: Metrics vs Loss
-```
-Loss vs Metrics Comparison:
-
-    Loss Function           |  Metrics
-    (for optimization)      |  (for evaluation)
-         ↓                  |       ↓
-    ┌─────────────┐         |  ┌─────────────┐
-    │ Continuous  │         |  │ Interpretable│
-    │ Differentiable│       |  │ Business-aligned│
-    │ 0.693147... │         |  │ 85.3% accuracy│
-    └─────────────┘         |  └─────────────┘
-         ↓                  |       ↓
-    Gradient descent        |  Human understanding
-    
-Both measure performance, different purposes!
-```
-
-### Classification Metrics Deep Dive
-
-#### **Accuracy** - Overall Correctness
-```
-Confusion Matrix Visualization:
-                Predicted
-              0       1
-    Actual 0  TN      FP   ← False Positives hurt accuracy  
-           1  FN      TP   ← False Negatives hurt accuracy
-              ↑       ↑
-    
-    Accuracy = (TP + TN) / (TP + TN + FP + FN)
-    Range: [0, 1] where 1.0 = perfect predictions
-```
-- **Use case**: Balanced datasets where all classes matter equally
-- **Limitation**: Misleading on imbalanced data (99% negative class)
-
-#### **Precision** - Quality of Positive Predictions
-```
-Precision Focus:
-    "Of all my positive predictions, how many were actually positive?"
-    
-    High Precision = Few False Positives
-    
-    Prediction:  [+] [+] [+] [+]    ← 4 positive predictions
-    Reality:     [+] [+] [-] [+]    ← 1 false positive
-    Precision:   3/4 = 0.75
-    
-    Formula: TP / (TP + FP)
-```
-- **Critical for**: Spam detection, medical diagnosis (avoid false alarms)
-- **Trade-off**: High precision often means lower recall
-
-#### **Recall** - Coverage of Actual Positives  
-```
-Recall Focus:
-    "Of all actual positives, how many did I find?"
-    
-    High Recall = Few False Negatives
-    
-    Reality:     [+] [+] [+] [+]    ← 4 actual positives
-    Prediction:  [+] [-] [+] [+]    ← Missed 1 positive
-    Recall:      3/4 = 0.75
-    
-    Formula: TP / (TP + FN)
-```
-- **Critical for**: Cancer screening, fraud detection (can't miss positives)
-- **Trade-off**: High recall often means lower precision
-
-### Regression Metrics
-
-#### **Mean Absolute Error (MAE)** - Robust Error Measure
-```
-MAE vs MSE Comparison:
-    
-    Errors:    [-2, -1, 0, +1, +10]  ← One outlier
-    MAE:       (2+1+0+1+10)/5 = 2.8   ← Robust to outlier
-    MSE:       (4+1+0+1+100)/5 = 21.2 ← Heavily affected
-    
-    MAE = (1/n) * Σ|pred - true|
-    Always non-negative, same units as target
-```
-- **Advantage**: Robust to outliers, interpretable
-- **Disadvantage**: Less smooth gradients than MSE
-
-Let's implement these essential metrics!
-"""
-
-# Test function defined (called in main block)
-
-# %% nbgrader={"grade": false, "grade_id": "accuracy-metric", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Accuracy:
-    """
-    Accuracy Metric for Classification
-    
-    Computes the fraction of correct predictions.
-    Accuracy = (Correct Predictions) / (Total Predictions)
-    """
-    
-    def __init__(self):
-        """Initialize Accuracy metric."""
-        pass
-    
-    def __call__(self, y_pred: Tensor, y_true: Tensor) -> float:
-        """
-        Compute accuracy between predictions and targets.
-        
-        Args:
-            y_pred: Model predictions (shape: [batch_size, num_classes] or [batch_size])
-            y_true: True class labels (shape: [batch_size] or [batch_size])
-            
-        Returns:
-            Accuracy as a float value between 0 and 1
-            
-        TODO: Implement accuracy computation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert predictions to class indices (argmax for multi-class)
-        2. Convert true labels to class indices if needed
-        3. Count correct predictions
-        4. Divide by total predictions
-        5. Return as float
-        
-        EXAMPLE:
-        y_pred = Tensor([[0.9, 0.1], [0.2, 0.8], [0.6, 0.4]])  # Probabilities
-        y_true = Tensor([0, 1, 0])  # True classes
-        accuracy = accuracy_metric(y_pred, y_true)
-        # Should return: 2/3 = 0.667 (first and second predictions correct)
-        
-        LEARNING CONNECTIONS:
-        - **Model Evaluation**: Primary metric for classification model performance
-        - **Business KPIs**: Often directly tied to business objectives and success metrics
-        - **Baseline Comparison**: Standard metric for comparing different models
-        - **Production Monitoring**: Real-time accuracy monitoring for model health
-        
-        HINTS:
-        - Use np.argmax(axis=1) for multi-class predictions
-        - Handle both probability and class index inputs
-        - Use np.mean() for averaging
-        - Return Python float, not Tensor
-        """
-        ### BEGIN SOLUTION
-        # Accuracy Computation Visual:
-        # Step 1: Convert predictions → class indices (argmax or threshold)
-        # Step 2: Convert true labels → class indices (if one-hot)
-        # Step 3: Count matches: pred_class == true_class
-        # Step 4: Divide by total: accuracy = correct / total
-        
-        # Convert predictions to class indices
-        if len(y_pred.data.shape) > 1 and y_pred.data.shape[1] > 1:
-            # Multi-class: use argmax to find highest probability class
-            pred_classes = np.argmax(y_pred.data, axis=1)
-        else:
-            # Binary classification: threshold at 0.5
-            pred_classes = (y_pred.data.flatten() > 0.5).astype(int)
-        
-        # Convert true labels to class indices if needed
-        if len(y_true.data.shape) > 1 and y_true.data.shape[1] > 1:
-            # One-hot encoded: [0,1,0] → class 1
-            true_classes = np.argmax(y_true.data, axis=1)
-        else:
-            # Already class indices: [0, 1, 2, ...]
-            true_classes = y_true.data.flatten().astype(int)
-        
-        # Compute accuracy: fraction of correct predictions
-        correct = np.sum(pred_classes == true_classes)
-        total = len(true_classes)
-        accuracy = correct / total
-        
-        return float(accuracy)
-        ### END SOLUTION
-    
-    def forward(self, y_pred: Tensor, y_true: Tensor) -> float:
-        """Alternative interface for forward pass."""
-        return self.__call__(y_pred, y_true)
-
-# 🔍 SYSTEMS INSIGHT: Accuracy Metric Analysis
-def analyze_accuracy_edge_cases():
-    """Analyze accuracy metric behavior in different scenarios."""
-    try:
-        print("🔬 Accuracy Metric Edge Case Analysis:")
-        
-        accuracy = Accuracy()
-        
-        # Test 1: Balanced vs Imbalanced Dataset Impact
-        print("\n📊 Balanced vs Imbalanced Dataset:")
-        
-        # Balanced: 50% class 0, 50% class 1
-        balanced_pred = Tensor([[0.6, 0.4], [0.4, 0.6], [0.6, 0.4], [0.4, 0.6]])
-        balanced_true = Tensor([0, 1, 0, 1])
-        balanced_acc = accuracy(balanced_pred, balanced_true)
-        
-        # Imbalanced: 90% class 0, 10% class 1 (model predicts all class 0)
-        imbalanced_pred = Tensor([[0.9, 0.1]] * 10)  # Always predict class 0
-        imbalanced_true = Tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])  # 9 class 0, 1 class 1
-        imbalanced_acc = accuracy(imbalanced_pred, imbalanced_true)
-        
-        print(f"  Balanced dataset accuracy: {balanced_acc:.3f}")
-        print(f"  Imbalanced dataset accuracy: {imbalanced_acc:.3f}")
-        print(f"  💡 Imbalanced shows {imbalanced_acc:.1%} accuracy but misses all positives!")
-        
-        # Test 2: Confidence vs Correctness
-        print("\n🎯 Confidence vs Correctness:")
-        
-        # High confidence, wrong
-        confident_wrong = Tensor([[0.95, 0.05], [0.05, 0.95]])
-        labels = Tensor([1, 0])  # Opposite of predictions
-        confident_wrong_acc = accuracy(confident_wrong, labels)
-        
-        # Low confidence, correct
-        barely_right = Tensor([[0.51, 0.49], [0.49, 0.51]])
-        labels = Tensor([0, 1])  # Matches predictions
-        barely_right_acc = accuracy(barely_right, labels)
-        
-        print(f"  High confidence, wrong: {confident_wrong_acc:.3f}")
-        print(f"  Low confidence, correct: {barely_right_acc:.3f}")
-        print(f"  💡 Accuracy ignores confidence - only cares about final prediction!")
-        
-        # Test 3: Multi-class complexity
-        print("\n🎲 Multi-class Scaling:")
-        num_classes = [2, 5, 10, 100]
-        random_accuracies = []
-        
-        for n_classes in num_classes:
-            # Random predictions
-            random_pred = Tensor(np.random.randn(1000, n_classes))
-            random_true = Tensor(np.random.randint(0, n_classes, 1000))
-            random_acc = accuracy(random_pred, random_true)
-            random_accuracies.append(random_acc)
-            
-            expected_random = 1.0 / n_classes
-            print(f"  {n_classes:>3} classes: {random_acc:.3f} (expect ~{expected_random:.3f})")
-        
-        print(f"\n💡 Key Insights:")
-        print(f"  • Accuracy can hide class imbalance problems")
-        print(f"  • Random guessing accuracy = 1/num_classes")
-        print(f"  • High accuracy ≠ good model on imbalanced data")
-        print(f"  • Always evaluate alongside precision/recall")
-        
-    except Exception as e:
-        print(f"⚠️ Analysis failed: {e}")
-
-# Run analysis
-analyze_accuracy_edge_cases()
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Accuracy Metric
-
-Let's test our Accuracy metric implementation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-accuracy-metric", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_unit_accuracy_metric():
-    """Test Accuracy metric with comprehensive examples."""
-    print("🔬 Unit Test: Accuracy Metric...")
-    
-    accuracy = Accuracy()
-    
-    # Test 1: Perfect predictions
-    y_pred = Tensor([[0.9, 0.1], [0.1, 0.9], [0.8, 0.2]])
-    y_true = Tensor([0, 1, 0])
-    acc = accuracy(y_pred, y_true)
-    assert acc == 1.0, f"Perfect predictions should have accuracy 1.0, got {acc}"
-    print("✅ Perfect predictions test passed")
-    
-    # Test 2: Half correct
-    y_pred = Tensor([[0.9, 0.1], [0.9, 0.1], [0.8, 0.2]])  # All predict class 0
-    y_true = Tensor([0, 1, 0])  # Classes: 0, 1, 0
-    acc = accuracy(y_pred, y_true)
-    expected = 2.0/3.0  # 2 out of 3 correct
-    assert abs(acc - expected) < 1e-6, f"Half correct should have accuracy {expected}, got {acc}"
-    print("✅ Half correct test passed")
-    
-    # Test 3: Binary classification
-    y_pred = Tensor([[0.8], [0.3], [0.9], [0.1]])  # Predictions above/below 0.5
-    y_true = Tensor([1, 0, 1, 0])
-    acc = accuracy(y_pred, y_true)
-    assert acc == 1.0, f"Binary classification should have accuracy 1.0, got {acc}"
-    print("✅ Binary classification test passed")
-    
-    # Test 4: Multi-class
-    y_pred = Tensor([[0.7, 0.2, 0.1], [0.1, 0.8, 0.1], [0.1, 0.1, 0.8]])
-    y_true = Tensor([0, 1, 2])
-    acc = accuracy(y_pred, y_true)
-    assert acc == 1.0, f"Multi-class should have accuracy 1.0, got {acc}"
-    print("✅ Multi-class test passed")
-    
-    print("🎯 Accuracy Metric: All tests passed!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Step 3: Building the Training Loop
-
-### What is a Training Loop?
-A training loop is the orchestration engine that coordinates all components of neural network training. Think of it as the conductor of an ML orchestra!
-
-### Visual Training Loop Architecture
-```
-Epoch Loop (Outer Loop):
-┌─────────────────────────────────────────────────────────────┐
-│  Epoch 1          Epoch 2          Epoch 3        ...     │
-│     ↓               ↓               ↓                      │
-└─────────────────────────────────────────────────────────────┘
-        │               │               │
-        ↓               ↓               ↓
-┌─────────────────────────────────────────────────────────────┐
-│                 Batch Loop (Inner Loop)                    │
-│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐    │
-│  │Batch1│→│Batch2│→│Batch3│→│Batch4│→│Batch5│→│Batch6│... │
-│  └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘    │
-└─────────────────────────────────────────────────────────────┘
-        │
-        ↓
-┌─────────────────────────────────────────────────────────────┐
-│             Single Training Step (Per Batch)               │
-│                                                             │
-│  Input Data → Forward Pass → Loss → Backward → Update      │
-│      X      →     ŷ        →  L   →    ∇L    →   θ'       │
-│                                                             │
-│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           │
-│  │ 📊 Data │→│ 🧠 Model│→│ 📉 Loss │→│ ⚡ Optim│           │
-│  │ Loading │ │ Forward │ │ Compute │ │ Update  │           │
-│  └─────────┘ └─────────┘ └─────────┘ └─────────┘           │
-└─────────────────────────────────────────────────────────────┘
-```
-
-### The 5-Step Training Dance
-```
-Step 1: Forward Pass        Step 2: Loss Computation
-   Input → Model              Prediction vs Truth
-     🔢 → 🧠 → 📊                📊 vs ✅ → 📉
-
-Step 3: Backward Pass       Step 4: Parameter Update
-   Loss → Gradients          Gradients → New Weights
-     📉 → ∇ → ⚡                ⚡ + 🧠 → 🧠'
-
-Step 5: Evaluation          Repeat for next batch!
-   Metrics & Monitoring        🔄 → Next Batch
-     📈 📊 💾
-```
-
-### Memory Flow During Training
-```
-Memory Usage Pattern:
-
-    Forward Pass:          Backward Pass:         After Update:
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│ Activations     │    │ Activations     │    │ Parameters      │
-│ Parameters      │ →  │ Parameters      │ →  │ (Updated)       │
-│                 │    │ Gradients       │    │                 │
-│                 │    │ (New!)          │    │                 │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
-    ~1x Model Size       ~2x Model Size         ~1x Model Size
-                         (Peak Memory!)         (Gradients freed)
-```
-
-### Why We Need a Trainer Class
-- **Orchestration**: Coordinates all training components seamlessly
-- **Reusability**: Same trainer works with different models/datasets
-- **Monitoring**: Built-in logging and progress tracking 
-- **Flexibility**: Easy to modify training behavior (early stopping, checkpointing)
-- **Production Ready**: Handles errors, resumption, and scale
-
-Let's build our Trainer class!
-"""
-
-# 🔍 SYSTEMS INSIGHT: Batch Processing vs Single Sample Training
-def analyze_batch_vs_single_sample_efficiency():
-    """Analyze the efficiency gains from batch processing in training."""
-    try:
-        import time
-        print("🔬 Batch Processing Efficiency Analysis:")
-        
-        # Create test components
-        model = Sequential([Linear(50, 25), ReLU(), Linear(25, 10)])
-        loss_fn = MeanSquaredError()
-        
-        # Test data
-        single_x = Tensor(np.random.randn(1, 50))  # Single sample
-        single_y = Tensor(np.random.randn(1, 10))
-        
-        batch_x = Tensor(np.random.randn(32, 50))  # Batch of 32
-        batch_y = Tensor(np.random.randn(32, 10))
-        
-        # Time single sample processing (32 times)
-        single_start = time.perf_counter()
-        single_losses = []
-        for _ in range(32):
-            try:
-                pred = model(single_x)
-                loss = loss_fn(pred, single_y)
-                single_losses.append(get_tensor_value(loss))
-            except:
-                single_losses.append(0.5)  # Fallback for testing
-        single_time = time.perf_counter() - single_start
-        
-        # Time batch processing (32 samples at once)
-        batch_start = time.perf_counter()
-        try:
-            batch_pred = model(batch_x)
-            batch_loss = loss_fn(batch_pred, batch_y)
-            batch_loss_value = get_tensor_value(batch_loss)
-        except:
-            batch_loss_value = 0.5  # Fallback for testing
-        batch_time = time.perf_counter() - batch_start
-        
-        # Calculate efficiency
-        speedup = single_time / batch_time if batch_time > 0 else float('inf')
-        
-        print(f"\n📊 Processing Time Comparison:")
-        print(f"  32 single samples: {single_time*1000:.2f}ms")
-        print(f"  1 batch of 32:     {batch_time*1000:.2f}ms")
-        print(f"  Speedup:           {speedup:.1f}x faster")
-        
-        # Memory efficiency
-        single_memory_per_sample = 50 * 4  # input size * bytes
-        batch_memory = 32 * 50 * 4  # batch_size * input_size * bytes
-        memory_ratio = batch_memory / (32 * single_memory_per_sample)
-        
-        print(f"\n💾 Memory Efficiency:")
-        print(f"  Single sample memory: {single_memory_per_sample/1024:.1f}KB per sample")
-        print(f"  Batch memory:         {batch_memory/1024:.1f}KB total")
-        print(f"  Memory ratio:         {memory_ratio:.1f}x (ideal: 1.0)")
-        
-        # Gradient update frequency analysis
-        print(f"\n⚡ Training Dynamics:")
-        print(f"  Single sample updates: 32 parameter updates")
-        print(f"  Batch updates:         1 parameter update (averaged gradient)")
-        print(f"  Gradient noise:        Higher with single → more exploration")
-        print(f"  Convergence:           Lower with batch → more stable")
-        
-        print(f"\n💡 Key Insights:")
-        print(f"  • Vectorization gives {speedup:.1f}x speedup through parallel computation")
-        print(f"  • Larger batches = better GPU utilization")
-        print(f"  • Batch size affects gradient noise and convergence dynamics")
-        print(f"  • Memory usage grows linearly with batch size")
-        
-    except Exception as e:
-        print(f"⚠️ Analysis failed: {e}")
-
-# Run batch efficiency analysis
-analyze_batch_vs_single_sample_efficiency()
-
-# %% nbgrader={"grade": false, "grade_id": "trainer-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Trainer:
-    """
-    Training Loop Orchestrator
-    
-    Coordinates model training with loss functions, optimizers, and metrics.
-    """
-    
-    def __init__(self, model, optimizer, loss_function, metrics=None):
-        """
-        Initialize trainer with model and training components.
-        
-        Args:
-            model: Neural network model to train
-            optimizer: Optimizer for parameter updates
-            loss_function: Loss function for training
-            metrics: List of metrics to track (optional)
-            
-        TODO: Initialize the trainer with all necessary components.
-        
-        APPROACH:
-        1. Store model, optimizer, loss function, and metrics
-        2. Initialize history tracking for losses and metrics
-        3. Set up training state (epoch, step counters)
-        4. Prepare for training and validation loops
-        
-        EXAMPLE:
-        model = Sequential([Linear(10, 5), ReLU(), Linear(5, 2)])
-        optimizer = Adam(model.parameters, learning_rate=0.001)
-        loss_fn = CrossEntropyLoss()
-        metrics = [Accuracy()]
-        trainer = Trainer(model, optimizer, loss_fn, metrics)
-        
-        HINTS:
-        - Store all components as instance variables
-        - Initialize empty history dictionaries
-        - Set metrics to empty list if None provided
-        - Initialize epoch and step counters to 0
-        """
-        ### BEGIN SOLUTION
-        self.model = model
-        self.optimizer = optimizer
-        self.loss_function = loss_function
-        self.metrics = metrics or []
-        
-        # Training history
-        self.history = {
-            'train_loss': [],
-            'val_loss': [],
-            'epoch': []
-        }
-        
-        # Add metric history tracking
-        for metric in self.metrics:
-            metric_name = metric.__class__.__name__.lower()
-            self.history[f'train_{metric_name}'] = []
-            self.history[f'val_{metric_name}'] = []
-        
-        # Training state
-        self.current_epoch = 0
-        self.current_step = 0
-        ### END SOLUTION
-    
-    def train_epoch(self, dataloader):
-        """
-        Train for one epoch on the given dataloader.
-        
-        Args:
-            dataloader: DataLoader containing training data
-            
-        Returns:
-            Dictionary with epoch training metrics
-            
-        TODO: Implement single epoch training logic.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Initialize epoch metrics tracking
-        2. Iterate through batches in dataloader
-        3. For each batch:
-           - Zero gradients
-           - Forward pass
-           - Compute loss
-           - Backward pass
-           - Update parameters
-           - Track metrics
-        4. Return averaged metrics for the epoch
-        
-        LEARNING CONNECTIONS:
-        - **Training Loop Foundation**: Core pattern used in all deep learning frameworks
-        - **Gradient Accumulation**: Optimizer.zero_grad() prevents gradient accumulation bugs
-        - **Backpropagation**: loss.backward() computes gradients through entire network
-        - **Parameter Updates**: optimizer.step() applies computed gradients to model weights
-        
-        HINTS:
-        - Use optimizer.zero_grad() before each batch
-        - Call loss.backward() for gradient computation
-        - Use optimizer.step() for parameter updates
-        - Track running averages for metrics
-        """
-        ### BEGIN SOLUTION
-        # Training Epoch Visual Flow:
-        # For each batch: zero_grad → forward → loss → backward → step → metrics
-        #                    ↓         ↓       ↓       ↓        ↓       ↓
-        #                 Clear    Predict  Error   Grads   Update  Track
-        
-        epoch_metrics = {'loss': 0.0}
-        
-        # Initialize metric tracking
-        for metric in self.metrics:
-            metric_name = metric.__class__.__name__.lower()
-            epoch_metrics[metric_name] = 0.0
-        
-        batch_count = 0
-        
-        for batch_x, batch_y in dataloader:
-            # Step 1: Zero gradients (critical - prevents accumulation bugs)
-            self.optimizer.zero_grad()
-            
-            # Step 2: Forward pass (model predictions)
-            predictions = self.model(batch_x)
-            
-            # Step 3: Compute loss (measure prediction quality)
-            loss = self.loss_function(predictions, batch_y)
-            
-            # Step 4: Backward pass - simplified for Module 8 (basic autograd from Module 6)
-            # Gradient Flow Visualization:
-            #     Loss
-            #      ↓ ∂L/∂loss = 1.0
-            #   Predictions ← Model ← Input
-            #      ↓ ∂L/∂pred    ↓ ∂L/∂W    ↓ ∂L/∂x
-            #   Gradients flow backward through computational graph
-            # Note: In a full implementation, loss.backward() would compute gradients
-            # For educational Module 8, we focus on the training loop pattern
-            
-            # Step 5: Update parameters (apply gradients)
-            self.optimizer.step()
-            
-            # Step 6: Track metrics for monitoring
-            if hasattr(loss, 'data'):
-                if hasattr(loss.data, 'data'):
-                    epoch_metrics['loss'] += loss.data.data  # Variable with Tensor data
-                else:
-                    epoch_metrics['loss'] += loss.data  # Variable with numpy data
-            else:
-                epoch_metrics['loss'] += loss  # Direct value
-            
-            for metric in self.metrics:
-                metric_name = metric.__class__.__name__.lower()
-                metric_value = metric(predictions, batch_y)
-                epoch_metrics[metric_name] += metric_value
-            
-            batch_count += 1
-            self.current_step += 1
-        
-        # Average metrics over all batches
-        for key in epoch_metrics:
-            epoch_metrics[key] /= batch_count
-        
-        return epoch_metrics
-        ### END SOLUTION
-    
-    def validate_epoch(self, dataloader):
-        """
-        Validate for one epoch on the given dataloader.
-        
-        Args:
-            dataloader: DataLoader containing validation data
-            
-        Returns:
-            Dictionary with epoch validation metrics
-            
-        TODO: Implement single epoch validation logic.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Initialize epoch metrics tracking
-        2. Iterate through batches in dataloader
-        3. For each batch:
-           - Forward pass (no gradient computation)
-           - Compute loss
-           - Track metrics
-        4. Return averaged metrics for the epoch
-        
-        LEARNING CONNECTIONS:
-        - **Model Evaluation**: Validation measures generalization to unseen data
-        - **Overfitting Detection**: Comparing train vs validation metrics reveals overfitting
-        - **Model Selection**: Validation metrics guide hyperparameter tuning and architecture choices
-        - **Early Stopping**: Validation loss plateaus indicate optimal training duration
-        
-        HINTS:
-        - No gradient computation needed for validation
-        - No parameter updates during validation
-        - Similar to train_epoch but simpler
-        """
-        ### BEGIN SOLUTION
-        epoch_metrics = {'loss': 0.0}
-        
-        # Initialize metric tracking
-        for metric in self.metrics:
-            metric_name = metric.__class__.__name__.lower()
-            epoch_metrics[metric_name] = 0.0
-        
-        batch_count = 0
-        
-        for batch_x, batch_y in dataloader:
-            # Forward pass only (no gradients needed)
-            predictions = self.model(batch_x)
-            
-            # Compute loss
-            loss = self.loss_function(predictions, batch_y)
-            
-            # Track metrics
-            if hasattr(loss, 'data'):
-                if hasattr(loss.data, 'data'):
-                    epoch_metrics['loss'] += loss.data.data  # Variable with Tensor data
-                else:
-                    epoch_metrics['loss'] += loss.data  # Variable with numpy data
-            else:
-                epoch_metrics['loss'] += loss  # Direct value
-            
-            for metric in self.metrics:
-                metric_name = metric.__class__.__name__.lower()
-                metric_value = metric(predictions, batch_y)
-                epoch_metrics[metric_name] += metric_value
-            
-            batch_count += 1
-        
-        # Average metrics over all batches
-        for key in epoch_metrics:
-            epoch_metrics[key] /= batch_count
-        
-        return epoch_metrics
-        ### END SOLUTION
-    
-    def fit(self, train_dataloader, val_dataloader=None, epochs=10, verbose=True, save_best=False, checkpoint_path="best_model.pkl"):
-        """
-        Train the model for specified number of epochs.
-        
-        Args:
-            train_dataloader: Training data
-            val_dataloader: Validation data (optional)
-            epochs: Number of training epochs
-            verbose: Whether to print training progress
-            
-        Returns:
-            Training history dictionary
-            
-        TODO: Implement complete training loop.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Loop through epochs
-        2. For each epoch:
-           - Train on training data
-           - Validate on validation data (if provided)
-           - Update history
-           - Print progress (if verbose)
-        3. Return complete training history
-        
-        LEARNING CONNECTIONS:
-        - **Epoch Management**: Organizing training into discrete passes through the dataset
-        - **Learning Curves**: History tracking enables visualization of training progress
-        - **Hyperparameter Tuning**: Training history guides learning rate and architecture decisions
-        - **Production Monitoring**: Training logs provide debugging and optimization insights
-        
-        HINTS:
-        - Use train_epoch() and validate_epoch() methods
-        - Update self.history with results
-        - Print epoch summary if verbose=True
-        """
-        ### BEGIN SOLUTION
-        print(f"Starting training for {epochs} epochs...")
-        best_val_loss = float('inf')
-        
-        for epoch in range(epochs):
-            self.current_epoch = epoch
-            
-            # Training phase
-            train_metrics = self.train_epoch(train_dataloader)
-            
-            # Validation phase
-            val_metrics = {}
-            if val_dataloader is not None:
-                val_metrics = self.validate_epoch(val_dataloader)
-            
-            # Update history
-            self.history['epoch'].append(epoch)
-            self.history['train_loss'].append(train_metrics['loss'])
-            
-            if val_dataloader is not None:
-                self.history['val_loss'].append(val_metrics['loss'])
-            
-            # Update metric history
-            for metric in self.metrics:
-                metric_name = metric.__class__.__name__.lower()
-                self.history[f'train_{metric_name}'].append(train_metrics[metric_name])
-                if val_dataloader is not None:
-                    self.history[f'val_{metric_name}'].append(val_metrics[metric_name])
-            
-            # Save best model checkpoint
-            if save_best and val_dataloader is not None:
-                if val_metrics['loss'] < best_val_loss:
-                    best_val_loss = val_metrics['loss']
-                    self.save_checkpoint(checkpoint_path)
-                    if verbose:
-                        print(f"  💾 Saved best model (val_loss: {best_val_loss:.4f})")
-            
-            # Print progress
-            if verbose:
-                train_loss = train_metrics['loss']
-                print(f"Epoch {epoch+1}/{epochs} - train_loss: {train_loss:.4f}", end="")
-                
-                if val_dataloader is not None:
-                    val_loss = val_metrics['loss']
-                    print(f" - val_loss: {val_loss:.4f}", end="")
-                
-                for metric in self.metrics:
-                    metric_name = metric.__class__.__name__.lower()
-                    train_metric = train_metrics[metric_name]
-                    print(f" - train_{metric_name}: {train_metric:.4f}", end="")
-                    
-                    if val_dataloader is not None:
-                        val_metric = val_metrics[metric_name]
-                        print(f" - val_{metric_name}: {val_metric:.4f}", end="")
-                
-                print()  # New line
-        
-        print("Training completed!")
-        
-        # 🎯 Training Summary Visualization
-        print(f"\n📊 Training Summary:")
-        print(f"  Total epochs: {epochs}")
-        print(f"  Total steps: {self.current_step}")
-        final_train_loss = self.history['train_loss'][-1] if self.history['train_loss'] else 0
-        print(f"  Final training loss: {final_train_loss:.4f}")
-        if val_dataloader is not None:
-            final_val_loss = self.history['val_loss'][-1] if self.history['val_loss'] else 0
-            print(f"  Final validation loss: {final_val_loss:.4f}")
-        
-        # Visual training progress
-        if len(self.history['train_loss']) >= 3:
-            start_loss = self.history['train_loss'][0]
-            mid_loss = self.history['train_loss'][len(self.history['train_loss'])//2]
-            end_loss = self.history['train_loss'][-1]
-            print(f"\n📈 Loss Progression:")
-            print(f"  Start: {start_loss:.4f} → Mid: {mid_loss:.4f} → End: {end_loss:.4f}")
-            improvement = ((start_loss - end_loss) / start_loss * 100) if start_loss > 0 else 0
-            print(f"  Improvement: {improvement:.1f}% loss reduction")
-        
-        return self.history
-        ### END SOLUTION
-    
-    def save_checkpoint(self, filepath):
-        """Save model checkpoint."""
-        checkpoint = {
-            'epoch': self.current_epoch,
-            'model_state': self._get_model_state(),
-            'history': self.history
-        }
-        
-        with open(filepath, 'wb') as f:
-            pickle.dump(checkpoint, f)
-    
-    def load_checkpoint(self, filepath):
-        """Load model checkpoint."""
-        with open(filepath, 'rb') as f:
-            checkpoint = pickle.load(f)
-        
-        self.current_epoch = checkpoint['epoch']
-        self.history = checkpoint['history']
-        self._set_model_state(checkpoint['model_state'])
-        
-        print(f"✅ Loaded checkpoint from epoch {self.current_epoch}")
-    
-    def _get_model_state(self):
-        """Extract model parameters."""
-        state = {}
-        for i, layer in enumerate(self.model.layers):
-            if hasattr(layer, 'weight'):
-                state[f'layer_{i}_weight'] = layer.weight.data.copy()
-                state[f'layer_{i}_bias'] = layer.bias.data.copy()
-        return state
-    
-    def _set_model_state(self, state):
-        """Restore model parameters."""
-        for i, layer in enumerate(self.model.layers):
-            if hasattr(layer, 'weight'):
-                layer.weight.data = state[f'layer_{i}_weight']
-                layer.bias.data = state[f'layer_{i}_bias']
-
-# 🔍 SYSTEMS INSIGHT: Training Loop Performance Analysis
-def analyze_training_loop_bottlenecks():
-    """Analyze training loop performance and identify bottlenecks."""
-    try:
-        import time
-        
-        print("🔬 Training Loop Bottleneck Analysis:")
-        
-        # Create components for analysis
-        model = Sequential([Linear(100, 50), ReLU(), Linear(50, 10)])
-        optimizer = SGD([], learning_rate=0.01)
-        loss_fn = MeanSquaredError()
-        metrics = [Accuracy()]
-        
-        trainer = Trainer(model, optimizer, loss_fn, metrics)
-        
-        # Simulate different batch sizes
-        batch_sizes = [16, 32, 64, 128]
-        results = []
-        
-        for batch_size in batch_sizes:
-            print(f"\n  Testing batch size: {batch_size}")
-            
-            # Create test data
-            test_data = [(Tensor(np.random.randn(batch_size, 100)), 
-                         Tensor(np.random.randint(0, 10, batch_size))) for _ in range(10)]
-            
-            # Time training step components
-            step_times = {'forward': 0, 'loss': 0, 'backward': 0, 'optimizer': 0}
-            total_start = time.perf_counter()
-            
-            for batch_x, batch_y in test_data:
-                # Time forward pass
-                forward_start = time.perf_counter()
-                try:
-                    predictions = model(batch_x)
-                    step_times['forward'] += time.perf_counter() - forward_start
-                except:
-                    predictions = Tensor(np.random.randn(batch_size, 10))
-                    step_times['forward'] += 0.001
-                
-                # Time loss computation
-                loss_start = time.perf_counter()
-                loss = loss_fn(predictions, batch_y)
-                step_times['loss'] += time.perf_counter() - loss_start
-                
-                # Time backward pass (simulated)
-                step_times['backward'] += 0.002  # Simulated time
-                
-                # Time optimizer step
-                opt_start = time.perf_counter()
-                try:
-                    optimizer.step()
-                    step_times['optimizer'] += time.perf_counter() - opt_start
-                except:
-                    step_times['optimizer'] += 0.001
-            
-            total_time = time.perf_counter() - total_start
-            throughput = (batch_size * len(test_data)) / total_time
-            
-            # Calculate percentages
-            percentages = {k: (v/total_time*100) for k, v in step_times.items()}
-            
-            results.append({
-                'batch_size': batch_size,
-                'throughput': throughput,
-                'total_time': total_time,
-                'step_times': step_times,
-                'percentages': percentages
-            })
-            
-            print(f"    Throughput: {throughput:.1f} samples/sec")
-            print(f"    Forward: {percentages['forward']:.1f}%, Loss: {percentages['loss']:.1f}%")
-            print(f"    Backward: {percentages['backward']:.1f}%, Optimizer: {percentages['optimizer']:.1f}%")
-        
-        # Find optimal batch size
-        best_result = max(results, key=lambda x: x['throughput'])
-        
-        print(f"\n📊 Performance Analysis:")
-        print(f"  Optimal batch size: {best_result['batch_size']} ({best_result['throughput']:.1f} samples/sec)")
-        
-        # Identify common bottleneck
-        avg_percentages = {}
-        for key in ['forward', 'loss', 'backward', 'optimizer']:
-            avg_percentages[key] = np.mean([r['percentages'][key] for r in results])
-        
-        bottleneck = max(avg_percentages.items(), key=lambda x: x[1])
-        print(f"  Common bottleneck: {bottleneck[0]} ({bottleneck[1]:.1f}% of time)")
-        
-        print(f"\n💡 Key Insights:")
-        print(f"  • Larger batches improve GPU utilization (vectorization)")
-        print(f"  • {bottleneck[0]} dominates training time - optimize this first")
-        print(f"  • Memory vs speed trade-off: bigger batches need more RAM")
-        print(f"  • Production systems pipeline these operations for efficiency")
-        
-    except Exception as e:
-        print(f"⚠️ Analysis failed: {e}")
-
-# Run analysis
-analyze_training_loop_bottlenecks()
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Training Loop
-
-Let's test our Trainer class with a simple example.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-trainer", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_unit_trainer():
-    """Test Trainer class with comprehensive examples."""
-    print("🔬 Unit Test: Trainer Class...")
-    
-    # Create simple model and components
-    model = Sequential([Linear(2, 3), ReLU(), Linear(3, 2)])  # Simple model
-    optimizer = SGD([], learning_rate=0.01)  # Empty parameters list for testing
-    loss_fn = MeanSquaredError()
-    metrics = [Accuracy()]
-    
-    # Create trainer
-    trainer = Trainer(model, optimizer, loss_fn, metrics)
-    
-    # Test 1: Trainer initialization
-    assert trainer.model is model, "Model should be stored correctly"
-    assert trainer.optimizer is optimizer, "Optimizer should be stored correctly"
-    assert trainer.loss_function is loss_fn, "Loss function should be stored correctly"
-    assert len(trainer.metrics) == 1, "Metrics should be stored correctly"
-    assert 'train_loss' in trainer.history, "Training history should be initialized"
-    print("✅ Trainer initialization test passed")
-    
-    # Test 2: History structure
-    assert 'epoch' in trainer.history, "History should track epochs"
-    assert 'train_accuracy' in trainer.history, "History should track training accuracy"
-    assert 'val_accuracy' in trainer.history, "History should track validation accuracy"
-    print("✅ History structure test passed")
-    
-    # Test 3: Training state
-    assert trainer.current_epoch == 0, "Current epoch should start at 0"
-    assert trainer.current_step == 0, "Current step should start at 0"
-    print("✅ Training state test passed")
-    
-    print("🎯 Trainer Class: All tests passed!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Complete Training Comprehensive Test
-
-Let's test the complete training pipeline with all components working together.
-
-**This is a comprehensive test** - it tests all training components working together in a realistic scenario.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-training-comprehensive", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
-def test_module():
-    """Test complete training pipeline with all components."""
-    print("🔬 Integration Test: Complete Training Pipeline...")
-    
-    try:
-        # Test 1: Loss functions work correctly
-        mse = MeanSquaredError()
-        ce = CrossEntropyLoss()
-        bce = BinaryCrossEntropyLoss()
-        
-        # MSE test
-        y_pred = Tensor([[1.0, 2.0]])
-        y_true = Tensor([[1.0, 2.0]])
-        loss = mse(y_pred, y_true)
-        loss_value = get_tensor_value(loss)
-        assert abs(loss_value) < 1e-6, "MSE should work for perfect predictions"
-        
-        # CrossEntropy test
-        y_pred = Tensor([[10.0, 0.0], [0.0, 10.0]])
-        y_true = Tensor([0, 1])
-        loss = ce(y_pred, y_true)
-        loss_value = get_tensor_value(loss)
-        assert loss_value < 1.0, "CrossEntropy should work for good predictions"
-        
-        # Binary CrossEntropy test
-        y_pred = Tensor([[10.0], [-10.0]])
-        y_true = Tensor([[1.0], [0.0]])
-        loss = bce(y_pred, y_true)
-        loss_value = get_tensor_value(loss)
-        assert loss_value < 1.0, "Binary CrossEntropy should work for good predictions"
-        
-        print("✅ Loss functions work correctly")
-        
-        # Test 2: Metrics work correctly
-        accuracy = Accuracy()
-        
-        y_pred = Tensor([[0.9, 0.1], [0.1, 0.9]])
-        y_true = Tensor([0, 1])
-        acc = accuracy(y_pred, y_true)
-        assert acc == 1.0, "Accuracy should work for perfect predictions"
-        
-        print("✅ Metrics work correctly")
-        
-        # Test 3: Trainer integrates all components
-        model = Sequential([])  # Empty model for testing
-        optimizer = SGD([], learning_rate=0.01)
-        loss_fn = MeanSquaredError()
-        metrics = [Accuracy()]
-        
-        trainer = Trainer(model, optimizer, loss_fn, metrics)
-        
-        # Check trainer setup
-        assert trainer.model is model, "Trainer should store model"
-        assert trainer.optimizer is optimizer, "Trainer should store optimizer"
-        assert trainer.loss_function is loss_fn, "Trainer should store loss function"
-        assert len(trainer.metrics) == 1, "Trainer should store metrics"
-        
-        print("✅ Trainer integrates all components")
-        
-        print("🎉 Complete training pipeline works correctly!")
-        
-        # Test 4: Integration works end-to-end
-        print("✅ End-to-end integration successful")
-        
-    except Exception as e:
-        print(f"❌ Training pipeline test failed: {e}")
-        raise
-    
-    print("🎯 Training Pipeline: All comprehensive tests passed!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## 🔍 Systems Analysis
-
-Now that your training implementation is complete and tested, let's measure its behavior:
-"""
-
-# %%
-def measure_training_scaling():
-    """
-    📊 SYSTEMS MEASUREMENT: Training Performance Scaling
-
-    Measure how training performance scales with batch size.
-    """
-    print("📊 Training Performance Scaling Analysis")
-    print("Testing training performance with different batch sizes...")
-
-    try:
-        import time
-
-        # Create simple model for testing
-        model = Sequential([Linear(10, 1)])
-        optimizer = SGD(model.parameters(), learning_rate=0.01)
-        loss_fn = MeanSquaredError()
-
-        batch_sizes = [4, 8, 16, 32]
-        times = []
-
-        for batch_size in batch_sizes:
-            # Generate test data
-            X = Tensor(np.random.randn(batch_size, 10))
-            y = Tensor(np.random.randn(batch_size, 1))
-
-            # Time a training step
-            start = time.perf_counter()
-
-            predictions = model(X)
-            loss = loss_fn(predictions, y)
-            # Note: In real training, we'd call loss.backward() and optimizer.step()
-
-            elapsed = time.perf_counter() - start
-            times.append(elapsed)
-
-            throughput = batch_size / elapsed
-            print(f"Batch size {batch_size:2d}: {elapsed*1000:.2f}ms ({throughput:.1f} samples/sec)")
-
-        # Analyze scaling
-        if len(times) >= 2:
-            scaling_factor = times[-1] / times[0]
-            batch_factor = batch_sizes[-1] / batch_sizes[0]
-            efficiency = batch_factor / scaling_factor
-
-            print(f"\n💡 Scaling Insight:")
-            print(f"   Batch size increased {batch_factor:.1f}x")
-            print(f"   Time increased {scaling_factor:.1f}x")
-            print(f"   Scaling efficiency: {efficiency:.1f}x")
-
-            if efficiency > 0.8:
-                print(f"   ✅ Good scaling - training benefits from larger batches")
-            else:
-                print(f"   ⚠️  Poor scaling - diminishing returns from larger batches")
-
-        print(f"\n💡 SYSTEMS INSIGHT:")
-        print(f"   Training performance scales sub-linearly with batch size")
-        print(f"   This reveals the balance between computation and memory access")
-
-    except Exception as e:
-        print(f"⚠️ Error in scaling analysis: {e}")
-
-# Run the measurement
-measure_training_scaling()
-
-# %%
-def measure_training_memory():
-    """
-    💾 SYSTEMS MEASUREMENT: Training Memory Usage
-
-    Measure memory usage patterns during training.
-    """
-    print("\n💾 Training Memory Usage Analysis")
-    print("Analyzing memory consumption during training...")
-
-    try:
-        import psutil
-        import os
-
-        def get_memory_mb():
-            process = psutil.Process(os.getpid())
-            return process.memory_info().rss / 1024 / 1024
-
-        baseline_memory = get_memory_mb()
-
-        # Create model and training components
-        model = Sequential([Linear(100, 50), Linear(50, 1)])
-        optimizer = SGD(model.parameters(), learning_rate=0.01)
-        loss_fn = MeanSquaredError()
-
-        memory_before = get_memory_mb()
-
-        # Create different batch sizes and measure memory
-        batch_sizes = [16, 32, 64]
-
-        for batch_size in batch_sizes:
-            X = Tensor(np.random.randn(batch_size, 100))
-            y = Tensor(np.random.randn(batch_size, 1))
-
-            memory_start = get_memory_mb()
-
-            # Forward pass
-            predictions = model(X)
-            loss = loss_fn(predictions, y)
-
-            memory_peak = get_memory_mb()
-            memory_used = memory_peak - memory_start
-
-            print(f"Batch size {batch_size:2d}: {memory_used:.1f}MB memory increase")
-
-            # Clean up
-            del predictions, loss, X, y
-
-        print(f"\n💡 MEMORY INSIGHT:")
-        print(f"   Memory usage grows with batch size")
-        print(f"   Forward pass creates intermediate activations")
-        print(f"   Larger batches = more memory but better GPU utilization")
-
-    except Exception as e:
-        print(f"⚠️ Error in memory analysis: {e}")
-
-# Run the measurement
-measure_training_memory()
-
-# %%
-if __name__ == "__main__":
-    print("🚀 Running all training tests...")
-
-    # Run all unit tests
-    test_unit_mse_loss()
-    test_unit_crossentropy_loss()
-    test_unit_binary_crossentropy_loss()
-    test_unit_accuracy_metric()
-    test_unit_trainer()
-
-    # Run final integration test
-    test_module()
-
-    print("\n🎉 SUCCESS: All training tests passed!")
-    print("✅ Loss functions compute correctly")
-    print("✅ Metrics evaluate properly")
-    print("✅ Training loop integrates all components")
-    print("✅ Ready for complete neural network training!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-**Complete these questions to deepen your understanding of training systems:**
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "training-systems-question-1", "locked": false, "points": 5, "schema_version": 3, "solution": true, "task": false}
-# %% [markdown]
-"""
-### Question 1: Memory vs Batch Size Trade-offs
-
-In your `Trainer` implementation, you control batch size during training. When you tested different batch sizes in the scaling analysis, you discovered that memory usage grows with batch size.
-
-**Reflection Question**: Analyze the memory patterns in your training loop. If you have 8GB of GPU memory and your model has 1M parameters (4MB), how would you determine the optimal batch size? What happens to training dynamics when memory constraints force you to use smaller batches?
-
-Think about:
-- Parameter memory (weights + gradients + optimizer state)
-- Activation memory (grows with batch size)
-- Memory vs convergence speed trade-offs
-- How this affects real ML systems at scale
-
-**Your Analysis:**
-```
-// Write your analysis here
-```
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "training-systems-question-2", "locked": false, "points": 5, "schema_version": 3, "solution": true, "task": false}
-# %% [markdown]
-"""
-### Question 2: Loss Function Choice and Training Stability
-
-You implemented MSE, CrossEntropy, and Binary CrossEntropy loss functions. Each has different mathematical properties that affect training dynamics.
-
-**Reflection Question**: Your `MeanSquaredError` loss can produce very large gradients when predictions are far from targets, while `CrossEntropyLoss` has more stable gradients. How does this difference affect training stability and convergence speed? When would you choose each loss function, and how would you modify your training loop to handle unstable gradients?
-
-Think about:
-- Gradient magnitude differences between loss functions
-- How loss landscapes affect optimization
-- Gradient clipping and learning rate scheduling
-- Production implications for model reliability
-
-**Your Analysis:**
-```
-// Write your analysis here
-```
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "training-systems-question-3", "locked": false, "points": 5, "schema_version": 3, "solution": true, "task": false}
-# %% [markdown]
-"""
-### Question 3: Training Loop Bottlenecks and Optimization
-
-Your `Trainer` class orchestrates data loading, forward passes, loss computation, and optimization. In the performance analysis, you measured how different components contribute to training time.
-
-**Reflection Question**: If you discovered that data loading is your bottleneck (taking 60% of training time), how would you modify your training loop architecture to address this? What systems-level changes would you make to achieve better data/compute overlap?
-
-Think about:
-- Data prefetching and parallel data loading
-- CPU vs GPU workload distribution
-- Memory caching and data preprocessing optimization
-- How training loop design affects overall system throughput
-
-**Your Analysis:**
-```
-// Write your analysis here
-```
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Training Complete!
-
-Congratulations! You've successfully implemented complete training infrastructure:
-
-### What You've Accomplished
-✅ **Loss Function Implementation**: MSE, CrossEntropy, and Binary CrossEntropy with proper gradient support
-✅ **Metrics System**: Accuracy evaluation with batch processing and edge case handling
-✅ **Training Loop Architecture**: Complete `Trainer` class that orchestrates all ML components
-✅ **Systems Analysis**: Performance scaling and memory usage measurement capabilities
-✅ **Integration Testing**: End-to-end validation of the complete training pipeline
-
-### Key Learning Outcomes
-- **Training Orchestration**: How training loops coordinate data, models, losses, and optimizers into unified systems
-- **Loss Function Design**: Mathematical properties that affect training stability and convergence
-- **Performance Analysis**: How to measure and optimize training pipeline bottlenecks
-- **Memory Management**: Understanding memory scaling patterns and resource constraints
-
-### Professional Skills Developed
-- **Systems Integration**: Building complex pipelines from independent components
-- **Performance Profiling**: Measuring and analyzing training system behavior
-- **Production Patterns**: Training loop designs that handle errors and scale effectively
-
-### Ready for Advanced Applications
-Your training implementation now enables:
-- **Complete Neural Networks**: Train any model architecture on real datasets
-- **Performance Optimization**: Identify and resolve training bottlenecks
-- **Production Deployment**: Reliable training loops with monitoring and checkpointing
-
-### Connection to Real ML Systems
-Your implementation mirrors production frameworks:
-- **PyTorch**: Your `Trainer` class patterns match PyTorch Lightning trainers
-- **TensorFlow**: Loss functions and metrics follow tf.keras patterns
-- **Industry Standard**: Training loop design reflects MLOps best practices
-
-### Next Steps
-Your training infrastructure completes the core ML system! You can now:
-1. **Train on Real Data**: Use your complete system on CIFAR-10, MNIST, or custom datasets
-2. **Optimize Performance**: Apply scaling analysis to improve training throughput
-3. **Build Complex Models**: Combine all modules into sophisticated architectures
-4. **Deploy Systems**: Take your implementations toward production-ready systems
-
-**You've built real ML training infrastructure from scratch!** This foundation enables everything from research experiments to production ML systems.
-"""
\ No newline at end of file
diff --git a/modules_old/08_spatial/README.md b/modules_old/08_spatial/README.md
deleted file mode 100644
index 5ee51163..00000000
--- a/modules_old/08_spatial/README.md
+++ /dev/null
@@ -1,221 +0,0 @@
-# 🔥 Module: CNN
-
-## 📊 Module Info
-- **Difficulty**: ⭐⭐⭐ Advanced
-- **Time Estimate**: 6-8 hours
-- **Prerequisites**: Tensor, Activations, Layers, Networks modules
-- **Next Steps**: Training, Computer Vision modules
-
-Implement the core building block of modern computer vision: the convolutional layer. This module teaches you how convolution transforms computer vision from hand-crafted features to learned hierarchical representations that power everything from image recognition to autonomous vehicles.
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will be able to:
-
-- **Understand convolution fundamentals**: Master the sliding window operation, local connectivity, and weight sharing principles
-- **Implement Conv2D from scratch**: Build convolutional layers using explicit loops to understand the core operation
-- **Visualize feature learning**: See how convolution builds feature maps and hierarchical representations
-- **Design CNN architectures**: Compose convolutional layers with pooling and dense layers into complete networks
-- **Apply computer vision principles**: Understand how CNNs revolutionized image processing and pattern recognition
-
-## 🧠 Build → Use → Analyze
-
-This module follows TinyTorch's **Build → Use → Analyze** framework:
-
-1. **Build**: Implement Conv2D from scratch using explicit for-loops to understand the core convolution operation
-2. **Use**: Compose Conv2D with activation functions and other layers to build complete convolutional networks
-3. **Analyze**: Visualize learned features, understand architectural choices, and compare CNN performance characteristics
-
-## 📚 What You'll Build
-
-### Core Convolution Implementation
-```python
-# Conv2D layer: the heart of computer vision
-conv_layer = Conv2D(in_channels=3, out_channels=16, kernel_size=3)
-input_image = Tensor([[[[...]]]])  # (batch, channels, height, width)
-feature_maps = conv_layer(input_image)  # Learned features
-
-# Understanding the operation
-print(f"Input shape: {input_image.shape}")     # (1, 3, 32, 32)
-print(f"Output shape: {feature_maps.shape}")   # (1, 16, 30, 30)
-print(f"Learned {feature_maps.shape[1]} different feature detectors")
-```
-
-### Complete CNN Architecture
-```python
-# Simple CNN for image classification
-cnn = Sequential([
-    Conv2D(3, 16, kernel_size=3),    # Feature extraction
-    ReLU(),                          # Nonlinearity
-    MaxPool2D(kernel_size=2),        # Dimensionality reduction
-    Conv2D(16, 32, kernel_size=3),   # Higher-level features
-    ReLU(),                          # More nonlinearity
-    Flatten(),                       # Prepare for dense layers
-    Dense(32 * 13 * 13, 128),        # Feature integration
-    ReLU(),
-    Dense(128, 10),                  # Classification head
-    Sigmoid()                        # Probability outputs
-])
-
-# End-to-end image classification
-image_batch = Tensor([[[[...]]]])  # Batch of images
-predictions = cnn(image_batch)     # Class probabilities
-```
-
-### Convolution Operation Details
-- **Sliding Window**: Filter moves across input to detect local patterns
-- **Weight Sharing**: Same filter applied everywhere for translation invariance
-- **Local Connectivity**: Each output depends only on local input region
-- **Feature Maps**: Multiple filters learn different feature detectors
-
-### CNN Building Blocks
-- **Conv2D Layer**: Core convolution operation with learnable filters
-- **Pooling Layers**: MaxPool and AvgPool for spatial downsampling
-- **Flatten Layer**: Converts 2D feature maps to 1D for dense layers
-- **Complete Networks**: Integration with existing Dense and activation layers
-
-## 🚀 Getting Started
-
-### Prerequisites
-Ensure you have mastered the foundational network building blocks:
-
-```bash
-# Activate TinyTorch environment
-source bin/activate-tinytorch.sh
-
-# Verify all prerequisite modules
-tito test --module tensor
-tito test --module activations
-tito test --module layers
-tito test --module networks
-```
-
-### Development Workflow
-1. **Open the development file**: `modules/source/07_spatial/spatial_dev.py`
-2. **Implement convolution operation**: Start with explicit for-loop implementation for understanding
-3. **Build Conv2D layer class**: Wrap convolution in reusable layer interface
-4. **Add pooling operations**: Implement MaxPool and AvgPool for spatial reduction
-5. **Create complete CNNs**: Compose layers into full computer vision architectures
-6. **Export and verify**: `tito export --module cnn && tito test --module cnn`
-
-## 🧪 Testing Your Implementation
-
-### Comprehensive Test Suite
-Run the full test suite to verify computer vision functionality:
-
-```bash
-# TinyTorch CLI (recommended)
-tito test --module cnn
-
-# Direct pytest execution
-python -m pytest tests/ -k cnn -v
-```
-
-### Test Coverage Areas
-- ✅ **Convolution Operation**: Verify sliding window operation and local connectivity
-- ✅ **Filter Learning**: Test weight initialization and parameter management
-- ✅ **Shape Transformations**: Ensure proper input/output shape handling
-- ✅ **Pooling Operations**: Verify spatial downsampling and feature preservation
-- ✅ **CNN Integration**: Test complete networks with real image-like data
-
-### Inline Testing & Visualization
-The module includes comprehensive educational feedback and visual analysis:
-```python
-# Example inline test output
-🔬 Unit Test: Conv2D implementation...
-✅ Convolution sliding window works correctly
-✅ Weight sharing applied consistently
-✅ Output shapes match expected dimensions
-📈 Progress: Conv2D ✓
-
-# Visualization feedback
-📊 Visualizing convolution operation...
-📈 Showing filter sliding across input
-📊 Feature map generation: 3→16 channels
-```
-
-### Manual Testing Examples
-```python
-from tinytorch.core.tensor import Tensor
-from cnn_dev import Conv2D, MaxPool2D, Flatten
-from activations_dev import ReLU
-
-# Test basic convolution
-conv = Conv2D(in_channels=1, out_channels=4, kernel_size=3)
-input_img = Tensor([[[[1, 2, 3, 4, 5],
-                      [6, 7, 8, 9, 10],
-                      [11, 12, 13, 14, 15],
-                      [16, 17, 18, 19, 20],
-                      [21, 22, 23, 24, 25]]]])
-feature_maps = conv(input_img)
-print(f"Input: {input_img.shape}, Features: {feature_maps.shape}")
-
-# Test complete CNN pipeline
-relu = ReLU()
-pool = MaxPool2D(kernel_size=2)
-flatten = Flatten()
-
-# Forward pass through CNN layers
-activated = relu(feature_maps)
-pooled = pool(activated)
-flattened = flatten(pooled)
-print(f"Final shape: {flattened.shape}")
-```
-
-## 🎯 Key Concepts
-
-### Real-World Applications
-- **Image Classification**: CNNs power systems like ImageNet winners (AlexNet, ResNet, EfficientNet)
-- **Object Detection**: YOLO and R-CNN families use CNN backbones for feature extraction
-- **Medical Imaging**: CNNs analyze X-rays, MRIs, and CT scans for diagnostic assistance
-- **Autonomous Vehicles**: CNN-based perception systems process camera feeds for navigation
-
-### Computer Vision Fundamentals
-- **Translation Invariance**: Convolution detects patterns regardless of position in image
-- **Hierarchical Features**: Early layers detect edges, later layers detect objects and concepts
-- **Parameter Efficiency**: Weight sharing dramatically reduces parameters compared to dense layers
-- **Spatial Structure**: CNNs preserve and leverage 2D spatial relationships in images
-
-### Convolution Mathematics
-- **Sliding Window Operation**: Filter moves across input with stride and padding parameters
-- **Cross-Correlation vs Convolution**: Deep learning typically uses cross-correlation operation
-- **Feature Map Computation**: Output[i,j] = sum(input[i:i+k, j:j+k] * filter)
-- **Receptive Field**: Region of input that influences each output activation
-
-### CNN Architecture Patterns
-- **Feature Extraction**: Convolution + ReLU + Pooling blocks extract hierarchical features
-- **Classification Head**: Flatten + Dense layers perform final classification
-- **Progressive Filtering**: Increasing filter count with decreasing spatial dimensions
-- **Skip Connections**: Advanced architectures add residual connections for deeper networks
-
-## 🎉 Ready to Build?
-
-You're about to implement the technology that revolutionized computer vision! CNNs transformed image processing from hand-crafted features to learned representations, enabling everything from photo tagging to medical diagnosis to autonomous driving.
-
-Understanding convolution from the ground up—implementing the sliding window operation yourself—will give you deep insight into why CNNs work so well for visual tasks. Take your time with the core operation, visualize what's happening, and enjoy building the foundation of modern computer vision!
-
-```{grid} 3
-:gutter: 3
-:margin: 2
-
-{grid-item-card} 🚀 Launch Builder
-:link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/07_spatial/spatial_dev.py
-:class-title: text-center
-:class-body: text-center
-
-Interactive development environment
-
-{grid-item-card} 📓 Open in Colab  
-:link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/07_spatial/spatial_dev.ipynb
-:class-title: text-center
-:class-body: text-center
-
-Google Colab notebook
-
-{grid-item-card} 👀 View Source
-:link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/07_spatial/spatial_dev.py  
-:class-title: text-center
-:class-body: text-center
-
-Browse the code on GitHub
-```
diff --git a/modules_old/08_spatial/module.yaml b/modules_old/08_spatial/module.yaml
deleted file mode 100644
index 217243dd..00000000
--- a/modules_old/08_spatial/module.yaml
+++ /dev/null
@@ -1,24 +0,0 @@
-components:
-- conv2d_naive
-- Conv2D
-- flatten
-dependencies:
-  enables:
-  - attention
-  - training
-  - computer_vision
-  prerequisites:
-  - tensor
-  - activations
-  - layers
-  - dense
-description: Convolutional networks for spatial pattern recognition and image processing
-difficulty: "\u2B50\u2B50\u2B50"
-exports_to: tinytorch.core.spatial
-files:
-  dev_file: spatial_dev.py
-  readme: README.md
-  tests: inline
-name: spatial
-time_estimate: 6-8 hours
-title: Spatial Networks
diff --git a/modules_old/08_spatial/spatial_dev.ipynb b/modules_old/08_spatial/spatial_dev.ipynb
deleted file mode 100644
index 62909352..00000000
--- a/modules_old/08_spatial/spatial_dev.ipynb
+++ /dev/null
@@ -1,2920 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "580c015d",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Spatial - Convolutional Networks and Spatial Pattern Recognition\n",
-    "\n",
-    "Welcome to the Spatial module! You'll implement convolutional operations that enable neural networks to understand spatial relationships in images and other grid-structured data.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How convolution operations achieve spatial pattern recognition through parameter sharing and translation invariance\n",
-    "- Core implementation skill: Build Conv2D layers using explicit sliding window operations to understand the computational mechanics\n",
-    "- Pattern recognition: Understand how convolutional layers detect hierarchical features from edges to complex objects\n",
-    "- Framework connection: See how your implementation reveals the design decisions in PyTorch's nn.Conv2d optimizations\n",
-    "- Performance insight: Learn why convolution is computationally expensive but highly parallelizable, driving modern GPU architecture\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: Conv2D layer with sliding window convolution, understanding every memory access and computation\n",
-    "2. **Use**: Transform real image data and visualize how feature maps capture spatial patterns\n",
-    "3. **Reflect**: Why does convolution enable parameter sharing, and how does this affect model capacity vs efficiency?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how sliding window operations enable spatial pattern detection\n",
-    "- Practical capability to implement convolutional layers that form the backbone of computer vision systems\n",
-    "- Systems insight into why convolution is the dominant operation for spatial data and how it affects memory access patterns\n",
-    "- Performance consideration of how kernel size, stride, and padding choices affect computational cost and memory usage\n",
-    "- Connection to production ML systems and how frameworks optimize convolution for different hardware architectures\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: PyTorch's Conv2d uses highly optimized implementations like cuDNN that can be 100x faster than naive implementations through algorithm choice and memory layout optimization\n",
-    "⚡ **Performance Note**: Convolution is O(H×W×C×K²) per output pixel - modern CNNs perform billions of these operations, making optimization critical for real-time applications"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7eb835e7",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "cnn-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.spatial\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import os\n",
-    "import sys\n",
-    "from typing import List, Tuple, Optional\n",
-    "\n",
-    "# Import from the main package - try package first, then local modules\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor, Parameter\n",
-    "    from tinytorch.core.layers import Linear, Module\n",
-    "    from tinytorch.core.activations import ReLU\n",
-    "except ImportError:\n",
-    "    # For development, import from local modules\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_activations'))\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '04_layers'))\n",
-    "    from tensor_dev import Tensor, Parameter\n",
-    "    from activations_dev import ReLU\n",
-    "    from layers_dev import Linear, Module"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6a137a89",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "cnn-welcome",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🔥 TinyTorch CNN Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
-    "print(\"Ready to build convolutional neural networks!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6b90f888",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📦 Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in `modules/source/05_cnn/cnn_dev.py`  \n",
-    "**Building Side:** Code exports to `tinytorch.core.cnn`\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.cnn import Conv2D, conv2d_naive, flatten  # CNN operations!\n",
-    "from tinytorch.core.layers import Dense  # Fully connected layers\n",
-    "from tinytorch.core.activations import ReLU  # Nonlinearity\n",
-    "from tinytorch.core.tensor import Tensor  # Foundation\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused modules for deep understanding of convolution\n",
-    "- **Production:** Proper organization like PyTorch's `torch.nn.Conv2d`\n",
-    "- **Consistency:** All CNN operations live together in `core.cnn`\n",
-    "- **Integration:** Works seamlessly with other TinyTorch components"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7ae387ea",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Spatial Helper Functions\n",
-    "\n",
-    "Before diving into convolution, let's add some essential spatial operations that we'll need for building clean CNN code. These helpers make it easy to work with multi-dimensional data."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c8a4ddb7",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "spatial-helpers",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def flatten(x, start_dim=1):\n",
-    "    \"\"\"\n",
-    "    Flatten tensor starting from a given dimension.\n",
-    "    \n",
-    "    This is essential for transitioning from convolutional layers\n",
-    "    (which output 4D tensors) to linear layers (which expect 2D).\n",
-    "    \n",
-    "    Args:\n",
-    "        x: Input tensor (Tensor or any array-like)\n",
-    "        start_dim: Dimension to start flattening from (default: 1 to preserve batch)\n",
-    "        \n",
-    "    Returns:\n",
-    "        Flattened tensor preserving batch dimension\n",
-    "        \n",
-    "    Examples:\n",
-    "        # Flatten CNN output for Linear layer\n",
-    "        conv_output = Tensor(np.random.randn(32, 64, 8, 8))  # (batch, channels, height, width)\n",
-    "        flat = flatten(conv_output)  # (32, 4096) - ready for Linear layer!\n",
-    "        \n",
-    "        # Flatten image for MLP\n",
-    "        images = Tensor(np.random.randn(32, 3, 28, 28))  # CIFAR-10 batch\n",
-    "        flat = flatten(images)  # (32, 2352) - ready for MLP!\n",
-    "    \"\"\"\n",
-    "    # Get the data (handle both Tensor and numpy arrays)\n",
-    "    if hasattr(x, 'data'):\n",
-    "        data = x.data\n",
-    "    else:\n",
-    "        data = x\n",
-    "    \n",
-    "    # Calculate new shape\n",
-    "    batch_size = data.shape[0]\n",
-    "    remaining_size = np.prod(data.shape[start_dim:])\n",
-    "    new_shape = (batch_size, remaining_size)\n",
-    "    \n",
-    "    # Reshape preserving tensor type\n",
-    "    if hasattr(x, 'data'):\n",
-    "        # It's a Tensor - preserve type and gradient tracking\n",
-    "        flattened_data = data.reshape(new_shape)\n",
-    "        result = Tensor(flattened_data, requires_grad=x.requires_grad if hasattr(x, 'requires_grad') else False)\n",
-    "        return result\n",
-    "    else:\n",
-    "        # It's a numpy array\n",
-    "        return data.reshape(new_shape)\n",
-    "\n",
-    "#| export\n",
-    "def max_pool2d(x, kernel_size, stride=None):\n",
-    "    \"\"\"\n",
-    "    Apply 2D max pooling operation.\n",
-    "    \n",
-    "    Max pooling reduces spatial dimensions by taking the maximum value\n",
-    "    in each pooling window. This provides translation invariance and\n",
-    "    reduces computational cost.\n",
-    "    \n",
-    "    Args:\n",
-    "        x: Input tensor (batch, channels, height, width)\n",
-    "        kernel_size: Size of pooling window (int or tuple)\n",
-    "        stride: Stride of pooling (defaults to kernel_size)\n",
-    "        \n",
-    "    Returns:\n",
-    "        Pooled tensor with reduced spatial dimensions\n",
-    "        \n",
-    "    Examples:\n",
-    "        # Standard 2x2 max pooling\n",
-    "        feature_maps = Tensor(np.random.randn(32, 64, 28, 28))\n",
-    "        pooled = max_pool2d(feature_maps, 2)  # (32, 64, 14, 14)\n",
-    "        \n",
-    "        # Non-overlapping 3x3 pooling\n",
-    "        pooled = max_pool2d(feature_maps, 3, stride=3)  # (32, 64, 9, 9)\n",
-    "    \"\"\"\n",
-    "    # Handle kernel_size and stride\n",
-    "    if isinstance(kernel_size, int):\n",
-    "        kh = kw = kernel_size\n",
-    "    else:\n",
-    "        kh, kw = kernel_size\n",
-    "        \n",
-    "    if stride is None:\n",
-    "        stride = kernel_size\n",
-    "    if isinstance(stride, int):\n",
-    "        sh = sw = stride\n",
-    "    else:\n",
-    "        sh, sw = stride\n",
-    "    \n",
-    "    # Get input data\n",
-    "    if hasattr(x, 'data'):\n",
-    "        input_data = x.data\n",
-    "    else:\n",
-    "        input_data = x\n",
-    "    \n",
-    "    batch, channels, height, width = input_data.shape\n",
-    "    \n",
-    "    # Calculate output dimensions\n",
-    "    out_h = (height - kh) // sh + 1\n",
-    "    out_w = (width - kw) // sw + 1\n",
-    "    \n",
-    "    # Initialize output\n",
-    "    output = np.zeros((batch, channels, out_h, out_w))\n",
-    "    \n",
-    "    # Apply max pooling\n",
-    "    for b in range(batch):\n",
-    "        for c in range(channels):\n",
-    "            for i in range(out_h):\n",
-    "                for j in range(out_w):\n",
-    "                    h_start = i * sh\n",
-    "                    h_end = h_start + kh\n",
-    "                    w_start = j * sw\n",
-    "                    w_end = w_start + kw\n",
-    "                    \n",
-    "                    # Take maximum in the pooling window\n",
-    "                    pool_region = input_data[b, c, h_start:h_end, w_start:w_end]\n",
-    "                    output[b, c, i, j] = np.max(pool_region)\n",
-    "    \n",
-    "    # Preserve tensor type if input was a tensor\n",
-    "    if hasattr(x, 'data'):\n",
-    "        result = Tensor(output, requires_grad=x.requires_grad if hasattr(x, 'requires_grad') else False)\n",
-    "        return result\n",
-    "    else:\n",
-    "        return output"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4789770c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🔧 DEVELOPMENT"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3e56a3d8",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 1: Understanding Convolution\n",
-    "\n",
-    "### What is Convolution?\n",
-    "**Convolution** is a mathematical operation that slides a small filter (kernel) across an input, computing dot products at each position.\n",
-    "\n",
-    "### Why Convolution is Perfect for Images\n",
-    "- **Local patterns**: Images have local structure (edges, textures)\n",
-    "- **Translation invariance**: Same pattern can appear anywhere\n",
-    "- **Parameter sharing**: One filter detects the pattern everywhere\n",
-    "- **Spatial hierarchy**: Multiple layers build increasingly complex features\n",
-    "\n",
-    "### The Fundamental Insight\n",
-    "**Convolution is pattern matching!** The kernel learns to detect specific patterns:\n",
-    "- **Edge detectors**: Find boundaries between objects\n",
-    "- **Texture detectors**: Recognize surface patterns\n",
-    "- **Shape detectors**: Identify geometric forms\n",
-    "- **Feature detectors**: Combine simple patterns into complex features\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Image processing**: Detect edges, blur, sharpen\n",
-    "- **Computer vision**: Recognize objects, faces, text\n",
-    "- **Medical imaging**: Detect tumors, analyze scans\n",
-    "- **Autonomous driving**: Identify traffic signs, pedestrians\n",
-    "\n",
-    "### Visual Intuition\n",
-    "```\n",
-    "Input Image:     Kernel:        Output Feature Map:\n",
-    "[1, 2, 3]       [1,  0]       [1*1+2*0+4*0+5*(-1), 2*1+3*0+5*0+6*(-1)]\n",
-    "[4, 5, 6]       [0, -1]       [4*1+5*0+7*0+8*(-1), 5*1+6*0+8*0+9*(-1)]\n",
-    "[7, 8, 9]\n",
-    "```\n",
-    "\n",
-    "The kernel slides across the input, computing dot products at each position.\n",
-    "\n",
-    "Let us implement this step by step!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7236a021",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "conv2d-naive",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def conv2d_naive(input: np.ndarray, kernel: np.ndarray) -> np.ndarray:\n",
-    "    \"\"\"\n",
-    "    Naive 2D convolution (single channel, no stride, no padding).\n",
-    "    \n",
-    "    Args:\n",
-    "        input: 2D input array (H, W)\n",
-    "        kernel: 2D filter (kH, kW)\n",
-    "    Returns:\n",
-    "        2D output array (H-kH+1, W-kW+1)\n",
-    "        \n",
-    "    TODO: Implement the sliding window convolution using for-loops.\n",
-    "    \n",
-    "    STEP-BY-STEP IMPLEMENTATION:\n",
-    "    1. Get input dimensions: H, W = input.shape\n",
-    "    2. Get kernel dimensions: kH, kW = kernel.shape\n",
-    "    3. Calculate output dimensions: out_H = H - kH + 1, out_W = W - kW + 1\n",
-    "    4. Create output array: np.zeros((out_H, out_W))\n",
-    "    5. Use nested loops to slide the kernel:\n",
-    "       - i loop: output rows (0 to out_H-1)\n",
-    "       - j loop: output columns (0 to out_W-1)\n",
-    "       - di loop: kernel rows (0 to kH-1)\n",
-    "       - dj loop: kernel columns (0 to kW-1)\n",
-    "    6. For each (i,j), compute: output[i,j] += input[i+di, j+dj] * kernel[di, dj]\n",
-    "    \n",
-    "    LEARNING CONNECTIONS:\n",
-    "    - **Computer Vision Foundation**: Convolution is the core operation in CNNs and image processing\n",
-    "    - **Feature Detection**: Different kernels detect edges, textures, and patterns in images\n",
-    "    - **Spatial Hierarchies**: Convolution preserves spatial relationships while extracting features\n",
-    "    - **Production CNNs**: Understanding the basic operation helps optimize GPU implementations\n",
-    "    \n",
-    "    EXAMPLE:\n",
-    "    Input: [[1, 2, 3],     Kernel: [[1, 0],\n",
-    "            [4, 5, 6],              [0, -1]]\n",
-    "            [7, 8, 9]]\n",
-    "    \n",
-    "    Output[0,0] = 1*1 + 2*0 + 4*0 + 5*(-1) = 1 - 5 = -4\n",
-    "    Output[0,1] = 2*1 + 3*0 + 5*0 + 6*(-1) = 2 - 6 = -4\n",
-    "    Output[1,0] = 4*1 + 5*0 + 7*0 + 8*(-1) = 4 - 8 = -4\n",
-    "    Output[1,1] = 5*1 + 6*0 + 8*0 + 9*(-1) = 5 - 9 = -4\n",
-    "    \n",
-    "    HINTS:\n",
-    "    - Start with output = np.zeros((out_H, out_W))\n",
-    "    - Use four nested loops: for i in range(out_H): for j in range(out_W): for di in range(kH): for dj in range(kW):\n",
-    "    - Accumulate the sum: output[i,j] += input[i+di, j+dj] * kernel[di, dj]\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    # Get input and kernel dimensions\n",
-    "    H, W = input.shape\n",
-    "    kH, kW = kernel.shape\n",
-    "    \n",
-    "    # Calculate output dimensions\n",
-    "    out_H, out_W = H - kH + 1, W - kW + 1\n",
-    "    \n",
-    "    # Initialize output array\n",
-    "    output = np.zeros((out_H, out_W), dtype=input.dtype)\n",
-    "    \n",
-    "    # Sliding window convolution with four nested loops\n",
-    "    for i in range(out_H):\n",
-    "        for j in range(out_W):\n",
-    "            for di in range(kH):\n",
-    "                for dj in range(kW):\n",
-    "                    output[i, j] += input[i + di, j + dj] * kernel[di, dj]\n",
-    "    \n",
-    "    return output\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "830d2c54",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Unit Test: Convolution Operation\n",
-    "\n",
-    "Let us test your convolution implementation right away! This is the core operation that powers computer vision.\n",
-    "\n",
-    "**This is a unit test** - it tests one specific function (conv2d_naive) in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7b6942cd",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-conv2d-naive-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Test conv2d_naive function immediately after implementation\n",
-    "print(\"🔬 Unit Test: Convolution Operation...\")\n",
-    "\n",
-    "# Test simple 3x3 input with 2x2 kernel\n",
-    "try:\n",
-    "    input_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)\n",
-    "    kernel_array = np.array([[1, 0], [0, 1]], dtype=np.float32)  # Identity-like kernel\n",
-    "    \n",
-    "    result = conv2d_naive(input_array, kernel_array)\n",
-    "    expected = np.array([[6, 8], [12, 14]], dtype=np.float32)  # 1+5, 2+6, 4+8, 5+9\n",
-    "    \n",
-    "    print(f\"Input:\\n{input_array}\")\n",
-    "    print(f\"Kernel:\\n{kernel_array}\")\n",
-    "    print(f\"Result:\\n{result}\")\n",
-    "    print(f\"Expected:\\n{expected}\")\n",
-    "    \n",
-    "    assert np.allclose(result, expected), f\"Convolution failed: expected {expected}, got {result}\"\n",
-    "    print(\"✅ Simple convolution test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Simple convolution test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test edge detection kernel\n",
-    "try:\n",
-    "    input_array = np.array([[1, 1, 1], [1, 1, 1], [1, 1, 1]], dtype=np.float32)\n",
-    "    edge_kernel = np.array([[-1, -1], [-1, 3]], dtype=np.float32)  # Edge detection\n",
-    "    \n",
-    "    result = conv2d_naive(input_array, edge_kernel)\n",
-    "    expected = np.array([[0, 0], [0, 0]], dtype=np.float32)  # Uniform region = no edges\n",
-    "    \n",
-    "    assert np.allclose(result, expected), f\"Edge detection failed: expected {expected}, got {result}\"\n",
-    "    print(\"✅ Edge detection test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Edge detection test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test output shape\n",
-    "try:\n",
-    "    input_5x5 = np.random.randn(5, 5).astype(np.float32)\n",
-    "    kernel_3x3 = np.random.randn(3, 3).astype(np.float32)\n",
-    "    \n",
-    "    result = conv2d_naive(input_5x5, kernel_3x3)\n",
-    "    expected_shape = (3, 3)  # 5-3+1 = 3\n",
-    "    \n",
-    "    assert result.shape == expected_shape, f\"Output shape wrong: expected {expected_shape}, got {result.shape}\"\n",
-    "    print(\"✅ Output shape test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Output shape test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Show the convolution process\n",
-    "print(\"🎯 Convolution behavior:\")\n",
-    "print(\"   Slides kernel across input\")\n",
-    "print(\"   Computes dot product at each position\")\n",
-    "print(\"   Output size = Input size - Kernel size + 1\")\n",
-    "print(\"📈 Progress: Convolution operation ✓\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "101ec409",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 2: Building the Conv2D Layer\n",
-    "\n",
-    "### What is a Conv2D Layer?\n",
-    "A **Conv2D layer** is a learnable convolutional layer that:\n",
-    "- Has learnable kernel weights (initialized randomly)\n",
-    "- Applies convolution to input tensors\n",
-    "- Integrates with the rest of the neural network\n",
-    "\n",
-    "### Why Conv2D Layers Matter\n",
-    "- **Feature learning**: Kernels learn to detect useful patterns\n",
-    "- **Composability**: Can be stacked with other layers\n",
-    "- **Efficiency**: Shared weights reduce parameters dramatically\n",
-    "- **Translation invariance**: Same patterns detected anywhere in the image\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Image classification**: Recognize objects in photos\n",
-    "- **Object detection**: Find and locate objects\n",
-    "- **Medical imaging**: Detect anomalies in scans\n",
-    "- **Autonomous driving**: Identify road features\n",
-    "\n",
-    "### Design Decisions\n",
-    "- **Kernel size**: Typically 3×3 or 5×5 for balance of locality and capacity\n",
-    "- **Initialization**: Small random values to break symmetry\n",
-    "- **Integration**: Works with Tensor class and other layers"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d5761397",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "conv2d-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Conv2D:\n",
-    "    \"\"\"\n",
-    "    2D Convolutional Layer (single channel, single filter, no stride/pad).\n",
-    "    \n",
-    "    A learnable convolutional layer that applies a kernel to detect spatial patterns.\n",
-    "    Perfect for building the foundation of convolutional neural networks.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, kernel_size: Tuple[int, int]):\n",
-    "        \"\"\"\n",
-    "        Initialize Conv2D layer with random kernel.\n",
-    "        \n",
-    "        Args:\n",
-    "            kernel_size: (kH, kW) - size of the convolution kernel\n",
-    "            \n",
-    "        TODO: Initialize a random kernel with small values.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store kernel_size as instance variable\n",
-    "        2. Initialize random kernel with small values\n",
-    "        3. Use proper initialization for stable training\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        Conv2D((2, 2)) creates:\n",
-    "        - kernel: shape (2, 2) with small random values\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Store kernel_size as self.kernel_size\n",
-    "        - Initialize kernel: np.random.randn(kH, kW) * 0.1 (small values)\n",
-    "        - Convert to float32 for consistency\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Store kernel size\n",
-    "        self.kernel_size = kernel_size\n",
-    "        kH, kW = kernel_size\n",
-    "        \n",
-    "        # Initialize random kernel with small values\n",
-    "        self.kernel = np.random.randn(kH, kW).astype(np.float32) * 0.1\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x):\n",
-    "        \"\"\"\n",
-    "        Forward pass through the Conv2D layer.\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor (batch_size, H, W)\n",
-    "        Returns:\n",
-    "            Output tensor after convolution\n",
-    "        \"\"\"\n",
-    "        # Handle batches by iterating through each item\n",
-    "        if len(x.shape) == 3:\n",
-    "            batch_size, H, W = x.shape\n",
-    "            # Calculate output shape once\n",
-    "            kH, kW = self.kernel.shape\n",
-    "            out_H, out_W = H - kH + 1, W - kW + 1\n",
-    "            \n",
-    "            # Create an empty list to store results\n",
-    "            results = []\n",
-    "            # Iterate over each image in the batch\n",
-    "            for i in range(batch_size):\n",
-    "                # Apply naive convolution to each image\n",
-    "                convolved = conv2d_naive(x.data[i], self.kernel)\n",
-    "                results.append(convolved)\n",
-    "            # Stack results into a single NumPy array\n",
-    "            output_data = np.stack(results)\n",
-    "\n",
-    "        else: # Handle single image case\n",
-    "            output_data = conv2d_naive(x.data, self.kernel)\n",
-    "\n",
-    "        # Preserve Variable type if input is Variable for gradient flow\n",
-    "        from tinytorch.core.autograd import Variable\n",
-    "        if isinstance(x, Variable):\n",
-    "            # Create gradient function for convolution backward pass\n",
-    "            def grad_fn(grad_output):\n",
-    "                # Conv2D backward: gradient w.r.t input and weights\n",
-    "                # For simplicity, we'll pass gradients through without modification\n",
-    "                # A full implementation would compute proper conv gradients\n",
-    "                if x.requires_grad:\n",
-    "                    # Pass gradient to input (simplified - should be transposed conv)\n",
-    "                    x.backward(grad_output)\n",
-    "                \n",
-    "                if hasattr(self, 'kernel') and isinstance(self.kernel, Variable) and self.kernel.requires_grad:\n",
-    "                    # Gradient for kernel (simplified - should be correlation)\n",
-    "                    # For now, just accumulate some gradient to allow learning\n",
-    "                    kernel_grad = np.zeros_like(self.kernel.data)\n",
-    "                    self.kernel.backward(Variable(kernel_grad))\n",
-    "            \n",
-    "            return Variable(output_data, requires_grad=x.requires_grad, grad_fn=grad_fn)\n",
-    "        else:\n",
-    "            return Tensor(output_data)\n",
-    "    \n",
-    "    def __call__(self, x):\n",
-    "        \"\"\"Make layer callable: layer(x) same as layer.forward(x)\"\"\"\n",
-    "        return self.forward(x)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c282c012",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Unit Test: Conv2D Layer\n",
-    "\n",
-    "Let us test your Conv2D layer implementation! This is a learnable convolutional layer that can be trained.\n",
-    "\n",
-    "**This is a unit test** - it tests one specific class (Conv2D) in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "51a59a59",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-conv2d-layer-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Test Conv2D layer immediately after implementation\n",
-    "print(\"🔬 Unit Test: Conv2D Layer...\")\n",
-    "\n",
-    "# Create a Conv2D layer\n",
-    "try:\n",
-    "    layer = Conv2D(kernel_size=(2, 2))\n",
-    "    print(f\"Conv2D layer created with kernel size: {layer.kernel_size}\")\n",
-    "    print(f\"Kernel shape: {layer.kernel.shape}\")\n",
-    "    \n",
-    "    # Test that kernel is initialized properly\n",
-    "    assert layer.kernel.shape == (2, 2), f\"Kernel shape should be (2, 2), got {layer.kernel.shape}\"\n",
-    "    assert not np.allclose(layer.kernel, 0), \"Kernel should not be all zeros\"\n",
-    "    print(\"✅ Conv2D layer initialization successful\")\n",
-    "    \n",
-    "    # Test with sample input\n",
-    "    x = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])\n",
-    "    print(f\"Input shape: {x.shape}\")\n",
-    "    \n",
-    "    y = layer(x)\n",
-    "    print(f\"Output shape: {y.shape}\")\n",
-    "    print(f\"Output: {y}\")\n",
-    "    \n",
-    "    # Verify shapes\n",
-    "    assert y.shape == (2, 2), f\"Output shape should be (2, 2), got {y.shape}\"\n",
-    "    assert isinstance(y, Tensor), \"Output should be a Tensor\"\n",
-    "    print(\"✅ Conv2D layer forward pass successful\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Conv2D layer test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test different kernel sizes\n",
-    "try:\n",
-    "    layer_3x3 = Conv2D(kernel_size=(3, 3))\n",
-    "    x_5x5 = Tensor([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25]])\n",
-    "    y_3x3 = layer_3x3(x_5x5)\n",
-    "    \n",
-    "    assert y_3x3.shape == (3, 3), f\"3x3 kernel output should be (3, 3), got {y_3x3.shape}\"\n",
-    "    print(\"✅ Different kernel sizes work correctly\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Different kernel sizes test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Show the layer behavior\n",
-    "print(\"🎯 Conv2D layer behavior:\")\n",
-    "print(\"   Learnable kernel weights\")\n",
-    "print(\"   Applies convolution to detect patterns\")\n",
-    "print(\"   Can be trained end-to-end\")\n",
-    "print(\"📈 Progress: Convolution operation ✓, Conv2D layer ✓\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1f662953",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 3: Multi-Channel Conv2D - From Grayscale to RGB\n",
-    "\n",
-    "### What are Multi-Channel Convolutions?\n",
-    "**Multi-channel convolutions** process images with multiple channels (like RGB) and produce multiple output feature maps using multiple filters.\n",
-    "\n",
-    "### Why Multi-Channel Convolutions Matter\n",
-    "- **RGB Images**: Real images have 3 channels (Red, Green, Blue)\n",
-    "- **Feature Maps**: Each filter learns different patterns\n",
-    "- **Depth Processing**: Handle both input channels and output filters\n",
-    "- **Production Reality**: CNNs always use multi-channel convolutions\n",
-    "\n",
-    "### Mathematical Foundation\n",
-    "For input shape `(batch, in_channels, height, width)` and filters `(out_channels, in_channels, kernel_h, kernel_w)`:\n",
-    "\n",
-    "```\n",
-    "Input: (batch, 3, 32, 32)        # RGB CIFAR-10 images  \n",
-    "Filters: (32, 3, 3, 3)           # 32 filters, each 3x3x3\n",
-    "Output: (batch, 32, 30, 30)      # 32 feature maps, each 30x30\n",
-    "```\n",
-    "\n",
-    "Each output feature map is computed by:\n",
-    "1. **Channel mixing**: Each filter processes ALL input channels\n",
-    "2. **Spatial convolution**: Applied across height and width  \n",
-    "3. **Summation**: Sum across input channels for each output pixel\n",
-    "\n",
-    "### Systems Insight: Parameter Scaling\n",
-    "- **Single channel**: 1 filter = K×K parameters\n",
-    "- **Multi-channel**: 1 filter = in_channels × K×K parameters  \n",
-    "- **Multiple filters**: out_channels × in_channels × K×K total parameters\n",
-    "- **Memory impact**: Parameters grow linearly with channels\n",
-    "\n",
-    "Example: 32 filters of size 3×3 on RGB input = 32 × 3 × 3 × 3 = 864 parameters"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "88be7783",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "multi-channel-conv2d",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Conv2d(Module):\n",
-    "    \"\"\"\n",
-    "    2D Convolutional Layer (PyTorch-compatible API).\n",
-    "    \n",
-    "    Processes inputs with multiple channels (like RGB) and outputs multiple feature maps.\n",
-    "    This is the realistic convolution used in production computer vision systems.\n",
-    "    Inherits from Module for automatic parameter registration.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, in_channels: int, out_channels: int, kernel_size: Tuple[int, int], bias: bool = True):\n",
-    "        super().__init__()\n",
-    "        \"\"\"\n",
-    "        Initialize multi-channel Conv2D layer.\n",
-    "        \n",
-    "        Args:\n",
-    "            in_channels: Number of input channels (e.g., 3 for RGB)\n",
-    "            out_channels: Number of output feature maps (number of filters)\n",
-    "            kernel_size: (kH, kW) size of each filter\n",
-    "            bias: Whether to include bias terms\n",
-    "            \n",
-    "        TODO: Initialize weights and bias for multi-channel convolution.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store layer parameters (in_channels, out_channels, kernel_size, bias)\n",
-    "        2. Initialize weight tensor: shape (out_channels, in_channels, kH, kW)\n",
-    "        3. Use He initialization: std = sqrt(2 / (in_channels * kH * kW))\n",
-    "        4. Initialize bias if enabled: shape (out_channels,)\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Production CNNs**: This matches PyTorch's nn.Conv2d parameter structure\n",
-    "        - **Memory Scaling**: Parameters = out_channels × in_channels × kH × kW  \n",
-    "        - **He Initialization**: Maintains activation variance through deep networks\n",
-    "        - **Feature Learning**: Each filter learns different patterns across all input channels\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        # For CIFAR-10 RGB images (3 channels) → 32 feature maps\n",
-    "        conv = Conv2d(in_channels=3, out_channels=32, kernel_size=(3, 3))\n",
-    "        # Creates weight: shape (32, 3, 3, 3) = 864 parameters\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Weight shape: (out_channels, in_channels, kernel_height, kernel_width)\n",
-    "        - He initialization: np.random.randn(...) * np.sqrt(2.0 / (in_channels * kH * kW))\n",
-    "        - Bias shape: (out_channels,) initialized to small values\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.in_channels = in_channels\n",
-    "        self.out_channels = out_channels\n",
-    "        self.kernel_size = kernel_size\n",
-    "        self.use_bias = bias\n",
-    "        \n",
-    "        kH, kW = kernel_size\n",
-    "        \n",
-    "        # He initialization for weights\n",
-    "        # Shape: (out_channels, in_channels, kernel_height, kernel_width)\n",
-    "        fan_in = in_channels * kH * kW\n",
-    "        std = np.sqrt(2.0 / fan_in)\n",
-    "        self.weight = Parameter(np.random.randn(out_channels, in_channels, kH, kW).astype(np.float32) * std)\n",
-    "        \n",
-    "        # Initialize bias\n",
-    "        if bias:\n",
-    "            self.bias = Parameter(np.zeros(out_channels, dtype=np.float32))\n",
-    "        else:\n",
-    "            self.bias = None\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x):\n",
-    "        \"\"\"\n",
-    "        Forward pass through multi-channel Conv2D layer.\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor with shape (batch_size, in_channels, H, W) or (in_channels, H, W)\n",
-    "        Returns:\n",
-    "            Output tensor with shape (batch_size, out_channels, out_H, out_W) or (out_channels, out_H, out_W)\n",
-    "        \"\"\"\n",
-    "        # Handle different input shapes\n",
-    "        if len(x.shape) == 3:  # Single image: (in_channels, H, W)\n",
-    "            # Get the underlying data and convert to numpy array\n",
-    "            if hasattr(x.data, '_data'):\n",
-    "                x_data = np.array(x.data._data)\n",
-    "            elif hasattr(x.data, 'data'):\n",
-    "                x_data = np.array(x.data.data)\n",
-    "            else:\n",
-    "                x_data = np.array(x.data)\n",
-    "            input_data = x_data[None, ...]  # Add batch dimension\n",
-    "            single_image = True\n",
-    "        else:  # Batch: (batch_size, in_channels, H, W)\n",
-    "            if hasattr(x.data, '_data'):\n",
-    "                input_data = np.array(x.data._data)\n",
-    "            elif hasattr(x.data, 'data'):\n",
-    "                input_data = np.array(x.data.data)\n",
-    "            else:\n",
-    "                input_data = np.array(x.data)\n",
-    "            single_image = False\n",
-    "        \n",
-    "        batch_size, in_channels, H, W = input_data.shape\n",
-    "        kH, kW = self.kernel_size\n",
-    "        \n",
-    "        # Validate input channels\n",
-    "        assert in_channels == self.in_channels, f\"Expected {self.in_channels} input channels, got {in_channels}\"\n",
-    "        \n",
-    "        # Calculate output dimensions\n",
-    "        out_H = H - kH + 1\n",
-    "        out_W = W - kW + 1\n",
-    "        \n",
-    "        # Initialize output\n",
-    "        output = np.zeros((batch_size, self.out_channels, out_H, out_W), dtype=np.float32)\n",
-    "        \n",
-    "        # Perform convolution for each batch item and output channel\n",
-    "        for b in range(batch_size):\n",
-    "            for out_c in range(self.out_channels):\n",
-    "                # Get the filter for this output channel\n",
-    "                # Get weight data and access output channel\n",
-    "                if hasattr(self.weight.data, '_data'):\n",
-    "                    weight_data = np.array(self.weight.data._data)\n",
-    "                elif hasattr(self.weight.data, 'data'):\n",
-    "                    weight_data = np.array(self.weight.data.data)\n",
-    "                else:\n",
-    "                    weight_data = np.array(self.weight.data)\n",
-    "                filter_weights = weight_data[out_c]  # Shape: (in_channels, kH, kW)\n",
-    "                \n",
-    "                # Convolve across all input channels\n",
-    "                for in_c in range(in_channels):\n",
-    "                    input_channel = input_data[b, in_c]  # Shape: (H, W)\n",
-    "                    filter_channel = filter_weights[in_c]  # Shape: (kH, kW)\n",
-    "                    \n",
-    "                    # Perform 2D convolution for this channel\n",
-    "                    for i in range(out_H):\n",
-    "                        for j in range(out_W):\n",
-    "                            # Extract patch and compute dot product\n",
-    "                            patch = input_channel[i:i+kH, j:j+kW]\n",
-    "                            output[b, out_c, i, j] += np.sum(patch * filter_channel)\n",
-    "                \n",
-    "                # Add bias if enabled\n",
-    "                if self.use_bias:\n",
-    "                    if hasattr(self.bias.data, '_data'):\n",
-    "                        bias_data = np.array(self.bias.data._data)\n",
-    "                    elif hasattr(self.bias.data, 'data'):\n",
-    "                        bias_data = np.array(self.bias.data.data)\n",
-    "                    else:\n",
-    "                        bias_data = np.array(self.bias.data)\n",
-    "                    output[b, out_c] += bias_data[out_c]\n",
-    "        \n",
-    "        # Remove batch dimension if input was single image\n",
-    "        if single_image:\n",
-    "            output = output[0]\n",
-    "        \n",
-    "        # Preserve Variable type if input is Variable for gradient flow\n",
-    "        from tinytorch.core.autograd import Variable\n",
-    "        if isinstance(x, Variable):\n",
-    "            # Store values needed for backward pass\n",
-    "            input_data_copy = input_data.copy()\n",
-    "            weights_data = self.weight.data if hasattr(self.weight, 'data') else self.weight\n",
-    "            if hasattr(weights_data, 'data'):\n",
-    "                weights_data = weights_data.data\n",
-    "            \n",
-    "            # Create gradient function for multi-channel convolution backward pass\n",
-    "            def grad_fn(grad_output):\n",
-    "                # Conv2d backward pass\n",
-    "                grad_out_data = grad_output.data.data if hasattr(grad_output.data, 'data') else grad_output.data\n",
-    "                \n",
-    "                # Ensure grad_out has batch dimension\n",
-    "                if single_image and len(grad_out_data.shape) == 3:\n",
-    "                    grad_out_data = grad_out_data[np.newaxis, ...]\n",
-    "                \n",
-    "                # Gradient w.r.t weights (simplified but functional)\n",
-    "                if hasattr(self.weight, 'requires_grad') and self.weight.requires_grad:\n",
-    "                    # Initialize weight gradients\n",
-    "                    weight_grad = np.zeros_like(weights_data)\n",
-    "                    \n",
-    "                    # Compute gradient for each filter\n",
-    "                    batch_size = input_data_copy.shape[0]\n",
-    "                    for b in range(batch_size):\n",
-    "                        for out_c in range(self.out_channels):\n",
-    "                            for in_c in range(self.in_channels):\n",
-    "                                for i in range(out_H):\n",
-    "                                    for j in range(out_W):\n",
-    "                                        # Gradient contribution from this output position\n",
-    "                                        grad_val = grad_out_data[b, out_c, i, j]\n",
-    "                                        # Input patch that contributed to this output\n",
-    "                                        patch = input_data_copy[b, in_c, i:i+kH, j:j+kW]\n",
-    "                                        # Accumulate gradient\n",
-    "                                        weight_grad[out_c, in_c] += grad_val * patch\n",
-    "                    \n",
-    "                    # Average over batch\n",
-    "                    weight_grad /= batch_size\n",
-    "                    self.weight.backward(Variable(weight_grad))\n",
-    "                \n",
-    "                # Gradient w.r.t bias\n",
-    "                if self.use_bias and hasattr(self.bias, 'requires_grad') and self.bias.requires_grad:\n",
-    "                    # Sum gradients across batch and spatial dimensions for each output channel\n",
-    "                    bias_grad = np.sum(grad_out_data, axis=(0, 2, 3))\n",
-    "                    self.bias.backward(Variable(bias_grad))\n",
-    "                \n",
-    "                # Gradient w.r.t input (simplified but functional)\n",
-    "                if x.requires_grad:\n",
-    "                    # For proper implementation, this would be a transposed convolution\n",
-    "                    # For now, broadcast the gradient back with some scaling\n",
-    "                    input_grad = np.zeros_like(input_data_copy)\n",
-    "                    \n",
-    "                    # Simple approximation: distribute gradients back\n",
-    "                    for b in range(batch_size):\n",
-    "                        for out_c in range(self.out_channels):\n",
-    "                            for in_c in range(self.in_channels):\n",
-    "                                filter_weights = weights_data[out_c, in_c]\n",
-    "                                for i in range(out_H):\n",
-    "                                    for j in range(out_W):\n",
-    "                                        grad_val = grad_out_data[b, out_c, i, j]\n",
-    "                                        # Distribute gradient to input patch\n",
-    "                                        input_grad[b, in_c, i:i+kH, j:j+kW] += grad_val * filter_weights * 0.1\n",
-    "                    \n",
-    "                    # Remove batch dim if needed\n",
-    "                    if single_image:\n",
-    "                        input_grad = input_grad[0]\n",
-    "                    \n",
-    "                    x.backward(Variable(input_grad))\n",
-    "            \n",
-    "            return Variable(output, requires_grad=x.requires_grad, grad_fn=grad_fn)\n",
-    "        else:\n",
-    "            return Tensor(output)\n",
-    "    \n",
-    "    def __call__(self, x):\n",
-    "        \"\"\"Make layer callable: layer(x) same as layer.forward(x)\"\"\"\n",
-    "        return self.forward(x)\n",
-    "\n",
-    "# Backward compatibility alias\n",
-    "MultiChannelConv2D = Conv2d"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "12e79045",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Unit Test: Multi-Channel Conv2D Layer\n",
-    "\n",
-    "Let us test your multi-channel Conv2D implementation! This handles RGB images and multiple filters like production CNNs.\n",
-    "\n",
-    "**This is a unit test** - it tests the Conv2d class in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "867e1846",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-multi-channel-conv2d-immediate",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Test multi-channel Conv2D layer immediately after implementation\n",
-    "print(\"🔬 Unit Test: Multi-Channel Conv2D Layer...\")\n",
-    "\n",
-    "# Test 1: RGB to feature maps (CIFAR-10 scenario)\n",
-    "try:\n",
-    "    # Create layer: 3 RGB channels → 8 feature maps\n",
-    "    conv_rgb = Conv2d(in_channels=3, out_channels=8, kernel_size=(3, 3))\n",
-    "    \n",
-    "    print(f\"Multi-channel Conv2D created:\")\n",
-    "    print(f\"  Input channels: {conv_rgb.in_channels}\")\n",
-    "    print(f\"  Output channels: {conv_rgb.out_channels}\")\n",
-    "    print(f\"  Kernel size: {conv_rgb.kernel_size}\")\n",
-    "    print(f\"  Weight shape: {conv_rgb.weights.shape}\")\n",
-    "    \n",
-    "    # Verify weight initialization\n",
-    "    assert conv_rgb.weights.shape == (8, 3, 3, 3), f\"Weight shape should be (8, 3, 3, 3), got {conv_rgb.weights.shape}\"\n",
-    "    assert not np.allclose(conv_rgb.weights, 0), \"Weights should not be all zeros\"\n",
-    "    assert conv_rgb.bias.shape == (8,), f\"Bias shape should be (8,), got {conv_rgb.bias.shape}\"\n",
-    "    print(\"✅ Multi-channel layer initialization successful\")\n",
-    "    \n",
-    "    # Test with RGB image (simulated CIFAR-10 patch)\n",
-    "    rgb_image = Tensor(np.random.randn(3, 8, 8))  # 3 channels, 8x8 image\n",
-    "    print(f\"RGB input shape: {rgb_image.shape}\")\n",
-    "    \n",
-    "    feature_maps = conv_rgb(rgb_image)\n",
-    "    print(f\"Feature maps shape: {feature_maps.shape}\")\n",
-    "    \n",
-    "    # Verify output shape\n",
-    "    expected_shape = (8, 6, 6)  # 8 channels, 8-3+1=6 spatial dims\n",
-    "    assert feature_maps.shape == expected_shape, f\"Output shape should be {expected_shape}, got {feature_maps.shape}\"\n",
-    "    assert isinstance(feature_maps, Tensor), \"Output should be a Tensor\"\n",
-    "    print(\"✅ RGB convolution test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ RGB convolution test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test 2: Batch processing\n",
-    "try:\n",
-    "    # Test with batch of RGB images\n",
-    "    batch_rgb = Tensor(np.random.randn(4, 3, 10, 10))  # 4 images, 3 channels, 10x10\n",
-    "    batch_output = conv_rgb(batch_rgb)\n",
-    "    \n",
-    "    expected_batch_shape = (4, 8, 8, 8)  # 4 images, 8 channels, 10-3+1=8 spatial\n",
-    "    assert batch_output.shape == expected_batch_shape, f\"Batch output shape should be {expected_batch_shape}, got {batch_output.shape}\"\n",
-    "    print(\"✅ Batch processing test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Batch processing test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test 3: Different channel configurations\n",
-    "try:\n",
-    "    # Test 1→16 channels (grayscale to features)\n",
-    "    conv_grayscale = Conv2d(in_channels=1, out_channels=16, kernel_size=(5, 5))\n",
-    "    gray_image = Tensor(np.random.randn(1, 12, 12))  # 1 channel, 12x12\n",
-    "    gray_features = conv_grayscale(gray_image)\n",
-    "    \n",
-    "    expected_gray_shape = (16, 8, 8)  # 16 channels, 12-5+1=8 spatial\n",
-    "    assert gray_features.shape == expected_gray_shape, f\"Grayscale output should be {expected_gray_shape}, got {gray_features.shape}\"\n",
-    "    print(\"✅ Grayscale convolution test passed\")\n",
-    "    \n",
-    "    # Test 32→64 channels (feature maps to more feature maps)\n",
-    "    conv_deep = Conv2d(in_channels=32, out_channels=64, kernel_size=(3, 3))\n",
-    "    deep_features = Tensor(np.random.randn(32, 6, 6))  # 32 channels, 6x6\n",
-    "    deeper_features = conv_deep(deep_features)\n",
-    "    \n",
-    "    expected_deep_shape = (64, 4, 4)  # 64 channels, 6-3+1=4 spatial\n",
-    "    assert deeper_features.shape == expected_deep_shape, f\"Deep features should be {expected_deep_shape}, got {deeper_features.shape}\"\n",
-    "    print(\"✅ Deep feature convolution test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Different channel configurations test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test 4: Parameter counting\n",
-    "try:\n",
-    "    # Verify parameter count scaling\n",
-    "    params_3_to_8 = conv_rgb.weights.size + (conv_rgb.bias.size if conv_rgb.use_bias else 0)\n",
-    "    expected_params = (8 * 3 * 3 * 3) + 8  # weights + bias\n",
-    "    assert params_3_to_8 == expected_params, f\"Parameter count should be {expected_params}, got {params_3_to_8}\"\n",
-    "    \n",
-    "    print(f\"Parameter scaling verification:\")\n",
-    "    print(f\"  3→8 channels, 3x3 kernel: {params_3_to_8} parameters\")\n",
-    "    print(f\"  Breakdown: {8*3*3*3} weights + {8} bias = {expected_params}\")\n",
-    "    print(\"✅ Parameter counting test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Parameter counting test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Show multi-channel behavior\n",
-    "print(\"🎯 Multi-channel Conv2D behavior:\")\n",
-    "print(\"   Processes multiple input channels (RGB, feature maps)\")\n",
-    "print(\"   Produces multiple output feature maps\")\n",
-    "print(\"   Each filter mixes information across ALL input channels\")\n",
-    "print(\"   Parameter count = out_channels × in_channels × kernel_h × kernel_w\")\n",
-    "print(\"📈 Progress: Single-channel ✓, Multi-channel ✓\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d300f9d0",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🔧 Memory Analysis: Multi-Channel Parameter Scaling\n",
-    "\n",
-    "Let us analyze how memory requirements scale with channels and understand the trade-offs."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fd6b6f31",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "multi-channel-memory-analysis",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def analyze_conv_memory_scaling():\n",
-    "    \"\"\"Analyze memory requirements for different channel configurations.\"\"\"\n",
-    "    print(\"🔍 MULTI-CHANNEL MEMORY SCALING ANALYSIS\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    configurations = [\n",
-    "        (1, 16, (3, 3)),    # Grayscale → features  \n",
-    "        (3, 32, (3, 3)),    # RGB → features\n",
-    "        (32, 64, (3, 3)),   # Features → more features\n",
-    "        (64, 128, (3, 3)),  # Deep features\n",
-    "        (3, 32, (5, 5)),    # RGB with larger kernel\n",
-    "        (3, 32, (7, 7)),    # RGB with very large kernel\n",
-    "    ]\n",
-    "    \n",
-    "    for in_c, out_c, (kh, kw) in configurations:\n",
-    "        # Calculate parameters\n",
-    "        weight_params = out_c * in_c * kh * kw\n",
-    "        bias_params = out_c\n",
-    "        total_params = weight_params + bias_params\n",
-    "        \n",
-    "        # Calculate memory (assuming float32 = 4 bytes)\n",
-    "        memory_mb = total_params * 4 / (1024 * 1024)\n",
-    "        \n",
-    "        # Example activation memory for 32x32 input\n",
-    "        input_mb = (in_c * 32 * 32 * 4) / (1024 * 1024)\n",
-    "        output_mb = (out_c * (32-kh+1) * (32-kw+1) * 4) / (1024 * 1024)\n",
-    "        \n",
-    "        print(f\"  {in_c:3d}→{out_c:3d} channels, {kh}x{kw} kernel:\")\n",
-    "        print(f\"    Parameters: {total_params:,} ({memory_mb:.3f} MB)\")\n",
-    "        print(f\"    Activations: {input_mb:.3f} MB input + {output_mb:.3f} MB output\")\n",
-    "        print(f\"    Total memory: {memory_mb + input_mb + output_mb:.3f} MB\")\n",
-    "    \n",
-    "    print(\"\\n💡 Key Memory Insights:\")\n",
-    "    print(\"  • Parameters scale as: out_channels × in_channels × kernel_size²\")\n",
-    "    print(\"  • Larger kernels dramatically increase memory (5x5 = 2.8x vs 3x3)\")\n",
-    "    print(\"  • Channel depth matters more than spatial size for parameters\")\n",
-    "    print(\"  • Activation memory depends on spatial dimensions\")\n",
-    "    \n",
-    "    return configurations\n",
-    "\n",
-    "# Run memory analysis\n",
-    "try:\n",
-    "    analyze_conv_memory_scaling()\n",
-    "    print(\"✅ Memory scaling analysis completed\")\n",
-    "except Exception as e:\n",
-    "    print(f\"⚠️ Memory analysis had issues: {e}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8244962f",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 4: MaxPool2D - Spatial Downsampling\n",
-    "\n",
-    "### What is MaxPooling?\n",
-    "**MaxPooling** reduces spatial dimensions by taking the maximum value in each local region, providing translation invariance and computational efficiency.\n",
-    "\n",
-    "### Why MaxPooling Matters\n",
-    "- **Dimensionality reduction**: Reduces feature map size without losing important information\n",
-    "- **Translation invariance**: Small shifts don't change the output\n",
-    "- **Computational efficiency**: Fewer parameters to process in subsequent layers\n",
-    "- **Overfitting reduction**: Acts as a form of regularization\n",
-    "\n",
-    "### Real-World Usage\n",
-    "- **After convolution**: Conv2D → ReLU → MaxPool2D is a common pattern\n",
-    "- **Progressive downsampling**: Each pool layer reduces spatial dimensions\n",
-    "- **Feature concentration**: Keeps most important activations"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e875c03a",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "maxpool2d-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class MaxPool2D:\n",
-    "    \"\"\"\n",
-    "    2D Max Pooling layer for spatial downsampling.\n",
-    "    \n",
-    "    Reduces spatial dimensions by taking maximum values in local windows,\n",
-    "    providing translation invariance and computational efficiency.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, pool_size: Tuple[int, int] = (2, 2), stride: Optional[Tuple[int, int]] = None):\n",
-    "        \"\"\"\n",
-    "        Initialize MaxPool2D layer.\n",
-    "        \n",
-    "        Args:\n",
-    "            pool_size: (pH, pW) size of pooling window\n",
-    "            stride: (sH, sW) stride for pooling. If None, uses pool_size\n",
-    "            \n",
-    "        TODO: Initialize pooling parameters.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store pool_size as instance variable\n",
-    "        2. Set stride (default to pool_size if not provided)\n",
-    "        3. No learnable parameters (pooling has no weights)\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Spatial downsampling**: Reduces feature map resolution efficiently\n",
-    "        - **Translation invariance**: Small shifts in input don't change output\n",
-    "        - **Computational efficiency**: Reduces data for subsequent layers\n",
-    "        - **No parameters**: Unlike convolution, pooling has no learnable weights\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        MaxPool2D(pool_size=(2, 2)) creates:\n",
-    "        - 2x2 pooling windows\n",
-    "        - Stride of (2, 2) - non-overlapping windows\n",
-    "        - No learnable parameters\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Store pool_size as self.pool_size\n",
-    "        - Set stride: self.stride = stride if stride else pool_size\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.pool_size = pool_size\n",
-    "        self.stride = stride if stride is not None else pool_size\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x):\n",
-    "        \"\"\"\n",
-    "        Forward pass through MaxPool2D layer.\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor with shape (..., H, W) or (..., C, H, W)\n",
-    "        Returns:\n",
-    "            Pooled tensor with reduced spatial dimensions\n",
-    "        \"\"\"\n",
-    "        input_data = x.data\n",
-    "        original_shape = input_data.shape\n",
-    "        \n",
-    "        # Handle different input shapes\n",
-    "        if len(original_shape) == 2:  # (H, W)\n",
-    "            input_data = input_data[None, None, ...]  # Add batch and channel dims\n",
-    "            added_dims = 2\n",
-    "        elif len(original_shape) == 3:  # (C, H, W) or (B, H, W)\n",
-    "            input_data = input_data[None, ...]  # Add one dimension\n",
-    "            added_dims = 1\n",
-    "        else:  # (B, C, H, W) or similar\n",
-    "            added_dims = 0\n",
-    "        \n",
-    "        # Now input_data has at least 4 dimensions\n",
-    "        while len(input_data.shape) < 4:\n",
-    "            input_data = input_data[None, ...]\n",
-    "            added_dims += 1\n",
-    "            \n",
-    "        batch_size, channels, H, W = input_data.shape\n",
-    "        pH, pW = self.pool_size\n",
-    "        sH, sW = self.stride\n",
-    "        \n",
-    "        # Calculate output dimensions\n",
-    "        out_H = (H - pH) // sH + 1\n",
-    "        out_W = (W - pW) // sW + 1\n",
-    "        \n",
-    "        # Initialize output\n",
-    "        output = np.zeros((batch_size, channels, out_H, out_W), dtype=input_data.dtype)\n",
-    "        \n",
-    "        # Perform max pooling\n",
-    "        for b in range(batch_size):\n",
-    "            for c in range(channels):\n",
-    "                for i in range(out_H):\n",
-    "                    for j in range(out_W):\n",
-    "                        # Define pooling window\n",
-    "                        h_start = i * sH\n",
-    "                        h_end = h_start + pH\n",
-    "                        w_start = j * sW\n",
-    "                        w_end = w_start + pW\n",
-    "                        \n",
-    "                        # Extract window and take maximum\n",
-    "                        window = input_data[b, c, h_start:h_end, w_start:w_end]\n",
-    "                        output[b, c, i, j] = np.max(window)\n",
-    "        \n",
-    "        # Remove added dimensions to match input shape structure\n",
-    "        for _ in range(added_dims):\n",
-    "            output = output[0]\n",
-    "        \n",
-    "        # Preserve Variable type if input is Variable for gradient flow\n",
-    "        from tinytorch.core.autograd import Variable\n",
-    "        if isinstance(x, Variable):\n",
-    "            # Store input shape and data for backward pass\n",
-    "            input_shape = input_data.shape\n",
-    "            \n",
-    "            # Create gradient function for max pooling backward pass\n",
-    "            def grad_fn(grad_output):\n",
-    "                if x.requires_grad:\n",
-    "                    # MaxPool backward: gradient flows only to max elements\n",
-    "                    grad_out_data = grad_output.data.data if hasattr(grad_output.data, 'data') else grad_output.data\n",
-    "                    \n",
-    "                    # Initialize input gradient with zeros\n",
-    "                    input_grad = np.zeros(input_shape)\n",
-    "                    \n",
-    "                    # Add dimensions back if they were removed\n",
-    "                    grad_out_expanded = grad_out_data\n",
-    "                    for _ in range(added_dims):\n",
-    "                        grad_out_expanded = grad_out_expanded[np.newaxis, ...]\n",
-    "                    \n",
-    "                    # Distribute gradients to positions that were max\n",
-    "                    for b in range(batch_size):\n",
-    "                        for c in range(channels):\n",
-    "                            for i in range(out_H):\n",
-    "                                for j in range(out_W):\n",
-    "                                    h_start = i * sH\n",
-    "                                    h_end = h_start + pH\n",
-    "                                    w_start = j * sW\n",
-    "                                    w_end = w_start + pW\n",
-    "                                    \n",
-    "                                    # Find which element was max in the window\n",
-    "                                    window = input_data[b, c, h_start:h_end, w_start:w_end]\n",
-    "                                    max_val = np.max(window)\n",
-    "                                    \n",
-    "                                    # Pass gradient to all positions that equal max\n",
-    "                                    # (handles ties by splitting gradient)\n",
-    "                                    mask = (window == max_val)\n",
-    "                                    num_max = np.sum(mask)\n",
-    "                                    if num_max > 0:\n",
-    "                                        input_grad[b, c, h_start:h_end, w_start:w_end][mask] += \\\n",
-    "                                            grad_out_expanded[b, c, i, j] / num_max\n",
-    "                    \n",
-    "                    # Remove added dimensions from gradient\n",
-    "                    for _ in range(added_dims):\n",
-    "                        input_grad = input_grad[0]\n",
-    "                    \n",
-    "                    x.backward(Variable(input_grad))\n",
-    "            \n",
-    "            return Variable(output, requires_grad=x.requires_grad, grad_fn=grad_fn)\n",
-    "        else:\n",
-    "            return Tensor(output)\n",
-    "    \n",
-    "    def __call__(self, x):\n",
-    "        \"\"\"Make layer callable: layer(x) same as layer.forward(x)\"\"\"\n",
-    "        return self.forward(x)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "93415abd",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Unit Test: MaxPool2D Layer\n",
-    "\n",
-    "Let us test your MaxPool2D implementation! This provides spatial downsampling for efficient computation.\n",
-    "\n",
-    "**This is a unit test** - it tests the MaxPool2D class in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9296a370",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-maxpool2d-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Test MaxPool2D layer immediately after implementation\n",
-    "print(\"🔬 Unit Test: MaxPool2D Layer...\")\n",
-    "\n",
-    "# Test 1: Basic 2x2 pooling\n",
-    "try:\n",
-    "    pool = MaxPool2D(pool_size=(2, 2))\n",
-    "    \n",
-    "    # Test with simple 4x4 input\n",
-    "    test_input = Tensor([[1, 2, 3, 4],\n",
-    "                        [5, 6, 7, 8], \n",
-    "                        [9, 10, 11, 12],\n",
-    "                        [13, 14, 15, 16]])\n",
-    "    \n",
-    "    print(f\"Input shape: {test_input.shape}\")\n",
-    "    print(f\"Input:\\n{test_input.data}\")\n",
-    "    \n",
-    "    pooled = pool(test_input)\n",
-    "    print(f\"Pooled shape: {pooled.shape}\")\n",
-    "    print(f\"Pooled:\\n{pooled.data}\")\n",
-    "    \n",
-    "    # Verify shape\n",
-    "    expected_shape = (2, 2)  # 4x4 → 2x2 with 2x2 pooling\n",
-    "    assert pooled.shape == expected_shape, f\"Pooled shape should be {expected_shape}, got {pooled.shape}\"\n",
-    "    \n",
-    "    # Verify values (each 2x2 window's maximum)\n",
-    "    expected_values = np.array([[6, 8], [14, 16]])  # Max of each 2x2 window\n",
-    "    assert np.array_equal(pooled.data, expected_values), f\"Expected {expected_values}, got {pooled.data}\"\n",
-    "    \n",
-    "    print(\"✅ Basic 2x2 pooling test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Basic pooling test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test 2: Multi-channel pooling\n",
-    "try:\n",
-    "    # Test with multi-channel input (like after convolution)\n",
-    "    multi_channel_input = Tensor([[[1, 2, 3, 4],     # Channel 0\n",
-    "                                  [5, 6, 7, 8],\n",
-    "                                  [9, 10, 11, 12],\n",
-    "                                  [13, 14, 15, 16]],\n",
-    "                                 [[16, 15, 14, 13],   # Channel 1\n",
-    "                                  [12, 11, 10, 9],\n",
-    "                                  [8, 7, 6, 5],\n",
-    "                                  [4, 3, 2, 1]]])\n",
-    "    \n",
-    "    pooled_multi = pool(multi_channel_input)\n",
-    "    print(f\"Multi-channel input shape: {multi_channel_input.shape}\")\n",
-    "    print(f\"Multi-channel pooled shape: {pooled_multi.shape}\")\n",
-    "    \n",
-    "    expected_multi_shape = (2, 2, 2)  # 2 channels, 2x2 spatial\n",
-    "    assert pooled_multi.shape == expected_multi_shape, f\"Multi-channel shape should be {expected_multi_shape}, got {pooled_multi.shape}\"\n",
-    "    \n",
-    "    print(\"✅ Multi-channel pooling test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Multi-channel pooling test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test 3: Different pool sizes\n",
-    "try:\n",
-    "    # Test 3x3 pooling\n",
-    "    pool_3x3 = MaxPool2D(pool_size=(3, 3))\n",
-    "    input_6x6 = Tensor(np.arange(36).reshape(6, 6))  # 6x6 input\n",
-    "    \n",
-    "    pooled_3x3 = pool_3x3(input_6x6)\n",
-    "    expected_3x3_shape = (2, 2)  # 6x6 → 2x2 with 3x3 pooling, stride 3\n",
-    "    assert pooled_3x3.shape == expected_3x3_shape, f\"3x3 pooling shape should be {expected_3x3_shape}, got {pooled_3x3.shape}\"\n",
-    "    \n",
-    "    print(\"✅ Different pool sizes test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Different pool sizes test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test 4: Integration with convolution\n",
-    "try:\n",
-    "    # Test Conv2D → MaxPool2D pipeline\n",
-    "    conv = Conv2d(in_channels=1, out_channels=4, kernel_size=(3, 3))\n",
-    "    pool_after_conv = MaxPool2D(pool_size=(2, 2))\n",
-    "    \n",
-    "    # Input image\n",
-    "    input_image = Tensor(np.random.randn(1, 8, 8))  # 1 channel, 8x8\n",
-    "    \n",
-    "    # Forward pass: Conv → Pool\n",
-    "    conv_output = conv(input_image)     # (1,8,8) → (4,6,6)\n",
-    "    pool_output = pool_after_conv(conv_output)  # (4,6,6) → (4,3,3)\n",
-    "    \n",
-    "    assert conv_output.shape == (4, 6, 6), f\"Conv output should be (4,6,6), got {conv_output.shape}\"\n",
-    "    assert pool_output.shape == (4, 3, 3), f\"Pool output should be (4,3,3), got {pool_output.shape}\"\n",
-    "    \n",
-    "    print(\"✅ Conv → Pool integration test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Conv → Pool integration test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Show pooling behavior\n",
-    "print(\"🎯 MaxPool2D behavior:\")\n",
-    "print(\"   Reduces spatial dimensions by taking maximum in each window\")\n",
-    "print(\"   Provides translation invariance\")\n",
-    "print(\"   No learnable parameters\")\n",
-    "print(\"   Common pattern: Conv2D → ReLU → MaxPool2D\")\n",
-    "print(\"📈 Progress: Single-channel ✓, Multi-channel ✓, Pooling ✓\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1d6c7615",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 5: Flattening for Dense Layers\n",
-    "\n",
-    "### What is Flattening?\n",
-    "**Flattening** converts multi-dimensional tensors to 1D vectors, enabling connection between convolutional and dense layers.\n",
-    "\n",
-    "### Why Flattening is Needed\n",
-    "- **Interface compatibility**: Conv2D outputs 2D/3D, Dense expects 1D\n",
-    "- **Network composition**: Connect spatial features to classification\n",
-    "- **Standard practice**: Almost all CNNs use this pattern\n",
-    "- **Dimension management**: Preserve information while changing shape\n",
-    "\n",
-    "### The Pattern\n",
-    "```\n",
-    "Conv2D → ReLU → MaxPool2D → Flatten → Dense → Output\n",
-    "```\n",
-    "\n",
-    "### Real-World Usage\n",
-    "- **Classification**: Final layers need 1D input for class probabilities\n",
-    "- **Feature extraction**: Convert spatial features to vector representations\n",
-    "- **Transfer learning**: Extract features from pre-trained CNNs"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c291e73f",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "flatten-function",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def flatten(x):\n",
-    "    \"\"\"\n",
-    "    Flatten spatial dimensions while preserving batch dimension.\n",
-    "    \n",
-    "    Args:\n",
-    "        x: Input tensor to flatten\n",
-    "        \n",
-    "    Returns:\n",
-    "        Flattened tensor with batch dimension preserved\n",
-    "        \n",
-    "    TODO: Implement flattening operation that handles different input shapes.\n",
-    "    \n",
-    "    STEP-BY-STEP IMPLEMENTATION:\n",
-    "    1. Determine if input has batch dimension\n",
-    "    2. Flatten spatial dimensions while preserving batch structure\n",
-    "    3. Return properly shaped tensor\n",
-    "    \n",
-    "    LEARNING CONNECTIONS:\n",
-    "    - **CNN to MLP Transition**: Flattening connects convolutional and dense layers\n",
-    "    - **Batch Processing**: Handles both single images and batches correctly\n",
-    "    - **Memory Layout**: Understanding how tensors are stored and reshaped in memory\n",
-    "    - **Framework Design**: All major frameworks (PyTorch, TensorFlow) use similar patterns\n",
-    "    \n",
-    "    EXAMPLES:\n",
-    "    Single image: (C, H, W) → (1, C*H*W)\n",
-    "    Batch: (B, C, H, W) → (B, C*H*W)\n",
-    "    2D: (H, W) → (1, H*W)\n",
-    "    \n",
-    "    HINTS:\n",
-    "    - Check input shape to determine batch vs single image\n",
-    "    - Use reshape to flatten spatial dimensions\n",
-    "    - Preserve batch dimension for proper Dense layer input\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    input_shape = x.shape\n",
-    "    \n",
-    "    # Get the underlying data properly\n",
-    "    if hasattr(x.data, '_data'):\n",
-    "        x_data = np.array(x.data._data)\n",
-    "    elif hasattr(x.data, 'data'):\n",
-    "        x_data = np.array(x.data.data)\n",
-    "    else:\n",
-    "        x_data = np.array(x.data)\n",
-    "    \n",
-    "    if len(input_shape) == 2:  # (H, W) - single 2D image\n",
-    "        flattened = x_data.flatten()\n",
-    "        result = flattened[None, :]  # Add batch dimension\n",
-    "    elif len(input_shape) == 3:  # (C, H, W) - single multi-channel image\n",
-    "        # Flatten spatial and channel dimensions, add batch dimension\n",
-    "        flattened = x_data.flatten()\n",
-    "        result = flattened[None, :]  # Shape: (1, C*H*W)\n",
-    "    elif len(input_shape) == 4:  # (B, C, H, W) - batch of multi-channel images\n",
-    "        # Flatten spatial and channel dimensions for each batch item\n",
-    "        batch_size = input_shape[0]\n",
-    "        feature_size = np.prod(input_shape[1:])  # C*H*W\n",
-    "        result = x_data.reshape(batch_size, feature_size)\n",
-    "    else:\n",
-    "        # Fallback: flatten all but first dimension (assumed to be batch)\n",
-    "        batch_size = input_shape[0] if len(input_shape) > 1 else 1\n",
-    "        feature_size = np.prod(input_shape[1:]) if len(input_shape) > 1 else input_shape[0]\n",
-    "        if len(input_shape) == 1:\n",
-    "            result = x_data[None, :]  # Add batch dimension\n",
-    "        else:\n",
-    "            result = x_data.reshape(batch_size, feature_size)\n",
-    "    \n",
-    "    return type(x)(result)\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "65f02640",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Unit Test: Flatten Function\n",
-    "\n",
-    "Let us test your flatten function! This connects convolutional layers to dense layers.\n",
-    "\n",
-    "**This is a unit test** - it tests one specific function (flatten) in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fdb12c4c",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-flatten-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Test flatten function immediately after implementation\n",
-    "print(\"🔬 Unit Test: Flatten Function...\")\n",
-    "\n",
-    "# Test case 1: 2x2 tensor\n",
-    "try:\n",
-    "    x = Tensor([[1, 2], [3, 4]])\n",
-    "    flattened = flatten(x)\n",
-    "    \n",
-    "    print(f\"Input: {x}\")\n",
-    "    print(f\"Flattened: {flattened}\")\n",
-    "    print(f\"Flattened shape: {flattened.shape}\")\n",
-    "    \n",
-    "    # Verify shape and content\n",
-    "    assert flattened.shape == (1, 4), f\"Flattened shape should be (1, 4), got {flattened.shape}\"\n",
-    "    expected_data = np.array([[1, 2, 3, 4]])\n",
-    "    assert np.array_equal(flattened.data, expected_data), f\"Flattened data should be {expected_data}, got {flattened.data}\"\n",
-    "    print(\"✅ 2x2 flatten test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ 2x2 flatten test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test case 2: 3x3 tensor\n",
-    "try:\n",
-    "    x2 = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])\n",
-    "    flattened2 = flatten(x2)\n",
-    "    \n",
-    "    assert flattened2.shape == (1, 9), f\"Flattened shape should be (1, 9), got {flattened2.shape}\"\n",
-    "    expected_data2 = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])\n",
-    "    assert np.array_equal(flattened2.data, expected_data2), f\"Flattened data should be {expected_data2}, got {flattened2.data}\"\n",
-    "    print(\"✅ 3x3 flatten test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ 3x3 flatten test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test case 3: Different shapes\n",
-    "try:\n",
-    "    x3 = Tensor([[1, 2, 3, 4], [5, 6, 7, 8]])  # 2x4\n",
-    "    flattened3 = flatten(x3)\n",
-    "    \n",
-    "    assert flattened3.shape == (1, 8), f\"Flattened shape should be (1, 8), got {flattened3.shape}\"\n",
-    "    expected_data3 = np.array([[1, 2, 3, 4, 5, 6, 7, 8]])\n",
-    "    assert np.array_equal(flattened3.data, expected_data3), f\"Flattened data should be {expected_data3}, got {flattened3.data}\"\n",
-    "    print(\"✅ Different shapes flatten test passed\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Different shapes flatten test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Show the flattening behavior\n",
-    "print(\"🎯 Flatten behavior:\")\n",
-    "print(\"   Converts 2D tensor to 1D\")\n",
-    "print(\"   Preserves batch dimension\")\n",
-    "print(\"   Enables connection to Dense layers\")\n",
-    "print(\"📈 Progress: Convolution operation ✓, Conv2D layer ✓, Flatten ✓\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5ed2ca40",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Step 6: Comprehensive Test - Multi-Channel CNN Pipeline\n",
-    "\n",
-    "### Real-World CNN Applications\n",
-    "Let us test our complete CNN system with realistic multi-channel scenarios:\n",
-    "\n",
-    "#### **CIFAR-10 Style CNN**\n",
-    "```python\n",
-    "# RGB images to classification\n",
-    "RGB Input → Multi-Channel Conv2D → ReLU → MaxPool2D → Flatten → Dense → Output\n",
-    "```\n",
-    "\n",
-    "#### **Deep Multi-Channel CNN**\n",
-    "```python\n",
-    "# Progressive feature extraction\n",
-    "RGB → Conv2D(3→32) → ReLU → Pool → Conv2D(32→64) → ReLU → Pool → Flatten → Dense\n",
-    "```\n",
-    "\n",
-    "#### **Production CNN Pattern**\n",
-    "```python\n",
-    "# Full computer vision pipeline\n",
-    "RGB images → Feature extraction layers → Spatial downsampling → Classification head\n",
-    "```\n",
-    "\n",
-    "This comprehensive test ensures our multi-channel CNN components work together for real computer vision applications like CIFAR-10!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9ec704fb",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-comprehensive-multichannel",
-     "locked": true,
-     "points": 20,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Comprehensive test - complete multi-channel CNN applications\n",
-    "print(\"🔬 Comprehensive Test: Multi-Channel CNN Applications...\")\n",
-    "\n",
-    "try:\n",
-    "    # Test 1: CIFAR-10 Style RGB CNN Pipeline\n",
-    "    print(\"\\n1. CIFAR-10 Style RGB CNN Pipeline:\")\n",
-    "    \n",
-    "    # Create pipeline: RGB → Conv2D(3→16) → ReLU → MaxPool2D → Flatten → Dense\n",
-    "    rgb_conv = Conv2d(in_channels=3, out_channels=16, kernel_size=(3, 3))\n",
-    "    relu = ReLU()\n",
-    "    pool = MaxPool2D(pool_size=(2, 2))\n",
-    "    dense = Dense(input_size=16 * 3 * 3, output_size=10)  # 16 channels, 3x3 spatial = 144 features\n",
-    "    \n",
-    "    # Simulated CIFAR-10 image (3 channels, 8x8 for testing)\n",
-    "    rgb_image = Tensor(np.random.randn(3, 8, 8))  # RGB 8x8 image\n",
-    "    print(f\"RGB input shape: {rgb_image.shape}\")\n",
-    "    \n",
-    "    # Forward pass through complete pipeline\n",
-    "    conv_features = rgb_conv(rgb_image)    # (3,8,8) → (16,6,6)\n",
-    "    activated = relu(conv_features)        # (16,6,6) → (16,6,6)\n",
-    "    pooled = pool(activated)              # (16,6,6) → (16,3,3)\n",
-    "    flattened = flatten(pooled)           # (16,3,3) → (1,144)\n",
-    "    predictions = dense(flattened)        # (1,144) → (1,10)\n",
-    "    \n",
-    "    assert conv_features.shape == (16, 6, 6), f\"Conv features wrong: {conv_features.shape}\"\n",
-    "    assert activated.shape == (16, 6, 6), f\"Activated features wrong: {activated.shape}\"\n",
-    "    assert pooled.shape == (16, 3, 3), f\"Pooled features wrong: {pooled.shape}\"\n",
-    "    assert flattened.shape == (1, 144), f\"Flattened features wrong: {flattened.shape}\"\n",
-    "    assert predictions.shape == (1, 10), f\"Predictions wrong: {predictions.shape}\"\n",
-    "    \n",
-    "    print(\"✅ CIFAR-10 style RGB pipeline works correctly\")\n",
-    "    \n",
-    "    # Test 2: Deep Multi-Channel CNN\n",
-    "    print(\"\\n2. Deep Multi-Channel CNN:\")\n",
-    "    \n",
-    "    # Create deeper pipeline: RGB → Conv1(3→32) → ReLU → Pool → Conv2(32→64) → ReLU → Pool → Dense\n",
-    "    conv1_deep = Conv2d(in_channels=3, out_channels=32, kernel_size=(3, 3))\n",
-    "    relu1 = ReLU()\n",
-    "    pool1 = MaxPool2D(pool_size=(2, 2))\n",
-    "    conv2_deep = Conv2d(in_channels=32, out_channels=64, kernel_size=(3, 3))\n",
-    "    relu2 = ReLU()\n",
-    "    pool2 = MaxPool2D(pool_size=(2, 2))\n",
-    "    classifier_deep = Dense(input_size=64 * 1 * 1, output_size=5)  # 64 channels, 1x1 spatial\n",
-    "    \n",
-    "    # Larger RGB input for deep processing\n",
-    "    large_rgb = Tensor(np.random.randn(3, 12, 12))  # RGB 12x12 image\n",
-    "    print(f\"Large RGB input shape: {large_rgb.shape}\")\n",
-    "    \n",
-    "    # Forward pass through deep network\n",
-    "    h1 = conv1_deep(large_rgb)  # (3,12,12) → (32,10,10)\n",
-    "    h2 = relu1(h1)              # (32,10,10) → (32,10,10)\n",
-    "    h3 = pool1(h2)              # (32,10,10) → (32,5,5)\n",
-    "    h4 = conv2_deep(h3)         # (32,5,5) → (64,3,3)\n",
-    "    h5 = relu2(h4)              # (64,3,3) → (64,3,3)\n",
-    "    h6 = pool2(h5)              # (64,3,3) → (64,1,1)\n",
-    "    h7 = flatten(h6)            # (64,1,1) → (1,64)\n",
-    "    output_deep = classifier_deep(h7)  # (1,64) → (1,5)\n",
-    "    \n",
-    "    assert h1.shape == (32, 10, 10), f\"Conv1 output wrong: {h1.shape}\"\n",
-    "    assert h3.shape == (32, 5, 5), f\"Pool1 output wrong: {h3.shape}\"\n",
-    "    assert h4.shape == (64, 3, 3), f\"Conv2 output wrong: {h4.shape}\"\n",
-    "    assert h6.shape == (64, 1, 1), f\"Pool2 output wrong: {h6.shape}\"\n",
-    "    assert h7.shape == (1, 64), f\"Final flatten wrong: {h7.shape}\"\n",
-    "    assert output_deep.shape == (1, 5), f\"Final prediction wrong: {output_deep.shape}\"\n",
-    "    \n",
-    "    print(\"✅ Deep multi-channel CNN works correctly\")\n",
-    "    \n",
-    "    # Test 3: Batch Processing with Multi-Channel\n",
-    "    print(\"\\n3. Batch Processing Test:\")\n",
-    "    \n",
-    "    # Test batch of RGB images\n",
-    "    batch_conv = Conv2d(in_channels=3, out_channels=8, kernel_size=(3, 3))\n",
-    "    batch_pool = MaxPool2D(pool_size=(2, 2))\n",
-    "    \n",
-    "    # Batch of 4 RGB images\n",
-    "    rgb_batch = Tensor(np.random.randn(4, 3, 6, 6))  # 4 images, 3 channels, 6x6\n",
-    "    print(f\"Batch RGB input shape: {rgb_batch.shape}\")\n",
-    "    \n",
-    "    # Forward pass to determine correct feature size\n",
-    "    batch_conv_out = batch_conv(rgb_batch)    # (4,3,6,6) → (4,8,4,4)\n",
-    "    batch_pool_out = batch_pool(batch_conv_out)  # (4,8,4,4) → (4,8,2,2)\n",
-    "    batch_flat = flatten(batch_pool_out)      # (4,8,2,2) → (4,32)\n",
-    "    \n",
-    "    # Create classifier with correct input size\n",
-    "    feature_size = batch_flat.shape[1]  # 32 features\n",
-    "    batch_classifier = Dense(input_size=feature_size, output_size=3)\n",
-    "    batch_pred = batch_classifier(batch_flat) # (4,32) → (4,3)\n",
-    "    \n",
-    "    assert batch_conv_out.shape == (4, 8, 4, 4), f\"Batch conv wrong: {batch_conv_out.shape}\"\n",
-    "    assert batch_pool_out.shape == (4, 8, 2, 2), f\"Batch pool wrong: {batch_pool_out.shape}\"\n",
-    "    assert batch_flat.shape == (4, 32), f\"Batch flatten wrong: {batch_flat.shape}\"\n",
-    "    assert batch_pred.shape == (4, 3), f\"Batch prediction wrong: {batch_pred.shape}\"\n",
-    "    \n",
-    "    print(\"✅ Batch processing with multi-channel works correctly\")\n",
-    "    \n",
-    "    # Test 4: Backward Compatibility with Single Channel\n",
-    "    print(\"\\n4. Backward Compatibility Test:\")\n",
-    "    \n",
-    "    # Test that Conv2d works for single-channel (grayscale)\n",
-    "    gray_conv = Conv2d(in_channels=1, out_channels=8, kernel_size=(3, 3))\n",
-    "    gray_image = Tensor(np.random.randn(1, 6, 6))  # 1 channel, 6x6\n",
-    "    gray_features = gray_conv(gray_image)\n",
-    "    \n",
-    "    assert gray_features.shape == (8, 4, 4), f\"Grayscale features wrong: {gray_features.shape}\"\n",
-    "    print(\"✅ Single-channel compatibility works correctly\")\n",
-    "    \n",
-    "    # Test 5: Memory and Parameter Analysis\n",
-    "    print(\"\\n5. Memory and Parameter Analysis:\")\n",
-    "    \n",
-    "    # Analyze different configurations\n",
-    "    configs = [\n",
-    "        (Conv2d(1, 8, (3, 3)), \"1→8 channels\"),\n",
-    "        (Conv2d(3, 16, (3, 3)), \"3→16 channels (RGB)\"),\n",
-    "        (Conv2d(16, 32, (3, 3)), \"16→32 channels\"),\n",
-    "        (Conv2d(32, 64, (3, 3)), \"32→64 channels\"),\n",
-    "    ]\n",
-    "    \n",
-    "    for conv_layer, desc in configs:\n",
-    "        params = conv_layer.weights.size + (conv_layer.bias.size if conv_layer.use_bias else 0)\n",
-    "        memory_mb = params * 4 / (1024 * 1024)  # float32 = 4 bytes\n",
-    "        print(f\"  {desc}: {params:,} parameters ({memory_mb:.3f} MB)\")\n",
-    "    \n",
-    "    print(\"✅ Memory analysis completed\")\n",
-    "    \n",
-    "    print(\"\\n🎉 Comprehensive multi-channel test passed! Your CNN system supports:\")\n",
-    "    print(\"  • RGB image processing (CIFAR-10 ready)\")\n",
-    "    print(\"  • Deep multi-channel architectures\")\n",
-    "    print(\"  • Batch processing with multiple channels\")\n",
-    "    print(\"  • Backward compatibility with single-channel\")\n",
-    "    print(\"  • Production-ready parameter scaling\")\n",
-    "    print(\"  • Complete Conv → Pool → Dense pipelines\")\n",
-    "    print(\"📈 Progress: Production-ready multi-channel CNN system!\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Comprehensive multi-channel test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "print(\"📈 Final Progress: Production-ready multi-channel CNN system for real computer vision!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "12ce47c3",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Convolution Operation Implementation\n",
-    "\n",
-    "This test validates the `conv2d_naive` function, ensuring it correctly performs 2D convolution operations with proper kernel sliding, dot product computation, and output shape calculation for spatial feature detection."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2a3c87c0",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_convolution_operation():\n",
-    "    \"\"\"Unit test for the convolution operation implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Convolution Operation...\")\n",
-    "    \n",
-    "    # Test basic convolution\n",
-    "    input_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])\n",
-    "    kernel = np.array([[1, 0], [0, 1]])\n",
-    "    result = conv2d_naive(input_data, kernel)\n",
-    "    \n",
-    "    assert result.shape == (2, 2), \"Convolution should produce correct output shape\"\n",
-    "    expected = np.array([[6, 8], [12, 14]])\n",
-    "    assert np.array_equal(result, expected), \"Convolution should produce correct values\"\n",
-    "    \n",
-    "    print(\"✅ Convolution operation works correctly\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4d1ec5b9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Conv2D Layer Implementation\n",
-    "\n",
-    "This test validates the Conv2D layer class, ensuring proper kernel initialization, forward pass functionality, and integration with the tensor framework for convolutional neural network construction."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f1b89a6c",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_conv2d_layer():\n",
-    "    \"\"\"Unit test for the Conv2D layer implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Conv2D Layer...\")\n",
-    "    \n",
-    "    # Test Conv2D layer\n",
-    "    conv = Conv2D(kernel_size=(3, 3))\n",
-    "    input_tensor = Tensor(np.random.randn(6, 6))\n",
-    "    output = conv(input_tensor)\n",
-    "    \n",
-    "    assert output.shape == (4, 4), \"Conv2D should produce correct output shape\"\n",
-    "    assert hasattr(conv, 'kernel'), \"Conv2D should have kernel attribute\"\n",
-    "    assert conv.kernel.shape == (3, 3), \"Kernel should have correct shape\"\n",
-    "    \n",
-    "    print(\"✅ Conv2D layer works correctly\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6ec26a7a",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Flatten Function Implementation\n",
-    "\n",
-    "This test validates the flatten function, ensuring it correctly converts 2D spatial tensors to 1D vectors for connecting convolutional layers to dense layers in CNN architectures."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "796a6408",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_flatten_function():\n",
-    "    \"\"\"Unit test for the flatten function implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Flatten Function...\")\n",
-    "    \n",
-    "    # Test flatten function\n",
-    "    input_2d = Tensor([[1, 2], [3, 4]])\n",
-    "    flattened = flatten(input_2d)\n",
-    "    \n",
-    "    assert flattened.shape == (1, 4), \"Flatten should produce output with batch dimension\"\n",
-    "    expected = np.array([[1, 2, 3, 4]])\n",
-    "    assert np.array_equal(flattened.data, expected), \"Flatten should preserve values\"\n",
-    "    \n",
-    "    print(\"✅ Flatten function works correctly\")\n",
-    "\n",
-    "# Test function defined (called in main block)\n",
-    "\n",
-    "# CNN pipeline integration test moved to tests/integration/test_cnn_pipeline.py"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "94878855",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🧪 Module Testing\n",
-    "\n",
-    "Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.\n",
-    "\n",
-    "**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "762494a0",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "standardized-testing",
-     "locked": true,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# =============================================================================\n",
-    "# STANDARDIZED MODULE TESTING - DO NOT MODIFY\n",
-    "# This cell is locked to ensure consistent testing across all TinyTorch modules\n",
-    "# ============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "15457d78",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## 🔬 Integration Test: Conv2D Layer with Tensors"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1584ea06",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def test_module_conv2d_tensor_compatibility():\n",
-    "    \"\"\"\n",
-    "    Integration test for the Conv2D layer and the Tensor class.\n",
-    "    \n",
-    "    Tests that the Conv2D layer correctly processes a batch of image-like Tensors.\n",
-    "    \"\"\"\n",
-    "    print(\"🔬 Running Integration Test: Conv2D with Tensors...\")\n",
-    "\n",
-    "    # 1. Define a Conv2D layer\n",
-    "    # Kernel of size 3x3\n",
-    "    conv_layer = Conv2D((3, 3))\n",
-    "\n",
-    "    # 2. Create a batch of 5 grayscale images (10x10)\n",
-    "    # Shape: (batch_size, height, width)\n",
-    "    input_images = np.random.randn(5, 10, 10)\n",
-    "    input_tensor = Tensor(input_images)\n",
-    "\n",
-    "    # 3. Perform a forward pass\n",
-    "    output_tensor = conv_layer(input_tensor)\n",
-    "\n",
-    "    # 4. Assert the output shape is correct\n",
-    "    # Output height = 10 - 3 + 1 = 8\n",
-    "    # Output width = 10 - 3 + 1 = 8\n",
-    "    expected_shape = (5, 8, 8)\n",
-    "    assert isinstance(output_tensor, Tensor), \"Conv2D output must be a Tensor\"\n",
-    "    assert output_tensor.shape == expected_shape, f\"Expected output shape {expected_shape}, but got {output_tensor.shape}\"\n",
-    "    print(\"✅ Integration Test Passed: Conv2D layer correctly transformed image tensor.\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "523115e6",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Step 4: ML Systems Thinking - Convolution Optimization & Memory Patterns\n",
-    "\n",
-    "### 🏗️ Spatial Computation at Scale\n",
-    "\n",
-    "Your convolution implementation provides the foundation for understanding how production computer vision systems optimize spatial operations for massive image processing workloads.\n",
-    "\n",
-    "#### **Convolution Memory Patterns**\n",
-    "```python\n",
-    "class ConvolutionMemoryAnalyzer:\n",
-    "    def __init__(self):\n",
-    "        # Memory access patterns in convolution operations\n",
-    "        self.spatial_locality = SpatialLocalityTracker()\n",
-    "        self.cache_efficiency = CacheEfficiencyMonitor()\n",
-    "        self.memory_bandwidth = BandwidthAnalyzer()\n",
-    "```\n",
-    "\n",
-    "Real convolution systems must handle:\n",
-    "- **Spatial locality**: Adjacent pixels accessed together optimize cache performance\n",
-    "- **Memory bandwidth**: Large feature maps require efficient memory access patterns  \n",
-    "- **Tiling strategies**: Breaking large convolutions into cache-friendly chunks\n",
-    "- **Hardware acceleration**: Specialized convolution units in modern GPUs and TPUs"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f87ccc04",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "convolution-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "import time\n",
-    "from collections import defaultdict\n",
-    "\n",
-    "class ConvolutionProfiler:\n",
-    "    \"\"\"\n",
-    "    Production Convolution Performance Analysis and Optimization\n",
-    "    \n",
-    "    Analyzes spatial computation efficiency, memory patterns, and optimization\n",
-    "    opportunities for production computer vision systems.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize convolution profiler for spatial operations analysis.\"\"\"\n",
-    "        self.profiling_data = defaultdict(list)\n",
-    "        self.memory_analysis = defaultdict(list) \n",
-    "        self.optimization_recommendations = []\n",
-    "        \n",
-    "    def profile_convolution_operation(self, conv_layer, input_tensor, kernel_sizes=[(3,3), (5,5), (7,7)]):\n",
-    "        \"\"\"\n",
-    "        Profile convolution operations across different kernel sizes.\n",
-    "        \n",
-    "        TODO: Implement convolution operation profiling.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Profile different kernel sizes and their computational costs\n",
-    "        2. Measure memory usage patterns for spatial operations\n",
-    "        3. Analyze cache efficiency and memory access patterns\n",
-    "        4. Identify optimization opportunities for production systems\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Performance Optimization**: Understanding computational costs of different kernel sizes\n",
-    "        - **Memory Efficiency**: Cache-friendly access patterns improve performance significantly\n",
-    "        - **Production Scaling**: Profiling guides hardware selection and deployment strategies\n",
-    "        - **GPU Optimization**: Spatial operations are ideal for parallel processing\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Time convolution operations with different kernel sizes\n",
-    "        2. Analyze memory usage patterns for spatial operations\n",
-    "        3. Calculate computational intensity (FLOPs per operation)\n",
-    "        4. Identify memory bandwidth vs compute bottlenecks\n",
-    "        5. Generate optimization recommendations\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        profiler = ConvolutionProfiler()\n",
-    "        conv = Conv2D(kernel_size=(3, 3))\n",
-    "        input_img = Tensor(np.random.randn(32, 32))  # 32x32 image\n",
-    "        analysis = profiler.profile_convolution_operation(conv, input_img)\n",
-    "        print(f\"Convolution throughput: {analysis['throughput_mflops']:.1f} MFLOPS\")\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use time.time() for timing measurements\n",
-    "        - Calculate memory footprint of input and output tensors\n",
-    "        - Estimate FLOPs: output_height * output_width * kernel_height * kernel_width\n",
-    "        - Compare performance across kernel sizes\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(\"🔧 Profiling Convolution Operations...\")\n",
-    "        \n",
-    "        results = {}\n",
-    "        \n",
-    "        for kernel_size in kernel_sizes:\n",
-    "            print(f\"  Testing kernel size: {kernel_size}\")\n",
-    "            \n",
-    "            # Create convolution layer with specified kernel size\n",
-    "            # Note: Using the provided conv_layer or creating new one\n",
-    "            try:\n",
-    "                if hasattr(conv_layer, 'kernel_size'):\n",
-    "                    # Use existing layer if compatible, otherwise create new\n",
-    "                    if conv_layer.kernel_size == kernel_size:\n",
-    "                        test_conv = conv_layer\n",
-    "                    else:\n",
-    "                        test_conv = Conv2D(kernel_size=kernel_size)\n",
-    "                else:\n",
-    "                    test_conv = Conv2D(kernel_size=kernel_size)\n",
-    "            except:\n",
-    "                # Fallback for testing - create mock convolution\n",
-    "                test_conv = conv_layer\n",
-    "            \n",
-    "            # Measure timing\n",
-    "            iterations = 10\n",
-    "            start_time = time.time()\n",
-    "            \n",
-    "            for _ in range(iterations):\n",
-    "                try:\n",
-    "                    output = test_conv(input_tensor)\n",
-    "                except:\n",
-    "                    # Fallback: simulate convolution operation\n",
-    "                    # Calculate expected output size\n",
-    "                    input_h, input_w = input_tensor.shape[-2:]\n",
-    "                    kernel_h, kernel_w = kernel_size\n",
-    "                    output_h = input_h - kernel_h + 1\n",
-    "                    output_w = input_w - kernel_w + 1\n",
-    "                    output = Tensor(np.random.randn(output_h, output_w))\n",
-    "            \n",
-    "            end_time = time.time()\n",
-    "            avg_time = (end_time - start_time) / iterations\n",
-    "            \n",
-    "            # Calculate computational metrics\n",
-    "            input_h, input_w = input_tensor.shape[-2:]\n",
-    "            kernel_h, kernel_w = kernel_size\n",
-    "            output_h = max(1, input_h - kernel_h + 1)\n",
-    "            output_w = max(1, input_w - kernel_w + 1)\n",
-    "            \n",
-    "            # Estimate FLOPs (floating point operations)\n",
-    "            flops = output_h * output_w * kernel_h * kernel_w\n",
-    "            mflops = flops / 1e6\n",
-    "            throughput_mflops = mflops / avg_time if avg_time > 0 else 0\n",
-    "            \n",
-    "            # Memory analysis\n",
-    "            input_memory_mb = input_tensor.data.nbytes / (1024 * 1024)\n",
-    "            output_memory_mb = (output_h * output_w * 4) / (1024 * 1024)  # Assuming float32\n",
-    "            kernel_memory_mb = (kernel_h * kernel_w * 4) / (1024 * 1024)\n",
-    "            total_memory_mb = input_memory_mb + output_memory_mb + kernel_memory_mb\n",
-    "            \n",
-    "            # Calculate computational intensity (FLOPs per byte)\n",
-    "            computational_intensity = flops / max(input_tensor.data.nbytes, 1)\n",
-    "            \n",
-    "            result = {\n",
-    "                'kernel_size': kernel_size,\n",
-    "                'time_ms': avg_time * 1000,\n",
-    "                'throughput_mflops': throughput_mflops,\n",
-    "                'flops': flops,\n",
-    "                'input_memory_mb': input_memory_mb,\n",
-    "                'output_memory_mb': output_memory_mb,\n",
-    "                'total_memory_mb': total_memory_mb,\n",
-    "                'computational_intensity': computational_intensity,\n",
-    "                'output_size': (output_h, output_w)\n",
-    "            }\n",
-    "            \n",
-    "            results[f\"{kernel_size[0]}x{kernel_size[1]}\"] = result\n",
-    "            \n",
-    "            print(f\"    Time: {avg_time*1000:.3f}ms, Throughput: {throughput_mflops:.1f} MFLOPS\")\n",
-    "        \n",
-    "        # Store profiling data\n",
-    "        self.profiling_data['convolution_results'] = results\n",
-    "        \n",
-    "        # Generate analysis\n",
-    "        analysis = self._analyze_convolution_performance(results)\n",
-    "        \n",
-    "        return {\n",
-    "            'detailed_results': results,\n",
-    "            'analysis': analysis,\n",
-    "            'recommendations': self._generate_optimization_recommendations(results)\n",
-    "        }\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _analyze_convolution_performance(self, results):\n",
-    "        \"\"\"Analyze convolution performance patterns.\"\"\"\n",
-    "        analysis = []\n",
-    "        \n",
-    "        # Find fastest and slowest configurations\n",
-    "        times = [(k, v['time_ms']) for k, v in results.items()]\n",
-    "        fastest = min(times, key=lambda x: x[1])\n",
-    "        slowest = max(times, key=lambda x: x[1])\n",
-    "        \n",
-    "        analysis.append(f\"🚀 Fastest kernel: {fastest[0]} ({fastest[1]:.3f}ms)\")\n",
-    "        analysis.append(f\"🐌 Slowest kernel: {slowest[0]} ({slowest[1]:.3f}ms)\")\n",
-    "        \n",
-    "        # Performance scaling analysis\n",
-    "        if len(results) > 1:\n",
-    "            small_kernel = min(results.keys(), key=lambda k: results[k]['flops'])\n",
-    "            large_kernel = max(results.keys(), key=lambda k: results[k]['flops'])\n",
-    "            \n",
-    "            flops_ratio = results[large_kernel]['flops'] / results[small_kernel]['flops']\n",
-    "            time_ratio = results[large_kernel]['time_ms'] / results[small_kernel]['time_ms']\n",
-    "            \n",
-    "            analysis.append(f\"📈 FLOPS scaling: {small_kernel} → {large_kernel} = {flops_ratio:.1f}x more computation\")\n",
-    "            analysis.append(f\"⏱️ Time scaling: {time_ratio:.1f}x slower\")\n",
-    "            \n",
-    "            if time_ratio < flops_ratio:\n",
-    "                analysis.append(\"✅ Good computational efficiency - time scales better than FLOPs\")\n",
-    "            else:\n",
-    "                analysis.append(\"⚠️ Computational bottleneck - time scales worse than FLOPs\")\n",
-    "        \n",
-    "        # Memory analysis\n",
-    "        memory_usage = [(k, v['total_memory_mb']) for k, v in results.items()]\n",
-    "        max_memory = max(memory_usage, key=lambda x: x[1])\n",
-    "        analysis.append(f\"💾 Peak memory usage: {max_memory[0]} ({max_memory[1]:.2f} MB)\")\n",
-    "        \n",
-    "        return analysis\n",
-    "    \n",
-    "    def _generate_optimization_recommendations(self, results):\n",
-    "        \"\"\"Generate optimization recommendations based on profiling results.\"\"\"\n",
-    "        recommendations = []\n",
-    "        \n",
-    "        # Analyze computational intensity\n",
-    "        intensities = [v['computational_intensity'] for v in results.values()]\n",
-    "        avg_intensity = sum(intensities) / len(intensities)\n",
-    "        \n",
-    "        if avg_intensity < 1.0:\n",
-    "            recommendations.append(\"🔧 Memory-bound operation: Consider memory layout optimization\")\n",
-    "            recommendations.append(\"💡 Try: Tensor tiling, cache-friendly access patterns\")\n",
-    "        else:\n",
-    "            recommendations.append(\"🔧 Compute-bound operation: Focus on computational optimization\")\n",
-    "            recommendations.append(\"💡 Try: SIMD instructions, hardware acceleration\")\n",
-    "        \n",
-    "        # Kernel size recommendations\n",
-    "        best_throughput = max(results.values(), key=lambda x: x['throughput_mflops'])\n",
-    "        recommendations.append(f\"⚡ Optimal kernel size for throughput: {best_throughput['kernel_size']}\")\n",
-    "        \n",
-    "        # Memory efficiency recommendations\n",
-    "        memory_efficiency = {k: v['throughput_mflops'] / v['total_memory_mb'] \n",
-    "                           for k, v in results.items() if v['total_memory_mb'] > 0}\n",
-    "        if memory_efficiency:\n",
-    "            best_memory_efficiency = max(memory_efficiency.items(), key=lambda x: x[1])\n",
-    "            recommendations.append(f\"💾 Most memory-efficient: {best_memory_efficiency[0]}\")\n",
-    "        \n",
-    "        return recommendations\n",
-    "\n",
-    "    def analyze_memory_patterns(self, input_sizes=[(64, 64), (128, 128), (256, 256)]):\n",
-    "        \"\"\"\n",
-    "        Analyze memory access patterns for different image sizes.\n",
-    "        \n",
-    "        This function is PROVIDED to demonstrate memory scaling analysis.\n",
-    "        Students use it to understand spatial computation memory requirements.\n",
-    "        \"\"\"\n",
-    "        print(\"🔍 MEMORY PATTERN ANALYSIS\")\n",
-    "        print(\"=\" * 40)\n",
-    "        \n",
-    "        conv_3x3 = Conv2D(kernel_size=(3, 3))\n",
-    "        \n",
-    "        memory_results = []\n",
-    "        \n",
-    "        for height, width in input_sizes:\n",
-    "            # Create test tensor\n",
-    "            test_tensor = Tensor(np.random.randn(height, width))\n",
-    "            \n",
-    "            # Calculate memory requirements\n",
-    "            input_memory = test_tensor.data.nbytes / (1024 * 1024)  # MB\n",
-    "            \n",
-    "            # Estimate output size\n",
-    "            output_h = height - 3 + 1\n",
-    "            output_w = width - 3 + 1\n",
-    "            output_memory = (output_h * output_w * 4) / (1024 * 1024)  # MB, float32\n",
-    "            \n",
-    "            # Kernel memory\n",
-    "            kernel_memory = (3 * 3 * 4) / (1024 * 1024)  # MB\n",
-    "            \n",
-    "            total_memory = input_memory + output_memory + kernel_memory\n",
-    "            memory_efficiency = (output_h * output_w) / total_memory  # operations per MB\n",
-    "            \n",
-    "            result = {\n",
-    "                'input_size': (height, width),\n",
-    "                'input_memory_mb': input_memory,\n",
-    "                'output_memory_mb': output_memory,\n",
-    "                'total_memory_mb': total_memory,\n",
-    "                'memory_efficiency': memory_efficiency\n",
-    "            }\n",
-    "            memory_results.append(result)\n",
-    "            \n",
-    "            print(f\"  {height}x{width}: {total_memory:.2f} MB total, {memory_efficiency:.0f} ops/MB\")\n",
-    "        \n",
-    "        # Analyze scaling\n",
-    "        if len(memory_results) >= 2:\n",
-    "            small = memory_results[0]\n",
-    "            large = memory_results[-1]\n",
-    "            \n",
-    "            size_ratio = (large['input_size'][0] / small['input_size'][0]) ** 2\n",
-    "            memory_ratio = large['total_memory_mb'] / small['total_memory_mb']\n",
-    "            \n",
-    "            print(f\"\\n📈 Memory Scaling Analysis:\")\n",
-    "            print(f\"  Input size increased {size_ratio:.1f}x\")\n",
-    "            print(f\"  Memory usage increased {memory_ratio:.1f}x\")\n",
-    "            print(f\"  Scaling efficiency: {(memory_ratio/size_ratio)*100:.1f}% (lower is better)\")\n",
-    "        \n",
-    "        return memory_results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0b1c39b5",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test: Convolution Performance Profiling\n",
-    "\n",
-    "Let us test our convolution profiler with realistic computer vision scenarios."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "932fff67",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-convolution-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_convolution_profiler():\n",
-    "    \"\"\"Test convolution profiler with comprehensive scenarios.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Convolution Performance Profiler...\")\n",
-    "    \n",
-    "    profiler = ConvolutionProfiler()\n",
-    "    \n",
-    "    # Create test components\n",
-    "    conv = Conv2D(kernel_size=(3, 3))\n",
-    "    test_image = Tensor(np.random.randn(64, 64))  # 64x64 test image\n",
-    "    \n",
-    "    # Test convolution profiling\n",
-    "    try:\n",
-    "        analysis = profiler.profile_convolution_operation(conv, test_image, \n",
-    "                                                        kernel_sizes=[(3,3), (5,5)])\n",
-    "        \n",
-    "        # Verify analysis structure\n",
-    "        assert 'detailed_results' in analysis, \"Should provide detailed results\"\n",
-    "        assert 'analysis' in analysis, \"Should provide performance analysis\"\n",
-    "        assert 'recommendations' in analysis, \"Should provide optimization recommendations\"\n",
-    "        \n",
-    "        # Verify detailed results\n",
-    "        results = analysis['detailed_results']\n",
-    "        assert len(results) == 2, \"Should test both kernel sizes\"\n",
-    "        \n",
-    "        for kernel_name, result in results.items():\n",
-    "            assert 'time_ms' in result, f\"Should include timing for {kernel_name}\"\n",
-    "            assert 'throughput_mflops' in result, f\"Should calculate throughput for {kernel_name}\"\n",
-    "            assert 'total_memory_mb' in result, f\"Should analyze memory for {kernel_name}\"\n",
-    "            assert result['time_ms'] > 0, f\"Time should be positive for {kernel_name}\"\n",
-    "        \n",
-    "        print(\"✅ Convolution profiling test passed\")\n",
-    "        \n",
-    "        # Test memory pattern analysis\n",
-    "        memory_analysis = profiler.analyze_memory_patterns(input_sizes=[(32, 32), (64, 64)])\n",
-    "        \n",
-    "        assert isinstance(memory_analysis, list), \"Should return memory analysis results\"\n",
-    "        assert len(memory_analysis) == 2, \"Should analyze both input sizes\"\n",
-    "        \n",
-    "        for result in memory_analysis:\n",
-    "            assert 'input_size' in result, \"Should include input size\"\n",
-    "            assert 'total_memory_mb' in result, \"Should calculate total memory\"\n",
-    "            assert result['total_memory_mb'] > 0, \"Memory usage should be positive\"\n",
-    "        \n",
-    "        print(\"✅ Memory pattern analysis test passed\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"⚠️ Convolution profiling test had issues: {e}\")\n",
-    "        print(\"✅ Basic structure test passed (graceful degradation)\")\n",
-    "    \n",
-    "    print(\"🎯 Convolution Profiler: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)\n",
-    "\n",
-    "def test_unit_multichannel_conv2d():\n",
-    "    \"\"\"Unit test for the multi-channel Conv2D implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Multi-Channel Conv2D...\")\n",
-    "    \n",
-    "    # Test multi-channel convolution\n",
-    "    conv = Conv2d(in_channels=3, out_channels=8, kernel_size=(3, 3))\n",
-    "    input_rgb = Tensor(np.random.randn(3, 6, 6))\n",
-    "    output = conv(input_rgb)\n",
-    "    \n",
-    "    assert output.shape == (8, 4, 4), \"Multi-channel Conv2D should produce correct output shape\"\n",
-    "    assert hasattr(conv, 'weights'), \"Multi-channel Conv2D should have weights attribute\"\n",
-    "    assert conv.weights.shape == (8, 3, 3, 3), \"Weights should have correct multi-channel shape\"\n",
-    "    \n",
-    "    print(\"✅ Multi-channel Conv2D works correctly\")\n",
-    "\n",
-    "def test_unit_maxpool2d():\n",
-    "    \"\"\"Unit test for the MaxPool2D implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: MaxPool2D...\")\n",
-    "    \n",
-    "    # Test MaxPool2D\n",
-    "    pool = MaxPool2D(pool_size=(2, 2))\n",
-    "    input_4x4 = Tensor(np.arange(16).reshape(4, 4))\n",
-    "    pooled = pool(input_4x4)\n",
-    "    \n",
-    "    assert pooled.shape == (2, 2), \"MaxPool2D should produce correct output shape\"\n",
-    "    expected = np.array([[5, 7], [13, 15]])  # Max of each 2x2 window\n",
-    "    assert np.array_equal(pooled.data, expected), \"MaxPool2D should compute correct max values\"\n",
-    "    \n",
-    "    print(\"✅ MaxPool2D works correctly\")\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    # Run all tests\n",
-    "    test_unit_convolution_operation()\n",
-    "    test_unit_conv2d_layer()\n",
-    "    test_unit_multichannel_conv2d()\n",
-    "    test_unit_maxpool2d()\n",
-    "    test_unit_flatten_function()\n",
-    "    test_module_conv2d_tensor_compatibility()\n",
-    "    test_convolution_profiler()\n",
-    "    \n",
-    "    print(\"All tests passed!\")\n",
-    "    print(\"spatial_dev module complete with multi-channel support!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c7b7fb14",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've built convolution operations and spatial processing capabilities, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how spatial computation patterns scale to production computer vision environments.\n",
-    "\n",
-    "Take time to reflect thoughtfully on each question - your insights will help you understand how the spatial processing concepts you've implemented connect to real-world ML systems engineering."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cf5d480d",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Convolution Optimization and Memory Access Patterns\n",
-    "\n",
-    "**Context**: Your convolution implementation processes images by sliding kernels across spatial dimensions, accessing nearby pixels repeatedly. Production computer vision systems must optimize these memory access patterns for cache efficiency, especially when processing high-resolution images that exceed cache capacity.\n",
-    "\n",
-    "**Reflection Question**: Design an optimized convolution system for production computer vision that maximizes cache efficiency and memory bandwidth utilization. How would you implement spatial data layout optimization for different image sizes, optimize kernel access patterns for cache locality, and handle memory hierarchies from L1 cache to main memory? Consider scenarios where you need to process 4K video streams in real-time while maintaining memory efficiency.\n",
-    "\n",
-    "Think about: spatial data layouts (NCHW vs NHWC), cache-blocking strategies, memory prefetching, and bandwidth optimization techniques.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ea72244c",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-1-convolution-optimization",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON CONVOLUTION OPTIMIZATION AND MEMORY ACCESS PATTERNS:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about optimized convolution system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you optimize spatial data layouts for different image processing scenarios?\n",
-    "- What strategies would you use to maximize cache locality in convolution operations?\n",
-    "- How would you handle memory bandwidth bottlenecks in high-resolution image processing?\n",
-    "- What role would cache-blocking and prefetching play in your optimization approach?\n",
-    "- How would you adapt memory access patterns for different hardware architectures?\n",
-    "\n",
-    "Write a technical analysis connecting your convolution implementations to real memory optimization challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Demonstrates understanding of spatial memory access optimization (3 points)\n",
-    "- Addresses cache efficiency and bandwidth utilization strategies (3 points)\n",
-    "- Shows practical knowledge of data layout and access pattern optimization (2 points)\n",
-    "- Demonstrates systems thinking about memory hierarchy optimization (2 points)\n",
-    "- Clear technical reasoning and practical considerations (bonus points for innovative approaches)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring technical analysis of convolution optimization\n",
-    "# Students should demonstrate understanding of spatial memory access patterns and cache optimization\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f8527a46",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: GPU Parallelization and Hardware Acceleration\n",
-    "\n",
-    "**Context**: Your convolution processes pixels sequentially, but production computer vision systems leverage thousands of GPU cores for parallel computation. Different hardware platforms (GPUs, TPUs, mobile processors) have distinct optimization opportunities and constraints for spatial operations.\n",
-    "\n",
-    "**Reflection Question**: Architect a hardware-aware convolution system that optimally utilizes parallel computing resources across different platforms. How would you implement data parallelism strategies for GPU convolution kernels, optimize for specialized AI accelerators like TPUs, and adapt convolution algorithms for mobile and edge devices with limited resources? Consider scenarios where the same model needs efficient deployment across cloud GPUs, mobile phones, and embedded vision systems.\n",
-    "\n",
-    "Think about: parallel algorithm design, hardware-specific optimization, work distribution strategies, and cross-platform efficiency considerations.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "77462556",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-2-gpu-parallelization",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON GPU PARALLELIZATION AND HARDWARE ACCELERATION:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about hardware-aware convolution system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you design parallel convolution algorithms for different hardware platforms?\n",
-    "- What strategies would you use to optimize convolution for GPU, TPU, and mobile processors?\n",
-    "- How would you implement work distribution and load balancing for parallel convolution?\n",
-    "- What role would hardware-specific optimizations play in your design?\n",
-    "- How would you maintain efficiency across diverse deployment platforms?\n",
-    "\n",
-    "Write an architectural analysis connecting your spatial processing to real hardware acceleration challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Shows understanding of parallel computing and hardware acceleration (3 points)\n",
-    "- Designs practical approaches to multi-platform convolution optimization (3 points)\n",
-    "- Addresses work distribution and platform-specific optimization (2 points)\n",
-    "- Demonstrates systems thinking about hardware-software co-optimization (2 points)\n",
-    "- Clear architectural reasoning with hardware insights (bonus points for comprehensive understanding)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of parallel computing and hardware optimization\n",
-    "# Students should demonstrate knowledge of GPU acceleration and multi-platform optimization\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "55162794",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 3: Production Computer Vision Pipeline Integration\n",
-    "\n",
-    "**Context**: Your convolution operates on individual images, but production computer vision systems must handle continuous streams of images, video processing, and real-time inference with strict latency requirements. Integration with broader ML pipelines becomes critical for system performance.\n",
-    "\n",
-    "**Reflection Question**: Design a production computer vision pipeline that integrates convolution operations with real-time processing requirements and system-wide optimization. How would you implement batching strategies for video streams, optimize pipeline throughput while maintaining low latency, and integrate convolution with preprocessing and postprocessing stages? Consider scenarios where you need to process security camera feeds, autonomous vehicle vision, or real-time medical imaging with reliability and performance guarantees.\n",
-    "\n",
-    "Think about: pipeline optimization, batching strategies, latency vs throughput trade-offs, and system integration patterns.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9d49a458",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-3-production-pipeline",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON PRODUCTION COMPUTER VISION PIPELINE INTEGRATION:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about production vision pipeline design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you design computer vision pipelines that integrate convolution with real-time processing?\n",
-    "- What strategies would you use to optimize batching and throughput for video streams?\n",
-    "- How would you balance latency requirements with computational efficiency?\n",
-    "- What role would pipeline integration and optimization play in your system?\n",
-    "- How would you ensure reliability and performance guarantees for critical applications?\n",
-    "\n",
-    "Write a systems analysis connecting your convolution operations to real production pipeline challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Understands production computer vision pipeline requirements (3 points)\n",
-    "- Designs practical approaches to real-time processing and batching (3 points)\n",
-    "- Addresses latency vs throughput optimization challenges (2 points)\n",
-    "- Shows systems thinking about integration and reliability (2 points)\n",
-    "- Clear systems reasoning with production deployment insights (bonus points for deep understanding)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of production computer vision pipelines\n",
-    "# Students should demonstrate knowledge of real-time processing and system integration\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0305fe8f",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Multi-Channel Convolutional Networks\n",
-    "\n",
-    "Congratulations! You have successfully implemented a complete multi-channel CNN system ready for real computer vision applications:\n",
-    "\n",
-    "### What You have Accomplished\n",
-    "✅ **Convolution Operation**: Implemented the sliding window mechanism from scratch  \n",
-    "✅ **Single-Channel Conv2D**: Built learnable convolutional layers with random initialization  \n",
-    "✅ **Multi-Channel Conv2D**: Added support for RGB images and multiple output feature maps  \n",
-    "✅ **MaxPool2D**: Implemented spatial downsampling for computational efficiency  \n",
-    "✅ **Flatten Function**: Created the bridge between convolutional and dense layers  \n",
-    "✅ **Complete CNN Pipelines**: Built CIFAR-10 ready architectures with proper parameter scaling  \n",
-    "✅ **Memory Analysis**: Profiled parameter scaling and computational complexity\n",
-    "✅ **Production Patterns**: Tested batch processing and deep multi-channel architectures\n",
-    "\n",
-    "### Key Concepts You have Learned\n",
-    "- **Multi-channel convolution**: How RGB images are processed through multiple filters\n",
-    "- **Parameter scaling**: How memory requirements grow with channels and kernel sizes\n",
-    "- **Spatial downsampling**: MaxPooling for translation invariance and efficiency  \n",
-    "- **Feature hierarchy**: Progressive extraction from RGB → edges → objects → concepts\n",
-    "- **Production architectures**: Conv → ReLU → Pool → Conv → ReLU → Pool → Dense patterns\n",
-    "- **He initialization**: Proper weight initialization for stable multi-layer training\n",
-    "\n",
-    "### Mathematical Foundations\n",
-    "- **Multi-channel convolution**: Each filter processes ALL input channels, summing results\n",
-    "- **Parameter calculation**: out_channels × in_channels × kernel_h × kernel_w + bias_terms\n",
-    "- **Spatial size reduction**: Convolution and pooling progressively reduce spatial dimensions\n",
-    "- **Channel expansion**: Typical pattern increases channels while reducing spatial size\n",
-    "- **Memory complexity**: O(batch × channels × height × width) for activations\n",
-    "\n",
-    "### Systems Engineering Insights\n",
-    "- **Memory scaling**: Parameters grow quadratically with channels, linearly with filters\n",
-    "- **Computational intensity**: CIFAR-10 CNN requires millions of multiply-accumulate operations\n",
-    "- **Cache efficiency**: Spatial locality in convolution enables hardware optimization\n",
-    "- **Parallelization**: Each filter and spatial position can be computed independently\n",
-    "- **Production trade-offs**: More channels = better accuracy but higher memory/compute cost\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **CIFAR-10 classification**: Your CNN can handle 32×32 RGB images → 10 classes\n",
-    "- **Image recognition**: Object detection, medical imaging, autonomous driving\n",
-    "- **Transfer learning**: Pre-trained features for downstream tasks\n",
-    "- **Computer vision**: Face recognition, document analysis, quality inspection\n",
-    "\n",
-    "### CNN Architecture Patterns\n",
-    "- **Basic CNN**: RGB → Conv(3→32) → ReLU → Pool → Conv(32→64) → ReLU → Pool → Dense\n",
-    "- **Parameter efficiency**: 32×3×3×3 = 864 parameters vs 32×32×32 = 32,768 for dense layer\n",
-    "- **Spatial hierarchy**: Early layers detect edges, later layers detect objects\n",
-    "- **Translation invariance**: Same features detected regardless of position in image\n",
-    "\n",
-    "### Performance Characteristics\n",
-    "- **Memory efficiency**: Shared parameters across spatial locations\n",
-    "- **Computational complexity**: O(batch × out_channels × in_channels × kernel_size² × output_spatial)\n",
-    "- **Hardware acceleration**: Highly parallelizable operations ideal for GPUs\n",
-    "- **Scaling behavior**: Memory grows with channels, computation grows with spatial size\n",
-    "\n",
-    "### Production-Ready Features\n",
-    "```python\n",
-    "from tinytorch.core.spatial import Conv2d, MaxPool2D, flatten\n",
-    "from tinytorch.core.layers import Dense\n",
-    "from tinytorch.core.activations import ReLU\n",
-    "\n",
-    "# CIFAR-10 CNN architecture\n",
-    "conv1 = Conv2d(in_channels=3, out_channels=32, kernel_size=(3, 3))\n",
-    "pool1 = MaxPool2D(pool_size=(2, 2))\n",
-    "conv2 = Conv2d(in_channels=32, out_channels=64, kernel_size=(3, 3))\n",
-    "pool2 = MaxPool2D(pool_size=(2, 2))\n",
-    "classifier = Dense(input_size=64*6*6, output_size=10)\n",
-    "\n",
-    "# Process RGB image\n",
-    "rgb_image = Tensor(np.random.randn(3, 32, 32))  # CIFAR-10 format\n",
-    "features1 = pool1(ReLU()(conv1(rgb_image)))     # (3,32,32) → (32,15,15)\n",
-    "features2 = pool2(ReLU()(conv2(features1)))     # (32,15,15) → (64,6,6)\n",
-    "predictions = classifier(flatten(features2))    # (64,6,6) → (1,10)\n",
-    "```\n",
-    "\n",
-    "### Next Steps\n",
-    "1. **Export to package**: Use `tito module complete 10_spatial` to export your implementation\n",
-    "2. **Test with real data**: Load CIFAR-10 dataset and train your CNN\n",
-    "3. **Experiment with architectures**: Try different channel numbers and kernel sizes\n",
-    "4. **Optimize performance**: Profile memory usage and computational bottlenecks\n",
-    "5. **Build deeper networks**: Add more layers and advanced techniques\n",
-    "\n",
-    "**Ready for the next challenge?** Let us add attention mechanisms to understand sequence relationships!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/08_spatial/spatial_dev.py b/modules_old/08_spatial/spatial_dev.py
deleted file mode 100644
index 3864cfd3..00000000
--- a/modules_old/08_spatial/spatial_dev.py
+++ /dev/null
@@ -1,911 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-#   kernelspec:
-#     display_name: Python 3 (ipykernel)
-#     language: python
-#     name: python3
-# ---
-
-# %% [markdown]
-"""
-# Spatial - Convolutional Neural Networks
-
-Welcome to Spatial! You'll implement the fundamental spatial operations that make CNNs work for image processing and pattern recognition.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**:
-- Module 03 (Layers): Neural network building blocks
-- Module 04 (Networks): Multi-layer architectures
-
-**What's Working**: You can build fully connected networks that process flattened data.
-
-**The Gap**: Your networks can't recognize spatial patterns in images - they lose all spatial structure when flattening.
-
-**This Module's Solution**: Implement convolution and pooling operations that preserve and process spatial relationships.
-
-**Connection Map**:
-```
-Networks → Spatial → Autograd
-(1D data)  (2D images) (gradient computation)
-```
-
-## Learning Objectives
-1. **Core Implementation**: Build Conv2D and MaxPool2D layers for spatial pattern recognition
-2. **Systems Understanding**: Analyze memory usage and computational complexity of spatial operations
-3. **Integration Knowledge**: Connect convolutional layers with existing neural network components
-4. **Testing Skills**: Validate spatial operations with immediate unit testing
-
-## Build → Test → Use
-1. **Build**: Implement convolution and pooling from scratch
-2. **Test**: Validate each operation immediately after implementation
-3. **Use**: Combine operations into CNN architectures for image processing
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "spatial-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp spatial
-
-# Core imports for spatial operations
-import numpy as np
-from typing import Tuple, Union, Optional
-
-# Import previous modules
-import sys
-sys.path.append('../../')
-try:
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.layers import Module, Linear
-except ImportError:
-    # Fallback for development
-    sys.path.extend([
-        '../01_tensor',
-        '../03_layers'
-    ])
-    from tensor_dev import Tensor
-    from layers_dev import Module, Linear
-
-print("✅ Spatial module imports successful!")
-
-# %% [markdown]
-"""
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/08_spatial/spatial_dev.py
-**Building Side:** Code exports to tinytorch.core.spatial
-
-```python
-# Final package structure:
-from tinytorch.core.spatial import Conv2D, MaxPool2D, flatten  # This module
-from tinytorch.core.tensor import Tensor  # Foundation (always needed)
-from tinytorch.core.layers import Module  # Base class for layers
-```
-
-**Why this matters:**
-- **Learning:** Complete spatial processing system in one focused module
-- **Production:** Organized like PyTorch's torch.nn with spatial operations
-- **Consistency:** All spatial operations and utilities in core.spatial
-- **Integration:** Works seamlessly with layers for complete CNN architectures
-"""
-
-# %% [markdown]
-"""
-## 🏗️ Understanding Spatial Operations
-
-### What is Convolution?
-
-Convolution is a mathematical operation that slides a small filter (kernel) across an image to detect patterns:
-
-```
-Input Image (5×5)      Filter (3×3)       Output (3×3)
-┌─────────────────┐    ┌───────┐         ┌─────────┐
-│ 1 2 3 4 5      │    │ 1 0-1 │         │ ? ? ? │
-│ 6 7 8 9 0      │  × │ 2 1 0 │    =    │ ? ? ? │
-│ 1 2 3 4 5      │    │-1 0 1 │         │ ? ? ? │
-│ 6 7 8 9 0      │    └───────┘         └─────────┘
-│ 1 2 3 4 5      │
-└─────────────────┘
-```
-
-**Why Spatial Operations Matter:**
-- **Pattern Recognition**: Detect edges, textures, and complex features
-- **Translation Invariance**: Same pattern detected regardless of position
-- **Parameter Sharing**: One filter detects patterns across entire image
-- **Spatial Hierarchy**: Simple patterns → complex patterns → objects
-
-### Memory Efficiency vs Fully Connected
-
-**Fully Connected Approach** (wasteful):
-- 28×28 image = 784 inputs
-- Hidden layer: 784 × 128 = 100,352 parameters per neuron!
-- No spatial understanding
-
-**Convolutional Approach** (efficient):
-- 3×3 filter = 9 parameters
-- Applied everywhere via sliding
-- Preserves spatial relationships
-"""
-# %% [markdown]
-"""
-## Implementation: Core Spatial Operations
-
-Let's build the essential spatial operations: convolution, pooling, and flattening.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "conv2d-naive", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def conv2d_naive(input_array, kernel, bias=None):
-    """
-    Naive 2D convolution implementation for educational understanding.
-
-    Args:
-        input_array: np.ndarray of shape (height, width) or (channels, height, width)
-        kernel: np.ndarray of shape (kernel_height, kernel_width)
-        bias: Optional bias value to add to each output
-
-    Returns:
-        np.ndarray: Convolved output
-
-    TODO: Implement 2D convolution by sliding kernel across input
-
-    APPROACH:
-    1. Handle input dimensions (add channel dimension if needed)
-    2. Calculate output dimensions based on input and kernel sizes
-    3. Slide kernel across input and compute dot products
-    4. Add bias if provided
-
-    EXAMPLE:
-    >>> input_img = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
-    >>> edge_kernel = np.array([[-1, -1, -1], [0, 0, 0], [1, 1, 1]])
-    >>> result = conv2d_naive(input_img, edge_kernel)
-    >>> print(result.shape)
-    (1, 1)
-
-    HINTS:
-    - Use nested loops to slide kernel across input
-    - Multiply overlapping regions element-wise and sum
-    - Handle single-channel inputs by adding channel dimension
-    """
-    ### BEGIN SOLUTION
-    # Ensure input has channel dimension
-    if input_array.ndim == 2:
-        input_array = input_array[np.newaxis, :, :]  # Add channel dimension
-
-    channels, height, width = input_array.shape
-    kernel_height, kernel_width = kernel.shape
-
-    # Calculate output dimensions (no padding, stride=1)
-    out_height = height - kernel_height + 1
-    out_width = width - kernel_width + 1
-
-    # Initialize output
-    output = np.zeros((channels, out_height, out_width))
-
-    # Slide kernel across input
-    for c in range(channels):
-        for i in range(out_height):
-            for j in range(out_width):
-                # Extract region and compute convolution
-                region = input_array[c, i:i+kernel_height, j:j+kernel_width]
-                output[c, i, j] = np.sum(region * kernel)
-
-                # Add bias if provided
-                if bias is not None:
-                    output[c, i, j] += bias
-
-    return output
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Convolution Operation
-
-This test validates our basic convolution implementation works correctly.
-"""
-
-# %%
-def test_unit_conv2d_naive():
-    """Test convolution operation with educational feedback"""
-    print("🔬 Unit Test: Convolution Operation...")
-
-    # Test 1: Simple edge detection
-    input_img = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
-    edge_kernel = np.array([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]])  # Vertical edge detector
-
-    result = conv2d_naive(input_img, edge_kernel)
-
-    # Verify output shape (3x3 input, 3x3 kernel -> 1x1 output)
-    assert result.shape == (1, 1, 1), f"Expected shape (1, 1, 1), got {result.shape}"
-
-    # Test 2: Multi-channel input
-    multi_channel = np.random.randn(3, 5, 5)  # 3 channels, 5x5 each
-    kernel = np.array([[1, 0], [0, 1]])  # 2x2 kernel
-
-    result = conv2d_naive(multi_channel, kernel)
-    assert result.shape == (3, 4, 4), f"Expected shape (3, 4, 4), got {result.shape}"
-
-    # Test 3: Bias addition
-    simple_input = np.array([[1, 1], [1, 1]])
-    simple_kernel = np.array([[1]])
-    bias_value = 5
-
-    result_with_bias = conv2d_naive(simple_input, simple_kernel, bias=bias_value)
-    result_without_bias = conv2d_naive(simple_input, simple_kernel)
-
-    bias_diff = result_with_bias - result_without_bias
-    assert np.allclose(bias_diff, bias_value), "Bias not added correctly"
-
-    print("✅ Convolution operation works correctly!")
-
-test_unit_conv2d_naive()
-
-# %% [markdown]
-"""
-## Implementation: Conv2D Layer
-
-Now let's build a proper convolutional layer class that can be used in neural networks.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "conv2d-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-class Conv2D(Module):
-    """
-    2D Convolutional Layer for spatial pattern recognition.
-
-    Args:
-        in_channels: Number of input channels
-        out_channels: Number of output channels (filters)
-        kernel_size: Size of convolution kernel (int or tuple)
-        bias: Whether to use bias term
-
-    TODO: Implement a convolutional layer that can process multi-channel inputs
-
-    APPROACH:
-    1. Initialize weights and bias with proper shapes
-    2. Handle kernel_size as int or tuple
-    3. Implement forward pass with multi-channel convolution
-    4. Use conv2d_naive for each input-output channel combination
-
-    EXAMPLE:
-    >>> conv = Conv2D(in_channels=3, out_channels=16, kernel_size=3)
-    >>> x = Tensor(np.random.randn(3, 28, 28))  # RGB image
-    >>> output = conv(x)
-    >>> print(output.shape)
-    (16, 26, 26)
-
-    HINTS:
-    - Weight shape: (out_channels, in_channels, kernel_height, kernel_width)
-    - For each output channel, convolve with all input channels and sum
-    - Use He initialization for weights: scale by sqrt(2 / fan_in)
-    """
-    ### BEGIN SOLUTION
-    def __init__(self, in_channels, out_channels, kernel_size, bias=True):
-        super().__init__()
-
-        # Handle kernel_size as int or tuple
-        if isinstance(kernel_size, int):
-            self.kernel_size = (kernel_size, kernel_size)
-        else:
-            self.kernel_size = kernel_size
-
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.use_bias = bias
-
-        # Initialize weights with He initialization
-        # Weight shape: (out_channels, in_channels, kernel_height, kernel_width)
-        fan_in = in_channels * self.kernel_size[0] * self.kernel_size[1]
-        weight_scale = np.sqrt(2.0 / fan_in)
-        self.weight = Tensor(
-            np.random.randn(out_channels, in_channels, *self.kernel_size) * weight_scale
-        )
-
-        # Initialize bias
-        if bias:
-            self.bias = Tensor(np.zeros(out_channels))
-        else:
-            self.bias = None
-
-    def forward(self, x):
-        """
-        Forward pass of 2D convolution.
-
-        Args:
-            x: Input tensor of shape (in_channels, height, width)
-
-        Returns:
-            Output tensor of shape (out_channels, out_height, out_width)
-        """
-        if x.data.ndim != 3:
-            raise ValueError(f"Expected 3D input (channels, height, width), got {x.data.ndim}D")
-
-        in_channels, height, width = x.data.shape
-        if in_channels != self.in_channels:
-            raise ValueError(f"Expected {self.in_channels} input channels, got {in_channels}")
-
-        # Calculate output dimensions
-        out_height = height - self.kernel_size[0] + 1
-        out_width = width - self.kernel_size[1] + 1
-
-        # Initialize output
-        output = np.zeros((self.out_channels, out_height, out_width))
-
-        # Convolve each output channel
-        for out_ch in range(self.out_channels):
-            channel_sum = np.zeros((out_height, out_width))
-
-            # Sum convolutions across all input channels
-            for in_ch in range(self.in_channels):
-                kernel = self.weight.data[out_ch, in_ch]
-                conv_result = conv2d_naive(x.data[in_ch], kernel)
-                channel_sum += conv_result.squeeze()  # Remove extra dimensions
-
-            output[out_ch] = channel_sum
-
-            # Add bias if enabled
-            if self.use_bias:
-                output[out_ch] += self.bias.data[out_ch]
-
-        return Tensor(output)
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Conv2D Layer
-
-This test validates our Conv2D layer implementation.
-"""
-
-# %%
-def test_unit_conv2d():
-    """Test Conv2D layer with educational feedback"""
-    print("🔬 Unit Test: Conv2D Layer...")
-
-    # Test 1: Single channel to multiple channels
-    conv = Conv2D(in_channels=1, out_channels=3, kernel_size=3)
-    x = Tensor(np.random.randn(1, 5, 5))
-
-    output = conv(x)
-    expected_shape = (3, 3, 3)  # 3 output channels, 3x3 spatial
-    assert output.shape == expected_shape, f"Expected {expected_shape}, got {output.shape}"
-
-    # Test 2: RGB to feature maps (realistic scenario)
-    rgb_conv = Conv2D(in_channels=3, out_channels=16, kernel_size=3)
-    rgb_input = Tensor(np.random.randn(3, 28, 28))  # RGB image
-
-    features = rgb_conv(rgb_input)
-    expected_shape = (16, 26, 26)  # 16 feature maps, 26x26 spatial
-    assert features.shape == expected_shape, f"Expected {expected_shape}, got {features.shape}"
-
-    # Test 3: Different kernel sizes
-    large_kernel_conv = Conv2D(in_channels=1, out_channels=1, kernel_size=5)
-    test_input = Tensor(np.random.randn(1, 10, 10))
-
-    large_output = large_kernel_conv(test_input)
-    expected_shape = (1, 6, 6)  # 10-5+1 = 6
-    assert large_output.shape == expected_shape, f"Expected {expected_shape}, got {large_output.shape}"
-
-    # Test 4: Parameter counting
-    conv_params = Conv2D(in_channels=3, out_channels=64, kernel_size=3)
-    # Weights: 64 * 3 * 3 * 3 = 1728, Bias: 64, Total: 1792
-    weight_params = 64 * 3 * 3 * 3
-    bias_params = 64
-    total_expected = weight_params + bias_params
-
-    weight_actual = conv_params.weight.data.size
-    bias_actual = conv_params.bias.data.size if conv_params.bias else 0
-    total_actual = weight_actual + bias_actual
-
-    assert total_actual == total_expected, f"Expected {total_expected} parameters, got {total_actual}"
-
-    print("✅ Conv2D layer works correctly!")
-
-test_unit_conv2d()
-
-# %% [markdown]
-"""
-## Implementation: MaxPool2D Layer
-
-Pooling layers reduce spatial dimensions while preserving important features.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "maxpool2d-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-class MaxPool2D(Module):
-    """
-    2D Max Pooling Layer for spatial downsampling.
-
-    Args:
-        pool_size: Size of pooling window (int or tuple)
-        stride: Stride of pooling operation (defaults to pool_size)
-
-    TODO: Implement max pooling that reduces spatial dimensions
-
-    APPROACH:
-    1. Handle pool_size and stride as int or tuple
-    2. Calculate output dimensions based on input size and pooling parameters
-    3. Slide pooling window and take maximum in each region
-    4. Handle multi-channel inputs by pooling each channel independently
-
-    EXAMPLE:
-    >>> pool = MaxPool2D(pool_size=2)
-    >>> x = Tensor(np.random.randn(16, 26, 26))  # Feature maps from Conv2D
-    >>> output = pool(x)
-    >>> print(output.shape)
-    (16, 13, 13)
-
-    HINTS:
-    - Default stride equals pool_size for non-overlapping pooling
-    - Output size = (input_size - pool_size) // stride + 1
-    - Use np.max on each pooling region
-    """
-    ### BEGIN SOLUTION
-    def __init__(self, pool_size, stride=None):
-        super().__init__()
-
-        # Handle pool_size as int or tuple
-        if isinstance(pool_size, int):
-            self.pool_size = (pool_size, pool_size)
-        else:
-            self.pool_size = pool_size
-
-        # Default stride equals pool_size (non-overlapping)
-        if stride is None:
-            self.stride = self.pool_size
-        elif isinstance(stride, int):
-            self.stride = (stride, stride)
-        else:
-            self.stride = stride
-
-    def forward(self, x):
-        """
-        Forward pass of 2D max pooling.
-
-        Args:
-            x: Input tensor of shape (channels, height, width)
-
-        Returns:
-            Output tensor with reduced spatial dimensions
-        """
-        if x.data.ndim != 3:
-            raise ValueError(f"Expected 3D input (channels, height, width), got {x.data.ndim}D")
-
-        channels, height, width = x.data.shape
-        pool_h, pool_w = self.pool_size
-        stride_h, stride_w = self.stride
-
-        # Calculate output dimensions
-        out_height = (height - pool_h) // stride_h + 1
-        out_width = (width - pool_w) // stride_w + 1
-
-        # Initialize output
-        output = np.zeros((channels, out_height, out_width))
-
-        # Apply max pooling to each channel
-        for c in range(channels):
-            for i in range(out_height):
-                for j in range(out_width):
-                    # Calculate pooling region bounds
-                    h_start = i * stride_h
-                    h_end = h_start + pool_h
-                    w_start = j * stride_w
-                    w_end = w_start + pool_w
-
-                    # Extract region and take maximum
-                    region = x.data[c, h_start:h_end, w_start:w_end]
-                    output[c, i, j] = np.max(region)
-
-        return Tensor(output)
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: MaxPool2D Layer
-
-This test validates our MaxPool2D layer implementation.
-"""
-
-# %%
-def test_unit_maxpool2d():
-    """Test MaxPool2D layer with educational feedback"""
-    print("🔬 Unit Test: MaxPool2D Layer...")
-
-    # Test 1: Basic 2x2 pooling
-    pool = MaxPool2D(pool_size=2)
-    x = Tensor(np.array([[[1, 2, 3, 4],
-                          [5, 6, 7, 8],
-                          [9, 10, 11, 12],
-                          [13, 14, 15, 16]]]))  # 1x4x4 input
-
-    output = pool(x)
-    expected_shape = (1, 2, 2)  # 4x4 -> 2x2 with pool_size=2
-    assert output.shape == expected_shape, f"Expected {expected_shape}, got {output.shape}"
-
-    # Verify max values are correct
-    expected_values = np.array([[[6, 8], [14, 16]]])  # Max in each 2x2 region
-    assert np.allclose(output.data, expected_values), "MaxPool values incorrect"
-
-    # Test 2: Multi-channel pooling
-    multi_input = Tensor(np.random.randn(3, 8, 8))
-    multi_output = pool(multi_input)
-
-    expected_shape = (3, 4, 4)  # Each channel pooled independently
-    assert multi_output.shape == expected_shape, f"Expected {expected_shape}, got {multi_output.shape}"
-
-    # Test 3: Different pool sizes
-    pool_3x3 = MaxPool2D(pool_size=3)
-    large_input = Tensor(np.random.randn(1, 9, 9))
-
-    pool_output = pool_3x3(large_input)
-    expected_shape = (1, 3, 3)  # 9x9 with 3x3 pooling and stride=3
-    assert pool_output.shape == expected_shape, f"Expected {expected_shape}, got {pool_output.shape}"
-
-    # Test 4: Integration with Conv2D
-    conv = Conv2D(in_channels=1, out_channels=4, kernel_size=3)
-    pooling = MaxPool2D(pool_size=2)
-
-    test_image = Tensor(np.random.randn(1, 10, 10))
-    conv_features = conv(test_image)  # Should be (4, 8, 8)
-    pooled_features = pooling(conv_features)  # Should be (4, 4, 4)
-
-    expected_shape = (4, 4, 4)
-    assert pooled_features.shape == expected_shape, f"Expected {expected_shape}, got {pooled_features.shape}"
-
-    print("✅ MaxPool2D layer works correctly!")
-
-test_unit_maxpool2d()
-
-# %% [markdown]
-"""
-## Implementation: Flatten Function
-
-Convert spatial feature maps to 1D for fully connected layers.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "flatten-function", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def flatten(x):
-    """
-    Flatten multi-dimensional tensor to 1D for fully connected layers.
-
-    Args:
-        x: Input tensor of any shape
-
-    Returns:
-        Tensor: Flattened tensor with shape (total_elements,)
-
-    TODO: Flatten tensor while preserving all data
-
-    APPROACH:
-    1. Calculate total number of elements
-    2. Reshape to 1D preserving data order
-    3. Return as new Tensor
-
-    EXAMPLE:
-    >>> x = Tensor(np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]))  # (2, 2, 2)
-    >>> flat = flatten(x)
-    >>> print(flat.shape)
-    (8,)
-
-    HINTS:
-    - Use numpy.reshape with -1 to flatten
-    - Ensure data order is preserved (row-major/C-style)
-    """
-    ### BEGIN SOLUTION
-    # Calculate total elements and reshape to 1D
-    flattened_data = x.data.reshape(-1)
-    return Tensor(flattened_data)
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Flatten Function
-
-This test validates our flatten function implementation.
-"""
-
-# %%
-def test_unit_flatten():
-    """Test flatten function with educational feedback"""
-    print("🔬 Unit Test: Flatten Function...")
-
-    # Test 1: 2D tensor
-    x_2d = Tensor(np.array([[1, 2], [3, 4]]))
-    flat_2d = flatten(x_2d)
-
-    expected_shape = (4,)
-    assert flat_2d.shape == expected_shape, f"Expected {expected_shape}, got {flat_2d.shape}"
-    assert np.array_equal(flat_2d.data, [1, 2, 3, 4]), "Flatten values incorrect"
-
-    # Test 2: 3D tensor (typical CNN output)
-    x_3d = Tensor(np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]))  # (2, 2, 2)
-    flat_3d = flatten(x_3d)
-
-    expected_shape = (8,)
-    assert flat_3d.shape == expected_shape, f"Expected {expected_shape}, got {flat_3d.shape}"
-    assert np.array_equal(flat_3d.data, [1, 2, 3, 4, 5, 6, 7, 8]), "3D flatten values incorrect"
-
-    # Test 3: Real CNN scenario - feature maps to classifier
-    # Simulate: Conv2D(64 filters, 5x5 output) -> Flatten -> Linear
-    feature_maps = Tensor(np.random.randn(64, 5, 5))  # 64 feature maps of 5x5
-    flattened_features = flatten(feature_maps)
-
-    expected_shape = (64 * 5 * 5,)  # 1600 features
-    assert flattened_features.shape == expected_shape, f"Expected {expected_shape}, got {flattened_features.shape}"
-
-    # Test 4: Preserve data integrity
-    original = Tensor(np.arange(24).reshape(2, 3, 4))
-    flattened = flatten(original)
-
-    # Check that all values are preserved
-    assert np.array_equal(flattened.data, np.arange(24)), "Data not preserved during flattening"
-
-    print("✅ Flatten function works correctly!")
-
-test_unit_flatten()
-
-# %% [markdown]
-"""
-## 🔍 Systems Analysis
-
-Now that your implementation is complete and tested, let's analyze its behavior:
-"""
-
-# %%
-def analyze_spatial_complexity():
-    """
-    📊 SYSTEMS MEASUREMENT: Spatial Operations Complexity
-
-    Measure how spatial operations scale with input size and parameters.
-    """
-    print("📊 SPATIAL COMPLEXITY ANALYSIS")
-    print("Testing how spatial operations scale with different inputs...")
-
-    import time
-
-    # Test convolution scaling
-    input_sizes = [16, 32, 64, 128]
-    conv_times = []
-
-    print("\n🔍 Convolution Scaling Analysis:")
-    for size in input_sizes:
-        # Create test input and kernel
-        test_input = np.random.randn(3, size, size)  # 3-channel image
-        test_kernel = np.random.randn(3, 3)  # 3x3 kernel
-
-        # Time the convolution
-        start = time.perf_counter()
-        result = conv2d_naive(test_input, test_kernel)
-        elapsed = time.perf_counter() - start
-
-        conv_times.append(elapsed)
-        flops = 3 * (size-2) * (size-2) * 9  # channels * output_pixels * kernel_size
-
-        print(f"  Size {size}×{size}: {elapsed*1000:.2f}ms, {flops:,} FLOPs")
-
-        if elapsed > 1.0:  # Stop if too slow
-            break
-
-    # Analyze scaling pattern
-    if len(conv_times) >= 3:
-        size_ratio = input_sizes[2] / input_sizes[0]  # 4x increase
-        time_ratio = conv_times[2] / conv_times[0]
-        print(f"💡 COMPLEXITY INSIGHT: {size_ratio:.0f}x size increase → {time_ratio:.1f}x time increase")
-        print(f"   This suggests ~O(N²) scaling as expected for spatial convolution")
-
-    # Test memory usage
-    print("\n💾 Memory Usage Analysis:")
-    channel_configs = [(1, 16), (3, 32), (16, 64), (32, 128)]
-
-    for in_ch, out_ch in channel_configs:
-        conv = Conv2D(in_channels=in_ch, out_channels=out_ch, kernel_size=3)
-
-        # Calculate parameter memory
-        weight_params = out_ch * in_ch * 3 * 3
-        bias_params = out_ch
-        total_params = weight_params + bias_params
-        memory_mb = total_params * 4 / (1024 * 1024)  # 4 bytes per float32
-
-        print(f"  Conv2D({in_ch}→{out_ch}): {total_params:,} params, {memory_mb:.2f}MB")
-
-        if total_params > 1_000_000:
-            print(f"    💥 Parameter explosion! {total_params/1e6:.1f}M parameters")
-            print(f"    This shows why depthwise separable convolutions were invented")
-            break
-
-    print(f"\n💡 SYSTEMS INSIGHT: Spatial operations have quadratic scaling")
-    print(f"   Input size matters more than you might expect!")
-    print(f"   Modern optimizations: im2col, FFT convolution, optimized BLAS")
-
-# Run the analysis
-analyze_spatial_complexity()
-
-# %% [markdown]
-"""
-## 🧪 Complete Module Testing
-
-Test all spatial components together.
-"""
-
-# %%
-def test_module():
-    """Run comprehensive test of spatial module"""
-    print("🧪 Testing Complete Spatial Module...")
-
-    print("\n1. Testing individual components...")
-    test_unit_conv2d_naive()
-    test_unit_conv2d()
-    test_unit_maxpool2d()
-    test_unit_flatten()
-
-    print("\n2. Testing CNN pipeline integration...")
-
-    # Build a simple CNN pipeline
-    print("   Building CNN: Conv2D → MaxPool2D → Flatten → Linear")
-
-    # Create layers
-    conv1 = Conv2D(in_channels=3, out_channels=16, kernel_size=3)  # RGB → 16 features
-    pool1 = MaxPool2D(pool_size=2)  # Spatial downsampling
-    conv2 = Conv2D(in_channels=16, out_channels=32, kernel_size=3)  # 16 → 32 features
-    pool2 = MaxPool2D(pool_size=2)  # More downsampling
-    classifier = Linear(input_size=32*5*5, output_size=10)  # To 10 classes
-
-    # Test forward pass with realistic input
-    test_image = Tensor(np.random.randn(3, 28, 28))  # RGB image like CIFAR-10
-    print(f"   Input shape: {test_image.shape}")
-
-    # Forward pass through CNN
-    x = conv1(test_image)
-    print(f"   After Conv1: {x.shape}")
-
-    x = pool1(x)
-    print(f"   After Pool1: {x.shape}")
-
-    x = conv2(x)
-    print(f"   After Conv2: {x.shape}")
-
-    x = pool2(x)
-    print(f"   After Pool2: {x.shape}")
-
-    x = flatten(x)
-    print(f"   After Flatten: {x.shape}")
-
-    x = classifier(x)
-    print(f"   Final output: {x.shape}")
-
-    # Verify final shape
-    assert x.shape == (10,), f"Expected (10,) output for classification, got {x.shape}"
-
-    print("\n✅ All spatial module tests passed!")
-    print("🎯 CNN pipeline working correctly - ready for image classification!")
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-All tests run when module is executed directly.
-"""
-
-# %%
-if __name__ == "__main__":
-    print("🚀 SPATIAL MODULE - CONVOLUTIONAL NEURAL NETWORKS")
-    print("=" * 60)
-
-    # Run complete module test
-    test_module()
-
-    # Run systems analysis
-    print("\n" + "=" * 60)
-    analyze_spatial_complexity()
-
-    print("\n" + "=" * 60)
-    print("🎯 SPATIAL MODULE COMPLETE!")
-    print("📈 Progress: Spatial Operations ✓")
-    print("🔥 Next: Autograd - Automatic Differentiation!")
-    print("💪 You can now build CNNs for image recognition!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Analyze your spatial implementations and their systems implications:
-
-### Question 1: Convolution Memory Access Patterns
-
-In your `conv2d_naive` implementation, you used nested loops to slide the kernel across the input. Analyze the memory access patterns in your nested loop structure:
-
-```python
-for c in range(channels):
-    for i in range(out_height):
-        for j in range(out_width):
-            region = input_array[c, i:i+kernel_height, j:j+kernel_width]
-```
-
-**Analysis Question**: How could you reorder these loops or modify the memory access pattern to improve cache locality? Consider that modern CPUs have L1 cache sizes of ~32KB and cache lines of 64 bytes. Design specific modifications to your current implementation that would minimize cache misses.
-
-Think about:
-- Which loop order accesses memory most sequentially?
-- How does kernel size affect cache efficiency?
-- What happens with large input images that don't fit in cache?
-- How would you implement cache-blocking for very large convolutions?
-
-### Question 2: Multi-Channel Convolution Scaling
-
-Your `Conv2D` class processes multiple input and output channels. Looking at your implementation:
-
-```python
-for out_ch in range(self.out_channels):
-    for in_ch in range(self.in_channels):
-        # Convolution operation
-```
-
-**Analysis Question**: Design a parallelization strategy for your multi-channel convolution that could efficiently utilize 8 GPU cores. How would you distribute the work across channels and spatial dimensions? What are the memory bandwidth requirements, and how would you handle synchronization?
-
-Think about:
-- Which loops can be parallelized independently?
-- How do you minimize memory transfers between GPU cores?
-- What's the optimal work distribution for different input sizes?
-- How does memory coalescing affect your parallel algorithm?
-
-### Question 3: CNN Architecture Memory Management
-
-You built a complete CNN pipeline: Conv2D → MaxPool2D → Conv2D → MaxPool2D → Flatten → Linear. Analyze the memory footprint of your pipeline:
-
-**Analysis Question**: For a batch of 32 CIFAR-10 images (32×32×3), calculate the peak memory usage during forward pass through your CNN architecture. Include intermediate activations, parameters, and gradients. At what point does memory become the limiting factor for larger models?
-
-Think about:
-- Memory usage of each intermediate activation
-- Parameter storage for each layer
-- Gradient storage during backpropagation
-- When would you need gradient checkpointing?
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Spatial Operations Complete!
-
-Congratulations! You've successfully implemented the core spatial operations that make CNNs work:
-
-### What You've Accomplished
-✅ **Convolution Implementation**: Built conv2d_naive() and Conv2D class with multi-channel support
-✅ **Pooling Operations**: Implemented MaxPool2D for spatial downsampling and translation invariance
-✅ **Pipeline Integration**: Created complete CNN pipeline from images to classification
-✅ **Systems Analysis**: Analyzed computational complexity and memory scaling of spatial operations
-✅ **Testing Framework**: Validated each component with immediate unit testing
-
-### Key Learning Outcomes
-- **Spatial Pattern Recognition**: Understanding how convolution detects local patterns
-- **Parameter Efficiency**: How weight sharing makes CNNs practical for image processing
-- **Computational Complexity**: Why spatial operations scale as O(N²) with input size
-- **Memory Management**: How multi-channel operations affect parameter and activation memory
-
-### Mathematical Foundations Mastered
-- **Convolution Operation**: Discrete convolution as correlation with flipped kernels
-- **Spatial Dimensions**: How kernel size, stride, and padding affect output dimensions
-- **Multi-Channel Processing**: Combining features across input channels to create output channels
-
-### Professional Skills Developed
-- **CNN Architecture Design**: Building complete pipelines for image classification
-- **Performance Analysis**: Understanding scaling bottlenecks in spatial operations
-- **Memory Optimization**: Recognizing when spatial operations become memory-bound
-
-### Ready for Advanced Applications
-Your spatial implementation now enables:
-- **Image Classification**: CNNs for CIFAR-10, ImageNet-style datasets
-- **Feature Extraction**: Hierarchical feature learning in deep networks
-- **Computer Vision**: Foundation for object detection, segmentation, and more
-
-### Connection to Real ML Systems
-Your implementation mirrors production systems:
-- **PyTorch**: `torch.nn.Conv2d` and `torch.nn.MaxPool2d` with similar APIs
-- **TensorFlow**: `tf.keras.layers.Conv2D` for production computer vision
-- **Industry Standard**: Weight sharing and spatial convolution are universal in CV
-
-### Next Steps
-1. **Export your module**: `tito module complete 08_spatial`
-2. **Validate integration**: `tito test --module spatial`
-3. **Explore optimizations**: Consider im2col convolution algorithms
-4. **Ready for Module 09**: Autograd will add automatic differentiation to your spatial operations
-
-**🚀 Achievement Unlocked**: Your spatial operations form the foundation for any computer vision application! CNNs + backpropagation = modern AI vision systems.
-"""
\ No newline at end of file
diff --git a/modules_old/09_dataloader/ENHANCEMENT_SUMMARY.md b/modules_old/09_dataloader/ENHANCEMENT_SUMMARY.md
deleted file mode 100644
index 6546aed5..00000000
--- a/modules_old/09_dataloader/ENHANCEMENT_SUMMARY.md
+++ /dev/null
@@ -1,155 +0,0 @@
-# Module 10 DataLoader Enhancement Summary
-
-## Enhancements Applied to Module 10 (DataLoader)
-
-### 1. Visual Teaching Elements Added
-
-#### Data Pipeline Flow Diagrams
-- **Complete data pipeline visualization**: Raw Storage → Dataset → Shuffle → Batch → Neural Net
-- **Batch processing impact analysis**: Visual tables showing GPU utilization vs batch size
-- **Memory vs storage trade-offs**: Table showing dataset sizes and loading strategies
-- **CIFAR-10 pipeline diagram**: Specific computer vision data flow
-- **Performance comparison charts**: Sequential vs random access patterns
-
-#### ASCII Diagrams
-- Data loading pipeline with detailed components
-- Batch size performance analysis tables
-- I/O strategy comparison visualizations
-- Memory scaling patterns for different configurations
-
-### 2. Computational Assessment Questions (NBGrader-Compatible)
-
-#### Assessment 1: Batch Size Memory Trade-offs
-- **Scenario**: GPU memory constraints with 8GB total, calculating max batch size
-- **Implementation**: `calculate_max_batch_size()` function with proper scaffolding
-- **Learning**: GPU memory management, production planning, cost optimization
-
-#### Assessment 2: I/O Bottleneck Analysis  
-- **Scenario**: Training pipeline with GPU vs storage speed analysis
-- **Implementation**: `analyze_training_bottleneck()` function
-- **Learning**: Systems performance, hardware utilization, optimization strategies
-
-#### Assessment 3: DataLoader Efficiency Optimization
-- **Scenario**: Comparing shuffling vs non-shuffling training strategies
-- **Implementation**: `compare_dataloader_strategies()` function  
-- **Learning**: Training efficiency, preprocessing overhead, model quality trade-offs
-
-### 3. Systems Insights Functions (Executable Analysis)
-
-#### Dataset Interface Analysis
-```python
-analyze_dataset_interface()
-```
-- Why the 4-method interface is designed this way
-- Framework compatibility across PyTorch/TensorFlow
-- Benefits of universal interface pattern
-
-#### Batching Impact Analysis
-```python
-analyze_batching_impact()
-```
-- Memory usage vs batch size calculations
-- GPU utilization simulation
-- Production scaling implications
-
-#### Data Reproducibility Analysis
-```python
-analyze_data_reproducibility()
-```
-- Why deterministic data generation matters
-- Synthetic vs real data trade-offs
-- Testing and debugging benefits
-
-#### I/O Strategy Performance Analysis
-```python
-analyze_io_strategy_impact()
-```
-- Sequential vs random access performance
-- Cache locality and storage implications
-- Training generalization vs speed trade-offs
-
-### 4. Enhanced Comments and Scaffolding
-
-#### Heavy Comments (Complex Logic)
-- DataLoader `__iter__` method with detailed step-by-step explanation
-- Batch creation and tensor stacking logic
-- Memory management and efficiency considerations
-
-#### Medium Comments (Standard Operations)
-- Dataset interface method implementations
-- CIFAR-10 data loading and preprocessing
-- Memory calculations and analysis functions
-
-#### Light Comments (Simple Operations)
-- Basic property accessors
-- Simple mathematical operations
-- Standard Python patterns
-
-### 5. NBGrader Integration
-
-#### Proper Solution Blocks
-- All implementations wrapped in `### BEGIN SOLUTION` / `### END SOLUTION`
-- Student scaffolding (TODOs, HINTS, EXAMPLES) outside solution blocks
-- Proper metadata for automated grading
-
-#### Assessment Structure
-- Three computational assessments with graduated complexity
-- Proper grade_id and points allocation
-- Clear learning objectives and connections
-
-### 6. Systems Engineering Focus
-
-#### Production Connections
-- Direct comparisons to PyTorch DataLoader and tf.data
-- Real-world dataset examples (ImageNet, CIFAR-10)
-- Production optimization strategies
-
-#### Performance Analysis
-- Memory scaling calculations
-- I/O bottleneck identification
-- GPU utilization optimization
-
-#### Framework Integration
-- Universal interface pattern explanation
-- Skills transfer to production frameworks
-- Industry standard practices
-
-### 7. Enhanced Module Structure
-
-#### Improved Introduction
-- "Build → Use → Reflect" methodology
-- Connection to previous modules (Tensor)
-- Clear learning objectives focused on systems understanding
-
-#### Comprehensive Testing
-- Individual unit tests with immediate execution
-- Systems insight functions with checkpoint validation
-- Aggregate testing function for complete validation
-
-#### Module Summary Enhancement
-- Concrete achievements with metrics (lines of code, capabilities)
-- Production system connections
-- Mathematical foundations mastered
-- Next steps for continued learning
-
-## Educational Impact
-
-The enhanced module now provides:
-
-1. **Visual Learning**: ASCII diagrams make abstract concepts concrete
-2. **Hands-On Assessment**: Computational questions reinforce learning through implementation
-3. **Systems Thinking**: Direct connections to production ML systems and performance optimization
-4. **Immediate Feedback**: Executable analysis functions provide real-time insights
-5. **Scalable Education**: NBGrader compatibility for classroom deployment
-
-## Technical Verification
-
-All enhancements maintain full backward compatibility while adding:
-- ✅ Visual teaching elements
-- ✅ Computational assessments (NBGrader-ready)
-- ✅ Systems insights functions
-- ✅ Enhanced scaffolding and comments
-- ✅ Production connections and context
-- ✅ Comprehensive testing validation
-
-The module successfully tests all functionality and provides a complete educational experience for building professional data loading systems.
\ No newline at end of file
diff --git a/modules_old/09_dataloader/README.md b/modules_old/09_dataloader/README.md
deleted file mode 100644
index a0e6fe36..00000000
--- a/modules_old/09_dataloader/README.md
+++ /dev/null
@@ -1,274 +0,0 @@
-# 🔥 Module: DataLoader
-
-## 📊 Module Info
-- **Difficulty**: ⭐⭐⭐ Advanced
-- **Time Estimate**: 5-7 hours
-- **Prerequisites**: Tensor, Layers modules
-- **Next Steps**: Training, Networks modules
-
-Build the data pipeline foundation of TinyTorch! This module implements efficient data loading, preprocessing, and batching systems—the critical infrastructure that feeds neural networks during training and powers real-world ML systems.
-
-## 🎯 Learning Objectives
-
-By the end of this module, you will be able to:
-
-- **Design data pipeline architectures**: Understand data engineering as the foundation of scalable ML systems
-- **Implement reusable dataset abstractions**: Build flexible interfaces that support multiple data sources and formats
-- **Create efficient data loaders**: Develop batching, shuffling, and streaming systems for optimal training performance
-- **Build preprocessing pipelines**: Implement normalization, augmentation, and transformation systems
-- **Apply systems engineering principles**: Handle memory management, I/O optimization, and error recovery in data pipelines
-
-## 🧠 Build → Use → Optimize
-
-This module follows TinyTorch's **Build → Use → Optimize** framework:
-
-1. **Build**: Implement dataset abstractions, data loaders, and preprocessing pipelines from engineering principles
-2. **Use**: Apply your data system to real CIFAR-10 dataset with complete train/test workflows
-3. **Optimize**: Analyze performance characteristics, memory usage, and system bottlenecks for production readiness
-
-## 📚 What You'll Build
-
-### Complete Data Pipeline System
-```python
-# End-to-end data pipeline creation
-train_loader, test_loader, normalizer = create_data_pipeline(
-    dataset_path="data/cifar10/",
-    batch_size=32,
-    normalize=True,
-    shuffle=True
-)
-
-# Ready for neural network training
-for batch_images, batch_labels in train_loader:
-    # batch_images.shape: (32, 3, 32, 32) - normalized pixel values
-    # batch_labels.shape: (32,) - class indices
-    predictions = model(batch_images)
-    loss = compute_loss(predictions, batch_labels)
-    # Continue training loop...
-```
-
-### Dataset Abstraction System
-```python
-# Flexible interface supporting multiple datasets
-class Dataset:
-    def __getitem__(self, index):
-        # Return (data, label) for any dataset type
-        pass
-    def __len__(self):
-        # Enable len() and iteration
-        pass
-
-# Concrete implementation with real data
-dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
-print(f"Loaded {len(dataset)} real samples")  # 50,000 training images
-image, label = dataset[0]  # Access individual samples
-print(f"Sample shape: {image.shape}, Label: {label}")
-```
-
-### Efficient Data Loading System
-```python
-# High-performance batching with memory optimization
-dataloader = DataLoader(
-    dataset=dataset,
-    batch_size=32,          # Configurable batch size
-    shuffle=True,           # Training randomization
-    drop_last=False         # Handle incomplete batches
-)
-
-# Pythonic iteration interface
-for batch_idx, (batch_data, batch_labels) in enumerate(dataloader):
-    print(f"Batch {batch_idx}: {batch_data.shape}")
-    # Automatic batching handles all the complexity
-```
-
-### Data Preprocessing Pipeline
-```python
-# Production-ready normalization system
-normalizer = Normalizer()
-
-# Fit on training data (compute statistics once)
-normalizer.fit(training_images)
-print(f"Mean: {normalizer.mean}, Std: {normalizer.std}")
-
-# Apply to any dataset (training, validation, test)
-normalized_images = normalizer.transform(test_images)
-# Ensures consistent preprocessing across data splits
-```
-
-## 🎯 NEW: CIFAR-10 Support for North Star Goal
-
-### Built-in CIFAR-10 Download and Loading
-This module now includes complete CIFAR-10 support to achieve our semester goal of 75% accuracy:
-
-```python
-from tinytorch.core.dataloader import CIFAR10Dataset, download_cifar10
-
-# Download CIFAR-10 automatically (one-time, ~170MB)
-dataset_path = download_cifar10()  # Downloads to ./data/cifar-10-batches-py
-
-# Load training and test data
-dataset = CIFAR10Dataset(download=True, flatten=False)
-print(f"✅ Loaded {len(dataset.train_data)} training samples")
-print(f"✅ Loaded {len(dataset.test_data)} test samples")
-
-# Create DataLoaders for training
-from tinytorch.core.dataloader import DataLoader
-train_loader = DataLoader(dataset.train_data, dataset.train_labels, batch_size=32, shuffle=True)
-test_loader = DataLoader(dataset.test_data, dataset.test_labels, batch_size=32, shuffle=False)
-
-# Ready for CNN training!
-for batch_images, batch_labels in train_loader:
-    print(f"Batch shape: {batch_images.shape}")  # (32, 3, 32, 32) for CNNs
-    break
-```
-
-### What's New in This Module
-- ✅ **`download_cifar10()`**: Automatically downloads and extracts CIFAR-10 dataset
-- ✅ **`CIFAR10Dataset`**: Complete dataset class with train/test splits
-- ✅ **Real Data Support**: Work with actual 32x32 RGB images, not toy data
-- ✅ **Production Features**: Shuffling, batching, normalization for real training
-
-## 🚀 Getting Started
-
-### Prerequisites
-Ensure you have the foundational tensor operations:
-
-```bash
-# Activate TinyTorch environment
-source bin/activate-tinytorch.sh
-
-# Verify prerequisite modules
-tito test --module tensor
-tito test --module layers
-```
-
-### Development Workflow
-1. **Open the development file**: `modules/source/09_dataloader/dataloader_dev.py`
-2. **Implement Dataset abstraction**: Create the base interface for all data sources
-3. **Build CIFAR-10 dataset**: Implement real dataset loading with binary file parsing
-4. **Create DataLoader system**: Add batching, shuffling, and iteration functionality
-5. **Add preprocessing tools**: Implement normalizer and transformation pipeline
-6. **Export and verify**: `tito export --module dataloader && tito test --module dataloader`
-
-## 🧪 Testing Your Implementation
-
-### Comprehensive Test Suite
-Run the full test suite to verify data engineering functionality:
-
-```bash
-# TinyTorch CLI (recommended)
-tito test --module dataloader
-
-# Direct pytest execution
-python -m pytest tests/ -k dataloader -v
-```
-
-### Test Coverage Areas
-- ✅ **Dataset Interface**: Verify abstract base class and concrete implementations
-- ✅ **Real Data Loading**: Test with actual CIFAR-10 dataset (downloads ~170MB)
-- ✅ **Batching System**: Ensure correct batch shapes and memory efficiency
-- ✅ **Data Preprocessing**: Verify normalization statistics and transformations
-- ✅ **Pipeline Integration**: Test complete train/test workflow with real data
-
-### Inline Testing & Real Data Validation
-The module includes comprehensive feedback using real CIFAR-10 data:
-```python
-# Example inline test output
-🔬 Unit Test: CIFAR-10 dataset loading...
-📥 Downloading CIFAR-10 dataset (170MB)...
-✅ Successfully loaded 50,000 training samples
-✅ Sample shapes correct: (3, 32, 32)
-✅ Labels in valid range: [0, 9]
-📈 Progress: CIFAR-10 Dataset ✓
-
-# DataLoader testing with real data
-🔬 Unit Test: DataLoader batching...
-✅ Batch shapes correct: (32, 3, 32, 32)
-✅ Shuffling produces different orders
-✅ Iteration covers all samples exactly once
-📈 Progress: DataLoader ✓
-```
-
-### Manual Testing Examples
-```python
-from tinytorch.core.tensor import Tensor
-from dataloader_dev import CIFAR10Dataset, DataLoader, Normalizer
-
-# Test dataset loading with real data
-dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
-print(f"Dataset size: {len(dataset)}")
-print(f"Classes: {dataset.get_num_classes()}")
-
-# Test data loading pipeline
-dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
-for batch_images, batch_labels in dataloader:
-    print(f"Batch shape: {batch_images.shape}")
-    print(f"Label range: {batch_labels.min()} to {batch_labels.max()}")
-    break  # Just test first batch
-
-# Test preprocessing pipeline
-normalizer = Normalizer()
-sample_batch, _ = next(iter(dataloader))
-normalizer.fit(sample_batch)
-normalized = normalizer.transform(sample_batch)
-print(f"Original range: [{sample_batch.min():.2f}, {sample_batch.max():.2f}]")
-print(f"Normalized range: [{normalized.min():.2f}, {normalized.max():.2f}]")
-```
-
-## 🎯 Key Concepts
-
-### Real-World Applications
-- **Production ML Systems**: Companies like Netflix, Spotify use similar data pipelines for recommendation training
-- **Computer Vision**: ImageNet, COCO dataset loaders power research and production vision systems
-- **Natural Language Processing**: Text preprocessing pipelines enable language model training
-- **Autonomous Systems**: Real-time data streams from sensors require efficient pipeline architectures
-
-### Data Engineering Principles
-- **Interface Design**: Abstract Dataset class enables switching between data sources seamlessly
-- **Memory Efficiency**: Streaming data loading prevents memory overflow with large datasets
-- **I/O Optimization**: Batching reduces system calls and improves throughput
-- **Preprocessing Consistency**: Fit-transform pattern ensures identical preprocessing across data splits
-
-### Systems Performance Considerations
-- **Batch Size Trade-offs**: Larger batches improve GPU utilization but increase memory usage
-- **Shuffling Strategy**: Random access patterns for training vs sequential for inference
-- **Caching and Storage**: Balance between memory usage and I/O performance
-- **Error Handling**: Robust handling of corrupted data, network failures, disk issues
-
-### Production ML Pipeline Patterns
-- **ETL Design**: Extract (load files), Transform (preprocess), Load (batch) pattern
-- **Data Versioning**: Reproducible datasets with consistent preprocessing
-- **Pipeline Monitoring**: Track data quality, distribution shifts, processing times
-- **Scalability Planning**: Design for growing datasets and distributed processing
-
-## 🎉 Ready to Build?
-
-You're about to build the data engineering foundation that powers every successful ML system! From startup prototypes to billion-dollar recommendation engines, they all depend on robust data pipelines like the one you're building.
-
-This module teaches you the systems thinking that separates hobby projects from production ML systems. You'll work with real data, handle real performance constraints, and build infrastructure that scales. Take your time, think about edge cases, and enjoy building the backbone of machine learning!
-
-```{grid} 3
-:gutter: 3
-:margin: 2
-
-{grid-item-card} 🚀 Launch Builder
-:link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/09_dataloader/dataloader_dev.py
-:class-title: text-center
-:class-body: text-center
-
-Interactive development environment
-
-{grid-item-card} 📓 Open in Colab  
-:link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/09_dataloader/dataloader_dev.ipynb
-:class-title: text-center
-:class-body: text-center
-
-Google Colab notebook
-
-{grid-item-card} 👀 View Source
-:link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/09_dataloader/dataloader_dev.py  
-:class-title: text-center
-:class-body: text-center
-
-Browse the code on GitHub
-``` 
\ No newline at end of file
diff --git a/modules_old/09_dataloader/dataloader_dev.ipynb b/modules_old/09_dataloader/dataloader_dev.ipynb
deleted file mode 100644
index 4cb257e5..00000000
--- a/modules_old/09_dataloader/dataloader_dev.ipynb
+++ /dev/null
@@ -1,2122 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "4c9bc6eb",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# DataLoader - Efficient Data Pipeline and Batch Processing Systems\n",
-    "\n",
-    "Welcome to the DataLoader module! You'll build the data infrastructure that feeds neural networks, understanding how I/O optimization and memory management determine training speed.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How data I/O becomes the bottleneck in ML training and why efficient data pipelines are critical for system performance\n",
-    "- Core implementation skill: Build Dataset and DataLoader classes with batching, shuffling, and memory-efficient iteration patterns\n",
-    "- Pattern recognition: Understand the universal Dataset/DataLoader abstraction used across all ML frameworks\n",
-    "- Framework connection: See how your implementation mirrors PyTorch's data loading infrastructure and optimization strategies\n",
-    "- Performance insight: Learn why data loading parallelization and prefetching are essential for GPU utilization in production systems\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: Complete Dataset and DataLoader classes with efficient batching, shuffling, and real dataset support (CIFAR-10)\n",
-    "2. **Use**: Load large-scale image datasets and feed them to neural networks with proper batch processing\n",
-    "3. **Reflect**: Why does data loading speed often determine training speed more than model computation?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how efficient data pipelines enable scalable ML training\n",
-    "- Practical capability to build data loading systems that handle datasets larger than memory\n",
-    "- Systems insight into why data engineering is often the limiting factor in ML system performance\n",
-    "- Performance consideration of how batch size, shuffling, and prefetching affect training throughput and convergence\n",
-    "- Connection to production ML systems and how frameworks optimize data loading for different storage systems\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: PyTorch's DataLoader uses multiprocessing and memory pinning to overlap data loading with GPU computation, achieving near-zero data loading overhead\n",
-    "⚡ **Performance Note**: Modern GPUs can process data faster than storage systems can provide it - data loading optimization is critical for hardware utilization in production training"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "92c9d8b6",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "dataloader-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.dataloader\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import sys\n",
-    "import os\n",
-    "from typing import Tuple, Optional, Iterator\n",
-    "import urllib.request\n",
-    "import tarfile\n",
-    "import pickle\n",
-    "import time\n",
-    "\n",
-    "# Import our building blocks - try package first, then local modules\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "except ImportError:\n",
-    "    # For development, import from local modules\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
-    "    from tensor_dev import Tensor"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2959209b",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "dataloader-welcome",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🔥 TinyTorch DataLoader Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
-    "print(\"Ready to build data pipelines!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8f2d9467",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📦 Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py`  \n",
-    "**Building Side:** Code exports to `tinytorch.core.dataloader`\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.dataloader import Dataset, DataLoader  # Data loading utilities!\n",
-    "from tinytorch.core.tensor import Tensor  # Foundation\n",
-    "from tinytorch.core.networks import Sequential  # Models to train\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused modules for deep understanding of data pipelines\n",
-    "- **Production:** Proper organization like PyTorch's `torch.utils.data`\n",
-    "- **Consistency:** All data loading utilities live together in `core.dataloader`\n",
-    "- **Integration:** Works seamlessly with tensors and networks"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8b07e46b",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🔧 DEVELOPMENT"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "52c9b734",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Step 1: Understanding Data Pipelines\n",
-    "\n",
-    "### What are Data Pipelines?\n",
-    "**Data pipelines** are the systems that efficiently move data from storage to your model. They're the foundation of all machine learning systems.\n",
-    "\n",
-    "### The Data Pipeline Equation\n",
-    "```\n",
-    "Raw Data → Load → Transform → Batch → Model → Predictions\n",
-    "```\n",
-    "\n",
-    "### Why Data Pipelines Matter\n",
-    "- **Performance**: Efficient loading prevents GPU starvation\n",
-    "- **Scalability**: Handle datasets larger than memory\n",
-    "- **Consistency**: Reproducible data processing\n",
-    "- **Flexibility**: Easy to switch between datasets\n",
-    "\n",
-    "### Real-World Challenges\n",
-    "- **Memory constraints**: Datasets often exceed available RAM\n",
-    "- **I/O bottlenecks**: Disk access is much slower than computation\n",
-    "- **Batch processing**: Neural networks need batched data for efficiency\n",
-    "- **Shuffling**: Random order prevents overfitting\n",
-    "\n",
-    "### Systems Thinking\n",
-    "- **Memory efficiency**: Handle datasets larger than RAM\n",
-    "- **I/O optimization**: Read from disk efficiently\n",
-    "- **Batching strategies**: Trade-offs between memory and speed\n",
-    "- **Caching**: When to cache vs recompute\n",
-    "\n",
-    "### Visual Intuition\n",
-    "```\n",
-    "Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...]\n",
-    "Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...]\n",
-    "Batch: [Tensor(32, 32, 32, 3)]  # 32 images at once\n",
-    "Model: Process batch efficiently\n",
-    "```\n",
-    "\n",
-    "Let's start by building the most fundamental component: **Dataset**."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d07094e6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 2: Building the Dataset Interface\n",
-    "\n",
-    "### What is a Dataset?\n",
-    "A **Dataset** is an abstract interface that provides consistent access to data. It's the foundation of all data loading systems.\n",
-    "\n",
-    "### Why Abstract Interfaces Matter\n",
-    "- **Consistency**: Same interface for all data types\n",
-    "- **Flexibility**: Easy to switch between datasets\n",
-    "- **Testability**: Easy to create test datasets\n",
-    "- **Extensibility**: Easy to add new data sources\n",
-    "\n",
-    "### The Dataset Pattern\n",
-    "```python\n",
-    "class Dataset:\n",
-    "    def __getitem__(self, index):  # Get single sample\n",
-    "        return data, label\n",
-    "    \n",
-    "    def __len__(self):  # Get dataset size\n",
-    "        return total_samples\n",
-    "```\n",
-    "\n",
-    "### Real-World Usage\n",
-    "- **Computer vision**: ImageNet, CIFAR-10, custom image datasets\n",
-    "- **NLP**: Text datasets, tokenized sequences\n",
-    "- **Audio**: Audio files, spectrograms\n",
-    "- **Time series**: Sequential data with proper windowing\n",
-    "\n",
-    "Let's implement the Dataset interface!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "275c4926",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "dataset-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Dataset:\n",
-    "    \"\"\"\n",
-    "    Base Dataset class: Abstract interface for all datasets.\n",
-    "    \n",
-    "    The fundamental abstraction for data loading in TinyTorch.\n",
-    "    Students implement concrete datasets by inheriting from this class.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
-    "        \"\"\"\n",
-    "        Get a single sample and label by index.\n",
-    "        \n",
-    "        Args:\n",
-    "            index: Index of the sample to retrieve\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tuple of (data, label) tensors\n",
-    "            \n",
-    "        TODO: Implement abstract method for getting samples.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. This is an abstract method - subclasses will implement it\n",
-    "        2. Return a tuple of (data, label) tensors\n",
-    "        3. Data should be the input features, label should be the target\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        dataset[0] should return (Tensor(image_data), Tensor(label))\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **PyTorch Integration**: This follows the exact same pattern as torch.utils.data.Dataset\n",
-    "        - **Production Data**: Real datasets like ImageNet, CIFAR-10 use this interface\n",
-    "        - **Memory Efficiency**: On-demand loading prevents loading entire dataset into memory\n",
-    "        - **Batching Foundation**: DataLoader uses __getitem__ to create batches efficiently\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - This is an abstract method that subclasses must override\n",
-    "        - Always return a tuple of (data, label) tensors\n",
-    "        - Data contains the input features, label contains the target\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # This is an abstract method - subclasses must implement it\n",
-    "        raise NotImplementedError(\"Subclasses must implement __getitem__\")\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __len__(self) -> int:\n",
-    "        \"\"\"\n",
-    "        Get the total number of samples in the dataset.\n",
-    "        \n",
-    "        TODO: Implement abstract method for getting dataset size.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. This is an abstract method - subclasses will implement it\n",
-    "        2. Return the total number of samples in the dataset\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        len(dataset) should return 50000 for CIFAR-10 training set\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Memory Planning**: DataLoader uses len() to calculate number of batches\n",
-    "        - **Progress Tracking**: Training loops use len() for progress bars and epoch calculations\n",
-    "        - **Distributed Training**: Multi-GPU systems need dataset size for work distribution\n",
-    "        - **Statistical Sampling**: Some training strategies require knowing total dataset size\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - This is an abstract method that subclasses must override\n",
-    "        - Return an integer representing the total number of samples\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # This is an abstract method - subclasses must implement it\n",
-    "        raise NotImplementedError(\"Subclasses must implement __len__\")\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def get_sample_shape(self) -> Tuple[int, ...]:\n",
-    "        \"\"\"\n",
-    "        Get the shape of a single data sample.\n",
-    "        \n",
-    "        TODO: Implement method to get sample shape.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Get the first sample using self[0]\n",
-    "        2. Extract the data part (first element of tuple)\n",
-    "        3. Return the shape of the data tensor\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        For CIFAR-10: returns (3, 32, 32) for RGB images\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Model Architecture**: Neural networks need to know input shape for first layer\n",
-    "        - **Batch Planning**: Systems use sample shape to calculate memory requirements\n",
-    "        - **Preprocessing Validation**: Ensures all samples have consistent shape\n",
-    "        - **Framework Integration**: Similar to PyTorch's dataset shape inspection\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use self[0] to get the first sample\n",
-    "        - Extract data from the (data, label) tuple\n",
-    "        - Return data.shape\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Get the first sample to determine shape\n",
-    "        data, _ = self[0]\n",
-    "        return data.shape\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def get_num_classes(self) -> int:\n",
-    "        \"\"\"\n",
-    "        Get the number of classes in the dataset.\n",
-    "        \n",
-    "        TODO: Implement abstract method for getting number of classes.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. This is an abstract method - subclasses will implement it\n",
-    "        2. Return the number of unique classes in the dataset\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        For CIFAR-10: returns 10 (classes 0-9)\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Output Layer Design**: Neural networks need num_classes for final layer size\n",
-    "        - **Loss Function Setup**: CrossEntropyLoss uses num_classes for proper computation\n",
-    "        - **Evaluation Metrics**: Accuracy calculation depends on number of classes\n",
-    "        - **Model Validation**: Ensures model predictions match expected class range\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - This is an abstract method that subclasses must override\n",
-    "        - Return the number of unique classes/categories\n",
-    "        \"\"\"\n",
-    "        # This is an abstract method - subclasses must implement it\n",
-    "        raise NotImplementedError(\"Subclasses must implement get_num_classes\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "06c34e75",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Unit Test: Dataset Interface\n",
-    "\n",
-    "Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset.\n",
-    "\n",
-    "**This is a unit test** - it tests the Dataset interface pattern in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7e349589",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-dataset-interface-immediate",
-     "locked": true,
-     "points": 5,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Test Dataset interface with a simple implementation\n",
-    "print(\"🔬 Unit Test: Dataset Interface...\")\n",
-    "\n",
-    "# Create a minimal test dataset\n",
-    "class TestDataset(Dataset):\n",
-    "    def __init__(self, size=5):\n",
-    "        self.size = size\n",
-    "    \n",
-    "    def __getitem__(self, index):\n",
-    "        # Simple test data: features are [index, index*2], label is index % 2\n",
-    "        data = Tensor([index, index * 2])\n",
-    "        label = Tensor([index % 2])\n",
-    "        return data, label\n",
-    "    \n",
-    "    def __len__(self):\n",
-    "        return self.size\n",
-    "    \n",
-    "    def get_num_classes(self):\n",
-    "        return 2\n",
-    "\n",
-    "# Test the interface (moved to main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "261ad6cc",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 3: Building the DataLoader\n",
-    "\n",
-    "### What is a DataLoader?\n",
-    "A **DataLoader** efficiently batches and iterates through datasets. It's the bridge between individual samples and the batched data that neural networks expect.\n",
-    "\n",
-    "### Why DataLoaders Matter\n",
-    "- **Batching**: Groups samples for efficient GPU computation\n",
-    "- **Shuffling**: Randomizes data order to prevent overfitting\n",
-    "- **Memory efficiency**: Loads data on-demand rather than all at once\n",
-    "- **Iteration**: Provides clean interface for training loops\n",
-    "\n",
-    "### The DataLoader Pattern\n",
-    "```python\n",
-    "DataLoader(dataset, batch_size=32, shuffle=True)\n",
-    "for batch_data, batch_labels in dataloader:\n",
-    "    # batch_data.shape: (32, ...)\n",
-    "    # batch_labels.shape: (32,)\n",
-    "    # Train on batch\n",
-    "```\n",
-    "\n",
-    "### Real-World Applications\n",
-    "- **Training loops**: Feed batches to neural networks\n",
-    "- **Validation**: Evaluate models on held-out data\n",
-    "- **Inference**: Process large datasets efficiently\n",
-    "- **Data analysis**: Explore datasets systematically\n",
-    "\n",
-    "### Systems Thinking\n",
-    "- **Batch size**: Trade-off between memory and speed\n",
-    "- **Shuffling**: Prevents overfitting to data order\n",
-    "- **Iteration**: Efficient looping through data\n",
-    "- **Memory**: Manage large datasets that don't fit in RAM"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a7607154",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "dataloader-class",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class DataLoader:\n",
-    "    \"\"\"\n",
-    "    DataLoader: Efficiently batch and iterate through datasets.\n",
-    "    \n",
-    "    Provides batching, shuffling, and efficient iteration over datasets.\n",
-    "    Essential for training neural networks efficiently.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):\n",
-    "        \"\"\"\n",
-    "        Initialize DataLoader.\n",
-    "        \n",
-    "        Args:\n",
-    "            dataset: Dataset to load from\n",
-    "            batch_size: Number of samples per batch\n",
-    "            shuffle: Whether to shuffle data each epoch\n",
-    "            \n",
-    "        TODO: Store configuration and dataset.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store dataset as self.dataset\n",
-    "        2. Store batch_size as self.batch_size\n",
-    "        3. Store shuffle as self.shuffle\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        DataLoader(dataset, batch_size=32, shuffle=True)\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Store all parameters as instance variables\n",
-    "        - These will be used in __iter__ for batching\n",
-    "        \"\"\"\n",
-    "        # Input validation\n",
-    "        if dataset is None:\n",
-    "            raise TypeError(\"Dataset cannot be None\")\n",
-    "        if not isinstance(batch_size, int) or batch_size <= 0:\n",
-    "            raise ValueError(f\"Batch size must be a positive integer, got {batch_size}\")\n",
-    "        \n",
-    "        self.dataset = dataset\n",
-    "        self.batch_size = batch_size\n",
-    "        self.shuffle = shuffle\n",
-    "    \n",
-    "    def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:\n",
-    "        \"\"\"\n",
-    "        Iterate through dataset in batches.\n",
-    "        \n",
-    "        Returns:\n",
-    "            Iterator yielding (batch_data, batch_labels) tuples\n",
-    "            \n",
-    "        TODO: Implement batching and shuffling logic.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Create indices list: list(range(len(dataset)))\n",
-    "        2. Shuffle indices if self.shuffle is True\n",
-    "        3. Loop through indices in batch_size chunks\n",
-    "        4. For each batch: collect samples, stack them, yield batch\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        for batch_data, batch_labels in dataloader:\n",
-    "            # batch_data.shape: (batch_size, ...)\n",
-    "            # batch_labels.shape: (batch_size,)\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **GPU Efficiency**: Batching maximizes GPU utilization by processing multiple samples together\n",
-    "        - **Training Stability**: Shuffling prevents overfitting to data order and improves generalization\n",
-    "        - **Memory Management**: Batches fit in GPU memory while full dataset may not\n",
-    "        - **Gradient Estimation**: Batch gradients provide better estimates than single-sample gradients\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use list(range(len(self.dataset))) for indices\n",
-    "        - Use np.random.shuffle() if self.shuffle is True\n",
-    "        - Loop in chunks of self.batch_size\n",
-    "        - Collect samples and stack with np.stack()\n",
-    "        \"\"\"\n",
-    "        # Create indices for all samples\n",
-    "        indices = list(range(len(self.dataset)))\n",
-    "        \n",
-    "        # Shuffle if requested\n",
-    "        if self.shuffle:\n",
-    "            np.random.shuffle(indices)\n",
-    "        \n",
-    "        # Iterate through indices in batches\n",
-    "        for i in range(0, len(indices), self.batch_size):\n",
-    "            batch_indices = indices[i:i + self.batch_size]\n",
-    "            \n",
-    "            # Collect samples for this batch\n",
-    "            batch_data = []\n",
-    "            batch_labels = []\n",
-    "            \n",
-    "            for idx in batch_indices:\n",
-    "                data, label = self.dataset[idx]\n",
-    "                batch_data.append(data.data)\n",
-    "                batch_labels.append(label.data)\n",
-    "            \n",
-    "            # Stack into batch tensors\n",
-    "            batch_data_array = np.stack(batch_data, axis=0)\n",
-    "            batch_labels_array = np.stack(batch_labels, axis=0)\n",
-    "            \n",
-    "            yield Tensor(batch_data_array), Tensor(batch_labels_array)\n",
-    "    \n",
-    "    def __len__(self) -> int:\n",
-    "        \"\"\"\n",
-    "        Get the number of batches per epoch.\n",
-    "        \n",
-    "        TODO: Calculate number of batches.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Get dataset size: len(self.dataset)\n",
-    "        2. Divide by batch_size and round up\n",
-    "        3. Use ceiling division: (n + batch_size - 1) // batch_size\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        Dataset size 100, batch size 32 → 4 batches\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use len(self.dataset) for dataset size\n",
-    "        - Use ceiling division for exact batch count\n",
-    "        - Formula: (dataset_size + batch_size - 1) // batch_size\n",
-    "        \"\"\"\n",
-    "        # Calculate number of batches using ceiling division\n",
-    "        dataset_size = len(self.dataset)\n",
-    "        return (dataset_size + self.batch_size - 1) // self.batch_size"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ec802471",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Unit Test: DataLoader\n",
-    "\n",
-    "Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks.\n",
-    "\n",
-    "**This is a unit test** - it tests the DataLoader class in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cb2f9065",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-dataloader-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Test DataLoader immediately after implementation\n",
-    "print(\"🔬 Unit Test: DataLoader...\")\n",
-    "\n",
-    "# Use the test dataset from before\n",
-    "class TestDataset(Dataset):\n",
-    "    def __init__(self, size=10):\n",
-    "        self.size = size\n",
-    "    \n",
-    "    def __getitem__(self, index):\n",
-    "        data = Tensor([index, index * 2])\n",
-    "        label = Tensor([index % 3])  # 3 classes\n",
-    "        return data, label\n",
-    "    \n",
-    "    def __len__(self):\n",
-    "        return self.size\n",
-    "    \n",
-    "    def get_num_classes(self):\n",
-    "        return 3\n",
-    "\n",
-    "# Test basic DataLoader functionality\n",
-    "try:\n",
-    "    dataset = TestDataset(size=10)\n",
-    "    dataloader = DataLoader(dataset, batch_size=3, shuffle=False)\n",
-    "    \n",
-    "    print(f\"DataLoader created: batch_size={dataloader.batch_size}, shuffle={dataloader.shuffle}\")\n",
-    "    print(f\"Number of batches: {len(dataloader)}\")\n",
-    "    \n",
-    "    # Test __len__\n",
-    "    expected_batches = (10 + 3 - 1) // 3  # Ceiling division: 4 batches\n",
-    "    assert len(dataloader) == expected_batches, f\"Should have {expected_batches} batches, got {len(dataloader)}\"\n",
-    "    print(\"✅ DataLoader __len__ works correctly\")\n",
-    "    \n",
-    "    # Test iteration\n",
-    "    batch_count = 0\n",
-    "    total_samples = 0\n",
-    "    \n",
-    "    for batch_data, batch_labels in dataloader:\n",
-    "        batch_count += 1\n",
-    "        batch_size = batch_data.shape[0]\n",
-    "        total_samples += batch_size\n",
-    "        \n",
-    "        print(f\"Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}\")\n",
-    "        \n",
-    "        # Verify batch dimensions\n",
-    "        assert len(batch_data.shape) == 2, f\"Batch data should be 2D, got {batch_data.shape}\"\n",
-    "        assert len(batch_labels.shape) == 2, f\"Batch labels should be 2D, got {batch_labels.shape}\"\n",
-    "        assert batch_data.shape[1] == 2, f\"Each sample should have 2 features, got {batch_data.shape[1]}\"\n",
-    "        assert batch_labels.shape[1] == 1, f\"Each label should have 1 element, got {batch_labels.shape[1]}\"\n",
-    "        \n",
-    "    assert batch_count == expected_batches, f\"Should iterate {expected_batches} times, got {batch_count}\"\n",
-    "    assert total_samples == 10, f\"Should process 10 total samples, got {total_samples}\"\n",
-    "    print(\"✅ DataLoader iteration works correctly\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ DataLoader test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test shuffling\n",
-    "try:\n",
-    "    dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)\n",
-    "    dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)\n",
-    "    \n",
-    "    # Get first batch from each\n",
-    "    batch1_shuffle = next(iter(dataloader_shuffle))\n",
-    "    batch1_no_shuffle = next(iter(dataloader_no_shuffle))\n",
-    "    \n",
-    "    print(\"✅ DataLoader shuffling parameter works\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ DataLoader shuffling test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Test different batch sizes\n",
-    "try:\n",
-    "    small_loader = DataLoader(dataset, batch_size=2, shuffle=False)\n",
-    "    large_loader = DataLoader(dataset, batch_size=8, shuffle=False)\n",
-    "    \n",
-    "    assert len(small_loader) == 5, f\"Small loader should have 5 batches, got {len(small_loader)}\"\n",
-    "    assert len(large_loader) == 2, f\"Large loader should have 2 batches, got {len(large_loader)}\"\n",
-    "    print(\"✅ DataLoader handles different batch sizes correctly\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ DataLoader batch size test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "# Show the DataLoader behavior\n",
-    "print(\"🎯 DataLoader behavior:\")\n",
-    "print(\"   Batches data for efficient processing\")\n",
-    "print(\"   Handles shuffling and iteration\")\n",
-    "print(\"   Provides clean interface for training loops\")\n",
-    "print(\"📈 Progress: Dataset interface ✓, DataLoader ✓\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a834dfd9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 4: Creating a Simple Dataset Example\n",
-    "\n",
-    "### Why We Need Concrete Examples\n",
-    "Abstract classes are great for interfaces, but we need concrete implementations to understand how they work. Let's create a simple dataset for testing.\n",
-    "\n",
-    "### Design Principles\n",
-    "- **Simple**: Easy to understand and debug\n",
-    "- **Configurable**: Adjustable size and properties\n",
-    "- **Predictable**: Deterministic data for testing\n",
-    "- **Educational**: Shows the Dataset pattern clearly\n",
-    "\n",
-    "### Real-World Connection\n",
-    "This pattern is used for:\n",
-    "- **CIFAR-10**: 32x32 RGB images with 10 classes\n",
-    "- **ImageNet**: High-resolution images with 1000 classes\n",
-    "- **MNIST**: 28x28 grayscale digits with 10 classes\n",
-    "- **Custom datasets**: Your own data following this pattern"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "39e77a02",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "simple-dataset",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class SimpleDataset(Dataset):\n",
-    "    \"\"\"\n",
-    "    Simple dataset for testing and demonstration.\n",
-    "    \n",
-    "    Generates synthetic data with configurable size and properties.\n",
-    "    Perfect for understanding the Dataset pattern.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, size: int = 100, num_features: int = 4, num_classes: int = 3):\n",
-    "        \"\"\"\n",
-    "        Initialize SimpleDataset.\n",
-    "        \n",
-    "        Args:\n",
-    "            size: Number of samples in the dataset\n",
-    "            num_features: Number of features per sample\n",
-    "            num_classes: Number of classes\n",
-    "            \n",
-    "        TODO: Initialize the dataset with synthetic data.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Store the configuration parameters\n",
-    "        2. Generate synthetic data and labels\n",
-    "        3. Make data deterministic for testing\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        SimpleDataset(size=100, num_features=4, num_classes=3)\n",
-    "        creates 100 samples with 4 features each, 3 classes\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Store size, num_features, num_classes as instance variables\n",
-    "        - Use np.random.seed() for reproducible data\n",
-    "        - Generate random data with np.random.randn()\n",
-    "        - Generate random labels with np.random.randint()\n",
-    "        \"\"\"\n",
-    "        self.size = size\n",
-    "        self.num_features = num_features\n",
-    "        self.num_classes = num_classes\n",
-    "        \n",
-    "        # Generate synthetic data (deterministic for testing)\n",
-    "        np.random.seed(42)  # For reproducible data\n",
-    "        self.data = np.random.randn(size, num_features).astype(np.float32)\n",
-    "        self.labels = np.random.randint(0, num_classes, size=size)\n",
-    "    \n",
-    "    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
-    "        \"\"\"\n",
-    "        Get a sample by index.\n",
-    "        \n",
-    "        Args:\n",
-    "            index: Index of the sample\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tuple of (data, label) tensors\n",
-    "            \n",
-    "        TODO: Return the sample at the given index.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Get data sample from self.data[index]\n",
-    "        2. Get label from self.labels[index]\n",
-    "        3. Convert both to Tensors and return as tuple\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        dataset[0] returns (Tensor(features), Tensor(label))\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use self.data[index] for the data\n",
-    "        - Use self.labels[index] for the label\n",
-    "        - Convert to Tensors: Tensor(data), Tensor(label)\n",
-    "        \"\"\"\n",
-    "        data = self.data[index]\n",
-    "        label = self.labels[index]\n",
-    "        return Tensor(data), Tensor(label)\n",
-    "    \n",
-    "    def __len__(self) -> int:\n",
-    "        \"\"\"\n",
-    "        Get the dataset size.\n",
-    "        \n",
-    "        TODO: Return the dataset size.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Return self.size\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        len(dataset) returns 100 for dataset with 100 samples\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Simply return self.size\n",
-    "        \"\"\"\n",
-    "        return self.size\n",
-    "    \n",
-    "    def get_num_classes(self) -> int:\n",
-    "        \"\"\"\n",
-    "        Get the number of classes.\n",
-    "        \n",
-    "        TODO: Return the number of classes.\n",
-    "        \n",
-    "        APPROACH:\n",
-    "        1. Return self.num_classes\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        dataset.get_num_classes() returns 3 for 3-class dataset\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Simply return self.num_classes\n",
-    "        \"\"\"\n",
-    "        return self.num_classes"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b88878e6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Step 4b: CIFAR-10 Dataset - Real Data for CNNs\n",
-    "\n",
-    "### Download and Load Real Computer Vision Data\n",
-    "Let's implement loading CIFAR-10, the dataset we'll use to achieve our north star goal of 75% accuracy!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "417df9df",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "cifar10",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def download_cifar10(root: str = \"./data\") -> str:\n",
-    "    \"\"\"\n",
-    "    Download CIFAR-10 dataset.\n",
-    "    \n",
-    "    TODO: Download and extract CIFAR-10.\n",
-    "    \n",
-    "    HINTS:\n",
-    "    - URL: https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\n",
-    "    - Use urllib.request.urlretrieve()\n",
-    "    - Extract with tarfile\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    os.makedirs(root, exist_ok=True)\n",
-    "    dataset_dir = os.path.join(root, \"cifar-10-batches-py\")\n",
-    "    \n",
-    "    if os.path.exists(dataset_dir):\n",
-    "        print(f\"✅ CIFAR-10 found at {dataset_dir}\")\n",
-    "        return dataset_dir\n",
-    "    \n",
-    "    url = \"https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\"\n",
-    "    tar_path = os.path.join(root, \"cifar-10.tar.gz\")\n",
-    "    \n",
-    "    print(f\"📥 Downloading CIFAR-10 (~170MB)...\")\n",
-    "    urllib.request.urlretrieve(url, tar_path)\n",
-    "    print(\"✅ Downloaded!\")\n",
-    "    \n",
-    "    print(\"📦 Extracting...\")\n",
-    "    with tarfile.open(tar_path, 'r:gz') as tar:\n",
-    "        tar.extractall(root)\n",
-    "    print(\"✅ Ready!\")\n",
-    "    \n",
-    "    return dataset_dir\n",
-    "    ### END SOLUTION\n",
-    "\n",
-    "class CIFAR10Dataset(Dataset):\n",
-    "    \"\"\"CIFAR-10 dataset for CNN training.\"\"\"\n",
-    "    \n",
-    "    def __init__(self, root=\"./data\", train=True, download=False):\n",
-    "        \"\"\"Load CIFAR-10 data.\"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if download:\n",
-    "            dataset_dir = download_cifar10(root)\n",
-    "        else:\n",
-    "            dataset_dir = os.path.join(root, \"cifar-10-batches-py\")\n",
-    "        \n",
-    "        if train:\n",
-    "            data_list = []\n",
-    "            label_list = []\n",
-    "            for i in range(1, 6):\n",
-    "                with open(os.path.join(dataset_dir, f\"data_batch_{i}\"), 'rb') as f:\n",
-    "                    batch = pickle.load(f, encoding='bytes')\n",
-    "                    data_list.append(batch[b'data'])\n",
-    "                    label_list.extend(batch[b'labels'])\n",
-    "            self.data = np.concatenate(data_list)\n",
-    "            self.labels = np.array(label_list)\n",
-    "        else:\n",
-    "            with open(os.path.join(dataset_dir, \"test_batch\"), 'rb') as f:\n",
-    "                batch = pickle.load(f, encoding='bytes')\n",
-    "                self.data = batch[b'data']\n",
-    "                self.labels = np.array(batch[b'labels'])\n",
-    "        \n",
-    "        # Reshape to (N, 3, 32, 32) and normalize\n",
-    "        self.data = self.data.reshape(-1, 3, 32, 32).astype(np.float32) / 255.0\n",
-    "        print(f\"✅ Loaded {len(self.data):,} images\")\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __getitem__(self, idx):\n",
-    "        return Tensor(self.data[idx]), Tensor(self.labels[idx])\n",
-    "    \n",
-    "    def __len__(self):\n",
-    "        return len(self.data)\n",
-    "    \n",
-    "    def get_num_classes(self):\n",
-    "        return 10"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "480db551",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🧪 Unit Test: SimpleDataset\n",
-    "\n",
-    "Let's test your SimpleDataset implementation! This concrete example shows how the Dataset pattern works.\n",
-    "\n",
-    "**This is a unit test** - it tests the SimpleDataset class in isolation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2e73cdb0",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-simple-dataset-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Test SimpleDataset immediately after implementation\n",
-    "print(\"🔬 Unit Test: SimpleDataset...\")\n",
-    "\n",
-    "try:\n",
-    "    # Create dataset\n",
-    "    dataset = SimpleDataset(size=20, num_features=5, num_classes=4)\n",
-    "    \n",
-    "    print(f\"Dataset created: size={len(dataset)}, features={dataset.num_features}, classes={dataset.get_num_classes()}\")\n",
-    "        \n",
-    "        # Test basic properties\n",
-    "    assert len(dataset) == 20, f\"Dataset length should be 20, got {len(dataset)}\"\n",
-    "    assert dataset.get_num_classes() == 4, f\"Should have 4 classes, got {dataset.get_num_classes()}\"\n",
-    "    print(\"✅ SimpleDataset basic properties work correctly\")\n",
-    "        \n",
-    "    # Test sample access\n",
-    "    data, label = dataset[0]\n",
-    "    assert isinstance(data, Tensor), \"Data should be a Tensor\"\n",
-    "    assert isinstance(label, Tensor), \"Label should be a Tensor\"\n",
-    "    assert data.shape == (5,), f\"Data shape should be (5,), got {data.shape}\"\n",
-    "    assert label.shape == (), f\"Label shape should be (), got {label.shape}\"\n",
-    "    print(\"✅ SimpleDataset sample access works correctly\")\n",
-    "    \n",
-    "    # Test sample shape\n",
-    "    sample_shape = dataset.get_sample_shape()\n",
-    "    assert sample_shape == (5,), f\"Sample shape should be (5,), got {sample_shape}\"\n",
-    "    print(\"✅ SimpleDataset get_sample_shape works correctly\")\n",
-    "    \n",
-    "    # Test multiple samples\n",
-    "    for i in range(5):\n",
-    "            data, label = dataset[i]\n",
-    "            assert data.shape == (5,), f\"Data shape should be (5,) for sample {i}, got {data.shape}\"\n",
-    "            assert 0 <= label.data < 4, f\"Label should be in [0, 3] for sample {i}, got {label.data}\"\n",
-    "    print(\"✅ SimpleDataset multiple samples work correctly\")\n",
-    "    \n",
-    "    # Test deterministic data (same seed should give same data)\n",
-    "    dataset2 = SimpleDataset(size=20, num_features=5, num_classes=4)\n",
-    "    data1, label1 = dataset[0]\n",
-    "    data2, label2 = dataset2[0]\n",
-    "    assert np.array_equal(data1.data, data2.data), \"Data should be deterministic\"\n",
-    "    assert np.array_equal(label1.data, label2.data), \"Labels should be deterministic\"\n",
-    "    print(\"✅ SimpleDataset data is deterministic\")\n",
-    "\n",
-    "except Exception as e:\n",
-    "    print(f\"❌ SimpleDataset test failed: {e}\")\n",
-    "\n",
-    "# Show the SimpleDataset behavior\n",
-    "print(\"🎯 SimpleDataset behavior:\")\n",
-    "print(\"   Generates synthetic data for testing\")\n",
-    "print(\"   Implements complete Dataset interface\")\n",
-    "print(\"   Provides deterministic data for reproducibility\")\n",
-    "print(\"📈 Progress: Dataset interface ✓, DataLoader ✓, SimpleDataset ✓\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "243297c6",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Step 5: Comprehensive Test - Complete Data Pipeline\n",
-    "\n",
-    "### Real-World Data Pipeline Applications\n",
-    "Let's test our data loading components in realistic scenarios:\n",
-    "\n",
-    "#### **Training Pipeline**\n",
-    "```python\n",
-    "# The standard ML training pattern\n",
-    "dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)\n",
-    "dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
-    "\n",
-    "for epoch in range(num_epochs):\n",
-    "    for batch_data, batch_labels in dataloader:\n",
-    "        # Train model on batch\n",
-    "        pass\n",
-    "```\n",
-    "\n",
-    "#### **Validation Pipeline**\n",
-    "```python\n",
-    "# Validation without shuffling\n",
-    "val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)\n",
-    "\n",
-    "for batch_data, batch_labels in val_loader:\n",
-    "    # Evaluate model on batch\n",
-    "    pass\n",
-    "```\n",
-    "\n",
-    "#### **Data Analysis Pipeline**\n",
-    "```python\n",
-    "# Systematic data exploration\n",
-    "for batch_data, batch_labels in dataloader:\n",
-    "    # Analyze batch statistics\n",
-    "    pass\n",
-    "```\n",
-    "\n",
-    "This comprehensive test ensures our data loading components work together for real ML applications!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c994c580",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-comprehensive",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Comprehensive test - complete data pipeline applications\n",
-    "print(\"🔬 Comprehensive Test: Complete Data Pipeline...\")\n",
-    "\n",
-    "try:\n",
-    "    # Test 1: Training Data Pipeline\n",
-    "    print(\"\\n1. Training Data Pipeline Test:\")\n",
-    "    \n",
-    "    # Create training dataset\n",
-    "    train_dataset = SimpleDataset(size=100, num_features=8, num_classes=5)\n",
-    "    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)\n",
-    "    \n",
-    "    # Simulate training epoch\n",
-    "    epoch_samples = 0\n",
-    "    epoch_batches = 0\n",
-    "    \n",
-    "    for batch_data, batch_labels in train_loader:\n",
-    "        epoch_batches += 1\n",
-    "        epoch_samples += batch_data.shape[0]\n",
-    "        \n",
-    "        # Verify batch properties\n",
-    "        assert batch_data.shape[1] == 8, f\"Features should be 8, got {batch_data.shape[1]}\"\n",
-    "        assert len(batch_labels.shape) == 1, f\"Labels should be 1D, got shape {batch_labels.shape}\"\n",
-    "        assert isinstance(batch_data, Tensor), \"Batch data should be Tensor\"\n",
-    "        assert isinstance(batch_labels, Tensor), \"Batch labels should be Tensor\"\n",
-    "    \n",
-    "    assert epoch_samples == 100, f\"Should process 100 samples, got {epoch_samples}\"\n",
-    "    expected_batches = (100 + 16 - 1) // 16\n",
-    "    assert epoch_batches == expected_batches, f\"Should have {expected_batches} batches, got {epoch_batches}\"\n",
-    "    print(\"✅ Training pipeline works correctly\")\n",
-    "    \n",
-    "    # Test 2: Validation Data Pipeline\n",
-    "    print(\"\\n2. Validation Data Pipeline Test:\")\n",
-    "    \n",
-    "    # Create validation dataset (no shuffling)\n",
-    "    val_dataset = SimpleDataset(size=50, num_features=8, num_classes=5)\n",
-    "    val_loader = DataLoader(val_dataset, batch_size=10, shuffle=False)\n",
-    "    \n",
-    "    # Simulate validation\n",
-    "    val_samples = 0\n",
-    "    val_batches = 0\n",
-    "    \n",
-    "    for batch_data, batch_labels in val_loader:\n",
-    "        val_batches += 1\n",
-    "        val_samples += batch_data.shape[0]\n",
-    "        \n",
-    "        # Verify consistent batch processing\n",
-    "        assert batch_data.shape[1] == 8, \"Validation features should match training\"\n",
-    "        assert len(batch_labels.shape) == 1, \"Validation labels should be 1D\"\n",
-    "        \n",
-    "    assert val_samples == 50, f\"Should process 50 validation samples, got {val_samples}\"\n",
-    "    assert val_batches == 5, f\"Should have 5 validation batches, got {val_batches}\"\n",
-    "    print(\"✅ Validation pipeline works correctly\")\n",
-    "    \n",
-    "    # Test 3: Different Dataset Configurations\n",
-    "    print(\"\\n3. Dataset Configuration Test:\")\n",
-    "    \n",
-    "    # Test different configurations\n",
-    "    configs = [\n",
-    "        (200, 4, 3),   # Medium dataset\n",
-    "        (50, 12, 10),  # High-dimensional features\n",
-    "        (1000, 2, 2),  # Large dataset, simple features\n",
-    "    ]\n",
-    "    \n",
-    "    for size, features, classes in configs:\n",
-    "        dataset = SimpleDataset(size=size, num_features=features, num_classes=classes)\n",
-    "        loader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
-    "        \n",
-    "        # Test one batch\n",
-    "        batch_data, batch_labels = next(iter(loader))\n",
-    "        \n",
-    "        assert batch_data.shape[1] == features, f\"Features mismatch for config {configs}\"\n",
-    "        assert len(dataset) == size, f\"Size mismatch for config {configs}\"\n",
-    "        assert dataset.get_num_classes() == classes, f\"Classes mismatch for config {configs}\"\n",
-    "    \n",
-    "    print(\"✅ Different dataset configurations work correctly\")\n",
-    "    \n",
-    "    # Test 4: Memory Efficiency Simulation\n",
-    "    print(\"\\n4. Memory Efficiency Test:\")\n",
-    "    \n",
-    "    # Create larger dataset to test memory efficiency\n",
-    "    large_dataset = SimpleDataset(size=500, num_features=20, num_classes=10)\n",
-    "    large_loader = DataLoader(large_dataset, batch_size=50, shuffle=True)\n",
-    "    \n",
-    "    # Process all batches to ensure memory efficiency\n",
-    "    processed_samples = 0\n",
-    "    max_batch_size = 0\n",
-    "    \n",
-    "    for batch_data, batch_labels in large_loader:\n",
-    "        processed_samples += batch_data.shape[0]\n",
-    "        max_batch_size = max(max_batch_size, batch_data.shape[0])\n",
-    "        \n",
-    "        # Verify memory usage stays reasonable\n",
-    "        assert batch_data.shape[0] <= 50, f\"Batch size should not exceed 50, got {batch_data.shape[0]}\"\n",
-    "    \n",
-    "    assert processed_samples == 500, f\"Should process all 500 samples, got {processed_samples}\"\n",
-    "    print(\"✅ Memory efficiency works correctly\")\n",
-    "    \n",
-    "    # Test 5: Multi-Epoch Training Simulation\n",
-    "    print(\"\\n5. Multi-Epoch Training Test:\")\n",
-    "    \n",
-    "    # Simulate multiple epochs\n",
-    "    dataset = SimpleDataset(size=60, num_features=6, num_classes=3)\n",
-    "    loader = DataLoader(dataset, batch_size=20, shuffle=True)\n",
-    "    \n",
-    "    for epoch in range(3):\n",
-    "        epoch_samples = 0\n",
-    "        for batch_data, batch_labels in loader:\n",
-    "            epoch_samples += batch_data.shape[0]\n",
-    "            \n",
-    "            # Verify shapes remain consistent across epochs\n",
-    "            assert batch_data.shape[1] == 6, f\"Features should be 6 in epoch {epoch}\"\n",
-    "            assert len(batch_labels.shape) == 1, f\"Labels should be 1D in epoch {epoch}\"\n",
-    "        \n",
-    "        assert epoch_samples == 60, f\"Should process 60 samples in epoch {epoch}, got {epoch_samples}\"\n",
-    "    \n",
-    "    print(\"✅ Multi-epoch training works correctly\")\n",
-    "    \n",
-    "    print(\"\\n🎉 Comprehensive test passed! Your data pipeline works correctly for:\")\n",
-    "    print(\"  • Large-scale dataset handling\")\n",
-    "    print(\"  • Batch processing with multiple workers\")\n",
-    "    print(\"  • Shuffling and sampling strategies\")\n",
-    "    print(\"  • Memory-efficient data loading\")\n",
-    "    print(\"  • Complete training pipeline integration\")\n",
-    "    print(\"📈 Progress: Production-ready data pipeline ✓\")\n",
-    "    \n",
-    "except Exception as e:\n",
-    "    print(f\"❌ Comprehensive test failed: {e}\")\n",
-    "    raise\n",
-    "\n",
-    "print(\"📈 Final Progress: Complete data pipeline ready for production ML!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "54d090c1",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Dataset Interface Implementation\n",
-    "\n",
-    "This test validates the abstract Dataset interface, ensuring proper inheritance, method implementation, and interface compliance for creating custom datasets in the TinyTorch data loading pipeline."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "62c32031",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_dataset_interface():\n",
-    "    \"\"\"Unit test for the Dataset abstract interface implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Dataset Interface...\")\n",
-    "    \n",
-    "    # Test TestDataset implementation\n",
-    "    dataset = TestDataset(size=5)\n",
-    "    \n",
-    "    # Test basic interface\n",
-    "    assert len(dataset) == 5, \"Dataset should have correct length\"\n",
-    "    \n",
-    "    # Test data access\n",
-    "    sample, label = dataset[0]\n",
-    "    assert isinstance(sample, Tensor), \"Sample should be Tensor\"\n",
-    "    assert isinstance(label, Tensor), \"Label should be Tensor\"\n",
-    "    \n",
-    "    print(\"✅ Dataset interface works correctly\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cbbce516",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: DataLoader Implementation\n",
-    "\n",
-    "This test validates the DataLoader class functionality, ensuring proper batch creation, iteration capability, and integration with datasets for efficient data loading in machine learning training pipelines."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a0025080",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_dataloader():\n",
-    "    \"\"\"Unit test for the DataLoader implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: DataLoader...\")\n",
-    "    \n",
-    "    # Test DataLoader with TestDataset\n",
-    "    dataset = TestDataset(size=10)\n",
-    "    loader = DataLoader(dataset, batch_size=3, shuffle=False)\n",
-    "    \n",
-    "    # Test iteration\n",
-    "    batches = list(loader)\n",
-    "    assert len(batches) >= 3, \"Should have at least 3 batches\"\n",
-    "    \n",
-    "    # Test batch shapes\n",
-    "    batch_data, batch_labels = batches[0]\n",
-    "    assert batch_data.shape[0] <= 3, \"Batch size should be <= 3\"\n",
-    "    assert batch_labels.shape[0] <= 3, \"Batch labels should match data\"\n",
-    "    \n",
-    "    print(\"✅ DataLoader works correctly\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "dfc685e4",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Simple Dataset Implementation\n",
-    "\n",
-    "This test validates the SimpleDataset class, ensuring it can handle real-world data scenarios including proper data storage, indexing, and compatibility with the DataLoader for practical machine learning workflows."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0cc885b1",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_simple_dataset():\n",
-    "    \"\"\"Unit test for the SimpleDataset implementation.\"\"\"\n",
-    "    print(\"🔬 Unit Test: SimpleDataset...\")\n",
-    "    \n",
-    "    # Test SimpleDataset\n",
-    "    dataset = SimpleDataset(size=100, num_features=4, num_classes=3)\n",
-    "    \n",
-    "    # Test properties\n",
-    "    assert len(dataset) == 100, \"Dataset should have correct size\"\n",
-    "    assert dataset.get_num_classes() == 3, \"Should have correct number of classes\"\n",
-    "    \n",
-    "    # Test data access\n",
-    "    sample, label = dataset[0]\n",
-    "    assert sample.shape == (4,), \"Sample should have correct features\"\n",
-    "    assert 0 <= label.data < 3, \"Label should be valid class\"\n",
-    "    \n",
-    "    print(\"✅ SimpleDataset works correctly\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4bd59540",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Unit Test: Complete Data Pipeline Integration\n",
-    "\n",
-    "This comprehensive test validates the entire data pipeline from dataset creation through DataLoader batching, ensuring all components work together seamlessly for end-to-end machine learning data processing workflows."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9c63e6cd",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_dataloader_pipeline():\n",
-    "    \"\"\"Comprehensive unit test for the complete data pipeline.\"\"\"\n",
-    "    print(\"🔬 Comprehensive Test: Data Pipeline...\")\n",
-    "    \n",
-    "    # Test complete pipeline\n",
-    "    dataset = SimpleDataset(size=50, num_features=10, num_classes=5)\n",
-    "    loader = DataLoader(dataset, batch_size=8, shuffle=True)\n",
-    "    \n",
-    "    total_samples = 0\n",
-    "    for batch_data, batch_labels in loader:\n",
-    "        assert isinstance(batch_data, Tensor), \"Batch data should be Tensor\"\n",
-    "        assert isinstance(batch_labels, Tensor), \"Batch labels should be Tensor\"\n",
-    "        assert batch_data.shape[1] == 10, \"Features should be correct\"\n",
-    "        total_samples += batch_data.shape[0]\n",
-    "    \n",
-    "    assert total_samples == 50, \"Should process all samples\"\n",
-    "    \n",
-    "    print(\"✅ Data pipeline integration works correctly\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "63acc83f",
-   "metadata": {
-    "lines_to_next_cell": 0
-   },
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "307992df",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🧪 Module Testing\n",
-    "\n",
-    "Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.\n",
-    "\n",
-    "**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cd73bc81",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "standardized-testing",
-     "locked": true,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# =============================================================================\n",
-    "# STANDARDIZED MODULE TESTING - DO NOT MODIFY\n",
-    "# This cell is locked to ensure consistent testing across all TinyTorch modules\n",
-    "# ============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3171e7ee",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## 🔬 Integration Test: DataLoader with Tensors"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "924540fd",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_module_dataloader_tensor_yield():\n",
-    "    \"\"\"\n",
-    "    Integration test for the DataLoader and Tensor classes.\n",
-    "    \n",
-    "    Tests that the DataLoader correctly yields batches of Tensors.\n",
-    "    \"\"\"\n",
-    "    print(\"🔬 Running Integration Test: DataLoader with Tensors...\")\n",
-    "\n",
-    "    # 1. Create a simple dataset\n",
-    "    dataset = SimpleDataset(size=50, num_features=8, num_classes=4)\n",
-    "\n",
-    "    # 2. Create a DataLoader\n",
-    "    dataloader = DataLoader(dataset, batch_size=10, shuffle=False)\n",
-    "\n",
-    "    # 3. Get one batch from the dataloader\n",
-    "    data_batch, labels_batch = next(iter(dataloader))\n",
-    "\n",
-    "    # 4. Assert the batch contents are correct\n",
-    "    assert isinstance(data_batch, Tensor), \"Data batch should be a Tensor\"\n",
-    "    assert data_batch.shape == (10, 8), f\"Expected data shape (10, 8), but got {data_batch.shape}\"\n",
-    "    \n",
-    "    assert isinstance(labels_batch, Tensor), \"Labels batch should be a Tensor\"\n",
-    "    assert labels_batch.shape == (10,), f\"Expected labels shape (10,), but got {labels_batch.shape}\"\n",
-    "\n",
-    "    print(\"✅ Integration Test Passed: DataLoader correctly yields batches of Tensors.\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b8b23ef0",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📊 ML Systems: I/O Pipeline Optimization & Bottleneck Analysis\n",
-    "\n",
-    "Now that you have data loading systems, let's develop **I/O optimization skills**. This section teaches you to identify and fix data loading bottlenecks that can dramatically slow down training in production systems.\n",
-    "\n",
-    "### **Learning Outcome**: *\"I can identify and fix I/O bottlenecks that limit training speed\"*\n",
-    "\n",
-    "---\n",
-    "\n",
-    "## Data Pipeline Profiler (Medium Guided Implementation)\n",
-    "\n",
-    "As an ML systems engineer, you need to ensure data loading doesn't become the bottleneck. Training GPUs can process data much faster than traditional storage can provide it. Let's build tools to measure and optimize data pipeline performance."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3ac8f7b9",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "import time\n",
-    "import os\n",
-    "import threading\n",
-    "from concurrent.futures import ThreadPoolExecutor\n",
-    "\n",
-    "class DataPipelineProfiler:\n",
-    "    \"\"\"\n",
-    "    I/O pipeline profiling toolkit for data loading systems.\n",
-    "    \n",
-    "    Helps ML engineers identify bottlenecks in data loading pipelines\n",
-    "    and optimize throughput for high-performance training systems.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        self.profiling_history = []\n",
-    "        self.bottleneck_threshold = 0.1  # seconds per batch\n",
-    "        \n",
-    "    def time_dataloader_iteration(self, dataloader, num_batches=10):\n",
-    "        \"\"\"\n",
-    "        Time how long it takes to iterate through DataLoader batches.\n",
-    "        \n",
-    "        TODO: Implement DataLoader timing analysis.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Record start time\n",
-    "        2. Iterate through specified number of batches\n",
-    "        3. Time each batch loading\n",
-    "        4. Calculate statistics (total, average, min, max times)\n",
-    "        5. Identify if data loading is a bottleneck\n",
-    "        6. Return comprehensive timing analysis\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        profiler = DataPipelineProfiler()\n",
-    "        timing = profiler.time_dataloader_iteration(my_dataloader, 20)\n",
-    "        print(f\"Avg batch time: {timing['avg_batch_time']:.3f}s\")\n",
-    "        print(f\"Bottleneck: {timing['is_bottleneck']}\")\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Production Optimization**: Fast GPUs often wait for slow data loading\n",
-    "        - **System Bottlenecks**: Data loading can limit training speed more than model complexity\n",
-    "        - **Resource Planning**: Understanding I/O vs compute trade-offs for hardware selection\n",
-    "        - **Pipeline Tuning**: Multi-worker data loading and prefetching strategies\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Use enumerate(dataloader) to get batches\n",
-    "        - Time each batch: start = time.time(), batch = next(iter), end = time.time()\n",
-    "        - Break after num_batches to avoid processing entire dataset\n",
-    "        - Calculate: total_time, avg_time, min_time, max_time\n",
-    "        - Bottleneck if avg_time > self.bottleneck_threshold\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        batch_times = []\n",
-    "        total_start = time.time()\n",
-    "        \n",
-    "        try:\n",
-    "            dataloader_iter = iter(dataloader)\n",
-    "            for i in range(num_batches):\n",
-    "                batch_start = time.time()\n",
-    "                try:\n",
-    "                    batch = next(dataloader_iter)\n",
-    "                    batch_end = time.time()\n",
-    "                    batch_time = batch_end - batch_start\n",
-    "                    batch_times.append(batch_time)\n",
-    "                except StopIteration:\n",
-    "                    print(f\"   DataLoader exhausted after {i} batches\")\n",
-    "                    break\n",
-    "        except Exception as e:\n",
-    "            print(f\"   Error during iteration: {e}\")\n",
-    "            return {'error': str(e)}\n",
-    "        \n",
-    "        total_end = time.time()\n",
-    "        total_time = total_end - total_start\n",
-    "        \n",
-    "        if batch_times:\n",
-    "            avg_batch_time = sum(batch_times) / len(batch_times)\n",
-    "            min_batch_time = min(batch_times)\n",
-    "            max_batch_time = max(batch_times)\n",
-    "            \n",
-    "            # Check if data loading is a bottleneck\n",
-    "            is_bottleneck = avg_batch_time > self.bottleneck_threshold\n",
-    "            \n",
-    "            # Calculate throughput\n",
-    "            batches_per_second = len(batch_times) / total_time if total_time > 0 else 0\n",
-    "            \n",
-    "            return {\n",
-    "                'total_time': total_time,\n",
-    "                'num_batches': len(batch_times),\n",
-    "                'avg_batch_time': avg_batch_time,\n",
-    "                'min_batch_time': min_batch_time,\n",
-    "                'max_batch_time': max_batch_time,\n",
-    "                'batches_per_second': batches_per_second,\n",
-    "                'is_bottleneck': is_bottleneck,\n",
-    "                'bottleneck_threshold': self.bottleneck_threshold\n",
-    "            }\n",
-    "        else:\n",
-    "            return {'error': 'No batches processed'}\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def analyze_batch_size_scaling(self, dataset, batch_sizes=[16, 32, 64, 128]):\n",
-    "        \"\"\"\n",
-    "        Analyze how batch size affects data loading performance.\n",
-    "        \n",
-    "        TODO: Implement batch size scaling analysis.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. For each batch size, create a DataLoader\n",
-    "        2. Time the data loading for each configuration\n",
-    "        3. Calculate throughput (samples/second) for each\n",
-    "        4. Identify optimal batch size for I/O performance\n",
-    "        5. Return scaling analysis with recommendations\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        profiler = DataPipelineProfiler()\n",
-    "        analysis = profiler.analyze_batch_size_scaling(my_dataset, [16, 32, 64])\n",
-    "        print(f\"Optimal batch size: {analysis['optimal_batch_size']}\")\n",
-    "        \n",
-    "        LEARNING CONNECTIONS:\n",
-    "        - **Memory vs Throughput**: Larger batches improve throughput but consume more memory\n",
-    "        - **Hardware Optimization**: Optimal batch size depends on GPU memory and compute units\n",
-    "        - **Training Dynamics**: Batch size affects gradient noise and convergence behavior\n",
-    "        - **Production Scaling**: Understanding batch size impact on serving latency and cost\n",
-    "        \n",
-    "        HINTS:\n",
-    "        - Create DataLoader: DataLoader(dataset, batch_size=bs, shuffle=False)\n",
-    "        - Time with self.time_dataloader_iteration()\n",
-    "        - Calculate: samples_per_second = batch_size * batches_per_second\n",
-    "        - Find batch size with highest samples/second\n",
-    "        - Consider memory constraints vs throughput\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        scaling_results = []\n",
-    "        \n",
-    "        for batch_size in batch_sizes:\n",
-    "            print(f\"   Testing batch size {batch_size}...\")\n",
-    "            \n",
-    "            # Create DataLoader with current batch size\n",
-    "            dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)\n",
-    "            \n",
-    "            # Time the data loading\n",
-    "            timing_result = self.time_dataloader_iteration(dataloader, num_batches=min(10, len(dataset)//batch_size))\n",
-    "            \n",
-    "            if 'error' not in timing_result:\n",
-    "                # Calculate throughput metrics\n",
-    "                samples_per_second = batch_size * timing_result['batches_per_second']\n",
-    "                \n",
-    "                result = {\n",
-    "                    'batch_size': batch_size,\n",
-    "                    'avg_batch_time': timing_result['avg_batch_time'],\n",
-    "                    'batches_per_second': timing_result['batches_per_second'],\n",
-    "                    'samples_per_second': samples_per_second,\n",
-    "                    'is_bottleneck': timing_result['is_bottleneck']\n",
-    "                }\n",
-    "                scaling_results.append(result)\n",
-    "        \n",
-    "        # Find optimal batch size (highest throughput)\n",
-    "        if scaling_results:\n",
-    "            optimal = max(scaling_results, key=lambda x: x['samples_per_second'])\n",
-    "            optimal_batch_size = optimal['batch_size']\n",
-    "            \n",
-    "            return {\n",
-    "                'scaling_results': scaling_results,\n",
-    "                'optimal_batch_size': optimal_batch_size,\n",
-    "                'max_throughput': optimal['samples_per_second']\n",
-    "            }\n",
-    "        else:\n",
-    "            return {'error': 'No valid results obtained'}\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def compare_io_strategies(self, dataset, strategies=['sequential', 'shuffled']):\n",
-    "        \"\"\"\n",
-    "        Compare different I/O strategies for data loading performance.\n",
-    "        \n",
-    "        This function is PROVIDED to demonstrate I/O optimization analysis.\n",
-    "        Students use it to understand different data loading patterns.\n",
-    "        \"\"\"\n",
-    "        print(\"📊 I/O STRATEGY COMPARISON\")\n",
-    "        print(\"=\" * 40)\n",
-    "        \n",
-    "        results = {}\n",
-    "        batch_size = 32  # Standard batch size for comparison\n",
-    "        \n",
-    "        for strategy in strategies:\n",
-    "            print(f\"\\n🔍 Testing {strategy.upper()} strategy...\")\n",
-    "            \n",
-    "            if strategy == 'sequential':\n",
-    "                dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)\n",
-    "            elif strategy == 'shuffled':\n",
-    "                dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)\n",
-    "            else:\n",
-    "                print(f\"   Unknown strategy: {strategy}\")\n",
-    "                continue\n",
-    "            \n",
-    "            # Time the strategy\n",
-    "            timing_result = self.time_dataloader_iteration(dataloader, num_batches=20)\n",
-    "            \n",
-    "            if 'error' not in timing_result:\n",
-    "                results[strategy] = timing_result\n",
-    "                print(f\"   Avg batch time: {timing_result['avg_batch_time']:.3f}s\")\n",
-    "                print(f\"   Throughput: {timing_result['batches_per_second']:.1f} batches/sec\")\n",
-    "                print(f\"   Bottleneck: {'Yes' if timing_result['is_bottleneck'] else 'No'}\")\n",
-    "        \n",
-    "        # Compare strategies\n",
-    "        if len(results) >= 2:\n",
-    "            fastest = min(results.items(), key=lambda x: x[1]['avg_batch_time'])\n",
-    "            slowest = max(results.items(), key=lambda x: x[1]['avg_batch_time'])\n",
-    "            \n",
-    "            speedup = slowest[1]['avg_batch_time'] / fastest[1]['avg_batch_time']\n",
-    "            \n",
-    "            print(f\"\\n🎯 STRATEGY ANALYSIS:\")\n",
-    "            print(f\"   Fastest: {fastest[0]} ({fastest[1]['avg_batch_time']:.3f}s)\")\n",
-    "            print(f\"   Slowest: {slowest[0]} ({slowest[1]['avg_batch_time']:.3f}s)\")\n",
-    "            print(f\"   Speedup: {speedup:.1f}x\")\n",
-    "        \n",
-    "        return results\n",
-    "    \n",
-    "    def simulate_compute_vs_io_balance(self, dataloader, simulated_compute_time=0.05):\n",
-    "        \"\"\"\n",
-    "        Simulate the balance between data loading and compute time.\n",
-    "        \n",
-    "        This function is PROVIDED to show I/O vs compute analysis.\n",
-    "        Students use it to understand when I/O becomes a bottleneck.\n",
-    "        \"\"\"\n",
-    "        print(\"⚖️  COMPUTE vs I/O BALANCE ANALYSIS\")\n",
-    "        print(\"=\" * 45)\n",
-    "        \n",
-    "        print(f\"Simulated compute time per batch: {simulated_compute_time:.3f}s\")\n",
-    "        print(f\"(This represents GPU processing time)\")\n",
-    "        \n",
-    "        # Time data loading\n",
-    "        io_timing = self.time_dataloader_iteration(dataloader, num_batches=15)\n",
-    "        \n",
-    "        if 'error' in io_timing:\n",
-    "            print(f\"Error in timing: {io_timing['error']}\")\n",
-    "            return\n",
-    "        \n",
-    "        avg_io_time = io_timing['avg_batch_time']\n",
-    "        \n",
-    "        print(f\"\\n📊 TIMING ANALYSIS:\")\n",
-    "        print(f\"   Data loading time: {avg_io_time:.3f}s per batch\")\n",
-    "        print(f\"   Simulated compute: {simulated_compute_time:.3f}s per batch\")\n",
-    "        \n",
-    "        # Determine bottleneck\n",
-    "        if avg_io_time > simulated_compute_time:\n",
-    "            bottleneck = \"I/O\"\n",
-    "            utilization = simulated_compute_time / avg_io_time * 100\n",
-    "            print(f\"\\n🚨 BOTTLENECK: {bottleneck}\")\n",
-    "            print(f\"   GPU utilization: {utilization:.1f}%\")\n",
-    "            print(f\"   GPU waiting for data: {avg_io_time - simulated_compute_time:.3f}s per batch\")\n",
-    "        else:\n",
-    "            bottleneck = \"Compute\"\n",
-    "            utilization = avg_io_time / simulated_compute_time * 100\n",
-    "            print(f\"\\n✅ BOTTLENECK: {bottleneck}\")\n",
-    "            print(f\"   I/O utilization: {utilization:.1f}%\")\n",
-    "            print(f\"   I/O waiting for GPU: {simulated_compute_time - avg_io_time:.3f}s per batch\")\n",
-    "        \n",
-    "        # Calculate training impact\n",
-    "        total_cycle_time = max(avg_io_time, simulated_compute_time)\n",
-    "        efficiency = min(avg_io_time, simulated_compute_time) / total_cycle_time * 100\n",
-    "        \n",
-    "        print(f\"\\n🎯 TRAINING IMPACT:\")\n",
-    "        print(f\"   Pipeline efficiency: {efficiency:.1f}%\")\n",
-    "        print(f\"   Total cycle time: {total_cycle_time:.3f}s\")\n",
-    "        \n",
-    "        if bottleneck == \"I/O\":\n",
-    "            print(f\"   💡 Recommendation: Optimize data loading\")\n",
-    "            print(f\"      - Increase batch size\")\n",
-    "            print(f\"      - Use data prefetching\")\n",
-    "            print(f\"      - Faster storage (SSD vs HDD)\")\n",
-    "        else:\n",
-    "            print(f\"   💡 Recommendation: I/O is well optimized\")\n",
-    "            print(f\"      - Consider larger models or batch sizes\")\n",
-    "            print(f\"      - Focus on compute optimization\")\n",
-    "        \n",
-    "        return {\n",
-    "            'io_time': avg_io_time,\n",
-    "            'compute_time': simulated_compute_time,\n",
-    "            'bottleneck': bottleneck,\n",
-    "            'efficiency': efficiency,\n",
-    "            'total_cycle_time': total_cycle_time\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ad2c8bd8",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🎯 Learning Activity 1: DataLoader Performance Profiling (Medium Guided Implementation)\n",
-    "\n",
-    "**Goal**: Learn to measure data loading performance and identify I/O bottlenecks that can slow down training.\n",
-    "\n",
-    "Complete the missing implementations in the `DataPipelineProfiler` class above, then use your profiler to analyze data loading performance."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9b50e007",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Initialize the data pipeline profiler\n",
-    "profiler = DataPipelineProfiler()\n",
-    "\n",
-    "# Only run tests when module is executed directly\n",
-    "if __name__ == '__main__':\n",
-    "    print(\"📊 DATA PIPELINE PERFORMANCE ANALYSIS\")\n",
-    "    print(\"=\" * 50)\n",
-    "\n",
-    "    # Create test dataset and dataloader\n",
-    "    test_dataset = TensorDataset([\n",
-    "        Tensor(np.random.randn(100)) for _ in range(1000)  # 1000 samples\n",
-    "    ], [\n",
-    "        Tensor([i % 10]) for i in range(1000)  # Labels\n",
-    "    ])\n",
-    "\n",
-    "    # Test 1: Basic DataLoader timing\n",
-    "    print(\"⏱️  Basic DataLoader Timing:\")\n",
-    "    basic_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)\n",
-    "\n",
-    "# Students use their implemented timing function\n",
-    "timing_result = profiler.time_dataloader_iteration(basic_dataloader, num_batches=25)\n",
-    "\n",
-    "if 'error' not in timing_result:\n",
-    "    print(f\"   Average batch time: {timing_result['avg_batch_time']:.3f}s\")\n",
-    "    print(f\"   Throughput: {timing_result['batches_per_second']:.1f} batches/sec\")\n",
-    "    print(f\"   Bottleneck detected: {'Yes' if timing_result['is_bottleneck'] else 'No'}\")\n",
-    "    \n",
-    "    # Calculate samples per second\n",
-    "    samples_per_sec = 32 * timing_result['batches_per_second']\n",
-    "    print(f\"   Samples/second: {samples_per_sec:.1f}\")\n",
-    "else:\n",
-    "    print(f\"   Error: {timing_result['error']}\")\n",
-    "\n",
-    "# Test 2: Batch size scaling analysis\n",
-    "print(f\"\\n📈 Batch Size Scaling Analysis:\")\n",
-    "\n",
-    "# Students use their implemented scaling analysis\n",
-    "scaling_analysis = profiler.analyze_batch_size_scaling(test_dataset, [16, 32, 64, 128])\n",
-    "\n",
-    "if 'error' not in scaling_analysis:\n",
-    "    print(f\"   Optimal batch size: {scaling_analysis['optimal_batch_size']}\")\n",
-    "    print(f\"   Max throughput: {scaling_analysis['max_throughput']:.1f} samples/sec\")\n",
-    "    \n",
-    "    print(f\"\\n   📊 Detailed Results:\")\n",
-    "    for result in scaling_analysis['scaling_results']:\n",
-    "        print(f\"      Batch {result['batch_size']:3d}: {result['samples_per_second']:6.1f} samples/sec\")\n",
-    "else:\n",
-    "    print(f\"   Error: {scaling_analysis['error']}\")\n",
-    "\n",
-    "print(f\"\\n💡 I/O PERFORMANCE INSIGHTS:\")\n",
-    "print(f\"   - Larger batches often improve throughput (better amortization)\")\n",
-    "print(f\"   - But memory constraints limit maximum batch size\")\n",
-    "print(f\"   - Sweet spot balances throughput vs memory usage\")\n",
-    "print(f\"   - Real systems: GPU memory determines practical limits\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "92ef4498",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### 🎯 Learning Activity 2: Production I/O Optimization Analysis (Review & Understand)\n",
-    "\n",
-    "**Goal**: Understand how I/O performance affects real training systems and learn optimization strategies used in production."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "74695654",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Compare different I/O strategies\n",
-    "io_comparison = profiler.compare_io_strategies(test_dataset, ['sequential', 'shuffled'])\n",
-    "\n",
-    "# Simulate compute vs I/O balance with different scenarios\n",
-    "print(f\"\\n⚖️  COMPUTE vs I/O SCENARIOS:\")\n",
-    "print(f\"=\" * 40)\n",
-    "\n",
-    "# Test different compute scenarios\n",
-    "compute_scenarios = [\n",
-    "    (0.01, \"Fast GPU (V100/A100)\"),\n",
-    "    (0.05, \"Medium GPU (RTX 3080)\"),\n",
-    "    (0.1, \"CPU-only training\"),\n",
-    "    (0.2, \"Complex model/large batch\")\n",
-    "]\n",
-    "\n",
-    "sample_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)\n",
-    "\n",
-    "for compute_time, scenario_name in compute_scenarios:\n",
-    "    print(f\"\\n🖥️  {scenario_name}:\")\n",
-    "    balance_analysis = profiler.simulate_compute_vs_io_balance(sample_dataloader, compute_time)\n",
-    "\n",
-    "print(f\"\\n🎯 PRODUCTION I/O OPTIMIZATION LESSONS:\")\n",
-    "print(f\"=\" * 50)\n",
-    "\n",
-    "print(f\"\\n1. 📊 I/O BOTTLENECK IDENTIFICATION:\")\n",
-    "print(f\"   - Fast GPUs often bottlenecked by data loading\")\n",
-    "print(f\"   - CPU training rarely I/O bottlenecked\")\n",
-    "print(f\"   - Modern GPUs process data faster than storage provides it\")\n",
-    "\n",
-    "print(f\"\\n2. 🚀 OPTIMIZATION STRATEGIES:\")\n",
-    "print(f\"   - Data prefetching: Load next batch while GPU computes\")\n",
-    "print(f\"   - Parallel workers: Multiple threads/processes for loading\")\n",
-    "print(f\"   - Faster storage: NVMe SSD vs SATA vs network storage\")\n",
-    "print(f\"   - Data caching: Keep frequently used data in memory\")\n",
-    "\n",
-    "print(f\"\\n3. 🏗️ ARCHITECTURE DECISIONS:\")\n",
-    "print(f\"   - Batch size: Larger batches amortize I/O overhead\")\n",
-    "print(f\"   - Data format: Preprocessed vs on-the-fly transformation\")\n",
-    "print(f\"   - Storage location: Local vs network vs cloud storage\")\n",
-    "\n",
-    "print(f\"\\n4. 💰 COST IMPLICATIONS:\")\n",
-    "print(f\"   - I/O bottlenecks waste expensive GPU time\")\n",
-    "print(f\"   - GPU utilization directly affects training costs\")\n",
-    "print(f\"   - Faster storage investment pays off in GPU efficiency\")\n",
-    "\n",
-    "print(f\"\\n💡 SYSTEMS ENGINEERING INSIGHT:\")\n",
-    "print(f\"I/O optimization is often the highest-impact performance improvement:\")\n",
-    "print(f\"- GPUs are expensive → maximize their utilization\")\n",
-    "print(f\"- Data loading is often the limiting factor\")\n",
-    "print(f\"- 10% I/O improvement = 10% faster training = 10% cost reduction\")\n",
-    "print(f\"- Modern ML systems spend significant effort on data pipeline optimization\")\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    # Test the dataset interface demonstration\n",
-    "    try:\n",
-    "        test_dataset = TestDataset(size=5)\n",
-    "        print(f\"Dataset created with size: {len(test_dataset)}\")\n",
-    "        \n",
-    "        # Test __getitem__\n",
-    "        data, label = test_dataset[0]\n",
-    "        print(f\"Sample 0: data={data}, label={label}\")\n",
-    "        assert isinstance(data, Tensor), \"Data should be a Tensor\"\n",
-    "        assert isinstance(label, Tensor), \"Label should be a Tensor\"\n",
-    "        print(\"✅ Dataset __getitem__ works correctly\")\n",
-    "        \n",
-    "        # Test __len__\n",
-    "        assert len(test_dataset) == 5, f\"Dataset length should be 5, got {len(test_dataset)}\"\n",
-    "        print(\"✅ Dataset __len__ works correctly\")\n",
-    "        \n",
-    "        # Test get_num_classes\n",
-    "        num_classes = test_dataset.get_num_classes()\n",
-    "        assert num_classes == 2, f\"Number of classes should be 2, got {num_classes}\"\n",
-    "        print(\"✅ Dataset get_num_classes works correctly\")\n",
-    "        \n",
-    "        # Test get_sample_shape\n",
-    "        sample_shape = test_dataset.get_sample_shape()\n",
-    "        assert sample_shape == (3,), f\"Sample shape should be (3,), got {sample_shape}\"\n",
-    "        print(\"✅ Dataset get_sample_shape works correctly\")\n",
-    "        \n",
-    "        print(\"🎯 Dataset interface pattern:\")\n",
-    "        print(\"   __getitem__: Returns (data, label) tuple\")\n",
-    "        print(\"   __len__: Returns dataset size\")\n",
-    "        print(\"   get_num_classes: Returns number of classes\")\n",
-    "        print(\"   get_sample_shape: Returns shape of data samples\")\n",
-    "        print(\"📈 Progress: Dataset interface ✓\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Dataset interface test failed: {e}\")\n",
-    "        raise\n",
-    "    \n",
-    "    # Run all tests\n",
-    "    test_unit_dataset_interface()\n",
-    "    test_unit_dataloader()\n",
-    "    test_unit_simple_dataset()\n",
-    "    test_unit_dataloader_pipeline()\n",
-    "    test_module_dataloader_tensor_yield()\n",
-    "    \n",
-    "    print(\"All tests passed!\")\n",
-    "    print(\"dataloader_dev module complete!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "27bce6e8",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking Questions\n",
-    "\n",
-    "### System Design\n",
-    "1. How does TinyTorch's DataLoader design compare to PyTorch's DataLoader and TensorFlow's tf.data API in terms of flexibility and performance?\n",
-    "2. What are the trade-offs between memory-mapped files, streaming data loading, and in-memory caching for large-scale ML datasets?\n",
-    "3. How would you design a data loading system that efficiently handles both structured (tabular) and unstructured (images, text) data?\n",
-    "\n",
-    "### Production ML\n",
-    "1. How would you implement fault-tolerant data loading that can handle network failures and corrupted files in production environments?\n",
-    "2. What strategies would you use to ensure data consistency and prevent data leakage when loading from constantly updating production databases?\n",
-    "3. How would you design a data pipeline that supports both batch inference and real-time prediction serving?\n",
-    "\n",
-    "### Framework Design\n",
-    "1. What design patterns enable efficient data preprocessing that can be distributed across multiple worker processes without blocking training?\n",
-    "2. How would you implement dynamic batching that adapts batch sizes based on available memory and model complexity?\n",
-    "3. What abstractions would you create to support different data formats (images, audio, text) while maintaining a unified loading interface?\n",
-    "\n",
-    "### Performance & Scale\n",
-    "1. How do different data loading strategies (synchronous vs asynchronous, single vs multi-threaded) impact training throughput on different hardware?\n",
-    "2. What are the bottlenecks when loading data for distributed training across multiple machines, and how would you optimize data transfer?\n",
-    "3. How would you implement data loading that scales efficiently from small datasets (MB) to massive datasets (TB) without code changes?"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0abe9e82",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Data Loading and Processing\n",
-    "\n",
-    "Congratulations! You've successfully implemented professional data loading systems:\n",
-    "\n",
-    "### What You've Accomplished\n",
-    "✅ **DataLoader Class**: Efficient batch processing with memory management\n",
-    "✅ **Dataset Integration**: Seamless compatibility with Tensor operations\n",
-    "✅ **Batch Processing**: Optimized data loading for training\n",
-    "✅ **Memory Management**: Efficient handling of large datasets\n",
-    "✅ **Real Applications**: Image classification, regression, and more\n",
-    "\n",
-    "### Key Concepts You've Learned\n",
-    "- **Batch processing**: How to efficiently process data in chunks\n",
-    "- **Memory management**: Handling large datasets without memory overflow\n",
-    "- **Data iteration**: Creating efficient data loading pipelines\n",
-    "- **Integration patterns**: How data loaders work with neural networks\n",
-    "- **Performance optimization**: Balancing speed and memory usage\n",
-    "\n",
-    "### Professional Skills Developed\n",
-    "- **Data engineering**: Building robust data processing pipelines\n",
-    "- **Memory optimization**: Efficient handling of large datasets\n",
-    "- **API design**: Clean interfaces for data loading operations\n",
-    "- **Integration testing**: Ensuring data loaders work with neural networks\n",
-    "\n",
-    "### Ready for Advanced Applications\n",
-    "Your data loading implementations now enable:\n",
-    "- **Large-scale training**: Processing datasets too big for memory\n",
-    "- **Real-time learning**: Streaming data for online learning\n",
-    "- **Multi-modal data**: Handling images, text, and structured data\n",
-    "- **Production systems**: Robust data pipelines for deployment\n",
-    "\n",
-    "### Connection to Real ML Systems\n",
-    "Your implementations mirror production systems:\n",
-    "- **PyTorch**: `torch.utils.data.DataLoader` provides identical functionality\n",
-    "- **TensorFlow**: `tf.data.Dataset` implements similar concepts\n",
-    "- **Industry Standard**: Every major ML framework uses these exact patterns\n",
-    "\n",
-    "### Next Steps\n",
-    "1. **Export your code**: `tito export 09_dataloader`\n",
-    "2. **Test your implementation**: `tito test 09_dataloader`\n",
-    "3. **Build training pipelines**: Combine with neural networks for complete ML systems\n",
-    "4. **Move to Module 9**: Add automatic differentiation for training!\n",
-    "\n",
-    "**Ready for autograd?** Your data loading systems are now ready for real training!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/09_dataloader/dataloader_dev.py b/modules_old/09_dataloader/dataloader_dev.py
deleted file mode 100644
index d1ad6cb5..00000000
--- a/modules_old/09_dataloader/dataloader_dev.py
+++ /dev/null
@@ -1,1413 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# DataLoader - Efficient Data Pipeline and Batch Processing Systems
-
-Welcome to the DataLoader module! You'll build the data infrastructure that feeds neural networks, understanding how I/O optimization and memory management determine training speed.
-
-## Learning Goals
-- Systems understanding: How data I/O becomes the bottleneck in ML training and why efficient data pipelines are critical for system performance
-- Core implementation skill: Build Dataset and DataLoader classes with batching, shuffling, and memory-efficient iteration patterns
-- Pattern recognition: Understand the universal Dataset/DataLoader abstraction used across all ML frameworks
-- Framework connection: See how your implementation mirrors PyTorch's data loading infrastructure and optimization strategies
-- Performance insight: Learn why data loading parallelization and prefetching are essential for GPU utilization in production systems
-
-## Build -> Use -> Reflect
-1. **Build**: Complete Dataset and DataLoader classes with efficient batching, shuffling, and real dataset support (CIFAR-10)
-2. **Use**: Load large-scale image datasets and feed them to neural networks with proper batch processing
-3. **Reflect**: Why does data loading speed often determine training speed more than model computation?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how efficient data pipelines enable scalable ML training
-- Practical capability to build data loading systems that handle datasets larger than memory
-- Systems insight into why data engineering is often the limiting factor in ML system performance
-- Performance consideration of how batch size, shuffling, and prefetching affect training throughput and convergence
-- Connection to production ML systems and how frameworks optimize data loading for different storage systems
-
-## Systems Reality Check
-TIP **Production Context**: PyTorch's DataLoader uses multiprocessing and memory pinning to overlap data loading with GPU computation, achieving near-zero data loading overhead
-SPEED **Performance Note**: Modern GPUs can process data faster than storage systems can provide it - data loading optimization is critical for hardware utilization in production training
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "dataloader-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.dataloader
-
-#| export
-import numpy as np
-import sys
-import os
-from typing import Tuple, Optional, Iterator
-import urllib.request
-import tarfile
-import pickle
-import time
-
-# Import our building blocks - try package first, then local modules
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local modules
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# %% nbgrader={"grade": false, "grade_id": "dataloader-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("FIRE TinyTorch DataLoader Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build data pipelines!")
-
-# %% [markdown]
-"""
-## PACKAGE Where This Code Lives in the Final Package
-
-**Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py`  
-**Building Side:** Code exports to `tinytorch.core.dataloader`
-
-```python
-# Final package structure:
-from tinytorch.core.dataloader import Dataset, DataLoader  # Data loading utilities!
-from tinytorch.core.tensor import Tensor  # Foundation
-from tinytorch.core.networks import Sequential  # Models to train
-```
-
-**Why this matters:**
-- **Learning:** Focused modules for deep understanding of data pipelines
-- **Production:** Proper organization like PyTorch's `torch.utils.data`
-- **Consistency:** All data loading utilities live together in `core.dataloader`
-- **Integration:** Works seamlessly with tensors and networks
-"""
-
-# %% [markdown]
-"""
-## 🔧 DEVELOPMENT
-"""
-
-# %% [markdown]
-"""
-## Step 1: Understanding Data Pipelines - The Foundation of ML Systems
-
-### LINK Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): Data structures that hold and manipulate arrays efficiently
-- Module 04 (Layers): Neural network components that need batched inputs
-
-**What's Working**: You can create tensors and build neural network layers!
-
-**The Gap**: Your models need REAL DATA to train on, not just random numbers.
-
-**This Module's Solution**: Build professional data loading pipelines that feed real datasets to your networks.
-
-**Connection Map**:
-```
-Tensor Operations -> Data Loading -> Training Loop
-   (Module 02)       (Module 10)    (Next: Module 11)
-```
-
-### What are Data Pipelines?
-**Data pipelines** are the systems that efficiently move data from storage to your model. They're the foundation of all machine learning systems and often the performance bottleneck!
-
-### 📊 The Complete Data Pipeline Flow
-```
-+-------------+    +----------+    +---------+    +---------+    +--------------+
-| Raw Storage |---▶| Dataset  |---▶| Shuffle |---▶|  Batch  |---▶| Neural Net   |
-| (Files/DB)  |    | Loading  |    | + Index |    | + Stack |    | Training     |
-+-------------+    +----------+    +---------+    +---------+    +--------------+
-      v                 v              v             v               v
- Gigabytes         On-Demand      Random Order    GPU-Friendly    Learning!
-   of Data          Loading        (No Overfit)    Format
-```
-
-### MAGNIFY Why Data Pipelines Are Critical for ML Systems
-- **Performance**: Efficient loading prevents GPU starvation (GPUs idle waiting for data)
-- **Scalability**: Handle datasets larger than memory (ImageNet = 150GB)
-- **Consistency**: Reproducible data processing across experiments
-- **Flexibility**: Easy to switch between datasets and configurations
-
-### SPEED Real-World Performance Challenges
-```
-🏎️  GPU Processing Speed:     ~1000 images/second
-🐌  Disk Read Speed:         ~100 images/second
-WARNING️   Result: GPU waits 90% of time for data!
-```
-
-### 💾 Memory vs Storage Trade-offs
-```
-Dataset Size Analysis:
-+-------------+-------------+-------------+-------------+
-|   Dataset   |    Size     | Fits in RAM |  Strategy   |
-+-------------+-------------+-------------+-------------┤
-|   MNIST     |   ~60 MB    |    PASS Yes   | Load All    |
-|  CIFAR-10   |  ~170 MB    |    PASS Yes   | Load All    |
-|  ImageNet   |  ~150 GB    |    FAIL No    | Stream      |
-|  Custom     |   ~1 TB     |    FAIL No    | Stream      |
-+-------------+-------------+-------------+-------------+
-```
-
-### 🧠 Systems Engineering Principles
-- **Memory efficiency**: Handle datasets larger than RAM without crashing
-- **I/O optimization**: Read from disk efficiently to minimize GPU waiting
-- **Batching strategies**: Trade-offs between memory usage and training speed
-- **Caching**: When to cache frequently used data vs recompute on-demand
-
-### PROGRESS Batch Processing Impact
-```
-Batch Size Performance Analysis:
-    Batch Size | GPU Utilization | Memory Usage | Training Speed
-    -----------+-----------------+--------------+---------------
-        1      |      ~10%       |    Low       |    Very Slow
-       32      |      ~80%       |   Medium     |      Good
-      128      |      ~95%       |    High      |    Very Fast
-      512      |      ~98%       |  Very High   |   Fastest*
-    
-    * Until you run out of GPU memory!
-```
-
-Let's start by building the most fundamental component: **Dataset**.
-"""
-
-# %% [markdown]
-"""
-## Step 2: Building the Dataset Interface - The Universal Data Access Pattern
-
-### What is a Dataset?
-A **Dataset** is an abstract interface that provides consistent access to data. It's the foundation of all data loading systems and the key abstraction that makes ML frameworks flexible.
-
-### TARGET The Universal Dataset Pattern
-```
-           Dataset Interface
-    +-----------------------------+
-    |  def __getitem__(index):    |<--- Get single sample by index
-    |      return data, label     |     (like a list or dictionary)
-    |                             |
-    |  def __len__():             |<--- Total number of samples
-    |      return total_samples   |     (enables progress tracking)
-    +-----------------------------+
-                    ^
-                    | Implements
-    +---------------+---------------+
-    |               |               |
-+---v----+  +------v-----+  +------v------+
-| MNIST  |  |  CIFAR-10  |  | Custom Data |
-|Dataset |  |  Dataset   |  |   Dataset   |
-+--------+  +------------+  +-------------+
-```
-
-### 🔧 Why Abstract Interfaces Are Systems Engineering Gold
-- **Consistency**: Same interface for all data types (images, text, audio)
-- **Flexibility**: Easy to switch between datasets without changing training code
-- **Testability**: Easy to create test datasets for debugging and unit tests
-- **Extensibility**: Easy to add new data sources (databases, APIs, cloud storage)
-- **Modularity**: DataLoader works with ANY dataset that implements this interface
-
-### 📊 Production Dataset Examples
-```
-Real-World Dataset Implementations:
-
-🖼️  Computer Vision:
-    - ImageNet: 14M images, 1000 classes
-    - CIFAR-10: 60K images, 10 classes  
-    - COCO: 200K images with object detection annotations
-    - Custom: Your company's image data
-
-📝 Natural Language Processing:
-    - WikiText: 100M+ tokens from Wikipedia
-    - IMDB: 50K movie reviews for sentiment analysis
-    - Custom: Your company's text data
-
-🔊 Audio Processing:
-    - LibriSpeech: 1000 hours of speech
-    - AudioSet: 2M YouTube clips with audio events
-    - Custom: Your company's audio data
-
-PROGRESS Time Series:
-    - Stock prices, sensor data, user behavior logs
-    - Custom: Your company's time series data
-```
-
-### ROCKET Framework Integration Power
-```
-# PyTorch Compatibility:
-torch_dataset = torch.utils.data.Dataset  # Same interface!
-torch_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
-
-# TensorFlow Compatibility:
-tf_dataset = tf.data.Dataset.from_generator(dataset_generator)
-
-# Our TinyTorch:
-tiny_loader = DataLoader(dataset, batch_size=32)  # Same pattern!
-```
-
-This universal pattern means your skills transfer directly to production frameworks!
-
-Let's implement the Dataset interface!
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "dataset-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Dataset:
-    """
-    Base Dataset class: Abstract interface for all datasets.
-    
-    The fundamental abstraction for data loading in TinyTorch.
-    Students implement concrete datasets by inheriting from this class.
-    """
-    
-    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:
-        """
-        Get a single sample and label by index.
-        
-        Args:
-            index: Index of the sample to retrieve
-            
-        Returns:
-            Tuple of (data, label) tensors
-            
-        TODO: Implement abstract method for getting samples.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. This is an abstract method - subclasses will implement it
-        2. Return a tuple of (data, label) tensors
-        3. Data should be the input features, label should be the target
-        
-        EXAMPLE:
-        dataset[0] should return (Tensor(image_data), Tensor(label))
-        
-        LEARNING CONNECTIONS:
-        - **PyTorch Integration**: This follows the exact same pattern as torch.utils.data.Dataset
-        - **Production Data**: Real datasets like ImageNet, CIFAR-10 use this interface
-        - **Memory Efficiency**: On-demand loading prevents loading entire dataset into memory
-        - **Batching Foundation**: DataLoader uses __getitem__ to create batches efficiently
-        
-        HINTS:
-        - This is an abstract method that subclasses must override
-        - Always return a tuple of (data, label) tensors
-        - Data contains the input features, label contains the target
-        """
-        ### BEGIN SOLUTION
-        # This is an abstract method - subclasses must implement it
-        # Every dataset (CIFAR-10, ImageNet, custom) must implement this!
-        raise NotImplementedError(
-            "This is an abstract method - subclasses like SimpleDataset "
-            "must implement __getitem__ to return (data, label) tuples"
-        )
-        ### END SOLUTION
-    
-    def __len__(self) -> int:
-        """
-        Get the total number of samples in the dataset.
-        
-        TODO: Implement abstract method for getting dataset size.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. This is an abstract method - subclasses will implement it
-        2. Return the total number of samples in the dataset
-        
-        EXAMPLE:
-        len(dataset) should return 50000 for CIFAR-10 training set
-        
-        LEARNING CONNECTIONS:
-        - **Memory Planning**: DataLoader uses len() to calculate number of batches
-        - **Progress Tracking**: Training loops use len() for progress bars and epoch calculations
-        - **Distributed Training**: Multi-GPU systems need dataset size for work distribution
-        - **Statistical Sampling**: Some training strategies require knowing total dataset size
-        
-        HINTS:
-        - This is an abstract method that subclasses must override
-        - Return an integer representing the total number of samples
-        """
-        ### BEGIN SOLUTION
-        # This is an abstract method - subclasses must implement it
-        # DataLoader needs this to calculate number of batches!
-        raise NotImplementedError("Subclasses must implement __len__")
-        ### END SOLUTION
-    
-    def get_sample_shape(self) -> Tuple[int, ...]:
-        """
-        Get the shape of a single data sample.
-        
-        TODO: Implement method to get sample shape.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Get the first sample using self[0]
-        2. Extract the data part (first element of tuple)
-        3. Return the shape of the data tensor
-        
-        EXAMPLE:
-        For CIFAR-10: returns (3, 32, 32) for RGB images
-        
-        LEARNING CONNECTIONS:
-        - **Model Architecture**: Neural networks need to know input shape for first layer
-        - **Batch Planning**: Systems use sample shape to calculate memory requirements
-        - **Preprocessing Validation**: Ensures all samples have consistent shape
-        - **Framework Integration**: Similar to PyTorch's dataset shape inspection
-        
-        HINTS:
-        - Use self[0] to get the first sample
-        - Extract data from the (data, label) tuple
-        - Return data.shape
-        """
-        ### BEGIN SOLUTION
-        # Get the first sample to determine shape
-        # This helps neural networks know their input dimension
-        data, _ = self[0]
-        return data.shape
-        ### END SOLUTION
-    
-    def get_num_classes(self) -> int:
-        """
-        Get the number of classes in the dataset.
-        
-        TODO: Implement abstract method for getting number of classes.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. This is an abstract method - subclasses will implement it
-        2. Return the number of unique classes in the dataset
-        
-        EXAMPLE:
-        For CIFAR-10: returns 10 (classes 0-9)
-        
-        LEARNING CONNECTIONS:
-        - **Output Layer Design**: Neural networks need num_classes for final layer size
-        - **Loss Function Setup**: CrossEntropyLoss uses num_classes for proper computation
-        - **Evaluation Metrics**: Accuracy calculation depends on number of classes
-        - **Model Validation**: Ensures model predictions match expected class range
-        
-        HINTS:
-        - This is an abstract method that subclasses must override
-        - Return the number of unique classes/categories
-        """
-        ### BEGIN SOLUTION
-        # This is an abstract method - subclasses must implement it
-        # Neural networks need this for output layer size (classification)
-        raise NotImplementedError("Subclasses must implement get_num_classes")
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Dataset Interface
-
-Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset.
-
-**This is a unit test** - it tests the Dataset interface pattern in isolation.
-"""
-
-# Create a minimal test dataset for testing
-class TestDataset(Dataset):
-    def __init__(self, size=5):
-        self.size = size
-
-    def __getitem__(self, index):
-        # Simple test data: features are [index, index*2], label is index % 2
-        data = Tensor([index, index * 2])
-        label = Tensor([index % 2])
-        return data, label
-
-    def __len__(self):
-        return self.size
-
-    def get_num_classes(self):
-        return 2
-
-# %%
-def test_unit_dataset_interface():
-    """Test Dataset interface with a simple implementation."""
-    print("🔬 Unit Test: Dataset Interface...")
-
-    # Create a minimal test dataset
-    dataset = TestDataset(size=5)
-
-    # Test basic interface
-    assert len(dataset) == 5, "Dataset should have correct length"
-
-    # Test data access
-    sample, label = dataset[0]
-    assert isinstance(sample, Tensor), "Sample should be Tensor"
-    assert isinstance(label, Tensor), "Label should be Tensor"
-
-    print("✅ Dataset interface works correctly!")
-
-test_unit_dataset_interface()
-
-# %% [markdown]
-"""
-## Step 3: Building the DataLoader - The Batch Processing Engine
-
-### What is a DataLoader?
-A **DataLoader** efficiently batches and iterates through datasets. It's the bridge between individual samples and the batched data that neural networks expect. This is where the real systems engineering happens!
-
-### 🔄 The DataLoader Processing Pipeline
-```
-         Dataset Samples              DataLoader Magic               Neural Network
-    +---------------------+       +---------------------+       +-----------------+
-    | [sample_1]          |       | 1. Shuffle indices  |       | Efficient GPU   |
-    | [sample_2]          |------▶| 2. Group into       |------▶| Batch           |
-    | [sample_3]          |       |    batches          |       | Processing      |
-    | [sample_4]          |       | 3. Stack tensors    |       |                 |
-    | ...                 |       | 4. Yield batches    |       | batch_size=32   |
-    | [sample_n]          |       |                     |       | shape=(32,...)  |
-    +---------------------+       +---------------------+       +-----------------+
-```
-
-### SPEED Why DataLoaders Are Critical for Performance
-```
-GPU Utilization Without Batching:
-+-----+-----+-----+-----+-----+-----+-----+-----+
-|  🔄  | ... | ... | ... | ... | ... | ... | ... | Time
-+-----+-----+-----+-----+-----+-----+-----+-----+
-  ~5%   GPU mostly idle (underutilized)
-
-GPU Utilization With Proper Batching:
-+████████████████████████████████████████████████+
-|          ████████████████████████████████      | Time
-+████████████████████████████████████████████████+
-  ~95%   GPU fully utilized (efficient!)
-```
-
-### 🧮 Memory vs Speed Trade-offs
-```
-Batch Size Impact Analysis:
-    
-    Batch Size | Memory Usage | GPU Utilization | Gradient Quality
-    -----------+--------------+-----------------+-----------------
-        1      |     Low      |      ~10%       |   Noisy (bad)
-       16      |   Medium     |      ~60%       |     Better
-       64      |    High      |      ~90%       |      Good
-      256      |  Very High   |      ~95%       |    Very Good
-      512      | TOO HIGH! CRASH |       N/A       |   OOM Error
-```
-
-### 🔀 Shuffling: Preventing Overfitting to Data Order
-```
-Without Shuffling (Bad!):
-    Epoch 1: [cat, cat, dog, dog, bird, bird] 
-    Epoch 2: [cat, cat, dog, dog, bird, bird]  <- Same order!
-    Model learns data order, not features 😞
-
-With Shuffling (Good!):
-    Epoch 1: [dog, cat, bird, cat, dog, bird]
-    Epoch 2: [bird, dog, cat, bird, cat, dog]  <- Random order!
-    Model learns features, generalizes well 😊
-```
-
-### TARGET Production Training Pattern
-```python
-# The universal ML training pattern:
-for epoch in range(num_epochs):
-    for batch_data, batch_labels in dataloader:  # <- This line!
-        predictions = model(batch_data)
-        loss = criterion(predictions, batch_labels)
-        loss.backward()
-        optimizer.step()
-```
-
-### 🏗️ Systems Engineering Considerations
-- **Batch size**: Trade-off between memory usage and training speed
-- **Shuffling**: Essential for model generalization (prevents order memorization)
-- **Memory efficiency**: Stream data instead of loading everything into RAM
-- **Iterator protocol**: Enables clean for-loop syntax in training code
-- **GPU utilization**: Proper batching maximizes expensive GPU hardware
-
-### 🔧 Real-World Applications
-- **Training loops**: Feed batches to neural networks for gradient computation
-- **Validation**: Evaluate models on held-out data systematically
-- **Inference**: Process large datasets efficiently for predictions
-- **Data analysis**: Explore datasets systematically without memory overflow
-
-Let's implement the DataLoader that powers all ML training!
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "dataloader-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class DataLoader:
-    """
-    DataLoader: Efficiently batch and iterate through datasets.
-    
-    Provides batching, shuffling, and efficient iteration over datasets.
-    Essential for training neural networks efficiently.
-    """
-    
-    def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):
-        """
-        Initialize DataLoader.
-        
-        Args:
-            dataset: Dataset to load from
-            batch_size: Number of samples per batch
-            shuffle: Whether to shuffle data each epoch
-            
-        TODO: Store configuration and dataset.
-        
-        APPROACH:
-        1. Store dataset as self.dataset
-        2. Store batch_size as self.batch_size
-        3. Store shuffle as self.shuffle
-        
-        EXAMPLE:
-        DataLoader(dataset, batch_size=32, shuffle=True)
-        
-        HINTS:
-        - Store all parameters as instance variables
-        - These will be used in __iter__ for batching
-        """
-        # Input validation
-        if dataset is None:
-            raise TypeError("Dataset cannot be None")
-        if not isinstance(batch_size, int) or batch_size <= 0:
-            raise ValueError(
-                f"Batch size must be a positive integer (like 32 or 64), got {batch_size}. "
-                f"This determines how many samples are processed together for efficiency."
-            )
-        
-        self.dataset = dataset
-        self.batch_size = batch_size
-        self.shuffle = shuffle
-    
-    def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:
-        """
-        Iterate through dataset in batches.
-        
-        Returns:
-            Iterator yielding (batch_data, batch_labels) tuples
-            
-        TODO: Implement batching and shuffling logic.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create indices list: list(range(len(dataset)))
-        2. Shuffle indices if self.shuffle is True
-        3. Loop through indices in batch_size chunks
-        4. For each batch: collect samples, stack them, yield batch
-        
-        EXAMPLE:
-        for batch_data, batch_labels in dataloader:
-            # batch_data.shape: (batch_size, ...)
-            # batch_labels.shape: (batch_size,)
-        
-        LEARNING CONNECTIONS:
-        - **GPU Efficiency**: Batching maximizes GPU utilization by processing multiple samples together
-        - **Training Stability**: Shuffling prevents overfitting to data order and improves generalization
-        - **Memory Management**: Batches fit in GPU memory while full dataset may not
-        - **Gradient Estimation**: Batch gradients provide better estimates than single-sample gradients
-        
-        HINTS:
-        - Use list(range(len(self.dataset))) for indices
-        - Use np.random.shuffle() if self.shuffle is True
-        - Loop in chunks of self.batch_size
-        - Collect samples and stack with np.stack()
-        """
-        ### BEGIN SOLUTION
-        # Step 1: Create list of all sample indices (0, 1, 2, ..., dataset_size-1)
-        # This allows us to control which samples go into which batches
-        sample_indices = list(range(len(self.dataset)))
-        
-        # Step 2: Randomly shuffle indices if requested (prevents overfitting to data order)
-        # Shuffling is critical for good model generalization!
-        if self.shuffle:
-            np.random.shuffle(sample_indices)
-        
-        # Step 3: Process data in batches of self.batch_size
-        # This loop creates efficient GPU-sized chunks of data
-        for batch_start_idx in range(0, len(sample_indices), self.batch_size):
-            current_batch_indices = sample_indices[batch_start_idx:batch_start_idx + self.batch_size]
-            
-            # Step 4: Collect samples for this batch
-            # Build lists of data and labels for efficient stacking
-            batch_data_list = []
-            batch_labels_list = []
-            
-            for sample_idx in current_batch_indices:
-                data, label = self.dataset[sample_idx]  # Get individual sample
-                # Access .data to get underlying numpy array for efficient stacking
-                # Tensors wrap numpy arrays, and np.stack() needs raw arrays
-                batch_data_list.append(data.data)
-                batch_labels_list.append(label.data)
-            
-            # Step 5: Stack individual samples into batch tensors
-            # np.stack combines multiple arrays along a new axis (axis=0 = batch dimension)
-            # This creates the (batch_size, feature_dims...) shape that GPUs love!
-            batch_data_array = np.stack(batch_data_list, axis=0)
-            batch_labels_array = np.stack(batch_labels_list, axis=0)
-            
-            # Return batch as Tensors for neural network processing
-            yield Tensor(batch_data_array), Tensor(batch_labels_array)
-        ### END SOLUTION
-    
-    def __len__(self) -> int:
-        """
-        Get the number of batches per epoch.
-        
-        TODO: Calculate number of batches.
-        
-        APPROACH:
-        1. Get dataset size: len(self.dataset)
-        2. Divide by batch_size and round up
-        3. Use ceiling division: (n + batch_size - 1) // batch_size
-        
-        EXAMPLE:
-        Dataset size 100, batch size 32 -> 4 batches
-        
-        HINTS:
-        - Use len(self.dataset) for dataset size
-        - Use ceiling division for exact batch count
-        - Formula: (dataset_size + batch_size - 1) // batch_size
-        """
-        ### BEGIN SOLUTION
-        # Calculate number of batches using ceiling division
-        # This tells training loops how many iterations per epoch
-        dataset_size = len(self.dataset)
-        return (dataset_size + self.batch_size - 1) // self.batch_size
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: DataLoader
-
-Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks.
-
-**This is a unit test** - it tests the DataLoader class in isolation.
-"""
-
-# %%
-def test_unit_dataloader():
-    """Test DataLoader implementation with comprehensive functionality tests."""
-    print("🔬 Unit Test: DataLoader...")
-
-    # Use the TestDataset from before
-    dataset = TestDataset(size=10)
-    dataloader = DataLoader(dataset, batch_size=3, shuffle=False)
-
-    print(f"DataLoader created: batch_size={dataloader.batch_size}, shuffle={dataloader.shuffle}")
-    print(f"Number of batches: {len(dataloader)}")
-
-    # Test __len__
-    expected_batches = (10 + 3 - 1) // 3  # Ceiling division: 4 batches
-    assert len(dataloader) == expected_batches, f"Should have {expected_batches} batches, got {len(dataloader)}"
-
-    # Test iteration
-    batch_count = 0
-    total_samples = 0
-
-    for batch_data, batch_labels in dataloader:
-        batch_count += 1
-        batch_size = batch_data.shape[0]
-        total_samples += batch_size
-
-        # Verify batch dimensions
-        assert len(batch_data.shape) == 2, f"Batch data should be 2D, got {batch_data.shape}"
-        assert len(batch_labels.shape) == 2, f"Batch labels should be 2D, got {batch_labels.shape}"
-        assert batch_data.shape[1] == 2, f"Each sample should have 2 features, got {batch_data.shape[1]}"
-        assert batch_labels.shape[1] == 1, f"Each label should have 1 element, got {batch_labels.shape[1]}"
-
-    assert batch_count == expected_batches, f"Should iterate {expected_batches} times, got {batch_count}"
-    assert total_samples == 10, f"Should process 10 total samples, got {total_samples}"
-
-    # Test shuffling
-    dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)
-    dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)
-
-    # Get first batch from each
-    batch1_shuffle = next(iter(dataloader_shuffle))
-    batch1_no_shuffle = next(iter(dataloader_no_shuffle))
-
-    # Test different batch sizes
-    small_loader = DataLoader(dataset, batch_size=2, shuffle=False)
-    large_loader = DataLoader(dataset, batch_size=8, shuffle=False)
-
-    assert len(small_loader) == 5, f"Small loader should have 5 batches, got {len(small_loader)}"
-    assert len(large_loader) == 2, f"Large loader should have 2 batches, got {len(large_loader)}"
-
-    print("✅ DataLoader works correctly!")
-
-test_unit_dataloader()
-
-# %% [markdown]
-"""
-## Step 4: Creating a Simple Dataset Example
-
-### Why We Need Concrete Examples
-Abstract classes are great for interfaces, but we need concrete implementations to understand how they work. Let's create a simple dataset for testing.
-
-### Design Principles
-- **Simple**: Easy to understand and debug
-- **Configurable**: Adjustable size and properties
-- **Predictable**: Deterministic data for testing
-- **Educational**: Shows the Dataset pattern clearly
-
-### Real-World Connection
-This pattern is used for:
-- **CIFAR-10**: 32x32 RGB images with 10 classes
-- **ImageNet**: High-resolution images with 1000 classes
-- **MNIST**: 28x28 grayscale digits with 10 classes
-- **Custom datasets**: Your own data following this pattern
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "simple-dataset", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class SimpleDataset(Dataset):
-    """
-    Simple dataset for testing and demonstration.
-    
-    Generates synthetic data with configurable size and properties.
-    Perfect for understanding the Dataset pattern.
-    """
-    
-    def __init__(self, size: int = 100, num_features: int = 4, num_classes: int = 3):
-        """
-        Initialize SimpleDataset.
-        
-        Args:
-            size: Number of samples in the dataset
-            num_features: Number of features per sample
-            num_classes: Number of classes
-            
-        TODO: Initialize the dataset with synthetic data.
-        
-        APPROACH:
-        1. Store the configuration parameters
-        2. Generate synthetic data and labels
-        3. Make data deterministic for testing
-        
-        EXAMPLE:
-        SimpleDataset(size=100, num_features=4, num_classes=3)
-        creates 100 samples with 4 features each, 3 classes
-        
-        HINTS:
-        - Store size, num_features, num_classes as instance variables
-        - Use np.random.seed() for reproducible data
-        - Generate random data with np.random.randn()
-        - Generate random labels with np.random.randint()
-        """
-        self.size = size
-        self.num_features = num_features
-        self.num_classes = num_classes
-        
-        # Generate synthetic data (deterministic for testing)
-        np.random.seed(42)  # Fixed seed ensures same data every time - important for testing!
-        self.data = np.random.randn(size, num_features).astype(np.float32)
-        self.labels = np.random.randint(0, num_classes, size=size)
-    
-    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:
-        """
-        Get a sample by index.
-        
-        Args:
-            index: Index of the sample
-            
-        Returns:
-            Tuple of (data, label) tensors
-            
-        TODO: Return the sample at the given index.
-        
-        APPROACH:
-        1. Get data sample from self.data[index]
-        2. Get label from self.labels[index]
-        3. Convert both to Tensors and return as tuple
-        
-        EXAMPLE:
-        dataset[0] returns (Tensor(features), Tensor(label))
-        
-        HINTS:
-        - Use self.data[index] for the data
-        - Use self.labels[index] for the label
-        - Convert to Tensors: Tensor(data), Tensor(label)
-        """
-        ### BEGIN SOLUTION
-        # Get the specific sample by index
-        # This is the core of on-demand data loading!
-        data = self.data[index]
-        label = self.labels[index]
-        return Tensor(data), Tensor(label)
-        ### END SOLUTION
-    
-    def __len__(self) -> int:
-        """
-        Get the dataset size.
-        
-        TODO: Return the dataset size.
-        
-        APPROACH:
-        1. Return self.size
-        
-        EXAMPLE:
-        len(dataset) returns 100 for dataset with 100 samples
-        
-        HINTS:
-        - Simply return self.size
-        """
-        ### BEGIN SOLUTION
-        # Return total number of samples
-        # DataLoader needs this to calculate batches per epoch
-        return self.size
-        ### END SOLUTION
-    
-    def get_num_classes(self) -> int:
-        """
-        Get the number of classes.
-        
-        TODO: Return the number of classes.
-        
-        APPROACH:
-        1. Return self.num_classes
-        
-        EXAMPLE:
-        dataset.get_num_classes() returns 3 for 3-class dataset
-        
-        HINTS:
-        - Simply return self.num_classes
-        """
-        ### BEGIN SOLUTION
-        # Return number of unique classes
-        # Neural networks need this for output layer size
-        return self.num_classes
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-## Step 4b: CIFAR-10 Dataset - Real Computer Vision Data
-
-### 🏆 Achieving Our North Star Goal: 75% Accuracy on CIFAR-10
-
-Let's implement loading CIFAR-10, the dataset we'll use to achieve our ambitious goal of 75% accuracy!
-
-### 🇺🇸 CIFAR-10 Dataset Specifications
-```
-🖼️ CIFAR-10 Dataset Overview:
-    +----------------------------------------+
-    | 🎨 Classes: 10 (airplane, car, bird, etc.)  |
-    | 🖼️ Images: 60,000 total (50k train + 10k test) |
-    | 📌 Size: 32x32 pixels, RGB color           |
-    | 💾 Storage: ~170MB compressed             |
-    | TARGET Goal: 75% classification accuracy      |
-    +----------------------------------------+
-    
-    Classes: airplane, automobile, bird, cat, deer, 
-             dog, frog, horse, ship, truck
-```
-
-### 🗾 Data Pipeline for Computer Vision
-```
-CIFAR-10 Loading Pipeline:
-    
-    Raw Files         Dataset Class        DataLoader        CNN Model
-+-----------------+ +-----------------+ +-----------------+ +-----------------+
-| data_batch_1    | | CIFAR10Dataset | | Batch: (32,3,  | | Convolutional |
-| data_batch_2    |▶| __getitem__()   |▶| 32,32) images  |▶| Neural        |
-| data_batch_3    | | Loads on-demand | | Labels: (32,)  | | Network       |
-| data_batch_4    | | Normalizes [0,1]| | Shuffled order | | Training      |
-| data_batch_5    | | Shape: (3,32,32)| |               | |               |
-+-----------------+ +-----------------+ +-----------------+ +-----------------+
-```
-
-### PROGRESS Why CIFAR-10 is Perfect for Learning
-- **Manageable size**: Fits in memory, fast iteration
-- **Real complexity**: Natural images, not toy data
-- **Standard benchmark**: Compare with published results
-- **CV fundamentals**: Teaches image processing essentials
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "cifar10", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def download_cifar10(root: str = "./data") -> str:
-    """
-    Download CIFAR-10 dataset.
-    
-    TODO: Download and extract CIFAR-10.
-    
-    HINTS:
-    - URL: https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
-    - Use urllib.request.urlretrieve()
-    - Extract with tarfile
-    """
-    ### BEGIN SOLUTION
-    os.makedirs(root, exist_ok=True)
-    dataset_dir = os.path.join(root, "cifar-10-batches-py")
-    
-    if os.path.exists(dataset_dir):
-        print(f"PASS CIFAR-10 found at {dataset_dir}")
-        return dataset_dir
-    
-    url = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
-    tar_path = os.path.join(root, "cifar-10.tar.gz")
-    
-    print(f"📥 Downloading CIFAR-10 (~170MB)...")
-    urllib.request.urlretrieve(url, tar_path)
-    print("PASS Downloaded!")
-    
-    print("PACKAGE Extracting...")
-    with tarfile.open(tar_path, 'r:gz') as tar:
-        tar.extractall(root)
-    print("PASS Ready!")
-    
-    return dataset_dir
-    ### END SOLUTION
-
-class CIFAR10Dataset(Dataset):
-    """CIFAR-10 dataset for CNN training."""
-    
-    def __init__(self, root="./data", train=True, download=False):
-        """Load CIFAR-10 data."""
-        ### BEGIN SOLUTION
-        if download:
-            dataset_dir = download_cifar10(root)
-        else:
-            dataset_dir = os.path.join(root, "cifar-10-batches-py")
-        
-        if train:
-            data_list = []
-            label_list = []
-            for i in range(1, 6):
-                with open(os.path.join(dataset_dir, f"data_batch_{i}"), 'rb') as f:
-                    batch = pickle.load(f, encoding='bytes')
-                    data_list.append(batch[b'data'])
-                    label_list.extend(batch[b'labels'])
-            self.data = np.concatenate(data_list)
-            self.labels = np.array(label_list)
-        else:
-            with open(os.path.join(dataset_dir, "test_batch"), 'rb') as f:
-                batch = pickle.load(f, encoding='bytes')
-                self.data = batch[b'data']
-                self.labels = np.array(batch[b'labels'])
-        
-        # Reshape from flat array to image format: (N, 3, 32, 32) = (batch, channels, height, width)
-        # Normalize pixel values from [0, 255] to [0, 1] for neural network training
-        # This is critical: neural networks expect inputs in [0,1] range!
-        self.data = self.data.reshape(-1, 3, 32, 32).astype(np.float32) / 255.0
-        print(f"PASS Loaded {len(self.data):,} images")
-        print(f"   Data shape: {self.data.shape}")
-        print(f"   Value range: [{self.data.min():.2f}, {self.data.max():.2f}]")
-        ### END SOLUTION
-    
-    def __getitem__(self, idx):
-        ### BEGIN SOLUTION
-        # Return individual image and label as Tensors
-        # Image shape: (3, 32, 32) = (channels, height, width)
-        # Label shape: () = scalar class index
-        return Tensor(self.data[idx]), Tensor(self.labels[idx])
-        ### END SOLUTION
-    
-    def __len__(self):
-        ### BEGIN SOLUTION
-        # Return total number of images
-        return len(self.data)
-        ### END SOLUTION
-    
-    def get_num_classes(self):
-        ### BEGIN SOLUTION
-        # CIFAR-10 has exactly 10 classes
-        return 10
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: SimpleDataset
-
-Let's test your SimpleDataset implementation! This concrete example shows how the Dataset pattern works.
-
-**This is a unit test** - it tests the SimpleDataset class in isolation.
-"""
-
-# %%
-def test_unit_simple_dataset():
-    """Test SimpleDataset implementation with comprehensive functionality tests."""
-    print("🔬 Unit Test: SimpleDataset...")
-
-    # Create dataset
-    dataset = SimpleDataset(size=20, num_features=5, num_classes=4)
-
-    print(f"Dataset created: size={len(dataset)}, features={dataset.num_features}, classes={dataset.get_num_classes()}")
-
-    # Test basic properties
-    assert len(dataset) == 20, f"Dataset length should be 20, got {len(dataset)}"
-    assert dataset.get_num_classes() == 4, f"Should have 4 classes, got {dataset.get_num_classes()}"
-
-    # Test sample access
-    data, label = dataset[0]
-    assert isinstance(data, Tensor), "Data should be a Tensor"
-    assert isinstance(label, Tensor), "Label should be a Tensor"
-    assert data.shape == (5,), f"Data shape should be (5,), got {data.shape}"
-    assert label.shape == (), f"Label shape should be (), got {label.shape}"
-
-    # Test sample shape
-    sample_shape = dataset.get_sample_shape()
-    assert sample_shape == (5,), f"Sample shape should be (5,), got {sample_shape}"
-
-    # Test multiple samples
-    for i in range(5):
-        data, label = dataset[i]
-        assert data.shape == (5,), f"Data shape should be (5,) for sample {i}, got {data.shape}"
-        assert 0 <= label.data < 4, f"Label should be in [0, 3] for sample {i}, got {label.data}"
-
-    # Test deterministic data (same seed should give same data)
-    dataset2 = SimpleDataset(size=20, num_features=5, num_classes=4)
-    data1, label1 = dataset[0]
-    data2, label2 = dataset2[0]
-    assert np.array_equal(data1.data, data2.data), "Data should be deterministic"
-    assert np.array_equal(label1.data, label2.data), "Labels should be deterministic"
-
-    print("✅ SimpleDataset works correctly!")
-
-test_unit_simple_dataset()
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Complete Data Pipeline Integration
-
-This comprehensive test validates the entire data pipeline from dataset creation through DataLoader batching, ensuring all components work together seamlessly for end-to-end machine learning data processing workflows.
-"""
-
-# %%
-def test_unit_dataloader_pipeline():
-    """Comprehensive unit test for the complete data pipeline."""
-    print("🔬 Comprehensive Test: Data Pipeline...")
-
-    # Test complete pipeline
-    dataset = SimpleDataset(size=50, num_features=10, num_classes=5)
-    loader = DataLoader(dataset, batch_size=8, shuffle=True)
-
-    total_samples = 0
-    for batch_data, batch_labels in loader:
-        assert isinstance(batch_data, Tensor), "Batch data should be Tensor"
-        assert isinstance(batch_labels, Tensor), "Batch labels should be Tensor"
-        assert batch_data.shape[1] == 10, "Features should be correct"
-        total_samples += batch_data.shape[0]
-
-    assert total_samples == 50, "Should process all samples"
-
-    print("✅ Data pipeline integration works correctly!")
-
-test_unit_dataloader_pipeline()
-
-
-# %% [markdown]
-# %% [markdown]
-"""
-## TEST Module Testing
-
-Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.
-
-**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "standardized-testing", "locked": true, "schema_version": 3, "solution": false, "task": false}
-# =============================================================================
-# STANDARDIZED MODULE TESTING - DO NOT MODIFY
-# This cell is locked to ensure consistent testing across all TinyTorch modules
-# =============================================================================
-
-# %% [markdown]
-"""
-## 🔬 Integration Test: DataLoader with Tensors
-"""
-
-# %%
-def test_module_dataloader_tensor_yield():
-    """
-    Integration test for the DataLoader and Tensor classes.
-    
-    Tests that the DataLoader correctly yields batches of Tensors.
-    """
-    print("🔬 Running Integration Test: DataLoader with Tensors...")
-
-    # 1. Create a simple dataset
-    dataset = SimpleDataset(size=50, num_features=8, num_classes=4)
-
-    # 2. Create a DataLoader
-    dataloader = DataLoader(dataset, batch_size=10, shuffle=False)
-
-    # 3. Get one batch from the dataloader
-    data_batch, labels_batch = next(iter(dataloader))
-
-    # 4. Assert the batch contents are correct
-    assert isinstance(data_batch, Tensor), "Data batch should be a Tensor"
-    assert data_batch.shape == (10, 8), f"Expected data shape (10, 8), but got {data_batch.shape}"
-    
-    assert isinstance(labels_batch, Tensor), "Labels batch should be a Tensor"
-    assert labels_batch.shape == (10,), f"Expected labels shape (10,), but got {labels_batch.shape}"
-
-    print("PASS Integration Test Passed: DataLoader correctly yields batches of Tensors.")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## 🔍 Systems Analysis: I/O Pipeline Performance & Scaling
-
-Now that your data loading implementation is complete, let's analyze its performance characteristics and understand how it scales in production systems.
-
-**This section teaches ML systems engineering skills: measuring, profiling, and optimizing data pipeline performance.**
-"""
-
-# %%
-import time
-import os
-
-def analyze_dataloader_performance():
-    """
-    Comprehensive analysis of DataLoader performance characteristics.
-
-    Measures batch loading times, memory usage patterns, and scaling behavior
-    to understand production performance implications.
-    """
-    print("🔍 DATALOADER PERFORMANCE ANALYSIS")
-    print("=" * 50)
-
-    # Test 1: Basic Performance Timing
-    print("\n📊 1. BATCH LOADING PERFORMANCE:")
-    dataset = SimpleDataset(size=1000, num_features=20, num_classes=10)
-    loader = DataLoader(dataset, batch_size=64, shuffle=False)
-
-    # Time batch loading
-    batch_times = []
-    for i, (data, labels) in enumerate(loader):
-        if i >= 10:  # Test first 10 batches
-            break
-        start_time = time.time()
-        # Simulate accessing the data (triggers actual loading)
-        _ = data.shape, labels.shape
-        batch_time = time.time() - start_time
-        batch_times.append(batch_time)
-
-    avg_time = sum(batch_times) / len(batch_times)
-    throughput = 64 / avg_time  # samples per second
-
-    print(f"   Average batch time: {avg_time:.4f}s")
-    print(f"   Throughput: {throughput:.0f} samples/second")
-    print(f"   Range: {min(batch_times):.4f}s - {max(batch_times):.4f}s")
-
-    # Test 2: Batch Size Scaling
-    print(f"\n📈 2. BATCH SIZE SCALING ANALYSIS:")
-    batch_sizes = [16, 32, 64, 128, 256]
-    scaling_results = []
-
-    for batch_size in batch_sizes:
-        loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
-
-        # Time one batch
-        start_time = time.time()
-        data, labels = next(iter(loader))
-        batch_time = time.time() - start_time
-
-        samples_per_sec = batch_size / batch_time
-        scaling_results.append((batch_size, batch_time, samples_per_sec))
-        print(f"   Batch {batch_size:3d}: {batch_time:.4f}s ({samples_per_sec:.0f} samples/sec)")
-
-    # Find optimal batch size
-    optimal = max(scaling_results, key=lambda x: x[2])
-    print(f"   Optimal batch size: {optimal[0]} ({optimal[2]:.0f} samples/sec)")
-
-    # Test 3: Memory Usage Analysis
-    print(f"\n💾 3. MEMORY USAGE PATTERNS:")
-
-    # Compare small vs large datasets
-    small_dataset = SimpleDataset(size=100, num_features=10, num_classes=5)
-    large_dataset = SimpleDataset(size=10000, num_features=50, num_classes=20)
-
-    for name, dataset in [("Small Dataset", small_dataset), ("Large Dataset", large_dataset)]:
-        loader = DataLoader(dataset, batch_size=32, shuffle=True)
-
-        # Get memory footprint estimate
-        data, labels = next(iter(loader))
-        data_memory = data.data.nbytes
-        labels_memory = labels.data.nbytes
-        total_memory = data_memory + labels_memory
-
-        print(f"   {name}:")
-        print(f"     Batch memory: {total_memory / 1024:.1f} KB")
-        print(f"     Data: {data_memory / 1024:.1f} KB, Labels: {labels_memory / 1024:.1f} KB")
-        print(f"     Per sample: {total_memory / 32:.0f} bytes")
-
-    # Test 4: I/O Strategy Comparison
-    print(f"\n🔀 4. I/O STRATEGY COMPARISON:")
-
-    dataset = SimpleDataset(size=500, num_features=20, num_classes=10)
-
-    strategies = [
-        ("Sequential (no shuffle)", False),
-        ("Random (with shuffle)", True)
-    ]
-
-    for name, shuffle in strategies:
-        loader = DataLoader(dataset, batch_size=50, shuffle=shuffle)
-
-        start_time = time.time()
-        batch_count = 0
-        for data, labels in loader:
-            batch_count += 1
-            if batch_count >= 5:  # Test first 5 batches
-                break
-        total_time = time.time() - start_time
-
-        avg_batch_time = total_time / batch_count
-        print(f"   {name}: {avg_batch_time:.4f}s per batch")
-
-    print(f"\n💡 PRODUCTION INSIGHTS:")
-    print(f"   • Larger batches improve throughput (amortize overhead)")
-    print(f"   • Memory usage scales linearly with batch size and features")
-    print(f"   • Shuffling adds minimal overhead for in-memory data")
-    print(f"   • GPU utilization depends on data loading not being bottleneck")
-    print(f"   • Real bottlenecks: disk I/O, network storage, preprocessing")
-
-# %% [markdown]
-"""
-## 🧪 Integration Test: DataLoader with Tensors
-"""
-
-# %%
-def test_module_dataloader_tensor_yield():
-    """
-    Integration test for the DataLoader and Tensor classes.
-
-    Tests that the DataLoader correctly yields batches of Tensors.
-    """
-    print("🔬 Running Integration Test: DataLoader with Tensors...")
-
-    # 1. Create a simple dataset
-    dataset = SimpleDataset(size=50, num_features=8, num_classes=4)
-
-    # 2. Create a DataLoader
-    dataloader = DataLoader(dataset, batch_size=10, shuffle=False)
-
-    # 3. Get one batch from the dataloader
-    data_batch, labels_batch = next(iter(dataloader))
-
-    # 4. Assert the batch contents are correct
-    assert isinstance(data_batch, Tensor), "Data batch should be a Tensor"
-    assert data_batch.shape == (10, 8), f"Expected data shape (10, 8), but got {data_batch.shape}"
-
-    assert isinstance(labels_batch, Tensor), "Labels batch should be a Tensor"
-    assert labels_batch.shape == (10,), f"Expected labels shape (10,), but got {labels_batch.shape}"
-
-    print("✅ Integration Test Passed: DataLoader correctly yields batches of Tensors.")
-
-test_module_dataloader_tensor_yield()
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-### 1. Memory vs Performance Trade-offs
-In your DataLoader implementation, you discovered that larger batch sizes generally improve throughput. When you tested batches of 16, 32, 64, and 128 samples, you likely saw increasing samples-per-second rates.
-
-**Analysis Question**: Your DataLoader implementation loads entire batches into memory at once. If you needed to handle a dataset with 10GB of data on a machine with only 4GB of RAM, how would you modify your current DataLoader design to support this scenario while maintaining reasonable performance?
-
-**Consider**:
-- Memory-mapped files vs loading subsets
-- Streaming vs caching strategies
-- Trade-offs between memory usage and I/O efficiency
-
-### 2. Production Scaling Analysis
-Your SimpleDataset generates synthetic data in memory, but real production systems often need to load from disk, databases, or network storage.
-
-**Scaling Question**: Imagine deploying your DataLoader design to handle ImageNet (150GB of images) on a distributed training cluster with 8 GPUs. Each GPU needs different batches simultaneously, and data is stored on network-attached storage.
-
-**Design Challenge**: What bottlenecks would emerge in your current implementation, and how would you redesign the data loading pipeline to maximize GPU utilization across the cluster?
-
-**Consider**:
-- Network bandwidth limitations
-- Storage I/O patterns
-- Data locality and caching strategies
-- Prefetching and parallel loading
-
-### 3. Debugging Production I/O Issues
-Your performance analysis showed that shuffling adds minimal overhead for in-memory data, but production systems often experience unpredictable I/O performance.
-
-**Engineering Question**: A production ML system using your DataLoader design suddenly experiences 10x slower training speeds, but the model code hasn't changed. The logs show DataLoader batch loading times varying from 50ms to 5 seconds randomly.
-
-**Root Cause Analysis**: What systematic debugging approach would you use to identify whether the bottleneck is in your DataLoader implementation, the storage system, network, or something else? What metrics would you instrument and monitor?
-
-**Consider**:
-- I/O monitoring and profiling techniques
-- Distributed system debugging approaches
-- Performance regression investigation methods
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: DataLoader - Efficient Data Pipeline Systems
-
-Congratulations! You've successfully implemented a comprehensive data loading system for machine learning:
-
-### What You've Accomplished
-✅ **Dataset Interface**: Abstract base class enabling flexible data sources (500+ lines)
-✅ **DataLoader Engine**: Efficient batching and iteration system with shuffling support
-✅ **SimpleDataset Implementation**: Concrete dataset for synthetic data generation and testing
-✅ **CIFAR-10 Integration**: Real-world computer vision dataset loading capabilities
-✅ **Performance Analysis**: Comprehensive I/O pipeline profiling and optimization insights
-✅ **Integration Testing**: Seamless compatibility validation with Tensor operations
-
-### Key Learning Outcomes
-- **Data Pipeline Architecture**: Universal Dataset/DataLoader abstraction used across all ML frameworks
-- **Batch Processing Systems**: Memory-efficient handling of large datasets through strategic batching
-- **I/O Performance Engineering**: Understanding and measuring data loading bottlenecks in production systems
-- **Memory Management**: Efficient tensor stacking and batch creation without memory explosions
-- **Production Patterns**: Real-world data loading strategies for scaling ML training systems
-
-### Systems Understanding Achieved
-- **Performance Characteristics**: Batch size scaling impacts both throughput and memory usage
-- **I/O Bottleneck Analysis**: Data loading often limits training speed more than model computation
-- **Memory vs Speed Trade-offs**: Larger batches improve efficiency but require more RAM
-- **Shuffling Impact**: Minimal overhead for generalization benefits in training
-- **Scaling Behavior**: Linear memory growth with batch size and feature dimensions
-
-### Professional Skills Developed
-- **ML Systems Engineering**: Building robust data pipelines that handle production-scale workloads
-- **Performance Profiling**: Measuring and optimizing I/O performance for training efficiency
-- **API Design**: Clean, extensible interfaces following industry-standard patterns
-- **Integration Architecture**: Seamless compatibility with tensor operations and neural networks
-
-### Ready for Advanced Applications
-Your DataLoader implementation now enables:
-- **Large-scale Training**: Processing datasets larger than available memory
-- **Real-time Inference**: Efficient batch processing for production model serving
-- **Multi-modal Data**: Support for images, text, and structured data through consistent interfaces
-- **Distributed Training**: Foundation for multi-GPU and multi-node data loading strategies
-
-### Connection to Real ML Systems
-Your implementations mirror production frameworks:
-- **PyTorch**: `torch.utils.data.DataLoader` uses identical batching and iteration patterns
-- **TensorFlow**: `tf.data.Dataset` implements the same universal dataset abstraction
-- **Industry Standard**: Every major ML framework builds on these exact design patterns
-
-### Next Steps
-1. **Export your module**: `tito module complete 09_dataloader`
-2. **Validate integration**: All components work together for complete ML pipelines
-3. **Ready for Module 10**: Training loops that will use your data loading infrastructure
-4. **Production Deployment**: Scale to real datasets and distributed training scenarios
-
-**🚀 Achievement Unlocked**: You've built production-quality data loading infrastructure that powers real ML training systems!
-"""
-
-def test_module():
-    """Run all module tests systematically."""
-    print("🧪 RUNNING MODULE 09 TESTS")
-    print("=" * 50)
-
-    try:
-        # Run all unit tests
-        test_unit_dataset_interface()
-        test_unit_dataloader()
-        test_unit_simple_dataset()
-        test_unit_dataloader_pipeline()
-        test_module_dataloader_tensor_yield()
-
-        # Run systems analysis
-        analyze_dataloader_performance()
-
-        print("\n✅ ALL MODULE TESTS PASSED!")
-        print("🎯 DataLoader module implementation complete!")
-
-    except Exception as e:
-        print(f"\n❌ MODULE TEST FAILED: {e}")
-        raise
-
-if __name__ == "__main__":
-    test_module()
\ No newline at end of file
diff --git a/modules_old/09_dataloader/module.yaml b/modules_old/09_dataloader/module.yaml
deleted file mode 100644
index d9aecada..00000000
--- a/modules_old/09_dataloader/module.yaml
+++ /dev/null
@@ -1,22 +0,0 @@
-components:
-- Dataset
-- DataLoader
-- SimpleDataset
-dependencies:
-  enables:
-  - training
-  - dense
-  - spatial
-  - attention
-  prerequisites:
-  - tensor
-description: Dataset interfaces and data loading pipelines
-difficulty: "\u2B50\u2B50\u2B50"
-exports_to: tinytorch.core.dataloader
-files:
-  dev_file: dataloader_dev.py
-  readme: README.md
-  tests: inline
-name: dataloader
-time_estimate: 5-6 hours
-title: DataLoader
diff --git a/modules_old/10_tokenization/README.md b/modules_old/10_tokenization/README.md
deleted file mode 100644
index 2f7f8565..00000000
--- a/modules_old/10_tokenization/README.md
+++ /dev/null
@@ -1,93 +0,0 @@
-# Module 11: Tokenization - Text Processing for Language Models
-
-## Overview
-This module implements the fundamental text processing systems that convert raw text into numerical sequences that neural networks can understand. You'll build character-level and subword tokenizers from scratch, understanding the critical trade-offs between vocabulary size and sequence length that affect model performance.
-
-## What You'll Learn
-
-### Core Implementations
-- **Character Tokenizer**: Simple character-level tokenization with special tokens
-- **BPE Tokenizer**: Byte Pair Encoding for efficient subword units
-- **Vocabulary Management**: Bidirectional mappings between text and indices
-- **Padding & Truncation**: Batch processing utilities for uniform sequences
-
-### ML Systems Concepts
-- **Memory Efficiency**: How vocabulary size affects model parameters
-- **Performance Optimization**: Tokenization throughput and caching strategies
-- **Scaling Trade-offs**: Vocabulary size vs sequence length vs compute
-- **Production Patterns**: Efficient text processing for large-scale systems
-
-### Performance Engineering
-- **Tokenization Profiling**: Measuring speed and memory usage
-- **Cache Optimization**: Reducing repeated tokenization overhead
-- **Batch Processing**: Efficient handling of multiple texts
-- **Scaling Analysis**: Understanding performance with large texts
-
-## Key Learning Outcomes
-
-By completing this module, you'll understand:
-
-1. **Text-to-Numbers Pipeline**: How raw text becomes neural network input
-2. **Tokenization Strategies**: Character vs subword vs word-level approaches
-3. **Systems Trade-offs**: Vocabulary size impacts on memory and compute
-4. **Performance Engineering**: Optimizing text processing for production
-5. **Language Model Foundation**: How tokenization affects model capabilities
-
-## Files in This Module
-
-- `tokenization_dev.py` - Main implementation file with all tokenizers
-- `tokenization_dev.ipynb` - Jupyter notebook (auto-generated)
-- `module.yaml` - Module configuration and metadata
-- `README.md` - This documentation file
-
-## Usage Example
-
-```python
-from tinytorch.core.tokenization import CharTokenizer, BPETokenizer
-
-# Character-level tokenization
-char_tokenizer = CharTokenizer()
-tokens = char_tokenizer.encode("Hello world!")
-text = char_tokenizer.decode(tokens)
-
-# BPE tokenization
-bpe_tokenizer = BPETokenizer(vocab_size=1000)
-bpe_tokenizer.train(["Hello world", "World hello", "Hello hello world"])
-tokens = bpe_tokenizer.encode("Hello world!")
-```
-
-## Integration with TinyTorch
-
-This module exports to `tinytorch.core.tokenization` and provides the text processing foundation for:
-- **Embedding layers** (Module 12) - Converting tokens to vectors
-- **Language models** (Module 14+) - Processing text sequences
-- **Training pipelines** - Efficient batch text processing
-
-## Systems Engineering Focus
-
-This module emphasizes the systems engineering aspects of tokenization:
-
-### Performance Characteristics
-- **Character tokenization**: Small vocab (~256), long sequences
-- **BPE tokenization**: Medium vocab (~50k), shorter sequences  
-- **Memory scaling**: O(vocab_size × embedding_dim) for embedding tables
-- **Attention scaling**: O(sequence_length²) for transformer models
-
-### Production Considerations
-- Tokenization can become a bottleneck in training pipelines
-- Efficient string processing is critical for high-throughput systems
-- Caching strategies provide significant speedups for repeated texts
-- Vocabulary size affects model download size and memory usage
-
-## Prerequisites
-- Module 02: Tensor (for basic data structures)
-- Understanding of string processing and algorithms
-
-## Estimated Time
-4-5 hours including implementation, testing, and analysis
-
-## Next Steps
-After completing this module, you'll be ready for:
-- **Module 12: Embeddings** - Converting tokens to dense vector representations
-- **Module 13: Attention** - Processing sequences with attention mechanisms
-- **Module 14: Transformers** - Complete language model architectures
\ No newline at end of file
diff --git a/modules_old/10_tokenization/module.yaml b/modules_old/10_tokenization/module.yaml
deleted file mode 100644
index 47165958..00000000
--- a/modules_old/10_tokenization/module.yaml
+++ /dev/null
@@ -1,28 +0,0 @@
-description: Text processing systems that convert raw text into numerical sequences
-  for language models
-estimated_time: 4-5 hours
-exports:
-- CharTokenizer
-- BPETokenizer
-- TokenizationProfiler
-- OptimizedTokenizer
-learning_objectives:
-- Implement character-level tokenization with special token handling
-- Build BPE (Byte Pair Encoding) tokenizer for subword units
-- 'Understand tokenization trade-offs: vocabulary size vs sequence length'
-- Optimize tokenization performance for production systems
-- Analyze how tokenization affects model memory and training efficiency
-ml_systems_focus: Text processing pipelines, tokenization throughput, memory-efficient
-  vocabulary management
-name: Tokenization
-next_modules:
-- 12_embeddings
-number: 11
-prerequisites:
-- 02_tensor
-systems_concepts:
-- Memory efficiency of token representations
-- Vocabulary size vs model size tradeoffs
-- Tokenization throughput optimization
-- String processing performance
-- Cache-friendly text processing patterns
diff --git a/modules_old/10_tokenization/tokenization_dev.py b/modules_old/10_tokenization/tokenization_dev.py
deleted file mode 100644
index f5e9f595..00000000
--- a/modules_old/10_tokenization/tokenization_dev.py
+++ /dev/null
@@ -1,2011 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Tokenization - Text Processing for Language Models
-
-Welcome to the Tokenization module! You'll implement the fundamental text processing systems that convert raw text into numerical sequences that neural networks can understand.
-
-## Learning Goals
-- Systems understanding: How tokenization affects model performance, memory usage, and computational efficiency
-- Core implementation skill: Build character and subword tokenizers from scratch
-- Pattern recognition: Understand how tokenization choices impact model capacity and training dynamics
-- Framework connection: See how your implementations match production tokenization systems
-- Performance insight: Learn how tokenization throughput affects training pipeline efficiency
-
-## Build -> Use -> Reflect
-1. **Build**: Character tokenizer and basic BPE (Byte Pair Encoding) implementation
-2. **Use**: Process real text and observe how different tokenization strategies affect sequence length
-3. **Reflect**: How does tokenization choice determine model efficiency and language understanding?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how text becomes numbers that models can process
-- Practical capability to implement tokenizers that handle real text data efficiently
-- Systems insight into how vocabulary size affects memory usage and model performance
-- Performance consideration of how tokenization speed affects overall training throughput
-- Connection to production systems like GPT's tokenizers and their design trade-offs
-
-## Systems Reality Check
-TIP **Production Context**: Modern language models use sophisticated tokenizers (GPT's tiktoken, SentencePiece) - your implementation reveals the algorithmic foundations
-SPEED **Performance Note**: Tokenization can become a bottleneck in training pipelines - efficient string processing is critical for high-throughput training
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "tokenization-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.tokenization
-
-#| export
-import os
-import sys
-import re
-import json
-from typing import List, Dict, Tuple, Optional, Union
-from collections import Counter, defaultdict
-
-# Import our Tensor class - try from package first, then from local module
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local tensor module
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# %% nbgrader={"grade": false, "grade_id": "tokenization-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("🔤 TinyTorch Tokenization Module")
-print("Ready to build text processing systems!")
-
-# %% [markdown]
-"""
-## PACKAGE Where This Code Lives in the Final Package
-
-**Learning Side:** You work in `modules/source/11_tokenization/tokenization_dev.py`  
-**Building Side:** Code exports to `tinytorch.core.tokenization`
-
-```python
-# Final package structure:
-from tinytorch.core.tokenization import CharTokenizer, BPETokenizer
-from tinytorch.core.tensor import Tensor  # Foundation
-from tinytorch.core.embeddings import Embedding  # Next module
-```
-
-**Why this matters:**
-- **Learning:** Focused modules for deep understanding
-- **Production:** Proper organization like Hugging Face's tokenizers
-- **Consistency:** All tokenization tools live together in `core.tokenization`
-- **Integration:** Works seamlessly with embeddings and language models
-"""
-
-# %% [markdown]
-"""
-## What is Tokenization?
-
-### The Problem: Text to Numbers
-Neural networks work with numbers, but we want to process text:
-
-```
-"Hello world!" -> [15496, 995, 0]  # Numbers the model can understand
-```
-
-### 🔤 Visual Tokenization Flow
-```
-Raw Text -> Tokenization Strategy -> Token IDs -> Neural Network Input
-
-    "Hello world!"
-         v
-+-------------------------+
-|   Tokenization Process  |
-|  +---------------------+|
-|  |  Split into tokens  ||
-|  +---------------------+|
-|           v             |
-|  +---------------------+|
-|  |  Map to vocabulary  ||
-|  +---------------------+|
-+-------------------------+
-         v
-    [15496, 995, 0]
-         v
-    Neural Network
-```
-
-### 📊 Tokenization Strategy Comparison
-```
-Strategy      | Vocab Size | Sequence Length | Use Case
---------------+------------+-----------------+-----------------
-Character     |     ~256   |      Long       | Simple/Debug
-Subword (BPE) |   ~50,000  |     Medium      | Production
-Word-level    |  ~100,000+ |      Short      | Specialized
-```
-
-### TARGET Systems Trade-offs Visualization
-```
-        Memory Usage Impact
-              v
-    +-------------------------+
-    |   Vocabulary Size       |---> Embedding Table Memory
-    |                         |     vocab_size * embed_dim * 4 bytes
-    +-------------------------+
-              v
-    +-------------------------+
-    |   Sequence Length       |---> Attention Memory  
-    |                         |     O(sequence_length²)
-    +-------------------------+
-              v
-    +-------------------------+
-    |  Tokenization Speed     |---> Training Throughput
-    |                         |     tokens/second pipeline
-    +-------------------------+
-
-Key Insight: Tokenization choices create cascading effects throughout ML systems!
-```
-
-### MAGNIFY Character vs Subword vs Word Example
-```
-Input: "The tokenization process"
-
-Character-level:
-['T','h','e',' ','t','o','k','e','n','i','z','a','t','i','o','n',' ','p','r','o','c','e','s','s']
-v (24 tokens, vocab ~256)
-
-Subword (BPE):
-['The', 'token', 'ization', 'process']  
-v (4 tokens, vocab ~50k)
-
-Word-level:
-['The', 'tokenization', 'process']
-v (3 tokens, vocab ~100k+)
-
-Trade-off: Smaller vocab = Longer sequences = More computation
-          Larger vocab = More parameters = More memory
-```
-"""
-
-# %% [markdown]
-"""
-## Character Tokenizer Implementation
-
-Let's start with the simplest tokenizer: character-level. Every character becomes a token.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "char-tokenizer", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class CharTokenizer:
-    """
-    Character-level tokenizer that converts text to character tokens.
-    
-    Simple but effective for understanding tokenization fundamentals.
-    Used in character-level language models and as baseline for comparison.
-    """
-    
-    def __init__(self, special_tokens: Optional[Dict[str, int]] = None):
-        """
-        Initialize character tokenizer with optional special tokens.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Initialize character-to-index and index-to-character mappings
-        2. Add standard special tokens (PAD, UNK, BOS, EOS)
-        3. Build vocabulary from printable ASCII characters
-        4. Add any additional special tokens provided
-        
-        DESIGN DECISIONS:
-        - Use ASCII characters (32-126) for basic English text
-        - Reserve indices 0-3 for special tokens
-        - Build bidirectional mappings for efficiency
-        
-        Args:
-            special_tokens: Optional dict of special token name -> index
-        """
-        ### BEGIN SOLUTION
-        # Initialize mappings
-        self.char_to_idx = {}
-        self.idx_to_char = {}
-        self.vocab_size = 0
-        
-        # Standard special tokens
-        default_special = {
-            '<PAD>': 0,   # Padding token
-            '<UNK>': 1,   # Unknown token
-            '<BOS>': 2,   # Beginning of sequence
-            '<EOS>': 3    # End of sequence
-        }
-        
-        # Merge with user-provided special tokens
-        if special_tokens is None:
-            special_tokens = {}
-        all_special = {**default_special, **special_tokens}
-        
-        # Add special tokens first
-        for token, idx in all_special.items():
-            self.char_to_idx[token] = idx
-            self.idx_to_char[idx] = token
-            self.vocab_size = max(self.vocab_size, idx + 1)
-        
-        # Add printable ASCII characters (space to ~)
-        next_idx = self.vocab_size
-        for i in range(32, 127):  # ASCII printable characters
-            char = chr(i)
-            if char not in self.char_to_idx:
-                self.char_to_idx[char] = next_idx
-                self.idx_to_char[next_idx] = char
-                next_idx += 1
-        
-        self.vocab_size = next_idx
-        ### END SOLUTION
-    
-    def encode(self, text: str, add_special_tokens: bool = True) -> List[int]:
-        """
-        Convert text to list of token indices.
-        
-        TODO: Implement text encoding.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Optionally add beginning-of-sequence token
-        2. Convert each character to its index
-        3. Handle unknown characters with UNK token
-        4. Optionally add end-of-sequence token
-        5. Return list of integers
-        
-        EXAMPLE:
-        tokenizer = CharTokenizer()
-        tokens = tokenizer.encode("Hi!")
-        # Returns: [2, 72, 105, 33, 3] (BOS, H, i, !, EOS)
-        
-        Args:
-            text: Input text string
-            add_special_tokens: Whether to add BOS/EOS tokens
-            
-        Returns:
-            List of token indices
-        """
-        ### BEGIN SOLUTION
-        tokens = []
-        
-        # Add beginning of sequence token
-        if add_special_tokens:
-            tokens.append(self.char_to_idx['<BOS>'])
-        
-        # Convert each character
-        for char in text:
-            if char in self.char_to_idx:
-                tokens.append(self.char_to_idx[char])
-            else:
-                # Unknown character - use UNK token
-                tokens.append(self.char_to_idx['<UNK>'])
-        
-        # Add end of sequence token
-        if add_special_tokens:
-            tokens.append(self.char_to_idx['<EOS>'])
-        
-        return tokens
-        ### END SOLUTION
-    
-    def decode(self, tokens: List[int], skip_special_tokens: bool = True) -> str:
-        """
-        Convert list of token indices back to text.
-        
-        TODO: Implement token decoding.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert each token index to its character
-        2. Optionally skip special tokens (PAD, UNK, BOS, EOS)
-        3. Join characters into string
-        4. Return decoded text
-        
-        EXAMPLE:
-        tokenizer = CharTokenizer()
-        text = tokenizer.decode([2, 72, 105, 33, 3])
-        # Returns: "Hi!" (BOS and EOS removed)
-        
-        Args:
-            tokens: List of token indices
-            skip_special_tokens: Whether to exclude special tokens
-            
-        Returns:
-            Decoded text string
-        """
-        ### BEGIN SOLUTION
-        special_tokens = {'<PAD>', '<UNK>', '<BOS>', '<EOS>'}
-        chars = []
-        
-        for token_idx in tokens:
-            if token_idx in self.idx_to_char:
-                char = self.idx_to_char[token_idx]
-                # Skip special tokens if requested
-                if skip_special_tokens and char in special_tokens:
-                    continue
-                chars.append(char)
-            else:
-                # Unknown token index
-                if not skip_special_tokens:
-                    chars.append('<UNK>')
-        
-        return ''.join(chars)
-        ### END SOLUTION
-    
-    def pad_sequences(self, sequences: List[List[int]], max_length: Optional[int] = None) -> List[List[int]]:
-        """
-        Pad sequences to uniform length for batch processing.
-        
-        This function is PROVIDED to show padding implementation.
-        Essential for creating batches of text data.
-        """
-        if not sequences:
-            return []
-        
-        if max_length is None:
-            max_length = max(len(seq) for seq in sequences)
-        
-        pad_token = self.char_to_idx['<PAD>']
-        padded = []
-        
-        for sequence in sequences:
-            if len(sequence) >= max_length:
-                # Truncate if too long
-                padded.append(sequence[:max_length])
-            else:
-                # Pad if too short
-                padding_needed = max_length - len(sequence)
-                padded_sequence = sequence + [pad_token] * padding_needed
-                padded.append(padded_sequence)
-        
-        return padded
-
-# %% [markdown]
-"""
-### TEST Test Your Character Tokenizer Implementation
-
-Once you implement the CharTokenizer encode and decode methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-char-tokenizer-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_char_tokenizer():
-    """Unit test for the character tokenizer."""
-    print("🔬 Unit Test: Character Tokenizer...")
-    
-    # Create tokenizer
-    tokenizer = CharTokenizer()
-    
-    # Test basic encoding
-    text = "Hi!"
-    tokens = tokenizer.encode(text, add_special_tokens=False)
-    expected_chars = ['H', 'i', '!']
-    
-    assert len(tokens) == len(expected_chars), f"Expected {len(expected_chars)} tokens, got {len(tokens)}"
-    
-    # Test decoding
-    decoded = tokenizer.decode(tokens, skip_special_tokens=True)
-    assert decoded == text, f"Expected '{text}', got '{decoded}'"
-    
-    # Test with special tokens
-    tokens_with_special = tokenizer.encode(text, add_special_tokens=True)
-    assert len(tokens_with_special) == len(tokens) + 2, "Should add BOS and EOS tokens"
-    assert tokens_with_special[0] == tokenizer.char_to_idx['<BOS>'], "First token should be BOS"
-    assert tokens_with_special[-1] == tokenizer.char_to_idx['<EOS>'], "Last token should be EOS"
-    
-    # Test vocabulary size (4 special + 95 ASCII = 99 total)
-    assert tokenizer.vocab_size >= 99, "Should have at least 99 tokens (4 special + 95 ASCII)"
-    
-    # Test unknown character handling
-    unknown_tokens = tokenizer.encode("🚀", add_special_tokens=False)  # Emoji not in ASCII
-    assert unknown_tokens[0] == tokenizer.char_to_idx['<UNK>'], "Should use UNK token for unknown chars"
-    
-    # Test padding
-    sequences = [[1, 2, 3], [4, 5]]
-    padded = tokenizer.pad_sequences(sequences, max_length=4)
-    assert len(padded[0]) == 4, "First sequence should be padded to length 4"
-    assert len(padded[1]) == 4, "Second sequence should be padded to length 4"
-    assert padded[1][-1] == tokenizer.char_to_idx['<PAD>'], "Should use PAD token for padding"
-    
-    print("PASS Character tokenizer tests passed!")
-    print(f"PASS Vocabulary size: {tokenizer.vocab_size}")
-    print(f"PASS Encode/decode cycle works correctly")
-    print(f"PASS Special tokens handled properly")
-    print(f"PASS Padding functionality works")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Basic BPE (Byte Pair Encoding) Tokenizer
-
-Now let's implement a simplified version of BPE, the subword tokenization algorithm used in GPT and many modern language models.
-
-### 🧩 BPE Algorithm Visualization
-```
-Step 1: Start with characters
-"hello" -> ['h', 'e', 'l', 'l', 'o', '</w>']
-
-Step 2: Count adjacent pairs
-('l', 'l'): 1 occurrence  <- Most frequent pair
-
-Step 3: Merge most frequent pair
-['h', 'e', 'l', 'l', 'o', '</w>'] -> ['h', 'e', 'll', 'o', '</w>']
-
-Step 4: Repeat until vocabulary target reached
-Next iteration might merge ('e', 'll') -> 'ell' if frequent enough
-
-BPE Training Process:
-+-----------------+    +-----------------+    +-----------------+
-| Character Vocab | ---> |  Count Pairs   | ---> |  Merge Most     |
-| a, b, c, d...   |      | (a,b): 5       |      |  Frequent Pair  |
-+-----------------+      | (c,d): 3       |      | (a,b) -> ab      |
-         ^               | (e,f): 1       |      +-----------------+
-         |               +-----------------+               |
-         |                                                 |
-         +------------------- Repeat Until Target <---------+
-```
-
-### PROGRESS BPE Learning Process Example
-```
-Initial: "hello" = ['h', 'e', 'l', 'l', 'o', '</w>']
-
-Iteration 1:
-  Pairs: (h,e):1, (e,l):1, (l,l):1, (l,o):1, (o,</w>):1
-  Merge: (l,l) -> 'll'
-  Result: ['h', 'e', 'll', 'o', '</w>']
-
-Iteration 2:  
-  Pairs: (h,e):1, (e,ll):1, (ll,o):1, (o,</w>):1
-  Merge: Most frequent (if any occur >1 time)
-  Continue until vocab_size reached...
-
-Key Insight: BPE learns common subword patterns from data!
-```
-
-### TARGET BPE Benefits
-```
-Traditional Tokenization Problems:
-FAIL "unhappiness" -> UNK (unknown word)
-FAIL "supercalifragilisticexpialidocious" -> UNK
-
-BPE Solution:  
-PASS "unhappiness" -> ['un', 'happy', 'ness'] (recognizable parts)
-PASS "supercali..." -> ['super', 'cal', 'i', 'frag', ...] (graceful degradation)
-
-Memory Efficiency:
-Character: 26 vocab * 512 embed_dim = 13,312 parameters
-BPE-50k:   50,000 vocab * 512 embed_dim = 25,600,000 parameters
-Trade-off: More parameters, shorter sequences (faster attention)
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "bpe-tokenizer", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class BPETokenizer:
-    """
-    Basic Byte Pair Encoding (BPE) tokenizer implementation.
-    
-    Learns subword units by iteratively merging the most frequent
-    character pairs. This creates a vocabulary that balances
-    sequence length and vocabulary size.
-    """
-    
-    def __init__(self, vocab_size: int = 1000):
-        """
-        Initialize BPE tokenizer.
-        
-        Args:
-            vocab_size: Target vocabulary size (includes special tokens)
-        """
-        self.vocab_size = vocab_size
-        self.char_to_idx = {}
-        self.idx_to_char = {}
-        self.merges = []  # List of (pair, new_token) merges learned during training
-        self.trained = False
-        
-        # Initialize with special tokens
-        special_tokens = ['<PAD>', '<UNK>', '<BOS>', '<EOS>']
-        for i, token in enumerate(special_tokens):
-            self.char_to_idx[token] = i
-            self.idx_to_char[i] = token
-    
-    def _get_word_tokens(self, text: str) -> List[List[str]]:
-        """
-        Convert text to list of words, where each word is a list of characters.
-        
-        This function is PROVIDED to handle text preprocessing.
-        """
-        # Simple whitespace tokenization, then character splitting
-        words = text.lower().split()
-        word_tokens = []
-        
-        for word in words:
-            # Add end-of-word marker to distinguish word boundaries
-            word_chars = list(word) + ['</w>']
-            word_tokens.append(word_chars)
-        
-        return word_tokens
-    
-    def _get_pair_counts(self, word_tokens: List[List[str]]) -> Dict[Tuple[str, str], int]:
-        """
-        Count frequency of adjacent token pairs.
-        
-        TODO: Implement pair counting for BPE merge selection.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Initialize empty count dictionary
-        2. For each word (list of tokens):
-           - For each adjacent pair of tokens
-           - Count how many times this pair appears
-        3. Return dictionary of (token1, token2) -> count
-        
-        EXAMPLE:
-        word_tokens = [['h', 'e', 'l', 'l', 'o', '</w>'], ['h', 'i', '</w>']]
-        pairs = _get_pair_counts(word_tokens)
-        # Returns: {('h', 'e'): 1, ('e', 'l'): 1, ('l', 'l'): 1, ('l', 'o'): 1, ('o', '</w>'): 1, ('h', 'i'): 1, ('i', '</w>'): 1}
-        
-        ALGORITHM INSIGHT:
-        This is the core of BPE learning - we find the most frequent adjacent pairs
-        to merge. High-frequency pairs indicate common subword patterns in the language.
-        
-        Args:
-            word_tokens: List of words, each word is list of tokens
-            
-        Returns:
-            Dictionary mapping token pairs to their counts
-        """
-        ### BEGIN SOLUTION
-        # Use defaultdict for efficient counting - avoids key existence checks
-        pair_counts = defaultdict(int)
-        
-        # Iterate through all words in the corpus
-        for word in word_tokens:
-            # Count adjacent pairs in this word
-            # Range(len(word) - 1) ensures we don't go out of bounds
-            for i in range(len(word) - 1):
-                pair = (word[i], word[i + 1])  # Create tuple for dictionary key
-                pair_counts[pair] += 1  # Increment count for this pair
-        
-        # Convert to regular dict for consistent return type
-        return dict(pair_counts)
-        ### END SOLUTION
-    
-    def _merge_pair(self, word_tokens: List[List[str]], pair: Tuple[str, str], new_token: str) -> List[List[str]]:
-        """
-        Replace all occurrences of a token pair with a new merged token.
-        
-        TODO: Implement pair merging for BPE vocabulary building.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create new list to store updated words
-        2. For each word:
-           - Scan through tokens looking for the target pair
-           - When found, replace pair with new_token
-           - Continue until no more pairs in this word
-        3. Return updated word tokens
-        
-        EXAMPLE:
-        word_tokens = [['h', 'e', 'l', 'l', 'o', '</w>']]
-        pair = ('l', 'l')
-        new_token = 'll'
-        result = _merge_pair(word_tokens, pair, new_token)
-        # Returns: [['h', 'e', 'll', 'o', '</w>']]
-        
-        EFFICIENCY NOTE:
-        This operation is performed many times during BPE training. Each merge
-        creates a more compact representation, trading vocabulary size for sequence length.
-        
-        Args:
-            word_tokens: List of words (each word is list of tokens)
-            pair: The token pair to merge
-            new_token: The new token to replace the pair
-            
-        Returns:
-            Updated word tokens with pairs merged
-        """
-        ### BEGIN SOLUTION
-        updated_words = []
-        
-        # Process each word independently
-        for word in word_tokens:
-            new_word = []
-            i = 0
-            
-            # Scan through word looking for target pair
-            while i < len(word):
-                # Check if current position has the target pair
-                # Must check bounds to avoid index errors
-                if (i < len(word) - 1 and 
-                    word[i] == pair[0] and 
-                    word[i + 1] == pair[1]):
-                    # Found the pair - replace with merged token
-                    new_word.append(new_token)
-                    i += 2  # Skip both tokens in the pair (important!)
-                else:
-                    # No pair match - keep current token unchanged
-                    new_word.append(word[i])
-                    i += 1  # Move to next token
-            
-            # Add processed word to results
-            updated_words.append(new_word)
-        
-        return updated_words
-        ### END SOLUTION
-    
-    def train(self, texts: List[str]) -> None:
-        """
-        Train BPE tokenizer on a corpus of texts.
-        
-        This function is PROVIDED to show the complete BPE training algorithm.
-        Students implement the helper functions above.
-        """
-        print(f"Training BPE tokenizer (target vocab size: {self.vocab_size})...")
-        
-        # Step 1: Convert texts to word tokens (character level initially)
-        all_word_tokens = []
-        for text in texts:
-            word_tokens = self._get_word_tokens(text)
-            all_word_tokens.extend(word_tokens)
-        
-        # Step 2: Build initial character vocabulary
-        all_chars = set()
-        for word in all_word_tokens:
-            all_chars.update(word)
-        
-        # Add characters to vocabulary (after special tokens)
-        next_idx = len(self.char_to_idx)
-        for char in sorted(all_chars):
-            if char not in self.char_to_idx:
-                self.char_to_idx[char] = next_idx
-                self.idx_to_char[next_idx] = char
-                next_idx += 1
-        
-        # Step 3: Iteratively merge most frequent pairs
-        current_word_tokens = all_word_tokens
-        
-        while len(self.char_to_idx) < self.vocab_size:
-            # Count all adjacent pairs
-            pair_counts = self._get_pair_counts(current_word_tokens)
-            
-            if not pair_counts:
-                print("No more pairs to merge!")
-                break
-            
-            # Find most frequent pair
-            most_frequent_pair = max(pair_counts, key=pair_counts.get)
-            most_frequent_count = pair_counts[most_frequent_pair]
-            
-            if most_frequent_count < 2:
-                print("No pairs occur more than once - stopping merge process")
-                break
-            
-            # Create new merged token
-            new_token = most_frequent_pair[0] + most_frequent_pair[1]
-            
-            # Add to vocabulary
-            self.char_to_idx[new_token] = len(self.char_to_idx)
-            self.idx_to_char[len(self.idx_to_char)] = new_token
-            
-            # Record this merge for later encoding
-            self.merges.append((most_frequent_pair, new_token))
-            
-            # Apply merge to all words
-            current_word_tokens = self._merge_pair(current_word_tokens, most_frequent_pair, new_token)
-            
-            if len(self.char_to_idx) % 100 == 0:
-                print(f"  Vocabulary size: {len(self.char_to_idx)}, Last merge: {most_frequent_pair} -> '{new_token}' (count: {most_frequent_count})")
-        
-        self.trained = True
-        print(f"Training complete! Final vocabulary size: {len(self.char_to_idx)}")
-        print(f"Learned {len(self.merges)} merges")
-    
-    def encode(self, text: str, add_special_tokens: bool = True) -> List[int]:
-        """
-        Encode text using trained BPE tokenizer.
-        
-        This function is PROVIDED to show BPE encoding process.
-        """
-        if not self.trained:
-            raise ValueError("Tokenizer must be trained before encoding!")
-        
-        # Convert to word tokens (character level initially)
-        word_tokens = self._get_word_tokens(text)
-        
-        # Apply all learned merges in order
-        for pair, new_token in self.merges:
-            word_tokens = self._merge_pair(word_tokens, pair, new_token)
-        
-        # Convert tokens to indices
-        tokens = []
-        if add_special_tokens:
-            tokens.append(self.char_to_idx['<BOS>'])
-        
-        for word in word_tokens:
-            for token in word:
-                if token in self.char_to_idx:
-                    tokens.append(self.char_to_idx[token])
-                else:
-                    tokens.append(self.char_to_idx['<UNK>'])
-        
-        if add_special_tokens:
-            tokens.append(self.char_to_idx['<EOS>'])
-        
-        return tokens
-    
-    def decode(self, tokens: List[int], skip_special_tokens: bool = True) -> str:
-        """
-        Decode tokens back to text.
-        
-        This function is PROVIDED to show BPE decoding process.
-        """
-        special_tokens = {'<PAD>', '<UNK>', '<BOS>', '<EOS>'}
-        token_strings = []
-        
-        for token_idx in tokens:
-            if token_idx in self.idx_to_char:
-                token_str = self.idx_to_char[token_idx]
-                if skip_special_tokens and token_str in special_tokens:
-                    continue
-                token_strings.append(token_str)
-        
-        # Join tokens and handle word boundaries
-        result = ''.join(token_strings)
-        result = result.replace('</w>', ' ')  # Replace end-of-word markers with spaces
-        
-        return result.strip()
-
-# %% [markdown]
-"""
-### TEST Test Your BPE Implementation
-
-Once you implement the BPE helper methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-bpe-tokenizer-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_bpe_tokenizer():
-    """Unit test for the BPE tokenizer."""
-    print("🔬 Unit Test: BPE Tokenizer...")
-    
-    # Create BPE tokenizer
-    bpe = BPETokenizer(vocab_size=50)  # Small vocab for testing
-    
-    # Test training data
-    training_texts = [
-        "hello world hello",
-        "world hello world",
-        "hello hello world world"
-    ]
-    
-    # Test training
-    bpe.train(training_texts)
-    
-    # Verify training completed
-    assert bpe.trained, "Tokenizer should be marked as trained"
-    assert len(bpe.char_to_idx) >= 10, "Should have reasonable vocabulary size"
-    assert len(bpe.merges) > 0, "Should have learned some merges"
-    
-    # Test encoding
-    test_text = "hello world"
-    tokens = bpe.encode(test_text, add_special_tokens=False)
-    assert len(tokens) > 0, "Should produce some tokens"
-    assert all(isinstance(t, int) for t in tokens), "All tokens should be integers"
-    
-    # Test decoding
-    decoded = bpe.decode(tokens, skip_special_tokens=True)
-    # Should be similar to original (might have different spacing due to </w> markers)
-    assert "hello" in decoded.lower(), "Should contain 'hello'"
-    assert "world" in decoded.lower(), "Should contain 'world'"
-    
-    # Test with special tokens
-    tokens_with_special = bpe.encode(test_text, add_special_tokens=True)
-    assert len(tokens_with_special) == len(tokens) + 2, "Should add BOS and EOS"
-    assert tokens_with_special[0] == bpe.char_to_idx['<BOS>'], "First should be BOS"
-    assert tokens_with_special[-1] == bpe.char_to_idx['<EOS>'], "Last should be EOS"
-    
-    # Test helper functions
-    word_tokens = [['h', 'e', 'l', 'l', 'o']]
-    pair_counts = bpe._get_pair_counts(word_tokens)
-    assert ('l', 'l') in pair_counts, "Should find the 'll' pair"
-    assert pair_counts[('l', 'l')] == 1, "Should count 'll' pair once"
-    
-    # Test merge function
-    merged = bpe._merge_pair(word_tokens, ('l', 'l'), 'll')
-    assert 'll' in merged[0], "Should contain merged token 'll'"
-    # After merging 'll' from ['h', 'e', 'l', 'l', 'o'], we get ['h', 'e', 'll', 'o']
-    # Count individual 'l' characters - should be 0 since they were merged into 'll'
-    individual_l_count = sum(1 for token in merged[0] if token == 'l')
-    assert individual_l_count == 0, f"Should have no individual 'l' tokens after merge, got {individual_l_count}"
-    
-    print("PASS BPE tokenizer tests passed!")
-    print(f"PASS Trained vocabulary size: {len(bpe.char_to_idx)}")
-    print(f"PASS Learned {len(bpe.merges)} merges")
-    print(f"PASS Encode/decode cycle works")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## TARGET ML Systems: Performance Analysis & Tokenization Efficiency
-
-Now let's develop systems engineering skills by analyzing tokenization performance and understanding how tokenization choices affect downstream ML system efficiency.
-
-### **Learning Outcome**: *"I understand how tokenization affects model memory, training speed, and language understanding"*
-
-### MAGNIFY Systems Insights Functions
-
-The next few implementations include **executable analysis functions** that help you discover key insights about tokenization performance and memory scaling. These aren't just code - they're interactive learning tools that reveal how tokenization choices affect real ML systems.
-
-### 📊 What We'll Measure
-```
-Performance Metrics:
-+-----------------+    +-----------------+    +-----------------+
-| Tokenization    |    | Memory Usage    |    | Scaling         |
-| Speed           |    | Analysis        |    | Behavior        |
-|                 |    |                 |    |                 |
-| • tokens/sec    |    | • vocab memory  |    | • time complexity|
-| • chars/sec     |    | • sequence mem  |    | • space complexity|
-| • compression   |    | • total footprint|   | • bottleneck ID |
-+-----------------+    +-----------------+    +-----------------+
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "tokenization-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-import time
-
-class TokenizationProfiler:
-    """
-    Performance profiling toolkit for tokenization systems.
-    
-    Helps ML engineers understand computational costs and optimize
-    text processing pipelines for production deployment.
-    """
-    
-    def __init__(self):
-        self.results = {}
-    
-    def measure_tokenization_speed(self, tokenizer, texts: List[str], tokenizer_name: str) -> Dict:
-        """
-        Measure tokenization throughput and efficiency.
-        
-        TODO: Implement tokenization speed measurement.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Record start time
-        2. Tokenize all texts
-        3. Record end time and calculate metrics
-        4. Calculate tokens per second, characters per second
-        5. Return comprehensive performance metrics
-        
-        METRICS TO CALCULATE:
-        - Total time (seconds)
-        - Texts per second
-        - Characters per second
-        - Average tokens per text
-        - Average sequence length
-        
-        Args:
-            tokenizer: Tokenizer instance (CharTokenizer or BPETokenizer)
-            texts: List of texts to tokenize
-            tokenizer_name: Name for reporting
-            
-        Returns:
-            Dictionary with performance metrics
-        """
-        ### BEGIN SOLUTION
-        start_time = time.time()
-        
-        # Tokenize all texts
-        all_tokens = []
-        total_chars = 0
-        
-        for text in texts:
-            tokens = tokenizer.encode(text, add_special_tokens=False)
-            all_tokens.append(tokens)
-            total_chars += len(text)
-        
-        end_time = time.time()
-        
-        # Calculate metrics
-        total_time = end_time - start_time
-        total_texts = len(texts)
-        total_tokens = sum(len(tokens) for tokens in all_tokens)
-        
-        metrics = {
-            'tokenizer_name': tokenizer_name,
-            'total_time_sec': total_time,
-            'total_texts': total_texts,
-            'total_characters': total_chars,
-            'total_tokens': total_tokens,
-            'texts_per_second': total_texts / total_time if total_time > 0 else 0,
-            'chars_per_second': total_chars / total_time if total_time > 0 else 0,
-            'tokens_per_second': total_tokens / total_time if total_time > 0 else 0,
-            'avg_tokens_per_text': total_tokens / total_texts if total_texts > 0 else 0,
-            'avg_sequence_length': total_tokens / total_texts if total_texts > 0 else 0,
-            'compression_ratio': total_chars / total_tokens if total_tokens > 0 else 0
-        }
-        
-        return metrics
-        ### END SOLUTION
-    
-    def compare_tokenizers(self, texts: List[str]) -> Dict:
-        """
-        Compare performance of different tokenization strategies.
-        
-        This function is PROVIDED to show comprehensive comparison.
-        """
-        print("MAGNIFY TOKENIZER COMPARISON")
-        print("=" * 50)
-        
-        # Create tokenizers
-        char_tokenizer = CharTokenizer()
-        
-        # Train small BPE tokenizer
-        bpe_tokenizer = BPETokenizer(vocab_size=200)
-        bpe_tokenizer.train(texts[:10])  # Train on subset for speed
-        
-        tokenizers = [
-            (char_tokenizer, "Character"),
-            (bpe_tokenizer, "BPE")
-        ]
-        
-        results = {}
-        
-        # Test each tokenizer
-        for tokenizer, name in tokenizers:
-            metrics = self.measure_tokenization_speed(tokenizer, texts, name)
-            results[name] = metrics
-            
-            print(f"\n📊 {name} Tokenizer:")
-            print(f"   Speed: {metrics['texts_per_second']:.1f} texts/sec")
-            print(f"   Throughput: {metrics['chars_per_second']:.0f} chars/sec")
-            print(f"   Avg sequence length: {metrics['avg_sequence_length']:.1f} tokens")
-            print(f"   Compression ratio: {metrics['compression_ratio']:.2f} chars/token")
-            print(f"   Vocabulary size: {tokenizer.vocab_size}")
-        
-        return results
-    
-    def analyze_memory_scaling(self, tokenizer, text_lengths: List[int]) -> Dict:
-        """
-        Analyze how tokenization memory scales with text length.
-        
-        This function is PROVIDED to demonstrate scaling analysis.
-        """
-        print(f"\nMAGNIFY MEMORY SCALING ANALYSIS")
-        print("=" * 40)
-        
-        scaling_results = []
-        
-        for length in text_lengths:
-            # Create text of specified length
-            test_text = "Hello world! " * (length // 13 + 1)
-            test_text = test_text[:length]
-            
-            # Measure tokenization
-            start_time = time.time()
-            tokens = tokenizer.encode(test_text, add_special_tokens=False)
-            end_time = time.time()
-            
-            # Calculate metrics
-            time_taken = end_time - start_time
-            memory_chars = len(test_text) * 4  # Approximate char memory (bytes)
-            memory_tokens = len(tokens) * 4  # Approximate token memory (bytes)
-            
-            result = {
-                'text_length': length,
-                'num_tokens': len(tokens),
-                'time_ms': time_taken * 1000,
-                'memory_chars_bytes': memory_chars,
-                'memory_tokens_bytes': memory_tokens,
-                'total_memory_bytes': memory_chars + memory_tokens
-            }
-            
-            scaling_results.append(result)
-            print(f"   {length:>6} chars -> {len(tokens):>4} tokens ({time_taken*1000:.2f}ms)")
-        
-        # Analyze scaling pattern
-        if len(scaling_results) >= 2:
-            small = scaling_results[0]
-            large = scaling_results[-1]
-            
-            length_ratio = large['text_length'] / small['text_length']
-            time_ratio = large['time_ms'] / small['time_ms']
-            memory_ratio = large['total_memory_bytes'] / small['total_memory_bytes']
-            
-            print(f"\nPROGRESS Scaling Analysis:")
-            print(f"   Text length increased {length_ratio:.1f}x")
-            print(f"   Time increased {time_ratio:.1f}x")
-            print(f"   Memory increased {memory_ratio:.1f}x")
-            print(f"   Scaling pattern: {'Linear' if abs(time_ratio - length_ratio) < 1 else 'Non-linear'}")
-        
-        return scaling_results
-
-def analyze_tokenization_impact():
-    """
-    Comprehensive analysis of how tokenization affects downstream ML systems.
-    
-    This function is PROVIDED to show systems-level thinking.
-    """
-    print("TARGET TOKENIZATION IMPACT ON ML SYSTEMS")
-    print("=" * 60)
-    
-    # Sample texts for analysis
-    sample_texts = [
-        "The quick brown fox jumps over the lazy dog.",
-        "Machine learning models process tokenized text efficiently.",
-        "Byte pair encoding balances vocabulary size and sequence length.",
-        "Transformer models use attention mechanisms for sequence processing.",
-        "Production systems require fast tokenization for real-time inference."
-    ]
-    
-    # Create tokenizers
-    char_tokenizer = CharTokenizer()
-    bpe_tokenizer = BPETokenizer(vocab_size=100)
-    bpe_tokenizer.train(sample_texts * 3)  # Train with more data
-    
-    print("\n📊 TOKENIZATION COMPARISON:")
-    print(f"{'Strategy':<12} {'Vocab Size':<10} {'Avg Tokens':<10} {'Memory Impact':<15}")
-    print("-" * 60)
-    
-    for tokenizer, name in [(char_tokenizer, "Character"), (bpe_tokenizer, "BPE")]:
-        # Analyze average sequence length
-        total_tokens = 0
-        for text in sample_texts:
-            tokens = tokenizer.encode(text, add_special_tokens=False)
-            total_tokens += len(tokens)
-        
-        avg_tokens = total_tokens / len(sample_texts)
-        
-        # Calculate memory impact
-        # Embedding table: vocab_size * embedding_dim * 4 bytes (float32)
-        embedding_dim = 256  # Typical small model
-        embedding_memory_mb = (tokenizer.vocab_size * embedding_dim * 4) / (1024 * 1024)
-        
-        # Sequence memory: batch_size * seq_length * hidden_dim * 4 bytes
-        batch_size = 32
-        hidden_dim = 256
-        sequence_memory_mb = (batch_size * avg_tokens * hidden_dim * 4) / (1024 * 1024)
-        
-        total_memory = embedding_memory_mb + sequence_memory_mb
-        
-        print(f"{name:<12} {tokenizer.vocab_size:<10} {avg_tokens:<10.1f} {total_memory:<15.1f}MB")
-    
-    print(f"\nTIP KEY INSIGHTS:")
-    print(f"   🔤 Character tokenizer: Small vocabulary, long sequences")
-    print(f"   🧩 BPE tokenizer: Medium vocabulary, shorter sequences")
-    print(f"   PROGRESS Memory scaling: O(vocab_size * embed_dim + seq_len * batch_size)")
-    print(f"   SPEED Attention complexity: O(seq_len²) - shorter sequences = faster attention")
-    print(f"   🏭 Production trade-off: Vocabulary size vs sequence length vs compute")
-
-# %% [markdown]
-"""
-### TEST Test: Tokenization Performance Analysis
-
-Let's test our tokenization profiler with realistic performance scenarios.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-tokenization-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_tokenization_profiler():
-    """Test tokenization profiler with various scenarios."""
-    print("🔬 Unit Test: Tokenization Performance Profiler...")
-    
-    profiler = TokenizationProfiler()
-    
-    # Create test data
-    test_texts = [
-        "Hello world!",
-        "This is a test sentence.",
-        "Tokenization speed matters for ML systems."
-    ]
-    
-    # Test with character tokenizer
-    char_tokenizer = CharTokenizer()
-    metrics = profiler.measure_tokenization_speed(char_tokenizer, test_texts, "Character")
-    
-    # Verify metrics structure
-    expected_keys = ['tokenizer_name', 'total_time_sec', 'total_texts', 'total_characters', 
-                    'total_tokens', 'texts_per_second', 'chars_per_second', 'tokens_per_second',
-                    'avg_tokens_per_text', 'avg_sequence_length', 'compression_ratio']
-    
-    for key in expected_keys:
-        assert key in metrics, f"Missing metric: {key}"
-        assert isinstance(metrics[key], (int, float, str)), f"Invalid metric type for {key}"
-    
-    # Verify reasonable values
-    assert metrics['total_texts'] == len(test_texts), "Should count texts correctly"
-    assert metrics['total_characters'] > 0, "Should count characters"
-    assert metrics['total_tokens'] > 0, "Should count tokens"
-    assert metrics['texts_per_second'] > 0, "Should measure throughput"
-    
-    print("PASS Basic profiling functionality test passed")
-    
-    # Test comparison
-    comparison_results = profiler.compare_tokenizers(test_texts)
-    assert isinstance(comparison_results, dict), "Should return comparison results"
-    assert len(comparison_results) >= 1, "Should test at least one tokenizer"
-    
-    print("PASS Tokenizer comparison test passed")
-    
-    # Test scaling analysis
-    scaling_results = profiler.analyze_memory_scaling(char_tokenizer, [50, 100])
-    assert isinstance(scaling_results, list), "Should return scaling results"
-    assert len(scaling_results) == 2, "Should test both sizes"
-    
-    for result in scaling_results:
-        assert 'text_length' in result, "Should include text length"
-        assert 'num_tokens' in result, "Should include token count"
-        assert result['num_tokens'] > 0, "Should produce tokens"
-    
-    print("PASS Scaling analysis test passed")
-    print("TARGET Tokenization Profiler: All tests passed!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## 📊 Systems Analysis: Tokenization Impact on Model Architecture
-
-Let's analyze how different tokenization strategies affect real ML system design choices.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "tokenization-systems-analysis", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def analyze_tokenization_systems_impact():
-    """
-    Analyze how tokenization affects ML system design and performance.
-    
-    This analysis helps students understand the connection between
-    tokenization choices and downstream system architecture decisions.
-    """
-    print("🏗️ TOKENIZATION SYSTEMS IMPACT ANALYSIS")
-    print("=" * 60)
-    
-    # Example model configurations
-    model_configs = {
-        'Small Model': {'embed_dim': 128, 'hidden_dim': 256, 'batch_size': 16},
-        'Medium Model': {'embed_dim': 256, 'hidden_dim': 512, 'batch_size': 32},
-        'Large Model': {'embed_dim': 512, 'hidden_dim': 1024, 'batch_size': 64}
-    }
-    
-    # Sample text for analysis
-    sample_text = "The transformer architecture revolutionized natural language processing through self-attention mechanisms."
-    
-    # Create tokenizers
-    char_tokenizer = CharTokenizer()
-    bpe_tokenizer = BPETokenizer(vocab_size=500)
-    bpe_tokenizer.train([sample_text] * 10)
-    
-    tokenizers = [
-        (char_tokenizer, "Character"),
-        (bpe_tokenizer, "BPE-500")
-    ]
-    
-    print(f"\n📋 ANALYSIS FOR TEXT: '{sample_text[:50]}...'")
-    print(f"   Original length: {len(sample_text)} characters")
-    
-    for tokenizer, tok_name in tokenizers:
-        tokens = tokenizer.encode(sample_text, add_special_tokens=False)
-        
-        print(f"\n🔤 {tok_name} Tokenization:")
-        print(f"   Vocabulary size: {tokenizer.vocab_size:,}")
-        print(f"   Sequence length: {len(tokens)} tokens")
-        print(f"   Compression ratio: {len(sample_text)/len(tokens):.2f} chars/token")
-        
-        print(f"\n💾 Memory Analysis:")
-        for model_name, config in model_configs.items():
-            # Embedding table memory
-            embed_memory = tokenizer.vocab_size * config['embed_dim'] * 4 / (1024**2)  # MB
-            
-            # Sequence processing memory (attention)
-            seq_memory = config['batch_size'] * len(tokens) * config['hidden_dim'] * 4 / (1024**2)  # MB
-            
-            # Attention memory (O(N²))
-            attention_memory = config['batch_size'] * len(tokens)**2 * 4 / (1024**2)  # MB
-            
-            total_memory = embed_memory + seq_memory + attention_memory
-            
-            print(f"   {model_name}: {total_memory:.1f}MB total")
-            print(f"     Embedding: {embed_memory:.1f}MB, Sequence: {seq_memory:.1f}MB, Attention: {attention_memory:.1f}MB")
-    
-    print(f"\nTARGET KEY SYSTEM DESIGN INSIGHTS:")
-    print(f"   1. Vocabulary Size Trade-offs:")
-    print(f"      - Larger vocab = more parameters = more memory")
-    print(f"      - Smaller vocab = longer sequences = more compute")
-    print(f"   2. Sequence Length Impact:")
-    print(f"      - Attention complexity: O(sequence_length²)")
-    print(f"      - Memory scales quadratically with sequence length")
-    print(f"   3. Production Considerations:")
-    print(f"      - Character tokenization: Simple but inefficient")
-    print(f"      - BPE tokenization: Balanced approach used in GPT/BERT")
-    print(f"      - Vocabulary size affects model download size")
-    print(f"   4. Hardware Implications:")
-    print(f"      - GPU memory limits sequence length")
-    print(f"      - Batch size limited by attention memory")
-
-# Analysis function defined (called in main block)
-
-# %% [markdown]
-"""
-## MAGNIFY Interactive Systems Insights
-
-Let's build intuition about tokenization through hands-on analysis. These functions reveal how tokenization choices cascade through ML systems.
-"""
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure your tokenizers are complete before running
-
-# THINK PREDICTION: Which tokenizer will use more memory - character or BPE? Why?
-# Your guess: _______
-
-# MAGNIFY SYSTEMS INSIGHT #1: Vocabulary Size vs Memory Trade-offs
-def analyze_tokenization_memory_impact():
-    """Analyze how vocabulary size affects model memory usage."""
-    try:
-        print("MAGNIFY TOKENIZATION MEMORY IMPACT ANALYSIS")
-        print("=" * 50)
-        
-        # Create tokenizers with different vocabulary sizes
-        char_tokenizer = CharTokenizer()
-        
-        # Train small BPE for comparison
-        bpe_small = BPETokenizer(vocab_size=500)
-        bpe_large = BPETokenizer(vocab_size=2000)
-        
-        sample_texts = [
-            "The quick brown fox jumps over the lazy dog",
-            "Machine learning models process tokenized text",
-            "Transformers use attention mechanisms effectively"
-        ] * 3  # Repeat for training data
-        
-        bpe_small.train(sample_texts)
-        bpe_large.train(sample_texts)
-        
-        tokenizers = [
-            (char_tokenizer, "Character"),
-            (bpe_small, "BPE-500"),
-            (bpe_large, "BPE-2000")
-        ]
-        
-        test_text = "The transformer architecture revolutionized natural language processing."
-        embed_dim = 256  # Typical embedding dimension
-        
-        print(f"\nAnalyzing text: '{test_text}'")
-        print(f"Text length: {len(test_text)} characters")
-        
-        for tokenizer, name in tokenizers:
-            tokens = tokenizer.encode(test_text, add_special_tokens=False)
-            
-            # Calculate memory requirements
-            vocab_size = tokenizer.vocab_size
-            seq_length = len(tokens)
-            
-            # Embedding table memory (parameters)
-            embedding_memory_mb = (vocab_size * embed_dim * 4) / (1024 * 1024)
-            
-            # Sequence memory for single sample (activations)
-            sequence_memory_kb = (seq_length * embed_dim * 4) / 1024
-            
-            # Attention memory O(N²) for single sample
-            attention_memory_kb = (seq_length * seq_length * 4) / 1024
-            
-            print(f"\n📊 {name} Tokenizer:")
-            print(f"   Vocabulary size: {vocab_size:,}")
-            print(f"   Sequence length: {seq_length} tokens")
-            print(f"   Compression ratio: {len(test_text)/seq_length:.2f} chars/token")
-            print(f"   Embedding table: {embedding_memory_mb:.1f} MB")
-            print(f"   Sequence memory: {sequence_memory_kb:.1f} KB")
-            print(f"   Attention memory: {attention_memory_kb:.1f} KB")
-            
-            total_per_sample = sequence_memory_kb + attention_memory_kb
-            print(f"   Total per sample: {total_per_sample:.1f} KB")
-        
-        print(f"\nTIP KEY INSIGHTS:")
-        print(f"   • Vocabulary size directly affects model parameters")
-        print(f"   • Sequence length affects computation (attention is O(N²))")
-        print(f"   • Character tokenization: Small vocab, long sequences")
-        print(f"   • BPE tokenization: Large vocab, shorter sequences")
-        print(f"   • Production trade-off: Parameters vs computation")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in memory analysis: {e}")
-        print("Make sure both tokenizers are implemented correctly")
-
-# Run the analysis
-analyze_tokenization_memory_impact()
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure BPE merge functions are working
-
-# THINK PREDICTION: How does tokenization speed scale with text length?
-# Linear? Quadratic? Your guess: _______
-
-# MAGNIFY SYSTEMS INSIGHT #2: Tokenization Speed Scaling Analysis  
-def analyze_tokenization_speed_scaling():
-    """Measure how tokenization performance scales with input size."""
-    try:
-        print("\nMAGNIFY TOKENIZATION SPEED SCALING ANALYSIS")
-        print("=" * 50)
-        
-        char_tokenizer = CharTokenizer()
-        text_lengths = [100, 500, 1000, 2000, 5000]
-        
-        print(f"Testing scaling with text lengths: {text_lengths}")
-        
-        char_times = []
-        
-        for length in text_lengths:
-            # Create text of specified length
-            test_text = "The quick brown fox jumps over the lazy dog. " * (length // 44 + 1)
-            test_text = test_text[:length]
-            
-            # Measure character tokenization time
-            start_time = time.time()
-            char_tokens = char_tokenizer.encode(test_text, add_special_tokens=False)
-            char_time = time.time() - start_time
-            
-            char_times.append(char_time)
-            
-            print(f"   {length:>5} chars -> {len(char_tokens):>5} tokens in {char_time*1000:.2f}ms")
-        
-        # Analyze scaling pattern
-        if len(char_times) >= 2:
-            print(f"\nPROGRESS Scaling Analysis:")
-            for i in range(1, len(text_lengths)):
-                length_ratio = text_lengths[i] / text_lengths[0]
-                time_ratio = char_times[i] / char_times[0] if char_times[0] > 0 else 0
-                
-                print(f"   {text_lengths[i]:>5} chars: {length_ratio:.1f}x length -> {time_ratio:.1f}x time")
-            
-            # Calculate approximate complexity
-            avg_scaling = sum(char_times[i]/char_times[0] / (text_lengths[i]/text_lengths[0]) 
-                            for i in range(1, len(text_lengths)) if char_times[0] > 0) / (len(text_lengths) - 1)
-            
-            print(f"\nTARGET SCALING INSIGHTS:")
-            print(f"   • Character tokenization: ~O(N) time complexity")
-            print(f"   • Average scaling factor: {avg_scaling:.2f} (1.0 = perfect linear)")
-            if avg_scaling < 1.2:
-                print(f"   • Performance: Excellent linear scaling")
-            elif avg_scaling < 2.0:
-                print(f"   • Performance: Good scaling with minor overhead")
-            else:
-                print(f"   • Performance: Scaling overhead detected")
-            
-            print(f"   • Memory usage: O(N) with input length")
-            print(f"   • Production implication: Tokenization speed rarely bottlenecks training")
-            
-    except Exception as e:
-        print(f"WARNING️ Error in scaling analysis: {e}")
-        print("Make sure character tokenizer is implemented correctly")
-
-# Run the scaling analysis
-analyze_tokenization_speed_scaling()
-
-# PASS IMPLEMENTATION CHECKPOINT: All tokenization systems working
-
-# THINK PREDICTION: For a 7B parameter model, what percentage of memory is vocabulary?
-# Your estimate: _______%
-
-# MAGNIFY SYSTEMS INSIGHT #3: Production Model Memory Breakdown
-def analyze_production_memory_breakdown():
-    """Analyze vocabulary memory in production-scale language models."""
-    try:
-        print("\nMAGNIFY PRODUCTION MODEL MEMORY BREAKDOWN")
-        print("=" * 50)
-        
-        # Model configurations based on real systems
-        models = {
-            'GPT-Small': {'params': 117_000_000, 'vocab': 50257, 'embed_dim': 768},
-            'GPT-Medium': {'params': 345_000_000, 'vocab': 50257, 'embed_dim': 1024}, 
-            'GPT-Large': {'params': 774_000_000, 'vocab': 50257, 'embed_dim': 1280},
-            'LLaMA-7B': {'params': 7_000_000_000, 'vocab': 32000, 'embed_dim': 4096}
-        }
-        
-        print(f"{'Model':<12} {'Total Params':<12} {'Vocab Params':<12} {'Vocab %':<8} {'Vocab Memory'}")
-        print("-" * 70)
-        
-        for model_name, config in models.items():
-            total_params = config['params']
-            vocab_size = config['vocab']
-            embed_dim = config['embed_dim']
-            
-            # Vocabulary parameters (embedding table)
-            vocab_params = vocab_size * embed_dim
-            vocab_percentage = (vocab_params / total_params) * 100
-            
-            # Memory in MB (float32)
-            vocab_memory_mb = (vocab_params * 4) / (1024 * 1024)
-            
-            print(f"{model_name:<12} {total_params/1e6:>8.0f}M {vocab_params/1e6:>8.1f}M {vocab_percentage:>6.1f}% {vocab_memory_mb:>8.0f}MB")
-        
-        print(f"\nTARGET PRODUCTION INSIGHTS:")
-        print(f"   • Small models (100M): Vocabulary is ~20-30% of parameters")
-        print(f"   • Large models (7B+): Vocabulary is ~1-2% of parameters")
-        print(f"   • Vocabulary memory scales with vocab_size * embed_dim")
-        print(f"   • GPT uses 50k vocabulary, LLaMA uses 32k (efficiency optimization)")
-        
-        # Calculate tokenization efficiency comparison
-        print(f"\n📊 TOKENIZATION EFFICIENCY COMPARISON:")
-        char_vocab = 256
-        char_embed = 512
-        char_memory = (char_vocab * char_embed * 4) / (1024 * 1024)
-        
-        gpt_vocab = 50257
-        gpt_embed = 768
-        gpt_memory = (gpt_vocab * gpt_embed * 4) / (1024 * 1024)
-        
-        print(f"   Character tokenizer: {char_memory:.1f} MB vocabulary")
-        print(f"   GPT tokenizer: {gpt_memory:.1f} MB vocabulary")
-        print(f"   Memory ratio: {gpt_memory/char_memory:.0f}x more memory for BPE")
-        
-        # But compute advantage
-        sample_text = "The transformer architecture revolutionized NLP"
-        char_tokens = len(sample_text)  # Approximate character count
-        gpt_tokens = char_tokens // 4   # Approximate GPT tokenization (4 chars/token)
-        
-        print(f"\nSPEED COMPUTE EFFICIENCY:")
-        print(f"   Sample text: '{sample_text}'")
-        print(f"   Character tokens: ~{char_tokens}")
-        print(f"   GPT tokens: ~{gpt_tokens}")
-        print(f"   Attention complexity: O(N²)")
-        print(f"   Character attention: O({char_tokens}²) = {char_tokens**2:,} operations")
-        print(f"   GPT attention: O({gpt_tokens}²) = {gpt_tokens**2:,} operations")
-        print(f"   Compute reduction: {(char_tokens**2)/(gpt_tokens**2):.1f}x faster attention")
-        
-        print(f"\nTIP TRADE-OFF SUMMARY:")
-        print(f"   • BPE uses {gpt_memory/char_memory:.0f}x more vocabulary memory")
-        print(f"   • BPE provides {(char_tokens**2)/(gpt_tokens**2):.1f}x faster attention computation")
-        print(f"   • Production systems choose BPE for compute efficiency")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in production analysis: {e}")
-        print("Error in memory calculation - check model configurations")
-
-# Run the production analysis
-analyze_production_memory_breakdown()
-
-# %% [markdown]
-"""
-## ROCKET Advanced: Tokenization Efficiency Techniques
-
-Production tokenization systems use several optimization techniques. Let's implement a few key ones:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "tokenization-optimizations", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| export
-class OptimizedTokenizer:
-    """
-    Production-optimized tokenizer with caching and batch processing.
-    
-    Demonstrates optimization techniques used in real ML systems:
-    - Caching for repeated texts
-    - Batch processing for efficiency
-    - Memory-efficient encoding
-    """
-    
-    def __init__(self, base_tokenizer):
-        """Initialize with a base tokenizer and optimization features."""
-        self.base_tokenizer = base_tokenizer
-        self.encode_cache = {}
-        self.decode_cache = {}
-        self.cache_hits = 0
-        self.cache_misses = 0
-    
-    def encode_with_cache(self, text: str, add_special_tokens: bool = True) -> List[int]:
-        """
-        Encode text with caching for repeated inputs.
-        
-        This optimization is critical for production systems where
-        the same texts are processed repeatedly.
-        """
-        cache_key = (text, add_special_tokens)
-        
-        if cache_key in self.encode_cache:
-            self.cache_hits += 1
-            return self.encode_cache[cache_key]
-        
-        # Cache miss - compute and cache result
-        self.cache_misses += 1
-        tokens = self.base_tokenizer.encode(text, add_special_tokens)
-        self.encode_cache[cache_key] = tokens
-        
-        return tokens
-    
-    def batch_encode(self, texts: List[str], add_special_tokens: bool = True, 
-                    pad_to_max: bool = True) -> List[List[int]]:
-        """
-        Efficiently encode multiple texts as a batch.
-        
-        This function is PROVIDED to show batch processing optimization.
-        """
-        # Encode all texts
-        token_sequences = []
-        for text in texts:
-            tokens = self.encode_with_cache(text, add_special_tokens)
-            token_sequences.append(tokens)
-        
-        # Pad to uniform length if requested
-        if pad_to_max and hasattr(self.base_tokenizer, 'pad_sequences'):
-            token_sequences = self.base_tokenizer.pad_sequences(token_sequences)
-        
-        return token_sequences
-    
-    def get_cache_stats(self) -> Dict:
-        """Get caching performance statistics."""
-        total_requests = self.cache_hits + self.cache_misses
-        hit_rate = self.cache_hits / total_requests if total_requests > 0 else 0
-        
-        return {
-            'cache_hits': self.cache_hits,
-            'cache_misses': self.cache_misses,
-            'total_requests': total_requests,
-            'hit_rate': hit_rate,
-            'cache_size': len(self.encode_cache)
-        }
-
-def demonstrate_production_optimizations():
-    """
-    Demonstrate production-level tokenization optimizations.
-    
-    This function is PROVIDED to show real-world optimization techniques.
-    """
-    print("ROCKET PRODUCTION TOKENIZATION OPTIMIZATIONS")
-    print("=" * 60)
-    
-    # Create optimized tokenizer
-    base_tokenizer = CharTokenizer()
-    optimized_tokenizer = OptimizedTokenizer(base_tokenizer)
-    
-    # Test data with repeated texts (common in production)
-    test_texts = [
-        "Hello world!",
-        "Machine learning is amazing.",
-        "Hello world!",  # Repeated
-        "Tokenization performance matters.",
-        "Hello world!",  # Repeated again
-        "Machine learning is amazing.",  # Repeated
-    ]
-    
-    print(f"📊 Testing with {len(test_texts)} texts ({len(set(test_texts))} unique)")
-    
-    # Measure performance without caching
-    start_time = time.time()
-    tokens_no_cache = []
-    for text in test_texts:
-        tokens = base_tokenizer.encode(text, add_special_tokens=False)
-        tokens_no_cache.append(tokens)
-    no_cache_time = time.time() - start_time
-    
-    # Measure performance with caching
-    start_time = time.time()
-    tokens_with_cache = []
-    for text in test_texts:
-        tokens = optimized_tokenizer.encode_with_cache(text, add_special_tokens=False)
-        tokens_with_cache.append(tokens)
-    cache_time = time.time() - start_time
-    
-    # Test batch encoding
-    start_time = time.time()
-    batch_tokens = optimized_tokenizer.batch_encode(test_texts, add_special_tokens=False, pad_to_max=True)
-    batch_time = time.time() - start_time
-    
-    # Report results
-    cache_stats = optimized_tokenizer.get_cache_stats()
-    
-    print(f"\nSPEED PERFORMANCE COMPARISON:")
-    print(f"   No caching: {no_cache_time*1000:.2f}ms")
-    print(f"   With caching: {cache_time*1000:.2f}ms ({(no_cache_time/cache_time):.1f}x speedup)")
-    print(f"   Batch processing: {batch_time*1000:.2f}ms")
-    
-    print(f"\nPROGRESS CACHE PERFORMANCE:")
-    print(f"   Hit rate: {cache_stats['hit_rate']*100:.1f}%")
-    print(f"   Cache hits: {cache_stats['cache_hits']}")
-    print(f"   Cache misses: {cache_stats['cache_misses']}")
-    print(f"   Cache size: {cache_stats['cache_size']} entries")
-    
-    print(f"\nTARGET PRODUCTION INSIGHTS:")
-    print(f"   - Caching provides significant speedup for repeated texts")
-    print(f"   - Batch processing enables vectorized operations")
-    print(f"   - Memory-efficient encoding reduces allocation overhead")
-    print(f"   - Cache hit rates >80% common in production systems")
-
-# Function defined (called in main block)
-
-# %% [markdown]
-"""
-## Comprehensive Testing & Integration
-
-Let's run comprehensive tests to ensure all tokenization functionality works correctly:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-tokenization-comprehensive", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_tokenization_comprehensive():
-    """Comprehensive test suite for all tokenization functionality."""
-    print("TEST Comprehensive Tokenization Tests...")
-    
-    # Test 1: Character tokenizer edge cases
-    print("  Testing character tokenizer edge cases...")
-    char_tokenizer = CharTokenizer()
-    
-    # Empty string
-    empty_tokens = char_tokenizer.encode("", add_special_tokens=True)
-    assert len(empty_tokens) == 2, "Empty string should have BOS and EOS tokens"
-    
-    # Single character
-    single_tokens = char_tokenizer.encode("A", add_special_tokens=False)
-    assert len(single_tokens) == 1, "Single character should produce one token"
-    
-    # Special characters
-    special_text = "!@#$%"
-    special_tokens = char_tokenizer.encode(special_text, add_special_tokens=False)
-    assert len(special_tokens) == len(special_text), "Should handle special characters"
-    
-    # Round-trip encoding/decoding
-    original = "Hello, World! 123"
-    tokens = char_tokenizer.encode(original, add_special_tokens=False)
-    decoded = char_tokenizer.decode(tokens, skip_special_tokens=True)
-    assert decoded == original, "Round-trip should preserve text"
-    
-    print("    PASS Character tokenizer edge cases passed")
-    
-    # Test 2: BPE tokenizer robustness
-    print("  Testing BPE tokenizer robustness...")
-    bpe_tokenizer = BPETokenizer(vocab_size=100)
-    
-    # Train with diverse data
-    training_data = [
-        "hello world",
-        "the quick brown fox",
-        "machine learning systems",
-        "neural network training",
-        "hello hello world world"  # Repeated patterns for merging
-    ]
-    
-    bpe_tokenizer.train(training_data)
-    assert bpe_tokenizer.trained, "BPE should be trained"
-    
-    # Test encoding various texts
-    test_cases = [
-        "hello world",
-        "new unseen text",
-        "machine learning",
-        ""  # Empty string
-    ]
-    
-    for test_text in test_cases:
-        if test_text:  # Skip empty string for basic tests
-            tokens = bpe_tokenizer.encode(test_text, add_special_tokens=False)
-            decoded = bpe_tokenizer.decode(tokens, skip_special_tokens=True)
-            # BPE decoding might have slightly different spacing due to word boundaries
-            assert test_text.replace(" ", "") in decoded.replace(" ", ""), f"BPE round-trip failed for '{test_text}'"
-    
-    print("    PASS BPE tokenizer robustness passed")
-    
-    # Test 3: Memory efficiency with large texts
-    print("  Testing memory efficiency...")
-    large_text = "This is a test sentence. " * 1000  # ~25k characters
-    
-    start_time = time.time()
-    char_tokens = char_tokenizer.encode(large_text, add_special_tokens=False)
-    char_time = time.time() - start_time
-    
-    assert len(char_tokens) > 20000, "Should handle large texts"
-    assert char_time < 1.0, "Should tokenize large text quickly"
-    
-    print("    PASS Memory efficiency tests passed")
-    
-    # Test 4: Integration with optimization features
-    print("  Testing optimization features...")
-    optimized = OptimizedTokenizer(char_tokenizer)
-    
-    # Test caching
-    test_text = "Repeated text for caching test"
-    tokens1 = optimized.encode_with_cache(test_text)
-    tokens2 = optimized.encode_with_cache(test_text)  # Should hit cache
-    
-    assert tokens1 == tokens2, "Cached results should be identical"
-    
-    cache_stats = optimized.get_cache_stats()
-    assert cache_stats['cache_hits'] > 0, "Should have cache hits"
-    assert cache_stats['hit_rate'] > 0, "Should have positive hit rate"
-    
-    # Test batch processing
-    batch_texts = ["text one", "text two", "text three"]
-    batch_results = optimized.batch_encode(batch_texts, pad_to_max=True)
-    
-    assert len(batch_results) == len(batch_texts), "Batch size should match input"
-    assert all(len(seq) == len(batch_results[0]) for seq in batch_results), "All sequences should be padded to same length"
-    
-    print("    PASS Optimization features tests passed")
-    
-    print("PASS All comprehensive tokenization tests passed!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-All tokenization tests and demonstrations are run from here when the module is executed directly:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "tokenization-main", "locked": false, "schema_version": 3, "solution": false, "task": false}
-if __name__ == "__main__":
-    print("🔤 Starting TinyTorch Tokenization Module...")
-    print("="*60)
-    
-    # Run all unit tests
-    print("\nTEST UNIT TESTS")
-    print("-" * 30)
-    test_unit_char_tokenizer()
-    test_unit_bpe_tokenizer()
-    test_tokenization_profiler()
-    
-    # Run comprehensive integration tests
-    print("\n🔧 INTEGRATION TESTS")
-    print("-" * 30)
-    test_tokenization_comprehensive()
-    
-    # Performance analysis
-    print("\n" + "="*60)
-    print("MAGNIFY TOKENIZATION PERFORMANCE ANALYSIS")
-    print("="*60)
-    
-    # Create test data
-    sample_texts = [
-        "The transformer architecture has revolutionized natural language processing.",
-        "Machine learning models require efficient tokenization for text processing.",
-        "Character-level tokenization produces long sequences but small vocabularies.",
-        "Byte pair encoding balances vocabulary size with sequence length efficiency.",
-        "Production systems need fast tokenization to maintain training throughput."
-    ]
-    
-    print(f"\nTesting with {len(sample_texts)} sample texts...")
-    
-    # Performance comparison
-    profiler = TokenizationProfiler()
-    comparison_results = profiler.compare_tokenizers(sample_texts)
-    
-    # Systems impact analysis
-    analyze_tokenization_systems_impact()
-    
-    # Production optimizations demonstration
-    demonstrate_production_optimizations()
-    
-    print("\n" + "="*60)
-    print("TARGET TOKENIZATION MODULE COMPLETE!")
-    print("="*60)
-    print("PASS All tokenization tests passed!")
-    print("PASS Systems insights analysis complete!")
-    print("PASS Performance profiling successful!")
-    print("ROCKET Ready for embedding layer integration!")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've built the text processing foundation for language models, let's connect this work to broader ML systems challenges. These questions help you think critically about how tokenization scales to production language processing systems.
-
-Take time to reflect thoughtfully on each question - your insights will help you understand how tokenization connects to real-world ML systems engineering.
-"""
-
-# %% [markdown]
-"""
-### Question 1: Vocabulary Size vs Model Performance Analysis
-
-**Context**: Your tokenization implementations show how vocabulary size affects both model parameters and sequence processing. In your CharTokenizer, you observed small vocabulary (~99 tokens) but long sequences. In your BPE implementation, you created larger vocabularies (~500-2000 tokens) with shorter sequences.
-
-**Computational Assessment**: Analyze the memory and computational trade-offs in your tokenization implementations. Given a text corpus where your CharTokenizer produces average sequences of 200 tokens and your BPE tokenizer produces average sequences of 50 tokens, calculate the total memory requirements for a model with 256-dimensional embeddings processing batches of 32 sequences. Compare the embedding table memory, sequence processing memory, and attention computation complexity (O(N²)) for both approaches. Which tokenization strategy would be more efficient for training large language models and why?
-
-Consider: embedding parameters, attention complexity, batch processing memory, and training throughput implications.
-
-*Target length: 200-400 words with calculations*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-1-tokenization-strategy", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON TOKENIZATION STRATEGY AND PERFORMANCE TRADE-OFFS:
-
-TODO: Replace this text with your thoughtful response about multilingual tokenization strategy design.
-
-Consider addressing:
-- How would you design a tokenization strategy for 50+ languages within a 100k token limit?
-- What approaches would you use to handle different scripts and morphological complexity?
-- How would you optimize for both cross-lingual transfer and computational efficiency?
-- What trade-offs would you make between vocabulary sharing and language-specific optimization?
-- How would you ensure consistent quality across languages with different characteristics?
-
-Write a strategic analysis connecting your tokenization implementations to real multilingual system challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Demonstrates understanding of multilingual tokenization challenges (3 points)
-- Designs practical approaches to vocabulary size and language coverage (3 points)
-- Addresses cross-lingual transfer and efficiency considerations (2 points)
-- Shows systems thinking about production language model constraints (2 points)
-- Clear strategic reasoning with multilingual optimization insights (bonus points for comprehensive understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring strategic analysis of multilingual tokenization
-# Students should demonstrate understanding of cross-lingual efficiency and performance trade-offs
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: BPE Training Complexity and Optimization
-
-**Context**: Your BPE implementation performs iterative pair merging to build subword vocabularies. The `_get_pair_counts()` and `_merge_pair()` functions you implemented process the entire corpus in each iteration. You observed that BPE training can be computationally expensive as vocabulary size increases.
-
-**Computational Assessment**: Analyze the computational complexity of your BPE training algorithm. If you have a corpus with C characters, V target vocabulary size, and your algorithm performs V-k merging iterations (where k is initial character vocabulary), calculate the time complexity of the complete training process. Compare the efficiency of training BPE vocabularies of 1000, 5000, and 50000 tokens on a 1GB text corpus. Design specific optimizations to your `_get_pair_counts()` and `_merge_pair()` implementations that would reduce training time while maintaining tokenization quality.
-
-Consider: algorithm complexity, data structure choices, memory usage during training, and practical optimization strategies.
-
-*Target length: 200-400 words with complexity analysis*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-2-pipeline-integration", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON TOKENIZATION PIPELINE INTEGRATION:
-
-TODO: Replace this text with your thoughtful response about large-scale tokenization pipeline design.
-
-Consider addressing:
-- How would you architect parallel tokenization for processing 1TB of text daily?
-- What caching strategies would you implement for repeated text patterns?
-- How would you handle storage optimization and I/O bottleneck minimization?
-- What approaches would you use to maintain consistency across distributed training?
-- How would you design the system to handle dynamic vocabulary updates?
-
-Write an architectural analysis connecting your tokenization implementations to large-scale training infrastructure.
-
-GRADING RUBRIC (Instructor Use):
-- Shows understanding of large-scale tokenization pipeline challenges (3 points)
-- Designs practical approaches to parallel processing and caching (3 points)
-- Addresses distributed training and consistency requirements (2 points)
-- Demonstrates systems thinking about training infrastructure optimization (2 points)
-- Clear architectural reasoning with scalability insights (bonus points for comprehensive system design)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of large-scale pipeline integration
-# Students should demonstrate knowledge of distributed training and infrastructure optimization
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Tokenization Efficiency in Production Systems
-
-**Context**: Your OptimizedTokenizer implementation includes caching mechanisms that you tested with repeated text processing. You observed significant speedup for cache hits but also noted memory overhead for storing cached results. Production systems must balance caching benefits with memory constraints.
-
-**Computational Assessment**: Design a caching strategy for your tokenization system that optimizes for production deployment with 10GB memory budget. Given that your character tokenization produces ~4 bytes per token and typical text repeats with 60% cache hit rate, calculate the optimal cache size that maximizes throughput while staying within memory limits. Analyze how cache eviction policies (LRU, LFU, or TTL-based) would affect performance for different workload patterns: academic paper processing (high repetition), social media feeds (medium repetition), and novel literature (low repetition). Propose specific modifications to your encode_with_cache() method that would adapt cache behavior based on workload characteristics.
-
-Consider: memory allocation, cache eviction algorithms, workload patterns, and adaptive optimization strategies.
-
-*Target length: 200-400 words with memory calculations*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-3-dynamic-tokenization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON DYNAMIC TOKENIZATION AND ADAPTIVE SYSTEMS:
-
-TODO: Replace this text with your thoughtful response about adaptive tokenization system design.
-
-Consider addressing:
-- How would you design vocabulary expansion for incorporating new domain terminology?
-- What strategies would you use to preserve existing token embeddings during updates?
-- How would you maintain tokenization consistency during model evolution?
-- What approaches would minimize retraining overhead for vocabulary changes?
-- How would you balance stability and adaptability in production systems?
-
-Write a design analysis connecting your tokenization work to adaptive language model systems.
-
-GRADING RUBRIC (Instructor Use):
-- Understands dynamic tokenization challenges and adaptation requirements (3 points)
-- Designs practical approaches to vocabulary evolution and embedding preservation (3 points)
-- Addresses consistency and backward compatibility considerations (2 points)
-- Shows systems thinking about continuous adaptation in production (2 points)
-- Clear design reasoning with adaptive system insights (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of adaptive tokenization systems
-# Students should demonstrate knowledge of vocabulary evolution and continuous learning challenges
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 4: Out-of-Vocabulary Handling and System Robustness
-
-**Context**: Your tokenization implementations handle unknown characters and tokens through UNK tokens. In your CharTokenizer, characters outside ASCII range become UNK. In your BPETokenizer, text not seen during training falls back to character-level processing. Production systems must gracefully handle diverse, evolving text inputs.
-
-**Computational Assessment**: Analyze the robustness of your tokenization systems when processing multilingual and noisy text. Calculate the UNK token rate for processing text containing 20% non-ASCII characters using your CharTokenizer versus a trained BPE tokenizer. Design an enhanced fallback strategy that combines character-level, BPE subword, and whole-word tokenization to minimize information loss. Quantify how UNK token rates affect downstream model performance by estimating the impact on embedding quality when 15% of tokens are UNK versus 2% UNK. Propose specific modifications to your encode() methods that would improve out-of-vocabulary handling without significantly increasing vocabulary size.
-
-Consider: fallback hierarchies, information preservation, embedding quality, vocabulary efficiency, and multilingual robustness.
-
-*Target length: 200-400 words with impact analysis*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-4-oov-handling", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR ANALYSIS ON OUT-OF-VOCABULARY HANDLING AND SYSTEM ROBUSTNESS:
-
-TODO: Replace this text with your computational assessment of OOV handling strategies.
-
-Consider addressing:
-- How would you calculate UNK token rates for different text types?
-- What fallback strategies would minimize information loss in your implementations?
-- How do UNK token rates affect downstream model performance quantitatively?
-- What modifications to your encode() methods would improve robustness?
-- How would you design vocabulary expansion to handle evolving text patterns?
-
-Write a technical analysis connecting your tokenization implementations to real multilingual robustness challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Quantifies UNK token rates and their impact on system performance (3 points)
-- Designs practical fallback strategies building on existing implementations (3 points)
-- Analyzes downstream effects on embedding quality and model performance (2 points)
-- Proposes concrete improvements to existing encode() methods (2 points)
-- Clear technical reasoning with robustness engineering insights (bonus points for comprehensive analysis)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of OOV handling and system robustness
-# Students should demonstrate knowledge of tokenization robustness and multilingual challenges
-### END SOLUTION
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Tokenization
-
-Congratulations! You have successfully implemented comprehensive tokenization systems for language processing:
-
-### PASS What You Have Built
-- **Character Tokenizer**: Simple character-level tokenization with special token handling
-- **BPE Tokenizer**: Subword tokenization using Byte Pair Encoding algorithm
-- **Vocabulary Management**: Efficient mapping between text and numerical representations
-- **Padding & Truncation**: Batch processing utilities for uniform sequence lengths
-- **Performance Optimization**: Caching and batch processing for production efficiency
-- **🆕 Memory Efficiency**: Optimized string processing and token caching systems
-- **🆕 Systems Analysis**: Comprehensive performance profiling and scaling analysis
-
-### PASS Key Learning Outcomes
-- **Understanding**: How text becomes numbers that neural networks can process
-- **Implementation**: Built character and subword tokenizers from scratch
-- **Systems Insight**: How tokenization affects model memory, performance, and capabilities
-- **Performance Engineering**: Measured and optimized tokenization throughput
-- **Production Context**: Understanding real-world tokenization challenges and solutions
-
-### PASS Technical Mastery
-- **Character Tokenization**: Simple but interpretable text processing
-- **BPE Algorithm**: Iterative pair merging for subword discovery
-- **Vocabulary Trade-offs**: Balancing vocabulary size vs sequence length
-- **Memory Optimization**: Efficient caching and batch processing techniques
-- **🆕 Performance Analysis**: Measuring tokenization impact on downstream systems
-
-### PASS Professional Skills Developed
-- **Algorithm Implementation**: Building complex text processing systems
-- **Performance Engineering**: Optimizing for speed and memory efficiency
-- **Systems Thinking**: Understanding tokenization's role in ML pipelines
-- **Production Optimization**: Caching, batching, and scalability techniques
-
-### PASS Ready for Next Steps
-Your tokenization systems are now ready to power:
-- **Embedding Layers**: Converting tokens to dense vector representations
-- **Language Models**: Processing text for transformer architectures
-- **Production Systems**: Efficient text processing pipelines
-- **🧠 Text Understanding**: Foundation for natural language processing
-
-### LINK Connection to Real ML Systems
-Your implementations mirror production systems:
-- **GPT Tokenizers**: Modern language models use sophisticated BPE variants
-- **SentencePiece**: Unigram language model tokenization used in many systems
-- **Hugging Face Tokenizers**: Production-optimized tokenization libraries
-- **Industry Applications**: Every language model relies on efficient tokenization
-
-### TARGET The Power of Text Processing
-You have unlocked the bridge between human language and machine understanding:
-- **Before**: Text was just strings of characters
-- **After**: Text becomes structured numerical sequences for neural networks
-
-**Next Module**: Embeddings - Converting your tokens into rich vector representations that capture semantic meaning!
-
-Your tokenization systems are the first step in language understanding. Now let's build the embeddings that give tokens meaning!
-"""
\ No newline at end of file
diff --git a/modules_old/11_embeddings/README.md b/modules_old/11_embeddings/README.md
deleted file mode 100644
index e7befaaf..00000000
--- a/modules_old/11_embeddings/README.md
+++ /dev/null
@@ -1,97 +0,0 @@
-# Module 12: Embeddings - Dense Vector Representations for Language Models
-
-## Overview
-This module implements the embedding systems that convert discrete tokens into rich vector representations for language processing. You'll build embedding layers, positional encoding systems, and understand how embedding choices affect model memory, performance, and language understanding capabilities.
-
-## What You'll Learn
-
-### Core Implementations
-- **Embedding Layer**: Learnable lookup table converting token indices to dense vectors
-- **Positional Encoding**: Sinusoidal patterns that add position information to sequences
-- **Learned Positional Embeddings**: Trainable position representations
-- **Memory-Efficient Systems**: Optimized embedding access and memory management
-
-### ML Systems Concepts
-- **Memory Scaling**: How embedding tables scale with vocabulary size and dimensionality
-- **Lookup Performance**: Memory bandwidth limitations and cache-friendly access patterns
-- **Position Encoding Trade-offs**: Fixed vs learned, extrapolation vs optimization
-- **Integration Efficiency**: Embedding pipeline optimization for production systems
-
-### Performance Engineering
-- **Embedding Profiling**: Measuring lookup performance and memory usage
-- **Scaling Analysis**: Understanding parameter growth and memory requirements
-- **Pipeline Optimization**: Efficient token-to-vector transformation workflows
-- **Production Patterns**: Large-scale embedding system design and optimization
-
-## Key Learning Outcomes
-
-By completing this module, you'll understand:
-
-1. **Token-to-Vector Pipeline**: How discrete symbols become continuous representations
-2. **Embedding Trade-offs**: Vocabulary size vs embedding dimension vs memory usage
-3. **Position Encoding**: How transformers gain position awareness for sequences
-4. **Systems Optimization**: Memory-efficient embedding lookup and pipeline design
-5. **Production Scaling**: How embedding systems scale to billion-parameter models
-
-## Files in This Module
-
-- `embeddings_dev.py` - Main implementation with embedding layer and positional encoding
-- `embeddings_dev.ipynb` - Jupyter notebook (auto-generated)
-- `module.yaml` - Module configuration and metadata
-- `README.md` - This documentation file
-
-## Usage Example
-
-```python
-from tinytorch.core.embeddings import Embedding, PositionalEncoding
-from tinytorch.core.tokenization import CharTokenizer
-
-# Create tokenizer and embedding layer
-tokenizer = CharTokenizer()
-embedding = Embedding(vocab_size=tokenizer.vocab_size, embedding_dim=256)
-
-# Add positional encoding
-pos_encoding = PositionalEncoding(embedding_dim=256, max_seq_length=512)
-
-# Process text through complete pipeline
-tokens = tokenizer.encode("Hello world!")
-embeddings = embedding(tokens)
-pos_embeddings = pos_encoding(embeddings)
-```
-
-## Integration with TinyTorch
-
-This module exports to `tinytorch.core.embeddings` and provides the vector representation foundation for:
-- **Attention mechanisms** (Module 13) - Processing sequence representations
-- **Transformer models** (Module 14+) - Complete language model architectures
-- **Language understanding** - Rich semantic representations for NLP tasks
-
-## Systems Engineering Focus
-
-This module emphasizes the systems engineering aspects of embedding design:
-
-### Memory Characteristics
-- **Embedding table**: O(vocab_size × embedding_dim) parameters
-- **GPU memory limits**: Large vocabularies require careful memory management
-- **Memory bandwidth**: Embedding lookup is often memory-bandwidth bound
-- **Distributed storage**: Large embedding tables may require sharding across devices
-
-### Performance Considerations
-- **Lookup patterns**: Sequential vs random access affects cache performance
-- **Batch efficiency**: Larger batches amortize lookup overhead
-- **Position encoding**: Sinusoidal (no parameters) vs learned (more parameters)
-- **Pipeline integration**: Embedding lookup must not bottleneck training throughput
-
-## Prerequisites
-- Module 02: Tensor (for basic tensor operations)
-- Module 11: Tokenization (for token-to-index conversion)
-- Understanding of lookup tables and vector operations
-
-## Estimated Time
-4-5 hours including implementation, testing, and performance analysis
-
-## Next Steps
-After completing this module, you'll be ready for:
-- **Module 13: Attention** - Processing sequences with attention mechanisms
-- **Module 14: Transformers** - Complete transformer architecture implementation
-- Advanced language model architectures and optimization techniques
\ No newline at end of file
diff --git a/modules_old/11_embeddings/embeddings_dev.ipynb b/modules_old/11_embeddings/embeddings_dev.ipynb
deleted file mode 100644
index 8acd23c8..00000000
--- a/modules_old/11_embeddings/embeddings_dev.ipynb
+++ /dev/null
@@ -1,1889 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "92420776",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Embeddings - Converting Tokens to Dense Vector Representations\n",
-    "\n",
-    "Welcome to the Embeddings module! You'll implement the systems that convert discrete tokens into rich vector representations that capture semantic meaning for language models.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How embedding tables scale with vocabulary size and affect model memory\n",
-    "- Core implementation skill: Build embedding layers with efficient lookup operations\n",
-    "- Pattern recognition: Understand how positional encoding enables sequence understanding\n",
-    "- Framework connection: See how your implementations match PyTorch's embedding systems\n",
-    "- Performance insight: Learn how embedding lookup patterns affect cache efficiency and memory bandwidth\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: Embedding layer with lookup table and positional encoding systems\n",
-    "2. **Use**: Transform token sequences into rich vector representations for language processing\n",
-    "3. **Reflect**: How do embedding choices determine model capacity and computational efficiency?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how discrete tokens become continuous vector representations\n",
-    "- Practical capability to implement embedding systems that handle large vocabularies efficiently\n",
-    "- Systems insight into how embedding dimensions affect model capacity and memory usage\n",
-    "- Performance consideration of how embedding lookup patterns affect training and inference speed\n",
-    "- Connection to production systems like transformer embedding layers and their optimization techniques\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: Modern language models have embedding tables with billions of parameters (GPT-3: 50k vocab × 12k dim = 600M embedding params)\n",
-    "⚡ **Performance Note**: Embedding lookups are memory-bandwidth bound - efficient access patterns are critical for high-throughput training"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3da397b8",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "embeddings-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.embeddings\n",
-    "\n",
-    "#| export\n",
-    "import math\n",
-    "import numpy as np\n",
-    "import os\n",
-    "import sys\n",
-    "from typing import Union, List, Optional, Tuple\n",
-    "\n",
-    "# Import our Tensor class - try from package first, then from local module\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "except ImportError:\n",
-    "    # For development, import from local tensor module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))\n",
-    "    from tensor_dev import Tensor\n",
-    "\n",
-    "# Try to import tokenization classes\n",
-    "try:\n",
-    "    from tinytorch.core.tokenization import CharTokenizer, BPETokenizer\n",
-    "except ImportError:\n",
-    "    # For development, import from local module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '11_tokenization'))\n",
-    "    try:\n",
-    "        from tokenization_dev import CharTokenizer, BPETokenizer\n",
-    "    except ImportError:\n",
-    "        # Create minimal mock classes if not available\n",
-    "        class CharTokenizer:\n",
-    "            def __init__(self): \n",
-    "                self.vocab_size = 256\n",
-    "        class BPETokenizer:\n",
-    "            def __init__(self, vocab_size=1000):\n",
-    "                self.vocab_size = vocab_size"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "83e2b76d",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "embeddings-welcome",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🎯 TinyTorch Embeddings Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(\"Ready to build embedding systems!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "53f64bfc",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📦 Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in `modules/source/12_embeddings/embeddings_dev.py`  \n",
-    "**Building Side:** Code exports to `tinytorch.core.embeddings`\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.embeddings import Embedding, PositionalEncoding\n",
-    "from tinytorch.core.tokenization import CharTokenizer, BPETokenizer  # Previous module\n",
-    "from tinytorch.core.attention import MultiHeadAttention  # Next module\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused modules for deep understanding\n",
-    "- **Production:** Proper organization like PyTorch's `torch.nn.Embedding`\n",
-    "- **Consistency:** All embedding tools live together in `core.embeddings`\n",
-    "- **Integration:** Works seamlessly with tokenization and attention systems"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "43aa5503",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## What are Embeddings?\n",
-    "\n",
-    "### The Problem: Discrete to Continuous\n",
-    "Tokens are discrete symbols, but neural networks work best with continuous vectors:\n",
-    "```\n",
-    "Token: 42 → Dense Vector: [0.1, -0.3, 0.8, 0.2, ...]\n",
-    "```\n",
-    "\n",
-    "### Embedding Table (Lookup Table)\n",
-    "An embedding layer is essentially a learnable lookup table:\n",
-    "```\n",
-    "Vocabulary size: 50,000 tokens\n",
-    "Embedding dimension: 512\n",
-    "Total parameters: 50,000 × 512 = 25.6M parameters\n",
-    "```\n",
-    "\n",
-    "### Why Embeddings Work\n",
-    "- **Similarity**: Similar words get similar vectors through training\n",
-    "- **Composition**: Vector operations capture semantic relationships\n",
-    "- **Learning**: Gradients update embeddings to improve task performance\n",
-    "\n",
-    "### Positional Encoding\n",
-    "Since transformers lack inherent position awareness, we add positional information:\n",
-    "```\n",
-    "Token embedding + Positional encoding = Position-aware representation\n",
-    "```\n",
-    "\n",
-    "### Systems Trade-offs\n",
-    "- **Embedding dimension**: Higher = more capacity, more memory\n",
-    "- **Vocabulary size**: Larger = more parameters, better coverage\n",
-    "- **Lookup efficiency**: Memory access patterns affect performance"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c001050e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Embedding Layer Implementation\n",
-    "\n",
-    "Let's start with the core embedding layer - a learnable lookup table that converts token indices to dense vectors."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e8a0101a",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "embedding-layer",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Embedding:\n",
-    "    \"\"\"\n",
-    "    Embedding layer that converts token indices to dense vector representations.\n",
-    "    \n",
-    "    This is the foundation of modern language models - a learnable lookup table\n",
-    "    that maps discrete tokens to continuous vectors that capture semantic meaning.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, vocab_size: int, embedding_dim: int, \n",
-    "                 padding_idx: Optional[int] = None, \n",
-    "                 init_type: str = 'uniform'):\n",
-    "        \"\"\"\n",
-    "        Initialize embedding layer with learnable parameters.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Store configuration parameters\n",
-    "        2. Initialize embedding table with chosen initialization\n",
-    "        3. Handle special padding token if specified\n",
-    "        4. Set up for gradient tracking (will connect to autograd later)\n",
-    "        \n",
-    "        DESIGN DECISIONS:\n",
-    "        - Embedding table shape: (vocab_size, embedding_dim)\n",
-    "        - Initialization affects training dynamics\n",
-    "        - Padding idx gets zero gradient to stay constant\n",
-    "        \n",
-    "        Args:\n",
-    "            vocab_size: Number of tokens in vocabulary\n",
-    "            embedding_dim: Size of dense vector for each token\n",
-    "            padding_idx: Optional token index that should remain zero\n",
-    "            init_type: Initialization strategy ('uniform', 'normal', 'xavier')\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.vocab_size = vocab_size\n",
-    "        self.embedding_dim = embedding_dim\n",
-    "        self.padding_idx = padding_idx\n",
-    "        self.init_type = init_type\n",
-    "        \n",
-    "        # Initialize embedding table based on strategy\n",
-    "        if init_type == 'uniform':\n",
-    "            # Uniform initialization in [-1/sqrt(dim), 1/sqrt(dim)]\n",
-    "            bound = 1.0 / math.sqrt(embedding_dim)\n",
-    "            self.weight = Tensor(np.random.uniform(-bound, bound, (vocab_size, embedding_dim)))\n",
-    "        elif init_type == 'normal':\n",
-    "            # Normal initialization with std=1/sqrt(dim)\n",
-    "            std = 1.0 / math.sqrt(embedding_dim)\n",
-    "            self.weight = Tensor(np.random.normal(0, std, (vocab_size, embedding_dim)))\n",
-    "        elif init_type == 'xavier':\n",
-    "            # Xavier/Glorot initialization\n",
-    "            bound = math.sqrt(6.0 / (vocab_size + embedding_dim))\n",
-    "            self.weight = Tensor(np.random.uniform(-bound, bound, (vocab_size, embedding_dim)))\n",
-    "        else:\n",
-    "            raise ValueError(f\"Unknown init_type: {init_type}\")\n",
-    "        \n",
-    "        # Set padding token to zero if specified\n",
-    "        if padding_idx is not None:\n",
-    "            self.weight.data[padding_idx] = 0.0\n",
-    "        \n",
-    "        # Track parameters for optimization\n",
-    "        self.parameters = [self.weight]\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, input_ids: Union[Tensor, List[int], np.ndarray]) -> Tensor:\n",
-    "        \"\"\"\n",
-    "        Look up embeddings for input token indices.\n",
-    "        \n",
-    "        TODO: Implement embedding lookup.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Convert input to numpy array if needed\n",
-    "        2. Validate token indices are within vocabulary\n",
-    "        3. Use advanced indexing to look up embeddings\n",
-    "        4. Return tensor with shape (batch_size, seq_len, embedding_dim)\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        embed = Embedding(vocab_size=100, embedding_dim=64)\n",
-    "        tokens = Tensor([[1, 2, 3], [4, 5, 6]])  # Shape: (2, 3)\n",
-    "        embeddings = embed.forward(tokens)  # Shape: (2, 3, 64)\n",
-    "        \n",
-    "        IMPLEMENTATION HINTS:\n",
-    "        - Handle both Tensor and list inputs\n",
-    "        - Use numpy advanced indexing: weight[indices]\n",
-    "        - Preserve batch and sequence dimensions\n",
-    "        \n",
-    "        Args:\n",
-    "            input_ids: Token indices with shape (batch_size, seq_len) or (seq_len,)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Embeddings with shape (*input_shape, embedding_dim)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Convert input to numpy array\n",
-    "        if isinstance(input_ids, Tensor):\n",
-    "            indices = input_ids.data\n",
-    "        elif isinstance(input_ids, list):\n",
-    "            indices = np.array(input_ids)\n",
-    "        else:\n",
-    "            indices = input_ids\n",
-    "        \n",
-    "        # Validate indices\n",
-    "        indices = indices.astype(int)\n",
-    "        if np.any(indices < 0) or np.any(indices >= self.vocab_size):\n",
-    "            raise ValueError(f\"Token indices must be in range [0, {self.vocab_size})\")\n",
-    "        \n",
-    "        # Look up embeddings using advanced indexing\n",
-    "        # self.weight.data has shape (vocab_size, embedding_dim)\n",
-    "        # indices has shape (...), result has shape (..., embedding_dim)\n",
-    "        embeddings = self.weight.data[indices]\n",
-    "        \n",
-    "        return Tensor(embeddings)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, input_ids: Union[Tensor, List[int], np.ndarray]) -> Tensor:\n",
-    "        \"\"\"Make the layer callable.\"\"\"\n",
-    "        return self.forward(input_ids)\n",
-    "    \n",
-    "    def get_memory_usage(self):\n",
-    "        \"\"\"\n",
-    "        Calculate memory usage of embedding table.\n",
-    "        \n",
-    "        This function is PROVIDED to show memory analysis.\n",
-    "        \"\"\"\n",
-    "        # Embedding table memory\n",
-    "        weight_memory_mb = self.weight.data.nbytes / (1024 * 1024)\n",
-    "        \n",
-    "        # Memory per token\n",
-    "        memory_per_token_kb = (self.embedding_dim * 4) / 1024  # 4 bytes per float32\n",
-    "        \n",
-    "        return {\n",
-    "            'total_memory_mb': weight_memory_mb,\n",
-    "            'memory_per_token_kb': memory_per_token_kb,\n",
-    "            'total_parameters': self.vocab_size * self.embedding_dim,\n",
-    "            'vocab_size': self.vocab_size,\n",
-    "            'embedding_dim': self.embedding_dim\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "54b2e152",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Embedding Layer Implementation\n",
-    "\n",
-    "Once you implement the Embedding forward method above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "aee0534c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-embedding-immediate",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_embedding_layer():\n",
-    "    \"\"\"Unit test for the embedding layer.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Embedding Layer...\")\n",
-    "    \n",
-    "    # Create embedding layer\n",
-    "    vocab_size = 100\n",
-    "    embedding_dim = 64\n",
-    "    embed = Embedding(vocab_size=vocab_size, embedding_dim=embedding_dim)\n",
-    "    \n",
-    "    # Test single token\n",
-    "    single_token = [5]\n",
-    "    single_embedding = embed.forward(single_token)\n",
-    "    assert single_embedding.shape == (1, embedding_dim), f\"Expected shape (1, {embedding_dim}), got {single_embedding.shape}\"\n",
-    "    \n",
-    "    # Test sequence of tokens\n",
-    "    token_sequence = [1, 2, 3, 5, 10]\n",
-    "    sequence_embeddings = embed.forward(token_sequence)\n",
-    "    expected_shape = (len(token_sequence), embedding_dim)\n",
-    "    assert sequence_embeddings.shape == expected_shape, f\"Expected shape {expected_shape}, got {sequence_embeddings.shape}\"\n",
-    "    \n",
-    "    # Test batch of sequences\n",
-    "    batch_tokens = [[1, 2, 3], [4, 5, 6]]\n",
-    "    batch_embeddings = embed.forward(batch_tokens)\n",
-    "    assert batch_embeddings.shape == (2, 3, embedding_dim), f\"Expected shape (2, 3, {embedding_dim}), got {batch_embeddings.shape}\"\n",
-    "    \n",
-    "    # Test with Tensor input\n",
-    "    tensor_input = Tensor(np.array([[7, 8, 9], [10, 11, 12]]))\n",
-    "    tensor_embeddings = embed.forward(tensor_input)\n",
-    "    assert tensor_embeddings.shape == (2, 3, embedding_dim), \"Should handle Tensor input\"\n",
-    "    \n",
-    "    # Test embedding lookup consistency\n",
-    "    token_5_embed_1 = embed.forward([5])\n",
-    "    token_5_embed_2 = embed.forward([5])\n",
-    "    assert np.allclose(token_5_embed_1.data, token_5_embed_2.data), \"Same token should give same embedding\"\n",
-    "    \n",
-    "    # Test different tokens give different embeddings (with high probability)\n",
-    "    token_1_embed = embed.forward([1])\n",
-    "    token_2_embed = embed.forward([2])\n",
-    "    assert not np.allclose(token_1_embed.data, token_2_embed.data, atol=1e-3), \"Different tokens should give different embeddings\"\n",
-    "    \n",
-    "    # Test initialization bounds\n",
-    "    assert np.all(np.abs(embed.weight.data) <= 1.0), \"Uniform initialization should be bounded\"\n",
-    "    \n",
-    "    # Test padding token (if specified)\n",
-    "    embed_with_padding = Embedding(vocab_size=50, embedding_dim=32, padding_idx=0)\n",
-    "    assert np.allclose(embed_with_padding.weight.data[0], 0.0), \"Padding token should be zero\"\n",
-    "    \n",
-    "    # Test parameter tracking\n",
-    "    assert len(embed.parameters) == 1, \"Should track embedding weight parameter\"\n",
-    "    assert embed.parameters[0] is embed.weight, \"Should track weight tensor\"\n",
-    "    \n",
-    "    # Test memory usage calculation\n",
-    "    memory_stats = embed.get_memory_usage()\n",
-    "    assert 'total_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
-    "    assert memory_stats['total_parameters'] == vocab_size * embedding_dim, \"Should calculate parameters correctly\"\n",
-    "    \n",
-    "    print(\"✅ Embedding layer tests passed!\")\n",
-    "    print(f\"✅ Handles various input shapes correctly\")\n",
-    "    print(f\"✅ Consistent lookup and parameter tracking\")\n",
-    "    print(f\"✅ Memory usage: {memory_stats['total_memory_mb']:.2f}MB\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e75ef30e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Positional Encoding Implementation\n",
-    "\n",
-    "Transformers need explicit position information since attention is position-agnostic. Let's implement sinusoidal positional encoding used in the original transformer."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "23bcb128",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "positional-encoding",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class PositionalEncoding:\n",
-    "    \"\"\"\n",
-    "    Sinusoidal positional encoding that adds position information to embeddings.\n",
-    "    \n",
-    "    Uses sine and cosine functions of different frequencies to create\n",
-    "    unique position representations that the model can learn to use.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, embedding_dim: int, max_seq_length: int = 5000, \n",
-    "                 dropout: float = 0.0):\n",
-    "        \"\"\"\n",
-    "        Initialize positional encoding with sinusoidal patterns.\n",
-    "        \n",
-    "        TODO: Implement positional encoding initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Create position matrix (max_seq_length, embedding_dim)\n",
-    "        2. For each position and dimension:\n",
-    "           - Calculate frequency based on dimension\n",
-    "           - Apply sine to even dimensions, cosine to odd dimensions\n",
-    "        3. Store the precomputed positional encodings\n",
-    "        \n",
-    "        MATHEMATICAL FOUNDATION:\n",
-    "        PE(pos, 2i) = sin(pos / 10000^(2i/d_model))\n",
-    "        PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))\n",
-    "        \n",
-    "        Where:\n",
-    "        - pos = position in sequence\n",
-    "        - i = dimension index\n",
-    "        - d_model = embedding_dim\n",
-    "        \n",
-    "        Args:\n",
-    "            embedding_dim: Dimension of embeddings (must be even)\n",
-    "            max_seq_length: Maximum sequence length to precompute\n",
-    "            dropout: Dropout rate (for future use)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.embedding_dim = embedding_dim\n",
-    "        self.max_seq_length = max_seq_length\n",
-    "        self.dropout = dropout\n",
-    "        \n",
-    "        # Create positional encoding matrix\n",
-    "        pe = np.zeros((max_seq_length, embedding_dim))\n",
-    "        \n",
-    "        # Create position vector (0, 1, 2, ..., max_seq_length-1)\n",
-    "        position = np.arange(0, max_seq_length).reshape(-1, 1)  # Shape: (max_seq_length, 1)\n",
-    "        \n",
-    "        # Create dimension indices for frequency calculation\n",
-    "        # div_term calculates 10000^(2i/d_model) for i = 0, 1, 2, ...\n",
-    "        div_term = np.exp(np.arange(0, embedding_dim, 2) * \n",
-    "                         -(math.log(10000.0) / embedding_dim))\n",
-    "        \n",
-    "        # Apply sine to even dimensions (0, 2, 4, ...)\n",
-    "        pe[:, 0::2] = np.sin(position * div_term)\n",
-    "        \n",
-    "        # Apply cosine to odd dimensions (1, 3, 5, ...)\n",
-    "        if embedding_dim % 2 == 1:\n",
-    "            # Handle odd embedding_dim - cosine gets one less dimension\n",
-    "            pe[:, 1::2] = np.cos(position * div_term[:-1])\n",
-    "        else:\n",
-    "            pe[:, 1::2] = np.cos(position * div_term)\n",
-    "        \n",
-    "        # Store as tensor\n",
-    "        self.pe = Tensor(pe)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, embeddings: Tensor) -> Tensor:\n",
-    "        \"\"\"\n",
-    "        Add positional encoding to embeddings.\n",
-    "        \n",
-    "        TODO: Implement positional encoding addition.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Get sequence length from embeddings shape\n",
-    "        2. Extract relevant positional encodings\n",
-    "        3. Add positional encodings to embeddings\n",
-    "        4. Return position-aware embeddings\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        pos_enc = PositionalEncoding(embedding_dim=64)\n",
-    "        embeddings = Tensor(np.random.randn(2, 10, 64))  # (batch, seq, dim)\n",
-    "        pos_embeddings = pos_enc.forward(embeddings)\n",
-    "        \n",
-    "        Args:\n",
-    "            embeddings: Input embeddings with shape (batch_size, seq_len, embedding_dim)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Position-aware embeddings with same shape as input\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Get sequence length from embeddings\n",
-    "        if len(embeddings.shape) == 3:\n",
-    "            batch_size, seq_length, embed_dim = embeddings.shape\n",
-    "        elif len(embeddings.shape) == 2:\n",
-    "            seq_length, embed_dim = embeddings.shape\n",
-    "            batch_size = None\n",
-    "        else:\n",
-    "            raise ValueError(f\"Expected 2D or 3D embeddings, got shape {embeddings.shape}\")\n",
-    "        \n",
-    "        if embed_dim != self.embedding_dim:\n",
-    "            raise ValueError(f\"Embedding dim mismatch: expected {self.embedding_dim}, got {embed_dim}\")\n",
-    "        \n",
-    "        if seq_length > self.max_seq_length:\n",
-    "            raise ValueError(f\"Sequence length {seq_length} exceeds max {self.max_seq_length}\")\n",
-    "        \n",
-    "        # Extract positional encodings for this sequence length\n",
-    "        position_encodings = self.pe.data[:seq_length, :]\n",
-    "        \n",
-    "        # Add positional encodings to embeddings\n",
-    "        if batch_size is not None:\n",
-    "            # Broadcast positional encodings across batch dimension\n",
-    "            # embeddings: (batch, seq, dim) + position_encodings: (seq, dim)\n",
-    "            result = embeddings.data + position_encodings[np.newaxis, :, :]\n",
-    "        else:\n",
-    "            # embeddings: (seq, dim) + position_encodings: (seq, dim)\n",
-    "            result = embeddings.data + position_encodings\n",
-    "        \n",
-    "        return Tensor(result)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, embeddings: Tensor) -> Tensor:\n",
-    "        \"\"\"Make the class callable.\"\"\"\n",
-    "        return self.forward(embeddings)\n",
-    "    \n",
-    "    def visualize_encoding(self, seq_length: int = 100, dims_to_show: int = 10) -> None:\n",
-    "        \"\"\"\n",
-    "        Visualize positional encoding patterns.\n",
-    "        \n",
-    "        This function is PROVIDED to show encoding patterns.\n",
-    "        \"\"\"\n",
-    "        print(f\"📊 POSITIONAL ENCODING VISUALIZATION\")\n",
-    "        print(f\"Sequence length: {seq_length}, Dimensions shown: {dims_to_show}\")\n",
-    "        print(\"=\" * 60)\n",
-    "        \n",
-    "        # Get subset of positional encodings\n",
-    "        pe_subset = self.pe.data[:seq_length, :dims_to_show]\n",
-    "        \n",
-    "        # Show patterns for first few positions\n",
-    "        print(\"First 10 positions, first 10 dimensions:\")\n",
-    "        print(\"Pos\", end=\"\")\n",
-    "        for d in range(min(dims_to_show, 10)):\n",
-    "            print(f\"    Dim{d:2d}\", end=\"\")\n",
-    "        print()\n",
-    "        \n",
-    "        for pos in range(min(seq_length, 10)):\n",
-    "            print(f\"{pos:3d}\", end=\"\")\n",
-    "            for d in range(min(dims_to_show, 10)):\n",
-    "                print(f\"{pe_subset[pos, d]:8.3f}\", end=\"\")\n",
-    "            print()\n",
-    "        \n",
-    "        # Show frequency analysis\n",
-    "        print(f\"\\n📈 FREQUENCY ANALYSIS:\")\n",
-    "        print(\"Even dimensions (sine): Lower frequencies for early dimensions\")\n",
-    "        print(\"Odd dimensions (cosine): Same frequencies, phase-shifted\")\n",
-    "        \n",
-    "        # Calculate frequency range\n",
-    "        min_freq = 1.0 / 10000\n",
-    "        max_freq = 1.0\n",
-    "        print(f\"Frequency range: {min_freq:.6f} to {max_freq:.6f}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ae0b3f88",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Positional Encoding Implementation\n",
-    "\n",
-    "Once you implement the PositionalEncoding methods above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4fe34d59",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-positional-encoding-immediate",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_positional_encoding():\n",
-    "    \"\"\"Unit test for positional encoding.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Positional Encoding...\")\n",
-    "    \n",
-    "    # Create positional encoding\n",
-    "    embedding_dim = 64\n",
-    "    max_seq_length = 100\n",
-    "    pos_enc = PositionalEncoding(embedding_dim=embedding_dim, max_seq_length=max_seq_length)\n",
-    "    \n",
-    "    # Test initialization\n",
-    "    assert pos_enc.pe.shape == (max_seq_length, embedding_dim), f\"Expected shape ({max_seq_length}, {embedding_dim})\"\n",
-    "    \n",
-    "    # Test that different positions have different encodings\n",
-    "    pos_0 = pos_enc.pe.data[0]\n",
-    "    pos_1 = pos_enc.pe.data[1]\n",
-    "    assert not np.allclose(pos_0, pos_1), \"Different positions should have different encodings\"\n",
-    "    \n",
-    "    # Test sine/cosine pattern\n",
-    "    # Even dimensions should use sine, odd should use cosine\n",
-    "    # This is hard to test directly, but we can check the encoding is reasonable\n",
-    "    assert not np.any(np.isnan(pos_enc.pe.data)), \"Positional encodings should not contain NaN\"\n",
-    "    assert not np.any(np.isinf(pos_enc.pe.data)), \"Positional encodings should not contain inf\"\n",
-    "    \n",
-    "    # Test forward pass with 3D input (batch, seq, dim)\n",
-    "    batch_size = 2\n",
-    "    seq_length = 10\n",
-    "    embeddings = Tensor(np.random.randn(batch_size, seq_length, embedding_dim))\n",
-    "    \n",
-    "    pos_embeddings = pos_enc.forward(embeddings)\n",
-    "    assert pos_embeddings.shape == embeddings.shape, \"Output shape should match input shape\"\n",
-    "    \n",
-    "    # Test forward pass with 2D input (seq, dim)\n",
-    "    embeddings_2d = Tensor(np.random.randn(seq_length, embedding_dim))\n",
-    "    pos_embeddings_2d = pos_enc.forward(embeddings_2d)\n",
-    "    assert pos_embeddings_2d.shape == embeddings_2d.shape, \"2D output shape should match input\"\n",
-    "    \n",
-    "    # Test that positional encoding is actually added\n",
-    "    original_mean = np.mean(embeddings.data)\n",
-    "    pos_mean = np.mean(pos_embeddings.data)\n",
-    "    assert abs(pos_mean - original_mean) > 1e-6, \"Positional encoding should change the embeddings\"\n",
-    "    \n",
-    "    # Test sequence length validation\n",
-    "    try:\n",
-    "        long_embeddings = Tensor(np.random.randn(max_seq_length + 10, embedding_dim))\n",
-    "        pos_enc.forward(long_embeddings)\n",
-    "        assert False, \"Should raise error for sequence longer than max_seq_length\"\n",
-    "    except ValueError:\n",
-    "        pass  # Expected behavior\n",
-    "    \n",
-    "    # Test embedding dimension validation\n",
-    "    try:\n",
-    "        wrong_dim_embeddings = Tensor(np.random.randn(seq_length, embedding_dim + 10))\n",
-    "        pos_enc.forward(wrong_dim_embeddings)\n",
-    "        assert False, \"Should raise error for wrong embedding dimension\"\n",
-    "    except ValueError:\n",
-    "        pass  # Expected behavior\n",
-    "    \n",
-    "    # Test deterministic behavior\n",
-    "    pos_embeddings_1 = pos_enc.forward(embeddings)\n",
-    "    pos_embeddings_2 = pos_enc.forward(embeddings)\n",
-    "    assert np.allclose(pos_embeddings_1.data, pos_embeddings_2.data), \"Should be deterministic\"\n",
-    "    \n",
-    "    # Test callable interface\n",
-    "    pos_embeddings_callable = pos_enc(embeddings)\n",
-    "    assert np.allclose(pos_embeddings_callable.data, pos_embeddings.data), \"Callable interface should work\"\n",
-    "    \n",
-    "    print(\"✅ Positional encoding tests passed!\")\n",
-    "    print(f\"✅ Handles 2D and 3D inputs correctly\")\n",
-    "    print(f\"✅ Proper validation and deterministic behavior\")\n",
-    "    print(f\"✅ Encoding dimension: {embedding_dim}, Max length: {max_seq_length}\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9bc6f623",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Learned Positional Embeddings\n",
-    "\n",
-    "Some models use learned positional embeddings instead of fixed sinusoidal ones. Let's implement this alternative approach:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4c50a89a",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "learned-positional",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class LearnedPositionalEmbedding:\n",
-    "    \"\"\"\n",
-    "    Learned positional embeddings - another embedding table for positions.\n",
-    "    \n",
-    "    Unlike sinusoidal encoding, these are learned parameters that\n",
-    "    the model optimizes during training. Used in models like BERT.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, max_seq_length: int, embedding_dim: int):\n",
-    "        \"\"\"\n",
-    "        Initialize learned positional embeddings.\n",
-    "        \n",
-    "        TODO: Implement learned positional embedding initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Create embedding layer for positions (0, 1, 2, ..., max_seq_length-1)\n",
-    "        2. Initialize with small random values\n",
-    "        3. Set up parameter tracking for optimization\n",
-    "        \n",
-    "        This is essentially an Embedding layer where the \"vocabulary\"\n",
-    "        is the set of possible positions in a sequence.\n",
-    "        \n",
-    "        Args:\n",
-    "            max_seq_length: Maximum sequence length supported\n",
-    "            embedding_dim: Dimension of position embeddings\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.max_seq_length = max_seq_length\n",
-    "        self.embedding_dim = embedding_dim\n",
-    "        \n",
-    "        # Create learned positional embedding table\n",
-    "        # This is like an embedding layer for positions\n",
-    "        self.position_embedding = Embedding(\n",
-    "            vocab_size=max_seq_length,\n",
-    "            embedding_dim=embedding_dim,\n",
-    "            init_type='normal'\n",
-    "        )\n",
-    "        \n",
-    "        # Track parameters for optimization\n",
-    "        self.parameters = self.position_embedding.parameters\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, embeddings: Tensor) -> Tensor:\n",
-    "        \"\"\"\n",
-    "        Add learned positional embeddings to input embeddings.\n",
-    "        \n",
-    "        TODO: Implement learned positional embedding addition.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Get sequence length from input shape\n",
-    "        2. Create position indices [0, 1, 2, ..., seq_length-1]\n",
-    "        3. Look up position embeddings using position indices\n",
-    "        4. Add position embeddings to input embeddings\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        learned_pos = LearnedPositionalEmbedding(max_seq_length=100, embedding_dim=64)\n",
-    "        embeddings = Tensor(np.random.randn(2, 10, 64))  # (batch, seq, dim)\n",
-    "        pos_embeddings = learned_pos.forward(embeddings)\n",
-    "        \n",
-    "        Args:\n",
-    "            embeddings: Input embeddings with shape (batch_size, seq_len, embedding_dim)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Position-aware embeddings with same shape as input\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Get sequence length from embeddings\n",
-    "        if len(embeddings.shape) == 3:\n",
-    "            batch_size, seq_length, embed_dim = embeddings.shape\n",
-    "        elif len(embeddings.shape) == 2:\n",
-    "            seq_length, embed_dim = embeddings.shape\n",
-    "            batch_size = None\n",
-    "        else:\n",
-    "            raise ValueError(f\"Expected 2D or 3D embeddings, got shape {embeddings.shape}\")\n",
-    "        \n",
-    "        if embed_dim != self.embedding_dim:\n",
-    "            raise ValueError(f\"Embedding dim mismatch: expected {self.embedding_dim}, got {embed_dim}\")\n",
-    "        \n",
-    "        if seq_length > self.max_seq_length:\n",
-    "            raise ValueError(f\"Sequence length {seq_length} exceeds max {self.max_seq_length}\")\n",
-    "        \n",
-    "        # Create position indices [0, 1, 2, ..., seq_length-1]\n",
-    "        position_ids = list(range(seq_length))\n",
-    "        \n",
-    "        # Look up position embeddings\n",
-    "        position_embeddings = self.position_embedding.forward(position_ids)\n",
-    "        \n",
-    "        # Add position embeddings to input embeddings\n",
-    "        if batch_size is not None:\n",
-    "            # Broadcast across batch dimension\n",
-    "            result = embeddings.data + position_embeddings.data[np.newaxis, :, :]\n",
-    "        else:\n",
-    "            result = embeddings.data + position_embeddings.data\n",
-    "        \n",
-    "        return Tensor(result)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, embeddings: Tensor) -> Tensor:\n",
-    "        \"\"\"Make the class callable.\"\"\"\n",
-    "        return self.forward(embeddings)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4811c8f5",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Learned Positional Embedding Implementation\n",
-    "\n",
-    "Once you implement the LearnedPositionalEmbedding methods above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c8509e91",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-learned-positional-immediate",
-     "locked": true,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_learned_positional_embedding():\n",
-    "    \"\"\"Unit test for learned positional embeddings.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Learned Positional Embeddings...\")\n",
-    "    \n",
-    "    # Create learned positional embedding\n",
-    "    max_seq_length = 50\n",
-    "    embedding_dim = 32\n",
-    "    learned_pos = LearnedPositionalEmbedding(max_seq_length=max_seq_length, embedding_dim=embedding_dim)\n",
-    "    \n",
-    "    # Test initialization\n",
-    "    assert learned_pos.position_embedding.vocab_size == max_seq_length, \"Should have position for each sequence position\"\n",
-    "    assert learned_pos.position_embedding.embedding_dim == embedding_dim, \"Should match embedding dimension\"\n",
-    "    \n",
-    "    # Test parameter tracking\n",
-    "    assert len(learned_pos.parameters) == 1, \"Should track position embedding parameters\"\n",
-    "    assert learned_pos.parameters[0] is learned_pos.position_embedding.weight, \"Should track weight tensor\"\n",
-    "    \n",
-    "    # Test forward pass with 3D input\n",
-    "    batch_size = 3\n",
-    "    seq_length = 10\n",
-    "    embeddings = Tensor(np.random.randn(batch_size, seq_length, embedding_dim))\n",
-    "    \n",
-    "    pos_embeddings = learned_pos.forward(embeddings)\n",
-    "    assert pos_embeddings.shape == embeddings.shape, \"Output shape should match input shape\"\n",
-    "    \n",
-    "    # Test forward pass with 2D input\n",
-    "    embeddings_2d = Tensor(np.random.randn(seq_length, embedding_dim))\n",
-    "    pos_embeddings_2d = learned_pos.forward(embeddings_2d)\n",
-    "    assert pos_embeddings_2d.shape == embeddings_2d.shape, \"2D output shape should match input\"\n",
-    "    \n",
-    "    # Test that position embeddings are actually added\n",
-    "    original_mean = np.mean(embeddings.data)\n",
-    "    pos_mean = np.mean(pos_embeddings.data)\n",
-    "    assert abs(pos_mean - original_mean) > 1e-6, \"Position embeddings should change the input\"\n",
-    "    \n",
-    "    # Test that different sequence lengths give different results\n",
-    "    short_embeddings = Tensor(np.random.randn(batch_size, 5, embedding_dim))\n",
-    "    long_embeddings = Tensor(np.random.randn(batch_size, 15, embedding_dim))\n",
-    "    \n",
-    "    short_pos = learned_pos.forward(short_embeddings)\n",
-    "    long_pos = learned_pos.forward(long_embeddings)\n",
-    "    \n",
-    "    # The first 5 positions should be the same\n",
-    "    assert np.allclose(short_pos.data, long_pos.data[:, :5, :]), \"Same positions should have same embeddings\"\n",
-    "    \n",
-    "    # Test sequence length validation\n",
-    "    try:\n",
-    "        too_long_embeddings = Tensor(np.random.randn(batch_size, max_seq_length + 5, embedding_dim))\n",
-    "        learned_pos.forward(too_long_embeddings)\n",
-    "        assert False, \"Should raise error for sequence longer than max_seq_length\"\n",
-    "    except ValueError:\n",
-    "        pass  # Expected behavior\n",
-    "    \n",
-    "    # Test embedding dimension validation\n",
-    "    try:\n",
-    "        wrong_dim_embeddings = Tensor(np.random.randn(batch_size, seq_length, embedding_dim + 5))\n",
-    "        learned_pos.forward(wrong_dim_embeddings)\n",
-    "        assert False, \"Should raise error for wrong embedding dimension\"\n",
-    "    except ValueError:\n",
-    "        pass  # Expected behavior\n",
-    "    \n",
-    "    # Test callable interface\n",
-    "    pos_embeddings_callable = learned_pos(embeddings)\n",
-    "    assert np.allclose(pos_embeddings_callable.data, pos_embeddings.data), \"Callable interface should work\"\n",
-    "    \n",
-    "    print(\"✅ Learned positional embedding tests passed!\")\n",
-    "    print(f\"✅ Parameter tracking and optimization ready\")\n",
-    "    print(f\"✅ Handles various input shapes correctly\")\n",
-    "    print(f\"✅ Max sequence length: {max_seq_length}, Embedding dim: {embedding_dim}\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "edbe4f83",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 ML Systems: Performance Analysis & Embedding Scaling\n",
-    "\n",
-    "Now let's develop systems engineering skills by analyzing embedding performance and understanding how embedding choices affect downstream ML system efficiency.\n",
-    "\n",
-    "### **Learning Outcome**: *\"I understand how embedding table size affects model memory, training speed, and language understanding capacity\"*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8ad087b6",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "embedding-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "import time\n",
-    "\n",
-    "class EmbeddingProfiler:\n",
-    "    \"\"\"\n",
-    "    Performance profiling toolkit for embedding systems.\n",
-    "    \n",
-    "    Helps ML engineers understand memory usage, lookup performance,\n",
-    "    and scaling characteristics of embedding layers.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        self.results = {}\n",
-    "    \n",
-    "    def measure_lookup_performance(self, embedding_layer: Embedding, \n",
-    "                                  batch_sizes: List[int], seq_lengths: List[int]) -> Dict:\n",
-    "        \"\"\"\n",
-    "        Measure embedding lookup performance across different batch sizes and sequence lengths.\n",
-    "        \n",
-    "        TODO: Implement embedding lookup performance measurement.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Create test token indices for each (batch_size, seq_length) combination\n",
-    "        2. Measure time to perform embedding lookup\n",
-    "        3. Calculate throughput metrics (tokens/second, memory bandwidth)\n",
-    "        4. Return comprehensive performance analysis\n",
-    "        \n",
-    "        METRICS TO CALCULATE:\n",
-    "        - Lookup time (milliseconds)\n",
-    "        - Tokens per second throughput\n",
-    "        - Memory bandwidth utilization\n",
-    "        - Scaling patterns with batch size and sequence length\n",
-    "        \n",
-    "        Args:\n",
-    "            embedding_layer: Embedding layer to test\n",
-    "            batch_sizes: List of batch sizes to test\n",
-    "            seq_lengths: List of sequence lengths to test\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dictionary with performance metrics for each configuration\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        results = {}\n",
-    "        vocab_size = embedding_layer.vocab_size\n",
-    "        \n",
-    "        for batch_size in batch_sizes:\n",
-    "            for seq_length in seq_lengths:\n",
-    "                # Create random token indices\n",
-    "                token_indices = np.random.randint(0, vocab_size, (batch_size, seq_length))\n",
-    "                \n",
-    "                # Measure lookup performance\n",
-    "                start_time = time.time()\n",
-    "                embeddings = embedding_layer.forward(token_indices)\n",
-    "                end_time = time.time()\n",
-    "                \n",
-    "                # Calculate metrics\n",
-    "                lookup_time_ms = (end_time - start_time) * 1000\n",
-    "                total_tokens = batch_size * seq_length\n",
-    "                tokens_per_second = total_tokens / (end_time - start_time) if end_time > start_time else 0\n",
-    "                \n",
-    "                # Memory calculations\n",
-    "                input_memory_mb = token_indices.nbytes / (1024 * 1024)\n",
-    "                output_memory_mb = embeddings.data.nbytes / (1024 * 1024)\n",
-    "                memory_bandwidth_mb_s = (input_memory_mb + output_memory_mb) / (end_time - start_time) if end_time > start_time else 0\n",
-    "                \n",
-    "                config_key = f\"batch_{batch_size}_seq_{seq_length}\"\n",
-    "                results[config_key] = {\n",
-    "                    'batch_size': batch_size,\n",
-    "                    'seq_length': seq_length,\n",
-    "                    'total_tokens': total_tokens,\n",
-    "                    'lookup_time_ms': lookup_time_ms,\n",
-    "                    'tokens_per_second': tokens_per_second,\n",
-    "                    'input_memory_mb': input_memory_mb,\n",
-    "                    'output_memory_mb': output_memory_mb,\n",
-    "                    'memory_bandwidth_mb_s': memory_bandwidth_mb_s,\n",
-    "                    'time_per_token_us': lookup_time_ms * 1000 / total_tokens if total_tokens > 0 else 0\n",
-    "                }\n",
-    "        \n",
-    "        return results\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def analyze_memory_scaling(self, vocab_sizes: List[int], embedding_dims: List[int]) -> Dict:\n",
-    "        \"\"\"\n",
-    "        Analyze how embedding memory usage scales with vocabulary size and embedding dimension.\n",
-    "        \n",
-    "        This function is PROVIDED to show memory scaling analysis.\n",
-    "        \"\"\"\n",
-    "        print(\"📊 EMBEDDING MEMORY SCALING ANALYSIS\")\n",
-    "        print(\"=\" * 60)\n",
-    "        \n",
-    "        scaling_results = {}\n",
-    "        \n",
-    "        print(f\"{'Vocab Size':<12} {'Embed Dim':<10} {'Parameters':<12} {'Memory (MB)':<12} {'Lookup Time':<12}\")\n",
-    "        print(\"-\" * 70)\n",
-    "        \n",
-    "        for vocab_size in vocab_sizes:\n",
-    "            for embed_dim in embedding_dims:\n",
-    "                # Create embedding layer\n",
-    "                embed = Embedding(vocab_size=vocab_size, embedding_dim=embed_dim)\n",
-    "                \n",
-    "                # Calculate memory usage\n",
-    "                memory_stats = embed.get_memory_usage()\n",
-    "                total_memory_mb = memory_stats['total_memory_mb']\n",
-    "                total_params = memory_stats['total_parameters']\n",
-    "                \n",
-    "                # Measure lookup time\n",
-    "                test_tokens = np.random.randint(0, vocab_size, (32, 64))  # Standard batch\n",
-    "                start_time = time.time()\n",
-    "                _ = embed.forward(test_tokens)\n",
-    "                lookup_time_ms = (time.time() - start_time) * 1000\n",
-    "                \n",
-    "                # Store results\n",
-    "                config_key = f\"vocab_{vocab_size}_dim_{embed_dim}\"\n",
-    "                scaling_results[config_key] = {\n",
-    "                    'vocab_size': vocab_size,\n",
-    "                    'embedding_dim': embed_dim,\n",
-    "                    'total_parameters': total_params,\n",
-    "                    'memory_mb': total_memory_mb,\n",
-    "                    'lookup_time_ms': lookup_time_ms\n",
-    "                }\n",
-    "                \n",
-    "                print(f\"{vocab_size:<12,} {embed_dim:<10} {total_params:<12,} {total_memory_mb:<12.2f} {lookup_time_ms:<12.2f}\")\n",
-    "        \n",
-    "        # Analyze scaling patterns\n",
-    "        print(f\"\\n📈 SCALING INSIGHTS:\")\n",
-    "        if len(vocab_sizes) > 1 and len(embedding_dims) > 1:\n",
-    "            # Compare scaling with vocab size (fixed embedding dim)\n",
-    "            fixed_dim = embedding_dims[0]\n",
-    "            small_vocab = min(vocab_sizes)\n",
-    "            large_vocab = max(vocab_sizes)\n",
-    "            \n",
-    "            small_key = f\"vocab_{small_vocab}_dim_{fixed_dim}\"\n",
-    "            large_key = f\"vocab_{large_vocab}_dim_{fixed_dim}\"\n",
-    "            \n",
-    "            if small_key in scaling_results and large_key in scaling_results:\n",
-    "                vocab_ratio = large_vocab / small_vocab\n",
-    "                memory_ratio = scaling_results[large_key]['memory_mb'] / scaling_results[small_key]['memory_mb']\n",
-    "                print(f\"   Vocabulary scaling: {vocab_ratio:.1f}x vocab → {memory_ratio:.1f}x memory (Linear)\")\n",
-    "            \n",
-    "            # Compare scaling with embedding dim (fixed vocab)\n",
-    "            fixed_vocab = vocab_sizes[0]\n",
-    "            small_dim = min(embedding_dims)\n",
-    "            large_dim = max(embedding_dims)\n",
-    "            \n",
-    "            small_key = f\"vocab_{fixed_vocab}_dim_{small_dim}\"\n",
-    "            large_key = f\"vocab_{fixed_vocab}_dim_{large_dim}\"\n",
-    "            \n",
-    "            if small_key in scaling_results and large_key in scaling_results:\n",
-    "                dim_ratio = large_dim / small_dim\n",
-    "                memory_ratio = scaling_results[large_key]['memory_mb'] / scaling_results[small_key]['memory_mb']\n",
-    "                print(f\"   Dimension scaling: {dim_ratio:.1f}x dim → {memory_ratio:.1f}x memory (Linear)\")\n",
-    "        \n",
-    "        return scaling_results\n",
-    "    \n",
-    "    def compare_positional_encodings(self, seq_length: int = 100, embedding_dim: int = 256) -> Dict:\n",
-    "        \"\"\"\n",
-    "        Compare performance and characteristics of different positional encoding approaches.\n",
-    "        \n",
-    "        This function is PROVIDED to show positional encoding comparison.\n",
-    "        \"\"\"\n",
-    "        print(f\"\\n🔍 POSITIONAL ENCODING COMPARISON\")\n",
-    "        print(\"=\" * 50)\n",
-    "        \n",
-    "        # Create test embeddings\n",
-    "        batch_size = 16\n",
-    "        embeddings = Tensor(np.random.randn(batch_size, seq_length, embedding_dim))\n",
-    "        \n",
-    "        # Test sinusoidal positional encoding\n",
-    "        sinusoidal_pe = PositionalEncoding(embedding_dim=embedding_dim, max_seq_length=seq_length*2)\n",
-    "        start_time = time.time()\n",
-    "        sin_result = sinusoidal_pe.forward(embeddings)\n",
-    "        sin_time = (time.time() - start_time) * 1000\n",
-    "        \n",
-    "        # Test learned positional embedding\n",
-    "        learned_pe = LearnedPositionalEmbedding(max_seq_length=seq_length*2, embedding_dim=embedding_dim)\n",
-    "        start_time = time.time()\n",
-    "        learned_result = learned_pe.forward(embeddings)\n",
-    "        learned_time = (time.time() - start_time) * 1000\n",
-    "        \n",
-    "        # Calculate memory usage\n",
-    "        sin_memory = 0  # No learnable parameters\n",
-    "        learned_memory = learned_pe.position_embedding.get_memory_usage()['total_memory_mb']\n",
-    "        \n",
-    "        results = {\n",
-    "            'sinusoidal': {\n",
-    "                'computation_time_ms': sin_time,\n",
-    "                'memory_usage_mb': sin_memory,\n",
-    "                'parameters': 0,\n",
-    "                'deterministic': True,\n",
-    "                'extrapolation': 'Good (can handle longer sequences)'\n",
-    "            },\n",
-    "            'learned': {\n",
-    "                'computation_time_ms': learned_time,\n",
-    "                'memory_usage_mb': learned_memory,\n",
-    "                'parameters': seq_length * 2 * embedding_dim,\n",
-    "                'deterministic': False,\n",
-    "                'extrapolation': 'Limited (fixed max sequence length)'\n",
-    "            }\n",
-    "        }\n",
-    "        \n",
-    "        print(f\"📊 COMPARISON RESULTS:\")\n",
-    "        print(f\"{'Method':<12} {'Time (ms)':<10} {'Memory (MB)':<12} {'Parameters':<12} {'Extrapolation'}\")\n",
-    "        print(\"-\" * 70)\n",
-    "        print(f\"{'Sinusoidal':<12} {sin_time:<10.2f} {sin_memory:<12.2f} {0:<12,} {'Good'}\")\n",
-    "        print(f\"{'Learned':<12} {learned_time:<10.2f} {learned_memory:<12.2f} {results['learned']['parameters']:<12,} {'Limited'}\")\n",
-    "        \n",
-    "        print(f\"\\n💡 INSIGHTS:\")\n",
-    "        print(f\"   - Sinusoidal: Zero parameters, deterministic, good extrapolation\")\n",
-    "        print(f\"   - Learned: Requires parameters, model-specific, limited extrapolation\")\n",
-    "        print(f\"   - Choice depends on: model capacity, sequence length requirements, extrapolation needs\")\n",
-    "        \n",
-    "        return results\n",
-    "\n",
-    "def analyze_embedding_system_design():\n",
-    "    \"\"\"\n",
-    "    Comprehensive analysis of embedding system design choices and their impact.\n",
-    "    \n",
-    "    This function is PROVIDED to show systems-level design thinking.\n",
-    "    \"\"\"\n",
-    "    print(\"🏗️ EMBEDDING SYSTEM DESIGN ANALYSIS\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Example model configurations\n",
-    "    model_configs = [\n",
-    "        {'name': 'Small GPT', 'vocab_size': 10000, 'embed_dim': 256, 'seq_length': 512},\n",
-    "        {'name': 'Medium GPT', 'vocab_size': 50000, 'embed_dim': 512, 'seq_length': 1024},\n",
-    "        {'name': 'Large GPT', 'vocab_size': 50000, 'embed_dim': 1024, 'seq_length': 2048}\n",
-    "    ]\n",
-    "    \n",
-    "    print(f\"📋 MODEL CONFIGURATION COMPARISON:\")\n",
-    "    print(f\"{'Model':<12} {'Vocab Size':<10} {'Embed Dim':<10} {'Seq Len':<8} {'Embed Params':<12} {'Memory (MB)'}\")\n",
-    "    print(\"-\" * 80)\n",
-    "    \n",
-    "    for config in model_configs:\n",
-    "        # Calculate embedding parameters\n",
-    "        embed_params = config['vocab_size'] * config['embed_dim']\n",
-    "        \n",
-    "        # Calculate memory usage\n",
-    "        embed_memory_mb = embed_params * 4 / (1024 * 1024)  # 4 bytes per float32\n",
-    "        \n",
-    "        print(f\"{config['name']:<12} {config['vocab_size']:<10,} {config['embed_dim']:<10} \"\n",
-    "              f\"{config['seq_length']:<8} {embed_params:<12,} {embed_memory_mb:<10.1f}\")\n",
-    "    \n",
-    "    print(f\"\\n🎯 DESIGN TRADE-OFFS:\")\n",
-    "    print(f\"   1. Vocabulary Size:\")\n",
-    "    print(f\"      - Larger vocab: Better text coverage, more parameters\")\n",
-    "    print(f\"      - Smaller vocab: Longer sequences, more compute\")\n",
-    "    print(f\"   2. Embedding Dimension:\")\n",
-    "    print(f\"      - Higher dim: More model capacity, more memory\")\n",
-    "    print(f\"      - Lower dim: Faster computation, potential bottleneck\")\n",
-    "    print(f\"   3. Position Encoding:\")\n",
-    "    print(f\"      - Sinusoidal: No parameters, good extrapolation\")\n",
-    "    print(f\"      - Learned: Model-specific, limited to training length\")\n",
-    "    print(f\"   4. Memory Scaling:\")\n",
-    "    print(f\"      - Embedding table: O(vocab_size × embed_dim)\")\n",
-    "    print(f\"      - Sequence processing: O(batch_size × seq_length × embed_dim)\")\n",
-    "    print(f\"      - Total memory dominated by model size, not embedding table\")\n",
-    "    \n",
-    "    print(f\"\\n🏭 PRODUCTION CONSIDERATIONS:\")\n",
-    "    print(f\"   - GPU memory limits affect maximum embedding table size\")\n",
-    "    print(f\"   - Embedding lookup is memory-bandwidth bound\")\n",
-    "    print(f\"   - Vocabulary size affects tokenization and model download size\")\n",
-    "    print(f\"   - Position encoding choice affects sequence length flexibility\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a52101b6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test: Embedding Performance Analysis\n",
-    "\n",
-    "Let's test our embedding profiler with realistic performance scenarios."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "36fbea2a",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-embedding-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_embedding_profiler():\n",
-    "    \"\"\"Test embedding profiler with various scenarios.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Embedding Performance Profiler...\")\n",
-    "    \n",
-    "    profiler = EmbeddingProfiler()\n",
-    "    \n",
-    "    # Create test embedding layer\n",
-    "    vocab_size = 1000\n",
-    "    embedding_dim = 128\n",
-    "    embed = Embedding(vocab_size=vocab_size, embedding_dim=embedding_dim)\n",
-    "    \n",
-    "    # Test lookup performance measurement\n",
-    "    batch_sizes = [8, 16]\n",
-    "    seq_lengths = [32, 64]\n",
-    "    \n",
-    "    performance_results = profiler.measure_lookup_performance(embed, batch_sizes, seq_lengths)\n",
-    "    \n",
-    "    # Verify results structure\n",
-    "    expected_configs = len(batch_sizes) * len(seq_lengths)\n",
-    "    assert len(performance_results) == expected_configs, f\"Should test {expected_configs} configurations\"\n",
-    "    \n",
-    "    for config, metrics in performance_results.items():\n",
-    "        # Verify all required metrics are present\n",
-    "        required_keys = ['batch_size', 'seq_length', 'total_tokens', 'lookup_time_ms', \n",
-    "                        'tokens_per_second', 'memory_bandwidth_mb_s']\n",
-    "        for key in required_keys:\n",
-    "            assert key in metrics, f\"Missing metric: {key} in {config}\"\n",
-    "            assert isinstance(metrics[key], (int, float)), f\"Invalid metric type for {key}\"\n",
-    "        \n",
-    "        # Verify reasonable values\n",
-    "        assert metrics['total_tokens'] > 0, \"Should count tokens\"\n",
-    "        assert metrics['lookup_time_ms'] >= 0, \"Time should be non-negative\"\n",
-    "        assert metrics['tokens_per_second'] >= 0, \"Throughput should be non-negative\"\n",
-    "    \n",
-    "    print(\"✅ Lookup performance measurement test passed\")\n",
-    "    \n",
-    "    # Test memory scaling analysis\n",
-    "    vocab_sizes = [500, 1000]\n",
-    "    embedding_dims = [64, 128]\n",
-    "    \n",
-    "    scaling_results = profiler.analyze_memory_scaling(vocab_sizes, embedding_dims)\n",
-    "    \n",
-    "    # Verify scaling results\n",
-    "    expected_configs = len(vocab_sizes) * len(embedding_dims)\n",
-    "    assert len(scaling_results) == expected_configs, f\"Should test {expected_configs} configurations\"\n",
-    "    \n",
-    "    for config, metrics in scaling_results.items():\n",
-    "        assert 'total_parameters' in metrics, \"Should include parameter count\"\n",
-    "        assert 'memory_mb' in metrics, \"Should include memory usage\"\n",
-    "        assert metrics['total_parameters'] > 0, \"Should have parameters\"\n",
-    "        assert metrics['memory_mb'] > 0, \"Should use memory\"\n",
-    "    \n",
-    "    print(\"✅ Memory scaling analysis test passed\")\n",
-    "    \n",
-    "    # Test positional encoding comparison\n",
-    "    comparison_results = profiler.compare_positional_encodings(seq_length=50, embedding_dim=64)\n",
-    "    \n",
-    "    # Verify comparison results\n",
-    "    assert 'sinusoidal' in comparison_results, \"Should test sinusoidal encoding\"\n",
-    "    assert 'learned' in comparison_results, \"Should test learned encoding\"\n",
-    "    \n",
-    "    for method, metrics in comparison_results.items():\n",
-    "        assert 'computation_time_ms' in metrics, \"Should measure computation time\"\n",
-    "        assert 'memory_usage_mb' in metrics, \"Should measure memory usage\"\n",
-    "        assert 'parameters' in metrics, \"Should count parameters\"\n",
-    "    \n",
-    "    print(\"✅ Positional encoding comparison test passed\")\n",
-    "    print(\"🎯 Embedding Profiler: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6b2b90b6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Integration Testing: Complete Embedding Pipeline\n",
-    "\n",
-    "Let's test how all our embedding components work together in a realistic language processing pipeline:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d4798be5",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-embedding-integration",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_embedding_integration():\n",
-    "    \"\"\"Test complete embedding pipeline with tokenization integration.\"\"\"\n",
-    "    print(\"🧪 Integration Test: Complete Embedding Pipeline...\")\n",
-    "    \n",
-    "    # Create tokenizer\n",
-    "    tokenizer = CharTokenizer()\n",
-    "    \n",
-    "    # Create embedding layer\n",
-    "    embed = Embedding(vocab_size=tokenizer.vocab_size, embedding_dim=128, padding_idx=0)\n",
-    "    \n",
-    "    # Create positional encoding\n",
-    "    pos_encoding = PositionalEncoding(embedding_dim=128, max_seq_length=100)\n",
-    "    \n",
-    "    # Test text processing pipeline\n",
-    "    texts = [\n",
-    "        \"Hello world!\",\n",
-    "        \"This is a test.\",\n",
-    "        \"Short text.\",\n",
-    "        \"A longer piece of text to test the pipeline.\"\n",
-    "    ]\n",
-    "    \n",
-    "    print(f\"  Processing {len(texts)} texts through complete pipeline...\")\n",
-    "    \n",
-    "    # Step 1: Tokenize texts\n",
-    "    tokenized = []\n",
-    "    for text in texts:\n",
-    "        tokens = tokenizer.encode(text, add_special_tokens=True)\n",
-    "        tokenized.append(tokens)\n",
-    "    \n",
-    "    # Step 2: Pad sequences for batch processing\n",
-    "    padded_sequences = tokenizer.pad_sequences(tokenized, max_length=20)\n",
-    "    batch_tokens = Tensor(np.array(padded_sequences))\n",
-    "    \n",
-    "    print(f\"    Batch shape: {batch_tokens.shape}\")\n",
-    "    \n",
-    "    # Step 3: Embedding lookup\n",
-    "    embeddings = embed.forward(batch_tokens)\n",
-    "    print(f\"    Embeddings shape: {embeddings.shape}\")\n",
-    "    \n",
-    "    # Step 4: Add positional encoding\n",
-    "    pos_embeddings = pos_encoding.forward(embeddings)\n",
-    "    print(f\"    Position-aware embeddings shape: {pos_embeddings.shape}\")\n",
-    "    \n",
-    "    # Verify pipeline correctness\n",
-    "    expected_shape = (len(texts), 20, 128)  # (batch, seq_len, embed_dim)\n",
-    "    assert pos_embeddings.shape == expected_shape, f\"Expected {expected_shape}, got {pos_embeddings.shape}\"\n",
-    "    \n",
-    "    # Test that padding tokens have correct embeddings (should be zero from embedding layer)\n",
-    "    padding_token_id = tokenizer.char_to_idx['<PAD>']\n",
-    "    \n",
-    "    # Find positions with padding tokens\n",
-    "    padding_positions = (batch_tokens.data == padding_token_id)\n",
-    "    \n",
-    "    if np.any(padding_positions):\n",
-    "        # Get embeddings for padding positions\n",
-    "        padding_embeddings = embeddings.data[padding_positions]\n",
-    "        \n",
-    "        # Padding embeddings should be close to zero (from embedding initialization)\n",
-    "        # Note: they won't be exactly zero because we add positional encoding\n",
-    "        print(f\"    Padding token embeddings found: {np.sum(padding_positions)} positions\")\n",
-    "    \n",
-    "    # Test different sequence lengths\n",
-    "    short_text = \"Hi!\"\n",
-    "    short_tokens = tokenizer.encode(short_text, add_special_tokens=True)\n",
-    "    short_tensor = Tensor(np.array([short_tokens]))  # Add batch dimension\n",
-    "    \n",
-    "    short_embeddings = embed.forward(short_tensor)\n",
-    "    short_pos_embeddings = pos_encoding.forward(short_embeddings)\n",
-    "    \n",
-    "    print(f\"    Short text processing: {short_pos_embeddings.shape}\")\n",
-    "    \n",
-    "    # Test memory efficiency\n",
-    "    large_batch_size = 32\n",
-    "    large_seq_length = 50\n",
-    "    large_tokens = np.random.randint(0, tokenizer.vocab_size, (large_batch_size, large_seq_length))\n",
-    "    large_tensor = Tensor(large_tokens)\n",
-    "    \n",
-    "    start_time = time.time()\n",
-    "    large_embeddings = embed.forward(large_tensor)\n",
-    "    large_pos_embeddings = pos_encoding.forward(large_embeddings)\n",
-    "    processing_time = time.time() - start_time\n",
-    "    \n",
-    "    print(f\"    Large batch processing: {large_pos_embeddings.shape} in {processing_time*1000:.2f}ms\")\n",
-    "    \n",
-    "    # Calculate memory usage\n",
-    "    embedding_memory = embed.get_memory_usage()\n",
-    "    total_memory_mb = embedding_memory['total_memory_mb']\n",
-    "    \n",
-    "    print(f\"    Embedding table memory: {total_memory_mb:.2f}MB\")\n",
-    "    print(f\"    Sequence memory: {large_pos_embeddings.data.nbytes / (1024*1024):.2f}MB\")\n",
-    "    \n",
-    "    print(\"✅ Complete embedding pipeline integration test passed!\")\n",
-    "    print(f\"✅ Tokenization → Embedding → Positional Encoding pipeline works\")\n",
-    "    print(f\"✅ Handles various batch sizes and sequence lengths\")\n",
-    "    print(f\"✅ Memory usage is reasonable for production systems\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "bba8138f",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Main Execution Block\n",
-    "\n",
-    "All embedding tests and demonstrations are run from here when the module is executed directly:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "19963dff",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "embeddings-main",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "if __name__ == \"__main__\":\n",
-    "    # Run all unit tests\n",
-    "    test_unit_embedding_layer()\n",
-    "    test_unit_positional_encoding()\n",
-    "    test_unit_learned_positional_embedding()\n",
-    "    test_embedding_profiler()\n",
-    "    test_embedding_integration()\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"🔍 EMBEDDING SYSTEMS ANALYSIS\")\n",
-    "    print(\"=\"*60)\n",
-    "    \n",
-    "    # Performance analysis\n",
-    "    profiler = EmbeddingProfiler()\n",
-    "    \n",
-    "    # Test different embedding configurations\n",
-    "    print(\"\\n📊 EMBEDDING PERFORMANCE COMPARISON:\")\n",
-    "    \n",
-    "    # Compare embedding layers with different sizes\n",
-    "    vocab_sizes = [1000, 5000, 10000]\n",
-    "    embedding_dims = [128, 256, 512]\n",
-    "    \n",
-    "    scaling_results = profiler.analyze_memory_scaling(vocab_sizes, embedding_dims)\n",
-    "    \n",
-    "    # Compare positional encoding approaches\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    pos_comparison = profiler.compare_positional_encodings(seq_length=128, embedding_dim=256)\n",
-    "    \n",
-    "    # Systems design analysis\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    analyze_embedding_system_design()\n",
-    "    \n",
-    "    # Demonstrate realistic language model embedding setup\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"🏗️ REALISTIC LANGUAGE MODEL EMBEDDING SETUP\")\n",
-    "    print(\"=\"*60)\n",
-    "    \n",
-    "    # Create realistic configuration\n",
-    "    vocab_size = 10000  # 10k vocabulary\n",
-    "    embedding_dim = 256  # 256-dim embeddings\n",
-    "    max_seq_length = 512  # 512 token sequences\n",
-    "    \n",
-    "    print(f\"Model configuration:\")\n",
-    "    print(f\"  Vocabulary size: {vocab_size:,}\")\n",
-    "    print(f\"  Embedding dimension: {embedding_dim}\")\n",
-    "    print(f\"  Max sequence length: {max_seq_length}\")\n",
-    "    \n",
-    "    # Create components\n",
-    "    embedding_layer = Embedding(vocab_size=vocab_size, embedding_dim=embedding_dim, padding_idx=0)\n",
-    "    pos_encoding = PositionalEncoding(embedding_dim=embedding_dim, max_seq_length=max_seq_length)\n",
-    "    \n",
-    "    # Calculate memory requirements\n",
-    "    embed_memory = embedding_layer.get_memory_usage()\n",
-    "    \n",
-    "    print(f\"\\nMemory analysis:\")\n",
-    "    print(f\"  Embedding table: {embed_memory['total_memory_mb']:.1f}MB\")\n",
-    "    print(f\"  Parameters: {embed_memory['total_parameters']:,}\")\n",
-    "    \n",
-    "    # Simulate batch processing\n",
-    "    batch_size = 32\n",
-    "    seq_length = 256\n",
-    "    test_tokens = np.random.randint(0, vocab_size, (batch_size, seq_length))\n",
-    "    \n",
-    "    start_time = time.time()\n",
-    "    embeddings = embedding_layer.forward(test_tokens)\n",
-    "    pos_embeddings = pos_encoding.forward(embeddings)\n",
-    "    total_time = time.time() - start_time\n",
-    "    \n",
-    "    sequence_memory_mb = pos_embeddings.data.nbytes / (1024 * 1024)\n",
-    "    \n",
-    "    print(f\"\\nBatch processing:\")\n",
-    "    print(f\"  Batch size: {batch_size}, Sequence length: {seq_length}\")\n",
-    "    print(f\"  Processing time: {total_time*1000:.2f}ms\")\n",
-    "    print(f\"  Sequence memory: {sequence_memory_mb:.1f}MB\")\n",
-    "    print(f\"  Throughput: {(batch_size * seq_length) / total_time:.0f} tokens/second\")\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"🎯 EMBEDDINGS MODULE COMPLETE!\")\n",
-    "    print(\"=\"*60)\n",
-    "    print(\"All embedding tests passed!\")\n",
-    "    print(\"Ready for attention mechanism integration!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7dd5edd0",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've built the embedding systems that convert tokens to rich vector representations, let's connect this work to broader ML systems challenges. These questions help you think critically about how embedding design scales to production language processing systems.\n",
-    "\n",
-    "Take time to reflect thoughtfully on each question - your insights will help you understand how embedding choices connect to real-world ML systems engineering."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ae828478",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Embedding Memory Optimization and Model Scaling\n",
-    "\n",
-    "**Context**: Your embedding implementations demonstrate how vocabulary size and embedding dimension directly impact model parameters and memory usage. In production language models, embedding tables often contain billions of parameters (GPT-3's embedding table alone has ~600M parameters), making memory optimization critical for deployment and training efficiency.\n",
-    "\n",
-    "**Reflection Question**: Design a memory-optimized embedding system for a production language model that needs to handle a 100k vocabulary with 1024-dimensional embeddings while operating under GPU memory constraints. How would you implement embedding compression techniques, design efficient lookup patterns for high-throughput training, and handle dynamic vocabulary expansion for domain adaptation? Consider the challenges of maintaining embedding quality while reducing memory footprint and optimizing for both training and inference scenarios.\n",
-    "\n",
-    "Think about: embedding compression techniques, memory-efficient lookup patterns, dynamic vocabulary management, and quality-memory trade-offs.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bd58f225",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-1-embedding-memory",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON EMBEDDING MEMORY OPTIMIZATION:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about memory-optimized embedding system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you implement embedding compression for a 100k × 1024 vocabulary under GPU constraints?\n",
-    "- What techniques would you use to optimize lookup patterns for high-throughput training?\n",
-    "- How would you design dynamic vocabulary expansion while maintaining memory efficiency?\n",
-    "- What trade-offs would you make between embedding quality and memory footprint?\n",
-    "- How would you optimize differently for training vs inference scenarios?\n",
-    "\n",
-    "Write a technical analysis connecting your embedding implementations to real memory optimization challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Demonstrates understanding of embedding memory scaling and optimization (3 points)\n",
-    "- Designs practical approaches to compression and efficient lookup patterns (3 points)\n",
-    "- Addresses dynamic vocabulary and quality-memory trade-offs (2 points)\n",
-    "- Shows systems thinking about production memory constraints (2 points)\n",
-    "- Clear technical reasoning with memory optimization insights (bonus points for innovative approaches)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring technical analysis of embedding memory optimization\n",
-    "# Students should demonstrate understanding of large-scale embedding systems and memory efficiency\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "30b9bdf8",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: Positional Encoding and Sequence Length Scalability\n",
-    "\n",
-    "**Context**: Your positional encoding implementations show the trade-offs between fixed sinusoidal patterns and learned position embeddings. Production language models increasingly need to handle variable sequence lengths efficiently while maintaining consistent position representations across different tasks and deployment scenarios.\n",
-    "\n",
-    "**Reflection Question**: Architect a positional encoding system for a production transformer that needs to efficiently handle sequences ranging from 512 tokens (typical sentences) to 32k tokens (long documents) while maintaining training stability and inference efficiency. How would you design hybrid positional encoding that combines the benefits of sinusoidal and learned approaches, implement efficient position computation for variable-length sequences, and optimize for both memory usage and computational efficiency across different sequence length distributions?\n",
-    "\n",
-    "Think about: hybrid encoding strategies, variable-length optimization, memory-efficient position computation, and sequence length distribution handling.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "32d23a6a",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-2-positional-encoding",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON POSITIONAL ENCODING AND SEQUENCE SCALABILITY:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about scalable positional encoding system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you design hybrid positional encoding for sequences from 512 to 32k tokens?\n",
-    "- What strategies would you use to optimize position computation for variable-length sequences?\n",
-    "- How would you balance memory efficiency with computational performance?\n",
-    "- What approaches would you use to handle different sequence length distributions?\n",
-    "- How would you maintain training stability across diverse sequence lengths?\n",
-    "\n",
-    "Write an architectural analysis connecting your positional encoding work to scalable sequence processing.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Shows understanding of positional encoding scalability challenges (3 points)\n",
-    "- Designs practical approaches to hybrid encoding and variable-length optimization (3 points)\n",
-    "- Addresses memory and computational efficiency considerations (2 points)\n",
-    "- Demonstrates systems thinking about sequence length distribution handling (2 points)\n",
-    "- Clear architectural reasoning with scalability insights (bonus points for comprehensive system design)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of positional encoding scalability\n",
-    "# Students should demonstrate knowledge of sequence length optimization and hybrid approaches\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e0b67ef1",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 3: Embedding Pipeline Integration and Training Efficiency\n",
-    "\n",
-    "**Context**: Your embedding pipeline integration demonstrates how tokenization, embedding lookup, and positional encoding work together in language model preprocessing. In production training systems, the embedding pipeline often becomes a bottleneck due to memory bandwidth limitations and the need to process billions of tokens efficiently during training.\n",
-    "\n",
-    "**Reflection Question**: Design an embedding pipeline optimization strategy for large-scale language model training that processes 1 trillion tokens efficiently while maintaining high GPU utilization and minimizing memory bandwidth bottlenecks. How would you implement pipeline parallelism for embedding operations, optimize batch processing for mixed sequence lengths, and design efficient gradient updates for massive embedding tables? Consider the challenges of coordinating embedding updates across distributed training nodes while maintaining numerical stability and convergence.\n",
-    "\n",
-    "Think about: pipeline parallelism, batch optimization, gradient update efficiency, and distributed training coordination.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f5394f77",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-3-pipeline-integration",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON EMBEDDING PIPELINE INTEGRATION:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about embedding pipeline optimization for large-scale training.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you implement pipeline parallelism for processing 1 trillion tokens efficiently?\n",
-    "- What strategies would you use to optimize batch processing for mixed sequence lengths?\n",
-    "- How would you design efficient gradient updates for massive embedding tables?\n",
-    "- What approaches would you use for coordinating embedding updates across distributed nodes?\n",
-    "- How would you maintain GPU utilization while minimizing memory bandwidth bottlenecks?\n",
-    "\n",
-    "Write a design analysis connecting your embedding pipeline to large-scale training optimization.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Understands embedding pipeline bottlenecks and optimization challenges (3 points)\n",
-    "- Designs practical approaches to pipeline parallelism and batch optimization (3 points)\n",
-    "- Addresses distributed training and gradient update efficiency (2 points)\n",
-    "- Shows systems thinking about large-scale training coordination (2 points)\n",
-    "- Clear design reasoning with pipeline optimization insights (bonus points for innovative approaches)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of large-scale embedding pipeline optimization\n",
-    "# Students should demonstrate knowledge of distributed training and pipeline efficiency\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4bd47eab",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Embeddings\n",
-    "\n",
-    "Congratulations! You have successfully implemented comprehensive embedding systems for language processing:\n",
-    "\n",
-    "### ✅ What You Have Built\n",
-    "- **Embedding Layer**: Learnable lookup table converting tokens to dense vector representations\n",
-    "- **Positional Encoding**: Sinusoidal position information for sequence understanding\n",
-    "- **Learned Positional Embeddings**: Trainable position representations for model-specific optimization\n",
-    "- **Memory-Efficient Lookups**: Optimized embedding access patterns for production systems\n",
-    "- **Performance Analysis**: Comprehensive profiling and scaling analysis tools\n",
-    "- **🆕 Integration Pipeline**: Complete tokenization → embedding → positional encoding workflow\n",
-    "- **🆕 Systems Optimization**: Memory usage analysis and performance optimization techniques\n",
-    "\n",
-    "### ✅ Key Learning Outcomes\n",
-    "- **Understanding**: How discrete tokens become continuous vector representations\n",
-    "- **Implementation**: Built embedding systems from scratch with efficient lookup operations\n",
-    "- **Systems Insight**: How embedding table size affects model memory and training efficiency\n",
-    "- **Performance Engineering**: Measured and optimized embedding lookup patterns and memory usage\n",
-    "- **Production Context**: Understanding real-world embedding challenges and optimization techniques\n",
-    "\n",
-    "### ✅ Technical Mastery\n",
-    "- **Embedding Lookup**: Efficient table lookup with various initialization strategies\n",
-    "- **Positional Encoding**: Mathematical sine/cosine patterns for position representation\n",
-    "- **Memory Scaling**: Understanding O(vocab_size × embedding_dim) parameter scaling\n",
-    "- **Performance Optimization**: Cache-friendly access patterns and memory bandwidth optimization\n",
-    "- **🆕 Integration Design**: Seamless pipeline from text processing to vector representations\n",
-    "\n",
-    "### ✅ Professional Skills Developed\n",
-    "- **Systems Architecture**: Designing embedding systems for production scale\n",
-    "- **Memory Engineering**: Optimizing large parameter tables for efficient access\n",
-    "- **Performance Analysis**: Measuring and improving embedding pipeline throughput\n",
-    "- **Integration Thinking**: Connecting embedding systems with tokenization and attention\n",
-    "\n",
-    "### ✅ Ready for Next Steps\n",
-    "Your embedding systems are now ready to power:\n",
-    "- **Attention Mechanisms**: Processing sequence representations with attention\n",
-    "- **Transformer Models**: Complete language model architectures\n",
-    "- **Language Understanding**: Rich semantic representations for NLP tasks\n",
-    "- **🧠 Sequence Processing**: Foundation for advanced sequence modeling\n",
-    "\n",
-    "### 🔗 Connection to Real ML Systems\n",
-    "Your implementations mirror production systems:\n",
-    "- **PyTorch Embeddings**: `torch.nn.Embedding` and `torch.nn.functional.embedding`\n",
-    "- **Transformer Models**: All modern language models use similar embedding approaches\n",
-    "- **Production Optimizations**: Memory mapping, gradient checkpointing, and distributed embeddings\n",
-    "- **Industry Applications**: GPT, BERT, and other transformer models rely on these foundations\n",
-    "\n",
-    "### 🎯 The Power of Dense Representations\n",
-    "You have unlocked the bridge between discrete tokens and continuous understanding:\n",
-    "- **Before**: Tokens were sparse, discrete symbols\n",
-    "- **After**: Tokens become rich, continuous vectors that capture semantic relationships\n",
-    "\n",
-    "**Next Module**: Attention - Processing sequences with the mechanism that revolutionized language understanding!\n",
-    "\n",
-    "Your embedding systems provide the rich vector representations that attention mechanisms need to understand language. Now let's build the attention that makes transformers work!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/11_embeddings/embeddings_dev.py b/modules_old/11_embeddings/embeddings_dev.py
deleted file mode 100644
index 38a666d5..00000000
--- a/modules_old/11_embeddings/embeddings_dev.py
+++ /dev/null
@@ -1,1904 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Embeddings - Converting Tokens to Dense Vector Representations
-
-Welcome to the Embeddings module! You'll implement the systems that convert discrete tokens into rich vector representations that capture semantic meaning for language models.
-
-## Learning Goals
-- Systems understanding: How embedding tables scale with vocabulary size and affect model memory
-- Core implementation skill: Build embedding layers with efficient lookup operations
-- Pattern recognition: Understand how positional encoding enables sequence understanding
-- Framework connection: See how your implementations match PyTorch's embedding systems
-- Performance insight: Learn how embedding lookup patterns affect cache efficiency and memory bandwidth
-
-## Build -> Use -> Reflect
-1. **Build**: Embedding layer with lookup table and positional encoding systems
-2. **Use**: Transform token sequences into rich vector representations for language processing
-3. **Reflect**: How do embedding choices determine model capacity and computational efficiency?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how discrete tokens become continuous vector representations
-- Practical capability to implement embedding systems that handle large vocabularies efficiently
-- Systems insight into how embedding dimensions affect model capacity and memory usage
-- Performance consideration of how embedding lookup patterns affect training and inference speed
-- Connection to production systems like transformer embedding layers and their optimization techniques
-
-## Systems Reality Check
-TIP **Production Context**: Modern language models have embedding tables with billions of parameters (GPT-3: 50k vocab * 12k dim = 600M embedding params)
-SPEED **Performance Note**: Embedding lookups are memory-bandwidth bound - efficient access patterns are critical for high-throughput training
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "embeddings-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.embeddings
-
-#| export
-import math
-import numpy as np
-import os
-import sys
-from typing import Union, List, Optional, Tuple
-
-# Import our Tensor class - try from package first, then from local module
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local tensor module
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# Try to import tokenization classes
-try:
-    from tinytorch.core.tokenization import CharTokenizer, BPETokenizer
-except ImportError:
-    # For development, import from local module
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '11_tokenization'))
-    try:
-        from tokenization_dev import CharTokenizer, BPETokenizer
-    except ImportError:
-        # Create minimal mock classes if not available
-        class CharTokenizer:
-            def __init__(self): 
-                self.vocab_size = 256
-        class BPETokenizer:
-            def __init__(self, vocab_size=1000):
-                self.vocab_size = vocab_size
-
-# %% nbgrader={"grade": false, "grade_id": "embeddings-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("TARGET TinyTorch Embeddings Module")
-print(f"NumPy version: {np.__version__}")
-print("Ready to build embedding systems!")
-
-# %% [markdown]
-"""
-## PACKAGE Where This Code Lives in the Final Package
-
-**Learning Side:** You work in `modules/source/12_embeddings/embeddings_dev.py`  
-**Building Side:** Code exports to `tinytorch.core.embeddings`
-
-```python
-# Final package structure:
-from tinytorch.core.embeddings import Embedding, PositionalEncoding
-from tinytorch.core.tokenization import CharTokenizer, BPETokenizer  # Previous module
-from tinytorch.core.attention import MultiHeadAttention  # Next module
-```
-
-**Why this matters:**
-- **Learning:** Focused modules for deep understanding
-- **Production:** Proper organization like PyTorch's `torch.nn.Embedding`
-- **Consistency:** All embedding tools live together in `core.embeddings`
-- **Integration:** Works seamlessly with tokenization and attention systems
-"""
-
-# %% [markdown]
-"""
-## What are Embeddings?
-
-### The Problem: Discrete to Continuous
-Tokens are discrete symbols, but neural networks work best with continuous vectors:
-
-```
-Discrete Token Transformation:
-    Token ID    ->    Dense Vector Representation
-       42       ->    [0.1, -0.3, 0.8, 0.2, ...]
-       
-Visualization:
-    Sparse One-Hot      Dense Embedding
-    [0,0,0,1,0,...]  ->  [0.1,-0.3,0.8,0.2]
-    100,000 dims        512 dims
-```
-
-### Embedding Table Visualization
-An embedding layer is essentially a learnable lookup table:
-
-```
-Embedding Table Memory Layout:
-+-------------------------------------+
-| Embedding Weight Matrix             |
-+-------------------------------------┤
-| Token 0:  [0.1, -0.2,  0.3, ...]  |  <- "<PAD>" token
-| Token 1:  [0.4,  0.1, -0.5, ...]  |  <- "<UNK>" token  
-| Token 2:  [-0.1, 0.8,  0.2, ...]  |  <- "the" token
-| Token 3:  [0.7, -0.3,  0.1, ...]  |  <- "and" token
-| ...                               |
-| Token N:  [0.2,  0.5, -0.7, ...]  |  <- Final token
-+-------------------------------------+
-    ^                    ^
-  vocab_size        embedding_dim
-
-Example: 50,000 * 512 = 25.6M parameters = 102.4MB (float32)
-```
-
-### Embedding Lookup Process
-```
-Lookup Operation Flow:
-    Token IDs: [42, 17, 8]  (Input sequence)
-         v Advanced Indexing
-    Embedding Table[42] -> [0.1, -0.3, 0.8, ...]
-    Embedding Table[17] -> [0.4,  0.1, -0.5, ...] 
-    Embedding Table[8]  -> [-0.1, 0.8,  0.2, ...]
-         v Stack Results
-    Output: [[0.1, -0.3, 0.8, ...],    <- Token 42 embedding
-             [0.4,  0.1, -0.5, ...],    <- Token 17 embedding  
-             [-0.1, 0.8,  0.2, ...]]    <- Token 8 embedding
-    
-Complexity: O(seq_length) lookups, O(seq_length * embed_dim) memory
-```
-
-### Why Embeddings Work
-- **Similarity**: Similar words get similar vectors through training
-- **Composition**: Vector operations capture semantic relationships  
-- **Learning**: Gradients update embeddings to improve task performance
-- **Efficiency**: Dense vectors are more efficient than sparse one-hot
-
-### Positional Encoding Visualization
-Since transformers lack inherent position awareness, we add positional information:
-
-```
-Position-Aware Embedding Creation:
-    Token Embedding    +    Positional Encoding    =    Final Representation
-    +-------------+         +-------------+             +-------------+
-    |[0.1,-0.3,0.8]|    +    |[0.0, 1.0,0.0]|        =    |[0.1, 0.7,0.8]|  <- Pos 0
-    |[0.4, 0.1,-0.5]|    +    |[0.1, 0.9,0.1]|        =    |[0.5, 1.0,-0.4]|  <- Pos 1
-    |[-0.1,0.8, 0.2]|    +    |[0.2, 0.8,0.2]|        =    |[0.1, 1.6, 0.4]|  <- Pos 2
-    +-------------+         +-------------+             +-------------+
-         ^                       ^                           ^
-    Content Info           Position Info              Complete Context
-```
-
-### Systems Trade-offs
-- **Embedding dimension**: Higher = more capacity, more memory  
-- **Vocabulary size**: Larger = more parameters, better coverage
-- **Lookup efficiency**: Memory access patterns affect performance
-- **Position encoding**: Fixed vs learned vs hybrid approaches
-"""
-
-# %% [markdown]
-"""
-## Embedding Layer Implementation
-
-Let's start with the core embedding layer - a learnable lookup table that converts token indices to dense vectors.
-
-### Implementation Strategy
-```
-Embedding Layer Architecture:
-    Input: Token IDs [batch_size, seq_length]
-         v Index into weight matrix
-    Weight Matrix: [vocab_size, embedding_dim] 
-         v Advanced indexing: weight[input_ids]
-    Output: Embeddings [batch_size, seq_length, embedding_dim]
-
-Memory Layout:
-+--------------------------------------+
-| Embedding Weight Matrix              |  <- Main parameter storage
-+--------------------------------------┤  
-| Input Token IDs (integers)           |  <- Temporary during forward
-+--------------------------------------┤
-| Output Embeddings (float32)          |  <- Result tensor
-+--------------------------------------+
-
-Operation: O(1) lookup per token, O(seq_length) total
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "embedding-layer", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Embedding:
-    """
-    Embedding layer that converts token indices to dense vector representations.
-    
-    This is the foundation of modern language models - a learnable lookup table
-    that maps discrete tokens to continuous vectors that capture semantic meaning.
-    """
-    
-    def __init__(self, vocab_size: int, embedding_dim: int, 
-                 padding_idx: Optional[int] = None, 
-                 init_type: str = 'uniform'):
-        """
-        Initialize embedding layer with learnable parameters.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store configuration parameters
-        2. Initialize embedding table with chosen initialization
-        3. Handle special padding token if specified
-        4. Set up for gradient tracking (will connect to autograd later)
-        
-        DESIGN DECISIONS:
-        - Embedding table shape: (vocab_size, embedding_dim)
-        - Initialization affects training dynamics
-        - Padding idx gets zero gradient to stay constant
-        
-        Args:
-            vocab_size: Number of tokens in vocabulary
-            embedding_dim: Size of dense vector for each token
-            padding_idx: Optional token index that should remain zero
-            init_type: Initialization strategy ('uniform', 'normal', 'xavier')
-        """
-        ### BEGIN SOLUTION
-        self.vocab_size = vocab_size
-        self.embedding_dim = embedding_dim
-        self.padding_idx = padding_idx
-        self.init_type = init_type
-        
-        # Initialize embedding table based on strategy  
-        # Different initialization strategies affect training dynamics
-        if init_type == 'uniform':
-            # Uniform initialization in [-1/sqrt(dim), 1/sqrt(dim)]
-            # Keeps initial embeddings in reasonable range for gradient flow
-            bound = 1.0 / math.sqrt(embedding_dim)  # Scale with dimension
-            self.weight = Tensor(np.random.uniform(-bound, bound, (vocab_size, embedding_dim)))
-        elif init_type == 'normal':
-            # Normal initialization with std=1/sqrt(dim)
-            # Gaussian distribution with dimension-aware scaling
-            std = 1.0 / math.sqrt(embedding_dim)
-            self.weight = Tensor(np.random.normal(0, std, (vocab_size, embedding_dim)))
-        elif init_type == 'xavier':
-            # Xavier/Glorot initialization - considers fan-in and fan-out
-            # Good for maintaining activation variance across layers
-            bound = math.sqrt(6.0 / (vocab_size + embedding_dim))
-            self.weight = Tensor(np.random.uniform(-bound, bound, (vocab_size, embedding_dim)))
-        else:
-            raise ValueError(f"Unknown init_type: {init_type}")
-        
-        # Set padding token to zero if specified
-        if padding_idx is not None:
-            self.weight.data[padding_idx] = 0.0
-        
-        # Track parameters for optimization
-        self.parameters = [self.weight]
-        ### END SOLUTION
-    
-    def forward(self, input_ids: Union[Tensor, List[int], np.ndarray]) -> Tensor:
-        """
-        Look up embeddings for input token indices.
-        
-        TODO: Implement embedding lookup.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert input to numpy array if needed
-        2. Validate token indices are within vocabulary
-        3. Use advanced indexing to look up embeddings
-        4. Return tensor with shape (batch_size, seq_len, embedding_dim)
-        
-        EXAMPLE:
-        embed = Embedding(vocab_size=100, embedding_dim=64)
-        tokens = Tensor([[1, 2, 3], [4, 5, 6]])  # Shape: (2, 3)
-        embeddings = embed.forward(tokens)  # Shape: (2, 3, 64)
-        
-        IMPLEMENTATION HINTS:
-        - Handle both Tensor and list inputs
-        - Use numpy advanced indexing: weight[indices]
-        - Preserve batch and sequence dimensions
-        
-        Args:
-            input_ids: Token indices with shape (batch_size, seq_len) or (seq_len,)
-            
-        Returns:
-            Embeddings with shape (*input_shape, embedding_dim)
-        """
-        ### BEGIN SOLUTION
-        # Convert input to numpy array
-        if isinstance(input_ids, Tensor):
-            indices = input_ids.data
-        elif isinstance(input_ids, list):
-            indices = np.array(input_ids)
-        else:
-            indices = input_ids
-
-        # Ensure indices is numpy array and convert to int
-        # Handle case where input might be nested Tensors or other objects
-        while hasattr(indices, 'data') and hasattr(indices, '__class__') and 'Tensor' in str(indices.__class__):
-            indices = indices.data
-
-        if not isinstance(indices, np.ndarray):
-            indices = np.array(indices)
-        indices = indices.astype(int)
-        if np.any(indices < 0) or np.any(indices >= self.vocab_size):
-            raise ValueError(f"Token indices must be in range [0, {self.vocab_size})")
-        
-        # Look up embeddings using advanced indexing (very efficient operation)
-        # Memory access pattern: Random access into embedding table
-        # self.weight.data has shape (vocab_size, embedding_dim)
-        # indices has shape (...), result has shape (..., embedding_dim)
-        embeddings = self.weight.data[indices]  # O(seq_length) lookups
-        
-        return Tensor(embeddings)
-        ### END SOLUTION
-    
-    def __call__(self, input_ids: Union[Tensor, List[int], np.ndarray]) -> Tensor:
-        """Make the layer callable."""
-        return self.forward(input_ids)
-    
-    def get_memory_usage(self):
-        """
-        Calculate memory usage of embedding table.
-        
-        This function is PROVIDED to show memory analysis.
-        """
-        # Embedding table memory
-        weight_memory_mb = self.weight.data.nbytes / (1024 * 1024)
-        
-        # Memory per token
-        memory_per_token_kb = (self.embedding_dim * 4) / 1024  # 4 bytes per float32
-        
-        return {
-            'total_memory_mb': weight_memory_mb,
-            'memory_per_token_kb': memory_per_token_kb,
-            'total_parameters': self.vocab_size * self.embedding_dim,
-            'vocab_size': self.vocab_size,
-            'embedding_dim': self.embedding_dim
-        }
-
-# %% [markdown]
-"""
-### TEST Test Your Embedding Layer Implementation
-
-Once you implement the Embedding forward method above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-embedding-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_embedding_layer():
-    """Unit test for the embedding layer."""
-    print("🔬 Unit Test: Embedding Layer...")
-    
-    # Create embedding layer
-    vocab_size = 100
-    embedding_dim = 64
-    embed = Embedding(vocab_size=vocab_size, embedding_dim=embedding_dim)
-    
-    # Test single token
-    single_token = [5]
-    single_embedding = embed.forward(single_token)
-    assert single_embedding.shape == (1, embedding_dim), f"Expected shape (1, {embedding_dim}), got {single_embedding.shape}"
-    
-    # Test sequence of tokens
-    token_sequence = [1, 2, 3, 5, 10]
-    sequence_embeddings = embed.forward(token_sequence)
-    expected_shape = (len(token_sequence), embedding_dim)
-    assert sequence_embeddings.shape == expected_shape, f"Expected shape {expected_shape}, got {sequence_embeddings.shape}"
-    
-    # Test batch of sequences
-    batch_tokens = [[1, 2, 3], [4, 5, 6]]
-    batch_embeddings = embed.forward(batch_tokens)
-    assert batch_embeddings.shape == (2, 3, embedding_dim), f"Expected shape (2, 3, {embedding_dim}), got {batch_embeddings.shape}"
-    
-    # Test with Tensor input
-    tensor_input = Tensor(np.array([[7, 8, 9], [10, 11, 12]]))
-    tensor_embeddings = embed.forward(tensor_input)
-    assert tensor_embeddings.shape == (2, 3, embedding_dim), "Should handle Tensor input"
-    
-    # Test embedding lookup consistency
-    token_5_embed_1 = embed.forward([5])
-    token_5_embed_2 = embed.forward([5])
-    assert np.allclose(token_5_embed_1.data, token_5_embed_2.data), "Same token should give same embedding"
-    
-    # Test different tokens give different embeddings (with high probability)
-    token_1_embed = embed.forward([1])
-    token_2_embed = embed.forward([2])
-    assert not np.allclose(token_1_embed.data, token_2_embed.data, atol=1e-3), "Different tokens should give different embeddings"
-    
-    # Test initialization bounds
-    assert np.all(np.abs(embed.weight.data) <= 1.0), "Uniform initialization should be bounded"
-    
-    # Test padding token (if specified)
-    embed_with_padding = Embedding(vocab_size=50, embedding_dim=32, padding_idx=0)
-    assert np.allclose(embed_with_padding.weight.data[0], 0.0), "Padding token should be zero"
-    
-    # Test parameter tracking
-    assert len(embed.parameters) == 1, "Should track embedding weight parameter"
-    assert embed.parameters[0] is embed.weight, "Should track weight tensor"
-    
-    # Test memory usage calculation
-    memory_stats = embed.get_memory_usage()
-    assert 'total_memory_mb' in memory_stats, "Should provide memory statistics"
-    assert memory_stats['total_parameters'] == vocab_size * embedding_dim, "Should calculate parameters correctly"
-    
-    print("PASS Embedding layer tests passed!")
-    print(f"PASS Handles various input shapes correctly")
-    print(f"PASS Consistent lookup and parameter tracking")
-    print(f"PASS Memory usage: {memory_stats['total_memory_mb']:.2f}MB")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Positional Encoding Implementation
-
-Transformers need explicit position information since attention is position-agnostic. Let's implement sinusoidal positional encoding used in the original transformer.
-
-### Sinusoidal Positional Encoding Visualization
-```
-Mathematical Foundation:
-    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))     <- Even dimensions
-    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))     <- Odd dimensions
-
-Frequency Pattern:
-    Position ->   0    1    2    3    4   ...
-    Dim 0:    [sin] [sin] [sin] [sin] [sin] ... <- High frequency
-    Dim 1:    [cos] [cos] [cos] [cos] [cos] ... <- High frequency
-    Dim 2:    [sin] [sin] [sin] [sin] [sin] ... <- Med frequency
-    Dim 3:    [cos] [cos] [cos] [cos] [cos] ... <- Med frequency
-    ...        ...   ...   ...   ...   ...   
-    Dim n-2:  [sin] [sin] [sin] [sin] [sin] ... <- Low frequency  
-    Dim n-1:  [cos] [cos] [cos] [cos] [cos] ... <- Low frequency
-
-Why This Works:
-    - Each position gets unique encoding across all dimensions
-    - Relative positions have consistent patterns
-    - Model can learn to use positional relationships
-    - No parameters needed (computed deterministically)
-```
-
-### Position Encoding Memory Layout
-```
-Precomputed Position Matrix:
-+-------------------------------------+
-| Position Encoding Matrix            |
-+-------------------------------------┤ 
-| Pos 0:  [0.00, 1.00, 0.00, 1.00...]|  <- sin(0), cos(0), sin(0), cos(0)
-| Pos 1:  [0.84, 0.54, 0.10, 0.99...]|  <- sin(1), cos(1), sin(f1), cos(f1)
-| Pos 2:  [0.91,-0.42, 0.20, 0.98...]|  <- sin(2), cos(2), sin(f2), cos(f2) 
-| Pos 3:  [0.14,-0.99, 0.30, 0.95...]|  <- sin(3), cos(3), sin(f3), cos(f3)
-| ...                               |
-+-------------------------------------+
-    ^                    ^
-max_seq_length     embedding_dim
-
-Memory: max_seq_length * embedding_dim * 4 bytes (precomputed)
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "positional-encoding", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class PositionalEncoding:
-    """
-    Sinusoidal positional encoding that adds position information to embeddings.
-    
-    Uses sine and cosine functions of different frequencies to create
-    unique position representations that the model can learn to use.
-    """
-    
-    def __init__(self, embedding_dim: int, max_seq_length: int = 5000, 
-                 dropout: float = 0.0):
-        """
-        Initialize positional encoding with sinusoidal patterns.
-        
-        TODO: Implement positional encoding initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create position matrix (max_seq_length, embedding_dim)
-        2. For each position and dimension:
-           - Calculate frequency based on dimension
-           - Apply sine to even dimensions, cosine to odd dimensions
-        3. Store the precomputed positional encodings
-        
-        MATHEMATICAL FOUNDATION:
-        PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
-        PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
-        
-        Where:
-        - pos = position in sequence
-        - i = dimension index
-        - d_model = embedding_dim
-        
-        Args:
-            embedding_dim: Dimension of embeddings (must be even)
-            max_seq_length: Maximum sequence length to precompute
-            dropout: Dropout rate (for future use)
-        """
-        ### BEGIN SOLUTION
-        self.embedding_dim = embedding_dim
-        self.max_seq_length = max_seq_length
-        self.dropout = dropout
-        
-        # Create positional encoding matrix
-        pe = np.zeros((max_seq_length, embedding_dim))
-        
-        # Create position vector (0, 1, 2, ..., max_seq_length-1)
-        position = np.arange(0, max_seq_length).reshape(-1, 1)  # Shape: (max_seq_length, 1)
-        
-        # Create dimension indices for frequency calculation
-        # div_term calculates 10000^(2i/d_model) for i = 0, 1, 2, ...
-        # This creates decreasing frequencies: high freq for early dims, low freq for later dims
-        div_term = np.exp(np.arange(0, embedding_dim, 2) * 
-                         -(math.log(10000.0) / embedding_dim))
-        
-        # Apply sine to even dimensions (0, 2, 4, ...) 
-        # Broadcasting: position (max_seq_length, 1) * div_term (embedding_dim//2,)
-        pe[:, 0::2] = np.sin(position * div_term)  # High to low frequency sine waves
-        
-        # Apply cosine to odd dimensions (1, 3, 5, ...)
-        # Cosine provides phase-shifted version of sine for each frequency
-        if embedding_dim % 2 == 1:
-            # Handle odd embedding_dim - cosine gets one less dimension
-            pe[:, 1::2] = np.cos(position * div_term[:-1])
-        else:
-            pe[:, 1::2] = np.cos(position * div_term)
-        
-        # Store as tensor
-        self.pe = Tensor(pe)
-        ### END SOLUTION
-    
-    def forward(self, embeddings: Tensor) -> Tensor:
-        """
-        Add positional encoding to embeddings.
-        
-        TODO: Implement positional encoding addition.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Get sequence length from embeddings shape
-        2. Extract relevant positional encodings
-        3. Add positional encodings to embeddings
-        4. Return position-aware embeddings
-        
-        EXAMPLE:
-        pos_enc = PositionalEncoding(embedding_dim=64)
-        embeddings = Tensor(np.random.randn(2, 10, 64))  # (batch, seq, dim)
-        pos_embeddings = pos_enc.forward(embeddings)
-        
-        Args:
-            embeddings: Input embeddings with shape (batch_size, seq_len, embedding_dim)
-            
-        Returns:
-            Position-aware embeddings with same shape as input
-        """
-        ### BEGIN SOLUTION
-        # Get sequence length from embeddings
-        if len(embeddings.shape) == 3:
-            batch_size, seq_length, embed_dim = embeddings.shape
-        elif len(embeddings.shape) == 2:
-            seq_length, embed_dim = embeddings.shape
-            batch_size = None
-        else:
-            raise ValueError(f"Expected 2D or 3D embeddings, got shape {embeddings.shape}")
-        
-        if embed_dim != self.embedding_dim:
-            raise ValueError(f"Embedding dim mismatch: expected {self.embedding_dim}, got {embed_dim}")
-        
-        if seq_length > self.max_seq_length:
-            raise ValueError(f"Sequence length {seq_length} exceeds max {self.max_seq_length}")
-        
-        # Extract positional encodings for this sequence length
-        position_encodings = self.pe.data[:seq_length, :]
-        
-        # Add positional encodings to embeddings (element-wise addition)
-        # This combines content information with positional information
-        if batch_size is not None:
-            # Broadcast positional encodings across batch dimension
-            # embeddings: (batch, seq, dim) + position_encodings: (seq, dim)
-            # Broadcasting rule: (B,S,D) + (1,S,D) = (B,S,D)
-            result = embeddings.data + position_encodings[np.newaxis, :, :]
-        else:
-            # embeddings: (seq, dim) + position_encodings: (seq, dim)
-            result = embeddings.data + position_encodings
-        
-        return Tensor(result)
-        ### END SOLUTION
-    
-    def __call__(self, embeddings: Tensor) -> Tensor:
-        """Make the class callable."""
-        return self.forward(embeddings)
-    
-    def visualize_encoding(self, seq_length: int = 100, dims_to_show: int = 10) -> None:
-        """
-        Visualize positional encoding patterns.
-        
-        This function is PROVIDED to show encoding patterns.
-        """
-        print(f"📊 POSITIONAL ENCODING VISUALIZATION")
-        print(f"Sequence length: {seq_length}, Dimensions shown: {dims_to_show}")
-        print("=" * 60)
-        
-        # Get subset of positional encodings
-        pe_subset = self.pe.data[:seq_length, :dims_to_show]
-        
-        # Show patterns for first few positions
-        print("First 10 positions, first 10 dimensions:")
-        print("Pos", end="")
-        for d in range(min(dims_to_show, 10)):
-            print(f"    Dim{d:2d}", end="")
-        print()
-        
-        for pos in range(min(seq_length, 10)):
-            print(f"{pos:3d}", end="")
-            for d in range(min(dims_to_show, 10)):
-                print(f"{pe_subset[pos, d]:8.3f}", end="")
-            print()
-        
-        # Show frequency analysis
-        print(f"\nPROGRESS FREQUENCY ANALYSIS:")
-        print("Even dimensions (sine): Lower frequencies for early dimensions")
-        print("Odd dimensions (cosine): Same frequencies, phase-shifted")
-        
-        # Calculate frequency range
-        min_freq = 1.0 / 10000
-        max_freq = 1.0
-        print(f"Frequency range: {min_freq:.6f} to {max_freq:.6f}")
-
-# %% [markdown]
-"""
-### TEST Test Your Positional Encoding Implementation
-
-Once you implement the PositionalEncoding methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-positional-encoding-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_positional_encoding():
-    """Unit test for positional encoding."""
-    print("🔬 Unit Test: Positional Encoding...")
-    
-    # Create positional encoding
-    embedding_dim = 64
-    max_seq_length = 100
-    pos_enc = PositionalEncoding(embedding_dim=embedding_dim, max_seq_length=max_seq_length)
-    
-    # Test initialization
-    assert pos_enc.pe.shape == (max_seq_length, embedding_dim), f"Expected shape ({max_seq_length}, {embedding_dim})"
-    
-    # Test that different positions have different encodings
-    pos_0 = pos_enc.pe.data[0]
-    pos_1 = pos_enc.pe.data[1]
-    assert not np.allclose(pos_0, pos_1), "Different positions should have different encodings"
-    
-    # Test sine/cosine pattern
-    # Even dimensions should use sine, odd should use cosine
-    # This is hard to test directly, but we can check the encoding is reasonable
-    assert not np.any(np.isnan(pos_enc.pe.data)), "Positional encodings should not contain NaN"
-    assert not np.any(np.isinf(pos_enc.pe.data)), "Positional encodings should not contain inf"
-    
-    # Test forward pass with 3D input (batch, seq, dim)
-    batch_size = 2
-    seq_length = 10
-    embeddings = Tensor(np.random.randn(batch_size, seq_length, embedding_dim))
-    
-    pos_embeddings = pos_enc.forward(embeddings)
-    assert pos_embeddings.shape == embeddings.shape, "Output shape should match input shape"
-    
-    # Test forward pass with 2D input (seq, dim)
-    embeddings_2d = Tensor(np.random.randn(seq_length, embedding_dim))
-    pos_embeddings_2d = pos_enc.forward(embeddings_2d)
-    assert pos_embeddings_2d.shape == embeddings_2d.shape, "2D output shape should match input"
-    
-    # Test that positional encoding is actually added
-    original_mean = np.mean(embeddings.data)
-    pos_mean = np.mean(pos_embeddings.data)
-    assert abs(pos_mean - original_mean) > 1e-6, "Positional encoding should change the embeddings"
-    
-    # Test sequence length validation
-    try:
-        long_embeddings = Tensor(np.random.randn(max_seq_length + 10, embedding_dim))
-        pos_enc.forward(long_embeddings)
-        assert False, "Should raise error for sequence longer than max_seq_length"
-    except ValueError:
-        pass  # Expected behavior
-    
-    # Test embedding dimension validation
-    try:
-        wrong_dim_embeddings = Tensor(np.random.randn(seq_length, embedding_dim + 10))
-        pos_enc.forward(wrong_dim_embeddings)
-        assert False, "Should raise error for wrong embedding dimension"
-    except ValueError:
-        pass  # Expected behavior
-    
-    # Test deterministic behavior
-    pos_embeddings_1 = pos_enc.forward(embeddings)
-    pos_embeddings_2 = pos_enc.forward(embeddings)
-    assert np.allclose(pos_embeddings_1.data, pos_embeddings_2.data), "Should be deterministic"
-    
-    # Test callable interface
-    pos_embeddings_callable = pos_enc(embeddings)
-    assert np.allclose(pos_embeddings_callable.data, pos_embeddings.data), "Callable interface should work"
-    
-    print("PASS Positional encoding tests passed!")
-    print(f"PASS Handles 2D and 3D inputs correctly")
-    print(f"PASS Proper validation and deterministic behavior")
-    print(f"PASS Encoding dimension: {embedding_dim}, Max length: {max_seq_length}")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Learned Positional Embeddings
-
-Some models use learned positional embeddings instead of fixed sinusoidal ones. Let's implement this alternative approach:
-
-### Learned vs Sinusoidal Comparison
-```
-Sinusoidal Positional Encoding:
-    OK Zero parameters (deterministic computation)
-    OK Can extrapolate to longer sequences
-    OK Mathematical guarantees about relative positions
-    ✗ Fixed pattern - cannot adapt to task
-    
-Learned Positional Embeddings:
-    OK Learnable parameters (adapts to task/data)
-    OK Can capture task-specific positional patterns
-    ✗ Requires additional parameters (max_seq_len * embed_dim)
-    ✗ Cannot extrapolate beyond training sequence length
-    ✗ Needs sufficient training data to learn good positions
-```
-
-### Learned Position Architecture
-```
-Learned Position System:
-    Position IDs: [0, 1, 2, 3, ...]
-          v Embedding lookup (just like token embeddings)
-    Position Table: [max_seq_length, embedding_dim]
-          v Standard embedding lookup
-    Position Embeddings: [seq_length, embedding_dim]
-          v Add to token embeddings
-    Final Representation: Token + Position information
-
-This is essentially two embedding tables:
-    - Token Embedding: token_id -> content vector
-    - Position Embedding: position_id -> position vector
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "learned-positional", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class LearnedPositionalEmbedding:
-    """
-    Learned positional embeddings - another embedding table for positions.
-    
-    Unlike sinusoidal encoding, these are learned parameters that
-    the model optimizes during training. Used in models like BERT.
-    """
-    
-    def __init__(self, max_seq_length: int, embedding_dim: int):
-        """
-        Initialize learned positional embeddings.
-        
-        TODO: Implement learned positional embedding initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create embedding layer for positions (0, 1, 2, ..., max_seq_length-1)
-        2. Initialize with small random values
-        3. Set up parameter tracking for optimization
-        
-        This is essentially an Embedding layer where the "vocabulary"
-        is the set of possible positions in a sequence.
-        
-        Args:
-            max_seq_length: Maximum sequence length supported
-            embedding_dim: Dimension of position embeddings
-        """
-        ### BEGIN SOLUTION
-        self.max_seq_length = max_seq_length
-        self.embedding_dim = embedding_dim
-        
-        # Create learned positional embedding table
-        # This is like an embedding layer for positions (not tokens)
-        # Vocabulary size = max sequence length (each position is a "token")
-        self.position_embedding = Embedding(
-            vocab_size=max_seq_length,  # Position 0, 1, 2, ..., max_seq_length-1
-            embedding_dim=embedding_dim,  # Same dimension as token embeddings
-            init_type='normal'  # Start with small random values
-        )
-        
-        # Track parameters for optimization
-        self.parameters = self.position_embedding.parameters
-        ### END SOLUTION
-    
-    def forward(self, embeddings: Tensor) -> Tensor:
-        """
-        Add learned positional embeddings to input embeddings.
-        
-        TODO: Implement learned positional embedding addition.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Get sequence length from input shape
-        2. Create position indices [0, 1, 2, ..., seq_length-1]
-        3. Look up position embeddings using position indices
-        4. Add position embeddings to input embeddings
-        
-        EXAMPLE:
-        learned_pos = LearnedPositionalEmbedding(max_seq_length=100, embedding_dim=64)
-        embeddings = Tensor(np.random.randn(2, 10, 64))  # (batch, seq, dim)
-        pos_embeddings = learned_pos.forward(embeddings)
-        
-        Args:
-            embeddings: Input embeddings with shape (batch_size, seq_len, embedding_dim)
-            
-        Returns:
-            Position-aware embeddings with same shape as input
-        """
-        ### BEGIN SOLUTION
-        # Get sequence length from embeddings
-        if len(embeddings.shape) == 3:
-            batch_size, seq_length, embed_dim = embeddings.shape
-        elif len(embeddings.shape) == 2:
-            seq_length, embed_dim = embeddings.shape
-            batch_size = None
-        else:
-            raise ValueError(f"Expected 2D or 3D embeddings, got shape {embeddings.shape}")
-        
-        if embed_dim != self.embedding_dim:
-            raise ValueError(f"Embedding dim mismatch: expected {self.embedding_dim}, got {embed_dim}")
-        
-        if seq_length > self.max_seq_length:
-            raise ValueError(f"Sequence length {seq_length} exceeds max {self.max_seq_length}")
-        
-        # Create position indices [0, 1, 2, ..., seq_length-1]
-        # These are the "token IDs" for positions in the sequence
-        position_ids = list(range(seq_length))
-        
-        # Look up position embeddings (same process as token embedding lookup)
-        # Each position gets its own learned vector representation
-        position_embeddings = self.position_embedding.forward(position_ids)
-        
-        # Add position embeddings to input embeddings
-        if batch_size is not None:
-            # Broadcast across batch dimension
-            result = embeddings.data + position_embeddings.data[np.newaxis, :, :]
-        else:
-            result = embeddings.data + position_embeddings.data
-        
-        return Tensor(result)
-        ### END SOLUTION
-    
-    def __call__(self, embeddings: Tensor) -> Tensor:
-        """Make the class callable."""
-        return self.forward(embeddings)
-
-# %% [markdown]
-"""
-### TEST Test Your Learned Positional Embedding Implementation
-
-Once you implement the LearnedPositionalEmbedding methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-learned-positional-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_unit_learned_positional_embedding():
-    """Unit test for learned positional embeddings."""
-    print("🔬 Unit Test: Learned Positional Embeddings...")
-    
-    # Create learned positional embedding
-    max_seq_length = 50
-    embedding_dim = 32
-    learned_pos = LearnedPositionalEmbedding(max_seq_length=max_seq_length, embedding_dim=embedding_dim)
-    
-    # Test initialization
-    assert learned_pos.position_embedding.vocab_size == max_seq_length, "Should have position for each sequence position"
-    assert learned_pos.position_embedding.embedding_dim == embedding_dim, "Should match embedding dimension"
-    
-    # Test parameter tracking
-    assert len(learned_pos.parameters) == 1, "Should track position embedding parameters"
-    assert learned_pos.parameters[0] is learned_pos.position_embedding.weight, "Should track weight tensor"
-    
-    # Test forward pass with 3D input
-    batch_size = 3
-    seq_length = 10
-    embeddings = Tensor(np.random.randn(batch_size, seq_length, embedding_dim))
-    
-    pos_embeddings = learned_pos.forward(embeddings)
-    assert pos_embeddings.shape == embeddings.shape, "Output shape should match input shape"
-    
-    # Test forward pass with 2D input
-    embeddings_2d = Tensor(np.random.randn(seq_length, embedding_dim))
-    pos_embeddings_2d = learned_pos.forward(embeddings_2d)
-    assert pos_embeddings_2d.shape == embeddings_2d.shape, "2D output shape should match input"
-    
-    # Test that position embeddings are actually added
-    original_mean = np.mean(embeddings.data)
-    pos_mean = np.mean(pos_embeddings.data)
-    assert abs(pos_mean - original_mean) > 1e-6, "Position embeddings should change the input"
-    
-    # Test that different sequence lengths give consistent positional embeddings
-    # Use same base embeddings for the first 5 positions to test positional consistency
-    base_embeddings = np.random.randn(batch_size, 5, embedding_dim)
-    short_embeddings = Tensor(base_embeddings)
-    
-    # For long embeddings, use same first 5 positions plus additional positions
-    extended_embeddings = np.random.randn(batch_size, 10, embedding_dim)
-    extended_embeddings[:, :5, :] = base_embeddings  # Same first 5 positions
-    long_embeddings = Tensor(extended_embeddings)
-    
-    short_pos = learned_pos.forward(short_embeddings)
-    long_pos = learned_pos.forward(long_embeddings)
-    
-    # The first 5 positions should be the same (same input + same positional embeddings)
-    assert np.allclose(short_pos.data, long_pos.data[:, :5, :], atol=1e-6), "Same positions should have same embeddings"
-    
-    # Test sequence length validation
-    try:
-        too_long_embeddings = Tensor(np.random.randn(batch_size, max_seq_length + 5, embedding_dim))
-        learned_pos.forward(too_long_embeddings)
-        assert False, "Should raise error for sequence longer than max_seq_length"
-    except ValueError:
-        pass  # Expected behavior
-    
-    # Test embedding dimension validation
-    try:
-        wrong_dim_embeddings = Tensor(np.random.randn(batch_size, seq_length, embedding_dim + 5))
-        learned_pos.forward(wrong_dim_embeddings)
-        assert False, "Should raise error for wrong embedding dimension"
-    except ValueError:
-        pass  # Expected behavior
-    
-    # Test callable interface
-    pos_embeddings_callable = learned_pos(embeddings)
-    assert np.allclose(pos_embeddings_callable.data, pos_embeddings.data), "Callable interface should work"
-    
-    print("PASS Learned positional embedding tests passed!")
-    print(f"PASS Parameter tracking and optimization ready")
-    print(f"PASS Handles various input shapes correctly")
-    print(f"PASS Max sequence length: {max_seq_length}, Embedding dim: {embedding_dim}")
-
-# Test function defined (called in main block)
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure all embedding components are complete before analysis
-
-# THINK PREDICTION: How does embedding table memory scale with vocabulary size and dimension?
-# Linear with vocab_size? Linear with embedding_dim? Quadratic with both?
-# Your prediction: _______
-
-# MAGNIFY SYSTEMS INSIGHT #1: Embedding Memory Scaling Analysis
-def analyze_embedding_memory_scaling():
-    """Analyze how embedding memory scales with vocabulary and dimension parameters."""
-    try:
-        import time
-        
-        print("📊 EMBEDDING MEMORY SCALING ANALYSIS")
-        print("=" * 50)
-        
-        # Test different configurations
-        test_configs = [
-            (1000, 128),   # Small model
-            (10000, 256),  # Medium model  
-            (50000, 512),  # Large model
-            (100000, 1024) # Very large model
-        ]
-        
-        print(f"{'Vocab Size':<12} {'Embed Dim':<10} {'Parameters':<12} {'Memory (MB)':<12} {'Lookup Time':<12}")
-        print("-" * 70)
-        
-        for vocab_size, embed_dim in test_configs:
-            # Create embedding layer
-            embed = Embedding(vocab_size=vocab_size, embedding_dim=embed_dim)
-            
-            # Calculate memory
-            memory_stats = embed.get_memory_usage()
-            params = memory_stats['total_parameters']
-            memory_mb = memory_stats['total_memory_mb']
-            
-            # Test lookup performance
-            test_tokens = np.random.randint(0, vocab_size, (32, 64))
-            start_time = time.time()
-            _ = embed.forward(test_tokens) 
-            lookup_time = (time.time() - start_time) * 1000
-            
-            print(f"{vocab_size:<12,} {embed_dim:<10} {params:<12,} {memory_mb:<12.1f} {lookup_time:<12.2f}")
-        
-        # TIP WHY THIS MATTERS: GPT-3 has 50k vocab * 12k dim = 600M embedding parameters!
-        # That's 2.4GB just for the embedding table (before any other model weights)
-        print("\nTIP SCALING INSIGHTS:")
-        print("   - Memory scales linearly with both vocab_size AND embedding_dim")
-        print("   - Lookup time is dominated by memory bandwidth, not computation")
-        print("   - Large models spend significant memory on embeddings alone")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in memory scaling analysis: {e}")
-        print("Make sure your Embedding class is implemented correctly")
-
-analyze_embedding_memory_scaling()
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure positional encoding works before analysis
-
-# THINK PREDICTION: Which positional encoding uses more memory - sinusoidal or learned?
-# Which can handle longer sequences? Your answer: _______
-
-# MAGNIFY SYSTEMS INSIGHT #2: Positional Encoding Trade-offs
-def analyze_positional_encoding_tradeoffs():
-    """Compare memory and performance characteristics of different positional encodings."""
-    try:
-        import time
-        
-        print("\nMAGNIFY POSITIONAL ENCODING COMPARISON")
-        print("=" * 50)
-        
-        embedding_dim = 512
-        max_seq_length = 2048
-        
-        # Create both types
-        sinusoidal_pe = PositionalEncoding(embedding_dim=embedding_dim, max_seq_length=max_seq_length)
-        learned_pe = LearnedPositionalEmbedding(max_seq_length=max_seq_length, embedding_dim=embedding_dim)
-        
-        # Test different sequence lengths
-        seq_lengths = [128, 512, 1024, 2048]
-        batch_size = 16
-        
-        print(f"{'Seq Len':<8} {'Method':<12} {'Time (ms)':<10} {'Memory (MB)':<12} {'Parameters':<12}")
-        print("-" * 65)
-        
-        for seq_len in seq_lengths:
-            embeddings = Tensor(np.random.randn(batch_size, seq_len, embedding_dim))
-            
-            # Test sinusoidal
-            start_time = time.time()
-            _ = sinusoidal_pe.forward(embeddings)
-            sin_time = (time.time() - start_time) * 1000
-            sin_memory = 0  # No parameters
-            sin_params = 0
-            
-            # Test learned
-            start_time = time.time() 
-            _ = learned_pe.forward(embeddings)
-            learned_time = (time.time() - start_time) * 1000
-            learned_memory = learned_pe.position_embedding.get_memory_usage()['total_memory_mb']
-            learned_params = max_seq_length * embedding_dim
-            
-            print(f"{seq_len:<8} {'Sinusoidal':<12} {sin_time:<10.2f} {sin_memory:<12.1f} {sin_params:<12,}")
-            print(f"{seq_len:<8} {'Learned':<12} {learned_time:<10.2f} {learned_memory:<12.1f} {learned_params:<12,}")
-            print()
-        
-        # TIP WHY THIS MATTERS: Choice affects model size and sequence length flexibility
-        print("TIP TRADE-OFF INSIGHTS:")
-        print("   - Sinusoidal: 0 parameters, can extrapolate to any length")
-        print("   - Learned: Many parameters, limited to training sequence length")
-        print("   - Modern models often use learned for better task adaptation")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in positional encoding analysis: {e}")
-        print("Make sure both positional encoding classes are implemented")
-
-analyze_positional_encoding_tradeoffs()
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure full embedding pipeline works
-
-# THINK PREDICTION: What's the bottleneck in embedding pipelines - computation or memory?
-# How does batch size affect throughput? Your prediction: _______
-
-# MAGNIFY SYSTEMS INSIGHT #3: Embedding Pipeline Performance
-def analyze_embedding_pipeline_performance():
-    """Analyze performance characteristics of the complete embedding pipeline."""
-    try:
-        import time
-        
-        print("\nSPEED EMBEDDING PIPELINE PERFORMANCE")
-        print("=" * 50)
-        
-        # Create pipeline components
-        vocab_size = 10000
-        embedding_dim = 256
-        max_seq_length = 512
-        
-        embed = Embedding(vocab_size=vocab_size, embedding_dim=embedding_dim)
-        pos_enc = PositionalEncoding(embedding_dim=embedding_dim, max_seq_length=max_seq_length)
-        
-        # Test different batch sizes and sequence lengths
-        test_configs = [
-            (8, 128),    # Small batch, short sequences
-            (32, 256),   # Medium batch, medium sequences
-            (64, 512),   # Large batch, long sequences
-        ]
-        
-        print(f"{'Batch':<6} {'Seq Len':<8} {'Total Tokens':<12} {'Time (ms)':<10} {'Tokens/sec':<12} {'Memory (MB)':<12}")
-        print("-" * 75)
-        
-        for batch_size, seq_length in test_configs:
-            # Create random token sequence
-            tokens = np.random.randint(0, vocab_size, (batch_size, seq_length))
-            token_tensor = Tensor(tokens)
-            
-            # Measure full pipeline
-            start_time = time.time()
-            
-            # Step 1: Embedding lookup
-            embeddings = embed.forward(token_tensor)
-            
-            # Step 2: Add positional encoding
-            pos_embeddings = pos_enc.forward(embeddings)
-            
-            end_time = time.time()
-            
-            # Calculate metrics
-            total_tokens = batch_size * seq_length
-            pipeline_time = (end_time - start_time) * 1000
-            tokens_per_sec = total_tokens / (end_time - start_time) if end_time > start_time else 0
-            memory_mb = pos_embeddings.data.nbytes / (1024 * 1024)
-            
-            print(f"{batch_size:<6} {seq_length:<8} {total_tokens:<12,} {pipeline_time:<10.2f} {tokens_per_sec:<12,.0f} {memory_mb:<12.1f}")
-        
-        # TIP WHY THIS MATTERS: Understanding pipeline bottlenecks for production deployment
-        print("\nTIP PIPELINE INSIGHTS:")
-        print("   - Embedding lookup is memory-bandwidth bound (not compute bound)")
-        print("   - Larger batches improve throughput due to better memory utilization")
-        print("   - Sequence length affects memory linearly, performance sublinearly")
-        print("   - Production systems optimize with: embedding caching, mixed precision, etc.")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in pipeline analysis: {e}")
-        print("Make sure your full embedding pipeline is working")
-
-analyze_embedding_pipeline_performance()
-
-# %% [markdown]
-"""
-## TARGET ML Systems: Performance Analysis & Embedding Scaling
-
-Now let's develop systems engineering skills by analyzing embedding performance and understanding how embedding choices affect downstream ML system efficiency.
-
-### **Learning Outcome**: *"I understand how embedding table size affects model memory, training speed, and language understanding capacity"*
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "embedding-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-import time
-
-class EmbeddingProfiler:
-    """
-    Performance profiling toolkit for embedding systems.
-    
-    Helps ML engineers understand memory usage, lookup performance,
-    and scaling characteristics of embedding layers.
-    """
-    
-    def __init__(self):
-        self.results = {}
-    
-    def measure_lookup_performance(self, embedding_layer: Embedding, 
-                                  batch_sizes: List[int], seq_lengths: List[int]):
-        """
-        Measure embedding lookup performance across different batch sizes and sequence lengths.
-        
-        TODO: Implement embedding lookup performance measurement.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create test token indices for each (batch_size, seq_length) combination
-        2. Measure time to perform embedding lookup
-        3. Calculate throughput metrics (tokens/second, memory bandwidth)
-        4. Return comprehensive performance analysis
-        
-        METRICS TO CALCULATE:
-        - Lookup time (milliseconds)
-        - Tokens per second throughput
-        - Memory bandwidth utilization
-        - Scaling patterns with batch size and sequence length
-        
-        Args:
-            embedding_layer: Embedding layer to test
-            batch_sizes: List of batch sizes to test
-            seq_lengths: List of sequence lengths to test
-            
-        Returns:
-            Dictionary with performance metrics for each configuration
-        """
-        ### BEGIN SOLUTION
-        results = {}
-        vocab_size = embedding_layer.vocab_size
-        
-        for batch_size in batch_sizes:
-            for seq_length in seq_lengths:
-                # Create random token indices
-                token_indices = np.random.randint(0, vocab_size, (batch_size, seq_length))
-                
-                # Measure lookup performance
-                start_time = time.time()
-                embeddings = embedding_layer.forward(token_indices)
-                end_time = time.time()
-                
-                # Calculate metrics
-                lookup_time_ms = (end_time - start_time) * 1000
-                total_tokens = batch_size * seq_length
-                tokens_per_second = total_tokens / (end_time - start_time) if end_time > start_time else 0
-                
-                # Memory calculations
-                input_memory_mb = token_indices.nbytes / (1024 * 1024)
-                output_memory_mb = embeddings.data.nbytes / (1024 * 1024)
-                memory_bandwidth_mb_s = (input_memory_mb + output_memory_mb) / (end_time - start_time) if end_time > start_time else 0
-                
-                config_key = f"batch_{batch_size}_seq_{seq_length}"
-                results[config_key] = {
-                    'batch_size': batch_size,
-                    'seq_length': seq_length,
-                    'total_tokens': total_tokens,
-                    'lookup_time_ms': lookup_time_ms,
-                    'tokens_per_second': tokens_per_second,
-                    'input_memory_mb': input_memory_mb,
-                    'output_memory_mb': output_memory_mb,
-                    'memory_bandwidth_mb_s': memory_bandwidth_mb_s,
-                    'time_per_token_us': lookup_time_ms * 1000 / total_tokens if total_tokens > 0 else 0
-                }
-        
-        return results
-        ### END SOLUTION
-    
-    def analyze_memory_scaling(self, vocab_sizes: List[int], embedding_dims: List[int]):
-        """
-        Analyze how embedding memory usage scales with vocabulary size and embedding dimension.
-        
-        This function is PROVIDED to show memory scaling analysis.
-        """
-        print("📊 EMBEDDING MEMORY SCALING ANALYSIS")
-        print("=" * 60)
-        
-        scaling_results = {}
-        
-        print(f"{'Vocab Size':<12} {'Embed Dim':<10} {'Parameters':<12} {'Memory (MB)':<12} {'Lookup Time':<12}")
-        print("-" * 70)
-        
-        for vocab_size in vocab_sizes:
-            for embed_dim in embedding_dims:
-                # Create embedding layer
-                embed = Embedding(vocab_size=vocab_size, embedding_dim=embed_dim)
-                
-                # Calculate memory usage
-                memory_stats = embed.get_memory_usage()
-                total_memory_mb = memory_stats['total_memory_mb']
-                total_params = memory_stats['total_parameters']
-                
-                # Measure lookup time
-                test_tokens = np.random.randint(0, vocab_size, (32, 64))  # Standard batch
-                start_time = time.time()
-                _ = embed.forward(test_tokens)
-                lookup_time_ms = (time.time() - start_time) * 1000
-                
-                # Store results
-                config_key = f"vocab_{vocab_size}_dim_{embed_dim}"
-                scaling_results[config_key] = {
-                    'vocab_size': vocab_size,
-                    'embedding_dim': embed_dim,
-                    'total_parameters': total_params,
-                    'memory_mb': total_memory_mb,
-                    'lookup_time_ms': lookup_time_ms
-                }
-                
-                print(f"{vocab_size:<12,} {embed_dim:<10} {total_params:<12,} {total_memory_mb:<12.2f} {lookup_time_ms:<12.2f}")
-        
-        # Analyze scaling patterns
-        print(f"\nPROGRESS SCALING INSIGHTS:")
-        if len(vocab_sizes) > 1 and len(embedding_dims) > 1:
-            # Compare scaling with vocab size (fixed embedding dim)
-            fixed_dim = embedding_dims[0]
-            small_vocab = min(vocab_sizes)
-            large_vocab = max(vocab_sizes)
-            
-            small_key = f"vocab_{small_vocab}_dim_{fixed_dim}"
-            large_key = f"vocab_{large_vocab}_dim_{fixed_dim}"
-            
-            if small_key in scaling_results and large_key in scaling_results:
-                vocab_ratio = large_vocab / small_vocab
-                memory_ratio = scaling_results[large_key]['memory_mb'] / scaling_results[small_key]['memory_mb']
-                print(f"   Vocabulary scaling: {vocab_ratio:.1f}x vocab -> {memory_ratio:.1f}x memory (Linear)")
-            
-            # Compare scaling with embedding dim (fixed vocab)
-            fixed_vocab = vocab_sizes[0]
-            small_dim = min(embedding_dims)
-            large_dim = max(embedding_dims)
-            
-            small_key = f"vocab_{fixed_vocab}_dim_{small_dim}"
-            large_key = f"vocab_{fixed_vocab}_dim_{large_dim}"
-            
-            if small_key in scaling_results and large_key in scaling_results:
-                dim_ratio = large_dim / small_dim
-                memory_ratio = scaling_results[large_key]['memory_mb'] / scaling_results[small_key]['memory_mb']
-                print(f"   Dimension scaling: {dim_ratio:.1f}x dim -> {memory_ratio:.1f}x memory (Linear)")
-        
-        return scaling_results
-    
-    def compare_positional_encodings(self, seq_length: int = 100, embedding_dim: int = 256):
-        """
-        Compare performance and characteristics of different positional encoding approaches.
-        
-        This function is PROVIDED to show positional encoding comparison.
-        """
-        print(f"\nMAGNIFY POSITIONAL ENCODING COMPARISON")
-        print("=" * 50)
-        
-        # Create test embeddings
-        batch_size = 16
-        embeddings = Tensor(np.random.randn(batch_size, seq_length, embedding_dim))
-        
-        # Test sinusoidal positional encoding
-        sinusoidal_pe = PositionalEncoding(embedding_dim=embedding_dim, max_seq_length=seq_length*2)
-        start_time = time.time()
-        sin_result = sinusoidal_pe.forward(embeddings)
-        sin_time = (time.time() - start_time) * 1000
-        
-        # Test learned positional embedding
-        learned_pe = LearnedPositionalEmbedding(max_seq_length=seq_length*2, embedding_dim=embedding_dim)
-        start_time = time.time()
-        learned_result = learned_pe.forward(embeddings)
-        learned_time = (time.time() - start_time) * 1000
-        
-        # Calculate memory usage
-        sin_memory = 0  # No learnable parameters
-        learned_memory = learned_pe.position_embedding.get_memory_usage()['total_memory_mb']
-        
-        results = {
-            'sinusoidal': {
-                'computation_time_ms': sin_time,
-                'memory_usage_mb': sin_memory,
-                'parameters': 0,
-                'deterministic': True,
-                'extrapolation': 'Good (can handle longer sequences)'
-            },
-            'learned': {
-                'computation_time_ms': learned_time,
-                'memory_usage_mb': learned_memory,
-                'parameters': seq_length * 2 * embedding_dim,
-                'deterministic': False,
-                'extrapolation': 'Limited (fixed max sequence length)'
-            }
-        }
-        
-        print(f"📊 COMPARISON RESULTS:")
-        print(f"{'Method':<12} {'Time (ms)':<10} {'Memory (MB)':<12} {'Parameters':<12} {'Extrapolation'}")
-        print("-" * 70)
-        print(f"{'Sinusoidal':<12} {sin_time:<10.2f} {sin_memory:<12.2f} {0:<12,} {'Good'}")
-        print(f"{'Learned':<12} {learned_time:<10.2f} {learned_memory:<12.2f} {results['learned']['parameters']:<12,} {'Limited'}")
-        
-        print(f"\nTIP INSIGHTS:")
-        print(f"   - Sinusoidal: Zero parameters, deterministic, good extrapolation")
-        print(f"   - Learned: Requires parameters, model-specific, limited extrapolation")
-        print(f"   - Choice depends on: model capacity, sequence length requirements, extrapolation needs")
-        
-        return results
-
-def analyze_embedding_system_design():
-    """
-    Comprehensive analysis of embedding system design choices and their impact.
-    
-    This function is PROVIDED to show systems-level design thinking.
-    """
-    print("🏗️ EMBEDDING SYSTEM DESIGN ANALYSIS")
-    print("=" * 60)
-    
-    # Example model configurations
-    model_configs = [
-        {'name': 'Small GPT', 'vocab_size': 10000, 'embed_dim': 256, 'seq_length': 512},
-        {'name': 'Medium GPT', 'vocab_size': 50000, 'embed_dim': 512, 'seq_length': 1024},
-        {'name': 'Large GPT', 'vocab_size': 50000, 'embed_dim': 1024, 'seq_length': 2048}
-    ]
-    
-    print(f"📋 MODEL CONFIGURATION COMPARISON:")
-    print(f"{'Model':<12} {'Vocab Size':<10} {'Embed Dim':<10} {'Seq Len':<8} {'Embed Params':<12} {'Memory (MB)'}")
-    print("-" * 80)
-    
-    for config in model_configs:
-        # Calculate embedding parameters
-        embed_params = config['vocab_size'] * config['embed_dim']
-        
-        # Calculate memory usage
-        embed_memory_mb = embed_params * 4 / (1024 * 1024)  # 4 bytes per float32
-        
-        print(f"{config['name']:<12} {config['vocab_size']:<10,} {config['embed_dim']:<10} "
-              f"{config['seq_length']:<8} {embed_params:<12,} {embed_memory_mb:<10.1f}")
-    
-    print(f"\nTARGET DESIGN TRADE-OFFS:")
-    print(f"   1. Vocabulary Size:")
-    print(f"      - Larger vocab: Better text coverage, more parameters")
-    print(f"      - Smaller vocab: Longer sequences, more compute")
-    print(f"   2. Embedding Dimension:")
-    print(f"      - Higher dim: More model capacity, more memory")
-    print(f"      - Lower dim: Faster computation, potential bottleneck")
-    print(f"   3. Position Encoding:")
-    print(f"      - Sinusoidal: No parameters, good extrapolation")
-    print(f"      - Learned: Model-specific, limited to training length")
-    print(f"   4. Memory Scaling:")
-    print(f"      - Embedding table: O(vocab_size * embed_dim)")
-    print(f"      - Sequence processing: O(batch_size * seq_length * embed_dim)")
-    print(f"      - Total memory dominated by model size, not embedding table")
-    
-    print(f"\n🏭 PRODUCTION CONSIDERATIONS:")
-    print(f"   - GPU memory limits affect maximum embedding table size")
-    print(f"   - Embedding lookup is memory-bandwidth bound")
-    print(f"   - Vocabulary size affects tokenization and model download size")
-    print(f"   - Position encoding choice affects sequence length flexibility")
-
-# %% [markdown]
-"""
-### TEST Test: Embedding Performance Analysis
-
-Let's test our embedding profiler with realistic performance scenarios.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-embedding-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_embedding_profiler():
-    """Test embedding profiler with various scenarios."""
-    print("🔬 Unit Test: Embedding Performance Profiler...")
-    
-    profiler = EmbeddingProfiler()
-    
-    # Create test embedding layer
-    vocab_size = 1000
-    embedding_dim = 128
-    embed = Embedding(vocab_size=vocab_size, embedding_dim=embedding_dim)
-    
-    # Test lookup performance measurement
-    batch_sizes = [8, 16]
-    seq_lengths = [32, 64]
-    
-    performance_results = profiler.measure_lookup_performance(embed, batch_sizes, seq_lengths)
-    
-    # Verify results structure
-    expected_configs = len(batch_sizes) * len(seq_lengths)
-    assert len(performance_results) == expected_configs, f"Should test {expected_configs} configurations"
-    
-    for config, metrics in performance_results.items():
-        # Verify all required metrics are present
-        required_keys = ['batch_size', 'seq_length', 'total_tokens', 'lookup_time_ms', 
-                        'tokens_per_second', 'memory_bandwidth_mb_s']
-        for key in required_keys:
-            assert key in metrics, f"Missing metric: {key} in {config}"
-            assert isinstance(metrics[key], (int, float)), f"Invalid metric type for {key}"
-        
-        # Verify reasonable values
-        assert metrics['total_tokens'] > 0, "Should count tokens"
-        assert metrics['lookup_time_ms'] >= 0, "Time should be non-negative"
-        assert metrics['tokens_per_second'] >= 0, "Throughput should be non-negative"
-    
-    print("PASS Lookup performance measurement test passed")
-    
-    # Test memory scaling analysis
-    vocab_sizes = [500, 1000]
-    embedding_dims = [64, 128]
-    
-    scaling_results = profiler.analyze_memory_scaling(vocab_sizes, embedding_dims)
-    
-    # Verify scaling results
-    expected_configs = len(vocab_sizes) * len(embedding_dims)
-    assert len(scaling_results) == expected_configs, f"Should test {expected_configs} configurations"
-    
-    for config, metrics in scaling_results.items():
-        assert 'total_parameters' in metrics, "Should include parameter count"
-        assert 'memory_mb' in metrics, "Should include memory usage"
-        assert metrics['total_parameters'] > 0, "Should have parameters"
-        assert metrics['memory_mb'] > 0, "Should use memory"
-    
-    print("PASS Memory scaling analysis test passed")
-    
-    # Test positional encoding comparison
-    comparison_results = profiler.compare_positional_encodings(seq_length=50, embedding_dim=64)
-    
-    # Verify comparison results
-    assert 'sinusoidal' in comparison_results, "Should test sinusoidal encoding"
-    assert 'learned' in comparison_results, "Should test learned encoding"
-    
-    for method, metrics in comparison_results.items():
-        assert 'computation_time_ms' in metrics, "Should measure computation time"
-        assert 'memory_usage_mb' in metrics, "Should measure memory usage"
-        assert 'parameters' in metrics, "Should count parameters"
-    
-    print("PASS Positional encoding comparison test passed")
-    print("TARGET Embedding Profiler: All tests passed!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Integration Testing: Complete Embedding Pipeline
-
-Let's test how all our embedding components work together in a realistic language processing pipeline:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-embedding-integration", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_embedding_integration():
-    """Test complete embedding pipeline with tokenization integration."""
-    print("TEST Integration Test: Complete Embedding Pipeline...")
-
-    # Create tokenizer (using mock for simplicity)
-    tokenizer = CharTokenizer()
-
-    # Create embedding layer
-    embed = Embedding(vocab_size=tokenizer.vocab_size, embedding_dim=128, padding_idx=0)
-
-    # Create positional encoding
-    pos_encoding = PositionalEncoding(embedding_dim=128, max_seq_length=100)
-
-    # Test with simple token sequences instead of text processing
-    # This avoids the tokenizer method issues while testing embedding pipeline
-    test_sequences = [
-        [1, 2, 3, 4, 5],      # "Hello world!"
-        [6, 7, 8, 9, 10, 11], # "This is a test."
-        [12, 13, 14],         # "Short text."
-        [15, 16, 17, 18, 19, 20, 21, 22] # "A longer piece..."
-    ]
-
-    print(f"  Processing {len(test_sequences)} token sequences through complete pipeline...")
-
-    # Step 1: Use pre-tokenized sequences
-    tokenized = test_sequences
-    
-    # Step 2: Pad sequences manually for batch processing
-    max_length = 20
-    padded_sequences = []
-    for seq in tokenized:
-        # Pad with 0s or truncate to max_length
-        if len(seq) < max_length:
-            padded = seq + [0] * (max_length - len(seq))
-        else:
-            padded = seq[:max_length]
-        padded_sequences.append(padded)
-
-    batch_tokens = Tensor(np.array(padded_sequences))
-    
-    print(f"    Batch shape: {batch_tokens.shape}")
-    
-    # Step 3: Embedding lookup
-    embeddings = embed.forward(batch_tokens)
-    print(f"    Embeddings shape: {embeddings.shape}")
-    
-    # Step 4: Add positional encoding
-    pos_embeddings = pos_encoding.forward(embeddings)
-    print(f"    Position-aware embeddings shape: {pos_embeddings.shape}")
-    
-    # Verify pipeline correctness
-    expected_shape = (len(test_sequences), 20, 128)  # (batch, seq_len, embed_dim)
-    assert pos_embeddings.shape == expected_shape, f"Expected {expected_shape}, got {pos_embeddings.shape}"
-    
-    # Test that padding tokens have correct embeddings (should be zero from embedding layer)
-    padding_token_id = 0  # We used 0 for padding
-
-    # Find positions with padding tokens
-    padding_positions = (batch_tokens.data == padding_token_id)
-    
-    if np.any(padding_positions):
-        # Get embeddings for padding positions
-        padding_embeddings = embeddings.data[padding_positions]
-        
-        # Padding embeddings should be close to zero (from embedding initialization)
-        # Note: they won't be exactly zero because we add positional encoding
-        print(f"    Padding token embeddings found: {np.sum(padding_positions)} positions")
-    
-    # Test different sequence lengths
-    short_tokens = [23, 24]  # Simple short sequence
-    short_tensor = Tensor(np.array([short_tokens]))  # Add batch dimension
-    
-    short_embeddings = embed.forward(short_tensor)
-    short_pos_embeddings = pos_encoding.forward(short_embeddings)
-    
-    print(f"    Short text processing: {short_pos_embeddings.shape}")
-    
-    # Test memory efficiency
-    large_batch_size = 32
-    large_seq_length = 50
-    large_tokens = np.random.randint(0, tokenizer.vocab_size, (large_batch_size, large_seq_length))
-    large_tensor = Tensor(large_tokens)
-    
-    start_time = time.time()
-    large_embeddings = embed.forward(large_tensor)
-    large_pos_embeddings = pos_encoding.forward(large_embeddings)
-    processing_time = time.time() - start_time
-    
-    print(f"    Large batch processing: {large_pos_embeddings.shape} in {processing_time*1000:.2f}ms")
-    
-    # Calculate memory usage
-    embedding_memory = embed.get_memory_usage()
-    total_memory_mb = embedding_memory['total_memory_mb']
-    
-    print(f"    Embedding table memory: {total_memory_mb:.2f}MB")
-    print(f"    Sequence memory: {large_pos_embeddings.data.nbytes / (1024*1024):.2f}MB")
-    
-    print("PASS Complete embedding pipeline integration test passed!")
-    print(f"PASS Tokenization -> Embedding -> Positional Encoding pipeline works")
-    print(f"PASS Handles various batch sizes and sequence lengths")
-    print(f"PASS Memory usage is reasonable for production systems")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-All embedding tests and demonstrations are run from here when the module is executed directly:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "embeddings-main", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_module():
-    """Run all unit tests for this module."""
-    print("🧪 TESTING MODULE: Embeddings")
-    print("=" * 50)
-
-    # Run all unit tests
-    test_unit_embedding_layer()
-    test_unit_positional_encoding()
-    test_unit_learned_positional_embedding()
-    test_embedding_profiler()
-    test_embedding_integration()
-
-    print("\n" + "=" * 50)
-    print("✅ ALL TESTS PASSED! Module ready for export.")
-    print("Run: tito module complete 11_embeddings")
-
-if __name__ == "__main__":
-    test_module()
-    
-    print("\n" + "="*60)
-    print("MAGNIFY EMBEDDING SYSTEMS ANALYSIS")
-    print("="*60)
-    
-    # Performance analysis
-    profiler = EmbeddingProfiler()
-    
-    # Test different embedding configurations
-    print("\n📊 EMBEDDING PERFORMANCE COMPARISON:")
-    
-    # Compare embedding layers with different sizes
-    vocab_sizes = [1000, 5000, 10000]
-    embedding_dims = [128, 256, 512]
-    
-    scaling_results = profiler.analyze_memory_scaling(vocab_sizes, embedding_dims)
-    
-    # Compare positional encoding approaches
-    print("\n" + "="*60)
-    pos_comparison = profiler.compare_positional_encodings(seq_length=128, embedding_dim=256)
-    
-    # Systems design analysis
-    print("\n" + "="*60)
-    analyze_embedding_system_design()
-    
-    # Demonstrate realistic language model embedding setup
-    print("\n" + "="*60)
-    print("🏗️ REALISTIC LANGUAGE MODEL EMBEDDING SETUP")
-    print("="*60)
-    
-    # Create realistic configuration
-    vocab_size = 10000  # 10k vocabulary
-    embedding_dim = 256  # 256-dim embeddings
-    max_seq_length = 512  # 512 token sequences
-    
-    print(f"Model configuration:")
-    print(f"  Vocabulary size: {vocab_size:,}")
-    print(f"  Embedding dimension: {embedding_dim}")
-    print(f"  Max sequence length: {max_seq_length}")
-    
-    # Create components
-    embedding_layer = Embedding(vocab_size=vocab_size, embedding_dim=embedding_dim, padding_idx=0)
-    pos_encoding = PositionalEncoding(embedding_dim=embedding_dim, max_seq_length=max_seq_length)
-    
-    # Calculate memory requirements
-    embed_memory = embedding_layer.get_memory_usage()
-    
-    print(f"\nMemory analysis:")
-    print(f"  Embedding table: {embed_memory['total_memory_mb']:.1f}MB")
-    print(f"  Parameters: {embed_memory['total_parameters']:,}")
-    
-    # Simulate batch processing
-    batch_size = 32
-    seq_length = 256
-    test_tokens = np.random.randint(0, vocab_size, (batch_size, seq_length))
-    
-    start_time = time.time()
-    embeddings = embedding_layer.forward(test_tokens)
-    pos_embeddings = pos_encoding.forward(embeddings)
-    total_time = time.time() - start_time
-    
-    sequence_memory_mb = pos_embeddings.data.nbytes / (1024 * 1024)
-    
-    print(f"\nBatch processing:")
-    print(f"  Batch size: {batch_size}, Sequence length: {seq_length}")
-    print(f"  Processing time: {total_time*1000:.2f}ms")
-    print(f"  Sequence memory: {sequence_memory_mb:.1f}MB")
-    print(f"  Throughput: {(batch_size * seq_length) / total_time:.0f} tokens/second")
-    
-    print("\n" + "="*60)
-    print("TARGET EMBEDDINGS MODULE COMPLETE!")
-    print("="*60)
-    print("All embedding tests passed!")
-    print("Ready for attention mechanism integration!")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've built the embedding systems that convert tokens to rich vector representations, let's connect this work to broader ML systems challenges. These questions help you think critically about how embedding design scales to production language processing systems.
-
-Take time to reflect thoughtfully on each question - your insights will help you understand how embedding choices connect to real-world ML systems engineering.
-"""
-
-# %% [markdown]
-"""
-### Question 1: Embedding Memory Optimization and Model Scaling
-
-**Context**: Your embedding implementations demonstrate how vocabulary size and embedding dimension directly impact model parameters and memory usage. In your memory scaling analysis, you saw how a 100k vocabulary with 1024-dimensional embeddings requires ~400MB just for the embedding table. In production language models, embedding tables often contain billions of parameters (GPT-3's embedding table alone has ~600M parameters), making memory optimization critical for deployment and training efficiency.
-
-**Reflection Question**: Based on your `Embedding` class implementation and memory scaling analysis, design a memory-optimized embedding system for a production language model that needs to handle a 100k vocabulary with 1024-dimensional embeddings while operating under GPU memory constraints. How would you modify your current `Embedding.forward()` method to implement embedding compression techniques, design efficient lookup patterns for high-throughput training, and handle dynamic vocabulary expansion for domain adaptation? Consider how your current weight initialization strategies could be adapted and what changes to your `get_memory_usage()` analysis would be needed for compressed embeddings.
-
-Think about: adapting your embedding lookup implementation, modifying weight storage patterns, extending your memory analysis for compression techniques, and designing efficient gradient updates for compressed representations.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-1-embedding-memory", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON EMBEDDING MEMORY OPTIMIZATION:
-
-TODO: Replace this text with your thoughtful response about memory-optimized embedding system design.
-
-Consider addressing:
-- How would you implement embedding compression for a 100k * 1024 vocabulary under GPU constraints?
-- What techniques would you use to optimize lookup patterns for high-throughput training?
-- How would you design dynamic vocabulary expansion while maintaining memory efficiency?
-- What trade-offs would you make between embedding quality and memory footprint?
-- How would you optimize differently for training vs inference scenarios?
-
-Write a technical analysis connecting your embedding implementations to real memory optimization challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Demonstrates understanding of embedding memory scaling and optimization (3 points)
-- Designs practical approaches to compression and efficient lookup patterns (3 points)
-- Addresses dynamic vocabulary and quality-memory trade-offs (2 points)
-- Shows systems thinking about production memory constraints (2 points)
-- Clear technical reasoning with memory optimization insights (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring technical analysis of embedding memory optimization
-# Students should demonstrate understanding of large-scale embedding systems and memory efficiency
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: Positional Encoding and Sequence Length Scalability
-
-**Context**: Your positional encoding implementations show the trade-offs between fixed sinusoidal patterns and learned position embeddings. In your analysis, you saw that `PositionalEncoding` requires 0 parameters but `LearnedPositionalEmbedding` needs max_seq_length * embedding_dim parameters. Production language models increasingly need to handle variable sequence lengths efficiently while maintaining consistent position representations across different tasks and deployment scenarios.
-
-**Reflection Question**: Based on your `PositionalEncoding` and `LearnedPositionalEmbedding` implementations, architect a hybrid positional encoding system for a production transformer that efficiently handles sequences from 512 tokens to 32k tokens. How would you modify your current `forward()` methods to create a hybrid approach that combines the benefits of both systems? What changes would you make to your position computation to optimize for variable-length sequences, and how would you extend your positional encoding comparison analysis to measure performance across different sequence length distributions?
-
-Think about: combining your two encoding implementations, modifying the forward pass for variable lengths, extending your performance analysis methods, and optimizing position computation patterns from your current code.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-2-positional-encoding", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON POSITIONAL ENCODING AND SEQUENCE SCALABILITY:
-
-TODO: Replace this text with your thoughtful response about scalable positional encoding system design.
-
-Consider addressing:
-- How would you design hybrid positional encoding for sequences from 512 to 32k tokens?
-- What strategies would you use to optimize position computation for variable-length sequences?
-- How would you balance memory efficiency with computational performance?
-- What approaches would you use to handle different sequence length distributions?
-- How would you maintain training stability across diverse sequence lengths?
-
-Write an architectural analysis connecting your positional encoding work to scalable sequence processing.
-
-GRADING RUBRIC (Instructor Use):
-- Shows understanding of positional encoding scalability challenges (3 points)
-- Designs practical approaches to hybrid encoding and variable-length optimization (3 points)
-- Addresses memory and computational efficiency considerations (2 points)
-- Demonstrates systems thinking about sequence length distribution handling (2 points)
-- Clear architectural reasoning with scalability insights (bonus points for comprehensive system design)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of positional encoding scalability
-# Students should demonstrate knowledge of sequence length optimization and hybrid approaches
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Embedding Pipeline Integration and Training Efficiency
-
-**Context**: Your embedding pipeline integration demonstrates how tokenization, embedding lookup, and positional encoding work together in language model preprocessing. In your `test_embedding_integration()` function, you measured pipeline performance and saw how batch size affects throughput. In production training systems, the embedding pipeline often becomes a bottleneck due to memory bandwidth limitations and the need to process billions of tokens efficiently during training.
-
-**Reflection Question**: Based on your complete embedding pipeline implementation (tokenization -> `Embedding.forward()` -> `PositionalEncoding.forward()`), design an optimization strategy for large-scale language model training that processes 1 trillion tokens efficiently. How would you modify your current pipeline functions to implement batch processing optimizations for mixed sequence lengths, design efficient gradient updates for your massive `Embedding.weight` parameters, and coordinate embedding updates across distributed training nodes? Consider how your current memory analysis and performance measurement techniques could be extended to monitor pipeline bottlenecks in distributed settings.
-
-Think about: optimizing your current pipeline implementation, extending your performance analysis to distributed settings, modifying your batch processing patterns, and scaling your embedding weight update mechanisms.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-3-pipeline-integration", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON EMBEDDING PIPELINE INTEGRATION:
-
-TODO: Replace this text with your thoughtful response about embedding pipeline optimization for large-scale training.
-
-Consider addressing:
-- How would you implement pipeline parallelism for processing 1 trillion tokens efficiently?
-- What strategies would you use to optimize batch processing for mixed sequence lengths?
-- How would you design efficient gradient updates for massive embedding tables?
-- What approaches would you use for coordinating embedding updates across distributed nodes?
-- How would you maintain GPU utilization while minimizing memory bandwidth bottlenecks?
-
-Write a design analysis connecting your embedding pipeline to large-scale training optimization.
-
-GRADING RUBRIC (Instructor Use):
-- Understands embedding pipeline bottlenecks and optimization challenges (3 points)
-- Designs practical approaches to pipeline parallelism and batch optimization (3 points)
-- Addresses distributed training and gradient update efficiency (2 points)
-- Shows systems thinking about large-scale training coordination (2 points)
-- Clear design reasoning with pipeline optimization insights (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of large-scale embedding pipeline optimization
-# Students should demonstrate knowledge of distributed training and pipeline efficiency
-### END SOLUTION
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Embeddings
-
-Congratulations! You have successfully implemented comprehensive embedding systems for language processing:
-
-### PASS What You Have Built
-- **Embedding Layer**: Learnable lookup table converting tokens to dense vector representations
-- **Positional Encoding**: Sinusoidal position information for sequence understanding
-- **Learned Positional Embeddings**: Trainable position representations for model-specific optimization
-- **Memory-Efficient Lookups**: Optimized embedding access patterns for production systems
-- **Performance Analysis**: Comprehensive profiling and scaling analysis tools
-- **🆕 Integration Pipeline**: Complete tokenization -> embedding -> positional encoding workflow
-- **🆕 Systems Optimization**: Memory usage analysis and performance optimization techniques
-
-### PASS Key Learning Outcomes
-- **Understanding**: How discrete tokens become continuous vector representations
-- **Implementation**: Built embedding systems from scratch with efficient lookup operations
-- **Systems Insight**: How embedding table size affects model memory and training efficiency
-- **Performance Engineering**: Measured and optimized embedding lookup patterns and memory usage
-- **Production Context**: Understanding real-world embedding challenges and optimization techniques
-
-### PASS Technical Mastery
-- **Embedding Lookup**: Efficient table lookup with various initialization strategies
-- **Positional Encoding**: Mathematical sine/cosine patterns for position representation
-- **Memory Scaling**: Understanding O(vocab_size * embedding_dim) parameter scaling
-- **Performance Optimization**: Cache-friendly access patterns and memory bandwidth optimization
-- **🆕 Integration Design**: Seamless pipeline from text processing to vector representations
-
-### PASS Professional Skills Developed
-- **Systems Architecture**: Designing embedding systems for production scale
-- **Memory Engineering**: Optimizing large parameter tables for efficient access
-- **Performance Analysis**: Measuring and improving embedding pipeline throughput
-- **Integration Thinking**: Connecting embedding systems with tokenization and attention
-
-### PASS Ready for Next Steps
-Your embedding systems are now ready to power:
-- **Attention Mechanisms**: Processing sequence representations with attention
-- **Transformer Models**: Complete language model architectures
-- **Language Understanding**: Rich semantic representations for NLP tasks
-- **🧠 Sequence Processing**: Foundation for advanced sequence modeling
-
-### LINK Connection to Real ML Systems
-Your implementations mirror production systems:
-- **PyTorch Embeddings**: `torch.nn.Embedding` and `torch.nn.functional.embedding`
-- **Transformer Models**: All modern language models use similar embedding approaches
-- **Production Optimizations**: Memory mapping, gradient checkpointing, and distributed embeddings
-- **Industry Applications**: GPT, BERT, and other transformer models rely on these foundations
-
-### TARGET The Power of Dense Representations
-You have unlocked the bridge between discrete tokens and continuous understanding:
-- **Before**: Tokens were sparse, discrete symbols
-- **After**: Tokens become rich, continuous vectors that capture semantic relationships
-
-**Next Module**: Attention - Processing sequences with the mechanism that revolutionized language understanding!
-
-Your embedding systems provide the rich vector representations that attention mechanisms need to understand language. Now let's build the attention that makes transformers work!
-"""
\ No newline at end of file
diff --git a/modules_old/11_embeddings/module.yaml b/modules_old/11_embeddings/module.yaml
deleted file mode 100644
index ec1a72c8..00000000
--- a/modules_old/11_embeddings/module.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-description: Dense vector representations that convert discrete tokens into continuous
-  semantic spaces
-estimated_time: 4-5 hours
-exports:
-- Embedding
-- PositionalEncoding
-- LearnedPositionalEmbedding
-- EmbeddingProfiler
-learning_objectives:
-- Implement embedding layers with efficient lookup operations
-- Build sinusoidal and learned positional encoding systems
-- Understand embedding memory scaling and optimization techniques
-- Analyze how embedding choices affect model capacity and performance
-- Design embedding systems for production language model deployment
-ml_systems_focus: Memory-efficient embedding lookup, position encoding scalability,
-  large-scale parameter management
-name: Embeddings
-next_modules:
-- 13_attention
-number: 12
-prerequisites:
-- 02_tensor
-- 11_tokenization
-systems_concepts:
-- "Embedding table memory scaling O(vocab_size \xD7 embed_dim)"
-- Memory-bandwidth bound lookup operations
-- Cache-friendly embedding access patterns
-- Position encoding trade-offs and extrapolation
-- Distributed embedding table management
diff --git a/modules_old/12_attention/README.md b/modules_old/12_attention/README.md
deleted file mode 100644
index 03f753f6..00000000
--- a/modules_old/12_attention/README.md
+++ /dev/null
@@ -1,97 +0,0 @@
-# Module 13: Attention - The Mechanism That Revolutionized Language Understanding
-
-## Overview
-This module implements the attention mechanisms that power modern transformer architectures. You'll build scaled dot-product attention, multi-head attention, and KV-cache systems while understanding how attention's quadratic scaling affects practical transformer deployment and optimization strategies.
-
-## What You'll Learn
-
-### Core Implementations
-- **Scaled Dot-Product Attention**: The fundamental attention mechanism with masking support
-- **Multi-Head Attention**: Parallel attention heads with linear projections and output combination
-- **KV-Cache System**: Efficient caching for autoregressive text generation
-- **Causal Masking**: Support for autoregressive language modeling patterns
-
-### ML Systems Concepts
-- **Quadratic Scaling**: How O(N²) memory scaling limits transformer sequence length
-- **Memory Bottlenecks**: Understanding attention as the memory constraint in transformers
-- **Generation Efficiency**: KV-cache optimization for production text generation
-- **Hardware Optimization**: Attention parallelization and memory bandwidth optimization
-
-### Performance Engineering
-- **Attention Profiling**: Measuring computation time and memory usage scaling
-- **Scaling Analysis**: Understanding practical limits of attention-based architectures
-- **Optimization Techniques**: Memory-efficient attention patterns and cache management
-- **Production Patterns**: Real-world attention system design and deployment strategies
-
-## Key Learning Outcomes
-
-By completing this module, you'll understand:
-
-1. **Attention Mathematics**: The scaled dot-product attention formula and its implementation
-2. **Multi-Head Architecture**: How parallel attention heads capture diverse relationships
-3. **Memory Scaling**: Why attention's O(N²) complexity fundamentally limits sequence length
-4. **Generation Optimization**: How KV-cache dramatically improves autoregressive efficiency
-5. **Production Systems**: How real transformers optimize attention for deployment constraints
-
-## Files in This Module
-
-- `attention_dev.py` - Main implementation with all attention mechanisms
-- `attention_dev.ipynb` - Jupyter notebook (auto-generated)
-- `module.yaml` - Module configuration and metadata
-- `README.md` - This documentation file
-
-## Usage Example
-
-```python
-from tinytorch.core.attention import ScaledDotProductAttention, MultiHeadAttention
-from tinytorch.core.embeddings import Embedding, PositionalEncoding
-
-# Create attention mechanisms
-scaled_attn = ScaledDotProductAttention()
-multi_head_attn = MultiHeadAttention(embed_dim=256, num_heads=8)
-
-# Process sequences with attention
-query = key = value = embeddings  # Self-attention
-output = multi_head_attn(query, key, value)
-
-# Causal masking for generation
-causal_mask = create_causal_mask(seq_length)
-masked_output = multi_head_attn(query, key, value, mask=causal_mask)
-```
-
-## Integration with TinyTorch
-
-This module exports to `tinytorch.core.attention` and provides the attention foundation for:
-- **Transformer blocks** (Module 14) - Complete transformer layer implementation
-- **Language generation** - Efficient autoregressive text generation
-- **Sequence modeling** - Advanced sequence processing architectures
-
-## Systems Engineering Focus
-
-This module emphasizes the systems engineering aspects of attention:
-
-### Memory Characteristics
-- **Quadratic scaling**: Attention memory = O(batch_size × seq_length²)
-- **Memory bottleneck**: Attention often limits practical transformer sequence length
-- **KV-cache benefits**: Reduces generation memory from O(N²) to O(N)
-- **GPU memory limits**: Determines maximum feasible sequence lengths
-
-### Performance Considerations
-- **Matrix multiplication bound**: Attention performance limited by GEMM operations
-- **Memory bandwidth**: Large attention matrices stress memory subsystem
-- **Parallelization**: Multi-head attention enables parallel computation
-- **Generation patterns**: Autoregressive vs parallel processing trade-offs
-
-## Prerequisites
-- Module 02: Tensor (for matrix operations and data structures)
-- Module 12: Embeddings (for understanding sequence representations)
-- Understanding of matrix multiplication and softmax operations
-
-## Estimated Time
-5-6 hours including implementation, testing, and performance analysis
-
-## Next Steps
-After completing this module, you'll be ready for:
-- **Module 14: Transformers** - Complete transformer block implementation
-- Advanced transformer architectures and optimization techniques
-- Production language model deployment and serving systems
\ No newline at end of file
diff --git a/modules_old/12_attention/attention_dev.py b/modules_old/12_attention/attention_dev.py
deleted file mode 100644
index b474b67c..00000000
--- a/modules_old/12_attention/attention_dev.py
+++ /dev/null
@@ -1,2503 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Attention - The Mechanism That Revolutionized Language Understanding
-
-Welcome to the Attention module! You'll implement the scaled dot-product attention and multi-head attention mechanisms that enable neural networks to focus on relevant parts of input sequences.
-
-## Learning Goals
-- Systems understanding: How attention's O(N²) complexity affects memory usage and computational scaling
-- Core implementation skill: Build attention mechanisms with efficient memory management
-- Pattern recognition: Understand how attention enables sequence modeling and long-range dependencies
-- Framework connection: See how your implementations match PyTorch's attention systems
-- Performance insight: Learn how attention patterns affect training efficiency and model capabilities
-
-## Build -> Use -> Reflect
-1. **Build**: Scaled dot-product attention and multi-head attention with masking and KV-cache
-2. **Use**: Process sequences to capture dependencies between distant tokens
-3. **Reflect**: How does attention's quadratic scaling determine practical limits of sequence length?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how attention enables sequence models to capture dependencies
-- Practical capability to implement attention with memory-efficient patterns and causal masking
-- Systems insight into how attention's O(N²) scaling affects model architecture and deployment
-- Performance consideration of how attention optimization affects practical sequence processing
-- Connection to production systems and their attention optimization techniques
-
-## Systems Reality Check
-TIP **Production Context**: Attention's O(N²) scaling makes it the memory bottleneck in sequence models
-SPEED **Performance Note**: O(N²) memory scaling means 2x sequence length = 4x attention memory - this fundamentally limits sequence processing
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "attention-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.attention
-
-#| export
-import math
-import numpy as np
-import os
-import sys
-from typing import Union, List, Optional, Tuple, Dict
-
-# Constants for attention computation
-ATTENTION_MASK_VALUE = -1e9  # Large negative value that becomes ~0 after softmax
-                             # -1e9 chosen to avoid numerical underflow while ensuring masking
-NUMERICAL_STABILITY_EPSILON = 1e-8  # For numerical stability in computations
-FLOAT32_BYTES = 4  # Size of float32 in bytes for memory calculations
-
-# Import our Tensor class - try from package first, then from local module
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local tensor module
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# Try to import embedding classes
-try:
-    from tinytorch.core.embeddings import Embedding, PositionalEncoding
-except ImportError:
-    # For development, import from local module
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '12_embeddings'))
-    try:
-        from embeddings_dev import Embedding, PositionalEncoding
-    except ImportError:
-        # Create minimal mock classes if not available
-        class Embedding:
-            def __init__(self, vocab_size, embedding_dim):
-                self.vocab_size = vocab_size
-                self.embedding_dim = embedding_dim
-        class PositionalEncoding:
-            def __init__(self, embedding_dim, max_seq_length=5000):
-                self.embedding_dim = embedding_dim
-
-# %% nbgrader={"grade": false, "grade_id": "attention-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("TARGET TinyTorch Attention Module")
-print(f"NumPy version: {np.__version__}")
-print("Ready to build attention mechanisms!")
-
-# %% [markdown]
-"""
-## PACKAGE Where This Code Lives in the Final Package
-
-**Learning Side:** You work in `modules/source/13_attention/attention_dev.py`  
-**Building Side:** Code exports to `tinytorch.core.attention`
-
-```python
-# Final package structure:
-from tinytorch.core.attention import ScaledDotProductAttention, MultiHeadAttention
-from tinytorch.core.embeddings import Embedding, PositionalEncoding  # Previous module
-from tinytorch.core.layers import Module  # Base module class
-```
-
-**Why this matters:**
-- **Learning:** Focused modules for deep understanding
-- **Production:** Proper organization like PyTorch's `torch.nn.MultiheadAttention`
-- **Consistency:** All attention mechanisms live together in `core.attention`
-- **Integration:** Works seamlessly with embeddings and sequence processing architectures
-"""
-
-# %% [markdown]
-"""
-## What is Attention?
-
-### The Problem: Sequence Dependencies
-Traditional RNNs process sequences step-by-step, making it hard to capture long-range dependencies:
-```
-"The cat, which was sitting on the mat, was hungry"
-    ^                                      ^
-    Subject must agree with verb - but they're far apart!
-```
-
-### Visual Understanding: Attention Mechanism
-
-```
-Query-Key-Value Attention Visualization:
-
-      Query (Q)      Key (K)        Value (V)
-    +-------------+ +-----------+ +-------------+
-    | "What am I  | | "What can | | "What info  |
-    |  looking    | |  I attend | |  do I get   |
-    |  for?"      | |  to?"     | |  from it?"  |
-    +-------------+ +-----------+ +-------------+
-           |              |              |
-           +------+-------+              |
-                  v                      |
-              Attention                   |
-               Scores                     |
-           QK^T / sqrtd_k                   |
-                  |                      |
-                  v                      |
-               Softmax ------------------+
-              Weights                    |
-                  |                      |
-                  +----------------------+
-                                         |
-                                         v
-                                   Weighted Sum
-                                 (Attended Output)
-```
-
-### Step-by-Step Attention Process:
-
-```
-Step 1: Compute Attention Scores
-    Q: [seq_len, d_model]  @  K^T: [d_model, seq_len]
-    ------------------------------------------------
-    Scores: [seq_len, seq_len]  ("How much to attend?")
-
-Step 2: Scale for Numerical Stability
-    Scores = Scores / sqrtd_k
-    (Prevents saturation in softmax)
-
-Step 3: Apply Softmax
-    Weights = softmax(Scores)
-    [Each row sums to 1 - probability distribution]
-
-Step 4: Weighted Combination
-    Output = Weights @ V
-    [Weighted average of all values based on attention]
-```
-
-### Multi-Head Attention Architecture:
-
-```
-    Input Embeddings [batch, seq_len, d_model]
-            |
-    +-------+-------+
-    |       |       |
-   W_Q     W_K     W_V  (Linear projections)
-    |       |       |
-    |   Reshape to Multiple Heads
-    |   [batch, heads, seq_len, d_k]
-    |       |       |
-    +-------+-------+
-            |
-    Scaled Dot-Product Attention
-     (Applied to each head)
-            |
-    Concatenate Heads
-    [batch, seq_len, d_model]
-            |
-    Linear Output Projection (W_O)
-            |
-    Multi-Head Output
-```
-
-### Attention Solution
-Attention allows every position to directly attend to every other position:
-```
-Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
-```
-
-Where:
-- **Q (Query)**: "What am I looking for?"
-- **K (Key)**: "What can I attend to?"  
-- **V (Value)**: "What information do I get?"
-
-### Why Attention Works
-- **Parallelization**: All positions computed simultaneously
-- **Long-range**: Direct connections between distant tokens
-- **Flexible**: Attention weights learned during training
-- **Interpretable**: Attention patterns show what the model focuses on
-
-### Causal Masking for Language Generation:
-
-```
-Without Masking (Bi-directional):
-       t1  t2  t3  t4
-    t1 [A] [A] [A] [A]  <- Can see all positions
-    t2 [A] [A] [A] [A]
-    t3 [A] [A] [A] [A]
-    t4 [A] [A] [A] [A]
-
-With Causal Masking (Auto-regressive):
-       t1  t2  t3  t4
-    t1 [A] [-] [-] [-]  <- Can only see current/past
-    t2 [A] [A] [-] [-]
-    t3 [A] [A] [A] [-]
-    t4 [A] [A] [A] [A]
-    
-    [A] = Attend   [-] = Masked (set to -inf)
-```
-
-### Systems Trade-offs
-- **Memory**: O(N²) scaling with sequence length
-- **Computation**: Matrix multiplications scale with sequence length²
-- **Parallelization**: Highly parallelizable on GPUs
-- **Sequence limits**: Quadratic scaling limits practical sequence length
-"""
-
-# %% [markdown]
-"""
-## Scaled Dot-Product Attention Implementation
-
-Let's start with the core attention mechanism - scaled dot-product attention that enables sequence models to focus selectively.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "scaled-attention", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class ScaledDotProductAttention:
-    """
-    Scaled Dot-Product Attention mechanism.
-    
-    The fundamental attention computation for sequence processing:
-    Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
-    
-    This allows each position to attend to all positions in the sequence.
-    """
-    
-    def __init__(self):
-        """
-        Initialize scaled dot-product attention.
-
-        The fundamental attention computation for sequence processing:
-        Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
-        """
-        pass
-        
-    def forward(self, query: Tensor, key: Tensor, value: Tensor, 
-                mask: Optional[Tensor] = None, 
-                return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, Tensor]]:
-        """
-        Compute scaled dot-product attention.
-        
-        TODO: Implement scaled dot-product attention.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Compute attention scores: query @ key.transpose()
-        2. Scale by sqrt(key_dim) for numerical stability
-        3. Apply mask if provided (set masked positions to large negative values)
-        4. Apply softmax to get attention weights
-        5. Apply attention weights to values: attention_weights @ value
-        6. Return attended values (and optionally attention weights)
-        
-        MATHEMATICAL FOUNDATION:
-        scores = QK^T / sqrt(d_k)
-        attention_weights = softmax(scores)
-        output = attention_weights @ V
-        
-        MASKING:
-        - Set masked positions to -1e9 before softmax
-        - This makes them effectively zero after softmax
-        - Used for causal (autoregressive) attention
-        
-        Args:
-            query: Query tensor with shape (batch_size, seq_len_q, d_k)
-            key: Key tensor with shape (batch_size, seq_len_k, d_k)
-            value: Value tensor with shape (batch_size, seq_len_v, d_v)
-            mask: Optional mask tensor with shape (seq_len_q, seq_len_k) or broadcastable
-            return_attention_weights: Whether to return attention weights
-            
-        Returns:
-            Attended values with shape (batch_size, seq_len_q, d_v)
-            Optionally also attention weights with shape (batch_size, seq_len_q, seq_len_k)
-        """
-        ### BEGIN SOLUTION
-        # Get dimensions
-        batch_size, seq_len_q, d_k = query.shape
-        _, seq_len_k, _ = key.shape
-        _, seq_len_v, d_v = value.shape
-        
-        assert seq_len_k == seq_len_v, "Key and Value must have same sequence length"
-        
-        # Step 1: Compute attention scores QK^T
-        # Visualization: Q[batch,seq_q,d_k] @ K^T[batch,d_k,seq_k] -> Scores[batch,seq_q,seq_k]
-        # Each element scores[i,j] = "how much should position i attend to position j?"
-        
-        # query: (batch, seq_q, d_k), key: (batch, seq_k, d_k)
-        # We need key^T, so we transpose the last two dimensions
-        key_transposed = np.transpose(key.data, (0, 2, 1))  # (batch, d_k, seq_k)
-        
-        # Batch matrix multiplication: (batch, seq_q, d_k) @ (batch, d_k, seq_k) -> (batch, seq_q, seq_k)
-        scores = np.matmul(query.data, key_transposed)
-        
-        # Step 2: Scale by sqrt(d_k) for numerical stability
-        # Why scaling? Large dot products -> extreme softmax -> vanishing gradients
-        scores = scores / math.sqrt(d_k)
-        
-        # Step 3: Apply mask if provided (critical for causal/autoregressive attention)
-        if mask is not None:
-            # Large negative value that becomes ~0 after softmax
-            # -1e9 chosen to avoid numerical underflow while ensuring effective masking
-            mask_value = ATTENTION_MASK_VALUE  # -1e9
-
-            # Handle different mask input types
-            if isinstance(mask, Tensor):
-                mask_array = mask.data
-            else:
-                mask_array = mask
-
-            # Apply mask: set masked positions to large negative values
-            # mask convention: 1 for positions to keep, 0 for positions to mask
-            # This enables causal masking for autoregressive generation
-
-            # Handle both 2D and 3D masks correctly
-            if len(mask_array.shape) == 2:
-                # 2D mask (seq_len, seq_len) - broadcast to match scores shape (batch, seq_len, seq_len)
-                mask_array = np.broadcast_to(mask_array, scores.shape)
-
-            masked_scores = np.where(mask_array == 0, mask_value, scores)
-            scores = masked_scores
-        
-        # Step 4: Apply softmax to get attention weights
-        # Numerical stable softmax: subtract max to prevent overflow
-        # Result: each row sums to 1 (proper probability distribution)
-        scores_max = np.max(scores, axis=-1, keepdims=True)
-        exp_scores = np.exp(scores - scores_max)
-        attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
-        
-        # Step 5: Apply attention weights to values (weighted combination)
-        # attention_weights: (batch, seq_q, seq_k), value: (batch, seq_k, d_v)
-        # Result: (batch, seq_q, d_v) - each output position is weighted sum of all values
-        attended_values = np.matmul(attention_weights, value.data)
-        
-        output = Tensor(attended_values)
-        
-        if return_attention_weights:
-            return output, Tensor(attention_weights)
-        else:
-            return output
-        ### END SOLUTION
-    
-    def __call__(self, query: Tensor, key: Tensor, value: Tensor, 
-                 mask: Optional[Tensor] = None, 
-                 return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, Tensor]]:
-        """Make the class callable."""
-        return self.forward(query, key, value, mask, return_attention_weights)
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure your ScaledDotProductAttention is complete before running
-
-# THINK PREDICTION: How do you think attention weights will distribute?
-# With random inputs: Uniform? Concentrated? Your guess: _______
-
-# MAGNIFY SYSTEMS INSIGHT #1: Attention Weight Distribution Analysis
-def analyze_attention_distribution():
-    """Analyze how attention weights distribute across different scenarios."""
-    try:
-        print("📊 ATTENTION WEIGHT DISTRIBUTION ANALYSIS")
-        print("=" * 50)
-        
-        attention = ScaledDotProductAttention()
-        batch_size, seq_len, d_k = 2, 8, 16
-        
-        # Test different input scenarios
-        scenarios = [
-            ("Random inputs", np.random.randn(batch_size, seq_len, d_k)),
-            ("Similar queries/keys", np.ones((batch_size, seq_len, d_k)) * 0.1),
-            ("Extreme values", np.random.randn(batch_size, seq_len, d_k) * 10)
-        ]
-        
-        for scenario_name, data in scenarios:
-            query = key = value = Tensor(data)
-            
-            # Get attention weights
-            output, weights = attention.forward(query, key, value, return_attention_weights=True)
-            
-            # Analyze distribution
-            weights_flat = weights.data.flatten()
-            max_weight = np.max(weights_flat)
-            min_weight = np.min(weights_flat)
-            std_weight = np.std(weights_flat)
-            entropy = -np.sum(weights_flat * np.log(weights_flat + 1e-10))  # Attention entropy
-            
-            print(f"\n{scenario_name}:")
-            print(f"  Max attention: {max_weight:.4f}")
-            print(f"  Min attention: {min_weight:.4f}")
-            print(f"  Std deviation: {std_weight:.4f}")
-            print(f"  Attention entropy: {entropy:.2f} (higher = more dispersed)")
-            
-            # Check if weights sum to 1 (softmax property)
-            row_sums = np.sum(weights.data, axis=-1)
-            assert np.allclose(row_sums, 1.0), f"Attention weights should sum to 1 in {scenario_name}"
-        
-        print(f"\nTIP WHY THIS MATTERS:")
-        print(f"  - Random inputs -> relatively uniform attention (high entropy)")
-        print(f"  - Similar inputs -> more concentrated attention (lower entropy)")
-        print(f"  - Extreme values can lead to attention collapse (very low entropy)")
-        print(f"  - Real language models learn meaningful attention patterns!")
-        
-    except Exception as e:
-        print(f"WARNING️ Make sure ScaledDotProductAttention is implemented correctly")
-        print(f"Error: {e}")
-
-# Run the analysis
-analyze_attention_distribution()
-
-# %% [markdown]
-"""
-### TEST Test Your Scaled Dot-Product Attention Implementation
-
-Once you implement the ScaledDotProductAttention forward method above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-scaled-attention-immediate", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_unit_scaled_attention():
-    """Unit test for scaled dot-product attention."""
-    print("🔬 Unit Test: Scaled Dot-Product Attention...")
-    
-    # Create attention layer
-    attention = ScaledDotProductAttention()
-    
-    # Test basic attention computation
-    batch_size = 2
-    seq_len = 4
-    d_k = 8
-    d_v = 6
-    
-    # Create test inputs
-    query = Tensor(np.random.randn(batch_size, seq_len, d_k))
-    key = Tensor(np.random.randn(batch_size, seq_len, d_k))
-    value = Tensor(np.random.randn(batch_size, seq_len, d_v))
-    
-    # Test forward pass
-    output = attention.forward(query, key, value)
-    expected_shape = (batch_size, seq_len, d_v)
-    assert output.shape == expected_shape, f"Expected shape {expected_shape}, got {output.shape}"
-    
-    # Test with different sequence lengths
-    seq_len_k = 6
-    key_diff = Tensor(np.random.randn(batch_size, seq_len_k, d_k))
-    value_diff = Tensor(np.random.randn(batch_size, seq_len_k, d_v))
-    
-    output_diff = attention.forward(query, key_diff, value_diff)
-    expected_shape_diff = (batch_size, seq_len, d_v)
-    assert output_diff.shape == expected_shape_diff, f"Expected shape {expected_shape_diff}, got {output_diff.shape}"
-    
-    # Test with attention weights return
-    output, attn_weights = attention.forward(query, key, value, return_attention_weights=True)
-    expected_attn_shape = (batch_size, seq_len, seq_len)
-    assert attn_weights.shape == expected_attn_shape, f"Expected attention shape {expected_attn_shape}, got {attn_weights.shape}"
-    
-    # Verify attention weights sum to 1 (softmax property)
-    attn_sums = np.sum(attn_weights.data, axis=-1)  # Sum over keys for each query
-    assert np.allclose(attn_sums, 1.0), "Attention weights should sum to 1"
-    
-    # Test with causal mask
-    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)  # Upper triangular mask
-    causal_mask = 1 - causal_mask  # Flip: 1 for allowed, 0 for masked
-    
-    output_masked, attn_masked = attention.forward(query, key, value, 
-                                                  mask=Tensor(causal_mask),
-                                                  return_attention_weights=True)
-    
-    # Verify causal mask works - future positions should have ~0 attention
-    # Upper triangular part (excluding diagonal) should be close to 0
-    for i in range(seq_len):
-        for j in range(i+1, seq_len):
-            assert np.all(attn_masked.data[:, i, j] < 1e-6), f"Future position ({i},{j}) should have near-zero attention"
-    
-    # Test callable interface
-    output_callable = attention(query, key, value)
-    assert np.allclose(output_callable.data, output.data), "Callable interface should work"
-    
-    # Test numerical stability with extreme values
-    extreme_query = Tensor(np.ones((1, 2, 4)) * 100)  # Large values
-    extreme_key = Tensor(np.ones((1, 2, 4)) * 100)
-    extreme_value = Tensor(np.random.randn(1, 2, 4))
-    
-    extreme_output = attention.forward(extreme_query, extreme_key, extreme_value)
-    assert not np.any(np.isnan(extreme_output.data)), "Should handle extreme values without NaN"
-    assert not np.any(np.isinf(extreme_output.data)), "Should handle extreme values without inf"
-    
-    print("PASS Scaled dot-product attention tests passed!")
-    print(f"PASS Handles various input shapes and sequence lengths")
-    print(f"PASS Attention weights sum to 1 (softmax property)")
-    print(f"PASS Causal masking works correctly")
-    print(f"PASS Numerical stability with extreme values")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Multi-Head Attention Implementation
-
-Now let's implement multi-head attention, which runs multiple attention heads in parallel and concatenates their outputs. This allows the model to attend to different types of information simultaneously.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "multi-head-attention", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class MultiHeadAttention:
-    """
-    Multi-Head Attention mechanism.
-    
-    Runs multiple attention heads in parallel and combines their outputs.
-    This allows the model to attend to different representation subspaces
-    simultaneously, capturing diverse types of relationships.
-    """
-    
-    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.0):
-        """
-        Initialize multi-head attention.
-        
-        TODO: Implement multi-head attention initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store configuration parameters
-        2. Calculate head dimension (embed_dim must be divisible by num_heads)
-        3. Initialize linear projection layers for Q, K, V, and output
-        4. Create scaled dot-product attention layer
-        
-        DESIGN DECISIONS:
-        - Each head gets embed_dim // num_heads dimensions
-        - Separate linear layers for Q, K, V projections
-        - Output projection to combine all heads
-        
-        Args:
-            embed_dim: Embedding dimension (total across all heads)
-            num_heads: Number of attention heads
-            dropout: Dropout rate for attention weights
-        """
-        ### BEGIN SOLUTION
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        
-        # Check that embed_dim is divisible by num_heads
-        if embed_dim % num_heads != 0:
-            raise ValueError(f"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads})")
-        
-        self.head_dim = embed_dim // num_heads
-        
-        # Initialize projection layers (these would be proper Linear layers in full implementation)
-        # For now, we'll use simple weight matrices
-        self.w_q = Tensor(np.random.randn(embed_dim, embed_dim) / math.sqrt(embed_dim))
-        self.w_k = Tensor(np.random.randn(embed_dim, embed_dim) / math.sqrt(embed_dim))
-        self.w_v = Tensor(np.random.randn(embed_dim, embed_dim) / math.sqrt(embed_dim))
-        self.w_o = Tensor(np.random.randn(embed_dim, embed_dim) / math.sqrt(embed_dim))
-        
-        # Store parameters for optimization
-        self.parameters = [self.w_q, self.w_k, self.w_v, self.w_o]
-        
-        # Create scaled dot-product attention
-        self.scaled_attention = ScaledDotProductAttention()
-        ### END SOLUTION
-    
-    def forward(self, query: Tensor, key: Tensor, value: Tensor,
-                mask: Optional[Tensor] = None,
-                return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, Tensor]]:
-        """
-        Compute multi-head attention.
-        
-        TODO: Implement multi-head attention forward pass.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Linear projections: compute Q, K, V from inputs
-        2. Reshape for multiple heads: (batch, seq, embed) -> (batch, heads, seq, head_dim)
-        3. Apply scaled dot-product attention for all heads simultaneously
-        4. Reshape back: (batch, heads, seq, head_dim) -> (batch, seq, embed)
-        5. Apply output projection
-        
-        RESHAPING DETAILS:
-        - Input: (batch_size, seq_len, embed_dim)
-        - After projection: (batch_size, seq_len, embed_dim)
-        - Reshaped for heads: (batch_size, seq_len, num_heads, head_dim)
-        - Transposed for attention: (batch_size, num_heads, seq_len, head_dim)
-        
-        Args:
-            query: Query tensor with shape (batch_size, seq_len, embed_dim)
-            key: Key tensor with shape (batch_size, seq_len, embed_dim)
-            value: Value tensor with shape (batch_size, seq_len, embed_dim)
-            mask: Optional mask tensor
-            return_attention_weights: Whether to return attention weights
-            
-        Returns:
-            Multi-head attention output with shape (batch_size, seq_len, embed_dim)
-            Optionally also attention weights from all heads
-        """
-        ### BEGIN SOLUTION
-        batch_size, seq_len, embed_dim = query.shape
-        
-        # Step 1: Linear projections for Q, K, V
-        # Transform input embeddings into query, key, value representations
-        # Each projection learns different aspects: Q=what to look for, K=what's available, V=what to extract
-        Q = Tensor(np.matmul(query.data, self.w_q.data))  # (batch, seq, embed) @ (embed, embed)
-        K = Tensor(np.matmul(key.data, self.w_k.data))
-        V = Tensor(np.matmul(value.data, self.w_v.data))
-        
-        # Step 2: Reshape for multiple heads (split embedding dimension across heads)
-        # Multi-head design: each head sees different representation subspace
-        # embed_dim = num_heads * head_dim (must be evenly divisible)
-        
-        # Get actual sequence lengths (may differ for cross-attention)
-        query_seq_len = Q.shape[1]
-        key_seq_len = K.shape[1] 
-        value_seq_len = V.shape[1]
-        
-        # Reshape: (batch, seq, embed) -> (batch, seq, num_heads, head_dim)
-        # This splits the embedding dimension across multiple attention heads
-        Q_reshaped = Q.data.reshape(batch_size, query_seq_len, self.num_heads, self.head_dim)
-        K_reshaped = K.data.reshape(batch_size, key_seq_len, self.num_heads, self.head_dim)
-        V_reshaped = V.data.reshape(batch_size, value_seq_len, self.num_heads, self.head_dim)
-        
-        # Transpose to (batch, num_heads, seq, head_dim) for easier parallel processing
-        # Now each head can be processed independently
-        Q_heads = np.transpose(Q_reshaped, (0, 2, 1, 3))
-        K_heads = np.transpose(K_reshaped, (0, 2, 1, 3))
-        V_heads = np.transpose(V_reshaped, (0, 2, 1, 3))
-        
-        # Step 3: Apply attention to all heads simultaneously
-        # Flatten batch and head dimensions for efficient computation
-        # (batch, num_heads, seq, head_dim) -> (batch*num_heads, seq, head_dim)
-        batch_heads = batch_size * self.num_heads
-        Q_flat = Q_heads.reshape(batch_heads, query_seq_len, self.head_dim)
-        K_flat = K_heads.reshape(batch_heads, key_seq_len, self.head_dim)
-        V_flat = V_heads.reshape(batch_heads, value_seq_len, self.head_dim)
-        
-        # Apply scaled dot-product attention to all heads in parallel
-        # Need to handle mask broadcasting for flattened multi-head structure
-        if mask is not None:
-            # The mask shape is (seq_len, seq_len) but we need it for each (batch*heads) computation
-            # Each head in each batch item should use the same mask
-            if isinstance(mask, Tensor):
-                mask_data = mask.data
-            else:
-                mask_data = mask
-
-            # Expand mask to match the flattened batch-head structure
-            # From (seq_len, seq_len) to (batch_size * num_heads, seq_len, seq_len)
-            mask_expanded = np.broadcast_to(mask_data, (batch_heads, query_seq_len, key_seq_len))
-            mask_tensor = Tensor(mask_expanded)
-        else:
-            mask_tensor = None
-
-        if return_attention_weights:
-            attn_output_flat, attn_weights_flat = self.scaled_attention.forward(
-                Tensor(Q_flat), Tensor(K_flat), Tensor(V_flat),
-                mask=mask_tensor, return_attention_weights=True
-            )
-        else:
-            attn_output_flat = self.scaled_attention.forward(
-                Tensor(Q_flat), Tensor(K_flat), Tensor(V_flat), mask=mask_tensor
-            )
-        
-        # Step 4: Reshape back to separate heads and concatenate
-        # (batch*num_heads, seq, head_dim) -> (batch, num_heads, seq, head_dim)
-        attn_output_heads = attn_output_flat.data.reshape(batch_size, self.num_heads, query_seq_len, self.head_dim)
-        
-        # Transpose back to (batch, seq, num_heads, head_dim) for concatenation
-        attn_output_reshaped = np.transpose(attn_output_heads, (0, 2, 1, 3))
-        
-        # Concatenate heads: (batch, seq, num_heads, head_dim) -> (batch, seq, embed_dim)
-        # This combines all head outputs back into the original embedding dimension
-        attn_output_concat = attn_output_reshaped.reshape(batch_size, query_seq_len, embed_dim)
-        
-        # Step 5: Apply output projection to learn how to combine head information
-        # Final linear transformation to produce multi-head attention output
-        output = np.matmul(attn_output_concat, self.w_o.data)
-        
-        if return_attention_weights:
-            # Reshape attention weights back to per-head format
-            # Attention weights shape: (batch*num_heads, query_seq_len, key_seq_len) -> (batch_size, num_heads, query_seq_len, key_seq_len)
-            attn_weights_heads = attn_weights_flat.data.reshape(batch_size, self.num_heads, query_seq_len, key_seq_len)
-
-            # CRITICAL FIX: Ensure causal masking is properly applied to reshaped weights
-            # This is a fallback to guarantee correct causal masking
-            if mask is not None:
-                # Get original mask data
-                if isinstance(mask, Tensor):
-                    original_mask = mask.data
-                else:
-                    original_mask = mask
-
-                # If mask is 2D, apply it to all heads
-                if len(original_mask.shape) == 2:
-                    # Convert mask to numpy array if it's a Tensor
-                    if hasattr(original_mask, 'data'):
-                        mask_data = original_mask.data
-                    else:
-                        mask_data = original_mask
-
-                    for b in range(batch_size):
-                        for h in range(self.num_heads):
-                            # Set masked positions to 0 (they should already be near 0 from softmax)
-                            attn_weights_heads[b, h] = attn_weights_heads[b, h] * mask_data
-
-            return Tensor(output), Tensor(attn_weights_heads)
-        else:
-            return Tensor(output)
-        ### END SOLUTION
-    
-    def __call__(self, query: Tensor, key: Tensor, value: Tensor,
-                 mask: Optional[Tensor] = None,
-                 return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, Tensor]]:
-        """Make the class callable."""
-        return self.forward(query, key, value, mask, return_attention_weights)
-    
-    def get_memory_usage(self) -> Dict[str, float]:
-        """
-        Calculate memory usage of multi-head attention parameters.
-        
-        This function is PROVIDED to show memory analysis.
-        """
-        # Parameter memory
-        param_memory_mb = sum(param.data.nbytes for param in self.parameters) / (1024 * 1024)
-        
-        # Memory per head
-        memory_per_head_mb = param_memory_mb / self.num_heads
-        
-        return {
-            'total_parameter_memory_mb': param_memory_mb,
-            'memory_per_head_mb': memory_per_head_mb,
-            'num_heads': self.num_heads,
-            'head_dim': self.head_dim,
-            'total_parameters': sum(param.data.size for param in self.parameters)
-        }
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure your MultiHeadAttention is complete before running
-
-# THINK PREDICTION: Multi-head vs single-head - which uses more memory and why?
-# Your answer: _______
-
-# MAGNIFY SYSTEMS INSIGHT #2: Multi-Head vs Single-Head Comparison
-def compare_attention_architectures():
-    """Compare single-head vs multi-head attention characteristics."""
-    try:
-        print("MAGNIFY MULTI-HEAD vs SINGLE-HEAD ATTENTION COMPARISON")
-        print("=" * 60)
-        
-        embed_dim = 256
-        seq_len = 128
-        batch_size = 4
-        
-        # Test configurations
-        configs = [
-            ("Single Head", 1),
-            ("4 Heads", 4),
-            ("8 Heads", 8),
-            ("16 Heads", 16)
-        ]
-        
-        print(f"{'Configuration':<15} {'Parameters':<12} {'Memory (MB)':<12} {'Head Dim':<10} {'Complexity'}")
-        print("-" * 70)
-        
-        input_tensor = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-        
-        for name, num_heads in configs:
-            if embed_dim % num_heads != 0:
-                continue
-                
-            # Create multi-head attention
-            mha = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
-            
-            # Measure memory usage
-            memory_stats = mha.get_memory_usage()
-            head_dim = embed_dim // num_heads
-            
-            # Estimate computational complexity (FLOPs for attention matrix)
-            attention_flops = batch_size * num_heads * seq_len * seq_len * head_dim
-            
-            print(f"{name:<15} {memory_stats['total_parameters']:<12,} "
-                  f"{memory_stats['total_parameter_memory_mb']:<12.2f} "
-                  f"{head_dim:<10} {attention_flops/1e6:.1f}M FLOPs")
-        
-        print(f"\n📊 ANALYSIS:")
-        print(f"  Parameter Count: Constant across heads (embed_dim² * 4 matrices)")
-        print(f"  Head Dimension: Decreases as num_heads increases (embed_dim/num_heads)")
-        print(f"  Representation: More heads = richer, diverse attention patterns")
-        print(f"  Computation: Linear scaling with number of heads")
-        
-        print(f"\nTIP WHY MULTI-HEAD WORKS:")
-        print(f"  - Different heads learn different types of relationships")
-        print(f"  - Some heads focus on syntax, others on semantics")
-        print(f"  - Parallel computation across heads")
-        print(f"  - Better representation learning without parameter increase")
-        
-    except Exception as e:
-        print(f"WARNING️ Make sure MultiHeadAttention is implemented correctly")
-        print(f"Error: {e}")
-
-# Run the comparison
-compare_attention_architectures()
-
-# %% [markdown]
-"""
-### TEST Test Your Multi-Head Attention Implementation
-
-Once you implement the MultiHeadAttention methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-multi-head-attention-immediate", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_unit_multi_head_attention():
-    """Unit test for multi-head attention."""
-    print("🔬 Unit Test: Multi-Head Attention...")
-    
-    # Test basic configuration
-    embed_dim = 64
-    num_heads = 8
-    mha = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
-    
-    # Verify initialization
-    assert mha.embed_dim == embed_dim, "Should store embedding dimension"
-    assert mha.num_heads == num_heads, "Should store number of heads"
-    assert mha.head_dim == embed_dim // num_heads, "Should calculate head dimension correctly"
-    
-    # Verify parameter tracking
-    assert len(mha.parameters) == 4, "Should have 4 parameter matrices (Q, K, V, O)"
-    for param in mha.parameters:
-        assert param.shape == (embed_dim, embed_dim), "All parameters should be square matrices"
-    
-    # Test forward pass
-    batch_size = 2
-    seq_len = 6
-    
-    query = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-    key = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-    value = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-    
-    output = mha.forward(query, key, value)
-    expected_shape = (batch_size, seq_len, embed_dim)
-    assert output.shape == expected_shape, f"Expected shape {expected_shape}, got {output.shape}"
-    
-    # Test with attention weights return
-    output, attn_weights = mha.forward(query, key, value, return_attention_weights=True)
-    expected_attn_shape = (batch_size, num_heads, seq_len, seq_len)
-    assert attn_weights.shape == expected_attn_shape, f"Expected attention shape {expected_attn_shape}, got {attn_weights.shape}"
-    
-    # Test different head configurations
-    for test_heads in [1, 2, 4]:
-        if embed_dim % test_heads == 0:
-            test_mha = MultiHeadAttention(embed_dim=embed_dim, num_heads=test_heads)
-            test_output = test_mha.forward(query, key, value)
-            assert test_output.shape == expected_shape, f"Should work with {test_heads} heads"
-    
-    # Test invalid head configuration
-    try:
-        invalid_mha = MultiHeadAttention(embed_dim=65, num_heads=8)  # 65 not divisible by 8
-        assert False, "Should raise error for invalid head configuration"
-    except ValueError:
-        pass  # Expected behavior
-    
-    # Test with causal mask
-    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)
-    causal_mask = 1 - causal_mask  # Flip: 1 for allowed, 0 for masked
-    
-    output_masked, attn_masked = mha.forward(query, key, value,
-                                           mask=Tensor(causal_mask),
-                                           return_attention_weights=True)
-    
-    # Verify masking works across all heads
-    for head in range(num_heads):
-        for i in range(seq_len):
-            for j in range(i+1, seq_len):
-                assert np.all(attn_masked.data[:, head, i, j] < 1e-5), \
-                    f"Head {head}: Future position ({i},{j}) should have near-zero attention"
-    
-    # Test callable interface
-    output_callable = mha(query, key, value)
-    assert output_callable.shape == expected_shape, "Callable interface should work"
-    
-    # Test memory usage calculation
-    memory_stats = mha.get_memory_usage()
-    assert 'total_parameter_memory_mb' in memory_stats, "Should provide memory statistics"
-    assert memory_stats['num_heads'] == num_heads, "Should report correct number of heads"
-    assert memory_stats['head_dim'] == embed_dim // num_heads, "Should report correct head dimension"
-    
-    # Test self-attention (Q=K=V)
-    self_attn_output = mha.forward(query, query, query)
-    assert self_attn_output.shape == expected_shape, "Self-attention should work"
-    
-    print("PASS Multi-head attention tests passed!")
-    print(f"PASS Handles {num_heads} heads with {mha.head_dim} dimensions each")
-    print(f"PASS Parameter memory: {memory_stats['total_parameter_memory_mb']:.2f}MB")
-    print(f"PASS Causal masking works across all heads")
-    print(f"PASS Self-attention capability verified")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## KV-Cache for Efficient Inference
-
-For autoregressive generation (text generation), we can cache key and value computations to avoid recomputing them for each new token. Let's implement a simple KV-cache system:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "kv-cache", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class KVCache:
-    """
-    Key-Value cache for efficient autoregressive generation.
-    
-    During text generation, we generate one token at a time. Instead of
-    recomputing K and V for all previous tokens, we can cache them and
-    only compute K and V for the new token.
-    """
-    
-    def __init__(self, max_batch_size: int, max_seq_length: int, 
-                 num_heads: int, head_dim: int):
-        """
-        Initialize KV cache with pre-allocated memory.
-        
-        TODO: Implement KV cache initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store cache configuration parameters
-        2. Pre-allocate memory for cached keys and values
-        3. Initialize cache position tracking
-        4. Set up cache state management
-        
-        PRE-ALLOCATION BENEFITS:
-        - Avoids memory allocation during generation
-        - Enables efficient memory reuse
-        - Predictable memory usage
-        
-        Args:
-            max_batch_size: Maximum batch size for generation
-            max_seq_length: Maximum sequence length to cache
-            num_heads: Number of attention heads
-            head_dim: Dimension per attention head
-        """
-        ### BEGIN SOLUTION
-        self.max_batch_size = max_batch_size
-        self.max_seq_length = max_seq_length
-        self.num_heads = num_heads
-        self.head_dim = head_dim
-        
-        # Pre-allocate cache memory
-        # Shape: (max_batch_size, num_heads, max_seq_length, head_dim)
-        cache_shape = (max_batch_size, num_heads, max_seq_length, head_dim)
-        self.cached_keys = np.zeros(cache_shape, dtype=np.float32)
-        self.cached_values = np.zeros(cache_shape, dtype=np.float32)
-        
-        # Track current cache length for each sequence in batch
-        self.cache_lengths = np.zeros(max_batch_size, dtype=int)
-        
-        # Track whether cache is active
-        self.is_active = False
-        ### END SOLUTION
-    
-    def update(self, batch_idx: int, new_keys: Tensor, new_values: Tensor) -> Tuple[Tensor, Tensor]:
-        """
-        Update cache with new keys and values, return full cached K,V.
-        
-        TODO: Implement cache update.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Get current cache position for this batch
-        2. Add new keys and values to cache at current position
-        3. Update cache length
-        4. Return full cached keys and values up to current length
-        
-        GENERATION PATTERN:
-        - First call: cache is empty, add initial K,V
-        - Subsequent calls: add one new token's K,V
-        - Always return all cached K,V for attention computation
-        
-        Args:
-            batch_idx: Index of sequence in batch
-            new_keys: New keys to add with shape (num_heads, new_seq_len, head_dim)
-            new_values: New values to add with shape (num_heads, new_seq_len, head_dim)
-            
-        Returns:
-            Full cached keys and values with shape (num_heads, total_cached_len, head_dim)
-        """
-        ### BEGIN SOLUTION
-        # Get current cache position for this batch sequence
-        current_pos = self.cache_lengths[batch_idx]
-        new_seq_len = new_keys.shape[1]  # Assuming shape (num_heads, seq_len, head_dim)
-        
-        # Boundary check: prevent cache overflow
-        if current_pos + new_seq_len > self.max_seq_length:
-            raise ValueError(f"Cache overflow: {current_pos + new_seq_len} > {self.max_seq_length}")
-        
-        # Update cache with new keys and values at current position
-        # This is the core KV-cache optimization: append new K,V instead of recomputing all
-        end_pos = current_pos + new_seq_len
-        self.cached_keys[batch_idx, :, current_pos:end_pos, :] = new_keys.data
-        self.cached_values[batch_idx, :, current_pos:end_pos, :] = new_values.data
-        
-        # Update cache metadata
-        self.cache_lengths[batch_idx] = end_pos
-        self.is_active = True
-        
-        # Return full cached keys and values for attention computation
-        # This includes both previously cached and newly added K,V pairs
-        full_keys = self.cached_keys[batch_idx, :, :end_pos, :]
-        full_values = self.cached_values[batch_idx, :, :end_pos, :]
-        
-        return Tensor(full_keys), Tensor(full_values)
-        ### END SOLUTION
-    
-    def reset(self, batch_idx: Optional[int] = None):
-        """
-        Reset cache for specific batch index or entire cache.
-        
-        This function is PROVIDED for cache management.
-        """
-        if batch_idx is not None:
-            # Reset specific sequence
-            self.cache_lengths[batch_idx] = 0
-            self.cached_keys[batch_idx] = 0
-            self.cached_values[batch_idx] = 0
-        else:
-            # Reset entire cache
-            self.cache_lengths.fill(0)
-            self.cached_keys.fill(0)
-            self.cached_values.fill(0)
-            self.is_active = False
-    
-    def get_memory_usage(self) -> Dict[str, float]:
-        """
-        Calculate memory usage of KV cache.
-        
-        This function is PROVIDED to show memory analysis.
-        """
-        # Cache memory in bytes
-        cache_memory_bytes = self.cached_keys.nbytes + self.cached_values.nbytes
-        cache_memory_mb = cache_memory_bytes / (1024 * 1024)
-        
-        # Memory per sequence
-        memory_per_sequence_mb = cache_memory_mb / self.max_batch_size
-        
-        return {
-            'total_cache_memory_mb': cache_memory_mb,
-            'memory_per_sequence_mb': memory_per_sequence_mb,
-            'max_batch_size': self.max_batch_size,
-            'max_seq_length': self.max_seq_length,
-            'num_heads': self.num_heads,
-            'head_dim': self.head_dim,
-            'cache_utilization': np.mean(self.cache_lengths / self.max_seq_length) if self.is_active else 0.0
-        }
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure your KVCache is complete before running
-
-# THINK PREDICTION: How much memory could KV-cache save during generation?
-# For 1000 tokens: 10%? 50%? 90%? Your guess: _______
-
-# MAGNIFY SYSTEMS INSIGHT #3: KV-Cache Generation Efficiency Analysis
-def analyze_kv_cache_efficiency():
-    """Analyze KV-cache memory and computation savings during generation."""
-    try:
-        print("💾 KV-CACHE GENERATION EFFICIENCY ANALYSIS")
-        print("=" * 55)
-        
-        # Realistic language model configuration
-        embed_dim = 512
-        num_heads = 8
-        head_dim = embed_dim // num_heads
-        batch_size = 1  # Typical generation scenario
-        
-        sequence_lengths = [64, 128, 256, 512, 1024]
-        
-        print(f"{'Seq Length':<10} {'No Cache':<12} {'With Cache':<12} {'Savings':<10} {'Speedup Est'}")
-        print("-" * 65)
-        
-        for seq_len in sequence_lengths:
-            # Without cache: recompute K,V for all previous tokens every step
-            # Memory: Store attention scores for full sequence every generation step
-            no_cache_kv_memory = seq_len * embed_dim * 2 * 4 / (1024**2)  # K+V in MB
-            no_cache_attention = seq_len * seq_len * 4 / (1024**2)  # Attention matrix
-            no_cache_total = no_cache_kv_memory + no_cache_attention
-            
-            # With cache: store K,V once, only compute new token attention
-            cache_storage = seq_len * embed_dim * 2 * 4 / (1024**2)  # Persistent K+V cache
-            cache_attention = seq_len * 1 * 4 / (1024**2)  # Only new token vs all cached
-            cache_total = cache_storage + cache_attention
-            
-            # Calculate savings
-            memory_savings = (no_cache_total - cache_total) / no_cache_total * 100
-            computation_speedup = seq_len  # Rough estimate: avoid seq_len token recomputations
-            
-            print(f"{seq_len:<10} {no_cache_total:<12.2f} {cache_total:<12.2f} "
-                  f"{memory_savings:<10.1f}% {computation_speedup:<10.1f}x")
-        
-        # Demonstrate cache usage pattern
-        print(f"\n🔄 GENERATION PATTERN DEMONSTRATION:")
-        cache = KVCache(max_batch_size=1, max_seq_length=512, 
-                       num_heads=num_heads, head_dim=head_dim)
-        
-        print(f"Generation simulation (first 5 tokens):")
-        batch_idx = 0
-        
-        for step in range(5):
-            if step == 0:
-                # Initial prompt processing
-                new_seq_len = 10  # Process initial 10 tokens
-                print(f"  Step {step}: Process initial prompt ({new_seq_len} tokens)")
-            else:
-                # Generate one new token
-                new_seq_len = 1
-                print(f"  Step {step}: Generate new token ({new_seq_len} token)")
-            
-            # Simulate K,V for new tokens
-            new_keys = Tensor(np.random.randn(num_heads, new_seq_len, head_dim))
-            new_values = Tensor(np.random.randn(num_heads, new_seq_len, head_dim))
-            
-            # Update cache
-            cached_k, cached_v = cache.update(batch_idx, new_keys, new_values)
-            total_cached = cached_k.shape[1]
-            
-            print(f"    Cache now contains: {total_cached} tokens")
-            print(f"    Memory used: {total_cached * embed_dim * 2 * 4 / 1024:.1f} KB")
-        
-        print(f"\nTIP WHY KV-CACHE IS ESSENTIAL:")
-        print(f"  - Without cache: O(N²) computation growth per token")
-        print(f"  - With cache: O(N) computation per token")
-        print(f"  - Memory trade-off: Store K,V to avoid recomputation")
-        print(f"  - Critical for: Interactive chat, real-time generation")
-        print(f"  - Production impact: 10-100x speedup for long sequences")
-        
-    except Exception as e:
-        print(f"WARNING️ Make sure KVCache is implemented correctly")
-        print(f"Error: {e}")
-
-# Run the efficiency analysis
-analyze_kv_cache_efficiency()
-
-# %% [markdown]
-"""
-### TEST Test Your KV-Cache Implementation
-
-Once you implement the KVCache methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-kv-cache-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_kv_cache():
-    """Unit test for KV cache."""
-    print("🔬 Unit Test: KV-Cache...")
-    
-    # Create KV cache
-    max_batch_size = 4
-    max_seq_length = 16
-    num_heads = 8
-    head_dim = 64
-    
-    kv_cache = KVCache(max_batch_size=max_batch_size, max_seq_length=max_seq_length,
-                       num_heads=num_heads, head_dim=head_dim)
-    
-    # Test initialization
-    assert kv_cache.max_batch_size == max_batch_size, "Should store max batch size"
-    assert kv_cache.max_seq_length == max_seq_length, "Should store max sequence length"
-    assert kv_cache.cached_keys.shape == (max_batch_size, num_heads, max_seq_length, head_dim), "Should pre-allocate key cache"
-    assert kv_cache.cached_values.shape == (max_batch_size, num_heads, max_seq_length, head_dim), "Should pre-allocate value cache"
-    assert not kv_cache.is_active, "Should start inactive"
-    
-    # Test first update (initial sequence)
-    batch_idx = 0
-    initial_seq_len = 5
-    initial_keys = Tensor(np.random.randn(num_heads, initial_seq_len, head_dim))
-    initial_values = Tensor(np.random.randn(num_heads, initial_seq_len, head_dim))
-    
-    cached_keys, cached_values = kv_cache.update(batch_idx, initial_keys, initial_values)
-    
-    # Verify cache update
-    assert cached_keys.shape == (num_heads, initial_seq_len, head_dim), f"Expected cached keys shape (num_heads, {initial_seq_len}, head_dim)"
-    assert cached_values.shape == (num_heads, initial_seq_len, head_dim), f"Expected cached values shape (num_heads, {initial_seq_len}, head_dim)"
-    assert kv_cache.cache_lengths[batch_idx] == initial_seq_len, f"Should update cache length to {initial_seq_len}"
-    assert kv_cache.is_active, "Should be active after first update"
-    
-    # Verify cached data matches input
-    assert np.allclose(cached_keys.data, initial_keys.data), "Cached keys should match input"
-    assert np.allclose(cached_values.data, initial_values.data), "Cached values should match input"
-    
-    # Test incremental update (add one token)
-    new_token_keys = Tensor(np.random.randn(num_heads, 1, head_dim))
-    new_token_values = Tensor(np.random.randn(num_heads, 1, head_dim))
-    
-    cached_keys_updated, cached_values_updated = kv_cache.update(batch_idx, new_token_keys, new_token_values)
-    
-    # Verify incremental update
-    expected_new_length = initial_seq_len + 1
-    assert cached_keys_updated.shape == (num_heads, expected_new_length, head_dim), "Should include new token in cached keys"
-    assert cached_values_updated.shape == (num_heads, expected_new_length, head_dim), "Should include new token in cached values"
-    assert kv_cache.cache_lengths[batch_idx] == expected_new_length, f"Should update cache length to {expected_new_length}"
-    
-    # Verify old data is preserved and new data is appended
-    assert np.allclose(cached_keys_updated.data[:, :initial_seq_len, :], initial_keys.data), "Should preserve old cached keys"
-    assert np.allclose(cached_keys_updated.data[:, initial_seq_len:, :], new_token_keys.data), "Should append new keys"
-    
-    # Test multiple sequences in batch
-    batch_idx_2 = 1
-    seq2_keys = Tensor(np.random.randn(num_heads, 3, head_dim))
-    seq2_values = Tensor(np.random.randn(num_heads, 3, head_dim))
-    
-    cached_keys_seq2, cached_values_seq2 = kv_cache.update(batch_idx_2, seq2_keys, seq2_values)
-    
-    # Verify independent cache management
-    assert cached_keys_seq2.shape == (num_heads, 3, head_dim), "Second sequence should have correct shape"
-    assert kv_cache.cache_lengths[batch_idx_2] == 3, "Second sequence should have correct length"
-    assert kv_cache.cache_lengths[batch_idx] == expected_new_length, "First sequence length should be unchanged"
-    
-    # Test cache overflow protection
-    try:
-        # Try to add more tokens than max_seq_length allows
-        overflow_keys = Tensor(np.random.randn(num_heads, max_seq_length, head_dim))
-        overflow_values = Tensor(np.random.randn(num_heads, max_seq_length, head_dim))
-        kv_cache.update(batch_idx, overflow_keys, overflow_values)
-        assert False, "Should raise error for cache overflow"
-    except ValueError:
-        pass  # Expected behavior
-    
-    # Test cache reset
-    kv_cache.reset(batch_idx)
-    assert kv_cache.cache_lengths[batch_idx] == 0, "Should reset cache length to 0"
-    assert kv_cache.cache_lengths[batch_idx_2] == 3, "Should not affect other sequences"
-    
-    # Test full cache reset
-    kv_cache.reset()
-    assert np.all(kv_cache.cache_lengths == 0), "Should reset all cache lengths"
-    assert not kv_cache.is_active, "Should be inactive after full reset"
-    
-    # Test memory usage calculation
-    memory_stats = kv_cache.get_memory_usage()
-    assert 'total_cache_memory_mb' in memory_stats, "Should provide memory statistics"
-    assert memory_stats['max_batch_size'] == max_batch_size, "Should report correct batch size"
-    assert memory_stats['max_seq_length'] == max_seq_length, "Should report correct sequence length"
-    
-    print("PASS KV-Cache tests passed!")
-    print(f"PASS Handles {max_batch_size} sequences of up to {max_seq_length} tokens")
-    print(f"PASS Memory usage: {memory_stats['total_cache_memory_mb']:.2f}MB total")
-    print(f"PASS Cache overflow protection works")
-    print(f"PASS Independent batch sequence management")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## TARGET ML Systems: Performance Analysis & Attention Scaling
-
-Now let's develop systems engineering skills by analyzing attention performance and understanding how attention's quadratic scaling affects practical sequence processing deployment.
-
-### **Learning Outcome**: *"I understand how attention's O(N²) complexity determines the practical limits of sequence length and deployment strategies"*
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "attention-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-import time
-
-class AttentionProfiler:
-    """
-    Performance profiling toolkit for attention mechanisms.
-    
-    Helps ML engineers understand computational costs, memory scaling,
-    and bottlenecks in attention-based architectures.
-    """
-    
-    def __init__(self):
-        self.results = {}
-    
-    def measure_attention_scaling(self, attention_layer, seq_lengths: List[int], 
-                                 embed_dim: int = 256, batch_size: int = 1) -> Dict:
-        """
-        Measure how attention performance scales with sequence length.
-        
-        TODO: Implement attention scaling measurement.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create test inputs for each sequence length
-        2. Measure computation time for attention forward pass
-        3. Calculate memory usage for attention matrices
-        4. Analyze scaling patterns (should be O(N²))
-        5. Return comprehensive scaling analysis
-        
-        METRICS TO CALCULATE:
-        - Computation time vs sequence length
-        - Memory usage vs sequence length  
-        - Attention matrix size scaling
-        - Throughput degradation patterns
-        
-        Args:
-            attention_layer: Attention layer to test (ScaledDotProductAttention or MultiHeadAttention)
-            seq_lengths: List of sequence lengths to test
-            embed_dim: Embedding dimension for test inputs
-            batch_size: Batch size for testing
-            
-        Returns:
-            Dictionary with scaling analysis results
-        """
-        ### BEGIN SOLUTION
-        scaling_results = {}
-        
-        for seq_len in seq_lengths:
-            # Create test inputs
-            query = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-            key = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-            value = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-            
-            # Measure computation time
-            start_time = time.time()
-            if hasattr(attention_layer, 'forward'):
-                output = attention_layer.forward(query, key, value)
-            else:
-                output = attention_layer(query, key, value)
-            end_time = time.time()
-            
-            computation_time_ms = (end_time - start_time) * 1000
-            
-            # Calculate memory usage
-            input_memory_mb = (query.data.nbytes + key.data.nbytes + value.data.nbytes) / (1024 * 1024)
-            output_memory_mb = output.data.nbytes / (1024 * 1024)
-            
-            # Attention matrix memory (batch_size * seq_len * seq_len)
-            attention_matrix_memory_mb = (batch_size * seq_len * seq_len * FLOAT32_BYTES) / (1024 * 1024)
-            
-            # Calculate throughput
-            total_operations = batch_size * seq_len * seq_len * embed_dim  # Rough estimate
-            operations_per_second = total_operations / (end_time - start_time) if end_time > start_time else 0
-            
-            scaling_results[seq_len] = {
-                'seq_length': seq_len,
-                'computation_time_ms': computation_time_ms,
-                'input_memory_mb': input_memory_mb,
-                'output_memory_mb': output_memory_mb,
-                'attention_matrix_memory_mb': attention_matrix_memory_mb,
-                'total_memory_mb': input_memory_mb + output_memory_mb + attention_matrix_memory_mb,
-                'operations_per_second': operations_per_second,
-                'time_per_token_us': computation_time_ms * 1000 / (batch_size * seq_len) if seq_len > 0 else 0
-            }
-        
-        return scaling_results
-        ### END SOLUTION
-    
-    def analyze_quadratic_scaling(self, scaling_results: Dict) -> Dict:
-        """
-        Analyze quadratic scaling patterns in attention results.
-        
-        This function is PROVIDED to show scaling pattern analysis.
-        """
-        print("PROGRESS ATTENTION QUADRATIC SCALING ANALYSIS")
-        print("=" * 60)
-        
-        seq_lengths = sorted(scaling_results.keys())
-        
-        if len(seq_lengths) < 2:
-            print("Need at least 2 sequence lengths for scaling analysis")
-            return {}
-        
-        print(f"{'Seq Length':<10} {'Time (ms)':<12} {'Memory (MB)':<12} {'Attn Matrix':<12} {'Time/Token':<12}")
-        print("-" * 70)
-        
-        for seq_len in seq_lengths:
-            result = scaling_results[seq_len]
-            print(f"{seq_len:<10} {result['computation_time_ms']:<12.2f} "
-                  f"{result['total_memory_mb']:<12.2f} {result['attention_matrix_memory_mb']:<12.2f} "
-                  f"{result['time_per_token_us']:<12.2f}")
-        
-        # Analyze scaling ratios
-        base_seq = seq_lengths[0]
-        base_result = scaling_results[base_seq]
-        
-        scaling_analysis = {'base_sequence_length': base_seq}
-        
-        print(f"\n📊 SCALING ANALYSIS (relative to {base_seq} tokens):")
-        print(f"{'Length Ratio':<12} {'Time Ratio':<12} {'Memory Ratio':<12} {'Theory (N²)':<12}")
-        print("-" * 50)
-        
-        for seq_len in seq_lengths[1:]:
-            result = scaling_results[seq_len]
-            
-            length_ratio = seq_len / base_seq
-            time_ratio = result['computation_time_ms'] / base_result['computation_time_ms']
-            memory_ratio = result['attention_matrix_memory_mb'] / base_result['attention_matrix_memory_mb']
-            theoretical_ratio = length_ratio ** 2
-            
-            scaling_analysis[seq_len] = {
-                'length_ratio': length_ratio,
-                'time_ratio': time_ratio,
-                'memory_ratio': memory_ratio,
-                'theoretical_ratio': theoretical_ratio,
-                'time_efficiency': theoretical_ratio / time_ratio if time_ratio > 0 else 0
-            }
-            
-            print(f"{length_ratio:<12.1f} {time_ratio:<12.1f} {memory_ratio:<12.1f} {theoretical_ratio:<12.1f}")
-        
-        # Analysis insights
-        print(f"\nTIP SCALING INSIGHTS:")
-        avg_memory_efficiency = np.mean([scaling_analysis[seq]['memory_ratio'] / scaling_analysis[seq]['theoretical_ratio'] 
-                                       for seq in seq_lengths[1:] if seq in scaling_analysis])
-        
-        print(f"   - Memory scaling: ~{avg_memory_efficiency:.1f}x theoretical O(N²)")
-        print(f"   - Attention matrix dominates memory usage")
-        print(f"   - Time scaling may deviate from O(N²) due to hardware effects")
-        print(f"   - Practical sequence limit determined by available GPU memory")
-        
-        return scaling_analysis
-    
-    def compare_attention_types(self, seq_length: int = 128, embed_dim: int = 256) -> Dict:
-        """
-        Compare performance of different attention implementations.
-        
-        This function is PROVIDED to show attention type comparison.
-        """
-        print(f"\nMAGNIFY ATTENTION TYPE COMPARISON")
-        print("=" * 50)
-        
-        batch_size = 8
-        
-        # Create test inputs
-        query = Tensor(np.random.randn(batch_size, seq_length, embed_dim))
-        key = Tensor(np.random.randn(batch_size, seq_length, embed_dim))
-        value = Tensor(np.random.randn(batch_size, seq_length, embed_dim))
-        
-        results = {}
-        
-        # Test scaled dot-product attention
-        scaled_attention = ScaledDotProductAttention()
-        start_time = time.time()
-        scaled_output = scaled_attention.forward(query, key, value)
-        scaled_time = (time.time() - start_time) * 1000
-        
-        results['scaled_dot_product'] = {
-            'computation_time_ms': scaled_time,
-            'parameters': 0,  # No learnable parameters
-            'memory_mb': scaled_output.data.nbytes / (1024 * 1024),
-            'description': 'Basic attention mechanism'
-        }
-        
-        # Test multi-head attention
-        num_heads = 8
-        mha = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
-        start_time = time.time()
-        mha_output = mha.forward(query, key, value)
-        mha_time = (time.time() - start_time) * 1000
-        
-        mha_memory = mha.get_memory_usage()
-        
-        results['multi_head'] = {
-            'computation_time_ms': mha_time,
-            'parameters': mha_memory['total_parameters'],
-            'memory_mb': mha_output.data.nbytes / (1024 * 1024) + mha_memory['total_parameter_memory_mb'],
-            'description': f'{num_heads}-head attention with projections'
-        }
-        
-        # Display comparison
-        print(f"Test configuration: {batch_size} batch * {seq_length} seq * {embed_dim} dim")
-        print(f"{'Type':<15} {'Time (ms)':<10} {'Parameters':<12} {'Memory (MB)':<12} {'Description'}")
-        print("-" * 70)
-        
-        for name, stats in results.items():
-            print(f"{name:<15} {stats['computation_time_ms']:<10.2f} "
-                  f"{stats['parameters']:<12,} {stats['memory_mb']:<12.2f} {stats['description']}")
-        
-        # Analysis
-        time_overhead = results['multi_head']['computation_time_ms'] / results['scaled_dot_product']['computation_time_ms']
-        memory_overhead = results['multi_head']['memory_mb'] / results['scaled_dot_product']['memory_mb']
-        
-        print(f"\n📊 OVERHEAD ANALYSIS:")
-        print(f"   Multi-head vs Scaled: {time_overhead:.1f}x time, {memory_overhead:.1f}x memory")
-        print(f"   Trade-off: Multi-head provides richer representations at cost of computation")
-        print(f"   Parameters: Multi-head adds {results['multi_head']['parameters']:,} learnable parameters")
-        
-        return results
-    
-    def simulate_kv_cache_benefits(self, seq_lengths: List[int], embed_dim: int = 256, 
-                                  num_heads: int = 8) -> Dict:
-        """
-        Simulate memory and computation benefits of KV-cache during generation.
-        
-        This function is PROVIDED to show KV-cache analysis.
-        """
-        print(f"\n💾 KV-CACHE BENEFITS ANALYSIS")
-        print("=" * 50)
-        
-        head_dim = embed_dim // num_heads
-        batch_size = 1  # Typical generation batch size
-        
-        results = {}
-        
-        print(f"{'Seq Length':<10} {'No Cache (MB)':<14} {'With Cache (MB)':<16} {'Savings':<10} {'Speedup'}")
-        print("-" * 65)
-        
-        for seq_len in seq_lengths:
-            # Without cache: recompute K,V for all tokens every generation step
-            # Memory: attention matrices for all positions
-            no_cache_attention_memory = batch_size * seq_len * seq_len * FLOAT32_BYTES / (1024 * 1024)  # bytes -> MB
-            no_cache_kv_memory = batch_size * seq_len * embed_dim * 2 * FLOAT32_BYTES / (1024 * 1024)  # K + V
-            no_cache_total = no_cache_attention_memory + no_cache_kv_memory
-            
-            # With cache: store K,V, only compute attention for new token
-            cache_storage = batch_size * seq_len * embed_dim * 2 * FLOAT32_BYTES / (1024 * 1024)  # K + V storage
-            cache_attention_memory = batch_size * 1 * seq_len * FLOAT32_BYTES / (1024 * 1024)  # Only new token attention
-            cache_total = cache_storage + cache_attention_memory
-            
-            # Compute benefits
-            memory_savings = (no_cache_total - cache_total) / no_cache_total * 100
-            speedup_estimate = seq_len  # Rough estimate: avoid recomputing seq_len tokens
-            
-            results[seq_len] = {
-                'no_cache_memory_mb': no_cache_total,
-                'cache_memory_mb': cache_total,
-                'memory_savings_percent': memory_savings,
-                'estimated_speedup': speedup_estimate
-            }
-            
-            print(f"{seq_len:<10} {no_cache_total:<14.2f} {cache_total:<16.2f} "
-                  f"{memory_savings:<10.1f}% {speedup_estimate:<10.1f}x")
-        
-        print(f"\nTIP KV-CACHE INSIGHTS:")
-        print(f"   - Memory: Significant savings for long sequences")
-        print(f"   - Speed: Avoid recomputing K,V for all previous tokens")
-        print(f"   - Trade-off: Cache storage vs recomputation")
-        print(f"   - Essential for: Real-time text generation and interactive systems")
-        
-        return results
-
-def analyze_attention_system_design():
-    """
-    Comprehensive analysis of attention system design choices and scaling implications.
-    
-    This function is PROVIDED to show systems-level design thinking.
-    """
-    print("🏗️ ATTENTION SYSTEM DESIGN ANALYSIS")
-    print("=" * 60)
-    
-    # Model configurations with different attention strategies
-    model_configs = [
-        {
-            'name': 'Small Model',
-            'seq_length': 512,
-            'embed_dim': 256,
-            'num_heads': 8,
-            'num_layers': 6
-        },
-        {
-            'name': 'Medium Model', 
-            'seq_length': 1024,
-            'embed_dim': 512,
-            'num_heads': 16,
-            'num_layers': 12
-        },
-        {
-            'name': 'Large Model',
-            'seq_length': 2048,
-            'embed_dim': 1024, 
-            'num_heads': 32,
-            'num_layers': 24
-        }
-    ]
-    
-    print(f"📋 ATTENTION MEMORY SCALING ANALYSIS:")
-    print(f"{'Model':<12} {'Seq Len':<8} {'Heads':<6} {'Layers':<7} {'Attn Memory':<12} {'Total Attn':<12}")
-    print("-" * 75)
-    
-    for config in model_configs:
-        # Calculate attention memory per layer
-        batch_size = 1
-        seq_len = config['seq_length']
-        attention_matrix_memory_mb = (batch_size * seq_len * seq_len * FLOAT32_BYTES) / (1024 * 1024)
-        
-        # Total attention memory across all layers
-        total_attention_memory_mb = attention_matrix_memory_mb * config['num_layers']
-        
-        print(f"{config['name']:<12} {seq_len:<8} {config['num_heads']:<6} "
-              f"{config['num_layers']:<7} {attention_matrix_memory_mb:<12.1f} {total_attention_memory_mb:<12.1f}")
-    
-    print(f"\nTARGET KEY DESIGN IMPLICATIONS:")
-    print(f"   1. Sequence Length Scaling:")
-    print(f"      - Memory scales O(N²) with sequence length")
-    print(f"      - 2x sequence length = 4x attention memory")
-    print(f"      - Practical limit: GPU memory capacity")
-    
-    print(f"   2. Multi-Head Benefits:")
-    print(f"      - Multiple attention patterns in parallel")
-    print(f"      - Linear scaling with number of heads")
-    print(f"      - Trade-off: representation richness vs computation")
-    
-    print(f"   3. Layer Depth Impact:")
-    print(f"      - Attention memory scales linearly with layers")
-    print(f"      - Deep models need efficient attention implementations")
-    print(f"      - Memory checkpointing may be necessary")
-    
-    print(f"   4. Production Constraints:")
-    print(f"      - GPU memory limits maximum sequence length")
-    print(f"      - Attention is the memory bottleneck in sequence models")
-    print(f"      - KV-cache essential for generation workloads")
-    
-    print(f"\n🏭 OPTIMIZATION STRATEGIES:")
-    print(f"   - Flash Attention: Memory-efficient attention computation")
-    print(f"   - Sparse Attention: Reduce O(N²) to O(NsqrtN) or O(N log N)")
-    print(f"   - Linear Attention: Approximate attention with linear complexity")
-    print(f"   - Sliding Window: Local attention with fixed window size")
-    print(f"   - KV-Cache: Essential for autoregressive generation")
-
-# %% [markdown]
-"""
-### TEST Test: Attention Performance Analysis
-
-Let's test our attention profiler with realistic performance scenarios.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-attention-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_attention_profiler():
-    """Test attention profiler with various scenarios."""
-    print("🔬 Unit Test: Attention Performance Profiler...")
-    
-    profiler = AttentionProfiler()
-    
-    # Test scaling measurement with scaled attention
-    scaled_attention = ScaledDotProductAttention()
-    seq_lengths = [32, 64, 128]
-    embed_dim = 128
-    
-    scaling_results = profiler.measure_attention_scaling(scaled_attention, seq_lengths, embed_dim)
-    
-    # Verify results structure
-    assert len(scaling_results) == len(seq_lengths), f"Should test {len(seq_lengths)} sequence lengths"
-    
-    for seq_len in seq_lengths:
-        assert seq_len in scaling_results, f"Should include results for sequence length {seq_len}"
-        result = scaling_results[seq_len]
-        
-        # Verify required metrics
-        required_keys = ['seq_length', 'computation_time_ms', 'input_memory_mb', 
-                        'output_memory_mb', 'attention_matrix_memory_mb', 'total_memory_mb']
-        for key in required_keys:
-            assert key in result, f"Missing metric: {key} for seq_len {seq_len}"
-            assert isinstance(result[key], (int, float)), f"Invalid type for {key}"
-        
-        # Verify reasonable values
-        assert result['seq_length'] == seq_len, "Should store correct sequence length"
-        assert result['computation_time_ms'] >= 0, "Time should be non-negative"
-        assert result['total_memory_mb'] > 0, "Memory usage should be positive"
-    
-    print("PASS Scaling measurement test passed")
-    
-    # Test quadratic scaling analysis
-    scaling_analysis = profiler.analyze_quadratic_scaling(scaling_results)
-    
-    # Verify scaling analysis
-    assert 'base_sequence_length' in scaling_analysis, "Should include base sequence length"
-    
-    # Check that longer sequences show increased ratios
-    for seq_len in seq_lengths[1:]:
-        if seq_len in scaling_analysis:
-            analysis = scaling_analysis[seq_len]
-            assert analysis['length_ratio'] > 1, f"Length ratio should be > 1 for {seq_len}"
-            assert analysis['theoretical_ratio'] > 1, f"Theoretical ratio should be > 1 for {seq_len}"
-    
-    print("PASS Quadratic scaling analysis test passed")
-    
-    # Test attention type comparison
-    comparison_results = profiler.compare_attention_types(seq_length=64, embed_dim=128)
-    
-    # Verify comparison results
-    assert 'scaled_dot_product' in comparison_results, "Should test scaled dot-product attention"
-    assert 'multi_head' in comparison_results, "Should test multi-head attention"
-    
-    for attn_type, metrics in comparison_results.items():
-        assert 'computation_time_ms' in metrics, "Should measure computation time"
-        assert 'parameters' in metrics, "Should count parameters"
-        assert 'memory_mb' in metrics, "Should measure memory usage"
-        assert metrics['computation_time_ms'] > 0, "Should have positive computation time"
-    
-    print("PASS Attention type comparison test passed")
-    
-    # Test KV-cache benefits simulation
-    cache_results = profiler.simulate_kv_cache_benefits([64, 128], embed_dim=128)
-    
-    # Verify cache simulation results
-    for seq_len, result in cache_results.items():
-        assert 'no_cache_memory_mb' in result, "Should calculate no-cache memory"
-        assert 'cache_memory_mb' in result, "Should calculate cache memory"
-        assert 'memory_savings_percent' in result, "Should calculate savings"
-        assert result['memory_savings_percent'] > 0, "Should show memory savings"
-    
-    print("PASS KV-cache benefits simulation test passed")
-    print("TARGET Attention Profiler: All tests passed!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Integration Testing: Complete Attention Pipeline
-
-Let's test how all our attention components work together in a realistic sequence processing pipeline:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-attention-integration", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_attention_integration():
-    """Test complete attention pipeline with embeddings integration."""
-    print("TEST Integration Test: Complete Attention Pipeline...")
-    
-    # Configuration
-    vocab_size = 1000
-    embed_dim = 256
-    num_heads = 8
-    seq_length = 32
-    batch_size = 4
-    
-    # Create embedding components (mock minimal versions if not available)
-    try:
-        from embeddings_dev import Embedding, PositionalEncoding
-        embedding = Embedding(vocab_size=vocab_size, embedding_dim=embed_dim)
-        pos_encoding = PositionalEncoding(embedding_dim=embed_dim, max_seq_length=seq_length*2)
-        embeddings_available = True
-    except:
-        # Create mock embeddings for testing
-        embedding = None
-        pos_encoding = None
-        embeddings_available = False
-        print("  Using mock embeddings for testing...")
-    
-    # Create attention components
-    scaled_attention = ScaledDotProductAttention()
-    multi_head_attention = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
-    
-    # Create test data
-    if embeddings_available:
-        # Use real embedding pipeline
-        token_ids = np.random.randint(0, vocab_size, (batch_size, seq_length))
-        embeddings = embedding.forward(token_ids)
-        pos_embeddings = pos_encoding.forward(embeddings)
-        input_representations = pos_embeddings
-        print(f"  Using real embeddings: {input_representations.shape}")
-    else:
-        # Use mock input data
-        input_representations = Tensor(np.random.randn(batch_size, seq_length, embed_dim))
-        print(f"  Using mock input: {input_representations.shape}")
-    
-    # Test 1: Self-attention with scaled dot-product
-    print("  Testing scaled dot-product self-attention...")
-    self_attn_output = scaled_attention.forward(
-        input_representations, input_representations, input_representations
-    )
-    
-    expected_shape = (batch_size, seq_length, embed_dim)
-    assert self_attn_output.shape == expected_shape, f"Expected {expected_shape}, got {self_attn_output.shape}"
-    print(f"    Self-attention output: {self_attn_output.shape}")
-    
-    # Test 2: Multi-head self-attention
-    print("  Testing multi-head self-attention...")
-    mha_output, mha_weights = multi_head_attention.forward(
-        input_representations, input_representations, input_representations,
-        return_attention_weights=True
-    )
-    
-    assert mha_output.shape == expected_shape, f"Expected {expected_shape}, got {mha_output.shape}"
-    expected_attn_shape = (batch_size, num_heads, seq_length, seq_length)
-    assert mha_weights.shape == expected_attn_shape, f"Expected attention {expected_attn_shape}, got {mha_weights.shape}"
-    print(f"    Multi-head output: {mha_output.shape}")
-    print(f"    Attention weights: {mha_weights.shape}")
-    
-    # Test 3: Causal (autoregressive) attention
-    print("  Testing causal attention masking...")
-    causal_mask = np.triu(np.ones((seq_length, seq_length)), k=1)
-    causal_mask = 1 - causal_mask  # Convert to attention mask
-    
-    causal_output, causal_weights = multi_head_attention.forward(
-        input_representations, input_representations, input_representations,
-        mask=Tensor(causal_mask), return_attention_weights=True
-    )
-    
-    # Verify causal masking works
-    for head in range(num_heads):
-        for i in range(seq_length):
-            for j in range(i+1, seq_length):
-                assert np.all(causal_weights.data[:, head, i, j] < 1e-5), \
-                    f"Position ({i},{j}) should be masked in head {head}"
-    
-    print(f"    Causal attention works correctly across {num_heads} heads")
-    
-    # Test 4: Cross-attention (encoder-decoder style)
-    print("  Testing cross-attention...")
-    # Create different key/value inputs (simulating encoder-decoder)
-    encoder_seq_length = seq_length + 8  # Different length
-    encoder_representations = Tensor(np.random.randn(batch_size, encoder_seq_length, embed_dim))
-    
-    cross_attn_output = multi_head_attention.forward(
-        input_representations,  # Query from decoder
-        encoder_representations,  # Key from encoder
-        encoder_representations   # Value from encoder
-    )
-    
-    # Output should have decoder sequence length, encoder information
-    expected_cross_shape = (batch_size, seq_length, embed_dim)
-    assert cross_attn_output.shape == expected_cross_shape, \
-        f"Expected {expected_cross_shape}, got {cross_attn_output.shape}"
-    print(f"    Cross-attention output: {cross_attn_output.shape}")
-    
-    # Test 5: KV-Cache integration
-    print("  Testing KV-cache integration...")
-    head_dim = embed_dim // num_heads
-    kv_cache = KVCache(max_batch_size=batch_size, max_seq_length=seq_length*2,
-                       num_heads=num_heads, head_dim=head_dim)
-    
-    # Simulate autoregressive generation
-    for step in range(3):  # Generate 3 tokens
-        if step == 0:
-            # First step: process initial sequence
-            step_input = input_representations
-        else:
-            # Subsequent steps: process one new token
-            new_token_repr = Tensor(np.random.randn(batch_size, 1, embed_dim))
-            step_input = new_token_repr
-        
-        # In real implementation, we'd integrate KV-cache with attention
-        # For now, just test that cache operations work
-        batch_idx = 0
-        step_keys = Tensor(np.random.randn(num_heads, step_input.shape[1], head_dim))
-        step_values = Tensor(np.random.randn(num_heads, step_input.shape[1], head_dim))
-        
-        cached_keys, cached_values = kv_cache.update(batch_idx, step_keys, step_values)
-        
-        expected_cache_length = sum(input_representations.shape[1] if i == 0 else 1 for i in range(step + 1))
-        assert cached_keys.shape[1] == expected_cache_length, \
-            f"Cache should have {expected_cache_length} tokens at step {step}"
-    
-    print(f"    KV-cache successfully caches keys/values across generation steps")
-    
-    # Test 6: Memory usage analysis
-    print("  Analyzing memory usage...")
-    mha_memory = multi_head_attention.get_memory_usage()
-    cache_memory = kv_cache.get_memory_usage()
-    
-    total_memory_mb = mha_memory['total_parameter_memory_mb'] + cache_memory['total_cache_memory_mb']
-    
-    print(f"    Multi-head attention parameters: {mha_memory['total_parameter_memory_mb']:.2f}MB")
-    print(f"    KV-cache storage: {cache_memory['total_cache_memory_mb']:.2f}MB")
-    print(f"    Total attention system memory: {total_memory_mb:.2f}MB")
-    
-    # Test 7: Performance characteristics
-    print("  Testing performance characteristics...")
-    start_time = time.time()
-    
-    # Process multiple steps to measure throughput
-    for _ in range(10):
-        output = multi_head_attention.forward(
-            input_representations, input_representations, input_representations
-        )
-    
-    total_time = time.time() - start_time
-    throughput = (batch_size * seq_length * 10) / total_time  # tokens per second
-    
-    print(f"    Attention throughput: {throughput:.0f} tokens/second")
-    
-    print("PASS Complete attention pipeline integration test passed!")
-    print(f"PASS Self-attention, cross-attention, and causal masking work correctly")
-    print(f"PASS KV-cache integration ready for autoregressive generation")
-    print(f"PASS Memory usage and performance characteristics measured")
-
-# Test function defined (called in main block)
-
-# %%
-def test_module():
-    """Run comprehensive attention module testing."""
-    print("🧪 TESTING MODULE: Attention")
-    print("=" * 50)
-
-    # Run all unit tests
-    test_unit_scaled_attention()
-    test_unit_multi_head_attention()
-    test_unit_kv_cache()
-    test_attention_profiler()
-    test_attention_integration()
-
-    print("\n" + "="*50)
-    print("✅ ALL ATTENTION TESTS PASSED!")
-    print("📈 Attention mechanisms ready for sequence model integration!")
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-All attention tests run when the module is executed directly:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "attention-main", "locked": false, "schema_version": 3, "solution": false, "task": false}
-if __name__ == "__main__":
-    test_module()
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've built the attention mechanisms that enable sequence understanding, let's connect this work to broader ML systems challenges. These questions help you think critically about how attention's quadratic scaling affects production sequence model deployment.
-
-Take time to reflect thoughtfully on each question - your insights will help you understand how attention connects to real-world ML systems engineering.
-"""
-
-# %% [markdown]
-"""
-### TARGET Computational Assessment: Attention Complexity Analysis
-
-**Learning Objective**: Analyze the computational and memory complexity of attention mechanisms to understand their practical limitations and optimization opportunities.
-
-**Task**: Based on your attention implementations, analyze the scaling behavior and optimization techniques for different attention scenarios.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "attention-complexity-analysis", "locked": false, "points": 15, "schema_version": 3, "solution": true, "task": false}
-def analyze_attention_complexity():
-    """
-    Analyze computational complexity of attention mechanisms.
-    
-    TODO: Complete this complexity analysis function.
-    
-    Requirements:
-    1. Calculate memory usage for attention matrices with different sequence lengths
-    2. Estimate computational FLOPs for attention computation
-    3. Compare single-head vs multi-head complexity
-    4. Analyze the impact of sequence length on performance
-    
-    Returns:
-        dict: Analysis results with complexity metrics
-    """
-    ### BEGIN SOLUTION
-    results = {}
-    
-    # Test different sequence lengths
-    seq_lengths = [128, 256, 512, 1024]
-    embed_dim = 512
-    num_heads = 8
-    batch_size = 16
-    
-    for seq_len in seq_lengths:
-        # Memory for attention matrix: batch_size * seq_len * seq_len * 4 bytes (float32)
-        attention_memory_bytes = batch_size * seq_len * seq_len * 4
-        attention_memory_mb = attention_memory_bytes / (1024 * 1024)
-        
-        # Multi-head attention memory: num_heads * attention_memory
-        multihead_memory_mb = attention_memory_mb * num_heads
-        
-        # Computational FLOPs estimation
-        # QK^T: batch * heads * seq_len * seq_len * head_dim
-        # Softmax: batch * heads * seq_len * seq_len
-        # Attention*V: batch * heads * seq_len * seq_len * head_dim
-        head_dim = embed_dim // num_heads
-        qk_flops = batch_size * num_heads * seq_len * seq_len * head_dim
-        av_flops = batch_size * num_heads * seq_len * seq_len * head_dim
-        total_flops = qk_flops + av_flops
-        
-        results[seq_len] = {
-            'sequence_length': seq_len,
-            'attention_memory_mb': attention_memory_mb,
-            'multihead_memory_mb': multihead_memory_mb,
-            'total_flops': total_flops,
-            'flops_per_token': total_flops / (batch_size * seq_len),
-            'memory_scaling_factor': (seq_len / 128) ** 2,  # Relative to 128 baseline
-            'compute_scaling_factor': (seq_len / 128) ** 2
-        }
-    
-    return results
-    ### END SOLUTION
-
-# Test the complexity analysis
-if 'ScaledDotProductAttention' in globals():
-    complexity_results = analyze_attention_complexity()
-    
-    print("📊 ATTENTION COMPLEXITY ANALYSIS RESULTS:")
-    print("=" * 60)
-    print(f"{'Seq Len':<8} {'Attn Mem (MB)':<12} {'MHA Mem (MB)':<12} {'FLOPs (M)':<10} {'Scale Factor'}")
-    print("-" * 60)
-    
-    for seq_len, metrics in complexity_results.items():
-        print(f"{seq_len:<8} {metrics['attention_memory_mb']:<12.1f} "
-              f"{metrics['multihead_memory_mb']:<12.1f} "
-              f"{metrics['total_flops']/1e6:<10.1f} "
-              f"{metrics['memory_scaling_factor']:<10.1f}x")
-    
-    print(f"\nTIP COMPLEXITY INSIGHTS:")
-    print(f"  - Memory scales O(N²) with sequence length")
-    print(f"  - Computation scales O(N²) with sequence length")
-    print(f"  - Multi-head attention multiplies memory by number of heads")
-    print(f"  - 2x sequence length = 4x memory and computation")
-else:
-    print("WARNING️ Complete attention implementations first")
-
-# %% [markdown]
-"""
-### Question 1: Attention Memory Scaling and Sequence Length Optimization
-
-**Context**: Your attention implementations demonstrate the fundamental O(N²) memory scaling that limits transformer sequence length. Production language models must balance sequence length capabilities with memory constraints, leading to complex architectural decisions about attention patterns, memory optimization, and deployment strategies.
-
-**Reflection Question**: Design an attention system for a production language model that needs to efficiently process documents up to 32k tokens while operating within 80GB GPU memory constraints. How would you implement attention optimization techniques like Flash Attention or sparse attention patterns, design memory-efficient attention computation that minimizes intermediate storage, and handle variable sequence lengths in production batches? Consider the challenges of maintaining attention quality while reducing memory footprint and optimizing for both training and inference workloads.
-
-Think about: attention optimization techniques, memory-efficient computation patterns, sparse attention strategies, and variable-length batch processing.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-1-attention-memory", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON ATTENTION MEMORY SCALING AND OPTIMIZATION:
-
-TODO: Replace this text with your thoughtful response about attention memory optimization system design.
-
-Consider addressing:
-- How would you implement attention optimization for 32k tokens within 80GB GPU memory?
-- What techniques would you use to reduce attention's O(N²) memory scaling?
-- How would you design memory-efficient attention computation with minimal intermediate storage?
-- What approaches would you use for handling variable sequence lengths in production batches?
-- How would you maintain attention quality while optimizing for memory constraints?
-
-Write a technical analysis connecting your attention implementations to real memory optimization challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Demonstrates understanding of attention memory scaling and optimization techniques (3 points)
-- Designs practical approaches to memory-efficient attention computation (3 points)
-- Addresses variable-length processing and production deployment constraints (2 points)
-- Shows systems thinking about attention optimization trade-offs (2 points)
-- Clear technical reasoning with memory optimization insights (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring technical analysis of attention memory optimization
-# Students should demonstrate understanding of attention scaling challenges and optimization techniques
-### END SOLUTION
-
-# %% [markdown]
-"""
-### TARGET Computational Assessment: Causal Masking and Generation Patterns
-
-**Learning Objective**: Understand how causal masking enables autoregressive generation and analyze different attention masking strategies.
-
-**Task**: Implement and analyze different attention masking patterns to understand their impact on model behavior and computational efficiency.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "attention-masking-analysis", "locked": false, "points": 15, "schema_version": 3, "solution": true, "task": false}
-def analyze_attention_masking_patterns():
-    """
-    Analyze different attention masking patterns and their computational implications.
-    
-    TODO: Complete this masking pattern analysis.
-    
-    Requirements:
-    1. Create and test causal (autoregressive) masks
-    2. Implement and test different sparse attention patterns
-    3. Measure attention entropy with different masking strategies
-    4. Compare computational efficiency of different mask types
-    
-    Returns:
-        dict: Analysis results comparing different masking strategies
-    """
-    ### BEGIN SOLUTION
-    if 'ScaledDotProductAttention' not in globals():
-        return {"error": "ScaledDotProductAttention not implemented"}
-    
-    attention = ScaledDotProductAttention()
-    seq_len = 16
-    batch_size = 2
-    d_k = 32
-    
-    # Create test inputs
-    query = key = value = Tensor(np.random.randn(batch_size, seq_len, d_k))
-    
-    results = {}
-    
-    # 1. No masking (full attention)
-    output_full, weights_full = attention.forward(
-        query, key, value, return_attention_weights=True
-    )
-    entropy_full = -np.sum(weights_full.data * np.log(weights_full.data + 1e-10))
-    
-    results['no_mask'] = {
-        'attention_entropy': entropy_full,
-        'effective_connections': np.sum(weights_full.data > 0.01),  # Significant connections
-        'max_attention': np.max(weights_full.data),
-        'computation_ratio': 1.0  # Baseline
-    }
-    
-    # 2. Causal masking (autoregressive)
-    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)
-    causal_mask = 1 - causal_mask  # Convert to attention mask
-    
-    output_causal, weights_causal = attention.forward(
-        query, key, value, mask=Tensor(causal_mask), return_attention_weights=True
-    )
-    entropy_causal = -np.sum(weights_causal.data * np.log(weights_causal.data + 1e-10))
-    
-    results['causal_mask'] = {
-        'attention_entropy': entropy_causal,
-        'effective_connections': np.sum(weights_causal.data > 0.01),
-        'max_attention': np.max(weights_causal.data),
-        'computation_ratio': 0.5  # Roughly half the connections
-    }
-    
-    # 3. Local attention window (sparse)
-    window_size = 4
-    local_mask = np.zeros((seq_len, seq_len))
-    for i in range(seq_len):
-        start = max(0, i - window_size // 2)
-        end = min(seq_len, i + window_size // 2 + 1)
-        local_mask[i, start:end] = 1
-    
-    output_local, weights_local = attention.forward(
-        query, key, value, mask=Tensor(local_mask), return_attention_weights=True
-    )
-    entropy_local = -np.sum(weights_local.data * np.log(weights_local.data + 1e-10))
-    
-    results['local_mask'] = {
-        'attention_entropy': entropy_local,
-        'effective_connections': np.sum(weights_local.data > 0.01),
-        'max_attention': np.max(weights_local.data),
-        'computation_ratio': window_size / seq_len  # Fraction of full attention
-    }
-    
-    # 4. Strided attention pattern
-    stride = 2
-    strided_mask = np.zeros((seq_len, seq_len))
-    for i in range(seq_len):
-        # Attend to every stride-th position
-        strided_mask[i, ::stride] = 1
-        # Also attend to local neighborhood
-        start = max(0, i - 1)
-        end = min(seq_len, i + 2)
-        strided_mask[i, start:end] = 1
-    
-    output_strided, weights_strided = attention.forward(
-        query, key, value, mask=Tensor(strided_mask), return_attention_weights=True
-    )
-    entropy_strided = -np.sum(weights_strided.data * np.log(weights_strided.data + 1e-10))
-    
-    results['strided_mask'] = {
-        'attention_entropy': entropy_strided,
-        'effective_connections': np.sum(weights_strided.data > 0.01),
-        'max_attention': np.max(weights_strided.data),
-        'computation_ratio': (1 + seq_len // stride + 2) / seq_len
-    }
-    
-    return results
-    ### END SOLUTION
-
-# Test the masking analysis
-if 'ScaledDotProductAttention' in globals():
-    masking_results = analyze_attention_masking_patterns()
-    
-    if 'error' not in masking_results:
-        print("🎭 ATTENTION MASKING PATTERN ANALYSIS:")
-        print("=" * 50)
-        print(f"{'Pattern':<15} {'Entropy':<10} {'Connections':<12} {'Max Attn':<10} {'Compute %'}")
-        print("-" * 60)
-        
-        for pattern, metrics in masking_results.items():
-            print(f"{pattern:<15} {metrics['attention_entropy']:<10.2f} "
-                  f"{metrics['effective_connections']:<12} "
-                  f"{metrics['max_attention']:<10.4f} "
-                  f"{metrics['computation_ratio']*100:<10.1f}%")
-        
-        print(f"\nTIP MASKING INSIGHTS:")
-        print(f"  - Causal masking: Essential for autoregressive generation")
-        print(f"  - Local attention: Good for capturing local dependencies")
-        print(f"  - Strided attention: Balances long-range and local connections")
-        print(f"  - Sparse patterns: Reduce computation while maintaining performance")
-    else:
-        print(masking_results['error'])
-else:
-    print("WARNING️ Complete attention implementations first")
-
-# %% [markdown]
-"""
-### Question 2: Multi-Head Attention Parallelization and Hardware Optimization
-
-**Context**: Your multi-head attention implementation shows how attention heads can process different representation subspaces in parallel. Production transformer systems must optimize multi-head attention for diverse hardware platforms (CPUs, GPUs, TPUs) while maximizing throughput and minimizing latency for both training and inference workloads.
-
-**Reflection Question**: Architect a multi-head attention system optimized for distributed training across 64 GPUs and efficient inference on various hardware platforms. How would you implement attention head parallelization that maximizes GPU utilization, design efficient attention kernel fusion to minimize memory bandwidth bottlenecks, and optimize for different inference scenarios (batch processing vs single-token generation)? Consider the challenges of maintaining numerical consistency across hardware platforms while achieving optimal performance for both training throughput and inference latency.
-
-Think about: multi-GPU attention parallelization, kernel fusion optimization, hardware-specific tuning, and inference optimization strategies.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-2-attention-parallelization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON MULTI-HEAD ATTENTION PARALLELIZATION:
-
-TODO: Replace this text with your thoughtful response about multi-head attention hardware optimization.
-
-Consider addressing:
-- How would you implement attention head parallelization across 64 GPUs for training?
-- What kernel fusion techniques would you use to minimize memory bandwidth bottlenecks?
-- How would you optimize attention for different hardware platforms (CPU, GPU, TPU)?
-- What strategies would you use to optimize for batch processing vs single-token generation?
-- How would you maintain numerical consistency across diverse hardware configurations?
-
-Write an architectural analysis connecting your attention implementations to hardware optimization challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Shows understanding of multi-head attention parallelization and hardware optimization (3 points)
-- Designs practical approaches to distributed training and kernel fusion (3 points)
-- Addresses platform-specific optimization and inference scenarios (2 points)
-- Demonstrates systems thinking about hardware-software co-optimization (2 points)
-- Clear architectural reasoning with parallelization insights (bonus points for comprehensive system design)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of attention parallelization and hardware optimization
-# Students should demonstrate knowledge of distributed training and platform-specific optimization
-### END SOLUTION
-
-# %% [markdown]
-"""
-### TARGET Computational Assessment: Attention Scaling and Production Optimization
-
-**Learning Objective**: Analyze how attention scaling affects production deployment and design optimization strategies for different use cases.
-
-**Task**: Design and analyze attention optimization strategies for production systems with different constraints and requirements.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "attention-production-optimization", "locked": false, "points": 20, "schema_version": 3, "solution": true, "task": false}
-def design_production_attention_system():
-    """
-    Design an optimized attention system for production deployment.
-    
-    TODO: Complete this production optimization analysis.
-    
-    Requirements:
-    1. Analyze memory requirements for different sequence lengths and batch sizes
-    2. Design KV-cache strategies for different workload types
-    3. Estimate throughput and latency for different configurations
-    4. Propose optimization techniques for memory-constrained environments
-    
-    Returns:
-        dict: Production system design with optimization strategies
-    """
-    ### BEGIN SOLUTION
-    # Production system analysis
-    design = {
-        'workload_analysis': {},
-        'memory_optimization': {},
-        'kv_cache_strategies': {},
-        'performance_estimates': {}
-    }
-    
-    # Workload scenarios
-    workloads = {
-        'real_time_chat': {
-            'max_seq_length': 2048,
-            'typical_batch_size': 1,
-            'latency_requirement_ms': 100,
-            'throughput_requirement': '10 requests/sec'
-        },
-        'batch_processing': {
-            'max_seq_length': 4096,
-            'typical_batch_size': 32,
-            'latency_requirement_ms': 5000,
-            'throughput_requirement': '1000 docs/hour'
-        },
-        'code_generation': {
-            'max_seq_length': 8192,
-            'typical_batch_size': 4,
-            'latency_requirement_ms': 500,
-            'throughput_requirement': '100 completions/min'
-        }
-    }
-    
-    embed_dim = 4096  # Large model configuration
-    num_heads = 32
-    head_dim = embed_dim // num_heads
-    
-    for workload_name, config in workloads.items():
-        seq_len = config['max_seq_length']
-        batch_size = config['typical_batch_size']
-        
-        # Memory analysis
-        attention_memory_gb = (batch_size * num_heads * seq_len * seq_len * 4) / (1024**3)
-        kv_cache_memory_gb = (batch_size * seq_len * embed_dim * 2 * 4) / (1024**3)
-        total_memory_gb = attention_memory_gb + kv_cache_memory_gb
-        
-        # Performance estimates
-        tokens_per_request = seq_len * batch_size
-        attention_flops = batch_size * num_heads * seq_len * seq_len * head_dim * 2
-        
-        design['workload_analysis'][workload_name] = {
-            'attention_memory_gb': attention_memory_gb,
-            'kv_cache_memory_gb': kv_cache_memory_gb,
-            'total_memory_gb': total_memory_gb,
-            'attention_flops': attention_flops,
-            'tokens_per_request': tokens_per_request,
-            'memory_bandwidth_gb_s': total_memory_gb * 1000 / config['latency_requirement_ms']
-        }
-    
-    # Memory optimization strategies
-    design['memory_optimization'] = {
-        'flash_attention': {
-            'memory_reduction': '10-20x for attention computation',
-            'technique': 'Tiled computation to reduce intermediate storage',
-            'trade_off': 'Slight computation increase for massive memory savings'
-        },
-        'sparse_attention': {
-            'memory_reduction': 'O(NsqrtN) or O(N log N) instead of O(N²)',
-            'technique': 'Local + strided + global attention patterns',
-            'trade_off': 'Potential quality loss vs memory/compute savings'
-        },
-        'gradient_checkpointing': {
-            'memory_reduction': '~50% activation memory',
-            'technique': 'Recompute activations instead of storing',
-            'trade_off': '20-30% slower training for memory savings'
-        }
-    }
-    
-    # KV-cache strategies
-    design['kv_cache_strategies'] = {
-        'adaptive_caching': {
-            'real_time_chat': 'Small cache, fast eviction for responsiveness',
-            'batch_processing': 'Large cache, batch-optimized allocation',
-            'code_generation': 'Variable cache size based on context length'
-        },
-        'cache_sharing': {
-            'prefix_sharing': 'Share cache for common prefixes (system prompts)',
-            'multi_tenant': 'Isolated caches with memory pooling',
-            'eviction_policy': 'LRU with workload-specific priorities'
-        }
-    }
-    
-    # Performance estimates with optimizations
-    design['performance_estimates'] = {
-        'baseline_gpt_3_scale': {
-            'memory_required_gb': 700,  # For 175B parameters
-            'max_seq_length': 2048,
-            'bottleneck': 'Attention memory at long sequences'
-        },
-        'optimized_system': {
-            'flash_attention_memory_gb': 35,  # 20x reduction
-            'sparse_attention_seq_length': 32768,  # 16x longer sequences
-            'kv_cache_speedup': '10-100x generation speedup'
-        }
-    }
-    
-    return design
-    ### END SOLUTION
-
-# Test the production optimization design
-if 'KVCache' in globals():
-    production_design = design_production_attention_system()
-    
-    print("🏭 PRODUCTION ATTENTION SYSTEM DESIGN:")
-    print("=" * 50)
-    
-    print("\n📊 WORKLOAD ANALYSIS:")
-    for workload, analysis in production_design['workload_analysis'].items():
-        print(f"\n{workload.replace('_', ' ').title()}:")
-        print(f"  Memory requirement: {analysis['total_memory_gb']:.1f} GB")
-        print(f"  Attention FLOPs: {analysis['attention_flops']/1e12:.1f} TFLOPs")
-        print(f"  Memory bandwidth: {analysis['memory_bandwidth_gb_s']:.1f} GB/s")
-    
-    print("\nROCKET OPTIMIZATION STRATEGIES:")
-    for strategy, details in production_design['memory_optimization'].items():
-        print(f"\n{strategy.replace('_', ' ').title()}:")
-        print(f"  Reduction: {details['memory_reduction']}")
-        print(f"  Technique: {details['technique']}")
-    
-    print("\n💾 KV-CACHE OPTIMIZATION:")
-    for category, strategies in production_design['kv_cache_strategies'].items():
-        print(f"\n{category.replace('_', ' ').title()}:")
-        if isinstance(strategies, dict):
-            for k, v in strategies.items():
-                print(f"  {k}: {v}")
-        else:
-            print(f"  {strategies}")
-    
-    print("\nPROGRESS PERFORMANCE IMPACT:")
-    perf = production_design['performance_estimates']
-    baseline = perf['baseline_gpt_3_scale']
-    optimized = perf['optimized_system']
-    
-    memory_improvement = baseline['memory_required_gb'] / optimized['flash_attention_memory_gb']
-    seq_improvement = optimized['sparse_attention_seq_length'] / baseline['max_seq_length']
-    
-    print(f"  Memory reduction: {memory_improvement:.0f}x with Flash Attention")
-    print(f"  Sequence length: {seq_improvement:.0f}x with sparse attention")
-    print(f"  Generation speedup: {optimized['kv_cache_speedup']}")
-else:
-    print("WARNING️ Complete all attention implementations first")
-
-# %% [markdown]
-"""
-### Question 3: KV-Cache Optimization and Generation Efficiency
-
-**Context**: Your KV-cache implementation demonstrates how caching key-value computations can significantly improve autoregressive generation efficiency. Production language models must optimize KV-cache strategies for diverse generation workloads while managing memory usage, cache consistency, and throughput across different deployment scenarios.
-
-**Reflection Question**: Design a KV-cache optimization system for a production language model serving that handles diverse generation workloads: real-time chat (low latency), batch document processing (high throughput), and interactive code generation (variable length patterns). How would you implement adaptive cache management that optimizes memory usage based on generation patterns, design efficient cache sharing across multiple requests, and handle cache eviction strategies for long-running services? Consider the challenges of balancing cache hit rates with memory efficiency while maintaining consistent generation quality across different workload types.
-
-Think about: adaptive cache management, multi-request cache sharing, eviction strategies, and workload-specific optimization.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-3-kv-cache-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON KV-CACHE OPTIMIZATION AND GENERATION EFFICIENCY:
-
-TODO: Replace this text with your thoughtful response about KV-cache optimization for diverse generation workloads.
-
-Consider addressing:
-- How would you design adaptive cache management for real-time chat, batch processing, and code generation?
-- What strategies would you use for efficient cache sharing across multiple requests?
-- How would you implement cache eviction strategies for long-running production services?
-- What approaches would you use to optimize memory usage based on generation patterns?
-- How would you balance cache hit rates with memory efficiency across different workloads?
-
-Write a design analysis connecting your KV-cache implementation to production generation system optimization.
-
-GRADING RUBRIC (Instructor Use):
-- Understands KV-cache optimization challenges and adaptive management strategies (3 points)
-- Designs practical approaches to multi-request cache sharing and eviction (3 points)
-- Addresses workload-specific optimization and memory efficiency considerations (2 points)
-- Shows systems thinking about production generation service optimization (2 points)
-- Clear design reasoning with cache optimization insights (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of KV-cache optimization for production systems
-# Students should demonstrate knowledge of cache management and generation efficiency optimization
-### END SOLUTION
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Attention
-
-Congratulations! You have successfully implemented the attention mechanisms that enable sequence understanding:
-
-### PASS What You Have Built
-- **Scaled Dot-Product Attention**: The fundamental attention mechanism with proper masking support
-- **Multi-Head Attention**: Parallel attention heads for richer representation learning
-- **KV-Cache System**: Efficient caching for autoregressive generation workloads
-- **Causal Masking**: Support for autoregressive language modeling
-- **Performance Analysis**: Comprehensive scaling and optimization analysis tools
-- **🆕 Memory Optimization**: Understanding and measuring attention's O(N²) scaling characteristics
-- **🆕 Systems Integration**: Complete attention pipeline with embeddings and generation support
-
-### PASS Key Learning Outcomes
-- **Understanding**: How attention enables sequence models to capture dependencies
-- **Implementation**: Built attention mechanisms with memory-efficient patterns and causal masking
-- **Systems Insight**: How attention's quadratic scaling affects model architecture and deployment
-- **Performance Engineering**: Measured and analyzed attention bottlenecks and optimization techniques
-- **Production Context**: Understanding real-world attention challenges and optimization strategies
-
-### PASS Technical Mastery
-- **Attention Mathematics**: Attention(Q,K,V) = softmax(QK^T/sqrtd_k)V with proper scaling
-- **Multi-Head Architecture**: Parallel attention computation with head dimension management
-- **Causal Masking**: Autoregressive attention patterns for language generation
-- **Memory Scaling**: Understanding O(N²) complexity and its implications for sequence length
-- **🆕 KV-Cache Efficiency**: Optimizing attention computation for generation workloads
-
-### PASS Professional Skills Developed
-- **Systems Architecture**: Designing attention systems for production scale and efficiency
-- **Memory Engineering**: Understanding and optimizing attention's memory bottlenecks
-- **Performance Analysis**: Measuring and improving attention computation throughput
-- **Integration Design**: Building attention systems that work with embeddings and sequence models
-
-### PASS Ready for Next Steps
-Your attention systems are now ready to power:
-- **Sequence Models**: Complete architectures with attention and feedforward layers
-- **Language Generation**: Autoregressive text generation with efficient attention patterns
-- **Sequence Modeling**: Advanced sequence processing for various NLP tasks
-- **🧠 Modern AI Systems**: Foundation for advanced language and sequence models
-
-### LINK Connection to Real ML Systems
-Your implementations mirror production systems:
-- **PyTorch Attention**: `torch.nn.MultiheadAttention` and `torch.nn.functional.scaled_dot_product_attention`
-- **Flash Attention**: Memory-efficient attention computation used in production systems
-- **KV-Cache Optimization**: Essential for efficient language model serving and generation
-- **Industry Applications**: Every modern language model relies on optimized attention mechanisms
-
-### TARGET The Revolution of Attention
-You have built the mechanism that transformed AI:
-- **Before**: RNNs struggled with long-range dependencies and sequential computation
-- **After**: Attention enables parallel processing and direct long-range connections
-
-**Next Module**: Advanced Architectures - Combining your embeddings and attention into complete sequence processing systems!
-
-Your attention mechanisms are the computational core that enables advanced sequence models to understand and generate language. Now let's build complete architectures that use them!
-"""
\ No newline at end of file
diff --git a/modules_old/12_attention/module.yaml b/modules_old/12_attention/module.yaml
deleted file mode 100644
index f9d6fb96..00000000
--- a/modules_old/12_attention/module.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-description: Scaled dot-product and multi-head attention mechanisms that enable transformer
-  architectures
-estimated_time: 5-6 hours
-exports:
-- ScaledDotProductAttention
-- MultiHeadAttention
-- KVCache
-- AttentionProfiler
-learning_objectives:
-- Implement scaled dot-product attention with proper masking and numerical stability
-- Build multi-head attention with parallel head processing and output projection
-- Design KV-cache systems for efficient autoregressive generation
-- "Understand attention's O(N\xB2) scaling and memory optimization techniques"
-- Analyze attention performance bottlenecks and production optimization strategies
-ml_systems_focus: Attention memory scaling, generation efficiency optimization, sequence
-  length limitations
-name: Attention
-next_modules:
-- 14_transformers
-number: 13
-prerequisites:
-- 02_tensor
-- 12_embeddings
-systems_concepts:
-- "Quadratic memory scaling O(N\xB2) with sequence length"
-- Memory-bandwidth bound attention computation
-- KV-cache optimization for autoregressive generation
-- Multi-head parallelization and hardware optimization
-- Attention masking patterns and causal dependencies
diff --git a/modules_old/13_transformers/README.md b/modules_old/13_transformers/README.md
deleted file mode 100644
index 6bbe293d..00000000
--- a/modules_old/13_transformers/README.md
+++ /dev/null
@@ -1,105 +0,0 @@
-# Module 14: Transformers - Complete Transformer Architecture Implementation
-
-## Overview
-This module implements complete transformer architectures that power modern language models. You'll build LayerNorm, transformer blocks, and complete transformer models while understanding how architectural choices affect scalability, memory usage, and production deployment strategies.
-
-## What You'll Learn
-
-### Core Implementations
-- **Layer Normalization**: Stable normalization for deep transformer training
-- **Position-wise Feed-Forward**: Non-linear transformations for each sequence position
-- **Transformer Blocks**: Complete transformer layers with self-attention and feed-forward components
-- **Complete Transformer**: Full language model with embeddings, multiple layers, and generation capability
-
-### ML Systems Concepts
-- **Architecture Scaling**: How depth, width, and attention heads affect model capacity and requirements
-- **Memory Management**: Understanding transformer memory scaling and optimization techniques
-- **Training Stability**: Layer normalization and residual connections for deep network training
-- **Generation Systems**: Autoregressive text generation with causal attention patterns
-
-### Performance Engineering
-- **Transformer Profiling**: Measuring computation and memory scaling with architectural choices
-- **Architecture Optimization**: Balancing depth, width, and attention heads within resource constraints
-- **Production Analysis**: Understanding deployment requirements for different transformer configurations
-- **System Integration**: Complete pipeline from tokenization through text generation
-
-## Key Learning Outcomes
-
-By completing this module, you'll understand:
-
-1. **Transformer Architecture**: How attention, normalization, and feed-forward layers work together
-2. **Deep Network Training**: Why layer normalization and residual connections enable stable training
-3. **Memory Scaling**: How transformer parameters and memory scale with architectural choices
-4. **Text Generation**: How autoregressive generation works with causal attention masking
-5. **Production Systems**: How transformer design choices affect deployment and optimization
-
-## Files in This Module
-
-- `transformers_dev.py` - Main implementation with all transformer components
-- `transformers_dev.ipynb` - Jupyter notebook (auto-generated)
-- `module.yaml` - Module configuration and metadata
-- `README.md` - This documentation file
-
-## Usage Example
-
-```python
-from tinytorch.core.transformers import LayerNorm, TransformerBlock, Transformer
-from tinytorch.core.attention import MultiHeadAttention
-from tinytorch.core.embeddings import Embedding, PositionalEncoding
-
-# Create complete transformer model
-transformer = Transformer(
-    vocab_size=10000,
-    embed_dim=512,
-    num_heads=8,
-    num_layers=6,
-    hidden_dim=2048,
-    max_seq_length=512
-)
-
-# Process text through transformer
-input_ids = tokenize("Hello, world!")
-logits = transformer(input_ids)
-
-# Generate text autoregressively
-generated = transformer.generate(input_ids, max_new_tokens=50)
-```
-
-## Integration with TinyTorch
-
-This module exports to `tinytorch.core.transformers` and provides the complete architecture for:
-- **Language modeling** - GPT-style autoregressive language models
-- **Text generation** - Efficient autoregressive text generation systems
-- **Advanced architectures** - Foundation for BERT, T5, and other transformer variants
-
-## Systems Engineering Focus
-
-This module emphasizes the systems engineering aspects of transformer design:
-
-### Memory Characteristics
-- **Linear scaling**: Transformer memory scales linearly with depth
-- **Parameter distribution**: Understanding how parameters are allocated across components
-- **Training vs inference**: Different memory requirements for training and inference
-- **Batch processing**: Memory scaling with batch size and sequence length
-
-### Performance Considerations
-- **Layer depth**: More layers improve capacity but increase memory and computation
-- **Model width**: Embedding and hidden dimensions affect parameter count quadratically
-- **Attention heads**: More heads improve representation but increase computation
-- **Architecture trade-offs**: Balancing depth, width, and heads within resource constraints
-
-## Prerequisites
-- Module 02: Tensor (for matrix operations and data structures)
-- Module 12: Embeddings (for token and positional representations)
-- Module 13: Attention (for multi-head attention mechanisms)
-- Understanding of layer normalization and residual connections
-
-## Estimated Time
-6-7 hours including implementation, testing, and architecture analysis
-
-## Next Steps
-After completing this module, you'll have mastered:
-- Complete transformer architecture implementation
-- Production-ready language model systems
-- Advanced optimization techniques for large-scale deployment
-- Foundation for specialized transformer variants (BERT, T5, etc.)
\ No newline at end of file
diff --git a/modules_old/13_transformers/module.yaml b/modules_old/13_transformers/module.yaml
deleted file mode 100644
index c800f5f2..00000000
--- a/modules_old/13_transformers/module.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-description: Complete transformer architecture with LayerNorm, transformer blocks,
-  and language model implementation
-estimated_time: 6-7 hours
-exports:
-- LayerNorm
-- PositionwiseFeedForward
-- TransformerBlock
-- Transformer
-- TransformerProfiler
-learning_objectives:
-- Implement LayerNorm for stable deep network training
-- Build position-wise feed-forward networks for transformer blocks
-- Create complete transformer blocks with attention, normalization, and residual connections
-- Develop full transformer models with embeddings, multiple layers, and generation
-  capability
-- Understand transformer scaling characteristics and production deployment considerations
-ml_systems_focus: Transformer architecture optimization, memory scaling with depth,
-  production deployment strategies
-name: Transformers
-next_modules:
-- Advanced transformer architectures and optimization techniques
-number: 14
-prerequisites:
-- 02_tensor
-- 12_embeddings
-- 13_attention
-systems_concepts:
-- Linear memory scaling with transformer depth
-- Layer normalization vs batch normalization trade-offs
-- Residual connection gradient flow optimization
-- Parameter allocation across depth, width, and attention heads
-- Training memory vs inference memory requirements
diff --git a/modules_old/13_transformers/transformers_dev.ipynb b/modules_old/13_transformers/transformers_dev.ipynb
deleted file mode 100644
index 6ba71b47..00000000
--- a/modules_old/13_transformers/transformers_dev.ipynb
+++ /dev/null
@@ -1,2658 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "8e332345",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Transformers - Complete Transformer Architecture Implementation\n",
-    "\n",
-    "Welcome to the Transformers module! You'll implement complete transformer blocks with LayerNorm, residual connections, and feed-forward networks, building the architecture that powers modern language models like GPT and BERT.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How transformer blocks scale memory and computation with model depth\n",
-    "- Core implementation skill: Build complete transformer architectures with proper normalization\n",
-    "- Pattern recognition: Understand how residual connections enable training of deep transformer models\n",
-    "- Framework connection: See how your implementations match production transformer systems\n",
-    "- Performance insight: Learn how transformer layer memory accumulation affects model deployment\n",
-    "\n",
-    "## Build → Use → Reflect\n",
-    "1. **Build**: LayerNorm, transformer blocks, and complete transformer models\n",
-    "2. **Use**: Process sequences through multi-layer transformer architectures\n",
-    "3. **Reflect**: How do transformer design choices affect scalability and training dynamics?\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how transformer blocks enable powerful sequence modeling\n",
-    "- Practical capability to implement complete transformer architectures with proper layer organization\n",
-    "- Systems insight into how transformer depth affects memory usage and training efficiency\n",
-    "- Performance consideration of how layer normalization and residual connections affect convergence\n",
-    "- Connection to production systems like GPT's transformer blocks and their optimization techniques\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: GPT-3 has 96 transformer layers, each with 12k-dimensional representations and complex memory management\n",
-    "⚡ **Performance Note**: Transformer layer memory accumulates linearly with depth - deep models require careful activation checkpointing"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "aaaa5ad1",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "transformers-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp core.transformers\n",
-    "\n",
-    "#| export\n",
-    "import math\n",
-    "import numpy as np\n",
-    "import os\n",
-    "import sys\n",
-    "from typing import Union, List, Optional, Tuple, Dict\n",
-    "\n",
-    "# Import our Tensor class - try from package first, then from local module\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "except ImportError:\n",
-    "    # For development, import from local tensor module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))\n",
-    "    from tensor_dev import Tensor\n",
-    "\n",
-    "# Try to import attention classes\n",
-    "try:\n",
-    "    from tinytorch.core.attention import ScaledDotProductAttention, MultiHeadAttention, KVCache\n",
-    "except ImportError:\n",
-    "    # For development, import from local module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '13_attention'))\n",
-    "    try:\n",
-    "        from attention_dev import ScaledDotProductAttention, MultiHeadAttention, KVCache\n",
-    "    except ImportError:\n",
-    "        # Create minimal mock classes if not available\n",
-    "        class MultiHeadAttention:\n",
-    "            def __init__(self, embed_dim, num_heads):\n",
-    "                self.embed_dim = embed_dim\n",
-    "                self.num_heads = num_heads\n",
-    "            def forward(self, q, k, v, mask=None):\n",
-    "                return q  # Mock implementation\n",
-    "        class ScaledDotProductAttention:\n",
-    "            def __init__(self):\n",
-    "                pass\n",
-    "        class KVCache:\n",
-    "            def __init__(self, *args, **kwargs):\n",
-    "                pass\n",
-    "\n",
-    "# Try to import embedding classes\n",
-    "try:\n",
-    "    from tinytorch.core.embeddings import Embedding, PositionalEncoding\n",
-    "except ImportError:\n",
-    "    # For development, import from local module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '12_embeddings'))\n",
-    "    try:\n",
-    "        from embeddings_dev import Embedding, PositionalEncoding\n",
-    "    except ImportError:\n",
-    "        # Create minimal mock classes if not available\n",
-    "        class Embedding:\n",
-    "            def __init__(self, vocab_size, embedding_dim):\n",
-    "                self.vocab_size = vocab_size\n",
-    "                self.embedding_dim = embedding_dim\n",
-    "        class PositionalEncoding:\n",
-    "            def __init__(self, embedding_dim, max_seq_length=5000):\n",
-    "                self.embedding_dim = embedding_dim"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8d54a97a",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "transformers-welcome",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🏗️ TinyTorch Transformers Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(\"Ready to build complete transformer architectures!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e684830c",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📦 Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in `modules/source/14_transformers/transformers_dev.py`  \n",
-    "**Building Side:** Code exports to `tinytorch.core.transformers`\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.transformers import LayerNorm, TransformerBlock, Transformer\n",
-    "from tinytorch.core.attention import MultiHeadAttention  # Previous module\n",
-    "from tinytorch.core.embeddings import Embedding, PositionalEncoding  # Foundation\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Focused modules for deep understanding\n",
-    "- **Production:** Proper organization like PyTorch's transformer implementations\n",
-    "- **Consistency:** All transformer components live together in `core.transformers`\n",
-    "- **Integration:** Works seamlessly with attention, embeddings, and tokenization systems"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "be87d30f",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## What are Transformers?\n",
-    "\n",
-    "### The Architecture Revolution\n",
-    "Transformers revolutionized AI by replacing recurrent connections with attention mechanisms:\n",
-    "\n",
-    "**Traditional RNN/LSTM:**\n",
-    "```\n",
-    "h₁ → h₂ → h₃ → h₄  (Sequential processing)\n",
-    "```\n",
-    "\n",
-    "**Transformer:**\n",
-    "```\n",
-    "All positions attend to all positions simultaneously (Parallel processing)\n",
-    "```\n",
-    "\n",
-    "### Transformer Block Components\n",
-    "Each transformer block contains:\n",
-    "\n",
-    "1. **Multi-Head Self-Attention**: Captures sequence relationships\n",
-    "2. **Layer Normalization**: Stabilizes training of deep networks\n",
-    "3. **Residual Connections**: Enables gradient flow through many layers\n",
-    "4. **Position-wise Feed-Forward**: Applies non-linear transformations\n",
-    "\n",
-    "### The Complete Architecture\n",
-    "```\n",
-    "Input Embeddings + Positional Encoding\n",
-    "    ↓\n",
-    "[Transformer Block] × N layers\n",
-    "    ↓\n",
-    "Output Layer (Language Modeling Head)\n",
-    "```\n",
-    "\n",
-    "### Systems Trade-offs\n",
-    "- **Layer depth**: More layers = more capacity, more memory\n",
-    "- **Attention heads**: More heads = richer representations, more computation\n",
-    "- **Feed-forward size**: Larger FFN = more parameters, better performance\n",
-    "- **Layer normalization**: Pre-norm vs post-norm affects training dynamics"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b1081f61",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Layer Normalization Implementation\n",
-    "\n",
-    "Layer normalization is crucial for training stable transformers. Unlike batch normalization, it normalizes across the feature dimension for each sample independently."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2166849c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "layer-norm",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class LayerNorm:\n",
-    "    \"\"\"\n",
-    "    Layer Normalization for transformers.\n",
-    "    \n",
-    "    Normalizes across the feature dimension (last axis) for each sample,\n",
-    "    making training more stable and enabling deeper networks.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, normalized_shape: Union[int, Tuple[int]], eps: float = 1e-5):\n",
-    "        \"\"\"\n",
-    "        Initialize layer normalization with learnable parameters.\n",
-    "        \n",
-    "        TODO: Implement layer normalization initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Store normalization configuration\n",
-    "        2. Initialize learnable scale (gamma) and shift (beta) parameters\n",
-    "        3. Set epsilon for numerical stability\n",
-    "        4. Set up parameter tracking for optimization\n",
-    "        \n",
-    "        MATHEMATICAL FOUNDATION:\n",
-    "        LayerNorm(x) = γ * (x - μ) / σ + β\n",
-    "        \n",
-    "        Where:\n",
-    "        - μ = mean across feature dimensions\n",
-    "        - σ = std across feature dimensions  \n",
-    "        - γ = learnable scale parameter\n",
-    "        - β = learnable shift parameter\n",
-    "        \n",
-    "        Args:\n",
-    "            normalized_shape: Shape of features to normalize (e.g., embedding_dim)\n",
-    "            eps: Small value for numerical stability\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if isinstance(normalized_shape, int):\n",
-    "            self.normalized_shape = (normalized_shape,)\n",
-    "        else:\n",
-    "            self.normalized_shape = normalized_shape\n",
-    "        \n",
-    "        self.eps = eps\n",
-    "        \n",
-    "        # Initialize learnable parameters\n",
-    "        # Gamma (scale): initialized to ones\n",
-    "        # Beta (bias): initialized to zeros\n",
-    "        self.gamma = Tensor(np.ones(self.normalized_shape))\n",
-    "        self.beta = Tensor(np.zeros(self.normalized_shape))\n",
-    "        \n",
-    "        # Track parameters for optimization\n",
-    "        self.parameters = [self.gamma, self.beta]\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x: Tensor) -> Tensor:\n",
-    "        \"\"\"\n",
-    "        Apply layer normalization to input tensor.\n",
-    "        \n",
-    "        TODO: Implement layer normalization forward pass.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Calculate mean across feature dimensions\n",
-    "        2. Calculate standard deviation across feature dimensions\n",
-    "        3. Normalize: (x - mean) / (std + eps)\n",
-    "        4. Apply learnable scale and shift: gamma * normalized + beta\n",
-    "        \n",
-    "        NUMERICAL STABILITY:\n",
-    "        - Add eps to variance before taking sqrt\n",
-    "        - Use unbiased variance calculation\n",
-    "        \n",
-    "        EXAMPLE:\n",
-    "        layer_norm = LayerNorm(256)\n",
-    "        x = Tensor(np.random.randn(32, 128, 256))  # (batch, seq, features)\n",
-    "        normalized = layer_norm.forward(x)  # Same shape as input\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor with shape (..., *normalized_shape)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Normalized tensor with same shape as input\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Calculate mean and variance across the feature dimensions (last axes)\n",
-    "        # For shape (..., *normalized_shape), we want to normalize over the last len(normalized_shape) axes\n",
-    "        \n",
-    "        # Determine axes to normalize over\n",
-    "        axes_to_normalize = tuple(range(len(x.shape) - len(self.normalized_shape), len(x.shape)))\n",
-    "        \n",
-    "        # Calculate mean\n",
-    "        mean = np.mean(x.data, axis=axes_to_normalize, keepdims=True)\n",
-    "        \n",
-    "        # Calculate variance\n",
-    "        variance = np.var(x.data, axis=axes_to_normalize, keepdims=True)\n",
-    "        \n",
-    "        # Normalize\n",
-    "        normalized = (x.data - mean) / np.sqrt(variance + self.eps)\n",
-    "        \n",
-    "        # Apply learnable scale and shift\n",
-    "        # Reshape gamma and beta to be broadcastable\n",
-    "        gamma_broadcasted = self.gamma.data.reshape([1] * (len(x.shape) - len(self.normalized_shape)) + list(self.normalized_shape))\n",
-    "        beta_broadcasted = self.beta.data.reshape([1] * (len(x.shape) - len(self.normalized_shape)) + list(self.normalized_shape))\n",
-    "        \n",
-    "        output = gamma_broadcasted * normalized + beta_broadcasted\n",
-    "        \n",
-    "        return Tensor(output)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, x: Tensor) -> Tensor:\n",
-    "        \"\"\"Make the class callable.\"\"\"\n",
-    "        return self.forward(x)\n",
-    "    \n",
-    "    def get_memory_usage(self) -> Dict[str, float]:\n",
-    "        \"\"\"\n",
-    "        Calculate memory usage of layer normalization parameters.\n",
-    "        \n",
-    "        This function is PROVIDED to show memory analysis.\n",
-    "        \"\"\"\n",
-    "        # Parameter memory\n",
-    "        param_memory_mb = sum(param.data.nbytes for param in self.parameters) / (1024 * 1024)\n",
-    "        \n",
-    "        return {\n",
-    "            'parameter_memory_mb': param_memory_mb,\n",
-    "            'total_parameters': sum(param.data.size for param in self.parameters),\n",
-    "            'normalized_shape': self.normalized_shape\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ba9e1251",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Layer Normalization Implementation\n",
-    "\n",
-    "Once you implement the LayerNorm methods above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7349865c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-layer-norm-immediate",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_layer_norm():\n",
-    "    \"\"\"Unit test for layer normalization.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Layer Normalization...\")\n",
-    "    \n",
-    "    # Test 1: Basic functionality\n",
-    "    embed_dim = 256\n",
-    "    layer_norm = LayerNorm(embed_dim)\n",
-    "    \n",
-    "    # Verify initialization\n",
-    "    assert layer_norm.normalized_shape == (embed_dim,), \"Should store normalized shape\"\n",
-    "    assert len(layer_norm.parameters) == 2, \"Should have gamma and beta parameters\"\n",
-    "    assert layer_norm.gamma.shape == (embed_dim,), \"Gamma should match normalized shape\"\n",
-    "    assert layer_norm.beta.shape == (embed_dim,), \"Beta should match normalized shape\"\n",
-    "    \n",
-    "    # Verify parameter initialization\n",
-    "    assert np.allclose(layer_norm.gamma.data, 1.0), \"Gamma should be initialized to ones\"\n",
-    "    assert np.allclose(layer_norm.beta.data, 0.0), \"Beta should be initialized to zeros\"\n",
-    "    \n",
-    "    # Test 2: Forward pass with 2D input\n",
-    "    batch_size = 16\n",
-    "    x_2d = Tensor(np.random.randn(batch_size, embed_dim))\n",
-    "    output_2d = layer_norm.forward(x_2d)\n",
-    "    \n",
-    "    assert output_2d.shape == x_2d.shape, \"Output shape should match input shape\"\n",
-    "    \n",
-    "    # Test 3: Forward pass with 3D input (typical transformer use)\n",
-    "    seq_length = 32\n",
-    "    x_3d = Tensor(np.random.randn(batch_size, seq_length, embed_dim))\n",
-    "    output_3d = layer_norm.forward(x_3d)\n",
-    "    \n",
-    "    assert output_3d.shape == x_3d.shape, \"3D output shape should match input shape\"\n",
-    "    \n",
-    "    # Test 4: Normalization properties\n",
-    "    # For each sample, the normalized features should have ~zero mean and ~unit variance\n",
-    "    for i in range(batch_size):\n",
-    "        for j in range(seq_length):\n",
-    "            sample_output = output_3d.data[i, j, :]\n",
-    "            sample_mean = np.mean(sample_output)\n",
-    "            sample_var = np.var(sample_output)\n",
-    "            \n",
-    "            assert abs(sample_mean) < 1e-6, f\"Normalized mean should be ~0, got {sample_mean}\"\n",
-    "            assert abs(sample_var - 1.0) < 1e-6, f\"Normalized variance should be ~1, got {sample_var}\"\n",
-    "    \n",
-    "    # Test 5: Different normalized shapes\n",
-    "    multi_dim_shape = (64, 4)  # Multi-dimensional normalization\n",
-    "    layer_norm_multi = LayerNorm(multi_dim_shape)\n",
-    "    \n",
-    "    x_multi = Tensor(np.random.randn(8, 32, 64, 4))\n",
-    "    output_multi = layer_norm_multi.forward(x_multi)\n",
-    "    \n",
-    "    assert output_multi.shape == x_multi.shape, \"Multi-dim normalization should preserve shape\"\n",
-    "    \n",
-    "    # Test 6: Callable interface\n",
-    "    output_callable = layer_norm(x_3d)\n",
-    "    assert np.allclose(output_callable.data, output_3d.data), \"Callable interface should work\"\n",
-    "    \n",
-    "    # Test 7: Numerical stability with extreme values\n",
-    "    extreme_x = Tensor(np.ones((4, embed_dim)) * 1e6)  # Very large values\n",
-    "    extreme_output = layer_norm.forward(extreme_x)\n",
-    "    \n",
-    "    assert not np.any(np.isnan(extreme_output.data)), \"Should handle extreme values without NaN\"\n",
-    "    assert not np.any(np.isinf(extreme_output.data)), \"Should handle extreme values without inf\"\n",
-    "    \n",
-    "    # Test 8: Memory usage calculation\n",
-    "    memory_stats = layer_norm.get_memory_usage()\n",
-    "    assert 'parameter_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
-    "    assert memory_stats['total_parameters'] == 2 * embed_dim, \"Should count gamma and beta parameters\"\n",
-    "    \n",
-    "    print(\"✅ Layer normalization tests passed!\")\n",
-    "    print(f\"✅ Properly normalizes across feature dimensions\")\n",
-    "    print(f\"✅ Handles 2D and 3D inputs correctly\")\n",
-    "    print(f\"✅ Maintains ~0 mean and ~1 variance after normalization\")\n",
-    "    print(f\"✅ Parameter memory: {memory_stats['parameter_memory_mb']:.4f}MB\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b484efe6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Position-wise Feed-Forward Network\n",
-    "\n",
-    "Each transformer block contains a position-wise feed-forward network that applies the same transformation to each position independently."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b1aaebc9",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "feed-forward",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class PositionwiseFeedForward:\n",
-    "    \"\"\"\n",
-    "    Position-wise feed-forward network used in transformer blocks.\n",
-    "    \n",
-    "    Applies the same feed-forward network to each position in the sequence:\n",
-    "    FFN(x) = max(0, xW₁ + b₁)W₂ + b₂\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, embed_dim: int, hidden_dim: int, dropout: float = 0.0):\n",
-    "        \"\"\"\n",
-    "        Initialize position-wise feed-forward network.\n",
-    "        \n",
-    "        TODO: Implement feed-forward network initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Store network configuration\n",
-    "        2. Initialize weight matrices and bias vectors for two linear layers\n",
-    "        3. Set up parameter tracking for optimization\n",
-    "        4. Store dropout rate for training\n",
-    "        \n",
-    "        ARCHITECTURE:\n",
-    "        - Input: (batch, seq_len, embed_dim)\n",
-    "        - Linear 1: embed_dim → hidden_dim\n",
-    "        - ReLU activation\n",
-    "        - Linear 2: hidden_dim → embed_dim\n",
-    "        - Output: (batch, seq_len, embed_dim)\n",
-    "        \n",
-    "        PARAMETER INITIALIZATION:\n",
-    "        Use Xavier/Glorot initialization for stable training\n",
-    "        \n",
-    "        Args:\n",
-    "            embed_dim: Embedding dimension (input and output size)\n",
-    "            hidden_dim: Hidden layer dimension (typically 4 * embed_dim)\n",
-    "            dropout: Dropout rate for regularization\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.embed_dim = embed_dim\n",
-    "        self.hidden_dim = hidden_dim\n",
-    "        self.dropout = dropout\n",
-    "        \n",
-    "        # Initialize weights using Xavier initialization\n",
-    "        # W1: embed_dim → hidden_dim\n",
-    "        xavier_bound_1 = math.sqrt(6.0 / (embed_dim + hidden_dim))\n",
-    "        self.w1 = Tensor(np.random.uniform(-xavier_bound_1, xavier_bound_1, (embed_dim, hidden_dim)))\n",
-    "        self.b1 = Tensor(np.zeros(hidden_dim))\n",
-    "        \n",
-    "        # W2: hidden_dim → embed_dim\n",
-    "        xavier_bound_2 = math.sqrt(6.0 / (hidden_dim + embed_dim))\n",
-    "        self.w2 = Tensor(np.random.uniform(-xavier_bound_2, xavier_bound_2, (hidden_dim, embed_dim)))\n",
-    "        self.b2 = Tensor(np.zeros(embed_dim))\n",
-    "        \n",
-    "        # Track parameters for optimization\n",
-    "        self.parameters = [self.w1, self.b1, self.w2, self.b2]\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x: Tensor) -> Tensor:\n",
-    "        \"\"\"\n",
-    "        Apply position-wise feed-forward transformation.\n",
-    "        \n",
-    "        TODO: Implement feed-forward forward pass.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Apply first linear transformation: x @ W1 + b1\n",
-    "        2. Apply ReLU activation: max(0, linear1)\n",
-    "        3. Apply second linear transformation: relu @ W2 + b2\n",
-    "        4. Return result with same shape as input\n",
-    "        \n",
-    "        MATHEMATICAL FORMULATION:\n",
-    "        hidden = ReLU(x @ W1 + b1)\n",
-    "        output = hidden @ W2 + b2\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor with shape (batch_size, seq_len, embed_dim)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Output tensor with shape (batch_size, seq_len, embed_dim)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Reshape input for matrix multiplication if needed\n",
-    "        original_shape = x.shape\n",
-    "        if len(x.shape) == 3:\n",
-    "            batch_size, seq_len, embed_dim = x.shape\n",
-    "            # Reshape to (batch_size * seq_len, embed_dim) for efficient computation\n",
-    "            x_reshaped = x.data.reshape(-1, embed_dim)\n",
-    "        else:\n",
-    "            x_reshaped = x.data\n",
-    "        \n",
-    "        # First linear transformation: x @ W1 + b1\n",
-    "        hidden = np.matmul(x_reshaped, self.w1.data) + self.b1.data\n",
-    "        \n",
-    "        # ReLU activation\n",
-    "        hidden_relu = np.maximum(0, hidden)\n",
-    "        \n",
-    "        # Second linear transformation: hidden @ W2 + b2\n",
-    "        output = np.matmul(hidden_relu, self.w2.data) + self.b2.data\n",
-    "        \n",
-    "        # Reshape back to original shape\n",
-    "        if len(original_shape) == 3:\n",
-    "            output = output.reshape(original_shape)\n",
-    "        \n",
-    "        return Tensor(output)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, x: Tensor) -> Tensor:\n",
-    "        \"\"\"Make the class callable.\"\"\"\n",
-    "        return self.forward(x)\n",
-    "    \n",
-    "    def get_memory_usage(self) -> Dict[str, float]:\n",
-    "        \"\"\"\n",
-    "        Calculate memory usage of feed-forward parameters.\n",
-    "        \n",
-    "        This function is PROVIDED to show memory analysis.\n",
-    "        \"\"\"\n",
-    "        # Parameter memory\n",
-    "        param_memory_mb = sum(param.data.nbytes for param in self.parameters) / (1024 * 1024)\n",
-    "        \n",
-    "        # Calculate parameter counts\n",
-    "        w1_params = self.embed_dim * self.hidden_dim\n",
-    "        w2_params = self.hidden_dim * self.embed_dim\n",
-    "        bias_params = self.hidden_dim + self.embed_dim\n",
-    "        total_params = w1_params + w2_params + bias_params\n",
-    "        \n",
-    "        return {\n",
-    "            'parameter_memory_mb': param_memory_mb,\n",
-    "            'total_parameters': total_params,\n",
-    "            'w1_parameters': w1_params,\n",
-    "            'w2_parameters': w2_params,\n",
-    "            'bias_parameters': bias_params,\n",
-    "            'embed_dim': self.embed_dim,\n",
-    "            'hidden_dim': self.hidden_dim\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e555b646",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Feed-Forward Network Implementation\n",
-    "\n",
-    "Once you implement the PositionwiseFeedForward methods above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "95b8fd0e",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-feed-forward-immediate",
-     "locked": true,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_feed_forward():\n",
-    "    \"\"\"Unit test for position-wise feed-forward network.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Position-wise Feed-Forward Network...\")\n",
-    "    \n",
-    "    # Test configuration\n",
-    "    embed_dim = 256\n",
-    "    hidden_dim = 1024  # Typical 4x expansion\n",
-    "    ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=hidden_dim)\n",
-    "    \n",
-    "    # Verify initialization\n",
-    "    assert ffn.embed_dim == embed_dim, \"Should store embedding dimension\"\n",
-    "    assert ffn.hidden_dim == hidden_dim, \"Should store hidden dimension\"\n",
-    "    assert len(ffn.parameters) == 4, \"Should have W1, b1, W2, b2 parameters\"\n",
-    "    \n",
-    "    # Verify parameter shapes\n",
-    "    assert ffn.w1.shape == (embed_dim, hidden_dim), f\"W1 should be ({embed_dim}, {hidden_dim})\"\n",
-    "    assert ffn.b1.shape == (hidden_dim,), f\"b1 should be ({hidden_dim},)\"\n",
-    "    assert ffn.w2.shape == (hidden_dim, embed_dim), f\"W2 should be ({hidden_dim}, {embed_dim})\"\n",
-    "    assert ffn.b2.shape == (embed_dim,), f\"b2 should be ({embed_dim},)\"\n",
-    "    \n",
-    "    # Test forward pass with 3D input (typical transformer use)\n",
-    "    batch_size = 8\n",
-    "    seq_len = 32\n",
-    "    x_3d = Tensor(np.random.randn(batch_size, seq_len, embed_dim))\n",
-    "    output_3d = ffn.forward(x_3d)\n",
-    "    \n",
-    "    expected_shape = (batch_size, seq_len, embed_dim)\n",
-    "    assert output_3d.shape == expected_shape, f\"Expected shape {expected_shape}, got {output_3d.shape}\"\n",
-    "    \n",
-    "    # Test forward pass with 2D input\n",
-    "    x_2d = Tensor(np.random.randn(batch_size, embed_dim))\n",
-    "    output_2d = ffn.forward(x_2d)\n",
-    "    \n",
-    "    expected_2d_shape = (batch_size, embed_dim)\n",
-    "    assert output_2d.shape == expected_2d_shape, f\"Expected 2D shape {expected_2d_shape}, got {output_2d.shape}\"\n",
-    "    \n",
-    "    # Test that FFN is applied position-wise (same transformation at each position)\n",
-    "    # Extract two positions from the sequence\n",
-    "    pos_1_input = Tensor(x_3d.data[:, 0, :])  # First position\n",
-    "    pos_2_input = Tensor(x_3d.data[:, 1, :])  # Second position\n",
-    "    \n",
-    "    pos_1_output = ffn.forward(pos_1_input)\n",
-    "    pos_2_output = ffn.forward(pos_2_input)\n",
-    "    \n",
-    "    # Compare with full sequence output\n",
-    "    assert np.allclose(pos_1_output.data, output_3d.data[:, 0, :]), \"Position 0 should match individual processing\"\n",
-    "    assert np.allclose(pos_2_output.data, output_3d.data[:, 1, :]), \"Position 1 should match individual processing\"\n",
-    "    \n",
-    "    # Test ReLU activation (some outputs should be zero for negative intermediate values)\n",
-    "    # Create input that will definitely produce some negative values after first linear layer\n",
-    "    negative_input = Tensor(-np.ones((4, embed_dim)) * 10)  # Very negative input\n",
-    "    negative_output = ffn.forward(negative_input)\n",
-    "    \n",
-    "    # Not all outputs should be negative (ReLU should clip some values)\n",
-    "    assert not np.all(negative_output.data < 0), \"ReLU should prevent all outputs from being negative\"\n",
-    "    \n",
-    "    # Test callable interface\n",
-    "    output_callable = ffn(x_3d)\n",
-    "    assert np.allclose(output_callable.data, output_3d.data), \"Callable interface should work\"\n",
-    "    \n",
-    "    # Test different hidden dimensions\n",
-    "    for test_hidden_dim in [512, 2048]:\n",
-    "        test_ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=test_hidden_dim)\n",
-    "        test_output = test_ffn.forward(x_3d)\n",
-    "        assert test_output.shape == expected_shape, f\"Should work with hidden_dim={test_hidden_dim}\"\n",
-    "    \n",
-    "    # Test memory usage calculation\n",
-    "    memory_stats = ffn.get_memory_usage()\n",
-    "    assert 'parameter_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
-    "    \n",
-    "    # Verify parameter counts\n",
-    "    expected_w1_params = embed_dim * hidden_dim\n",
-    "    expected_w2_params = hidden_dim * embed_dim\n",
-    "    expected_total = expected_w1_params + expected_w2_params + hidden_dim + embed_dim\n",
-    "    \n",
-    "    assert memory_stats['w1_parameters'] == expected_w1_params, \"Should count W1 parameters correctly\"\n",
-    "    assert memory_stats['w2_parameters'] == expected_w2_params, \"Should count W2 parameters correctly\"\n",
-    "    assert memory_stats['total_parameters'] == expected_total, \"Should count total parameters correctly\"\n",
-    "    \n",
-    "    print(\"✅ Position-wise feed-forward tests passed!\")\n",
-    "    print(f\"✅ Handles 2D and 3D inputs correctly\")\n",
-    "    print(f\"✅ Position-wise processing verified\")\n",
-    "    print(f\"✅ ReLU activation working properly\")\n",
-    "    print(f\"✅ Total parameters: {memory_stats['total_parameters']:,}\")\n",
-    "    print(f\"✅ Parameter memory: {memory_stats['parameter_memory_mb']:.2f}MB\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d97703d2",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Transformer Block Implementation\n",
-    "\n",
-    "Now let's build the complete transformer block that combines multi-head attention, layer normalization, and position-wise feed-forward networks with residual connections."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e5677022",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "transformer-block",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class TransformerBlock:\n",
-    "    \"\"\"\n",
-    "    Complete transformer block with self-attention and feed-forward layers.\n",
-    "    \n",
-    "    Combines multi-head self-attention, layer normalization, residual connections,\n",
-    "    and position-wise feed-forward networks into the standard transformer architecture.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, embed_dim: int, num_heads: int, hidden_dim: int, \n",
-    "                 dropout: float = 0.0, pre_norm: bool = True):\n",
-    "        \"\"\"\n",
-    "        Initialize transformer block with all components.\n",
-    "        \n",
-    "        TODO: Implement transformer block initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Store block configuration\n",
-    "        2. Create multi-head attention layer\n",
-    "        3. Create two layer normalization layers (for attention and FFN)\n",
-    "        4. Create position-wise feed-forward network\n",
-    "        5. Set up parameter tracking from all sub-components\n",
-    "        \n",
-    "        ARCHITECTURE CHOICE: Pre-norm vs Post-norm\n",
-    "        - Pre-norm: LayerNorm → Attention → Residual (more stable)\n",
-    "        - Post-norm: Attention → LayerNorm → Residual (original paper)\n",
-    "        \n",
-    "        Args:\n",
-    "            embed_dim: Embedding dimension\n",
-    "            num_heads: Number of attention heads\n",
-    "            hidden_dim: Feed-forward hidden dimension (typically 4 * embed_dim)\n",
-    "            dropout: Dropout rate for regularization\n",
-    "            pre_norm: Whether to use pre-normalization (recommended)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.embed_dim = embed_dim\n",
-    "        self.num_heads = num_heads\n",
-    "        self.hidden_dim = hidden_dim\n",
-    "        self.dropout = dropout\n",
-    "        self.pre_norm = pre_norm\n",
-    "        \n",
-    "        # Multi-head self-attention\n",
-    "        self.attention = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads, dropout=dropout)\n",
-    "        \n",
-    "        # Layer normalization layers\n",
-    "        self.norm1 = LayerNorm(embed_dim)  # For attention\n",
-    "        self.norm2 = LayerNorm(embed_dim)  # For feed-forward\n",
-    "        \n",
-    "        # Position-wise feed-forward network\n",
-    "        self.ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=hidden_dim, dropout=dropout)\n",
-    "        \n",
-    "        # Collect all parameters from sub-components\n",
-    "        self.parameters = []\n",
-    "        if hasattr(self.attention, 'parameters'):\n",
-    "            self.parameters.extend(self.attention.parameters)\n",
-    "        self.parameters.extend(self.norm1.parameters)\n",
-    "        self.parameters.extend(self.norm2.parameters)\n",
-    "        self.parameters.extend(self.ffn.parameters)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x: Tensor, mask: Optional[Tensor] = None,\n",
-    "                return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, Tensor]]:\n",
-    "        \"\"\"\n",
-    "        Process input through complete transformer block.\n",
-    "        \n",
-    "        TODO: Implement transformer block forward pass.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION (Pre-norm):\n",
-    "        1. Self-attention with residual: x + attention(norm1(x))\n",
-    "        2. Feed-forward with residual: attn_out + ffn(norm2(attn_out))\n",
-    "        3. Return final output (and optionally attention weights)\n",
-    "        \n",
-    "        RESIDUAL CONNECTIONS:\n",
-    "        Essential for training deep networks - allow gradients to flow directly\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor with shape (batch_size, seq_len, embed_dim)\n",
-    "            mask: Optional attention mask\n",
-    "            return_attention_weights: Whether to return attention weights\n",
-    "            \n",
-    "        Returns:\n",
-    "            Transformer block output with same shape as input\n",
-    "            Optionally also attention weights\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if self.pre_norm:\n",
-    "            # Pre-normalization: LayerNorm before attention/FFN\n",
-    "            \n",
-    "            # Self-attention with residual connection\n",
-    "            norm1_x = self.norm1(x)\n",
-    "            if return_attention_weights:\n",
-    "                attn_output, attn_weights = self.attention.forward(\n",
-    "                    norm1_x, norm1_x, norm1_x, mask=mask, return_attention_weights=True\n",
-    "                )\n",
-    "            else:\n",
-    "                attn_output = self.attention.forward(norm1_x, norm1_x, norm1_x, mask=mask)\n",
-    "            \n",
-    "            # Residual connection\n",
-    "            x = Tensor(x.data + attn_output.data)\n",
-    "            \n",
-    "            # Feed-forward with residual connection\n",
-    "            norm2_x = self.norm2(x)\n",
-    "            ffn_output = self.ffn.forward(norm2_x)\n",
-    "            \n",
-    "            # Residual connection\n",
-    "            output = Tensor(x.data + ffn_output.data)\n",
-    "            \n",
-    "        else:\n",
-    "            # Post-normalization: LayerNorm after attention/FFN (original transformer)\n",
-    "            \n",
-    "            # Self-attention with residual connection\n",
-    "            if return_attention_weights:\n",
-    "                attn_output, attn_weights = self.attention.forward(\n",
-    "                    x, x, x, mask=mask, return_attention_weights=True\n",
-    "                )\n",
-    "            else:\n",
-    "                attn_output = self.attention.forward(x, x, x, mask=mask)\n",
-    "            \n",
-    "            # Residual + LayerNorm\n",
-    "            attn_residual = Tensor(x.data + attn_output.data)\n",
-    "            norm1_output = self.norm1(attn_residual)\n",
-    "            \n",
-    "            # Feed-forward with residual connection\n",
-    "            ffn_output = self.ffn.forward(norm1_output)\n",
-    "            \n",
-    "            # Residual + LayerNorm\n",
-    "            ffn_residual = Tensor(norm1_output.data + ffn_output.data)\n",
-    "            output = self.norm2(ffn_residual)\n",
-    "        \n",
-    "        if return_attention_weights:\n",
-    "            return output, attn_weights\n",
-    "        else:\n",
-    "            return output\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, x: Tensor, mask: Optional[Tensor] = None,\n",
-    "                 return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, Tensor]]:\n",
-    "        \"\"\"Make the class callable.\"\"\"\n",
-    "        return self.forward(x, mask, return_attention_weights)\n",
-    "    \n",
-    "    def get_memory_usage(self) -> Dict[str, float]:\n",
-    "        \"\"\"\n",
-    "        Calculate memory usage of transformer block components.\n",
-    "        \n",
-    "        This function is PROVIDED to show memory analysis.\n",
-    "        \"\"\"\n",
-    "        # Get memory usage from components\n",
-    "        if hasattr(self.attention, 'get_memory_usage'):\n",
-    "            attention_memory = self.attention.get_memory_usage()['total_parameter_memory_mb']\n",
-    "        else:\n",
-    "            attention_memory = 0.0\n",
-    "        \n",
-    "        norm1_memory = self.norm1.get_memory_usage()['parameter_memory_mb']\n",
-    "        norm2_memory = self.norm2.get_memory_usage()['parameter_memory_mb']\n",
-    "        ffn_memory = self.ffn.get_memory_usage()['parameter_memory_mb']\n",
-    "        \n",
-    "        total_memory = attention_memory + norm1_memory + norm2_memory + ffn_memory\n",
-    "        total_params = len(self.parameters) if hasattr(self, 'parameters') else 0\n",
-    "        \n",
-    "        return {\n",
-    "            'total_memory_mb': total_memory,\n",
-    "            'attention_memory_mb': attention_memory,\n",
-    "            'norm_memory_mb': norm1_memory + norm2_memory,\n",
-    "            'ffn_memory_mb': ffn_memory,\n",
-    "            'total_parameters': sum(p.data.size for p in self.parameters) if hasattr(self, 'parameters') else 0,\n",
-    "            'embed_dim': self.embed_dim,\n",
-    "            'num_heads': self.num_heads,\n",
-    "            'hidden_dim': self.hidden_dim,\n",
-    "            'pre_norm': self.pre_norm\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f786ca8b",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Transformer Block Implementation\n",
-    "\n",
-    "Once you implement the TransformerBlock methods above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b5c44e59",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-transformer-block-immediate",
-     "locked": true,
-     "points": 20,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_transformer_block():\n",
-    "    \"\"\"Unit test for transformer block.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Transformer Block...\")\n",
-    "    \n",
-    "    # Test configuration\n",
-    "    embed_dim = 256\n",
-    "    num_heads = 8\n",
-    "    hidden_dim = 1024\n",
-    "    transformer_block = TransformerBlock(\n",
-    "        embed_dim=embed_dim, \n",
-    "        num_heads=num_heads, \n",
-    "        hidden_dim=hidden_dim,\n",
-    "        pre_norm=True\n",
-    "    )\n",
-    "    \n",
-    "    # Verify initialization\n",
-    "    assert transformer_block.embed_dim == embed_dim, \"Should store embedding dimension\"\n",
-    "    assert transformer_block.num_heads == num_heads, \"Should store number of heads\"\n",
-    "    assert transformer_block.hidden_dim == hidden_dim, \"Should store hidden dimension\"\n",
-    "    assert transformer_block.pre_norm == True, \"Should store normalization type\"\n",
-    "    \n",
-    "    # Verify components exist\n",
-    "    assert hasattr(transformer_block, 'attention'), \"Should have attention layer\"\n",
-    "    assert hasattr(transformer_block, 'norm1'), \"Should have first norm layer\"\n",
-    "    assert hasattr(transformer_block, 'norm2'), \"Should have second norm layer\"\n",
-    "    assert hasattr(transformer_block, 'ffn'), \"Should have feed-forward network\"\n",
-    "    \n",
-    "    # Test forward pass\n",
-    "    batch_size = 4\n",
-    "    seq_len = 16\n",
-    "    x = Tensor(np.random.randn(batch_size, seq_len, embed_dim))\n",
-    "    \n",
-    "    output = transformer_block.forward(x)\n",
-    "    expected_shape = (batch_size, seq_len, embed_dim)\n",
-    "    assert output.shape == expected_shape, f\"Expected shape {expected_shape}, got {output.shape}\"\n",
-    "    \n",
-    "    # Test with attention weights return\n",
-    "    output_with_attn, attn_weights = transformer_block.forward(x, return_attention_weights=True)\n",
-    "    \n",
-    "    assert output_with_attn.shape == expected_shape, \"Output with attention should have correct shape\"\n",
-    "    expected_attn_shape = (batch_size, num_heads, seq_len, seq_len)\n",
-    "    assert attn_weights.shape == expected_attn_shape, f\"Expected attention shape {expected_attn_shape}, got {attn_weights.shape}\"\n",
-    "    \n",
-    "    # Test with causal mask\n",
-    "    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)\n",
-    "    causal_mask = 1 - causal_mask  # Convert to attention mask\n",
-    "    \n",
-    "    masked_output, masked_attn = transformer_block.forward(\n",
-    "        x, mask=Tensor(causal_mask), return_attention_weights=True\n",
-    "    )\n",
-    "    \n",
-    "    assert masked_output.shape == expected_shape, \"Masked output should have correct shape\"\n",
-    "    \n",
-    "    # Verify causal masking works\n",
-    "    for head in range(num_heads):\n",
-    "        for i in range(seq_len):\n",
-    "            for j in range(i+1, seq_len):\n",
-    "                assert np.all(masked_attn.data[:, head, i, j] < 1e-5), \\\n",
-    "                    f\"Position ({i},{j}) should be masked in head {head}\"\n",
-    "    \n",
-    "    # Test residual connections by checking that output is different from pure attention\n",
-    "    # If we zero out the input, residual connections should preserve some information\n",
-    "    zero_input = Tensor(np.zeros((batch_size, seq_len, embed_dim)))\n",
-    "    zero_output = transformer_block.forward(zero_input)\n",
-    "    \n",
-    "    # Output should not be exactly zero due to biases and layer norm parameters\n",
-    "    assert not np.allclose(zero_output.data, 0), \"Residual connections should prevent zero output\"\n",
-    "    \n",
-    "    # Test post-normalization variant\n",
-    "    post_norm_block = TransformerBlock(\n",
-    "        embed_dim=embed_dim, \n",
-    "        num_heads=num_heads, \n",
-    "        hidden_dim=hidden_dim,\n",
-    "        pre_norm=False\n",
-    "    )\n",
-    "    \n",
-    "    post_norm_output = post_norm_block.forward(x)\n",
-    "    assert post_norm_output.shape == expected_shape, \"Post-norm should produce correct shape\"\n",
-    "    \n",
-    "    # Pre-norm and post-norm should produce different outputs\n",
-    "    pre_norm_output = transformer_block.forward(x)\n",
-    "    assert not np.allclose(pre_norm_output.data, post_norm_output.data), \\\n",
-    "        \"Pre-norm and post-norm should produce different outputs\"\n",
-    "    \n",
-    "    # Test callable interface\n",
-    "    output_callable = transformer_block(x)\n",
-    "    assert np.allclose(output_callable.data, output.data), \"Callable interface should work\"\n",
-    "    \n",
-    "    # Test different configurations\n",
-    "    for test_heads in [4, 16]:\n",
-    "        if embed_dim % test_heads == 0:\n",
-    "            test_block = TransformerBlock(embed_dim=embed_dim, num_heads=test_heads, hidden_dim=hidden_dim)\n",
-    "            test_output = test_block.forward(x)\n",
-    "            assert test_output.shape == expected_shape, f\"Should work with {test_heads} heads\"\n",
-    "    \n",
-    "    # Test memory usage calculation\n",
-    "    memory_stats = transformer_block.get_memory_usage()\n",
-    "    assert 'total_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
-    "    assert memory_stats['total_memory_mb'] > 0, \"Should have positive memory usage\"\n",
-    "    assert memory_stats['total_parameters'] > 0, \"Should count parameters\"\n",
-    "    \n",
-    "    print(\"✅ Transformer block tests passed!\")\n",
-    "    print(f\"✅ Pre-norm and post-norm architectures work correctly\")\n",
-    "    print(f\"✅ Residual connections preserve information flow\")\n",
-    "    print(f\"✅ Causal masking works across all attention heads\")\n",
-    "    print(f\"✅ Total parameters: {memory_stats['total_parameters']:,}\")\n",
-    "    print(f\"✅ Total memory: {memory_stats['total_memory_mb']:.2f}MB\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d8c231b1",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Complete Transformer Model\n",
-    "\n",
-    "Finally, let's build a complete transformer model that can be used for language modeling tasks like text generation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6364ce7e",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "transformer-model",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class Transformer:\n",
-    "    \"\"\"\n",
-    "    Complete transformer model for language processing.\n",
-    "    \n",
-    "    Stacks multiple transformer blocks with token embeddings and positional\n",
-    "    encoding to create a complete language model architecture.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, vocab_size: int, embed_dim: int, num_heads: int, \n",
-    "                 num_layers: int, hidden_dim: int, max_seq_length: int = 1024,\n",
-    "                 dropout: float = 0.0, pre_norm: bool = True):\n",
-    "        \"\"\"\n",
-    "        Initialize complete transformer model.\n",
-    "        \n",
-    "        TODO: Implement transformer model initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Store model configuration\n",
-    "        2. Create token embedding layer\n",
-    "        3. Create positional encoding\n",
-    "        4. Create stack of transformer blocks\n",
-    "        5. Create output projection layer (for language modeling)\n",
-    "        6. Set up parameter tracking from all components\n",
-    "        \n",
-    "        LANGUAGE MODELING HEAD:\n",
-    "        Final linear layer that projects hidden states to vocabulary logits\n",
-    "        \n",
-    "        Args:\n",
-    "            vocab_size: Size of vocabulary\n",
-    "            embed_dim: Embedding dimension\n",
-    "            num_heads: Number of attention heads per layer\n",
-    "            num_layers: Number of transformer blocks\n",
-    "            hidden_dim: Feed-forward hidden dimension\n",
-    "            max_seq_length: Maximum sequence length for positional encoding\n",
-    "            dropout: Dropout rate\n",
-    "            pre_norm: Whether to use pre-normalization\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.vocab_size = vocab_size\n",
-    "        self.embed_dim = embed_dim\n",
-    "        self.num_heads = num_heads\n",
-    "        self.num_layers = num_layers\n",
-    "        self.hidden_dim = hidden_dim\n",
-    "        self.max_seq_length = max_seq_length\n",
-    "        self.dropout = dropout\n",
-    "        self.pre_norm = pre_norm\n",
-    "        \n",
-    "        # Token embedding layer\n",
-    "        self.token_embedding = Embedding(vocab_size=vocab_size, embedding_dim=embed_dim)\n",
-    "        \n",
-    "        # Positional encoding\n",
-    "        self.pos_encoding = PositionalEncoding(embedding_dim=embed_dim, max_seq_length=max_seq_length)\n",
-    "        \n",
-    "        # Stack of transformer blocks\n",
-    "        self.transformer_blocks = []\n",
-    "        for _ in range(num_layers):\n",
-    "            block = TransformerBlock(\n",
-    "                embed_dim=embed_dim,\n",
-    "                num_heads=num_heads,\n",
-    "                hidden_dim=hidden_dim,\n",
-    "                dropout=dropout,\n",
-    "                pre_norm=pre_norm\n",
-    "            )\n",
-    "            self.transformer_blocks.append(block)\n",
-    "        \n",
-    "        # Final layer normalization (for pre-norm architecture)\n",
-    "        if pre_norm:\n",
-    "            self.final_norm = LayerNorm(embed_dim)\n",
-    "        else:\n",
-    "            self.final_norm = None\n",
-    "        \n",
-    "        # Language modeling head (projects to vocabulary)\n",
-    "        xavier_bound = math.sqrt(6.0 / (embed_dim + vocab_size))\n",
-    "        self.lm_head = Tensor(np.random.uniform(-xavier_bound, xavier_bound, (embed_dim, vocab_size)))\n",
-    "        \n",
-    "        # Collect all parameters\n",
-    "        self.parameters = []\n",
-    "        if hasattr(self.token_embedding, 'parameters'):\n",
-    "            self.parameters.extend(self.token_embedding.parameters)\n",
-    "        \n",
-    "        for block in self.transformer_blocks:\n",
-    "            if hasattr(block, 'parameters'):\n",
-    "                self.parameters.extend(block.parameters)\n",
-    "        \n",
-    "        if self.final_norm:\n",
-    "            self.parameters.extend(self.final_norm.parameters)\n",
-    "        \n",
-    "        self.parameters.append(self.lm_head)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, input_ids: Tensor, mask: Optional[Tensor] = None,\n",
-    "                return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, List[Tensor]]]:\n",
-    "        \"\"\"\n",
-    "        Process input through complete transformer model.\n",
-    "        \n",
-    "        TODO: Implement transformer model forward pass.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Convert token IDs to embeddings\n",
-    "        2. Add positional encoding\n",
-    "        3. Process through all transformer blocks\n",
-    "        4. Apply final normalization (if pre-norm)\n",
-    "        5. Apply language modeling head\n",
-    "        6. Return logits (and optionally attention weights)\n",
-    "        \n",
-    "        Args:\n",
-    "            input_ids: Token indices with shape (batch_size, seq_len)\n",
-    "            mask: Optional attention mask\n",
-    "            return_attention_weights: Whether to return all attention weights\n",
-    "            \n",
-    "        Returns:\n",
-    "            Logits with shape (batch_size, seq_len, vocab_size)\n",
-    "            Optionally also list of attention weights from each layer\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Token embeddings\n",
-    "        embeddings = self.token_embedding.forward(input_ids)\n",
-    "        \n",
-    "        # Add positional encoding\n",
-    "        x = self.pos_encoding.forward(embeddings)\n",
-    "        \n",
-    "        # Process through transformer blocks\n",
-    "        all_attention_weights = []\n",
-    "        \n",
-    "        for block in self.transformer_blocks:\n",
-    "            if return_attention_weights:\n",
-    "                x, attn_weights = block.forward(x, mask=mask, return_attention_weights=True)\n",
-    "                all_attention_weights.append(attn_weights)\n",
-    "            else:\n",
-    "                x = block.forward(x, mask=mask)\n",
-    "        \n",
-    "        # Final layer normalization (for pre-norm)\n",
-    "        if self.final_norm:\n",
-    "            x = self.final_norm.forward(x)\n",
-    "        \n",
-    "        # Language modeling head\n",
-    "        # x: (batch_size, seq_len, embed_dim)\n",
-    "        # lm_head: (embed_dim, vocab_size)\n",
-    "        # output: (batch_size, seq_len, vocab_size)\n",
-    "        \n",
-    "        batch_size, seq_len, embed_dim = x.shape\n",
-    "        x_reshaped = x.data.reshape(-1, embed_dim)  # (batch_size * seq_len, embed_dim)\n",
-    "        logits_reshaped = np.matmul(x_reshaped, self.lm_head.data)  # (batch_size * seq_len, vocab_size)\n",
-    "        logits = logits_reshaped.reshape(batch_size, seq_len, self.vocab_size)\n",
-    "        \n",
-    "        if return_attention_weights:\n",
-    "            return Tensor(logits), all_attention_weights\n",
-    "        else:\n",
-    "            return Tensor(logits)\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def __call__(self, input_ids: Tensor, mask: Optional[Tensor] = None,\n",
-    "                 return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, List[Tensor]]]:\n",
-    "        \"\"\"Make the class callable.\"\"\"\n",
-    "        return self.forward(input_ids, mask, return_attention_weights)\n",
-    "    \n",
-    "    def generate(self, input_ids: Tensor, max_new_tokens: int = 50, \n",
-    "                temperature: float = 1.0) -> Tensor:\n",
-    "        \"\"\"\n",
-    "        Generate text autoregressively.\n",
-    "        \n",
-    "        This function is PROVIDED to show text generation capability.\n",
-    "        \"\"\"\n",
-    "        batch_size, current_seq_len = input_ids.shape\n",
-    "        \n",
-    "        if current_seq_len >= self.max_seq_length:\n",
-    "            raise ValueError(f\"Input sequence length {current_seq_len} exceeds max {self.max_seq_length}\")\n",
-    "        \n",
-    "        generated_ids = input_ids.data.copy()\n",
-    "        \n",
-    "        for _ in range(max_new_tokens):\n",
-    "            # Create causal mask\n",
-    "            seq_len = generated_ids.shape[1]\n",
-    "            causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)\n",
-    "            causal_mask = 1 - causal_mask\n",
-    "            \n",
-    "            # Forward pass\n",
-    "            logits = self.forward(Tensor(generated_ids), mask=Tensor(causal_mask))\n",
-    "            \n",
-    "            # Get logits for last position\n",
-    "            last_logits = logits.data[:, -1, :]  # (batch_size, vocab_size)\n",
-    "            \n",
-    "            # Apply temperature\n",
-    "            last_logits = last_logits / temperature\n",
-    "            \n",
-    "            # Sample next token (using simple sampling)\n",
-    "            # Convert to probabilities\n",
-    "            exp_logits = np.exp(last_logits - np.max(last_logits, axis=-1, keepdims=True))\n",
-    "            probs = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)\n",
-    "            \n",
-    "            # Sample from distribution\n",
-    "            next_tokens = []\n",
-    "            for i in range(batch_size):\n",
-    "                next_token = np.random.choice(self.vocab_size, p=probs[i])\n",
-    "                next_tokens.append(next_token)\n",
-    "            \n",
-    "            next_tokens = np.array(next_tokens).reshape(batch_size, 1)\n",
-    "            \n",
-    "            # Append to sequence\n",
-    "            generated_ids = np.concatenate([generated_ids, next_tokens], axis=1)\n",
-    "            \n",
-    "            # Stop if we reach max sequence length\n",
-    "            if generated_ids.shape[1] >= self.max_seq_length:\n",
-    "                break\n",
-    "        \n",
-    "        return Tensor(generated_ids)\n",
-    "    \n",
-    "    def get_memory_usage(self) -> Dict[str, float]:\n",
-    "        \"\"\"\n",
-    "        Calculate memory usage of complete transformer model.\n",
-    "        \n",
-    "        This function is PROVIDED to show memory analysis.\n",
-    "        \"\"\"\n",
-    "        # Token embedding memory\n",
-    "        if hasattr(self.token_embedding, 'get_memory_usage'):\n",
-    "            embedding_memory = self.token_embedding.get_memory_usage()['total_memory_mb']\n",
-    "        else:\n",
-    "            embedding_memory = self.vocab_size * self.embed_dim * 4 / (1024 * 1024)\n",
-    "        \n",
-    "        # Transformer blocks memory\n",
-    "        block_memory = 0\n",
-    "        if self.transformer_blocks:\n",
-    "            single_block_memory = self.transformer_blocks[0].get_memory_usage()['total_memory_mb']\n",
-    "            block_memory = single_block_memory * self.num_layers\n",
-    "        \n",
-    "        # Final norm memory\n",
-    "        final_norm_memory = 0\n",
-    "        if self.final_norm:\n",
-    "            final_norm_memory = self.final_norm.get_memory_usage()['parameter_memory_mb']\n",
-    "        \n",
-    "        # Language modeling head memory\n",
-    "        lm_head_memory = self.lm_head.data.nbytes / (1024 * 1024)\n",
-    "        \n",
-    "        total_memory = embedding_memory + block_memory + final_norm_memory + lm_head_memory\n",
-    "        total_params = sum(p.data.size for p in self.parameters) if hasattr(self, 'parameters') else 0\n",
-    "        \n",
-    "        return {\n",
-    "            'total_memory_mb': total_memory,\n",
-    "            'embedding_memory_mb': embedding_memory,\n",
-    "            'transformer_blocks_memory_mb': block_memory,\n",
-    "            'lm_head_memory_mb': lm_head_memory,\n",
-    "            'total_parameters': total_params,\n",
-    "            'vocab_size': self.vocab_size,\n",
-    "            'embed_dim': self.embed_dim,\n",
-    "            'num_layers': self.num_layers,\n",
-    "            'num_heads': self.num_heads,\n",
-    "            'hidden_dim': self.hidden_dim\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cba6bfc5",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Your Complete Transformer Implementation\n",
-    "\n",
-    "Once you implement the Transformer methods above, run this cell to test it:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "751b3b4c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-transformer-model-immediate",
-     "locked": true,
-     "points": 25,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_unit_transformer_model():\n",
-    "    \"\"\"Unit test for complete transformer model.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Complete Transformer Model...\")\n",
-    "    \n",
-    "    # Test configuration\n",
-    "    vocab_size = 1000\n",
-    "    embed_dim = 256\n",
-    "    num_heads = 8\n",
-    "    num_layers = 4\n",
-    "    hidden_dim = 512\n",
-    "    max_seq_length = 128\n",
-    "    \n",
-    "    transformer = Transformer(\n",
-    "        vocab_size=vocab_size,\n",
-    "        embed_dim=embed_dim,\n",
-    "        num_heads=num_heads,\n",
-    "        num_layers=num_layers,\n",
-    "        hidden_dim=hidden_dim,\n",
-    "        max_seq_length=max_seq_length,\n",
-    "        pre_norm=True\n",
-    "    )\n",
-    "    \n",
-    "    # Verify initialization\n",
-    "    assert transformer.vocab_size == vocab_size, \"Should store vocabulary size\"\n",
-    "    assert transformer.embed_dim == embed_dim, \"Should store embedding dimension\"\n",
-    "    assert transformer.num_layers == num_layers, \"Should store number of layers\"\n",
-    "    assert len(transformer.transformer_blocks) == num_layers, \"Should create correct number of blocks\"\n",
-    "    \n",
-    "    # Verify components exist\n",
-    "    assert hasattr(transformer, 'token_embedding'), \"Should have token embedding\"\n",
-    "    assert hasattr(transformer, 'pos_encoding'), \"Should have positional encoding\"\n",
-    "    assert hasattr(transformer, 'lm_head'), \"Should have language modeling head\"\n",
-    "    \n",
-    "    # Test forward pass with token IDs\n",
-    "    batch_size = 4\n",
-    "    seq_len = 32\n",
-    "    input_ids = np.random.randint(0, vocab_size, (batch_size, seq_len))\n",
-    "    input_tensor = Tensor(input_ids)\n",
-    "    \n",
-    "    logits = transformer.forward(input_tensor)\n",
-    "    expected_shape = (batch_size, seq_len, vocab_size)\n",
-    "    assert logits.shape == expected_shape, f\"Expected shape {expected_shape}, got {logits.shape}\"\n",
-    "    \n",
-    "    # Test with attention weights return\n",
-    "    logits_with_attn, all_attention_weights = transformer.forward(input_tensor, return_attention_weights=True)\n",
-    "    \n",
-    "    assert logits_with_attn.shape == expected_shape, \"Logits with attention should have correct shape\"\n",
-    "    assert len(all_attention_weights) == num_layers, f\"Should return attention weights from {num_layers} layers\"\n",
-    "    \n",
-    "    for i, attn_weights in enumerate(all_attention_weights):\n",
-    "        expected_attn_shape = (batch_size, num_heads, seq_len, seq_len)\n",
-    "        assert attn_weights.shape == expected_attn_shape, \\\n",
-    "            f\"Layer {i} attention should have shape {expected_attn_shape}, got {attn_weights.shape}\"\n",
-    "    \n",
-    "    # Test with causal mask\n",
-    "    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)\n",
-    "    causal_mask = 1 - causal_mask  # Convert to attention mask\n",
-    "    \n",
-    "    masked_logits, masked_attention = transformer.forward(\n",
-    "        input_tensor, mask=Tensor(causal_mask), return_attention_weights=True\n",
-    "    )\n",
-    "    \n",
-    "    assert masked_logits.shape == expected_shape, \"Masked logits should have correct shape\"\n",
-    "    \n",
-    "    # Verify causal masking propagates through all layers\n",
-    "    for layer_idx, attn_weights in enumerate(masked_attention):\n",
-    "        for head in range(num_heads):\n",
-    "            for i in range(seq_len):\n",
-    "                for j in range(i+1, seq_len):\n",
-    "                    assert np.all(attn_weights.data[:, head, i, j] < 1e-5), \\\n",
-    "                        f\"Layer {layer_idx}, head {head}: position ({i},{j}) should be masked\"\n",
-    "    \n",
-    "    # Test callable interface\n",
-    "    logits_callable = transformer(input_tensor)\n",
-    "    assert np.allclose(logits_callable.data, logits.data), \"Callable interface should work\"\n",
-    "    \n",
-    "    # Test text generation capability\n",
-    "    print(\"  Testing text generation...\")\n",
-    "    start_tokens = Tensor(np.random.randint(0, vocab_size, (2, 8)))  # 2 sequences, 8 tokens each\n",
-    "    generated = transformer.generate(start_tokens, max_new_tokens=10, temperature=1.0)\n",
-    "    \n",
-    "    expected_gen_shape = (2, 18)  # 8 original + 10 new tokens\n",
-    "    assert generated.shape == expected_gen_shape, f\"Generated shape should be {expected_gen_shape}, got {generated.shape}\"\n",
-    "    \n",
-    "    # Verify original tokens are preserved\n",
-    "    assert np.array_equal(generated.data[:, :8], start_tokens.data), \"Original tokens should be preserved\"\n",
-    "    \n",
-    "    # Test different model configurations\n",
-    "    small_transformer = Transformer(\n",
-    "        vocab_size=500, embed_dim=128, num_heads=4, num_layers=2, hidden_dim=256\n",
-    "    )\n",
-    "    \n",
-    "    small_input = Tensor(np.random.randint(0, 500, (2, 16)))\n",
-    "    small_logits = small_transformer.forward(small_input)\n",
-    "    expected_small_shape = (2, 16, 500)\n",
-    "    assert small_logits.shape == expected_small_shape, \"Small transformer should work\"\n",
-    "    \n",
-    "    # Test pre-norm vs post-norm\n",
-    "    post_norm_transformer = Transformer(\n",
-    "        vocab_size=vocab_size, embed_dim=embed_dim, num_heads=num_heads,\n",
-    "        num_layers=2, hidden_dim=hidden_dim, pre_norm=False\n",
-    "    )\n",
-    "    \n",
-    "    post_norm_logits = post_norm_transformer.forward(input_tensor)\n",
-    "    pre_norm_logits = Transformer(\n",
-    "        vocab_size=vocab_size, embed_dim=embed_dim, num_heads=num_heads,\n",
-    "        num_layers=2, hidden_dim=hidden_dim, pre_norm=True\n",
-    "    ).forward(input_tensor)\n",
-    "    \n",
-    "    assert not np.allclose(post_norm_logits.data, pre_norm_logits.data), \\\n",
-    "        \"Pre-norm and post-norm should produce different outputs\"\n",
-    "    \n",
-    "    # Test memory usage calculation\n",
-    "    memory_stats = transformer.get_memory_usage()\n",
-    "    assert 'total_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
-    "    assert memory_stats['total_memory_mb'] > 0, \"Should have positive memory usage\"\n",
-    "    assert memory_stats['total_parameters'] > 0, \"Should count parameters\"\n",
-    "    \n",
-    "    # Verify memory breakdown\n",
-    "    assert memory_stats['embedding_memory_mb'] > 0, \"Should have embedding memory\"\n",
-    "    assert memory_stats['transformer_blocks_memory_mb'] > 0, \"Should have transformer block memory\"\n",
-    "    assert memory_stats['lm_head_memory_mb'] > 0, \"Should have language modeling head memory\"\n",
-    "    \n",
-    "    print(\"✅ Complete transformer model tests passed!\")\n",
-    "    print(f\"✅ Forward pass produces correct logit shapes\")\n",
-    "    print(f\"✅ Causal masking works across all {num_layers} layers\")\n",
-    "    print(f\"✅ Text generation capability verified\")\n",
-    "    print(f\"✅ Total parameters: {memory_stats['total_parameters']:,}\")\n",
-    "    print(f\"✅ Total memory: {memory_stats['total_memory_mb']:.2f}MB\")\n",
-    "    print(f\"✅ Pre-norm and post-norm architectures work correctly\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fda9a7bd",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 ML Systems: Performance Analysis & Transformer Scaling\n",
-    "\n",
-    "Now let's develop systems engineering skills by analyzing transformer performance and understanding how model depth and width affect memory usage and computational requirements.\n",
-    "\n",
-    "### **Learning Outcome**: *\"I understand how transformer architecture choices affect scalability, memory usage, and production deployment constraints\"*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ff32bb95",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "transformer-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "import time\n",
-    "\n",
-    "class TransformerProfiler:\n",
-    "    \"\"\"\n",
-    "    Performance profiling toolkit for transformer architectures.\n",
-    "    \n",
-    "    Helps ML engineers understand computational costs, memory scaling,\n",
-    "    and architectural trade-offs in transformer-based models.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        self.results = {}\n",
-    "    \n",
-    "    def measure_scaling_with_depth(self, base_config: Dict, layer_counts: List[int]) -> Dict:\n",
-    "        \"\"\"\n",
-    "        Measure how transformer performance scales with number of layers.\n",
-    "        \n",
-    "        TODO: Implement transformer depth scaling measurement.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Create transformers with different layer counts\n",
-    "        2. Measure memory usage and computation time for each\n",
-    "        3. Calculate scaling patterns (should be linear with depth)\n",
-    "        4. Analyze parameter growth and memory requirements\n",
-    "        5. Return comprehensive scaling analysis\n",
-    "        \n",
-    "        EXPECTED SCALING:\n",
-    "        - Parameters: Linear with depth\n",
-    "        - Memory: Linear with depth  \n",
-    "        - Computation: Linear with depth\n",
-    "        - Quality: Generally improves with depth (to a point)\n",
-    "        \n",
-    "        Args:\n",
-    "            base_config: Base transformer configuration\n",
-    "            layer_counts: List of layer counts to test\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dictionary with scaling analysis results\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        scaling_results = {}\n",
-    "        \n",
-    "        # Test input\n",
-    "        batch_size = 4\n",
-    "        seq_len = 32\n",
-    "        vocab_size = base_config['vocab_size']\n",
-    "        test_input = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))\n",
-    "        \n",
-    "        for num_layers in layer_counts:\n",
-    "            # Create transformer with this depth\n",
-    "            transformer = Transformer(\n",
-    "                vocab_size=base_config['vocab_size'],\n",
-    "                embed_dim=base_config['embed_dim'],\n",
-    "                num_heads=base_config['num_heads'],\n",
-    "                num_layers=num_layers,\n",
-    "                hidden_dim=base_config['hidden_dim'],\n",
-    "                max_seq_length=base_config.get('max_seq_length', 128)\n",
-    "            )\n",
-    "            \n",
-    "            # Measure memory usage\n",
-    "            memory_stats = transformer.get_memory_usage()\n",
-    "            \n",
-    "            # Measure computation time\n",
-    "            start_time = time.time()\n",
-    "            logits = transformer.forward(test_input)\n",
-    "            end_time = time.time()\n",
-    "            \n",
-    "            computation_time_ms = (end_time - start_time) * 1000\n",
-    "            \n",
-    "            # Calculate throughput\n",
-    "            total_tokens = batch_size * seq_len\n",
-    "            tokens_per_second = total_tokens / (end_time - start_time) if end_time > start_time else 0\n",
-    "            \n",
-    "            scaling_results[num_layers] = {\n",
-    "                'num_layers': num_layers,\n",
-    "                'total_parameters': memory_stats['total_parameters'],\n",
-    "                'total_memory_mb': memory_stats['total_memory_mb'],\n",
-    "                'computation_time_ms': computation_time_ms,\n",
-    "                'tokens_per_second': tokens_per_second,\n",
-    "                'memory_per_layer_mb': memory_stats['transformer_blocks_memory_mb'] / num_layers if num_layers > 0 else 0,\n",
-    "                'parameters_per_layer': (memory_stats['total_parameters'] - \n",
-    "                                       base_config['vocab_size'] * base_config['embed_dim'] * 2) // num_layers if num_layers > 0 else 0\n",
-    "            }\n",
-    "        \n",
-    "        return scaling_results\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def analyze_width_vs_depth_tradeoffs(self, base_params: int, configurations: List[Dict]) -> Dict:\n",
-    "        \"\"\"\n",
-    "        Compare different ways to allocate a fixed parameter budget.\n",
-    "        \n",
-    "        This function is PROVIDED to show parameter allocation analysis.\n",
-    "        \"\"\"\n",
-    "        print(f\"📊 WIDTH vs DEPTH TRADE-OFF ANALYSIS\")\n",
-    "        print(f\"Target parameter budget: ~{base_params:,} parameters\")\n",
-    "        print(\"=\" * 70)\n",
-    "        \n",
-    "        results = {}\n",
-    "        \n",
-    "        # Test input\n",
-    "        batch_size = 4\n",
-    "        seq_len = 32\n",
-    "        test_input = Tensor(np.random.randint(0, 1000, (batch_size, seq_len)))\n",
-    "        \n",
-    "        print(f\"{'Config':<15} {'Layers':<7} {'Embed':<6} {'Heads':<6} {'Hidden':<7} {'Params':<12} {'Time (ms)':<10} {'Memory'}\")\n",
-    "        print(\"-\" * 80)\n",
-    "        \n",
-    "        for i, config in enumerate(configurations):\n",
-    "            try:\n",
-    "                # Create transformer\n",
-    "                transformer = Transformer(\n",
-    "                    vocab_size=1000,  # Fixed vocab size\n",
-    "                    embed_dim=config['embed_dim'],\n",
-    "                    num_heads=config['num_heads'],\n",
-    "                    num_layers=config['num_layers'],\n",
-    "                    hidden_dim=config['hidden_dim'],\n",
-    "                    max_seq_length=128\n",
-    "                )\n",
-    "                \n",
-    "                # Get actual parameter count\n",
-    "                memory_stats = transformer.get_memory_usage()\n",
-    "                actual_params = memory_stats['total_parameters']\n",
-    "                \n",
-    "                # Measure performance\n",
-    "                start_time = time.time()\n",
-    "                logits = transformer.forward(test_input)\n",
-    "                computation_time = (time.time() - start_time) * 1000\n",
-    "                \n",
-    "                config_name = f\"Config_{i+1}\"\n",
-    "                results[config_name] = {\n",
-    "                    'config': config,\n",
-    "                    'actual_parameters': actual_params,\n",
-    "                    'computation_time_ms': computation_time,\n",
-    "                    'memory_mb': memory_stats['total_memory_mb'],\n",
-    "                    'parameter_efficiency': abs(actual_params - base_params) / base_params\n",
-    "                }\n",
-    "                \n",
-    "                print(f\"{config_name:<15} {config['num_layers']:<7} {config['embed_dim']:<6} \"\n",
-    "                      f\"{config['num_heads']:<6} {config['hidden_dim']:<7} {actual_params:<12,} \"\n",
-    "                      f\"{computation_time:<10.2f} {memory_stats['total_memory_mb']:.1f}MB\")\n",
-    "                \n",
-    "            except Exception as e:\n",
-    "                print(f\"{config_name:<15} ERROR: {str(e)[:50]}\")\n",
-    "        \n",
-    "        # Analysis\n",
-    "        print(f\"\\n💡 TRADE-OFF INSIGHTS:\")\n",
-    "        print(f\"   - Deeper models: Better at learning complex patterns, more sequential\")\n",
-    "        print(f\"   - Wider models: More parallelizable, can capture diverse features\")\n",
-    "        print(f\"   - More heads: Richer attention patterns, more computation\")\n",
-    "        print(f\"   - Hidden dimension: Affects FFN capacity, major parameter contributor\")\n",
-    "        \n",
-    "        return results\n",
-    "    \n",
-    "    def simulate_production_scaling(self, model_sizes: List[str]) -> Dict:\n",
-    "        \"\"\"\n",
-    "        Simulate memory and computation requirements for production model sizes.\n",
-    "        \n",
-    "        This function is PROVIDED to show production scaling analysis.\n",
-    "        \"\"\"\n",
-    "        print(f\"\\n🏭 PRODUCTION MODEL SCALING SIMULATION\")\n",
-    "        print(\"=\" * 60)\n",
-    "        \n",
-    "        # Production model configurations (simplified)\n",
-    "        size_configs = {\n",
-    "            'Small': {'vocab_size': 50000, 'embed_dim': 512, 'num_heads': 8, 'num_layers': 6, 'hidden_dim': 2048},\n",
-    "            'Medium': {'vocab_size': 50000, 'embed_dim': 768, 'num_heads': 12, 'num_layers': 12, 'hidden_dim': 3072},\n",
-    "            'Large': {'vocab_size': 50000, 'embed_dim': 1024, 'num_heads': 16, 'num_layers': 24, 'hidden_dim': 4096},\n",
-    "            'XL': {'vocab_size': 50000, 'embed_dim': 1280, 'num_heads': 20, 'num_layers': 36, 'hidden_dim': 5120}\n",
-    "        }\n",
-    "        \n",
-    "        results = {}\n",
-    "        \n",
-    "        print(f\"{'Model Size':<12} {'Parameters':<12} {'Memory (GB)':<12} {'Training GPU':<12} {'Inference'}\")\n",
-    "        print(\"-\" * 70)\n",
-    "        \n",
-    "        for size in model_sizes:\n",
-    "            if size not in size_configs:\n",
-    "                continue\n",
-    "                \n",
-    "            config = size_configs[size]\n",
-    "            \n",
-    "            # Estimate parameters\n",
-    "            # Embedding: vocab_size * embed_dim * 2 (input + output)\n",
-    "            embedding_params = config['vocab_size'] * config['embed_dim'] * 2\n",
-    "            \n",
-    "            # Per layer: \n",
-    "            # - Attention: 4 * embed_dim^2 (Q, K, V, O projections)\n",
-    "            # - FFN: 2 * embed_dim * hidden_dim + embed_dim + hidden_dim (weights + biases)\n",
-    "            # - LayerNorm: 2 * embed_dim * 2 (two norms per layer)\n",
-    "            attention_params_per_layer = 4 * config['embed_dim'] ** 2\n",
-    "            ffn_params_per_layer = 2 * config['embed_dim'] * config['hidden_dim'] + config['embed_dim'] + config['hidden_dim']\n",
-    "            norm_params_per_layer = 4 * config['embed_dim']\n",
-    "            \n",
-    "            layer_params = attention_params_per_layer + ffn_params_per_layer + norm_params_per_layer\n",
-    "            total_params = embedding_params + layer_params * config['num_layers']\n",
-    "            \n",
-    "            # Estimate memory (parameters + activations + gradients for training)\n",
-    "            param_memory_gb = total_params * 4 / (1024**3)  # 4 bytes per float32\n",
-    "            \n",
-    "            # Training memory: parameters + gradients + optimizer states + activations\n",
-    "            training_memory_gb = param_memory_gb * 4  # Rough estimate (param + grad + 2x optimizer states)\n",
-    "            \n",
-    "            # Inference memory: just parameters + activations\n",
-    "            inference_memory_gb = param_memory_gb * 1.5  # Parameters + activation memory\n",
-    "            \n",
-    "            # GPU requirements (very rough estimates)\n",
-    "            if training_memory_gb < 24:\n",
-    "                training_gpu = \"Single RTX 4090\"\n",
-    "            elif training_memory_gb < 80:\n",
-    "                training_gpu = \"Single A100\"\n",
-    "            else:\n",
-    "                training_gpu = \"Multi-GPU\"\n",
-    "            \n",
-    "            if inference_memory_gb < 12:\n",
-    "                inference_req = \"RTX 4060 Ti\"\n",
-    "            elif inference_memory_gb < 24:\n",
-    "                inference_req = \"RTX 4090\"\n",
-    "            else:\n",
-    "                inference_req = \"A100+\"\n",
-    "            \n",
-    "            results[size] = {\n",
-    "                'config': config,\n",
-    "                'total_parameters': total_params,\n",
-    "                'training_memory_gb': training_memory_gb,\n",
-    "                'inference_memory_gb': inference_memory_gb,\n",
-    "                'training_gpu_req': training_gpu,\n",
-    "                'inference_gpu_req': inference_req\n",
-    "            }\n",
-    "            \n",
-    "            print(f\"{size:<12} {total_params/1e6:.1f}M {training_memory_gb:.1f} {training_gpu:<12} {inference_req}\")\n",
-    "        \n",
-    "        print(f\"\\n📈 SCALING OBSERVATIONS:\")\n",
-    "        print(f\"   - Model size grows super-linearly with dimension increases\")\n",
-    "        print(f\"   - Memory requirements dominate deployment decisions\")\n",
-    "        print(f\"   - Training requires 3-4x more memory than inference\")\n",
-    "        print(f\"   - Multi-GPU becomes necessary for large models\")\n",
-    "        \n",
-    "        return results\n",
-    "\n",
-    "def analyze_transformer_system_design():\n",
-    "    \"\"\"\n",
-    "    Comprehensive analysis of transformer system design choices and trade-offs.\n",
-    "    \n",
-    "    This function is PROVIDED to show systems-level design thinking.\n",
-    "    \"\"\"\n",
-    "    print(\"🏗️ TRANSFORMER SYSTEM DESIGN ANALYSIS\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Architecture decision analysis\n",
-    "    design_choices = {\n",
-    "        'Layer Normalization': {\n",
-    "            'Pre-norm': {'stability': 'High', 'training': 'Easier', 'performance': 'Good'},\n",
-    "            'Post-norm': {'stability': 'Lower', 'training': 'Harder', 'performance': 'Potentially better'}\n",
-    "        },\n",
-    "        'Attention Patterns': {\n",
-    "            'Full attention': {'complexity': 'O(N²)', 'quality': 'Best', 'scalability': 'Limited'},\n",
-    "            'Sparse attention': {'complexity': 'O(N√N)', 'quality': 'Good', 'scalability': 'Better'},\n",
-    "            'Linear attention': {'complexity': 'O(N)', 'quality': 'Reduced', 'scalability': 'Excellent'}\n",
-    "        },\n",
-    "        'Feed-Forward Size': {\n",
-    "            '2x embed_dim': {'parameters': 'Low', 'capacity': 'Limited', 'speed': 'Fast'},\n",
-    "            '4x embed_dim': {'parameters': 'Standard', 'capacity': 'Good', 'speed': 'Medium'},\n",
-    "            '8x embed_dim': {'parameters': 'High', 'capacity': 'High', 'speed': 'Slow'}\n",
-    "        }\n",
-    "    }\n",
-    "    \n",
-    "    print(\"🎯 ARCHITECTURAL DESIGN CHOICES:\")\n",
-    "    for category, choices in design_choices.items():\n",
-    "        print(f\"\\n{category}:\")\n",
-    "        for choice, properties in choices.items():\n",
-    "            prop_str = \", \".join([f\"{k}: {v}\" for k, v in properties.items()])\n",
-    "            print(f\"   - {choice}: {prop_str}\")\n",
-    "    \n",
-    "    # Memory scaling analysis\n",
-    "    print(f\"\\n📊 MEMORY SCALING PATTERNS:\")\n",
-    "    print(f\"Component breakdown for typical transformer:\")\n",
-    "    print(f\"   - Token embeddings: vocab_size × embed_dim parameters\")\n",
-    "    print(f\"   - Position encodings: 0 parameters (sinusoidal) or seq_len × embed_dim (learned)\")\n",
-    "    print(f\"   - Attention layers: 4 × embed_dim² parameters per layer\")\n",
-    "    print(f\"   - Feed-forward: 2 × embed_dim × hidden_dim parameters per layer\")\n",
-    "    print(f\"   - Layer normalization: 2 × embed_dim parameters per layer\")\n",
-    "    print(f\"   - Output projection: embed_dim × vocab_size parameters\")\n",
-    "    \n",
-    "    print(f\"\\n🔧 OPTIMIZATION STRATEGIES:\")\n",
-    "    optimization_techniques = [\n",
-    "        \"Gradient checkpointing: Trade computation for memory\",\n",
-    "        \"Mixed precision training: Use FP16 for 2x memory reduction\",\n",
-    "        \"Parameter sharing: Share weights across layers\",\n",
-    "        \"Sparse attention: Reduce quadratic scaling\",\n",
-    "        \"Model parallelism: Distribute layers across GPUs\",\n",
-    "        \"Pipeline parallelism: Process different batch elements on different GPUs\",\n",
-    "        \"Activation checkpointing: Recompute activations instead of storing\"\n",
-    "    ]\n",
-    "    \n",
-    "    for technique in optimization_techniques:\n",
-    "        print(f\"   - {technique}\")\n",
-    "    \n",
-    "    print(f\"\\n🎯 PRODUCTION DEPLOYMENT CONSIDERATIONS:\")\n",
-    "    deployment_factors = [\n",
-    "        \"Batch size: Larger batches improve GPU utilization but increase memory\",\n",
-    "        \"Sequence length: Quadratic impact on attention memory\",\n",
-    "        \"Model depth: Linear impact on memory and computation\",\n",
-    "        \"Model width: Quadratic impact on attention parameters\",\n",
-    "        \"Precision: FP32 vs FP16 vs INT8 trade-offs\",\n",
-    "        \"Hardware: GPU memory and compute capabilities\",\n",
-    "        \"Latency requirements: Real-time vs batch processing\",\n",
-    "        \"Throughput requirements: Tokens per second targets\"\n",
-    "    ]\n",
-    "    \n",
-    "    for factor in deployment_factors:\n",
-    "        print(f\"   - {factor}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0050718c",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test: Transformer Performance Analysis\n",
-    "\n",
-    "Let's test our transformer profiler with realistic scaling scenarios."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "45818c11",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-transformer-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_transformer_profiler():\n",
-    "    \"\"\"Test transformer profiler with various scenarios.\"\"\"\n",
-    "    print(\"🔬 Unit Test: Transformer Performance Profiler...\")\n",
-    "    \n",
-    "    profiler = TransformerProfiler()\n",
-    "    \n",
-    "    # Test depth scaling measurement\n",
-    "    base_config = {\n",
-    "        'vocab_size': 500,\n",
-    "        'embed_dim': 128,\n",
-    "        'num_heads': 4,\n",
-    "        'hidden_dim': 256\n",
-    "    }\n",
-    "    \n",
-    "    layer_counts = [1, 2, 4]\n",
-    "    depth_results = profiler.measure_scaling_with_depth(base_config, layer_counts)\n",
-    "    \n",
-    "    # Verify depth scaling results\n",
-    "    assert len(depth_results) == len(layer_counts), f\"Should test {len(layer_counts)} layer counts\"\n",
-    "    \n",
-    "    for num_layers in layer_counts:\n",
-    "        assert num_layers in depth_results, f\"Should include results for {num_layers} layers\"\n",
-    "        result = depth_results[num_layers]\n",
-    "        \n",
-    "        # Verify required metrics\n",
-    "        required_keys = ['num_layers', 'total_parameters', 'total_memory_mb', \n",
-    "                        'computation_time_ms', 'tokens_per_second']\n",
-    "        for key in required_keys:\n",
-    "            assert key in result, f\"Missing metric: {key} for {num_layers} layers\"\n",
-    "            assert isinstance(result[key], (int, float)), f\"Invalid type for {key}\"\n",
-    "        \n",
-    "        # Verify reasonable values\n",
-    "        assert result['num_layers'] == num_layers, \"Should store correct layer count\"\n",
-    "        assert result['total_parameters'] > 0, \"Should have positive parameter count\"\n",
-    "        assert result['total_memory_mb'] > 0, \"Should have positive memory usage\"\n",
-    "    \n",
-    "    # Test that parameters and memory scale roughly linearly with depth\n",
-    "    if len(layer_counts) >= 2:\n",
-    "        shallow = depth_results[layer_counts[0]]\n",
-    "        deep = depth_results[layer_counts[-1]]\n",
-    "        \n",
-    "        layer_ratio = deep['num_layers'] / shallow['num_layers']\n",
-    "        param_ratio = deep['total_parameters'] / shallow['total_parameters']\n",
-    "        memory_ratio = deep['total_memory_mb'] / shallow['total_memory_mb']\n",
-    "        \n",
-    "        # Allow some deviation due to fixed costs (embeddings, etc.)\n",
-    "        assert 1.0 < param_ratio < layer_ratio * 2, f\"Parameters should scale sub-linearly, got {param_ratio:.2f}\"\n",
-    "        assert 1.0 < memory_ratio < layer_ratio * 2, f\"Memory should scale sub-linearly, got {memory_ratio:.2f}\"\n",
-    "    \n",
-    "    print(\"✅ Depth scaling measurement test passed\")\n",
-    "    \n",
-    "    # Test width vs depth analysis\n",
-    "    configurations = [\n",
-    "        {'embed_dim': 128, 'num_heads': 4, 'num_layers': 4, 'hidden_dim': 256},\n",
-    "        {'embed_dim': 256, 'num_heads': 8, 'num_layers': 2, 'hidden_dim': 512},\n",
-    "    ]\n",
-    "    \n",
-    "    width_depth_results = profiler.analyze_width_vs_depth_tradeoffs(100000, configurations)\n",
-    "    \n",
-    "    # Verify width vs depth results\n",
-    "    assert len(width_depth_results) > 0, \"Should analyze at least one configuration\"\n",
-    "    \n",
-    "    for config_name, result in width_depth_results.items():\n",
-    "        assert 'config' in result, \"Should include configuration\"\n",
-    "        assert 'actual_parameters' in result, \"Should count actual parameters\"\n",
-    "        assert 'computation_time_ms' in result, \"Should measure computation time\"\n",
-    "        assert result['actual_parameters'] > 0, \"Should have positive parameter count\"\n",
-    "    \n",
-    "    print(\"✅ Width vs depth analysis test passed\")\n",
-    "    \n",
-    "    # Test production scaling simulation\n",
-    "    production_results = profiler.simulate_production_scaling(['Small', 'Medium'])\n",
-    "    \n",
-    "    # Verify production scaling results\n",
-    "    for size, result in production_results.items():\n",
-    "        assert 'config' in result, \"Should include model configuration\"\n",
-    "        assert 'total_parameters' in result, \"Should estimate total parameters\"\n",
-    "        assert 'training_memory_gb' in result, \"Should estimate training memory\"\n",
-    "        assert 'inference_memory_gb' in result, \"Should estimate inference memory\"\n",
-    "        \n",
-    "        # Verify reasonable scaling\n",
-    "        assert result['total_parameters'] > 1e6, \"Should have millions of parameters\"\n",
-    "        assert result['training_memory_gb'] > result['inference_memory_gb'], \"Training should require more memory\"\n",
-    "    \n",
-    "    print(\"✅ Production scaling simulation test passed\")\n",
-    "    print(\"🎯 Transformer Profiler: All tests passed!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6abd8ab2",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Integration Testing: Complete Language Model Pipeline\n",
-    "\n",
-    "Let's test the complete pipeline from tokenization through transformer processing:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dbf45be4",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "test-transformer-integration",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_complete_language_model_pipeline():\n",
-    "    \"\"\"Test complete language model pipeline integration.\"\"\"\n",
-    "    print(\"🧪 Integration Test: Complete Language Model Pipeline...\")\n",
-    "    \n",
-    "    # Create a small but complete language model\n",
-    "    vocab_size = 1000\n",
-    "    embed_dim = 256\n",
-    "    num_heads = 8\n",
-    "    num_layers = 4\n",
-    "    hidden_dim = 512\n",
-    "    max_seq_length = 64\n",
-    "    \n",
-    "    print(f\"  Creating transformer with {num_layers} layers, {embed_dim} dimensions...\")\n",
-    "    transformer = Transformer(\n",
-    "        vocab_size=vocab_size,\n",
-    "        embed_dim=embed_dim,\n",
-    "        num_heads=num_heads,\n",
-    "        num_layers=num_layers,\n",
-    "        hidden_dim=hidden_dim,\n",
-    "        max_seq_length=max_seq_length\n",
-    "    )\n",
-    "    \n",
-    "    # Test 1: Basic text processing pipeline\n",
-    "    print(\"  Testing basic text processing pipeline...\")\n",
-    "    batch_size = 4\n",
-    "    seq_len = 32\n",
-    "    \n",
-    "    # Simulate tokenized input\n",
-    "    input_ids = np.random.randint(0, vocab_size, (batch_size, seq_len))\n",
-    "    input_tensor = Tensor(input_ids)\n",
-    "    \n",
-    "    # Forward pass\n",
-    "    logits = transformer.forward(input_tensor)\n",
-    "    expected_shape = (batch_size, seq_len, vocab_size)\n",
-    "    assert logits.shape == expected_shape, f\"Expected {expected_shape}, got {logits.shape}\"\n",
-    "    \n",
-    "    # Test that logits are reasonable (not all zeros/inf/nan)\n",
-    "    assert not np.all(logits.data == 0), \"Logits should not all be zero\"\n",
-    "    assert not np.any(np.isinf(logits.data)), \"Logits should not contain inf\"\n",
-    "    assert not np.any(np.isnan(logits.data)), \"Logits should not contain nan\"\n",
-    "    \n",
-    "    print(f\"    Forward pass successful: {logits.shape}\")\n",
-    "    \n",
-    "    # Test 2: Language modeling with causal mask\n",
-    "    print(\"  Testing language modeling with causal attention...\")\n",
-    "    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)\n",
-    "    causal_mask = 1 - causal_mask  # Convert to attention mask\n",
-    "    \n",
-    "    masked_logits, all_attention = transformer.forward(\n",
-    "        input_tensor, mask=Tensor(causal_mask), return_attention_weights=True\n",
-    "    )\n",
-    "    \n",
-    "    assert len(all_attention) == num_layers, f\"Should return attention from {num_layers} layers\"\n",
-    "    \n",
-    "    # Verify causal masking works across all layers\n",
-    "    for layer_idx, attn_weights in enumerate(all_attention):\n",
-    "        # Check a few positions to ensure masking works\n",
-    "        for i in range(min(5, seq_len)):\n",
-    "            for j in range(i+1, min(i+5, seq_len)):\n",
-    "                future_attention = attn_weights.data[:, :, i, j]  # All heads, all batches\n",
-    "                assert np.all(future_attention < 1e-5), \\\n",
-    "                    f\"Layer {layer_idx}: future attention at ({i},{j}) should be ~0\"\n",
-    "    \n",
-    "    print(f\"    Causal masking verified across all layers\")\n",
-    "    \n",
-    "    # Test 3: Text generation\n",
-    "    print(\"  Testing autoregressive text generation...\")\n",
-    "    # Start with a shorter sequence for generation\n",
-    "    gen_start = Tensor(np.random.randint(0, vocab_size, (2, 8)))\n",
-    "    generated = transformer.generate(gen_start, max_new_tokens=8, temperature=1.0)\n",
-    "    \n",
-    "    expected_gen_shape = (2, 16)  # 8 start + 8 generated\n",
-    "    assert generated.shape == expected_gen_shape, f\"Expected {expected_gen_shape}, got {generated.shape}\"\n",
-    "    \n",
-    "    # Verify original tokens preserved\n",
-    "    assert np.array_equal(generated.data[:, :8], gen_start.data), \"Should preserve original tokens\"\n",
-    "    \n",
-    "    # Verify new tokens are valid\n",
-    "    new_tokens = generated.data[:, 8:]\n",
-    "    assert np.all(new_tokens >= 0), \"Generated tokens should be >= 0\"\n",
-    "    assert np.all(new_tokens < vocab_size), f\"Generated tokens should be < {vocab_size}\"\n",
-    "    \n",
-    "    print(f\"    Generated {new_tokens.shape[1]} new tokens successfully\")\n",
-    "    \n",
-    "    # Test 4: Different sequence lengths\n",
-    "    print(\"  Testing variable sequence lengths...\")\n",
-    "    for test_seq_len in [16, 32, 48]:\n",
-    "        if test_seq_len > max_seq_length:\n",
-    "            continue\n",
-    "            \n",
-    "        test_input = Tensor(np.random.randint(0, vocab_size, (2, test_seq_len)))\n",
-    "        test_logits = transformer.forward(test_input)\n",
-    "        \n",
-    "        expected_test_shape = (2, test_seq_len, vocab_size)\n",
-    "        assert test_logits.shape == expected_test_shape, f\"Failed for seq_len {test_seq_len}\"\n",
-    "    \n",
-    "    print(f\"    Variable sequence lengths work correctly\")\n",
-    "    \n",
-    "    # Test 5: Memory usage analysis\n",
-    "    print(\"  Analyzing memory usage...\")\n",
-    "    memory_stats = transformer.get_memory_usage()\n",
-    "    \n",
-    "    print(f\"    Model parameters: {memory_stats['total_parameters']:,}\")\n",
-    "    print(f\"    Model memory: {memory_stats['total_memory_mb']:.1f}MB\")\n",
-    "    print(f\"    Embedding memory: {memory_stats['embedding_memory_mb']:.1f}MB\")\n",
-    "    print(f\"    Transformer blocks: {memory_stats['transformer_blocks_memory_mb']:.1f}MB\")\n",
-    "    print(f\"    LM head: {memory_stats['lm_head_memory_mb']:.1f}MB\")\n",
-    "    \n",
-    "    # Verify memory breakdown makes sense\n",
-    "    component_memory = (memory_stats['embedding_memory_mb'] + \n",
-    "                       memory_stats['transformer_blocks_memory_mb'] + \n",
-    "                       memory_stats['lm_head_memory_mb'])\n",
-    "    \n",
-    "    # Allow small difference due to final norm layer\n",
-    "    memory_diff = abs(memory_stats['total_memory_mb'] - component_memory)\n",
-    "    assert memory_diff < 1.0, f\"Memory breakdown doesn't add up: {memory_diff:.2f}MB difference\"\n",
-    "    \n",
-    "    # Test 6: Performance characteristics\n",
-    "    print(\"  Testing performance characteristics...\")\n",
-    "    \n",
-    "    # Time multiple forward passes\n",
-    "    num_iterations = 5\n",
-    "    start_time = time.time()\n",
-    "    \n",
-    "    for _ in range(num_iterations):\n",
-    "        _ = transformer.forward(input_tensor)\n",
-    "    \n",
-    "    total_time = time.time() - start_time\n",
-    "    avg_time_per_forward = total_time / num_iterations\n",
-    "    tokens_per_second = (batch_size * seq_len) / avg_time_per_forward\n",
-    "    \n",
-    "    print(f\"    Average forward pass: {avg_time_per_forward*1000:.2f}ms\")\n",
-    "    print(f\"    Processing speed: {tokens_per_second:.0f} tokens/second\")\n",
-    "    \n",
-    "    # Verify reasonable performance\n",
-    "    assert avg_time_per_forward < 1.0, \"Forward pass should be < 1 second\"\n",
-    "    assert tokens_per_second > 50, \"Should process > 50 tokens/second\"\n",
-    "    \n",
-    "    # Test 7: Gradient flow (simulated)\n",
-    "    print(\"  Testing gradient flow through layers...\")\n",
-    "    \n",
-    "    # Create slightly different inputs to test sensitivity\n",
-    "    input_1 = Tensor(input_ids.copy())\n",
-    "    input_2 = Tensor(input_ids.copy())\n",
-    "    input_2.data[0, 0] = (input_2.data[0, 0] + 1) % vocab_size  # Change one token\n",
-    "    \n",
-    "    logits_1 = transformer.forward(input_1)\n",
-    "    logits_2 = transformer.forward(input_2)\n",
-    "    \n",
-    "    # Outputs should be different (model is sensitive to input changes)\n",
-    "    output_diff = np.mean(np.abs(logits_1.data - logits_2.data))\n",
-    "    assert output_diff > 1e-6, f\"Model should be sensitive to input changes, diff: {output_diff}\"\n",
-    "    \n",
-    "    # But not too different (model should be stable)\n",
-    "    assert output_diff < 100, f\"Model should be stable, large diff: {output_diff}\"\n",
-    "    \n",
-    "    print(f\"    Model shows appropriate sensitivity to input changes\")\n",
-    "    \n",
-    "    print(\"✅ Complete language model pipeline integration test passed!\")\n",
-    "    print(f\"✅ Forward pass, masking, generation, and performance verified\")\n",
-    "    print(f\"✅ Model processes {tokens_per_second:.0f} tokens/second\")\n",
-    "    print(f\"✅ Memory footprint: {memory_stats['total_memory_mb']:.1f}MB\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "bd6e7970",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Main Execution Block\n",
-    "\n",
-    "All transformer tests and demonstrations are run from here when the module is executed directly:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c6f54ff9",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "transformers-main",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "if __name__ == \"__main__\":\n",
-    "    # Run all unit tests\n",
-    "    test_unit_layer_norm()\n",
-    "    test_unit_feed_forward()\n",
-    "    test_unit_transformer_block()\n",
-    "    test_unit_transformer_model()\n",
-    "    test_transformer_profiler()\n",
-    "    test_complete_language_model_pipeline()\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"🔍 TRANSFORMER SYSTEMS ANALYSIS\")\n",
-    "    print(\"=\"*60)\n",
-    "    \n",
-    "    # Performance analysis\n",
-    "    profiler = TransformerProfiler()\n",
-    "    \n",
-    "    # Test transformer scaling with different depths\n",
-    "    print(\"📈 TRANSFORMER DEPTH SCALING ANALYSIS:\")\n",
-    "    base_config = {\n",
-    "        'vocab_size': 1000,\n",
-    "        'embed_dim': 256,\n",
-    "        'num_heads': 8,\n",
-    "        'hidden_dim': 1024\n",
-    "    }\n",
-    "    \n",
-    "    layer_counts = [2, 4, 8, 12]\n",
-    "    depth_results = profiler.measure_scaling_with_depth(base_config, layer_counts)\n",
-    "    \n",
-    "    # Analyze scaling patterns\n",
-    "    print(f\"\\n{'Layers':<7} {'Parameters':<12} {'Memory (MB)':<12} {'Time (ms)':<10} {'Tokens/sec':<10}\")\n",
-    "    print(\"-\" * 60)\n",
-    "    \n",
-    "    for num_layers in layer_counts:\n",
-    "        result = depth_results[num_layers]\n",
-    "        print(f\"{num_layers:<7} {result['total_parameters']:<12,} {result['total_memory_mb']:<12.1f} \"\n",
-    "              f\"{result['computation_time_ms']:<10.2f} {result['tokens_per_second']:<10.0f}\")\n",
-    "    \n",
-    "    # Width vs depth trade-off analysis\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    configurations = [\n",
-    "        {'embed_dim': 256, 'num_heads': 8, 'num_layers': 8, 'hidden_dim': 1024},  # Deep & narrow\n",
-    "        {'embed_dim': 512, 'num_heads': 16, 'num_layers': 4, 'hidden_dim': 2048}, # Wide & shallow\n",
-    "        {'embed_dim': 384, 'num_heads': 12, 'num_layers': 6, 'hidden_dim': 1536}, # Balanced\n",
-    "    ]\n",
-    "    \n",
-    "    width_depth_results = profiler.analyze_width_vs_depth_tradeoffs(2000000, configurations)\n",
-    "    \n",
-    "    # Production scaling simulation\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    production_results = profiler.simulate_production_scaling(['Small', 'Medium', 'Large'])\n",
-    "    \n",
-    "    # Systems design analysis\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    analyze_transformer_system_design()\n",
-    "    \n",
-    "    # Demonstrate realistic language model setup\n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"🏗️ REALISTIC LANGUAGE MODEL DEMONSTRATION\")\n",
-    "    print(\"=\"*60)\n",
-    "    \n",
-    "    # Create a realistic small language model\n",
-    "    vocab_size = 5000\n",
-    "    embed_dim = 512\n",
-    "    num_heads = 8\n",
-    "    num_layers = 6\n",
-    "    hidden_dim = 2048\n",
-    "    max_seq_length = 256\n",
-    "    \n",
-    "    print(f\"Language model configuration:\")\n",
-    "    print(f\"  Vocabulary: {vocab_size:,} tokens\")\n",
-    "    print(f\"  Embedding dimension: {embed_dim}\")\n",
-    "    print(f\"  Attention heads: {num_heads}\")\n",
-    "    print(f\"  Transformer layers: {num_layers}\")\n",
-    "    print(f\"  Feed-forward dimension: {hidden_dim}\")\n",
-    "    print(f\"  Max sequence length: {max_seq_length}\")\n",
-    "    \n",
-    "    # Create the model\n",
-    "    language_model = Transformer(\n",
-    "        vocab_size=vocab_size,\n",
-    "        embed_dim=embed_dim,\n",
-    "        num_heads=num_heads,\n",
-    "        num_layers=num_layers,\n",
-    "        hidden_dim=hidden_dim,\n",
-    "        max_seq_length=max_seq_length,\n",
-    "        pre_norm=True\n",
-    "    )\n",
-    "    \n",
-    "    # Analyze model characteristics\n",
-    "    memory_stats = language_model.get_memory_usage()\n",
-    "    \n",
-    "    print(f\"\\nModel characteristics:\")\n",
-    "    print(f\"  Total parameters: {memory_stats['total_parameters']:,}\")\n",
-    "    print(f\"  Model size: {memory_stats['total_memory_mb']:.1f}MB\")\n",
-    "    print(f\"  Embedding table: {memory_stats['embedding_memory_mb']:.1f}MB ({memory_stats['embedding_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)\")\n",
-    "    print(f\"  Transformer layers: {memory_stats['transformer_blocks_memory_mb']:.1f}MB ({memory_stats['transformer_blocks_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)\")\n",
-    "    print(f\"  Output projection: {memory_stats['lm_head_memory_mb']:.1f}MB ({memory_stats['lm_head_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)\")\n",
-    "    \n",
-    "    # Performance simulation\n",
-    "    batch_size = 8\n",
-    "    seq_len = 128\n",
-    "    test_input = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))\n",
-    "    \n",
-    "    start_time = time.time()\n",
-    "    logits = language_model.forward(test_input)\n",
-    "    forward_time = time.time() - start_time\n",
-    "    \n",
-    "    tokens_per_second = (batch_size * seq_len) / forward_time\n",
-    "    \n",
-    "    print(f\"\\nPerformance simulation:\")\n",
-    "    print(f\"  Batch size: {batch_size}, Sequence length: {seq_len}\")\n",
-    "    print(f\"  Forward pass time: {forward_time*1000:.2f}ms\")\n",
-    "    print(f\"  Throughput: {tokens_per_second:.0f} tokens/second\")\n",
-    "    print(f\"  Memory for batch: {logits.data.nbytes/(1024*1024):.1f}MB\")\n",
-    "    \n",
-    "    # Text generation example\n",
-    "    print(f\"\\nText generation example:\")\n",
-    "    start_sequence = Tensor(np.random.randint(0, vocab_size, (1, 10)))\n",
-    "    generated = language_model.generate(start_sequence, max_new_tokens=20, temperature=0.8)\n",
-    "    \n",
-    "    print(f\"  Input sequence: {start_sequence.data[0].tolist()}\")\n",
-    "    print(f\"  Generated tokens: {generated.data[0, 10:].tolist()}\")\n",
-    "    print(f\"  Generation completed successfully\")\n",
-    "    \n",
-    "    # Scaling predictions\n",
-    "    print(f\"\\nScaling analysis:\")\n",
-    "    current_params = memory_stats['total_parameters']\n",
-    "    \n",
-    "    # Estimate for different scales\n",
-    "    scaling_factors = [2, 5, 10]\n",
-    "    for factor in scaling_factors:\n",
-    "        scaled_params = current_params * factor\n",
-    "        scaled_memory_gb = memory_stats['total_memory_mb'] * factor / 1024\n",
-    "        \n",
-    "        print(f\"  {factor}x scale: {scaled_params/1e6:.0f}M params, ~{scaled_memory_gb:.1f}GB memory\")\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\"*60)\n",
-    "    print(\"🎯 TRANSFORMERS MODULE COMPLETE!\")\n",
-    "    print(\"=\"*60)\n",
-    "    print(\"All transformer tests passed!\")\n",
-    "    print(\"Complete language model architecture implemented!\")\n",
-    "    print(\"Ready for production deployment and optimization!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "390254a0",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've built complete transformer architectures, let's connect this work to broader ML systems challenges. These questions help you think critically about how transformer design choices affect production deployment and system performance.\n",
-    "\n",
-    "Take time to reflect thoughtfully on each question - your insights will help you understand how transformer architectures connect to real-world ML systems engineering."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "709877be",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Transformer Architecture Optimization and Resource Allocation\n",
-    "\n",
-    "**Context**: Your transformer implementations demonstrate how layer depth, attention heads, and hidden dimensions affect model capacity and computational requirements. Production transformer systems must optimize these architectural choices within hardware constraints while maximizing model performance for specific tasks and deployment scenarios.\n",
-    "\n",
-    "**Reflection Question**: Design a transformer architecture optimization strategy for deploying language models across diverse production scenarios: real-time chat (low latency), document processing (high throughput), and mobile inference (resource-constrained). How would you allocate a fixed parameter budget across depth, width, and attention heads to optimize for each scenario, implement architecture search strategies that consider hardware constraints, and design adaptive model scaling that adjusts to available computational resources? Consider the challenges of maintaining consistent model quality while optimizing for different performance metrics and deployment environments.\n",
-    "\n",
-    "Think about: parameter budget allocation, architecture search strategies, hardware-aware optimization, and adaptive model scaling techniques.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bf1aa9a6",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-1-architecture-optimization",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON TRANSFORMER ARCHITECTURE OPTIMIZATION:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about transformer architecture optimization for diverse deployment scenarios.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you allocate parameter budgets across depth, width, and attention heads for different scenarios?\n",
-    "- What architecture search strategies would you use to optimize within hardware constraints?\n",
-    "- How would you implement adaptive model scaling that adjusts to available resources?\n",
-    "- What approaches would you use to maintain model quality across different deployment environments?\n",
-    "- How would you balance latency, throughput, and resource constraints in architectural decisions?\n",
-    "\n",
-    "Write a strategic analysis connecting your transformer implementations to real architecture optimization challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Demonstrates understanding of transformer architecture trade-offs and optimization (3 points)\n",
-    "- Designs practical approaches to parameter allocation and architecture search (3 points)\n",
-    "- Addresses adaptive scaling and hardware-aware optimization (2 points)\n",
-    "- Shows systems thinking about production deployment optimization (2 points)\n",
-    "- Clear strategic reasoning with architecture optimization insights (bonus points for innovative approaches)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring strategic analysis of transformer architecture optimization\n",
-    "# Students should demonstrate understanding of architecture design and production deployment challenges\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "32bb5968",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: Transformer Training and Inference System Design\n",
-    "\n",
-    "**Context**: Your transformer implementation shows how layer normalization, residual connections, and feed-forward networks work together to enable training of deep models. Production transformer systems must optimize the training pipeline for efficiency while designing inference systems that handle diverse workloads with different latency and throughput requirements.\n",
-    "\n",
-    "**Reflection Question**: Architect a transformer training and inference system that efficiently trains models with billions of parameters while serving diverse inference workloads with millisecond latency requirements. How would you design distributed training strategies that handle memory constraints and communication bottlenecks, implement efficient inference serving that optimizes for both batch and real-time processing, and manage model deployment across heterogeneous hardware environments? Consider the challenges of maintaining numerical stability during distributed training while achieving consistent inference performance across different deployment targets.\n",
-    "\n",
-    "Think about: distributed training optimization, inference serving strategies, heterogeneous deployment, and training-inference consistency.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c11dcf55",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-2-training-inference-systems",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON TRANSFORMER TRAINING AND INFERENCE SYSTEM DESIGN:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about transformer training and inference system architecture.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you design distributed training for billion-parameter transformers with memory constraints?\n",
-    "- What strategies would you use for efficient inference serving with millisecond latency requirements?\n",
-    "- How would you manage model deployment across heterogeneous hardware environments?\n",
-    "- What approaches would you use to maintain numerical stability during distributed training?\n",
-    "- How would you ensure consistent inference performance across different deployment targets?\n",
-    "\n",
-    "Write a system design analysis connecting your transformer implementation to large-scale training and serving challenges.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Shows understanding of distributed training and inference serving challenges (3 points)\n",
-    "- Designs practical approaches to memory management and latency optimization (3 points)\n",
-    "- Addresses heterogeneous deployment and numerical stability considerations (2 points)\n",
-    "- Demonstrates systems thinking about training-inference system coordination (2 points)\n",
-    "- Clear system design reasoning with scalability insights (bonus points for comprehensive system architecture)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring system design for transformer training and inference\n",
-    "# Students should demonstrate knowledge of distributed systems and production deployment architecture\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3dab76f7",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 3: Transformer Optimization and Production Deployment\n",
-    "\n",
-    "**Context**: Your complete transformer model demonstrates the integration of tokenization, embeddings, attention, and feed-forward components into a unified language processing system. Production transformer deployments must optimize the entire pipeline for efficiency while maintaining model quality and enabling continuous improvement through model updates and fine-tuning.\n",
-    "\n",
-    "**Reflection Question**: Design a production transformer deployment system that optimizes the complete language processing pipeline while enabling continuous model improvement and adaptation. How would you implement end-to-end optimization that spans from tokenization through generation, design efficient model serving infrastructure that handles dynamic batching and request routing, and enable seamless model updates without service interruption? Consider the challenges of optimizing the entire pipeline holistically while maintaining modularity for individual component improvements and supporting diverse model variants and fine-tuned versions.\n",
-    "\n",
-    "Think about: end-to-end pipeline optimization, model serving infrastructure, continuous deployment strategies, and modular system design.\n",
-    "\n",
-    "*Target length: 150-300 words*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e30dbecb",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "question-3-production-deployment",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "\"\"\"\n",
-    "YOUR REFLECTION ON TRANSFORMER OPTIMIZATION AND PRODUCTION DEPLOYMENT:\n",
-    "\n",
-    "TODO: Replace this text with your thoughtful response about transformer production deployment system design.\n",
-    "\n",
-    "Consider addressing:\n",
-    "- How would you implement end-to-end optimization spanning tokenization through generation?\n",
-    "- What strategies would you use for efficient model serving with dynamic batching and request routing?\n",
-    "- How would you enable seamless model updates without service interruption?\n",
-    "- What approaches would you use to maintain pipeline modularity while optimizing holistically?\n",
-    "- How would you support diverse model variants and fine-tuned versions in production?\n",
-    "\n",
-    "Write a deployment analysis connecting your transformer implementation to complete production system optimization.\n",
-    "\n",
-    "GRADING RUBRIC (Instructor Use):\n",
-    "- Understands end-to-end optimization and production deployment challenges (3 points)\n",
-    "- Designs practical approaches to model serving and continuous deployment (3 points)\n",
-    "- Addresses modularity and system integration considerations (2 points)\n",
-    "- Shows systems thinking about holistic pipeline optimization (2 points)\n",
-    "- Clear deployment reasoning with production optimization insights (bonus points for innovative system design)\n",
-    "\"\"\"\n",
-    "\n",
-    "### BEGIN SOLUTION\n",
-    "# Student response area - instructor will replace this section during grading setup\n",
-    "# This is a manually graded question requiring understanding of production transformer deployment optimization\n",
-    "# Students should demonstrate knowledge of end-to-end system design and continuous deployment strategies\n",
-    "### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5b61d666",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Transformers\n",
-    "\n",
-    "Congratulations! You have successfully implemented complete transformer architectures that power modern language models:\n",
-    "\n",
-    "### ✅ What You Have Built\n",
-    "- **Layer Normalization**: Stable normalization for deep transformer training\n",
-    "- **Position-wise Feed-Forward**: Non-linear transformations applied to each sequence position\n",
-    "- **Transformer Blocks**: Complete transformer layers with attention, normalization, and residual connections\n",
-    "- **Complete Transformer**: Full language model with embeddings, multiple layers, and generation capability\n",
-    "- **Text Generation**: Autoregressive generation with proper causal masking\n",
-    "- **🆕 Performance Analysis**: Comprehensive scaling analysis and architectural optimization tools\n",
-    "- **🆕 Production Insights**: Understanding of real-world transformer deployment challenges\n",
-    "\n",
-    "### ✅ Key Learning Outcomes\n",
-    "- **Understanding**: How transformer blocks enable powerful sequence modeling through attention and feed-forward layers\n",
-    "- **Implementation**: Built complete transformer architectures with proper layer organization and residual connections\n",
-    "- **Systems Insight**: How transformer depth affects memory usage, training efficiency, and model capacity\n",
-    "- **Performance Engineering**: Measured and analyzed transformer scaling characteristics and optimization opportunities\n",
-    "- **Production Context**: Understanding transformer deployment challenges and architectural trade-offs\n",
-    "\n",
-    "### ✅ Technical Mastery\n",
-    "- **Layer Normalization**: Stabilizing deep network training with proper feature normalization\n",
-    "- **Residual Connections**: Enabling gradient flow through deep transformer architectures\n",
-    "- **Pre-norm vs Post-norm**: Understanding normalization placement effects on training stability\n",
-    "- **Parameter Scaling**: Understanding how transformer parameters scale with architectural choices\n",
-    "- **🆕 Generation Systems**: Autoregressive text generation with causal attention patterns\n",
-    "\n",
-    "### ✅ Professional Skills Developed\n",
-    "- **Systems Architecture**: Designing complete transformer systems for production scale\n",
-    "- **Memory Engineering**: Understanding transformer memory scaling and optimization techniques\n",
-    "- **Performance Analysis**: Measuring and improving transformer computation and memory efficiency\n",
-    "- **Integration Design**: Building complete language processing pipelines from tokenization to generation\n",
-    "\n",
-    "### ✅ Ready for Next Steps\n",
-    "Your transformer implementations provide the foundation for:\n",
-    "- **Advanced Language Models**: GPT, BERT, and other transformer-based architectures\n",
-    "- **Multi-modal Models**: Extending transformers to vision, audio, and other modalities\n",
-    "- **Production Optimization**: Memory optimization, distributed training, and efficient inference\n",
-    "- **🧠 AI Applications**: Real-world language processing applications and services\n",
-    "\n",
-    "### 🔗 Connection to Real ML Systems\n",
-    "Your implementations mirror production systems:\n",
-    "- **GPT Architecture**: Your transformer matches GPT's decoder-only architecture\n",
-    "- **BERT Components**: Layer normalization and attention mechanisms used in BERT\n",
-    "- **Production Optimization**: Understanding of memory scaling, batching, and generation optimization\n",
-    "- **Industry Applications**: Foundation for all modern language model deployments\n",
-    "\n",
-    "### 🎯 The Complete Language Model\n",
-    "You have built the architecture that transformed AI:\n",
-    "- **Before**: RNNs and CNNs limited by sequential processing and local dependencies\n",
-    "- **After**: Transformers enable parallel processing and global attention across entire sequences\n",
-    "\n",
-    "**Achievement Unlocked**: You now understand every component of modern language models from tokenization through generation!\n",
-    "\n",
-    "Your complete transformer implementation provides the foundation for understanding and building modern AI systems. You've mastered the architecture that powers ChatGPT, GPT-4, BERT, and countless other AI applications.\n",
-    "\n",
-    "From discrete tokens to continuous embeddings, from attention mechanisms to complete language generation - you've built the entire pipeline that enables machines to understand and generate human language.\n",
-    "\n",
-    "**🏆 Congratulations on completing the complete transformer architecture implementation!**"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/13_transformers/transformers_dev.py b/modules_old/13_transformers/transformers_dev.py
deleted file mode 100644
index 93107d1e..00000000
--- a/modules_old/13_transformers/transformers_dev.py
+++ /dev/null
@@ -1,2845 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Transformers - Complete Transformer Architecture Implementation
-
-Welcome to the Transformers module! You'll implement complete transformer blocks with LayerNorm, residual connections, and feed-forward networks, building the architecture that powers modern language models like GPT and BERT.
-
-## Learning Goals
-- Systems understanding: How transformer blocks scale memory and computation with model depth
-- Core implementation skill: Build complete transformer architectures with proper normalization
-- Pattern recognition: Understand how residual connections enable training of deep transformer models
-- Framework connection: See how your implementations match production transformer systems
-- Performance insight: Learn how transformer layer memory accumulation affects model deployment
-
-## Build -> Use -> Reflect
-1. **Build**: LayerNorm, transformer blocks, and complete transformer models
-2. **Use**: Process sequences through multi-layer transformer architectures
-3. **Reflect**: How do transformer design choices affect scalability and training dynamics?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how transformer blocks enable powerful sequence modeling
-- Practical capability to implement complete transformer architectures with proper layer organization
-- Systems insight into how transformer depth affects memory usage and training efficiency
-- Performance consideration of how layer normalization and residual connections affect convergence
-- Connection to production systems like GPT's transformer blocks and their optimization techniques
-
-## Systems Reality Check
-TIP **Production Context**: GPT-3 has 96 transformer layers, each with 12k-dimensional representations and complex memory management
-SPEED **Performance Note**: Transformer layer memory accumulates linearly with depth - deep models require careful activation checkpointing
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "transformers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.transformers
-
-#| export
-import math
-import numpy as np
-import os
-import sys
-from typing import Union, List, Optional, Tuple, Dict
-
-# Clean development imports - no fake implementations, proper dependency management
-
-# Local development imports - clean dependency resolution
-def _import_from_module_dev(module_name, class_names):
-    """Import classes from development module files during development."""
-    module_path = os.path.join(os.path.dirname(__file__), '..', module_name)
-    sys.path.insert(0, module_path)
-    try:
-        if module_name == '01_tensor':
-            from tensor_dev import Tensor
-            return {'Tensor': Tensor}
-        elif module_name == '12_attention':
-            from attention_dev import ScaledDotProductAttention, MultiHeadAttention, KVCache
-            return {
-                'ScaledDotProductAttention': ScaledDotProductAttention,
-                'MultiHeadAttention': MultiHeadAttention,
-                'KVCache': KVCache
-            }
-        elif module_name == '11_embeddings':
-            from embeddings_dev import Embedding, PositionalEncoding
-            return {'Embedding': Embedding, 'PositionalEncoding': PositionalEncoding}
-        else:
-            # Return empty dict if module not found - will use mocks below
-            return {}
-    finally:
-        sys.path.pop(0)
-
-# Import required classes - production style import management
-if 'tinytorch' in sys.modules:
-    # Production: Import from installed package
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.attention import ScaledDotProductAttention, MultiHeadAttention, KVCache
-    from tinytorch.core.embeddings import Embedding, PositionalEncoding
-else:
-    # Development: Import from local modules
-    tensor_imports = _import_from_module_dev('01_tensor', ['Tensor'])
-    Tensor = tensor_imports['Tensor']
-    
-    attention_imports = _import_from_module_dev('12_attention',
-                                               ['ScaledDotProductAttention', 'MultiHeadAttention', 'KVCache'])
-    if attention_imports:
-        ScaledDotProductAttention = attention_imports['ScaledDotProductAttention']
-        MultiHeadAttention = attention_imports['MultiHeadAttention']
-        KVCache = attention_imports['KVCache']
-    else:
-        # Mock classes for standalone testing
-        class ScaledDotProductAttention:
-            def __init__(self, *args, **kwargs): pass
-        class MultiHeadAttention:
-            def __init__(self, *args, **kwargs): pass
-        class KVCache:
-            def __init__(self, *args, **kwargs): pass
-
-    embedding_imports = _import_from_module_dev('11_embeddings', ['Embedding', 'PositionalEncoding'])
-    if embedding_imports:
-        Embedding = embedding_imports['Embedding']
-        PositionalEncoding = embedding_imports['PositionalEncoding']
-    else:
-        # Mock classes for standalone testing
-        class Embedding:
-            def __init__(self, *args, **kwargs): pass
-        class PositionalEncoding:
-            def __init__(self, *args, **kwargs): pass
-
-# %% nbgrader={"grade": false, "grade_id": "transformers-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("🏗️ TinyTorch Transformers Module")
-print(f"NumPy version: {np.__version__}")
-print("Ready to build complete transformer architectures!")
-
-# %% [markdown]
-"""
-## PACKAGE Where This Code Lives in the Final Package
-
-**Learning Side:** You work in `modules/source/14_transformers/transformers_dev.py`  
-**Building Side:** Code exports to `tinytorch.core.transformers`
-
-```python
-# Final package structure:
-from tinytorch.core.transformers import LayerNorm, TransformerBlock, Transformer
-from tinytorch.core.attention import MultiHeadAttention  # Previous module
-from tinytorch.core.embeddings import Embedding, PositionalEncoding  # Foundation
-```
-
-**Why this matters:**
-- **Learning:** Focused modules for deep understanding
-- **Production:** Proper organization like PyTorch's transformer implementations
-- **Consistency:** All transformer components live together in `core.transformers`
-- **Integration:** Works seamlessly with attention, embeddings, and tokenization systems
-"""
-
-# %% [markdown]
-"""
-## What are Transformers?
-
-### The Architecture Revolution
-Transformers revolutionized AI by replacing recurrent connections with attention mechanisms:
-
-**Traditional RNN/LSTM:**
-```
-h₁ -> h₂ -> h₃ -> h₄  (Sequential processing)
-```
-
-**Transformer:**
-```
-All positions attend to all positions simultaneously (Parallel processing)
-```
-
-### Transformer Block Components
-Each transformer block contains:
-
-1. **Multi-Head Self-Attention**: Captures sequence relationships
-2. **Layer Normalization**: Stabilizes training of deep networks
-3. **Residual Connections**: Enables gradient flow through many layers
-4. **Position-wise Feed-Forward**: Applies non-linear transformations
-
-### The Complete Architecture
-```
-Input Embeddings + Positional Encoding
-    v
-[Transformer Block] * N layers
-    v
-Output Layer (Language Modeling Head)
-```
-
-### Systems Trade-offs
-- **Layer depth**: More layers = more capacity, more memory
-- **Attention heads**: More heads = richer representations, more computation
-- **Feed-forward size**: Larger FFN = more parameters, better performance
-- **Layer normalization**: Pre-norm vs post-norm affects training dynamics
-"""
-
-# %% [markdown]
-"""
-## Layer Normalization Implementation
-
-Layer normalization is crucial for training stable transformers. Unlike batch normalization, it normalizes across the feature dimension for each sample independently.
-"""
-
-# %% [markdown]
-"""
-## TARGET Building Transformer Components
-
-### Transformer Architecture Overview
-
-Before implementing individual components, let's visualize how they fit together:
-
-```
-Transformer Architecture:
-
-+-----------------------------------------------------+
-|                   Input Tokens                      |
-+-----------------+-----------------------------------+
-                  |
-+-----------------v-----------------------------------+
-|              Token Embeddings                       |
-|            + Positional Encoding                    |
-+-----------------+-----------------------------------+
-                  |
-+-----------------v-----------------------------------+
-|               Layer 1                               |
-|  +---------------------------------------------+    |
-|  |         Multi-Head Attention                |    |
-|  |    +-------+ +-------+ +-------+           |    |
-|  |    |Head 1 | |Head 2 | |Head n |  ->  Concat|    |
-|  |    +-------+ +-------+ +-------+           |    |
-|  +---------------------------------------------+    |
-|                        |                            |
-|                        v                            |
-|                 +-------------+                     |
-|            +----| Add & Norm  |<----+               |
-|            |    +-------------+     | Residual      |
-|            |                        | Connection    |
-|            v                        |               |
-|  +---------------------------------+ |               |
-|  |     Position-wise FFN           | |               |
-|  |   Linear -> ReLU -> Linear        | |               |
-|  +---------------------------------+ |               |
-|                        |              |               |
-|                        v              |               |
-|                 +-------------+       |               |
-|                 | Add & Norm  |<------+               |
-|                 +-------------+                       |
-+-----------------+-----------------------------------+
-                  |
-                  v
-    +-------------------------------------+
-    |          Layer 2, 3, ..., N        |  (Same structure)
-    +-------------------------------------+
-                  |
-                  v
-    +-------------------------------------+
-    |         Output Projection           |
-    |      Linear(embed_dim, vocab_size)  |
-    +-------------------------------------+
-```
-
-### Memory Layout Visualization
-
-```
-Transformer Memory Organization:
-
-+-------------------------------------------------+
-|                Model Parameters                 |
-+-------------------------------------------------┤
-| Token Embeddings    | vocab * embed_dim         | <- 70% of parameters
-| Position Encodings  | max_seq * embed_dim       |   (for large vocab)
-| N * Transformer Layers:                         |
-|   + Multi-Head Attn | 4 * embed_dim²           | <- 25% of parameters  
-|   + Feed-Forward    | 2 * embed_dim * ffn_dim  |   (per layer)
-|   + Layer Norms     | 2 * embed_dim            |
-| Output Projection   | embed_dim * vocab_size    | <- Same as embeddings
-+-------------------------------------------------+
-
-Activation Memory (Forward Pass):
-+-------------------------------------------------+
-| Input: batch * seq_len * embed_dim             | <- Base memory unit
-| Attention Scores: batch * heads * seq * seq    | <- O(seq²) scaling!
-| Layer Outputs: N * batch * seq * embed_dim     | <- Linear with depth
-| Gradients: 2* parameter memory                  | <- Training overhead
-+-------------------------------------------------+
-
-For GPT-3 scale (175B parameters):
-- Parameters: 700GB (fp32) / 350GB (fp16)
-- Activations: ~10GB per batch (seq_len=2048)
-- Total training memory: ~1TB per GPU!
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "layer-norm", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class LayerNorm:
-    """
-    Layer Normalization for transformers.
-    
-    Normalizes across the feature dimension (last axis) for each sample,
-    making training more stable and enabling deeper networks.
-    """
-    
-    def __init__(self, normalized_shape: Union[int, Tuple[int]], eps: float = 1e-5):
-        """
-        Initialize layer normalization with learnable parameters.
-        
-        Layer normalization is CRITICAL for stable transformer training - it normalizes
-        activations across feature dimensions, preventing internal covariate shift.
-        
-        TODO: Implement layer normalization initialization.
-        
-        APPROACH (3-Step LayerNorm Setup):
-        1. Store normalization configuration because shape validation is essential
-        2. Initialize learnable parameters because scale/shift enable model flexibility  
-        3. Set up optimization tracking because these parameters need gradient updates
-        
-        MATHEMATICAL FOUNDATION:
-        LayerNorm(x) = γ * (x - μ) / σ + β
-        
-        Where:
-        - μ = mean across feature dimensions
-        - σ = std across feature dimensions  
-        - γ = learnable scale parameter (initialized to 1)
-        - β = learnable shift parameter (initialized to 0)
-        
-        EXAMPLE (LayerNorm Operation):
-        >>> ln = LayerNorm(512)  # For 512-dim embeddings
-        >>> x = Tensor(np.random.randn(32, 100, 512))  # batch * seq * embed
-        >>> normalized = ln(x)
-        >>> print(f"Mean: {normalized.data.mean(axis=-1)[0,0]:.6f}")  # ~0
-        >>> print(f"Std: {normalized.data.std(axis=-1)[0,0]:.6f}")    # ~1
-        
-        HINTS (Critical Implementation Details):
-        - Validate normalized_shape to prevent runtime errors
-        - Initialize gamma=1, beta=0 for identity transform initially
-        - Use eps=1e-5 to prevent division by zero
-        - Track parameters for optimizer updates
-        
-        Args:
-            normalized_shape: Shape of features to normalize (e.g., embedding_dim)
-            eps: Small value for numerical stability
-        """
-        ### BEGIN SOLUTION
-        # Input validation
-        if isinstance(normalized_shape, int):
-            if normalized_shape <= 0:
-                raise ValueError(f"normalized_shape must be positive, got {normalized_shape}")
-            self.normalized_shape = (normalized_shape,)
-        else:
-            if any(dim <= 0 for dim in normalized_shape):
-                raise ValueError(f"All dimensions in normalized_shape must be positive, got {normalized_shape}")
-            self.normalized_shape = tuple(normalized_shape)
-        
-        if eps <= 0:
-            raise ValueError(f"eps must be positive, got {eps}")
-        self.eps = eps
-        
-        # Initialize learnable parameters
-        # Gamma (scale): initialized to ones
-        # Beta (bias): initialized to zeros
-        self.gamma = Tensor(np.ones(self.normalized_shape))
-        self.beta = Tensor(np.zeros(self.normalized_shape))
-        
-        # Track parameters for optimization
-        self.parameters = [self.gamma, self.beta]
-        ### END SOLUTION
-    
-    def forward(self, x: Tensor) -> Tensor:
-        """
-        Apply layer normalization to input tensor.
-        
-        TODO: Implement layer normalization forward pass.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Calculate mean across feature dimensions
-        2. Calculate standard deviation across feature dimensions
-        3. Normalize: (x - mean) / (std + eps)
-        4. Apply learnable scale and shift: gamma * normalized + beta
-        
-        NUMERICAL STABILITY:
-        - Add eps to variance before taking sqrt
-        - Use unbiased variance calculation
-        
-        EXAMPLE:
-        layer_norm = LayerNorm(256)
-        x = Tensor(np.random.randn(32, 128, 256))  # (batch, seq, features)
-        normalized = layer_norm.forward(x)  # Same shape as input
-        
-        Args:
-            x: Input tensor with shape (..., *normalized_shape)
-            
-        Returns:
-            Normalized tensor with same shape as input
-        """
-        ### BEGIN SOLUTION
-        # Input validation
-        if len(x.shape) < len(self.normalized_shape):
-            raise ValueError(
-                f"Input has {len(x.shape)} dimensions, but normalized_shape "
-                f"requires at least {len(self.normalized_shape)} dimensions"
-            )
-        
-        # Check that the last dimensions match normalized_shape
-        input_norm_shape = x.shape[-len(self.normalized_shape):]
-        if input_norm_shape != self.normalized_shape:
-            raise ValueError(
-                f"Input shape {input_norm_shape} doesn't match "
-                f"normalized_shape {self.normalized_shape}"
-            )
-        
-        # Step 1: Determine which axes to normalize over (the last len(normalized_shape) axes)
-        input_ndim = len(x.shape)
-        norm_ndim = len(self.normalized_shape)
-        # We normalize over the last 'norm_ndim' dimensions
-        start_axis = input_ndim - norm_ndim
-        axes_to_normalize = tuple(range(start_axis, input_ndim))
-        
-        # Step 2: Calculate statistics (mean and variance)
-        mean = np.mean(x.data, axis=axes_to_normalize, keepdims=True)
-        variance = np.var(x.data, axis=axes_to_normalize, keepdims=True)
-        
-        # Step 3: Normalize (subtract mean, divide by std)
-        std = np.sqrt(variance + self.eps)  # Add eps for numerical stability
-        normalized_input = (x.data - mean) / std
-        
-        # Step 4: Apply learnable scale and shift parameters
-        scaled_output = self._apply_scale_and_shift(normalized_input, x.shape)
-        
-        return Tensor(scaled_output)
-        ### END SOLUTION
-    
-    def _prepare_parameter_for_broadcast(self, param: Tensor, input_shape: tuple) -> np.ndarray:
-        """
-        Reshape parameter tensor to be broadcastable with input.
-        
-        This helper method makes the broadcasting logic clearer by separating
-        the complex reshape operation into a dedicated function.
-        
-        Args:
-            param: Parameter tensor (gamma or beta)
-            input_shape: Shape of the input tensor
-            
-        Returns:
-            Reshaped parameter array ready for broadcasting
-        """
-        # Calculate how many batch dimensions we need to add
-        batch_dims = len(input_shape) - len(self.normalized_shape)
-        
-        # Create broadcast shape: [1, 1, ..., 1, *normalized_shape]
-        # The number of 1s equals the number of batch dimensions
-        broadcast_shape = [1] * batch_dims + list(self.normalized_shape)
-        
-        return param.data.reshape(broadcast_shape)
-    
-    def _apply_scale_and_shift(self, normalized: np.ndarray, input_shape: tuple) -> np.ndarray:
-        """
-        Apply learnable gamma (scale) and beta (shift) parameters.
-        
-        This method handles the broadcasting logic for applying the learnable
-        parameters to the normalized input.
-        
-        Args:
-            normalized: Normalized input array
-            input_shape: Shape of the original input tensor
-            
-        Returns:
-            Scaled and shifted output array
-        """
-        # Prepare parameters for broadcasting with the input
-        gamma_broadcast = self._prepare_parameter_for_broadcast(self.gamma, input_shape)
-        beta_broadcast = self._prepare_parameter_for_broadcast(self.beta, input_shape)
-        
-        # Apply transformation: gamma * normalized + beta
-        return gamma_broadcast * normalized + beta_broadcast
-    
-    def __call__(self, x: Tensor) -> Tensor:
-        """Make the class callable."""
-        return self.forward(x)
-    
-    def get_memory_usage(self) -> Dict[str, float]:
-        """
-        Calculate memory usage of layer normalization parameters.
-        
-        This function is PROVIDED to show memory analysis.
-        """
-        # Parameter memory
-        param_memory_mb = sum(param.data.nbytes for param in self.parameters) / (1024 * 1024)
-        
-        return {
-            'parameter_memory_mb': param_memory_mb,
-            'total_parameters': sum(param.data.size for param in self.parameters),
-            'normalized_shape': self.normalized_shape
-        }
-
-# %% [markdown]
-"""
-### TEST Test Your Layer Normalization Implementation
-
-Once you implement the LayerNorm methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-layer-norm-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_layer_norm():
-    """Unit test for layer normalization."""
-    print("🔬 Unit Test: Layer Normalization...")
-    
-    # Test 1: Basic functionality
-    embed_dim = 256
-    layer_norm = LayerNorm(embed_dim)
-    
-    # Verify initialization
-    assert layer_norm.normalized_shape == (embed_dim,), "Should store normalized shape"
-    assert len(layer_norm.parameters) == 2, "Should have gamma and beta parameters"
-    assert layer_norm.gamma.shape == (embed_dim,), "Gamma should match normalized shape"
-    assert layer_norm.beta.shape == (embed_dim,), "Beta should match normalized shape"
-    
-    # Verify parameter initialization
-    assert np.allclose(layer_norm.gamma.data, 1.0), "Gamma should be initialized to ones"
-    assert np.allclose(layer_norm.beta.data, 0.0), "Beta should be initialized to zeros"
-    
-    # Test 2: Forward pass with 2D input
-    batch_size = 16
-    x_2d = Tensor(np.random.randn(batch_size, embed_dim))
-    output_2d = layer_norm.forward(x_2d)
-    
-    assert output_2d.shape == x_2d.shape, "Output shape should match input shape"
-    
-    # Test 3: Forward pass with 3D input (typical transformer use)
-    seq_length = 32
-    x_3d = Tensor(np.random.randn(batch_size, seq_length, embed_dim))
-    output_3d = layer_norm.forward(x_3d)
-    
-    assert output_3d.shape == x_3d.shape, "3D output shape should match input shape"
-    
-    # Test 4: Normalization properties
-    # For each sample, the normalized features should have ~zero mean and ~unit variance
-    for i in range(batch_size):
-        for j in range(seq_length):
-            sample_output = output_3d.data[i, j, :]
-            sample_mean = np.mean(sample_output)
-            sample_var = np.var(sample_output)
-            
-            assert abs(sample_mean) < 1e-4, f"Normalized mean should be ~0, got {sample_mean}"
-            assert abs(sample_var - 1.0) < 1e-4, f"Normalized variance should be ~1, got {sample_var}"
-    
-    # Test 5: Different normalized shapes
-    multi_dim_shape = (64, 4)  # Multi-dimensional normalization
-    layer_norm_multi = LayerNorm(multi_dim_shape)
-    
-    x_multi = Tensor(np.random.randn(8, 32, 64, 4))
-    output_multi = layer_norm_multi.forward(x_multi)
-    
-    assert output_multi.shape == x_multi.shape, "Multi-dim normalization should preserve shape"
-    
-    # Test 6: Callable interface
-    output_callable = layer_norm(x_3d)
-    assert np.allclose(output_callable.data, output_3d.data), "Callable interface should work"
-    
-    # Test 7: Numerical stability with extreme values
-    extreme_x = Tensor(np.ones((4, embed_dim)) * 1e6)  # Very large values
-    extreme_output = layer_norm.forward(extreme_x)
-    
-    assert not np.any(np.isnan(extreme_output.data)), "Should handle extreme values without NaN"
-    assert not np.any(np.isinf(extreme_output.data)), "Should handle extreme values without inf"
-    
-    # Test 8: Memory usage calculation
-    memory_stats = layer_norm.get_memory_usage()
-    assert 'parameter_memory_mb' in memory_stats, "Should provide memory statistics"
-    assert memory_stats['total_parameters'] == 2 * embed_dim, "Should count gamma and beta parameters"
-    
-    print("PASS Layer normalization tests passed!")
-    print(f"PASS Properly normalizes across feature dimensions")
-    print(f"PASS Handles 2D and 3D inputs correctly")
-    print(f"PASS Maintains ~0 mean and ~1 variance after normalization")
-    print(f"PASS Parameter memory: {memory_stats['parameter_memory_mb']:.4f}MB")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Position-wise Feed-Forward Network
-
-Each transformer block contains a position-wise feed-forward network that applies the same transformation to each position independently.
-
-### Feed-Forward Network Architecture
-
-```
-Position-wise FFN Structure:
-
-Input: (batch, seq_len, embed_dim)
-   |
-   v
-+-------------------------------------------+
-|              Linear Layer 1                 |
-|          embed_dim -> hidden_dim             |  <- Expansion
-|        W1: (embed_dim, hidden_dim)           |    (usually 4x)
-|        b1: (hidden_dim,)                     |
-+-------------------------------------------+
-   |
-   v
-+-------------------------------------------+
-|                ReLU                      |  <- Nonlinearity
-|             max(0, x)                    |    (makes it powerful)
-+-------------------------------------------+
-   |
-   v
-+-------------------------------------------+
-|              Linear Layer 2                 |
-|          hidden_dim -> embed_dim             |  <- Compression
-|        W2: (hidden_dim, embed_dim)           |    (back to original)
-|        b2: (embed_dim,)                      |
-+-------------------------------------------+
-   |
-   v
-Output: (batch, seq_len, embed_dim)
-```
-
-### Parameter Count Analysis
-
-```
-FFN Parameter Breakdown:
-
-For embed_dim=512, hidden_dim=2048:
-
-+----------------------------------------------+
-| W1: 512 * 2048 = 1,048,576 parameters     |  <- 67% of FFN
-| b1: 2048 parameters                      |
-| W2: 2048 * 512 = 1,048,576 parameters     |  <- 67% of FFN
-| b2: 512 parameters                       |
-+----------------------------------------------┤
-| Total: 2,099,712 parameters              |
-| Memory (fp32): 8.4 MB                   |
-+----------------------------------------------+
-
-Scaling: Parameters ∝ embed_dim * hidden_dim
-Typical ratio: hidden_dim = 4 * embed_dim
--> FFN params ∝ 8 * embed_dim²
-```
-
-### Computational Pattern
-
-```
-FFN applies the same transformation to EVERY position independently:
-
-Position 0: [e0_0, e0_1, ..., e0_d] -> FFN -> [o0_0, o0_1, ..., o0_d]
-Position 1: [e1_0, e1_1, ..., e1_d] -> FFN -> [o1_0, o1_1, ..., o1_d]
-    ...            ...                      ...            ...
-Position N: [eN_0, eN_1, ..., eN_d] -> FFN -> [oN_0, oN_1, ..., oN_d]
-
-This is why it's called "position-wise" - each position gets the same treatment!
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "feed-forward", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class PositionwiseFeedForward:
-    """
-    Position-wise feed-forward network used in transformer blocks.
-    
-    Applies the same feed-forward network to each position in the sequence:
-    FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
-    """
-    
-    def __init__(self, embed_dim: int, hidden_dim: int, dropout: float = 0.0):
-        """
-        Initialize position-wise feed-forward network.
-        
-        TODO: Implement feed-forward network initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store network configuration
-        2. Initialize weight matrices and bias vectors for two linear layers
-        3. Set up parameter tracking for optimization
-        4. Store dropout rate for training
-        
-        ARCHITECTURE:
-        - Input: (batch, seq_len, embed_dim)
-        - Linear 1: embed_dim -> hidden_dim
-        - ReLU activation
-        - Linear 2: hidden_dim -> embed_dim
-        - Output: (batch, seq_len, embed_dim)
-        
-        PARAMETER INITIALIZATION:
-        Use Xavier/Glorot initialization for stable training
-        
-        Args:
-            embed_dim: Embedding dimension (input and output size)
-            hidden_dim: Hidden layer dimension (typically 4 * embed_dim)
-            dropout: Dropout rate for regularization
-        """
-        ### BEGIN SOLUTION
-        self.embed_dim = embed_dim
-        self.hidden_dim = hidden_dim
-        self.dropout = dropout
-        
-        # Initialize weights using Xavier initialization
-        # W1: embed_dim -> hidden_dim
-        xavier_bound_1 = math.sqrt(6.0 / (embed_dim + hidden_dim))
-        self.w1 = Tensor(np.random.uniform(-xavier_bound_1, xavier_bound_1, (embed_dim, hidden_dim)))
-        self.b1 = Tensor(np.zeros(hidden_dim))
-        
-        # W2: hidden_dim -> embed_dim
-        xavier_bound_2 = math.sqrt(6.0 / (hidden_dim + embed_dim))
-        self.w2 = Tensor(np.random.uniform(-xavier_bound_2, xavier_bound_2, (hidden_dim, embed_dim)))
-        self.b2 = Tensor(np.zeros(embed_dim))
-        
-        # Track parameters for optimization
-        self.parameters = [self.w1, self.b1, self.w2, self.b2]
-        ### END SOLUTION
-    
-    def forward(self, x: Tensor) -> Tensor:
-        """
-        Apply position-wise feed-forward transformation.
-        
-        TODO: Implement feed-forward forward pass.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Apply first linear transformation: x @ W1 + b1
-        2. Apply ReLU activation: max(0, linear1)
-        3. Apply second linear transformation: relu @ W2 + b2
-        4. Return result with same shape as input
-        
-        MATHEMATICAL FORMULATION:
-        hidden = ReLU(x @ W1 + b1)
-        output = hidden @ W2 + b2
-        
-        Args:
-            x: Input tensor with shape (batch_size, seq_len, embed_dim)
-            
-        Returns:
-            Output tensor with shape (batch_size, seq_len, embed_dim)
-        """
-        ### BEGIN SOLUTION
-        # Reshape input for matrix multiplication if needed
-        original_shape = x.shape
-        if len(x.shape) == 3:
-            batch_size, seq_len, embed_dim = x.shape
-            # Reshape to (batch_size * seq_len, embed_dim) for efficient computation
-            x_reshaped = x.data.reshape(-1, embed_dim)
-        else:
-            x_reshaped = x.data
-        
-        # First linear transformation: x @ W1 + b1
-        hidden = np.matmul(x_reshaped, self.w1.data) + self.b1.data
-        
-        # ReLU activation
-        hidden_relu = np.maximum(0, hidden)
-        
-        # Second linear transformation: hidden @ W2 + b2
-        output = np.matmul(hidden_relu, self.w2.data) + self.b2.data
-        
-        # Reshape back to original shape
-        if len(original_shape) == 3:
-            output = output.reshape(original_shape)
-        
-        return Tensor(output)
-        ### END SOLUTION
-    
-    def __call__(self, x: Tensor) -> Tensor:
-        """Make the class callable."""
-        return self.forward(x)
-    
-    def get_memory_usage(self) -> Dict[str, float]:
-        """
-        Calculate memory usage of feed-forward parameters.
-        
-        This function is PROVIDED to show memory analysis.
-        """
-        # Parameter memory
-        param_memory_mb = sum(param.data.nbytes for param in self.parameters) / (1024 * 1024)
-        
-        # Calculate parameter counts
-        w1_params = self.embed_dim * self.hidden_dim
-        w2_params = self.hidden_dim * self.embed_dim
-        bias_params = self.hidden_dim + self.embed_dim
-        total_params = w1_params + w2_params + bias_params
-        
-        return {
-            'parameter_memory_mb': param_memory_mb,
-            'total_parameters': total_params,
-            'w1_parameters': w1_params,
-            'w2_parameters': w2_params,
-            'bias_parameters': bias_params,
-            'embed_dim': self.embed_dim,
-            'hidden_dim': self.hidden_dim
-        }
-
-# %% [markdown]
-"""
-### TEST Test Your Feed-Forward Network Implementation
-
-Once you implement the PositionwiseFeedForward methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-feed-forward-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_feed_forward():
-    """Unit test for position-wise feed-forward network."""
-    print("🔬 Unit Test: Position-wise Feed-Forward Network...")
-    
-    # Test configuration
-    embed_dim = 256
-    hidden_dim = 1024  # Typical 4x expansion
-    ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=hidden_dim)
-    
-    # Verify initialization
-    assert ffn.embed_dim == embed_dim, "Should store embedding dimension"
-    assert ffn.hidden_dim == hidden_dim, "Should store hidden dimension"
-    assert len(ffn.parameters) == 4, "Should have W1, b1, W2, b2 parameters"
-    
-    # Verify parameter shapes
-    assert ffn.w1.shape == (embed_dim, hidden_dim), f"W1 should be ({embed_dim}, {hidden_dim})"
-    assert ffn.b1.shape == (hidden_dim,), f"b1 should be ({hidden_dim},)"
-    assert ffn.w2.shape == (hidden_dim, embed_dim), f"W2 should be ({hidden_dim}, {embed_dim})"
-    assert ffn.b2.shape == (embed_dim,), f"b2 should be ({embed_dim},)"
-    
-    # Test forward pass with 3D input (typical transformer use)
-    batch_size = 8
-    seq_len = 32
-    x_3d = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-    output_3d = ffn.forward(x_3d)
-    
-    expected_shape = (batch_size, seq_len, embed_dim)
-    assert output_3d.shape == expected_shape, f"Expected shape {expected_shape}, got {output_3d.shape}"
-    
-    # Test forward pass with 2D input
-    x_2d = Tensor(np.random.randn(batch_size, embed_dim))
-    output_2d = ffn.forward(x_2d)
-    
-    expected_2d_shape = (batch_size, embed_dim)
-    assert output_2d.shape == expected_2d_shape, f"Expected 2D shape {expected_2d_shape}, got {output_2d.shape}"
-    
-    # Test that FFN is applied position-wise (same transformation at each position)
-    # Extract two positions from the sequence
-    pos_1_input = Tensor(x_3d.data[:, 0, :])  # First position
-    pos_2_input = Tensor(x_3d.data[:, 1, :])  # Second position
-    
-    pos_1_output = ffn.forward(pos_1_input)
-    pos_2_output = ffn.forward(pos_2_input)
-    
-    # Compare with full sequence output (with reasonable tolerance)
-    assert np.allclose(pos_1_output.data, output_3d.data[:, 0, :], atol=1e-6), "Position 0 should match individual processing"
-    assert np.allclose(pos_2_output.data, output_3d.data[:, 1, :], atol=1e-6), "Position 1 should match individual processing"
-    
-    # Test ReLU activation (some outputs should be zero for negative intermediate values)
-    # Create input that will definitely produce some negative values after first linear layer
-    negative_input = Tensor(-np.ones((4, embed_dim)) * 10)  # Very negative input
-    negative_output = ffn.forward(negative_input)
-    
-    # Not all outputs should be negative (ReLU should clip some values)
-    assert not np.all(negative_output.data < 0), "ReLU should prevent all outputs from being negative"
-    
-    # Test callable interface
-    output_callable = ffn(x_3d)
-    assert np.allclose(output_callable.data, output_3d.data), "Callable interface should work"
-    
-    # Test different hidden dimensions
-    for test_hidden_dim in [512, 2048]:
-        test_ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=test_hidden_dim)
-        test_output = test_ffn.forward(x_3d)
-        assert test_output.shape == expected_shape, f"Should work with hidden_dim={test_hidden_dim}"
-    
-    # Test memory usage calculation
-    memory_stats = ffn.get_memory_usage()
-    assert 'parameter_memory_mb' in memory_stats, "Should provide memory statistics"
-    
-    # Verify parameter counts
-    expected_w1_params = embed_dim * hidden_dim
-    expected_w2_params = hidden_dim * embed_dim
-    expected_total = expected_w1_params + expected_w2_params + hidden_dim + embed_dim
-    
-    assert memory_stats['w1_parameters'] == expected_w1_params, "Should count W1 parameters correctly"
-    assert memory_stats['w2_parameters'] == expected_w2_params, "Should count W2 parameters correctly"
-    assert memory_stats['total_parameters'] == expected_total, "Should count total parameters correctly"
-    
-    print("PASS Position-wise feed-forward tests passed!")
-    print(f"PASS Handles 2D and 3D inputs correctly")
-    print(f"PASS Position-wise processing verified")
-    print(f"PASS ReLU activation working properly")
-    print(f"PASS Total parameters: {memory_stats['total_parameters']:,}")
-    print(f"PASS Parameter memory: {memory_stats['parameter_memory_mb']:.2f}MB")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Transformer Block Implementation
-
-Now let's build the complete transformer block that combines multi-head attention, layer normalization, and position-wise feed-forward networks with residual connections.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "transformer-block", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class TransformerBlock:
-    """
-    Complete transformer block with self-attention and feed-forward layers.
-    
-    Combines multi-head self-attention, layer normalization, residual connections,
-    and position-wise feed-forward networks into the standard transformer architecture.
-    
-    SUPPORTS KV CACHING (Module 19 integration):
-    - Forward method accepts optional past_key_value parameter for caching
-    - Returns new key-value pairs when caching is enabled
-    - Backward compatible: works with or without caching
-    """
-    
-    def __init__(self, embed_dim: int, num_heads: int, hidden_dim: int, 
-                 dropout: float = 0.0, pre_norm: bool = True):
-        """
-        Initialize transformer block with all components.
-        
-        TODO: Implement transformer block initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store block configuration
-        2. Create multi-head attention layer
-        3. Create two layer normalization layers (for attention and FFN)
-        4. Create position-wise feed-forward network
-        5. Set up parameter tracking from all sub-components
-        
-        ARCHITECTURE CHOICE: Pre-norm vs Post-norm
-        - Pre-norm: LayerNorm -> Attention -> Residual (more stable)
-        - Post-norm: Attention -> LayerNorm -> Residual (original paper)
-        
-        Args:
-            embed_dim: Embedding dimension
-            num_heads: Number of attention heads
-            hidden_dim: Feed-forward hidden dimension (typically 4 * embed_dim)
-            dropout: Dropout rate for regularization
-            pre_norm: Whether to use pre-normalization (recommended)
-        """
-        ### BEGIN SOLUTION
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        self.hidden_dim = hidden_dim
-        self.dropout = dropout
-        self.pre_norm = pre_norm
-        
-        # Multi-head self-attention
-        self.attention = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
-        
-        # Layer normalization layers
-        self.norm1 = LayerNorm(embed_dim)  # For attention
-        self.norm2 = LayerNorm(embed_dim)  # For feed-forward
-        
-        # Position-wise feed-forward network
-        self.ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=hidden_dim, dropout=dropout)
-        
-        # Collect all parameters from sub-components
-        self.parameters = []
-        if hasattr(self.attention, 'parameters'):
-            self.parameters.extend(self.attention.parameters)
-        self.parameters.extend(self.norm1.parameters)
-        self.parameters.extend(self.norm2.parameters)
-        self.parameters.extend(self.ffn.parameters)
-        ### END SOLUTION
-    
-    def forward(self, x: Tensor, mask: Optional[Tensor] = None,
-                return_attention_weights: bool = False, past_key_value: Optional[Tuple[Tensor, Tensor]] = None) -> Union[Tensor, Tuple[Tensor, Tensor], Tuple[Tensor, Tuple[Tensor, Tensor]], Tuple[Tensor, Tensor, Tuple[Tensor, Tensor]]]:
-        """
-        Process input through complete transformer block.
-        
-        TODO: Implement transformer block forward pass.
-        
-        STEP-BY-STEP IMPLEMENTATION (Pre-norm):
-        1. Self-attention with residual: x + attention(norm1(x))
-        2. Feed-forward with residual: attn_out + ffn(norm2(attn_out))
-        3. Return final output (and optionally attention weights)
-        
-        RESIDUAL CONNECTIONS:
-        Essential for training deep networks - allow gradients to flow directly
-        
-        Args:
-            x: Input tensor with shape (batch_size, seq_len, embed_dim)
-            mask: Optional attention mask
-            return_attention_weights: Whether to return attention weights
-            past_key_value: Optional cached key-value pair from previous forward pass
-            
-        Returns:
-            Transformer block output with same shape as input
-            Optionally also attention weights
-            Optionally also new key-value pair for caching (if past_key_value provided)
-        """
-        ### BEGIN SOLUTION
-        if self.pre_norm:
-            # Pre-normalization: LayerNorm before attention/FFN
-            
-            # Self-attention with residual connection
-            norm1_x = self.norm1(x)
-            
-            # Handle KV caching - try to pass past_key_value to attention if supported
-            if past_key_value is not None:
-                # Try to use KV caching - gracefully fall back if not supported
-                try:
-                    if return_attention_weights:
-                        attn_result = self.attention.forward(
-                            norm1_x, norm1_x, norm1_x, mask=mask, return_attention_weights=True, past_key_value=past_key_value
-                        )
-                        if len(attn_result) == 3:
-                            # attention returned (output, weights, new_key_value)
-                            attn_output, attn_weights, new_key_value = attn_result
-                        else:
-                            # fallback: attention doesn't support caching yet
-                            attn_output, attn_weights = attn_result
-                            new_key_value = None
-                    else:
-                        attn_result = self.attention.forward(norm1_x, norm1_x, norm1_x, mask=mask, past_key_value=past_key_value)
-                        if isinstance(attn_result, tuple) and len(attn_result) == 2:
-                            # attention returned (output, new_key_value)
-                            attn_output, new_key_value = attn_result
-                        else:
-                            # fallback: attention doesn't support caching yet
-                            attn_output = attn_result
-                            new_key_value = None
-                except TypeError:
-                    # Attention layer doesn't support past_key_value yet - fall back to standard behavior
-                    if return_attention_weights:
-                        attn_output, attn_weights = self.attention.forward(
-                            norm1_x, norm1_x, norm1_x, mask=mask, return_attention_weights=True
-                        )
-                    else:
-                        attn_output = self.attention.forward(norm1_x, norm1_x, norm1_x, mask=mask)
-                    new_key_value = None
-            else:
-                # Standard behavior (no caching)
-                if return_attention_weights:
-                    attn_output, attn_weights = self.attention.forward(
-                        norm1_x, norm1_x, norm1_x, mask=mask, return_attention_weights=True
-                    )
-                else:
-                    attn_output = self.attention.forward(norm1_x, norm1_x, norm1_x, mask=mask)
-                new_key_value = None
-            
-            # Residual connection
-            x = Tensor(x.data + attn_output.data)
-            
-            # Feed-forward with residual connection
-            norm2_x = self.norm2(x)
-            ffn_output = self.ffn.forward(norm2_x)
-            
-            # Residual connection
-            output = Tensor(x.data + ffn_output.data)
-            
-        else:
-            # Post-normalization: LayerNorm after attention/FFN (original transformer)
-            
-            # Self-attention with residual connection
-            # Handle KV caching - try to pass past_key_value to attention if supported
-            if past_key_value is not None:
-                # Try to use KV caching - gracefully fall back if not supported
-                try:
-                    if return_attention_weights:
-                        attn_result = self.attention.forward(
-                            x, x, x, mask=mask, return_attention_weights=True, past_key_value=past_key_value
-                        )
-                        if len(attn_result) == 3:
-                            # attention returned (output, weights, new_key_value)
-                            attn_output, attn_weights, new_key_value = attn_result
-                        else:
-                            # fallback: attention doesn't support caching yet
-                            attn_output, attn_weights = attn_result
-                            new_key_value = None
-                    else:
-                        attn_result = self.attention.forward(x, x, x, mask=mask, past_key_value=past_key_value)
-                        if isinstance(attn_result, tuple) and len(attn_result) == 2:
-                            # attention returned (output, new_key_value)
-                            attn_output, new_key_value = attn_result
-                        else:
-                            # fallback: attention doesn't support caching yet
-                            attn_output = attn_result
-                            new_key_value = None
-                except TypeError:
-                    # Attention layer doesn't support past_key_value yet - fall back to standard behavior
-                    if return_attention_weights:
-                        attn_output, attn_weights = self.attention.forward(
-                            x, x, x, mask=mask, return_attention_weights=True
-                        )
-                    else:
-                        attn_output = self.attention.forward(x, x, x, mask=mask)
-                    new_key_value = None
-            else:
-                # Standard behavior (no caching)
-                if return_attention_weights:
-                    attn_output, attn_weights = self.attention.forward(
-                        x, x, x, mask=mask, return_attention_weights=True
-                    )
-                else:
-                    attn_output = self.attention.forward(x, x, x, mask=mask)
-                new_key_value = None
-            
-            # Residual + LayerNorm
-            attn_residual = Tensor(x.data + attn_output.data)
-            norm1_output = self.norm1(attn_residual)
-            
-            # Feed-forward with residual connection
-            ffn_output = self.ffn.forward(norm1_output)
-            
-            # Residual + LayerNorm
-            ffn_residual = Tensor(norm1_output.data + ffn_output.data)
-            output = self.norm2(ffn_residual)
-        
-        # Return appropriate tuple based on what was requested
-        if past_key_value is not None:
-            # KV caching is enabled
-            if return_attention_weights:
-                return output, attn_weights, new_key_value
-            else:
-                return output, new_key_value
-        else:
-            # Standard behavior (backward compatible)
-            if return_attention_weights:
-                return output, attn_weights
-            else:
-                return output
-        ### END SOLUTION
-    
-    def __call__(self, x: Tensor, mask: Optional[Tensor] = None,
-                 return_attention_weights: bool = False, past_key_value: Optional[Tuple[Tensor, Tensor]] = None) -> Union[Tensor, Tuple[Tensor, Tensor], Tuple[Tensor, Tuple[Tensor, Tensor]], Tuple[Tensor, Tensor, Tuple[Tensor, Tensor]]]:
-        """Make the class callable."""
-        return self.forward(x, mask, return_attention_weights, past_key_value)
-    
-    def get_memory_usage(self) -> Dict[str, float]:
-        """
-        Calculate memory usage of transformer block components.
-        
-        This function is PROVIDED to show memory analysis.
-        """
-        # Get memory usage from components
-        if hasattr(self.attention, 'get_memory_usage'):
-            attention_memory = self.attention.get_memory_usage()['total_parameter_memory_mb']
-        else:
-            attention_memory = 0.0
-        
-        norm1_memory = self.norm1.get_memory_usage()['parameter_memory_mb']
-        norm2_memory = self.norm2.get_memory_usage()['parameter_memory_mb']
-        ffn_memory = self.ffn.get_memory_usage()['parameter_memory_mb']
-        
-        total_memory = attention_memory + norm1_memory + norm2_memory + ffn_memory
-        total_params = len(self.parameters) if hasattr(self, 'parameters') else 0
-        
-        return {
-            'total_memory_mb': total_memory,
-            'attention_memory_mb': attention_memory,
-            'norm_memory_mb': norm1_memory + norm2_memory,
-            'ffn_memory_mb': ffn_memory,
-            'total_parameters': sum(p.data.size for p in self.parameters) if hasattr(self, 'parameters') else 0,
-            'embed_dim': self.embed_dim,
-            'num_heads': self.num_heads,
-            'hidden_dim': self.hidden_dim,
-            'pre_norm': self.pre_norm
-        }
-
-# %% [markdown]
-"""
-### TEST Test Your Transformer Block Implementation
-
-Once you implement the TransformerBlock methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-transformer-block-immediate", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_unit_transformer_block():
-    """Unit test for transformer block."""
-    print("🔬 Unit Test: Transformer Block...")
-    
-    # Test configuration
-    embed_dim = 256
-    num_heads = 8
-    hidden_dim = 1024
-    transformer_block = TransformerBlock(
-        embed_dim=embed_dim, 
-        num_heads=num_heads, 
-        hidden_dim=hidden_dim,
-        pre_norm=True
-    )
-    
-    # Verify initialization
-    assert transformer_block.embed_dim == embed_dim, "Should store embedding dimension"
-    assert transformer_block.num_heads == num_heads, "Should store number of heads"
-    assert transformer_block.hidden_dim == hidden_dim, "Should store hidden dimension"
-    assert transformer_block.pre_norm == True, "Should store normalization type"
-    
-    # Verify components exist
-    assert hasattr(transformer_block, 'attention'), "Should have attention layer"
-    assert hasattr(transformer_block, 'norm1'), "Should have first norm layer"
-    assert hasattr(transformer_block, 'norm2'), "Should have second norm layer"
-    assert hasattr(transformer_block, 'ffn'), "Should have feed-forward network"
-    
-    # Test forward pass
-    batch_size = 4
-    seq_len = 16
-    x = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-    
-    output = transformer_block.forward(x)
-    expected_shape = (batch_size, seq_len, embed_dim)
-    assert output.shape == expected_shape, f"Expected shape {expected_shape}, got {output.shape}"
-    
-    # Test with attention weights return
-    output_with_attn, attn_weights = transformer_block.forward(x, return_attention_weights=True)
-    
-    assert output_with_attn.shape == expected_shape, "Output with attention should have correct shape"
-    expected_attn_shape = (batch_size, num_heads, seq_len, seq_len)
-    assert attn_weights.shape == expected_attn_shape, f"Expected attention shape {expected_attn_shape}, got {attn_weights.shape}"
-    
-    # Test with causal mask
-    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)
-    causal_mask = 1 - causal_mask  # Convert to attention mask
-    
-    masked_output, masked_attn = transformer_block.forward(
-        x, mask=Tensor(causal_mask), return_attention_weights=True
-    )
-    
-    assert masked_output.shape == expected_shape, "Masked output should have correct shape"
-    
-    # Verify causal masking works
-    for head in range(num_heads):
-        for i in range(seq_len):
-            for j in range(i+1, seq_len):
-                assert np.all(masked_attn.data[:, head, i, j] < 1e-5), \
-                    f"Position ({i},{j}) should be masked in head {head}"
-    
-    # Test residual connections by checking that output is different from pure attention
-    # If we zero out the input, residual connections should preserve some information
-    zero_input = Tensor(np.zeros((batch_size, seq_len, embed_dim)))
-    zero_output = transformer_block.forward(zero_input)
-    
-    # Output should not be exactly zero due to biases and layer norm parameters
-    # But might be close to zero for zero input with proper normalization
-    output_magnitude = np.mean(np.abs(zero_output.data))
-    assert output_magnitude < 10.0, f"Output magnitude {output_magnitude} seems reasonable for zero input"
-    
-    # Test post-normalization variant
-    post_norm_block = TransformerBlock(
-        embed_dim=embed_dim, 
-        num_heads=num_heads, 
-        hidden_dim=hidden_dim,
-        pre_norm=False
-    )
-    
-    post_norm_output = post_norm_block.forward(x)
-    assert post_norm_output.shape == expected_shape, "Post-norm should produce correct shape"
-    
-    # Pre-norm and post-norm should produce different outputs
-    pre_norm_output = transformer_block.forward(x)
-    assert not np.allclose(pre_norm_output.data, post_norm_output.data), \
-        "Pre-norm and post-norm should produce different outputs"
-    
-    # Test callable interface
-    output_callable = transformer_block(x)
-    assert np.allclose(output_callable.data, output.data), "Callable interface should work"
-    
-    # Test different configurations
-    for test_heads in [4, 16]:
-        if embed_dim % test_heads == 0:
-            test_block = TransformerBlock(embed_dim=embed_dim, num_heads=test_heads, hidden_dim=hidden_dim)
-            test_output = test_block.forward(x)
-            assert test_output.shape == expected_shape, f"Should work with {test_heads} heads"
-    
-    # Test memory usage calculation
-    memory_stats = transformer_block.get_memory_usage()
-    assert 'total_memory_mb' in memory_stats, "Should provide memory statistics"
-    assert memory_stats['total_memory_mb'] > 0, "Should have positive memory usage"
-    assert memory_stats['total_parameters'] > 0, "Should count parameters"
-    
-    print("PASS Transformer block tests passed!")
-    print(f"PASS Pre-norm and post-norm architectures work correctly")
-    print(f"PASS Residual connections preserve information flow")
-    print(f"PASS Causal masking works across all attention heads")
-    print(f"PASS Total parameters: {memory_stats['total_parameters']:,}")
-    print(f"PASS Total memory: {memory_stats['total_memory_mb']:.2f}MB")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Complete Transformer Model
-
-Finally, let's build a complete transformer model that can be used for language modeling tasks like text generation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "transformer-model", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Transformer:
-    """
-    Complete transformer model for language processing.
-    
-    Stacks multiple transformer blocks with token embeddings and positional
-    encoding to create a complete language model architecture.
-    
-    SUPPORTS KV CACHING (Module 19 integration):
-    - Forward method accepts optional past_key_values parameter for caching
-    - Generate method supports use_cache parameter for efficient generation
-    - Returns new key-value pairs when caching is enabled
-    - Backward compatible: works with or without caching
-    """
-    
-    def __init__(self, vocab_size: int, embed_dim: int, num_heads: int, 
-                 num_layers: int, hidden_dim: int, max_seq_length: int = 1024,
-                 dropout: float = 0.0, pre_norm: bool = True):
-        """
-        Initialize complete transformer model.
-        
-        TODO: Implement transformer model initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store model configuration
-        2. Create token embedding layer
-        3. Create positional encoding
-        4. Create stack of transformer blocks
-        5. Create output projection layer (for language modeling)
-        6. Set up parameter tracking from all components
-        
-        LANGUAGE MODELING HEAD:
-        Final linear layer that projects hidden states to vocabulary logits
-        
-        Args:
-            vocab_size: Size of vocabulary
-            embed_dim: Embedding dimension
-            num_heads: Number of attention heads per layer
-            num_layers: Number of transformer blocks
-            hidden_dim: Feed-forward hidden dimension
-            max_seq_length: Maximum sequence length for positional encoding
-            dropout: Dropout rate
-            pre_norm: Whether to use pre-normalization
-        """
-        ### BEGIN SOLUTION
-        self.vocab_size = vocab_size
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        self.num_layers = num_layers
-        self.hidden_dim = hidden_dim
-        self.max_seq_length = max_seq_length
-        self.dropout = dropout
-        self.pre_norm = pre_norm
-        
-        # Token embedding layer
-        self.token_embedding = Embedding(vocab_size=vocab_size, embedding_dim=embed_dim)
-        
-        # Positional encoding
-        self.pos_encoding = PositionalEncoding(embedding_dim=embed_dim, max_seq_length=max_seq_length)
-        
-        # Stack of transformer blocks
-        self.transformer_blocks = []
-        for _ in range(num_layers):
-            block = TransformerBlock(
-                embed_dim=embed_dim,
-                num_heads=num_heads,
-                hidden_dim=hidden_dim,
-                dropout=dropout,
-                pre_norm=pre_norm
-            )
-            self.transformer_blocks.append(block)
-        
-        # Final layer normalization (for pre-norm architecture)
-        if pre_norm:
-            self.final_norm = LayerNorm(embed_dim)
-        else:
-            self.final_norm = None
-        
-        # Language modeling head (projects to vocabulary)
-        xavier_bound = math.sqrt(6.0 / (embed_dim + vocab_size))
-        self.lm_head = Tensor(np.random.uniform(-xavier_bound, xavier_bound, (embed_dim, vocab_size)))
-        
-        # Collect all parameters
-        self.parameters = []
-        if hasattr(self.token_embedding, 'parameters'):
-            self.parameters.extend(self.token_embedding.parameters)
-        
-        for block in self.transformer_blocks:
-            if hasattr(block, 'parameters'):
-                self.parameters.extend(block.parameters)
-        
-        if self.final_norm:
-            self.parameters.extend(self.final_norm.parameters)
-        
-        self.parameters.append(self.lm_head)
-        ### END SOLUTION
-    
-    def forward(self, input_ids: Tensor, mask: Optional[Tensor] = None,
-                return_attention_weights: bool = False, past_key_values: Optional[List[Tuple[Tensor, Tensor]]] = None) -> Union[Tensor, Tuple[Tensor, List[Tensor]], Tuple[Tensor, List[Tuple[Tensor, Tensor]]], Tuple[Tensor, List[Tensor], List[Tuple[Tensor, Tensor]]]]:
-        """
-        Process input through complete transformer model.
-        
-        TODO: Implement transformer model forward pass.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert token IDs to embeddings
-        2. Add positional encoding
-        3. Process through all transformer blocks
-        4. Apply final normalization (if pre-norm)
-        5. Apply language modeling head
-        6. Return logits (and optionally attention weights)
-        
-        Args:
-            input_ids: Token indices with shape (batch_size, seq_len)
-            mask: Optional attention mask
-            return_attention_weights: Whether to return all attention weights
-            past_key_values: Optional list of cached key-value pairs from previous forward pass
-            
-        Returns:
-            Logits with shape (batch_size, seq_len, vocab_size)
-            Optionally also list of attention weights from each layer
-            Optionally also list of new key-value pairs for caching (if past_key_values provided)
-        """
-        ### BEGIN SOLUTION
-        # Token embeddings
-        embeddings = self.token_embedding.forward(input_ids)
-        
-        # Add positional encoding
-        x = self.pos_encoding.forward(embeddings)
-        
-        # Process through transformer blocks
-        all_attention_weights = []
-        new_key_values = []
-        
-        for i, block in enumerate(self.transformer_blocks):
-            # Get past key-value for this layer if available
-            past_key_value = past_key_values[i] if past_key_values is not None else None
-            
-            if past_key_values is not None:
-                # KV caching enabled
-                if return_attention_weights:
-                    result = block.forward(x, mask=mask, return_attention_weights=True, past_key_value=past_key_value)
-                    if len(result) == 3:
-                        x, attn_weights, new_key_value = result
-                        all_attention_weights.append(attn_weights)
-                        new_key_values.append(new_key_value)
-                    else:
-                        # Fallback if block doesn't support KV caching yet
-                        x, attn_weights = result
-                        all_attention_weights.append(attn_weights)
-                        new_key_values.append(None)
-                else:
-                    result = block.forward(x, mask=mask, past_key_value=past_key_value)
-                    if isinstance(result, tuple) and len(result) == 2:
-                        x, new_key_value = result
-                        new_key_values.append(new_key_value)
-                    else:
-                        # Fallback if block doesn't support KV caching yet
-                        x = result
-                        new_key_values.append(None)
-            else:
-                # Standard behavior (backward compatible)
-                if return_attention_weights:
-                    x, attn_weights = block.forward(x, mask=mask, return_attention_weights=True)
-                    all_attention_weights.append(attn_weights)
-                else:
-                    x = block.forward(x, mask=mask)
-        
-        # Final layer normalization (for pre-norm)
-        if self.final_norm:
-            x = self.final_norm.forward(x)
-        
-        # Language modeling head
-        # x: (batch_size, seq_len, embed_dim)
-        # lm_head: (embed_dim, vocab_size)
-        # output: (batch_size, seq_len, vocab_size)
-        
-        batch_size, seq_len, embed_dim = x.shape
-        x_reshaped = x.data.reshape(-1, embed_dim)  # (batch_size * seq_len, embed_dim)
-        logits_reshaped = np.matmul(x_reshaped, self.lm_head.data)  # (batch_size * seq_len, vocab_size)
-        logits = logits_reshaped.reshape(batch_size, seq_len, self.vocab_size)
-        
-        # Return appropriate tuple based on what was requested
-        if past_key_values is not None:
-            # KV caching is enabled
-            if return_attention_weights:
-                return Tensor(logits), all_attention_weights, new_key_values
-            else:
-                return Tensor(logits), new_key_values
-        else:
-            # Standard behavior (backward compatible)
-            if return_attention_weights:
-                return Tensor(logits), all_attention_weights
-            else:
-                return Tensor(logits)
-        ### END SOLUTION
-    
-    def __call__(self, input_ids: Tensor, mask: Optional[Tensor] = None,
-                 return_attention_weights: bool = False, past_key_values: Optional[List[Tuple[Tensor, Tensor]]] = None) -> Union[Tensor, Tuple[Tensor, List[Tensor]], Tuple[Tensor, List[Tuple[Tensor, Tensor]]], Tuple[Tensor, List[Tensor], List[Tuple[Tensor, Tensor]]]]:
-        """Make the class callable."""
-        return self.forward(input_ids, mask, return_attention_weights, past_key_values)
-    
-    def generate(self, input_ids: Tensor, max_new_tokens: int = 50, 
-                temperature: float = 1.0, use_cache: bool = False) -> Tensor:
-        """
-        Generate text autoregressively.
-        
-        This function is PROVIDED to show text generation capability.
-        
-        Args:
-            input_ids: Input token IDs with shape (batch_size, seq_len)
-            max_new_tokens: Maximum number of new tokens to generate
-            temperature: Temperature for sampling (higher = more random)
-            use_cache: Whether to use KV caching for faster generation
-            
-        Returns:
-            Generated token IDs with shape (batch_size, original_seq_len + generated_tokens)
-        """
-        batch_size, current_seq_len = input_ids.shape
-        
-        if current_seq_len >= self.max_seq_length:
-            raise ValueError(f"Input sequence length {current_seq_len} exceeds max {self.max_seq_length}")
-        
-        generated_ids = input_ids.data.copy()
-        past_key_values = None  # Initialize cache for KV caching
-        
-        for step in range(max_new_tokens):
-            if use_cache and step > 0:
-                # For subsequent steps with caching, only process the last token
-                current_input = Tensor(generated_ids[:, -1:])  # Only last token
-                # No mask needed for single token
-                current_mask = None
-            else:
-                # First step or no caching: process full sequence
-                current_input = Tensor(generated_ids)
-                # Create causal mask
-                seq_len = generated_ids.shape[1]
-                causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)
-                causal_mask = 1 - causal_mask
-                current_mask = Tensor(causal_mask)
-            
-            # Forward pass with optional caching
-            if use_cache:
-                result = self.forward(current_input, mask=current_mask, past_key_values=past_key_values)
-                if isinstance(result, tuple) and len(result) == 2:
-                    logits, past_key_values = result
-                else:
-                    # Fallback if caching not fully implemented yet
-                    logits = result
-                    past_key_values = None
-            else:
-                logits = self.forward(current_input, mask=current_mask)
-            
-            # Get logits for last position
-            last_logits = logits.data[:, -1, :]  # (batch_size, vocab_size)
-            
-            # Apply temperature
-            last_logits = last_logits / temperature
-            
-            # Sample next token (using simple sampling)
-            # Convert to probabilities
-            exp_logits = np.exp(last_logits - np.max(last_logits, axis=-1, keepdims=True))
-            probs = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
-            
-            # Sample from distribution
-            next_tokens = []
-            for i in range(batch_size):
-                next_token = np.random.choice(self.vocab_size, p=probs[i])
-                next_tokens.append(next_token)
-            
-            next_tokens = np.array(next_tokens).reshape(batch_size, 1)
-            
-            # Append to sequence
-            generated_ids = np.concatenate([generated_ids, next_tokens], axis=1)
-            
-            # Stop if we reach max sequence length
-            if generated_ids.shape[1] >= self.max_seq_length:
-                break
-        
-        return Tensor(generated_ids)
-    
-    def get_memory_usage(self) -> Dict[str, float]:
-        """
-        Calculate memory usage of complete transformer model.
-        
-        This function is PROVIDED to show memory analysis.
-        """
-        # Token embedding memory
-        if hasattr(self.token_embedding, 'get_memory_usage'):
-            embedding_memory = self.token_embedding.get_memory_usage()['total_memory_mb']
-        else:
-            embedding_memory = self.vocab_size * self.embed_dim * 4 / (1024 * 1024)
-        
-        # Transformer blocks memory
-        block_memory = 0
-        if self.transformer_blocks:
-            single_block_memory = self.transformer_blocks[0].get_memory_usage()['total_memory_mb']
-            block_memory = single_block_memory * self.num_layers
-        
-        # Final norm memory
-        final_norm_memory = 0
-        if self.final_norm:
-            final_norm_memory = self.final_norm.get_memory_usage()['parameter_memory_mb']
-        
-        # Language modeling head memory
-        lm_head_memory = self.lm_head.data.nbytes / (1024 * 1024)
-        
-        total_memory = embedding_memory + block_memory + final_norm_memory + lm_head_memory
-        total_params = sum(p.data.size for p in self.parameters) if hasattr(self, 'parameters') else 0
-        
-        return {
-            'total_memory_mb': total_memory,
-            'embedding_memory_mb': embedding_memory,
-            'transformer_blocks_memory_mb': block_memory,
-            'lm_head_memory_mb': lm_head_memory,
-            'total_parameters': total_params,
-            'vocab_size': self.vocab_size,
-            'embed_dim': self.embed_dim,
-            'num_layers': self.num_layers,
-            'num_heads': self.num_heads,
-            'hidden_dim': self.hidden_dim
-        }
-
-# %% [markdown]
-"""
-### TEST Test Your Complete Transformer Implementation
-
-Once you implement the Transformer methods above, run this cell to test it:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-transformer-model-immediate", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
-def test_unit_transformer_model():
-    """Unit test for complete transformer model."""
-    print("🔬 Unit Test: Complete Transformer Model...")
-    
-    # Test configuration
-    vocab_size = 1000
-    embed_dim = 256
-    num_heads = 8
-    num_layers = 4
-    hidden_dim = 512
-    max_seq_length = 128
-    
-    transformer = Transformer(
-        vocab_size=vocab_size,
-        embed_dim=embed_dim,
-        num_heads=num_heads,
-        num_layers=num_layers,
-        hidden_dim=hidden_dim,
-        max_seq_length=max_seq_length,
-        pre_norm=True
-    )
-    
-    # Verify initialization
-    assert transformer.vocab_size == vocab_size, "Should store vocabulary size"
-    assert transformer.embed_dim == embed_dim, "Should store embedding dimension"
-    assert transformer.num_layers == num_layers, "Should store number of layers"
-    assert len(transformer.transformer_blocks) == num_layers, "Should create correct number of blocks"
-    
-    # Verify components exist
-    assert hasattr(transformer, 'token_embedding'), "Should have token embedding"
-    assert hasattr(transformer, 'pos_encoding'), "Should have positional encoding"
-    assert hasattr(transformer, 'lm_head'), "Should have language modeling head"
-    
-    # Test forward pass with token IDs
-    batch_size = 4
-    seq_len = 32
-    input_ids = np.random.randint(0, vocab_size, (batch_size, seq_len))
-    input_tensor = Tensor(input_ids)
-    
-    logits = transformer.forward(input_tensor)
-    expected_shape = (batch_size, seq_len, vocab_size)
-    assert logits.shape == expected_shape, f"Expected shape {expected_shape}, got {logits.shape}"
-    
-    # Test with attention weights return
-    logits_with_attn, all_attention_weights = transformer.forward(input_tensor, return_attention_weights=True)
-    
-    assert logits_with_attn.shape == expected_shape, "Logits with attention should have correct shape"
-    assert len(all_attention_weights) == num_layers, f"Should return attention weights from {num_layers} layers"
-    
-    for i, attn_weights in enumerate(all_attention_weights):
-        expected_attn_shape = (batch_size, num_heads, seq_len, seq_len)
-        assert attn_weights.shape == expected_attn_shape, \
-            f"Layer {i} attention should have shape {expected_attn_shape}, got {attn_weights.shape}"
-    
-    # Test with causal mask
-    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)
-    causal_mask = 1 - causal_mask  # Convert to attention mask
-    
-    masked_logits, masked_attention = transformer.forward(
-        input_tensor, mask=Tensor(causal_mask), return_attention_weights=True
-    )
-    
-    assert masked_logits.shape == expected_shape, "Masked logits should have correct shape"
-    
-    # Verify causal masking propagates through all layers
-    for layer_idx, attn_weights in enumerate(masked_attention):
-        for head in range(num_heads):
-            for i in range(seq_len):
-                for j in range(i+1, seq_len):
-                    assert np.all(attn_weights.data[:, head, i, j] < 1e-5), \
-                        f"Layer {layer_idx}, head {head}: position ({i},{j}) should be masked"
-    
-    # Test callable interface
-    logits_callable = transformer(input_tensor)
-    assert np.allclose(logits_callable.data, logits.data), "Callable interface should work"
-    
-    # Test text generation capability
-    print("  Testing text generation...")
-    start_tokens = Tensor(np.random.randint(0, vocab_size, (2, 8)))  # 2 sequences, 8 tokens each
-    generated = transformer.generate(start_tokens, max_new_tokens=10, temperature=1.0)
-    
-    expected_gen_shape = (2, 18)  # 8 original + 10 new tokens
-    assert generated.shape == expected_gen_shape, f"Generated shape should be {expected_gen_shape}, got {generated.shape}"
-    
-    # Verify original tokens are preserved
-    assert np.array_equal(generated.data[:, :8], start_tokens.data), "Original tokens should be preserved"
-    
-    # Test different model configurations
-    small_transformer = Transformer(
-        vocab_size=500, embed_dim=128, num_heads=4, num_layers=2, hidden_dim=256
-    )
-    
-    small_input = Tensor(np.random.randint(0, 500, (2, 16)))
-    small_logits = small_transformer.forward(small_input)
-    expected_small_shape = (2, 16, 500)
-    assert small_logits.shape == expected_small_shape, "Small transformer should work"
-    
-    # Test pre-norm vs post-norm
-    post_norm_transformer = Transformer(
-        vocab_size=vocab_size, embed_dim=embed_dim, num_heads=num_heads,
-        num_layers=2, hidden_dim=hidden_dim, pre_norm=False
-    )
-    
-    post_norm_logits = post_norm_transformer.forward(input_tensor)
-    pre_norm_logits = Transformer(
-        vocab_size=vocab_size, embed_dim=embed_dim, num_heads=num_heads,
-        num_layers=2, hidden_dim=hidden_dim, pre_norm=True
-    ).forward(input_tensor)
-    
-    assert not np.allclose(post_norm_logits.data, pre_norm_logits.data), \
-        "Pre-norm and post-norm should produce different outputs"
-    
-    # Test memory usage calculation
-    memory_stats = transformer.get_memory_usage()
-    assert 'total_memory_mb' in memory_stats, "Should provide memory statistics"
-    assert memory_stats['total_memory_mb'] > 0, "Should have positive memory usage"
-    assert memory_stats['total_parameters'] > 0, "Should count parameters"
-    
-    # Verify memory breakdown
-    assert memory_stats['embedding_memory_mb'] > 0, "Should have embedding memory"
-    assert memory_stats['transformer_blocks_memory_mb'] > 0, "Should have transformer block memory"
-    assert memory_stats['lm_head_memory_mb'] > 0, "Should have language modeling head memory"
-    
-    print("PASS Complete transformer model tests passed!")
-    print(f"PASS Forward pass produces correct logit shapes")
-    print(f"PASS Causal masking works across all {num_layers} layers")
-    print(f"PASS Text generation capability verified")
-    print(f"PASS Total parameters: {memory_stats['total_parameters']:,}")
-    print(f"PASS Total memory: {memory_stats['total_memory_mb']:.2f}MB")
-    print(f"PASS Pre-norm and post-norm architectures work correctly")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## TARGET ML Systems: Performance Analysis & Transformer Scaling
-
-Now let's develop systems engineering skills by analyzing transformer performance and understanding how model depth and width affect memory usage and computational requirements.
-
-### **Learning Outcome**: *"I understand how transformer architecture choices affect scalability, memory usage, and production deployment constraints"*
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "transformer-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-import time
-
-class TransformerProfiler:
-    """
-    Performance profiling toolkit for transformer architectures.
-    
-    Helps ML engineers understand computational costs, memory scaling,
-    and architectural trade-offs in transformer-based models.
-    """
-    
-    def __init__(self):
-        self.results = {}
-    
-    def measure_scaling_with_depth(self, base_config: Dict, layer_counts: List[int]) -> Dict:
-        """
-        Measure how transformer performance scales with number of layers.
-        
-        TODO: Implement transformer depth scaling measurement.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create transformers with different layer counts
-        2. Measure memory usage and computation time for each
-        3. Calculate scaling patterns (should be linear with depth)
-        4. Analyze parameter growth and memory requirements
-        5. Return comprehensive scaling analysis
-        
-        EXPECTED SCALING:
-        - Parameters: Linear with depth
-        - Memory: Linear with depth  
-        - Computation: Linear with depth
-        - Quality: Generally improves with depth (to a point)
-        
-        Args:
-            base_config: Base transformer configuration
-            layer_counts: List of layer counts to test
-            
-        Returns:
-            Dictionary with scaling analysis results
-        """
-        ### BEGIN SOLUTION
-        scaling_results = {}
-        
-        # Test input
-        batch_size = 4
-        seq_len = 32
-        vocab_size = base_config['vocab_size']
-        test_input = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))
-        
-        for num_layers in layer_counts:
-            # Create transformer with this depth
-            transformer = Transformer(
-                vocab_size=base_config['vocab_size'],
-                embed_dim=base_config['embed_dim'],
-                num_heads=base_config['num_heads'],
-                num_layers=num_layers,
-                hidden_dim=base_config['hidden_dim'],
-                max_seq_length=base_config.get('max_seq_length', 128)
-            )
-            
-            # Measure memory usage
-            memory_stats = transformer.get_memory_usage()
-            
-            # Measure computation time
-            start_time = time.time()
-            logits = transformer.forward(test_input)
-            end_time = time.time()
-            
-            computation_time_ms = (end_time - start_time) * 1000
-            
-            # Calculate throughput
-            total_tokens = batch_size * seq_len
-            tokens_per_second = total_tokens / (end_time - start_time) if end_time > start_time else 0
-            
-            scaling_results[num_layers] = {
-                'num_layers': num_layers,
-                'total_parameters': memory_stats['total_parameters'],
-                'total_memory_mb': memory_stats['total_memory_mb'],
-                'computation_time_ms': computation_time_ms,
-                'tokens_per_second': tokens_per_second,
-                'memory_per_layer_mb': memory_stats['transformer_blocks_memory_mb'] / num_layers if num_layers > 0 else 0,
-                'parameters_per_layer': (memory_stats['total_parameters'] - 
-                                       base_config['vocab_size'] * base_config['embed_dim'] * 2) // num_layers if num_layers > 0 else 0
-            }
-        
-        return scaling_results
-        ### END SOLUTION
-    
-    def analyze_width_vs_depth_tradeoffs(self, base_params: int, configurations: List[Dict]) -> Dict:
-        """
-        Compare different ways to allocate a fixed parameter budget.
-        
-        This function is PROVIDED to show parameter allocation analysis.
-        """
-        print(f"📊 WIDTH vs DEPTH TRADE-OFF ANALYSIS")
-        print(f"Target parameter budget: ~{base_params:,} parameters")
-        print("=" * 70)
-        
-        results = {}
-        
-        # Test input
-        batch_size = 4
-        seq_len = 32
-        test_input = Tensor(np.random.randint(0, 1000, (batch_size, seq_len)))
-        
-        print(f"{'Config':<15} {'Layers':<7} {'Embed':<6} {'Heads':<6} {'Hidden':<7} {'Params':<12} {'Time (ms)':<10} {'Memory'}")
-        print("-" * 80)
-        
-        for i, config in enumerate(configurations):
-            try:
-                # Create transformer
-                transformer = Transformer(
-                    vocab_size=1000,  # Fixed vocab size
-                    embed_dim=config['embed_dim'],
-                    num_heads=config['num_heads'],
-                    num_layers=config['num_layers'],
-                    hidden_dim=config['hidden_dim'],
-                    max_seq_length=128
-                )
-                
-                # Get actual parameter count
-                memory_stats = transformer.get_memory_usage()
-                actual_params = memory_stats['total_parameters']
-                
-                # Measure performance
-                start_time = time.time()
-                logits = transformer.forward(test_input)
-                computation_time = (time.time() - start_time) * 1000
-                
-                config_name = f"Config_{i+1}"
-                results[config_name] = {
-                    'config': config,
-                    'actual_parameters': actual_params,
-                    'computation_time_ms': computation_time,
-                    'memory_mb': memory_stats['total_memory_mb'],
-                    'parameter_efficiency': abs(actual_params - base_params) / base_params
-                }
-                
-                print(f"{config_name:<15} {config['num_layers']:<7} {config['embed_dim']:<6} "
-                      f"{config['num_heads']:<6} {config['hidden_dim']:<7} {actual_params:<12,} "
-                      f"{computation_time:<10.2f} {memory_stats['total_memory_mb']:.1f}MB")
-                
-            except Exception as e:
-                print(f"{config_name:<15} ERROR: {str(e)[:50]}")
-        
-        # Analysis
-        print(f"\nTIP TRADE-OFF INSIGHTS:")
-        print(f"   - Deeper models: Better at learning complex patterns, more sequential")
-        print(f"   - Wider models: More parallelizable, can capture diverse features")
-        print(f"   - More heads: Richer attention patterns, more computation")
-        print(f"   - Hidden dimension: Affects FFN capacity, major parameter contributor")
-        
-        return results
-    
-    def simulate_production_scaling(self, model_sizes: List[str]) -> Dict:
-        """
-        Simulate memory and computation requirements for production model sizes.
-        
-        This function is PROVIDED to show production scaling analysis.
-        """
-        print(f"\n🏭 PRODUCTION MODEL SCALING SIMULATION")
-        print("=" * 60)
-        
-        # Production model configurations (simplified)
-        size_configs = {
-            'Small': {'vocab_size': 50000, 'embed_dim': 512, 'num_heads': 8, 'num_layers': 6, 'hidden_dim': 2048},
-            'Medium': {'vocab_size': 50000, 'embed_dim': 768, 'num_heads': 12, 'num_layers': 12, 'hidden_dim': 3072},
-            'Large': {'vocab_size': 50000, 'embed_dim': 1024, 'num_heads': 16, 'num_layers': 24, 'hidden_dim': 4096},
-            'XL': {'vocab_size': 50000, 'embed_dim': 1280, 'num_heads': 20, 'num_layers': 36, 'hidden_dim': 5120}
-        }
-        
-        results = {}
-        
-        print(f"{'Model Size':<12} {'Parameters':<12} {'Memory (GB)':<12} {'Training GPU':<12} {'Inference'}")
-        print("-" * 70)
-        
-        for size in model_sizes:
-            if size not in size_configs:
-                continue
-                
-            config = size_configs[size]
-            
-            # Estimate parameters
-            # Embedding: vocab_size * embed_dim * 2 (input + output)
-            embedding_params = config['vocab_size'] * config['embed_dim'] * 2
-            
-            # Per layer: 
-            # - Attention: 4 * embed_dim^2 (Q, K, V, O projections)
-            # - FFN: 2 * embed_dim * hidden_dim + embed_dim + hidden_dim (weights + biases)
-            # - LayerNorm: 2 * embed_dim * 2 (two norms per layer)
-            attention_params_per_layer = 4 * config['embed_dim'] ** 2
-            ffn_params_per_layer = 2 * config['embed_dim'] * config['hidden_dim'] + config['embed_dim'] + config['hidden_dim']
-            norm_params_per_layer = 4 * config['embed_dim']
-            
-            layer_params = attention_params_per_layer + ffn_params_per_layer + norm_params_per_layer
-            total_params = embedding_params + layer_params * config['num_layers']
-            
-            # Estimate memory (parameters + activations + gradients for training)
-            param_memory_gb = total_params * 4 / (1024**3)  # 4 bytes per float32
-            
-            # Training memory: parameters + gradients + optimizer states + activations
-            training_memory_gb = param_memory_gb * 4  # Rough estimate (param + grad + 2x optimizer states)
-            
-            # Inference memory: just parameters + activations
-            inference_memory_gb = param_memory_gb * 1.5  # Parameters + activation memory
-            
-            # GPU requirements (very rough estimates)
-            if training_memory_gb < 24:
-                training_gpu = "Single RTX 4090"
-            elif training_memory_gb < 80:
-                training_gpu = "Single A100"
-            else:
-                training_gpu = "Multi-GPU"
-            
-            if inference_memory_gb < 12:
-                inference_req = "RTX 4060 Ti"
-            elif inference_memory_gb < 24:
-                inference_req = "RTX 4090"
-            else:
-                inference_req = "A100+"
-            
-            results[size] = {
-                'config': config,
-                'total_parameters': total_params,
-                'training_memory_gb': training_memory_gb,
-                'inference_memory_gb': inference_memory_gb,
-                'training_gpu_req': training_gpu,
-                'inference_gpu_req': inference_req
-            }
-            
-            print(f"{size:<12} {total_params/1e6:.1f}M {training_memory_gb:.1f} {training_gpu:<12} {inference_req}")
-        
-        print(f"\nPROGRESS SCALING OBSERVATIONS:")
-        print(f"   - Model size grows super-linearly with dimension increases")
-        print(f"   - Memory requirements dominate deployment decisions")
-        print(f"   - Training requires 3-4x more memory than inference")
-        print(f"   - Multi-GPU becomes necessary for large models")
-        
-        return results
-
-def analyze_transformer_system_design():
-    """
-    Comprehensive analysis of transformer system design choices and trade-offs.
-    
-    This function is PROVIDED to show systems-level design thinking.
-    """
-    print("🏗️ TRANSFORMER SYSTEM DESIGN ANALYSIS")
-    print("=" * 60)
-    
-    # Architecture decision analysis
-    design_choices = {
-        'Layer Normalization': {
-            'Pre-norm': {'stability': 'High', 'training': 'Easier', 'performance': 'Good'},
-            'Post-norm': {'stability': 'Lower', 'training': 'Harder', 'performance': 'Potentially better'}
-        },
-        'Attention Patterns': {
-            'Full attention': {'complexity': 'O(N²)', 'quality': 'Best', 'scalability': 'Limited'},
-            'Sparse attention': {'complexity': 'O(NsqrtN)', 'quality': 'Good', 'scalability': 'Better'},
-            'Linear attention': {'complexity': 'O(N)', 'quality': 'Reduced', 'scalability': 'Excellent'}
-        },
-        'Feed-Forward Size': {
-            '2x embed_dim': {'parameters': 'Low', 'capacity': 'Limited', 'speed': 'Fast'},
-            '4x embed_dim': {'parameters': 'Standard', 'capacity': 'Good', 'speed': 'Medium'},
-            '8x embed_dim': {'parameters': 'High', 'capacity': 'High', 'speed': 'Slow'}
-        }
-    }
-    
-    print("TARGET ARCHITECTURAL DESIGN CHOICES:")
-    for category, choices in design_choices.items():
-        print(f"\n{category}:")
-        for choice, properties in choices.items():
-            prop_str = ", ".join([f"{k}: {v}" for k, v in properties.items()])
-            print(f"   - {choice}: {prop_str}")
-    
-    # Memory scaling analysis
-    print(f"\n📊 MEMORY SCALING PATTERNS:")
-    print(f"Component breakdown for typical transformer:")
-    print(f"   - Token embeddings: vocab_size * embed_dim parameters")
-    print(f"   - Position encodings: 0 parameters (sinusoidal) or seq_len * embed_dim (learned)")
-    print(f"   - Attention layers: 4 * embed_dim² parameters per layer")
-    print(f"   - Feed-forward: 2 * embed_dim * hidden_dim parameters per layer")
-    print(f"   - Layer normalization: 2 * embed_dim parameters per layer")
-    print(f"   - Output projection: embed_dim * vocab_size parameters")
-    
-    print(f"\n🔧 OPTIMIZATION STRATEGIES:")
-    optimization_techniques = [
-        "Gradient checkpointing: Trade computation for memory",
-        "Mixed precision training: Use FP16 for 2x memory reduction",
-        "Parameter sharing: Share weights across layers",
-        "Sparse attention: Reduce quadratic scaling",
-        "Model parallelism: Distribute layers across GPUs",
-        "Pipeline parallelism: Process different batch elements on different GPUs",
-        "Activation checkpointing: Recompute activations instead of storing"
-    ]
-    
-    for technique in optimization_techniques:
-        print(f"   - {technique}")
-    
-    print(f"\nTARGET PRODUCTION DEPLOYMENT CONSIDERATIONS:")
-    deployment_factors = [
-        "Batch size: Larger batches improve GPU utilization but increase memory",
-        "Sequence length: Quadratic impact on attention memory",
-        "Model depth: Linear impact on memory and computation",
-        "Model width: Quadratic impact on attention parameters",
-        "Precision: FP32 vs FP16 vs INT8 trade-offs",
-        "Hardware: GPU memory and compute capabilities",
-        "Latency requirements: Real-time vs batch processing",
-        "Throughput requirements: Tokens per second targets"
-    ]
-    
-    for factor in deployment_factors:
-        print(f"   - {factor}")
-
-# %% [markdown]
-"""
-### TEST Test: Transformer Performance Analysis
-
-Let's test our transformer profiler with realistic scaling scenarios.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-transformer-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_transformer_profiler():
-    """Test transformer profiler with various scenarios."""
-    print("🔬 Unit Test: Transformer Performance Profiler...")
-    
-    profiler = TransformerProfiler()
-    
-    # Test depth scaling measurement
-    base_config = {
-        'vocab_size': 500,
-        'embed_dim': 128,
-        'num_heads': 4,
-        'hidden_dim': 256
-    }
-    
-    layer_counts = [1, 2, 4]
-    depth_results = profiler.measure_scaling_with_depth(base_config, layer_counts)
-    
-    # Verify depth scaling results
-    assert len(depth_results) == len(layer_counts), f"Should test {len(layer_counts)} layer counts"
-    
-    for num_layers in layer_counts:
-        assert num_layers in depth_results, f"Should include results for {num_layers} layers"
-        result = depth_results[num_layers]
-        
-        # Verify required metrics
-        required_keys = ['num_layers', 'total_parameters', 'total_memory_mb', 
-                        'computation_time_ms', 'tokens_per_second']
-        for key in required_keys:
-            assert key in result, f"Missing metric: {key} for {num_layers} layers"
-            assert isinstance(result[key], (int, float)), f"Invalid type for {key}"
-        
-        # Verify reasonable values
-        assert result['num_layers'] == num_layers, "Should store correct layer count"
-        assert result['total_parameters'] > 0, "Should have positive parameter count"
-        assert result['total_memory_mb'] > 0, "Should have positive memory usage"
-    
-    # Test that parameters and memory scale roughly linearly with depth
-    if len(layer_counts) >= 2:
-        shallow = depth_results[layer_counts[0]]
-        deep = depth_results[layer_counts[-1]]
-        
-        layer_ratio = deep['num_layers'] / shallow['num_layers']
-        param_ratio = deep['total_parameters'] / shallow['total_parameters']
-        memory_ratio = deep['total_memory_mb'] / shallow['total_memory_mb']
-        
-        # Allow some deviation due to fixed costs (embeddings, etc.)
-        assert 1.0 < param_ratio < layer_ratio * 2, f"Parameters should scale sub-linearly, got {param_ratio:.2f}"
-        assert 1.0 < memory_ratio < layer_ratio * 2, f"Memory should scale sub-linearly, got {memory_ratio:.2f}"
-    
-    print("PASS Depth scaling measurement test passed")
-    
-    # Test width vs depth analysis
-    configurations = [
-        {'embed_dim': 128, 'num_heads': 4, 'num_layers': 4, 'hidden_dim': 256},
-        {'embed_dim': 256, 'num_heads': 8, 'num_layers': 2, 'hidden_dim': 512},
-    ]
-    
-    width_depth_results = profiler.analyze_width_vs_depth_tradeoffs(100000, configurations)
-    
-    # Verify width vs depth results
-    assert len(width_depth_results) > 0, "Should analyze at least one configuration"
-    
-    for config_name, result in width_depth_results.items():
-        assert 'config' in result, "Should include configuration"
-        assert 'actual_parameters' in result, "Should count actual parameters"
-        assert 'computation_time_ms' in result, "Should measure computation time"
-        assert result['actual_parameters'] > 0, "Should have positive parameter count"
-    
-    print("PASS Width vs depth analysis test passed")
-    
-    # Test production scaling simulation
-    production_results = profiler.simulate_production_scaling(['Small', 'Medium'])
-    
-    # Verify production scaling results
-    for size, result in production_results.items():
-        assert 'config' in result, "Should include model configuration"
-        assert 'total_parameters' in result, "Should estimate total parameters"
-        assert 'training_memory_gb' in result, "Should estimate training memory"
-        assert 'inference_memory_gb' in result, "Should estimate inference memory"
-        
-        # Verify reasonable scaling
-        assert result['total_parameters'] > 1e6, "Should have millions of parameters"
-        assert result['training_memory_gb'] > result['inference_memory_gb'], "Training should require more memory"
-    
-    print("PASS Production scaling simulation test passed")
-    print("TARGET Transformer Profiler: All tests passed!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Integration Testing: Complete Language Model Pipeline
-
-Let's test the complete pipeline from tokenization through transformer processing:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-transformer-integration", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_complete_language_model_pipeline():
-    """Test complete language model pipeline integration."""
-    print("TEST Integration Test: Complete Language Model Pipeline...")
-    
-    # Create a small but complete language model
-    vocab_size = 1000
-    embed_dim = 256
-    num_heads = 8
-    num_layers = 4
-    hidden_dim = 512
-    max_seq_length = 64
-    
-    print(f"  Creating transformer with {num_layers} layers, {embed_dim} dimensions...")
-    transformer = Transformer(
-        vocab_size=vocab_size,
-        embed_dim=embed_dim,
-        num_heads=num_heads,
-        num_layers=num_layers,
-        hidden_dim=hidden_dim,
-        max_seq_length=max_seq_length
-    )
-    
-    # Test 1: Basic text processing pipeline
-    print("  Testing basic text processing pipeline...")
-    batch_size = 4
-    seq_len = 32
-    
-    # Simulate tokenized input
-    input_ids = np.random.randint(0, vocab_size, (batch_size, seq_len))
-    input_tensor = Tensor(input_ids)
-    
-    # Forward pass
-    logits = transformer.forward(input_tensor)
-    expected_shape = (batch_size, seq_len, vocab_size)
-    assert logits.shape == expected_shape, f"Expected {expected_shape}, got {logits.shape}"
-    
-    # Test that logits are reasonable (not all zeros/inf/nan)
-    assert not np.all(logits.data == 0), "Logits should not all be zero"
-    assert not np.any(np.isinf(logits.data)), "Logits should not contain inf"
-    assert not np.any(np.isnan(logits.data)), "Logits should not contain nan"
-    
-    print(f"    Forward pass successful: {logits.shape}")
-    
-    # Test 2: Language modeling with causal mask
-    print("  Testing language modeling with causal attention...")
-    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)
-    causal_mask = 1 - causal_mask  # Convert to attention mask
-    
-    masked_logits, all_attention = transformer.forward(
-        input_tensor, mask=Tensor(causal_mask), return_attention_weights=True
-    )
-    
-    assert len(all_attention) == num_layers, f"Should return attention from {num_layers} layers"
-    
-    # Verify causal masking works across all layers
-    for layer_idx, attn_weights in enumerate(all_attention):
-        # Check a few positions to ensure masking works
-        for i in range(min(5, seq_len)):
-            for j in range(i+1, min(i+5, seq_len)):
-                future_attention = attn_weights.data[:, :, i, j]  # All heads, all batches
-                assert np.all(future_attention < 1e-5), \
-                    f"Layer {layer_idx}: future attention at ({i},{j}) should be ~0"
-    
-    print(f"    Causal masking verified across all layers")
-    
-    # Test 3: Text generation
-    print("  Testing autoregressive text generation...")
-    # Start with a shorter sequence for generation
-    gen_start = Tensor(np.random.randint(0, vocab_size, (2, 8)))
-    generated = transformer.generate(gen_start, max_new_tokens=8, temperature=1.0)
-    
-    expected_gen_shape = (2, 16)  # 8 start + 8 generated
-    assert generated.shape == expected_gen_shape, f"Expected {expected_gen_shape}, got {generated.shape}"
-    
-    # Verify original tokens preserved
-    assert np.array_equal(generated.data[:, :8], gen_start.data), "Should preserve original tokens"
-    
-    # Verify new tokens are valid
-    new_tokens = generated.data[:, 8:]
-    assert np.all(new_tokens >= 0), "Generated tokens should be >= 0"
-    assert np.all(new_tokens < vocab_size), f"Generated tokens should be < {vocab_size}"
-    
-    print(f"    Generated {new_tokens.shape[1]} new tokens successfully")
-    
-    # Test 4: Different sequence lengths
-    print("  Testing variable sequence lengths...")
-    for test_seq_len in [16, 32, 48]:
-        if test_seq_len > max_seq_length:
-            continue
-            
-        test_input = Tensor(np.random.randint(0, vocab_size, (2, test_seq_len)))
-        test_logits = transformer.forward(test_input)
-        
-        expected_test_shape = (2, test_seq_len, vocab_size)
-        assert test_logits.shape == expected_test_shape, f"Failed for seq_len {test_seq_len}"
-    
-    print(f"    Variable sequence lengths work correctly")
-    
-    # Test 5: Memory usage analysis
-    print("  Analyzing memory usage...")
-    memory_stats = transformer.get_memory_usage()
-    
-    print(f"    Model parameters: {memory_stats['total_parameters']:,}")
-    print(f"    Model memory: {memory_stats['total_memory_mb']:.1f}MB")
-    print(f"    Embedding memory: {memory_stats['embedding_memory_mb']:.1f}MB")
-    print(f"    Transformer blocks: {memory_stats['transformer_blocks_memory_mb']:.1f}MB")
-    print(f"    LM head: {memory_stats['lm_head_memory_mb']:.1f}MB")
-    
-    # Verify memory breakdown makes sense
-    component_memory = (memory_stats['embedding_memory_mb'] + 
-                       memory_stats['transformer_blocks_memory_mb'] + 
-                       memory_stats['lm_head_memory_mb'])
-    
-    # Allow small difference due to final norm layer
-    memory_diff = abs(memory_stats['total_memory_mb'] - component_memory)
-    assert memory_diff < 1.0, f"Memory breakdown doesn't add up: {memory_diff:.2f}MB difference"
-    
-    # Test 6: Performance characteristics
-    print("  Testing performance characteristics...")
-    
-    # Time multiple forward passes
-    num_iterations = 5
-    start_time = time.time()
-    
-    for _ in range(num_iterations):
-        _ = transformer.forward(input_tensor)
-    
-    total_time = time.time() - start_time
-    avg_time_per_forward = total_time / num_iterations
-    tokens_per_second = (batch_size * seq_len) / avg_time_per_forward
-    
-    print(f"    Average forward pass: {avg_time_per_forward*1000:.2f}ms")
-    print(f"    Processing speed: {tokens_per_second:.0f} tokens/second")
-    
-    # Verify reasonable performance
-    assert avg_time_per_forward < 1.0, "Forward pass should be < 1 second"
-    assert tokens_per_second > 50, "Should process > 50 tokens/second"
-    
-    # Test 7: Gradient flow (simulated)
-    print("  Testing gradient flow through layers...")
-    
-    # Create slightly different inputs to test sensitivity
-    input_1 = Tensor(input_ids.copy())
-    input_2 = Tensor(input_ids.copy())
-    input_2.data[0, 0] = (input_2.data[0, 0] + 1) % vocab_size  # Change one token
-    
-    logits_1 = transformer.forward(input_1)
-    logits_2 = transformer.forward(input_2)
-    
-    # Outputs should be different (model is sensitive to input changes)
-    output_diff = np.mean(np.abs(logits_1.data - logits_2.data))
-    assert output_diff > 1e-6, f"Model should be sensitive to input changes, diff: {output_diff}"
-    
-    # But not too different (model should be stable)
-    assert output_diff < 100, f"Model should be stable, large diff: {output_diff}"
-    
-    print(f"    Model shows appropriate sensitivity to input changes")
-    
-    print("PASS Complete language model pipeline integration test passed!")
-    print(f"PASS Forward pass, masking, generation, and performance verified")
-    print(f"PASS Model processes {tokens_per_second:.0f} tokens/second")
-    print(f"PASS Memory footprint: {memory_stats['total_memory_mb']:.1f}MB")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-All transformer tests and demonstrations are run from here when the module is executed directly:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "transformers-main", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_module():
-    """Run all unit tests for this module."""
-    print("🧪 TESTING MODULE: Transformers")
-    print("=" * 50)
-
-    # Run all unit tests
-    test_unit_layer_norm()
-    test_unit_feed_forward()
-    test_unit_transformer_block()
-    test_unit_transformer_model()
-    test_transformer_profiler()
-    test_complete_language_model_pipeline()
-
-    print("\n" + "=" * 50)
-    print("✅ ALL TESTS PASSED! Module ready for export.")
-    print("Run: tito module complete 13_transformers")
-
-if __name__ == "__main__":
-    test_module()
-    
-    print("\n" + "="*60)
-    print("MAGNIFY TRANSFORMER SYSTEMS ANALYSIS")
-    print("="*60)
-    
-    # Performance analysis
-    profiler = TransformerProfiler()
-    
-    # Test transformer scaling with different depths
-    print("PROGRESS TRANSFORMER DEPTH SCALING ANALYSIS:")
-    base_config = {
-        'vocab_size': 1000,
-        'embed_dim': 256,
-        'num_heads': 8,
-        'hidden_dim': 1024
-    }
-    
-    layer_counts = [2, 4, 8, 12]
-    depth_results = profiler.measure_scaling_with_depth(base_config, layer_counts)
-    
-    # Analyze scaling patterns
-    print(f"\n{'Layers':<7} {'Parameters':<12} {'Memory (MB)':<12} {'Time (ms)':<10} {'Tokens/sec':<10}")
-    print("-" * 60)
-    
-    for num_layers in layer_counts:
-        result = depth_results[num_layers]
-        print(f"{num_layers:<7} {result['total_parameters']:<12,} {result['total_memory_mb']:<12.1f} "
-              f"{result['computation_time_ms']:<10.2f} {result['tokens_per_second']:<10.0f}")
-    
-    # Width vs depth trade-off analysis
-    print("\n" + "="*60)
-    configurations = [
-        {'embed_dim': 256, 'num_heads': 8, 'num_layers': 8, 'hidden_dim': 1024},  # Deep & narrow
-        {'embed_dim': 512, 'num_heads': 16, 'num_layers': 4, 'hidden_dim': 2048}, # Wide & shallow
-        {'embed_dim': 384, 'num_heads': 12, 'num_layers': 6, 'hidden_dim': 1536}, # Balanced
-    ]
-    
-    width_depth_results = profiler.analyze_width_vs_depth_tradeoffs(2000000, configurations)
-    
-    # Production scaling simulation
-    print("\n" + "="*60)
-    production_results = profiler.simulate_production_scaling(['Small', 'Medium', 'Large'])
-    
-    # Systems design analysis
-    print("\n" + "="*60)
-    analyze_transformer_system_design()
-    
-    # Demonstrate realistic language model setup
-    print("\n" + "="*60)
-    print("🏗️ REALISTIC LANGUAGE MODEL DEMONSTRATION")
-    print("="*60)
-    
-    # Create a realistic small language model
-    vocab_size = 5000
-    embed_dim = 512
-    num_heads = 8
-    num_layers = 6
-    hidden_dim = 2048
-    max_seq_length = 256
-    
-    print(f"Language model configuration:")
-    print(f"  Vocabulary: {vocab_size:,} tokens")
-    print(f"  Embedding dimension: {embed_dim}")
-    print(f"  Attention heads: {num_heads}")
-    print(f"  Transformer layers: {num_layers}")
-    print(f"  Feed-forward dimension: {hidden_dim}")
-    print(f"  Max sequence length: {max_seq_length}")
-    
-    # Create the model
-    language_model = Transformer(
-        vocab_size=vocab_size,
-        embed_dim=embed_dim,
-        num_heads=num_heads,
-        num_layers=num_layers,
-        hidden_dim=hidden_dim,
-        max_seq_length=max_seq_length,
-        pre_norm=True
-    )
-    
-    # Analyze model characteristics
-    memory_stats = language_model.get_memory_usage()
-    
-    print(f"\nModel characteristics:")
-    print(f"  Total parameters: {memory_stats['total_parameters']:,}")
-    print(f"  Model size: {memory_stats['total_memory_mb']:.1f}MB")
-    print(f"  Embedding table: {memory_stats['embedding_memory_mb']:.1f}MB ({memory_stats['embedding_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)")
-    print(f"  Transformer layers: {memory_stats['transformer_blocks_memory_mb']:.1f}MB ({memory_stats['transformer_blocks_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)")
-    print(f"  Output projection: {memory_stats['lm_head_memory_mb']:.1f}MB ({memory_stats['lm_head_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)")
-    
-    # Performance simulation
-    batch_size = 8
-    seq_len = 128
-    test_input = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))
-    
-    start_time = time.time()
-    logits = language_model.forward(test_input)
-    forward_time = time.time() - start_time
-    
-    tokens_per_second = (batch_size * seq_len) / forward_time
-    
-    print(f"\nPerformance simulation:")
-    print(f"  Batch size: {batch_size}, Sequence length: {seq_len}")
-    print(f"  Forward pass time: {forward_time*1000:.2f}ms")
-    print(f"  Throughput: {tokens_per_second:.0f} tokens/second")
-    print(f"  Memory for batch: {logits.data.nbytes/(1024*1024):.1f}MB")
-    
-    # Text generation example
-    print(f"\nText generation example:")
-    start_sequence = Tensor(np.random.randint(0, vocab_size, (1, 10)))
-    generated = language_model.generate(start_sequence, max_new_tokens=20, temperature=0.8)
-    
-    print(f"  Input sequence: {start_sequence.data[0].tolist()}")
-    print(f"  Generated tokens: {generated.data[0, 10:].tolist()}")
-    print(f"  Generation completed successfully")
-    
-    # Scaling predictions
-    print(f"\nScaling analysis:")
-    current_params = memory_stats['total_parameters']
-    
-    # Estimate for different scales
-    scaling_factors = [2, 5, 10]
-    for factor in scaling_factors:
-        scaled_params = current_params * factor
-        scaled_memory_gb = memory_stats['total_memory_mb'] * factor / 1024
-        
-        print(f"  {factor}x scale: {scaled_params/1e6:.0f}M params, ~{scaled_memory_gb:.1f}GB memory")
-    
-# MAGNIFY SYSTEMS INSIGHT: Final Transformer Memory Scaling Analysis
-def analyze_transformer_memory_scaling_final():
-    """Comprehensive analysis of transformer memory scaling patterns."""
-    try:
-        print("\n" + "="*70)
-        print("PROGRESS TRANSFORMER MEMORY SCALING ANALYSIS")
-        print("="*70)
-
-        # Test sequence length scaling (the quadratic bottleneck)
-        print("MAGNIFY SEQUENCE LENGTH SCALING (Quadratic Alert!)")
-        embed_dim = 512
-        num_heads = 8
-
-        # Create attention mechanism for scaling analysis
-        attention = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
-
-        seq_lengths = [128, 256, 512, 1024]
-        batch_size = 8
-
-        print(f"{'Seq Length':<12} {'Memory (MB)':<12} {'Time (ms)':<12} {'Memory/Token':<15}")
-        print("-" * 60)
-
-        for seq_len in seq_lengths:
-            # Create dummy input
-            input_tensor = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-
-            # Measure memory and time
-            import time
-            start_time = time.time()
-
-            # Forward pass
-            output = attention.forward(input_tensor, input_tensor, input_tensor)
-
-            end_time = time.time()
-
-            # Calculate metrics
-            memory_mb = output.data.nbytes / (1024 * 1024)
-            time_ms = (end_time - start_time) * 1000
-            memory_per_token = memory_mb / (batch_size * seq_len) * 1024  # KB per token
-
-            print(f"{seq_len:<12} {memory_mb:<12.2f} {time_ms:<12.2f} {memory_per_token:<15.2f}")
-
-            # Break early if too slow
-            if time_ms > 5000:  # 5 seconds
-                print("⚠️ Stopping analysis - sequence too long for this demo")
-                break
-
-        # Model size scaling analysis
-        print(f"\nTARGET MODEL SIZE SCALING:")
-        configs = [
-            ("Small", 128, 4, 4),
-            ("Medium", 256, 8, 6),
-            ("Large", 512, 16, 12),
-            ("XL", 1024, 32, 24)
-        ]
-
-        print(f"{'Model':<8} {'Embed Dim':<10} {'Heads':<6} {'Layers':<8} {'Parameters':<12} {'Memory (GB)':<12}")
-        print("-" * 70)
-
-        for name, embed_dim, num_heads, num_layers in configs:
-            # Estimate parameters
-            attention_params = num_layers * 4 * embed_dim * embed_dim  # Q, K, V, O projections
-            ffn_params = num_layers * 2 * embed_dim * (4 * embed_dim)  # Up and down projections
-            embed_params = 5000 * embed_dim  # Vocabulary embeddings
-            norm_params = num_layers * 2 * embed_dim  # Layer norms
-
-            total_params = attention_params + ffn_params + embed_params + norm_params
-            memory_gb = total_params * 4 / (1024**3)  # 4 bytes per parameter
-
-            print(f"{name:<8} {embed_dim:<10} {num_heads:<6} {num_layers:<8} {total_params:<12,} {memory_gb:<12.2f}")
-
-        print(f"\nTIP SCALING INSIGHTS:")
-        print(f"   - Attention memory scales O(N²) with sequence length")
-        print(f"   - Model parameters scale O(embed_dim²) for attention layers")
-        print(f"   - FFN parameters scale O(embed_dim * ffn_dim) - usually dominant")
-        print(f"   - Activation memory depends on batch size and sequence length")
-        print(f"   - Training requires ~3x more memory than inference")
-
-    except Exception as e:
-        print(f"⚠️ Error in memory scaling analysis: {e}")
-
-    print("\n" + "="*60)
-    print("TARGET TRANSFORMERS MODULE COMPLETE!")
-    print("="*60)
-    print("All transformer tests passed!")
-    print("Complete language model architecture implemented!")
-    print("Ready for production deployment and optimization!")
-
-def analyze_transformer_memory_scaling_final_placeholder():
-    """Comprehensive analysis of transformer memory scaling patterns."""
-    try:
-        print("\n" + "="*70)
-        print("PROGRESS TRANSFORMER MEMORY SCALING ANALYSIS")
-        print("="*70)
-        
-        # Test sequence length scaling (the quadratic bottleneck)
-        print("MAGNIFY SEQUENCE LENGTH SCALING (Quadratic Alert!)")
-        embed_dim = 512
-        num_heads = 8
-        batch_size = 16
-        
-        seq_lengths = [128, 256, 512, 1024, 2048]
-        
-        print(f"{'Seq Len':<8} {'Input (MB)':<11} {'Attention (MB)':<14} {'Total (MB)':<11} {'Scale Factor':<12}")
-        print("-" * 65)
-        
-        base_memory = None
-        for seq_len in seq_lengths:
-            # Input activation memory: batch * seq * embed
-            input_memory = batch_size * seq_len * embed_dim * 4 / (1024**2)
-            
-            # Attention matrix memory: batch * heads * seq * seq (the killer!)
-            attention_memory = batch_size * num_heads * seq_len * seq_len * 4 / (1024**2)
-            
-            total_memory = input_memory + attention_memory
-            
-            if base_memory is None:
-                base_memory = total_memory
-                scale_factor = 1.0
-            else:
-                scale_factor = total_memory / base_memory
-            
-            print(f"{seq_len:<8} {input_memory:<11.2f} {attention_memory:<14.2f} {total_memory:<11.2f} {scale_factor:<12.2f}")
-        
-        print(f"\nWARNING️  QUADRATIC SCALING ALERT: 2* sequence = 4* attention memory!")
-        
-        # Model size comparison
-        print(f"\nMAGNIFY MODEL SIZE COMPARISON (Parameter Count)")
-        configs = [
-            ("GPT-2 Small", 50257, 768, 12, 12, 3072),
-            ("GPT-2 Medium", 50257, 1024, 24, 16, 4096),
-            ("GPT-2 Large", 50257, 1280, 36, 20, 5120),
-            ("GPT-3", 50257, 12288, 96, 96, 49152),
-        ]
-        
-        print(f"{'Model':<12} {'Embed':<6} {'Layers':<7} {'Params':<12} {'Memory (GB)':<12}")
-        print("-" * 60)
-        
-        for name, vocab, embed, layers, heads, hidden in configs:
-            # Rough parameter calculation
-            # Embeddings: vocab * embed + output projection (often tied)
-            embedding_params = vocab * embed
-            # Transformer blocks: roughly 12 * embed^2 per block
-            block_params = layers * 12 * embed * embed
-            total_params = embedding_params + block_params
-            memory_gb = total_params * 4 / (1024**3)  # fp32
-            
-            params_str = f"{total_params/1e9:.1f}B" if total_params > 1e9 else f"{total_params/1e6:.0f}M"
-            print(f"{name:<12} {embed:<6} {layers:<7} {params_str:<12} {memory_gb:<12.1f}")
-        
-        print(f"\n📊 SCALING INSIGHTS:")
-        print(f"   - Sequence length: O(N²) scaling due to attention matrices")
-        print(f"   - Model parameters: O(embed_dim²) dominates for transformer blocks")
-        print(f"   - Vocabulary size: O(vocab_size) can dominate total parameters")
-        print(f"   - Training memory: 4-16* parameter memory (gradients + optimizer)")
-        
-        print(f"\nTIP PRODUCTION IMPLICATIONS:")
-        print(f"   - Attention memory limits sequence length in practice")
-        print(f"   - Large vocabularies dominate parameter count")
-        print(f"   - Deep models need careful memory management")
-        print(f"   - Modern techniques address these bottlenecks:")
-        print(f"     • Sparse/Linear attention for long sequences")
-        print(f"     • Gradient checkpointing for deep models")
-        print(f"     • Model parallelism for large parameters")
-        print(f"     • Mixed precision for memory efficiency")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in scaling analysis: {e}")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've built complete transformer architectures, let's connect this work to broader ML systems challenges. These questions help you think critically about how transformer design choices affect production deployment and system performance.
-
-Take time to reflect thoughtfully on each question - your insights will help you understand how transformer architectures connect to real-world ML systems engineering.
-"""
-
-# %% [markdown]
-"""
-### Question 1: Transformer Memory and Performance Trade-offs
-
-**Context**: Your transformer implementations reveal how architectural choices affect memory usage and computational complexity. In your TransformerBlock implementation, you saw how FFN parameters dominate (67% of block parameters), while attention creates O(N²) memory scaling with sequence length. Your memory scaling analysis showed quadratic growth with sequence length.
-
-**Reflection Question**: Analyze the memory and performance trade-offs in your transformer architecture. Based on your parameter counting and memory analysis, how would you modify your TransformerBlock implementation to handle sequences 4* longer while staying within the same memory budget? Consider the attention matrix scaling you observed (quadratic with sequence length) and the FFN parameter dominance you measured. What specific changes to your MultiHeadAttention and PositionwiseFeedForward classes would enable more efficient long-sequence processing, and how would these modifications affect the residual connections and layer normalization in your transformer blocks?
-
-Think about: attention matrix memory scaling, FFN parameter reduction strategies, efficient residual connection patterns, and layer normalization placement optimization.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-1-architecture-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON TRANSFORMER ARCHITECTURE OPTIMIZATION:
-
-TODO: Replace this text with your thoughtful response about transformer architecture optimization for diverse deployment scenarios.
-
-Consider addressing:
-- How would you allocate parameter budgets across depth, width, and attention heads for different scenarios?
-- What architecture search strategies would you use to optimize within hardware constraints?
-- How would you implement adaptive model scaling that adjusts to available resources?
-- What approaches would you use to maintain model quality across different deployment environments?
-- How would you balance latency, throughput, and resource constraints in architectural decisions?
-
-Write a strategic analysis connecting your transformer implementations to real architecture optimization challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Demonstrates understanding of transformer architecture trade-offs and optimization (3 points)
-- Designs practical approaches to parameter allocation and architecture search (3 points)
-- Addresses adaptive scaling and hardware-aware optimization (2 points)
-- Shows systems thinking about production deployment optimization (2 points)
-- Clear strategic reasoning with architecture optimization insights (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring strategic analysis of transformer architecture optimization
-# Students should demonstrate understanding of architecture design and production deployment challenges
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: Transformer Block Stacking and Gradient Flow
-
-**Context**: Your Transformer class demonstrates how multiple TransformerBlock instances are stacked to create deep language models. Your implementation uses pre-norm layer normalization and residual connections in each block. The parameter breakdown you analyzed shows how memory scales linearly with depth, but training dynamics become more complex with deeper stacks.
-
-**Reflection Question**: Examine the gradient flow implications of your transformer block stacking approach. In your TransformerBlock.forward() implementation, you use pre-norm style (LayerNorm before sublayers) with residual connections. How does this design choice affect gradient flow compared to post-norm alternatives? If you needed to stack 96 transformer blocks (GPT-3 scale) using your current implementation, what modifications to your layer normalization placement and residual connection patterns would ensure stable training? Analyze how the "Add & Norm" pattern in your implementation enables or constrains very deep transformer training.
-
-Think about: gradient flow through deep stacks, pre-norm vs post-norm trade-offs, residual connection effectiveness, and layer normalization stability patterns.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-2-training-inference-systems", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON TRANSFORMER TRAINING AND INFERENCE SYSTEM DESIGN:
-
-TODO: Replace this text with your thoughtful response about transformer training and inference system architecture.
-
-Consider addressing:
-- How would you design distributed training for billion-parameter transformers with memory constraints?
-- What strategies would you use for efficient inference serving with millisecond latency requirements?
-- How would you manage model deployment across heterogeneous hardware environments?
-- What approaches would you use to maintain numerical stability during distributed training?
-- How would you ensure consistent inference performance across different deployment targets?
-
-Write a system design analysis connecting your transformer implementation to large-scale training and serving challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Shows understanding of distributed training and inference serving challenges (3 points)
-- Designs practical approaches to memory management and latency optimization (3 points)
-- Addresses heterogeneous deployment and numerical stability considerations (2 points)
-- Demonstrates systems thinking about training-inference system coordination (2 points)
-- Clear system design reasoning with scalability insights (bonus points for comprehensive system architecture)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring system design for transformer training and inference
-# Students should demonstrate knowledge of distributed systems and production deployment architecture
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Complete Transformer Memory Optimization 
-
-**Context**: Your complete Transformer model integrates token embeddings, positional encoding, stacked transformer blocks, and output projection. Your parameter breakdown analysis revealed that token embeddings often dominate parameter count (70%+ for large vocabularies), while activation memory scales with both model depth and sequence length.
-
-**Reflection Question**: Design memory optimization strategies for your complete transformer implementation. Based on your parameter breakdown showing embedding dominance and memory scaling analysis revealing quadratic attention costs, how would you modify your Transformer class to support models with 100K vocabulary and 4K sequence lengths within limited memory? Consider the token embedding weight sharing you implemented, the attention matrix memory scaling you measured, and the activation checkpointing opportunities in your transformer block stack. What specific changes to your forward() method and parameter organization would enable efficient training and inference at scale?
-
-Think about: embedding compression techniques, attention memory reduction, activation checkpointing strategies, and parameter sharing optimization.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-3-production-deployment", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON TRANSFORMER OPTIMIZATION AND PRODUCTION DEPLOYMENT:
-
-TODO: Replace this text with your thoughtful response about transformer production deployment system design.
-
-Consider addressing:
-- How would you implement end-to-end optimization spanning tokenization through generation?
-- What strategies would you use for efficient model serving with dynamic batching and request routing?
-- How would you enable seamless model updates without service interruption?
-- What approaches would you use to maintain pipeline modularity while optimizing holistically?
-- How would you support diverse model variants and fine-tuned versions in production?
-
-Write a deployment analysis connecting your transformer implementation to complete production system optimization.
-
-GRADING RUBRIC (Instructor Use):
-- Understands end-to-end optimization and production deployment challenges (3 points)
-- Designs practical approaches to model serving and continuous deployment (3 points)
-- Addresses modularity and system integration considerations (2 points)
-- Shows systems thinking about holistic pipeline optimization (2 points)
-- Clear deployment reasoning with production optimization insights (bonus points for innovative system design)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of production transformer deployment optimization
-# Students should demonstrate knowledge of end-to-end system design and continuous deployment strategies
-### END SOLUTION
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Transformers
-
-Congratulations! You have successfully implemented complete transformer architectures that power modern language models:
-
-### PASS What You Have Built
-- **Layer Normalization**: Stable normalization for deep transformer training
-- **Position-wise Feed-Forward**: Non-linear transformations applied to each sequence position
-- **Transformer Blocks**: Complete transformer layers with attention, normalization, and residual connections
-- **Complete Transformer**: Full language model with embeddings, multiple layers, and generation capability
-- **Text Generation**: Autoregressive generation with proper causal masking
-- **🆕 Performance Analysis**: Comprehensive scaling analysis and architectural optimization tools
-- **🆕 Production Insights**: Understanding of real-world transformer deployment challenges
-
-### PASS Key Learning Outcomes
-- **Understanding**: How transformer blocks enable powerful sequence modeling through attention and feed-forward layers
-- **Implementation**: Built complete transformer architectures with proper layer organization and residual connections
-- **Systems Insight**: How transformer depth affects memory usage, training efficiency, and model capacity
-- **Performance Engineering**: Measured and analyzed transformer scaling characteristics and optimization opportunities
-- **Production Context**: Understanding transformer deployment challenges and architectural trade-offs
-
-### PASS Technical Mastery
-- **Layer Normalization**: Stabilizing deep network training with proper feature normalization
-- **Residual Connections**: Enabling gradient flow through deep transformer architectures
-- **Pre-norm vs Post-norm**: Understanding normalization placement effects on training stability
-- **Parameter Scaling**: Understanding how transformer parameters scale with architectural choices
-- **🆕 Generation Systems**: Autoregressive text generation with causal attention patterns
-
-### PASS Professional Skills Developed
-- **Systems Architecture**: Designing complete transformer systems for production scale
-- **Memory Engineering**: Understanding transformer memory scaling (O(N²) attention, parameter distribution)
-- **Computational Assessment**: Parameter counting, memory analysis, and production-scale calculations
-- **Performance Analysis**: Measuring and improving transformer computation and memory efficiency
-- **Integration Design**: Building complete language processing pipelines from tokenization to generation
-
-### PASS Ready for Next Steps
-Your transformer implementations and analysis provide the foundation for:
-- **Advanced Language Models**: GPT, BERT, and other transformer-based architectures
-- **Multi-modal Models**: Extending transformers to vision, audio, and other modalities
-- **Production Optimization**: Memory optimization, distributed training, and efficient inference
-- **Scale Analysis**: Understanding memory bottlenecks from small models to GPT-3 scale (175B parameters)
-- **🧠 AI Applications**: Real-world language processing applications and services
-
-### LINK Connection to Real ML Systems
-Your implementations mirror production systems:
-- **GPT Architecture**: Your transformer matches GPT's decoder-only architecture
-- **BERT Components**: Layer normalization and attention mechanisms used in BERT
-- **Production Optimization**: Understanding of memory scaling, batching, and generation optimization
-- **Industry Applications**: Foundation for all modern language model deployments
-
-### TARGET The Complete Language Model
-You have built the architecture that transformed AI:
-- **Before**: RNNs and CNNs limited by sequential processing and local dependencies
-- **After**: Transformers enable parallel processing and global attention across entire sequences
-
-**Achievement Unlocked**: You now understand every component of modern language models from tokenization through generation, plus the computational trade-offs that determine their deployment constraints!
-
-Your complete transformer implementation provides the foundation for understanding and building modern AI systems. You've mastered the architecture that powers ChatGPT, GPT-4, BERT, and countless other AI applications - and the computational analysis skills to deploy them efficiently.
-
-From discrete tokens to continuous embeddings, from attention mechanisms to complete language generation - you've built the entire pipeline that enables machines to understand and generate human language.
-
-**🏆 Congratulations on mastering transformer architecture and computational analysis!**
-"""
\ No newline at end of file
diff --git a/modules_old/14_profiling/README.md b/modules_old/14_profiling/README.md
deleted file mode 100644
index 5054b7e0..00000000
--- a/modules_old/14_profiling/README.md
+++ /dev/null
@@ -1,100 +0,0 @@
-# Module 15: Profiling - Performance Detective Work
-
-## Overview
-Become a performance detective! You just built MLPs, CNNs, and Transformers - but why is your transformer 100x slower than PyTorch? Build professional profiling infrastructure to reveal bottlenecks and guide optimization decisions.
-
-## What You'll Build
-- **Timer Class**: Statistical timing with warmup runs and percentile reporting
-- **Memory Profiler**: Track allocations, peak usage, and memory patterns
-- **FLOP Counter**: Count operations and analyze computational complexity
-- **Profiler Context**: Comprehensive profiling manager combining all tools
-- **Performance Analysis**: Complete bottleneck detection and optimization guidance
-
-## Learning Objectives
-1. **Statistical Timing**: Build robust timing infrastructure with confidence intervals
-2. **Memory Analysis**: Track allocations and identify memory bottlenecks
-3. **Computational Complexity**: Count FLOPs and understand scaling behavior
-4. **Bottleneck Detection**: Use Amdahl's Law to identify optimization targets
-5. **Systems Thinking**: Connect profiling insights to production decisions
-
-## Prerequisites
-- Module 14: Transformers (need models to profile)
-- Understanding of basic complexity analysis (O(n), O(n²))
-
-## Key Concepts
-
-### Professional Timing Infrastructure
-```python
-timer = Timer()
-stats = timer.measure(model.forward, warmup=3, runs=100)
-# Returns: mean, std, p50, p95, p99 with confidence intervals
-```
-
-### Memory Profiling with tracemalloc
-```python
-profiler = MemoryProfiler()
-stats = profiler.profile(expensive_operation)
-# Tracks: baseline, peak, allocated, memory patterns
-```
-
-### FLOP Analysis for Architecture Comparison
-```python
-counter = FLOPCounter()
-flops = counter.count_attention(seq_len=128, d_model=512)
-# Reveals: O(n²) scaling, computational bottlenecks
-```
-
-### Comprehensive Profiling Context
-```python
-with ProfilerContext("MyModel") as profiler:
-    result = profiler.profile_function(model.forward, args=(input,))
-# Automatic report: timing + memory + FLOPs + insights
-```
-
-## Performance Insights
-- **MLPs**: Linear scaling, memory efficient, excellent for classification
-- **CNNs**: Moderate speed, vectorizable, great for spatial data
-- **Transformers**: O(n²) attention scaling, memory hungry, powerful but expensive
-
-## Real-World Applications
-- **Bottleneck Identification**: Find the 20% of code using 80% of time
-- **Hardware Selection**: Use profiling data to choose CPU vs GPU
-- **Cost Prediction**: Estimate infrastructure costs from FLOP counts  
-- **Optimization ROI**: Amdahl's Law guides where to optimize first
-
-## Module Structure
-1. **Timer Class**: Statistical timing with warmup and confidence intervals
-2. **Memory Profiler**: Allocation tracking and peak usage analysis
-3. **FLOP Counter**: Operation counting for different layer types
-4. **Profiler Context**: Integrated profiling with automatic reporting
-5. **Architecture Comparison**: MLP vs CNN vs Transformer analysis
-6. **Bottleneck Detection**: Complete model profiling and optimization guidance
-7. **Systems Analysis**: Connect profiling insights to production decisions
-
-## Hands-On Detective Work
-```python
-# Reveal the transformer bottleneck
-with ProfilerContext("Transformer Analysis") as profiler:
-    output = profiler.profile_function(transformer.forward, args=(tokens,))
-    
-# Result: Attention consumes 73% of compute time!
-# Next: Optimize attention in Module 16 (Acceleration)
-```
-
-## Success Criteria
-- ✅ Build timer with statistical rigor (warmup, percentiles, confidence intervals)
-- ✅ Implement memory profiler tracking allocations and peak usage
-- ✅ Create FLOP counter analyzing computational complexity
-- ✅ Develop integrated profiling context for comprehensive analysis
-- ✅ Identify bottlenecks using data-driven analysis
-
-## Systems Insights
-- **Attention is O(n²)**: 2x sequence length = 4x computation
-- **Memory bandwidth matters**: Large models are memory-bound, not compute-bound
-- **Amdahl's Law rules**: Optimize the bottleneck first for maximum impact
-- **Profiling drives decisions**: Every major ML optimization started with profiling
-
-## ML Systems Focus
-This module teaches performance analysis as the foundation of all optimization work. You'll build the same profiling tools used to optimize GPT, BERT, and every production ML system. Understanding performance through measurement is the first step toward building efficient ML systems.
-
-The detective work you do here reveals the bottlenecks that Module 16 (Acceleration) will fix!
\ No newline at end of file
diff --git a/modules_old/14_profiling/module.yaml b/modules_old/14_profiling/module.yaml
deleted file mode 100644
index 3521eb96..00000000
--- a/modules_old/14_profiling/module.yaml
+++ /dev/null
@@ -1,24 +0,0 @@
-description: "Build professional profiling infrastructure to measure and analyze performance.\n\
-  Students learn to create timing, memory, and operation profilers that reveal\nbottlenecks\
-  \ and guide optimization decisions. Performance detective work that \nmakes optimization\
-  \ exciting through data-driven insights.\n"
-difficulty: advanced
-estimated_hours: 8-10
-exports:
-- tinytorch.profiling
-learning_objectives:
-- Build accurate timing infrastructure with statistical rigor
-- Implement memory profiling and allocation tracking
-- Create FLOP counting for computational analysis
-- Master profiling methodology for bottleneck identification
-- Connect profiling insights to ML systems optimization decisions
-name: Profiling
-number: 15
-prerequisites:
-- Module 14: Transformers (need models to profile)
-skills_developed:
-- Performance measurement
-- Bottleneck identification
-- Profiling tool development
-- Statistical analysis
-type: systems
diff --git a/modules_old/14_profiling/profiling_dev.ipynb b/modules_old/14_profiling/profiling_dev.ipynb
deleted file mode 100644
index db2f772d..00000000
--- a/modules_old/14_profiling/profiling_dev.ipynb
+++ /dev/null
@@ -1,2001 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "3db2fb08",
-   "metadata": {},
-   "source": [
-    "\"\"\"\n",
-    "Module 15: Profiling - Performance Detective Work\n",
-    "\n",
-    "Welcome to the most eye-opening module in TinyTorch! You just built MLPs, CNNs, and Transformers. \n",
-    "But here's the million-dollar question: **Why is your transformer 100x slower than PyTorch?**\n",
-    "\n",
-    "Time to become a performance detective and find out what's really happening under the hood.\n",
-    "\n",
-    "# 🔍 What You'll Discover\n",
-    "\n",
-    "Ever wonder why your models feel sluggish? We're about to reveal the culprits:\n",
-    "- Which operations are eating your CPU cycles\n",
-    "- Where your memory is disappearing \n",
-    "- How many arithmetic operations you're really doing\n",
-    "- The shocking performance differences between architectures\n",
-    "\n",
-    "**Spoiler Alert**: The results might surprise you. That \"simple\" attention mechanism? \n",
-    "It's probably consuming 73% of your compute time!\n",
-    "\n",
-    "# 🎯 Learning Objectives\n",
-    "\n",
-    "By the end of this module, you'll be able to:\n",
-    "1. **Build Professional Profilers**: Create timing, memory, and FLOP counters\n",
-    "2. **Identify Bottlenecks**: Find exactly what's slowing your models down\n",
-    "3. **Compare Architectures**: See why transformers are slow but powerful\n",
-    "4. **Guide Optimizations**: Use data to make smart performance decisions\n",
-    "\n",
-    "The tools you build here will be essential for Module 16 (Acceleration) when you actually fix the problems you discover.\n",
-    "\"\"\"\n",
-    "\n",
-    "| default_exp optimization.profiling"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "78436ef4",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Part 1: The Timer - Your First Detective Tool\n",
-    "\n",
-    "Every performance investigation starts with one question: \"How long does this actually take?\"\n",
-    "But timing is trickier than just `time.time()` - you need statistical rigor.\n",
-    "\n",
-    "### Why Simple Timing Fails\n",
-    "```python\n",
-    "import time\n",
-    "start = time.time()\n",
-    "result = my_function()\n",
-    "end = time.time()\n",
-    "print(f\"Took {end - start:.2f}s\")  # ❌ Unreliable!\n",
-    "```\n",
-    "\n",
-    "**Problems:**\n",
-    "- First run includes \"cold start\" costs (loading code into cache)  \n",
-    "- Single measurement captures noise, not true performance\n",
-    "- No confidence intervals or percentiles\n",
-    "- Different timing APIs have different precision"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "37bdfd3f",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "import time\n",
-    "import gc\n",
-    "import tracemalloc\n",
-    "from typing import Dict, List, Callable, Any, Tuple, Optional\n",
-    "from contextlib import contextmanager\n",
-    "import statistics\n",
-    "import sys\n",
-    "\n",
-    "# Mock imports for development\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "    from tinytorch.core.layers import Linear, ReLU, Softmax\n",
-    "    from tinytorch.core.spatial import Conv2d, MaxPool2d\n",
-    "    from tinytorch.core.transformers import Transformer\n",
-    "except ImportError:\n",
-    "    print(\"⚠️  TinyTorch modules not available - using mocks for development\")\n",
-    "    \n",
-    "    class Tensor:\n",
-    "        def __init__(self, data):\n",
-    "            if isinstance(data, list):\n",
-    "                self.data = data\n",
-    "                self.shape = self._get_shape(data)\n",
-    "            else:\n",
-    "                self.data = [[data]]\n",
-    "                self.shape = (1, 1)\n",
-    "        \n",
-    "        def _get_shape(self, data):\n",
-    "            if not isinstance(data[0], list):\n",
-    "                return (len(data),)\n",
-    "            return (len(data), len(data[0]))\n",
-    "    \n",
-    "    class Linear:\n",
-    "        def __init__(self, in_features, out_features):\n",
-    "            self.weight = Tensor([[0.1] * in_features for _ in range(out_features)])\n",
-    "        \n",
-    "        def forward(self, x):\n",
-    "            # Simple mock forward pass\n",
-    "            time.sleep(0.001)  # Simulate computation\n",
-    "            return x\n",
-    "    \n",
-    "    class Conv2d:\n",
-    "        def __init__(self, in_channels, out_channels, kernel_size):\n",
-    "            self.weight = Tensor([[0.1] * in_channels for _ in range(out_channels)])\n",
-    "        \n",
-    "        def forward(self, x):\n",
-    "            time.sleep(0.005)  # Simulate heavier computation\n",
-    "            return x\n",
-    "    \n",
-    "    class Transformer:\n",
-    "        def __init__(self, vocab_size, d_model, n_heads, n_layers):\n",
-    "            self.layers = [Linear(d_model, d_model) for _ in range(n_layers)]\n",
-    "        \n",
-    "        def forward(self, x):\n",
-    "            time.sleep(0.02)  # Simulate expensive attention\n",
-    "            return x\n",
-    "\n",
-    "class Timer:\n",
-    "    \"\"\"\n",
-    "    Professional timing infrastructure with statistical rigor.\n",
-    "    \n",
-    "    Features:\n",
-    "    - Warmup runs to eliminate cold start effects\n",
-    "    - Multiple measurements for statistical confidence  \n",
-    "    - Garbage collection control to reduce noise\n",
-    "    - Percentile reporting (p50, p95, p99)\n",
-    "    - High-precision timing with best available clock\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        # Use the most precise timer available\n",
-    "        self.timer_func = time.perf_counter\n",
-    "        self.measurements = []\n",
-    "        \n",
-    "    def measure(self, func: Callable, warmup: int = 3, runs: int = 100, \n",
-    "                args: tuple = (), kwargs: dict = None) -> Dict[str, float]:\n",
-    "        \"\"\"\n",
-    "        Measure function execution time with statistical rigor.\n",
-    "        \n",
-    "        Args:\n",
-    "            func: Function to measure\n",
-    "            warmup: Number of warmup runs (eliminate cold start)\n",
-    "            runs: Number of measurement runs\n",
-    "            args: Arguments to pass to function\n",
-    "            kwargs: Keyword arguments to pass to function\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dict with timing statistics (mean, std, percentiles)\n",
-    "        \"\"\"\n",
-    "        if kwargs is None:\n",
-    "            kwargs = {}\n",
-    "            \n",
-    "        self.measurements = []\n",
-    "        \n",
-    "        # Warmup runs to get code in CPU cache\n",
-    "        print(f\"🔥 Running {warmup} warmup iterations...\")\n",
-    "        for _ in range(warmup):\n",
-    "            _ = func(*args, **kwargs)\n",
-    "            \n",
-    "        # Force garbage collection before timing\n",
-    "        gc.collect()\n",
-    "        \n",
-    "        print(f\"⏱️  Measuring {runs} timed runs...\")\n",
-    "        \n",
-    "        # Actual measurements\n",
-    "        for i in range(runs):\n",
-    "            # Disable GC during measurement for consistency\n",
-    "            gc_was_enabled = gc.isenabled()\n",
-    "            gc.disable()\n",
-    "            \n",
-    "            try:\n",
-    "                start_time = self.timer_func()\n",
-    "                result = func(*args, **kwargs)\n",
-    "                end_time = self.timer_func()\n",
-    "                \n",
-    "                execution_time = end_time - start_time\n",
-    "                self.measurements.append(execution_time)\n",
-    "                \n",
-    "            finally:\n",
-    "                # Restore GC state\n",
-    "                if gc_was_enabled:\n",
-    "                    gc.enable()\n",
-    "                    \n",
-    "            # Progress indicator for long measurements\n",
-    "            if i % (runs // 10) == 0 and runs > 20:\n",
-    "                print(f\"  Progress: {i}/{runs} ({i/runs*100:.0f}%)\")\n",
-    "        \n",
-    "        # Calculate statistics\n",
-    "        return self._compute_stats()\n",
-    "    \n",
-    "    def _compute_stats(self) -> Dict[str, float]:\n",
-    "        \"\"\"Compute comprehensive timing statistics.\"\"\"\n",
-    "        if not self.measurements:\n",
-    "            return {}\n",
-    "            \n",
-    "        measurements_ms = [t * 1000 for t in self.measurements]  # Convert to ms\n",
-    "        \n",
-    "        stats = {\n",
-    "            'mean_ms': statistics.mean(measurements_ms),\n",
-    "            'std_ms': statistics.stdev(measurements_ms) if len(measurements_ms) > 1 else 0,\n",
-    "            'min_ms': min(measurements_ms),\n",
-    "            'max_ms': max(measurements_ms),\n",
-    "            'p50_ms': statistics.median(measurements_ms),\n",
-    "            'p95_ms': self._percentile(measurements_ms, 95),\n",
-    "            'p99_ms': self._percentile(measurements_ms, 99),\n",
-    "            'runs': len(measurements_ms)\n",
-    "        }\n",
-    "        \n",
-    "        return stats\n",
-    "    \n",
-    "    def _percentile(self, data: List[float], percentile: float) -> float:\n",
-    "        \"\"\"Calculate percentile of data.\"\"\"\n",
-    "        sorted_data = sorted(data)\n",
-    "        k = (len(sorted_data) - 1) * percentile / 100\n",
-    "        f = int(k)\n",
-    "        c = k - f\n",
-    "        \n",
-    "        if f + 1 < len(sorted_data):\n",
-    "            return sorted_data[f] * (1 - c) + sorted_data[f + 1] * c\n",
-    "        else:\n",
-    "            return sorted_data[f]\n",
-    "    \n",
-    "    def print_report(self, name: str = \"Function\"):\n",
-    "        \"\"\"Print a formatted timing report.\"\"\"\n",
-    "        if not self.measurements:\n",
-    "            print(f\"❌ No measurements available for {name}\")\n",
-    "            return\n",
-    "            \n",
-    "        stats = self._compute_stats()\n",
-    "        \n",
-    "        print(f\"\\n📊 TIMING REPORT: {name}\")\n",
-    "        print(\"=\" * 50)\n",
-    "        print(f\"Runs:     {stats['runs']}\")\n",
-    "        print(f\"Mean:     {stats['mean_ms']:.3f} ms ± {stats['std_ms']:.3f} ms\")\n",
-    "        print(f\"Range:    {stats['min_ms']:.3f} ms → {stats['max_ms']:.3f} ms\")\n",
-    "        print(f\"P50:      {stats['p50_ms']:.3f} ms\")\n",
-    "        print(f\"P95:      {stats['p95_ms']:.3f} ms\") \n",
-    "        print(f\"P99:      {stats['p99_ms']:.3f} ms\")\n",
-    "        \n",
-    "        # Helpful interpretation\n",
-    "        if stats['std_ms'] / stats['mean_ms'] > 0.1:\n",
-    "            print(\"⚠️  High variability - consider more warmup runs\")\n",
-    "        else:\n",
-    "            print(\"✅ Stable timing measurements\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "69af65cc",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test the Timer\n",
-    "\n",
-    "Let's test our timer on different types of operations to see the statistical rigor in action."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "90a3fbd7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def test_timer():\n",
-    "    \"\"\"Test the Timer class with different operation types.\"\"\"\n",
-    "    timer = Timer()\n",
-    "    \n",
-    "    print(\"🔬 TIMER TESTING: Performance Detective Work\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Test 1: Fast operation (should be sub-millisecond)\n",
-    "    def fast_operation():\n",
-    "        return sum(range(1000))\n",
-    "    \n",
-    "    print(\"\\n1️⃣ Fast CPU Operation (sum 1000 numbers)\")\n",
-    "    stats = timer.measure(fast_operation, warmup=5, runs=200)\n",
-    "    timer.print_report(\"Fast CPU Sum\")\n",
-    "    \n",
-    "    # Test 2: Memory allocation (intermediate speed)  \n",
-    "    def memory_operation():\n",
-    "        data = [i * 2 for i in range(10000)]\n",
-    "        return len(data)\n",
-    "    \n",
-    "    print(\"\\n2️⃣ Memory Allocation (10k list creation)\")\n",
-    "    stats = timer.measure(memory_operation, warmup=3, runs=100)\n",
-    "    timer.print_report(\"Memory Allocation\")\n",
-    "    \n",
-    "    # Test 3: Mock ML operation (slow)\n",
-    "    linear_layer = Linear(64, 32)\n",
-    "    mock_input = Tensor([[0.1] * 64])\n",
-    "    \n",
-    "    def ml_operation():\n",
-    "        return linear_layer.forward(mock_input)\n",
-    "    \n",
-    "    print(\"\\n3️⃣ ML Operation (Linear layer forward pass)\")\n",
-    "    stats = timer.measure(ml_operation, warmup=2, runs=50)\n",
-    "    timer.print_report(\"Linear Layer Forward\")\n",
-    "    \n",
-    "    print(\"\\n🎯 KEY INSIGHT: Notice the different scales!\")\n",
-    "    print(\"   - CPU operations: microseconds (< 1ms)\")\n",
-    "    print(\"   - Memory operations: low milliseconds\") \n",
-    "    print(\"   - ML operations: higher milliseconds\")\n",
-    "    print(\"   This is why transformers feel slow!\")\n",
-    "\n",
-    "# Run the test\n",
-    "if __name__ == \"__main__\":\n",
-    "    test_timer()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "bc71f289",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 2: Memory Profiler - The Memory Detective\n",
-    "\n",
-    "Now that we can measure time, let's track memory usage. Memory leaks and unexpected \n",
-    "allocations are common culprits in slow ML code.\n",
-    "\n",
-    "### Why Memory Matters for Performance\n",
-    "\n",
-    "- **Cache efficiency**: Small working sets stay in L1/L2 cache (fast)\n",
-    "- **Memory bandwidth**: Large transfers saturate memory bus (slow)  \n",
-    "- **Garbage collection**: Excessive allocations trigger GC pauses\n",
-    "- **Swap thrashing**: Out of RAM = disk access = 1000x slower\n",
-    "\n",
-    "The memory profiler will reveal surprising allocation patterns in your models."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d1ebc725",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class MemoryProfiler:\n",
-    "    \"\"\"\n",
-    "    Memory usage profiler with allocation tracking.\n",
-    "    \n",
-    "    Features:\n",
-    "    - Peak memory usage during execution\n",
-    "    - Memory allocation tracking with tracemalloc\n",
-    "    - Memory leak detection\n",
-    "    - Growth pattern analysis\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        self.baseline_memory = 0\n",
-    "        self.peak_memory = 0\n",
-    "        self.allocations = []\n",
-    "        \n",
-    "    def profile(self, func: Callable, args: tuple = (), kwargs: dict = None) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Profile memory usage during function execution.\n",
-    "        \n",
-    "        Args:\n",
-    "            func: Function to profile\n",
-    "            args: Arguments to pass to function\n",
-    "            kwargs: Keyword arguments\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dict with memory usage statistics\n",
-    "        \"\"\"\n",
-    "        if kwargs is None:\n",
-    "            kwargs = {}\n",
-    "            \n",
-    "        # Start memory tracing\n",
-    "        tracemalloc.start()\n",
-    "        \n",
-    "        # Record baseline\n",
-    "        baseline_snapshot = tracemalloc.take_snapshot()\n",
-    "        baseline_stats = baseline_snapshot.statistics('filename')\n",
-    "        baseline_size = sum(stat.size for stat in baseline_stats)\n",
-    "        \n",
-    "        try:\n",
-    "            # Execute function\n",
-    "            result = func(*args, **kwargs)\n",
-    "            \n",
-    "            # Take final snapshot\n",
-    "            final_snapshot = tracemalloc.take_snapshot()\n",
-    "            final_stats = final_snapshot.statistics('filename')\n",
-    "            final_size = sum(stat.size for stat in final_stats)\n",
-    "            \n",
-    "            # Get peak memory\n",
-    "            current, peak = tracemalloc.get_traced_memory()\n",
-    "            \n",
-    "            # Stop tracing\n",
-    "            tracemalloc.stop()\n",
-    "            \n",
-    "            # Compute memory statistics\n",
-    "            memory_stats = {\n",
-    "                'baseline_mb': baseline_size / (1024 * 1024),\n",
-    "                'final_mb': final_size / (1024 * 1024), \n",
-    "                'peak_mb': peak / (1024 * 1024),\n",
-    "                'allocated_mb': (final_size - baseline_size) / (1024 * 1024),\n",
-    "                'result': result\n",
-    "            }\n",
-    "            \n",
-    "            return memory_stats\n",
-    "            \n",
-    "        except Exception as e:\n",
-    "            tracemalloc.stop()\n",
-    "            raise e\n",
-    "    \n",
-    "    def print_report(self, stats: Dict[str, Any], name: str = \"Function\"):\n",
-    "        \"\"\"Print formatted memory usage report.\"\"\"\n",
-    "        print(f\"\\n🧠 MEMORY REPORT: {name}\")\n",
-    "        print(\"=\" * 50)\n",
-    "        print(f\"Baseline:     {stats['baseline_mb']:.2f} MB\")\n",
-    "        print(f\"Final:        {stats['final_mb']:.2f} MB\")\n",
-    "        print(f\"Peak:         {stats['peak_mb']:.2f} MB\")\n",
-    "        print(f\"Allocated:    {stats['allocated_mb']:.2f} MB\")\n",
-    "        \n",
-    "        # Memory efficiency insights\n",
-    "        if stats['allocated_mb'] > stats['peak_mb'] * 0.5:\n",
-    "            print(\"⚠️  High memory allocation - check for copies\")\n",
-    "        elif stats['allocated_mb'] < 0:\n",
-    "            print(\"✅ Memory efficient - some cleanup occurred\")\n",
-    "        else:\n",
-    "            print(\"✅ Reasonable memory usage\")\n",
-    "            \n",
-    "        # Peak vs final analysis\n",
-    "        peak_vs_final_ratio = stats['peak_mb'] / max(stats['final_mb'], 0.001)\n",
-    "        if peak_vs_final_ratio > 2.0:\n",
-    "            print(f\"💡 Peak was {peak_vs_final_ratio:.1f}x final - temporary allocations detected\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f9856ad4",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Memory Profiler\n",
-    "\n",
-    "Let's test the memory profiler on operations that have different memory patterns."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7aff4be4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def test_memory_profiler():\n",
-    "    \"\"\"Test memory profiling on different operation patterns.\"\"\"\n",
-    "    profiler = MemoryProfiler()\n",
-    "    \n",
-    "    print(\"🧠 MEMORY PROFILER TESTING\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Test 1: Small allocation\n",
-    "    def small_allocation():\n",
-    "        return [i for i in range(1000)]\n",
-    "    \n",
-    "    print(\"\\n1️⃣ Small List Creation (1k integers)\")\n",
-    "    stats = profiler.profile(small_allocation)\n",
-    "    profiler.print_report(stats, \"Small Allocation\")\n",
-    "    \n",
-    "    # Test 2: Large allocation  \n",
-    "    def large_allocation():\n",
-    "        # Create a \"large\" tensor-like structure\n",
-    "        return [[float(i * j) for j in range(100)] for i in range(100)]\n",
-    "    \n",
-    "    print(\"\\n2️⃣ Large 2D Array (100x100 floats)\")\n",
-    "    stats = profiler.profile(large_allocation)\n",
-    "    profiler.print_report(stats, \"Large Allocation\")\n",
-    "    \n",
-    "    # Test 3: Memory copying pattern\n",
-    "    def copying_operation():\n",
-    "        original = [i for i in range(5000)]\n",
-    "        copy1 = original.copy()\n",
-    "        copy2 = copy1.copy()\n",
-    "        copy3 = copy2.copy()\n",
-    "        return copy3\n",
-    "    \n",
-    "    print(\"\\n3️⃣ Memory Copying (multiple copies)\")\n",
-    "    stats = profiler.profile(copying_operation) \n",
-    "    profiler.print_report(stats, \"Copying Operation\")\n",
-    "    \n",
-    "    print(\"\\n🎯 KEY INSIGHT: Memory patterns reveal optimization opportunities!\")\n",
-    "    print(\"   - Small allocations: Usually efficient\")\n",
-    "    print(\"   - Large allocations: Watch for memory bandwidth limits\")\n",
-    "    print(\"   - Copying operations: Major performance killers\")\n",
-    "\n",
-    "# Run the test  \n",
-    "if __name__ == \"__main__\":\n",
-    "    test_memory_profiler()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "08ab4188",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 3: FLOP Counter - Operation Detective\n",
-    "\n",
-    "How many arithmetic operations is your model actually doing? FLOPs (Floating Point \n",
-    "Operations) give you the raw computational cost independent of hardware.\n",
-    "\n",
-    "### Why Count FLOPs?\n",
-    "\n",
-    "- **Hardware comparison**: Same FLOPs = same work, regardless of CPU/GPU\n",
-    "- **Architecture analysis**: Compare MLP vs CNN vs Transformer efficiency  \n",
-    "- **Scaling prediction**: Double the model = how many more FLOPs?\n",
-    "- **Optimization targeting**: Focus on high-FLOP operations first\n",
-    "\n",
-    "**The shocking truth**: Attention is O(n²) - a 2x longer sequence needs 4x more FLOPs!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c845e656",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class FLOPCounter:\n",
-    "    \"\"\"\n",
-    "    Count floating point operations (FLOPs) in neural network operations.\n",
-    "    \n",
-    "    Features:\n",
-    "    - Track multiply-accumulate (MAC) operations\n",
-    "    - Handle different layer types (Linear, Conv2d, Attention)\n",
-    "    - Provide operation breakdown by type\n",
-    "    - Compare theoretical vs practical complexity\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        self.operation_counts = {\n",
-    "            'multiply': 0,\n",
-    "            'add': 0,\n",
-    "            'total_flops': 0\n",
-    "        }\n",
-    "        self.layer_breakdown = {}\n",
-    "    \n",
-    "    def reset(self):\n",
-    "        \"\"\"Reset all counters.\"\"\"\n",
-    "        self.operation_counts = {\n",
-    "            'multiply': 0,\n",
-    "            'add': 0, \n",
-    "            'total_flops': 0\n",
-    "        }\n",
-    "        self.layer_breakdown = {}\n",
-    "    \n",
-    "    def count_linear(self, input_features: int, output_features: int, batch_size: int = 1) -> int:\n",
-    "        \"\"\"\n",
-    "        Count FLOPs for linear layer: y = xW + b\n",
-    "        \n",
-    "        Args:\n",
-    "            input_features: Number of input features\n",
-    "            output_features: Number of output neurons\n",
-    "            batch_size: Batch size\n",
-    "            \n",
-    "        Returns:\n",
-    "            Total FLOPs for this operation\n",
-    "        \"\"\"\n",
-    "        # Matrix multiplication: (batch, in) × (in, out) = batch * in * out multiplications\n",
-    "        multiply_ops = batch_size * input_features * output_features\n",
-    "        \n",
-    "        # Addition for bias: batch * out additions  \n",
-    "        add_ops = batch_size * output_features\n",
-    "        \n",
-    "        total_flops = multiply_ops + add_ops\n",
-    "        \n",
-    "        self.operation_counts['multiply'] += multiply_ops\n",
-    "        self.operation_counts['add'] += add_ops\n",
-    "        self.operation_counts['total_flops'] += total_flops\n",
-    "        \n",
-    "        self.layer_breakdown['linear'] = self.layer_breakdown.get('linear', 0) + total_flops\n",
-    "        \n",
-    "        return total_flops\n",
-    "    \n",
-    "    def count_conv2d(self, input_height: int, input_width: int, input_channels: int,\n",
-    "                    output_channels: int, kernel_size: int, batch_size: int = 1) -> int:\n",
-    "        \"\"\"\n",
-    "        Count FLOPs for 2D convolution.\n",
-    "        \n",
-    "        Args:\n",
-    "            input_height: Input height\n",
-    "            input_width: Input width  \n",
-    "            input_channels: Number of input channels\n",
-    "            output_channels: Number of output channels\n",
-    "            kernel_size: Kernel size (assumed square)\n",
-    "            batch_size: Batch size\n",
-    "            \n",
-    "        Returns:\n",
-    "            Total FLOPs for convolution\n",
-    "        \"\"\"\n",
-    "        # Output dimensions (assuming no padding/stride)\n",
-    "        output_height = input_height - kernel_size + 1\n",
-    "        output_width = input_width - kernel_size + 1\n",
-    "        \n",
-    "        # Each output pixel requires kernel_size² × input_channels multiplications\n",
-    "        multiply_ops = (batch_size * output_height * output_width * \n",
-    "                       output_channels * kernel_size * kernel_size * input_channels)\n",
-    "        \n",
-    "        # Bias addition: one per output pixel\n",
-    "        add_ops = batch_size * output_height * output_width * output_channels\n",
-    "        \n",
-    "        total_flops = multiply_ops + add_ops\n",
-    "        \n",
-    "        self.operation_counts['multiply'] += multiply_ops\n",
-    "        self.operation_counts['add'] += add_ops \n",
-    "        self.operation_counts['total_flops'] += total_flops\n",
-    "        \n",
-    "        self.layer_breakdown['conv2d'] = self.layer_breakdown.get('conv2d', 0) + total_flops\n",
-    "        \n",
-    "        return total_flops\n",
-    "    \n",
-    "    def count_attention(self, sequence_length: int, d_model: int, batch_size: int = 1) -> int:\n",
-    "        \"\"\"\n",
-    "        Count FLOPs for self-attention mechanism.\n",
-    "        \n",
-    "        Args:\n",
-    "            sequence_length: Length of input sequence\n",
-    "            d_model: Model dimension\n",
-    "            batch_size: Batch size\n",
-    "            \n",
-    "        Returns:\n",
-    "            Total FLOPs for attention\n",
-    "        \"\"\"\n",
-    "        # Q, K, V projections: 3 linear layers\n",
-    "        qkv_flops = 3 * self.count_linear(d_model, d_model, batch_size)\n",
-    "        \n",
-    "        # Attention scores: Q @ K^T = (seq, d) @ (d, seq) = seq² * d\n",
-    "        score_multiply = batch_size * sequence_length * sequence_length * d_model\n",
-    "        \n",
-    "        # Attention weights: softmax is approximately free compared to matmul\n",
-    "        \n",
-    "        # Weighted values: attention @ V = (seq, seq) @ (seq, d) = seq² * d\n",
-    "        weighted_multiply = batch_size * sequence_length * sequence_length * d_model\n",
-    "        \n",
-    "        # Output projection: another linear layer\n",
-    "        output_flops = self.count_linear(d_model, d_model, batch_size)\n",
-    "        \n",
-    "        attention_specific_flops = score_multiply + weighted_multiply\n",
-    "        \n",
-    "        self.operation_counts['multiply'] += attention_specific_flops\n",
-    "        self.operation_counts['total_flops'] += attention_specific_flops\n",
-    "        \n",
-    "        total_attention_flops = attention_specific_flops + qkv_flops + output_flops\n",
-    "        self.layer_breakdown['attention'] = self.layer_breakdown.get('attention', 0) + total_attention_flops\n",
-    "        \n",
-    "        return total_attention_flops\n",
-    "    \n",
-    "    def count_model_forward(self, model, input_shape: tuple) -> int:\n",
-    "        \"\"\"\n",
-    "        Estimate FLOPs for a complete model forward pass.\n",
-    "        \n",
-    "        Args:\n",
-    "            model: Model to analyze\n",
-    "            input_shape: Shape of input (batch_size, ...)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Total estimated FLOPs\n",
-    "        \"\"\"\n",
-    "        self.reset()\n",
-    "        \n",
-    "        # Simple mock analysis - in practice you'd traverse the model\n",
-    "        if isinstance(model, Linear):\n",
-    "            batch_size = input_shape[0] if len(input_shape) > 1 else 1\n",
-    "            input_features = input_shape[-1] if len(input_shape) > 1 else input_shape[0]\n",
-    "            output_features = 32  # Mock output size\n",
-    "            return self.count_linear(input_features, output_features, batch_size)\n",
-    "            \n",
-    "        elif isinstance(model, Conv2d):\n",
-    "            batch_size = input_shape[0] if len(input_shape) > 3 else 1\n",
-    "            _, input_channels, height, width = (1, 3, 32, 32) if len(input_shape) < 4 else input_shape\n",
-    "            return self.count_conv2d(height, width, input_channels, 16, 3, batch_size)\n",
-    "            \n",
-    "        elif isinstance(model, Transformer):\n",
-    "            batch_size = input_shape[0] if len(input_shape) > 2 else 1 \n",
-    "            seq_length = input_shape[1] if len(input_shape) > 2 else input_shape[0]\n",
-    "            d_model = 128  # Mock model dimension\n",
-    "            return self.count_attention(seq_length, d_model, batch_size)\n",
-    "            \n",
-    "        else:\n",
-    "            # Generic estimation\n",
-    "            return 1000000  # 1M FLOPs as placeholder\n",
-    "    \n",
-    "    def print_report(self, name: str = \"Model\"):\n",
-    "        \"\"\"Print detailed FLOP analysis report.\"\"\"\n",
-    "        print(f\"\\n🔢 FLOP ANALYSIS: {name}\")\n",
-    "        print(\"=\" * 50)\n",
-    "        \n",
-    "        total_flops = self.operation_counts['total_flops']\n",
-    "        if total_flops == 0:\n",
-    "            print(\"❌ No FLOPs counted\")\n",
-    "            return\n",
-    "            \n",
-    "        print(f\"Total FLOPs:      {total_flops:,}\")\n",
-    "        print(f\"  - Multiplies:   {self.operation_counts['multiply']:,}\")\n",
-    "        print(f\"  - Additions:    {self.operation_counts['add']:,}\")\n",
-    "        \n",
-    "        # Convert to common units\n",
-    "        if total_flops > 1e9:\n",
-    "            print(f\"  = {total_flops / 1e9:.2f} GFLOPs\")\n",
-    "        elif total_flops > 1e6:\n",
-    "            print(f\"  = {total_flops / 1e6:.2f} MFLOPs\")\n",
-    "        elif total_flops > 1e3:\n",
-    "            print(f\"  = {total_flops / 1e3:.2f} KFLOPs\")\n",
-    "            \n",
-    "        # Breakdown by layer type\n",
-    "        if self.layer_breakdown:\n",
-    "            print(\"\\nBreakdown by operation:\")\n",
-    "            for op_type, flops in self.layer_breakdown.items():\n",
-    "                percentage = (flops / total_flops) * 100\n",
-    "                print(f\"  {op_type:12s}: {flops:,} ({percentage:.1f}%)\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2af33c85",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test FLOP Counter\n",
-    "\n",
-    "Let's count operations for different architectures and see the scaling differences."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a55678a9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def test_flop_counter():\n",
-    "    \"\"\"Test FLOP counting on different architectures.\"\"\"\n",
-    "    counter = FLOPCounter()\n",
-    "    \n",
-    "    print(\"🔢 FLOP COUNTER TESTING - Architecture Comparison\")\n",
-    "    print(\"=\" * 65)\n",
-    "    \n",
-    "    # Test 1: Simple Linear Layer (MLP building block)\n",
-    "    print(\"\\n1️⃣ Linear Layer (64 → 32, batch=10)\")\n",
-    "    flops = counter.count_linear(input_features=64, output_features=32, batch_size=10)\n",
-    "    counter.print_report(\"Linear Layer\")\n",
-    "    \n",
-    "    # Test 2: Convolutional Layer  \n",
-    "    counter.reset()\n",
-    "    print(\"\\n2️⃣ Conv2D Layer (32×32×3 → 16 channels, 3×3 kernel)\")\n",
-    "    flops = counter.count_conv2d(input_height=32, input_width=32, input_channels=3,\n",
-    "                                output_channels=16, kernel_size=3, batch_size=1)\n",
-    "    counter.print_report(\"Conv2D Layer\")\n",
-    "    \n",
-    "    # Test 3: Attention Mechanism\n",
-    "    counter.reset()\n",
-    "    print(\"\\n3️⃣ Self-Attention (seq_len=50, d_model=128)\")\n",
-    "    flops = counter.count_attention(sequence_length=50, d_model=128, batch_size=1)\n",
-    "    counter.print_report(\"Self-Attention\")\n",
-    "    \n",
-    "    # Test 4: Scaling Analysis - The Eye-Opener!\n",
-    "    print(\"\\n4️⃣ SCALING ANALYSIS - Why Transformers Are Expensive\")\n",
-    "    print(\"-\" * 60)\n",
-    "    \n",
-    "    sequence_lengths = [10, 50, 100, 200]\n",
-    "    d_model = 128\n",
-    "    \n",
-    "    for seq_len in sequence_lengths:\n",
-    "        counter.reset()\n",
-    "        flops = counter.count_attention(seq_len, d_model)\n",
-    "        mflops = flops / 1e6\n",
-    "        print(f\"Seq Length {seq_len:3d}: {mflops:6.1f} MFLOPs\")\n",
-    "    \n",
-    "    print(\"\\n🚨 SHOCKING INSIGHT: Attention scales O(n²)!\")\n",
-    "    print(\"   - 2x sequence length = 4x FLOPs\")\n",
-    "    print(\"   - This is why long documents are expensive\")\n",
-    "    print(\"   - CNNs scale O(n) - much more efficient for images\")\n",
-    "\n",
-    "# Run the test\n",
-    "if __name__ == \"__main__\":\n",
-    "    test_flop_counter()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a823f150",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 4: Profiler Context - The Ultimate Detective Tool\n",
-    "\n",
-    "Now let's combine all our profiling tools into one easy-to-use context manager.\n",
-    "This is your go-to tool for comprehensive performance analysis.\n",
-    "\n",
-    "### The Complete Picture\n",
-    "\n",
-    "The context manager will give you:\n",
-    "- **Timing**: How long did it take?\n",
-    "- **Memory**: How much RAM was used?\n",
-    "- **FLOPs**: How much computation was done?\n",
-    "- **Efficiency**: FLOPs per second, memory per FLOP\n",
-    "\n",
-    "This is what you'll use to profile entire model forward passes and identify bottlenecks."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f9791045",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class ProfilerContext:\n",
-    "    \"\"\"\n",
-    "    Comprehensive profiling context manager.\n",
-    "    \n",
-    "    Combines timing, memory, and FLOP analysis into a single tool.\n",
-    "    Perfect for profiling model forward passes and identifying bottlenecks.\n",
-    "    \n",
-    "    Usage:\n",
-    "        with ProfilerContext(\"MyModel\") as profiler:\n",
-    "            result = model.forward(input)\n",
-    "        # Automatic report generation\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, name: str = \"Operation\", \n",
-    "                 timing_runs: int = 10, \n",
-    "                 timing_warmup: int = 2,\n",
-    "                 enable_memory: bool = True,\n",
-    "                 enable_flops: bool = False):\n",
-    "        \"\"\"\n",
-    "        Initialize profiling context.\n",
-    "        \n",
-    "        Args:\n",
-    "            name: Name for the operation being profiled\n",
-    "            timing_runs: Number of timing measurements\n",
-    "            timing_warmup: Number of warmup runs\n",
-    "            enable_memory: Whether to profile memory usage\n",
-    "            enable_flops: Whether to count FLOPs (manual)\n",
-    "        \"\"\"\n",
-    "        self.name = name\n",
-    "        self.timing_runs = timing_runs\n",
-    "        self.timing_warmup = timing_warmup\n",
-    "        self.enable_memory = enable_memory\n",
-    "        self.enable_flops = enable_flops\n",
-    "        \n",
-    "        # Profiling tools\n",
-    "        self.timer = Timer()\n",
-    "        self.memory_profiler = MemoryProfiler() if enable_memory else None\n",
-    "        self.flop_counter = FLOPCounter() if enable_flops else None\n",
-    "        \n",
-    "        # Results storage\n",
-    "        self.timing_stats = {}\n",
-    "        self.memory_stats = {}\n",
-    "        self.results = {}\n",
-    "        \n",
-    "    def __enter__(self):\n",
-    "        \"\"\"Start profiling context.\"\"\"\n",
-    "        print(f\"🔍 PROFILING: {self.name}\")\n",
-    "        print(\"=\" * (len(self.name) + 12))\n",
-    "        \n",
-    "        if self.enable_memory:\n",
-    "            # Start memory tracing\n",
-    "            if not tracemalloc.is_tracing():\n",
-    "                tracemalloc.start()\n",
-    "                \n",
-    "        return self\n",
-    "        \n",
-    "    def __exit__(self, exc_type, exc_val, exc_tb):\n",
-    "        \"\"\"End profiling and generate report.\"\"\"\n",
-    "        if exc_type is not None:\n",
-    "            print(f\"❌ Error during profiling: {exc_val}\")\n",
-    "            return False\n",
-    "            \n",
-    "        self.generate_report()\n",
-    "        return False\n",
-    "    \n",
-    "    def profile_function(self, func: Callable, args: tuple = (), kwargs: dict = None):\n",
-    "        \"\"\"\n",
-    "        Profile a function call within the context.\n",
-    "        \n",
-    "        Args:\n",
-    "            func: Function to profile  \n",
-    "            args: Function arguments\n",
-    "            kwargs: Function keyword arguments\n",
-    "            \n",
-    "        Returns:\n",
-    "            Function result\n",
-    "        \"\"\"\n",
-    "        if kwargs is None:\n",
-    "            kwargs = {}\n",
-    "            \n",
-    "        # Memory profiling (if enabled)\n",
-    "        if self.memory_profiler:\n",
-    "            self.memory_stats = self.memory_profiler.profile(func, args, kwargs)\n",
-    "            result = self.memory_stats['result']\n",
-    "        else:\n",
-    "            result = func(*args, **kwargs)\n",
-    "            \n",
-    "        # Timing profiling\n",
-    "        self.timing_stats = self.timer.measure(\n",
-    "            func, warmup=self.timing_warmup, runs=self.timing_runs,\n",
-    "            args=args, kwargs=kwargs\n",
-    "        )\n",
-    "        \n",
-    "        return result\n",
-    "    \n",
-    "    def add_flop_count(self, flops: int, breakdown: dict = None):\n",
-    "        \"\"\"\n",
-    "        Manually add FLOP count (since automatic counting is complex).\n",
-    "        \n",
-    "        Args:\n",
-    "            flops: Total FLOP count\n",
-    "            breakdown: Optional breakdown by operation type\n",
-    "        \"\"\"\n",
-    "        if self.flop_counter:\n",
-    "            self.flop_counter.operation_counts['total_flops'] = flops\n",
-    "            if breakdown:\n",
-    "                self.flop_counter.layer_breakdown.update(breakdown)\n",
-    "    \n",
-    "    def generate_report(self):\n",
-    "        \"\"\"Generate comprehensive profiling report.\"\"\"\n",
-    "        print(f\"\\n📊 COMPREHENSIVE PROFILE REPORT: {self.name}\")\n",
-    "        print(\"=\" * 70)\n",
-    "        \n",
-    "        # Timing report\n",
-    "        if self.timing_stats:\n",
-    "            mean_ms = self.timing_stats.get('mean_ms', 0)\n",
-    "            std_ms = self.timing_stats.get('std_ms', 0)\n",
-    "            print(f\"⏱️  TIMING:\")\n",
-    "            print(f\"   Average:     {mean_ms:.3f} ms ± {std_ms:.3f} ms\")\n",
-    "            print(f\"   P95:         {self.timing_stats.get('p95_ms', 0):.3f} ms\")\n",
-    "            print(f\"   Throughput:  {1000/max(mean_ms, 0.001):.1f} ops/sec\")\n",
-    "        \n",
-    "        # Memory report  \n",
-    "        if self.memory_stats:\n",
-    "            print(f\"\\n🧠 MEMORY:\")\n",
-    "            print(f\"   Peak usage:  {self.memory_stats.get('peak_mb', 0):.2f} MB\")\n",
-    "            print(f\"   Allocated:   {self.memory_stats.get('allocated_mb', 0):.2f} MB\")\n",
-    "        \n",
-    "        # FLOP report\n",
-    "        if self.flop_counter and self.flop_counter.operation_counts['total_flops'] > 0:\n",
-    "            total_flops = self.flop_counter.operation_counts['total_flops']\n",
-    "            print(f\"\\n🔢 COMPUTATION:\")\n",
-    "            print(f\"   Total FLOPs: {total_flops:,}\")\n",
-    "            \n",
-    "            if self.timing_stats and self.timing_stats.get('mean_ms', 0) > 0:\n",
-    "                mean_seconds = self.timing_stats['mean_ms'] / 1000\n",
-    "                gflops_per_sec = (total_flops / 1e9) / mean_seconds\n",
-    "                print(f\"   Performance: {gflops_per_sec:.2f} GFLOPS/sec\")\n",
-    "        \n",
-    "        # Efficiency insights\n",
-    "        self._print_insights()\n",
-    "    \n",
-    "    def _print_insights(self):\n",
-    "        \"\"\"Print performance insights and recommendations.\"\"\"\n",
-    "        print(f\"\\n💡 PERFORMANCE INSIGHTS:\")\n",
-    "        \n",
-    "        insights = []\n",
-    "        \n",
-    "        # Timing insights\n",
-    "        if self.timing_stats:\n",
-    "            mean_ms = self.timing_stats.get('mean_ms', 0)\n",
-    "            std_ms = self.timing_stats.get('std_ms', 0)\n",
-    "            \n",
-    "            if mean_ms < 0.1:\n",
-    "                insights.append(\"⚡ Very fast operation (< 0.1ms)\")\n",
-    "            elif mean_ms < 1:\n",
-    "                insights.append(\"✅ Fast operation (< 1ms)\")  \n",
-    "            elif mean_ms < 10:\n",
-    "                insights.append(\"⚠️  Moderate speed (1-10ms)\")\n",
-    "            else:\n",
-    "                insights.append(\"🐌 Slow operation (> 10ms) - optimization target\")\n",
-    "                \n",
-    "            if std_ms / max(mean_ms, 0.001) > 0.2:\n",
-    "                insights.append(\"📊 High timing variance - inconsistent performance\")\n",
-    "        \n",
-    "        # Memory insights\n",
-    "        if self.memory_stats:\n",
-    "            allocated_mb = self.memory_stats.get('allocated_mb', 0)\n",
-    "            peak_mb = self.memory_stats.get('peak_mb', 0)\n",
-    "            \n",
-    "            if peak_mb > allocated_mb * 2:\n",
-    "                insights.append(\"🗑️  High temporary memory usage - check for copies\")\n",
-    "            \n",
-    "            if allocated_mb < 0:\n",
-    "                insights.append(\"♻️  Memory cleanup detected - good garbage collection\")\n",
-    "        \n",
-    "        # FLOP insights\n",
-    "        if self.flop_counter and self.flop_counter.operation_counts['total_flops'] > 0:\n",
-    "            if self.timing_stats:\n",
-    "                mean_seconds = self.timing_stats.get('mean_ms', 1) / 1000\n",
-    "                gflops_per_sec = (self.flop_counter.operation_counts['total_flops'] / 1e9) / mean_seconds\n",
-    "                \n",
-    "                if gflops_per_sec > 10:\n",
-    "                    insights.append(\"🚀 Excellent computational efficiency\")\n",
-    "                elif gflops_per_sec > 1:\n",
-    "                    insights.append(\"✅ Good computational efficiency\")\n",
-    "                else:\n",
-    "                    insights.append(\"⚠️  Low efficiency - check for bottlenecks\")\n",
-    "        \n",
-    "        # Print insights\n",
-    "        for insight in insights:\n",
-    "            print(f\"   {insight}\")\n",
-    "        \n",
-    "        if not insights:\n",
-    "            print(\"   📈 Run with more profiling options for insights\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4d82ca61",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class SimpleProfiler:\n",
-    "    \"\"\"\n",
-    "    Simple profiler interface expected by benchmarking module.\n",
-    "    Wrapper around the comprehensive ProfilerContext for easy use.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, track_memory=True, track_cpu=True):\n",
-    "        self.track_memory = track_memory\n",
-    "        self.track_cpu = track_cpu\n",
-    "        self.timer = Timer()\n",
-    "        self.memory_profiler = MemoryProfiler() if track_memory else None\n",
-    "        \n",
-    "    def profile(self, func, *args, name=\"operation\", warmup=True):\n",
-    "        \"\"\"Profile a function call and return comprehensive results.\"\"\"\n",
-    "        if warmup:\n",
-    "            # Warmup run\n",
-    "            _ = func(*args)\n",
-    "            \n",
-    "        # Time the operation\n",
-    "        timing_stats = self.timer.measure(func, warmup=2, runs=10, args=args)\n",
-    "        \n",
-    "        result_dict = {\n",
-    "            'wall_time': timing_stats['mean_ms'] / 1000,  # Convert to seconds\n",
-    "            'cpu_time': timing_stats['mean_ms'] / 1000,   # Simplified\n",
-    "            'cpu_efficiency': 0.85,  # Mock reasonable value\n",
-    "            'name': name\n",
-    "        }\n",
-    "        \n",
-    "        # Add memory stats if enabled\n",
-    "        if self.memory_profiler:\n",
-    "            memory_stats = self.memory_profiler.profile(func, args)\n",
-    "            result_dict.update({\n",
-    "                'memory_delta_mb': memory_stats.get('allocated_mb', 0),\n",
-    "                'peak_memory_mb': memory_stats.get('peak_mb', 0),\n",
-    "                'result_size_mb': 0.1  # Mock value\n",
-    "            })\n",
-    "            \n",
-    "        return result_dict\n",
-    "\n",
-    "#| export \n",
-    "def profile_function(func, *args, **kwargs):\n",
-    "    \"\"\"Simple function profiler decorator/utility.\"\"\"\n",
-    "    profiler = SimpleProfiler()\n",
-    "    return profiler.profile(func, *args, **kwargs)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7a3229c6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### 🧪 Test Comprehensive Profiling\n",
-    "\n",
-    "Now let's use the complete profiler to analyze different model architectures. \n",
-    "This is where the detective work pays off - you'll see exactly why some models are fast and others are slow!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "369f4812",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def test_comprehensive_profiling():\n",
-    "    \"\"\"Test comprehensive profiling on different model types.\"\"\"\n",
-    "    \n",
-    "    print(\"🔍 COMPREHENSIVE PROFILING - Architecture Detective Work\")\n",
-    "    print(\"=\" * 80)\n",
-    "    \n",
-    "    # Test 1: Simple Linear Model (MLP)\n",
-    "    print(\"\\n\" + \"=\"*50)\n",
-    "    print(\"TEST 1: Multi-Layer Perceptron (MLP)\")\n",
-    "    print(\"=\"*50)\n",
-    "    \n",
-    "    linear_model = Linear(128, 64)\n",
-    "    mock_input = Tensor([[0.1] * 128 for _ in range(32)])  # Batch of 32\n",
-    "    \n",
-    "    with ProfilerContext(\"MLP Forward Pass\", timing_runs=50, enable_memory=True) as profiler:\n",
-    "        result = profiler.profile_function(linear_model.forward, args=(mock_input,))\n",
-    "        # Add manual FLOP count for this operation\n",
-    "        flops = 32 * 128 * 64  # batch_size * input_features * output_features\n",
-    "        profiler.add_flop_count(flops, {'linear': flops})\n",
-    "    \n",
-    "    # Test 2: Convolutional Model (CNN)  \n",
-    "    print(\"\\n\" + \"=\"*50)\n",
-    "    print(\"TEST 2: Convolutional Neural Network (CNN)\")\n",
-    "    print(\"=\"*50)\n",
-    "    \n",
-    "    conv_model = Conv2d(3, 16, 3)\n",
-    "    # Mock 32x32 RGB image batch\n",
-    "    conv_input = Tensor([[[0.1] * 32 for _ in range(32)] for _ in range(3)])\n",
-    "    \n",
-    "    with ProfilerContext(\"CNN Forward Pass\", timing_runs=30, enable_memory=True) as profiler:\n",
-    "        result = profiler.profile_function(conv_model.forward, args=(conv_input,))\n",
-    "        # FLOP count for convolution: output_pixels * kernel_ops * channels\n",
-    "        output_size = 30 * 30  # 32-3+1 = 30\n",
-    "        flops = output_size * 3 * 3 * 3 * 16  # output_h * output_w * kernel_h * kernel_w * in_ch * out_ch\n",
-    "        profiler.add_flop_count(flops, {'conv2d': flops})\n",
-    "    \n",
-    "    # Test 3: Transformer Model\n",
-    "    print(\"\\n\" + \"=\"*50)\n",
-    "    print(\"TEST 3: Transformer (Attention-Based)\")\n",
-    "    print(\"=\"*50)\n",
-    "    \n",
-    "    transformer_model = Transformer(vocab_size=1000, d_model=128, n_heads=8, n_layers=4)\n",
-    "    # Mock sequence of tokens\n",
-    "    seq_input = Tensor([[i] for i in range(32)])  # Sequence length 32\n",
-    "    \n",
-    "    with ProfilerContext(\"Transformer Forward Pass\", timing_runs=20, enable_memory=True) as profiler:\n",
-    "        result = profiler.profile_function(transformer_model.forward, args=(seq_input,))\n",
-    "        # Attention FLOP count: approximately seq_len² * d_model * n_heads * n_layers\n",
-    "        attention_flops = 32 * 32 * 128 * 8 * 4  # Quadratic in sequence length!\n",
-    "        linear_flops = 4 * (128 * 128 + 128 * 512 + 512 * 128)  # Linear layers in transformer\n",
-    "        total_flops = attention_flops + linear_flops\n",
-    "        profiler.add_flop_count(total_flops, {\n",
-    "            'attention': attention_flops,\n",
-    "            'linear': linear_flops\n",
-    "        })\n",
-    "    \n",
-    "    # Comparative Analysis\n",
-    "    print(\"\\n\" + \"🏁\"*25)\n",
-    "    print(\"COMPARATIVE ANALYSIS - The Big Reveal!\")\n",
-    "    print(\"🏁\"*25)\n",
-    "    print(\"\"\"\n",
-    "🎯 KEY DISCOVERIES:\n",
-    "\n",
-    "1️⃣ MLP (Linear): \n",
-    "   - Fastest for small inputs\n",
-    "   - Linear scaling: O(input_size × output_size)\n",
-    "   - Excellent for final classification layers\n",
-    "\n",
-    "2️⃣ CNN (Convolutional):\n",
-    "   - Moderate speed, excellent for spatial data  \n",
-    "   - Scaling: O(input_pixels × kernel_size)\n",
-    "   - Hardware-friendly (vectorizable)\n",
-    "\n",
-    "3️⃣ Transformer (Attention):\n",
-    "   - Slowest but most powerful\n",
-    "   - Quadratic scaling: O(sequence_length²)\n",
-    "   - Memory hungry due to attention matrices\n",
-    "\n",
-    "🚨 PERFORMANCE BOTTLENECK REVEALED:\n",
-    "   Attention is the culprit! The O(n²) complexity means:\n",
-    "   - 2x longer sequence = 4x computation\n",
-    "   - 10x longer sequence = 100x computation\n",
-    "   - This is why GPT models are expensive to run!\n",
-    "\n",
-    "💡 OPTIMIZATION STRATEGIES:\n",
-    "   - MLPs: Focus on batch processing\n",
-    "   - CNNs: Use optimized convolution libraries  \n",
-    "   - Transformers: Implement attention optimizations (next module!)\n",
-    "\"\"\")\n",
-    "\n",
-    "# Run the comprehensive test\n",
-    "if __name__ == \"__main__\":\n",
-    "    test_comprehensive_profiling()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "af66f3c0",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 5: Real-World Profiling - Bottleneck Detection\n",
-    "\n",
-    "Let's simulate profiling a complete neural network to see where the bottlenecks really are.\n",
-    "This is the kind of analysis that guides optimization decisions in production ML systems.\n",
-    "\n",
-    "### Performance Detective Workflow\n",
-    "\n",
-    "1. **Profile the whole model** - get the big picture\n",
-    "2. **Identify the bottleneck** - which layer is slowest?\n",
-    "3. **Drill down into that layer** - why is it slow?\n",
-    "4. **Predict optimization impact** - fix this layer = how much speedup?\n",
-    "\n",
-    "This is exactly what PyTorch's profiler and NVIDIA's NSight do for production models."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "431d5fe8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def simulate_complete_model_profiling():\n",
-    "    \"\"\"\n",
-    "    Simulate profiling a complete neural network to identify bottlenecks.\n",
-    "    This shows the detective process used in real ML systems optimization.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    print(\"🕵️ PERFORMANCE DETECTIVE: Complete Model Analysis\")\n",
-    "    print(\"=\" * 80)\n",
-    "    print(\"\"\"\n",
-    "🎯 MISSION: Find the bottleneck in our neural network\n",
-    "\n",
-    "We have a model with:\n",
-    "- Input processing (Linear layer)\n",
-    "- Feature extraction (CNN layers) \n",
-    "- Sequence modeling (Transformer)\n",
-    "- Output classification (Linear layer)\n",
-    "\n",
-    "Which component is slowing us down?\n",
-    "\"\"\")\n",
-    "    \n",
-    "    # Simulate different components with realistic timing\n",
-    "    components = [\n",
-    "        (\"Input Processing\", Linear(784, 256), 0.5),    # Fast  \n",
-    "        (\"Conv Layer 1\", Conv2d(1, 32, 3), 2.0),       # Moderate\n",
-    "        (\"Conv Layer 2\", Conv2d(32, 64, 3), 4.0),      # Slower\n",
-    "        (\"Attention Layer\", Transformer(1000, 128, 8, 2), 15.0),  # Bottleneck!\n",
-    "        (\"Output Layer\", Linear(128, 10), 0.3)         # Fast\n",
-    "    ]\n",
-    "    \n",
-    "    timing_results = []\n",
-    "    total_time = 0\n",
-    "    \n",
-    "    print(\"\\n📊 LAYER-BY-LAYER TIMING ANALYSIS:\")\n",
-    "    print(\"-\" * 60)\n",
-    "    \n",
-    "    for name, model, base_time_ms in components:\n",
-    "        # Simulate timing measurement with some noise\n",
-    "        import random\n",
-    "        measured_time = base_time_ms + random.uniform(-0.2, 0.2)\n",
-    "        \n",
-    "        timing_results.append((name, measured_time))\n",
-    "        total_time += measured_time\n",
-    "        \n",
-    "        print(f\"{name:20s}: {measured_time:6.2f} ms\")\n",
-    "    \n",
-    "    print(f\"{'='*20}: {'='*6}\")\n",
-    "    print(f\"{'TOTAL':<20s}: {total_time:6.2f} ms\")\n",
-    "    \n",
-    "    # Bottleneck analysis\n",
-    "    print(f\"\\n🔍 BOTTLENECK ANALYSIS:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    # Find the slowest component\n",
-    "    slowest_name, slowest_time = max(timing_results, key=lambda x: x[1])\n",
-    "    bottleneck_percentage = (slowest_time / total_time) * 100\n",
-    "    \n",
-    "    print(f\"🚨 Primary bottleneck: {slowest_name}\")\n",
-    "    print(f\"   Time: {slowest_time:.2f} ms ({bottleneck_percentage:.1f}% of total)\")\n",
-    "    \n",
-    "    # Calculate optimization impact\n",
-    "    print(f\"\\n💡 OPTIMIZATION IMPACT ANALYSIS:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    # If we optimize the bottleneck by different amounts\n",
-    "    optimization_factors = [0.5, 0.25, 0.1]  # 2x, 4x, 10x faster\n",
-    "    \n",
-    "    for factor in optimization_factors:\n",
-    "        speedup_factor = 1 / factor\n",
-    "        new_bottleneck_time = slowest_time * factor\n",
-    "        new_total_time = total_time - slowest_time + new_bottleneck_time\n",
-    "        overall_speedup = total_time / new_total_time\n",
-    "        \n",
-    "        print(f\"If {slowest_name} is {speedup_factor:.0f}x faster:\")\n",
-    "        print(f\"   New total time: {new_total_time:.2f} ms\")\n",
-    "        print(f\"   Overall speedup: {overall_speedup:.2f}x\")\n",
-    "        print()\n",
-    "    \n",
-    "    # Memory analysis\n",
-    "    print(\"🧠 MEMORY USAGE BREAKDOWN:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    memory_usage = {\n",
-    "        \"Input Processing\": 0.5,\n",
-    "        \"Conv Layer 1\": 2.1,\n",
-    "        \"Conv Layer 2\": 8.4,  \n",
-    "        \"Attention Layer\": 45.2,  # Memory hungry!\n",
-    "        \"Output Layer\": 0.1\n",
-    "    }\n",
-    "    \n",
-    "    total_memory = sum(memory_usage.values())\n",
-    "    \n",
-    "    for component, memory_mb in memory_usage.items():\n",
-    "        percentage = (memory_mb / total_memory) * 100\n",
-    "        print(f\"{component:20s}: {memory_mb:5.1f} MB ({percentage:4.1f}%)\")\n",
-    "    \n",
-    "    print(f\"{'='*20}: {'='*5}\")\n",
-    "    print(f\"{'TOTAL':<20s}: {total_memory:5.1f} MB\")\n",
-    "    \n",
-    "    # Key insights\n",
-    "    print(f\"\\n🎯 KEY PERFORMANCE INSIGHTS:\")\n",
-    "    print(\"=\" * 50)\n",
-    "    print(f\"\"\"\n",
-    "1️⃣ BOTTLENECK IDENTIFIED: {slowest_name}\n",
-    "   - Consumes {bottleneck_percentage:.0f}% of execution time\n",
-    "   - This is your #1 optimization target\n",
-    "   \n",
-    "2️⃣ MEMORY HOTSPOT: Attention Layer  \n",
-    "   - Uses 80%+ of total memory\n",
-    "   - Memory bandwidth likely limiting factor\n",
-    "   \n",
-    "3️⃣ OPTIMIZATION STRATEGY:\n",
-    "   - Focus on attention optimization first\n",
-    "   - 4x attention speedup = {total_time / (total_time - slowest_time + slowest_time*0.25):.1f}x overall speedup\n",
-    "   - Consider: Flash Attention, KV caching, quantization\n",
-    "   \n",
-    "4️⃣ AMDAHL'S LAW IN ACTION:\n",
-    "   - Optimizing non-bottleneck layers has minimal impact\n",
-    "   - {slowest_name} dominates performance profile\n",
-    "   \n",
-    "5️⃣ PRODUCTION IMPLICATIONS:\n",
-    "   - Batch size limited by attention memory usage\n",
-    "   - Inference latency dominated by attention computation  \n",
-    "   - This is why transformer serving is expensive!\n",
-    "\"\"\")\n",
-    "\n",
-    "# Run the bottleneck detection\n",
-    "if __name__ == \"__main__\":\n",
-    "    simulate_complete_model_profiling()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "af3bbd22",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 6: Systems Analysis - Memory and Performance Deep Dive\n",
-    "\n",
-    "Now let's analyze the systems implications of what we've discovered. This is where profiling \n",
-    "becomes actionable intelligence for ML systems engineers.\n",
-    "\n",
-    "### Memory vs Computation Trade-offs\n",
-    "\n",
-    "What we've learned through profiling:\n",
-    "- **Attention**: High memory, high computation (O(n²) for both)\n",
-    "- **Convolution**: Moderate memory, moderate computation  \n",
-    "- **Linear layers**: Low memory, low computation\n",
-    "\n",
-    "These patterns drive real-world architectural decisions."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6757febe",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def analyze_systems_implications():\n",
-    "    \"\"\"\n",
-    "    Analyze the systems implications of our profiling discoveries.\n",
-    "    This connects profiling data to real-world ML systems decisions.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    print(\"🏗️ SYSTEMS ANALYSIS: From Profiling to Production Decisions\")\n",
-    "    print(\"=\" * 80)\n",
-    "    \n",
-    "    print(\"\"\"\n",
-    "🎯 PROFILING INSIGHTS → SYSTEMS DECISIONS\n",
-    "\n",
-    "Our performance detective work revealed several critical patterns.\n",
-    "Let's trace how these insights drive production ML systems:\n",
-    "\"\"\")\n",
-    "    \n",
-    "    # Memory scaling analysis\n",
-    "    print(\"\\n📈 MEMORY SCALING ANALYSIS:\")\n",
-    "    print(\"-\" * 50)\n",
-    "    \n",
-    "    sequence_lengths = [128, 512, 1024, 2048, 4096]\n",
-    "    d_model = 768  # GPT-like model\n",
-    "    \n",
-    "    print(\"Attention Memory Usage by Sequence Length:\")\n",
-    "    print(\"Seq Length | Memory (GB) | Notes\")\n",
-    "    print(\"-\" * 50)\n",
-    "    \n",
-    "    for seq_len in sequence_lengths:\n",
-    "        # Attention matrices: Q, K, V projections + attention scores + weighted values\n",
-    "        qkv_memory = 3 * seq_len * d_model * 4 / (1024**3)  # 4 bytes per float32\n",
-    "        attention_scores = seq_len * seq_len * 4 / (1024**3)  # O(n²) memory!\n",
-    "        \n",
-    "        total_memory_gb = (qkv_memory + attention_scores) * 2  # Forward + backward\n",
-    "        \n",
-    "        if seq_len <= 512:\n",
-    "            note = \"✅ Practical\"\n",
-    "        elif seq_len <= 1024:\n",
-    "            note = \"⚠️ Expensive\"\n",
-    "        else:\n",
-    "            note = \"🚨 Prohibitive\"\n",
-    "            \n",
-    "        print(f\"{seq_len:8d}   |  {total_memory_gb:8.2f}   | {note}\")\n",
-    "    \n",
-    "    print(\"\\n💡 KEY INSIGHT: Memory grows O(n²) - this is why context length is limited!\")\n",
-    "    \n",
-    "    # Compute scaling analysis  \n",
-    "    print(\"\\n⚡ COMPUTE SCALING ANALYSIS:\")\n",
-    "    print(\"-\" * 50)\n",
-    "    \n",
-    "    print(\"FLOPs Required by Architecture (1M input features):\")\n",
-    "    print(\"Architecture | FLOPs      | Scaling | Use Case\")\n",
-    "    print(\"-\" * 60)\n",
-    "    \n",
-    "    architectures = [\n",
-    "        (\"Linear (MLP)\", \"1B\", \"O(n)\", \"Fast classification\"),\n",
-    "        (\"Conv2D\", \"10B\", \"O(n)\", \"Image processing\"), \n",
-    "        (\"Attention\", \"1T\", \"O(n²)\", \"Sequence modeling\"),\n",
-    "        (\"Sparse Attention\", \"100B\", \"O(n log n)\", \"Long sequences\")\n",
-    "    ]\n",
-    "    \n",
-    "    for arch, flops, scaling, use_case in architectures:\n",
-    "        print(f\"{arch:12s} | {flops:8s}   | {scaling:8s} | {use_case}\")\n",
-    "    \n",
-    "    print(\"\\n💡 INSIGHT: Attention is 1000x more expensive than linear layers!\")\n",
-    "    \n",
-    "    # Hardware implications\n",
-    "    print(\"\\n🔧 HARDWARE IMPLICATIONS:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    print(\"\"\"\n",
-    "From Profiling Data → Hardware Decisions:\n",
-    "\n",
-    "1️⃣ CPU vs GPU Choice:\n",
-    "   - Linear layers: CPU fine (low parallelism)\n",
-    "   - Convolutions: GPU preferred (high parallelism)  \n",
-    "   - Attention: GPU essential (massive parallelism)\n",
-    "\n",
-    "2️⃣ Memory Hierarchy:\n",
-    "   - Small models: Fit in GPU memory (fast)\n",
-    "   - Large models: CPU-GPU transfers (slow)\n",
-    "   - Huge models: Model sharding required\n",
-    "\n",
-    "3️⃣ Batch Size Limits:\n",
-    "   - Memory-bound: Attention limits batch size\n",
-    "   - Compute-bound: Can increase batch size\n",
-    "   - Our profiling shows attention is memory-bound\n",
-    "\n",
-    "4️⃣ Inference Serving:\n",
-    "   - MLPs: High throughput possible\n",
-    "   - CNNs: Moderate throughput\n",
-    "   - Transformers: Low throughput, high latency\n",
-    "\"\"\")\n",
-    "    \n",
-    "    # Real-world examples\n",
-    "    print(\"\\n🌍 REAL-WORLD EXAMPLES:\")\n",
-    "    print(\"-\" * 30)\n",
-    "    \n",
-    "    print(\"\"\"\n",
-    "How Our Profiling Insights Play Out in Production:\n",
-    "\n",
-    "📱 MOBILE DEPLOYMENT:\n",
-    "   - Profiling shows: Attention uses 80% memory\n",
-    "   - Decision: Use distilled models (smaller attention)\n",
-    "   - Result: 10x memory reduction, 3x speedup\n",
-    "\n",
-    "🏢 DATACENTER SERVING:  \n",
-    "   - Profiling shows: Attention is compute bottleneck\n",
-    "   - Decision: Use tensor parallelism across GPUs\n",
-    "   - Result: Split attention computation, linear speedup\n",
-    "\n",
-    "⚡ EDGE DEVICES:\n",
-    "   - Profiling shows: Memory bandwidth limited\n",
-    "   - Decision: Quantize to INT8, cache frequent patterns\n",
-    "   - Result: 4x memory reduction, 2x speedup\n",
-    "\n",
-    "🎯 KEY TAKEAWAY:\n",
-    "   Profiling isn't academic - it drives billion-dollar infrastructure decisions!\n",
-    "   Every major ML system (GPT, BERT, ResNet) was optimized using these techniques.\n",
-    "\"\"\")\n",
-    "\n",
-    "# Run the systems analysis\n",
-    "if __name__ == \"__main__\":\n",
-    "    analyze_systems_implications()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6cea7d76",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 7: Integration Testing - Putting It All Together\n",
-    "\n",
-    "Let's test our complete profiling infrastructure by analyzing a realistic neural network scenario.\n",
-    "This integration test validates that all our profiling tools work together seamlessly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fce09fbd",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def integration_test_profiling_suite():\n",
-    "    \"\"\"\n",
-    "    Integration test for the complete profiling suite.\n",
-    "    Tests all components working together on a realistic model.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    print(\"🧪 INTEGRATION TEST: Complete Profiling Suite\")\n",
-    "    print(\"=\" * 70)\n",
-    "    \n",
-    "    # Test all profilers working together\n",
-    "    print(\"\\n1️⃣ Testing Individual Components:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    # Timer test\n",
-    "    timer = Timer()\n",
-    "    \n",
-    "    def sample_computation():\n",
-    "        return sum(i*i for i in range(10000))\n",
-    "    \n",
-    "    timing_stats = timer.measure(sample_computation, warmup=2, runs=50)\n",
-    "    assert timing_stats['runs'] == 50\n",
-    "    assert timing_stats['mean_ms'] > 0\n",
-    "    print(\"✅ Timer: Working correctly\")\n",
-    "    \n",
-    "    # Memory profiler test\n",
-    "    memory_profiler = MemoryProfiler()\n",
-    "    \n",
-    "    def memory_intensive_task():\n",
-    "        return [i for i in range(100000)]\n",
-    "    \n",
-    "    memory_stats = memory_profiler.profile(memory_intensive_task)\n",
-    "    assert memory_stats['peak_mb'] > 0\n",
-    "    print(\"✅ Memory Profiler: Working correctly\")\n",
-    "    \n",
-    "    # FLOP counter test\n",
-    "    flop_counter = FLOPCounter()\n",
-    "    flops = flop_counter.count_linear(100, 50, batch_size=32)\n",
-    "    assert flops == 32 * 100 * 50 + 32 * 50  # multiply + add operations\n",
-    "    print(\"✅ FLOP Counter: Working correctly\")\n",
-    "    \n",
-    "    # Context manager test\n",
-    "    print(\"\\n2️⃣ Testing Profiler Context Integration:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    def complex_model_simulation():\n",
-    "        \"\"\"Simulate a complex model with multiple operations.\"\"\"\n",
-    "        # Simulate different types of computation\n",
-    "        linear_result = sum(i*j for i in range(100) for j in range(100))  # O(n²)\n",
-    "        conv_result = [sum(row) for row in [[i*j for j in range(50)] for i in range(50)]]  # Simulate convolution\n",
-    "        attention_result = sum(i*j*k for i in range(20) for j in range(20) for k in range(20))  # O(n³) - expensive!\n",
-    "        return linear_result + sum(conv_result) + attention_result\n",
-    "    \n",
-    "    with ProfilerContext(\"Complex Model Simulation\", timing_runs=20) as profiler:\n",
-    "        result = profiler.profile_function(complex_model_simulation)\n",
-    "        \n",
-    "        # Add FLOP count for analysis\n",
-    "        estimated_flops = (\n",
-    "            100 * 100 +  # Linear operations  \n",
-    "            50 * 50 * 10 +  # Conv-like operations\n",
-    "            20 * 20 * 20 * 5  # Attention-like operations (expensive!)\n",
-    "        )\n",
-    "        profiler.add_flop_count(estimated_flops)\n",
-    "    \n",
-    "    print(\"✅ Profiler Context: Integration successful\")\n",
-    "    \n",
-    "    # Test performance comparison\n",
-    "    print(\"\\n3️⃣ Performance Comparison Test:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    operations = [\n",
-    "        (\"Fast Linear\", lambda: sum(range(1000))),\n",
-    "        (\"Moderate Conv\", lambda: [[i*j for j in range(100)] for i in range(100)]),\n",
-    "        (\"Slow Attention\", lambda: [[[i*j*k for k in range(10)] for j in range(10)] for i in range(10)])\n",
-    "    ]\n",
-    "    \n",
-    "    results = []\n",
-    "    \n",
-    "    for name, operation in operations:\n",
-    "        with ProfilerContext(name, timing_runs=30) as profiler:\n",
-    "            profiler.profile_function(operation)\n",
-    "            \n",
-    "        results.append(name)\n",
-    "    \n",
-    "    print(\"✅ Performance Comparison: All operations profiled successfully\")\n",
-    "    \n",
-    "    # Validate profiling accuracy\n",
-    "    print(\"\\n4️⃣ Profiling Accuracy Validation:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    # Test that timing is consistent\n",
-    "    consistent_operation = lambda: time.sleep(0.01)  # Should be ~10ms\n",
-    "    \n",
-    "    timing_stats = timer.measure(consistent_operation, warmup=1, runs=10)\n",
-    "    mean_ms = timing_stats['mean_ms']\n",
-    "    expected_ms = 10.0\n",
-    "    \n",
-    "    # Allow 30% tolerance for timing variability (system dependent)\n",
-    "    tolerance = 0.3\n",
-    "    relative_error = abs(mean_ms - expected_ms) / expected_ms\n",
-    "    if relative_error > tolerance:\n",
-    "        print(f\"⚠️  Timing variance higher than expected: {mean_ms:.2f}ms vs expected {expected_ms:.2f}ms (tolerance: {tolerance*100}%)\")\n",
-    "        print(\"   This is normal for mock operations and system-dependent timing\")\n",
-    "    else:\n",
-    "        print(\"✅ Timing Accuracy: Within acceptable tolerance\")\n",
-    "    \n",
-    "    # Test memory tracking accuracy\n",
-    "    def known_memory_allocation():\n",
-    "        # Allocate approximately 1MB of data\n",
-    "        return [i for i in range(125000)]  # ~1MB for 125k integers\n",
-    "    \n",
-    "    memory_stats = memory_profiler.profile(known_memory_allocation)\n",
-    "    allocated_mb = memory_stats.get('allocated_mb', 0)\n",
-    "    \n",
-    "    # Memory allocation should be positive and reasonable\n",
-    "    assert allocated_mb > 0.5, f\"Memory tracking issue: {allocated_mb:.2f}MB seems too low\"\n",
-    "    assert allocated_mb < 10, f\"Memory tracking issue: {allocated_mb:.2f}MB seems too high\"\n",
-    "    print(\"✅ Memory Tracking: Reasonable accuracy\")\n",
-    "    \n",
-    "    # Final integration validation\n",
-    "    print(\"\\n5️⃣ End-to-End Integration Test:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    # Simulate complete ML model profiling workflow\n",
-    "    class MockMLModel:\n",
-    "        def __init__(self):\n",
-    "            self.layers = [\"embedding\", \"attention\", \"mlp\", \"output\"]\n",
-    "            \n",
-    "        def forward(self, input_data):\n",
-    "            # Simulate different computational patterns\n",
-    "            embedding_time = time.sleep(0.001)  # Fast\n",
-    "            attention_time = time.sleep(0.010)  # Slow (bottleneck)\n",
-    "            mlp_time = time.sleep(0.002)       # Moderate\n",
-    "            output_time = time.sleep(0.001)    # Fast\n",
-    "            return \"model_output\"\n",
-    "    \n",
-    "    model = MockMLModel()\n",
-    "    mock_input = \"input_tokens\"\n",
-    "    \n",
-    "    # Profile the complete model\n",
-    "    with ProfilerContext(\"Complete ML Model\", timing_runs=20, enable_memory=True) as profiler:\n",
-    "        output = profiler.profile_function(model.forward, args=(mock_input,))\n",
-    "        \n",
-    "        # Add realistic FLOP counts\n",
-    "        model_flops = {\n",
-    "            'embedding': 1000000,     # 1M FLOPs\n",
-    "            'attention': 50000000,    # 50M FLOPs (bottleneck!)\n",
-    "            'mlp': 10000000,         # 10M FLOPs  \n",
-    "            'output': 500000         # 0.5M FLOPs\n",
-    "        }\n",
-    "        \n",
-    "        total_flops = sum(model_flops.values())\n",
-    "        profiler.add_flop_count(total_flops, model_flops)\n",
-    "    \n",
-    "    print(\"✅ End-to-End: Complete workflow successful\")\n",
-    "    \n",
-    "    # Test SimpleProfiler interface (for Module 20 compatibility)\n",
-    "    print(\"\\n6️⃣ SimpleProfiler Interface Test:\")\n",
-    "    print(\"-\" * 40)\n",
-    "    \n",
-    "    # Test SimpleProfiler\n",
-    "    simple_profiler = SimpleProfiler()\n",
-    "    \n",
-    "    def sample_computation():\n",
-    "        import numpy as np\n",
-    "        return np.random.randn(100, 100) @ np.random.randn(100, 100)\n",
-    "    \n",
-    "    try:\n",
-    "        # Try with numpy - if available\n",
-    "        result = simple_profiler.profile(sample_computation, name=\"Matrix Multiply\")\n",
-    "        print(f\"SimpleProfiler result keys: {list(result.keys())}\")\n",
-    "        assert 'wall_time' in result\n",
-    "        assert 'cpu_time' in result\n",
-    "        assert 'name' in result\n",
-    "        print(\"✅ SimpleProfiler: Full functionality working\")\n",
-    "    except ImportError:\n",
-    "        # Fall back to simple computation if numpy not available\n",
-    "        def simple_computation():\n",
-    "            return sum(i*i for i in range(1000))\n",
-    "        \n",
-    "        result = simple_profiler.profile(simple_computation, name=\"Sum of Squares\")\n",
-    "        print(f\"SimpleProfiler result keys: {list(result.keys())}\")\n",
-    "        assert 'wall_time' in result\n",
-    "        assert 'cpu_time' in result\n",
-    "        assert 'name' in result\n",
-    "        print(\"✅ SimpleProfiler: Basic functionality working\")\n",
-    "    \n",
-    "    # Test profile_function utility\n",
-    "    try:\n",
-    "        func_result = profile_function(sample_computation)\n",
-    "        assert 'wall_time' in func_result\n",
-    "        print(\"✅ profile_function utility: Working correctly\")\n",
-    "    except ImportError:\n",
-    "        def simple_computation():\n",
-    "            return sum(i*i for i in range(1000))\n",
-    "        func_result = profile_function(simple_computation)\n",
-    "        assert 'wall_time' in func_result\n",
-    "        print(\"✅ profile_function utility: Working correctly (fallback)\")\n",
-    "    \n",
-    "    # Success summary\n",
-    "    print(f\"\\n🎉 INTEGRATION TEST RESULTS:\")\n",
-    "    print(\"=\" * 50)\n",
-    "    print(\"\"\"\n",
-    "✅ All profiling components working correctly\n",
-    "✅ Context manager integration successful  \n",
-    "✅ Timing accuracy within acceptable range\n",
-    "✅ Memory tracking functioning properly\n",
-    "✅ FLOP counting calculations correct\n",
-    "✅ End-to-end workflow validated\n",
-    "✅ SimpleProfiler interface ready for Module 20\n",
-    "\n",
-    "🚀 PROFILING SUITE READY FOR PRODUCTION USE!\n",
-    "\n",
-    "Your profiling tools are now ready to:\n",
-    "- Identify bottlenecks in real models\n",
-    "- Guide optimization decisions\n",
-    "- Validate performance improvements  \n",
-    "- Support Module 16 (Acceleration) development\n",
-    "- Provide SimpleProfiler interface for Module 20 (Benchmarking)\n",
-    "\n",
-    "Next step: Use these tools to profile YOUR models and find the bottlenecks!\n",
-    "\"\"\")\n",
-    "\n",
-    "# Run the integration test\n",
-    "if __name__ == \"__main__\":\n",
-    "    integration_test_profiling_suite()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "02897c99",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've built a complete profiling suite, let's think about how this applies to real ML systems engineering."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1107224a",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Bottleneck Analysis Strategy\n",
-    "\n",
-    "You're optimizing a production transformer model that serves 1M requests/day. Your profiling reveals:\n",
-    "- Attention computation: 45ms (70% of total time)\n",
-    "- Linear layers: 10ms (15% of total time)  \n",
-    "- Activation functions: 5ms (8% of total time)\n",
-    "- I/O overhead: 5ms (7% of total time)\n",
-    "\n",
-    "If you can only optimize ONE component this quarter, which would you choose and why? What's the maximum theoretical speedup you could achieve?\n",
-    "\n",
-    "*Think about Amdahl's Law and real-world optimization constraints.*"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f3bac1f3",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: Memory vs Compute Trade-offs\n",
-    "\n",
-    "Your profiling shows that a CNN model uses:\n",
-    "- 2GB memory with 50ms inference time on CPU\n",
-    "- 0.5GB memory with 200ms inference time on mobile chip\n",
-    "\n",
-    "A customer wants to deploy on mobile devices with 1GB total RAM and requires <100ms inference. \n",
-    "\n",
-    "Design an optimization strategy using your profiling insights. What techniques would you try, and in what order?\n",
-    "\n",
-    "*Consider quantization, pruning, architecture changes, and caching strategies.*"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "50687569",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 3: Scaling Prediction\n",
-    "\n",
-    "Your profiling reveals that attention computation scales as O(n²) with sequence length. You measured:\n",
-    "- 128 tokens: 10ms\n",
-    "- 256 tokens: 40ms  \n",
-    "- 512 tokens: 160ms\n",
-    "\n",
-    "If you need to support 2048 tokens, predict the inference time. What optimization techniques could break this quadratic scaling?\n",
-    "\n",
-    "*Think about the mathematical relationship and alternative attention mechanisms.*"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9fabc277",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 4: Production Profiling Strategy\n",
-    "\n",
-    "You're building a profiling system for a production ML platform that serves 100 different models. Your Timer class works great for development, but production has different constraints:\n",
-    "\n",
-    "- Can't add 100ms of profiling overhead per request\n",
-    "- Need continuous monitoring, not batch measurements\n",
-    "- Must handle concurrent requests and GPU operations\n",
-    "- Need automatic anomaly detection\n",
-    "\n",
-    "How would you modify your profiling approach for production? What are the key design trade-offs?\n",
-    "\n",
-    "*Consider sampling strategies, async profiling, and monitoring infrastructure.*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "02726380",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "if __name__ == \"__main__\":\n",
-    "    print(\"🤔 ML Systems Thinking Questions\")\n",
-    "    print(\"=\" * 50)\n",
-    "    print(\"\"\"\n",
-    "Complete the interactive questions above to deepen your understanding of:\n",
-    "\n",
-    "1️⃣ Bottleneck Analysis Strategy\n",
-    "   - Applying Amdahl's Law to optimization decisions\n",
-    "   - Understanding the ROI of different optimization targets\n",
-    "\n",
-    "2️⃣ Memory vs Compute Trade-offs  \n",
-    "   - Balancing memory constraints with performance requirements\n",
-    "   - Designing optimization strategies for resource-limited devices\n",
-    "\n",
-    "3️⃣ Scaling Prediction\n",
-    "   - Using profiling data to predict performance at scale\n",
-    "   - Understanding algorithmic complexity implications\n",
-    "\n",
-    "4️⃣ Production Profiling Strategy\n",
-    "   - Adapting development tools for production constraints\n",
-    "   - Building monitoring systems for ML performance\n",
-    "\n",
-    "These questions connect your profiling implementations to real-world ML systems challenges.\n",
-    "Answer them to master performance analysis thinking!\n",
-    "\"\"\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a1cda0e7",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Profiling - Performance Detective Work\n",
-    "\n",
-    "Congratulations! You've built a comprehensive profiling suite that reveals the performance secrets of neural networks.\n",
-    "\n",
-    "### 🏆 What You Accomplished\n",
-    "\n",
-    "**1. Professional Timing Infrastructure**\n",
-    "- Built `Timer` class with statistical rigor\n",
-    "- Implemented warmup runs and percentile reporting\n",
-    "- Eliminated cold start effects and measurement noise\n",
-    "- Created reproducible performance measurements\n",
-    "\n",
-    "**2. Memory Analysis Tools**\n",
-    "- Developed `MemoryProfiler` with allocation tracking  \n",
-    "- Implemented peak memory usage monitoring\n",
-    "- Built memory leak detection capabilities\n",
-    "- Connected memory patterns to performance implications\n",
-    "\n",
-    "**3. Computational Analysis**\n",
-    "- Created `FLOPCounter` for operation counting\n",
-    "- Analyzed different layer types (Linear, Conv2d, Attention)\n",
-    "- Revealed the O(n²) scaling problem in transformers\n",
-    "- Connected FLOPs to hardware efficiency\n",
-    "\n",
-    "**4. Integrated Profiling Context**\n",
-    "- Built `ProfilerContext` manager combining all tools\n",
-    "- Created comprehensive performance reports\n",
-    "- Implemented automatic insight generation\n",
-    "- Developed production-ready profiling workflow\n",
-    "\n",
-    "### 🔍 Key Discoveries Made\n",
-    "\n",
-    "**Architecture Performance Profiles:**\n",
-    "- **MLPs**: Fast, linear scaling, memory efficient\n",
-    "- **CNNs**: Moderate speed, excellent for spatial data\n",
-    "- **Transformers**: Slow but powerful, memory hungry, O(n²) scaling\n",
-    "\n",
-    "**Bottleneck Identification:**\n",
-    "- Attention mechanisms consume 70%+ of computation time\n",
-    "- Memory bandwidth often limits performance more than raw FLOPs\n",
-    "- O(n²) scaling makes long sequences prohibitively expensive\n",
-    "\n",
-    "**Systems Implications:**\n",
-    "- Profiling data drives hardware selection (CPU vs GPU)\n",
-    "- Memory constraints limit batch sizes in attention models\n",
-    "- Optimization ROI follows Amdahl's Law patterns\n",
-    "\n",
-    "### 🚀 Real-World Applications\n",
-    "\n",
-    "Your profiling tools enable:\n",
-    "- **Bottleneck identification** in production models\n",
-    "- **Optimization targeting** for maximum impact\n",
-    "- **Hardware selection** based on performance characteristics  \n",
-    "- **Cost prediction** for scaling ML systems\n",
-    "- **Performance regression** detection in CI/CD\n",
-    "\n",
-    "### 🎯 What's Next\n",
-    "\n",
-    "Module 16 (Acceleration) will use these profiling insights to:\n",
-    "- Implement attention optimizations (Flash Attention patterns)\n",
-    "- Build efficient kernels for bottleneck operations\n",
-    "- Create caching strategies for memory optimization\n",
-    "- Develop quantization techniques for inference speedup\n",
-    "\n",
-    "**Your profiling detective work laid the foundation - now we'll fix the problems you discovered!**\n",
-    "\n",
-    "### 🏅 Systems Engineering Skills Mastered\n",
-    "\n",
-    "- **Performance measurement methodology** with statistical rigor\n",
-    "- **Bottleneck analysis** using Amdahl's Law principles  \n",
-    "- **Memory profiling** and allocation pattern analysis\n",
-    "- **Computational complexity** analysis through FLOP counting\n",
-    "- **Production profiling** strategy design\n",
-    "- **Data-driven optimization** decision making\n",
-    "\n",
-    "You now have the tools to analyze any neural network and understand exactly why it's fast or slow. These are the same techniques used to optimize GPT, BERT, and every other production ML system.\n",
-    "\n",
-    "**Welcome to the ranks of ML systems performance engineers!** 🎉"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "cell_metadata_filter": "-all",
-   "main_language": "python",
-   "notebook_metadata_filter": "-all"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/14_profiling/profiling_dev.py b/modules_old/14_profiling/profiling_dev.py
deleted file mode 100644
index f7bbda2f..00000000
--- a/modules_old/14_profiling/profiling_dev.py
+++ /dev/null
@@ -1,1821 +0,0 @@
-# %% [markdown]
-"""
-# Module 15: Profiling - Performance Detective Work
-
-Welcome to the most eye-opening module in TinyTorch! You just built MLPs, CNNs, and Transformers. 
-But here's the million-dollar question: **Why is your transformer 100x slower than PyTorch?**
-
-Time to become a performance detective and find out what's really happening under the hood.
-
-## MAGNIFY What You'll Discover
-
-Ever wonder why your models feel sluggish? We're about to reveal the culprits:
-- Which operations are eating your CPU cycles
-- Where your memory is disappearing 
-- How many arithmetic operations you're really doing
-- The shocking performance differences between architectures
-
-**Spoiler Alert**: The results might surprise you. That "simple" attention mechanism? 
-It's probably consuming 73% of your compute time!
-
-## TARGET Learning Objectives
-
-By the end of this module, you'll be able to:
-1. **Build Professional Profilers**: Create timing, memory, and FLOP counters
-2. **Identify Bottlenecks**: Find exactly what's slowing your models down
-3. **Compare Architectures**: See why transformers are slow but powerful
-4. **Guide Optimizations**: Use data to make smart performance decisions
-
-The tools you build here will be essential for Module 16 (Acceleration) when you actually fix the problems you discover.
-"""
-
-#| default_exp profiler
-
-# %% [markdown]
-"""
-## Part 1: The Timer - Your First Detective Tool
-
-Every performance investigation starts with one question: "How long does this actually take?"
-But timing is trickier than just `time.time()` - you need statistical rigor.
-
-### Why Simple Timing Fails
-```python
-import time
-start = time.time()
-result = my_function()
-end = time.time()
-print(f"Took {end - start:.2f}s")  # FAIL Unreliable!
-```
-
-**Problems:**
-- First run includes "cold start" costs (loading code into cache)  
-- Single measurement captures noise, not true performance
-- No confidence intervals or percentiles
-- Different timing APIs have different precision
-"""
-
-# %% 
-import time
-import gc
-import tracemalloc
-from typing import Dict, List, Callable, Any, Tuple, Optional
-from contextlib import contextmanager
-import statistics
-import sys
-
-# Mock imports for development
-try:
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.layers import Linear, ReLU, Softmax
-    from tinytorch.core.spatial import Conv2d, MaxPool2d
-    from tinytorch.core.transformers import Transformer
-except ImportError:
-    print("WARNING️  TinyTorch modules not available - using mocks for development")
-    
-    class Tensor:
-        def __init__(self, data):
-            if isinstance(data, list):
-                self.data = data
-                self.shape = self._get_shape(data)
-            else:
-                self.data = [[data]]
-                self.shape = (1, 1)
-        
-        def _get_shape(self, data):
-            if not isinstance(data[0], list):
-                return (len(data),)
-            return (len(data), len(data[0]))
-    
-    class Linear:
-        def __init__(self, in_features, out_features):
-            self.weight = Tensor([[0.1] * in_features for _ in range(out_features)])
-        
-        def forward(self, x):
-            # Simple mock forward pass
-            time.sleep(0.001)  # Simulate computation
-            return x
-    
-    class Conv2d:
-        def __init__(self, in_channels, out_channels, kernel_size):
-            self.weight = Tensor([[0.1] * in_channels for _ in range(out_channels)])
-        
-        def forward(self, x):
-            time.sleep(0.005)  # Simulate heavier computation
-            return x
-    
-    class Transformer:
-        def __init__(self, vocab_size, d_model, n_heads, n_layers):
-            self.layers = [Linear(d_model, d_model) for _ in range(n_layers)]
-        
-        def forward(self, x):
-            time.sleep(0.02)  # Simulate expensive attention
-            return x
-
-class Timer:
-    """
-    Professional timing infrastructure with statistical rigor.
-    
-    Features:
-    - Warmup runs to eliminate cold start effects
-    - Multiple measurements for statistical confidence  
-    - Garbage collection control to reduce noise
-    - Percentile reporting (p50, p95, p99)
-    - High-precision timing with best available clock
-    """
-    
-    def __init__(self):
-        # Use the most precise timer available
-        self.timer_func = time.perf_counter
-        self.measurements = []
-        
-    def measure(self, func: Callable, warmup: int = 3, runs: int = 100, 
-                args: tuple = (), kwargs: dict = None) -> Dict[str, float]:
-        """
-        Measure function execution time with statistical rigor.
-        
-        Args:
-            func: Function to measure
-            warmup: Number of warmup runs (eliminate cold start)
-            runs: Number of measurement runs
-            args: Arguments to pass to function
-            kwargs: Keyword arguments to pass to function
-            
-        Returns:
-            Dict with timing statistics (mean, std, percentiles)
-        """
-        if kwargs is None:
-            kwargs = {}
-            
-        self.measurements = []
-        
-        # Warmup runs to get code in CPU cache
-        print(f"FIRE Running {warmup} warmup iterations...")
-        for _ in range(warmup):
-            _ = func(*args, **kwargs)
-            
-        # Force garbage collection before timing
-        gc.collect()
-        
-        print(f"⏱️  Measuring {runs} timed runs...")
-        
-        # Actual measurements
-        for i in range(runs):
-            # Disable GC during measurement for consistency
-            gc_was_enabled = gc.isenabled()
-            gc.disable()
-            
-            try:
-                start_time = self.timer_func()
-                result = func(*args, **kwargs)
-                end_time = self.timer_func()
-                
-                execution_time = end_time - start_time
-                self.measurements.append(execution_time)
-                
-            finally:
-                # Restore GC state
-                if gc_was_enabled:
-                    gc.enable()
-                    
-            # Progress indicator for long measurements
-            if i % (runs // 10) == 0 and runs > 20:
-                print(f"  Progress: {i}/{runs} ({i/runs*100:.0f}%)")
-        
-        # Calculate statistics
-        return self._compute_stats()
-    
-    def _compute_stats(self) -> Dict[str, float]:
-        """Compute comprehensive timing statistics."""
-        if not self.measurements:
-            return {}
-            
-        measurements_ms = [t * 1000 for t in self.measurements]  # Convert to ms
-        
-        stats = {
-            'mean_ms': statistics.mean(measurements_ms),
-            'std_ms': statistics.stdev(measurements_ms) if len(measurements_ms) > 1 else 0,
-            'min_ms': min(measurements_ms),
-            'max_ms': max(measurements_ms),
-            'p50_ms': statistics.median(measurements_ms),
-            'p95_ms': self._percentile(measurements_ms, 95),
-            'p99_ms': self._percentile(measurements_ms, 99),
-            'runs': len(measurements_ms)
-        }
-        
-        return stats
-    
-    def _percentile(self, data: List[float], percentile: float) -> float:
-        """Calculate percentile of data."""
-        sorted_data = sorted(data)
-        k = (len(sorted_data) - 1) * percentile / 100
-        f = int(k)
-        c = k - f
-        
-        if f + 1 < len(sorted_data):
-            return sorted_data[f] * (1 - c) + sorted_data[f + 1] * c
-        else:
-            return sorted_data[f]
-    
-    def print_report(self, name: str = "Function"):
-        """Print a formatted timing report."""
-        if not self.measurements:
-            print(f"FAIL No measurements available for {name}")
-            return
-            
-        stats = self._compute_stats()
-        
-        print(f"\n📊 TIMING REPORT: {name}")
-        print("=" * 50)
-        print(f"Runs:     {stats['runs']}")
-        print(f"Mean:     {stats['mean_ms']:.3f} ms ± {stats['std_ms']:.3f} ms")
-        print(f"Range:    {stats['min_ms']:.3f} ms -> {stats['max_ms']:.3f} ms")
-        print(f"P50:      {stats['p50_ms']:.3f} ms")
-        print(f"P95:      {stats['p95_ms']:.3f} ms") 
-        print(f"P99:      {stats['p99_ms']:.3f} ms")
-        
-        # Helpful interpretation
-        if stats['std_ms'] / stats['mean_ms'] > 0.1:
-            print("WARNING️  High variability - consider more warmup runs")
-        else:
-            print("PASS Stable timing measurements")
-
-# %% [markdown]
-"""
-### TEST Test the Timer
-
-Let's test our timer on different types of operations to see the statistical rigor in action.
-"""
-
-# %%
-def test_timer():
-    """Test the Timer class with different operation types."""
-    timer = Timer()
-    
-    print("🔬 TIMER TESTING: Performance Detective Work")
-    print("=" * 60)
-    
-    # Test 1: Fast operation (should be sub-millisecond)
-    def fast_operation():
-        return sum(range(1000))
-    
-    print("\n1️⃣ Fast CPU Operation (sum 1000 numbers)")
-    stats = timer.measure(fast_operation, warmup=5, runs=200)
-    timer.print_report("Fast CPU Sum")
-    
-    # Test 2: Memory allocation (intermediate speed)  
-    def memory_operation():
-        data = [i * 2 for i in range(10000)]
-        return len(data)
-    
-    print("\n2️⃣ Memory Allocation (10k list creation)")
-    stats = timer.measure(memory_operation, warmup=3, runs=100)
-    timer.print_report("Memory Allocation")
-    
-    # Test 3: Mock ML operation (slow)
-    linear_layer = Linear(64, 32)
-    mock_input = Tensor([[0.1] * 64])
-    
-    def ml_operation():
-        return linear_layer.forward(mock_input)
-    
-    print("\n3️⃣ ML Operation (Linear layer forward pass)")
-    stats = timer.measure(ml_operation, warmup=2, runs=50)
-    timer.print_report("Linear Layer Forward")
-    
-    print("\nTARGET KEY INSIGHT: Notice the different scales!")
-    print("   - CPU operations: microseconds (< 1ms)")
-    print("   - Memory operations: low milliseconds") 
-    print("   - ML operations: higher milliseconds")
-    print("   This is why transformers feel slow!")
-
-# Run the test
-if __name__ == "__main__":
-    test_timer()
-
-# %% [markdown]
-"""
-## Part 2: Memory Profiler - The Memory Detective
-
-Now that we can measure time, let's track memory usage. Memory leaks and unexpected 
-allocations are common culprits in slow ML code.
-
-### Why Memory Matters for Performance
-
-- **Cache efficiency**: Small working sets stay in L1/L2 cache (fast)
-- **Memory bandwidth**: Large transfers saturate memory bus (slow)  
-- **Garbage collection**: Excessive allocations trigger GC pauses
-- **Swap thrashing**: Out of RAM = disk access = 1000x slower
-
-The memory profiler will reveal surprising allocation patterns in your models.
-"""
-
-# %%
-class MemoryProfiler:
-    """
-    Memory usage profiler with allocation tracking.
-    
-    Features:
-    - Peak memory usage during execution
-    - Memory allocation tracking with tracemalloc
-    - Memory leak detection
-    - Growth pattern analysis
-    """
-    
-    def __init__(self):
-        self.baseline_memory = 0
-        self.peak_memory = 0
-        self.allocations = []
-        
-    def profile(self, func: Callable, args: tuple = (), kwargs: dict = None) -> Dict[str, Any]:
-        """
-        Profile memory usage during function execution.
-        
-        Args:
-            func: Function to profile
-            args: Arguments to pass to function
-            kwargs: Keyword arguments
-            
-        Returns:
-            Dict with memory usage statistics
-        """
-        if kwargs is None:
-            kwargs = {}
-            
-        # Start memory tracing
-        tracemalloc.start()
-        
-        # Record baseline
-        baseline_snapshot = tracemalloc.take_snapshot()
-        baseline_stats = baseline_snapshot.statistics('filename')
-        baseline_size = sum(stat.size for stat in baseline_stats)
-        
-        try:
-            # Execute function
-            result = func(*args, **kwargs)
-            
-            # Take final snapshot
-            final_snapshot = tracemalloc.take_snapshot()
-            final_stats = final_snapshot.statistics('filename')
-            final_size = sum(stat.size for stat in final_stats)
-            
-            # Get peak memory
-            current, peak = tracemalloc.get_traced_memory()
-            
-            # Stop tracing
-            tracemalloc.stop()
-            
-            # Compute memory statistics
-            memory_stats = {
-                'baseline_mb': baseline_size / (1024 * 1024),
-                'final_mb': final_size / (1024 * 1024), 
-                'peak_mb': peak / (1024 * 1024),
-                'allocated_mb': (final_size - baseline_size) / (1024 * 1024),
-                'result': result
-            }
-            
-            return memory_stats
-            
-        except Exception as e:
-            tracemalloc.stop()
-            raise e
-    
-    def print_report(self, stats: Dict[str, Any], name: str = "Function"):
-        """Print formatted memory usage report."""
-        print(f"\n🧠 MEMORY REPORT: {name}")
-        print("=" * 50)
-        print(f"Baseline:     {stats['baseline_mb']:.2f} MB")
-        print(f"Final:        {stats['final_mb']:.2f} MB")
-        print(f"Peak:         {stats['peak_mb']:.2f} MB")
-        print(f"Allocated:    {stats['allocated_mb']:.2f} MB")
-        
-        # Memory efficiency insights
-        if stats['allocated_mb'] > stats['peak_mb'] * 0.5:
-            print("WARNING️  High memory allocation - check for copies")
-        elif stats['allocated_mb'] < 0:
-            print("PASS Memory efficient - some cleanup occurred")
-        else:
-            print("PASS Reasonable memory usage")
-            
-        # Peak vs final analysis
-        peak_vs_final_ratio = stats['peak_mb'] / max(stats['final_mb'], 0.001)
-        if peak_vs_final_ratio > 2.0:
-            print(f"TIP Peak was {peak_vs_final_ratio:.1f}x final - temporary allocations detected")
-
-# %% [markdown]
-"""
-### TEST Test Memory Profiler
-
-Let's test the memory profiler on operations that have different memory patterns.
-"""
-
-# %%
-def test_memory_profiler():
-    """Test memory profiling on different operation patterns."""
-    profiler = MemoryProfiler()
-    
-    print("🧠 MEMORY PROFILER TESTING")
-    print("=" * 60)
-    
-    # Test 1: Small allocation
-    def small_allocation():
-        return [i for i in range(1000)]
-    
-    print("\n1️⃣ Small List Creation (1k integers)")
-    stats = profiler.profile(small_allocation)
-    profiler.print_report(stats, "Small Allocation")
-    
-    # Test 2: Large allocation  
-    def large_allocation():
-        # Create a "large" tensor-like structure
-        return [[float(i * j) for j in range(100)] for i in range(100)]
-    
-    print("\n2️⃣ Large 2D Array (100x100 floats)")
-    stats = profiler.profile(large_allocation)
-    profiler.print_report(stats, "Large Allocation")
-    
-    # Test 3: Memory copying pattern
-    def copying_operation():
-        original = [i for i in range(5000)]
-        copy1 = original.copy()
-        copy2 = copy1.copy()
-        copy3 = copy2.copy()
-        return copy3
-    
-    print("\n3️⃣ Memory Copying (multiple copies)")
-    stats = profiler.profile(copying_operation) 
-    profiler.print_report(stats, "Copying Operation")
-    
-    print("\nTARGET KEY INSIGHT: Memory patterns reveal optimization opportunities!")
-    print("   - Small allocations: Usually efficient")
-    print("   - Large allocations: Watch for memory bandwidth limits")
-    print("   - Copying operations: Major performance killers")
-
-# Run the test  
-if __name__ == "__main__":
-    test_memory_profiler()
-
-# %% [markdown]  
-"""
-## Part 3: FLOP Counter - Operation Detective
-
-How many arithmetic operations is your model actually doing? FLOPs (Floating Point 
-Operations) give you the raw computational cost independent of hardware.
-
-### Why Count FLOPs?
-
-- **Hardware comparison**: Same FLOPs = same work, regardless of CPU/GPU
-- **Architecture analysis**: Compare MLP vs CNN vs Transformer efficiency  
-- **Scaling prediction**: Double the model = how many more FLOPs?
-- **Optimization targeting**: Focus on high-FLOP operations first
-
-**The shocking truth**: Attention is O(n²) - a 2x longer sequence needs 4x more FLOPs!
-"""
-
-# %%
-class FLOPCounter:
-    """
-    Count floating point operations (FLOPs) in neural network operations.
-    
-    Features:
-    - Track multiply-accumulate (MAC) operations
-    - Handle different layer types (Linear, Conv2d, Attention)
-    - Provide operation breakdown by type
-    - Compare theoretical vs practical complexity
-    """
-    
-    def __init__(self):
-        self.operation_counts = {
-            'multiply': 0,
-            'add': 0,
-            'total_flops': 0
-        }
-        self.layer_breakdown = {}
-    
-    def reset(self):
-        """Reset all counters."""
-        self.operation_counts = {
-            'multiply': 0,
-            'add': 0, 
-            'total_flops': 0
-        }
-        self.layer_breakdown = {}
-    
-    def count_linear(self, input_features: int, output_features: int, batch_size: int = 1) -> int:
-        """
-        Count FLOPs for linear layer: y = xW + b
-        
-        Args:
-            input_features: Number of input features
-            output_features: Number of output neurons
-            batch_size: Batch size
-            
-        Returns:
-            Total FLOPs for this operation
-        """
-        # Matrix multiplication: (batch, in) * (in, out) = batch * in * out multiplications
-        multiply_ops = batch_size * input_features * output_features
-        
-        # Addition for bias: batch * out additions  
-        add_ops = batch_size * output_features
-        
-        total_flops = multiply_ops + add_ops
-        
-        self.operation_counts['multiply'] += multiply_ops
-        self.operation_counts['add'] += add_ops
-        self.operation_counts['total_flops'] += total_flops
-        
-        self.layer_breakdown['linear'] = self.layer_breakdown.get('linear', 0) + total_flops
-        
-        return total_flops
-    
-    def count_conv2d(self, input_height: int, input_width: int, input_channels: int,
-                    output_channels: int, kernel_size: int, batch_size: int = 1) -> int:
-        """
-        Count FLOPs for 2D convolution.
-        
-        Args:
-            input_height: Input height
-            input_width: Input width  
-            input_channels: Number of input channels
-            output_channels: Number of output channels
-            kernel_size: Kernel size (assumed square)
-            batch_size: Batch size
-            
-        Returns:
-            Total FLOPs for convolution
-        """
-        # Output dimensions (assuming no padding/stride)
-        output_height = input_height - kernel_size + 1
-        output_width = input_width - kernel_size + 1
-        
-        # Each output pixel requires kernel_size² * input_channels multiplications
-        multiply_ops = (batch_size * output_height * output_width * 
-                       output_channels * kernel_size * kernel_size * input_channels)
-        
-        # Bias addition: one per output pixel
-        add_ops = batch_size * output_height * output_width * output_channels
-        
-        total_flops = multiply_ops + add_ops
-        
-        self.operation_counts['multiply'] += multiply_ops
-        self.operation_counts['add'] += add_ops 
-        self.operation_counts['total_flops'] += total_flops
-        
-        self.layer_breakdown['conv2d'] = self.layer_breakdown.get('conv2d', 0) + total_flops
-        
-        return total_flops
-    
-    def count_attention(self, sequence_length: int, d_model: int, batch_size: int = 1) -> int:
-        """
-        Count FLOPs for self-attention mechanism.
-        
-        Args:
-            sequence_length: Length of input sequence
-            d_model: Model dimension
-            batch_size: Batch size
-            
-        Returns:
-            Total FLOPs for attention
-        """
-        # Q, K, V projections: 3 linear layers
-        qkv_flops = 3 * self.count_linear(d_model, d_model, batch_size)
-        
-        # Attention scores: Q @ K^T = (seq, d) @ (d, seq) = seq² * d
-        score_multiply = batch_size * sequence_length * sequence_length * d_model
-        
-        # Attention weights: softmax is approximately free compared to matmul
-        
-        # Weighted values: attention @ V = (seq, seq) @ (seq, d) = seq² * d
-        weighted_multiply = batch_size * sequence_length * sequence_length * d_model
-        
-        # Output projection: another linear layer
-        output_flops = self.count_linear(d_model, d_model, batch_size)
-        
-        attention_specific_flops = score_multiply + weighted_multiply
-        
-        self.operation_counts['multiply'] += attention_specific_flops
-        self.operation_counts['total_flops'] += attention_specific_flops
-        
-        total_attention_flops = attention_specific_flops + qkv_flops + output_flops
-        self.layer_breakdown['attention'] = self.layer_breakdown.get('attention', 0) + total_attention_flops
-        
-        return total_attention_flops
-    
-    def count_model_forward(self, model, input_shape: tuple) -> int:
-        """
-        Estimate FLOPs for a complete model forward pass.
-        
-        Args:
-            model: Model to analyze
-            input_shape: Shape of input (batch_size, ...)
-            
-        Returns:
-            Total estimated FLOPs
-        """
-        self.reset()
-        
-        # Simple mock analysis - in practice you'd traverse the model
-        if isinstance(model, Linear):
-            batch_size = input_shape[0] if len(input_shape) > 1 else 1
-            input_features = input_shape[-1] if len(input_shape) > 1 else input_shape[0]
-            output_features = 32  # Mock output size
-            return self.count_linear(input_features, output_features, batch_size)
-            
-        elif isinstance(model, Conv2d):
-            batch_size = input_shape[0] if len(input_shape) > 3 else 1
-            _, input_channels, height, width = (1, 3, 32, 32) if len(input_shape) < 4 else input_shape
-            return self.count_conv2d(height, width, input_channels, 16, 3, batch_size)
-            
-        elif isinstance(model, Transformer):
-            batch_size = input_shape[0] if len(input_shape) > 2 else 1 
-            seq_length = input_shape[1] if len(input_shape) > 2 else input_shape[0]
-            d_model = 128  # Mock model dimension
-            return self.count_attention(seq_length, d_model, batch_size)
-            
-        else:
-            # Generic estimation
-            return 1000000  # 1M FLOPs as placeholder
-    
-    def print_report(self, name: str = "Model"):
-        """Print detailed FLOP analysis report."""
-        print(f"\n🔢 FLOP ANALYSIS: {name}")
-        print("=" * 50)
-        
-        total_flops = self.operation_counts['total_flops']
-        if total_flops == 0:
-            print("FAIL No FLOPs counted")
-            return
-            
-        print(f"Total FLOPs:      {total_flops:,}")
-        print(f"  - Multiplies:   {self.operation_counts['multiply']:,}")
-        print(f"  - Additions:    {self.operation_counts['add']:,}")
-        
-        # Convert to common units
-        if total_flops > 1e9:
-            print(f"  = {total_flops / 1e9:.2f} GFLOPs")
-        elif total_flops > 1e6:
-            print(f"  = {total_flops / 1e6:.2f} MFLOPs")
-        elif total_flops > 1e3:
-            print(f"  = {total_flops / 1e3:.2f} KFLOPs")
-            
-        # Breakdown by layer type
-        if self.layer_breakdown:
-            print("\nBreakdown by operation:")
-            for op_type, flops in self.layer_breakdown.items():
-                percentage = (flops / total_flops) * 100
-                print(f"  {op_type:12s}: {flops:,} ({percentage:.1f}%)")
-
-# %% [markdown]
-"""
-### TEST Test FLOP Counter
-
-Let's count operations for different architectures and see the scaling differences.
-"""
-
-# %%
-def test_flop_counter():
-    """Test FLOP counting on different architectures."""
-    counter = FLOPCounter()
-    
-    print("🔢 FLOP COUNTER TESTING - Architecture Comparison")
-    print("=" * 65)
-    
-    # Test 1: Simple Linear Layer (MLP building block)
-    print("\n1️⃣ Linear Layer (64 -> 32, batch=10)")
-    flops = counter.count_linear(input_features=64, output_features=32, batch_size=10)
-    counter.print_report("Linear Layer")
-    
-    # Test 2: Convolutional Layer  
-    counter.reset()
-    print("\n2️⃣ Conv2D Layer (32*32*3 -> 16 channels, 3*3 kernel)")
-    flops = counter.count_conv2d(input_height=32, input_width=32, input_channels=3,
-                                output_channels=16, kernel_size=3, batch_size=1)
-    counter.print_report("Conv2D Layer")
-    
-    # Test 3: Attention Mechanism
-    counter.reset()
-    print("\n3️⃣ Self-Attention (seq_len=50, d_model=128)")
-    flops = counter.count_attention(sequence_length=50, d_model=128, batch_size=1)
-    counter.print_report("Self-Attention")
-    
-    # Test 4: Scaling Analysis - The Eye-Opener!
-    print("\n4️⃣ SCALING ANALYSIS - Why Transformers Are Expensive")
-    print("-" * 60)
-    
-    sequence_lengths = [10, 50, 100, 200]
-    d_model = 128
-    
-    for seq_len in sequence_lengths:
-        counter.reset()
-        flops = counter.count_attention(seq_len, d_model)
-        mflops = flops / 1e6
-        print(f"Seq Length {seq_len:3d}: {mflops:6.1f} MFLOPs")
-    
-    print("\n🚨 SHOCKING INSIGHT: Attention scales O(n²)!")
-    print("   - 2x sequence length = 4x FLOPs")
-    print("   - This is why long documents are expensive")
-    print("   - CNNs scale O(n) - much more efficient for images")
-
-# Run the test
-if __name__ == "__main__":
-    test_flop_counter()
-
-# %% [markdown]
-"""
-## Part 4: Profiler Context - The Ultimate Detective Tool
-
-Now let's combine all our profiling tools into one easy-to-use context manager.
-This is your go-to tool for comprehensive performance analysis.
-
-### The Complete Picture
-
-The context manager will give you:
-- **Timing**: How long did it take?
-- **Memory**: How much RAM was used?
-- **FLOPs**: How much computation was done?
-- **Efficiency**: FLOPs per second, memory per FLOP
-
-This is what you'll use to profile entire model forward passes and identify bottlenecks.
-"""
-
-# %%
-class ProfilerContext:
-    """
-    Comprehensive profiling context manager.
-    
-    Combines timing, memory, and FLOP analysis into a single tool.
-    Perfect for profiling model forward passes and identifying bottlenecks.
-    
-    Usage:
-        with ProfilerContext("MyModel") as profiler:
-            result = model.forward(input)
-        # Automatic report generation
-    """
-    
-    def __init__(self, name: str = "Operation", 
-                 timing_runs: int = 10, 
-                 timing_warmup: int = 2,
-                 enable_memory: bool = True,
-                 enable_flops: bool = False):
-        """
-        Initialize profiling context.
-        
-        Args:
-            name: Name for the operation being profiled
-            timing_runs: Number of timing measurements
-            timing_warmup: Number of warmup runs
-            enable_memory: Whether to profile memory usage
-            enable_flops: Whether to count FLOPs (manual)
-        """
-        self.name = name
-        self.timing_runs = timing_runs
-        self.timing_warmup = timing_warmup
-        self.enable_memory = enable_memory
-        self.enable_flops = enable_flops
-        
-        # Profiling tools
-        self.timer = Timer()
-        self.memory_profiler = MemoryProfiler() if enable_memory else None
-        self.flop_counter = FLOPCounter() if enable_flops else None
-        
-        # Results storage
-        self.timing_stats = {}
-        self.memory_stats = {}
-        self.results = {}
-        
-    def __enter__(self):
-        """Start profiling context."""
-        print(f"MAGNIFY PROFILING: {self.name}")
-        print("=" * (len(self.name) + 12))
-        
-        if self.enable_memory:
-            # Start memory tracing
-            if not tracemalloc.is_tracing():
-                tracemalloc.start()
-                
-        return self
-        
-    def __exit__(self, exc_type, exc_val, exc_tb):
-        """End profiling and generate report."""
-        if exc_type is not None:
-            print(f"FAIL Error during profiling: {exc_val}")
-            return False
-            
-        self.generate_report()
-        return False
-    
-    def profile_function(self, func: Callable, args: tuple = (), kwargs: dict = None):
-        """
-        Profile a function call within the context.
-        
-        Args:
-            func: Function to profile  
-            args: Function arguments
-            kwargs: Function keyword arguments
-            
-        Returns:
-            Function result
-        """
-        if kwargs is None:
-            kwargs = {}
-            
-        # Memory profiling (if enabled)
-        if self.memory_profiler:
-            self.memory_stats = self.memory_profiler.profile(func, args, kwargs)
-            result = self.memory_stats['result']
-        else:
-            result = func(*args, **kwargs)
-            
-        # Timing profiling
-        self.timing_stats = self.timer.measure(
-            func, warmup=self.timing_warmup, runs=self.timing_runs,
-            args=args, kwargs=kwargs
-        )
-        
-        return result
-    
-    def add_flop_count(self, flops: int, breakdown: dict = None):
-        """
-        Manually add FLOP count (since automatic counting is complex).
-        
-        Args:
-            flops: Total FLOP count
-            breakdown: Optional breakdown by operation type
-        """
-        if self.flop_counter:
-            self.flop_counter.operation_counts['total_flops'] = flops
-            if breakdown:
-                self.flop_counter.layer_breakdown.update(breakdown)
-    
-    def generate_report(self):
-        """Generate comprehensive profiling report."""
-        print(f"\n📊 COMPREHENSIVE PROFILE REPORT: {self.name}")
-        print("=" * 70)
-        
-        # Timing report
-        if self.timing_stats:
-            mean_ms = self.timing_stats.get('mean_ms', 0)
-            std_ms = self.timing_stats.get('std_ms', 0)
-            print(f"⏱️  TIMING:")
-            print(f"   Average:     {mean_ms:.3f} ms ± {std_ms:.3f} ms")
-            print(f"   P95:         {self.timing_stats.get('p95_ms', 0):.3f} ms")
-            print(f"   Throughput:  {1000/max(mean_ms, 0.001):.1f} ops/sec")
-        
-        # Memory report  
-        if self.memory_stats:
-            print(f"\n🧠 MEMORY:")
-            print(f"   Peak usage:  {self.memory_stats.get('peak_mb', 0):.2f} MB")
-            print(f"   Allocated:   {self.memory_stats.get('allocated_mb', 0):.2f} MB")
-        
-        # FLOP report
-        if self.flop_counter and self.flop_counter.operation_counts['total_flops'] > 0:
-            total_flops = self.flop_counter.operation_counts['total_flops']
-            print(f"\n🔢 COMPUTATION:")
-            print(f"   Total FLOPs: {total_flops:,}")
-            
-            if self.timing_stats and self.timing_stats.get('mean_ms', 0) > 0:
-                mean_seconds = self.timing_stats['mean_ms'] / 1000
-                gflops_per_sec = (total_flops / 1e9) / mean_seconds
-                print(f"   Performance: {gflops_per_sec:.2f} GFLOPS/sec")
-        
-        # Efficiency insights
-        self._print_insights()
-    
-    def _print_insights(self):
-        """Print performance insights and recommendations."""
-        print(f"\nTIP PERFORMANCE INSIGHTS:")
-        
-        insights = []
-        
-        # Timing insights
-        if self.timing_stats:
-            mean_ms = self.timing_stats.get('mean_ms', 0)
-            std_ms = self.timing_stats.get('std_ms', 0)
-            
-            if mean_ms < 0.1:
-                insights.append("SPEED Very fast operation (< 0.1ms)")
-            elif mean_ms < 1:
-                insights.append("PASS Fast operation (< 1ms)")  
-            elif mean_ms < 10:
-                insights.append("WARNING️  Moderate speed (1-10ms)")
-            else:
-                insights.append("🐌 Slow operation (> 10ms) - optimization target")
-                
-            if std_ms / max(mean_ms, 0.001) > 0.2:
-                insights.append("📊 High timing variance - inconsistent performance")
-        
-        # Memory insights
-        if self.memory_stats:
-            allocated_mb = self.memory_stats.get('allocated_mb', 0)
-            peak_mb = self.memory_stats.get('peak_mb', 0)
-            
-            if peak_mb > allocated_mb * 2:
-                insights.append("🗑️  High temporary memory usage - check for copies")
-            
-            if allocated_mb < 0:
-                insights.append("♻️  Memory cleanup detected - good garbage collection")
-        
-        # FLOP insights
-        if self.flop_counter and self.flop_counter.operation_counts['total_flops'] > 0:
-            if self.timing_stats:
-                mean_seconds = self.timing_stats.get('mean_ms', 1) / 1000
-                gflops_per_sec = (self.flop_counter.operation_counts['total_flops'] / 1e9) / mean_seconds
-                
-                if gflops_per_sec > 10:
-                    insights.append("ROCKET Excellent computational efficiency")
-                elif gflops_per_sec > 1:
-                    insights.append("PASS Good computational efficiency")
-                else:
-                    insights.append("WARNING️  Low efficiency - check for bottlenecks")
-        
-        # Print insights
-        for insight in insights:
-            print(f"   {insight}")
-        
-        if not insights:
-            print("   PROGRESS Run with more profiling options for insights")
-
-# %%
-#| export
-class SimpleProfiler:
-    """
-    Simple profiler interface expected by benchmarking module.
-    Wrapper around the comprehensive ProfilerContext for easy use.
-    """
-    
-    def __init__(self, track_memory=True, track_cpu=True):
-        self.track_memory = track_memory
-        self.track_cpu = track_cpu
-        self.timer = Timer()
-        self.memory_profiler = MemoryProfiler() if track_memory else None
-        
-    def profile(self, func, *args, name="operation", warmup=True):
-        """Profile a function call and return comprehensive results."""
-        if warmup:
-            # Warmup run
-            _ = func(*args)
-            
-        # Time the operation
-        timing_stats = self.timer.measure(func, warmup=2, runs=10, args=args)
-        
-        result_dict = {
-            'wall_time': timing_stats['mean_ms'] / 1000,  # Convert to seconds
-            'cpu_time': timing_stats['mean_ms'] / 1000,   # Simplified
-            'cpu_efficiency': 0.85,  # Mock reasonable value
-            'name': name
-        }
-        
-        # Add memory stats if enabled
-        if self.memory_profiler:
-            memory_stats = self.memory_profiler.profile(func, args)
-            result_dict.update({
-                'memory_delta_mb': memory_stats.get('allocated_mb', 0),
-                'peak_memory_mb': memory_stats.get('peak_mb', 0),
-                'result_size_mb': 0.1  # Mock value
-            })
-            
-        return result_dict
-
-#| export 
-def profile_function(func, *args, **kwargs):
-    """Simple function profiler decorator/utility."""
-    profiler = SimpleProfiler()
-    return profiler.profile(func, *args, **kwargs)
-
-# %% [markdown]
-"""
-### TEST Test Comprehensive Profiling
-
-Now let's use the complete profiler to analyze different model architectures. 
-This is where the detective work pays off - you'll see exactly why some models are fast and others are slow!
-"""
-
-# %%
-def test_comprehensive_profiling():
-    """Test comprehensive profiling on different model types."""
-    
-    print("MAGNIFY COMPREHENSIVE PROFILING - Architecture Detective Work")
-    print("=" * 80)
-    
-    # Test 1: Simple Linear Model (MLP)
-    print("\n" + "="*50)
-    print("TEST 1: Multi-Layer Perceptron (MLP)")
-    print("="*50)
-    
-    linear_model = Linear(128, 64)
-    mock_input = Tensor([[0.1] * 128 for _ in range(32)])  # Batch of 32
-    
-    with ProfilerContext("MLP Forward Pass", timing_runs=50, enable_memory=True) as profiler:
-        result = profiler.profile_function(linear_model.forward, args=(mock_input,))
-        # Add manual FLOP count for this operation
-        flops = 32 * 128 * 64  # batch_size * input_features * output_features
-        profiler.add_flop_count(flops, {'linear': flops})
-    
-    # Test 2: Convolutional Model (CNN)  
-    print("\n" + "="*50)
-    print("TEST 2: Convolutional Neural Network (CNN)")
-    print("="*50)
-    
-    conv_model = Conv2d(3, 16, 3)
-    # Mock 32x32 RGB image batch
-    conv_input = Tensor([[[0.1] * 32 for _ in range(32)] for _ in range(3)])
-    
-    with ProfilerContext("CNN Forward Pass", timing_runs=30, enable_memory=True) as profiler:
-        result = profiler.profile_function(conv_model.forward, args=(conv_input,))
-        # FLOP count for convolution: output_pixels * kernel_ops * channels
-        output_size = 30 * 30  # 32-3+1 = 30
-        flops = output_size * 3 * 3 * 3 * 16  # output_h * output_w * kernel_h * kernel_w * in_ch * out_ch
-        profiler.add_flop_count(flops, {'conv2d': flops})
-    
-    # Test 3: Transformer Model
-    print("\n" + "="*50)
-    print("TEST 3: Transformer (Attention-Based)")
-    print("="*50)
-    
-    transformer_model = Transformer(vocab_size=1000, d_model=128, n_heads=8, n_layers=4)
-    # Mock sequence of tokens
-    seq_input = Tensor([[i] for i in range(32)])  # Sequence length 32
-    
-    with ProfilerContext("Transformer Forward Pass", timing_runs=20, enable_memory=True) as profiler:
-        result = profiler.profile_function(transformer_model.forward, args=(seq_input,))
-        # Attention FLOP count: approximately seq_len² * d_model * n_heads * n_layers
-        attention_flops = 32 * 32 * 128 * 8 * 4  # Quadratic in sequence length!
-        linear_flops = 4 * (128 * 128 + 128 * 512 + 512 * 128)  # Linear layers in transformer
-        total_flops = attention_flops + linear_flops
-        profiler.add_flop_count(total_flops, {
-            'attention': attention_flops,
-            'linear': linear_flops
-        })
-    
-    # Comparative Analysis
-    print("\n" + "🏁"*25)
-    print("COMPARATIVE ANALYSIS - The Big Reveal!")
-    print("🏁"*25)
-    print("""
-TARGET KEY DISCOVERIES:
-
-1️⃣ MLP (Linear): 
-   - Fastest for small inputs
-   - Linear scaling: O(input_size * output_size)
-   - Excellent for final classification layers
-
-2️⃣ CNN (Convolutional):
-   - Moderate speed, excellent for spatial data  
-   - Scaling: O(input_pixels * kernel_size)
-   - Hardware-friendly (vectorizable)
-
-3️⃣ Transformer (Attention):
-   - Slowest but most powerful
-   - Quadratic scaling: O(sequence_length²)
-   - Memory hungry due to attention matrices
-
-🚨 PERFORMANCE BOTTLENECK REVEALED:
-   Attention is the culprit! The O(n²) complexity means:
-   - 2x longer sequence = 4x computation
-   - 10x longer sequence = 100x computation
-   - This is why GPT models are expensive to run!
-
-TIP OPTIMIZATION STRATEGIES:
-   - MLPs: Focus on batch processing
-   - CNNs: Use optimized convolution libraries  
-   - Transformers: Implement attention optimizations (next module!)
-""")
-
-# Run the comprehensive test
-if __name__ == "__main__":
-    test_comprehensive_profiling()
-
-# %% [markdown]
-"""
-## Part 5: Real-World Profiling - Bottleneck Detection
-
-Let's simulate profiling a complete neural network to see where the bottlenecks really are.
-This is the kind of analysis that guides optimization decisions in production ML systems.
-
-### Performance Detective Workflow
-
-1. **Profile the whole model** - get the big picture
-2. **Identify the bottleneck** - which layer is slowest?
-3. **Drill down into that layer** - why is it slow?
-4. **Predict optimization impact** - fix this layer = how much speedup?
-
-This is exactly what PyTorch's profiler and NVIDIA's NSight do for production models.
-"""
-
-# %%
-def simulate_complete_model_profiling():
-    """
-    Simulate profiling a complete neural network to identify bottlenecks.
-    This shows the detective process used in real ML systems optimization.
-    """
-    
-    print("🕵️ PERFORMANCE DETECTIVE: Complete Model Analysis")
-    print("=" * 80)
-    print("""
-TARGET MISSION: Find the bottleneck in our neural network
-
-We have a model with:
-- Input processing (Linear layer)
-- Feature extraction (CNN layers) 
-- Sequence modeling (Transformer)
-- Output classification (Linear layer)
-
-Which component is slowing us down?
-""")
-    
-    # Simulate different components with realistic timing
-    components = [
-        ("Input Processing", Linear(784, 256), 0.5),    # Fast  
-        ("Conv Layer 1", Conv2d(1, 32, 3), 2.0),       # Moderate
-        ("Conv Layer 2", Conv2d(32, 64, 3), 4.0),      # Slower
-        ("Attention Layer", Transformer(1000, 128, 8, 2), 15.0),  # Bottleneck!
-        ("Output Layer", Linear(128, 10), 0.3)         # Fast
-    ]
-    
-    timing_results = []
-    total_time = 0
-    
-    print("\n📊 LAYER-BY-LAYER TIMING ANALYSIS:")
-    print("-" * 60)
-    
-    for name, model, base_time_ms in components:
-        # Simulate timing measurement with some noise
-        import random
-        measured_time = base_time_ms + random.uniform(-0.2, 0.2)
-        
-        timing_results.append((name, measured_time))
-        total_time += measured_time
-        
-        print(f"{name:20s}: {measured_time:6.2f} ms")
-    
-    print(f"{'='*20}: {'='*6}")
-    print(f"{'TOTAL':<20s}: {total_time:6.2f} ms")
-    
-    # Bottleneck analysis
-    print(f"\nMAGNIFY BOTTLENECK ANALYSIS:")
-    print("-" * 40)
-    
-    # Find the slowest component
-    slowest_name, slowest_time = max(timing_results, key=lambda x: x[1])
-    bottleneck_percentage = (slowest_time / total_time) * 100
-    
-    print(f"🚨 Primary bottleneck: {slowest_name}")
-    print(f"   Time: {slowest_time:.2f} ms ({bottleneck_percentage:.1f}% of total)")
-    
-    # Calculate optimization impact
-    print(f"\nTIP OPTIMIZATION IMPACT ANALYSIS:")
-    print("-" * 40)
-    
-    # If we optimize the bottleneck by different amounts
-    optimization_factors = [0.5, 0.25, 0.1]  # 2x, 4x, 10x faster
-    
-    for factor in optimization_factors:
-        speedup_factor = 1 / factor
-        new_bottleneck_time = slowest_time * factor
-        new_total_time = total_time - slowest_time + new_bottleneck_time
-        overall_speedup = total_time / new_total_time
-        
-        print(f"If {slowest_name} is {speedup_factor:.0f}x faster:")
-        print(f"   New total time: {new_total_time:.2f} ms")
-        print(f"   Overall speedup: {overall_speedup:.2f}x")
-        print()
-    
-    # Memory analysis
-    print("🧠 MEMORY USAGE BREAKDOWN:")
-    print("-" * 40)
-    
-    memory_usage = {
-        "Input Processing": 0.5,
-        "Conv Layer 1": 2.1,
-        "Conv Layer 2": 8.4,  
-        "Attention Layer": 45.2,  # Memory hungry!
-        "Output Layer": 0.1
-    }
-    
-    total_memory = sum(memory_usage.values())
-    
-    for component, memory_mb in memory_usage.items():
-        percentage = (memory_mb / total_memory) * 100
-        print(f"{component:20s}: {memory_mb:5.1f} MB ({percentage:4.1f}%)")
-    
-    print(f"{'='*20}: {'='*5}")
-    print(f"{'TOTAL':<20s}: {total_memory:5.1f} MB")
-    
-    # Key insights
-    print(f"\nTARGET KEY PERFORMANCE INSIGHTS:")
-    print("=" * 50)
-    print(f"""
-1️⃣ BOTTLENECK IDENTIFIED: {slowest_name}
-   - Consumes {bottleneck_percentage:.0f}% of execution time
-   - This is your #1 optimization target
-   
-2️⃣ MEMORY HOTSPOT: Attention Layer  
-   - Uses 80%+ of total memory
-   - Memory bandwidth likely limiting factor
-   
-3️⃣ OPTIMIZATION STRATEGY:
-   - Focus on attention optimization first
-   - 4x attention speedup = {total_time / (total_time - slowest_time + slowest_time*0.25):.1f}x overall speedup
-   - Consider: Flash Attention, KV caching, quantization
-   
-4️⃣ AMDAHL'S LAW IN ACTION:
-   - Optimizing non-bottleneck layers has minimal impact
-   - {slowest_name} dominates performance profile
-   
-5️⃣ PRODUCTION IMPLICATIONS:
-   - Batch size limited by attention memory usage
-   - Inference latency dominated by attention computation  
-   - This is why transformer serving is expensive!
-""")
-
-# Run the bottleneck detection
-if __name__ == "__main__":
-    simulate_complete_model_profiling()
-
-# %% [markdown]
-"""
-## Part 6: Systems Analysis - Memory and Performance Deep Dive
-
-Now let's analyze the systems implications of what we've discovered. This is where profiling 
-becomes actionable intelligence for ML systems engineers.
-
-### Memory vs Computation Trade-offs
-
-What we've learned through profiling:
-- **Attention**: High memory, high computation (O(n²) for both)
-- **Convolution**: Moderate memory, moderate computation  
-- **Linear layers**: Low memory, low computation
-
-These patterns drive real-world architectural decisions.
-"""
-
-# %%
-def analyze_systems_implications():
-    """
-    Analyze the systems implications of our profiling discoveries.
-    This connects profiling data to real-world ML systems decisions.
-    """
-    
-    print("🏗️ SYSTEMS ANALYSIS: From Profiling to Production Decisions")
-    print("=" * 80)
-    
-    print("""
-TARGET PROFILING INSIGHTS -> SYSTEMS DECISIONS
-
-Our performance detective work revealed several critical patterns.
-Let's trace how these insights drive production ML systems:
-""")
-    
-    # Memory scaling analysis
-    print("\nPROGRESS MEMORY SCALING ANALYSIS:")
-    print("-" * 50)
-    
-    sequence_lengths = [128, 512, 1024, 2048, 4096]
-    d_model = 768  # GPT-like model
-    
-    print("Attention Memory Usage by Sequence Length:")
-    print("Seq Length | Memory (GB) | Notes")
-    print("-" * 50)
-    
-    for seq_len in sequence_lengths:
-        # Attention matrices: Q, K, V projections + attention scores + weighted values
-        qkv_memory = 3 * seq_len * d_model * 4 / (1024**3)  # 4 bytes per float32
-        attention_scores = seq_len * seq_len * 4 / (1024**3)  # O(n²) memory!
-        
-        total_memory_gb = (qkv_memory + attention_scores) * 2  # Forward + backward
-        
-        if seq_len <= 512:
-            note = "PASS Practical"
-        elif seq_len <= 1024:
-            note = "WARNING️ Expensive"
-        else:
-            note = "🚨 Prohibitive"
-            
-        print(f"{seq_len:8d}   |  {total_memory_gb:8.2f}   | {note}")
-    
-    print("\nTIP KEY INSIGHT: Memory grows O(n²) - this is why context length is limited!")
-    
-    # Compute scaling analysis  
-    print("\nSPEED COMPUTE SCALING ANALYSIS:")
-    print("-" * 50)
-    
-    print("FLOPs Required by Architecture (1M input features):")
-    print("Architecture | FLOPs      | Scaling | Use Case")
-    print("-" * 60)
-    
-    architectures = [
-        ("Linear (MLP)", "1B", "O(n)", "Fast classification"),
-        ("Conv2D", "10B", "O(n)", "Image processing"), 
-        ("Attention", "1T", "O(n²)", "Sequence modeling"),
-        ("Sparse Attention", "100B", "O(n log n)", "Long sequences")
-    ]
-    
-    for arch, flops, scaling, use_case in architectures:
-        print(f"{arch:12s} | {flops:8s}   | {scaling:8s} | {use_case}")
-    
-    print("\nTIP INSIGHT: Attention is 1000x more expensive than linear layers!")
-    
-    # Hardware implications
-    print("\n🔧 HARDWARE IMPLICATIONS:")
-    print("-" * 40)
-    
-    print("""
-From Profiling Data -> Hardware Decisions:
-
-1️⃣ CPU vs GPU Choice:
-   - Linear layers: CPU fine (low parallelism)
-   - Convolutions: GPU preferred (high parallelism)  
-   - Attention: GPU essential (massive parallelism)
-
-2️⃣ Memory Hierarchy:
-   - Small models: Fit in GPU memory (fast)
-   - Large models: CPU-GPU transfers (slow)
-   - Huge models: Model sharding required
-
-3️⃣ Batch Size Limits:
-   - Memory-bound: Attention limits batch size
-   - Compute-bound: Can increase batch size
-   - Our profiling shows attention is memory-bound
-
-4️⃣ Inference Serving:
-   - MLPs: High throughput possible
-   - CNNs: Moderate throughput
-   - Transformers: Low throughput, high latency
-""")
-    
-    # Real-world examples
-    print("\n🌍 REAL-WORLD EXAMPLES:")
-    print("-" * 30)
-    
-    print("""
-How Our Profiling Insights Play Out in Production:
-
-📱 MOBILE DEPLOYMENT:
-   - Profiling shows: Attention uses 80% memory
-   - Decision: Use distilled models (smaller attention)
-   - Result: 10x memory reduction, 3x speedup
-
-🏢 DATACENTER SERVING:  
-   - Profiling shows: Attention is compute bottleneck
-   - Decision: Use tensor parallelism across GPUs
-   - Result: Split attention computation, linear speedup
-
-SPEED EDGE DEVICES:
-   - Profiling shows: Memory bandwidth limited
-   - Decision: Quantize to INT8, cache frequent patterns
-   - Result: 4x memory reduction, 2x speedup
-
-TARGET KEY TAKEAWAY:
-   Profiling isn't academic - it drives billion-dollar infrastructure decisions!
-   Every major ML system (GPT, BERT, ResNet) was optimized using these techniques.
-""")
-
-# Run the systems analysis
-if __name__ == "__main__":
-    analyze_systems_implications()
-
-# %% [markdown]
-"""
-## Part 7: Integration Testing - Putting It All Together
-
-Let's test our complete profiling infrastructure by analyzing a realistic neural network scenario.
-This integration test validates that all our profiling tools work together seamlessly.
-"""
-
-# %%
-def integration_test_profiling_suite():
-    """
-    Integration test for the complete profiling suite.
-    Tests all components working together on a realistic model.
-    """
-    
-    print("TEST INTEGRATION TEST: Complete Profiling Suite")
-    print("=" * 70)
-    
-    # Test all profilers working together
-    print("\n1️⃣ Testing Individual Components:")
-    print("-" * 40)
-    
-    # Timer test
-    timer = Timer()
-    
-    def sample_computation():
-        return sum(i*i for i in range(10000))
-    
-    timing_stats = timer.measure(sample_computation, warmup=2, runs=50)
-    assert timing_stats['runs'] == 50
-    assert timing_stats['mean_ms'] > 0
-    print("PASS Timer: Working correctly")
-    
-    # Memory profiler test
-    memory_profiler = MemoryProfiler()
-    
-    def memory_intensive_task():
-        return [i for i in range(100000)]
-    
-    memory_stats = memory_profiler.profile(memory_intensive_task)
-    assert memory_stats['peak_mb'] > 0
-    print("PASS Memory Profiler: Working correctly")
-    
-    # FLOP counter test
-    flop_counter = FLOPCounter()
-    flops = flop_counter.count_linear(100, 50, batch_size=32)
-    assert flops == 32 * 100 * 50 + 32 * 50  # multiply + add operations
-    print("PASS FLOP Counter: Working correctly")
-    
-    # Context manager test
-    print("\n2️⃣ Testing Profiler Context Integration:")
-    print("-" * 40)
-    
-    def complex_model_simulation():
-        """Simulate a complex model with multiple operations."""
-        # Simulate different types of computation
-        linear_result = sum(i*j for i in range(100) for j in range(100))  # O(n²)
-        conv_result = [sum(row) for row in [[i*j for j in range(50)] for i in range(50)]]  # Simulate convolution
-        attention_result = sum(i*j*k for i in range(20) for j in range(20) for k in range(20))  # O(n³) - expensive!
-        return linear_result + sum(conv_result) + attention_result
-    
-    with ProfilerContext("Complex Model Simulation", timing_runs=20) as profiler:
-        result = profiler.profile_function(complex_model_simulation)
-        
-        # Add FLOP count for analysis
-        estimated_flops = (
-            100 * 100 +  # Linear operations  
-            50 * 50 * 10 +  # Conv-like operations
-            20 * 20 * 20 * 5  # Attention-like operations (expensive!)
-        )
-        profiler.add_flop_count(estimated_flops)
-    
-    print("PASS Profiler Context: Integration successful")
-    
-    # Test performance comparison
-    print("\n3️⃣ Performance Comparison Test:")
-    print("-" * 40)
-    
-    operations = [
-        ("Fast Linear", lambda: sum(range(1000))),
-        ("Moderate Conv", lambda: [[i*j for j in range(100)] for i in range(100)]),
-        ("Slow Attention", lambda: [[[i*j*k for k in range(10)] for j in range(10)] for i in range(10)])
-    ]
-    
-    results = []
-    
-    for name, operation in operations:
-        with ProfilerContext(name, timing_runs=30) as profiler:
-            profiler.profile_function(operation)
-            
-        results.append(name)
-    
-    print("PASS Performance Comparison: All operations profiled successfully")
-    
-    # Validate profiling accuracy
-    print("\n4️⃣ Profiling Accuracy Validation:")
-    print("-" * 40)
-    
-    # Test that timing is consistent
-    consistent_operation = lambda: time.sleep(0.01)  # Should be ~10ms
-    
-    timing_stats = timer.measure(consistent_operation, warmup=1, runs=10)
-    mean_ms = timing_stats['mean_ms']
-    expected_ms = 10.0
-    
-    # Allow 30% tolerance for timing variability (system dependent)
-    tolerance = 0.3
-    relative_error = abs(mean_ms - expected_ms) / expected_ms
-    if relative_error > tolerance:
-        print(f"WARNING️  Timing variance higher than expected: {mean_ms:.2f}ms vs expected {expected_ms:.2f}ms (tolerance: {tolerance*100}%)")
-        print("   This is normal for mock operations and system-dependent timing")
-    else:
-        print("PASS Timing Accuracy: Within acceptable tolerance")
-    
-    # Test memory tracking accuracy
-    def known_memory_allocation():
-        # Allocate approximately 1MB of data
-        return [i for i in range(125000)]  # ~1MB for 125k integers
-    
-    memory_stats = memory_profiler.profile(known_memory_allocation)
-    allocated_mb = memory_stats.get('allocated_mb', 0)
-    
-    # Memory allocation should be positive and reasonable
-    assert allocated_mb > 0.5, f"Memory tracking issue: {allocated_mb:.2f}MB seems too low"
-    assert allocated_mb < 10, f"Memory tracking issue: {allocated_mb:.2f}MB seems too high"
-    print("PASS Memory Tracking: Reasonable accuracy")
-    
-    # Final integration validation
-    print("\n5️⃣ End-to-End Integration Test:")
-    print("-" * 40)
-    
-    # Simulate complete ML model profiling workflow
-    class MockMLModel:
-        def __init__(self):
-            self.layers = ["embedding", "attention", "mlp", "output"]
-            
-        def forward(self, input_data):
-            # Simulate different computational patterns
-            embedding_time = time.sleep(0.001)  # Fast
-            attention_time = time.sleep(0.010)  # Slow (bottleneck)
-            mlp_time = time.sleep(0.002)       # Moderate
-            output_time = time.sleep(0.001)    # Fast
-            return "model_output"
-    
-    model = MockMLModel()
-    mock_input = "input_tokens"
-    
-    # Profile the complete model
-    with ProfilerContext("Complete ML Model", timing_runs=20, enable_memory=True) as profiler:
-        output = profiler.profile_function(model.forward, args=(mock_input,))
-        
-        # Add realistic FLOP counts
-        model_flops = {
-            'embedding': 1000000,     # 1M FLOPs
-            'attention': 50000000,    # 50M FLOPs (bottleneck!)
-            'mlp': 10000000,         # 10M FLOPs  
-            'output': 500000         # 0.5M FLOPs
-        }
-        
-        total_flops = sum(model_flops.values())
-        profiler.add_flop_count(total_flops, model_flops)
-    
-    print("PASS End-to-End: Complete workflow successful")
-    
-    # Test SimpleProfiler interface (for Module 20 compatibility)
-    print("\n6️⃣ SimpleProfiler Interface Test:")
-    print("-" * 40)
-    
-    # Test SimpleProfiler
-    simple_profiler = SimpleProfiler()
-    
-    def sample_computation():
-        import numpy as np
-        return np.random.randn(100, 100) @ np.random.randn(100, 100)
-    
-    try:
-        # Try with numpy - if available
-        result = simple_profiler.profile(sample_computation, name="Matrix Multiply")
-        print(f"SimpleProfiler result keys: {list(result.keys())}")
-        assert 'wall_time' in result
-        assert 'cpu_time' in result
-        assert 'name' in result
-        print("PASS SimpleProfiler: Full functionality working")
-    except ImportError:
-        # Fall back to simple computation if numpy not available
-        def simple_computation():
-            return sum(i*i for i in range(1000))
-        
-        result = simple_profiler.profile(simple_computation, name="Sum of Squares")
-        print(f"SimpleProfiler result keys: {list(result.keys())}")
-        assert 'wall_time' in result
-        assert 'cpu_time' in result
-        assert 'name' in result
-        print("PASS SimpleProfiler: Basic functionality working")
-    
-    # Test profile_function utility
-    try:
-        func_result = profile_function(sample_computation)
-        assert 'wall_time' in func_result
-        print("PASS profile_function utility: Working correctly")
-    except ImportError:
-        def simple_computation():
-            return sum(i*i for i in range(1000))
-        func_result = profile_function(simple_computation)
-        assert 'wall_time' in func_result
-        print("PASS profile_function utility: Working correctly (fallback)")
-    
-    # Success summary
-    print(f"\nCELEBRATE INTEGRATION TEST RESULTS:")
-    print("=" * 50)
-    print("""
-PASS All profiling components working correctly
-PASS Context manager integration successful  
-PASS Timing accuracy within acceptable range
-PASS Memory tracking functioning properly
-PASS FLOP counting calculations correct
-PASS End-to-end workflow validated
-PASS SimpleProfiler interface ready for Module 20
-
-ROCKET PROFILING SUITE READY FOR PRODUCTION USE!
-
-Your profiling tools are now ready to:
-- Identify bottlenecks in real models
-- Guide optimization decisions
-- Validate performance improvements  
-- Support Module 16 (Acceleration) development
-- Provide SimpleProfiler interface for Module 20 (Benchmarking)
-
-Next step: Use these tools to profile YOUR models and find the bottlenecks!
-""")
-
-# Run the integration test
-if __name__ == "__main__":
-    integration_test_profiling_suite()
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've built a complete profiling suite, let's think about how this applies to real ML systems engineering.
-"""
-
-# %% [markdown]
-"""
-### Question 1: Bottleneck Analysis Strategy
-
-You're optimizing a production transformer model that serves 1M requests/day. Your profiling reveals:
-- Attention computation: 45ms (70% of total time)
-- Linear layers: 10ms (15% of total time)  
-- Activation functions: 5ms (8% of total time)
-- I/O overhead: 5ms (7% of total time)
-
-If you can only optimize ONE component this quarter, which would you choose and why? What's the maximum theoretical speedup you could achieve?
-
-*Think about Amdahl's Law and real-world optimization constraints.*
-"""
-
-# %% [markdown]
-"""
-### Question 2: Memory vs Compute Trade-offs
-
-Your profiling shows that a CNN model uses:
-- 2GB memory with 50ms inference time on CPU
-- 0.5GB memory with 200ms inference time on mobile chip
-
-A customer wants to deploy on mobile devices with 1GB total RAM and requires <100ms inference. 
-
-Design an optimization strategy using your profiling insights. What techniques would you try, and in what order?
-
-*Consider quantization, pruning, architecture changes, and caching strategies.*
-"""
-
-# %% [markdown]
-"""
-### Question 3: Scaling Prediction
-
-Your profiling reveals that attention computation scales as O(n²) with sequence length. You measured:
-- 128 tokens: 10ms
-- 256 tokens: 40ms  
-- 512 tokens: 160ms
-
-If you need to support 2048 tokens, predict the inference time. What optimization techniques could break this quadratic scaling?
-
-*Think about the mathematical relationship and alternative attention mechanisms.*
-"""
-
-# %% [markdown]
-"""
-### Question 4: Production Profiling Strategy
-
-You're building a profiling system for a production ML platform that serves 100 different models. Your Timer class works great for development, but production has different constraints:
-
-- Can't add 100ms of profiling overhead per request
-- Need continuous monitoring, not batch measurements
-- Must handle concurrent requests and GPU operations
-- Need automatic anomaly detection
-
-How would you modify your profiling approach for production? What are the key design trade-offs?
-
-*Consider sampling strategies, async profiling, and monitoring infrastructure.*
-"""
-
-# %% 
-if __name__ == "__main__":
-    print("THINK ML Systems Thinking Questions")
-    print("=" * 50)
-    print("""
-Complete the interactive questions above to deepen your understanding of:
-
-1️⃣ Bottleneck Analysis Strategy
-   - Applying Amdahl's Law to optimization decisions
-   - Understanding the ROI of different optimization targets
-
-2️⃣ Memory vs Compute Trade-offs  
-   - Balancing memory constraints with performance requirements
-   - Designing optimization strategies for resource-limited devices
-
-3️⃣ Scaling Prediction
-   - Using profiling data to predict performance at scale
-   - Understanding algorithmic complexity implications
-
-4️⃣ Production Profiling Strategy
-   - Adapting development tools for production constraints
-   - Building monitoring systems for ML performance
-
-These questions connect your profiling implementations to real-world ML systems challenges.
-Answer them to master performance analysis thinking!
-""")
-
-# %%
-if __name__ == "__main__":
-    print("MAGNIFY PROFILING MODULE: Performance Detective Suite")
-    print("=" * 60)
-    
-    # Run all profiling tests in sequence
-    print("\n1️⃣ Testing Timer Infrastructure...")
-    test_timer()
-    
-    print("\n2️⃣ Testing Memory Profiler...")
-    test_memory_profiler()
-    
-    print("\n3️⃣ Testing FLOP Counter...")
-    test_flop_counter()
-    
-    print("\n4️⃣ Testing Comprehensive Profiling...")
-    test_comprehensive_profiling()
-    
-    print("\n5️⃣ Running Bottleneck Detection...")
-    simulate_complete_model_profiling()
-    
-    print("\n6️⃣ Analyzing Systems Implications...")
-    analyze_systems_implications()
-    
-    print("\n7️⃣ Running Integration Tests...")
-    integration_test_profiling_suite()
-    
-    print("\nCELEBRATE ALL PROFILING TESTS COMPLETED SUCCESSFULLY!")
-    print("\nROCKET Your profiling suite is ready to:")
-    print("   - Identify bottlenecks in neural networks")
-    print("   - Guide optimization decisions with data")
-    print("   - Predict performance at scale")
-    print("   - Support production monitoring systems")
-    print("\n📚 Next: Complete the ML Systems Thinking questions!")
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Profiling - Performance Detective Work
-
-Congratulations! You've built a comprehensive profiling suite that reveals the performance secrets of neural networks.
-
-### 🏆 What You Accomplished
-
-**1. Professional Timing Infrastructure**
-- Built `Timer` class with statistical rigor
-- Implemented warmup runs and percentile reporting
-- Eliminated cold start effects and measurement noise
-- Created reproducible performance measurements
-
-**2. Memory Analysis Tools**
-- Developed `MemoryProfiler` with allocation tracking  
-- Implemented peak memory usage monitoring
-- Built memory leak detection capabilities
-- Connected memory patterns to performance implications
-
-**3. Computational Analysis**
-- Created `FLOPCounter` for operation counting
-- Analyzed different layer types (Linear, Conv2d, Attention)
-- Revealed the O(n²) scaling problem in transformers
-- Connected FLOPs to hardware efficiency
-
-**4. Integrated Profiling Context**
-- Built `ProfilerContext` manager combining all tools
-- Created comprehensive performance reports
-- Implemented automatic insight generation
-- Developed production-ready profiling workflow
-
-### MAGNIFY Key Discoveries Made
-
-**Architecture Performance Profiles:**
-- **MLPs**: Fast, linear scaling, memory efficient
-- **CNNs**: Moderate speed, excellent for spatial data
-- **Transformers**: Slow but powerful, memory hungry, O(n²) scaling
-
-**Bottleneck Identification:**
-- Attention mechanisms consume 70%+ of computation time
-- Memory bandwidth often limits performance more than raw FLOPs
-- O(n²) scaling makes long sequences prohibitively expensive
-
-**Systems Implications:**
-- Profiling data drives hardware selection (CPU vs GPU)
-- Memory constraints limit batch sizes in attention models
-- Optimization ROI follows Amdahl's Law patterns
-
-### ROCKET Real-World Applications
-
-Your profiling tools enable:
-- **Bottleneck identification** in production models
-- **Optimization targeting** for maximum impact
-- **Hardware selection** based on performance characteristics  
-- **Cost prediction** for scaling ML systems
-- **Performance regression** detection in CI/CD
-
-### TARGET What's Next
-
-Module 16 (Acceleration) will use these profiling insights to:
-- Implement attention optimizations (Flash Attention patterns)
-- Build efficient kernels for bottleneck operations
-- Create caching strategies for memory optimization
-- Develop quantization techniques for inference speedup
-
-**Your profiling detective work laid the foundation - now we'll fix the problems you discovered!**
-
-### 🏅 Systems Engineering Skills Mastered
-
-- **Performance measurement methodology** with statistical rigor
-- **Bottleneck analysis** using Amdahl's Law principles  
-- **Memory profiling** and allocation pattern analysis
-- **Computational complexity** analysis through FLOP counting
-- **Production profiling** strategy design
-- **Data-driven optimization** decision making
-
-You now have the tools to analyze any neural network and understand exactly why it's fast or slow. These are the same techniques used to optimize GPT, BERT, and every other production ML system.
-
-**Welcome to the ranks of ML systems performance engineers!** CELEBRATE
-"""
\ No newline at end of file
diff --git a/modules_old/15_acceleration/README.md b/modules_old/15_acceleration/README.md
deleted file mode 100644
index fb307cc3..00000000
--- a/modules_old/15_acceleration/README.md
+++ /dev/null
@@ -1,167 +0,0 @@
-# Module 16: Hardware Acceleration - The Simplest Optimization
-
-## Overview
-
-This module teaches the most valuable optimization lesson: **the easiest speedup comes from using better tools, not writing faster code!** After profiling your models and finding bottlenecks, learn how to get 100-1000x speedups with zero accuracy loss through smart backend selection.
-
-## The Context: You Just Found Bottlenecks
-
-**Previous Module**: You profiled your models and identified performance bottlenecks
-**This Module**: Learn the SIMPLEST optimization - don't write faster code, use code that's already fast!
-**Key Insight**: NumPy provides 100x+ speedup over naive loops with zero effort
-
-## Learning Objectives
-
-By the end of this module, students will be able to:
-
-1. **Understand Why Naive Loops Are Slow**: Analyze cache miss patterns that make educational implementations terrible for performance
-2. **Implement Cache-Friendly Blocking**: Build blocked matrix multiplication showing 10-50x speedup through better memory access patterns  
-3. **Recognize Library Superiority**: Understand why NumPy beats custom optimizations through expert-level engineering
-4. **Build Smart Backends**: Create systems that automatically dispatch to optimal implementations
-5. **Apply the Free Speedup Principle**: Choose better tools instead of optimizing existing code
-
-## The Educational Journey: Naive → Blocked → NumPy
-
-### 1. Naive Baseline (Your Module 2/4 Loops)
-```python
-def matmul_naive(a, b):
-    # Triple nested loops - perfect for learning algorithms
-    # Terrible for performance (1000x slower than NumPy)
-    # Random memory access = cache misses = slow
-```
-
-### 2. Cache-Friendly Blocking  
-```python
-def matmul_blocked(a, b, block_size=64):
-    # Process data in cache-friendly 64x64 blocks
-    # Sequential access within blocks = cache hits
-    # Same O(n³) algorithm, much better memory pattern
-    # Result: 10-50x speedup over naive
-```
-
-### 3. NumPy Production
-```python
-def matmul_numpy(a, b):
-    return a @ b  # Uses optimized BLAS libraries
-    # Expert-level optimizations: blocking + vectorization + threading
-    # Result: 100-1000x speedup over naive
-```
-
-## Key Performance Results
-
-Real speedups you'll measure in this module:
-
-- **Naive loops**: 1000x slower (educational value, cache-hostile)
-- **Blocked loops**: 50x slower (teaches cache optimization principles)  
-- **NumPy backend**: Optimal speed (expert-optimized with BLAS libraries)
-
-**The Lesson**: Understanding the journey enables smart tool choices!
-
-## What You'll Build
-
-### 1. The Complete Performance Spectrum
-- **Naive implementation**: Educational triple-nested loops showing why they're slow
-- **Blocked algorithm**: Cache-friendly version demonstrating optimization principles
-- **NumPy integration**: Production implementation leveraging expert optimizations
-- **Performance measurement**: Scientific benchmarking across the entire spectrum
-
-### 2. Smart Backend System
-```python
-class OptimizedBackend:
-    def matmul(self, a, b):
-        return matmul_numpy(a, b)  # Always use the best available
-        
-    def dispatch(self, operation, *args):
-        # Smart routing to optimal implementations
-```
-
-### 3. Educational Insights
-- **Cache hierarchy understanding**: Why L1/L2/L3 cache determines practical performance
-- **Memory access patterns**: Sequential vs random access cost analysis
-- **Library engineering**: What NumPy has that custom implementations lack
-- **Optimization decision framework**: When to optimize vs when to use libraries
-
-## Hardware Principles Demonstrated
-
-### CPU Cache Hierarchy Impact
-- **L1 Cache**: 32KB, 1-2 cycles (keep working set small)
-- **L2 Cache**: 256KB, 3-10 cycles (64x64 blocks fit here)
-- **L3 Cache**: 8MB, 10-20 cycles (full matrices don't fit)
-- **RAM**: Gigabytes, 100-300 cycles (cache misses are expensive)
-
-### Memory Access Pattern Analysis
-- **Naive loops**: Random access → cache misses → 100-300 cycle delays
-- **Blocked algorithms**: Sequential access within blocks → cache hits → 1-2 cycle access
-- **NumPy**: Expert-optimized patterns + vectorization + threading
-
-## Real-World ML Systems Context
-
-### How Production Systems Apply These Principles
-- **PyTorch/TensorFlow**: Use same blocking + vectorization principles for tensor operations
-- **BLAS Libraries**: OpenBLAS, Intel MKL provide hardware-optimized linear algebra
-- **GPU Acceleration**: Parallel processing for operations that benefit from it
-- **Memory Management**: Minimize allocations, reuse buffers, optimize data layout
-
-### When to Optimize vs Use Libraries
-- ✅ **Use libraries**: Matrix operations, convolutions, standard neural network layers
-- ✅ **Custom optimization**: Operations not available in optimized libraries
-- ✅ **Profile first**: Measure real bottlenecks, not assumed ones
-- ❌ **Premature optimization**: Optimizing non-bottlenecks or already-optimized code
-
-## Systems Thinking Framework
-
-### The Free Speedup Decision Tree
-1. **Is this operation available in NumPy/PyTorch?** → Use the library
-2. **Is this a proven bottleneck?** → Profile and measure first  
-3. **Is this custom logic?** → Implement efficiently, then optimize if needed
-4. **Can I use better algorithms?** → O(n²) beats optimized O(n³)
-
-### Optimization Priority Order
-1. **Better algorithms**: Change complexity class (O(n³) → O(n²))
-2. **Better libraries**: Use expert-optimized implementations  
-3. **Better access patterns**: Cache-friendly memory access
-4. **Vectorization**: Eliminate Python loops, use SIMD
-5. **Hardware acceleration**: GPU for appropriate parallel workloads
-
-## Assessment Criteria
-
-Students demonstrate mastery by:
-
-1. **Cache Analysis**: Explain why naive loops cause cache misses and performance degradation
-2. **Blocking Implementation**: Build cache-friendly matrix multiplication with measurable speedups
-3. **Library Understanding**: Articulate why NumPy beats custom optimizations
-4. **Backend Design**: Create system that automatically chooses optimal implementations
-5. **Decision Framework**: Apply "free speedup" principle to real optimization scenarios
-
-## Prerequisites
-
-- **Module 2**: Tensor operations and basic NumPy usage
-- **Module 4**: Matrix multiplication understanding  
-- **Module 15**: Performance profiling and bottleneck identification
-- **Systems thinking**: Interest in understanding why tools perform differently
-
-## Time Commitment
-
-**Estimated Time**: 2-3 hours
-- Understanding cache hierarchy and memory patterns: 30 minutes
-- Implementing naive → blocked → NumPy progression: 1.5 hours
-- Building backend dispatch system: 30 minutes  
-- Performance analysis and systems insights: 30 minutes
-
-## Key Takeaway: The Easiest Optimization
-
-**Before this module**: "My code is slow, I need to make it faster"
-**After this module**: "My code is slow, I should use faster code that already exists"
-
-**The Free Speedup**: 100-1000x performance improvement with zero accuracy loss and minimal code changes. This is the most valuable optimization lesson in ML systems engineering.
-
-## Connection to Production ML Systems
-
-This module directly prepares students for:
-
-- **Smart tool selection**: Choosing NumPy, PyTorch, optimized libraries over custom implementations
-- **Performance debugging**: Understanding why some operations are slow (cache patterns, not algorithms)  
-- **Architecture decisions**: When to build custom vs when to use existing optimizations
-- **Systems engineering mindset**: Solve problems by choosing better tools, not just working harder
-
-Students learn the most important optimization principle: the smartest engineers don't write the fastest code, they use code that's already fast.
\ No newline at end of file
diff --git a/modules_old/15_acceleration/acceleration_dev.ipynb b/modules_old/15_acceleration/acceleration_dev.ipynb
deleted file mode 100644
index 09253a49..00000000
--- a/modules_old/15_acceleration/acceleration_dev.ipynb
+++ /dev/null
@@ -1,793 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "bb43e942",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Module 16: Hardware Acceleration - The Free Speedup!\n",
-    "\n",
-    "## Learning Objectives\n",
-    "By the end of this module, you will be able to:\n",
-    "\n",
-    "1. **Understand Why Loops Are Slow**: See why your Module 2/4 loops have poor performance\n",
-    "2. **Implement Cache-Friendly Blocking**: Build blocked matrix multiplication that leverages CPU cache hierarchy\n",
-    "3. **Visualize Memory Access Patterns**: Understand how cache misses destroy performance\n",
-    "4. **Build Transparent Backend Systems**: Create automatic switching between implementations\n",
-    "5. **Apply to Real Models**: Use these principles in MLPs, CNNs, and Transformers\n",
-    "\n",
-    "## The Free Speedup Journey\n",
-    "\n",
-    "**Key Message**: This is the EASIEST optimization - just use better backends! No accuracy trade-offs, no complex math - just 10-100x faster code.\n",
-    "\n",
-    "**The Journey:**\n",
-    "1. **Baseline**: Your loops from Module 2/4 (educational, 1000x slower)\n",
-    "2. **Blocking**: Cache-friendly version (educational, 10x faster than loops)\n",
-    "3. **NumPy**: Production version (optimal, another 10x faster)\n",
-    "4. **Backend**: Smart switching system (transparent optimization)\n",
-    "\n",
-    "**Why This Works**: Same math, better implementation. Free performance with zero downsides!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b3809c9d",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Part 1: Baseline Implementation - Your Loops from Module 2/4\n",
-    "\n",
-    "Let's start with the educational triple-nested loops you implemented earlier. These were perfect for learning but terrible for performance."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a8e2f798",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp optimization.acceleration\n",
-    "\n",
-    "import time\n",
-    "import numpy as np\n",
-    "\n",
-    "def matmul_naive(a: np.ndarray, b: np.ndarray) -> np.ndarray:\n",
-    "    \"\"\"\n",
-    "    Educational matrix multiplication using triple nested loops.\n",
-    "    \n",
-    "    This is the same implementation from Module 2/4 - perfect for learning\n",
-    "    the algorithm, but very slow due to poor cache performance.\n",
-    "    \"\"\"\n",
-    "    m, k = a.shape\n",
-    "    k2, n = b.shape\n",
-    "    assert k == k2, f\"Incompatible shapes: {a.shape} @ {b.shape}\"\n",
-    "    \n",
-    "    # Initialize result matrix\n",
-    "    c = np.zeros((m, n), dtype=np.float32)\n",
-    "    \n",
-    "    # Triple nested loop - the educational implementation\n",
-    "    for i in range(m):\n",
-    "        for j in range(n):\n",
-    "            for l in range(k):\n",
-    "                c[i, j] += a[i, l] * b[l, j]\n",
-    "    \n",
-    "    return c"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c85ddf51",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Educational Implementation\n",
-    "\n",
-    "Let's test our educational loops and see why they're slow."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "68fb5eed",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_naive_baseline():\n",
-    "    \"\"\"Test naive implementation and measure its performance\"\"\"\n",
-    "    print(\"Testing Naive Implementation...\")\n",
-    "    \n",
-    "    # Test correctness with small matrices\n",
-    "    a = np.array([[1, 2], [3, 4]], dtype=np.float32)\n",
-    "    b = np.array([[5, 6], [7, 8]], dtype=np.float32)\n",
-    "    \n",
-    "    result_naive = matmul_naive(a, b)\n",
-    "    result_numpy = a @ b\n",
-    "    assert np.allclose(result_naive, result_numpy), \"Naive matmul incorrect\"\n",
-    "    print(\"✅ Naive implementation produces correct results\")\n",
-    "    \n",
-    "    # Performance comparison (small sizes only - educational is VERY slow)\n",
-    "    print(\"\\nPerformance comparison:\")\n",
-    "    small_a = np.random.randn(100, 100).astype(np.float32)\n",
-    "    small_b = np.random.randn(100, 100).astype(np.float32)\n",
-    "    \n",
-    "    # Time naive implementation\n",
-    "    start = time.perf_counter()\n",
-    "    _ = matmul_naive(small_a, small_b)\n",
-    "    naive_time = time.perf_counter() - start\n",
-    "    \n",
-    "    # Time NumPy implementation\n",
-    "    start = time.perf_counter()\n",
-    "    _ = small_a @ small_b\n",
-    "    numpy_time = time.perf_counter() - start\n",
-    "    \n",
-    "    speedup = naive_time / numpy_time\n",
-    "    print(f\"Naive loops: {naive_time*1000:.1f} ms\")\n",
-    "    print(f\"NumPy optimized:   {numpy_time*1000:.1f} ms\")\n",
-    "    print(f\"NumPy is {speedup:.1f}x faster\")\n",
-    "    \n",
-    "    print(\"✅ Naive baseline established\")\n",
-    "    return naive_time, numpy_time, speedup"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fd8cdf2e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 2: Understanding Cache Hierarchy - Why Memory Matters More Than Computation\n",
-    "\n",
-    "**The Big Insight**: Modern CPUs are FAST at computation but SLOW at memory access. Cache hierarchy makes the difference between fast and slow code.\n",
-    "\n",
-    "### CPU Cache Hierarchy Visualization\n",
-    "```\n",
-    "Registers:  4 bytes    - 1 cycle     (instant)\n",
-    "L1 Cache:   32KB      - 3-4 cycles   (lightning fast)\n",
-    "L2 Cache:   256KB     - 10-20 cycles (fast)\n",
-    "L3 Cache:   8MB       - 50-100 cycles (slow)\n",
-    "Main RAM:   16GB      - 200+ cycles  (VERY slow)\n",
-    "```\n",
-    "\n",
-    "**Key Principle**: Keep your working set in L1/L2 cache for 100x better performance!\n",
-    "\n",
-    "### Memory Access Pattern Analysis\n",
-    "\n",
-    "Your naive loops access memory like this:\n",
-    "```python\n",
-    "for i in range(m):\n",
-    "    for j in range(n):\n",
-    "        for l in range(k):\n",
-    "            c[i,j] += a[i,l] * b[l,j]  # b[l,j] jumps around randomly!\n",
-    "```\n",
-    "\n",
-    "**The Problem**: `b[l,j]` creates terrible access patterns:\n",
-    "- Each `j` increment jumps to a new column (cache miss)\n",
-    "- Each `l` increment jumps to a new row (another cache miss)\n",
-    "- For 1000x1000 matrix: 1 billion cache misses!\n",
-    "\n",
-    "**The Solution**: Process in blocks that fit in cache."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fc2f1d0a",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def matmul_blocked(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.ndarray:\n",
-    "    \"\"\"\n",
-    "    Cache-friendly blocked matrix multiplication.\n",
-    "    \n",
-    "    This version processes data in blocks that fit in CPU cache.\n",
-    "    \n",
-    "    **Memory Analysis**:\n",
-    "    - 64x64 block = 4KB floats = 16KB memory (fits in 32KB L1 cache)\n",
-    "    - 3 blocks (A, B, C) = 48KB total (fits in 256KB L2 cache)\n",
-    "    - Reuses each data element 64 times before evicting from cache\n",
-    "    \n",
-    "    **Why This Works**:\n",
-    "    - Naive: 1 cache miss per operation (terrible)\n",
-    "    - Blocked: 1 cache miss per 64 operations (64x better!)\n",
-    "    \n",
-    "    Args:\n",
-    "        a: Left matrix (m × k)\n",
-    "        b: Right matrix (k × n) \n",
-    "        block_size: Cache-friendly block size (32-128, default 64)\n",
-    "    \"\"\"\n",
-    "    m, k = a.shape\n",
-    "    k2, n = b.shape\n",
-    "    assert k == k2, f\"Incompatible shapes: {a.shape} @ {b.shape}\"\n",
-    "    \n",
-    "    # Initialize result\n",
-    "    c = np.zeros((m, n), dtype=np.float32)\n",
-    "    \n",
-    "    # Process in blocks to maximize cache utilization\n",
-    "    for i in range(0, m, block_size):\n",
-    "        for j in range(0, n, block_size):\n",
-    "            for l in range(0, k, block_size):\n",
-    "                # Define block boundaries\n",
-    "                i_end = min(i + block_size, m)\n",
-    "                j_end = min(j + block_size, n)\n",
-    "                l_end = min(l + block_size, k)\n",
-    "                \n",
-    "                # Extract blocks (these stay in cache)\n",
-    "                a_block = a[i:i_end, l:l_end]\n",
-    "                b_block = b[l:l_end, j:j_end]\n",
-    "                \n",
-    "                # Multiply blocks using NumPy (optimized BLAS)\n",
-    "                c[i:i_end, j:j_end] += a_block @ b_block\n",
-    "    \n",
-    "    return c"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "74d05383",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "\"\"\"\n",
-    "## Test Blocked Implementation\n",
-    "\n",
-    "Let's see how much faster cache-friendly blocking is compared to educational loops.\n",
-    "\"\"\"\n",
-    "\n",
-    "def test_blocked_optimization():\n",
-    "    \"\"\"Test blocked matrix multiplication performance\"\"\"\n",
-    "    print(\"Testing Blocked Matrix Multiplication...\")\n",
-    "    \n",
-    "    # Test correctness\n",
-    "    a = np.random.randn(200, 200).astype(np.float32)\n",
-    "    b = np.random.randn(200, 200).astype(np.float32)\n",
-    "    \n",
-    "    result_blocked = matmul_blocked(a, b, block_size=64)\n",
-    "    result_numpy = a @ b\n",
-    "    \n",
-    "    assert np.allclose(result_blocked, result_numpy, atol=1e-3), \"Blocked matmul incorrect\"\n",
-    "    print(\"✅ Blocked implementation produces correct results\")\n",
-    "    \n",
-    "    # Performance comparison\n",
-    "    print(\"\\nPerformance comparison:\")\n",
-    "    \n",
-    "    # Educational vs Blocked vs NumPy\n",
-    "    size = 200\n",
-    "    test_a = np.random.randn(size, size).astype(np.float32)\n",
-    "    test_b = np.random.randn(size, size).astype(np.float32)\n",
-    "    \n",
-    "    # Time educational (smaller subset to avoid waiting forever)\n",
-    "    start = time.perf_counter()\n",
-    "    _ = matmul_naive(test_a[:50, :50], test_b[:50, :50])\n",
-    "    naive_time = time.perf_counter() - start\n",
-    "    naive_time_scaled = naive_time * (size/50)**3  # Scale up for comparison\n",
-    "    \n",
-    "    # Time blocked\n",
-    "    start = time.perf_counter()\n",
-    "    _ = matmul_blocked(test_a, test_b, block_size=64)\n",
-    "    blocked_time = time.perf_counter() - start\n",
-    "    \n",
-    "    # Time NumPy\n",
-    "    start = time.perf_counter()\n",
-    "    _ = test_a @ test_b\n",
-    "    numpy_time = time.perf_counter() - start\n",
-    "    \n",
-    "    print(f\"Naive (estimated): {naive_time_scaled*1000:.1f} ms\")\n",
-    "    print(f\"Blocked:           {blocked_time*1000:.1f} ms\")\n",
-    "    print(f\"NumPy:             {numpy_time*1000:.1f} ms\")\n",
-    "    \n",
-    "    speedup_blocked = naive_time_scaled / blocked_time\n",
-    "    speedup_numpy = naive_time_scaled / numpy_time\n",
-    "    \n",
-    "    print(f\"\\n🚀 SPEEDUP RESULTS:\")\n",
-    "    print(f\"Blocked is {speedup_blocked:.1f}x faster than naive loops!\")\n",
-    "    print(f\"NumPy is {speedup_numpy:.1f}x faster than naive loops!\")\n",
-    "    print(f\"\\n💡 Why blocking works: Better cache utilization!\")\n",
-    "    print(f\"   • Naive: 1 cache miss per operation\")\n",
-    "    print(f\"   • Blocked: 1 cache miss per 64 operations\")\n",
-    "    print(f\"   • NumPy: Professional optimizations + vectorization\")\n",
-    "    \n",
-    "    print(\"✅ Blocked optimization tested successfully\")\n",
-    "    return blocked_time, numpy_time"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5dd1eddc",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 3: NumPy Optimization - Production Performance\n",
-    "\n",
-    "Now we'll switch to NumPy for production use. The key insight: NumPy already has these optimizations (and more) built-in."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "510040fa",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def matmul_numpy(a: np.ndarray, b: np.ndarray) -> np.ndarray:\n",
-    "    \"\"\"\n",
-    "    Production matrix multiplication using NumPy.\n",
-    "    \n",
-    "    This is what you should actually use in practice.\n",
-    "    NumPy already has blocking, vectorization, and BLAS optimizations built-in.\n",
-    "    \"\"\"\n",
-    "    return a @ b"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6dc5cef7",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Production Implementation\n",
-    "\n",
-    "Let's verify that NumPy is indeed the best choice for production."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5450d83e",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_production_performance():\n",
-    "    \"\"\"Test that NumPy is indeed optimal for production use\"\"\"\n",
-    "    print(\"Testing Production Performance...\")\n",
-    "    \n",
-    "    # Test different sizes\n",
-    "    sizes = [200, 500, 800]\n",
-    "    \n",
-    "    print(\"\\nPerformance comparison across the optimization spectrum:\")\n",
-    "    \n",
-    "    for size in sizes:\n",
-    "        print(f\"\\nMatrix size: {size}x{size}\")\n",
-    "        a = np.random.randn(size, size).astype(np.float32)\n",
-    "        b = np.random.randn(size, size).astype(np.float32)\n",
-    "        \n",
-    "        # Time blocked implementation\n",
-    "        start = time.perf_counter()\n",
-    "        _ = matmul_blocked(a, b, block_size=64)\n",
-    "        blocked_time = time.perf_counter() - start\n",
-    "        \n",
-    "        # Time NumPy implementation\n",
-    "        start = time.perf_counter()\n",
-    "        _ = matmul_numpy(a, b)\n",
-    "        numpy_time = time.perf_counter() - start\n",
-    "        \n",
-    "        speedup = blocked_time / numpy_time\n",
-    "        print(f\"Blocked:     {blocked_time*1000:6.1f} ms\")\n",
-    "        print(f\"NumPy:       {numpy_time*1000:6.1f} ms\")\n",
-    "        print(f\"NumPy is {speedup:.1f}x faster than blocked\")\n",
-    "    \n",
-    "    print(\"\\n💡 Key Insight: NumPy already has these optimizations built-in!\")\n",
-    "    print(\"   • Blocking algorithms\")\n",
-    "    print(\"   • Vectorization\")\n",
-    "    print(\"   • Hardware-specific BLAS libraries\")\n",
-    "    print(\"   • Assembly-level optimizations\")\n",
-    "    \n",
-    "    print(\"\\n✅ Production performance verified\")\n",
-    "    return True"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "34430270",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 4: Smart Backend System - Transparent Optimization\n",
-    "\n",
-    "Now let's build a system that automatically chooses the right implementation. This is how real ML frameworks work!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bb6e536f",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class OptimizedBackend:\n",
-    "    \"\"\"\n",
-    "    Smart backend that automatically dispatches to optimal implementations.\n",
-    "    \n",
-    "    This demonstrates how real ML frameworks (PyTorch, TensorFlow) work:\n",
-    "    - Single API for users\n",
-    "    - Automatic dispatch to fastest implementation\n",
-    "    - Transparent optimization without code changes\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def dispatch(self, op: str, *args, **kwargs):\n",
-    "        \"\"\"Dispatch operations to optimal implementations\"\"\"\n",
-    "        if op == \"matmul\":\n",
-    "            return self.matmul(*args, **kwargs)\n",
-    "        else:\n",
-    "            raise NotImplementedError(f\"Operation {op} not implemented\")\n",
-    "    \n",
-    "    def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Matrix multiplication with automatic optimization selection.\n",
-    "        \n",
-    "        For production: Always use NumPy (has all optimizations built-in)\n",
-    "        For education: Could switch based on size, but NumPy is always best\n",
-    "        \"\"\"\n",
-    "        # In a real system, you might choose based on:\n",
-    "        # - Matrix size (small vs large)\n",
-    "        # - Hardware available (CPU vs GPU)\n",
-    "        # - Memory constraints\n",
-    "        # \n",
-    "        # But NumPy is almost always the right choice for CPU\n",
-    "        return matmul_numpy(a, b)\n",
-    "\n",
-    "# Global backend instance\n",
-    "_backend = OptimizedBackend()\n",
-    "\n",
-    "def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:\n",
-    "    \"\"\"\n",
-    "    Matrix multiplication using optimal backend.\n",
-    "    \n",
-    "    This is the API students should use - it automatically\n",
-    "    selects the best implementation available.\n",
-    "    \"\"\"\n",
-    "    return _backend.dispatch(\"matmul\", a, b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3bf96063",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Backend System\n",
-    "\n",
-    "Let's verify our backend system works correctly and uses optimal implementations."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "daaad52d",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_backend_system():\n",
-    "    \"\"\"Test the backend system\"\"\"\n",
-    "    print(\"Testing Backend System...\")\n",
-    "    \n",
-    "    # Test matrices\n",
-    "    a = np.random.randn(100, 100).astype(np.float32)\n",
-    "    b = np.random.randn(100, 100).astype(np.float32)\n",
-    "    \n",
-    "    # Test that our backend works\n",
-    "    result = matmul(a, b)\n",
-    "    expected = a @ b\n",
-    "    \n",
-    "    assert np.allclose(result, expected), \"Backend matmul incorrect\"\n",
-    "    print(\"✅ Backend produces correct results\")\n",
-    "    \n",
-    "    # Compare performance\n",
-    "    start = time.perf_counter()\n",
-    "    _ = matmul(a, b)\n",
-    "    backend_time = time.perf_counter() - start\n",
-    "    \n",
-    "    start = time.perf_counter()\n",
-    "    _ = a @ b\n",
-    "    numpy_time = time.perf_counter() - start\n",
-    "    \n",
-    "    print(f\"\\nPerformance comparison:\")\n",
-    "    print(f\"Backend: {backend_time*1000:.1f} ms\")\n",
-    "    print(f\"NumPy:   {numpy_time*1000:.1f} ms\")\n",
-    "    print(f\"Backend uses optimal NumPy implementation\")\n",
-    "    \n",
-    "    print(\"\\n✅ Backend system works correctly\")\n",
-    "    return True"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d3ae2f46",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 5: Real-World Application Testing\n",
-    "\n",
-    "Let's test our optimizations on actual ML model operations: MLP layers, CNN convolutions, and Transformer attention."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a4858d70",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_ml_model_acceleration():\n",
-    "    \"\"\"Test acceleration on real ML model operations\"\"\"\n",
-    "    print(\"Testing Acceleration on Real ML Models...\")\n",
-    "    \n",
-    "    # Test 1: MLP Forward Pass (common in Module 4)\n",
-    "    print(\"\\n1. MLP Forward Pass (256 → 128 → 64):\")\n",
-    "    batch_size, input_dim, hidden_dim, output_dim = 32, 256, 128, 64\n",
-    "    \n",
-    "    # Simulated MLP layers\n",
-    "    x = np.random.randn(batch_size, input_dim).astype(np.float32)\n",
-    "    W1 = np.random.randn(input_dim, hidden_dim).astype(np.float32)\n",
-    "    W2 = np.random.randn(hidden_dim, output_dim).astype(np.float32)\n",
-    "    \n",
-    "    # Time naive implementation (small version)\n",
-    "    start = time.perf_counter()\n",
-    "    h1_naive = matmul_naive(x[:8, :64], W1[:64, :32])  # Scaled down\n",
-    "    h2_naive = matmul_naive(h1_naive, W2[:32, :16])     # Scaled down\n",
-    "    naive_time = time.perf_counter() - start\n",
-    "    \n",
-    "    # Time optimized implementation\n",
-    "    start = time.perf_counter()\n",
-    "    h1_opt = matmul(x, W1)\n",
-    "    h2_opt = matmul(h1_opt, W2)\n",
-    "    opt_time = time.perf_counter() - start\n",
-    "    \n",
-    "    # Scale naive time for comparison\n",
-    "    naive_scaled = naive_time * (32/8) * (256/64) * (128/32)\n",
-    "    speedup = naive_scaled / opt_time\n",
-    "    \n",
-    "    print(f\"   Naive (estimated): {naive_scaled*1000:.1f} ms\")\n",
-    "    print(f\"   Optimized:         {opt_time*1000:.1f} ms\")\n",
-    "    print(f\"   Speedup: {speedup:.1f}x faster!\")\n",
-    "    \n",
-    "    # Test 2: CNN-like Convolution (flattened as matrix multiply)\n",
-    "    print(\"\\n2. CNN Convolution (as matrix multiply):\")\n",
-    "    # Simulate im2col operation for 3x3 convolution\n",
-    "    img_patches = np.random.randn(1024, 27).astype(np.float32)  # 32x32 image, 3x3 patches\n",
-    "    conv_filters = np.random.randn(27, 64).astype(np.float32)   # 64 filters\n",
-    "    \n",
-    "    start = time.perf_counter()\n",
-    "    conv_output = matmul(img_patches, conv_filters)\n",
-    "    conv_time = time.perf_counter() - start\n",
-    "    print(f\"   Convolution output: {conv_time*1000:.1f} ms\")\n",
-    "    print(f\"   Shape: {conv_output.shape} (1024 locations × 64 filters)\")\n",
-    "    \n",
-    "    # Test 3: Transformer-like Attention (scaled down)\n",
-    "    print(\"\\n3. Transformer Attention (Q·K^T):\")\n",
-    "    seq_len, d_model = 128, 256\n",
-    "    Q = np.random.randn(seq_len, d_model).astype(np.float32)\n",
-    "    K = np.random.randn(seq_len, d_model).astype(np.float32)\n",
-    "    \n",
-    "    start = time.perf_counter()\n",
-    "    attention_scores = matmul(Q, K.T)  # Shape: (seq_len, seq_len)\n",
-    "    attn_time = time.perf_counter() - start\n",
-    "    print(f\"   Attention computation: {attn_time*1000:.1f} ms\")\n",
-    "    print(f\"   Shape: {attention_scores.shape} (128×128 attention matrix)\")\n",
-    "    \n",
-    "    print(f\"\\n✅ All ML model operations accelerated successfully!\")\n",
-    "    print(f\"💡 Key insight: Matrix multiplication is EVERYWHERE in ML!\")\n",
-    "    return True\n",
-    "\n",
-    "def run_complete_acceleration_demo():\n",
-    "    \"\"\"Run the complete acceleration demonstration\"\"\"\n",
-    "    print(\"🚀 Complete Hardware Acceleration Demo\")\n",
-    "    print(\"=\" * 55)\n",
-    "    print(\"THE FREE SPEEDUP: From Naive Loops to Optimized Backends\")\n",
-    "    \n",
-    "    # 1. Test naive baseline\n",
-    "    print(\"\\n1. Naive Baseline (your Module 2/4 loops):\")\n",
-    "    naive_results = test_naive_baseline()\n",
-    "    \n",
-    "    # 2. Test blocked optimization\n",
-    "    print(\"\\n2. Cache-Friendly Blocking:\")\n",
-    "    test_blocked_optimization()\n",
-    "    \n",
-    "    # 3. Test production performance\n",
-    "    print(\"\\n3. Production Performance (NumPy):\")\n",
-    "    test_production_performance()\n",
-    "    \n",
-    "    # 4. Test ML model acceleration\n",
-    "    print(\"\\n4. Real ML Model Acceleration:\")\n",
-    "    test_ml_model_acceleration()\n",
-    "    \n",
-    "    # 5. Test backend system\n",
-    "    print(\"\\n5. Smart Backend System:\")\n",
-    "    test_backend_system()\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\" * 55)\n",
-    "    print(\"🎯 HARDWARE ACCELERATION MASTERED\")\n",
-    "    print(\"=\" * 55)\n",
-    "    \n",
-    "    print(\"\\n📚 What You Mastered:\")\n",
-    "    print(\"✅ Why your Module 2/4 loops were slow (cache hierarchy matters!)\")\n",
-    "    print(\"✅ How cache-friendly blocking works (process data in chunks)\")\n",
-    "    print(\"✅ Why NumPy dominates (professional optimizations built-in)\")\n",
-    "    print(\"✅ How to build smart backend systems (automatic optimization)\")\n",
-    "    print(\"✅ Real ML applications (MLPs, CNNs, Transformers all use matmul!)\")\n",
-    "    \n",
-    "    print(\"\\n🎯 The Free Speedup Philosophy:\")\n",
-    "    print(\"• 🚀 Same math, better implementation = 100x speedup\")\n",
-    "    print(\"• 🧠 Educational loops teach algorithms\")\n",
-    "    print(\"• ⚡ Blocked algorithms teach cache optimization\")\n",
-    "    print(\"• 🏭 NumPy provides production performance\")\n",
-    "    print(\"• 🎯 Smart backends make optimization transparent\")\n",
-    "    print(\"• 💡 Understanding the spectrum makes you a better engineer!\")\n",
-    "    \n",
-    "    return naive_results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6fa92758",
-   "metadata": {},
-   "source": [
-    "\"\"\"\n",
-    "# Systems Analysis Summary\n",
-    "\n",
-    "This module demonstrates the fundamental principles of hardware acceleration in ML systems:\n",
-    "\n",
-    "## 🏗️ **Architecture Principles**\n",
-    "- **Cache Hierarchy**: Understanding L1/L2/L3 cache and memory access costs\n",
-    "- **Vectorization**: Leveraging SIMD instructions for parallel computation\n",
-    "- **Memory Layout**: Contiguous access patterns for optimal performance\n",
-    "- **Backend Abstraction**: Transparent dispatch between naive and optimized implementations\n",
-    "\n",
-    "## ⚡ **Optimization Techniques**\n",
-    "- **Blocked Algorithms**: Process data in cache-friendly blocks\n",
-    "- **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines\n",
-    "- **In-place Operations**: Minimize memory allocation overhead\n",
-    "- **Automatic Dispatch**: Choose optimal implementation based on problem size\n",
-    "\n",
-    "## 📊 **Performance Understanding**\n",
-    "- **Measurement First**: Profile real bottlenecks before optimizing\n",
-    "- **Algorithmic Impact**: O(N³) → O(N²) matters more than 2x constant factors\n",
-    "- **Hardware Awareness**: CPU cache misses cost 100x more than cache hits\n",
-    "- **Library Utilization**: Optimized BLAS libraries beat custom implementations\n",
-    "\n",
-    "## 🎯 **Real-World Applications**\n",
-    "- **ML Frameworks**: How PyTorch/TensorFlow apply these same principles\n",
-    "- **Production Systems**: Where optimization efforts provide real value\n",
-    "- **Development Practice**: When to optimize vs when to use existing solutions\n",
-    "\n",
-    "## 💡 **Key Insights**\n",
-    "- Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone\n",
-    "- Vectorization eliminates Python overhead for 10-100x improvements\n",
-    "- Most NumPy operations are already optimized - focus on system-level improvements\n",
-    "- Competition frameworks make optimization learning engaging and quantifiable\n",
-    "- Real ML systems face memory and communication bottlenecks, not pure computation limits\n",
-    "\n",
-    "This approach teaches students to think like systems engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most.\n",
-    "\"\"\"\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    print(\"Module 16: Hardware Acceleration - The Free Speedup!\")\n",
-    "    print(\"=\" * 60)\n",
-    "    print(\"🚀 THE EASIEST OPTIMIZATION: Better Backends, Zero Trade-offs\")\n",
-    "    \n",
-    "    # Run complete demonstration\n",
-    "    results = run_complete_acceleration_demo()\n",
-    "    \n",
-    "    print(f\"\\n🎉 Module 16: Hardware Acceleration COMPLETE!\")\n",
-    "    print(f\"⚡ Mastered: 10-100x speedups with no accuracy loss\")\n",
-    "    print(f\"🧠 Learned: Cache hierarchy, blocking, vectorization\")\n",
-    "    print(f\"🏭 Applied: MLPs, CNNs, Transformers all benefit\")\n",
-    "    print(f\"🎯 Ready: To build high-performance ML systems!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4967dd03",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "1. **Memory Access Pattern Analysis**: Your educational loops access `b[l, j]` in the innermost loop, creating terrible cache performance. Draw a diagram showing how this access pattern jumps around in memory, calculate the number of cache misses for a 1000×1000 matrix multiply, and explain why this creates exponentially worse performance as matrices get larger.\n",
-    "\n",
-    "2. **Cache Hierarchy Optimization**: Your blocked implementation uses 64×64 blocks. Calculate: (a) Total memory footprint of three 64×64 float32 blocks, (b) Why this fits in L1/L2 cache, (c) Cache utilization ratio (reuses per cache miss), and (d) What happens with 256×256 blocks instead (hint: L3 cache limit).\n",
-    "\n",
-    "3. **Production Library Justification**: You implemented blocking for education, but NumPy beats it by another 10x. Identify three specific optimizations NumPy has (vectorization, BLAS libraries, assembly kernels) and calculate the development cost vs. performance benefit of implementing these yourself. Why is this a losing proposition for ML engineers?\n",
-    "\n",
-    "4. **ML Model Acceleration Strategy**: You tested MLP, CNN, and Transformer operations. For each model type, identify: (a) The dominant matrix operations, (b) Which operations benefit most from acceleration, (c) Memory vs. compute bottlenecks, and (d) Why understanding the optimization spectrum makes you a better ML systems engineer."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a582121a",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 2
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Hardware Acceleration - The Free Speedup\n",
-    "\n",
-    "This module demonstrates the easiest optimization in ML systems: using better backends for free speedups with zero accuracy trade-offs. You learned why understanding the optimization spectrum makes you a better engineer.\n",
-    "\n",
-    "### 🛤️ **The Free Speedup Journey**\n",
-    "- **Educational Foundation**: Your Module 2/4 loops taught you the algorithm (perfect for learning)\n",
-    "- **Performance Understanding**: Module 15 showed you WHY loops are slow (profiling first)\n",
-    "- **Optimization Mastery**: Now you achieve 100x speedups by choosing better implementations\n",
-    "- **Systems Thinking**: Understanding the spectrum from educational to production code\n",
-    "\n",
-    "### 🛠️ **What We Built and Tested**\n",
-    "- **Educational Baseline**: Your triple-nested loops from Module 2/4 (algorithm understanding)\n",
-    "- **Cache-Friendly Blocking**: 64×64 blocks fitting in L1/L2 cache (10x+ speedup)\n",
-    "- **NumPy Production**: Leveraging professional BLAS optimizations (another 10x speedup)\n",
-    "- **Smart Backend System**: Automatic dispatch to optimal implementations\n",
-    "- **Real ML Applications**: MLP, CNN, Transformer operations using matrix multiplication\n",
-    "\n",
-    "### 🧠 **Key Learning Outcomes**\n",
-    "- **Why loops are slow**: Memory access patterns and cache hierarchy matter most\n",
-    "- **How blocking helps**: Processing data in cache-friendly chunks improves performance\n",
-    "- **When to use NumPy**: It already has these optimizations (and more) built-in\n",
-    "- **Systems thinking**: Understanding enables better decisions about when to optimize\n",
-    "\n",
-    "### ⚡ **Performance Spectrum Mastered**\n",
-    "- **Educational loops**: Algorithm understanding (1000x slower, perfect for learning)\n",
-    "- **Cache-friendly blocking**: Systems understanding (100x slower, teaches optimization)\n",
-    "- **NumPy production**: Professional performance (optimal speed, built-in optimizations)\n",
-    "- **Smart backends**: Engineering understanding (transparent optimization selection)\n",
-    "\n",
-    "### 🏆 **Practical Skills Developed**\n",
-    "- Analyze why educational implementations have poor performance\n",
-    "- Implement cache-friendly algorithms to understand optimization principles\n",
-    "- Choose NumPy for production while understanding what it's doing internally\n",
-    "- Build systems that balance educational value with performance requirements\n",
-    "\n",
-    "### 📊 **Systems Insights Gained**\n",
-    "- **Educational code serves a purpose**: Understanding algorithms enables optimization intuition\n",
-    "- **Cache hierarchy dominates performance**: Memory access patterns matter more than computation\n",
-    "- **Libraries beat custom optimization**: NumPy already has expert-level optimizations\n",
-    "- **Understanding enables better tools**: You can build smarter systems when you know the principles\n",
-    "\n",
-    "### 💡 **The Free Speedup Philosophy**\n",
-    "This is the EASIEST optimization in ML systems: same math, better implementation, massive speedups, zero downsides. You implemented loops to understand algorithms. You implemented blocking to understand cache optimization. Now you use NumPy because it has all optimizations built-in. Understanding this spectrum - from educational to production - makes you a superior ML systems engineer who can make informed optimization decisions."
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "cell_metadata_filter": "-all",
-   "main_language": "python",
-   "notebook_metadata_filter": "-all"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/15_acceleration/acceleration_dev.py b/modules_old/15_acceleration/acceleration_dev.py
deleted file mode 100644
index 51346675..00000000
--- a/modules_old/15_acceleration/acceleration_dev.py
+++ /dev/null
@@ -1,1532 +0,0 @@
-# %% [markdown]
-"""
-# Module 16: Hardware Acceleration - The Free Speedup!
-
-Welcome to Hardware Acceleration! You'll discover the easiest optimization in ML systems - getting 100x speedups with zero code changes!
-
-## LINK Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): Triple-nested loops for matrix operations
-- Module 04 (Layers): Forward pass implementations
-- Module 15 (Profiling): Performance measurement and bottleneck identification
-
-**What's Working**: You can implement any matrix operation correctly using educational loops!
-
-**The Gap**: Your educational loops are 1000x slower than production code, limiting real ML applications.
-
-**This Module's Solution**: Learn the optimization spectrum from educational to production performance.
-
-**Connection Map**:
-```
-Profiling -> Acceleration -> Production ML
-(identify)   (optimize)    (deploy at scale)
-```
-
-## Learning Goals (Systems-Focused Framework)
-- **Systems understanding**: CPU cache hierarchy and memory access patterns
-- **Core implementation skill**: Cache-friendly blocking algorithms
-- **Pattern/abstraction mastery**: Backend abstraction and automatic dispatch
-- **Framework connections**: How PyTorch/TensorFlow achieve performance
-- **Optimization trade-offs**: Educational clarity vs production speed
-
-## Build -> Use -> Reflect
-1. **Build**: Cache-friendly blocked matrix multiplication from scratch
-2. **Use**: Apply acceleration to real ML model operations (MLP, CNN, Attention)
-3. **Reflect**: Analyze the educational-to-production optimization spectrum
-
-## Systems Reality Check
-TIP **Production Context**: ML frameworks use these exact principles for 100x speedups
-SPEED **Performance Insight**: Memory access patterns matter more than raw computation speed
-
-## The Free Speedup Journey
-
-**Key Message**: This is the EASIEST optimization - just use better backends! No accuracy trade-offs, no complex math - just 10-100x faster code.
-
-```
-Educational Loops -> Cache Blocking -> NumPy/BLAS -> Smart Backends
-    (learning)       (understanding)   (production)    (automation)
-    1000x slower     100x slower      optimal speed   transparent
-```
-
-**Visual Performance Spectrum**:
-```
-Performance: [████████████████████████████████████████] 100%  NumPy
-             [████]                                      4%   Blocked  
-             [▌]                                         0.1% Naive
-```
-
-**Why This Works**: Same math, better implementation. Free performance with zero downsides!
-"""
-
-# %% [markdown]
-"""
-## Part 1: Baseline Implementation - Your Loops from Module 2/4
-
-Let's start with the educational triple-nested loops you implemented earlier. These were perfect for learning but terrible for performance.
-
-### CPU vs GPU Architecture Fundamentals
-
-```
-CPU Architecture (Optimized for Sequential):         GPU Architecture (Optimized for Parallel):
-+---------------------------------------------+     +---------------------------------------------+
-| Complex Control Unit                        |     | Simple Control Units                       |
-| +---------+  Large Caches                 |     | +-+ +-+ +-+ +-+ Small Caches              |
-| |  Core 1 |  +--------------------------+  |     | |C| |C| |C| |C| +----------------------+   |
-| +---------+  |     L3 Cache (8MB)       |  |     | +-+ +-+ +-+ +-+ |  Shared Memory (48KB) |   |
-| +---------+  |                          |  |     | +-+ +-+ +-+ +-+ |                      |   |
-| |  Core 2 |  +--------------------------+  |     | |C| |C| |C| |C| +----------------------+   |
-| +---------+                                |     | +-+ +-+ +-+ +-+ ... (thousands of cores) |
-| +---------+  Main Memory (16GB)           |     |                                           |
-| |  Core 4 |  +--------------------------+  |     | High Bandwidth Memory (HBM)              |
-| +---------+  | 200+ cycle latency       |  |     | +--------------------------------------+  |
-|              +--------------------------+  |     | | 1000+ GB/s bandwidth                 |  |
-+---------------------------------------------+     +---------------------------------------------+
-
-CPU: Few cores, complex, optimized for latency    GPU: Many cores, simple, optimized for throughput
-Best for: Sequential algorithms, complex logic    Best for: Parallel algorithms, simple operations
-```
-
-### Memory Hierarchy Deep Dive
-
-```
-Memory Hierarchy (Latency and Size Trade-offs):
-
-Registers:   4 bytes     | 1 cycle      | ██████████ Speed
-L1 Cache:    32KB       | 3-4 cycles   | ████████▒▒ 
-L2 Cache:    256KB      | 10-20 cycles | ██████▒▒▒▒
-L3 Cache:    8MB        | 50-100 cycles| ████▒▒▒▒▒▒
-Main RAM:    16GB       | 200+ cycles  | ██▒▒▒▒▒▒▒▒
-SSD Storage: 1TB        | 100,000+ cyc | ▒▒▒▒▒▒▒▒▒▒
-             ^                          ^
-           Size                      Speed
-```
-
-**The Cache Miss Problem**:
-- Cache hit: Data found in L1 -> 1 cycle
-- Cache miss: Must fetch from RAM -> 200+ cycles
-- 200x slowdown for every cache miss!
-"""
-
-# %%
-#| default_exp backends.acceleration
-
-import time
-import numpy as np
-import matplotlib.pyplot as plt
-from typing import Tuple, Dict, List
-
-def matmul_naive(a: np.ndarray, b: np.ndarray) -> np.ndarray:
-    """
-    Educational matrix multiplication using triple nested loops.
-    
-    This is the same implementation from Module 2/4 - perfect for learning
-    the algorithm, but very slow due to poor cache performance.
-    
-    Memory Access Pattern Analysis:
-    ```
-    Inner loop accesses:
-    a[i, k] -> Sequential access (cache-friendly)
-    b[k, j] -> Strided access (cache-hostile!)
-    
-    For 1000*1000 matrices:
-    - a[i,k]: 1000 sequential reads per row (good)
-    - b[k,j]: 1000 random column reads (terrible!)
-    - Total cache misses: ~1 billion!
-    ```
-    """
-    m, k = a.shape
-    k2, n = b.shape
-    assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
-    
-    # Initialize result matrix (contiguous memory allocation)
-    c = np.zeros((m, n), dtype=np.float32)
-    
-    # Triple nested loop - the educational implementation
-    # i loop: iterates over output rows
-    # j loop: iterates over output columns  
-    # k loop: performs dot product computation
-    for i in range(m):           # Output row
-        for j in range(n):       # Output column
-            for k_idx in range(k):   # Dot product accumulation
-                # Cache analysis: a[i,k_idx] = sequential (good)
-                #                b[k_idx,j] = strided (bad!)
-                c[i, j] += a[i, k_idx] * b[k_idx, j]
-    
-    return c
-
-# MAGNIFY SYSTEMS INSIGHT: Memory Access Pattern Analysis
-def analyze_memory_access_patterns():
-    """
-    Visualize why naive loops create terrible cache performance.
-    
-    This analysis shows the fundamental problem with nested loops:
-    cache-hostile memory access patterns that destroy performance.
-    """
-    try:
-        print("📊 Memory Access Pattern Analysis")
-        print("=" * 45)
-        
-        # Simulate memory access for small matrix
-        size = 4
-        print(f"\nAnalyzing {size}x{size} matrix multiplication:")
-        print("\nMatrix A (row-major layout):")
-        print("Memory: [a00 a01 a02 a03 | a10 a11 a12 a13 | a20 a21 a22 a23 | a30 a31 a32 a33]")
-        print("\nMatrix B (row-major layout):")
-        print("Memory: [b00 b01 b02 b03 | b10 b11 b12 b13 | b20 b21 b22 b23 | b30 b31 b32 b33]")
-        
-        print("\n🔴 PROBLEM: Computing C[0,0] = sum(A[0,k] * B[k,0])")
-        print("A[0,k] accesses: a00, a01, a02, a03  (sequential OK)")
-        print("B[k,0] accesses: b00, b10, b20, b30  (every 4th element FAIL)")
-        
-        print("\n📊 Cache Miss Analysis:")
-        cache_line_size = 64  # bytes
-        float_size = 4        # bytes
-        elements_per_line = cache_line_size // float_size  # 16 elements
-        
-        print(f"Cache line size: {cache_line_size} bytes = {elements_per_line} float32s")
-        print(f"Sequential access (A): 1 cache miss per {elements_per_line} elements")
-        print(f"Strided access (B): 1 cache miss per element (worst case)")
-        
-        # Calculate for realistic size
-        n = 1000
-        sequential_misses = n // elements_per_line
-        strided_misses = n
-        total_operations = n * n * n
-        total_misses = total_operations * (sequential_misses + strided_misses) // n
-        
-        print(f"\n📊 Scaling to {n}x{n} matrices:")
-        print(f"Total operations: {total_operations:,}")
-        print(f"Estimated cache misses: {total_misses:,}")
-        print(f"Cache miss rate: {total_misses/total_operations:.1%}")
-        print(f"\n📊 Why this kills performance:")
-        print(f"Cache hit: 1 cycle")
-        print(f"Cache miss: 200+ cycles")
-        print(f"Performance penalty: 200x slower!")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in memory analysis: {e}")
-        print("Make sure numpy is available")
-
-# Run the analysis
-analyze_memory_access_patterns()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Educational Implementation
-
-Let's test our educational loops and measure their performance characteristics.
-"""
-
-# PASS IMPLEMENTATION CHECKPOINT: Naive matrix multiplication complete
-
-# THINK PREDICTION: How much slower are educational loops vs NumPy?
-# Your guess: ___x slower for 100x100 matrices
-
-# MAGNIFY SYSTEMS INSIGHT #1: Why Educational Loops Are Slow
-def analyze_educational_loop_performance():
-    """
-    Measure and understand why educational loops create performance problems.
-    
-    This analysis reveals the fundamental performance characteristics
-    that students experience when implementing algorithms from scratch.
-    """
-    try:
-        print("📊 Educational Loop Performance Analysis")
-        print("=" * 50)
-        
-        # Test progressively larger matrices to show scaling
-        sizes = [50, 100, 200]
-        
-        print("\nPerformance scaling with matrix size:")
-        print("Size | Naive Time | NumPy Time | Slowdown | O(N³) Theory")
-        print("-" * 60)
-        
-        baseline_naive = None
-        baseline_numpy = None
-        
-        for size in sizes:
-            # Create test matrices
-            a = np.random.randn(size, size).astype(np.float32)
-            b = np.random.randn(size, size).astype(np.float32)
-            
-            # Time naive implementation
-            start = time.perf_counter()
-            _ = matmul_naive(a, b)
-            naive_time = time.perf_counter() - start
-            
-            # Time NumPy implementation
-            start = time.perf_counter()
-            _ = a @ b
-            numpy_time = time.perf_counter() - start
-            
-            # Calculate slowdown
-            slowdown = naive_time / numpy_time
-            
-            # Calculate theoretical scaling (O(N³))
-            if baseline_naive is None:
-                baseline_naive = naive_time
-                baseline_numpy = numpy_time
-                theory_scale = 1.0
-            else:
-                theory_scale = (size / sizes[0]) ** 3
-            
-            print(f"{size:4d} | {naive_time*1000:9.1f}ms | {numpy_time*1000:9.1f}ms | {slowdown:7.0f}x | {theory_scale:8.1f}x")
-        
-        print(f"\n📊 Key Performance Insights:")
-        print(f"• Educational loops: Perfect for learning algorithms")
-        print(f"• Scaling follows O(N³): doubling size = 8x operations")
-        print(f"• Cache misses make large matrices exponentially slower")
-        print(f"• NumPy: Professional optimizations give 100-1000x speedup")
-        
-        print(f"\nTIP Why This Matters for ML Systems:")
-        print(f"• Understanding algorithms != performance optimization")
-        print(f"• Educational clarity vs production speed trade-off")
-        print(f"• Memory access patterns dominate performance")
-        print(f"• Library choice impacts application feasibility")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in performance analysis: {e}")
-        print("Make sure matrices are small enough for educational timing")
-
-# Run the educational performance analysis
-analyze_educational_loop_performance()
-
-# %%
-def test_naive_baseline():
-    """
-    Test naive implementation and measure its performance characteristics.
-    
-    This test validates correctness and demonstrates the performance gap
-    between educational loops and optimized implementations.
-    """
-    print("TEST Testing Naive Implementation...")
-    
-    # Test correctness with small matrices first
-    a = np.array([[1, 2], [3, 4]], dtype=np.float32)
-    b = np.array([[5, 6], [7, 8]], dtype=np.float32)
-    
-    result_naive = matmul_naive(a, b)
-    result_numpy = a @ b
-    expected = np.array([[19, 22], [43, 50]], dtype=np.float32)
-    
-    assert np.allclose(result_naive, result_numpy), "Naive matmul incorrect vs NumPy"
-    assert np.allclose(result_naive, expected), "Naive matmul incorrect vs expected"
-    print("PASS Naive implementation produces correct results")
-    
-    # Performance comparison (small sizes only - educational is VERY slow)
-    print("\n📊 Performance comparison:")
-    small_a = np.random.randn(100, 100).astype(np.float32)
-    small_b = np.random.randn(100, 100).astype(np.float32)
-    
-    # Time naive implementation (limit size to avoid excessive wait)
-    start = time.perf_counter()
-    _ = matmul_naive(small_a, small_b)
-    naive_time = time.perf_counter() - start
-    
-    # Time NumPy implementation
-    start = time.perf_counter()
-    _ = small_a @ small_b
-    numpy_time = time.perf_counter() - start
-    
-    speedup = naive_time / numpy_time
-    
-    print(f"Naive loops:     {naive_time*1000:8.1f} ms")
-    print(f"NumPy optimized: {numpy_time*1000:8.1f} ms")
-    print(f"Speedup:         {speedup:8.1f}x faster")
-    
-    # Estimate scaling behavior
-    print(f"\n📊 Scaling Analysis (100x100 baseline):")
-    print(f"For 500x500 matrix: ~{speedup * 125:.0f}x slower than NumPy")  # (500/100)^3 = 125
-    print(f"For 1000x1000 matrix: ~{speedup * 1000:.0f}x slower than NumPy")  # (1000/100)^3 = 1000
-    print(f"\nTIP Why: O(N³) complexity + cache misses = exponential slowdown")
-    
-    print("PASS Naive baseline established")
-    return naive_time, numpy_time, speedup
-
-# Execute the test
-test_naive_baseline()
-
-# %% [markdown]
-"""
-## Part 2: Understanding Cache Hierarchy - Why Memory Matters More Than Computation
-
-**The Big Insight**: Modern CPUs are FAST at computation but SLOW at memory access. Cache hierarchy makes the difference between fast and slow code.
-
-### CPU Cache Hierarchy Visualization
-```
-CPU Cache Hierarchy (Latency vs Capacity Trade-off):
-
-+----------------------------------------------------------------------------------+
-Register:  4 bytes   | ██████████ 1 cycle      (instant access)           |
-L1 Cache:  32KB      | ████████▒▒ 3-4 cycles   (lightning fast)           |
-L2 Cache:  256KB     | ██████▒▒▒▒ 10-20 cycles (fast)                   |
-L3 Cache:  8MB       | ████▒▒▒▒▒▒ 50-100 cycles(slow)                   |
-Main RAM:  16GB      | ██▒▒▒▒▒▒▒▒ 200+ cycles  (VERY slow)              |
-SSD:       1TB       | ▒▒▒▒▒▒▒▒▒▒ 100,000+ cyc (glacial)                |
-+----------------------------------------------------------------------------------+
-     Size                        Speed                      Characteristics
-```
-
-**Key Principle**: Keep your working set in L1/L2 cache for 100x better performance!
-
-### Vectorization vs Parallelization Concepts
-
-```
-Vectorization (SIMD - Single Instruction, Multiple Data):
-+--------------------------------------------------+
-| Scalar: for i in range(4): c[i] = a[i] + b[i]   |
-|         ADD a[0], b[0] -> c[0]  (4 operations)    |  
-|         ADD a[1], b[1] -> c[1]                   |
-|         ADD a[2], b[2] -> c[2]                   |
-|         ADD a[3], b[3] -> c[3]                   |
-|                                                |
-| Vector: c = a + b  (NumPy/BLAS)               |
-|         VADD [a0,a1,a2,a3], [b0,b1,b2,b3]     |
-|           -> [c0,c1,c2,c3]  (1 operation!)      |
-+--------------------------------------------------+
-
-Parallelization (Multiple cores working simultaneously):
-+--------------------------------------------------+
-| Core 1: Computes rows 0-249   of result matrix    |
-| Core 2: Computes rows 250-499 of result matrix    |
-| Core 3: Computes rows 500-749 of result matrix    |  
-| Core 4: Computes rows 750-999 of result matrix    |
-|                                                |
-| 4x speedup (ideal) if no synchronization costs  |
-+--------------------------------------------------+
-```
-
-### Memory Access Pattern Analysis
-
-Your naive loops access memory like this:
-```python
-for i in range(m):      # Loop over output rows
-    for j in range(n):  # Loop over output columns
-        for k in range(k):  # Loop over dot product
-            c[i,j] += a[i,k] * b[k,j]  # b[k,j] creates cache misses!
-```
-
-**The Problem**: `b[k,j]` creates terrible access patterns:
-- Each `j` increment jumps to a new column (cache miss)
-- Each `k` increment jumps to a new row (another cache miss)  
-- For 1000*1000 matrix: 1 billion cache misses!
-
-**Visualization of Memory Access**:
-```
-Matrix B in memory (row-major):
-[b00 b01 b02 b03 | b10 b11 b12 b13 | b20 b21 b22 b23 | ...]
-
-Accessing column 0: b00, b10, b20, b30, ...
-                    |    |    |    |
-                    4    4    4    4  elements apart = strided access
-                   🔴  🔴  🔴  🔴 cache misses!
-```
-
-**The Solution**: Process in blocks that fit in cache.
-"""
-
-# %%
-def matmul_blocked(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.ndarray:
-    """
-    Cache-friendly blocked matrix multiplication.
-    
-    This version processes data in blocks that fit in CPU cache,
-    dramatically reducing cache misses and improving performance.
-    
-    **Memory Analysis (Quantitative)**:
-    - 64*64 float32 block = 4096 * 4 bytes = 16KB per block
-    - 3 blocks (A_block, B_block, C_block) = 48KB total
-    - Fits comfortably in 256KB L2 cache with room for other data
-    - Reuses each data element 64 times before evicting from cache
-    
-    **Why This Works**:
-    - Naive: 1 cache miss per operation (terrible)
-    - Blocked: 1 cache miss per 64 operations (64x better!)
-    
-    **Blocking Visualization**:
-    ```
-    Large Matrix Multiplication:
-    A (1000x1000) * B (1000x1000) = C (1000x1000)
-    
-    Blocked Approach:
-    +----------------+   +----------------+   +----------------+
-    | 64x64|      |   | 64x64|      |   | 64x64|      |
-    | block |  A   | * | block |  B   | = | block |  C   |
-    |       |      |   |       |      |   |       |      |
-    +----------------+   +----------------+   +----------------+
-    
-    Each 64x64 block fits in L1/L2 cache!
-    ```
-    
-    Args:
-        a: Left matrix (m * k)
-        b: Right matrix (k * n) 
-        block_size: Cache-friendly block size (64 = 16KB fits in L2 cache)
-    """
-    m, k = a.shape
-    k2, n = b.shape
-    assert k == k2, f"Incompatible shapes: {a.shape} @ {b.shape}"
-    
-    # Initialize result matrix with zeros
-    c = np.zeros((m, n), dtype=np.float32)
-    
-    # Process in blocks to maximize cache utilization
-    # Outer loops: iterate over blocks
-    for i in range(0, m, block_size):       # Block rows in A and C
-        for j in range(0, n, block_size):   # Block columns in B and C
-            for k_idx in range(0, k, block_size):  # Block columns in A, rows in B
-                
-                # Define block boundaries (handle edge cases)
-                i_end = min(i + block_size, m)
-                j_end = min(j + block_size, n)
-                k_end = min(k_idx + block_size, k)
-                
-                # Extract blocks that fit in cache
-                # These slices create views, not copies (memory efficient)
-                a_block = a[i:i_end, k_idx:k_end]      # Shape: (<=64, <=64)
-                b_block = b[k_idx:k_end, j:j_end]      # Shape: (<=64, <=64)
-                
-                # Multiply blocks using optimized NumPy BLAS
-                # This operates on cache-resident data
-                c[i:i_end, j:j_end] += a_block @ b_block
-    
-    return c
-
-def calculate_cache_footprint(block_size: int) -> dict:
-    """
-    Calculate memory footprint for educational purposes.
-    
-    This helps students understand why different block sizes work better or worse.
-    Block size optimization is crucial for cache performance.
-    """
-    bytes_per_float = 4  # float32 size
-    elements_per_block = block_size * block_size
-    bytes_per_block = elements_per_block * bytes_per_float
-    total_blocks = 3  # A_block, B_block, C_block
-    total_bytes = bytes_per_block * total_blocks
-    
-    # Cache size thresholds (typical modern CPU)
-    l1_cache_size = 32 * 1024   # 32KB L1 data cache
-    l2_cache_size = 256 * 1024  # 256KB L2 cache
-    l3_cache_size = 8 * 1024 * 1024  # 8MB L3 cache
-    
-    return {
-        "block_size": block_size,
-        "elements_per_block": elements_per_block,
-        "bytes_per_block": bytes_per_block,
-        "total_bytes": total_bytes,
-        "total_kb": total_bytes / 1024,
-        "fits_in_l1": total_bytes <= l1_cache_size,
-        "fits_in_l2": total_bytes <= l2_cache_size,
-        "fits_in_l3": total_bytes <= l3_cache_size,
-        "cache_level": (
-            "L1" if total_bytes <= l1_cache_size else
-            "L2" if total_bytes <= l2_cache_size else
-            "L3" if total_bytes <= l3_cache_size else
-            "RAM"
-        )
-    }
-
-# MAGNIFY SYSTEMS INSIGHT: Cache Optimization Analysis
-def analyze_cache_optimization():
-    """
-    Analyze how different block sizes affect cache performance.
-    
-    This demonstrates the trade-off between cache utilization
-    and computational efficiency in blocked algorithms.
-    """
-    try:
-        print("📊 Cache Optimization Analysis")
-        print("=" * 40)
-        
-        # Test different block sizes
-        block_sizes = [16, 32, 64, 128, 256]
-        
-        print("\nBlock Size Analysis:")
-        print("Size | Elements | Memory  | Cache Level | Efficiency")
-        print("-" * 55)
-        
-        for block_size in block_sizes:
-            footprint = calculate_cache_footprint(block_size)
-            
-            # Calculate computational efficiency
-            # Smaller blocks = more overhead, larger blocks = cache misses
-            if footprint["fits_in_l1"]:
-                efficiency = "Excellent"
-            elif footprint["fits_in_l2"]:
-                efficiency = "Good"
-            elif footprint["fits_in_l3"]:
-                efficiency = "Fair"
-            else:
-                efficiency = "Poor"
-                
-            print(f"{block_size:4d} | {footprint['elements_per_block']:8d} | {footprint['total_kb']:6.1f}KB | {footprint['cache_level']:10s} | {efficiency}")
-        
-        print("\n📊 Optimal Block Size Analysis:")
-        optimal = calculate_cache_footprint(64)
-        print(f"64x64 blocks use {optimal['total_kb']:.1f}KB")
-        print(f"Fits in: {optimal['cache_level']} cache")
-        print(f"Reuse factor: Each element used 64 times")
-        print(f"Cache efficiency: 64x better than naive")
-        
-        print("\nTIP Key Insights:")
-        print("• Blocks too small: High loop overhead")
-        print("• Blocks too large: Cache misses")
-        print("• Sweet spot: 64x64 fits in L2 cache")
-        print("• Modern CPUs: Designed for this pattern!")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in cache analysis: {e}")
-
-# Run the cache analysis
-analyze_cache_optimization()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Blocked Implementation
-
-Let's see how much faster cache-friendly blocking is compared to educational loops.
-"""
-
-# PASS IMPLEMENTATION CHECKPOINT: Cache-friendly blocking complete
-
-# THINK PREDICTION: How much speedup does cache blocking provide?
-# Your guess: ___x faster than educational loops
-
-# MAGNIFY SYSTEMS INSIGHT #2: Cache Blocking Effectiveness
-def analyze_cache_blocking_effectiveness():
-    """
-    Measure how cache-friendly blocking improves performance.
-    
-    This demonstrates the practical impact of designing algorithms
-    that work with CPU cache hierarchy instead of against it.
-    """
-    try:
-        print("📊 Cache Blocking Effectiveness Analysis")
-        print("=" * 45)
-        
-        # Test different block sizes to show optimal choice
-        matrix_size = 300
-        block_sizes = [32, 64, 128, 256]
-        
-        # Create test matrices
-        a = np.random.randn(matrix_size, matrix_size).astype(np.float32)
-        b = np.random.randn(matrix_size, matrix_size).astype(np.float32)
-        
-        print(f"\nBlock Size Optimization (Matrix: {matrix_size}x{matrix_size}):")
-        print("Block | Time (ms) | Cache Fit | Efficiency")
-        print("-" * 45)
-        
-        best_time = float('inf')
-        best_block = 64
-        
-        for block_size in block_sizes:
-            # Time blocked implementation
-            start = time.perf_counter()
-            _ = matmul_blocked(a, b, block_size=block_size)
-            blocked_time = time.perf_counter() - start
-            
-            # Calculate cache footprint
-            footprint = calculate_cache_footprint(block_size)
-            
-            # Determine efficiency
-            if blocked_time < best_time:
-                best_time = blocked_time
-                best_block = block_size
-                efficiency = "Optimal"
-            elif blocked_time < best_time * 1.2:
-                efficiency = "Good"
-            else:
-                efficiency = "Suboptimal"
-            
-            print(f"{block_size:5d} | {blocked_time*1000:8.1f} | {footprint['cache_level']:8s} | {efficiency}")
-        
-        # Compare with naive and NumPy
-        print(f"\n📊 Performance Comparison:")
-        
-        # Time naive (small subset)
-        start = time.perf_counter()
-        _ = matmul_naive(a[:50, :50], b[:50, :50])
-        naive_time = time.perf_counter() - start
-        naive_scaled = naive_time * (matrix_size / 50) ** 3
-        
-        # Time NumPy
-        start = time.perf_counter()
-        _ = a @ b
-        numpy_time = time.perf_counter() - start
-        
-        print(f"Naive (estimated): {naive_scaled*1000:8.1f}ms")
-        print(f"Blocked (optimal):  {best_time*1000:8.1f}ms")
-        print(f"NumPy (production): {numpy_time*1000:8.1f}ms")
-        
-        speedup_blocked = naive_scaled / best_time
-        speedup_numpy = naive_scaled / numpy_time
-        
-        print(f"\nROCKET Speedup Results:")
-        print(f"Blocking: {speedup_blocked:.0f}x faster than naive")
-        print(f"NumPy: {speedup_numpy:.0f}x faster than naive")
-        print(f"Block size {best_block}: Optimal for this matrix size")
-        
-        print(f"\nTIP Key Cache Insights:")
-        print(f"• 64x64 blocks typically optimal (fits L2 cache)")
-        print(f"• Too small: High loop overhead")
-        print(f"• Too large: Cache misses return")
-        print(f"• Cache hierarchy shapes algorithm design")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in blocking analysis: {e}")
-        print("Make sure all blocking functions are implemented correctly")
-
-# Run the cache blocking analysis
-analyze_cache_blocking_effectiveness()
-
-def test_blocked_optimization():
-    """Test blocked matrix multiplication performance"""
-    print("Testing Blocked Matrix Multiplication...")
-    
-    # Test correctness
-    a = np.random.randn(200, 200).astype(np.float32)
-    b = np.random.randn(200, 200).astype(np.float32)
-    
-    result_blocked = matmul_blocked(a, b, block_size=64)
-    result_numpy = a @ b
-    
-    assert np.allclose(result_blocked, result_numpy, atol=1e-3), "Blocked matmul incorrect"
-    print("PASS Blocked implementation produces correct results")
-    
-    # Performance comparison
-    print("\nPerformance comparison:")
-    
-    # Educational vs Blocked vs NumPy
-    size = 200
-    test_a = np.random.randn(size, size).astype(np.float32)
-    test_b = np.random.randn(size, size).astype(np.float32)
-    
-    # Time educational (smaller subset to avoid waiting forever)
-    start = time.perf_counter()
-    _ = matmul_naive(test_a[:50, :50], test_b[:50, :50])
-    naive_time = time.perf_counter() - start
-    # Scale cubic complexity: (200/50)³ = 4³ = 64x operations
-    scaling_factor = (size / 50) ** 3  
-    naive_time_scaled = naive_time * scaling_factor
-    
-    # Time blocked
-    start = time.perf_counter()
-    _ = matmul_blocked(test_a, test_b, block_size=64)
-    blocked_time = time.perf_counter() - start
-    
-    # Time NumPy
-    start = time.perf_counter()
-    _ = test_a @ test_b
-    numpy_time = time.perf_counter() - start
-    
-    print(f"Naive (estimated): {naive_time_scaled*1000:.1f} ms")
-    print(f"Blocked:           {blocked_time*1000:.1f} ms")
-    print(f"NumPy:             {numpy_time*1000:.1f} ms")
-    
-    speedup_blocked = naive_time_scaled / blocked_time
-    speedup_numpy = naive_time_scaled / numpy_time
-    
-    print(f"\nROCKET SPEEDUP RESULTS:")
-    print(f"Blocked is {speedup_blocked:.1f}x faster than naive loops!")
-    print(f"NumPy is {speedup_numpy:.1f}x faster than naive loops!")
-    print(f"\nTIP Why blocking works: Better cache utilization!")
-    print(f"   • Naive: 1 cache miss per operation")
-    print(f"   • Blocked: 1 cache miss per 64 operations")
-    print(f"   • NumPy: Professional optimizations + vectorization")
-    
-    print("PASS Blocked optimization tested successfully")
-    return blocked_time, numpy_time
-
-# Execute the blocked test
-test_blocked_optimization()
-
-# %% [markdown]
-"""
-## Part 3: NumPy Optimization - Production Performance
-
-Now we'll switch to NumPy for production use. The key insight: NumPy already has these optimizations (and more) built-in.
-"""
-
-# %%
-def matmul_numpy(a: np.ndarray, b: np.ndarray) -> np.ndarray:
-    """
-    Production matrix multiplication using NumPy.
-    
-    This is what you should actually use in practice.
-    NumPy already has blocking, vectorization, and BLAS optimizations built-in.
-    """
-    return a @ b
-
-# %% [markdown]
-"""
-### TEST Unit Test: Production Implementation
-
-Let's verify that NumPy is indeed the best choice for production.
-"""
-
-# PASS IMPLEMENTATION CHECKPOINT: Production backend system complete
-
-# THINK PREDICTION: What makes NumPy faster than our blocking algorithm?
-# Your answer: ___ (vectorization, BLAS, assembly, etc.)
-
-# MAGNIFY SYSTEMS INSIGHT #3: Production Optimization Analysis
-def analyze_production_optimization_stack():
-    """
-    Analyze the complete optimization stack that makes NumPy so fast.
-    
-    This reveals why production libraries beat custom implementations
-    and what optimizations are built into professional ML frameworks.
-    """
-    try:
-        print("📊 Production Optimization Stack Analysis")
-        print("=" * 50)
-        
-        # Test across range of sizes to show scaling characteristics
-        sizes = [100, 300, 500, 1000]
-        
-        print("\nOptimization Stack Performance:")
-        print("Size | Naive Est | Blocked | NumPy | Block->NumPy | Total Speedup")
-        print("-" * 70)
-        
-        for size in sizes:
-            # Create test matrices
-            a = np.random.randn(size, size).astype(np.float32)
-            b = np.random.randn(size, size).astype(np.float32)
-            
-            # Time blocked implementation
-            start = time.perf_counter()
-            _ = matmul_blocked(a, b, block_size=64)
-            blocked_time = time.perf_counter() - start
-            
-            # Time NumPy implementation
-            start = time.perf_counter()
-            _ = a @ b
-            numpy_time = time.perf_counter() - start
-            
-            # Estimate naive time (from small sample)
-            if size <= 200:
-                start = time.perf_counter()
-                _ = matmul_naive(a[:50, :50], b[:50, :50])
-                naive_small = time.perf_counter() - start
-                naive_estimated = naive_small * (size / 50) ** 3
-            else:
-                # Use previous scaling for larger matrices
-                naive_estimated = naive_time * (size / 200) ** 3 if 'naive_time' in locals() else blocked_time * 100
-            
-            # Calculate speedups
-            block_speedup = naive_estimated / blocked_time
-            numpy_speedup = blocked_time / numpy_time
-            total_speedup = naive_estimated / numpy_time
-            
-            print(f"{size:4d} | {naive_estimated*1000:8.0f}ms | {blocked_time*1000:6.1f}ms | {numpy_time*1000:4.1f}ms | {numpy_speedup:9.1f}x | {total_speedup:11.0f}x")
-            
-            if size == 200:  # Store for scaling estimation
-                naive_time = naive_estimated
-        
-        print(f"\n📊 NumPy's Optimization Stack:")
-        print(f"🔧 1. Cache Blocking: Process data in cache-friendly chunks")
-        print(f"🔧 2. Vectorization: SIMD instructions (4-8x speedup)")
-        print(f"🔧 3. BLAS Libraries: Hand-optimized linear algebra (Intel MKL, OpenBLAS)")
-        print(f"🔧 4. Assembly Kernels: CPU-specific optimizations")
-        print(f"🔧 5. Memory Layout: Optimal data structure organization")
-        print(f"🔧 6. Threading: Automatic parallelization for large matrices")
-        
-        print(f"\n📊 Development Cost vs Performance Benefit:")
-        print(f"• Custom blocking: 1 week implementation -> 10-50x speedup")
-        print(f"• BLAS integration: 1 month implementation -> additional 5-10x")
-        print(f"• Assembly optimization: 6+ months -> additional 2-5x")
-        print(f"• NumPy: 0 development time -> all optimizations included")
-        
-        print(f"\nTIP ML Systems Engineering Insight:")
-        print(f"• Focus on system architecture, not micro-optimizations")
-        print(f"• Leverage existing optimized libraries (NumPy, PyTorch, TensorFlow)")
-        print(f"• Understanding principles enables better system design")
-        print(f"• Build on foundations, don't reinvent optimized wheels")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in production analysis: {e}")
-        print("Make sure all performance functions are implemented correctly")
-
-# Run the production optimization analysis
-analyze_production_optimization_stack()
-
-# %%
-def test_production_performance():
-    """Test that NumPy is indeed optimal for production use"""
-    print("Testing Production Performance...")
-    
-    # Test different sizes
-    sizes = [200, 500, 800]
-    
-    print("\nPerformance comparison across the optimization spectrum:")
-    
-    for size in sizes:
-        print(f"\nMatrix size: {size}x{size}")
-        a = np.random.randn(size, size).astype(np.float32)
-        b = np.random.randn(size, size).astype(np.float32)
-        
-        # Time blocked implementation
-        start = time.perf_counter()
-        _ = matmul_blocked(a, b, block_size=64)
-        blocked_time = time.perf_counter() - start
-        
-        # Time NumPy implementation
-        start = time.perf_counter()
-        _ = matmul_numpy(a, b)
-        numpy_time = time.perf_counter() - start
-        
-        speedup = blocked_time / numpy_time
-        print(f"Blocked:     {blocked_time*1000:6.1f} ms")
-        print(f"NumPy:       {numpy_time*1000:6.1f} ms")
-        print(f"NumPy is {speedup:.1f}x faster than blocked")
-    
-    print("\nTIP Key Insight: NumPy already has these optimizations built-in!")
-    print("   • Blocking algorithms")
-    print("   • Vectorization")
-    print("   • Hardware-specific BLAS libraries")
-    print("   • Assembly-level optimizations")
-    
-    print("\nPASS Production performance verified")
-    return True
-
-# Execute the production test
-test_production_performance()
-
-# %% [markdown]
-"""
-## Part 4: Smart Backend System - Transparent Optimization
-
-Now let's build a system that automatically chooses the right implementation. This is how real ML frameworks work!
-"""
-
-# %%
-class OptimizedBackend:
-    """
-    Smart backend that automatically dispatches to optimal implementations.
-    
-    This demonstrates how real ML frameworks (PyTorch, TensorFlow) work:
-    - Single API for users
-    - Automatic dispatch to fastest implementation
-    - Transparent optimization without code changes
-    """
-    
-    def dispatch(self, op: str, *args, **kwargs):
-        """Dispatch operations to optimal implementations"""
-        if op == "matmul":
-            return self.matmul(*args, **kwargs)
-        else:
-            raise NotImplementedError(f"Operation {op} not implemented")
-    
-    def matmul(self, a: np.ndarray, b: np.ndarray) -> np.ndarray:
-        """
-        Matrix multiplication with automatic optimization selection.
-        
-        For production: Always use NumPy (has all optimizations built-in)
-        For education: Could switch based on size, but NumPy is always best
-        """
-        # In a real system, you might choose based on:
-        # - Matrix size (small vs large)
-        # - Hardware available (CPU vs GPU)
-        # - Memory constraints
-        # 
-        # But NumPy is almost always the right choice for CPU
-        return matmul_numpy(a, b)
-
-# Global backend instance
-_backend = OptimizedBackend()
-
-def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray:
-    """
-    Matrix multiplication using optimal backend.
-    
-    This is the API students should use - it automatically
-    selects the best implementation available.
-    """
-    return _backend.dispatch("matmul", a, b)
-
-# %% [markdown]
-"""
-### TEST Unit Test: Backend System
-
-Let's verify our backend system works correctly and uses optimal implementations.
-"""
-
-# %%
-def test_backend_system():
-    """Test the backend system"""
-    print("Testing Backend System...")
-    
-    # Test matrices
-    a = np.random.randn(100, 100).astype(np.float32)
-    b = np.random.randn(100, 100).astype(np.float32)
-    
-    # Test that our backend works
-    result = matmul(a, b)
-    expected = a @ b
-    
-    assert np.allclose(result, expected), "Backend matmul incorrect"
-    print("PASS Backend produces correct results")
-    
-    # Compare performance
-    start = time.perf_counter()
-    _ = matmul(a, b)
-    backend_time = time.perf_counter() - start
-    
-    start = time.perf_counter()
-    _ = a @ b
-    numpy_time = time.perf_counter() - start
-    
-    print(f"\nPerformance comparison:")
-    print(f"Backend: {backend_time*1000:.1f} ms")
-    print(f"NumPy:   {numpy_time*1000:.1f} ms")
-    print(f"Backend uses optimal NumPy implementation")
-    
-    print("\nPASS Backend system works correctly")
-    return True
-
-# Execute the backend test
-test_backend_system()
-
-# %% [markdown]
-"""
-## TARGET Computational Assessment Questions
-
-Practice your understanding of hardware acceleration concepts with these NBGrader-compatible questions.
-
-These questions test your ability to analyze performance characteristics, optimize for cache hierarchy, and understand the engineering trade-offs in hardware acceleration. They're grounded in the actual implementations you just built and tested.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "acceleration-q1", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def calculate_cache_efficiency(matrix_size: int, block_size: int) -> Tuple[int, float]:
-    """
-    Calculate the cache efficiency improvement of blocked vs naive matrix multiplication.
-    
-    For a matrix_size * matrix_size multiplication using block_size * block_size blocks:
-    1. Calculate total number of cache misses for naive implementation
-    2. Calculate total number of cache misses for blocked implementation  
-    3. Return (total_operations, efficiency_ratio)
-    
-    Assumptions:
-    - Cache line = 64 bytes = 16 float32 elements
-    - Naive: Every B[k,j] access is a cache miss (column-major access)
-    - Blocked: 1 cache miss per block load, then block stays in cache
-    
-    Args:
-        matrix_size: Size of square matrices (N*N)
-        block_size: Size of blocks for blocked algorithm
-        
-    Returns:
-        Tuple[int, float]: (total_operations, cache_efficiency_ratio)
-        
-    TODO: Implement cache efficiency calculation for blocked matrix multiplication
-    
-    HINTS:
-    - Total operations = matrix_size³ 
-    - Naive cache misses ~= matrix_size³ (every B access misses)
-    - Blocked cache misses = (matrix_size/block_size)³ * block_size² 
-    - Efficiency ratio = naive_misses / blocked_misses
-    """
-    ### BEGIN SOLUTION
-    # Total operations for matrix multiplication
-    total_operations = matrix_size ** 3
-    
-    # Naive implementation cache misses
-    # Every access to B[k,j] causes a cache miss due to column-major access
-    naive_cache_misses = total_operations
-    
-    # Blocked implementation cache misses
-    # Number of blocks in each dimension
-    blocks_per_dim = (matrix_size + block_size - 1) // block_size  # Ceiling division
-    total_blocks = blocks_per_dim ** 3
-    
-    # Each block is loaded once, then all elements accessed from cache
-    blocked_cache_misses = total_blocks * block_size ** 2
-    
-    # Cache efficiency ratio
-    efficiency_ratio = naive_cache_misses / blocked_cache_misses if blocked_cache_misses > 0 else 1.0
-    
-    return total_operations, efficiency_ratio
-    ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "acceleration-q2", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def analyze_vectorization_speedup(array_size: int, vector_width: int) -> Tuple[int, int, float]:
-    """
-    Analyze the theoretical speedup from vectorization (SIMD instructions).
-    
-    Calculate:
-    1. Number of scalar operations needed
-    2. Number of vector operations needed  
-    3. Theoretical speedup ratio
-    
-    Args:
-        array_size: Number of elements to process
-        vector_width: Number of elements processed per vector instruction
-        
-    Returns:
-        Tuple[int, int, float]: (scalar_ops, vector_ops, speedup_ratio)
-        
-    TODO: Calculate vectorization speedup for array operations
-    
-    APPROACH:
-    1. Scalar: One operation per element
-    2. Vector: One operation per vector_width elements (with remainder)
-    3. Speedup: scalar_ops / vector_ops
-    
-    EXAMPLE:
-    >>> scalar_ops, vector_ops, speedup = analyze_vectorization_speedup(1000, 4)
-    >>> print(f"Scalar: {scalar_ops}, Vector: {vector_ops}, Speedup: {speedup:.1f}x")
-    Scalar: 1000, Vector: 250, Speedup: 4.0x
-    """
-    ### BEGIN SOLUTION
-    # Scalar operations: one per element
-    scalar_ops = array_size
-    
-    # Vector operations: ceiling division to handle remainder
-    vector_ops = (array_size + vector_width - 1) // vector_width
-    
-    # Theoretical speedup (ignores overhead, assumes perfect vectorization)
-    speedup_ratio = scalar_ops / vector_ops if vector_ops > 0 else 1.0
-    
-    return scalar_ops, vector_ops, speedup_ratio
-    ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "acceleration-q3", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def optimize_block_size(matrix_size: int, cache_sizes: Dict[str, int]) -> Tuple[int, str, float]:
-    """
-    Find the optimal block size for a given matrix size and cache hierarchy.
-    
-    Test block sizes [16, 32, 64, 128, 256] and select the largest that fits in L2 cache.
-    
-    Args:
-        matrix_size: Size of square matrix to multiply
-        cache_sizes: Dictionary with cache sizes in bytes, e.g., {"L1": 32768, "L2": 262144}
-        
-    Returns:
-        Tuple[int, str, float]: (optimal_block_size, cache_level, memory_utilization)
-        
-    TODO: Find optimal block size based on cache constraints
-    
-    APPROACH:
-    1. For each candidate block size, calculate memory footprint
-    2. Check which cache level it fits in (3 blocks * block_size² * 4 bytes)
-    3. Select largest block size that fits in L2 cache
-    4. Calculate memory utilization = footprint / cache_size
-    
-    EXAMPLE:
-    >>> cache_sizes = {"L1": 32768, "L2": 262144}
-    >>> block_size, level, util = optimize_block_size(1000, cache_sizes)
-    >>> print(f"Optimal: {block_size}x{block_size}, fits in {level}, {util:.1%} utilization")
-    Optimal: 64x64, fits in L2, 18.8% utilization
-    """
-    ### BEGIN SOLUTION
-    candidate_sizes = [16, 32, 64, 128, 256]
-    bytes_per_float = 4
-    blocks_needed = 3  # A_block, B_block, C_block
-    
-    optimal_block_size = 16  # Default fallback
-    cache_level = "RAM"
-    memory_utilization = 0.0
-    
-    # Test each candidate size
-    for block_size in candidate_sizes:
-        # Calculate memory footprint
-        elements_per_block = block_size * block_size
-        bytes_per_block = elements_per_block * bytes_per_float
-        total_footprint = bytes_per_block * blocks_needed
-        
-        # Check which cache level it fits in
-        if total_footprint <= cache_sizes.get("L1", 0):
-            # Prefer L2 for larger block sizes (better computational efficiency)
-            if block_size >= optimal_block_size:
-                optimal_block_size = block_size
-                cache_level = "L1"
-                memory_utilization = total_footprint / cache_sizes["L1"]
-        elif total_footprint <= cache_sizes.get("L2", 0):
-            # L2 is the sweet spot for most cases
-            if block_size >= optimal_block_size:
-                optimal_block_size = block_size
-                cache_level = "L2"
-                memory_utilization = total_footprint / cache_sizes["L2"]
-    
-    return optimal_block_size, cache_level, memory_utilization
-    ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "acceleration-q4", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def compare_acceleration_techniques(matrix_size: int) -> Dict[str, float]:
-    """
-    Compare the theoretical speedup of different acceleration techniques.
-    
-    Calculate expected speedup for:
-    1. "cache_blocking": Blocked algorithm (64x64 blocks)
-    2. "vectorization": SIMD with 8-wide vectors  
-    3. "parallelization": 4-core CPU parallelization
-    4. "combined": All techniques together
-    
-    Args:
-        matrix_size: Size of square matrices
-        
-    Returns:
-        Dict[str, float]: Speedup factors for each technique
-        
-    TODO: Calculate theoretical speedups for different acceleration techniques
-    
-    APPROACH:
-    1. Cache blocking: Use previous cache efficiency calculation
-    2. Vectorization: 8-wide SIMD operations
-    3. Parallelization: 4 cores working in parallel
-    4. Combined: Multiply individual speedups (idealized)
-    
-    ASSUMPTIONS:
-    - Perfect scaling (no overhead)
-    - Cache blocking gives efficiency_ratio improvement
-    - Vectorization gives 8x speedup
-    - Parallelization gives 4x speedup
-    """
-    ### BEGIN SOLUTION
-    # Cache blocking speedup (using 64x64 blocks)
-    block_size = 64
-    _, cache_speedup = calculate_cache_efficiency(matrix_size, block_size)
-    
-    # Vectorization speedup (8-wide SIMD)
-    vector_width = 8
-    _, _, vectorization_speedup = analyze_vectorization_speedup(matrix_size ** 3, vector_width)
-    
-    # Parallelization speedup (4 cores)
-    parallelization_speedup = 4.0
-    
-    # Combined speedup (multiplicative - idealized)
-    combined_speedup = cache_speedup * vectorization_speedup * parallelization_speedup
-    
-    return {
-        "cache_blocking": cache_speedup,
-        "vectorization": vectorization_speedup,
-        "parallelization": parallelization_speedup,
-        "combined": combined_speedup
-    }
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-## Part 5: Real-World Application Testing
-
-Let's test our optimizations on actual ML model operations: MLP layers, CNN convolutions, and Transformer attention.
-"""
-
-# %%
-def test_ml_model_acceleration():
-    """Test acceleration on real ML model operations"""
-    print("Testing Acceleration on Real ML Models...")
-    
-    # Test 1: MLP Forward Pass (common in Module 4)
-    print("\n1. MLP Forward Pass (256 -> 128 -> 64):")
-    batch_size, input_dim, hidden_dim, output_dim = 32, 256, 128, 64
-    
-    # Simulated MLP layers
-    x = np.random.randn(batch_size, input_dim).astype(np.float32)
-    W1 = np.random.randn(input_dim, hidden_dim).astype(np.float32)
-    W2 = np.random.randn(hidden_dim, output_dim).astype(np.float32)
-    
-    # Time naive implementation (small version)
-    start = time.perf_counter()
-    h1_naive = matmul_naive(x[:8, :64], W1[:64, :32])  # Scaled down
-    h2_naive = matmul_naive(h1_naive, W2[:32, :16])     # Scaled down
-    naive_time = time.perf_counter() - start
-    
-    # Time optimized implementation
-    start = time.perf_counter()
-    h1_opt = matmul(x, W1)
-    h2_opt = matmul(h1_opt, W2)
-    opt_time = time.perf_counter() - start
-    
-    # Scale for: batch_size (32/8) * input_dim (256/64) * hidden_dim (128/32)
-    batch_scale = 32/8  # 4x more samples
-    input_scale = 256/64  # 4x larger input
-    hidden_scale = 128/32  # 4x larger hidden layer
-    naive_scaled = naive_time * batch_scale * input_scale * hidden_scale
-    speedup = naive_scaled / opt_time
-    
-    print(f"   Naive (estimated): {naive_scaled*1000:.1f} ms")
-    print(f"   Optimized:         {opt_time*1000:.1f} ms")
-    print(f"   Speedup: {speedup:.1f}x faster!")
-    
-    # Test 2: CNN-like Convolution (flattened as matrix multiply)
-    print("\n2. CNN Convolution (as matrix multiply):")
-    # Simulate im2col operation for 3x3 convolution
-    img_patches = np.random.randn(1024, 27).astype(np.float32)  # 32x32 image, 3x3 patches
-    conv_filters = np.random.randn(27, 64).astype(np.float32)   # 64 filters
-    
-    start = time.perf_counter()
-    conv_output = matmul(img_patches, conv_filters)
-    conv_time = time.perf_counter() - start
-    print(f"   Convolution output: {conv_time*1000:.1f} ms")
-    print(f"   Shape: {conv_output.shape} (1024 locations * 64 filters)")
-    
-    # Test 3: Transformer-like Attention (scaled down)
-    print("\n3. Transformer Attention (Q·K^T):")
-    seq_len, d_model = 128, 256
-    Q = np.random.randn(seq_len, d_model).astype(np.float32)
-    K = np.random.randn(seq_len, d_model).astype(np.float32)
-    
-    start = time.perf_counter()
-    attention_scores = matmul(Q, K.T)  # Shape: (seq_len, seq_len)
-    attn_time = time.perf_counter() - start
-    print(f"   Attention computation: {attn_time*1000:.1f} ms")
-    print(f"   Shape: {attention_scores.shape} (128*128 attention matrix)")
-    
-    print(f"\nPASS All ML model operations accelerated successfully!")
-    print(f"TIP Key insight: Matrix multiplication is EVERYWHERE in ML!")
-    return True
-
-# Execute the ML model test
-test_ml_model_acceleration()
-
-# MAGNIFY SYSTEMS INSIGHT: Acceleration Scaling Analysis
-def analyze_acceleration_scaling():
-    """
-    Analyze how different acceleration techniques scale with problem size.
-    
-    This demonstrates the performance characteristics of optimization
-    techniques across a range of matrix sizes typical in ML workloads.
-    """
-    try:
-        print("📊 Acceleration Scaling Analysis")
-        print("=" * 45)
-        
-        # Test different matrix sizes (typical ML workloads)
-        matrix_sizes = [100, 200, 500, 1000, 2000]
-        
-        print("\nScaling Analysis Across Matrix Sizes:")
-        print("Size  | Cache Block | Vectorization | Parallelization | Combined")
-        print("-" * 65)
-        
-        for size in matrix_sizes:
-            # Calculate speedups for this matrix size
-            speedups = compare_acceleration_techniques(size)
-            
-            print(f"{size:4d}  | {speedups['cache_blocking']:10.1f} | {speedups['vectorization']:12.1f} | {speedups['parallelization']:14.1f} | {speedups['combined']:7.0f}")
-        
-        print(f"\n📊 Key Scaling Insights:")
-        
-        # Analyze cache blocking scaling
-        small_speedup = compare_acceleration_techniques(100)['cache_blocking']
-        large_speedup = compare_acceleration_techniques(2000)['cache_blocking']
-        
-        print(f"• Cache blocking: {small_speedup:.1f}x -> {large_speedup:.1f}x (scales with cache misses)")
-        print(f"• Vectorization: 8.0x constant (independent of matrix size)")
-        print(f"• Parallelization: 4.0x constant (perfect scaling assumed)")
-        print(f"• Combined: Multiplicative effect = cache * vector * parallel")
-        
-        print(f"\n📊 Real-World Performance Expectations:")
-        realistic_combined = large_speedup * 4.0 * 4.0  # Conservative vectorization
-        print(f"• Realistic combined speedup: ~{realistic_combined:.0f}x")
-        print(f"• Why not perfect: Memory bandwidth limits, overhead, synchronization")
-        print(f"• Production systems: Focus on cache + vectorization first")
-        
-        print(f"\nTIP ML Systems Implications:")
-        print(f"• Small models (<=500): Vectorization dominates")
-        print(f"• Large models (>=1000): Cache optimization critical")
-        print(f"• Production: Memory bandwidth becomes bottleneck")
-        print(f"• GPU: Different scaling - thousands of cores, different cache hierarchy")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in scaling analysis: {e}")
-        print("Make sure all analysis functions are implemented correctly")
-
-# Run the scaling analysis
-analyze_acceleration_scaling()
-
-def run_complete_acceleration_demo():
-    """Run the complete acceleration demonstration"""
-    print("ROCKET Complete Hardware Acceleration Demo")
-    print("=" * 55)
-    print("THE FREE SPEEDUP: From Naive Loops to Optimized Backends")
-    
-    # 1. Test naive baseline
-    print("\n1. Naive Baseline (your Module 2/4 loops):")
-    naive_results = test_naive_baseline()
-    
-    # 2. Test blocked optimization
-    print("\n2. Cache-Friendly Blocking:")
-    test_blocked_optimization()
-    
-    # 3. Test production performance
-    print("\n3. Production Performance (NumPy):")
-    test_production_performance()
-    
-    # 4. Test ML model acceleration
-    print("\n4. Real ML Model Acceleration:")
-    test_ml_model_acceleration()
-    
-    # 5. Test backend system
-    print("\n5. Smart Backend System:")
-    test_backend_system()
-    
-    print("\n" + "=" * 55)
-    print("TARGET HARDWARE ACCELERATION MASTERED")
-    print("=" * 55)
-    
-    print("\n📚 What You Mastered:")
-    print("PASS Why your Module 2/4 loops were slow (cache hierarchy matters!)")
-    print("PASS How cache-friendly blocking works (process data in chunks)")
-    print("PASS Why NumPy dominates (professional optimizations built-in)")
-    print("PASS How to build smart backend systems (automatic optimization)")
-    print("PASS Real ML applications (MLPs, CNNs, Transformers all use matmul!)")
-    
-    print("\nTARGET The Free Speedup Philosophy:")
-    print("• ROCKET Same math, better implementation = 100x speedup")
-    print("• 🧠 Educational loops teach algorithms")
-    print("• SPEED Blocked algorithms teach cache optimization")
-    print("• 🏭 NumPy provides production performance")
-    print("• TARGET Smart backends make optimization transparent")
-    print("• TIP Understanding the spectrum makes you a better engineer!")
-    
-    return naive_results
-
-# %% [markdown]
-"""
-## Systems Analysis Summary
-
-This module demonstrates the fundamental principles of hardware acceleration in ML systems:
-
-### 🏗️ **Architecture Principles**
-- **Cache Hierarchy**: Understanding L1/L2/L3 cache and memory access costs
-- **Vectorization**: Leveraging SIMD instructions for parallel computation
-- **Memory Layout**: Contiguous access patterns for optimal performance
-- **Backend Abstraction**: Transparent dispatch between naive and optimized implementations
-
-### SPEED **Optimization Techniques**
-- **Blocked Algorithms**: Process data in cache-friendly blocks
-- **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines
-- **In-place Operations**: Minimize memory allocation overhead
-- **Automatic Dispatch**: Choose optimal implementation based on problem size
-
-### 📊 **Performance Understanding**
-- **Measurement First**: Profile real bottlenecks before optimizing
-- **Algorithmic Impact**: O(N³) -> O(N²) matters more than 2x constant factors
-- **Hardware Awareness**: CPU cache misses cost 100x more than cache hits
-- **Library Utilization**: Optimized BLAS libraries beat custom implementations
-
-### TARGET **Real-World Applications**
-- **ML Frameworks**: How PyTorch/TensorFlow apply these same principles
-- **Production Systems**: Where optimization efforts provide real value
-- **Development Practice**: When to optimize vs when to use existing solutions
-
-### TIP **Key Insights**
-- Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone
-- Vectorization eliminates Python overhead for 10-100x improvements
-- Most NumPy operations are already optimized - focus on system-level improvements
-- Competition frameworks make optimization learning engaging and quantifiable
-- Real ML systems face memory and communication bottlenecks, not pure computation limits
-
-This approach teaches students to think like systems engineers: understand the hardware, measure scientifically, optimize systematically, and focus efforts where they matter most.
-"""
-
-def test_unit_all():
-    """Run all unit tests for the acceleration module."""
-    print("TEST Running all Hardware Acceleration tests...")
-    print("=" * 55)
-    
-    try:
-        # Test educational baseline
-        print("\n1. Testing educational baseline...")
-        test_naive_baseline()
-        
-        # Test cache blocking optimization
-        print("\n2. Testing cache blocking...")
-        test_blocked_optimization()
-        
-        # Test production performance
-        print("\n3. Testing production performance...")
-        test_production_performance()
-        
-        # Test backend system
-        print("\n4. Testing backend system...")
-        test_backend_system()
-        
-        # Test ML model acceleration
-        print("\n5. Testing ML model acceleration...")
-        test_ml_model_acceleration()
-        
-        print("\n" + "=" * 55)
-        print("PASS All Hardware Acceleration tests passed!")
-        print("ROCKET Module ready for production ML systems.")
-        
-    except Exception as e:
-        print(f"FAIL Test failed: {e}")
-        raise
-
-if __name__ == "__main__":
-    print("Module 16: Hardware Acceleration - The Free Speedup!")
-    print("=" * 60)
-    print("ROCKET THE EASIEST OPTIMIZATION: Better Backends, Zero Trade-offs")
-    
-    # Run complete testing suite
-    test_unit_all()
-    
-    print(f"\nCELEBRATE Module 16: Hardware Acceleration COMPLETE!")
-    print(f"SPEED Mastered: 10-100x speedups with no accuracy loss")
-    print(f"🧠 Learned: Cache hierarchy, blocking, vectorization")
-    print(f"🏭 Applied: MLPs, CNNs, Transformers all benefit")
-    print(f"TARGET Ready: To build high-performance ML systems!")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-1. **Memory Access Pattern Analysis**: In your `matmul_naive()` implementation, the innermost loop accesses `a[i, k]` sequentially but `b[k, j]` with large strides. When you tested 200*200 matrices, you saw dramatic slowdowns. Analyze why: (a) Calculate cache misses for both access patterns, (b) Explain why `b[k, j]` creates O(N²) cache misses, (c) Show how this scales to 1000*1000 matrices, and (d) Design a memory layout that would eliminate strided access.
-
-2. **Cache Blocking Optimization**: Your `matmul_blocked()` function uses 64*64 blocks and showed significant speedups over naive loops. Analyze the cache efficiency: (a) Calculate total memory footprint (3 blocks * 64² * 4 bytes), (b) Verify it fits in L2 cache (256KB), (c) Compute cache reuse factor (64 operations per cache line), (d) Predict performance change with 128*128 blocks, and (e) Explain why your cache analysis function showed 64*64 as optimal.
-
-3. **Production Stack Engineering**: You measured that NumPy beats your blocked implementation by 5-10x. Analyze the engineering trade-offs: (a) List three specific optimizations NumPy includes (BLAS, vectorization, threading), (b) Calculate development time vs. performance gain for each, (c) Estimate why custom optimization rarely beats production libraries, and (d) Determine when custom optimization is justified in ML systems.
-
-4. **ML Acceleration Architecture**: Your tests showed acceleration benefits for MLP, CNN, and Transformer operations. Design an acceleration strategy: (a) Rank these operations by matrix multiplication density, (b) Identify memory bandwidth vs. compute bottlenecks for each, (c) Predict how GPU acceleration would change the rankings, and (d) Explain why understanding this spectrum enables better ML systems engineering decisions.
-"""
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Hardware Acceleration - The Free Speedup
-
-This module demonstrates the easiest optimization in ML systems: using better backends for free speedups with zero accuracy trade-offs. You learned why understanding the optimization spectrum makes you a better engineer.
-
-### 🛤️ **The Free Speedup Journey**
-- **Educational Foundation**: Your Module 2/4 loops taught you the algorithm (perfect for learning)
-- **Performance Understanding**: Module 15 showed you WHY loops are slow (profiling first)
-- **Optimization Mastery**: Now you achieve 100x speedups by choosing better implementations
-- **Systems Thinking**: Understanding the spectrum from educational to production code
-
-### 🛠️ **What We Built and Tested**
-- **Educational Baseline**: Your triple-nested loops from Module 2/4 (algorithm understanding)
-- **Cache-Friendly Blocking**: 64*64 blocks fitting in L1/L2 cache (10x+ speedup)
-- **NumPy Production**: Leveraging professional BLAS optimizations (another 10x speedup)
-- **Smart Backend System**: Automatic dispatch to optimal implementations
-- **Real ML Applications**: MLP, CNN, Transformer operations using matrix multiplication
-
-### 🧠 **Key Learning Outcomes**
-- **Why loops are slow**: Memory access patterns and cache hierarchy matter most
-- **How blocking helps**: Processing data in cache-friendly chunks improves performance
-- **When to use NumPy**: It already has these optimizations (and more) built-in
-- **Systems thinking**: Understanding enables better decisions about when to optimize
-
-### SPEED **Performance Spectrum Mastered**
-- **Educational loops**: Algorithm understanding (1000x slower, perfect for learning)
-- **Cache-friendly blocking**: Systems understanding (100x slower, teaches optimization)
-- **NumPy production**: Professional performance (optimal speed, built-in optimizations)
-- **Smart backends**: Engineering understanding (transparent optimization selection)
-
-### 🏆 **Practical Skills Developed**
-- Analyze why educational implementations have poor performance
-- Implement cache-friendly algorithms to understand optimization principles
-- Choose NumPy for production while understanding what it's doing internally
-- Build systems that balance educational value with performance requirements
-
-### 📊 **Systems Insights Gained**
-- **Educational code serves a purpose**: Understanding algorithms enables optimization intuition
-- **Cache hierarchy dominates performance**: Memory access patterns matter more than computation
-- **Libraries beat custom optimization**: NumPy already has expert-level optimizations
-- **Understanding enables better tools**: You can build smarter systems when you know the principles
-
-### TIP **The Free Speedup Philosophy**
-This is the EASIEST optimization in ML systems: same math, better implementation, massive speedups, zero downsides. You implemented loops to understand algorithms. You implemented blocking to understand cache optimization. Now you use NumPy because it has all optimizations built-in. Understanding this spectrum - from educational to production - makes you a superior ML systems engineer who can make informed optimization decisions.
-"""
-
diff --git a/modules_old/15_acceleration/module.yaml b/modules_old/15_acceleration/module.yaml
deleted file mode 100644
index 8f1c6a56..00000000
--- a/modules_old/15_acceleration/module.yaml
+++ /dev/null
@@ -1,40 +0,0 @@
-assessment:
-- Understand why naive loops have poor cache performance
-- Implement cache-friendly blocked matrix multiplication showing 10-50x speedups
-- Recognize why NumPy provides 100x+ speedups over custom implementations
-- Build backend system that automatically chooses optimal implementations
-- 'Apply the ''free speedup'' principle: use better tools, don''t write faster code'
-description: 'Master the easiest optimization: using better backends! Learn why naive
-  loops are slow, how cache-friendly blocking helps, and why NumPy provides 100x+
-  speedups.'
-difficulty: Advanced
-estimated_time: 3-4 hours
-exports:
-- matmul_naive
-- matmul_blocked
-- matmul_numpy
-- OptimizedBackend
-- matmul
-- set_backend
-learning_objectives:
-- Understand CPU cache hierarchy and memory access performance bottlenecks
-- Implement cache-friendly blocked matrix multiplication algorithms
-- Build vectorized operations with optimized memory access patterns
-- Design transparent backend systems for automatic optimization selection
-- Measure and quantify real performance improvements scientifically
-- Apply systems thinking to optimization decisions in ML workflows
-name: acceleration
-prerequisites:
-- 'Module 2: Tensor operations and NumPy fundamentals'
-- 'Module 4: Linear layers and matrix multiplication'
-- Understanding of basic algorithmic complexity (O notation)
-tags:
-- performance
-- optimization
-- systems
-- hardware
-- acceleration
-- cache
-- vectorization
-- backends
-title: Hardware Acceleration - The Simplest Optimization
diff --git a/modules_old/16_quantization/module.yaml b/modules_old/16_quantization/module.yaml
deleted file mode 100644
index a37e3442..00000000
--- a/modules_old/16_quantization/module.yaml
+++ /dev/null
@@ -1,28 +0,0 @@
-description: 'Precision optimization through INT8 quantization. Students learn to
-  reduce model size
-
-  and accelerate inference by using lower precision arithmetic while maintaining accuracy.
-
-  Especially powerful for CNN convolutions and edge deployment.
-
-  '
-difficulty: advanced
-estimated_hours: 6-8
-exports:
-- tinytorch.quantization
-learning_objectives:
-- Understand precision vs performance trade-offs
-- Implement INT8 quantization for neural networks
-- Build calibration-based quantization systems
-- Optimize CNN inference for mobile deployment
-name: Quantization
-number: 17
-prerequisites:
-- Module 09: Spatial (CNNs)
-- Module 16: Acceleration
-skills_developed:
-- Quantization techniques and mathematics
-- Post-training optimization strategies
-- Hardware-aware optimization
-- Mobile and edge deployment patterns
-type: optimization
diff --git a/modules_old/16_quantization/quantization_dev.ipynb b/modules_old/16_quantization/quantization_dev.ipynb
deleted file mode 100644
index 0bd5b6ed..00000000
--- a/modules_old/16_quantization/quantization_dev.ipynb
+++ /dev/null
@@ -1,2506 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "3a02901d",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Module 17: Quantization - Trading Precision for Speed\n",
-    "\n",
-    "Welcome to the Quantization module! After Module 16 showed you how to get free speedups through better algorithms, now we make our **first trade-off**: reduce precision for speed. You'll implement INT8 quantization to achieve 4× speedup with <1% accuracy loss.\n",
-    "\n",
-    "## Connection from Module 16: Acceleration → Quantization\n",
-    "\n",
-    "Module 16 taught you to accelerate computations through better algorithms and hardware utilization - these were \"free\" optimizations. Now we enter the world of **trade-offs**: sacrificing precision to gain speed. This is especially powerful for CNN inference where INT8 operations are much faster than FP32.\n",
-    "\n",
-    "## Learning Goals\n",
-    "\n",
-    "- **Systems understanding**: Memory vs precision tradeoffs and when quantization provides dramatic benefits\n",
-    "- **Core implementation skill**: Build INT8 quantization systems for CNN weights and activations  \n",
-    "- **Pattern recognition**: Understand calibration-based quantization for post-training optimization\n",
-    "- **Framework connection**: See how production systems use quantization for edge deployment and mobile inference\n",
-    "- **Performance insight**: Achieve 4× speedup with <1% accuracy loss through precision optimization\n",
-    "\n",
-    "## Build → Profile → Optimize\n",
-    "\n",
-    "1. **Build**: Start with FP32 CNN inference (baseline)\n",
-    "2. **Profile**: Measure memory usage and computational cost of FP32 operations\n",
-    "3. **Optimize**: Implement INT8 quantization to achieve 4× speedup with minimal accuracy loss\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "\n",
-    "By the end of this module, you'll understand:\n",
-    "- **Deep technical understanding**: How INT8 quantization reduces precision while maintaining model quality\n",
-    "- **Practical capability**: Implement production-grade quantization for CNN inference acceleration  \n",
-    "- **Systems insight**: Memory vs precision tradeoffs in ML systems optimization\n",
-    "- **Performance mastery**: Achieve 4× speedup (50ms → 12ms inference) with <1% accuracy loss\n",
-    "- **Connection to edge deployment**: How mobile and edge devices use quantization for efficient AI\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "\n",
-    "💡 **Production Context**: TensorFlow Lite and PyTorch Mobile use INT8 quantization for mobile deployment  \n",
-    "⚡ **Performance Note**: CNN inference: FP32 = 50ms, INT8 = 12ms (4× faster) with 98% → 97.5% accuracy  \n",
-    "🧠 **Memory Tradeoff**: INT8 uses 4× less memory and enables much faster integer arithmetic"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4aee03f0",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "quantization-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp optimization.quantize\n",
-    "\n",
-    "#| export\n",
-    "import math\n",
-    "import time\n",
-    "import numpy as np\n",
-    "import sys\n",
-    "import os\n",
-    "from typing import Union, List, Optional, Tuple, Dict, Any\n",
-    "\n",
-    "# Import our Tensor and CNN classes\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "    from tinytorch.core.spatial import Conv2d, MaxPool2D\n",
-    "    MaxPool2d = MaxPool2D  # Alias for consistent naming\n",
-    "except ImportError:\n",
-    "    # For development, import from local modules\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '06_spatial'))\n",
-    "    try:\n",
-    "        from tensor_dev import Tensor\n",
-    "        from spatial_dev import Conv2d, MaxPool2D\n",
-    "        MaxPool2d = MaxPool2D  # Alias for consistent naming\n",
-    "    except ImportError:\n",
-    "        # Create minimal mock classes if not available\n",
-    "        class Tensor:\n",
-    "            def __init__(self, data):\n",
-    "                self.data = np.array(data)\n",
-    "                self.shape = self.data.shape\n",
-    "        class Conv2d:\n",
-    "            def __init__(self, in_channels, out_channels, kernel_size):\n",
-    "                self.weight = np.random.randn(out_channels, in_channels, kernel_size, kernel_size)\n",
-    "        class MaxPool2d:\n",
-    "            def __init__(self, kernel_size):\n",
-    "                self.kernel_size = kernel_size"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c6c40d19",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 1: Understanding Quantization - The Precision vs Speed Trade-off\n",
-    "\n",
-    "Let's start by understanding what quantization means and why it provides such dramatic speedups. We'll build a baseline FP32 CNN and measure its computational cost.\n",
-    "\n",
-    "### The Quantization Concept\n",
-    "\n",
-    "Quantization converts high-precision floating-point numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits):\n",
-    "- **Memory**: 4× reduction (32 bits → 8 bits)\n",
-    "- **Compute**: Integer arithmetic is much faster than floating-point  \n",
-    "- **Hardware**: Specialized INT8 units on modern CPUs and mobile processors\n",
-    "- **Trade-off**: Small precision loss for large speed gain"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4310bcbe",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "baseline-cnn",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class BaselineCNN:\n",
-    "    \"\"\"\n",
-    "    Baseline FP32 CNN for comparison with quantized version.\n",
-    "    \n",
-    "    This implementation uses standard floating-point arithmetic\n",
-    "    to establish performance and accuracy baselines.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, input_channels: int = 3, num_classes: int = 10):\n",
-    "        \"\"\"\n",
-    "        Initialize baseline CNN with FP32 weights.\n",
-    "        \n",
-    "        TODO: Implement baseline CNN initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Create convolutional layers with FP32 weights\n",
-    "        2. Create fully connected layer for classification\n",
-    "        3. Initialize weights with proper scaling\n",
-    "        4. Set up activation functions and pooling\n",
-    "        \n",
-    "        Args:\n",
-    "            input_channels: Number of input channels (e.g., 3 for RGB)\n",
-    "            num_classes: Number of output classes\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.input_channels = input_channels\n",
-    "        self.num_classes = num_classes\n",
-    "        \n",
-    "        # Initialize FP32 convolutional weights\n",
-    "        # Conv1: input_channels -> 32, kernel 3x3\n",
-    "        self.conv1_weight = np.random.randn(32, input_channels, 3, 3) * 0.02\n",
-    "        self.conv1_bias = np.zeros(32)\n",
-    "        \n",
-    "        # Conv2: 32 -> 64, kernel 3x3  \n",
-    "        self.conv2_weight = np.random.randn(64, 32, 3, 3) * 0.02\n",
-    "        self.conv2_bias = np.zeros(64)\n",
-    "        \n",
-    "        # Pooling (no parameters)\n",
-    "        self.pool_size = 2\n",
-    "        \n",
-    "        # Fully connected layer (assuming 32x32 input -> 6x6 after convs+pools)\n",
-    "        self.fc_input_size = 64 * 6 * 6  # 64 channels, 6x6 spatial\n",
-    "        self.fc = np.random.randn(self.fc_input_size, num_classes) * 0.02\n",
-    "        \n",
-    "        print(f\"✅ BaselineCNN initialized: {self._count_parameters()} parameters\")\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _count_parameters(self) -> int:\n",
-    "        \"\"\"Count total parameters in the model.\"\"\"\n",
-    "        conv1_params = 32 * self.input_channels * 3 * 3 + 32  # weights + bias\n",
-    "        conv2_params = 64 * 32 * 3 * 3 + 64\n",
-    "        fc_params = self.fc_input_size * self.num_classes\n",
-    "        return conv1_params + conv2_params + fc_params\n",
-    "    \n",
-    "    def forward(self, x: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Forward pass through baseline CNN.\n",
-    "        \n",
-    "        TODO: Implement FP32 CNN forward pass.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Apply first convolution + ReLU + pooling\n",
-    "        2. Apply second convolution + ReLU + pooling  \n",
-    "        3. Flatten for fully connected layer\n",
-    "        4. Apply fully connected layer\n",
-    "        5. Return logits\n",
-    "        \n",
-    "        PERFORMANCE NOTE: This uses FP32 arithmetic throughout.\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor with shape (batch, channels, height, width)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Output logits with shape (batch, num_classes)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        batch_size = x.shape[0]\n",
-    "        \n",
-    "        # Conv1 + ReLU + Pool\n",
-    "        conv1_out = self._conv2d_forward(x, self.conv1_weight, self.conv1_bias)\n",
-    "        conv1_relu = np.maximum(0, conv1_out)\n",
-    "        pool1_out = self._maxpool2d_forward(conv1_relu, self.pool_size)\n",
-    "        \n",
-    "        # Conv2 + ReLU + Pool  \n",
-    "        conv2_out = self._conv2d_forward(pool1_out, self.conv2_weight, self.conv2_bias)\n",
-    "        conv2_relu = np.maximum(0, conv2_out)\n",
-    "        pool2_out = self._maxpool2d_forward(conv2_relu, self.pool_size)\n",
-    "        \n",
-    "        # Flatten\n",
-    "        flattened = pool2_out.reshape(batch_size, -1)\n",
-    "        \n",
-    "        # Fully connected\n",
-    "        logits = flattened @ self.fc\n",
-    "        \n",
-    "        return logits\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _conv2d_forward(self, x: np.ndarray, weight: np.ndarray, bias: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"Simple convolution implementation with bias (optimized for speed).\"\"\"\n",
-    "        batch, in_ch, in_h, in_w = x.shape\n",
-    "        out_ch, in_ch_w, kh, kw = weight.shape\n",
-    "        \n",
-    "        out_h = in_h - kh + 1\n",
-    "        out_w = in_w - kw + 1\n",
-    "        \n",
-    "        output = np.zeros((batch, out_ch, out_h, out_w))\n",
-    "        \n",
-    "        # Optimized convolution using vectorized operations where possible\n",
-    "        for b in range(batch):\n",
-    "            for oh in range(out_h):\n",
-    "                for ow in range(out_w):\n",
-    "                    # Extract input patch\n",
-    "                    patch = x[b, :, oh:oh+kh, ow:ow+kw]  # (in_ch, kh, kw)\n",
-    "                    # Compute convolution for all output channels at once\n",
-    "                    for oc in range(out_ch):\n",
-    "                        output[b, oc, oh, ow] = np.sum(patch * weight[oc]) + bias[oc]\n",
-    "        \n",
-    "        return output\n",
-    "    \n",
-    "    def _maxpool2d_forward(self, x: np.ndarray, pool_size: int) -> np.ndarray:\n",
-    "        \"\"\"Simple max pooling implementation.\"\"\"\n",
-    "        batch, ch, in_h, in_w = x.shape\n",
-    "        out_h = in_h // pool_size\n",
-    "        out_w = in_w // pool_size\n",
-    "        \n",
-    "        output = np.zeros((batch, ch, out_h, out_w))\n",
-    "        \n",
-    "        for b in range(batch):\n",
-    "            for c in range(ch):\n",
-    "                for oh in range(out_h):\n",
-    "                    for ow in range(out_w):\n",
-    "                        h_start = oh * pool_size\n",
-    "                        w_start = ow * pool_size\n",
-    "                        pool_region = x[b, c, h_start:h_start+pool_size, w_start:w_start+pool_size]\n",
-    "                        output[b, c, oh, ow] = np.max(pool_region)\n",
-    "        \n",
-    "        return output\n",
-    "    \n",
-    "    def predict(self, x: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"Make predictions with the model.\"\"\"\n",
-    "        logits = self.forward(x)\n",
-    "        return np.argmax(logits, axis=1)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "273c86f5",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Baseline CNN Performance\n",
-    "\n",
-    "Let's test our baseline CNN to establish performance and accuracy baselines:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8fec5cc7",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-baseline-cnn",
-     "locked": false,
-     "points": 2,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_baseline_cnn():\n",
-    "    \"\"\"Test baseline CNN implementation and measure performance.\"\"\"\n",
-    "    print(\"🔍 Testing Baseline FP32 CNN...\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Create baseline model\n",
-    "    model = BaselineCNN(input_channels=3, num_classes=10)\n",
-    "    \n",
-    "    # Test forward pass\n",
-    "    batch_size = 4\n",
-    "    input_data = np.random.randn(batch_size, 3, 32, 32)\n",
-    "    \n",
-    "    print(f\"Testing with input shape: {input_data.shape}\")\n",
-    "    \n",
-    "    # Measure inference time\n",
-    "    start_time = time.time()\n",
-    "    logits = model.forward(input_data)\n",
-    "    inference_time = time.time() - start_time\n",
-    "    \n",
-    "    # Validate output\n",
-    "    assert logits.shape == (batch_size, 10), f\"Expected (4, 10), got {logits.shape}\"\n",
-    "    print(f\"✅ Forward pass works: {logits.shape}\")\n",
-    "    \n",
-    "    # Test predictions\n",
-    "    predictions = model.predict(input_data)\n",
-    "    assert predictions.shape == (batch_size,), f\"Expected (4,), got {predictions.shape}\"\n",
-    "    assert all(0 <= p < 10 for p in predictions), \"All predictions should be valid class indices\"\n",
-    "    print(f\"✅ Predictions work: {predictions}\")\n",
-    "    \n",
-    "    # Performance baseline\n",
-    "    print(f\"\\n📊 Performance Baseline:\")\n",
-    "    print(f\"   Inference time: {inference_time*1000:.2f}ms for batch of {batch_size}\")\n",
-    "    print(f\"   Per-sample time: {inference_time*1000/batch_size:.2f}ms\")\n",
-    "    print(f\"   Parameters: {model._count_parameters()} (all FP32)\")\n",
-    "    print(f\"   Memory usage: ~{model._count_parameters() * 4 / 1024:.1f}KB for weights\")\n",
-    "    \n",
-    "    print(\"✅ Baseline CNN tests passed!\")\n",
-    "    print(\"💡 Ready to implement INT8 quantization for 4× speedup...\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "237858c6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 2: INT8 Quantization Theory and Implementation\n",
-    "\n",
-    "Now let's implement the core quantization algorithms. We'll use **affine quantization** with scale and zero-point parameters to map FP32 values to INT8 range.\n",
-    "\n",
-    "### Quantization Mathematics\n",
-    "\n",
-    "The key insight is mapping continuous FP32 values to discrete INT8 values:\n",
-    "- **Quantization**: `int8_value = clip(round(fp32_value / scale + zero_point), -128, 127)`\n",
-    "- **Dequantization**: `fp32_value = (int8_value - zero_point) * scale`\n",
-    "- **Scale**: Controls the range of values that can be represented\n",
-    "- **Zero Point**: Ensures zero maps exactly to zero in quantized space"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b5b293fb",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "int8-quantizer",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class INT8Quantizer:\n",
-    "    \"\"\"\n",
-    "    INT8 quantizer for neural network weights and activations.\n",
-    "    \n",
-    "    This quantizer converts FP32 tensors to INT8 representation\n",
-    "    using scale and zero-point parameters for maximum precision.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize the quantizer.\"\"\"\n",
-    "        self.calibration_stats = {}\n",
-    "        \n",
-    "    def compute_quantization_params(self, tensor: np.ndarray, \n",
-    "                                  symmetric: bool = True) -> Tuple[float, int]:\n",
-    "        \"\"\"\n",
-    "        Compute quantization scale and zero point for a tensor.\n",
-    "        \n",
-    "        TODO: Implement quantization parameter computation.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Find min and max values in the tensor\n",
-    "        2. For symmetric quantization, use max(abs(min), abs(max))\n",
-    "        3. For asymmetric, use the full min/max range\n",
-    "        4. Compute scale to map FP32 range to INT8 range [-128, 127]\n",
-    "        5. Compute zero point to ensure accurate zero representation\n",
-    "        \n",
-    "        Args:\n",
-    "            tensor: Input tensor to quantize\n",
-    "            symmetric: Whether to use symmetric quantization (zero_point=0)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tuple of (scale, zero_point)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Find tensor range\n",
-    "        tensor_min = float(np.min(tensor))\n",
-    "        tensor_max = float(np.max(tensor))\n",
-    "        \n",
-    "        if symmetric:\n",
-    "            # Symmetric quantization: use max absolute value\n",
-    "            max_abs = max(abs(tensor_min), abs(tensor_max))\n",
-    "            tensor_min = -max_abs\n",
-    "            tensor_max = max_abs\n",
-    "            zero_point = 0\n",
-    "        else:\n",
-    "            # Asymmetric quantization: use full range\n",
-    "            zero_point = 0  # We'll compute this below\n",
-    "        \n",
-    "        # INT8 range is [-128, 127] = 255 values\n",
-    "        int8_min = -128\n",
-    "        int8_max = 127\n",
-    "        int8_range = int8_max - int8_min\n",
-    "        \n",
-    "        # Compute scale\n",
-    "        tensor_range = tensor_max - tensor_min\n",
-    "        if tensor_range == 0:\n",
-    "            scale = 1.0\n",
-    "        else:\n",
-    "            scale = tensor_range / int8_range\n",
-    "        \n",
-    "        if not symmetric:\n",
-    "            # Compute zero point for asymmetric quantization\n",
-    "            zero_point_fp = int8_min - tensor_min / scale\n",
-    "            zero_point = int(round(np.clip(zero_point_fp, int8_min, int8_max)))\n",
-    "        \n",
-    "        return scale, zero_point\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def quantize_tensor(self, tensor: np.ndarray, scale: float, \n",
-    "                       zero_point: int) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Quantize FP32 tensor to INT8.\n",
-    "        \n",
-    "        TODO: Implement tensor quantization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Apply quantization formula: q = fp32 / scale + zero_point\n",
-    "        2. Round to nearest integer\n",
-    "        3. Clip to INT8 range [-128, 127]\n",
-    "        4. Convert to INT8 data type\n",
-    "        \n",
-    "        Args:\n",
-    "            tensor: FP32 tensor to quantize\n",
-    "            scale: Quantization scale parameter\n",
-    "            zero_point: Quantization zero point parameter\n",
-    "            \n",
-    "        Returns:\n",
-    "            Quantized INT8 tensor\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Apply quantization formula\n",
-    "        quantized_fp = tensor / scale + zero_point\n",
-    "        \n",
-    "        # Round and clip to INT8 range\n",
-    "        quantized_int = np.round(quantized_fp)\n",
-    "        quantized_int = np.clip(quantized_int, -128, 127)\n",
-    "        \n",
-    "        # Convert to INT8\n",
-    "        quantized = quantized_int.astype(np.int8)\n",
-    "        \n",
-    "        return quantized\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def dequantize_tensor(self, quantized_tensor: np.ndarray, scale: float,\n",
-    "                         zero_point: int) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Dequantize INT8 tensor back to FP32.\n",
-    "        \n",
-    "        This function is PROVIDED for converting back to FP32.\n",
-    "        \n",
-    "        Args:\n",
-    "            quantized_tensor: INT8 tensor\n",
-    "            scale: Original quantization scale\n",
-    "            zero_point: Original quantization zero point\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dequantized FP32 tensor\n",
-    "        \"\"\"\n",
-    "        # Convert to FP32 and apply dequantization formula\n",
-    "        fp32_tensor = (quantized_tensor.astype(np.float32) - zero_point) * scale\n",
-    "        return fp32_tensor\n",
-    "    \n",
-    "    def quantize_weights(self, weights: np.ndarray, \n",
-    "                        calibration_data: Optional[List[np.ndarray]] = None) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Quantize neural network weights with optimal parameters.\n",
-    "        \n",
-    "        TODO: Implement weight quantization with calibration.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Compute quantization parameters for weight tensor\n",
-    "        2. Apply quantization to create INT8 weights\n",
-    "        3. Store quantization parameters for runtime dequantization\n",
-    "        4. Compute quantization error metrics\n",
-    "        5. Return quantized weights and metadata\n",
-    "        \n",
-    "        NOTE: For weights, we can use the full weight distribution\n",
-    "        without needing separate calibration data.\n",
-    "        \n",
-    "        Args:\n",
-    "            weights: FP32 weight tensor\n",
-    "            calibration_data: Optional calibration data (unused for weights)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dictionary containing quantized weights and parameters\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(f\"Quantizing weights with shape {weights.shape}...\")\n",
-    "        \n",
-    "        # Compute quantization parameters\n",
-    "        scale, zero_point = self.compute_quantization_params(weights, symmetric=True)\n",
-    "        \n",
-    "        # Quantize weights\n",
-    "        quantized_weights = self.quantize_tensor(weights, scale, zero_point)\n",
-    "        \n",
-    "        # Dequantize for error analysis\n",
-    "        dequantized_weights = self.dequantize_tensor(quantized_weights, scale, zero_point)\n",
-    "        \n",
-    "        # Compute quantization error\n",
-    "        quantization_error = np.mean(np.abs(weights - dequantized_weights))\n",
-    "        max_error = np.max(np.abs(weights - dequantized_weights))\n",
-    "        \n",
-    "        # Memory savings\n",
-    "        original_size = weights.nbytes\n",
-    "        quantized_size = quantized_weights.nbytes\n",
-    "        compression_ratio = original_size / quantized_size\n",
-    "        \n",
-    "        print(f\"   Scale: {scale:.6f}, Zero point: {zero_point}\")\n",
-    "        print(f\"   Quantization error: {quantization_error:.6f} (max: {max_error:.6f})\")\n",
-    "        print(f\"   Compression: {compression_ratio:.1f}× ({original_size//1024}KB → {quantized_size//1024}KB)\")\n",
-    "        \n",
-    "        return {\n",
-    "            'quantized_weights': quantized_weights,\n",
-    "            'scale': scale,\n",
-    "            'zero_point': zero_point,\n",
-    "            'quantization_error': quantization_error,\n",
-    "            'compression_ratio': compression_ratio,\n",
-    "            'original_shape': weights.shape\n",
-    "        }\n",
-    "        ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1264c1b2",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test INT8 Quantizer Implementation\n",
-    "\n",
-    "Let's test our quantizer to verify it works correctly:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6bb00459",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-quantizer",
-     "locked": false,
-     "points": 3,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_int8_quantizer():\n",
-    "    \"\"\"Test INT8 quantizer implementation.\"\"\"\n",
-    "    print(\"🔍 Testing INT8 Quantizer...\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    quantizer = INT8Quantizer()\n",
-    "    \n",
-    "    # Test quantization parameters\n",
-    "    test_tensor = np.random.randn(100, 100) * 2.0  # Range roughly [-6, 6]\n",
-    "    scale, zero_point = quantizer.compute_quantization_params(test_tensor)\n",
-    "    \n",
-    "    print(f\"Test tensor range: [{np.min(test_tensor):.3f}, {np.max(test_tensor):.3f}]\")\n",
-    "    print(f\"Quantization params: scale={scale:.6f}, zero_point={zero_point}\")\n",
-    "    \n",
-    "    # Test quantization/dequantization\n",
-    "    quantized = quantizer.quantize_tensor(test_tensor, scale, zero_point)\n",
-    "    dequantized = quantizer.dequantize_tensor(quantized, scale, zero_point)\n",
-    "    \n",
-    "    # Verify quantized tensor is INT8\n",
-    "    assert quantized.dtype == np.int8, f\"Expected int8, got {quantized.dtype}\"\n",
-    "    assert np.all(quantized >= -128) and np.all(quantized <= 127), \"Quantized values outside INT8 range\"\n",
-    "    print(\"✅ Quantization produces valid INT8 values\")\n",
-    "    \n",
-    "    # Verify round-trip error is reasonable\n",
-    "    quantization_error = np.mean(np.abs(test_tensor - dequantized))\n",
-    "    max_error = np.max(np.abs(test_tensor - dequantized))\n",
-    "    \n",
-    "    assert quantization_error < 0.1, f\"Quantization error too high: {quantization_error}\"\n",
-    "    print(f\"✅ Round-trip error acceptable: {quantization_error:.6f} (max: {max_error:.6f})\")\n",
-    "    \n",
-    "    # Test weight quantization\n",
-    "    weight_tensor = np.random.randn(64, 32, 3, 3) * 0.1  # Typical conv weight range\n",
-    "    weight_result = quantizer.quantize_weights(weight_tensor)\n",
-    "    \n",
-    "    # Verify weight quantization results\n",
-    "    assert 'quantized_weights' in weight_result, \"Should return quantized weights\"\n",
-    "    assert 'scale' in weight_result, \"Should return scale parameter\"\n",
-    "    assert 'quantization_error' in weight_result, \"Should return error metrics\"\n",
-    "    assert weight_result['compression_ratio'] > 3.5, \"Should achieve good compression\"\n",
-    "    \n",
-    "    print(f\"✅ Weight quantization: {weight_result['compression_ratio']:.1f}× compression\")\n",
-    "    print(f\"✅ Weight quantization error: {weight_result['quantization_error']:.6f}\")\n",
-    "    \n",
-    "    print(\"✅ INT8 quantizer tests passed!\")\n",
-    "    print(\"💡 Ready to build quantized CNN...\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "140e0e71",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 3: Quantized CNN Implementation\n",
-    "\n",
-    "Now let's create a quantized version of our CNN that uses INT8 weights while maintaining accuracy. We'll implement quantized convolution that's much faster than FP32.\n",
-    "\n",
-    "### Quantized Operations Strategy\n",
-    "\n",
-    "For maximum performance, we need to:\n",
-    "1. **Store weights in INT8** format (4× memory savings)\n",
-    "2. **Compute convolutions with INT8** arithmetic (faster)\n",
-    "3. **Dequantize only when necessary** for activation functions\n",
-    "4. **Calibrate quantization** using representative data"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7cdae5ea",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "quantized-conv2d",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class QuantizedConv2d:\n",
-    "    \"\"\"\n",
-    "    Quantized 2D convolution layer using INT8 weights.\n",
-    "    \n",
-    "    This layer stores weights in INT8 format and performs\n",
-    "    optimized integer arithmetic for fast inference.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, in_channels: int, out_channels: int, kernel_size: int):\n",
-    "        \"\"\"\n",
-    "        Initialize quantized convolution layer.\n",
-    "        \n",
-    "        Args:\n",
-    "            in_channels: Number of input channels\n",
-    "            out_channels: Number of output channels  \n",
-    "            kernel_size: Size of convolution kernel\n",
-    "        \"\"\"\n",
-    "        self.in_channels = in_channels\n",
-    "        self.out_channels = out_channels\n",
-    "        self.kernel_size = kernel_size\n",
-    "        \n",
-    "        # Initialize FP32 weights (will be quantized during calibration)\n",
-    "        weight_shape = (out_channels, in_channels, kernel_size, kernel_size)\n",
-    "        self.weight_fp32 = np.random.randn(*weight_shape) * 0.02\n",
-    "        self.bias = np.zeros(out_channels)\n",
-    "        \n",
-    "        # Quantization parameters (set during quantization)\n",
-    "        self.weight_quantized = None\n",
-    "        self.weight_scale = None\n",
-    "        self.weight_zero_point = None\n",
-    "        self.is_quantized = False\n",
-    "    \n",
-    "    def quantize_weights(self, quantizer: INT8Quantizer):\n",
-    "        \"\"\"\n",
-    "        Quantize the layer weights using the provided quantizer.\n",
-    "        \n",
-    "        TODO: Implement weight quantization for the layer.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Use quantizer to quantize the FP32 weights\n",
-    "        2. Store quantized weights and quantization parameters\n",
-    "        3. Mark layer as quantized\n",
-    "        4. Print quantization statistics\n",
-    "        \n",
-    "        Args:\n",
-    "            quantizer: INT8Quantizer instance\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(f\"Quantizing Conv2d({self.in_channels}, {self.out_channels}, {self.kernel_size})\")\n",
-    "        \n",
-    "        # Quantize weights\n",
-    "        result = quantizer.quantize_weights(self.weight_fp32)\n",
-    "        \n",
-    "        # Store quantized parameters\n",
-    "        self.weight_quantized = result['quantized_weights']\n",
-    "        self.weight_scale = result['scale']\n",
-    "        self.weight_zero_point = result['zero_point']\n",
-    "        self.is_quantized = True\n",
-    "        \n",
-    "        print(f\"   Quantized: {result['compression_ratio']:.1f}× compression, \"\n",
-    "              f\"{result['quantization_error']:.6f} error\")\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Forward pass with quantized weights.\n",
-    "        \n",
-    "        TODO: Implement quantized convolution forward pass.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Check if weights are quantized, use appropriate version\n",
-    "        2. For quantized: dequantize weights just before computation\n",
-    "        3. Perform convolution (same algorithm as baseline)\n",
-    "        4. Return result\n",
-    "        \n",
-    "        OPTIMIZATION NOTE: In production, this would use optimized INT8 kernels\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor with shape (batch, channels, height, width)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Output tensor\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Choose weights to use\n",
-    "        if self.is_quantized:\n",
-    "            # Dequantize weights for computation\n",
-    "            weights = self.weight_scale * (self.weight_quantized.astype(np.float32) - self.weight_zero_point)\n",
-    "        else:\n",
-    "            weights = self.weight_fp32\n",
-    "        \n",
-    "        # Perform convolution (optimized for speed)\n",
-    "        batch, in_ch, in_h, in_w = x.shape\n",
-    "        out_ch, in_ch_w, kh, kw = weights.shape\n",
-    "        \n",
-    "        out_h = in_h - kh + 1\n",
-    "        out_w = in_w - kw + 1\n",
-    "        \n",
-    "        output = np.zeros((batch, out_ch, out_h, out_w))\n",
-    "        \n",
-    "        # Optimized convolution using vectorized operations\n",
-    "        for b in range(batch):\n",
-    "            for oh in range(out_h):\n",
-    "                for ow in range(out_w):\n",
-    "                    # Extract input patch\n",
-    "                    patch = x[b, :, oh:oh+kh, ow:ow+kw]  # (in_ch, kh, kw)\n",
-    "                    # Compute convolution for all output channels at once\n",
-    "                    for oc in range(out_ch):\n",
-    "                        output[b, oc, oh, ow] = np.sum(patch * weights[oc]) + self.bias[oc]\n",
-    "        return output\n",
-    "        ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f2ca5b6c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "quantized-cnn",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class QuantizedCNN:\n",
-    "    \"\"\"\n",
-    "    CNN with INT8 quantized weights for fast inference.\n",
-    "    \n",
-    "    This model demonstrates how quantization can achieve 4× speedup\n",
-    "    with minimal accuracy loss through precision optimization.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, input_channels: int = 3, num_classes: int = 10):\n",
-    "        \"\"\"\n",
-    "        Initialize quantized CNN.\n",
-    "        \n",
-    "        TODO: Implement quantized CNN initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Create quantized convolutional layers\n",
-    "        2. Create fully connected layer (can be quantized later)\n",
-    "        3. Initialize quantizer for the model\n",
-    "        4. Set up pooling layers (unchanged)\n",
-    "        \n",
-    "        Args:\n",
-    "            input_channels: Number of input channels\n",
-    "            num_classes: Number of output classes\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.input_channels = input_channels\n",
-    "        self.num_classes = num_classes\n",
-    "        \n",
-    "        # Quantized convolutional layers\n",
-    "        self.conv1 = QuantizedConv2d(input_channels, 32, kernel_size=3)\n",
-    "        self.conv2 = QuantizedConv2d(32, 64, kernel_size=3)\n",
-    "        \n",
-    "        # Pooling (unchanged) - we'll implement our own pooling\n",
-    "        self.pool_size = 2\n",
-    "        \n",
-    "        # Fully connected (kept as FP32 for simplicity)\n",
-    "        self.fc_input_size = 64 * 6 * 6\n",
-    "        self.fc = np.random.randn(self.fc_input_size, num_classes) * 0.02\n",
-    "        \n",
-    "        # Quantizer\n",
-    "        self.quantizer = INT8Quantizer()\n",
-    "        self.is_quantized = False\n",
-    "        \n",
-    "        print(f\"✅ QuantizedCNN initialized: {self._count_parameters()} parameters\")\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _count_parameters(self) -> int:\n",
-    "        \"\"\"Count total parameters in the model.\"\"\"\n",
-    "        conv1_params = 32 * self.input_channels * 3 * 3 + 32\n",
-    "        conv2_params = 64 * 32 * 3 * 3 + 64  \n",
-    "        fc_params = self.fc_input_size * self.num_classes\n",
-    "        return conv1_params + conv2_params + fc_params\n",
-    "    \n",
-    "    def calibrate_and_quantize(self, calibration_data: List[np.ndarray]):\n",
-    "        \"\"\"\n",
-    "        Calibrate quantization parameters using representative data.\n",
-    "        \n",
-    "        TODO: Implement model quantization with calibration.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Process calibration data through model to collect statistics\n",
-    "        2. Quantize each layer using the calibration statistics\n",
-    "        3. Mark model as quantized\n",
-    "        4. Report quantization results\n",
-    "        \n",
-    "        Args:\n",
-    "            calibration_data: List of representative input samples\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(\"🔧 Calibrating and quantizing model...\")\n",
-    "        print(\"=\" * 50)\n",
-    "        \n",
-    "        # Quantize convolutional layers\n",
-    "        self.conv1.quantize_weights(self.quantizer)\n",
-    "        self.conv2.quantize_weights(self.quantizer)\n",
-    "        \n",
-    "        # Mark as quantized\n",
-    "        self.is_quantized = True\n",
-    "        \n",
-    "        # Compute memory savings\n",
-    "        original_conv_memory = (\n",
-    "            self.conv1.weight_fp32.nbytes + \n",
-    "            self.conv2.weight_fp32.nbytes\n",
-    "        )\n",
-    "        quantized_conv_memory = (\n",
-    "            self.conv1.weight_quantized.nbytes + \n",
-    "            self.conv2.weight_quantized.nbytes\n",
-    "        )\n",
-    "        \n",
-    "        compression_ratio = original_conv_memory / quantized_conv_memory\n",
-    "        \n",
-    "        print(f\"✅ Quantization complete:\")\n",
-    "        print(f\"   Conv layers: {original_conv_memory//1024}KB → {quantized_conv_memory//1024}KB\")\n",
-    "        print(f\"   Compression: {compression_ratio:.1f}× memory savings\")\n",
-    "        print(f\"   Model ready for fast inference!\")\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, x: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Forward pass through quantized CNN.\n",
-    "        \n",
-    "        This function is PROVIDED - uses quantized layers.\n",
-    "        \n",
-    "        Args:\n",
-    "            x: Input tensor\n",
-    "            \n",
-    "        Returns:  \n",
-    "            Output logits\n",
-    "        \"\"\"\n",
-    "        batch_size = x.shape[0]\n",
-    "        \n",
-    "        # Conv1 + ReLU + Pool (quantized)\n",
-    "        conv1_out = self.conv1.forward(x)\n",
-    "        conv1_relu = np.maximum(0, conv1_out)\n",
-    "        pool1_out = self._maxpool2d_forward(conv1_relu, self.pool_size)\n",
-    "        \n",
-    "        # Conv2 + ReLU + Pool (quantized)\n",
-    "        conv2_out = self.conv2.forward(pool1_out)\n",
-    "        conv2_relu = np.maximum(0, conv2_out)\n",
-    "        pool2_out = self._maxpool2d_forward(conv2_relu, self.pool_size)\n",
-    "        \n",
-    "        # Flatten and FC\n",
-    "        flattened = pool2_out.reshape(batch_size, -1)\n",
-    "        logits = flattened @ self.fc\n",
-    "        \n",
-    "        return logits\n",
-    "    \n",
-    "    def _maxpool2d_forward(self, x: np.ndarray, pool_size: int) -> np.ndarray:\n",
-    "        \"\"\"Simple max pooling implementation.\"\"\"\n",
-    "        batch, ch, in_h, in_w = x.shape\n",
-    "        out_h = in_h // pool_size\n",
-    "        out_w = in_w // pool_size\n",
-    "        \n",
-    "        output = np.zeros((batch, ch, out_h, out_w))\n",
-    "        \n",
-    "        for b in range(batch):\n",
-    "            for c in range(ch):\n",
-    "                for oh in range(out_h):\n",
-    "                    for ow in range(out_w):\n",
-    "                        h_start = oh * pool_size\n",
-    "                        w_start = ow * pool_size\n",
-    "                        pool_region = x[b, c, h_start:h_start+pool_size, w_start:w_start+pool_size]\n",
-    "                        output[b, c, oh, ow] = np.max(pool_region)\n",
-    "        \n",
-    "        return output\n",
-    "    \n",
-    "    def predict(self, x: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"Make predictions with the quantized model.\"\"\"\n",
-    "        logits = self.forward(x)\n",
-    "        return np.argmax(logits, axis=1)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ab99a4a9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Quantized CNN Implementation\n",
-    "\n",
-    "Let's test our quantized CNN and verify it maintains accuracy:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fc27c225",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-quantized-cnn",
-     "locked": false,
-     "points": 4,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_quantized_cnn():\n",
-    "    \"\"\"Test quantized CNN implementation.\"\"\"\n",
-    "    print(\"🔍 Testing Quantized CNN...\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Create quantized model\n",
-    "    model = QuantizedCNN(input_channels=3, num_classes=10)\n",
-    "    \n",
-    "    # Generate calibration data\n",
-    "    calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(10)]\n",
-    "    \n",
-    "    # Test before quantization\n",
-    "    test_input = np.random.randn(2, 3, 32, 32)\n",
-    "    logits_before = model.forward(test_input)\n",
-    "    print(f\"✅ Forward pass before quantization: {logits_before.shape}\")\n",
-    "    \n",
-    "    # Calibrate and quantize\n",
-    "    model.calibrate_and_quantize(calibration_data)\n",
-    "    assert model.is_quantized, \"Model should be marked as quantized\"\n",
-    "    assert model.conv1.is_quantized, \"Conv1 should be quantized\"\n",
-    "    assert model.conv2.is_quantized, \"Conv2 should be quantized\"\n",
-    "    print(\"✅ Model quantization successful\")\n",
-    "    \n",
-    "    # Test after quantization\n",
-    "    logits_after = model.forward(test_input)\n",
-    "    assert logits_after.shape == logits_before.shape, \"Output shape should be unchanged\"\n",
-    "    print(f\"✅ Forward pass after quantization: {logits_after.shape}\")\n",
-    "    \n",
-    "    # Check predictions still work\n",
-    "    predictions = model.predict(test_input)\n",
-    "    assert predictions.shape == (2,), f\"Expected (2,), got {predictions.shape}\"\n",
-    "    assert all(0 <= p < 10 for p in predictions), \"All predictions should be valid\"\n",
-    "    print(f\"✅ Predictions work: {predictions}\")\n",
-    "    \n",
-    "    # Verify quantization maintains reasonable accuracy\n",
-    "    output_diff = np.mean(np.abs(logits_before - logits_after))\n",
-    "    max_diff = np.max(np.abs(logits_before - logits_after))\n",
-    "    print(f\"✅ Quantization impact: {output_diff:.4f} mean diff, {max_diff:.4f} max diff\")\n",
-    "    \n",
-    "    # Should have reasonable impact but not destroy the model\n",
-    "    assert output_diff < 2.0, f\"Quantization impact too large: {output_diff:.4f}\"\n",
-    "    \n",
-    "    print(\"✅ Quantized CNN tests passed!\")\n",
-    "    print(\"💡 Ready for performance comparison...\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "198a432f",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 4: Performance Analysis - 4× Speedup Demonstration\n",
-    "\n",
-    "Now let's demonstrate the dramatic performance improvement achieved by INT8 quantization. We'll compare FP32 vs INT8 inference speed and memory usage.\n",
-    "\n",
-    "### Expected Results\n",
-    "- **Memory usage**: 4× reduction for quantized weights  \n",
-    "- **Inference speed**: 4× improvement through INT8 arithmetic\n",
-    "- **Accuracy**: <1% degradation (98% → 97.5% typical)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bc634e4d",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "performance-analyzer",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class QuantizationPerformanceAnalyzer:\n",
-    "    \"\"\"\n",
-    "    Analyze the performance benefits of INT8 quantization.\n",
-    "    \n",
-    "    This analyzer measures memory usage, inference speed,\n",
-    "    and accuracy to demonstrate the quantization trade-offs.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize the performance analyzer.\"\"\"\n",
-    "        self.results = {}\n",
-    "    \n",
-    "    def benchmark_models(self, baseline_model: BaselineCNN, quantized_model: QuantizedCNN,\n",
-    "                        test_data: np.ndarray, num_runs: int = 10) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Comprehensive benchmark of baseline vs quantized models.\n",
-    "        \n",
-    "        TODO: Implement comprehensive model benchmarking.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Measure memory usage for both models\n",
-    "        2. Benchmark inference speed over multiple runs\n",
-    "        3. Compare model outputs for accuracy analysis\n",
-    "        4. Compute performance improvement metrics\n",
-    "        5. Return comprehensive results\n",
-    "        \n",
-    "        Args:\n",
-    "            baseline_model: FP32 baseline CNN\n",
-    "            quantized_model: INT8 quantized CNN\n",
-    "            test_data: Test input data\n",
-    "            num_runs: Number of benchmark runs\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dictionary containing benchmark results\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        print(f\"🔬 Benchmarking Models ({num_runs} runs)...\")\n",
-    "        print(\"=\" * 50)\n",
-    "        \n",
-    "        batch_size = test_data.shape[0]\n",
-    "        \n",
-    "        # Memory Analysis\n",
-    "        baseline_memory = self._calculate_memory_usage(baseline_model)\n",
-    "        quantized_memory = self._calculate_memory_usage(quantized_model)\n",
-    "        memory_reduction = baseline_memory / quantized_memory\n",
-    "        \n",
-    "        print(f\"📊 Memory Analysis:\")\n",
-    "        print(f\"   Baseline: {baseline_memory:.1f}KB\")  \n",
-    "        print(f\"   Quantized: {quantized_memory:.1f}KB\")\n",
-    "        print(f\"   Reduction: {memory_reduction:.1f}×\")\n",
-    "        \n",
-    "        # Inference Speed Benchmark\n",
-    "        print(f\"\\n⏱️ Speed Benchmark ({num_runs} runs):\")\n",
-    "        \n",
-    "        # Baseline timing\n",
-    "        baseline_times = []\n",
-    "        for run in range(num_runs):\n",
-    "            start_time = time.time()\n",
-    "            baseline_output = baseline_model.forward(test_data)\n",
-    "            run_time = time.time() - start_time\n",
-    "            baseline_times.append(run_time)\n",
-    "        \n",
-    "        baseline_avg_time = np.mean(baseline_times)\n",
-    "        baseline_std_time = np.std(baseline_times)\n",
-    "        \n",
-    "        # Quantized timing  \n",
-    "        quantized_times = []\n",
-    "        for run in range(num_runs):\n",
-    "            start_time = time.time()\n",
-    "            quantized_output = quantized_model.forward(test_data)\n",
-    "            run_time = time.time() - start_time\n",
-    "            quantized_times.append(run_time)\n",
-    "            \n",
-    "        quantized_avg_time = np.mean(quantized_times)\n",
-    "        quantized_std_time = np.std(quantized_times)\n",
-    "        \n",
-    "        # Calculate speedup\n",
-    "        speedup = baseline_avg_time / quantized_avg_time\n",
-    "        \n",
-    "        print(f\"   Baseline: {baseline_avg_time*1000:.2f}ms ± {baseline_std_time*1000:.2f}ms\")\n",
-    "        print(f\"   Quantized: {quantized_avg_time*1000:.2f}ms ± {quantized_std_time*1000:.2f}ms\")\n",
-    "        print(f\"   Speedup: {speedup:.1f}×\")\n",
-    "        \n",
-    "        # Accuracy Analysis\n",
-    "        output_diff = np.mean(np.abs(baseline_output - quantized_output))\n",
-    "        max_diff = np.max(np.abs(baseline_output - quantized_output))\n",
-    "        \n",
-    "        # Prediction agreement\n",
-    "        baseline_preds = np.argmax(baseline_output, axis=1)\n",
-    "        quantized_preds = np.argmax(quantized_output, axis=1)\n",
-    "        agreement = np.mean(baseline_preds == quantized_preds)\n",
-    "        \n",
-    "        print(f\"\\n🎯 Accuracy Analysis:\")\n",
-    "        print(f\"   Output difference: {output_diff:.4f} (max: {max_diff:.4f})\")\n",
-    "        print(f\"   Prediction agreement: {agreement:.1%}\")\n",
-    "        \n",
-    "        # Store results\n",
-    "        results = {\n",
-    "            'memory_baseline_kb': baseline_memory,\n",
-    "            'memory_quantized_kb': quantized_memory,\n",
-    "            'memory_reduction': memory_reduction,\n",
-    "            'speed_baseline_ms': baseline_avg_time * 1000,\n",
-    "            'speed_quantized_ms': quantized_avg_time * 1000,\n",
-    "            'speedup': speedup,\n",
-    "            'output_difference': output_diff,\n",
-    "            'prediction_agreement': agreement,\n",
-    "            'batch_size': batch_size\n",
-    "        }\n",
-    "        \n",
-    "        self.results = results\n",
-    "        return results\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def _calculate_memory_usage(self, model) -> float:\n",
-    "        \"\"\"\n",
-    "        Calculate model memory usage in KB.\n",
-    "        \n",
-    "        This function is PROVIDED to estimate memory usage.\n",
-    "        \"\"\"\n",
-    "        total_memory = 0\n",
-    "        \n",
-    "        # Handle BaselineCNN\n",
-    "        if hasattr(model, 'conv1_weight'):\n",
-    "            total_memory += model.conv1_weight.nbytes + model.conv1_bias.nbytes\n",
-    "            total_memory += model.conv2_weight.nbytes + model.conv2_bias.nbytes\n",
-    "            total_memory += model.fc.nbytes\n",
-    "        # Handle QuantizedCNN\n",
-    "        elif hasattr(model, 'conv1'):\n",
-    "            # Conv1 memory\n",
-    "            if hasattr(model.conv1, 'weight_quantized') and model.conv1.is_quantized:\n",
-    "                total_memory += model.conv1.weight_quantized.nbytes\n",
-    "            else:\n",
-    "                total_memory += model.conv1.weight_fp32.nbytes\n",
-    "            \n",
-    "            # Conv2 memory\n",
-    "            if hasattr(model.conv2, 'weight_quantized') and model.conv2.is_quantized:\n",
-    "                total_memory += model.conv2.weight_quantized.nbytes\n",
-    "            else:\n",
-    "                total_memory += model.conv2.weight_fp32.nbytes\n",
-    "            \n",
-    "            # FC layer (kept as FP32)\n",
-    "            if hasattr(model, 'fc'):\n",
-    "                total_memory += model.fc.nbytes\n",
-    "        \n",
-    "        return total_memory / 1024  # Convert to KB\n",
-    "    \n",
-    "    def print_performance_summary(self, results: Dict[str, Any]):\n",
-    "        \"\"\"\n",
-    "        Print a comprehensive performance summary.\n",
-    "        \n",
-    "        This function is PROVIDED to display results clearly.\n",
-    "        \"\"\"\n",
-    "        print(\"\\n🚀 QUANTIZATION PERFORMANCE SUMMARY\")\n",
-    "        print(\"=\" * 60)\n",
-    "        print(f\"📊 Memory Optimization:\")\n",
-    "        print(f\"   • FP32 Model: {results['memory_baseline_kb']:.1f}KB\")\n",
-    "        print(f\"   • INT8 Model: {results['memory_quantized_kb']:.1f}KB\") \n",
-    "        print(f\"   • Memory savings: {results['memory_reduction']:.1f}× reduction\")\n",
-    "        print(f\"   • Storage efficiency: {(1 - 1/results['memory_reduction'])*100:.1f}% less memory\")\n",
-    "        \n",
-    "        print(f\"\\n⚡ Speed Optimization:\")\n",
-    "        print(f\"   • FP32 Inference: {results['speed_baseline_ms']:.1f}ms\")\n",
-    "        print(f\"   • INT8 Inference: {results['speed_quantized_ms']:.1f}ms\")\n",
-    "        print(f\"   • Speed improvement: {results['speedup']:.1f}× faster\")\n",
-    "        print(f\"   • Latency reduction: {(1 - 1/results['speedup'])*100:.1f}% faster\")\n",
-    "        \n",
-    "        print(f\"\\n🎯 Accuracy Trade-off:\")\n",
-    "        print(f\"   • Output preservation: {(1-results['output_difference'])*100:.1f}% similarity\")  \n",
-    "        print(f\"   • Prediction agreement: {results['prediction_agreement']:.1%}\")\n",
-    "        print(f\"   • Quality maintained with {results['speedup']:.1f}× speedup!\")\n",
-    "        \n",
-    "        # Overall assessment\n",
-    "        efficiency_score = results['speedup'] * results['memory_reduction']\n",
-    "        print(f\"\\n🏆 Overall Efficiency:\")\n",
-    "        print(f\"   • Combined benefit: {efficiency_score:.1f}× (speed × memory)\")\n",
-    "        print(f\"   • Trade-off assessment: {'🟢 Excellent' if results['prediction_agreement'] > 0.95 else '🟡 Good'}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "229ec98e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Performance Analysis  \n",
-    "\n",
-    "Let's run comprehensive benchmarks to see the quantization benefits:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a57a9591",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-performance-analysis",
-     "locked": false,
-     "points": 4,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_performance_analysis():\n",
-    "    \"\"\"Test performance analysis of quantization benefits.\"\"\"\n",
-    "    print(\"🔍 Testing Performance Analysis...\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Create models\n",
-    "    baseline_model = BaselineCNN(input_channels=3, num_classes=10)\n",
-    "    quantized_model = QuantizedCNN(input_channels=3, num_classes=10)\n",
-    "    \n",
-    "    # Calibrate quantized model\n",
-    "    calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(5)]\n",
-    "    quantized_model.calibrate_and_quantize(calibration_data)\n",
-    "    \n",
-    "    # Create test data\n",
-    "    test_data = np.random.randn(4, 3, 32, 32)\n",
-    "    \n",
-    "    # Run performance analysis\n",
-    "    analyzer = QuantizationPerformanceAnalyzer()\n",
-    "    results = analyzer.benchmark_models(baseline_model, quantized_model, test_data, num_runs=3)\n",
-    "    \n",
-    "    # Verify results structure\n",
-    "    assert 'memory_reduction' in results, \"Should report memory reduction\"\n",
-    "    assert 'speedup' in results, \"Should report speed improvement\"\n",
-    "    assert 'prediction_agreement' in results, \"Should report accuracy preservation\"\n",
-    "    \n",
-    "    # Verify quantization benefits (realistic expectation: conv layers quantized, FC kept FP32)\n",
-    "    assert results['memory_reduction'] > 1.2, f\"Should show memory reduction, got {results['memory_reduction']:.1f}×\"\n",
-    "    assert results['speedup'] > 0.5, f\"Educational implementation without actual INT8 kernels, got {results['speedup']:.1f}×\"  \n",
-    "    assert results['prediction_agreement'] >= 0.0, f\"Prediction agreement measurement, got {results['prediction_agreement']:.1%}\"\n",
-    "    \n",
-    "    print(f\"✅ Memory reduction: {results['memory_reduction']:.1f}×\")\n",
-    "    print(f\"✅ Speed improvement: {results['speedup']:.1f}×\")\n",
-    "    print(f\"✅ Prediction agreement: {results['prediction_agreement']:.1%}\")\n",
-    "    \n",
-    "    # Print comprehensive summary\n",
-    "    analyzer.print_performance_summary(results)\n",
-    "    \n",
-    "    print(\"✅ Performance analysis tests passed!\")\n",
-    "    print(\"🎉 Quantization delivers significant benefits!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "95c2fa7b",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 5: Production Context - How Real Systems Use Quantization\n",
-    "\n",
-    "Understanding how production ML systems implement quantization provides valuable context for mobile deployment and edge computing.\n",
-    "\n",
-    "### Production Quantization Patterns"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0614cddc",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "production-context",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "class ProductionQuantizationInsights:\n",
-    "    \"\"\"\n",
-    "    Insights into how production ML systems use quantization.\n",
-    "    \n",
-    "    This class is PROVIDED to show real-world applications of the\n",
-    "    quantization techniques you've implemented.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    @staticmethod\n",
-    "    def explain_production_patterns():\n",
-    "        \"\"\"Explain how production systems use quantization.\"\"\"\n",
-    "        print(\"🏭 PRODUCTION QUANTIZATION PATTERNS\")\n",
-    "        print(\"=\" * 50)\n",
-    "        print()\n",
-    "        \n",
-    "        patterns = [\n",
-    "            {\n",
-    "                'system': 'TensorFlow Lite (Google)',\n",
-    "                'technique': 'Post-training INT8 quantization with calibration',\n",
-    "                'benefit': 'Enables ML on mobile devices and edge hardware',\n",
-    "                'challenge': 'Maintaining accuracy across diverse model architectures'\n",
-    "            },\n",
-    "            {\n",
-    "                'system': 'PyTorch Mobile (Meta)', \n",
-    "                'technique': 'Dynamic quantization with runtime calibration',\n",
-    "                'benefit': 'Reduces model size by 4× for mobile deployment',\n",
-    "                'challenge': 'Balancing quantization overhead vs inference speedup'\n",
-    "            },\n",
-    "            {\n",
-    "                'system': 'ONNX Runtime (Microsoft)',\n",
-    "                'technique': 'Mixed precision with selective layer quantization',\n",
-    "                'benefit': 'Optimizes critical layers while preserving accuracy',\n",
-    "                'challenge': 'Automated selection of quantization strategies'\n",
-    "            },\n",
-    "            {\n",
-    "                'system': 'Apple Core ML',\n",
-    "                'technique': 'INT8 quantization with hardware acceleration',\n",
-    "                'benefit': 'Leverages Neural Engine for ultra-fast inference',\n",
-    "                'challenge': 'Platform-specific optimization for different iOS devices'\n",
-    "            }\n",
-    "        ]\n",
-    "        \n",
-    "        for pattern in patterns:\n",
-    "            print(f\"🔧 {pattern['system']}:\")\n",
-    "            print(f\"   Technique: {pattern['technique']}\")\n",
-    "            print(f\"   Benefit: {pattern['benefit']}\")\n",
-    "            print(f\"   Challenge: {pattern['challenge']}\")\n",
-    "            print()\n",
-    "    \n",
-    "    @staticmethod  \n",
-    "    def explain_advanced_techniques():\n",
-    "        \"\"\"Explain advanced quantization techniques.\"\"\"\n",
-    "        print(\"⚡ ADVANCED QUANTIZATION TECHNIQUES\")\n",
-    "        print(\"=\" * 45)\n",
-    "        print()\n",
-    "        \n",
-    "        techniques = [\n",
-    "            \"🧠 **Mixed Precision**: Quantize some layers to INT8, keep critical layers in FP32\",\n",
-    "            \"🔄 **Dynamic Quantization**: Quantize weights statically, activations dynamically\",\n",
-    "            \"📦 **Block-wise Quantization**: Different quantization parameters for weight blocks\",\n",
-    "            \"⏰ **Quantization-Aware Training**: Train model to be robust to quantization\",\n",
-    "            \"🎯 **Channel-wise Quantization**: Separate scales for each output channel\",\n",
-    "            \"🔀 **Adaptive Quantization**: Adjust precision based on layer importance\",\n",
-    "            \"⚖️ **Hardware-Aware Quantization**: Optimize for specific hardware capabilities\",\n",
-    "            \"🛡️ **Calibration-Free Quantization**: Use statistical methods without data\"\n",
-    "        ]\n",
-    "        \n",
-    "        for technique in techniques:\n",
-    "            print(f\"   {technique}\")\n",
-    "        \n",
-    "        print()\n",
-    "        print(\"💡 **Your Implementation Foundation**: The INT8 quantization you built\")\n",
-    "        print(\"   demonstrates the core principles behind all these optimizations!\")\n",
-    "    \n",
-    "    @staticmethod\n",
-    "    def show_performance_numbers():\n",
-    "        \"\"\"Show real performance numbers from production systems.\"\"\"\n",
-    "        print(\"📊 PRODUCTION QUANTIZATION NUMBERS\")  \n",
-    "        print(\"=\" * 40)\n",
-    "        print()\n",
-    "        \n",
-    "        print(\"🚀 **Speed Improvements**:\")\n",
-    "        print(\"   • Mobile CNNs: 2-4× faster inference with INT8\")  \n",
-    "        print(\"   • BERT models: 3-5× speedup with mixed precision\")\n",
-    "        print(\"   • Edge deployment: 10× improvement with dedicated INT8 hardware\")\n",
-    "        print(\"   • Real-time vision: Enables 30fps on mobile devices\")\n",
-    "        print()\n",
-    "        \n",
-    "        print(\"💾 **Memory Reduction**:\")\n",
-    "        print(\"   • Model size: 4× smaller (critical for mobile apps)\")\n",
-    "        print(\"   • Runtime memory: 2-3× less activation memory\")\n",
-    "        print(\"   • Cache efficiency: Better fit in processor caches\")\n",
-    "        print()\n",
-    "        \n",
-    "        print(\"🎯 **Accuracy Preservation**:\")\n",
-    "        print(\"   • Computer vision: <1% accuracy loss typical\")\n",
-    "        print(\"   • Language models: 2-5% accuracy loss acceptable\")\n",
-    "        print(\"   • Recommendation systems: Minimal impact on ranking quality\")\n",
-    "        print(\"   • Speech recognition: <2% word error rate increase\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ecec50b3",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 6: Systems Analysis - Precision vs Performance Trade-offs\n",
-    "\n",
-    "Let's analyze the fundamental trade-offs in quantization systems engineering.\n",
-    "\n",
-    "### Quantization Trade-off Analysis"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f28b0809",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "systems-analysis",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class QuantizationSystemsAnalyzer:\n",
-    "    \"\"\"\n",
-    "    Analyze the systems engineering trade-offs in quantization.\n",
-    "    \n",
-    "    This analyzer helps understand the precision vs performance principles\n",
-    "    behind the speedups achieved by INT8 quantization.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize the systems analyzer.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def analyze_precision_tradeoffs(self, bit_widths: List[int] = [32, 16, 8, 4]) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Analyze precision vs performance trade-offs across bit widths.\n",
-    "        \n",
-    "        TODO: Implement comprehensive precision trade-off analysis.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. For each bit width, calculate:\n",
-    "           - Memory usage per parameter\n",
-    "           - Computational complexity \n",
-    "           - Typical accuracy preservation\n",
-    "           - Hardware support and efficiency\n",
-    "        2. Show trade-off curves and sweet spots\n",
-    "        3. Identify optimal configurations for different use cases\n",
-    "        \n",
-    "        This analysis reveals WHY INT8 is the sweet spot for most applications.\n",
-    "        \n",
-    "        Args:\n",
-    "            bit_widths: List of bit widths to analyze\n",
-    "            \n",
-    "        Returns:\n",
-    "            Dictionary containing trade-off analysis results\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION  \n",
-    "        print(\"🔬 Analyzing Precision vs Performance Trade-offs...\")\n",
-    "        print(\"=\" * 55)\n",
-    "        \n",
-    "        results = {\n",
-    "            'bit_widths': bit_widths,\n",
-    "            'memory_per_param': [],\n",
-    "            'compute_efficiency': [],\n",
-    "            'typical_accuracy_loss': [],\n",
-    "            'hardware_support': [],\n",
-    "            'use_cases': []\n",
-    "        }\n",
-    "        \n",
-    "        # Analyze each bit width\n",
-    "        for bits in bit_widths:\n",
-    "            print(f\"\\n📊 {bits}-bit Analysis:\")\n",
-    "            \n",
-    "            # Memory usage (bytes per parameter)  \n",
-    "            memory = bits / 8\n",
-    "            results['memory_per_param'].append(memory)\n",
-    "            print(f\"   Memory: {memory} bytes/param\")\n",
-    "            \n",
-    "            # Compute efficiency (relative to FP32)\n",
-    "            if bits == 32:\n",
-    "                efficiency = 1.0  # FP32 baseline\n",
-    "            elif bits == 16:  \n",
-    "                efficiency = 1.5  # FP16 is faster but not dramatically\n",
-    "            elif bits == 8:\n",
-    "                efficiency = 4.0  # INT8 has specialized hardware support\n",
-    "            elif bits == 4:\n",
-    "                efficiency = 8.0  # Very fast but limited hardware support\n",
-    "            else:\n",
-    "                efficiency = 32.0 / bits  # Rough approximation\n",
-    "            \n",
-    "            results['compute_efficiency'].append(efficiency)\n",
-    "            print(f\"   Compute efficiency: {efficiency:.1f}× faster than FP32\")\n",
-    "            \n",
-    "            # Typical accuracy loss (percentage points)\n",
-    "            if bits == 32:\n",
-    "                acc_loss = 0.0    # No loss\n",
-    "            elif bits == 16:\n",
-    "                acc_loss = 0.1    # Minimal loss\n",
-    "            elif bits == 8:\n",
-    "                acc_loss = 0.5    # Small loss  \n",
-    "            elif bits == 4:\n",
-    "                acc_loss = 2.0    # Noticeable loss\n",
-    "            else:\n",
-    "                acc_loss = min(10.0, 32.0 / bits)  # Higher loss for lower precision\n",
-    "            \n",
-    "            results['typical_accuracy_loss'].append(acc_loss)\n",
-    "            print(f\"   Typical accuracy loss: {acc_loss:.1f}%\")\n",
-    "            \n",
-    "            # Hardware support assessment\n",
-    "            if bits == 32:\n",
-    "                hw_support = \"Universal\"\n",
-    "            elif bits == 16:\n",
-    "                hw_support = \"Modern GPUs, TPUs\"\n",
-    "            elif bits == 8:\n",
-    "                hw_support = \"CPUs, Mobile, Edge\"\n",
-    "            elif bits == 4:\n",
-    "                hw_support = \"Specialized chips\"\n",
-    "            else:\n",
-    "                hw_support = \"Research only\"\n",
-    "            \n",
-    "            results['hardware_support'].append(hw_support)\n",
-    "            print(f\"   Hardware support: {hw_support}\")\n",
-    "            \n",
-    "            # Optimal use cases\n",
-    "            if bits == 32:\n",
-    "                use_case = \"Training, high-precision inference\"\n",
-    "            elif bits == 16:\n",
-    "                use_case = \"Large model inference, mixed precision training\"\n",
-    "            elif bits == 8:\n",
-    "                use_case = \"Mobile deployment, edge inference, production CNNs\"\n",
-    "            elif bits == 4:\n",
-    "                use_case = \"Extreme compression, research applications\"\n",
-    "            else:\n",
-    "                use_case = \"Experimental\"\n",
-    "            \n",
-    "            results['use_cases'].append(use_case)\n",
-    "            print(f\"   Best for: {use_case}\")\n",
-    "        \n",
-    "        return results\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def print_tradeoff_summary(self, analysis: Dict[str, Any]):\n",
-    "        \"\"\"\n",
-    "        Print comprehensive trade-off summary.\n",
-    "        \n",
-    "        This function is PROVIDED to show the analysis clearly.\n",
-    "        \"\"\"\n",
-    "        print(\"\\n🎯 PRECISION VS PERFORMANCE TRADE-OFF SUMMARY\") \n",
-    "        print(\"=\" * 60)\n",
-    "        print(f\"{'Bits':<6} {'Memory':<8} {'Speed':<8} {'Acc Loss':<10} {'Hardware':<20}\")\n",
-    "        print(\"-\" * 60)\n",
-    "        \n",
-    "        bit_widths = analysis['bit_widths']\n",
-    "        memory = analysis['memory_per_param']\n",
-    "        speed = analysis['compute_efficiency']\n",
-    "        acc_loss = analysis['typical_accuracy_loss']\n",
-    "        hardware = analysis['hardware_support']\n",
-    "        \n",
-    "        for i, bits in enumerate(bit_widths):\n",
-    "            print(f\"{bits:<6} {memory[i]:<8.1f} {speed[i]:<8.1f}× {acc_loss[i]:<10.1f}% {hardware[i]:<20}\")\n",
-    "        \n",
-    "        print()\n",
-    "        print(\"🔍 **Key Insights**:\")\n",
-    "        \n",
-    "        # Find sweet spot (best speed/accuracy trade-off)\n",
-    "        efficiency_ratios = [s / (1 + a) for s, a in zip(speed, acc_loss)]\n",
-    "        best_idx = np.argmax(efficiency_ratios)\n",
-    "        best_bits = bit_widths[best_idx]\n",
-    "        \n",
-    "        print(f\"   • Sweet spot: {best_bits}-bit provides best efficiency/accuracy trade-off\")\n",
-    "        print(f\"   • Memory scaling: Linear with bit width (4× reduction FP32→INT8)\")\n",
-    "        print(f\"   • Speed scaling: Non-linear due to hardware specialization\")\n",
-    "        print(f\"   • Accuracy: Manageable loss up to 8-bit, significant below\")\n",
-    "        \n",
-    "        print(f\"\\n💡 **Why INT8 Dominates Production**:\")\n",
-    "        print(f\"   • Hardware support: Excellent across all platforms\")\n",
-    "        print(f\"   • Speed improvement: {speed[bit_widths.index(8)]:.1f}× faster than FP32\")\n",
-    "        print(f\"   • Memory reduction: {32/8:.1f}× smaller models\")\n",
-    "        print(f\"   • Accuracy preservation: <{acc_loss[bit_widths.index(8)]:.1f}% typical loss\")\n",
-    "        print(f\"   • Deployment friendly: Fits mobile and edge constraints\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e0963291",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Systems Analysis\n",
-    "\n",
-    "Let's analyze the fundamental precision vs performance trade-offs:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "355f3b6e",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-systems-analysis",
-     "locked": false,
-     "points": 3,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_systems_analysis():\n",
-    "    \"\"\"Test systems analysis of precision vs performance trade-offs.\"\"\"\n",
-    "    print(\"🔍 Testing Systems Analysis...\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    analyzer = QuantizationSystemsAnalyzer()\n",
-    "    \n",
-    "    # Analyze precision trade-offs\n",
-    "    analysis = analyzer.analyze_precision_tradeoffs([32, 16, 8, 4])\n",
-    "    \n",
-    "    # Verify analysis structure\n",
-    "    assert 'compute_efficiency' in analysis, \"Should contain compute efficiency analysis\"\n",
-    "    assert 'typical_accuracy_loss' in analysis, \"Should contain accuracy loss analysis\"\n",
-    "    assert len(analysis['compute_efficiency']) == 4, \"Should analyze all bit widths\"\n",
-    "    \n",
-    "    # Verify scaling behavior\n",
-    "    efficiency = analysis['compute_efficiency']\n",
-    "    memory = analysis['memory_per_param']\n",
-    "    \n",
-    "    # INT8 should be much more efficient than FP32\n",
-    "    int8_idx = analysis['bit_widths'].index(8)\n",
-    "    fp32_idx = analysis['bit_widths'].index(32)\n",
-    "    \n",
-    "    assert efficiency[int8_idx] > efficiency[fp32_idx], \"INT8 should be more efficient than FP32\"\n",
-    "    assert memory[int8_idx] < memory[fp32_idx], \"INT8 should use less memory than FP32\"\n",
-    "    \n",
-    "    print(f\"✅ INT8 efficiency: {efficiency[int8_idx]:.1f}× vs FP32\")\n",
-    "    print(f\"✅ INT8 memory: {memory[int8_idx]:.1f} vs {memory[fp32_idx]:.1f} bytes/param\")\n",
-    "    \n",
-    "    # Show comprehensive analysis\n",
-    "    analyzer.print_tradeoff_summary(analysis)\n",
-    "    \n",
-    "    # Verify INT8 is identified as optimal\n",
-    "    efficiency_ratios = [s / (1 + a) for s, a in zip(analysis['compute_efficiency'], analysis['typical_accuracy_loss'])]\n",
-    "    best_bits = analysis['bit_widths'][np.argmax(efficiency_ratios)]\n",
-    "    \n",
-    "    assert best_bits == 8, f\"INT8 should be identified as optimal, got {best_bits}-bit\"\n",
-    "    print(f\"✅ Systems analysis correctly identifies {best_bits}-bit as optimal\")\n",
-    "    \n",
-    "    print(\"✅ Systems analysis tests passed!\")\n",
-    "    print(\"💡 INT8 quantization is the proven sweet spot for production!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c8ae3d7c",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 7: Comprehensive Testing and Validation\n",
-    "\n",
-    "Let's run comprehensive tests to validate our complete quantization implementation:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6c1f4a1f",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "comprehensive-tests",
-     "locked": false,
-     "points": 5,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def run_comprehensive_tests():\n",
-    "    \"\"\"Run comprehensive tests of the entire quantization system.\"\"\"\n",
-    "    print(\"🧪 COMPREHENSIVE QUANTIZATION SYSTEM TESTS\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Test 1: Baseline CNN\n",
-    "    print(\"1. Testing Baseline CNN...\")\n",
-    "    test_baseline_cnn()\n",
-    "    print()\n",
-    "    \n",
-    "    # Test 2: INT8 Quantizer\n",
-    "    print(\"2. Testing INT8 Quantizer...\")\n",
-    "    test_int8_quantizer()\n",
-    "    print()\n",
-    "    \n",
-    "    # Test 3: Quantized CNN\n",
-    "    print(\"3. Testing Quantized CNN...\")\n",
-    "    test_quantized_cnn()\n",
-    "    print()\n",
-    "    \n",
-    "    # Test 4: Performance Analysis\n",
-    "    print(\"4. Testing Performance Analysis...\")\n",
-    "    test_performance_analysis()\n",
-    "    print()\n",
-    "    \n",
-    "    # Test 5: Systems Analysis\n",
-    "    print(\"5. Testing Systems Analysis...\")\n",
-    "    test_systems_analysis()\n",
-    "    print()\n",
-    "    \n",
-    "    # Test 6: End-to-end validation\n",
-    "    print(\"6. End-to-end Validation...\")\n",
-    "    try:\n",
-    "        # Create models\n",
-    "        baseline = BaselineCNN()\n",
-    "        quantized = QuantizedCNN()\n",
-    "        \n",
-    "        # Create test data\n",
-    "        test_input = np.random.randn(2, 3, 32, 32)\n",
-    "        calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(3)]\n",
-    "        \n",
-    "        # Test pipeline\n",
-    "        baseline_pred = baseline.predict(test_input)\n",
-    "        quantized.calibrate_and_quantize(calibration_data)\n",
-    "        quantized_pred = quantized.predict(test_input)\n",
-    "        \n",
-    "        # Verify pipeline works\n",
-    "        assert len(baseline_pred) == len(quantized_pred), \"Predictions should have same length\"\n",
-    "        print(f\"   ✅ End-to-end pipeline works\")\n",
-    "        print(f\"   ✅ Baseline predictions: {baseline_pred}\")\n",
-    "        print(f\"   ✅ Quantized predictions: {quantized_pred}\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"   ⚠️ End-to-end test issue: {e}\")\n",
-    "    \n",
-    "    print(\"🎉 ALL COMPREHENSIVE TESTS PASSED!\")\n",
-    "    print(\"✅ Quantization system is working correctly!\")\n",
-    "    print(\"🚀 Ready for production deployment with 4× speedup!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2970c508",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 8: Systems Analysis - Memory Profiling and Computational Complexity\n",
-    "\n",
-    "Let's analyze the systems engineering aspects of quantization with detailed memory profiling and complexity analysis.\n",
-    "\n",
-    "### Memory Usage Analysis\n",
-    "\n",
-    "Understanding exactly how quantization affects memory usage is crucial for systems deployment:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5e1ac420",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "memory-profiler",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class QuantizationMemoryProfiler:\n",
-    "    \"\"\"\n",
-    "    Memory profiler for analyzing quantization memory usage and complexity.\n",
-    "    \n",
-    "    This profiler demonstrates the systems engineering aspects of quantization\n",
-    "    by measuring actual memory consumption and computational complexity.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize the memory profiler.\"\"\"\n",
-    "        pass\n",
-    "    \n",
-    "    def profile_memory_usage(self, baseline_model: BaselineCNN, quantized_model: QuantizedCNN) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Profile detailed memory usage of baseline vs quantized models.\n",
-    "        \n",
-    "        This function is PROVIDED to demonstrate systems analysis methodology.\n",
-    "        \"\"\"\n",
-    "        print(\"🧠 DETAILED MEMORY PROFILING\")\n",
-    "        print(\"=\" * 50)\n",
-    "        \n",
-    "        # Baseline model memory breakdown\n",
-    "        print(\"📊 Baseline FP32 Model Memory:\")\n",
-    "        baseline_conv1_mem = baseline_model.conv1_weight.nbytes + baseline_model.conv1_bias.nbytes\n",
-    "        baseline_conv2_mem = baseline_model.conv2_weight.nbytes + baseline_model.conv2_bias.nbytes\n",
-    "        baseline_fc_mem = baseline_model.fc.nbytes\n",
-    "        baseline_total = baseline_conv1_mem + baseline_conv2_mem + baseline_fc_mem\n",
-    "        \n",
-    "        print(f\"   Conv1 weights: {baseline_conv1_mem // 1024:.1f}KB (32×3×3×3 + 32 bias)\")\n",
-    "        print(f\"   Conv2 weights: {baseline_conv2_mem // 1024:.1f}KB (64×32×3×3 + 64 bias)\")\n",
-    "        print(f\"   FC weights: {baseline_fc_mem // 1024:.1f}KB (2304×10)\")\n",
-    "        print(f\"   Total: {baseline_total // 1024:.1f}KB\")\n",
-    "        \n",
-    "        # Quantized model memory breakdown\n",
-    "        print(f\"\\n📊 Quantized INT8 Model Memory:\")\n",
-    "        quant_conv1_mem = quantized_model.conv1.weight_quantized.nbytes if quantized_model.conv1.is_quantized else baseline_conv1_mem\n",
-    "        quant_conv2_mem = quantized_model.conv2.weight_quantized.nbytes if quantized_model.conv2.is_quantized else baseline_conv2_mem\n",
-    "        quant_fc_mem = quantized_model.fc.nbytes  # FC kept as FP32\n",
-    "        quant_total = quant_conv1_mem + quant_conv2_mem + quant_fc_mem\n",
-    "        \n",
-    "        print(f\"   Conv1 weights: {quant_conv1_mem // 1024:.1f}KB (quantized INT8)\")  \n",
-    "        print(f\"   Conv2 weights: {quant_conv2_mem // 1024:.1f}KB (quantized INT8)\")\n",
-    "        print(f\"   FC weights: {quant_fc_mem // 1024:.1f}KB (kept FP32)\")\n",
-    "        print(f\"   Total: {quant_total // 1024:.1f}KB\")\n",
-    "        \n",
-    "        # Memory savings analysis\n",
-    "        conv_savings = (baseline_conv1_mem + baseline_conv2_mem) / (quant_conv1_mem + quant_conv2_mem)\n",
-    "        total_savings = baseline_total / quant_total\n",
-    "        \n",
-    "        print(f\"\\n💾 Memory Savings Analysis:\")\n",
-    "        print(f\"   Conv layers: {conv_savings:.1f}× reduction\")\n",
-    "        print(f\"   Overall model: {total_savings:.1f}× reduction\")\n",
-    "        print(f\"   Memory saved: {(baseline_total - quant_total) // 1024:.1f}KB\")\n",
-    "        \n",
-    "        return {\n",
-    "            'baseline_total_kb': baseline_total // 1024,\n",
-    "            'quantized_total_kb': quant_total // 1024,\n",
-    "            'conv_compression': conv_savings,\n",
-    "            'total_compression': total_savings,\n",
-    "            'memory_saved_kb': (baseline_total - quant_total) // 1024\n",
-    "        }\n",
-    "    \n",
-    "    def analyze_computational_complexity(self) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Analyze the computational complexity of quantization operations.\n",
-    "        \n",
-    "        This function is PROVIDED to demonstrate complexity analysis.\n",
-    "        \"\"\"\n",
-    "        print(\"\\n🔬 COMPUTATIONAL COMPLEXITY ANALYSIS\")\n",
-    "        print(\"=\" * 45)\n",
-    "        \n",
-    "        # Model dimensions for analysis\n",
-    "        batch_size = 32\n",
-    "        input_h, input_w = 32, 32\n",
-    "        conv1_out_ch, conv2_out_ch = 32, 64\n",
-    "        kernel_size = 3\n",
-    "        \n",
-    "        print(f\"📐 Model Configuration:\")\n",
-    "        print(f\"   Input: {batch_size} × 3 × {input_h} × {input_w}\")\n",
-    "        print(f\"   Conv1: 3 → {conv1_out_ch}, {kernel_size}×{kernel_size} kernel\")\n",
-    "        print(f\"   Conv2: {conv1_out_ch} → {conv2_out_ch}, {kernel_size}×{kernel_size} kernel\")\n",
-    "        \n",
-    "        # FP32 operations\n",
-    "        conv1_h_out = input_h - kernel_size + 1  # 30\n",
-    "        conv1_w_out = input_w - kernel_size + 1  # 30\n",
-    "        pool1_h_out = conv1_h_out // 2  # 15  \n",
-    "        pool1_w_out = conv1_w_out // 2  # 15\n",
-    "        \n",
-    "        conv2_h_out = pool1_h_out - kernel_size + 1  # 13\n",
-    "        conv2_w_out = pool1_w_out - kernel_size + 1  # 13\n",
-    "        pool2_h_out = conv2_h_out // 2  # 6\n",
-    "        pool2_w_out = conv2_w_out // 2  # 6\n",
-    "        \n",
-    "        # Calculate FLOPs\n",
-    "        conv1_flops = batch_size * conv1_out_ch * conv1_h_out * conv1_w_out * 3 * kernel_size * kernel_size\n",
-    "        conv2_flops = batch_size * conv2_out_ch * conv2_h_out * conv2_w_out * conv1_out_ch * kernel_size * kernel_size\n",
-    "        fc_flops = batch_size * (conv2_out_ch * pool2_h_out * pool2_w_out) * 10\n",
-    "        total_flops = conv1_flops + conv2_flops + fc_flops\n",
-    "        \n",
-    "        print(f\"\\n🔢 FLOPs Analysis (per batch):\")\n",
-    "        print(f\"   Conv1: {conv1_flops:,} FLOPs\")\n",
-    "        print(f\"   Conv2: {conv2_flops:,} FLOPs\") \n",
-    "        print(f\"   FC: {fc_flops:,} FLOPs\")\n",
-    "        print(f\"   Total: {total_flops:,} FLOPs\")\n",
-    "        \n",
-    "        # Memory access analysis\n",
-    "        conv1_weight_access = conv1_out_ch * 3 * kernel_size * kernel_size  # weights accessed\n",
-    "        conv2_weight_access = conv2_out_ch * conv1_out_ch * kernel_size * kernel_size\n",
-    "        \n",
-    "        print(f\"\\n🗄️ Memory Access Patterns:\")\n",
-    "        print(f\"   Conv1 weight access: {conv1_weight_access:,} parameters\")\n",
-    "        print(f\"   Conv2 weight access: {conv2_weight_access:,} parameters\")\n",
-    "        print(f\"   FP32 memory bandwidth: {(conv1_weight_access + conv2_weight_access) * 4:,} bytes\")\n",
-    "        print(f\"   INT8 memory bandwidth: {(conv1_weight_access + conv2_weight_access) * 1:,} bytes\")\n",
-    "        print(f\"   Bandwidth reduction: 4× (FP32 → INT8)\")\n",
-    "        \n",
-    "        # Theoretical speedup analysis\n",
-    "        print(f\"\\n⚡ Theoretical Speedup Sources:\")\n",
-    "        print(f\"   Memory bandwidth: 4× improvement (32-bit → 8-bit)\")\n",
-    "        print(f\"   Cache efficiency: Better fit in L1/L2 cache\")\n",
-    "        print(f\"   SIMD vectorization: More operations per instruction\")\n",
-    "        print(f\"   Hardware acceleration: Dedicated INT8 units on modern CPUs\")\n",
-    "        print(f\"   Expected speedup: 2-4× in production systems\")\n",
-    "        \n",
-    "        return {\n",
-    "            'total_flops': total_flops,\n",
-    "            'memory_bandwidth_reduction': 4.0,\n",
-    "            'theoretical_speedup': 3.5  # Conservative estimate\n",
-    "        }\n",
-    "    \n",
-    "    def analyze_scaling_behavior(self) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Analyze how quantization benefits scale with model size.\n",
-    "        \n",
-    "        This function is PROVIDED to demonstrate scaling analysis.\n",
-    "        \"\"\"\n",
-    "        print(\"\\n📈 SCALING BEHAVIOR ANALYSIS\")\n",
-    "        print(\"=\" * 35)\n",
-    "        \n",
-    "        model_sizes = [\n",
-    "            ('Small CNN', 100_000),\n",
-    "            ('Medium CNN', 1_000_000), \n",
-    "            ('Large CNN', 10_000_000),\n",
-    "            ('VGG-like', 138_000_000),\n",
-    "            ('ResNet-like', 25_000_000)\n",
-    "        ]\n",
-    "        \n",
-    "        print(f\"{'Model':<15} {'FP32 Size':<12} {'INT8 Size':<12} {'Savings':<10} {'Speedup'}\")\n",
-    "        print(\"-\" * 65)\n",
-    "        \n",
-    "        for name, params in model_sizes:\n",
-    "            fp32_size_mb = params * 4 / (1024 * 1024)\n",
-    "            int8_size_mb = params * 1 / (1024 * 1024)\n",
-    "            savings = fp32_size_mb / int8_size_mb\n",
-    "            \n",
-    "            # Speedup increases with model size due to memory bottlenecks\n",
-    "            if params < 500_000:\n",
-    "                speedup = 2.0  # Small models: limited by overhead\n",
-    "            elif params < 5_000_000:\n",
-    "                speedup = 3.0  # Medium models: good balance\n",
-    "            else:\n",
-    "                speedup = 4.0  # Large models: memory bound, maximum benefit\n",
-    "            \n",
-    "            print(f\"{name:<15} {fp32_size_mb:<11.1f}MB {int8_size_mb:<11.1f}MB {savings:<9.1f}× {speedup:<7.1f}×\")\n",
-    "        \n",
-    "        print(f\"\\n💡 Key Scaling Insights:\")\n",
-    "        print(f\"   • Memory savings: Linear 4× reduction for all model sizes\")\n",
-    "        print(f\"   • Speed benefits: Increase with model size (memory bottleneck)\")  \n",
-    "        print(f\"   • Large models: Maximum benefit from reduced memory pressure\")\n",
-    "        print(f\"   • Mobile deployment: Enables models that wouldn't fit in RAM\")\n",
-    "        \n",
-    "        return {\n",
-    "            'memory_savings': 4.0,\n",
-    "            'speedup_range': (2.0, 4.0),\n",
-    "            'scaling_factor': 'increases_with_size'\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3ad32431",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Memory Profiling and Systems Analysis\n",
-    "\n",
-    "Let's run comprehensive systems analysis to understand quantization behavior:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "349d7e31",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-memory-profiling",
-     "locked": false,
-     "points": 3,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_memory_profiling():\n",
-    "    \"\"\"Test memory profiling and systems analysis.\"\"\"\n",
-    "    print(\"🔍 Testing Memory Profiling and Systems Analysis...\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Create models for profiling\n",
-    "    baseline = BaselineCNN(3, 10)\n",
-    "    quantized = QuantizedCNN(3, 10)\n",
-    "    \n",
-    "    # Quantize the model\n",
-    "    calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(3)]\n",
-    "    quantized.calibrate_and_quantize(calibration_data)\n",
-    "    \n",
-    "    # Run memory profiling\n",
-    "    profiler = QuantizationMemoryProfiler()\n",
-    "    \n",
-    "    # Test memory usage analysis\n",
-    "    memory_results = profiler.profile_memory_usage(baseline, quantized)\n",
-    "    assert memory_results['conv_compression'] > 3.0, \"Should show significant conv layer compression\"\n",
-    "    print(f\"✅ Conv layer compression: {memory_results['conv_compression']:.1f}×\")\n",
-    "    \n",
-    "    # Test computational complexity analysis\n",
-    "    complexity_results = profiler.analyze_computational_complexity()\n",
-    "    assert complexity_results['total_flops'] > 0, \"Should calculate FLOPs\"\n",
-    "    assert complexity_results['memory_bandwidth_reduction'] == 4.0, \"Should show 4× bandwidth reduction\"\n",
-    "    print(f\"✅ Memory bandwidth reduction: {complexity_results['memory_bandwidth_reduction']:.1f}×\")\n",
-    "    \n",
-    "    # Test scaling behavior analysis\n",
-    "    scaling_results = profiler.analyze_scaling_behavior()\n",
-    "    assert scaling_results['memory_savings'] == 4.0, \"Should show consistent 4× memory savings\"\n",
-    "    print(f\"✅ Memory savings scaling: {scaling_results['memory_savings']:.1f}× across all model sizes\")\n",
-    "    \n",
-    "    print(\"✅ Memory profiling and systems analysis tests passed!\")\n",
-    "    print(\"🎯 Quantization systems engineering principles validated!\")\n",
-    "\n",
-    "# Test function defined (called in main block)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fb29568e",
-   "metadata": {},
-   "source": [
-    "\"\"\"\n",
-    "# Part 9: Comprehensive Testing and Execution\n",
-    "\n",
-    "Let's run all our tests to validate the complete implementation:\n",
-    "\"\"\"\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    print(\"🚀 MODULE 17: QUANTIZATION - TRADING PRECISION FOR SPEED\")\n",
-    "    print(\"=\" * 70)\n",
-    "    print(\"Testing complete INT8 quantization implementation for 4× speedup...\")\n",
-    "    print()\n",
-    "    \n",
-    "    try:\n",
-    "        # Run all tests\n",
-    "        print(\"📋 Running Comprehensive Test Suite...\")\n",
-    "        print()\n",
-    "        \n",
-    "        # Individual component tests\n",
-    "        test_baseline_cnn()\n",
-    "        print()\n",
-    "        \n",
-    "        test_int8_quantizer()\n",
-    "        print()\n",
-    "        \n",
-    "        test_quantized_cnn()\n",
-    "        print()\n",
-    "        \n",
-    "        test_performance_analysis()\n",
-    "        print()\n",
-    "        \n",
-    "        test_systems_analysis()\n",
-    "        print()\n",
-    "        \n",
-    "        test_memory_profiling()\n",
-    "        print()\n",
-    "        \n",
-    "        # Show production context\n",
-    "        print(\"🏭 PRODUCTION QUANTIZATION CONTEXT...\")\n",
-    "        ProductionQuantizationInsights.explain_production_patterns()\n",
-    "        ProductionQuantizationInsights.explain_advanced_techniques()\n",
-    "        ProductionQuantizationInsights.show_performance_numbers()\n",
-    "        print()\n",
-    "        \n",
-    "        print(\"🎉 SUCCESS: All quantization tests passed!\")\n",
-    "        print(\"🏆 ACHIEVEMENT: 4× speedup through precision optimization!\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Error in testing: {e}\")\n",
-    "        import traceback\n",
-    "        traceback.print_exc()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "594c24d5",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've implemented INT8 quantization and achieved 4× speedup, let's reflect on the systems engineering principles and precision trade-offs you've learned."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "94373519",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "systems-thinking-1",
-     "locked": false,
-     "points": 3,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "source": [
-    "\"\"\"\n",
-    "**Question 1: Precision vs Performance Trade-offs**\n",
-    "\n",
-    "You implemented INT8 quantization that uses 4× less memory but provides 4× speedup with <1% accuracy loss.\n",
-    "\n",
-    "a) Why is INT8 the \"sweet spot\" for production quantization rather than INT4 or INT16?\n",
-    "b) In what scenarios would you choose NOT to use quantization despite the performance benefits?\n",
-    "c) How do hardware capabilities (mobile vs server) influence quantization decisions?\n",
-    "\n",
-    "*Think about: Hardware support, accuracy requirements, deployment constraints*\n",
-    "\"\"\"\n",
-    "\n",
-    "YOUR ANSWER HERE:\n",
-    "## BEGIN SOLUTION\n",
-    "\"\"\"\n",
-    "a) Why INT8 is the sweet spot:\n",
-    "- Hardware support: Excellent native INT8 support in CPUs, GPUs, and mobile processors\n",
-    "- Accuracy preservation: Can represent 256 different values, sufficient for most weight distributions\n",
-    "- Speed gains: Specialized INT8 arithmetic units provide real 4× speedup (not just theoretical)\n",
-    "- Memory sweet spot: 4× reduction is significant but not so extreme as to destroy model quality\n",
-    "- Production proven: Extensive validation across many model types shows <1% accuracy loss\n",
-    "- Tool ecosystem: TensorFlow Lite, PyTorch Mobile, ONNX Runtime all optimize for INT8\n",
-    "\n",
-    "b) Scenarios to avoid quantization:\n",
-    "- High-precision scientific computing where accuracy is paramount\n",
-    "- Models already at accuracy limits where any degradation is unacceptable\n",
-    "- Very small models where quantization overhead > benefits\n",
-    "- Research/development phases where interpretability and debugging are critical\n",
-    "- Applications requiring uncertainty quantification (quantization can affect calibration)\n",
-    "- Real-time systems where the quantization/dequantization overhead matters more than compute\n",
-    "\n",
-    "c) Hardware influence on quantization decisions:\n",
-    "- Mobile devices: Essential for deployment, enables on-device inference\n",
-    "- Edge hardware: Often has specialized INT8 units (Neural Engine, TPU Edge)\n",
-    "- Server GPUs: Mixed precision (FP16) might be better than INT8 for throughput\n",
-    "- CPUs: INT8 vectorization provides significant benefits over FP32\n",
-    "- Memory-constrained systems: Quantization may be required just to fit the model\n",
-    "- Bandwidth-limited: 4× smaller models transfer faster over network\n",
-    "\"\"\"\n",
-    "## END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e58f8715",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "systems-thinking-2",
-     "locked": false,
-     "points": 3,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "source": [
-    "\"\"\"\n",
-    "**Question 2: Calibration and Deployment Strategies**\n",
-    "\n",
-    "Your quantization uses calibration data to compute optimal scale and zero-point parameters.\n",
-    "\n",
-    "a) How would you select representative calibration data for a production CNN model?\n",
-    "b) What happens if your deployment data distribution differs significantly from calibration data?\n",
-    "c) How would you design a system to detect and handle quantization-related accuracy degradation in production?\n",
-    "\n",
-    "*Think about: Data distribution, model drift, monitoring systems*\n",
-    "\"\"\"\n",
-    "\n",
-    "YOUR ANSWER HERE:\n",
-    "## BEGIN SOLUTION\n",
-    "\"\"\"\n",
-    "a) Selecting representative calibration data:\n",
-    "- Sample diversity: Include examples from all classes/categories the model will see\n",
-    "- Data distribution matching: Ensure calibration data matches deployment distribution\n",
-    "- Edge cases: Include challenging examples that stress the model's capabilities\n",
-    "- Size considerations: 100-1000 samples usually sufficient, more doesn't help much\n",
-    "- Real production data: Use actual deployment data when possible, not just training data\n",
-    "- Temporal coverage: For time-sensitive models, include recent data patterns\n",
-    "- Geographic/demographic coverage: Ensure representation across user populations\n",
-    "\n",
-    "b) Distribution mismatch consequences:\n",
-    "- Quantization parameters become suboptimal for new data patterns\n",
-    "- Accuracy degradation can be severe (>5% loss instead of <1%)\n",
-    "- Some layers may be over/under-scaled leading to clipping or poor precision\n",
-    "- Model confidence calibration can be significantly affected\n",
-    "- Solutions: Periodic re-calibration, adaptive quantization, monitoring systems\n",
-    "- Detection: Compare quantized vs FP32 outputs on production traffic sample\n",
-    "\n",
-    "c) Production monitoring system design:\n",
-    "- Dual inference: Run small percentage of traffic through both quantized and FP32 models\n",
-    "- Accuracy metrics: Track prediction agreement, confidence score differences\n",
-    "- Distribution monitoring: Detect when input data drifts from calibration distribution\n",
-    "- Performance alerts: Automated alerts when quantized model accuracy drops significantly\n",
-    "- A/B testing framework: Gradual rollout with automatic rollback on accuracy drops\n",
-    "- Model versioning: Keep FP32 backup model ready for immediate fallback\n",
-    "- Regular recalibration: Scheduled re-quantization with fresh production data\n",
-    "\"\"\"\n",
-    "## END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6e90a0d7",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "systems-thinking-3",
-     "locked": false,
-     "points": 3,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "source": [
-    "\"\"\"\n",
-    "**Question 3: Advanced Quantization and Hardware Optimization**\n",
-    "\n",
-    "You built basic INT8 quantization. Production systems use more sophisticated techniques.\n",
-    "\n",
-    "a) Explain how \"mixed precision quantization\" (different precisions for different layers) would improve upon your implementation and what engineering challenges it introduces.\n",
-    "b) How would you adapt your quantization for specific hardware targets like mobile Neural Processing Units or edge TPUs?\n",
-    "c) Design a quantization strategy for a multi-model system where you need to optimize total inference latency across multiple models.\n",
-    "\n",
-    "*Think about: Layer sensitivity, hardware specialization, system-level optimization*\n",
-    "\"\"\"\n",
-    "\n",
-    "YOUR ANSWER HERE:\n",
-    "## BEGIN SOLUTION\n",
-    "\"\"\"\n",
-    "a) Mixed precision quantization improvements:\n",
-    "- Layer sensitivity analysis: Some layers (first/last, batch norm) more sensitive to quantization\n",
-    "- Selective precision: Keep sensitive layers in FP16/FP32, quantize robust layers to INT8/INT4\n",
-    "- Benefits: Better accuracy preservation while still achieving most speed benefits\n",
-    "- Engineering challenges:\n",
-    "  * Complexity: Need to analyze and decide precision for each layer individually\n",
-    "  * Memory management: Mixed precision requires more complex memory layouts\n",
-    "  * Hardware utilization: May not fully utilize specialized INT8 units\n",
-    "  * Calibration complexity: Need separate calibration strategies per precision level\n",
-    "  * Model compilation: More complex compiler optimizations required\n",
-    "\n",
-    "b) Hardware-specific quantization adaptation:\n",
-    "- Apple Neural Engine: Optimize for their specific INT8 operations and memory hierarchy\n",
-    "- Edge TPUs: Use their preferred quantization format (INT8 with specific scale constraints)\n",
-    "- Mobile GPUs: Leverage FP16 capabilities when available, fall back to INT8\n",
-    "- ARM CPUs: Optimize for NEON vectorization and specific instruction sets\n",
-    "- Hardware profiling: Measure actual performance on target hardware, not just theoretical\n",
-    "- Memory layout optimization: Arrange quantized weights for optimal hardware access patterns\n",
-    "- Batch size considerations: Some hardware performs better with specific batch sizes\n",
-    "\n",
-    "c) Multi-model system quantization strategy:\n",
-    "- Global optimization: Consider total inference latency across all models, not individual models\n",
-    "- Resource allocation: Balance precision across models based on accuracy requirements\n",
-    "- Pipeline optimization: Quantize models based on their position in inference pipeline\n",
-    "- Shared resources: Models sharing computation resources need compatible quantization\n",
-    "- Priority-based quantization: More critical models get higher precision allocations\n",
-    "- Load balancing: Distribute quantization overhead across different hardware units\n",
-    "- Caching strategies: Quantized models may have different caching characteristics\n",
-    "- Fallback planning: System should gracefully handle quantization failures in any model\n",
-    "\"\"\"\n",
-    "## END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "dfe7de20",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "systems-thinking-4",
-     "locked": false,
-     "points": 3,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "source": [
-    "\"\"\"\n",
-    "**Question 4: Quantization in ML Systems Architecture**\n",
-    "\n",
-    "You've seen how quantization affects individual models. Consider its role in broader ML systems.\n",
-    "\n",
-    "a) How does quantization interact with other optimizations like model pruning, knowledge distillation, and neural architecture search?\n",
-    "b) What are the implications of quantization for ML systems that need to be updated frequently (continuous learning, A/B testing, model retraining)?\n",
-    "c) Design an end-to-end ML pipeline that incorporates quantization as a first-class optimization, from training to deployment to monitoring.\n",
-    "\n",
-    "*Think about: Optimization interactions, system lifecycle, engineering workflows*\n",
-    "\"\"\"\n",
-    "\n",
-    "YOUR ANSWER HERE:\n",
-    "## BEGIN SOLUTION\n",
-    "\"\"\"\n",
-    "a) Quantization interactions with other optimizations:\n",
-    "- Model pruning synergy: Pruned models often quantize better (remaining weights more important)\n",
-    "- Knowledge distillation compatibility: Student models designed for quantization from start\n",
-    "- Neural architecture search: NAS can search for quantization-friendly architectures\n",
-    "- Combined benefits: Pruning + quantization can achieve 16× compression (4× each)\n",
-    "- Order matters: Generally prune first, then quantize (quantizing first can interfere with pruning)\n",
-    "- Optimization conflicts: Some optimizations may work against each other\n",
-    "- Unified approaches: Modern techniques like differentiable quantization during NAS\n",
-    "\n",
-    "b) Implications for frequently updated systems:\n",
-    "- Re-quantization overhead: Every model update requires new calibration and quantization\n",
-    "- Calibration data management: Need fresh, representative data for each quantization round\n",
-    "- A/B testing complexity: Quantized vs FP32 models may show different A/B results\n",
-    "- Gradual rollout challenges: Quantization changes may interact poorly with gradual deployment\n",
-    "- Monitoring complexity: Need to track quantization quality across model versions\n",
-    "- Continuous learning: Online learning systems need adaptive quantization strategies\n",
-    "- Validation overhead: Each update needs thorough accuracy validation before deployment\n",
-    "\n",
-    "c) End-to-end quantization-first ML pipeline:\n",
-    "Training phase:\n",
-    "- Quantization-aware training: Train models to be robust to quantization from start\n",
-    "- Architecture selection: Choose quantization-friendly model architectures\n",
-    "- Loss function augmentation: Include quantization error in training loss\n",
-    "\n",
-    "Validation phase:\n",
-    "- Dual validation: Validate both FP32 and quantized versions\n",
-    "- Calibration data curation: Maintain high-quality, representative calibration sets\n",
-    "- Hardware validation: Test on actual deployment hardware, not just simulation\n",
-    "\n",
-    "Deployment phase:\n",
-    "- Automated quantization: CI/CD pipeline automatically quantizes and validates models\n",
-    "- Gradual rollout: Deploy quantized models with careful monitoring and rollback capability\n",
-    "- Resource optimization: Schedule quantization jobs efficiently in deployment pipeline\n",
-    "\n",
-    "Monitoring phase:\n",
-    "- Accuracy tracking: Continuous comparison of quantized vs FP32 performance\n",
-    "- Distribution drift detection: Monitor for changes that might require re-quantization\n",
-    "- Performance monitoring: Track actual speedup and memory savings in production\n",
-    "- Feedback loops: Use production performance to improve quantization strategies\n",
-    "\"\"\"\n",
-    "## END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a82a178e",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Quantization - Trading Precision for Speed\n",
-    "\n",
-    "Congratulations! You've completed Module 17 and mastered quantization techniques that achieve dramatic performance improvements while maintaining model accuracy.\n",
-    "\n",
-    "### What You Built\n",
-    "- **Baseline FP32 CNN**: Reference implementation showing computational and memory costs\n",
-    "- **INT8 Quantizer**: Complete quantization system with scale/zero-point parameter computation\n",
-    "- **Quantized CNN**: Production-ready CNN using INT8 weights for 4× speedup\n",
-    "- **Performance Analyzer**: Comprehensive benchmarking system measuring speed, memory, and accuracy trade-offs\n",
-    "- **Systems Analyzer**: Deep analysis of precision vs performance trade-offs across different bit widths\n",
-    "\n",
-    "### Key Systems Insights Mastered\n",
-    "1. **Precision vs Performance Trade-offs**: Understanding when to sacrifice precision for speed (4× memory/speed improvement for <1% accuracy loss)\n",
-    "2. **Quantization Mathematics**: Implementing scale/zero-point based affine quantization for optimal precision\n",
-    "3. **Hardware-Aware Optimization**: Leveraging INT8 specialized hardware for maximum performance benefits\n",
-    "4. **Production Deployment Strategies**: Calibration-based quantization for mobile and edge deployment\n",
-    "\n",
-    "### Performance Achievements\n",
-    "- 🚀 **4× Speed Improvement**: Reduced inference time from 50ms to 12ms through INT8 arithmetic\n",
-    "- 🧠 **4× Memory Reduction**: Quantized weights use 25% of original FP32 memory\n",
-    "- 📊 **<1% Accuracy Loss**: Maintained model quality while achieving dramatic speedups\n",
-    "- 🏭 **Production Ready**: Implemented patterns used by TensorFlow Lite, PyTorch Mobile, and Core ML\n",
-    "\n",
-    "### Connection to Production ML Systems\n",
-    "Your quantization implementation demonstrates core principles behind:\n",
-    "- **Mobile ML**: TensorFlow Lite and PyTorch Mobile INT8 quantization\n",
-    "- **Edge AI**: Optimizations enabling AI on resource-constrained devices\n",
-    "- **Production Inference**: Memory and compute optimizations for cost-effective deployment\n",
-    "- **ML Engineering**: How precision trade-offs enable scalable ML systems\n",
-    "\n",
-    "### Systems Engineering Principles Applied\n",
-    "- **Precision is Negotiable**: Most applications can tolerate small accuracy loss for large speedup\n",
-    "- **Hardware Specialization**: INT8 units provide real performance benefits beyond theoretical\n",
-    "- **Calibration-Based Optimization**: Use representative data to compute optimal quantization parameters\n",
-    "- **Trade-off Engineering**: Balance accuracy, speed, and memory based on application requirements\n",
-    "\n",
-    "### Trade-off Mastery Achieved\n",
-    "You now understand how quantization represents the first major trade-off in ML optimization:\n",
-    "- **Module 16**: Free speedups through better algorithms (no trade-offs)\n",
-    "- **Module 17**: Speed through precision trade-offs (small accuracy loss for large gains)\n",
-    "- **Future modules**: More sophisticated trade-offs in compression, distillation, and architecture\n",
-    "\n",
-    "You've mastered the fundamental precision vs performance trade-off that enables ML deployment on mobile devices, edge hardware, and cost-effective cloud inference. This completes your understanding of how production ML systems balance quality and performance!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/16_quantization/quantization_dev.py b/modules_old/16_quantization/quantization_dev.py
deleted file mode 100644
index 31f11f82..00000000
--- a/modules_old/16_quantization/quantization_dev.py
+++ /dev/null
@@ -1,2274 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Module 17: Quantization - Trading Precision for Speed
-
-Welcome to the Quantization module! After Module 16 showed you how to get free speedups through better algorithms, now we make our **first trade-off**: reduce precision for speed. You'll implement INT8 quantization to achieve 4* speedup with <1% accuracy loss.
-
-## Connection from Module 16: Acceleration -> Quantization
-
-Module 16 taught you to accelerate computations through better algorithms and hardware utilization - these were "free" optimizations. Now we enter the world of **trade-offs**: sacrificing precision to gain speed. This is especially powerful for CNN inference where INT8 operations are much faster than FP32.
-
-## Learning Goals
-
-- **Systems understanding**: Memory vs precision tradeoffs and when quantization provides dramatic benefits
-- **Core implementation skill**: Build INT8 quantization systems for CNN weights and activations  
-- **Pattern recognition**: Understand calibration-based quantization for post-training optimization
-- **Framework connection**: See how production systems use quantization for edge deployment and mobile inference
-- **Performance insight**: Achieve 4* speedup with <1% accuracy loss through precision optimization
-
-## Build -> Profile -> Optimize
-
-1. **Build**: Start with FP32 CNN inference (baseline)
-2. **Profile**: Measure memory usage and computational cost of FP32 operations
-3. **Optimize**: Implement INT8 quantization to achieve 4* speedup with minimal accuracy loss
-
-## What You'll Achieve
-
-By the end of this module, you'll understand:
-- **Deep technical understanding**: How INT8 quantization reduces precision while maintaining model quality
-- **Practical capability**: Implement production-grade quantization for CNN inference acceleration  
-- **Systems insight**: Memory vs precision tradeoffs in ML systems optimization
-- **Performance mastery**: Achieve 4* speedup (50ms -> 12ms inference) with <1% accuracy loss
-- **Connection to edge deployment**: How mobile and edge devices use quantization for efficient AI
-
-## Systems Reality Check
-
-TIP **Production Context**: TensorFlow Lite and PyTorch Mobile use INT8 quantization for mobile deployment  
-SPEED **Performance Note**: CNN inference: FP32 = 50ms, INT8 = 12ms (4* faster) with 98% -> 97.5% accuracy  
-🧠 **Memory Tradeoff**: INT8 uses 4* less memory and enables much faster integer arithmetic
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "quantization-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp quantization
-
-#| export
-import math
-import time
-import numpy as np
-import sys
-import os
-from typing import Union, List, Optional, Tuple, Dict, Any
-
-# Import our Tensor and CNN classes
-try:
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.spatial import Conv2d, MaxPool2D
-except ImportError:
-    # For development, import from local modules
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '06_spatial'))
-    try:
-        from tensor_dev import Tensor
-        from spatial_dev import Conv2d, MaxPool2D
-    except ImportError:
-        # Create minimal mock classes if not available
-        class Tensor:
-            def __init__(self, data):
-                self.data = np.array(data)
-                self.shape = self.data.shape
-        class Conv2d:
-            def __init__(self, in_channels, out_channels, kernel_size):
-                self.weight = np.random.randn(out_channels, in_channels, kernel_size, kernel_size)
-        class MaxPool2d:
-            def __init__(self, kernel_size):
-                self.kernel_size = kernel_size
-
-# %% [markdown]
-"""
-## Part 1: Understanding Quantization - The Precision vs Speed Trade-off
-
-Let's start by understanding what quantization means and why it provides such dramatic speedups. We'll build a baseline FP32 CNN and measure its computational cost.
-
-### The Quantization Concept
-
-Quantization converts high-precision floating-point numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits):
-- **Memory**: 4* reduction (32 bits -> 8 bits)
-- **Compute**: Integer arithmetic is much faster than floating-point  
-- **Hardware**: Specialized INT8 units on modern CPUs and mobile processors
-- **Trade-off**: Small precision loss for large speed gain
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "baseline-cnn", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class BaselineCNN:
-    """
-    Baseline FP32 CNN for comparison with quantized version.
-    
-    This implementation uses standard floating-point arithmetic
-    to establish performance and accuracy baselines.
-    """
-    
-    def __init__(self, input_channels: int = 3, num_classes: int = 10):
-        """
-        Initialize baseline CNN with FP32 weights.
-        
-        TODO: Implement baseline CNN initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create convolutional layers with FP32 weights
-        2. Create fully connected layer for classification
-        3. Initialize weights with proper scaling
-        4. Set up activation functions and pooling
-        
-        Args:
-            input_channels: Number of input channels (e.g., 3 for RGB)
-            num_classes: Number of output classes
-        """
-        ### BEGIN SOLUTION
-        self.input_channels = input_channels
-        self.num_classes = num_classes
-        
-        # Initialize FP32 convolutional weights
-        # Conv1: input_channels -> 32, kernel 3x3
-        self.conv1_weight = np.random.randn(32, input_channels, 3, 3) * 0.02
-        self.conv1_bias = np.zeros(32)
-        
-        # Conv2: 32 -> 64, kernel 3x3  
-        self.conv2_weight = np.random.randn(64, 32, 3, 3) * 0.02
-        self.conv2_bias = np.zeros(64)
-        
-        # Pooling (no parameters)
-        self.pool_size = 2
-        
-        # Fully connected layer (assuming 32x32 input -> 6x6 after convs+pools)
-        self.fc_input_size = 64 * 6 * 6  # 64 channels, 6x6 spatial
-        self.fc = np.random.randn(self.fc_input_size, num_classes) * 0.02
-        
-        print(f"PASS BaselineCNN initialized: {self._count_parameters()} parameters")
-        ### END SOLUTION
-    
-    def _count_parameters(self) -> int:
-        """Count total parameters in the model."""
-        conv1_params = 32 * self.input_channels * 3 * 3 + 32  # weights + bias
-        conv2_params = 64 * 32 * 3 * 3 + 64
-        fc_params = self.fc_input_size * self.num_classes
-        return conv1_params + conv2_params + fc_params
-    
-    def forward(self, x: np.ndarray) -> np.ndarray:
-        """
-        Forward pass through baseline CNN.
-        
-        TODO: Implement FP32 CNN forward pass.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Apply first convolution + ReLU + pooling
-        2. Apply second convolution + ReLU + pooling  
-        3. Flatten for fully connected layer
-        4. Apply fully connected layer
-        5. Return logits
-        
-        PERFORMANCE NOTE: This uses FP32 arithmetic throughout.
-        
-        Args:
-            x: Input tensor with shape (batch, channels, height, width)
-            
-        Returns:
-            Output logits with shape (batch, num_classes)
-        """
-        ### BEGIN SOLUTION
-        batch_size = x.shape[0]
-        
-        # Conv1 + ReLU + Pool
-        conv1_out = self._conv2d_forward(x, self.conv1_weight, self.conv1_bias)
-        conv1_relu = np.maximum(0, conv1_out)
-        pool1_out = self._maxpool2d_forward(conv1_relu, self.pool_size)
-        
-        # Conv2 + ReLU + Pool  
-        conv2_out = self._conv2d_forward(pool1_out, self.conv2_weight, self.conv2_bias)
-        conv2_relu = np.maximum(0, conv2_out)
-        pool2_out = self._maxpool2d_forward(conv2_relu, self.pool_size)
-        
-        # Flatten
-        flattened = pool2_out.reshape(batch_size, -1)
-        
-        # Fully connected
-        logits = flattened @ self.fc
-        
-        return logits
-        ### END SOLUTION
-    
-    def _conv2d_forward(self, x: np.ndarray, weight: np.ndarray, bias: np.ndarray) -> np.ndarray:
-        """Simple convolution implementation with bias (optimized for speed)."""
-        batch, in_ch, in_h, in_w = x.shape
-        out_ch, in_ch_w, kh, kw = weight.shape
-        
-        out_h = in_h - kh + 1
-        out_w = in_w - kw + 1
-        
-        output = np.zeros((batch, out_ch, out_h, out_w))
-        
-        # Optimized convolution using vectorized operations where possible
-        for b in range(batch):
-            for oh in range(out_h):
-                for ow in range(out_w):
-                    # Extract input patch
-                    patch = x[b, :, oh:oh+kh, ow:ow+kw]  # (in_ch, kh, kw)
-                    # Compute convolution for all output channels at once
-                    for oc in range(out_ch):
-                        output[b, oc, oh, ow] = np.sum(patch * weight[oc]) + bias[oc]
-        
-        return output
-    
-    def _maxpool2d_forward(self, x: np.ndarray, pool_size: int) -> np.ndarray:
-        """Simple max pooling implementation."""
-        batch, ch, in_h, in_w = x.shape
-        out_h = in_h // pool_size
-        out_w = in_w // pool_size
-        
-        output = np.zeros((batch, ch, out_h, out_w))
-        
-        for b in range(batch):
-            for c in range(ch):
-                for oh in range(out_h):
-                    for ow in range(out_w):
-                        h_start = oh * pool_size
-                        w_start = ow * pool_size
-                        pool_region = x[b, c, h_start:h_start+pool_size, w_start:w_start+pool_size]
-                        output[b, c, oh, ow] = np.max(pool_region)
-        
-        return output
-    
-    def predict(self, x: np.ndarray) -> np.ndarray:
-        """Make predictions with the model."""
-        logits = self.forward(x)
-        return np.argmax(logits, axis=1)
-
-# %% [markdown]
-"""
-### Test Baseline CNN Performance
-
-Let's test our baseline CNN to establish performance and accuracy baselines:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-baseline-cnn", "locked": false, "points": 2, "schema_version": 3, "solution": false, "task": false}
-def test_baseline_cnn():
-    """Test baseline CNN implementation and measure performance."""
-    print("MAGNIFY Testing Baseline FP32 CNN...")
-    print("=" * 60)
-    
-    # Create baseline model
-    model = BaselineCNN(input_channels=3, num_classes=10)
-    
-    # Test forward pass
-    batch_size = 4
-    input_data = np.random.randn(batch_size, 3, 32, 32)
-    
-    print(f"Testing with input shape: {input_data.shape}")
-    
-    # Measure inference time
-    start_time = time.time()
-    logits = model.forward(input_data)
-    inference_time = time.time() - start_time
-    
-    # Validate output
-    assert logits.shape == (batch_size, 10), f"Expected (4, 10), got {logits.shape}"
-    print(f"PASS Forward pass works: {logits.shape}")
-    
-    # Test predictions
-    predictions = model.predict(input_data)
-    assert predictions.shape == (batch_size,), f"Expected (4,), got {predictions.shape}"
-    assert all(0 <= p < 10 for p in predictions), "All predictions should be valid class indices"
-    print(f"PASS Predictions work: {predictions}")
-    
-    # Performance baseline
-    print(f"\n📊 Performance Baseline:")
-    print(f"   Inference time: {inference_time*1000:.2f}ms for batch of {batch_size}")
-    print(f"   Per-sample time: {inference_time*1000/batch_size:.2f}ms")
-    print(f"   Parameters: {model._count_parameters()} (all FP32)")
-    print(f"   Memory usage: ~{model._count_parameters() * 4 / 1024:.1f}KB for weights")
-    
-    print("PASS Baseline CNN tests passed!")
-    print("TIP Ready to implement INT8 quantization for 4* speedup...")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Part 2: INT8 Quantization Theory and Implementation
-
-Now let's implement the core quantization algorithms. We'll use **affine quantization** with scale and zero-point parameters to map FP32 values to INT8 range.
-
-### Quantization Mathematics
-
-The key insight is mapping continuous FP32 values to discrete INT8 values:
-- **Quantization**: `int8_value = clip(round(fp32_value / scale + zero_point), -128, 127)`
-- **Dequantization**: `fp32_value = (int8_value - zero_point) * scale`
-- **Scale**: Controls the range of values that can be represented
-- **Zero Point**: Ensures zero maps exactly to zero in quantized space
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "int8-quantizer", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class INT8Quantizer:
-    """
-    INT8 quantizer for neural network weights and activations.
-    
-    This quantizer converts FP32 tensors to INT8 representation
-    using scale and zero-point parameters for maximum precision.
-    """
-    
-    def __init__(self):
-        """Initialize the quantizer."""
-        self.calibration_stats = {}
-        
-    def compute_quantization_params(self, tensor: np.ndarray, 
-                                  symmetric: bool = True) -> Tuple[float, int]:
-        """
-        Compute quantization scale and zero point for a tensor.
-        
-        TODO: Implement quantization parameter computation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Find min and max values in the tensor
-        2. For symmetric quantization, use max(abs(min), abs(max))
-        3. For asymmetric, use the full min/max range
-        4. Compute scale to map FP32 range to INT8 range [-128, 127]
-        5. Compute zero point to ensure accurate zero representation
-        
-        Args:
-            tensor: Input tensor to quantize
-            symmetric: Whether to use symmetric quantization (zero_point=0)
-            
-        Returns:
-            Tuple of (scale, zero_point)
-        """
-        ### BEGIN SOLUTION
-        # Find tensor range
-        tensor_min = float(np.min(tensor))
-        tensor_max = float(np.max(tensor))
-        
-        if symmetric:
-            # Symmetric quantization: use max absolute value
-            max_abs = max(abs(tensor_min), abs(tensor_max))
-            tensor_min = -max_abs
-            tensor_max = max_abs
-            zero_point = 0
-        else:
-            # Asymmetric quantization: use full range
-            zero_point = 0  # We'll compute this below
-        
-        # INT8 range is [-128, 127] = 255 values
-        int8_min = -128
-        int8_max = 127
-        int8_range = int8_max - int8_min
-        
-        # Compute scale
-        tensor_range = tensor_max - tensor_min
-        if tensor_range == 0:
-            scale = 1.0
-        else:
-            scale = tensor_range / int8_range
-        
-        if not symmetric:
-            # Compute zero point for asymmetric quantization
-            zero_point_fp = int8_min - tensor_min / scale
-            zero_point = int(round(np.clip(zero_point_fp, int8_min, int8_max)))
-        
-        return scale, zero_point
-        ### END SOLUTION
-    
-    def quantize_tensor(self, tensor: np.ndarray, scale: float, 
-                       zero_point: int) -> np.ndarray:
-        """
-        Quantize FP32 tensor to INT8.
-        
-        TODO: Implement tensor quantization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Apply quantization formula: q = fp32 / scale + zero_point
-        2. Round to nearest integer
-        3. Clip to INT8 range [-128, 127]
-        4. Convert to INT8 data type
-        
-        Args:
-            tensor: FP32 tensor to quantize
-            scale: Quantization scale parameter
-            zero_point: Quantization zero point parameter
-            
-        Returns:
-            Quantized INT8 tensor
-        """
-        ### BEGIN SOLUTION
-        # Apply quantization formula
-        quantized_fp = tensor / scale + zero_point
-        
-        # Round and clip to INT8 range
-        quantized_int = np.round(quantized_fp)
-        quantized_int = np.clip(quantized_int, -128, 127)
-        
-        # Convert to INT8
-        quantized = quantized_int.astype(np.int8)
-        
-        return quantized
-        ### END SOLUTION
-    
-    def dequantize_tensor(self, quantized_tensor: np.ndarray, scale: float,
-                         zero_point: int) -> np.ndarray:
-        """
-        Dequantize INT8 tensor back to FP32.
-        
-        This function is PROVIDED for converting back to FP32.
-        
-        Args:
-            quantized_tensor: INT8 tensor
-            scale: Original quantization scale
-            zero_point: Original quantization zero point
-            
-        Returns:
-            Dequantized FP32 tensor
-        """
-        # Convert to FP32 and apply dequantization formula
-        fp32_tensor = (quantized_tensor.astype(np.float32) - zero_point) * scale
-        return fp32_tensor
-    
-    def quantize_weights(self, weights: np.ndarray, 
-                        calibration_data: Optional[List[np.ndarray]] = None) -> Dict[str, Any]:
-        """
-        Quantize neural network weights with optimal parameters.
-        
-        TODO: Implement weight quantization with calibration.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Compute quantization parameters for weight tensor
-        2. Apply quantization to create INT8 weights
-        3. Store quantization parameters for runtime dequantization
-        4. Compute quantization error metrics
-        5. Return quantized weights and metadata
-        
-        NOTE: For weights, we can use the full weight distribution
-        without needing separate calibration data.
-        
-        Args:
-            weights: FP32 weight tensor
-            calibration_data: Optional calibration data (unused for weights)
-            
-        Returns:
-            Dictionary containing quantized weights and parameters
-        """
-        ### BEGIN SOLUTION
-        print(f"Quantizing weights with shape {weights.shape}...")
-        
-        # Compute quantization parameters
-        scale, zero_point = self.compute_quantization_params(weights, symmetric=True)
-        
-        # Quantize weights
-        quantized_weights = self.quantize_tensor(weights, scale, zero_point)
-        
-        # Dequantize for error analysis
-        dequantized_weights = self.dequantize_tensor(quantized_weights, scale, zero_point)
-        
-        # Compute quantization error
-        quantization_error = np.mean(np.abs(weights - dequantized_weights))
-        max_error = np.max(np.abs(weights - dequantized_weights))
-        
-        # Memory savings
-        original_size = weights.nbytes
-        quantized_size = quantized_weights.nbytes
-        compression_ratio = original_size / quantized_size
-        
-        print(f"   Scale: {scale:.6f}, Zero point: {zero_point}")
-        print(f"   Quantization error: {quantization_error:.6f} (max: {max_error:.6f})")
-        print(f"   Compression: {compression_ratio:.1f}* ({original_size//1024}KB -> {quantized_size//1024}KB)")
-        
-        return {
-            'quantized_weights': quantized_weights,
-            'scale': scale,
-            'zero_point': zero_point,
-            'quantization_error': quantization_error,
-            'compression_ratio': compression_ratio,
-            'original_shape': weights.shape
-        }
-        ### END SOLUTION
-
-# %% [markdown]
-"""
-### Test INT8 Quantizer Implementation
-
-Let's test our quantizer to verify it works correctly:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-quantizer", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false}
-def test_int8_quantizer():
-    """Test INT8 quantizer implementation."""
-    print("MAGNIFY Testing INT8 Quantizer...")
-    print("=" * 60)
-    
-    quantizer = INT8Quantizer()
-    
-    # Test quantization parameters
-    test_tensor = np.random.randn(100, 100) * 2.0  # Range roughly [-6, 6]
-    scale, zero_point = quantizer.compute_quantization_params(test_tensor)
-    
-    print(f"Test tensor range: [{np.min(test_tensor):.3f}, {np.max(test_tensor):.3f}]")
-    print(f"Quantization params: scale={scale:.6f}, zero_point={zero_point}")
-    
-    # Test quantization/dequantization
-    quantized = quantizer.quantize_tensor(test_tensor, scale, zero_point)
-    dequantized = quantizer.dequantize_tensor(quantized, scale, zero_point)
-    
-    # Verify quantized tensor is INT8
-    assert quantized.dtype == np.int8, f"Expected int8, got {quantized.dtype}"
-    assert np.all(quantized >= -128) and np.all(quantized <= 127), "Quantized values outside INT8 range"
-    print("PASS Quantization produces valid INT8 values")
-    
-    # Verify round-trip error is reasonable
-    quantization_error = np.mean(np.abs(test_tensor - dequantized))
-    max_error = np.max(np.abs(test_tensor - dequantized))
-    
-    assert quantization_error < 0.1, f"Quantization error too high: {quantization_error}"
-    print(f"PASS Round-trip error acceptable: {quantization_error:.6f} (max: {max_error:.6f})")
-    
-    # Test weight quantization
-    weight_tensor = np.random.randn(64, 32, 3, 3) * 0.1  # Typical conv weight range
-    weight_result = quantizer.quantize_weights(weight_tensor)
-    
-    # Verify weight quantization results
-    assert 'quantized_weights' in weight_result, "Should return quantized weights"
-    assert 'scale' in weight_result, "Should return scale parameter"
-    assert 'quantization_error' in weight_result, "Should return error metrics"
-    assert weight_result['compression_ratio'] > 3.5, "Should achieve good compression"
-    
-    print(f"PASS Weight quantization: {weight_result['compression_ratio']:.1f}* compression")
-    print(f"PASS Weight quantization error: {weight_result['quantization_error']:.6f}")
-    
-    print("PASS INT8 quantizer tests passed!")
-    print("TIP Ready to build quantized CNN...")
-
-# Test function defined (called in main block)
-
-# PASS IMPLEMENTATION CHECKPOINT: Ensure quantized CNN is fully built before running
-
-# THINK PREDICTION: How much memory will quantization save for convolutional layers?
-# Write your guess here: _______* reduction
-
-# MAGNIFY SYSTEMS INSIGHT #1: Quantization Memory Analysis
-def analyze_quantization_memory():
-    """Analyze memory savings from quantization."""
-    try:
-        # Create models for comparison
-        baseline = BaselineCNN(3, 10)
-        quantized = QuantizedCNN(3, 10)
-        
-        # Quantize the model
-        calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(5)]
-        quantized.calibrate_and_quantize(calibration_data)
-        
-        # Calculate memory usage
-        baseline_conv_memory = (
-            baseline.conv1_weight.nbytes + 
-            baseline.conv2_weight.nbytes
-        )
-        
-        quantized_conv_memory = (
-            quantized.conv1.weight_quantized.nbytes + 
-            quantized.conv2.weight_quantized.nbytes
-        )
-        
-        compression_ratio = baseline_conv_memory / quantized_conv_memory
-        
-        print(f"📊 Quantization Memory Analysis:")
-        print(f"   Baseline conv weights: {baseline_conv_memory/1024:.1f}KB")
-        print(f"   Quantized conv weights: {quantized_conv_memory/1024:.1f}KB")
-        print(f"   Compression ratio: {compression_ratio:.1f}*")
-        print(f"   Memory saved: {(baseline_conv_memory - quantized_conv_memory)/1024:.1f}KB")
-        
-        # Explain the scaling
-        print(f"\nTIP WHY THIS MATTERS:")
-        print(f"   • FP32 uses 4 bytes per parameter")
-        print(f"   • INT8 uses 1 byte per parameter")
-        print(f"   • Theoretical maximum: 4* compression")
-        print(f"   • Actual compression: {compression_ratio:.1f}* (close to theoretical!)")
-        print(f"   • For large models: This enables mobile deployment")
-        
-        # Scale to production size
-        print(f"\n🏭 Production Scale Example:")
-        mobile_net_params = 4_200_000  # Typical mobile CNN
-        fp32_size_mb = mobile_net_params * 4 / 1024 / 1024
-        int8_size_mb = mobile_net_params * 1 / 1024 / 1024
-        print(f"   MobileNet-sized model (~4.2M params):")
-        print(f"   FP32 size: {fp32_size_mb:.1f}MB")
-        print(f"   INT8 size: {int8_size_mb:.1f}MB")
-        print(f"   Mobile app size reduction: {fp32_size_mb - int8_size_mb:.1f}MB")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in memory analysis: {e}")
-        print("Make sure quantized CNN is implemented correctly")
-
-# Analyze quantization memory impact
-analyze_quantization_memory()
-
-# %% [markdown]
-"""
-## Part 3: Quantized CNN Implementation
-
-Now let's create a quantized version of our CNN that uses INT8 weights while maintaining accuracy. We'll implement quantized convolution that's much faster than FP32.
-
-### Quantized Operations Strategy
-
-For maximum performance, we need to:
-1. **Store weights in INT8** format (4* memory savings)
-2. **Compute convolutions with INT8** arithmetic (faster)
-3. **Dequantize only when necessary** for activation functions
-4. **Calibrate quantization** using representative data
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "quantized-conv2d", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class QuantizedConv2d:
-    """
-    Quantized 2D convolution layer using INT8 weights.
-    
-    This layer stores weights in INT8 format and performs
-    optimized integer arithmetic for fast inference.
-    """
-    
-    def __init__(self, in_channels: int, out_channels: int, kernel_size: int):
-        """
-        Initialize quantized convolution layer.
-        
-        Args:
-            in_channels: Number of input channels
-            out_channels: Number of output channels  
-            kernel_size: Size of convolution kernel
-        """
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.kernel_size = kernel_size
-        
-        # Initialize FP32 weights (will be quantized during calibration)
-        weight_shape = (out_channels, in_channels, kernel_size, kernel_size)
-        self.weight_fp32 = np.random.randn(*weight_shape) * 0.02
-        self.bias = np.zeros(out_channels)
-        
-        # Quantization parameters (set during quantization)
-        self.weight_quantized = None
-        self.weight_scale = None
-        self.weight_zero_point = None
-        self.is_quantized = False
-    
-    def quantize_weights(self, quantizer: INT8Quantizer):
-        """
-        Quantize the layer weights using the provided quantizer.
-        
-        TODO: Implement weight quantization for the layer.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Use quantizer to quantize the FP32 weights
-        2. Store quantized weights and quantization parameters
-        3. Mark layer as quantized
-        4. Print quantization statistics
-        
-        Args:
-            quantizer: INT8Quantizer instance
-        """
-        ### BEGIN SOLUTION
-        print(f"Quantizing Conv2d({self.in_channels}, {self.out_channels}, {self.kernel_size})")
-        
-        # Quantize weights
-        result = quantizer.quantize_weights(self.weight_fp32)
-        
-        # Store quantized parameters
-        self.weight_quantized = result['quantized_weights']
-        self.weight_scale = result['scale']
-        self.weight_zero_point = result['zero_point']
-        self.is_quantized = True
-        
-        print(f"   Quantized: {result['compression_ratio']:.1f}* compression, "
-              f"{result['quantization_error']:.6f} error")
-        ### END SOLUTION
-    
-    def forward(self, x: np.ndarray) -> np.ndarray:
-        """
-        Forward pass with quantized weights.
-        
-        TODO: Implement quantized convolution forward pass.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if weights are quantized, use appropriate version
-        2. For quantized: dequantize weights just before computation
-        3. Perform convolution (same algorithm as baseline)
-        4. Return result
-        
-        OPTIMIZATION NOTE: In production, this would use optimized INT8 kernels
-        
-        Args:
-            x: Input tensor with shape (batch, channels, height, width)
-            
-        Returns:
-            Output tensor
-        """
-        ### BEGIN SOLUTION
-        # Choose weights to use
-        if self.is_quantized:
-            # Dequantize weights for computation
-            weights = self.weight_scale * (self.weight_quantized.astype(np.float32) - self.weight_zero_point)
-        else:
-            weights = self.weight_fp32
-        
-        # Perform convolution (optimized for speed)
-        batch, in_ch, in_h, in_w = x.shape
-        out_ch, in_ch_w, kh, kw = weights.shape
-        
-        out_h = in_h - kh + 1
-        out_w = in_w - kw + 1
-        
-        output = np.zeros((batch, out_ch, out_h, out_w))
-        
-        # Optimized convolution using vectorized operations
-        for b in range(batch):
-            for oh in range(out_h):
-                for ow in range(out_w):
-                    # Extract input patch
-                    patch = x[b, :, oh:oh+kh, ow:ow+kw]  # (in_ch, kh, kw)
-                    # Compute convolution for all output channels at once
-                    for oc in range(out_ch):
-                        output[b, oc, oh, ow] = np.sum(patch * weights[oc]) + self.bias[oc]
-        return output
-        ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "quantized-cnn", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class QuantizedCNN:
-    """
-    CNN with INT8 quantized weights for fast inference.
-    
-    This model demonstrates how quantization can achieve 4* speedup
-    with minimal accuracy loss through precision optimization.
-    """
-    
-    def __init__(self, input_channels: int = 3, num_classes: int = 10):
-        """
-        Initialize quantized CNN.
-        
-        TODO: Implement quantized CNN initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Create quantized convolutional layers
-        2. Create fully connected layer (can be quantized later)
-        3. Initialize quantizer for the model
-        4. Set up pooling layers (unchanged)
-        
-        Args:
-            input_channels: Number of input channels
-            num_classes: Number of output classes
-        """
-        ### BEGIN SOLUTION
-        self.input_channels = input_channels
-        self.num_classes = num_classes
-        
-        # Quantized convolutional layers
-        self.conv1 = QuantizedConv2d(input_channels, 32, kernel_size=3)
-        self.conv2 = QuantizedConv2d(32, 64, kernel_size=3)
-        
-        # Pooling (unchanged) - we'll implement our own pooling
-        self.pool_size = 2
-        
-        # Fully connected (kept as FP32 for simplicity)
-        self.fc_input_size = 64 * 6 * 6
-        self.fc = np.random.randn(self.fc_input_size, num_classes) * 0.02
-        
-        # Quantizer
-        self.quantizer = INT8Quantizer()
-        self.is_quantized = False
-        
-        print(f"PASS QuantizedCNN initialized: {self._count_parameters()} parameters")
-        ### END SOLUTION
-    
-    def _count_parameters(self) -> int:
-        """Count total parameters in the model."""
-        conv1_params = 32 * self.input_channels * 3 * 3 + 32
-        conv2_params = 64 * 32 * 3 * 3 + 64  
-        fc_params = self.fc_input_size * self.num_classes
-        return conv1_params + conv2_params + fc_params
-    
-    def calibrate_and_quantize(self, calibration_data: List[np.ndarray]):
-        """
-        Calibrate quantization parameters using representative data.
-        
-        TODO: Implement model quantization with calibration.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Process calibration data through model to collect statistics
-        2. Quantize each layer using the calibration statistics
-        3. Mark model as quantized
-        4. Report quantization results
-        
-        Args:
-            calibration_data: List of representative input samples
-        """
-        ### BEGIN SOLUTION
-        print("🔧 Calibrating and quantizing model...")
-        print("=" * 50)
-        
-        # Quantize convolutional layers
-        self.conv1.quantize_weights(self.quantizer)
-        self.conv2.quantize_weights(self.quantizer)
-        
-        # Mark as quantized
-        self.is_quantized = True
-        
-        # Compute memory savings
-        original_conv_memory = (
-            self.conv1.weight_fp32.nbytes + 
-            self.conv2.weight_fp32.nbytes
-        )
-        quantized_conv_memory = (
-            self.conv1.weight_quantized.nbytes + 
-            self.conv2.weight_quantized.nbytes
-        )
-        
-        compression_ratio = original_conv_memory / quantized_conv_memory
-        
-        print(f"PASS Quantization complete:")
-        print(f"   Conv layers: {original_conv_memory//1024}KB -> {quantized_conv_memory//1024}KB")
-        print(f"   Compression: {compression_ratio:.1f}* memory savings")
-        print(f"   Model ready for fast inference!")
-        ### END SOLUTION
-    
-    def forward(self, x: np.ndarray) -> np.ndarray:
-        """
-        Forward pass through quantized CNN.
-        
-        This function is PROVIDED - uses quantized layers.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:  
-            Output logits
-        """
-        batch_size = x.shape[0]
-        
-        # Conv1 + ReLU + Pool (quantized)
-        conv1_out = self.conv1.forward(x)
-        conv1_relu = np.maximum(0, conv1_out)
-        pool1_out = self._maxpool2d_forward(conv1_relu, self.pool_size)
-        
-        # Conv2 + ReLU + Pool (quantized)
-        conv2_out = self.conv2.forward(pool1_out)
-        conv2_relu = np.maximum(0, conv2_out)
-        pool2_out = self._maxpool2d_forward(conv2_relu, self.pool_size)
-        
-        # Flatten and FC
-        flattened = pool2_out.reshape(batch_size, -1)
-        logits = flattened @ self.fc
-        
-        return logits
-    
-    def _maxpool2d_forward(self, x: np.ndarray, pool_size: int) -> np.ndarray:
-        """Simple max pooling implementation."""
-        batch, ch, in_h, in_w = x.shape
-        out_h = in_h // pool_size
-        out_w = in_w // pool_size
-        
-        output = np.zeros((batch, ch, out_h, out_w))
-        
-        for b in range(batch):
-            for c in range(ch):
-                for oh in range(out_h):
-                    for ow in range(out_w):
-                        h_start = oh * pool_size
-                        w_start = ow * pool_size
-                        pool_region = x[b, c, h_start:h_start+pool_size, w_start:w_start+pool_size]
-                        output[b, c, oh, ow] = np.max(pool_region)
-        
-        return output
-    
-    def predict(self, x: np.ndarray) -> np.ndarray:
-        """Make predictions with the quantized model."""
-        logits = self.forward(x)
-        return np.argmax(logits, axis=1)
-
-# %% [markdown]
-"""
-### Test Quantized CNN Implementation
-
-Let's test our quantized CNN and verify it maintains accuracy:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-quantized-cnn", "locked": false, "points": 4, "schema_version": 3, "solution": false, "task": false}
-def test_quantized_cnn():
-    """Test quantized CNN implementation."""
-    print("MAGNIFY Testing Quantized CNN...")
-    print("=" * 60)
-    
-    # Create quantized model
-    model = QuantizedCNN(input_channels=3, num_classes=10)
-    
-    # Generate calibration data
-    calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(10)]
-    
-    # Test before quantization
-    test_input = np.random.randn(2, 3, 32, 32)
-    logits_before = model.forward(test_input)
-    print(f"PASS Forward pass before quantization: {logits_before.shape}")
-    
-    # Calibrate and quantize
-    model.calibrate_and_quantize(calibration_data)
-    assert model.is_quantized, "Model should be marked as quantized"
-    assert model.conv1.is_quantized, "Conv1 should be quantized"
-    assert model.conv2.is_quantized, "Conv2 should be quantized"
-    print("PASS Model quantization successful")
-    
-    # Test after quantization
-    logits_after = model.forward(test_input)
-    assert logits_after.shape == logits_before.shape, "Output shape should be unchanged"
-    print(f"PASS Forward pass after quantization: {logits_after.shape}")
-    
-    # Check predictions still work
-    predictions = model.predict(test_input)
-    assert predictions.shape == (2,), f"Expected (2,), got {predictions.shape}"
-    assert all(0 <= p < 10 for p in predictions), "All predictions should be valid"
-    print(f"PASS Predictions work: {predictions}")
-    
-    # Verify quantization maintains reasonable accuracy
-    output_diff = np.mean(np.abs(logits_before - logits_after))
-    max_diff = np.max(np.abs(logits_before - logits_after))
-    print(f"PASS Quantization impact: {output_diff:.4f} mean diff, {max_diff:.4f} max diff")
-    
-    # Should have reasonable impact but not destroy the model
-    assert output_diff < 2.0, f"Quantization impact too large: {output_diff:.4f}"
-    
-    print("PASS Quantized CNN tests passed!")
-    print("TIP Ready for performance comparison...")
-
-# Test function defined (called in main block)
-
-# PASS IMPLEMENTATION CHECKPOINT: Quantized CNN complete
-
-# THINK PREDICTION: What will be the biggest source of speedup from quantization?
-# Your answer: Memory bandwidth / Computation / Cache efficiency / _______
-
-# MAGNIFY SYSTEMS INSIGHT #2: Quantization Speed Analysis
-def analyze_quantization_speed():
-    """Analyze speed improvements from quantization."""
-    try:
-        import time
-        
-        # Create models
-        baseline = BaselineCNN(3, 10)
-        quantized = QuantizedCNN(3, 10)
-        
-        # Quantize and prepare test data
-        calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(3)]
-        quantized.calibrate_and_quantize(calibration_data)
-        test_input = np.random.randn(8, 3, 32, 32)  # Larger batch for timing
-        
-        # Benchmark baseline model
-        baseline_times = []
-        for _ in range(5):
-            start = time.perf_counter()
-            _ = baseline.forward(test_input)
-            baseline_times.append(time.perf_counter() - start)
-        
-        baseline_avg = np.mean(baseline_times) * 1000  # Convert to ms
-        
-        # Benchmark quantized model  
-        quantized_times = []
-        for _ in range(5):
-            start = time.perf_counter()
-            _ = quantized.forward(test_input)
-            quantized_times.append(time.perf_counter() - start)
-        
-        quantized_avg = np.mean(quantized_times) * 1000  # Convert to ms
-        
-        speedup = baseline_avg / quantized_avg if quantized_avg > 0 else 1.0
-        
-        print(f"SPEED Quantization Speed Analysis:")
-        print(f"   Baseline FP32: {baseline_avg:.2f}ms")
-        print(f"   Quantized INT8: {quantized_avg:.2f}ms")
-        print(f"   Speedup: {speedup:.1f}*")
-        
-        # Analyze speedup sources
-        print(f"\nMAGNIFY Speedup Sources:")
-        print(f"   1. Memory bandwidth: 4* less data to load (32->8 bits)")
-        print(f"   2. Cache efficiency: More weights fit in CPU cache")
-        print(f"   3. SIMD operations: More INT8 ops per instruction")
-        print(f"   4. Hardware acceleration: Dedicated INT8 units")
-        
-        # Note about production vs educational implementation
-        print(f"\n📚 Educational vs Production:")
-        print(f"   • This implementation: {speedup:.1f}* (educational focus)")
-        print(f"   • Production systems: 3-5* typical speedup")
-        print(f"   • Hardware optimized: Up to 10* on specialized chips")
-        print(f"   • Why difference: We dequantize for computation (educational clarity)")
-        print(f"   • Production: Native INT8 kernels throughout pipeline")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in speed analysis: {e}")
-
-# Analyze quantization speed benefits
-analyze_quantization_speed()
-
-# %% [markdown]
-"""
-## Part 4: Performance Analysis - 4* Speedup Demonstration
-
-Now let's demonstrate the dramatic performance improvement achieved by INT8 quantization. We'll compare FP32 vs INT8 inference speed and memory usage.
-
-### Expected Results
-- **Memory usage**: 4* reduction for quantized weights  
-- **Inference speed**: 4* improvement through INT8 arithmetic
-- **Accuracy**: <1% degradation (98% -> 97.5% typical)
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "performance-analyzer", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class QuantizationPerformanceAnalyzer:
-    """
-    Analyze the performance benefits of INT8 quantization.
-    
-    This analyzer measures memory usage, inference speed,
-    and accuracy to demonstrate the quantization trade-offs.
-    """
-    
-    def __init__(self):
-        """Initialize the performance analyzer."""
-        self.results = {}
-    
-    def benchmark_models(self, baseline_model: BaselineCNN, quantized_model: QuantizedCNN,
-                        test_data: np.ndarray, num_runs: int = 10) -> Dict[str, Any]:
-        """
-        Comprehensive benchmark of baseline vs quantized models.
-        
-        TODO: Implement comprehensive model benchmarking.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Measure memory usage for both models
-        2. Benchmark inference speed over multiple runs
-        3. Compare model outputs for accuracy analysis
-        4. Compute performance improvement metrics
-        5. Return comprehensive results
-        
-        Args:
-            baseline_model: FP32 baseline CNN
-            quantized_model: INT8 quantized CNN
-            test_data: Test input data
-            num_runs: Number of benchmark runs
-            
-        Returns:
-            Dictionary containing benchmark results
-        """
-        ### BEGIN SOLUTION
-        print(f"🔬 Benchmarking Models ({num_runs} runs)...")
-        print("=" * 50)
-        
-        batch_size = test_data.shape[0]
-        
-        # Memory Analysis
-        baseline_memory = self._calculate_memory_usage(baseline_model)
-        quantized_memory = self._calculate_memory_usage(quantized_model)
-        memory_reduction = baseline_memory / quantized_memory
-        
-        print(f"📊 Memory Analysis:")
-        print(f"   Baseline: {baseline_memory:.1f}KB")  
-        print(f"   Quantized: {quantized_memory:.1f}KB")
-        print(f"   Reduction: {memory_reduction:.1f}*")
-        
-        # Inference Speed Benchmark
-        print(f"\n⏱️ Speed Benchmark ({num_runs} runs):")
-        
-        # Baseline timing
-        baseline_times = []
-        for run in range(num_runs):
-            start_time = time.time()
-            baseline_output = baseline_model.forward(test_data)
-            run_time = time.time() - start_time
-            baseline_times.append(run_time)
-        
-        baseline_avg_time = np.mean(baseline_times)
-        baseline_std_time = np.std(baseline_times)
-        
-        # Quantized timing  
-        quantized_times = []
-        for run in range(num_runs):
-            start_time = time.time()
-            quantized_output = quantized_model.forward(test_data)
-            run_time = time.time() - start_time
-            quantized_times.append(run_time)
-            
-        quantized_avg_time = np.mean(quantized_times)
-        quantized_std_time = np.std(quantized_times)
-        
-        # Calculate speedup
-        speedup = baseline_avg_time / quantized_avg_time
-        
-        print(f"   Baseline: {baseline_avg_time*1000:.2f}ms ± {baseline_std_time*1000:.2f}ms")
-        print(f"   Quantized: {quantized_avg_time*1000:.2f}ms ± {quantized_std_time*1000:.2f}ms")
-        print(f"   Speedup: {speedup:.1f}*")
-        
-        # Accuracy Analysis
-        output_diff = np.mean(np.abs(baseline_output - quantized_output))
-        max_diff = np.max(np.abs(baseline_output - quantized_output))
-        
-        # Prediction agreement
-        baseline_preds = np.argmax(baseline_output, axis=1)
-        quantized_preds = np.argmax(quantized_output, axis=1)
-        agreement = np.mean(baseline_preds == quantized_preds)
-        
-        print(f"\nTARGET Accuracy Analysis:")
-        print(f"   Output difference: {output_diff:.4f} (max: {max_diff:.4f})")
-        print(f"   Prediction agreement: {agreement:.1%}")
-        
-        # Store results
-        results = {
-            'memory_baseline_kb': baseline_memory,
-            'memory_quantized_kb': quantized_memory,
-            'memory_reduction': memory_reduction,
-            'speed_baseline_ms': baseline_avg_time * 1000,
-            'speed_quantized_ms': quantized_avg_time * 1000,
-            'speedup': speedup,
-            'output_difference': output_diff,
-            'prediction_agreement': agreement,
-            'batch_size': batch_size
-        }
-        
-        self.results = results
-        return results
-        ### END SOLUTION
-    
-    def _calculate_memory_usage(self, model) -> float:
-        """
-        Calculate model memory usage in KB.
-        
-        This function is PROVIDED to estimate memory usage.
-        """
-        total_memory = 0
-        
-        # Handle BaselineCNN
-        if hasattr(model, 'conv1_weight'):
-            total_memory += model.conv1_weight.nbytes + model.conv1_bias.nbytes
-            total_memory += model.conv2_weight.nbytes + model.conv2_bias.nbytes
-            total_memory += model.fc.nbytes
-        # Handle QuantizedCNN
-        elif hasattr(model, 'conv1'):
-            # Conv1 memory
-            if hasattr(model.conv1, 'weight_quantized') and model.conv1.is_quantized:
-                total_memory += model.conv1.weight_quantized.nbytes
-            else:
-                total_memory += model.conv1.weight_fp32.nbytes
-            
-            # Conv2 memory
-            if hasattr(model.conv2, 'weight_quantized') and model.conv2.is_quantized:
-                total_memory += model.conv2.weight_quantized.nbytes
-            else:
-                total_memory += model.conv2.weight_fp32.nbytes
-            
-            # FC layer (kept as FP32)
-            if hasattr(model, 'fc'):
-                total_memory += model.fc.nbytes
-        
-        return total_memory / 1024  # Convert to KB
-    
-    def print_performance_summary(self, results: Dict[str, Any]):
-        """
-        Print a comprehensive performance summary.
-        
-        This function is PROVIDED to display results clearly.
-        """
-        print("\nROCKET QUANTIZATION PERFORMANCE SUMMARY")
-        print("=" * 60)
-        print(f"📊 Memory Optimization:")
-        print(f"   • FP32 Model: {results['memory_baseline_kb']:.1f}KB")
-        print(f"   • INT8 Model: {results['memory_quantized_kb']:.1f}KB") 
-        print(f"   • Memory savings: {results['memory_reduction']:.1f}* reduction")
-        print(f"   • Storage efficiency: {(1 - 1/results['memory_reduction'])*100:.1f}% less memory")
-        
-        print(f"\nSPEED Speed Optimization:")
-        print(f"   • FP32 Inference: {results['speed_baseline_ms']:.1f}ms")
-        print(f"   • INT8 Inference: {results['speed_quantized_ms']:.1f}ms")
-        print(f"   • Speed improvement: {results['speedup']:.1f}* faster")
-        print(f"   • Latency reduction: {(1 - 1/results['speedup'])*100:.1f}% faster")
-        
-        print(f"\nTARGET Accuracy Trade-off:")
-        print(f"   • Output preservation: {(1-results['output_difference'])*100:.1f}% similarity")  
-        print(f"   • Prediction agreement: {results['prediction_agreement']:.1%}")
-        print(f"   • Quality maintained with {results['speedup']:.1f}* speedup!")
-        
-        # Overall assessment
-        efficiency_score = results['speedup'] * results['memory_reduction']
-        print(f"\n🏆 Overall Efficiency:")
-        print(f"   • Combined benefit: {efficiency_score:.1f}* (speed * memory)")
-        print(f"   • Trade-off assessment: {'🟢 Excellent' if results['prediction_agreement'] > 0.95 else '🟡 Good'}")
-
-# %% [markdown]
-"""
-### Test Performance Analysis  
-
-Let's run comprehensive benchmarks to see the quantization benefits:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-performance-analysis", "locked": false, "points": 4, "schema_version": 3, "solution": false, "task": false}
-def test_performance_analysis():
-    """Test performance analysis of quantization benefits."""
-    print("MAGNIFY Testing Performance Analysis...")
-    print("=" * 60)
-    
-    # Create models
-    baseline_model = BaselineCNN(input_channels=3, num_classes=10)
-    quantized_model = QuantizedCNN(input_channels=3, num_classes=10)
-    
-    # Calibrate quantized model
-    calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(5)]
-    quantized_model.calibrate_and_quantize(calibration_data)
-    
-    # Create test data
-    test_data = np.random.randn(4, 3, 32, 32)
-    
-    # Run performance analysis
-    analyzer = QuantizationPerformanceAnalyzer()
-    results = analyzer.benchmark_models(baseline_model, quantized_model, test_data, num_runs=3)
-    
-    # Verify results structure
-    assert 'memory_reduction' in results, "Should report memory reduction"
-    assert 'speedup' in results, "Should report speed improvement"
-    assert 'prediction_agreement' in results, "Should report accuracy preservation"
-    
-    # Verify quantization benefits (realistic expectation: conv layers quantized, FC kept FP32)
-    assert results['memory_reduction'] > 1.2, f"Should show memory reduction, got {results['memory_reduction']:.1f}*"
-    assert results['speedup'] > 0.5, f"Educational implementation without actual INT8 kernels, got {results['speedup']:.1f}*"  
-    assert results['prediction_agreement'] >= 0.0, f"Prediction agreement measurement, got {results['prediction_agreement']:.1%}"
-    
-    print(f"PASS Memory reduction: {results['memory_reduction']:.1f}*")
-    print(f"PASS Speed improvement: {results['speedup']:.1f}*")
-    print(f"PASS Prediction agreement: {results['prediction_agreement']:.1%}")
-    
-    # Print comprehensive summary
-    analyzer.print_performance_summary(results)
-    
-    print("PASS Performance analysis tests passed!")
-    print("CELEBRATE Quantization delivers significant benefits!")
-
-# Test function defined (called in main block)
-
-# PASS IMPLEMENTATION CHECKPOINT: Performance analysis complete
-
-# THINK PREDICTION: Which quantization bit-width provides the best trade-off?
-# Your answer: 4-bit / 8-bit / 16-bit / 32-bit
-
-# MAGNIFY SYSTEMS INSIGHT #3: Quantization Bit-Width Analysis
-def analyze_quantization_bitwidths():
-    """Compare different quantization bit-widths."""
-    try:
-        print(f"🔬 Quantization Bit-Width Trade-off Analysis:")
-        
-        bit_widths = [32, 16, 8, 4, 2]
-        
-        print(f"{'Bits':<6} {'Memory':<8} {'Speed':<8} {'Accuracy':<10} {'Hardware':<15} {'Use Case':<20}")
-        print("-" * 75)
-        
-        for bits in bit_widths:
-            # Memory calculation (bytes per parameter)
-            memory = bits / 8
-            
-            # Speed improvement (relative to FP32)
-            if bits == 32:
-                speed = 1.0
-                accuracy = 100.0
-                hardware = "Universal"
-                use_case = "Training, Research"
-            elif bits == 16:
-                speed = 1.8
-                accuracy = 99.9
-                hardware = "Modern GPUs"
-                use_case = "Large Models"
-            elif bits == 8:
-                speed = 4.0
-                accuracy = 99.5
-                hardware = "CPUs, Mobile"
-                use_case = "Production"
-            elif bits == 4:
-                speed = 8.0
-                accuracy = 97.0
-                hardware = "Specialized"
-                use_case = "Extreme Mobile"
-            else:  # 2-bit
-                speed = 16.0
-                accuracy = 90.0
-                hardware = "Research"
-                use_case = "Experimental"
-            
-            print(f"{bits:<6} {memory:<8.1f} {speed:<8.1f}* {accuracy:<10.1f}% {hardware:<15} {use_case:<20}")
-        
-        print(f"\nTARGET Key Insights:")
-        print(f"   • INT8 Sweet Spot: Best balance of speed, accuracy, and hardware support")
-        print(f"   • Memory scales linearly: Each bit halving saves 2* memory")
-        print(f"   • Speed scaling non-linear: Hardware specialization matters")
-        print(f"   • Accuracy degrades exponentially: Below 8-bit becomes problematic")
-        
-        print(f"\n🏭 Production Reality:")
-        print(f"   • TensorFlow Lite: Standardized on INT8")
-        print(f"   • PyTorch Mobile: INT8 with FP16 fallback")
-        print(f"   • Apple Neural Engine: Optimized for INT8")
-        print(f"   • Google TPU: INT8 operations 10* faster than FP32")
-        
-        # Calculate efficiency score (speed / accuracy_loss)
-        print(f"\n📊 Efficiency Score (Speed / Accuracy Loss):")
-        for bits in [32, 16, 8, 4]:
-            if bits == 32:
-                score = 1.0 / 0.1  # Baseline
-                speed, acc_loss = 1.0, 0.0
-            elif bits == 16:
-                speed, acc_loss = 1.8, 0.1
-                score = speed / max(acc_loss, 0.1)
-            elif bits == 8:
-                speed, acc_loss = 4.0, 0.5
-                score = speed / acc_loss
-            else:  # 4-bit
-                speed, acc_loss = 8.0, 3.0
-                score = speed / acc_loss
-            
-            print(f"   {bits}-bit: {score:.1f} (higher is better)")
-        
-        print(f"\nTIP WHY INT8 WINS: Highest efficiency score + universal hardware support!")
-        
-    except Exception as e:
-        print(f"WARNING️ Error in bit-width analysis: {e}")
-
-# Analyze different quantization bit-widths
-analyze_quantization_bitwidths()
-
-# %% [markdown]
-"""
-## Part 5: Production Context - How Real Systems Use Quantization
-
-Understanding how production ML systems implement quantization provides valuable context for mobile deployment and edge computing.
-
-### Production Quantization Patterns
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "production-context", "locked": false, "schema_version": 3, "solution": false, "task": false}
-class ProductionQuantizationInsights:
-    """
-    Insights into how production ML systems use quantization.
-    
-    This class is PROVIDED to show real-world applications of the
-    quantization techniques you've implemented.
-    """
-    
-    @staticmethod
-    def explain_production_patterns():
-        """Explain how production systems use quantization."""
-        print("🏭 PRODUCTION QUANTIZATION PATTERNS")
-        print("=" * 50)
-        print()
-        
-        patterns = [
-            {
-                'system': 'TensorFlow Lite (Google)',
-                'technique': 'Post-training INT8 quantization with calibration',
-                'benefit': 'Enables ML on mobile devices and edge hardware',
-                'challenge': 'Maintaining accuracy across diverse model architectures'
-            },
-            {
-                'system': 'PyTorch Mobile (Meta)', 
-                'technique': 'Dynamic quantization with runtime calibration',
-                'benefit': 'Reduces model size by 4* for mobile deployment',
-                'challenge': 'Balancing quantization overhead vs inference speedup'
-            },
-            {
-                'system': 'ONNX Runtime (Microsoft)',
-                'technique': 'Mixed precision with selective layer quantization',
-                'benefit': 'Optimizes critical layers while preserving accuracy',
-                'challenge': 'Automated selection of quantization strategies'
-            },
-            {
-                'system': 'Apple Core ML',
-                'technique': 'INT8 quantization with hardware acceleration',
-                'benefit': 'Leverages Neural Engine for ultra-fast inference',
-                'challenge': 'Platform-specific optimization for different iOS devices'
-            }
-        ]
-        
-        for pattern in patterns:
-            print(f"🔧 {pattern['system']}:")
-            print(f"   Technique: {pattern['technique']}")
-            print(f"   Benefit: {pattern['benefit']}")
-            print(f"   Challenge: {pattern['challenge']}")
-            print()
-    
-    @staticmethod  
-    def explain_advanced_techniques():
-        """Explain advanced quantization techniques."""
-        print("SPEED ADVANCED QUANTIZATION TECHNIQUES")
-        print("=" * 45)
-        print()
-        
-        techniques = [
-            "🧠 **Mixed Precision**: Quantize some layers to INT8, keep critical layers in FP32",
-            "🔄 **Dynamic Quantization**: Quantize weights statically, activations dynamically",
-            "PACKAGE **Block-wise Quantization**: Different quantization parameters for weight blocks",
-            "⏰ **Quantization-Aware Training**: Train model to be robust to quantization",
-            "TARGET **Channel-wise Quantization**: Separate scales for each output channel",
-            "🔀 **Adaptive Quantization**: Adjust precision based on layer importance",
-            "⚖️ **Hardware-Aware Quantization**: Optimize for specific hardware capabilities",
-            "🛡️ **Calibration-Free Quantization**: Use statistical methods without data"
-        ]
-        
-        for technique in techniques:
-            print(f"   {technique}")
-        
-        print()
-        print("TIP **Your Implementation Foundation**: The INT8 quantization you built")
-        print("   demonstrates the core principles behind all these optimizations!")
-    
-    @staticmethod
-    def show_performance_numbers():
-        """Show real performance numbers from production systems."""
-        print("📊 PRODUCTION QUANTIZATION NUMBERS")  
-        print("=" * 40)
-        print()
-        
-        print("ROCKET **Speed Improvements**:")
-        print("   • Mobile CNNs: 2-4* faster inference with INT8")  
-        print("   • BERT models: 3-5* speedup with mixed precision")
-        print("   • Edge deployment: 10* improvement with dedicated INT8 hardware")
-        print("   • Real-time vision: Enables 30fps on mobile devices")
-        print()
-        
-        print("💾 **Memory Reduction**:")
-        print("   • Model size: 4* smaller (critical for mobile apps)")
-        print("   • Runtime memory: 2-3* less activation memory")
-        print("   • Cache efficiency: Better fit in processor caches")
-        print()
-        
-        print("TARGET **Accuracy Preservation**:")
-        print("   • Computer vision: <1% accuracy loss typical")
-        print("   • Language models: 2-5% accuracy loss acceptable")
-        print("   • Recommendation systems: Minimal impact on ranking quality")
-        print("   • Speech recognition: <2% word error rate increase")
-
-# %% [markdown]
-"""
-## Part 6: Systems Analysis - Precision vs Performance Trade-offs
-
-Let's analyze the fundamental trade-offs in quantization systems engineering.
-
-### Quantization Trade-off Analysis
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "systems-analysis", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class QuantizationSystemsAnalyzer:
-    """
-    Analyze the systems engineering trade-offs in quantization.
-    
-    This analyzer helps understand the precision vs performance principles
-    behind the speedups achieved by INT8 quantization.
-    """
-    
-    def __init__(self):
-        """Initialize the systems analyzer."""
-        pass
-    
-    def analyze_precision_tradeoffs(self, bit_widths: List[int] = [32, 16, 8, 4]) -> Dict[str, Any]:
-        """
-        Analyze precision vs performance trade-offs across bit widths.
-        
-        TODO: Implement comprehensive precision trade-off analysis.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. For each bit width, calculate:
-           - Memory usage per parameter
-           - Computational complexity 
-           - Typical accuracy preservation
-           - Hardware support and efficiency
-        2. Show trade-off curves and sweet spots
-        3. Identify optimal configurations for different use cases
-        
-        This analysis reveals WHY INT8 is the sweet spot for most applications.
-        
-        Args:
-            bit_widths: List of bit widths to analyze
-            
-        Returns:
-            Dictionary containing trade-off analysis results
-        """
-        ### BEGIN SOLUTION  
-        print("🔬 Analyzing Precision vs Performance Trade-offs...")
-        print("=" * 55)
-        
-        results = {
-            'bit_widths': bit_widths,
-            'memory_per_param': [],
-            'compute_efficiency': [],
-            'typical_accuracy_loss': [],
-            'hardware_support': [],
-            'use_cases': []
-        }
-        
-        # Analyze each bit width
-        for bits in bit_widths:
-            print(f"\n📊 {bits}-bit Analysis:")
-            
-            # Memory usage (bytes per parameter)  
-            memory = bits / 8
-            results['memory_per_param'].append(memory)
-            print(f"   Memory: {memory} bytes/param")
-            
-            # Compute efficiency (relative to FP32)
-            if bits == 32:
-                efficiency = 1.0  # FP32 baseline
-            elif bits == 16:  
-                efficiency = 1.5  # FP16 is faster but not dramatically
-            elif bits == 8:
-                efficiency = 4.0  # INT8 has specialized hardware support
-            elif bits == 4:
-                efficiency = 8.0  # Very fast but limited hardware support
-            else:
-                efficiency = 32.0 / bits  # Rough approximation
-            
-            results['compute_efficiency'].append(efficiency)
-            print(f"   Compute efficiency: {efficiency:.1f}* faster than FP32")
-            
-            # Typical accuracy loss (percentage points)
-            if bits == 32:
-                acc_loss = 0.0    # No loss
-            elif bits == 16:
-                acc_loss = 0.1    # Minimal loss
-            elif bits == 8:
-                acc_loss = 0.5    # Small loss  
-            elif bits == 4:
-                acc_loss = 2.0    # Noticeable loss
-            else:
-                acc_loss = min(10.0, 32.0 / bits)  # Higher loss for lower precision
-            
-            results['typical_accuracy_loss'].append(acc_loss)
-            print(f"   Typical accuracy loss: {acc_loss:.1f}%")
-            
-            # Hardware support assessment
-            if bits == 32:
-                hw_support = "Universal"
-            elif bits == 16:
-                hw_support = "Modern GPUs, TPUs"
-            elif bits == 8:
-                hw_support = "CPUs, Mobile, Edge"
-            elif bits == 4:
-                hw_support = "Specialized chips"
-            else:
-                hw_support = "Research only"
-            
-            results['hardware_support'].append(hw_support)
-            print(f"   Hardware support: {hw_support}")
-            
-            # Optimal use cases
-            if bits == 32:
-                use_case = "Training, high-precision inference"
-            elif bits == 16:
-                use_case = "Large model inference, mixed precision training"
-            elif bits == 8:
-                use_case = "Mobile deployment, edge inference, production CNNs"
-            elif bits == 4:
-                use_case = "Extreme compression, research applications"
-            else:
-                use_case = "Experimental"
-            
-            results['use_cases'].append(use_case)
-            print(f"   Best for: {use_case}")
-        
-        return results
-        ### END SOLUTION
-    
-    def print_tradeoff_summary(self, analysis: Dict[str, Any]):
-        """
-        Print comprehensive trade-off summary.
-        
-        This function is PROVIDED to show the analysis clearly.
-        """
-        print("\nTARGET PRECISION VS PERFORMANCE TRADE-OFF SUMMARY") 
-        print("=" * 60)
-        print(f"{'Bits':<6} {'Memory':<8} {'Speed':<8} {'Acc Loss':<10} {'Hardware':<20}")
-        print("-" * 60)
-        
-        bit_widths = analysis['bit_widths']
-        memory = analysis['memory_per_param']
-        speed = analysis['compute_efficiency']
-        acc_loss = analysis['typical_accuracy_loss']
-        hardware = analysis['hardware_support']
-        
-        for i, bits in enumerate(bit_widths):
-            print(f"{bits:<6} {memory[i]:<8.1f} {speed[i]:<8.1f}* {acc_loss[i]:<10.1f}% {hardware[i]:<20}")
-        
-        print()
-        print("MAGNIFY **Key Insights**:")
-        
-        # Find sweet spot (best speed/accuracy trade-off)
-        efficiency_ratios = [s / (1 + a) for s, a in zip(speed, acc_loss)]
-        best_idx = np.argmax(efficiency_ratios)
-        best_bits = bit_widths[best_idx]
-        
-        print(f"   • Sweet spot: {best_bits}-bit provides best efficiency/accuracy trade-off")
-        print(f"   • Memory scaling: Linear with bit width (4* reduction FP32->INT8)")
-        print(f"   • Speed scaling: Non-linear due to hardware specialization")
-        print(f"   • Accuracy: Manageable loss up to 8-bit, significant below")
-        
-        print(f"\nTIP **Why INT8 Dominates Production**:")
-        print(f"   • Hardware support: Excellent across all platforms")
-        print(f"   • Speed improvement: {speed[bit_widths.index(8)]:.1f}* faster than FP32")
-        print(f"   • Memory reduction: {32/8:.1f}* smaller models")
-        print(f"   • Accuracy preservation: <{acc_loss[bit_widths.index(8)]:.1f}% typical loss")
-        print(f"   • Deployment friendly: Fits mobile and edge constraints")
-
-# %% [markdown]
-"""
-### Test Systems Analysis
-
-Let's analyze the fundamental precision vs performance trade-offs:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-systems-analysis", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false}
-def test_systems_analysis():
-    """Test systems analysis of precision vs performance trade-offs."""
-    print("MAGNIFY Testing Systems Analysis...")
-    print("=" * 60)
-    
-    analyzer = QuantizationSystemsAnalyzer()
-    
-    # Analyze precision trade-offs
-    analysis = analyzer.analyze_precision_tradeoffs([32, 16, 8, 4])
-    
-    # Verify analysis structure
-    assert 'compute_efficiency' in analysis, "Should contain compute efficiency analysis"
-    assert 'typical_accuracy_loss' in analysis, "Should contain accuracy loss analysis"
-    assert len(analysis['compute_efficiency']) == 4, "Should analyze all bit widths"
-    
-    # Verify scaling behavior
-    efficiency = analysis['compute_efficiency']
-    memory = analysis['memory_per_param']
-    
-    # INT8 should be much more efficient than FP32
-    int8_idx = analysis['bit_widths'].index(8)
-    fp32_idx = analysis['bit_widths'].index(32)
-    
-    assert efficiency[int8_idx] > efficiency[fp32_idx], "INT8 should be more efficient than FP32"
-    assert memory[int8_idx] < memory[fp32_idx], "INT8 should use less memory than FP32"
-    
-    print(f"PASS INT8 efficiency: {efficiency[int8_idx]:.1f}* vs FP32")
-    print(f"PASS INT8 memory: {memory[int8_idx]:.1f} vs {memory[fp32_idx]:.1f} bytes/param")
-    
-    # Show comprehensive analysis
-    analyzer.print_tradeoff_summary(analysis)
-    
-    # Verify INT8 is identified as optimal
-    efficiency_ratios = [s / (1 + a) for s, a in zip(analysis['compute_efficiency'], analysis['typical_accuracy_loss'])]
-    best_bits = analysis['bit_widths'][np.argmax(efficiency_ratios)]
-    
-    assert best_bits == 8, f"INT8 should be identified as optimal, got {best_bits}-bit"
-    print(f"PASS Systems analysis correctly identifies {best_bits}-bit as optimal")
-    
-    print("PASS Systems analysis tests passed!")
-    print("TIP INT8 quantization is the proven sweet spot for production!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Part 7: Comprehensive Testing and Validation
-
-Let's run comprehensive tests to validate our complete quantization implementation:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "comprehensive-tests", "locked": false, "points": 5, "schema_version": 3, "solution": false, "task": false}
-def run_comprehensive_tests():
-    """Run comprehensive tests of the entire quantization system."""
-    print("TEST COMPREHENSIVE QUANTIZATION SYSTEM TESTS")
-    print("=" * 60)
-    
-    # Test 1: Baseline CNN
-    print("1. Testing Baseline CNN...")
-    test_baseline_cnn()
-    print()
-    
-    # Test 2: INT8 Quantizer
-    print("2. Testing INT8 Quantizer...")
-    test_int8_quantizer()
-    print()
-    
-    # Test 3: Quantized CNN
-    print("3. Testing Quantized CNN...")
-    test_quantized_cnn()
-    print()
-    
-    # Test 4: Performance Analysis
-    print("4. Testing Performance Analysis...")
-    test_performance_analysis()
-    print()
-    
-    # Test 5: Systems Analysis
-    print("5. Testing Systems Analysis...")
-    test_systems_analysis()
-    print()
-    
-    # Test 6: End-to-end validation
-    print("6. End-to-end Validation...")
-    try:
-        # Create models
-        baseline = BaselineCNN()
-        quantized = QuantizedCNN()
-        
-        # Create test data
-        test_input = np.random.randn(2, 3, 32, 32)
-        calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(3)]
-        
-        # Test pipeline
-        baseline_pred = baseline.predict(test_input)
-        quantized.calibrate_and_quantize(calibration_data)
-        quantized_pred = quantized.predict(test_input)
-        
-        # Verify pipeline works
-        assert len(baseline_pred) == len(quantized_pred), "Predictions should have same length"
-        print(f"   PASS End-to-end pipeline works")
-        print(f"   PASS Baseline predictions: {baseline_pred}")
-        print(f"   PASS Quantized predictions: {quantized_pred}")
-        
-    except Exception as e:
-        print(f"   WARNING️ End-to-end test issue: {e}")
-    
-    print("CELEBRATE ALL COMPREHENSIVE TESTS PASSED!")
-    print("PASS Quantization system is working correctly!")
-    print("ROCKET Ready for production deployment with 4* speedup!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Part 8: Systems Analysis - Memory Profiling and Computational Complexity
-
-Let's analyze the systems engineering aspects of quantization with detailed memory profiling and complexity analysis.
-
-### Memory Usage Analysis
-
-Understanding exactly how quantization affects memory usage is crucial for systems deployment:
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "memory-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| export
-class QuantizationMemoryProfiler:
-    """
-    Memory profiler for analyzing quantization memory usage and complexity.
-    
-    This profiler demonstrates the systems engineering aspects of quantization
-    by measuring actual memory consumption and computational complexity.
-    """
-    
-    def __init__(self):
-        """Initialize the memory profiler."""
-        pass
-    
-    def profile_memory_usage(self, baseline_model: BaselineCNN, quantized_model: QuantizedCNN) -> Dict[str, Any]:
-        """
-        Profile detailed memory usage of baseline vs quantized models.
-        
-        This function is PROVIDED to demonstrate systems analysis methodology.
-        """
-        print("🧠 DETAILED MEMORY PROFILING")
-        print("=" * 50)
-        
-        # Baseline model memory breakdown
-        print("📊 Baseline FP32 Model Memory:")
-        baseline_conv1_mem = baseline_model.conv1_weight.nbytes + baseline_model.conv1_bias.nbytes
-        baseline_conv2_mem = baseline_model.conv2_weight.nbytes + baseline_model.conv2_bias.nbytes
-        baseline_fc_mem = baseline_model.fc.nbytes
-        baseline_total = baseline_conv1_mem + baseline_conv2_mem + baseline_fc_mem
-        
-        print(f"   Conv1 weights: {baseline_conv1_mem // 1024:.1f}KB (32*3*3*3 + 32 bias)")
-        print(f"   Conv2 weights: {baseline_conv2_mem // 1024:.1f}KB (64*32*3*3 + 64 bias)")
-        print(f"   FC weights: {baseline_fc_mem // 1024:.1f}KB (2304*10)")
-        print(f"   Total: {baseline_total // 1024:.1f}KB")
-        
-        # Quantized model memory breakdown
-        print(f"\n📊 Quantized INT8 Model Memory:")
-        quant_conv1_mem = quantized_model.conv1.weight_quantized.nbytes if quantized_model.conv1.is_quantized else baseline_conv1_mem
-        quant_conv2_mem = quantized_model.conv2.weight_quantized.nbytes if quantized_model.conv2.is_quantized else baseline_conv2_mem
-        quant_fc_mem = quantized_model.fc.nbytes  # FC kept as FP32
-        quant_total = quant_conv1_mem + quant_conv2_mem + quant_fc_mem
-        
-        print(f"   Conv1 weights: {quant_conv1_mem // 1024:.1f}KB (quantized INT8)")  
-        print(f"   Conv2 weights: {quant_conv2_mem // 1024:.1f}KB (quantized INT8)")
-        print(f"   FC weights: {quant_fc_mem // 1024:.1f}KB (kept FP32)")
-        print(f"   Total: {quant_total // 1024:.1f}KB")
-        
-        # Memory savings analysis
-        conv_savings = (baseline_conv1_mem + baseline_conv2_mem) / (quant_conv1_mem + quant_conv2_mem)
-        total_savings = baseline_total / quant_total
-        
-        print(f"\n💾 Memory Savings Analysis:")
-        print(f"   Conv layers: {conv_savings:.1f}* reduction")
-        print(f"   Overall model: {total_savings:.1f}* reduction")
-        print(f"   Memory saved: {(baseline_total - quant_total) // 1024:.1f}KB")
-        
-        return {
-            'baseline_total_kb': baseline_total // 1024,
-            'quantized_total_kb': quant_total // 1024,
-            'conv_compression': conv_savings,
-            'total_compression': total_savings,
-            'memory_saved_kb': (baseline_total - quant_total) // 1024
-        }
-    
-    def analyze_computational_complexity(self) -> Dict[str, Any]:
-        """
-        Analyze the computational complexity of quantization operations.
-        
-        This function is PROVIDED to demonstrate complexity analysis.
-        """
-        print("\n🔬 COMPUTATIONAL COMPLEXITY ANALYSIS")
-        print("=" * 45)
-        
-        # Model dimensions for analysis
-        batch_size = 32
-        input_h, input_w = 32, 32
-        conv1_out_ch, conv2_out_ch = 32, 64
-        kernel_size = 3
-        
-        print(f"📐 Model Configuration:")
-        print(f"   Input: {batch_size} * 3 * {input_h} * {input_w}")
-        print(f"   Conv1: 3 -> {conv1_out_ch}, {kernel_size}*{kernel_size} kernel")
-        print(f"   Conv2: {conv1_out_ch} -> {conv2_out_ch}, {kernel_size}*{kernel_size} kernel")
-        
-        # FP32 operations
-        conv1_h_out = input_h - kernel_size + 1  # 30
-        conv1_w_out = input_w - kernel_size + 1  # 30
-        pool1_h_out = conv1_h_out // 2  # 15  
-        pool1_w_out = conv1_w_out // 2  # 15
-        
-        conv2_h_out = pool1_h_out - kernel_size + 1  # 13
-        conv2_w_out = pool1_w_out - kernel_size + 1  # 13
-        pool2_h_out = conv2_h_out // 2  # 6
-        pool2_w_out = conv2_w_out // 2  # 6
-        
-        # Calculate FLOPs
-        conv1_flops = batch_size * conv1_out_ch * conv1_h_out * conv1_w_out * 3 * kernel_size * kernel_size
-        conv2_flops = batch_size * conv2_out_ch * conv2_h_out * conv2_w_out * conv1_out_ch * kernel_size * kernel_size
-        fc_flops = batch_size * (conv2_out_ch * pool2_h_out * pool2_w_out) * 10
-        total_flops = conv1_flops + conv2_flops + fc_flops
-        
-        print(f"\n🔢 FLOPs Analysis (per batch):")
-        print(f"   Conv1: {conv1_flops:,} FLOPs")
-        print(f"   Conv2: {conv2_flops:,} FLOPs") 
-        print(f"   FC: {fc_flops:,} FLOPs")
-        print(f"   Total: {total_flops:,} FLOPs")
-        
-        # Memory access analysis
-        conv1_weight_access = conv1_out_ch * 3 * kernel_size * kernel_size  # weights accessed
-        conv2_weight_access = conv2_out_ch * conv1_out_ch * kernel_size * kernel_size
-        
-        print(f"\n🗄️ Memory Access Patterns:")
-        print(f"   Conv1 weight access: {conv1_weight_access:,} parameters")
-        print(f"   Conv2 weight access: {conv2_weight_access:,} parameters")
-        print(f"   FP32 memory bandwidth: {(conv1_weight_access + conv2_weight_access) * 4:,} bytes")
-        print(f"   INT8 memory bandwidth: {(conv1_weight_access + conv2_weight_access) * 1:,} bytes")
-        print(f"   Bandwidth reduction: 4* (FP32 -> INT8)")
-        
-        # Theoretical speedup analysis
-        print(f"\nSPEED Theoretical Speedup Sources:")
-        print(f"   Memory bandwidth: 4* improvement (32-bit -> 8-bit)")
-        print(f"   Cache efficiency: Better fit in L1/L2 cache")
-        print(f"   SIMD vectorization: More operations per instruction")
-        print(f"   Hardware acceleration: Dedicated INT8 units on modern CPUs")
-        print(f"   Expected speedup: 2-4* in production systems")
-        
-        return {
-            'total_flops': total_flops,
-            'memory_bandwidth_reduction': 4.0,
-            'theoretical_speedup': 3.5  # Conservative estimate
-        }
-    
-    def analyze_scaling_behavior(self) -> Dict[str, Any]:
-        """
-        Analyze how quantization benefits scale with model size.
-        
-        This function is PROVIDED to demonstrate scaling analysis.
-        """
-        print("\nPROGRESS SCALING BEHAVIOR ANALYSIS")
-        print("=" * 35)
-        
-        model_sizes = [
-            ('Small CNN', 100_000),
-            ('Medium CNN', 1_000_000), 
-            ('Large CNN', 10_000_000),
-            ('VGG-like', 138_000_000),
-            ('ResNet-like', 25_000_000)
-        ]
-        
-        print(f"{'Model':<15} {'FP32 Size':<12} {'INT8 Size':<12} {'Savings':<10} {'Speedup'}")
-        print("-" * 65)
-        
-        for name, params in model_sizes:
-            fp32_size_mb = params * 4 / (1024 * 1024)
-            int8_size_mb = params * 1 / (1024 * 1024)
-            savings = fp32_size_mb / int8_size_mb
-            
-            # Speedup increases with model size due to memory bottlenecks
-            if params < 500_000:
-                speedup = 2.0  # Small models: limited by overhead
-            elif params < 5_000_000:
-                speedup = 3.0  # Medium models: good balance
-            else:
-                speedup = 4.0  # Large models: memory bound, maximum benefit
-            
-            print(f"{name:<15} {fp32_size_mb:<11.1f}MB {int8_size_mb:<11.1f}MB {savings:<9.1f}* {speedup:<7.1f}*")
-        
-        print(f"\nTIP Key Scaling Insights:")
-        print(f"   • Memory savings: Linear 4* reduction for all model sizes")
-        print(f"   • Speed benefits: Increase with model size (memory bottleneck)")  
-        print(f"   • Large models: Maximum benefit from reduced memory pressure")
-        print(f"   • Mobile deployment: Enables models that wouldn't fit in RAM")
-        
-        return {
-            'memory_savings': 4.0,
-            'speedup_range': (2.0, 4.0),
-            'scaling_factor': 'increases_with_size'
-        }
-
-# %% [markdown]
-"""
-### Test Memory Profiling and Systems Analysis
-
-Let's run comprehensive systems analysis to understand quantization behavior:
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-memory-profiling", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false}
-def test_memory_profiling():
-    """Test memory profiling and systems analysis."""
-    print("MAGNIFY Testing Memory Profiling and Systems Analysis...")
-    print("=" * 60)
-    
-    # Create models for profiling
-    baseline = BaselineCNN(3, 10)
-    quantized = QuantizedCNN(3, 10)
-    
-    # Quantize the model
-    calibration_data = [np.random.randn(1, 3, 32, 32) for _ in range(3)]
-    quantized.calibrate_and_quantize(calibration_data)
-    
-    # Run memory profiling
-    profiler = QuantizationMemoryProfiler()
-    
-    # Test memory usage analysis
-    memory_results = profiler.profile_memory_usage(baseline, quantized)
-    assert memory_results['conv_compression'] > 3.0, "Should show significant conv layer compression"
-    print(f"PASS Conv layer compression: {memory_results['conv_compression']:.1f}*")
-    
-    # Test computational complexity analysis
-    complexity_results = profiler.analyze_computational_complexity()
-    assert complexity_results['total_flops'] > 0, "Should calculate FLOPs"
-    assert complexity_results['memory_bandwidth_reduction'] == 4.0, "Should show 4* bandwidth reduction"
-    print(f"PASS Memory bandwidth reduction: {complexity_results['memory_bandwidth_reduction']:.1f}*")
-    
-    # Test scaling behavior analysis
-    scaling_results = profiler.analyze_scaling_behavior()
-    assert scaling_results['memory_savings'] == 4.0, "Should show consistent 4* memory savings"
-    print(f"PASS Memory savings scaling: {scaling_results['memory_savings']:.1f}* across all model sizes")
-    
-    print("PASS Memory profiling and systems analysis tests passed!")
-    print("TARGET Quantization systems engineering principles validated!")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-## Part 9: Comprehensive Testing and Execution
-
-Let's run all our tests to validate the complete implementation:
-"""
-
-if __name__ == "__main__":
-    print("ROCKET MODULE 17: QUANTIZATION - TRADING PRECISION FOR SPEED")
-    print("=" * 70)
-    print("Testing complete INT8 quantization implementation for 4* speedup...")
-    print()
-    
-    try:
-        # Run all tests
-        print("📋 Running Comprehensive Test Suite...")
-        print()
-        
-        # Individual component tests
-        test_baseline_cnn()
-        print()
-        
-        test_int8_quantizer()
-        print()
-        
-        test_quantized_cnn()
-        print()
-        
-        test_performance_analysis()
-        print()
-        
-        test_systems_analysis()
-        print()
-        
-        test_memory_profiling()
-        print()
-        
-        # Show production context
-        print("🏭 PRODUCTION QUANTIZATION CONTEXT...")
-        ProductionQuantizationInsights.explain_production_patterns()
-        ProductionQuantizationInsights.explain_advanced_techniques()
-        ProductionQuantizationInsights.show_performance_numbers()
-        print()
-        
-        print("CELEBRATE SUCCESS: All quantization tests passed!")
-        print("🏆 ACHIEVEMENT: 4* speedup through precision optimization!")
-        
-    except Exception as e:
-        print(f"FAIL Error in testing: {e}")
-        import traceback
-        traceback.print_exc()
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've implemented INT8 quantization and achieved 4* speedup, let's reflect on the systems engineering principles and precision trade-offs you've learned.
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-1", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false}
-"""
-**Question 1: Precision vs Performance Trade-offs**
-
-You implemented INT8 quantization that uses 4* less memory but provides 4* speedup with <1% accuracy loss.
-
-a) Why is INT8 the "sweet spot" for production quantization rather than INT4 or INT16?
-b) In what scenarios would you choose NOT to use quantization despite the performance benefits?
-c) How do hardware capabilities (mobile vs server) influence quantization decisions?
-
-*Think about: Hardware support, accuracy requirements, deployment constraints*
-"""
-
-# YOUR ANSWER HERE:
-### BEGIN SOLUTION
-"""
-a) Why INT8 is the sweet spot:
-- Hardware support: Excellent native INT8 support in CPUs, GPUs, and mobile processors
-- Accuracy preservation: Can represent 256 different values, sufficient for most weight distributions
-- Speed gains: Specialized INT8 arithmetic units provide real 4* speedup (not just theoretical)
-- Memory sweet spot: 4* reduction is significant but not so extreme as to destroy model quality
-- Production proven: Extensive validation across many model types shows <1% accuracy loss
-- Tool ecosystem: TensorFlow Lite, PyTorch Mobile, ONNX Runtime all optimize for INT8
-
-b) Scenarios to avoid quantization:
-- High-precision scientific computing where accuracy is paramount
-- Models already at accuracy limits where any degradation is unacceptable
-- Very small models where quantization overhead > benefits
-- Research/development phases where interpretability and debugging are critical
-- Applications requiring uncertainty quantification (quantization can affect calibration)
-- Real-time systems where the quantization/dequantization overhead matters more than compute
-
-c) Hardware influence on quantization decisions:
-- Mobile devices: Essential for deployment, enables on-device inference
-- Edge hardware: Often has specialized INT8 units (Neural Engine, TPU Edge)
-- Server GPUs: Mixed precision (FP16) might be better than INT8 for throughput
-- CPUs: INT8 vectorization provides significant benefits over FP32
-- Memory-constrained systems: Quantization may be required just to fit the model
-- Bandwidth-limited: 4* smaller models transfer faster over network
-"""
-### END SOLUTION
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-2", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false}
-"""
-**Question 2: Calibration and Deployment Strategies**
-
-Your quantization uses calibration data to compute optimal scale and zero-point parameters.
-
-a) How would you select representative calibration data for a production CNN model?
-b) What happens if your deployment data distribution differs significantly from calibration data?
-c) How would you design a system to detect and handle quantization-related accuracy degradation in production?
-
-*Think about: Data distribution, model drift, monitoring systems*
-"""
-
-# YOUR ANSWER HERE:
-### BEGIN SOLUTION
-"""
-a) Selecting representative calibration data:
-- Sample diversity: Include examples from all classes/categories the model will see
-- Data distribution matching: Ensure calibration data matches deployment distribution
-- Edge cases: Include challenging examples that stress the model's capabilities
-- Size considerations: 100-1000 samples usually sufficient, more doesn't help much
-- Real production data: Use actual deployment data when possible, not just training data
-- Temporal coverage: For time-sensitive models, include recent data patterns
-- Geographic/demographic coverage: Ensure representation across user populations
-
-b) Distribution mismatch consequences:
-- Quantization parameters become suboptimal for new data patterns
-- Accuracy degradation can be severe (>5% loss instead of <1%)
-- Some layers may be over/under-scaled leading to clipping or poor precision
-- Model confidence calibration can be significantly affected
-- Solutions: Periodic re-calibration, adaptive quantization, monitoring systems
-- Detection: Compare quantized vs FP32 outputs on production traffic sample
-
-c) Production monitoring system design:
-- Dual inference: Run small percentage of traffic through both quantized and FP32 models
-- Accuracy metrics: Track prediction agreement, confidence score differences
-- Distribution monitoring: Detect when input data drifts from calibration distribution
-- Performance alerts: Automated alerts when quantized model accuracy drops significantly
-- A/B testing framework: Gradual rollout with automatic rollback on accuracy drops
-- Model versioning: Keep FP32 backup model ready for immediate fallback
-- Regular recalibration: Scheduled re-quantization with fresh production data
-"""
-### END SOLUTION
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-3", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false}
-"""
-**Question 3: Advanced Quantization and Hardware Optimization**
-
-You built basic INT8 quantization. Production systems use more sophisticated techniques.
-
-a) Explain how "mixed precision quantization" (different precisions for different layers) would improve upon your implementation and what engineering challenges it introduces.
-b) How would you adapt your quantization for specific hardware targets like mobile Neural Processing Units or edge TPUs?
-c) Design a quantization strategy for a multi-model system where you need to optimize total inference latency across multiple models.
-
-*Think about: Layer sensitivity, hardware specialization, system-level optimization*
-"""
-
-# YOUR ANSWER HERE:
-### BEGIN SOLUTION
-"""
-a) Mixed precision quantization improvements:
-- Layer sensitivity analysis: Some layers (first/last, batch norm) more sensitive to quantization
-- Selective precision: Keep sensitive layers in FP16/FP32, quantize robust layers to INT8/INT4
-- Benefits: Better accuracy preservation while still achieving most speed benefits
-- Engineering challenges:
-  * Complexity: Need to analyze and decide precision for each layer individually
-  * Memory management: Mixed precision requires more complex memory layouts
-  * Hardware utilization: May not fully utilize specialized INT8 units
-  * Calibration complexity: Need separate calibration strategies per precision level
-  * Model compilation: More complex compiler optimizations required
-
-b) Hardware-specific quantization adaptation:
-- Apple Neural Engine: Optimize for their specific INT8 operations and memory hierarchy
-- Edge TPUs: Use their preferred quantization format (INT8 with specific scale constraints)
-- Mobile GPUs: Leverage FP16 capabilities when available, fall back to INT8
-- ARM CPUs: Optimize for NEON vectorization and specific instruction sets
-- Hardware profiling: Measure actual performance on target hardware, not just theoretical
-- Memory layout optimization: Arrange quantized weights for optimal hardware access patterns
-- Batch size considerations: Some hardware performs better with specific batch sizes
-
-c) Multi-model system quantization strategy:
-- Global optimization: Consider total inference latency across all models, not individual models
-- Resource allocation: Balance precision across models based on accuracy requirements
-- Pipeline optimization: Quantize models based on their position in inference pipeline
-- Shared resources: Models sharing computation resources need compatible quantization
-- Priority-based quantization: More critical models get higher precision allocations
-- Load balancing: Distribute quantization overhead across different hardware units
-- Caching strategies: Quantized models may have different caching characteristics
-- Fallback planning: System should gracefully handle quantization failures in any model
-"""
-### END SOLUTION
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-4", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false}
-"""
-**Question 4: Quantization in ML Systems Architecture**
-
-You've seen how quantization affects individual models. Consider its role in broader ML systems.
-
-a) How does quantization interact with other optimizations like model pruning, knowledge distillation, and neural architecture search?
-b) What are the implications of quantization for ML systems that need to be updated frequently (continuous learning, A/B testing, model retraining)?
-c) Design an end-to-end ML pipeline that incorporates quantization as a first-class optimization, from training to deployment to monitoring.
-
-*Think about: Optimization interactions, system lifecycle, engineering workflows*
-"""
-
-# YOUR ANSWER HERE:
-### BEGIN SOLUTION
-"""
-a) Quantization interactions with other optimizations:
-- Model pruning synergy: Pruned models often quantize better (remaining weights more important)
-- Knowledge distillation compatibility: Student models designed for quantization from start
-- Neural architecture search: NAS can search for quantization-friendly architectures
-- Combined benefits: Pruning + quantization can achieve 16* compression (4* each)
-- Order matters: Generally prune first, then quantize (quantizing first can interfere with pruning)
-- Optimization conflicts: Some optimizations may work against each other
-- Unified approaches: Modern techniques like differentiable quantization during NAS
-
-b) Implications for frequently updated systems:
-- Re-quantization overhead: Every model update requires new calibration and quantization
-- Calibration data management: Need fresh, representative data for each quantization round
-- A/B testing complexity: Quantized vs FP32 models may show different A/B results
-- Gradual rollout challenges: Quantization changes may interact poorly with gradual deployment
-- Monitoring complexity: Need to track quantization quality across model versions
-- Continuous learning: Online learning systems need adaptive quantization strategies
-- Validation overhead: Each update needs thorough accuracy validation before deployment
-
-c) End-to-end quantization-first ML pipeline:
-Training phase:
-- Quantization-aware training: Train models to be robust to quantization from start
-- Architecture selection: Choose quantization-friendly model architectures
-- Loss function augmentation: Include quantization error in training loss
-
-Validation phase:
-- Dual validation: Validate both FP32 and quantized versions
-- Calibration data curation: Maintain high-quality, representative calibration sets
-- Hardware validation: Test on actual deployment hardware, not just simulation
-
-Deployment phase:
-- Automated quantization: CI/CD pipeline automatically quantizes and validates models
-- Gradual rollout: Deploy quantized models with careful monitoring and rollback capability
-- Resource optimization: Schedule quantization jobs efficiently in deployment pipeline
-
-Monitoring phase:
-- Accuracy tracking: Continuous comparison of quantized vs FP32 performance
-- Distribution drift detection: Monitor for changes that might require re-quantization
-- Performance monitoring: Track actual speedup and memory savings in production
-- Feedback loops: Use production performance to improve quantization strategies
-"""
-### END SOLUTION
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Quantization - Trading Precision for Speed
-
-Congratulations! You've completed Module 17 and mastered quantization techniques that achieve dramatic performance improvements while maintaining model accuracy.
-
-### What You Built
-- **Baseline FP32 CNN**: Reference implementation showing computational and memory costs
-- **INT8 Quantizer**: Complete quantization system with scale/zero-point parameter computation
-- **Quantized CNN**: Production-ready CNN using INT8 weights for 4* speedup
-- **Performance Analyzer**: Comprehensive benchmarking system measuring speed, memory, and accuracy trade-offs
-- **Systems Analyzer**: Deep analysis of precision vs performance trade-offs across different bit widths
-
-### Key Systems Insights Mastered
-1. **Precision vs Performance Trade-offs**: Understanding when to sacrifice precision for speed (4* memory/speed improvement for <1% accuracy loss)
-2. **Quantization Mathematics**: Implementing scale/zero-point based affine quantization for optimal precision
-3. **Hardware-Aware Optimization**: Leveraging INT8 specialized hardware for maximum performance benefits
-4. **Production Deployment Strategies**: Calibration-based quantization for mobile and edge deployment
-
-### Performance Achievements
-- ROCKET **4* Speed Improvement**: Reduced inference time from 50ms to 12ms through INT8 arithmetic
-- 🧠 **4* Memory Reduction**: Quantized weights use 25% of original FP32 memory
-- 📊 **<1% Accuracy Loss**: Maintained model quality while achieving dramatic speedups
-- 🏭 **Production Ready**: Implemented patterns used by TensorFlow Lite, PyTorch Mobile, and Core ML
-
-### Connection to Production ML Systems
-Your quantization implementation demonstrates core principles behind:
-- **Mobile ML**: TensorFlow Lite and PyTorch Mobile INT8 quantization
-- **Edge AI**: Optimizations enabling AI on resource-constrained devices
-- **Production Inference**: Memory and compute optimizations for cost-effective deployment
-- **ML Engineering**: How precision trade-offs enable scalable ML systems
-
-### Systems Engineering Principles Applied
-- **Precision is Negotiable**: Most applications can tolerate small accuracy loss for large speedup
-- **Hardware Specialization**: INT8 units provide real performance benefits beyond theoretical
-- **Calibration-Based Optimization**: Use representative data to compute optimal quantization parameters
-- **Trade-off Engineering**: Balance accuracy, speed, and memory based on application requirements
-
-### Trade-off Mastery Achieved
-You now understand how quantization represents the first major trade-off in ML optimization:
-- **Module 16**: Free speedups through better algorithms (no trade-offs)
-- **Module 17**: Speed through precision trade-offs (small accuracy loss for large gains)
-- **Future modules**: More sophisticated trade-offs in compression, distillation, and architecture
-
-You've mastered the fundamental precision vs performance trade-off that enables ML deployment on mobile devices, edge hardware, and cost-effective cloud inference. This completes your understanding of how production ML systems balance quality and performance!
-"""
\ No newline at end of file
diff --git a/modules_old/16_quantization/quantization_dev_fixed.py b/modules_old/16_quantization/quantization_dev_fixed.py
deleted file mode 100644
index d548e672..00000000
--- a/modules_old/16_quantization/quantization_dev_fixed.py
+++ /dev/null
@@ -1,662 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Module 17: Quantization - Trading Precision for Speed (FIXED VERSION)
-
-Fixed implementation that demonstrates proper Post-Training Quantization (PTQ) 
-with realistic performance benefits and minimal accuracy loss.
-
-## What Was Fixed
-
-1. **Proper PTQ Implementation**: Real post-training quantization that doesn't 
-   dequantize weights during forward pass
-2. **Realistic CNN Model**: Uses larger, more representative CNN architecture
-3. **Proper Calibration**: Uses meaningful calibration data for quantization
-4. **Actual Performance Benefits**: Shows real speedup and memory reduction
-5. **Accurate Measurements**: Proper timing and accuracy comparisons
-
-## Why This Works Better
-
-- **Stay in INT8**: Weights remain quantized during computation
-- **Vectorized Operations**: Use numpy operations that benefit from lower precision
-- **Proper Scale**: Test on models large enough to show quantization benefits
-- **Real Calibration**: Use representative data for computing quantization parameters
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "quantization-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp quantization
-
-#| export
-import math
-import time
-import numpy as np
-import sys
-import os
-from typing import Union, List, Optional, Tuple, Dict, Any
-
-# %% [markdown]
-"""
-## Part 1: Realistic CNN Model for Quantization Testing
-
-First, let's create a CNN model that's large enough to demonstrate quantization benefits.
-The previous model was too small - quantization needs sufficient computation to overcome overhead.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "realistic-cnn", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class RealisticCNN:
-    """
-    Larger CNN model suitable for demonstrating quantization benefits.
-    
-    This model has enough parameters and computation to show meaningful
-    speedup from INT8 quantization while being simple to understand.
-    """
-    
-    def __init__(self, input_channels: int = 3, num_classes: int = 10):
-        """Initialize realistic CNN with sufficient complexity for quantization."""
-        self.input_channels = input_channels
-        self.num_classes = num_classes
-        
-        # Larger convolutional layers
-        # Conv1: 3 -> 64 channels, 5x5 kernel
-        self.conv1_weight = np.random.randn(64, input_channels, 5, 5) * 0.02
-        self.conv1_bias = np.zeros(64)
-        
-        # Conv2: 64 -> 128 channels, 5x5 kernel  
-        self.conv2_weight = np.random.randn(128, 64, 5, 5) * 0.02
-        self.conv2_bias = np.zeros(128)
-        
-        # Conv3: 128 -> 256 channels, 3x3 kernel
-        self.conv3_weight = np.random.randn(256, 128, 3, 3) * 0.02
-        self.conv3_bias = np.zeros(256)
-        
-        # Larger fully connected layers
-        # After 3 conv+pool layers: 32x32 -> 28x28 -> 12x12 -> 10x10 -> 3x3
-        self.fc1 = np.random.randn(256 * 3 * 3, 512) * 0.02
-        self.fc1_bias = np.zeros(512)
-        
-        self.fc2 = np.random.randn(512, num_classes) * 0.02
-        self.fc2_bias = np.zeros(num_classes)
-        
-        print(f"✅ RealisticCNN initialized: {self._count_parameters():,} parameters")
-    
-    def _count_parameters(self) -> int:
-        """Count total parameters in the model."""
-        conv1_params = 64 * self.input_channels * 5 * 5 + 64
-        conv2_params = 128 * 64 * 5 * 5 + 128  
-        conv3_params = 256 * 128 * 3 * 3 + 256
-        fc1_params = 256 * 3 * 3 * 512 + 512
-        fc2_params = 512 * self.num_classes + self.num_classes
-        return conv1_params + conv2_params + conv3_params + fc1_params + fc2_params
-    
-    def forward(self, x: np.ndarray) -> np.ndarray:
-        """Forward pass through realistic CNN."""
-        batch_size = x.shape[0]
-        
-        # Conv1 + ReLU + Pool (32x32 -> 28x28 -> 14x14)
-        conv1_out = self._conv2d_forward(x, self.conv1_weight, self.conv1_bias)
-        conv1_relu = np.maximum(0, conv1_out)
-        pool1_out = self._maxpool2d_forward(conv1_relu, 2)
-        
-        # Conv2 + ReLU + Pool (14x14 -> 10x10 -> 5x5)
-        conv2_out = self._conv2d_forward(pool1_out, self.conv2_weight, self.conv2_bias)
-        conv2_relu = np.maximum(0, conv2_out)
-        pool2_out = self._maxpool2d_forward(conv2_relu, 2)
-        
-        # Conv3 + ReLU + Pool (5x5 -> 3x3 -> 3x3, no pool to preserve size)
-        conv3_out = self._conv2d_forward(pool2_out, self.conv3_weight, self.conv3_bias)
-        conv3_relu = np.maximum(0, conv3_out)
-        
-        # Flatten
-        flattened = conv3_relu.reshape(batch_size, -1)
-        
-        # FC1 + ReLU
-        fc1_out = flattened @ self.fc1 + self.fc1_bias
-        fc1_relu = np.maximum(0, fc1_out)
-        
-        # FC2 (output)
-        logits = fc1_relu @ self.fc2 + self.fc2_bias
-        
-        return logits
-    
-    def _conv2d_forward(self, x: np.ndarray, weight: np.ndarray, bias: np.ndarray) -> np.ndarray:
-        """Optimized convolution implementation."""
-        batch, in_ch, in_h, in_w = x.shape
-        out_ch, in_ch_w, kh, kw = weight.shape
-        
-        out_h = in_h - kh + 1
-        out_w = in_w - kw + 1
-        
-        output = np.zeros((batch, out_ch, out_h, out_w))
-        
-        # Vectorized convolution for better performance
-        for b in range(batch):
-            for oh in range(out_h):
-                for ow in range(out_w):
-                    patch = x[b, :, oh:oh+kh, ow:ow+kw]
-                    # Vectorized across output channels
-                    for oc in range(out_ch):
-                        output[b, oc, oh, ow] = np.sum(patch * weight[oc]) + bias[oc]
-        
-        return output
-    
-    def _maxpool2d_forward(self, x: np.ndarray, pool_size: int) -> np.ndarray:
-        """Max pooling implementation."""
-        batch, ch, in_h, in_w = x.shape
-        out_h = in_h // pool_size
-        out_w = in_w // pool_size
-        
-        output = np.zeros((batch, ch, out_h, out_w))
-        
-        for b in range(batch):
-            for c in range(ch):
-                for oh in range(out_h):
-                    for ow in range(out_w):
-                        h_start = oh * pool_size
-                        w_start = ow * pool_size
-                        pool_region = x[b, c, h_start:h_start+pool_size, w_start:w_start+pool_size]
-                        output[b, c, oh, ow] = np.max(pool_region)
-        
-        return output
-    
-    def predict(self, x: np.ndarray) -> np.ndarray:
-        """Make predictions with the model."""
-        logits = self.forward(x)
-        return np.argmax(logits, axis=1)
-
-# %% [markdown]
-"""
-## Part 2: Proper Post-Training Quantization (PTQ)
-
-Now let's implement PTQ that actually stays in INT8 during computation,
-rather than dequantizing weights for every operation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "proper-ptq", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class ProperINT8Quantizer:
-    """
-    Proper Post-Training Quantization that demonstrates real benefits.
-    
-    Key improvements:
-    1. Weights stay quantized during computation
-    2. Simulates INT8 arithmetic benefits
-    3. Proper calibration with representative data
-    4. Realistic performance gains
-    """
-    
-    def __init__(self):
-        """Initialize the PTQ quantizer."""
-        pass
-    
-    def calibrate_and_quantize_model(self, model: RealisticCNN, 
-                                   calibration_data: List[np.ndarray]) -> 'QuantizedRealisticCNN':
-        """
-        Perform complete PTQ on a model.
-        
-        Args:
-            model: FP32 model to quantize
-            calibration_data: Representative data for computing quantization parameters
-            
-        Returns:
-            Quantized model with INT8 weights
-        """
-        print("🔧 Performing Post-Training Quantization...")
-        
-        # Create quantized model
-        quantized_model = QuantizedRealisticCNN(
-            input_channels=model.input_channels,
-            num_classes=model.num_classes
-        )
-        
-        # Calibrate and quantize each layer
-        print("  📊 Calibrating conv1 layer...")
-        quantized_model.conv1_weight_q, quantized_model.conv1_scale = self._quantize_weights(
-            model.conv1_weight, "conv1"
-        )
-        
-        print("  📊 Calibrating conv2 layer...")
-        quantized_model.conv2_weight_q, quantized_model.conv2_scale = self._quantize_weights(
-            model.conv2_weight, "conv2"
-        )
-        
-        print("  📊 Calibrating conv3 layer...")
-        quantized_model.conv3_weight_q, quantized_model.conv3_scale = self._quantize_weights(
-            model.conv3_weight, "conv3"
-        )
-        
-        print("  📊 Calibrating fc1 layer...")
-        quantized_model.fc1_q, quantized_model.fc1_scale = self._quantize_weights(
-            model.fc1, "fc1"
-        )
-        
-        print("  📊 Calibrating fc2 layer...")
-        quantized_model.fc2_q, quantized_model.fc2_scale = self._quantize_weights(
-            model.fc2, "fc2"
-        )
-        
-        # Copy biases (keep as FP32 for simplicity)
-        quantized_model.conv1_bias = model.conv1_bias.copy()
-        quantized_model.conv2_bias = model.conv2_bias.copy()
-        quantized_model.conv3_bias = model.conv3_bias.copy()
-        quantized_model.fc1_bias = model.fc1_bias.copy()
-        quantized_model.fc2_bias = model.fc2_bias.copy()
-        
-        # Calculate memory savings
-        original_memory = self._calculate_memory_mb(model)
-        quantized_memory = self._calculate_memory_mb(quantized_model)
-        
-        print(f"✅ PTQ Complete:")
-        print(f"   Original model: {original_memory:.2f} MB")
-        print(f"   Quantized model: {quantized_memory:.2f} MB")
-        print(f"   Memory reduction: {original_memory/quantized_memory:.1f}×")
-        
-        return quantized_model
-    
-    def _quantize_weights(self, weights: np.ndarray, layer_name: str) -> Tuple[np.ndarray, float]:
-        """Quantize weight tensor to INT8."""
-        # Compute quantization scale
-        max_val = np.max(np.abs(weights))
-        scale = max_val / 127.0  # INT8 range is -128 to 127
-        
-        # Quantize weights
-        quantized = np.round(weights / scale).astype(np.int8)
-        
-        # Calculate quantization error
-        dequantized = quantized.astype(np.float32) * scale
-        error = np.mean(np.abs(weights - dequantized))
-        
-        print(f"    {layer_name}: scale={scale:.6f}, error={error:.6f}")
-        
-        return quantized, scale
-    
-    def _calculate_memory_mb(self, model) -> float:
-        """Calculate model memory usage in MB."""
-        total_bytes = 0
-        
-        if hasattr(model, 'conv1_weight'):  # FP32 model
-            total_bytes += model.conv1_weight.nbytes + model.conv1_bias.nbytes
-            total_bytes += model.conv2_weight.nbytes + model.conv2_bias.nbytes
-            total_bytes += model.conv3_weight.nbytes + model.conv3_bias.nbytes
-            total_bytes += model.fc1.nbytes + model.fc1_bias.nbytes
-            total_bytes += model.fc2.nbytes + model.fc2_bias.nbytes
-        else:  # Quantized model
-            # INT8 weights + FP32 biases + FP32 scales
-            total_bytes += model.conv1_weight_q.nbytes + model.conv1_bias.nbytes + 4  # scale
-            total_bytes += model.conv2_weight_q.nbytes + model.conv2_bias.nbytes + 4
-            total_bytes += model.conv3_weight_q.nbytes + model.conv3_bias.nbytes + 4
-            total_bytes += model.fc1_q.nbytes + model.fc1_bias.nbytes + 4
-            total_bytes += model.fc2_q.nbytes + model.fc2_bias.nbytes + 4
-        
-        return total_bytes / (1024 * 1024)
-
-# %% nbgrader={"grade": false, "grade_id": "quantized-model", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class QuantizedRealisticCNN:
-    """
-    CNN model with INT8 quantized weights.
-    
-    This model demonstrates proper PTQ by:
-    1. Storing weights in INT8 format
-    2. Using simulated INT8 arithmetic 
-    3. Showing realistic speedup and memory benefits
-    """
-    
-    def __init__(self, input_channels: int = 3, num_classes: int = 10):
-        """Initialize quantized CNN structure."""
-        self.input_channels = input_channels
-        self.num_classes = num_classes
-        
-        # Quantized weights (will be set during quantization)
-        self.conv1_weight_q = None
-        self.conv1_scale = None
-        
-        self.conv2_weight_q = None
-        self.conv2_scale = None
-        
-        self.conv3_weight_q = None
-        self.conv3_scale = None
-        
-        self.fc1_q = None
-        self.fc1_scale = None
-        
-        self.fc2_q = None
-        self.fc2_scale = None
-        
-        # Biases (kept as FP32)
-        self.conv1_bias = None
-        self.conv2_bias = None
-        self.conv3_bias = None
-        self.fc1_bias = None
-        self.fc2_bias = None
-    
-    def forward(self, x: np.ndarray) -> np.ndarray:
-        """
-        Forward pass using quantized weights.
-        
-        Key optimization: Weights stay in INT8, we simulate the speedup
-        that would come from INT8 arithmetic units.
-        """
-        batch_size = x.shape[0]
-        
-        # Conv1 + ReLU + Pool (using quantized weights)
-        conv1_out = self._quantized_conv2d_forward(
-            x, self.conv1_weight_q, self.conv1_scale, self.conv1_bias
-        )
-        conv1_relu = np.maximum(0, conv1_out)
-        pool1_out = self._maxpool2d_forward(conv1_relu, 2)
-        
-        # Conv2 + ReLU + Pool
-        conv2_out = self._quantized_conv2d_forward(
-            pool1_out, self.conv2_weight_q, self.conv2_scale, self.conv2_bias
-        )
-        conv2_relu = np.maximum(0, conv2_out)
-        pool2_out = self._maxpool2d_forward(conv2_relu, 2)
-        
-        # Conv3 + ReLU
-        conv3_out = self._quantized_conv2d_forward(
-            pool2_out, self.conv3_weight_q, self.conv3_scale, self.conv3_bias
-        )
-        conv3_relu = np.maximum(0, conv3_out)
-        
-        # Flatten
-        flattened = conv3_relu.reshape(batch_size, -1)
-        
-        # FC1 + ReLU (using quantized weights)
-        fc1_out = self._quantized_linear_forward(
-            flattened, self.fc1_q, self.fc1_scale, self.fc1_bias
-        )
-        fc1_relu = np.maximum(0, fc1_out)
-        
-        # FC2 (output)
-        logits = self._quantized_linear_forward(
-            fc1_relu, self.fc2_q, self.fc2_scale, self.fc2_bias
-        )
-        
-        return logits
-    
-    def _quantized_conv2d_forward(self, x: np.ndarray, weight_q: np.ndarray, 
-                                 scale: float, bias: np.ndarray) -> np.ndarray:
-        """
-        Convolution using quantized weights.
-        
-        Simulates INT8 arithmetic by using integer operations where possible.
-        """
-        batch, in_ch, in_h, in_w = x.shape
-        out_ch, in_ch_w, kh, kw = weight_q.shape
-        
-        out_h = in_h - kh + 1
-        out_w = in_w - kw + 1
-        
-        output = np.zeros((batch, out_ch, out_h, out_w))
-        
-        # Simulate faster INT8 computation by using integer weights
-        for b in range(batch):
-            for oh in range(out_h):
-                for ow in range(out_w):
-                    patch = x[b, :, oh:oh+kh, ow:ow+kw]
-                    # Use INT8 weights directly, then scale result
-                    for oc in range(out_ch):
-                        # INT8 arithmetic simulation
-                        int_result = np.sum(patch * weight_q[oc].astype(np.float32))
-                        # Scale back to FP32 range and add bias
-                        output[b, oc, oh, ow] = int_result * scale + bias[oc]
-        
-        return output
-    
-    def _quantized_linear_forward(self, x: np.ndarray, weight_q: np.ndarray,
-                                 scale: float, bias: np.ndarray) -> np.ndarray:
-        """Linear layer using quantized weights."""
-        # INT8 matrix multiply simulation
-        int_result = x @ weight_q.astype(np.float32)
-        # Scale and add bias
-        return int_result * scale + bias
-    
-    def _maxpool2d_forward(self, x: np.ndarray, pool_size: int) -> np.ndarray:
-        """Max pooling (unchanged from FP32 version)."""
-        batch, ch, in_h, in_w = x.shape
-        out_h = in_h // pool_size
-        out_w = in_w // pool_size
-        
-        output = np.zeros((batch, ch, out_h, out_w))
-        
-        for b in range(batch):
-            for c in range(ch):
-                for oh in range(out_h):
-                    for ow in range(out_w):
-                        h_start = oh * pool_size
-                        w_start = ow * pool_size
-                        pool_region = x[b, c, h_start:h_start+pool_size, w_start:w_start+pool_size]
-                        output[b, c, oh, ow] = np.max(pool_region)
-        
-        return output
-    
-    def predict(self, x: np.ndarray) -> np.ndarray:
-        """Make predictions with quantized model."""
-        logits = self.forward(x)
-        return np.argmax(logits, axis=1)
-
-# %% [markdown]
-"""
-## Part 3: Performance Analysis with Proper Scale
-
-Now let's test quantization on a model large enough to show real benefits.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "performance-test", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def test_proper_quantization_performance():
-    """Test quantization on realistic CNN to demonstrate actual benefits."""
-    print("🔍 Testing Proper Post-Training Quantization")
-    print("=" * 60)
-    
-    # Create realistic models
-    print("Creating realistic CNN model...")
-    fp32_model = RealisticCNN(input_channels=3, num_classes=10)
-    
-    # Generate calibration data (representative of CIFAR-10)
-    print("Generating calibration dataset...")
-    calibration_data = []
-    for i in range(100):
-        sample = np.random.randn(1, 3, 32, 32) * 0.5 + 0.5  # Normalized images
-        calibration_data.append(sample)
-    
-    # Perform PTQ
-    quantizer = ProperINT8Quantizer()
-    int8_model = quantizer.calibrate_and_quantize_model(fp32_model, calibration_data)
-    
-    # Create test batch (larger for meaningful timing)
-    test_batch = np.random.randn(32, 3, 32, 32) * 0.5 + 0.5  # 32 images
-    print(f"Test batch shape: {test_batch.shape}")
-    
-    # Warm up both models
-    print("Warming up models...")
-    _ = fp32_model.forward(test_batch[:4])
-    _ = int8_model.forward(test_batch[:4])
-    
-    # Benchmark FP32 model
-    print("Benchmarking FP32 model...")
-    fp32_times = []
-    for run in range(10):
-        start = time.time()
-        fp32_output = fp32_model.forward(test_batch)
-        fp32_times.append(time.time() - start)
-    
-    fp32_avg_time = np.mean(fp32_times)
-    fp32_predictions = fp32_model.predict(test_batch)
-    
-    # Benchmark INT8 model  
-    print("Benchmarking INT8 model...")
-    int8_times = []
-    for run in range(10):
-        start = time.time()
-        int8_output = int8_model.forward(test_batch)
-        int8_times.append(time.time() - start)
-    
-    int8_avg_time = np.mean(int8_times)
-    int8_predictions = int8_model.predict(test_batch)
-    
-    # Calculate metrics
-    speedup = fp32_avg_time / int8_avg_time
-    
-    # Accuracy analysis
-    prediction_agreement = np.mean(fp32_predictions == int8_predictions)
-    output_mse = np.mean((fp32_output - int8_output) ** 2)
-    
-    # Memory analysis
-    fp32_memory = quantizer._calculate_memory_mb(fp32_model)
-    int8_memory = quantizer._calculate_memory_mb(int8_model)
-    memory_reduction = fp32_memory / int8_memory
-    
-    # Results
-    print(f"\n🚀 QUANTIZATION PERFORMANCE RESULTS")
-    print(f"=" * 50)
-    print(f"📊 Model Size:")
-    print(f"   FP32: {fp32_memory:.2f} MB")
-    print(f"   INT8: {int8_memory:.2f} MB")
-    print(f"   Memory reduction: {memory_reduction:.1f}×")
-    
-    print(f"\n⚡ Inference Speed:")
-    print(f"   FP32: {fp32_avg_time*1000:.1f}ms ± {np.std(fp32_times)*1000:.1f}ms")
-    print(f"   INT8: {int8_avg_time*1000:.1f}ms ± {np.std(int8_times)*1000:.1f}ms")
-    print(f"   Speedup: {speedup:.2f}×")
-    
-    print(f"\n🎯 Accuracy Preservation:")
-    print(f"   Prediction agreement: {prediction_agreement:.1%}")
-    print(f"   Output MSE: {output_mse:.6f}")
-    
-    # Assessment
-    if speedup > 1.5 and memory_reduction > 3.0 and prediction_agreement > 0.95:
-        print(f"\n🎉 SUCCESS: PTQ demonstrates clear benefits!")
-        print(f"   ✅ Speed: {speedup:.1f}× faster")
-        print(f"   ✅ Memory: {memory_reduction:.1f}× smaller") 
-        print(f"   ✅ Accuracy: {prediction_agreement:.1%} preserved")
-    else:
-        print(f"\n⚠️  Results mixed - may need further optimization")
-    
-    return {
-        'speedup': speedup,
-        'memory_reduction': memory_reduction,
-        'prediction_agreement': prediction_agreement,
-        'output_mse': output_mse
-    }
-
-# %% [markdown]
-"""
-## Part 4: Systems Analysis - Why PTQ Works
-
-Let's analyze why proper PTQ provides benefits and when it's most effective.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "systems-analysis", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def analyze_quantization_scaling():
-    """Analyze how quantization benefits scale with model size."""
-    print("🔬 QUANTIZATION SCALING ANALYSIS")
-    print("=" * 50)
-    
-    # Test different model complexities
-    model_configs = [
-        ("Small CNN", {"conv_channels": [16, 32], "fc_size": 128}),
-        ("Medium CNN", {"conv_channels": [32, 64, 128], "fc_size": 256}), 
-        ("Large CNN", {"conv_channels": [64, 128, 256], "fc_size": 512}),
-    ]
-    
-    print(f"{'Model':<12} {'Params':<10} {'Speedup':<10} {'Memory':<10} {'Accuracy'}")
-    print("-" * 60)
-    
-    for name, config in model_configs:
-        try:
-            # Create simplified model for this test
-            conv_layers = len(config["conv_channels"])
-            total_params = sum(config["conv_channels"]) * 1000  # Rough estimate
-            
-            # Simulate quantization benefits based on model size
-            if total_params < 50000:
-                speedup = 1.2  # Small overhead dominates
-                memory_reduction = 3.8
-                accuracy = 0.99
-            elif total_params < 200000:
-                speedup = 2.1  # Moderate benefits
-                memory_reduction = 3.9
-                accuracy = 0.98
-            else:
-                speedup = 3.2  # Large benefits
-                memory_reduction = 4.0
-                accuracy = 0.975
-            
-            print(f"{name:<12} {total_params:<10,} {speedup:<10.1f}× {memory_reduction:<10.1f}× {accuracy:<10.1%}")
-            
-        except Exception as e:
-            print(f"{name:<12} ERROR: {str(e)[:30]}")
-    
-    print(f"\n💡 Key Insights:")
-    print(f"   🎯 Quantization benefits increase with model size")
-    print(f"   📈 Larger models overcome quantization overhead better")
-    print(f"   🎪 4× memory reduction is consistent across sizes")
-    print(f"   ⚖️  Speed benefits: 1.2× (small) → 3.2× (large)")
-    print(f"   🔧 Production models (millions of params) see maximum benefits")
-
-# %% [markdown]
-"""
-## Main Execution Block
-"""
-
-if __name__ == "__main__":
-    print("🚀 MODULE 17: QUANTIZATION - FIXED VERSION")
-    print("=" * 60)
-    print("Demonstrating proper Post-Training Quantization with realistic benefits")
-    print()
-    
-    try:
-        # Test proper quantization
-        results = test_proper_quantization_performance()
-        print()
-        
-        # Analyze scaling behavior
-        analyze_quantization_scaling()
-        print()
-        
-        print("🎉 SUCCESS: Fixed quantization demonstrates real benefits!")
-        print(f"✅ Achieved {results['speedup']:.1f}× speedup with {results['prediction_agreement']:.1%} accuracy")
-        
-    except Exception as e:
-        print(f"❌ Error in quantization testing: {e}")
-        import traceback
-        traceback.print_exc()
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Fixed Quantization Implementation
-
-### What Was Fixed
-
-1. **Proper PTQ Implementation**: Weights stay quantized during computation
-2. **Realistic CNN Model**: Large enough to show quantization benefits  
-3. **Correct Performance Measurement**: Proper timing and memory analysis
-4. **Educational Clarity**: Clear demonstration of trade-offs
-
-### Performance Results
-
-- **Memory Reduction**: Consistent 4× reduction from FP32 → INT8
-- **Speed Improvement**: 2-3× speedup on realistic models
-- **Accuracy Preservation**: >95% prediction agreement maintained
-- **Scalability**: Benefits increase with model size
-
-### Key Learning Points
-
-1. **Model Scale Matters**: Quantization needs sufficient computation to overcome overhead
-2. **Stay in INT8**: Real benefits come from keeping weights quantized
-3. **Proper Calibration**: Representative data is crucial for good quantization
-4. **Trade-off Understanding**: Small accuracy loss for significant resource savings
-
-This implementation properly demonstrates the precision vs performance trade-off
-that makes quantization valuable for production ML systems.
-"""
\ No newline at end of file
diff --git a/modules_old/17_compression/READABILITY_IMPROVEMENTS.md b/modules_old/17_compression/READABILITY_IMPROVEMENTS.md
deleted file mode 100644
index 649e4236..00000000
--- a/modules_old/17_compression/READABILITY_IMPROVEMENTS.md
+++ /dev/null
@@ -1,195 +0,0 @@
-# Compression Module Readability Improvements
-
-## Summary of Changes Made
-
-Based on the readability assessment that gave this module 8.5/10, the following improvements were made to address the identified issues:
-
-### 1. Magic Numbers → Named Constants ✅
-
-**Problem**: Hardcoded values scattered throughout the code
-**Solution**: Added comprehensive constants section at the top
-
-```python
-# Constants for compression configuration
-DEFAULT_SPARSITY = 0.7
-NEAR_ZERO_THRESHOLD_RATIO = 0.1  # 10% of mean weight magnitude
-MIN_FILTERS_TO_KEEP = 1
-EPS_DIVISION_SAFETY = 1e-8  # Avoid division by zero
-
-# Layer type detection thresholds
-CONV2D_NDIM = 4  # (out_channels, in_channels, H, W)
-DENSE_NDIM = 2   # (out_features, in_features)
-
-# Default sparsity levels by layer type
-DEFAULT_CONV_SPARSITY = 0.6   # Conservative for conv layers
-DEFAULT_DENSE_SPARSITY = 0.8  # Aggressive for dense layers
-DEFAULT_OTHER_SPARSITY = 0.5  # Safe default for unknown layers
-
-# Quality score thresholds
-EXCELLENT_QUALITY_THRESHOLD = 0.8
-ACCEPTABLE_QUALITY_THRESHOLD = 0.6
-
-# Benchmarking defaults
-DEFAULT_BATCH_SIZE = 32
-DEFAULT_BENCHMARK_ITERATIONS = 100
-SPEEDUP_EFFICIENCY_HIGH = 0.8
-SPEEDUP_EFFICIENCY_MEDIUM = 0.5
-```
-
-### 2. Class Initialization Documentation ✅
-
-**Problem**: Minimal documentation for class __init__ methods
-**Solution**: Added comprehensive docstrings for all class constructors
-
-#### MagnitudePruner
-```python
-def __init__(self):
-    """
-    Initialize magnitude-based pruner.
-    
-    Stores pruning masks, original weights, and statistics for 
-    tracking compression across multiple layers.
-    """
-```
-
-#### SparseLinear
-```python
-def __init__(self, in_features: int, out_features: int):
-    """
-    Initialize sparse linear layer.
-    
-    Args:
-        in_features: Number of input features
-        out_features: Number of output features
-        
-    Attributes:
-        dense_weights: Original dense weight matrix (out_features, in_features)
-        sparse_weights: Pruned weight matrix with zeros
-        mask: Binary mask indicating kept weights (1=keep, 0=prune)
-        sparsity: Fraction of weights that are zero
-        dense_ops: Number of operations for dense computation
-        sparse_ops: Number of operations for sparse computation
-    """
-```
-
-#### ModelCompressor
-```python
-def __init__(self):
-    """
-    Initialize model compression pipeline.
-    
-    Attributes:
-        original_model: Storage for original dense model weights
-        compressed_model: Storage for compressed model weights and metadata
-        compression_stats: Overall compression statistics
-        layer_sensitivities: Per-layer sensitivity analysis results
-    """
-```
-
-### 3. Long Functions → Utility Functions ✅
-
-**Problem**: Methods exceeding 100 lines with complex logic
-**Solution**: Extracted utility functions to break up complex methods
-
-#### New Utility Functions Created:
-
-```python
-def _determine_layer_type_and_sparsity(shape: tuple) -> Tuple[str, float]:
-    """Determine layer type and recommended sparsity from weight tensor shape."""
-
-def _calculate_layer_analysis_info(layer_name: str, weights: np.ndarray, layer_type: str, 
-                                 natural_sparsity: float, recommended_sparsity: float) -> Dict[str, Any]:
-    """Create layer analysis information dictionary."""
-
-def _print_layer_analysis_row(layer_name: str, layer_type: str, num_params: int, 
-                             natural_sparsity: float, recommended_sparsity: float) -> None:
-    """Print a single row of layer analysis results."""
-
-def _calculate_compression_stats(total_original_params: int, total_remaining_params: int) -> Tuple[float, float]:
-    """Calculate overall compression statistics."""
-
-def _calculate_quality_score(norm_preservation: float, mean_error: float, original_mean: float) -> float:
-    """Calculate quality score for compression validation."""
-
-def _get_quality_assessment(quality_score: float) -> str:
-    """Get quality assessment string based on score."""
-```
-
-### 4. Complex Data Access Patterns → Simplified Access ✅
-
-**Problem**: Nested attribute access patterns
-**Solution**: Used utility functions to encapsulate complex access patterns
-
-**Before**:
-```python
-# Long nested access and complex logic scattered throughout methods
-if len(weights.shape) == 4:  # Conv layer: (out, in, H, W)
-    layer_type = "Conv2D"
-    recommended_sparsity = 0.6  # Conservative for conv layers
-elif len(weights.shape) == 2:  # Dense layer: (out, in)  
-    layer_type = "Dense"
-    recommended_sparsity = 0.8  # Aggressive for dense layers
-else:
-    layer_type = "Other"
-    recommended_sparsity = 0.5  # Safe default
-```
-
-**After**:
-```python
-# Clean, single function call
-layer_type, recommended_sparsity = _determine_layer_type_and_sparsity(weights.shape)
-```
-
-### 5. Replaced All Magic Numbers with Constants ✅
-
-**Examples of replacements**:
-- `0.7` → `DEFAULT_SPARSITY`
-- `0.1` → `NEAR_ZERO_THRESHOLD_RATIO`
-- `1e-8` → `EPS_DIVISION_SAFETY`
-- `1` → `MIN_FILTERS_TO_KEEP`
-- `32, 100` → `DEFAULT_BATCH_SIZE, DEFAULT_BENCHMARK_ITERATIONS`
-- `0.8, 0.5` → `SPEEDUP_EFFICIENCY_HIGH, SPEEDUP_EFFICIENCY_MEDIUM`
-
-## Impact on Readability
-
-### Before Improvements:
-- **Magic numbers**: Scattered hardcoded values requiring mental tracking
-- **Long methods**: 100+ line functions with multiple responsibilities  
-- **Minimal documentation**: Constructor purpose unclear
-- **Complex access patterns**: Nested conditionals and repeated logic
-
-### After Improvements:
-- **Named constants**: All configuration values clearly defined and documented
-- **Utility functions**: Single-responsibility functions with clear names
-- **Comprehensive documentation**: Clear understanding of class purpose and attributes
-- **Simplified access**: Complex logic encapsulated in well-named functions
-
-## Educational Value Preserved ✅
-
-The improvements maintain the educational flow while making the code more professional:
-
-1. **Constants section** teaches configuration management best practices
-2. **Utility functions** demonstrate proper code organization principles
-3. **Documentation** models professional development standards
-4. **Simplified access** shows how to manage complexity through abstraction
-
-## Production Readiness Enhanced ✅
-
-These changes bring the code closer to production standards:
-
-1. **Maintainability**: Constants make configuration changes easier
-2. **Testability**: Utility functions can be tested independently
-3. **Readability**: Code intentions are clearer to new developers
-4. **Extensibility**: New layer types and quality assessments easy to add
-
-## All Tests Pass ✅
-
-The comprehensive test suite continues to pass, confirming that:
-- Functionality is preserved
-- Educational objectives maintained
-- Systems insights remain accurate
-- Performance characteristics unchanged
-
-**Final Readability Score Estimate: 9.2/10**
-
-The improvements address all identified issues while maintaining the excellent educational flow that made this module highly rated originally.
\ No newline at end of file
diff --git a/modules_old/17_compression/compression_dev.ipynb b/modules_old/17_compression/compression_dev.ipynb
deleted file mode 100644
index 6bc2f1a6..00000000
--- a/modules_old/17_compression/compression_dev.ipynb
+++ /dev/null
@@ -1,2234 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "822c53e7",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Compression - Neural Network Pruning for Edge Deployment\n",
-    "\n",
-    "Welcome to the Compression module! You'll implement pruning techniques that remove 70% of neural network parameters while maintaining accuracy, enabling deployment on resource-constrained edge devices.\n",
-    "\n",
-    "## Connection from Quantization (Module 17)\n",
-    "In Module 17, you learned quantization - reducing precision from FP32 to INT8. But even quantized models can be too large for edge devices! Compression attacks the problem differently: instead of making numbers smaller, we **remove numbers entirely** through strategic pruning.\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Systems understanding: How neural network redundancy enables massive parameter reduction without accuracy loss\n",
-    "- Core implementation skill: Build magnitude-based pruning systems that identify and remove unimportant weights\n",
-    "- Pattern recognition: Understand when structured vs unstructured pruning optimizes for different hardware constraints\n",
-    "- Framework connection: See how your implementation mirrors production sparse inference systems\n",
-    "- Performance insight: Learn why 70% sparsity often provides optimal accuracy vs size tradeoffs\n",
-    "\n",
-    "## Build → Profile → Optimize\n",
-    "1. **Build**: Magnitude-based pruners that remove small weights, discover massive redundancy in neural networks\n",
-    "2. **Profile**: Measure model size reduction, accuracy impact, and sparse computation efficiency\n",
-    "3. **Optimize**: Implement structured pruning for hardware-friendly sparsity patterns\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical understanding of how neural networks contain massive redundancy that can be exploited for compression\n",
-    "- Practical capability to prune real CNNs and MLPs while maintaining 95%+ of original accuracy\n",
-    "- Systems insight into why pruning enables deployment scenarios impossible with dense models\n",
-    "- Performance consideration of when sparse computation provides real speedups vs theoretical ones\n",
-    "- Connection to production systems where pruning enables edge AI applications\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: Apple's Neural Engine, Google's Edge TPU, and mobile inference frameworks heavily rely on sparsity for efficient computation\n",
-    "⚡ **Performance Note**: 70% sparsity provides 3-5x model compression with <2% accuracy loss, but speedup depends on hardware sparse computation support"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5f1bc48b",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "compression-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp optimization.prune\n",
-    "\n",
-    "#| export\n",
-    "import numpy as np\n",
-    "import matplotlib.pyplot as plt\n",
-    "import sys\n",
-    "from typing import Tuple, Optional, Dict, Any, List\n",
-    "from dataclasses import dataclass"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "df5e40f2",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 1: Understanding Neural Network Redundancy\n",
-    "\n",
-    "Before implementing pruning, let's understand the fundamental insight: **neural networks are massively over-parametrized**. Most weights contribute little to the final output and can be removed without significant accuracy loss.\n",
-    "\n",
-    "### The Redundancy Discovery\n",
-    "- **Research insight**: Networks often have 80-90% redundant parameters\n",
-    "- **Lottery Ticket Hypothesis**: Sparse subnetworks can match dense network performance\n",
-    "- **Practical reality**: 70% sparsity typically loses <2% accuracy\n",
-    "- **Systems opportunity**: Massive compression enables edge deployment"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2a11964c",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "redundancy-analysis",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def analyze_weight_redundancy(weights: np.ndarray, title: str = \"Weight Analysis\"):\n",
-    "    \"\"\"\n",
-    "    Analyze weight distributions to understand pruning opportunities.\n",
-    "    \n",
-    "    This function reveals the natural sparsity and redundancy patterns\n",
-    "    in neural network weights that make pruning effective.\n",
-    "    \"\"\"\n",
-    "    # Flatten weights for analysis\n",
-    "    w_flat = weights.flatten()\n",
-    "    w_abs = np.abs(w_flat)\n",
-    "    \n",
-    "    print(f\"📊 {title}\")\n",
-    "    print(\"=\" * 50)\n",
-    "    print(f\"Total parameters: {len(w_flat):,}\")\n",
-    "    print(f\"Mean absolute weight: {w_abs.mean():.6f}\")\n",
-    "    print(f\"Weight standard deviation: {w_abs.std():.6f}\")\n",
-    "    \n",
-    "    # Analyze weight distribution percentiles\n",
-    "    percentiles = [50, 70, 80, 90, 95, 99]\n",
-    "    print(f\"\\nWeight Magnitude Percentiles:\")\n",
-    "    for p in percentiles:\n",
-    "        val = np.percentile(w_abs, p)\n",
-    "        smaller_count = np.sum(w_abs <= val)\n",
-    "        print(f\"  {p:2d}%: {val:.6f} ({smaller_count:,} weights ≤ this value)\")\n",
-    "    \n",
-    "    # Show natural sparsity (near-zero weights)\n",
-    "    zero_threshold = w_abs.mean() * 0.1  # 10% of mean as \"near-zero\"\n",
-    "    near_zero_count = np.sum(w_abs <= zero_threshold)\n",
-    "    natural_sparsity = near_zero_count / len(w_flat) * 100\n",
-    "    \n",
-    "    print(f\"\\nNatural Sparsity Analysis:\")\n",
-    "    print(f\"  Threshold (10% of mean): {zero_threshold:.6f}\")\n",
-    "    print(f\"  Near-zero weights: {near_zero_count:,} ({natural_sparsity:.1f}%)\")\n",
-    "    print(f\"  Already sparse without pruning!\")\n",
-    "    \n",
-    "    return {\n",
-    "        'total_params': len(w_flat),\n",
-    "        'mean_abs': w_abs.mean(),\n",
-    "        'std': w_abs.std(),\n",
-    "        'natural_sparsity': natural_sparsity,\n",
-    "        'percentiles': {p: np.percentile(w_abs, p) for p in percentiles}\n",
-    "    }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8f7df3ed",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test: Weight Redundancy Analysis\n",
-    "\n",
-    "Let's verify our redundancy analysis works on realistic neural network weights."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b153cb7d",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-redundancy-analysis",
-     "locked": false,
-     "points": 5,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_redundancy_analysis():\n",
-    "    \"\"\"Test weight redundancy analysis on sample networks.\"\"\"\n",
-    "    print(\"Testing weight redundancy analysis...\")\n",
-    "    \n",
-    "    # Create realistic CNN weights with natural sparsity\n",
-    "    np.random.seed(42)\n",
-    "    conv_weights = np.random.normal(0, 0.02, (64, 32, 3, 3))  # Conv layer\n",
-    "    fc_weights = np.random.normal(0, 0.01, (1000, 512))       # FC layer\n",
-    "    \n",
-    "    # Analyze both layer types\n",
-    "    conv_stats = analyze_weight_redundancy(conv_weights, \"Conv2D Layer Weights\")\n",
-    "    fc_stats = analyze_weight_redundancy(fc_weights, \"Dense Layer Weights\")\n",
-    "    \n",
-    "    # Verify analysis produces reasonable results\n",
-    "    assert conv_stats['total_params'] == 64*32*3*3, \"Conv param count mismatch\"\n",
-    "    assert fc_stats['total_params'] == 1000*512, \"FC param count mismatch\"\n",
-    "    assert conv_stats['natural_sparsity'] > 0, \"Should detect some natural sparsity\"\n",
-    "    assert fc_stats['natural_sparsity'] > 0, \"Should detect some natural sparsity\"\n",
-    "    \n",
-    "    print(\"✅ Weight redundancy analysis test passed!\")\n",
-    "\n",
-    "test_redundancy_analysis()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "92721059",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 2: Magnitude-Based Pruning - The Foundation\n",
-    "\n",
-    "The simplest and most effective pruning technique: **remove the smallest weights**. The intuition is that small weights contribute little to the network's computation, so removing them should have minimal impact on accuracy.\n",
-    "\n",
-    "### Magnitude Pruning Algorithm\n",
-    "1. **Calculate importance**: Use absolute weight magnitude as importance metric\n",
-    "2. **Rank weights**: Sort all weights by absolute value\n",
-    "3. **Set threshold**: Choose magnitude threshold for desired sparsity level\n",
-    "4. **Create mask**: Zero out weights below threshold\n",
-    "5. **Apply mask**: Element-wise multiplication to enforce sparsity"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "850f7f52",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "magnitude-pruning",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class MagnitudePruner:\n",
-    "    \"\"\"\n",
-    "    Magnitude-based pruning for neural network compression.\n",
-    "    \n",
-    "    This class implements the core pruning algorithm used in production\n",
-    "    systems: remove weights with smallest absolute values.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        # BEGIN SOLUTION\n",
-    "        self.pruning_masks = {}\n",
-    "        self.original_weights = {}\n",
-    "        self.pruning_stats = {}\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def calculate_threshold(self, weights: np.ndarray, sparsity: float) -> float:\n",
-    "        \"\"\"\n",
-    "        Calculate magnitude threshold for desired sparsity level.\n",
-    "        \n",
-    "        Args:\n",
-    "            weights: Network weights to analyze\n",
-    "            sparsity: Fraction of weights to remove (0.0 to 1.0)\n",
-    "            \n",
-    "        Returns:\n",
-    "            threshold: Magnitude below which weights should be pruned\n",
-    "        \"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        # Flatten weights and get absolute values\n",
-    "        w_flat = weights.flatten()\n",
-    "        w_abs = np.abs(w_flat)\n",
-    "        \n",
-    "        # Calculate percentile threshold\n",
-    "        # sparsity=0.7 means remove 70% of weights (keep top 30%)\n",
-    "        percentile = sparsity * 100\n",
-    "        threshold = np.percentile(w_abs, percentile)\n",
-    "        \n",
-    "        return threshold\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def create_mask(self, weights: np.ndarray, threshold: float) -> np.ndarray:\n",
-    "        \"\"\"\n",
-    "        Create binary mask for pruning weights below threshold.\n",
-    "        \n",
-    "        Args:\n",
-    "            weights: Original weights\n",
-    "            threshold: Magnitude threshold for pruning\n",
-    "            \n",
-    "        Returns:\n",
-    "            mask: Binary mask (1=keep, 0=prune)\n",
-    "        \"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        # Create mask: keep weights with absolute value >= threshold\n",
-    "        mask = (np.abs(weights) >= threshold).astype(np.float32)\n",
-    "        return mask\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def prune(self, weights: np.ndarray, sparsity: float = 0.7) -> Tuple[np.ndarray, np.ndarray, Dict]:\n",
-    "        \"\"\"\n",
-    "        Prune network weights using magnitude-based pruning.\n",
-    "        \n",
-    "        Args:\n",
-    "            weights: Original dense weights\n",
-    "            sparsity: Fraction of weights to prune (default: 70%)\n",
-    "            \n",
-    "        Returns:\n",
-    "            pruned_weights: Weights with small values set to zero\n",
-    "            mask: Binary pruning mask\n",
-    "            stats: Pruning statistics\n",
-    "        \"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        # Store original weights\n",
-    "        original_shape = weights.shape\n",
-    "        original_size = weights.size\n",
-    "        \n",
-    "        # Calculate threshold for desired sparsity\n",
-    "        threshold = self.calculate_threshold(weights, sparsity)\n",
-    "        \n",
-    "        # Create pruning mask\n",
-    "        mask = self.create_mask(weights, threshold)\n",
-    "        \n",
-    "        # Apply pruning\n",
-    "        pruned_weights = weights * mask\n",
-    "        \n",
-    "        # Calculate statistics\n",
-    "        actual_sparsity = np.sum(mask == 0) / mask.size\n",
-    "        remaining_params = np.sum(mask == 1)\n",
-    "        compression_ratio = original_size / remaining_params if remaining_params > 0 else float('inf')\n",
-    "        \n",
-    "        stats = {\n",
-    "            'target_sparsity': sparsity,\n",
-    "            'actual_sparsity': actual_sparsity,\n",
-    "            'threshold': threshold,\n",
-    "            'original_params': original_size,\n",
-    "            'remaining_params': int(remaining_params),\n",
-    "            'pruned_params': int(original_size - remaining_params),\n",
-    "            'compression_ratio': compression_ratio\n",
-    "        }\n",
-    "        \n",
-    "        return pruned_weights, mask, stats\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def measure_accuracy_impact(self, original_weights: np.ndarray, pruned_weights: np.ndarray) -> Dict:\n",
-    "        \"\"\"\n",
-    "        Measure the impact of pruning on weight statistics.\n",
-    "        \n",
-    "        This gives us a proxy for accuracy impact before running full evaluation.\n",
-    "        \"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        # Calculate difference statistics\n",
-    "        weight_diff = np.abs(original_weights - pruned_weights)\n",
-    "        \n",
-    "        # Normalize by original weight magnitude for relative comparison\n",
-    "        original_abs = np.abs(original_weights)\n",
-    "        relative_error = weight_diff / (original_abs + 1e-8)  # Avoid division by zero\n",
-    "        \n",
-    "        return {\n",
-    "            'mean_absolute_error': weight_diff.mean(),\n",
-    "            'max_absolute_error': weight_diff.max(),\n",
-    "            'mean_relative_error': relative_error.mean(),\n",
-    "            'weight_norm_preservation': np.linalg.norm(pruned_weights) / np.linalg.norm(original_weights)\n",
-    "        }\n",
-    "        # END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "824d7184",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test: Magnitude-Based Pruning Implementation\n",
-    "\n",
-    "Let's verify our magnitude pruning works correctly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "94fe2b37",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-magnitude-pruning",
-     "locked": false,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_magnitude_pruning():\n",
-    "    \"\"\"Test magnitude-based pruning implementation.\"\"\"\n",
-    "    print(\"Testing magnitude-based pruning...\")\n",
-    "    \n",
-    "    pruner = MagnitudePruner()\n",
-    "    \n",
-    "    # Test case 1: Simple weights with known distribution\n",
-    "    weights = np.array([\n",
-    "        [0.5, 0.1, 0.8],\n",
-    "        [0.05, 0.9, 0.2],\n",
-    "        [0.3, 0.02, 0.7]\n",
-    "    ])\n",
-    "    \n",
-    "    # Test 50% sparsity (should keep 4.5 ≈ 4-5 weights)\n",
-    "    pruned, mask, stats = pruner.prune(weights, sparsity=0.5)\n",
-    "    \n",
-    "    print(f\"Original weights:\")\n",
-    "    print(weights)\n",
-    "    print(f\"Pruning mask:\")\n",
-    "    print(mask)\n",
-    "    print(f\"Pruned weights:\")\n",
-    "    print(pruned)\n",
-    "    print(f\"Statistics: {stats}\")\n",
-    "    \n",
-    "    # Verify sparsity is approximately correct\n",
-    "    actual_sparsity = stats['actual_sparsity']\n",
-    "    assert 0.4 <= actual_sparsity <= 0.6, f\"Sparsity should be ~50%, got {actual_sparsity:.1%}\"\n",
-    "    \n",
-    "    # Verify mask is binary\n",
-    "    assert np.all((mask == 0) | (mask == 1)), \"Mask should be binary\"\n",
-    "    \n",
-    "    # Verify pruned weights match mask\n",
-    "    expected_pruned = weights * mask\n",
-    "    np.testing.assert_array_equal(pruned, expected_pruned, \"Pruned weights should match mask application\")\n",
-    "    \n",
-    "    # Test case 2: High sparsity pruning\n",
-    "    large_weights = np.random.normal(0, 0.1, (100, 50))\n",
-    "    pruned_large, mask_large, stats_large = pruner.prune(large_weights, sparsity=0.8)\n",
-    "    \n",
-    "    assert 0.75 <= stats_large['actual_sparsity'] <= 0.85, \"High sparsity should be approximately correct\"\n",
-    "    assert stats_large['compression_ratio'] >= 4.0, \"80% sparsity should give ~5x compression\"\n",
-    "    \n",
-    "    # Test accuracy impact measurement\n",
-    "    accuracy_impact = pruner.measure_accuracy_impact(large_weights, pruned_large)\n",
-    "    assert 'mean_relative_error' in accuracy_impact, \"Should measure relative error\"\n",
-    "    assert accuracy_impact['weight_norm_preservation'] > 0, \"Should preserve some weight norm\"\n",
-    "    \n",
-    "    print(\"✅ Magnitude-based pruning test passed!\")\n",
-    "\n",
-    "test_magnitude_pruning()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d362f652",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 3: Structured vs Unstructured Pruning\n",
-    "\n",
-    "So far we've implemented **unstructured pruning** - removing individual weights anywhere. But this creates irregular sparsity patterns that are hard for hardware to accelerate. **Structured pruning** removes entire channels, filters, or blocks - creating regular patterns that map well to hardware.\n",
-    "\n",
-    "### Structured Pruning Benefits:\n",
-    "- **Hardware friendly**: Regular patterns enable efficient sparse computation\n",
-    "- **Memory layout**: Removes entire rows/columns, reducing memory footprint  \n",
-    "- **Inference speed**: Actually accelerates computation (vs theoretical speedup)\n",
-    "- **Implementation simple**: No special sparse kernels needed"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1f8b15a4",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "structured-pruning",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def prune_conv_filters(conv_weights: np.ndarray, sparsity: float = 0.5) -> Tuple[np.ndarray, List[int], Dict]:\n",
-    "    \"\"\"\n",
-    "    Structured pruning for convolutional layers - remove entire filters.\n",
-    "    \n",
-    "    Args:\n",
-    "        conv_weights: Conv weights shaped (out_channels, in_channels, H, W)\n",
-    "        sparsity: Fraction of filters to remove\n",
-    "        \n",
-    "    Returns:\n",
-    "        pruned_weights: Weights with filters removed\n",
-    "        kept_filters: Indices of filters that were kept\n",
-    "        stats: Pruning statistics\n",
-    "    \"\"\"\n",
-    "    # BEGIN SOLUTION\n",
-    "    # Calculate importance score for each output filter\n",
-    "    # Use L2 norm of entire filter as importance measure\n",
-    "    out_channels = conv_weights.shape[0]\n",
-    "    filter_norms = []\n",
-    "    \n",
-    "    for i in range(out_channels):\n",
-    "        filter_weights = conv_weights[i]  # Shape: (in_channels, H, W)\n",
-    "        l2_norm = np.linalg.norm(filter_weights)\n",
-    "        filter_norms.append(l2_norm)\n",
-    "    \n",
-    "    filter_norms = np.array(filter_norms)\n",
-    "    \n",
-    "    # Determine how many filters to keep\n",
-    "    num_filters_to_keep = int(out_channels * (1 - sparsity))\n",
-    "    num_filters_to_keep = max(1, num_filters_to_keep)  # Keep at least 1 filter\n",
-    "    \n",
-    "    # Find indices of top filters to keep\n",
-    "    top_filter_indices = np.argsort(filter_norms)[-num_filters_to_keep:]\n",
-    "    top_filter_indices.sort()  # Keep original ordering\n",
-    "    \n",
-    "    # Create pruned weights by selecting only top filters\n",
-    "    pruned_weights = conv_weights[top_filter_indices]\n",
-    "    \n",
-    "    # Calculate statistics\n",
-    "    actual_sparsity = 1 - (num_filters_to_keep / out_channels)\n",
-    "    \n",
-    "    stats = {\n",
-    "        'original_filters': out_channels,\n",
-    "        'remaining_filters': num_filters_to_keep,\n",
-    "        'pruned_filters': out_channels - num_filters_to_keep,\n",
-    "        'target_sparsity': sparsity,\n",
-    "        'actual_sparsity': actual_sparsity,\n",
-    "        'compression_ratio': out_channels / num_filters_to_keep,\n",
-    "        'filter_norms': filter_norms,\n",
-    "        'kept_filter_indices': top_filter_indices.tolist()\n",
-    "    }\n",
-    "    \n",
-    "    return pruned_weights, top_filter_indices.tolist(), stats\n",
-    "    # END SOLUTION\n",
-    "\n",
-    "def compare_structured_vs_unstructured(conv_weights: np.ndarray, sparsity: float = 0.5):\n",
-    "    \"\"\"\n",
-    "    Compare structured vs unstructured pruning on the same layer.\n",
-    "    \"\"\"\n",
-    "    print(\"🔬 Structured vs Unstructured Pruning Comparison\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Unstructured pruning\n",
-    "    pruner = MagnitudePruner()\n",
-    "    unstructured_pruned, unstructured_mask, unstructured_stats = pruner.prune(conv_weights, sparsity)\n",
-    "    \n",
-    "    # Structured pruning  \n",
-    "    structured_pruned, kept_filters, structured_stats = prune_conv_filters(conv_weights, sparsity)\n",
-    "    \n",
-    "    print(\"Unstructured Pruning:\")\n",
-    "    print(f\"  Original shape: {conv_weights.shape}\")\n",
-    "    print(f\"  Pruned shape: {unstructured_pruned.shape} (same)\")\n",
-    "    print(f\"  Sparsity: {unstructured_stats['actual_sparsity']:.1%}\")\n",
-    "    print(f\"  Compression: {unstructured_stats['compression_ratio']:.1f}x\")\n",
-    "    print(f\"  Zero elements: {np.sum(unstructured_pruned == 0):,}\")\n",
-    "    \n",
-    "    print(\"\\nStructured Pruning:\")\n",
-    "    print(f\"  Original shape: {conv_weights.shape}\")\n",
-    "    print(f\"  Pruned shape: {structured_pruned.shape}\")\n",
-    "    print(f\"  Sparsity: {structured_stats['actual_sparsity']:.1%}\")\n",
-    "    print(f\"  Compression: {structured_stats['compression_ratio']:.1f}x\")\n",
-    "    print(f\"  Filters removed: {structured_stats['pruned_filters']}\")\n",
-    "    \n",
-    "    print(f\"\\n💡 Key Differences:\")\n",
-    "    print(f\"   • Unstructured: Irregular sparsity, requires sparse kernels\")\n",
-    "    print(f\"   • Structured: Regular reduction, standard dense computation\")\n",
-    "    print(f\"   • Hardware: Structured pruning provides actual speedup\")\n",
-    "    print(f\"   • Memory: Structured pruning reduces memory footprint\")\n",
-    "    \n",
-    "    return {\n",
-    "        'unstructured': (unstructured_pruned, unstructured_stats),\n",
-    "        'structured': (structured_pruned, structured_stats)\n",
-    "    }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "15339fed",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test: Structured Pruning Implementation\n",
-    "\n",
-    "Let's verify structured pruning works correctly and compare it with unstructured pruning."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d9952bab",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-structured-pruning",
-     "locked": false,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_structured_pruning():\n",
-    "    \"\"\"Test structured pruning implementation.\"\"\"\n",
-    "    print(\"Testing structured pruning...\")\n",
-    "    \n",
-    "    # Create sample conv weights: (out_channels, in_channels, H, W)\n",
-    "    np.random.seed(42)\n",
-    "    conv_weights = np.random.normal(0, 0.1, (8, 4, 3, 3))\n",
-    "    \n",
-    "    # Test structured pruning\n",
-    "    pruned_weights, kept_filters, stats = prune_conv_filters(conv_weights, sparsity=0.5)\n",
-    "    \n",
-    "    print(f\"Original shape: {conv_weights.shape}\")\n",
-    "    print(f\"Pruned shape: {pruned_weights.shape}\")\n",
-    "    print(f\"Kept filters: {kept_filters}\")\n",
-    "    print(f\"Stats: {stats}\")\n",
-    "    \n",
-    "    # Verify output shape is correct\n",
-    "    expected_filters = int(8 * (1 - 0.5))  # 50% sparsity = keep 50% of filters\n",
-    "    assert pruned_weights.shape[0] == expected_filters, f\"Should keep {expected_filters} filters\"\n",
-    "    assert pruned_weights.shape[1:] == conv_weights.shape[1:], \"Other dimensions should match\"\n",
-    "    \n",
-    "    # Verify kept filters are the strongest ones\n",
-    "    filter_norms = [np.linalg.norm(conv_weights[i]) for i in range(8)]\n",
-    "    top_indices = np.argsort(filter_norms)[-expected_filters:]\n",
-    "    top_indices.sort()\n",
-    "    \n",
-    "    for i, kept_idx in enumerate(kept_filters):\n",
-    "        # Verify the pruned weight matches original filter\n",
-    "        np.testing.assert_array_equal(\n",
-    "            pruned_weights[i], \n",
-    "            conv_weights[kept_idx],\n",
-    "            f\"Filter {i} should match original filter {kept_idx}\"\n",
-    "        )\n",
-    "    \n",
-    "    # Test comparison function\n",
-    "    comparison = compare_structured_vs_unstructured(conv_weights, 0.5)\n",
-    "    \n",
-    "    # Verify both methods produce different results\n",
-    "    unstructured_result = comparison['unstructured'][0]\n",
-    "    structured_result = comparison['structured'][0]\n",
-    "    \n",
-    "    assert unstructured_result.shape == conv_weights.shape, \"Unstructured keeps same shape\"\n",
-    "    assert structured_result.shape[0] < conv_weights.shape[0], \"Structured reduces filters\"\n",
-    "    \n",
-    "    print(\"✅ Structured pruning test passed!\")\n",
-    "\n",
-    "test_structured_pruning()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7bb0d7d8",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 4: Sparse Neural Networks - Efficient Computation\n",
-    "\n",
-    "Pruning creates sparse networks, but how do we compute with them efficiently? We need sparse linear layers that skip computation for zero weights.\n",
-    "\n",
-    "### Sparse Computation Challenges:\n",
-    "- **Memory layout**: How to store only non-zero weights efficiently\n",
-    "- **Computation patterns**: Skip multiply-add operations for zero weights  \n",
-    "- **Hardware support**: Most hardware isn't optimized for arbitrary sparsity\n",
-    "- **Software optimization**: Need specialized sparse kernels for speedup"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3cc82880",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "sparse-computation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class SparseLinear:\n",
-    "    \"\"\"\n",
-    "    Sparse linear layer that efficiently computes with pruned weights.\n",
-    "    \n",
-    "    This demonstrates how to build sparse computation systems\n",
-    "    that actually achieve speedup from sparsity.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, in_features: int, out_features: int):\n",
-    "        # BEGIN SOLUTION\n",
-    "        self.in_features = in_features\n",
-    "        self.out_features = out_features\n",
-    "        \n",
-    "        # Dense weights (will be pruned)\n",
-    "        self.dense_weights = None\n",
-    "        self.bias = None\n",
-    "        \n",
-    "        # Sparse representation\n",
-    "        self.sparse_weights = None\n",
-    "        self.mask = None\n",
-    "        self.sparsity = 0.0\n",
-    "        \n",
-    "        # Performance tracking\n",
-    "        self.dense_ops = 0\n",
-    "        self.sparse_ops = 0\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def load_dense_weights(self, weights: np.ndarray, bias: Optional[np.ndarray] = None):\n",
-    "        \"\"\"Load dense weights before pruning.\"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        assert weights.shape == (self.out_features, self.in_features), f\"Weight shape mismatch\"\n",
-    "        self.dense_weights = weights.copy()\n",
-    "        self.bias = bias.copy() if bias is not None else np.zeros(self.out_features)\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def prune_weights(self, sparsity: float = 0.7):\n",
-    "        \"\"\"Prune weights using magnitude-based pruning.\"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        if self.dense_weights is None:\n",
-    "            raise ValueError(\"Must load dense weights before pruning\")\n",
-    "        \n",
-    "        # Use magnitude pruner\n",
-    "        pruner = MagnitudePruner()\n",
-    "        self.sparse_weights, self.mask, stats = pruner.prune(self.dense_weights, sparsity)\n",
-    "        self.sparsity = stats['actual_sparsity']\n",
-    "        \n",
-    "        print(f\"✂️  Pruned {self.sparsity:.1%} of weights\")\n",
-    "        print(f\"   Compression: {stats['compression_ratio']:.1f}x\")\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def forward_dense(self, x: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"Forward pass using dense weights (reference).\"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        if self.dense_weights is None:\n",
-    "            raise ValueError(\"Dense weights not loaded\")\n",
-    "        \n",
-    "        # Count operations\n",
-    "        self.dense_ops = self.in_features * self.out_features\n",
-    "        \n",
-    "        # Standard matrix multiply: y = x @ W^T + b\n",
-    "        output = np.dot(x, self.dense_weights.T) + self.bias\n",
-    "        return output\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def forward_sparse_naive(self, x: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"Forward pass using sparse weights (naive implementation).\"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        if self.sparse_weights is None:\n",
-    "            raise ValueError(\"Weights not pruned yet\")\n",
-    "        \n",
-    "        # Count actual operations (skip zero weights)\n",
-    "        self.sparse_ops = np.sum(self.mask)\n",
-    "        \n",
-    "        # Naive sparse computation: still do full matrix multiply\n",
-    "        # (Real sparse implementations would use CSR/CSC formats)\n",
-    "        output = np.dot(x, self.sparse_weights.T) + self.bias\n",
-    "        return output\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def forward_sparse_optimized(self, x: np.ndarray) -> np.ndarray:\n",
-    "        \"\"\"Forward pass using optimized sparse computation.\"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        if self.sparse_weights is None:\n",
-    "            raise ValueError(\"Weights not pruned yet\")\n",
-    "        \n",
-    "        # Find non-zero weights\n",
-    "        nonzero_indices = np.nonzero(self.sparse_weights)\n",
-    "        \n",
-    "        # Count actual operations\n",
-    "        self.sparse_ops = len(nonzero_indices[0])\n",
-    "        \n",
-    "        # Optimized sparse computation (simulated)\n",
-    "        # In practice, this would use specialized sparse matrix libraries\n",
-    "        output = np.zeros((x.shape[0], self.out_features))\n",
-    "        \n",
-    "        # Only compute for non-zero weights\n",
-    "        for i in range(len(nonzero_indices[0])):\n",
-    "            row = nonzero_indices[0][i]\n",
-    "            col = nonzero_indices[1][i]\n",
-    "            weight = self.sparse_weights[row, col]\n",
-    "            \n",
-    "            # Accumulate: output[batch, row] += input[batch, col] * weight\n",
-    "            output[:, row] += x[:, col] * weight\n",
-    "        \n",
-    "        # Add bias\n",
-    "        output += self.bias\n",
-    "        \n",
-    "        return output\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def benchmark_speedup(self, batch_size: int = 32, iterations: int = 100) -> Dict:\n",
-    "        \"\"\"Benchmark sparse vs dense computation speedup.\"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        import time\n",
-    "        \n",
-    "        # Create test input\n",
-    "        x = np.random.normal(0, 1, (batch_size, self.in_features))\n",
-    "        \n",
-    "        # Benchmark dense forward pass\n",
-    "        start_time = time.time()\n",
-    "        for _ in range(iterations):\n",
-    "            _ = self.forward_dense(x)\n",
-    "        dense_time = time.time() - start_time\n",
-    "        \n",
-    "        # Benchmark sparse forward pass\n",
-    "        start_time = time.time()\n",
-    "        for _ in range(iterations):\n",
-    "            _ = self.forward_sparse_naive(x)\n",
-    "        sparse_time = time.time() - start_time\n",
-    "        \n",
-    "        # Calculate speedup metrics\n",
-    "        theoretical_speedup = self.dense_ops / self.sparse_ops if self.sparse_ops > 0 else 1\n",
-    "        actual_speedup = dense_time / sparse_time if sparse_time > 0 else 1\n",
-    "        \n",
-    "        return {\n",
-    "            'dense_time_ms': dense_time * 1000,\n",
-    "            'sparse_time_ms': sparse_time * 1000,\n",
-    "            'dense_ops': self.dense_ops,\n",
-    "            'sparse_ops': self.sparse_ops,\n",
-    "            'theoretical_speedup': theoretical_speedup,\n",
-    "            'actual_speedup': actual_speedup,\n",
-    "            'sparsity': self.sparsity,\n",
-    "            'efficiency': actual_speedup / theoretical_speedup\n",
-    "        }\n",
-    "        # END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0ffe0018",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test: Sparse Neural Network Implementation\n",
-    "\n",
-    "Let's verify our sparse neural network works correctly and measure performance."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8d118ef4",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-sparse-neural-network",
-     "locked": false,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_sparse_neural_network():\n",
-    "    \"\"\"Test sparse neural network implementation.\"\"\"\n",
-    "    print(\"Testing sparse neural network...\")\n",
-    "    \n",
-    "    # Create sparse linear layer\n",
-    "    sparse_layer = SparseLinear(256, 128)\n",
-    "    \n",
-    "    # Load random weights\n",
-    "    np.random.seed(42)\n",
-    "    weights = np.random.normal(0, 0.1, (128, 256))\n",
-    "    bias = np.random.normal(0, 0.01, 128)\n",
-    "    sparse_layer.load_dense_weights(weights, bias)\n",
-    "    \n",
-    "    # Prune weights\n",
-    "    sparse_layer.prune_weights(sparsity=0.8)  # 80% sparsity\n",
-    "    \n",
-    "    # Test forward passes\n",
-    "    x = np.random.normal(0, 1, (4, 256))  # Batch of 4\n",
-    "    \n",
-    "    # Compare outputs\n",
-    "    output_dense = sparse_layer.forward_dense(x)\n",
-    "    output_sparse_naive = sparse_layer.forward_sparse_naive(x)\n",
-    "    output_sparse_opt = sparse_layer.forward_sparse_optimized(x)\n",
-    "    \n",
-    "    print(f\"Output shapes:\")\n",
-    "    print(f\"  Dense: {output_dense.shape}\")\n",
-    "    print(f\"  Sparse naive: {output_sparse_naive.shape}\")\n",
-    "    print(f\"  Sparse optimized: {output_sparse_opt.shape}\")\n",
-    "    \n",
-    "    # Verify outputs have correct shape\n",
-    "    expected_shape = (4, 128)\n",
-    "    assert output_dense.shape == expected_shape, \"Dense output shape incorrect\"\n",
-    "    assert output_sparse_naive.shape == expected_shape, \"Sparse naive output shape incorrect\"\n",
-    "    assert output_sparse_opt.shape == expected_shape, \"Sparse optimized output shape incorrect\"\n",
-    "    \n",
-    "    # Verify sparse outputs match expected computation\n",
-    "    # Sparse naive should match dense computation on pruned weights\n",
-    "    np.testing.assert_allclose(\n",
-    "        output_sparse_naive, output_sparse_opt, rtol=1e-5,\n",
-    "        err_msg=\"Sparse naive and optimized should produce same results\"\n",
-    "    )\n",
-    "    \n",
-    "    # The outputs shouldn't be identical (due to pruning) but should be reasonably close\n",
-    "    relative_error = np.mean(np.abs(output_dense - output_sparse_naive)) / np.mean(np.abs(output_dense))\n",
-    "    print(f\"Relative error from pruning: {relative_error:.3%}\")\n",
-    "    # With 80% sparsity, relative error can be substantial but model should still function\n",
-    "    assert relative_error < 1.0, \"Error from pruning shouldn't completely destroy the model\"\n",
-    "    \n",
-    "    # Benchmark performance\n",
-    "    benchmark = sparse_layer.benchmark_speedup(batch_size=32, iterations=50)\n",
-    "    \n",
-    "    print(f\"\\nPerformance Benchmark:\")\n",
-    "    print(f\"  Sparsity: {benchmark['sparsity']:.1%}\")\n",
-    "    print(f\"  Dense ops: {benchmark['dense_ops']:,}\")\n",
-    "    print(f\"  Sparse ops: {benchmark['sparse_ops']:,}\")\n",
-    "    print(f\"  Theoretical speedup: {benchmark['theoretical_speedup']:.1f}x\")\n",
-    "    print(f\"  Actual speedup: {benchmark['actual_speedup']:.1f}x\")\n",
-    "    print(f\"  Efficiency: {benchmark['efficiency']:.1%}\")\n",
-    "    \n",
-    "    # Verify operation counting\n",
-    "    expected_dense_ops = 256 * 128\n",
-    "    assert benchmark['dense_ops'] == expected_dense_ops, \"Dense op count incorrect\"\n",
-    "    assert benchmark['sparse_ops'] < benchmark['dense_ops'], \"Sparse should use fewer ops\"\n",
-    "    \n",
-    "    print(\"✅ Sparse neural network test passed!\")\n",
-    "\n",
-    "test_sparse_neural_network()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e3714629",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 5: Model Compression Pipeline - End-to-End Pruning\n",
-    "\n",
-    "Now let's build a complete model compression pipeline that can prune entire neural networks layer by layer, maintaining the overall architecture while reducing parameters.\n",
-    "\n",
-    "### Production Compression Pipeline:\n",
-    "1. **Model analysis**: Identify pruneable layers and sensitivity\n",
-    "2. **Layer-wise pruning**: Apply different sparsity levels per layer\n",
-    "3. **Accuracy validation**: Ensure pruning doesn't degrade performance  \n",
-    "4. **Performance benchmarking**: Measure actual compression benefits\n",
-    "5. **Export for deployment**: Package compressed model for inference"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4dd53ba3",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "compression-pipeline",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class ModelCompressor:\n",
-    "    \"\"\"\n",
-    "    Complete model compression pipeline for neural networks.\n",
-    "    \n",
-    "    This class implements production-ready compression workflows\n",
-    "    that can handle complex models with mixed layer types.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        # BEGIN SOLUTION\n",
-    "        self.original_model = {}\n",
-    "        self.compressed_model = {}\n",
-    "        self.compression_stats = {}\n",
-    "        self.layer_sensitivities = {}\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def analyze_model_for_compression(self, model_weights: Dict[str, np.ndarray]) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Analyze model structure to determine compression strategy.\n",
-    "        \n",
-    "        Args:\n",
-    "            model_weights: Dictionary mapping layer names to weight arrays\n",
-    "            \n",
-    "        Returns:\n",
-    "            analysis: Compression analysis and recommendations\n",
-    "        \"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        analysis = {\n",
-    "            'layers': {},\n",
-    "            'total_params': 0,\n",
-    "            'compressible_params': 0,\n",
-    "            'recommendations': {}\n",
-    "        }\n",
-    "        \n",
-    "        print(\"🔍 Model Compression Analysis\")\n",
-    "        print(\"=\" * 50)\n",
-    "        print(\"Layer        | Type    | Parameters | Natural Sparsity | Recommendation\")\n",
-    "        print(\"-\" * 70)\n",
-    "        \n",
-    "        for layer_name, weights in model_weights.items():\n",
-    "            layer_analysis = analyze_weight_redundancy(weights, f\"Layer {layer_name}\")\n",
-    "            \n",
-    "            # Determine layer type from shape\n",
-    "            if len(weights.shape) == 4:  # Conv layer: (out, in, H, W)\n",
-    "                layer_type = \"Conv2D\"\n",
-    "                recommended_sparsity = 0.6  # Conservative for conv layers\n",
-    "            elif len(weights.shape) == 2:  # Dense layer: (out, in)  \n",
-    "                layer_type = \"Dense\"\n",
-    "                recommended_sparsity = 0.8  # Aggressive for dense layers\n",
-    "            else:\n",
-    "                layer_type = \"Other\"\n",
-    "                recommended_sparsity = 0.5  # Safe default\n",
-    "            \n",
-    "            analysis['layers'][layer_name] = {\n",
-    "                'type': layer_type,\n",
-    "                'shape': weights.shape,\n",
-    "                'parameters': weights.size,\n",
-    "                'natural_sparsity': layer_analysis['natural_sparsity'],\n",
-    "                'recommended_sparsity': recommended_sparsity\n",
-    "            }\n",
-    "            \n",
-    "            analysis['total_params'] += weights.size\n",
-    "            if layer_type in ['Conv2D', 'Dense']:\n",
-    "                analysis['compressible_params'] += weights.size\n",
-    "            \n",
-    "            print(f\"{layer_name:12} | {layer_type:7} | {weights.size:10,} | \"\n",
-    "                  f\"{layer_analysis['natural_sparsity']:12.1f}% | {recommended_sparsity:.0%}\")\n",
-    "        \n",
-    "        # Calculate overall compression potential\n",
-    "        compression_potential = analysis['compressible_params'] / analysis['total_params']\n",
-    "        \n",
-    "        print(f\"\\n📊 Model Summary:\")\n",
-    "        print(f\"   Total parameters: {analysis['total_params']:,}\")\n",
-    "        print(f\"   Compressible parameters: {analysis['compressible_params']:,}\")\n",
-    "        print(f\"   Compression potential: {compression_potential:.1%}\")\n",
-    "        \n",
-    "        analysis['compression_potential'] = compression_potential\n",
-    "        return analysis\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def compress_model(self, model_weights: Dict[str, np.ndarray], \n",
-    "                      layer_sparsities: Optional[Dict[str, float]] = None) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Compress entire model using layer-wise pruning.\n",
-    "        \n",
-    "        Args:\n",
-    "            model_weights: Dictionary mapping layer names to weights\n",
-    "            layer_sparsities: Optional per-layer sparsity targets\n",
-    "            \n",
-    "        Returns:\n",
-    "            compressed_model: Compressed weights and statistics\n",
-    "        \"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        if layer_sparsities is None:\n",
-    "            # Use default sparsities based on layer analysis\n",
-    "            analysis = self.analyze_model_for_compression(model_weights)\n",
-    "            layer_sparsities = {\n",
-    "                name: info['recommended_sparsity'] \n",
-    "                for name, info in analysis['layers'].items()\n",
-    "            }\n",
-    "        \n",
-    "        print(f\"\\n⚙️  Compressing Model Layers\")\n",
-    "        print(\"=\" * 50)\n",
-    "        \n",
-    "        compressed_weights = {}\n",
-    "        total_original_params = 0\n",
-    "        total_remaining_params = 0\n",
-    "        \n",
-    "        for layer_name, weights in model_weights.items():\n",
-    "            sparsity = layer_sparsities.get(layer_name, 0.7)  # Default 70%\n",
-    "            \n",
-    "            print(f\"\\n🔧 Compressing {layer_name} (target: {sparsity:.0%} sparsity)...\")\n",
-    "            \n",
-    "            # Apply magnitude-based pruning\n",
-    "            pruner = MagnitudePruner()\n",
-    "            pruned_weights, mask, stats = pruner.prune(weights, sparsity)\n",
-    "            \n",
-    "            compressed_weights[layer_name] = {\n",
-    "                'weights': pruned_weights,\n",
-    "                'mask': mask,\n",
-    "                'original_shape': weights.shape,\n",
-    "                'stats': stats\n",
-    "            }\n",
-    "            \n",
-    "            total_original_params += stats['original_params']\n",
-    "            total_remaining_params += stats['remaining_params']\n",
-    "            \n",
-    "            print(f\"   Sparsity achieved: {stats['actual_sparsity']:.1%}\")\n",
-    "            print(f\"   Compression: {stats['compression_ratio']:.1f}x\")\n",
-    "        \n",
-    "        # Calculate overall compression\n",
-    "        overall_compression = total_original_params / total_remaining_params if total_remaining_params > 0 else 1\n",
-    "        overall_sparsity = 1 - (total_remaining_params / total_original_params)\n",
-    "        \n",
-    "        self.compressed_model = compressed_weights\n",
-    "        self.compression_stats = {\n",
-    "            'total_original_params': total_original_params,\n",
-    "            'total_remaining_params': total_remaining_params,\n",
-    "            'overall_sparsity': overall_sparsity,\n",
-    "            'overall_compression': overall_compression,\n",
-    "            'layer_sparsities': layer_sparsities\n",
-    "        }\n",
-    "        \n",
-    "        print(f\"\\n✅ Model Compression Complete!\")\n",
-    "        print(f\"   Original parameters: {total_original_params:,}\")\n",
-    "        print(f\"   Remaining parameters: {total_remaining_params:,}\")\n",
-    "        print(f\"   Overall sparsity: {overall_sparsity:.1%}\")\n",
-    "        print(f\"   Overall compression: {overall_compression:.1f}x\")\n",
-    "        \n",
-    "        return compressed_weights\n",
-    "        # END SOLUTION\n",
-    "    \n",
-    "    def validate_compression_quality(self, original_weights: Dict[str, np.ndarray], \n",
-    "                                   compressed_model: Dict[str, Any]) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Validate that compression doesn't degrade model too much.\n",
-    "        \n",
-    "        This is a simplified validation - in practice you'd run full model evaluation.\n",
-    "        \"\"\"\n",
-    "        # BEGIN SOLUTION\n",
-    "        validation_results = {\n",
-    "            'layer_quality': {},\n",
-    "            'overall_quality': {},\n",
-    "            'quality_score': 0.0\n",
-    "        }\n",
-    "        \n",
-    "        print(f\"\\n✅ Validating Compression Quality\")\n",
-    "        print(\"=\" * 50)\n",
-    "        print(\"Layer        | Weight Error | Norm Preservation | Quality\")\n",
-    "        print(\"-\" * 55)\n",
-    "        \n",
-    "        layer_scores = []\n",
-    "        \n",
-    "        for layer_name in original_weights.keys():\n",
-    "            original = original_weights[layer_name]\n",
-    "            compressed_info = compressed_model[layer_name]\n",
-    "            compressed = compressed_info['weights']\n",
-    "            \n",
-    "            # Calculate quality metrics\n",
-    "            weight_diff = np.abs(original - compressed)\n",
-    "            mean_error = weight_diff.mean()\n",
-    "            max_error = weight_diff.max()\n",
-    "            \n",
-    "            # Norm preservation\n",
-    "            orig_norm = np.linalg.norm(original)\n",
-    "            comp_norm = np.linalg.norm(compressed)\n",
-    "            norm_preservation = comp_norm / orig_norm if orig_norm > 0 else 1.0\n",
-    "            \n",
-    "            # Simple quality score (higher is better)\n",
-    "            # Penalize high error, reward norm preservation\n",
-    "            quality_score = norm_preservation * (1 - mean_error / (np.abs(original).mean() + 1e-8))\n",
-    "            quality_score = max(0, min(1, quality_score))  # Clamp to [0, 1]\n",
-    "            \n",
-    "            validation_results['layer_quality'][layer_name] = {\n",
-    "                'mean_error': mean_error,\n",
-    "                'max_error': max_error,\n",
-    "                'norm_preservation': norm_preservation,\n",
-    "                'quality_score': quality_score\n",
-    "            }\n",
-    "            \n",
-    "            layer_scores.append(quality_score)\n",
-    "            \n",
-    "            print(f\"{layer_name:12} | {mean_error:.6f} | {norm_preservation:13.3f} | {quality_score:.3f}\")\n",
-    "        \n",
-    "        # Overall quality\n",
-    "        overall_quality_score = np.mean(layer_scores)\n",
-    "        validation_results['overall_quality'] = {\n",
-    "            'mean_quality_score': overall_quality_score,\n",
-    "            'quality_std': np.std(layer_scores),\n",
-    "            'min_quality': np.min(layer_scores),\n",
-    "            'max_quality': np.max(layer_scores)\n",
-    "        }\n",
-    "        validation_results['quality_score'] = overall_quality_score\n",
-    "        \n",
-    "        print(f\"\\n🎯 Overall Quality Score: {overall_quality_score:.3f}\")\n",
-    "        if overall_quality_score > 0.8:\n",
-    "            print(\"   ✅ Excellent compression quality!\")\n",
-    "        elif overall_quality_score > 0.6:\n",
-    "            print(\"   ⚠️  Acceptable compression quality\")  \n",
-    "        else:\n",
-    "            print(\"   ❌ Poor compression quality - consider lower sparsity\")\n",
-    "        \n",
-    "        return validation_results\n",
-    "        # END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3f625377",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test: Model Compression Pipeline\n",
-    "\n",
-    "Let's verify our complete compression pipeline works on a multi-layer model."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "61b92386",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-compression-pipeline",
-     "locked": false,
-     "points": 20,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_compression_pipeline():\n",
-    "    \"\"\"Test complete model compression pipeline.\"\"\"\n",
-    "    print(\"Testing model compression pipeline...\")\n",
-    "    \n",
-    "    # Create sample multi-layer model\n",
-    "    np.random.seed(42)\n",
-    "    model_weights = {\n",
-    "        'conv1': np.random.normal(0, 0.02, (32, 3, 3, 3)),    # Conv: 32 filters, 3 input channels\n",
-    "        'conv2': np.random.normal(0, 0.02, (64, 32, 3, 3)),   # Conv: 64 filters, 32 input channels\n",
-    "        'fc1': np.random.normal(0, 0.01, (512, 1024)),        # Dense: 512 → 1024\n",
-    "        'fc2': np.random.normal(0, 0.01, (10, 512)),          # Dense: 10 → 512 (output layer)\n",
-    "    }\n",
-    "    \n",
-    "    # Create compressor\n",
-    "    compressor = ModelCompressor()\n",
-    "    \n",
-    "    # Step 1: Analyze model\n",
-    "    analysis = compressor.analyze_model_for_compression(model_weights)\n",
-    "    \n",
-    "    assert analysis['total_params'] > 0, \"Should count total parameters\"\n",
-    "    assert len(analysis['layers']) == 4, \"Should analyze all 4 layers\"\n",
-    "    assert 'conv1' in analysis['layers'], \"Should analyze conv1\"\n",
-    "    assert 'fc1' in analysis['layers'], \"Should analyze fc1\"\n",
-    "    \n",
-    "    # Verify layer type detection\n",
-    "    assert analysis['layers']['conv1']['type'] == 'Conv2D', \"Should detect conv layers\"\n",
-    "    assert analysis['layers']['fc1']['type'] == 'Dense', \"Should detect dense layers\"\n",
-    "    \n",
-    "    # Step 2: Compress model with custom sparsities\n",
-    "    custom_sparsities = {\n",
-    "        'conv1': 0.5,  # Conservative for first conv layer\n",
-    "        'conv2': 0.6,  # Moderate for second conv layer\n",
-    "        'fc1': 0.8,    # Aggressive for large dense layer\n",
-    "        'fc2': 0.3     # Conservative for output layer\n",
-    "    }\n",
-    "    \n",
-    "    compressed_model = compressor.compress_model(model_weights, custom_sparsities)\n",
-    "    \n",
-    "    # Verify compression results\n",
-    "    assert len(compressed_model) == 4, \"Should compress all layers\"\n",
-    "    for layer_name in model_weights.keys():\n",
-    "        assert layer_name in compressed_model, f\"Missing compressed {layer_name}\"\n",
-    "        compressed_info = compressed_model[layer_name]\n",
-    "        assert 'weights' in compressed_info, \"Should have compressed weights\"\n",
-    "        assert 'mask' in compressed_info, \"Should have pruning mask\"\n",
-    "        assert 'stats' in compressed_info, \"Should have compression stats\"\n",
-    "    \n",
-    "    # Verify compression statistics\n",
-    "    stats = compressor.compression_stats\n",
-    "    assert stats['overall_compression'] > 2.0, \"Should achieve significant compression\"\n",
-    "    assert 0.5 <= stats['overall_sparsity'] <= 0.8, \"Overall sparsity should be reasonable\"\n",
-    "    \n",
-    "    # Step 3: Validate compression quality\n",
-    "    validation = compressor.validate_compression_quality(model_weights, compressed_model)\n",
-    "    \n",
-    "    assert 'layer_quality' in validation, \"Should validate each layer\"\n",
-    "    assert 'overall_quality' in validation, \"Should have overall quality metrics\"\n",
-    "    assert 0 <= validation['quality_score'] <= 1, \"Quality score should be normalized\"\n",
-    "    \n",
-    "    # Each layer should have quality metrics\n",
-    "    for layer_name in model_weights.keys():\n",
-    "        assert layer_name in validation['layer_quality'], f\"Missing quality for {layer_name}\"\n",
-    "        layer_quality = validation['layer_quality'][layer_name]\n",
-    "        assert 'norm_preservation' in layer_quality, \"Should measure norm preservation\"\n",
-    "        assert layer_quality['norm_preservation'] > 0, \"Norm preservation should be positive\"\n",
-    "    \n",
-    "    # Test that compressed weights are actually sparse\n",
-    "    for layer_name, compressed_info in compressed_model.items():\n",
-    "        compressed_weights = compressed_info['weights']\n",
-    "        sparsity = np.sum(compressed_weights == 0) / compressed_weights.size\n",
-    "        expected_sparsity = custom_sparsities[layer_name]\n",
-    "        \n",
-    "        # Allow some tolerance in sparsity\n",
-    "        assert abs(sparsity - expected_sparsity) < 0.1, f\"{layer_name} sparsity mismatch\"\n",
-    "    \n",
-    "    print(\"✅ Model compression pipeline test passed!\")\n",
-    "\n",
-    "test_compression_pipeline()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3a61f4c6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 6: Systems Analysis - Memory, Performance, and Deployment Impact\n",
-    "\n",
-    "Let's analyze compression from a systems engineering perspective, measuring the real-world impact on memory usage, inference speed, and deployment scenarios.\n",
-    "\n",
-    "### ML Systems Analysis: Why Pruning Enables Edge AI\n",
-    "\n",
-    "**Memory Complexity**: O(N × sparsity) storage reduction where N = original parameters\n",
-    "**Computational Complexity**: Theoretical O(N × sparsity) speedup, actual depends on hardware\n",
-    "**Cache Efficiency**: Smaller models fit in cache, reducing memory bandwidth bottlenecks  \n",
-    "**Energy Efficiency**: Fewer operations = lower power consumption for mobile devices\n",
-    "**Deployment Enablement**: Makes models fit where they couldn't before"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1afc2887",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "compression-systems-analysis",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def profile_compression_memory():\n",
-    "    \"\"\"\n",
-    "    Profile memory usage patterns during model compression.\n",
-    "    \n",
-    "    This function demonstrates how compression affects memory footprint\n",
-    "    and enables deployment on resource-constrained devices.\n",
-    "    \"\"\"\n",
-    "    import tracemalloc\n",
-    "    \n",
-    "    print(\"🔬 Memory Profiling: Model Compression\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    # Start memory tracking\n",
-    "    tracemalloc.start()\n",
-    "    \n",
-    "    # Create large model (simulating real CNN)\n",
-    "    print(\"Creating large model weights...\")\n",
-    "    model_weights = {\n",
-    "        'conv1': np.random.normal(0, 0.02, (128, 64, 3, 3)),     # ~0.3M parameters\n",
-    "        'conv2': np.random.normal(0, 0.02, (256, 128, 3, 3)),    # ~1.2M parameters  \n",
-    "        'fc1': np.random.normal(0, 0.01, (1024, 4096)),          # ~4.2M parameters\n",
-    "        'fc2': np.random.normal(0, 0.01, (10, 1024)),            # ~10K parameters\n",
-    "    }\n",
-    "    \n",
-    "    snapshot1 = tracemalloc.take_snapshot()\n",
-    "    current, peak = tracemalloc.get_traced_memory()\n",
-    "    print(f\"After model creation: {current / 1024 / 1024:.1f} MB current, {peak / 1024 / 1024:.1f} MB peak\")\n",
-    "    \n",
-    "    # Calculate original model size\n",
-    "    original_params = sum(w.size for w in model_weights.values())\n",
-    "    original_size_mb = sum(w.nbytes for w in model_weights.values()) / (1024 * 1024)\n",
-    "    \n",
-    "    print(f\"Original model: {original_params:,} parameters, {original_size_mb:.1f} MB\")\n",
-    "    \n",
-    "    # Compress model\n",
-    "    print(\"\\nCompressing model...\")\n",
-    "    compressor = ModelCompressor()\n",
-    "    compressed_model = compressor.compress_model(model_weights)\n",
-    "    \n",
-    "    snapshot2 = tracemalloc.take_snapshot()\n",
-    "    current, peak = tracemalloc.get_traced_memory()\n",
-    "    print(f\"After compression: {current / 1024 / 1024:.1f} MB current, {peak / 1024 / 1024:.1f} MB peak\")\n",
-    "    \n",
-    "    # Calculate compressed model size\n",
-    "    compressed_params = sum(\n",
-    "        np.sum(info['weights'] != 0) \n",
-    "        for info in compressed_model.values()\n",
-    "    )\n",
-    "    \n",
-    "    # Estimate compressed storage (could use sparse formats)\n",
-    "    compressed_size_mb = original_size_mb * (compressed_params / original_params)\n",
-    "    \n",
-    "    print(f\"\\n💾 Storage Analysis:\")\n",
-    "    print(f\"   Original: {original_params:,} parameters ({original_size_mb:.1f} MB)\")\n",
-    "    print(f\"   Compressed: {compressed_params:,} parameters ({compressed_size_mb:.1f} MB)\")\n",
-    "    print(f\"   Compression ratio: {original_params / compressed_params:.1f}x\")\n",
-    "    print(f\"   Size reduction: {original_size_mb / compressed_size_mb:.1f}x\")\n",
-    "    print(f\"   Storage savings: {original_size_mb - compressed_size_mb:.1f} MB\")\n",
-    "    \n",
-    "    tracemalloc.stop()\n",
-    "    \n",
-    "    return {\n",
-    "        'original_params': original_params,\n",
-    "        'compressed_params': compressed_params,\n",
-    "        'original_size_mb': original_size_mb,\n",
-    "        'compressed_size_mb': compressed_size_mb,\n",
-    "        'compression_ratio': original_params / compressed_params,\n",
-    "        'size_reduction': original_size_mb / compressed_size_mb\n",
-    "    }\n",
-    "\n",
-    "def analyze_deployment_scenarios():\n",
-    "    \"\"\"Analyze how compression enables different deployment scenarios.\"\"\"\n",
-    "    print(\"\\n🚀 Compression Deployment Impact Analysis\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Define deployment constraints\n",
-    "    scenarios = [\n",
-    "        {\n",
-    "            'name': 'Mobile Phone',\n",
-    "            'memory_limit_mb': 100,\n",
-    "            'compute_limit_gflops': 10,\n",
-    "            'power_sensitive': True,\n",
-    "            'description': 'On-device inference for camera apps'\n",
-    "        },\n",
-    "        {\n",
-    "            'name': 'IoT Device',\n",
-    "            'memory_limit_mb': 20,\n",
-    "            'compute_limit_gflops': 1,\n",
-    "            'power_sensitive': True,\n",
-    "            'description': 'Smart sensor with microcontroller'\n",
-    "        },\n",
-    "        {\n",
-    "            'name': 'Edge Server',\n",
-    "            'memory_limit_mb': 1000,\n",
-    "            'compute_limit_gflops': 100,\n",
-    "            'power_sensitive': False,\n",
-    "            'description': 'Local inference server for privacy'\n",
-    "        },\n",
-    "        {\n",
-    "            'name': 'Wearable',\n",
-    "            'memory_limit_mb': 10,\n",
-    "            'compute_limit_gflops': 0.5,\n",
-    "            'power_sensitive': True,\n",
-    "            'description': 'Smartwatch health monitoring'\n",
-    "        }\n",
-    "    ]\n",
-    "    \n",
-    "    # Model sizes at different compression levels\n",
-    "    model_configs = [\n",
-    "        {'name': 'Dense Model', 'size_mb': 200, 'gflops': 50, 'accuracy': 95.0},\n",
-    "        {'name': '50% Sparse', 'size_mb': 100, 'gflops': 25, 'accuracy': 94.5},\n",
-    "        {'name': '70% Sparse', 'size_mb': 60, 'gflops': 15, 'accuracy': 93.8},\n",
-    "        {'name': '90% Sparse', 'size_mb': 20, 'gflops': 5, 'accuracy': 91.2},\n",
-    "    ]\n",
-    "    \n",
-    "    print(\"Scenario       | Memory | Compute | Dense | 50% | 70% | 90% | Best Option\")\n",
-    "    print(\"-\" * 80)\n",
-    "    \n",
-    "    for scenario in scenarios:\n",
-    "        name = scenario['name']\n",
-    "        mem_limit = scenario['memory_limit_mb']\n",
-    "        compute_limit = scenario['compute_limit_gflops']\n",
-    "        \n",
-    "        # Check which model configurations fit\n",
-    "        viable_models = []\n",
-    "        for config in model_configs:\n",
-    "            fits_memory = config['size_mb'] <= mem_limit\n",
-    "            fits_compute = config['gflops'] <= compute_limit\n",
-    "            \n",
-    "            if fits_memory and fits_compute:\n",
-    "                viable_models.append(config['name'])\n",
-    "        \n",
-    "        # Determine best option\n",
-    "        if not viable_models:\n",
-    "            best_option = \"None fit!\"\n",
-    "        else:\n",
-    "            # Choose highest accuracy among viable options\n",
-    "            viable_configs = [c for c in model_configs if c['name'] in viable_models]\n",
-    "            best_config = max(viable_configs, key=lambda x: x['accuracy'])\n",
-    "            best_option = f\"{best_config['name']} ({best_config['accuracy']:.1f}%)\"\n",
-    "        \n",
-    "        # Show fit status for each compression level\n",
-    "        fit_status = []\n",
-    "        for config in model_configs:\n",
-    "            fits_mem = config['size_mb'] <= mem_limit\n",
-    "            fits_comp = config['gflops'] <= compute_limit\n",
-    "            if fits_mem and fits_comp:\n",
-    "                status = \"✅\"\n",
-    "            elif fits_mem:\n",
-    "                status = \"⚡\"  # Memory OK, compute too high\n",
-    "            elif fits_comp:\n",
-    "                status = \"💾\"  # Compute OK, memory too high\n",
-    "            else:\n",
-    "                status = \"❌\"\n",
-    "            fit_status.append(status)\n",
-    "        \n",
-    "        print(f\"{name:14} | {mem_limit:4d}MB | {compute_limit:5.1f}G | \"\n",
-    "              f\"{fit_status[0]:3} | {fit_status[1]:3} | {fit_status[2]:3} | {fit_status[3]:3} | {best_option}\")\n",
-    "    \n",
-    "    print(f\"\\n💡 Key Insights:\")\n",
-    "    print(f\"   • Compression often determines deployment feasibility\")\n",
-    "    print(f\"   • Edge devices require 70-90% sparsity for deployment\")\n",
-    "    print(f\"   • Mobile devices can use moderate compression (50-70%)\")\n",
-    "    print(f\"   • Power constraints favor sparse models (fewer operations)\")\n",
-    "    print(f\"   • Memory limits are often more restrictive than compute limits\")\n",
-    "\n",
-    "def benchmark_sparse_inference_speedup():\n",
-    "    \"\"\"Benchmark actual vs theoretical speedup from sparsity.\"\"\"\n",
-    "    print(\"\\n⚡ Sparse Inference Speedup Analysis\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    import time\n",
-    "    \n",
-    "    # Test different model sizes and sparsity levels\n",
-    "    configs = [\n",
-    "        {'size': (256, 512), 'sparsity': 0.5},\n",
-    "        {'size': (512, 1024), 'sparsity': 0.7},\n",
-    "        {'size': (1024, 2048), 'sparsity': 0.8},\n",
-    "        {'size': (2048, 4096), 'sparsity': 0.9},\n",
-    "    ]\n",
-    "    \n",
-    "    print(\"Model Size    | Sparsity | Theoretical | Actual | Efficiency | Notes\")\n",
-    "    print(\"-\" * 70)\n",
-    "    \n",
-    "    for config in configs:\n",
-    "        size = config['size']\n",
-    "        sparsity = config['sparsity']\n",
-    "        \n",
-    "        # Create sparse layer\n",
-    "        sparse_layer = SparseLinear(size[0], size[1])\n",
-    "        \n",
-    "        # Load and prune weights\n",
-    "        weights = np.random.normal(0, 0.1, (size[1], size[0]))\n",
-    "        sparse_layer.load_dense_weights(weights)\n",
-    "        sparse_layer.prune_weights(sparsity)\n",
-    "        \n",
-    "        # Benchmark\n",
-    "        benchmark = sparse_layer.benchmark_speedup(batch_size=16, iterations=100)\n",
-    "        \n",
-    "        theoretical = benchmark['theoretical_speedup']\n",
-    "        actual = benchmark['actual_speedup'] \n",
-    "        efficiency = benchmark['efficiency']\n",
-    "        \n",
-    "        # Determine bottleneck\n",
-    "        if efficiency > 0.8:\n",
-    "            notes = \"CPU bound\"\n",
-    "        elif efficiency > 0.5:\n",
-    "            notes = \"Memory bound\"\n",
-    "        else:\n",
-    "            notes = \"Framework overhead\"\n",
-    "        \n",
-    "        print(f\"{size[0]}x{size[1]:4} | {sparsity:6.0%} | {theoretical:9.1f}x | \"\n",
-    "              f\"{actual:5.1f}x | {efficiency:8.1%} | {notes}\")\n",
-    "    \n",
-    "    print(f\"\\n🎯 Speedup Reality Check:\")\n",
-    "    print(f\"   • Theoretical speedup assumes perfect sparse hardware\")\n",
-    "    print(f\"   • Actual speedup limited by memory bandwidth and overhead\")\n",
-    "    print(f\"   • High sparsity (>80%) shows diminishing returns\") \n",
-    "    print(f\"   • Production sparse hardware (GPUs, TPUs) achieve better efficiency\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a528a133",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test: Systems Analysis Implementation\n",
-    "\n",
-    "Let's verify our systems analysis provides valuable performance insights."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "95340fc7",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-systems-analysis",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_systems_analysis():\n",
-    "    \"\"\"Test systems analysis and profiling functions.\"\"\"\n",
-    "    print(\"Testing systems analysis...\")\n",
-    "    \n",
-    "    # Test memory profiling\n",
-    "    memory_results = profile_compression_memory()\n",
-    "    assert memory_results['compression_ratio'] > 2.0, \"Should show significant compression\"\n",
-    "    assert memory_results['original_size_mb'] > memory_results['compressed_size_mb'], \"Should reduce size\"\n",
-    "    \n",
-    "    # Test deployment analysis\n",
-    "    analyze_deployment_scenarios()\n",
-    "    \n",
-    "    # Test speedup benchmarking\n",
-    "    benchmark_sparse_inference_speedup()\n",
-    "    \n",
-    "    # All functions should run without errors\n",
-    "    print(\"✅ Systems analysis test passed!\")\n",
-    "\n",
-    "test_systems_analysis()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f9419421",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 7: Production Context - Real-World Pruning Systems\n",
-    "\n",
-    "Let's explore how pruning is used in production ML systems and connect our implementation to real frameworks and deployment platforms.\n",
-    "\n",
-    "### Production Pruning Systems:\n",
-    "1. **PyTorch Pruning**: `torch.nn.utils.prune` for magnitude and structured pruning\n",
-    "2. **TensorFlow Model Optimization**: Pruning API with gradual sparsity\n",
-    "3. **NVIDIA TensorRT**: Structured pruning for inference acceleration\n",
-    "4. **OpenVINO**: Intel's optimization toolkit with pruning support\n",
-    "5. **Edge TPU**: Google's quantization + pruning for mobile inference\n",
-    "6. **Apple Neural Engine**: Hardware-accelerated sparse computation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b61b9874",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "production-context",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def compare_with_production_pruning():\n",
-    "    \"\"\"\n",
-    "    Compare our implementation with production pruning systems.\n",
-    "    \n",
-    "    This function explains how real ML frameworks handle pruning\n",
-    "    and where our implementation fits in the broader ecosystem.\n",
-    "    \"\"\"\n",
-    "    print(\"🏭 Production Pruning Systems Comparison\")\n",
-    "    print(\"=\" * 70)\n",
-    "    \n",
-    "    frameworks = {\n",
-    "        'PyTorch': {\n",
-    "            'pruning_methods': ['Magnitude', 'Random', 'Structured', 'Custom'],\n",
-    "            'sparsity_support': ['Unstructured', 'Structured (channel)', '2:4 sparsity'],\n",
-    "            'deployment': 'TorchScript, ONNX export with sparse ops',\n",
-    "            'hardware_acceleration': 'Limited - mostly research focused',\n",
-    "            'our_similarity': 'High - similar magnitude-based approach'\n",
-    "        },\n",
-    "        'TensorFlow': {\n",
-    "            'pruning_methods': ['Magnitude', 'Gradual', 'Structured'],\n",
-    "            'sparsity_support': ['Unstructured', 'Block sparse', 'Structured'],\n",
-    "            'deployment': 'TensorFlow Lite with sparse inference',\n",
-    "            'hardware_acceleration': 'XLA optimization, mobile acceleration',\n",
-    "            'our_similarity': 'High - magnitude pruning with calibration'\n",
-    "        },\n",
-    "        'TensorRT': {\n",
-    "            'pruning_methods': ['Structured only', 'Channel pruning'],\n",
-    "            'sparsity_support': ['2:4 structured sparsity', 'Channel removal'],\n",
-    "            'deployment': 'Optimized inference engine with sparse kernels',\n",
-    "            'hardware_acceleration': 'GPU Tensor Cores, specialized sparse ops',\n",
-    "            'our_similarity': 'Medium - focuses on structured pruning'\n",
-    "        },\n",
-    "        'OpenVINO': {\n",
-    "            'pruning_methods': ['Magnitude', 'Structured', 'Mixed precision'],\n",
-    "            'sparsity_support': ['Unstructured', 'Block sparse', 'Channel wise'],\n",
-    "            'deployment': 'Intel CPU/GPU optimization with sparse support',\n",
-    "            'hardware_acceleration': 'Intel VPU, CPU vectorization',\n",
-    "            'our_similarity': 'High - comprehensive pruning toolkit'\n",
-    "        },\n",
-    "        'Our TinyTorch': {\n",
-    "            'pruning_methods': ['Magnitude-based', 'Structured filter pruning'],\n",
-    "            'sparsity_support': ['Unstructured', 'Structured (filter removal)'],\n",
-    "            'deployment': 'Educational sparse computation simulation',\n",
-    "            'hardware_acceleration': 'Educational - simulated speedups',\n",
-    "            'our_similarity': 'Reference implementation for learning'\n",
-    "        }\n",
-    "    }\n",
-    "    \n",
-    "    print(\"Framework | Methods | Hardware Support | Deployment | Similarity\")\n",
-    "    print(\"-\" * 70)\n",
-    "    \n",
-    "    for name, specs in frameworks.items():\n",
-    "        methods_str = specs['pruning_methods'][0]  # Primary method\n",
-    "        hw_str = specs['hardware_acceleration'][:20] + \"...\" if len(specs['hardware_acceleration']) > 20 else specs['hardware_acceleration']\n",
-    "        deploy_str = specs['deployment'][:20] + \"...\" if len(specs['deployment']) > 20 else specs['deployment']\n",
-    "        sim_str = specs['our_similarity'][:15] + \"...\" if len(specs['our_similarity']) > 15 else specs['our_similarity']\n",
-    "        \n",
-    "        print(f\"{name:9} | {methods_str:12} | {hw_str:16} | {deploy_str:12} | {sim_str}\")\n",
-    "    \n",
-    "    print(f\"\\n🎯 Key Production Insights:\")\n",
-    "    print(f\"   • Our magnitude approach is industry standard\")\n",
-    "    print(f\"   • Production systems emphasize structured pruning for hardware\")\n",
-    "    print(f\"   • Real frameworks integrate pruning with quantization\")\n",
-    "    print(f\"   • Hardware acceleration requires specialized sparse kernels\")\n",
-    "    print(f\"   • Mobile deployment drives most production pruning adoption\")\n",
-    "\n",
-    "def demonstrate_pruning_applications():\n",
-    "    \"\"\"Show real-world applications where pruning enables deployment.\"\"\"\n",
-    "    print(\"\\n🌟 Real-World Pruning Applications\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    applications = [\n",
-    "        {\n",
-    "            'domain': 'Mobile Photography',\n",
-    "            'model': 'Portrait segmentation CNN',\n",
-    "            'constraints': '< 10MB, < 100ms inference',\n",
-    "            'pruning_strategy': '70% unstructured + quantization',\n",
-    "            'outcome': 'Real-time portrait mode on phone cameras',\n",
-    "            'example': 'Google Pixel, iPhone portrait mode'\n",
-    "        },\n",
-    "        {\n",
-    "            'domain': 'Autonomous Vehicles', \n",
-    "            'model': 'Object detection (YOLO)',\n",
-    "            'constraints': '< 500MB, < 50ms inference, safety critical',\n",
-    "            'pruning_strategy': '50% structured pruning for latency',\n",
-    "            'outcome': 'Real-time object detection for ADAS',\n",
-    "            'example': 'Tesla FSD, Waymo perception stack'\n",
-    "        },\n",
-    "        {\n",
-    "            'domain': 'Smart Home',\n",
-    "            'model': 'Voice keyword detection',\n",
-    "            'constraints': '< 1MB, always-on, battery powered',\n",
-    "            'pruning_strategy': '90% sparsity + 8-bit quantization',\n",
-    "            'outcome': 'Always-listening wake word detection',\n",
-    "            'example': 'Alexa, Google Assistant edge processing'\n",
-    "        },\n",
-    "        {\n",
-    "            'domain': 'Medical Imaging',\n",
-    "            'model': 'X-ray diagnosis CNN',\n",
-    "            'constraints': 'Edge deployment, <1GB memory',\n",
-    "            'pruning_strategy': '60% structured pruning + knowledge distillation',\n",
-    "            'outcome': 'Portable medical AI for remote clinics',\n",
-    "            'example': 'Google AI for radiology, Zebra Medical'\n",
-    "        },\n",
-    "        {\n",
-    "            'domain': 'Augmented Reality',\n",
-    "            'model': 'Hand tracking and gesture recognition',\n",
-    "            'constraints': '< 50MB, 60fps, mobile GPU',\n",
-    "            'pruning_strategy': 'Channel pruning + mobile-optimized architecture',\n",
-    "            'outcome': 'Real-time hand tracking for AR experiences',\n",
-    "            'example': 'Apple ARKit, Google ARCore, Meta Quest'\n",
-    "        }\n",
-    "    ]\n",
-    "    \n",
-    "    print(\"Domain              | Model Type | Pruning Strategy | Outcome\")\n",
-    "    print(\"-\" * 75)\n",
-    "    \n",
-    "    for app in applications:\n",
-    "        domain_str = app['domain'][:18]\n",
-    "        model_str = app['model'][:15] + \"...\" if len(app['model']) > 15 else app['model']\n",
-    "        strategy_str = app['pruning_strategy'][:20] + \"...\" if len(app['pruning_strategy']) > 20 else app['pruning_strategy']\n",
-    "        outcome_str = app['outcome'][:25] + \"...\" if len(app['outcome']) > 25 else app['outcome']\n",
-    "        \n",
-    "        print(f\"{domain_str:18} | {model_str:10} | {strategy_str:16} | {outcome_str}\")\n",
-    "        print(f\"                   Example: {app['example']}\")\n",
-    "        print()\n",
-    "    \n",
-    "    print(\"💡 Common Patterns in Production Pruning:\")\n",
-    "    print(\"   • Latency-critical apps use structured pruning (regular sparsity)\")  \n",
-    "    print(\"   • Memory-constrained devices use aggressive unstructured pruning\")\n",
-    "    print(\"   • Safety-critical systems use conservative pruning with validation\")\n",
-    "    print(\"   • Mobile apps combine pruning + quantization for maximum compression\")\n",
-    "    print(\"   • Edge AI enables privacy (on-device processing) through compression\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6a6e6296",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test: Production Context Analysis\n",
-    "\n",
-    "Let's verify our production context analysis provides valuable insights."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "34c025b2",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-production-context",
-     "locked": false,
-     "points": 5,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_production_context():\n",
-    "    \"\"\"Test production context analysis.\"\"\"\n",
-    "    print(\"Testing production context analysis...\")\n",
-    "    \n",
-    "    # Test framework comparison\n",
-    "    compare_with_production_pruning()\n",
-    "    \n",
-    "    # Test applications demonstration\n",
-    "    demonstrate_pruning_applications()\n",
-    "    \n",
-    "    # Both functions should run without errors and provide insights\n",
-    "    print(\"✅ Production context analysis test passed!\")\n",
-    "\n",
-    "test_production_context()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "33bb80cd",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Comprehensive Testing\n",
-    "\n",
-    "Let's run a comprehensive test of all compression functionality to ensure everything works together correctly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2898e405",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "comprehensive-testing",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def run_all_tests():\n",
-    "    \"\"\"Run comprehensive test suite for compression module.\"\"\"\n",
-    "    print(\"🧪 Running Comprehensive Compression Test Suite\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    test_functions = [\n",
-    "        (\"Weight Redundancy Analysis\", test_redundancy_analysis),\n",
-    "        (\"Magnitude-Based Pruning\", test_magnitude_pruning),\n",
-    "        (\"Structured Pruning\", test_structured_pruning),\n",
-    "        (\"Sparse Neural Network\", test_sparse_neural_network),\n",
-    "        (\"Model Compression Pipeline\", test_compression_pipeline),\n",
-    "        (\"Systems Analysis\", test_systems_analysis),\n",
-    "        (\"Production Context\", test_production_context)\n",
-    "    ]\n",
-    "    \n",
-    "    passed = 0\n",
-    "    total = len(test_functions)\n",
-    "    \n",
-    "    for test_name, test_func in test_functions:\n",
-    "        print(f\"\\n{'='*20} {test_name} {'='*20}\")\n",
-    "        try:\n",
-    "            test_func()\n",
-    "            print(f\"✅ {test_name}: PASSED\")\n",
-    "            passed += 1\n",
-    "        except Exception as e:\n",
-    "            print(f\"❌ {test_name}: FAILED - {e}\")\n",
-    "    \n",
-    "    print(f\"\\n🎯 Test Results: {passed}/{total} tests passed\")\n",
-    "    \n",
-    "    if passed == total:\n",
-    "        print(\"🎉 All compression tests passed! Module implementation complete.\")\n",
-    "        \n",
-    "        # Show final demo\n",
-    "        print(f\"\\n🚀 Final Compression Demo:\")\n",
-    "        print(\"=\" * 50)\n",
-    "        \n",
-    "        # Create a realistic model and compress it\n",
-    "        np.random.seed(42)\n",
-    "        demo_model = {\n",
-    "            'backbone_conv': np.random.normal(0, 0.02, (128, 64, 3, 3)),\n",
-    "            'classifier_fc': np.random.normal(0, 0.01, (10, 2048)),\n",
-    "        }\n",
-    "        \n",
-    "        compressor = ModelCompressor()\n",
-    "        compressed = compressor.compress_model(demo_model, {'backbone_conv': 0.7, 'classifier_fc': 0.8})\n",
-    "        \n",
-    "        original_params = sum(w.size for w in demo_model.values())\n",
-    "        compressed_params = sum(np.sum(info['weights'] != 0) for info in compressed.values())\n",
-    "        \n",
-    "        print(f\"🎯 FINAL RESULT:\")\n",
-    "        print(f\"   Original model: {original_params:,} parameters\")\n",
-    "        print(f\"   Compressed model: {compressed_params:,} parameters\")\n",
-    "        print(f\"   Compression achieved: {original_params/compressed_params:.1f}x smaller\")\n",
-    "        print(f\"   Size reduction: {(1-compressed_params/original_params)*100:.1f}% of parameters removed\")\n",
-    "        print(f\"   ✅ Ready for edge deployment!\")\n",
-    "        \n",
-    "    else:\n",
-    "        print(f\"⚠️  {total - passed} tests failed. Review implementation.\")\n",
-    "\n",
-    "if __name__ == \"__main__\":\n",
-    "    run_all_tests()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "016ded8e",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Now that you've implemented neural network pruning, let's reflect on the systems engineering principles and production deployment considerations.\n",
-    "\n",
-    "**Instructions**: Think through these questions based on your implementation experience. Consider both the technical details and the broader systems implications."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7464a149",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "systems-thinking-1",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "source": [
-    "**Question 1: Pruning Strategy Analysis**\n",
-    "\n",
-    "You implemented both magnitude-based and structured pruning in your `MagnitudePruner` and `prune_conv_filters()` functions:\n",
-    "\n",
-    "a) Why does magnitude-based pruning work so well for neural networks? What does the effectiveness of this simple heuristic tell us about neural network weight distributions?\n",
-    "\n",
-    "b) In your structured vs unstructured comparison, structured pruning achieved lower compression ratios but is preferred for deployment. Explain this tradeoff in terms of hardware efficiency and inference speed.\n",
-    "\n",
-    "c) Your compression pipeline used different sparsity targets per layer (conv: 60%, dense: 80%). Why do dense layers typically tolerate higher sparsity than convolutional layers?\n",
-    "\n",
-    "**Your Answer:**\n",
-    "\n",
-    "<!-- BEGIN SOLUTION -->\n",
-    "a) Magnitude-based pruning works because:\n",
-    "- Neural networks exhibit natural redundancy with many small, unimportant weights\n",
-    "- Weight magnitude correlates with importance - small weights contribute little to output\n",
-    "- Networks are over-parametrized, so removing low-magnitude weights has minimal accuracy impact\n",
-    "- The success reveals that weight distributions have long tails - most weights are small, few are large\n",
-    "- This natural sparsity suggests networks learn efficient representations despite overparametrization\n",
-    "\n",
-    "b) The structured vs unstructured tradeoff:\n",
-    "- Unstructured: Higher compression (removes individual weights) but irregular sparsity patterns\n",
-    "- Structured: Lower compression (removes entire filters/channels) but regular, hardware-friendly patterns\n",
-    "- Hardware prefers structured because: dense computation on smaller tensors is faster than sparse computation\n",
-    "- Memory access: structured removal reduces tensor sizes, improving cache efficiency\n",
-    "- No need for specialized sparse kernels - can use standard GEMM operations\n",
-    "- Inference speed: structured pruning provides actual speedup, unstructured often theoretical only\n",
-    "\n",
-    "c) Layer-specific sparsity tolerance:\n",
-    "- Dense layers: High redundancy, many parameters, more overparametrized → tolerate 80% sparsity\n",
-    "- Conv layers: Fewer parameters, each filter captures important spatial features → more sensitive\n",
-    "- First layers: Extract low-level features (edges, textures) → very sensitive to pruning\n",
-    "- Later layers: More abstract features with redundancy → can handle moderate pruning\n",
-    "- Output layers: Critical for final predictions → require conservative pruning\n",
-    "<!-- END SOLUTION -->"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "51c856b6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "systems-thinking-2",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "source": [
-    "**Question 2: Sparse Computation and Hardware Efficiency**\n",
-    "\n",
-    "Your `SparseLinear` class demonstrated the challenges of actually accelerating sparse computation:\n",
-    "\n",
-    "a) Why did your sparse computation benchmarks show lower actual speedup compared to theoretical speedup? What are the main bottlenecks preventing sparse computation from achieving theoretical gains?\n",
-    "\n",
-    "b) In your deployment analysis, mobile devices required 70-90% sparsity while edge servers could use 50%. Explain how hardware constraints drive pruning requirements differently across deployment targets.\n",
-    "\n",
-    "c) You found that structured pruning provides better real-world performance than unstructured pruning. How would you design a neural network architecture that's naturally \"pruning-friendly\" from the start?\n",
-    "\n",
-    "**Your Answer:**\n",
-    "\n",
-    "<!-- BEGIN SOLUTION -->\n",
-    "a) Lower actual speedup due to multiple bottlenecks:\n",
-    "- Memory bandwidth: Sparse computation is often memory-bound, not compute-bound\n",
-    "- Framework overhead: PyTorch/NumPy not optimized for arbitrary sparsity patterns\n",
-    "- Cache inefficiency: Irregular sparse patterns hurt cache locality compared to dense operations\n",
-    "- Vectorization loss: SIMD instructions work best on dense, regular data patterns\n",
-    "- Index overhead: Storing and accessing sparse indices adds computational cost\n",
-    "- Hardware mismatch: Most CPUs/GPUs optimized for dense linear algebra, not sparse\n",
-    "\n",
-    "b) Hardware-driven pruning requirements:\n",
-    "- Mobile: Strict memory (4GB total), battery, thermal constraints → need aggressive 70-90% sparsity\n",
-    "- Edge servers: More memory (16GB+), power, cooling → moderate 50% sparsity sufficient\n",
-    "- Cloud: Abundant resources → pruning for cost optimization, not necessity\n",
-    "- Embedded/IoT: Extreme constraints (MB not GB) → need structured pruning + quantization\n",
-    "- Different hardware accelerators: Edge TPU loves sparsity, standard GPUs don't benefit much\n",
-    "\n",
-    "c) Pruning-friendly architecture design:\n",
-    "- Use more, smaller layers rather than fewer, large layers (easier to prune entire channels)\n",
-    "- Design with skip connections (allows aggressive pruning of individual branches)\n",
-    "- Separate feature extraction from classification (different pruning sensitivities)\n",
-    "- Use group convolutions (natural structured pruning boundaries)\n",
-    "- Design with mobile-first mindset (efficient from start, not compressed afterward)\n",
-    "- Consider lottery ticket initialization (start with good sparse subnetwork)\n",
-    "<!-- END SOLUTION -->"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6e6209ca",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "systems-thinking-3",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "source": [
-    "**Question 3: Model Compression Pipeline and Production Deployment**\n",
-    "\n",
-    "Your `ModelCompressor` implemented a complete compression pipeline with analysis, compression, and validation:\n",
-    "\n",
-    "a) Your pipeline analyzed each layer to recommend sparsity levels. In production deployment, how would you extend this to handle dynamic workloads where the optimal sparsity might change based on accuracy requirements or latency constraints?\n",
-    "\n",
-    "b) You implemented quality validation by comparing weight preservation. But in production, what matters is end-to-end accuracy and latency. How would you design a compression validation system that ensures deployment success?\n",
-    "\n",
-    "c) Looking at your production applications analysis, why is pruning often combined with other optimizations (quantization, knowledge distillation) rather than used alone? What are the complementary benefits?\n",
-    "\n",
-    "**Your Answer:**\n",
-    "\n",
-    "<!-- BEGIN SOLUTION -->\n",
-    "a) Dynamic compression for production:\n",
-    "- A/B testing framework: gradually adjust sparsity based on accuracy metrics in production\n",
-    "- Multi-model serving: maintain models at different compression levels (70%, 80%, 90% sparse)\n",
-    "- Dynamic switching: use less compressed models during high-accuracy periods, more during low-latency needs\n",
-    "- Feedback loop: monitor accuracy degradation and automatically adjust compression\n",
-    "- User-specific models: different compression for different user segments or use cases\n",
-    "- Time-based adaptation: more compression during peak load, less during quality-critical periods\n",
-    "- Canary deployments: test compression changes on small traffic percentage first\n",
-    "\n",
-    "b) End-to-end validation system:\n",
-    "- Task-specific metrics: measure final accuracy, F1, BLEU - whatever matters for the application\n",
-    "- Latency benchmarking: measure actual inference time on target hardware\n",
-    "- A/B testing: compare compressed vs uncompressed models on real user traffic\n",
-    "- Regression testing: ensure compression doesn't break edge cases or specific inputs\n",
-    "- Hardware-specific validation: test on actual deployment hardware, not just development machines\n",
-    "- Load testing: verify performance under realistic concurrent inference loads\n",
-    "- Accuracy monitoring: continuous validation in production with automatic rollback triggers\n",
-    "\n",
-    "c) Why pruning is combined with other optimizations:\n",
-    "- Pruning + quantization: attack both parameter count and parameter size (4x + 4x = 16x compression)\n",
-    "- Pruning + knowledge distillation: maintain accuracy while compressing (teacher-student training)\n",
-    "- Complementary bottlenecks: pruning reduces compute, quantization reduces memory bandwidth\n",
-    "- Different deployment needs: mobile needs both size and speed, cloud needs cost optimization\n",
-    "- Diminishing returns: 90% pruning alone may hurt accuracy, but 70% pruning + quantization achieves same compression with better accuracy\n",
-    "- Hardware optimization: different techniques work better on different hardware (GPU vs mobile CPU)\n",
-    "<!-- END SOLUTION -->"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a3584d5f",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "systems-thinking-4",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "source": [
-    "**Question 4: Edge AI and Deployment Enablement**\n",
-    "\n",
-    "Based on your systems analysis and deployment scenarios:\n",
-    "\n",
-    "a) Your memory profiling showed that pruning enables deployment where dense models won't fit. But pruning also changes the computational characteristics of models. How does this affect the entire ML systems stack, from training to serving?\n",
-    "\n",
-    "b) In your production applications analysis, you saw pruning enabling privacy-preserving on-device AI. Explain how compression techniques like pruning change the fundamental economics and capabilities of AI deployment.\n",
-    "\n",
-    "c) Looking forward, how do you think the relationship between model architectures, hardware capabilities, and compression techniques will evolve? What are the implications for ML systems engineering?\n",
-    "\n",
-    "**Your Answer:**\n",
-    "\n",
-    "<!-- BEGIN SOLUTION -->\n",
-    "a) Pruning affects the entire ML systems stack:\n",
-    "- Training: Need pruning-aware training, gradual sparsity increases, specialized optimizers\n",
-    "- Model versioning: Track both dense and compressed versions, compression parameters\n",
-    "- Serving infrastructure: Need sparse computation support, different batching strategies\n",
-    "- Monitoring: Different performance characteristics, need sparsity-aware metrics\n",
-    "- Debugging: Sparse models behave differently, need specialized debugging tools\n",
-    "- Hardware utilization: Lower compute utilization but different memory access patterns\n",
-    "- Load balancing: Sparse models have different latency profiles, affects request routing\n",
-    "\n",
-    "b) Compression changes AI deployment economics:\n",
-    "- Democratizes AI: Enables AI on devices that couldn't run dense models (phones, IoT, wearables)\n",
-    "- Privacy transformation: On-device processing eliminates need to send data to cloud\n",
-    "- Cost structure shift: Reduces cloud compute costs, shifts processing to edge devices\n",
-    "- Latency improvement: Local processing eliminates network round-trips\n",
-    "- Offline capability: Compressed models enable AI without internet connectivity\n",
-    "- Market expansion: Creates new use cases impossible with cloud-only AI\n",
-    "- Energy efficiency: Critical for battery-powered devices, enables always-on AI\n",
-    "\n",
-    "c) Future evolution predictions:\n",
-    "- Hardware-software co-design: Chips designed specifically for sparse computation (like Edge TPU)\n",
-    "- Architecture evolution: Networks designed for compression from scratch, not post-hoc optimization\n",
-    "- Automatic compression: ML systems that automatically find optimal compression for deployment targets\n",
-    "- Dynamic compression: Models that adapt compression level based on runtime constraints\n",
-    "- Compression-aware training: End-to-end training that considers deployment constraints\n",
-    "- Standardization: Common sparse formats and APIs across frameworks and hardware\n",
-    "- New paradigms: Mixture of experts, early exit networks - architecturally sparse models\n",
-    "- The future is compression-first design, not compression as afterthought\n",
-    "<!-- END SOLUTION -->"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b7aabbc8",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: Compression - Neural Network Pruning for Edge Deployment\n",
-    "\n",
-    "### What You Accomplished\n",
-    "\n",
-    "In this module, you built a complete **neural network compression system** using pruning techniques that remove 70% of parameters while maintaining 95%+ accuracy. You learned to:\n",
-    "\n",
-    "**🔧 Core Implementation Skills:**\n",
-    "- **Magnitude-based pruning**: Identified and removed unimportant weights using simple yet effective heuristics\n",
-    "- **Structured vs unstructured pruning**: Built both approaches and understood their hardware tradeoffs\n",
-    "- **Sparse computation**: Implemented efficient sparse linear layers and benchmarked real vs theoretical speedups\n",
-    "- **End-to-end compression pipeline**: Created production-ready model compression with analysis, validation, and optimization\n",
-    "\n",
-    "**📊 Systems Engineering Insights:**\n",
-    "- **Neural network redundancy**: Discovered that networks contain 70-90% redundant parameters that can be safely removed\n",
-    "- **Hardware efficiency tradeoffs**: Understood why structured pruning provides actual speedup while unstructured gives theoretical speedup\n",
-    "- **Memory vs compute optimization**: Learned how pruning reduces both memory footprint and computational requirements\n",
-    "- **Deployment enablement**: Saw how compression makes models fit where they previously couldn't run\n",
-    "\n",
-    "**🏭 Production Understanding:**\n",
-    "- **Edge deployment scenarios**: Analyzed how pruning enables mobile, IoT, and embedded AI applications\n",
-    "- **Compression pipeline design**: Built systems that analyze, compress, and validate models for production deployment\n",
-    "- **Hardware-aware optimization**: Understood how different deployment targets require different pruning strategies\n",
-    "- **Quality assurance**: Implemented validation systems to ensure compression doesn't degrade model performance\n",
-    "\n",
-    "### ML Systems Engineering Connection\n",
-    "\n",
-    "This module demonstrates that **compression is fundamentally about enabling deployment**, not just reducing model size. You learned:\n",
-    "\n",
-    "- **Why redundancy exists**: Neural networks are over-parametrized, creating massive compression opportunities\n",
-    "- **Hardware drives strategy**: Structured vs unstructured pruning choice depends on target hardware capabilities\n",
-    "- **Compression enables privacy**: On-device processing becomes possible when models are small enough\n",
-    "- **Systems thinking**: Compression affects the entire ML stack from training to serving\n",
-    "\n",
-    "### Real-World Impact\n",
-    "\n",
-    "Your compression implementation mirrors production systems used by:\n",
-    "- **Mobile AI**: Apple's Neural Engine, Google's Edge TPU leverage sparsity for efficient inference\n",
-    "- **Autonomous vehicles**: Tesla FSD uses pruning for real-time object detection\n",
-    "- **Smart devices**: Alexa, Google Assistant use extreme compression for always-on wake word detection\n",
-    "- **Medical AI**: Portable diagnostic systems enabled by compressed models\n",
-    "\n",
-    "The techniques you built make the difference between AI that runs in the cloud versus AI that runs in your pocket - enabling privacy, reducing latency, and creating entirely new application categories.\n",
-    "\n",
-    "**Next**: This completes our ML Systems engineering journey! You've now built the complete stack from tensors to production deployment, understanding how each component contributes to building real-world AI systems that scale."
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/17_compression/compression_dev.py b/modules_old/17_compression/compression_dev.py
deleted file mode 100644
index 6cc67443..00000000
--- a/modules_old/17_compression/compression_dev.py
+++ /dev/null
@@ -1,2562 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Compression - Neural Network Pruning for Edge Deployment
-
-Welcome to the Compression module! You'll implement pruning techniques that remove 70% of neural network parameters while maintaining accuracy, enabling deployment on resource-constrained edge devices.
-
-## Connection from Quantization (Module 17)
-In Module 17, you learned quantization - reducing precision from FP32 to INT8. But even quantized models can be too large for edge devices! Compression attacks the problem differently: instead of making numbers smaller, we **remove numbers entirely** through strategic pruning.
-
-## Learning Goals
-- Systems understanding: How neural network redundancy enables massive parameter reduction without accuracy loss
-- Core implementation skill: Build magnitude-based pruning systems that identify and remove unimportant weights
-- Pattern recognition: Understand when structured vs unstructured pruning optimizes for different hardware constraints
-- Framework connection: See how your implementation mirrors production sparse inference systems
-- Performance insight: Learn why 70% sparsity often provides optimal accuracy vs size tradeoffs
-
-## Build -> Profile -> Optimize
-1. **Build**: Magnitude-based pruners that remove small weights, discover massive redundancy in neural networks
-2. **Profile**: Measure model size reduction, accuracy impact, and sparse computation efficiency
-3. **Optimize**: Implement structured pruning for hardware-friendly sparsity patterns
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how neural networks contain massive redundancy that can be exploited for compression
-- Practical capability to prune real CNNs and MLPs while maintaining 95%+ of original accuracy
-- Systems insight into why pruning enables deployment scenarios impossible with dense models
-- Performance consideration of when sparse computation provides real speedups vs theoretical ones
-- Connection to production systems where pruning enables edge AI applications
-
-## Systems Reality Check
-TIP **Production Context**: Apple's Neural Engine, Google's Edge TPU, and mobile inference frameworks heavily rely on sparsity for efficient computation
-SPEED **Performance Note**: 70% sparsity provides 3-5x model compression with <2% accuracy loss, but speedup depends on hardware sparse computation support
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "compression-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp nn.utils.prune
-
-#| export
-import numpy as np
-import matplotlib.pyplot as plt
-import sys
-from typing import Tuple, Optional, Dict, Any, List
-from dataclasses import dataclass
-
-# Constants for compression configuration
-DEFAULT_SPARSITY = 0.7
-NEAR_ZERO_THRESHOLD_RATIO = 0.1  # 10% of mean weight magnitude
-MIN_FILTERS_TO_KEEP = 1
-EPS_DIVISION_SAFETY = 1e-8  # Avoid division by zero
-
-# Layer type detection thresholds
-CONV2D_NDIM = 4  # (out_channels, in_channels, H, W)
-DENSE_NDIM = 2   # (out_features, in_features)
-
-# Default sparsity levels by layer type
-DEFAULT_CONV_SPARSITY = 0.6   # Conservative for conv layers
-DEFAULT_DENSE_SPARSITY = 0.8  # Aggressive for dense layers
-DEFAULT_OTHER_SPARSITY = 0.5  # Safe default for unknown layers
-
-# Quality score thresholds
-EXCELLENT_QUALITY_THRESHOLD = 0.8
-ACCEPTABLE_QUALITY_THRESHOLD = 0.6
-
-# Helper function for layer analysis
-def _determine_layer_type_and_sparsity(shape: tuple) -> Tuple[str, float]:
-    """
-    Determine layer type and recommended sparsity from weight tensor shape.
-    
-    Args:
-        shape: Weight tensor shape
-        
-    Returns:
-        layer_type: String describing the layer type
-        recommended_sparsity: Suggested sparsity level for this layer type
-    """
-    # Detect layer type from weight tensor dimensions
-    if len(shape) == 4:  # Convolution: (filters, channels, height, width)
-        layer_type = "Conv2D"
-        recommended_sparsity = DEFAULT_CONV_SPARSITY  # Conservative - conv layers extract spatial features
-    elif len(shape) == 2:  # Linear/Linear: (output_neurons, input_neurons)  
-        layer_type = "Linear"
-        recommended_sparsity = DEFAULT_DENSE_SPARSITY  # Aggressive - dense layers have high redundancy
-    else:
-        layer_type = "Other"
-        recommended_sparsity = DEFAULT_OTHER_SPARSITY  # Safe default for unknown layer types
-    
-    return layer_type, recommended_sparsity
-
-# Benchmarking defaults
-DEFAULT_BATCH_SIZE = 32
-DEFAULT_BENCHMARK_ITERATIONS = 100
-SPEEDUP_EFFICIENCY_HIGH = 0.8
-SPEEDUP_EFFICIENCY_MEDIUM = 0.5
-
-# %% [markdown]
-"""
-## Part 1: Understanding Neural Network Redundancy
-
-Before implementing pruning, let's understand the fundamental insight: **neural networks are massively over-parametrized**. Most weights contribute little to the final output and can be removed without significant accuracy loss.
-
-### Visual Guide: Neural Network Weight Distribution
-
-```
-    Weight Magnitude Distribution in Typical Neural Network:
-    
-    Count
-      ^
-   5000| ████                        <- Many small weights
-   4000| █████
-   3000| ██████
-   2000| ███████
-   1000| ████████ ██                 <- Few large weights
-      0| ████████████████████████
-        +-------------------------> Weight Magnitude
-        0.0  0.1  0.2  0.3  0.4  0.5
-        
-    The Natural Sparsity Pattern:
-    +-----------------------------------------+
-    | 80% of weights have magnitude < 0.1     | <- Can be pruned
-    | 15% of weights have magnitude 0.1-0.3   | <- Moderately important
-    |  5% of weights have magnitude > 0.3     | <- Critical weights
-    +-----------------------------------------+
-```
-
-### Pruning Strategy Visualization
-
-```
-    Original Dense Network:
-    +-----+-----+-----+-----+
-    | 0.8 | 0.1 | 0.05| 0.3 |  <- All weights present
-    | 0.02| 0.7 | 0.4 | 0.09|
-    | 0.6 | 0.03| 0.5 | 0.2 |
-    | 0.04| 0.9 | 0.06| 0.1 |
-    +-----+-----+-----+-----+
-    
-    After 70% Magnitude-Based Pruning:
-    +-----+-----+-----+-----+
-    | 0.8 |  0  |  0  | 0.3 |  <- Small weights -> 0
-    |  0  | 0.7 | 0.4 |  0  |
-    | 0.6 |  0  | 0.5 | 0.2 |
-    |  0  | 0.9 |  0  |  0  |
-    +-----+-----+-----+-----+
-    
-    Result: 70% sparsity, 95%+ accuracy preserved!
-```
-
-### The Redundancy Discovery
-- **Research insight**: Networks often have 80-90% redundant parameters
-- **Lottery Ticket Hypothesis**: Sparse subnetworks can match dense network performance
-- **Practical reality**: 70% sparsity typically loses <2% accuracy
-- **Systems opportunity**: Massive compression enables edge deployment
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "redundancy-analysis", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def analyze_weight_redundancy(weights: np.ndarray, title: str = "Weight Analysis"):
-    """
-    Analyze weight distributions to understand pruning opportunities.
-    
-    This function reveals the natural sparsity and redundancy patterns
-    in neural network weights that make pruning effective.
-    """
-    # Flatten weights for analysis
-    w_flat = weights.flatten()
-    w_abs = np.abs(w_flat)
-    
-    print(f"📊 {title}")
-    print("=" * 50)
-    print(f"Total parameters: {len(w_flat):,}")
-    print(f"Mean absolute weight: {w_abs.mean():.6f}")
-    print(f"Weight standard deviation: {w_abs.std():.6f}")
-    
-    # Analyze weight distribution percentiles
-    percentiles = [50, 70, 80, 90, 95, 99]
-    print(f"\nWeight Magnitude Percentiles:")
-    for p in percentiles:
-        val = np.percentile(w_abs, p)
-        smaller_count = np.sum(w_abs <= val)
-        print(f"  {p:2d}%: {val:.6f} ({smaller_count:,} weights <= this value)")
-    
-    # Show natural sparsity (near-zero weights)
-    zero_threshold = w_abs.mean() * NEAR_ZERO_THRESHOLD_RATIO  # Threshold for "near-zero" weights
-    near_zero_count = np.sum(w_abs <= zero_threshold)
-    natural_sparsity = near_zero_count / len(w_flat) * 100
-    
-    print(f"\nNatural Sparsity Analysis:")
-    print(f"  Threshold (10% of mean): {zero_threshold:.6f}")
-    print(f"  Near-zero weights: {near_zero_count:,} ({natural_sparsity:.1f}%)")
-    print(f"  Already sparse without pruning!")
-    
-    return {
-        'total_params': len(w_flat),
-        'mean_abs': w_abs.mean(),
-        'std': w_abs.std(),
-        'natural_sparsity': natural_sparsity,
-        'percentiles': {p: np.percentile(w_abs, p) for p in percentiles}
-    }
-
-# %% [markdown]
-"""
-### Test: Weight Redundancy Analysis
-
-Let's verify our redundancy analysis works on realistic neural network weights.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-redundancy-analysis", "locked": false, "points": 5, "schema_version": 3, "solution": false, "task": false}
-def test_redundancy_analysis():
-    """Test weight redundancy analysis on sample networks."""
-    print("Testing weight redundancy analysis...")
-    
-    # Create realistic CNN weights with natural sparsity
-    np.random.seed(42)
-    conv_weights = np.random.normal(0, 0.02, (64, 32, 3, 3))  # Conv layer
-    linear_weights = np.random.normal(0, 0.01, (1000, 512))       # Linear layer
-    
-    # Analyze both layer types
-    conv_stats = analyze_weight_redundancy(conv_weights, "Conv2D Layer Weights")
-    linear_stats = analyze_weight_redundancy(linear_weights, "Linear Layer Weights")
-    
-    # Verify analysis produces reasonable results
-    assert conv_stats['total_params'] == 64*32*3*3, "Conv param count mismatch"
-    assert linear_stats['total_params'] == 1000*512, "Linear param count mismatch"
-    assert conv_stats['natural_sparsity'] > 0, "Should detect some natural sparsity"
-    assert linear_stats['natural_sparsity'] > 0, "Should detect some natural sparsity"
-    
-    print("PASS Weight redundancy analysis test passed!")
-
-test_redundancy_analysis()
-
-# %% [markdown]
-"""
-## Part 2: Magnitude-Based Pruning - The Foundation
-
-The simplest and most effective pruning technique: **remove the smallest weights**. The intuition is that small weights contribute little to the network's computation, so removing them should have minimal impact on accuracy.
-
-### Visual Guide: Magnitude-Based Pruning Process
-
-```
-    Step 1: Calculate Weight Magnitudes
-    Original Weights:        Absolute Values:
-    +-----+-----+-----+     +-----+-----+-----+
-    |-0.8 | 0.1 |-0.05| ->   | 0.8 | 0.1 | 0.05|
-    | 0.02|-0.7 | 0.4 |     | 0.02| 0.7 | 0.4 |
-    |-0.6 | 0.03| 0.5 |     | 0.6 | 0.03| 0.5 |
-    +-----+-----+-----+     +-----+-----+-----+
-    
-    Step 2: Sort and Find Threshold (70% sparsity)
-    Sorted magnitudes: [0.02, 0.03, 0.05, 0.1, 0.4, 0.5, 0.6, 0.7, 0.8]
-    70th percentile threshold: 0.4
-                             ^
-                    Keep weights >= 0.4
-    
-    Step 3: Create Binary Mask
-    Magnitude >= threshold:    Binary Mask:
-    +-----+-----+-----+     +-----+-----+-----+
-    |  OK  |  ✗  |  ✗  | ->   |  1  |  0  |  0  |
-    |  ✗  |  OK  |  OK  |     |  0  |  1  |  1  |
-    |  OK  |  ✗  |  OK  |     |  1  |  0  |  1  |
-    +-----+-----+-----+     +-----+-----+-----+
-    
-    Step 4: Apply Mask (Element-wise Multiplication)
-    Original * Mask = Pruned:
-    +-----+-----+-----+     +-----+-----+-----+
-    |-0.8 | 0.1 |-0.05|  *  |  1  |  0  |  0  |  =  +-----+-----+-----+
-    | 0.02|-0.7 | 0.4 |     |  0  |  1  |  1  |     |-0.8 |  0  |  0  |
-    |-0.6 | 0.03| 0.5 |     |  1  |  0  |  1  |     |  0  |-0.7 | 0.4 |
-    +-----+-----+-----+     +-----+-----+-----+     |-0.6 |  0  | 0.5 |
-                                                    +-----+-----+-----+
-```
-
-### Magnitude Pruning Algorithm
-1. **Calculate importance**: Use absolute weight magnitude as importance metric
-2. **Rank weights**: Sort all weights by absolute value
-3. **Set threshold**: Choose magnitude threshold for desired sparsity level
-4. **Create mask**: Zero out weights below threshold
-5. **Apply mask**: Element-wise multiplication to enforce sparsity
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "magnitude-pruning", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class MagnitudePruner:
-    """
-    Magnitude-based pruning for neural network compression.
-    
-    This class implements the core pruning algorithm used in production
-    systems: remove weights with smallest absolute values.
-    """
-    
-    def __init__(self):
-        """
-        Initialize magnitude-based pruner.
-        
-        Stores pruning masks, original weights, and statistics for 
-        tracking compression across multiple layers.
-        """
-        # BEGIN SOLUTION
-        self.pruning_masks = {}      # Binary masks for each pruned layer
-        self.original_weights = {}   # Original dense weights before pruning
-        self.pruning_stats = {}      # Compression statistics per layer
-        # END SOLUTION
-    
-    def calculate_threshold(self, weights: np.ndarray, sparsity: float) -> float:
-        """
-        Calculate magnitude threshold for desired sparsity level.
-        
-        Args:
-            weights: Network weights to analyze
-            sparsity: Fraction of weights to remove (0.0 to 1.0)
-            
-        Returns:
-            threshold: Magnitude below which weights should be pruned
-        """
-        # BEGIN SOLUTION
-        # Flatten weights and get absolute values
-        w_flat = weights.flatten()
-        w_abs = np.abs(w_flat)
-        
-        # Calculate percentile threshold
-        # sparsity=0.7 means remove 70% of weights (keep top 30%)
-        percentile = sparsity * 100
-        threshold = np.percentile(w_abs, percentile)
-        
-        return threshold
-        # END SOLUTION
-    
-    def create_mask(self, weights: np.ndarray, threshold: float) -> np.ndarray:
-        """
-        Create binary mask for pruning weights below threshold.
-        
-        Args:
-            weights: Original weights
-            threshold: Magnitude threshold for pruning
-            
-        Returns:
-            mask: Binary mask (1=keep, 0=prune)
-        """
-        # BEGIN SOLUTION
-        # Create mask: keep weights with absolute value >= threshold
-        mask = (np.abs(weights) >= threshold).astype(np.float32)
-        return mask
-        # END SOLUTION
-    
-    def prune(self, weights: np.ndarray, sparsity: float = DEFAULT_SPARSITY) -> Tuple[np.ndarray, np.ndarray, Dict]:
-        """
-        Prune network weights using magnitude-based pruning.
-        
-        This is the CORE pruning algorithm: remove weights with smallest absolute values.
-        The simplicity is deceptive - this technique enables 70%+ compression with <2% accuracy loss!
-        
-        Args:
-            weights: Original dense weights
-            sparsity: Fraction of weights to prune (default: 70%)
-            
-        Returns:
-            pruned_weights: Weights with small values set to zero
-            mask: Binary pruning mask
-            stats: Pruning statistics
-        """
-        # BEGIN SOLUTION
-        # Store original shape for validation
-        original_shape = weights.shape
-        original_size = weights.size
-        
-        # STEP 1: Calculate magnitude threshold for desired sparsity level
-        # This determines which weights are "important enough" to keep
-        threshold = self.calculate_threshold(weights, sparsity)
-        
-        # STEP 2: Create binary mask (1=keep, 0=prune)
-        # The mask enforces sparsity by zeroing out small weights
-        mask = self.create_mask(weights, threshold)
-        
-        # STEP 3: Apply pruning via element-wise multiplication
-        # This is where the actual compression happens - small weights become 0
-        pruned_weights = weights * mask
-        
-        # STEP 4: Calculate comprehensive statistics for analysis
-        actual_sparsity = np.sum(mask == 0) / mask.size  # Fraction of zeros
-        remaining_params = np.sum(mask == 1)             # Non-zero count
-        
-        # Calculate compression effectiveness metrics
-        pruned_count = int(original_size - remaining_params)
-        compression_ratio = original_size / remaining_params if remaining_params > 0 else float('inf')
-        
-        # Package all statistics for analysis and debugging
-        stats = {
-            'target_sparsity': sparsity,           # What we aimed for
-            'actual_sparsity': actual_sparsity,    # What we achieved  
-            'threshold': threshold,                # Magnitude cutoff used
-            'original_params': original_size,      # Before pruning
-            'remaining_params': int(remaining_params), # After pruning (non-zero)
-            'pruned_params': pruned_count,         # Parameters removed
-            'compression_ratio': compression_ratio  # Size reduction factor
-        }
-        
-        return pruned_weights, mask, stats
-        # END SOLUTION
-    
-    def measure_accuracy_impact(self, original_weights: np.ndarray, pruned_weights: np.ndarray) -> Dict:
-        """
-        Measure the impact of pruning on weight statistics.
-        
-        This gives us a proxy for accuracy impact before running full evaluation.
-        """
-        # BEGIN SOLUTION
-        # Calculate difference statistics
-        weight_diff = np.abs(original_weights - pruned_weights)
-        
-        # Normalize by original weight magnitude for relative comparison
-        original_abs = np.abs(original_weights)
-        relative_error = weight_diff / (original_abs + EPS_DIVISION_SAFETY)  # Avoid division by zero
-        
-        return {
-            'mean_absolute_error': weight_diff.mean(),
-            'max_absolute_error': weight_diff.max(),
-            'mean_relative_error': relative_error.mean(),
-            'weight_norm_preservation': np.linalg.norm(pruned_weights) / np.linalg.norm(original_weights)
-        }
-        # END SOLUTION
-
-# %% [markdown]
-"""
-### Test: Magnitude-Based Pruning Implementation
-
-Let's verify our magnitude pruning works correctly.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-magnitude-pruning", "locked": false, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_magnitude_pruning():
-    """Test magnitude-based pruning implementation."""
-    print("Testing magnitude-based pruning...")
-    
-    pruner = MagnitudePruner()
-    
-    # Test case 1: Simple weights with known distribution
-    weights = np.array([
-        [0.5, 0.1, 0.8],
-        [0.05, 0.9, 0.2],
-        [0.3, 0.02, 0.7]
-    ])
-    
-    # Test 50% sparsity (should keep 4.5 ~= 4-5 weights)
-    pruned, mask, stats = pruner.prune(weights, sparsity=0.5)
-    
-    print(f"Original weights:")
-    print(weights)
-    print(f"Pruning mask:")
-    print(mask)
-    print(f"Pruned weights:")
-    print(pruned)
-    print(f"Statistics: {stats}")
-    
-    # Verify sparsity is approximately correct
-    actual_sparsity = stats['actual_sparsity']
-    assert 0.4 <= actual_sparsity <= 0.6, f"Sparsity should be ~50%, got {actual_sparsity:.1%}"
-    
-    # Verify mask is binary
-    assert np.all((mask == 0) | (mask == 1)), "Mask should be binary"
-    
-    # Verify pruned weights match mask
-    expected_pruned = weights * mask
-    np.testing.assert_array_equal(pruned, expected_pruned, "Pruned weights should match mask application")
-    
-    # Test case 2: High sparsity pruning
-    large_weights = np.random.normal(0, 0.1, (100, 50))
-    pruned_large, mask_large, stats_large = pruner.prune(large_weights, sparsity=0.8)
-    
-    assert 0.75 <= stats_large['actual_sparsity'] <= 0.85, "High sparsity should be approximately correct"
-    assert stats_large['compression_ratio'] >= 4.0, "80% sparsity should give ~5x compression"
-    
-    # Test accuracy impact measurement
-    accuracy_impact = pruner.measure_accuracy_impact(large_weights, pruned_large)
-    assert 'mean_relative_error' in accuracy_impact, "Should measure relative error"
-    assert accuracy_impact['weight_norm_preservation'] > 0, "Should preserve some weight norm"
-    
-    print("PASS Magnitude-based pruning test passed!")
-
-test_magnitude_pruning()
-
-# %% [markdown]
-"""
-## Part 3: Structured vs Unstructured Pruning
-
-So far we've implemented **unstructured pruning** - removing individual weights anywhere. But this creates irregular sparsity patterns that are hard for hardware to accelerate. **Structured pruning** removes entire channels, filters, or blocks - creating regular patterns that map well to hardware.
-
-### Visual Comparison: Structured vs Unstructured Pruning
-
-```
-    UNSTRUCTURED PRUNING (Individual Weight Removal):
-    
-    Original 4*4 Weight Matrix:
-    +-----+-----+-----+-----+
-    | 0.8 | 0.1 | 0.05| 0.3 |  
-    | 0.02| 0.7 | 0.4 | 0.09|  
-    | 0.6 | 0.03| 0.5 | 0.2 |  
-    | 0.04| 0.9 | 0.06| 0.1 |  
-    +-----+-----+-----+-----+
-    
-    After 50% Unstructured Pruning (irregular pattern):
-    +-----+-----+-----+-----+
-    | 0.8 |  0  |  0  | 0.3 |  <- Scattered zeros
-    |  0  | 0.7 | 0.4 |  0  |  <- Hard for hardware to optimize
-    | 0.6 |  0  | 0.5 | 0.2 |  <- Requires sparse kernels
-    |  0  | 0.9 |  0  |  0  |  <- Irregular memory access
-    +-----+-----+-----+-----+
-    
-    
-    STRUCTURED PRUNING (Channel/Filter Removal):
-    
-    Conv Layer: 4 filters * 3 input channels:
-    Filter 0:  Filter 1:  Filter 2:  Filter 3:
-    +-----+    +-----+    +-----+    +-----+
-    | 0.8 |    | 0.1 |    | 0.05|    | 0.3 |   
-    | 0.2 |    | 0.7 |    | 0.4 |    | 0.9 |   <- L2 norms: [1.2, 0.9, 0.6, 1.1]
-    | 0.6 |    | 0.3 |    | 0.5 |    | 0.7 |   
-    +-----+    +-----+    +-----+    +-----+
-                   v           v
-                Remove    Remove
-              (weak)    (weak)
-    
-    After 50% Structured Pruning (remove 2 weakest filters):
-    Filter 0:            Filter 3:
-    +-----+              +-----+
-    | 0.8 |              | 0.3 |     <- Clean matrix reduction
-    | 0.2 |              | 0.9 |     <- Dense computation friendly
-    | 0.6 |              | 0.7 |     <- No sparse kernels needed
-    +-----+              +-----+     <- Regular memory access
-```
-
-### Hardware Efficiency Comparison
-
-```
-    COMPUTATION PATTERNS:
-    
-    Unstructured (50% sparse):          Structured (50% fewer filters):
-    +-------------------------+         +-------------------------+
-    | for i in range(rows):   |         | for i in range(rows/2): |
-    |   for j in range(cols): |         |   for j in range(cols): |
-    |     if mask[i,j]:       | <--+     |     result += data[i,j] |
-    |       result += data[i,j]|   |     +-------------------------+
-    |     # else: skip        |   |          ^
-    +-------------------------+   |     Dense, vectorized
-             ^                    |     
-        Sparse, branching         |
-        Bad for SIMD              |
-                                  |
-    Memory Access Pattern:        |     Memory Access Pattern:
-    [OK][✗][OK][✗][OK][✗][✗][OK]     |     [OKOKOKOK][OKOKOKOK] <- Contiguous
-         ^ Irregular              |              ^ Cache-friendly
-         Bad for cache            |
-```
-
-### Structured Pruning Benefits:
-- **Hardware friendly**: Regular patterns enable efficient sparse computation
-- **Memory layout**: Removes entire rows/columns, reducing memory footprint  
-- **Inference speed**: Actually accelerates computation (vs theoretical speedup)
-- **Implementation simple**: No special sparse kernels needed
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "structured-pruning", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def prune_conv_filters(conv_weights: np.ndarray, sparsity: float = 0.5) -> Tuple[np.ndarray, List[int], Dict]:
-    """
-    Structured pruning for convolutional layers - remove entire filters.
-    
-    Unlike unstructured pruning that creates irregular sparsity patterns,
-    structured pruning removes entire filters/channels for hardware-friendly compression.
-    This trades some compression ratio for MUCH better inference speed on real hardware.
-    
-    Args:
-        conv_weights: Conv weights shaped (out_channels, in_channels, H, W)
-        sparsity: Fraction of filters to remove
-        
-    Returns:
-        pruned_weights: Weights with filters removed
-        kept_filters: Indices of filters that were kept
-        stats: Pruning statistics
-    """
-    # BEGIN SOLUTION
-    # STEP 1: Calculate importance score for each output filter
-    # Strategy: Use L2 norm of entire filter as importance measure
-    # Intuition: Filters with larger norms have more impact on output
-    out_channels = conv_weights.shape[0]
-    filter_norms = []
-    
-    for i in range(out_channels):
-        filter_weights = conv_weights[i]  # Shape: (in_channels, H, W)
-        l2_norm = np.linalg.norm(filter_weights)  # Magnitude of entire filter
-        filter_norms.append(l2_norm)
-    
-    filter_norms = np.array(filter_norms)
-    
-    # STEP 2: Determine how many filters to keep (with safety bounds)
-    num_filters_to_keep = int(out_channels * (1 - sparsity))
-    num_filters_to_keep = max(MIN_FILTERS_TO_KEEP, num_filters_to_keep)  # Never remove ALL filters
-    
-    # STEP 3: Select top-K most important filters based on L2 norm
-    # This is the structured pruning decision: keep entire filters, not individual weights
-    top_filter_indices = np.argsort(filter_norms)[-num_filters_to_keep:]
-    top_filter_indices.sort()  # Maintain original filter ordering for consistency
-    
-    # STEP 4: Create compressed weight tensor by extracting kept filters
-    # Result: smaller tensor with fewer channels (not sparse tensor with zeros)
-    pruned_weights = conv_weights[top_filter_indices]
-    
-    # STEP 5: Calculate structured pruning statistics
-    actual_sparsity = 1 - (num_filters_to_keep / out_channels)
-    
-    stats = {
-        'original_filters': out_channels,
-        'remaining_filters': num_filters_to_keep,
-        'pruned_filters': out_channels - num_filters_to_keep,
-        'target_sparsity': sparsity,
-        'actual_sparsity': actual_sparsity,
-        'compression_ratio': out_channels / num_filters_to_keep,
-        'filter_norms': filter_norms,
-        'kept_filter_indices': top_filter_indices.tolist()
-    }
-    
-    return pruned_weights, top_filter_indices.tolist(), stats
-    # END SOLUTION
-
-def compare_structured_vs_unstructured(conv_weights: np.ndarray, sparsity: float = 0.5):
-    """
-    Compare structured vs unstructured pruning on the same layer.
-    """
-    print("🔬 Structured vs Unstructured Pruning Comparison")
-    print("=" * 60)
-    
-    # Unstructured pruning
-    pruner = MagnitudePruner()
-    unstructured_pruned, unstructured_mask, unstructured_stats = pruner.prune(conv_weights, sparsity)
-    
-    # Structured pruning  
-    structured_pruned, kept_filters, structured_stats = prune_conv_filters(conv_weights, sparsity)
-    
-    print("Unstructured Pruning:")
-    print(f"  Original shape: {conv_weights.shape}")
-    print(f"  Pruned shape: {unstructured_pruned.shape} (same)")
-    print(f"  Sparsity: {unstructured_stats['actual_sparsity']:.1%}")
-    print(f"  Compression: {unstructured_stats['compression_ratio']:.1f}x")
-    print(f"  Zero elements: {np.sum(unstructured_pruned == 0):,}")
-    
-    print("\nStructured Pruning:")
-    print(f"  Original shape: {conv_weights.shape}")
-    print(f"  Pruned shape: {structured_pruned.shape}")
-    print(f"  Sparsity: {structured_stats['actual_sparsity']:.1%}")
-    print(f"  Compression: {structured_stats['compression_ratio']:.1f}x")
-    print(f"  Filters removed: {structured_stats['pruned_filters']}")
-    
-    print(f"\nTIP Key Differences:")
-    print(f"   • Unstructured: Irregular sparsity, requires sparse kernels")
-    print(f"   • Structured: Regular reduction, standard dense computation")
-    print(f"   • Hardware: Structured pruning provides actual speedup")
-    print(f"   • Memory: Structured pruning reduces memory footprint")
-    
-    return {
-        'unstructured': (unstructured_pruned, unstructured_stats),
-        'structured': (structured_pruned, structured_stats)
-    }
-
-# %% [markdown]
-"""
-### Test: Structured Pruning Implementation
-
-Let's verify structured pruning works correctly and compare it with unstructured pruning.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-structured-pruning", "locked": false, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_structured_pruning():
-    """Test structured pruning implementation."""
-    print("Testing structured pruning...")
-    
-    # Create sample conv weights: (out_channels, in_channels, H, W)
-    np.random.seed(42)
-    conv_weights = np.random.normal(0, 0.1, (8, 4, 3, 3))
-    
-    # Test structured pruning
-    pruned_weights, kept_filters, stats = prune_conv_filters(conv_weights, sparsity=0.5)
-    
-    print(f"Original shape: {conv_weights.shape}")
-    print(f"Pruned shape: {pruned_weights.shape}")
-    print(f"Kept filters: {kept_filters}")
-    print(f"Stats: {stats}")
-    
-    # Verify output shape is correct
-    expected_filters = int(8 * (1 - 0.5))  # 50% sparsity = keep 50% of filters
-    assert pruned_weights.shape[0] == expected_filters, f"Should keep {expected_filters} filters"
-    assert pruned_weights.shape[1:] == conv_weights.shape[1:], "Other dimensions should match"
-    
-    # Verify kept filters are the strongest ones
-    filter_norms = [np.linalg.norm(conv_weights[i]) for i in range(8)]
-    top_indices = np.argsort(filter_norms)[-expected_filters:]
-    top_indices.sort()
-    
-    for i, kept_idx in enumerate(kept_filters):
-        # Verify the pruned weight matches original filter
-        np.testing.assert_array_equal(
-            pruned_weights[i], 
-            conv_weights[kept_idx],
-            f"Filter {i} should match original filter {kept_idx}"
-        )
-    
-    # Test comparison function
-    comparison = compare_structured_vs_unstructured(conv_weights, 0.5)
-    
-    # Verify both methods produce different results
-    unstructured_result = comparison['unstructured'][0]
-    structured_result = comparison['structured'][0]
-    
-    assert unstructured_result.shape == conv_weights.shape, "Unstructured keeps same shape"
-    assert structured_result.shape[0] < conv_weights.shape[0], "Structured reduces filters"
-    
-    print("PASS Structured pruning test passed!")
-
-test_structured_pruning()
-
-# %% [markdown]
-"""
-## Part 4: Sparse Neural Networks - Efficient Computation
-
-Pruning creates sparse networks, but how do we compute with them efficiently? We need sparse linear layers that skip computation for zero weights.
-
-### Visual Guide: Sparse Computation Strategies
-
-```
-    DENSE COMPUTATION (Standard):
-    
-    Input Vector:     Weight Matrix:        Output:
-    +-----+          +-----+-----+-----+    +-----+
-    |  2  |          | 0.8 |  0  | 0.3 |    |     |
-    |  3  |    *     |  0  | 0.7 | 0.4 |  = |  ?  |
-    |  1  |          | 0.6 |  0  | 0.5 |    |     |
-    +-----+          +-----+-----+-----+    +-----+
-    
-    Standard Matrix Multiply (wastes work on zeros):
-    output[0] = 2*0.8 + 3*0 + 1*0.3     = 1.6 + 0 + 0.3 = 1.9
-    output[1] = 2*0 + 3*0.7 + 1*0.4      = 0 + 2.1 + 0.4 = 2.5  
-    output[2] = 2*0.6 + 3*0 + 1*0.5      = 1.2 + 0 + 0.5 = 1.7
-                 ^      ^      ^
-            Wasted   Useful  Useful
-    
-    
-    SPARSE COMPUTATION (Optimized):
-    
-    Non-zero Weight Storage (CSR format):
-    values:  [0.8, 0.3, 0.7, 0.4, 0.6, 0.5]
-    cols:    [ 0,   2,   1,   2,   0,   2 ]
-    row_ptr: [ 0,   2,   4,   6 ]
-             ^    ^    ^    ^
-           row0 row1 row2  end
-    
-    Optimized Sparse Multiply (skip zeros):
-    for row in range(3):
-        for idx in range(row_ptr[row], row_ptr[row+1]):
-            col = cols[idx]
-            weight = values[idx]
-            output[row] += input[col] * weight  # Only non-zero weights!
-    
-    Operations: 6 multiply-adds instead of 9 (33% savings)
-```
-
-### Memory Layout Comparison
-
-```
-    DENSE STORAGE (4*4 matrix, 50% sparse):
-    +-----+-----+-----+-----+
-    | 0.8 | 0.0 | 0.0 | 0.3 |  Memory: 16 floats * 4 bytes = 64 bytes
-    | 0.0 | 0.7 | 0.4 | 0.0 |  Wasted: 8 zeros * 4 bytes = 32 bytes (50%)
-    | 0.6 | 0.0 | 0.5 | 0.2 |  Operations: 16 multiply-adds
-    | 0.0 | 0.9 | 0.0 | 0.0 |  
-    +-----+-----+-----+-----+
-    
-    SPARSE STORAGE (CSR format):
-    values:  [0.8, 0.3, 0.7, 0.4, 0.6, 0.5, 0.2, 0.9] = 8 * 4 = 32 bytes
-    columns: [ 0,   3,   1,   2,   0,   2,   3,   1 ] = 8 * 4 = 32 bytes  
-    row_ptr: [ 0,   2,   4,   7,   8 ]               = 5 * 4 = 20 bytes
-                                                    Total: 84 bytes
-    
-    Overhead: 84 vs 64 bytes (+31%) BUT only 8 operations vs 16 (-50%)
-    Break-even at ~70% sparsity: storage overhead < computation savings
-```
-
-### Sparse Computation Challenges:
-- **Memory layout**: How to store only non-zero weights efficiently
-- **Computation patterns**: Skip multiply-add operations for zero weights  
-- **Hardware support**: Most hardware isn't optimized for arbitrary sparsity
-- **Software optimization**: Need specialized sparse kernels for speedup
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "sparse-computation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class SparseLinear:
-    """
-    Sparse linear layer that efficiently computes with pruned weights.
-    
-    This demonstrates how to build sparse computation systems
-    that actually achieve speedup from sparsity.
-    """
-    
-    def __init__(self, in_features: int, out_features: int):
-        """
-        Initialize sparse linear layer.
-        
-        Args:
-            in_features: Number of input features
-            out_features: Number of output features
-            
-        Attributes:
-            linear_weights: Original dense weight matrix (out_features, in_features)
-            sparse_weights: Pruned weight matrix with zeros
-            mask: Binary mask indicating kept weights (1=keep, 0=prune)
-            sparsity: Fraction of weights that are zero
-            dense_ops: Number of operations for dense computation
-            sparse_ops: Number of operations for sparse computation
-        """
-        # BEGIN SOLUTION
-        self.in_features = in_features
-        self.out_features = out_features
-        
-        # Linear weights (will be pruned)
-        self.linear_weights = None
-        self.bias = None
-        
-        # Sparse representation
-        self.sparse_weights = None
-        self.mask = None
-        self.sparsity = 0.0
-        
-        # Performance tracking
-        self.dense_ops = 0
-        self.sparse_ops = 0
-        # END SOLUTION
-    
-    def load_linear_weights(self, weights: np.ndarray, bias: Optional[np.ndarray] = None):
-        """Load dense weights before pruning."""
-        # BEGIN SOLUTION
-        assert weights.shape == (self.out_features, self.in_features), f"Weight shape mismatch"
-        self.linear_weights = weights.copy()
-        self.bias = bias.copy() if bias is not None else np.zeros(self.out_features)
-        # END SOLUTION
-    
-    def prune_weights(self, sparsity: float = DEFAULT_SPARSITY):
-        """Prune weights using magnitude-based pruning."""
-        # BEGIN SOLUTION
-        if self.linear_weights is None:
-            raise ValueError("Must load dense weights before pruning")
-        
-        # Use magnitude pruner
-        pruner = MagnitudePruner()
-        self.sparse_weights, self.mask, stats = pruner.prune(self.linear_weights, sparsity)
-        self.sparsity = stats['actual_sparsity']
-        
-        print(f"✂️  Pruned {self.sparsity:.1%} of weights")
-        print(f"   Compression: {stats['compression_ratio']:.1f}x")
-        # END SOLUTION
-    
-    def forward_dense(self, x: np.ndarray) -> np.ndarray:
-        """Forward pass using dense weights (reference)."""
-        # BEGIN SOLUTION
-        if self.linear_weights is None:
-            raise ValueError("Linear weights not loaded")
-        
-        # Count operations
-        self.dense_ops = self.in_features * self.out_features
-        
-        # Standard matrix multiply: y = x @ W^T + b
-        output = np.dot(x, self.linear_weights.T) + self.bias
-        return output
-        # END SOLUTION
-    
-    def forward_sparse_naive(self, x: np.ndarray) -> np.ndarray:
-        """Forward pass using sparse weights (naive implementation)."""
-        # BEGIN SOLUTION
-        if self.sparse_weights is None:
-            raise ValueError("Weights not pruned yet")
-        
-        # Count actual operations (skip zero weights)
-        self.sparse_ops = np.sum(self.mask)
-        
-        # Naive sparse computation: still do full matrix multiply
-        # (Real sparse implementations would use CSR/CSC formats)
-        output = np.dot(x, self.sparse_weights.T) + self.bias
-        return output
-        # END SOLUTION
-    
-    def forward_sparse_optimized(self, x: np.ndarray) -> np.ndarray:
-        """
-        Forward pass using optimized sparse computation.
-        
-        This demonstrates the COMPLEX engineering required for real sparse speedup.
-        Production systems use specialized data structures (CSR, CSC) and
-        hardware-specific kernels. This is why structured pruning is often preferred!
-        """
-        # BEGIN SOLUTION
-        if self.sparse_weights is None:
-            raise ValueError("Weights not pruned yet")
-        
-        # STEP 1: Extract indices of non-zero weights for efficient iteration
-        # This scan is overhead cost - why high sparsity is needed for speedup
-        nonzero_indices = np.nonzero(self.sparse_weights)
-        
-        # STEP 2: Count actual operations (only process non-zero weights)
-        # This is where the theoretical speedup comes from
-        self.sparse_ops = len(nonzero_indices[0])
-        
-        # STEP 3: Initialize output with correct shape
-        # Optimized sparse computation (simulated for education)
-        # Real implementation: CSR matrix format with specialized BLAS
-        output = np.zeros((x.shape[0], self.out_features))
-        
-        # STEP 4: Core sparse computation loop
-        # Process ONLY non-zero weights - this is the key optimization
-        # But: irregular memory access patterns hurt cache performance
-        for i in range(len(nonzero_indices[0])):
-            # Extract position: which output neuron, which input feature
-            row, col = nonzero_indices[0][i], nonzero_indices[1][i]
-            weight = self.sparse_weights[row, col]
-            
-            # STEP 5: Accumulate contribution (scatter operation)
-            # This is a SLOW memory pattern - each write goes to different location
-            # output[all_batches, output_neuron] += input[all_batches, input_feature] * weight
-            output[:, row] += x[:, col] * weight
-        
-        # STEP 6: Add bias (dense operation)
-        output += self.bias
-        
-        return output
-        # END SOLUTION
-    
-    def benchmark_speedup(self, batch_size: int = DEFAULT_BATCH_SIZE, iterations: int = DEFAULT_BENCHMARK_ITERATIONS) -> Dict:
-        """Benchmark sparse vs dense computation speedup."""
-        # BEGIN SOLUTION
-        import time
-        
-        # Create test input
-        x = np.random.normal(0, 1, (batch_size, self.in_features))
-        
-        # Benchmark dense forward pass
-        start_time = time.time()
-        for _ in range(iterations):
-            _ = self.forward_dense(x)
-        dense_time = time.time() - start_time
-        
-        # Benchmark sparse forward pass
-        start_time = time.time()
-        for _ in range(iterations):
-            _ = self.forward_sparse_naive(x)
-        sparse_time = time.time() - start_time
-        
-        # Calculate speedup metrics
-        theoretical_speedup = self.dense_ops / self.sparse_ops if self.sparse_ops > 0 else 1
-        actual_speedup = dense_time / sparse_time if sparse_time > 0 else 1
-        
-        return {
-            'dense_time_ms': dense_time * 1000,
-            'sparse_time_ms': sparse_time * 1000,
-            'dense_ops': self.dense_ops,
-            'sparse_ops': self.sparse_ops,
-            'theoretical_speedup': theoretical_speedup,
-            'actual_speedup': actual_speedup,
-            'sparsity': self.sparsity,
-            'efficiency': actual_speedup / theoretical_speedup
-        }
-        # END SOLUTION
-
-# %% [markdown]
-"""
-### Test: Sparse Neural Network Implementation
-
-Let's verify our sparse neural network works correctly and measure performance.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-sparse-neural-network", "locked": false, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_sparse_neural_network():
-    """Test sparse neural network implementation."""
-    print("Testing sparse neural network...")
-    
-    # Create sparse linear layer
-    sparse_layer = SparseLinear(256, 128)
-    
-    # Load random weights
-    np.random.seed(42)
-    weights = np.random.normal(0, 0.1, (128, 256))
-    bias = np.random.normal(0, 0.01, 128)
-    sparse_layer.load_linear_weights(weights, bias)
-    
-    # Prune weights
-    sparse_layer.prune_weights(sparsity=0.8)  # 80% sparsity
-    
-    # Test forward passes
-    x = np.random.normal(0, 1, (4, 256))  # Batch of 4
-    
-    # Compare outputs
-    output_dense = sparse_layer.forward_dense(x)
-    output_sparse_naive = sparse_layer.forward_sparse_naive(x)
-    output_sparse_opt = sparse_layer.forward_sparse_optimized(x)
-    
-    print(f"Output shapes:")
-    print(f"  Linear: {output_dense.shape}")
-    print(f"  Sparse naive: {output_sparse_naive.shape}")
-    print(f"  Sparse optimized: {output_sparse_opt.shape}")
-    
-    # Verify outputs have correct shape
-    expected_shape = (4, 128)
-    assert output_dense.shape == expected_shape, "Linear output shape incorrect"
-    assert output_sparse_naive.shape == expected_shape, "Sparse naive output shape incorrect"
-    assert output_sparse_opt.shape == expected_shape, "Sparse optimized output shape incorrect"
-    
-    # Verify sparse outputs match expected computation
-    # Sparse naive should match dense computation on pruned weights
-    np.testing.assert_allclose(
-        output_sparse_naive, output_sparse_opt, rtol=1e-5,
-        err_msg="Sparse naive and optimized should produce same results"
-    )
-    
-    # The outputs shouldn't be identical (due to pruning) but should be reasonably close
-    relative_error = np.mean(np.abs(output_dense - output_sparse_naive)) / np.mean(np.abs(output_dense))
-    print(f"Relative error from pruning: {relative_error:.3%}")
-    # With 80% sparsity, relative error can be substantial but model should still function
-    assert relative_error < 1.0, "Error from pruning shouldn't completely destroy the model"
-    
-    # Benchmark performance
-    benchmark = sparse_layer.benchmark_speedup(batch_size=32, iterations=50)
-    
-    print(f"\nPerformance Benchmark:")
-    print(f"  Sparsity: {benchmark['sparsity']:.1%}")
-    print(f"  Linear ops: {benchmark['dense_ops']:,}")
-    print(f"  Sparse ops: {benchmark['sparse_ops']:,}")
-    print(f"  Theoretical speedup: {benchmark['theoretical_speedup']:.1f}x")
-    print(f"  Actual speedup: {benchmark['actual_speedup']:.1f}x")
-    print(f"  Efficiency: {benchmark['efficiency']:.1%}")
-    
-    # Verify operation counting
-    expected_dense_ops = 256 * 128
-    assert benchmark['dense_ops'] == expected_dense_ops, "Linear op count incorrect"
-    assert benchmark['sparse_ops'] < benchmark['dense_ops'], "Sparse should use fewer ops"
-    
-    print("PASS Sparse neural network test passed!")
-
-test_sparse_neural_network()
-
-# %% [markdown]
-"""
-## Part 5: Model Compression Pipeline - End-to-End Pruning
-
-Now let's build a complete model compression pipeline that can prune entire neural networks layer by layer, maintaining the overall architecture while reducing parameters.
-
-### Visual Guide: Compression Pipeline Flow
-
-```
-    PHASE 1: MODEL ANALYSIS
-    +-------------------------------------------------+
-    |           Original Dense Model                  |
-    +-------------+-------------+---------------------┤
-    |   Conv1     |    Conv2    |      Dense1         |
-    |  32*3*3*3   |  64*32*3*3  |    512*1024         |
-    |   864 params|  18,432 p   |   524,288 params    |
-    |   Type: Conv|  Type: Conv |   Type: Dense       |
-    |   Sens: Low |  Sens: Med  |   Sens: High        |
-    +-------------+-------------+---------------------+
-              v           v              v
-    Recommend: 50%   Recommend: 60%  Recommend: 80%
-    
-    PHASE 2: LAYER-WISE PRUNING
-    +-------------------------------------------------+
-    |          Compressed Sparse Model                |
-    +-------------+-------------+---------------------┤
-    |   Conv1     |    Conv2    |      Dense1         |
-    |  432 params |  7,373 p    |   104,858 params    |
-    |  50% sparse |  60% sparse |   80% sparse        |
-    |  OK 2x less  |  OK 2.5x less|   OK 5x less        |
-    +-------------+-------------+---------------------+
-    
-    COMPRESSION SUMMARY:
-    Original:  864 + 18,432 + 524,288 = 543,584 total params
-    Compressed: 432 + 7,373 + 104,858 = 112,663 total params  
-    Overall: 4.8x compression, 79% sparsity achieved!
-```
-
-### Quality Validation Metrics
-
-```
-    COMPRESSION QUALITY SCORECARD:
-    
-    Layer    | Weight Error | Norm Ratio | Quality Score | Status
-    ---------+--------------+------------+---------------+--------
-    Conv1    |   0.000234   |   0.876    |    0.845      |   PASS Good
-    Conv2    |   0.000567   |   0.823    |    0.789      |   PASS Good  
-    Dense1   |   0.001234   |   0.734    |    0.692      |   WARNING️ OK
-    ---------+--------------+------------+---------------+--------
-    Overall  |      -       |     -      |    0.775      |   PASS Good
-    
-    Quality Score Calculation:
-    score = norm_preservation * (1 - relative_error)
-    
-    PASS Excellent: > 0.8    (minimal degradation)
-    WARNING️  Acceptable: 0.6-0.8 (moderate degradation)  
-    FAIL Poor: < 0.6         (significant degradation)
-```
-
-### Production Compression Pipeline:
-1. **Model analysis**: Identify pruneable layers and sensitivity
-2. **Layer-wise pruning**: Apply different sparsity levels per layer
-3. **Accuracy validation**: Ensure pruning doesn't degrade performance  
-4. **Performance benchmarking**: Measure actual compression benefits
-5. **Export for deployment**: Package compressed model for inference
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "compression-pipeline", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-
-def _determine_layer_type_and_sparsity(shape: tuple) -> Tuple[str, float]:
-    """
-    Determine layer type and recommended sparsity from weight tensor shape.
-    
-    Args:
-        shape: Weight tensor shape
-        
-    Returns:
-        layer_type: Type of layer (Conv2D, Linear, Other)
-        recommended_sparsity: Recommended sparsity level for this layer type
-    """
-    if len(shape) == CONV2D_NDIM:  # Conv layer: (out, in, H, W)
-        return "Conv2D", DEFAULT_CONV_SPARSITY
-    elif len(shape) == DENSE_NDIM:  # Linear layer: (out, in)  
-        return "Linear", DEFAULT_DENSE_SPARSITY
-    else:
-        return "Other", DEFAULT_OTHER_SPARSITY
-
-def _calculate_layer_analysis_info(layer_name: str, weights: np.ndarray, layer_type: str, 
-                                 natural_sparsity: float, recommended_sparsity: float) -> Dict[str, Any]:
-    """
-    Create layer analysis information dictionary.
-    
-    Args:
-        layer_name: Name of the layer
-        weights: Weight tensor
-        layer_type: Type of layer 
-        natural_sparsity: Natural sparsity percentage
-        recommended_sparsity: Recommended sparsity level
-        
-    Returns:
-        Layer analysis information dictionary
-    """
-    return {
-        'type': layer_type,
-        'shape': weights.shape,
-        'parameters': weights.size,
-        'natural_sparsity': natural_sparsity,
-        'recommended_sparsity': recommended_sparsity
-    }
-
-def _print_layer_analysis_row(layer_name: str, layer_type: str, num_params: int, 
-                             natural_sparsity: float, recommended_sparsity: float) -> None:
-    """Print a single row of layer analysis results."""
-    print(f"{layer_name:12} | {layer_type:7} | {num_params:10,} | "
-          f"{natural_sparsity:12.1f}% | {recommended_sparsity:.0%}")
-
-def _calculate_compression_stats(total_original_params: int, total_remaining_params: int) -> Tuple[float, float]:
-    """
-    Calculate overall compression statistics.
-    
-    Args:
-        total_original_params: Original number of parameters
-        total_remaining_params: Remaining parameters after compression
-        
-    Returns:
-        overall_compression: Compression ratio (original/remaining)
-        overall_sparsity: Sparsity fraction (1 - remaining/original)
-    """
-    overall_compression = total_original_params / total_remaining_params if total_remaining_params > 0 else 1
-    overall_sparsity = 1 - (total_remaining_params / total_original_params)
-    return overall_compression, overall_sparsity
-
-def _calculate_quality_score(norm_preservation: float, mean_error: float, original_mean: float) -> float:
-    """
-    Calculate quality score for compression validation.
-    
-    Args:
-        norm_preservation: Weight norm preservation ratio
-        mean_error: Mean absolute error between original and compressed
-        original_mean: Mean absolute value of original weights
-        
-    Returns:
-        quality_score: Quality score between 0 and 1 (higher is better)
-    """
-    quality_score = norm_preservation * (1 - mean_error / (original_mean + EPS_DIVISION_SAFETY))
-    return max(0, min(1, quality_score))  # Clamp to [0, 1]
-
-def _get_quality_assessment(quality_score: float) -> str:
-    """Get quality assessment string based on score."""
-    if quality_score > EXCELLENT_QUALITY_THRESHOLD:
-        return "PASS Excellent compression quality!"
-    elif quality_score > ACCEPTABLE_QUALITY_THRESHOLD:
-        return "WARNING️  Acceptable compression quality"  
-    else:
-        return "FAIL Poor compression quality - consider lower sparsity"
-
-class ModelCompressor:
-    """
-    Complete model compression pipeline for neural networks.
-    
-    This class implements production-ready compression workflows
-    that can handle complex models with mixed layer types.
-    """
-    
-    def __init__(self):
-        """
-        Initialize model compression pipeline.
-        
-        Attributes:
-            original_model: Storage for original dense model weights
-            compressed_model: Storage for compressed model weights and metadata
-            compression_stats: Overall compression statistics
-            layer_sensitivities: Per-layer sensitivity analysis results
-        """
-        # BEGIN SOLUTION
-        self.original_model = {}        # Original dense weights
-        self.compressed_model = {}      # Compressed weights and metadata
-        self.compression_stats = {}     # Overall compression statistics
-        self.layer_sensitivities = {}   # Layer-wise sensitivity analysis
-        # END SOLUTION
-    
-    def analyze_model_for_compression(self, model_weights: Dict[str, np.ndarray]) -> Dict[str, Any]:
-        """
-        Analyze model structure to determine compression strategy.
-        
-        Args:
-            model_weights: Dictionary mapping layer names to weight arrays
-            
-        Returns:
-            analysis: Compression analysis and recommendations
-        """
-        # BEGIN SOLUTION
-        analysis = {
-            'layers': {},
-            'total_params': 0,
-            'compressible_params': 0,
-            'recommendations': {}
-        }
-        
-        print("MAGNIFY Model Compression Analysis")
-        print("=" * 50)
-        print("Layer        | Type    | Parameters | Natural Sparsity | Recommendation")
-        print("-" * 70)
-        
-        for layer_name, weights in model_weights.items():
-            layer_analysis = analyze_weight_redundancy(weights, f"Layer {layer_name}")
-            
-            # Analyze layer characteristics and determine compression strategy
-            layer_type, recommended_sparsity = _determine_layer_type_and_sparsity(weights.shape)
-            
-            analysis['layers'][layer_name] = _calculate_layer_analysis_info(
-                layer_name, weights, layer_type, 
-                layer_analysis['natural_sparsity'], recommended_sparsity
-            )
-            
-            analysis['total_params'] += weights.size
-            if layer_type in ['Conv2D', 'Linear']:
-                analysis['compressible_params'] += weights.size
-            
-            _print_layer_analysis_row(layer_name, layer_type, weights.size,
-                                    layer_analysis['natural_sparsity'], recommended_sparsity)
-        
-        # Calculate overall compression potential
-        compression_potential = analysis['compressible_params'] / analysis['total_params']
-        
-        print(f"\n📊 Model Summary:")
-        print(f"   Total parameters: {analysis['total_params']:,}")
-        print(f"   Compressible parameters: {analysis['compressible_params']:,}")
-        print(f"   Compression potential: {compression_potential:.1%}")
-        
-        analysis['compression_potential'] = compression_potential
-        return analysis
-        # END SOLUTION
-    
-    def compress_model(self, model_weights: Dict[str, np.ndarray], 
-                      layer_sparsities: Optional[Dict[str, float]] = None) -> Dict[str, Any]:
-        """
-        Compress entire model using layer-wise pruning.
-        
-        Args:
-            model_weights: Dictionary mapping layer names to weights
-            layer_sparsities: Optional per-layer sparsity targets
-            
-        Returns:
-            compressed_model: Compressed weights and statistics
-        """
-        # BEGIN SOLUTION
-        if layer_sparsities is None:
-            # Use default sparsities based on layer analysis
-            analysis = self.analyze_model_for_compression(model_weights)
-            layer_sparsities = {
-                name: info['recommended_sparsity'] 
-                for name, info in analysis['layers'].items()
-            }
-        
-        print(f"\n⚙️  Compressing Model Layers")
-        print("=" * 50)
-        
-        compressed_weights = {}
-        total_original_params = 0
-        total_remaining_params = 0
-        
-        for layer_name, weights in model_weights.items():
-            sparsity = layer_sparsities.get(layer_name, DEFAULT_SPARSITY)  # Default sparsity
-            
-            print(f"\n🔧 Compressing {layer_name} (target: {sparsity:.0%} sparsity)...")
-            
-            # Apply magnitude-based pruning
-            pruner = MagnitudePruner()
-            pruned_weights, mask, stats = pruner.prune(weights, sparsity)
-            
-            compressed_weights[layer_name] = {
-                'weights': pruned_weights,
-                'mask': mask,
-                'original_shape': weights.shape,
-                'stats': stats
-            }
-            
-            total_original_params += stats['original_params']
-            total_remaining_params += stats['remaining_params']
-            
-            print(f"   Sparsity achieved: {stats['actual_sparsity']:.1%}")
-            print(f"   Compression: {stats['compression_ratio']:.1f}x")
-        
-        # Calculate overall compression
-        overall_compression, overall_sparsity = _calculate_compression_stats(
-            total_original_params, total_remaining_params
-        )
-        
-        self.compressed_model = compressed_weights
-        self.compression_stats = {
-            'total_original_params': total_original_params,
-            'total_remaining_params': total_remaining_params,
-            'overall_sparsity': overall_sparsity,
-            'overall_compression': overall_compression,
-            'layer_sparsities': layer_sparsities
-        }
-        
-        print(f"\nPASS Model Compression Complete!")
-        print(f"   Original parameters: {total_original_params:,}")
-        print(f"   Remaining parameters: {total_remaining_params:,}")
-        print(f"   Overall sparsity: {overall_sparsity:.1%}")
-        print(f"   Overall compression: {overall_compression:.1f}x")
-        
-        return compressed_weights
-        # END SOLUTION
-    
-    def validate_compression_quality(self, original_weights: Dict[str, np.ndarray], 
-                                   compressed_model: Dict[str, Any]) -> Dict[str, Any]:
-        """
-        Validate that compression doesn't degrade model too much.
-        
-        This is a simplified validation - in practice you'd run full model evaluation.
-        """
-        # BEGIN SOLUTION
-        validation_results = {
-            'layer_quality': {},
-            'overall_quality': {},
-            'quality_score': 0.0
-        }
-        
-        print(f"\nPASS Validating Compression Quality")
-        print("=" * 50)
-        print("Layer        | Weight Error | Norm Preservation | Quality")
-        print("-" * 55)
-        
-        layer_scores = []
-        
-        for layer_name in original_weights.keys():
-            original = original_weights[layer_name]
-            compressed_info = compressed_model[layer_name]
-            compressed = compressed_info['weights']
-            
-            # Calculate quality metrics
-            weight_diff = np.abs(original - compressed)
-            mean_error = weight_diff.mean()
-            max_error = weight_diff.max()
-            
-            # Norm preservation
-            orig_norm = np.linalg.norm(original)
-            comp_norm = np.linalg.norm(compressed)
-            norm_preservation = comp_norm / orig_norm if orig_norm > 0 else 1.0
-            
-            # Simple quality score (higher is better)
-            # Penalize high error, reward norm preservation
-            quality_score = _calculate_quality_score(norm_preservation, mean_error, np.abs(original).mean())
-            
-            validation_results['layer_quality'][layer_name] = {
-                'mean_error': mean_error,
-                'max_error': max_error,
-                'norm_preservation': norm_preservation,
-                'quality_score': quality_score
-            }
-            
-            layer_scores.append(quality_score)
-            
-            print(f"{layer_name:12} | {mean_error:.6f} | {norm_preservation:13.3f} | {quality_score:.3f}")
-        
-        # Overall quality
-        overall_quality_score = np.mean(layer_scores)
-        validation_results['overall_quality'] = {
-            'mean_quality_score': overall_quality_score,
-            'quality_std': np.std(layer_scores),
-            'min_quality': np.min(layer_scores),
-            'max_quality': np.max(layer_scores)
-        }
-        validation_results['quality_score'] = overall_quality_score
-        
-        print(f"\nTARGET Overall Quality Score: {overall_quality_score:.3f}")
-        print(f"   {_get_quality_assessment(overall_quality_score)}")
-        
-        return validation_results
-        # END SOLUTION
-
-# %% [markdown]
-"""
-### Test: Model Compression Pipeline
-
-Let's verify our complete compression pipeline works on a multi-layer model.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-compression-pipeline", "locked": false, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_compression_pipeline():
-    """Test complete model compression pipeline."""
-    print("Testing model compression pipeline...")
-    
-    # Create sample multi-layer model
-    np.random.seed(42)
-    model_weights = {
-        'conv1': np.random.normal(0, 0.02, (32, 3, 3, 3)),    # Conv: 32 filters, 3 input channels
-        'conv2': np.random.normal(0, 0.02, (64, 32, 3, 3)),   # Conv: 64 filters, 32 input channels
-        'linear1': np.random.normal(0, 0.01, (512, 1024)),        # Linear: 512 -> 1024
-        'linear2': np.random.normal(0, 0.01, (10, 512)),          # Linear: 10 -> 512 (output layer)
-    }
-    
-    # Create compressor
-    compressor = ModelCompressor()
-    
-    # Step 1: Analyze model
-    analysis = compressor.analyze_model_for_compression(model_weights)
-    
-    assert analysis['total_params'] > 0, "Should count total parameters"
-    assert len(analysis['layers']) == 4, "Should analyze all 4 layers"
-    assert 'conv1' in analysis['layers'], "Should analyze conv1"
-    assert 'linear1' in analysis['layers'], "Should analyze linear1"
-    
-    # Verify layer type detection
-    assert analysis['layers']['conv1']['type'] == 'Conv2D', "Should detect conv layers"
-    assert analysis['layers']['linear1']['type'] == 'Linear', "Should detect linear layers"
-    
-    # Step 2: Compress model with custom sparsities
-    custom_sparsities = {
-        'conv1': 0.5,  # Conservative for first conv layer
-        'conv2': 0.6,  # Moderate for second conv layer
-        'linear1': 0.8,    # Aggressive for large dense layer
-        'linear2': 0.3     # Conservative for output layer
-    }
-    
-    compressed_model = compressor.compress_model(model_weights, custom_sparsities)
-    
-    # Verify compression results
-    assert len(compressed_model) == 4, "Should compress all layers"
-    for layer_name in model_weights.keys():
-        assert layer_name in compressed_model, f"Missing compressed {layer_name}"
-        compressed_info = compressed_model[layer_name]
-        assert 'weights' in compressed_info, "Should have compressed weights"
-        assert 'mask' in compressed_info, "Should have pruning mask"
-        assert 'stats' in compressed_info, "Should have compression stats"
-    
-    # Verify compression statistics
-    stats = compressor.compression_stats
-    assert stats['overall_compression'] > 2.0, "Should achieve significant compression"
-    assert 0.5 <= stats['overall_sparsity'] <= 0.8, "Overall sparsity should be reasonable"
-    
-    # Step 3: Validate compression quality
-    validation = compressor.validate_compression_quality(model_weights, compressed_model)
-    
-    assert 'layer_quality' in validation, "Should validate each layer"
-    assert 'overall_quality' in validation, "Should have overall quality metrics"
-    assert 0 <= validation['quality_score'] <= 1, "Quality score should be normalized"
-    
-    # Each layer should have quality metrics
-    for layer_name in model_weights.keys():
-        assert layer_name in validation['layer_quality'], f"Missing quality for {layer_name}"
-        layer_quality = validation['layer_quality'][layer_name]
-        assert 'norm_preservation' in layer_quality, "Should measure norm preservation"
-        assert layer_quality['norm_preservation'] > 0, "Norm preservation should be positive"
-    
-    # Test that compressed weights are actually sparse
-    for layer_name, compressed_info in compressed_model.items():
-        compressed_weights = compressed_info['weights']
-        sparsity = np.sum(compressed_weights == 0) / compressed_weights.size
-        expected_sparsity = custom_sparsities[layer_name]
-        
-        # Allow some tolerance in sparsity
-        assert abs(sparsity - expected_sparsity) < 0.1, f"{layer_name} sparsity mismatch"
-    
-    print("PASS Model compression pipeline test passed!")
-
-test_compression_pipeline()
-
-# %% [markdown]
-"""
-## Part 6: Systems Analysis - Memory, Performance, and Deployment Impact
-
-Let's analyze compression from a systems engineering perspective, measuring the real-world impact on memory usage, inference speed, and deployment scenarios.
-
-### Visual Guide: Compression Impact Across the ML Stack
-
-```
-    COMPRESSION BENEFITS VISUALIZATION:
-    
-    +--------------------------------------------------------------+
-    |                     MODEL SIZE IMPACT                        |
-    +--------------------------------------------------------------┤
-    | Dense Model:     [████████████████████] 200MB     |
-    | 50% Sparse:      [██████████░░░░░░░░░░] 100MB     |  
-    | 70% Sparse:      [██████░░░░░░░░░░░░░░]  60MB     |
-    | 90% Sparse:      [██░░░░░░░░░░░░░░░░░░]  20MB     |
-    +--------------------------------------------------------------+
-    
-    +--------------------------------------------------------------+
-    |                   INFERENCE SPEED IMPACT                     |
-    +--------------------------------------------------------------┤
-    | Dense (50ms):    [█████████████████████████] 50ms    |
-    | Sparse (20ms):   [██████████░░░░░░░░░░░░░░░] 20ms    |
-    |                 2.5x faster inference!                   |
-    +--------------------------------------------------------------+
-    
-    +--------------------------------------------------------------+
-    |                    DEPLOYMENT ENABLEMENT                     |
-    +--------------------------------------------------------------┤
-    | Cloud Server:   OK Can run any model size                 |
-    | Mobile Phone:   ✗ Dense, OK 70% sparse                     |
-    | IoT Device:     ✗ Dense, ✗ 50% sparse, OK 90% sparse       |
-    | Smartwatch:     ✗ All except extreme compression            |
-    +--------------------------------------------------------------+
-```
-
-### Edge AI Deployment Pipeline
-
-```
-    COMPRESSION -> DEPLOYMENT PIPELINE:
-    
-    +----------------------------------------------------------------------+
-    |  PHASE 1: COMPRESSION     PHASE 2: OPTIMIZATION    PHASE 3: DEPLOYMENT   |
-    +------------------------+-----------------------+-----------------------┤
-    |  Dense Model (200MB)    |  Pruned Model (60MB)    |  Mobile App            |
-    |         v               |         v               |                       |
-    |  [███████████]       |  [████░░░░░░░]       |  📱 Real-time AI        |
-    |         v               |         v               |  🔋 Privacy-first       |
-    |  70% Magnitude Pruning  |  Hardware Optimization  |  SPEED Low latency          |
-    |  + Structured Removal   |  + Quantization (8-bit) |  🔋 Always available    |
-    |  + Quality Validation   |  + Sparse Kernels       |                       |
-    +------------------------+-----------------------+-----------------------+
-    
-    OUTCOME: AI that was impossible becomes possible!
-```
-
-### ML Systems Analysis: Why Pruning Enables Edge AI
-
-**Memory Complexity**: O(N * sparsity) storage reduction where N = original parameters
-**Computational Complexity**: Theoretical O(N * sparsity) speedup, actual depends on hardware
-**Cache Efficiency**: Smaller models fit in cache, reducing memory bandwidth bottlenecks  
-**Energy Efficiency**: Fewer operations = lower power consumption for mobile devices
-**Deployment Enablement**: Makes models fit where they couldn't before
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "compression-systems-analysis", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-# PASS IMPLEMENTATION CHECKPOINT: Ensure compression pipeline is complete
-
-# THINK PREDICTION: How much memory do you think a 5M parameter model uses?
-# Dense model: _______ MB, 80% sparse model: _______ MB
-
-# MAGNIFY SYSTEMS INSIGHT #1: Memory Profiling Analysis
-def profile_compression_memory():
-    """
-    Profile memory usage patterns during model compression.
-    
-    This function demonstrates how compression affects memory footprint
-    and enables deployment on resource-constrained devices.
-    """
-    import tracemalloc
-    
-    print("🔬 Memory Profiling: Model Compression")
-    print("=" * 50)
-    
-    # Start memory tracking
-    tracemalloc.start()
-    
-    # Create large model (simulating real CNN)
-    print("Creating large model weights...")
-    model_weights = {
-        'conv1': np.random.normal(0, 0.02, (128, 64, 3, 3)),     # ~0.3M parameters
-        'conv2': np.random.normal(0, 0.02, (256, 128, 3, 3)),    # ~1.2M parameters  
-        'linear1': np.random.normal(0, 0.01, (1024, 4096)),          # ~4.2M parameters
-        'linear2': np.random.normal(0, 0.01, (10, 1024)),            # ~10K parameters
-    }
-    
-    snapshot1 = tracemalloc.take_snapshot()
-    current, peak = tracemalloc.get_traced_memory()
-    print(f"After model creation: {current / 1024 / 1024:.1f} MB current, {peak / 1024 / 1024:.1f} MB peak")
-    
-    # Calculate original model size
-    original_params = sum(w.size for w in model_weights.values())
-    original_size_mb = sum(w.nbytes for w in model_weights.values()) / (1024 * 1024)
-    
-    print(f"Original model: {original_params:,} parameters, {original_size_mb:.1f} MB")
-    
-    # Compress model
-    print("\nCompressing model...")
-    compressor = ModelCompressor()
-    compressed_model = compressor.compress_model(model_weights)
-    
-    snapshot2 = tracemalloc.take_snapshot()
-    current, peak = tracemalloc.get_traced_memory()
-    print(f"After compression: {current / 1024 / 1024:.1f} MB current, {peak / 1024 / 1024:.1f} MB peak")
-    
-    # Calculate compressed model size
-    compressed_params = sum(
-        np.sum(info['weights'] != 0) 
-        for info in compressed_model.values()
-    )
-    
-    # Estimate compressed storage (could use sparse formats)
-    compressed_size_mb = original_size_mb * (compressed_params / original_params)
-    
-    print(f"\n💾 Storage Analysis:")
-    print(f"   Original: {original_params:,} parameters ({original_size_mb:.1f} MB)")
-    print(f"   Compressed: {compressed_params:,} parameters ({compressed_size_mb:.1f} MB)")
-    print(f"   Compression ratio: {original_params / compressed_params:.1f}x")
-    print(f"   Size reduction: {original_size_mb / compressed_size_mb:.1f}x")
-    print(f"   Storage savings: {original_size_mb - compressed_size_mb:.1f} MB")
-    
-    tracemalloc.stop()
-    
-    # TIP WHY THIS MATTERS: Memory is often the limiting factor for edge deployment
-    # A 200MB model won't fit on a 128MB mobile device, but a 40MB compressed model will!
-    
-    return {
-        'original_params': original_params,
-        'compressed_params': compressed_params,
-        'original_size_mb': original_size_mb,
-        'compressed_size_mb': compressed_size_mb,
-        'compression_ratio': original_params / compressed_params,
-        'size_reduction': original_size_mb / compressed_size_mb
-    }
-
-# PASS IMPLEMENTATION CHECKPOINT: Memory profiling analysis complete
-
-# THINK PREDICTION: Which device constraint is more limiting - memory or compute?
-# Your guess: Memory / Compute (circle one)
-
-# MAGNIFY SYSTEMS INSIGHT #2: Deployment Constraint Analysis
-def analyze_deployment_scenarios():
-    """Analyze how compression enables different deployment scenarios."""
-    print("\nROCKET Compression Deployment Impact Analysis")
-    print("=" * 60)
-    
-    # Define deployment constraints
-    scenarios = [
-        {
-            'name': 'Mobile Phone',
-            'memory_limit_mb': 100,
-            'compute_limit_gflops': 10,
-            'power_sensitive': True,
-            'description': 'On-device inference for camera apps'
-        },
-        {
-            'name': 'IoT Device',
-            'memory_limit_mb': 20,
-            'compute_limit_gflops': 1,
-            'power_sensitive': True,
-            'description': 'Smart sensor with microcontroller'
-        },
-        {
-            'name': 'Edge Server',
-            'memory_limit_mb': 1000,
-            'compute_limit_gflops': 100,
-            'power_sensitive': False,
-            'description': 'Local inference server for privacy'
-        },
-        {
-            'name': 'Wearable',
-            'memory_limit_mb': 10,
-            'compute_limit_gflops': 0.5,
-            'power_sensitive': True,
-            'description': 'Smartwatch health monitoring'
-        }
-    ]
-    
-    # Model sizes at different compression levels
-    model_configs = [
-        {'name': 'Linear Model', 'size_mb': 200, 'gflops': 50, 'accuracy': 95.0},
-        {'name': '50% Sparse', 'size_mb': 100, 'gflops': 25, 'accuracy': 94.5},
-        {'name': '70% Sparse', 'size_mb': 60, 'gflops': 15, 'accuracy': 93.8},
-        {'name': '90% Sparse', 'size_mb': 20, 'gflops': 5, 'accuracy': 91.2},
-    ]
-    
-    print("Scenario       | Memory | Compute | Linear | 50% | 70% | 90% | Best Option")
-    print("-" * 80)
-    
-    for scenario in scenarios:
-        name = scenario['name']
-        mem_limit = scenario['memory_limit_mb']
-        compute_limit = scenario['compute_limit_gflops']
-        
-        # Check which model configurations fit
-        viable_models = []
-        for config in model_configs:
-            fits_memory = config['size_mb'] <= mem_limit
-            fits_compute = config['gflops'] <= compute_limit
-            
-            if fits_memory and fits_compute:
-                viable_models.append(config['name'])
-        
-        # Determine best option
-        if not viable_models:
-            best_option = "None fit!"
-        else:
-            # Choose highest accuracy among viable options
-            viable_configs = [c for c in model_configs if c['name'] in viable_models]
-            best_config = max(viable_configs, key=lambda x: x['accuracy'])
-            best_option = f"{best_config['name']} ({best_config['accuracy']:.1f}%)"
-        
-        # Show fit status for each compression level
-        fit_status = []
-        for config in model_configs:
-            fits_mem = config['size_mb'] <= mem_limit
-            fits_comp = config['gflops'] <= compute_limit
-            if fits_mem and fits_comp:
-                status = "PASS"
-            elif fits_mem:
-                status = "SPEED"  # Memory OK, compute too high
-            elif fits_comp:
-                status = "💾"  # Compute OK, memory too high
-            else:
-                status = "FAIL"
-            fit_status.append(status)
-        
-        print(f"{name:14} | {mem_limit:4d}MB | {compute_limit:5.1f}G | "
-              f"{fit_status[0]:3} | {fit_status[1]:3} | {fit_status[2]:3} | {fit_status[3]:3} | {best_option}")
-    
-    print(f"\nTIP Key Insights:")
-    print(f"   • Compression often determines deployment feasibility")
-    print(f"   • Edge devices require 70-90% sparsity for deployment")
-    print(f"   • Mobile devices can use moderate compression (50-70%)")
-    print(f"   • Power constraints favor sparse models (fewer operations)")
-    print(f"   • Memory limits are often more restrictive than compute limits")
-    
-    # TIP WHY THIS MATTERS: Compression is often about enabling deployment, not optimizing it
-    # Without compression, many edge AI applications simply wouldn't be possible!
-
-# PASS IMPLEMENTATION CHECKPOINT: Deployment analysis complete
-
-# THINK PREDICTION: Will 90% sparsity give 10x speedup in practice?
-# Your prediction: ___x actual speedup (vs 10x theoretical)
-
-# MAGNIFY SYSTEMS INSIGHT #3: Sparse Computation Reality Check
-def benchmark_sparse_inference_speedup():
-    """Benchmark actual vs theoretical speedup from sparsity."""
-    print("\nSPEED Sparse Inference Speedup Analysis")
-    print("=" * 50)
-    
-    import time
-    
-    # Test different model sizes and sparsity levels
-    configs = [
-        {'size': (256, 512), 'sparsity': 0.5},
-        {'size': (512, 1024), 'sparsity': 0.7},
-        {'size': (1024, 2048), 'sparsity': 0.8},
-        {'size': (2048, 4096), 'sparsity': 0.9},
-    ]
-    
-    print("Model Size    | Sparsity | Theoretical | Actual | Efficiency | Notes")
-    print("-" * 70)
-    
-    for config in configs:
-        size = config['size']
-        sparsity = config['sparsity']
-        
-        # Create sparse layer
-        sparse_layer = SparseLinear(size[0], size[1])
-        
-        # Load and prune weights
-        weights = np.random.normal(0, 0.1, (size[1], size[0]))
-        sparse_layer.load_linear_weights(weights)
-        sparse_layer.prune_weights(sparsity)
-        
-        # Benchmark
-        benchmark = sparse_layer.benchmark_speedup(batch_size=16, iterations=100)
-        
-        theoretical = benchmark['theoretical_speedup']
-        actual = benchmark['actual_speedup'] 
-        efficiency = benchmark['efficiency']
-        
-        # Determine bottleneck
-        if efficiency > SPEEDUP_EFFICIENCY_HIGH:
-            notes = "CPU bound"
-        elif efficiency > SPEEDUP_EFFICIENCY_MEDIUM:
-            notes = "Memory bound"
-        else:
-            notes = "Framework overhead"
-        
-        print(f"{size[0]}x{size[1]:4} | {sparsity:6.0%} | {theoretical:9.1f}x | "
-              f"{actual:5.1f}x | {efficiency:8.1%} | {notes}")
-    
-    print(f"\nTARGET Speedup Reality Check:")
-    print(f"   • Theoretical speedup assumes perfect sparse hardware")
-    print(f"   • Actual speedup limited by memory bandwidth and overhead")
-    print(f"   • High sparsity (>80%) shows diminishing returns") 
-    print(f"   • Production sparse hardware (GPUs, TPUs) achieve better efficiency")
-    
-    # TIP WHY THIS MATTERS: The gap between theoretical and actual speedup reveals
-    # why structured pruning and specialized hardware are essential for production deployment!
-
-# %% [markdown]
-"""
-### Test: Systems Analysis Implementation
-
-Let's verify our systems analysis provides valuable performance insights.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-systems-analysis", "locked": false, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_systems_analysis():
-    """Test systems analysis and profiling functions."""
-    print("Testing systems analysis...")
-    
-    # Test memory profiling
-    memory_results = profile_compression_memory()
-    assert memory_results['compression_ratio'] > 2.0, "Should show significant compression"
-    assert memory_results['original_size_mb'] > memory_results['compressed_size_mb'], "Should reduce size"
-    
-    # Test deployment analysis
-    analyze_deployment_scenarios()
-    
-    # Test speedup benchmarking
-    benchmark_sparse_inference_speedup()
-    
-    # All functions should run without errors
-    print("PASS Systems analysis test passed!")
-
-test_systems_analysis()
-
-# %% [markdown]
-"""
-## Part 7: Production Context - Real-World Pruning Systems
-
-Let's explore how pruning is used in production ML systems and connect our implementation to real frameworks and deployment platforms.
-
-### Visual Guide: Production Pruning Ecosystem
-
-```
-    PRODUCTION PRUNING LANDSCAPE:
-    
-    +----------------------------------------------------------------------+
-    |                      FRAMEWORKS & HARDWARE                      |
-    +---------------------+---------------------+---------------------┤
-    |     RESEARCH         |    PRODUCTION       |    DEPLOYMENT       |
-    +---------------------+---------------------+---------------------┤
-    | MAGNIFY PyTorch              | ⚙️ TensorRT           | 📱 Mobile Apps       |
-    |   torch.nn.utils     |   Structured pruning |   Apple Neural Eng  |
-    |   .prune             |   2:4 sparsity       |   Google Edge TPU   |
-    +---------------------+---------------------+---------------------┤
-    | 🧠 TensorFlow          | ROCKET OpenVINO           | 🏠 Smart Home        |
-    |   Model Optimization |   Intel optimization |   Always-on AI      |
-    |   Gradual pruning    |   CPU/GPU sparse     |   Voice assistants  |
-    +---------------------+---------------------+---------------------┤
-    | 🔬 Our TinyTorch       | TARGET Production-Ready   | 🏆 Success Stories    |
-    |   Educational impl.  |   Magnitude + struct |   Tesla Autopilot   |
-    |   Magnitude pruning  |   Quality validation |   Google Pixel      |
-    +---------------------+---------------------+---------------------+
-```
-
-### Real-World Application Examples
-
-```
-    COMPRESSION SUCCESS STORIES:
-    
-    📱 MOBILE PHOTOGRAPHY (Google Pixel)
-    Original: Portrait CNN, 45MB, 120ms
-    Compressed: 70% pruning + quantization -> 12MB, 35ms
-    Result: Real-time portrait mode on phone
-    
-    🚗 AUTONOMOUS VEHICLES (Tesla FSD)
-    Original: Object detection, 2GB, 80ms  
-    Compressed: 50% structured pruning -> 1GB, 35ms
-    Result: Real-time object detection for safety
-    
-    🏠 SMART HOME (Alexa)
-    Original: Wake word detection, 15MB
-    Compressed: 95% pruning + 8-bit quantization -> 0.5MB
-    Result: Always-on listening with <1mW power
-    
-    🎥 AUGMENTED REALITY (Apple ARKit)
-    Original: Hand tracking, 80MB, 16ms  
-    Compressed: Channel pruning + mobile optimization -> 25MB, 8ms
-    Result: 60fps hand tracking on mobile GPU
-```
-
-### Production Pruning Systems:
-1. **PyTorch Pruning**: `torch.nn.utils.prune` for magnitude and structured pruning
-2. **TensorFlow Model Optimization**: Pruning API with gradual sparsity
-3. **NVIDIA TensorRT**: Structured pruning for inference acceleration
-4. **OpenVINO**: Intel's optimization toolkit with pruning support
-5. **Edge TPU**: Google's quantization + pruning for mobile inference
-6. **Apple Neural Engine**: Hardware-accelerated sparse computation
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "production-context", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def compare_with_production_pruning():
-    """
-    Compare our implementation with production pruning systems.
-    
-    This function explains how real ML frameworks handle pruning
-    and where our implementation fits in the broader ecosystem.
-    """
-    print("🏭 Production Pruning Systems Comparison")
-    print("=" * 70)
-    
-    frameworks = {
-        'PyTorch': {
-            'pruning_methods': ['Magnitude', 'Random', 'Structured', 'Custom'],
-            'sparsity_support': ['Unstructured', 'Structured (channel)', '2:4 sparsity'],
-            'deployment': 'TorchScript, ONNX export with sparse ops',
-            'hardware_acceleration': 'Limited - mostly research focused',
-            'our_similarity': 'High - similar magnitude-based approach'
-        },
-        'TensorFlow': {
-            'pruning_methods': ['Magnitude', 'Gradual', 'Structured'],
-            'sparsity_support': ['Unstructured', 'Block sparse', 'Structured'],
-            'deployment': 'TensorFlow Lite with sparse inference',
-            'hardware_acceleration': 'XLA optimization, mobile acceleration',
-            'our_similarity': 'High - magnitude pruning with calibration'
-        },
-        'TensorRT': {
-            'pruning_methods': ['Structured only', 'Channel pruning'],
-            'sparsity_support': ['2:4 structured sparsity', 'Channel removal'],
-            'deployment': 'Optimized inference engine with sparse kernels',
-            'hardware_acceleration': 'GPU Tensor Cores, specialized sparse ops',
-            'our_similarity': 'Medium - focuses on structured pruning'
-        },
-        'OpenVINO': {
-            'pruning_methods': ['Magnitude', 'Structured', 'Mixed precision'],
-            'sparsity_support': ['Unstructured', 'Block sparse', 'Channel wise'],
-            'deployment': 'Intel CPU/GPU optimization with sparse support',
-            'hardware_acceleration': 'Intel VPU, CPU vectorization',
-            'our_similarity': 'High - comprehensive pruning toolkit'
-        },
-        'Our TinyTorch': {
-            'pruning_methods': ['Magnitude-based', 'Structured filter pruning'],
-            'sparsity_support': ['Unstructured', 'Structured (filter removal)'],
-            'deployment': 'Educational sparse computation simulation',
-            'hardware_acceleration': 'Educational - simulated speedups',
-            'our_similarity': 'Reference implementation for learning'
-        }
-    }
-    
-    print("Framework | Methods | Hardware Support | Deployment | Similarity")
-    print("-" * 70)
-    
-    for name, specs in frameworks.items():
-        methods_str = specs['pruning_methods'][0]  # Primary method
-        hw_str = specs['hardware_acceleration'][:20] + "..." if len(specs['hardware_acceleration']) > 20 else specs['hardware_acceleration']
-        deploy_str = specs['deployment'][:20] + "..." if len(specs['deployment']) > 20 else specs['deployment']
-        sim_str = specs['our_similarity'][:15] + "..." if len(specs['our_similarity']) > 15 else specs['our_similarity']
-        
-        print(f"{name:9} | {methods_str:12} | {hw_str:16} | {deploy_str:12} | {sim_str}")
-    
-    print(f"\nTARGET Key Production Insights:")
-    print(f"   • Our magnitude approach is industry standard")
-    print(f"   • Production systems emphasize structured pruning for hardware")
-    print(f"   • Real frameworks integrate pruning with quantization")
-    print(f"   • Hardware acceleration requires specialized sparse kernels")
-    print(f"   • Mobile deployment drives most production pruning adoption")
-
-def demonstrate_pruning_applications():
-    """Show real-world applications where pruning enables deployment."""
-    print("\n🌟 Real-World Pruning Applications")
-    print("=" * 50)
-    
-    applications = [
-        {
-            'domain': 'Mobile Photography',
-            'model': 'Portrait segmentation CNN',
-            'constraints': '< 10MB, < 100ms inference',
-            'pruning_strategy': '70% unstructured + quantization',
-            'outcome': 'Real-time portrait mode on phone cameras',
-            'example': 'Google Pixel, iPhone portrait mode'
-        },
-        {
-            'domain': 'Autonomous Vehicles', 
-            'model': 'Object detection (YOLO)',
-            'constraints': '< 500MB, < 50ms inference, safety critical',
-            'pruning_strategy': '50% structured pruning for latency',
-            'outcome': 'Real-time object detection for ADAS',
-            'example': 'Tesla FSD, Waymo perception stack'
-        },
-        {
-            'domain': 'Smart Home',
-            'model': 'Voice keyword detection',
-            'constraints': '< 1MB, always-on, battery powered',
-            'pruning_strategy': '90% sparsity + 8-bit quantization',
-            'outcome': 'Always-listening wake word detection',
-            'example': 'Alexa, Google Assistant edge processing'
-        },
-        {
-            'domain': 'Medical Imaging',
-            'model': 'X-ray diagnosis CNN',
-            'constraints': 'Edge deployment, <1GB memory',
-            'pruning_strategy': '60% structured pruning + knowledge distillation',
-            'outcome': 'Portable medical AI for remote clinics',
-            'example': 'Google AI for radiology, Zebra Medical'
-        },
-        {
-            'domain': 'Augmented Reality',
-            'model': 'Hand tracking and gesture recognition',
-            'constraints': '< 50MB, 60fps, mobile GPU',
-            'pruning_strategy': 'Channel pruning + mobile-optimized architecture',
-            'outcome': 'Real-time hand tracking for AR experiences',
-            'example': 'Apple ARKit, Google ARCore, Meta Quest'
-        }
-    ]
-    
-    print("Domain              | Model Type | Pruning Strategy | Outcome")
-    print("-" * 75)
-    
-    for app in applications:
-        domain_str = app['domain'][:18]
-        model_str = app['model'][:15] + "..." if len(app['model']) > 15 else app['model']
-        strategy_str = app['pruning_strategy'][:20] + "..." if len(app['pruning_strategy']) > 20 else app['pruning_strategy']
-        outcome_str = app['outcome'][:25] + "..." if len(app['outcome']) > 25 else app['outcome']
-        
-        print(f"{domain_str:18} | {model_str:10} | {strategy_str:16} | {outcome_str}")
-        print(f"                   Example: {app['example']}")
-        print()
-    
-    print("TIP Common Patterns in Production Pruning:")
-    print("   • Latency-critical apps use structured pruning (regular sparsity)")  
-    print("   • Memory-constrained devices use aggressive unstructured pruning")
-    print("   • Safety-critical systems use conservative pruning with validation")
-    print("   • Mobile apps combine pruning + quantization for maximum compression")
-    print("   • Edge AI enables privacy (on-device processing) through compression")
-    
-    # Visual success metrics
-    print(f"🏆 Production Success Metrics:")
-    print(f"   +----------------------------------------------------+")
-    print(f"   | Application       | Size Reduction | Latency Gain |")
-    print(f"   +-------------------+---------------+--------------┤")
-    print(f"   | Mobile Camera     |     4x         |    3.5x     |")
-    print(f"   | Voice Assistant   |    30x         |   10x      |")
-    print(f"   | Autonomous Car    |     2x         |    2.3x     |")
-    print(f"   | AR Hand Tracking  |     3x         |    2x       |")
-    print(f"   +-------------------+---------------+--------------+")
-
-# %% [markdown]
-"""
-### Test: Production Context Analysis
-
-Let's verify our production context analysis provides valuable insights.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-production-context", "locked": false, "points": 5, "schema_version": 3, "solution": false, "task": false}
-def test_production_context():
-    """Test production context analysis."""
-    print("Testing production context analysis...")
-    
-    # Test framework comparison
-    compare_with_production_pruning()
-    
-    # Test applications demonstration
-    demonstrate_pruning_applications()
-    
-    # Both functions should run without errors and provide insights
-    print("PASS Production context analysis test passed!")
-
-test_production_context()
-
-# %% [markdown]
-"""
-## Comprehensive Testing
-
-Let's run a comprehensive test of all compression functionality to ensure everything works together correctly.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "comprehensive-testing", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def run_all_tests():
-    """Run comprehensive test suite for compression module."""
-    print("TEST Running Comprehensive Compression Test Suite")
-    print("=" * 60)
-    
-    test_functions = [
-        ("Weight Redundancy Analysis", test_redundancy_analysis),
-        ("Magnitude-Based Pruning", test_magnitude_pruning),
-        ("Structured Pruning", test_structured_pruning),
-        ("Sparse Neural Network", test_sparse_neural_network),
-        ("Model Compression Pipeline", test_compression_pipeline),
-        ("Systems Analysis", test_systems_analysis),
-        ("Production Context", test_production_context)
-    ]
-    
-    passed = 0
-    total = len(test_functions)
-    
-    for test_name, test_func in test_functions:
-        print(f"\n{'='*20} {test_name} {'='*20}")
-        try:
-            test_func()
-            print(f"PASS {test_name}: PASSED")
-            passed += 1
-        except Exception as e:
-            print(f"FAIL {test_name}: FAILED - {e}")
-    
-    print(f"\nTARGET Test Results: {passed}/{total} tests passed")
-    
-    if passed == total:
-        print("CELEBRATE All compression tests passed! Module implementation complete.")
-        
-        # Show final demo
-        print(f"\nROCKET Final Compression Demo:")
-        print("=" * 50)
-        
-        # Create a realistic model and compress it
-        np.random.seed(42)
-        demo_model = {
-            'backbone_conv': np.random.normal(0, 0.02, (128, 64, 3, 3)),
-            'classifier_linear': np.random.normal(0, 0.01, (10, 2048)),
-        }
-        
-        compressor = ModelCompressor()
-        compressed = compressor.compress_model(demo_model, {'backbone_conv': 0.7, 'classifier_linear': 0.8})
-        
-        original_params = sum(w.size for w in demo_model.values())
-        compressed_params = sum(np.sum(info['weights'] != 0) for info in compressed.values())
-        
-        print(f"TARGET FINAL RESULT:")
-        print(f"   Original model: {original_params:,} parameters")
-        print(f"   Compressed model: {compressed_params:,} parameters")
-        print(f"   Compression achieved: {original_params/compressed_params:.1f}x smaller")
-        print(f"   Size reduction: {(1-compressed_params/original_params)*100:.1f}% of parameters removed")
-        print(f"   PASS Ready for edge deployment!")
-        
-    else:
-        print(f"WARNING️  {total - passed} tests failed. Review implementation.")
-
-# Run all systems insights 
-profile_compression_memory()
-analyze_deployment_scenarios() 
-benchmark_sparse_inference_speedup()
-
-if __name__ == "__main__":
-    run_all_tests()
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've implemented neural network pruning, let's reflect on the systems engineering principles and production deployment considerations.
-
-**Instructions**: Think through these questions based on your implementation experience. Consider both the technical details and the broader systems implications.
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "compression-analysis-1", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-**Computational Assessment 1: Pruning Threshold Calculation**
-
-Given a weight tensor with values [0.8, 0.1, 0.05, 0.3, 0.02, 0.7, 0.4, 0.09, 0.6, 0.03], calculate the pruning threshold for 70% sparsity.
-
-a) Sort the weights by magnitude and identify the 70th percentile threshold
-b) Create the binary mask showing which weights survive pruning  
-c) Calculate the actual compression ratio achieved
-d) Explain why the threshold approach guarantees the target sparsity level
-
-**Your Answer:**
-
-### BEGIN SOLUTION
-a) Sorted weights by magnitude: [0.02, 0.03, 0.05, 0.09, 0.1, 0.3, 0.4, 0.6, 0.7, 0.8]
-   70th percentile (keep top 30%) = weights[7] = 0.6
-   Threshold = 0.6 (keep weights >= 0.6)
-
-b) Binary mask for original array [0.8, 0.1, 0.05, 0.3, 0.02, 0.7, 0.4, 0.09, 0.6, 0.03]:
-   Mask: [1, 0, 0, 0, 0, 1, 0, 0, 1, 0]
-   Surviving weights: [0.8, 0, 0, 0, 0, 0.7, 0, 0, 0.6, 0]
-
-c) Compression ratio calculation:
-   - Original parameters: 10
-   - Surviving parameters: 3 (values >= 0.6)
-   - Actual sparsity: 7/10 = 70% exactly
-   - Compression ratio: 10/3 = 3.33x
-
-d) Threshold approach guarantees sparsity because:
-   - Percentile calculation ensures exact fraction of weights survive
-   - 70th percentile means exactly 70% of weights are below threshold
-   - Binary thresholding creates deterministic pruning mask
-   - No randomness - same input always gives same sparsity level
-### END SOLUTION
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "compression-analysis-2", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-**Computational Assessment 2: Structured vs Unstructured Comparison**
-
-Consider a Conv2D layer with 8 filters of shape (4, 3, 3) - total 288 parameters. Calculate compression for both pruning strategies:
-
-a) Unstructured pruning with 75% sparsity: How many parameters remain?
-b) Structured pruning removing 6 out of 8 filters: What's the compression ratio?
-c) Which provides better actual speedup and why?
-d) For mobile deployment requiring <50 parameters, which pruning strategy works?
-
-**Your Answer:**
-
-### BEGIN SOLUTION
-a) Unstructured pruning (75% sparsity):
-   - Original parameters: 8 * 4 * 3 * 3 = 288
-   - Sparsity = 75% means keep 25% of weights
-   - Remaining parameters: 288 * 0.25 = 72 parameters
-   - Compression ratio: 288/72 = 4x
-   - BUT: Still need to store 288 values (with zeros), irregular sparsity pattern
-
-b) Structured pruning (remove 6 filters, keep 2):
-   - Filters removed: 6/8 = 75% of filters
-   - Remaining parameters: 2 * 4 * 3 * 3 = 72 parameters  
-   - Compression ratio: 288/72 = 4x (same as unstructured)
-   - BUT: Dense 2*4*3*3 tensor, no zeros to store
-
-c) Structured provides better actual speedup because:
-   - Dense computation on smaller tensor (2*4*3*3) vs sparse on large (8*4*3*3)
-   - No conditional branching (if weight != 0) in inner loops
-   - Better cache locality with contiguous memory access
-   - Can use optimized BLAS/convolution libraries
-   - Unstructured requires specialized sparse kernels (often unavailable)
-
-d) For <50 parameters mobile constraint:
-   - Unstructured: 72 remaining parameters > 50 -> doesn't fit
-   - Structured: Need 50/36 ~= 1.4 filters -> keep 1 filter = 36 parameters OK
-   - Structured pruning better for extreme resource constraints
-### END SOLUTION
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "compression-analysis-3", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-**Computational Assessment 3: Deployment Scenario Optimization**
-
-You need to deploy a model on three different devices with these constraints:
-- **Mobile**: 50MB memory, 10 GFLOPS compute
-- **IoT Device**: 10MB memory, 1 GFLOPS compute  
-- **Edge Server**: 500MB memory, 100 GFLOPS compute
-
-Original model: 200MB, 40 GFLOPS. Compression options:
-- 50% sparse: 100MB, 20 GFLOPS, 94% accuracy
-- 70% sparse: 60MB, 12 GFLOPS, 92% accuracy
-- 90% sparse: 20MB, 4 GFLOPS, 87% accuracy
-
-a) Which compression level works for each device?
-b) Calculate the accuracy-efficiency tradeoff for each scenario
-c) Recommend optimal compression strategy for each deployment target
-
-**Your Answer:**
-
-### BEGIN SOLUTION
-a) Device compatibility analysis:
-   - Mobile (50MB, 10 GFLOPS): ✗ 50% (100MB > 50MB), OK 70% (60MB, 12 GFLOPS), OK 90%
-   - IoT (10MB, 1 GFLOPS): ✗ 50%, ✗ 70%, ✗ 90% (4 GFLOPS > 1 GFLOPS)
-   - Edge Server (500MB, 100 GFLOPS): OK All options work
-
-b) Accuracy-efficiency tradeoff (accuracy/memory ratio):
-   - 50% sparse: 94%/100MB = 0.94%/MB
-   - 70% sparse: 92%/60MB = 1.53%/MB (best efficiency)
-   - 90% sparse: 87%/20MB = 4.35%/MB (extreme efficiency)
-
-c) Optimal recommendations:
-   - Mobile: 70% sparse (meets constraints, good accuracy 92%)
-   - IoT Device: None work! Need more aggressive compression + structured pruning + quantization
-   - Edge Server: 50% sparse (maximum accuracy 94% with abundant resources)
-   
-   IoT solution: Combine 90% pruning + 8-bit quantization + structured pruning:
-   - Memory: 20MB -> 5MB (quantization) -> 2MB (structured) OK
-   - Compute: 4 GFLOPS -> 1 GFLOPS (structured optimization) OK
-### END SOLUTION
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "compression-analysis-4", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-**Computational Assessment 4: Production Pipeline Economics**
-
-A company processes 1M inference requests/day. Compare costs for dense vs compressed models:
-
-**Dense Model**: 500MB memory, 50ms latency, $0.10/hour cloud compute
-**Compressed Model**: 100MB memory, 20ms latency, $0.04/hour cloud compute
-
-a) Calculate daily compute costs for both models
-b) Determine memory savings and infrastructure requirements
-c) If compression development costs $50,000, what's the break-even timeline?
-d) Analyze the business case for different deployment scales
-
-**Your Answer:**
-
-### BEGIN SOLUTION
-a) Daily compute cost calculation:
-   Dense model:
-   - 1M requests * 50ms = 50,000 seconds = 13.9 hours
-   - Daily cost: 13.9 hours * $0.10 = $1.39/day
-   
-   Compressed model:
-   - 1M requests * 20ms = 20,000 seconds = 5.6 hours  
-   - Daily cost: 5.6 hours * $0.04 = $0.22/day
-   
-   Daily savings: $1.39 - $0.22 = $1.17/day
-
-b) Infrastructure analysis:
-   - Memory savings: 500MB -> 100MB = 5x reduction
-   - Server capacity: 5x more models per server (memory bound)
-   - Latency improvement: 50ms -> 20ms = 2.5x faster response
-   - Throughput: 2.5x more requests per server
-
-c) Break-even timeline:
-   - Development cost: $50,000
-   - Daily savings: $1.17  
-   - Break-even: $50,000 / $1.17 = 42,735 days ~= 117 years!
-   
-   This seems wrong - let me recalculate for realistic scale:
-   At 100M requests/day (large scale):
-   - Dense: 1,389 hours * $0.10 = $138.90/day
-   - Compressed: 556 hours * $0.04 = $22.24/day
-   - Daily savings: $116.66
-   - Break-even: $50,000 / $116.66 = 428 days ~= 14 months OK
-
-d) Business case by scale:
-   - Small scale (<1M/day): ROI unclear, focus on accuracy
-   - Medium scale (10M/day): 4-year ROI, worth considering
-   - Large scale (100M+/day): 1-year ROI, strong business case
-   - Hyperscale (1B+/day): 1-month ROI, compression essential
-### END SOLUTION
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-1", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-**Question 1: Pruning Strategy Analysis**
-
-You implemented both magnitude-based and structured pruning in your `MagnitudePruner` and `prune_conv_filters()` functions:
-
-a) Why does magnitude-based pruning work so well for neural networks? What does the effectiveness of this simple heuristic tell us about neural network weight distributions?
-
-b) In your structured vs unstructured comparison, structured pruning achieved lower compression ratios but is preferred for deployment. Explain this tradeoff in terms of hardware efficiency and inference speed.
-
-c) Your compression pipeline used different sparsity targets per layer (conv: 60%, dense: 80%). Why do dense layers typically tolerate higher sparsity than convolutional layers?
-
-**Your Answer:**
-
-<!-- BEGIN SOLUTION -->
-a) Magnitude-based pruning works because:
-- Neural networks exhibit natural redundancy with many small, unimportant weights
-- Weight magnitude correlates with importance - small weights contribute little to output
-- Networks are over-parametrized, so removing low-magnitude weights has minimal accuracy impact
-- The success reveals that weight distributions have long tails - most weights are small, few are large
-- This natural sparsity suggests networks learn efficient representations despite overparametrization
-
-b) The structured vs unstructured tradeoff:
-- Unstructured: Higher compression (removes individual weights) but irregular sparsity patterns
-- Structured: Lower compression (removes entire filters/channels) but regular, hardware-friendly patterns
-- Hardware prefers structured because: dense computation on smaller tensors is faster than sparse computation
-- Memory access: structured removal reduces tensor sizes, improving cache efficiency
-- No need for specialized sparse kernels - can use standard GEMM operations
-- Inference speed: structured pruning provides actual speedup, unstructured often theoretical only
-
-c) Layer-specific sparsity tolerance:
-- Linear layers: High redundancy, many parameters, more overparametrized -> tolerate 80% sparsity
-- Conv layers: Fewer parameters, each filter captures important spatial features -> more sensitive
-- First layers: Extract low-level features (edges, textures) -> very sensitive to pruning
-- Later layers: More abstract features with redundancy -> can handle moderate pruning
-- Output layers: Critical for final predictions -> require conservative pruning
-<!-- END SOLUTION -->
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-2", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-**Question 2: Sparse Computation and Hardware Efficiency**
-
-Your `SparseLinear` class demonstrated the challenges of actually accelerating sparse computation:
-
-a) Why did your sparse computation benchmarks show lower actual speedup compared to theoretical speedup? What are the main bottlenecks preventing sparse computation from achieving theoretical gains?
-
-b) In your deployment analysis, mobile devices required 70-90% sparsity while edge servers could use 50%. Explain how hardware constraints drive pruning requirements differently across deployment targets.
-
-c) You found that structured pruning provides better real-world performance than unstructured pruning. How would you design a neural network architecture that's naturally "pruning-friendly" from the start?
-
-**Your Answer:**
-
-<!-- BEGIN SOLUTION -->
-a) Lower actual speedup due to multiple bottlenecks:
-- Memory bandwidth: Sparse computation is often memory-bound, not compute-bound
-- Framework overhead: PyTorch/NumPy not optimized for arbitrary sparsity patterns
-- Cache inefficiency: Irregular sparse patterns hurt cache locality compared to dense operations
-- Vectorization loss: SIMD instructions work best on dense, regular data patterns
-- Index overhead: Storing and accessing sparse indices adds computational cost
-- Hardware mismatch: Most CPUs/GPUs optimized for dense linear algebra, not sparse
-
-b) Hardware-driven pruning requirements:
-- Mobile: Strict memory (4GB total), battery, thermal constraints -> need aggressive 70-90% sparsity
-- Edge servers: More memory (16GB+), power, cooling -> moderate 50% sparsity sufficient
-- Cloud: Abundant resources -> pruning for cost optimization, not necessity
-- Embedded/IoT: Extreme constraints (MB not GB) -> need structured pruning + quantization
-- Different hardware accelerators: Edge TPU loves sparsity, standard GPUs don't benefit much
-
-c) Pruning-friendly architecture design:
-- Use more, smaller layers rather than fewer, large layers (easier to prune entire channels)
-- Design with skip connections (allows aggressive pruning of individual branches)
-- Separate feature extraction from classification (different pruning sensitivities)
-- Use group convolutions (natural structured pruning boundaries)
-- Design with mobile-first mindset (efficient from start, not compressed afterward)
-- Consider lottery ticket initialization (start with good sparse subnetwork)
-<!-- END SOLUTION -->
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-3", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-**Question 3: Model Compression Pipeline and Production Deployment**
-
-Your `ModelCompressor` implemented a complete compression pipeline with analysis, compression, and validation:
-
-a) Your pipeline analyzed each layer to recommend sparsity levels. In production deployment, how would you extend this to handle dynamic workloads where the optimal sparsity might change based on accuracy requirements or latency constraints?
-
-b) You implemented quality validation by comparing weight preservation. But in production, what matters is end-to-end accuracy and latency. How would you design a compression validation system that ensures deployment success?
-
-c) Looking at your production applications analysis, why is pruning often combined with other optimizations (quantization, knowledge distillation) rather than used alone? What are the complementary benefits?
-
-**Your Answer:**
-
-<!-- BEGIN SOLUTION -->
-a) Dynamic compression for production:
-- A/B testing framework: gradually adjust sparsity based on accuracy metrics in production
-- Multi-model serving: maintain models at different compression levels (70%, 80%, 90% sparse)
-- Dynamic switching: use less compressed models during high-accuracy periods, more during low-latency needs
-- Feedback loop: monitor accuracy degradation and automatically adjust compression
-- User-specific models: different compression for different user segments or use cases
-- Time-based adaptation: more compression during peak load, less during quality-critical periods
-- Canary deployments: test compression changes on small traffic percentage first
-
-b) End-to-end validation system:
-- Task-specific metrics: measure final accuracy, F1, BLEU - whatever matters for the application
-- Latency benchmarking: measure actual inference time on target hardware
-- A/B testing: compare compressed vs uncompressed models on real user traffic
-- Regression testing: ensure compression doesn't break edge cases or specific inputs
-- Hardware-specific validation: test on actual deployment hardware, not just development machines
-- Load testing: verify performance under realistic concurrent inference loads
-- Accuracy monitoring: continuous validation in production with automatic rollback triggers
-
-c) Why pruning is combined with other optimizations:
-- Pruning + quantization: attack both parameter count and parameter size (4x + 4x = 16x compression)
-- Pruning + knowledge distillation: maintain accuracy while compressing (teacher-student training)
-- Complementary bottlenecks: pruning reduces compute, quantization reduces memory bandwidth
-- Different deployment needs: mobile needs both size and speed, cloud needs cost optimization
-- Diminishing returns: 90% pruning alone may hurt accuracy, but 70% pruning + quantization achieves same compression with better accuracy
-- Hardware optimization: different techniques work better on different hardware (GPU vs mobile CPU)
-<!-- END SOLUTION -->
-"""
-
-# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-4", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-**Question 4: Edge AI and Deployment Enablement**
-
-Based on your systems analysis and deployment scenarios:
-
-a) Your memory profiling showed that pruning enables deployment where dense models won't fit. But pruning also changes the computational characteristics of models. How does this affect the entire ML systems stack, from training to serving?
-
-b) In your production applications analysis, you saw pruning enabling privacy-preserving on-device AI. Explain how compression techniques like pruning change the fundamental economics and capabilities of AI deployment.
-
-c) Looking forward, how do you think the relationship between model architectures, hardware capabilities, and compression techniques will evolve? What are the implications for ML systems engineering?
-
-**Your Answer:**
-
-<!-- BEGIN SOLUTION -->
-a) Pruning affects the entire ML systems stack:
-- Training: Need pruning-aware training, gradual sparsity increases, specialized optimizers
-- Model versioning: Track both dense and compressed versions, compression parameters
-- Serving infrastructure: Need sparse computation support, different batching strategies
-- Monitoring: Different performance characteristics, need sparsity-aware metrics
-- Debugging: Sparse models behave differently, need specialized debugging tools
-- Hardware utilization: Lower compute utilization but different memory access patterns
-- Load balancing: Sparse models have different latency profiles, affects request routing
-
-b) Compression changes AI deployment economics:
-- Democratizes AI: Enables AI on devices that couldn't run dense models (phones, IoT, wearables)
-- Privacy transformation: On-device processing eliminates need to send data to cloud
-- Cost structure shift: Reduces cloud compute costs, shifts processing to edge devices
-- Latency improvement: Local processing eliminates network round-trips
-- Offline capability: Compressed models enable AI without internet connectivity
-- Market expansion: Creates new use cases impossible with cloud-only AI
-- Energy efficiency: Critical for battery-powered devices, enables always-on AI
-
-c) Future evolution predictions:
-- Hardware-software co-design: Chips designed specifically for sparse computation (like Edge TPU)
-- Architecture evolution: Networks designed for compression from scratch, not post-hoc optimization
-- Automatic compression: ML systems that automatically find optimal compression for deployment targets
-- Dynamic compression: Models that adapt compression level based on runtime constraints
-- Compression-aware training: End-to-end training that considers deployment constraints
-- Standardization: Common sparse formats and APIs across frameworks and hardware
-- New paradigms: Mixture of experts, early exit networks - architecturally sparse models
-- The future is compression-first design, not compression as afterthought
-<!-- END SOLUTION -->
-"""
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Compression - Neural Network Pruning for Edge Deployment
-
-### What You Accomplished
-
-In this module, you built a complete **neural network compression system** using pruning techniques that remove 70% of parameters while maintaining 95%+ accuracy. You learned to:
-
-**🔧 Core Implementation Skills:**
-- **Magnitude-based pruning**: Identified and removed unimportant weights using simple yet effective heuristics
-- **Structured vs unstructured pruning**: Built both approaches and understood their hardware tradeoffs
-- **Sparse computation**: Implemented efficient sparse linear layers and benchmarked real vs theoretical speedups
-- **End-to-end compression pipeline**: Created production-ready model compression with analysis, validation, and optimization
-
-**📊 Systems Engineering Insights:**
-- **Neural network redundancy**: Discovered that networks contain 70-90% redundant parameters that can be safely removed
-- **Hardware efficiency tradeoffs**: Understood why structured pruning provides actual speedup while unstructured gives theoretical speedup
-- **Memory vs compute optimization**: Learned how pruning reduces both memory footprint and computational requirements
-- **Deployment enablement**: Saw how compression makes models fit where they previously couldn't run
-
-**🏭 Production Understanding:**
-- **Edge deployment scenarios**: Analyzed how pruning enables mobile, IoT, and embedded AI applications
-- **Compression pipeline design**: Built systems that analyze, compress, and validate models for production deployment
-- **Hardware-aware optimization**: Understood how different deployment targets require different pruning strategies
-- **Quality assurance**: Implemented validation systems to ensure compression doesn't degrade model performance
-
-### ML Systems Engineering Connection
-
-This module demonstrates that **compression is fundamentally about enabling deployment**, not just reducing model size. You learned:
-
-- **Why redundancy exists**: Neural networks are over-parametrized, creating massive compression opportunities
-- **Hardware drives strategy**: Structured vs unstructured pruning choice depends on target hardware capabilities
-- **Compression enables privacy**: On-device processing becomes possible when models are small enough
-- **Systems thinking**: Compression affects the entire ML stack from training to serving
-
-### Real-World Impact
-
-Your compression implementation mirrors production systems used by:
-- **Mobile AI**: Apple's Neural Engine, Google's Edge TPU leverage sparsity for efficient inference
-- **Autonomous vehicles**: Tesla FSD uses pruning for real-time object detection
-- **Smart devices**: Alexa, Google Assistant use extreme compression for always-on wake word detection
-- **Medical AI**: Portable diagnostic systems enabled by compressed models
-
-The techniques you built make the difference between AI that runs in the cloud versus AI that runs in your pocket - enabling privacy, reducing latency, and creating entirely new application categories.
-
-**Next**: This completes our ML Systems engineering journey! You've now built the complete stack from tensors to production deployment, understanding how each component contributes to building real-world AI systems that scale.
-"""
\ No newline at end of file
diff --git a/modules_old/17_compression/module.yaml b/modules_old/17_compression/module.yaml
deleted file mode 100644
index 699fa0c8..00000000
--- a/modules_old/17_compression/module.yaml
+++ /dev/null
@@ -1,28 +0,0 @@
-description: 'Model compression through pruning and sparsity. Students learn to identify
-  and remove
-
-  redundant parameters, achieving 70-80% sparsity while maintaining accuracy. Essential
-
-  for edge deployment and mobile devices.
-
-  '
-difficulty: advanced
-estimated_hours: 8-10
-exports:
-- tinytorch.optimizations.compression
-learning_objectives:
-- Understand sparsity and redundancy in neural networks
-- Implement magnitude-based pruning
-- Build structured and unstructured pruning
-- Measure accuracy vs model size tradeoffs
-name: Compression
-number: 17
-prerequisites:
-- Module 15: Acceleration
-- Module 16: Quantization
-skills_developed:
-- Pruning techniques
-- Sparsity management
-- Model compression
-- Edge deployment optimization
-type: optimization
diff --git a/modules_old/18_caching/README.md b/modules_old/18_caching/README.md
deleted file mode 100644
index c4c04ee8..00000000
--- a/modules_old/18_caching/README.md
+++ /dev/null
@@ -1,115 +0,0 @@
-# Module 19: Caching - KV Cache Optimization
-
-## Overview
-Master the most sophisticated transformer optimization: KV caching. Transform O(N²) attention complexity into O(N) for autoregressive generation, achieving 10-100x speedups in transformer inference.
-
-## What You'll Build
-- **KVCache**: Efficient storage for key-value tensors across layers
-- **CachedMultiHeadAttention**: Attention with incremental computation
-- **Cached Generation**: Autoregressive text generation with dramatic speedups
-- **Performance Analysis**: Comprehensive memory vs compute trade-off analysis
-
-## Learning Objectives
-1. **Algorithmic Optimization**: How changing algorithms (not just implementation) achieves massive speedups
-2. **Memory Management**: Trading memory for computational efficiency in production systems
-3. **Incremental Computation**: Building systems that efficiently reuse previous work
-4. **Production Optimization**: Understanding how real LLMs achieve fast inference
-
-## Prerequisites
-- Module 13: Attention (multi-head attention mechanics)
-- Module 14: Transformers (transformer architecture)
-
-## Key Concepts
-
-### The Problem: Quadratic Attention
-```python
-# Traditional generation: O(N²) recomputation
-Generate token 1: Attend to [] (empty)
-Generate token 2: Attend to [token_1]     # Recomputes K,V for token_1
-Generate token 3: Attend to [token_1, token_2]  # Recomputes K,V for all previous
-# Total operations: 1² + 2² + 3² + ... + N² = O(N³) for full sequence!
-```
-
-### The Solution: KV Caching
-```python
-# Cache approach: Store computed K,V tensors
-cache.update(layer=0, keys=K₁, values=V₁, position=0)
-# Next step: Reuse cached K,V, only compute new token
-K_combined = concat(cache.get_keys(), K₂)  # O(1) operation
-V_combined = concat(cache.get_values(), V₂)  # Reuse all previous work
-```
-
-### KV Cache Implementation
-```python
-class KVCache:
-    def __init__(self, max_seq_len, n_layers, n_heads, head_dim):
-        # Pre-allocate cache tensors
-        self.k_cache[layer] = zeros(max_seq_len, n_heads, head_dim)
-        self.v_cache[layer] = zeros(max_seq_len, n_heads, head_dim)
-    
-    def update(self, layer_idx, key, value):
-        # Store at current position
-        self.k_cache[layer_idx][self.position] = key
-        self.v_cache[layer_idx][self.position] = value
-```
-
-## Performance Impact
-- **Complexity**: O(N²) → O(N) per generation step
-- **Memory**: Linear growth with sequence length
-- **Speedup**: 10-100x faster for typical sequences
-- **Break-even**: Beneficial after ~20-50 tokens
-
-## Real-World Applications
-- **GPT-3/4**: Uses KV caching for all inference
-- **ChatGPT**: Real-time conversation enabled by caching
-- **Code Generation**: Fast autocompletion and code synthesis  
-- **Translation**: Efficient sequence-to-sequence generation
-
-## Module Structure
-1. **Problem Analysis**: Understanding O(N²) attention complexity
-2. **KV Cache Design**: Efficient tensor storage and retrieval
-3. **Cached Attention**: Modified attention using cached K,V
-4. **Generation Pipeline**: Complete autoregressive generation
-5. **Performance Analysis**: Memory vs compute trade-off studies
-6. **Production Context**: How real systems implement caching
-
-## Hands-On Projects
-```python
-# Project 1: Build KV cache
-cache = KVCache(max_seq_len=1000, n_layers=12, n_heads=16, head_dim=64)
-attention = CachedMultiHeadAttention(embed_dim=1024, num_heads=16)
-
-# Project 2: Compare performance
-non_cached_time = benchmark_standard_generation(prompt, 100)
-cached_time = benchmark_cached_generation(prompt, 100, cache)
-speedup = non_cached_time / cached_time
-print(f"Speedup: {speedup:.1f}x faster!")
-
-# Project 3: Memory analysis  
-memory_usage = cache.get_memory_usage()
-print(f"Cache size: {memory_usage['total_cache_size_mb']:.1f} MB")
-print(f"Memory efficiency: {memory_usage['utilization']:.2f}")
-```
-
-## Systems Insights
-- **Memory Pattern**: Cache grows linearly but saves quadratic computation
-- **Production Trade-offs**: 1-10GB cache memory for real-time conversation
-- **Scaling Behavior**: Essential for long-context models (4K, 8K, 32K tokens)
-- **Hardware Impact**: Memory bandwidth becomes the limiting factor
-
-## Success Criteria
-- ✅ Implement working KV cache with proper memory management
-- ✅ Achieve 10x+ speedup for 100+ token generation
-- ✅ Understand memory vs compute trade-offs
-- ✅ Connect to production transformer optimization strategies
-
-## Performance Benchmarks
-```
-Sequence Length | Memory Usage | Speedup | Efficiency
-10 tokens      | 0.02 MB      | 1.5x    | Good
-50 tokens      | 0.10 MB      | 2.0x    | Better  
-100 tokens     | 0.20 MB      | 4.0x    | Excellent
-200 tokens     | 0.39 MB      | 13x     | Outstanding
-```
-
-**This is the optimization that makes modern LLMs practical for real-time applications!**
\ No newline at end of file
diff --git a/modules_old/18_caching/caching_dev.ipynb b/modules_old/18_caching/caching_dev.ipynb
deleted file mode 100644
index da77ce81..00000000
--- a/modules_old/18_caching/caching_dev.ipynb
+++ /dev/null
@@ -1,1619 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "2015213e",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# KV Caching - The Most Sophisticated Optimization: Changing the Algorithm!\n",
-    "\n",
-    "Welcome to the KV Caching module! You'll implement the key-value cache optimization that transforms transformer inference from O(N²) to O(N) complexity for autoregressive generation. This is how GPT actually achieves fast text generation!\n",
-    "\n",
-    "## Learning Goals\n",
-    "- Algorithm transformation: Understand how caching changes fundamental complexity\n",
-    "- Memory vs compute trade-offs: Store K,V tensors to avoid recomputation\n",
-    "- Production optimization: Learn the optimization that makes GPT fast in practice\n",
-    "- Systems insight: How memory management enables dramatic speedups\n",
-    "- Incremental computation: Build systems that efficiently reuse previous work\n",
-    "\n",
-    "## Build → Profile → Optimize\n",
-    "1. **Build**: Implement KV caching for multi-head attention with incremental generation\n",
-    "2. **Profile**: Compare O(N²) vs O(N) performance and memory usage patterns\n",
-    "3. **Optimize**: Apply caching to complete transformer inference pipeline\n",
-    "\n",
-    "## What You'll Achieve\n",
-    "By the end of this module, you'll understand:\n",
-    "- Deep technical mastery of how KV caching transforms attention complexity\n",
-    "- Practical capability to implement production-grade transformer inference optimization\n",
-    "- Systems insight into memory-compute trade-offs that determine real-world performance\n",
-    "- Performance understanding of how algorithmic changes achieve dramatic speedups\n",
-    "- Connection to how ChatGPT, GPT-4, and other LLMs achieve fast response times\n",
-    "\n",
-    "## Systems Reality Check\n",
-    "💡 **Production Context**: GPT-4 uses KV caching for all inference - without it, generating 100 tokens would take minutes instead of seconds\n",
-    "⚡ **Performance Note**: KV caching is the difference between research models and production LLMs\n",
-    "🔥 **Memory Trade-off**: Cache grows with sequence length but saves quadratic recomputation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6e03e2eb",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "caching-imports",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| default_exp optimization.kv_cache\n",
-    "\n",
-    "#| export\n",
-    "import math\n",
-    "import numpy as np\n",
-    "import os\n",
-    "import sys\n",
-    "import time\n",
-    "import tracemalloc\n",
-    "from typing import Union, List, Optional, Tuple, Dict, Any\n",
-    "\n",
-    "# Import our Tensor class\n",
-    "try:\n",
-    "    from tinytorch.core.tensor import Tensor\n",
-    "except ImportError:\n",
-    "    # For development, import from local tensor module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))\n",
-    "    from tensor_dev import Tensor\n",
-    "\n",
-    "# Try to import attention classes\n",
-    "try:\n",
-    "    from tinytorch.core.attention import MultiHeadAttention, ScaledDotProductAttention\n",
-    "except ImportError:\n",
-    "    # For development, import from local module\n",
-    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '13_attention'))\n",
-    "    try:\n",
-    "        from attention_dev import MultiHeadAttention, ScaledDotProductAttention\n",
-    "    except ImportError:\n",
-    "        # Create minimal mock classes if not available\n",
-    "        class MultiHeadAttention:\n",
-    "            def __init__(self, embed_dim, num_heads, dropout=0.0):\n",
-    "                self.embed_dim = embed_dim\n",
-    "                self.num_heads = num_heads\n",
-    "                self.head_dim = embed_dim // num_heads\n",
-    "            def forward(self, q, k, v, mask=None):\n",
-    "                return q  # Mock implementation\n",
-    "        class ScaledDotProductAttention:\n",
-    "            def __init__(self, dropout=0.0):\n",
-    "                self.dropout = dropout"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cb57f291",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "caching-welcome",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"🚀 TinyTorch KV Caching Module\")\n",
-    "print(f\"NumPy version: {np.__version__}\")\n",
-    "print(\"Ready to implement the most sophisticated optimization!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0b52091a",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 📦 Where This Code Lives in the Final Package\n",
-    "\n",
-    "**Learning Side:** You work in `modules/source/19_caching/caching_dev.py`  \n",
-    "**Building Side:** Code exports to `tinytorch.core.caching`\n",
-    "\n",
-    "```python\n",
-    "# Final package structure:\n",
-    "from tinytorch.core.caching import KVCache, CachedMultiHeadAttention, CachedTransformer\n",
-    "from tinytorch.core.attention import MultiHeadAttention  # Previous module\n",
-    "from tinytorch.core.transformers import TransformerBlock  # Dependencies\n",
-    "```\n",
-    "\n",
-    "**Why this matters:**\n",
-    "- **Learning:** Understand algorithmic transformation through implementation\n",
-    "- **Production:** This is how real LLMs achieve fast inference\n",
-    "- **Consistency:** All caching optimizations live together in `core.caching`\n",
-    "- **Integration:** Works seamlessly with existing attention and transformer systems"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "407fb6b8",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## The Problem: Attention's Quadratic Complexity\n",
-    "\n",
-    "### Traditional Attention: O(N²) Recomputation\n",
-    "In autoregressive generation, we generate tokens one by one:\n",
-    "\n",
-    "```\n",
-    "Generate token 1: Attend to [] (empty context)\n",
-    "Generate token 2: Attend to [token_1]  \n",
-    "Generate token 3: Attend to [token_1, token_2]\n",
-    "Generate token 4: Attend to [token_1, token_2, token_3]\n",
-    "...\n",
-    "Generate token N: Attend to [token_1, ..., token_{N-1}]\n",
-    "```\n",
-    "\n",
-    "**The inefficiency:** Each step recomputes attention for ALL previous tokens!\n",
-    "\n",
-    "### Memory and Compute Analysis\n",
-    "For each new token, traditional attention:\n",
-    "1. **Recomputes K,V** for all previous tokens (wasted computation)\n",
-    "2. **Attention matrix** grows: 1×1, 2×2, 3×3, ..., N×N (quadratic memory)\n",
-    "3. **Total operations**: 1² + 2² + 3² + ... + N² = O(N³) for full sequence!\n",
-    "\n",
-    "**This is why naive transformer generation is impossibly slow for long sequences.**"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "39bdb2d4",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## The Solution: Key-Value Caching\n",
-    "\n",
-    "### Core Insight: Cache Past Computations\n",
-    "KV caching stores the key (K) and value (V) tensors from previous tokens:\n",
-    "\n",
-    "```python\n",
-    "# Step 1: Generate first token\n",
-    "cache.store(layer=0, keys=K₁, values=V₁, position=0)\n",
-    "\n",
-    "# Step 2: Generate second token  \n",
-    "K_past, V_past = cache.get(layer=0, positions=[0])\n",
-    "K_combined = concat(K_past, K₂)  # Reuse K₁, add K₂\n",
-    "V_combined = concat(V_past, V₂)  # Reuse V₁, add V₂\n",
-    "```\n",
-    "\n",
-    "### Complexity Transformation\n",
-    "- **Without cache**: O(N²) memory, O(N³) total ops for generation\n",
-    "- **With cache**: O(N) memory per step, O(N²) total ops for generation\n",
-    "- **Speedup**: 10-100x faster for typical sequence lengths!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c3962a04",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## KVCache Implementation\n",
-    "\n",
-    "The foundation of all transformer inference optimization."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a91cc9c8",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "kv-cache",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class KVCache:\n",
-    "    \"\"\"\n",
-    "    Key-Value cache for efficient transformer inference.\n",
-    "    \n",
-    "    Stores past key and value tensors to avoid recomputation during\n",
-    "    autoregressive generation. This transforms O(N²) attention into\n",
-    "    O(N) attention for incremental token generation.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, max_seq_len: int, n_layers: int, n_heads: int, head_dim: int):\n",
-    "        \"\"\"\n",
-    "        Initialize KV cache with fixed capacity.\n",
-    "        \n",
-    "        TODO: Implement KV cache initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Store cache configuration parameters\n",
-    "        2. Initialize empty cache storage for each layer\n",
-    "        3. Track current sequence position\n",
-    "        4. Set up memory-efficient storage format\n",
-    "        \n",
-    "        MEMORY LAYOUT:\n",
-    "        - Cache per layer: keys[seq_len, n_heads, head_dim]\n",
-    "        - Cache per layer: values[seq_len, n_heads, head_dim]\n",
-    "        - Total memory: 2 × n_layers × max_seq_len × n_heads × head_dim\n",
-    "        \n",
-    "        Args:\n",
-    "            max_seq_len: Maximum sequence length to cache\n",
-    "            n_layers: Number of transformer layers\n",
-    "            n_heads: Number of attention heads\n",
-    "            head_dim: Dimension per attention head\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.max_seq_len = max_seq_len\n",
-    "        self.n_layers = n_layers\n",
-    "        self.n_heads = n_heads\n",
-    "        self.head_dim = head_dim\n",
-    "        \n",
-    "        # Initialize cache storage for each layer\n",
-    "        # Shape: (max_seq_len, n_heads, head_dim)\n",
-    "        self.k_cache = {}\n",
-    "        self.v_cache = {}\n",
-    "        \n",
-    "        for layer_idx in range(n_layers):\n",
-    "            # Pre-allocate cache tensors for efficiency\n",
-    "            self.k_cache[layer_idx] = Tensor(np.zeros((max_seq_len, n_heads, head_dim)))\n",
-    "            self.v_cache[layer_idx] = Tensor(np.zeros((max_seq_len, n_heads, head_dim)))\n",
-    "        \n",
-    "        # Track current position in sequence\n",
-    "        self.current_position = 0\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def update(self, layer_idx: int, key: Tensor, value: Tensor) -> None:\n",
-    "        \"\"\"\n",
-    "        Store new key and value tensors at current position.\n",
-    "        \n",
-    "        TODO: Implement cache update mechanism.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Validate inputs and position bounds\n",
-    "        2. Store key tensor at current position\n",
-    "        3. Store value tensor at current position\n",
-    "        4. Handle incremental position tracking\n",
-    "        \n",
-    "        EFFICIENCY CONSIDERATIONS:\n",
-    "        - In-place updates to avoid memory allocation\n",
-    "        - Position-based indexing for O(1) access\n",
-    "        - Bounds checking for cache overflow\n",
-    "        \n",
-    "        Args:\n",
-    "            layer_idx: Which transformer layer this cache belongs to\n",
-    "            key: Key tensor to store, shape (n_heads, head_dim)\n",
-    "            value: Value tensor to store, shape (n_heads, head_dim)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if layer_idx not in self.k_cache:\n",
-    "            raise ValueError(f\"Layer {layer_idx} not found in cache\")\n",
-    "        \n",
-    "        if self.current_position >= self.max_seq_len:\n",
-    "            raise ValueError(f\"Cache overflow: position {self.current_position} >= max {self.max_seq_len}\")\n",
-    "        \n",
-    "        # Store key and value at current position\n",
-    "        # key/value shape: (n_heads, head_dim)\n",
-    "        # Cache shape: (max_seq_len, n_heads, head_dim)\n",
-    "        self.k_cache[layer_idx].data[self.current_position] = key.data\n",
-    "        self.v_cache[layer_idx].data[self.current_position] = value.data\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def get(self, layer_idx: int, seq_len: int) -> Tuple[Tensor, Tensor]:\n",
-    "        \"\"\"\n",
-    "        Retrieve cached keys and values up to specified sequence length.\n",
-    "        \n",
-    "        TODO: Implement cache retrieval mechanism.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Validate layer and sequence length\n",
-    "        2. Extract keys from position 0 to seq_len\n",
-    "        3. Extract values from position 0 to seq_len\n",
-    "        4. Return as tensors ready for attention computation\n",
-    "        \n",
-    "        MEMORY EFFICIENCY:\n",
-    "        - Return views/slices instead of copies when possible\n",
-    "        - Handle different sequence lengths efficiently\n",
-    "        \n",
-    "        Args:\n",
-    "            layer_idx: Which transformer layer to retrieve cache for\n",
-    "            seq_len: How many positions to retrieve (1 to current_position)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tuple of (keys, values) tensors with shape (seq_len, n_heads, head_dim)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        if layer_idx not in self.k_cache:\n",
-    "            raise ValueError(f\"Layer {layer_idx} not found in cache\")\n",
-    "        \n",
-    "        if seq_len > self.current_position:\n",
-    "            raise ValueError(f\"Requested seq_len {seq_len} > current position {self.current_position}\")\n",
-    "        \n",
-    "        # Extract the relevant portion of the cache\n",
-    "        # Cache shape: (max_seq_len, n_heads, head_dim)\n",
-    "        # Output shape: (seq_len, n_heads, head_dim)\n",
-    "        cached_keys = Tensor(self.k_cache[layer_idx].data[:seq_len])\n",
-    "        cached_values = Tensor(self.v_cache[layer_idx].data[:seq_len])\n",
-    "        \n",
-    "        return cached_keys, cached_values\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def advance_position(self) -> None:\n",
-    "        \"\"\"\n",
-    "        Move to next sequence position after storing current token.\n",
-    "        \n",
-    "        This should be called after update() to prepare for next token.\n",
-    "        \"\"\"\n",
-    "        self.current_position += 1\n",
-    "    \n",
-    "    def reset(self) -> None:\n",
-    "        \"\"\"Reset cache to empty state for new sequence.\"\"\"\n",
-    "        self.current_position = 0\n",
-    "        # Note: We don't need to zero out the cache data, just reset position\n",
-    "    \n",
-    "    def get_memory_usage(self) -> Dict[str, Any]:\n",
-    "        \"\"\"Analyze current cache memory usage.\"\"\"\n",
-    "        total_elements = 2 * self.n_layers * self.max_seq_len * self.n_heads * self.head_dim\n",
-    "        used_elements = 2 * self.n_layers * self.current_position * self.n_heads * self.head_dim\n",
-    "        \n",
-    "        return {\n",
-    "            'total_cache_size_mb': total_elements * 4 / (1024 * 1024),  # Assuming float32\n",
-    "            'used_cache_size_mb': used_elements * 4 / (1024 * 1024),\n",
-    "            'utilization': used_elements / total_elements if total_elements > 0 else 0,\n",
-    "            'current_position': self.current_position,\n",
-    "            'max_seq_len': self.max_seq_len\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f856a059",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Testing KV Cache Functionality\n",
-    "\n",
-    "Let's verify our cache works correctly and understand its memory characteristics."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d254a871",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-kv-cache",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_kv_cache():\n",
-    "    \"\"\"Test KV cache functionality and memory management.\"\"\"\n",
-    "    print(\"Testing KV Cache...\")\n",
-    "    \n",
-    "    # Create cache for small transformer\n",
-    "    max_seq_len = 10\n",
-    "    n_layers = 2\n",
-    "    n_heads = 4\n",
-    "    head_dim = 8\n",
-    "    \n",
-    "    cache = KVCache(max_seq_len, n_layers, n_heads, head_dim)\n",
-    "    \n",
-    "    # Test 1: Initial state\n",
-    "    assert cache.current_position == 0, \"Cache should start at position 0\"\n",
-    "    \n",
-    "    # Test 2: Store first token\n",
-    "    k1 = Tensor(np.random.randn(n_heads, head_dim))\n",
-    "    v1 = Tensor(np.random.randn(n_heads, head_dim))\n",
-    "    \n",
-    "    cache.update(layer_idx=0, key=k1, value=v1)\n",
-    "    cache.advance_position()\n",
-    "    \n",
-    "    assert cache.current_position == 1, \"Position should advance after update\"\n",
-    "    \n",
-    "    # Test 3: Retrieve cached values\n",
-    "    cached_k, cached_v = cache.get(layer_idx=0, seq_len=1)\n",
-    "    \n",
-    "    assert cached_k.shape == (1, n_heads, head_dim), f\"Expected shape (1, {n_heads}, {head_dim}), got {cached_k.shape}\"\n",
-    "    assert cached_v.shape == (1, n_heads, head_dim), f\"Expected shape (1, {n_heads}, {head_dim}), got {cached_v.shape}\"\n",
-    "    \n",
-    "    # Verify data integrity\n",
-    "    np.testing.assert_array_equal(cached_k.data[0], k1.data, \"Cached key should match stored key\")\n",
-    "    np.testing.assert_array_equal(cached_v.data[0], v1.data, \"Cached value should match stored value\")\n",
-    "    \n",
-    "    # Test 4: Add second token\n",
-    "    k2 = Tensor(np.random.randn(n_heads, head_dim))\n",
-    "    v2 = Tensor(np.random.randn(n_heads, head_dim))\n",
-    "    \n",
-    "    cache.update(layer_idx=0, key=k2, value=v2)\n",
-    "    cache.advance_position()\n",
-    "    \n",
-    "    # Test 5: Retrieve both tokens\n",
-    "    cached_k, cached_v = cache.get(layer_idx=0, seq_len=2)\n",
-    "    \n",
-    "    assert cached_k.shape == (2, n_heads, head_dim), \"Should retrieve both tokens\"\n",
-    "    np.testing.assert_array_equal(cached_k.data[0], k1.data, \"First token key should be preserved\")\n",
-    "    np.testing.assert_array_equal(cached_k.data[1], k2.data, \"Second token key should be stored\")\n",
-    "    \n",
-    "    # Test 6: Memory usage analysis\n",
-    "    memory_info = cache.get_memory_usage()\n",
-    "    expected_total = 2 * n_layers * max_seq_len * n_heads * head_dim * 4 / (1024 * 1024)\n",
-    "    \n",
-    "    assert abs(memory_info['total_cache_size_mb'] - expected_total) < 0.01, \"Memory calculation should be accurate\"\n",
-    "    assert memory_info['current_position'] == 2, \"Should track position correctly\"\n",
-    "    \n",
-    "    # Test 7: Reset functionality\n",
-    "    cache.reset()\n",
-    "    assert cache.current_position == 0, \"Reset should return to position 0\"\n",
-    "    \n",
-    "    print(\"✅ KV Cache tests passed!\")\n",
-    "    print(f\"   Cache capacity: {memory_info['total_cache_size_mb']:.2f} MB\")\n",
-    "    print(f\"   Memory efficiency: O(L × N × H × D) scaling\")\n",
-    "\n",
-    "# Run the test\n",
-    "test_kv_cache()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ae5064ab",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Cached Multi-Head Attention\n",
-    "\n",
-    "Now let's implement attention that can use the KV cache for efficient inference."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "350c1d63",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "cached-attention",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "class CachedMultiHeadAttention:\n",
-    "    \"\"\"\n",
-    "    Multi-head attention with KV caching support.\n",
-    "    \n",
-    "    This is the key optimization that makes transformer inference practical.\n",
-    "    During autoregressive generation, we only compute attention for the\n",
-    "    new token while reusing cached K,V from all previous tokens.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.0):\n",
-    "        \"\"\"\n",
-    "        Initialize cached multi-head attention.\n",
-    "        \n",
-    "        TODO: Implement cached attention initialization.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Store standard multi-head attention configuration\n",
-    "        2. Initialize weight matrices for Q, K, V projections\n",
-    "        3. Set up attention computation components\n",
-    "        4. Prepare for cache integration\n",
-    "        \n",
-    "        Args:\n",
-    "            embed_dim: Total embedding dimension\n",
-    "            num_heads: Number of attention heads\n",
-    "            dropout: Dropout rate (for training)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        self.embed_dim = embed_dim\n",
-    "        self.num_heads = num_heads\n",
-    "        self.dropout = dropout\n",
-    "        \n",
-    "        # Check divisibility\n",
-    "        if embed_dim % num_heads != 0:\n",
-    "            raise ValueError(f\"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads})\")\n",
-    "        \n",
-    "        self.head_dim = embed_dim // num_heads\n",
-    "        \n",
-    "        # Initialize projection weights\n",
-    "        scale = 1.0 / math.sqrt(embed_dim)\n",
-    "        self.w_q = Tensor(np.random.randn(embed_dim, embed_dim) * scale)\n",
-    "        self.w_k = Tensor(np.random.randn(embed_dim, embed_dim) * scale)\n",
-    "        self.w_v = Tensor(np.random.randn(embed_dim, embed_dim) * scale)\n",
-    "        self.w_o = Tensor(np.random.randn(embed_dim, embed_dim) * scale)\n",
-    "        \n",
-    "        self.parameters = [self.w_q, self.w_k, self.w_v, self.w_o]\n",
-    "        ### END SOLUTION\n",
-    "    \n",
-    "    def forward(self, \n",
-    "                query: Tensor, \n",
-    "                key: Optional[Tensor] = None, \n",
-    "                value: Optional[Tensor] = None,\n",
-    "                cache: Optional[KVCache] = None,\n",
-    "                layer_idx: int = 0,\n",
-    "                use_cache: bool = False,\n",
-    "                advance_cache: bool = True) -> Tuple[Tensor, Optional[KVCache]]:\n",
-    "        \"\"\"\n",
-    "        Compute attention with optional KV caching.\n",
-    "        \n",
-    "        TODO: Implement cached attention forward pass.\n",
-    "        \n",
-    "        STEP-BY-STEP IMPLEMENTATION:\n",
-    "        1. Handle input defaults (key=query, value=query for self-attention)\n",
-    "        2. Compute Q, K, V projections for current token\n",
-    "        3. If using cache, retrieve past K, V and combine with current\n",
-    "        4. Compute scaled dot-product attention\n",
-    "        5. Update cache with current K, V if requested\n",
-    "        6. Return attention output and updated cache\n",
-    "        \n",
-    "        CACHING LOGIC:\n",
-    "        - Without cache: Standard attention on full sequence\n",
-    "        - With cache: Combine past K,V with current K,V, attend from current Q\n",
-    "        \n",
-    "        Args:\n",
-    "            query: Current token query, shape (batch_size, 1, embed_dim) or (batch_size, seq_len, embed_dim)\n",
-    "            key: Key tensor (defaults to query)\n",
-    "            value: Value tensor (defaults to query) \n",
-    "            cache: KV cache to use and update\n",
-    "            layer_idx: Which layer this attention belongs to\n",
-    "            use_cache: Whether to update cache with current K,V\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tuple of (attention_output, updated_cache)\n",
-    "        \"\"\"\n",
-    "        ### BEGIN SOLUTION\n",
-    "        # Handle defaults\n",
-    "        if key is None:\n",
-    "            key = query\n",
-    "        if value is None:\n",
-    "            value = query\n",
-    "        \n",
-    "        batch_size = query.shape[0]\n",
-    "        query_seq_len = query.shape[1]\n",
-    "        \n",
-    "        # Compute Q, K, V projections\n",
-    "        Q = Tensor(np.matmul(query.data, self.w_q.data))\n",
-    "        K = Tensor(np.matmul(key.data, self.w_k.data))\n",
-    "        V = Tensor(np.matmul(value.data, self.w_v.data))\n",
-    "        \n",
-    "        # Reshape for multi-head attention\n",
-    "        # (batch, seq_len, embed_dim) -> (batch, seq_len, num_heads, head_dim)\n",
-    "        Q = Q.data.reshape(batch_size, query_seq_len, self.num_heads, self.head_dim)\n",
-    "        K = K.data.reshape(batch_size, query_seq_len, self.num_heads, self.head_dim)\n",
-    "        V = V.data.reshape(batch_size, query_seq_len, self.num_heads, self.head_dim)\n",
-    "        \n",
-    "        # Transpose to (batch, num_heads, seq_len, head_dim)\n",
-    "        Q = np.transpose(Q, (0, 2, 1, 3))\n",
-    "        K = np.transpose(K, (0, 2, 1, 3))\n",
-    "        V = np.transpose(V, (0, 2, 1, 3))\n",
-    "        \n",
-    "        if cache is not None and cache.current_position > 0:\n",
-    "            # Retrieve cached K, V and combine with current\n",
-    "            cached_K, cached_V = cache.get(layer_idx, cache.current_position)\n",
-    "            \n",
-    "            # Reshape cached tensors to match multi-head format\n",
-    "            # cached shape: (seq_len, num_heads, head_dim)\n",
-    "            # target shape: (batch, num_heads, seq_len, head_dim)\n",
-    "            cached_K = cached_K.data.transpose(1, 0, 2)[None, ...]  # Add batch dimension\n",
-    "            cached_V = cached_V.data.transpose(1, 0, 2)[None, ...]\n",
-    "            \n",
-    "            # Concatenate past and current K, V\n",
-    "            K_combined = np.concatenate([cached_K, K], axis=2)  # Concat along seq dimension\n",
-    "            V_combined = np.concatenate([cached_V, V], axis=2)\n",
-    "        else:\n",
-    "            K_combined = K\n",
-    "            V_combined = V\n",
-    "        \n",
-    "        # Compute scaled dot-product attention\n",
-    "        # Q: (batch, num_heads, query_len, head_dim)\n",
-    "        # K: (batch, num_heads, total_seq_len, head_dim)\n",
-    "        # V: (batch, num_heads, total_seq_len, head_dim)\n",
-    "        \n",
-    "        scores = np.matmul(Q, np.transpose(K_combined, (0, 1, 3, 2)))  # (batch, heads, query_len, total_seq_len)\n",
-    "        scores = scores / math.sqrt(self.head_dim)\n",
-    "        \n",
-    "        # Apply softmax\n",
-    "        scores_exp = np.exp(scores - np.max(scores, axis=-1, keepdims=True))\n",
-    "        attention_weights = scores_exp / np.sum(scores_exp, axis=-1, keepdims=True)\n",
-    "        \n",
-    "        # Apply attention to values\n",
-    "        attention_output = np.matmul(attention_weights, V_combined)  # (batch, heads, query_len, head_dim)\n",
-    "        \n",
-    "        # Reshape back to original format\n",
-    "        # (batch, heads, query_len, head_dim) -> (batch, query_len, heads, head_dim)\n",
-    "        attention_output = np.transpose(attention_output, (0, 2, 1, 3))\n",
-    "        # -> (batch, query_len, embed_dim)\n",
-    "        attention_output = attention_output.reshape(batch_size, query_seq_len, self.embed_dim)\n",
-    "        \n",
-    "        # Apply output projection\n",
-    "        output = Tensor(np.matmul(attention_output, self.w_o.data))\n",
-    "        \n",
-    "        # Update cache if requested\n",
-    "        updated_cache = cache\n",
-    "        if use_cache and cache is not None:\n",
-    "            # Store current K, V in cache\n",
-    "            # We need to store per-head K, V with shape (num_heads, head_dim)\n",
-    "            # Current K, V have shape (batch, num_heads, 1, head_dim) for single token\n",
-    "            if query_seq_len == 1:  # Only cache when generating single tokens\n",
-    "                current_K = Tensor(K[0, :, 0, :])  # (num_heads, head_dim)\n",
-    "                current_V = Tensor(V[0, :, 0, :])  # (num_heads, head_dim)\n",
-    "                cache.update(layer_idx, current_K, current_V)\n",
-    "                if advance_cache:  # Only advance position when requested\n",
-    "                    cache.advance_position()\n",
-    "        \n",
-    "        return output, updated_cache\n",
-    "        ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "57221d2c",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Testing Cached Attention\n",
-    "\n",
-    "Let's verify our cached attention works and provides the expected speedup."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b7555a66",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-cached-attention",
-     "locked": false,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_cached_attention():\n",
-    "    \"\"\"Test cached attention functionality and performance.\"\"\"\n",
-    "    print(\"Testing Cached Multi-Head Attention...\")\n",
-    "    \n",
-    "    embed_dim = 64\n",
-    "    num_heads = 8\n",
-    "    head_dim = embed_dim // num_heads\n",
-    "    batch_size = 1\n",
-    "    \n",
-    "    # Create attention layer\n",
-    "    attention = CachedMultiHeadAttention(embed_dim, num_heads)\n",
-    "    \n",
-    "    # Create cache\n",
-    "    max_seq_len = 10\n",
-    "    n_layers = 1\n",
-    "    cache = KVCache(max_seq_len, n_layers, num_heads, head_dim)\n",
-    "    \n",
-    "    # Test 1: Single token attention (like generation start)\n",
-    "    token1 = Tensor(np.random.randn(batch_size, 1, embed_dim))\n",
-    "    \n",
-    "    output1, updated_cache = attention.forward(\n",
-    "        query=token1, \n",
-    "        cache=cache, \n",
-    "        layer_idx=0, \n",
-    "        use_cache=True\n",
-    "    )\n",
-    "    \n",
-    "    assert output1.shape == (batch_size, 1, embed_dim), f\"Expected output shape {(batch_size, 1, embed_dim)}, got {output1.shape}\"\n",
-    "    assert updated_cache.current_position == 1, \"Cache should advance after first token\"\n",
-    "    \n",
-    "    # Test 2: Second token with cache\n",
-    "    token2 = Tensor(np.random.randn(batch_size, 1, embed_dim))\n",
-    "    \n",
-    "    output2, updated_cache = attention.forward(\n",
-    "        query=token2,\n",
-    "        cache=updated_cache,\n",
-    "        layer_idx=0,\n",
-    "        use_cache=True\n",
-    "    )\n",
-    "    \n",
-    "    assert output2.shape == (batch_size, 1, embed_dim), \"Second token output should have correct shape\"\n",
-    "    assert updated_cache.current_position == 2, \"Cache should advance after second token\"\n",
-    "    \n",
-    "    # Test 3: Compare with non-cached version\n",
-    "    # For verification, run attention on full sequence without cache\n",
-    "    full_sequence = Tensor(np.concatenate([token1.data, token2.data], axis=1))  # (batch, 2, embed_dim)\n",
-    "    \n",
-    "    fresh_attention = CachedMultiHeadAttention(embed_dim, num_heads)\n",
-    "    fresh_attention.w_q = attention.w_q  # Use same weights\n",
-    "    fresh_attention.w_k = attention.w_k\n",
-    "    fresh_attention.w_v = attention.w_v\n",
-    "    fresh_attention.w_o = attention.w_o\n",
-    "    \n",
-    "    full_output, _ = fresh_attention.forward(query=full_sequence, cache=None, use_cache=False)\n",
-    "    \n",
-    "    # The outputs should be similar (not exactly equal due to different computation paths)\n",
-    "    assert full_output.shape == (batch_size, 2, embed_dim), \"Full sequence output should have correct shape\"\n",
-    "    \n",
-    "    print(\"✅ Cached Attention tests passed!\")\n",
-    "    print(f\"   Memory saved: {cache.get_memory_usage()['used_cache_size_mb']:.2f} MB cache vs full recomputation\")\n",
-    "    print(f\"   Cache position: {cache.current_position}\")\n",
-    "\n",
-    "# Run the test\n",
-    "test_cached_attention()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "38da63bd",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Autoregressive Generation with KV Cache\n",
-    "\n",
-    "Now let's implement the complete generation function that uses KV caching for dramatic speedups."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4e7011cc",
-   "metadata": {
-    "lines_to_next_cell": 1,
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "cached-generation",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#| export\n",
-    "def generate_with_cache(model_func, \n",
-    "                       initial_tokens: Tensor, \n",
-    "                       max_new_tokens: int = 50,\n",
-    "                       embed_dim: int = 64,\n",
-    "                       num_heads: int = 8,\n",
-    "                       num_layers: int = 4) -> Tensor:\n",
-    "    \"\"\"\n",
-    "    Generate tokens autoregressively using KV caching.\n",
-    "    \n",
-    "    This demonstrates the key optimization that makes modern LLMs practical.\n",
-    "    Instead of recomputing attention for all previous tokens at each step,\n",
-    "    we cache the key and value tensors and incrementally build the sequence.\n",
-    "    \n",
-    "    TODO: Implement cached autoregressive generation.\n",
-    "    \n",
-    "    STEP-BY-STEP IMPLEMENTATION:\n",
-    "    1. Initialize KV cache for all layers\n",
-    "    2. Process initial tokens to populate cache\n",
-    "    3. For each new token to generate:\n",
-    "       a. Compute attention using cache (O(N) instead of O(N²))\n",
-    "       b. Generate next token prediction\n",
-    "       c. Update cache with new K,V\n",
-    "       d. Add new token to sequence\n",
-    "    4. Return complete generated sequence\n",
-    "    \n",
-    "    COMPLEXITY ANALYSIS:\n",
-    "    - Without cache: O(N²) per token, O(N³) total\n",
-    "    - With cache: O(N) per token, O(N²) total\n",
-    "    \n",
-    "    Args:\n",
-    "        model_func: Function that predicts next token given current sequence\n",
-    "        initial_tokens: Starting tokens, shape (batch_size, seq_len, embed_dim)\n",
-    "        max_new_tokens: How many new tokens to generate\n",
-    "        embed_dim: Model embedding dimension\n",
-    "        num_heads: Number of attention heads\n",
-    "        num_layers: Number of transformer layers\n",
-    "        \n",
-    "    Returns:\n",
-    "        Complete sequence including initial and generated tokens\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    batch_size, initial_seq_len, _ = initial_tokens.shape\n",
-    "    head_dim = embed_dim // num_heads\n",
-    "    max_seq_len = initial_seq_len + max_new_tokens\n",
-    "    \n",
-    "    # Initialize KV cache\n",
-    "    cache = KVCache(max_seq_len, num_layers, num_heads, head_dim)\n",
-    "    # Initialize cached attention layers for each layer\n",
-    "    attention_layers = []\n",
-    "    for layer_idx in range(num_layers):\n",
-    "        attention_layers.append(CachedMultiHeadAttention(embed_dim, num_heads))\n",
-    "    \n",
-    "    # Start with initial tokens\n",
-    "    generated_sequence = [initial_tokens]\n",
-    "    current_tokens = initial_tokens\n",
-    "    \n",
-    "    # Process initial tokens to populate cache\n",
-    "    for pos in range(initial_seq_len):\n",
-    "        # Extract K,V for this position and store in cache for each layer\n",
-    "        token_slice = Tensor(current_tokens.data[:, pos:pos+1, :])  # (batch, 1, embed_dim)\n",
-    "        \n",
-    "        for layer_idx, attention_layer in enumerate(attention_layers):\n",
-    "            # Compute K, V for this token\n",
-    "            K = Tensor(np.matmul(token_slice.data, attention_layer.w_k.data))\n",
-    "            V = Tensor(np.matmul(token_slice.data, attention_layer.w_v.data))\n",
-    "            \n",
-    "            # Reshape to (num_heads, head_dim)\n",
-    "            K_reshaped = K.data.reshape(1, num_heads, head_dim)[0]  # Remove batch dim\n",
-    "            V_reshaped = V.data.reshape(1, num_heads, head_dim)[0]\n",
-    "            \n",
-    "            cache.update(layer_idx, Tensor(K_reshaped), Tensor(V_reshaped))\n",
-    "        \n",
-    "        # Advance cache position once per token (shared across all layers)\n",
-    "        cache.advance_position()\n",
-    "    \n",
-    "    # Generate new tokens one by one\n",
-    "    for step in range(max_new_tokens):\n",
-    "        # Use the last token as query for next prediction\n",
-    "        last_token = Tensor(current_tokens.data[:, -1:, :])  # (batch, 1, embed_dim)\n",
-    "        \n",
-    "        # Process through all attention layers with caching\n",
-    "        layer_input = last_token\n",
-    "        for layer_idx, attention_layer in enumerate(attention_layers):\n",
-    "            # Don't advance cache in forward method - we'll do it once at the end\n",
-    "            layer_output, cache = attention_layer.forward(\n",
-    "                query=layer_input,\n",
-    "                cache=cache,\n",
-    "                layer_idx=layer_idx,\n",
-    "                use_cache=True,\n",
-    "                advance_cache=False  # Don't advance yet\n",
-    "            )\n",
-    "            layer_input = layer_output\n",
-    "        \n",
-    "        # Advance cache position once after processing all layers\n",
-    "        cache.advance_position()\n",
-    "        \n",
-    "        # Simulate next token generation (in real implementation, this would be a language model head)\n",
-    "        # For demo, we'll just add some variation to continue the pattern\n",
-    "        next_token = Tensor(layer_output.data + np.random.randn(*layer_output.shape) * 0.1)\n",
-    "        \n",
-    "        # Add to sequence\n",
-    "        generated_sequence.append(next_token)\n",
-    "        \n",
-    "        # Update current tokens (in practice, you'd convert logits to tokens)\n",
-    "        current_tokens = Tensor(np.concatenate([current_tokens.data, next_token.data], axis=1))\n",
-    "    \n",
-    "    # Combine all tokens\n",
-    "    final_sequence = Tensor(np.concatenate([seq.data for seq in generated_sequence], axis=1))\n",
-    "    return final_sequence\n",
-    "    ### END SOLUTION"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6529e5b9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Testing Cached Generation\n",
-    "\n",
-    "Let's compare the performance of cached vs non-cached generation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f2ad7842",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "test-cached-generation",
-     "locked": false,
-     "points": 15,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def test_cached_generation():\n",
-    "    \"\"\"Test and benchmark cached generation.\"\"\"\n",
-    "    print(\"Testing Cached Generation...\")\n",
-    "    \n",
-    "    # Test parameters\n",
-    "    batch_size = 1\n",
-    "    embed_dim = 32  # Smaller for faster testing\n",
-    "    num_heads = 4\n",
-    "    num_layers = 2\n",
-    "    initial_seq_len = 5\n",
-    "    max_new_tokens = 5  # Reduced for debugging\n",
-    "    \n",
-    "    # Create initial tokens\n",
-    "    initial_tokens = Tensor(np.random.randn(batch_size, initial_seq_len, embed_dim))\n",
-    "    \n",
-    "    # Simple model function for testing\n",
-    "    def simple_model(tokens):\n",
-    "        return tokens  # Identity for testing\n",
-    "    \n",
-    "    # Test cached generation\n",
-    "    start_time = time.time()\n",
-    "    \n",
-    "    generated_sequence = generate_with_cache(\n",
-    "        model_func=simple_model,\n",
-    "        initial_tokens=initial_tokens,\n",
-    "        max_new_tokens=max_new_tokens,\n",
-    "        embed_dim=embed_dim,\n",
-    "        num_heads=num_heads,\n",
-    "        num_layers=num_layers\n",
-    "    )\n",
-    "    \n",
-    "    cached_time = time.time() - start_time\n",
-    "    \n",
-    "    # Verify output shape\n",
-    "    expected_seq_len = initial_seq_len + max_new_tokens\n",
-    "    assert generated_sequence.shape == (batch_size, expected_seq_len, embed_dim), \\\n",
-    "        f\"Expected shape {(batch_size, expected_seq_len, embed_dim)}, got {generated_sequence.shape}\"\n",
-    "    \n",
-    "    # Verify initial tokens are preserved\n",
-    "    np.testing.assert_array_equal(\n",
-    "        generated_sequence.data[:, :initial_seq_len, :],\n",
-    "        initial_tokens.data,\n",
-    "        \"Initial tokens should be preserved in output\"\n",
-    "    )\n",
-    "    \n",
-    "    print(\"✅ Cached Generation tests passed!\")\n",
-    "    print(f\"   Generated sequence length: {generated_sequence.shape[1]}\")\n",
-    "    print(f\"   Processing time: {cached_time:.3f}s\")\n",
-    "    print(f\"   Memory efficiency: O(N) per step instead of O(N²)\")\n",
-    "\n",
-    "# Run the test\n",
-    "test_cached_generation()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "aa6ba968",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Systems Analysis: Memory vs Compute Trade-off\n",
-    "\n",
-    "Let's analyze the memory and computational characteristics of KV caching."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9152d089",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "kv-cache-analysis",
-     "locked": false,
-     "schema_version": 3,
-     "solution": true,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def analyze_kv_cache_performance():\n",
-    "    \"\"\"\n",
-    "    Comprehensive analysis of KV cache memory and performance characteristics.\n",
-    "    \n",
-    "    TODO: Implement performance analysis for KV caching.\n",
-    "    \n",
-    "    STEP-BY-STEP IMPLEMENTATION:\n",
-    "    1. Set up test scenarios with different sequence lengths\n",
-    "    2. Measure memory usage with and without caching\n",
-    "    3. Benchmark computation time for both approaches\n",
-    "    4. Analyze scaling behavior as sequence length increases\n",
-    "    5. Calculate the break-even point where caching becomes beneficial\n",
-    "    \n",
-    "    ANALYSIS DIMENSIONS:\n",
-    "    - Memory usage: How much RAM does caching consume?\n",
-    "    - Computation time: How much faster is cached generation?\n",
-    "    - Scaling behavior: How does performance change with sequence length?\n",
-    "    - Break-even analysis: When is caching worth the memory cost?\n",
-    "    \"\"\"\n",
-    "    ### BEGIN SOLUTION\n",
-    "    print(\"🔍 Analyzing KV Cache Performance Characteristics...\")\n",
-    "    \n",
-    "    # Test configuration\n",
-    "    embed_dim = 64\n",
-    "    num_heads = 8\n",
-    "    head_dim = embed_dim // num_heads\n",
-    "    num_layers = 4\n",
-    "    batch_size = 1\n",
-    "    \n",
-    "    sequence_lengths = [10, 25, 50, 100, 200]\n",
-    "    results = []\n",
-    "    \n",
-    "    for seq_len in sequence_lengths:\n",
-    "        print(f\"\\n📊 Testing sequence length: {seq_len}\")\n",
-    "        \n",
-    "        # Memory analysis\n",
-    "        cache = KVCache(seq_len, num_layers, num_heads, head_dim)\n",
-    "        memory_info = cache.get_memory_usage()\n",
-    "        \n",
-    "        # Simulate cache usage\n",
-    "        attention = CachedMultiHeadAttention(embed_dim, num_heads)\n",
-    "        \n",
-    "        # Benchmark cached vs non-cached attention\n",
-    "        token = Tensor(np.random.randn(batch_size, 1, embed_dim))\n",
-    "        full_sequence = Tensor(np.random.randn(batch_size, seq_len, embed_dim))\n",
-    "        \n",
-    "        # Time cached approach (simulating incremental generation)\n",
-    "        start_time = time.time()\n",
-    "        for pos in range(seq_len):\n",
-    "            output, cache = attention.forward(\n",
-    "                query=token, \n",
-    "                cache=cache, \n",
-    "                layer_idx=0, \n",
-    "                use_cache=True\n",
-    "            )\n",
-    "        cached_time = time.time() - start_time\n",
-    "        \n",
-    "        # Time non-cached approach (full sequence each time)\n",
-    "        start_time = time.time()\n",
-    "        for pos in range(seq_len):\n",
-    "            # Simulate recomputing attention for growing sequence\n",
-    "            subseq = Tensor(full_sequence.data[:, :pos+1, :])\n",
-    "            output, _ = attention.forward(query=subseq, cache=None, use_cache=False)\n",
-    "        non_cached_time = time.time() - start_time\n",
-    "        \n",
-    "        # Calculate theoretical operation counts\n",
-    "        # Cached: O(N) operations per step, O(N²) total\n",
-    "        cached_ops = seq_len * seq_len  # Simplified model\n",
-    "        \n",
-    "        # Non-cached: O(N²) operations per step, O(N³) total  \n",
-    "        non_cached_ops = sum(i*i for i in range(1, seq_len+1))\n",
-    "        \n",
-    "        speedup = non_cached_time / cached_time if cached_time > 0 else 0\n",
-    "        theoretical_speedup = non_cached_ops / cached_ops if cached_ops > 0 else 0\n",
-    "        \n",
-    "        results.append({\n",
-    "            'seq_len': seq_len,\n",
-    "            'cache_memory_mb': memory_info['total_cache_size_mb'],\n",
-    "            'cached_time': cached_time,\n",
-    "            'non_cached_time': non_cached_time,\n",
-    "            'actual_speedup': speedup,\n",
-    "            'theoretical_speedup': theoretical_speedup,\n",
-    "            'cached_ops': cached_ops,\n",
-    "            'non_cached_ops': non_cached_ops\n",
-    "        })\n",
-    "        \n",
-    "        print(f\"   Cache memory: {memory_info['total_cache_size_mb']:.2f} MB\")\n",
-    "        print(f\"   Cached time: {cached_time:.4f}s\")\n",
-    "        print(f\"   Non-cached time: {non_cached_time:.4f}s\") \n",
-    "        print(f\"   Actual speedup: {speedup:.2f}x\")\n",
-    "        print(f\"   Theoretical speedup: {theoretical_speedup:.2f}x\")\n",
-    "    \n",
-    "    # Summary analysis\n",
-    "    print(f\"\\n📈 Performance Summary:\")\n",
-    "    print(f\"{'Seq Len':<8} {'Memory(MB)':<12} {'Speedup':<10} {'Memory/Speedup':<15}\")\n",
-    "    print(\"-\" * 50)\n",
-    "    \n",
-    "    for result in results:\n",
-    "        efficiency = result['cache_memory_mb'] / result['actual_speedup'] if result['actual_speedup'] > 0 else float('inf')\n",
-    "        print(f\"{result['seq_len']:<8} {result['cache_memory_mb']:<12.2f} {result['actual_speedup']:<10.2f} {efficiency:<15.2f}\")\n",
-    "    \n",
-    "    # Key insights\n",
-    "    print(f\"\\n🎯 Key Insights:\")\n",
-    "    print(f\"   • Memory scales as O(L × N × H × D) where L=layers, N=seq_len, H=heads, D=head_dim\")\n",
-    "    print(f\"   • Computation scales as O(N²) with cache vs O(N³) without\")\n",
-    "    print(f\"   • Break-even point: ~{sequence_lengths[1]} tokens for this configuration\")\n",
-    "    print(f\"   • Memory-efficiency trade-off: more cache memory for better performance\")\n",
-    "    \n",
-    "    return results\n",
-    "    ### END SOLUTION\n",
-    "\n",
-    "# Run the analysis\n",
-    "performance_results = analyze_kv_cache_performance()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5687d9a6",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Production Context: How Real Systems Use KV Caching\n",
-    "\n",
-    "Understanding how KV caching is implemented in production systems."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bd07055b",
-   "metadata": {
-    "nbgrader": {
-     "grade": false,
-     "grade_id": "production-context",
-     "locked": false,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def explore_production_kv_caching():\n",
-    "    \"\"\"\n",
-    "    Explore how KV caching is used in production transformer systems.\n",
-    "    \n",
-    "    This function demonstrates the connection between our implementation\n",
-    "    and real-world systems like GPT, BERT, and other transformer models.\n",
-    "    \"\"\"\n",
-    "    print(\"🏭 Production KV Caching Systems Analysis\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Production system examples\n",
-    "    systems = [\n",
-    "        {\n",
-    "            'name': 'GPT-3',\n",
-    "            'layers': 96,\n",
-    "            'heads': 96,\n",
-    "            'head_dim': 128,\n",
-    "            'max_context': 2048,\n",
-    "            'use_case': 'Text generation'\n",
-    "        },\n",
-    "        {\n",
-    "            'name': 'GPT-4',\n",
-    "            'layers': 120,  # Estimated\n",
-    "            'heads': 128,   # Estimated  \n",
-    "            'head_dim': 128,\n",
-    "            'max_context': 8192,\n",
-    "            'use_case': 'Conversation'\n",
-    "        },\n",
-    "        {\n",
-    "            'name': 'CodeT5',\n",
-    "            'layers': 12,\n",
-    "            'heads': 12,\n",
-    "            'head_dim': 64,\n",
-    "            'max_context': 512,\n",
-    "            'use_case': 'Code generation'\n",
-    "        },\n",
-    "        {\n",
-    "            'name': 'Local 7B Model',\n",
-    "            'layers': 32,\n",
-    "            'heads': 32,\n",
-    "            'head_dim': 128,\n",
-    "            'max_context': 4096,\n",
-    "            'use_case': 'Local inference'\n",
-    "        }\n",
-    "    ]\n",
-    "    \n",
-    "    print(f\"{'System':<15} {'Cache Size':<12} {'Max Tokens':<12} {'Use Case':<15}\")\n",
-    "    print(\"-\" * 60)\n",
-    "    \n",
-    "    for system in systems:\n",
-    "        # Calculate cache memory requirements\n",
-    "        # 2 (K + V) × layers × max_context × heads × head_dim × 4 bytes (float32)\n",
-    "        cache_size_bytes = (2 * system['layers'] * system['max_context'] * \n",
-    "                           system['heads'] * system['head_dim'] * 4)\n",
-    "        cache_size_gb = cache_size_bytes / (1024**3)\n",
-    "        \n",
-    "        print(f\"{system['name']:<15} {cache_size_gb:<12.2f}GB {system['max_context']:<12} {system['use_case']:<15}\")\n",
-    "    \n",
-    "    print(f\"\\n💡 Production Optimizations:\")\n",
-    "    print(f\"   • Memory pooling: Reuse cache memory across requests\")\n",
-    "    print(f\"   • Batch processing: Share cache computation across multiple queries\")\n",
-    "    print(f\"   • Attention masks: Skip computation for padded tokens\")\n",
-    "    print(f\"   • Gradient checkpointing: Trade memory for compute during training\")\n",
-    "    print(f\"   • Mixed precision: Use FP16/INT8 to reduce cache memory\")\n",
-    "    print(f\"   • Flash Attention: Optimize memory access patterns\")\n",
-    "    \n",
-    "    print(f\"\\n⚡ Real-World Performance Impact:\")\n",
-    "    print(f\"   • Without KV cache: GPT would take minutes to generate short responses\")\n",
-    "    print(f\"   • With KV cache: Real-time conversation becomes possible\")\n",
-    "    print(f\"   • Memory cost: 1-10GB RAM per conversation depending on model size\")\n",
-    "    print(f\"   • Speedup: 10-100x faster generation for typical use cases\")\n",
-    "    \n",
-    "    print(f\"\\n🎯 Why This Matters for ML Engineers:\")\n",
-    "    print(f\"   • KV caching is THE optimization that makes LLMs practical\")\n",
-    "    print(f\"   • Memory management becomes critical at scale\")\n",
-    "    print(f\"   • Understanding trade-offs helps design better systems\")\n",
-    "    print(f\"   • This optimization enables real-time AI applications\")\n",
-    "\n",
-    "# Explore production systems\n",
-    "explore_production_kv_caching()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "830f9a00",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Comprehensive Testing\n",
-    "\n",
-    "Complete validation of our KV caching implementation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b965df6b",
-   "metadata": {
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "comprehensive-tests",
-     "locked": false,
-     "points": 20,
-     "schema_version": 3,
-     "solution": false,
-     "task": false
-    }
-   },
-   "outputs": [],
-   "source": [
-    "def run_comprehensive_tests():\n",
-    "    \"\"\"Run all tests to validate KV caching implementation.\"\"\"\n",
-    "    print(\"🧪 Running Comprehensive KV Caching Tests\")\n",
-    "    print(\"=\" * 50)\n",
-    "    \n",
-    "    # Test 1: Cache capacity and bounds checking\n",
-    "    print(\"Test 1: Cache Capacity...\")\n",
-    "    cache = KVCache(max_seq_len=3, n_layers=1, n_heads=2, head_dim=4)\n",
-    "    \n",
-    "    # Fill cache to capacity\n",
-    "    for i in range(3):\n",
-    "        k = Tensor(np.ones((2, 4)) * i)  # Different values for each position\n",
-    "        v = Tensor(np.ones((2, 4)) * i)\n",
-    "        cache.update(0, k, v)\n",
-    "        cache.advance_position()\n",
-    "    \n",
-    "    # Verify capacity reached\n",
-    "    assert cache.current_position == 3, \"Cache should be at capacity\"\n",
-    "    \n",
-    "    # Test overflow protection\n",
-    "    try:\n",
-    "        cache.update(0, Tensor(np.ones((2, 4))), Tensor(np.ones((2, 4))))\n",
-    "        assert False, \"Should raise overflow error\"\n",
-    "    except ValueError:\n",
-    "        pass  # Expected\n",
-    "    \n",
-    "    print(\"   ✅ Capacity management works\")\n",
-    "    \n",
-    "    # Test 2: Multi-layer cache consistency\n",
-    "    print(\"Test 2: Multi-layer Consistency...\")\n",
-    "    multi_cache = KVCache(max_seq_len=5, n_layers=3, n_heads=2, head_dim=4)\n",
-    "    \n",
-    "    # Add different data to each layer\n",
-    "    for layer in range(3):\n",
-    "        k = Tensor(np.ones((2, 4)) * layer)\n",
-    "        v = Tensor(np.ones((2, 4)) * layer * 10)\n",
-    "        multi_cache.update(layer, k, v)\n",
-    "    \n",
-    "    multi_cache.advance_position()\n",
-    "    \n",
-    "    # Verify each layer has correct data\n",
-    "    for layer in range(3):\n",
-    "        cached_k, cached_v = multi_cache.get(layer, 1)\n",
-    "        expected_k = np.ones((1, 2, 4)) * layer\n",
-    "        expected_v = np.ones((1, 2, 4)) * layer * 10\n",
-    "        \n",
-    "        np.testing.assert_array_equal(cached_k.data, expected_k, f\"Layer {layer} keys incorrect\")\n",
-    "        np.testing.assert_array_equal(cached_v.data, expected_v, f\"Layer {layer} values incorrect\")\n",
-    "    \n",
-    "    print(\"   ✅ Multi-layer consistency works\")\n",
-    "    \n",
-    "    # Test 3: Attention output consistency\n",
-    "    print(\"Test 3: Attention Consistency...\")\n",
-    "    embed_dim = 16\n",
-    "    num_heads = 4\n",
-    "    \n",
-    "    attention = CachedMultiHeadAttention(embed_dim, num_heads)\n",
-    "    cache = KVCache(max_seq_len=5, n_layers=1, n_heads=num_heads, head_dim=embed_dim//num_heads)\n",
-    "    \n",
-    "    # Generate sequence token by token with cache\n",
-    "    tokens = [Tensor(np.random.randn(1, 1, embed_dim)) for _ in range(3)]\n",
-    "    cached_outputs = []\n",
-    "    \n",
-    "    for i, token in enumerate(tokens):\n",
-    "        output, cache = attention.forward(token, cache=cache, layer_idx=0, use_cache=True)\n",
-    "        cached_outputs.append(output.data)\n",
-    "    \n",
-    "    # Generate same sequence all at once (no cache)\n",
-    "    full_sequence = Tensor(np.concatenate([t.data for t in tokens], axis=1))\n",
-    "    attention_fresh = CachedMultiHeadAttention(embed_dim, num_heads)\n",
-    "    \n",
-    "    # Use same weights for fair comparison\n",
-    "    attention_fresh.w_q = attention.w_q\n",
-    "    attention_fresh.w_k = attention.w_k  \n",
-    "    attention_fresh.w_v = attention.w_v\n",
-    "    attention_fresh.w_o = attention.w_o\n",
-    "    \n",
-    "    full_output, _ = attention_fresh.forward(full_sequence, cache=None, use_cache=False)\n",
-    "    \n",
-    "    # Last cached output should be similar to last position of full output\n",
-    "    # (Note: might not be exactly equal due to different computation paths)\n",
-    "    diff = np.abs(cached_outputs[-1] - full_output.data[:, -1:, :]).mean()\n",
-    "    assert diff < 1.0, f\"Cached and non-cached outputs too different: {diff}\"\n",
-    "    \n",
-    "    print(\"   ✅ Attention consistency acceptable\")\n",
-    "    \n",
-    "    # Test 4: Memory profiling\n",
-    "    print(\"Test 4: Memory Profiling...\")\n",
-    "    \n",
-    "    tracemalloc.start()\n",
-    "    \n",
-    "    # Create large cache\n",
-    "    large_cache = KVCache(max_seq_len=100, n_layers=12, n_heads=16, head_dim=64)\n",
-    "    \n",
-    "    current, peak = tracemalloc.get_traced_memory()\n",
-    "    tracemalloc.stop()\n",
-    "    \n",
-    "    # Verify memory usage is reasonable\n",
-    "    memory_mb = peak / (1024 * 1024)\n",
-    "    theoretical_mb = large_cache.get_memory_usage()['total_cache_size_mb']\n",
-    "    \n",
-    "    print(f\"   Actual memory usage: {memory_mb:.2f} MB\")\n",
-    "    print(f\"   Theoretical cache size: {theoretical_mb:.2f} MB\")\n",
-    "    print(\"   ✅ Memory usage within expected range\")\n",
-    "    \n",
-    "    print(\"\\n🎉 All Comprehensive Tests Passed!\")\n",
-    "    print(\"KV caching implementation is working correctly!\")\n",
-    "\n",
-    "# Run comprehensive tests\n",
-    "run_comprehensive_tests()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "43511800",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Main Execution Block\n",
-    "\n",
-    "Consolidate all test execution for when the module is run directly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2bc43e23",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "if __name__ == \"__main__\":\n",
-    "    print(\"🚀 TinyTorch KV Caching Module - Complete Test Suite\")\n",
-    "    print(\"=\" * 60)\n",
-    "    \n",
-    "    # Run all tests in sequence\n",
-    "    test_kv_cache()\n",
-    "    print()\n",
-    "    \n",
-    "    test_cached_attention() \n",
-    "    print()\n",
-    "    \n",
-    "    test_cached_generation()\n",
-    "    print()\n",
-    "    \n",
-    "    performance_results = analyze_kv_cache_performance()\n",
-    "    print()\n",
-    "    \n",
-    "    explore_production_kv_caching()\n",
-    "    print()\n",
-    "    \n",
-    "    run_comprehensive_tests()\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\" * 60)\n",
-    "    print(\"🎯 MODULE COMPLETE: KV Caching Implementation\")\n",
-    "    print(\"=\" * 60)\n",
-    "    print(\"✅ All tests passed!\")\n",
-    "    print(\"✅ Performance analysis complete!\")\n",
-    "    print(\"✅ Production context understood!\")\n",
-    "    print(\"\\nYou now understand the most sophisticated transformer optimization!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "990b104d",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "Reflect on how KV caching transforms transformer systems and enables production deployments."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b4f04b20",
-   "metadata": {
-    "lines_to_next_cell": 0,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "kv-cache-reflection",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": true
-    }
-   },
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f933c864",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 1: Algorithmic Complexity Analysis\n",
-    "**Prompt**: You're optimizing a transformer for generating 1000-token stories. Without KV caching, each token generation requires computing attention for all previous tokens. \n",
-    "\n",
-    "**Question**: Calculate the total number of attention operations needed with and without KV caching. At what sequence length does the memory cost of caching equal the computational savings? How would you design a hybrid approach that balances memory and compute?\n",
-    "\n",
-    "**Your Analysis**:\n",
-    "[Provide detailed complexity analysis, break-even calculations, and hybrid system design]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d31fb4e9",
-   "metadata": {
-    "lines_to_next_cell": 0,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "memory-compute-tradeoff",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": true
-    }
-   },
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "19d9b1b1",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "### Question 2: Production Memory Management\n",
-    "**Prompt**: You're deploying a chatbot service that handles 1000 concurrent conversations, each potentially 4096 tokens long. Each conversation needs its own KV cache.\n",
-    "\n",
-    "**Question**: Calculate total memory requirements for a 7B parameter model with 32 layers and 32 heads. How would you implement cache eviction, memory pooling, and batch processing to optimize resource usage? What happens when cache memory exceeds available RAM?\n",
-    "\n",
-    "**Your Analysis**:  \n",
-    "[Provide memory calculations, architecture design, and resource management strategies]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a88ef0f2",
-   "metadata": {
-    "lines_to_next_cell": 0,
-    "nbgrader": {
-     "grade": true,
-     "grade_id": "optimization-techniques",
-     "locked": false,
-     "points": 10,
-     "schema_version": 3,
-     "solution": false,
-     "task": true
-    }
-   },
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e05d70cf",
-   "metadata": {},
-   "source": [
-    "  \n",
-    "### Question 3: Advanced Optimization Techniques\n",
-    "**Prompt**: Modern systems combine KV caching with other optimizations: Flash Attention (memory-efficient attention), mixed precision (FP16/INT8), and attention distillation (smaller attention matrices).\n",
-    "\n",
-    "**Question**: How would you modify your KV cache implementation to support these optimizations? What are the trade-offs between cache compression (storing compressed K,V) and cache accuracy? Design a system that adaptively chooses optimization strategies based on sequence length and available memory.\n",
-    "\n",
-    "**Your Analysis**:\n",
-    "[Provide optimization integration design, compression trade-offs, and adaptive system architecture]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "bdb14c9a",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: KV Caching - The Most Sophisticated Optimization\n",
-    "\n",
-    "### What We Built\n",
-    "- **KVCache Class**: Efficient storage and retrieval of key-value tensors across transformer layers\n",
-    "- **CachedMultiHeadAttention**: Attention mechanism that leverages cached K,V for O(N) complexity\n",
-    "- **Cached Generation Pipeline**: Complete autoregressive generation with dramatic performance improvements\n",
-    "- **Performance Analysis Tools**: Comprehensive benchmarking and memory profiling capabilities\n",
-    "\n",
-    "### Systems Insights Gained\n",
-    "- **Algorithmic Transformation**: How changing the algorithm (not just implementation) achieves orders-of-magnitude speedups\n",
-    "- **Memory-Compute Trade-offs**: Understanding when storing intermediate results pays off vs recomputation\n",
-    "- **Production Optimization**: How real LLMs like GPT achieve fast inference through sophisticated caching\n",
-    "- **Scaling Analysis**: How O(N²) → O(N) complexity changes enable practical long-context models\n",
-    "\n",
-    "### Performance Characteristics\n",
-    "- **Complexity**: O(N) attention per token vs O(N²) without caching\n",
-    "- **Memory**: Linear growth with sequence length, bounded by cache capacity\n",
-    "- **Speedup**: 10-100x faster generation for typical sequence lengths\n",
-    "- **Break-even**: Caching becomes beneficial around 20-50 tokens depending on model size\n",
-    "\n",
-    "### Production Impact\n",
-    "- **Real-world Necessity**: KV caching is essential for any practical transformer deployment\n",
-    "- **Memory Management**: Production systems require sophisticated cache management and memory pooling\n",
-    "- **User Experience**: This optimization enables real-time conversation and interactive AI applications\n",
-    "- **Cost Efficiency**: Reduces computational costs by orders of magnitude for inference workloads\n",
-    "\n",
-    "### Connection to Broader ML Systems\n",
-    "KV caching exemplifies the most sophisticated type of optimization - **changing the algorithm itself**. Unlike lower-level optimizations (vectorization, memory layout), this requires deep understanding of the mathematical structure and transforms the fundamental complexity of the operation.\n",
-    "\n",
-    "**You now understand the optimization that makes modern LLMs practical!** 🚀\n",
-    "\n",
-    "This completes your journey through transformer optimization techniques - from basic implementations to the algorithmic innovations that power production AI systems."
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "main_language": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/18_caching/caching_dev.py b/modules_old/18_caching/caching_dev.py
deleted file mode 100644
index 2dea1b0b..00000000
--- a/modules_old/18_caching/caching_dev.py
+++ /dev/null
@@ -1,1706 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# KV Caching - The Most Sophisticated Optimization: Changing the Algorithm!
-
-Welcome to the KV Caching module! You'll implement the key-value cache optimization that transforms transformer inference from O(N²) to O(N) complexity for autoregressive generation. This is how GPT actually achieves fast text generation!
-
-## Learning Goals
-- Algorithm transformation: Understand how caching changes fundamental complexity
-- Memory vs compute trade-offs: Store K,V tensors to avoid recomputation
-- Production optimization: Learn the optimization that makes GPT fast in practice
-- Systems insight: How memory management enables dramatic speedups
-- Incremental computation: Build systems that efficiently reuse previous work
-
-## Build -> Profile -> Optimize
-1. **Build**: Implement KV caching for multi-head attention with incremental generation
-2. **Profile**: Compare O(N²) vs O(N) performance and memory usage patterns
-3. **Optimize**: Apply caching to complete transformer inference pipeline
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical mastery of how KV caching transforms attention complexity
-- Practical capability to implement production-grade transformer inference optimization
-- Systems insight into memory-compute trade-offs that determine real-world performance
-- Performance understanding of how algorithmic changes achieve dramatic speedups
-- Connection to how ChatGPT, GPT-4, and other LLMs achieve fast response times
-
-## Systems Reality Check
-TIP **Production Context**: GPT-4 uses KV caching for all inference - without it, generating 100 tokens would take minutes instead of seconds
-SPEED **Performance Note**: KV caching is the difference between research models and production LLMs
-FIRE **Memory Trade-off**: Cache grows with sequence length but saves quadratic recomputation
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "caching-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp experimental.kv_cache
-
-#| export
-import math
-import numpy as np
-import os
-import sys
-import time
-import tracemalloc
-from typing import Union, List, Optional, Tuple, Dict, Any
-
-# Import our Tensor class
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local tensor module
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# Try to import attention classes
-try:
-    from tinytorch.core.attention import MultiHeadAttention, ScaledDotProductAttention
-except ImportError:
-    # For development, import from local module
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '13_attention'))
-    try:
-        from attention_dev import MultiHeadAttention, ScaledDotProductAttention
-    except ImportError:
-        # Create minimal mock classes if not available
-        class MultiHeadAttention:
-            def __init__(self, embed_dim, num_heads, dropout=0.0):
-                self.embed_dim = embed_dim
-                self.num_heads = num_heads
-                self.head_dim = embed_dim // num_heads
-            def forward(self, q, k, v, mask=None):
-                return q  # Mock implementation
-        class ScaledDotProductAttention:
-            def __init__(self, dropout=0.0):
-                self.dropout = dropout
-
-# %% nbgrader={"grade": false, "grade_id": "caching-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("ROCKET TinyTorch KV Caching Module")
-print(f"NumPy version: {np.__version__}")
-print("Ready to implement the most sophisticated optimization!")
-
-# %% [markdown]
-"""
-## PACKAGE Where This Code Lives in the Final Package
-
-**Learning Side:** You work in `modules/source/19_caching/caching_dev.py`  
-**Building Side:** Code exports to `tinytorch.core.caching`
-
-```python
-# Final package structure:
-from tinytorch.core.caching import KVCache, CachedMultiHeadAttention, CachedTransformer
-from tinytorch.core.attention import MultiHeadAttention  # Previous module
-from tinytorch.core.transformers import TransformerBlock  # Dependencies
-```
-
-**Why this matters:**
-- **Learning:** Understand algorithmic transformation through implementation
-- **Production:** This is how real LLMs achieve fast inference
-- **Consistency:** All caching optimizations live together in `core.caching`
-- **Integration:** Works seamlessly with existing attention and transformer systems
-"""
-
-# %% [markdown]
-"""
-## The Problem: Attention's Quadratic Complexity
-
-### Traditional Attention: O(N²) Recomputation
-In autoregressive generation, we generate tokens one by one:
-
-```
-Generate token 1: Attend to [] (empty context)
-Generate token 2: Attend to [token_1]  
-Generate token 3: Attend to [token_1, token_2]
-Generate token 4: Attend to [token_1, token_2, token_3]
-...
-Generate token N: Attend to [token_1, ..., token_{N-1}]
-```
-
-**The inefficiency:** Each step recomputes attention for ALL previous tokens!
-
-### Memory and Compute Analysis
-For each new token, traditional attention:
-1. **Recomputes K,V** for all previous tokens (wasted computation)
-2. **Attention matrix** grows: 1*1, 2*2, 3*3, ..., N*N (quadratic memory)
-3. **Total operations**: 1² + 2² + 3² + ... + N² = O(N³) for full sequence!
-
-**This is why naive transformer generation is impossibly slow for long sequences.**
-"""
-
-# %% [markdown]
-"""
-## The Solution: Key-Value Caching
-
-### Core Insight: Cache Past Computations
-KV caching stores the key (K) and value (V) tensors from previous tokens:
-
-```python
-# Step 1: Generate first token
-cache.store(layer=0, keys=K₁, values=V₁, position=0)
-
-# Step 2: Generate second token  
-K_past, V_past = cache.get(layer=0, positions=[0])
-K_combined = concat(K_past, K₂)  # Reuse K₁, add K₂
-V_combined = concat(V_past, V₂)  # Reuse V₁, add V₂
-```
-
-### Complexity Transformation
-- **Without cache**: O(N²) memory, O(N³) total ops for generation
-- **With cache**: O(N) memory per step, O(N²) total ops for generation
-- **Speedup**: 10-100x faster for typical sequence lengths!
-"""
-
-# %% [markdown]
-"""
-## KVCache Implementation
-
-The foundation of all transformer inference optimization.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "kv-cache", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class KVCache:
-    """
-    Key-Value cache for efficient transformer inference.
-    
-    Stores past key and value tensors to avoid recomputation during
-    autoregressive generation. This transforms O(N²) attention into
-    O(N) attention for incremental token generation.
-    """
-    
-    def __init__(self, max_seq_len: int, n_layers: int, n_heads: int, head_dim: int):
-        """
-        Initialize KV cache with fixed capacity.
-        
-        TODO: Implement KV cache initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store cache configuration parameters
-        2. Initialize empty cache storage for each layer
-        3. Track current sequence position
-        4. Set up memory-efficient storage format
-        
-        MEMORY LAYOUT:
-        - Cache per layer: keys[seq_len, n_heads, head_dim]
-        - Cache per layer: values[seq_len, n_heads, head_dim]
-        - Total memory: 2 * n_layers * max_seq_len * n_heads * head_dim
-        
-        Args:
-            max_seq_len: Maximum sequence length to cache
-            n_layers: Number of transformer layers
-            n_heads: Number of attention heads
-            head_dim: Dimension per attention head
-        """
-        ### BEGIN SOLUTION
-        self.max_seq_len = max_seq_len
-        self.n_layers = n_layers
-        self.n_heads = n_heads
-        self.head_dim = head_dim
-        
-        # Initialize cache storage for each layer
-        # Shape: (max_seq_len, n_heads, head_dim)
-        self.k_cache = {}
-        self.v_cache = {}
-        
-        for layer_idx in range(n_layers):
-            # Pre-allocate cache tensors for efficiency
-            self.k_cache[layer_idx] = Tensor(np.zeros((max_seq_len, n_heads, head_dim)))
-            self.v_cache[layer_idx] = Tensor(np.zeros((max_seq_len, n_heads, head_dim)))
-        
-        # Track current position in sequence
-        self.current_position = 0
-        ### END SOLUTION
-    
-    def update(self, layer_idx: int, key: Tensor, value: Tensor) -> None:
-        """
-        Store new key and value tensors at current position.
-        
-        TODO: Implement cache update mechanism.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Validate inputs and position bounds
-        2. Store key tensor at current position
-        3. Store value tensor at current position
-        4. Handle incremental position tracking
-        
-        EFFICIENCY CONSIDERATIONS:
-        - In-place updates to avoid memory allocation
-        - Position-based indexing for O(1) access
-        - Bounds checking for cache overflow
-        
-        Args:
-            layer_idx: Which transformer layer this cache belongs to
-            key: Key tensor to store, shape (n_heads, head_dim)
-            value: Value tensor to store, shape (n_heads, head_dim)
-        """
-        ### BEGIN SOLUTION
-        if layer_idx not in self.k_cache:
-            raise ValueError(f"Layer {layer_idx} not found in cache")
-        
-        if self.current_position >= self.max_seq_len:
-            # This prevents cache overflow which would cause memory corruption
-            raise ValueError(f"Cache overflow: position {self.current_position} >= max {self.max_seq_len}")
-        
-        # Store key and value at current position
-        # key/value shape: (n_heads, head_dim)
-        # Cache shape: (max_seq_len, n_heads, head_dim)
-        self.k_cache[layer_idx].data[self.current_position] = key.data
-        self.v_cache[layer_idx].data[self.current_position] = value.data
-        ### END SOLUTION
-    
-    def get(self, layer_idx: int, seq_len: int) -> Tuple[Tensor, Tensor]:
-        """
-        Retrieve cached keys and values up to specified sequence length.
-        
-        TODO: Implement cache retrieval mechanism.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Validate layer and sequence length
-        2. Extract keys from position 0 to seq_len
-        3. Extract values from position 0 to seq_len
-        4. Return as tensors ready for attention computation
-        
-        MEMORY EFFICIENCY:
-        - Return views/slices instead of copies when possible
-        - Handle different sequence lengths efficiently
-        
-        Args:
-            layer_idx: Which transformer layer to retrieve cache for
-            seq_len: How many positions to retrieve (1 to current_position)
-            
-        Returns:
-            Tuple of (keys, values) tensors with shape (seq_len, n_heads, head_dim)
-        """
-        ### BEGIN SOLUTION
-        if layer_idx not in self.k_cache:
-            raise ValueError(f"Layer {layer_idx} not found in cache")
-        
-        if seq_len > self.current_position:
-            raise ValueError(f"Requested seq_len {seq_len} > current position {self.current_position}")
-        
-        # Extract the relevant portion of the cache
-        # Cache shape: (max_seq_len, n_heads, head_dim)
-        # Output shape: (seq_len, n_heads, head_dim)
-        cached_keys = Tensor(self.k_cache[layer_idx].data[:seq_len])
-        cached_values = Tensor(self.v_cache[layer_idx].data[:seq_len])
-        
-        return cached_keys, cached_values
-        ### END SOLUTION
-    
-    def advance_position(self) -> None:
-        """
-        Move to next sequence position after storing current token.
-        
-        This should be called after update() to prepare for next token.
-        """
-        self.current_position += 1
-    
-    def reset(self) -> None:
-        """Reset cache to empty state for new sequence."""
-        self.current_position = 0
-        # Note: We don't need to zero out the cache data, just reset position
-    
-    def get_memory_usage(self) -> Dict[str, Any]:
-        """Analyze current cache memory usage."""
-        total_elements = 2 * self.n_layers * self.max_seq_len * self.n_heads * self.head_dim
-        used_elements = 2 * self.n_layers * self.current_position * self.n_heads * self.head_dim
-        
-        return {
-            'total_cache_size_mb': total_elements * 4 / (1024 * 1024),  # Assuming float32
-            'used_cache_size_mb': used_elements * 4 / (1024 * 1024),
-            'utilization': used_elements / total_elements if total_elements > 0 else 0,
-            'current_position': self.current_position,
-            'max_seq_len': self.max_seq_len
-        }
-
-# %% [markdown]
-"""
-### Testing KV Cache Functionality
-
-Let's verify our cache works correctly and understand its memory characteristics.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-kv-cache", "locked": false, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_kv_cache():
-    """Test KV cache functionality and memory management."""
-    print("Testing KV Cache...")
-    
-    # Create cache for small transformer
-    max_seq_len = 10
-    n_layers = 2
-    n_heads = 4
-    head_dim = 8
-    
-    cache = KVCache(max_seq_len, n_layers, n_heads, head_dim)
-    
-    # Test 1: Initial state
-    assert cache.current_position == 0, "Cache should start at position 0"
-    
-    # Test 2: Store first token
-    k1 = Tensor(np.random.randn(n_heads, head_dim))
-    v1 = Tensor(np.random.randn(n_heads, head_dim))
-    
-    cache.update(layer_idx=0, key=k1, value=v1)
-    cache.advance_position()
-    
-    assert cache.current_position == 1, "Position should advance after update"
-    
-    # Test 3: Retrieve cached values
-    cached_k, cached_v = cache.get(layer_idx=0, seq_len=1)
-    
-    assert cached_k.shape == (1, n_heads, head_dim), f"Expected shape (1, {n_heads}, {head_dim}), got {cached_k.shape}"
-    assert cached_v.shape == (1, n_heads, head_dim), f"Expected shape (1, {n_heads}, {head_dim}), got {cached_v.shape}"
-    
-    # Verify data integrity
-    np.testing.assert_array_equal(cached_k.data[0], k1.data, "Cached key should match stored key")
-    np.testing.assert_array_equal(cached_v.data[0], v1.data, "Cached value should match stored value")
-    
-    # Test 4: Add second token
-    k2 = Tensor(np.random.randn(n_heads, head_dim))
-    v2 = Tensor(np.random.randn(n_heads, head_dim))
-    
-    cache.update(layer_idx=0, key=k2, value=v2)
-    cache.advance_position()
-    
-    # Test 5: Retrieve both tokens
-    cached_k, cached_v = cache.get(layer_idx=0, seq_len=2)
-    
-    assert cached_k.shape == (2, n_heads, head_dim), "Should retrieve both tokens"
-    np.testing.assert_array_equal(cached_k.data[0], k1.data, "First token key should be preserved")
-    np.testing.assert_array_equal(cached_k.data[1], k2.data, "Second token key should be stored")
-    
-    # Test 6: Memory usage analysis
-    memory_info = cache.get_memory_usage()
-    expected_total = 2 * n_layers * max_seq_len * n_heads * head_dim * 4 / (1024 * 1024)
-    
-    assert abs(memory_info['total_cache_size_mb'] - expected_total) < 0.01, "Memory calculation should be accurate"
-    assert memory_info['current_position'] == 2, "Should track position correctly"
-    
-    # Test 7: Reset functionality
-    cache.reset()
-    assert cache.current_position == 0, "Reset should return to position 0"
-    
-    print("PASS KV Cache tests passed!")
-    print(f"   Cache capacity: {memory_info['total_cache_size_mb']:.2f} MB")
-    print(f"   Memory efficiency: O(L * N * H * D) scaling")
-
-# Run the test
-test_kv_cache()
-
-# PASS IMPLEMENTATION CHECKPOINT: Basic KV Cache complete
-
-# THINK PREDICTION: How much memory would a KV cache use for GPT-3?
-# GPT-3: 96 layers, 96 heads, 128 head_dim, 2048 max tokens
-# Your guess: _____ GB
-
-# MAGNIFY SYSTEMS INSIGHT #1: Cache Memory Scaling Analysis
-def analyze_cache_memory_scaling():
-    """Analyze how KV cache memory scales with model and sequence parameters."""
-    try:
-        print("\n🧠 KV Cache Memory Scaling Analysis")
-        print("=" * 45)
-        
-        # Test different model configurations
-        configs = [
-            {'name': 'Small Model', 'layers': 6, 'heads': 6, 'head_dim': 64, 'max_seq': 512},
-            {'name': 'Medium Model', 'layers': 12, 'heads': 12, 'head_dim': 64, 'max_seq': 1024},
-            {'name': 'Large Model', 'layers': 24, 'heads': 16, 'head_dim': 64, 'max_seq': 2048},
-            {'name': 'GPT-3 Scale', 'layers': 96, 'heads': 96, 'head_dim': 128, 'max_seq': 2048},
-            {'name': 'GPT-4 Scale', 'layers': 120, 'heads': 128, 'head_dim': 128, 'max_seq': 8192}
-        ]
-        
-        print(f"{'Model':<15} {'Layers':<8} {'Memory':<12} {'Per Token':<12}")
-        print("-" * 50)
-        
-        for config in configs:
-            # Create cache to get accurate memory calculation
-            cache = KVCache(
-                max_seq_len=config['max_seq'], 
-                n_layers=config['layers'],
-                n_heads=config['heads'], 
-                head_dim=config['head_dim']
-            )
-            
-            memory_info = cache.get_memory_usage()
-            total_mb = memory_info['total_cache_size_mb']
-            per_token_kb = (total_mb * 1024) / config['max_seq']
-            
-            print(f"{config['name']:<15} {config['layers']:<8} {total_mb:<12.1f}MB {per_token_kb:<12.1f}KB")
-        
-        print(f"\nMAGNIFY Key Insights:")
-        print(f"   • Memory scales as: O(Layers * Heads * HeadDim * SeqLen)")
-        print(f"   • Each token adds: 2 * Layers * Heads * HeadDim * 4 bytes")
-        print(f"   • GPT-3 cache: ~2.4GB for full 2048-token context!")
-        print(f"   • Trade-off: Large memory cost but eliminates O(N²) recomputation")
-        
-        # TIP WHY THIS MATTERS: Understanding memory scaling helps design
-        # systems that can handle large models and long sequences efficiently.
-        # Real inference servers must budget memory for multiple concurrent caches!
-        
-    except Exception as e:
-        print(f"WARNING️ Error in memory analysis: {e}")
-        print("Make sure KVCache class is implemented correctly")
-
-# Analyze cache memory scaling
-analyze_cache_memory_scaling()
-
-# %% [markdown]
-"""
-## Cached Multi-Head Attention
-
-Now let's implement attention that can use the KV cache for efficient inference.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "cached-attention", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class CachedMultiHeadAttention:
-    """
-    Multi-head attention with KV caching support.
-    
-    This is the key optimization that makes transformer inference practical.
-    During autoregressive generation, we only compute attention for the
-    new token while reusing cached K,V from all previous tokens.
-    """
-    
-    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.0):
-        """
-        Initialize cached multi-head attention.
-        
-        TODO: Implement cached attention initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store standard multi-head attention configuration
-        2. Initialize weight matrices for Q, K, V projections
-        3. Set up attention computation components
-        4. Prepare for cache integration
-        
-        Args:
-            embed_dim: Total embedding dimension
-            num_heads: Number of attention heads
-            dropout: Dropout rate (for training)
-        """
-        ### BEGIN SOLUTION
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        self.dropout = dropout
-        
-        # Check divisibility
-        if embed_dim % num_heads != 0:
-            raise ValueError(f"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads})")
-        
-        self.head_dim = embed_dim // num_heads
-        
-        # Initialize projection weights
-        scale = 1.0 / math.sqrt(embed_dim)
-        self.w_q = Tensor(np.random.randn(embed_dim, embed_dim) * scale)
-        self.w_k = Tensor(np.random.randn(embed_dim, embed_dim) * scale)
-        self.w_v = Tensor(np.random.randn(embed_dim, embed_dim) * scale)
-        self.w_o = Tensor(np.random.randn(embed_dim, embed_dim) * scale)
-        
-        self.parameters = [self.w_q, self.w_k, self.w_v, self.w_o]
-        ### END SOLUTION
-    
-    def forward(self, 
-                query: Tensor, 
-                key: Optional[Tensor] = None, 
-                value: Optional[Tensor] = None,
-                cache: Optional[KVCache] = None,
-                layer_idx: int = 0,
-                use_cache: bool = False,
-                advance_cache: bool = True) -> Tuple[Tensor, Optional[KVCache]]:
-        """
-        Compute attention with optional KV caching.
-        
-        TODO: Implement cached attention forward pass.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Handle input defaults (key=query, value=query for self-attention)
-        2. Compute Q, K, V projections for current token
-        3. If using cache, retrieve past K, V and combine with current
-        4. Compute scaled dot-product attention
-        5. Update cache with current K, V if requested
-        6. Return attention output and updated cache
-        
-        CACHING LOGIC:
-        - Without cache: Standard attention on full sequence
-        - With cache: Combine past K,V with current K,V, attend from current Q
-        
-        Args:
-            query: Current token query, shape (batch_size, 1, embed_dim) or (batch_size, seq_len, embed_dim)
-            key: Key tensor (defaults to query)
-            value: Value tensor (defaults to query) 
-            cache: KV cache to use and update
-            layer_idx: Which layer this attention belongs to
-            use_cache: Whether to update cache with current K,V
-            
-        Returns:
-            Tuple of (attention_output, updated_cache)
-        """
-        ### BEGIN SOLUTION
-        # Handle input defaults
-        if key is None:
-            key = query
-        if value is None:
-            value = query
-        
-        batch_size, query_seq_len = query.shape[0], query.shape[1]
-        
-        # Step 1: Project query, key, value with descriptive names
-        query_projected, key_projected, value_projected = self._compute_qkv_projections(query, key, value)
-        
-        # Step 2: Reshape for multi-head attention
-        query_multihead, key_multihead, value_multihead = self._reshape_for_multihead(
-            query_projected, key_projected, value_projected, batch_size, query_seq_len
-        )
-        
-        # Step 3: Combine with cached K,V if available
-        keys_combined, values_combined = self._combine_with_cache(
-            cache, layer_idx, key_multihead, value_multihead
-        )
-        
-        # Step 4: Compute attention output
-        attention_output = self._compute_attention(
-            query_multihead, keys_combined, values_combined, batch_size, query_seq_len
-        )
-        
-        # Step 5: Update cache if requested
-        updated_cache = self._update_cache_if_needed(
-            cache, use_cache, advance_cache, layer_idx, key_multihead, value_multihead, query_seq_len
-        )
-        
-        return attention_output, updated_cache
-        ### END SOLUTION
-    
-    def _compute_qkv_projections(self, query: Tensor, key: Tensor, value: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
-        """Compute Q, K, V projections with descriptive variable names."""
-        query_projected = Tensor(np.matmul(query.data, self.w_q.data))
-        key_projected = Tensor(np.matmul(key.data, self.w_k.data))
-        value_projected = Tensor(np.matmul(value.data, self.w_v.data))
-        return query_projected, key_projected, value_projected
-    
-    def _reshape_for_multihead(self, query_proj: Tensor, key_proj: Tensor, value_proj: Tensor, 
-                              batch_size: int, seq_len: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
-        """Reshape tensors for multi-head attention computation."""
-        # Reshape: (batch, seq_len, embed_dim) -> (batch, seq_len, num_heads, head_dim)
-        query_heads = query_proj.data.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
-        key_heads = key_proj.data.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
-        value_heads = value_proj.data.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
-        
-        # Transpose to (batch, num_heads, seq_len, head_dim) for attention computation
-        query_multihead = np.transpose(query_heads, (0, 2, 1, 3))
-        key_multihead = np.transpose(key_heads, (0, 2, 1, 3))
-        value_multihead = np.transpose(value_heads, (0, 2, 1, 3))
-        
-        return query_multihead, key_multihead, value_multihead
-    
-    def _combine_with_cache(self, cache: Optional[KVCache], layer_idx: int, 
-                           current_keys: np.ndarray, current_values: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
-        """Combine current K,V with cached K,V if cache is available."""
-        if cache is not None and cache.current_position > 0:
-            # Retrieve cached K, V tensors
-            cached_keys, cached_values = cache.get(layer_idx, cache.current_position)
-            
-            # Transform cached tensors to match current format
-            cached_keys_formatted = self._format_cached_tensors(cached_keys)
-            cached_values_formatted = self._format_cached_tensors(cached_values)
-            
-            # Concatenate past and current along sequence dimension (axis=2)
-            keys_combined = np.concatenate([cached_keys_formatted, current_keys], axis=2)
-            values_combined = np.concatenate([cached_values_formatted, current_values], axis=2)
-        else:
-            keys_combined = current_keys
-            values_combined = current_values
-            
-        return keys_combined, values_combined
-    
-    def _format_cached_tensors(self, cached_tensor: Tensor) -> np.ndarray:
-        """Format cached tensors for concatenation with current tensors."""
-        # cached shape: (seq_len, num_heads, head_dim)
-        # Step 1: Transpose to (num_heads, seq_len, head_dim)
-        tensor_transposed = cached_tensor.data.transpose(1, 0, 2)
-        # Step 2: Add batch dimension -> (batch=1, num_heads, seq_len, head_dim)
-        tensor_batched = tensor_transposed[None, ...]
-        return tensor_batched
-    
-    def _compute_attention(self, query_multihead: np.ndarray, keys_combined: np.ndarray, 
-                          values_combined: np.ndarray, batch_size: int, query_seq_len: int) -> Tensor:
-        """Compute scaled dot-product attention with clear variable names."""
-        # Calculate attention scores: Q @ K^T
-        keys_transposed = np.transpose(keys_combined, (0, 1, 3, 2))  # Transpose last two dims
-        attention_scores = np.matmul(query_multihead, keys_transposed)
-        scaled_scores = attention_scores / math.sqrt(self.head_dim)
-        
-        # Apply softmax to get attention weights
-        attention_weights = self._apply_softmax(scaled_scores)
-        
-        # Apply attention weights to values: weights @ V
-        attention_output = np.matmul(attention_weights, values_combined)
-        
-        # Reshape back to original format and apply output projection
-        final_output = self._reshape_attention_output(attention_output, batch_size, query_seq_len)
-        
-        return Tensor(np.matmul(final_output, self.w_o.data))
-    
-    def _apply_softmax(self, scores: np.ndarray) -> np.ndarray:
-        """Apply numerically stable softmax to attention scores."""
-        scores_shifted = scores - np.max(scores, axis=-1, keepdims=True)
-        scores_exp = np.exp(scores_shifted)
-        attention_weights = scores_exp / np.sum(scores_exp, axis=-1, keepdims=True)
-        return attention_weights
-    
-    def _reshape_attention_output(self, attention_output: np.ndarray, batch_size: int, seq_len: int) -> np.ndarray:
-        """Reshape attention output back to original format."""
-        # (batch, heads, seq_len, head_dim) -> (batch, seq_len, heads, head_dim)
-        output_transposed = np.transpose(attention_output, (0, 2, 1, 3))
-        # -> (batch, seq_len, embed_dim)
-        output_reshaped = output_transposed.reshape(batch_size, seq_len, self.embed_dim)
-        return output_reshaped
-    
-    def _update_cache_if_needed(self, cache: Optional[KVCache], use_cache: bool, advance_cache: bool,
-                               layer_idx: int, key_multihead: np.ndarray, value_multihead: np.ndarray, 
-                               query_seq_len: int) -> Optional[KVCache]:
-        """Update cache with current K,V if caching is enabled."""
-        if use_cache and cache is not None and query_seq_len == 1:
-            # Extract single token's K, V for cache storage (remove batch and sequence dims)
-            current_key_for_cache = Tensor(key_multihead[0, :, 0, :])  # (num_heads, head_dim)
-            current_value_for_cache = Tensor(value_multihead[0, :, 0, :])  # (num_heads, head_dim)
-            
-            cache.update(layer_idx, current_key_for_cache, current_value_for_cache)
-            
-            if advance_cache:
-                cache.advance_position()
-                
-        return cache
-
-# %% [markdown]
-"""
-### Testing Cached Attention
-
-Let's verify our cached attention works and provides the expected speedup.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-cached-attention", "locked": false, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_cached_attention():
-    """Test cached attention functionality and performance."""
-    print("Testing Cached Multi-Head Attention...")
-    
-    embed_dim = 64
-    num_heads = 8
-    head_dim = embed_dim // num_heads
-    batch_size = 1
-    
-    # Create attention layer
-    attention = CachedMultiHeadAttention(embed_dim, num_heads)
-    
-    # Create cache
-    max_seq_len = 10
-    n_layers = 1
-    cache = KVCache(max_seq_len, n_layers, num_heads, head_dim)
-    
-    # Test 1: Single token attention (like generation start)
-    token1 = Tensor(np.random.randn(batch_size, 1, embed_dim))
-    
-    output1, updated_cache = attention.forward(
-        query=token1, 
-        cache=cache, 
-        layer_idx=0, 
-        use_cache=True
-    )
-    
-    assert output1.shape == (batch_size, 1, embed_dim), f"Expected output shape {(batch_size, 1, embed_dim)}, got {output1.shape}"
-    assert updated_cache.current_position == 1, "Cache should advance after first token"
-    
-    # Test 2: Second token with cache
-    token2 = Tensor(np.random.randn(batch_size, 1, embed_dim))
-    
-    output2, updated_cache = attention.forward(
-        query=token2,
-        cache=updated_cache,
-        layer_idx=0,
-        use_cache=True
-    )
-    
-    assert output2.shape == (batch_size, 1, embed_dim), "Second token output should have correct shape"
-    assert updated_cache.current_position == 2, "Cache should advance after second token"
-    
-    # Test 3: Compare with non-cached version
-    # For verification, run attention on full sequence without cache
-    full_sequence = Tensor(np.concatenate([token1.data, token2.data], axis=1))  # (batch, 2, embed_dim)
-    
-    fresh_attention = CachedMultiHeadAttention(embed_dim, num_heads)
-    fresh_attention.w_q = attention.w_q  # Use same weights
-    fresh_attention.w_k = attention.w_k
-    fresh_attention.w_v = attention.w_v
-    fresh_attention.w_o = attention.w_o
-    
-    full_output, _ = fresh_attention.forward(query=full_sequence, cache=None, use_cache=False)
-    
-    # The outputs should be similar (not exactly equal due to different computation paths)
-    assert full_output.shape == (batch_size, 2, embed_dim), "Full sequence output should have correct shape"
-    
-    print("PASS Cached Attention tests passed!")
-    print(f"   Memory saved: {cache.get_memory_usage()['used_cache_size_mb']:.2f} MB cache vs full recomputation")
-    print(f"   Cache position: {cache.current_position}")
-
-# Run the test
-test_cached_attention()
-
-# PASS IMPLEMENTATION CHECKPOINT: Cached Attention complete
-
-# THINK PREDICTION: How much faster is cached vs non-cached attention for 100 tokens?
-# Your guess: ___x faster
-
-# MAGNIFY SYSTEMS INSIGHT #2: Attention Performance Comparison
-def analyze_attention_performance_scaling():
-    """Compare cached vs non-cached attention across different sequence lengths."""
-    try:
-        print("\nSPEED Attention Performance Scaling Analysis")
-        print("=" * 45)
-        
-        embed_dim = 64
-        num_heads = 8
-        batch_size = 1
-        test_lengths = [10, 25, 50, 100, 200]
-        
-        print(f"{'Seq Len':<10} {'Cached (ms)':<12} {'No Cache (ms)':<15} {'Speedup':<10}")
-        print("-" * 50)
-        
-        for seq_len in test_lengths:
-            # Set up test components
-            attention = CachedMultiHeadAttention(embed_dim, num_heads)
-            cache = KVCache(seq_len, 1, num_heads, embed_dim // num_heads)
-            
-            # Create test data
-            single_token = Tensor(np.random.randn(batch_size, 1, embed_dim))
-            full_sequence = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-            
-            # Time cached attention (incremental generation)
-            import time
-            start = time.perf_counter()
-            for pos in range(seq_len):
-                output, cache = attention.forward(
-                    query=single_token, cache=cache, layer_idx=0, use_cache=True
-                )
-            cached_time = (time.perf_counter() - start) * 1000  # Convert to ms
-            
-            # Time non-cached attention (full recomputation each step)
-            start = time.perf_counter()
-            for pos in range(seq_len):
-                subseq = Tensor(full_sequence.data[:, :pos+1, :])
-                output, _ = attention.forward(query=subseq, cache=None, use_cache=False)
-            non_cached_time = (time.perf_counter() - start) * 1000
-            
-            speedup = non_cached_time / cached_time if cached_time > 0 else float('inf')
-            
-            print(f"{seq_len:<10} {cached_time:<12.2f} {non_cached_time:<15.2f} {speedup:<10.2f}x")
-        
-        print(f"\nMAGNIFY Key Insights:")
-        print(f"   • Speedup increases with sequence length (more reuse!)")
-        print(f"   • Cached: O(N) complexity per token")
-        print(f"   • Non-cached: O(N²) complexity per token")
-        print(f"   • Break-even typically around 20-50 tokens")
-        print(f"   • Memory cost: Linear cache vs quadratic recomputation")
-        
-        # TIP WHY THIS MATTERS: This analysis shows why KV caching is essential
-        # for any practical transformer deployment. The speedup becomes dramatic
-        # for longer sequences that are common in real applications!
-        
-    except Exception as e:
-        print(f"WARNING️ Error in performance analysis: {e}")
-        print("Make sure cached attention is implemented correctly")
-
-# Analyze attention performance scaling
-analyze_attention_performance_scaling()
-
-# %% [markdown]
-"""
-## Autoregressive Generation with KV Cache
-
-Now let's implement the complete generation function that uses KV caching for dramatic speedups.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "cached-generation", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-
-def generate_with_cache(model_func, 
-                       initial_tokens: Tensor, 
-                       max_new_tokens: int = 50,
-                       embed_dim: int = 64,
-                       num_heads: int = 8,
-                       num_layers: int = 4) -> Tensor:
-    """
-    Generate tokens autoregressively using KV caching.
-    
-    This demonstrates the key optimization that makes modern LLMs practical.
-    Instead of recomputing attention for all previous tokens at each step,
-    we cache the key and value tensors and incrementally build the sequence.
-    
-    TODO: Implement cached autoregressive generation.
-    
-    STEP-BY-STEP IMPLEMENTATION:
-    1. Initialize KV cache for all layers
-    2. Process initial tokens to populate cache
-    3. For each new token to generate:
-       a. Compute attention using cache (O(N) instead of O(N²))
-       b. Generate next token prediction
-       c. Update cache with new K,V
-       d. Add new token to sequence
-    4. Return complete generated sequence
-    
-    COMPLEXITY ANALYSIS:
-    - Without cache: O(N²) per token, O(N³) total
-    - With cache: O(N) per token, O(N²) total
-    
-    Args:
-        model_func: Function that predicts next token given current sequence
-        initial_tokens: Starting tokens, shape (batch_size, seq_len, embed_dim)
-        max_new_tokens: How many new tokens to generate
-        embed_dim: Model embedding dimension
-        num_heads: Number of attention heads
-        num_layers: Number of transformer layers
-        
-    Returns:
-        Complete sequence including initial and generated tokens
-    """
-    ### BEGIN SOLUTION
-    # Initialize generation components
-    cache, attention_layers = _initialize_generation_components(
-        initial_tokens, max_new_tokens, embed_dim, num_heads, num_layers
-    )
-    
-    # Populate cache with initial tokens
-    _populate_cache_with_initial_tokens(initial_tokens, attention_layers, cache)
-    
-    # Generate new tokens iteratively
-    generated_sequence = _generate_tokens_iteratively(
-        initial_tokens, attention_layers, cache, max_new_tokens
-    )
-    
-    return generated_sequence
-    ### END SOLUTION
-
-def _initialize_generation_components(initial_tokens: Tensor, max_new_tokens: int, 
-                                    embed_dim: int, num_heads: int, num_layers: int) -> Tuple[KVCache, List]:
-    """Initialize KV cache and attention layers for generation."""
-    batch_size, initial_seq_len, _ = initial_tokens.shape
-    head_dim = embed_dim // num_heads
-    max_seq_len = initial_seq_len + max_new_tokens
-    
-    # Initialize KV cache
-    cache = KVCache(max_seq_len, num_layers, num_heads, head_dim)
-    
-    # Initialize attention layers for each transformer layer
-    attention_layers = []
-    for layer_idx in range(num_layers):
-        attention_layers.append(CachedMultiHeadAttention(embed_dim, num_heads))
-    
-    return cache, attention_layers
-
-def _populate_cache_with_initial_tokens(initial_tokens: Tensor, attention_layers: List, cache: KVCache) -> None:
-    """Populate cache with initial tokens to prepare for generation."""
-    batch_size, initial_seq_len, embed_dim = initial_tokens.shape
-    num_heads = attention_layers[0].num_heads
-    head_dim = attention_layers[0].head_dim
-    
-    # Process each initial token position
-    for token_position in range(initial_seq_len):
-        # Extract single token: (batch, 1, embed_dim)
-        current_token = Tensor(initial_tokens.data[:, token_position:token_position+1, :])
-        
-        # Store K,V for this token across all layers
-        for layer_idx, attention_layer in enumerate(attention_layers):
-            key_for_cache, value_for_cache = _compute_and_format_kv_for_cache(
-                current_token, attention_layer, num_heads, head_dim
-            )
-            cache.update(layer_idx, key_for_cache, value_for_cache)
-        
-        # Advance cache position once per token (shared across all layers)
-        cache.advance_position()
-
-def _compute_and_format_kv_for_cache(token: Tensor, attention_layer, num_heads: int, head_dim: int) -> Tuple[Tensor, Tensor]:
-    """Compute K,V projections for a token and format for cache storage."""
-    # Compute K, V projections
-    token_key_projection = Tensor(np.matmul(token.data, attention_layer.w_k.data))
-    token_value_projection = Tensor(np.matmul(token.data, attention_layer.w_v.data))
-    
-    # Reshape to (num_heads, head_dim) for cache storage
-    key_for_cache = token_key_projection.data.reshape(1, num_heads, head_dim)[0]  # Remove batch dim
-    value_for_cache = token_value_projection.data.reshape(1, num_heads, head_dim)[0]
-    
-    return Tensor(key_for_cache), Tensor(value_for_cache)
-
-def _generate_tokens_iteratively(initial_tokens: Tensor, attention_layers: List, 
-                               cache: KVCache, max_new_tokens: int) -> Tensor:
-    """Generate new tokens one by one using cached attention."""
-    generated_sequence = [initial_tokens]
-    current_sequence = initial_tokens
-    
-    for generation_step in range(max_new_tokens):
-        # Get the most recent token as query
-        last_token = Tensor(current_sequence.data[:, -1:, :])  # (batch, 1, embed_dim)
-        
-        # Process through all attention layers with caching
-        next_token = _process_token_through_layers(last_token, attention_layers, cache)
-        
-        # Add generated token to sequence
-        generated_sequence.append(next_token)
-        
-        # Update current sequence for next iteration
-        current_sequence = Tensor(np.concatenate([current_sequence.data, next_token.data], axis=1))
-    
-    # Combine all tokens into final sequence
-    final_sequence = Tensor(np.concatenate([seq.data for seq in generated_sequence], axis=1))
-    return final_sequence
-
-def _process_token_through_layers(input_token: Tensor, attention_layers: List, cache: KVCache) -> Tensor:
-    """Process a token through all attention layers with caching."""
-    layer_input = input_token
-    
-    # Pass through each attention layer
-    for layer_idx, attention_layer in enumerate(attention_layers):
-        layer_output, cache = attention_layer.forward(
-            query=layer_input,
-            cache=cache,
-            layer_idx=layer_idx,
-            use_cache=True,
-            advance_cache=False  # Don't advance yet - will do once at the end
-        )
-        layer_input = layer_output
-    
-    # Advance cache position once after processing all layers
-    cache.advance_position()
-    
-    # Simulate next token generation with demo logic
-    # DEMO ONLY: In real systems, this would be:
-    # logits = language_model_head(layer_output)
-    # next_token_id = sample_from_logits(logits)
-    # next_token = embedding_lookup(next_token_id)
-    next_token = Tensor(layer_output.data + np.random.randn(*layer_output.shape) * 0.1)
-    
-    return next_token
-
-# %% [markdown]
-"""
-### Testing Cached Generation
-
-Let's compare the performance of cached vs non-cached generation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-cached-generation", "locked": false, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_cached_generation():
-    """Test and benchmark cached generation."""
-    print("Testing Cached Generation...")
-    
-    # Test configuration - optimized for clarity and testing speed
-    test_config = {
-        'batch_size': 1,
-        'embed_dim': 32,        # Smaller embedding for faster testing
-        'num_heads': 4,         # Fewer heads for simpler debugging
-        'num_layers': 2,        # Fewer layers for faster execution
-        'initial_seq_len': 5,   # Short initial sequence for quick setup
-        'max_new_tokens': 5     # Limited generation for testing focus
-    }
-    
-    batch_size = test_config['batch_size']
-    embed_dim = test_config['embed_dim']
-    num_heads = test_config['num_heads']
-    num_layers = test_config['num_layers']
-    initial_seq_len = test_config['initial_seq_len']
-    max_new_tokens = test_config['max_new_tokens']
-    
-    # Create initial tokens
-    initial_tokens = Tensor(np.random.randn(batch_size, initial_seq_len, embed_dim))
-    
-    # Simple model function for testing
-    def simple_model(tokens):
-        return tokens  # Identity for testing
-    
-    # Test cached generation
-    start_time = time.time()
-    
-    generated_sequence = generate_with_cache(
-        model_func=simple_model,
-        initial_tokens=initial_tokens,
-        max_new_tokens=max_new_tokens,
-        embed_dim=embed_dim,
-        num_heads=num_heads,
-        num_layers=num_layers
-    )
-    
-    cached_time = time.time() - start_time
-    
-    # Verify output shape
-    expected_seq_len = initial_seq_len + max_new_tokens
-    assert generated_sequence.shape == (batch_size, expected_seq_len, embed_dim), \
-        f"Expected shape {(batch_size, expected_seq_len, embed_dim)}, got {generated_sequence.shape}"
-    
-    # Verify initial tokens are preserved
-    np.testing.assert_array_equal(
-        generated_sequence.data[:, :initial_seq_len, :],
-        initial_tokens.data,
-        "Initial tokens should be preserved in output"
-    )
-    
-    print("PASS Cached Generation tests passed!")
-    print(f"   Generated sequence length: {generated_sequence.shape[1]}")
-    print(f"   Processing time: {cached_time:.3f}s")
-    print(f"   Memory efficiency: O(N) per step instead of O(N²)")
-
-# Run the test
-test_cached_generation()
-
-# PASS IMPLEMENTATION CHECKPOINT: Cached Generation complete
-
-# THINK PREDICTION: For a 1000-token story, how many fewer operations does caching save?
-# Without cache: ~333 million operations, With cache: ~1 million operations
-# Your calculation: _____ million operations saved
-
-# MAGNIFY SYSTEMS INSIGHT #3: Generation Efficiency Analysis
-def analyze_generation_efficiency():
-    """Analyze the computational savings from KV caching in text generation."""
-    try:
-        print("\nROCKET Text Generation Efficiency Analysis")
-        print("=" * 45)
-        
-        # Analyze different generation scenarios
-        scenarios = [
-            {'name': 'Short Response', 'tokens': 50},
-            {'name': 'Paragraph', 'tokens': 200}, 
-            {'name': 'Article', 'tokens': 1000},
-            {'name': 'Long Document', 'tokens': 4000}
-        ]
-        
-        print(f"{'Scenario':<15} {'Tokens':<8} {'Ops w/o Cache':<15} {'Ops w/ Cache':<12} {'Reduction':<12}")
-        print("-" * 70)
-        
-        for scenario in scenarios:
-            n = scenario['tokens']
-            
-            # Operations without cache: sum of i² for i=1 to N (quadratic growth)
-            ops_without_cache = sum(i*i for i in range(1, n+1))
-            
-            # Operations with cache: N operations (linear growth)
-            ops_with_cache = n
-            
-            # Calculate reduction factor
-            reduction = ops_without_cache / ops_with_cache if ops_with_cache > 0 else 0
-            
-            # Format large numbers for readability
-            ops_without_str = f"{ops_without_cache/1e6:.1f}M" if ops_without_cache > 1e6 else f"{ops_without_cache/1e3:.1f}K"
-            ops_with_str = f"{ops_with_cache/1e3:.1f}K" if ops_with_cache > 1e3 else str(ops_with_cache)
-            
-            print(f"{scenario['name']:<15} {n:<8} {ops_without_str:<15} {ops_with_str:<12} {reduction:<12.0f}x")
-        
-        print(f"\nMAGNIFY Computational Complexity:")
-        print(f"   • Without Cache: O(N³) total operations for N-token generation")
-        print(f"   • With Cache: O(N²) total operations for N-token generation")
-        print(f"   • Memory Trade-off: O(L*H*D*N) cache vs O(N³) recomputation")
-        print(f"   • Real Impact: Makes GPT-style models practical for generation")
-        
-        # Test actual generation performance
-        print(f"\n⏱️ Real Performance Test:")
-        embed_dim, num_heads, num_layers = 32, 4, 2
-        initial_tokens = Tensor(np.random.randn(1, 5, embed_dim))
-        
-        start_time = time.time()
-        result = generate_with_cache(
-            model_func=lambda x: x,
-            initial_tokens=initial_tokens,
-            max_new_tokens=20,
-            embed_dim=embed_dim,
-            num_heads=num_heads, 
-            num_layers=num_layers
-        )
-        generation_time = time.time() - start_time
-        
-        print(f"   Generated {result.shape[1]} tokens in {generation_time:.3f}s")
-        print(f"   Rate: {result.shape[1]/generation_time:.1f} tokens/second")
-        print(f"   This enables real-time conversational AI!")
-        
-        # TIP WHY THIS MATTERS: This dramatic computational savings is what
-        # makes conversational AI possible. Without KV caching, chatbots would
-        # take minutes to generate simple responses!
-        
-    except Exception as e:
-        print(f"WARNING️ Error in efficiency analysis: {e}")
-        print("Make sure generation functions are implemented correctly")
-
-# Analyze generation efficiency
-analyze_generation_efficiency()
-
-# %% [markdown]
-"""
-## Systems Analysis: Memory vs Compute Trade-off
-
-Let's analyze the memory and computational characteristics of KV caching.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "kv-cache-analysis", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def benchmark_cached_attention(seq_len: int, attention: CachedMultiHeadAttention, 
-                              cache: KVCache, token: Tensor) -> float:
-    """Benchmark cached attention performance for a given sequence length."""
-    start_time = time.time()
-    for pos in range(seq_len):
-        output, cache = attention.forward(
-            query=token, 
-            cache=cache, 
-            layer_idx=0, 
-            use_cache=True
-        )
-    return time.time() - start_time
-
-def benchmark_non_cached_attention(seq_len: int, attention: CachedMultiHeadAttention, 
-                                  full_sequence: Tensor) -> float:
-    """Benchmark non-cached attention performance for a given sequence length."""
-    start_time = time.time()
-    for pos in range(seq_len):
-        # Simulate recomputing attention for growing sequence
-        subseq = Tensor(full_sequence.data[:, :pos+1, :])
-        output, _ = attention.forward(query=subseq, cache=None, use_cache=False)
-    return time.time() - start_time
-
-def calculate_theoretical_speedup(seq_len: int) -> Dict[str, int]:
-    """Calculate theoretical operation counts for cached vs non-cached approaches."""
-    # Cached: O(N) operations per step, O(N²) total
-    cached_ops = seq_len * seq_len  # Simplified model
-    
-    # Non-cached: O(N²) operations per step, O(N³) total  
-    non_cached_ops = sum(i*i for i in range(1, seq_len+1))
-    
-    return {
-        'cached_ops': cached_ops,
-        'non_cached_ops': non_cached_ops,
-        'theoretical_speedup': non_cached_ops / cached_ops if cached_ops > 0 else 0
-    }
-
-def format_performance_results(results: List[Dict[str, Any]]) -> None:
-    """Format and display performance analysis results in a readable table."""
-    print(f"\nPROGRESS Performance Summary:")
-    print(f"{'Seq Len':<8} {'Memory(MB)':<12} {'Speedup':<10} {'Memory/Speedup':<15}")
-    print("-" * 50)
-    
-    for result in results:
-        efficiency = result['cache_memory_mb'] / result['actual_speedup'] if result['actual_speedup'] > 0 else float('inf')
-        print(f"{result['seq_len']:<8} {result['cache_memory_mb']:<12.2f} {result['actual_speedup']:<10.2f} {efficiency:<15.2f}")
-
-def analyze_kv_cache_performance():
-    """
-    Comprehensive analysis of KV cache memory and performance characteristics.
-    
-    This function has been refactored into smaller, focused helper functions
-    for better readability and maintainability.
-    """
-    print("MAGNIFY Analyzing KV Cache Performance Characteristics...")
-    
-    # Define test configuration (reduced for faster testing)
-    test_config = {
-        'embed_dim': 32,
-        'num_heads': 4,
-        'num_layers': 2,
-        'batch_size': 1,
-        'sequence_lengths': [4, 8]  # Very small for fast testing
-    }
-    
-    # Run performance analysis across different sequence lengths
-    results = _run_performance_analysis_across_lengths(test_config)
-    
-    # Display formatted summary and insights
-    _display_analysis_summary(results, test_config['sequence_lengths'])
-    
-    return results
-
-def _run_performance_analysis_across_lengths(config: Dict[str, Any]) -> List[Dict[str, Any]]:
-    """Run performance analysis across different sequence lengths."""
-    results = []
-    head_dim = config['embed_dim'] // config['num_heads']
-    
-    for seq_len in config['sequence_lengths']:
-        print(f"\n📊 Testing sequence length: {seq_len}")
-        
-        # Analyze memory and performance for this sequence length
-        result = _analyze_single_sequence_length(
-            seq_len, config['embed_dim'], config['num_heads'], 
-            config['num_layers'], config['batch_size'], head_dim
-        )
-        
-        results.append(result)
-        _display_individual_results(result)
-    
-    return results
-
-def _analyze_single_sequence_length(seq_len: int, embed_dim: int, num_heads: int, 
-                                   num_layers: int, batch_size: int, head_dim: int) -> Dict[str, Any]:
-    """Analyze memory and performance for a single sequence length."""
-    # Set up test components
-    cache = KVCache(seq_len, num_layers, num_heads, head_dim)
-    memory_info = cache.get_memory_usage()
-    
-    attention = CachedMultiHeadAttention(embed_dim, num_heads)
-    single_token = Tensor(np.random.randn(batch_size, 1, embed_dim))
-    full_sequence = Tensor(np.random.randn(batch_size, seq_len, embed_dim))
-    
-    # Benchmark performance
-    cached_time = benchmark_cached_attention(seq_len, attention, cache, single_token)
-    non_cached_time = benchmark_non_cached_attention(seq_len, attention, full_sequence)
-    
-    # Calculate metrics
-    theoretical_metrics = calculate_theoretical_speedup(seq_len)
-    actual_speedup = non_cached_time / cached_time if cached_time > 0 else 0
-    
-    return {
-        'seq_len': seq_len,
-        'cache_memory_mb': memory_info['total_cache_size_mb'],
-        'cached_time': cached_time,
-        'non_cached_time': non_cached_time,
-        'actual_speedup': actual_speedup,
-        'theoretical_speedup': theoretical_metrics['theoretical_speedup'],
-        'cached_ops': theoretical_metrics['cached_ops'],
-        'non_cached_ops': theoretical_metrics['non_cached_ops']
-    }
-
-def _display_individual_results(result: Dict[str, Any]) -> None:
-    """Display results for a single sequence length test."""
-    print(f"   Cache memory: {result['cache_memory_mb']:.2f} MB")
-    print(f"   Cached time: {result['cached_time']:.4f}s")
-    print(f"   Non-cached time: {result['non_cached_time']:.4f}s") 
-    print(f"   Actual speedup: {result['actual_speedup']:.2f}x")
-    print(f"   Theoretical speedup: {result['theoretical_speedup']:.2f}x")
-
-def _display_analysis_summary(results: List[Dict[str, Any]], sequence_lengths: List[int]) -> None:
-    """Display formatted summary and key insights."""
-    format_performance_results(results)
-    
-    print(f"\nTARGET Key Insights:")
-    print(f"   • Memory scales as O(L * N * H * D) where L=layers, N=seq_len, H=heads, D=head_dim")
-    print(f"   • Computation scales as O(N²) with cache vs O(N³) without")
-    print(f"   • Break-even point: ~{sequence_lengths[1]} tokens for this configuration")
-    print(f"   • Memory-efficiency trade-off: more cache memory for better performance")
-
-# Run the analysis
-performance_results = analyze_kv_cache_performance()
-
-# %% [markdown]
-"""
-## Production Context: How Real Systems Use KV Caching
-
-Understanding how KV caching is implemented in production systems.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "production-context", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def explore_production_kv_caching():
-    """
-    Explore how KV caching is used in production transformer systems.
-    
-    This function demonstrates the connection between our implementation
-    and real-world systems like GPT, BERT, and other transformer models.
-    """
-    print("🏭 Production KV Caching Systems Analysis")
-    print("=" * 60)
-    
-    # Production system examples
-    systems = [
-        {
-            'name': 'GPT-3',
-            'layers': 96,
-            'heads': 96,
-            'head_dim': 128,
-            'max_context': 2048,
-            'use_case': 'Text generation'
-        },
-        {
-            'name': 'GPT-4',
-            'layers': 120,  # Estimated
-            'heads': 128,   # Estimated  
-            'head_dim': 128,
-            'max_context': 8192,
-            'use_case': 'Conversation'
-        },
-        {
-            'name': 'CodeT5',
-            'layers': 12,
-            'heads': 12,
-            'head_dim': 64,
-            'max_context': 512,
-            'use_case': 'Code generation'
-        },
-        {
-            'name': 'Local 7B Model',
-            'layers': 32,
-            'heads': 32,
-            'head_dim': 128,
-            'max_context': 4096,
-            'use_case': 'Local inference'
-        }
-    ]
-    
-    print(f"{'System':<15} {'Cache Size':<12} {'Max Tokens':<12} {'Use Case':<15}")
-    print("-" * 60)
-    
-    for system in systems:
-        # Calculate cache memory requirements
-        # 2 (K + V) * layers * max_context * heads * head_dim * 4 bytes (float32)
-        cache_size_bytes = (2 * system['layers'] * system['max_context'] * 
-                           system['heads'] * system['head_dim'] * 4)
-        cache_size_gb = cache_size_bytes / (1024**3)
-        
-        print(f"{system['name']:<15} {cache_size_gb:<12.2f}GB {system['max_context']:<12} {system['use_case']:<15}")
-    
-    print(f"\nTIP Production Optimizations:")
-    print(f"   • Memory pooling: Reuse cache memory across requests")
-    print(f"   • Batch processing: Share cache computation across multiple queries")
-    print(f"   • Attention masks: Skip computation for padded tokens")
-    print(f"   • Gradient checkpointing: Trade memory for compute during training")
-    print(f"   • Mixed precision: Use FP16/INT8 to reduce cache memory")
-    print(f"   • Flash Attention: Optimize memory access patterns")
-    
-    print(f"\nSPEED Real-World Performance Impact:")
-    print(f"   • Without KV cache: GPT would take minutes to generate short responses")
-    print(f"   • With KV cache: Real-time conversation becomes possible")
-    print(f"   • Memory cost: 1-10GB RAM per conversation depending on model size")
-    print(f"   • Speedup: 10-100x faster generation for typical use cases")
-    
-    print(f"\nTARGET Why This Matters for ML Engineers:")
-    print(f"   • KV caching is THE optimization that makes LLMs practical")
-    print(f"   • Memory management becomes critical at scale")
-    print(f"   • Understanding trade-offs helps design better systems")
-    print(f"   • This optimization enables real-time AI applications")
-
-# Explore production systems
-explore_production_kv_caching()
-
-# %% [markdown]
-"""
-## Comprehensive Testing
-
-Complete validation of our KV caching implementation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "comprehensive-tests", "locked": false, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def run_comprehensive_tests():
-    """Run all tests to validate KV caching implementation."""
-    print("TEST Running Comprehensive KV Caching Tests")
-    print("=" * 50)
-    
-    # Test 1: Cache capacity and bounds checking
-    print("Test 1: Cache Capacity...")
-    cache = KVCache(max_seq_len=3, n_layers=1, n_heads=2, head_dim=4)
-    
-    # Fill cache to capacity
-    for i in range(3):
-        k = Tensor(np.ones((2, 4)) * i)  # Different values for each position
-        v = Tensor(np.ones((2, 4)) * i)
-        cache.update(0, k, v)
-        cache.advance_position()
-    
-    # Verify capacity reached
-    assert cache.current_position == 3, "Cache should be at capacity"
-    
-    # Test overflow protection
-    try:
-        cache.update(0, Tensor(np.ones((2, 4))), Tensor(np.ones((2, 4))))
-        assert False, "Should raise overflow error"
-    except ValueError:
-        pass  # Expected
-    
-    print("   PASS Capacity management works")
-    
-    # Test 2: Multi-layer cache consistency
-    print("Test 2: Multi-layer Consistency...")
-    multi_cache = KVCache(max_seq_len=5, n_layers=3, n_heads=2, head_dim=4)
-    
-    # Add different data to each layer
-    for layer in range(3):
-        k = Tensor(np.ones((2, 4)) * layer)
-        v = Tensor(np.ones((2, 4)) * layer * 10)
-        multi_cache.update(layer, k, v)
-    
-    multi_cache.advance_position()
-    
-    # Verify each layer has correct data
-    for layer in range(3):
-        cached_k, cached_v = multi_cache.get(layer, 1)
-        expected_k = np.ones((1, 2, 4)) * layer
-        expected_v = np.ones((1, 2, 4)) * layer * 10
-        
-        np.testing.assert_array_equal(cached_k.data, expected_k, f"Layer {layer} keys incorrect")
-        np.testing.assert_array_equal(cached_v.data, expected_v, f"Layer {layer} values incorrect")
-    
-    print("   PASS Multi-layer consistency works")
-    
-    # Test 3: Attention output consistency
-    print("Test 3: Attention Consistency...")
-    embed_dim = 16
-    num_heads = 4
-    
-    attention = CachedMultiHeadAttention(embed_dim, num_heads)
-    cache = KVCache(max_seq_len=5, n_layers=1, n_heads=num_heads, head_dim=embed_dim//num_heads)
-    
-    # Generate sequence token by token with cache
-    tokens = [Tensor(np.random.randn(1, 1, embed_dim)) for _ in range(3)]
-    cached_outputs = []
-    
-    for i, token in enumerate(tokens):
-        output, cache = attention.forward(token, cache=cache, layer_idx=0, use_cache=True)
-        cached_outputs.append(output.data)
-    
-    # Generate same sequence all at once (no cache)
-    full_sequence = Tensor(np.concatenate([t.data for t in tokens], axis=1))
-    attention_fresh = CachedMultiHeadAttention(embed_dim, num_heads)
-    
-    # Use same weights for fair comparison
-    attention_fresh.w_q = attention.w_q
-    attention_fresh.w_k = attention.w_k  
-    attention_fresh.w_v = attention.w_v
-    attention_fresh.w_o = attention.w_o
-    
-    full_output, _ = attention_fresh.forward(full_sequence, cache=None, use_cache=False)
-    
-    # Last cached output should be similar to last position of full output
-    # (Note: might not be exactly equal due to different computation paths)
-    diff = np.abs(cached_outputs[-1] - full_output.data[:, -1:, :]).mean()
-    assert diff < 1.0, f"Cached and non-cached outputs too different: {diff}"
-    
-    print("   PASS Attention consistency acceptable")
-    
-    # Test 4: Memory profiling
-    print("Test 4: Memory Profiling...")
-    
-    tracemalloc.start()
-    
-    # Create large cache
-    large_cache = KVCache(max_seq_len=100, n_layers=12, n_heads=16, head_dim=64)
-    
-    current, peak = tracemalloc.get_traced_memory()
-    tracemalloc.stop()
-    
-    # Verify memory usage is reasonable
-    memory_mb = peak / (1024 * 1024)
-    theoretical_mb = large_cache.get_memory_usage()['total_cache_size_mb']
-    
-    print(f"   Actual memory usage: {memory_mb:.2f} MB")
-    print(f"   Theoretical cache size: {theoretical_mb:.2f} MB")
-    print("   PASS Memory usage within expected range")
-    
-    print("\nCELEBRATE All Comprehensive Tests Passed!")
-    print("KV caching implementation is working correctly!")
-
-# Run comprehensive tests
-run_comprehensive_tests()
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-Consolidate all test execution for when the module is run directly.
-"""
-
-# %%
-if __name__ == "__main__":
-    print("ROCKET TinyTorch KV Caching Module - Complete Test Suite")
-    print("=" * 60)
-    
-    # Run all tests in sequence
-    test_kv_cache()
-    print()
-    
-    test_cached_attention() 
-    print()
-    
-    test_cached_generation()
-    print()
-    
-    performance_results = analyze_kv_cache_performance()
-    print()
-    
-    explore_production_kv_caching()
-    print()
-    
-    run_comprehensive_tests()
-    
-    print("\n" + "=" * 60)
-    print("TARGET MODULE COMPLETE: KV Caching Implementation")
-    print("=" * 60)
-    print("PASS All tests passed!")
-    print("PASS Performance analysis complete!")
-    print("PASS Production context understood!")
-    print("\nYou now understand the most sophisticated transformer optimization!")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Reflect on how KV caching transforms transformer systems and enables production deployments.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "kv-cache-reflection", "locked": false, "points": 10, "schema_version": 3, "solution": false, "task": true}
-# %% [markdown]
-"""
-### Question 1: Algorithmic Complexity Analysis
-**Prompt**: You're optimizing a transformer for generating 1000-token stories. Without KV caching, each token generation requires computing attention for all previous tokens. 
-
-**Question**: Calculate the total number of attention operations needed with and without KV caching. At what sequence length does the memory cost of caching equal the computational savings? How would you design a hybrid approach that balances memory and compute?
-
-**Your Analysis**:
-[Provide detailed complexity analysis, break-even calculations, and hybrid system design]
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "memory-compute-tradeoff", "locked": false, "points": 10, "schema_version": 3, "solution": false, "task": true}
-# %% [markdown]
-"""
-### Question 2: Production Memory Management
-**Prompt**: You're deploying a chatbot service that handles 1000 concurrent conversations, each potentially 4096 tokens long. Each conversation needs its own KV cache.
-
-**Question**: Calculate total memory requirements for a 7B parameter model with 32 layers and 32 heads. How would you implement cache eviction, memory pooling, and batch processing to optimize resource usage? What happens when cache memory exceeds available RAM?
-
-**Your Analysis**:  
-[Provide memory calculations, architecture design, and resource management strategies]
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "optimization-techniques", "locked": false, "points": 10, "schema_version": 3, "solution": false, "task": true}
-# %% [markdown]
-"""
-### Question 3: Cache Optimization Integration
-
-**Context**: Your KVCache and CachedMultiHeadAttention work with float32 tensors in full precision. Production systems combine KV caching with Flash Attention, mixed precision (FP16/INT8), and cache compression.
-
-**Question**: Extend your implementation to support advanced optimizations:
-
-1. **Mixed Precision**: Modify your `update()` method to store K,V in FP16 while maintaining accuracy
-2. **Cache Compression**: Design a compression scheme for your cache storage that reduces memory by 50%
-3. **Adaptive Strategy**: Create a decision system that chooses between full-cache, compressed-cache, or no-cache based on:
-   - Available memory (use your `get_memory_usage()` calculations)
-   - Sequence length (from your performance analysis)
-   - Accuracy requirements
-4. **Flash Attention Integration**: How would you modify your `_compute_attention()` method to work with tiled attention computation?
-
-**Think about**:
-- Precision trade-offs in your current tensor operations
-- Compression techniques that maintain attention accuracy
-- Memory-performance decision trees
-- Integration points in your existing code
-
-### BEGIN SOLUTION
-[Student provides optimization integration design, precision analysis, and adaptive system modifications to their implementation]
-### END SOLUTION
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "cache-scaling-analysis", "locked": false, "points": 10, "schema_version": 3, "solution": false, "task": true}
-# %% [markdown]
-"""
-### Question 4: Real-World Cache Scaling
-
-**Context**: Your implementation handles single-layer attention, but real transformers have dozens of layers. You tested configurations up to GPT-3 scale in your analysis functions.
-
-**Question**: Analyze how your KV caching scales in real deployment scenarios:
-
-1. **Multi-Layer Scaling**: Your KVCache supports multiple layers - analyze the memory growth pattern as you scale from 6 layers (small) to 96 layers (GPT-3)
-2. **Concurrent User Impact**: If your cached attention serves 100 simultaneous users, each with different conversation lengths (50-2000 tokens), calculate total system memory requirements
-3. **Cache Efficiency**: Based on your performance measurements, at what point does cache memory cost exceed the computational savings? Design a cache size limit policy.
-4. **Production Failure Modes**: What happens when your `advance_position()` reaches max_seq_len? How would you handle cache overflow in production?
-
-**Think about**:
-- Your `get_memory_usage()` calculations across different scales
-- The performance trade-offs you measured
-- System reliability when caches fill up
-- Real-world memory constraints
-
-### BEGIN SOLUTION
-[Student provides scaling analysis, memory calculations, and production failure handling strategies]
-### END SOLUTION
-"""
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: KV Caching - The Most Sophisticated Optimization
-
-### What You've Accomplished
-PASS **KVCache Implementation**: 200+ lines of sophisticated cache management with memory-efficient storage and retrieval
-PASS **CachedMultiHeadAttention**: Complete attention mechanism with O(N) complexity instead of O(N²)  
-PASS **Autoregressive Generation**: Full text generation pipeline with dramatic performance improvements
-PASS **Systems Analysis**: Comprehensive memory profiling and performance benchmarking across model scales
-PASS **Production Context**: Understanding of real-world deployment challenges and optimization strategies
-
-### Key Learning Outcomes
-- **Algorithmic Transformation**: Mastered how changing the algorithm (not just implementation) achieves orders-of-magnitude speedups
-- **Memory-Compute Trade-offs**: Deep understanding of when storing intermediate results pays off vs recomputation
-- **Production Optimization**: Learned how real LLMs like GPT achieve fast inference through sophisticated caching
-- **Systems Engineering**: Gained insight into memory management, cache eviction, and resource optimization at scale
-
-### Mathematical Foundations Mastered
-- **Complexity Analysis**: O(N³) -> O(N²) total operations transformation for sequence generation
-- **Memory Scaling**: O(L * N * H * D) cache memory requirements across layers, sequence length, heads, and dimensions
-- **Performance Metrics**: Break-even analysis between cache memory cost and computational savings
-
-### Professional Skills Developed
-- **Cache Architecture**: Designed efficient storage systems with position-based indexing and multi-layer support
-- **Performance Optimization**: Implemented and measured algorithmic improvements with quantified speedups
-- **Production Thinking**: Analyzed real-world constraints like memory limits, concurrent users, and system reliability
-
-### Visual Understanding Gained
-```
-Complexity Transformation Achieved:
-
-Without KV Cache (O(N³) total):
-Token 1: [■]                    <- 0 ops
-Token 2: [■]---[■]              <- 1 op  
-Token 3: [■]---[■]---[■]        <- 4 ops (recompute all)
-Token 4: [■]---[■]---[■]---[■]  <- 9 ops (recompute all)
-...
-Total: 0 + 1 + 4 + 9 + 16 + ... = O(N³) scaling
-
-With KV Cache (O(N²) total):
-Token 1: [■] -> Cache            <- 1 op + store
-Token 2: [C]---[■] -> Cache      <- 1 op + reuse
-Token 3: [C]---[C]---[■]        <- 1 op + reuse  
-Token 4: [C]---[C]---[C]---[■]  <- 1 op + reuse
-...
-Total: 1 + 1 + 1 + 1 + ... = O(N) per token, O(N²) total
-
-Memory Layout You Implemented:
-+--------------------------------------------------+
-| KVCache: Multi-Layer Storage System              |
-+--------------------------------------------------┤
-| Layer 0: K[seq_len, heads, head_dim]             |
-|          V[seq_len, heads, head_dim]             |
-+--------------------------------------------------┤
-| Layer 1: K[seq_len, heads, head_dim]             |
-|          V[seq_len, heads, head_dim]             |
-+--------------------------------------------------+
-Position Tracking: current_position -> shared across layers
-```
-
-### Ready for Advanced Applications
-Your KV caching implementation now enables:
-- **Real-time Generation**: 10-100x faster than naive approaches for typical sequence lengths
-- **Production Deployment**: Understanding of memory management and resource optimization
-- **Advanced Optimizations**: Foundation for Flash Attention, mixed precision, and cache compression
-
-### Connection to Real ML Systems
-Your implementation mirrors production systems:
-- **PyTorch**: `torch.nn.functional.multi_head_attention_forward` with cache support
-- **Transformers**: Hugging Face's `past_key_values` mechanism in GPT models
-- **Production APIs**: OpenAI API, ChatGPT, and other LLMs rely on this exact optimization
-
-### Systems Impact Delivered
-- **Computational Savings**: Reduced O(N³) to O(N²) complexity for autoregressive generation
-- **Memory Efficiency**: Linear cache growth vs quadratic recomputation costs
-- **Production Readiness**: Understanding of real-world deployment constraints and optimization strategies
-- **Engineering Excellence**: Built maintainable, testable cache systems with comprehensive error handling
-
-### Next Steps
-1. **Export your module**: `tito module complete 19_caching`
-2. **Validate integration**: `tito test --module caching`
-3. **Explore advanced features**: Multi-precision caching, Flash Attention integration
-4. **Ready for Production**: Apply these techniques to real transformer deployments
-
-**Congratulations!** Your KV caching implementation represents the pinnacle of transformer optimization - the algorithmic innovation that makes conversational AI possible. You've mastered the most sophisticated optimization in modern ML systems! ROCKET
-
-This completes your journey through transformer optimization techniques - from basic implementations to the algorithmic innovations that power production AI systems.
-"""
\ No newline at end of file
diff --git a/modules_old/18_caching/module.yaml b/modules_old/18_caching/module.yaml
deleted file mode 100644
index 9b175247..00000000
--- a/modules_old/18_caching/module.yaml
+++ /dev/null
@@ -1,23 +0,0 @@
-description: "Memory optimization through KV caching for transformer inference. Students\
-  \ learn to\ntransform O(N\xB2) attention complexity into O(N) for autoregressive\
-  \ generation, achieving\ndramatic speedups in transformer inference.\n"
-difficulty: advanced
-estimated_hours: 8-10
-exports:
-- tinytorch.optimizations.caching
-learning_objectives:
-- Understand attention memory complexity
-- Implement KV caching for transformers
-- Build incremental computation patterns
-- Optimize autoregressive generation
-name: Caching
-number: 18
-prerequisites:
-- Module 14: Transformers
-- Module 17: Compression
-skills_developed:
-- KV caching implementation
-- Memory-computation tradeoffs
-- Incremental computation
-- Production inference patterns
-type: optimization
diff --git a/modules_old/19_benchmarking/COMPREHENSIVE_QA_AUDIT_REPORT.md b/modules_old/19_benchmarking/COMPREHENSIVE_QA_AUDIT_REPORT.md
deleted file mode 100644
index 98c622a4..00000000
--- a/modules_old/19_benchmarking/COMPREHENSIVE_QA_AUDIT_REPORT.md
+++ /dev/null
@@ -1,164 +0,0 @@
-# 🔬 COMPREHENSIVE QUALITY ASSURANCE AUDIT REPORT
-**Date**: 2025-09-26  
-**Auditor**: Quality Assurance Agent (Dr. Priya Sharma)  
-**Scope**: Complete TinyTorch Module System (21 modules)  
-
-## 📊 EXECUTIVE SUMMARY
-
-**Overall Status**: ✅ **HIGHLY SUCCESSFUL**  
-- **21 modules discovered** (01-21, module 18_pruning deleted as planned)
-- **21/21 modules compile successfully** (100% compilation rate)
-- **19/21 modules execute without critical errors** (90% execution success)
-- **2 modules have minor issues** requiring attention
-
-## 🏗️ COMPLETE MODULE INVENTORY
-
-### Core Foundation Modules (01-10) - ✅ ALL FUNCTIONAL
-1. **01_setup** - ✅ PERFECT - Complete environment setup with systems analysis
-2. **02_tensor** - ✅ PERFECT - Tensor operations with NumPy integration
-3. **03_activations** - ✅ PERFECT - Activation functions compilation
-4. **04_layers** - ⚠️ MINOR ISSUE - `__file__` undefined in execution context
-5. **05_losses** - ✅ PERFECT - Loss functions with comprehensive testing
-6. **06_autograd** - ✅ PERFECT - Automatic differentiation compilation
-7. **07_optimizers** - ✅ PERFECT - Optimization algorithms compilation
-8. **08_training** - ✅ PERFECT - Training loop implementation compilation
-9. **09_spatial** - ✅ PERFECT - CNN operations with extensive testing
-10. **10_dataloader** - ✅ PERFECT - Data loading and preprocessing compilation
-
-### Advanced Modules (11-15) - ✅ STRONG PERFORMANCE
-11. **11_tokenization** - ❌ BPE TEST FAILURE - Assertion error in merge function
-12. **12_embeddings** - ✅ PERFECT - Word embeddings compilation
-13. **13_attention** - ✅ PERFECT - Attention mechanisms compilation
-14. **14_transformers** - ✅ PERFECT - Transformer architecture compilation
-15. **15_profiling** - ✅ PERFECT - Performance profiling execution validated
-
-### Specialized Modules (16-21) - ✅ COMPLETE COVERAGE
-16. **16_acceleration** - ✅ PERFECT - Hardware acceleration compilation
-17. **17_quantization** - ✅ PERFECT - Model quantization compilation
-18. **18_compression** - ✅ PERFECT - Model compression compilation
-19. **19_caching** - ✅ PERFECT - Caching strategies compilation
-20. **20_benchmarking** - ✅ PERFECT - Benchmarking systems execution validated
-21. **21_mlops** - ✅ PERFECT - MLOps deployment compilation
-
-## 🔍 DETAILED TEST RESULTS
-
-### Compilation Testing (21/21 PASS)
-```
-✅ ALL 21 MODULES COMPILE SUCCESSFULLY
-- No syntax errors detected
-- All imports resolve correctly
-- NBGrader metadata properly formatted
-- Module structure compliant
-```
-
-### Execution Testing (19/21 PASS)
-**Successful Executions:**
-- **setup**: Full test suite execution with systems analysis ✅
-- **tensor**: Complete tensor operations with NumPy integration ✅  
-- **losses**: Comprehensive loss function testing ✅
-- **profiling**: Performance profiling systems ✅
-- **benchmarking**: Benchmarking framework execution ✅
-
-**Issues Identified:**
-- **layers**: `__file__` undefined in execution context (minor)
-- **tokenization**: BPE merge function test assertion failure (fixable)
-
-### Systems Analysis Validation
-**EXCELLENT**: All tested modules include proper:
-- Memory profiling and complexity analysis
-- Performance benchmarking capabilities
-- Scaling behavior documentation
-- Production context references
-- Integration with larger systems
-
-## 🚨 CRITICAL ISSUES IDENTIFIED
-
-### 1. Tokenization Module BPE Test Failure
-**Module**: `modules/11_tokenization/tokenization_dev.py`  
-**Issue**: `assert merged[0].count('l') == 1, "Should have only one 'l' left after merge"`  
-**Severity**: MEDIUM - Test logic error in BPE implementation  
-**Action Required**: Fix BPE merge function test expectations  
-
-### 2. Layers Module Execution Context Issue  
-**Module**: `modules/04_layers/layers_dev.py`  
-**Issue**: `name '__file__' is not defined`  
-**Severity**: LOW - Execution context issue, doesn't affect core functionality  
-**Action Required**: Remove dependency on `__file__` variable in test context  
-
-## ✅ QUALITY ASSURANCE VALIDATION
-
-### ML Systems Teaching Standards - EXCELLENT
-- ✅ **Memory Analysis**: All tested modules include explicit memory profiling
-- ✅ **Performance Characteristics**: Computational complexity documented
-- ✅ **Scaling Behavior**: Large input performance analysis present
-- ✅ **Production Context**: Real-world system references (PyTorch, TensorFlow)
-- ✅ **Hardware Implications**: Cache behavior and vectorization considerations
-
-### Test Structure Compliance - VERY GOOD
-- ✅ **Immediate Testing**: Tests follow implementation in proper sequence
-- ✅ **Unit Test Functions**: Proper `test_unit_*()` function naming
-- ✅ **Main Block Structure**: `if __name__ == "__main__":` blocks present
-- ✅ **Comprehensive Testing**: Integration and edge case coverage
-- ✅ **Educational Assertions**: Clear error messages that teach concepts
-
-### NBGrader Integration - VALIDATED
-- ✅ **Metadata Complete**: All cells have proper NBGrader metadata
-- ✅ **Schema Version**: Consistent schema version 3 usage
-- ✅ **Solution Blocks**: BEGIN/END SOLUTION properly implemented
-- ✅ **Grade IDs**: Unique identifiers across modules
-- ✅ **Student Scaffolding**: Clear TODO comments and implementation hints
-
-## 📈 PERFORMANCE METRICS
-
-### Compilation Success Rate: 100% (21/21)
-### Execution Success Rate: 90% (19/21)  
-### Critical Issues: 0
-### Medium Issues: 1 (Tokenization BPE test)
-### Minor Issues: 1 (Layers execution context)
-
-## 🎯 RECOMMENDATIONS
-
-### Immediate Actions Required:
-1. **Fix tokenization BPE merge test** - Update assertion logic to match implementation
-2. **Resolve layers module execution** - Remove `__file__` dependency in test context
-
-### Quality Improvements:
-1. **Add automated testing pipeline** - Implement CI/CD for module validation
-2. **Expand integration testing** - Test cross-module dependencies
-3. **Performance regression testing** - Monitor computational complexity over time
-
-## 🏆 OVERALL ASSESSMENT
-
-**GRADE: A- (EXCELLENT WITH MINOR FIXES NEEDED)**
-
-### Strengths:
-- **Outstanding compilation rate** (100%)
-- **Strong execution success** (90%)
-- **Excellent ML systems focus** throughout all modules
-- **Comprehensive testing frameworks** in place
-- **Professional NBGrader integration** ready for classroom use
-- **Real-world production context** consistently provided
-
-### Areas for Improvement:
-- **Fix 2 specific module issues** (tokenization BPE, layers execution)
-- **Implement automated testing** to prevent regressions
-- **Add cross-module integration testing** for complex workflows
-
-## 🚀 PRODUCTION READINESS
-
-**STATUS**: ✅ **READY FOR DEPLOYMENT WITH MINOR FIXES**
-
-The TinyTorch module system demonstrates excellent quality across all tested dimensions:
-- Technical implementation is sound and complete
-- Educational design follows ML systems engineering principles  
-- NBGrader integration supports instructor workflows
-- Students will have positive learning experiences with proper scaffolding
-- Professional software development practices are maintained throughout
-
-**RECOMMENDATION**: Approve for production use after fixing the 2 identified issues.
-
----
-
-**Audit Completed**: 2025-09-26  
-**Quality Assurance Agent**: Dr. Priya Sharma  
-**Next Review Date**: Upon issue resolution and before major releases  
\ No newline at end of file
diff --git a/modules_old/19_benchmarking/README.md b/modules_old/19_benchmarking/README.md
deleted file mode 100644
index 537d565c..00000000
--- a/modules_old/19_benchmarking/README.md
+++ /dev/null
@@ -1,194 +0,0 @@
-# Module 20: TinyMLPerf - The Ultimate ML Systems Competition
-
-**The Olympics of ML Systems Optimization!** 🏆
-
-## Overview
-
-Module 20 creates TinyMLPerf, an exciting competition framework where students benchmark all their optimizations from Modules 16-19 in three thrilling events. This is the grand finale that proves optimization mastery through measurable, competitive performance improvements.
-
-## Learning Objectives
-
-By completing this module, students will:
-
-1. **Build Competition Benchmarking Infrastructure**: Create standardized TinyMLPerf benchmark suite for fair competition
-2. **Use Profiling Tools for Systematic Measurement**: Apply Module 15's profiler to measure real performance gains
-3. **Compete Across Multiple Categories**: Optimize for speed, memory, model size, and innovation simultaneously
-4. **Calculate Relative Performance Improvements**: Show speedup ratios independent of hardware differences
-5. **Drive Innovation Through Competition**: Use competitive pressure to discover new optimization techniques
-
-## The Three Competition Events
-
-### 🏃 MLP Sprint - Fastest Feedforward Network
-- **Challenge**: Optimize feedforward neural network inference for maximum speed
-- **Benchmark**: 3-layer MLP (784→128→64→10) on MNIST-like data
-- **Victory Condition**: Fastest inference time while maintaining accuracy
-- **Techniques**: Quantization, pruning, custom kernels, architecture optimization
-
-### 🏃‍♂️ CNN Marathon - Efficient Convolutions
-- **Challenge**: Optimize convolutional neural network processing for efficiency
-- **Benchmark**: CNN model on 28×28×1 image data
-- **Victory Condition**: Best balance of speed, memory usage, and accuracy
-- **Techniques**: Convolution optimization, memory layout, spatial locality
-
-### 🏃‍♀️ Transformer Decathlon - Ultimate Attention Optimization
-- **Challenge**: Optimize attention mechanisms and sequence processing
-- **Benchmark**: Self-attention model on 64-token sequences
-- **Victory Condition**: Complete optimization across all attention components
-- **Techniques**: Attention optimization, memory management, sequence processing
-
-## Key Features
-
-### 🔧 TinyMLPerf Benchmark Suite
-```python
-from tinytorch.core.benchmarking import TinyMLPerf
-
-# Load standard competition benchmarks
-tinyperf = TinyMLPerf()
-mlp_model, mlp_dataset = tinyperf.load_benchmark('mlp_sprint')
-cnn_model, cnn_dataset = tinyperf.load_benchmark('cnn_marathon') 
-transformer_model, transformer_dataset = tinyperf.load_benchmark('transformer_decathlon')
-```
-
-### ⚡ Competition Profiling with Module 15 Integration
-```python
-from tinytorch.core.benchmarking import CompetitionProfiler
-
-# Rigorous benchmarking using Module 15's profiler
-profiler = CompetitionProfiler(warmup_runs=3, timing_runs=10)
-results = profiler.benchmark_model(optimized_model, dataset, baseline_model)
-
-print(f"Speedup: {results['speedup_vs_baseline']:.2f}x faster!")
-```
-
-### 🏆 Competition Framework with Leaderboards
-```python
-from tinytorch.core.benchmarking import TinyMLPerfCompetitionPlus
-
-# Submit to competition
-competition = TinyMLPerfCompetitionPlus()
-submission = competition.submit_entry(
-    team_name="Speed Demons",
-    event_name="mlp_sprint", 
-    optimized_model=my_optimized_mlp,
-    optimization_description="INT8 quantization + custom SIMD kernels",
-    github_url="https://github.com/team/optimization-repo"
-)
-
-# View leaderboards
-competition.display_all_enhanced_leaderboards()
-```
-
-### 🔬 Innovation Detection and Advanced Scoring
-```python
-# Automatic technique detection
-innovation_analysis = competition.innovation_detector.analyze_innovation(
-    model=optimized_model,
-    optimization_description="Quantization + pruning + knowledge distillation"
-)
-
-print(f"Innovation Score: {innovation_analysis['innovation_score']:.3f}")
-print(f"Detected: {innovation_analysis['detected_techniques']}")
-```
-
-## Competition Scoring
-
-### Hardware-Independent Relative Scoring
-- **Speedup Ratio**: `baseline_time / optimized_time` (3x faster = 3.0 score)
-- **Innovation Score**: Automatic detection of optimization techniques (0.0 - 1.0)
-- **Composite Score**: 70% speed + 30% innovation for balanced optimization
-
-### Multiple Leaderboards
-1. **Speed Leaderboard**: Pure performance ranking by inference time
-2. **Innovation Leaderboard**: Most creative optimization techniques
-3. **Composite Leaderboard**: Best overall balance of speed and innovation
-
-## Innovation Technique Detection
-
-The system automatically detects and rewards:
-- **Quantization**: INT8, INT16, low-precision techniques
-- **Pruning**: Structured pruning, sparsity, weight removal
-- **Distillation**: Knowledge transfer, teacher-student models
-- **Custom Kernels**: SIMD, vectorization, hardware optimization
-- **Memory Optimization**: In-place operations, gradient checkpointing
-- **Compression**: Weight sharing, parameter compression
-
-## Example Competition Workflow
-
-```python
-# 1. Load TinyMLPerf benchmark
-tinyperf = TinyMLPerf()
-model, dataset = tinyperf.load_benchmark('mlp_sprint')
-
-# 2. Apply your optimizations (from Modules 16-19)
-optimized_model = apply_quantization(model)      # Module 17
-optimized_model = apply_pruning(optimized_model) # Module 18
-optimized_model = add_custom_kernels(optimized_model)  # Module 16
-
-# 3. Submit to competition
-competition = TinyMLPerfCompetitionPlus()
-submission = competition.submit_entry(
-    team_name="Your Team Name",
-    event_name="mlp_sprint",
-    optimized_model=optimized_model,
-    optimization_description="Quantization + structured pruning + vectorized kernels",
-    github_url="https://github.com/yourteam/optimization-repo"
-)
-
-# 4. View results and leaderboards
-competition.display_leaderboard('mlp_sprint')
-competition.display_innovation_leaderboard('mlp_sprint')  
-competition.display_composite_leaderboard('mlp_sprint')
-```
-
-## Systems Engineering Insights
-
-### 🏗️ **Professional Benchmarking Practices**
-- **Statistical Reliability**: Multiple timing runs with warmup periods
-- **Controlled Conditions**: Consistent test environments and data
-- **Memory Profiling**: Resource usage analysis beyond timing
-- **Evidence Requirements**: GitHub links and reproducibility
-
-### ⚡ **Multi-Dimensional Optimization**
-- **Speed vs. Innovation Balance**: Composite scoring prevents tunnel vision
-- **Hardware Independence**: Relative metrics work across platforms
-- **Technique Diversity**: Innovation rewards encourage exploration
-- **Production Relevance**: Real-world optimization constraints
-
-### 📊 **Competition-Driven Learning**
-- **Concrete Motivation**: Leaderboard rankings drive engagement
-- **Peer Learning**: See techniques used by other competitors
-- **Iterative Improvement**: Multiple submissions encourage refinement
-- **Evidence-Based Claims**: Reproducible performance reporting
-
-## Prerequisites
-
-- **Module 15**: Profiling infrastructure for performance measurement
-- **Modules 16-19**: Optimization techniques to apply competitively
-- **All Previous Modules**: Complete ML systems stack for comprehensive optimization
-
-## Success Metrics
-
-Students successfully complete this module when they can:
-
-1. **Submit Competitive Entries**: Use TinyMLPerf to benchmark optimized models
-2. **Achieve Measurable Speedups**: Demonstrate concrete performance improvements
-3. **Apply Multiple Techniques**: Combine quantization, pruning, acceleration, memory optimization
-4. **Interpret Competition Results**: Understand relative scoring and leaderboard rankings
-5. **Drive Innovation**: Explore creative optimization approaches for competitive advantage
-
-## Real-World Applications
-
-- **ML Competition Platforms**: Kaggle-style optimization competitions
-- **Production Deployment**: Resource-constrained optimization for real systems
-- **Research Evaluation**: Systematic comparison of optimization techniques
-- **Industry Benchmarking**: Performance evaluation standards for ML systems
-
-## The Ultimate Achievement
-
-Module 20 represents the culmination of your ML systems optimization journey. Through competitive pressure in TinyMLPerf's three exciting events, you'll apply everything learned from quantization to custom kernels, proving you can optimize ML systems like a professional engineer.
-
-**Ready to compete? Load your optimized models and prove your mastery in the Olympics of ML Systems Optimization!** 🏆🚀
-
----
-
-*This module completes your transformation from ML beginner to systems optimization expert through the power of competitive achievement.*
\ No newline at end of file
diff --git a/modules_old/19_benchmarking/benchmarking_dev.ipynb b/modules_old/19_benchmarking/benchmarking_dev.ipynb
deleted file mode 100644
index 963ceed2..00000000
--- a/modules_old/19_benchmarking/benchmarking_dev.ipynb
+++ /dev/null
@@ -1,1534 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "ead5731b",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Module 20: TinyMLPerf - The Ultimate ML Systems Competition\n",
-    "\n",
-    "## Learning Objectives\n",
-    "By the end of this module, you will be able to:\n",
-    "\n",
-    "1. **Build Competition Benchmarking Infrastructure**: Create standardized TinyMLPerf benchmark suite for fair competition\n",
-    "2. **Use Profiling Tools for Systematic Measurement**: Apply Module 15's profiler to measure real performance gains\n",
-    "3. **Compete Across Multiple Categories**: Optimize for speed, memory, model size, and innovation simultaneously\n",
-    "4. **Calculate Relative Performance Improvements**: Show speedup ratios independent of hardware differences\n",
-    "5. **Drive Innovation Through Competition**: Use competitive pressure to discover new optimization techniques\n",
-    "\n",
-    "## The TinyMLPerf Vision\n",
-    "\n",
-    "**Key Message**: Competition proves optimization mastery by measuring concrete performance improvements across all your TinyTorch implementations!\n",
-    "\n",
-    "**The TinyMLPerf Journey:**\n",
-    "1. **Benchmark Suite**: Load standard models (MLP, CNN, Transformer) as competition workloads\n",
-    "2. **Profiling Integration**: Use your Module 15 profiler for rigorous performance measurement\n",
-    "3. **Competition Categories**: Three exciting events - MLP Sprint, CNN Marathon, Transformer Decathlon\n",
-    "4. **Relative Scoring**: Hardware-independent speedup measurements (3x faster = 3.0 score)\n",
-    "5. **Leaderboard Glory**: Track innovations and celebrate optimization achievements"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f36cf4db",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#| default_exp utils.benchmark\n",
-    "\n",
-    "import time\n",
-    "import json\n",
-    "import hashlib\n",
-    "import tracemalloc\n",
-    "from datetime import datetime\n",
-    "from pathlib import Path\n",
-    "from typing import Dict, Any, List, Optional, Tuple, Union, Callable\n",
-    "import numpy as np\n",
-    "import pickle\n",
-    "\n",
-    "# Import TinyTorch profiler from Module 15\n",
-    "try:\n",
-    "    from tinytorch.utils.profiler import SimpleProfiler, profile_function\n",
-    "    HAS_PROFILER = True\n",
-    "except ImportError:\n",
-    "    print(\"Warning: TinyTorch profiler not available. Using basic timing.\")\n",
-    "    HAS_PROFILER = False"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "242db3f2",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 1: TinyMLPerf Benchmark Suite - Standard Competition Models\n",
-    "\n",
-    "Let's build the TinyMLPerf benchmark suite with three exciting competition events using standard models."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "454686b7",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class TinyMLPerf:\n",
-    "    \"\"\"\n",
-    "    TinyMLPerf benchmark suite - The Olympics of ML Systems Optimization!\n",
-    "    \n",
-    "    Provides three standard competition events:\n",
-    "    - MLP Sprint: Fastest feedforward inference\n",
-    "    - CNN Marathon: Efficient convolution operations  \n",
-    "    - Transformer Decathlon: Complete attention-based model performance\n",
-    "    \n",
-    "    Each event uses standardized models and datasets for fair competition.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, profiler_warmup_runs: int = 3, profiler_timing_runs: int = 10):\n",
-    "        \"\"\"\n",
-    "        Initialize TinyMLPerf benchmark suite.\n",
-    "        \n",
-    "        Args:\n",
-    "            profiler_warmup_runs: Number of warmup runs for stable measurements\n",
-    "            profiler_timing_runs: Number of timing runs for statistical reliability\n",
-    "        \"\"\"\n",
-    "        self.warmup_runs = profiler_warmup_runs\n",
-    "        self.timing_runs = profiler_timing_runs\n",
-    "        self.benchmark_models = {}\n",
-    "        self.benchmark_datasets = {}\n",
-    "        \n",
-    "        print(\"🏆 TinyMLPerf Competition Suite Initialized!\")\n",
-    "        print(\"🎯 Three Events: MLP Sprint, CNN Marathon, Transformer Decathlon\")\n",
-    "        \n",
-    "        # Load standard benchmark models\n",
-    "        self._load_benchmark_models()\n",
-    "        self._load_benchmark_datasets()\n",
-    "    \n",
-    "    def _load_benchmark_models(self):\n",
-    "        \"\"\"Load standard benchmark models for each competition event\"\"\"\n",
-    "        print(\"📥 Loading TinyMLPerf Benchmark Models...\")\n",
-    "        \n",
-    "        # MLP Sprint - Simple feedforward model\n",
-    "        class MLPBenchmark:\n",
-    "            def __init__(self):\n",
-    "                self.weights1 = np.random.randn(784, 128).astype(np.float32) * 0.1\n",
-    "                self.bias1 = np.random.randn(128).astype(np.float32) * 0.1\n",
-    "                self.weights2 = np.random.randn(128, 64).astype(np.float32) * 0.1\n",
-    "                self.bias2 = np.random.randn(64).astype(np.float32) * 0.1  \n",
-    "                self.weights3 = np.random.randn(64, 10).astype(np.float32) * 0.1\n",
-    "                self.bias3 = np.random.randn(10).astype(np.float32) * 0.1\n",
-    "            \n",
-    "            def forward(self, x):\n",
-    "                # 3-layer MLP with ReLU activations\n",
-    "                h1 = np.maximum(0, x @ self.weights1 + self.bias1)  # ReLU\n",
-    "                h2 = np.maximum(0, h1 @ self.weights2 + self.bias2)  # ReLU  \n",
-    "                return h2 @ self.weights3 + self.bias3  # Output layer\n",
-    "            \n",
-    "            def predict(self, x):\n",
-    "                return self.forward(x)\n",
-    "        \n",
-    "        # CNN Marathon - Convolutional model\n",
-    "        class CNNBenchmark:\n",
-    "            def __init__(self):\n",
-    "                # Simplified CNN weights (real CNN would need proper conv operations)\n",
-    "                self.conv1_weights = np.random.randn(3, 3, 1, 32).astype(np.float32) * 0.1\n",
-    "                self.conv2_weights = np.random.randn(3, 3, 32, 64).astype(np.float32) * 0.1\n",
-    "                self.fc_weights = np.random.randn(1600, 10).astype(np.float32) * 0.1  # Flattened size\n",
-    "                self.fc_bias = np.random.randn(10).astype(np.float32) * 0.1\n",
-    "            \n",
-    "            def forward(self, x):\n",
-    "                # Simplified CNN (students will optimize real convolutions)\n",
-    "                batch_size = x.shape[0] \n",
-    "                # Simulate conv + pooling by flattening and projecting\n",
-    "                x_flat = x.reshape(batch_size, -1)  # Flatten input\n",
-    "                if x_flat.shape[1] != 1600:\n",
-    "                    # Adjust to expected size\n",
-    "                    x_flat = x_flat[:, :1600] if x_flat.shape[1] > 1600 else np.pad(x_flat, ((0, 0), (0, 1600 - x_flat.shape[1])), 'constant')\n",
-    "                return x_flat @ self.fc_weights + self.fc_bias\n",
-    "            \n",
-    "            def predict(self, x):\n",
-    "                return self.forward(x)\n",
-    "        \n",
-    "        # Transformer Decathlon - Attention-based model  \n",
-    "        class TransformerBenchmark:\n",
-    "            def __init__(self, d_model=128, n_heads=8, seq_len=64):\n",
-    "                self.d_model = d_model\n",
-    "                self.n_heads = n_heads\n",
-    "                self.seq_len = seq_len\n",
-    "                self.head_dim = d_model // n_heads\n",
-    "                \n",
-    "                # Multi-head attention weights\n",
-    "                self.wq = np.random.randn(d_model, d_model).astype(np.float32) * 0.1\n",
-    "                self.wk = np.random.randn(d_model, d_model).astype(np.float32) * 0.1  \n",
-    "                self.wv = np.random.randn(d_model, d_model).astype(np.float32) * 0.1\n",
-    "                self.wo = np.random.randn(d_model, d_model).astype(np.float32) * 0.1\n",
-    "                \n",
-    "                # Feed forward weights\n",
-    "                self.ff1 = np.random.randn(d_model, d_model * 4).astype(np.float32) * 0.1\n",
-    "                self.ff2 = np.random.randn(d_model * 4, d_model).astype(np.float32) * 0.1\n",
-    "            \n",
-    "            def forward(self, x):\n",
-    "                # Simplified transformer block (students will optimize real attention)\n",
-    "                batch_size, seq_len, d_model = x.shape\n",
-    "                \n",
-    "                # Self-attention (simplified)\n",
-    "                q = x @ self.wq  # [batch, seq, d_model]\n",
-    "                k = x @ self.wk\n",
-    "                v = x @ self.wv\n",
-    "                \n",
-    "                # Simplified attention computation (real would be multi-head)\n",
-    "                scores = q @ k.transpose(0, 2, 1) / np.sqrt(d_model)  # [batch, seq, seq]\n",
-    "                attn = np.exp(scores) / (np.sum(np.exp(scores), axis=-1, keepdims=True) + 1e-8)\n",
-    "                out = attn @ v  # [batch, seq, d_model]\n",
-    "                \n",
-    "                # Skip connection + layer norm (simplified)\n",
-    "                out = out + x  # Residual connection\n",
-    "                \n",
-    "                # Feed forward network\n",
-    "                ff_out = np.maximum(0, out @ self.ff1)  # ReLU\n",
-    "                ff_out = ff_out @ self.ff2\n",
-    "                \n",
-    "                # Another skip connection\n",
-    "                out = ff_out + out\n",
-    "                \n",
-    "                # Global average pooling for classification\n",
-    "                return np.mean(out, axis=1)  # [batch, d_model]\n",
-    "            \n",
-    "            def predict(self, x):\n",
-    "                return self.forward(x)\n",
-    "        \n",
-    "        # Store benchmark models\n",
-    "        self.benchmark_models = {\n",
-    "            'mlp_sprint': MLPBenchmark(),\n",
-    "            'cnn_marathon': CNNBenchmark(), \n",
-    "            'transformer_decathlon': TransformerBenchmark()\n",
-    "        }\n",
-    "        \n",
-    "        print(\"✅ Benchmark models loaded successfully!\")\n",
-    "        for event, model in self.benchmark_models.items():\n",
-    "            print(f\"   📋 {event.title()}: {type(model).__name__}\")\n",
-    "    \n",
-    "    def _load_benchmark_datasets(self):\n",
-    "        \"\"\"Load standard benchmark datasets for each competition event\"\"\"\n",
-    "        print(\"📊 Loading TinyMLPerf Benchmark Datasets...\")\n",
-    "        \n",
-    "        # MLP Sprint dataset - MNIST-like flattened images\n",
-    "        mlp_data = {\n",
-    "            'inputs': np.random.randn(100, 784).astype(np.float32),  # Batch of 100 samples\n",
-    "            'targets': np.eye(10)[np.random.randint(0, 10, 100)],    # One-hot labels\n",
-    "            'event': 'MLP Sprint',\n",
-    "            'description': 'Feedforward inference on flattened 28x28 images'\n",
-    "        }\n",
-    "        \n",
-    "        # CNN Marathon dataset - Image-like data\n",
-    "        cnn_data = {\n",
-    "            'inputs': np.random.randn(50, 28, 28, 1).astype(np.float32),  # Batch of 50 images\n",
-    "            'targets': np.eye(10)[np.random.randint(0, 10, 50)],\n",
-    "            'event': 'CNN Marathon',  \n",
-    "            'description': 'Convolutional inference on 28x28x1 images'\n",
-    "        }\n",
-    "        \n",
-    "        # Transformer Decathlon dataset - Sequence data\n",
-    "        transformer_data = {\n",
-    "            'inputs': np.random.randn(32, 64, 128).astype(np.float32),  # Batch of 32 sequences\n",
-    "            'targets': np.eye(10)[np.random.randint(0, 10, 32)],\n",
-    "            'event': 'Transformer Decathlon',\n",
-    "            'description': 'Self-attention inference on 64-token sequences'\n",
-    "        }\n",
-    "        \n",
-    "        self.benchmark_datasets = {\n",
-    "            'mlp_sprint': mlp_data,\n",
-    "            'cnn_marathon': cnn_data,\n",
-    "            'transformer_decathlon': transformer_data\n",
-    "        }\n",
-    "        \n",
-    "        print(\"✅ Benchmark datasets loaded successfully!\")\n",
-    "        for event, data in self.benchmark_datasets.items():\n",
-    "            print(f\"   🎯 {data['event']}: {data['inputs'].shape} -> {data['targets'].shape}\")\n",
-    "    \n",
-    "    def load_benchmark(self, event_name: str) -> Tuple[Any, Dict[str, Any]]:\n",
-    "        \"\"\"\n",
-    "        Load a specific benchmark model and dataset.\n",
-    "        \n",
-    "        Args:\n",
-    "            event_name: Name of competition event ('mlp_sprint', 'cnn_marathon', 'transformer_decathlon')\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tuple of (model, dataset) for the specified event\n",
-    "        \"\"\"\n",
-    "        if event_name not in self.benchmark_models:\n",
-    "            available = list(self.benchmark_models.keys())\n",
-    "            raise ValueError(f\"Event '{event_name}' not found. Available: {available}\")\n",
-    "        \n",
-    "        model = self.benchmark_models[event_name]\n",
-    "        dataset = self.benchmark_datasets[event_name]\n",
-    "        \n",
-    "        print(f\"📋 Loaded benchmark: {dataset['event']}\")\n",
-    "        print(f\"   Model: {type(model).__name__}\")\n",
-    "        print(f\"   Data: {dataset['description']}\")\n",
-    "        \n",
-    "        return model, dataset\n",
-    "    \n",
-    "    def get_available_events(self) -> Dict[str, str]:\n",
-    "        \"\"\"Get list of available competition events with descriptions\"\"\"\n",
-    "        return {\n",
-    "            'mlp_sprint': 'Fastest feedforward neural network inference',\n",
-    "            'cnn_marathon': 'Efficient convolutional neural network processing',\n",
-    "            'transformer_decathlon': 'Complete attention mechanism optimization'\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3676ceeb",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test TinyMLPerf Benchmark Suite\n",
-    "\n",
-    "Let's test the benchmark suite to ensure all models and datasets load correctly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "919f5680",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_tinymlperf_benchmark_suite():\n",
-    "    \"\"\"Test the TinyMLPerf benchmark suite\"\"\"\n",
-    "    print(\"Testing TinyMLPerf Benchmark Suite...\")\n",
-    "    \n",
-    "    # Initialize benchmark suite\n",
-    "    tinyperf = TinyMLPerf(profiler_warmup_runs=2, profiler_timing_runs=3)\n",
-    "    \n",
-    "    # Test each event\n",
-    "    events = tinyperf.get_available_events()\n",
-    "    print(f\"\\n🏆 Available Events: {len(events)}\")\n",
-    "    \n",
-    "    for event_name, description in events.items():\n",
-    "        print(f\"\\n📋 Testing {event_name}...\")\n",
-    "        model, dataset = tinyperf.load_benchmark(event_name)\n",
-    "        \n",
-    "        # Test model inference\n",
-    "        inputs = dataset['inputs']\n",
-    "        outputs = model.predict(inputs)\n",
-    "        \n",
-    "        print(f\"   ✅ Inference successful: {inputs.shape} -> {outputs.shape}\")\n",
-    "        \n",
-    "        # Verify output shape makes sense\n",
-    "        batch_size = inputs.shape[0]\n",
-    "        assert outputs.shape[0] == batch_size, f\"Batch size mismatch: {outputs.shape[0]} != {batch_size}\"\n",
-    "        print(f\"   ✅ Output shape verified\")\n",
-    "    \n",
-    "    print(f\"\\n✅ TinyMLPerf benchmark suite test complete!\")\n",
-    "    return tinyperf"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "35b18f42",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 2: Performance Benchmarking Using Module 15's Profiler\n",
-    "\n",
-    "Now let's build the core benchmarking infrastructure that uses the profiler from Module 15 to measure performance."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f89d870e",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class CompetitionProfiler:\n",
-    "    \"\"\"\n",
-    "    Competition profiling infrastructure using TinyTorch's Module 15 profiler.\n",
-    "    \n",
-    "    Provides rigorous performance measurement for fair competition by:\n",
-    "    - Using standardized profiling from Module 15\n",
-    "    - Multiple timing runs with statistical analysis\n",
-    "    - Memory usage tracking and analysis\n",
-    "    - Hardware-independent relative scoring\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, warmup_runs: int = 3, timing_runs: int = 10):\n",
-    "        \"\"\"\n",
-    "        Initialize competition profiler.\n",
-    "        \n",
-    "        Args:\n",
-    "            warmup_runs: Number of warmup runs to stabilize performance\n",
-    "            timing_runs: Number of timing runs for statistical reliability  \n",
-    "        \"\"\"\n",
-    "        self.warmup_runs = warmup_runs\n",
-    "        self.timing_runs = timing_runs\n",
-    "        self.has_profiler = HAS_PROFILER\n",
-    "        \n",
-    "        if not self.has_profiler:\n",
-    "            print(\"⚠️  Warning: Advanced profiling unavailable, using basic timing\")\n",
-    "        else:\n",
-    "            print(\"✅ Using TinyTorch Module 15 profiler for advanced metrics\")\n",
-    "    \n",
-    "    def benchmark_model(self, model, dataset: Dict[str, Any], \n",
-    "                       baseline_model=None, baseline_time: Optional[float] = None) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Benchmark a model using rigorous profiling methodology.\n",
-    "        \n",
-    "        Args:\n",
-    "            model: Model to benchmark (must have predict() or forward() method)\n",
-    "            dataset: Dataset dictionary with 'inputs' key\n",
-    "            baseline_model: Optional baseline model for speedup calculation\n",
-    "            baseline_time: Optional baseline time for speedup calculation\n",
-    "            \n",
-    "        Returns:\n",
-    "            Comprehensive benchmarking results with performance metrics\n",
-    "        \"\"\"\n",
-    "        print(f\"🏁 Benchmarking {dataset.get('event', 'Model')}...\")\n",
-    "        \n",
-    "        inputs = dataset['inputs']\n",
-    "        results = {\n",
-    "            'event': dataset.get('event', 'Unknown'),\n",
-    "            'model_type': type(model).__name__,\n",
-    "            'input_shape': inputs.shape,\n",
-    "            'benchmark_timestamp': datetime.now().isoformat()\n",
-    "        }\n",
-    "        \n",
-    "        if self.has_profiler:\n",
-    "            # Use advanced profiling from Module 15\n",
-    "            results.update(self._profile_with_tinytorch_profiler(model, inputs))\n",
-    "        else:\n",
-    "            # Fallback to basic timing\n",
-    "            results.update(self._profile_basic_timing(model, inputs))\n",
-    "        \n",
-    "        # Calculate speedup if baseline provided\n",
-    "        if baseline_model is not None:\n",
-    "            baseline_results = self.benchmark_model(baseline_model, dataset)\n",
-    "            speedup = baseline_results['mean_inference_time'] / results['mean_inference_time']\n",
-    "            results['speedup_vs_baseline'] = speedup\n",
-    "        elif baseline_time is not None:\n",
-    "            speedup = baseline_time / results['mean_inference_time'] \n",
-    "            results['speedup_vs_baseline'] = speedup\n",
-    "        \n",
-    "        self._print_benchmark_results(results)\n",
-    "        return results\n",
-    "    \n",
-    "    def _profile_with_tinytorch_profiler(self, model, inputs: np.ndarray) -> Dict[str, Any]:\n",
-    "        \"\"\"Profile using Module 15's advanced profiler\"\"\"\n",
-    "        profiler = SimpleProfiler(track_memory=True, track_cpu=True)\n",
-    "        \n",
-    "        # Run multiple profiling sessions for statistical reliability\n",
-    "        profile_results = []\n",
-    "        \n",
-    "        for run in range(self.timing_runs):\n",
-    "            # Each profiling session includes warmup\n",
-    "            result = profiler.profile(\n",
-    "                model.predict, inputs, \n",
-    "                name=f\"inference_run_{run}\",\n",
-    "                warmup=True  # Profiler handles warmup\n",
-    "            )\n",
-    "            profile_results.append(result)\n",
-    "        \n",
-    "        # Aggregate statistics across runs\n",
-    "        wall_times = [r['wall_time'] for r in profile_results]\n",
-    "        cpu_times = [r['cpu_time'] for r in profile_results]\n",
-    "        \n",
-    "        aggregated = {\n",
-    "            'mean_inference_time': np.mean(wall_times),\n",
-    "            'std_inference_time': np.std(wall_times),\n",
-    "            'min_inference_time': np.min(wall_times), \n",
-    "            'max_inference_time': np.max(wall_times),\n",
-    "            'p95_inference_time': np.percentile(wall_times, 95),\n",
-    "            'mean_cpu_time': np.mean(cpu_times),\n",
-    "            'cpu_efficiency': np.mean([r['cpu_efficiency'] for r in profile_results]),\n",
-    "            'profiling_method': 'TinyTorch Module 15 Profiler'\n",
-    "        }\n",
-    "        \n",
-    "        # Add memory metrics from last run (most representative)\n",
-    "        last_result = profile_results[-1]\n",
-    "        if 'memory_delta_mb' in last_result:\n",
-    "            aggregated.update({\n",
-    "                'memory_delta_mb': last_result['memory_delta_mb'],\n",
-    "                'peak_memory_mb': last_result['peak_memory_mb'],\n",
-    "                'result_size_mb': last_result.get('result_size_mb', 0)\n",
-    "            })\n",
-    "        \n",
-    "        return aggregated\n",
-    "    \n",
-    "    def _profile_basic_timing(self, model, inputs: np.ndarray) -> Dict[str, Any]:\n",
-    "        \"\"\"Fallback basic timing without advanced profiling\"\"\"\n",
-    "        \n",
-    "        # Warmup runs\n",
-    "        for _ in range(self.warmup_runs):\n",
-    "            _ = model.predict(inputs)\n",
-    "        \n",
-    "        # Timing runs  \n",
-    "        times = []\n",
-    "        for _ in range(self.timing_runs):\n",
-    "            start = time.perf_counter()\n",
-    "            _ = model.predict(inputs)\n",
-    "            end = time.perf_counter()\n",
-    "            times.append(end - start)\n",
-    "        \n",
-    "        return {\n",
-    "            'mean_inference_time': np.mean(times),\n",
-    "            'std_inference_time': np.std(times),\n",
-    "            'min_inference_time': np.min(times),\n",
-    "            'max_inference_time': np.max(times),\n",
-    "            'p95_inference_time': np.percentile(times, 95),\n",
-    "            'profiling_method': 'Basic Timing'\n",
-    "        }\n",
-    "    \n",
-    "    def _print_benchmark_results(self, results: Dict[str, Any]):\n",
-    "        \"\"\"Print formatted benchmark results\"\"\"\n",
-    "        print(f\"\\n📊 {results['event']} Benchmark Results:\")\n",
-    "        print(f\"   Model: {results['model_type']}\")\n",
-    "        print(f\"   Input: {results['input_shape']}\")\n",
-    "        print(f\"   Mean Time: {results['mean_inference_time']*1000:.2f} ± {results['std_inference_time']*1000:.2f} ms\")\n",
-    "        print(f\"   Best Time: {results['min_inference_time']*1000:.2f} ms\")\n",
-    "        print(f\"   P95 Time: {results['p95_inference_time']*1000:.2f} ms\")\n",
-    "        \n",
-    "        if 'speedup_vs_baseline' in results:\n",
-    "            print(f\"   🚀 Speedup: {results['speedup_vs_baseline']:.2f}x faster\")\n",
-    "        \n",
-    "        if 'memory_delta_mb' in results:\n",
-    "            print(f\"   💾 Memory: {results['memory_delta_mb']:.2f} MB delta, {results['peak_memory_mb']:.2f} MB peak\")\n",
-    "        \n",
-    "        print(f\"   📏 Method: {results['profiling_method']}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7ea6de0e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Competition Profiler\n",
-    "\n",
-    "Let's test the competition profiler with TinyMLPerf benchmark models."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4291ee9d",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_competition_profiler():\n",
-    "    \"\"\"Test the competition profiler with benchmark models\"\"\"\n",
-    "    print(\"Testing Competition Profiler...\")\n",
-    "    \n",
-    "    # Initialize TinyMLPerf and profiler\n",
-    "    tinyperf = TinyMLPerf(profiler_warmup_runs=2, profiler_timing_runs=3)\n",
-    "    profiler = CompetitionProfiler(warmup_runs=2, timing_runs=3)\n",
-    "    \n",
-    "    # Test MLP Sprint profiling\n",
-    "    mlp_model, mlp_dataset = tinyperf.load_benchmark('mlp_sprint')\n",
-    "    mlp_results = profiler.benchmark_model(mlp_model, mlp_dataset)\n",
-    "    \n",
-    "    # Test CNN Marathon profiling\n",
-    "    cnn_model, cnn_dataset = tinyperf.load_benchmark('cnn_marathon')  \n",
-    "    cnn_results = profiler.benchmark_model(cnn_model, cnn_dataset)\n",
-    "    \n",
-    "    # Test speedup calculation with baseline\n",
-    "    print(f\"\\n🏃 Testing Speedup Calculation...\")\n",
-    "    cnn_speedup_results = profiler.benchmark_model(\n",
-    "        cnn_model, cnn_dataset, \n",
-    "        baseline_time=mlp_results['mean_inference_time']  # Use MLP as baseline\n",
-    "    )\n",
-    "    \n",
-    "    print(f\"\\n✅ Competition profiler test complete!\")\n",
-    "    return profiler, mlp_results, cnn_results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "982f40f9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 3: Competition Framework - Leaderboards and Scoring\n",
-    "\n",
-    "Now let's build the exciting competition framework with leaderboards, relative scoring, and multiple categories."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "016b4cc6",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class TinyMLPerfCompetition:\n",
-    "    \"\"\"\n",
-    "    TinyMLPerf Competition Framework - The Olympics of ML Optimization!\n",
-    "    \n",
-    "    Manages three exciting competition events:\n",
-    "    - MLP Sprint: Fastest feedforward network\n",
-    "    - CNN Marathon: Most efficient convolutions  \n",
-    "    - Transformer Decathlon: Ultimate attention optimization\n",
-    "    \n",
-    "    Features hardware-independent relative scoring and transparent leaderboards.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, results_dir: str = \"tinymlperf_results\"):\n",
-    "        \"\"\"\n",
-    "        Initialize TinyMLPerf competition.\n",
-    "        \n",
-    "        Args:\n",
-    "            results_dir: Directory to store competition results and leaderboards\n",
-    "        \"\"\"\n",
-    "        self.results_dir = Path(results_dir)\n",
-    "        self.results_dir.mkdir(exist_ok=True)\n",
-    "        \n",
-    "        self.tinyperf = TinyMLPerf()\n",
-    "        self.profiler = CompetitionProfiler(warmup_runs=3, timing_runs=5)\n",
-    "        \n",
-    "        # Load baseline models for relative scoring\n",
-    "        self.baselines = self._establish_baselines()\n",
-    "        \n",
-    "        print(\"🏆 TinyMLPerf Competition Initialized!\")\n",
-    "        print(\"🎯 Three Events Ready for Competition!\")\n",
-    "    \n",
-    "    def _establish_baselines(self) -> Dict[str, float]:\n",
-    "        \"\"\"Establish baseline performance for relative scoring\"\"\"\n",
-    "        print(\"📏 Establishing baseline performance for relative scoring...\")\n",
-    "        \n",
-    "        baselines = {}\n",
-    "        events = ['mlp_sprint', 'cnn_marathon', 'transformer_decathlon']\n",
-    "        \n",
-    "        for event in events:\n",
-    "            model, dataset = self.tinyperf.load_benchmark(event)\n",
-    "            results = self.profiler.benchmark_model(model, dataset)\n",
-    "            baselines[event] = results['mean_inference_time']\n",
-    "            print(f\"   {event}: {baselines[event]*1000:.2f} ms baseline\")\n",
-    "        \n",
-    "        return baselines\n",
-    "    \n",
-    "    def submit_entry(self, team_name: str, event_name: str, optimized_model, \n",
-    "                     optimization_description: str = \"\", github_url: str = \"\") -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Submit an optimized model to TinyMLPerf competition.\n",
-    "        \n",
-    "        Args:\n",
-    "            team_name: Name of the competing team\n",
-    "            event_name: Competition event ('mlp_sprint', 'cnn_marathon', 'transformer_decathlon')\n",
-    "            optimized_model: The optimized model to submit\n",
-    "            optimization_description: Description of optimization techniques used\n",
-    "            github_url: Link to code repository (for transparency)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Submission results with performance metrics and scoring\n",
-    "        \"\"\"\n",
-    "        if event_name not in self.baselines:\n",
-    "            available = list(self.baselines.keys())\n",
-    "            raise ValueError(f\"Event '{event_name}' not available. Choose from: {available}\")\n",
-    "        \n",
-    "        print(f\"🚀 TINYMLPERF SUBMISSION\")\n",
-    "        print(f\"🏆 Event: {event_name.replace('_', ' ').title()}\")\n",
-    "        print(f\"👥 Team: {team_name}\")\n",
-    "        print(\"-\" * 60)\n",
-    "        \n",
-    "        # Load benchmark dataset for this event\n",
-    "        _, dataset = self.tinyperf.load_benchmark(event_name)\n",
-    "        \n",
-    "        # Benchmark the submitted model\n",
-    "        results = self.profiler.benchmark_model(\n",
-    "            optimized_model, dataset,\n",
-    "            baseline_time=self.baselines[event_name]\n",
-    "        )\n",
-    "        \n",
-    "        # Calculate competition score (relative speedup)\n",
-    "        baseline_time = self.baselines[event_name]\n",
-    "        submission_time = results['mean_inference_time']\n",
-    "        speedup_score = baseline_time / submission_time\n",
-    "        \n",
-    "        # Create submission record\n",
-    "        submission = {\n",
-    "            'submission_id': self._generate_submission_id(team_name, event_name),\n",
-    "            'timestamp': datetime.now().isoformat(),\n",
-    "            'team_name': team_name,\n",
-    "            'event_name': event_name,\n",
-    "            'optimization_description': optimization_description,\n",
-    "            'github_url': github_url,\n",
-    "            'performance_metrics': results,\n",
-    "            'speedup_score': speedup_score,\n",
-    "            'baseline_time_ms': baseline_time * 1000,\n",
-    "            'submission_time_ms': submission_time * 1000\n",
-    "        }\n",
-    "        \n",
-    "        # Save submission\n",
-    "        self._save_submission(submission)\n",
-    "        \n",
-    "        # Display results\n",
-    "        self._display_submission_results(submission)\n",
-    "        \n",
-    "        return submission\n",
-    "    \n",
-    "    def _generate_submission_id(self, team_name: str, event_name: str) -> str:\n",
-    "        \"\"\"Generate unique submission ID\"\"\"\n",
-    "        timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
-    "        team_hash = hashlib.md5(team_name.encode()).hexdigest()[:6]\n",
-    "        return f\"{event_name}_{team_hash}_{timestamp}\"\n",
-    "    \n",
-    "    def _save_submission(self, submission: Dict[str, Any]):\n",
-    "        \"\"\"Save submission to results directory\"\"\"\n",
-    "        filename = f\"{submission['submission_id']}.json\"\n",
-    "        filepath = self.results_dir / filename\n",
-    "        \n",
-    "        with open(filepath, 'w') as f:\n",
-    "            json.dump(submission, f, indent=2, default=str)\n",
-    "        \n",
-    "        print(f\"💾 Submission saved: {filepath}\")\n",
-    "    \n",
-    "    def _display_submission_results(self, submission: Dict[str, Any]):\n",
-    "        \"\"\"Display formatted submission results\"\"\"\n",
-    "        metrics = submission['performance_metrics']\n",
-    "        speedup = submission['speedup_score']\n",
-    "        \n",
-    "        print(f\"\\n🏆 SUBMISSION RESULTS\")\n",
-    "        print(f\"=\" * 50)\n",
-    "        print(f\"Team: {submission['team_name']}\")\n",
-    "        print(f\"Event: {submission['event_name'].replace('_', ' ').title()}\")\n",
-    "        \n",
-    "        print(f\"\\n⏱️  Performance:\")\n",
-    "        print(f\"   Your Time:    {submission['submission_time_ms']:.2f} ms\")\n",
-    "        print(f\"   Baseline:     {submission['baseline_time_ms']:.2f} ms\")\n",
-    "        print(f\"   🚀 Speedup:   {speedup:.2f}x {'FASTER' if speedup > 1.0 else 'slower'}\")\n",
-    "        \n",
-    "        if 'memory_delta_mb' in metrics:\n",
-    "            print(f\"   💾 Memory:    {metrics['memory_delta_mb']:.2f} MB\")\n",
-    "        \n",
-    "        # Award celebration for good performance\n",
-    "        if speedup >= 3.0:\n",
-    "            print(f\"\\n🎉 AMAZING! 3x+ speedup achieved!\")\n",
-    "        elif speedup >= 2.0:\n",
-    "            print(f\"\\n🏆 EXCELLENT! 2x+ speedup!\")\n",
-    "        elif speedup >= 1.5:\n",
-    "            print(f\"\\n⭐ GREAT! 50%+ speedup!\")\n",
-    "        elif speedup >= 1.1:\n",
-    "            print(f\"\\n✅ Good optimization!\")\n",
-    "        else:\n",
-    "            print(f\"\\n🤔 Keep optimizing - you can do better!\")\n",
-    "        \n",
-    "        if submission['optimization_description']:\n",
-    "            print(f\"\\n💡 Techniques Used:\")\n",
-    "            print(f\"   {submission['optimization_description']}\")\n",
-    "    \n",
-    "    def display_leaderboard(self, event_name: str, top_n: int = 10) -> List[Dict[str, Any]]:\n",
-    "        \"\"\"\n",
-    "        Display leaderboard for a specific event.\n",
-    "        \n",
-    "        Args:\n",
-    "            event_name: Event to show leaderboard for\n",
-    "            top_n: Number of top entries to display\n",
-    "            \n",
-    "        Returns:\n",
-    "            List of top submissions\n",
-    "        \"\"\"\n",
-    "        submissions = self._load_event_submissions(event_name)\n",
-    "        \n",
-    "        if not submissions:\n",
-    "            print(f\"🏆 {event_name.replace('_', ' ').title()} Leaderboard\")\n",
-    "            print(\"No submissions yet! Be the first to compete!\")\n",
-    "            return []\n",
-    "        \n",
-    "        # Sort by speedup score (highest first)\n",
-    "        submissions.sort(key=lambda s: s['speedup_score'], reverse=True)\n",
-    "        top_submissions = submissions[:top_n]\n",
-    "        \n",
-    "        print(f\"\\n🏆 TINYMLPERF LEADERBOARD - {event_name.replace('_', ' ').title()}\")\n",
-    "        print(\"=\" * 80)\n",
-    "        print(f\"{'Rank':<6} {'Team':<20} {'Speedup':<10} {'Time (ms)':<12} {'Techniques':<25}\")\n",
-    "        print(\"-\" * 80)\n",
-    "        \n",
-    "        for i, submission in enumerate(top_submissions):\n",
-    "            rank = i + 1\n",
-    "            team = submission['team_name'][:19]\n",
-    "            speedup = f\"{submission['speedup_score']:.2f}x\"\n",
-    "            time_ms = f\"{submission['submission_time_ms']:.2f}\"\n",
-    "            techniques = submission['optimization_description'][:24] + \"...\" if len(submission['optimization_description']) > 24 else submission['optimization_description']\n",
-    "            \n",
-    "            print(f\"{rank:<6} {team:<20} {speedup:<10} {time_ms:<12} {techniques:<25}\")\n",
-    "        \n",
-    "        print(\"-\" * 80)\n",
-    "        print(f\"Showing top {len(top_submissions)} of {len(submissions)} submissions\")\n",
-    "        \n",
-    "        return top_submissions\n",
-    "    \n",
-    "    def display_all_leaderboards(self):\n",
-    "        \"\"\"Display leaderboards for all events\"\"\"\n",
-    "        events = ['mlp_sprint', 'cnn_marathon', 'transformer_decathlon']\n",
-    "        \n",
-    "        for event in events:\n",
-    "            self.display_leaderboard(event, top_n=5)\n",
-    "            print()\n",
-    "    \n",
-    "    def _load_event_submissions(self, event_name: str) -> List[Dict[str, Any]]:\n",
-    "        \"\"\"Load all submissions for a specific event\"\"\"\n",
-    "        submissions = []\n",
-    "        \n",
-    "        for filepath in self.results_dir.glob(f\"{event_name}_*.json\"):\n",
-    "            try:\n",
-    "                with open(filepath, 'r') as f:\n",
-    "                    submission = json.load(f)\n",
-    "                    submissions.append(submission)\n",
-    "            except Exception as e:\n",
-    "                print(f\"Warning: Could not load {filepath}: {e}\")\n",
-    "        \n",
-    "        return submissions\n",
-    "    \n",
-    "    def get_team_progress(self, team_name: str) -> Dict[str, List[Dict[str, Any]]]:\n",
-    "        \"\"\"Get all submissions from a specific team across all events\"\"\"\n",
-    "        all_files = list(self.results_dir.glob(\"*.json\"))\n",
-    "        team_submissions = {'mlp_sprint': [], 'cnn_marathon': [], 'transformer_decathlon': []}\n",
-    "        \n",
-    "        for filepath in all_files:\n",
-    "            try:\n",
-    "                with open(filepath, 'r') as f:\n",
-    "                    submission = json.load(f)\n",
-    "                    if submission['team_name'] == team_name:\n",
-    "                        event = submission['event_name']\n",
-    "                        if event in team_submissions:\n",
-    "                            team_submissions[event].append(submission)\n",
-    "            except Exception as e:\n",
-    "                continue\n",
-    "        \n",
-    "        # Sort by timestamp\n",
-    "        for event in team_submissions:\n",
-    "            team_submissions[event].sort(key=lambda s: s['timestamp'])\n",
-    "        \n",
-    "        return team_submissions"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c164bce1",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test TinyMLPerf Competition Framework\n",
-    "\n",
-    "Let's test the competition framework with multiple team submissions and leaderboards."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "64308dff",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_tinymlperf_competition():\n",
-    "    \"\"\"Test the TinyMLPerf competition framework\"\"\"\n",
-    "    print(\"Testing TinyMLPerf Competition Framework...\")\n",
-    "    \n",
-    "    # Initialize competition\n",
-    "    competition = TinyMLPerfCompetition()\n",
-    "    \n",
-    "    # Create some test optimized models\n",
-    "    class FastMLPModel:\n",
-    "        \"\"\"Simulated optimized MLP - smaller and faster\"\"\"\n",
-    "        def __init__(self):\n",
-    "            # Smaller model for speed\n",
-    "            self.weights1 = np.random.randn(784, 64).astype(np.float32) * 0.1\n",
-    "            self.bias1 = np.random.randn(64).astype(np.float32) * 0.1\n",
-    "            self.weights2 = np.random.randn(64, 10).astype(np.float32) * 0.1  \n",
-    "            self.bias2 = np.random.randn(10).astype(np.float32) * 0.1\n",
-    "        \n",
-    "        def predict(self, x):\n",
-    "            h1 = np.maximum(0, x @ self.weights1 + self.bias1)\n",
-    "            return h1 @ self.weights2 + self.bias2\n",
-    "    \n",
-    "    class EfficientCNNModel:\n",
-    "        \"\"\"Simulated optimized CNN\"\"\"\n",
-    "        def __init__(self):\n",
-    "            # Optimized weights\n",
-    "            self.fc_weights = np.random.randn(1600, 10).astype(np.float32) * 0.05\n",
-    "            self.fc_bias = np.random.randn(10).astype(np.float32) * 0.05\n",
-    "        \n",
-    "        def predict(self, x):\n",
-    "            batch_size = x.shape[0]\n",
-    "            x_flat = x.reshape(batch_size, -1)\n",
-    "            if x_flat.shape[1] != 1600:\n",
-    "                x_flat = x_flat[:, :1600] if x_flat.shape[1] > 1600 else np.pad(x_flat, ((0, 0), (0, 1600 - x_flat.shape[1])), 'constant')\n",
-    "            return x_flat @ self.fc_weights + self.fc_bias\n",
-    "    \n",
-    "    # Submit optimized models to competition\n",
-    "    print(\"\\n🚀 Submitting Competition Entries...\")\n",
-    "    \n",
-    "    # MLP Sprint submissions\n",
-    "    mlp_submission1 = competition.submit_entry(\n",
-    "        team_name=\"Speed Demons\",\n",
-    "        event_name=\"mlp_sprint\",\n",
-    "        optimized_model=FastMLPModel(),\n",
-    "        optimization_description=\"Reduced hidden layer size for 2x speedup\",\n",
-    "        github_url=\"https://github.com/speed-demons/fast-mlp\"\n",
-    "    )\n",
-    "    \n",
-    "    mlp_submission2 = competition.submit_entry(\n",
-    "        team_name=\"Lightning Fast\",  \n",
-    "        event_name=\"mlp_sprint\",\n",
-    "        optimized_model=FastMLPModel(),\n",
-    "        optimization_description=\"Quantization + kernel optimization\",\n",
-    "        github_url=\"https://github.com/lightning-fast/mlp-opt\"\n",
-    "    )\n",
-    "    \n",
-    "    # CNN Marathon submission\n",
-    "    cnn_submission = competition.submit_entry(\n",
-    "        team_name=\"CNN Champions\",\n",
-    "        event_name=\"cnn_marathon\", \n",
-    "        optimized_model=EfficientCNNModel(),\n",
-    "        optimization_description=\"Custom convolution kernels + memory optimization\",\n",
-    "        github_url=\"https://github.com/cnn-champions/efficient-cnn\"\n",
-    "    )\n",
-    "    \n",
-    "    # Display leaderboards\n",
-    "    print(\"\\n📊 Competition Leaderboards:\")\n",
-    "    competition.display_all_leaderboards()\n",
-    "    \n",
-    "    print(\"\\n✅ TinyMLPerf competition framework test complete!\")\n",
-    "    return competition"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e89abe4e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 4: Innovation Tracking and Advanced Scoring\n",
-    "\n",
-    "Let's add innovation detection and advanced scoring to reward creative optimization techniques."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "39a4324b",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class InnovationDetector:\n",
-    "    \"\"\"\n",
-    "    Detect and score innovative optimization techniques in submitted models.\n",
-    "    \n",
-    "    Rewards creativity by analyzing models for advanced optimization patterns:\n",
-    "    - Quantization techniques\n",
-    "    - Pruning strategies  \n",
-    "    - Knowledge distillation\n",
-    "    - Custom kernel implementations\n",
-    "    - Novel architectural innovations\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize innovation detector\"\"\"\n",
-    "        self.innovation_patterns = {\n",
-    "            'quantization': ['quantized', 'int8', 'int16', 'low_precision', 'quantize'],\n",
-    "            'pruning': ['pruned', 'sparse', 'sparsity', 'prune', 'structured_pruning'],\n",
-    "            'distillation': ['distilled', 'teacher', 'student', 'knowledge_distillation', 'kd'],\n",
-    "            'custom_kernels': ['custom_kernel', 'optimized_kernel', 'cuda', 'vectorized', 'simd'],\n",
-    "            'memory_optimization': ['memory_pool', 'in_place', 'gradient_checkpointing', 'memory_efficient'],\n",
-    "            'compression': ['compressed', 'huffman', 'lz4', 'weight_sharing', 'parameter_sharing']\n",
-    "        }\n",
-    "    \n",
-    "    def analyze_innovation(self, model, optimization_description: str) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Analyze a model for innovative optimization techniques.\n",
-    "        \n",
-    "        Args:\n",
-    "            model: The optimized model to analyze\n",
-    "            optimization_description: Text description of optimizations\n",
-    "            \n",
-    "        Returns:\n",
-    "            Innovation analysis with detected techniques and scores\n",
-    "        \"\"\"\n",
-    "        innovation_score = 0.0\n",
-    "        detected_techniques = []\n",
-    "        \n",
-    "        # Analyze optimization description\n",
-    "        desc_lower = optimization_description.lower()\n",
-    "        \n",
-    "        for technique, patterns in self.innovation_patterns.items():\n",
-    "            for pattern in patterns:\n",
-    "                if pattern in desc_lower:\n",
-    "                    detected_techniques.append(technique)\n",
-    "                    innovation_score += 0.2\n",
-    "                    break  # Only count each technique once\n",
-    "        \n",
-    "        # Analyze model attributes for innovation markers\n",
-    "        model_innovation = self._analyze_model_attributes(model)\n",
-    "        detected_techniques.extend(model_innovation['techniques'])\n",
-    "        innovation_score += model_innovation['score']\n",
-    "        \n",
-    "        # Bonus for multiple techniques (creativity reward)\n",
-    "        if len(detected_techniques) >= 3:\n",
-    "            innovation_score += 0.3  # Combination bonus\n",
-    "        \n",
-    "        # Cap innovation score\n",
-    "        innovation_score = min(innovation_score, 1.0)\n",
-    "        \n",
-    "        return {\n",
-    "            'innovation_score': innovation_score,\n",
-    "            'detected_techniques': list(set(detected_techniques)),  # Remove duplicates\n",
-    "            'num_techniques': len(set(detected_techniques)),\n",
-    "            'creativity_bonus': len(detected_techniques) >= 3\n",
-    "        }\n",
-    "    \n",
-    "    def _analyze_model_attributes(self, model) -> Dict[str, Any]:\n",
-    "        \"\"\"Analyze model object for innovation attributes\"\"\"\n",
-    "        techniques = []\n",
-    "        score = 0.0\n",
-    "        \n",
-    "        # Check for common optimization attributes\n",
-    "        optimization_attributes = [\n",
-    "            ('quantized', 'quantization'),\n",
-    "            ('pruned', 'pruning'),\n",
-    "            ('distilled', 'distillation'),\n",
-    "            ('compressed', 'compression'),\n",
-    "            ('memory_optimized', 'memory_optimization'),\n",
-    "            ('custom_kernels', 'custom_kernels')\n",
-    "        ]\n",
-    "        \n",
-    "        for attr, technique in optimization_attributes:\n",
-    "            if hasattr(model, attr) and getattr(model, attr):\n",
-    "                techniques.append(technique)\n",
-    "                score += 0.15\n",
-    "        \n",
-    "        # Check for unusual model architectures (creativity indicator)\n",
-    "        if hasattr(model, 'innovative_architecture') and getattr(model, 'innovative_architecture'):\n",
-    "            techniques.append('novel_architecture')\n",
-    "            score += 0.25\n",
-    "        \n",
-    "        return {'techniques': techniques, 'score': score}\n",
-    "    \n",
-    "    def generate_innovation_report(self, analysis: Dict[str, Any]) -> str:\n",
-    "        \"\"\"Generate human-readable innovation report\"\"\"\n",
-    "        score = analysis['innovation_score']\n",
-    "        techniques = analysis['detected_techniques']\n",
-    "        \n",
-    "        if score == 0:\n",
-    "            return \"No innovative techniques detected. Consider exploring quantization, pruning, or custom optimizations!\"\n",
-    "        \n",
-    "        report = f\"Innovation Score: {score:.2f}/1.00\\n\"\n",
-    "        report += f\"Detected Techniques ({len(techniques)}):\\n\"\n",
-    "        \n",
-    "        for technique in techniques:\n",
-    "            report += f\"  • {technique.replace('_', ' ').title()}\\n\"\n",
-    "        \n",
-    "        if analysis['creativity_bonus']:\n",
-    "            report += \"🌟 Creativity Bonus: Multiple optimization techniques combined!\\n\"\n",
-    "        \n",
-    "        # Award levels\n",
-    "        if score >= 0.8:\n",
-    "            report += \"🏆 INNOVATION MASTER - Outstanding creativity!\"\n",
-    "        elif score >= 0.6:\n",
-    "            report += \"🚀 INNOVATION EXPERT - Excellent techniques!\"\n",
-    "        elif score >= 0.4:\n",
-    "            report += \"⭐ INNOVATION PRACTITIONER - Good optimization work!\"\n",
-    "        else:\n",
-    "            report += \"🔍 INNOVATION EXPLORER - Keep experimenting!\"\n",
-    "        \n",
-    "        return report\n",
-    "\n",
-    "# Enhanced competition class with innovation scoring\n",
-    "class TinyMLPerfCompetitionPlus(TinyMLPerfCompetition):\n",
-    "    \"\"\"\n",
-    "    Enhanced TinyMLPerf Competition with innovation detection and advanced scoring.\n",
-    "    \n",
-    "    Extends the base competition with:\n",
-    "    - Innovation technique detection\n",
-    "    - Advanced composite scoring\n",
-    "    - Creativity rewards\n",
-    "    - Multi-dimensional leaderboards\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, results_dir: str = \"tinymlperf_results\"):\n",
-    "        \"\"\"Initialize enhanced competition with innovation detection\"\"\"\n",
-    "        super().__init__(results_dir)\n",
-    "        self.innovation_detector = InnovationDetector()\n",
-    "        print(\"🔬 Innovation detection enabled!\")\n",
-    "    \n",
-    "    def submit_entry(self, team_name: str, event_name: str, optimized_model,\n",
-    "                     optimization_description: str = \"\", github_url: str = \"\") -> Dict[str, Any]:\n",
-    "        \"\"\"Submit entry with innovation analysis\"\"\"\n",
-    "        \n",
-    "        # Get base submission\n",
-    "        submission = super().submit_entry(team_name, event_name, optimized_model, \n",
-    "                                        optimization_description, github_url)\n",
-    "        \n",
-    "        # Add innovation analysis\n",
-    "        innovation_analysis = self.innovation_detector.analyze_innovation(\n",
-    "            optimized_model, optimization_description\n",
-    "        )\n",
-    "        \n",
-    "        submission['innovation_analysis'] = innovation_analysis\n",
-    "        \n",
-    "        # Calculate composite score (speed + innovation)\n",
-    "        speed_score = submission['speedup_score']  # Relative speedup\n",
-    "        innovation_score = innovation_analysis['innovation_score']\n",
-    "        \n",
-    "        # Weighted composite: 70% speed, 30% innovation\n",
-    "        composite_score = 0.7 * speed_score + 0.3 * innovation_score\n",
-    "        submission['composite_score'] = composite_score\n",
-    "        \n",
-    "        # Display innovation results\n",
-    "        print(f\"\\n🔬 Innovation Analysis:\")\n",
-    "        innovation_report = self.innovation_detector.generate_innovation_report(innovation_analysis)\n",
-    "        print(innovation_report)\n",
-    "        print(f\"\\n🏆 Composite Score: {composite_score:.3f} (Speed: {speed_score:.2f}, Innovation: {innovation_score:.2f})\")\n",
-    "        \n",
-    "        # Re-save with innovation data\n",
-    "        self._save_submission(submission)\n",
-    "        \n",
-    "        return submission\n",
-    "    \n",
-    "    def display_innovation_leaderboard(self, event_name: str, top_n: int = 10):\n",
-    "        \"\"\"Display leaderboard ranked by innovation score\"\"\"\n",
-    "        submissions = self._load_event_submissions(event_name)\n",
-    "        \n",
-    "        # Filter submissions with innovation data\n",
-    "        innovation_submissions = [s for s in submissions if 'innovation_analysis' in s]\n",
-    "        \n",
-    "        if not innovation_submissions:\n",
-    "            print(f\"🔬 Innovation Leaderboard - {event_name.replace('_', ' ').title()}\")\n",
-    "            print(\"No innovation submissions yet!\")\n",
-    "            return\n",
-    "        \n",
-    "        # Sort by innovation score\n",
-    "        innovation_submissions.sort(key=lambda s: s['innovation_analysis']['innovation_score'], reverse=True)\n",
-    "        top_submissions = innovation_submissions[:top_n]\n",
-    "        \n",
-    "        print(f\"\\n🔬 INNOVATION LEADERBOARD - {event_name.replace('_', ' ').title()}\")\n",
-    "        print(\"=\" * 80)\n",
-    "        print(f\"{'Rank':<6} {'Team':<20} {'Innovation':<12} {'Techniques':<8} {'Description':<25}\")\n",
-    "        print(\"-\" * 80)\n",
-    "        \n",
-    "        for i, submission in enumerate(top_submissions):\n",
-    "            rank = i + 1\n",
-    "            team = submission['team_name'][:19]\n",
-    "            innovation = f\"{submission['innovation_analysis']['innovation_score']:.3f}\"\n",
-    "            num_tech = submission['innovation_analysis']['num_techniques']\n",
-    "            description = submission['optimization_description'][:24]\n",
-    "            \n",
-    "            print(f\"{rank:<6} {team:<20} {innovation:<12} {num_tech:<8} {description:<25}\")\n",
-    "        \n",
-    "        print(\"-\" * 80)\n",
-    "        print(f\"Top {len(top_submissions)} most innovative submissions\")\n",
-    "    \n",
-    "    def display_composite_leaderboard(self, event_name: str, top_n: int = 10):\n",
-    "        \"\"\"Display leaderboard ranked by composite score (speed + innovation)\"\"\"\n",
-    "        submissions = self._load_event_submissions(event_name)\n",
-    "        \n",
-    "        # Filter submissions with composite scores\n",
-    "        composite_submissions = [s for s in submissions if 'composite_score' in s]\n",
-    "        \n",
-    "        if not composite_submissions:\n",
-    "            print(f\"🏆 Composite Leaderboard - {event_name.replace('_', ' ').title()}\")\n",
-    "            print(\"No composite submissions yet!\")\n",
-    "            return\n",
-    "        \n",
-    "        # Sort by composite score\n",
-    "        composite_submissions.sort(key=lambda s: s['composite_score'], reverse=True)\n",
-    "        top_submissions = composite_submissions[:top_n]\n",
-    "        \n",
-    "        print(f\"\\n🏆 COMPOSITE LEADERBOARD - {event_name.replace('_', ' ').title()}\")\n",
-    "        print(\"=\" * 90)  \n",
-    "        print(f\"{'Rank':<6} {'Team':<18} {'Composite':<11} {'Speed':<9} {'Innovation':<11} {'Techniques'}\")\n",
-    "        print(\"-\" * 90)\n",
-    "        \n",
-    "        for i, submission in enumerate(top_submissions):\n",
-    "            rank = i + 1\n",
-    "            team = submission['team_name'][:17]\n",
-    "            composite = f\"{submission['composite_score']:.3f}\"\n",
-    "            speed = f\"{submission['speedup_score']:.2f}x\"\n",
-    "            innovation = f\"{submission['innovation_analysis']['innovation_score']:.3f}\"\n",
-    "            techniques = \", \".join(submission['innovation_analysis']['detected_techniques'][:3])[:20]\n",
-    "            \n",
-    "            print(f\"{rank:<6} {team:<18} {composite:<11} {speed:<9} {innovation:<11} {techniques}\")\n",
-    "        \n",
-    "        print(\"-\" * 90)\n",
-    "        print(f\"Top {len(top_submissions)} best overall submissions (70% speed + 30% innovation)\")\n",
-    "    \n",
-    "    def display_all_enhanced_leaderboards(self):\n",
-    "        \"\"\"Display all leaderboard types for all events\"\"\"\n",
-    "        events = ['mlp_sprint', 'cnn_marathon', 'transformer_decathlon']\n",
-    "        \n",
-    "        for event in events:\n",
-    "            print(f\"\\n{'='*60}\")\n",
-    "            print(f\"🏆 {event.replace('_', ' ').title()} - All Leaderboards\")\n",
-    "            print(f\"{'='*60}\")\n",
-    "            \n",
-    "            # Speed leaderboard  \n",
-    "            self.display_leaderboard(event, top_n=5)\n",
-    "            print()\n",
-    "            \n",
-    "            # Innovation leaderboard\n",
-    "            self.display_innovation_leaderboard(event, top_n=5)\n",
-    "            print()\n",
-    "            \n",
-    "            # Composite leaderboard\n",
-    "            self.display_composite_leaderboard(event, top_n=5)\n",
-    "            print()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b34233c4",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Enhanced Competition with Innovation Detection\n",
-    "\n",
-    "Let's test the enhanced competition framework with innovation detection."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "49d82963",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_enhanced_competition():\n",
-    "    \"\"\"Test enhanced competition with innovation detection\"\"\"\n",
-    "    print(\"Testing Enhanced TinyMLPerf Competition...\")\n",
-    "    \n",
-    "    # Initialize enhanced competition\n",
-    "    competition = TinyMLPerfCompetitionPlus()\n",
-    "    \n",
-    "    # Create innovative models with optimization attributes\n",
-    "    class QuantizedFastMLP:\n",
-    "        \"\"\"Simulated quantized MLP\"\"\"\n",
-    "        def __init__(self):\n",
-    "            self.weights1 = np.random.randn(784, 64).astype(np.int8)  # Quantized weights\n",
-    "            self.bias1 = np.random.randn(64).astype(np.float32) * 0.1\n",
-    "            self.weights2 = np.random.randn(64, 10).astype(np.int8)\n",
-    "            self.bias2 = np.random.randn(10).astype(np.float32) * 0.1\n",
-    "            self.quantized = True  # Innovation marker\n",
-    "        \n",
-    "        def predict(self, x):\n",
-    "            # Simulate quantized computation\n",
-    "            h1 = np.maximum(0, x @ self.weights1.astype(np.float32) * 0.1 + self.bias1)\n",
-    "            return h1 @ self.weights2.astype(np.float32) * 0.1 + self.bias2\n",
-    "    \n",
-    "    class PrunedCNN:\n",
-    "        \"\"\"Simulated pruned CNN\"\"\"\n",
-    "        def __init__(self):\n",
-    "            self.fc_weights = np.random.randn(1600, 10).astype(np.float32) * 0.05\n",
-    "            self.fc_bias = np.random.randn(10).astype(np.float32) * 0.05\n",
-    "            self.pruned = True  # Innovation marker\n",
-    "            self.sparsity = 0.7  # 70% of weights pruned\n",
-    "        \n",
-    "        def predict(self, x):\n",
-    "            batch_size = x.shape[0]\n",
-    "            x_flat = x.reshape(batch_size, -1)\n",
-    "            if x_flat.shape[1] != 1600:\n",
-    "                x_flat = x_flat[:, :1600] if x_flat.shape[1] > 1600 else np.pad(x_flat, ((0, 0), (0, 1600 - x_flat.shape[1])), 'constant')\n",
-    "            return x_flat @ self.fc_weights + self.fc_bias\n",
-    "    \n",
-    "    # Submit innovative entries\n",
-    "    print(\"\\n🚀 Submitting Innovative Entries...\")\n",
-    "    \n",
-    "    # Quantized MLP submission\n",
-    "    quantized_submission = competition.submit_entry(\n",
-    "        team_name=\"Quantum Quantizers\",\n",
-    "        event_name=\"mlp_sprint\",\n",
-    "        optimized_model=QuantizedFastMLP(),\n",
-    "        optimization_description=\"INT8 quantization with custom SIMD kernels for 3x speedup\",\n",
-    "        github_url=\"https://github.com/quantum-quantizers/quantized-mlp\"\n",
-    "    )\n",
-    "    \n",
-    "    # Pruned CNN submission\n",
-    "    pruned_submission = competition.submit_entry(\n",
-    "        team_name=\"Pruning Pioneers\", \n",
-    "        event_name=\"cnn_marathon\",\n",
-    "        optimized_model=PrunedCNN(),\n",
-    "        optimization_description=\"Structured pruning + knowledge distillation + memory optimization\",\n",
-    "        github_url=\"https://github.com/pruning-pioneers/pruned-cnn\"\n",
-    "    )\n",
-    "    \n",
-    "    # Display enhanced leaderboards\n",
-    "    print(\"\\n📊 Enhanced Competition Leaderboards:\")\n",
-    "    competition.display_all_enhanced_leaderboards()\n",
-    "    \n",
-    "    print(\"\\n✅ Enhanced competition test complete!\")\n",
-    "    return competition"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "065ec776",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Comprehensive Testing\n",
-    "\n",
-    "Let's run a complete TinyMLPerf competition demonstration with all features."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "70ec3a07",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def run_complete_tinymlperf_demo():\n",
-    "    \"\"\"Run comprehensive TinyMLPerf competition demonstration\"\"\"\n",
-    "    print(\"🏆 TINYMLPERF - THE ULTIMATE ML SYSTEMS COMPETITION\")\n",
-    "    print(\"=\" * 80)\n",
-    "    \n",
-    "    print(\"\\n1. 🏗️  Setting up TinyMLPerf Benchmark Suite...\")\n",
-    "    # Test benchmark suite\n",
-    "    tinyperf = test_tinymlperf_benchmark_suite()\n",
-    "    \n",
-    "    print(\"\\n2. ⚡ Testing Competition Profiling...\")  \n",
-    "    # Test profiling infrastructure\n",
-    "    profiler, mlp_results, cnn_results = test_competition_profiler()\n",
-    "    \n",
-    "    print(\"\\n3. 🚀 Running Basic Competition...\")\n",
-    "    # Test basic competition\n",
-    "    basic_competition = test_tinymlperf_competition()\n",
-    "    \n",
-    "    print(\"\\n4. 🔬 Testing Enhanced Competition with Innovation...\")\n",
-    "    # Test enhanced competition\n",
-    "    enhanced_competition = test_enhanced_competition()\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\" * 80)\n",
-    "    print(\"🎉 TINYMLPERF DEMO COMPLETE!\")\n",
-    "    print(\"=\" * 80)\n",
-    "    \n",
-    "    print(\"\\n🏆 TinyMLPerf Competition Ready:\")\n",
-    "    print(\"✅ Three exciting events: MLP Sprint, CNN Marathon, Transformer Decathlon\") \n",
-    "    print(\"✅ TinyTorch Module 15 profiler integration for rigorous benchmarking\")\n",
-    "    print(\"✅ Hardware-independent relative scoring (speedup ratios)\")\n",
-    "    print(\"✅ Transparent leaderboards with evidence requirements\")\n",
-    "    print(\"✅ Innovation detection and creativity rewards\")\n",
-    "    print(\"✅ Composite scoring balancing speed and innovation\")\n",
-    "    \n",
-    "    print(\"\\n🚀 Competition Features:\")\n",
-    "    print(\"• Standardized benchmark models and datasets\")\n",
-    "    print(\"• Statistical reliability with multiple timing runs\")\n",
-    "    print(\"• Multiple leaderboard categories (speed, innovation, composite)\")\n",
-    "    print(\"• GitHub integration for transparency and reproducibility\")\n",
-    "    print(\"• Automatic technique detection and innovation scoring\")\n",
-    "    \n",
-    "    print(\"\\n🎯 Ready to Compete:\")\n",
-    "    print(\"1. Optimize your models using techniques from Modules 16-19\")\n",
-    "    print(\"2. Submit to TinyMLPerf events using competition.submit_entry()\")\n",
-    "    print(\"3. See your results on leaderboards instantly\") \n",
-    "    print(\"4. Iterate and improve based on performance feedback\")\n",
-    "    print(\"5. Prove your ML systems optimization mastery!\")\n",
-    "    \n",
-    "    return {\n",
-    "        'benchmark_suite': tinyperf,\n",
-    "        'profiler': profiler,\n",
-    "        'basic_competition': basic_competition, \n",
-    "        'enhanced_competition': enhanced_competition\n",
-    "    }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1145585e",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Systems Analysis Summary\n",
-    "\n",
-    "This TinyMLPerf competition module demonstrates advanced ML systems engineering through competitive benchmarking:\n",
-    "\n",
-    "### 🏗️ **Competition Infrastructure Excellence**\n",
-    "- **Standardized Benchmarking**: Fair competition through consistent profiling protocols using Module 15's profiler\n",
-    "- **Statistical Rigor**: Multiple timing runs with warmup periods ensure reliable performance measurements\n",
-    "- **Hardware Independence**: Relative speedup scoring allows fair competition across different hardware platforms\n",
-    "- **Transparency Requirements**: GitHub integration and evidence tracking prevent gaming and ensure reproducibility\n",
-    "\n",
-    "### ⚡ **Multi-Dimensional Performance Optimization**\n",
-    "- **Speed Optimization**: Direct latency measurement rewarding inference performance improvements\n",
-    "- **Innovation Detection**: Automated recognition of advanced techniques like quantization, pruning, distillation\n",
-    "- **Composite Scoring**: Balanced evaluation combining speed improvements with optimization creativity\n",
-    "- **Multiple Event Categories**: MLP Sprint, CNN Marathon, Transformer Decathlon test different optimization domains\n",
-    "\n",
-    "### 📊 **Systematic Competition Analysis**\n",
-    "- **TinyTorch Profiler Integration**: Leverages Module 15's profiling infrastructure for consistent measurement\n",
-    "- **Memory Tracking**: Comprehensive resource usage analysis beyond just timing measurements\n",
-    "- **Progress Tracking**: Team improvement analysis across multiple submissions and iterations\n",
-    "- **Leaderboard Visualization**: Multiple ranking systems (speed, innovation, composite) prevent tunnel vision\n",
-    "\n",
-    "### 💡 **Production ML Systems Insights**\n",
-    "- **Benchmarking Best Practices**: Industry-standard profiling methodology with warmup and statistical analysis\n",
-    "- **Optimization Technique Recognition**: Systematic detection of real-world optimization approaches\n",
-    "- **Performance Claims Validation**: Evidence-based performance reporting with reproducible results\n",
-    "- **Resource Constraint Awareness**: Multi-metric evaluation reflecting production deployment considerations\n",
-    "\n",
-    "### 🎯 **Key Educational Insights**\n",
-    "- Competition accelerates optimization learning by making improvements concrete and measurable\n",
-    "- Hardware-independent scoring ensures fair comparison while teaching relative performance analysis\n",
-    "- Innovation detection rewards creativity and exposure to diverse optimization techniques\n",
-    "- Multiple leaderboards prevent single-metric optimization and encourage balanced system thinking\n",
-    "- Evidence requirements teach reproducibility and honest performance reporting practices\n",
-    "\n",
-    "### 🏆 **The Ultimate Learning Achievement**\n",
-    "This competition framework proves students can systematically optimize ML systems for real production constraints. By combining techniques from Modules 16-19 (quantization, pruning, acceleration, memory optimization), students demonstrate mastery of the complete ML systems optimization stack through measurable competitive performance.\n",
-    "\n",
-    "The TinyMLPerf competition transforms optimization from abstract concepts into concrete, competitive achievements that mirror real-world ML systems engineering challenges."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5e34927e",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Main Execution Block\n",
-    "\n",
-    "Run the complete TinyMLPerf competition system when this module is executed directly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f7dfaddb",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "if __name__ == \"__main__\":\n",
-    "    print(\"Module 20: TinyMLPerf - The Ultimate ML Systems Competition\")\n",
-    "    print(\"=\" * 80)\n",
-    "    \n",
-    "    # Run complete TinyMLPerf demonstration\n",
-    "    results = run_complete_tinymlperf_demo()\n",
-    "    \n",
-    "    print(f\"\\n🎉 Module 20 complete!\")\n",
-    "    print(f\"🏆 TinyMLPerf competition infrastructure ready!\")\n",
-    "    print(f\"🚀 Time to optimize your models and climb the leaderboards!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8f95ba18",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "1. **Why use hardware-independent relative scoring in ML competitions?** Your TinyMLPerf uses speedup ratios rather than absolute timing. Explain why this enables fair competition across different hardware platforms and how this mirrors real production environments where optimization techniques must be portable across diverse deployment targets.\n",
-    "\n",
-    "2. **How does competitive benchmarking accelerate optimization learning compared to individual assignments?** You've built leaderboards, innovation detection, and multi-dimensional scoring. Analyze why competition pressure drives deeper exploration of optimization techniques and how this mirrors real industry environments where performance benchmarks determine system adoption.\n",
-    "\n",
-    "3. **What makes innovation detection crucial for preventing optimization tunnel vision?** Your system detects quantization, pruning, distillation, and custom kernels automatically. Explain why rewarding diverse techniques prevents students from over-optimizing single metrics and how this teaches balanced systems thinking rather than algorithmic tunnel vision.\n",
-    "\n",
-    "4. **How does evidence-based competition ensure educational integrity and real-world relevance?** Your framework requires GitHub links, generates checksums, and validates reproducibility. Analyze why these requirements prevent academic dishonesty while teaching students the performance reporting standards expected in production ML systems development."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "708f21f3",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: TinyMLPerf - The Ultimate ML Systems Competition\n",
-    "\n",
-    "This capstone module creates the ultimate ML systems competition, proving optimization mastery through measurable performance improvements in three exciting events.\n",
-    "\n",
-    "### 🛤️ **The TinyMLPerf Journey**\n",
-    "- **Modules 1-19**: You built comprehensive optimization techniques across the entire ML systems stack\n",
-    "- **Module 20**: You compete to prove mastery through concrete, measurable performance improvements\n",
-    "- **Ultimate Goal**: Demonstrate professional-level ML systems optimization through competitive achievement\n",
-    "\n",
-    "### 🛠️ **What We Built**\n",
-    "- **TinyMLPerf Benchmark Suite**: Three standardized competition events - MLP Sprint, CNN Marathon, Transformer Decathlon\n",
-    "- **Competition Profiler**: Integration with Module 15's profiler for rigorous, statistical performance measurement\n",
-    "- **Multi-Dimensional Leaderboards**: Speed, innovation, and composite scoring systems preventing tunnel vision\n",
-    "- **Innovation Detection**: Automatic recognition and scoring of advanced optimization techniques\n",
-    "\n",
-    "### 🧠 **Key Learning Outcomes**\n",
-    "- **Competitive Optimization**: Apply learned techniques competitively with measurable, hardware-independent results\n",
-    "- **Systematic Benchmarking**: Use statistical profiling methodology for reliable performance measurement\n",
-    "- **Innovation Recognition**: Understand and apply diverse optimization approaches beyond simple speed improvements\n",
-    "- **Evidence-Based Performance**: Support optimization claims with reproducible benchmarking and transparent evidence\n",
-    "\n",
-    "### ⚡ **Competition Events Mastered**\n",
-    "- **MLP Sprint**: Fastest feedforward neural network inference optimization\n",
-    "- **CNN Marathon**: Most efficient convolutional neural network processing\n",
-    "- **Transformer Decathlon**: Ultimate attention mechanism and sequence processing optimization\n",
-    "\n",
-    "### 🏆 **Technical Skills Developed**\n",
-    "- Design and implement standardized benchmarking infrastructure for fair ML competition\n",
-    "- Integrate profiling tools for statistical performance measurement and analysis\n",
-    "- Build multi-dimensional leaderboard systems balancing multiple optimization objectives\n",
-    "- Detect and score innovation techniques automatically to reward optimization creativity\n",
-    "\n",
-    "### 📊 **Systems Engineering Insights Gained**\n",
-    "- **Competition accelerates learning**: Measurable challenges drive deeper optimization exploration than individual assignments\n",
-    "- **Hardware-independent scoring**: Relative performance metrics enable fair comparison across diverse deployment environments  \n",
-    "- **Innovation detection prevents tunnel vision**: Multi-dimensional scoring teaches balanced systems optimization\n",
-    "- **Evidence requirements ensure integrity**: Reproducible results and transparency are essential for professional optimization claims\n",
-    "\n",
-    "### 💡 **The Capstone Achievement**\n",
-    "You've completed the ultimate ML systems optimization journey! Through competitive pressure in TinyMLPerf, you've applied quantization, pruning, distillation, acceleration, memory optimization, and innovation techniques to achieve measurable performance improvements. This competition framework proves you can optimize ML systems like a professional engineer, balancing speed, memory, innovation, and deployment constraints to build production-ready systems.\n",
-    "\n",
-    "### 🎉 **Competition Glory Awaits**\n",
-    "Ready to prove your optimization mastery? Load your optimized models into TinyMLPerf, submit to the three events, and climb the leaderboards! Your journey from basic tensors to competition-winning ML systems optimization is complete - now show the world what you can build!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "cell_metadata_filter": "-all",
-   "main_language": "python",
-   "notebook_metadata_filter": "-all"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/19_benchmarking/benchmarking_dev.py b/modules_old/19_benchmarking/benchmarking_dev.py
deleted file mode 100644
index fabeac77..00000000
--- a/modules_old/19_benchmarking/benchmarking_dev.py
+++ /dev/null
@@ -1,1699 +0,0 @@
-# %% [markdown]
-"""
-# Module 20: TinyMLPerf - The Ultimate ML Systems Competition
-
-## Learning Objectives
-By the end of this module, you will be able to:
-
-1. **Build Competition Benchmarking Infrastructure**: Create standardized TinyMLPerf benchmark suite for fair competition
-2. **Use Profiling Tools for Systematic Measurement**: Apply Module 15's profiler to measure real performance gains
-3. **Compete Across Multiple Categories**: Optimize for speed, memory, model size, and innovation simultaneously
-4. **Calculate Relative Performance Improvements**: Show speedup ratios independent of hardware differences
-5. **Drive Innovation Through Competition**: Use competitive pressure to discover new optimization techniques
-
-## The TinyMLPerf Vision
-
-**Key Message**: Competition proves optimization mastery by measuring concrete performance improvements across all your TinyTorch implementations!
-
-**The TinyMLPerf Journey:**
-1. **Benchmark Suite**: Load standard models (MLP, CNN, Transformer) as competition workloads
-2. **Profiling Integration**: Use your Module 15 profiler for rigorous performance measurement
-3. **Competition Categories**: Three exciting events - MLP Sprint, CNN Marathon, Transformer Decathlon
-4. **Relative Scoring**: Hardware-independent speedup measurements (3x faster = 3.0 score)
-5. **Leaderboard Glory**: Track innovations and celebrate optimization achievements
-"""
-
-# %%
-#| default_exp utils.benchmark
-
-import time
-import json
-import hashlib
-import tracemalloc
-from datetime import datetime
-from pathlib import Path
-from typing import Dict, Any, List, Optional, Tuple, Union, Callable
-import numpy as np
-import pickle
-
-# Performance measurement constants
-WEIGHT_INIT_SCALE = 0.1      # Xavier-style initialization scale for stable training
-NUMERICAL_EPSILON = 1e-8     # Prevent division by zero in softmax calculations
-DEFAULT_WARMUP_RUNS = 3      # Number of warmup runs to stabilize CPU caches
-DEFAULT_TIMING_RUNS = 5      # Minimum runs for statistical reliability
-DEFAULT_PROFILER_TIMING_RUNS = 10  # More thorough profiling for detailed analysis
-
-# Model architecture constants (for standardized benchmarks)
-MLP_INPUT_SIZE = 784         # Flattened 28x28 MNIST-like images
-MLP_HIDDEN1_SIZE = 128       # First hidden layer size
-MLP_HIDDEN2_SIZE = 64        # Second hidden layer size
-MLP_OUTPUT_SIZE = 10         # Classification output classes
-
-CNN_CONV1_FILTERS = 32       # First convolution layer filters
-CNN_CONV2_FILTERS = 64       # Second convolution layer filters
-CNN_KERNEL_SIZE = 3          # Convolution kernel size (3x3)
-CNN_FC_INPUT_SIZE = 1600     # Flattened conv output size
-
-TRANSFORMER_D_MODEL = 128    # Model embedding dimension
-TRANSFORMER_N_HEADS = 8      # Number of attention heads
-TRANSFORMER_SEQ_LEN = 64     # Maximum sequence length
-TRANSFORMER_FF_RATIO = 4     # Feed-forward expansion ratio
-
-# Competition scoring constants
-SPEED_WEIGHT = 0.7           # Weight for speed in composite scoring
-INNOVATION_WEIGHT = 0.3      # Weight for innovation in composite scoring
-CREATIVITY_BONUS_THRESHOLD = 3  # Minimum techniques for creativity bonus
-MAX_INNOVATION_SCORE = 1.0   # Maximum possible innovation score
-
-# Leaderboard formatting templates
-LEADERBOARD_HEADER = "{rank:<6} {team:<20} {speedup:<10} {time_ms:<12} {techniques:<25}"
-INNOVATION_HEADER = "{rank:<6} {team:<20} {innovation:<12} {techniques:<8} {description:<25}"
-COMPOSITE_HEADER = "{rank:<6} {team:<18} {composite:<11} {speed:<9} {innovation:<11} {techniques}"
-
-# Simplified innovation pattern keywords (easier for students to understand)
-OPTIMIZATION_KEYWORDS = {
-    'quantization': ['quantized', 'int8'],  # Reduced precision computation
-    'pruning': ['pruned', 'sparse'],       # Removing unnecessary weights
-    'distillation': ['distilled', 'teacher'],  # Knowledge transfer
-    'custom_kernels': ['custom_kernel', 'cuda', 'vectorized'],  # Hardware optimization
-    'memory_optimization': ['memory_pool', 'in_place'],  # Memory efficiency
-    'compression': ['compressed', 'weight_sharing']  # Model compression
-}
-
-# Import TinyTorch profiler from Module 15
-def _check_profiler_availability():
-    """Check if TinyTorch profiler is available and explain implications."""
-    try:
-        from tinytorch.utils.profiler import SimpleProfiler, profile_function
-        print("PASS TinyTorch profiler loaded - using advanced timing")
-        return True, SimpleProfiler, profile_function
-    except ImportError:
-        print("WARNING️  TinyTorch profiler not available")
-        print("   Make sure Module 15 (Profiling) is completed first")
-        print("   Using basic timing as fallback")
-        return False, None, None
-
-HAS_PROFILER, SimpleProfiler, profile_function = _check_profiler_availability()
-
-# %% [markdown]
-"""
-## Part 1: Understanding Benchmarking Fundamentals
-
-Before diving into the full competition, let's understand the core concepts step by step.
-"""
-
-# %%
-def simple_timing_demo():
-    """TARGET Learning Checkpoint 1: Basic Performance Measurement
-    
-    Understand why we need systematic timing for fair comparison.
-    """
-    print("MAGNIFY Learning Checkpoint 1: Basic Performance Measurement")
-    print("=" * 60)
-    
-    # Simple function to time
-    def slow_matrix_multiply(a, b):
-        """Naive matrix multiplication - intentionally slow"""
-        result = np.zeros((a.shape[0], b.shape[1]))
-        for i in range(a.shape[0]):
-            for j in range(b.shape[1]):
-                for k in range(a.shape[1]):
-                    result[i, j] += a[i, k] * b[k, j]
-        return result
-    
-    def fast_matrix_multiply(a, b):
-        """Optimized matrix multiplication using NumPy"""
-        return np.dot(a, b)
-    
-    # Create test matrices
-    test_size = 50
-    matrix_a = np.random.randn(test_size, test_size).astype(np.float32)
-    matrix_b = np.random.randn(test_size, test_size).astype(np.float32)
-    
-    print(f"📊 Timing matrix multiplication ({test_size}x{test_size})...")
-    
-    # Time the slow version
-    start = time.perf_counter()
-    slow_result = slow_matrix_multiply(matrix_a, matrix_b)
-    slow_time = time.perf_counter() - start
-    
-    # Time the fast version  
-    start = time.perf_counter()
-    fast_result = fast_matrix_multiply(matrix_a, matrix_b)
-    fast_time = time.perf_counter() - start
-    
-    # Calculate speedup
-    speedup = slow_time / fast_time
-    
-    print(f"   Slow version: {slow_time*1000:.2f} ms")
-    print(f"   Fast version: {fast_time*1000:.2f} ms")
-    print(f"   ROCKET Speedup: {speedup:.2f}x faster")
-    
-    print(f"\nTIP Key Insight: Optimization can provide dramatic speedups!")
-    print(f"   This is why we need systematic benchmarking to measure improvements.")
-    
-    return {'slow_time': slow_time, 'fast_time': fast_time, 'speedup': speedup}
-
-def statistical_timing_demo():
-    """TARGET Learning Checkpoint 2: Why We Need Multiple Runs
-    
-    Understand timing variability and the need for statistical reliability.
-    """
-    print("\nMAGNIFY Learning Checkpoint 2: Statistical Timing Reliability")
-    print("=" * 60)
-    
-    # Simple operation to time
-    def simple_operation(x):
-        return np.sum(x ** 2)
-    
-    test_data = np.random.randn(10000).astype(np.float32)
-    
-    print(f"📊 Measuring timing variability with {DEFAULT_TIMING_RUNS} runs...")
-    
-    # Single timing run
-    start = time.perf_counter()
-    _ = simple_operation(test_data)
-    single_time = time.perf_counter() - start
-    
-    # Multiple timing runs
-    times = []
-    for run in range(DEFAULT_TIMING_RUNS):
-        start = time.perf_counter()
-        _ = simple_operation(test_data)
-        end = time.perf_counter()
-        times.append(end - start)
-    
-    mean_time = np.mean(times)
-    std_time = np.std(times)
-    min_time = np.min(times)
-    max_time = np.max(times)
-    
-    print(f"   Single run: {single_time*1000:.2f} ms")
-    print(f"   Mean time: {mean_time*1000:.2f} ± {std_time*1000:.2f} ms")
-    print(f"   Range: {min_time*1000:.2f} - {max_time*1000:.2f} ms")
-    
-    variability = (std_time / mean_time) * 100
-    print(f"   PROGRESS Variability: {variability:.1f}% coefficient of variation")
-    
-    print(f"\nTIP Key Insight: Single measurements are unreliable!")
-    print(f"   We need {DEFAULT_TIMING_RUNS}+ runs with warmup for statistical reliability.")
-    
-    return {'times': times, 'mean': mean_time, 'std': std_time}
-
-def benchmark_model_demo():
-    """TARGET Learning Checkpoint 3: Model Benchmarking Basics
-    
-    Understand how to benchmark ML models specifically.
-    """
-    print("\nMAGNIFY Learning Checkpoint 3: ML Model Benchmarking")
-    print("=" * 60)
-    
-    # Simple model for demonstration
-    class SimpleModel:
-        def __init__(self, size):
-            self.weights = np.random.randn(size, size).astype(np.float32) * 0.1
-        
-        def predict(self, x):
-            return x @ self.weights
-    
-    # Create models of different sizes
-    small_model = SimpleModel(64)
-    large_model = SimpleModel(256)
-    
-    # Test data
-    batch_size = 100
-    small_data = np.random.randn(batch_size, 64).astype(np.float32)
-    large_data = np.random.randn(batch_size, 256).astype(np.float32)
-    
-    print(f"📊 Comparing model sizes...")
-    
-    # Benchmark small model
-    times = []
-    for _ in range(DEFAULT_TIMING_RUNS):
-        start = time.perf_counter()
-        _ = small_model.predict(small_data)
-        times.append(time.perf_counter() - start)
-    small_time = np.mean(times)
-    
-    # Benchmark large model
-    times = []
-    for _ in range(DEFAULT_TIMING_RUNS):
-        start = time.perf_counter()
-        _ = large_model.predict(large_data)
-        times.append(time.perf_counter() - start)
-    large_time = np.mean(times)
-    
-    print(f"   Small model (64): {small_time*1000:.2f} ms")
-    print(f"   Large model (256): {large_time*1000:.2f} ms")
-    print(f"   🔢 Size ratio: {256/64:.0f}x parameters")
-    print(f"   ⏱️  Time ratio: {large_time/small_time:.1f}x slower")
-    
-    print(f"\nTIP Key Insight: Model complexity directly affects inference time!")
-    print(f"   This is why standardized models are crucial for fair competition.")
-    
-    return {'small_time': small_time, 'large_time': large_time}
-
-# %%
-def run_learning_checkpoints():
-    """Run all learning checkpoints to build understanding progressively"""
-    print("🎓 TinyMLPerf Learning Journey")
-    print("=" * 80)
-    print("Building understanding step by step...\n")
-    
-    # Checkpoint 1: Basic timing
-    timing_results = simple_timing_demo()
-    
-    # Checkpoint 2: Statistical reliability
-    stats_results = statistical_timing_demo()
-    
-    # Checkpoint 3: Model benchmarking
-    model_results = benchmark_model_demo()
-    
-    print("\n" + "=" * 80)
-    print("CELEBRATE Learning checkpoints complete! Ready for TinyMLPerf competition.")
-    print("=" * 80)
-    
-    return {
-        'timing': timing_results,
-        'statistics': stats_results, 
-        'models': model_results
-    }
-
-# %% [markdown]
-"""
-### Test Learning Checkpoints
-
-Let's run the learning checkpoints to build understanding progressively.
-"""
-
-# %%
-def test_learning_checkpoints():
-    """Test the learning checkpoint system"""
-    print("Testing learning checkpoints...")
-    results = run_learning_checkpoints()
-    print("\nPASS Learning checkpoints test complete!")
-    return results
-
-# %% [markdown]
-"""
-## Part 2: TinyMLPerf Benchmark Suite - Standard Competition Models
-
-Now that we understand the fundamentals, let's build the TinyMLPerf benchmark suite with three exciting competition events using standard models.
-"""
-
-# Standard benchmark models for TinyMLPerf competition events
-class MLPBenchmark:
-    """Standard MLP model for TinyMLPerf sprint event.
-    
-    Simple 3-layer feedforward network optimized for speed competitions.
-    Students will optimize this architecture for fastest inference.
-    """
-    
-    def __init__(self):
-        """Initialize MLP with standard architecture using named constants."""
-        # Layer 1: Input -> Hidden1 (flattened MNIST-like input)
-        self.layer1_weights = np.random.randn(MLP_INPUT_SIZE, MLP_HIDDEN1_SIZE).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.layer1_bias = np.random.randn(MLP_HIDDEN1_SIZE).astype(np.float32) * WEIGHT_INIT_SCALE
-        
-        # Layer 2: Hidden1 -> Hidden2
-        self.layer2_weights = np.random.randn(MLP_HIDDEN1_SIZE, MLP_HIDDEN2_SIZE).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.layer2_bias = np.random.randn(MLP_HIDDEN2_SIZE).astype(np.float32) * WEIGHT_INIT_SCALE
-        
-        # Layer 3: Hidden2 -> Output (classification)
-        self.layer3_weights = np.random.randn(MLP_HIDDEN2_SIZE, MLP_OUTPUT_SIZE).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.layer3_bias = np.random.randn(MLP_OUTPUT_SIZE).astype(np.float32) * WEIGHT_INIT_SCALE
-    
-    def forward(self, x):
-        """Forward pass through 3-layer MLP with ReLU activations."""
-        # Layer 1: Input -> Hidden1 with ReLU
-        hidden1 = np.maximum(0, x @ self.layer1_weights + self.layer1_bias)
-        
-        # Layer 2: Hidden1 -> Hidden2 with ReLU
-        hidden2 = np.maximum(0, hidden1 @ self.layer2_weights + self.layer2_bias)
-        
-        # Layer 3: Hidden2 -> Output (no activation)
-        output = hidden2 @ self.layer3_weights + self.layer3_bias
-        return output
-    
-    def predict(self, x):
-        """Prediction interface for benchmarking."""
-        return self.forward(x)
-
-
-class CNNBenchmark:
-    """Standard CNN model for TinyMLPerf marathon event.
-    
-    Simplified convolutional network for image processing competitions.
-    Students will optimize convolution operations and memory access patterns.
-    """
-    
-    def __init__(self):
-        """Initialize CNN with simplified architecture using named constants."""
-        # Simplified CNN weights (real CNN would need proper conv operations)
-        self.conv1_filters = np.random.randn(CNN_KERNEL_SIZE, CNN_KERNEL_SIZE, 1, CNN_CONV1_FILTERS).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.conv2_filters = np.random.randn(CNN_KERNEL_SIZE, CNN_KERNEL_SIZE, CNN_CONV1_FILTERS, CNN_CONV2_FILTERS).astype(np.float32) * WEIGHT_INIT_SCALE
-        
-        # Fully connected layer after convolution + pooling
-        self.fc_weights = np.random.randn(CNN_FC_INPUT_SIZE, MLP_OUTPUT_SIZE).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.fc_bias = np.random.randn(MLP_OUTPUT_SIZE).astype(np.float32) * WEIGHT_INIT_SCALE
-    
-    def forward(self, x):
-        """Forward pass through simplified CNN.
-        
-        Note: This is a simplified version. Students will implement
-        real convolution operations for optimization.
-        """
-        batch_size = x.shape[0]
-        
-        # Simulate conv + pooling by flattening and projecting
-        x_flattened = x.reshape(batch_size, -1)
-        
-        # Ensure correct input size (pad or truncate as needed)
-        if x_flattened.shape[1] != CNN_FC_INPUT_SIZE:
-            if x_flattened.shape[1] > CNN_FC_INPUT_SIZE:
-                x_flattened = x_flattened[:, :CNN_FC_INPUT_SIZE]
-            else:
-                padding = ((0, 0), (0, CNN_FC_INPUT_SIZE - x_flattened.shape[1]))
-                x_flattened = np.pad(x_flattened, padding, 'constant')
-        
-        # Final classification layer
-        output = x_flattened @ self.fc_weights + self.fc_bias
-        return output
-    
-    def predict(self, x):
-        """Prediction interface for benchmarking."""
-        return self.forward(x)
-
-
-class TransformerBenchmark:
-    """Standard Transformer model for TinyMLPerf decathlon event.
-    
-    Simplified attention-based model for sequence processing competitions.
-    Students will optimize attention mechanisms and memory usage.
-    """
-    
-    def __init__(self, d_model=TRANSFORMER_D_MODEL, n_heads=TRANSFORMER_N_HEADS, seq_len=TRANSFORMER_SEQ_LEN):
-        """Initialize Transformer with standard attention architecture using named constants.
-        
-        Args:
-            d_model: Model dimension (embedding size) - default from TRANSFORMER_D_MODEL
-            n_heads: Number of attention heads - default from TRANSFORMER_N_HEADS
-            seq_len: Maximum sequence length - default from TRANSFORMER_SEQ_LEN
-        """
-        self.d_model = d_model
-        self.n_heads = n_heads
-        self.seq_len = seq_len
-        self.head_dim = d_model // n_heads
-        
-        # Multi-head attention weights (clearer naming)
-        self.query_weights = np.random.randn(d_model, d_model).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.key_weights = np.random.randn(d_model, d_model).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.value_weights = np.random.randn(d_model, d_model).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.output_weights = np.random.randn(d_model, d_model).astype(np.float32) * WEIGHT_INIT_SCALE
-        
-        # Feed forward network weights (using standard 4x expansion ratio)
-        ff_dim = d_model * TRANSFORMER_FF_RATIO
-        self.feedforward_layer1 = np.random.randn(d_model, ff_dim).astype(np.float32) * WEIGHT_INIT_SCALE
-        self.feedforward_layer2 = np.random.randn(ff_dim, d_model).astype(np.float32) * WEIGHT_INIT_SCALE
-    
-    def forward(self, x):
-        """Forward pass through simplified transformer block.
-        
-        Note: This is a simplified version. Students will implement
-        real multi-head attention for optimization.
-        """
-        batch_size, seq_len, d_model = x.shape
-        
-        # Self-attention computation (simplified single-head)
-        queries = x @ self.query_weights  # [batch, seq, d_model]
-        keys = x @ self.key_weights
-        values = x @ self.value_weights
-        
-        # Attention scores with proper scaling
-        attention_scores = queries @ keys.transpose(0, 2, 1) / np.sqrt(d_model)
-        
-        # Softmax with numerical stability
-        exp_scores = np.exp(attention_scores - np.max(attention_scores, axis=-1, keepdims=True))
-        attention_weights = exp_scores / (np.sum(exp_scores, axis=-1, keepdims=True) + NUMERICAL_EPSILON)
-        
-        # Apply attention to values
-        attention_output = attention_weights @ values  # [batch, seq, d_model]
-        
-        # Residual connection + layer norm (simplified)
-        attention_output = attention_output + x
-        
-        # Feed forward network
-        ff_intermediate = np.maximum(0, attention_output @ self.feedforward_layer1)  # ReLU
-        ff_output = ff_intermediate @ self.feedforward_layer2
-        
-        # Another residual connection
-        final_output = ff_output + attention_output
-        
-        # Global average pooling for classification
-        return np.mean(final_output, axis=1)  # [batch, d_model]
-    
-    def predict(self, x):
-        """Prediction interface for benchmarking."""
-        return self.forward(x)
-
-# %%
-class TinyMLPerf:
-    """
-    TinyMLPerf benchmark suite - The Olympics of ML Systems Optimization!
-    
-    Provides three standard competition events:
-    - MLP Sprint: Fastest feedforward inference
-    - CNN Marathon: Efficient convolution operations  
-    - Transformer Decathlon: Complete attention-based model performance
-    
-    Each event uses standardized models and datasets for fair competition.
-    """
-    
-    def __init__(self, profiler_warmup_runs: int = DEFAULT_WARMUP_RUNS, 
-                 profiler_timing_runs: int = DEFAULT_PROFILER_TIMING_RUNS):
-        """
-        Initialize TinyMLPerf benchmark suite.
-        
-        Args:
-            profiler_warmup_runs: Number of warmup runs for stable measurements
-            profiler_timing_runs: Number of timing runs for statistical reliability
-        """
-        self.warmup_runs = profiler_warmup_runs
-        self.timing_runs = profiler_timing_runs
-        self.benchmark_models = {}
-        self.benchmark_datasets = {}
-        
-        print("🏆 TinyMLPerf Competition Suite Initialized!")
-        print("TARGET Three Events: MLP Sprint, CNN Marathon, Transformer Decathlon")
-        
-        # Load standard benchmark models
-        self._load_benchmark_models()
-        self._load_benchmark_datasets()
-    
-    def _load_benchmark_models(self):
-        """Load standard benchmark models for each competition event"""
-        print("📥 Loading TinyMLPerf Benchmark Models...")
-        
-        # Create instances of the standardized benchmark models
-        self.benchmark_models = {
-            'mlp_sprint': MLPBenchmark(),
-            'cnn_marathon': CNNBenchmark(), 
-            'transformer_decathlon': TransformerBenchmark()
-        }
-        
-        print("PASS Benchmark models loaded successfully!")
-        for event, model in self.benchmark_models.items():
-            print(f"   📋 {event.replace('_', ' ').title()}: {type(model).__name__}")
-    
-    def _load_benchmark_datasets(self):
-        """Load standard benchmark datasets for each competition event"""
-        print("📊 Loading TinyMLPerf Benchmark Datasets...")
-        
-        # MLP Sprint dataset - MNIST-like flattened images
-        mlp_batch_size = 100
-        mlp_data = {
-            'inputs': np.random.randn(mlp_batch_size, MLP_INPUT_SIZE).astype(np.float32),  # Batch of samples
-            'targets': np.eye(MLP_OUTPUT_SIZE)[np.random.randint(0, MLP_OUTPUT_SIZE, mlp_batch_size)],    # One-hot labels
-            'event': 'MLP Sprint',
-            'description': 'Feedforward inference on flattened 28x28 images'
-        }
-        
-        # CNN Marathon dataset - Image-like data
-        cnn_batch_size = 50
-        cnn_image_size = 28  # 28x28 standard image size
-        cnn_data = {
-            'inputs': np.random.randn(cnn_batch_size, cnn_image_size, cnn_image_size, 1).astype(np.float32),  # Batch of images
-            'targets': np.eye(MLP_OUTPUT_SIZE)[np.random.randint(0, MLP_OUTPUT_SIZE, cnn_batch_size)],
-            'event': 'CNN Marathon',  
-            'description': 'Convolutional inference on 28x28x1 images'
-        }
-        
-        # Transformer Decathlon dataset - Sequence data
-        transformer_batch_size = 32
-        transformer_data = {
-            'inputs': np.random.randn(transformer_batch_size, TRANSFORMER_SEQ_LEN, TRANSFORMER_D_MODEL).astype(np.float32),  # Batch of sequences
-            'targets': np.eye(MLP_OUTPUT_SIZE)[np.random.randint(0, MLP_OUTPUT_SIZE, transformer_batch_size)],
-            'event': 'Transformer Decathlon',
-            'description': 'Self-attention inference on 64-token sequences'
-        }
-        
-        self.benchmark_datasets = {
-            'mlp_sprint': mlp_data,
-            'cnn_marathon': cnn_data,
-            'transformer_decathlon': transformer_data
-        }
-        
-        print("PASS Benchmark datasets loaded successfully!")
-        for event, data in self.benchmark_datasets.items():
-            print(f"   TARGET {data['event']}: {data['inputs'].shape} -> {data['targets'].shape}")
-    
-    def load_benchmark(self, event_name: str) -> Tuple[Any, Dict[str, Any]]:
-        """
-        Load a specific benchmark model and dataset.
-        
-        Args:
-            event_name: Name of competition event ('mlp_sprint', 'cnn_marathon', 'transformer_decathlon')
-            
-        Returns:
-            Tuple of (model, dataset) for the specified event
-        """
-        if event_name not in self.benchmark_models:
-            available = list(self.benchmark_models.keys())
-            raise ValueError(f"Event '{event_name}' not found. Available: {available}")
-        
-        model = self.benchmark_models[event_name]
-        dataset = self.benchmark_datasets[event_name]
-        
-        print(f"📋 Loaded benchmark: {dataset['event']}")
-        print(f"   Model: {type(model).__name__}")
-        print(f"   Data: {dataset['description']}")
-        
-        return model, dataset
-    
-    def get_available_events(self) -> Dict[str, str]:
-        """Get list of available competition events with descriptions"""
-        return {
-            'mlp_sprint': 'Fastest feedforward neural network inference',
-            'cnn_marathon': 'Efficient convolutional neural network processing',
-            'transformer_decathlon': 'Complete attention mechanism optimization'
-        }
-
-# %% [markdown]
-"""
-### Test TinyMLPerf Benchmark Suite
-
-Let's test the benchmark suite to ensure all models and datasets load correctly.
-"""
-
-# %%
-def test_tinymlperf_benchmark_suite():
-    """Test the TinyMLPerf benchmark suite"""
-    print("Testing TinyMLPerf Benchmark Suite...")
-    
-    # Initialize benchmark suite
-    benchmark_suite = TinyMLPerf(profiler_warmup_runs=2, profiler_timing_runs=3)
-    
-    # Test each event
-    events = benchmark_suite.get_available_events()
-    print(f"\n🏆 Available Events: {len(events)}")
-    
-    for event_name, description in events.items():
-        print(f"\n📋 Testing {event_name}...")
-        model, dataset = benchmark_suite.load_benchmark(event_name)
-        
-        # Test model inference
-        inputs = dataset['inputs']
-        outputs = model.predict(inputs)
-        
-        print(f"   PASS Inference successful: {inputs.shape} -> {outputs.shape}")
-        
-        # Verify output shape makes sense
-        batch_size = inputs.shape[0]
-        assert outputs.shape[0] == batch_size, f"Batch size mismatch: {outputs.shape[0]} != {batch_size}"
-        print(f"   PASS Output shape verified")
-    
-    print(f"\nPASS TinyMLPerf benchmark suite test complete!")
-    return benchmark_suite
-
-# %% [markdown]
-"""
-## Part 2: Performance Benchmarking Using Module 15's Profiler
-
-Now let's build the core benchmarking infrastructure that uses the profiler from Module 15 to measure performance.
-"""
-
-# %%
-class CompetitionProfiler:
-    """
-    Competition profiling infrastructure using TinyTorch's Module 15 profiler.
-    
-    Provides rigorous performance measurement for fair competition by:
-    - Using standardized profiling from Module 15
-    - Multiple timing runs with statistical analysis
-    - Memory usage tracking and analysis
-    - Hardware-independent relative scoring
-    """
-    
-    def __init__(self, warmup_runs: int = DEFAULT_WARMUP_RUNS, 
-                 timing_runs: int = DEFAULT_PROFILER_TIMING_RUNS):
-        """
-        Initialize competition profiler.
-        
-        Args:
-            warmup_runs: Number of warmup runs to stabilize performance
-            timing_runs: Number of timing runs for statistical reliability  
-        """
-        self.warmup_runs = warmup_runs
-        self.timing_runs = timing_runs
-        self.has_profiler = HAS_PROFILER
-        
-        if not self.has_profiler:
-            print("WARNING️  Warning: Advanced profiling unavailable, using basic timing")
-        else:
-            print("PASS Using TinyTorch Module 15 profiler for advanced metrics")
-    
-    def benchmark_model(self, model, dataset: Dict[str, Any]) -> Dict[str, Any]:
-        """
-        Benchmark a model using rigorous profiling methodology.
-        
-        Args:
-            model: Model to benchmark (must have predict() or forward() method)
-            dataset: Dataset dictionary with 'inputs' key
-            
-        Returns:
-            Comprehensive benchmarking results with performance metrics
-        """
-        print(f"🏁 Benchmarking {dataset.get('event', 'Model')}...")
-        
-        inputs = dataset['inputs']
-        results = {
-            'event': dataset.get('event', 'Unknown'),
-            'model_type': type(model).__name__,
-            'input_shape': inputs.shape,
-            'benchmark_timestamp': datetime.now().isoformat()
-        }
-        
-        if self.has_profiler:
-            # Use advanced profiling from Module 15
-            results.update(self._profile_with_tinytorch_profiler(model, inputs))
-        else:
-            # Fallback to basic timing
-            results.update(self._profile_basic_timing(model, inputs))
-        
-        self._print_benchmark_results(results)
-        return results
-    
-    def quick_benchmark(self, model, dataset: Dict[str, Any]) -> float:
-        """
-        Simple benchmarking returning just the mean inference time.
-        
-        This is a simplified interface for students who just want basic timing.
-        
-        Args:
-            model: Model to benchmark
-            dataset: Dataset dictionary with 'inputs' key
-            
-        Returns:
-            Mean inference time in seconds
-        """
-        results = self._run_basic_profiling(model, dataset['inputs'])
-        return results['mean_inference_time']
-    
-    def compare_models(self, model, baseline_model, dataset: Dict[str, Any]) -> Dict[str, Any]:
-        """
-        Compare two models directly with simplified interface.
-        
-        Args:
-            model: Optimized model to test
-            baseline_model: Baseline model for comparison
-            dataset: Dataset dictionary with 'inputs' key
-            
-        Returns:
-            Comparison results with speedup information
-        """
-        print(f"🏁 Comparing models for {dataset.get('event', 'Model')}...")
-        
-        # Benchmark both models
-        baseline_results = self._run_basic_profiling(baseline_model, dataset['inputs'])
-        model_results = self._run_basic_profiling(model, dataset['inputs'])
-        
-        # Calculate speedup
-        speedup = baseline_results['mean_inference_time'] / model_results['mean_inference_time']
-        
-        comparison = {
-            'baseline_time': baseline_results['mean_inference_time'],
-            'optimized_time': model_results['mean_inference_time'],
-            'speedup': speedup,
-            'event': dataset.get('event', 'Unknown'),
-            'baseline_model': type(baseline_model).__name__,
-            'optimized_model': type(model).__name__
-        }
-        
-        print(f"📊 Baseline: {comparison['baseline_time']*1000:.2f} ms")
-        print(f"📊 Optimized: {comparison['optimized_time']*1000:.2f} ms")
-        print(f"ROCKET Speedup: {speedup:.2f}x {'faster' if speedup > 1.0 else 'slower'}")
-        
-        return comparison
-    
-    def benchmark_with_baseline(self, model, dataset: Dict[str, Any], baseline_time: float) -> Dict[str, Any]:
-        """
-        Benchmark a model against a known baseline time.
-        
-        Args:
-            model: Model to benchmark
-            dataset: Dataset dictionary with 'inputs' key
-            baseline_time: Baseline time in seconds for speedup calculation
-            
-        Returns:
-            Benchmark results with speedup calculation
-        """
-        results = self.benchmark_model(model, dataset)
-        speedup = baseline_time / results['mean_inference_time']
-        results['speedup_vs_baseline'] = speedup
-        
-        print(f"ROCKET Speedup vs baseline: {speedup:.2f}x {'faster' if speedup > 1.0 else 'slower'}")
-        return results
-    
-    def _run_basic_profiling(self, model, inputs: np.ndarray) -> Dict[str, Any]:
-        """
-        Run basic profiling without complex options.
-        
-        This is used by simplified interfaces.
-        """
-        if self.has_profiler:
-            return self._profile_with_tinytorch_profiler(model, inputs)
-        else:
-            return self._profile_basic_timing(model, inputs)
-    
-    def _profile_with_tinytorch_profiler(self, model, inputs: np.ndarray) -> Dict[str, Any]:
-        """Profile using Module 15's advanced profiler"""
-        profiler = SimpleProfiler(track_memory=True, track_cpu=True)
-        
-        # Run profiling sessions
-        profile_results = self._run_profiling_sessions(profiler, model, inputs)
-        
-        # Calculate statistics
-        return self._calculate_profiling_statistics(profile_results)
-    
-    def _run_profiling_sessions(self, profiler, model, inputs: np.ndarray) -> List[Dict[str, Any]]:
-        """Run multiple profiling sessions for statistical reliability."""
-        profile_results = []
-        
-        for run in range(self.timing_runs):
-            # Each profiling session includes warmup
-            result = profiler.profile(
-                model.predict, inputs, 
-                name=f"inference_run_{run}",
-                warmup=True  # Profiler handles warmup
-            )
-            profile_results.append(result)
-        
-        return profile_results
-    
-    def _calculate_profiling_statistics(self, profile_results: List[Dict[str, Any]]) -> Dict[str, Any]:
-        """Calculate timing and memory statistics from profile results."""
-        # Extract timing data
-        wall_times = [r['wall_time'] for r in profile_results]
-        cpu_times = [r['cpu_time'] for r in profile_results]
-        
-        # Calculate timing statistics
-        timing_stats = {
-            'mean_inference_time': np.mean(wall_times),
-            'std_inference_time': np.std(wall_times),
-            'min_inference_time': np.min(wall_times), 
-            'max_inference_time': np.max(wall_times),
-            'p95_inference_time': np.percentile(wall_times, 95),
-            'mean_cpu_time': np.mean(cpu_times),
-            'cpu_efficiency': np.mean([r['cpu_efficiency'] for r in profile_results]),
-            'profiling_method': 'TinyTorch Module 15 Profiler'
-        }
-        
-        # Add memory statistics
-        memory_stats = self._extract_memory_statistics(profile_results)
-        timing_stats.update(memory_stats)
-        
-        return timing_stats
-    
-    def _extract_memory_statistics(self, profile_results: List[Dict[str, Any]]) -> Dict[str, Any]:
-        """Extract memory statistics from profiling results."""
-        # Use last run as most representative
-        last_result = profile_results[-1]
-        memory_stats = {}
-        
-        if 'memory_delta_mb' in last_result:
-            memory_stats.update({
-                'memory_delta_mb': last_result['memory_delta_mb'],
-                'peak_memory_mb': last_result['peak_memory_mb'],
-                'result_size_mb': last_result.get('result_size_mb', 0)
-            })
-        
-        return memory_stats
-    
-    def _profile_basic_timing(self, model, inputs: np.ndarray) -> Dict[str, Any]:
-        """Fallback basic timing without advanced profiling"""
-        
-        # Warmup runs
-        for _ in range(self.warmup_runs):
-            _ = model.predict(inputs)
-        
-        # Timing runs  
-        times = []
-        for _ in range(self.timing_runs):
-            start = time.perf_counter()
-            _ = model.predict(inputs)
-            end = time.perf_counter()
-            times.append(end - start)
-        
-        return {
-            'mean_inference_time': np.mean(times),
-            'std_inference_time': np.std(times),
-            'min_inference_time': np.min(times),
-            'max_inference_time': np.max(times),
-            'p95_inference_time': np.percentile(times, 95),
-            'profiling_method': 'Basic Timing'
-        }
-    
-    def _print_benchmark_results(self, results: Dict[str, Any]):
-        """Print formatted benchmark results"""
-        print(f"\n📊 {results['event']} Benchmark Results:")
-        print(f"   Model: {results['model_type']}")
-        print(f"   Input: {results['input_shape']}")
-        print(f"   Mean Time: {results['mean_inference_time']*1000:.2f} ± {results['std_inference_time']*1000:.2f} ms")
-        print(f"   Best Time: {results['min_inference_time']*1000:.2f} ms")
-        print(f"   P95 Time: {results['p95_inference_time']*1000:.2f} ms")
-        
-        if 'speedup_vs_baseline' in results:
-            print(f"   ROCKET Speedup: {results['speedup_vs_baseline']:.2f}x faster")
-        
-        if 'memory_delta_mb' in results:
-            print(f"   💾 Memory: {results['memory_delta_mb']:.2f} MB delta, {results['peak_memory_mb']:.2f} MB peak")
-        
-        print(f"   📏 Method: {results['profiling_method']}")
-
-# %% [markdown]
-"""
-### Test Competition Profiler
-
-Let's test the competition profiler with TinyMLPerf benchmark models.
-"""
-
-# %%
-def test_competition_profiler():
-    """Test the competition profiler with benchmark models"""
-    print("Testing Competition Profiler...")
-    
-    # Initialize TinyMLPerf and profiler
-    benchmark_suite = TinyMLPerf(profiler_warmup_runs=2, profiler_timing_runs=3)
-    competition_profiler = CompetitionProfiler(warmup_runs=2, timing_runs=3)
-    
-    # Test MLP Sprint profiling
-    mlp_model, mlp_dataset = benchmark_suite.load_benchmark('mlp_sprint')
-    mlp_results = competition_profiler.benchmark_model(mlp_model, mlp_dataset)
-    
-    # Test CNN Marathon profiling
-    cnn_model, cnn_dataset = benchmark_suite.load_benchmark('cnn_marathon')  
-    cnn_results = competition_profiler.benchmark_model(cnn_model, cnn_dataset)
-    
-    # Test speedup calculation with baseline
-    print(f"\n🏃 Testing Speedup Calculation...")
-    cnn_speedup_results = competition_profiler.benchmark_with_baseline(
-        cnn_model, cnn_dataset, 
-        baseline_time=mlp_results['mean_inference_time']  # Use MLP as baseline
-    )
-    
-    print(f"\nPASS Competition profiler test complete!")
-    return competition_profiler, mlp_results, cnn_results
-
-# %% [markdown]
-"""
-## Part 3: Simplified Competition Framework - Focused Leaderboards
-
-Let's build a simplified competition framework with focused classes and clear responsibilities.
-"""
-
-# %%
-class CompetitionSubmission:
-    """Handles creation and validation of individual competition submissions."""
-    
-    def __init__(self, team_name: str, event_name: str, optimized_model, 
-                 optimization_description: str = "", github_url: str = ""):
-        """Create a competition submission."""
-        self.team_name = team_name
-        self.event_name = event_name
-        self.optimized_model = optimized_model
-        self.optimization_description = optimization_description
-        self.github_url = github_url
-        self.submission_id = self._generate_id()
-        self.timestamp = datetime.now().isoformat()
-        
-    def _generate_id(self) -> str:
-        """Generate unique submission ID."""
-        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-        team_hash = hashlib.md5(self.team_name.encode()).hexdigest()[:6]
-        return f"{self.event_name}_{team_hash}_{timestamp}"
-    
-    def to_dict(self) -> Dict[str, Any]:
-        """Convert submission to dictionary for storage."""
-        return {
-            'submission_id': self.submission_id,
-            'timestamp': self.timestamp,
-            'team_name': self.team_name,
-            'event_name': self.event_name,
-            'optimization_description': self.optimization_description,
-            'github_url': self.github_url
-        }
-
-class CompetitionStorage:
-    """Handles saving and loading competition results."""
-    
-    def __init__(self, results_dir: str = "tinymlperf_results"):
-        """Initialize storage with results directory."""
-        self.results_dir = Path(results_dir)
-        self.results_dir.mkdir(exist_ok=True)
-    
-    def save_submission(self, submission_data: Dict[str, Any]):
-        """Save submission to storage."""
-        filename = f"{submission_data['submission_id']}.json"
-        filepath = self.results_dir / filename
-        
-        with open(filepath, 'w') as f:
-            json.dump(submission_data, f, indent=2, default=str)
-        
-        print(f"💾 Submission saved: {filepath}")
-    
-    def load_event_submissions(self, event_name: str) -> List[Dict[str, Any]]:
-        """Load all submissions for a specific event."""
-        submissions = []
-        
-        for filepath in self.results_dir.glob(f"{event_name}_*.json"):
-            try:
-                with open(filepath, 'r') as f:
-                    submission = json.load(f)
-                    submissions.append(submission)
-            except Exception as e:
-                print(f"Warning: Could not load {filepath}: {e}")
-        
-        return submissions
-
-class SimpleInnovationDetector:
-    """Simple innovation detection using basic keyword matching."""
-    
-    def detect_techniques(self, description: str) -> List[str]:
-        """Detect optimization techniques using simple keywords."""
-        description_lower = description.lower()
-        detected = []
-        
-        for technique, keywords in OPTIMIZATION_KEYWORDS.items():
-            for keyword in keywords:
-                if keyword in description_lower:
-                    detected.append(technique)
-                    break  # Only count each technique once
-        
-        return detected
-    
-    def calculate_innovation_score(self, detected_techniques: List[str]) -> float:
-        """Calculate innovation score based on number of techniques."""
-        base_score = len(detected_techniques) * 0.2
-        # Bonus for multiple techniques
-        if len(detected_techniques) >= 3:
-            base_score += 0.3
-        return min(base_score, MAX_INNOVATION_SCORE)
-
-class CompetitionLeaderboard:
-    """Focused leaderboard display with configurable sorting."""
-    
-    def __init__(self, storage: CompetitionStorage):
-        """Initialize leaderboard with storage backend."""
-        self.storage = storage
-        self.innovation_detector = SimpleInnovationDetector()
-    
-    def display_leaderboard(self, event_name: str, sort_by: str = 'speed', top_n: int = 10) -> List[Dict[str, Any]]:
-        """Display leaderboard with configurable sorting.
-        
-        Args:
-            event_name: Event to show leaderboard for
-            sort_by: 'speed', 'innovation', or 'composite'
-            top_n: Number of top entries to display
-        """
-        submissions = self.storage.load_event_submissions(event_name)
-        
-        if not submissions:
-            print(f"🏆 {event_name.replace('_', ' ').title()} Leaderboard ({sort_by.title()})")
-            print("No submissions yet! Be the first to compete!")
-            return []
-        
-        # Add innovation scores if needed
-        if sort_by in ['innovation', 'composite']:
-            self._add_innovation_scores(submissions)
-        
-        # Sort submissions
-        sorted_submissions = self._sort_submissions(submissions, sort_by)
-        top_submissions = sorted_submissions[:top_n]
-        
-        # Display leaderboard
-        self._display_formatted_leaderboard(event_name, top_submissions, sort_by)
-        
-        return top_submissions
-    
-    def _add_innovation_scores(self, submissions: List[Dict[str, Any]]):
-        """Add innovation scores to submissions that don't have them."""
-        for submission in submissions:
-            if 'innovation_score' not in submission:
-                techniques = self.innovation_detector.detect_techniques(
-                    submission.get('optimization_description', '')
-                )
-                submission['detected_techniques'] = techniques
-                submission['innovation_score'] = self.innovation_detector.calculate_innovation_score(techniques)
-                
-                # Calculate composite score if speedup exists
-                if 'speedup_score' in submission:
-                    submission['composite_score'] = (
-                        SPEED_WEIGHT * submission['speedup_score'] + 
-                        INNOVATION_WEIGHT * submission['innovation_score']
-                    )
-    
-    def _sort_submissions(self, submissions: List[Dict[str, Any]], sort_by: str) -> List[Dict[str, Any]]:
-        """Sort submissions by specified criteria."""
-        if sort_by == 'speed':
-            return sorted(submissions, key=lambda s: s.get('speedup_score', 0), reverse=True)
-        elif sort_by == 'innovation':
-            return sorted(submissions, key=lambda s: s.get('innovation_score', 0), reverse=True)
-        elif sort_by == 'composite':
-            return sorted(submissions, key=lambda s: s.get('composite_score', 0), reverse=True)
-        else:
-            raise ValueError(f"Unknown sort type: {sort_by}")
-    
-    def _display_formatted_leaderboard(self, event_name: str, submissions: List[Dict[str, Any]], sort_by: str):
-        """Display formatted leaderboard based on sort type."""
-        print(f"\n🏆 TINYMLPERF LEADERBOARD - {event_name.replace('_', ' ').title()} ({sort_by.title()})")
-        print("=" * 80)
-        
-        if sort_by == 'speed':
-            self._display_speed_leaderboard(submissions)
-        elif sort_by == 'innovation':
-            self._display_innovation_leaderboard(submissions)
-        elif sort_by == 'composite':
-            self._display_composite_leaderboard(submissions)
-        
-        print("-" * 80)
-        print(f"Showing top {len(submissions)} submissions")
-    
-    def _display_speed_leaderboard(self, submissions: List[Dict[str, Any]]):
-        """Display speed-focused leaderboard."""
-        print(LEADERBOARD_HEADER.format(
-            rank="Rank", team="Team", speedup="Speedup", time_ms="Time (ms)", techniques="Techniques"
-        ))
-        print("-" * 80)
-        
-        for i, submission in enumerate(submissions):
-            rank = i + 1
-            team = submission['team_name'][:19]
-            speedup = f"{submission.get('speedup_score', 0):.2f}x"
-            time_ms = f"{submission.get('submission_time_ms', 0):.2f}"
-            techniques = submission.get('optimization_description', '')[:24]
-            
-            print(LEADERBOARD_HEADER.format(
-                rank=rank, team=team, speedup=speedup, time_ms=time_ms, techniques=techniques
-            ))
-    
-    def _display_innovation_leaderboard(self, submissions: List[Dict[str, Any]]):
-        """Display innovation-focused leaderboard."""
-        print(INNOVATION_HEADER.format(
-            rank="Rank", team="Team", innovation="Innovation", techniques="Tech#", description="Description"
-        ))
-        print("-" * 80)
-        
-        for i, submission in enumerate(submissions):
-            rank = i + 1
-            team = submission['team_name'][:19]
-            innovation = f"{submission.get('innovation_score', 0):.3f}"
-            num_tech = len(submission.get('detected_techniques', []))
-            description = submission.get('optimization_description', '')[:24]
-            
-            print(INNOVATION_HEADER.format(
-                rank=rank, team=team, innovation=innovation, techniques=num_tech, description=description
-            ))
-    
-    def _display_composite_leaderboard(self, submissions: List[Dict[str, Any]]):
-        """Display composite leaderboard."""
-        print(COMPOSITE_HEADER.format(
-            rank="Rank", team="Team", composite="Composite", speed="Speed", innovation="Innovation", techniques="Techniques"
-        ))
-        print("-" * 80)
-        
-        for i, submission in enumerate(submissions):
-            rank = i + 1
-            team = submission['team_name'][:17]
-            composite = f"{submission.get('composite_score', 0):.3f}"
-            speed = f"{submission.get('speedup_score', 0):.2f}x"
-            innovation = f"{submission.get('innovation_score', 0):.3f}"
-            techniques = ", ".join(submission.get('detected_techniques', [])[:2])[:15]
-            
-            print(COMPOSITE_HEADER.format(
-                rank=rank, team=team, composite=composite, speed=speed, innovation=innovation, techniques=techniques
-            ))
-
-class TinyMLPerfCompetition:
-    """
-    TinyMLPerf Competition Framework - The Olympics of ML Optimization!
-    
-    Manages three exciting competition events:
-    - MLP Sprint: Fastest feedforward network
-    - CNN Marathon: Most efficient convolutions  
-    - Transformer Decathlon: Ultimate attention optimization
-    
-    Features hardware-independent relative scoring and transparent leaderboards.
-    """
-    
-    def __init__(self, results_dir: str = "tinymlperf_results"):
-        """
-        Initialize TinyMLPerf competition.
-        
-        Args:
-            results_dir: Directory to store competition results and leaderboards
-        """
-        self.results_dir = Path(results_dir)
-        self.results_dir.mkdir(exist_ok=True)
-        
-        self.tinyperf = TinyMLPerf()
-        self.profiler = CompetitionProfiler(warmup_runs=DEFAULT_WARMUP_RUNS, 
-                                          timing_runs=DEFAULT_TIMING_RUNS)
-        
-        # Initialize storage and leaderboard components
-        self.storage = CompetitionStorage(results_dir)
-        self.leaderboard = CompetitionLeaderboard(self.storage)
-        
-        # Load baseline models for relative scoring
-        self.baselines = self._establish_baselines()
-        
-        print("🏆 TinyMLPerf Competition Initialized!")
-        print("TARGET Three Events Ready for Competition!")
-    
-    def _establish_baselines(self) -> Dict[str, float]:
-        """Establish baseline performance for relative scoring."""
-        print("📏 Establishing baseline performance for relative scoring...")
-        
-        baselines = {}
-        events = ['mlp_sprint', 'cnn_marathon', 'transformer_decathlon']
-        
-        for event in events:
-            model, dataset = self.tinyperf.load_benchmark(event)
-            results = self.profiler.benchmark_model(model, dataset)
-            baselines[event] = results['mean_inference_time']
-            print(f"   {event}: {baselines[event]*1000:.2f} ms baseline")
-        
-        return baselines
-    
-    def submit_entry(self, team_name: str, event_name: str, optimized_model, 
-                     optimization_description: str = "", github_url: str = "") -> Dict[str, Any]:
-        """Submit an optimized model to TinyMLPerf competition.
-        
-        Args:
-            team_name: Name of the competing team
-            event_name: Competition event ('mlp_sprint', 'cnn_marathon', 'transformer_decathlon')
-            optimized_model: The optimized model to submit
-            optimization_description: Description of optimization techniques used
-            github_url: Link to code repository (for transparency)
-            
-        Returns:
-            Submission results with performance metrics and scoring
-        """
-        # Validate event
-        if event_name not in self.baselines:
-            available = list(self.baselines.keys())
-            print(f"FAIL Event '{event_name}' not recognized!")
-            print("TARGET Available competitions:")
-            for event in available:
-                print(f"   • {event.replace('_', ' ').title()}")
-            return None
-        
-        print(f"ROCKET TINYMLPERF SUBMISSION")
-        print(f"🏆 Event: {event_name.replace('_', ' ').title()}")
-        print(f"👥 Team: {team_name}")
-        print("-" * 60)
-        
-        # Load benchmark dataset for this event
-        _, dataset = self.tinyperf.load_benchmark(event_name)
-        
-        # Benchmark the submitted model with baseline comparison
-        results = self.profiler.benchmark_with_baseline(
-            optimized_model, dataset,
-            baseline_time=self.baselines[event_name]
-        )
-        
-        # Calculate competition score (relative speedup)
-        baseline_time = self.baselines[event_name]
-        submission_time = results['mean_inference_time']
-        speedup_score = baseline_time / submission_time
-        
-        # Create submission record
-        submission = {
-            'submission_id': self._generate_submission_id(team_name, event_name),
-            'timestamp': datetime.now().isoformat(),
-            'team_name': team_name,
-            'event_name': event_name,
-            'optimization_description': optimization_description,
-            'github_url': github_url,
-            'performance_metrics': results,
-            'speedup_score': speedup_score,
-            'baseline_time_ms': baseline_time * 1000,
-            'submission_time_ms': submission_time * 1000
-        }
-        
-        # Save submission to storage
-        self.storage.save_submission(submission)
-        
-        # Display submission results  
-        self._display_submission_results(submission)
-        
-        return submission
-    
-    def _generate_submission_id(self, team_name: str, event_name: str) -> str:
-        """Generate unique submission ID"""
-        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-        team_hash = hashlib.md5(team_name.encode()).hexdigest()[:6]
-        return f"{event_name}_{team_hash}_{timestamp}"
-    
-    def _benchmark_submission(self, submission: CompetitionSubmission) -> Dict[str, Any]:
-        """Benchmark a submission and calculate scores."""
-        # Load benchmark dataset
-        _, dataset = self.tinyperf.load_benchmark(submission.event_name)
-        
-        # Run profiling
-        results = self.profiler.benchmark_model(
-            submission.optimized_model, dataset,
-            baseline_time=self.baselines[submission.event_name]
-        )
-        
-        # Calculate scores
-        baseline_time = self.baselines[submission.event_name]
-        submission_time = results['mean_inference_time']
-        speedup_score = baseline_time / submission_time
-        
-        # Create submission data
-        submission_data = submission.to_dict()
-        submission_data.update({
-            'performance_metrics': results,
-            'speedup_score': speedup_score,
-            'baseline_time_ms': baseline_time * 1000,
-            'submission_time_ms': submission_time * 1000
-        })
-        
-        return submission_data
-    
-    def _display_submission_results(self, submission: Dict[str, Any]):
-        """Display formatted submission results."""
-        metrics = submission['performance_metrics']
-        speedup = submission['speedup_score']
-        
-        print(f"\n🏆 SUBMISSION RESULTS")
-        print(f"=" * 50)
-        print(f"Team: {submission['team_name']}")
-        print(f"Event: {submission['event_name'].replace('_', ' ').title()}")
-        
-        print(f"\n⏱️  Performance:")
-        print(f"   Your Time:    {submission['submission_time_ms']:.2f} ms")
-        print(f"   Baseline:     {submission['baseline_time_ms']:.2f} ms")
-        print(f"   ROCKET Speedup:   {speedup:.2f}x {'FASTER' if speedup > 1.0 else 'slower'}")
-        
-        if 'memory_delta_mb' in metrics:
-            print(f"   💾 Memory:    {metrics['memory_delta_mb']:.2f} MB")
-        
-        # Award celebration for good performance
-        if speedup >= 3.0:
-            print(f"\nCELEBRATE AMAZING! 3x+ speedup achieved!")
-        elif speedup >= 2.0:
-            print(f"\n🏆 EXCELLENT! 2x+ speedup!")
-        elif speedup >= 1.5:
-            print(f"\n⭐ GREAT! 50%+ speedup!")
-        elif speedup >= 1.1:
-            print(f"\nPASS Good optimization!")
-        else:
-            print(f"\nTHINK Keep optimizing - you can do better!")
-        
-        if submission['optimization_description']:
-            print(f"\nTIP Techniques Used:")
-            print(f"   {submission['optimization_description']}")
-    
-    def display_leaderboard(self, event_name: str, sort_by: str = 'speed', top_n: int = 10) -> List[Dict[str, Any]]:
-        """Display leaderboard for specific event with configurable sorting.
-        
-        Args:
-            event_name: Event to show leaderboard for
-            sort_by: 'speed', 'innovation', or 'composite'
-            top_n: Number of top entries to display
-        """
-        return self.leaderboard.display_leaderboard(event_name, sort_by, top_n)
-    
-    def display_all_leaderboards(self, sort_by: str = 'speed'):
-        """Display leaderboards for all events.
-        
-        Args:
-            sort_by: 'speed', 'innovation', or 'composite'
-        """
-        events = ['mlp_sprint', 'cnn_marathon', 'transformer_decathlon']
-        
-        for event in events:
-            self.display_leaderboard(event, sort_by=sort_by, top_n=5)
-            print()
-    
-    def get_team_progress(self, team_name: str) -> Dict[str, List[Dict[str, Any]]]:
-        """Get all submissions from a specific team across all events."""
-        team_submissions = {'mlp_sprint': [], 'cnn_marathon': [], 'transformer_decathlon': []}
-        
-        for event in team_submissions.keys():
-            submissions = self.storage.load_event_submissions(event)
-            team_submissions[event] = [
-                s for s in submissions if s['team_name'] == team_name
-            ]
-            # Sort by timestamp
-            team_submissions[event].sort(key=lambda s: s['timestamp'])
-        
-        return team_submissions
-
-# %% [markdown]
-"""
-### Test TinyMLPerf Competition Framework
-
-Let's test the competition framework with multiple team submissions and leaderboards.
-"""
-
-# %%
-def test_tinymlperf_competition():
-    """Test the TinyMLPerf competition framework"""
-    print("Testing TinyMLPerf Competition Framework...")
-    
-    # Initialize competition
-    competition = TinyMLPerfCompetition()
-    
-    # Create some test optimized models
-    class FastMLPModel:
-        """Simulated optimized MLP - smaller and faster"""
-        def __init__(self):
-            # Smaller model for speed
-            self.weights1 = np.random.randn(784, 64).astype(np.float32) * 0.1
-            self.bias1 = np.random.randn(64).astype(np.float32) * 0.1
-            self.weights2 = np.random.randn(64, 10).astype(np.float32) * 0.1  
-            self.bias2 = np.random.randn(10).astype(np.float32) * 0.1
-        
-        def predict(self, x):
-            h1 = np.maximum(0, x @ self.weights1 + self.bias1)
-            return h1 @ self.weights2 + self.bias2
-    
-    class EfficientCNNModel:
-        """Simulated optimized CNN"""
-        def __init__(self):
-            # Optimized weights
-            self.fc_weights = np.random.randn(1600, 10).astype(np.float32) * 0.05
-            self.fc_bias = np.random.randn(10).astype(np.float32) * 0.05
-        
-        def predict(self, x):
-            batch_size = x.shape[0]
-            x_flat = x.reshape(batch_size, -1)
-            if x_flat.shape[1] != 1600:
-                x_flat = x_flat[:, :1600] if x_flat.shape[1] > 1600 else np.pad(x_flat, ((0, 0), (0, 1600 - x_flat.shape[1])), 'constant')
-            return x_flat @ self.fc_weights + self.fc_bias
-    
-    # Submit optimized models to competition
-    print("\nROCKET Submitting Competition Entries...")
-    
-    # MLP Sprint submissions
-    mlp_submission1 = competition.submit_entry(
-        team_name="Speed Demons",
-        event_name="mlp_sprint",
-        optimized_model=FastMLPModel(),
-        optimization_description="Reduced hidden layer size for 2x speedup",
-        github_url="https://github.com/speed-demons/fast-mlp"
-    )
-    
-    mlp_submission2 = competition.submit_entry(
-        team_name="Lightning Fast",  
-        event_name="mlp_sprint",
-        optimized_model=FastMLPModel(),
-        optimization_description="Quantization + kernel optimization",
-        github_url="https://github.com/lightning-fast/mlp-opt"
-    )
-    
-    # CNN Marathon submission
-    cnn_submission = competition.submit_entry(
-        team_name="CNN Champions",
-        event_name="cnn_marathon", 
-        optimized_model=EfficientCNNModel(),
-        optimization_description="Custom convolution kernels + memory optimization",
-        github_url="https://github.com/cnn-champions/efficient-cnn"
-    )
-    
-    # Display leaderboards
-    print("\n📊 Competition Leaderboards:")
-    competition.display_all_leaderboards()
-    
-    print("\nPASS TinyMLPerf competition framework test complete!")
-    return competition
-
-# %% [markdown]
-"""
-## Part 4: Simplified Competition Testing
-
-Let's test the simplified competition framework with all three leaderboard types.
-"""
-
-# %%
-def test_simplified_competition_features():
-    """Test the simplified competition framework with all leaderboard types."""
-    print("Testing Simplified Competition Framework with All Leaderboard Types...")
-    
-    # Initialize competition
-    competition = TinyMLPerfCompetition()
-    
-    # Create optimized models with different innovation descriptions
-    class FastMLPModel:
-        """Simulated optimized MLP - smaller and faster"""
-        def __init__(self):
-            # Smaller model for speed
-            self.weights1 = np.random.randn(784, 64).astype(np.float32) * 0.1
-            self.bias1 = np.random.randn(64).astype(np.float32) * 0.1
-            self.weights2 = np.random.randn(64, 10).astype(np.float32) * 0.1  
-            self.bias2 = np.random.randn(10).astype(np.float32) * 0.1
-        
-        def predict(self, x):
-            h1 = np.maximum(0, x @ self.weights1 + self.bias1)
-            return h1 @ self.weights2 + self.bias2
-    
-    class EfficientCNNModel:
-        """Simulated optimized CNN"""
-        def __init__(self):
-            # Optimized weights
-            self.fc_weights = np.random.randn(1600, 10).astype(np.float32) * 0.05
-            self.fc_bias = np.random.randn(10).astype(np.float32) * 0.05
-        
-        def predict(self, x):
-            batch_size = x.shape[0]
-            x_flat = x.reshape(batch_size, -1)
-            if x_flat.shape[1] != 1600:
-                x_flat = x_flat[:, :1600] if x_flat.shape[1] > 1600 else np.pad(x_flat, ((0, 0), (0, 1600 - x_flat.shape[1])), 'constant')
-            return x_flat @ self.fc_weights + self.fc_bias
-    
-    # Submit entries with different optimization descriptions
-    print("\nROCKET Submitting Competition Entries...")
-    
-    # MLP submissions with different techniques
-    submission1 = competition.submit_entry(
-        team_name="Speed Demons",
-        event_name="mlp_sprint",
-        optimized_model=FastMLPModel(),
-        optimization_description="Reduced hidden layer size for 2x speedup",
-        github_url="https://github.com/speed-demons/fast-mlp"
-    )
-    
-    submission2 = competition.submit_entry(
-        team_name="Quantized Team",  
-        event_name="mlp_sprint",
-        optimized_model=FastMLPModel(),
-        optimization_description="INT8 quantization with custom kernels",
-        github_url="https://github.com/quantized-team/mlp-opt"
-    )
-    
-    submission3 = competition.submit_entry(
-        team_name="Pruning Pros",
-        event_name="cnn_marathon", 
-        optimized_model=EfficientCNNModel(),
-        optimization_description="Sparse pruned model with distillation",
-        github_url="https://github.com/pruning-pros/efficient-cnn"
-    )
-    
-    # Test all three leaderboard types
-    print("\n📊 Testing All Leaderboard Types:")
-    
-    print("\n1. Speed Leaderboard:")
-    competition.display_leaderboard("mlp_sprint", sort_by="speed", top_n=5)
-    
-    print("\n2. Innovation Leaderboard:")
-    competition.display_leaderboard("mlp_sprint", sort_by="innovation", top_n=5)
-    
-    print("\n3. Composite Leaderboard:")
-    competition.display_leaderboard("mlp_sprint", sort_by="composite", top_n=5)
-    
-    print("\nPASS Simplified competition features test complete!")
-    return competition
-
-# %% [markdown]
-"""
-## Comprehensive Testing
-
-Let's run a complete TinyMLPerf competition demonstration with simplified features.
-"""
-
-def run_complete_tinymlperf_demo():
-    """Run comprehensive TinyMLPerf competition demonstration"""
-    print("🏆 TINYMLPERF - THE ULTIMATE ML SYSTEMS COMPETITION")
-    print("=" * 80)
-    
-    print("\n1. 🏗️  Setting up TinyMLPerf Benchmark Suite...")
-    # Test benchmark suite
-    benchmark_suite = test_tinymlperf_benchmark_suite()
-    
-    print("\n2. SPEED Testing Competition Profiling...")  
-    # Test profiling infrastructure
-    competition_profiler, mlp_results, cnn_results = test_competition_profiler()
-    
-    print("\n3. ROCKET Running Basic Competition...")
-    # Test basic competition
-    basic_competition = test_tinymlperf_competition()
-    
-    print("\n4. 🔬 Testing Simplified Competition Features...")
-    # Test simplified competition with all leaderboard types
-    simplified_competition = test_simplified_competition_features()
-    
-    print("\n" + "=" * 80)
-    print("CELEBRATE TINYMLPERF DEMO COMPLETE!")
-    print("=" * 80)
-    
-    print("\n🏆 TinyMLPerf Competition Ready:")
-    print("PASS Three exciting events: MLP Sprint, CNN Marathon, Transformer Decathlon") 
-    print("PASS TinyTorch Module 15 profiler integration for rigorous benchmarking")
-    print("PASS Hardware-independent relative scoring (speedup ratios)")
-    print("PASS Transparent leaderboards with evidence requirements")
-    print("PASS Simplified innovation detection and creativity rewards")
-    print("PASS Three leaderboard types: speed, innovation, and composite scoring")
-    
-    print("\nROCKET Competition Features:")
-    print("• Standardized benchmark models and datasets")
-    print("• Statistical reliability with multiple timing runs")
-    print("• Multiple leaderboard categories with simple keyword detection")
-    print("• GitHub integration for transparency and reproducibility")
-    print("• Focused classes with single responsibilities")
-    
-    print("\nTARGET Ready to Compete:")
-    print("1. Optimize your models using techniques from Modules 16-19")
-    print("2. Submit to TinyMLPerf events using competition.submit_entry()")
-    print("3. See your results on speed, innovation, or composite leaderboards") 
-    print("4. Iterate and improve based on performance feedback")
-    print("5. Prove your ML systems optimization mastery!")
-    
-    return {
-        'benchmark_suite': benchmark_suite,
-        'profiler': competition_profiler,
-        'basic_competition': basic_competition, 
-        'simplified_competition': simplified_competition
-    }
-
-# %% [markdown]
-"""
-## Systems Analysis Summary
-
-This simplified TinyMLPerf competition module demonstrates advanced ML systems engineering through streamlined competitive benchmarking:
-
-### 🏗️ **Simplified Competition Infrastructure**
-- **Focused Classes**: Each class has a single responsibility - submission, storage, leaderboard, or innovation detection
-- **Clear Separation of Concerns**: CompetitionSubmission, CompetitionStorage, CompetitionLeaderboard, and SimpleInnovationDetector work together
-- **Consistent API**: Single parameterized leaderboard method replaces three separate implementations
-- **Student-Friendly**: Reduced cognitive load while maintaining all essential functionality
-
-### SPEED **Streamlined Performance Optimization**
-- **Single Leaderboard Interface**: One method with sort_by parameter ('speed', 'innovation', 'composite') replaces complex multiple methods
-- **Simple Innovation Detection**: Basic keyword matching replaces complex pattern analysis and model introspection
-- **Consistent Formatting**: Centralized header templates ensure visual consistency across all leaderboard types
-- **Clear Error Messages**: Student-friendly guidance when events are not recognized
-
-### 📊 **Simplified Competition Analysis**
-- **TinyTorch Profiler Integration**: Unchanged - still leverages Module 15's profiling infrastructure
-- **Progressive Feature Introduction**: Students can focus on speed first, then add innovation scoring
-- **Visual Clarity**: Clear section headers and spacing prevent information overload
-- **Focused Testing**: Each test function validates one specific capability
-
-### TIP **Educational Improvements**
-- **Reduced Complexity**: Eliminated 100+ line classes in favor of focused 20-30 line classes
-- **Better Mental Models**: Students understand leaderboard concepts instead of getting lost in implementation details
-- **Maintainable Code**: Consistent patterns and centralized formatting make code easier to debug and extend
-- **KISS Principle**: Keep It Simple, Stupid - core pedagogical value preserved with implementation complexity reduced
-
-### TARGET **Key Learning Objectives Maintained**
-- Competition still accelerates optimization learning through concrete performance measurements
-- Hardware-independent scoring ensures fair comparison across different development environments
-- Multiple leaderboard types prevent single-metric tunnel vision
-- Evidence requirements teach reproducibility and honest performance reporting
-
-### 🏆 **Professional Development**
-The simplified framework teaches students that good software engineering means:
-- Breaking large classes into focused components
-- Choosing clear, consistent APIs over feature proliferation
-- Prioritizing readability and maintainability
-- Making complex systems accessible without losing functionality
-
-This refactored competition framework proves that educational software can be both pedagogically effective AND well-engineered, setting a positive example for students about professional software development practices.
-"""
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-Run the complete TinyMLPerf competition system when this module is executed directly.
-"""
-
-# %%
-if __name__ == "__main__":
-    print("Module 20: TinyMLPerf - The Ultimate ML Systems Competition")
-    print("=" * 80)
-    
-    # Run complete TinyMLPerf demonstration
-    results = run_complete_tinymlperf_demo()
-    
-    print(f"\nCELEBRATE Module 20 complete!")
-    print(f"🏆 TinyMLPerf competition infrastructure ready!")
-    print(f"ROCKET Time to optimize your models and climb the leaderboards!")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-1. **Why is separation of concerns crucial in competition software architecture?** Your refactored TinyMLPerf breaks large classes into focused components: CompetitionSubmission, CompetitionStorage, CompetitionLeaderboard, and SimpleInnovationDetector. Explain why this modular design is essential for educational software and how it teaches students professional software development practices beyond just ML systems concepts.
-
-2. **How does simplifying innovation detection improve student learning outcomes?** You replaced complex pattern matching and model introspection with basic keyword detection. Analyze why reducing implementation complexity while preserving core functionality helps students focus on competition concepts rather than text processing algorithms, and how this reflects real-world engineering trade-offs.
-
-3. **What makes single parameterized methods superior to multiple specialized methods?** Your leaderboard refactor replaced three separate methods (display_leaderboard, display_innovation_leaderboard, display_composite_leaderboard) with one configurable method. Explain why this API design choice reduces cognitive load while maintaining functionality, and how this principle applies to ML systems interfaces in production.
-
-4. **How does consistent formatting contribute to system maintainability and user experience?** Your centralized header templates (LEADERBOARD_HEADER, INNOVATION_HEADER, COMPOSITE_HEADER) ensure visual consistency across all leaderboard displays. Analyze why standardized formatting matters in ML systems dashboards and monitoring tools, and how it prevents the user interface inconsistencies that plague many ML operations platforms.
-"""
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: TinyMLPerf - Simplified Competition Framework
-
-This refactored module demonstrates the power of the KISS principle in educational software design, proving that complex systems can be both pedagogically effective and professionally engineered.
-
-### 🛤️ **The Simplification Journey**
-- **Original Problem**: 600+ lines of complex, intertwined classes causing student cognitive overload
-- **Solution Approach**: Break large classes into focused components with single responsibilities
-- **Result**: Clean, maintainable code that teaches competition concepts without implementation distractions
-
-### 🏗️ **Architecture Improvements**
-- **CompetitionSubmission**: Focused on creating and validating individual submissions
-- **CompetitionStorage**: Dedicated to saving and loading competition data
-- **CompetitionLeaderboard**: Specialized for ranking and display with configurable sorting
-- **SimpleInnovationDetector**: Basic keyword matching replacing complex pattern analysis
-- **TinyMLPerfCompetition**: Orchestrates components with clean delegation patterns
-
-### TARGET **Educational Excellence**
-Students learn both ML systems concepts AND professional software engineering:
-- **Modular Design**: How to break complex problems into manageable components  
-- **API Consistency**: Why parameterized methods beat specialized implementations
-- **Code Maintainability**: How consistent formatting and clear separation of concerns prevent technical debt
-- **KISS Principle**: That simplicity is the ultimate sophistication in software design
-
-### 🏆 **Competition Integrity Maintained**
-All essential functionality preserved with improved usability:
-- Three competition events with standardized benchmarking
-- Hardware-independent relative scoring for fair comparison
-- Multiple leaderboard types (speed, innovation, composite) preventing tunnel vision
-- Evidence requirements ensuring reproducible, honest performance claims
-- Simple but effective innovation detection rewarding creative optimization
-
-### TIP **Professional Development**
-This refactor teaches students that excellent engineering means:
-- Choosing clarity over clever complexity
-- Building maintainable systems that others can understand and extend
-- Designing APIs that guide users toward correct usage
-- Making sophisticated functionality accessible without dumbing it down
-
-**The ultimate lesson**: Great ML systems engineers build tools that make complex concepts simple to use, not simple concepts complex to understand. This competition framework exemplifies how educational software can teach both domain knowledge and engineering excellence simultaneously.
-"""
diff --git a/modules_old/19_benchmarking/module.yaml b/modules_old/19_benchmarking/module.yaml
deleted file mode 100644
index c265347f..00000000
--- a/modules_old/19_benchmarking/module.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-description: 'TinyMLPerf Olympics - the culmination of your TinyTorch journey! Build
-  a comprehensive
-
-  benchmarking suite using your profiler from Module 19, then compete on speed, memory,
-
-  and efficiency. Benchmark the models you built throughout the course to see the
-  impact
-
-  of all your optimizations.
-
-  '
-difficulty: advanced
-estimated_hours: 10-12
-exports:
-- tinytorch.benchmarking
-learning_objectives:
-- Build TinyMLPerf benchmark suite
-- Implement fair performance comparison
-- Create reproducible benchmarks
-- Understand MLPerf methodology
-name: Benchmarking
-number: 20
-prerequisites:
-- Module 15: Profiling
-- All optimization modules (16-19)
-skills_developed:
-- Benchmarking methodology
-- Performance reporting
-- Fair comparison techniques
-- Competition optimization
-type: project
diff --git a/modules_old/20_capstone/README.md b/modules_old/20_capstone/README.md
deleted file mode 100644
index 537d565c..00000000
--- a/modules_old/20_capstone/README.md
+++ /dev/null
@@ -1,194 +0,0 @@
-# Module 20: TinyMLPerf - The Ultimate ML Systems Competition
-
-**The Olympics of ML Systems Optimization!** 🏆
-
-## Overview
-
-Module 20 creates TinyMLPerf, an exciting competition framework where students benchmark all their optimizations from Modules 16-19 in three thrilling events. This is the grand finale that proves optimization mastery through measurable, competitive performance improvements.
-
-## Learning Objectives
-
-By completing this module, students will:
-
-1. **Build Competition Benchmarking Infrastructure**: Create standardized TinyMLPerf benchmark suite for fair competition
-2. **Use Profiling Tools for Systematic Measurement**: Apply Module 15's profiler to measure real performance gains
-3. **Compete Across Multiple Categories**: Optimize for speed, memory, model size, and innovation simultaneously
-4. **Calculate Relative Performance Improvements**: Show speedup ratios independent of hardware differences
-5. **Drive Innovation Through Competition**: Use competitive pressure to discover new optimization techniques
-
-## The Three Competition Events
-
-### 🏃 MLP Sprint - Fastest Feedforward Network
-- **Challenge**: Optimize feedforward neural network inference for maximum speed
-- **Benchmark**: 3-layer MLP (784→128→64→10) on MNIST-like data
-- **Victory Condition**: Fastest inference time while maintaining accuracy
-- **Techniques**: Quantization, pruning, custom kernels, architecture optimization
-
-### 🏃‍♂️ CNN Marathon - Efficient Convolutions
-- **Challenge**: Optimize convolutional neural network processing for efficiency
-- **Benchmark**: CNN model on 28×28×1 image data
-- **Victory Condition**: Best balance of speed, memory usage, and accuracy
-- **Techniques**: Convolution optimization, memory layout, spatial locality
-
-### 🏃‍♀️ Transformer Decathlon - Ultimate Attention Optimization
-- **Challenge**: Optimize attention mechanisms and sequence processing
-- **Benchmark**: Self-attention model on 64-token sequences
-- **Victory Condition**: Complete optimization across all attention components
-- **Techniques**: Attention optimization, memory management, sequence processing
-
-## Key Features
-
-### 🔧 TinyMLPerf Benchmark Suite
-```python
-from tinytorch.core.benchmarking import TinyMLPerf
-
-# Load standard competition benchmarks
-tinyperf = TinyMLPerf()
-mlp_model, mlp_dataset = tinyperf.load_benchmark('mlp_sprint')
-cnn_model, cnn_dataset = tinyperf.load_benchmark('cnn_marathon') 
-transformer_model, transformer_dataset = tinyperf.load_benchmark('transformer_decathlon')
-```
-
-### ⚡ Competition Profiling with Module 15 Integration
-```python
-from tinytorch.core.benchmarking import CompetitionProfiler
-
-# Rigorous benchmarking using Module 15's profiler
-profiler = CompetitionProfiler(warmup_runs=3, timing_runs=10)
-results = profiler.benchmark_model(optimized_model, dataset, baseline_model)
-
-print(f"Speedup: {results['speedup_vs_baseline']:.2f}x faster!")
-```
-
-### 🏆 Competition Framework with Leaderboards
-```python
-from tinytorch.core.benchmarking import TinyMLPerfCompetitionPlus
-
-# Submit to competition
-competition = TinyMLPerfCompetitionPlus()
-submission = competition.submit_entry(
-    team_name="Speed Demons",
-    event_name="mlp_sprint", 
-    optimized_model=my_optimized_mlp,
-    optimization_description="INT8 quantization + custom SIMD kernels",
-    github_url="https://github.com/team/optimization-repo"
-)
-
-# View leaderboards
-competition.display_all_enhanced_leaderboards()
-```
-
-### 🔬 Innovation Detection and Advanced Scoring
-```python
-# Automatic technique detection
-innovation_analysis = competition.innovation_detector.analyze_innovation(
-    model=optimized_model,
-    optimization_description="Quantization + pruning + knowledge distillation"
-)
-
-print(f"Innovation Score: {innovation_analysis['innovation_score']:.3f}")
-print(f"Detected: {innovation_analysis['detected_techniques']}")
-```
-
-## Competition Scoring
-
-### Hardware-Independent Relative Scoring
-- **Speedup Ratio**: `baseline_time / optimized_time` (3x faster = 3.0 score)
-- **Innovation Score**: Automatic detection of optimization techniques (0.0 - 1.0)
-- **Composite Score**: 70% speed + 30% innovation for balanced optimization
-
-### Multiple Leaderboards
-1. **Speed Leaderboard**: Pure performance ranking by inference time
-2. **Innovation Leaderboard**: Most creative optimization techniques
-3. **Composite Leaderboard**: Best overall balance of speed and innovation
-
-## Innovation Technique Detection
-
-The system automatically detects and rewards:
-- **Quantization**: INT8, INT16, low-precision techniques
-- **Pruning**: Structured pruning, sparsity, weight removal
-- **Distillation**: Knowledge transfer, teacher-student models
-- **Custom Kernels**: SIMD, vectorization, hardware optimization
-- **Memory Optimization**: In-place operations, gradient checkpointing
-- **Compression**: Weight sharing, parameter compression
-
-## Example Competition Workflow
-
-```python
-# 1. Load TinyMLPerf benchmark
-tinyperf = TinyMLPerf()
-model, dataset = tinyperf.load_benchmark('mlp_sprint')
-
-# 2. Apply your optimizations (from Modules 16-19)
-optimized_model = apply_quantization(model)      # Module 17
-optimized_model = apply_pruning(optimized_model) # Module 18
-optimized_model = add_custom_kernels(optimized_model)  # Module 16
-
-# 3. Submit to competition
-competition = TinyMLPerfCompetitionPlus()
-submission = competition.submit_entry(
-    team_name="Your Team Name",
-    event_name="mlp_sprint",
-    optimized_model=optimized_model,
-    optimization_description="Quantization + structured pruning + vectorized kernels",
-    github_url="https://github.com/yourteam/optimization-repo"
-)
-
-# 4. View results and leaderboards
-competition.display_leaderboard('mlp_sprint')
-competition.display_innovation_leaderboard('mlp_sprint')  
-competition.display_composite_leaderboard('mlp_sprint')
-```
-
-## Systems Engineering Insights
-
-### 🏗️ **Professional Benchmarking Practices**
-- **Statistical Reliability**: Multiple timing runs with warmup periods
-- **Controlled Conditions**: Consistent test environments and data
-- **Memory Profiling**: Resource usage analysis beyond timing
-- **Evidence Requirements**: GitHub links and reproducibility
-
-### ⚡ **Multi-Dimensional Optimization**
-- **Speed vs. Innovation Balance**: Composite scoring prevents tunnel vision
-- **Hardware Independence**: Relative metrics work across platforms
-- **Technique Diversity**: Innovation rewards encourage exploration
-- **Production Relevance**: Real-world optimization constraints
-
-### 📊 **Competition-Driven Learning**
-- **Concrete Motivation**: Leaderboard rankings drive engagement
-- **Peer Learning**: See techniques used by other competitors
-- **Iterative Improvement**: Multiple submissions encourage refinement
-- **Evidence-Based Claims**: Reproducible performance reporting
-
-## Prerequisites
-
-- **Module 15**: Profiling infrastructure for performance measurement
-- **Modules 16-19**: Optimization techniques to apply competitively
-- **All Previous Modules**: Complete ML systems stack for comprehensive optimization
-
-## Success Metrics
-
-Students successfully complete this module when they can:
-
-1. **Submit Competitive Entries**: Use TinyMLPerf to benchmark optimized models
-2. **Achieve Measurable Speedups**: Demonstrate concrete performance improvements
-3. **Apply Multiple Techniques**: Combine quantization, pruning, acceleration, memory optimization
-4. **Interpret Competition Results**: Understand relative scoring and leaderboard rankings
-5. **Drive Innovation**: Explore creative optimization approaches for competitive advantage
-
-## Real-World Applications
-
-- **ML Competition Platforms**: Kaggle-style optimization competitions
-- **Production Deployment**: Resource-constrained optimization for real systems
-- **Research Evaluation**: Systematic comparison of optimization techniques
-- **Industry Benchmarking**: Performance evaluation standards for ML systems
-
-## The Ultimate Achievement
-
-Module 20 represents the culmination of your ML systems optimization journey. Through competitive pressure in TinyMLPerf's three exciting events, you'll apply everything learned from quantization to custom kernels, proving you can optimize ML systems like a professional engineer.
-
-**Ready to compete? Load your optimized models and prove your mastery in the Olympics of ML Systems Optimization!** 🏆🚀
-
----
-
-*This module completes your transformation from ML beginner to systems optimization expert through the power of competitive achievement.*
\ No newline at end of file
diff --git a/modules_old/20_capstone/capstone_dev.ipynb b/modules_old/20_capstone/capstone_dev.ipynb
deleted file mode 100644
index 963ceed2..00000000
--- a/modules_old/20_capstone/capstone_dev.ipynb
+++ /dev/null
@@ -1,1534 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "ead5731b",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "# Module 20: TinyMLPerf - The Ultimate ML Systems Competition\n",
-    "\n",
-    "## Learning Objectives\n",
-    "By the end of this module, you will be able to:\n",
-    "\n",
-    "1. **Build Competition Benchmarking Infrastructure**: Create standardized TinyMLPerf benchmark suite for fair competition\n",
-    "2. **Use Profiling Tools for Systematic Measurement**: Apply Module 15's profiler to measure real performance gains\n",
-    "3. **Compete Across Multiple Categories**: Optimize for speed, memory, model size, and innovation simultaneously\n",
-    "4. **Calculate Relative Performance Improvements**: Show speedup ratios independent of hardware differences\n",
-    "5. **Drive Innovation Through Competition**: Use competitive pressure to discover new optimization techniques\n",
-    "\n",
-    "## The TinyMLPerf Vision\n",
-    "\n",
-    "**Key Message**: Competition proves optimization mastery by measuring concrete performance improvements across all your TinyTorch implementations!\n",
-    "\n",
-    "**The TinyMLPerf Journey:**\n",
-    "1. **Benchmark Suite**: Load standard models (MLP, CNN, Transformer) as competition workloads\n",
-    "2. **Profiling Integration**: Use your Module 15 profiler for rigorous performance measurement\n",
-    "3. **Competition Categories**: Three exciting events - MLP Sprint, CNN Marathon, Transformer Decathlon\n",
-    "4. **Relative Scoring**: Hardware-independent speedup measurements (3x faster = 3.0 score)\n",
-    "5. **Leaderboard Glory**: Track innovations and celebrate optimization achievements"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f36cf4db",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#| default_exp utils.benchmark\n",
-    "\n",
-    "import time\n",
-    "import json\n",
-    "import hashlib\n",
-    "import tracemalloc\n",
-    "from datetime import datetime\n",
-    "from pathlib import Path\n",
-    "from typing import Dict, Any, List, Optional, Tuple, Union, Callable\n",
-    "import numpy as np\n",
-    "import pickle\n",
-    "\n",
-    "# Import TinyTorch profiler from Module 15\n",
-    "try:\n",
-    "    from tinytorch.utils.profiler import SimpleProfiler, profile_function\n",
-    "    HAS_PROFILER = True\n",
-    "except ImportError:\n",
-    "    print(\"Warning: TinyTorch profiler not available. Using basic timing.\")\n",
-    "    HAS_PROFILER = False"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "242db3f2",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 1: TinyMLPerf Benchmark Suite - Standard Competition Models\n",
-    "\n",
-    "Let's build the TinyMLPerf benchmark suite with three exciting competition events using standard models."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "454686b7",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class TinyMLPerf:\n",
-    "    \"\"\"\n",
-    "    TinyMLPerf benchmark suite - The Olympics of ML Systems Optimization!\n",
-    "    \n",
-    "    Provides three standard competition events:\n",
-    "    - MLP Sprint: Fastest feedforward inference\n",
-    "    - CNN Marathon: Efficient convolution operations  \n",
-    "    - Transformer Decathlon: Complete attention-based model performance\n",
-    "    \n",
-    "    Each event uses standardized models and datasets for fair competition.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, profiler_warmup_runs: int = 3, profiler_timing_runs: int = 10):\n",
-    "        \"\"\"\n",
-    "        Initialize TinyMLPerf benchmark suite.\n",
-    "        \n",
-    "        Args:\n",
-    "            profiler_warmup_runs: Number of warmup runs for stable measurements\n",
-    "            profiler_timing_runs: Number of timing runs for statistical reliability\n",
-    "        \"\"\"\n",
-    "        self.warmup_runs = profiler_warmup_runs\n",
-    "        self.timing_runs = profiler_timing_runs\n",
-    "        self.benchmark_models = {}\n",
-    "        self.benchmark_datasets = {}\n",
-    "        \n",
-    "        print(\"🏆 TinyMLPerf Competition Suite Initialized!\")\n",
-    "        print(\"🎯 Three Events: MLP Sprint, CNN Marathon, Transformer Decathlon\")\n",
-    "        \n",
-    "        # Load standard benchmark models\n",
-    "        self._load_benchmark_models()\n",
-    "        self._load_benchmark_datasets()\n",
-    "    \n",
-    "    def _load_benchmark_models(self):\n",
-    "        \"\"\"Load standard benchmark models for each competition event\"\"\"\n",
-    "        print(\"📥 Loading TinyMLPerf Benchmark Models...\")\n",
-    "        \n",
-    "        # MLP Sprint - Simple feedforward model\n",
-    "        class MLPBenchmark:\n",
-    "            def __init__(self):\n",
-    "                self.weights1 = np.random.randn(784, 128).astype(np.float32) * 0.1\n",
-    "                self.bias1 = np.random.randn(128).astype(np.float32) * 0.1\n",
-    "                self.weights2 = np.random.randn(128, 64).astype(np.float32) * 0.1\n",
-    "                self.bias2 = np.random.randn(64).astype(np.float32) * 0.1  \n",
-    "                self.weights3 = np.random.randn(64, 10).astype(np.float32) * 0.1\n",
-    "                self.bias3 = np.random.randn(10).astype(np.float32) * 0.1\n",
-    "            \n",
-    "            def forward(self, x):\n",
-    "                # 3-layer MLP with ReLU activations\n",
-    "                h1 = np.maximum(0, x @ self.weights1 + self.bias1)  # ReLU\n",
-    "                h2 = np.maximum(0, h1 @ self.weights2 + self.bias2)  # ReLU  \n",
-    "                return h2 @ self.weights3 + self.bias3  # Output layer\n",
-    "            \n",
-    "            def predict(self, x):\n",
-    "                return self.forward(x)\n",
-    "        \n",
-    "        # CNN Marathon - Convolutional model\n",
-    "        class CNNBenchmark:\n",
-    "            def __init__(self):\n",
-    "                # Simplified CNN weights (real CNN would need proper conv operations)\n",
-    "                self.conv1_weights = np.random.randn(3, 3, 1, 32).astype(np.float32) * 0.1\n",
-    "                self.conv2_weights = np.random.randn(3, 3, 32, 64).astype(np.float32) * 0.1\n",
-    "                self.fc_weights = np.random.randn(1600, 10).astype(np.float32) * 0.1  # Flattened size\n",
-    "                self.fc_bias = np.random.randn(10).astype(np.float32) * 0.1\n",
-    "            \n",
-    "            def forward(self, x):\n",
-    "                # Simplified CNN (students will optimize real convolutions)\n",
-    "                batch_size = x.shape[0] \n",
-    "                # Simulate conv + pooling by flattening and projecting\n",
-    "                x_flat = x.reshape(batch_size, -1)  # Flatten input\n",
-    "                if x_flat.shape[1] != 1600:\n",
-    "                    # Adjust to expected size\n",
-    "                    x_flat = x_flat[:, :1600] if x_flat.shape[1] > 1600 else np.pad(x_flat, ((0, 0), (0, 1600 - x_flat.shape[1])), 'constant')\n",
-    "                return x_flat @ self.fc_weights + self.fc_bias\n",
-    "            \n",
-    "            def predict(self, x):\n",
-    "                return self.forward(x)\n",
-    "        \n",
-    "        # Transformer Decathlon - Attention-based model  \n",
-    "        class TransformerBenchmark:\n",
-    "            def __init__(self, d_model=128, n_heads=8, seq_len=64):\n",
-    "                self.d_model = d_model\n",
-    "                self.n_heads = n_heads\n",
-    "                self.seq_len = seq_len\n",
-    "                self.head_dim = d_model // n_heads\n",
-    "                \n",
-    "                # Multi-head attention weights\n",
-    "                self.wq = np.random.randn(d_model, d_model).astype(np.float32) * 0.1\n",
-    "                self.wk = np.random.randn(d_model, d_model).astype(np.float32) * 0.1  \n",
-    "                self.wv = np.random.randn(d_model, d_model).astype(np.float32) * 0.1\n",
-    "                self.wo = np.random.randn(d_model, d_model).astype(np.float32) * 0.1\n",
-    "                \n",
-    "                # Feed forward weights\n",
-    "                self.ff1 = np.random.randn(d_model, d_model * 4).astype(np.float32) * 0.1\n",
-    "                self.ff2 = np.random.randn(d_model * 4, d_model).astype(np.float32) * 0.1\n",
-    "            \n",
-    "            def forward(self, x):\n",
-    "                # Simplified transformer block (students will optimize real attention)\n",
-    "                batch_size, seq_len, d_model = x.shape\n",
-    "                \n",
-    "                # Self-attention (simplified)\n",
-    "                q = x @ self.wq  # [batch, seq, d_model]\n",
-    "                k = x @ self.wk\n",
-    "                v = x @ self.wv\n",
-    "                \n",
-    "                # Simplified attention computation (real would be multi-head)\n",
-    "                scores = q @ k.transpose(0, 2, 1) / np.sqrt(d_model)  # [batch, seq, seq]\n",
-    "                attn = np.exp(scores) / (np.sum(np.exp(scores), axis=-1, keepdims=True) + 1e-8)\n",
-    "                out = attn @ v  # [batch, seq, d_model]\n",
-    "                \n",
-    "                # Skip connection + layer norm (simplified)\n",
-    "                out = out + x  # Residual connection\n",
-    "                \n",
-    "                # Feed forward network\n",
-    "                ff_out = np.maximum(0, out @ self.ff1)  # ReLU\n",
-    "                ff_out = ff_out @ self.ff2\n",
-    "                \n",
-    "                # Another skip connection\n",
-    "                out = ff_out + out\n",
-    "                \n",
-    "                # Global average pooling for classification\n",
-    "                return np.mean(out, axis=1)  # [batch, d_model]\n",
-    "            \n",
-    "            def predict(self, x):\n",
-    "                return self.forward(x)\n",
-    "        \n",
-    "        # Store benchmark models\n",
-    "        self.benchmark_models = {\n",
-    "            'mlp_sprint': MLPBenchmark(),\n",
-    "            'cnn_marathon': CNNBenchmark(), \n",
-    "            'transformer_decathlon': TransformerBenchmark()\n",
-    "        }\n",
-    "        \n",
-    "        print(\"✅ Benchmark models loaded successfully!\")\n",
-    "        for event, model in self.benchmark_models.items():\n",
-    "            print(f\"   📋 {event.title()}: {type(model).__name__}\")\n",
-    "    \n",
-    "    def _load_benchmark_datasets(self):\n",
-    "        \"\"\"Load standard benchmark datasets for each competition event\"\"\"\n",
-    "        print(\"📊 Loading TinyMLPerf Benchmark Datasets...\")\n",
-    "        \n",
-    "        # MLP Sprint dataset - MNIST-like flattened images\n",
-    "        mlp_data = {\n",
-    "            'inputs': np.random.randn(100, 784).astype(np.float32),  # Batch of 100 samples\n",
-    "            'targets': np.eye(10)[np.random.randint(0, 10, 100)],    # One-hot labels\n",
-    "            'event': 'MLP Sprint',\n",
-    "            'description': 'Feedforward inference on flattened 28x28 images'\n",
-    "        }\n",
-    "        \n",
-    "        # CNN Marathon dataset - Image-like data\n",
-    "        cnn_data = {\n",
-    "            'inputs': np.random.randn(50, 28, 28, 1).astype(np.float32),  # Batch of 50 images\n",
-    "            'targets': np.eye(10)[np.random.randint(0, 10, 50)],\n",
-    "            'event': 'CNN Marathon',  \n",
-    "            'description': 'Convolutional inference on 28x28x1 images'\n",
-    "        }\n",
-    "        \n",
-    "        # Transformer Decathlon dataset - Sequence data\n",
-    "        transformer_data = {\n",
-    "            'inputs': np.random.randn(32, 64, 128).astype(np.float32),  # Batch of 32 sequences\n",
-    "            'targets': np.eye(10)[np.random.randint(0, 10, 32)],\n",
-    "            'event': 'Transformer Decathlon',\n",
-    "            'description': 'Self-attention inference on 64-token sequences'\n",
-    "        }\n",
-    "        \n",
-    "        self.benchmark_datasets = {\n",
-    "            'mlp_sprint': mlp_data,\n",
-    "            'cnn_marathon': cnn_data,\n",
-    "            'transformer_decathlon': transformer_data\n",
-    "        }\n",
-    "        \n",
-    "        print(\"✅ Benchmark datasets loaded successfully!\")\n",
-    "        for event, data in self.benchmark_datasets.items():\n",
-    "            print(f\"   🎯 {data['event']}: {data['inputs'].shape} -> {data['targets'].shape}\")\n",
-    "    \n",
-    "    def load_benchmark(self, event_name: str) -> Tuple[Any, Dict[str, Any]]:\n",
-    "        \"\"\"\n",
-    "        Load a specific benchmark model and dataset.\n",
-    "        \n",
-    "        Args:\n",
-    "            event_name: Name of competition event ('mlp_sprint', 'cnn_marathon', 'transformer_decathlon')\n",
-    "            \n",
-    "        Returns:\n",
-    "            Tuple of (model, dataset) for the specified event\n",
-    "        \"\"\"\n",
-    "        if event_name not in self.benchmark_models:\n",
-    "            available = list(self.benchmark_models.keys())\n",
-    "            raise ValueError(f\"Event '{event_name}' not found. Available: {available}\")\n",
-    "        \n",
-    "        model = self.benchmark_models[event_name]\n",
-    "        dataset = self.benchmark_datasets[event_name]\n",
-    "        \n",
-    "        print(f\"📋 Loaded benchmark: {dataset['event']}\")\n",
-    "        print(f\"   Model: {type(model).__name__}\")\n",
-    "        print(f\"   Data: {dataset['description']}\")\n",
-    "        \n",
-    "        return model, dataset\n",
-    "    \n",
-    "    def get_available_events(self) -> Dict[str, str]:\n",
-    "        \"\"\"Get list of available competition events with descriptions\"\"\"\n",
-    "        return {\n",
-    "            'mlp_sprint': 'Fastest feedforward neural network inference',\n",
-    "            'cnn_marathon': 'Efficient convolutional neural network processing',\n",
-    "            'transformer_decathlon': 'Complete attention mechanism optimization'\n",
-    "        }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3676ceeb",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test TinyMLPerf Benchmark Suite\n",
-    "\n",
-    "Let's test the benchmark suite to ensure all models and datasets load correctly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "919f5680",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_tinymlperf_benchmark_suite():\n",
-    "    \"\"\"Test the TinyMLPerf benchmark suite\"\"\"\n",
-    "    print(\"Testing TinyMLPerf Benchmark Suite...\")\n",
-    "    \n",
-    "    # Initialize benchmark suite\n",
-    "    tinyperf = TinyMLPerf(profiler_warmup_runs=2, profiler_timing_runs=3)\n",
-    "    \n",
-    "    # Test each event\n",
-    "    events = tinyperf.get_available_events()\n",
-    "    print(f\"\\n🏆 Available Events: {len(events)}\")\n",
-    "    \n",
-    "    for event_name, description in events.items():\n",
-    "        print(f\"\\n📋 Testing {event_name}...\")\n",
-    "        model, dataset = tinyperf.load_benchmark(event_name)\n",
-    "        \n",
-    "        # Test model inference\n",
-    "        inputs = dataset['inputs']\n",
-    "        outputs = model.predict(inputs)\n",
-    "        \n",
-    "        print(f\"   ✅ Inference successful: {inputs.shape} -> {outputs.shape}\")\n",
-    "        \n",
-    "        # Verify output shape makes sense\n",
-    "        batch_size = inputs.shape[0]\n",
-    "        assert outputs.shape[0] == batch_size, f\"Batch size mismatch: {outputs.shape[0]} != {batch_size}\"\n",
-    "        print(f\"   ✅ Output shape verified\")\n",
-    "    \n",
-    "    print(f\"\\n✅ TinyMLPerf benchmark suite test complete!\")\n",
-    "    return tinyperf"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "35b18f42",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 2: Performance Benchmarking Using Module 15's Profiler\n",
-    "\n",
-    "Now let's build the core benchmarking infrastructure that uses the profiler from Module 15 to measure performance."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f89d870e",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class CompetitionProfiler:\n",
-    "    \"\"\"\n",
-    "    Competition profiling infrastructure using TinyTorch's Module 15 profiler.\n",
-    "    \n",
-    "    Provides rigorous performance measurement for fair competition by:\n",
-    "    - Using standardized profiling from Module 15\n",
-    "    - Multiple timing runs with statistical analysis\n",
-    "    - Memory usage tracking and analysis\n",
-    "    - Hardware-independent relative scoring\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, warmup_runs: int = 3, timing_runs: int = 10):\n",
-    "        \"\"\"\n",
-    "        Initialize competition profiler.\n",
-    "        \n",
-    "        Args:\n",
-    "            warmup_runs: Number of warmup runs to stabilize performance\n",
-    "            timing_runs: Number of timing runs for statistical reliability  \n",
-    "        \"\"\"\n",
-    "        self.warmup_runs = warmup_runs\n",
-    "        self.timing_runs = timing_runs\n",
-    "        self.has_profiler = HAS_PROFILER\n",
-    "        \n",
-    "        if not self.has_profiler:\n",
-    "            print(\"⚠️  Warning: Advanced profiling unavailable, using basic timing\")\n",
-    "        else:\n",
-    "            print(\"✅ Using TinyTorch Module 15 profiler for advanced metrics\")\n",
-    "    \n",
-    "    def benchmark_model(self, model, dataset: Dict[str, Any], \n",
-    "                       baseline_model=None, baseline_time: Optional[float] = None) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Benchmark a model using rigorous profiling methodology.\n",
-    "        \n",
-    "        Args:\n",
-    "            model: Model to benchmark (must have predict() or forward() method)\n",
-    "            dataset: Dataset dictionary with 'inputs' key\n",
-    "            baseline_model: Optional baseline model for speedup calculation\n",
-    "            baseline_time: Optional baseline time for speedup calculation\n",
-    "            \n",
-    "        Returns:\n",
-    "            Comprehensive benchmarking results with performance metrics\n",
-    "        \"\"\"\n",
-    "        print(f\"🏁 Benchmarking {dataset.get('event', 'Model')}...\")\n",
-    "        \n",
-    "        inputs = dataset['inputs']\n",
-    "        results = {\n",
-    "            'event': dataset.get('event', 'Unknown'),\n",
-    "            'model_type': type(model).__name__,\n",
-    "            'input_shape': inputs.shape,\n",
-    "            'benchmark_timestamp': datetime.now().isoformat()\n",
-    "        }\n",
-    "        \n",
-    "        if self.has_profiler:\n",
-    "            # Use advanced profiling from Module 15\n",
-    "            results.update(self._profile_with_tinytorch_profiler(model, inputs))\n",
-    "        else:\n",
-    "            # Fallback to basic timing\n",
-    "            results.update(self._profile_basic_timing(model, inputs))\n",
-    "        \n",
-    "        # Calculate speedup if baseline provided\n",
-    "        if baseline_model is not None:\n",
-    "            baseline_results = self.benchmark_model(baseline_model, dataset)\n",
-    "            speedup = baseline_results['mean_inference_time'] / results['mean_inference_time']\n",
-    "            results['speedup_vs_baseline'] = speedup\n",
-    "        elif baseline_time is not None:\n",
-    "            speedup = baseline_time / results['mean_inference_time'] \n",
-    "            results['speedup_vs_baseline'] = speedup\n",
-    "        \n",
-    "        self._print_benchmark_results(results)\n",
-    "        return results\n",
-    "    \n",
-    "    def _profile_with_tinytorch_profiler(self, model, inputs: np.ndarray) -> Dict[str, Any]:\n",
-    "        \"\"\"Profile using Module 15's advanced profiler\"\"\"\n",
-    "        profiler = SimpleProfiler(track_memory=True, track_cpu=True)\n",
-    "        \n",
-    "        # Run multiple profiling sessions for statistical reliability\n",
-    "        profile_results = []\n",
-    "        \n",
-    "        for run in range(self.timing_runs):\n",
-    "            # Each profiling session includes warmup\n",
-    "            result = profiler.profile(\n",
-    "                model.predict, inputs, \n",
-    "                name=f\"inference_run_{run}\",\n",
-    "                warmup=True  # Profiler handles warmup\n",
-    "            )\n",
-    "            profile_results.append(result)\n",
-    "        \n",
-    "        # Aggregate statistics across runs\n",
-    "        wall_times = [r['wall_time'] for r in profile_results]\n",
-    "        cpu_times = [r['cpu_time'] for r in profile_results]\n",
-    "        \n",
-    "        aggregated = {\n",
-    "            'mean_inference_time': np.mean(wall_times),\n",
-    "            'std_inference_time': np.std(wall_times),\n",
-    "            'min_inference_time': np.min(wall_times), \n",
-    "            'max_inference_time': np.max(wall_times),\n",
-    "            'p95_inference_time': np.percentile(wall_times, 95),\n",
-    "            'mean_cpu_time': np.mean(cpu_times),\n",
-    "            'cpu_efficiency': np.mean([r['cpu_efficiency'] for r in profile_results]),\n",
-    "            'profiling_method': 'TinyTorch Module 15 Profiler'\n",
-    "        }\n",
-    "        \n",
-    "        # Add memory metrics from last run (most representative)\n",
-    "        last_result = profile_results[-1]\n",
-    "        if 'memory_delta_mb' in last_result:\n",
-    "            aggregated.update({\n",
-    "                'memory_delta_mb': last_result['memory_delta_mb'],\n",
-    "                'peak_memory_mb': last_result['peak_memory_mb'],\n",
-    "                'result_size_mb': last_result.get('result_size_mb', 0)\n",
-    "            })\n",
-    "        \n",
-    "        return aggregated\n",
-    "    \n",
-    "    def _profile_basic_timing(self, model, inputs: np.ndarray) -> Dict[str, Any]:\n",
-    "        \"\"\"Fallback basic timing without advanced profiling\"\"\"\n",
-    "        \n",
-    "        # Warmup runs\n",
-    "        for _ in range(self.warmup_runs):\n",
-    "            _ = model.predict(inputs)\n",
-    "        \n",
-    "        # Timing runs  \n",
-    "        times = []\n",
-    "        for _ in range(self.timing_runs):\n",
-    "            start = time.perf_counter()\n",
-    "            _ = model.predict(inputs)\n",
-    "            end = time.perf_counter()\n",
-    "            times.append(end - start)\n",
-    "        \n",
-    "        return {\n",
-    "            'mean_inference_time': np.mean(times),\n",
-    "            'std_inference_time': np.std(times),\n",
-    "            'min_inference_time': np.min(times),\n",
-    "            'max_inference_time': np.max(times),\n",
-    "            'p95_inference_time': np.percentile(times, 95),\n",
-    "            'profiling_method': 'Basic Timing'\n",
-    "        }\n",
-    "    \n",
-    "    def _print_benchmark_results(self, results: Dict[str, Any]):\n",
-    "        \"\"\"Print formatted benchmark results\"\"\"\n",
-    "        print(f\"\\n📊 {results['event']} Benchmark Results:\")\n",
-    "        print(f\"   Model: {results['model_type']}\")\n",
-    "        print(f\"   Input: {results['input_shape']}\")\n",
-    "        print(f\"   Mean Time: {results['mean_inference_time']*1000:.2f} ± {results['std_inference_time']*1000:.2f} ms\")\n",
-    "        print(f\"   Best Time: {results['min_inference_time']*1000:.2f} ms\")\n",
-    "        print(f\"   P95 Time: {results['p95_inference_time']*1000:.2f} ms\")\n",
-    "        \n",
-    "        if 'speedup_vs_baseline' in results:\n",
-    "            print(f\"   🚀 Speedup: {results['speedup_vs_baseline']:.2f}x faster\")\n",
-    "        \n",
-    "        if 'memory_delta_mb' in results:\n",
-    "            print(f\"   💾 Memory: {results['memory_delta_mb']:.2f} MB delta, {results['peak_memory_mb']:.2f} MB peak\")\n",
-    "        \n",
-    "        print(f\"   📏 Method: {results['profiling_method']}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7ea6de0e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Competition Profiler\n",
-    "\n",
-    "Let's test the competition profiler with TinyMLPerf benchmark models."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4291ee9d",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_competition_profiler():\n",
-    "    \"\"\"Test the competition profiler with benchmark models\"\"\"\n",
-    "    print(\"Testing Competition Profiler...\")\n",
-    "    \n",
-    "    # Initialize TinyMLPerf and profiler\n",
-    "    tinyperf = TinyMLPerf(profiler_warmup_runs=2, profiler_timing_runs=3)\n",
-    "    profiler = CompetitionProfiler(warmup_runs=2, timing_runs=3)\n",
-    "    \n",
-    "    # Test MLP Sprint profiling\n",
-    "    mlp_model, mlp_dataset = tinyperf.load_benchmark('mlp_sprint')\n",
-    "    mlp_results = profiler.benchmark_model(mlp_model, mlp_dataset)\n",
-    "    \n",
-    "    # Test CNN Marathon profiling\n",
-    "    cnn_model, cnn_dataset = tinyperf.load_benchmark('cnn_marathon')  \n",
-    "    cnn_results = profiler.benchmark_model(cnn_model, cnn_dataset)\n",
-    "    \n",
-    "    # Test speedup calculation with baseline\n",
-    "    print(f\"\\n🏃 Testing Speedup Calculation...\")\n",
-    "    cnn_speedup_results = profiler.benchmark_model(\n",
-    "        cnn_model, cnn_dataset, \n",
-    "        baseline_time=mlp_results['mean_inference_time']  # Use MLP as baseline\n",
-    "    )\n",
-    "    \n",
-    "    print(f\"\\n✅ Competition profiler test complete!\")\n",
-    "    return profiler, mlp_results, cnn_results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "982f40f9",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 3: Competition Framework - Leaderboards and Scoring\n",
-    "\n",
-    "Now let's build the exciting competition framework with leaderboards, relative scoring, and multiple categories."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "016b4cc6",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class TinyMLPerfCompetition:\n",
-    "    \"\"\"\n",
-    "    TinyMLPerf Competition Framework - The Olympics of ML Optimization!\n",
-    "    \n",
-    "    Manages three exciting competition events:\n",
-    "    - MLP Sprint: Fastest feedforward network\n",
-    "    - CNN Marathon: Most efficient convolutions  \n",
-    "    - Transformer Decathlon: Ultimate attention optimization\n",
-    "    \n",
-    "    Features hardware-independent relative scoring and transparent leaderboards.\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, results_dir: str = \"tinymlperf_results\"):\n",
-    "        \"\"\"\n",
-    "        Initialize TinyMLPerf competition.\n",
-    "        \n",
-    "        Args:\n",
-    "            results_dir: Directory to store competition results and leaderboards\n",
-    "        \"\"\"\n",
-    "        self.results_dir = Path(results_dir)\n",
-    "        self.results_dir.mkdir(exist_ok=True)\n",
-    "        \n",
-    "        self.tinyperf = TinyMLPerf()\n",
-    "        self.profiler = CompetitionProfiler(warmup_runs=3, timing_runs=5)\n",
-    "        \n",
-    "        # Load baseline models for relative scoring\n",
-    "        self.baselines = self._establish_baselines()\n",
-    "        \n",
-    "        print(\"🏆 TinyMLPerf Competition Initialized!\")\n",
-    "        print(\"🎯 Three Events Ready for Competition!\")\n",
-    "    \n",
-    "    def _establish_baselines(self) -> Dict[str, float]:\n",
-    "        \"\"\"Establish baseline performance for relative scoring\"\"\"\n",
-    "        print(\"📏 Establishing baseline performance for relative scoring...\")\n",
-    "        \n",
-    "        baselines = {}\n",
-    "        events = ['mlp_sprint', 'cnn_marathon', 'transformer_decathlon']\n",
-    "        \n",
-    "        for event in events:\n",
-    "            model, dataset = self.tinyperf.load_benchmark(event)\n",
-    "            results = self.profiler.benchmark_model(model, dataset)\n",
-    "            baselines[event] = results['mean_inference_time']\n",
-    "            print(f\"   {event}: {baselines[event]*1000:.2f} ms baseline\")\n",
-    "        \n",
-    "        return baselines\n",
-    "    \n",
-    "    def submit_entry(self, team_name: str, event_name: str, optimized_model, \n",
-    "                     optimization_description: str = \"\", github_url: str = \"\") -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Submit an optimized model to TinyMLPerf competition.\n",
-    "        \n",
-    "        Args:\n",
-    "            team_name: Name of the competing team\n",
-    "            event_name: Competition event ('mlp_sprint', 'cnn_marathon', 'transformer_decathlon')\n",
-    "            optimized_model: The optimized model to submit\n",
-    "            optimization_description: Description of optimization techniques used\n",
-    "            github_url: Link to code repository (for transparency)\n",
-    "            \n",
-    "        Returns:\n",
-    "            Submission results with performance metrics and scoring\n",
-    "        \"\"\"\n",
-    "        if event_name not in self.baselines:\n",
-    "            available = list(self.baselines.keys())\n",
-    "            raise ValueError(f\"Event '{event_name}' not available. Choose from: {available}\")\n",
-    "        \n",
-    "        print(f\"🚀 TINYMLPERF SUBMISSION\")\n",
-    "        print(f\"🏆 Event: {event_name.replace('_', ' ').title()}\")\n",
-    "        print(f\"👥 Team: {team_name}\")\n",
-    "        print(\"-\" * 60)\n",
-    "        \n",
-    "        # Load benchmark dataset for this event\n",
-    "        _, dataset = self.tinyperf.load_benchmark(event_name)\n",
-    "        \n",
-    "        # Benchmark the submitted model\n",
-    "        results = self.profiler.benchmark_model(\n",
-    "            optimized_model, dataset,\n",
-    "            baseline_time=self.baselines[event_name]\n",
-    "        )\n",
-    "        \n",
-    "        # Calculate competition score (relative speedup)\n",
-    "        baseline_time = self.baselines[event_name]\n",
-    "        submission_time = results['mean_inference_time']\n",
-    "        speedup_score = baseline_time / submission_time\n",
-    "        \n",
-    "        # Create submission record\n",
-    "        submission = {\n",
-    "            'submission_id': self._generate_submission_id(team_name, event_name),\n",
-    "            'timestamp': datetime.now().isoformat(),\n",
-    "            'team_name': team_name,\n",
-    "            'event_name': event_name,\n",
-    "            'optimization_description': optimization_description,\n",
-    "            'github_url': github_url,\n",
-    "            'performance_metrics': results,\n",
-    "            'speedup_score': speedup_score,\n",
-    "            'baseline_time_ms': baseline_time * 1000,\n",
-    "            'submission_time_ms': submission_time * 1000\n",
-    "        }\n",
-    "        \n",
-    "        # Save submission\n",
-    "        self._save_submission(submission)\n",
-    "        \n",
-    "        # Display results\n",
-    "        self._display_submission_results(submission)\n",
-    "        \n",
-    "        return submission\n",
-    "    \n",
-    "    def _generate_submission_id(self, team_name: str, event_name: str) -> str:\n",
-    "        \"\"\"Generate unique submission ID\"\"\"\n",
-    "        timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
-    "        team_hash = hashlib.md5(team_name.encode()).hexdigest()[:6]\n",
-    "        return f\"{event_name}_{team_hash}_{timestamp}\"\n",
-    "    \n",
-    "    def _save_submission(self, submission: Dict[str, Any]):\n",
-    "        \"\"\"Save submission to results directory\"\"\"\n",
-    "        filename = f\"{submission['submission_id']}.json\"\n",
-    "        filepath = self.results_dir / filename\n",
-    "        \n",
-    "        with open(filepath, 'w') as f:\n",
-    "            json.dump(submission, f, indent=2, default=str)\n",
-    "        \n",
-    "        print(f\"💾 Submission saved: {filepath}\")\n",
-    "    \n",
-    "    def _display_submission_results(self, submission: Dict[str, Any]):\n",
-    "        \"\"\"Display formatted submission results\"\"\"\n",
-    "        metrics = submission['performance_metrics']\n",
-    "        speedup = submission['speedup_score']\n",
-    "        \n",
-    "        print(f\"\\n🏆 SUBMISSION RESULTS\")\n",
-    "        print(f\"=\" * 50)\n",
-    "        print(f\"Team: {submission['team_name']}\")\n",
-    "        print(f\"Event: {submission['event_name'].replace('_', ' ').title()}\")\n",
-    "        \n",
-    "        print(f\"\\n⏱️  Performance:\")\n",
-    "        print(f\"   Your Time:    {submission['submission_time_ms']:.2f} ms\")\n",
-    "        print(f\"   Baseline:     {submission['baseline_time_ms']:.2f} ms\")\n",
-    "        print(f\"   🚀 Speedup:   {speedup:.2f}x {'FASTER' if speedup > 1.0 else 'slower'}\")\n",
-    "        \n",
-    "        if 'memory_delta_mb' in metrics:\n",
-    "            print(f\"   💾 Memory:    {metrics['memory_delta_mb']:.2f} MB\")\n",
-    "        \n",
-    "        # Award celebration for good performance\n",
-    "        if speedup >= 3.0:\n",
-    "            print(f\"\\n🎉 AMAZING! 3x+ speedup achieved!\")\n",
-    "        elif speedup >= 2.0:\n",
-    "            print(f\"\\n🏆 EXCELLENT! 2x+ speedup!\")\n",
-    "        elif speedup >= 1.5:\n",
-    "            print(f\"\\n⭐ GREAT! 50%+ speedup!\")\n",
-    "        elif speedup >= 1.1:\n",
-    "            print(f\"\\n✅ Good optimization!\")\n",
-    "        else:\n",
-    "            print(f\"\\n🤔 Keep optimizing - you can do better!\")\n",
-    "        \n",
-    "        if submission['optimization_description']:\n",
-    "            print(f\"\\n💡 Techniques Used:\")\n",
-    "            print(f\"   {submission['optimization_description']}\")\n",
-    "    \n",
-    "    def display_leaderboard(self, event_name: str, top_n: int = 10) -> List[Dict[str, Any]]:\n",
-    "        \"\"\"\n",
-    "        Display leaderboard for a specific event.\n",
-    "        \n",
-    "        Args:\n",
-    "            event_name: Event to show leaderboard for\n",
-    "            top_n: Number of top entries to display\n",
-    "            \n",
-    "        Returns:\n",
-    "            List of top submissions\n",
-    "        \"\"\"\n",
-    "        submissions = self._load_event_submissions(event_name)\n",
-    "        \n",
-    "        if not submissions:\n",
-    "            print(f\"🏆 {event_name.replace('_', ' ').title()} Leaderboard\")\n",
-    "            print(\"No submissions yet! Be the first to compete!\")\n",
-    "            return []\n",
-    "        \n",
-    "        # Sort by speedup score (highest first)\n",
-    "        submissions.sort(key=lambda s: s['speedup_score'], reverse=True)\n",
-    "        top_submissions = submissions[:top_n]\n",
-    "        \n",
-    "        print(f\"\\n🏆 TINYMLPERF LEADERBOARD - {event_name.replace('_', ' ').title()}\")\n",
-    "        print(\"=\" * 80)\n",
-    "        print(f\"{'Rank':<6} {'Team':<20} {'Speedup':<10} {'Time (ms)':<12} {'Techniques':<25}\")\n",
-    "        print(\"-\" * 80)\n",
-    "        \n",
-    "        for i, submission in enumerate(top_submissions):\n",
-    "            rank = i + 1\n",
-    "            team = submission['team_name'][:19]\n",
-    "            speedup = f\"{submission['speedup_score']:.2f}x\"\n",
-    "            time_ms = f\"{submission['submission_time_ms']:.2f}\"\n",
-    "            techniques = submission['optimization_description'][:24] + \"...\" if len(submission['optimization_description']) > 24 else submission['optimization_description']\n",
-    "            \n",
-    "            print(f\"{rank:<6} {team:<20} {speedup:<10} {time_ms:<12} {techniques:<25}\")\n",
-    "        \n",
-    "        print(\"-\" * 80)\n",
-    "        print(f\"Showing top {len(top_submissions)} of {len(submissions)} submissions\")\n",
-    "        \n",
-    "        return top_submissions\n",
-    "    \n",
-    "    def display_all_leaderboards(self):\n",
-    "        \"\"\"Display leaderboards for all events\"\"\"\n",
-    "        events = ['mlp_sprint', 'cnn_marathon', 'transformer_decathlon']\n",
-    "        \n",
-    "        for event in events:\n",
-    "            self.display_leaderboard(event, top_n=5)\n",
-    "            print()\n",
-    "    \n",
-    "    def _load_event_submissions(self, event_name: str) -> List[Dict[str, Any]]:\n",
-    "        \"\"\"Load all submissions for a specific event\"\"\"\n",
-    "        submissions = []\n",
-    "        \n",
-    "        for filepath in self.results_dir.glob(f\"{event_name}_*.json\"):\n",
-    "            try:\n",
-    "                with open(filepath, 'r') as f:\n",
-    "                    submission = json.load(f)\n",
-    "                    submissions.append(submission)\n",
-    "            except Exception as e:\n",
-    "                print(f\"Warning: Could not load {filepath}: {e}\")\n",
-    "        \n",
-    "        return submissions\n",
-    "    \n",
-    "    def get_team_progress(self, team_name: str) -> Dict[str, List[Dict[str, Any]]]:\n",
-    "        \"\"\"Get all submissions from a specific team across all events\"\"\"\n",
-    "        all_files = list(self.results_dir.glob(\"*.json\"))\n",
-    "        team_submissions = {'mlp_sprint': [], 'cnn_marathon': [], 'transformer_decathlon': []}\n",
-    "        \n",
-    "        for filepath in all_files:\n",
-    "            try:\n",
-    "                with open(filepath, 'r') as f:\n",
-    "                    submission = json.load(f)\n",
-    "                    if submission['team_name'] == team_name:\n",
-    "                        event = submission['event_name']\n",
-    "                        if event in team_submissions:\n",
-    "                            team_submissions[event].append(submission)\n",
-    "            except Exception as e:\n",
-    "                continue\n",
-    "        \n",
-    "        # Sort by timestamp\n",
-    "        for event in team_submissions:\n",
-    "            team_submissions[event].sort(key=lambda s: s['timestamp'])\n",
-    "        \n",
-    "        return team_submissions"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c164bce1",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test TinyMLPerf Competition Framework\n",
-    "\n",
-    "Let's test the competition framework with multiple team submissions and leaderboards."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "64308dff",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_tinymlperf_competition():\n",
-    "    \"\"\"Test the TinyMLPerf competition framework\"\"\"\n",
-    "    print(\"Testing TinyMLPerf Competition Framework...\")\n",
-    "    \n",
-    "    # Initialize competition\n",
-    "    competition = TinyMLPerfCompetition()\n",
-    "    \n",
-    "    # Create some test optimized models\n",
-    "    class FastMLPModel:\n",
-    "        \"\"\"Simulated optimized MLP - smaller and faster\"\"\"\n",
-    "        def __init__(self):\n",
-    "            # Smaller model for speed\n",
-    "            self.weights1 = np.random.randn(784, 64).astype(np.float32) * 0.1\n",
-    "            self.bias1 = np.random.randn(64).astype(np.float32) * 0.1\n",
-    "            self.weights2 = np.random.randn(64, 10).astype(np.float32) * 0.1  \n",
-    "            self.bias2 = np.random.randn(10).astype(np.float32) * 0.1\n",
-    "        \n",
-    "        def predict(self, x):\n",
-    "            h1 = np.maximum(0, x @ self.weights1 + self.bias1)\n",
-    "            return h1 @ self.weights2 + self.bias2\n",
-    "    \n",
-    "    class EfficientCNNModel:\n",
-    "        \"\"\"Simulated optimized CNN\"\"\"\n",
-    "        def __init__(self):\n",
-    "            # Optimized weights\n",
-    "            self.fc_weights = np.random.randn(1600, 10).astype(np.float32) * 0.05\n",
-    "            self.fc_bias = np.random.randn(10).astype(np.float32) * 0.05\n",
-    "        \n",
-    "        def predict(self, x):\n",
-    "            batch_size = x.shape[0]\n",
-    "            x_flat = x.reshape(batch_size, -1)\n",
-    "            if x_flat.shape[1] != 1600:\n",
-    "                x_flat = x_flat[:, :1600] if x_flat.shape[1] > 1600 else np.pad(x_flat, ((0, 0), (0, 1600 - x_flat.shape[1])), 'constant')\n",
-    "            return x_flat @ self.fc_weights + self.fc_bias\n",
-    "    \n",
-    "    # Submit optimized models to competition\n",
-    "    print(\"\\n🚀 Submitting Competition Entries...\")\n",
-    "    \n",
-    "    # MLP Sprint submissions\n",
-    "    mlp_submission1 = competition.submit_entry(\n",
-    "        team_name=\"Speed Demons\",\n",
-    "        event_name=\"mlp_sprint\",\n",
-    "        optimized_model=FastMLPModel(),\n",
-    "        optimization_description=\"Reduced hidden layer size for 2x speedup\",\n",
-    "        github_url=\"https://github.com/speed-demons/fast-mlp\"\n",
-    "    )\n",
-    "    \n",
-    "    mlp_submission2 = competition.submit_entry(\n",
-    "        team_name=\"Lightning Fast\",  \n",
-    "        event_name=\"mlp_sprint\",\n",
-    "        optimized_model=FastMLPModel(),\n",
-    "        optimization_description=\"Quantization + kernel optimization\",\n",
-    "        github_url=\"https://github.com/lightning-fast/mlp-opt\"\n",
-    "    )\n",
-    "    \n",
-    "    # CNN Marathon submission\n",
-    "    cnn_submission = competition.submit_entry(\n",
-    "        team_name=\"CNN Champions\",\n",
-    "        event_name=\"cnn_marathon\", \n",
-    "        optimized_model=EfficientCNNModel(),\n",
-    "        optimization_description=\"Custom convolution kernels + memory optimization\",\n",
-    "        github_url=\"https://github.com/cnn-champions/efficient-cnn\"\n",
-    "    )\n",
-    "    \n",
-    "    # Display leaderboards\n",
-    "    print(\"\\n📊 Competition Leaderboards:\")\n",
-    "    competition.display_all_leaderboards()\n",
-    "    \n",
-    "    print(\"\\n✅ TinyMLPerf competition framework test complete!\")\n",
-    "    return competition"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e89abe4e",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Part 4: Innovation Tracking and Advanced Scoring\n",
-    "\n",
-    "Let's add innovation detection and advanced scoring to reward creative optimization techniques."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "39a4324b",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "class InnovationDetector:\n",
-    "    \"\"\"\n",
-    "    Detect and score innovative optimization techniques in submitted models.\n",
-    "    \n",
-    "    Rewards creativity by analyzing models for advanced optimization patterns:\n",
-    "    - Quantization techniques\n",
-    "    - Pruning strategies  \n",
-    "    - Knowledge distillation\n",
-    "    - Custom kernel implementations\n",
-    "    - Novel architectural innovations\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self):\n",
-    "        \"\"\"Initialize innovation detector\"\"\"\n",
-    "        self.innovation_patterns = {\n",
-    "            'quantization': ['quantized', 'int8', 'int16', 'low_precision', 'quantize'],\n",
-    "            'pruning': ['pruned', 'sparse', 'sparsity', 'prune', 'structured_pruning'],\n",
-    "            'distillation': ['distilled', 'teacher', 'student', 'knowledge_distillation', 'kd'],\n",
-    "            'custom_kernels': ['custom_kernel', 'optimized_kernel', 'cuda', 'vectorized', 'simd'],\n",
-    "            'memory_optimization': ['memory_pool', 'in_place', 'gradient_checkpointing', 'memory_efficient'],\n",
-    "            'compression': ['compressed', 'huffman', 'lz4', 'weight_sharing', 'parameter_sharing']\n",
-    "        }\n",
-    "    \n",
-    "    def analyze_innovation(self, model, optimization_description: str) -> Dict[str, Any]:\n",
-    "        \"\"\"\n",
-    "        Analyze a model for innovative optimization techniques.\n",
-    "        \n",
-    "        Args:\n",
-    "            model: The optimized model to analyze\n",
-    "            optimization_description: Text description of optimizations\n",
-    "            \n",
-    "        Returns:\n",
-    "            Innovation analysis with detected techniques and scores\n",
-    "        \"\"\"\n",
-    "        innovation_score = 0.0\n",
-    "        detected_techniques = []\n",
-    "        \n",
-    "        # Analyze optimization description\n",
-    "        desc_lower = optimization_description.lower()\n",
-    "        \n",
-    "        for technique, patterns in self.innovation_patterns.items():\n",
-    "            for pattern in patterns:\n",
-    "                if pattern in desc_lower:\n",
-    "                    detected_techniques.append(technique)\n",
-    "                    innovation_score += 0.2\n",
-    "                    break  # Only count each technique once\n",
-    "        \n",
-    "        # Analyze model attributes for innovation markers\n",
-    "        model_innovation = self._analyze_model_attributes(model)\n",
-    "        detected_techniques.extend(model_innovation['techniques'])\n",
-    "        innovation_score += model_innovation['score']\n",
-    "        \n",
-    "        # Bonus for multiple techniques (creativity reward)\n",
-    "        if len(detected_techniques) >= 3:\n",
-    "            innovation_score += 0.3  # Combination bonus\n",
-    "        \n",
-    "        # Cap innovation score\n",
-    "        innovation_score = min(innovation_score, 1.0)\n",
-    "        \n",
-    "        return {\n",
-    "            'innovation_score': innovation_score,\n",
-    "            'detected_techniques': list(set(detected_techniques)),  # Remove duplicates\n",
-    "            'num_techniques': len(set(detected_techniques)),\n",
-    "            'creativity_bonus': len(detected_techniques) >= 3\n",
-    "        }\n",
-    "    \n",
-    "    def _analyze_model_attributes(self, model) -> Dict[str, Any]:\n",
-    "        \"\"\"Analyze model object for innovation attributes\"\"\"\n",
-    "        techniques = []\n",
-    "        score = 0.0\n",
-    "        \n",
-    "        # Check for common optimization attributes\n",
-    "        optimization_attributes = [\n",
-    "            ('quantized', 'quantization'),\n",
-    "            ('pruned', 'pruning'),\n",
-    "            ('distilled', 'distillation'),\n",
-    "            ('compressed', 'compression'),\n",
-    "            ('memory_optimized', 'memory_optimization'),\n",
-    "            ('custom_kernels', 'custom_kernels')\n",
-    "        ]\n",
-    "        \n",
-    "        for attr, technique in optimization_attributes:\n",
-    "            if hasattr(model, attr) and getattr(model, attr):\n",
-    "                techniques.append(technique)\n",
-    "                score += 0.15\n",
-    "        \n",
-    "        # Check for unusual model architectures (creativity indicator)\n",
-    "        if hasattr(model, 'innovative_architecture') and getattr(model, 'innovative_architecture'):\n",
-    "            techniques.append('novel_architecture')\n",
-    "            score += 0.25\n",
-    "        \n",
-    "        return {'techniques': techniques, 'score': score}\n",
-    "    \n",
-    "    def generate_innovation_report(self, analysis: Dict[str, Any]) -> str:\n",
-    "        \"\"\"Generate human-readable innovation report\"\"\"\n",
-    "        score = analysis['innovation_score']\n",
-    "        techniques = analysis['detected_techniques']\n",
-    "        \n",
-    "        if score == 0:\n",
-    "            return \"No innovative techniques detected. Consider exploring quantization, pruning, or custom optimizations!\"\n",
-    "        \n",
-    "        report = f\"Innovation Score: {score:.2f}/1.00\\n\"\n",
-    "        report += f\"Detected Techniques ({len(techniques)}):\\n\"\n",
-    "        \n",
-    "        for technique in techniques:\n",
-    "            report += f\"  • {technique.replace('_', ' ').title()}\\n\"\n",
-    "        \n",
-    "        if analysis['creativity_bonus']:\n",
-    "            report += \"🌟 Creativity Bonus: Multiple optimization techniques combined!\\n\"\n",
-    "        \n",
-    "        # Award levels\n",
-    "        if score >= 0.8:\n",
-    "            report += \"🏆 INNOVATION MASTER - Outstanding creativity!\"\n",
-    "        elif score >= 0.6:\n",
-    "            report += \"🚀 INNOVATION EXPERT - Excellent techniques!\"\n",
-    "        elif score >= 0.4:\n",
-    "            report += \"⭐ INNOVATION PRACTITIONER - Good optimization work!\"\n",
-    "        else:\n",
-    "            report += \"🔍 INNOVATION EXPLORER - Keep experimenting!\"\n",
-    "        \n",
-    "        return report\n",
-    "\n",
-    "# Enhanced competition class with innovation scoring\n",
-    "class TinyMLPerfCompetitionPlus(TinyMLPerfCompetition):\n",
-    "    \"\"\"\n",
-    "    Enhanced TinyMLPerf Competition with innovation detection and advanced scoring.\n",
-    "    \n",
-    "    Extends the base competition with:\n",
-    "    - Innovation technique detection\n",
-    "    - Advanced composite scoring\n",
-    "    - Creativity rewards\n",
-    "    - Multi-dimensional leaderboards\n",
-    "    \"\"\"\n",
-    "    \n",
-    "    def __init__(self, results_dir: str = \"tinymlperf_results\"):\n",
-    "        \"\"\"Initialize enhanced competition with innovation detection\"\"\"\n",
-    "        super().__init__(results_dir)\n",
-    "        self.innovation_detector = InnovationDetector()\n",
-    "        print(\"🔬 Innovation detection enabled!\")\n",
-    "    \n",
-    "    def submit_entry(self, team_name: str, event_name: str, optimized_model,\n",
-    "                     optimization_description: str = \"\", github_url: str = \"\") -> Dict[str, Any]:\n",
-    "        \"\"\"Submit entry with innovation analysis\"\"\"\n",
-    "        \n",
-    "        # Get base submission\n",
-    "        submission = super().submit_entry(team_name, event_name, optimized_model, \n",
-    "                                        optimization_description, github_url)\n",
-    "        \n",
-    "        # Add innovation analysis\n",
-    "        innovation_analysis = self.innovation_detector.analyze_innovation(\n",
-    "            optimized_model, optimization_description\n",
-    "        )\n",
-    "        \n",
-    "        submission['innovation_analysis'] = innovation_analysis\n",
-    "        \n",
-    "        # Calculate composite score (speed + innovation)\n",
-    "        speed_score = submission['speedup_score']  # Relative speedup\n",
-    "        innovation_score = innovation_analysis['innovation_score']\n",
-    "        \n",
-    "        # Weighted composite: 70% speed, 30% innovation\n",
-    "        composite_score = 0.7 * speed_score + 0.3 * innovation_score\n",
-    "        submission['composite_score'] = composite_score\n",
-    "        \n",
-    "        # Display innovation results\n",
-    "        print(f\"\\n🔬 Innovation Analysis:\")\n",
-    "        innovation_report = self.innovation_detector.generate_innovation_report(innovation_analysis)\n",
-    "        print(innovation_report)\n",
-    "        print(f\"\\n🏆 Composite Score: {composite_score:.3f} (Speed: {speed_score:.2f}, Innovation: {innovation_score:.2f})\")\n",
-    "        \n",
-    "        # Re-save with innovation data\n",
-    "        self._save_submission(submission)\n",
-    "        \n",
-    "        return submission\n",
-    "    \n",
-    "    def display_innovation_leaderboard(self, event_name: str, top_n: int = 10):\n",
-    "        \"\"\"Display leaderboard ranked by innovation score\"\"\"\n",
-    "        submissions = self._load_event_submissions(event_name)\n",
-    "        \n",
-    "        # Filter submissions with innovation data\n",
-    "        innovation_submissions = [s for s in submissions if 'innovation_analysis' in s]\n",
-    "        \n",
-    "        if not innovation_submissions:\n",
-    "            print(f\"🔬 Innovation Leaderboard - {event_name.replace('_', ' ').title()}\")\n",
-    "            print(\"No innovation submissions yet!\")\n",
-    "            return\n",
-    "        \n",
-    "        # Sort by innovation score\n",
-    "        innovation_submissions.sort(key=lambda s: s['innovation_analysis']['innovation_score'], reverse=True)\n",
-    "        top_submissions = innovation_submissions[:top_n]\n",
-    "        \n",
-    "        print(f\"\\n🔬 INNOVATION LEADERBOARD - {event_name.replace('_', ' ').title()}\")\n",
-    "        print(\"=\" * 80)\n",
-    "        print(f\"{'Rank':<6} {'Team':<20} {'Innovation':<12} {'Techniques':<8} {'Description':<25}\")\n",
-    "        print(\"-\" * 80)\n",
-    "        \n",
-    "        for i, submission in enumerate(top_submissions):\n",
-    "            rank = i + 1\n",
-    "            team = submission['team_name'][:19]\n",
-    "            innovation = f\"{submission['innovation_analysis']['innovation_score']:.3f}\"\n",
-    "            num_tech = submission['innovation_analysis']['num_techniques']\n",
-    "            description = submission['optimization_description'][:24]\n",
-    "            \n",
-    "            print(f\"{rank:<6} {team:<20} {innovation:<12} {num_tech:<8} {description:<25}\")\n",
-    "        \n",
-    "        print(\"-\" * 80)\n",
-    "        print(f\"Top {len(top_submissions)} most innovative submissions\")\n",
-    "    \n",
-    "    def display_composite_leaderboard(self, event_name: str, top_n: int = 10):\n",
-    "        \"\"\"Display leaderboard ranked by composite score (speed + innovation)\"\"\"\n",
-    "        submissions = self._load_event_submissions(event_name)\n",
-    "        \n",
-    "        # Filter submissions with composite scores\n",
-    "        composite_submissions = [s for s in submissions if 'composite_score' in s]\n",
-    "        \n",
-    "        if not composite_submissions:\n",
-    "            print(f\"🏆 Composite Leaderboard - {event_name.replace('_', ' ').title()}\")\n",
-    "            print(\"No composite submissions yet!\")\n",
-    "            return\n",
-    "        \n",
-    "        # Sort by composite score\n",
-    "        composite_submissions.sort(key=lambda s: s['composite_score'], reverse=True)\n",
-    "        top_submissions = composite_submissions[:top_n]\n",
-    "        \n",
-    "        print(f\"\\n🏆 COMPOSITE LEADERBOARD - {event_name.replace('_', ' ').title()}\")\n",
-    "        print(\"=\" * 90)  \n",
-    "        print(f\"{'Rank':<6} {'Team':<18} {'Composite':<11} {'Speed':<9} {'Innovation':<11} {'Techniques'}\")\n",
-    "        print(\"-\" * 90)\n",
-    "        \n",
-    "        for i, submission in enumerate(top_submissions):\n",
-    "            rank = i + 1\n",
-    "            team = submission['team_name'][:17]\n",
-    "            composite = f\"{submission['composite_score']:.3f}\"\n",
-    "            speed = f\"{submission['speedup_score']:.2f}x\"\n",
-    "            innovation = f\"{submission['innovation_analysis']['innovation_score']:.3f}\"\n",
-    "            techniques = \", \".join(submission['innovation_analysis']['detected_techniques'][:3])[:20]\n",
-    "            \n",
-    "            print(f\"{rank:<6} {team:<18} {composite:<11} {speed:<9} {innovation:<11} {techniques}\")\n",
-    "        \n",
-    "        print(\"-\" * 90)\n",
-    "        print(f\"Top {len(top_submissions)} best overall submissions (70% speed + 30% innovation)\")\n",
-    "    \n",
-    "    def display_all_enhanced_leaderboards(self):\n",
-    "        \"\"\"Display all leaderboard types for all events\"\"\"\n",
-    "        events = ['mlp_sprint', 'cnn_marathon', 'transformer_decathlon']\n",
-    "        \n",
-    "        for event in events:\n",
-    "            print(f\"\\n{'='*60}\")\n",
-    "            print(f\"🏆 {event.replace('_', ' ').title()} - All Leaderboards\")\n",
-    "            print(f\"{'='*60}\")\n",
-    "            \n",
-    "            # Speed leaderboard  \n",
-    "            self.display_leaderboard(event, top_n=5)\n",
-    "            print()\n",
-    "            \n",
-    "            # Innovation leaderboard\n",
-    "            self.display_innovation_leaderboard(event, top_n=5)\n",
-    "            print()\n",
-    "            \n",
-    "            # Composite leaderboard\n",
-    "            self.display_composite_leaderboard(event, top_n=5)\n",
-    "            print()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b34233c4",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "### Test Enhanced Competition with Innovation Detection\n",
-    "\n",
-    "Let's test the enhanced competition framework with innovation detection."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "49d82963",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def test_enhanced_competition():\n",
-    "    \"\"\"Test enhanced competition with innovation detection\"\"\"\n",
-    "    print(\"Testing Enhanced TinyMLPerf Competition...\")\n",
-    "    \n",
-    "    # Initialize enhanced competition\n",
-    "    competition = TinyMLPerfCompetitionPlus()\n",
-    "    \n",
-    "    # Create innovative models with optimization attributes\n",
-    "    class QuantizedFastMLP:\n",
-    "        \"\"\"Simulated quantized MLP\"\"\"\n",
-    "        def __init__(self):\n",
-    "            self.weights1 = np.random.randn(784, 64).astype(np.int8)  # Quantized weights\n",
-    "            self.bias1 = np.random.randn(64).astype(np.float32) * 0.1\n",
-    "            self.weights2 = np.random.randn(64, 10).astype(np.int8)\n",
-    "            self.bias2 = np.random.randn(10).astype(np.float32) * 0.1\n",
-    "            self.quantized = True  # Innovation marker\n",
-    "        \n",
-    "        def predict(self, x):\n",
-    "            # Simulate quantized computation\n",
-    "            h1 = np.maximum(0, x @ self.weights1.astype(np.float32) * 0.1 + self.bias1)\n",
-    "            return h1 @ self.weights2.astype(np.float32) * 0.1 + self.bias2\n",
-    "    \n",
-    "    class PrunedCNN:\n",
-    "        \"\"\"Simulated pruned CNN\"\"\"\n",
-    "        def __init__(self):\n",
-    "            self.fc_weights = np.random.randn(1600, 10).astype(np.float32) * 0.05\n",
-    "            self.fc_bias = np.random.randn(10).astype(np.float32) * 0.05\n",
-    "            self.pruned = True  # Innovation marker\n",
-    "            self.sparsity = 0.7  # 70% of weights pruned\n",
-    "        \n",
-    "        def predict(self, x):\n",
-    "            batch_size = x.shape[0]\n",
-    "            x_flat = x.reshape(batch_size, -1)\n",
-    "            if x_flat.shape[1] != 1600:\n",
-    "                x_flat = x_flat[:, :1600] if x_flat.shape[1] > 1600 else np.pad(x_flat, ((0, 0), (0, 1600 - x_flat.shape[1])), 'constant')\n",
-    "            return x_flat @ self.fc_weights + self.fc_bias\n",
-    "    \n",
-    "    # Submit innovative entries\n",
-    "    print(\"\\n🚀 Submitting Innovative Entries...\")\n",
-    "    \n",
-    "    # Quantized MLP submission\n",
-    "    quantized_submission = competition.submit_entry(\n",
-    "        team_name=\"Quantum Quantizers\",\n",
-    "        event_name=\"mlp_sprint\",\n",
-    "        optimized_model=QuantizedFastMLP(),\n",
-    "        optimization_description=\"INT8 quantization with custom SIMD kernels for 3x speedup\",\n",
-    "        github_url=\"https://github.com/quantum-quantizers/quantized-mlp\"\n",
-    "    )\n",
-    "    \n",
-    "    # Pruned CNN submission\n",
-    "    pruned_submission = competition.submit_entry(\n",
-    "        team_name=\"Pruning Pioneers\", \n",
-    "        event_name=\"cnn_marathon\",\n",
-    "        optimized_model=PrunedCNN(),\n",
-    "        optimization_description=\"Structured pruning + knowledge distillation + memory optimization\",\n",
-    "        github_url=\"https://github.com/pruning-pioneers/pruned-cnn\"\n",
-    "    )\n",
-    "    \n",
-    "    # Display enhanced leaderboards\n",
-    "    print(\"\\n📊 Enhanced Competition Leaderboards:\")\n",
-    "    competition.display_all_enhanced_leaderboards()\n",
-    "    \n",
-    "    print(\"\\n✅ Enhanced competition test complete!\")\n",
-    "    return competition"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "065ec776",
-   "metadata": {
-    "cell_marker": "\"\"\"",
-    "lines_to_next_cell": 1
-   },
-   "source": [
-    "## Comprehensive Testing\n",
-    "\n",
-    "Let's run a complete TinyMLPerf competition demonstration with all features."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "70ec3a07",
-   "metadata": {
-    "lines_to_next_cell": 1
-   },
-   "outputs": [],
-   "source": [
-    "def run_complete_tinymlperf_demo():\n",
-    "    \"\"\"Run comprehensive TinyMLPerf competition demonstration\"\"\"\n",
-    "    print(\"🏆 TINYMLPERF - THE ULTIMATE ML SYSTEMS COMPETITION\")\n",
-    "    print(\"=\" * 80)\n",
-    "    \n",
-    "    print(\"\\n1. 🏗️  Setting up TinyMLPerf Benchmark Suite...\")\n",
-    "    # Test benchmark suite\n",
-    "    tinyperf = test_tinymlperf_benchmark_suite()\n",
-    "    \n",
-    "    print(\"\\n2. ⚡ Testing Competition Profiling...\")  \n",
-    "    # Test profiling infrastructure\n",
-    "    profiler, mlp_results, cnn_results = test_competition_profiler()\n",
-    "    \n",
-    "    print(\"\\n3. 🚀 Running Basic Competition...\")\n",
-    "    # Test basic competition\n",
-    "    basic_competition = test_tinymlperf_competition()\n",
-    "    \n",
-    "    print(\"\\n4. 🔬 Testing Enhanced Competition with Innovation...\")\n",
-    "    # Test enhanced competition\n",
-    "    enhanced_competition = test_enhanced_competition()\n",
-    "    \n",
-    "    print(\"\\n\" + \"=\" * 80)\n",
-    "    print(\"🎉 TINYMLPERF DEMO COMPLETE!\")\n",
-    "    print(\"=\" * 80)\n",
-    "    \n",
-    "    print(\"\\n🏆 TinyMLPerf Competition Ready:\")\n",
-    "    print(\"✅ Three exciting events: MLP Sprint, CNN Marathon, Transformer Decathlon\") \n",
-    "    print(\"✅ TinyTorch Module 15 profiler integration for rigorous benchmarking\")\n",
-    "    print(\"✅ Hardware-independent relative scoring (speedup ratios)\")\n",
-    "    print(\"✅ Transparent leaderboards with evidence requirements\")\n",
-    "    print(\"✅ Innovation detection and creativity rewards\")\n",
-    "    print(\"✅ Composite scoring balancing speed and innovation\")\n",
-    "    \n",
-    "    print(\"\\n🚀 Competition Features:\")\n",
-    "    print(\"• Standardized benchmark models and datasets\")\n",
-    "    print(\"• Statistical reliability with multiple timing runs\")\n",
-    "    print(\"• Multiple leaderboard categories (speed, innovation, composite)\")\n",
-    "    print(\"• GitHub integration for transparency and reproducibility\")\n",
-    "    print(\"• Automatic technique detection and innovation scoring\")\n",
-    "    \n",
-    "    print(\"\\n🎯 Ready to Compete:\")\n",
-    "    print(\"1. Optimize your models using techniques from Modules 16-19\")\n",
-    "    print(\"2. Submit to TinyMLPerf events using competition.submit_entry()\")\n",
-    "    print(\"3. See your results on leaderboards instantly\") \n",
-    "    print(\"4. Iterate and improve based on performance feedback\")\n",
-    "    print(\"5. Prove your ML systems optimization mastery!\")\n",
-    "    \n",
-    "    return {\n",
-    "        'benchmark_suite': tinyperf,\n",
-    "        'profiler': profiler,\n",
-    "        'basic_competition': basic_competition, \n",
-    "        'enhanced_competition': enhanced_competition\n",
-    "    }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1145585e",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Systems Analysis Summary\n",
-    "\n",
-    "This TinyMLPerf competition module demonstrates advanced ML systems engineering through competitive benchmarking:\n",
-    "\n",
-    "### 🏗️ **Competition Infrastructure Excellence**\n",
-    "- **Standardized Benchmarking**: Fair competition through consistent profiling protocols using Module 15's profiler\n",
-    "- **Statistical Rigor**: Multiple timing runs with warmup periods ensure reliable performance measurements\n",
-    "- **Hardware Independence**: Relative speedup scoring allows fair competition across different hardware platforms\n",
-    "- **Transparency Requirements**: GitHub integration and evidence tracking prevent gaming and ensure reproducibility\n",
-    "\n",
-    "### ⚡ **Multi-Dimensional Performance Optimization**\n",
-    "- **Speed Optimization**: Direct latency measurement rewarding inference performance improvements\n",
-    "- **Innovation Detection**: Automated recognition of advanced techniques like quantization, pruning, distillation\n",
-    "- **Composite Scoring**: Balanced evaluation combining speed improvements with optimization creativity\n",
-    "- **Multiple Event Categories**: MLP Sprint, CNN Marathon, Transformer Decathlon test different optimization domains\n",
-    "\n",
-    "### 📊 **Systematic Competition Analysis**\n",
-    "- **TinyTorch Profiler Integration**: Leverages Module 15's profiling infrastructure for consistent measurement\n",
-    "- **Memory Tracking**: Comprehensive resource usage analysis beyond just timing measurements\n",
-    "- **Progress Tracking**: Team improvement analysis across multiple submissions and iterations\n",
-    "- **Leaderboard Visualization**: Multiple ranking systems (speed, innovation, composite) prevent tunnel vision\n",
-    "\n",
-    "### 💡 **Production ML Systems Insights**\n",
-    "- **Benchmarking Best Practices**: Industry-standard profiling methodology with warmup and statistical analysis\n",
-    "- **Optimization Technique Recognition**: Systematic detection of real-world optimization approaches\n",
-    "- **Performance Claims Validation**: Evidence-based performance reporting with reproducible results\n",
-    "- **Resource Constraint Awareness**: Multi-metric evaluation reflecting production deployment considerations\n",
-    "\n",
-    "### 🎯 **Key Educational Insights**\n",
-    "- Competition accelerates optimization learning by making improvements concrete and measurable\n",
-    "- Hardware-independent scoring ensures fair comparison while teaching relative performance analysis\n",
-    "- Innovation detection rewards creativity and exposure to diverse optimization techniques\n",
-    "- Multiple leaderboards prevent single-metric optimization and encourage balanced system thinking\n",
-    "- Evidence requirements teach reproducibility and honest performance reporting practices\n",
-    "\n",
-    "### 🏆 **The Ultimate Learning Achievement**\n",
-    "This competition framework proves students can systematically optimize ML systems for real production constraints. By combining techniques from Modules 16-19 (quantization, pruning, acceleration, memory optimization), students demonstrate mastery of the complete ML systems optimization stack through measurable competitive performance.\n",
-    "\n",
-    "The TinyMLPerf competition transforms optimization from abstract concepts into concrete, competitive achievements that mirror real-world ML systems engineering challenges."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5e34927e",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## Main Execution Block\n",
-    "\n",
-    "Run the complete TinyMLPerf competition system when this module is executed directly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f7dfaddb",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "if __name__ == \"__main__\":\n",
-    "    print(\"Module 20: TinyMLPerf - The Ultimate ML Systems Competition\")\n",
-    "    print(\"=\" * 80)\n",
-    "    \n",
-    "    # Run complete TinyMLPerf demonstration\n",
-    "    results = run_complete_tinymlperf_demo()\n",
-    "    \n",
-    "    print(f\"\\n🎉 Module 20 complete!\")\n",
-    "    print(f\"🏆 TinyMLPerf competition infrastructure ready!\")\n",
-    "    print(f\"🚀 Time to optimize your models and climb the leaderboards!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8f95ba18",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🤔 ML Systems Thinking: Interactive Questions\n",
-    "\n",
-    "1. **Why use hardware-independent relative scoring in ML competitions?** Your TinyMLPerf uses speedup ratios rather than absolute timing. Explain why this enables fair competition across different hardware platforms and how this mirrors real production environments where optimization techniques must be portable across diverse deployment targets.\n",
-    "\n",
-    "2. **How does competitive benchmarking accelerate optimization learning compared to individual assignments?** You've built leaderboards, innovation detection, and multi-dimensional scoring. Analyze why competition pressure drives deeper exploration of optimization techniques and how this mirrors real industry environments where performance benchmarks determine system adoption.\n",
-    "\n",
-    "3. **What makes innovation detection crucial for preventing optimization tunnel vision?** Your system detects quantization, pruning, distillation, and custom kernels automatically. Explain why rewarding diverse techniques prevents students from over-optimizing single metrics and how this teaches balanced systems thinking rather than algorithmic tunnel vision.\n",
-    "\n",
-    "4. **How does evidence-based competition ensure educational integrity and real-world relevance?** Your framework requires GitHub links, generates checksums, and validates reproducibility. Analyze why these requirements prevent academic dishonesty while teaching students the performance reporting standards expected in production ML systems development."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "708f21f3",
-   "metadata": {
-    "cell_marker": "\"\"\""
-   },
-   "source": [
-    "## 🎯 MODULE SUMMARY: TinyMLPerf - The Ultimate ML Systems Competition\n",
-    "\n",
-    "This capstone module creates the ultimate ML systems competition, proving optimization mastery through measurable performance improvements in three exciting events.\n",
-    "\n",
-    "### 🛤️ **The TinyMLPerf Journey**\n",
-    "- **Modules 1-19**: You built comprehensive optimization techniques across the entire ML systems stack\n",
-    "- **Module 20**: You compete to prove mastery through concrete, measurable performance improvements\n",
-    "- **Ultimate Goal**: Demonstrate professional-level ML systems optimization through competitive achievement\n",
-    "\n",
-    "### 🛠️ **What We Built**\n",
-    "- **TinyMLPerf Benchmark Suite**: Three standardized competition events - MLP Sprint, CNN Marathon, Transformer Decathlon\n",
-    "- **Competition Profiler**: Integration with Module 15's profiler for rigorous, statistical performance measurement\n",
-    "- **Multi-Dimensional Leaderboards**: Speed, innovation, and composite scoring systems preventing tunnel vision\n",
-    "- **Innovation Detection**: Automatic recognition and scoring of advanced optimization techniques\n",
-    "\n",
-    "### 🧠 **Key Learning Outcomes**\n",
-    "- **Competitive Optimization**: Apply learned techniques competitively with measurable, hardware-independent results\n",
-    "- **Systematic Benchmarking**: Use statistical profiling methodology for reliable performance measurement\n",
-    "- **Innovation Recognition**: Understand and apply diverse optimization approaches beyond simple speed improvements\n",
-    "- **Evidence-Based Performance**: Support optimization claims with reproducible benchmarking and transparent evidence\n",
-    "\n",
-    "### ⚡ **Competition Events Mastered**\n",
-    "- **MLP Sprint**: Fastest feedforward neural network inference optimization\n",
-    "- **CNN Marathon**: Most efficient convolutional neural network processing\n",
-    "- **Transformer Decathlon**: Ultimate attention mechanism and sequence processing optimization\n",
-    "\n",
-    "### 🏆 **Technical Skills Developed**\n",
-    "- Design and implement standardized benchmarking infrastructure for fair ML competition\n",
-    "- Integrate profiling tools for statistical performance measurement and analysis\n",
-    "- Build multi-dimensional leaderboard systems balancing multiple optimization objectives\n",
-    "- Detect and score innovation techniques automatically to reward optimization creativity\n",
-    "\n",
-    "### 📊 **Systems Engineering Insights Gained**\n",
-    "- **Competition accelerates learning**: Measurable challenges drive deeper optimization exploration than individual assignments\n",
-    "- **Hardware-independent scoring**: Relative performance metrics enable fair comparison across diverse deployment environments  \n",
-    "- **Innovation detection prevents tunnel vision**: Multi-dimensional scoring teaches balanced systems optimization\n",
-    "- **Evidence requirements ensure integrity**: Reproducible results and transparency are essential for professional optimization claims\n",
-    "\n",
-    "### 💡 **The Capstone Achievement**\n",
-    "You've completed the ultimate ML systems optimization journey! Through competitive pressure in TinyMLPerf, you've applied quantization, pruning, distillation, acceleration, memory optimization, and innovation techniques to achieve measurable performance improvements. This competition framework proves you can optimize ML systems like a professional engineer, balancing speed, memory, innovation, and deployment constraints to build production-ready systems.\n",
-    "\n",
-    "### 🎉 **Competition Glory Awaits**\n",
-    "Ready to prove your optimization mastery? Load your optimized models into TinyMLPerf, submit to the three events, and climb the leaderboards! Your journey from basic tensors to competition-winning ML systems optimization is complete - now show the world what you can build!"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "cell_metadata_filter": "-all",
-   "main_language": "python",
-   "notebook_metadata_filter": "-all"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/modules_old/20_capstone/capstone_dev.py b/modules_old/20_capstone/capstone_dev.py
deleted file mode 100644
index 63aeb3a0..00000000
--- a/modules_old/20_capstone/capstone_dev.py
+++ /dev/null
@@ -1,2367 +0,0 @@
-# %% [markdown]
-"""
-# Module 20: TinyGPT Capstone - Building Complete ML Systems from Scratch
-
-Welcome to the TinyGPT Capstone! You'll integrate everything from modules 02-19 to build a complete language model from first principles.
-
-## LINK Building on Previous Learning
-**What You Built Before**:
-- Modules 02-11: Core ML infrastructure (tensors, layers, training, optimization)
-- Modules 12-15: Advanced systems (attention, profiling, benchmarking)
-- Modules 16-19: Production techniques (quantization, deployment, optimization)
-
-**What's Working**: You can build and train individual components!
-
-**The Gap**: Components exist in isolation - no end-to-end language model.
-
-**This Module's Solution**: Integrate all TinyTorch modules into a working TinyGPT that generates text.
-
-**Connection Map**:
-```
-All Previous Modules -> TinyGPT Integration -> Complete ML System
-    (components)         (assembly)         (text generation)
-```
-
-## Learning Goals
-1. **Systems Integration**: Combine all TinyTorch components into working language model
-2. **End-to-End Pipeline**: Build complete tokenization -> inference -> generation workflow
-3. **Performance Analysis**: Profile and optimize complete system bottlenecks
-4. **Production Readiness**: Deploy working model with monitoring and optimization
-5. **Mastery Demonstration**: Prove comprehensive ML systems engineering capability
-
-## Build -> Use -> Reflect
-1. **Build**: Complete TinyGPT integration from all previous modules
-2. **Use**: Generate text and analyze end-to-end performance characteristics
-3. **Reflect**: Evaluate system design decisions and optimization opportunities
-
-## Systems Reality Check
-TIP **Production Context**: Real language models require careful component integration and system optimization
-SPEED **Performance Insight**: End-to-end systems reveal bottlenecks invisible in isolated components
-"""
-
-# %%
-#| default_exp tinygpt.capstone
-
-import time
-import json
-import hashlib
-import tracemalloc
-from datetime import datetime
-from pathlib import Path
-from typing import Dict, Any, List, Optional, Tuple, Union, Callable
-import numpy as np
-import pickle
-
-# Import all TinyTorch components for integration
-try:
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.activations import ReLU, Softmax, GELU
-    from tinytorch.core.layers import Linear, LayerNorm
-    from tinytorch.core.losses import CrossEntropyLoss
-    from tinytorch.core.autograd import Variable
-    from tinytorch.core.optimizers import AdamOptimizer
-    from tinytorch.core.attention import MultiHeadAttention
-    from tinytorch.utils.profiler import SimpleProfiler
-    TINYTORCH_AVAILABLE = True
-    print("PASS TinyTorch components loaded successfully")
-except ImportError as e:
-    print(f"WARNING️  TinyTorch components not available: {e}")
-    print("   Some functionality will use NumPy fallbacks")
-    TINYTORCH_AVAILABLE = False
-
-# TinyGPT Architecture Constants - Comprehensive Language Model Configuration
-TINYGPT_VOCAB_SIZE = 1000       # Vocabulary size for tokenization (educational scale)
-TINYGPT_D_MODEL = 128           # Model embedding dimension (balances capability/speed)
-TINYGPT_N_HEADS = 8             # Number of attention heads (d_model must be divisible)
-TINYGPT_N_LAYERS = 6            # Number of transformer layers (depth for language modeling)
-TINYGPT_SEQ_LEN = 64            # Maximum sequence length (context window)
-TINYGPT_FF_RATIO = 4            # Feed-forward expansion ratio (standard transformer)
-TINYGPT_DROPOUT = 0.1           # Dropout rate for regularization
-
-# Training and Generation Constants
-TINYGPT_LEARNING_RATE = 1e-4    # Learning rate for Adam optimizer
-TINYGPT_BATCH_SIZE = 8          # Batch size for training (memory-efficient)
-TINYGPT_MAX_TOKENS = 50         # Maximum tokens to generate
-TINYGPT_TEMPERATURE = 0.8       # Sampling temperature for generation
-TINYGPT_TOP_K = 10              # Top-k sampling for text generation
-
-# Performance measurement constants
-WEIGHT_INIT_SCALE = 0.02        # GPT-style weight initialization
-NUMERICAL_EPSILON = 1e-8        # Prevent division by zero in computations
-DEFAULT_WARMUP_RUNS = 3         # Number of warmup runs to stabilize CPU caches
-DEFAULT_TIMING_RUNS = 5         # Minimum runs for statistical reliability
-PROFILING_RUNS = 10             # More thorough profiling for detailed analysis
-
-# System Analysis Constants - for comprehensive performance evaluation
-MEMORY_ANALYSIS_ENABLED = True       # Enable detailed memory profiling
-PERFORMANCE_BASELINE_RUNS = 5        # Runs for establishing performance baselines
-SCALING_TEST_SEQUENCE_LENGTHS = [16, 32, 64, 128]  # Sequence lengths for scaling analysis
-OPTIMIZATION_TARGET_SPEEDUP = 2.0    # Target speedup for optimization validation
-
-# Component Integration Status Tracking
-COMPONENT_STATUS = {
-    'tensor': False,      # Module 02: Tensor operations
-    'activations': False, # Module 03: Activation functions  
-    'layers': False,      # Module 04: Neural network layers
-    'losses': False,      # Module 05: Loss functions
-    'autograd': False,    # Module 06: Automatic differentiation
-    'optimizers': False,  # Module 07: Optimization algorithms
-    'attention': False,   # Module 08: Attention mechanisms
-    'profiler': False     # Module 15: Performance profiling
-}
-
-# Component Availability Check - validate TinyTorch integration status
-def _check_component_availability():
-    """Check which TinyTorch components are available for integration."""
-    global COMPONENT_STATUS
-    
-    # Check each component systematically
-    components_to_check = [
-        ('tensor', 'tinytorch.core.tensor', 'Tensor'),
-        ('activations', 'tinytorch.core.activations', 'ReLU'),
-        ('layers', 'tinytorch.core.layers', 'Linear'),
-        ('losses', 'tinytorch.core.losses', 'CrossEntropyLoss'),
-        ('autograd', 'tinytorch.core.autograd', 'Variable'),
-        ('optimizers', 'tinytorch.core.optimizers', 'AdamOptimizer'),
-        ('attention', 'tinytorch.core.attention', 'MultiHeadAttention'),
-        ('profiler', 'tinytorch.utils.profiler', 'SimpleProfiler')
-    ]
-    
-    available_count = 0
-    for component_name, module_name, class_name in components_to_check:
-        try:
-            module = __import__(module_name, fromlist=[class_name])
-            getattr(module, class_name)
-            COMPONENT_STATUS[component_name] = True
-            available_count += 1
-        except (ImportError, AttributeError):
-            COMPONENT_STATUS[component_name] = False
-    
-    print(f"MAGNIFY Component Integration Status: {available_count}/{len(components_to_check)} available")
-    
-    # Display detailed status
-    for component, available in COMPONENT_STATUS.items():
-        status = "PASS" if available else "FAIL"
-        print(f"   {status} {component.capitalize()}")
-    
-    return available_count, len(components_to_check)
-
-# Check component availability on module load
-available_components, total_components = _check_component_availability()
-
-# %% [markdown]
-"""
-## Part 1: TinyGPT Architecture Overview - Visual System Design
-
-Before building the complete system, let's understand how all TinyTorch components integrate into a working language model.
-
-### 🏢 Complete TinyGPT Architecture
-
-```
-TinyGPT Language Model Pipeline:
-
-    Input Text
-        |
-        v (Tokenization)
-    Token IDs [7, 23, 145, ...]
-        |
-        v (Token Embedding)
-    +-----------------------------------+
-    |  Token + Position Embeddings        |
-    |  Shape: (batch, seq_len, d_model)   |
-    +-----------------------------------+
-        |
-        v (Transformer Layers x6)
-    +-----------------------------------+
-    |  Layer 1: MultiHeadAttention       |
-    |  |  +--------------------------+  |
-    |  |  | Q, K, V -> Attention    |  |
-    |  |  | O(n²) complexity       |  |
-    |  |  +--------------------------+  |
-    |  v                               |
-    |  LayerNorm + Residual            |
-    |  v                               |
-    |  Feed Forward (Linear -> GELU -> Linear) |
-    |  v                               |
-    |  LayerNorm + Residual            |
-    +-----------------------------------+
-        | (Repeat for layers 2-6)
-        v
-    +-----------------------------------+
-    |  Final Layer Norm                |
-    +-----------------------------------+
-        |
-        v (Language Modeling Head)
-    +-----------------------------------+
-    |  Linear: d_model -> vocab_size     |
-    |  Output: (batch, seq_len, vocab)  |
-    +-----------------------------------+
-        |
-        v (Softmax + Sampling)
-    Next Token Probabilities
-        |
-        v (Generation Loop)
-    Generated Text Output
-```
-
-### 📊 Memory Layout Analysis
-
-```
-TinyGPT Memory Footprint (Educational Scale):
-
-+------------------------------------------+
-| Component           | Parameters | Memory (MB) |
-+------------------------------------------┤
-| Token Embedding     |   128,000  |    0.5     |  vocab * d_model
-| Position Embedding  |     8,192  |    0.03    |  seq_len * d_model  
-| 6x Attention Layers |   294,912  |    1.1     |  4 * d_model² * layers
-| 6x Feed Forward     |   393,216  |    1.5     |  8 * d_model² * layers
-| Output Head         |   128,000  |    0.5     |  d_model * vocab
-+------------------------------------------┤
-| TOTAL MODEL         |   952,320  |    3.6     |  -> 1M parameters!
-+------------------------------------------+
-
-Runtime Memory (per batch):
-- Forward pass activations: ~2-4 MB
-- Backward pass gradients: ~3.6 MB (same as model)
-- Adam optimizer states: ~7.2 MB (2x gradients)
-- Total training memory: ~15-20 MB
-```
-
-### SPEED Performance Characteristics
-
-```
-Inference Performance Analysis:
-
-Sequence Length Scaling (O(n²) attention bottleneck):
-    16 tokens:  ~2ms   (baseline)
-    32 tokens:  ~8ms   (4x slower - quadratic scaling)
-    64 tokens:  ~32ms  (16x slower)
-   128 tokens:  ~128ms (64x slower)
-
-Bottleneck Analysis:
-1. MAGNIFY Attention: 60-70% of computation time
-2. MAGNIFY Feed Forward: 20-25% of computation time  
-3. MAGNIFY Embedding Lookup: 5-10% of computation time
-4. MAGNIFY Other Operations: 5-10% of computation time
-```
-"""
-
-# %%
-def simple_tokenizer_demo():
-    """TARGET Learning Checkpoint 1: Basic Text Tokenization
-    
-    Understand how text becomes numerical tokens for language modeling.
-    """
-    print("MAGNIFY Learning Checkpoint 1: Text Tokenization for Language Models")
-    print("=" * 60)
-    
-    # Simple vocabulary for demonstration (real tokenizers are much more sophisticated)
-    vocab = {
-        '<PAD>': 0, '<UNK>': 1, '<BOS>': 2, '<EOS>': 3,
-        'the': 4, 'cat': 5, 'sat': 6, 'on': 7, 'mat': 8,
-        'dog': 9, 'ran': 10, 'fast': 11, 'in': 12, 'park': 13,
-        'hello': 14, 'world': 15, 'how': 16, 'are': 17, 'you': 18
-    }
-    
-    # Reverse mapping for decoding
-    id_to_token = {v: k for k, v in vocab.items()}
-    
-    def tokenize_text(text):
-        """Convert text to token IDs using simple word-level tokenization"""
-        words = text.lower().split()
-        token_ids = [vocab.get(word, vocab['<UNK>']) for word in words]
-        return token_ids
-    
-    def detokenize_ids(token_ids):
-        """Convert token IDs back to text"""
-        words = [id_to_token.get(id, '<UNK>') for id in token_ids]
-        return ' '.join(words)
-    
-    # Test tokenization
-    test_sentences = [
-        "the cat sat on the mat",
-        "hello world how are you",
-        "the dog ran fast in the park"
-    ]
-    
-    print(f"📊 Vocabulary size: {len(vocab)} tokens")
-    print(f"🔤 Testing tokenization on {len(test_sentences)} sentences...\n")
-    
-    tokenization_results = []
-    for i, sentence in enumerate(test_sentences):
-        token_ids = tokenize_text(sentence)
-        reconstructed = detokenize_ids(token_ids)
-        
-        print(f"   Sentence {i+1}: '{sentence}'")
-        print(f"   Token IDs:  {token_ids}")
-        print(f"   Reconstructed: '{reconstructed}'")
-        print(f"   Length: {len(token_ids)} tokens\n")
-        
-        tokenization_results.append({
-            'original': sentence,
-            'token_ids': token_ids,
-            'reconstructed': reconstructed,
-            'length': len(token_ids)
-        })
-    
-    print(f"TIP Key Insight: Language models work with token IDs, not raw text!")
-    print(f"   Tokenization quality directly affects model performance.")
-    
-    return {'vocab': vocab, 'results': tokenization_results}
-
-def attention_scaling_demo():
-    """TARGET Learning Checkpoint 2: Understanding Attention Complexity
-    
-    Understand why attention is O(n²) and becomes the bottleneck in large models.
-    """
-    print("\nMAGNIFY Learning Checkpoint 2: Attention Scaling Analysis")
-    print("=" * 60)
-    
-    def simple_attention(query, key, value):
-        """Simple attention mechanism for timing analysis"""
-        # Compute attention scores: Q @ K^T
-        scores = query @ np.transpose(key, (0, 1, 3, 2))  # Shape: (batch, heads, seq_len, seq_len)
-        
-        # Scale by sqrt(d_k)
-        d_k = query.shape[-1]
-        scores = scores / np.sqrt(d_k)
-        
-        # Softmax normalization
-        exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
-        attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
-        
-        # Apply attention to values
-        output = attention_weights @ value  # Shape: (batch, heads, seq_len, d_k)
-        
-        return output, attention_weights
-    
-    # Test different sequence lengths to show quadratic scaling
-    test_lengths = [16, 32, 64, 128]
-    d_model = 128
-    n_heads = 8
-    d_k = d_model // n_heads
-    batch_size = 1
-    
-    print(f"📊 Testing attention scaling with d_model={d_model}, heads={n_heads}...\n")
-    
-    scaling_results = []
-    for seq_len in test_lengths:
-        # Create random Q, K, V matrices
-        shape = (batch_size, n_heads, seq_len, d_k)
-        query = np.random.randn(*shape).astype(np.float32) * 0.1
-        key = np.random.randn(*shape).astype(np.float32) * 0.1
-        value = np.random.randn(*shape).astype(np.float32) * 0.1
-        
-        # Time attention computation
-        times = []
-        for _ in range(DEFAULT_TIMING_RUNS):
-            start = time.perf_counter()
-            output, weights = simple_attention(query, key, value)
-            end = time.perf_counter()
-            times.append(end - start)
-        
-        mean_time = np.mean(times)
-        
-        # Calculate memory usage for attention matrix
-        attention_memory_mb = (seq_len * seq_len * 4) / (1024 * 1024)  # float32
-        
-        print(f"   Seq Length {seq_len:3d}: {mean_time*1000:6.2f} ms, Memory: {attention_memory_mb:.3f} MB")
-        
-        scaling_results.append({
-            'seq_len': seq_len,
-            'time_ms': mean_time * 1000,
-            'memory_mb': attention_memory_mb,
-            'operations': seq_len * seq_len * d_k  # Approximate FLOPs
-        })
-    
-    # Analyze scaling
-    if len(scaling_results) >= 2:
-        base_time = scaling_results[0]['time_ms']
-        base_length = scaling_results[0]['seq_len']
-        
-        print(f"\nPROGRESS Scaling Analysis:")
-        for result in scaling_results[1:]:
-            length_ratio = result['seq_len'] / base_length
-            time_ratio = result['time_ms'] / base_time
-            expected_quadratic = length_ratio ** 2
-            
-            print(f"   {result['seq_len']}vs{base_length}: {time_ratio:.1f}x time (expected O(n²): {expected_quadratic:.1f}x)")
-    
-    print(f"\nTIP Key Insight: Attention scales quadratically with sequence length!")
-    print(f"   This is why long sequences are expensive in transformers.")
-    
-    return {'results': scaling_results}
-
-def transformer_component_demo():
-    """TARGET Learning Checkpoint 3: Transformer Component Integration
-    
-    Understand how transformer components work together in language models.
-    """
-    print("\nMAGNIFY Learning Checkpoint 3: Transformer Component Integration")
-    print("=" * 60)
-    
-    # Simple transformer components for demonstration
-    class SimpleAttentionLayer:
-        def __init__(self, d_model, n_heads):
-            self.d_model = d_model
-            self.n_heads = n_heads
-            self.d_k = d_model // n_heads
-            
-            # Initialize weight matrices (simplified)
-            self.w_q = np.random.randn(d_model, d_model).astype(np.float32) * 0.1
-            self.w_k = np.random.randn(d_model, d_model).astype(np.float32) * 0.1
-            self.w_v = np.random.randn(d_model, d_model).astype(np.float32) * 0.1
-            self.w_o = np.random.randn(d_model, d_model).astype(np.float32) * 0.1
-        
-        def forward(self, x):
-            """Simple multi-head attention forward pass"""
-            batch_size, seq_len, d_model = x.shape
-            
-            # Linear transformations
-            q = x @ self.w_q  # (batch, seq, d_model)
-            k = x @ self.w_k
-            v = x @ self.w_v
-            
-            # Reshape for multi-head attention
-            q = q.reshape(batch_size, seq_len, self.n_heads, self.d_k).transpose(0, 2, 1, 3)
-            k = k.reshape(batch_size, seq_len, self.n_heads, self.d_k).transpose(0, 2, 1, 3)
-            v = v.reshape(batch_size, seq_len, self.n_heads, self.d_k).transpose(0, 2, 1, 3)
-            
-            # Attention computation
-            scores = q @ np.swapaxes(k, -2, -1) / np.sqrt(self.d_k)
-            weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
-            attended = weights @ v
-            
-            # Concatenate heads and project
-            attended = attended.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
-            output = attended @ self.w_o
-            
-            return output
-    
-    class SimpleFeedForward:
-        def __init__(self, d_model, d_ff):
-            self.w1 = np.random.randn(d_model, d_ff).astype(np.float32) * 0.1
-            self.w2 = np.random.randn(d_ff, d_model).astype(np.float32) * 0.1
-        
-        def forward(self, x):
-            """Feed-forward network: Linear -> GELU -> Linear"""
-            # First linear transformation
-            hidden = x @ self.w1
-            
-            # GELU activation (approximation)
-            hidden = 0.5 * hidden * (1 + np.tanh(np.sqrt(2/np.pi) * (hidden + 0.044715 * hidden**3)))
-            
-            # Second linear transformation
-            output = hidden @ self.w2
-            
-            return output
-    
-    # Test component integration
-    batch_size = 2
-    seq_len = 32
-    d_model = 128
-    n_heads = 8
-    d_ff = d_model * 4
-    
-    # Create test input
-    x = np.random.randn(batch_size, seq_len, d_model).astype(np.float32) * 0.1
-    
-    print(f"📊 Testing transformer components...")
-    print(f"   Input shape: {x.shape}")
-    print(f"   d_model: {d_model}, n_heads: {n_heads}, d_ff: {d_ff}\n")
-    
-    # Initialize components
-    attention = SimpleAttentionLayer(d_model, n_heads)
-    feed_forward = SimpleFeedForward(d_model, d_ff)
-    
-    # Time each component
-    components_timing = {}
-    
-    # Attention timing
-    times = []
-    for _ in range(DEFAULT_TIMING_RUNS):
-        start = time.perf_counter()
-        attn_output = attention.forward(x)
-        times.append(time.perf_counter() - start)
-    attention_time = np.mean(times)
-    components_timing['attention'] = attention_time
-    
-    # Feed-forward timing
-    times = []
-    for _ in range(DEFAULT_TIMING_RUNS):
-        start = time.perf_counter()
-        ff_output = feed_forward.forward(x)
-        times.append(time.perf_counter() - start)
-    ff_time = np.mean(times)
-    components_timing['feed_forward'] = ff_time
-    
-    # Full transformer layer timing (attention + residual + ff + residual)
-    times = []
-    for _ in range(DEFAULT_TIMING_RUNS):
-        start = time.perf_counter()
-        # Attention block
-        attn_out = attention.forward(x)
-        x_after_attn = x + attn_out  # Residual connection
-        
-        # Feed-forward block  
-        ff_out = feed_forward.forward(x_after_attn)
-        final_out = x_after_attn + ff_out  # Residual connection
-        times.append(time.perf_counter() - start)
-    full_layer_time = np.mean(times)
-    components_timing['full_layer'] = full_layer_time
-    
-    print(f"   Component Timing:")
-    print(f"   Attention:     {attention_time*1000:6.2f} ms ({attention_time/full_layer_time*100:.1f}%)")
-    print(f"   Feed Forward:  {ff_time*1000:6.2f} ms ({ff_time/full_layer_time*100:.1f}%)")
-    print(f"   Full Layer:    {full_layer_time*1000:6.2f} ms (100.0%)")
-    
-    # Calculate parameter counts
-    attn_params = 4 * d_model * d_model  # Q, K, V, O projections
-    ff_params = d_model * d_ff + d_ff * d_model  # Two linear layers
-    total_params = attn_params + ff_params
-    
-    print(f"\n   Parameter Count:")
-    print(f"   Attention:     {attn_params:,} parameters ({attn_params/total_params*100:.1f}%)")
-    print(f"   Feed Forward:  {ff_params:,} parameters ({ff_params/total_params*100:.1f}%)")
-    print(f"   Total Layer:   {total_params:,} parameters")
-    
-    print(f"\nTIP Key Insight: Attention dominates compute, FF dominates parameters!")
-    print(f"   Understanding component characteristics guides optimization.")
-    
-    return {'timing': components_timing, 'params': {'attention': attn_params, 'ff': ff_params}}
-
-# %%
-def run_learning_checkpoints():
-    """Run all learning checkpoints to build understanding progressively"""
-    print("🎓 TinyGPT Capstone Learning Journey")
-    print("=" * 80)
-    print("Building understanding of complete language model systems...\n")
-    
-    # Checkpoint 1: Text tokenization
-    tokenization_results = simple_tokenizer_demo()
-    
-    # Checkpoint 2: Attention scaling
-    attention_results = attention_scaling_demo()
-    
-    # Checkpoint 3: Component integration
-    component_results = transformer_component_demo()
-    
-    print("\n" + "=" * 80)
-    print("CELEBRATE Learning checkpoints complete! Ready for TinyGPT integration.")
-    print("=" * 80)
-    
-    return {
-        'tokenization': tokenization_results,
-        'attention': attention_results, 
-        'components': component_results
-    }
-
-# %% [markdown]
-"""
-### Test Learning Checkpoints
-
-Let's run the learning checkpoints to build understanding of language model components progressively.
-"""
-
-# %%
-def test_learning_checkpoints():
-    """Test the TinyGPT learning checkpoint system"""
-    print("Testing TinyGPT learning checkpoints...")
-    results = run_learning_checkpoints()
-    print("\nPASS TinyGPT learning checkpoints test complete!")
-    return results
-
-# %% [markdown]
-"""
-## Part 2: TinyGPT Core Components - Integrated Language Model Implementation
-
-Now that we understand the fundamentals, let's build the complete TinyGPT system by integrating all TinyTorch components into a working language model.
-"""
-
-# Core TinyGPT Components - Complete Language Model Implementation
-class TinyGPTTokenizer:
-    """Educational tokenizer for TinyGPT language model.
-    
-    Implements word-level tokenization with special tokens for language modeling.
-    In production, this would be BPE/SentencePiece, but word-level is clearer for learning.
-    """
-    
-    def __init__(self, vocab_size=TINYGPT_VOCAB_SIZE):
-        """Initialize tokenizer with educational vocabulary."""
-        # Core special tokens (essential for language modeling)
-        self.special_tokens = {
-            '<PAD>': 0,    # Padding token for batch processing
-            '<UNK>': 1,    # Unknown words not in vocabulary
-            '<BOS>': 2,    # Beginning of sequence token
-            '<EOS>': 3,    # End of sequence token
-        }
-        
-        # Common English words (educational vocabulary - real tokenizers use BPE)
-        common_words = [
-            'the', 'and', 'to', 'of', 'a', 'in', 'is', 'it', 'you', 'that',
-            'he', 'was', 'for', 'on', 'are', 'as', 'with', 'his', 'they', 'be',
-            'at', 'one', 'have', 'this', 'from', 'or', 'had', 'by', 'word', 'but',
-            'what', 'some', 'we', 'can', 'out', 'other', 'were', 'all', 'there', 'when',
-            'up', 'use', 'your', 'how', 'said', 'an', 'each', 'which', 'do', 'their',
-            'time', 'will', 'about', 'if', 'up', 'out', 'many', 'then', 'them', 'these',
-            'so', 'some', 'her', 'would', 'make', 'like', 'into', 'him', 'has', 'two',
-            'more', 'very', 'what', 'know', 'just', 'first', 'get', 'over', 'think', 'also',
-            'good', 'new', 'where', 'much', 'go', 'well', 'little', 'only', 'those', 'tell',
-            'way', 'she', 'may', 'say', 'which', 'any', 'my', 'now', 'old', 'see'
-        ]
-        
-        # Build complete vocabulary (special tokens + common words + generated tokens)
-        self.vocab = self.special_tokens.copy()
-        
-        # Add common words to vocabulary
-        for i, word in enumerate(common_words[:min(len(common_words), vocab_size - len(self.special_tokens))]):
-            self.vocab[word] = len(self.special_tokens) + i
-        
-        # Fill remaining slots with generated tokens (simulating subword tokens)
-        current_id = len(self.vocab)
-        while len(self.vocab) < vocab_size:
-            self.vocab[f'tok_{current_id}'] = current_id
-            current_id += 1
-        
-        # Create reverse mapping for decoding
-        self.id_to_token = {v: k for k, v in self.vocab.items()}
-        
-        print(f"📚 TinyGPT Tokenizer initialized: {len(self.vocab)} tokens")
-    
-    def encode(self, text):
-        """Convert text to token IDs for model input."""
-        # Simple word-level tokenization (lowercase and split)
-        words = text.lower().strip().split()
-        
-        # Convert words to token IDs
-        token_ids = [self.vocab['<BOS>']]  # Start with beginning token
-        for word in words:
-            token_id = self.vocab.get(word, self.vocab['<UNK>'])
-            token_ids.append(token_id)
-        token_ids.append(self.vocab['<EOS>'])  # End with end token
-        
-        return np.array(token_ids, dtype=np.int32)
-    
-    def decode(self, token_ids):
-        """Convert token IDs back to human-readable text."""
-        # Convert IDs to tokens, filtering out special tokens for readability
-        tokens = []
-        for token_id in token_ids:
-            token = self.id_to_token.get(token_id, '<UNK>')
-            if token not in ['<BOS>', '<EOS>', '<PAD>']:
-                tokens.append(token)
-        
-        return ' '.join(tokens)
-    
-    def get_vocab_size(self):
-        """Return vocabulary size for model configuration."""
-        return len(self.vocab)
-
-
-class TinyGPTTransformerLayer:
-    """Complete transformer layer integrating all TinyTorch components.
-    
-    Combines multi-head attention, feed-forward networks, layer normalization,
-    and residual connections into a standard transformer layer.
-    """
-    
-    def __init__(self, d_model=TINYGPT_D_MODEL, n_heads=TINYGPT_N_HEADS, 
-                 d_ff=None, dropout=TINYGPT_DROPOUT):
-        """Initialize transformer layer with comprehensive component integration."""
-        self.d_model = d_model
-        self.n_heads = n_heads
-        self.d_ff = d_ff or (d_model * TINYGPT_FF_RATIO)  # Standard 4x expansion
-        self.dropout = dropout
-        
-        # Multi-head attention weights (using TinyTorch patterns)
-        self.attention_weights = {
-            'w_q': np.random.randn(d_model, d_model).astype(np.float32) * WEIGHT_INIT_SCALE,
-            'w_k': np.random.randn(d_model, d_model).astype(np.float32) * WEIGHT_INIT_SCALE,
-            'w_v': np.random.randn(d_model, d_model).astype(np.float32) * WEIGHT_INIT_SCALE,
-            'w_o': np.random.randn(d_model, d_model).astype(np.float32) * WEIGHT_INIT_SCALE
-        }
-        
-        # Feed-forward network weights (Linear -> GELU -> Linear pattern)
-        self.ff_weights = {
-            'w1': np.random.randn(d_model, self.d_ff).astype(np.float32) * WEIGHT_INIT_SCALE,
-            'b1': np.zeros(self.d_ff).astype(np.float32),
-            'w2': np.random.randn(self.d_ff, d_model).astype(np.float32) * WEIGHT_INIT_SCALE,
-            'b2': np.zeros(d_model).astype(np.float32)
-        }
-        
-        # Layer normalization parameters (following LayerNorm from Module 04)
-        self.layer_norm1_params = {
-            'gamma': np.ones(d_model).astype(np.float32),  # Scale parameter
-            'beta': np.zeros(d_model).astype(np.float32)   # Shift parameter
-        }
-        
-        self.layer_norm2_params = {
-            'gamma': np.ones(d_model).astype(np.float32),
-            'beta': np.zeros(d_model).astype(np.float32)
-        }
-        
-        print(f"🔧 Transformer Layer: d_model={d_model}, n_heads={n_heads}, d_ff={self.d_ff}")
-    
-    def layer_norm(self, x, gamma, beta, eps=1e-8):
-        """Layer normalization following Module 04 patterns."""
-        # Compute mean and variance along the last dimension
-        mean = np.mean(x, axis=-1, keepdims=True)
-        var = np.var(x, axis=-1, keepdims=True)
-        
-        # Normalize and scale/shift
-        x_norm = (x - mean) / np.sqrt(var + eps)
-        return gamma * x_norm + beta
-    
-    def multi_head_attention(self, x, mask=None):
-        """Multi-head attention following Module 08 attention patterns."""
-        batch_size, seq_len, d_model = x.shape
-        d_k = d_model // self.n_heads
-        
-        # Linear transformations to Q, K, V
-        q = x @ self.attention_weights['w_q']  # (batch, seq, d_model)
-        k = x @ self.attention_weights['w_k']
-        v = x @ self.attention_weights['w_v']
-        
-        # Reshape for multi-head attention: (batch, n_heads, seq, d_k)
-        q = q.reshape(batch_size, seq_len, self.n_heads, d_k).transpose(0, 2, 1, 3)
-        k = k.reshape(batch_size, seq_len, self.n_heads, d_k).transpose(0, 2, 1, 3)
-        v = v.reshape(batch_size, seq_len, self.n_heads, d_k).transpose(0, 2, 1, 3)
-        
-        # Scaled dot-product attention with causal masking
-        scores = q @ np.swapaxes(k, -2, -1) / np.sqrt(d_k)  # (batch, heads, seq, seq)
-        
-        # Apply causal mask (prevent attending to future tokens)
-        if mask is None:
-            mask = np.triu(np.ones((seq_len, seq_len)), k=1) * -1e9
-        scores = scores + mask
-        
-        # Softmax attention weights
-        exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
-        attention_weights = exp_scores / (np.sum(exp_scores, axis=-1, keepdims=True) + NUMERICAL_EPSILON)
-        
-        # Apply attention to values
-        attended = attention_weights @ v  # (batch, heads, seq, d_k)
-        
-        # Concatenate heads and project
-        attended = attended.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
-        output = attended @ self.attention_weights['w_o']
-        
-        return output, attention_weights
-    
-    def feed_forward(self, x):
-        """Feed-forward network with GELU activation (Module 03 activation patterns)."""
-        # First linear transformation
-        hidden = x @ self.ff_weights['w1'] + self.ff_weights['b1']
-        
-        # GELU activation (commonly used in transformers)
-        # GELU(x) = 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))
-        hidden = 0.5 * hidden * (1 + np.tanh(np.sqrt(2/np.pi) * (hidden + 0.044715 * hidden**3)))
-        
-        # Second linear transformation
-        output = hidden @ self.ff_weights['w2'] + self.ff_weights['b2']
-        
-        return output
-    
-    def forward(self, x, mask=None):
-        """Complete transformer layer forward pass with residual connections."""
-        # Multi-head attention block
-        attn_output, attention_weights = self.multi_head_attention(x, mask)
-        
-        # First residual connection + layer norm (pre-norm architecture)
-        x_after_attn = self.layer_norm(
-            x + attn_output,  # Residual connection
-            self.layer_norm1_params['gamma'],
-            self.layer_norm1_params['beta']
-        )
-        
-        # Feed-forward block
-        ff_output = self.feed_forward(x_after_attn)
-        
-        # Second residual connection + layer norm
-        x_final = self.layer_norm(
-            x_after_attn + ff_output,  # Residual connection
-            self.layer_norm2_params['gamma'],
-            self.layer_norm2_params['beta']
-        )
-        
-        return x_final, attention_weights
-
-
-class TinyGPTModel:
-    """Complete TinyGPT language model integrating all TinyTorch components.
-    
-    This is the culmination of the entire TinyTorch course - a working language model
-    built entirely from components you implemented in modules 02-19.
-    """
-    
-    def __init__(self, vocab_size=TINYGPT_VOCAB_SIZE, d_model=TINYGPT_D_MODEL, 
-                 n_heads=TINYGPT_N_HEADS, n_layers=TINYGPT_N_LAYERS, 
-                 max_seq_len=TINYGPT_SEQ_LEN, dropout=TINYGPT_DROPOUT):
-        """Initialize complete TinyGPT model with all integrated components."""
-        self.vocab_size = vocab_size
-        self.d_model = d_model
-        self.n_heads = n_heads
-        self.n_layers = n_layers
-        self.max_seq_len = max_seq_len
-        self.dropout = dropout
-        
-        # Token embeddings (Module 04 embedding patterns)
-        self.token_embeddings = np.random.randn(vocab_size, d_model).astype(np.float32) * WEIGHT_INIT_SCALE
-        
-        # Positional embeddings (learned position encodings)
-        self.position_embeddings = np.random.randn(max_seq_len, d_model).astype(np.float32) * WEIGHT_INIT_SCALE
-        
-        # Stack of transformer layers (integrating Module 08 attention)
-        self.transformer_layers = [
-            TinyGPTTransformerLayer(d_model, n_heads, d_model * TINYGPT_FF_RATIO, dropout)
-            for _ in range(n_layers)
-        ]
-        
-        # Final layer normalization
-        self.final_layer_norm = {
-            'gamma': np.ones(d_model).astype(np.float32),
-            'beta': np.zeros(d_model).astype(np.float32)
-        }
-        
-        # Language modeling head (predict next token)
-        self.lm_head = np.random.randn(d_model, vocab_size).astype(np.float32) * WEIGHT_INIT_SCALE
-        
-        # Calculate total parameters
-        self.total_parameters = self._count_parameters()
-        
-        print(f"ROCKET TinyGPT Model Initialized:")
-        print(f"   📊 Parameters: {self.total_parameters:,}")
-        print(f"   🏗️ Architecture: {n_layers} layers, {n_heads} heads, {d_model} dim")
-        print(f"   📚 Vocabulary: {vocab_size} tokens")
-        print(f"   📏 Max Sequence: {max_seq_len} tokens")
-    
-    def _count_parameters(self):
-        """Count total trainable parameters in the model."""
-        total = 0
-        
-        # Embedding parameters
-        total += self.token_embeddings.size  # vocab_size * d_model
-        total += self.position_embeddings.size  # max_seq_len * d_model
-        
-        # Transformer layer parameters (attention + feed-forward + layer norms)
-        layer_params = (
-            4 * self.d_model * self.d_model +  # Q, K, V, O projections
-            2 * self.d_model * (self.d_model * TINYGPT_FF_RATIO) +  # FF layers
-            self.d_model * TINYGPT_FF_RATIO +  # FF bias
-            self.d_model +  # FF bias
-            4 * self.d_model  # 2 layer norms (gamma + beta)
-        )
-        total += layer_params * self.n_layers
-        
-        # Final layer norm and language modeling head
-        total += 2 * self.d_model  # Final layer norm
-        total += self.d_model * self.vocab_size  # LM head
-        
-        return total
-    
-    def get_embeddings(self, token_ids):
-        """Get token and position embeddings for input sequence."""
-        batch_size, seq_len = token_ids.shape
-        
-        # Token embeddings: lookup embeddings for each token
-        token_embeds = self.token_embeddings[token_ids]  # (batch, seq, d_model)
-        
-        # Position embeddings: add learned positional information
-        position_ids = np.arange(seq_len)
-        position_embeds = self.position_embeddings[position_ids]  # (seq, d_model)
-        
-        # Combine token and position embeddings
-        embeddings = token_embeds + position_embeds[np.newaxis, :, :]  # Broadcasting
-        
-        return embeddings
-    
-    def forward(self, token_ids, return_attention=False):
-        """Complete forward pass through TinyGPT model."""
-        batch_size, seq_len = token_ids.shape
-        
-        # Input embeddings (token + position)
-        x = self.get_embeddings(token_ids)  # (batch, seq, d_model)
-        
-        # Create causal mask for autoregressive generation
-        causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1) * -1e9
-        
-        # Pass through transformer layers
-        all_attention_weights = []
-        for layer in self.transformer_layers:
-            x, attention_weights = layer.forward(x, mask=causal_mask)
-            if return_attention:
-                all_attention_weights.append(attention_weights)
-        
-        # Final layer normalization
-        x = self._layer_norm(
-            x, 
-            self.final_layer_norm['gamma'], 
-            self.final_layer_norm['beta']
-        )
-        
-        # Language modeling head: predict next token logits
-        logits = x @ self.lm_head  # (batch, seq, vocab_size)
-        
-        if return_attention:
-            return logits, all_attention_weights
-        return logits
-    
-    def _layer_norm(self, x, gamma, beta, eps=1e-8):
-        """Helper layer normalization function."""
-        mean = np.mean(x, axis=-1, keepdims=True)
-        var = np.var(x, axis=-1, keepdims=True)
-        x_norm = (x - mean) / np.sqrt(var + eps)
-        return gamma * x_norm + beta
-    
-    def generate_next_token(self, token_ids, temperature=TINYGPT_TEMPERATURE, top_k=TINYGPT_TOP_K):
-        """Generate next token using the trained model."""
-        # Forward pass to get logits
-        logits = self.forward(token_ids)  # (batch, seq, vocab_size)
-        
-        # Get logits for the last token (next token prediction)
-        next_token_logits = logits[:, -1, :]  # (batch, vocab_size)
-        
-        # Apply temperature scaling
-        scaled_logits = next_token_logits / temperature
-        
-        # Top-k sampling: keep only top k most likely tokens
-        if top_k > 0:
-            top_k_indices = np.argpartition(scaled_logits, -top_k, axis=-1)[:, -top_k:]
-            top_k_logits = np.take_along_axis(scaled_logits, top_k_indices, axis=-1)
-            
-            # Softmax over top-k tokens
-            exp_logits = np.exp(top_k_logits - np.max(top_k_logits, axis=-1, keepdims=True))
-            probs = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
-            
-            # Sample from top-k distribution
-            # For simplicity, use argmax (greedy). Real implementation would sample.
-            selected_indices = np.argmax(probs, axis=-1)
-            next_tokens = top_k_indices[np.arange(len(selected_indices)), selected_indices]
-        else:
-            # Greedy decoding: select most likely token
-            next_tokens = np.argmax(scaled_logits, axis=-1)
-        
-        return next_tokens
-    
-    def predict(self, token_ids):
-        """Prediction interface for compatibility with profiling infrastructure."""
-        return self.forward(token_ids)
-
-# %%
-class TinyGPTSystem:
-    """
-    Complete TinyGPT language model system - The culmination of TinyTorch!
-    
-    Integrates all components from modules 02-19 into a working end-to-end system:
-    - Tokenization: Text processing and vocabulary management
-    - Model: Complete transformer architecture with all TinyTorch components
-    - Generation: Autoregressive text generation with sampling
-    - Profiling: Performance analysis using Module 15's profiler
-    """
-    
-    def __init__(self, vocab_size=TINYGPT_VOCAB_SIZE, d_model=TINYGPT_D_MODEL,
-                 n_heads=TINYGPT_N_HEADS, n_layers=TINYGPT_N_LAYERS,
-                 max_seq_len=TINYGPT_SEQ_LEN, warmup_runs=DEFAULT_WARMUP_RUNS,
-                 timing_runs=DEFAULT_TIMING_RUNS):
-        """
-        Initialize complete TinyGPT system with integrated components.
-        
-        Args:
-            vocab_size: Vocabulary size for tokenization
-            d_model: Model embedding dimension
-            n_heads: Number of attention heads
-            n_layers: Number of transformer layers
-            max_seq_len: Maximum sequence length
-            warmup_runs: Number of warmup runs for profiling
-            timing_runs: Number of timing runs for statistical reliability
-        """
-        self.warmup_runs = warmup_runs
-        self.timing_runs = timing_runs
-        
-        print("ROCKET TinyGPT Complete System Initializing...")
-        print("TARGET Integrating All TinyTorch Components (Modules 02-19)")
-        
-        # Initialize tokenizer (text processing foundation)
-        self.tokenizer = TinyGPTTokenizer(vocab_size)
-        
-        # Initialize complete language model
-        self.model = TinyGPTModel(
-            vocab_size=vocab_size,
-            d_model=d_model,
-            n_heads=n_heads,
-            n_layers=n_layers,
-            max_seq_len=max_seq_len
-        )
-        
-        # Initialize profiler for performance analysis
-        self.profiler_available = TINYTORCH_AVAILABLE and available_components >= 6
-        if self.profiler_available:
-            print("PASS Advanced profiling available (Module 15 integrated)")
-        else:
-            print("WARNING️  Using basic timing (complete TinyTorch integration recommended)")
-        
-        # System status and integration validation
-        self._validate_system_integration()
-        self._display_system_summary()
-    
-    def _validate_system_integration(self):
-        """Validate that all TinyTorch components are properly integrated."""
-        print("MAGNIFY Validating TinyGPT System Integration...")
-        
-        integration_checks = {
-            'tokenizer': self.tokenizer is not None,
-            'model': self.model is not None,
-            'vocabulary': self.tokenizer.get_vocab_size() == self.model.vocab_size,
-            'architecture': self.model.total_parameters > 0,
-            'components': available_components >= 4  # Minimum for basic functionality
-        }
-        
-        all_passed = True
-        for check_name, passed in integration_checks.items():
-            status = "PASS" if passed else "FAIL"
-            print(f"   {status} {check_name.replace('_', ' ').title()}")
-            if not passed:
-                all_passed = False
-        
-        if all_passed:
-            print("PASS All integration checks passed!")
-        else:
-            print("WARNING️  Some integration issues detected - functionality may be limited")
-        
-        return all_passed
-    
-    def _display_system_summary(self):
-        """Display comprehensive system summary and capabilities."""
-        print("\n📊 TinyGPT System Summary:")
-        print("=" * 50)
-        
-        # Model architecture summary
-        print(f"🏗️  Architecture:")
-        print(f"   • Model: {self.model.n_layers} layers, {self.model.n_heads} heads")
-        print(f"   • Dimensions: {self.model.d_model} d_model, {self.model.d_model * TINYGPT_FF_RATIO} d_ff")
-        print(f"   • Parameters: {self.model.total_parameters:,}")
-        print(f"   • Memory: ~{self.model.total_parameters * 4 / 1024 / 1024:.1f} MB (float32)")
-        
-        # Tokenization summary
-        print(f"\n📚 Tokenization:")
-        print(f"   • Vocabulary: {self.tokenizer.get_vocab_size():,} tokens")
-        print(f"   • Max Sequence: {self.model.max_seq_len} tokens")
-        print(f"   • Context Window: ~{self.model.max_seq_len * 4} characters")
-        
-        # Component integration status
-        print(f"\n🔧 TinyTorch Integration:")
-        available_names = [name for name, status in COMPONENT_STATUS.items() if status]
-        print(f"   • Available: {', '.join(available_names)}")
-        print(f"   • Integration: {available_components}/{total_components} components")
-        
-        # System capabilities
-        print(f"\nROCKET Capabilities:")
-        print(f"   • Text Generation: PASS Autoregressive generation with sampling")
-        print(f"   • Performance Analysis: {'PASS' if self.profiler_available else 'WARNING️ '} {'Advanced' if self.profiler_available else 'Basic'} profiling")
-        print(f"   • Scaling Analysis: PASS Memory and compute profiling")
-        print(f"   • Production Ready: PASS Complete end-to-end pipeline")
-        
-        print("\nTARGET Ready for text generation and performance analysis!")
-    
-    def encode_text(self, text: str) -> np.ndarray:
-        """
-        Convert text to token IDs for model processing.
-        
-        Args:
-            text: Input text to tokenize
-            
-        Returns:
-            Token IDs as numpy array
-        """
-        token_ids = self.tokenizer.encode(text)
-        
-        # Ensure sequence doesn't exceed max length
-        if len(token_ids) > self.model.max_seq_len:
-            print(f"WARNING️  Text truncated: {len(token_ids)} -> {self.model.max_seq_len} tokens")
-            token_ids = token_ids[:self.model.max_seq_len]
-        
-        return token_ids
-    
-    def decode_tokens(self, token_ids: np.ndarray) -> str:
-        """
-        Convert token IDs back to human-readable text.
-        
-        Args:
-            token_ids: Array of token IDs to decode
-            
-        Returns:
-            Decoded text string
-        """
-        return self.tokenizer.decode(token_ids)
-    
-    def generate_text(self, prompt: str, max_new_tokens: int = TINYGPT_MAX_TOKENS, 
-                     temperature: float = TINYGPT_TEMPERATURE, top_k: int = TINYGPT_TOP_K,
-                     verbose: bool = False) -> str:
-        """
-        Generate text autoregressively from a prompt using the complete TinyGPT system.
-        
-        This is the culmination of all TinyTorch modules - end-to-end text generation!
-        
-        Args:
-            prompt: Input text to start generation
-            max_new_tokens: Maximum number of new tokens to generate
-            temperature: Sampling temperature (higher = more random)
-            top_k: Top-k sampling (0 = greedy, >0 = sample from top k tokens)
-            verbose: Whether to show generation progress
-            
-        Returns:
-            Complete generated text (prompt + new tokens)
-        """
-        if verbose:
-            print(f"ROCKET TinyGPT Text Generation Starting...")
-            print(f"   📝 Prompt: '{prompt}'")
-            print(f"   TARGET Generating {max_new_tokens} tokens with temp={temperature}, top_k={top_k}")
-        
-        # Encode prompt to token IDs
-        initial_tokens = self.encode_text(prompt)
-        
-        # Start with prompt tokens (batch size = 1 for generation)
-        current_tokens = initial_tokens.reshape(1, -1)  # (1, seq_len)
-        
-        generated_tokens = []
-        
-        # Autoregressive generation loop
-        for step in range(max_new_tokens):
-            # Check if we've reached max sequence length
-            if current_tokens.shape[1] >= self.model.max_seq_len:
-                if verbose:
-                    print(f"   WARNING️  Reached max sequence length ({self.model.max_seq_len}), stopping generation")
-                break
-            
-            # Generate next token using the model
-            next_token = self.model.generate_next_token(
-                current_tokens, 
-                temperature=temperature, 
-                top_k=top_k
-            )
-            
-            # Check for end-of-sequence token
-            if next_token[0] == self.tokenizer.vocab['<EOS>']:
-                if verbose:
-                    print(f"   PASS Generated <EOS> token, stopping generation")
-                break
-            
-            # Add new token to sequence
-            next_token_reshaped = next_token.reshape(1, 1)  # (1, 1)
-            current_tokens = np.concatenate([current_tokens, next_token_reshaped], axis=1)
-            generated_tokens.append(next_token[0])
-            
-            # Show progress for verbose mode
-            if verbose and (step + 1) % 10 == 0:
-                partial_text = self.decode_tokens(current_tokens[0])
-                print(f"   📝 Step {step + 1}: '{partial_text}'")
-        
-        # Decode final sequence to text
-        final_text = self.decode_tokens(current_tokens[0])
-        
-        if verbose:
-            print(f"   PASS Generation complete: {len(generated_tokens)} new tokens")
-            print(f"   📚 Final text: '{final_text}'")
-        
-        return final_text
-    
-    def analyze_text_complexity(self, text: str) -> Dict[str, Any]:
-        """
-        Analyze text complexity and tokenization characteristics.
-        
-        Args:
-            text: Text to analyze
-            
-        Returns:
-            Dictionary with complexity metrics
-        """
-        # Tokenize text
-        token_ids = self.encode_text(text)
-        
-        # Basic text statistics
-        words = text.split()
-        unique_words = set(word.lower() for word in words)
-        
-        # Tokenization analysis
-        unique_tokens = set(token_ids)
-        unknown_tokens = sum(1 for token_id in token_ids if token_id == self.tokenizer.vocab['<UNK>'])
-        
-        # Calculate compression ratio (characters per token)
-        compression_ratio = len(text) / len(token_ids) if len(token_ids) > 0 else 0
-        
-        analysis = {
-            'text_length': len(text),
-            'word_count': len(words),
-            'unique_words': len(unique_words),
-            'token_count': len(token_ids),
-            'unique_tokens': len(unique_tokens),
-            'unknown_tokens': unknown_tokens,
-            'compression_ratio': compression_ratio,
-            'vocabulary_coverage': (len(token_ids) - unknown_tokens) / len(token_ids) if len(token_ids) > 0 else 0,
-            'token_ids': token_ids[:20].tolist() if len(token_ids) > 20 else token_ids.tolist()  # First 20 tokens
-        }
-        
-        return analysis
-    
-    def profile_inference_performance(self, text: str, batch_sizes: List[int] = [1, 2, 4, 8]) -> Dict[str, Any]:
-        """
-        Profile model inference performance across different batch sizes.
-        
-        Args:
-            text: Input text for profiling
-            batch_sizes: List of batch sizes to test
-            
-        Returns:
-            Performance profiling results
-        """
-        print(f"SPEED Profiling TinyGPT Inference Performance...")
-        
-        # Encode text once
-        token_ids = self.encode_text(text)
-        
-        performance_results = {
-            'text_length': len(text),
-            'sequence_length': len(token_ids),
-            'batch_results': []
-        }
-        
-        for batch_size in batch_sizes:
-            print(f"   📊 Testing batch size: {batch_size}")
-            
-            # Create batch by repeating the sequence
-            batch_tokens = np.tile(token_ids.reshape(1, -1), (batch_size, 1))
-            
-            # Time multiple runs for statistical reliability
-            times = []
-            for run in range(self.timing_runs):
-                start_time = time.perf_counter()
-                
-                # Forward pass through model
-                logits = self.model.forward(batch_tokens)
-                
-                end_time = time.perf_counter()
-                times.append(end_time - start_time)
-            
-            # Calculate statistics
-            mean_time = np.mean(times)
-            std_time = np.std(times)
-            
-            # Calculate throughput metrics
-            total_tokens = batch_size * len(token_ids)
-            tokens_per_second = total_tokens / mean_time
-            
-            batch_result = {
-                'batch_size': batch_size,
-                'total_tokens': total_tokens,
-                'mean_time_ms': mean_time * 1000,
-                'std_time_ms': std_time * 1000,
-                'tokens_per_second': tokens_per_second,
-                'time_per_token_ms': (mean_time * 1000) / total_tokens
-            }
-            
-            performance_results['batch_results'].append(batch_result)
-            
-            print(f"     ⏱️  {mean_time*1000:.2f}±{std_time*1000:.2f} ms ({tokens_per_second:.1f} tokens/sec)")
-        
-        return performance_results
-
-# MAGNIFY SYSTEMS INSIGHT: Complete System Performance Analysis
-def analyze_complete_system_performance():
-    """Comprehensive performance analysis of the complete TinyGPT system."""
-    print("MAGNIFY SYSTEMS INSIGHT: Complete TinyGPT Performance Analysis")
-    print("=" * 70)
-    
-    # Initialize system
-    system = TinyGPTSystem()
-    
-    # Test text for analysis
-    test_text = "the cat sat on the mat and the dog ran in the park"
-    
-    print(f"\n📊 System Component Analysis:")
-    
-    # 1. Tokenization analysis
-    complexity = system.analyze_text_complexity(test_text)
-    print(f"   📝 Text: '{test_text}'")
-    print(f"   🔤 Tokenization: {complexity['word_count']} words -> {complexity['token_count']} tokens")
-    print(f"   PROGRESS Compression: {complexity['compression_ratio']:.2f} chars/token")
-    print(f"   📚 Coverage: {complexity['vocabulary_coverage']*100:.1f}% known tokens")
-    
-    # 2. Model size analysis
-    total_params = system.model.total_parameters
-    memory_mb = total_params * 4 / 1024 / 1024  # float32
-    print(f"\n   🏗️  Model Architecture:")
-    print(f"   📊 Parameters: {total_params:,} ({memory_mb:.1f} MB)")
-    print(f"   🔢 Vocabulary: {system.model.vocab_size:,} tokens")
-    print(f"   📏 Context: {system.model.max_seq_len} tokens")
-    
-    # 3. Attention complexity analysis
-    seq_len = len(system.encode_text(test_text))
-    attention_memory = seq_len * seq_len * 4 / 1024 / 1024  # Attention matrix in MB
-    attention_flops = seq_len * seq_len * system.model.d_model  # Approximate FLOPs
-    
-    print(f"\n   SPEED Attention Analysis (seq_len={seq_len}):")
-    print(f"   💾 Attention Memory: {attention_memory:.3f} MB per head")
-    print(f"   🧮 Total Attention Memory: {attention_memory * system.model.n_heads:.2f} MB")
-    print(f"   SPEED Attention FLOPs: {attention_flops:,}")
-    
-    # 4. Performance profiling
-    print(f"\n   ⏱️  Performance Profiling:")
-    perf_results = system.profile_inference_performance(test_text, batch_sizes=[1, 2, 4])
-    
-    # Analyze scaling
-    batch_results = perf_results['batch_results']
-    if len(batch_results) >= 2:
-        linear_scaling = batch_results[1]['total_tokens'] / batch_results[0]['total_tokens']
-        actual_scaling = batch_results[1]['mean_time_ms'] / batch_results[0]['mean_time_ms']
-        efficiency = linear_scaling / actual_scaling
-        
-        print(f"   PROGRESS Batch Scaling Efficiency: {efficiency:.2f} (1.0 = perfect)")
-        print(f"   TARGET Best Throughput: {max(r['tokens_per_second'] for r in batch_results):.1f} tokens/sec")
-    
-    # 5. Memory scaling with sequence length
-    print(f"\n   📊 Memory Scaling Analysis:")
-    seq_lengths = [16, 32, 64]
-    for seq_len in seq_lengths:
-        attn_mem_per_head = seq_len * seq_len * 4 / 1024 / 1024
-        total_attn_mem = attn_mem_per_head * system.model.n_heads
-        
-        print(f"   📏 Seq {seq_len:2d}: {total_attn_mem:.2f} MB attention ({seq_len*seq_len:,} elements)")
-    
-    print(f"\nTIP KEY INSIGHTS:")
-    print(f"   MAGNIFY Attention dominates memory: O(n²) scaling with sequence length")
-    print(f"   ROCKET Batch processing improves throughput via parallelization")
-    print(f"   💾 Model parameters: {memory_mb:.1f} MB, Attention: varies with sequence")
-    print(f"   SPEED Total system uses all TinyTorch components from modules 02-19")
-    
-    return {
-        'complexity': complexity,
-        'performance': perf_results,
-        'model_params': total_params,
-        'attention_analysis': {
-            'memory_per_head_mb': attention_memory,
-            'total_memory_mb': attention_memory * system.model.n_heads,
-            'flops': attention_flops
-        }
-    }
-
-# MAGNIFY SYSTEMS INSIGHT: Scaling Behavior Analysis
-def analyze_scaling_bottlenecks():
-    """Analyze how TinyGPT performance scales with different dimensions."""
-    print("\nMAGNIFY SYSTEMS INSIGHT: TinyGPT Scaling Bottleneck Analysis")
-    print("=" * 70)
-    
-    test_text = "the quick brown fox jumps over the lazy dog"
-    
-    # Test different model sizes (keeping other dimensions constant)
-    model_configs = [
-        {'d_model': 64, 'n_heads': 4, 'n_layers': 2, 'name': 'Tiny'},
-        {'d_model': 128, 'n_heads': 8, 'n_layers': 4, 'name': 'Small'},
-        {'d_model': 256, 'n_heads': 8, 'n_layers': 6, 'name': 'Medium'}
-    ]
-    
-    print(f"\n📊 Model Size Scaling:")
-    
-    scaling_results = []
-    for config in model_configs:
-        try:
-            # Create system with specific configuration
-            system = TinyGPTSystem(
-                d_model=config['d_model'],
-                n_heads=config['n_heads'],
-                n_layers=config['n_layers'],
-                timing_runs=3  # Fewer runs for speed
-            )
-            
-            # Profile performance
-            token_ids = system.encode_text(test_text)
-            batch_tokens = token_ids.reshape(1, -1)
-            
-            # Time inference
-            times = []
-            for _ in range(3):
-                start = time.perf_counter()
-                _ = system.model.forward(batch_tokens)
-                times.append(time.perf_counter() - start)
-            
-            mean_time = np.mean(times) * 1000  # Convert to ms
-            
-            result = {
-                'name': config['name'],
-                'params': system.model.total_parameters,
-                'time_ms': mean_time,
-                'memory_mb': system.model.total_parameters * 4 / 1024 / 1024,
-                'd_model': config['d_model'],
-                'n_layers': config['n_layers']
-            }
-            
-            scaling_results.append(result)
-            
-            print(f"   {config['name']:6s}: {result['params']:7,} params, {mean_time:5.1f} ms, {result['memory_mb']:4.1f} MB")
-            
-        except Exception as e:
-            print(f"   {config['name']:6s}: Error - {e}")
-    
-    # Analyze scaling relationships
-    if len(scaling_results) >= 2:
-        print(f"\nPROGRESS Scaling Analysis:")
-        base = scaling_results[0]
-        
-        for result in scaling_results[1:]:
-            param_ratio = result['params'] / base['params']
-            time_ratio = result['time_ms'] / base['time_ms']
-            memory_ratio = result['memory_mb'] / base['memory_mb']
-            
-            print(f"   {result['name']} vs {base['name']}:")
-            print(f"     📊 Parameters: {param_ratio:.1f}x")
-            print(f"     ⏱️  Time: {time_ratio:.1f}x")
-            print(f"     💾 Memory: {memory_ratio:.1f}x")
-    
-    print(f"\nTIP SCALING INSIGHTS:")
-    print(f"   MAGNIFY Parameter count grows roughly O(d_model²) due to attention")
-    print(f"   ⏱️  Inference time scales with both parameters and sequence length")
-    print(f"   💾 Memory usage is dominated by model parameters (not activations)")
-    print(f"   TARGET Sweet spot: Balance model size with inference speed requirements")
-    
-    return scaling_results
-
-# MAGNIFY SYSTEMS INSIGHT: End-to-End Pipeline Analysis  
-def analyze_end_to_end_pipeline():
-    """Analyze the complete text generation pipeline from input to output."""
-    print("\nMAGNIFY SYSTEMS INSIGHT: End-to-End Pipeline Analysis")
-    print("=" * 70)
-    
-    system = TinyGPTSystem()
-    test_prompt = "the cat sat on"
-    
-    print(f"\n🔄 Pipeline Stage Analysis:")
-    
-    # Stage 1: Tokenization
-    start_time = time.perf_counter()
-    token_ids = system.encode_text(test_prompt)
-    tokenization_time = (time.perf_counter() - start_time) * 1000
-    
-    print(f"   1️⃣  Tokenization: {tokenization_time:.3f} ms")
-    print(f"       '{test_prompt}' -> {token_ids.tolist()}")
-    
-    # Stage 2: Model Forward Pass
-    batch_tokens = token_ids.reshape(1, -1)
-    start_time = time.perf_counter()
-    logits = system.model.forward(batch_tokens)
-    forward_time = (time.perf_counter() - start_time) * 1000
-    
-    print(f"   2️⃣  Model Forward: {forward_time:.3f} ms")
-    print(f"       {batch_tokens.shape} -> {logits.shape}")
-    
-    # Stage 3: Next Token Generation
-    start_time = time.perf_counter()
-    next_token = system.model.generate_next_token(batch_tokens)
-    generation_time = (time.perf_counter() - start_time) * 1000
-    
-    print(f"   3️⃣  Token Generation: {generation_time:.3f} ms")
-    print(f"       Next token ID: {next_token[0]}")
-    
-    # Stage 4: Detokenization
-    complete_tokens = np.concatenate([token_ids, next_token])
-    start_time = time.perf_counter()
-    output_text = system.decode_tokens(complete_tokens)
-    detokenization_time = (time.perf_counter() - start_time) * 1000
-    
-    print(f"   4️⃣  Detokenization: {detokenization_time:.3f} ms")
-    print(f"       {complete_tokens.tolist()} -> '{output_text}'")
-    
-    # Total pipeline time
-    total_time = tokenization_time + forward_time + generation_time + detokenization_time
-    
-    print(f"\n⏱️  Pipeline Timing Breakdown:")
-    print(f"   📝 Tokenization:   {tokenization_time:6.3f} ms ({tokenization_time/total_time*100:4.1f}%)")
-    print(f"   🧠 Model Forward:  {forward_time:6.3f} ms ({forward_time/total_time*100:4.1f}%)")
-    print(f"   🎲 Token Generation: {generation_time:6.3f} ms ({generation_time/total_time*100:4.1f}%)")
-    print(f"   🔤 Detokenization: {detokenization_time:6.3f} ms ({detokenization_time/total_time*100:4.1f}%)")
-    print(f"   SPEED TOTAL:          {total_time:6.3f} ms (100.0%)")
-    
-    # Calculate tokens per second for generation
-    tokens_per_second = 1000 / total_time  # 1 token generated per total_time ms
-    
-    print(f"\n📊 Generation Performance:")
-    print(f"   ROCKET Speed: {tokens_per_second:.1f} tokens/second")
-    print(f"   📏 Latency: {total_time:.1f} ms per token")
-    
-    # Estimate full text generation time
-    target_tokens = 50
-    estimated_time = target_tokens * total_time / 1000  # Convert to seconds
-    
-    print(f"\nTARGET Scaling Projection:")
-    print(f"   📝 Generate {target_tokens} tokens: ~{estimated_time:.1f} seconds")
-    print(f"   📊 Rate: {target_tokens/estimated_time:.1f} tokens/sec sustained")
-    
-    print(f"\nTIP PIPELINE INSIGHTS:")
-    print(f"   MAGNIFY Model forward pass dominates computation time")
-    print(f"   SPEED Tokenization/detokenization are negligible overhead")
-    print(f"   ROCKET Autoregressive generation requires N forward passes for N tokens")
-    print(f"   💾 Memory usage stays constant (no KV caching implemented)")
-    
-    return {
-        'tokenization_ms': tokenization_time,
-        'forward_ms': forward_time,
-        'generation_ms': generation_time,
-        'detokenization_ms': detokenization_time,
-        'total_ms': total_time,
-        'tokens_per_second': tokens_per_second
-    }
-
-# %% [markdown]
-"""
-### Test TinyGPT Complete System
-
-Let's test the complete TinyGPT system to ensure all components work together.
-"""
-
-# %%
-def test_tinygpt_complete_system():
-    """Test the complete TinyGPT system with all integrated components."""
-    print("Testing TinyGPT Complete System...")
-    
-    try:
-        # Initialize complete system
-        system = TinyGPTSystem()
-        
-        print(f"\nTEST Component Integration Tests:")
-        
-        # Test 1: Tokenization
-        test_text = "hello world how are you"
-        token_ids = system.encode_text(test_text)
-        decoded_text = system.decode_tokens(token_ids)
-        
-        print(f"   PASS Tokenization: '{test_text}' -> {len(token_ids)} tokens -> '{decoded_text}'")
-        
-        # Test 2: Model forward pass
-        batch_tokens = token_ids.reshape(1, -1)
-        logits = system.model.forward(batch_tokens)
-        expected_shape = (1, len(token_ids), system.model.vocab_size)
-        
-        assert logits.shape == expected_shape, f"Shape mismatch: {logits.shape} != {expected_shape}"
-        print(f"   PASS Model Forward: {batch_tokens.shape} -> {logits.shape}")
-        
-        # Test 3: Text generation
-        generated_text = system.generate_text("the cat", max_new_tokens=5, verbose=False)
-        
-        print(f"   PASS Text Generation: 'the cat' -> '{generated_text}'")
-        
-        # Test 4: Performance analysis
-        complexity = system.analyze_text_complexity(test_text)
-        
-        print(f"   PASS Text Analysis: {complexity['word_count']} words, {complexity['token_count']} tokens")
-        
-        # Test 5: Performance profiling
-        perf_results = system.profile_inference_performance(test_text, batch_sizes=[1, 2])
-        
-        print(f"   PASS Performance Profiling: {len(perf_results['batch_results'])} batch sizes tested")
-        
-        print(f"\nTARGET Integration Validation:")
-        
-        # Validate component integration
-        validation_results = {
-            'tokenizer_vocab_matches': system.tokenizer.get_vocab_size() == system.model.vocab_size,
-            'model_parameters_counted': system.model.total_parameters > 0,
-            'generation_works': len(generated_text) > len("the cat"),
-            'profiling_works': len(perf_results['batch_results']) > 0,
-            'components_available': available_components >= 4
-        }
-        
-        for test_name, passed in validation_results.items():
-            status = "PASS" if passed else "FAIL"
-            print(f"   {status} {test_name.replace('_', ' ').title()}")
-        
-        all_tests_passed = all(validation_results.values())
-        
-        if all_tests_passed:
-            print(f"\nCELEBRATE ALL TESTS PASSED! TinyGPT system fully operational.")
-            print(f"   ROCKET Ready for comprehensive text generation and analysis")
-        else:
-            print(f"\nWARNING️  Some tests failed - check TinyTorch component integration")
-        
-        return system, validation_results
-        
-    except Exception as e:
-        print(f"\nFAIL System test failed: {e}")
-        print(f"   TIP Ensure all TinyTorch modules (02-19) are properly integrated")
-        return None, {}
-
-# %% [markdown]
-"""
-## Part 3: Computational Assessment Questions - NBGrader Compatible
-
-These interactive questions test understanding of complete ML systems integration and end-to-end performance optimization.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "system-integration-analysis", "solution": true}
-def analyze_system_integration_bottlenecks(system):
-    """
-    Analyze the TinyGPT system to identify integration bottlenecks and optimization opportunities.
-    
-    TODO: Complete this function to analyze where the complete system spends most of its time
-    and identify the primary bottlenecks in end-to-end text generation.
-    
-    APPROACH:
-    1. Profile each major component (tokenization, model forward, generation, detokenization)
-    2. Identify which components dominate overall latency
-    3. Calculate the theoretical vs actual throughput
-    4. Recommend specific optimizations based on bottleneck analysis
-    
-    Args:
-        system: TinyGPTSystem instance to analyze
-        
-    Returns:
-        dict: Analysis results with bottleneck identification and optimization recommendations
-    """
-    ### BEGIN SOLUTION
-    # Test prompt for analysis
-    test_prompt = "the quick brown fox jumps"
-    
-    # Profile each pipeline stage
-    analysis_results = {
-        'pipeline_breakdown': {},
-        'bottleneck_analysis': {},
-        'optimization_recommendations': []
-    }
-    
-    # 1. Tokenization timing
-    start_time = time.perf_counter()
-    token_ids = system.encode_text(test_prompt)
-    tokenization_time = (time.perf_counter() - start_time) * 1000
-    
-    # 2. Model forward pass timing
-    batch_tokens = token_ids.reshape(1, -1)
-    start_time = time.perf_counter()
-    logits = system.model.forward(batch_tokens)
-    forward_time = (time.perf_counter() - start_time) * 1000
-    
-    # 3. Token generation timing
-    start_time = time.perf_counter()
-    next_token = system.model.generate_next_token(batch_tokens)
-    generation_time = (time.perf_counter() - start_time) * 1000
-    
-    # 4. Detokenization timing
-    complete_tokens = np.concatenate([token_ids, next_token])
-    start_time = time.perf_counter()
-    output_text = system.decode_tokens(complete_tokens)
-    detokenization_time = (time.perf_counter() - start_time) * 1000
-    
-    total_time = tokenization_time + forward_time + generation_time + detokenization_time
-    
-    # Pipeline breakdown
-    analysis_results['pipeline_breakdown'] = {
-        'tokenization_ms': tokenization_time,
-        'forward_pass_ms': forward_time,
-        'generation_ms': generation_time,
-        'detokenization_ms': detokenization_time,
-        'total_ms': total_time
-    }
-    
-    # Identify bottlenecks (stages taking >20% of total time)
-    bottlenecks = {}
-    if forward_time / total_time > 0.5:
-        bottlenecks['model_forward'] = {
-            'percentage': forward_time / total_time * 100,
-            'reason': 'Transformer forward pass with attention dominates computation'
-        }
-    
-    if generation_time / total_time > 0.2:
-        bottlenecks['token_generation'] = {
-            'percentage': generation_time / total_time * 100,
-            'reason': 'Sampling and probability computation overhead'
-        }
-    
-    analysis_results['bottleneck_analysis'] = bottlenecks
-    
-    # Generate optimization recommendations
-    recommendations = []
-    
-    if 'model_forward' in bottlenecks:
-        recommendations.append({
-            'component': 'Model Forward Pass',
-            'optimization': 'Implement attention optimizations (FlashAttention, sparse patterns)',
-            'expected_benefit': '2-4x speedup for attention computation'
-        })
-        
-        recommendations.append({
-            'component': 'Model Forward Pass', 
-            'optimization': 'Add KV-caching for autoregressive generation',
-            'expected_benefit': 'Linear instead of quadratic scaling with generation length'
-        })
-    
-    if len(token_ids) > 32:
-        recommendations.append({
-            'component': 'Sequence Length',
-            'optimization': 'Implement sequence length bucketing or truncation',
-            'expected_benefit': 'Reduced attention memory and computation'
-        })
-    
-    recommendations.append({
-        'component': 'Overall System',
-        'optimization': 'Implement batch processing for multiple generations',
-        'expected_benefit': 'Better GPU/CPU utilization through parallelization'
-    })
-    
-    analysis_results['optimization_recommendations'] = recommendations
-    
-    return analysis_results
-    ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "scaling-analysis", "solution": true}
-def analyze_scaling_characteristics(system, sequence_lengths=[16, 32, 64]):
-    """
-    Analyze how TinyGPT performance scales with sequence length and identify scaling bottlenecks.
-    
-    TODO: Implement scaling analysis to understand O(n²) attention bottleneck and memory scaling.
-    
-    APPROACH:
-    1. Test model performance across different sequence lengths
-    2. Measure both time and memory scaling
-    3. Identify which operations scale quadratically vs linearly
-    4. Calculate attention memory overhead vs model parameters
-    
-    Args:
-        system: TinyGPTSystem instance
-        sequence_lengths: List of sequence lengths to test
-        
-    Returns:
-        dict: Scaling analysis with complexity characterization
-    """
-    ### BEGIN SOLUTION
-    scaling_results = {
-        'sequence_scaling': [],
-        'memory_analysis': {},
-        'complexity_analysis': {},
-        'scaling_insights': []
-    }
-    
-    # Test scaling across different sequence lengths
-    for seq_len in sequence_lengths:
-        # Create test sequence of specified length
-        test_tokens = np.random.randint(4, system.model.vocab_size, seq_len)  # Skip special tokens
-        test_tokens = test_tokens.reshape(1, -1)
-        
-        # Time forward pass
-        times = []
-        for _ in range(3):  # Multiple runs for reliability
-            start_time = time.perf_counter()
-            logits = system.model.forward(test_tokens)
-            end_time = time.perf_counter()
-            times.append(end_time - start_time)
-        
-        mean_time = np.mean(times) * 1000  # Convert to ms
-        
-        # Calculate attention memory requirement
-        attention_memory_mb = (seq_len * seq_len * system.model.n_heads * 4) / (1024 * 1024)
-        
-        # Calculate total FLOPs (approximate)
-        attention_flops = seq_len * seq_len * system.model.d_model * system.model.n_heads
-        ff_flops = seq_len * system.model.d_model * (system.model.d_model * 4) * 2  # FF network
-        total_flops = (attention_flops + ff_flops) * system.model.n_layers
-        
-        scaling_results['sequence_scaling'].append({
-            'sequence_length': seq_len,
-            'time_ms': mean_time,
-            'attention_memory_mb': attention_memory_mb,
-            'total_flops': total_flops,
-            'flops_per_ms': total_flops / mean_time if mean_time > 0 else 0
-        })
-    
-    # Analyze memory characteristics
-    model_memory_mb = system.model.total_parameters * 4 / 1024 / 1024
-    max_attention_memory = max(r['attention_memory_mb'] for r in scaling_results['sequence_scaling'])
-    
-    scaling_results['memory_analysis'] = {
-        'model_parameters_mb': model_memory_mb,
-        'max_attention_memory_mb': max_attention_memory,
-        'memory_ratio': max_attention_memory / model_memory_mb,
-        'memory_scaling': 'O(n²)' if len(sequence_lengths) > 1 else 'unknown'
-    }
-    
-    # Analyze time complexity
-    if len(scaling_results['sequence_scaling']) >= 2:
-        base_result = scaling_results['sequence_scaling'][0]
-        scaling_ratios = []
-        
-        for result in scaling_results['sequence_scaling'][1:]:
-            length_ratio = result['sequence_length'] / base_result['sequence_length']
-            time_ratio = result['time_ms'] / base_result['time_ms']
-            
-            # Calculate observed scaling exponent
-            if length_ratio > 1:
-                scaling_exponent = np.log(time_ratio) / np.log(length_ratio)
-                scaling_ratios.append(scaling_exponent)
-        
-        avg_scaling_exponent = np.mean(scaling_ratios) if scaling_ratios else 1.0
-        
-        scaling_results['complexity_analysis'] = {
-            'observed_scaling_exponent': avg_scaling_exponent,
-            'theoretical_attention_scaling': 2.0,  # O(n²)
-            'scaling_classification': 'Quadratic' if avg_scaling_exponent > 1.5 else 'Sub-quadratic'
-        }
-    
-    # Generate insights
-    insights = []
-    
-    if scaling_results['memory_analysis']['memory_ratio'] > 0.1:
-        insights.append("Attention memory becomes significant fraction of model memory at long sequences")
-    
-    if 'observed_scaling_exponent' in scaling_results['complexity_analysis']:
-        exp = scaling_results['complexity_analysis']['observed_scaling_exponent'] 
-        if exp > 1.8:
-            insights.append("Performance scales close to O(n²) - attention dominates computation")
-        elif exp > 1.2:
-            insights.append("Performance scaling between linear and quadratic - mixed bottlenecks")
-        else:
-            insights.append("Performance scales sub-linearly - non-attention operations dominate")
-    
-    insights.append("Memory usage scales quadratically with sequence length due to attention")
-    insights.append("Model parameters remain constant regardless of sequence length")
-    
-    scaling_results['scaling_insights'] = insights
-    
-    return scaling_results
-    ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "optimization-strategy", "solution": true}
-def design_optimization_strategy(system):
-    """
-    Design a comprehensive optimization strategy for the TinyGPT system based on profiling results.
-    
-    TODO: Create an optimization roadmap that prioritizes improvements based on actual bottlenecks.
-    
-    APPROACH:
-    1. Profile the current system to identify bottlenecks
-    2. Categorize optimizations by impact vs effort
-    3. Design a phased optimization plan
-    4. Estimate expected performance improvements
-    
-    Args:
-        system: TinyGPTSystem instance to optimize
-        
-    Returns:
-        dict: Comprehensive optimization strategy with prioritized recommendations
-    """
-    ### BEGIN SOLUTION
-    optimization_strategy = {
-        'current_performance': {},
-        'optimization_phases': [],
-        'expected_improvements': {},
-        'implementation_roadmap': []
-    }
-    
-    # 1. Baseline performance measurement
-    test_text = "the quick brown fox jumps over the lazy dog"
-    
-    # Profile current performance
-    perf_results = system.profile_inference_performance(test_text, batch_sizes=[1])
-    baseline_perf = perf_results['batch_results'][0]
-    
-    optimization_strategy['current_performance'] = {
-        'tokens_per_second': baseline_perf['tokens_per_second'],
-        'time_per_token_ms': baseline_perf['time_per_token_ms'],
-        'total_parameters': system.model.total_parameters,
-        'memory_mb': system.model.total_parameters * 4 / 1024 / 1024
-    }
-    
-    # 2. Define optimization phases (ordered by impact vs effort)
-    
-    # Phase 1: High Impact, Low Effort
-    phase1 = {
-        'name': 'Quick Wins',
-        'duration_weeks': 2,
-        'optimizations': [
-            {
-                'name': 'Batch Processing',
-                'description': 'Implement batched inference for multiple sequences',
-                'expected_speedup': '2-4x for batch sizes 4-8',
-                'effort': 'Low',
-                'impact': 'High'
-            },
-            {
-                'name': 'Memory Layout Optimization',
-                'description': 'Optimize tensor memory layout for cache efficiency',
-                'expected_speedup': '20-30% improvement',
-                'effort': 'Low',
-                'impact': 'Medium'
-            }
-        ]
-    }
-    
-    # Phase 2: Medium Impact, Medium Effort  
-    phase2 = {
-        'name': 'Core Optimizations',
-        'duration_weeks': 6,
-        'optimizations': [
-            {
-                'name': 'KV-Cache Implementation',
-                'description': 'Cache key-value pairs for autoregressive generation',
-                'expected_speedup': '3-5x for generation (linear vs quadratic scaling)',
-                'effort': 'Medium',
-                'impact': 'High'
-            },
-            {
-                'name': 'Quantization',
-                'description': 'Implement INT8 quantization for model weights',
-                'expected_speedup': '2x memory reduction, 30-50% speed improvement',
-                'effort': 'Medium',
-                'impact': 'High'
-            },
-            {
-                'name': 'Operator Fusion',
-                'description': 'Fuse layer norm, attention, and feed-forward operations',
-                'expected_speedup': '20-40% reduction in kernel overhead',
-                'effort': 'Medium',
-                'impact': 'Medium'
-            }
-        ]
-    }
-    
-    # Phase 3: High Impact, High Effort
-    phase3 = {
-        'name': 'Advanced Optimizations',
-        'duration_weeks': 12,
-        'optimizations': [
-            {
-                'name': 'FlashAttention',
-                'description': 'Implement memory-efficient attention algorithm',
-                'expected_speedup': '2-4x attention speedup, O(1) memory scaling',
-                'effort': 'High',
-                'impact': 'Very High'
-            },
-            {
-                'name': 'Sparse Attention Patterns',
-                'description': 'Implement local + global attention patterns',
-                'expected_speedup': 'Linear scaling with sequence length',
-                'effort': 'High',
-                'impact': 'High'
-            },
-            {
-                'name': 'Custom CUDA Kernels',
-                'description': 'Write optimized GPU kernels for key operations',
-                'expected_speedup': '3-10x for specific operations',
-                'effort': 'Very High',
-                'impact': 'High'
-            }
-        ]
-    }
-    
-    optimization_strategy['optimization_phases'] = [phase1, phase2, phase3]
-    
-    # 3. Calculate expected improvements
-    cumulative_speedup = 1.0
-    cumulative_memory_reduction = 1.0
-    
-    # Conservative estimates
-    phase1_speedup = 2.5  # Batching + memory layout
-    phase2_speedup = 3.0  # KV-cache + quantization + fusion
-    phase3_speedup = 2.0  # FlashAttention + sparse patterns
-    
-    cumulative_speedup = phase1_speedup * phase2_speedup * phase3_speedup
-    
-    optimization_strategy['expected_improvements'] = {
-        'phase1_speedup': phase1_speedup,
-        'phase2_speedup': phase2_speedup, 
-        'phase3_speedup': phase3_speedup,
-        'total_speedup': cumulative_speedup,
-        'final_tokens_per_second': baseline_perf['tokens_per_second'] * cumulative_speedup,
-        'memory_reduction': 0.5,  # 50% reduction from quantization
-        'sequence_length_scaling': 'Linear (from O(n²) attention optimization)'
-    }
-    
-    # 4. Implementation roadmap
-    roadmap = [
-        {
-            'milestone': 'Week 2: Quick Wins Complete',
-            'deliverable': f"{phase1_speedup:.1f}x speedup from batching and memory optimization",
-            'success_metric': f">{baseline_perf['tokens_per_second'] * phase1_speedup:.0f} tokens/sec"
-        },
-        {
-            'milestone': 'Week 8: Core Optimizations Complete', 
-            'deliverable': f"{phase1_speedup * phase2_speedup:.1f}x cumulative speedup",
-            'success_metric': 'Linear scaling with generation length via KV-cache'
-        },
-        {
-            'milestone': 'Week 20: Advanced Optimizations Complete',
-            'deliverable': f"{cumulative_speedup:.1f}x total speedup with O(1) memory scaling",
-            'success_metric': f">{baseline_perf['tokens_per_second'] * cumulative_speedup:.0f} tokens/sec"
-        }
-    ]
-    
-    optimization_strategy['implementation_roadmap'] = roadmap
-    
-    return optimization_strategy
-    ### END SOLUTION
-
-# %% nbgrader={"grade": false, "grade_id": "production-deployment", "solution": true}
-def design_production_deployment_strategy(system):
-    """
-    Design a production deployment strategy for TinyGPT including monitoring and scaling considerations.
-    
-    TODO: Create a comprehensive deployment plan that addresses real-world production requirements.
-    
-    APPROACH:
-    1. Analyze current system capabilities and limitations
-    2. Design deployment architecture for different use cases
-    3. Plan monitoring and observability strategy
-    4. Address scaling and reliability requirements
-    
-    Args:
-        system: TinyGPTSystem instance to deploy
-        
-    Returns:
-        dict: Production deployment strategy with architecture and monitoring plans
-    """
-    ### BEGIN SOLUTION
-    deployment_strategy = {
-        'system_analysis': {},
-        'deployment_architectures': [],
-        'monitoring_strategy': {},
-        'scaling_plan': {},
-        'reliability_considerations': []
-    }
-    
-    # 1. Analyze current system for production readiness
-    baseline_perf = system.profile_inference_performance("hello world", batch_sizes=[1])['batch_results'][0]
-    
-    deployment_strategy['system_analysis'] = {
-        'model_size_mb': system.model.total_parameters * 4 / 1024 / 1024,
-        'inference_latency_ms': baseline_perf['time_per_token_ms'],
-        'throughput_tokens_per_sec': baseline_perf['tokens_per_second'],
-        'memory_requirements_mb': system.model.total_parameters * 16 / 1024 / 1024,  # Model + gradients + optimizer
-        'production_readiness': {
-            'checkpointing': 'Not implemented',
-            'error_handling': 'Basic',
-            'input_validation': 'Basic',
-            'monitoring': 'Not implemented',
-            'batching': 'Limited'
-        }
-    }
-    
-    # 2. Define deployment architectures for different use cases
-    
-    
-    # Skip the deployment architecture implementation to avoid syntax issues
-    deployment_strategy['deployment_architectures'] = [
-        {'name': 'Single Instance', 'use_case': 'Development'},
-        {'name': 'Production Load-Balanced', 'use_case': 'Production applications'},
-        {'name': 'Distributed High-Scale', 'use_case': 'Large-scale applications'}
-    ]
-    
-    deployment_strategy['monitoring_strategy'] = {
-        'performance_metrics': ['Requests per second', 'Latency percentiles', 'Memory utilization'],
-        'business_metrics': ['Active users', 'Text generation volume'],
-        'alerts': ['Latency > 500ms', 'Error rate > 1%'],
-        'logging': ['Request/response logging', 'Error logging']
-    }
-    
-    deployment_strategy['scaling_plan'] = {
-        'horizontal_scaling': {'trigger': 'CPU > 70%', 'scale_up': 'Add instances'},
-        'vertical_scaling': {'memory_threshold': '85%'},
-        'traffic_patterns': {'daily_peak': 'Scale up during peaks'}
-    }
-    
-    deployment_strategy['reliability_considerations'] = [
-        {'area': 'Model Serving', 'consideration': 'Implement versioning'},
-        {'area': 'Data Validation', 'consideration': 'Validate inputs'},
-        {'area': 'Rate Limiting', 'consideration': 'Implement rate limits'}
-    ]
-    
-    return deployment_strategy
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-## Part 4: Complete System Testing and Validation
-
-Let's test the complete TinyGPT system with all systems insights and demonstrate end-to-end functionality.
-"""
-
-# %%
-def run_complete_tinygpt_demonstration():
-    """Comprehensive demonstration of the complete TinyGPT system capabilities."""
-    print("ROCKET TINYGPT CAPSTONE DEMONSTRATION")
-    print("=" * 80)
-    print("Complete ML Systems Integration - Modules 02-19 Working Together!")
-    print("=" * 80)
-    
-    # Initialize complete system
-    print("\n1. 🔧 System Initialization...")
-    system = TinyGPTSystem()
-    
-    # Test 1: Basic functionality
-    print("\n2. 📝 Basic Text Generation Test...")
-    test_prompt = "the cat sat on"
-    generated_text = system.generate_text(test_prompt, max_new_tokens=10, verbose=True)
-    
-    # Summary of achievements
-    print("\n" + "=" * 80)
-    print("🏆 TINYGPT CAPSTONE COMPLETION SUMMARY")
-    print("=" * 80)
-    
-    print(f"\nTARGET Complete Integration Achieved:")
-    print(f"   PASS Tokenizer: {system.tokenizer.get_vocab_size():,} token vocabulary")
-    print(f"   PASS Model: {system.model.total_parameters:,} parameters across {system.model.n_layers} layers")
-    print(f"   PASS Generation: Working autoregressive text generation")
-    print(f"   PASS Systems Analysis: Memory, compute, and scaling characteristics")
-    
-    print(f"\n🔧 TinyTorch Component Integration:")
-    integrated_components = [name for name, status in COMPONENT_STATUS.items() if status]
-    print(f"   PASS Integrated: {', '.join(integrated_components)}")
-    print(f"   📊 Coverage: {len(integrated_components)}/{len(COMPONENT_STATUS)} components")
-    
-    print(f"\n🎓 Educational Achievement:")
-    print(f"   PASS End-to-end language model built from scratch")
-    print(f"   PASS All TinyTorch modules integrated into working system")
-    print(f"   PASS Production-ready systems understanding demonstrated")
-    print(f"   PASS Complete ML systems engineering pipeline mastered")
-    
-    return {'system': system}
-
-# %% [markdown]
-"""
-### Unit Testing Framework
-
-Test the complete TinyGPT system functionality.
-"""
-
-# %%
-def test_unit_tinygpt_system():
-    """TEST Unit Test: Complete TinyGPT System Integration"""
-    print("TEST Unit Test: TinyGPT Complete System")
-    print("-" * 50)
-    
-    try:
-        # Test system initialization
-        system = TinyGPTSystem()
-        assert system.model is not None, "Model should be initialized"
-        assert system.tokenizer is not None, "Tokenizer should be initialized"
-        print("   PASS System initialization successful")
-        
-        # Test tokenization
-        test_text = "hello world"
-        token_ids = system.encode_text(test_text)
-        decoded_text = system.decode_tokens(token_ids)
-        assert len(token_ids) > 0, "Tokenization should produce tokens"
-        print(f"   PASS Tokenization works: '{test_text}' -> {len(token_ids)} tokens -> '{decoded_text}'")
-        
-        # Test model forward pass
-        batch_tokens = token_ids.reshape(1, -1)
-        logits = system.model.forward(batch_tokens)
-        expected_shape = (1, len(token_ids), system.model.vocab_size)
-        assert logits.shape == expected_shape, f"Shape mismatch: {logits.shape} != {expected_shape}"
-        print(f"   PASS Model forward pass: {batch_tokens.shape} -> {logits.shape}")
-        
-        # Test text generation
-        generated = system.generate_text("the", max_new_tokens=3, verbose=False)
-        assert len(generated) > len("the"), "Generation should add tokens"
-        print(f"   PASS Text generation: 'the' -> '{generated}'")
-        
-        # Test performance profiling
-        performance = system.profile_inference_performance(test_text, batch_sizes=[1])
-        assert len(performance['batch_results']) > 0, "Performance profiling should work"
-        print(f"   PASS Performance profiling: {performance['batch_results'][0]['tokens_per_second']:.1f} tokens/sec")
-        
-        print("PASS TinyGPT system integration test passed!")
-        return True
-        
-    except Exception as e:
-        print(f"FAIL TinyGPT system test failed: {e}")
-        return False
-
-def test_unit_systems_insights():
-    """TEST Unit Test: Systems Insights Functions"""
-    print("TEST Unit Test: Systems Insights Analysis")
-    print("-" * 50)
-    
-    try:
-        # Test complete system analysis
-        analysis = analyze_complete_system_performance()
-        assert 'complexity' in analysis, "Should include complexity analysis"
-        print("   PASS Complete system performance analysis works")
-        
-        # Test scaling analysis
-        scaling = analyze_scaling_bottlenecks()
-        assert len(scaling) > 0, "Should return scaling results"
-        print("   PASS Scaling bottleneck analysis works")
-        
-        # Test pipeline analysis
-        pipeline = analyze_end_to_end_pipeline()
-        assert 'tokenization_ms' in pipeline, "Should include pipeline timing"
-        print("   PASS End-to-end pipeline analysis works")
-        
-        print("PASS Systems insights test passed!")
-        return True
-        
-    except Exception as e:
-        print(f"FAIL Systems insights test failed: {e}")
-        return False
-
-def test_unit_computational_assessments():
-    """TEST Unit Test: Computational Assessment Questions"""
-    print("TEST Unit Test: Computational Assessment Questions")
-    print("-" * 50)
-    
-    try:
-        system = TinyGPTSystem()
-        
-        # Test integration analysis
-        integration = analyze_system_integration_bottlenecks(system)
-        assert 'pipeline_breakdown' in integration, "Should analyze pipeline"
-        print("   PASS System integration analysis assessment works")
-        
-        # Test scaling analysis
-        scaling = analyze_scaling_characteristics(system)
-        assert 'sequence_scaling' in scaling, "Should analyze sequence scaling"
-        print("   PASS Scaling characteristics assessment works")
-        
-        # Test optimization strategy
-        optimization = design_optimization_strategy(system)
-        assert 'current_performance' in optimization, "Should analyze current performance"
-        print("   PASS Optimization strategy assessment works")
-        
-        # Test deployment strategy
-        deployment = design_production_deployment_strategy(system)
-        assert 'system_analysis' in deployment, "Should analyze system"
-        print("   PASS Production deployment assessment works")
-        
-        print("PASS Computational assessments test passed!")
-        return True
-        
-    except Exception as e:
-        print(f"FAIL Computational assessments test failed: {e}")
-        return False
-
-def test_unit_all():
-    """Run all TinyGPT capstone unit tests."""
-    print("TEST Running All TinyGPT Capstone Unit Tests...")
-    print("=" * 60)
-    
-    tests = [
-        test_unit_tinygpt_system,
-        test_unit_systems_insights,
-        test_unit_computational_assessments
-    ]
-    
-    passed = 0
-    for test_func in tests:
-        if test_func():
-            passed += 1
-        print()
-    
-    print("=" * 60)
-    if passed == len(tests):
-        print(f"CELEBRATE ALL TESTS PASSED! ({passed}/{len(tests)})")
-        print("PASS TinyGPT Capstone module is fully operational!")
-    else:
-        print(f"WARNING️ {len(tests) - passed}/{len(tests)} tests failed")
-        print("TIP Check TinyTorch component integration")
-    
-    return passed == len(tests)
-
-# Call tests immediately
-test_unit_tinygpt_system()
-test_unit_systems_insights()
-test_unit_computational_assessments()
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-Run the complete TinyGPT capstone demonstration when this module is executed directly.
-"""
-
-# %%
-if __name__ == "__main__":
-    print("Module 20: TinyGPT Capstone - Complete ML Systems Integration")
-    print("=" * 80)
-    
-    # Run learning checkpoints first
-    print("🎓 Running TinyGPT Learning Checkpoints...")
-    checkpoint_results = run_learning_checkpoints()
-    
-    # Test complete system
-    print("\nTEST Testing Complete TinyGPT System...")
-    system_tests_passed = test_unit_all()
-    
-    # Run comprehensive demonstration
-    print("\nROCKET Running Complete TinyGPT Demonstration...")
-    demo_results = run_complete_tinygpt_demonstration()
-    
-    print(f"\nCELEBRATE Module 20 Capstone Complete!")
-    print(f"🏆 TinyGPT system fully integrated and operational!")
-    print(f"ROCKET Ready for real-world ML systems engineering!")
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-1. **How does end-to-end system integration reveal bottlenecks invisible in isolated components?** Your TinyGPT system integrates tokenization, transformer layers, attention mechanisms, and generation into a complete pipeline. Analyze how profiling the complete system revealed different performance characteristics than testing individual components in isolation, and explain why production ML systems require end-to-end optimization rather than component-wise optimization.
-
-2. **What makes autoregressive generation fundamentally different from batch inference in terms of systems requirements?** Your text generation implementation generates tokens one at a time, requiring multiple forward passes through the model. Compare the memory usage patterns, computational efficiency, and parallelization opportunities between single-token autoregressive generation and batch inference, and design specific optimizations for each use case.
-
-3. **How do your scaling analysis results inform real-world production deployment decisions?** Your scaling bottleneck analysis identified O(n²) attention complexity and memory scaling patterns. Using your actual profiling results, design a production deployment strategy that handles sequence lengths from 16 tokens (chat messages) to 2048 tokens (document processing), including specific infrastructure requirements, cost estimates, and performance SLAs.
-
-4. **Why is systems thinking essential for ML engineering beyond just algorithmic knowledge?** Your capstone integrated components from tensor operations (Module 02) through production deployment strategies. Reflect on how understanding memory layouts, computational complexity, scaling bottlenecks, and production constraints changes how you approach ML problems compared to purely algorithmic or mathematical perspectives, and explain why this systems understanding is crucial for building reliable ML products.
-"""
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: TinyGPT Capstone - Complete ML Systems Mastery
-
-Congratulations! You have successfully completed the ultimate ML systems engineering challenge by building a complete language model from first principles.
-
-### 🛤️ **The Complete Journey**
-- **Starting Point**: Individual TinyTorch components in modules 02-19
-- **Integration Challenge**: Combine all components into working end-to-end system
-- **Final Achievement**: Complete TinyGPT language model with text generation capabilities
-
-### 🏗️ **System Architecture Mastered**
-- **TinyGPTTokenizer**: Text processing with vocabulary management and encoding/decoding
-- **TinyGPTTransformerLayer**: Complete transformer layer with multi-head attention, feed-forward networks, and layer normalization
-- **TinyGPTModel**: Full language model with token embeddings, positional encodings, and autoregressive generation
-- **TinyGPTSystem**: End-to-end pipeline with profiling, analysis, and optimization capabilities
-
-### 🔧 **Technical Integration Achieved**
-PASS **Component Integration**: All TinyTorch modules (02-19) working together seamlessly
-PASS **Text Generation**: Working autoregressive language model with sampling and temperature control
-PASS **Performance Analysis**: Complete system profiling with bottleneck identification and scaling analysis
-PASS **Production Strategy**: Comprehensive deployment planning with monitoring and reliability considerations
-PASS **Optimization Roadmap**: Phased optimization strategy based on actual performance profiling results
-
-### 📊 **Systems Engineering Mastery**
-Your implementation demonstrates mastery of:
-- **Memory Management**: Understanding parameter storage, attention matrices, and gradient memory requirements
-- **Computational Complexity**: O(n²) attention scaling analysis and bottleneck identification
-- **Performance Optimization**: From basic batching to advanced techniques like FlashAttention and KV-caching
-- **Production Deployment**: Real-world architecture design, monitoring strategies, and reliability planning
-- **End-to-End Thinking**: Integration challenges that only emerge when components work together
-
-### TARGET **Real-World Capability Achieved**
-You can now:
-- **Build**: Complete language models from individual components
-- **Analyze**: System performance characteristics and scaling bottlenecks
-- **Optimize**: Multi-phase performance improvement strategies
-- **Deploy**: Production-ready ML systems with monitoring and reliability
-- **Scale**: From prototype to production with concrete performance targets
-
-### 🏆 **Professional ML Systems Engineer**
-This capstone proves you understand:
-- How individual ML components integrate into complete systems
-- Why production ML systems require systems engineering beyond algorithms
-- How to identify and resolve performance bottlenecks through profiling
-- What it takes to deploy and scale ML systems in real-world environments
-- That great ML engineering requires both deep technical knowledge and systems thinking
-
-**You are now equipped to tackle real-world ML systems engineering challenges with confidence and expertise!**
-
-### ROCKET **Next Steps**
-1. **Apply Knowledge**: Use your TinyGPT system as foundation for more advanced projects
-2. **Optimize Further**: Implement advanced optimizations from your roadmap
-3. **Scale Up**: Deploy your system and measure real-world performance
-4. **Keep Learning**: Explore cutting-edge ML systems research and production techniques
-
-**Congratulations on completing the TinyTorch ML Systems Engineering journey! You've built something remarkable - a complete language model that demonstrates mastery of the entire ML systems stack.**
-"""
diff --git a/modules_old/source/08_normalization/normalization_dev.py b/modules_old/source/08_normalization/normalization_dev.py
deleted file mode 100644
index 4b720f0c..00000000
--- a/modules_old/source/08_normalization/normalization_dev.py
+++ /dev/null
@@ -1,1369 +0,0 @@
-# %% [markdown]
-"""
-# Normalization - Stabilizing Deep Network Training
-
-Welcome to Normalization! You'll implement the normalization techniques that make deep neural networks trainable and stable.
-
-## LINK Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): Data structures with gradient tracking
-- Module 04 (Layers): Neural network layer primitives
-- Module 06 (Autograd): Automatic gradient computation
-- Module 07 (Optimizers): Parameter update algorithms
-
-**What's Working**: You can build multi-layer networks and train them with optimizers!
-
-**The Gap**: Deep networks suffer from internal covariate shift - activations drift during training, making learning unstable and slow.
-
-**This Module's Solution**: Implement BatchNorm, LayerNorm, and GroupNorm to stabilize training by normalizing intermediate activations.
-
-**Connection Map**:
-```
-Layers -> Normalization -> Stable Training
-(unstable)  (stabilized)    (convergence)
-```
-
-## Learning Goals (5-Point Framework)
-- **Systems understanding**: Memory and computation patterns of different normalization schemes
-- **Core implementation skill**: Build BatchNorm, LayerNorm, and GroupNorm from mathematical foundations
-- **Pattern/abstraction mastery**: Understand when to use each normalization technique
-- **Framework connections**: Connect to PyTorch's nn.BatchNorm2d, nn.LayerNorm, nn.GroupNorm
-- **Optimization trade-offs**: Analyze memory vs stability vs computation trade-offs
-
-## Build -> Use -> Reflect
-1. **Build**: Implementation of BatchNorm, LayerNorm, and GroupNorm with running statistics
-2. **Use**: Apply normalization to stabilize training of deep networks
-3. **Reflect**: How do different normalization schemes affect memory, computation, and training dynamics?
-
-## Systems Reality Check
-TIP **Production Context**: Normalization is critical in all modern deep learning - ResNet uses BatchNorm, Transformers use LayerNorm, modern ConvNets use GroupNorm
-SPEED **Performance Insight**: BatchNorm adds 2* parameters per layer but often enables 10* larger learning rates, dramatically accelerating training
-
-## What You'll Achieve
-By the end of this module, you'll have implemented the normalization arsenal that makes modern deep learning possible, with complete understanding of their memory characteristics and performance trade-offs.
-"""
-
-# %% [markdown]
-"""
-## Mathematical Foundation: Why Normalization Works
-
-Internal covariate shift occurs when the distribution of inputs to each layer changes during training. This makes learning slow and unstable.
-
-### The Core Problem:
-```
-Layer 1: x₁ -> f₁(x₁) -> y₁  (distribution D₁)
-Layer 2: y₁ -> f₂(y₁) -> y₂  (distribution changes as f₁ changes!)
-Layer 3: y₂ -> f₃(y₂) -> y₃  (distribution keeps shifting!)
-```
-
-### The Normalization Solution:
-Normalize activations to have stable statistics (mean=0, variance=1):
-
-**Mathematical Form:**
-```
-ŷ = γ * (x - μ) / σ + β
-
-Where:
-- μ = E[x] (mean)
-- σ = sqrt(Var[x] + ε) (standard deviation)
-- γ = learnable scale parameter
-- β = learnable shift parameter
-- ε = numerical stability constant (usually 1e-5)
-```
-
-**Key Insight**: γ and β allow the network to recover the original representation if normalization hurts performance.
-"""
-
-# %% [markdown]
-"""
-## Context: Why Normalization Matters
-
-### Historical Context
-- **2015**: BatchNorm revolutionizes training, enables much deeper networks
-- **2016**: LayerNorm enables stable transformer training
-- **2018**: GroupNorm provides batch-independent normalization for object detection
-
-### Production Impact
-- **ImageNet Training**: BatchNorm reduces training time from weeks to days
-- **Language Models**: LayerNorm enables training of billion-parameter transformers
-- **Object Detection**: GroupNorm enables small-batch training with stable results
-
-### Memory vs Performance Trade-offs
-- **BatchNorm**: 2* parameters, but enables 5-10* larger learning rates
-- **LayerNorm**: No batch dimension dependence, consistent across batch sizes
-- **GroupNorm**: Balance between batch and layer normalization benefits
-"""
-
-# %% [markdown]
-"""
-## Connections: Production Normalization Systems
-
-### PyTorch Implementation Patterns
-```python
-# Production patterns you'll implement
-torch.nn.BatchNorm2d(channels, eps=1e-5, momentum=0.1)
-torch.nn.LayerNorm(normalized_shape, eps=1e-5)
-torch.nn.GroupNorm(num_groups, num_channels, eps=1e-5)
-
-# Your implementation will match these interfaces
-```
-
-### Real-World Usage
-- **ResNet**: Uses BatchNorm after every convolution layer
-- **BERT/GPT**: Uses LayerNorm in transformer blocks
-- **YOLO**: Uses BatchNorm for training stability with large images
-- **Modern ConvNets**: Often use GroupNorm for object detection tasks
-"""
-
-# %% [markdown]
-"""
-## Design: Why Build Normalization From Scratch?
-
-### Learning Justification
-Building normalization layers teaches:
-1. **Statistical Computing**: How to compute mean/variance efficiently across different dimensions
-2. **Memory Management**: Understanding running statistics and their memory implications
-3. **Training vs Inference**: How normalization behaves differently during training and evaluation
-4. **Gradient Flow**: How normalization affects backpropagation through learnable parameters
-
-### Systems Understanding Goals
-- **Dimension Analysis**: How normalization axes affect memory and computation
-- **Batch Dependencies**: Understanding when normalization depends on batch statistics
-- **Parameter Sharing**: How γ and β parameters are organized in memory
-- **Numerical Stability**: Why ε is critical for avoiding division by zero
-"""
-
-# %% [markdown]
-"""
-## Architecture: Normalization Design Decisions
-
-### Key Design Choices
-
-1. **Normalization Axis Selection**:
-   ```
-   BatchNorm: Normalize across batch dimension (N, C, H, W) -> across N
-   LayerNorm: Normalize across feature dimensions -> across C, H, W
-   GroupNorm: Normalize across channel groups -> within groups of C
-   ```
-
-2. **Parameter Organization**:
-   ```
-   γ (scale) and β (bias) parameters:
-   - BatchNorm: Shape (C,) - one per channel
-   - LayerNorm: Shape of normalized dimensions
-   - GroupNorm: Shape (C,) - one per channel
-   ```
-
-3. **Training vs Inference**:
-   ```
-   Training: Use batch statistics (mean, var computed from current batch)
-   Inference: Use running statistics (exponential moving average from training)
-   ```
-
-4. **Memory Layout Optimization**:
-   ```
-   Running statistics stored separately from learnable parameters
-   Efficient computation using vectorized operations across normalization axes
-   ```
-"""
-
-# %% [markdown]
-"""
-## Implementation: Building Normalization Classes
-
-Let's implement the three essential normalization techniques used in modern deep learning.
-"""
-
-# %%
-#| default_exp tinytorch.core.normalization
-import numpy as np
-from typing import Optional, Union, Tuple, Dict, List
-import warnings
-
-# Import our tensor and layer base classes
-try:
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.layers import Module
-except ImportError:
-    # Fallback for development environment
-    import sys
-    import os
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..', '..'))
-    from tinytorch.core.tensor import Tensor
-    from tinytorch.core.layers import Module
-
-# %% [markdown]
-"""
-### Batch Normalization Implementation
-
-Batch Normalization normalizes activations across the batch dimension, making training more stable and allowing higher learning rates.
-
-**Key Insight**: BatchNorm computes statistics across the batch dimension, so it requires batch_size > 1 during training.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "batch-norm", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class BatchNorm2d(Module):
-    """
-    Batch Normalization for 2D convolutions (4D tensors: N*C*H*W).
-
-    Normalizes across the batch dimension, computing μ and σ² across N, H, W
-    for each channel C independently.
-
-    MATHEMATICAL FOUNDATION:
-    BN(x) = γ * (x - μ_batch) / sqrt(σ²_batch + ε) + β
-
-    Where μ_batch and σ²_batch are computed across (N, H, W) dimensions.
-    """
-
-    def __init__(self, num_features: int, eps: float = 1e-5, momentum: float = 0.1):
-        """
-        Initialize Batch Normalization layer.
-
-        TODO: Implement BatchNorm initialization with running statistics.
-
-        APPROACH (4-Step BatchNorm Setup):
-        1. Store configuration parameters (num_features, eps, momentum)
-        2. Initialize learnable parameters (γ=1, β=0) with proper shapes
-        3. Initialize running statistics (running_mean=0, running_var=1)
-        4. Set training mode flag for different train/eval behavior
-
-        MEMORY ANALYSIS:
-        - Learnable parameters: 2 * num_features (γ and β)
-        - Running statistics: 2 * num_features (running_mean and running_var)
-        - Total memory: 4 * num_features parameters
-
-        EXAMPLE (BatchNorm Usage):
-        >>> bn = BatchNorm2d(64)  # For 64 channels
-        >>> x = Tensor(np.random.randn(32, 64, 28, 28))  # batch * channels * height * width
-        >>> normalized = bn(x)
-        >>> print(f"Normalized shape: {normalized.shape}")  # (32, 64, 28, 28)
-
-        HINTS:
-        - Use np.ones() for γ initialization (multiplicative identity)
-        - Use np.zeros() for β initialization (additive identity)
-        - Running statistics are numpy arrays (not Tensors - no gradients needed)
-        - momentum controls exponential moving average: new_running = (1-momentum)*old + momentum*batch
-
-        Args:
-            num_features: Number of channels (C dimension)
-            eps: Small constant for numerical stability
-            momentum: Momentum for running statistics update
-        """
-        ### BEGIN SOLUTION
-        super().__init__()
-        self.num_features = num_features
-        self.eps = eps
-        self.momentum = momentum
-        self.training = True
-
-        # Learnable parameters - shape (num_features,)
-        self.gamma = Tensor(np.ones((num_features,)))  # Scale parameter
-        self.beta = Tensor(np.zeros((num_features,)))   # Shift parameter
-
-        # Running statistics for inference - numpy arrays (no gradients needed)
-        self.running_mean = np.zeros((num_features,))
-        self.running_var = np.ones((num_features,))
-
-        # Track parameters for optimization
-        self.parameters = [self.gamma, self.beta]
-        ### END SOLUTION
-
-    def forward(self, x: Tensor) -> Tensor:
-        """
-        Apply batch normalization to input tensor.
-
-        TODO: Implement batch normalization forward pass with proper training/eval modes.
-
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Determine which statistics to use (batch vs running)
-        2. Compute mean and variance across appropriate dimensions
-        3. Normalize: (x - mean) / sqrt(var + eps)
-        4. Scale and shift: γ * normalized + β
-        5. Update running statistics during training
-
-        DIMENSION ANALYSIS for 4D input (N, C, H, W):
-        - Batch statistics computed across dims (0, 2, 3) -> shape (C,)
-        - γ and β broadcasted to match input: (1, C, 1, 1)
-        - Output has same shape as input
-
-        TRAINING vs INFERENCE:
-        - Training: Use batch statistics, update running statistics
-        - Inference: Use running statistics, no updates
-
-        EXAMPLE:
-        >>> bn = BatchNorm2d(3)
-        >>> x = Tensor(np.random.randn(16, 3, 32, 32))
-        >>> bn.training = True   # Training mode
-        >>> out_train = bn.forward(x)
-        >>> bn.training = False  # Inference mode
-        >>> out_eval = bn.forward(x)
-
-        Args:
-            x: Input tensor of shape (N, C, H, W)
-
-        Returns:
-            Normalized tensor of shape (N, C, H, W)
-        """
-        ### BEGIN SOLUTION
-        if self.training:
-            # Training mode: compute batch statistics
-            # Compute mean and variance across batch, height, width (dims 0, 2, 3)
-            batch_mean = np.mean(x.data, axis=(0, 2, 3), keepdims=False)  # Shape: (C,)
-            batch_var = np.var(x.data, axis=(0, 2, 3), keepdims=False)    # Shape: (C,)
-
-            # Update running statistics using exponential moving average
-            self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * batch_mean
-            self.running_var = (1 - self.momentum) * self.running_var + self.momentum * batch_var
-
-            # Use batch statistics for normalization
-            mean = batch_mean
-            var = batch_var
-        else:
-            # Inference mode: use running statistics
-            mean = self.running_mean
-            var = self.running_var
-
-        # Reshape statistics for broadcasting: (1, C, 1, 1)
-        mean = mean.reshape(1, -1, 1, 1)
-        var = var.reshape(1, -1, 1, 1)
-        gamma = self.gamma.data.reshape(1, -1, 1, 1)
-        beta = self.beta.data.reshape(1, -1, 1, 1)
-
-        # Apply normalization: γ * (x - μ) / σ + β
-        normalized = (x.data - mean) / np.sqrt(var + self.eps)
-        output = gamma * normalized + beta
-
-        return Tensor(output)
-        ### END SOLUTION
-
-    def train(self, mode: bool = True) -> 'BatchNorm2d':
-        """Set training mode."""
-        self.training = mode
-        return self
-
-    def eval(self) -> 'BatchNorm2d':
-        """Set evaluation mode."""
-        self.training = False
-        return self
-
-# 🔍 SYSTEMS INSIGHT: BatchNorm Memory and Batch Dependencies
-def analyze_batchnorm_behavior():
-    """Quick analysis of BatchNorm memory and batch size effects."""
-    print("🔍 BatchNorm Memory Scaling:")
-
-    # Test different channel sizes
-    for channels in [64, 256, 512]:
-        bn = BatchNorm2d(channels)
-        param_memory = 4 * channels * 4  # 4 params per channel * 4 bytes
-        print(f"  {channels} channels: {param_memory // 1024} KB ({4 * channels} parameters)")
-
-    print("\n🔍 Batch Size Dependency:")
-    bn = BatchNorm2d(64)
-
-    # Test different batch sizes
-    for batch_size in [1, 8, 32]:
-        if batch_size == 1:
-            print(f"  Batch size {batch_size}: ⚠️ Unstable (no batch statistics)")
-        else:
-            print(f"  Batch size {batch_size}: ✅ Good statistics")
-
-    print("\n💡 Key insight: BatchNorm needs batch_size > 1 for training")
-
-analyze_batchnorm_behavior()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Batch Normalization
-
-This test validates BatchNorm2d implementation, ensuring proper normalization across batch dimension and correct running statistics updates.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-batch-norm", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_unit_batch_norm():
-    """Unit test for batch normalization."""
-    print("🔬 Unit Test: Batch Normalization...")
-
-    # Test 1: Basic functionality
-    num_features = 32
-    bn = BatchNorm2d(num_features)
-
-    # Verify initialization
-    assert bn.num_features == num_features, "Should store number of features"
-    assert bn.eps == 1e-5, "Should use default epsilon"
-    assert bn.momentum == 0.1, "Should use default momentum"
-    assert bn.training == True, "Should start in training mode"
-
-    # Check parameter shapes
-    assert bn.gamma.shape == (num_features,), f"Gamma shape should be ({num_features},)"
-    assert bn.beta.shape == (num_features,), f"Beta shape should be ({num_features},)"
-    assert np.allclose(bn.gamma.data, 1.0), "Gamma should be initialized to 1"
-    assert np.allclose(bn.beta.data, 0.0), "Beta should be initialized to 0"
-
-    # Test 2: Forward pass in training mode
-    batch_size, height, width = 16, 8, 8
-    x = Tensor(np.random.randn(batch_size, num_features, height, width))
-
-    output = bn.forward(x)
-
-    # Check output shape
-    assert output.shape == x.shape, "Output should have same shape as input"
-
-    # Check normalization (approximately zero mean, unit variance per channel)
-    for c in range(num_features):
-        channel_data = output.data[:, c, :, :]
-        channel_mean = np.mean(channel_data)
-        channel_var = np.var(channel_data)
-
-        assert abs(channel_mean) < 1e-6, f"Channel {c} should have ~0 mean, got {channel_mean}"
-        assert abs(channel_var - 1.0) < 1e-4, f"Channel {c} should have ~1 variance, got {channel_var}"
-
-    # Test 3: Running statistics update
-    initial_running_mean = bn.running_mean.copy()
-    initial_running_var = bn.running_var.copy()
-
-    # Process another batch
-    x2 = Tensor(np.random.randn(batch_size, num_features, height, width) * 2 + 1)
-    _ = bn.forward(x2)
-
-    # Running statistics should have changed
-    assert not np.allclose(bn.running_mean, initial_running_mean), "Running mean should update"
-    assert not np.allclose(bn.running_var, initial_running_var), "Running variance should update"
-
-    # Test 4: Evaluation mode
-    bn.eval()
-    assert bn.training == False, "Should be in eval mode"
-
-    running_mean_before = bn.running_mean.copy()
-    running_var_before = bn.running_var.copy()
-
-    # Forward pass in eval mode
-    output_eval = bn.forward(x)
-
-    # Running statistics should not change in eval mode
-    assert np.allclose(bn.running_mean, running_mean_before), "Running mean should not change in eval mode"
-    assert np.allclose(bn.running_var, running_var_before), "Running variance should not change in eval mode"
-
-    # Test 5: Gradient flow (basic check)
-    bn.train()
-    x_grad = Tensor(np.random.randn(batch_size, num_features, height, width))
-    output_grad = bn.forward(x_grad)
-
-    # Should be able to access gamma and beta for gradient computation
-    assert hasattr(bn, 'gamma'), "Should have gamma parameter"
-    assert hasattr(bn, 'beta'), "Should have beta parameter"
-    assert len(bn.parameters) == 2, "Should have 2 learnable parameters"
-
-    print("PASS Batch normalization tests passed!")
-    print(f"PASS Properly normalizes across batch dimension")
-    print(f"PASS Updates running statistics during training")
-    print(f"PASS Uses running statistics during evaluation")
-    print(f"PASS Maintains gradient flow through learnable parameters")
-
-test_unit_batch_norm()
-
-# %% [markdown]
-"""
-### Layer Normalization Implementation
-
-Layer Normalization normalizes across the feature dimensions for each sample independently, making it batch-size independent.
-
-**Key Insight**: LayerNorm is crucial for transformers because it doesn't depend on batch statistics, enabling consistent behavior across different batch sizes.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "layer-norm", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class LayerNorm(Module):
-    """
-    Layer Normalization for any-dimensional tensors.
-
-    Normalizes across specified feature dimensions for each sample independently.
-    Unlike BatchNorm, LayerNorm doesn't depend on batch statistics.
-
-    MATHEMATICAL FOUNDATION:
-    LN(x) = γ * (x - μ) / sqrt(σ² + ε) + β
-
-    Where μ and σ² are computed across feature dimensions for each sample.
-    """
-
-    def __init__(self, normalized_shape: Union[int, Tuple[int, ...]], eps: float = 1e-5):
-        """
-        Initialize Layer Normalization.
-
-        TODO: Implement LayerNorm initialization with proper shape handling.
-
-        APPROACH (3-Step LayerNorm Setup):
-        1. Store normalization configuration (shape and eps)
-        2. Initialize learnable parameters γ and β with correct shapes
-        3. Set up parameter tracking for optimization
-
-        SHAPE ANALYSIS:
-        - If normalized_shape is int: treat as last dimension only
-        - If normalized_shape is tuple: treat as multiple dimensions
-        - γ and β have shape matching normalized_shape
-
-        EXAMPLE (LayerNorm Shapes):
-        >>> ln1 = LayerNorm(512)        # For last dim: (..., 512)
-        >>> ln2 = LayerNorm((64, 64))   # For last 2 dims: (..., 64, 64)
-        >>> ln3 = LayerNorm((256, 4, 4)) # For 3D features: (..., 256, 4, 4)
-
-        HINTS:
-        - Convert int to tuple for consistent handling
-        - Parameter shapes should match normalized_shape exactly
-        - No running statistics needed (computed fresh each time)
-
-        Args:
-            normalized_shape: Shape of features to normalize over
-            eps: Small constant for numerical stability
-        """
-        ### BEGIN SOLUTION
-        super().__init__()
-
-        # Handle both int and tuple inputs
-        if isinstance(normalized_shape, int):
-            self.normalized_shape = (normalized_shape,)
-        else:
-            self.normalized_shape = tuple(normalized_shape)
-
-        self.eps = eps
-
-        # Learnable parameters with shape matching normalized dimensions
-        self.gamma = Tensor(np.ones(self.normalized_shape))  # Scale parameter
-        self.beta = Tensor(np.zeros(self.normalized_shape))   # Shift parameter
-
-        # Track parameters for optimization
-        self.parameters = [self.gamma, self.beta]
-        ### END SOLUTION
-
-    def forward(self, x: Tensor) -> Tensor:
-        """
-        Apply layer normalization to input tensor.
-
-        TODO: Implement layer normalization forward pass.
-
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Determine normalization axes based on normalized_shape
-        2. Compute mean and variance across those axes (keepdims=True)
-        3. Normalize: (x - mean) / sqrt(var + eps)
-        4. Apply learnable parameters: γ * normalized + β
-
-        AXIS CALCULATION:
-        For input shape (N, ..., D1, D2, ..., Dk) and normalized_shape (D1, D2, ..., Dk):
-        - Normalize over last len(normalized_shape) dimensions
-        - Keep dimensions for proper broadcasting
-
-        EXAMPLE:
-        >>> ln = LayerNorm(256)
-        >>> x = Tensor(np.random.randn(32, 128, 256))  # (batch, seq, features)
-        >>> out = ln.forward(x)  # Normalize over last dim (256)
-
-        Args:
-            x: Input tensor
-
-        Returns:
-            Normalized tensor (same shape as input)
-        """
-        ### BEGIN SOLUTION
-        # Calculate which axes to normalize over (last len(normalized_shape) dimensions)
-        num_dims_to_normalize = len(self.normalized_shape)
-        axes = tuple(range(-num_dims_to_normalize, 0))  # Last N dimensions
-
-        # Compute mean and variance over normalization axes
-        mean = np.mean(x.data, axis=axes, keepdims=True)
-        var = np.var(x.data, axis=axes, keepdims=True)
-
-        # Normalize
-        normalized = (x.data - mean) / np.sqrt(var + self.eps)
-
-        # Apply learnable parameters (broadcasting automatically handles shapes)
-        output = self.gamma.data * normalized + self.beta.data
-
-        return Tensor(output)
-        ### END SOLUTION
-
-    def __call__(self, x: Tensor) -> Tensor:
-        """Allow LayerNorm to be called directly."""
-        return self.forward(x)
-
-# PASS IMPLEMENTATION CHECKPOINT: Basic LayerNorm complete
-
-# THINK PREDICTION: How does LayerNorm memory scale compared to BatchNorm?
-# Your guess: LayerNorm uses _____ memory than BatchNorm for the same feature size
-
-# 🔍 SYSTEMS INSIGHT: LayerNorm Memory and Batch Independence
-def compare_norm_characteristics():
-    """Compare key characteristics of BatchNorm vs LayerNorm."""
-    print("🔍 Memory Comparison:")
-
-    # Compare memory usage
-    for features in [64, 256, 512]:
-        bn_memory = 4 * features * 4  # γ, β, running_mean, running_var
-        ln_memory = 2 * features * 4  # γ, β only
-        ratio = bn_memory / ln_memory
-        print(f"  {features} features: BatchNorm {bn_memory//1024}KB vs LayerNorm {ln_memory//1024}KB ({ratio:.1f}x)")
-
-    print("\n🔍 Batch Size Independence:")
-    ln = LayerNorm(256)
-
-    # Test different batch sizes
-    for batch_size in [1, 8, 32]:
-        x = Tensor(np.random.randn(batch_size, 64, 256))
-        output = ln.forward(x)
-        sample_var = np.var(output.data[0, :, :])
-        print(f"  Batch size {batch_size}: Variance = {sample_var:.3f} ✅")
-
-    print("\n💡 Key insight: LayerNorm works consistently at any batch size")
-
-compare_norm_characteristics()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Layer Normalization
-
-This test validates LayerNorm implementation, ensuring proper normalization across feature dimensions and batch-size independence.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-layer-norm", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_unit_layer_norm():
-    """Unit test for layer normalization."""
-    print("🔬 Unit Test: Layer Normalization...")
-
-    # Test 1: Basic 1D normalization
-    embed_dim = 256
-    ln = LayerNorm(embed_dim)
-
-    # Verify initialization
-    assert ln.normalized_shape == (embed_dim,), "Should store normalized shape as tuple"
-    assert ln.eps == 1e-5, "Should use default epsilon"
-
-    # Check parameter shapes
-    assert ln.gamma.shape == (embed_dim,), f"Gamma shape should be ({embed_dim},)"
-    assert ln.beta.shape == (embed_dim,), f"Beta shape should be ({embed_dim},)"
-    assert np.allclose(ln.gamma.data, 1.0), "Gamma should be initialized to 1"
-    assert np.allclose(ln.beta.data, 0.0), "Beta should be initialized to 0"
-
-    # Test 2: Forward pass with 3D input (batch, seq, features)
-    batch_size, seq_len = 16, 64
-    x = Tensor(np.random.randn(batch_size, seq_len, embed_dim) * 2 + 3)  # Non-standard distribution
-
-    output = ln.forward(x)
-
-    # Check output shape
-    assert output.shape == x.shape, "Output should have same shape as input"
-
-    # Check normalization for each sample independently
-    for b in range(batch_size):
-        for s in range(seq_len):
-            sample_data = output.data[b, s, :]
-            sample_mean = np.mean(sample_data)
-            sample_var = np.var(sample_data)
-
-            assert abs(sample_mean) < 1e-6, f"Sample [{b},{s}] should have ~0 mean, got {sample_mean}"
-            assert abs(sample_var - 1.0) < 1e-4, f"Sample [{b},{s}] should have ~1 variance, got {sample_var}"
-
-    # Test 3: Multi-dimensional normalization
-    multi_dim_shape = (64, 4)  # Normalize over 2D features
-    ln_multi = LayerNorm(multi_dim_shape)
-
-    x_multi = Tensor(np.random.randn(8, 32, 64, 4))
-    output_multi = ln_multi.forward(x_multi)
-
-    assert output_multi.shape == x_multi.shape, "Multi-dim normalization should preserve shape"
-
-    # Check normalization across last 2 dimensions for each sample
-    for b in range(8):
-        for s in range(32):
-            sample_data = output_multi.data[b, s, :, :].flatten()
-            sample_mean = np.mean(sample_data)
-            sample_var = np.var(sample_data)
-
-            assert abs(sample_mean) < 1e-6, f"Multi-dim sample should have ~0 mean"
-            assert abs(sample_var - 1.0) < 1e-4, f"Multi-dim sample should have ~1 variance"
-
-    # Test 4: Callable interface
-    output_callable = ln(x)
-    assert np.allclose(output.data, output_callable.data), "Callable interface should work"
-
-    # Test 5: Batch size independence
-    x_small = Tensor(np.random.randn(1, seq_len, embed_dim))
-    x_large = Tensor(np.random.randn(64, seq_len, embed_dim))
-
-    output_small = ln.forward(x_small)
-    output_large = ln.forward(x_large)
-
-    # Both should be properly normalized regardless of batch size
-    small_mean = np.mean(output_small.data[0, 0, :])
-    large_mean = np.mean(output_large.data[0, 0, :])  # Same position
-
-    assert abs(small_mean) < 1e-6, "Small batch should be normalized"
-    assert abs(large_mean) < 1e-6, "Large batch should be normalized"
-
-    # Test 6: Parameter tracking
-    assert len(ln.parameters) == 2, "Should have 2 learnable parameters"
-    assert ln.gamma in ln.parameters, "Gamma should be tracked"
-    assert ln.beta in ln.parameters, "Beta should be tracked"
-
-    print("PASS Layer normalization tests passed!")
-    print(f"PASS Properly normalizes across feature dimensions")
-    print(f"PASS Works with any input shape")
-    print(f"PASS Batch-size independent behavior")
-    print(f"PASS Supports multi-dimensional normalization")
-
-test_unit_layer_norm()
-
-# %% [markdown]
-"""
-### Group Normalization Implementation
-
-Group Normalization divides channels into groups and normalizes within each group, providing a middle ground between batch and layer normalization.
-
-**Key Insight**: GroupNorm is particularly useful for object detection and when batch sizes are small, as it doesn't depend on batch statistics but provides channel-wise organization.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "group-norm", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class GroupNorm(Module):
-    """
-    Group Normalization for convolutional layers.
-
-    Divides channels into groups and normalizes within each group.
-    Provides benefits of both batch and layer normalization.
-
-    MATHEMATICAL FOUNDATION:
-    For input (N, C, H, W) with G groups:
-    1. Reshape to (N, G, C//G, H, W)
-    2. Normalize within each group: GN(x) = γ * (x - μ_group) / sqrt(σ²_group + ε) + β
-    3. Reshape back to (N, C, H, W)
-    """
-
-    def __init__(self, num_groups: int, num_channels: int, eps: float = 1e-5):
-        """
-        Initialize Group Normalization.
-
-        TODO: Implement GroupNorm initialization with group configuration.
-
-        APPROACH (4-Step GroupNorm Setup):
-        1. Validate group configuration (num_channels must be divisible by num_groups)
-        2. Store configuration parameters
-        3. Initialize learnable parameters γ and β for each channel
-        4. Set up parameter tracking
-
-        GROUP ORGANIZATION:
-        - Each group contains num_channels // num_groups channels
-        - Normalization computed independently within each group
-        - Parameters γ and β have shape (num_channels,) for per-channel scaling
-
-        EXAMPLE (GroupNorm Configurations):
-        >>> gn1 = GroupNorm(32, 64)   # 32 groups, 64 channels -> 2 channels per group
-        >>> gn2 = GroupNorm(8, 256)   # 8 groups, 256 channels -> 32 channels per group
-        >>> gn3 = GroupNorm(1, 128)   # 1 group, 128 channels -> LayerNorm equivalent
-
-        HINTS:
-        - Use assert to validate num_channels % num_groups == 0
-        - Special case: num_groups = num_channels -> InstanceNorm (each channel is a group)
-        - Special case: num_groups = 1 -> LayerNorm for spatial data
-
-        Args:
-            num_groups: Number of groups to divide channels into
-            num_channels: Total number of channels
-            eps: Small constant for numerical stability
-        """
-        ### BEGIN SOLUTION
-        super().__init__()
-
-        # Validate configuration
-        assert num_channels % num_groups == 0, f"num_channels ({num_channels}) must be divisible by num_groups ({num_groups})"
-        assert num_groups > 0, "num_groups must be positive"
-        assert num_channels > 0, "num_channels must be positive"
-
-        self.num_groups = num_groups
-        self.num_channels = num_channels
-        self.eps = eps
-
-        # Calculate channels per group
-        self.channels_per_group = num_channels // num_groups
-
-        # Learnable parameters - one per channel
-        self.gamma = Tensor(np.ones((num_channels,)))  # Scale parameter
-        self.beta = Tensor(np.zeros((num_channels,)))   # Shift parameter
-
-        # Track parameters for optimization
-        self.parameters = [self.gamma, self.beta]
-        ### END SOLUTION
-
-    def forward(self, x: Tensor) -> Tensor:
-        """
-        Apply group normalization to input tensor.
-
-        TODO: Implement group normalization forward pass.
-
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Reshape input to separate groups: (N, C, H, W) -> (N, G, C//G, H, W)
-        2. Compute mean and variance within each group
-        3. Normalize within groups
-        4. Reshape back to original shape
-        5. Apply per-channel γ and β parameters
-
-        SHAPE TRANSFORMATIONS:
-        Input:  (N, C, H, W)
-        Groups: (N, G, C//G, H, W)  # Separate groups for normalization
-        Norm:   (N, G, C//G, H, W)  # Normalized within groups
-        Output: (N, C, H, W)        # Back to original shape with γ/β applied
-
-        EXAMPLE:
-        >>> gn = GroupNorm(8, 64)  # 8 groups, 64 channels
-        >>> x = Tensor(np.random.randn(16, 64, 32, 32))
-        >>> out = gn.forward(x)  # Normalized within 8 groups
-
-        Args:
-            x: Input tensor of shape (N, C, H, W)
-
-        Returns:
-            Normalized tensor of shape (N, C, H, W)
-        """
-        ### BEGIN SOLUTION
-        N, C, H, W = x.shape
-        assert C == self.num_channels, f"Expected {self.num_channels} channels, got {C}"
-
-        # Reshape to separate groups: (N, C, H, W) -> (N, G, C//G, H, W)
-        x_grouped = x.data.reshape(N, self.num_groups, self.channels_per_group, H, W)
-
-        # Compute mean and variance within each group
-        # Normalize over dimensions (2, 3, 4) which are (channels_per_group, H, W)
-        mean = np.mean(x_grouped, axis=(2, 3, 4), keepdims=True)  # Shape: (N, G, 1, 1, 1)
-        var = np.var(x_grouped, axis=(2, 3, 4), keepdims=True)    # Shape: (N, G, 1, 1, 1)
-
-        # Normalize within groups
-        normalized = (x_grouped - mean) / np.sqrt(var + self.eps)
-
-        # Reshape back to original shape: (N, G, C//G, H, W) -> (N, C, H, W)
-        normalized = normalized.reshape(N, C, H, W)
-
-        # Apply per-channel learnable parameters
-        gamma = self.gamma.data.reshape(1, C, 1, 1)  # Broadcast shape
-        beta = self.beta.data.reshape(1, C, 1, 1)    # Broadcast shape
-
-        output = gamma * normalized + beta
-
-        return Tensor(output)
-        ### END SOLUTION
-
-# PASS IMPLEMENTATION CHECKPOINT: All normalization techniques complete
-
-# THINK PREDICTION: Which normalization uses the most memory - Batch, Layer, or Group?
-# Your answer: _______ because _______
-
-# 🔍 SYSTEMS INSIGHT: Comparing All Three Normalization Techniques
-def analyze_all_normalization_types():
-    """Compare memory and behavior of all three normalization types."""
-    print("🔍 Complete Normalization Comparison:")
-
-    # Memory comparison
-    print("\nMemory Usage (per channel):")
-    channels = 256
-    bn_memory = 4 * channels * 4  # γ, β, running_mean, running_var
-    ln_memory = 2 * channels * 4  # γ, β only
-    gn_memory = 2 * channels * 4  # γ, β only
-
-    print(f"  BatchNorm: {bn_memory//1024} KB (stores running statistics)")
-    print(f"  LayerNorm: {ln_memory//1024} KB (no running state)")
-    print(f"  GroupNorm: {gn_memory//1024} KB (no running state)")
-
-    # Batch size effects
-    print("\n🔍 Batch Size Behavior:")
-    test_channels = 64
-    bn = BatchNorm2d(test_channels)
-    ln = LayerNorm((test_channels, 16, 16))
-    gn = GroupNorm(8, test_channels)
-
-    for batch_size in [1, 8, 32]:
-        x = Tensor(np.random.randn(batch_size, test_channels, 16, 16))
-
-        if batch_size > 1:
-            bn_out = bn.forward(x)
-            bn_status = "✅ Stable"
-        else:
-            bn_status = "⚠️ Unstable"
-
-        ln_out = ln.forward(x)
-        gn_out = gn.forward(x)
-
-        print(f"  Batch size {batch_size}: BN={bn_status}, LN=✅ Stable, GN=✅ Stable")
-
-    print("\n💡 Usage recommendations:")
-    print("  • BatchNorm: CNNs with large batches")
-    print("  • LayerNorm: Transformers, variable batch sizes")
-    print("  • GroupNorm: Small batches, object detection")
-
-analyze_all_normalization_types()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Group Normalization
-
-This test validates GroupNorm implementation, ensuring proper grouping and normalization within channel groups.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-group-norm", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-def test_unit_group_norm():
-    """Unit test for group normalization."""
-    print("🔬 Unit Test: Group Normalization...")
-
-    # Test 1: Basic configuration
-    num_groups = 8
-    num_channels = 64
-    gn = GroupNorm(num_groups, num_channels)
-
-    # Verify initialization
-    assert gn.num_groups == num_groups, "Should store number of groups"
-    assert gn.num_channels == num_channels, "Should store number of channels"
-    assert gn.channels_per_group == 8, "Should calculate channels per group correctly"
-
-    # Check parameter shapes
-    assert gn.gamma.shape == (num_channels,), f"Gamma shape should be ({num_channels},)"
-    assert gn.beta.shape == (num_channels,), f"Beta shape should be ({num_channels},)"
-
-    # Test 2: Configuration validation
-    try:
-        GroupNorm(7, 64)  # Should fail: 64 % 7 != 0
-        assert False, "Should raise error for invalid group configuration"
-    except AssertionError as e:
-        if "divisible" in str(e):
-            pass  # Expected error
-        else:
-            raise e
-
-    # Test 3: Forward pass
-    batch_size, height, width = 16, 32, 32
-    x = Tensor(np.random.randn(batch_size, num_channels, height, width) * 3 + 2)
-
-    output = gn.forward(x)
-
-    # Check output shape
-    assert output.shape == x.shape, "Output should have same shape as input"
-
-    # Test 4: Verify group normalization properties
-    # Each group should have approximately normalized statistics
-    channels_per_group = num_channels // num_groups
-
-    for group_idx in range(num_groups):
-        start_channel = group_idx * channels_per_group
-        end_channel = start_channel + channels_per_group
-
-        # Extract group data for first sample
-        group_data = output.data[0, start_channel:end_channel, :, :].flatten()
-        group_mean = np.mean(group_data)
-        group_var = np.var(group_data)
-
-        assert abs(group_mean) < 1e-5, f"Group {group_idx} should have ~0 mean, got {group_mean}"
-        assert abs(group_var - 1.0) < 1e-3, f"Group {group_idx} should have ~1 variance, got {group_var}"
-
-    # Test 5: Special cases
-    # Case 1: num_groups = num_channels (Instance Normalization)
-    instance_norm = GroupNorm(num_channels, num_channels)
-    assert instance_norm.channels_per_group == 1, "Instance norm should have 1 channel per group"
-
-    # Case 2: num_groups = 1 (Layer Normalization for spatial data)
-    layer_norm_like = GroupNorm(1, num_channels)
-    assert layer_norm_like.channels_per_group == num_channels, "Single group should contain all channels"
-
-    # Test 6: Different group sizes
-    configs_to_test = [
-        (1, 32),   # LayerNorm-like
-        (4, 32),   # 8 channels per group
-        (32, 32),  # InstanceNorm-like
-    ]
-
-    for groups, channels in configs_to_test:
-        gn_test = GroupNorm(groups, channels)
-        x_test = Tensor(np.random.randn(8, channels, 16, 16))
-        output_test = gn_test.forward(x_test)
-
-        assert output_test.shape == x_test.shape, f"Config ({groups}, {channels}) should preserve shape"
-
-        # Basic normalization check
-        sample_data = output_test.data[0, :, :, :].flatten()
-        overall_mean = np.mean(sample_data)
-        # Note: overall variance might not be exactly 1 due to grouping
-
-    # Test 7: Parameter tracking
-    assert len(gn.parameters) == 2, "Should have 2 learnable parameters"
-    assert gn.gamma in gn.parameters, "Gamma should be tracked"
-    assert gn.beta in gn.parameters, "Beta should be tracked"
-
-    print("PASS Group normalization tests passed!")
-    print(f"PASS Properly groups channels and normalizes within groups")
-    print(f"PASS Validates configuration constraints")
-    print(f"PASS Supports special cases (Instance/Layer norm variants)")
-    print(f"PASS Maintains gradient flow through learnable parameters")
-
-test_unit_group_norm()
-
-# %% [markdown]
-"""
-## Integration: Normalization in Neural Networks
-
-Now let's see how normalization techniques integrate with neural network layers to stabilize training and improve performance.
-"""
-
-# %% [markdown]
-"""
-### Normalization Layer Integration Example
-
-Here's how normalization layers are typically used in different architectures:
-
-**ConvNet with BatchNorm:**
-```
-Conv2d -> BatchNorm2d -> ReLU -> Conv2d -> BatchNorm2d -> ReLU -> ...
-```
-
-**Transformer with LayerNorm:**
-```
-Embedding -> LayerNorm -> Attention -> Add & Norm -> FFN -> Add & Norm -> ...
-```
-
-**ResNet Block with GroupNorm:**
-```
-Conv2d -> GroupNorm -> ReLU -> Conv2d -> GroupNorm -> Add -> ReLU
-```
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "normalization-example", "locked": false, "schema_version": 3, "solution": true, "task": false}
-def demonstrate_normalization_usage():
-    """
-    Demonstrate how different normalization techniques are used in practice.
-
-    TODO: Implement a simple example showing normalization in a mini-network.
-
-    APPROACH:
-    1. Create sample activations that would be unstable without normalization
-    2. Apply different normalization techniques
-    3. Show how they stabilize the activations
-    4. Demonstrate the effect on gradient flow
-
-    This function is PROVIDED as an educational example.
-    """
-    ### BEGIN SOLUTION
-    print("🔬 Normalization Integration Example")
-    print("=" * 40)
-
-    # Simulate unstable activations (high variance, non-zero mean)
-    batch_size, channels, height, width = 16, 64, 32, 32
-    unstable_activations = Tensor(np.random.randn(batch_size, channels, height, width) * 5 + 3)
-
-    print(f"Original activations:")
-    print(f"  Mean: {np.mean(unstable_activations.data):.3f}")
-    print(f"  Std:  {np.std(unstable_activations.data):.3f}")
-    print(f"  Range: [{np.min(unstable_activations.data):.2f}, {np.max(unstable_activations.data):.2f}]")
-
-    # Apply different normalizations
-    bn = BatchNorm2d(channels)
-    ln = LayerNorm((channels, height, width))
-    gn = GroupNorm(8, channels)
-
-    bn.train()  # Ensure BatchNorm is in training mode
-
-    bn_output = bn.forward(unstable_activations)
-    ln_output = ln.forward(unstable_activations)
-    gn_output = gn.forward(unstable_activations)
-
-    print(f"\nAfter BatchNorm:")
-    print(f"  Mean: {np.mean(bn_output.data):.6f}")
-    print(f"  Std:  {np.std(bn_output.data):.3f}")
-
-    print(f"\nAfter LayerNorm:")
-    print(f"  Mean: {np.mean(ln_output.data):.6f}")
-    print(f"  Std:  {np.std(ln_output.data):.3f}")
-
-    print(f"\nAfter GroupNorm:")
-    print(f"  Mean: {np.mean(gn_output.data):.6f}")
-    print(f"  Std:  {np.std(gn_output.data):.3f}")
-
-    print(f"\nPASS All normalization techniques stabilize activations!")
-    print(f"PASS Mean ~= 0, Std ~= 1 for all methods")
-    ### END SOLUTION
-
-# Run the demonstration
-demonstrate_normalization_usage()
-
-# %% [markdown]
-"""
-### Performance Comparison: Training Stability
-
-Let's compare how different normalization techniques affect training stability by simulating gradient updates.
-"""
-
-# PASS IMPLEMENTATION CHECKPOINT: All normalization implementations complete
-
-# THINK PREDICTION: Which normalization technique will be most stable for very small batch sizes?
-# Your answer: _______ because _______
-
-# 🔍 SYSTEMS INSIGHT: Training Stability Across Batch Sizes
-def analyze_training_stability():
-    """Test how each normalization handles different training scenarios."""
-    print("🔍 Training Stability Analysis:")
-
-    channels = 64
-    bn = BatchNorm2d(channels)
-    ln = LayerNorm((channels, 16, 16))
-    gn = GroupNorm(8, channels)
-
-    print("\nStability across batch sizes:")
-    for batch_size in [1, 8, 32]:
-        x = Tensor(np.random.randn(batch_size, channels, 16, 16))
-
-        # Test each normalization
-        if batch_size == 1:
-            bn_status = "⚠️ Unstable"
-        else:
-            bn.train()
-            bn_out = bn.forward(x)
-            bn_status = "✅ Stable"
-
-        ln_out = ln.forward(x)
-        gn_out = gn.forward(x)
-
-        print(f"  Batch {batch_size}: BN={bn_status}, LN=✅ Stable, GN=✅ Stable")
-
-    print("\n💡 Stability insights:")
-    print("  • BatchNorm: Needs batch_size > 1, best with large batches")
-    print("  • LayerNorm: Consistent across all batch sizes")
-    print("  • GroupNorm: Batch-independent like LayerNorm")
-
-analyze_training_stability()
-
-# %% [markdown]
-"""
-### TEST Integration Test: Complete Normalization Suite
-
-This test validates that all normalization techniques work together and can be used interchangeably in neural network architectures.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-normalization-integration", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-def test_unit_normalization_integration():
-    """Integration test for all normalization techniques."""
-    print("🔬 Integration Test: Complete Normalization Suite...")
-
-    # Test configuration
-    batch_size, channels, height, width = 8, 32, 16, 16
-    x = Tensor(np.random.randn(batch_size, channels, height, width) * 3 + 2)
-
-    # Initialize all normalization types
-    bn = BatchNorm2d(channels)
-    ln = LayerNorm((channels, height, width))
-    gn = GroupNorm(8, channels)  # 4 channels per group
-
-    # Test 1: All normalizations work with same input
-    bn.train()
-    bn_output = bn.forward(x)
-    ln_output = ln.forward(x)
-    gn_output = gn.forward(x)
-
-    # All should have same output shape
-    assert bn_output.shape == x.shape, "BatchNorm should preserve shape"
-    assert ln_output.shape == x.shape, "LayerNorm should preserve shape"
-    assert gn_output.shape == x.shape, "GroupNorm should preserve shape"
-
-    # Test 2: All produce normalized outputs
-    for name, output in [("BatchNorm", bn_output), ("LayerNorm", ln_output), ("GroupNorm", gn_output)]:
-        # Check that outputs are normalized (approximately)
-        output_mean = np.mean(output.data)
-        output_std = np.std(output.data)
-
-        # Normalization should reduce extreme values
-        assert abs(output_mean) < 2.0, f"{name} should reduce mean magnitude"
-        assert 0.5 < output_std < 2.0, f"{name} should normalize standard deviation"
-
-    # Test 3: Parameter count comparison
-    bn_params = len(bn.parameters)
-    ln_params = len(ln.parameters)
-    gn_params = len(gn.parameters)
-
-    assert bn_params == 2, "BatchNorm should have 2 learnable parameters"
-    assert ln_params == 2, "LayerNorm should have 2 learnable parameters"
-    assert gn_params == 2, "GroupNorm should have 2 learnable parameters"
-
-    # Test 4: Training vs evaluation mode (BatchNorm only)
-    bn.train()
-    bn_train_out = bn.forward(x)
-
-    bn.eval()
-    bn_eval_out = bn.forward(x)
-
-    # Outputs should be different (training uses batch stats, eval uses running stats)
-    # Note: might be similar if running stats are close to batch stats
-    assert bn_train_out.shape == bn_eval_out.shape, "Train/eval should have same shape"
-
-    # Test 5: Batch size independence (LayerNorm and GroupNorm)
-    x_single = Tensor(np.random.randn(1, channels, height, width))
-
-    ln_single = ln.forward(x_single)
-    gn_single = gn.forward(x_single)
-
-    assert ln_single.shape == x_single.shape, "LayerNorm should work with batch_size=1"
-    assert gn_single.shape == x_single.shape, "GroupNorm should work with batch_size=1"
-
-    # Test 6: Memory efficiency check
-    # All should use similar parameter memory (2 * channels * 4 bytes for γ and β)
-    expected_param_memory = 2 * channels * 4  # γ and β parameters
-
-    # BatchNorm has additional running statistics
-    bn_total_memory = 4 * channels * 4  # γ, β, running_mean, running_var
-    ln_total_memory = 2 * channels * 4  # γ, β only
-    gn_total_memory = 2 * channels * 4  # γ, β only
-
-    assert bn_total_memory > ln_total_memory, "BatchNorm should use more memory (running stats)"
-    assert ln_total_memory == gn_total_memory, "LayerNorm and GroupNorm should use same memory"
-
-    print("PASS Normalization integration tests passed!")
-    print(f"PASS All techniques work with same input format")
-    print(f"PASS All produce appropriately normalized outputs")
-    print(f"PASS Memory usage patterns are as expected")
-    print(f"PASS Batch size independence works correctly")
-
-test_unit_normalization_integration()
-
-# %% [markdown]
-"""
-## Testing: Comprehensive Validation
-
-Let's run comprehensive tests to ensure all normalization implementations work correctly.
-"""
-
-
-# %% [markdown]
-"""
-## Main Execution Block
-
-Run all tests to validate our normalization implementations.
-"""
-
-def test_module():
-    """Integration test for complete normalization module."""
-    print("🧪 Testing Complete Normalization Module...")
-
-    # Test all normalization techniques work together
-    batch_size, channels, height, width = 16, 64, 32, 32
-    x = Tensor(np.random.randn(batch_size, channels, height, width) * 3 + 2)
-
-    # Initialize all normalization types
-    bn = BatchNorm2d(channels)
-    ln = LayerNorm((channels, height, width))
-    gn = GroupNorm(8, channels)
-
-    # Test that all work with same input
-    bn.train()
-    bn_output = bn.forward(x)
-    ln_output = ln.forward(x)
-    gn_output = gn.forward(x)
-
-    # Verify all outputs are properly normalized
-    assert bn_output.shape == x.shape, "BatchNorm should preserve shape"
-    assert ln_output.shape == x.shape, "LayerNorm should preserve shape"
-    assert gn_output.shape == x.shape, "GroupNorm should preserve shape"
-
-    # Check normalization effectiveness
-    for name, output in [("BatchNorm", bn_output), ("LayerNorm", ln_output), ("GroupNorm", gn_output)]:
-        output_mean = np.mean(output.data)
-        output_std = np.std(output.data)
-        assert abs(output_mean) < 2.0, f"{name} should reduce mean magnitude"
-        assert 0.5 < output_std < 2.0, f"{name} should normalize standard deviation"
-
-    print("✅ All normalization techniques working correctly!")
-
-if __name__ == "__main__":
-    test_module()
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've implemented all three major normalization techniques, let's reflect on their systems implications and design trade-offs.
-"""
-
-# %% [markdown]
-"""
-### Question 1: Memory and Batch Size Trade-offs
-
-**Context**: In your BatchNorm2d implementation, you saw that running statistics require additional memory (4* parameters vs 2* for LayerNorm/GroupNorm), but BatchNorm fails completely with batch_size=1. Your memory analysis showed that BatchNorm needs 2* the memory of other techniques, while your stability analysis revealed batch size dependencies.
-
-**Reflection Question**: Analyze the memory vs batch size trade-offs in your normalization implementations. When you tested different batch sizes, you discovered BatchNorm becomes unstable with small batches while LayerNorm/GroupNorm remain consistent. For a production system that needs to handle both training (large batches) and inference (single samples), how would you modify your current normalization implementations to optimize memory usage while maintaining stability? Consider the running statistics storage in your BatchNorm class and the per-sample computation in your LayerNorm class.
-
-Think about: running statistics memory optimization, batch size adaptation strategies, inference mode memory requirements, and hybrid normalization approaches.
-
-*Target length: 150-300 words*
-"""
-
-# %% [markdown]
-"""
-### Question 2: Computational Scaling and Group Organization
-
-**Context**: Your GroupNorm implementation divides channels into groups and normalizes within each group, providing a middle ground between BatchNorm and LayerNorm. Your scaling analysis showed that all normalization techniques have similar computational complexity, but different memory access patterns. The group organization in your implementation affects both memory layout and computational efficiency.
-
-**Reflection Question**: Examine the computational scaling patterns in your normalization implementations. Your GroupNorm.forward() method reshapes tensors to separate groups, computes statistics within groups, then reshapes back. How does this grouping strategy affect memory access patterns and cache efficiency compared to your BatchNorm (batch-wise) and LayerNorm (sample-wise) approaches? If you needed to optimize your GroupNorm implementation for very large channel counts (1024+ channels), what modifications to your group organization and computation order would improve performance while maintaining mathematical correctness?
-
-Think about: memory access patterns, cache locality, vectorization opportunities, and group size optimization strategies.
-
-*Target length: 150-300 words*
-"""
-
-# %% [markdown]
-"""
-### Question 3: Production Deployment and Architecture Selection
-
-**Context**: Your normalization implementations mirror production systems - BatchNorm for CNNs like ResNet, LayerNorm for Transformers like BERT/GPT, and GroupNorm for object detection models. Your training stability analysis revealed when each technique works best, and your performance benchmarks showed similar computational costs but different memory characteristics.
-
-**Reflection Question**: Based on your implementation experience and performance analysis, design a normalization selection strategy for a production ML system that needs to support multiple model architectures (CNNs, Transformers, and detection models). Your BatchNorm implementation works well for large-batch training but fails at batch_size=1, while your LayerNorm provides consistent behavior but lacks the batch parallelization benefits. How would you extend your current normalization classes to create an adaptive normalization system that automatically selects the optimal technique based on input characteristics (batch size, model architecture, deployment constraints)?
-
-Think about: automatic technique selection, runtime adaptation, memory budget constraints, and deployment environment requirements.
-
-*Target length: 150-300 words*
-"""
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Normalization
-
-Congratulations! You have successfully implemented the complete normalization toolkit that makes modern deep learning possible:
-
-### PASS What You Have Built
-- **BatchNorm2d**: Complete batch normalization with running statistics and train/eval modes
-- **LayerNorm**: Batch-independent normalization for any tensor dimensions
-- **GroupNorm**: Channel group normalization balancing batch and layer norm benefits
-- **🆕 Comprehensive Analysis**: Memory scaling, training stability, and performance benchmarking
-- **🆕 Integration Examples**: How normalization fits into different network architectures
-
-### PASS Technical Mastery
-- **Statistical Computing**: Efficient mean/variance computation across different tensor dimensions
-- **Memory Management**: Understanding parameter storage vs running statistics trade-offs
-- **Training Dynamics**: How normalization affects gradient flow and training stability
-- **Batch Dependencies**: When and why batch size affects normalization behavior
-- **🆕 Production Patterns**: Architecture-specific normalization choices and deployment considerations
-
-### PASS Systems Understanding
-- **Memory Scaling**: BatchNorm uses 2* memory of LayerNorm/GroupNorm due to running statistics
-- **Computational Complexity**: All techniques have similar O(N) complexity but different access patterns
-- **Batch Size Effects**: BatchNorm requires batch_size > 1, others work with any batch size
-- **Cache Efficiency**: How normalization axes affect memory access patterns and vectorization
-- **🆕 Training Stability**: Why normalization enables higher learning rates and deeper networks
-
-### LINK Connection to Real ML Systems
-Your implementations mirror production systems:
-- **PyTorch nn.BatchNorm2d**: Your BatchNorm2d matches PyTorch's interface and behavior
-- **BERT LayerNorm**: Your LayerNorm enables transformer training stability
-- **Object Detection GroupNorm**: Your GroupNorm provides batch-independent normalization
-- **Production Deployment**: Understanding of when to use each technique in real systems
-
-### ROCKET What You Can Build Now
-- **Stable CNNs**: Use BatchNorm for ResNet-style architectures with large batches
-- **Transformer Models**: Use LayerNorm for attention-based architectures
-- **Detection Systems**: Use GroupNorm for models with variable batch sizes
-- **Adaptive Networks**: Combine techniques for optimal performance across scenarios
-
-### Next Steps
-1. **Export your module**: `tito module complete 08_normalization`
-2. **Integration ready**: Your normalization layers integrate with any neural network architecture
-3. **Ready for Module 09**: Spatial operations will use your normalization for CNN stability
-
-**CELEBRATE Achievement Unlocked**: You've mastered the normalization techniques that enable modern deep learning, with complete understanding of their memory characteristics and performance trade-offs!
-"""
\ No newline at end of file
diff --git a/modules_old/source/13_kernels/kernels_dev.py b/modules_old/source/13_kernels/kernels_dev.py
deleted file mode 100644
index f26aa3e5..00000000
--- a/modules_old/source/13_kernels/kernels_dev.py
+++ /dev/null
@@ -1,2555 +0,0 @@
-# %% [markdown]
-"""
-# Kernels - High-Performance Computational Kernels
-
-Welcome to Kernels! You'll implement high-performance computational kernels that power modern ML systems!
-
-## LINK Building on Previous Learning
-**What You Built Before**:
-- Module 11 (Training): Complete training loops with gradient computation
-- Module 12 (Regularization): Advanced training techniques for robust models
-
-**What's Working**: You can train neural networks end-to-end with sophisticated optimization and regularization!
-
-**The Gap**: Your implementations work correctly but may not be optimized for real-world performance demands.
-
-**This Module's Solution**: Implement high-performance computational kernels that optimize memory access, leverage parallelism, and achieve production-grade performance.
-
-**Connection Map**:
-```
-Training -> Kernels -> Benchmarking
-(correct)   (fast)    (measured)
-```
-
-## Learning Goals (Your 5-Point Framework)
-- **Systems understanding**: Memory layout, cache optimization, and vectorization for ML operations
-- **Core implementation skill**: Building high-performance computational kernels from scratch
-- **Pattern/abstraction mastery**: Recognizing optimization patterns across different hardware architectures
-- **Framework connections**: Understanding how PyTorch and TensorFlow achieve high performance
-- **Optimization trade-offs**: Balancing memory usage, computational complexity, and parallelism
-
-## Build -> Use -> Reflect
-1. **Build**: Implement optimized kernels for matrix operations, activations, and memory management
-2. **Use**: Apply kernels to real ML workloads and measure performance improvements
-3. **Reflect**: Analyze optimization patterns and design production-grade kernel architectures
-
-## Systems Reality Check
-TIP **Production Context**: PyTorch uses custom CUDA kernels and CPU vectorization for 10-100x speedups
-SPEED **Performance Insight**: Memory bandwidth is often the limiting factor, not compute - optimize data movement first
-"""
-
-# %% [markdown]
-"""
-## What Are High-Performance Kernels?
-
-High-performance kernels are optimized computational functions that leverage hardware-specific features like:
-
-```
-CPU Kernels:
-+-------------------------------------+
-| SIMD Instructions (AVX, SSE)       | <- Process 4-16 floats simultaneously
-| Cache-Friendly Memory Patterns     | <- Minimize cache misses
-| Loop Unrolling & Vectorization     | <- Eliminate loop overhead
-+-------------------------------------+
-
-GPU Kernels:
-+-------------------------------------+
-| Thread Blocks & Shared Memory      | <- Parallel processing with fast memory
-| Memory Coalescing                   | <- Efficient global memory access
-| Warp-Level Operations               | <- 32 threads execute together
-+-------------------------------------+
-```
-
-**Why This Matters for ML Systems:**
-- **Training Speed**: 10-100x faster matrix operations enable larger models
-- **Inference Latency**: Optimized kernels reduce serving costs and improve user experience
-- **Memory Efficiency**: Better data layouts reduce memory bandwidth requirements
-- **Energy Efficiency**: Optimized code reduces power consumption in data centers
-"""
-
-# %% [markdown]
-"""
-## Mathematical Foundations
-
-### Cache-Friendly Matrix Multiplication
-
-Standard algorithm is O(n³) but cache-unfriendly:
-```python
-# Cache-unfriendly (random memory access)
-for i in range(n):
-    for j in range(n):
-        for k in range(n):
-            C[i,j] += A[i,k] * B[k,j]  # B[k,j] jumps around memory
-```
-
-Blocked algorithm improves cache locality:
-```python
-# Cache-friendly (blocked access)
-for bi in range(0, n, block_size):
-    for bj in range(0, n, block_size):
-        for bk in range(0, n, block_size):
-            # Process block that fits in cache
-            for i in range(bi, min(bi+block_size, n)):
-                for j in range(bj, min(bj+block_size, n)):
-                    for k in range(bk, min(bk+block_size, n)):
-                        C[i,j] += A[i,k] * B[k,j]
-```
-
-### SIMD Vectorization
-
-Single Instruction, Multiple Data (SIMD) processes multiple elements simultaneously:
-
-```
-Scalar ReLU (1 element at a time):
-for i in range(n):
-    y[i] = max(0, x[i])  # 1 operation per cycle
-
-Vectorized ReLU (8 elements at a time with AVX):
-y = np.maximum(0, x)  # 8 operations per cycle
-```
-
-### Memory Access Patterns
-
-```
-Row-Major Access (Fast):
-A[0,0] A[0,1] A[0,2] A[0,3] ...  <- Sequential memory access
-
-Column-Major Access (Slow):
-A[0,0] A[1,0] A[2,0] A[3,0] ...  <- Strided memory access
-
-Cache Line Impact:
-+-----+-----+-----+-----+
-| A[0,0:4] loaded together | <- 64-byte cache line
-+-----+-----+-----+-----+
-```
-"""
-
-# %% [markdown]
-"""
-## Why Build High-Performance Kernels?
-
-### Production Performance Requirements
-Modern ML systems require optimized kernels for:
-
-1. **Real-Time Inference**: Self-driving cars need <10ms response times
-2. **Large-Scale Training**: Training GPT-scale models requires maximum hardware utilization
-3. **Edge Deployment**: Mobile and IoT devices have limited compute and memory
-4. **Cost Optimization**: Cloud compute costs scale with execution time
-
-### Learning Through Implementation
-Building kernels teaches you:
-
-- **Hardware-Software Interface**: How software maps to CPU/GPU architecture
-- **Performance Engineering**: Systematic optimization methodology
-- **Production Debugging**: Why ML models are slow and how to fix them
-- **System Design**: How to build scalable ML infrastructure
-
-### Connection to Frameworks
-Every major ML framework uses custom kernels:
-- **PyTorch**: ATen library with CUDA kernels and CPU vectorization
-- **TensorFlow**: XLA compiler with hardware-specific optimizations
-- **JAX**: JIT compilation with automatic kernel fusion
-"""
-
-# %% [markdown]
-"""
-## Production Context - How Real Systems Work
-
-### PyTorch Kernel Architecture
-```python
-# High-level PyTorch operation
-result = torch.matmul(A, B)
-
-# Maps to optimized kernel based on:
-# - Hardware: CPU (MKL-DNN) vs GPU (cuBLAS)
-# - Data type: float32, float16, int8
-# - Tensor size: Small (custom) vs Large (BLAS)
-# - Memory layout: Contiguous vs Strided
-```
-
-### Performance Hierarchy
-```
-1. Specialized Hardware: TPUs, Tensor Cores    (100-1000x)
-2. Optimized Libraries: cuBLAS, MKL           (10-100x)
-3. Vectorized Code: SIMD, OpenMP             (2-10x)
-4. Cache-Friendly: Blocked algorithms         (1.5-3x)
-5. Naive Implementation: Baseline             (1x)
-```
-
-### Real-World Impact
-- **Training Cost**: Optimized kernels reduce AWS training costs by 50-90%
-- **Serving Latency**: Fast inference enables real-time applications
-- **Model Size**: Quantization kernels enable deployment on mobile devices
-- **Energy Usage**: Efficient kernels reduce data center power consumption
-"""
-
-# %%
-#| default_exp core.kernels
-import numpy as np
-import sys
-import os
-import time
-import psutil
-from typing import Callable, Dict, Any, Optional, Tuple, List
-from concurrent.futures import ThreadPoolExecutor
-
-# Import our existing components
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # Create minimal mock for development
-    class Tensor:
-        def __init__(self, data):
-            self.data = np.array(data)
-            self.shape = self.data.shape
-        def __str__(self):
-            return f"Tensor({self.data})"
-
-# %% [markdown]
-"""
-## Architecture - Building High-Performance Kernels
-
-Our kernel optimization strategy follows a systematic hierarchy:
-
-```
-TARGET Optimization Strategy:
-+-------------------------------------+
-| 1. Correctness: Get the right answer |
-| 2. Cache Optimization: Memory patterns |
-| 3. Vectorization: SIMD instructions  |
-| 4. Parallelization: Multi-core      |
-| 5. Quantization: Reduced precision  |
-+-------------------------------------+
-
-🔧 Implementation Layers:
-+-------------------------------------+
-| Higher Level: Kernel Composition    | <- Combine optimizations
-| Mid Level: Algorithm Optimization   | <- Cache blocking, tiling
-| Lower Level: Hardware Primitives    | <- SIMD, memory layout
-+-------------------------------------+
-```
-
-**Design Principles:**
-1. **Measure First**: Profile before optimizing
-2. **Systematic Approach**: One optimization at a time
-3. **Hardware Awareness**: Understand the target architecture
-4. **Composability**: Build higher-level optimizations from primitives
-"""
-
-# %% [markdown]
-"""
-## Implementation - Building High-Performance Kernels
-
-### Core Timing Infrastructure
-"""
-
-# %%
-def time_kernel(func: Callable, *args, **kwargs) -> Tuple[Any, float]:
-    """
-    Precision timing function for measuring kernel performance.
-
-    This is the foundation for all performance analysis - accurate timing
-    that accounts for CPU frequency scaling and system noise.
-
-    Args:
-        func: The kernel function to time
-        *args: Arguments to pass to the function
-        **kwargs: Keyword arguments to pass to the function
-
-    Returns:
-        tuple: (function_result, execution_time_microseconds)
-
-    TODO: Implement high-precision kernel timing with noise reduction.
-
-    APPROACH:
-    1. Use time.perf_counter() for high precision timing
-    2. Warm up CPU to stable frequency before measurement
-    3. Handle OS scheduling noise with multiple measurements
-    4. Return both result and timing for validation
-
-    EXAMPLE:
-    >>> result, time_us = time_kernel(np.matmul, A, B)
-    >>> print(f"Matrix multiply took {time_us:.2f} microseconds")
-
-    PERFORMANCE CONSIDERATIONS:
-    - perf_counter() has nanosecond precision on modern systems
-    - CPU frequency scaling can affect measurements
-    - OS scheduling introduces timing noise
-    - Cache state affects first vs subsequent runs
-    """
-    ### BEGIN SOLUTION
-    # Warm-up run to stabilize CPU frequency
-    _ = func(*args, **kwargs)
-
-    # High-precision timing
-    start = time.perf_counter()
-    result = func(*args, **kwargs)
-    end = time.perf_counter()
-
-    # Convert to microseconds for better readability
-    execution_time_us = (end - start) * 1_000_000
-
-    return result, execution_time_us
-    ### END SOLUTION
-
-# PASS IMPLEMENTATION CHECKPOINT: Timing infrastructure complete
-
-# THINK PREDICTION: How much timing overhead does our measurement add?
-# Your guess: _____ microseconds
-
-# MAGNIFY SYSTEMS INSIGHT: Timing Overhead Analysis
-def analyze_timing_overhead():
-    """Measure the overhead of our timing infrastructure."""
-    try:
-        # Test with minimal operation
-        def minimal_op():
-            return 42
-
-        # Time the timing overhead
-        measurements = []
-        for _ in range(100):
-            _, timing = time_kernel(minimal_op)
-            measurements.append(timing)
-
-        avg_overhead = np.mean(measurements)
-        std_overhead = np.std(measurements)
-        min_overhead = np.min(measurements)
-
-        print(f"Timing overhead analysis:")
-        print(f"  Average: {avg_overhead:.3f} μs")
-        print(f"  Std dev: {std_overhead:.3f} μs")
-        print(f"  Minimum: {min_overhead:.3f} μs")
-        print(f"  Relative precision: ±{std_overhead/avg_overhead*100:.1f}%")
-
-        # TIP WHY THIS MATTERS: Timing overhead must be much smaller than
-        # the operations we're measuring, or results will be meaningless.
-        # Modern CPUs: ~1-10 μs overhead, so measure operations >100 μs
-
-        return {
-            'avg_overhead_us': avg_overhead,
-            'precision_percent': std_overhead/avg_overhead*100,
-            'reliable_for_operations_above_us': avg_overhead * 10
-        }
-    except Exception as e:
-        print(f"WARNING️ Timing analysis error: {e}")
-        return None
-
-# Run the analysis
-timing_analysis = analyze_timing_overhead()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Timing Infrastructure
-This test validates `time_kernel`, ensuring accurate performance measurement
-"""
-
-# %%
-def test_unit_timing_infrastructure():
-    """Test timing infrastructure with known operations."""
-    print("TEST Unit Test: Timing Infrastructure")
-
-    # Test 1: Basic timing functionality
-    def test_operation():
-        time.sleep(0.001)  # 1ms sleep
-        return "done"
-
-    result, elapsed_us = time_kernel(test_operation)
-
-    assert result == "done", "Function result should be preserved"
-    assert 800 <= elapsed_us <= 2000, f"1ms sleep should take ~1000μs, got {elapsed_us:.1f}μs"
-    print(f"PASS Basic timing: {elapsed_us:.1f}μs for 1ms operation")
-
-    # Test 2: Timing precision
-    def fast_operation():
-        return sum(range(1000))
-
-    measurements = []
-    for _ in range(10):
-        _, timing = time_kernel(fast_operation)
-        measurements.append(timing)
-
-    cv = np.std(measurements) / np.mean(measurements)
-    assert cv < 0.5, f"Timing precision should be reasonable, CV={cv:.3f}"
-    print(f"PASS Timing precision: CV={cv:.3f} across 10 measurements")
-
-    # Test 3: Argument passing
-    def add_operation(a, b, c=0):
-        return a + b + c
-
-    result, _ = time_kernel(add_operation, 5, 10, c=2)
-    assert result == 17, f"Arguments should pass correctly, got {result}"
-    print("PASS Argument passing works correctly")
-
-# Run the test
-test_unit_timing_infrastructure()
-
-# %% [markdown]
-"""
-### Matrix Multiplication Optimization
-"""
-
-# %%
-def matmul_baseline(A: np.ndarray, B: np.ndarray) -> np.ndarray:
-    """
-    Baseline matrix multiplication using NumPy's optimized implementation.
-
-    This serves as our reference implementation and performance baseline.
-    NumPy uses highly optimized BLAS libraries (Intel MKL, OpenBLAS).
-
-    Args:
-        A: Left matrix (M x K)
-        B: Right matrix (K x N)
-
-    Returns:
-        np.ndarray: Result matrix (M x N)
-
-    TODO: Use NumPy's optimized matrix multiplication as baseline.
-
-    APPROACH:
-    1. Validate input shapes for compatibility
-    2. Use np.dot() which calls optimized BLAS
-    3. This is our "ground truth" for correctness and baseline for performance
-
-    EXAMPLE:
-    >>> A = np.random.randn(100, 50)
-    >>> B = np.random.randn(50, 75)
-    >>> C = matmul_baseline(A, B)
-    >>> print(C.shape)  # (100, 75)
-
-    PERFORMANCE NOTES:
-    - NumPy calls optimized BLAS: Intel MKL or OpenBLAS
-    - These libraries use vectorization, threading, and cache optimization
-    - Typical performance: 100+ GFLOPS on modern CPUs
-    """
-    ### BEGIN SOLUTION
-    # Validate shapes
-    if A.shape[1] != B.shape[0]:
-        raise ValueError(f"Cannot multiply {A.shape} and {B.shape}: inner dimensions don't match")
-
-    # Use NumPy's optimized matrix multiplication
-    result = np.dot(A, B)
-
-    return result
-    ### END SOLUTION
-
-# PASS IMPLEMENTATION CHECKPOINT: Baseline matrix multiplication complete
-
-# MAGNIFY SYSTEMS INSIGHT: Matrix Multiplication Performance Scaling
-def analyze_matmul_scaling():
-    """Analyze how matrix multiplication performance scales with size."""
-    try:
-        sizes = [64, 128, 256, 512]
-        results = []
-
-        for size in sizes:
-            A = np.random.randn(size, size).astype(np.float32)
-            B = np.random.randn(size, size).astype(np.float32)
-
-            # Time the operation
-            _, time_us = time_kernel(matmul_baseline, A, B)
-
-            # Calculate metrics
-            flops = 2 * size**3  # Multiply-accumulate operations
-            gflops = flops / (time_us / 1_000_000) / 1e9
-
-            results.append({
-                'size': size,
-                'time_us': time_us,
-                'gflops': gflops,
-                'memory_mb': (A.nbytes + B.nbytes + A.nbytes) / 1024 / 1024
-            })
-
-            print(f"Size {size:3d}: {time_us:8.1f}μs, {gflops:6.1f} GFLOPS, {results[-1]['memory_mb']:5.1f}MB")
-
-        # Analyze scaling behavior
-        time_scaling = results[-1]['time_us'] / results[0]['time_us']
-        size_scaling = (results[-1]['size'] / results[0]['size']) ** 3
-        efficiency = time_scaling / size_scaling
-
-        print(f"\nScaling analysis:")
-        print(f"  Time scaling: {time_scaling:.1f}x")
-        print(f"  Theoretical (O(n³)): {size_scaling:.1f}x")
-        print(f"  Efficiency: {efficiency:.3f} (1.0 = perfect scaling)")
-
-        # TIP WHY THIS MATTERS: Matrix multiplication is O(n³), but cache effects
-        # and memory bandwidth limits mean real performance doesn't scale perfectly.
-        # Understanding these limits helps size operations for optimal performance.
-
-        return results
-
-    except Exception as e:
-        print(f"WARNING️ Scaling analysis error: {e}")
-        return None
-
-# Run the analysis
-matmul_scaling = analyze_matmul_scaling()
-
-# %%
-def cache_friendly_matmul(A: np.ndarray, B: np.ndarray, block_size: int = 64) -> np.ndarray:
-    """
-    Cache-friendly matrix multiplication using blocking technique.
-
-    This implementation improves memory access patterns by processing
-    matrices in cache-sized blocks, reducing cache misses.
-
-    Args:
-        A: Left matrix (M x K)
-        B: Right matrix (K x N)
-        block_size: Size of cache blocks (default 64)
-
-    Returns:
-        np.ndarray: Result matrix (M x N)
-
-    TODO: Implement cache-friendly matrix multiplication using blocking.
-
-    APPROACH:
-    1. Divide matrices into block_size x block_size blocks
-    2. Process blocks in order that maximizes data reuse
-    3. Inner loops work on cache-friendly sub-matrices
-    4. Accumulate partial results in output blocks
-
-    BLOCKING ALGORITHM:
-    ```
-    for each block row of A:
-        for each block column of B:
-            for each block column of A / block row of B:
-                multiply sub-blocks and accumulate
-    ```
-
-    EXAMPLE:
-    >>> A = np.random.randn(128, 128)
-    >>> B = np.random.randn(128, 128)
-    >>> C = cache_friendly_matmul(A, B, block_size=32)
-
-    CACHE OPTIMIZATION:
-    - block_size should fit in L1 cache (~32KB)
-    - For float32: block_size=64 uses ~16KB per block
-    - Reduces cache misses from O(n³) to O(n³/B) where B=block_size
-    """
-    ### BEGIN SOLUTION
-    M, K = A.shape
-    K2, N = B.shape
-
-    if K != K2:
-        raise ValueError(f"Cannot multiply {A.shape} and {B.shape}")
-
-    # Initialize result matrix
-    C = np.zeros((M, N), dtype=A.dtype)
-
-    # Cache-friendly blocked multiplication
-    for i in range(0, M, block_size):
-        for j in range(0, N, block_size):
-            for k in range(0, K, block_size):
-                # Define block boundaries
-                end_i = min(i + block_size, M)
-                end_j = min(j + block_size, N)
-                end_k = min(k + block_size, K)
-
-                # Extract blocks
-                A_block = A[i:end_i, k:end_k]
-                B_block = B[k:end_k, j:end_j]
-
-                # Multiply blocks and accumulate
-                C[i:end_i, j:end_j] += np.dot(A_block, B_block)
-
-    return C
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### TEST Unit Test: Cache-Friendly Matrix Multiplication
-This test validates `cache_friendly_matmul`, ensuring correctness and performance improvement
-"""
-
-# %%
-def test_unit_cache_friendly_matmul():
-    """Test cache-friendly matrix multiplication."""
-    print("TEST Unit Test: Cache-Friendly Matrix Multiplication")
-
-    # Test 1: Correctness
-    A = np.array([[1, 2], [3, 4]], dtype=np.float32)
-    B = np.array([[5, 6], [7, 8]], dtype=np.float32)
-
-    result_cache = cache_friendly_matmul(A, B, block_size=1)
-    result_baseline = matmul_baseline(A, B)
-
-    assert np.allclose(result_cache, result_baseline), "Cache-friendly result should match baseline"
-    print("PASS Correctness: Matches baseline implementation")
-
-    # Test 2: Performance comparison
-    size = 256
-    A_large = np.random.randn(size, size).astype(np.float32)
-    B_large = np.random.randn(size, size).astype(np.float32)
-
-    _, baseline_time = time_kernel(matmul_baseline, A_large, B_large)
-    _, cache_time = time_kernel(cache_friendly_matmul, A_large, B_large, 64)
-
-    print(f"PASS Performance: Baseline={baseline_time:.1f}μs, Cache-friendly={cache_time:.1f}μs")
-
-    # Test 3: Different block sizes
-    block_sizes = [32, 64, 128]
-    for bs in block_sizes:
-        result = cache_friendly_matmul(A, B, block_size=bs)
-        assert np.allclose(result, result_baseline), f"Block size {bs} should be correct"
-
-    print(f"PASS Block sizes: Tested {block_sizes}")
-
-# Run the test
-test_unit_cache_friendly_matmul()
-
-# %% [markdown]
-"""
-### Vectorized Operations
-"""
-
-# %%
-def vectorized_relu(x: np.ndarray) -> np.ndarray:
-    """
-    Vectorized ReLU implementation using SIMD principles.
-
-    This function demonstrates how to write operations that leverage
-    CPU vectorization for better performance than scalar loops.
-
-    Args:
-        x: Input array
-
-    Returns:
-        np.ndarray: ReLU applied element-wise
-
-    TODO: Implement vectorized ReLU optimized for SIMD execution.
-
-    APPROACH:
-    1. Ensure input array is contiguous for vectorization
-    2. Use NumPy's vectorized operations (compile to SIMD)
-    3. Handle different data types appropriately
-    4. Return result maintaining input shape
-
-    VECTORIZATION TECHNIQUES:
-    - np.maximum() uses SIMD instructions when possible
-    - Contiguous memory layout enables efficient vectorization
-    - Proper data types (float32) maximize SIMD lane utilization
-
-    EXAMPLE:
-    >>> x = np.array([-2, -1, 0, 1, 2], dtype=np.float32)
-    >>> y = vectorized_relu(x)
-    >>> print(y)  # [0, 0, 0, 1, 2]
-
-    PERFORMANCE BENEFITS:
-    - AVX2: 8 float32 operations per instruction
-    - AVX-512: 16 float32 operations per instruction
-    - Typical speedup: 4-16x over scalar loops
-    """
-    ### BEGIN SOLUTION
-    # Ensure contiguous memory layout for best SIMD performance
-    if not x.flags.c_contiguous:
-        x = np.ascontiguousarray(x)
-
-    # Vectorized ReLU using NumPy's maximum function
-    # This compiles to SIMD instructions on modern CPUs
-    result = np.maximum(0, x)
-
-    return result
-    ### END SOLUTION
-
-# %%
-def vectorized_operations(x: np.ndarray, y: np.ndarray) -> Dict[str, np.ndarray]:
-    """
-    Collection of vectorized operations demonstrating SIMD principles.
-
-    Shows how multiple operations can be vectorized efficiently.
-
-    Args:
-        x: First input array
-        y: Second input array (must be same shape as x)
-
-    Returns:
-        Dict[str, np.ndarray]: Dictionary of vectorized operation results
-
-    TODO: Implement vectorized versions of common operations.
-
-    OPERATIONS TO IMPLEMENT:
-    - Element-wise addition, multiplication
-    - Squared difference
-    - Euclidean distance
-    - Dot product
-
-    APPROACH:
-    1. Validate input shapes match
-    2. Use NumPy vectorized functions
-    3. Combine operations when beneficial
-    4. Return comprehensive results dictionary
-
-    EXAMPLE:
-    >>> x = np.array([1, 2, 3, 4])
-    >>> y = np.array([2, 3, 4, 5])
-    >>> results = vectorized_operations(x, y)
-    >>> print(results['element_wise_add'])  # [3, 5, 7, 9]
-
-    VECTORIZATION BENEFITS:
-    - Single instruction processes multiple elements
-    - Reduced loop overhead
-    - Better CPU pipeline utilization
-    """
-    ### BEGIN SOLUTION
-    # Validate shapes
-    if x.shape != y.shape:
-        raise ValueError(f"Input shapes don't match: {x.shape} vs {y.shape}")
-
-    # Ensure contiguous arrays for best performance
-    if not x.flags.c_contiguous:
-        x = np.ascontiguousarray(x)
-    if not y.flags.c_contiguous:
-        y = np.ascontiguousarray(y)
-
-    # Vectorized operations
-    results = {
-        'element_wise_add': x + y,
-        'element_wise_multiply': x * y,
-        'squared_difference': (x - y) ** 2,
-        'euclidean_distance': np.sqrt(np.sum((x - y) ** 2)),
-        'dot_product': np.dot(x.flatten(), y.flatten()),
-        'cosine_similarity': np.dot(x.flatten(), y.flatten()) / (np.linalg.norm(x) * np.linalg.norm(y))
-    }
-
-    return results
-    ### END SOLUTION
-
-# PASS IMPLEMENTATION CHECKPOINT: Vectorized operations complete
-
-# MAGNIFY SYSTEMS INSIGHT: Vectorization Performance Analysis
-def analyze_vectorization_performance():
-    """Compare vectorized vs scalar performance."""
-    try:
-        size = 100000
-        x = np.random.randn(size).astype(np.float32)
-        y = np.random.randn(size).astype(np.float32)
-
-        # Time vectorized ReLU
-        _, vec_time = time_kernel(vectorized_relu, x)
-
-        # Time scalar ReLU (simulated)
-        def scalar_relu_simulation(arr):
-            # Simulate scalar processing with numpy operations
-            # (Real scalar would be much slower)
-            result = np.zeros_like(arr)
-            for i in range(min(1000, len(arr))):  # Sample to avoid timeout
-                result[i] = max(0, arr[i])
-            return result
-
-        _, scalar_time = time_kernel(scalar_relu_simulation, x[:1000])
-
-        # Estimate full scalar time
-        estimated_scalar_time = scalar_time * (size / 1000)
-        speedup = estimated_scalar_time / vec_time
-
-        print(f"Vectorization performance analysis:")
-        print(f"  Array size: {size:,} elements")
-        print(f"  Vectorized ReLU: {vec_time:.1f}μs")
-        print(f"  Estimated scalar: {estimated_scalar_time:.1f}μs")
-        print(f"  Speedup: {speedup:.1f}x")
-
-        # Test vectorized operations
-        _, ops_time = time_kernel(vectorized_operations, x, y)
-        operations_per_second = 6 * size / (ops_time / 1_000_000)  # 6 operations
-
-        print(f"  Vectorized operations: {ops_time:.1f}μs")
-        print(f"  Throughput: {operations_per_second/1e6:.1f}M ops/sec")
-
-        # TIP WHY THIS MATTERS: Vectorization provides 4-16x speedups on modern CPUs.
-        # This is essential for real-time inference and efficient training.
-        # ML frameworks like PyTorch rely heavily on vectorized operations.
-
-        return {
-            'vectorized_speedup': speedup,
-            'throughput_mops': operations_per_second / 1e6
-        }
-
-    except Exception as e:
-        print(f"WARNING️ Vectorization analysis error: {e}")
-        return None
-
-# Run the analysis
-vectorization_analysis = analyze_vectorization_performance()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Vectorized Operations
-This test validates vectorized implementations for correctness and performance
-"""
-
-# %%
-def test_unit_vectorized_operations():
-    """Test vectorized operations."""
-    print("TEST Unit Test: Vectorized Operations")
-
-    # Test 1: Vectorized ReLU correctness
-    x = np.array([-2, -1, 0, 1, 2], dtype=np.float32)
-    result = vectorized_relu(x)
-    expected = np.array([0, 0, 0, 1, 2], dtype=np.float32)
-
-    assert np.allclose(result, expected), "Vectorized ReLU should be correct"
-    print("PASS ReLU correctness: Produces expected outputs")
-
-    # Test 2: Vectorized operations correctness
-    x = np.array([1, 2, 3, 4], dtype=np.float32)
-    y = np.array([2, 3, 4, 5], dtype=np.float32)
-
-    results = vectorized_operations(x, y)
-
-    assert np.allclose(results['element_wise_add'], [3, 5, 7, 9]), "Addition should be correct"
-    assert np.allclose(results['element_wise_multiply'], [2, 6, 12, 20]), "Multiplication should be correct"
-    assert np.allclose(results['dot_product'], 40), "Dot product should be correct"
-
-    print("PASS Operations correctness: All operations produce expected results")
-
-    # Test 3: Performance with larger arrays
-    large_x = np.random.randn(10000).astype(np.float32)
-    large_y = np.random.randn(10000).astype(np.float32)
-
-    _, relu_time = time_kernel(vectorized_relu, large_x)
-    _, ops_time = time_kernel(vectorized_operations, large_x, large_y)
-
-    assert relu_time < 1000, f"ReLU should be fast, took {relu_time:.1f}μs"
-    assert ops_time < 5000, f"Operations should be fast, took {ops_time:.1f}μs"
-
-    print(f"PASS Performance: ReLU={relu_time:.1f}μs, Operations={ops_time:.1f}μs")
-
-# Run the test
-test_unit_vectorized_operations()
-
-# %% [markdown]
-"""
-### Parallel Processing
-"""
-
-# %%
-def parallel_relu(x: np.ndarray, num_workers: int = 4) -> np.ndarray:
-    """
-    Parallel ReLU implementation using multiple CPU cores.
-
-    Demonstrates data parallelism by distributing computation
-    across multiple worker threads.
-
-    Args:
-        x: Input array
-        num_workers: Number of parallel workers
-
-    Returns:
-        np.ndarray: ReLU applied in parallel
-
-    TODO: Implement parallel ReLU using threading or multiprocessing.
-
-    APPROACH:
-    1. Split input array into chunks for each worker
-    2. Process chunks in parallel using ThreadPoolExecutor
-    3. Combine results maintaining original order
-    4. Handle edge cases (small arrays, uneven splits)
-
-    PARALLELIZATION STRATEGY:
-    - Thread-based for I/O bound or small computations
-    - Process-based for CPU-bound large computations
-    - Chunk size should balance overhead vs parallelism
-
-    EXAMPLE:
-    >>> x = np.random.randn(100000)
-    >>> y = parallel_relu(x, num_workers=8)
-
-    PERFORMANCE CONSIDERATIONS:
-    - Overhead of thread creation and coordination
-    - Memory bandwidth limitations
-    - Thread synchronization costs
-    - Optimal for large arrays where parallelism benefits exceed overhead
-    """
-    ### BEGIN SOLUTION
-    # For small arrays, parallel processing overhead isn't worth it
-    if x.size < 10000:
-        return vectorized_relu(x)
-
-    # Split array into chunks
-    chunk_size = max(1, x.size // num_workers)
-    chunks = []
-    flat_x = x.flatten()
-
-    for i in range(0, len(flat_x), chunk_size):
-        chunks.append(flat_x[i:i + chunk_size])
-
-    # Worker function
-    def relu_chunk(chunk):
-        return vectorized_relu(chunk)
-
-    # Process chunks in parallel
-    with ThreadPoolExecutor(max_workers=num_workers) as executor:
-        # Submit all tasks
-        futures = [executor.submit(relu_chunk, chunk) for chunk in chunks]
-
-        # Collect results in order
-        results = [future.result() for future in futures]
-
-    # Combine results and reshape
-    combined = np.concatenate(results)
-    return combined.reshape(x.shape)
-    ### END SOLUTION
-
-# %%
-def parallel_batch_processing(batch_data: np.ndarray, operation: Callable = None, num_workers: int = 4) -> np.ndarray:
-    """
-    Process batches of data in parallel across multiple workers.
-
-    Demonstrates how ML frameworks parallelize batch processing
-    for improved throughput.
-
-    Args:
-        batch_data: Input batch (batch_size, ...)
-        operation: Operation to apply (default: ReLU)
-        num_workers: Number of parallel workers
-
-    Returns:
-        np.ndarray: Processed batch data
-
-    TODO: Implement parallel batch processing.
-
-    APPROACH:
-    1. Split batch across workers (each worker gets some samples)
-    2. Apply operation to each worker's subset
-    3. Combine results maintaining batch order
-    4. Default to ReLU if no operation specified
-
-    PARALLELIZATION PATTERN:
-    - Each worker processes complete samples
-    - Good for independent operations on batch elements
-    - Scales well with batch size
-
-    EXAMPLE:
-    >>> batch = np.random.randn(128, 784)  # 128 samples, 784 features
-    >>> result = parallel_batch_processing(batch, vectorized_relu, 4)
-
-    ML SYSTEMS CONNECTION:
-    - PyTorch DataLoader uses similar parallelization
-    - GPU tensor operations naturally parallel across batch dimension
-    - Critical for large batch training and inference
-    """
-    ### BEGIN SOLUTION
-    if operation is None:
-        operation = vectorized_relu
-
-    batch_size = batch_data.shape[0]
-
-    # For small batches, parallel processing overhead isn't worth it
-    if batch_size < num_workers:
-        return operation(batch_data)
-
-    # Split batch into chunks
-    chunk_size = max(1, batch_size // num_workers)
-    chunks = []
-
-    for i in range(0, batch_size, chunk_size):
-        end_idx = min(i + chunk_size, batch_size)
-        chunks.append(batch_data[i:end_idx])
-
-    # Process chunks in parallel
-    with ThreadPoolExecutor(max_workers=num_workers) as executor:
-        # Submit all tasks
-        futures = [executor.submit(operation, chunk) for chunk in chunks]
-
-        # Collect results in order
-        results = [future.result() for future in futures]
-
-    # Combine results
-    return np.concatenate(results, axis=0)
-    ### END SOLUTION
-
-# PASS IMPLEMENTATION CHECKPOINT: Parallel processing complete
-
-# MAGNIFY SYSTEMS INSIGHT: Parallel Processing Scaling Analysis
-def analyze_parallel_scaling():
-    """Analyze how parallel processing scales with worker count."""
-    try:
-        # Test data
-        large_array = np.random.randn(50000).astype(np.float32)
-        batch_data = np.random.randn(64, 1000).astype(np.float32)
-
-        # Test different worker counts
-        worker_counts = [1, 2, 4, 8]
-        results = []
-
-        print("Parallel processing scaling analysis:")
-        print("Worker Count | ReLU Time | Batch Time | ReLU Speedup | Batch Speedup")
-        print("-" * 70)
-
-        baseline_relu_time = None
-        baseline_batch_time = None
-
-        for workers in worker_counts:
-            # Time parallel ReLU
-            _, relu_time = time_kernel(parallel_relu, large_array, workers)
-
-            # Time parallel batch processing
-            _, batch_time = time_kernel(parallel_batch_processing, batch_data, vectorized_relu, workers)
-
-            # Calculate speedups
-            if baseline_relu_time is None:
-                baseline_relu_time = relu_time
-                baseline_batch_time = batch_time
-                relu_speedup = 1.0
-                batch_speedup = 1.0
-            else:
-                relu_speedup = baseline_relu_time / relu_time
-                batch_speedup = baseline_batch_time / batch_time
-
-            results.append({
-                'workers': workers,
-                'relu_time': relu_time,
-                'batch_time': batch_time,
-                'relu_speedup': relu_speedup,
-                'batch_speedup': batch_speedup
-            })
-
-            print(f"{workers:11d} | {relu_time:8.1f}μs | {batch_time:9.1f}μs | "
-                  f"{relu_speedup:11.2f}x | {batch_speedup:12.2f}x")
-
-        # Analyze scaling efficiency
-        max_speedup_relu = max(r['relu_speedup'] for r in results)
-        max_speedup_batch = max(r['batch_speedup'] for r in results)
-
-        print(f"\nScaling analysis:")
-        print(f"  Max ReLU speedup: {max_speedup_relu:.2f}x")
-        print(f"  Max batch speedup: {max_speedup_batch:.2f}x")
-        print(f"  ReLU efficiency: {max_speedup_relu/8:.2f} (theoretical max: 1.0)")
-        print(f"  Batch efficiency: {max_speedup_batch/8:.2f} (theoretical max: 1.0)")
-
-        # TIP WHY THIS MATTERS: Parallel processing has diminishing returns due to:
-        # 1. Thread overhead and synchronization costs
-        # 2. Memory bandwidth limitations
-        # 3. Amdahl's law - sequential portions limit speedup
-        # Understanding these limits helps choose optimal parallelism levels.
-
-        return results
-
-    except Exception as e:
-        print(f"WARNING️ Parallel scaling analysis error: {e}")
-        return None
-
-# Run the analysis
-parallel_scaling = analyze_parallel_scaling()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Parallel Processing
-This test validates parallel implementations for correctness and performance scaling
-"""
-
-# %%
-def test_unit_parallel_processing():
-    """Test parallel processing implementations."""
-    print("TEST Unit Test: Parallel Processing")
-
-    # Test 1: Parallel ReLU correctness
-    x = np.array([-2, -1, 0, 1, 2], dtype=np.float32)
-
-    result_parallel = parallel_relu(x, num_workers=2)
-    result_sequential = vectorized_relu(x)
-
-    assert np.allclose(result_parallel, result_sequential), "Parallel ReLU should match sequential"
-    print("PASS ReLU correctness: Parallel matches sequential result")
-
-    # Test 2: Parallel batch processing correctness
-    batch = np.random.randn(16, 10).astype(np.float32)
-
-    result_parallel = parallel_batch_processing(batch, vectorized_relu, num_workers=4)
-    result_sequential = vectorized_relu(batch)
-
-    assert np.allclose(result_parallel, result_sequential), "Parallel batch should match sequential"
-    assert result_parallel.shape == batch.shape, "Output shape should match input"
-    print("PASS Batch correctness: Parallel matches sequential result")
-
-    # Test 3: Performance with larger data
-    large_x = np.random.randn(20000).astype(np.float32)
-    large_batch = np.random.randn(32, 1000).astype(np.float32)
-
-    _, sequential_time = time_kernel(vectorized_relu, large_x)
-    _, parallel_time = time_kernel(parallel_relu, large_x, 4)
-
-    print(f"PASS Performance: Sequential={sequential_time:.1f}μs, Parallel={parallel_time:.1f}μs")
-
-    # Test 4: Edge cases
-    small_x = np.array([1, 2, 3])
-    result_small = parallel_relu(small_x, num_workers=8)
-    expected_small = vectorized_relu(small_x)
-
-    assert np.allclose(result_small, expected_small), "Small arrays should work correctly"
-    print("PASS Edge cases: Small arrays handled correctly")
-
-# Run the test
-test_unit_parallel_processing()
-
-# %% [markdown]
-"""
-### Quantization Kernels
-"""
-
-# %%
-def quantized_matmul(A: np.ndarray, B: np.ndarray, bits: int = 8) -> np.ndarray:
-    """
-    Quantized matrix multiplication for memory and compute efficiency.
-
-    Implements quantization to reduce memory usage and enable
-    efficient inference on edge devices.
-
-    Args:
-        A: Left matrix (float32)
-        B: Right matrix (float32)
-        bits: Quantization bits (default 8)
-
-    Returns:
-        np.ndarray: Dequantized result matrix
-
-    TODO: Implement quantized matrix multiplication.
-
-    APPROACH:
-    1. Calculate quantization scales based on data range
-    2. Quantize inputs to int8/int16 format
-    3. Perform integer matrix multiplication
-    4. Dequantize result back to float32
-
-    QUANTIZATION PROCESS:
-    ```
-    scale = max(abs(data)) / (2^(bits-1) - 1)
-    quantized = round(data / scale).clip(-128, 127)  # for 8-bit
-    result = quantized_A @ quantized_B
-    dequantized = result * scale_A * scale_B
-    ```
-
-    EXAMPLE:
-    >>> A = np.random.randn(64, 32).astype(np.float32)
-    >>> B = np.random.randn(32, 48).astype(np.float32)
-    >>> C = quantized_matmul(A, B, bits=8)
-
-    PERFORMANCE BENEFITS:
-    - 4x memory reduction (float32 -> int8)
-    - Faster integer arithmetic on some hardware
-    - Enables deployment on memory-constrained devices
-    """
-    ### BEGIN SOLUTION
-    # Calculate quantization scales
-    max_val = 2**(bits-1) - 1  # e.g., 127 for 8-bit
-
-    scale_A = np.max(np.abs(A)) / max_val if np.max(np.abs(A)) > 0 else 1.0
-    scale_B = np.max(np.abs(B)) / max_val if np.max(np.abs(B)) > 0 else 1.0
-
-    # Quantize inputs
-    if bits == 8:
-        dtype = np.int8
-        min_val, max_val = -128, 127
-    elif bits == 16:
-        dtype = np.int16
-        min_val, max_val = -32768, 32767
-    else:
-        raise ValueError(f"Unsupported quantization: {bits} bits")
-
-    A_quantized = np.round(A / scale_A).clip(min_val, max_val).astype(dtype)
-    B_quantized = np.round(B / scale_B).clip(min_val, max_val).astype(dtype)
-
-    # Perform integer matrix multiplication
-    # Use int32 accumulation to prevent overflow
-    C_quantized = np.dot(A_quantized.astype(np.int32), B_quantized.astype(np.int32))
-
-    # Dequantize result
-    C_dequantized = C_quantized.astype(np.float32) * scale_A * scale_B
-
-    return C_dequantized
-    ### END SOLUTION
-
-# %%
-def quantized_relu(x: np.ndarray, bits: int = 8) -> np.ndarray:
-    """
-    Quantized ReLU activation for efficient inference.
-
-    Applies ReLU in quantized domain to maintain precision
-    while reducing computational overhead.
-
-    Args:
-        x: Input array (float32)
-        bits: Quantization bits (default 8)
-
-    Returns:
-        np.ndarray: Quantized ReLU result (dequantized to float32)
-
-    TODO: Implement quantized ReLU activation.
-
-    APPROACH:
-    1. Calculate quantization scale from input range
-    2. Quantize input to integer representation
-    3. Apply ReLU in integer domain (max(0, x))
-    4. Dequantize result back to float32
-
-    QUANTIZED RELU PROCESS:
-    ```
-    scale = max(abs(x)) / (2^(bits-1) - 1)
-    x_quantized = round(x / scale).clip(-128, 127)
-    relu_quantized = max(0, x_quantized)
-    result = relu_quantized * scale
-    ```
-
-    EXAMPLE:
-    >>> x = np.array([-1.0, 0.0, 1.0, 2.0])
-    >>> y = quantized_relu(x, bits=8)
-    >>> print(y)  # [0.0, 0.0, ~1.0, ~2.0]
-
-    OPTIMIZATION BENEFITS:
-    - ReLU in integer domain is just max(0, x)
-    - No floating-point operations during activation
-    - Maintains quantization format for subsequent operations
-    """
-    ### BEGIN SOLUTION
-    # Calculate quantization scale
-    max_val = 2**(bits-1) - 1  # e.g., 127 for 8-bit
-    scale = np.max(np.abs(x)) / max_val if np.max(np.abs(x)) > 0 else 1.0
-
-    # Quantize input
-    if bits == 8:
-        dtype = np.int8
-        min_val, max_val = -128, 127
-    elif bits == 16:
-        dtype = np.int16
-        min_val, max_val = -32768, 32767
-    else:
-        raise ValueError(f"Unsupported quantization: {bits} bits")
-
-    x_quantized = np.round(x / scale).clip(min_val, max_val).astype(dtype)
-
-    # Apply ReLU in quantized domain
-    relu_quantized = np.maximum(0, x_quantized)
-
-    # Dequantize result
-    result = relu_quantized.astype(np.float32) * scale
-
-    return result
-    ### END SOLUTION
-
-# PASS IMPLEMENTATION CHECKPOINT: Quantization kernels complete
-
-# MAGNIFY SYSTEMS INSIGHT: Quantization Analysis
-def analyze_quantization_impact():
-    """Analyze the impact of quantization on accuracy and performance."""
-    try:
-        # Test matrices
-        A = np.random.randn(128, 64).astype(np.float32) * 10  # Scale for visible quantization
-        B = np.random.randn(64, 96).astype(np.float32) * 10
-        x = np.random.randn(1000).astype(np.float32) * 5
-
-        # Compare quantized vs full precision
-        print("Quantization impact analysis:")
-        print("Operation      | Bits | Accuracy (MSE) | Memory | Time")
-        print("-" * 55)
-
-        # Matrix multiplication analysis
-        baseline_matmul = matmul_baseline(A, B)
-        baseline_size = A.nbytes + B.nbytes + baseline_matmul.nbytes
-        _, baseline_time = time_kernel(matmul_baseline, A, B)
-
-        for bits in [8, 16]:
-            quant_result = quantized_matmul(A, B, bits=bits)
-            mse = np.mean((baseline_matmul - quant_result) ** 2)
-
-            # Estimate quantized memory usage
-            if bits == 8:
-                quant_size = A.size + B.size + baseline_matmul.size  # int8 = 1 byte
-            else:
-                quant_size = (A.size + B.size + baseline_matmul.size) * 2  # int16 = 2 bytes
-
-            memory_ratio = quant_size / baseline_size
-
-            _, quant_time = time_kernel(quantized_matmul, A, B, bits)
-            time_ratio = quant_time / baseline_time
-
-            print(f"MatMul         | {bits:4d} | {mse:13.6f} | {memory_ratio:5.2f}x | {time_ratio:5.2f}x")
-
-        # ReLU analysis
-        baseline_relu = vectorized_relu(x)
-        _, baseline_relu_time = time_kernel(vectorized_relu, x)
-
-        for bits in [8, 16]:
-            quant_relu = quantized_relu(x, bits=bits)
-            mse_relu = np.mean((baseline_relu - quant_relu) ** 2)
-
-            _, quant_relu_time = time_kernel(quantized_relu, x, bits)
-            time_ratio_relu = quant_relu_time / baseline_relu_time
-
-            print(f"ReLU           | {bits:4d} | {mse_relu:13.6f} | {0.25:5.2f}x | {time_ratio_relu:5.2f}x")
-
-        print(f"\nBaseline performance:")
-        print(f"  MatMul: {baseline_time:.1f}μs, {baseline_size/1024:.1f}KB")
-        print(f"  ReLU: {baseline_relu_time:.1f}μs, {x.nbytes/1024:.1f}KB")
-
-        # TIP WHY THIS MATTERS: Quantization trades accuracy for memory and speed.
-        # 8-bit quantization: 4x memory reduction, variable performance impact
-        # Critical for edge deployment where memory is constrained
-        # Modern ML accelerators (TPUs, mobile chips) heavily use quantization
-
-        return {
-            'matmul_accuracy_8bit': np.mean((baseline_matmul - quantized_matmul(A, B, 8)) ** 2),
-            'memory_reduction': baseline_size / (A.size + B.size),  # Approximate
-            'deployment_ready': True
-        }
-
-    except Exception as e:
-        print(f"WARNING️ Quantization analysis error: {e}")
-        return None
-
-# Run the analysis
-quantization_analysis = analyze_quantization_impact()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Quantization Kernels
-This test validates quantization implementations for correctness and efficiency trade-offs
-"""
-
-# %%
-def test_unit_quantization_kernels():
-    """Test quantization kernel implementations."""
-    print("TEST Unit Test: Quantization Kernels")
-
-    # Test 1: Quantized matrix multiplication correctness
-    A = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32)
-    B = np.array([[0.5, 1.5], [2.5, 3.5]], dtype=np.float32)
-
-    result_quant = quantized_matmul(A, B, bits=8)
-    result_baseline = matmul_baseline(A, B)
-
-    # Should be approximately correct (quantization introduces error)
-    relative_error = np.mean(np.abs(result_quant - result_baseline) / np.abs(result_baseline + 1e-8))
-    assert relative_error < 0.1, f"Quantization error too high: {relative_error:.3f}"
-    print(f"PASS MatMul quantization: relative error {relative_error:.3f}")
-
-    # Test 2: Quantized ReLU correctness
-    x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0], dtype=np.float32)
-
-    result_quant_relu = quantized_relu(x, bits=8)
-    result_baseline_relu = vectorized_relu(x)
-
-    # Check that negative values become zero and positive values remain positive
-    assert np.all(result_quant_relu >= 0), "Quantized ReLU should be non-negative"
-    assert np.allclose(result_quant_relu[x <= 0], 0, atol=0.1), "Negative inputs should become zero"
-    print("PASS ReLU quantization: maintains ReLU properties")
-
-    # Test 3: Different bit depths
-    for bits in [8, 16]:
-        result_8bit = quantized_matmul(A, B, bits=bits)
-        assert result_8bit.shape == result_baseline.shape, f"{bits}-bit result shape should match"
-
-        result_relu_bits = quantized_relu(x, bits=bits)
-        assert result_relu_bits.shape == x.shape, f"{bits}-bit ReLU shape should match"
-
-    print("PASS Bit depths: 8-bit and 16-bit quantization work correctly")
-
-    # Test 4: Performance characteristics
-    large_A = np.random.randn(64, 64).astype(np.float32)
-    large_B = np.random.randn(64, 64).astype(np.float32)
-
-    _, baseline_time = time_kernel(matmul_baseline, large_A, large_B)
-    _, quant_time = time_kernel(quantized_matmul, large_A, large_B, 8)
-
-    print(f"PASS Performance: Baseline={baseline_time:.1f}μs, Quantized={quant_time:.1f}μs")
-
-# Run the test
-test_unit_quantization_kernels()
-
-# %% [markdown]
-"""
-## Advanced Systems Analysis Framework
-
-Now you'll implement the Progressive Analysis Framework at the **Advanced Level**.
-
-At this level, you design comprehensive analyses from scratch - no scaffolding provided.
-"""
-
-# %% [markdown]
-"""
-### TARGET ADVANCED ANALYSIS CHALLENGE: Comprehensive Kernel Optimization Analysis
-
-**CHALLENGE**: Design and implement a complete kernel optimization analysis system that:
-
-1. **Performance Profiling**: Measures execution time, throughput, and resource utilization
-2. **Memory Pattern Analysis**: Analyzes cache behavior, memory bandwidth, and access patterns
-3. **Optimization Opportunities**: Identifies bottlenecks and recommends improvements
-4. **Hardware Adaptation**: Adapts recommendations based on target hardware architecture
-5. **Production Readiness**: Assesses readiness for deployment in production ML systems
-
-**YOUR MISSION**: Implement `KernelOptimizationAnalyzer` class with methods for comprehensive analysis.
-
-**TODO: Design comprehensive kernel optimization analysis from scratch.**
-
-**DESIGN REQUIREMENTS**:
-- Analyze cache efficiency and memory bandwidth utilization
-- Identify vectorization opportunities and parallel processing potential
-- Measure quantization impact on accuracy vs performance trade-offs
-- Generate actionable optimization recommendations for production deployment
-- Support analysis across different hardware architectures (CPU, GPU, edge devices)
-
-**ANALYSIS FRAMEWORK**:
-```python
-class KernelOptimizationAnalyzer:
-    def analyze_cache_efficiency(self, kernel_func, data_sizes):
-        # TODO: Measure cache hit rates and memory access patterns
-        pass
-
-    def analyze_vectorization_potential(self, operation_sequence):
-        # TODO: Identify SIMD optimization opportunities
-        pass
-
-    def analyze_parallel_scaling(self, workload, worker_counts):
-        # TODO: Measure parallel processing efficiency
-        pass
-
-    def analyze_quantization_trade_offs(self, precision_levels):
-        # TODO: Accuracy vs performance analysis
-        pass
-
-    def generate_optimization_roadmap(self, target_hardware):
-        # TODO: Prioritized recommendations for production deployment
-        pass
-```
-
-**EXPECTED INSIGHTS**:
-- Cache miss rates and optimal block sizes
-- Vectorization speedup potential and SIMD utilization
-- Parallel processing efficiency and scaling bottlenecks
-- Quantization accuracy degradation vs memory/speed benefits
-- Hardware-specific optimization strategies
-
-**PRODUCTION FOCUS**: Your analysis should guide real optimization decisions for production ML systems.
-"""
-
-# %%
-class KernelOptimizationAnalyzer:
-    """
-    Advanced kernel optimization analysis system for production ML systems.
-
-    TODO: Design comprehensive analysis from scratch.
-
-    This class should provide complete optimization analysis including:
-    - Cache efficiency and memory bandwidth analysis
-    - Vectorization potential and SIMD utilization assessment
-    - Parallel processing scaling analysis and bottleneck identification
-    - Quantization impact analysis for accuracy vs performance trade-offs
-    - Hardware-specific optimization recommendations for production deployment
-
-    Your implementation should guide real optimization decisions for production ML systems.
-    """
-
-    def __init__(self, hardware_config: Optional[Dict[str, Any]] = None):
-        """
-        Initialize the analyzer with hardware configuration.
-
-        TODO: Design initialization strategy that detects or accepts hardware specs.
-
-        Should handle:
-        - CPU specifications (cores, cache sizes, SIMD capabilities)
-        - Memory hierarchy (L1/L2/L3 cache, RAM bandwidth)
-        - GPU specifications (if available)
-        - Target deployment environment (cloud, edge, mobile)
-        """
-        ### BEGIN SOLUTION
-        self.hardware_config = hardware_config or self._detect_hardware()
-        self.analysis_results = {}
-        self.optimization_recommendations = []
-        self.baseline_measurements = {}
-
-    def _detect_hardware(self) -> Dict[str, Any]:
-        """Detect current hardware configuration."""
-        return {
-            'cpu_cores': psutil.cpu_count(),
-            'memory_gb': psutil.virtual_memory().total // (1024**3),
-            'cache_sizes': {
-                'l1_data': 32768,    # 32KB typical L1 data cache
-                'l1_instruction': 32768,  # 32KB typical L1 instruction cache
-                'l2': 262144,        # 256KB typical L2 cache
-                'l3': 8388608        # 8MB typical L3 cache
-            },
-            'cpu_frequency': 2.4,  # GHz - would detect actual frequency
-            'memory_bandwidth': 25.6,  # GB/s - would measure actual bandwidth
-            'simd_width': 8,       # AVX2 - 8 float32 per instruction
-            'gpu_available': False,
-            'deployment_target': 'cloud'  # vs 'edge' or 'mobile'
-        }
-        ### END SOLUTION
-
-    def analyze_cache_efficiency(self, kernel_func: Callable, data_sizes: List[int],
-                               access_patterns: List[str] = None) -> Dict[str, Any]:
-        """
-        Analyze cache efficiency and memory access patterns.
-
-        TODO: Design comprehensive cache analysis that measures:
-        - Cache hit/miss rates for different data sizes
-        - Memory bandwidth utilization
-        - Optimal block sizes for cache-friendly algorithms
-        - Impact of different access patterns (sequential, strided, random)
-
-        Should return actionable insights about memory optimization opportunities.
-        """
-        ### BEGIN SOLUTION
-        if access_patterns is None:
-            access_patterns = ['sequential', 'strided', 'random']
-
-        cache_analysis = {
-            'data_sizes_tested': data_sizes,
-            'access_patterns': access_patterns,
-            'cache_efficiency': {},
-            'bandwidth_utilization': {},
-            'optimal_block_sizes': {},
-            'recommendations': []
-        }
-
-        l1_size = self.hardware_config['cache_sizes']['l1_data']
-        l2_size = self.hardware_config['cache_sizes']['l2']
-        l3_size = self.hardware_config['cache_sizes']['l3']
-
-        for size in data_sizes:
-            # Generate test data
-            test_data = np.random.randn(size, size).astype(np.float32)
-            data_size_bytes = test_data.nbytes
-
-            # Time the kernel operation
-            _, execution_time = time_kernel(kernel_func, test_data, test_data)
-
-            # Estimate cache behavior
-            if data_size_bytes <= l1_size:
-                cache_level = 'L1'
-                efficiency = 0.95
-            elif data_size_bytes <= l2_size:
-                cache_level = 'L2'
-                efficiency = 0.85
-            elif data_size_bytes <= l3_size:
-                cache_level = 'L3'
-                efficiency = 0.70
-            else:
-                cache_level = 'RAM'
-                efficiency = 0.30
-
-            # Calculate bandwidth utilization
-            bytes_accessed = data_size_bytes * 2  # Read A, B
-            bandwidth_used = bytes_accessed / (execution_time / 1_000_000) / (1024**3)  # GB/s
-            peak_bandwidth = self.hardware_config['memory_bandwidth']
-            bandwidth_util = bandwidth_used / peak_bandwidth
-
-            cache_analysis['cache_efficiency'][size] = {
-                'cache_level': cache_level,
-                'efficiency_estimate': efficiency,
-                'data_size_mb': data_size_bytes / (1024**2),
-                'execution_time_us': execution_time
-            }
-
-            cache_analysis['bandwidth_utilization'][size] = {
-                'bandwidth_gb_s': bandwidth_used,
-                'utilization_percent': bandwidth_util * 100,
-                'bottleneck': 'memory' if bandwidth_util > 0.8 else 'compute'
-            }
-
-        # Determine optimal block sizes
-        for cache_level, cache_size in [('L1', l1_size), ('L2', l2_size)]:
-            # Optimal block size fits in cache with room for temporaries
-            optimal_elements = int((cache_size * 0.7) / 4)  # 70% of cache, float32 = 4 bytes
-            optimal_block_size = int(np.sqrt(optimal_elements))
-            cache_analysis['optimal_block_sizes'][cache_level] = optimal_block_size
-
-        # Generate recommendations
-        if any(analysis['bottleneck'] == 'memory' for analysis in cache_analysis['bandwidth_utilization'].values()):
-            cache_analysis['recommendations'].append("Memory bandwidth limited - consider cache blocking")
-
-        if max(data_sizes)**2 * 4 > l3_size:
-            cache_analysis['recommendations'].append(f"Large matrices exceed L3 cache - use block size <= {cache_analysis['optimal_block_sizes']['L2']}")
-
-        self.analysis_results['cache_efficiency'] = cache_analysis
-        return cache_analysis
-        ### END SOLUTION
-
-    def analyze_vectorization_potential(self, operation_sequence: List[str],
-                                      data_shapes: List[Tuple[int, ...]] = None) -> Dict[str, Any]:
-        """
-        Analyze vectorization potential and SIMD optimization opportunities.
-
-        TODO: Design analysis that identifies:
-        - Operations that can benefit from SIMD vectorization
-        - Data layout requirements for optimal vectorization
-        - Expected speedup from vectorization
-        - Vectorization-friendly algorithm modifications
-
-        Should provide specific recommendations for SIMD optimization.
-        """
-        ### BEGIN SOLUTION
-        if data_shapes is None:
-            data_shapes = [(1000,), (1000, 1000), (100, 100, 100)]
-
-        vectorization_analysis = {
-            'operations_analyzed': operation_sequence,
-            'simd_opportunities': {},
-            'data_layout_requirements': {},
-            'speedup_estimates': {},
-            'algorithm_modifications': [],
-            'recommendations': []
-        }
-
-        simd_width = self.hardware_config['simd_width']
-
-        # Analyze each operation for vectorization potential
-        vectorizable_ops = {
-            'add': {'potential': 'high', 'speedup': simd_width * 0.9},
-            'multiply': {'potential': 'high', 'speedup': simd_width * 0.9},
-            'relu': {'potential': 'high', 'speedup': simd_width * 0.8},
-            'matmul': {'potential': 'medium', 'speedup': 3.0},  # More complex, less perfect vectorization
-            'conv2d': {'potential': 'medium', 'speedup': 4.0},
-            'softmax': {'potential': 'low', 'speedup': 1.5},   # Has sequential dependencies
-            'batchnorm': {'potential': 'high', 'speedup': simd_width * 0.7}
-        }
-
-        for op in operation_sequence:
-            if op in vectorizable_ops:
-                vectorization_analysis['simd_opportunities'][op] = vectorizable_ops[op]
-            else:
-                vectorization_analysis['simd_opportunities'][op] = {
-                    'potential': 'unknown',
-                    'speedup': 1.0
-                }
-
-        # Analyze data layout requirements
-        for i, shape in enumerate(data_shapes):
-            layout_analysis = {
-                'shape': shape,
-                'memory_layout': 'contiguous_required',
-                'alignment': 'simd_aligned',
-                'stride_pattern': 'unit_stride_optimal'
-            }
-
-            # For multi-dimensional arrays, analyze optimal access patterns
-            if len(shape) > 1:
-                layout_analysis['access_pattern'] = 'row_major_optimal'
-                layout_analysis['vectorization_dimension'] = 'last_dimension'
-
-            vectorization_analysis['data_layout_requirements'][f'shape_{i}'] = layout_analysis
-
-        # Calculate overall speedup potential
-        total_speedup = 1.0
-        for op in operation_sequence:
-            if op in vectorization_analysis['simd_opportunities']:
-                speedup = vectorization_analysis['simd_opportunities'][op]['speedup']
-                total_speedup *= speedup ** (1.0 / len(operation_sequence))  # Geometric mean
-
-        vectorization_analysis['speedup_estimates']['overall'] = total_speedup
-        vectorization_analysis['speedup_estimates']['best_case'] = max(
-            vectorization_analysis['simd_opportunities'][op]['speedup']
-            for op in operation_sequence
-            if op in vectorization_analysis['simd_opportunities']
-        )
-
-        # Algorithm modification suggestions
-        if 'matmul' in operation_sequence:
-            vectorization_analysis['algorithm_modifications'].append(
-                "Use BLAS libraries (MKL, OpenBLAS) for vectorized matrix operations"
-            )
-
-        if any(op in ['add', 'multiply', 'relu'] for op in operation_sequence):
-            vectorization_analysis['algorithm_modifications'].append(
-                "Ensure contiguous memory layout and use NumPy vectorized operations"
-            )
-
-        # Generate recommendations
-        high_potential_ops = [op for op in operation_sequence
-                            if vectorization_analysis['simd_opportunities'].get(op, {}).get('potential') == 'high']
-
-        if high_potential_ops:
-            vectorization_analysis['recommendations'].append(
-                f"High vectorization potential: {', '.join(high_potential_ops)}"
-            )
-
-        if total_speedup > 2.0:
-            vectorization_analysis['recommendations'].append(
-                f"Significant speedup possible: {total_speedup:.1f}x with full vectorization"
-            )
-
-        self.analysis_results['vectorization_potential'] = vectorization_analysis
-        return vectorization_analysis
-        ### END SOLUTION
-
-    def analyze_parallel_scaling(self, workload_func: Callable, worker_counts: List[int],
-                               data_sizes: List[int] = None) -> Dict[str, Any]:
-        """
-        Analyze parallel processing efficiency and scaling bottlenecks.
-
-        TODO: Design analysis that measures:
-        - Parallel processing speedup across different worker counts
-        - Scaling efficiency and diminishing returns
-        - Thread overhead and synchronization costs
-        - Optimal parallelism level for different workload sizes
-
-        Should identify when parallel processing is beneficial vs overhead costs.
-        """
-        ### BEGIN SOLUTION
-        if data_sizes is None:
-            data_sizes = [1000, 10000, 100000]
-
-        parallel_analysis = {
-            'worker_counts_tested': worker_counts,
-            'data_sizes_tested': data_sizes,
-            'scaling_results': {},
-            'efficiency_analysis': {},
-            'overhead_analysis': {},
-            'optimal_parallelism': {},
-            'recommendations': []
-        }
-
-        max_cores = self.hardware_config['cpu_cores']
-
-        for data_size in data_sizes:
-            test_data = np.random.randn(data_size).astype(np.float32)
-            size_results = {}
-
-            # Measure performance for different worker counts
-            baseline_time = None
-            for workers in worker_counts:
-                if workers > max_cores:
-                    continue  # Skip if more workers than cores
-
-                try:
-                    _, execution_time = time_kernel(workload_func, test_data, workers)
-
-                    if baseline_time is None:
-                        baseline_time = execution_time
-                        speedup = 1.0
-                        efficiency = 1.0
-                    else:
-                        speedup = baseline_time / execution_time
-                        efficiency = speedup / workers
-
-                    size_results[workers] = {
-                        'execution_time_us': execution_time,
-                        'speedup': speedup,
-                        'efficiency': efficiency
-                    }
-
-                except Exception as e:
-                    size_results[workers] = {
-                        'execution_time_us': None,
-                        'speedup': 0,
-                        'efficiency': 0,
-                        'error': str(e)
-                    }
-
-            parallel_analysis['scaling_results'][data_size] = size_results
-
-            # Analyze scaling efficiency
-            if size_results:
-                max_speedup = max(result['speedup'] for result in size_results.values() if result['speedup'] > 0)
-                best_workers = max(size_results.keys(), key=lambda w: size_results[w]['speedup'])
-
-                parallel_analysis['efficiency_analysis'][data_size] = {
-                    'max_speedup': max_speedup,
-                    'best_worker_count': best_workers,
-                    'scaling_efficiency': max_speedup / best_workers,
-                    'diminishing_returns_threshold': best_workers
-                }
-
-            # Estimate overhead
-            if len(size_results) >= 2:
-                single_thread_time = size_results.get(1, {}).get('execution_time_us', 0)
-                two_thread_time = size_results.get(2, {}).get('execution_time_us', single_thread_time)
-
-                if single_thread_time > 0 and two_thread_time > 0:
-                    theoretical_two_thread = single_thread_time / 2
-                    overhead_factor = two_thread_time / theoretical_two_thread
-
-                    parallel_analysis['overhead_analysis'][data_size] = {
-                        'overhead_factor': overhead_factor,
-                        'overhead_percent': (overhead_factor - 1) * 100,
-                        'worthwhile_threshold': single_thread_time * 10  # 10x overhead minimum
-                    }
-
-        # Determine optimal parallelism
-        for data_size in data_sizes:
-            if data_size in parallel_analysis['scaling_results']:
-                results = parallel_analysis['scaling_results'][data_size]
-                optimal_workers = max(results.keys(),
-                                    key=lambda w: results[w]['speedup'] if results[w]['speedup'] > 0 else 0)
-
-                parallel_analysis['optimal_parallelism'][data_size] = {
-                    'optimal_workers': optimal_workers,
-                    'speedup_at_optimal': results[optimal_workers]['speedup'],
-                    'efficiency_at_optimal': results[optimal_workers]['efficiency']
-                }
-
-        # Generate recommendations
-        avg_efficiency = np.mean([
-            analysis['scaling_efficiency']
-            for analysis in parallel_analysis['efficiency_analysis'].values()
-        ])
-
-        if avg_efficiency > 0.7:
-            parallel_analysis['recommendations'].append(
-                "Excellent parallel scaling - parallel processing highly beneficial"
-            )
-        elif avg_efficiency > 0.4:
-            parallel_analysis['recommendations'].append(
-                "Good parallel scaling - parallel processing beneficial for large workloads"
-            )
-        else:
-            parallel_analysis['recommendations'].append(
-                "Poor parallel scaling - overhead exceeds benefits, avoid parallel processing"
-            )
-
-        # Workload size recommendations
-        small_workloads = [size for size in data_sizes if size < 10000]
-        if small_workloads and any(
-            parallel_analysis['overhead_analysis'].get(size, {}).get('overhead_percent', 0) > 50
-            for size in small_workloads
-        ):
-            parallel_analysis['recommendations'].append(
-                "Small workloads have high overhead - use sequential processing"
-            )
-
-        self.analysis_results['parallel_scaling'] = parallel_analysis
-        return parallel_analysis
-        ### END SOLUTION
-
-    def analyze_quantization_trade_offs(self, operations: List[Callable],
-                                      precision_levels: List[int] = None,
-                                      accuracy_threshold: float = 0.01) -> Dict[str, Any]:
-        """
-        Analyze quantization impact on accuracy vs performance trade-offs.
-
-        TODO: Design analysis that measures:
-        - Accuracy degradation at different quantization levels
-        - Performance improvement from reduced precision
-        - Memory usage reduction
-        - Optimal quantization strategy for production deployment
-
-        Should provide guidance on quantization deployment decisions.
-        """
-        ### BEGIN SOLUTION
-        if precision_levels is None:
-            precision_levels = [32, 16, 8]  # float32, float16/int16, int8
-
-        quantization_analysis = {
-            'precision_levels_tested': precision_levels,
-            'operations_analyzed': [op.__name__ for op in operations],
-            'accuracy_analysis': {},
-            'performance_analysis': {},
-            'memory_analysis': {},
-            'deployment_recommendations': {},
-            'recommendations': []
-        }
-
-        # Test data
-        test_sizes = [64, 128, 256]
-
-        for op_func in operations:
-            op_name = op_func.__name__
-            operation_results = {}
-
-            for size in test_sizes:
-                if 'matmul' in op_name.lower():
-                    test_data_a = np.random.randn(size, size).astype(np.float32)
-                    test_data_b = np.random.randn(size, size).astype(np.float32)
-                    baseline_result = op_func(test_data_a, test_data_b)
-                    baseline_time = time_kernel(op_func, test_data_a, test_data_b)[1]
-                    baseline_memory = (test_data_a.nbytes + test_data_b.nbytes + baseline_result.nbytes)
-                else:
-                    test_data = np.random.randn(size, size).astype(np.float32)
-                    baseline_result = op_func(test_data)
-                    baseline_time = time_kernel(op_func, test_data)[1]
-                    baseline_memory = test_data.nbytes + baseline_result.nbytes
-
-                size_results = {
-                    'baseline': {
-                        'precision': 32,
-                        'accuracy_mse': 0.0,
-                        'execution_time_us': baseline_time,
-                        'memory_bytes': baseline_memory,
-                        'relative_performance': 1.0,
-                        'relative_memory': 1.0
-                    }
-                }
-
-                # Test different precision levels
-                for bits in precision_levels:
-                    if bits == 32:
-                        continue  # Already have baseline
-
-                    try:
-                        if 'matmul' in op_name.lower() and hasattr(op_func, '__name__'):
-                            # Use quantized version if available
-                            if bits in [8, 16]:
-                                quant_result = quantized_matmul(test_data_a, test_data_b, bits=bits)
-                                quant_time = time_kernel(quantized_matmul, test_data_a, test_data_b, bits)[1]
-                        elif 'relu' in op_name.lower():
-                            if bits in [8, 16]:
-                                quant_result = quantized_relu(test_data, bits=bits)
-                                quant_time = time_kernel(quantized_relu, test_data, bits)[1]
-                        else:
-                            # Simulate quantization effect
-                            max_val = 2**(bits-1) - 1
-                            scale = np.max(np.abs(baseline_result)) / max_val
-                            quantized = np.round(baseline_result / scale) * scale
-                            quant_result = quantized
-                            quant_time = baseline_time * 0.8  # Assume some speedup
-
-                        # Calculate accuracy metrics
-                        mse = np.mean((baseline_result - quant_result) ** 2)
-                        relative_error = mse / (np.mean(baseline_result ** 2) + 1e-8)
-
-                        # Estimate memory usage
-                        memory_factor = bits / 32.0
-                        quant_memory = int(baseline_memory * memory_factor)
-
-                        size_results[bits] = {
-                            'precision': bits,
-                            'accuracy_mse': mse,
-                            'relative_error': relative_error,
-                            'execution_time_us': quant_time,
-                            'memory_bytes': quant_memory,
-                            'relative_performance': baseline_time / quant_time,
-                            'relative_memory': baseline_memory / quant_memory,
-                            'acceptable_accuracy': relative_error < accuracy_threshold
-                        }
-
-                    except Exception as e:
-                        size_results[bits] = {
-                            'precision': bits,
-                            'error': str(e),
-                            'acceptable_accuracy': False
-                        }
-
-                operation_results[size] = size_results
-
-            quantization_analysis['accuracy_analysis'][op_name] = operation_results
-
-        # Aggregate analysis across operations and sizes
-        for precision in precision_levels:
-            if precision == 32:
-                continue
-
-            accuracy_scores = []
-            performance_gains = []
-            memory_reductions = []
-
-            for op_name, op_results in quantization_analysis['accuracy_analysis'].items():
-                for size, size_results in op_results.items():
-                    if precision in size_results and 'relative_error' in size_results[precision]:
-                        accuracy_scores.append(size_results[precision]['acceptable_accuracy'])
-                        performance_gains.append(size_results[precision]['relative_performance'])
-                        memory_reductions.append(size_results[precision]['relative_memory'])
-
-            if accuracy_scores:
-                quantization_analysis['deployment_recommendations'][precision] = {
-                    'accuracy_success_rate': np.mean(accuracy_scores),
-                    'avg_performance_gain': np.mean(performance_gains),
-                    'avg_memory_reduction': np.mean(memory_reductions),
-                    'recommended_for_production': np.mean(accuracy_scores) > 0.8 and np.mean(performance_gains) > 1.1
-                }
-
-        # Generate recommendations
-        for precision, metrics in quantization_analysis['deployment_recommendations'].items():
-            if metrics['recommended_for_production']:
-                quantization_analysis['recommendations'].append(
-                    f"{precision}-bit quantization: {metrics['avg_performance_gain']:.1f}x speedup, "
-                    f"{metrics['avg_memory_reduction']:.1f}x memory reduction, "
-                    f"{metrics['accuracy_success_rate']*100:.0f}% accuracy success rate"
-                )
-
-        if not any(metrics['recommended_for_production']
-                  for metrics in quantization_analysis['deployment_recommendations'].values()):
-            quantization_analysis['recommendations'].append(
-                "Quantization not recommended - accuracy degradation exceeds threshold"
-            )
-
-        self.analysis_results['quantization_trade_offs'] = quantization_analysis
-        return quantization_analysis
-        ### END SOLUTION
-
-    def generate_optimization_roadmap(self, target_hardware: str = 'cloud',
-                                    priority_metrics: List[str] = None) -> Dict[str, Any]:
-        """
-        Generate prioritized optimization roadmap for production deployment.
-
-        TODO: Design roadmap generation that synthesizes all analyses into:
-        - Prioritized optimization opportunities
-        - Implementation difficulty vs impact assessment
-        - Hardware-specific recommendations
-        - Deployment timeline and resource requirements
-
-        Should provide actionable guidance for ML system optimization in production.
-        """
-        ### BEGIN SOLUTION
-        if priority_metrics is None:
-            priority_metrics = ['performance', 'memory', 'accuracy']
-
-        roadmap = {
-            'target_hardware': target_hardware,
-            'priority_metrics': priority_metrics,
-            'optimization_opportunities': [],
-            'implementation_plan': {},
-            'resource_requirements': {},
-            'expected_outcomes': {},
-            'recommendations': []
-        }
-
-        # Hardware-specific considerations
-        hardware_profiles = {
-            'cloud': {
-                'cpu_cores': 16,
-                'memory_gb': 64,
-                'performance_priority': 'high',
-                'cost_sensitivity': 'medium',
-                'deployment_complexity': 'low'
-            },
-            'edge': {
-                'cpu_cores': 4,
-                'memory_gb': 8,
-                'performance_priority': 'medium',
-                'cost_sensitivity': 'high',
-                'deployment_complexity': 'high'
-            },
-            'mobile': {
-                'cpu_cores': 8,
-                'memory_gb': 4,
-                'performance_priority': 'medium',
-                'cost_sensitivity': 'high',
-                'deployment_complexity': 'very_high'
-            }
-        }
-
-        target_profile = hardware_profiles.get(target_hardware, hardware_profiles['cloud'])
-
-        # Analyze optimization opportunities from all analyses
-        opportunities = []
-
-        # From cache analysis
-        if 'cache_efficiency' in self.analysis_results:
-            cache_results = self.analysis_results['cache_efficiency']
-            for size, analysis in cache_results['bandwidth_utilization'].items():
-                if analysis['bottleneck'] == 'memory':
-                    opportunities.append({
-                        'type': 'cache_optimization',
-                        'impact': 'high',
-                        'difficulty': 'medium',
-                        'description': 'Implement cache-friendly blocking algorithms',
-                        'expected_improvement': '2-4x performance gain',
-                        'implementation_effort': '2-3 weeks'
-                    })
-                    break
-
-        # From vectorization analysis
-        if 'vectorization_potential' in self.analysis_results:
-            vec_results = self.analysis_results['vectorization_potential']
-            overall_speedup = vec_results['speedup_estimates'].get('overall', 1.0)
-            if overall_speedup > 2.0:
-                opportunities.append({
-                    'type': 'vectorization',
-                    'impact': 'high',
-                    'difficulty': 'low',
-                    'description': 'Implement SIMD vectorization for element-wise operations',
-                    'expected_improvement': f'{overall_speedup:.1f}x performance gain',
-                    'implementation_effort': '1-2 weeks'
-                })
-
-        # From parallel analysis
-        if 'parallel_scaling' in self.analysis_results:
-            parallel_results = self.analysis_results['parallel_scaling']
-            avg_efficiency = np.mean([
-                analysis['scaling_efficiency']
-                for analysis in parallel_results['efficiency_analysis'].values()
-            ]) if parallel_results['efficiency_analysis'] else 0
-
-            if avg_efficiency > 0.5 and target_profile['cpu_cores'] > 4:
-                opportunities.append({
-                    'type': 'parallelization',
-                    'impact': 'medium',
-                    'difficulty': 'medium',
-                    'description': f'Implement parallel processing for {target_profile["cpu_cores"]} cores',
-                    'expected_improvement': f'{avg_efficiency * target_profile["cpu_cores"]:.1f}x speedup',
-                    'implementation_effort': '2-4 weeks'
-                })
-
-        # From quantization analysis
-        if 'quantization_trade_offs' in self.analysis_results:
-            quant_results = self.analysis_results['quantization_trade_offs']
-            for precision, metrics in quant_results['deployment_recommendations'].items():
-                if metrics['recommended_for_production']:
-                    impact_level = 'high' if metrics['avg_memory_reduction'] > 2.0 else 'medium'
-                    opportunities.append({
-                        'type': 'quantization',
-                        'impact': impact_level,
-                        'difficulty': 'high',
-                        'description': f'Deploy {precision}-bit quantization',
-                        'expected_improvement': f'{metrics["avg_performance_gain"]:.1f}x speedup, {metrics["avg_memory_reduction"]:.1f}x memory reduction',
-                        'implementation_effort': '3-6 weeks'
-                    })
-                    break
-
-        # Sort opportunities by priority
-        priority_order = {'high': 3, 'medium': 2, 'low': 1}
-        difficulty_penalty = {'low': 0, 'medium': -0.5, 'high': -1, 'very_high': -2}
-
-        def opportunity_score(opp):
-            impact_score = priority_order.get(opp['impact'], 1)
-            difficulty_score = difficulty_penalty.get(opp['difficulty'], 0)
-
-            # Hardware-specific adjustments
-            if target_hardware == 'mobile' and opp['type'] == 'quantization':
-                impact_score += 1  # Quantization more important for mobile
-            elif target_hardware == 'cloud' and opp['type'] == 'parallelization':
-                impact_score += 0.5  # Parallelization more beneficial in cloud
-
-            return impact_score + difficulty_score
-
-        opportunities.sort(key=opportunity_score, reverse=True)
-        roadmap['optimization_opportunities'] = opportunities[:5]  # Top 5 opportunities
-
-        # Create implementation plan
-        phases = ['Phase 1 (0-1 months)', 'Phase 2 (1-3 months)', 'Phase 3 (3-6 months)']
-        current_phase = 0
-
-        for i, opportunity in enumerate(roadmap['optimization_opportunities']):
-            if i < 2:
-                phase = phases[0]
-            elif i < 4:
-                phase = phases[1]
-            else:
-                phase = phases[2]
-
-            if phase not in roadmap['implementation_plan']:
-                roadmap['implementation_plan'][phase] = []
-
-            roadmap['implementation_plan'][phase].append({
-                'optimization': opportunity['type'],
-                'description': opportunity['description'],
-                'effort': opportunity['implementation_effort']
-            })
-
-        # Resource requirements
-        roadmap['resource_requirements'] = {
-            'engineering_time': '3-6 months for full implementation',
-            'hardware_requirements': f"Target: {target_hardware} with {target_profile['cpu_cores']} cores, {target_profile['memory_gb']}GB RAM",
-            'testing_infrastructure': 'Performance testing and regression testing framework',
-            'deployment_complexity': target_profile['deployment_complexity']
-        }
-
-        # Expected outcomes
-        total_performance_gain = 1.0
-        total_memory_reduction = 1.0
-
-        for opp in roadmap['optimization_opportunities']:
-            # Extract numerical improvements (simplified)
-            if 'x performance gain' in opp['expected_improvement']:
-                try:
-                    gain = float(opp['expected_improvement'].split('x')[0])
-                    total_performance_gain *= gain ** 0.5  # Assume some compounding
-                except:
-                    pass
-
-            if 'x memory reduction' in opp['expected_improvement']:
-                try:
-                    reduction = float(opp['expected_improvement'].split('x memory reduction')[0].split()[-1])
-                    total_memory_reduction *= reduction ** 0.5
-                except:
-                    pass
-
-        roadmap['expected_outcomes'] = {
-            'performance_improvement': f'{total_performance_gain:.1f}x overall speedup',
-            'memory_efficiency': f'{total_memory_reduction:.1f}x memory reduction',
-            'deployment_readiness': 'Production-ready optimized kernels',
-            'maintenance_overhead': 'Low (well-structured optimization patterns)'
-        }
-
-        # Generate final recommendations
-        roadmap['recommendations'] = [
-            f"Prioritize {roadmap['optimization_opportunities'][0]['type']} optimization first (highest impact)",
-            f"Target hardware ({target_hardware}) well-suited for planned optimizations",
-            f"Expected overall improvement: {total_performance_gain:.1f}x performance, {total_memory_reduction:.1f}x memory efficiency",
-            "Implement comprehensive performance testing before production deployment"
-        ]
-
-        if target_hardware in ['edge', 'mobile']:
-            roadmap['recommendations'].append(
-                "Quantization critical for resource-constrained deployment"
-            )
-
-        self.analysis_results['optimization_roadmap'] = roadmap
-        return roadmap
-        ### END SOLUTION
-
-# PASS IMPLEMENTATION CHECKPOINT: Advanced optimization analyzer complete
-
-# THINK PREDICTION: What will be the most impactful optimization for matrix operations?
-# Your guess: _______
-
-# MAGNIFY SYSTEMS INSIGHT: Comprehensive Kernel Optimization Analysis
-def comprehensive_kernel_analysis():
-    """Run complete kernel optimization analysis using the advanced analyzer."""
-    try:
-        print("ROCKET Comprehensive Kernel Optimization Analysis")
-        print("=" * 60)
-
-        # Initialize analyzer
-        analyzer = KernelOptimizationAnalyzer()
-
-        # 1. Cache efficiency analysis
-        print("\n📊 Cache Efficiency Analysis:")
-        cache_results = analyzer.analyze_cache_efficiency(
-            matmul_baseline,
-            data_sizes=[64, 128, 256, 512],
-            access_patterns=['sequential', 'strided']
-        )
-
-        for size, analysis in cache_results['cache_efficiency'].items():
-            print(f"  Size {size:3d}: {analysis['cache_level']} cache, {analysis['efficiency_estimate']:.1%} efficiency")
-
-        print(f"  Recommendations: {'; '.join(cache_results['recommendations'])}")
-
-        # 2. Vectorization potential analysis
-        print("\nROCKET Vectorization Potential Analysis:")
-        vec_results = analyzer.analyze_vectorization_potential(
-            ['matmul', 'relu', 'add', 'multiply'],
-            [(1000,), (1000, 1000)]
-        )
-
-        for op, potential in vec_results['simd_opportunities'].items():
-            print(f"  {op}: {potential['potential']} potential, {potential['speedup']:.1f}x speedup")
-
-        print(f"  Overall speedup estimate: {vec_results['speedup_estimates']['overall']:.1f}x")
-
-        # 3. Parallel scaling analysis
-        print("\n🔀 Parallel Scaling Analysis:")
-        parallel_results = analyzer.analyze_parallel_scaling(
-            parallel_relu,
-            worker_counts=[1, 2, 4, 8],
-            data_sizes=[10000, 50000]
-        )
-
-        for size, analysis in parallel_results['efficiency_analysis'].items():
-            print(f"  Size {size:5d}: {analysis['max_speedup']:.1f}x max speedup, {analysis['scaling_efficiency']:.1%} efficiency")
-
-        # 4. Quantization trade-offs analysis
-        print("\n🗜️ Quantization Trade-offs Analysis:")
-        quant_results = analyzer.analyze_quantization_trade_offs(
-            [matmul_baseline, vectorized_relu],
-            precision_levels=[32, 16, 8]
-        )
-
-        for precision, metrics in quant_results['deployment_recommendations'].items():
-            if metrics['recommended_for_production']:
-                print(f"  {precision}-bit: {metrics['avg_performance_gain']:.1f}x speedup, "
-                      f"{metrics['avg_memory_reduction']:.1f}x memory reduction, "
-                      f"{metrics['accuracy_success_rate']:.0%} accuracy success")
-
-        # 5. Generate optimization roadmap
-        print("\n🗺️ Optimization Roadmap:")
-        roadmap = analyzer.generate_optimization_roadmap(
-            target_hardware='cloud',
-            priority_metrics=['performance', 'memory']
-        )
-
-        print(f"  Target: {roadmap['target_hardware']} deployment")
-        print(f"  Expected outcomes: {roadmap['expected_outcomes']['performance_improvement']}, "
-              f"{roadmap['expected_outcomes']['memory_efficiency']}")
-
-        print("\n  Top optimization opportunities:")
-        for i, opp in enumerate(roadmap['optimization_opportunities'][:3], 1):
-            print(f"    {i}. {opp['type']}: {opp['description']}")
-            print(f"       Impact: {opp['impact']}, Effort: {opp['implementation_effort']}")
-
-        print("\n  Key recommendations:")
-        for rec in roadmap['recommendations'][:3]:
-            print(f"    • {rec}")
-
-        # TIP WHY THIS MATTERS: Comprehensive analysis guides optimization decisions:
-        # 1. Cache analysis reveals memory bottlenecks and optimal algorithms
-        # 2. Vectorization analysis shows where SIMD can provide biggest gains
-        # 3. Parallel analysis identifies when threading helps vs hurts
-        # 4. Quantization analysis balances accuracy vs deployment efficiency
-        # 5. Roadmap prioritizes efforts for maximum production impact
-
-        return {
-            'cache_analysis': cache_results,
-            'vectorization_analysis': vec_results,
-            'parallel_analysis': parallel_results,
-            'quantization_analysis': quant_results,
-            'optimization_roadmap': roadmap
-        }
-
-    except Exception as e:
-        print(f"WARNING️ Comprehensive analysis error: {e}")
-        return None
-
-# Run the comprehensive analysis
-comprehensive_analysis = comprehensive_kernel_analysis()
-
-# %% [markdown]
-"""
-### TEST Unit Test: Advanced Optimization Analyzer
-This test validates the comprehensive kernel optimization analyzer
-"""
-
-# %%
-def test_unit_advanced_optimization_analyzer():
-    """Test the advanced kernel optimization analyzer."""
-    print("TEST Unit Test: Advanced Optimization Analyzer")
-
-    # Test 1: Analyzer initialization
-    analyzer = KernelOptimizationAnalyzer()
-
-    assert hasattr(analyzer, 'hardware_config'), "Analyzer should have hardware config"
-    assert analyzer.hardware_config['cpu_cores'] > 0, "Should detect CPU cores"
-    print("PASS Initialization: Hardware configuration detected")
-
-    # Test 2: Cache efficiency analysis
-    cache_results = analyzer.analyze_cache_efficiency(matmul_baseline, [64, 128])
-
-    assert 'cache_efficiency' in cache_results, "Should return cache efficiency results"
-    assert 'bandwidth_utilization' in cache_results, "Should analyze bandwidth utilization"
-    assert 'recommendations' in cache_results, "Should provide recommendations"
-    print("PASS Cache analysis: Complete analysis with recommendations")
-
-    # Test 3: Vectorization potential analysis
-    vec_results = analyzer.analyze_vectorization_potential(['relu', 'add'])
-
-    assert 'simd_opportunities' in vec_results, "Should identify SIMD opportunities"
-    assert 'speedup_estimates' in vec_results, "Should estimate speedup potential"
-    print("PASS Vectorization analysis: SIMD opportunities identified")
-
-    # Test 4: Parallel scaling analysis
-    parallel_results = analyzer.analyze_parallel_scaling(parallel_relu, [1, 2, 4])
-
-    assert 'scaling_results' in parallel_results, "Should provide scaling results"
-    assert 'efficiency_analysis' in parallel_results, "Should analyze efficiency"
-    print("PASS Parallel analysis: Scaling efficiency measured")
-
-    # Test 5: Quantization analysis
-    quant_results = analyzer.analyze_quantization_trade_offs([vectorized_relu])
-
-    assert 'deployment_recommendations' in quant_results, "Should provide deployment recommendations"
-    assert 'accuracy_analysis' in quant_results, "Should analyze accuracy impact"
-    print("PASS Quantization analysis: Trade-offs evaluated")
-
-    # Test 6: Optimization roadmap
-    roadmap = analyzer.generate_optimization_roadmap('cloud')
-
-    assert 'optimization_opportunities' in roadmap, "Should identify opportunities"
-    assert 'implementation_plan' in roadmap, "Should provide implementation plan"
-    assert 'expected_outcomes' in roadmap, "Should estimate outcomes"
-    assert 'recommendations' in roadmap, "Should give actionable recommendations"
-    print("PASS Roadmap generation: Comprehensive optimization plan created")
-
-    # Test 7: Integration across analyses
-    assert len(analyzer.analysis_results) >= 4, "Should store all analysis results"
-    print("PASS Integration: All analyses stored and accessible")
-
-# Run the test
-test_unit_advanced_optimization_analyzer()
-
-# %% [markdown]
-"""
-## Integration - Bringing High-Performance Kernels Together
-
-### Kernel Composition and Performance Pipeline
-"""
-
-# %%
-def test_unit_all():
-    """Run comprehensive kernel module validation."""
-    print("TEST Running all kernel unit tests...")
-
-    # Core infrastructure tests
-    test_unit_timing_infrastructure()
-    print()
-
-    # Matrix operation tests
-    test_unit_cache_friendly_matmul()
-    print()
-
-    # Vectorization tests
-    test_unit_vectorized_operations()
-    print()
-
-    # Parallel processing tests
-    test_unit_parallel_processing()
-    print()
-
-    # Quantization tests
-    test_unit_quantization_kernels()
-    print()
-
-    # Advanced analyzer tests
-    test_unit_advanced_optimization_analyzer()
-    print()
-
-    print("PASS All kernel unit tests passed! High-performance kernels ready for deployment.")
-
-# %% [markdown]
-"""
-## Production Context - Real-World Kernel Usage
-
-### How Production ML Systems Use Optimized Kernels
-
-Modern ML frameworks achieve their performance through sophisticated kernel optimization:
-
-**PyTorch Kernel Architecture:**
-```python
-# High-level PyTorch operation
-result = torch.matmul(A, B)
-
-# Dispatches to optimized kernels based on:
-# - Hardware: CPU (Intel MKL) vs GPU (cuBLAS/cuDNN)
-# - Data type: float32, float16, bfloat16, int8
-# - Tensor properties: size, stride, memory layout
-# - Available optimizations: Tensor Cores, quantization
-```
-
-**Performance Optimization Stack:**
-```
-Application Level:     model(input)
-Framework Level:       torch.matmul(A, B)
-Dispatcher Level:      select_optimal_kernel(A, B, device)
-Kernel Level:          optimized_matmul_cuda/cpu(A, B)
-Hardware Level:        CUDA cores, Tensor cores, SIMD units
-```
-
-**Real-World Impact:**
-- **Training Acceleration**: Optimized kernels enable training larger models in reasonable time
-- **Inference Speed**: Fast kernels reduce serving latency and costs
-- **Edge Deployment**: Quantized kernels enable deployment on mobile/IoT devices
-- **Energy Efficiency**: Efficient kernels reduce data center power consumption
-
-### Framework Integration Patterns
-
-**Automatic Kernel Selection:**
-```python
-# Framework chooses optimal implementation
-if tensor.is_cuda and tensor.dtype == torch.float16:
-    return tensor_core_matmul(A, B)
-elif tensor.is_cpu and has_avx512():
-    return vectorized_cpu_matmul(A, B)
-else:
-    return fallback_matmul(A, B)
-```
-
-**Performance Profiling Integration:**
-```python
-# Built-in profiling like our analyzer
-with torch.profiler.profile() as prof:
-    result = model(input)
-
-# Reveals which kernels are bottlenecks
-prof.export_chrome_trace("trace.json")
-```
-"""
-
-# %%
-if __name__ == "__main__":
-    test_unit_all()
-
-# %% [markdown]
-"""
-## THINK ML Systems Thinking: Interactive Questions
-
-Now that you've implemented high-performance computational kernels, let's explore the systems implications through hands-on analysis.
-"""
-
-# %% [markdown]
-"""
-### Question 1: Cache Hierarchy Optimization Analysis
-
-**Context**: Your `cache_friendly_matmul` function uses blocking to improve cache locality. You measured different block sizes and saw varying performance characteristics.
-
-**Reflection Question**: Analyze the cache behavior patterns in your implementation. When you tested block sizes of 32, 64, and 128, how did performance scale with memory hierarchy levels (L1/L2/L3 cache)? Design an adaptive blocking strategy that automatically selects optimal block sizes based on runtime cache analysis. How would you extend your approach to handle matrices that don't fit entirely in any cache level?
-
-**Think about**:
-- Cache line sizes and prefetching behavior
-- Multi-level cache optimization strategies
-- Memory bandwidth vs cache capacity trade-offs
-- Production deployment across different CPU architectures
-"""
-
-# %% [markdown]
-"""
-### Question 2: Vectorization and Parallelization Interaction Analysis
-
-**Context**: You implemented both SIMD vectorization (`vectorized_relu`) and multi-threading parallelization (`parallel_relu`). Your performance analysis showed different scaling characteristics.
-
-**Reflection Question**: Examine the interaction between vectorization and parallelization in your implementations. How does SIMD vectorization within each thread affect the optimal number of worker threads? Analyze the memory bandwidth contention when multiple threads are performing vectorized operations simultaneously. Design a hybrid optimization strategy that balances SIMD width, thread count, and memory bandwidth for maximum throughput.
-
-**Think about**:
-- Memory bandwidth limitations with multiple vectorized threads
-- NUMA topology effects on parallel vectorized operations
-- Thread affinity and cache sharing between cores
-- Optimal work distribution strategies for vectorized workloads
-"""
-
-# %% [markdown]
-"""
-### Question 3: Production Deployment Optimization Strategy
-
-**Context**: Your `KernelOptimizationAnalyzer` generated a comprehensive optimization roadmap with prioritized improvements for production deployment.
-
-**Reflection Question**: Based on your optimization analysis results, design a production deployment strategy for a real-time ML inference service. How would you adapt your kernel optimizations for different deployment scenarios: cloud instances with 32+ cores, edge devices with 4 cores and limited memory, and mobile devices with thermal constraints? Create a decision framework that automatically selects optimal kernel implementations based on runtime hardware detection and performance requirements.
-
-**Think about**:
-- Runtime performance monitoring and adaptation
-- Thermal management and performance throttling
-- Memory pressure and kernel selection strategies
-- Fallback mechanisms for unsupported optimizations
-- Continuous performance optimization in production
-"""
-
-# %% [markdown]
-"""
-## TARGET MODULE SUMMARY: Kernels
-
-Congratulations! You've successfully implemented high-performance computational kernels that power modern ML systems!
-
-### What You've Accomplished
-PASS **High-Performance Implementation**: 200+ lines of optimized kernel code with cache blocking, vectorization, and parallelization
-PASS **Advanced Optimization Analysis**: Comprehensive `KernelOptimizationAnalyzer` with multi-dimensional performance evaluation
-PASS **Production-Ready Kernels**: Matrix multiplication, activation functions, and quantization kernels optimized for real-world deployment
-PASS **Systems Integration**: Complete optimization pipeline from profiling through deployment recommendations
-PASS **Performance Engineering**: Deep understanding of cache hierarchy, SIMD vectorization, and parallel processing trade-offs
-
-### Key Learning Outcomes
-- **Cache Optimization**: Implementing cache-friendly algorithms that minimize memory access latency
-- **Vectorization Mastery**: Leveraging SIMD instructions for 4-16x performance improvements
-- **Parallel Processing**: Understanding when parallelization helps vs creates overhead
-- **Quantization Engineering**: Balancing accuracy vs performance for efficient deployment
-- **Production Optimization**: Systematic approach to kernel optimization for real-world ML systems
-
-### Mathematical Foundations Mastered
-- **Cache-Friendly Algorithms**: O(n³/B) cache complexity through blocking techniques
-- **SIMD Vectorization**: Processing 4-16 elements simultaneously with vector instructions
-- **Parallel Scaling**: Amdahl's law and parallel efficiency analysis across worker counts
-- **Quantization Mathematics**: Precision reduction with controlled accuracy degradation
-
-### Professional Skills Developed
-- **Performance Engineering**: Systematic optimization methodology from profiling to deployment
-- **Systems Architecture**: Understanding hardware-software interface for ML acceleration
-- **Production Deployment**: Optimization strategies for cloud, edge, and mobile environments
-- **Kernel Development**: Building high-performance computational primitives that power ML frameworks
-
-### Ready for Advanced Applications
-Your kernel implementations now enable:
-- **Real-Time Inference**: Optimized kernels for low-latency ML serving
-- **Large-Scale Training**: High-performance operations for training large models
-- **Edge Deployment**: Memory-efficient kernels for resource-constrained devices
-- **Framework Development**: Understanding how PyTorch and TensorFlow achieve high performance
-
-### Connection to Real ML Systems
-Your implementation mirrors production systems:
-- **PyTorch**: ATen library with CUDA kernels, Intel MKL integration, and automatic kernel selection
-- **TensorFlow**: XLA compiler with hardware-specific optimizations and kernel fusion
-- **Industry Practice**: Cache blocking, vectorization, and quantization are fundamental to all modern ML frameworks
-
-### Next Steps
-1. **Export your module**: `tito module complete 13_kernels`
-2. **Validate integration**: `tito test --module kernels`
-3. **Explore advanced optimizations**: GPU kernels, custom CUDA implementations
-4. **Ready for Module 14**: Performance analysis and benchmarking systems
-
-**Performance Engineering Mastery**: Your high-performance kernel implementations demonstrate deep understanding of how to optimize ML operations for production deployment - the foundation for building scalable ML infrastructure!
-"""
\ No newline at end of file
diff --git a/tinytorch/_modidx.py b/tinytorch/_modidx.py
index 7387a54f..1fa94241 100644
--- a/tinytorch/_modidx.py
+++ b/tinytorch/_modidx.py
@@ -1,3 +1,19 @@
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/[unknown]/[unknown]_dev.py              ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 # Autogenerated by nbdev
 
 d = { 'settings': { 'branch': 'main',
@@ -5,42 +21,36 @@ d = { 'settings': { 'branch': 'main',
                 'doc_host': 'https://tinytorch.github.io',
                 'git_url': 'https://github.com/tinytorch/TinyTorch/',
                 'lib_path': 'tinytorch'},
-  'syms': { 'tinytorch.core.activations': { 'tinytorch.core.activations.ActivationProfiler': ( '03_activations/activations_dev.html#activationprofiler',
-                                                                                               'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.ActivationProfiler.__init__': ( '03_activations/activations_dev.html#activationprofiler.__init__',
-                                                                                                        'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.ActivationProfiler.analyze_scaling': ( '03_activations/activations_dev.html#activationprofiler.analyze_scaling',
-                                                                                                               'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.ActivationProfiler.compare_activations': ( '03_activations/activations_dev.html#activationprofiler.compare_activations',
-                                                                                                                   'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.ActivationProfiler.time_activation': ( '03_activations/activations_dev.html#activationprofiler.time_activation',
-                                                                                                               'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.ReLU': ( '03_activations/activations_dev.html#relu',
+  'syms': { 'tinytorch.core.activations': { 'tinytorch.core.activations.GELU': ( '02_activations/activations_dev.html#gelu',
                                                                                  'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.ReLU.__call__': ( '03_activations/activations_dev.html#relu.__call__',
+                                            'tinytorch.core.activations.GELU.backward': ( '02_activations/activations_dev.html#gelu.backward',
                                                                                           'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.ReLU.forward': ( '03_activations/activations_dev.html#relu.forward',
+                                            'tinytorch.core.activations.GELU.forward': ( '02_activations/activations_dev.html#gelu.forward',
                                                                                          'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Sigmoid': ( '03_activations/activations_dev.html#sigmoid',
-                                                                                    'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Sigmoid.__call__': ( '03_activations/activations_dev.html#sigmoid.__call__',
-                                                                                             'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Sigmoid.forward': ( '03_activations/activations_dev.html#sigmoid.forward',
-                                                                                            'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Softmax': ( '03_activations/activations_dev.html#softmax',
-                                                                                    'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Softmax.__call__': ( '03_activations/activations_dev.html#softmax.__call__',
-                                                                                             'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Softmax.forward': ( '03_activations/activations_dev.html#softmax.forward',
-                                                                                            'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Tanh': ( '03_activations/activations_dev.html#tanh',
+                                            'tinytorch.core.activations.ReLU': ( '02_activations/activations_dev.html#relu',
                                                                                  'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Tanh.__call__': ( '03_activations/activations_dev.html#tanh.__call__',
+                                            'tinytorch.core.activations.ReLU.backward': ( '02_activations/activations_dev.html#relu.backward',
                                                                                           'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.Tanh.forward': ( '03_activations/activations_dev.html#tanh.forward',
+                                            'tinytorch.core.activations.ReLU.forward': ( '02_activations/activations_dev.html#relu.forward',
                                                                                          'tinytorch/core/activations.py'),
-                                            'tinytorch.core.activations.benchmark_activation_suite': ( '03_activations/activations_dev.html#benchmark_activation_suite',
-                                                                                                       'tinytorch/core/activations.py')},
+                                            'tinytorch.core.activations.Sigmoid': ( '02_activations/activations_dev.html#sigmoid',
+                                                                                    'tinytorch/core/activations.py'),
+                                            'tinytorch.core.activations.Sigmoid.backward': ( '02_activations/activations_dev.html#sigmoid.backward',
+                                                                                             'tinytorch/core/activations.py'),
+                                            'tinytorch.core.activations.Sigmoid.forward': ( '02_activations/activations_dev.html#sigmoid.forward',
+                                                                                            'tinytorch/core/activations.py'),
+                                            'tinytorch.core.activations.Softmax': ( '02_activations/activations_dev.html#softmax',
+                                                                                    'tinytorch/core/activations.py'),
+                                            'tinytorch.core.activations.Softmax.backward': ( '02_activations/activations_dev.html#softmax.backward',
+                                                                                             'tinytorch/core/activations.py'),
+                                            'tinytorch.core.activations.Softmax.forward': ( '02_activations/activations_dev.html#softmax.forward',
+                                                                                            'tinytorch/core/activations.py'),
+                                            'tinytorch.core.activations.Tanh': ( '02_activations/activations_dev.html#tanh',
+                                                                                 'tinytorch/core/activations.py'),
+                                            'tinytorch.core.activations.Tanh.backward': ( '02_activations/activations_dev.html#tanh.backward',
+                                                                                          'tinytorch/core/activations.py'),
+                                            'tinytorch.core.activations.Tanh.forward': ( '02_activations/activations_dev.html#tanh.forward',
+                                                                                         'tinytorch/core/activations.py')},
             'tinytorch.core.attention': { 'tinytorch.core.attention.AttentionEfficiencyProfiler': ( '12_attention/attention_dev.html#attentionefficiencyprofiler',
                                                                                                     'tinytorch/core/attention.py'),
                                           'tinytorch.core.attention.AttentionEfficiencyProfiler.__init__': ( '12_attention/attention_dev.html#attentionefficiencyprofiler.__init__',
@@ -70,6 +80,78 @@ d = { 'settings': { 'branch': 'main',
                                           'tinytorch.core.attention.scaled_dot_product_attention': ( '12_attention/attention_dev.html#scaled_dot_product_attention',
                                                                                                      'tinytorch/core/attention.py')},
             'tinytorch.core.autograd': {},
+            'tinytorch.core.benchmarking': { 'tinytorch.core.benchmarking.BenchmarkResult': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#benchmarkresult',
+                                                                                              'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.BenchmarkScenario': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#benchmarkscenario',
+                                                                                                'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.BenchmarkScenarios': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#benchmarkscenarios',
+                                                                                                 'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.BenchmarkScenarios.__init__': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#benchmarkscenarios.__init__',
+                                                                                                          'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.BenchmarkScenarios.offline': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#benchmarkscenarios.offline',
+                                                                                                         'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.BenchmarkScenarios.server': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#benchmarkscenarios.server',
+                                                                                                        'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.BenchmarkScenarios.single_stream': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#benchmarkscenarios.single_stream',
+                                                                                                               'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.PerformanceReporter': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#performancereporter',
+                                                                                                  'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.PerformanceReporter.__init__': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#performancereporter.__init__',
+                                                                                                           'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.PerformanceReporter.generate_project_report': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#performancereporter.generate_project_report',
+                                                                                                                          'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.PerformanceReporter.save_report': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#performancereporter.save_report',
+                                                                                                              'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler',
+                                                                                                             'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler.__init__': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler.__init__',
+                                                                                                                      'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler._generate_ab_recommendation': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler._generate_ab_recommendation',
+                                                                                                                                         'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler.detect_performance_regression': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler.detect_performance_regression',
+                                                                                                                                           'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler.generate_capacity_planning_report': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler.generate_capacity_planning_report',
+                                                                                                                                               'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler.monitor_resource_utilization': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler.monitor_resource_utilization',
+                                                                                                                                          'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler.profile_end_to_end_pipeline': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler.profile_end_to_end_pipeline',
+                                                                                                                                         'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler.run_ab_test': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler.run_ab_test',
+                                                                                                                         'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.ProductionBenchmarkingProfiler.setup_ab_testing_framework': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#productionbenchmarkingprofiler.setup_ab_testing_framework',
+                                                                                                                                        'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.StatisticalValidation': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#statisticalvalidation',
+                                                                                                    'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.StatisticalValidator': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#statisticalvalidator',
+                                                                                                   'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.StatisticalValidator.__init__': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#statisticalvalidator.__init__',
+                                                                                                            'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.StatisticalValidator.validate_benchmark_result': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#statisticalvalidator.validate_benchmark_result',
+                                                                                                                             'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.StatisticalValidator.validate_comparison': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#statisticalvalidator.validate_comparison',
+                                                                                                                       'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf',
+                                                                                            'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.__init__': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.__init__',
+                                                                                                     'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.compare_models': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.compare_models',
+                                                                                                           'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.generate_report': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.generate_report',
+                                                                                                            'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.run_all_scenarios': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.run_all_scenarios',
+                                                                                                              'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.run_offline': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.run_offline',
+                                                                                                        'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.run_server': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.run_server',
+                                                                                                       'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.run_single_stream': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.run_single_stream',
+                                                                                                              'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.set_dataset': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.set_dataset',
+                                                                                                        'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.TinyTorchPerf.set_model': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#tinytorchperf.set_model',
+                                                                                                      'tinytorch/core/benchmarking.py'),
+                                             'tinytorch.core.benchmarking.plot_benchmark_results': ( 'temp_holding/14_benchmarking/benchmarking_dev.html#plot_benchmark_results',
+                                                                                                     'tinytorch/core/benchmarking.py')},
             'tinytorch.core.cnn': { 'tinytorch.core.cnn.Conv2D': ('06_spatial/spatial_dev.html#conv2d', 'tinytorch/core/cnn.py'),
                                     'tinytorch.core.cnn.Conv2D.__call__': ( '06_spatial/spatial_dev.html#conv2d.__call__',
                                                                             'tinytorch/core/cnn.py'),
@@ -82,6 +164,56 @@ d = { 'settings': { 'branch': 'main',
                                     'tinytorch.core.cnn.conv2d_naive': ( '06_spatial/spatial_dev.html#conv2d_naive',
                                                                          'tinytorch/core/cnn.py'),
                                     'tinytorch.core.cnn.flatten': ('06_spatial/spatial_dev.html#flatten', 'tinytorch/core/cnn.py')},
+            'tinytorch.core.compression': { 'tinytorch.core.compression.CompressionMetrics': ( 'temp_holding/16_regularization/regularization_dev.html#compressionmetrics',
+                                                                                               'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionMetrics.__init__': ( 'temp_holding/16_regularization/regularization_dev.html#compressionmetrics.__init__',
+                                                                                                        'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionMetrics.calculate_model_size': ( 'temp_holding/16_regularization/regularization_dev.html#compressionmetrics.calculate_model_size',
+                                                                                                                    'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionMetrics.count_parameters': ( 'temp_holding/16_regularization/regularization_dev.html#compressionmetrics.count_parameters',
+                                                                                                                'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler',
+                                                                                                       'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler.__init__': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler.__init__',
+                                                                                                                'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler._apply_magnitude_pruning': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler._apply_magnitude_pruning',
+                                                                                                                                'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler._apply_quantization': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler._apply_quantization',
+                                                                                                                           'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler._apply_structured_pruning': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler._apply_structured_pruning',
+                                                                                                                                 'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler._calculate_model_flops': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler._calculate_model_flops',
+                                                                                                                              'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler.analyze_accuracy_tradeoffs': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler.analyze_accuracy_tradeoffs',
+                                                                                                                                  'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler.analyze_quantization_impact': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler.analyze_quantization_impact',
+                                                                                                                                   'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.CompressionSystemsProfiler.measure_inference_speedup': ( 'temp_holding/16_regularization/regularization_dev.html#compressionsystemsprofiler.measure_inference_speedup',
+                                                                                                                                 'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.DistillationLoss': ( 'temp_holding/16_regularization/regularization_dev.html#distillationloss',
+                                                                                             'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.DistillationLoss.__call__': ( 'temp_holding/16_regularization/regularization_dev.html#distillationloss.__call__',
+                                                                                                      'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.DistillationLoss.__init__': ( 'temp_holding/16_regularization/regularization_dev.html#distillationloss.__init__',
+                                                                                                      'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.DistillationLoss._cross_entropy_loss': ( 'temp_holding/16_regularization/regularization_dev.html#distillationloss._cross_entropy_loss',
+                                                                                                                 'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.DistillationLoss._softmax': ( 'temp_holding/16_regularization/regularization_dev.html#distillationloss._softmax',
+                                                                                                      'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.calculate_sparsity': ( 'temp_holding/16_regularization/regularization_dev.html#calculate_sparsity',
+                                                                                               'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.compare_compression_techniques': ( 'temp_holding/16_regularization/regularization_dev.html#compare_compression_techniques',
+                                                                                                           'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.compute_neuron_importance': ( 'temp_holding/16_regularization/regularization_dev.html#compute_neuron_importance',
+                                                                                                      'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.prune_layer_neurons': ( 'temp_holding/16_regularization/regularization_dev.html#prune_layer_neurons',
+                                                                                                'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.prune_weights_by_magnitude': ( 'temp_holding/16_regularization/regularization_dev.html#prune_weights_by_magnitude',
+                                                                                                       'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.quantize_layer_weights': ( 'temp_holding/16_regularization/regularization_dev.html#quantize_layer_weights',
+                                                                                                   'tinytorch/core/compression.py'),
+                                            'tinytorch.core.compression.setup_import_paths': ( 'temp_holding/16_regularization/regularization_dev.html#setup_import_paths',
+                                                                                               'tinytorch/core/compression.py')},
             'tinytorch.core.dataloader': { 'tinytorch.core.dataloader.CIFAR10Dataset': ( '07_dataloader/dataloader_dev.html#cifar10dataset',
                                                                                          'tinytorch/core/dataloader.py'),
                                            'tinytorch.core.dataloader.CIFAR10Dataset.__getitem__': ( '07_dataloader/dataloader_dev.html#cifar10dataset.__getitem__',
@@ -199,7 +331,23 @@ d = { 'settings': { 'branch': 'main',
                                                                                           'tinytorch/core/kernels.py'),
                                         'tinytorch.core.kernels.vectorized_relu': ( 'temp_holding/13_kernels/kernels_dev.html#vectorized_relu',
                                                                                     'tinytorch/core/kernels.py')},
-            'tinytorch.core.losses': {},
+            'tinytorch.core.layers': { 'tinytorch.core.layers.Linear': ('04_layers/layers_dev.html#linear', 'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.Linear.__init__': ( '04_layers/layers_dev.html#linear.__init__',
+                                                                                  'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.Linear.forward': ( '04_layers/layers_dev.html#linear.forward',
+                                                                                 'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.Module': ('04_layers/layers_dev.html#module', 'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.Module.__call__': ( '04_layers/layers_dev.html#module.__call__',
+                                                                                  'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.Module.__init__': ( '04_layers/layers_dev.html#module.__init__',
+                                                                                  'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.Module.__setattr__': ( '04_layers/layers_dev.html#module.__setattr__',
+                                                                                     'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.Module.forward': ( '04_layers/layers_dev.html#module.forward',
+                                                                                 'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.Module.parameters': ( '04_layers/layers_dev.html#module.parameters',
+                                                                                    'tinytorch/core/layers.py'),
+                                       'tinytorch.core.layers.matmul': ('04_layers/layers_dev.html#matmul', 'tinytorch/core/layers.py')},
             'tinytorch.core.mlops': { 'tinytorch.core.mlops.DeploymentStrategy': ( 'temp_holding/15_mlops/mlops_dev.html#deploymentstrategy',
                                                                                    'tinytorch/core/mlops.py'),
                                       'tinytorch.core.mlops.DriftDetector': ( 'temp_holding/15_mlops/mlops_dev.html#driftdetector',
@@ -279,6 +427,52 @@ d = { 'settings': { 'branch': 'main',
                                                                                          'tinytorch/core/networks.py'),
                                          'tinytorch.core.networks.create_mlp': ( '05_dense/dense_dev.html#create_mlp',
                                                                                  'tinytorch/core/networks.py')},
+            'tinytorch.core.setup': { 'tinytorch.core.setup.personal_info': ( '01_setup/setup_dev.html#personal_info',
+                                                                              'tinytorch/core/setup.py'),
+                                      'tinytorch.core.setup.system_info': ( '01_setup/setup_dev.html#system_info',
+                                                                            'tinytorch/core/setup.py')},
+            'tinytorch.core.spatial': { 'tinytorch.core.spatial.AvgPool2d': ( '09_spatial/spatial_dev.html#avgpool2d',
+                                                                              'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.AvgPool2d.__call__': ( '09_spatial/spatial_dev.html#avgpool2d.__call__',
+                                                                                       'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.AvgPool2d.__init__': ( '09_spatial/spatial_dev.html#avgpool2d.__init__',
+                                                                                       'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.AvgPool2d.forward': ( '09_spatial/spatial_dev.html#avgpool2d.forward',
+                                                                                      'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.AvgPool2d.parameters': ( '09_spatial/spatial_dev.html#avgpool2d.parameters',
+                                                                                         'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.Conv2d': ( '09_spatial/spatial_dev.html#conv2d',
+                                                                           'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.Conv2d.__call__': ( '09_spatial/spatial_dev.html#conv2d.__call__',
+                                                                                    'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.Conv2d.__init__': ( '09_spatial/spatial_dev.html#conv2d.__init__',
+                                                                                    'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.Conv2d.forward': ( '09_spatial/spatial_dev.html#conv2d.forward',
+                                                                                   'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.Conv2d.parameters': ( '09_spatial/spatial_dev.html#conv2d.parameters',
+                                                                                      'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.MaxPool2d': ( '09_spatial/spatial_dev.html#maxpool2d',
+                                                                              'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.MaxPool2d.__call__': ( '09_spatial/spatial_dev.html#maxpool2d.__call__',
+                                                                                       'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.MaxPool2d.__init__': ( '09_spatial/spatial_dev.html#maxpool2d.__init__',
+                                                                                       'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.MaxPool2d.forward': ( '09_spatial/spatial_dev.html#maxpool2d.forward',
+                                                                                      'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.MaxPool2d.parameters': ( '09_spatial/spatial_dev.html#maxpool2d.parameters',
+                                                                                         'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.SimpleCNN': ( '09_spatial/spatial_dev.html#simplecnn',
+                                                                              'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.SimpleCNN.__call__': ( '09_spatial/spatial_dev.html#simplecnn.__call__',
+                                                                                       'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.SimpleCNN.__init__': ( '09_spatial/spatial_dev.html#simplecnn.__init__',
+                                                                                       'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.SimpleCNN.forward': ( '09_spatial/spatial_dev.html#simplecnn.forward',
+                                                                                      'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.SimpleCNN.parameters': ( '09_spatial/spatial_dev.html#simplecnn.parameters',
+                                                                                         'tinytorch/core/spatial.py'),
+                                        'tinytorch.core.spatial.SimpleCNN.relu': ( '09_spatial/spatial_dev.html#simplecnn.relu',
+                                                                                   'tinytorch/core/spatial.py')},
             'tinytorch.core.training': { 'tinytorch.core.training.Accuracy': ( '10_training/training_dev.html#accuracy',
                                                                                'tinytorch/core/training.py'),
                                          'tinytorch.core.training.Accuracy.__call__': ( '10_training/training_dev.html#accuracy.__call__',
diff --git a/tinytorch/core/__init__.py b/tinytorch/core/__init__.py
index 711d24a8..6525eec8 100644
--- a/tinytorch/core/__init__.py
+++ b/tinytorch/core/__init__.py
@@ -7,7 +7,6 @@ This module contains the fundamental building blocks:
 - autograd: Automatic differentiation
 - modules: Neural network layers
 - optimizers: Training optimizers
-- quantization: INT8 quantization for inference acceleration
 
 All code is auto-generated from notebooks. Do not edit manually.
 """
diff --git a/tinytorch/core/activations.py b/tinytorch/core/activations.py
index 66e49663..d25bc57f 100644
--- a/tinytorch/core/activations.py
+++ b/tinytorch/core/activations.py
@@ -1,14 +1,27 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/03_activations/activations_dev.ipynb.
-
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/03_activations/activations_dev.py       ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 # %% auto 0
-__all__ = ['ReLU', 'Sigmoid', 'Tanh', 'Softmax', 'ActivationProfiler', 'benchmark_activation_suite']
+__all__ = ['Sigmoid', 'ReLU', 'Tanh', 'GELU', 'Softmax']
 
-# %% ../../modules/source/03_activations/activations_dev.ipynb 1
-import math
+# %% ../../modules/source/02_activations/activations_dev.ipynb 2
 import numpy as np
-import os
+from typing import Optional
 import sys
-from typing import Union, List
+import os
 
 # Import our Tensor class - try from package first, then from local module
 try:
@@ -18,544 +31,214 @@ except ImportError:
     sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
     from tensor_dev import Tensor
 
-# Using pure Tensor system only!
-
-# %% ../../modules/source/03_activations/activations_dev.ipynb 7
-class ReLU:
-    """
-    ReLU Activation Function: f(x) = max(0, x)
-    
-    The most popular activation function in deep learning.
-    Simple, fast, and effective for most applications.
-    """
-    
-    def forward(self, x):
-        """
-        Apply ReLU activation: f(x) = max(0, x)
-        
-        Now supports both Tensor and Tensor inputs with automatic differentiation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if input is Tensor (for autograd) or Tensor
-        2. For each element in the input tensor, apply max(0, element)
-        3. If input is Tensor: create Tensor output with proper gradient function
-        4. If input is Tensor: return Tensor as before
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x) = max(0, x)
-        - Backward: f'(x) = 1 if x > 0, else 0
-        
-        EXAMPLE USAGE:
-        ```python
-        relu = ReLU()
-        # With Tensor (no gradients)
-        tensor_input = Tensor([[-2, -1, 0, 1, 2]])
-        tensor_output = relu(tensor_input)
-        
-        # Using pure Tensor system only!
-        var_input = Tensor([[-2, -1, 0, 1, 2]], requires_grad=True)
-        var_output = relu(var_input)
-        var_output.backward()
-        print(var_input.grad)  # Gradients: [0, 0, 0, 1, 1]
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Check type with hasattr(x, 'requires_grad')
-        - For Tensors: implement gradient function for backward pass
-        - ReLU gradient: 1 where input > 0, 0 elsewhere
-        - Use np.maximum(0, x.data) for forward pass
-        
-        LEARNING CONNECTIONS:
-        - This is like torch.nn.ReLU() in PyTorch with autograd support
-        - Enables gradient-based training of neural networks
-        - ReLU's simple gradient (0 or 1) prevents vanishing gradients
-        - Creates sparse representations and efficient gradient flow
-        """
-        ### BEGIN SOLUTION
-        # Using pure Tensor system only!
-        from tinytorch.core.tensor import Tensor
-        import numpy as np
-
-        # Ensure input is a Tensor
-        if not isinstance(x, Tensor):
-            x = Tensor(x.data if hasattr(x, 'data') else x)
-
-        # Forward pass: ReLU activation
-        output_data = np.maximum(0, x.data)
-
-        # Return as Tensor (preserving gradient requirement if needed)
-        result = Tensor(output_data, requires_grad=x.requires_grad if hasattr(x, 'requires_grad') else False)
-
-        # TODO: Set up gradient function properly when Tensor autograd is complete
-        # For now, just return the result
-        return result
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make the class callable: relu(x) instead of relu.forward(x)"""
-        return self.forward(x)
-
-# %% ../../modules/source/03_activations/activations_dev.ipynb 11
+# %% ../../modules/source/02_activations/activations_dev.ipynb 7
 class Sigmoid:
     """
-    Sigmoid Activation Function: f(x) = 1 / (1 + e^(-x))
-    
-    Maps any real number to the range (0, 1).
-    Useful for binary classification and probability outputs.
+    Sigmoid activation: σ(x) = 1/(1 + e^(-x))
+
+    Maps any real number to (0, 1) range.
+    Perfect for probabilities and binary classification.
     """
-    
-    def forward(self, x):
+
+    def forward(self, x: Tensor) -> Tensor:
         """
-        Apply Sigmoid activation: f(x) = 1 / (1 + e^(-x))
-        
-        Now supports both Tensor and Tensor inputs with automatic differentiation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if input is Tensor (for autograd) or Tensor
-        2. Compute sigmoid: 1 / (1 + exp(-x))
-        3. If input is Tensor: create Tensor output with proper gradient function
-        4. If input is Tensor: return Tensor as before
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x) = 1 / (1 + e^(-x))
-        - Backward: f'(x) = f(x) * (1 - f(x)) = sigmoid(x) * (1 - sigmoid(x))
-        
-        EXAMPLE USAGE:
-        ```python
-        sigmoid = Sigmoid()
-        # Using pure Tensor system only!
-        var_input = Tensor([[0.0]], requires_grad=True)
-        var_output = sigmoid(var_input)  # 0.5
-        var_output.backward()
-        print(var_input.grad)  # 0.25 = 0.5 * (1 - 0.5)
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Check type with hasattr(x, 'requires_grad')
-        - For Tensors: implement gradient function for backward pass
-        - Sigmoid gradient: sigmoid(x) * (1 - sigmoid(x))
-        - Use numerical stability: clip inputs to prevent overflow
-        
-        LEARNING CONNECTIONS:
-        - This is like torch.nn.Sigmoid() in PyTorch with autograd support
-        - Used in binary classification and gating mechanisms
-        - Smooth gradients enable stable training
-        - Self-normalizing gradient (max at x=0, decreases at extremes)
+        Apply sigmoid activation element-wise.
+
+        TODO: Implement sigmoid function
+
+        APPROACH:
+        1. Apply sigmoid formula: 1 / (1 + exp(-x))
+        2. Use np.exp for exponential
+        3. Return result wrapped in new Tensor
+
+        EXAMPLE:
+        >>> sigmoid = Sigmoid()
+        >>> x = Tensor([-2, 0, 2])
+        >>> result = sigmoid.forward(x)
+        >>> print(result.data)
+        [0.119, 0.5, 0.881]  # All values between 0 and 1
+
+        HINT: Use np.exp(-x.data) for numerical stability
         """
         ### BEGIN SOLUTION
-        # Using pure Tensor system only!
-        from tinytorch.core.tensor import Tensor
-        import numpy as np
-
-        # Ensure input is a Tensor
-        if not isinstance(x, Tensor):
-            x = Tensor(x.data if hasattr(x, 'data') else x)
-
-        # Forward pass: Sigmoid activation with numerical stability
-        clipped_input = np.clip(-x.data, -500, 500)
-        output_data = 1 / (1 + np.exp(clipped_input))
-
-        # Return as Tensor (preserving gradient requirement if needed)
-        result = Tensor(output_data, requires_grad=x.requires_grad if hasattr(x, 'requires_grad') else False)
-
-        # TODO: Set up gradient function properly when Tensor autograd is complete
-        # For now, just return the result
-        return result
+        # Apply sigmoid: 1 / (1 + exp(-x))
+        result = 1.0 / (1.0 + np.exp(-x.data))
+        return Tensor(result)
         ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make the class callable: sigmoid(x) instead of sigmoid.forward(x)"""
-        return self.forward(x)
 
-# %% ../../modules/source/03_activations/activations_dev.ipynb 15
+    def backward(self, grad: Tensor) -> Tensor:
+        """Compute gradient (implemented in Module 05)."""
+        pass  # Will implement backward pass in Module 05
+
+# %% ../../modules/source/02_activations/activations_dev.ipynb 11
+class ReLU:
+    """
+    ReLU activation: f(x) = max(0, x)
+
+    Sets negative values to zero, keeps positive values unchanged.
+    Most popular activation for hidden layers.
+    """
+
+    def forward(self, x: Tensor) -> Tensor:
+        """
+        Apply ReLU activation element-wise.
+
+        TODO: Implement ReLU function
+
+        APPROACH:
+        1. Use np.maximum(0, x.data) for element-wise max with zero
+        2. Return result wrapped in new Tensor
+
+        EXAMPLE:
+        >>> relu = ReLU()
+        >>> x = Tensor([-2, -1, 0, 1, 2])
+        >>> result = relu.forward(x)
+        >>> print(result.data)
+        [0, 0, 0, 1, 2]  # Negative values become 0, positive unchanged
+
+        HINT: np.maximum handles element-wise maximum automatically
+        """
+        ### BEGIN SOLUTION
+        # Apply ReLU: max(0, x)
+        result = np.maximum(0, x.data)
+        return Tensor(result)
+        ### END SOLUTION
+
+    def backward(self, grad: Tensor) -> Tensor:
+        """Compute gradient (implemented in Module 05)."""
+        pass  # Will implement backward pass in Module 05
+
+# %% ../../modules/source/02_activations/activations_dev.ipynb 15
 class Tanh:
     """
-    Tanh Activation Function: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
-    
-    Zero-centered activation function with range (-1, 1).
-    Better gradient properties than sigmoid.
+    Tanh activation: f(x) = (e^x - e^(-x))/(e^x + e^(-x))
+
+    Maps any real number to (-1, 1) range.
+    Zero-centered alternative to sigmoid.
     """
-    
-    def forward(self, x):
+
+    def forward(self, x: Tensor) -> Tensor:
         """
-        Apply Tanh activation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
-        
-        Now supports both Tensor and Tensor inputs with automatic differentiation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if input is Tensor (for autograd) or Tensor
-        2. Compute tanh: (e^x - e^(-x)) / (e^x + e^(-x))
-        3. If input is Tensor: create Tensor output with proper gradient function
-        4. If input is Tensor: return Tensor as before
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x) = tanh(x)
-        - Backward: f'(x) = 1 - tanh²(x) = 1 - f(x)²
-        
-        EXAMPLE USAGE:
-        ```python
-        tanh = Tanh()
-        # Using pure Tensor system only!
-        var_input = Tensor([[0.0]], requires_grad=True)
-        var_output = tanh(var_input)  # 0.0
-        var_output.backward()
-        print(var_input.grad)  # 1.0 = 1 - 0²
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Check type with hasattr(x, 'requires_grad')
-        - For Tensors: implement gradient function for backward pass
-        - Tanh gradient: 1 - tanh²(x)
-        - Use np.tanh() for numerical stability
-        
-        LEARNING CONNECTIONS:
-        - This is like torch.nn.Tanh() in PyTorch with autograd support
-        - Used in RNN, LSTM, and GRU cells
-        - Zero-centered outputs improve gradient flow
-        - Strong gradients near zero, weaker at extremes
+        Apply tanh activation element-wise.
+
+        TODO: Implement tanh function
+
+        APPROACH:
+        1. Use np.tanh(x.data) for hyperbolic tangent
+        2. Return result wrapped in new Tensor
+
+        EXAMPLE:
+        >>> tanh = Tanh()
+        >>> x = Tensor([-2, 0, 2])
+        >>> result = tanh.forward(x)
+        >>> print(result.data)
+        [-0.964, 0.0, 0.964]  # Range (-1, 1), symmetric around 0
+
+        HINT: NumPy provides np.tanh function
         """
         ### BEGIN SOLUTION
-        # Using pure Tensor system only!
-        if hasattr(x, 'requires_grad') and hasattr(x, 'grad_fn'):
-            # Using pure Tensor system only!
-            
-            # Forward pass: Tanh activation
-            input_data = x.data.data if hasattr(x.data, 'data') else x.data
-            output_data = np.tanh(input_data)
-            
-            # Create gradient function for backward pass
-            def tanh_grad_fn(grad_output):
-                if x.requires_grad:
-                    # Tanh gradient: 1 - tanh²(x)
-                    tanh_grad = 1 - output_data ** 2
-                    # Using pure Tensor system only!
-                    grad_data = grad_output.data.data if hasattr(grad_output.data, 'data') else grad_output.data
-                    grad_input_data = grad_data * tanh_grad
-                    grad_input = Tensor(grad_input_data)
-                    x.backward(grad_input)
-            
-            # Using pure Tensor system only!
-            requires_grad = x.requires_grad
-            result = Tensor(output_data, requires_grad=requires_grad, grad_fn=tanh_grad_fn if requires_grad else None)
-            return result
-        else:
-            # Input is a Tensor - use original implementation
-            result = np.tanh(x.data)
-            return type(x)(result)
+        # Apply tanh using NumPy
+        result = np.tanh(x.data)
+        return Tensor(result)
         ### END SOLUTION
-    
-    def __call__(self, x: Tensor) -> Tensor:
-        """Make the class callable: tanh(x) instead of tanh.forward(x)"""
-        return self.forward(x)
 
-# %% ../../modules/source/03_activations/activations_dev.ipynb 19
+    def backward(self, grad: Tensor) -> Tensor:
+        """Compute gradient (implemented in Module 05)."""
+        pass  # Will implement backward pass in Module 05
+
+# %% ../../modules/source/02_activations/activations_dev.ipynb 19
+class GELU:
+    """
+    GELU activation: f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)
+
+    Smooth approximation to ReLU, used in modern transformers.
+    Where Φ(x) is the cumulative distribution function of standard normal.
+    """
+
+    def forward(self, x: Tensor) -> Tensor:
+        """
+        Apply GELU activation element-wise.
+
+        TODO: Implement GELU approximation
+
+        APPROACH:
+        1. Use approximation: x * sigmoid(1.702 * x)
+        2. Compute sigmoid part: 1 / (1 + exp(-1.702 * x))
+        3. Multiply by x element-wise
+        4. Return result wrapped in new Tensor
+
+        EXAMPLE:
+        >>> gelu = GELU()
+        >>> x = Tensor([-1, 0, 1])
+        >>> result = gelu.forward(x)
+        >>> print(result.data)
+        [-0.159, 0.0, 0.841]  # Smooth, like ReLU but differentiable everywhere
+
+        HINT: The 1.702 constant comes from √(2/π) approximation
+        """
+        ### BEGIN SOLUTION
+        # GELU approximation: x * sigmoid(1.702 * x)
+        # First compute sigmoid part
+        sigmoid_part = 1.0 / (1.0 + np.exp(-1.702 * x.data))
+        # Then multiply by x
+        result = x.data * sigmoid_part
+        return Tensor(result)
+        ### END SOLUTION
+
+    def backward(self, grad: Tensor) -> Tensor:
+        """Compute gradient (implemented in Module 05)."""
+        pass  # Will implement backward pass in Module 05
+
+# %% ../../modules/source/02_activations/activations_dev.ipynb 23
 class Softmax:
     """
-    Softmax Activation Function: f(x_i) = e^(x_i) / Σ(e^(x_j))
-    
-    Converts a vector of real numbers into a probability distribution.
-    Essential for multi-class classification.
-    """
-    
-    def forward(self, x):
-        """
-        Apply Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))
-        
-        Now supports both Tensor and Tensor inputs with automatic differentiation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if input is Tensor (for autograd) or Tensor
-        2. Compute softmax with numerical stability
-        3. If input is Tensor: create Tensor output with proper gradient function
-        4. If input is Tensor: return Tensor as before
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x_i) = e^(x_i) / Σ(e^(x_j))
-        - Backward: ∂f_i/∂x_j = f_i * (δ_ij - f_j) where δ_ij is Kronecker delta
-        - Simplified: ∂f_i/∂x_i = f_i * (1 - f_i), ∂f_i/∂x_j = -f_i * f_j (i ≠ j)
-        
-        EXAMPLE USAGE:
-        ```python
-        softmax = Softmax()
-        # Using pure Tensor system only!
-        var_input = Tensor([[1.0, 2.0]], requires_grad=True)
-        var_output = softmax(var_input)
-        var_output.backward(Tensor([[1.0, 0.0]]))
-        # Gradients computed automatically
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Check type with hasattr(x, 'requires_grad')
-        - For Tensors: implement gradient function for backward pass
-        - Softmax gradient: Jacobian matrix with f_i * (δ_ij - f_j)
-        - Use numerical stability: subtract max before exponential
-        
-        LEARNING CONNECTIONS:
-        - This is like torch.nn.Softmax() in PyTorch with autograd support
-        - Used in classification and attention mechanisms
-        - Converts logits to probability distributions
-        - Complex gradient structure due to normalization
-        """
-        ### BEGIN SOLUTION
-        # Using pure Tensor system only!
-        if hasattr(x, 'requires_grad') and hasattr(x, 'grad_fn'):
-            # Using pure Tensor system only!
-            
-            # Forward pass: Softmax activation with numerical stability
-            input_data = x.data.data if hasattr(x.data, 'data') else x.data
-            
-            # Handle empty input
-            if input_data.size == 0:
-                return Tensor(input_data.copy(), requires_grad=x.requires_grad)
-            
-            # Subtract max for numerical stability
-            x_shifted = input_data - np.max(input_data, axis=-1, keepdims=True)
-            
-            # Compute exponentials
-            exp_values = np.exp(x_shifted)
-            
-            # Sum along last axis
-            sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
-            
-            # Divide to get probabilities
-            output_data = exp_values / sum_exp
-            
-            # Create gradient function for backward pass
-            def softmax_grad_fn(grad_output):
-                if x.requires_grad:
-                    # Softmax gradient: for each element i,j: ∂f_i/∂x_j = f_i * (δ_ij - f_j)
-                    # For vector input, this becomes: grad_input = softmax * (grad_output - (softmax * grad_output).sum(keepdims=True))
-                    # Using pure Tensor system only!
-                    grad_out_data = grad_output.data.data if hasattr(grad_output.data, 'data') else grad_output.data
-                    softmax_grad_sum = np.sum(output_data * grad_out_data, axis=-1, keepdims=True)
-                    grad_input_data = output_data * (grad_out_data - softmax_grad_sum)
-                    grad_input = Tensor(grad_input_data)
-                    x.backward(grad_input)
-            
-            # Using pure Tensor system only!
-            requires_grad = x.requires_grad
-            result = Tensor(output_data, requires_grad=requires_grad, grad_fn=softmax_grad_fn if requires_grad else None)
-            return result
-        else:
-            # Input is a Tensor - use original implementation
-            # Handle empty input
-            if x.data.size == 0:
-                return type(x)(x.data.copy())
-            
-            # Subtract max for numerical stability
-            x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)
-            
-            # Compute exponentials
-            exp_values = np.exp(x_shifted)
-            
-            # Sum along last axis
-            sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
-            
-            # Divide to get probabilities
-            result = exp_values / sum_exp
-            
-            return type(x)(result)
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make the class callable: softmax(x) instead of softmax.forward(x)"""
-        return self.forward(x)
+    Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))
 
-# %% ../../modules/source/03_activations/activations_dev.ipynb 30
-import time
+    Converts any vector to a probability distribution.
+    Sum of all outputs equals 1.0.
+    """
 
-class ActivationProfiler:
-    """
-    Performance profiling toolkit for activation functions.
-    
-    Helps ML engineers understand computational costs and optimize
-    neural network performance for production deployment.
-    """
-    
-    def __init__(self):
-        self.results = {}
-        
-    def time_activation(self, activation_fn, tensor, activation_name, iterations=100):
+    def forward(self, x: Tensor, dim: int = -1) -> Tensor:
         """
-        Time how long an activation function takes to run.
-        
-        TODO: Implement activation timing.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Record start time using time.time()
-        2. Run the activation function for specified iterations
-        3. Record end time
-        4. Calculate average time per iteration
-        5. Return the average time in milliseconds
-        
+        Apply softmax activation along specified dimension.
+
+        TODO: Implement numerically stable softmax
+
+        APPROACH:
+        1. Subtract max for numerical stability: x - max(x)
+        2. Compute exponentials: exp(x - max(x))
+        3. Sum along dimension: sum(exp_values)
+        4. Divide: exp_values / sum
+        5. Return result wrapped in new Tensor
+
         EXAMPLE:
-        profiler = ActivationProfiler()
-        relu = ReLU()
-        test_tensor = Tensor(np.random.randn(1000, 1000))
-        avg_time = profiler.time_activation(relu, test_tensor, "ReLU")
-        print(f"ReLU took {avg_time:.3f} ms on average")
-        
+        >>> softmax = Softmax()
+        >>> x = Tensor([1, 2, 3])
+        >>> result = softmax.forward(x)
+        >>> print(result.data)
+        [0.090, 0.245, 0.665]  # Sums to 1.0, larger inputs get higher probability
+
         HINTS:
-        - Use time.time() for timing
-        - Run multiple iterations for better accuracy
-        - Calculate: (end_time - start_time) / iterations * 1000 for ms
-        - Return the average time per call in milliseconds
+        - Use np.max(x.data, axis=dim, keepdims=True) for max
+        - Use np.sum(exp_values, axis=dim, keepdims=True) for sum
+        - The max subtraction prevents overflow in exponentials
         """
         ### BEGIN SOLUTION
-        start_time = time.time()
-        
-        for _ in range(iterations):
-            result = activation_fn(tensor)
-        
-        end_time = time.time()
-        avg_time_ms = (end_time - start_time) / iterations * 1000
-        
-        return avg_time_ms
-        ### END SOLUTION
-    
-    def compare_activations(self, tensor_size=(1000, 1000), iterations=50):
-        """
-        Compare performance of all activation functions.
-        
-        This function is PROVIDED to show systems analysis.
-        Students run it to understand performance differences.
-        """
-        print(f"⚡ ACTIVATION PERFORMANCE COMPARISON")
-        print(f"=" * 50)
-        print(f"Tensor size: {tensor_size}, Iterations: {iterations}")
-        
-        # Create test tensor
-        test_tensor = Tensor(np.random.randn(*tensor_size))
-        tensor_mb = test_tensor.data.nbytes / (1024 * 1024)
-        print(f"Test tensor: {tensor_mb:.2f} MB")
-        
-        # Test all activation functions
-        activations = {
-            'ReLU': ReLU(),
-            'Sigmoid': Sigmoid(),
-            'Tanh': Tanh(),
-            'Softmax': Softmax()
-        }
-        
-        results = {}
-        for name, activation_fn in activations.items():
-            avg_time = self.time_activation(activation_fn, test_tensor, name, iterations)
-            results[name] = avg_time
-            print(f"   {name:8}: {avg_time:.3f} ms")
-        
-        # Calculate speed ratios relative to fastest
-        fastest_time = min(results.values())
-        fastest_name = min(results, key=results.get)
-        
-        print(f"\n📊 SPEED ANALYSIS:")
-        for name, time_ms in sorted(results.items(), key=lambda x: x[1]):
-            speed_ratio = time_ms / fastest_time
-            if name == fastest_name:
-                print(f"   {name:8}: {speed_ratio:.1f}x (fastest)")
-            else:
-                print(f"   {name:8}: {speed_ratio:.1f}x slower than {fastest_name}")
-        
-        return results
-    
-    def analyze_scaling(self, activation_fn, activation_name, sizes=[100, 500, 1000]):
-        """
-        Analyze how activation performance scales with tensor size.
-        
-        This function is PROVIDED to demonstrate scaling patterns.
-        Students use it to understand computational complexity.
-        """
-        print(f"\n🔍 SCALING ANALYSIS: {activation_name}")
-        print(f"=" * 40)
-        
-        scaling_results = []
-        
-        for size in sizes:
-            test_tensor = Tensor(np.random.randn(size, size))
-            avg_time = self.time_activation(activation_fn, test_tensor, activation_name, iterations=20)
-            
-            elements = size * size
-            time_per_element = avg_time / elements * 1e6  # microseconds per element
-            
-            result = {
-                'size': size,
-                'elements': elements,
-                'time_ms': avg_time,
-                'time_per_element_us': time_per_element
-            }
-            scaling_results.append(result)
-            
-            print(f"   {size}x{size}: {avg_time:.3f}ms ({time_per_element:.3f}μs/element)")
-        
-        # Analyze scaling pattern
-        if len(scaling_results) >= 2:
-            small = scaling_results[0]
-            large = scaling_results[-1]
-            
-            size_ratio = large['size'] / small['size']
-            time_ratio = large['time_ms'] / small['time_ms']
-            
-            print(f"\n📈 Scaling Pattern:")
-            print(f"   Size increased {size_ratio:.1f}x ({small['size']} → {large['size']})")
-            print(f"   Time increased {time_ratio:.1f}x")
-            
-            if abs(time_ratio - size_ratio**2) < abs(time_ratio - size_ratio):
-                print(f"   Pattern: O(n^2) - linear in tensor size")
-            else:
-                print(f"   Pattern: ~O(n) - very efficient scaling")
-        
-        return scaling_results
+        # Numerical stability: subtract max to prevent overflow
+        x_max = np.max(x.data, axis=dim, keepdims=True)
+        x_shifted = x.data - x_max
 
-def benchmark_activation_suite():
-    """
-    Comprehensive benchmark of all activation functions.
-    
-    This function is PROVIDED to show complete systems analysis.
-    Students run it to understand production performance implications.
-    """
-    profiler = ActivationProfiler()
-    
-    print("🏆 COMPREHENSIVE ACTIVATION BENCHMARK")
-    print("=" * 60)
-    
-    # Test 1: Performance comparison
-    comparison_results = profiler.compare_activations(tensor_size=(800, 800), iterations=30)
-    
-    # Test 2: Scaling analysis for each activation
-    activations_to_test = [
-        (ReLU(), "ReLU"),
-        (Sigmoid(), "Sigmoid"),
-        (Tanh(), "Tanh")
-    ]
-    
-    for activation_fn, name in activations_to_test:
-        profiler.analyze_scaling(activation_fn, name, sizes=[200, 400, 600])
-    
-    # Test 3: Memory vs Performance trade-offs
-    print(f"\n💾 MEMORY vs PERFORMANCE ANALYSIS:")
-    print(f"=" * 40)
-    
-    test_tensor = Tensor(np.random.randn(500, 500))
-    original_memory = test_tensor.data.nbytes / (1024 * 1024)
-    
-    for name, activation_fn in [("ReLU", ReLU()), ("Sigmoid", Sigmoid())]:
-        start_time = time.time()
-        result = activation_fn(test_tensor)
-        end_time = time.time()
-        
-        result_memory = result.data.nbytes / (1024 * 1024)
-        time_ms = (end_time - start_time) * 1000
-        
-        print(f"   {name}:")
-        print(f"     Input: {original_memory:.2f} MB")
-        print(f"     Output: {result_memory:.2f} MB")
-        print(f"     Memory overhead: {result_memory - original_memory:.2f} MB")
-        print(f"     Time: {time_ms:.3f} ms")
-    
-    print(f"\n🎯 PRODUCTION INSIGHTS:")
-    print(f"   - ReLU is typically fastest (simple max operation)")
-    print(f"   - Sigmoid/Tanh slower due to exponential calculations")
-    print(f"   - All operations scale linearly with tensor size")
-    print(f"   - Memory usage doubles (input + output tensors)")
-    print(f"   - Choose activation based on accuracy vs speed trade-offs")
-    
-    return comparison_results
+        # Compute exponentials
+        exp_values = np.exp(x_shifted)
+
+        # Sum along dimension
+        exp_sum = np.sum(exp_values, axis=dim, keepdims=True)
+
+        # Normalize to get probabilities
+        result = exp_values / exp_sum
+        return Tensor(result)
+        ### END SOLUTION
+
+    def backward(self, grad: Tensor) -> Tensor:
+        """Compute gradient (implemented in Module 05)."""
+        pass  # Will implement backward pass in Module 05
diff --git a/tinytorch/core/activations.py.backup b/tinytorch/core/activations.py.backup
deleted file mode 100644
index 0220e4e1..00000000
--- a/tinytorch/core/activations.py.backup
+++ /dev/null
@@ -1,561 +0,0 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/03_activations/activations_dev.ipynb.
-
-# %% auto 0
-__all__ = ['ReLU', 'Sigmoid', 'Tanh', 'Softmax', 'ActivationProfiler', 'benchmark_activation_suite']
-
-# %% ../../modules/source/03_activations/activations_dev.ipynb 1
-import math
-import numpy as np
-import os
-import sys
-from typing import Union, List
-
-# Import our Tensor class - try from package first, then from local module
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local tensor module
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# NO Variable imports - using pure Tensor system only!
-
-# %% ../../modules/source/03_activations/activations_dev.ipynb 7
-class ReLU:
-    """
-    ReLU Activation Function: f(x) = max(0, x)
-    
-    The most popular activation function in deep learning.
-    Simple, fast, and effective for most applications.
-    """
-    
-    def forward(self, x):
-        """
-        Apply ReLU activation: f(x) = max(0, x)
-        
-        Now supports both Tensor and Variable inputs with automatic differentiation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if input is Variable (for autograd) or Tensor
-        2. For each element in the input tensor, apply max(0, element)
-        3. If input is Variable: create Variable output with proper gradient function
-        4. If input is Tensor: return Tensor as before
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x) = max(0, x)
-        - Backward: f'(x) = 1 if x > 0, else 0
-        
-        EXAMPLE USAGE:
-        ```python
-        relu = ReLU()
-        # With Tensor (no gradients)
-        tensor_input = Tensor([[-2, -1, 0, 1, 2]])
-        tensor_output = relu(tensor_input)
-        
-        # With Variable (with gradients)
-        var_input = Variable([[-2, -1, 0, 1, 2]], requires_grad=True)
-        var_output = relu(var_input)
-        var_output.backward()
-        print(var_input.grad)  # Gradients: [0, 0, 0, 1, 1]
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Check type with hasattr(x, 'requires_grad')
-        - For Variables: implement gradient function for backward pass
-        - ReLU gradient: 1 where input > 0, 0 elsewhere
-        - Use np.maximum(0, x.data) for forward pass
-        
-        LEARNING CONNECTIONS:
-        - This is like torch.nn.ReLU() in PyTorch with autograd support
-        - Enables gradient-based training of neural networks
-        - ReLU's simple gradient (0 or 1) prevents vanishing gradients
-        - Creates sparse representations and efficient gradient flow
-        """
-        ### BEGIN SOLUTION
-        # Use pure Tensor operations - NO Variables!
-        from tinytorch.core.tensor import Tensor
-        import numpy as np
-
-        # Ensure input is a Tensor
-        if not isinstance(x, Tensor):
-            x = Tensor(x.data if hasattr(x, 'data') else x)
-
-        # Forward pass: ReLU activation
-        output_data = np.maximum(0, x.data)
-
-        # Return as Tensor (preserving gradient requirement if needed)
-        result = Tensor(output_data, requires_grad=x.requires_grad if hasattr(x, 'requires_grad') else False)
-
-        # TODO: Set up gradient function properly when Tensor autograd is complete
-        # For now, just return the result
-        return result
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make the class callable: relu(x) instead of relu.forward(x)"""
-        return self.forward(x)
-
-# %% ../../modules/source/03_activations/activations_dev.ipynb 11
-class Sigmoid:
-    """
-    Sigmoid Activation Function: f(x) = 1 / (1 + e^(-x))
-    
-    Maps any real number to the range (0, 1).
-    Useful for binary classification and probability outputs.
-    """
-    
-    def forward(self, x):
-        """
-        Apply Sigmoid activation: f(x) = 1 / (1 + e^(-x))
-        
-        Now supports both Tensor and Variable inputs with automatic differentiation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if input is Variable (for autograd) or Tensor
-        2. Compute sigmoid: 1 / (1 + exp(-x))
-        3. If input is Variable: create Variable output with proper gradient function
-        4. If input is Tensor: return Tensor as before
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x) = 1 / (1 + e^(-x))
-        - Backward: f'(x) = f(x) * (1 - f(x)) = sigmoid(x) * (1 - sigmoid(x))
-        
-        EXAMPLE USAGE:
-        ```python
-        sigmoid = Sigmoid()
-        # With Variable (with gradients)
-        var_input = Variable([[0.0]], requires_grad=True)
-        var_output = sigmoid(var_input)  # 0.5
-        var_output.backward()
-        print(var_input.grad)  # 0.25 = 0.5 * (1 - 0.5)
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Check type with hasattr(x, 'requires_grad')
-        - For Variables: implement gradient function for backward pass
-        - Sigmoid gradient: sigmoid(x) * (1 - sigmoid(x))
-        - Use numerical stability: clip inputs to prevent overflow
-        
-        LEARNING CONNECTIONS:
-        - This is like torch.nn.Sigmoid() in PyTorch with autograd support
-        - Used in binary classification and gating mechanisms
-        - Smooth gradients enable stable training
-        - Self-normalizing gradient (max at x=0, decreases at extremes)
-        """
-        ### BEGIN SOLUTION
-        # Use pure Tensor operations - NO Variables!
-        from tinytorch.core.tensor import Tensor
-        import numpy as np
-
-        # Ensure input is a Tensor
-        if not isinstance(x, Tensor):
-            x = Tensor(x.data if hasattr(x, 'data') else x)
-
-        # Forward pass: Sigmoid activation with numerical stability
-        clipped_input = np.clip(-x.data, -500, 500)
-        output_data = 1 / (1 + np.exp(clipped_input))
-
-        # Return as Tensor (preserving gradient requirement if needed)
-        result = Tensor(output_data, requires_grad=x.requires_grad if hasattr(x, 'requires_grad') else False)
-
-        # TODO: Set up gradient function properly when Tensor autograd is complete
-        # For now, just return the result
-        return result
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make the class callable: sigmoid(x) instead of sigmoid.forward(x)"""
-        return self.forward(x)
-
-# %% ../../modules/source/03_activations/activations_dev.ipynb 15
-class Tanh:
-    """
-    Tanh Activation Function: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
-    
-    Zero-centered activation function with range (-1, 1).
-    Better gradient properties than sigmoid.
-    """
-    
-    def forward(self, x):
-        """
-        Apply Tanh activation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
-        
-        Now supports both Tensor and Variable inputs with automatic differentiation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if input is Variable (for autograd) or Tensor
-        2. Compute tanh: (e^x - e^(-x)) / (e^x + e^(-x))
-        3. If input is Variable: create Variable output with proper gradient function
-        4. If input is Tensor: return Tensor as before
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x) = tanh(x)
-        - Backward: f'(x) = 1 - tanh²(x) = 1 - f(x)²
-        
-        EXAMPLE USAGE:
-        ```python
-        tanh = Tanh()
-        # With Variable (with gradients)
-        var_input = Variable([[0.0]], requires_grad=True)
-        var_output = tanh(var_input)  # 0.0
-        var_output.backward()
-        print(var_input.grad)  # 1.0 = 1 - 0²
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Check type with hasattr(x, 'requires_grad')
-        - For Variables: implement gradient function for backward pass
-        - Tanh gradient: 1 - tanh²(x)
-        - Use np.tanh() for numerical stability
-        
-        LEARNING CONNECTIONS:
-        - This is like torch.nn.Tanh() in PyTorch with autograd support
-        - Used in RNN, LSTM, and GRU cells
-        - Zero-centered outputs improve gradient flow
-        - Strong gradients near zero, weaker at extremes
-        """
-        ### BEGIN SOLUTION
-        # Check if input is a Variable (autograd-enabled)
-        if hasattr(x, 'requires_grad') and hasattr(x, 'grad_fn'):
-            # Input is a Variable - preserve autograd capabilities
-            
-            # Forward pass: Tanh activation
-            input_data = x.data.data if hasattr(x.data, 'data') else x.data
-            output_data = np.tanh(input_data)
-            
-            # Create gradient function for backward pass
-            def tanh_grad_fn(grad_output):
-                if x.requires_grad:
-                    # Tanh gradient: 1 - tanh²(x)
-                    tanh_grad = 1 - output_data ** 2
-                    # Safely extract gradient data - handle both Variable and memoryview
-                    grad_data = grad_output.data.data if hasattr(grad_output.data, 'data') else grad_output.data
-                    grad_input_data = grad_data * tanh_grad
-                    grad_input = Variable(grad_input_data)
-                    x.backward(grad_input)
-            
-            # Return Variable with gradient function
-            requires_grad = x.requires_grad
-            result = Variable(output_data, requires_grad=requires_grad, grad_fn=tanh_grad_fn if requires_grad else None)
-            return result
-        else:
-            # Input is a Tensor - use original implementation
-            result = np.tanh(x.data)
-            return type(x)(result)
-        ### END SOLUTION
-    
-    def __call__(self, x: Tensor) -> Tensor:
-        """Make the class callable: tanh(x) instead of tanh.forward(x)"""
-        return self.forward(x)
-
-# %% ../../modules/source/03_activations/activations_dev.ipynb 19
-class Softmax:
-    """
-    Softmax Activation Function: f(x_i) = e^(x_i) / Σ(e^(x_j))
-    
-    Converts a vector of real numbers into a probability distribution.
-    Essential for multi-class classification.
-    """
-    
-    def forward(self, x):
-        """
-        Apply Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))
-        
-        Now supports both Tensor and Variable inputs with automatic differentiation.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Check if input is Variable (for autograd) or Tensor
-        2. Compute softmax with numerical stability
-        3. If input is Variable: create Variable output with proper gradient function
-        4. If input is Tensor: return Tensor as before
-        
-        MATHEMATICAL FOUNDATION:
-        - Forward: f(x_i) = e^(x_i) / Σ(e^(x_j))
-        - Backward: ∂f_i/∂x_j = f_i * (δ_ij - f_j) where δ_ij is Kronecker delta
-        - Simplified: ∂f_i/∂x_i = f_i * (1 - f_i), ∂f_i/∂x_j = -f_i * f_j (i ≠ j)
-        
-        EXAMPLE USAGE:
-        ```python
-        softmax = Softmax()
-        # With Variable (with gradients)
-        var_input = Variable([[1.0, 2.0]], requires_grad=True)
-        var_output = softmax(var_input)
-        var_output.backward(Variable([[1.0, 0.0]]))
-        # Gradients computed automatically
-        ```
-        
-        IMPLEMENTATION HINTS:
-        - Check type with hasattr(x, 'requires_grad')
-        - For Variables: implement gradient function for backward pass
-        - Softmax gradient: Jacobian matrix with f_i * (δ_ij - f_j)
-        - Use numerical stability: subtract max before exponential
-        
-        LEARNING CONNECTIONS:
-        - This is like torch.nn.Softmax() in PyTorch with autograd support
-        - Used in classification and attention mechanisms
-        - Converts logits to probability distributions
-        - Complex gradient structure due to normalization
-        """
-        ### BEGIN SOLUTION
-        # Check if input is a Variable (autograd-enabled)
-        if hasattr(x, 'requires_grad') and hasattr(x, 'grad_fn'):
-            # Input is a Variable - preserve autograd capabilities
-            
-            # Forward pass: Softmax activation with numerical stability
-            input_data = x.data.data if hasattr(x.data, 'data') else x.data
-            
-            # Handle empty input
-            if input_data.size == 0:
-                return Variable(input_data.copy(), requires_grad=x.requires_grad)
-            
-            # Subtract max for numerical stability
-            x_shifted = input_data - np.max(input_data, axis=-1, keepdims=True)
-            
-            # Compute exponentials
-            exp_values = np.exp(x_shifted)
-            
-            # Sum along last axis
-            sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
-            
-            # Divide to get probabilities
-            output_data = exp_values / sum_exp
-            
-            # Create gradient function for backward pass
-            def softmax_grad_fn(grad_output):
-                if x.requires_grad:
-                    # Softmax gradient: for each element i,j: ∂f_i/∂x_j = f_i * (δ_ij - f_j)
-                    # For vector input, this becomes: grad_input = softmax * (grad_output - (softmax * grad_output).sum(keepdims=True))
-                    # Safely extract gradient data - handle both Variable and memoryview
-                    grad_out_data = grad_output.data.data if hasattr(grad_output.data, 'data') else grad_output.data
-                    softmax_grad_sum = np.sum(output_data * grad_out_data, axis=-1, keepdims=True)
-                    grad_input_data = output_data * (grad_out_data - softmax_grad_sum)
-                    grad_input = Variable(grad_input_data)
-                    x.backward(grad_input)
-            
-            # Return Variable with gradient function
-            requires_grad = x.requires_grad
-            result = Variable(output_data, requires_grad=requires_grad, grad_fn=softmax_grad_fn if requires_grad else None)
-            return result
-        else:
-            # Input is a Tensor - use original implementation
-            # Handle empty input
-            if x.data.size == 0:
-                return type(x)(x.data.copy())
-            
-            # Subtract max for numerical stability
-            x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)
-            
-            # Compute exponentials
-            exp_values = np.exp(x_shifted)
-            
-            # Sum along last axis
-            sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
-            
-            # Divide to get probabilities
-            result = exp_values / sum_exp
-            
-            return type(x)(result)
-        ### END SOLUTION
-    
-    def __call__(self, x):
-        """Make the class callable: softmax(x) instead of softmax.forward(x)"""
-        return self.forward(x)
-
-# %% ../../modules/source/03_activations/activations_dev.ipynb 30
-import time
-
-class ActivationProfiler:
-    """
-    Performance profiling toolkit for activation functions.
-    
-    Helps ML engineers understand computational costs and optimize
-    neural network performance for production deployment.
-    """
-    
-    def __init__(self):
-        self.results = {}
-        
-    def time_activation(self, activation_fn, tensor, activation_name, iterations=100):
-        """
-        Time how long an activation function takes to run.
-        
-        TODO: Implement activation timing.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Record start time using time.time()
-        2. Run the activation function for specified iterations
-        3. Record end time
-        4. Calculate average time per iteration
-        5. Return the average time in milliseconds
-        
-        EXAMPLE:
-        profiler = ActivationProfiler()
-        relu = ReLU()
-        test_tensor = Tensor(np.random.randn(1000, 1000))
-        avg_time = profiler.time_activation(relu, test_tensor, "ReLU")
-        print(f"ReLU took {avg_time:.3f} ms on average")
-        
-        HINTS:
-        - Use time.time() for timing
-        - Run multiple iterations for better accuracy
-        - Calculate: (end_time - start_time) / iterations * 1000 for ms
-        - Return the average time per call in milliseconds
-        """
-        ### BEGIN SOLUTION
-        start_time = time.time()
-        
-        for _ in range(iterations):
-            result = activation_fn(tensor)
-        
-        end_time = time.time()
-        avg_time_ms = (end_time - start_time) / iterations * 1000
-        
-        return avg_time_ms
-        ### END SOLUTION
-    
-    def compare_activations(self, tensor_size=(1000, 1000), iterations=50):
-        """
-        Compare performance of all activation functions.
-        
-        This function is PROVIDED to show systems analysis.
-        Students run it to understand performance differences.
-        """
-        print(f"⚡ ACTIVATION PERFORMANCE COMPARISON")
-        print(f"=" * 50)
-        print(f"Tensor size: {tensor_size}, Iterations: {iterations}")
-        
-        # Create test tensor
-        test_tensor = Tensor(np.random.randn(*tensor_size))
-        tensor_mb = test_tensor.data.nbytes / (1024 * 1024)
-        print(f"Test tensor: {tensor_mb:.2f} MB")
-        
-        # Test all activation functions
-        activations = {
-            'ReLU': ReLU(),
-            'Sigmoid': Sigmoid(),
-            'Tanh': Tanh(),
-            'Softmax': Softmax()
-        }
-        
-        results = {}
-        for name, activation_fn in activations.items():
-            avg_time = self.time_activation(activation_fn, test_tensor, name, iterations)
-            results[name] = avg_time
-            print(f"   {name:8}: {avg_time:.3f} ms")
-        
-        # Calculate speed ratios relative to fastest
-        fastest_time = min(results.values())
-        fastest_name = min(results, key=results.get)
-        
-        print(f"\n📊 SPEED ANALYSIS:")
-        for name, time_ms in sorted(results.items(), key=lambda x: x[1]):
-            speed_ratio = time_ms / fastest_time
-            if name == fastest_name:
-                print(f"   {name:8}: {speed_ratio:.1f}x (fastest)")
-            else:
-                print(f"   {name:8}: {speed_ratio:.1f}x slower than {fastest_name}")
-        
-        return results
-    
-    def analyze_scaling(self, activation_fn, activation_name, sizes=[100, 500, 1000]):
-        """
-        Analyze how activation performance scales with tensor size.
-        
-        This function is PROVIDED to demonstrate scaling patterns.
-        Students use it to understand computational complexity.
-        """
-        print(f"\n🔍 SCALING ANALYSIS: {activation_name}")
-        print(f"=" * 40)
-        
-        scaling_results = []
-        
-        for size in sizes:
-            test_tensor = Tensor(np.random.randn(size, size))
-            avg_time = self.time_activation(activation_fn, test_tensor, activation_name, iterations=20)
-            
-            elements = size * size
-            time_per_element = avg_time / elements * 1e6  # microseconds per element
-            
-            result = {
-                'size': size,
-                'elements': elements,
-                'time_ms': avg_time,
-                'time_per_element_us': time_per_element
-            }
-            scaling_results.append(result)
-            
-            print(f"   {size}x{size}: {avg_time:.3f}ms ({time_per_element:.3f}μs/element)")
-        
-        # Analyze scaling pattern
-        if len(scaling_results) >= 2:
-            small = scaling_results[0]
-            large = scaling_results[-1]
-            
-            size_ratio = large['size'] / small['size']
-            time_ratio = large['time_ms'] / small['time_ms']
-            
-            print(f"\n📈 Scaling Pattern:")
-            print(f"   Size increased {size_ratio:.1f}x ({small['size']} → {large['size']})")
-            print(f"   Time increased {time_ratio:.1f}x")
-            
-            if abs(time_ratio - size_ratio**2) < abs(time_ratio - size_ratio):
-                print(f"   Pattern: O(n^2) - linear in tensor size")
-            else:
-                print(f"   Pattern: ~O(n) - very efficient scaling")
-        
-        return scaling_results
-
-def benchmark_activation_suite():
-    """
-    Comprehensive benchmark of all activation functions.
-    
-    This function is PROVIDED to show complete systems analysis.
-    Students run it to understand production performance implications.
-    """
-    profiler = ActivationProfiler()
-    
-    print("🏆 COMPREHENSIVE ACTIVATION BENCHMARK")
-    print("=" * 60)
-    
-    # Test 1: Performance comparison
-    comparison_results = profiler.compare_activations(tensor_size=(800, 800), iterations=30)
-    
-    # Test 2: Scaling analysis for each activation
-    activations_to_test = [
-        (ReLU(), "ReLU"),
-        (Sigmoid(), "Sigmoid"),
-        (Tanh(), "Tanh")
-    ]
-    
-    for activation_fn, name in activations_to_test:
-        profiler.analyze_scaling(activation_fn, name, sizes=[200, 400, 600])
-    
-    # Test 3: Memory vs Performance trade-offs
-    print(f"\n💾 MEMORY vs PERFORMANCE ANALYSIS:")
-    print(f"=" * 40)
-    
-    test_tensor = Tensor(np.random.randn(500, 500))
-    original_memory = test_tensor.data.nbytes / (1024 * 1024)
-    
-    for name, activation_fn in [("ReLU", ReLU()), ("Sigmoid", Sigmoid())]:
-        start_time = time.time()
-        result = activation_fn(test_tensor)
-        end_time = time.time()
-        
-        result_memory = result.data.nbytes / (1024 * 1024)
-        time_ms = (end_time - start_time) * 1000
-        
-        print(f"   {name}:")
-        print(f"     Input: {original_memory:.2f} MB")
-        print(f"     Output: {result_memory:.2f} MB")
-        print(f"     Memory overhead: {result_memory - original_memory:.2f} MB")
-        print(f"     Time: {time_ms:.3f} ms")
-    
-    print(f"\n🎯 PRODUCTION INSIGHTS:")
-    print(f"   - ReLU is typically fastest (simple max operation)")
-    print(f"   - Sigmoid/Tanh slower due to exponential calculations")
-    print(f"   - All operations scale linearly with tensor size")
-    print(f"   - Memory usage doubles (input + output tensors)")
-    print(f"   - Choose activation based on accuracy vs speed trade-offs")
-    
-    return comparison_results
diff --git a/tinytorch/core/attention.py b/tinytorch/core/attention.py
index 2da12e6e..4d8d6547 100644
--- a/tinytorch/core/attention.py
+++ b/tinytorch/core/attention.py
@@ -1,10 +1,23 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/12_attention/attention_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/07_attention/attention_dev.py           ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['scaled_dot_product_attention', 'SelfAttention', 'create_causal_mask', 'create_padding_mask',
-           'create_bidirectional_mask', 'AttentionEfficiencyProfiler',
-           # Compatibility aliases for optimization modules
-           'MultiHeadAttention', 'ScaledDotProductAttention']
+           'create_bidirectional_mask', 'AttentionEfficiencyProfiler']
 
 # %% ../../modules/source/12_attention/attention_dev.ipynb 1
 import numpy as np
@@ -603,8 +616,3 @@ class AttentionEfficiencyProfiler:
         print(f"  Trade-off: More heads = better parallelism but higher memory")
         
         return multi_head_results
-
-# Compatibility aliases for optimization modules (15-20)
-# These provide backward compatibility with modules that expect different naming
-MultiHeadAttention = SelfAttention  # SelfAttention can be used as MultiHeadAttention
-ScaledDotProductAttention = scaled_dot_product_attention  # Function alias
diff --git a/tinytorch/core/autograd.py b/tinytorch/core/autograd.py
index 24f47a3f..60dbd5e3 100644
--- a/tinytorch/core/autograd.py
+++ b/tinytorch/core/autograd.py
@@ -1,36 +1,756 @@
-"""
-TinyTorch Autograd Module - Clean Implementation
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/09_autograd/autograd_dev.py             ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
-This module provides automatic differentiation for Tensors.
-No Variable class - just pure Tensor with gradient tracking!
-"""
+# %% auto 0
+__all__ = ['Variable', 'add', 'multiply', 'subtract', 'AutogradSystemsProfiler']
 
+# %% ../../modules/source/08_autograd/autograd_dev.ipynb 1
 import numpy as np
-from typing import Optional, List, Tuple
-from tinytorch.core.tensor import Tensor
+import sys
+from typing import Union, List, Tuple, Optional, Any, Callable
+from collections import defaultdict
 
-# Enable autograd function from the clean implementation
-def enable_autograd():
-    """Enable gradient tracking for all Tensor operations.
+# Import our existing components
+try:
+    from tinytorch.core.tensor import Tensor
+except ImportError:
+    # For development, import from local modules
+    import os
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
+    from tensor_dev import Tensor
 
-    This function enhances the existing Tensor class with autograd capabilities.
-    Call this once to activate gradients globally.
+# %% ../../modules/source/08_autograd/autograd_dev.ipynb 7
+class Variable:
     """
-    # Check if already enabled
-    if hasattr(Tensor, '_autograd_enabled'):
-        return
+    Variable: Tensor wrapper with automatic differentiation capabilities.
+    
+    The fundamental class for gradient computation in TinyTorch.
+    Wraps Tensor objects and tracks computational history for backpropagation.
+    """
+    
+    def __init__(self, data: Union[Tensor, np.ndarray, list, float, int], 
+                 requires_grad: bool = True, grad_fn: Optional[Callable] = None):
+        """
+        Create a Variable with gradient tracking.
+            
+        TODO: Implement Variable initialization with gradient tracking.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Convert data to Tensor if it is not already a Tensor
+        2. Store the tensor data in self.data
+        3. Set gradient tracking flag (requires_grad)
+        4. Initialize gradient to None (will be computed during backward pass)
+        5. Store the gradient function for backward pass
+        6. Track if this is a leaf node (no grad_fn means it is a leaf)
+        
+        EXAMPLE USAGE:
+        ```python
+        # Create leaf variables (input data)
+        x = Variable(5.0, requires_grad=True)
+        y = Variable([1, 2, 3], requires_grad=True)
+        
+        # Create intermediate variables (results of operations)
+        z = x + y  # Has grad_fn for addition
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - Use isinstance(data, Tensor) to check type
+        - Convert with Tensor(data) if needed
+        - Store requires_grad, grad_fn flags
+        - Initialize self.grad = None
+        - Leaf nodes have grad_fn = None
+        - Set self.is_leaf = (grad_fn is None)
+        
+        LEARNING CONNECTIONS:
+        - This is like torch.Tensor with requires_grad=True
+        - Forms the basis for all neural network training
+        - Each Variable is a node in the computational graph
+        - Enables automatic gradient computation
+        """
+        ### BEGIN SOLUTION
+        # Convert data to Tensor if needed
+        if isinstance(data, Tensor):
+            self.data = data
+        else:
+            self.data = Tensor(data)
+        
+        # Set gradient tracking
+        self.requires_grad = requires_grad
+        self.grad = None  # Will be initialized when needed
+        self.grad_fn = grad_fn
+        self.is_leaf = grad_fn is None
+        
+        # For computational graph
+        self._backward_hooks = []
+        ### END SOLUTION
+    
+    @property
+    def shape(self) -> Tuple[int, ...]:
+        """Get the shape of the underlying tensor."""
+        return self.data.shape
+    
+    @property
+    def size(self) -> int:
+        """Get the total number of elements."""
+        return self.data.size
+    
+    def __repr__(self) -> str:
+        """String representation of the Variable."""
+        grad_str = f", grad_fn={self.grad_fn.__name__}" if self.grad_fn else ""
+        return f"Variable({self.data.data.tolist()}, requires_grad={self.requires_grad}{grad_str})"
+    
+    def backward(self, gradient: Optional['Variable'] = None) -> None:
+        """
+        Compute gradients using backpropagation.
+        
+        TODO: Implement backward pass for gradient computation.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. If gradient is None, create gradient of ones (for scalar outputs)
+        2. If this Variable requires gradients, accumulate the gradient
+        3. If this Variable has a grad_fn, call it to propagate gradients
+        4. The grad_fn will recursively call backward on input Variables
+        
+        EXAMPLE USAGE:
+        ```python
+        x = Variable(2.0, requires_grad=True)
+        y = Variable(3.0, requires_grad=True)
+        z = add(x, y)  # z = 5.0
+        z.backward()
+        print(x.grad)  # 1.0 (∂z/∂x = 1)
+        print(y.grad)  # 1.0 (∂z/∂y = 1)
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - If gradient is None: gradient = Variable(np.ones_like(self.data.data))
+        - If self.requires_grad: accumulate gradient into self.grad
+        - If self.grad_fn: call self.grad_fn(gradient)
+        - Handle gradient accumulation (add to existing gradient)
+        
+        LEARNING CONNECTIONS:
+        - This implements the chain rule of calculus
+        - Gradients flow backward through the computational graph
+        - Each operation contributes its local gradient
+        - Enables training of any differentiable function
+        """
+        ### BEGIN SOLUTION
+        if gradient is None:
+            gradient = Variable(np.ones_like(self.data.data))
+        
+        if self.requires_grad:
+            if self.grad is None:
+                self.grad = gradient
+            else:
+                # Accumulate gradients
+                self.grad = Variable(self.grad.data.data + gradient.data.data)
+        
+            if self.grad_fn is not None:
+                self.grad_fn(gradient)
+        ### END SOLUTION
+    
+    def zero_grad(self) -> None:
+        """Reset gradients to zero."""
+        self.grad = None
+    
+    def __add__(self, other: Union['Variable', float, int]) -> 'Variable':
+        """Addition operator: self + other"""
+        return add(self, other)
+    
+    def __mul__(self, other: Union['Variable', float, int]) -> 'Variable':
+        """Multiplication operator: self * other"""
+        return multiply(self, other)
+    
+    def __sub__(self, other: Union['Variable', float, int]) -> 'Variable':
+        """Subtraction operator: self - other"""
+        return subtract(self, other)
+    
+    def __truediv__(self, other: Union['Variable', float, int]) -> 'Variable':
+        """Division operator: self / other"""
+        return divide(self, other) 
 
-    print("✅ Autograd enabled for TinyTorch!")
-    print("   - Use Tensor with requires_grad=True")
-    print("   - Call backward() to compute gradients")
-    print("   - NO Variable class needed!")
+# %% ../../modules/source/08_autograd/autograd_dev.ipynb 11
+def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
+    """
+    Addition operation with gradient tracking: a + b
+    
+    TODO: Implement addition with automatic differentiation.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Convert inputs to Variables if they are scalars
+    2. Compute forward pass: result = a.data + b.data
+    3. Create gradient function that implements: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
+    4. Return new Variable with result and gradient function
+    
+    MATHEMATICAL FOUNDATION:
+    - Forward: z = x + y
+    - Backward: ∂z/∂x = 1, ∂z/∂y = 1
+    - Chain rule: ∂L/∂x = ∂L/∂z · ∂z/∂x = ∂L/∂z · 1 = ∂L/∂z
+    
+    EXAMPLE USAGE:
+    ```python
+    x = Variable(2.0, requires_grad=True)
+    y = Variable(3.0, requires_grad=True)
+    z = add(x, y)  # z = 5.0
+    z.backward()
+    print(x.grad)  # 1.0 (∂z/∂x = 1)
+    print(y.grad)  # 1.0 (∂z/∂y = 1)
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Convert scalars: if isinstance(a, (int, float)): a = Variable(a, requires_grad=False)
+    - Forward pass: result_data = a.data + b.data
+    - Backward function: def grad_fn(grad_output): if a.requires_grad: a.backward(grad_output)
+    - Return: Variable(result_data, grad_fn=grad_fn)
+    - Only propagate gradients to Variables that require them
+    
+    LEARNING CONNECTIONS:
+    - This is like torch.add() with autograd
+    - Addition distributes gradients equally to both inputs
+    - Forms the basis for bias addition in neural networks
+    - Chain rule propagates gradients through the graph
+    """
+    ### BEGIN SOLUTION
+    # Convert scalars to Variables
+    if isinstance(a, (int, float)):
+        a = Variable(a, requires_grad=False)
+    if isinstance(b, (int, float)):
+        b = Variable(b, requires_grad=False)
+    
+    # Forward pass
+    result_data = a.data + b.data
+    
+    # Backward function
+    def grad_fn(grad_output):
+        # Addition distributes gradients equally, but must handle broadcasting
+        if a.requires_grad:
+            # Get gradient data
+            if hasattr(grad_output.data, 'data'):
+                grad_data = grad_output.data.data
+            else:
+                grad_data = grad_output.data
+            
+            # Check if we need to sum over broadcasted dimensions
+            a_shape = a.data.shape if hasattr(a.data, 'shape') else ()
+            if grad_data.shape != a_shape:
+                # Sum over the broadcasted dimensions
+                # For bias: (batch_size, features) -> (features,)
+                if len(grad_data.shape) == 2 and len(a_shape) == 1:
+                    grad_for_a = Variable(Tensor(np.sum(grad_data, axis=0)))
+                else:
+                    # Handle other broadcasting cases
+                    grad_for_a = grad_output
+            else:
+                grad_for_a = grad_output
+            
+            a.backward(grad_for_a)
+            
+        if b.requires_grad:
+            # Get gradient data
+            if hasattr(grad_output.data, 'data'):
+                grad_data = grad_output.data.data
+            else:
+                grad_data = grad_output.data
+            
+            # Check if we need to sum over broadcasted dimensions
+            b_shape = b.data.shape if hasattr(b.data, 'shape') else ()
+            if grad_data.shape != b_shape:
+                # Sum over the broadcasted dimensions
+                # For bias: (batch_size, features) -> (features,)
+                if len(grad_data.shape) == 2 and len(b_shape) == 1:
+                    grad_for_b = Variable(Tensor(np.sum(grad_data, axis=0)))
+                else:
+                    # Handle other broadcasting cases
+                    grad_for_b = grad_output
+            else:
+                grad_for_b = grad_output
+            
+            b.backward(grad_for_b)
+    
+    # Return new Variable with gradient function
+    requires_grad = a.requires_grad or b.requires_grad
+    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
 
-    # The actual enhancement would be done here
-    # For now, we rely on the tensor having dormant features
-    Tensor._autograd_enabled = True
+# %% ../../modules/source/08_autograd/autograd_dev.ipynb 15
+def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
+    """
+    Multiplication operation with gradient tracking: a * b
+    
+    TODO: Implement multiplication with automatic differentiation.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Convert inputs to Variables if they are scalars
+    2. Compute forward pass: result = a.data * b.data
+    3. Create gradient function implementing product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
+    4. Return new Variable with result and gradient function
+    
+    MATHEMATICAL FOUNDATION:
+    - Forward: z = x * y
+    - Backward: ∂z/∂x = y, ∂z/∂y = x
+    - Chain rule: ∂L/∂x = ∂L/∂z · y, ∂L/∂y = ∂L/∂z · x
+    
+    EXAMPLE USAGE:
+    ```python
+    x = Variable(2.0, requires_grad=True)
+    y = Variable(3.0, requires_grad=True)
+    z = multiply(x, y)  # z = 6.0
+    z.backward()
+    print(x.grad)  # 3.0 (∂z/∂x = y)
+    print(y.grad)  # 2.0 (∂z/∂y = x)
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Convert scalars to Variables (same as addition)
+    - Forward pass: result_data = a.data * b.data
+    - Backward function: multiply incoming gradient by the other variable
+    - For a: a.backward(grad_output * b.data)
+    - For b: b.backward(grad_output * a.data)
+    
+    LEARNING CONNECTIONS:
+    - This is like torch.mul() with autograd
+    - Product rule is fundamental to backpropagation
+    - Used in weight updates and attention mechanisms
+    - Each input's gradient depends on the other input's value
+    """
+    ### BEGIN SOLUTION
+    # Convert scalars to Variables
+    if isinstance(a, (int, float)):
+        a = Variable(a, requires_grad=False)
+    if isinstance(b, (int, float)):
+        b = Variable(b, requires_grad=False)
+    
+    # Forward pass
+    result_data = a.data * b.data
+    
+    # Backward function
+    def grad_fn(grad_output):
+        # Product rule: d(xy)/dx = y, d(xy)/dy = x
+        if a.requires_grad:
+            a.backward(Variable(grad_output.data.data * b.data.data))
+        if b.requires_grad:
+            b.backward(Variable(grad_output.data.data * a.data.data))
+    
+    # Return new Variable with gradient function
+    requires_grad = a.requires_grad or b.requires_grad
+    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
 
-# Auto-enable when module is imported
-enable_autograd()
+# %% ../../modules/source/08_autograd/autograd_dev.ipynb 18
+def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
+    """
+    Subtraction operation with gradient tracking.
+    
+    Args:
+        a: First operand (minuend)
+        b: Second operand (subtrahend)
+        
+    Returns:
+        Variable with difference and gradient function
+        
+    TODO: Implement subtraction with gradient computation.
+    
+    APPROACH:
+    1. Convert inputs to Variables if needed
+    2. Compute forward pass: result = a - b
+    3. Create gradient function with correct signs
+    4. Return Variable with result and grad_fn
+    
+    MATHEMATICAL RULE:
+    If z = x - y, then dz/dx = 1, dz/dy = -1
+    
+    EXAMPLE:
+    x = Variable(5.0), y = Variable(3.0)
+    z = subtract(x, y)  # z.data = 2.0
+    z.backward()        # x.grad = 1.0, y.grad = -1.0
+    
+    HINTS:
+    - Forward pass is straightforward: a - b
+    - Gradient for a is positive, for b is negative
+    - Remember to negate the gradient for b
+    """
+    ### BEGIN SOLUTION
+    # Convert to Variables if needed
+    if not isinstance(a, Variable):
+        a = Variable(a, requires_grad=False)
+    if not isinstance(b, Variable):
+        b = Variable(b, requires_grad=False)
+    
+    # Forward pass
+    result_data = a.data - b.data
+    
+    # Create gradient function
+    def grad_fn(grad_output):
+        # Subtraction rule: d(x-y)/dx = 1, d(x-y)/dy = -1
+        if a.requires_grad:
+            a.backward(grad_output)
+        if b.requires_grad:
+            b_grad = Variable(-grad_output.data.data)
+            b.backward(b_grad)
+    
+    # Determine if result requires gradients
+    requires_grad = a.requires_grad or b.requires_grad
+    
+    return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
+    ### END SOLUTION
 
-# Export clean operations (no Variable!)
-__all__ = ['enable_autograd']
+# %% ../../modules/source/08_autograd/autograd_dev.ipynb 25
+import time
+import gc
+from collections import defaultdict, deque
+
+class AutogradSystemsProfiler:
+    """
+    Production Autograd System Performance Analysis and Optimization
+    
+    Analyzes computational graph efficiency, memory patterns, and optimization
+    opportunities for production automatic differentiation systems.
+    """
+    
+    def __init__(self):
+        """Initialize autograd systems profiler."""
+        self.profiling_data = defaultdict(list)
+        self.graph_analysis = defaultdict(list)
+        self.optimization_strategies = []
+        
+    def profile_computational_graph_depth(self, max_depth=10, operations_per_level=5):
+        """
+        Profile computational graph performance vs depth.
+        
+        TODO: Implement computational graph depth analysis.
+        
+        APPROACH:
+        1. Create computational graphs of increasing depth
+        2. Measure forward and backward pass timing
+        3. Analyze memory usage patterns during gradient computation
+        4. Identify memory accumulation and gradient flow bottlenecks
+        5. Generate graph optimization recommendations
+        
+        EXAMPLE:
+        profiler = AutogradSystemsProfiler()
+        graph_analysis = profiler.profile_computational_graph_depth(max_depth=8)
+        print(f"Memory scaling factor: {graph_analysis['memory_scaling_factor']:.2f}")
+        
+        HINTS:
+        - Build graphs by chaining operations: x -> op1 -> op2 -> ... -> loss
+        - Measure both forward and backward pass timing separately
+        - Track memory usage throughout the computation
+        - Monitor gradient accumulation patterns
+        - Focus on production-relevant graph depths
+        """
+        ### BEGIN SOLUTION
+        print("🔧 Profiling Computational Graph Depth Impact...")
+        
+        results = {}
+        
+        for depth in range(1, max_depth + 1):
+            print(f"  Testing graph depth: {depth}")
+            
+            # Create a computational graph of specified depth
+            # Each level adds more operations to test scaling
+            
+            # Start with input variable
+            try:
+                # Use Variable if available, otherwise simulate
+                x = Variable(np.random.randn(100, 100), requires_grad=True)
+            except:
+                # Fallback for testing - simulate Variable with Tensor
+                x = Tensor(np.random.randn(100, 100))
+            
+            # Build computational graph of specified depth
+            current_var = x
+            operations = []
+            
+            for level in range(depth):
+                # Add multiple operations per level to increase complexity
+                for op_idx in range(operations_per_level):
+                    try:
+                        # Simulate various operations
+                        if op_idx % 4 == 0:
+                            current_var = current_var * 0.9  # Scale operation
+                        elif op_idx % 4 == 1:
+                            current_var = current_var + 0.1  # Add operation
+                        elif op_idx % 4 == 2:
+                            # Matrix multiplication (most expensive)
+                            weight = Tensor(np.random.randn(100, 100))
+                            if hasattr(current_var, 'data'):
+                                current_var = Tensor(current_var.data @ weight.data)
+                            else:
+                                current_var = current_var @ weight
+                        else:
+                            # Activation-like operation
+                            if hasattr(current_var, 'data'):
+                                current_var = Tensor(np.maximum(0, current_var.data))
+                            else:
+                                current_var = current_var  # Skip for simplicity
+                        
+                        operations.append(f"level_{level}_op_{op_idx}")
+                    except:
+                        # Fallback for testing
+                        current_var = Tensor(np.random.randn(100, 100))
+                        operations.append(f"level_{level}_op_{op_idx}_fallback")
+            
+            # Add final loss computation
+            try:
+                if hasattr(current_var, 'data'):
+                    loss = Tensor(np.sum(current_var.data ** 2))
+                else:
+                    loss = np.sum(current_var ** 2)
+            except:
+                loss = Tensor(np.array([1.0]))
+            
+            # Measure forward pass timing
+            forward_iterations = 3
+            forward_start = time.time()
+            
+            for _ in range(forward_iterations):
+                # Simulate forward pass computation
+                temp_x = x
+                for level in range(depth):
+                    for op_idx in range(operations_per_level):
+                        if op_idx % 4 == 0:
+                            temp_x = temp_x * 0.9
+                        elif op_idx % 4 == 1:
+                            temp_x = temp_x + 0.1
+                        # Skip expensive ops for timing
+                
+            forward_end = time.time()
+            avg_forward_time = (forward_end - forward_start) / forward_iterations
+            
+            # Measure backward pass timing (simulated)
+            # In real implementation, this would be loss.backward()
+            backward_start = time.time()
+            
+            # Simulate gradient computation through the graph
+            for _ in range(forward_iterations):
+                # Simulate backpropagation through all operations
+                gradient_accumulation = 0
+                for level in range(depth):
+                    for op_idx in range(operations_per_level):
+                        # Simulate gradient computation
+                        gradient_accumulation += level * op_idx * 0.001
+            
+            backward_end = time.time()
+            avg_backward_time = (backward_end - backward_start) / forward_iterations
+            
+            # Memory analysis
+            try:
+                if hasattr(x, 'data'):
+                    base_memory = x.data.nbytes / (1024 * 1024)  # MB
+                    if hasattr(current_var, 'data'):
+                        result_memory = current_var.data.nbytes / (1024 * 1024)
+                    else:
+                        result_memory = base_memory
+                else:
+                    base_memory = x.nbytes / (1024 * 1024) if hasattr(x, 'nbytes') else 1.0
+                    result_memory = base_memory
+            except:
+                base_memory = 1.0
+                result_memory = 1.0
+            
+            # Estimate gradient memory (in production, each operation stores gradients)
+            estimated_gradient_memory = depth * operations_per_level * base_memory * 0.5
+            total_memory = base_memory + result_memory + estimated_gradient_memory
+            
+            # Calculate efficiency metrics
+            total_operations = depth * operations_per_level
+            total_time = avg_forward_time + avg_backward_time
+            operations_per_second = total_operations / total_time if total_time > 0 else 0
+            
+            result = {
+                'graph_depth': depth,
+                'total_operations': total_operations,
+                'forward_time_ms': avg_forward_time * 1000,
+                'backward_time_ms': avg_backward_time * 1000,
+                'total_time_ms': total_time * 1000,
+                'base_memory_mb': base_memory,
+                'estimated_gradient_memory_mb': estimated_gradient_memory,
+                'total_memory_mb': total_memory,
+                'operations_per_second': operations_per_second,
+                'memory_per_operation': total_memory / total_operations if total_operations > 0 else 0
+            }
+            
+            results[depth] = result
+            
+            print(f"    Forward: {avg_forward_time*1000:.3f}ms, Backward: {avg_backward_time*1000:.3f}ms, Memory: {total_memory:.2f}MB")
+        
+        # Analyze scaling patterns
+        graph_analysis = self._analyze_graph_scaling(results)
+        
+        # Store profiling data
+        self.profiling_data['graph_depth_analysis'] = results
+        self.graph_analysis = graph_analysis
+        
+        return {
+            'detailed_results': results,
+            'graph_analysis': graph_analysis,
+            'optimization_strategies': self._generate_graph_optimizations(results)
+        }
+        ### END SOLUTION
+    
+    def _analyze_graph_scaling(self, results):
+        """Analyze computational graph scaling patterns."""
+        analysis = {}
+        
+        # Extract metrics for scaling analysis
+        depths = sorted(results.keys())
+        forward_times = [results[d]['forward_time_ms'] for d in depths]
+        backward_times = [results[d]['backward_time_ms'] for d in depths]
+        total_times = [results[d]['total_time_ms'] for d in depths]
+        memory_usage = [results[d]['total_memory_mb'] for d in depths]
+        
+        # Calculate scaling factors
+        if len(depths) >= 2:
+            shallow = depths[0]
+            deep = depths[-1]
+            
+            depth_ratio = deep / shallow
+            forward_time_ratio = results[deep]['forward_time_ms'] / results[shallow]['forward_time_ms']
+            backward_time_ratio = results[deep]['backward_time_ms'] / results[shallow]['backward_time_ms']
+            memory_ratio = results[deep]['total_memory_mb'] / results[shallow]['total_memory_mb']
+            
+            analysis['scaling_metrics'] = {
+                'depth_ratio': depth_ratio,
+                'forward_time_scaling': forward_time_ratio,
+                'backward_time_scaling': backward_time_ratio,
+                'memory_scaling': memory_ratio,
+                'theoretical_linear': depth_ratio  # Expected linear scaling
+            }
+            
+            # Identify bottlenecks
+            if backward_time_ratio > forward_time_ratio * 1.5:
+                analysis['primary_bottleneck'] = 'backward_pass'
+                analysis['bottleneck_reason'] = 'Gradient computation scaling worse than forward pass'
+            elif memory_ratio > depth_ratio * 1.5:
+                analysis['primary_bottleneck'] = 'memory'
+                analysis['bottleneck_reason'] = 'Memory usage scaling faster than linear'
+            else:
+                analysis['primary_bottleneck'] = 'balanced'
+                analysis['bottleneck_reason'] = 'Forward and backward passes scaling proportionally'
+        
+        # Backward/Forward ratio analysis
+        backward_forward_ratios = [
+            results[d]['backward_time_ms'] / max(results[d]['forward_time_ms'], 0.001)
+            for d in depths
+        ]
+        avg_backward_forward_ratio = sum(backward_forward_ratios) / len(backward_forward_ratios)
+        
+        analysis['efficiency_metrics'] = {
+            'avg_backward_forward_ratio': avg_backward_forward_ratio,
+            'peak_memory_mb': max(memory_usage),
+            'memory_efficiency_trend': 'increasing' if memory_usage[-1] > memory_usage[0] * 2 else 'stable'
+        }
+        
+        return analysis
+    
+    def _generate_graph_optimizations(self, results):
+        """Generate computational graph optimization strategies."""
+        strategies = []
+        
+        # Analyze memory growth patterns
+        peak_memory = max(result['total_memory_mb'] for result in results.values())
+        
+        if peak_memory > 50:  # > 50MB memory usage
+            strategies.append("💾 High memory usage detected in computational graph")
+            strategies.append("🔧 Strategy: Gradient checkpointing for deep graphs")
+            strategies.append("🔧 Strategy: In-place operations where mathematically valid")
+        
+        # Analyze computational efficiency
+        graph_analysis = self.graph_analysis
+        if graph_analysis and 'scaling_metrics' in graph_analysis:
+            backward_scaling = graph_analysis['scaling_metrics']['backward_time_scaling']
+            if backward_scaling > 2.0:
+                strategies.append("🐌 Backward pass scaling poorly with graph depth")
+                strategies.append("🔧 Strategy: Kernel fusion for backward operations")
+                strategies.append("🔧 Strategy: Parallel gradient computation")
+        
+        # Memory vs computation trade-offs
+        if graph_analysis and 'efficiency_metrics' in graph_analysis:
+            backward_forward_ratio = graph_analysis['efficiency_metrics']['avg_backward_forward_ratio']
+            if backward_forward_ratio > 3.0:
+                strategies.append("⚖️ Backward pass significantly slower than forward")
+                strategies.append("🔧 Strategy: Optimize gradient computation with sparse gradients")
+                strategies.append("🔧 Strategy: Use mixed precision to reduce memory bandwidth")
+        
+        # Production optimization recommendations
+        strategies.append("🏭 Production graph optimizations:")
+        strategies.append("   • Graph compilation and optimization (TorchScript, XLA)")
+        strategies.append("   • Operator fusion to minimize intermediate allocations")
+        strategies.append("   • Dynamic shape optimization for variable input sizes")
+        strategies.append("   • Gradient accumulation for large effective batch sizes")
+        
+        return strategies
+
+    def analyze_memory_checkpointing_trade_offs(self, checkpoint_frequencies=[1, 2, 4, 8]):
+        """
+        Analyze memory vs computation trade-offs with gradient checkpointing.
+        
+        This function is PROVIDED to demonstrate checkpointing analysis.
+        Students use it to understand memory optimization strategies.
+        """
+        print("🔍 GRADIENT CHECKPOINTING ANALYSIS")
+        print("=" * 45)
+        
+        base_graph_depth = 12
+        base_memory_per_layer = 10  # MB per layer
+        base_computation_time = 5  # ms per layer
+        
+        checkpointing_results = []
+        
+        for freq in checkpoint_frequencies:
+            # Calculate memory savings
+            # Without checkpointing: store all intermediate activations
+            no_checkpoint_memory = base_graph_depth * base_memory_per_layer
+            
+            # With checkpointing: only store every freq-th activation
+            checkpointed_memory = (base_graph_depth // freq + 1) * base_memory_per_layer
+            memory_savings = no_checkpoint_memory - checkpointed_memory
+            memory_reduction_pct = (memory_savings / no_checkpoint_memory) * 100
+            
+            # Calculate recomputation overhead
+            # Need to recompute (freq-1) layers for each checkpoint
+            recomputation_layers = base_graph_depth * (freq - 1) / freq
+            recomputation_time = recomputation_layers * base_computation_time
+            
+            # Total training time = forward + backward + recomputation
+            base_training_time = base_graph_depth * base_computation_time * 2  # forward + backward
+            total_training_time = base_training_time + recomputation_time
+            time_overhead_pct = (recomputation_time / base_training_time) * 100
+            
+            result = {
+                'checkpoint_frequency': freq,
+                'memory_mb': checkpointed_memory,
+                'memory_reduction_pct': memory_reduction_pct,
+                'recomputation_time_ms': recomputation_time,
+                'time_overhead_pct': time_overhead_pct,
+                'memory_time_ratio': memory_reduction_pct / max(time_overhead_pct, 1)
+            }
+            checkpointing_results.append(result)
+            
+            print(f"  Checkpoint every {freq} layers:")
+            print(f"    Memory: {checkpointed_memory:.0f}MB ({memory_reduction_pct:.1f}% reduction)")
+            print(f"    Time overhead: {time_overhead_pct:.1f}%")
+            print(f"    Efficiency ratio: {result['memory_time_ratio']:.2f}")
+        
+        # Find optimal trade-off
+        optimal = max(checkpointing_results, key=lambda x: x['memory_time_ratio'])
+        
+        print(f"\n📈 Checkpointing Analysis:")
+        print(f"  Optimal frequency: Every {optimal['checkpoint_frequency']} layers")
+        print(f"  Best trade-off: {optimal['memory_reduction_pct']:.1f}% memory reduction")
+        print(f"  Cost: {optimal['time_overhead_pct']:.1f}% time overhead")
+        
+        return checkpointing_results
diff --git a/tinytorch/core/autograd.py.backup b/tinytorch/core/autograd.py.backup
deleted file mode 100644
index 2eb6930a..00000000
--- a/tinytorch/core/autograd.py.backup
+++ /dev/null
@@ -1,1268 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Autograd - Automatic Differentiation Engine
-
-Welcome to Autograd! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through computational graphs.
-
-## 🔗 Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): Data structures that hold neural network parameters
-- Module 04 (Losses): Functions that measure prediction accuracy
-
-**What's Working**: You can compute loss values for any prediction!
-
-**The Gap**: Loss values tell you HOW WRONG you are, but not HOW TO IMPROVE the parameters.
-
-**This Module's Solution**: Implement automatic differentiation to compute gradients automatically.
-
-**Connection Map**:
-```
-Tensors → Losses → Autograd → Optimizers
-(data)   (error)  (∇L/∇θ)   (updates)
-```
-
-## Learning Objectives
-1. **Core Implementation**: Variable class with gradient tracking
-2. **Mathematical Foundation**: Chain rule application in computational graphs
-3. **Testing Skills**: Gradient computation validation
-4. **Integration Knowledge**: How autograd enables neural network training
-
-## Build → Test → Use
-1. **Build**: Variable class with backward propagation
-2. **Test**: Verify gradients are computed correctly
-3. **Use**: Apply to mathematical expressions and see automatic differentiation
-
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/05_autograd/autograd_dev.py
-**Building Side:** Code exports to tinytorch.core.autograd
-
-```python
-# Final package structure:
-from tinytorch.core.autograd import Variable  # This module
-from tinytorch.core.tensor import Tensor      # Foundation (always needed)
-```
-
-**Why this matters:**
-- **Learning:** Complete automatic differentiation system for deep understanding
-- **Production:** Proper organization like PyTorch's torch.autograd
-- **Consistency:** All gradient operations in core.autograd
-- **Integration:** Works seamlessly with tensors for complete training systems
-"""
-
-# %%
-#| default_exp core.autograd
-
-#| export
-import numpy as np
-import sys
-from typing import Union, List, Optional, Callable
-
-# Import our existing components
-try:
-    from tinytorch.core.tensor import Tensor
-except ImportError:
-    # For development, import from local modules
-    import os
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
-    from tensor_dev import Tensor
-
-# %%
-print("🔥 TinyTorch Autograd Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build automatic differentiation!")
-
-# %% [markdown]
-"""
-## What is Automatic Differentiation?
-
-### The Problem: Computing Gradients at Scale
-
-In neural networks, we need to compute gradients of complex functions with millions of parameters:
-
-```
-Loss = f(W₁, W₂, ..., Wₙ, data)
-∇Loss = [∂Loss/∂W₁, ∂Loss/∂W₂, ..., ∂Loss/∂Wₙ]
-```
-
-Manual differentiation is impossible. Numerical differentiation is too slow.
-
-### The Solution: Automatic Differentiation
-
-🧠 **Core Concept**: Track operations as we compute forward pass, then apply chain rule backwards
-⚡ **Performance**: Same speed as forward pass, exact gradients (not approximations)
-📦 **Framework Compatibility**: This is how PyTorch and TensorFlow work internally
-
-### Visual Representation: Computational Graph
-
-```
-Forward Pass:
-x ──┐
-    ├──[×]──> z = x * y
-y ──┘
-
-Backward Pass:
-∂L/∂z ──┬──> ∂L/∂x = ∂L/∂z * y
-        │
-        └──> ∂L/∂y = ∂L/∂z * x
-```
-
-**Key Insight**: Each operation stores how to compute gradients with respect to its inputs.
-"""
-
-# %% [markdown]
-"""
-## Implementation: Variable Class - Gradient Tracking
-
-🏗️ **Organization**: Variables wrap tensors and track gradients
-🎯 **Clean API**: Seamless integration with existing tensor operations
-📐 **Mathematical Foundation**: Computational graph representation of functions
-
-### Design Principles
-
-A Variable tracks:
-- **data**: The actual values (using our Tensor)
-- **grad**: Accumulated gradients (starts as None)
-- **grad_fn**: Function to compute gradients during backward pass
-- **requires_grad**: Whether to track gradients for this variable
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "variable-class", "solution": true}
-#| export
-class Variable:
-    """
-    Variable with automatic differentiation support.
-
-    A Variable wraps a Tensor and tracks operations for gradient computation.
-
-    TODO: Implement Variable class with gradient tracking capabilities
-
-    APPROACH:
-    1. Initialize with data, optional gradient requirement
-    2. Store grad_fn for backward pass computation
-    3. Implement backward() method to compute gradients
-
-    EXAMPLE:
-    >>> x = Variable([2.0], requires_grad=True)
-    >>> y = Variable([3.0], requires_grad=True)
-    >>> z = x * y
-    >>> z.backward()
-    >>> print(x.grad)  # Should be [3.0]
-    >>> print(y.grad)  # Should be [2.0]
-
-    HINTS:
-    - Store data as Tensor for consistency
-    - grad starts as None, gets created during backward
-    - grad_fn is a callable that propagates gradients
-    """
-    ### BEGIN SOLUTION
-    def __init__(self, data, requires_grad=False, grad_fn=None):
-        """Initialize Variable with data and gradient tracking."""
-        # Convert to Tensor if needed
-        if isinstance(data, (list, tuple, int, float)):
-            self.data = Tensor(data)
-        elif isinstance(data, np.ndarray):
-            self.data = Tensor(data)
-        elif isinstance(data, (np.number, np.floating, np.integer)):
-            # Handle numpy scalar types
-            self.data = Tensor(data)
-        elif isinstance(data, Tensor):
-            self.data = data
-        else:
-            raise TypeError(f"Unsupported data type: {type(data)}")
-
-        self.grad = None
-        self.requires_grad = requires_grad
-        self.grad_fn = grad_fn
-
-    @property
-    def shape(self):
-        """Shape of the underlying data."""
-        return self.data.shape
-
-    def __repr__(self):
-        """String representation of Variable."""
-        grad_info = f", grad_fn={self.grad_fn.__name__}" if self.grad_fn else ""
-        requires_grad_info = f", requires_grad={self.requires_grad}" if self.requires_grad else ""
-        return f"Variable({self.data.data}{grad_info}{requires_grad_info})"
-
-    def backward(self, gradient=None):
-        """
-        Compute gradients via backpropagation.
-
-        Args:
-            gradient: Gradient flowing backwards (defaults to ones)
-        """
-        # Default gradient for scalar outputs
-        if gradient is None:
-            if self.data.data.size == 1:
-                gradient = np.ones_like(self.data.data)
-            else:
-                raise RuntimeError("gradient must be specified for non-scalar variables")
-
-        # Accumulate gradients
-        if self.requires_grad:
-            if self.grad is None:
-                self.grad = gradient
-            else:
-                self.grad = self.grad + gradient
-
-        # Propagate gradients backwards through computation graph
-        if self.grad_fn is not None:
-            self.grad_fn(gradient)
-
-    # Arithmetic operations with gradient tracking
-    def __add__(self, other):
-        """Addition with gradient tracking."""
-        return add(self, other)
-
-    def __radd__(self, other):
-        """Reverse addition."""
-        return add(other, self)
-
-    def __mul__(self, other):
-        """Multiplication with gradient tracking."""
-        return multiply(self, other)
-
-    def __rmul__(self, other):
-        """Reverse multiplication."""
-        return multiply(other, self)
-
-    def __sub__(self, other):
-        """Subtraction with gradient tracking."""
-        return subtract(self, other)
-
-    def __rsub__(self, other):
-        """Reverse subtraction."""
-        return subtract(other, self)
-
-    def __matmul__(self, other):
-        """Matrix multiplication with gradient tracking."""
-        return matmul(self, other)
-
-    @staticmethod
-    def sum(variable):
-        """
-        Sum all elements of a Variable, maintaining gradient tracking.
-
-        This is essential for creating scalar losses from multi-element results.
-        Unlike extracting scalar values, this preserves the computational graph.
-
-        Args:
-            variable: Variable to sum
-
-        Returns:
-            Variable containing the sum with gradient tracking
-        """
-        # Forward pass: compute sum
-        sum_data = np.sum(variable.data.data)
-
-        # Determine if result requires gradients
-        requires_grad = variable.requires_grad
-
-        # Define backward function for gradient propagation
-        def grad_fn(gradient):
-            """Propagate gradients back to all elements."""
-            if variable.requires_grad:
-                # For sum operation, gradient is broadcast to all elements
-                # Since d(sum)/d(xi) = 1 for all i
-                grad_shape = variable.data.data.shape
-                element_grad = np.full(grad_shape, gradient)
-                variable.backward(element_grad)
-
-        return Variable(sum_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Variable Class
-This test validates Variable creation and basic gradient setup
-"""
-
-# %%
-def test_unit_variable_class():
-    """Test Variable class implementation with gradient tracking."""
-    print("🔬 Unit Test: Variable Class...")
-
-    # Test basic creation
-    x = Variable([2.0, 3.0], requires_grad=True)
-    assert isinstance(x.data, Tensor), "Variable should wrap Tensor"
-    assert x.requires_grad == True, "Should track gradients when requested"
-    assert x.grad is None, "Gradient should start as None"
-
-    # Test creation without gradients
-    y = Variable([1.0, 2.0], requires_grad=False)
-    assert y.requires_grad == False, "Should not track gradients when not requested"
-
-    # Test different data types
-    z = Variable(np.array([4.0]), requires_grad=True)
-    assert isinstance(z.data, Tensor), "Should convert numpy arrays to Tensors"
-
-    print("✅ Variable class works correctly!")
-
-test_unit_variable_class()
-
-# %% [markdown]
-"""
-## Implementation: Addition Operation with Chain Rule
-
-🧠 **Core Concepts**: Addition requires applying chain rule to both operands
-⚡ **Performance**: Gradient computation is O(1) relative to forward pass
-📦 **Framework Compatibility**: Matches PyTorch's autograd behavior
-
-### Mathematical Foundation
-
-For z = x + y:
-- ∂z/∂x = 1 (derivative of x + y with respect to x)
-- ∂z/∂y = 1 (derivative of x + y with respect to y)
-
-Chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × 1 = ∂L/∂z
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "add-operation", "solution": true}
-def _ensure_variable(x):
-    """Convert input to Variable if needed."""
-    if isinstance(x, Variable):
-        return x
-    elif hasattr(x, '_variable'):  # Handle Parameter objects
-        return x._variable  # Parameter wraps a Variable
-    else:
-        return Variable(x, requires_grad=False)
-
-#| export
-def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Add two variables with gradient tracking.
-
-    TODO: Implement addition that properly tracks gradients
-
-    APPROACH:
-    1. Convert inputs to Variables if needed
-    2. Compute forward pass (a.data + b.data)
-    3. Create grad_fn that propagates gradients to both inputs
-    4. Return new Variable with result and grad_fn
-
-    EXAMPLE:
-    >>> x = Variable([2.0], requires_grad=True)
-    >>> y = Variable([3.0], requires_grad=True)
-    >>> z = add(x, y)
-    >>> z.backward()
-    >>> print(x.grad)  # [1.0] - derivative of z w.r.t x
-    >>> print(y.grad)  # [1.0] - derivative of z w.r.t y
-
-    HINTS:
-    - Use chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × 1
-    - Both operands get same gradient (derivative of sum is 1)
-    - Only propagate to variables that require gradients
-    """
-    ### BEGIN SOLUTION
-    # Ensure both inputs are Variables
-    a = _ensure_variable(a)
-    b = _ensure_variable(b)
-
-    # Forward pass computation
-    result_data = Tensor(a.data.data + b.data.data)
-
-    # Determine if result requires gradients
-    requires_grad = a.requires_grad or b.requires_grad
-
-    # Define backward function for gradient propagation
-    def grad_fn(gradient):
-        """Propagate gradients to both operands with broadcasting support."""
-        # Addition: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
-        # Handle broadcasting by summing gradients appropriately
-        if a.requires_grad:
-            # Sum out dimensions that were broadcasted for a
-            grad_a = gradient
-            # Sum over axes that were broadcasted
-            original_shape = a.data.data.shape
-            grad_shape = grad_a.shape if hasattr(grad_a, 'shape') else np.array(grad_a).shape
-
-            # Sum along axes that were added due to broadcasting
-            if len(grad_shape) > len(original_shape):
-                axes_to_sum = tuple(range(len(grad_shape) - len(original_shape)))
-                grad_a = np.sum(grad_a, axis=axes_to_sum)
-
-            # Sum along axes that were expanded
-            for i in range(len(original_shape)):
-                if i < len(grad_a.shape) and original_shape[i] == 1 and grad_a.shape[i] > 1:
-                    grad_a = np.sum(grad_a, axis=i, keepdims=True)
-
-            # Handle case where parameter is 1D but gradient is 2D
-            if len(original_shape) == 1 and len(grad_a.shape) == 2:
-                grad_a = np.sum(grad_a, axis=0)  # Sum across batch dimension
-
-            # Squeeze out singleton dimensions to match original shape
-            try:
-                grad_a = grad_a.reshape(original_shape)
-            except ValueError as e:
-                # If reshape fails, try to make shapes compatible
-                if grad_a.size == np.prod(original_shape):
-                    grad_a = grad_a.flatten().reshape(original_shape)
-                else:
-                    print(f"Warning: grad_a shape {grad_a.shape} incompatible with original {original_shape}")
-                    grad_a = np.sum(grad_a, axis=0, keepdims=False)
-                    if grad_a.shape != original_shape:
-                        grad_a = grad_a.reshape(original_shape)
-
-            a.backward(grad_a)
-
-        if b.requires_grad:
-            # Sum out dimensions that were broadcasted for b
-            grad_b = gradient
-            # Sum over axes that were broadcasted
-            original_shape = b.data.data.shape
-            grad_shape = grad_b.shape if hasattr(grad_b, 'shape') else np.array(grad_b).shape
-
-            # Sum along axes that were added due to broadcasting
-            if len(grad_shape) > len(original_shape):
-                axes_to_sum = tuple(range(len(grad_shape) - len(original_shape)))
-                grad_b = np.sum(grad_b, axis=axes_to_sum)
-
-            # Sum along axes that were expanded
-            for i in range(len(original_shape)):
-                if i < len(grad_b.shape) and original_shape[i] == 1 and grad_b.shape[i] > 1:
-                    grad_b = np.sum(grad_b, axis=i, keepdims=True)
-
-            # Handle case where bias is 1D but gradient is 2D
-            if len(original_shape) == 1 and len(grad_b.shape) == 2:
-                grad_b = np.sum(grad_b, axis=0)  # Sum across batch dimension
-
-            # Squeeze out singleton dimensions to match original shape
-            try:
-                grad_b = grad_b.reshape(original_shape)
-            except ValueError as e:
-                # If reshape fails, try to make shapes compatible
-                if grad_b.size == np.prod(original_shape):
-                    grad_b = grad_b.flatten().reshape(original_shape)
-                else:
-                    print(f"Warning: grad_b shape {grad_b.shape} incompatible with original {original_shape}")
-                    grad_b = np.sum(grad_b, axis=0, keepdims=False)
-                    if grad_b.shape != original_shape:
-                        grad_b = grad_b.reshape(original_shape)
-
-            b.backward(grad_b)
-
-    # Create result variable with gradient function
-    result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
-    return result
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Addition Operation
-This test validates addition with proper gradient computation
-"""
-
-# %%
-def test_unit_add_operation():
-    """Test addition with gradient tracking."""
-    print("🔬 Unit Test: Addition Operation...")
-
-    # Test basic addition
-    x = Variable([2.0], requires_grad=True)
-    y = Variable([3.0], requires_grad=True)
-    z = add(x, y)
-
-    # Verify forward pass
-    assert np.allclose(z.data.data, [5.0]), f"Expected [5.0], got {z.data.data}"
-
-    # Test backward pass
-    z.backward()
-    assert np.allclose(x.grad, [1.0]), f"Expected x.grad=[1.0], got {x.grad}"
-    assert np.allclose(y.grad, [1.0]), f"Expected y.grad=[1.0], got {y.grad}"
-
-    # Test with constants
-    a = Variable([1.0], requires_grad=True)
-    b = add(a, 5.0)  # Adding constant
-    b.backward()
-    assert np.allclose(a.grad, [1.0]), "Gradient should flow through constant addition"
-
-    print("✅ Addition operation works correctly!")
-
-test_unit_add_operation()
-
-# %% [markdown]
-"""
-## Implementation: Multiplication Operation with Product Rule
-
-📐 **Mathematical Foundation**: Product rule for derivatives
-🔗 **Connections**: Essential for linear layers, attention mechanisms
-⚡ **Performance**: Efficient gradient computation using cached forward values
-
-### The Product Rule
-
-For z = x × y:
-- ∂z/∂x = y (derivative with respect to first operand)
-- ∂z/∂y = x (derivative with respect to second operand)
-
-Chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × y
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "solution": true}
-#| export
-def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Multiply two variables with gradient tracking.
-
-    TODO: Implement multiplication using product rule for gradients
-
-    APPROACH:
-    1. Convert inputs to Variables if needed
-    2. Compute forward pass (a.data × b.data)
-    3. Create grad_fn using product rule: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a
-    4. Return Variable with result and grad_fn
-
-    EXAMPLE:
-    >>> x = Variable([2.0], requires_grad=True)
-    >>> y = Variable([3.0], requires_grad=True)
-    >>> z = multiply(x, y)
-    >>> z.backward()
-    >>> print(x.grad)  # [3.0] - derivative is y's value
-    >>> print(y.grad)  # [2.0] - derivative is x's value
-
-    HINTS:
-    - Product rule: d(uv)/dx = u(dv/dx) + v(du/dx)
-    - For our case: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a
-    - Store original values for use in backward pass
-    """
-    ### BEGIN SOLUTION
-    # Ensure both inputs are Variables
-    a = _ensure_variable(a)
-    b = _ensure_variable(b)
-
-    # Forward pass computation
-    result_data = Tensor(a.data.data * b.data.data)
-
-    # Determine if result requires gradients
-    requires_grad = a.requires_grad or b.requires_grad
-
-    # Define backward function for gradient propagation
-    def grad_fn(gradient):
-        """Propagate gradients using product rule."""
-        # Product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
-        if a.requires_grad:
-            # ∂L/∂a = ∂L/∂z × ∂z/∂a = gradient × b
-            a_grad = gradient * b.data.data
-            a.backward(a_grad)
-        if b.requires_grad:
-            # ∂L/∂b = ∂L/∂z × ∂z/∂b = gradient × a
-            b_grad = gradient * a.data.data
-            b.backward(b_grad)
-
-    # Create result variable with gradient function
-    result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
-    return result
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Multiplication Operation
-This test validates multiplication with product rule gradients
-"""
-
-# %%
-def test_unit_multiply_operation():
-    """Test multiplication with gradient tracking."""
-    print("🔬 Unit Test: Multiplication Operation...")
-
-    # Test basic multiplication
-    x = Variable([2.0], requires_grad=True)
-    y = Variable([3.0], requires_grad=True)
-    z = multiply(x, y)
-
-    # Verify forward pass
-    assert np.allclose(z.data.data, [6.0]), f"Expected [6.0], got {z.data.data}"
-
-    # Test backward pass
-    z.backward()
-    assert np.allclose(x.grad, [3.0]), f"Expected x.grad=[3.0], got {x.grad}"
-    assert np.allclose(y.grad, [2.0]), f"Expected y.grad=[2.0], got {y.grad}"
-
-    # Test with constants
-    a = Variable([4.0], requires_grad=True)
-    b = multiply(a, 2.0)  # Multiplying by constant
-    b.backward()
-    assert np.allclose(a.grad, [2.0]), "Gradient should be the constant value"
-
-    print("✅ Multiplication operation works correctly!")
-
-test_unit_multiply_operation()
-
-# %% [markdown]
-"""
-## Implementation: Additional Operations
-
-🔗 **Connections**: Complete the basic arithmetic operations needed for neural networks
-⚡ **Performance**: Each operation implements efficient gradient computation
-📦 **Framework Compatibility**: Matches behavior of production autograd systems
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "additional-operations", "solution": true}
-#| export
-def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Subtract two variables with gradient tracking.
-
-    TODO: Implement subtraction with proper gradient flow
-
-    HINTS:
-    - For z = a - b: ∂z/∂a = 1, ∂z/∂b = -1
-    - Similar to addition but second operand gets negative gradient
-    """
-    ### BEGIN SOLUTION
-    # Ensure both inputs are Variables
-    a = _ensure_variable(a)
-    b = _ensure_variable(b)
-
-    # Forward pass computation
-    result_data = Tensor(a.data.data - b.data.data)
-
-    # Determine if result requires gradients
-    requires_grad = a.requires_grad or b.requires_grad
-
-    # Define backward function for gradient propagation
-    def grad_fn(gradient):
-        """Propagate gradients for subtraction."""
-        # Subtraction: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
-        if a.requires_grad:
-            a.backward(gradient)
-        if b.requires_grad:
-            b.backward(-gradient)  # Negative for subtraction
-
-    # Create result variable with gradient function
-    result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
-    return result
-    ### END SOLUTION
-
-#| export
-def matmul(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
-    """
-    Matrix multiplication with gradient tracking.
-
-    TODO: Implement matrix multiplication with proper gradients
-
-    HINTS:
-    - For z = a @ b: ∂z/∂a = gradient @ b.T, ∂z/∂b = a.T @ gradient
-    - This is fundamental for neural network linear layers
-    """
-    ### BEGIN SOLUTION
-    # Ensure both inputs are Variables
-    a = _ensure_variable(a)
-    b = _ensure_variable(b)
-
-    # Forward pass computation
-    result_data = Tensor(a.data.data @ b.data.data)
-
-    # Determine if result requires gradients
-    requires_grad = a.requires_grad or b.requires_grad
-
-    # Define backward function for gradient propagation
-    def grad_fn(gradient):
-        """Propagate gradients for matrix multiplication."""
-        # Matrix multiplication gradients:
-        # ∂(a@b)/∂a = gradient @ b.T
-        # ∂(a@b)/∂b = a.T @ gradient
-        if a.requires_grad:
-            a_grad = gradient @ b.data.data.T
-            a.backward(a_grad)
-        if b.requires_grad:
-            b_grad = a.data.data.T @ gradient
-            b.backward(b_grad)
-
-    # Create result variable with gradient function
-    result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
-    return result
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Additional Operations
-This test validates subtraction and matrix multiplication
-"""
-
-# %%
-def test_unit_additional_operations():
-    """Test subtraction and matrix multiplication."""
-    print("🔬 Unit Test: Additional Operations...")
-
-    # Test subtraction
-    x = Variable([5.0], requires_grad=True)
-    y = Variable([2.0], requires_grad=True)
-    z = subtract(x, y)
-
-    assert np.allclose(z.data.data, [3.0]), f"Subtraction failed: expected [3.0], got {z.data.data}"
-
-    z.backward()
-    assert np.allclose(x.grad, [1.0]), f"Subtraction gradient failed: expected x.grad=[1.0], got {x.grad}"
-    assert np.allclose(y.grad, [-1.0]), f"Subtraction gradient failed: expected y.grad=[-1.0], got {y.grad}"
-
-    # Test matrix multiplication
-    a = Variable([[1.0, 2.0]], requires_grad=True)
-    b = Variable([[3.0], [4.0]], requires_grad=True)
-    c = matmul(a, b)
-
-    assert np.allclose(c.data.data, [[11.0]]), f"Matrix multiplication failed: expected [[11.0]], got {c.data.data}"
-
-    c.backward()
-    assert np.allclose(a.grad, [[3.0, 4.0]]), f"Matmul gradient failed for a: expected [[3.0, 4.0]], got {a.grad}"
-    assert np.allclose(b.grad, [[1.0], [2.0]]), f"Matmul gradient failed for b: expected [[1.0], [2.0]], got {b.grad}"
-
-    print("✅ Additional operations work correctly!")
-
-test_unit_additional_operations()
-
-# %% [markdown]
-"""
-## Implementation: Chain Rule Through Complex Expressions
-
-🧠 **Core Concept**: Multiple operations automatically chain gradients together
-⚡ **Performance**: Each operation contributes O(1) overhead for gradient computation
-🔗 **Connections**: This enables training deep neural networks with many layers
-
-### Example: Complex Expression
-
-Consider: f(x, y) = (x + y) × (x - y) = x² - y²
-
-The autograd system automatically:
-1. Tracks each intermediate operation
-2. Applies chain rule backwards through the computation graph
-3. Accumulates gradients at each variable
-
-Expected gradients:
-- ∂f/∂x = 2x (derivative of x² - y²)
-- ∂f/∂y = -2y (derivative of x² - y²)
-"""
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Chain Rule Application
-This test validates complex expressions with multiple operations
-"""
-
-# %%
-def test_unit_chain_rule():
-    """Test chain rule through complex expressions."""
-    print("🔬 Unit Test: Chain Rule Application...")
-
-    # Test complex expression: (x + y) * (x - y) = x² - y²
-    x = Variable([3.0], requires_grad=True)
-    y = Variable([2.0], requires_grad=True)
-
-    # Build computation graph
-    sum_term = add(x, y)      # x + y = 5
-    diff_term = subtract(x, y) # x - y = 1
-    result = multiply(sum_term, diff_term)  # (x+y)*(x-y) = 5*1 = 5
-
-    # Verify forward pass
-    expected_result = 3.0**2 - 2.0**2  # x² - y² = 9 - 4 = 5
-    assert np.allclose(result.data.data, [expected_result]), f"Expected [{expected_result}], got {result.data.data}"
-
-    # Test backward pass
-    result.backward()
-
-    # Expected gradients: ∂(x²-y²)/∂x = 2x = 6, ∂(x²-y²)/∂y = -2y = -4
-    expected_x_grad = 2 * 3.0  # 6.0
-    expected_y_grad = -2 * 2.0  # -4.0
-
-    assert np.allclose(x.grad, [expected_x_grad]), f"Expected x.grad=[{expected_x_grad}], got {x.grad}"
-    assert np.allclose(y.grad, [expected_y_grad]), f"Expected y.grad=[{expected_y_grad}], got {y.grad}"
-
-    # Test another complex expression: x * y + x * y (should equal 2*x*y)
-    a = Variable([2.0], requires_grad=True)
-    b = Variable([3.0], requires_grad=True)
-
-    term1 = multiply(a, b)
-    term2 = multiply(a, b)
-    sum_result = add(term1, term2)
-
-    sum_result.backward()
-
-    # Expected: ∂(2xy)/∂x = 2y = 6, ∂(2xy)/∂y = 2x = 4
-    assert np.allclose(a.grad, [6.0]), f"Expected a.grad=[6.0], got {a.grad}"
-    assert np.allclose(b.grad, [4.0]), f"Expected b.grad=[4.0], got {b.grad}"
-
-    print("✅ Chain rule works correctly through complex expressions!")
-
-test_unit_chain_rule()
-
-# %% [markdown]
-"""
-## 🔍 Systems Analysis: Gradient Computation Behavior
-
-Now that your autograd implementation is complete and tested, let's analyze its behavior:
-
-**Analysis Focus**: Understand memory usage and computational patterns in automatic differentiation
-"""
-
-# %%
-def analyze_gradient_computation():
-    """
-    📊 SYSTEMS MEASUREMENT: Gradient Computation Analysis
-
-    Measure how autograd scales with expression complexity and input size.
-    """
-    print("📊 AUTOGRAD SYSTEMS ANALYSIS")
-    print("Testing gradient computation patterns...")
-
-    import time
-
-    # Test 1: Expression complexity scaling
-    print("\n🔍 Expression Complexity Analysis:")
-    x = Variable([2.0], requires_grad=True)
-    y = Variable([3.0], requires_grad=True)
-
-    expressions = [
-        ("Simple: x + y", lambda: add(x, y)),
-        ("Medium: x * y + x", lambda: add(multiply(x, y), x)),
-        ("Complex: (x + y) * (x - y)", lambda: multiply(add(x, y), subtract(x, y)))
-    ]
-
-    for name, expr_fn in expressions:
-        # Reset gradients
-        x.grad = None
-        y.grad = None
-
-        # Time forward + backward pass
-        start = time.perf_counter()
-        result = expr_fn()
-        result.backward()
-        elapsed = time.perf_counter() - start
-
-        print(f"  {name}: {elapsed*1000:.3f}ms")
-
-    # Test 2: Memory usage pattern
-    print("\n💾 Memory Usage Analysis:")
-    try:
-        import psutil
-        import os
-
-        def get_memory_mb():
-            process = psutil.Process(os.getpid())
-            return process.memory_info().rss / 1024 / 1024
-
-        baseline = get_memory_mb()
-        psutil_available = True
-    except ImportError:
-        print("  Note: psutil not installed, skipping detailed memory analysis")
-        psutil_available = False
-        baseline = 0
-
-    # Create computation graph with many variables
-    variables = []
-    for i in range(100):
-        var = Variable([float(i)], requires_grad=True)
-        variables.append(var)
-
-    # Chain operations
-    result = variables[0]
-    for var in variables[1:]:
-        result = add(result, var)
-
-    if psutil_available:
-        memory_after_forward = get_memory_mb()
-
-    # Backward pass
-    result.backward()
-
-    if psutil_available:
-        memory_after_backward = get_memory_mb()
-        print(f"  Baseline memory: {baseline:.1f}MB")
-        print(f"  After forward pass: {memory_after_forward:.1f}MB (+{memory_after_forward-baseline:.1f}MB)")
-        print(f"  After backward pass: {memory_after_backward:.1f}MB (+{memory_after_backward-baseline:.1f}MB)")
-    else:
-        print("  Memory tracking skipped (psutil not available)")
-
-    # Test 3: Gradient accumulation
-    print("\n🔄 Gradient Accumulation Test:")
-    z = Variable([1.0], requires_grad=True)
-
-    # Multiple backward passes should accumulate gradients
-    loss1 = multiply(z, 2.0)
-    loss1.backward()
-    first_grad = z.grad.copy()
-
-    loss2 = multiply(z, 3.0)
-    loss2.backward()  # Should accumulate with previous gradient
-
-    print(f"  First backward: grad = {first_grad}")
-    print(f"  After second backward: grad = {z.grad}")
-    print(f"  Expected accumulation: {first_grad + 3.0}")
-
-    print("\n💡 AUTOGRAD INSIGHTS:")
-    print("  • Forward pass builds computation graph in memory")
-    print("  • Backward pass traverses graph and accumulates gradients")
-    print("  • Memory scales with graph depth, not just data size")
-    print("  • This is why PyTorch uses gradient checkpointing for deep networks!")
-
-analyze_gradient_computation()
-
-# %% [markdown]
-"""
-## Integration: Complete Module Testing
-
-🧪 **Testing Strategy**: Comprehensive validation of all autograd functionality
-✅ **Quality Assurance**: Ensure all components work together correctly
-🚀 **Ready for Training**: Verify autograd enables neural network optimization
-"""
-
-# %%
-def test_module():
-    """Comprehensive test of autograd module functionality."""
-    print("🧪 COMPREHENSIVE MODULE TEST")
-    print("Running complete autograd validation...")
-
-    # Test 1: Variable creation and basic properties
-    print("\n1️⃣ Testing Variable creation...")
-    x = Variable([1.0, 2.0], requires_grad=True)
-    assert isinstance(x.data, Tensor)
-    assert x.requires_grad == True
-    assert x.grad is None
-    print("   ✅ Variable creation works")
-
-    # Test 2: All arithmetic operations
-    print("\n2️⃣ Testing arithmetic operations...")
-    a = Variable([2.0], requires_grad=True)
-    b = Variable([3.0], requires_grad=True)
-
-    # Test each operation
-    add_result = add(a, b)
-    assert np.allclose(add_result.data.data, [5.0])
-
-    mul_result = multiply(a, b)
-    assert np.allclose(mul_result.data.data, [6.0])
-
-    sub_result = subtract(a, b)
-    assert np.allclose(sub_result.data.data, [-1.0])
-    print("   ✅ All arithmetic operations work")
-
-    # Test 3: Gradient computation
-    print("\n3️⃣ Testing gradient computation...")
-    x = Variable([3.0], requires_grad=True)
-    y = Variable([4.0], requires_grad=True)
-    z = multiply(x, y)  # z = 12
-    z.backward()
-
-    assert np.allclose(x.grad, [4.0]), f"Expected x.grad=[4.0], got {x.grad}"
-    assert np.allclose(y.grad, [3.0]), f"Expected y.grad=[3.0], got {y.grad}"
-    print("   ✅ Gradient computation works")
-
-    # Test 4: Complex expressions
-    print("\n4️⃣ Testing complex expressions...")
-    p = Variable([2.0], requires_grad=True)
-    q = Variable([3.0], requires_grad=True)
-
-    # (p + q) * (p - q) = p² - q²
-    expr = multiply(add(p, q), subtract(p, q))
-    expr.backward()
-
-    # Expected: ∂(p²-q²)/∂p = 2p = 4, ∂(p²-q²)/∂q = -2q = -6
-    assert np.allclose(p.grad, [4.0]), f"Expected p.grad=[4.0], got {p.grad}"
-    assert np.allclose(q.grad, [-6.0]), f"Expected q.grad=[-6.0], got {q.grad}"
-    print("   ✅ Complex expressions work")
-
-    # Test 5: Matrix operations
-    print("\n5️⃣ Testing matrix operations...")
-    A = Variable([[1.0, 2.0]], requires_grad=True)
-    B = Variable([[3.0], [4.0]], requires_grad=True)
-    C = matmul(A, B)
-
-    assert np.allclose(C.data.data, [[11.0]])
-    C.backward()
-    assert np.allclose(A.grad, [[3.0, 4.0]])
-    assert np.allclose(B.grad, [[1.0], [2.0]])
-    print("   ✅ Matrix operations work")
-
-    # Test 6: Mixed operations
-    print("\n6️⃣ Testing mixed operations...")
-    u = Variable([1.0], requires_grad=True)
-    v = Variable([2.0], requires_grad=True)
-
-    # Neural network-like computation: u * v + u
-    hidden = multiply(u, v)  # u * v
-    output = add(hidden, u)   # + u
-    output.backward()
-
-    # Expected: ∂(u*v + u)/∂u = v + 1 = 3, ∂(u*v + u)/∂v = u = 1
-    assert np.allclose(u.grad, [3.0]), f"Expected u.grad=[3.0], got {u.grad}"
-    assert np.allclose(v.grad, [1.0]), f"Expected v.grad=[1.0], got {v.grad}"
-    print("   ✅ Mixed operations work")
-
-    print("\n🎉 ALL TESTS PASSED!")
-    print("🚀 Autograd module is ready for neural network training!")
-    print("🔗 Next: Use these gradients in optimizers to update parameters")
-
-# %%
-if __name__ == "__main__":
-    test_module()
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-### Question 1: Memory Management in Computational Graphs
-
-Consider the expression `z = (x + y) * (x - y)` where x and y have `requires_grad=True`.
-
-**Analysis Task**: Your autograd implementation stores intermediate results during forward pass and uses them during backward pass. In a deep neural network with 100 layers, each layer creating intermediate variables, what memory challenges would emerge?
-
-**Specific Questions**:
-- How does memory usage scale with network depth in your current implementation?
-- What strategies could reduce memory usage during gradient computation?
-- Why do production frameworks like PyTorch implement "gradient checkpointing"?
-
-**Implementation Connection**: Examine how your `grad_fn` closures capture references to input variables and consider the memory implications.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "memory-analysis", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-TODO: Analyze memory usage patterns in your autograd implementation.
-
-Consider how your Variable class stores references to other variables through grad_fn,
-and how this affects memory usage in deep networks.
-
-Discuss specific memory optimization strategies you could implement.
-"""
-### BEGIN SOLUTION
-# Memory analysis for autograd implementation:
-
-# 1. Memory scaling with network depth:
-# - Each Variable stores references to inputs through grad_fn closure
-# - In deep networks: O(depth) memory growth for intermediate activations
-# - Gradient computation requires keeping forward activations in memory
-# - 100-layer network = 100x intermediate variables + their grad_fn closures
-
-# 2. Memory optimization strategies:
-# - Gradient checkpointing: Only store subset of activations, recompute others
-# - In-place operations where mathematically valid
-# - Clear computation graph after backward pass
-# - Use smaller data types (float16 vs float32) where precision allows
-
-# 3. Production framework solutions:
-# - PyTorch's gradient checkpointing trades compute for memory
-# - Automatic memory management with garbage collection
-# - Graph optimization to reduce intermediate storage
-# - Dynamic graph construction vs static graph optimization
-
-# Current implementation improvement:
-# Add method to clear computation graph: variable.detach() or graph.clear()
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: Gradient Accumulation and Training Efficiency
-
-In your autograd implementation, gradients accumulate when `backward()` is called multiple times without zeroing gradients.
-
-**Analysis Task**: Design a training loop that uses gradient accumulation to simulate larger batch sizes with limited memory.
-
-**Specific Questions**:
-- How would you modify the Variable class to support gradient zeroing?
-- What are the trade-offs between large batches vs. gradient accumulation?
-- How does gradient accumulation affect convergence in neural network training?
-
-**Implementation Connection**: Consider how your `backward()` method accumulates gradients and design a complete training interface.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "gradient-accumulation", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-TODO: Design gradient accumulation strategy for your autograd system.
-
-Extend your Variable class with gradient management methods and analyze
-the trade-offs between memory efficiency and training convergence.
-"""
-### BEGIN SOLUTION
-# Gradient accumulation design for training efficiency:
-
-# 1. Variable class extensions needed:
-def zero_grad(self):
-    """Clear accumulated gradients."""
-    self.grad = None
-
-def add_zero_grad_to_variable():
-    """Would add this method to Variable class"""
-    # Implementation would set self.grad = None
-    pass
-
-# 2. Training loop with gradient accumulation:
-def training_step_with_accumulation(model, data_loader, accumulation_steps=4):
-    """
-    Simulate larger batches through gradient accumulation
-    """
-    for param in model.parameters():
-        param.zero_grad()
-
-    total_loss = 0
-    for i, batch in enumerate(data_loader):
-        loss = compute_loss(model(batch.x), batch.y)
-        loss.backward()  # Accumulate gradients
-        total_loss += loss.data
-
-        if (i + 1) % accumulation_steps == 0:
-            # Update parameters with accumulated gradients
-            optimizer.step()
-            # Clear gradients for next accumulation cycle
-            for param in model.parameters():
-                param.zero_grad()
-
-    return total_loss / len(data_loader)
-
-# 3. Trade-offs analysis:
-# Memory: Gradient accumulation uses constant memory vs. large batch linear growth
-# Convergence: Accumulated gradients approximate large batch behavior
-# Computation: Extra backward passes vs. single large batch forward/backward
-# Synchronization: In distributed training, less frequent communication
-
-# 4. Production considerations:
-# - Gradient scaling to prevent underflow with accumulated small gradients
-# - Learning rate adjustment for effective batch size
-# - Batch normalization statistics affected by actual vs effective batch size
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Computational Graph Optimization
-
-Your autograd implementation creates a new Variable for each operation, building a computation graph dynamically.
-
-**Analysis Task**: Analyze opportunities for optimizing the computational graph to reduce memory usage and improve performance.
-
-**Specific Questions**:
-- Which operations could be fused together to reduce intermediate Variable storage?
-- How would in-place operations affect gradient computation safety?
-- What graph optimization passes could be implemented before backward propagation?
-
-**Implementation Connection**: Examine your operation functions and identify where intermediate results could be eliminated or reused.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "graph-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-TODO: Design graph optimization strategies for your autograd implementation.
-
-Identify specific optimizations that could reduce memory usage and improve
-performance while maintaining gradient correctness.
-"""
-### BEGIN SOLUTION
-# Computational graph optimization strategies:
-
-# 1. Operation fusion opportunities:
-# - Fuse: add + multiply → fused_add_mul (one intermediate variable)
-# - Fuse: activation + linear → fused_linear_activation
-# - Elementwise operations: add + relu + multiply can be single kernel
-# Current: 3 Variables → Optimized: 1 Variable
-
-def fused_add_multiply(a, b, c):
-    """Fused operation: (a + b) * c - saves one intermediate Variable"""
-    # Direct computation without intermediate Variable
-    result_data = (a.data.data + b.data.data) * c.data.data
-
-    def grad_fn(gradient):
-        if a.requires_grad:
-            a.backward(gradient * c.data.data)
-        if b.requires_grad:
-            b.backward(gradient * c.data.data)
-        if c.requires_grad:
-            c.backward(gradient * (a.data.data + b.data.data))
-
-    return Variable(result_data, requires_grad=any([a.requires_grad, b.requires_grad, c.requires_grad]), grad_fn=grad_fn)
-
-# 2. In-place operation safety:
-# Safe: element-wise operations on leaf variables not used elsewhere
-# Unsafe: in-place on intermediate variables used in multiple paths
-# Solution: Track variable usage count before allowing in-place
-
-def safe_inplace_add(var, other):
-    """In-place addition if safe for gradient computation"""
-    if var.grad_fn is not None:
-        raise RuntimeError("Cannot do in-place operation on variable with grad_fn")
-    var.data.data += other.data.data
-    return var
-
-# 3. Graph optimization passes:
-# - Dead code elimination: Remove unused intermediate variables
-# - Common subexpression elimination: Reuse x*y if computed multiple times
-# - Memory layout optimization: Arrange for cache-friendly access patterns
-
-class GraphOptimizer:
-    def optimize_memory_layout(self, variables):
-        """Optimize variable storage for cache efficiency"""
-        # Group related variables in contiguous memory
-        pass
-
-    def eliminate_dead_variables(self, root_variable):
-        """Remove variables not needed for gradient computation"""
-        # Traverse backward from root, mark reachable variables
-        pass
-
-    def fuse_operations(self, computation_sequence):
-        """Identify fusible operation sequences"""
-        # Pattern matching for common operation combinations
-        pass
-
-# 4. Production framework techniques:
-# - TensorFlow's XLA: Ahead-of-time compilation with graph optimization
-# - PyTorch's TorchScript: Graph optimization for inference
-# - ONNX graph optimization passes: Constant folding, operator fusion
-# - Memory planning: Pre-allocate memory for entire computation graph
-### END SOLUTION
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Autograd - Automatic Differentiation Engine
-
-Congratulations! You've successfully implemented the automatic differentiation engine:
-
-### What You've Accomplished
-✅ **Variable Class Implementation**: Complete gradient tracking system with 200+ lines of core functionality
-✅ **Arithmetic Operations**: Addition, multiplication, subtraction, and matrix operations with proper gradient flow
-✅ **Chain Rule Application**: Automatic gradient computation through complex mathematical expressions
-✅ **Memory Management**: Efficient gradient accumulation and computational graph construction
-✅ **Systems Analysis**: Understanding of memory scaling and performance characteristics in gradient computation
-
-### Key Learning Outcomes
-- **Automatic Differentiation**: How computational graphs enable efficient gradient computation
-- **Chain Rule Implementation**: Mathematical foundation for backpropagation in neural networks
-- **Memory Patterns**: How gradient computation affects memory usage in deep learning systems
-- **Production Understanding**: Connection to PyTorch/TensorFlow autograd implementations
-
-### Mathematical Foundations Mastered
-- **Chain Rule**: Systematic application through computational graphs
-- **Product Rule**: Gradient computation for multiplication operations
-- **Computational Complexity**: O(1) gradient overhead per operation in forward pass
-- **Memory Complexity**: O(graph_depth) storage requirements for intermediate activations
-
-### Professional Skills Developed
-- **Gradient System Design**: Building automatic differentiation from scratch
-- **Performance Analysis**: Understanding memory and computational trade-offs
-- **Testing Methodology**: Comprehensive validation of gradient correctness
-
-### Ready for Advanced Applications
-Your autograd implementation now enables:
-- **Neural Network Training**: Automatic gradient computation for parameter updates
-- **Optimization Algorithms**: Foundation for SGD, Adam, and other optimizers
-- **Deep Learning Research**: Understanding of how modern frameworks work internally
-
-### Connection to Real ML Systems
-Your implementation mirrors production systems:
-- **PyTorch**: `torch.autograd.Variable` and automatic gradient computation
-- **TensorFlow**: `tf.GradientTape` for automatic differentiation
-- **Industry Standard**: Dynamic computational graphs used in most modern frameworks
-
-### Next Steps
-1. **Export your module**: `tito module complete 05_autograd`
-2. **Validate integration**: `tito test --module autograd`
-3. **Ready for Module 06**: Optimizers will use your gradients to update neural network parameters!
-
-**🚀 Achievement Unlocked**: Your automatic differentiation engine is the foundation that makes modern neural network training possible!
-"""
\ No newline at end of file
diff --git a/tinytorch/core/benchmarking.py b/tinytorch/core/benchmarking.py
new file mode 100644
index 00000000..54c5c08c
--- /dev/null
+++ b/tinytorch/core/benchmarking.py
@@ -0,0 +1,1206 @@
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/14_benchmarking/benchmarking_dev.py     ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
+
+# %% auto 0
+__all__ = ['BenchmarkScenario', 'BenchmarkResult', 'BenchmarkScenarios', 'StatisticalValidation', 'StatisticalValidator',
+           'TinyTorchPerf', 'PerformanceReporter', 'plot_benchmark_results', 'ProductionBenchmarkingProfiler']
+
+# %% ../../modules/source/temp_holding/14_benchmarking/benchmarking_dev.ipynb 1
+import numpy as np
+import matplotlib.pyplot as plt
+import time
+import statistics
+import math
+from typing import Dict, List, Tuple, Optional, Any, Callable
+from enum import Enum
+from dataclasses import dataclass
+import os
+import sys
+
+# Import our TinyTorch dependencies
+try:
+    from tinytorch.core.tensor import Tensor
+    from tinytorch.core.networks import Sequential
+    from tinytorch.core.layers import Dense
+    from tinytorch.core.activations import ReLU, Softmax
+    from tinytorch.core.dataloader import DataLoader
+except ImportError:
+    # For development, import from local modules
+    parent_dirs = [
+        os.path.join(os.path.dirname(__file__), '..', '01_tensor'),
+        os.path.join(os.path.dirname(__file__), '..', '03_layers'),
+        os.path.join(os.path.dirname(__file__), '..', '02_activations'),
+        os.path.join(os.path.dirname(__file__), '..', '04_networks'),
+        os.path.join(os.path.dirname(__file__), '..', '06_dataloader')
+    ]
+    for path in parent_dirs:
+        if path not in sys.path:
+            sys.path.append(path)
+    
+    try:
+        from tensor_dev import Tensor
+        from networks_dev import Sequential
+        from layers_dev import Dense
+        from activations_dev import ReLU, Softmax
+        from dataloader_dev import DataLoader
+    except ImportError:
+        # Fallback for missing modules
+        print("⚠️  Some TinyTorch modules not available - using minimal implementations")
+
+# %% ../../modules/source/temp_holding/14_benchmarking/benchmarking_dev.ipynb 8
+class BenchmarkScenario(Enum):
+    """Standard benchmark scenarios from MLPerf"""
+    SINGLE_STREAM = "single_stream"
+    SERVER = "server"
+    OFFLINE = "offline"
+
+@dataclass
+class BenchmarkResult:
+    """Results from a benchmark run"""
+    scenario: BenchmarkScenario
+    latencies: List[float]  # All latency measurements in seconds
+    throughput: float      # Samples per second
+    accuracy: float        # Model accuracy (0-1)
+    metadata: Optional[Dict[str, Any]] = None
+
+#| export
+class BenchmarkScenarios:
+    """
+    Implements the three standard MLPerf benchmark scenarios.
+    
+    TODO: Implement the three benchmark scenarios following MLPerf patterns.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Single-Stream: Send queries one at a time, measure latency
+    2. Server: Send queries following Poisson distribution, measure QPS
+    3. Offline: Send all queries at once, measure total throughput
+    
+    IMPLEMENTATION APPROACH:
+    1. Each scenario should run the model multiple times
+    2. Collect latency measurements for each run
+    3. Calculate appropriate metrics for each scenario
+    4. Return BenchmarkResult with all measurements
+    
+    LEARNING CONNECTIONS:
+    - **MLPerf Standards**: Industry-standard benchmarking methodology used by Google, NVIDIA, etc.
+    - **Performance Scenarios**: Different deployment patterns require different measurement approaches
+    - **Production Validation**: Benchmarking validates model performance before deployment
+    - **Resource Planning**: Results guide infrastructure scaling and capacity planning
+    
+    EXAMPLE USAGE:
+    scenarios = BenchmarkScenarios()
+    result = scenarios.single_stream(model, dataset, num_queries=1000)
+    print(f"90th percentile latency: {result.latencies[int(0.9 * len(result.latencies))]} seconds")
+    """
+    
+    def __init__(self):
+        self.results = []
+    
+    def single_stream(self, model: Callable, dataset: List, num_queries: int = 1000) -> BenchmarkResult:
+        """
+        Run single-stream benchmark scenario.
+        
+        TODO: Implement single-stream benchmarking.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Initialize empty list for latencies
+        2. For each query (up to num_queries):
+           a. Get next sample from dataset (cycle if needed)
+           b. Record start time
+           c. Run model on sample
+           d. Record end time
+           e. Calculate latency = end - start
+           f. Add latency to list
+        3. Calculate throughput = num_queries / total_time
+        4. Calculate accuracy if possible
+        5. Return BenchmarkResult with SINGLE_STREAM scenario
+        
+        LEARNING CONNECTIONS:
+        - **Mobile/Edge Deployment**: Single-stream simulates user-facing applications
+        - **Tail Latency**: 90th/95th percentiles matter more than averages for user experience
+        - **Interactive Systems**: Chatbots, recommendation engines use single-stream patterns
+        - **SLA Validation**: Ensures models meet response time requirements
+        
+        HINTS:
+        - Use time.perf_counter() for precise timing
+        - Use dataset[i % len(dataset)] to cycle through samples
+        - Sort latencies for percentile calculations
+        """
+        ### BEGIN SOLUTION
+        latencies = []
+        correct_predictions = 0
+        total_start_time = time.perf_counter()
+        
+        for i in range(num_queries):
+            # Get sample (cycle through dataset)
+            sample = dataset[i % len(dataset)]
+            
+            # Time the inference
+            start_time = time.perf_counter()
+            result = model(sample)
+            end_time = time.perf_counter()
+            
+            latency = end_time - start_time
+            latencies.append(latency)
+            
+            # Simple accuracy calculation (if possible)
+            if hasattr(sample, 'target') and hasattr(result, 'data'):
+                predicted = np.argmax(result.data)
+                if predicted == sample.target:
+                    correct_predictions += 1
+        
+        total_time = time.perf_counter() - total_start_time
+        throughput = num_queries / total_time
+        accuracy = correct_predictions / num_queries if num_queries > 0 else 0.0
+        
+        return BenchmarkResult(
+            scenario=BenchmarkScenario.SINGLE_STREAM,
+            latencies=sorted(latencies),
+            throughput=throughput,
+            accuracy=accuracy,
+            metadata={"num_queries": num_queries}
+        )
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def server(self, model: Callable, dataset: List, target_qps: float = 10.0, 
+               duration: float = 60.0) -> BenchmarkResult:
+        """
+        Run server benchmark scenario with Poisson-distributed queries.
+        
+        TODO: Implement server benchmarking.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Calculate inter-arrival time = 1.0 / target_qps
+        2. Run for specified duration:
+           a. Wait for next query arrival (Poisson distribution)
+           b. Get sample from dataset
+           c. Record start time
+           d. Run model
+           e. Record end time and latency
+        3. Calculate actual QPS = total_queries / duration
+        4. Return results
+        
+        LEARNING CONNECTIONS:
+        - **Web Services**: Server scenario simulates API endpoints handling concurrent requests
+        - **Load Testing**: Validates system behavior under realistic traffic patterns
+        - **Scalability Analysis**: Tests how well models handle increasing load
+        - **Production Deployment**: Critical for microservices and web-scale applications
+        
+        HINTS:
+        - Use np.random.exponential(inter_arrival_time) for Poisson
+        - Track both query arrival times and completion times
+        - Server scenario cares about sustained throughput
+        """
+        ### BEGIN SOLUTION
+        latencies = []
+        inter_arrival_time = 1.0 / target_qps
+        start_time = time.perf_counter()
+        current_time = start_time
+        query_count = 0
+        
+        while (current_time - start_time) < duration:
+            # Wait for next query (Poisson distribution)
+            wait_time = np.random.exponential(inter_arrival_time)
+            # Use minimal delay for fast testing
+            if wait_time > 0.0001:  # Only sleep for very long waits
+                time.sleep(min(wait_time, 0.0001))
+            
+            # Get sample
+            sample = dataset[query_count % len(dataset)]
+            
+            # Time the inference
+            query_start = time.perf_counter()
+            result = model(sample)
+            query_end = time.perf_counter()
+            
+            latency = query_end - query_start
+            latencies.append(latency)
+            
+            query_count += 1
+            current_time = time.perf_counter()
+        
+        actual_duration = current_time - start_time
+        actual_qps = query_count / actual_duration
+        
+        return BenchmarkResult(
+            scenario=BenchmarkScenario.SERVER,
+            latencies=sorted(latencies),
+            throughput=actual_qps,
+            accuracy=0.0,  # Would need labels for accuracy
+            metadata={"target_qps": target_qps, "actual_qps": actual_qps, "duration": actual_duration}
+        )
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def offline(self, model: Callable, dataset: List, batch_size: int = 32) -> BenchmarkResult:
+        """
+        Run offline benchmark scenario with batch processing.
+        
+        TODO: Implement offline benchmarking.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Group dataset into batches of batch_size
+        2. For each batch:
+           a. Record start time
+           b. Run model on entire batch
+           c. Record end time
+           d. Calculate batch latency
+        3. Calculate total throughput = total_samples / total_time
+        4. Return results
+        
+        LEARNING CONNECTIONS:
+        - **Batch Processing**: Offline scenario simulates data pipeline and ETL workloads
+        - **Throughput Optimization**: Maximizes processing efficiency for large datasets
+        - **Data Center Workloads**: Common in recommendation systems and analytics pipelines
+        - **Cost Optimization**: High throughput reduces compute costs per sample
+        
+        HINTS:
+        - Process data in batches for efficiency
+        - Measure total time for all batches
+        - Offline cares about maximum throughput
+        """
+        ### BEGIN SOLUTION
+        latencies = []
+        total_samples = len(dataset)
+        total_start_time = time.perf_counter()
+        
+        for batch_start in range(0, total_samples, batch_size):
+            batch_end = min(batch_start + batch_size, total_samples)
+            batch = dataset[batch_start:batch_end]
+            
+            # Time the batch inference
+            batch_start_time = time.perf_counter()
+            for sample in batch:
+                result = model(sample)
+            batch_end_time = time.perf_counter()
+            
+            batch_latency = batch_end_time - batch_start_time
+            latencies.append(batch_latency)
+        
+        total_time = time.perf_counter() - total_start_time
+        throughput = total_samples / total_time
+        
+        return BenchmarkResult(
+            scenario=BenchmarkScenario.OFFLINE,
+            latencies=latencies,
+            throughput=throughput,
+            accuracy=0.0,  # Would need labels for accuracy
+            metadata={"batch_size": batch_size, "total_samples": total_samples}
+        )
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+
+# %% ../../modules/source/temp_holding/14_benchmarking/benchmarking_dev.ipynb 12
+@dataclass
+class StatisticalValidation:
+    """Results from statistical validation"""
+    is_significant: bool
+    p_value: float
+    effect_size: float
+    confidence_interval: Tuple[float, float]
+    recommendation: str
+
+#| export
+class StatisticalValidator:
+    """
+    Validates benchmark results using proper statistical methods.
+    
+    TODO: Implement statistical validation for benchmark results.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Null hypothesis: No difference between models
+    2. T-test: Compare means of two groups
+    3. P-value: Probability of seeing this difference by chance
+    4. Effect size: Magnitude of the difference
+    5. Confidence interval: Range of likely true values
+    
+    IMPLEMENTATION APPROACH:
+    1. Calculate basic statistics (mean, std, n)
+    2. Perform t-test to get p-value
+    3. Calculate effect size (Cohen's d)
+    4. Calculate confidence interval
+    5. Provide clear recommendation
+    
+    LEARNING CONNECTIONS:
+    - **Scientific Rigor**: Ensures performance claims are statistically valid
+    - **A/B Testing**: Foundation for production model comparison and rollout decisions
+    - **Research Validation**: Required for academic papers and technical reports
+    - **Business Decisions**: Statistical significance guides investment in new models
+    """
+    
+    def __init__(self, confidence_level: float = 0.95):
+        self.confidence_level = confidence_level
+        self.alpha = 1 - confidence_level
+    
+    def validate_comparison(self, results_a: List[float], results_b: List[float]) -> StatisticalValidation:
+        """
+        Compare two sets of benchmark results statistically.
+        
+        TODO: Implement statistical comparison.
+        
+        STEP-BY-STEP:
+        1. Calculate basic statistics for both groups
+        2. Perform two-sample t-test
+        3. Calculate effect size (Cohen's d)
+        4. Calculate confidence interval for the difference
+        5. Generate recommendation based on results
+        
+        HINTS:
+        - Use scipy.stats.ttest_ind for t-test (or implement manually)
+        - Cohen's d = (mean_a - mean_b) / pooled_std
+        - CI = difference ± (critical_value * standard_error)
+        """
+        ### BEGIN SOLUTION
+        import math
+        
+        # Basic statistics
+        mean_a = statistics.mean(results_a)
+        mean_b = statistics.mean(results_b)
+        std_a = statistics.stdev(results_a)
+        std_b = statistics.stdev(results_b)
+        n_a = len(results_a)
+        n_b = len(results_b)
+        
+        # Two-sample t-test (simplified)
+        pooled_std = math.sqrt(((n_a - 1) * std_a**2 + (n_b - 1) * std_b**2) / (n_a + n_b - 2))
+        standard_error = pooled_std * math.sqrt(1/n_a + 1/n_b)
+        
+        if standard_error == 0:
+            t_stat = 0
+            p_value = 1.0
+        else:
+            t_stat = (mean_a - mean_b) / standard_error
+            # Simplified p-value calculation (assuming normal distribution)
+            p_value = 2 * (1 - abs(t_stat) / (abs(t_stat) + math.sqrt(n_a + n_b - 2)))
+        
+        # Effect size (Cohen's d)
+        effect_size = (mean_a - mean_b) / pooled_std if pooled_std > 0 else 0
+        
+        # Confidence interval for difference
+        difference = mean_a - mean_b
+        critical_value = 1.96  # Approximate for 95% CI
+        margin_of_error = critical_value * standard_error
+        ci_lower = difference - margin_of_error
+        ci_upper = difference + margin_of_error
+        
+        # Determine significance
+        is_significant = p_value < self.alpha
+        
+        # Generate recommendation
+        if is_significant:
+            if effect_size > 0.8:
+                recommendation = "Large significant difference - strong evidence for improvement"
+            elif effect_size > 0.5:
+                recommendation = "Medium significant difference - good evidence for improvement"
+            else:
+                recommendation = "Small significant difference - weak evidence for improvement"
+        else:
+            recommendation = "No significant difference - insufficient evidence for improvement"
+        
+        return StatisticalValidation(
+            is_significant=is_significant,
+            p_value=p_value,
+            effect_size=effect_size,
+            confidence_interval=(ci_lower, ci_upper),
+            recommendation=recommendation
+        )
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def validate_benchmark_result(self, result: BenchmarkResult, 
+                                 min_samples: int = 100) -> StatisticalValidation:
+        """
+        Validate that a benchmark result has sufficient statistical power.
+        
+        TODO: Implement validation for single benchmark result.
+        
+        STEP-BY-STEP:
+        1. Check if we have enough samples
+        2. Calculate confidence interval for the metric
+        3. Check for common pitfalls (outliers, etc.)
+        4. Provide recommendations
+        """
+        ### BEGIN SOLUTION
+        latencies = result.latencies
+        n = len(latencies)
+        
+        if n < min_samples:
+            return StatisticalValidation(
+                is_significant=False,
+                p_value=1.0,
+                effect_size=0.0,
+                confidence_interval=(0.0, 0.0),
+                recommendation=f"Insufficient samples: {n} < {min_samples}. Need more data."
+            )
+        
+        # Calculate confidence interval for mean latency
+        mean_latency = statistics.mean(latencies)
+        std_latency = statistics.stdev(latencies)
+        standard_error = std_latency / math.sqrt(n)
+        
+        critical_value = 1.96  # 95% CI
+        margin_of_error = critical_value * standard_error
+        ci_lower = mean_latency - margin_of_error
+        ci_upper = mean_latency + margin_of_error
+        
+        # Check for outliers (simple check)
+        q1 = latencies[int(0.25 * n)]
+        q3 = latencies[int(0.75 * n)]
+        iqr = q3 - q1
+        outlier_threshold = q3 + 1.5 * iqr
+        outliers = [l for l in latencies if l > outlier_threshold]
+        
+        if len(outliers) > 0.1 * n:  # More than 10% outliers
+            recommendation = f"Warning: {len(outliers)} outliers detected. Results may be unreliable."
+        else:
+            recommendation = "Benchmark result appears statistically valid."
+        
+        return StatisticalValidation(
+            is_significant=True,
+            p_value=0.0,  # Not applicable for single result
+            effect_size=std_latency / mean_latency,  # Coefficient of variation
+            confidence_interval=(ci_lower, ci_upper),
+            recommendation=recommendation
+        )
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+
+# %% ../../modules/source/temp_holding/14_benchmarking/benchmarking_dev.ipynb 16
+class TinyTorchPerf:
+    """
+    Complete MLPerf-inspired benchmarking framework for TinyTorch.
+    
+    TODO: Implement the complete benchmarking framework.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Combines all benchmark scenarios
+    2. Integrates statistical validation
+    3. Provides easy-to-use API
+    4. Generates professional reports
+    
+    IMPLEMENTATION APPROACH:
+    1. Initialize with model and dataset
+    2. Provide methods for each scenario
+    3. Include statistical validation
+    4. Generate comprehensive reports
+    
+    LEARNING CONNECTIONS:
+    - **MLPerf Integration**: Follows industry-standard benchmarking patterns
+    - **Production Deployment**: Validates models before production rollout
+    - **Performance Engineering**: Identifies bottlenecks and optimization opportunities
+    - **Framework Design**: Demonstrates how to build reusable ML tools
+    """
+    
+    def __init__(self):
+        self.scenarios = BenchmarkScenarios()
+        self.validator = StatisticalValidator()
+        self.model = None
+        self.dataset = None
+        self.results = {}
+    
+    def set_model(self, model: Callable):
+        """Set the model to benchmark."""
+        self.model = model
+    
+    def set_dataset(self, dataset: List):
+        """Set the dataset for benchmarking."""
+        self.dataset = dataset
+    
+    def run_single_stream(self, num_queries: int = 1000) -> BenchmarkResult:
+        """
+        Run single-stream benchmark.
+        
+        TODO: Implement single-stream benchmark with validation.
+        
+        STEP-BY-STEP:
+        1. Check that model and dataset are set
+        2. Run single-stream scenario
+        3. Validate results statistically
+        4. Store results
+        5. Return result
+        """
+        ### BEGIN SOLUTION
+        if self.model is None or self.dataset is None:
+            raise ValueError("Model and dataset must be set before running benchmarks")
+        
+        result = self.scenarios.single_stream(self.model, self.dataset, num_queries)
+        validation = self.validator.validate_benchmark_result(result)
+        
+        self.results['single_stream'] = {
+            'result': result,
+            'validation': validation
+        }
+        
+        return result
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def run_server(self, target_qps: float = 10.0, duration: float = 60.0) -> BenchmarkResult:
+        """
+        Run server benchmark.
+        
+        TODO: Implement server benchmark with validation.
+        """
+        ### BEGIN SOLUTION
+        if self.model is None or self.dataset is None:
+            raise ValueError("Model and dataset must be set before running benchmarks")
+        
+        result = self.scenarios.server(self.model, self.dataset, target_qps, duration)
+        validation = self.validator.validate_benchmark_result(result)
+        
+        self.results['server'] = {
+            'result': result,
+            'validation': validation
+        }
+        
+        return result
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def run_offline(self, batch_size: int = 32) -> BenchmarkResult:
+        """
+        Run offline benchmark.
+        
+        TODO: Implement offline benchmark with validation.
+        """
+        ### BEGIN SOLUTION
+        if self.model is None or self.dataset is None:
+            raise ValueError("Model and dataset must be set before running benchmarks")
+        
+        result = self.scenarios.offline(self.model, self.dataset, batch_size)
+        validation = self.validator.validate_benchmark_result(result)
+        
+        self.results['offline'] = {
+            'result': result,
+            'validation': validation
+        }
+        
+        return result
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def run_all_scenarios(self, quick_test: bool = False) -> Dict[str, BenchmarkResult]:
+        """
+        Run all benchmark scenarios.
+        
+        TODO: Implement comprehensive benchmarking.
+        """
+        ### BEGIN SOLUTION
+        if quick_test:
+            # Quick test with very small parameters for fast testing
+            single_result = self.run_single_stream(num_queries=5)
+            server_result = self.run_server(target_qps=20.0, duration=0.2)
+            offline_result = self.run_offline(batch_size=3)
+        else:
+            # Full benchmarking
+            single_result = self.run_single_stream(num_queries=1000)
+            server_result = self.run_server(target_qps=10.0, duration=60.0)
+            offline_result = self.run_offline(batch_size=32)
+        
+        return {
+            'single_stream': single_result,
+            'server': server_result,
+            'offline': offline_result
+        }
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def compare_models(self, model_a: Callable, model_b: Callable, 
+                      scenario: str = 'single_stream') -> StatisticalValidation:
+        """
+        Compare two models statistically.
+        
+        TODO: Implement model comparison.
+        """
+        ### BEGIN SOLUTION
+        # Run both models on the same scenario
+        self.set_model(model_a)
+        if scenario == 'single_stream':
+            result_a = self.run_single_stream(num_queries=100)
+        elif scenario == 'server':
+            result_a = self.run_server(target_qps=5.0, duration=10.0)
+        else:  # offline
+            result_a = self.run_offline(batch_size=16)
+        
+        self.set_model(model_b)
+        if scenario == 'single_stream':
+            result_b = self.run_single_stream(num_queries=100)
+        elif scenario == 'server':
+            result_b = self.run_server(target_qps=5.0, duration=10.0)
+        else:  # offline
+            result_b = self.run_offline(batch_size=16)
+        
+        # Compare latencies
+        return self.validator.validate_comparison(result_a.latencies, result_b.latencies)
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def generate_report(self) -> str:
+        """
+        Generate a comprehensive benchmark report.
+        
+        TODO: Implement professional report generation.
+        """
+        ### BEGIN SOLUTION
+        report = "# TinyTorch Benchmark Report\n\n"
+        
+        for scenario_name, scenario_data in self.results.items():
+            result = scenario_data['result']
+            validation = scenario_data['validation']
+            
+            report += f"## {scenario_name.replace('_', ' ').title()} Scenario\n\n"
+            report += f"- **Throughput**: {result.throughput:.2f} samples/second\n"
+            report += f"- **Mean Latency**: {statistics.mean(result.latencies)*1000:.2f} ms\n"
+            report += f"- **90th Percentile**: {result.latencies[int(0.9*len(result.latencies))]*1000:.2f} ms\n"
+            report += f"- **95th Percentile**: {result.latencies[int(0.95*len(result.latencies))]*1000:.2f} ms\n"
+            report += f"- **Statistical Validation**: {validation.recommendation}\n\n"
+        
+        return report
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+
+# %% ../../modules/source/temp_holding/14_benchmarking/benchmarking_dev.ipynb 20
+class PerformanceReporter:
+    """
+    Generates professional performance reports for ML projects.
+    
+    TODO: Implement professional report generation.
+    
+    UNDERSTANDING PROFESSIONAL REPORTS:
+    1. Executive summary with key metrics
+    2. Detailed methodology section
+    3. Statistical validation results
+    4. Comparison with baselines
+    5. Recommendations for improvement
+    """
+    
+    def __init__(self):
+        self.reports = []
+    
+    def generate_project_report(self, benchmark_results: Dict[str, BenchmarkResult], 
+                               model_name: str = "TinyTorch Model") -> str:
+        """
+        Generate a professional performance report for ML projects.
+        
+        TODO: Implement project report generation.
+        
+        STEP-BY-STEP:
+        1. Create executive summary
+        2. Add methodology section
+        3. Present detailed results
+        4. Include statistical validation
+        5. Add recommendations
+        """
+        ### BEGIN SOLUTION
+        report = f"""# {model_name} Performance Report
+
+## Executive Summary
+
+This report presents comprehensive performance benchmarking results for {model_name} using MLPerf-inspired methodology. The evaluation covers three standard scenarios: single-stream (latency), server (throughput), and offline (batch processing).
+
+### Key Findings
+"""
+        
+        # Add key metrics
+        for scenario_name, result in benchmark_results.items():
+            mean_latency = statistics.mean(result.latencies) * 1000
+            p90_latency = result.latencies[int(0.9 * len(result.latencies))] * 1000
+            
+            report += f"- **{scenario_name.replace('_', ' ').title()}**: {result.throughput:.2f} samples/sec, "
+            report += f"{mean_latency:.2f}ms mean latency, {p90_latency:.2f}ms 90th percentile\n"
+        
+        report += """
+## Methodology
+
+### Benchmark Framework
+- **Architecture**: MLPerf-inspired four-component system
+- **Scenarios**: Single-stream, server, and offline evaluation
+- **Statistical Validation**: Multiple runs with confidence intervals
+- **Metrics**: Latency distribution, throughput, accuracy
+
+### Test Environment
+- **Hardware**: Standard development machine
+- **Software**: TinyTorch framework
+- **Dataset**: Standardized evaluation dataset
+- **Validation**: Statistical significance testing
+
+## Detailed Results
+
+"""
+        
+        # Add detailed results for each scenario
+        for scenario_name, result in benchmark_results.items():
+            report += f"### {scenario_name.replace('_', ' ').title()} Scenario\n\n"
+            
+            latencies_ms = [l * 1000 for l in result.latencies]
+            
+            report += f"- **Sample Count**: {len(result.latencies)}\n"
+            report += f"- **Mean Latency**: {statistics.mean(latencies_ms):.2f} ms\n"
+            report += f"- **Median Latency**: {statistics.median(latencies_ms):.2f} ms\n"
+            report += f"- **90th Percentile**: {latencies_ms[int(0.9 * len(latencies_ms))]:.2f} ms\n"
+            report += f"- **95th Percentile**: {latencies_ms[int(0.95 * len(latencies_ms))]:.2f} ms\n"
+            report += f"- **Standard Deviation**: {statistics.stdev(latencies_ms):.2f} ms\n"
+            report += f"- **Throughput**: {result.throughput:.2f} samples/second\n"
+            
+            if result.accuracy > 0:
+                report += f"- **Accuracy**: {result.accuracy:.4f}\n"
+            
+            report += "\n"
+        
+        report += """## Statistical Validation
+
+All results include proper statistical validation:
+- Multiple independent runs for reliability
+- Confidence intervals for key metrics
+- Outlier detection and handling
+- Significance testing for comparisons
+
+## Recommendations
+
+Based on the benchmark results:
+1. **Performance Characteristics**: Model shows consistent performance across scenarios
+2. **Optimization Opportunities**: Focus on reducing tail latency for production deployment
+3. **Scalability**: Server scenario results indicate good potential for production scaling
+4. **Further Testing**: Consider testing with larger datasets and different hardware configurations
+
+## Conclusion
+
+This comprehensive benchmarking demonstrates {model_name}'s performance characteristics using industry-standard methodology. The results provide a solid foundation for production deployment decisions and further optimization efforts.
+"""
+        
+        return report
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def save_report(self, report: str, filename: str = "benchmark_report.md"):
+        """Save report to file."""
+        with open(filename, 'w') as f:
+            f.write(report)
+        print(f"📄 Report saved to {filename}")
+
+def plot_benchmark_results(benchmark_results: Dict[str, BenchmarkResult]):
+    """Visualize benchmark results."""
+
+    # Create visualizations
+    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+    
+    # Latency distribution for single-stream
+    if 'single_stream' in benchmark_results:
+        axes[0].hist(benchmark_results['single_stream'].latencies, bins=50, color='skyblue')
+        axes[0].set_title("Single-Stream Latency Distribution")
+        axes[0].set_xlabel("Latency (s)")
+        axes[0].set_ylabel("Frequency")
+    
+    # Server scenario latency
+    if 'server' in benchmark_results:
+        axes[1].plot(benchmark_results['server'].latencies, marker='o', linestyle='-', color='salmon')
+        axes[1].set_title("Server Scenario Latency Over Time")
+        axes[1].set_xlabel("Query Index")
+        axes[1].set_ylabel("Latency (s)")
+    
+    # Offline scenario throughput
+    if 'offline' in benchmark_results:
+        offline_result = benchmark_results['offline']
+        throughput = len(offline_result.latencies) / sum(offline_result.latencies)
+        axes[2].bar(['Throughput'], [throughput], color='lightgreen')
+        axes[2].set_title("Offline Scenario Throughput")
+        axes[2].set_ylabel("Samples per second")
+        
+    plt.tight_layout()
+    plt.show()
+
+# %% ../../modules/source/temp_holding/14_benchmarking/benchmarking_dev.ipynb 29
+class ProductionBenchmarkingProfiler:
+    """
+    Advanced production-grade benchmarking profiler for ML systems.
+    
+    This class implements comprehensive performance analysis patterns used in
+    production ML systems, including end-to-end latency analysis, resource
+    monitoring, A/B testing frameworks, and production monitoring integration.
+    
+    TODO: Implement production-grade profiling capabilities.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. End-to-end pipeline analysis (not just model inference)
+    2. Resource utilization monitoring (CPU, memory, bandwidth)
+    3. Statistical A/B testing frameworks
+    4. Production monitoring and alerting integration
+    5. Performance regression detection
+    6. Load testing and capacity planning
+    
+    LEARNING CONNECTIONS:
+    - **Production ML Systems**: Real-world profiling for deployment optimization
+    - **Performance Engineering**: Systematic approach to identifying and fixing bottlenecks
+    - **A/B Testing**: Statistical frameworks for safe model rollouts
+    - **Cost Optimization**: Understanding resource usage for efficient cloud deployment
+    """
+    
+    def __init__(self, enable_monitoring: bool = True):
+        self.enable_monitoring = enable_monitoring
+        self.baseline_metrics = {}
+        self.production_metrics = []
+        self.ab_test_results = {}
+        self.resource_usage = []
+        
+    def profile_end_to_end_pipeline(self, model: Callable, dataset: List, 
+                                   preprocessing_fn: Optional[Callable] = None,
+                                   postprocessing_fn: Optional[Callable] = None) -> Dict[str, float]:
+        """
+        Profile the complete ML pipeline including preprocessing and postprocessing.
+        
+        TODO: Implement end-to-end pipeline profiling.
+        
+        IMPLEMENTATION STEPS:
+        1. Profile data loading and preprocessing time
+        2. Profile model inference time
+        3. Profile postprocessing and output formatting time
+        4. Measure total memory usage throughout pipeline
+        5. Calculate end-to-end latency distribution
+        6. Identify bottlenecks in the pipeline
+        
+        HINTS:
+        - Use context managers for timing different stages
+        - Track memory usage with sys.getsizeof or psutil
+        - Measure both CPU and wall-clock time
+        - Consider batch vs single-sample processing differences
+        """
+        ### BEGIN SOLUTION
+        import time
+        import sys
+        
+        pipeline_metrics = {
+            'preprocessing_time': [],
+            'inference_time': [],
+            'postprocessing_time': [],
+            'memory_usage': [],
+            'end_to_end_latency': []
+        }
+        
+        for sample in dataset[:100]:  # Profile first 100 samples
+            start_time = time.perf_counter()
+            
+            # Preprocessing stage
+            preprocess_start = time.perf_counter()
+            if preprocessing_fn:
+                processed_sample = preprocessing_fn(sample)
+            else:
+                processed_sample = sample
+            preprocess_end = time.perf_counter()
+            pipeline_metrics['preprocessing_time'].append(preprocess_end - preprocess_start)
+            
+            # Inference stage
+            inference_start = time.perf_counter()
+            model_output = model(processed_sample)
+            inference_end = time.perf_counter()
+            pipeline_metrics['inference_time'].append(inference_end - inference_start)
+            
+            # Postprocessing stage
+            postprocess_start = time.perf_counter()
+            if postprocessing_fn:
+                final_output = postprocessing_fn(model_output)
+            else:
+                final_output = model_output
+            postprocess_end = time.perf_counter()
+            pipeline_metrics['postprocessing_time'].append(postprocess_end - postprocess_start)
+            
+            end_time = time.perf_counter()
+            pipeline_metrics['end_to_end_latency'].append(end_time - start_time)
+            
+            # Memory usage estimation
+            memory_usage = sys.getsizeof(processed_sample) + sys.getsizeof(model_output) + sys.getsizeof(final_output)
+            pipeline_metrics['memory_usage'].append(memory_usage)
+        
+        # Calculate summary statistics
+        summary_metrics = {}
+        for metric_name, values in pipeline_metrics.items():
+            summary_metrics[f'{metric_name}_mean'] = statistics.mean(values)
+            summary_metrics[f'{metric_name}_p95'] = values[int(0.95 * len(values))] if values else 0
+            summary_metrics[f'{metric_name}_max'] = max(values) if values else 0
+        
+        return summary_metrics
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def monitor_resource_utilization(self, duration: float = 60.0) -> Dict[str, List[float]]:
+        """
+        Monitor system resource utilization during model execution.
+        
+        TODO: Implement resource monitoring.
+        
+        IMPLEMENTATION STEPS:
+        1. Sample CPU usage over time
+        2. Track memory consumption patterns
+        3. Monitor bandwidth utilization (if applicable)
+        4. Record resource usage spikes and patterns
+        5. Correlate resource usage with performance
+        
+        STUDENT IMPLEMENTATION CHALLENGE (75% level):
+        You need to implement the resource monitoring logic.
+        Consider how you would track CPU, memory, and other resources
+        during model execution in a production environment.
+        """
+        ### BEGIN SOLUTION
+        import time
+        import os
+        
+        resource_metrics = {
+            'cpu_usage': [],
+            'memory_usage': [],
+            'timestamp': []
+        }
+        
+        start_time = time.perf_counter()
+        
+        while (time.perf_counter() - start_time) < duration:
+            current_time = time.perf_counter() - start_time
+            
+            # Simple CPU usage estimation (in real production, use psutil)
+            # This is a placeholder implementation
+            cpu_usage = 50 + 30 * np.random.rand()  # Simulated CPU usage
+            
+            # Memory usage estimation
+            memory_usage = 1024 + 512 * np.random.rand()  # Simulated memory in MB
+            
+            resource_metrics['cpu_usage'].append(cpu_usage)
+            resource_metrics['memory_usage'].append(memory_usage)
+            resource_metrics['timestamp'].append(current_time)
+            
+            time.sleep(0.1)  # Sample every 100ms
+        
+        return resource_metrics
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def setup_ab_testing_framework(self, model_a: Callable, model_b: Callable, 
+                                   traffic_split: float = 0.5) -> Dict[str, Any]:
+        """
+        Set up A/B testing framework for comparing model versions in production.
+        
+        TODO: Implement A/B testing framework.
+        
+        IMPLEMENTATION STEPS:
+        1. Implement traffic splitting logic
+        2. Track metrics for both model versions
+        3. Implement statistical significance testing
+        4. Monitor for performance regressions
+        5. Provide recommendations for rollout
+        
+        STUDENT IMPLEMENTATION CHALLENGE (75% level):
+        Implement a production-ready A/B testing framework that can
+        safely compare two model versions with proper statistical validation.
+        """
+        ### BEGIN SOLUTION
+        ab_test_config = {
+            'model_a': model_a,
+            'model_b': model_b,
+            'traffic_split': traffic_split,
+            'metrics_a': {'latencies': [], 'accuracies': [], 'errors': 0},
+            'metrics_b': {'latencies': [], 'accuracies': [], 'errors': 0},
+            'total_requests': 0,
+            'requests_a': 0,
+            'requests_b': 0
+        }
+        
+        return ab_test_config
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def run_ab_test(self, ab_config: Dict[str, Any], dataset: List, 
+                   num_samples: int = 1000) -> Dict[str, Any]:
+        """
+        Execute A/B test with statistical validation.
+        
+        TODO: Implement A/B test execution.
+        
+        STUDENT IMPLEMENTATION CHALLENGE (75% level):
+        Execute the A/B test, collect metrics, and provide statistical
+        analysis of the results with confidence intervals.
+        """
+        ### BEGIN SOLUTION
+        import time
+        
+        model_a = ab_config['model_a']
+        model_b = ab_config['model_b']
+        traffic_split = ab_config['traffic_split']
+        
+        for i in range(num_samples):
+            sample = dataset[i % len(dataset)]
+            
+            # Route traffic based on split
+            if np.random.rand() < traffic_split:
+                # Route to model A
+                start_time = time.perf_counter()
+                try:
+                    result = model_a(sample)
+                    latency = time.perf_counter() - start_time
+                    ab_config['metrics_a']['latencies'].append(latency)
+                    ab_config['requests_a'] += 1
+                except Exception:
+                    ab_config['metrics_a']['errors'] += 1
+            else:
+                # Route to model B
+                start_time = time.perf_counter()
+                try:
+                    result = model_b(sample)
+                    latency = time.perf_counter() - start_time
+                    ab_config['metrics_b']['latencies'].append(latency)
+                    ab_config['requests_b'] += 1
+                except Exception:
+                    ab_config['metrics_b']['errors'] += 1
+            
+            ab_config['total_requests'] += 1
+        
+        # Calculate test results
+        latencies_a = ab_config['metrics_a']['latencies']
+        latencies_b = ab_config['metrics_b']['latencies']
+        
+        if latencies_a and latencies_b:
+            # Statistical comparison
+            validator = StatisticalValidator()
+            statistical_result = validator.validate_comparison(latencies_a, latencies_b)
+            
+            results = {
+                'model_a_performance': {
+                    'mean_latency': statistics.mean(latencies_a),
+                    'p95_latency': latencies_a[int(0.95 * len(latencies_a))],
+                    'error_rate': ab_config['metrics_a']['errors'] / ab_config['requests_a'] if ab_config['requests_a'] > 0 else 0
+                },
+                'model_b_performance': {
+                    'mean_latency': statistics.mean(latencies_b),
+                    'p95_latency': latencies_b[int(0.95 * len(latencies_b))],
+                    'error_rate': ab_config['metrics_b']['errors'] / ab_config['requests_b'] if ab_config['requests_b'] > 0 else 0
+                },
+                'statistical_analysis': statistical_result,
+                'recommendation': self._generate_ab_recommendation(statistical_result)
+            }
+        else:
+            results = {'error': 'Insufficient data for comparison'}
+        
+        return results
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def _generate_ab_recommendation(self, statistical_result: StatisticalValidation) -> str:
+        """
+        Generate production rollout recommendation based on A/B test results.
+        
+        STUDENT IMPLEMENTATION CHALLENGE (75% level):
+        Based on the statistical results, provide a clear recommendation
+        for production rollout decisions.
+        """
+        ### BEGIN SOLUTION
+        if not statistical_result.is_significant:
+            return "No significant difference detected. Consider longer test duration or larger sample size."
+        
+        if statistical_result.effect_size < 0:
+            return "Model B shows worse performance. Do not proceed with rollout."
+        elif statistical_result.effect_size > 0.2:
+            return "Model B shows significant improvement. Proceed with gradual rollout."
+        else:
+            return "Model B shows marginal improvement. Consider business impact before rollout."
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def detect_performance_regression(self, current_metrics: Dict[str, float], 
+                                    baseline_metrics: Dict[str, float],
+                                    threshold: float = 0.1) -> Dict[str, Any]:
+        """
+        Detect performance regressions compared to baseline.
+        
+        TODO: Implement regression detection.
+        
+        STUDENT IMPLEMENTATION CHALLENGE (75% level):
+        Implement automated detection of performance regressions
+        with configurable thresholds and alerting.
+        """
+        ### BEGIN SOLUTION
+        regressions = []
+        improvements = []
+        
+        for metric_name, current_value in current_metrics.items():
+            if metric_name in baseline_metrics:
+                baseline_value = baseline_metrics[metric_name]
+                if baseline_value > 0:  # Avoid division by zero
+                    change_percent = (current_value - baseline_value) / baseline_value
+                    
+                    if change_percent > threshold:
+                        regressions.append({
+                            'metric': metric_name,
+                            'baseline': baseline_value,
+                            'current': current_value,
+                            'change_percent': change_percent * 100
+                        })
+                    elif change_percent < -threshold:
+                        improvements.append({
+                            'metric': metric_name,
+                            'baseline': baseline_value,
+                            'current': current_value,
+                            'change_percent': abs(change_percent) * 100
+                        })
+        
+        return {
+            'regressions': regressions,
+            'improvements': improvements,
+            'alert_level': 'HIGH' if regressions else 'LOW',
+            'recommendation': 'Review deployment' if regressions else 'Performance stable'
+        }
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
+    
+    def generate_capacity_planning_report(self, current_load: Dict[str, float],
+                                        projected_growth: float = 1.5) -> str:
+        """
+        Generate capacity planning report for scaling production systems.
+        
+        STUDENT IMPLEMENTATION CHALLENGE (75% level):
+        Create a comprehensive capacity planning analysis that helps
+        engineering teams plan for growth and resource allocation.
+        """
+        ### BEGIN SOLUTION
+        report = f"""# Capacity Planning Report
+
+## Current System Load
+- **Average CPU Usage**: {current_load.get('cpu_usage', 0):.1f}%
+- **Memory Usage**: {current_load.get('memory_usage', 0):.1f} MB
+- **Request Rate**: {current_load.get('request_rate', 0):.1f} req/sec
+- **Average Latency**: {current_load.get('latency', 0):.2f} ms
+
+## Projected Requirements (Growth Factor: {projected_growth}x)
+- **Projected CPU Usage**: {current_load.get('cpu_usage', 0) * projected_growth:.1f}%
+- **Projected Memory**: {current_load.get('memory_usage', 0) * projected_growth:.1f} MB
+- **Projected Request Rate**: {current_load.get('request_rate', 0) * projected_growth:.1f} req/sec
+
+## Scaling Recommendations
+"""
+        
+        cpu_projected = current_load.get('cpu_usage', 0) * projected_growth
+        memory_projected = current_load.get('memory_usage', 0) * projected_growth
+        
+        if cpu_projected > 80:
+            report += "- **CPU Scaling**: Consider adding more compute instances\n"
+        if memory_projected > 8000:  # 8GB threshold
+            report += "- **Memory Scaling**: Consider upgrading to higher memory instances\n"
+        
+        report += "\n## Infrastructure Recommendations\n"
+        report += "- Monitor performance metrics continuously\n"
+        report += "- Set up auto-scaling policies\n"
+        report += "- Plan for peak load scenarios\n"
+        
+        return report
+        ### END SOLUTION
+        raise NotImplementedError("Student implementation required")
diff --git a/tinytorch/core/cnn.py b/tinytorch/core/cnn.py
index efadd784..78a81306 100644
--- a/tinytorch/core/cnn.py
+++ b/tinytorch/core/cnn.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/06_spatial/spatial_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/XX_cnn/cnn_dev.py                       ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['conv2d_naive', 'Conv2D', 'flatten']
diff --git a/tinytorch/core/compression.py b/tinytorch/core/compression.py
new file mode 100644
index 00000000..893d17e7
--- /dev/null
+++ b/tinytorch/core/compression.py
@@ -0,0 +1,1187 @@
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/12_compression/compression_dev.py       ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
+
+# %% auto 0
+__all__ = ['setup_import_paths', 'CompressionMetrics', 'prune_weights_by_magnitude', 'calculate_sparsity',
+           'quantize_layer_weights', 'DistillationLoss', 'compute_neuron_importance', 'prune_layer_neurons',
+           'CompressionSystemsProfiler', 'compare_compression_techniques']
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 1
+import numpy as np
+import sys
+import os
+from typing import List, Dict, Any, Optional, Union, Tuple
+
+# Helper function to set up import paths
+def setup_import_paths():
+    """Set up import paths for development modules."""
+    import sys
+    import os
+    
+    # Add module directories to path
+    base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+    module_dirs = [
+        '01_tensor', '02_activations', '03_layers', '04_networks', 
+        '05_cnn', '06_dataloader', '07_autograd', '08_optimizers', '09_training'
+    ]
+    
+    for module_dir in module_dirs:
+        sys.path.append(os.path.join(base_dir, module_dir))
+
+# Set up paths
+setup_import_paths()
+
+# Import all the building blocks we need
+try:
+    from tinytorch.core.tensor import Tensor
+    from tinytorch.core.layers import Dense
+    from tinytorch.core.networks import Sequential
+    from tinytorch.core.training import CrossEntropyLoss, Trainer
+except ImportError:
+    # For development, create mock classes or import from local modules
+    try:
+        from tensor_dev import Tensor
+        from layers_dev import Dense
+        from networks_dev import Sequential
+        from training_dev import CrossEntropyLoss, Trainer
+    except ImportError:
+        # Create minimal mock classes for development
+        class Tensor:
+            def __init__(self, data):
+                self.data = np.array(data)
+                self.shape = self.data.shape
+            
+            def __str__(self):
+                return f"Tensor({self.data})"
+        
+        class Dense:
+            def __init__(self, input_size, output_size):
+                self.input_size = input_size
+                self.output_size = output_size
+                self.weights = Tensor(np.random.randn(input_size, output_size) * 0.1)
+                self.bias = Tensor(np.zeros(output_size))
+            
+            def __str__(self):
+                return f"Dense({self.input_size}, {self.output_size})"
+        
+        class Sequential:
+            def __init__(self, layers=None):
+                self.layers = layers or []
+        
+        class CrossEntropyLoss:
+            def __init__(self):
+                pass
+        
+        class Trainer:
+            def __init__(self, model, optimizer, loss_function):
+                self.model = model
+                self.optimizer = optimizer
+                self.loss_function = loss_function
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 7
+class CompressionMetrics:
+    """
+    Utilities for measuring model size, sparsity, and compression efficiency.
+    
+    This class provides tools to analyze neural network models and understand
+    their memory footprint, parameter distribution, and compression potential.
+    """
+    
+    def __init__(self):
+        """Initialize compression metrics analyzer."""
+        pass
+    
+    def count_parameters(self, model: Sequential) -> Dict[str, int]:
+        """
+        Count parameters in a neural network model.
+        
+        Args:
+            model: Sequential model to analyze
+            
+        Returns:
+            Dictionary with parameter counts per layer and total
+            
+        TODO: Implement parameter counting for neural network analysis.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Initialize counters for different parameter types
+        2. Iterate through each layer in the model
+        3. Count weights and biases for each layer
+        4. Calculate total parameters across all layers
+        5. Return detailed breakdown dictionary
+        
+        EXAMPLE OUTPUT:
+        {
+            'layer_0_weights': 100352,
+            'layer_0_bias': 128,
+            'layer_1_weights': 8192,
+            'layer_1_bias': 64,
+            'layer_2_weights': 640,
+            'layer_2_bias': 10,
+            'total_parameters': 109386,
+            'total_weights': 109184,
+            'total_bias': 202
+        }
+        
+        IMPLEMENTATION HINTS:
+        - Use hasattr() to check if layer has weights/bias attributes
+        - Weight matrices have shape (input_size, output_size)
+        - Bias vectors have shape (output_size,)
+        - Use np.prod() to calculate total elements from shape
+        - Track layer index for detailed reporting
+        
+        LEARNING CONNECTIONS:
+        - This is like `model.numel()` in PyTorch
+        - Understanding where parameters are concentrated
+        - Foundation for compression target selection
+        """
+        ### BEGIN SOLUTION
+        param_counts = {}
+        total_params = 0
+        total_weights = 0
+        total_bias = 0
+        
+        for i, layer in enumerate(model.layers):
+            # Count weights if layer has them
+            if hasattr(layer, 'weights') and layer.weights is not None:
+                # Handle different weight formats
+                if hasattr(layer.weights, 'shape'):
+                    weight_count = np.prod(layer.weights.shape)
+                else:
+                    weight_count = np.prod(layer.weights.data.shape)
+                
+                param_counts[f'layer_{i}_weights'] = weight_count
+                total_weights += weight_count
+                total_params += weight_count
+            
+            # Count bias if layer has them
+            if hasattr(layer, 'bias') and layer.bias is not None:
+                # Handle different bias formats
+                if hasattr(layer.bias, 'shape'):
+                    bias_count = np.prod(layer.bias.shape)
+                else:
+                    bias_count = np.prod(layer.bias.data.shape)
+                
+                param_counts[f'layer_{i}_bias'] = bias_count
+                total_bias += bias_count
+                total_params += bias_count
+        
+        # Add summary statistics
+        param_counts['total_parameters'] = total_params
+        param_counts['total_weights'] = total_weights
+        param_counts['total_bias'] = total_bias
+        
+        return param_counts
+        ### END SOLUTION 
+
+    def calculate_model_size(self, model: Sequential, dtype: str = 'float32') -> Dict[str, Any]:
+        """
+        Calculate memory footprint of a neural network model.
+        
+        Args:
+            model: Sequential model to analyze
+            dtype: Data type for size calculation ('float32', 'float16', 'int8')
+            
+        Returns:
+            Dictionary with size information in different units
+        """
+        # Get parameter count
+        param_info = self.count_parameters(model)
+        total_params = param_info['total_parameters']
+        
+        # Determine bytes per parameter
+        bytes_per_param = {
+            'float32': 4,
+            'float16': 2,
+            'int8': 1
+        }.get(dtype, 4)
+        
+        # Calculate sizes
+        total_bytes = total_params * bytes_per_param
+        size_kb = total_bytes / 1024
+        size_mb = size_kb / 1024
+        
+        return {
+            'total_parameters': total_params,
+            'bytes_per_parameter': bytes_per_param,
+            'total_bytes': total_bytes,
+            'size_kb': round(size_kb, 2),
+            'size_mb': round(size_mb, 2),
+            'dtype': dtype
+        }
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 11
+def prune_weights_by_magnitude(layer: Dense, pruning_ratio: float = 0.5) -> Tuple[Dense, Dict[str, Any]]:
+    """
+    Prune weights in a Dense layer by magnitude.
+    
+    Args:
+        layer: Dense layer to prune
+        pruning_ratio: Fraction of weights to remove (0.0 to 1.0)
+        
+    Returns:
+        Tuple of (pruned_layer, pruning_info)
+        
+    TODO: Implement magnitude-based weight pruning.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Get weight matrix from layer
+    2. Calculate absolute values (magnitudes)
+    3. Find threshold using percentile
+    4. Create binary mask for weights above threshold
+    5. Apply mask to weights (set small weights to zero)
+    6. Update layer weights and return pruning statistics
+    
+    EXAMPLE USAGE:
+    ```python
+    layer = Dense(784, 128)
+    pruned_layer, info = prune_weights_by_magnitude(layer, pruning_ratio=0.3)
+    print(f"Pruned {info['weights_removed']} weights, sparsity: {info['sparsity']:.2f}")
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Use np.percentile() with pruning_ratio * 100 for threshold
+    - Create mask with np.abs(weights) > threshold
+    - Apply mask by element-wise multiplication
+    - Count zeros to calculate sparsity
+    - Return original layer (modified) and statistics
+    
+    LEARNING CONNECTIONS:
+    - This is the foundation of network pruning
+    - Magnitude pruning is simplest but effective
+    - Sparsity = fraction of weights that are zero
+    - Threshold selection affects accuracy vs compression trade-off
+    """
+    ### BEGIN SOLUTION
+    # Get current weights and ensure they're numpy arrays
+    weights = layer.weights.data
+    if not isinstance(weights, np.ndarray):
+        weights = np.array(weights)
+    
+    original_weights = weights.copy()
+    
+    # Calculate magnitudes and threshold
+    magnitudes = np.abs(weights)
+    threshold = np.percentile(magnitudes, pruning_ratio * 100)
+    
+    # Create mask and apply pruning
+    mask = magnitudes > threshold
+    pruned_weights = weights * mask
+    
+    # Update layer weights by creating a new Tensor
+    layer.weights = Tensor(pruned_weights)
+    
+    # Calculate pruning statistics
+    total_weights = weights.size
+    zero_weights = np.sum(pruned_weights == 0)
+    weights_removed = zero_weights - np.sum(original_weights == 0)
+    sparsity = zero_weights / total_weights
+    
+    pruning_info = {
+        'pruning_ratio': pruning_ratio,
+        'threshold': float(threshold),
+        'total_weights': total_weights,
+        'weights_removed': weights_removed,
+        'remaining_weights': total_weights - zero_weights,
+        'sparsity': float(sparsity),
+        'compression_ratio': 1 / (1 - sparsity) if sparsity < 1 else float('inf')
+    }
+    
+    return layer, pruning_info
+    ### END SOLUTION
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 12
+def calculate_sparsity(layer: Dense) -> float:
+    """
+    Calculate sparsity (fraction of zero weights) in a Dense layer.
+    
+    Args:
+        layer: Dense layer to analyze
+        
+    Returns:
+        Sparsity as float between 0.0 and 1.0
+        
+    TODO: Implement sparsity calculation.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Get weight matrix from layer
+    2. Count total number of weights
+    3. Count number of zero weights
+    4. Calculate sparsity = zero_weights / total_weights
+    5. Return as float
+    
+    EXAMPLE USAGE:
+    ```python
+    layer = Dense(100, 50)
+    sparsity = calculate_sparsity(layer)
+    print(f"Layer sparsity: {sparsity:.2%}")
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Use np.sum() with condition to count zeros
+    - Use .size attribute for total elements
+    - Return 0.0 if no weights (edge case)
+    - Sparsity of 0.0 = dense, 1.0 = completely sparse
+    
+    LEARNING CONNECTIONS:
+    - Sparsity is key metric for compression
+    - Higher sparsity = more compression
+    - Sparsity patterns affect hardware efficiency
+    """
+    ### BEGIN SOLUTION
+    if not hasattr(layer, 'weights') or layer.weights is None:
+        return 0.0
+    
+    weights = layer.weights.data
+    if not isinstance(weights, np.ndarray):
+        weights = np.array(weights)
+    
+    total_weights = weights.size
+    zero_weights = np.sum(weights == 0)
+    
+    return zero_weights / total_weights if total_weights > 0 else 0.0
+    ### END SOLUTION 
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 16
+def quantize_layer_weights(layer: Dense, bits: int = 8) -> Tuple[Dense, Dict[str, Any]]:
+    """
+    Quantize layer weights to reduce precision.
+    
+    Args:
+        layer: Dense layer to quantize
+        bits: Number of bits for quantization (8, 16, etc.)
+        
+    Returns:
+        Tuple of (quantized_layer, quantization_info)
+        
+    TODO: Implement weight quantization for memory efficiency.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Get weight matrix from layer
+    2. Find min and max values for quantization range
+    3. Calculate scale factor: (max - min) / (2^bits - 1)
+    4. Quantize: round((weights - min) / scale)
+    5. Dequantize back to float: quantized * scale + min
+    6. Update layer weights and return statistics
+    
+    EXAMPLE USAGE:
+    ```python
+    layer = Dense(784, 128)
+    quantized_layer, info = quantize_layer_weights(layer, bits=8)
+    print(f"Memory reduction: {info['memory_reduction']:.1f}x")
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Use np.min() and np.max() to find weight range
+    - Clamp quantized values to valid range [0, 2^bits-1]
+    - Store original dtype for memory calculation
+    - Calculate theoretical memory savings
+    
+    LEARNING CONNECTIONS:
+    - This is how mobile AI frameworks work
+    - Hardware accelerators optimize for INT8
+    - Precision-performance trade-off is key
+    """
+    ### BEGIN SOLUTION
+    # Get current weights and ensure they're numpy arrays
+    weights = layer.weights.data
+    if not isinstance(weights, np.ndarray):
+        weights = np.array(weights)
+    
+    original_weights = weights.copy()
+    original_dtype = weights.dtype
+    
+    # Find min and max for quantization range
+    w_min, w_max = np.min(weights), np.max(weights)
+    
+    # Calculate scale factor
+    scale = (w_max - w_min) / (2**bits - 1)
+    
+    # Quantize weights
+    quantized = np.round((weights - w_min) / scale)
+    quantized = np.clip(quantized, 0, 2**bits - 1)  # Clamp to valid range
+    
+    # Dequantize back to float (simulation of quantized inference)
+    dequantized = quantized * scale + w_min
+    
+    # Update layer weights
+    layer.weights = Tensor(dequantized.astype(np.float32))
+    
+    # Calculate quantization statistics
+    total_weights = weights.size
+    original_bytes = total_weights * 4  # FP32 = 4 bytes
+    quantized_bytes = total_weights * (bits // 8)  # bits/8 bytes per weight
+    memory_reduction = original_bytes / quantized_bytes if quantized_bytes > 0 else 1.0
+    
+    # Calculate quantization error
+    mse_error = np.mean((original_weights - dequantized) ** 2)
+    max_error = np.max(np.abs(original_weights - dequantized))
+    
+    quantization_info = {
+        'bits': bits,
+        'scale': float(scale),
+        'min_val': float(w_min),
+        'max_val': float(w_max),
+        'total_weights': total_weights,
+        'original_bytes': original_bytes,
+        'quantized_bytes': quantized_bytes,
+        'memory_reduction': float(memory_reduction),
+        'mse_error': float(mse_error),
+        'max_error': float(max_error),
+        'original_dtype': str(original_dtype)
+    }
+    
+    return layer, quantization_info
+    ### END SOLUTION 
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 20
+class DistillationLoss:
+    """
+    Combined loss function for knowledge distillation.
+    
+    This loss combines standard classification loss (hard targets) with
+    distillation loss (soft targets from teacher) for training compact models.
+    """
+    
+    def __init__(self, temperature: float = 3.0, alpha: float = 0.5):
+        """
+        Initialize distillation loss.
+        
+        Args:
+            temperature: Temperature for softening probability distributions
+            alpha: Weight for hard loss (1-alpha for soft loss)
+        """
+        self.temperature = temperature
+        self.alpha = alpha
+        self.ce_loss = CrossEntropyLoss()
+    
+    def __call__(self, student_logits: np.ndarray, teacher_logits: np.ndarray, 
+                 true_labels: np.ndarray) -> float:
+        """
+        Calculate combined distillation loss.
+        
+        Args:
+            student_logits: Raw outputs from student model
+            teacher_logits: Raw outputs from teacher model  
+            true_labels: Ground truth labels
+            
+        Returns:
+            Combined loss value
+            
+        TODO: Implement knowledge distillation loss function.
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Calculate hard loss using standard cross-entropy
+        2. Apply temperature scaling to both logits
+        3. Calculate soft targets from teacher logits
+        4. Calculate soft loss between student and teacher distributions
+        5. Combine hard and soft losses with alpha weighting
+        6. Return total loss
+        
+        EXAMPLE USAGE:
+        ```python
+        distill_loss = DistillationLoss(temperature=3.0, alpha=0.5)
+        loss = distill_loss(student_out, teacher_out, labels)
+        ```
+        
+        IMPLEMENTATION HINTS:
+        - Use temperature scaling before softmax: logits / temperature
+        - Implement stable softmax to avoid numerical issues
+        - Scale soft loss by temperature^2 (standard practice)
+        - Ensure proper normalization for both losses
+        
+        LEARNING CONNECTIONS:
+        - This is how DistilBERT was trained
+        - Temperature controls knowledge transfer richness
+        - Alpha balances accuracy vs compression
+        """
+        ### BEGIN SOLUTION
+        # Convert inputs to numpy arrays if needed
+        if not isinstance(student_logits, np.ndarray):
+            student_logits = np.array(student_logits)
+        if not isinstance(teacher_logits, np.ndarray):
+            teacher_logits = np.array(teacher_logits)
+        if not isinstance(true_labels, np.ndarray):
+            true_labels = np.array(true_labels)
+        
+        # Hard loss: standard classification loss
+        hard_loss = self._cross_entropy_loss(student_logits, true_labels)
+        
+        # Soft loss: distillation from teacher
+        # Apply temperature scaling
+        teacher_soft = self._softmax(teacher_logits / self.temperature)
+        student_soft = self._softmax(student_logits / self.temperature)
+        
+        # Calculate soft loss (KL divergence)
+        soft_loss = -np.mean(np.sum(teacher_soft * np.log(student_soft + 1e-10), axis=-1))
+        
+        # Scale soft loss by temperature^2 (standard practice)
+        soft_loss *= (self.temperature ** 2)
+        
+        # Combine losses
+        total_loss = self.alpha * hard_loss + (1 - self.alpha) * soft_loss
+        
+        return float(total_loss)
+        ### END SOLUTION
+    
+    def _softmax(self, logits: np.ndarray) -> np.ndarray:
+        """Numerically stable softmax."""
+        # Subtract max for numerical stability
+        exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
+        return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
+    
+    def _cross_entropy_loss(self, logits: np.ndarray, labels: np.ndarray) -> float:
+        """Simple cross-entropy loss implementation."""
+        # Convert labels to one-hot if needed
+        if labels.ndim == 1:
+            num_classes = logits.shape[-1]
+            one_hot = np.zeros((labels.shape[0], num_classes))
+            one_hot[np.arange(labels.shape[0]), labels] = 1
+            labels = one_hot
+        
+        # Apply softmax and calculate cross-entropy
+        probs = self._softmax(logits)
+        return -np.mean(np.sum(labels * np.log(probs + 1e-10), axis=-1)) 
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 24
+def compute_neuron_importance(layer: Dense, method: str = 'weight_magnitude') -> np.ndarray:
+    """
+    Compute importance scores for each neuron in a Dense layer.
+    
+    Args:
+        layer: Dense layer to analyze
+        method: Importance computation method
+        
+    Returns:
+        Array of importance scores for each output neuron
+        
+    TODO: Implement neuron importance calculation.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Get weight matrix from layer
+    2. Choose importance metric based on method
+    3. Calculate per-neuron importance scores
+    4. Return array of scores (one per output neuron)
+    
+    AVAILABLE METHODS:
+    - 'weight_magnitude': Sum of absolute weights per neuron
+    - 'weight_variance': Variance of weights per neuron
+    - 'random': Random importance (for baseline comparison)
+    
+    IMPLEMENTATION HINTS:
+    - Weights shape is (input_size, output_size)
+    - Each column represents one output neuron
+    - Use axis=0 for operations across input dimensions
+    - Higher scores = more important neurons
+    
+    LEARNING CONNECTIONS:
+    - This is how neural architecture search works
+    - Different metrics capture different aspects of importance
+    - Importance ranking is crucial for effective pruning
+    """
+    ### BEGIN SOLUTION
+    # Get weights and ensure they're numpy arrays
+    weights = layer.weights.data
+    if not isinstance(weights, np.ndarray):
+        weights = np.array(weights)
+    
+    if method == 'weight_magnitude':
+        # Sum of absolute weights per neuron (column)
+        importance = np.sum(np.abs(weights), axis=0)
+        
+    elif method == 'weight_variance':
+        # Variance of weights per neuron (column)
+        importance = np.var(weights, axis=0)
+        
+    elif method == 'random':
+        # Random importance for baseline comparison
+        importance = np.random.rand(weights.shape[1])
+        
+    else:
+        raise ValueError(f"Unknown importance method: {method}")
+    
+    return importance
+    ### END SOLUTION
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 25
+def prune_layer_neurons(layer: Dense, keep_ratio: float = 0.7, 
+                       importance_method: str = 'weight_magnitude') -> Tuple[Dense, Dict[str, Any]]:
+    """
+    Remove least important neurons from a Dense layer.
+    
+    Args:
+        layer: Dense layer to prune
+        keep_ratio: Fraction of neurons to keep (0.0 to 1.0)
+        importance_method: Method for computing neuron importance
+        
+    Returns:
+        Tuple of (pruned_layer, pruning_info)
+        
+    TODO: Implement structured neuron pruning.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Compute importance scores for all neurons
+    2. Determine how many neurons to keep
+    3. Select indices of most important neurons
+    4. Create new layer with reduced dimensions
+    5. Copy weights and biases for selected neurons
+    6. Return pruned layer and statistics
+    
+    EXAMPLE USAGE:
+    ```python
+    layer = Dense(784, 128)
+    pruned_layer, info = prune_layer_neurons(layer, keep_ratio=0.75)
+    print(f"Reduced from {info['original_neurons']} to {info['remaining_neurons']} neurons")
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Use np.argsort() to rank neurons by importance
+    - Take the top keep_count neurons: indices[-keep_count:]
+    - Create new layer with reduced output size
+    - Copy both weights and bias for selected neurons
+    - Track original and new sizes for statistics
+    
+    LEARNING CONNECTIONS:
+    - This is actual model architecture modification
+    - Hardware gets real speedup from smaller matrices
+    - Must consider cascade effects on next layers
+    """
+    ### BEGIN SOLUTION
+    # Compute neuron importance
+    importance_scores = compute_neuron_importance(layer, importance_method)
+    
+    # Determine how many neurons to keep
+    original_neurons = layer.output_size
+    keep_count = max(1, int(original_neurons * keep_ratio))  # Keep at least 1 neuron
+    
+    # Select most important neurons
+    sorted_indices = np.argsort(importance_scores)
+    keep_indices = sorted_indices[-keep_count:]  # Take top keep_count neurons
+    keep_indices = np.sort(keep_indices)  # Sort for consistent ordering
+    
+    # Get current weights and biases
+    weights = layer.weights.data
+    if not isinstance(weights, np.ndarray):
+        weights = np.array(weights)
+    
+    bias = layer.bias.data if layer.bias is not None else None
+    if bias is not None and not isinstance(bias, np.ndarray):
+        bias = np.array(bias)
+    
+    # Create new layer with reduced dimensions
+    pruned_layer = Dense(layer.input_size, keep_count)
+    
+    # Copy weights for selected neurons
+    pruned_weights = weights[:, keep_indices]
+    pruned_layer.weights = Tensor(np.ascontiguousarray(pruned_weights))
+    
+    # Copy bias for selected neurons
+    if bias is not None:
+        pruned_bias = bias[keep_indices]
+        pruned_layer.bias = Tensor(np.ascontiguousarray(pruned_bias))
+    
+    # Calculate pruning statistics
+    neurons_removed = original_neurons - keep_count
+    compression_ratio = original_neurons / keep_count if keep_count > 0 else float('inf')
+    
+    # Calculate parameter reduction
+    original_params = layer.input_size * original_neurons + (original_neurons if bias is not None else 0)
+    new_params = layer.input_size * keep_count + (keep_count if bias is not None else 0)
+    param_reduction = (original_params - new_params) / original_params
+    
+    pruning_info = {
+        'keep_ratio': keep_ratio,
+        'importance_method': importance_method,
+        'original_neurons': original_neurons,
+        'remaining_neurons': keep_count,
+        'neurons_removed': neurons_removed,
+        'compression_ratio': float(compression_ratio),
+        'original_params': original_params,
+        'new_params': new_params,
+        'param_reduction': float(param_reduction),
+        'keep_indices': keep_indices.tolist()
+    }
+    
+    return pruned_layer, pruning_info
+    ### END SOLUTION 
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 29
+class CompressionSystemsProfiler:
+    """
+    Advanced profiling system for analyzing compression techniques in production environments.
+    
+    This profiler provides 65% implementation level analysis of compression techniques,
+    focusing on production deployment scenarios including quantization impact analysis,
+    inference speedup measurements, and hardware-specific optimizations.
+    """
+    
+    def __init__(self):
+        """Initialize the compression systems profiler."""
+        self.metrics = CompressionMetrics()
+        self.compression_history = []
+        
+    def analyze_quantization_impact(self, model: Sequential, target_bits: List[int] = [32, 16, 8, 4]) -> Dict[str, Any]:
+        """
+        Analyze quantization impact across different bit widths for production deployment.
+        
+        Args:
+            model: Sequential model to analyze
+            target_bits: List of bit widths to test
+            
+        Returns:
+            Comprehensive quantization analysis including accuracy vs compression tradeoffs
+            
+        TODO: Implement advanced quantization impact analysis (65% implementation level).
+        
+        STEP-BY-STEP IMPLEMENTATION:
+        1. Create model copies for each bit width
+        2. Apply quantization with different bit widths
+        3. Measure memory reduction and inference implications
+        4. Calculate theoretical speedup for different hardware
+        5. Analyze accuracy degradation patterns
+        6. Generate production deployment recommendations
+        
+        PRODUCTION PATTERNS TO ANALYZE:
+        - Mobile deployment (ARM processors, limited memory)
+        - Edge inference (TPUs, power constraints)
+        - Cloud serving (GPU acceleration, batch processing)
+        - Real-time systems (latency requirements)
+        
+        IMPLEMENTATION HINTS:
+        - Model different hardware characteristics
+        - Consider memory bandwidth limitations
+        - Include power consumption estimates
+        - Analyze batch vs single inference patterns
+        
+        LEARNING CONNECTIONS:
+        - This mirrors TensorFlow Lite quantization analysis
+        - Production systems need this kind of comprehensive analysis
+        - Hardware-aware compression is crucial for deployment
+        """
+        ### BEGIN SOLUTION
+        results = {
+            'quantization_analysis': {},
+            'hardware_recommendations': {},
+            'deployment_scenarios': {}
+        }
+        
+        baseline_size = self.metrics.calculate_model_size(model, dtype='float32')
+        baseline_params = self.metrics.count_parameters(model)['total_parameters']
+        
+        for bits in target_bits:
+            # Create model copy for quantization
+            test_model = Sequential([Dense(layer.input_size, layer.output_size) for layer in model.layers])
+            for i, layer in enumerate(test_model.layers):
+                layer.weights = Tensor(model.layers[i].weights.data.copy() if hasattr(model.layers[i].weights.data, 'copy') else np.array(model.layers[i].weights.data))
+                if hasattr(layer, 'bias') and model.layers[i].bias is not None:
+                    layer.bias = Tensor(model.layers[i].bias.data.copy() if hasattr(model.layers[i].bias.data, 'copy') else np.array(model.layers[i].bias.data))
+            
+            # Apply quantization to all layers
+            total_error = 0
+            for i, layer in enumerate(test_model.layers):
+                if isinstance(layer, Dense):
+                    _, quant_info = quantize_layer_weights(layer, bits=bits)
+                    total_error += quant_info['mse_error']
+            
+            # Calculate quantized model size
+            dtype_map = {32: 'float32', 16: 'float16', 8: 'int8', 4: 'int8'}  # Approximate for 4-bit
+            quantized_size = self.metrics.calculate_model_size(test_model, dtype=dtype_map.get(bits, 'int8'))
+            
+            # Memory and performance analysis
+            memory_reduction = baseline_size['size_mb'] / quantized_size['size_mb']
+            
+            # Hardware-specific analysis
+            hardware_analysis = {
+                'mobile_arm': {
+                    'memory_bandwidth_improvement': memory_reduction * 0.8,  # ARM efficiency
+                    'inference_speedup': min(memory_reduction * 0.6, 4.0),  # Conservative estimate
+                    'power_reduction': memory_reduction * 0.7,  # Power scales with memory access
+                    'deployment_feasibility': 'excellent' if quantized_size['size_mb'] < 10 else 'good' if quantized_size['size_mb'] < 50 else 'limited'
+                },
+                'edge_tpu': {
+                    'quantization_compatibility': 'native' if bits == 8 else 'emulated',
+                    'inference_speedup': 8.0 if bits == 8 else 1.0,  # TPUs optimized for INT8
+                    'power_efficiency': 'optimal' if bits == 8 else 'suboptimal',
+                    'deployment_feasibility': 'excellent' if bits == 8 and quantized_size['size_mb'] < 20 else 'limited'
+                },
+                'gpu_cloud': {
+                    'tensor_core_acceleration': True if bits in [16, 8] else False,
+                    'batch_throughput_improvement': memory_reduction * 1.2,  # GPU batch efficiency
+                    'memory_capacity_improvement': memory_reduction,
+                    'deployment_feasibility': 'excellent'  # Cloud has fewer constraints
+                }
+            }
+            
+            results['quantization_analysis'][f'{bits}bit'] = {
+                'bits': bits,
+                'model_size_mb': quantized_size['size_mb'],
+                'memory_reduction_factor': memory_reduction,
+                'quantization_error': total_error / len(test_model.layers),
+                'compression_ratio': baseline_size['size_mb'] / quantized_size['size_mb'],
+                'hardware_analysis': hardware_analysis
+            }
+        
+        # Generate deployment recommendations
+        results['deployment_scenarios'] = {
+            'mobile_deployment': {
+                'recommended_bits': 8,
+                'rationale': 'INT8 provides optimal balance of size reduction and ARM processor efficiency',
+                'expected_benefits': 'Memory reduction, inference speedup, improved battery life',
+                'considerations': 'Monitor accuracy degradation, test on target devices'
+            },
+            'edge_inference': {
+                'recommended_bits': 8,
+                'rationale': 'Edge TPUs and similar hardware optimized for INT8 quantization',
+                'expected_benefits': 'Maximum hardware acceleration, minimal power consumption',
+                'considerations': 'Ensure quantization-aware training for best accuracy'
+            },
+            'cloud_serving': {
+                'recommended_bits': 16,
+                'rationale': 'FP16 provides good compression with minimal accuracy loss and GPU acceleration',
+                'expected_benefits': 'Increased batch throughput, reduced memory usage',
+                'considerations': 'Consider mixed precision for optimal performance'
+            }
+        }
+        
+        return results
+        ### END SOLUTION
+    
+    def measure_inference_speedup(self, original_model: Sequential, compressed_model: Sequential, 
+                                 batch_sizes: List[int] = [1, 8, 32, 128]) -> Dict[str, Any]:
+        """
+        Measure theoretical inference speedup from compression techniques.
+        
+        Args:
+            original_model: Baseline model
+            compressed_model: Compressed model to compare
+            batch_sizes: Different batch sizes for analysis
+            
+        Returns:
+            Inference speedup analysis across different scenarios
+        """
+        results = {
+            'flops_analysis': {},
+            'memory_analysis': {},
+            'speedup_estimates': {}
+        }
+        
+        # Calculate FLOPs for both models
+        original_flops = self._calculate_model_flops(original_model)
+        compressed_flops = self._calculate_model_flops(compressed_model)
+        
+        # Memory analysis
+        original_size = self.metrics.calculate_model_size(original_model)
+        compressed_size = self.metrics.calculate_model_size(compressed_model)
+        
+        results['flops_analysis'] = {
+            'original_flops': original_flops,
+            'compressed_flops': compressed_flops,
+            'flops_reduction': (original_flops - compressed_flops) / original_flops,
+            'computational_speedup': original_flops / compressed_flops if compressed_flops > 0 else float('inf')
+        }
+        
+        results['memory_analysis'] = {
+            'original_size_mb': original_size['size_mb'],
+            'compressed_size_mb': compressed_size['size_mb'],
+            'memory_reduction': (original_size['size_mb'] - compressed_size['size_mb']) / original_size['size_mb'],
+            'memory_speedup': original_size['size_mb'] / compressed_size['size_mb']
+        }
+        
+        # Estimate speedup for different scenarios
+        for batch_size in batch_sizes:
+            compute_time_original = original_flops * batch_size / 1e9  # Assume 1 GFLOPS baseline
+            compute_time_compressed = compressed_flops * batch_size / 1e9
+            
+            memory_time_original = original_size['size_mb'] * batch_size / 100  # Assume 100 MB/s memory bandwidth
+            memory_time_compressed = compressed_size['size_mb'] * batch_size / 100
+            
+            total_time_original = compute_time_original + memory_time_original
+            total_time_compressed = compute_time_compressed + memory_time_compressed
+            
+            results['speedup_estimates'][f'batch_{batch_size}'] = {
+                'compute_speedup': compute_time_original / compute_time_compressed if compute_time_compressed > 0 else float('inf'),
+                'memory_speedup': memory_time_original / memory_time_compressed if memory_time_compressed > 0 else float('inf'),
+                'total_speedup': total_time_original / total_time_compressed if total_time_compressed > 0 else float('inf')
+            }
+        
+        return results
+    
+    def analyze_accuracy_tradeoffs(self, model: Sequential, compression_levels: List[float] = [0.1, 0.3, 0.5, 0.7, 0.9]) -> Dict[str, Any]:
+        """
+        Analyze accuracy vs compression tradeoffs across different compression levels.
+        
+        Args:
+            model: Model to analyze
+            compression_levels: Different compression ratios to test
+            
+        Returns:
+            Analysis of accuracy degradation patterns
+        """
+        results = {
+            'compression_curves': {},
+            'optimal_operating_points': {},
+            'production_recommendations': {}
+        }
+        
+        baseline_size = self.metrics.calculate_model_size(model)
+        
+        for level in compression_levels:
+            # Test different compression techniques at this level
+            techniques = {
+                'magnitude_pruning': self._apply_magnitude_pruning(model, level),
+                'structured_pruning': self._apply_structured_pruning(model, 1 - level),
+                'quantization': self._apply_quantization(model, max(4, int(32 * (1 - level))))
+            }
+            
+            for technique_name, compressed_model in techniques.items():
+                if compressed_model is not None:
+                    compressed_size = self.metrics.calculate_model_size(compressed_model)
+                    compression_ratio = baseline_size['size_mb'] / compressed_size['size_mb']
+                    
+                    if technique_name not in results['compression_curves']:
+                        results['compression_curves'][technique_name] = []
+                    
+                    results['compression_curves'][technique_name].append({
+                        'compression_level': level,
+                        'compression_ratio': compression_ratio,
+                        'size_mb': compressed_size['size_mb'],
+                        'estimated_accuracy_retention': 1.0 - (level * 0.5)  # Simplified model
+                    })
+        
+        # Find optimal operating points
+        for technique in results['compression_curves']:
+            curves = results['compression_curves'][technique]
+            # Find point with best accuracy/compression balance
+            best_point = max(curves, key=lambda x: x['compression_ratio'] * x['estimated_accuracy_retention'])
+            results['optimal_operating_points'][technique] = best_point
+        
+        return results
+    
+    def _calculate_model_flops(self, model: Sequential) -> int:
+        """Calculate FLOPs for a Sequential model."""
+        total_flops = 0
+        for layer in model.layers:
+            if isinstance(layer, Dense):
+                total_flops += layer.input_size * layer.output_size * 2  # Multiply-add operations
+        return total_flops
+    
+    def _apply_magnitude_pruning(self, model: Sequential, pruning_ratio: float) -> Optional[Sequential]:
+        """Apply magnitude pruning to a model copy."""
+        try:
+            test_model = Sequential([Dense(layer.input_size, layer.output_size) for layer in model.layers])
+            for i, layer in enumerate(test_model.layers):
+                layer.weights = Tensor(model.layers[i].weights.data.copy() if hasattr(model.layers[i].weights.data, 'copy') else np.array(model.layers[i].weights.data))
+                if hasattr(layer, 'bias') and model.layers[i].bias is not None:
+                    layer.bias = Tensor(model.layers[i].bias.data.copy() if hasattr(model.layers[i].bias.data, 'copy') else np.array(model.layers[i].bias.data))
+                prune_weights_by_magnitude(layer, pruning_ratio)
+            return test_model
+        except Exception:
+            return None
+    
+    def _apply_structured_pruning(self, model: Sequential, keep_ratio: float) -> Optional[Sequential]:
+        """Apply structured pruning to a model copy."""
+        try:
+            test_model = Sequential([Dense(layer.input_size, layer.output_size) for layer in model.layers])
+            for i, layer in enumerate(test_model.layers):
+                layer.weights = Tensor(model.layers[i].weights.data.copy() if hasattr(model.layers[i].weights.data, 'copy') else np.array(model.layers[i].weights.data))
+                if hasattr(layer, 'bias') and model.layers[i].bias is not None:
+                    layer.bias = Tensor(model.layers[i].bias.data.copy() if hasattr(model.layers[i].bias.data, 'copy') else np.array(model.layers[i].bias.data))
+                pruned_layer, _ = prune_layer_neurons(layer, keep_ratio)
+                test_model.layers[i] = pruned_layer
+            return test_model
+        except Exception:
+            return None
+    
+    def _apply_quantization(self, model: Sequential, bits: int) -> Optional[Sequential]:
+        """Apply quantization to a model copy."""
+        try:
+            test_model = Sequential([Dense(layer.input_size, layer.output_size) for layer in model.layers])
+            for i, layer in enumerate(test_model.layers):
+                layer.weights = Tensor(model.layers[i].weights.data.copy() if hasattr(model.layers[i].weights.data, 'copy') else np.array(model.layers[i].weights.data))
+                if hasattr(layer, 'bias') and model.layers[i].bias is not None:
+                    layer.bias = Tensor(model.layers[i].bias.data.copy() if hasattr(model.layers[i].bias.data, 'copy') else np.array(model.layers[i].bias.data))
+                quantize_layer_weights(layer, bits)
+            return test_model
+        except Exception:
+            return None
+
+# %% ../../modules/source/temp_holding/16_regularization/regularization_dev.ipynb 30
+def compare_compression_techniques(original_model: Sequential) -> Dict[str, Dict[str, Any]]:
+    """
+    Compare all compression techniques on the same model.
+    
+    Args:
+        original_model: Base model to compress using different techniques
+        
+    Returns:
+        Dictionary comparing results from different compression approaches
+        
+    TODO: Implement comprehensive compression comparison.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Set up baseline metrics from original model
+    2. Apply each compression technique individually
+    3. Apply combined compression techniques
+    4. Measure and compare all results
+    5. Return comprehensive comparison data
+    
+    COMPARISON DIMENSIONS:
+    - Model size (MB)
+    - Parameter count
+    - Compression ratio
+    - Memory reduction
+    - Estimated speedup (for structured techniques)
+    
+    IMPLEMENTATION HINTS:
+    - Create separate model copies for each technique
+    - Use consistent parameters across techniques
+    - Track both individual and combined effects
+    - Include baseline for reference
+    
+    LEARNING CONNECTIONS:
+    - This is how research papers compare compression methods
+    - Production systems need this analysis for deployment decisions
+    - Understanding trade-offs guides technique selection
+    """
+    ### BEGIN SOLUTION
+    results = {}
+    metrics = CompressionMetrics()
+    
+    # Baseline: Original model
+    baseline_params = metrics.count_parameters(original_model)
+    baseline_size = metrics.calculate_model_size(original_model)
+    
+    results['baseline'] = {
+        'technique': 'Original Model',
+        'parameters': baseline_params['total_parameters'],
+        'size_mb': baseline_size['size_mb'],
+        'compression_ratio': 1.0,
+        'memory_reduction': 0.0
+    }
+    
+    # Technique 1: Magnitude-based pruning only
+    model_pruning = Sequential([Dense(layer.input_size, layer.output_size) for layer in original_model.layers])
+    for i, layer in enumerate(model_pruning.layers):
+        layer.weights = Tensor(original_model.layers[i].weights.data.copy() if hasattr(original_model.layers[i].weights.data, 'copy') else np.array(original_model.layers[i].weights.data))
+        if hasattr(layer, 'bias') and original_model.layers[i].bias is not None:
+            layer.bias = Tensor(original_model.layers[i].bias.data.copy() if hasattr(original_model.layers[i].bias.data, 'copy') else np.array(original_model.layers[i].bias.data))
+    
+    # Apply magnitude pruning to each layer
+    total_sparsity = 0
+    for i, layer in enumerate(model_pruning.layers):
+        if isinstance(layer, Dense):
+            _, prune_info = prune_weights_by_magnitude(layer, pruning_ratio=0.3)
+            total_sparsity += prune_info['sparsity']
+    
+    avg_sparsity = total_sparsity / len(model_pruning.layers)
+    pruning_params = metrics.count_parameters(model_pruning)
+    pruning_size = metrics.calculate_model_size(model_pruning)
+    
+    results['magnitude_pruning'] = {
+        'technique': 'Magnitude Pruning (30%)',
+        'parameters': pruning_params['total_parameters'],
+        'size_mb': pruning_size['size_mb'],
+        'compression_ratio': baseline_size['size_mb'] / pruning_size['size_mb'],
+        'memory_reduction': (baseline_size['size_mb'] - pruning_size['size_mb']) / baseline_size['size_mb'],
+        'sparsity': avg_sparsity
+    }
+    
+    # Technique 2: Quantization only
+    model_quantization = Sequential([Dense(layer.input_size, layer.output_size) for layer in original_model.layers])
+    for i, layer in enumerate(model_quantization.layers):
+        layer.weights = Tensor(original_model.layers[i].weights.data.copy() if hasattr(original_model.layers[i].weights.data, 'copy') else np.array(original_model.layers[i].weights.data))
+        if hasattr(layer, 'bias') and original_model.layers[i].bias is not None:
+            layer.bias = Tensor(original_model.layers[i].bias.data.copy() if hasattr(original_model.layers[i].bias.data, 'copy') else np.array(original_model.layers[i].bias.data))
+    
+    # Apply quantization to each layer
+    total_memory_reduction = 0
+    for i, layer in enumerate(model_quantization.layers):
+        if isinstance(layer, Dense):
+            _, quant_info = quantize_layer_weights(layer, bits=8)
+            total_memory_reduction += quant_info['memory_reduction']
+    
+    avg_memory_reduction = total_memory_reduction / len(model_quantization.layers)
+    quantization_size = metrics.calculate_model_size(model_quantization, dtype='int8')
+    
+    results['quantization'] = {
+        'technique': 'Quantization (INT8)',
+        'parameters': baseline_params['total_parameters'],
+        'size_mb': quantization_size['size_mb'],
+        'compression_ratio': baseline_size['size_mb'] / quantization_size['size_mb'],
+        'memory_reduction': (baseline_size['size_mb'] - quantization_size['size_mb']) / baseline_size['size_mb'],
+        'avg_memory_reduction_factor': avg_memory_reduction
+    }
+    
+    # Technique 3: Structured pruning only
+    model_structured = Sequential([Dense(layer.input_size, layer.output_size) for layer in original_model.layers])
+    for i, layer in enumerate(model_structured.layers):
+        layer.weights = Tensor(original_model.layers[i].weights.data.copy() if hasattr(original_model.layers[i].weights.data, 'copy') else np.array(original_model.layers[i].weights.data))
+        if hasattr(layer, 'bias') and original_model.layers[i].bias is not None:
+            layer.bias = Tensor(original_model.layers[i].bias.data.copy() if hasattr(original_model.layers[i].bias.data, 'copy') else np.array(original_model.layers[i].bias.data))
+    
+    # Apply structured pruning to each layer
+    total_param_reduction = 0
+    for i, layer in enumerate(model_structured.layers):
+        if isinstance(layer, Dense):
+            pruned_layer, struct_info = prune_layer_neurons(layer, keep_ratio=0.75)
+            model_structured.layers[i] = pruned_layer
+            total_param_reduction += struct_info['param_reduction']
+    
+    avg_param_reduction = total_param_reduction / len(model_structured.layers)
+    structured_params = metrics.count_parameters(model_structured)
+    structured_size = metrics.calculate_model_size(model_structured)
+    
+    results['structured_pruning'] = {
+        'technique': 'Structured Pruning (75% neurons kept)',
+        'parameters': structured_params['total_parameters'],
+        'size_mb': structured_size['size_mb'],
+        'compression_ratio': baseline_size['size_mb'] / structured_size['size_mb'],
+        'memory_reduction': (baseline_size['size_mb'] - structured_size['size_mb']) / baseline_size['size_mb'],
+        'param_reduction': avg_param_reduction
+    }
+    
+    # Technique 4: Combined approach
+    model_combined = Sequential([Dense(layer.input_size, layer.output_size) for layer in original_model.layers])
+    for i, layer in enumerate(model_combined.layers):
+        layer.weights = Tensor(original_model.layers[i].weights.data.copy() if hasattr(original_model.layers[i].weights.data, 'copy') else np.array(original_model.layers[i].weights.data))
+        if hasattr(layer, 'bias') and original_model.layers[i].bias is not None:
+            layer.bias = Tensor(original_model.layers[i].bias.data.copy() if hasattr(original_model.layers[i].bias.data, 'copy') else np.array(original_model.layers[i].bias.data))
+    
+    # Apply magnitude pruning + quantization + structured pruning
+    for i, layer in enumerate(model_combined.layers):
+        if isinstance(layer, Dense):
+            # Step 1: Magnitude pruning
+            _, _ = prune_weights_by_magnitude(layer, pruning_ratio=0.2)
+            # Step 2: Quantization  
+            _, _ = quantize_layer_weights(layer, bits=8)
+            # Step 3: Structured pruning
+            pruned_layer, _ = prune_layer_neurons(layer, keep_ratio=0.8)
+            model_combined.layers[i] = pruned_layer
+    
+    combined_params = metrics.count_parameters(model_combined)
+    combined_size = metrics.calculate_model_size(model_combined, dtype='int8')
+    
+    results['combined'] = {
+        'technique': 'Combined (Pruning + Quantization + Structured)',
+        'parameters': combined_params['total_parameters'],
+        'size_mb': combined_size['size_mb'],
+        'compression_ratio': baseline_size['size_mb'] / combined_size['size_mb'],
+        'memory_reduction': (baseline_size['size_mb'] - combined_size['size_mb']) / baseline_size['size_mb']
+    }
+    
+    return results
+    ### END SOLUTION
diff --git a/tinytorch/core/dataloader.py b/tinytorch/core/dataloader.py
index bbb398dd..0e70a2c9 100644
--- a/tinytorch/core/dataloader.py
+++ b/tinytorch/core/dataloader.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/07_dataloader/dataloader_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/08_dataloader/dataloader_dev.py         ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['Dataset', 'DataLoader', 'SimpleDataset', 'download_cifar10', 'CIFAR10Dataset']
diff --git a/tinytorch/core/dense.py b/tinytorch/core/dense.py
index 80a235ca..0dddea9e 100644
--- a/tinytorch/core/dense.py
+++ b/tinytorch/core/dense.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/05_networks/networks_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/05_dense/dense_dev.py                   ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['Sequential', 'create_mlp', 'MLP']
diff --git a/tinytorch/core/embeddings.py b/tinytorch/core/embeddings.py
index b37349df..1452d84b 100644
--- a/tinytorch/core/embeddings.py
+++ b/tinytorch/core/embeddings.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/12_embeddings/embeddings_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/XX_embeddings/embeddings_dev.py         ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['Embedding', 'PositionalEncoding', 'LearnedPositionalEmbedding', 'EmbeddingProfiler',
diff --git a/tinytorch/core/kernels.py b/tinytorch/core/kernels.py
index 850c66a8..8063ac63 100644
--- a/tinytorch/core/kernels.py
+++ b/tinytorch/core/kernels.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/temp_holding/13_kernels/kernels_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/13_kernels/kernels_dev.py               ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['time_kernel', 'matmul_baseline', 'vectorized_relu', 'vectorized_operations', 'cache_friendly_matmul', 'parallel_relu',
diff --git a/tinytorch/core/layers.py b/tinytorch/core/layers.py
index f6837740..2c2bfd3c 100644
--- a/tinytorch/core/layers.py
+++ b/tinytorch/core/layers.py
@@ -1,266 +1,38 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/04_layers/layers_dev.py                 ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
-# %% [markdown]
-"""
-# Layers - Building Neural Network Architectures
+# %% auto 0
+__all__ = ['Dense', 'Module', 'matmul', 'Linear']
 
-Welcome to Layers! You'll implement the essential building blocks that compose into complete neural network architectures.
-
-## LINK Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): N-dimensional arrays with shape management and broadcasting
-- Module 03 (Activations): ReLU and Softmax functions providing nonlinear intelligence
-
-**What's Working**: You can create tensors and apply nonlinear transformations for complex pattern learning!
-
-**The Gap**: You have data structures and nonlinear functions, but no way to combine them into trainable neural network architectures.
-
-**This Module's Solution**: Implement Linear layers, Module composition patterns, and Sequential networks - the architectural foundations enabling everything from MLPs to transformers.
-
-**Connection Map**:
-```
-Activations -> Layers -> Training
-(intelligence)  (architecture)  (learning)
-```
-
-## Learning Objectives
-
-By completing this module, you will:
-
-1. **Build layer abstractions** - Create the building blocks that compose into neural networks
-2. **Implement Linear layers** - The fundamental operation that transforms data between dimensions
-3. **Create Sequential networks** - Chain layers together to build complete neural networks
-4. **Manage parameters** - Handle weights and biases in an organized way
-5. **Foundation for architectures** - Enable building everything from simple MLPs to complex models
-
-## Build -> Use -> Reflect
-1. **Build**: Module base class, Linear layers, and Sequential composition
-2. **Use**: Combine layers into complete neural networks with real data
-3. **Reflect**: Understand how simple building blocks enable complex architectures
-"""
-
-# In[ ]:
-
-#| default_exp core.layers
-
-#| export
+# %% ../../modules/source/04_layers/layers_dev.ipynb 1
 import numpy as np
 import sys
 import os
+from typing import Union, Tuple, Optional, Any
 
-# Smart import system: works both during development and in production
-# This pattern allows the same code to work in two scenarios:
-# 1. During development: imports from local module files (tensor_dev.py)
-# 2. In production: imports from installed tinytorch package
-# This flexibility is essential for educational development workflows
+# Import our building blocks - try package first, then local modules
+try:
+    from tinytorch.core.tensor import Tensor, Parameter
+except ImportError:
+    # For development, import from local modules
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))
+    from tensor_dev import Tensor, Parameter
 
-if 'tinytorch' in sys.modules:
-    # Production: Import from installed package
-    # When tinytorch is installed as a package, use the packaged version
-    from tinytorch.core.tensor import Tensor
-else:
-    # Development: Import from local module files
-    # During development, we need to import directly from the source files
-    # This allows us to work with modules before they're packaged
-    tensor_module_path = os.path.join(os.path.dirname(__file__), '..', '01_tensor')
-    sys.path.insert(0, tensor_module_path)
-    try:
-        from tensor_dev import Tensor
-    finally:
-        sys.path.pop(0)  # Always clean up path to avoid side effects
-
-# CRITICAL FIX: Parameter must be Tensor-based for gradient tracking
-class Parameter:
-    """
-    A trainable parameter that supports automatic differentiation.
-
-    This creates a Tensor with requires_grad=True for use as neural network parameters.
-    Essential for gradient-based optimization of weights and biases.
-
-    IMPORTANT: Parameters must participate in autograd for training to work.
-    """
-    def __init__(self, data):
-        # Import Tensor locally to avoid circular imports
-        # NO Tensor imports - using pure Tensor system only!
-
-        # Use pure Tensor with gradients enabled
-        from tinytorch.core.tensor import Tensor
-
-        if isinstance(data, Tensor):
-            self._tensor = data
-            if not data.requires_grad:
-                # Ensure parameters always require gradients
-                data.requires_grad = True
-        else:
-            # Convert data to Tensor with gradient tracking
-            self._tensor = Tensor(data, requires_grad=True)
-
-    def __getattr__(self, name):
-        """Delegate all attribute access to the underlying Tensor."""
-        return getattr(self._tensor, name)
-
-    def __setattr__(self, name, value):
-        """Handle setting attributes."""
-        if name == '_tensor':
-            super().__setattr__(name, value)
-        else:
-            # Delegate to underlying Tensor
-            setattr(self._tensor, name, value)
-
-    @property
-    def data(self):
-        """Access to underlying data."""
-        return self._tensor.data
-
-    @property
-    def grad(self):
-        """Access to gradient."""
-        return self._tensor.grad
-
-    @grad.setter
-    def grad(self, value):
-        """Set gradient."""
-        self._tensor.grad = value
-
-    @property
-    def requires_grad(self):
-        """Whether this parameter requires gradients."""
-        return self._tensor.requires_grad
-
-    def backward(self, gradient=None):
-        """Backpropagate gradients."""
-        return self._tensor.backward(gradient)
-
-    def __repr__(self):
-        return f"Parameter({self._tensor})"
-
-# In[ ]:
-
-print("FIRE TinyTorch Layers Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build neural network layers!")
-
-# %% [markdown]
-"""
-## Visual Guide: Understanding Neural Network Architecture Through Diagrams
-
-### Neural Network Layers: From Components to Systems
-
-```
-Individual Neuron:                Neural Network Layer:
-    x₁ --○ w₁                    +---------------------+
-          \\                     |   Input Vector      |
-    x₂ --○ w₂ --> Sum --> f() --> y |   [x₁, x₂, x₃]    |
-          /                     +---------------------+
-    x₃ --○ w₃                              v
-       + bias                    +---------------------+
-                                 |  Weight Matrix W    |
-One computation unit             |  +w₁₁ w₁₂ w₁₃+     |
-                                 |  |w₂₁ w₂₂ w₂₃|     |
-                                 |  +w₃₁ w₃₂ w₃₃+     |
-                                 +---------------------+
-                                             v
-                                   Matrix multiplication
-                                     Y = X @ W + b
-                                             v
-                                 +---------------------+
-                                 |  Output Vector      |
-                                 |   [y₁, y₂, y₃]     |
-                                 +---------------------+
-
-Parallel processing of many neurons!
-```
-
-### Layer Composition: Building Complex Architectures
-
-```
-Multi-Layer Perceptron (MLP) Architecture:
-
-   Input        Hidden Layer 1    Hidden Layer 2     Output
- (784 dims)      (256 neurons)     (128 neurons)    (10 classes)
-+---------+     +-------------+   +-------------+   +---------+
-|  Image  |----▶|    ReLU     |--▶|    ReLU     |--▶| Softmax |
-| 28*28px |     | Activations |   | Activations |   | Probs   |
-+---------+     +-------------+   +-------------+   +---------+
-     v                v                 v               v
-200,960 params   32,896 params    1,290 params   Total: 235,146
-
-Parameter calculation for Linear(input_size, output_size):
-• Weights: input_size * output_size matrix
-• Biases:  output_size vector
-• Total:   (input_size * output_size) + output_size
-
-Memory scaling pattern:
-Layer width doubles -> Parameters quadruple -> Memory quadruples
-```
-
-### Module System: Automatic Parameter Management
-
-```
-Parameter Collection Hierarchy:
-
-Model (Sequential)
-+-- Layer1 (Linear)
-|   +-- weights [784 * 256]  --+
-|   +-- bias [256]           --┤
-+-- Layer2 (Linear)           +--▶ model.parameters()
-|   +-- weights [256 * 128]  --┤   Automatically collects
-|   +-- bias [128]           --┤   all parameters for
-+-- Layer3 (Linear)           +--▶ optimizer.step()
-    +-- weights [128 * 10]   --┤
-    +-- bias [10]            --+
-
-Before Module system:        With Module system:
-manually track params   ->    automatic collection
-params = [w1, b1, w2,...]    params = model.parameters()
-
-Enables: optimizer = Adam(model.parameters())
-```
-
-### Memory Layout and Performance Implications
-
-```
-Tensor Memory Access Patterns:
-
-Matrix Multiplication: A @ B = C
-
-Efficient (Row-major access):    Inefficient (Column-major):
-A: --------------▶               A: | | | | | ▶
-   Cache-friendly                    | | | | |
-   Sequential reads                  v v v v v
-                                     Cache misses
-B: |                             B: --------------▶
-   |
-   v
-
-Performance impact:
-• Good memory layout: 100% cache hit ratio
-• Poor memory layout: 10-50% cache hit ratio
-• 10-100x performance difference in practice
-
-Why contiguous tensors matter in production!
-```
-"""
-
-# %% [markdown]
-"""
-## Part 1: Module Base Class - The Foundation of Neural Network Architecture
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "module-base", "solution": true}
-
-# Before building specific layers, we need a base class that enables clean composition and automatic parameter management.
-
-#| export
+# %% ../../modules/source/04_layers/layers_dev.ipynb 4
 class Module:
     """
     Base class for all neural network modules.
@@ -270,7 +42,7 @@ class Module:
     inherit from this class.
     
     Key Features:
-    - Automatic parameter registration when you assign parameter Tensors (weights, bias)
+    - Automatic parameter registration when you assign Tensors with requires_grad=True
     - Recursive parameter collection from sub-modules
     - Clean __call__ interface: model(x) instead of model.forward(x)
     - Extensible for custom layers
@@ -279,8 +51,8 @@ class Module:
         class MLP(Module):
             def __init__(self):
                 super().__init__()
-                self.layer1 = Linear(784, 128)  # Auto-registered!
-                self.layer2 = Linear(128, 10)   # Auto-registered!
+                self.layer1 = Dense(784, 128)  # Auto-registered!
+                self.layer2 = Dense(128, 10)   # Auto-registered!
                 
             def forward(self, x):
                 x = self.layer1(x)
@@ -303,23 +75,14 @@ class Module:
         When you do self.weight = Parameter(...), this automatically adds
         the parameter to our collection for easy optimization.
         """
-        # Step 1: Check if this looks like a parameter (Tensor with data and specific name)
-        # Break down the complex boolean logic for clarity:
-        is_tensor_like = hasattr(value, 'data') and hasattr(value, 'shape')
-        is_tensor_type = isinstance(value, Tensor)
-        is_parameter_type = isinstance(value, Parameter)
-        is_parameter_name = name in ['weights', 'weight', 'bias']
-
-        if is_tensor_like and (is_tensor_type or is_parameter_type) and is_parameter_name:
-            # Step 2: Add to our parameter list for optimization
+        # Check if it's a tensor that needs gradients (a parameter)
+        if hasattr(value, 'requires_grad') and value.requires_grad:
             self._parameters.append(value)
-        
-        # Step 3: Check if it's a sub-module (another neural network layer)
+        # Check if it's another Module (sub-module)
         elif isinstance(value, Module):
-            # Step 4: Add to module list for recursive parameter collection
             self._modules.append(value)
         
-        # Step 5: Always set the actual attribute (this is essential!)
+        # Always call parent to actually set the attribute
         super().__setattr__(name, value)
     
     def parameters(self):
@@ -327,9 +90,9 @@ class Module:
         Recursively collect all parameters from this module and sub-modules.
         
         Returns:
-            List of all parameters (Tensors containing weights and biases)
+            List of all parameters (Tensors with requires_grad=True)
             
-        This enables: optimizer = Adam(model.parameters()) (when optimizers are available)
+        This enables: optimizer = Adam(model.parameters())
         """
         # Start with our own parameters
         params = list(self._parameters)
@@ -360,84 +123,99 @@ class Module:
         """
         raise NotImplementedError("Subclasses must implement forward()")
 
-# In[ ]:
+# %% ../../modules/source/04_layers/layers_dev.ipynb 7
+def matmul(a: Tensor, b: Tensor) -> Tensor:
+    """
+    Matrix multiplication for tensors.
+    
+    Args:
+        a: Left tensor (shape: ..., m, k)
+        b: Right tensor (shape: ..., k, n)
+    
+    Returns:
+        Result tensor (shape: ..., m, n)
+    
+    TODO: Implement matrix multiplication using numpy's @ operator.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Extract numpy arrays from both tensors using .data
+    2. Perform matrix multiplication: result_data = a_data @ b_data
+    3. Wrap result in a new Tensor and return
+    
+    LEARNING CONNECTIONS:
+    - This is the core operation in Dense layers: output = input @ weights
+    - PyTorch uses optimized BLAS libraries for this operation
+    - GPU implementations parallelize this across thousands of cores
+    - Understanding this operation is key to neural network performance
+    
+    EXAMPLE:
+    ```python
+    a = Tensor([[1, 2], [3, 4]])  # shape (2, 2)
+    b = Tensor([[5, 6], [7, 8]])  # shape (2, 2)
+    result = matmul(a, b)
+    # result.data = [[19, 22], [43, 50]]
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Use the @ operator for clean matrix multiplication
+    - Ensure you return a Tensor, not a numpy array
+    - The operation should work for any compatible matrix shapes
+    """
+    ### BEGIN SOLUTION
+    # Check if we're dealing with Variables (autograd) or plain Tensors
+    a_is_variable = hasattr(a, 'requires_grad') and hasattr(a, 'grad_fn')
+    b_is_variable = hasattr(b, 'requires_grad') and hasattr(b, 'grad_fn')
+    
+    # Extract numpy data appropriately
+    if a_is_variable:
+        a_data = a.data.data  # Variable.data is a Tensor, so .data.data gets numpy array
+    else:
+        a_data = a.data  # Tensor.data is numpy array directly
+    
+    if b_is_variable:
+        b_data = b.data.data
+    else:
+        b_data = b.data
+    
+    # Perform matrix multiplication
+    result_data = a_data @ b_data
+    
+    # If any input is a Variable, return Variable with gradient tracking
+    if a_is_variable or b_is_variable:
+        # Import Variable locally to avoid circular imports
+        if 'Variable' not in globals():
+            try:
+                from tinytorch.core.autograd import Variable
+            except ImportError:
+                from autograd_dev import Variable
+        
+        # Create gradient function for matrix multiplication
+        def grad_fn(grad_output):
+            # Matrix multiplication backward pass:
+            # If C = A @ B, then:
+            # dA = grad_output @ B^T
+            # dB = A^T @ grad_output
+            
+            if a_is_variable and a.requires_grad:
+                # Gradient w.r.t. A: grad_output @ B^T
+                grad_a_data = grad_output.data.data @ b_data.T
+                a.backward(Variable(grad_a_data))
+            
+            if b_is_variable and b.requires_grad:
+                # Gradient w.r.t. B: A^T @ grad_output  
+                grad_b_data = a_data.T @ grad_output.data.data
+                b.backward(Variable(grad_b_data))
+        
+        # Determine if result should require gradients
+        requires_grad = (a_is_variable and a.requires_grad) or (b_is_variable and b.requires_grad)
+        
+        return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
+    else:
+        # Both inputs are Tensors, return Tensor (backward compatible)
+        return Tensor(result_data)
+    ### END SOLUTION
 
-# PASS IMPLEMENTATION CHECKPOINT: Basic Module class complete
-
-# THINK PREDICTION: How many parameters would a simple 3-layer network have?
-# Write your guess here: _______
-
-# 🔍 SYSTEMS ANALYSIS: Layer Performance and Scaling
-def analyze_layer_performance():
-    """Analyze layer performance and scaling characteristics."""
-    print("📊 LAYER SYSTEMS ANALYSIS")
-    print("Understanding how neural network layers scale and perform...")
-
-    try:
-        # Parameter scaling analysis
-        print("\n1. Parameter Scaling:")
-        layer_sizes = [(784, 256), (256, 128), (128, 10)]
-        total_params = 0
-
-        for i, (input_size, output_size) in enumerate(layer_sizes):
-            weights = input_size * output_size
-            biases = output_size
-            layer_params = weights + biases
-            total_params += layer_params
-            print(f"   Layer {i+1} ({input_size}→{output_size}): {layer_params:,} params")
-
-        print(f"   Total network: {total_params:,} parameters")
-        print(f"   Memory usage: {total_params * 4 / 1024 / 1024:.2f} MB (float32)")
-
-        # Computational complexity
-        print("\n2. Computational Complexity:")
-        batch_size = 32
-        total_flops = 0
-
-        for i, (input_size, output_size) in enumerate(layer_sizes):
-            matmul_flops = 2 * batch_size * input_size * output_size
-            bias_flops = batch_size * output_size
-            layer_flops = matmul_flops + bias_flops
-            total_flops += layer_flops
-            print(f"   Layer {i+1}: {layer_flops:,} FLOPs ({matmul_flops:,} matmul + {bias_flops:,} bias)")
-
-        print(f"   Total forward pass: {total_flops:,} FLOPs")
-
-        # Scaling patterns
-        print("\n3. Scaling Insights:")
-        print("   • Parameter growth: O(input_size × output_size) - quadratic")
-        print("   • Computation: O(batch × input × output) - linear in each dimension")
-        print("   • Memory: Parameters + activations scale differently")
-        print("   • Bottlenecks: Large layers dominate both memory and compute")
-
-        print("\n💡 KEY INSIGHT: Layer size quadratically affects parameters but linearly affects computation per sample")
-
-    except Exception as e:
-        print(f"⚠️ Analysis error: {e}")
-
-# In[ ]:
-
-# %% [markdown]
-"""
-### ✅ IMPLEMENTATION CHECKPOINT: Module Base Class Complete
-
-You've built the foundation that enables automatic parameter management across all neural network components!
-
-🤔 **PREDICTION**: How many parameters would a simple 3-layer network have?
-Network: 784 → 256 → 128 → 10
-Your guess: _______
-"""
-
-# %% [markdown]
-"""
-## Part 2: Linear Layer - The Fundamental Neural Network Component
-
-Linear layers (also called Dense or Fully Connected layers) are the building blocks of neural networks.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "linear-layer", "solution": true}
-
-#| export
+# %% ../../modules/source/04_layers/layers_dev.ipynb 11
 class Linear(Module):
     """
     Linear (Fully Connected) Layer implementation.
@@ -492,714 +270,86 @@ class Linear(Module):
         
         # Initialize weights with small random values using Parameter
         # Shape: (input_size, output_size) for matrix multiplication
-        #
-        # MAGNIFY WEIGHT INITIALIZATION CONTEXT:
-        # Weight initialization is critical for training deep networks successfully.
-        # Our simple approach (small random * 0.1) works for shallow networks, but
-        # deeper networks require more sophisticated initialization strategies:
-        #
-        # • Xavier/Glorot: scale = sqrt(1/fan_in) - good for tanh/sigmoid activations
-        # • Kaiming/He: scale = sqrt(2/fan_in) - optimized for ReLU activations
-        # • Our approach: scale = 0.1 - simple but effective for basic networks
-        #
-        # Why proper initialization matters:
-        # - Prevents vanishing gradients (weights too small -> signals disappear)
-        # - Prevents exploding gradients (weights too large -> signals blow up)
-        # - Enables stable training in deeper architectures (Module 11 training)
-        # - Affects convergence speed and final model performance
-        #
-        # Production frameworks automatically choose initialization based on layer type!
         weight_data = np.random.randn(input_size, output_size) * 0.1
         self.weights = Parameter(weight_data)  # Auto-registers for optimization!
         
         # Initialize bias if requested
         if use_bias:
-            # MAGNIFY GRADIENT FLOW PREPARATION:
-            # Clean parameter management is essential for backpropagation (Module 09).
-            # When we implement autograd, the optimizer needs to find ALL trainable
-            # parameters automatically. Our Module base class ensures that:
-            #
-            # • Parameters are automatically registered when assigned
-            # • Recursive parameter collection works through network hierarchies
-            # • Gradient updates can flow to all learnable weights and biases
-            # • Memory management handles parameter lifecycle correctly
-            #
-            # This design enables the autograd system to:
-            # - Track computational graphs through all layers
-            # - Accumulate gradients for each parameter during backpropagation
-            # - Support optimizers that update parameters based on gradients
-            # - Scale to arbitrarily deep and complex network architectures
-            #
-            # Bias also uses small random initialization (could be zeros, but small random works well)
             bias_data = np.random.randn(output_size) * 0.1
             self.bias = Parameter(bias_data)  # Auto-registers for optimization!
         else:
             self.bias = None
         ### END SOLUTION
     
-    def forward(self, x):
+    def forward(self, x: Union[Tensor, 'Variable']) -> Union[Tensor, 'Variable']:
         """
-        Forward pass through the Linear layer with automatic differentiation.
-
+        Forward pass through the Linear layer.
+        
         Args:
-            x: Input Tensor (shape: ..., input_size)
-
+            x: Input tensor or Variable (shape: ..., input_size)
+        
         Returns:
-            Output Tensor (shape: ..., output_size) with gradient tracking
-
-        CRITICAL FIX: This method now properly uses autograd operations
-        to ensure gradients flow through parameters during backpropagation.
-
-        TODO: Implement the linear transformation using autograd operations
-
+            Output tensor or Variable (shape: ..., output_size)
+            Preserves Variable type for gradient tracking in training
+        
+        TODO: Implement autograd-aware forward pass: output = input @ weights + bias
+        
         STEP-BY-STEP IMPLEMENTATION:
-        1. Convert input to Tensor if needed (with gradient tracking)
-        2. Use autograd matrix multiplication: matmul(x, weights)
-        3. Add bias using autograd addition if it exists: add(result, bias)
-        4. Return Tensor with gradient tracking enabled
-
+        1. Perform matrix multiplication: output = matmul(x, self.weights)
+        2. If bias exists, add it appropriately based on input type
+        3. Preserve Variable type for gradient tracking if input is Variable
+        4. Return result maintaining autograd capabilities
+        
+        AUTOGRAD CONSIDERATIONS:
+        - If x is Variable: weights and bias should also be Variables for training
+        - Preserve gradient tracking through the entire computation
+        - Enable backpropagation through this layer's parameters
+        - Handle mixed Tensor/Variable scenarios gracefully
+        
         LEARNING CONNECTIONS:
-        - Uses autograd operations instead of raw numpy for gradient flow
-        - Parameters (weights/bias) are Variables with requires_grad=True
-        - Matrix multiplication and addition maintain computational graph
-        - This enables backpropagation through all parameters
-
+        - This is the core neural network transformation
+        - Matrix multiplication scales input features to output features  
+        - Bias provides offset (like y-intercept in linear equations)
+        - Broadcasting handles different batch sizes automatically
+        - Autograd support enables automatic parameter optimization
+        
         IMPLEMENTATION HINTS:
-        - Import autograd operations locally to avoid circular imports
-        - Ensure result Tensor has proper gradient tracking
-        - Handle both Tensor and Tensor inputs gracefully
+        - Use the matmul function you implemented above (now autograd-aware)
+        - Handle bias addition based on input/output types
+        - Variables support + operator for gradient-tracked addition
+        - Check if self.bias is not None before adding
         """
         ### BEGIN SOLUTION
-        # Use pure Tensor operations - NO Variables!
-        from tinytorch.core.tensor import Tensor
-
-        # Ensure input is a Tensor
-        if not isinstance(x, Tensor):
-            x = Tensor(x.data if hasattr(x, 'data') else x)
-
-        # Matrix multiplication: x @ weights
-        # Use Tensor's matmul which should track gradients
-        result = x.matmul(self.weights)
-
+        # Matrix multiplication: input @ weights (now autograd-aware)
+        output = matmul(x, self.weights)
+        
         # Add bias if it exists
+        # The addition will preserve Variable type if output is Variable
         if self.bias is not None:
-            result = result + self.bias
-
-        # Return pure Tensor with gradient tracking preserved
-        return result
+            # Check if we need Variable-aware addition
+            if hasattr(output, 'requires_grad'):
+                # output is a Variable, use Variable addition
+                if hasattr(self.bias, 'requires_grad'):
+                    # bias is also Variable, direct addition works
+                    output = output + self.bias
+                else:
+                    # bias is Tensor, convert to Variable for addition
+                    # Import Variable if not already available
+                    if 'Variable' not in globals():
+                        try:
+                            from tinytorch.core.autograd import Variable
+                        except ImportError:
+                            from autograd_dev import Variable
+                    
+                    bias_var = Variable(self.bias.data, requires_grad=False)
+                    output = output + bias_var
+            else:
+                # output is Tensor, use regular addition
+                output = output + self.bias
+        
+        return output
         ### END SOLUTION
 
-# In[ ]:
-
-# TEST Unit Test: Linear Layer
-def test_unit_linear():
-    """Test Linear layer implementation."""
-    print("TEST Testing Linear Layer...")
-    
-    # Test case 1: Basic functionality
-    layer = Linear(input_size=3, output_size=2)
-    input_tensor = Tensor([[1.0, 2.0, 3.0]])  # Shape: (1, 3)
-    output = layer.forward(input_tensor)
-    
-    # Check output shape
-    assert output.shape == (1, 2), f"Expected shape (1, 2), got {output.shape}"
-    print("PASS Output shape correct")
-    
-    # Test case 2: No bias
-    layer_no_bias = Linear(input_size=2, output_size=3, use_bias=False)
-    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
-    print("PASS No bias option works")
-    
-    # Test case 3: Multiple samples (batch processing)
-    batch_input = Tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])  # Shape: (3, 2)
-    layer_batch = Linear(input_size=2, output_size=2)
-    batch_output = layer_batch.forward(batch_input)
-    
-    assert batch_output.shape == (3, 2), f"Expected shape (3, 2), got {batch_output.shape}"
-    print("PASS Batch processing works")
-    
-    # Test case 4: Callable interface
-    callable_output = layer_batch(batch_input)
-    assert np.allclose(callable_output.data, batch_output.data), "Callable interface should match forward()"
-    print("PASS Callable interface works")
-    
-    # Test case 5: Parameter initialization
-    layer_init = Linear(input_size=10, output_size=5)
-    assert layer_init.weights.shape == (10, 5), f"Expected weights shape (10, 5), got {layer_init.weights.shape}"
-    assert layer_init.bias.shape == (5,), f"Expected bias shape (5,), got {layer_init.bias.shape}"
-    
-    # Check that weights are reasonably small (good initialization)
-    mean_val = np.abs(layer_init.weights.data).mean()
-    # Convert to float if it's a Tensor
-    if hasattr(mean_val, 'item'):
-        mean_val = mean_val.item()
-    elif hasattr(mean_val, 'data'):
-        mean_val = float(mean_val.data)
-    assert mean_val < 1.0, "Weights should be small for good initialization"
-    print("PASS Parameter initialization correct")
-    
-    print("CELEBRATE All Linear layer tests passed!")
-
-test_unit_linear()
-
-# In[ ]:
-
-# TEST Unit Test: Parameter Management
-def test_unit_parameter_management():
-    """Test Linear layer parameter management and module composition."""
-    print("TEST Testing Parameter Management...")
-    
-    # Test case 1: Parameter registration
-    layer = Linear(input_size=3, output_size=2)
-    params = layer.parameters()
-    
-    assert len(params) == 2, f"Expected 2 parameters (weights + bias), got {len(params)}"
-    assert layer.weights in params, "Weights should be in parameters list"
-    assert layer.bias in params, "Bias should be in parameters list"
-    print("PASS Parameter registration works")
-    
-    # Test case 2: Module composition
-    class SimpleNetwork(Module):
-        def __init__(self):
-            super().__init__()
-            self.layer1 = Linear(4, 3)
-            self.layer2 = Linear(3, 2)
-        
-        def forward(self, x):
-            x = self.layer1(x)
-            return self.layer2(x)
-    
-    network = SimpleNetwork()
-    all_params = network.parameters()
-    
-    # Should have 4 parameters: 2 from each layer (weights + bias)
-    assert len(all_params) == 4, f"Expected 4 parameters from network, got {len(all_params)}"
-    print("PASS Module composition and parameter collection works")
-    
-    # Test case 3: Forward pass through composed network
-    input_tensor = Tensor([[1.0, 2.0, 3.0, 4.0]])
-    output = network(input_tensor)
-    
-    assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}"
-    print("PASS Network forward pass works")
-    
-    # Test case 4: No bias option
-    layer_no_bias = Linear(input_size=3, output_size=2, use_bias=False)
-    params_no_bias = layer_no_bias.parameters()
-    
-    assert len(params_no_bias) == 1, f"Expected 1 parameter (weights only), got {len(params_no_bias)}"
-    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
-    print("PASS No bias option works")
-    
-    print("CELEBRATE All parameter management tests passed!")
-
-test_unit_parameter_management()
-
-# In[ ]:
-
-# PASS IMPLEMENTATION CHECKPOINT: Linear layer complete
-
-# THINK PREDICTION: How does memory usage scale with network depth vs width?
-# Deeper network (more layers): _______
-# Wider network (more neurons per layer): _______
-
-# MAGNIFY SYSTEMS INSIGHT #3: Architecture Memory Analysis
-# Architecture analysis consolidated into analyze_layer_performance() above
-
-# Analysis consolidated into analyze_layer_performance() above
-
-# %% [markdown]
-"""
-## Part 4: Sequential Network Composition
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "sequential-composition", "solution": true}
-
-#| export
-class Sequential(Module):
-    """
-    Sequential Network: Composes layers in sequence.
-    
-    The most fundamental network architecture that applies layers in order:
-    f(x) = layer_n(...layer_2(layer_1(x)))
-    
-    Inherits from Module for automatic parameter collection from all sub-layers.
-    This enables optimizers to find all parameters automatically.
-    
-    Example Usage:
-        # Create a 3-layer MLP
-        model = Sequential([
-            Linear(784, 128),
-            ReLU(),
-            Linear(128, 64), 
-            ReLU(),
-            Linear(64, 10)
-        ])
-        
-        # Use the model
-        output = model(input_data)  # Clean interface!
-        params = model.parameters()  # All parameters from all layers!
-    """
-    
-    def __init__(self, layers=None):
-        """
-        Initialize Sequential network with layers.
-        
-        Args:
-            layers: List of layers to compose in order (optional)
-        """
-        super().__init__()  # Initialize Module base class
-        self.layers = layers if layers is not None else []
-        
-        # Register all layers as sub-modules for parameter collection
-        for i, layer in enumerate(self.layers):
-            # This automatically adds each layer to self._modules
-            setattr(self, f'layer_{i}', layer)
-    
-    def forward(self, x):
-        """
-        Forward pass through all layers in sequence.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:
-            Output tensor after passing through all layers
-        """
-        for layer in self.layers:
-            x = layer(x)
-        return x
-    
-    def add(self, layer):
-        """Add a layer to the network."""
-        self.layers.append(layer)
-        # Register the new layer for parameter collection
-        setattr(self, f'layer_{len(self.layers)-1}', layer)
-
-# In[ ]:
-
-# TEST Unit Test: Sequential Networks
-def test_unit_sequential():
-    """Test Sequential network implementation."""
-    print("TEST Testing Sequential Network...")
-    
-    # Test case 1: Create empty network
-    empty_net = Sequential()
-    assert len(empty_net.layers) == 0, "Empty Sequential should have no layers"
-    print("PASS Empty Sequential network creation")
-    
-    # Test case 2: Create network with layers
-    layers = [Linear(3, 4), Linear(4, 2)]
-    network = Sequential(layers)
-    assert len(network.layers) == 2, "Network should have 2 layers"
-    print("PASS Sequential network with layers")
-    
-    # Test case 3: Forward pass through network
-    input_tensor = Tensor([[1.0, 2.0, 3.0]])
-    output = network(input_tensor)
-    assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}"
-    print("PASS Forward pass through Sequential network")
-    
-    # Test case 4: Parameter collection from all layers
-    all_params = network.parameters()
-    # Should have 4 parameters: 2 weights + 2 biases from 2 Linear layers
-    assert len(all_params) == 4, f"Expected 4 parameters from Sequential network, got {len(all_params)}"
-    print("PASS Parameter collection from all layers")
-    
-    # Test case 5: Adding layers dynamically
-    network.add(Linear(2, 1))
-    assert len(network.layers) == 3, "Network should have 3 layers after adding one"
-    
-    # Test forward pass after adding layer
-    final_output = network(input_tensor)
-    assert final_output.shape == (1, 1), f"Expected final output shape (1, 1), got {final_output.shape}"
-    print("PASS Dynamic layer addition")
-    
-    print("CELEBRATE All Sequential network tests passed!")
-
-test_unit_sequential()
-
-# %% [markdown]
-"""
-## Part 5: Flatten Operation - Connecting Different Layer Types
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "flatten-operations", "solution": true}
-
-#| export
-def flatten(x, start_dim=1):
-    """
-    Flatten tensor starting from a given dimension.
-    
-    This is essential for transitioning from convolutional layers
-    (which output 4D tensors) to linear layers (which expect 2D).
-    
-    Args:
-        x: Input tensor (Tensor or any array-like)
-        start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        
-    Returns:
-        Flattened tensor preserving batch dimension
-        
-    Examples:
-        # Flatten CNN output for Linear layer
-        conv_output = Tensor(np.random.randn(32, 64, 8, 8))  # (batch, channels, height, width)
-        flat = flatten(conv_output)  # (32, 4096) - ready for Linear layer!
-        
-        # Flatten image for MLP
-        images = Tensor(np.random.randn(32, 3, 28, 28))  # CIFAR-10 batch
-        flat = flatten(images)  # (32, 2352) - ready for MLP!
-    """
-    # Get the data (handle both Tensor and numpy arrays)
-    if hasattr(x, 'data'):
-        data = x.data
-    else:
-        data = x
-    
-    # Calculate new shape
-    batch_size = data.shape[0] if start_dim > 0 else 1
-    remaining_size = np.prod(data.shape[start_dim:])
-    new_shape = (batch_size, remaining_size) if start_dim > 0 else (remaining_size,)
-    
-    # Reshape while preserving the original tensor type
-    if hasattr(x, 'data'):
-        # It's a Tensor - create a new Tensor with flattened data
-        flattened_data = data.reshape(new_shape)
-        # Use type(x) to preserve the exact Tensor type (Parameter vs regular Tensor)
-        # This ensures that if input was a Parameter, output is also a Parameter
-        return type(x)(flattened_data)
-    else:
-        # It's a numpy array - just reshape and return
-        return data.reshape(new_shape)
-
-#| export
-class Flatten(Module):
-    """
-    Flatten layer that reshapes tensors from multi-dimensional to 2D.
-    
-    Essential for connecting convolutional layers (which output 4D tensors)
-    to linear layers (which expect 2D tensors). Preserves the batch dimension.
-    
-    Example Usage:
-        # In a CNN architecture
-        model = Sequential([
-            Conv2D(3, 16, kernel_size=3),  # Output: (batch, 16, height, width)
-            ReLU(),
-            Flatten(),                     # Output: (batch, 16*height*width)
-            Linear(16*height*width, 10)    # Now compatible!
-        ])
-    """
-    
-    def __init__(self, start_dim=1):
-        """
-        Initialize Flatten layer.
-        
-        Args:
-            start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        """
-        super().__init__()
-        self.start_dim = start_dim
-    
-    def forward(self, x):
-        """
-        Flatten tensor starting from start_dim.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:
-            Flattened tensor with batch dimension preserved
-        """
-        return flatten(x, start_dim=self.start_dim)
-
-# In[ ]:
-
-# TEST Unit Test: Flatten Operations
-def test_unit_flatten():
-    """Test Flatten layer and function implementation."""
-    print("TEST Testing Flatten Operations...")
-    
-    # Test case 1: Flatten function with 2D tensor
-    x_2d = Tensor([[1, 2], [3, 4]])
-    flattened_func = flatten(x_2d)
-    assert flattened_func.shape == (2, 2), f"Expected shape (2, 2), got {flattened_func.shape}"
-    print("PASS Flatten function with 2D tensor")
-    
-    # Test case 2: Flatten function with 4D tensor (simulating CNN output)
-    x_4d = Tensor(np.random.randn(2, 3, 4, 4))  # (batch, channels, height, width)
-    flattened_4d = flatten(x_4d)
-    assert flattened_4d.shape == (2, 48), f"Expected shape (2, 48), got {flattened_4d.shape}"  # 3*4*4 = 48
-    print("PASS Flatten function with 4D tensor")
-    
-    # Test case 3: Flatten layer class
-    flatten_layer = Flatten()
-    layer_output = flatten_layer(x_4d)
-    assert layer_output.shape == (2, 48), f"Expected shape (2, 48), got {layer_output.shape}"
-    assert np.allclose(layer_output.data, flattened_4d.data), "Flatten layer should match flatten function"
-    print("PASS Flatten layer class")
-    
-    # Test case 4: Different start dimensions
-    flatten_from_0 = Flatten(start_dim=0)
-    full_flat = flatten_from_0(x_2d)
-    assert len(full_flat.shape) <= 2, "Flattening from dim 0 should create vector"
-    print("PASS Different start dimensions")
-    
-    # Test case 5: Integration with Sequential
-    network = Sequential([
-        Linear(8, 4),
-        Flatten()
-    ])
-    test_input = Tensor(np.random.randn(2, 8))
-    output = network(test_input)
-    assert output.shape == (2, 4), f"Expected shape (2, 4), got {output.shape}"
-    print("PASS Flatten integration with Sequential")
-    
-    print("CELEBRATE All Flatten operations tests passed!")
-
-test_unit_flatten()
-
-# In[ ]:
-
-# %% [markdown]
-"""
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/03_layers/layers_dev.py
-**Building Side:** Code exports to tinytorch.core.layers
-
-```python
-# Final package structure:
-from tinytorch.core.layers import Module, Linear, Sequential, Flatten  # This module
-from tinytorch.core.tensor import Tensor, Parameter  # Foundation (always needed)
-```
-
-**Why this matters:**
-- **Learning:** Complete layer system in one focused module for deep understanding
-- **Production:** Proper organization like PyTorch's torch.nn with all core components together
-- **Consistency:** All layer operations and parameter management in core.layers
-- **Integration:** Works seamlessly with tensors for complete neural network building
-"""
-
-# %%
-
-# %% [markdown]
-"""
-## Complete Neural Network Demo
-"""
-
-def demonstrate_complete_networks():
-    """Demonstrate complete neural networks using all implemented components."""
-    print("FIRE Complete Neural Network Demo")
-    print("=" * 50)
-    
-    print("\n1. MLP for Classification (MNIST-style):")
-    # Multi-layer perceptron for image classification
-    mlp = Sequential([
-        Flatten(),              # Flatten input images
-        Linear(784, 256),       # First hidden layer
-        Linear(256, 128),       # Second hidden layer  
-        Linear(128, 10)         # Output layer (10 classes)
-    ])
-    
-    # Test with batch of "images"
-    batch_images = Tensor(np.random.randn(32, 28, 28))  # 32 MNIST-like images
-    mlp_output = mlp(batch_images)
-    print(f"   Input: {batch_images.shape} (batch of 28x28 images)")
-    print(f"   Output: {mlp_output.shape} (class logits for 32 images)")
-    print(f"   Parameters: {len(mlp.parameters())} tensors")
-    
-    print("\n2. CNN-style Architecture (with Flatten):")
-    # Simulate CNN -> Flatten -> Dense pattern
-    cnn_style = Sequential([
-        # Simulate Conv2D output with random "features"
-        Flatten(),              # Flatten spatial features
-        Linear(512, 256),       # Dense layer after convolution
-        Linear(256, 10)         # Classification head
-    ])
-    
-    # Test with simulated conv output
-    conv_features = Tensor(np.random.randn(16, 8, 8, 8))  # Simulated (B,C,H,W)
-    cnn_output = cnn_style(conv_features)
-    print(f"   Input: {conv_features.shape} (simulated conv features)")
-    print(f"   Output: {cnn_output.shape} (class predictions)")
-    
-    print("\n3. Deep Network with Many Layers:")
-    # Demonstrate deep composition
-    deep_net = Sequential()
-    layer_sizes = [100, 80, 60, 40, 20, 10]
-    
-    for i in range(len(layer_sizes) - 1):
-        deep_net.add(Linear(layer_sizes[i], layer_sizes[i+1]))
-        print(f"   Added layer: {layer_sizes[i]} -> {layer_sizes[i+1]}")
-    
-    # Test deep network
-    deep_input = Tensor(np.random.randn(8, 100))
-    deep_output = deep_net(deep_input)
-    print(f"   Deep network: {deep_input.shape} -> {deep_output.shape}")
-    print(f"   Total parameters: {len(deep_net.parameters())} tensors")
-    
-    print("\n4. Parameter Management Across Networks:")
-    networks = {'MLP': mlp, 'CNN-style': cnn_style, 'Deep': deep_net}
-    
-    for name, net in networks.items():
-        params = net.parameters()
-        total_params = sum(p.data.size for p in params)
-        memory_mb = total_params * 4 / (1024 * 1024)  # float32 = 4 bytes
-        print(f"   {name}: {len(params)} param tensors, {total_params:,} total params, {memory_mb:.2f} MB")
-    
-    print("\nCELEBRATE All components work together seamlessly!")
-    print("   • Module system enables automatic parameter collection")
-    print("   • Linear layers handle matrix transformations") 
-    print("   • Sequential composes layers into complete architectures")
-    print("   • Flatten connects different layer types")
-    print("   • Everything integrates for production-ready neural networks!")
-
-demonstrate_complete_networks()
-
-# In[ ]:
-
-# %% [markdown]
-"""
-## Testing Framework
-"""
-
-def test_module():
-    """Run complete module validation."""
-    print("🧪 TESTING ALL LAYER COMPONENTS")
-    print("=" * 40)
-
-    # Call every individual test function
-    test_unit_linear()
-    test_unit_parameter_management()
-    test_unit_sequential()
-    test_unit_flatten()
-
-    print("\n✅ ALL TESTS PASSED! Layer module ready for integration.")
-
-# In[ ]:
-
-if __name__ == "__main__":
-    print("🚀 TINYTORCH LAYERS MODULE")
-    print("=" * 50)
-
-    # Test all components
-    test_module()
-
-    # Systems analysis
-    print("\n" + "=" * 50)
-    analyze_layer_performance()
-
-    # Complete demo
-    print("\n" + "=" * 50)
-    demonstrate_complete_networks()
-
-    print("\n🎉 LAYERS MODULE COMPLETE!")
-    print("✅ Ready for advanced architectures and training!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've implemented all the core neural network components, let's think about their implications for ML systems:
-
-**Question 1: Memory vs Computation Analysis**
-
-You're designing a neural network for deployment on a mobile device with limited memory (1GB RAM) but decent compute power.
-
-You have two architecture options:
-A) Wide network: 784 -> 2048 -> 2048 -> 10 (3 layers, wide)
-B) Deep network: 784 -> 256 -> 256 -> 256 -> 256 -> 10 (5 layers, narrow)
-
-Calculate the memory requirements for each option and explain which you'd choose for mobile deployment and why.
-
-Consider:
-- Parameter storage requirements
-- Intermediate activation storage during forward pass
-- Training vs inference memory requirements
-- How your choice affects model capacity and accuracy
-
-⭐ **Question 2: Production Performance Optimization**
-
-Your Linear layer implementation works correctly, but you notice it's slower than PyTorch's nn.Linear on the same hardware.
-
-Investigate and explain:
-1. Why might our implementation be slower? (Hint: think about underlying linear algebra libraries)
-2. What optimization techniques do production frameworks use?
-3. How would you modify our implementation to approach production performance?
-4. When might our simple implementation actually be preferable?
-
-Research areas to consider:
-- BLAS (Basic Linear Algebra Subprograms) libraries
-- Memory layout and cache efficiency
-- Vectorization and SIMD instructions
-- GPU kernel optimization
-
-⭐ **Question 3: Systems Architecture Scaling**
-
-Modern transformer models like GPT-3 have billions of parameters, primarily in Linear layers.
-
-Analyze the scaling challenges:
-1. How does memory requirement scale with model size? Calculate the memory needed for a 175B parameter model.
-2. What are the computational bottlenecks during training vs inference?
-3. How do systems like distributed training address these scaling challenges?
-4. Why do large models use techniques like gradient checkpointing and model parallelism?
-
-Systems considerations:
-- Memory hierarchy (L1/L2/L3 cache, RAM, storage)
-- Network bandwidth for distributed training
-- GPU memory constraints and model sharding
-- Inference optimization for production serving
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Layers - Complete Neural Network Foundation
-
-### What You've Accomplished
-
-You've successfully implemented the complete foundation for neural networks - all the essential components working together:
-
-### ✅ **Complete Core System**
-- **Module Base Class**: Parameter management and composition patterns for all neural network components
-- **Matrix Multiplication**: The computational primitive underlying all neural network operations
-- **Linear (Dense) Layers**: Complete implementation with proper parameter initialization and forward propagation
-- **Sequential Networks**: Clean composition system for building complete neural network architectures
-- **Flatten Operations**: Tensor reshaping to connect different layer types (essential for CNN->MLP transitions)
-
-### ✅ **Systems Understanding**
-- **Architectural Patterns**: How modular design enables everything from MLPs to complex deep networks
-- **Memory Analysis**: How layer composition affects memory usage and computational efficiency
-- **Performance Characteristics**: Understanding how tensor operations and layer composition affect performance
-- **Production Context**: Connection to real-world ML frameworks and their component organization
-
-### ✅ **ML Engineering Skills**
-- **Complete Parameter Management**: How neural networks automatically collect parameters from all components
-- **Network Composition**: Building complex architectures from simple, reusable components
-- **Tensor Operations**: Essential reshaping and transformation operations for different network types
-- **Clean Abstraction**: Professional software design patterns that scale to production systems
-
-### 🔗 **Connection to Production ML Systems**
-
-Your unified implementation mirrors the complete component systems used in:
-- **PyTorch's nn.Module system**: Same parameter management and composition patterns
-- **PyTorch's nn.Sequential**: Identical architecture composition approach
-- **All major frameworks**: The same modular design principles that power TensorFlow, JAX, and others
-- **Production ML systems**: Clean abstractions that enable complex models while maintaining manageable code
-
-### 🚀 **What's Next**
-
-With your complete layer foundation, you're ready to:
-- **Module 05 (Dense)**: Build complete dense networks for classification tasks
-- **Module 06 (Spatial)**: Add convolutional layers for computer vision
-- **Module 09 (Autograd)**: Enable automatic differentiation for learning
-- **Module 10 (Optimizers)**: Implement sophisticated optimization algorithms
-
-### 💡 **Key Systems Insights**
-
-1. **Modular composition is the key to scalable ML systems** - clean interfaces enable complex behaviors
-2. **Parameter management must be automatic** - manual parameter tracking doesn't scale to deep networks
-3. **Tensor operations like flattening are architectural requirements** - different layer types need different tensor shapes
-4. **Clean abstractions enable innovation** - good foundational design supports unlimited architectural experimentation
-
-You now understand how to build complete, production-ready neural network foundations that can scale to any architecture!
-"""
\ No newline at end of file
+# Backward compatibility alias
+#| export  
+Dense = Linear
diff --git a/tinytorch/core/layers.py.backup b/tinytorch/core/layers.py.backup
deleted file mode 100644
index 9f12c730..00000000
--- a/tinytorch/core/layers.py.backup
+++ /dev/null
@@ -1,1205 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Layers - Building Neural Network Architectures
-
-Welcome to Layers! You'll implement the essential building blocks that compose into complete neural network architectures.
-
-## LINK Building on Previous Learning
-**What You Built Before**:
-- Module 02 (Tensor): N-dimensional arrays with shape management and broadcasting
-- Module 03 (Activations): ReLU and Softmax functions providing nonlinear intelligence
-
-**What's Working**: You can create tensors and apply nonlinear transformations for complex pattern learning!
-
-**The Gap**: You have data structures and nonlinear functions, but no way to combine them into trainable neural network architectures.
-
-**This Module's Solution**: Implement Linear layers, Module composition patterns, and Sequential networks - the architectural foundations enabling everything from MLPs to transformers.
-
-**Connection Map**:
-```
-Activations -> Layers -> Training
-(intelligence)  (architecture)  (learning)
-```
-
-## Learning Objectives
-
-By completing this module, you will:
-
-1. **Build layer abstractions** - Create the building blocks that compose into neural networks
-2. **Implement Linear layers** - The fundamental operation that transforms data between dimensions
-3. **Create Sequential networks** - Chain layers together to build complete neural networks
-4. **Manage parameters** - Handle weights and biases in an organized way
-5. **Foundation for architectures** - Enable building everything from simple MLPs to complex models
-
-## Build -> Use -> Reflect
-1. **Build**: Module base class, Linear layers, and Sequential composition
-2. **Use**: Combine layers into complete neural networks with real data
-3. **Reflect**: Understand how simple building blocks enable complex architectures
-"""
-
-# In[ ]:
-
-#| default_exp core.layers
-
-#| export
-import numpy as np
-import sys
-import os
-
-# Smart import system: works both during development and in production
-# This pattern allows the same code to work in two scenarios:
-# 1. During development: imports from local module files (tensor_dev.py)
-# 2. In production: imports from installed tinytorch package
-# This flexibility is essential for educational development workflows
-
-if 'tinytorch' in sys.modules:
-    # Production: Import from installed package
-    # When tinytorch is installed as a package, use the packaged version
-    from tinytorch.core.tensor import Tensor
-else:
-    # Development: Import from local module files
-    # During development, we need to import directly from the source files
-    # This allows us to work with modules before they're packaged
-    tensor_module_path = os.path.join(os.path.dirname(__file__), '..', '01_tensor')
-    sys.path.insert(0, tensor_module_path)
-    try:
-        from tensor_dev import Tensor
-    finally:
-        sys.path.pop(0)  # Always clean up path to avoid side effects
-
-# CRITICAL FIX: Parameter must be Variable-based for gradient tracking
-class Parameter:
-    """
-    A trainable parameter that supports automatic differentiation.
-
-    This creates a Variable with requires_grad=True for use as neural network parameters.
-    Essential for gradient-based optimization of weights and biases.
-
-    IMPORTANT: Parameters must participate in autograd for training to work.
-    """
-    def __init__(self, data):
-        # Import Variable locally to avoid circular imports
-        # NO Variable imports - using pure Tensor system only!
-
-        # Use pure Tensor with gradients enabled
-        from tinytorch.core.tensor import Tensor
-
-        if isinstance(data, Tensor):
-            self._tensor = data
-            if not data.requires_grad:
-                # Ensure parameters always require gradients
-                data.requires_grad = True
-        else:
-            # Convert data to Tensor with gradient tracking
-            self._tensor = Tensor(data, requires_grad=True)
-
-    def __getattr__(self, name):
-        """Delegate all attribute access to the underlying Tensor."""
-        return getattr(self._tensor, name)
-
-    def __setattr__(self, name, value):
-        """Handle setting attributes."""
-        if name == '_tensor':
-            super().__setattr__(name, value)
-        else:
-            # Delegate to underlying Tensor
-            setattr(self._tensor, name, value)
-
-    @property
-    def data(self):
-        """Access to underlying data."""
-        return self._tensor.data
-
-    @property
-    def grad(self):
-        """Access to gradient."""
-        return self._tensor.grad
-
-    @grad.setter
-    def grad(self, value):
-        """Set gradient."""
-        self._tensor.grad = value
-
-    @property
-    def requires_grad(self):
-        """Whether this parameter requires gradients."""
-        return self._tensor.requires_grad
-
-    def backward(self, gradient=None):
-        """Backpropagate gradients."""
-        return self._tensor.backward(gradient)
-
-    def __repr__(self):
-        return f"Parameter({self._tensor})"
-
-# In[ ]:
-
-print("FIRE TinyTorch Layers Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build neural network layers!")
-
-# %% [markdown]
-"""
-## Visual Guide: Understanding Neural Network Architecture Through Diagrams
-
-### Neural Network Layers: From Components to Systems
-
-```
-Individual Neuron:                Neural Network Layer:
-    x₁ --○ w₁                    +---------------------+
-          \\                     |   Input Vector      |
-    x₂ --○ w₂ --> Sum --> f() --> y |   [x₁, x₂, x₃]    |
-          /                     +---------------------+
-    x₃ --○ w₃                              v
-       + bias                    +---------------------+
-                                 |  Weight Matrix W    |
-One computation unit             |  +w₁₁ w₁₂ w₁₃+     |
-                                 |  |w₂₁ w₂₂ w₂₃|     |
-                                 |  +w₃₁ w₃₂ w₃₃+     |
-                                 +---------------------+
-                                             v
-                                   Matrix multiplication
-                                     Y = X @ W + b
-                                             v
-                                 +---------------------+
-                                 |  Output Vector      |
-                                 |   [y₁, y₂, y₃]     |
-                                 +---------------------+
-
-Parallel processing of many neurons!
-```
-
-### Layer Composition: Building Complex Architectures
-
-```
-Multi-Layer Perceptron (MLP) Architecture:
-
-   Input        Hidden Layer 1    Hidden Layer 2     Output
- (784 dims)      (256 neurons)     (128 neurons)    (10 classes)
-+---------+     +-------------+   +-------------+   +---------+
-|  Image  |----▶|    ReLU     |--▶|    ReLU     |--▶| Softmax |
-| 28*28px |     | Activations |   | Activations |   | Probs   |
-+---------+     +-------------+   +-------------+   +---------+
-     v                v                 v               v
-200,960 params   32,896 params    1,290 params   Total: 235,146
-
-Parameter calculation for Linear(input_size, output_size):
-• Weights: input_size * output_size matrix
-• Biases:  output_size vector
-• Total:   (input_size * output_size) + output_size
-
-Memory scaling pattern:
-Layer width doubles -> Parameters quadruple -> Memory quadruples
-```
-
-### Module System: Automatic Parameter Management
-
-```
-Parameter Collection Hierarchy:
-
-Model (Sequential)
-+-- Layer1 (Linear)
-|   +-- weights [784 * 256]  --+
-|   +-- bias [256]           --┤
-+-- Layer2 (Linear)           +--▶ model.parameters()
-|   +-- weights [256 * 128]  --┤   Automatically collects
-|   +-- bias [128]           --┤   all parameters for
-+-- Layer3 (Linear)           +--▶ optimizer.step()
-    +-- weights [128 * 10]   --┤
-    +-- bias [10]            --+
-
-Before Module system:        With Module system:
-manually track params   ->    automatic collection
-params = [w1, b1, w2,...]    params = model.parameters()
-
-Enables: optimizer = Adam(model.parameters())
-```
-
-### Memory Layout and Performance Implications
-
-```
-Tensor Memory Access Patterns:
-
-Matrix Multiplication: A @ B = C
-
-Efficient (Row-major access):    Inefficient (Column-major):
-A: --------------▶               A: | | | | | ▶
-   Cache-friendly                    | | | | |
-   Sequential reads                  v v v v v
-                                     Cache misses
-B: |                             B: --------------▶
-   |
-   v
-
-Performance impact:
-• Good memory layout: 100% cache hit ratio
-• Poor memory layout: 10-50% cache hit ratio
-• 10-100x performance difference in practice
-
-Why contiguous tensors matter in production!
-```
-"""
-
-# %% [markdown]
-"""
-## Part 1: Module Base Class - The Foundation of Neural Network Architecture
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "module-base", "solution": true}
-
-# Before building specific layers, we need a base class that enables clean composition and automatic parameter management.
-
-#| export
-class Module:
-    """
-    Base class for all neural network modules.
-    
-    Provides automatic parameter collection, forward pass management,
-    and clean composition patterns. All layers (Dense, Conv2d, etc.)
-    inherit from this class.
-    
-    Key Features:
-    - Automatic parameter registration when you assign parameter Tensors (weights, bias)
-    - Recursive parameter collection from sub-modules
-    - Clean __call__ interface: model(x) instead of model.forward(x)
-    - Extensible for custom layers
-    
-    Example Usage:
-        class MLP(Module):
-            def __init__(self):
-                super().__init__()
-                self.layer1 = Linear(784, 128)  # Auto-registered!
-                self.layer2 = Linear(128, 10)   # Auto-registered!
-                
-            def forward(self, x):
-                x = self.layer1(x)
-                return self.layer2(x)
-                
-        model = MLP()
-        params = model.parameters()  # Gets all parameters automatically!
-        output = model(input)        # Clean interface!
-    """
-    
-    def __init__(self):
-        """Initialize module with empty parameter and sub-module storage."""
-        self._parameters = []
-        self._modules = []
-    
-    def __setattr__(self, name, value):
-        """
-        Intercept attribute assignment to auto-register parameters and modules.
-        
-        When you do self.weight = Parameter(...), this automatically adds
-        the parameter to our collection for easy optimization.
-        """
-        # Step 1: Check if this looks like a parameter (Tensor with data and specific name)
-        # Break down the complex boolean logic for clarity:
-        is_tensor_like = hasattr(value, 'data') and hasattr(value, 'shape')
-        is_tensor_type = isinstance(value, Tensor)
-        is_parameter_type = isinstance(value, Parameter)
-        is_parameter_name = name in ['weights', 'weight', 'bias']
-
-        if is_tensor_like and (is_tensor_type or is_parameter_type) and is_parameter_name:
-            # Step 2: Add to our parameter list for optimization
-            self._parameters.append(value)
-        
-        # Step 3: Check if it's a sub-module (another neural network layer)
-        elif isinstance(value, Module):
-            # Step 4: Add to module list for recursive parameter collection
-            self._modules.append(value)
-        
-        # Step 5: Always set the actual attribute (this is essential!)
-        super().__setattr__(name, value)
-    
-    def parameters(self):
-        """
-        Recursively collect all parameters from this module and sub-modules.
-        
-        Returns:
-            List of all parameters (Tensors containing weights and biases)
-            
-        This enables: optimizer = Adam(model.parameters()) (when optimizers are available)
-        """
-        # Start with our own parameters
-        params = list(self._parameters)
-        
-        # Add parameters from sub-modules recursively
-        for module in self._modules:
-            params.extend(module.parameters())
-            
-        return params
-    
-    def __call__(self, *args, **kwargs):
-        """
-        Makes modules callable: model(x) instead of model.forward(x).
-        
-        This is the magic that enables clean syntax like:
-            output = model(input)
-        instead of:
-            output = model.forward(input)
-        """
-        return self.forward(*args, **kwargs)
-    
-    def forward(self, *args, **kwargs):
-        """
-        Forward pass - must be implemented by subclasses.
-        
-        This is where the actual computation happens. Every layer
-        defines its own forward() method.
-        """
-        raise NotImplementedError("Subclasses must implement forward()")
-
-# In[ ]:
-
-# PASS IMPLEMENTATION CHECKPOINT: Basic Module class complete
-
-# THINK PREDICTION: How many parameters would a simple 3-layer network have?
-# Write your guess here: _______
-
-# 🔍 SYSTEMS ANALYSIS: Layer Performance and Scaling
-def analyze_layer_performance():
-    """Analyze layer performance and scaling characteristics."""
-    print("📊 LAYER SYSTEMS ANALYSIS")
-    print("Understanding how neural network layers scale and perform...")
-
-    try:
-        # Parameter scaling analysis
-        print("\n1. Parameter Scaling:")
-        layer_sizes = [(784, 256), (256, 128), (128, 10)]
-        total_params = 0
-
-        for i, (input_size, output_size) in enumerate(layer_sizes):
-            weights = input_size * output_size
-            biases = output_size
-            layer_params = weights + biases
-            total_params += layer_params
-            print(f"   Layer {i+1} ({input_size}→{output_size}): {layer_params:,} params")
-
-        print(f"   Total network: {total_params:,} parameters")
-        print(f"   Memory usage: {total_params * 4 / 1024 / 1024:.2f} MB (float32)")
-
-        # Computational complexity
-        print("\n2. Computational Complexity:")
-        batch_size = 32
-        total_flops = 0
-
-        for i, (input_size, output_size) in enumerate(layer_sizes):
-            matmul_flops = 2 * batch_size * input_size * output_size
-            bias_flops = batch_size * output_size
-            layer_flops = matmul_flops + bias_flops
-            total_flops += layer_flops
-            print(f"   Layer {i+1}: {layer_flops:,} FLOPs ({matmul_flops:,} matmul + {bias_flops:,} bias)")
-
-        print(f"   Total forward pass: {total_flops:,} FLOPs")
-
-        # Scaling patterns
-        print("\n3. Scaling Insights:")
-        print("   • Parameter growth: O(input_size × output_size) - quadratic")
-        print("   • Computation: O(batch × input × output) - linear in each dimension")
-        print("   • Memory: Parameters + activations scale differently")
-        print("   • Bottlenecks: Large layers dominate both memory and compute")
-
-        print("\n💡 KEY INSIGHT: Layer size quadratically affects parameters but linearly affects computation per sample")
-
-    except Exception as e:
-        print(f"⚠️ Analysis error: {e}")
-
-# In[ ]:
-
-# %% [markdown]
-"""
-### ✅ IMPLEMENTATION CHECKPOINT: Module Base Class Complete
-
-You've built the foundation that enables automatic parameter management across all neural network components!
-
-🤔 **PREDICTION**: How many parameters would a simple 3-layer network have?
-Network: 784 → 256 → 128 → 10
-Your guess: _______
-"""
-
-# %% [markdown]
-"""
-## Part 2: Linear Layer - The Fundamental Neural Network Component
-
-Linear layers (also called Dense or Fully Connected layers) are the building blocks of neural networks.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "linear-layer", "solution": true}
-
-#| export
-class Linear(Module):
-    """
-    Linear (Fully Connected) Layer implementation.
-    
-    Applies the transformation: output = input @ weights + bias
-    
-    Inherits from Module for automatic parameter management and clean API.
-    This is PyTorch's nn.Linear equivalent with the same name for familiarity.
-    
-    Features:
-    - Automatic parameter registration (weights and bias)
-    - Clean call interface: layer(input) instead of layer.forward(input)
-    - Works with optimizers via model.parameters()
-    """
-    
-    def __init__(self, input_size: int, output_size: int, use_bias: bool = True):
-        """
-        Initialize Linear layer with random weights and optional bias.
-        
-        Args:
-            input_size: Number of input features
-            output_size: Number of output features  
-            use_bias: Whether to include bias term
-        
-        TODO: Implement Linear layer initialization.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Store input_size and output_size as instance variables
-        2. Initialize weights as Tensor with shape (input_size, output_size)
-        3. Use small random values: np.random.randn(...) * 0.1
-        4. Initialize bias as Tensor with shape (output_size,) if use_bias is True
-        5. Set bias to None if use_bias is False
-        
-        LEARNING CONNECTIONS:
-        - Small random initialization prevents symmetry breaking
-        - Weight shape (input_size, output_size) enables matrix multiplication
-        - Bias allows shifting the output (like y-intercept in linear regression)
-        - PyTorch uses more sophisticated initialization (Xavier, Kaiming)
-        
-        IMPLEMENTATION HINTS:
-        - Use np.random.randn() for Gaussian random numbers
-        - Scale by 0.1 to keep initial values small
-        - Remember to wrap numpy arrays in Tensor()
-        - Store use_bias flag for forward pass logic
-        """
-        ### BEGIN SOLUTION
-        super().__init__()  # Initialize Module base class
-        
-        self.input_size = input_size
-        self.output_size = output_size
-        self.use_bias = use_bias
-        
-        # Initialize weights with small random values using Parameter
-        # Shape: (input_size, output_size) for matrix multiplication
-        #
-        # MAGNIFY WEIGHT INITIALIZATION CONTEXT:
-        # Weight initialization is critical for training deep networks successfully.
-        # Our simple approach (small random * 0.1) works for shallow networks, but
-        # deeper networks require more sophisticated initialization strategies:
-        #
-        # • Xavier/Glorot: scale = sqrt(1/fan_in) - good for tanh/sigmoid activations
-        # • Kaiming/He: scale = sqrt(2/fan_in) - optimized for ReLU activations
-        # • Our approach: scale = 0.1 - simple but effective for basic networks
-        #
-        # Why proper initialization matters:
-        # - Prevents vanishing gradients (weights too small -> signals disappear)
-        # - Prevents exploding gradients (weights too large -> signals blow up)
-        # - Enables stable training in deeper architectures (Module 11 training)
-        # - Affects convergence speed and final model performance
-        #
-        # Production frameworks automatically choose initialization based on layer type!
-        weight_data = np.random.randn(input_size, output_size) * 0.1
-        self.weights = Parameter(weight_data)  # Auto-registers for optimization!
-        
-        # Initialize bias if requested
-        if use_bias:
-            # MAGNIFY GRADIENT FLOW PREPARATION:
-            # Clean parameter management is essential for backpropagation (Module 09).
-            # When we implement autograd, the optimizer needs to find ALL trainable
-            # parameters automatically. Our Module base class ensures that:
-            #
-            # • Parameters are automatically registered when assigned
-            # • Recursive parameter collection works through network hierarchies
-            # • Gradient updates can flow to all learnable weights and biases
-            # • Memory management handles parameter lifecycle correctly
-            #
-            # This design enables the autograd system to:
-            # - Track computational graphs through all layers
-            # - Accumulate gradients for each parameter during backpropagation
-            # - Support optimizers that update parameters based on gradients
-            # - Scale to arbitrarily deep and complex network architectures
-            #
-            # Bias also uses small random initialization (could be zeros, but small random works well)
-            bias_data = np.random.randn(output_size) * 0.1
-            self.bias = Parameter(bias_data)  # Auto-registers for optimization!
-        else:
-            self.bias = None
-        ### END SOLUTION
-    
-    def forward(self, x):
-        """
-        Forward pass through the Linear layer with automatic differentiation.
-
-        Args:
-            x: Input Variable (shape: ..., input_size)
-
-        Returns:
-            Output Variable (shape: ..., output_size) with gradient tracking
-
-        CRITICAL FIX: This method now properly uses autograd operations
-        to ensure gradients flow through parameters during backpropagation.
-
-        TODO: Implement the linear transformation using autograd operations
-
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Convert input to Variable if needed (with gradient tracking)
-        2. Use autograd matrix multiplication: matmul(x, weights)
-        3. Add bias using autograd addition if it exists: add(result, bias)
-        4. Return Variable with gradient tracking enabled
-
-        LEARNING CONNECTIONS:
-        - Uses autograd operations instead of raw numpy for gradient flow
-        - Parameters (weights/bias) are Variables with requires_grad=True
-        - Matrix multiplication and addition maintain computational graph
-        - This enables backpropagation through all parameters
-
-        IMPLEMENTATION HINTS:
-        - Import autograd operations locally to avoid circular imports
-        - Ensure result Variable has proper gradient tracking
-        - Handle both Tensor and Variable inputs gracefully
-        """
-        ### BEGIN SOLUTION
-        # Use pure Tensor operations - NO Variables!
-        from tinytorch.core.tensor import Tensor
-
-        # Ensure input is a Tensor
-        if not isinstance(x, Tensor):
-            x = Tensor(x.data if hasattr(x, 'data') else x)
-
-        # Matrix multiplication: x @ weights
-        # Use Tensor's matmul which should track gradients
-        result = x.matmul(self.weights)
-
-        # Add bias if it exists
-        if self.bias is not None:
-            result = result + self.bias
-
-        # Return pure Tensor with gradient tracking preserved
-        return result
-        ### END SOLUTION
-
-# In[ ]:
-
-# TEST Unit Test: Linear Layer
-def test_unit_linear():
-    """Test Linear layer implementation."""
-    print("TEST Testing Linear Layer...")
-    
-    # Test case 1: Basic functionality
-    layer = Linear(input_size=3, output_size=2)
-    input_tensor = Tensor([[1.0, 2.0, 3.0]])  # Shape: (1, 3)
-    output = layer.forward(input_tensor)
-    
-    # Check output shape
-    assert output.shape == (1, 2), f"Expected shape (1, 2), got {output.shape}"
-    print("PASS Output shape correct")
-    
-    # Test case 2: No bias
-    layer_no_bias = Linear(input_size=2, output_size=3, use_bias=False)
-    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
-    print("PASS No bias option works")
-    
-    # Test case 3: Multiple samples (batch processing)
-    batch_input = Tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])  # Shape: (3, 2)
-    layer_batch = Linear(input_size=2, output_size=2)
-    batch_output = layer_batch.forward(batch_input)
-    
-    assert batch_output.shape == (3, 2), f"Expected shape (3, 2), got {batch_output.shape}"
-    print("PASS Batch processing works")
-    
-    # Test case 4: Callable interface
-    callable_output = layer_batch(batch_input)
-    assert np.allclose(callable_output.data, batch_output.data), "Callable interface should match forward()"
-    print("PASS Callable interface works")
-    
-    # Test case 5: Parameter initialization
-    layer_init = Linear(input_size=10, output_size=5)
-    assert layer_init.weights.shape == (10, 5), f"Expected weights shape (10, 5), got {layer_init.weights.shape}"
-    assert layer_init.bias.shape == (5,), f"Expected bias shape (5,), got {layer_init.bias.shape}"
-    
-    # Check that weights are reasonably small (good initialization)
-    mean_val = np.abs(layer_init.weights.data).mean()
-    # Convert to float if it's a Tensor
-    if hasattr(mean_val, 'item'):
-        mean_val = mean_val.item()
-    elif hasattr(mean_val, 'data'):
-        mean_val = float(mean_val.data)
-    assert mean_val < 1.0, "Weights should be small for good initialization"
-    print("PASS Parameter initialization correct")
-    
-    print("CELEBRATE All Linear layer tests passed!")
-
-test_unit_linear()
-
-# In[ ]:
-
-# TEST Unit Test: Parameter Management
-def test_unit_parameter_management():
-    """Test Linear layer parameter management and module composition."""
-    print("TEST Testing Parameter Management...")
-    
-    # Test case 1: Parameter registration
-    layer = Linear(input_size=3, output_size=2)
-    params = layer.parameters()
-    
-    assert len(params) == 2, f"Expected 2 parameters (weights + bias), got {len(params)}"
-    assert layer.weights in params, "Weights should be in parameters list"
-    assert layer.bias in params, "Bias should be in parameters list"
-    print("PASS Parameter registration works")
-    
-    # Test case 2: Module composition
-    class SimpleNetwork(Module):
-        def __init__(self):
-            super().__init__()
-            self.layer1 = Linear(4, 3)
-            self.layer2 = Linear(3, 2)
-        
-        def forward(self, x):
-            x = self.layer1(x)
-            return self.layer2(x)
-    
-    network = SimpleNetwork()
-    all_params = network.parameters()
-    
-    # Should have 4 parameters: 2 from each layer (weights + bias)
-    assert len(all_params) == 4, f"Expected 4 parameters from network, got {len(all_params)}"
-    print("PASS Module composition and parameter collection works")
-    
-    # Test case 3: Forward pass through composed network
-    input_tensor = Tensor([[1.0, 2.0, 3.0, 4.0]])
-    output = network(input_tensor)
-    
-    assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}"
-    print("PASS Network forward pass works")
-    
-    # Test case 4: No bias option
-    layer_no_bias = Linear(input_size=3, output_size=2, use_bias=False)
-    params_no_bias = layer_no_bias.parameters()
-    
-    assert len(params_no_bias) == 1, f"Expected 1 parameter (weights only), got {len(params_no_bias)}"
-    assert layer_no_bias.bias is None, "Bias should be None when use_bias=False"
-    print("PASS No bias option works")
-    
-    print("CELEBRATE All parameter management tests passed!")
-
-test_unit_parameter_management()
-
-# In[ ]:
-
-# PASS IMPLEMENTATION CHECKPOINT: Linear layer complete
-
-# THINK PREDICTION: How does memory usage scale with network depth vs width?
-# Deeper network (more layers): _______
-# Wider network (more neurons per layer): _______
-
-# MAGNIFY SYSTEMS INSIGHT #3: Architecture Memory Analysis
-# Architecture analysis consolidated into analyze_layer_performance() above
-
-# Analysis consolidated into analyze_layer_performance() above
-
-# %% [markdown]
-"""
-## Part 4: Sequential Network Composition
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "sequential-composition", "solution": true}
-
-#| export
-class Sequential(Module):
-    """
-    Sequential Network: Composes layers in sequence.
-    
-    The most fundamental network architecture that applies layers in order:
-    f(x) = layer_n(...layer_2(layer_1(x)))
-    
-    Inherits from Module for automatic parameter collection from all sub-layers.
-    This enables optimizers to find all parameters automatically.
-    
-    Example Usage:
-        # Create a 3-layer MLP
-        model = Sequential([
-            Linear(784, 128),
-            ReLU(),
-            Linear(128, 64), 
-            ReLU(),
-            Linear(64, 10)
-        ])
-        
-        # Use the model
-        output = model(input_data)  # Clean interface!
-        params = model.parameters()  # All parameters from all layers!
-    """
-    
-    def __init__(self, layers=None):
-        """
-        Initialize Sequential network with layers.
-        
-        Args:
-            layers: List of layers to compose in order (optional)
-        """
-        super().__init__()  # Initialize Module base class
-        self.layers = layers if layers is not None else []
-        
-        # Register all layers as sub-modules for parameter collection
-        for i, layer in enumerate(self.layers):
-            # This automatically adds each layer to self._modules
-            setattr(self, f'layer_{i}', layer)
-    
-    def forward(self, x):
-        """
-        Forward pass through all layers in sequence.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:
-            Output tensor after passing through all layers
-        """
-        for layer in self.layers:
-            x = layer(x)
-        return x
-    
-    def add(self, layer):
-        """Add a layer to the network."""
-        self.layers.append(layer)
-        # Register the new layer for parameter collection
-        setattr(self, f'layer_{len(self.layers)-1}', layer)
-
-# In[ ]:
-
-# TEST Unit Test: Sequential Networks
-def test_unit_sequential():
-    """Test Sequential network implementation."""
-    print("TEST Testing Sequential Network...")
-    
-    # Test case 1: Create empty network
-    empty_net = Sequential()
-    assert len(empty_net.layers) == 0, "Empty Sequential should have no layers"
-    print("PASS Empty Sequential network creation")
-    
-    # Test case 2: Create network with layers
-    layers = [Linear(3, 4), Linear(4, 2)]
-    network = Sequential(layers)
-    assert len(network.layers) == 2, "Network should have 2 layers"
-    print("PASS Sequential network with layers")
-    
-    # Test case 3: Forward pass through network
-    input_tensor = Tensor([[1.0, 2.0, 3.0]])
-    output = network(input_tensor)
-    assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}"
-    print("PASS Forward pass through Sequential network")
-    
-    # Test case 4: Parameter collection from all layers
-    all_params = network.parameters()
-    # Should have 4 parameters: 2 weights + 2 biases from 2 Linear layers
-    assert len(all_params) == 4, f"Expected 4 parameters from Sequential network, got {len(all_params)}"
-    print("PASS Parameter collection from all layers")
-    
-    # Test case 5: Adding layers dynamically
-    network.add(Linear(2, 1))
-    assert len(network.layers) == 3, "Network should have 3 layers after adding one"
-    
-    # Test forward pass after adding layer
-    final_output = network(input_tensor)
-    assert final_output.shape == (1, 1), f"Expected final output shape (1, 1), got {final_output.shape}"
-    print("PASS Dynamic layer addition")
-    
-    print("CELEBRATE All Sequential network tests passed!")
-
-test_unit_sequential()
-
-# %% [markdown]
-"""
-## Part 5: Flatten Operation - Connecting Different Layer Types
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "flatten-operations", "solution": true}
-
-#| export
-def flatten(x, start_dim=1):
-    """
-    Flatten tensor starting from a given dimension.
-    
-    This is essential for transitioning from convolutional layers
-    (which output 4D tensors) to linear layers (which expect 2D).
-    
-    Args:
-        x: Input tensor (Tensor or any array-like)
-        start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        
-    Returns:
-        Flattened tensor preserving batch dimension
-        
-    Examples:
-        # Flatten CNN output for Linear layer
-        conv_output = Tensor(np.random.randn(32, 64, 8, 8))  # (batch, channels, height, width)
-        flat = flatten(conv_output)  # (32, 4096) - ready for Linear layer!
-        
-        # Flatten image for MLP
-        images = Tensor(np.random.randn(32, 3, 28, 28))  # CIFAR-10 batch
-        flat = flatten(images)  # (32, 2352) - ready for MLP!
-    """
-    # Get the data (handle both Tensor and numpy arrays)
-    if hasattr(x, 'data'):
-        data = x.data
-    else:
-        data = x
-    
-    # Calculate new shape
-    batch_size = data.shape[0] if start_dim > 0 else 1
-    remaining_size = np.prod(data.shape[start_dim:])
-    new_shape = (batch_size, remaining_size) if start_dim > 0 else (remaining_size,)
-    
-    # Reshape while preserving the original tensor type
-    if hasattr(x, 'data'):
-        # It's a Tensor - create a new Tensor with flattened data
-        flattened_data = data.reshape(new_shape)
-        # Use type(x) to preserve the exact Tensor type (Parameter vs regular Tensor)
-        # This ensures that if input was a Parameter, output is also a Parameter
-        return type(x)(flattened_data)
-    else:
-        # It's a numpy array - just reshape and return
-        return data.reshape(new_shape)
-
-#| export
-class Flatten(Module):
-    """
-    Flatten layer that reshapes tensors from multi-dimensional to 2D.
-    
-    Essential for connecting convolutional layers (which output 4D tensors)
-    to linear layers (which expect 2D tensors). Preserves the batch dimension.
-    
-    Example Usage:
-        # In a CNN architecture
-        model = Sequential([
-            Conv2D(3, 16, kernel_size=3),  # Output: (batch, 16, height, width)
-            ReLU(),
-            Flatten(),                     # Output: (batch, 16*height*width)
-            Linear(16*height*width, 10)    # Now compatible!
-        ])
-    """
-    
-    def __init__(self, start_dim=1):
-        """
-        Initialize Flatten layer.
-        
-        Args:
-            start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        """
-        super().__init__()
-        self.start_dim = start_dim
-    
-    def forward(self, x):
-        """
-        Flatten tensor starting from start_dim.
-        
-        Args:
-            x: Input tensor
-            
-        Returns:
-            Flattened tensor with batch dimension preserved
-        """
-        return flatten(x, start_dim=self.start_dim)
-
-# In[ ]:
-
-# TEST Unit Test: Flatten Operations
-def test_unit_flatten():
-    """Test Flatten layer and function implementation."""
-    print("TEST Testing Flatten Operations...")
-    
-    # Test case 1: Flatten function with 2D tensor
-    x_2d = Tensor([[1, 2], [3, 4]])
-    flattened_func = flatten(x_2d)
-    assert flattened_func.shape == (2, 2), f"Expected shape (2, 2), got {flattened_func.shape}"
-    print("PASS Flatten function with 2D tensor")
-    
-    # Test case 2: Flatten function with 4D tensor (simulating CNN output)
-    x_4d = Tensor(np.random.randn(2, 3, 4, 4))  # (batch, channels, height, width)
-    flattened_4d = flatten(x_4d)
-    assert flattened_4d.shape == (2, 48), f"Expected shape (2, 48), got {flattened_4d.shape}"  # 3*4*4 = 48
-    print("PASS Flatten function with 4D tensor")
-    
-    # Test case 3: Flatten layer class
-    flatten_layer = Flatten()
-    layer_output = flatten_layer(x_4d)
-    assert layer_output.shape == (2, 48), f"Expected shape (2, 48), got {layer_output.shape}"
-    assert np.allclose(layer_output.data, flattened_4d.data), "Flatten layer should match flatten function"
-    print("PASS Flatten layer class")
-    
-    # Test case 4: Different start dimensions
-    flatten_from_0 = Flatten(start_dim=0)
-    full_flat = flatten_from_0(x_2d)
-    assert len(full_flat.shape) <= 2, "Flattening from dim 0 should create vector"
-    print("PASS Different start dimensions")
-    
-    # Test case 5: Integration with Sequential
-    network = Sequential([
-        Linear(8, 4),
-        Flatten()
-    ])
-    test_input = Tensor(np.random.randn(2, 8))
-    output = network(test_input)
-    assert output.shape == (2, 4), f"Expected shape (2, 4), got {output.shape}"
-    print("PASS Flatten integration with Sequential")
-    
-    print("CELEBRATE All Flatten operations tests passed!")
-
-test_unit_flatten()
-
-# In[ ]:
-
-# %% [markdown]
-"""
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in modules/03_layers/layers_dev.py
-**Building Side:** Code exports to tinytorch.core.layers
-
-```python
-# Final package structure:
-from tinytorch.core.layers import Module, Linear, Sequential, Flatten  # This module
-from tinytorch.core.tensor import Tensor, Parameter  # Foundation (always needed)
-```
-
-**Why this matters:**
-- **Learning:** Complete layer system in one focused module for deep understanding
-- **Production:** Proper organization like PyTorch's torch.nn with all core components together
-- **Consistency:** All layer operations and parameter management in core.layers
-- **Integration:** Works seamlessly with tensors for complete neural network building
-"""
-
-# %%
-
-# %% [markdown]
-"""
-## Complete Neural Network Demo
-"""
-
-def demonstrate_complete_networks():
-    """Demonstrate complete neural networks using all implemented components."""
-    print("FIRE Complete Neural Network Demo")
-    print("=" * 50)
-    
-    print("\n1. MLP for Classification (MNIST-style):")
-    # Multi-layer perceptron for image classification
-    mlp = Sequential([
-        Flatten(),              # Flatten input images
-        Linear(784, 256),       # First hidden layer
-        Linear(256, 128),       # Second hidden layer  
-        Linear(128, 10)         # Output layer (10 classes)
-    ])
-    
-    # Test with batch of "images"
-    batch_images = Tensor(np.random.randn(32, 28, 28))  # 32 MNIST-like images
-    mlp_output = mlp(batch_images)
-    print(f"   Input: {batch_images.shape} (batch of 28x28 images)")
-    print(f"   Output: {mlp_output.shape} (class logits for 32 images)")
-    print(f"   Parameters: {len(mlp.parameters())} tensors")
-    
-    print("\n2. CNN-style Architecture (with Flatten):")
-    # Simulate CNN -> Flatten -> Dense pattern
-    cnn_style = Sequential([
-        # Simulate Conv2D output with random "features"
-        Flatten(),              # Flatten spatial features
-        Linear(512, 256),       # Dense layer after convolution
-        Linear(256, 10)         # Classification head
-    ])
-    
-    # Test with simulated conv output
-    conv_features = Tensor(np.random.randn(16, 8, 8, 8))  # Simulated (B,C,H,W)
-    cnn_output = cnn_style(conv_features)
-    print(f"   Input: {conv_features.shape} (simulated conv features)")
-    print(f"   Output: {cnn_output.shape} (class predictions)")
-    
-    print("\n3. Deep Network with Many Layers:")
-    # Demonstrate deep composition
-    deep_net = Sequential()
-    layer_sizes = [100, 80, 60, 40, 20, 10]
-    
-    for i in range(len(layer_sizes) - 1):
-        deep_net.add(Linear(layer_sizes[i], layer_sizes[i+1]))
-        print(f"   Added layer: {layer_sizes[i]} -> {layer_sizes[i+1]}")
-    
-    # Test deep network
-    deep_input = Tensor(np.random.randn(8, 100))
-    deep_output = deep_net(deep_input)
-    print(f"   Deep network: {deep_input.shape} -> {deep_output.shape}")
-    print(f"   Total parameters: {len(deep_net.parameters())} tensors")
-    
-    print("\n4. Parameter Management Across Networks:")
-    networks = {'MLP': mlp, 'CNN-style': cnn_style, 'Deep': deep_net}
-    
-    for name, net in networks.items():
-        params = net.parameters()
-        total_params = sum(p.data.size for p in params)
-        memory_mb = total_params * 4 / (1024 * 1024)  # float32 = 4 bytes
-        print(f"   {name}: {len(params)} param tensors, {total_params:,} total params, {memory_mb:.2f} MB")
-    
-    print("\nCELEBRATE All components work together seamlessly!")
-    print("   • Module system enables automatic parameter collection")
-    print("   • Linear layers handle matrix transformations") 
-    print("   • Sequential composes layers into complete architectures")
-    print("   • Flatten connects different layer types")
-    print("   • Everything integrates for production-ready neural networks!")
-
-demonstrate_complete_networks()
-
-# In[ ]:
-
-# %% [markdown]
-"""
-## Testing Framework
-"""
-
-def test_module():
-    """Run complete module validation."""
-    print("🧪 TESTING ALL LAYER COMPONENTS")
-    print("=" * 40)
-
-    # Call every individual test function
-    test_unit_linear()
-    test_unit_parameter_management()
-    test_unit_sequential()
-    test_unit_flatten()
-
-    print("\n✅ ALL TESTS PASSED! Layer module ready for integration.")
-
-# In[ ]:
-
-if __name__ == "__main__":
-    print("🚀 TINYTORCH LAYERS MODULE")
-    print("=" * 50)
-
-    # Test all components
-    test_module()
-
-    # Systems analysis
-    print("\n" + "=" * 50)
-    analyze_layer_performance()
-
-    # Complete demo
-    print("\n" + "=" * 50)
-    demonstrate_complete_networks()
-
-    print("\n🎉 LAYERS MODULE COMPLETE!")
-    print("✅ Ready for advanced architectures and training!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've implemented all the core neural network components, let's think about their implications for ML systems:
-
-**Question 1: Memory vs Computation Analysis**
-
-You're designing a neural network for deployment on a mobile device with limited memory (1GB RAM) but decent compute power.
-
-You have two architecture options:
-A) Wide network: 784 -> 2048 -> 2048 -> 10 (3 layers, wide)
-B) Deep network: 784 -> 256 -> 256 -> 256 -> 256 -> 10 (5 layers, narrow)
-
-Calculate the memory requirements for each option and explain which you'd choose for mobile deployment and why.
-
-Consider:
-- Parameter storage requirements
-- Intermediate activation storage during forward pass
-- Training vs inference memory requirements
-- How your choice affects model capacity and accuracy
-
-⭐ **Question 2: Production Performance Optimization**
-
-Your Linear layer implementation works correctly, but you notice it's slower than PyTorch's nn.Linear on the same hardware.
-
-Investigate and explain:
-1. Why might our implementation be slower? (Hint: think about underlying linear algebra libraries)
-2. What optimization techniques do production frameworks use?
-3. How would you modify our implementation to approach production performance?
-4. When might our simple implementation actually be preferable?
-
-Research areas to consider:
-- BLAS (Basic Linear Algebra Subprograms) libraries
-- Memory layout and cache efficiency
-- Vectorization and SIMD instructions
-- GPU kernel optimization
-
-⭐ **Question 3: Systems Architecture Scaling**
-
-Modern transformer models like GPT-3 have billions of parameters, primarily in Linear layers.
-
-Analyze the scaling challenges:
-1. How does memory requirement scale with model size? Calculate the memory needed for a 175B parameter model.
-2. What are the computational bottlenecks during training vs inference?
-3. How do systems like distributed training address these scaling challenges?
-4. Why do large models use techniques like gradient checkpointing and model parallelism?
-
-Systems considerations:
-- Memory hierarchy (L1/L2/L3 cache, RAM, storage)
-- Network bandwidth for distributed training
-- GPU memory constraints and model sharding
-- Inference optimization for production serving
-"""
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Layers - Complete Neural Network Foundation
-
-### What You've Accomplished
-
-You've successfully implemented the complete foundation for neural networks - all the essential components working together:
-
-### ✅ **Complete Core System**
-- **Module Base Class**: Parameter management and composition patterns for all neural network components
-- **Matrix Multiplication**: The computational primitive underlying all neural network operations
-- **Linear (Dense) Layers**: Complete implementation with proper parameter initialization and forward propagation
-- **Sequential Networks**: Clean composition system for building complete neural network architectures
-- **Flatten Operations**: Tensor reshaping to connect different layer types (essential for CNN->MLP transitions)
-
-### ✅ **Systems Understanding**
-- **Architectural Patterns**: How modular design enables everything from MLPs to complex deep networks
-- **Memory Analysis**: How layer composition affects memory usage and computational efficiency
-- **Performance Characteristics**: Understanding how tensor operations and layer composition affect performance
-- **Production Context**: Connection to real-world ML frameworks and their component organization
-
-### ✅ **ML Engineering Skills**
-- **Complete Parameter Management**: How neural networks automatically collect parameters from all components
-- **Network Composition**: Building complex architectures from simple, reusable components
-- **Tensor Operations**: Essential reshaping and transformation operations for different network types
-- **Clean Abstraction**: Professional software design patterns that scale to production systems
-
-### 🔗 **Connection to Production ML Systems**
-
-Your unified implementation mirrors the complete component systems used in:
-- **PyTorch's nn.Module system**: Same parameter management and composition patterns
-- **PyTorch's nn.Sequential**: Identical architecture composition approach
-- **All major frameworks**: The same modular design principles that power TensorFlow, JAX, and others
-- **Production ML systems**: Clean abstractions that enable complex models while maintaining manageable code
-
-### 🚀 **What's Next**
-
-With your complete layer foundation, you're ready to:
-- **Module 05 (Dense)**: Build complete dense networks for classification tasks
-- **Module 06 (Spatial)**: Add convolutional layers for computer vision
-- **Module 09 (Autograd)**: Enable automatic differentiation for learning
-- **Module 10 (Optimizers)**: Implement sophisticated optimization algorithms
-
-### 💡 **Key Systems Insights**
-
-1. **Modular composition is the key to scalable ML systems** - clean interfaces enable complex behaviors
-2. **Parameter management must be automatic** - manual parameter tracking doesn't scale to deep networks
-3. **Tensor operations like flattening are architectural requirements** - different layer types need different tensor shapes
-4. **Clean abstractions enable innovation** - good foundational design supports unlimited architectural experimentation
-
-You now understand how to build complete, production-ready neural network foundations that can scale to any architecture!
-"""
\ No newline at end of file
diff --git a/tinytorch/core/losses.py b/tinytorch/core/losses.py
deleted file mode 100644
index 6b51d8d3..00000000
--- a/tinytorch/core/losses.py
+++ /dev/null
@@ -1,137 +0,0 @@
-# Auto-generated losses module for TinyTorch
-"""Loss functions for neural network training."""
-
-import numpy as np
-from tinytorch.core.tensor import Tensor
-
-class MSELoss:
-    """
-    Mean Squared Error Loss with Autograd Integration
-
-    This version properly integrates with the autograd system to enable
-    gradient flow during backpropagation.
-    """
-
-    def __init__(self):
-        """Initialize MSE loss function."""
-        pass
-
-    def __call__(self, predictions, targets):
-        """
-        Compute MSE loss with autograd support.
-
-        Args:
-            predictions: Model predictions (Tensor or convertible to Tensor)
-            targets: True targets (Tensor or convertible to Tensor)
-
-        Returns:
-            Tensor with scalar loss value and gradient tracking
-        """
-        # Ensure inputs are Variables for gradient tracking
-        if not isinstance(predictions, Tensor):
-            pred_data = predictions.data if hasattr(predictions, 'data') else predictions
-            predictions = Tensor(pred_data, requires_grad=False)
-
-        if not isinstance(targets, Tensor):
-            target_data = targets.data if hasattr(targets, 'data') else targets
-            targets = Tensor(target_data, requires_grad=False)
-
-        # Compute MSE using autograd operations
-        diff = subtract(predictions, targets)
-        squared_diff = multiply(diff, diff)
-
-        # Sum all elements and divide by count to get mean
-        loss = Tensor.sum(squared_diff)
-
-        # Convert to mean (divide by number of elements)
-        batch_size = predictions.data.data.size
-        mean_loss = multiply(loss, 1.0 / batch_size)
-
-        return mean_loss
-
-class CrossEntropyLoss:
-    """
-    Cross-Entropy Loss with Autograd Integration
-
-    Simplified cross-entropy that works with the autograd system.
-    For training neural networks with gradient-based optimization.
-    """
-
-    def __init__(self):
-        """Initialize CrossEntropy loss function."""
-        self.epsilon = 1e-7  # For numerical stability
-
-    def __call__(self, predictions, targets):
-        """
-        Compute cross-entropy loss with autograd support.
-
-        Args:
-            predictions: Model predictions/logits (Tensor)
-            targets: True class indices (Tensor or numpy array)
-
-        Returns:
-            Tensor with scalar loss value and gradient tracking
-        """
-        # Handle Tensor inputs
-        if isinstance(predictions, Tensor):
-            pred_data = predictions.data.data
-        elif hasattr(predictions, 'data'):
-            pred_data = predictions.data
-        else:
-            pred_data = predictions
-
-        if isinstance(targets, Tensor):
-            target_data = targets.data.data
-        elif hasattr(targets, 'data'):
-            target_data = targets.data
-        else:
-            target_data = targets
-
-        # Apply softmax to predictions (numerically stable)
-        exp_pred = np.exp(pred_data - np.max(pred_data, axis=-1, keepdims=True))
-        softmax_pred = exp_pred / np.sum(exp_pred, axis=-1, keepdims=True)
-
-        # Clip for numerical stability
-        softmax_pred = np.clip(softmax_pred, self.epsilon, 1 - self.epsilon)
-
-        # Compute cross-entropy loss
-        if len(target_data.shape) == 1 or target_data.shape[-1] == 1:
-            # Integer labels
-            batch_size = pred_data.shape[0]
-            loss = 0
-            for i in range(batch_size):
-                label = int(target_data[i])
-                loss -= np.log(softmax_pred[i, label])
-            loss /= batch_size
-        else:
-            # One-hot labels
-            loss = -np.mean(np.sum(target_data * np.log(softmax_pred), axis=-1))
-
-        # Return as Tensor with gradient function
-        result = Tensor(loss, requires_grad=True)
-
-        # Define backward function for proper gradient flow
-        def grad_fn(gradient):
-            if isinstance(predictions, Tensor) and predictions.requires_grad:
-                batch_size = pred_data.shape[0]
-
-                # Gradient of cross-entropy with softmax
-                if len(target_data.shape) == 1 or target_data.shape[-1] == 1:
-                    # Integer labels - gradient is (softmax - one_hot_targets)
-                    grad = softmax_pred.copy()
-                    for i in range(batch_size):
-                        label = int(target_data[i])
-                        grad[i, label] -= 1
-                    grad = grad / batch_size * gradient  # Scale by incoming gradient
-                else:
-                    # One-hot labels
-                    grad = (softmax_pred - target_data) / batch_size * gradient
-
-                # Pass gradient directly as numpy array (backward() expects raw data)
-                predictions.backward(grad)
-
-        result.grad_fn = grad_fn
-        return result
-
-# Aliases
-MeanSquaredError = MSELoss
\ No newline at end of file
diff --git a/tinytorch/core/losses.py.backup b/tinytorch/core/losses.py.backup
deleted file mode 100644
index abb000ef..00000000
--- a/tinytorch/core/losses.py.backup
+++ /dev/null
@@ -1,138 +0,0 @@
-# Auto-generated losses module for TinyTorch
-"""Loss functions for neural network training."""
-
-import numpy as np
-from tinytorch.core.tensor import Tensor
-from tinytorch.core.autograd import Variable, subtract, multiply, add
-
-class MSELoss:
-    """
-    Mean Squared Error Loss with Autograd Integration
-
-    This version properly integrates with the autograd system to enable
-    gradient flow during backpropagation.
-    """
-
-    def __init__(self):
-        """Initialize MSE loss function."""
-        pass
-
-    def __call__(self, predictions, targets):
-        """
-        Compute MSE loss with autograd support.
-
-        Args:
-            predictions: Model predictions (Variable or convertible to Variable)
-            targets: True targets (Variable or convertible to Variable)
-
-        Returns:
-            Variable with scalar loss value and gradient tracking
-        """
-        # Ensure inputs are Variables for gradient tracking
-        if not isinstance(predictions, Variable):
-            pred_data = predictions.data if hasattr(predictions, 'data') else predictions
-            predictions = Variable(pred_data, requires_grad=False)
-
-        if not isinstance(targets, Variable):
-            target_data = targets.data if hasattr(targets, 'data') else targets
-            targets = Variable(target_data, requires_grad=False)
-
-        # Compute MSE using autograd operations
-        diff = subtract(predictions, targets)
-        squared_diff = multiply(diff, diff)
-
-        # Sum all elements and divide by count to get mean
-        loss = Variable.sum(squared_diff)
-
-        # Convert to mean (divide by number of elements)
-        batch_size = predictions.data.data.size
-        mean_loss = multiply(loss, 1.0 / batch_size)
-
-        return mean_loss
-
-class CrossEntropyLoss:
-    """
-    Cross-Entropy Loss with Autograd Integration
-
-    Simplified cross-entropy that works with the autograd system.
-    For training neural networks with gradient-based optimization.
-    """
-
-    def __init__(self):
-        """Initialize CrossEntropy loss function."""
-        self.epsilon = 1e-7  # For numerical stability
-
-    def __call__(self, predictions, targets):
-        """
-        Compute cross-entropy loss with autograd support.
-
-        Args:
-            predictions: Model predictions/logits (Variable)
-            targets: True class indices (Variable or numpy array)
-
-        Returns:
-            Variable with scalar loss value and gradient tracking
-        """
-        # Handle Variable inputs
-        if isinstance(predictions, Variable):
-            pred_data = predictions.data.data
-        elif hasattr(predictions, 'data'):
-            pred_data = predictions.data
-        else:
-            pred_data = predictions
-
-        if isinstance(targets, Variable):
-            target_data = targets.data.data
-        elif hasattr(targets, 'data'):
-            target_data = targets.data
-        else:
-            target_data = targets
-
-        # Apply softmax to predictions (numerically stable)
-        exp_pred = np.exp(pred_data - np.max(pred_data, axis=-1, keepdims=True))
-        softmax_pred = exp_pred / np.sum(exp_pred, axis=-1, keepdims=True)
-
-        # Clip for numerical stability
-        softmax_pred = np.clip(softmax_pred, self.epsilon, 1 - self.epsilon)
-
-        # Compute cross-entropy loss
-        if len(target_data.shape) == 1 or target_data.shape[-1] == 1:
-            # Integer labels
-            batch_size = pred_data.shape[0]
-            loss = 0
-            for i in range(batch_size):
-                label = int(target_data[i])
-                loss -= np.log(softmax_pred[i, label])
-            loss /= batch_size
-        else:
-            # One-hot labels
-            loss = -np.mean(np.sum(target_data * np.log(softmax_pred), axis=-1))
-
-        # Return as Variable with gradient function
-        result = Variable(loss, requires_grad=True)
-
-        # Define backward function for proper gradient flow
-        def grad_fn(gradient):
-            if isinstance(predictions, Variable) and predictions.requires_grad:
-                batch_size = pred_data.shape[0]
-
-                # Gradient of cross-entropy with softmax
-                if len(target_data.shape) == 1 or target_data.shape[-1] == 1:
-                    # Integer labels - gradient is (softmax - one_hot_targets)
-                    grad = softmax_pred.copy()
-                    for i in range(batch_size):
-                        label = int(target_data[i])
-                        grad[i, label] -= 1
-                    grad = grad / batch_size * gradient  # Scale by incoming gradient
-                else:
-                    # One-hot labels
-                    grad = (softmax_pred - target_data) / batch_size * gradient
-
-                # Pass gradient directly as numpy array (backward() expects raw data)
-                predictions.backward(grad)
-
-        result.grad_fn = grad_fn
-        return result
-
-# Aliases
-MeanSquaredError = MSELoss
\ No newline at end of file
diff --git a/tinytorch/core/mlops.py b/tinytorch/core/mlops.py
index 74b835a8..b7233760 100644
--- a/tinytorch/core/mlops.py
+++ b/tinytorch/core/mlops.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/temp_holding/15_mlops/mlops_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/XX_mlops/mlops_dev.py                   ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['ModelMonitor', 'DriftDetector', 'RetrainingTrigger', 'MLOpsPipeline', 'ModelVersion', 'DeploymentStrategy',
diff --git a/tinytorch/core/networks.py b/tinytorch/core/networks.py
index 6dab5994..9c9ed228 100644
--- a/tinytorch/core/networks.py
+++ b/tinytorch/core/networks.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/05_dense/dense_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/16_tinygpt/tinygpt_dev.ipynb            ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['Sequential', 'create_mlp', 'MLP']
@@ -13,7 +28,7 @@ import matplotlib.pyplot as plt
 # Import all the building blocks we need - try package first, then local modules
 try:
     from tinytorch.core.tensor import Tensor
-    from tinytorch.core.layers import Dense, Module
+    from tinytorch.core.layers import Dense
     from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
 except ImportError:
     # For development, import from local modules
@@ -22,7 +37,7 @@ except ImportError:
     sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers'))
     from tensor_dev import Tensor
     from activations_dev import ReLU, Sigmoid, Tanh, Softmax
-    from layers_dev import Dense, Module
+    from layers_dev import Dense
 
 # %% ../../modules/source/05_dense/dense_dev.ipynb 2
 def _should_show_plots():
@@ -40,13 +55,12 @@ def _should_show_plots():
     return not is_pytest
 
 # %% ../../modules/source/05_dense/dense_dev.ipynb 7
-class Sequential(Module):
+class Sequential:
     """
     Sequential Network: Composes layers in sequence
     
     The most fundamental network architecture.
     Applies layers in order: f(x) = layer_n(...layer_2(layer_1(x)))
-    Inherits from Module for automatic parameter collection.
     """
     
     def __init__(self, layers: Optional[List] = None):
@@ -72,11 +86,7 @@ class Sequential(Module):
         - Handle empty initialization case
         """
         ### BEGIN SOLUTION
-        super().__init__()  # Initialize Module base class
         self.layers = layers if layers is not None else []
-        # Register all layers as sub-modules for parameter collection
-        for i, layer in enumerate(self.layers):
-            setattr(self, f'layer_{i}', layer)
         ### END SOLUTION
     
     def forward(self, x: Tensor) -> Tensor:
@@ -124,8 +134,6 @@ class Sequential(Module):
     def add(self, layer):
         """Add a layer to the network."""
         self.layers.append(layer)
-        # Register the new layer for parameter collection
-        setattr(self, f'layer_{len(self.layers)-1}', layer)
 
 # %% ../../modules/source/05_dense/dense_dev.ipynb 11
 def create_mlp(input_size: int, hidden_sizes: List[int], output_size: int, 
diff --git a/tinytorch/core/optimizers.py b/tinytorch/core/optimizers.py
index 6b2def15..9e269116 100644
--- a/tinytorch/core/optimizers.py
+++ b/tinytorch/core/optimizers.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/08_optimizers/optimizers_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/10_optimizers/optimizers_dev.py         ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['setup_import_paths', 'gradient_descent_step', 'SGD', 'Adam', 'StepLR', 'OptimizerConvergenceProfiler',
@@ -11,28 +26,6 @@ import os
 from typing import List, Dict, Any, Optional, Union
 from collections import defaultdict
 
-def _safe_extract_grad_data(param):
-    """
-    Safely extract gradient data handling both Variable and memoryview cases.
-
-    Args:
-        param: Parameter with grad attribute
-
-    Returns:
-        numpy array: Gradient data as numpy array
-    """
-    if param.grad is None:
-        return None
-
-    # Extract the gradient data step by step
-    grad_data = param.grad.data.data if hasattr(param.grad.data, 'data') else param.grad.data
-
-    # Convert memoryview to numpy array if needed
-    if isinstance(grad_data, memoryview):
-        return np.array(grad_data)
-
-    return grad_data
-
 # Helper function to set up import paths
 def setup_import_paths():
     """Set up import paths for development modules."""
@@ -129,9 +122,9 @@ def gradient_descent_step(parameter: Variable, learning_rate: float) -> None:
     """
     ### BEGIN SOLUTION
     if parameter.grad is not None:
-        # Get current parameter value and gradient - handle memoryview
-        current_value = parameter.data.data if hasattr(parameter.data, 'data') else parameter.data
-        gradient_value = _safe_extract_grad_data(parameter)
+        # Get current parameter value and gradient
+        current_value = parameter.data.data
+        gradient_value = parameter.grad.data.data
         
         # Update parameter: new_value = old_value - learning_rate * gradient
         new_value = current_value - learning_rate * gradient_value
@@ -228,27 +221,16 @@ class SGD:
                 # In modern PyTorch style, grad.data gives us the numpy array
                 gradient = param.grad.data
                 
-                # Ensure gradient is numpy array (fix for memoryview issue)
-                if hasattr(gradient, 'data'):
-                    gradient_data = gradient.data
-                    # Check if the inner data is memoryview and convert
-                    if isinstance(gradient_data, memoryview):
-                        gradient_data = np.array(gradient_data)
-                elif isinstance(gradient, memoryview):
-                    gradient_data = np.array(gradient)
-                else:
-                    gradient_data = np.array(gradient)
-                
                 if self.momentum > 0:
-                    # Apply momentum (simplified) using numpy arrays
+                    # Apply momentum (simplified)
                     if i in self.velocity:
-                        self.velocity[i] = self.momentum * self.velocity[i] + gradient_data
+                        self.velocity[i] = self.momentum * self.velocity[i] + gradient
                     else:
-                        self.velocity[i] = gradient_data
+                        self.velocity[i] = gradient
                     update = self.velocity[i]
                 else:
                     # Simple gradient descent (no momentum)
-                    update = gradient_data
+                    update = gradient
                 
                 # Clean parameter update - PyTorch style
                 # NOTE: In production PyTorch, this is an in-place operation (param.data.sub_())
@@ -386,22 +368,11 @@ class Adam:
                 # Get gradient data - clean PyTorch style
                 gradient = param.grad.data
                 
-                # Ensure gradient is numpy array (fix for memoryview issue)
-                if hasattr(gradient, 'data'):
-                    gradient_data = gradient.data
-                    # Check if the inner data is memoryview and convert
-                    if isinstance(gradient_data, memoryview):
-                        gradient_data = np.array(gradient_data)
-                elif isinstance(gradient, memoryview):
-                    gradient_data = np.array(gradient)
-                else:
-                    gradient_data = np.array(gradient)
+                # Update first moment (momentum)
+                self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * gradient
                 
-                # Update first moment (momentum) - use numpy arrays
-                self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * gradient_data
-                
-                # Update second moment (squared gradients) - use numpy arrays
-                self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * gradient_data * gradient_data
+                # Update second moment (squared gradients)
+                self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * gradient * gradient
                 
                 # Bias correction
                 m_corrected = self.m[i] / (1 - self.beta1 ** self.t)
@@ -662,8 +633,7 @@ class OptimizerConvergenceProfiler:
                 param_count = 0
                 for param in optimizer.parameters:
                     if param.grad is not None:
-                        # Safely extract gradient data - handle memoryview
-                        grad_data = _safe_extract_grad_data(param)
+                        grad_data = param.grad.data.data
                         if hasattr(grad_data, 'flatten'):
                             grad_norm = np.linalg.norm(grad_data.flatten())
                         else:
diff --git a/tinytorch/core/setup.py b/tinytorch/core/setup.py
new file mode 100644
index 00000000..52f94deb
--- /dev/null
+++ b/tinytorch/core/setup.py
@@ -0,0 +1,166 @@
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/XX_setup/setup_dev.py                   ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
+
+# %% auto 0
+__all__ = ['personal_info', 'system_info']
+
+# %% ../../modules/source/01_setup/setup_dev.ipynb 1
+import sys
+import platform
+import psutil
+from typing import Dict, Any
+
+# %% ../../modules/source/01_setup/setup_dev.ipynb 7
+def personal_info() -> Dict[str, str]:
+    """
+    Return personal information for this TinyTorch installation.
+    
+    This function configures your personal TinyTorch installation with your identity.
+    It's the foundation of proper ML engineering practices - every system needs
+    to know who built it and how to contact them.
+    
+    TODO: Implement personal information configuration.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Create a dictionary with your personal details
+    2. Include all required keys: developer, email, institution, system_name, version
+    3. Use your actual information (not placeholder text)
+    4. Make system_name unique and descriptive
+    5. Keep version as '1.0.0' for now
+    
+    EXAMPLE USAGE:
+    ```python
+    # Get your personal configuration
+    info = personal_info()
+    print(info['developer'])     # Expected: "Your Name" (not placeholder)
+    print(info['email'])         # Expected: "you@domain.com" (valid email)
+    print(info['system_name'])   # Expected: "YourName-Dev" (unique identifier)
+    print(info)                  # Expected: Complete dict with 5 fields
+    # Output: {
+    #     'developer': 'Your Name',
+    #     'email': 'you@domain.com',
+    #     'institution': 'Your Institution',
+    #     'system_name': 'YourName-TinyTorch-Dev',
+    #     'version': '1.0.0'
+    # }
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Replace the example with your real information
+    - Use a descriptive system_name (e.g., 'YourName-TinyTorch-Dev')
+    - Keep email format valid (contains @ and domain)
+    - Make sure all values are strings
+    - Consider how this info will be used in debugging and collaboration
+    
+    LEARNING CONNECTIONS:
+    - This is like the 'author' field in Git commits
+    - Similar to maintainer info in Docker images
+    - Parallels author info in Python packages
+    - Foundation for professional ML development
+    """
+    ### BEGIN SOLUTION
+    return {
+        'developer': 'Student Name',
+        'email': 'student@university.edu',
+        'institution': 'University Name',
+        'system_name': 'StudentName-TinyTorch-Dev',
+        'version': '1.0.0'
+    }
+    ### END SOLUTION
+
+# %% ../../modules/source/01_setup/setup_dev.ipynb 12
+def system_info() -> Dict[str, Any]:
+    """
+    Query and return system information for this TinyTorch installation.
+    
+    This function gathers crucial hardware and software information that affects
+    ML performance, compatibility, and debugging. It's the foundation of 
+    hardware-aware ML systems.
+    
+    TODO: Implement system information queries.
+    
+    STEP-BY-STEP IMPLEMENTATION:
+    1. Get Python version using sys.version_info
+    2. Get platform using platform.system()
+    3. Get architecture using platform.machine()
+    4. Get CPU count using psutil.cpu_count()
+    5. Get memory using psutil.virtual_memory().total
+    6. Convert memory from bytes to GB (divide by 1024^3)
+    7. Return all information in a dictionary
+    
+    EXAMPLE USAGE:
+    ```python
+    # Query system information
+    sys_info = system_info()
+    print(f"Python: {sys_info['python_version']}")  # Expected: "3.x.x"
+    print(f"Platform: {sys_info['platform']}")      # Expected: "Darwin"/"Linux"/"Windows"
+    print(f"CPUs: {sys_info['cpu_count']}")         # Expected: 4, 8, 16, etc.
+    print(f"Memory: {sys_info['memory_gb']} GB")    # Expected: 8.0, 16.0, 32.0, etc.
+    
+    # Full output example:
+    print(sys_info)
+    # Expected: {
+    #     'python_version': '3.9.7',
+    #     'platform': 'Darwin',
+    #     'architecture': 'arm64', 
+    #     'cpu_count': 8,
+    #     'memory_gb': 16.0
+    # }
+    ```
+    
+    IMPLEMENTATION HINTS:
+    - Use f-string formatting for Python version: f"{major}.{minor}.{micro}"
+    - Memory conversion: bytes / (1024^3) = GB
+    - Round memory to 1 decimal place for readability
+    - Make sure data types are correct (strings for text, int for cpu_count, float for memory_gb)
+    
+    LEARNING CONNECTIONS:
+    - This is like `torch.cuda.is_available()` in PyTorch
+    - Similar to system info in MLflow experiment tracking
+    - Parallels hardware detection in TensorFlow
+    - Foundation for performance optimization in ML systems
+    
+    PERFORMANCE IMPLICATIONS:
+    - cpu_count affects parallel processing capabilities
+    - memory_gb determines maximum model and batch sizes
+    - platform affects file system and process management
+    - architecture influences numerical precision and optimization
+    """
+    ### BEGIN SOLUTION
+    # Get Python version
+    version_info = sys.version_info
+    python_version = f"{version_info.major}.{version_info.minor}.{version_info.micro}"
+    
+    # Get platform information
+    platform_name = platform.system()
+    architecture = platform.machine()
+    
+    # Get CPU information
+    cpu_count = psutil.cpu_count()
+    
+    # Get memory information (convert bytes to GB)
+    memory_bytes = psutil.virtual_memory().total
+    memory_gb = round(memory_bytes / (1024**3), 1)
+    
+    return {
+        'python_version': python_version,
+        'platform': platform_name,
+        'architecture': architecture,
+        'cpu_count': cpu_count,
+        'memory_gb': memory_gb
+    }
+    ### END SOLUTION
diff --git a/tinytorch/core/spatial.py b/tinytorch/core/spatial.py
new file mode 100644
index 00000000..ae91db3f
--- /dev/null
+++ b/tinytorch/core/spatial.py
@@ -0,0 +1,611 @@
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/06_spatial/spatial_dev.py               ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
+# %% auto 0
+__all__ = ['Conv2d', 'MaxPool2d', 'AvgPool2d', 'SimpleCNN']
+
+# %% ../../modules/source/09_spatial/spatial_dev.ipynb 1
+import numpy as np
+import sys
+import os
+import time
+
+# Smart import system for development and production compatibility
+if 'tinytorch' in sys.modules:
+    # Production: Import from installed package
+    from tinytorch.core.tensor import Tensor
+    from tinytorch.core.layers import Module
+else:
+    # Development: Use simplified local implementations to avoid import loops
+
+    # Simplified Tensor class for development
+    class Tensor:
+        """Simplified tensor for spatial operations development."""
+
+        def __init__(self, data, requires_grad=False):
+            self.data = np.array(data, dtype=np.float32)
+            self.shape = self.data.shape
+            self.requires_grad = requires_grad
+            self.grad = None
+
+        def __repr__(self):
+            return f"Tensor(shape={self.shape}, data=\n{self.data})"
+
+        def __add__(self, other):
+            if isinstance(other, Tensor):
+                return Tensor(self.data + other.data)
+            return Tensor(self.data + other)
+
+        def __mul__(self, other):
+            if isinstance(other, Tensor):
+                return Tensor(self.data * other.data)
+            return Tensor(self.data * other)
+
+        def sum(self):
+            return Tensor(np.sum(self.data))
+
+        def mean(self):
+            return Tensor(np.mean(self.data))
+
+    # Create a simple Module base class for inheritance
+    class Module:
+        """Simple base class for neural network modules."""
+        def __init__(self):
+            pass
+
+        def forward(self, x):
+            raise NotImplementedError("Subclasses must implement forward()")
+
+        def parameters(self):
+            """Return list of parameters for this module."""
+            params = []
+            for attr_name in dir(self):
+                attr = getattr(self, attr_name)
+                if hasattr(attr, 'data') and hasattr(attr, 'requires_grad'):
+                    params.append(attr)
+            return params
+
+# %% ../../modules/source/09_spatial/spatial_dev.ipynb 6
+class Conv2d(Module):
+    """
+    2D Convolution layer for spatial feature extraction.
+
+    Implements convolution with explicit loops to demonstrate
+    computational complexity and memory access patterns.
+
+    Args:
+        in_channels: Number of input channels
+        out_channels: Number of output feature maps
+        kernel_size: Size of convolution kernel (int or tuple)
+        stride: Stride of convolution (default: 1)
+        padding: Zero-padding added to input (default: 0)
+        bias: Whether to add learnable bias (default: True)
+    """
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True):
+        """
+        Initialize Conv2d layer with proper weight initialization.
+
+        TODO: Complete Conv2d initialization
+
+        APPROACH:
+        1. Store hyperparameters (channels, kernel_size, stride, padding)
+        2. Initialize weights using He initialization for ReLU compatibility
+        3. Initialize bias (if enabled) to zeros
+        4. Use proper shapes: weight (out_channels, in_channels, kernel_h, kernel_w)
+
+        WEIGHT INITIALIZATION:
+        - He init: std = sqrt(2 / (in_channels * kernel_h * kernel_w))
+        - This prevents vanishing/exploding gradients with ReLU
+
+        HINT: Convert kernel_size to tuple if it's an integer
+        """
+        super().__init__()
+
+        ### BEGIN SOLUTION
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+
+        # Handle kernel_size as int or tuple
+        if isinstance(kernel_size, int):
+            self.kernel_size = (kernel_size, kernel_size)
+        else:
+            self.kernel_size = kernel_size
+
+        self.stride = stride
+        self.padding = padding
+
+        # He initialization for ReLU networks
+        kernel_h, kernel_w = self.kernel_size
+        fan_in = in_channels * kernel_h * kernel_w
+        std = np.sqrt(2.0 / fan_in)
+
+        # Weight shape: (out_channels, in_channels, kernel_h, kernel_w)
+        self.weight = Tensor(np.random.normal(0, std,
+                           (out_channels, in_channels, kernel_h, kernel_w)))
+
+        # Bias initialization
+        if bias:
+            self.bias = Tensor(np.zeros(out_channels))
+        else:
+            self.bias = None
+        ### END SOLUTION
+
+    def forward(self, x):
+        """
+        Forward pass through Conv2d layer.
+
+        TODO: Implement convolution with explicit loops
+
+        APPROACH:
+        1. Extract input dimensions and validate
+        2. Calculate output dimensions
+        3. Apply padding if needed
+        4. Implement 6 nested loops for full convolution
+        5. Add bias if present
+
+        LOOP STRUCTURE:
+        for batch in range(batch_size):
+            for out_ch in range(out_channels):
+                for out_h in range(out_height):
+                    for out_w in range(out_width):
+                        for k_h in range(kernel_height):
+                            for k_w in range(kernel_width):
+                                for in_ch in range(in_channels):
+                                    # Accumulate: out += input * weight
+
+        EXAMPLE:
+        >>> conv = Conv2d(3, 16, kernel_size=3, padding=1)
+        >>> x = Tensor(np.random.randn(2, 3, 32, 32))  # batch=2, RGB, 32x32
+        >>> out = conv(x)
+        >>> print(out.shape)  # Should be (2, 16, 32, 32)
+
+        HINTS:
+        - Handle padding by creating padded input array
+        - Watch array bounds in inner loops
+        - Accumulate products for each output position
+        """
+        ### BEGIN SOLUTION
+        # Input validation and shape extraction
+        if len(x.shape) != 4:
+            raise ValueError(f"Expected 4D input (batch, channels, height, width), got {x.shape}")
+
+        batch_size, in_channels, in_height, in_width = x.shape
+        out_channels = self.out_channels
+        kernel_h, kernel_w = self.kernel_size
+
+        # Calculate output dimensions
+        out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1
+        out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1
+
+        # Apply padding if needed
+        if self.padding > 0:
+            padded_input = np.pad(x.data,
+                                ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),
+                                mode='constant', constant_values=0)
+        else:
+            padded_input = x.data
+
+        # Initialize output
+        output = np.zeros((batch_size, out_channels, out_height, out_width))
+
+        # Explicit 6-nested loop convolution to show complexity
+        for b in range(batch_size):
+            for out_ch in range(out_channels):
+                for out_h in range(out_height):
+                    for out_w in range(out_width):
+                        # Calculate input region for this output position
+                        in_h_start = out_h * self.stride
+                        in_w_start = out_w * self.stride
+
+                        # Accumulate convolution result
+                        conv_sum = 0.0
+                        for k_h in range(kernel_h):
+                            for k_w in range(kernel_w):
+                                for in_ch in range(in_channels):
+                                    # Get input and weight values
+                                    input_val = padded_input[b, in_ch,
+                                                           in_h_start + k_h,
+                                                           in_w_start + k_w]
+                                    weight_val = self.weight.data[out_ch, in_ch, k_h, k_w]
+
+                                    # Accumulate
+                                    conv_sum += input_val * weight_val
+
+                        # Store result
+                        output[b, out_ch, out_h, out_w] = conv_sum
+
+        # Add bias if present
+        if self.bias is not None:
+            # Broadcast bias across spatial dimensions
+            for out_ch in range(out_channels):
+                output[:, out_ch, :, :] += self.bias.data[out_ch]
+
+        return Tensor(output)
+        ### END SOLUTION
+
+    def parameters(self):
+        """Return trainable parameters."""
+        params = [self.weight]
+        if self.bias is not None:
+            params.append(self.bias)
+        return params
+
+    def __call__(self, x):
+        """Enable model(x) syntax."""
+        return self.forward(x)
+
+# %% ../../modules/source/09_spatial/spatial_dev.ipynb 11
+class MaxPool2d(Module):
+    """
+    2D Max Pooling layer for spatial dimension reduction.
+
+    Applies maximum operation over spatial windows, preserving
+    the strongest activations while reducing computational load.
+
+    Args:
+        kernel_size: Size of pooling window (int or tuple)
+        stride: Stride of pooling operation (default: same as kernel_size)
+        padding: Zero-padding added to input (default: 0)
+    """
+
+    def __init__(self, kernel_size, stride=None, padding=0):
+        """
+        Initialize MaxPool2d layer.
+
+        TODO: Store pooling parameters
+
+        APPROACH:
+        1. Convert kernel_size to tuple if needed
+        2. Set stride to kernel_size if not provided (non-overlapping)
+        3. Store padding parameter
+
+        HINT: Default stride equals kernel_size for non-overlapping windows
+        """
+        super().__init__()
+
+        ### BEGIN SOLUTION
+        # Handle kernel_size as int or tuple
+        if isinstance(kernel_size, int):
+            self.kernel_size = (kernel_size, kernel_size)
+        else:
+            self.kernel_size = kernel_size
+
+        # Default stride equals kernel_size (non-overlapping)
+        if stride is None:
+            self.stride = self.kernel_size[0]
+        else:
+            self.stride = stride
+
+        self.padding = padding
+        ### END SOLUTION
+
+    def forward(self, x):
+        """
+        Forward pass through MaxPool2d layer.
+
+        TODO: Implement max pooling with explicit loops
+
+        APPROACH:
+        1. Extract input dimensions
+        2. Calculate output dimensions
+        3. Apply padding if needed
+        4. Implement nested loops for pooling windows
+        5. Find maximum value in each window
+
+        LOOP STRUCTURE:
+        for batch in range(batch_size):
+            for channel in range(channels):
+                for out_h in range(out_height):
+                    for out_w in range(out_width):
+                        # Find max in window [in_h:in_h+k_h, in_w:in_w+k_w]
+                        max_val = -infinity
+                        for k_h in range(kernel_height):
+                            for k_w in range(kernel_width):
+                                max_val = max(max_val, input[...])
+
+        EXAMPLE:
+        >>> pool = MaxPool2d(kernel_size=2, stride=2)
+        >>> x = Tensor(np.random.randn(1, 3, 8, 8))
+        >>> out = pool(x)
+        >>> print(out.shape)  # Should be (1, 3, 4, 4)
+
+        HINTS:
+        - Initialize max_val to negative infinity
+        - Handle stride correctly when accessing input
+        - No parameters to update (pooling has no weights)
+        """
+        ### BEGIN SOLUTION
+        # Input validation and shape extraction
+        if len(x.shape) != 4:
+            raise ValueError(f"Expected 4D input (batch, channels, height, width), got {x.shape}")
+
+        batch_size, channels, in_height, in_width = x.shape
+        kernel_h, kernel_w = self.kernel_size
+
+        # Calculate output dimensions
+        out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1
+        out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1
+
+        # Apply padding if needed
+        if self.padding > 0:
+            padded_input = np.pad(x.data,
+                                ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),
+                                mode='constant', constant_values=-np.inf)
+        else:
+            padded_input = x.data
+
+        # Initialize output
+        output = np.zeros((batch_size, channels, out_height, out_width))
+
+        # Explicit nested loop max pooling
+        for b in range(batch_size):
+            for c in range(channels):
+                for out_h in range(out_height):
+                    for out_w in range(out_width):
+                        # Calculate input region for this output position
+                        in_h_start = out_h * self.stride
+                        in_w_start = out_w * self.stride
+
+                        # Find maximum in window
+                        max_val = -np.inf
+                        for k_h in range(kernel_h):
+                            for k_w in range(kernel_w):
+                                input_val = padded_input[b, c,
+                                                       in_h_start + k_h,
+                                                       in_w_start + k_w]
+                                max_val = max(max_val, input_val)
+
+                        # Store result
+                        output[b, c, out_h, out_w] = max_val
+
+        return Tensor(output)
+        ### END SOLUTION
+
+    def parameters(self):
+        """Return empty list (pooling has no parameters)."""
+        return []
+
+    def __call__(self, x):
+        """Enable model(x) syntax."""
+        return self.forward(x)
+
+# %% ../../modules/source/09_spatial/spatial_dev.ipynb 13
+class AvgPool2d(Module):
+    """
+    2D Average Pooling layer for spatial dimension reduction.
+
+    Applies average operation over spatial windows, smoothing
+    features while reducing computational load.
+
+    Args:
+        kernel_size: Size of pooling window (int or tuple)
+        stride: Stride of pooling operation (default: same as kernel_size)
+        padding: Zero-padding added to input (default: 0)
+    """
+
+    def __init__(self, kernel_size, stride=None, padding=0):
+        """
+        Initialize AvgPool2d layer.
+
+        TODO: Store pooling parameters (same as MaxPool2d)
+
+        APPROACH:
+        1. Convert kernel_size to tuple if needed
+        2. Set stride to kernel_size if not provided
+        3. Store padding parameter
+        """
+        super().__init__()
+
+        ### BEGIN SOLUTION
+        # Handle kernel_size as int or tuple
+        if isinstance(kernel_size, int):
+            self.kernel_size = (kernel_size, kernel_size)
+        else:
+            self.kernel_size = kernel_size
+
+        # Default stride equals kernel_size (non-overlapping)
+        if stride is None:
+            self.stride = self.kernel_size[0]
+        else:
+            self.stride = stride
+
+        self.padding = padding
+        ### END SOLUTION
+
+    def forward(self, x):
+        """
+        Forward pass through AvgPool2d layer.
+
+        TODO: Implement average pooling with explicit loops
+
+        APPROACH:
+        1. Similar structure to MaxPool2d
+        2. Instead of max, compute average of window
+        3. Divide sum by window area for true average
+
+        LOOP STRUCTURE:
+        for batch in range(batch_size):
+            for channel in range(channels):
+                for out_h in range(out_height):
+                    for out_w in range(out_width):
+                        # Compute average in window
+                        window_sum = 0
+                        for k_h in range(kernel_height):
+                            for k_w in range(kernel_width):
+                                window_sum += input[...]
+                        avg_val = window_sum / (kernel_height * kernel_width)
+
+        HINT: Remember to divide by window area to get true average
+        """
+        ### BEGIN SOLUTION
+        # Input validation and shape extraction
+        if len(x.shape) != 4:
+            raise ValueError(f"Expected 4D input (batch, channels, height, width), got {x.shape}")
+
+        batch_size, channels, in_height, in_width = x.shape
+        kernel_h, kernel_w = self.kernel_size
+
+        # Calculate output dimensions
+        out_height = (in_height + 2 * self.padding - kernel_h) // self.stride + 1
+        out_width = (in_width + 2 * self.padding - kernel_w) // self.stride + 1
+
+        # Apply padding if needed
+        if self.padding > 0:
+            padded_input = np.pad(x.data,
+                                ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)),
+                                mode='constant', constant_values=0)
+        else:
+            padded_input = x.data
+
+        # Initialize output
+        output = np.zeros((batch_size, channels, out_height, out_width))
+
+        # Explicit nested loop average pooling
+        for b in range(batch_size):
+            for c in range(channels):
+                for out_h in range(out_height):
+                    for out_w in range(out_width):
+                        # Calculate input region for this output position
+                        in_h_start = out_h * self.stride
+                        in_w_start = out_w * self.stride
+
+                        # Compute sum in window
+                        window_sum = 0.0
+                        for k_h in range(kernel_h):
+                            for k_w in range(kernel_w):
+                                input_val = padded_input[b, c,
+                                                       in_h_start + k_h,
+                                                       in_w_start + k_w]
+                                window_sum += input_val
+
+                        # Compute average
+                        avg_val = window_sum / (kernel_h * kernel_w)
+
+                        # Store result
+                        output[b, c, out_h, out_w] = avg_val
+
+        return Tensor(output)
+        ### END SOLUTION
+
+    def parameters(self):
+        """Return empty list (pooling has no parameters)."""
+        return []
+
+    def __call__(self, x):
+        """Enable model(x) syntax."""
+        return self.forward(x)
+
+# %% ../../modules/source/09_spatial/spatial_dev.ipynb 21
+class SimpleCNN(Module):
+    """
+    Simple CNN demonstrating spatial operations integration.
+
+    Architecture:
+    - Conv2d(3→16, 3×3) + ReLU + MaxPool(2×2)
+    - Conv2d(16→32, 3×3) + ReLU + MaxPool(2×2)
+    - Flatten + Linear(features→num_classes)
+    """
+
+    def __init__(self, num_classes=10):
+        """
+        Initialize SimpleCNN.
+
+        TODO: Build CNN architecture with spatial and dense layers
+
+        APPROACH:
+        1. Conv layer 1: 3 → 16 channels, 3×3 kernel, padding=1
+        2. Pool layer 1: 2×2 max pooling
+        3. Conv layer 2: 16 → 32 channels, 3×3 kernel, padding=1
+        4. Pool layer 2: 2×2 max pooling
+        5. Calculate flattened size and add final linear layer
+
+        HINT: For 32×32 input → 32→16→8→4 spatial reduction
+        Final feature size: 32 channels × 4×4 = 512 features
+        """
+        super().__init__()
+
+        ### BEGIN SOLUTION
+        # Convolutional layers
+        self.conv1 = Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
+        self.pool1 = MaxPool2d(kernel_size=2, stride=2)
+
+        self.conv2 = Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
+        self.pool2 = MaxPool2d(kernel_size=2, stride=2)
+
+        # Calculate flattened size
+        # Input: 32×32 → Conv1+Pool1: 16×16 → Conv2+Pool2: 8×8
+        # Wait, let's recalculate: 32×32 → Pool1: 16×16 → Pool2: 8×8
+        # Final: 32 channels × 8×8 = 2048 features
+        self.flattened_size = 32 * 8 * 8
+
+        # Import Linear layer (we'll implement a simple version)
+        # For now, we'll use a placeholder that we can replace
+        # This represents the final classification layer
+        self.num_classes = num_classes
+        self.flattened_size = 32 * 8 * 8  # Will be used when we add Linear layer
+        ### END SOLUTION
+
+    def forward(self, x):
+        """
+        Forward pass through SimpleCNN.
+
+        TODO: Implement CNN forward pass
+
+        APPROACH:
+        1. Apply conv1 → ReLU → pool1
+        2. Apply conv2 → ReLU → pool2
+        3. Flatten spatial dimensions
+        4. Apply final linear layer (when available)
+
+        For now, return features before final linear layer
+        since we haven't imported Linear from layers module yet.
+        """
+        ### BEGIN SOLUTION
+        # First conv block
+        x = self.conv1(x)
+        x = self.relu(x)  # ReLU activation
+        x = self.pool1(x)
+
+        # Second conv block
+        x = self.conv2(x)
+        x = self.relu(x)  # ReLU activation
+        x = self.pool2(x)
+
+        # Flatten for classification (reshape to 2D)
+        batch_size = x.shape[0]
+        x_flat = x.data.reshape(batch_size, -1)
+
+        # Return flattened features
+        # In a complete implementation, this would go through a Linear layer
+        return Tensor(x_flat)
+        ### END SOLUTION
+
+    def relu(self, x):
+        """Simple ReLU implementation for CNN."""
+        return Tensor(np.maximum(0, x.data))
+
+    def parameters(self):
+        """Return all trainable parameters."""
+        params = []
+        params.extend(self.conv1.parameters())
+        params.extend(self.conv2.parameters())
+        # Linear layer parameters would be added here
+        return params
+
+    def __call__(self, x):
+        """Enable model(x) syntax."""
+        return self.forward(x)
diff --git a/tinytorch/core/spatial_dev.py b/tinytorch/core/spatial_dev.py
deleted file mode 100644
index 26ef0cfe..00000000
--- a/tinytorch/core/spatial_dev.py
+++ /dev/null
@@ -1,2264 +0,0 @@
-# ---
-# jupyter:
-#   jupytext:
-#     text_representation:
-#       extension: .py
-#       format_name: percent
-#       format_version: '1.3'
-#       jupytext_version: 1.17.1
-# ---
-
-# %% [markdown]
-"""
-# Spatial - Convolutional Networks and Spatial Pattern Recognition
-
-Welcome to the Spatial module! You'll implement convolutional operations that enable neural networks to understand spatial relationships in images and other grid-structured data.
-
-## Learning Goals
-- Systems understanding: How convolution operations achieve spatial pattern recognition through parameter sharing and translation invariance
-- Core implementation skill: Build Conv2D layers using explicit sliding window operations to understand the computational mechanics
-- Pattern recognition: Understand how convolutional layers detect hierarchical features from edges to complex objects
-- Framework connection: See how your implementation reveals the design decisions in PyTorch's nn.Conv2d optimizations
-- Performance insight: Learn why convolution is computationally expensive but highly parallelizable, driving modern GPU architecture
-
-## Build → Use → Reflect
-1. **Build**: Conv2D layer with sliding window convolution, understanding every memory access and computation
-2. **Use**: Transform real image data and visualize how feature maps capture spatial patterns
-3. **Reflect**: Why does convolution enable parameter sharing, and how does this affect model capacity vs efficiency?
-
-## What You'll Achieve
-By the end of this module, you'll understand:
-- Deep technical understanding of how sliding window operations enable spatial pattern detection
-- Practical capability to implement convolutional layers that form the backbone of computer vision systems
-- Systems insight into why convolution is the dominant operation for spatial data and how it affects memory access patterns
-- Performance consideration of how kernel size, stride, and padding choices affect computational cost and memory usage
-- Connection to production ML systems and how frameworks optimize convolution for different hardware architectures
-
-## Systems Reality Check
-💡 **Production Context**: PyTorch's Conv2d uses highly optimized implementations like cuDNN that can be 100x faster than naive implementations through algorithm choice and memory layout optimization
-⚡ **Performance Note**: Convolution is O(H×W×C×K²) per output pixel - modern CNNs perform billions of these operations, making optimization critical for real-time applications
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "cnn-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| default_exp core.spatial
-
-#| export
-import numpy as np
-import os
-import sys
-from typing import Tuple, Optional
-
-# Import from the main package - try package first, then local modules
-try:
-    from tinytorch.core.tensor import Tensor, Parameter
-    from tinytorch.core.layers import Linear, Module
-    from tinytorch.core.activations import ReLU
-    Dense = Linear  # Alias for consistency
-except ImportError:
-    # For development, import from local modules
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_activations'))
-    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '04_layers'))
-    from tensor_dev import Tensor, Parameter
-    from activations_dev import ReLU
-    from layers_dev import Linear, Module
-    Dense = Linear  # Alias for consistency
-
-# %% nbgrader={"grade": false, "grade_id": "cnn-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
-print("🔥 TinyTorch CNN Module")
-print(f"NumPy version: {np.__version__}")
-print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
-print("Ready to build convolutional neural networks!")
-
-# %% [markdown]
-"""
-## 📦 Where This Code Lives in the Final Package
-
-**Learning Side:** You work in `modules/source/05_cnn/cnn_dev.py`  
-**Building Side:** Code exports to `tinytorch.core.cnn`
-
-```python
-# Final package structure:
-from tinytorch.core.cnn import Conv2D, conv2d_naive, flatten  # CNN operations!
-from tinytorch.core.layers import Dense  # Fully connected layers
-from tinytorch.core.activations import ReLU  # Nonlinearity
-from tinytorch.core.tensor import Tensor  # Foundation
-```
-
-**Why this matters:**
-- **Learning:** Focused modules for deep understanding of convolution
-- **Production:** Proper organization like PyTorch's `torch.nn.Conv2d`
-- **Consistency:** All CNN operations live together in `core.cnn`
-- **Integration:** Works seamlessly with other TinyTorch components
-"""
-
-# %% [markdown]
-"""
-## Spatial Helper Functions
-
-Before diving into convolution, let's add some essential spatial operations that we'll need for building clean CNN code. These helpers make it easy to work with multi-dimensional data.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "spatial-helpers", "locked": false, "schema_version": 3, "solution": false, "task": false}
-#| export
-def flatten(x, start_dim=1):
-    """
-    Flatten tensor starting from a given dimension.
-    
-    This is essential for transitioning from convolutional layers
-    (which output 4D tensors) to linear layers (which expect 2D).
-    
-    Args:
-        x: Input tensor (Tensor or any array-like)
-        start_dim: Dimension to start flattening from (default: 1 to preserve batch)
-        
-    Returns:
-        Flattened tensor preserving batch dimension
-        
-    Examples:
-        # Flatten CNN output for Linear layer
-        conv_output = Tensor(np.random.randn(32, 64, 8, 8))  # (batch, channels, height, width)
-        flat = flatten(conv_output)  # (32, 4096) - ready for Linear layer!
-        
-        # Flatten image for MLP
-        images = Tensor(np.random.randn(32, 3, 28, 28))  # CIFAR-10 batch
-        flat = flatten(images)  # (32, 2352) - ready for MLP!
-    """
-    # Get the data (handle both Tensor and numpy arrays)
-    if hasattr(x, 'data'):
-        data = x.data
-    else:
-        data = x
-    
-    # Calculate new shape
-    batch_size = data.shape[0]
-    remaining_size = np.prod(data.shape[start_dim:])
-    new_shape = (batch_size, remaining_size)
-    
-    # Reshape preserving tensor type
-    if hasattr(x, 'data'):
-        # It's a Tensor - preserve type and gradient tracking
-        flattened_data = data.reshape(new_shape)
-        result = Tensor(flattened_data)
-        return result
-    else:
-        # It's a numpy array
-        return data.reshape(new_shape)
-
-#| export
-def max_pool2d(x, kernel_size, stride=None):
-    """
-    Apply 2D max pooling operation.
-    
-    Max pooling reduces spatial dimensions by taking the maximum value
-    in each pooling window. This provides translation invariance and
-    reduces computational cost.
-    
-    Args:
-        x: Input tensor (batch, channels, height, width)
-        kernel_size: Size of pooling window (int or tuple)
-        stride: Stride of pooling (defaults to kernel_size)
-        
-    Returns:
-        Pooled tensor with reduced spatial dimensions
-        
-    Examples:
-        # Standard 2x2 max pooling
-        feature_maps = Tensor(np.random.randn(32, 64, 28, 28))
-        pooled = max_pool2d(feature_maps, 2)  # (32, 64, 14, 14)
-        
-        # Non-overlapping 3x3 pooling
-        pooled = max_pool2d(feature_maps, 3, stride=3)  # (32, 64, 9, 9)
-    """
-    # Handle kernel_size and stride
-    if isinstance(kernel_size, int):
-        kh = kw = kernel_size
-    else:
-        kh, kw = kernel_size
-        
-    if stride is None:
-        stride = kernel_size
-    if isinstance(stride, int):
-        sh = sw = stride
-    else:
-        sh, sw = stride
-    
-    # Get input data
-    if hasattr(x, 'data'):
-        input_data = x.data
-    else:
-        input_data = x
-    
-    batch, channels, height, width = input_data.shape
-    
-    # Calculate output dimensions
-    out_h = (height - kh) // sh + 1
-    out_w = (width - kw) // sw + 1
-    
-    # Initialize output
-    output = np.zeros((batch, channels, out_h, out_w))
-    
-    # Apply max pooling
-    for b in range(batch):
-        for c in range(channels):
-            for i in range(out_h):
-                for j in range(out_w):
-                    h_start = i * sh
-                    h_end = h_start + kh
-                    w_start = j * sw
-                    w_end = w_start + kw
-                    
-                    # Take maximum in the pooling window
-                    pool_region = input_data[b, c, h_start:h_end, w_start:w_end]
-                    output[b, c, i, j] = np.max(pool_region)
-    
-    # Preserve tensor type if input was a tensor
-    if hasattr(x, 'data'):
-        result = Tensor(output)
-        return result
-    else:
-        return output
-
-# %% [markdown]
-"""
-## 🔧 DEVELOPMENT
-"""
-
-# %% [markdown]
-"""
-## Step 1: Understanding Convolution
-
-### What is Convolution?
-**Convolution** is a mathematical operation that slides a small filter (kernel) across an input, computing dot products at each position.
-
-### Why Convolution is Perfect for Images
-- **Local patterns**: Images have local structure (edges, textures)
-- **Translation invariance**: Same pattern can appear anywhere
-- **Parameter sharing**: One filter detects the pattern everywhere
-- **Spatial hierarchy**: Multiple layers build increasingly complex features
-
-### The Fundamental Insight
-**Convolution is pattern matching!** The kernel learns to detect specific patterns:
-- **Edge detectors**: Find boundaries between objects
-- **Texture detectors**: Recognize surface patterns
-- **Shape detectors**: Identify geometric forms
-- **Feature detectors**: Combine simple patterns into complex features
-
-### Real-World Applications
-- **Image processing**: Detect edges, blur, sharpen
-- **Computer vision**: Recognize objects, faces, text
-- **Medical imaging**: Detect tumors, analyze scans
-- **Autonomous driving**: Identify traffic signs, pedestrians
-
-### Visual Intuition
-```
-Input Image:     Kernel:        Output Feature Map:
-[1, 2, 3]       [1,  0]       [1*1+2*0+4*0+5*(-1), 2*1+3*0+5*0+6*(-1)]
-[4, 5, 6]       [0, -1]       [4*1+5*0+7*0+8*(-1), 5*1+6*0+8*0+9*(-1)]
-[7, 8, 9]
-```
-
-The kernel slides across the input, computing dot products at each position.
-
-Let us implement this step by step!
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "conv2d-naive", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def conv2d_naive(input: np.ndarray, kernel: np.ndarray) -> np.ndarray:
-    """
-    Naive 2D convolution (single channel, no stride, no padding).
-    
-    Args:
-        input: 2D input array (H, W)
-        kernel: 2D filter (kH, kW)
-    Returns:
-        2D output array (H-kH+1, W-kW+1)
-        
-    TODO: Implement the sliding window convolution using for-loops.
-    
-    STEP-BY-STEP IMPLEMENTATION:
-    1. Get input dimensions: H, W = input.shape
-    2. Get kernel dimensions: kH, kW = kernel.shape
-    3. Calculate output dimensions: out_H = H - kH + 1, out_W = W - kW + 1
-    4. Create output array: np.zeros((out_H, out_W))
-    5. Use nested loops to slide the kernel:
-       - i loop: output rows (0 to out_H-1)
-       - j loop: output columns (0 to out_W-1)
-       - di loop: kernel rows (0 to kH-1)
-       - dj loop: kernel columns (0 to kW-1)
-    6. For each (i,j), compute: output[i,j] += input[i+di, j+dj] * kernel[di, dj]
-    
-    LEARNING CONNECTIONS:
-    - **Computer Vision Foundation**: Convolution is the core operation in CNNs and image processing
-    - **Feature Detection**: Different kernels detect edges, textures, and patterns in images
-    - **Spatial Hierarchies**: Convolution preserves spatial relationships while extracting features
-    - **Production CNNs**: Understanding the basic operation helps optimize GPU implementations
-    
-    EXAMPLE:
-    Input: [[1, 2, 3],     Kernel: [[1, 0],
-            [4, 5, 6],              [0, -1]]
-            [7, 8, 9]]
-    
-    Output[0,0] = 1*1 + 2*0 + 4*0 + 5*(-1) = 1 - 5 = -4
-    Output[0,1] = 2*1 + 3*0 + 5*0 + 6*(-1) = 2 - 6 = -4
-    Output[1,0] = 4*1 + 5*0 + 7*0 + 8*(-1) = 4 - 8 = -4
-    Output[1,1] = 5*1 + 6*0 + 8*0 + 9*(-1) = 5 - 9 = -4
-    
-    HINTS:
-    - Start with output = np.zeros((out_H, out_W))
-    - Use four nested loops: for i in range(out_H): for j in range(out_W): for di in range(kH): for dj in range(kW):
-    - Accumulate the sum: output[i,j] += input[i+di, j+dj] * kernel[di, dj]
-    """
-    ### BEGIN SOLUTION
-    # Get input and kernel dimensions
-    H, W = input.shape
-    kH, kW = kernel.shape
-    
-    # Calculate output dimensions
-    out_H, out_W = H - kH + 1, W - kW + 1
-    
-    # Initialize output array
-    output = np.zeros((out_H, out_W), dtype=input.dtype)
-    
-    # Sliding window convolution with four nested loops
-    for i in range(out_H):
-        for j in range(out_W):
-            for di in range(kH):
-                for dj in range(kW):
-                    output[i, j] += input[i + di, j + dj] * kernel[di, dj]
-    
-    return output
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Convolution Operation
-
-Let us test your convolution implementation right away! This is the core operation that powers computer vision.
-
-**This is a unit test** - it tests one specific function (conv2d_naive) in isolation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-conv2d-naive-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_unit_convolution_operation():
-    """Unit test for the convolution operation implementation."""
-    print("🔬 Unit Test: Convolution Operation...")
-    
-    # Test simple 3x3 input with 2x2 kernel
-    try:
-        input_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
-        kernel_array = np.array([[1, 0], [0, 1]], dtype=np.float32)  # Identity-like kernel
-        
-        result = conv2d_naive(input_array, kernel_array)
-        expected = np.array([[6, 8], [12, 14]], dtype=np.float32)  # 1+5, 2+6, 4+8, 5+9
-        
-        print(f"Input:\n{input_array}")
-        print(f"Kernel:\n{kernel_array}")
-        print(f"Result:\n{result}")
-        print(f"Expected:\n{expected}")
-        
-        assert np.allclose(result, expected), f"Convolution failed: expected {expected}, got {result}"
-        print("✅ Simple convolution test passed")
-        
-    except Exception as e:
-        print(f"❌ Simple convolution test failed: {e}")
-        raise
-    
-    # Test edge detection kernel
-    try:
-        input_array = np.array([[1, 1, 1], [1, 1, 1], [1, 1, 1]], dtype=np.float32)
-        edge_kernel = np.array([[-1, -1], [-1, 3]], dtype=np.float32)  # Edge detection
-        
-        result = conv2d_naive(input_array, edge_kernel)
-        expected = np.array([[0, 0], [0, 0]], dtype=np.float32)  # Uniform region = no edges
-        
-        assert np.allclose(result, expected), f"Edge detection failed: expected {expected}, got {result}"
-        print("✅ Edge detection test passed")
-        
-    except Exception as e:
-        print(f"❌ Edge detection test failed: {e}")
-        raise
-    
-    # Test output shape
-    try:
-        input_5x5 = np.random.randn(5, 5).astype(np.float32)
-        kernel_3x3 = np.random.randn(3, 3).astype(np.float32)
-        
-        result = conv2d_naive(input_5x5, kernel_3x3)
-        expected_shape = (3, 3)  # 5-3+1 = 3
-        
-        assert result.shape == expected_shape, f"Output shape wrong: expected {expected_shape}, got {result.shape}"
-        print("✅ Output shape test passed")
-        
-    except Exception as e:
-        print(f"❌ Output shape test failed: {e}")
-        raise
-    
-    # Show the convolution process
-    print("🎯 Convolution behavior:")
-    print("   Slides kernel across input")
-    print("   Computes dot product at each position")
-    print("   Output size = Input size - Kernel size + 1")
-    print("📈 Progress: Convolution operation ✓")
-
-# Call the test immediately
-test_unit_convolution_operation()
-
-# %% [markdown]
-"""
-## Step 2: Building the Conv2D Layer
-
-### What is a Conv2D Layer?
-A **Conv2D layer** is a learnable convolutional layer that:
-- Has learnable kernel weights (initialized randomly)
-- Applies convolution to input tensors
-- Integrates with the rest of the neural network
-
-### Why Conv2D Layers Matter
-- **Feature learning**: Kernels learn to detect useful patterns
-- **Composability**: Can be stacked with other layers
-- **Efficiency**: Shared weights reduce parameters dramatically
-- **Translation invariance**: Same patterns detected anywhere in the image
-
-### Real-World Applications
-- **Image classification**: Recognize objects in photos
-- **Object detection**: Find and locate objects
-- **Medical imaging**: Detect anomalies in scans
-- **Autonomous driving**: Identify road features
-
-### Design Decisions
-- **Kernel size**: Typically 3×3 or 5×5 for balance of locality and capacity
-- **Initialization**: Small random values to break symmetry
-- **Integration**: Works with Tensor class and other layers
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "conv2d-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Conv2D:
-    """
-    2D Convolutional Layer (single channel, single filter, no stride/pad).
-    
-    A learnable convolutional layer that applies a kernel to detect spatial patterns.
-    Perfect for building the foundation of convolutional neural networks.
-    """
-    
-    def __init__(self, kernel_size: Tuple[int, int]):
-        """
-        Initialize Conv2D layer with random kernel.
-        
-        Args:
-            kernel_size: (kH, kW) - size of the convolution kernel
-            
-        TODO: Initialize a random kernel with small values.
-        
-        APPROACH:
-        1. Store kernel_size as instance variable
-        2. Initialize random kernel with small values
-        3. Use proper initialization for stable training
-        
-        EXAMPLE:
-        Conv2D((2, 2)) creates:
-        - kernel: shape (2, 2) with small random values
-        
-        HINTS:
-        - Store kernel_size as self.kernel_size
-        - Initialize kernel: np.random.randn(kH, kW) * 0.1 (small values)
-        - Convert to float32 for consistency
-        """
-        ### BEGIN SOLUTION
-        # Store kernel size
-        self.kernel_size = kernel_size
-        kH, kW = kernel_size
-        
-        # Initialize random kernel with small values
-        self.kernel = np.random.randn(kH, kW).astype(np.float32) * 0.1
-        ### END SOLUTION
-    
-    def forward(self, x):
-        """
-        Forward pass through the Conv2D layer.
-        
-        Args:
-            x: Input tensor (batch_size, H, W)
-        Returns:
-            Output tensor after convolution
-        """
-        # Handle batches by iterating through each item
-        if len(x.shape) == 3:
-            batch_size, H, W = x.shape
-            
-            # Create an empty list to store results
-            results = []
-            # Iterate over each image in the batch
-            for i in range(batch_size):
-                # Apply naive convolution to each image
-                convolved = conv2d_naive(x.data[i], self.kernel)
-                results.append(convolved)
-            # Stack results into a single NumPy array
-            output_data = np.stack(results)
-
-        else: # Handle single image case
-            output_data = conv2d_naive(x.data, self.kernel)
-
-        # Return Tensor result - gradient support will be added in later modules
-        # For now, focus on learning convolution mechanics without complex autograd
-        return Tensor(output_data)
-    
-    def __call__(self, x):
-        """Make layer callable: layer(x) same as layer.forward(x)"""
-        return self.forward(x)
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Conv2D Layer
-
-Let us test your Conv2D layer implementation! This is a learnable convolutional layer that can be trained.
-
-**This is a unit test** - it tests one specific class (Conv2D) in isolation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-conv2d-layer-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-def test_unit_conv2d_layer():
-    """Unit test for the Conv2D layer implementation."""
-    print("🔬 Unit Test: Conv2D Layer...")
-    
-    # Create a Conv2D layer
-    try:
-        layer = Conv2D(kernel_size=(2, 2))
-        print(f"Conv2D layer created with kernel size: {layer.kernel_size}")
-        print(f"Kernel shape: {layer.kernel.shape}")
-        
-        # Test that kernel is initialized properly
-        assert layer.kernel.shape == (2, 2), f"Kernel shape should be (2, 2), got {layer.kernel.shape}"
-        assert not np.allclose(layer.kernel, 0), "Kernel should not be all zeros"
-        print("✅ Conv2D layer initialization successful")
-        
-        # Test with sample input
-        x = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
-        print(f"Input shape: {x.shape}")
-        
-        y = layer(x)
-        print(f"Output shape: {y.shape}")
-        print(f"Output: {y}")
-        
-        # Verify shapes
-        assert y.shape == (2, 2), f"Output shape should be (2, 2), got {y.shape}"
-        assert isinstance(y, Tensor), "Output should be a Tensor"
-        print("✅ Conv2D layer forward pass successful")
-        
-    except Exception as e:
-        print(f"❌ Conv2D layer test failed: {e}")
-        raise
-    
-    # Test different kernel sizes
-    try:
-        layer_3x3 = Conv2D(kernel_size=(3, 3))
-        x_5x5 = Tensor([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25]])
-        y_3x3 = layer_3x3(x_5x5)
-        
-        assert y_3x3.shape == (3, 3), f"3x3 kernel output should be (3, 3), got {y_3x3.shape}"
-        print("✅ Different kernel sizes work correctly")
-        
-    except Exception as e:
-        print(f"❌ Different kernel sizes test failed: {e}")
-        raise
-    
-    # Show the layer behavior
-    print("🎯 Conv2D layer behavior:")
-    print("   Learnable kernel weights")
-    print("   Applies convolution to detect patterns")
-    print("   Can be trained end-to-end")
-    print("📈 Progress: Convolution operation ✓, Conv2D layer ✓")
-
-# Call the test immediately
-test_unit_conv2d_layer()
-
-# %% [markdown]
-"""
-## Step 3: Multi-Channel Conv2D - From Grayscale to RGB
-
-### What are Multi-Channel Convolutions?
-**Multi-channel convolutions** process images with multiple channels (like RGB) and produce multiple output feature maps using multiple filters.
-
-### Why Multi-Channel Convolutions Matter
-- **RGB Images**: Real images have 3 channels (Red, Green, Blue)
-- **Feature Maps**: Each filter learns different patterns
-- **Depth Processing**: Handle both input channels and output filters
-- **Production Reality**: CNNs always use multi-channel convolutions
-
-### Mathematical Foundation
-For input shape `(batch, in_channels, height, width)` and filters `(out_channels, in_channels, kernel_h, kernel_w)`:
-
-```
-Input: (batch, 3, 32, 32)        # RGB CIFAR-10 images  
-Filters: (32, 3, 3, 3)           # 32 filters, each 3x3x3
-Output: (batch, 32, 30, 30)      # 32 feature maps, each 30x30
-```
-
-Each output feature map is computed by:
-1. **Channel mixing**: Each filter processes ALL input channels
-2. **Spatial convolution**: Applied across height and width  
-3. **Summation**: Sum across input channels for each output pixel
-
-### Systems Insight: Parameter Scaling
-- **Single channel**: 1 filter = K×K parameters
-- **Multi-channel**: 1 filter = in_channels × K×K parameters  
-- **Multiple filters**: out_channels × in_channels × K×K total parameters
-- **Memory impact**: Parameters grow linearly with channels
-
-Example: 32 filters of size 3×3 on RGB input = 32 × 3 × 3 × 3 = 864 parameters
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "multi-channel-conv2d", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class Conv2d(Module):
-    """
-    2D Convolutional Layer (PyTorch-compatible API).
-    
-    Processes inputs with multiple channels (like RGB) and outputs multiple feature maps.
-    This is the realistic convolution used in production computer vision systems.
-    Inherits from Module for automatic parameter registration.
-    """
-    
-    def __init__(self, in_channels: int, out_channels: int, kernel_size: Tuple[int, int], bias: bool = True):
-        super().__init__()
-        """
-        Initialize multi-channel Conv2D layer.
-        
-        Args:
-            in_channels: Number of input channels (e.g., 3 for RGB)
-            out_channels: Number of output feature maps (number of filters)
-            kernel_size: (kH, kW) size of each filter
-            bias: Whether to include bias terms
-            
-        TODO: Initialize weights and bias for multi-channel convolution.
-        
-        APPROACH:
-        1. Store layer parameters (in_channels, out_channels, kernel_size, bias)
-        2. Initialize weight tensor: shape (out_channels, in_channels, kH, kW)
-        3. Use He initialization: std = sqrt(2 / (in_channels * kH * kW))
-        4. Initialize bias if enabled: shape (out_channels,)
-        
-        LEARNING CONNECTIONS:
-        - **Production CNNs**: This matches PyTorch's nn.Conv2d parameter structure
-        - **Memory Scaling**: Parameters = out_channels × in_channels × kH × kW  
-        - **He Initialization**: Maintains activation variance through deep networks
-        - **Feature Learning**: Each filter learns different patterns across all input channels
-        
-        EXAMPLE:
-        # For CIFAR-10 RGB images (3 channels) → 32 feature maps
-        conv = Conv2d(in_channels=3, out_channels=32, kernel_size=(3, 3))
-        # Creates weight: shape (32, 3, 3, 3) = 864 parameters
-        
-        HINTS:
-        - Weight shape: (out_channels, in_channels, kernel_height, kernel_width)
-        - He initialization: np.random.randn(...) * np.sqrt(2.0 / (in_channels * kH * kW))
-        - Bias shape: (out_channels,) initialized to small values
-        """
-        ### BEGIN SOLUTION
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.kernel_size = kernel_size
-        self.use_bias = bias
-        
-        kH, kW = kernel_size
-        
-        # He initialization for weights
-        # Shape: (out_channels, in_channels, kernel_height, kernel_width)
-        fan_in = in_channels * kH * kW
-        std = np.sqrt(2.0 / fan_in)
-        self.weight = Parameter(np.random.randn(out_channels, in_channels, kH, kW).astype(np.float32) * std)
-        
-        # Initialize bias
-        if bias:
-            self.bias = Parameter(np.zeros(out_channels, dtype=np.float32))
-        else:
-            self.bias = None
-        ### END SOLUTION
-    
-    def forward(self, x):
-        """
-        Forward pass through multi-channel Conv2D layer with automatic differentiation.
-        
-        Uses the same Tensor-based approach as Linear layer for proper gradient flow.
-        
-        Args:
-            x: Input tensor/Tensor with shape (batch_size, in_channels, H, W) or (in_channels, H, W)
-        Returns:
-            Output tensor/Tensor with shape (batch_size, out_channels, out_H, out_W) or (out_channels, out_H, out_W)
-        """
-        # Import Tensor for gradient tracking (same pattern as Linear layer)
-        try:
-                    except ImportError:
-            # Fallback for development
-            import sys
-            import os
-            sys.path.append(os.path.join(os.path.dirname(__file__), '..', '06_autograd'))
-                    
-        # Ensure input supports autograd if it's a Tensor (same as Linear layer)
-        input_var = x if isinstance(x, Tensor) else Tensor(x, requires_grad=False)
-        
-        # Convert parameters to Variables to maintain gradient connections (same as Linear)
-        weight_var = Tensor(self.weight, requires_grad=True) if not isinstance(self.weight, Tensor) else self.weight
-        bias_var = None
-        if self.bias is not None:
-            bias_var = Tensor(self.bias, requires_grad=True) if not isinstance(self.bias, Tensor) else self.bias
-        # Perform convolution operation using Tensor-aware computation
-        # This creates the automatic differentiation graph like Linear layer does
-        result_var = self._conv2d_operation(input_var, weight_var, bias_var)
-        
-        return result_var
-    
-    def _conv2d_operation(self, input_var, weight_var, bias_var):
-        """
-        Core convolution operation with automatic differentiation support.
-        
-        This function performs the convolution computation while preserving
-        the Tensor computational graph for automatic gradient flow.
-        """
-        # Extract data for computation (while preserving Tensor wrapper)
-        input_data = input_var.data if hasattr(input_var, 'data') else input_var
-        weight_data = weight_var.data if hasattr(weight_var, 'data') else weight_var
-        
-        # Handle single image vs batch
-        if len(input_data.shape) == 3:  # Single image: (in_channels, H, W)
-            input_data = input_data[None, ...]  # Add batch dimension
-            single_image = True
-        else:
-            single_image = False
-        
-        batch_size, in_channels, H, W = input_data.shape
-        kH, kW = self.kernel_size
-        
-        # Validate input channels
-        assert in_channels == self.in_channels, f"Expected {self.in_channels} input channels, got {in_channels}"
-        
-        # Calculate output dimensions
-        out_H = H - kH + 1
-        out_W = W - kW + 1
-        
-        # Perform convolution computation
-        output = np.zeros((batch_size, self.out_channels, out_H, out_W), dtype=np.float32)
-        
-        for b in range(batch_size):
-            for out_c in range(self.out_channels):
-                # Get filter for this output channel
-                filter_weights = weight_data[out_c]  # Shape: (in_channels, kH, kW)
-                
-                # Convolve across all input channels
-                for in_c in range(in_channels):
-                    input_channel = input_data[b, in_c]  # Shape: (H, W)
-                    filter_channel = filter_weights[in_c]  # Shape: (kH, kW)
-                    
-                    # Perform 2D convolution
-                    for i in range(out_H):
-                        for j in range(out_W):
-                            patch = input_channel[i:i+kH, j:j+kW]
-                            output[b, out_c, i, j] += np.sum(patch * filter_channel)
-                
-                # Add bias if enabled
-                if self.use_bias and bias_var is not None:
-                    bias_data = bias_var.data if hasattr(bias_var, 'data') else bias_var
-                    output[b, out_c] += bias_data[out_c]
-        
-        # Remove batch dimension if input was single image
-        if single_image:
-            output = output[0]
-        
-        # Create output Tensor with gradient function for automatic differentiation
-        # This is the key difference from the old manual implementation
-                
-        # Create gradient function that integrates with automatic differentiation
-        def conv2d_grad_fn(grad_output):
-            # This will be called automatically during backprop
-            # The automatic differentiation system handles parameter updates
-            pass
-        
-        # Return Tensor that maintains the computational graph
-        return Tensor(output, requires_grad=(input_var.requires_grad or weight_var.requires_grad), grad_fn=conv2d_grad_fn)
-    
-    def __call__(self, x):
-        """Make layer callable: layer(x) same as layer.forward(x)"""
-        return self.forward(x)
-
-# Backward compatibility alias
-MultiChannelConv2D = Conv2d
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Multi-Channel Conv2D Layer
-
-Let us test your multi-channel Conv2D implementation! This handles RGB images and multiple filters like production CNNs.
-
-**This is a unit test** - it tests the Conv2d class in isolation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-multi-channel-conv2d-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
-# Test multi-channel Conv2D layer immediately after implementation
-print("🔬 Unit Test: Multi-Channel Conv2D Layer...")
-
-# Test 1: RGB to feature maps (CIFAR-10 scenario)
-try:
-    # Create layer: 3 RGB channels → 8 feature maps
-    conv_rgb = Conv2d(in_channels=3, out_channels=8, kernel_size=(3, 3))
-    
-    print(f"Multi-channel Conv2D created:")
-    print(f"  Input channels: {conv_rgb.in_channels}")
-    print(f"  Output channels: {conv_rgb.out_channels}")
-    print(f"  Kernel size: {conv_rgb.kernel_size}")
-    print(f"  Weight shape: {conv_rgb.weight.shape}")
-    
-    # Verify weight initialization
-    assert conv_rgb.weight.shape == (8, 3, 3, 3), f"Weight shape should be (8, 3, 3, 3), got {conv_rgb.weight.shape}"
-    assert not np.allclose(conv_rgb.weight.data, 0), "Weights should not be all zeros"
-    assert conv_rgb.bias.shape == (8,), f"Bias shape should be (8,), got {conv_rgb.bias.shape}"
-    print("✅ Multi-channel layer initialization successful")
-    
-    # Test with RGB image (simulated CIFAR-10 patch)
-    rgb_image = Tensor(np.random.randn(3, 8, 8))  # 3 channels, 8x8 image
-    print(f"RGB input shape: {rgb_image.shape}")
-    
-    feature_maps = conv_rgb(rgb_image)
-    print(f"Feature maps shape: {feature_maps.shape}")
-    
-    # Verify output shape
-    expected_shape = (8, 6, 6)  # 8 channels, 8-3+1=6 spatial dims
-    assert feature_maps.shape == expected_shape, f"Output shape should be {expected_shape}, got {feature_maps.shape}"
-    assert isinstance(feature_maps, Tensor), "Output should be a Tensor"
-    print("✅ RGB convolution test passed")
-    
-except Exception as e:
-    print(f"❌ RGB convolution test failed: {e}")
-    raise
-
-# Test 2: Batch processing
-try:
-    # Test with batch of RGB images
-    batch_rgb = Tensor(np.random.randn(4, 3, 10, 10))  # 4 images, 3 channels, 10x10
-    batch_output = conv_rgb(batch_rgb)
-    
-    expected_batch_shape = (4, 8, 8, 8)  # 4 images, 8 channels, 10-3+1=8 spatial
-    assert batch_output.shape == expected_batch_shape, f"Batch output shape should be {expected_batch_shape}, got {batch_output.shape}"
-    print("✅ Batch processing test passed")
-    
-except Exception as e:
-    print(f"❌ Batch processing test failed: {e}")
-    raise
-
-# Test 3: Different channel configurations
-try:
-    # Test 1→16 channels (grayscale to features)
-    conv_grayscale = Conv2d(in_channels=1, out_channels=16, kernel_size=(5, 5))
-    gray_image = Tensor(np.random.randn(1, 12, 12))  # 1 channel, 12x12
-    gray_features = conv_grayscale(gray_image)
-    
-    expected_gray_shape = (16, 8, 8)  # 16 channels, 12-5+1=8 spatial
-    assert gray_features.shape == expected_gray_shape, f"Grayscale output should be {expected_gray_shape}, got {gray_features.shape}"
-    print("✅ Grayscale convolution test passed")
-    
-    # Test 32→64 channels (feature maps to more feature maps)
-    conv_deep = Conv2d(in_channels=32, out_channels=64, kernel_size=(3, 3))
-    deep_features = Tensor(np.random.randn(32, 6, 6))  # 32 channels, 6x6
-    deeper_features = conv_deep(deep_features)
-    
-    expected_deep_shape = (64, 4, 4)  # 64 channels, 6-3+1=4 spatial
-    assert deeper_features.shape == expected_deep_shape, f"Deep features should be {expected_deep_shape}, got {deeper_features.shape}"
-    print("✅ Deep feature convolution test passed")
-    
-except Exception as e:
-    print(f"❌ Different channel configurations test failed: {e}")
-    raise
-
-# Test 4: Parameter counting
-try:
-    # Verify parameter count scaling
-    params_3_to_8 = conv_rgb.weight.size + (conv_rgb.bias.size if conv_rgb.use_bias else 0)
-    expected_params = (8 * 3 * 3 * 3) + 8  # weights + bias
-    assert params_3_to_8 == expected_params, f"Parameter count should be {expected_params}, got {params_3_to_8}"
-    
-    print(f"Parameter scaling verification:")
-    print(f"  3→8 channels, 3x3 kernel: {params_3_to_8} parameters")
-    print(f"  Breakdown: {8*3*3*3} weights + {8} bias = {expected_params}")
-    print("✅ Parameter counting test passed")
-    
-except Exception as e:
-    print(f"❌ Parameter counting test failed: {e}")
-    raise
-
-# Show multi-channel behavior
-print("🎯 Multi-channel Conv2D behavior:")
-print("   Processes multiple input channels (RGB, feature maps)")
-print("   Produces multiple output feature maps")
-print("   Each filter mixes information across ALL input channels")
-print("   Parameter count = out_channels × in_channels × kernel_h × kernel_w")
-print("📈 Progress: Single-channel ✓, Multi-channel ✓")
-
-# %% [markdown]
-"""
-### 🔧 Memory Analysis: Multi-Channel Parameter Scaling
-
-Let us analyze how memory requirements scale with channels and understand the trade-offs.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "multi-channel-memory-analysis", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def analyze_conv_memory_scaling():
-    """Analyze memory requirements for different channel configurations."""
-    print("🔍 MULTI-CHANNEL MEMORY SCALING ANALYSIS")
-    print("=" * 50)
-    
-    configurations = [
-        (1, 16, (3, 3)),    # Grayscale → features  
-        (3, 32, (3, 3)),    # RGB → features
-        (32, 64, (3, 3)),   # Features → more features
-        (64, 128, (3, 3)),  # Deep features
-        (3, 32, (5, 5)),    # RGB with larger kernel
-        (3, 32, (7, 7)),    # RGB with very large kernel
-    ]
-    
-    for in_c, out_c, (kh, kw) in configurations:
-        # Calculate parameters
-        weight_params = out_c * in_c * kh * kw
-        bias_params = out_c
-        total_params = weight_params + bias_params
-        
-        # Calculate memory (assuming float32 = 4 bytes)
-        memory_mb = total_params * 4 / (1024 * 1024)
-        
-        # Example activation memory for 32x32 input
-        input_mb = (in_c * 32 * 32 * 4) / (1024 * 1024)
-        output_mb = (out_c * (32-kh+1) * (32-kw+1) * 4) / (1024 * 1024)
-        
-        print(f"  {in_c:3d}→{out_c:3d} channels, {kh}x{kw} kernel:")
-        print(f"    Parameters: {total_params:,} ({memory_mb:.3f} MB)")
-        print(f"    Activations: {input_mb:.3f} MB input + {output_mb:.3f} MB output")
-        print(f"    Total memory: {memory_mb + input_mb + output_mb:.3f} MB")
-    
-    print("\n💡 Key Memory Insights:")
-    print("  • Parameters scale as: out_channels × in_channels × kernel_size²")
-    print("  • Larger kernels dramatically increase memory (5x5 = 2.8x vs 3x3)")
-    print("  • Channel depth matters more than spatial size for parameters")
-    print("  • Activation memory depends on spatial dimensions")
-    
-    return configurations
-
-# Run memory analysis
-try:
-    analyze_conv_memory_scaling()
-    print("✅ Memory scaling analysis completed")
-except Exception as e:
-    print(f"⚠️ Memory analysis had issues: {e}")
-
-# %% [markdown]
-"""
-## Step 4: MaxPool2D - Spatial Downsampling
-
-### What is MaxPooling?
-**MaxPooling** reduces spatial dimensions by taking the maximum value in each local region, providing translation invariance and computational efficiency.
-
-### Why MaxPooling Matters
-- **Dimensionality reduction**: Reduces feature map size without losing important information
-- **Translation invariance**: Small shifts don't change the output
-- **Computational efficiency**: Fewer parameters to process in subsequent layers
-- **Overfitting reduction**: Acts as a form of regularization
-
-### Real-World Usage
-- **After convolution**: Conv2D → ReLU → MaxPool2D is a common pattern
-- **Progressive downsampling**: Each pool layer reduces spatial dimensions
-- **Feature concentration**: Keeps most important activations
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "maxpool2d-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-class MaxPool2D:
-    """
-    2D Max Pooling layer for spatial downsampling.
-    
-    Reduces spatial dimensions by taking maximum values in local windows,
-    providing translation invariance and computational efficiency.
-    """
-    
-    def __init__(self, pool_size: Tuple[int, int] = (2, 2), stride: Optional[Tuple[int, int]] = None):
-        """
-        Initialize MaxPool2D layer.
-        
-        Args:
-            pool_size: (pH, pW) size of pooling window
-            stride: (sH, sW) stride for pooling. If None, uses pool_size
-            
-        TODO: Initialize pooling parameters.
-        
-        APPROACH:
-        1. Store pool_size as instance variable
-        2. Set stride (default to pool_size if not provided)
-        3. No learnable parameters (pooling has no weights)
-        
-        LEARNING CONNECTIONS:
-        - **Spatial downsampling**: Reduces feature map resolution efficiently
-        - **Translation invariance**: Small shifts in input don't change output
-        - **Computational efficiency**: Reduces data for subsequent layers
-        - **No parameters**: Unlike convolution, pooling has no learnable weights
-        
-        EXAMPLE:
-        MaxPool2D(pool_size=(2, 2)) creates:
-        - 2x2 pooling windows
-        - Stride of (2, 2) - non-overlapping windows
-        - No learnable parameters
-        
-        HINTS:
-        - Store pool_size as self.pool_size
-        - Set stride: self.stride = stride if stride else pool_size
-        """
-        ### BEGIN SOLUTION
-        self.pool_size = pool_size
-        self.stride = stride if stride is not None else pool_size
-        ### END SOLUTION
-    
-    def forward(self, x):
-        """
-        Forward pass through MaxPool2D layer.
-        
-        Args:
-            x: Input tensor with shape (..., H, W) or (..., C, H, W)
-        Returns:
-            Pooled tensor with reduced spatial dimensions
-        """
-        # Extract the underlying numpy array properly
-        if hasattr(x, 'data') and hasattr(x.data, 'data'):
-            # x is Tensor, x.data is Tensor, x.data.data is numpy array
-            input_data = x.data.data
-        elif hasattr(x, 'data'):
-            # x is Tensor, x.data is numpy array
-            input_data = x.data
-        else:
-            # x is numpy array
-            input_data = x
-        
-        original_shape = input_data.shape
-        
-        # Handle different input shapes
-        if len(original_shape) == 2:  # (H, W)
-            input_data = input_data[None, None, ...]  # Add batch and channel dims
-            added_dims = 2
-        elif len(original_shape) == 3:  # (C, H, W) or (B, H, W)
-            input_data = input_data[None, ...]  # Add one dimension
-            added_dims = 1
-        else:  # (B, C, H, W) or similar
-            added_dims = 0
-        
-        # Now input_data has at least 4 dimensions
-        while len(input_data.shape) < 4:
-            input_data = input_data[None, ...]
-            added_dims += 1
-            
-        batch_size, channels, H, W = input_data.shape
-        pH, pW = self.pool_size
-        sH, sW = self.stride
-        
-        # Calculate output dimensions
-        out_H = (H - pH) // sH + 1
-        out_W = (W - pW) // sW + 1
-        
-        # Initialize output
-        output = np.zeros((batch_size, channels, out_H, out_W), dtype=input_data.dtype)
-        
-        # Perform max pooling
-        for b in range(batch_size):
-            for c in range(channels):
-                for i in range(out_H):
-                    for j in range(out_W):
-                        # Define pooling window
-                        h_start = i * sH
-                        h_end = h_start + pH
-                        w_start = j * sW
-                        w_end = w_start + pW
-                        
-                        # Extract window and take maximum
-                        window = input_data[b, c, h_start:h_end, w_start:w_end]
-                        output[b, c, i, j] = np.max(window)
-        
-        # Remove added dimensions to match input shape structure
-        for _ in range(added_dims):
-            output = output[0]
-        
-        # Return Tensor result - gradient support will be added in later modules
-        # For now, focus on learning pooling mechanics without complex autograd
-        return Tensor(output)
-    
-    def __call__(self, x):
-        """Make layer callable: layer(x) same as layer.forward(x)"""
-        return self.forward(x)
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: MaxPool2D Layer
-
-Let us test your MaxPool2D implementation! This provides spatial downsampling for efficient computation.
-
-**This is a unit test** - it tests the MaxPool2D class in isolation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-maxpool2d-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-# Test MaxPool2D layer immediately after implementation
-print("🔬 Unit Test: MaxPool2D Layer...")
-
-# Test 1: Basic 2x2 pooling
-try:
-    pool = MaxPool2D(pool_size=(2, 2))
-    
-    # Test with simple 4x4 input
-    test_input = Tensor([[1, 2, 3, 4],
-                        [5, 6, 7, 8], 
-                        [9, 10, 11, 12],
-                        [13, 14, 15, 16]])
-    
-    print(f"Input shape: {test_input.shape}")
-    print(f"Input:\n{test_input.data}")
-    
-    pooled = pool(test_input)
-    print(f"Pooled shape: {pooled.shape}")
-    print(f"Pooled:\n{pooled.data}")
-    
-    # Verify shape
-    expected_shape = (2, 2)  # 4x4 → 2x2 with 2x2 pooling
-    assert pooled.shape == expected_shape, f"Pooled shape should be {expected_shape}, got {pooled.shape}"
-    
-    # Verify values (each 2x2 window's maximum)
-    expected_values = np.array([[6, 8], [14, 16]])  # Max of each 2x2 window
-    assert np.array_equal(pooled.data, expected_values), f"Expected {expected_values}, got {pooled.data}"
-    
-    print("✅ Basic 2x2 pooling test passed")
-    
-except Exception as e:
-    print(f"❌ Basic pooling test failed: {e}")
-    raise
-
-# Test 2: Multi-channel pooling
-try:
-    # Test with multi-channel input (like after convolution)
-    multi_channel_input = Tensor([[[1, 2, 3, 4],     # Channel 0
-                                  [5, 6, 7, 8],
-                                  [9, 10, 11, 12],
-                                  [13, 14, 15, 16]],
-                                 [[16, 15, 14, 13],   # Channel 1
-                                  [12, 11, 10, 9],
-                                  [8, 7, 6, 5],
-                                  [4, 3, 2, 1]]])
-    
-    pooled_multi = pool(multi_channel_input)
-    print(f"Multi-channel input shape: {multi_channel_input.shape}")
-    print(f"Multi-channel pooled shape: {pooled_multi.shape}")
-    
-    expected_multi_shape = (2, 2, 2)  # 2 channels, 2x2 spatial
-    assert pooled_multi.shape == expected_multi_shape, f"Multi-channel shape should be {expected_multi_shape}, got {pooled_multi.shape}"
-    
-    print("✅ Multi-channel pooling test passed")
-    
-except Exception as e:
-    print(f"❌ Multi-channel pooling test failed: {e}")
-    raise
-
-# Test 3: Different pool sizes
-try:
-    # Test 3x3 pooling
-    pool_3x3 = MaxPool2D(pool_size=(3, 3))
-    input_6x6 = Tensor(np.arange(36).reshape(6, 6))  # 6x6 input
-    
-    pooled_3x3 = pool_3x3(input_6x6)
-    expected_3x3_shape = (2, 2)  # 6x6 → 2x2 with 3x3 pooling, stride 3
-    assert pooled_3x3.shape == expected_3x3_shape, f"3x3 pooling shape should be {expected_3x3_shape}, got {pooled_3x3.shape}"
-    
-    print("✅ Different pool sizes test passed")
-    
-except Exception as e:
-    print(f"❌ Different pool sizes test failed: {e}")
-    raise
-
-# Test 4: Integration with convolution
-try:
-    # Test Conv2D → MaxPool2D pipeline
-    conv = Conv2d(in_channels=1, out_channels=4, kernel_size=(3, 3))
-    pool_after_conv = MaxPool2D(pool_size=(2, 2))
-    
-    # Input image
-    input_image = Tensor(np.random.randn(1, 8, 8))  # 1 channel, 8x8
-    
-    # Forward pass: Conv → Pool
-    conv_output = conv(input_image)     # (1,8,8) → (4,6,6)
-    pool_output = pool_after_conv(conv_output)  # (4,6,6) → (4,3,3)
-    
-    assert conv_output.shape == (4, 6, 6), f"Conv output should be (4,6,6), got {conv_output.shape}"
-    assert pool_output.shape == (4, 3, 3), f"Pool output should be (4,3,3), got {pool_output.shape}"
-    
-    print("✅ Conv → Pool integration test passed")
-    
-except Exception as e:
-    print(f"❌ Conv → Pool integration test failed: {e}")
-    raise
-
-# Show pooling behavior
-print("🎯 MaxPool2D behavior:")
-print("   Reduces spatial dimensions by taking maximum in each window")
-print("   Provides translation invariance")
-print("   No learnable parameters")
-print("   Common pattern: Conv2D → ReLU → MaxPool2D")
-print("📈 Progress: Single-channel ✓, Multi-channel ✓, Pooling ✓")
-
-# %% [markdown]
-"""
-## Step 5: Flattening for Dense Layers
-
-### What is Flattening?
-**Flattening** converts multi-dimensional tensors to 1D vectors, enabling connection between convolutional and dense layers.
-
-### Why Flattening is Needed
-- **Interface compatibility**: Conv2D outputs 2D/3D, Dense expects 1D
-- **Network composition**: Connect spatial features to classification
-- **Standard practice**: Almost all CNNs use this pattern
-- **Dimension management**: Preserve information while changing shape
-
-### The Pattern
-```
-Conv2D → ReLU → MaxPool2D → Flatten → Dense → Output
-```
-
-### Real-World Usage
-- **Classification**: Final layers need 1D input for class probabilities
-- **Feature extraction**: Convert spatial features to vector representations
-- **Transfer learning**: Extract features from pre-trained CNNs
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "flatten-function", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-def flatten(x):
-    """
-    Flatten spatial dimensions while preserving batch dimension.
-    
-    Args:
-        x: Input tensor to flatten
-        
-    Returns:
-        Flattened tensor with batch dimension preserved
-        
-    TODO: Implement flattening operation that handles different input shapes.
-    
-    STEP-BY-STEP IMPLEMENTATION:
-    1. Determine if input has batch dimension
-    2. Flatten spatial dimensions while preserving batch structure
-    3. Return properly shaped tensor
-    
-    LEARNING CONNECTIONS:
-    - **CNN to MLP Transition**: Flattening connects convolutional and dense layers
-    - **Batch Processing**: Handles both single images and batches correctly
-    - **Memory Layout**: Understanding how tensors are stored and reshaped in memory
-    - **Framework Design**: All major frameworks (PyTorch, TensorFlow) use similar patterns
-    
-    EXAMPLES:
-    Single image: (C, H, W) → (1, C*H*W)
-    Batch: (B, C, H, W) → (B, C*H*W)
-    2D: (H, W) → (1, H*W)
-    
-    HINTS:
-    - Check input shape to determine batch vs single image
-    - Use reshape to flatten spatial dimensions
-    - Preserve batch dimension for proper Dense layer input
-    """
-    ### BEGIN SOLUTION
-    # Clean PyTorch-style flatten implementation
-    input_shape = x.shape
-    x_data = x.data
-    
-    # Handle different input dimensions
-    if len(input_shape) == 2:  # (H, W) - add batch dimension
-        result_data = x_data.reshape(1, -1)  # Add batch, flatten rest
-    elif len(input_shape) == 3:  # (C, H, W) - add batch dimension  
-        result_data = x_data.reshape(1, -1)  # Add batch, flatten rest
-    elif len(input_shape) == 4:  # (B, C, H, W) - keep batch
-        batch_size = input_shape[0]
-        result_data = x_data.reshape(batch_size, -1)
-    else:
-        # Default: keep first dimension, flatten rest
-        result_data = x_data.reshape(input_shape[0], -1)
-    
-    return type(x)(result_data)
-    ### END SOLUTION
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Flatten Function
-
-Let us test your flatten function! This connects convolutional layers to dense layers.
-
-**This is a unit test** - it tests one specific function (flatten) in isolation.
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-flatten-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
-# Test flatten function immediately after implementation
-print("🔬 Unit Test: Flatten Function...")
-
-# Test case 1: 2x2 tensor
-try:
-    x = Tensor([[1, 2], [3, 4]])
-    flattened = flatten(x)
-    
-    print(f"Input: {x}")
-    print(f"Flattened: {flattened}")
-    print(f"Flattened shape: {flattened.shape}")
-    
-    # Verify shape and content
-    assert flattened.shape == (1, 4), f"Flattened shape should be (1, 4), got {flattened.shape}"
-    expected_data = np.array([[1, 2, 3, 4]])
-    assert np.array_equal(flattened.data, expected_data), f"Flattened data should be {expected_data}, got {flattened.data}"
-    print("✅ 2x2 flatten test passed")
-    
-except Exception as e:
-    print(f"❌ 2x2 flatten test failed: {e}")
-    raise
-
-# Test case 2: 3x3 tensor
-try:
-    x2 = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
-    flattened2 = flatten(x2)
-    
-    assert flattened2.shape == (1, 9), f"Flattened shape should be (1, 9), got {flattened2.shape}"
-    expected_data2 = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])
-    assert np.array_equal(flattened2.data, expected_data2), f"Flattened data should be {expected_data2}, got {flattened2.data}"
-    print("✅ 3x3 flatten test passed")
-    
-except Exception as e:
-    print(f"❌ 3x3 flatten test failed: {e}")
-    raise
-
-# Test case 3: Different shapes
-try:
-    x3 = Tensor([[1, 2, 3, 4], [5, 6, 7, 8]])  # 2x4
-    flattened3 = flatten(x3)
-    
-    assert flattened3.shape == (1, 8), f"Flattened shape should be (1, 8), got {flattened3.shape}"
-    expected_data3 = np.array([[1, 2, 3, 4, 5, 6, 7, 8]])
-    assert np.array_equal(flattened3.data, expected_data3), f"Flattened data should be {expected_data3}, got {flattened3.data}"
-    print("✅ Different shapes flatten test passed")
-    
-except Exception as e:
-    print(f"❌ Different shapes flatten test failed: {e}")
-    raise
-
-# Show the flattening behavior
-print("🎯 Flatten behavior:")
-print("   Converts 2D tensor to 1D")
-print("   Preserves batch dimension")
-print("   Enables connection to Dense layers")
-print("📈 Progress: Convolution operation ✓, Conv2D layer ✓, Flatten ✓")
-
-# %% [markdown]
-"""
-## Step 6: Comprehensive Test - Multi-Channel CNN Pipeline
-
-### Real-World CNN Applications
-Let us test our complete CNN system with realistic multi-channel scenarios:
-
-#### **CIFAR-10 Style CNN**
-```python
-# RGB images to classification
-RGB Input → Multi-Channel Conv2D → ReLU → MaxPool2D → Flatten → Dense → Output
-```
-
-#### **Deep Multi-Channel CNN**
-```python
-# Progressive feature extraction
-RGB → Conv2D(3→32) → ReLU → Pool → Conv2D(32→64) → ReLU → Pool → Flatten → Dense
-```
-
-#### **Production CNN Pattern**
-```python
-# Full computer vision pipeline
-RGB images → Feature extraction layers → Spatial downsampling → Classification head
-```
-
-This comprehensive test ensures our multi-channel CNN components work together for real computer vision applications like CIFAR-10!
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "test-comprehensive-multichannel", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
-# Comprehensive test - complete multi-channel CNN applications
-print("🔬 Comprehensive Test: Multi-Channel CNN Applications...")
-
-try:
-    # Test 1: CIFAR-10 Style RGB CNN Pipeline
-    print("\n1. CIFAR-10 Style RGB CNN Pipeline:")
-    
-    # Create pipeline: RGB → Conv2D(3→16) → ReLU → MaxPool2D → Flatten → Dense
-    rgb_conv = Conv2d(in_channels=3, out_channels=16, kernel_size=(3, 3))
-    relu = ReLU()
-    pool = MaxPool2D(pool_size=(2, 2))
-    dense = Dense(input_size=16 * 3 * 3, output_size=10)  # 16 channels, 3x3 spatial = 144 features
-    
-    # Simulated CIFAR-10 image (3 channels, 8x8 for testing)
-    rgb_image = Tensor(np.random.randn(3, 8, 8))  # RGB 8x8 image
-    print(f"RGB input shape: {rgb_image.shape}")
-    
-    # Forward pass through complete pipeline
-    conv_features = rgb_conv(rgb_image)    # (3,8,8) → (16,6,6)
-    activated = relu(conv_features)        # (16,6,6) → (16,6,6)
-    pooled = pool(activated)              # (16,6,6) → (16,3,3)
-    flattened = flatten(pooled)           # (16,3,3) → (1,144)
-    predictions = dense(flattened)        # (1,144) → (1,10)
-    
-    assert conv_features.shape == (16, 6, 6), f"Conv features wrong: {conv_features.shape}"
-    assert activated.shape == (16, 6, 6), f"Activated features wrong: {activated.shape}"
-    assert pooled.shape == (16, 3, 3), f"Pooled features wrong: {pooled.shape}"
-    assert flattened.shape == (1, 144), f"Flattened features wrong: {flattened.shape}"
-    assert predictions.shape == (1, 10), f"Predictions wrong: {predictions.shape}"
-    
-    print("✅ CIFAR-10 style RGB pipeline works correctly")
-    
-    # Test 2: Deep Multi-Channel CNN
-    print("\n2. Deep Multi-Channel CNN:")
-    
-    # Create deeper pipeline: RGB → Conv1(3→32) → ReLU → Pool → Conv2(32→64) → ReLU → Pool → Dense
-    conv1_deep = Conv2d(in_channels=3, out_channels=32, kernel_size=(3, 3))
-    relu1 = ReLU()
-    pool1 = MaxPool2D(pool_size=(2, 2))
-    conv2_deep = Conv2d(in_channels=32, out_channels=64, kernel_size=(3, 3))
-    relu2 = ReLU()
-    pool2 = MaxPool2D(pool_size=(2, 2))
-    classifier_deep = Dense(input_size=64 * 1 * 1, output_size=5)  # 64 channels, 1x1 spatial
-    
-    # Larger RGB input for deep processing
-    large_rgb = Tensor(np.random.randn(3, 12, 12))  # RGB 12x12 image
-    print(f"Large RGB input shape: {large_rgb.shape}")
-    
-    # Forward pass through deep network
-    h1 = conv1_deep(large_rgb)  # (3,12,12) → (32,10,10)
-    h2 = relu1(h1)              # (32,10,10) → (32,10,10)
-    h3 = pool1(h2)              # (32,10,10) → (32,5,5)
-    h4 = conv2_deep(h3)         # (32,5,5) → (64,3,3)
-    h5 = relu2(h4)              # (64,3,3) → (64,3,3)
-    h6 = pool2(h5)              # (64,3,3) → (64,1,1)
-    h7 = flatten(h6)            # (64,1,1) → (1,64)
-    output_deep = classifier_deep(h7)  # (1,64) → (1,5)
-    
-    assert h1.shape == (32, 10, 10), f"Conv1 output wrong: {h1.shape}"
-    assert h3.shape == (32, 5, 5), f"Pool1 output wrong: {h3.shape}"
-    assert h4.shape == (64, 3, 3), f"Conv2 output wrong: {h4.shape}"
-    assert h6.shape == (64, 1, 1), f"Pool2 output wrong: {h6.shape}"
-    assert h7.shape == (1, 64), f"Final flatten wrong: {h7.shape}"
-    assert output_deep.shape == (1, 5), f"Final prediction wrong: {output_deep.shape}"
-    
-    print("✅ Deep multi-channel CNN works correctly")
-    
-    # Test 3: Batch Processing with Multi-Channel
-    print("\n3. Batch Processing Test:")
-    
-    # Test batch of RGB images
-    batch_conv = Conv2d(in_channels=3, out_channels=8, kernel_size=(3, 3))
-    batch_pool = MaxPool2D(pool_size=(2, 2))
-    
-    # Batch of 4 RGB images
-    rgb_batch = Tensor(np.random.randn(4, 3, 6, 6))  # 4 images, 3 channels, 6x6
-    print(f"Batch RGB input shape: {rgb_batch.shape}")
-    
-    # Forward pass to determine correct feature size
-    batch_conv_out = batch_conv(rgb_batch)    # (4,3,6,6) → (4,8,4,4)
-    batch_pool_out = batch_pool(batch_conv_out)  # (4,8,4,4) → (4,8,2,2)
-    batch_flat = flatten(batch_pool_out)      # (4,8,2,2) → (4,32)
-    
-    # Create classifier with correct input size
-    feature_size = batch_flat.shape[1]  # 32 features
-    batch_classifier = Dense(input_size=feature_size, output_size=3)
-    batch_pred = batch_classifier(batch_flat) # (4,32) → (4,3)
-    
-    assert batch_conv_out.shape == (4, 8, 4, 4), f"Batch conv wrong: {batch_conv_out.shape}"
-    assert batch_pool_out.shape == (4, 8, 2, 2), f"Batch pool wrong: {batch_pool_out.shape}"
-    assert batch_flat.shape == (4, 32), f"Batch flatten wrong: {batch_flat.shape}"
-    assert batch_pred.shape == (4, 3), f"Batch prediction wrong: {batch_pred.shape}"
-    
-    print("✅ Batch processing with multi-channel works correctly")
-    
-    # Test 4: Backward Compatibility with Single Channel
-    print("\n4. Backward Compatibility Test:")
-    
-    # Test that Conv2d works for single-channel (grayscale)
-    gray_conv = Conv2d(in_channels=1, out_channels=8, kernel_size=(3, 3))
-    gray_image = Tensor(np.random.randn(1, 6, 6))  # 1 channel, 6x6
-    gray_features = gray_conv(gray_image)
-    
-    assert gray_features.shape == (8, 4, 4), f"Grayscale features wrong: {gray_features.shape}"
-    print("✅ Single-channel compatibility works correctly")
-    
-    # Test 5: Memory and Parameter Analysis
-    print("\n5. Memory and Parameter Analysis:")
-    
-    # Analyze different configurations
-    configs = [
-        (Conv2d(1, 8, (3, 3)), "1→8 channels"),
-        (Conv2d(3, 16, (3, 3)), "3→16 channels (RGB)"),
-        (Conv2d(16, 32, (3, 3)), "16→32 channels"),
-        (Conv2d(32, 64, (3, 3)), "32→64 channels"),
-    ]
-    
-    for conv_layer, desc in configs:
-        params = conv_layer.weight.size + (conv_layer.bias.size if conv_layer.use_bias else 0)
-        memory_mb = params * 4 / (1024 * 1024)  # float32 = 4 bytes
-        print(f"  {desc}: {params:,} parameters ({memory_mb:.3f} MB)")
-    
-    print("✅ Memory analysis completed")
-    
-    print("\n🎉 Comprehensive multi-channel test passed! Your CNN system supports:")
-    print("  • RGB image processing (CIFAR-10 ready)")
-    print("  • Deep multi-channel architectures")
-    print("  • Batch processing with multiple channels")
-    print("  • Backward compatibility with single-channel")
-    print("  • Production-ready parameter scaling")
-    print("  • Complete Conv → Pool → Dense pipelines")
-    print("📈 Progress: Production-ready multi-channel CNN system!")
-    
-except Exception as e:
-    print(f"❌ Comprehensive multi-channel test failed: {e}")
-    raise
-
-print("📈 Final Progress: Production-ready multi-channel CNN system for real computer vision!")
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Convolution Operation Implementation
-
-This test validates the `conv2d_naive` function, ensuring it correctly performs 2D convolution operations with proper kernel sliding, dot product computation, and output shape calculation for spatial feature detection.
-"""
-
-# %%
-def test_unit_convolution_operation():
-    """Unit test for the convolution operation implementation."""
-    print("🔬 Unit Test: Convolution Operation...")
-    
-    # Test basic convolution
-    input_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
-    kernel = np.array([[1, 0], [0, 1]])
-    result = conv2d_naive(input_data, kernel)
-    
-    assert result.shape == (2, 2), "Convolution should produce correct output shape"
-    expected = np.array([[6, 8], [12, 14]])
-    assert np.array_equal(result, expected), "Convolution should produce correct values"
-    
-    print("✅ Convolution operation works correctly")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Conv2D Layer Implementation
-
-This test validates the Conv2D layer class, ensuring proper kernel initialization, forward pass functionality, and integration with the tensor framework for convolutional neural network construction.
-"""
-
-# %%
-def test_unit_conv2d_layer():
-    """Unit test for the Conv2D layer implementation."""
-    print("🔬 Unit Test: Conv2D Layer...")
-    
-    # Test Conv2D layer
-    conv = Conv2D(kernel_size=(3, 3))
-    input_tensor = Tensor(np.random.randn(6, 6))
-    output = conv(input_tensor)
-    
-    assert output.shape == (4, 4), "Conv2D should produce correct output shape"
-    assert hasattr(conv, 'kernel'), "Conv2D should have kernel attribute"
-    assert conv.kernel.shape == (3, 3), "Kernel should have correct shape"
-    
-    print("✅ Conv2D layer works correctly")
-
-# Test function defined (called in main block)
-
-# %% [markdown]
-"""
-### 🧪 Unit Test: Flatten Function Implementation
-
-This test validates the flatten function, ensuring it correctly converts 2D spatial tensors to 1D vectors for connecting convolutional layers to dense layers in CNN architectures.
-"""
-
-# %%
-def test_unit_flatten_function():
-    """Unit test for the flatten function implementation."""
-    print("🔬 Unit Test: Flatten Function...")
-    
-    # Test flatten function
-    input_2d = Tensor([[1, 2], [3, 4]])
-    flattened = flatten(input_2d)
-    
-    assert flattened.shape == (1, 4), "Flatten should produce output with batch dimension"
-    expected = np.array([[1, 2, 3, 4]])
-    assert np.array_equal(flattened.data, expected), "Flatten should preserve values"
-    
-    print("✅ Flatten function works correctly")
-
-# Test function defined (called in main block)
-
-# CNN pipeline integration test moved to tests/integration/test_cnn_pipeline.py
-
-# %% [markdown]
-"""
-## 🧪 Module Testing
-
-Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.
-
-**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "standardized-testing", "locked": true, "schema_version": 3, "solution": false, "task": false}
-# =============================================================================
-# STANDARDIZED MODULE TESTING - DO NOT MODIFY
-# This cell is locked to ensure consistent testing across all TinyTorch modules
-# =============================================================================
-
-# %% [markdown]
-"""
-## 🔬 Integration Test: Conv2D Layer with Tensors
-"""
-
-# %%
-def test_module_conv2d_tensor_compatibility():
-    """
-    Integration test for the Conv2D layer and the Tensor class.
-    
-    Tests that the Conv2D layer correctly processes a batch of image-like Tensors.
-    """
-    print("🔬 Running Integration Test: Conv2D with Tensors...")
-
-    # 1. Define a Conv2D layer
-    # Kernel of size 3x3
-    conv_layer = Conv2D((3, 3))
-
-    # 2. Create a batch of 5 grayscale images (10x10)
-    # Shape: (batch_size, height, width)
-    input_images = np.random.randn(5, 10, 10)
-    input_tensor = Tensor(input_images)
-
-    # 3. Perform a forward pass
-    output_tensor = conv_layer(input_tensor)
-
-    # 4. Assert the output shape is correct
-    # Output height = 10 - 3 + 1 = 8
-    # Output width = 10 - 3 + 1 = 8
-    expected_shape = (5, 8, 8)
-    assert isinstance(output_tensor, Tensor), "Conv2D output must be a Tensor"
-    assert output_tensor.shape == expected_shape, f"Expected output shape {expected_shape}, but got {output_tensor.shape}"
-    print("✅ Integration Test Passed: Conv2D layer correctly transformed image tensor.")
-
-
-# %% [markdown]
-"""
-## Step 4: ML Systems Thinking - Convolution Optimization & Memory Patterns
-
-### 🏗️ Spatial Computation at Scale
-
-Your convolution implementation provides the foundation for understanding how production computer vision systems optimize spatial operations for massive image processing workloads.
-
-#### **Convolution Memory Patterns**
-```python
-class ConvolutionMemoryAnalyzer:
-    def __init__(self):
-        # Memory access patterns in convolution operations
-        self.spatial_locality = SpatialLocalityTracker()
-        self.cache_efficiency = CacheEfficiencyMonitor()
-        self.memory_bandwidth = BandwidthAnalyzer()
-```
-
-Real convolution systems must handle:
-- **Spatial locality**: Adjacent pixels accessed together optimize cache performance
-- **Memory bandwidth**: Large feature maps require efficient memory access patterns  
-- **Tiling strategies**: Breaking large convolutions into cache-friendly chunks
-- **Hardware acceleration**: Specialized convolution units in modern GPUs and TPUs
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "convolution-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
-#| export
-import time
-from collections import defaultdict
-
-class ConvolutionProfiler:
-    """
-    Production Convolution Performance Analysis and Optimization
-    
-    Analyzes spatial computation efficiency, memory patterns, and optimization
-    opportunities for production computer vision systems.
-    """
-    
-    def __init__(self):
-        """Initialize convolution profiler for spatial operations analysis."""
-        self.profiling_data = defaultdict(list)
-        self.memory_analysis = defaultdict(list) 
-        self.optimization_recommendations = []
-        
-    def profile_convolution_operation(self, conv_layer, input_tensor, kernel_sizes=[(3,3), (5,5), (7,7)]):
-        """
-        Profile convolution operations across different kernel sizes.
-        
-        TODO: Implement convolution operation profiling.
-        
-        STEP-BY-STEP IMPLEMENTATION:
-        1. Profile different kernel sizes and their computational costs
-        2. Measure memory usage patterns for spatial operations
-        3. Analyze cache efficiency and memory access patterns
-        4. Identify optimization opportunities for production systems
-        
-        LEARNING CONNECTIONS:
-        - **Performance Optimization**: Understanding computational costs of different kernel sizes
-        - **Memory Efficiency**: Cache-friendly access patterns improve performance significantly
-        - **Production Scaling**: Profiling guides hardware selection and deployment strategies
-        - **GPU Optimization**: Spatial operations are ideal for parallel processing
-        
-        APPROACH:
-        1. Time convolution operations with different kernel sizes
-        2. Analyze memory usage patterns for spatial operations
-        3. Calculate computational intensity (FLOPs per operation)
-        4. Identify memory bandwidth vs compute bottlenecks
-        5. Generate optimization recommendations
-        
-        EXAMPLE:
-        profiler = ConvolutionProfiler()
-        conv = Conv2D(kernel_size=(3, 3))
-        input_img = Tensor(np.random.randn(32, 32))  # 32x32 image
-        analysis = profiler.profile_convolution_operation(conv, input_img)
-        print(f"Convolution throughput: {analysis['throughput_mflops']:.1f} MFLOPS")
-        
-        HINTS:
-        - Use time.time() for timing measurements
-        - Calculate memory footprint of input and output tensors
-        - Estimate FLOPs: output_height * output_width * kernel_height * kernel_width
-        - Compare performance across kernel sizes
-        """
-        ### BEGIN SOLUTION
-        print("🔧 Profiling Convolution Operations...")
-        
-        results = {}
-        
-        for kernel_size in kernel_sizes:
-            print(f"  Testing kernel size: {kernel_size}")
-            
-            # Create convolution layer with specified kernel size
-            # Note: Using the provided conv_layer or creating new one
-            try:
-                if hasattr(conv_layer, 'kernel_size'):
-                    # Use existing layer if compatible, otherwise create new
-                    if conv_layer.kernel_size == kernel_size:
-                        test_conv = conv_layer
-                    else:
-                        test_conv = Conv2D(kernel_size=kernel_size)
-                else:
-                    test_conv = Conv2D(kernel_size=kernel_size)
-            except:
-                # Fallback for testing - create mock convolution
-                test_conv = conv_layer
-            
-            # Measure timing
-            iterations = 10
-            start_time = time.time()
-            
-            for _ in range(iterations):
-                try:
-                    output = test_conv(input_tensor)
-                except:
-                    # Fallback: simulate convolution operation
-                    # Calculate expected output size
-                    input_h, input_w = input_tensor.shape[-2:]
-                    kernel_h, kernel_w = kernel_size
-                    output_h = input_h - kernel_h + 1
-                    output_w = input_w - kernel_w + 1
-                    output = Tensor(np.random.randn(output_h, output_w))
-            
-            end_time = time.time()
-            avg_time = (end_time - start_time) / iterations
-            
-            # Calculate computational metrics
-            input_h, input_w = input_tensor.shape[-2:]
-            kernel_h, kernel_w = kernel_size
-            output_h = max(1, input_h - kernel_h + 1)
-            output_w = max(1, input_w - kernel_w + 1)
-            
-            # Estimate FLOPs (floating point operations)
-            flops = output_h * output_w * kernel_h * kernel_w
-            mflops = flops / 1e6
-            throughput_mflops = mflops / avg_time if avg_time > 0 else 0
-            
-            # Memory analysis
-            input_memory_mb = input_tensor.data.nbytes / (1024 * 1024)
-            output_memory_mb = (output_h * output_w * 4) / (1024 * 1024)  # Assuming float32
-            kernel_memory_mb = (kernel_h * kernel_w * 4) / (1024 * 1024)
-            total_memory_mb = input_memory_mb + output_memory_mb + kernel_memory_mb
-            
-            # Calculate computational intensity (FLOPs per byte)
-            computational_intensity = flops / max(input_tensor.data.nbytes, 1)
-            
-            result = {
-                'kernel_size': kernel_size,
-                'time_ms': avg_time * 1000,
-                'throughput_mflops': throughput_mflops,
-                'flops': flops,
-                'input_memory_mb': input_memory_mb,
-                'output_memory_mb': output_memory_mb,
-                'total_memory_mb': total_memory_mb,
-                'computational_intensity': computational_intensity,
-                'output_size': (output_h, output_w)
-            }
-            
-            results[f"{kernel_size[0]}x{kernel_size[1]}"] = result
-            
-            print(f"    Time: {avg_time*1000:.3f}ms, Throughput: {throughput_mflops:.1f} MFLOPS")
-        
-        # Store profiling data
-        self.profiling_data['convolution_results'] = results
-        
-        # Generate analysis
-        analysis = self._analyze_convolution_performance(results)
-        
-        return {
-            'detailed_results': results,
-            'analysis': analysis,
-            'recommendations': self._generate_optimization_recommendations(results)
-        }
-        ### END SOLUTION
-    
-    def _analyze_convolution_performance(self, results):
-        """Analyze convolution performance patterns."""
-        analysis = []
-        
-        # Find fastest and slowest configurations
-        times = [(k, v['time_ms']) for k, v in results.items()]
-        fastest = min(times, key=lambda x: x[1])
-        slowest = max(times, key=lambda x: x[1])
-        
-        analysis.append(f"🚀 Fastest kernel: {fastest[0]} ({fastest[1]:.3f}ms)")
-        analysis.append(f"🐌 Slowest kernel: {slowest[0]} ({slowest[1]:.3f}ms)")
-        
-        # Performance scaling analysis
-        if len(results) > 1:
-            small_kernel = min(results.keys(), key=lambda k: results[k]['flops'])
-            large_kernel = max(results.keys(), key=lambda k: results[k]['flops'])
-            
-            flops_ratio = results[large_kernel]['flops'] / results[small_kernel]['flops']
-            time_ratio = results[large_kernel]['time_ms'] / results[small_kernel]['time_ms']
-            
-            analysis.append(f"📈 FLOPS scaling: {small_kernel} → {large_kernel} = {flops_ratio:.1f}x more computation")
-            analysis.append(f"⏱️ Time scaling: {time_ratio:.1f}x slower")
-            
-            if time_ratio < flops_ratio:
-                analysis.append("✅ Good computational efficiency - time scales better than FLOPs")
-            else:
-                analysis.append("⚠️ Computational bottleneck - time scales worse than FLOPs")
-        
-        # Memory analysis
-        memory_usage = [(k, v['total_memory_mb']) for k, v in results.items()]
-        max_memory = max(memory_usage, key=lambda x: x[1])
-        analysis.append(f"💾 Peak memory usage: {max_memory[0]} ({max_memory[1]:.2f} MB)")
-        
-        return analysis
-    
-    def _generate_optimization_recommendations(self, results):
-        """Generate optimization recommendations based on profiling results."""
-        recommendations = []
-        
-        # Analyze computational intensity
-        intensities = [v['computational_intensity'] for v in results.values()]
-        avg_intensity = sum(intensities) / len(intensities)
-        
-        if avg_intensity < 1.0:
-            recommendations.append("🔧 Memory-bound operation: Consider memory layout optimization")
-            recommendations.append("💡 Try: Tensor tiling, cache-friendly access patterns")
-        else:
-            recommendations.append("🔧 Compute-bound operation: Focus on computational optimization")
-            recommendations.append("💡 Try: SIMD instructions, hardware acceleration")
-        
-        # Kernel size recommendations
-        best_throughput = max(results.values(), key=lambda x: x['throughput_mflops'])
-        recommendations.append(f"⚡ Optimal kernel size for throughput: {best_throughput['kernel_size']}")
-        
-        # Memory efficiency recommendations
-        memory_efficiency = {k: v['throughput_mflops'] / v['total_memory_mb'] 
-                           for k, v in results.items() if v['total_memory_mb'] > 0}
-        if memory_efficiency:
-            best_memory_efficiency = max(memory_efficiency.items(), key=lambda x: x[1])
-            recommendations.append(f"💾 Most memory-efficient: {best_memory_efficiency[0]}")
-        
-        return recommendations
-
-    def analyze_memory_patterns(self, input_sizes=[(64, 64), (128, 128), (256, 256)]):
-        """
-        Analyze memory access patterns for different image sizes.
-        
-        This function is PROVIDED to demonstrate memory scaling analysis.
-        Students use it to understand spatial computation memory requirements.
-        """
-        print("🔍 MEMORY PATTERN ANALYSIS")
-        print("=" * 40)
-        
-        conv_3x3 = Conv2D(kernel_size=(3, 3))
-        
-        memory_results = []
-        
-        for height, width in input_sizes:
-            # Create test tensor
-            test_tensor = Tensor(np.random.randn(height, width))
-            
-            # Calculate memory requirements
-            input_memory = test_tensor.data.nbytes / (1024 * 1024)  # MB
-            
-            # Estimate output size
-            output_h = height - 3 + 1
-            output_w = width - 3 + 1
-            output_memory = (output_h * output_w * 4) / (1024 * 1024)  # MB, float32
-            
-            # Kernel memory
-            kernel_memory = (3 * 3 * 4) / (1024 * 1024)  # MB
-            
-            total_memory = input_memory + output_memory + kernel_memory
-            memory_efficiency = (output_h * output_w) / total_memory  # operations per MB
-            
-            result = {
-                'input_size': (height, width),
-                'input_memory_mb': input_memory,
-                'output_memory_mb': output_memory,
-                'total_memory_mb': total_memory,
-                'memory_efficiency': memory_efficiency
-            }
-            memory_results.append(result)
-            
-            print(f"  {height}x{width}: {total_memory:.2f} MB total, {memory_efficiency:.0f} ops/MB")
-        
-        # Analyze scaling
-        if len(memory_results) >= 2:
-            small = memory_results[0]
-            large = memory_results[-1]
-            
-            size_ratio = (large['input_size'][0] / small['input_size'][0]) ** 2
-            memory_ratio = large['total_memory_mb'] / small['total_memory_mb']
-            
-            print(f"\n📈 Memory Scaling Analysis:")
-            print(f"  Input size increased {size_ratio:.1f}x")
-            print(f"  Memory usage increased {memory_ratio:.1f}x")
-            print(f"  Scaling efficiency: {(memory_ratio/size_ratio)*100:.1f}% (lower is better)")
-        
-        return memory_results
-
-# %% [markdown]
-"""
-### 🧪 Test: Convolution Performance Profiling
-
-Let us test our convolution profiler with realistic computer vision scenarios.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "test-convolution-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
-def test_convolution_profiler():
-    """Test convolution profiler with comprehensive scenarios."""
-    print("🔬 Unit Test: Convolution Performance Profiler...")
-    
-    profiler = ConvolutionProfiler()
-    
-    # Create test components
-    conv = Conv2D(kernel_size=(3, 3))
-    test_image = Tensor(np.random.randn(64, 64))  # 64x64 test image
-    
-    # Test convolution profiling
-    try:
-        analysis = profiler.profile_convolution_operation(conv, test_image, 
-                                                        kernel_sizes=[(3,3), (5,5)])
-        
-        # Verify analysis structure
-        assert 'detailed_results' in analysis, "Should provide detailed results"
-        assert 'analysis' in analysis, "Should provide performance analysis"
-        assert 'recommendations' in analysis, "Should provide optimization recommendations"
-        
-        # Verify detailed results
-        results = analysis['detailed_results']
-        assert len(results) == 2, "Should test both kernel sizes"
-        
-        for kernel_name, result in results.items():
-            assert 'time_ms' in result, f"Should include timing for {kernel_name}"
-            assert 'throughput_mflops' in result, f"Should calculate throughput for {kernel_name}"
-            assert 'total_memory_mb' in result, f"Should analyze memory for {kernel_name}"
-            assert result['time_ms'] > 0, f"Time should be positive for {kernel_name}"
-        
-        print("✅ Convolution profiling test passed")
-        
-        # Test memory pattern analysis
-        memory_analysis = profiler.analyze_memory_patterns(input_sizes=[(32, 32), (64, 64)])
-        
-        assert isinstance(memory_analysis, list), "Should return memory analysis results"
-        assert len(memory_analysis) == 2, "Should analyze both input sizes"
-        
-        for result in memory_analysis:
-            assert 'input_size' in result, "Should include input size"
-            assert 'total_memory_mb' in result, "Should calculate total memory"
-            assert result['total_memory_mb'] > 0, "Memory usage should be positive"
-        
-        print("✅ Memory pattern analysis test passed")
-        
-    except Exception as e:
-        print(f"⚠️ Convolution profiling test had issues: {e}")
-        print("✅ Basic structure test passed (graceful degradation)")
-    
-    print("🎯 Convolution Profiler: All tests passed!")
-
-# Test function defined (called in main block)
-
-def test_unit_multichannel_conv2d():
-    """Unit test for the multi-channel Conv2D implementation."""
-    print("🔬 Unit Test: Multi-Channel Conv2D...")
-    
-    # Test multi-channel convolution
-    conv = Conv2d(in_channels=3, out_channels=8, kernel_size=(3, 3))
-    input_rgb = Tensor(np.random.randn(3, 6, 6))
-    output = conv(input_rgb)
-    
-    assert output.shape == (8, 4, 4), "Multi-channel Conv2D should produce correct output shape"
-    assert hasattr(conv, 'weight'), "Multi-channel Conv2D should have weights attribute"
-    assert conv.weight.shape == (8, 3, 3, 3), "Weights should have correct multi-channel shape"
-    
-    print("✅ Multi-channel Conv2D works correctly")
-
-def test_unit_maxpool2d():
-    """Unit test for the MaxPool2D implementation."""
-    print("🔬 Unit Test: MaxPool2D...")
-    
-    # Test MaxPool2D
-    pool = MaxPool2D(pool_size=(2, 2))
-    input_4x4 = Tensor(np.arange(16).reshape(4, 4))
-    pooled = pool(input_4x4)
-    
-    assert pooled.shape == (2, 2), "MaxPool2D should produce correct output shape"
-    expected = np.array([[5, 7], [13, 15]])  # Max of each 2x2 window
-    assert np.array_equal(pooled.data, expected), "MaxPool2D should compute correct max values"
-    
-    print("✅ MaxPool2D works correctly")
-
-if __name__ == "__main__":
-    # Run all tests
-    test_unit_convolution_operation()
-    test_unit_conv2d_layer()
-    test_unit_multichannel_conv2d()
-    test_unit_maxpool2d()
-    test_unit_flatten_function()
-    test_module_conv2d_tensor_compatibility()
-    test_convolution_profiler()
-    
-    print("All tests passed!")
-    print("spatial_dev module complete with multi-channel support!")
-
-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've built convolution operations and spatial processing capabilities, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how spatial computation patterns scale to production computer vision environments.
-
-Take time to reflect thoughtfully on each question - your insights will help you understand how the spatial processing concepts you've implemented connect to real-world ML systems engineering.
-"""
-
-# %% [markdown]
-"""
-### Question 1: Convolution Optimization and Memory Access Patterns
-
-**Context**: Your convolution implementation processes images by sliding kernels across spatial dimensions, accessing nearby pixels repeatedly. Production computer vision systems must optimize these memory access patterns for cache efficiency, especially when processing high-resolution images that exceed cache capacity.
-
-**Reflection Question**: Design an optimized convolution system for production computer vision that maximizes cache efficiency and memory bandwidth utilization. How would you implement spatial data layout optimization for different image sizes, optimize kernel access patterns for cache locality, and handle memory hierarchies from L1 cache to main memory? Consider scenarios where you need to process 4K video streams in real-time while maintaining memory efficiency.
-
-Think about: spatial data layouts (NCHW vs NHWC), cache-blocking strategies, memory prefetching, and bandwidth optimization techniques.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-1-convolution-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON CONVOLUTION OPTIMIZATION AND MEMORY ACCESS PATTERNS:
-
-TODO: Replace this text with your thoughtful response about optimized convolution system design.
-
-Consider addressing:
-- How would you optimize spatial data layouts for different image processing scenarios?
-- What strategies would you use to maximize cache locality in convolution operations?
-- How would you handle memory bandwidth bottlenecks in high-resolution image processing?
-- What role would cache-blocking and prefetching play in your optimization approach?
-- How would you adapt memory access patterns for different hardware architectures?
-
-Write a technical analysis connecting your convolution implementations to real memory optimization challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Demonstrates understanding of spatial memory access optimization (3 points)
-- Addresses cache efficiency and bandwidth utilization strategies (3 points)
-- Shows practical knowledge of data layout and access pattern optimization (2 points)
-- Demonstrates systems thinking about memory hierarchy optimization (2 points)
-- Clear technical reasoning and practical considerations (bonus points for innovative approaches)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring technical analysis of convolution optimization
-# Students should demonstrate understanding of spatial memory access patterns and cache optimization
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 2: GPU Parallelization and Hardware Acceleration
-
-**Context**: Your convolution processes pixels sequentially, but production computer vision systems leverage thousands of GPU cores for parallel computation. Different hardware platforms (GPUs, TPUs, mobile processors) have distinct optimization opportunities and constraints for spatial operations.
-
-**Reflection Question**: Architect a hardware-aware convolution system that optimally utilizes parallel computing resources across different platforms. How would you implement data parallelism strategies for GPU convolution kernels, optimize for specialized AI accelerators like TPUs, and adapt convolution algorithms for mobile and edge devices with limited resources? Consider scenarios where the same model needs efficient deployment across cloud GPUs, mobile phones, and embedded vision systems.
-
-Think about: parallel algorithm design, hardware-specific optimization, work distribution strategies, and cross-platform efficiency considerations.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-2-gpu-parallelization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON GPU PARALLELIZATION AND HARDWARE ACCELERATION:
-
-TODO: Replace this text with your thoughtful response about hardware-aware convolution system design.
-
-Consider addressing:
-- How would you design parallel convolution algorithms for different hardware platforms?
-- What strategies would you use to optimize convolution for GPU, TPU, and mobile processors?
-- How would you implement work distribution and load balancing for parallel convolution?
-- What role would hardware-specific optimizations play in your design?
-- How would you maintain efficiency across diverse deployment platforms?
-
-Write an architectural analysis connecting your spatial processing to real hardware acceleration challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Shows understanding of parallel computing and hardware acceleration (3 points)
-- Designs practical approaches to multi-platform convolution optimization (3 points)
-- Addresses work distribution and platform-specific optimization (2 points)
-- Demonstrates systems thinking about hardware-software co-optimization (2 points)
-- Clear architectural reasoning with hardware insights (bonus points for comprehensive understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of parallel computing and hardware optimization
-# Students should demonstrate knowledge of GPU acceleration and multi-platform optimization
-### END SOLUTION
-
-# %% [markdown]
-"""
-### Question 3: Production Computer Vision Pipeline Integration
-
-**Context**: Your convolution operates on individual images, but production computer vision systems must handle continuous streams of images, video processing, and real-time inference with strict latency requirements. Integration with broader ML pipelines becomes critical for system performance.
-
-**Reflection Question**: Design a production computer vision pipeline that integrates convolution operations with real-time processing requirements and system-wide optimization. How would you implement batching strategies for video streams, optimize pipeline throughput while maintaining low latency, and integrate convolution with preprocessing and postprocessing stages? Consider scenarios where you need to process security camera feeds, autonomous vehicle vision, or real-time medical imaging with reliability and performance guarantees.
-
-Think about: pipeline optimization, batching strategies, latency vs throughput trade-offs, and system integration patterns.
-
-*Target length: 150-300 words*
-"""
-
-# %% nbgrader={"grade": true, "grade_id": "question-3-production-pipeline", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
-"""
-YOUR REFLECTION ON PRODUCTION COMPUTER VISION PIPELINE INTEGRATION:
-
-TODO: Replace this text with your thoughtful response about production vision pipeline design.
-
-Consider addressing:
-- How would you design computer vision pipelines that integrate convolution with real-time processing?
-- What strategies would you use to optimize batching and throughput for video streams?
-- How would you balance latency requirements with computational efficiency?
-- What role would pipeline integration and optimization play in your system?
-- How would you ensure reliability and performance guarantees for critical applications?
-
-Write a systems analysis connecting your convolution operations to real production pipeline challenges.
-
-GRADING RUBRIC (Instructor Use):
-- Understands production computer vision pipeline requirements (3 points)
-- Designs practical approaches to real-time processing and batching (3 points)
-- Addresses latency vs throughput optimization challenges (2 points)
-- Shows systems thinking about integration and reliability (2 points)
-- Clear systems reasoning with production deployment insights (bonus points for deep understanding)
-"""
-
-### BEGIN SOLUTION
-# Student response area - instructor will replace this section during grading setup
-# This is a manually graded question requiring understanding of production computer vision pipelines
-# Students should demonstrate knowledge of real-time processing and system integration
-### END SOLUTION
-
-# %% [markdown]
-"""
-## 🎯 MODULE SUMMARY: Multi-Channel Convolutional Networks
-
-Congratulations! You have successfully implemented a complete multi-channel CNN system ready for real computer vision applications:
-
-### What You have Accomplished
-✅ **Convolution Operation**: Implemented the sliding window mechanism from scratch  
-✅ **Single-Channel Conv2D**: Built learnable convolutional layers with random initialization  
-✅ **Multi-Channel Conv2D**: Added support for RGB images and multiple output feature maps  
-✅ **MaxPool2D**: Implemented spatial downsampling for computational efficiency  
-✅ **Flatten Function**: Created the bridge between convolutional and dense layers  
-✅ **Complete CNN Pipelines**: Built CIFAR-10 ready architectures with proper parameter scaling  
-✅ **Memory Analysis**: Profiled parameter scaling and computational complexity
-✅ **Production Patterns**: Tested batch processing and deep multi-channel architectures
-
-### Key Concepts You have Learned
-- **Multi-channel convolution**: How RGB images are processed through multiple filters
-- **Parameter scaling**: How memory requirements grow with channels and kernel sizes
-- **Spatial downsampling**: MaxPooling for translation invariance and efficiency  
-- **Feature hierarchy**: Progressive extraction from RGB → edges → objects → concepts
-- **Production architectures**: Conv → ReLU → Pool → Conv → ReLU → Pool → Dense patterns
-- **He initialization**: Proper weight initialization for stable multi-layer training
-
-### Mathematical Foundations
-- **Multi-channel convolution**: Each filter processes ALL input channels, summing results
-- **Parameter calculation**: out_channels × in_channels × kernel_h × kernel_w + bias_terms
-- **Spatial size reduction**: Convolution and pooling progressively reduce spatial dimensions
-- **Channel expansion**: Typical pattern increases channels while reducing spatial size
-- **Memory complexity**: O(batch × channels × height × width) for activations
-
-### Systems Engineering Insights
-- **Memory scaling**: Parameters grow quadratically with channels, linearly with filters
-- **Computational intensity**: CIFAR-10 CNN requires millions of multiply-accumulate operations
-- **Cache efficiency**: Spatial locality in convolution enables hardware optimization
-- **Parallelization**: Each filter and spatial position can be computed independently
-- **Production trade-offs**: More channels = better accuracy but higher memory/compute cost
-
-### Real-World Applications
-- **CIFAR-10 classification**: Your CNN can handle 32×32 RGB images → 10 classes
-- **Image recognition**: Object detection, medical imaging, autonomous driving
-- **Transfer learning**: Pre-trained features for downstream tasks
-- **Computer vision**: Face recognition, document analysis, quality inspection
-
-### CNN Architecture Patterns
-- **Basic CNN**: RGB → Conv(3→32) → ReLU → Pool → Conv(32→64) → ReLU → Pool → Dense
-- **Parameter efficiency**: 32×3×3×3 = 864 parameters vs 32×32×32 = 32,768 for dense layer
-- **Spatial hierarchy**: Early layers detect edges, later layers detect objects
-- **Translation invariance**: Same features detected regardless of position in image
-
-### Performance Characteristics
-- **Memory efficiency**: Shared parameters across spatial locations
-- **Computational complexity**: O(batch × out_channels × in_channels × kernel_size² × output_spatial)
-- **Hardware acceleration**: Highly parallelizable operations ideal for GPUs
-- **Scaling behavior**: Memory grows with channels, computation grows with spatial size
-
-### Production-Ready Features
-```python
-from tinytorch.core.spatial import Conv2d, MaxPool2D, flatten
-from tinytorch.core.layers import Dense
-from tinytorch.core.activations import ReLU
-
-# CIFAR-10 CNN architecture
-conv1 = Conv2d(in_channels=3, out_channels=32, kernel_size=(3, 3))
-pool1 = MaxPool2D(pool_size=(2, 2))
-conv2 = Conv2d(in_channels=32, out_channels=64, kernel_size=(3, 3))
-pool2 = MaxPool2D(pool_size=(2, 2))
-classifier = Dense(input_size=64*6*6, output_size=10)
-
-# Process RGB image
-rgb_image = Tensor(np.random.randn(3, 32, 32))  # CIFAR-10 format
-features1 = pool1(ReLU()(conv1(rgb_image)))     # (3,32,32) → (32,15,15)
-features2 = pool2(ReLU()(conv2(features1)))     # (32,15,15) → (64,6,6)
-predictions = classifier(flatten(features2))    # (64,6,6) → (1,10)
-```
-
-### Next Steps
-1. **Export to package**: Use `tito module complete 10_spatial` to export your implementation
-2. **Test with real data**: Load CIFAR-10 dataset and train your CNN
-3. **Experiment with architectures**: Try different channel numbers and kernel sizes
-4. **Optimize performance**: Profile memory usage and computational bottlenecks
-5. **Build deeper networks**: Add more layers and advanced techniques
-
-**Ready for the next challenge?** Let us add attention mechanisms to understand sequence relationships!
-"""
\ No newline at end of file
diff --git a/tinytorch/core/tensor.py b/tinytorch/core/tensor.py
index f3c6299e..871ef8f3 100644
--- a/tinytorch/core/tensor.py
+++ b/tinytorch/core/tensor.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/02_tensor/tensor_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/02_tensor/tensor_dev.py                 ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['Tensor', 'Parameter']
@@ -463,40 +478,21 @@ class Tensor:
         return Tensor(result)
         ### END SOLUTION
 
-    def mean(self, axis=None, dtype=None, out=None, keepdims=False) -> 'Tensor':
-        """
-        Computes the mean of the tensor's elements.
-        
-        Args:
-            axis: Axis or axes along which the means are computed.
-            dtype: Type to use in computing the mean.
-            out: Alternative output array (not supported in TinyTorch).
-            keepdims: If True, the axes which are reduced are left as dimensions with size one.
-            
-        Returns:
-            New tensor with computed means.
-        """
-        if out is not None:
-            raise NotImplementedError("out parameter not supported in TinyTorch")
-        result = np.mean(self.data, axis=axis, dtype=dtype, keepdims=keepdims)
-        return Tensor(result)
+    def mean(self) -> 'Tensor':
+        """Computes the mean of the tensor's elements."""
+        return Tensor(np.mean(self.data))
 
     def matmul(self, other: 'Tensor') -> 'Tensor':
         """
-        Perform matrix multiplication between two tensors using explicit loops.
-        
-        This implementation uses triple-nested loops for educational understanding
-        of the fundamental operations. Module 15 will show the optimization progression
-        from loops → blocking → vectorized operations.
+        Perform matrix multiplication between two tensors.
 
         TODO: Implement matrix multiplication.
 
         STEP-BY-STEP IMPLEMENTATION:
         1. Extract numpy arrays from both tensors
-        2. Check tensor shapes for compatibility
-        3. Use triple-nested loops for educational understanding
-        4. Create new Tensor object with the result
-        5. Return the new tensor
+        2. Use np.matmul() for proper matrix multiplication
+        3. Create new Tensor object with the result
+        4. Return the new tensor
 
         LEARNING CONNECTIONS:
         Real-world relevance:
@@ -505,49 +501,21 @@ class Tensor:
         - CNN convolutions: Implemented as matrix multiplications
         - Batch processing: Matrix ops enable parallel computation
 
-        EDUCATIONAL APPROACH:
-        1. Show every operation explicitly with loops
-        2. Build understanding before optimizing in Module 15
-        3. Connect mathematical operations to computational patterns
+        APPROACH:
+        1. Use np.matmul() to perform matrix multiplication
+        2. Return a new Tensor with the result
+        3. Handle broadcasting automatically
 
         EXAMPLE:
         Tensor([[1, 2], [3, 4]]) @ Tensor([[5, 6], [7, 8]]) → Tensor([[19, 22], [43, 50]])
 
         HINTS:
-        - This is intentionally simple for education, not optimized
-        - Module 15 will show the progression to high-performance implementations
-        - Understanding loops helps appreciate vectorization benefits
+        - Use np.matmul(self._data, other._data)
+        - Return Tensor(result)
+        - This is matrix multiplication, not element-wise multiplication
         """
         ### BEGIN SOLUTION
-        # Matrix multiplication using explicit loops for educational understanding
-        a_data = self._data
-        b_data = other._data
-        
-        # Get dimensions and validate compatibility
-        if len(a_data.shape) != 2 or len(b_data.shape) != 2:
-            raise ValueError("matmul requires 2D tensors")
-        
-        m, k = a_data.shape
-        k2, n = b_data.shape
-        
-        if k != k2:
-            raise ValueError(f"Inner dimensions must match: {k} != {k2}")
-        
-        # Initialize result matrix
-        result = np.zeros((m, n), dtype=a_data.dtype)
-        
-        # Triple nested loops - educational, shows every operation
-        # This is intentionally simple to understand the fundamental computation
-        # Module 15 will show the optimization journey:
-        #   Step 1 (here): Educational loops - slow but clear
-        #   Step 2: Loop blocking for cache efficiency  
-        #   Step 3: Vectorized operations with NumPy
-        #   Step 4: GPU acceleration and BLAS libraries
-        for i in range(m):                      # For each row in result
-            for j in range(n):                  # For each column in result
-                for k_idx in range(k):          # Dot product: sum over inner dimension
-                    result[i, j] += a_data[i, k_idx] * b_data[k_idx, j]
-        
+        result = np.matmul(self._data, other._data)
         return Tensor(result)
         ### END SOLUTION
 
@@ -560,37 +528,6 @@ class Tensor:
         """
         return self.matmul(other)
 
-    def __getitem__(self, key):
-        """
-        Access tensor elements using subscript notation: tensor[key]
-
-        Supports all NumPy indexing patterns:
-        - Single index: tensor[0]
-        - Multiple indices: tensor[0, 1]
-        - Slices: tensor[0:2, 1:3]
-        - Fancy indexing: tensor[[0, 2], [1, 3]]
-
-        Args:
-            key: Index or slice specification
-
-        Returns:
-            Scalar, array value, or new Tensor with subset of data
-
-        Examples:
-            tensor = Tensor([[1, 2], [3, 4]])
-            tensor[0, 0]  # Returns 1 (scalar)
-            tensor[0]     # Returns Tensor([1, 2])
-            tensor[0:1, 0:1]  # Returns Tensor([[1]])
-        """
-        result = self._data[key]
-
-        # If result is a scalar, return the scalar value directly
-        if np.isscalar(result):
-            return result
-
-        # If result is an array, wrap it in a Tensor
-        return Tensor(result)
-
     def backward(self, gradient=None):
         """
         Compute gradients for this tensor and propagate backward.
@@ -640,80 +577,6 @@ class Tensor:
         reshaped_data = self._data.reshape(*shape)
         return Tensor(reshaped_data)
 
-    def numpy(self) -> np.ndarray:
-        """
-        Convert tensor to NumPy array.
-        
-        This is the PyTorch-inspired method for tensor-to-numpy conversion.
-        Provides clean interface for interoperability with NumPy operations.
-        
-        Returns:
-            NumPy array containing the tensor's data
-            
-        Example:
-            tensor = Tensor([1, 2, 3])
-            array = tensor.numpy()  # Get NumPy array for scientific computing
-        """
-        return self._data
-    
-    def __array__(self, dtype=None) -> np.ndarray:
-        """
-        NumPy array protocol implementation.
-        
-        This enables NumPy functions to work directly with Tensor objects
-        by automatically converting them to arrays when needed.
-        
-        This is the key method that fixes np.allclose() compatibility!
-        
-        Args:
-            dtype: Optional dtype to cast to (NumPy may request this)
-        
-        Returns:
-            The underlying NumPy array, optionally cast to requested dtype
-            
-        Examples:
-            tensor = Tensor([1, 2, 3])
-            np.sum(tensor)        # Works automatically
-            np.allclose(tensor, [1, 2, 3])  # Now works!
-        """
-        if dtype is not None:
-            return self._data.astype(dtype)
-        return self._data
-    
-    def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
-        """
-        NumPy universal function protocol implementation.
-        
-        This enables NumPy ufuncs to work with Tensor objects by converting
-        them to arrays first, then wrapping results back in Tensor objects.
-        
-        This fixes advanced NumPy operations like np.maximum, np.minimum, etc.
-        """
-        # Convert Tensor inputs to NumPy arrays
-        args = []
-        for input_ in inputs:
-            if isinstance(input_, Tensor):
-                args.append(input_._data)
-            else:
-                args.append(input_)
-        
-        # Call the ufunc on NumPy arrays
-        outputs = getattr(ufunc, method)(*args, **kwargs)
-        
-        # If method returns NotImplemented, let NumPy handle it
-        if outputs is NotImplemented:
-            return NotImplemented
-            
-        # Wrap result back in Tensor if appropriate
-        if method == '__call__':
-            if isinstance(outputs, np.ndarray):
-                return Tensor(outputs)
-            elif isinstance(outputs, tuple):
-                return tuple(Tensor(output) if isinstance(output, np.ndarray) else output 
-                           for output in outputs)
-        
-        return outputs
-
 
 # # Testing Your Implementation
 # 
diff --git a/tinytorch/core/training.py b/tinytorch/core/training.py
index b9a5eac5..1223cbf0 100644
--- a/tinytorch/core/training.py
+++ b/tinytorch/core/training.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/source/10_training/training_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/11_training/training_dev.py             ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['MeanSquaredError', 'CrossEntropyLoss', 'BinaryCrossEntropyLoss', 'Accuracy', 'Trainer', 'TrainingPipelineProfiler',
@@ -28,6 +43,7 @@ from .layers import Dense
 from .networks import Sequential, create_mlp
 from .spatial import Conv2D, flatten
 from .dataloader import Dataset, DataLoader
+from .autograd import Variable  # FOR AUTOGRAD INTEGRATION
 from .optimizers import SGD, Adam
 
 # 🔥 AUTOGRAD INTEGRATION: Loss functions now return Variables that support .backward()
@@ -51,56 +67,56 @@ class MeanSquaredError:
         Compute MSE loss between predictions and targets.
         
         Args:
-            y_pred: Model predictions (Tensor or Tensor, shape: [batch_size, ...])
-            y_true: True targets (Tensor or Tensor, shape: [batch_size, ...])
+            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, ...])
+            y_true: True targets (Tensor or Variable, shape: [batch_size, ...])
             
         Returns:
-            Tensor with scalar loss value that supports .backward()
+            Variable with scalar loss value that supports .backward()
             
         TODO: Implement Mean SquaredError loss computation with autograd support.
         
         STEP-BY-STEP IMPLEMENTATION:
         1. Convert inputs to Variables if needed for autograd support
-        2. Compute difference using Tensor arithmetic: diff = y_pred - y_true
+        2. Compute difference using Variable arithmetic: diff = y_pred - y_true
         3. Square the differences: squared_diff = diff * diff
-        4. Take mean over all elements using Tensor operations
-        5. Return as Tensor that supports .backward() for gradient computation
+        4. Take mean over all elements using Variable operations
+        5. Return as Variable that supports .backward() for gradient computation
         
         EXAMPLE:
-        y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
-        y_true = Tensor([[1.5, 2.5], [2.5, 3.5]], requires_grad=False)
+        y_pred = Variable([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
+        y_true = Variable([[1.5, 2.5], [2.5, 3.5]], requires_grad=False)
         loss = mse_loss(y_pred, y_true)
         loss.backward()  # Computes gradients for y_pred
         
         LEARNING CONNECTIONS:
         - **Autograd Integration**: Loss functions must participate in computational graph for backpropagation
         - **Gradient Flow**: MSE provides smooth gradients that flow backward through the network
-        - **Tensor Operations**: Using Variables keeps computation in the autograd system
+        - **Variable Operations**: Using Variables keeps computation in the autograd system
         - **Training Pipeline**: Loss.backward() triggers gradient computation for entire network
         
         HINTS:
-        - Convert inputs to Variables if needed: Tensor(tensor_data, requires_grad=True)
-        - Use Tensor arithmetic to maintain autograd graph
+        - Convert inputs to Variables if needed: Variable(tensor_data, requires_grad=True)
+        - Use Variable arithmetic to maintain autograd graph
         - Use operations that preserve gradient computation
-        - Return Tensor that supports .backward() method
+        - Return Variable that supports .backward() method
         """
         ### BEGIN SOLUTION
         # Convert to Variables if needed to support autograd
-        if not isinstance(y_pred, Tensor):
+        if not isinstance(y_pred, Variable):
             if hasattr(y_pred, 'data'):
-                y_pred = Tensor(y_pred.data, requires_grad=True)
+                y_pred = Variable(y_pred.data, requires_grad=True)
             else:
-                y_pred = Tensor(y_pred, requires_grad=True)
+                y_pred = Variable(y_pred, requires_grad=True)
         
-        if not isinstance(y_true, Tensor):
+        if not isinstance(y_true, Variable):
             if hasattr(y_true, 'data'):
-                y_true = Tensor(y_true.data, requires_grad=False)  # Targets don't need gradients
+                y_true = Variable(y_true.data, requires_grad=False)  # Targets don't need gradients
             else:
-                y_true = Tensor(y_true, requires_grad=False)
+                y_true = Variable(y_true, requires_grad=False)
         
-        # Compute MSE using Tensor operations to maintain autograd graph
-        diff = y_pred - y_true  # Tensor subtraction
-        squared_diff = diff * diff  # Tensor multiplication
+        # Compute MSE using Variable operations to maintain autograd graph
+        diff = y_pred - y_true  # Variable subtraction
+        squared_diff = diff * diff  # Variable multiplication
         
         # Mean operation that preserves gradients
         # Create a simple mean operation for Variables
@@ -109,7 +125,7 @@ class MeanSquaredError:
         else:
             mean_data = np.mean(squared_diff.data)
         
-        # Create loss Tensor with gradient function for MSE
+        # Create loss Variable with gradient function for MSE
         def mse_grad_fn(grad_output):
             # MSE gradient: 2 * (y_pred - y_true) / n
             if y_pred.requires_grad:
@@ -125,9 +141,9 @@ class MeanSquaredError:
                 else:
                     final_grad = grad_data * grad_output.data
                 
-                y_pred.backward(Tensor(final_grad))
+                y_pred.backward(Variable(final_grad))
         
-        loss = Tensor(mean_data, requires_grad=y_pred.requires_grad, grad_fn=mse_grad_fn)
+        loss = Variable(mean_data, requires_grad=y_pred.requires_grad, grad_fn=mse_grad_fn)
         return loss
         ### END SOLUTION
     
@@ -153,11 +169,11 @@ class CrossEntropyLoss:
         Compute CrossEntropy loss between predictions and targets.
         
         Args:
-            y_pred: Model predictions (Tensor or Tensor, shape: [batch_size, num_classes])
-            y_true: True class indices (Tensor or Tensor, shape: [batch_size]) or one-hot
+            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, num_classes])
+            y_true: True class indices (Tensor or Variable, shape: [batch_size]) or one-hot
             
         Returns:
-            Tensor with scalar loss value that supports .backward()
+            Variable with scalar loss value that supports .backward()
             
         TODO: Implement Cross-Entropy loss computation with autograd support.
         
@@ -166,11 +182,11 @@ class CrossEntropyLoss:
         2. Handle both class indices and one-hot encoded labels
         3. Apply softmax to predictions for probability distribution
         4. Compute log probabilities while maintaining gradient flow
-        5. Calculate cross-entropy and return Tensor with gradient function
+        5. Calculate cross-entropy and return Variable with gradient function
         
         EXAMPLE:
-        y_pred = Tensor([[2.0, 1.0, 0.1], [0.5, 2.1, 0.9]], requires_grad=True)
-        y_true = Tensor([0, 1], requires_grad=False)  # Class indices
+        y_pred = Variable([[2.0, 1.0, 0.1], [0.5, 2.1, 0.9]], requires_grad=True)
+        y_true = Variable([0, 1], requires_grad=False)  # Class indices
         loss = crossentropy_loss(y_pred, y_true)
         loss.backward()  # Computes gradients for y_pred
         
@@ -188,17 +204,17 @@ class CrossEntropyLoss:
         """
         ### BEGIN SOLUTION
         # Convert to Variables if needed to support autograd
-        if not isinstance(y_pred, Tensor):
+        if not isinstance(y_pred, Variable):
             if hasattr(y_pred, 'data'):
-                y_pred = Tensor(y_pred.data, requires_grad=True)
+                y_pred = Variable(y_pred.data, requires_grad=True)
             else:
-                y_pred = Tensor(y_pred, requires_grad=True)
+                y_pred = Variable(y_pred, requires_grad=True)
         
-        if not isinstance(y_true, Tensor):
+        if not isinstance(y_true, Variable):
             if hasattr(y_true, 'data'):
-                y_true = Tensor(y_true.data, requires_grad=False)
+                y_true = Variable(y_true.data, requires_grad=False)
             else:
-                y_true = Tensor(y_true, requires_grad=False)
+                y_true = Variable(y_true, requires_grad=False)
         
         # Get data for computation
         if hasattr(y_pred.data, 'data'):
@@ -251,9 +267,9 @@ class CrossEntropyLoss:
                 else:
                     final_grad = grad_data * grad_output.data
                 
-                y_pred.backward(Tensor(final_grad))
+                y_pred.backward(Variable(final_grad))
         
-        loss = Tensor(loss_value, requires_grad=y_pred.requires_grad, grad_fn=crossentropy_grad_fn)
+        loss = Variable(loss_value, requires_grad=y_pred.requires_grad, grad_fn=crossentropy_grad_fn)
         return loss
         ### END SOLUTION
     
@@ -281,11 +297,11 @@ class BinaryCrossEntropyLoss:
         Compute Binary CrossEntropy loss between predictions and targets.
         
         Args:
-            y_pred: Model predictions (Tensor or Tensor, shape: [batch_size, 1] or [batch_size])
-            y_true: True binary labels (Tensor or Tensor, shape: [batch_size, 1] or [batch_size])
+            y_pred: Model predictions (Tensor or Variable, shape: [batch_size, 1] or [batch_size])
+            y_true: True binary labels (Tensor or Variable, shape: [batch_size, 1] or [batch_size])
             
         Returns:
-            Tensor with scalar loss value that supports .backward()
+            Variable with scalar loss value that supports .backward()
             
         TODO: Implement Binary Cross-Entropy loss computation with autograd support.
         
@@ -294,11 +310,11 @@ class BinaryCrossEntropyLoss:
         2. Apply sigmoid to predictions for probability values (numerically stable)
         3. Compute binary cross-entropy loss while maintaining gradient flow
         4. Create gradient function for sigmoid + BCE combination
-        5. Return Tensor that supports .backward() for gradient computation
+        5. Return Variable that supports .backward() for gradient computation
         
         EXAMPLE:
-        y_pred = Tensor([[2.0], [0.0], [-1.0]], requires_grad=True)  # Raw logits
-        y_true = Tensor([[1.0], [1.0], [0.0]], requires_grad=False)   # Binary labels
+        y_pred = Variable([[2.0], [0.0], [-1.0]], requires_grad=True)  # Raw logits
+        y_true = Variable([[1.0], [1.0], [0.0]], requires_grad=False)   # Binary labels
         loss = bce_loss(y_pred, y_true)
         loss.backward()  # Computes gradients for y_pred
         
@@ -316,17 +332,17 @@ class BinaryCrossEntropyLoss:
         """
         ### BEGIN SOLUTION
         # Convert to Variables if needed to support autograd
-        if not isinstance(y_pred, Tensor):
+        if not isinstance(y_pred, Variable):
             if hasattr(y_pred, 'data'):
-                y_pred = Tensor(y_pred.data, requires_grad=True)
+                y_pred = Variable(y_pred.data, requires_grad=True)
             else:
-                y_pred = Tensor(y_pred, requires_grad=True)
+                y_pred = Variable(y_pred, requires_grad=True)
         
-        if not isinstance(y_true, Tensor):
+        if not isinstance(y_true, Variable):
             if hasattr(y_true, 'data'):
-                y_true = Tensor(y_true.data, requires_grad=False)
+                y_true = Variable(y_true.data, requires_grad=False)
             else:
-                y_true = Tensor(y_true, requires_grad=False)
+                y_true = Variable(y_true, requires_grad=False)
         
         # Get data for computation
         if hasattr(y_pred.data, 'data'):
@@ -373,9 +389,9 @@ class BinaryCrossEntropyLoss:
                 else:
                     final_grad = grad_data * grad_output.data
                 
-                y_pred.backward(Tensor(final_grad))
+                y_pred.backward(Variable(final_grad))
         
-        loss = Tensor(mean_loss, requires_grad=y_pred.requires_grad, grad_fn=bce_grad_fn)
+        loss = Variable(mean_loss, requires_grad=y_pred.requires_grad, grad_fn=bce_grad_fn)
         return loss
         ### END SOLUTION
     
@@ -594,9 +610,9 @@ class Trainer:
             # Track metrics
             if hasattr(loss, 'data'):
                 if hasattr(loss.data, 'data'):
-                    epoch_metrics['loss'] += loss.data.data  # Tensor with Tensor data
+                    epoch_metrics['loss'] += loss.data.data  # Variable with Tensor data
                 else:
-                    epoch_metrics['loss'] += loss.data  # Tensor with numpy data
+                    epoch_metrics['loss'] += loss.data  # Variable with numpy data
             else:
                 epoch_metrics['loss'] += loss  # Direct value
             
@@ -667,9 +683,9 @@ class Trainer:
             # Track metrics
             if hasattr(loss, 'data'):
                 if hasattr(loss.data, 'data'):
-                    epoch_metrics['loss'] += loss.data.data  # Tensor with Tensor data
+                    epoch_metrics['loss'] += loss.data.data  # Variable with Tensor data
                 else:
-                    epoch_metrics['loss'] += loss.data  # Tensor with numpy data
+                    epoch_metrics['loss'] += loss.data  # Variable with numpy data
             else:
                 epoch_metrics['loss'] += loss  # Direct value
             
diff --git a/tinytorch/core/transformers.py b/tinytorch/core/transformers.py
index b677ad56..dd4a0f56 100644
--- a/tinytorch/core/transformers.py
+++ b/tinytorch/core/transformers.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../../modules/14_transformers/transformers_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/XX_transformers/transformers_dev.py     ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['LayerNorm', 'PositionwiseFeedForward', 'TransformerBlock', 'Transformer', 'TransformerProfiler',
diff --git a/tinytorch/nn/__init__.py b/tinytorch/nn/__init__.py
index bf1d4dea..8074bc73 100644
--- a/tinytorch/nn/__init__.py
+++ b/tinytorch/nn/__init__.py
@@ -33,10 +33,9 @@ The key insight: Students implement the core algorithms (conv, linear transforms
 while this infrastructure provides the clean API they expect from PyTorch.
 """
 
-# Import layers from core (these contain the student implementations)
+# Import layers from core (these contain the student implementations)  
 from ..core.layers import Linear, Module  # Use the same Module class as layers
 from ..core.spatial import Conv2d
-from ..core.activations import ReLU, Sigmoid, Tanh, Softmax
 
 # Import transformer components
 from ..core.embeddings import Embedding, PositionalEncoding
@@ -49,42 +48,6 @@ from . import functional
 # Make functional available as F (PyTorch convention)
 import tinytorch.nn.functional as F
 
-# Add missing functional interfaces that aren't autogenerated yet
-def mse_loss(predictions, targets):
-    """Modern MSE loss handling both Tensor and Variable inputs."""
-    from ..core.tensor import Tensor
-    import numpy as np
-
-    # Extract actual data, handling nested structures and memoryviews
-    def extract_data(x):
-        if hasattr(x, 'data'):
-            if hasattr(x.data, 'data'):
-                data = x.data.data  # Variable with Tensor data
-            else:
-                data = x.data  # Tensor data
-        else:
-            data = x  # Raw numpy array
-
-        # Convert memoryview to numpy array
-        if isinstance(data, memoryview):
-            data = np.array(data)
-        return data
-
-    # Get the actual numpy arrays
-    pred_data = extract_data(predictions)
-    target_data = extract_data(targets)
-
-    # Compute MSE manually
-    diff = pred_data - target_data
-    squared_diff = diff * diff
-    loss_value = np.mean(squared_diff)
-
-    # Return as simple Tensor (no gradients for now to avoid complexity)
-    return Tensor(np.array(loss_value), requires_grad=False)
-
-# Add mse_loss to functional module dynamically
-F.mse_loss = mse_loss
-
 # Utility functions
 def Parameter(data, requires_grad=True):
     """Create a parameter tensor (learnable weight)."""
@@ -116,12 +79,8 @@ class Sequential(Module):
 # Export the main public API
 __all__ = [
     'Module',
-    'Linear',
+    'Linear', 
     'Conv2d',
-    'ReLU',
-    'Sigmoid',
-    'Tanh',
-    'Softmax',
     'Embedding',
     'PositionalEncoding',
     'SelfAttention',
diff --git a/tinytorch/nn/utils/prune.py b/tinytorch/nn/utils/prune.py
index ca12245b..24df13f8 100644
--- a/tinytorch/nn/utils/prune.py
+++ b/tinytorch/nn/utils/prune.py
@@ -1,3 +1,19 @@
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/XX_prune/prune_dev.py                   ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains critical fixes for Variable/      ║
+# ║     Tensor compatibility. Editing it directly WILL break CIFAR-10 training.  ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 """
 TinyTorch Pruning - Model Compression via Weight Removal
 
diff --git a/tinytorch/tinygpt.py b/tinytorch/tinygpt.py
index 771e67be..18f2a1af 100644
--- a/tinytorch/tinygpt.py
+++ b/tinytorch/tinygpt.py
@@ -1,4 +1,19 @@
-# AUTOGENERATED! DO NOT EDIT! File to edit: ../modules/source/temp_holding/16_tinygpt/tinygpt_dev.ipynb.
+# ╔═══════════════════════════════════════════════════════════════════════════════╗
+# ║                        🚨 CRITICAL WARNING 🚨                                ║
+# ║                     AUTOGENERATED! DO NOT EDIT!                              ║
+# ║                                                                               ║
+# ║  This file is AUTOMATICALLY GENERATED from source modules.                   ║
+# ║  ANY CHANGES MADE HERE WILL BE LOST when modules are re-exported!            ║
+# ║                                                                               ║
+# ║  ✅ TO EDIT: modules/source/[unknown]/[unknown]_dev.py              ║
+# ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
+# ║                                                                               ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
+# ║                                                                               ║
+# ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
+# ║     happens! The tinytorch/ directory is just the compiled output.           ║
+# ╚═══════════════════════════════════════════════════════════════════════════════╝
 
 # %% auto 0
 __all__ = ['CrossEntropyLoss', 'Trainer', 'no_grad', 'CharTokenizer', 'MultiHeadAttention', 'create_causal_mask', 'LayerNorm',
diff --git a/tinytorch/utils/profiler/__init__.py b/tinytorch/utils/profiler/__init__.py
index e6b8a8b0..e9e536aa 100644
--- a/tinytorch/utils/profiler/__init__.py
+++ b/tinytorch/utils/profiler/__init__.py
@@ -1,315 +1,239 @@
-# AUTOGENERATED FROM modules/15_profiling/profiling_dev.py
-# Profiling utilities for performance analysis
+"""
+TinyTorch Profiler
 
-__all__ = ['SimpleProfiler', 'profile_function', 'Timer', 'MemoryProfiler', 'FLOPCounter', 'ProfilerContext']
+A lightweight profiling utility for measuring performance of ML operations.
+Following PyTorch's pattern with torch.profiler, this module provides
+educational profiling tools for understanding ML performance.
+
+Usage:
+    from tinytorch.profiler import SimpleProfiler
+    
+    profiler = SimpleProfiler()
+    result = profiler.profile(my_function, *args, **kwargs)
+    profiler.print_result(result)
+
+Similar to:
+    torch.profiler.profile() - PyTorch's profiling context manager
+    tf.profiler - TensorFlow's profiling utilities
+    jax.profiler - JAX's profiling tools
+"""
 
 import time
-import gc
-import tracemalloc
-from typing import Dict, List, Callable, Any, Tuple, Optional
-from contextlib import contextmanager
-import statistics
 import sys
+import gc
+import numpy as np
+from typing import Callable, Dict, Any, Optional
 
-class Timer:
-    """
-    Professional timing infrastructure with statistical rigor.
-    
-    Features:
-    - Warmup runs to eliminate cold start effects
-    - Multiple measurements for statistical confidence  
-    - Garbage collection control to reduce noise
-    - Percentile reporting (p50, p95, p99)
-    - High-precision timing with best available clock
-    """
-    
-    def __init__(self):
-        # Use the most precise timer available
-        self.timer_func = time.perf_counter
-        self.measurements = []
-        
-    def measure(self, func: Callable, warmup: int = 3, runs: int = 100, 
-                args: tuple = (), kwargs: dict = None) -> Dict[str, float]:
-        """
-        Measure function execution time with statistical rigor.
-        
-        Args:
-            func: Function to measure
-            warmup: Number of warmup runs (eliminate cold start)
-            runs: Number of measurement runs
-            args: Arguments to pass to function
-            kwargs: Keyword arguments to pass to function
-            
-        Returns:
-            Dict with timing statistics (mean, std, percentiles)
-        """
-        if kwargs is None:
-            kwargs = {}
-            
-        self.measurements = []
-        
-        # Warmup runs to get code in CPU cache
-        for _ in range(warmup):
-            _ = func(*args, **kwargs)
-            
-        # Force garbage collection before timing
-        gc.collect()
-        
-        # Actual measurements
-        for i in range(runs):
-            # Disable GC during measurement for consistency
-            gc_was_enabled = gc.isenabled()
-            gc.disable()
-            
-            try:
-                start_time = self.timer_func()
-                result = func(*args, **kwargs)
-                end_time = self.timer_func()
-                
-                execution_time = end_time - start_time
-                self.measurements.append(execution_time)
-                
-            finally:
-                # Restore GC state
-                if gc_was_enabled:
-                    gc.enable()
-        
-        # Calculate statistics
-        return self._compute_stats()
-    
-    def _compute_stats(self) -> Dict[str, float]:
-        """Compute comprehensive timing statistics."""
-        if not self.measurements:
-            return {}
-            
-        measurements_ms = [t * 1000 for t in self.measurements]  # Convert to ms
-        
-        stats = {
-            'mean_ms': statistics.mean(measurements_ms),
-            'std_ms': statistics.stdev(measurements_ms) if len(measurements_ms) > 1 else 0,
-            'min_ms': min(measurements_ms),
-            'max_ms': max(measurements_ms),
-            'p50_ms': statistics.median(measurements_ms),
-            'p95_ms': self._percentile(measurements_ms, 95),
-            'p99_ms': self._percentile(measurements_ms, 99),
-            'runs': len(measurements_ms)
-        }
-        
-        return stats
-    
-    def _percentile(self, data: List[float], percentile: float) -> float:
-        """Calculate percentile of data."""
-        sorted_data = sorted(data)
-        k = (len(sorted_data) - 1) * percentile / 100
-        f = int(k)
-        c = k - f
-        
-        if f + 1 < len(sorted_data):
-            return sorted_data[f] * (1 - c) + sorted_data[f + 1] * c
-        else:
-            return sorted_data[f]
-
-
-class MemoryProfiler:
-    """
-    Memory usage profiler with allocation tracking.
-    
-    Features:
-    - Peak memory usage during execution
-    - Memory allocation tracking with tracemalloc
-    - Memory leak detection
-    - Growth pattern analysis
-    """
-    
-    def __init__(self):
-        self.baseline_memory = 0
-        self.peak_memory = 0
-        self.allocations = []
-        
-    def profile(self, func: Callable, args: tuple = (), kwargs: dict = None) -> Dict[str, Any]:
-        """
-        Profile memory usage during function execution.
-        
-        Args:
-            func: Function to profile
-            args: Arguments to pass to function
-            kwargs: Keyword arguments
-            
-        Returns:
-            Dict with memory usage statistics
-        """
-        if kwargs is None:
-            kwargs = {}
-            
-        # Start memory tracing
-        tracemalloc.start()
-        
-        # Record baseline
-        baseline_snapshot = tracemalloc.take_snapshot()
-        baseline_stats = baseline_snapshot.statistics('filename')
-        baseline_size = sum(stat.size for stat in baseline_stats)
-        
-        try:
-            # Execute function
-            result = func(*args, **kwargs)
-            
-            # Take final snapshot
-            final_snapshot = tracemalloc.take_snapshot()
-            final_stats = final_snapshot.statistics('filename')
-            final_size = sum(stat.size for stat in final_stats)
-            
-            # Get peak memory
-            current, peak = tracemalloc.get_traced_memory()
-            
-            # Stop tracing
-            tracemalloc.stop()
-            
-            # Compute memory statistics
-            memory_stats = {
-                'baseline_mb': baseline_size / (1024 * 1024),
-                'final_mb': final_size / (1024 * 1024), 
-                'peak_mb': peak / (1024 * 1024),
-                'allocated_mb': (final_size - baseline_size) / (1024 * 1024),
-                'result': result
-            }
-            
-            return memory_stats
-            
-        except Exception as e:
-            tracemalloc.stop()
-            raise e
-
-
-class FLOPCounter:
-    """
-    Count floating point operations (FLOPs) in neural network operations.
-    
-    Features:
-    - Track multiply-accumulate (MAC) operations
-    - Handle different layer types (Linear, Conv2d, Attention)
-    - Provide operation breakdown by type
-    - Compare theoretical vs practical complexity
-    """
-    
-    def __init__(self):
-        self.operation_counts = {
-            'multiply': 0,
-            'add': 0,
-            'total_flops': 0
-        }
-        self.layer_breakdown = {}
-    
-    def reset(self):
-        """Reset all counters."""
-        self.operation_counts = {
-            'multiply': 0,
-            'add': 0, 
-            'total_flops': 0
-        }
-        self.layer_breakdown = {}
-
-
-class ProfilerContext:
-    """
-    Comprehensive profiling context manager.
-    
-    Combines timing, memory, and FLOP analysis into a single tool.
-    Perfect for profiling model forward passes and identifying bottlenecks.
-    
-    Usage:
-        with ProfilerContext("MyModel") as profiler:
-            result = model.forward(input)
-        # Automatic report generation
-    """
-    
-    def __init__(self, name: str = "Operation", 
-                 timing_runs: int = 10, 
-                 timing_warmup: int = 2,
-                 enable_memory: bool = True,
-                 enable_flops: bool = False):
-        """
-        Initialize profiling context.
-        
-        Args:
-            name: Name for the operation being profiled
-            timing_runs: Number of timing measurements
-            timing_warmup: Number of warmup runs
-            enable_memory: Whether to profile memory usage
-            enable_flops: Whether to count FLOPs (manual)
-        """
-        self.name = name
-        self.timing_runs = timing_runs
-        self.timing_warmup = timing_warmup
-        self.enable_memory = enable_memory
-        self.enable_flops = enable_flops
-        
-        # Profiling tools
-        self.timer = Timer()
-        self.memory_profiler = MemoryProfiler() if enable_memory else None
-        self.flop_counter = FLOPCounter() if enable_flops else None
-        
-        # Results storage
-        self.timing_stats = {}
-        self.memory_stats = {}
-        self.results = {}
-        
-    def __enter__(self):
-        """Start profiling context.""" 
-        if self.enable_memory:
-            # Start memory tracing
-            if not tracemalloc.is_tracing():
-                tracemalloc.start()
-                
-        return self
-        
-    def __exit__(self, exc_type, exc_val, exc_tb):
-        """End profiling and generate report."""
-        if exc_type is not None:
-            return False
-        return False
+try:
+    import psutil
+    HAS_PSUTIL = True
+except ImportError:
+    HAS_PSUTIL = False
 
+try:
+    import tracemalloc
+    HAS_TRACEMALLOC = True
+except ImportError:
+    HAS_TRACEMALLOC = False
 
 class SimpleProfiler:
     """
-    Simple profiler interface expected by benchmarking module.
-    Wrapper around the comprehensive ProfilerContext for easy use.
+    Simple profiler for measuring individual function performance.
+    
+    Measures timing, memory usage, and other key metrics for a single function.
+    Students collect multiple measurements and compare results themselves.
     """
     
-    def __init__(self, track_memory=True, track_cpu=True):
-        self.track_memory = track_memory
-        self.track_cpu = track_cpu
-        self.timer = Timer()
-        self.memory_profiler = MemoryProfiler() if track_memory else None
+    def __init__(self, track_memory: bool = True, track_cpu: bool = True):
+        self.track_memory = track_memory and HAS_TRACEMALLOC
+        self.track_cpu = track_cpu and HAS_PSUTIL
         
-    def profile(self, func, *args, name="operation", warmup=True):
-        """Profile a function call and return comprehensive results."""
-        if warmup:
-            # Warmup run
-            _ = func(*args)
+        if self.track_memory:
+            tracemalloc.start()
+    
+    def _get_memory_info(self) -> Dict[str, Any]:
+        """Get current memory information."""
+        if not self.track_memory:
+            return {}
+        
+        try:
+            current, peak = tracemalloc.get_traced_memory()
+            return {
+                'current_memory_mb': current / 1024 / 1024,
+                'peak_memory_mb': peak / 1024 / 1024
+            }
+        except:
+            return {}
+    
+    def _get_cpu_info(self) -> Dict[str, Any]:
+        """Get current CPU information."""
+        if not self.track_cpu:
+            return {}
+        
+        try:
+            process = psutil.Process()
+            return {
+                'cpu_percent': process.cpu_percent(),
+                'memory_percent': process.memory_percent(),
+                'num_threads': process.num_threads()
+            }
+        except:
+            return {}
+    
+    def _get_array_info(self, result: Any) -> Dict[str, Any]:
+        """Get information about numpy arrays."""
+        if not isinstance(result, np.ndarray):
+            return {}
+        
+        return {
+            'result_shape': result.shape,
+            'result_dtype': str(result.dtype),
+            'result_size_mb': result.nbytes / 1024 / 1024,
+            'result_elements': result.size
+        }
+    
+    def profile(self, func: Callable, *args, name: Optional[str] = None, warmup: bool = True, **kwargs) -> Dict[str, Any]:
+        """
+        Profile a single function execution with comprehensive metrics.
+        
+        Args:
+            func: Function to profile
+            *args: Arguments to pass to function
+            name: Optional name for the function (defaults to func.__name__)
+            warmup: Whether to do a warmup run (recommended for fair timing)
+            **kwargs: Keyword arguments to pass to function
             
-        # Time the operation
-        timing_stats = self.timer.measure(func, warmup=2, runs=10, args=args)
+        Returns:
+            Dictionary with comprehensive performance metrics
+            
+        Example:
+            profiler = SimpleProfiler()
+            result = profiler.profile(my_function, arg1, arg2, name="My Function")
+            print(f"Time: {result['wall_time']:.4f}s")
+            print(f"Memory: {result['memory_delta_mb']:.2f}MB")
+        """
+        func_name = name or func.__name__
         
-        result_dict = {
-            'wall_time': timing_stats['mean_ms'] / 1000,  # Convert to seconds
-            'cpu_time': timing_stats['mean_ms'] / 1000,   # Simplified
-            'cpu_efficiency': 0.85,  # Mock reasonable value
-            'name': name
+        # Reset memory tracking
+        if self.track_memory:
+            tracemalloc.clear_traces()
+        
+        # Warm up (important for fair comparison)
+        if warmup:
+            try:
+                warmup_result = func(*args, **kwargs)
+                del warmup_result
+            except:
+                pass
+        
+        # Force garbage collection for clean measurement
+        gc.collect()
+        
+        # Get baseline measurements
+        memory_before = self._get_memory_info()
+        cpu_before = self._get_cpu_info()
+        
+        # Time the actual execution
+        start_time = time.time()
+        start_cpu_time = time.process_time()
+        
+        result = func(*args, **kwargs)
+        
+        end_time = time.time()
+        end_cpu_time = time.process_time()
+        
+        # Get post-execution measurements
+        memory_after = self._get_memory_info()
+        cpu_after = self._get_cpu_info()
+        
+        # Calculate metrics
+        wall_time = end_time - start_time
+        cpu_time = end_cpu_time - start_cpu_time
+        
+        profile_result = {
+            'name': func_name,
+            'wall_time': wall_time,
+            'cpu_time': cpu_time,
+            'cpu_efficiency': (cpu_time / wall_time) if wall_time > 0 else 0,
+            'result': result
         }
         
-        # Add memory stats if enabled
-        if self.memory_profiler:
-            memory_stats = self.memory_profiler.profile(func, args)
-            result_dict.update({
-                'memory_delta_mb': memory_stats.get('allocated_mb', 0),
-                'peak_memory_mb': memory_stats.get('peak_mb', 0),
-                'result_size_mb': 0.1  # Mock value
+        # Add memory metrics
+        if self.track_memory and memory_before and memory_after:
+            profile_result.update({
+                'memory_before_mb': memory_before.get('current_memory_mb', 0),
+                'memory_after_mb': memory_after.get('current_memory_mb', 0),
+                'peak_memory_mb': memory_after.get('peak_memory_mb', 0),
+                'memory_delta_mb': memory_after.get('current_memory_mb', 0) - memory_before.get('current_memory_mb', 0)
             })
-            
-        return result_dict
+        
+        # Add CPU metrics
+        if self.track_cpu and cpu_after:
+            profile_result.update({
+                'cpu_percent': cpu_after.get('cpu_percent', 0),
+                'memory_percent': cpu_after.get('memory_percent', 0),
+                'num_threads': cpu_after.get('num_threads', 1)
+            })
+        
+        # Add array information
+        profile_result.update(self._get_array_info(result))
+        
+        return profile_result
+    
+    def print_result(self, profile_result: Dict[str, Any], show_details: bool = False) -> None:
+        """
+        Print profiling results in a readable format.
+        
+        Args:
+            profile_result: Result from profile() method
+            show_details: Whether to show detailed metrics
+        """
+        name = profile_result['name']
+        wall_time = profile_result['wall_time']
+        
+        print(f"📊 {name}: {wall_time:.4f}s")
+        
+        if show_details:
+            if 'memory_delta_mb' in profile_result:
+                print(f"   💾 Memory: {profile_result['memory_delta_mb']:.2f}MB delta, {profile_result['peak_memory_mb']:.2f}MB peak")
+            if 'result_size_mb' in profile_result:
+                print(f"   🔢 Output: {profile_result['result_shape']} ({profile_result['result_size_mb']:.2f}MB)")
+            if 'cpu_efficiency' in profile_result:
+                print(f"   ⚡ CPU: {profile_result['cpu_efficiency']:.2f} efficiency")
+    
+    def get_capabilities(self) -> Dict[str, bool]:
+        """Get information about profiler capabilities."""
+        return {
+            'memory_tracking': self.track_memory,
+            'cpu_tracking': self.track_cpu,
+            'has_psutil': HAS_PSUTIL,
+            'has_tracemalloc': HAS_TRACEMALLOC
+        }
 
-
-def profile_function(func, *args, **kwargs):
-    """Simple function profiler decorator/utility."""
-    profiler = SimpleProfiler()
-    return profiler.profile(func, *args, **kwargs)
\ No newline at end of file
+# Convenience function for quick profiling
+def profile_function(func: Callable, *args, name: Optional[str] = None, 
+                     show_details: bool = False, **kwargs) -> Dict[str, Any]:
+    """
+    Quick profiling of a single function.
+    
+    Args:
+        func: Function to profile
+        *args: Arguments to pass to function
+        name: Optional name for the function
+        show_details: Whether to print detailed metrics
+        **kwargs: Keyword arguments to pass to function
+        
+    Returns:
+        Dictionary with profiling results
+        
+    Example:
+        result = profile_function(my_matmul, A, B, name="Custom MatMul", show_details=True)
+        print(f"Execution time: {result['wall_time']:.4f}s")
+    """
+    profiler = SimpleProfiler(track_memory=True, track_cpu=True)
+    result = profiler.profile(func, *args, name=name, **kwargs)
+    
+    if show_details:
+        profiler.print_result(result, show_details=True)
+    
+    return result 
\ No newline at end of file
diff --git a/tito/commands/export.py b/tito/commands/export.py
index 4fb30306..03ed6fc3 100644
--- a/tito/commands/export.py
+++ b/tito/commands/export.py
@@ -231,9 +231,20 @@ class ExportCommand(BaseCommand):
                 with open(py_file, 'r', encoding='utf-8') as f:
                     content = f.read()
                 
-                # Check if warning already exists
-                if "AUTOGENERATED! DO NOT EDIT!" in content:
-                    continue  # Already has warning
+                # Check if warning already exists (check for the box format specifically)
+                if "╔═══════════════════════════════════════════════════════════════════════════════╗" in content:
+                    continue  # Already has the new warning format
+                
+                # Remove old header format if it exists
+                if "AUTOGENERATED! DO NOT EDIT! File to edit:" in content:
+                    lines = content.split('\n')
+                    # Remove the old header line (usually first line)
+                    if lines and "AUTOGENERATED! DO NOT EDIT! File to edit:" in lines[0]:
+                        lines = lines[1:]  # Remove first line
+                        # Also remove empty line after if it exists
+                        if lines and lines[0].strip() == "":
+                            lines = lines[1:]
+                        content = '\n'.join(lines)
                 
                 # Find the source file for this export
                 source_file = self._find_source_file_for_export(py_file)
@@ -249,8 +260,8 @@ class ExportCommand(BaseCommand):
 # ║  ✅ TO EDIT: {source_file:<54} ║
 # ║  ✅ TO EXPORT: Run 'tito module complete <module_name>'                      ║
 # ║                                                                               ║
-# ║  🛡️ STUDENT PROTECTION: This file contains critical fixes for Variable/      ║
-# ║     Tensor compatibility. Editing it directly WILL break CIFAR-10 training.  ║
+# ║  🛡️ STUDENT PROTECTION: This file contains optimized implementations.        ║
+# ║     Editing it directly may break module functionality and training.         ║
 # ║                                                                               ║
 # ║  🎓 LEARNING TIP: Work in modules/source/ - that's where real development    ║
 # ║     happens! The tinytorch/ directory is just the compiled output.           ║