TinyTorch/modules/source/16_acceleration/acceleration_dev.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a0bea02",
   "metadata": {},
   "outputs": [],
   "source": [
    "#| default_exp optimization.acceleration\n",
    "#| export"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9ac4364",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Module 16: Acceleration - Making Models Run Faster\n",
    "\n",
    "Welcome to Module 16! You're about to master the art of neural network acceleration through vectorization, kernel fusion, and mixed precision training.\n",
    "\n",
    "## 🔗 Prerequisites & Progress\n",
    "**You've Built**: Complete training pipeline with profiling capabilities\n",
    "**You'll Build**: Acceleration techniques including vectorization, operation fusion, and mixed precision\n",
    "**You'll Enable**: Production-ready optimization for real-world deployment\n",
    "\n",
    "**Connection Map**:\n",
    "```\n",
    "Profiling (Module 15) → Acceleration (Module 16) → Quantization (Module 17)\n",
    "(measurement)         (optimization)             (precision reduction)\n",
    "```\n",
    "\n",
    "## Learning Objectives\n",
    "By the end of this module, you will:\n",
    "1. Implement vectorized operations for maximum throughput\n",
    "2. Create fused operations to reduce memory bandwidth\n",
    "3. Build mixed precision training for memory efficiency\n",
    "4. Understand the relationship between compute and memory bandwidth\n",
    "5. Analyze acceleration trade-offs in production systems\n",
    "\n",
    "Let's optimize for speed!\n",
    "\n",
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in `modules/16_acceleration/acceleration_dev.py`  \n",
    "**Building Side:** Code exports to `tinytorch.optimization.acceleration`\n",
    "\n",
    "```python\n",
    "# How to use this module:\n",
    "from tinytorch.optimization.acceleration import vectorized_matmul, fused_gelu, MixedPrecisionTrainer\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Complete acceleration system in one focused module for deep understanding\n",
    "- **Production:** Proper organization like PyTorch's torch.amp and torch.jit with optimization components\n",
    "- **Consistency:** All acceleration operations and mixed precision training in optimization.acceleration\n",
    "- **Integration:** Works seamlessly with profiling for complete performance optimization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59fd81f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import time\n",
    "from typing import Dict, List, Tuple, Optional, Any, Union\n",
    "import warnings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e350bf3e",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 1. Introduction - The Performance Challenge\n",
    "\n",
    "Modern neural networks face two fundamental bottlenecks that limit their speed:\n",
    "\n",
    "### The Two Enemies of Performance\n",
    "\n",
    "**1. Compute Bound Operations:**\n",
    "```\n",
    "CPU/GPU Cores: [====BUSY====] [====BUSY====] [====BUSY====]\n",
    "Memory Bus:    [---idle---] [---idle---] [---idle---]\n",
    "\n",
    "When: Matrix multiplication, convolutions\n",
    "Solution: Vectorization, better algorithms\n",
    "```\n",
    "\n",
    "**2. Memory Bound Operations:**\n",
    "```\n",
    "CPU/GPU Cores: [--idle--] [--idle--] [--idle--]\n",
    "Memory Bus:    [========SATURATED========]\n",
    "\n",
    "When: Element-wise operations, small tensors\n",
    "Solution: Kernel fusion, memory layout optimization\n",
    "```\n",
    "\n",
    "### The Roofline Model - Your Performance Compass\n",
    "\n",
    "Every processor has fundamental limits:\n",
    "\n",
    "```\n",
    "Performance    │   Compute Bound Region\n",
    "(GFLOPS)      │  ┌─────────────────────\n",
    "              │  │ Peak Performance\n",
    "              │  │\n",
    "              │ ╱│ Memory Bound Region\n",
    "              │╱ │\n",
    "             ╱│  │\n",
    "            ╱ │  │\n",
    "           ╱  │  │\n",
    "          ╱───│──│───────────────────────\n",
    "         ╱    │  │\n",
    "        ╱     │  │\n",
    "       ╱──────│──│────────────────── Arithmetic Intensity\n",
    "              │  │        (FLOPs/Byte)\n",
    "           Low│  │High\n",
    "```\n",
    "\n",
    "**Key Insight**: Understand where your operations live on this graph to optimize effectively.\n",
    "\n",
    "### Why This Module Matters\n",
    "\n",
    "Real-world performance wins:\n",
    "- **2-5× speedup** from vectorization\n",
    "- **30-50% memory reduction** from mixed precision\n",
    "- **2-3× throughput** from kernel fusion\n",
    "- **10× scaling improvement** for large models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c8b7618",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "tensor-import",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "# Import required dependencies\n",
    "### BEGIN SOLUTION\n",
    "# Import tensor from our implementation\n",
    "import sys\n",
    "import os\n",
    "sys.path.append('/Users/VJ/GitHub/TinyTorch')\n",
    "\n",
    "try:\n",
    "    # Import from the modules directory structure\n",
    "    import importlib.util\n",
    "    spec = importlib.util.spec_from_file_location(\"tensor_dev\", \"/Users/VJ/GitHub/TinyTorch/modules/01_tensor/tensor_dev.py\")\n",
    "    tensor_module = importlib.util.module_from_spec(spec)\n",
    "    spec.loader.exec_module(tensor_module)\n",
    "    Tensor = tensor_module.Tensor\n",
    "except ImportError:\n",
    "    # Fallback for testing\n",
    "    class Tensor:\n",
    "        def __init__(self, data, requires_grad=False):\n",
    "            self.data = np.array(data, dtype=np.float32)\n",
    "            self.shape = self.data.shape\n",
    "            self.requires_grad = requires_grad\n",
    "            self.grad = None\n",
    "\n",
    "        def __add__(self, other):\n",
    "            return Tensor(self.data + other.data)\n",
    "\n",
    "        def __mul__(self, other):\n",
    "            return Tensor(self.data * other.data)\n",
    "\n",
    "        def matmul(self, other):\n",
    "            return Tensor(np.dot(self.data, other.data))\n",
    "\n",
    "        def reshape(self, *shape):\n",
    "            return Tensor(self.data.reshape(shape))\n",
    "\n",
    "        def sum(self, axis=None):\n",
    "            return Tensor(self.data.sum(axis=axis))\n",
    "\n",
    "        def backward(self):\n",
    "            pass\n",
    "### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a445584",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 2. Foundations - Vectorization: From Loops to Lightning\n",
    "\n",
    "### The SIMD Revolution\n",
    "\n",
    "Modern processors can execute **Single Instruction, Multiple Data** operations:\n",
    "\n",
    "```\n",
    "Traditional Loop (Scalar):               SIMD Vectorized:\n",
    "for i in range(4):        ┌─────┐      ┌─────┬─────┬─────┬─────┐\n",
    "    c[i] = a[i] + b[i]    │ ALU │  →   │ALU 0│ALU 1│ALU 2│ALU 3│\n",
    "                          └─────┘      └─────┴─────┴─────┴─────┘\n",
    "                          1 element     4 elements per cycle\n",
    "                          per cycle\n",
    "```\n",
    "\n",
    "### Memory Access Patterns: The Hidden Performance Killer\n",
    "\n",
    "```\n",
    "Sequential Access (FAST):\n",
    "Memory: [A][B][C][D][E][F][G][H]\n",
    "Access:  ↓  ↓  ↓  ↓  → Cache friendly\n",
    "\n",
    "Strided Access (SLOWER):\n",
    "Memory: [A][ ][B][ ][C][ ][D][ ]\n",
    "Access:  ↓     ↓     ↓     ↓   → Cache misses\n",
    "\n",
    "Random Access (SLOWEST):\n",
    "Memory: [A][B][C][D][E][F][G][H]\n",
    "Access:  ↓     ↑  ↓     ↑       → Cache chaos\n",
    "```\n",
    "\n",
    "### Matrix Multiplication: The King of Vectorization\n",
    "\n",
    "Matrix multiplication is **perfectly suited** for vectorization:\n",
    "\n",
    "```\n",
    "Matrix A (M×K) × Matrix B (K×N) = Matrix C (M×N)\n",
    "\n",
    "Computation Pattern:\n",
    "┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐\n",
    "│ a₁₁ a₁₂ a₁₃ a₁₄│ × │ b₁₁ b₁₂ b₁₃ b₁₄│ = │ c₁₁ c₁₂ c₁₃ c₁₄│\n",
    "│ a₂₁ a₂₂ a₂₃ a₂₄│   │ b₂₁ b₂₂ b₂₃ b₂₄│   │ c₂₁ c₂₂ c₂₃ c₂₄│\n",
    "│ a₃₁ a₃₂ a₃₃ a₃₄│   │ b₃₁ b₃₂ b₃₃ b₃₄│   │ c₃₁ c₃₂ c₃₃ c₃₄│\n",
    "│ a₄₁ a₄₂ a₄₃ a₄₄│   │ b₄₁ b₄₂ b₄₃ b₄₄│   │ c₄₁ c₄₂ c₄₃ c₄₄│\n",
    "└─────────────────┘   └─────────────────┘   └─────────────────┘\n",
    "\n",
    "For c₁₁: Row₁ · Column₁ = a₁₁×b₁₁ + a₁₂×b₂₁ + a₁₃×b₃₁ + a₁₄×b₄₁\n",
    "                                    ↑\n",
    "                              VECTORIZABLE!\n",
    "```\n",
    "\n",
    "**Why vectorization wins:**\n",
    "- **High arithmetic intensity**: 2N³ FLOPs for N³ data\n",
    "- **Predictable memory access**: Sequential row/column reads\n",
    "- **Parallelizable**: Independent dot products\n",
    "- **Cache-friendly**: Data reuse in inner loops"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01b0e1a7",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "vectorized-matmul",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def vectorized_matmul(a: Tensor, b: Tensor) -> Tensor:\n",
    "    \"\"\"\n",
    "    High-performance matrix multiplication using vectorized operations.\n",
    "\n",
    "    This implementation leverages optimized BLAS libraries that use:\n",
    "    - SIMD instructions for parallel computation\n",
    "    - Cache-blocking for memory efficiency\n",
    "    - Multi-threading for CPU parallelization\n",
    "\n",
    "    TODO: Implement production-grade matrix multiplication\n",
    "\n",
    "    APPROACH:\n",
    "    1. Validate shapes are compatible for matrix multiplication\n",
    "    2. Use NumPy's optimized dot product (calls BLAS GEMM)\n",
    "    3. Return result wrapped in Tensor\n",
    "\n",
    "    EXAMPLE:\n",
    "    Matrix multiplication visualization:\n",
    "    >>> a = Tensor([[1, 2], [3, 4]])  # 2×2\n",
    "    >>> b = Tensor([[5, 6], [7, 8]])  # 2×2\n",
    "    >>> result = vectorized_matmul(a, b)\n",
    "    >>> print(result.data)\n",
    "    [[19 22]    # [1×5+2×7, 1×6+2×8] = [19, 22]\n",
    "     [43 50]]   # [3×5+4×7, 3×6+4×8] = [43, 50]\n",
    "\n",
    "    PERFORMANCE CHARACTERISTICS:\n",
    "    - Time Complexity: O(N³) but highly optimized\n",
    "    - Space Complexity: O(N²) for result\n",
    "    - Arithmetic Intensity: 2N³ FLOPs / 3N² bytes = 2N/3 (good for large N)\n",
    "\n",
    "    HINTS:\n",
    "    - Check a.shape[-1] == b.shape[-2] for inner dimension match\n",
    "    - Use np.matmul() for batch support and optimization\n",
    "    - Trust BLAS to handle the vectorization magic\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Input validation for matrix multiplication\n",
    "    if len(a.shape) < 2 or len(b.shape) < 2:\n",
    "        raise ValueError(\n",
    "            f\"Matrix multiplication requires 2D+ tensors, got shapes {a.shape} and {b.shape}. \"\n",
    "            f\"💡 HINT: Use reshape() to add dimensions if needed.\"\n",
    "        )\n",
    "\n",
    "    if a.shape[-1] != b.shape[-2]:\n",
    "        raise ValueError(\n",
    "            f\"Matrix multiplication shape mismatch: {a.shape} @ {b.shape}. \"\n",
    "            f\"Inner dimensions must match: a.shape[-1]={a.shape[-1]} != b.shape[-2]={b.shape[-2]}. \"\n",
    "            f\"💡 HINT: For A@B, A's columns must equal B's rows.\"\n",
    "        )\n",
    "\n",
    "    # Use NumPy's highly optimized matrix multiplication\n",
    "    # This calls BLAS GEMM (General Matrix Multiply), which uses:\n",
    "    # - SIMD vectorization for parallel arithmetic\n",
    "    # - Cache blocking for memory efficiency\n",
    "    # - Multi-threading on multi-core systems\n",
    "    result_data = np.matmul(a.data, b.data)\n",
    "\n",
    "    return Tensor(result_data)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae44b17e",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-vectorized-matmul",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_vectorized_matmul():\n",
    "    \"\"\"🔬 Test vectorized matrix multiplication implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Vectorized Matrix Multiplication...\")\n",
    "\n",
    "    # Test basic 2D multiplication\n",
    "    a = Tensor([[1, 2], [3, 4]])\n",
    "    b = Tensor([[5, 6], [7, 8]])\n",
    "    result = vectorized_matmul(a, b)\n",
    "\n",
    "    expected = np.array([[19, 22], [43, 50]])\n",
    "    assert np.allclose(result.data, expected), f\"Basic matmul failed: expected {expected}, got {result.data}\"\n",
    "\n",
    "    # Test batch multiplication (3D tensors)\n",
    "    batch_size, m, k, n = 2, 3, 4, 5\n",
    "    a_batch = Tensor(np.random.randn(batch_size, m, k))\n",
    "    b_batch = Tensor(np.random.randn(batch_size, k, n))\n",
    "    result_batch = vectorized_matmul(a_batch, b_batch)\n",
    "\n",
    "    assert result_batch.shape == (batch_size, m, n), f\"Wrong batch shape: {result_batch.shape}\"\n",
    "\n",
    "    # Test broadcasting (different batch dimensions)\n",
    "    a_single = Tensor(np.random.randn(m, k))\n",
    "    b_batch = Tensor(np.random.randn(batch_size, k, n))\n",
    "    result_broadcast = vectorized_matmul(a_single, b_batch)\n",
    "\n",
    "    assert result_broadcast.shape == (batch_size, m, n), f\"Broadcasting failed: {result_broadcast.shape}\"\n",
    "\n",
    "    # Test error cases\n",
    "    try:\n",
    "        vectorized_matmul(Tensor([1, 2, 3]), Tensor([4, 5]))  # 1D tensors\n",
    "        assert False, \"Should reject 1D tensors\"\n",
    "    except ValueError as e:\n",
    "        assert \"2D+\" in str(e)\n",
    "\n",
    "    try:\n",
    "        vectorized_matmul(Tensor([[1, 2]]), Tensor([[1], [2], [3]]))  # Shape mismatch\n",
    "        assert False, \"Should reject incompatible shapes\"\n",
    "    except ValueError as e:\n",
    "        assert \"shape mismatch\" in str(e).lower()\n",
    "\n",
    "    print(\"✅ vectorized_matmul works correctly!\")\n",
    "\n",
    "test_unit_vectorized_matmul()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85cd07f9",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 3. Implementation - Kernel Fusion: Eliminating Memory Bottlenecks\n",
    "\n",
    "### The Memory Bandwidth Crisis\n",
    "\n",
    "Consider this innocent-looking computation: `y = gelu(x * weight + bias)`\n",
    "\n",
    "**Naive Implementation (Memory Intensive):**\n",
    "```\n",
    "Step 1: temp1 = x * weight     → Write 4GB to memory\n",
    "Step 2: temp2 = temp1 + bias   → Read 4GB, Write 4GB\n",
    "Step 3: y = gelu(temp2)        → Read 4GB, Write 4GB\n",
    "                                 Total: 20GB memory traffic!\n",
    "```\n",
    "\n",
    "**Fused Implementation (Memory Efficient):**\n",
    "```\n",
    "Single Step: y = gelu(x * weight + bias)  → Read 8GB, Write 4GB\n",
    "                                            Total: 12GB memory traffic!\n",
    "                                            60% memory bandwidth reduction!\n",
    "```\n",
    "\n",
    "### Understanding GELU: The Smooth Activation\n",
    "\n",
    "GELU (Gaussian Error Linear Unit) is used in transformers because it's **smooth** (differentiable everywhere):\n",
    "\n",
    "```\n",
    "Activation Functions Compared:\n",
    "\n",
    "ReLU:           GELU:           Sigmoid:\n",
    "     |               |                 1 ┌─────\n",
    "     |               |               ╱   │\n",
    "     |           ╱───│───            ╱    │\n",
    "─────┘       ╱───    │         ───╱     │\n",
    " Discontinuous   Smooth Curve    │ Smooth but saturates\n",
    " gradient at 0   everywhere      │\n",
    "```\n",
    "\n",
    "**GELU Formula**: `GELU(x) = x * Φ(x)` where Φ is the standard normal CDF\n",
    "\n",
    "**Fast Approximation**: `GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))`\n",
    "\n",
    "### Kernel Fusion Strategy\n",
    "\n",
    "```\n",
    "Unfused Operations:                    Fused Operation:\n",
    "┌─────────────────┐                   ┌─────────────────┐\n",
    "│ x³ computation  │ → temp1           │                 │\n",
    "└─────────────────┘                   │                 │\n",
    "┌─────────────────┐                   │                 │\n",
    "│ polynomial part │ → temp2           │   All operations│\n",
    "└─────────────────┘                   │   combined in   │\n",
    "┌─────────────────┐                   │   single kernel │\n",
    "│ tanh computation│ → temp3           │                 │\n",
    "└─────────────────┘                   │                 │\n",
    "┌─────────────────┐                   │                 │\n",
    "│ final multiply  │ → result          │                 │\n",
    "└─────────────────┘                   └─────────────────┘\n",
    "\n",
    "5 memory round-trips                   1 memory round-trip\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "085b3c2b",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "fused-gelu",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def fused_gelu(x: Tensor) -> Tensor:\n",
    "    \"\"\"\n",
    "    Fused GELU activation that combines all operations in a single kernel.\n",
    "\n",
    "    GELU combines the benefits of ReLU and sigmoid:\n",
    "    - Smooth everywhere (unlike ReLU's discontinuity at 0)\n",
    "    - Non-saturating for positive values (unlike sigmoid)\n",
    "    - Probabilistic interpretation: x * P(X ≤ x) where X ~ N(0,1)\n",
    "\n",
    "    Mathematical Definition:\n",
    "    GELU(x) = x * Φ(x) where Φ(x) is the standard normal CDF\n",
    "\n",
    "    Fast Approximation (used here):\n",
    "    GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))\n",
    "\n",
    "    TODO: Implement fused GELU to minimize memory bandwidth\n",
    "\n",
    "    APPROACH:\n",
    "    1. Compute all intermediate values in a single expression\n",
    "    2. Avoid creating temporary arrays\n",
    "    3. Let NumPy's broadcasting handle vectorization\n",
    "\n",
    "    EXAMPLE:\n",
    "    >>> x = Tensor([-2, -1, 0, 1, 2])\n",
    "    >>> result = fused_gelu(x)\n",
    "    >>> print(result.data)\n",
    "    [-0.04550026 -0.15865526  0.          0.8413447   1.9544997 ]\n",
    "    # Notice: smooth transition through 0, positive bias\n",
    "\n",
    "    MEMORY EFFICIENCY:\n",
    "    - Unfused: 5 temporary arrays × input_size × 4 bytes\n",
    "    - Fused: 0 temporary arrays, direct computation\n",
    "    - Bandwidth reduction: ~80% for memory-bound operations\n",
    "\n",
    "    HINTS:\n",
    "    - Use np.sqrt(2.0 / np.pi) for the constant\n",
    "    - Keep entire expression in one line for maximum fusion\n",
    "    - NumPy will optimize the expression tree automatically\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Mathematical constant for GELU approximation\n",
    "    sqrt_2_over_pi = np.sqrt(2.0 / np.pi)\n",
    "\n",
    "    # Fused GELU computation - all operations in single expression\n",
    "    # This minimizes memory bandwidth by avoiding intermediate arrays\n",
    "    # NumPy's expression evaluator will optimize this into efficient machine code\n",
    "    result_data = 0.5 * x.data * (\n",
    "        1.0 + np.tanh(sqrt_2_over_pi * (x.data + 0.044715 * x.data**3))\n",
    "    )\n",
    "\n",
    "    return Tensor(result_data)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b205cb72",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-fused-gelu",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_fused_gelu():\n",
    "    \"\"\"🔬 Test fused GELU activation implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Fused GELU...\")\n",
    "\n",
    "    # Test basic properties\n",
    "    x = Tensor([-3, -1, 0, 1, 3])\n",
    "    result = fused_gelu(x)\n",
    "\n",
    "    # GELU(0) = 0 (exact property)\n",
    "    assert abs(result.data[2]) < 1e-6, f\"GELU(0) should be 0, got {result.data[2]}\"\n",
    "\n",
    "    # GELU is smooth and increasing\n",
    "    assert result.data[4] > result.data[3] > result.data[2], \"GELU should be increasing\"\n",
    "\n",
    "    # GELU has positive bias (unlike ReLU)\n",
    "    assert result.data[3] > 0.8, \"GELU(1) should be close to 1\"\n",
    "    assert result.data[1] > -0.2, \"GELU(-1) should be slightly negative\"\n",
    "\n",
    "    # Test numerical stability with extreme values\n",
    "    x_extreme = Tensor([-10, -5, 0, 5, 10])\n",
    "    result_extreme = fused_gelu(x_extreme)\n",
    "\n",
    "    assert not np.any(np.isnan(result_extreme.data)), \"No NaN values allowed\"\n",
    "    assert not np.any(np.isinf(result_extreme.data)), \"No infinite values allowed\"\n",
    "\n",
    "    # Test large tensor processing\n",
    "    x_large = Tensor(np.random.randn(1000, 1000).astype(np.float32))\n",
    "    result_large = fused_gelu(x_large)\n",
    "\n",
    "    assert result_large.shape == x_large.shape, \"Shape preservation failed\"\n",
    "    assert result_large.data.dtype == np.float32, \"Data type preservation failed\"\n",
    "\n",
    "    # Test that positive inputs are mostly preserved (GELU ≈ x for large positive x)\n",
    "    x_positive = Tensor([5.0])\n",
    "    result_positive = fused_gelu(x_positive)\n",
    "    assert result_positive.data[0] > 4.9, \"Large positive values should be nearly preserved\"\n",
    "\n",
    "    print(\"✅ fused_gelu works correctly!\")\n",
    "\n",
    "test_unit_fused_gelu()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb075d6f",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🔬 Performance Analysis: Measuring Fusion Benefits\n",
    "\n",
    "Let's quantify the impact of kernel fusion by comparing fused vs unfused implementations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89558452",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "unfused-gelu",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def unfused_gelu(x: Tensor) -> Tensor:\n",
    "    \"\"\"\n",
    "    Deliberately unfused GELU implementation for performance comparison.\n",
    "\n",
    "    This version creates multiple intermediate tensors to simulate\n",
    "    the memory bandwidth overhead of unfused operations.\n",
    "\n",
    "    TODO: Implement GELU with explicit intermediate steps\n",
    "\n",
    "    APPROACH:\n",
    "    1. Break computation into individual steps\n",
    "    2. Create temporary Tensor objects for each step\n",
    "    3. This simulates real memory allocation overhead\n",
    "\n",
    "    PERFORMANCE IMPACT:\n",
    "    - Creates 7 temporary arrays\n",
    "    - Each array allocation/deallocation has overhead\n",
    "    - More memory bandwidth usage\n",
    "    - Potential cache misses between operations\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Unfused version - creates many intermediate arrays\n",
    "    sqrt_2_over_pi = np.sqrt(2.0 / np.pi)\n",
    "\n",
    "    # Each operation creates a temporary array (simulating kernel launches)\n",
    "    temp1 = Tensor(x.data**3)  # x³\n",
    "    temp2 = Tensor(0.044715 * temp1.data)  # 0.044715 * x³\n",
    "    temp3 = Tensor(x.data + temp2.data)  # x + 0.044715 * x³\n",
    "    temp4 = Tensor(sqrt_2_over_pi * temp3.data)  # √(2/π) * (...)\n",
    "    temp5 = Tensor(np.tanh(temp4.data))  # tanh(...)\n",
    "    temp6 = Tensor(1.0 + temp5.data)  # 1 + tanh(...)\n",
    "    temp7 = Tensor(x.data * temp6.data)  # x * (1 + tanh(...))\n",
    "    result = Tensor(0.5 * temp7.data)  # 0.5 * x * (...)\n",
    "\n",
    "    return result\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a50536a",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-fusion-speedup",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_fusion_speedup():\n",
    "    \"\"\"🔬 Measure the performance impact of kernel fusion.\"\"\"\n",
    "    print(\"🔬 Unit Test: Kernel Fusion Performance Impact...\")\n",
    "\n",
    "    # Create moderately large tensor for meaningful timing\n",
    "    size = 2000\n",
    "    x = Tensor(np.random.randn(size, size).astype(np.float32))\n",
    "    warmup_iterations = 2\n",
    "    timing_iterations = 5\n",
    "\n",
    "    # Warmup both implementations\n",
    "    for _ in range(warmup_iterations):\n",
    "        _ = unfused_gelu(x)\n",
    "        _ = fused_gelu(x)\n",
    "\n",
    "    # Time unfused version\n",
    "    start = time.time()\n",
    "    for _ in range(timing_iterations):\n",
    "        result_unfused = unfused_gelu(x)\n",
    "    unfused_time = time.time() - start\n",
    "\n",
    "    # Time fused version\n",
    "    start = time.time()\n",
    "    for _ in range(timing_iterations):\n",
    "        result_fused = fused_gelu(x)\n",
    "    fused_time = time.time() - start\n",
    "\n",
    "    # Verify numerical correctness\n",
    "    assert np.allclose(result_unfused.data, result_fused.data, atol=1e-6), \\\n",
    "        \"Fused and unfused implementations must be numerically equivalent\"\n",
    "\n",
    "    # Calculate performance metrics\n",
    "    speedup = unfused_time / fused_time if fused_time > 0 else 1.0\n",
    "    unfused_per_elem = (unfused_time / timing_iterations) / (size * size) * 1e9  # ns per element\n",
    "    fused_per_elem = (fused_time / timing_iterations) / (size * size) * 1e9\n",
    "\n",
    "    print(f\"📊 Kernel Fusion Performance Analysis:\")\n",
    "    print(f\"   Tensor size: {size}×{size} = {size*size:,} elements\")\n",
    "    print(f\"   Unfused time: {unfused_time/timing_iterations*1000:.2f} ms\")\n",
    "    print(f\"   Fused time:   {fused_time/timing_iterations*1000:.2f} ms\")\n",
    "    print(f\"   Speedup: {speedup:.2f}× faster\")\n",
    "    print(f\"   Per-element: {unfused_per_elem:.1f} ns → {fused_per_elem:.1f} ns\")\n",
    "\n",
    "    # Memory bandwidth estimate\n",
    "    bytes_per_elem = 4  # float32\n",
    "    unfused_memory_ops = 7  # 7 intermediate arrays\n",
    "    fused_memory_ops = 2   # read input, write output\n",
    "\n",
    "    unfused_bandwidth = (unfused_memory_ops * size * size * bytes_per_elem) / (unfused_time / timing_iterations) / 1e9\n",
    "    fused_bandwidth = (fused_memory_ops * size * size * bytes_per_elem) / (fused_time / timing_iterations) / 1e9\n",
    "\n",
    "    print(f\"   Memory efficiency: {unfused_memory_ops}→{fused_memory_ops} memory ops\")\n",
    "    print(f\"   Effective bandwidth: {unfused_bandwidth:.1f}→{fused_bandwidth:.1f} GB/s\")\n",
    "\n",
    "    # Interpret results\n",
    "    if speedup > 1.5:\n",
    "        print(\"🚀 Excellent! Kernel fusion providing significant speedup\")\n",
    "    elif speedup > 1.1:\n",
    "        print(\"✅ Good! Kernel fusion providing measurable benefit\")\n",
    "    else:\n",
    "        print(\"⚠️  Limited speedup - may be compute-bound or small tensor size\")\n",
    "\n",
    "    print(\"✅ Fusion performance analysis completed!\")\n",
    "\n",
    "test_unit_fusion_speedup()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adb97e5a",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 4. Integration - Mixed Precision Training: Memory and Speed\n",
    "\n",
    "### The Mixed Precision Revolution\n",
    "\n",
    "Modern GPUs (like V100, A100) have specialized **Tensor Cores** that can perform FP16 operations much faster than FP32:\n",
    "\n",
    "```\n",
    "Performance Comparison (Theoretical Peak):\n",
    "┌─────────────────┬────────────────┬────────────────┐\n",
    "│   Precision     │   V100 TFLOPS  │   A100 TFLOPS  │\n",
    "├─────────────────┼────────────────┼────────────────┤\n",
    "│   FP32 (float)  │      15.7      │      19.5      │\n",
    "│   FP16 (half)   │     125.0      │     312.0      │\n",
    "│   Speedup       │      8×        │      16×       │\n",
    "└─────────────────┴────────────────┴────────────────┘\n",
    "```\n",
    "\n",
    "### The Challenge: FP16 Precision Limitations\n",
    "\n",
    "FP16 has a much smaller range than FP32:\n",
    "\n",
    "```\n",
    "FP32 (32-bit):                    FP16 (16-bit):\n",
    "┌─────────────────────────────┐   ┌───────────────┐\n",
    "│ Sign │ 8-bit │   23-bit     │   │Sign│5-bit│10-bit│\n",
    "│  bit │ Exp   │  Mantissa    │   │bit │ Exp │Mant. │\n",
    "└─────────────────────────────┘   └───────────────┘\n",
    "Range: ±3.4 × 10³⁸              Range: ±6.5 × 10⁴\n",
    "Precision: ~7 decimal digits     Precision: ~3 decimal digits\n",
    "\n",
    "Problem: Small gradients (< 6e-5) become ZERO in FP16!\n",
    "```\n",
    "\n",
    "### The Solution: Automatic Loss Scaling\n",
    "\n",
    "```\n",
    "Training Step Without Scaling:       Training Step With Scaling:\n",
    "\n",
    "Loss = 0.0001                       Loss = 0.0001\n",
    "    ↓                                   ↓\n",
    "Gradients = 0.00001                 Scale × 1024\n",
    "    ↓                                   ↓\n",
    "Convert to FP16                     Loss = 0.1024\n",
    "    ↓                                   ↓\n",
    "Gradients = 0.0 (UNDERFLOW!)        Gradients = 0.01024\n",
    "    ↓                                   ↓\n",
    "No learning!                        Convert to FP16: 0.01024 ✓\n",
    "                                        ↓\n",
    "                                    Unscale: 0.01024 / 1024 = 0.00001\n",
    "                                        ↓\n",
    "                                    Successful learning!\n",
    "```\n",
    "\n",
    "### Mixed Precision Memory Benefits\n",
    "\n",
    "```\n",
    "Model Component Breakdown:\n",
    "┌─────────────────┬─────────────┬─────────────┬─────────────┐\n",
    "│   Component     │ FP32 Memory │ FP16 Memory │   Savings   │\n",
    "├─────────────────┼─────────────┼─────────────┼─────────────┤\n",
    "│ Parameters      │    4N       │     4N      │     0%      │\n",
    "│ Gradients       │    4N       │     2N      │    50%      │\n",
    "│ Activations     │    4A       │     2A      │    50%      │\n",
    "│ Optimizer State │    8N       │     8N      │     0%      │\n",
    "├─────────────────┼─────────────┼─────────────┼─────────────┤\n",
    "│ Total Typical   │   ~20N      │    ~16N     │    20%      │\n",
    "│ Activation-Heavy│   ~40N      │    ~24N     │    40%      │\n",
    "└─────────────────┴─────────────┴─────────────┴─────────────┘\n",
    "\n",
    "N = parameter count, A = activation memory\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a19b2a6",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "mixed-precision-trainer",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "class MixedPrecisionTrainer:\n",
    "    \"\"\"\n",
    "    Mixed precision trainer with automatic loss scaling.\n",
    "\n",
    "    Implements the same pattern as PyTorch's Automatic Mixed Precision (AMP):\n",
    "    1. Forward pass in FP16 for speed and memory efficiency\n",
    "    2. Loss scaling to prevent gradient underflow\n",
    "    3. Gradient computation and unscaling\n",
    "    4. Parameter updates in FP32 for numerical stability\n",
    "\n",
    "    The key insight: keep different parts of training in optimal precision.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, model, optimizer, loss_scale: float = 1024.0, max_loss_scale: float = 65536.0):\n",
    "        \"\"\"\n",
    "        Initialize mixed precision training infrastructure.\n",
    "\n",
    "        TODO: Set up automatic loss scaling and overflow detection\n",
    "\n",
    "        APPROACH:\n",
    "        1. Store model and optimizer references\n",
    "        2. Initialize dynamic loss scaling parameters\n",
    "        3. Set up overflow detection and scale adjustment logic\n",
    "\n",
    "        Args:\n",
    "            model: Neural network model\n",
    "            optimizer: Parameter optimizer (SGD, Adam, etc.)\n",
    "            loss_scale: Initial scaling factor for gradients\n",
    "            max_loss_scale: Maximum allowed loss scale\n",
    "\n",
    "        LOSS SCALING STRATEGY:\n",
    "        - Start with reasonable scale (1024)\n",
    "        - Increase gradually if no overflow (better precision)\n",
    "        - Decrease immediately on overflow (stability)\n",
    "        - This balances numerical precision with training stability\n",
    "\n",
    "        HINTS:\n",
    "        - Track consecutive successful steps for scale increases\n",
    "        - Use exponential backoff on overflow detection\n",
    "        - Keep scale within reasonable bounds [1, 65536]\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.model = model\n",
    "        self.optimizer = optimizer\n",
    "\n",
    "        # Loss scaling parameters\n",
    "        self.loss_scale = loss_scale\n",
    "        self.max_loss_scale = max_loss_scale\n",
    "        self.min_loss_scale = 1.0\n",
    "\n",
    "        # Dynamic scaling parameters\n",
    "        self.scale_growth_factor = 2.0      # Multiply by 2 when increasing\n",
    "        self.scale_backoff_factor = 0.5     # Divide by 2 when decreasing\n",
    "        self.growth_interval = 2000         # Steps between scale increases\n",
    "        self.steps_since_last_scale_update = 0\n",
    "\n",
    "        # Overflow tracking\n",
    "        self.overflow_detected = False\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def scale_loss(self, loss: Tensor) -> Tensor:\n",
    "        \"\"\"\n",
    "        Scale loss to prevent gradient underflow in FP16.\n",
    "\n",
    "        The fundamental challenge: FP16 can only represent values ≥ 6e-5.\n",
    "        Small gradients (common in deep networks) become zero without scaling.\n",
    "\n",
    "        TODO: Apply loss scaling for mixed precision stability\n",
    "\n",
    "        APPROACH:\n",
    "        1. Multiply loss by current scale factor\n",
    "        2. This amplifies gradients proportionally\n",
    "        3. Return scaled loss for backward pass\n",
    "\n",
    "        MATHEMATICAL INSIGHT:\n",
    "        If loss = 1e-6 and scale = 1024:\n",
    "        scaled_loss = 1e-6 × 1024 = 1.024e-3\n",
    "\n",
    "        After backward pass:\n",
    "        scaled_gradients = 1.024e-3 × dloss/dparam = 1024 × gradients\n",
    "\n",
    "        These larger gradients survive FP16 conversion!\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> trainer = MixedPrecisionTrainer(model, optimizer)\n",
    "        >>> loss = Tensor([0.0001])  # Small loss\n",
    "        >>> scaled = trainer.scale_loss(loss)\n",
    "        >>> print(scaled.data)  # [0.1024] (0.0001 × 1024)\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Scale the loss to amplify gradients\n",
    "        # This prevents gradient underflow in FP16 arithmetic\n",
    "        scaled_data = loss.data * self.loss_scale\n",
    "        return Tensor(scaled_data)\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def unscale_gradients(self, parameters: List[Tensor]) -> bool:\n",
    "        \"\"\"\n",
    "        Unscale gradients and detect overflow from FP16 conversion.\n",
    "\n",
    "        After backward pass on scaled loss, gradients are scaled too.\n",
    "        We must unscale them AND check for overflow/underflow.\n",
    "\n",
    "        TODO: Implement gradient unscaling with overflow detection\n",
    "\n",
    "        APPROACH:\n",
    "        1. Divide all gradients by loss scale (restore original magnitude)\n",
    "        2. Check for inf/nan values (indicates FP16 overflow)\n",
    "        3. Return True if gradients are valid, False if overflow detected\n",
    "\n",
    "        OVERFLOW DETECTION:\n",
    "        inf/nan in gradients indicates:\n",
    "        - Gradient magnitude too large for FP16\n",
    "        - Numerical instability in computation\n",
    "        - Loss scale too aggressive\n",
    "\n",
    "        When overflow occurs:\n",
    "        - Skip parameter update (unstable gradients)\n",
    "        - Reduce loss scale for next iteration\n",
    "        - Continue training with lower scale\n",
    "\n",
    "        HINTS:\n",
    "        - Use np.isfinite() to detect inf/nan efficiently\n",
    "        - Process all parameters even if overflow found\n",
    "        - Set self.overflow_detected flag for scale adjustment\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.overflow_detected = False\n",
    "\n",
    "        # Unscale all gradients and check for overflow\n",
    "        for param in parameters:\n",
    "            if param.grad is not None:\n",
    "                # Unscale gradients to original magnitude\n",
    "                param.grad.data = param.grad.data / self.loss_scale\n",
    "\n",
    "                # Check for overflow/underflow (inf/nan values)\n",
    "                if not np.all(np.isfinite(param.grad.data)):\n",
    "                    self.overflow_detected = True\n",
    "                    # Continue processing to unscale all gradients\n",
    "\n",
    "        return not self.overflow_detected\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def update_loss_scale(self):\n",
    "        \"\"\"\n",
    "        Dynamically adjust loss scale based on training stability.\n",
    "\n",
    "        Implements the \"Goldilocks\" principle for loss scaling:\n",
    "        - Too low: precision loss from small gradients\n",
    "        - Too high: overflow and instability\n",
    "        - Just right: maximum precision without overflow\n",
    "\n",
    "        TODO: Implement adaptive loss scale adjustment\n",
    "\n",
    "        APPROACH:\n",
    "        1. If overflow detected: reduce scale immediately (stability)\n",
    "        2. If no overflow for many steps: increase scale (precision)\n",
    "        3. Keep scale within reasonable bounds\n",
    "\n",
    "        SCALING STRATEGY:\n",
    "        - Aggressive reduction on overflow (×0.5)\n",
    "        - Conservative growth during stability (×2 every 2000 steps)\n",
    "        - This favors stability over maximum precision\n",
    "\n",
    "        WHY THIS WORKS:\n",
    "        - Most training is stable (gradual scale increase)\n",
    "        - Occasional instability (rapid scale decrease)\n",
    "        - Converges to optimal scale for current training phase\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if self.overflow_detected:\n",
    "            # Immediately reduce scale on overflow\n",
    "            self.loss_scale = max(\n",
    "                self.min_loss_scale,\n",
    "                self.loss_scale * self.scale_backoff_factor\n",
    "            )\n",
    "            self.steps_since_last_scale_update = 0\n",
    "        else:\n",
    "            # Gradually increase scale if stable\n",
    "            self.steps_since_last_scale_update += 1\n",
    "            if self.steps_since_last_scale_update >= self.growth_interval:\n",
    "                self.loss_scale = min(\n",
    "                    self.max_loss_scale,\n",
    "                    self.loss_scale * self.scale_growth_factor\n",
    "                )\n",
    "                self.steps_since_last_scale_update = 0\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def train_step(self, batch: Tuple[Tensor, Tensor]) -> Dict[str, float]:\n",
    "        \"\"\"\n",
    "        Execute complete mixed precision training step.\n",
    "\n",
    "        Orchestrates the entire mixed precision training process:\n",
    "        1. Forward pass (FP16 in real implementation)\n",
    "        2. Loss computation and scaling\n",
    "        3. Backward pass on scaled loss\n",
    "        4. Gradient unscaling and overflow detection\n",
    "        5. Conditional parameter update\n",
    "        6. Loss scale adjustment\n",
    "\n",
    "        TODO: Implement end-to-end mixed precision training step\n",
    "\n",
    "        APPROACH:\n",
    "        1. Clear gradients from previous step\n",
    "        2. Forward pass through model\n",
    "        3. Compute and scale loss\n",
    "        4. Backward pass to compute scaled gradients\n",
    "        5. Unscale gradients and check for overflow\n",
    "        6. Update parameters only if no overflow\n",
    "        7. Adjust loss scale based on stability\n",
    "\n",
    "        CRITICAL INSIGHT:\n",
    "        Skip parameter updates on overflow! Unstable gradients\n",
    "        would move parameters in wrong direction.\n",
    "\n",
    "        RETURN FORMAT:\n",
    "        Dictionary with training metrics:\n",
    "        - loss: unscaled loss value\n",
    "        - loss_scale: current scaling factor\n",
    "        - overflow: whether overflow occurred\n",
    "        - gradients_valid: whether update was applied\n",
    "\n",
    "        HINTS:\n",
    "        - Use self.optimizer.zero_grad() to clear gradients\n",
    "        - Get parameters with gradients for unscaling\n",
    "        - Only call optimizer.step() if gradients are valid\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        inputs, targets = batch\n",
    "\n",
    "        # Clear gradients from previous step\n",
    "        self.optimizer.zero_grad()\n",
    "\n",
    "        # Forward pass (would use FP16 autocast in real implementation)\n",
    "        # For simulation, we work in FP32 but apply scaling principles\n",
    "        outputs = self.model(inputs)\n",
    "\n",
    "        # Compute loss (unscaled)\n",
    "        loss = self._compute_loss(outputs, targets)\n",
    "\n",
    "        # Scale loss for mixed precision\n",
    "        scaled_loss = self.scale_loss(loss)\n",
    "\n",
    "        # Backward pass on scaled loss\n",
    "        scaled_loss.backward()\n",
    "\n",
    "        # Get all parameters with gradients\n",
    "        parameters = [p for p in self.model.parameters() if p.grad is not None]\n",
    "\n",
    "        # Unscale gradients and detect overflow\n",
    "        gradients_valid = self.unscale_gradients(parameters)\n",
    "\n",
    "        # Update parameters only if no overflow\n",
    "        if gradients_valid:\n",
    "            self.optimizer.step()\n",
    "\n",
    "        # Adjust loss scale based on stability\n",
    "        self.update_loss_scale()\n",
    "\n",
    "        # Return training metrics\n",
    "        return {\n",
    "            'loss': loss.data.item() if hasattr(loss.data, 'item') else float(loss.data),\n",
    "            'loss_scale': self.loss_scale,\n",
    "            'overflow': self.overflow_detected,\n",
    "            'gradients_valid': gradients_valid\n",
    "        }\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def _compute_loss(self, outputs: Tensor, targets: Tensor) -> Tensor:\n",
    "        \"\"\"Simple MSE loss for demonstration purposes.\"\"\"\n",
    "        diff = Tensor(outputs.data - targets.data)\n",
    "        return Tensor(np.mean(diff.data**2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "650bf77c",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-mixed-precision",
     "locked": true,
     "points": 15
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_mixed_precision():\n",
    "    \"\"\"🔬 Test mixed precision training components comprehensively.\"\"\"\n",
    "    print(\"🔬 Unit Test: Mixed Precision Training...\")\n",
    "\n",
    "    # Create mock model and optimizer for testing\n",
    "    class MockModel:\n",
    "        def __init__(self):\n",
    "            self.weight = Tensor(np.random.randn(10, 5).astype(np.float32))\n",
    "            self.weight.grad = None\n",
    "\n",
    "        def __call__(self, x):\n",
    "            return x.matmul(self.weight)\n",
    "\n",
    "        def parameters(self):\n",
    "            return [self.weight]\n",
    "\n",
    "    class MockOptimizer:\n",
    "        def __init__(self, params):\n",
    "            self.params = params\n",
    "            self.updates_applied = 0\n",
    "\n",
    "        def zero_grad(self):\n",
    "            for p in self.params:\n",
    "                p.grad = None\n",
    "\n",
    "        def step(self):\n",
    "            for p in self.params:\n",
    "                if p.grad is not None:\n",
    "                    p.data = p.data - 0.01 * p.grad.data\n",
    "                    self.updates_applied += 1\n",
    "\n",
    "    # Initialize mixed precision trainer\n",
    "    model = MockModel()\n",
    "    optimizer = MockOptimizer(model.parameters())\n",
    "    trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=1024.0)\n",
    "\n",
    "    # Test 1: Loss scaling\n",
    "    print(\"   Testing loss scaling...\")\n",
    "    loss = Tensor([0.001])\n",
    "    scaled_loss = trainer.scale_loss(loss)\n",
    "    expected_scaled = 0.001 * 1024.0\n",
    "    assert np.isclose(scaled_loss.data[0], expected_scaled), \\\n",
    "        f\"Loss scaling failed: expected {expected_scaled}, got {scaled_loss.data[0]}\"\n",
    "\n",
    "    # Test 2: Gradient unscaling (normal case)\n",
    "    print(\"   Testing gradient unscaling...\")\n",
    "    model.weight.grad = Tensor(np.full((10, 5), 1024.0))  # Simulate scaled gradients\n",
    "    valid = trainer.unscale_gradients([model.weight])\n",
    "    assert valid, \"Should detect valid gradients\"\n",
    "    assert np.allclose(model.weight.grad.data, 1.0), \"Gradient unscaling failed\"\n",
    "\n",
    "    # Test 3: Overflow detection\n",
    "    print(\"   Testing overflow detection...\")\n",
    "    model.weight.grad = Tensor(np.full((10, 5), np.inf))  # Simulate overflow\n",
    "    valid = trainer.unscale_gradients([model.weight])\n",
    "    assert not valid, \"Should detect overflow\"\n",
    "    assert trainer.overflow_detected, \"Overflow flag not set\"\n",
    "\n",
    "    # Test 4: Loss scale adjustment after overflow\n",
    "    print(\"   Testing loss scale adjustment...\")\n",
    "    initial_scale = trainer.loss_scale\n",
    "    trainer.update_loss_scale()  # Should reduce scale due to overflow\n",
    "    assert trainer.loss_scale < initial_scale, \\\n",
    "        f\"Scale should decrease after overflow: {initial_scale} → {trainer.loss_scale}\"\n",
    "\n",
    "    # Test 5: Loss scale increase during stability\n",
    "    print(\"   Testing loss scale increase...\")\n",
    "    trainer.overflow_detected = False\n",
    "    trainer.steps_since_last_scale_update = 2000  # Simulate stable training\n",
    "    scale_before = trainer.loss_scale\n",
    "    trainer.update_loss_scale()\n",
    "    assert trainer.loss_scale > scale_before, \"Scale should increase during stability\"\n",
    "\n",
    "    # Test 6: End-to-end training step\n",
    "    print(\"   Testing complete training step...\")\n",
    "    inputs = Tensor(np.random.randn(8, 10).astype(np.float32))\n",
    "    targets = Tensor(np.random.randn(8, 5).astype(np.float32))\n",
    "\n",
    "    initial_updates = optimizer.updates_applied\n",
    "    metrics = trainer.train_step((inputs, targets))\n",
    "\n",
    "    # Verify metrics structure\n",
    "    required_keys = ['loss', 'loss_scale', 'overflow', 'gradients_valid']\n",
    "    for key in required_keys:\n",
    "        assert key in metrics, f\"Missing metric: {key}\"\n",
    "\n",
    "    # Verify loss is reasonable\n",
    "    assert isinstance(metrics['loss'], (int, float)), \"Loss should be numeric\"\n",
    "    assert metrics['loss'] >= 0, \"Loss should be non-negative\"\n",
    "\n",
    "    # Verify loss scale is positive\n",
    "    assert metrics['loss_scale'] > 0, \"Loss scale should be positive\"\n",
    "\n",
    "    print(\"✅ Mixed precision training works correctly!\")\n",
    "\n",
    "test_unit_mixed_precision()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de9e4b44",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 5. Systems Analysis - Performance Scaling Patterns\n",
    "\n",
    "Let's analyze how our acceleration techniques perform across different scenarios and understand their scaling characteristics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f7edfee",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "analyze-vectorization",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def analyze_vectorization_scaling():\n",
    "    \"\"\"📊 Analyze vectorization performance across different tensor sizes.\"\"\"\n",
    "    print(\"📊 Analyzing vectorization scaling behavior...\")\n",
    "\n",
    "    # Test sizes spanning different cache regimes\n",
    "    sizes = [64, 128, 256, 512, 1024, 2048]\n",
    "\n",
    "    print(\"\\n🔍 Vectorization Scaling Analysis:\")\n",
    "    print(\"┌─────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n",
    "    print(\"│  Size   │ Time (ms)   │ GFLOPS      │ Bandwidth   │ Efficiency  │\")\n",
    "    print(\"│         │             │             │ (GB/s)      │ (% of peak) │\")\n",
    "    print(\"├─────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n",
    "\n",
    "    for size in sizes:\n",
    "        # Create test matrices\n",
    "        a = Tensor(np.random.randn(size, size).astype(np.float32))\n",
    "        b = Tensor(np.random.randn(size, size).astype(np.float32))\n",
    "\n",
    "        # Warm up\n",
    "        for _ in range(2):\n",
    "            _ = vectorized_matmul(a, b)\n",
    "\n",
    "        # Time vectorized implementation\n",
    "        iterations = max(1, 100 // (size // 64))  # Fewer iterations for larger sizes\n",
    "        start = time.time()\n",
    "        for _ in range(iterations):\n",
    "            result = vectorized_matmul(a, b)\n",
    "        elapsed = (time.time() - start) / iterations\n",
    "\n",
    "        # Calculate performance metrics\n",
    "        flops = 2 * size**3  # 2N³ FLOPs for matrix multiplication\n",
    "        gflops = flops / (elapsed * 1e9)\n",
    "\n",
    "        bytes_accessed = 3 * size * size * 4  # 3 matrices × size² × 4 bytes\n",
    "        bandwidth = bytes_accessed / (elapsed * 1e9)\n",
    "\n",
    "        # Estimate efficiency (rough baseline: modern CPU ~100-500 GFLOPS peak)\n",
    "        estimated_peak_gflops = 200  # Conservative estimate\n",
    "        efficiency = min(100, gflops / estimated_peak_gflops * 100)\n",
    "\n",
    "        print(f\"│ {size:6d}  │ {elapsed*1000:9.2f}   │ {gflops:9.1f}   │ {bandwidth:9.1f}   │ {efficiency:9.1f}   │\")\n",
    "\n",
    "    print(\"└─────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n",
    "\n",
    "    print(f\"\\n💡 Vectorization insights:\")\n",
    "    print(f\"   • Small matrices: Limited by overhead and cache effects\")\n",
    "    print(f\"   • Medium matrices: Sweet spot for cache reuse\")\n",
    "    print(f\"   • Large matrices: Memory bandwidth becomes limiting factor\")\n",
    "    print(f\"   • BLAS libraries automatically optimize for each size regime\")\n",
    "    print(\"🚀 Vectorization effectiveness depends on problem size and hardware\")\n",
    "\n",
    "analyze_vectorization_scaling()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5972a039",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "analyze-arithmetic-intensity",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def analyze_arithmetic_intensity():\n",
    "    \"\"\"📊 Demonstrate the roofline model with different operations.\"\"\"\n",
    "    print(\"📊 Analyzing arithmetic intensity patterns...\")\n",
    "\n",
    "    size = 1024\n",
    "    iterations = 10\n",
    "\n",
    "    operations = []\n",
    "\n",
    "    # Create test data\n",
    "    x = Tensor(np.random.randn(size, size).astype(np.float32))\n",
    "    y = Tensor(np.random.randn(size, size).astype(np.float32))\n",
    "\n",
    "    print(\"\\n🎯 Arithmetic Intensity Analysis:\")\n",
    "    print(\"┌─────────────────────┬─────────┬─────────────┬─────────────┬─────────────┐\")\n",
    "    print(\"│ Operation           │ AI      │ Time (ms)   │ GFLOPS      │ GB/s        │\")\n",
    "    print(\"│                     │(FLOPs/B)│             │             │             │\")\n",
    "    print(\"├─────────────────────┼─────────┼─────────────┼─────────────┼─────────────┤\")\n",
    "\n",
    "    # 1. Element-wise addition (very low arithmetic intensity)\n",
    "    start = time.time()\n",
    "    for _ in range(iterations):\n",
    "        _ = Tensor(x.data + y.data)\n",
    "    add_time = (time.time() - start) / iterations\n",
    "\n",
    "    add_flops = size * size  # One addition per element\n",
    "    add_bytes = 3 * size * size * 4  # Read x, read y, write result\n",
    "    add_ai = add_flops / add_bytes\n",
    "    add_gflops = add_flops / (add_time * 1e9)\n",
    "    add_bandwidth = add_bytes / (add_time * 1e9)\n",
    "\n",
    "    print(f\"│ Element-wise Add    │ {add_ai:6.3f}  │ {add_time*1000:9.2f}   │ {add_gflops:9.1f}   │ {add_bandwidth:9.1f}   │\")\n",
    "\n",
    "    # 2. Element-wise multiply (still low, but slightly higher)\n",
    "    start = time.time()\n",
    "    for _ in range(iterations):\n",
    "        _ = Tensor(x.data * y.data)\n",
    "    mul_time = (time.time() - start) / iterations\n",
    "\n",
    "    mul_flops = size * size\n",
    "    mul_bytes = 3 * size * size * 4\n",
    "    mul_ai = mul_flops / mul_bytes\n",
    "    mul_gflops = mul_flops / (mul_time * 1e9)\n",
    "    mul_bandwidth = mul_bytes / (mul_time * 1e9)\n",
    "\n",
    "    print(f\"│ Element-wise Mult   │ {mul_ai:6.3f}  │ {mul_time*1000:9.2f}   │ {mul_gflops:9.1f}   │ {mul_bandwidth:9.1f}   │\")\n",
    "\n",
    "    # 3. GELU (medium arithmetic intensity)\n",
    "    start = time.time()\n",
    "    for _ in range(iterations):\n",
    "        _ = fused_gelu(x)\n",
    "    gelu_time = (time.time() - start) / iterations\n",
    "\n",
    "    gelu_flops = size * size * 8  # Approximate: x³, add, mul, tanh, etc.\n",
    "    gelu_bytes = 2 * size * size * 4  # Read x, write result\n",
    "    gelu_ai = gelu_flops / gelu_bytes\n",
    "    gelu_gflops = gelu_flops / (gelu_time * 1e9)\n",
    "    gelu_bandwidth = gelu_bytes / (gelu_time * 1e9)\n",
    "\n",
    "    print(f\"│ Fused GELU          │ {gelu_ai:6.3f}  │ {gelu_time*1000:9.2f}   │ {gelu_gflops:9.1f}   │ {gelu_bandwidth:9.1f}   │\")\n",
    "\n",
    "    # 4. Matrix multiplication (high arithmetic intensity)\n",
    "    start = time.time()\n",
    "    for _ in range(iterations):\n",
    "        _ = vectorized_matmul(x, y)\n",
    "    matmul_time = (time.time() - start) / iterations\n",
    "\n",
    "    matmul_flops = 2 * size**3  # 2N³ FLOPs\n",
    "    matmul_bytes = 3 * size * size * 4  # 3 matrices\n",
    "    matmul_ai = matmul_flops / matmul_bytes\n",
    "    matmul_gflops = matmul_flops / (matmul_time * 1e9)\n",
    "    matmul_bandwidth = matmul_bytes / (matmul_time * 1e9)\n",
    "\n",
    "    print(f\"│ Matrix Multiply     │ {matmul_ai:6.3f}  │ {matmul_time*1000:9.2f}   │ {matmul_gflops:9.1f}   │ {matmul_bandwidth:9.1f}   │\")\n",
    "\n",
    "    print(\"└─────────────────────┴─────────┴─────────────┴─────────────┴─────────────┘\")\n",
    "\n",
    "    print(f\"\\n💡 Roofline Model Insights:\")\n",
    "    print(f\"   📊 Low AI (< 1): Memory bound - limited by bandwidth\")\n",
    "    print(f\"   📊 Med AI (1-10): Transitional - depends on implementation\")\n",
    "    print(f\"   📊 High AI (> 10): Compute bound - limited by ALU throughput\")\n",
    "    print(f\"   🎯 Matrix multiplication ({matmul_ai:.1f} AI) is ideal for GPUs/TPUs\")\n",
    "    print(f\"   ⚡ Element-wise ops ({add_ai:.3f} AI) need memory optimization\")\n",
    "    print(\"🚀 Design algorithms with high arithmetic intensity for performance\")\n",
    "\n",
    "analyze_arithmetic_intensity()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a539cd5",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "analyze-mixed-precision-benefits",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def analyze_mixed_precision_benefits():\n",
    "    \"\"\"📊 Quantify mixed precision memory and performance benefits.\"\"\"\n",
    "    print(\"📊 Analyzing mixed precision benefits across model sizes...\")\n",
    "\n",
    "    # Define representative model configurations\n",
    "    model_configs = [\n",
    "        (\"Tiny CNN\", {\"params\": 50_000, \"activations\": 100_000}),\n",
    "        (\"Small BERT\", {\"params\": 10_000_000, \"activations\": 5_000_000}),\n",
    "        (\"Medium GPT\", {\"params\": 100_000_000, \"activations\": 50_000_000}),\n",
    "        (\"Large Transformer\", {\"params\": 1_000_000_000, \"activations\": 500_000_000}),\n",
    "    ]\n",
    "\n",
    "    print(\"\\n🧮 Mixed Precision Memory Analysis:\")\n",
    "    print(\"┌─────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n",
    "    print(\"│ Model Type      │ Parameters  │ FP32 Memory │ FP16 Memory │ Savings     │\")\n",
    "    print(\"│                 │             │ (GB)        │ (GB)        │ (%)         │\")\n",
    "    print(\"├─────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n",
    "\n",
    "    for name, config in model_configs:\n",
    "        param_count = config[\"params\"]\n",
    "        activation_count = config[\"activations\"]\n",
    "\n",
    "        # Memory calculation (bytes)\n",
    "        # Parameters: always FP32 for stability\n",
    "        param_memory = param_count * 4\n",
    "\n",
    "        # FP32 training memory\n",
    "        fp32_activations = activation_count * 4\n",
    "        fp32_gradients = param_count * 4\n",
    "        fp32_optimizer = param_count * 8  # Adam: momentum + velocity\n",
    "        fp32_total = param_memory + fp32_activations + fp32_gradients + fp32_optimizer\n",
    "\n",
    "        # Mixed precision memory\n",
    "        fp16_activations = activation_count * 2  # FP16 activations\n",
    "        fp16_gradients = param_count * 2  # FP16 gradients during backward\n",
    "        mixed_total = param_memory + fp16_activations + fp16_gradients + fp32_optimizer\n",
    "\n",
    "        # Calculate savings\n",
    "        savings_gb = (fp32_total - mixed_total) / 1e9\n",
    "        savings_pct = (fp32_total - mixed_total) / fp32_total * 100\n",
    "\n",
    "        print(f\"│ {name:14s}  │ {param_count:10,d}  │ {fp32_total/1e9:9.1f}   │ {mixed_total/1e9:9.1f}   │ {savings_pct:9.1f}   │\")\n",
    "\n",
    "    print(\"└─────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n",
    "\n",
    "    # Performance simulation\n",
    "    print(f\"\\n⚡ Mixed Precision Performance Simulation:\")\n",
    "\n",
    "    # Simulate different batch sizes to show memory pressure\n",
    "    batch_sizes = [8, 16, 32, 64]\n",
    "    hidden_size = 1024\n",
    "    seq_length = 512\n",
    "\n",
    "    print(\"┌─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n",
    "    print(\"│ Batch Size  │ FP32 Mem    │ FP16 Mem    │ Throughput  │ Efficiency  │\")\n",
    "    print(\"│             │ (GB)        │ (GB)        │ Gain        │ Gain        │\")\n",
    "    print(\"├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n",
    "\n",
    "    for batch_size in batch_sizes:\n",
    "        # Memory for activations (dominant for large models)\n",
    "        elements = batch_size * seq_length * hidden_size\n",
    "\n",
    "        fp32_mem = elements * 4 / 1e9  # 4 bytes per FP32\n",
    "        fp16_mem = elements * 2 / 1e9  # 2 bytes per FP16\n",
    "\n",
    "        # Simulate throughput gains (based on Tensor Core speedups)\n",
    "        # Real speedups depend on hardware and operation mix\n",
    "        throughput_gain = 1.4  # Conservative estimate for mixed workloads\n",
    "\n",
    "        # Memory efficiency enables larger batch sizes\n",
    "        max_fp32_batch = 32  # Assume memory limit\n",
    "        max_fp16_batch = 64   # Double capacity with FP16\n",
    "\n",
    "        efficiency_gain = max_fp16_batch / max_fp32_batch if batch_size <= max_fp32_batch else \"OOM\"\n",
    "        efficiency_str = f\"{efficiency_gain:.1f}×\" if isinstance(efficiency_gain, float) else efficiency_gain\n",
    "\n",
    "        print(f\"│ {batch_size:10d}  │ {fp32_mem:9.2f}   │ {fp16_mem:9.2f}   │ {throughput_gain:9.1f}×  │ {efficiency_str:9s}   │\")\n",
    "\n",
    "    print(\"└─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n",
    "\n",
    "    print(f\"\\n💡 Mixed Precision Key Benefits:\")\n",
    "    print(f\"   🎯 Memory: 20-40% reduction enables larger models/batches\")\n",
    "    print(f\"   ⚡ Speed: 1.3-2× throughput on modern hardware (V100+)\")\n",
    "    print(f\"   📈 Scale: Essential for billion-parameter models\")\n",
    "    print(f\"   ⚠️  Complexity: Requires careful loss scaling and overflow handling\")\n",
    "    print(\"🚀 Mixed precision is crucial for competitive ML training\")\n",
    "\n",
    "analyze_mixed_precision_benefits()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d42aa6ff",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 6. Optimization Insights - Production Acceleration Strategy\n",
    "\n",
    "Understanding when and how to apply different acceleration techniques in real-world scenarios."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "133b1f71",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "acceleration-decision-framework",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "def analyze_acceleration_decision_framework():\n",
    "    \"\"\"📊 Decision framework for choosing acceleration techniques.\"\"\"\n",
    "    print(\"📊 Acceleration Technique Decision Framework...\")\n",
    "\n",
    "    # Define workload characteristics\n",
    "    workloads = [\n",
    "        (\"Research Training\", {\n",
    "            \"memory_pressure\": \"medium\",\n",
    "            \"latency_sensitive\": False,\n",
    "            \"stability_critical\": False,\n",
    "            \"development_speed\": \"high\",\n",
    "            \"hardware_variety\": \"high\"\n",
    "        }),\n",
    "        (\"Production Training\", {\n",
    "            \"memory_pressure\": \"high\",\n",
    "            \"latency_sensitive\": False,\n",
    "            \"stability_critical\": True,\n",
    "            \"development_speed\": \"medium\",\n",
    "            \"hardware_variety\": \"low\"\n",
    "        }),\n",
    "        (\"Real-time Inference\", {\n",
    "            \"memory_pressure\": \"medium\",\n",
    "            \"latency_sensitive\": True,\n",
    "            \"stability_critical\": True,\n",
    "            \"development_speed\": \"low\",\n",
    "            \"hardware_variety\": \"medium\"\n",
    "        }),\n",
    "        (\"Edge Deployment\", {\n",
    "            \"memory_pressure\": \"very_high\",\n",
    "            \"latency_sensitive\": True,\n",
    "            \"stability_critical\": True,\n",
    "            \"development_speed\": \"low\",\n",
    "            \"hardware_variety\": \"very_high\"\n",
    "        }),\n",
    "        (\"Batch Inference\", {\n",
    "            \"memory_pressure\": \"low\",\n",
    "            \"latency_sensitive\": False,\n",
    "            \"stability_critical\": True,\n",
    "            \"development_speed\": \"medium\",\n",
    "            \"hardware_variety\": \"low\"\n",
    "        })\n",
    "    ]\n",
    "\n",
    "    # Define technique characteristics\n",
    "    techniques = {\n",
    "        \"Vectorization\": {\n",
    "            \"implementation_cost\": \"low\",\n",
    "            \"memory_benefit\": \"none\",\n",
    "            \"latency_benefit\": \"high\",\n",
    "            \"stability_risk\": \"none\",\n",
    "            \"hardware_dependency\": \"low\"\n",
    "        },\n",
    "        \"Kernel Fusion\": {\n",
    "            \"implementation_cost\": \"medium\",\n",
    "            \"memory_benefit\": \"medium\",\n",
    "            \"latency_benefit\": \"medium\",\n",
    "            \"stability_risk\": \"low\",\n",
    "            \"hardware_dependency\": \"medium\"\n",
    "        },\n",
    "        \"Mixed Precision\": {\n",
    "            \"implementation_cost\": \"high\",\n",
    "            \"memory_benefit\": \"high\",\n",
    "            \"latency_benefit\": \"high\",\n",
    "            \"stability_risk\": \"medium\",\n",
    "            \"hardware_dependency\": \"high\"\n",
    "        },\n",
    "        \"Graph Optimization\": {\n",
    "            \"implementation_cost\": \"very_high\",\n",
    "            \"memory_benefit\": \"medium\",\n",
    "            \"latency_benefit\": \"very_high\",\n",
    "            \"stability_risk\": \"low\",\n",
    "            \"hardware_dependency\": \"very_high\"\n",
    "        }\n",
    "    }\n",
    "\n",
    "    print(\"\\n🎯 Acceleration Technique Recommendations:\")\n",
    "    print(\"┌─────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┐\")\n",
    "    print(\"│ Workload            │ Vectorize   │ Fuse Kernels│ Mixed Prec  │ Graph Opt   │\")\n",
    "    print(\"├─────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤\")\n",
    "\n",
    "    for workload_name, workload_chars in workloads:\n",
    "        recommendations = []\n",
    "\n",
    "        for technique_name in [\"Vectorization\", \"Kernel Fusion\", \"Mixed Precision\", \"Graph Optimization\"]:\n",
    "            tech_chars = techniques[technique_name]\n",
    "            score = 0\n",
    "\n",
    "            # Benefit vs requirement matching\n",
    "            if workload_chars[\"memory_pressure\"] in [\"high\", \"very_high\"]:\n",
    "                if tech_chars[\"memory_benefit\"] in [\"medium\", \"high\"]:\n",
    "                    score += 2\n",
    "\n",
    "            if workload_chars[\"latency_sensitive\"]:\n",
    "                if tech_chars[\"latency_benefit\"] in [\"medium\", \"high\", \"very_high\"]:\n",
    "                    score += 2\n",
    "\n",
    "            # Risk vs tolerance matching\n",
    "            if workload_chars[\"stability_critical\"]:\n",
    "                if tech_chars[\"stability_risk\"] in [\"none\", \"low\"]:\n",
    "                    score += 1\n",
    "                elif tech_chars[\"stability_risk\"] == \"medium\":\n",
    "                    score -= 1\n",
    "\n",
    "            # Implementation cost vs development speed\n",
    "            if workload_chars[\"development_speed\"] == \"high\":\n",
    "                if tech_chars[\"implementation_cost\"] in [\"low\", \"medium\"]:\n",
    "                    score += 1\n",
    "                elif tech_chars[\"implementation_cost\"] in [\"high\", \"very_high\"]:\n",
    "                    score -= 1\n",
    "\n",
    "            # Hardware dependency vs variety\n",
    "            if workload_chars[\"hardware_variety\"] in [\"high\", \"very_high\"]:\n",
    "                if tech_chars[\"hardware_dependency\"] in [\"low\", \"medium\"]:\n",
    "                    score += 1\n",
    "                elif tech_chars[\"hardware_dependency\"] in [\"high\", \"very_high\"]:\n",
    "                    score -= 2\n",
    "\n",
    "            # Convert score to recommendation\n",
    "            if score >= 3:\n",
    "                rec = \"✅ High\"\n",
    "            elif score >= 1:\n",
    "                rec = \"⚡ Medium\"\n",
    "            elif score >= 0:\n",
    "                rec = \"⚠️  Low\"\n",
    "            else:\n",
    "                rec = \"❌ Skip\"\n",
    "\n",
    "            recommendations.append(rec)\n",
    "\n",
    "        rec_line = \" │ \".join(f\"{rec:10s}\" for rec in recommendations)\n",
    "        print(f\"│ {workload_name:18s}  │ {rec_line} │\")\n",
    "\n",
    "    print(\"└─────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘\")\n",
    "\n",
    "    # Implementation priority framework\n",
    "    print(f\"\\n🛠️  Implementation Priority Framework:\")\n",
    "    print(f\"   📊 Phase 1 (Always): Vectorization\")\n",
    "    print(f\"      • Low risk, high reward\")\n",
    "    print(f\"      • Works on any hardware\")\n",
    "    print(f\"      • Foundation for other optimizations\")\n",
    "    print(f\"   \")\n",
    "    print(f\"   📊 Phase 2 (Memory constrained): Kernel Fusion\")\n",
    "    print(f\"      • Targets memory-bound operations\")\n",
    "    print(f\"      • Moderate complexity\")\n",
    "    print(f\"      • Significant wins on element-wise ops\")\n",
    "    print(f\"   \")\n",
    "    print(f\"   📊 Phase 3 (Large models): Mixed Precision\")\n",
    "    print(f\"      • Essential for large model training\")\n",
    "    print(f\"      • Requires careful validation\")\n",
    "    print(f\"      • Hardware-dependent benefits\")\n",
    "    print(f\"   \")\n",
    "    print(f\"   📊 Phase 4 (Production): Graph Optimization\")\n",
    "    print(f\"      • Maximum performance extraction\")\n",
    "    print(f\"      • High implementation cost\")\n",
    "    print(f\"      • Deployment-specific tuning\")\n",
    "\n",
    "    print(f\"\\n💡 Key Decision Factors:\")\n",
    "    print(f\"   🎯 Start simple: Vectorization first, always\")\n",
    "    print(f\"   📈 Scale up: Add complexity only when needed\")\n",
    "    print(f\"   ⚡ Measure impact: Profile before and after each optimization\")\n",
    "    print(f\"   🔄 Iterate: Optimization is an ongoing process, not one-time\")\n",
    "    print(\"🚀 Systematic acceleration beats random optimization\")\n",
    "\n",
    "analyze_acceleration_decision_framework()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "541be4f4",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 7. Module Integration Test\n",
    "\n",
    "Final validation that all acceleration components work together correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05244210",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-module",
     "locked": true,
     "points": 20
    }
   },
   "outputs": [],
   "source": [
    "def test_module():\n",
    "    \"\"\"\n",
    "    Comprehensive test of entire acceleration module functionality.\n",
    "\n",
    "    This final test ensures:\n",
    "    - All acceleration techniques work correctly\n",
    "    - Performance improvements are measurable\n",
    "    - Mixed precision training is stable\n",
    "    - Components integrate seamlessly\n",
    "    - Module is ready for production use\n",
    "    \"\"\"\n",
    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
    "    print(\"=\" * 50)\n",
    "\n",
    "    # Run all unit tests\n",
    "    print(\"Running unit tests...\")\n",
    "    test_unit_vectorized_matmul()\n",
    "    test_unit_fused_gelu()\n",
    "    test_unit_fusion_speedup()\n",
    "    test_unit_mixed_precision()\n",
    "\n",
    "    print(\"\\nRunning integration scenarios...\")\n",
    "\n",
    "    # Test realistic acceleration pipeline\n",
    "    print(\"🔬 Integration Test: Complete acceleration pipeline...\")\n",
    "\n",
    "    # Create realistic model scenario\n",
    "    batch_size, seq_len, hidden_dim = 16, 64, 256\n",
    "    print(f\"   Model config: batch={batch_size}, seq_len={seq_len}, hidden={hidden_dim}\")\n",
    "\n",
    "    # Test data\n",
    "    x = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))\n",
    "    weight = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n",
    "    print(f\"   Input tensor: {x.shape}, Weight tensor: {weight.shape}\")\n",
    "\n",
    "    # Test complete pipeline: reshape → matmul → activation → mixed precision\n",
    "    print(\"   Testing vectorized operations...\")\n",
    "\n",
    "    # Reshape for matrix multiplication (flatten batch and sequence)\n",
    "    x_reshaped = Tensor(x.data.reshape(-1, hidden_dim))\n",
    "    assert x_reshaped.shape == (batch_size * seq_len, hidden_dim)\n",
    "\n",
    "    # Vectorized matrix multiplication\n",
    "    linear_output = vectorized_matmul(x_reshaped, weight)\n",
    "    assert linear_output.shape == (batch_size * seq_len, hidden_dim)\n",
    "    print(f\"   ✅ Matrix multiplication: {x_reshaped.shape} @ {weight.shape} → {linear_output.shape}\")\n",
    "\n",
    "    # Fused activation\n",
    "    activated = fused_gelu(linear_output)\n",
    "    assert activated.shape == linear_output.shape\n",
    "    print(f\"   ✅ Fused GELU activation: {linear_output.shape} → {activated.shape}\")\n",
    "\n",
    "    # Reshape back to original structure\n",
    "    final_output = Tensor(activated.data.reshape(batch_size, seq_len, hidden_dim))\n",
    "    assert final_output.shape == x.shape\n",
    "    print(f\"   ✅ Output reshape: {activated.shape} → {final_output.shape}\")\n",
    "\n",
    "    print(\"   Testing mixed precision training integration...\")\n",
    "\n",
    "    # Create complete model for mixed precision testing\n",
    "    class TransformerBlock:\n",
    "        def __init__(self, hidden_dim):\n",
    "            self.hidden_dim = hidden_dim\n",
    "            self.weight1 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n",
    "            self.weight2 = Tensor(np.random.randn(hidden_dim, hidden_dim).astype(np.float32))\n",
    "            self.weight1.grad = None\n",
    "            self.weight2.grad = None\n",
    "\n",
    "        def __call__(self, x):\n",
    "            # Simulate transformer block: linear → activation → linear\n",
    "            batch_size, seq_len, hidden_dim = x.shape\n",
    "            x_flat = Tensor(x.data.reshape(-1, hidden_dim))\n",
    "\n",
    "            # First linear layer\n",
    "            h1 = vectorized_matmul(x_flat, self.weight1)\n",
    "            h1_activated = fused_gelu(h1)\n",
    "\n",
    "            # Second linear layer\n",
    "            h2 = vectorized_matmul(h1_activated, self.weight2)\n",
    "\n",
    "            # Reshape back\n",
    "            output = Tensor(h2.data.reshape(batch_size, seq_len, hidden_dim))\n",
    "            return output\n",
    "\n",
    "        def parameters(self):\n",
    "            return [self.weight1, self.weight2]\n",
    "\n",
    "    class SimpleOptimizer:\n",
    "        def __init__(self, params):\n",
    "            self.params = params\n",
    "\n",
    "        def zero_grad(self):\n",
    "            for p in self.params:\n",
    "                p.grad = None\n",
    "\n",
    "        def step(self):\n",
    "            for p in self.params:\n",
    "                if p.grad is not None:\n",
    "                    p.data = p.data - 0.001 * p.grad.data\n",
    "\n",
    "    # Initialize model and optimizer\n",
    "    model = TransformerBlock(hidden_dim)\n",
    "    optimizer = SimpleOptimizer(model.parameters())\n",
    "    trainer = MixedPrecisionTrainer(model, optimizer, loss_scale=512.0)\n",
    "\n",
    "    print(f\"   Model parameters: {len(model.parameters())}\")\n",
    "    print(f\"   Initial loss scale: {trainer.loss_scale}\")\n",
    "\n",
    "    # Simulate training steps\n",
    "    print(\"   Running training steps...\")\n",
    "    targets = Tensor(np.random.randn(batch_size, seq_len, hidden_dim).astype(np.float32))\n",
    "\n",
    "    training_metrics = []\n",
    "    for step in range(5):\n",
    "        metrics = trainer.train_step((x, targets))\n",
    "        training_metrics.append(metrics)\n",
    "\n",
    "        # Verify metrics are reasonable\n",
    "        assert isinstance(metrics['loss'], (int, float))\n",
    "        assert metrics['loss'] >= 0\n",
    "        assert metrics['loss_scale'] > 0\n",
    "        assert isinstance(metrics['overflow'], bool)\n",
    "        assert isinstance(metrics['gradients_valid'], bool)\n",
    "\n",
    "    print(f\"   ✅ Completed {len(training_metrics)} training steps\")\n",
    "\n",
    "    # Analyze training stability\n",
    "    losses = [m['loss'] for m in training_metrics]\n",
    "    overflows = [m['overflow'] for m in training_metrics]\n",
    "\n",
    "    print(f\"   Loss range: {min(losses):.6f} - {max(losses):.6f}\")\n",
    "    print(f\"   Overflow rate: {sum(overflows)}/{len(overflows)} steps\")\n",
    "\n",
    "    print(\"   Testing performance characteristics...\")\n",
    "\n",
    "    # Verify acceleration provides measurable benefits\n",
    "    test_sizes = [128, 256]\n",
    "    for size in test_sizes:\n",
    "        test_x = Tensor(np.random.randn(size, size).astype(np.float32))\n",
    "        test_y = Tensor(np.random.randn(size, size).astype(np.float32))\n",
    "\n",
    "        # Time operations and verify reasonable performance\n",
    "        start = time.time()\n",
    "        _ = vectorized_matmul(test_x, test_y)\n",
    "        matmul_time = time.time() - start\n",
    "\n",
    "        start = time.time()\n",
    "        _ = fused_gelu(test_x)\n",
    "        gelu_time = time.time() - start\n",
    "\n",
    "        # Verify operations complete in reasonable time\n",
    "        assert matmul_time < 1.0, f\"Matrix multiplication too slow: {matmul_time:.3f}s\"\n",
    "        assert gelu_time < 0.1, f\"GELU activation too slow: {gelu_time:.3f}s\"\n",
    "\n",
    "        print(f\"   ✅ Size {size}: matmul={matmul_time*1000:.1f}ms, gelu={gelu_time*1000:.1f}ms\")\n",
    "\n",
    "    print(\"   Testing memory efficiency...\")\n",
    "\n",
    "    # Verify mixed precision reduces memory usage conceptually\n",
    "    param_count = sum(p.data.size for p in model.parameters())\n",
    "    activation_count = batch_size * seq_len * hidden_dim\n",
    "\n",
    "    fp32_memory = (param_count + activation_count) * 4  # 4 bytes per FP32\n",
    "    mixed_memory = param_count * 4 + activation_count * 2  # FP32 params + FP16 activations\n",
    "    memory_savings = (fp32_memory - mixed_memory) / fp32_memory * 100\n",
    "\n",
    "    print(f\"   Memory analysis: {memory_savings:.1f}% savings from mixed precision\")\n",
    "    assert memory_savings > 0, \"Mixed precision should reduce memory usage\"\n",
    "\n",
    "    print(\"✅ End-to-end acceleration pipeline works!\")\n",
    "\n",
    "    print(\"\\n\" + \"=\" * 50)\n",
    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
    "    print(\"Run: tito module complete 16\")\n",
    "\n",
    "# Call the module test\n",
    "test_module()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6531eb00",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "main-execution",
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "# Main execution block\n",
    "if __name__ == \"__main__\":\n",
    "    print(\"🚀 Running Acceleration module...\")\n",
    "    test_module()\n",
    "    print(\"✅ Module validation complete!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1054af9",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🤔 ML Systems Thinking: Acceleration and Performance\n",
    "\n",
    "### Question 1: Arithmetic Intensity Analysis\n",
    "You implemented vectorized matrix multiplication and fused GELU.\n",
    "- Matrix multiplication (1024×1024): Performs ~2.1 billion FLOPs, reads ~12 MB data\n",
    "- Arithmetic intensity: _____ FLOPs/byte\n",
    "- Compared to element-wise addition (0.33 FLOPs/byte): _____× higher intensity\n",
    "- Why does this make matrix multiplication ideal for GPUs? _____\n",
    "\n",
    "### Question 2: Kernel Fusion Memory Benefits\n",
    "Your fused_gelu combines 7 operations into a single expression.\n",
    "- Unfused version memory accesses: 7 reads + 7 writes = _____ per element\n",
    "- Fused version memory accesses: 1 read + 1 write = _____ per element\n",
    "- Memory bandwidth reduction: _____%\n",
    "- Why is this critical for transformer inference? _____\n",
    "\n",
    "### Question 3: Mixed Precision Memory Calculation\n",
    "Your MixedPrecisionTrainer uses FP16 activations, FP32 parameters.\n",
    "For a 100M parameter model with 50M activation elements:\n",
    "- FP32 memory: (100M + 50M) × 4 bytes = _____ MB\n",
    "- Mixed precision memory: 100M × 4 + 50M × 2 = _____ MB\n",
    "- Memory reduction: _____%\n",
    "\n",
    "### Question 4: Loss Scaling Strategy\n",
    "Your trainer starts with loss_scale=1024, grows by 2×, shrinks by 0.5×.\n",
    "- Minimum FP16 representable value: ~6e-5\n",
    "- Without scaling, gradients < _____ become zero\n",
    "- With 1024× scaling, gradients down to _____ are preserved\n",
    "- Why increase scale gradually but decrease immediately? _____\n",
    "\n",
    "### Question 5: Production Optimization Strategy\n",
    "Based on your decision framework analysis:\n",
    "For edge deployment (memory critical, stability required, hardware diverse):\n",
    "- Priority 1 technique: _____ (low risk, universal)\n",
    "- Priority 2 technique: _____ (memory benefits)\n",
    "- Skip technique: _____ (why: _____)\n",
    "- What's the primary constraint: memory, compute, or power? _____"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fcecfae",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🎯 MODULE SUMMARY: Acceleration\n",
    "\n",
    "Congratulations! You've mastered the fundamental techniques for accelerating neural networks!\n",
    "\n",
    "### Key Accomplishments\n",
    "- Built **vectorized operations** leveraging SIMD and optimized BLAS for 2-5× speedups\n",
    "- Implemented **kernel fusion** reducing memory bandwidth by 60-80% for element-wise operations\n",
    "- Created **mixed precision training** with automatic loss scaling for 20-40% memory savings\n",
    "- Analyzed **arithmetic intensity patterns** and their impact on the roofline model\n",
    "- Developed **production decision framework** for systematic optimization\n",
    "- All tests pass ✅ (validated by `test_module()`)\n",
    "\n",
    "### Systems Insights Discovered\n",
    "- **Roofline Model**: Operations with high arithmetic intensity (FLOPs/byte) scale better\n",
    "- **Memory Bandwidth**: Often the limiting factor for modern accelerators\n",
    "- **Kernel Fusion**: Critical for memory-bound workloads, reduces intermediate storage overhead\n",
    "- **Mixed Precision**: Essential for large model training, requires careful gradient scaling\n",
    "- **Optimization Strategy**: Start simple (vectorization), add complexity as needed\n",
    "\n",
    "### Production Impact\n",
    "Your acceleration techniques enable:\n",
    "- **Training larger models** within memory constraints\n",
    "- **Faster iteration cycles** during research and development\n",
    "- **Better hardware utilization** across different deployment targets\n",
    "- **Cost reduction** through improved efficiency\n",
    "\n",
    "### Ready for Next Steps\n",
    "Your acceleration implementations provide the foundation for quantization techniques in Module 17.\n",
    "The performance analysis skills transfer directly to production optimization workflows.\n",
    "\n",
    "Export with: `tito module complete 16`\n",
    "\n",
    "**Next**: Module 17 will add quantization to further reduce memory and increase throughput while maintaining accuracy!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}