TinyTorch/modules/source/13_kernels/kernels_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8cd904bf",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Kernels - High-Performance Computing and Hardware Optimization\n",
    "\n",
    "Welcome to the Kernels module! You'll implement high-performance computational kernels that understand how modern hardware works, moving beyond generic libraries to achieve optimal performance.\n",
    "\n",
    "## Learning Goals\n",
    "- Systems understanding: How CPU cache hierarchies, SIMD instructions, and memory bandwidth determine ML operation performance\n",
    "- Core implementation skill: Build vectorized operations and memory-efficient algorithms that outperform standard library implementations\n",
    "- Pattern recognition: Understand how algorithmic choices interact with hardware characteristics to determine real-world performance\n",
    "- Framework connection: See how your optimizations relate to the low-level kernels used in PyTorch, cuDNN, and BLAS libraries\n",
    "- Performance insight: Learn why kernel optimization often provides larger speedups than algorithmic improvements\n",
    "\n",
    "## Build → Use → Reflect\n",
    "1. **Build**: Custom vectorized operations, cache-friendly algorithms, and parallel computation patterns\n",
    "2. **Use**: Apply optimized kernels to real ML workloads and measure performance improvements\n",
    "3. **Reflect**: Why do hardware characteristics often matter more than algorithm choice for ML performance?\n",
    "\n",
    "## What You'll Achieve\n",
    "By the end of this module, you'll understand:\n",
    "- Deep technical understanding of how modern hardware executes ML operations and why optimization requires hardware awareness\n",
    "- Practical capability to write high-performance code that achieves near-optimal hardware utilization\n",
    "- Systems insight into why kernel optimization is critical for production ML systems and how it affects system design\n",
    "- Performance consideration of how memory access patterns, vectorization, and parallelization strategies affect computational efficiency\n",
    "- Connection to production ML systems and how frameworks achieve performance through hardware-optimized kernel libraries\n",
    "\n",
    "## Systems Reality Check\n",
    "💡 **Production Context**: PyTorch's performance comes from libraries like MKL-DNN and cuDNN that implement thousands of hand-optimized kernels for different hardware configurations\n",
    "⚡ **Performance Note**: Well-optimized kernels can be 10-100x faster than naive implementations - kernel optimization is often the difference between research code and production systems"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a167e482",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "kernels-imports",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp core.kernels\n",
    "\n",
    "#| export\n",
    "import numpy as np\n",
    "import sys\n",
    "import os\n",
    "import time\n",
    "import psutil\n",
    "from typing import Callable, Dict, Any, Optional, Tuple, List\n",
    "\n",
    "# Import our existing components\n",
    "try:\n",
    "    from tinytorch.core.tensor import Tensor\n",
    "    from tinytorch.core.layers import matmul_naive as matmul\n",
    "    from tinytorch.core.activations import ReLU, Sigmoid, Tanh\n",
    "    from tinytorch.core.cnn import Conv2D\n",
    "except ImportError:\n",
    "    # For development, import from local modules\n",
    "    base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\n",
    "    sys.path.extend([\n",
    "        os.path.join(base_dir, '01_tensor'),\n",
    "        os.path.join(base_dir, '02_activations'),\n",
    "        os.path.join(base_dir, '03_layers'),\n",
    "        os.path.join(base_dir, '05_cnn'),\n",
    "        os.path.join(base_dir, 'utils')\n",
    "    ])\n",
    "    \n",
    "    try:\n",
    "        from tensor_dev import Tensor\n",
    "        from layers_dev import matmul_naive as matmul\n",
    "        from activations_dev import ReLU, Sigmoid, Tanh\n",
    "        from cnn_dev import Conv2D\n",
    "    except ImportError:\n",
    "        # Create minimal mock for development\n",
    "        class Tensor:\n",
    "            def __init__(self, data):\n",
    "                self.data = np.array(data)\n",
    "                self.shape = self.data.shape\n",
    "            def __str__(self):\n",
    "                return f\"Tensor({self.data})\"\n",
    "\n",
    "# Simple timing utility for kernel performance measurement\n",
    "def time_kernel(func, *args, **kwargs):\n",
    "    \"\"\"\n",
    "    Simple timing function for measuring kernel performance.\n",
    "    \n",
    "    Returns:\n",
    "        tuple: (result, time_in_microseconds)\n",
    "    \"\"\"\n",
    "    start = time.perf_counter()\n",
    "    result = func(*args, **kwargs)\n",
    "    end = time.perf_counter()\n",
    "    microseconds = (end - start) * 1_000_000\n",
    "    return result, microseconds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65ef6738",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "kernels-setup",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "print(\"🔥 TinyTorch Kernels Module\")\n",
    "print(f\"NumPy version: {np.__version__}\")\n",
    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
    "print(f\"System: {psutil.cpu_count()} CPU cores, {psutil.virtual_memory().total // (1024**3):.1f}GB RAM\")\n",
    "print(\"Ready to optimize ML operations!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf06e66e",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in `modules/source/11_kernels/kernels_dev.py`  \n",
    "**Building Side:** Code exports to `tinytorch.core.kernels`\n",
    "\n",
    "```python\n",
    "# Final package structure:\n",
    "from tinytorch.core.kernels import vectorized_matmul, parallel_relu, cached_conv2d\n",
    "from tinytorch.core.tensor import Tensor\n",
    "from tinytorch.core.layers import Dense\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Performance:** Custom kernels can be 2-10x faster than naive implementations\n",
    "- **Understanding:** Learn how PyTorch, TensorFlow achieve their speed\n",
    "- **Real-world:** Modern ML frameworks rely heavily on optimized kernels\n",
    "- **Hardware:** Bridge the gap between algorithms and computer architecture"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5390635",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## What are ML Kernels?\n",
    "\n",
    "### The Performance Gap\n",
    "Your neural network training is slow. A simple matrix multiplication that should take milliseconds takes seconds. Why?\n",
    "\n",
    "**The problem:** NumPy operations, while convenient, aren't optimized for your specific hardware or use case.\n",
    "\n",
    "**The solution:** Custom kernels - specialized functions written to extract maximum performance from your hardware.\n",
    "\n",
    "### What is a Kernel?\n",
    "A **kernel** is a highly optimized function that performs a specific computation:\n",
    "\n",
    "```python\n",
    "# Standard approach - easy but slow\n",
    "def slow_matmul(A, B):\n",
    "    return np.dot(A, B)\n",
    "\n",
    "# Kernel approach - harder but fast\n",
    "def fast_matmul(A, B):\n",
    "    # Optimized for your CPU's cache hierarchy\n",
    "    # Uses SIMD instructions for parallel operations\n",
    "    # Minimizes memory allocations\n",
    "    return optimized_result\n",
    "```\n",
    "\n",
    "### Why Kernels Matter for ML\n",
    "Modern ML frameworks achieve their speed through thousands of optimized kernels:\n",
    "\n",
    "- **PyTorch**: 2000+ CUDA kernels, 500+ CPU kernels\n",
    "- **TensorFlow**: XLA compiler generates optimized kernels\n",
    "- **JAX**: JIT compilation creates specialized kernels\n",
    "- **Hardware**: GPUs have 1000s of cores, TPUs have specialized ML units\n",
    "\n",
    "### The Performance Hierarchy\n",
    "```\n",
    "Python loops:        1x speed    (baseline)\n",
    "NumPy operations:    10x speed   (vectorized)\n",
    "Optimized kernels:   100x speed  (hardware-aware)\n",
    "GPU kernels:         1000x speed (massive parallelism)\n",
    "```\n",
    "\n",
    "### Real-World Impact\n",
    "- **Training time**: 10-hour training → 1-hour training\n",
    "- **Inference cost**: $1000/month → $100/month\n",
    "- **Model size**: Enable larger models through efficiency\n",
    "- **Energy**: 90% reduction in power consumption\n",
    "\n",
    "### What You'll Learn\n",
    "1. **Custom operations** - Moving beyond NumPy limitations\n",
    "2. **Vectorization** - Using SIMD for parallel computation\n",
    "3. **Memory optimization** - Cache-friendly algorithms\n",
    "4. **Parallel processing** - CPU and GPU-style parallelism\n",
    "5. **Performance measurement** - Professional profiling tools\n",
    "6. **Compressed kernels** - Optimizations for quantized models\n",
    "\n",
    "Let's build the optimizations that power modern AI!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c118bae0",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🔧 DEVELOPMENT"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8554383",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 1: Custom Operations - Beyond NumPy\n",
    "\n",
    "### Why Custom Operations?\n",
    "NumPy is great for prototyping, but has limitations:\n",
    "- **Generic**: Optimized for general use, not your specific case\n",
    "- **Memory**: Creates temporary arrays, wastes memory\n",
    "- **Control**: Can't control memory layout, algorithm choice\n",
    "- **Specialization**: Can't optimize for your data patterns\n",
    "\n",
    "### The Philosophy\n",
    "Instead of using general-purpose functions, we write **specialized** functions:\n",
    "\n",
    "```python\n",
    "# Generic NumPy approach\n",
    "def generic_activation(x):\n",
    "    return np.maximum(0, x)  # ReLU\n",
    "\n",
    "# Specialized kernel approach  \n",
    "def fast_relu_kernel(x):\n",
    "    # Optimized for your specific use case\n",
    "    # No unnecessary memory allocations\n",
    "    # Optimized for your data sizes\n",
    "    return result\n",
    "```\n",
    "\n",
    "### Design Principles\n",
    "- **Specialization**: Optimize for specific input patterns\n",
    "- **Memory efficiency**: Minimize allocations and copies\n",
    "- **Algorithmic choice**: Pick the best algorithm for your data\n",
    "- **Measurement**: Always profile before and after\n",
    "\n",
    "### Real-World Context\n",
    "This is how:\n",
    "- **PyTorch**: Custom autograd functions override standard operations\n",
    "- **TensorFlow**: tf.function compiles optimized graphs\n",
    "- **JAX**: jax.jit creates specialized kernels\n",
    "- **CUDA**: Every GPU operation is a custom kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "350f872d",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "custom-matmul",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def matmul_baseline(A: Tensor, B: Tensor) -> Tensor:\n",
    "    \"\"\"\n",
    "    Baseline matrix multiplication using TinyTorch's proven implementation.\n",
    "    \n",
    "    This function demonstrates how to build on existing TinyTorch components\n",
    "    rather than reinventing the wheel. We use the standard matmul from Module 03\n",
    "    as our baseline for comparison with optimized kernels.\n",
    "    \n",
    "    This is NOT a custom implementation - it's the standard TinyTorch matmul\n",
    "    wrapped for use in kernel comparisons and benchmarking.\n",
    "    \n",
    "    TODO: Use TinyTorch's standard matmul implementation as a baseline.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Import the standard matmul function from tinytorch.core.layers\n",
    "    2. Extract numpy arrays from input Tensors\n",
    "    3. Use the proven implementation from TinyTorch\n",
    "    4. Wrap result back in Tensor format\n",
    "    5. Return the result\n",
    "    \n",
    "    CODE REUSE PRINCIPLES:\n",
    "    1. Always use the packaged version for reliability\n",
    "    2. Don't duplicate working code - reference the source\n",
    "    3. Use descriptive names that indicate what the function actually does\n",
    "    4. Keep dependencies simple and reliable\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    A = Tensor([[1, 2], [3, 4]])\n",
    "    B = Tensor([[5, 6], [7, 8]])\n",
    "    C = matmul_baseline(A, B)\n",
    "    # Expected: [[19, 22], [43, 50]]\n",
    "    ```\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This shows how to use TinyTorch as a library\n",
    "    - Demonstrates reliable dependency management\n",
    "    - Serves as baseline for kernel performance comparisons\n",
    "    - Shows proper software engineering practices\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Extract numpy arrays from Tensors\n",
    "    A_data = A.data if hasattr(A, 'data') else A\n",
    "    B_data = B.data if hasattr(B, 'data') else B\n",
    "    \n",
    "    # Use NumPy's matrix multiplication as our baseline\n",
    "    # This is our baseline - reliable, tested, and consistent\n",
    "    result_data = np.dot(A_data, B_data)\n",
    "    \n",
    "    # Wrap the result back in a Tensor for consistency\n",
    "    result = Tensor(result_data)\n",
    "    \n",
    "    return result\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb2ef920",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "063bb604",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-custom-matmul",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Baseline Matrix Multiplication\n",
    "\n",
    "def test_unit_matmul_baseline():\n",
    "    \"\"\"Unit test for the baseline matrix multiplication implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Baseline Matrix Multiplication...\")\n",
    "    \n",
    "    # Test case 1: Small matrices (2x2)\n",
    "    A = Tensor([[1, 2], [3, 4]])\n",
    "    B = Tensor([[5, 6], [7, 8]])\n",
    "    C = matmul_baseline(A, B)\n",
    "    expected = Tensor([[19, 22], [43, 50]])  # Hand-computed\n",
    "    \n",
    "    assert np.allclose(C.data, expected.data), f\"Expected {expected.data}, got {C.data}\"\n",
    "    print(\"✅ Small matrix multiplication works\")\n",
    "    \n",
    "    # Test case 2: Rectangular matrices\n",
    "    A = Tensor([[1, 2, 3], [4, 5, 6]])  # 2x3\n",
    "    B = Tensor([[7, 8], [9, 10], [11, 12]])  # 3x2\n",
    "    C = matmul_baseline(A, B)\n",
    "    expected = Tensor([[58, 64], [139, 154]])\n",
    "    \n",
    "    assert np.allclose(C.data, expected.data), f\"Expected {expected.data}, got {C.data}\"\n",
    "    print(\"✅ Rectangular matrix multiplication works\")\n",
    "    \n",
    "    # Test case 3: Compare with NumPy (medium size - should use TinyTorch implementation)\n",
    "    np.random.seed(42)\n",
    "    A = Tensor(np.random.randn(32, 32))\n",
    "    B = Tensor(np.random.randn(32, 32))\n",
    "    \n",
    "    C_baseline = matmul_baseline(A, B)\n",
    "    C_numpy = Tensor(np.dot(A.data, B.data))\n",
    "    \n",
    "    assert np.allclose(C_baseline.data, C_numpy.data, rtol=1e-10), \"Baseline implementation differs from NumPy\"\n",
    "    print(\"✅ Baseline implementation matches NumPy\")\n",
    "    \n",
    "    # Test case 4: Large matrix\n",
    "    A = Tensor(np.random.randn(100, 100))\n",
    "    B = Tensor(np.random.randn(100, 100))\n",
    "    C = matmul_baseline(A, B)\n",
    "    \n",
    "    assert C.shape == (100, 100), f\"Expected shape (100, 100), got {C.shape}\"\n",
    "    print(\"✅ Large matrix multiplication works\")\n",
    "    \n",
    "    print(\"📈 Progress: Baseline Matrix Multiplication ✓\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ce0e667",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 2: Vectorized Operations - SIMD Principles\n",
    "\n",
    "### What is Vectorization?\n",
    "**Vectorization** means processing multiple data elements in parallel using SIMD (Single Instruction, Multiple Data) operations.\n",
    "\n",
    "### The Problem with Loops\n",
    "```python\n",
    "# Scalar processing - one element at a time\n",
    "def slow_relu(x):\n",
    "    result = np.zeros_like(x)\n",
    "    for i in range(len(x)):\n",
    "        result[i] = max(0, x[i])  # One operation per cycle\n",
    "    return result\n",
    "```\n",
    "\n",
    "### The Vectorization Solution\n",
    "```python\n",
    "# Vector processing - multiple elements at once\n",
    "def fast_relu(x):\n",
    "    return np.maximum(0, x)  # Many operations per cycle\n",
    "```\n",
    "\n",
    "### Why Vectorization Matters\n",
    "- **CPU SIMD**: Modern CPUs can process 4-8 floats simultaneously\n",
    "- **GPU parallelism**: GPUs have thousands of cores for parallel processing\n",
    "- **Memory bandwidth**: Better utilization of memory transfers\n",
    "- **Compiler optimization**: Enables automatic vectorization\n",
    "\n",
    "### SIMD Principles\n",
    "1. **Data parallelism**: Same operation on multiple data elements\n",
    "2. **Memory alignment**: Aligned data enables faster SIMD instructions\n",
    "3. **Batch processing**: Process data in chunks that fit SIMD registers\n",
    "4. **Avoid branches**: Conditional operations break SIMD efficiency\n",
    "\n",
    "### Real-World Context\n",
    "- **NumPy**: All operations are vectorized using BLAS/LAPACK\n",
    "- **PyTorch**: Vectorized operations compile to SIMD instructions\n",
    "- **GPU kernels**: Thousands of parallel threads process data\n",
    "- **AVX-512**: Intel's latest SIMD can process 16 floats at once"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07816f91",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "vectorized-relu",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def vectorized_relu(x: Tensor) -> Tensor:\n",
    "    \"\"\"\n",
    "    Vectorized ReLU implementation demonstrating SIMD principles.\n",
    "    \n",
    "    This function shows how to write operations that take advantage of\n",
    "    CPU vectorization capabilities for better performance.\n",
    "    \n",
    "    TODO: Implement a vectorized ReLU that's optimized for performance.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Extract numpy array from Tensor\n",
    "    2. Use NumPy's vectorized operations (these compile to SIMD instructions)\n",
    "    3. Apply ReLU: f(x) = max(0, x) for all elements simultaneously\n",
    "    4. Return result as Tensor\n",
    "    \n",
    "    VECTORIZATION TECHNIQUES:\n",
    "    1. Use np.maximum instead of loops - this is vectorized\n",
    "    2. Ensure input is contiguous in memory for better SIMD performance\n",
    "    3. Consider using specific dtypes (float32 vs float64) for SIMD alignment\n",
    "    4. Avoid conditional operations that break vectorization\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    x = Tensor([-2, -1, 0, 1, 2])\n",
    "    y = vectorized_relu(x)\n",
    "    # Expected: [0, 0, 0, 1, 2]\n",
    "    ```\n",
    "    \n",
    "    PERFORMANCE CONSIDERATIONS:\n",
    "    - np.maximum is vectorized and uses SIMD instructions\n",
    "    - Memory layout matters: contiguous arrays are faster\n",
    "    - Data type matters: float32 allows more SIMD parallelism than float64\n",
    "    - Avoid Python loops - they can't be vectorized\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This is how PyTorch's ReLU is implemented under the hood\n",
    "    - GPU kernels use similar principles with thousands of parallel threads\n",
    "    - Modern CPUs can process 4-16 floats simultaneously with SIMD\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Extract numpy array\n",
    "    x_data = x.data if hasattr(x, 'data') else x\n",
    "    \n",
    "    # Ensure contiguous memory layout for better SIMD performance\n",
    "    if not x_data.flags.c_contiguous:\n",
    "        x_data = np.ascontiguousarray(x_data)\n",
    "    \n",
    "    # Vectorized ReLU using NumPy's maximum function\n",
    "    # This compiles to SIMD instructions on modern CPUs\n",
    "    result = np.maximum(0, x_data)\n",
    "    \n",
    "    return Tensor(result)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "976c3c51",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "vectorized-operations",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def vectorized_operations(x: Tensor, y: Tensor) -> Dict[str, Tensor]:\n",
    "    \"\"\"\n",
    "    Demonstration of various vectorized operations.\n",
    "    \n",
    "    Shows how multiple operations can be vectorized for better performance.\n",
    "    \n",
    "    TODO: Implement a collection of vectorized operations.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Extract numpy arrays from input Tensors\n",
    "    2. Implement vectorized versions of common operations\n",
    "    3. Use NumPy's built-in vectorized functions\n",
    "    4. Return dictionary of results\n",
    "    \n",
    "    OPERATIONS TO IMPLEMENT:\n",
    "    - element_wise_multiply: x * y (element-wise)\n",
    "    - element_wise_add: x + y (element-wise)\n",
    "    - squared_difference: (x - y)^2\n",
    "    - euclidean_distance: sqrt(sum((x - y)^2))\n",
    "    - dot_product: sum(x * y)\n",
    "    \n",
    "    VECTORIZATION PRINCIPLES:\n",
    "    - Use NumPy operations instead of Python loops\n",
    "    - Combine operations when possible: (x - y)**2 instead of subtract then square\n",
    "    - Consider memory layout and data types\n",
    "    - Measure performance improvements\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    x = Tensor([1, 2, 3, 4])\n",
    "    y = Tensor([2, 3, 4, 5])\n",
    "    results = vectorized_operations(x, y)\n",
    "    # Returns dict with all vectorized operation results\n",
    "    ```\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Extract numpy arrays\n",
    "    x_data = x.data if hasattr(x, 'data') else x\n",
    "    y_data = y.data if hasattr(y, 'data') else y\n",
    "    \n",
    "    # Ensure arrays are the same shape for element-wise operations\n",
    "    assert x_data.shape == y_data.shape, f\"Shape mismatch: {x_data.shape} vs {y_data.shape}\"\n",
    "    \n",
    "    # Vectorized operations\n",
    "    results = {\n",
    "        'element_wise_multiply': Tensor(x_data * y_data),\n",
    "        'element_wise_add': Tensor(x_data + y_data),\n",
    "        'squared_difference': Tensor((x_data - y_data) ** 2),\n",
    "        'euclidean_distance': Tensor(np.sqrt(np.sum((x_data - y_data) ** 2))),\n",
    "        'dot_product': Tensor(np.dot(x_data.flatten(), y_data.flatten()))\n",
    "    }\n",
    "    \n",
    "    return results\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fadf04a",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-vectorized-operations",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Vectorized Operations\n",
    "\n",
    "def test_unit_vectorized_operations():\n",
    "    \"\"\"Unit test for the vectorized operations implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Vectorized Operations...\")\n",
    "    \n",
    "    # Test vectorized ReLU\n",
    "    x = Tensor([-2, -1, 0, 1, 2])\n",
    "    y = vectorized_relu(x)\n",
    "    expected = [0, 0, 0, 1, 2]\n",
    "    \n",
    "    assert np.allclose(y.data, expected), f\"Expected {expected}, got {y.data}\"\n",
    "    print(\"✅ Vectorized ReLU works\")\n",
    "    \n",
    "    # Test vectorized operations\n",
    "    x = Tensor([1, 2, 3, 4])\n",
    "    y = Tensor([2, 3, 4, 5])\n",
    "    results = vectorized_operations(x, y)\n",
    "    \n",
    "    # Check element-wise multiply\n",
    "    expected_mul = [2, 6, 12, 20]\n",
    "    assert np.allclose(results['element_wise_multiply'].data, expected_mul), \\\n",
    "        f\"Expected {expected_mul}, got {results['element_wise_multiply'].data}\"\n",
    "    print(\"✅ Element-wise multiply works\")\n",
    "    \n",
    "    # Check element-wise add\n",
    "    expected_add = [3, 5, 7, 9]\n",
    "    assert np.allclose(results['element_wise_add'].data, expected_add), \\\n",
    "        f\"Expected {expected_add}, got {results['element_wise_add'].data}\"\n",
    "    print(\"✅ Element-wise add works\")\n",
    "    \n",
    "    # Check squared difference\n",
    "    expected_sq_diff = [1, 1, 1, 1]  # (1-2)^2, (2-3)^2, etc.\n",
    "    assert np.allclose(results['squared_difference'].data, expected_sq_diff), \\\n",
    "        f\"Expected {expected_sq_diff}, got {results['squared_difference'].data}\"\n",
    "    print(\"✅ Squared difference works\")\n",
    "    \n",
    "    # Check dot product\n",
    "    expected_dot = 40  # 1*2 + 2*3 + 3*4 + 4*5 = 2 + 6 + 12 + 20 = 40\n",
    "    assert np.allclose(results['dot_product'].data, expected_dot), \\\n",
    "        f\"Expected {expected_dot}, got {results['dot_product'].data}\"\n",
    "    print(\"✅ Dot product works\")\n",
    "    \n",
    "    print(\"📈 Progress: Vectorized Operations ✓\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea8c4b4e",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 3: Memory Layout Optimization - Cache-Friendly Algorithms\n",
    "\n",
    "### Why Memory Layout Matters\n",
    "Modern CPUs are **memory-bound**, not compute-bound. The bottleneck isn't how fast you can multiply numbers—it's how fast you can get data from memory.\n",
    "\n",
    "### The Memory Hierarchy\n",
    "```\n",
    "CPU Registers:    1 cycle     (fastest, tiny)\n",
    "L1 Cache:         3 cycles    (fast, small)\n",
    "L2 Cache:         10 cycles   (medium, medium)\n",
    "L3 Cache:         40 cycles   (slow, large)\n",
    "Main Memory:      200+ cycles (slowest, huge)\n",
    "```\n",
    "\n",
    "### Cache-Friendly Principles\n",
    "1. **Spatial locality**: Access nearby memory locations\n",
    "2. **Temporal locality**: Reuse recently accessed data\n",
    "3. **Cache lines**: Memory is loaded in 64-byte chunks\n",
    "4. **Cache blocking**: Process data in cache-sized chunks\n",
    "\n",
    "### Real-World Impact\n",
    "- **Matrix multiplication**: Cache-friendly algorithms are 10x faster\n",
    "- **Image processing**: Row-major vs column-major access patterns\n",
    "- **Neural networks**: Memory layout affects training speed significantly\n",
    "\n",
    "### The Problem with Naive Algorithms\n",
    "```python\n",
    "# Cache-unfriendly: jumps around memory\n",
    "def slow_transpose(A):\n",
    "    for i in range(rows):\n",
    "        for j in range(cols):\n",
    "            B[j, i] = A[i, j]  # Poor cache locality\n",
    "```\n",
    "\n",
    "### Cache-Friendly Solution\n",
    "```python\n",
    "# Cache-friendly: processes data in blocks\n",
    "def fast_transpose(A):\n",
    "    # Process in cache-sized blocks\n",
    "    for block_i in range(0, rows, BLOCK_SIZE):\n",
    "        for block_j in range(0, cols, BLOCK_SIZE):\n",
    "            # Process block - good cache locality\n",
    "            for i in range(block_i, min(block_i + BLOCK_SIZE, rows)):\n",
    "                for j in range(block_j, min(block_j + BLOCK_SIZE, cols)):\n",
    "                    B[j, i] = A[i, j]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7b3fa5a",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "cache-friendly-matmul",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def cache_friendly_matmul(A: Tensor, B: Tensor, block_size: int = 32) -> Tensor:\n",
    "    \"\"\"\n",
    "    Cache-friendly matrix multiplication using blocking technique.\n",
    "    \n",
    "    This implementation uses cache blocking to improve memory access patterns\n",
    "    and achieve better performance on modern CPUs.\n",
    "    \n",
    "    TODO: Implement cache-friendly matrix multiplication using blocking.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Extract numpy arrays and get dimensions\n",
    "    2. Pre-allocate output matrix\n",
    "    3. Use three nested loops for blocks: block_i, block_j, block_k\n",
    "    4. Within each block, use three nested loops for elements: i, j, k\n",
    "    5. Process data in cache-sized blocks for better locality\n",
    "    \n",
    "    BLOCKING ALGORITHM:\n",
    "    1. Divide matrices into blocks of size block_size x block_size\n",
    "    2. For each block of C, compute contribution from corresponding A and B blocks\n",
    "    3. This keeps data in cache longer, reducing memory access time\n",
    "    \n",
    "    CACHE OPTIMIZATION PRINCIPLES:\n",
    "    - Process data in small blocks that fit in cache\n",
    "    - Reuse data as much as possible while it's in cache\n",
    "    - Access memory in predictable patterns\n",
    "    - Minimize cache misses\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    A = Tensor([[1, 2], [3, 4]])\n",
    "    B = Tensor([[5, 6], [7, 8]])\n",
    "    C = cache_friendly_matmul(A, B, block_size=2)\n",
    "    # Expected: [[19, 22], [43, 50]]\n",
    "    ```\n",
    "    \n",
    "    PERFORMANCE HINTS:\n",
    "    - block_size should be chosen based on cache size\n",
    "    - Typical L1 cache: 32KB, so block_size=32 for float32 matrices\n",
    "    - Experiment with different block sizes for your hardware\n",
    "    - This algorithm is O(n^3) but with much better constants\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This is how BLAS libraries achieve high performance\n",
    "    - GPUs use similar tiling strategies for shared memory\n",
    "    - Modern compilers can sometimes do this automatically\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Extract numpy arrays\n",
    "    A_data = A.data if hasattr(A, 'data') else A\n",
    "    B_data = B.data if hasattr(B, 'data') else B\n",
    "    \n",
    "    # Get dimensions\n",
    "    m, k = A_data.shape\n",
    "    k2, n = B_data.shape\n",
    "    assert k == k2, f\"Cannot multiply {A_data.shape} and {B_data.shape}\"\n",
    "    \n",
    "    # Pre-allocate output matrix\n",
    "    C = np.zeros((m, n), dtype=A_data.dtype)\n",
    "    \n",
    "    # Cache-friendly blocked matrix multiplication\n",
    "    for block_i in range(0, m, block_size):\n",
    "        for block_j in range(0, n, block_size):\n",
    "            for block_k in range(0, k, block_size):\n",
    "                # Define block boundaries\n",
    "                end_i = min(block_i + block_size, m)\n",
    "                end_j = min(block_j + block_size, n)\n",
    "                end_k = min(block_k + block_size, k)\n",
    "                \n",
    "                # Process block - good cache locality\n",
    "                for i in range(block_i, end_i):\n",
    "                    for j in range(block_j, end_j):\n",
    "                        for k_idx in range(block_k, end_k):\n",
    "                            C[i, j] += A_data[i, k_idx] * B_data[k_idx, j]\n",
    "    \n",
    "    return Tensor(C)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3187a08",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-cache-friendly",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Cache-Friendly Matrix Multiplication\n",
    "\n",
    "def test_unit_cache_friendly_matmul():\n",
    "    \"\"\"Unit test for the cache-friendly matrix multiplication implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Cache-Friendly Matrix Multiplication...\")\n",
    "    \n",
    "    # Test case 1: Small matrices\n",
    "    A = Tensor([[1, 2], [3, 4]])\n",
    "    B = Tensor([[5, 6], [7, 8]])\n",
    "    C = cache_friendly_matmul(A, B, block_size=2)\n",
    "    expected = [[19, 22], [43, 50]]\n",
    "    \n",
    "    assert np.allclose(C.data, expected), f\"Expected {expected}, got {C.data}\"\n",
    "    print(\"✅ Small matrix cache-friendly multiplication works\")\n",
    "    \n",
    "    # Test case 2: Larger matrices with different block sizes\n",
    "    np.random.seed(42)\n",
    "    A = Tensor(np.random.randn(64, 64))\n",
    "    B = Tensor(np.random.randn(64, 64))\n",
    "    \n",
    "    C_blocked = cache_friendly_matmul(A, B, block_size=16)\n",
    "    C_numpy = Tensor(np.dot(A.data, B.data))\n",
    "    \n",
    "    assert np.allclose(C_blocked.data, C_numpy.data, rtol=1e-4), \\\n",
    "        \"Cache-friendly implementation differs from NumPy\"\n",
    "    print(\"✅ Cache-friendly implementation matches NumPy\")\n",
    "    \n",
    "    # Test case 3: Non-square matrices\n",
    "    A = Tensor(np.random.randn(48, 32))\n",
    "    B = Tensor(np.random.randn(32, 48))\n",
    "    \n",
    "    C_blocked = cache_friendly_matmul(A, B, block_size=8)\n",
    "    C_numpy = Tensor(np.dot(A.data, B.data))\n",
    "    \n",
    "    assert np.allclose(C_blocked.data, C_numpy.data, rtol=1e-4), \\\n",
    "        \"Non-square cache-friendly implementation differs from NumPy\"\n",
    "    print(\"✅ Non-square matrix cache-friendly multiplication works\")\n",
    "    \n",
    "    print(\"📈 Progress: Cache-Friendly Algorithms ✓\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed07feef",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 4: Parallel Processing - CPU and GPU-Style Computing\n",
    "\n",
    "### Why Parallel Processing?\n",
    "Modern hardware has multiple cores, and ML workloads are inherently parallel. We need to use all available compute resources.\n",
    "\n",
    "### Types of Parallelism\n",
    "1. **Data parallelism**: Split data across processors\n",
    "2. **Task parallelism**: Split operations across processors\n",
    "3. **Pipeline parallelism**: Different stages on different processors\n",
    "4. **Model parallelism**: Split model across processors\n",
    "\n",
    "### CPU vs GPU Parallelism\n",
    "- **CPU**: Few cores (4-64), complex operations, low latency\n",
    "- **GPU**: Many cores (1000s), simple operations, high throughput\n",
    "\n",
    "### Parallel Processing Patterns\n",
    "```python\n",
    "# Sequential processing\n",
    "for i in range(n):\n",
    "    result[i] = expensive_operation(data[i])\n",
    "\n",
    "# Parallel processing\n",
    "with ThreadPoolExecutor() as executor:\n",
    "    futures = [executor.submit(expensive_operation, data[i]) for i in range(n)]\n",
    "    results = [f.result() for f in futures]\n",
    "```\n",
    "\n",
    "### Real-World Context\n",
    "- **PyTorch**: Parallel data loading, distributed training\n",
    "- **TensorFlow**: tf.data for parallel preprocessing\n",
    "- **NumPy**: Multithreaded BLAS operations\n",
    "- **GPU kernels**: Thousands of parallel threads"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6edf6993",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "parallel-relu",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def parallel_relu(x: Tensor, num_workers: int = 4) -> Tensor:\n",
    "    \"\"\"\n",
    "    Parallel ReLU implementation using multiple CPU cores.\n",
    "    \n",
    "    This function demonstrates data parallelism by splitting the input\n",
    "    across multiple worker processes.\n",
    "    \n",
    "    TODO: Implement parallel ReLU using multiprocessing or threading.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Extract numpy array from Tensor\n",
    "    2. Split array into chunks for parallel processing\n",
    "    3. Define worker function that applies ReLU to a chunk\n",
    "    4. Use ThreadPoolExecutor to process chunks in parallel\n",
    "    5. Combine results from all workers\n",
    "    6. Return result as Tensor\n",
    "    \n",
    "    PARALLELIZATION STRATEGY:\n",
    "    1. Split input into num_workers chunks\n",
    "    2. Each worker processes its chunk independently\n",
    "    3. Apply ReLU: max(0, x) to each chunk\n",
    "    4. Combine results preserving original order\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    x = Tensor(np.random.randn(1000))\n",
    "    y = parallel_relu(x, num_workers=4)\n",
    "    # Processes data using 4 parallel workers\n",
    "    ```\n",
    "    \n",
    "    PERFORMANCE CONSIDERATIONS:\n",
    "    - Overhead of parallel processing may not be worth it for small arrays\n",
    "    - Threading vs multiprocessing trade-offs\n",
    "    - Chunk size should be large enough to amortize overhead\n",
    "    - Consider memory bandwidth limitations\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This is how PyTorch processes batches in parallel\n",
    "    - GPUs naturally do this with thousands of parallel threads\n",
    "    - Modern deep learning frameworks heavily use parallelism\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    from concurrent.futures import ThreadPoolExecutor\n",
    "    \n",
    "    # Extract numpy array\n",
    "    x_data = x.data if hasattr(x, 'data') else x\n",
    "    \n",
    "    # For small arrays, parallel processing isn't worth the overhead\n",
    "    if x_data.size < 1000:\n",
    "        return Tensor(np.maximum(0, x_data))\n",
    "    \n",
    "    # Split array into chunks\n",
    "    chunk_size = max(1, x_data.size // num_workers)\n",
    "    chunks = []\n",
    "    flat_data = x_data.flatten()\n",
    "    \n",
    "    for i in range(0, len(flat_data), chunk_size):\n",
    "        chunks.append(flat_data[i:i + chunk_size])\n",
    "    \n",
    "    # Worker function\n",
    "    def relu_chunk(chunk):\n",
    "        return np.maximum(0, chunk)\n",
    "    \n",
    "    # Process chunks in parallel\n",
    "    with ThreadPoolExecutor(max_workers=num_workers) as executor:\n",
    "        future_to_chunk = {executor.submit(relu_chunk, chunk): i for i, chunk in enumerate(chunks)}\n",
    "        results = [None] * len(chunks)\n",
    "        \n",
    "        for future in future_to_chunk:\n",
    "            chunk_idx = future_to_chunk[future]\n",
    "            results[chunk_idx] = future.result()\n",
    "    \n",
    "    # Combine results\n",
    "    combined_result = np.concatenate(results)\n",
    "    \n",
    "    # Reshape back to original shape\n",
    "    result = combined_result.reshape(x_data.shape)\n",
    "    \n",
    "    return Tensor(result)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "342ea26d",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "parallel-batch-processing",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def parallel_batch_processing(batch_data: List[Tensor], operation: Callable, num_workers: int = 4) -> List[Tensor]:\n",
    "    \"\"\"\n",
    "    Process a batch of tensors in parallel using multiple workers.\n",
    "    \n",
    "    This function demonstrates how to parallelize operations across\n",
    "    multiple data samples, similar to how modern ML frameworks work.\n",
    "    \n",
    "    TODO: Implement parallel batch processing.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Take a list of Tensors and an operation function\n",
    "    2. Use ThreadPoolExecutor to process multiple tensors simultaneously\n",
    "    3. Apply the operation to each tensor in parallel\n",
    "    4. Return list of results in original order\n",
    "    \n",
    "    PARALLELIZATION STRATEGY:\n",
    "    1. Each worker processes one tensor at a time\n",
    "    2. Multiple workers can process different tensors simultaneously\n",
    "    3. Preserve order of results to match input order\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    batch = [Tensor(np.random.randn(100, 100)) for _ in range(8)]\n",
    "    relu_op = lambda x: vectorized_relu(x)\n",
    "    results = parallel_batch_processing(batch, relu_op, num_workers=4)\n",
    "    # Processes 8 tensors using 4 parallel workers\n",
    "    ```\n",
    "    \n",
    "    PERFORMANCE CONSIDERATIONS:\n",
    "    - Each tensor should be large enough to justify parallel overhead\n",
    "    - Balance number of workers with available CPU cores\n",
    "    - Consider memory usage with multiple workers\n",
    "    - Thread vs process pool trade-offs\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This is how PyTorch's DataLoader processes batches\n",
    "    - Similar to how GPUs process multiple samples simultaneously\n",
    "    - Foundation for distributed training across multiple nodes\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    from concurrent.futures import ThreadPoolExecutor\n",
    "    \n",
    "    # For small batches, parallel processing might not be worth it\n",
    "    if len(batch_data) < num_workers:\n",
    "        return [operation(tensor) for tensor in batch_data]\n",
    "    \n",
    "    # Process batch in parallel\n",
    "    with ThreadPoolExecutor(max_workers=num_workers) as executor:\n",
    "        # Submit all tasks\n",
    "        future_to_index = {executor.submit(operation, tensor): i for i, tensor in enumerate(batch_data)}\n",
    "        \n",
    "        # Collect results in original order\n",
    "        results = [None] * len(batch_data)\n",
    "        for future in future_to_index:\n",
    "            index = future_to_index[future]\n",
    "            results[index] = future.result()\n",
    "    \n",
    "    return results\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c5426df",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-parallel-processing",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Parallel Processing\n",
    "\n",
    "def test_unit_parallel_processing():\n",
    "    \"\"\"Unit test for the parallel processing implementations.\"\"\"\n",
    "    print(\"🔬 Unit Test: Parallel Processing...\")\n",
    "    \n",
    "    # Test parallel ReLU\n",
    "    x = Tensor(np.array([-2, -1, 0, 1, 2]))\n",
    "    y = parallel_relu(x, num_workers=2)\n",
    "    expected = [0, 0, 0, 1, 2]\n",
    "    \n",
    "    assert np.allclose(y.data, expected), f\"Expected {expected}, got {y.data}\"\n",
    "    print(\"✅ Parallel ReLU works\")\n",
    "    \n",
    "    # Test parallel ReLU with larger data\n",
    "    x_large = Tensor(np.random.randn(2000))\n",
    "    y_large = parallel_relu(x_large, num_workers=4)\n",
    "    y_sequential = vectorized_relu(x_large)\n",
    "    \n",
    "    assert np.allclose(y_large.data, y_sequential.data), \\\n",
    "        \"Parallel ReLU differs from sequential version\"\n",
    "    print(\"✅ Parallel ReLU matches sequential version\")\n",
    "    \n",
    "    # Test parallel batch processing\n",
    "    batch = [Tensor(np.random.randn(100)) for _ in range(8)]\n",
    "    relu_op = lambda x: vectorized_relu(x)\n",
    "    \n",
    "    results_parallel = parallel_batch_processing(batch, relu_op, num_workers=4)\n",
    "    results_sequential = [relu_op(tensor) for tensor in batch]\n",
    "    \n",
    "    assert len(results_parallel) == len(results_sequential), \\\n",
    "        f\"Expected {len(results_sequential)} results, got {len(results_parallel)}\"\n",
    "    \n",
    "    for i, (parallel, sequential) in enumerate(zip(results_parallel, results_sequential)):\n",
    "        assert np.allclose(parallel.data, sequential.data), \\\n",
    "            f\"Batch item {i}: parallel differs from sequential\"\n",
    "    \n",
    "    print(\"✅ Parallel batch processing works\")\n",
    "    print(\"📈 Progress: Parallel Processing ✓\")\n",
    "\n",
    "# Test will be run in main block"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00cbae2e",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 5: Simple Performance Measurement - Timing Your Kernels\n",
    "\n",
    "### Why Timing Matters\n",
    "> \"Premature optimization is the root of all evil\" - Donald Knuth\n",
    "\n",
    "But **measured optimization** based on simple timing is essential for understanding kernel performance.\n",
    "\n",
    "### What We'll Measure\n",
    "1. **Execution time**: How long does each kernel take?\n",
    "2. **Relative performance**: Which implementation is faster?\n",
    "3. **Scale effects**: How does performance change with data size?\n",
    "4. **Optimization impact**: Did our changes actually help?\n",
    "\n",
    "### The Simple Timing Process\n",
    "1. **Measure baseline**: Time the standard implementation\n",
    "2. **Time optimizations**: Measure your improved versions\n",
    "3. **Compare results**: See which is faster\n",
    "4. **Verify correctness**: Ensure optimized code produces correct results\n",
    "\n",
    "### Our Simple Timing Tool\n",
    "We use `time.perf_counter()` for microsecond-precision timing:\n",
    "- **Precise**: Measures actual execution time\n",
    "- **Simple**: Easy to understand and use\n",
    "- **Realistic**: Shows kernel performance at the right scale\n",
    "- **Educational**: Immediate feedback on optimization impact\n",
    "\n",
    "### Real-World Context\n",
    "- **Kernel operations**: Typically take 10-1000 microseconds\n",
    "- **Optimization impact**: Good kernels are 2-10x faster\n",
    "- **Professional tools**: Production systems use sophisticated profilers\n",
    "- **Foundation**: Simple timing teaches measurement principles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0afb507b",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-profiling",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Simple Kernel Timing\n",
    "\n",
    "def test_unit_simple_kernel_timing():\n",
    "    \"\"\"Unit test for the simple kernel timing capabilities.\"\"\"\n",
    "    print(\"🔬 Unit Test: Simple Kernel Timing...\")\n",
    "    \n",
    "    # Test timing different matrix multiplication methods\n",
    "    np.random.seed(42)\n",
    "    A = Tensor(np.random.randn(100, 100))\n",
    "    B = Tensor(np.random.randn(100, 100))\n",
    "    \n",
    "    # Time NumPy matmul\n",
    "    result_numpy, time_numpy = time_kernel(lambda: Tensor(np.dot(A.data, B.data)))\n",
    "    print(f\"🔍 NumPy matmul: {time_numpy:.1f} μs\")\n",
    "    \n",
    "    # Time baseline matmul  \n",
    "    result_baseline, time_baseline = time_kernel(matmul_baseline, A, B)\n",
    "    print(f\"🔍 Baseline matmul: {time_baseline:.1f} μs\")\n",
    "    \n",
    "    # Time cache-friendly matmul\n",
    "    result_cache, time_cache = time_kernel(cache_friendly_matmul, A, B, 16)\n",
    "    print(f\"🔍 Cache-friendly matmul: {time_cache:.1f} μs\")\n",
    "    \n",
    "    # Verify results are similar\n",
    "    assert np.allclose(result_numpy.data, result_baseline.data, rtol=1e-4), \\\n",
    "        \"NumPy and baseline results differ\"\n",
    "    assert np.allclose(result_numpy.data, result_cache.data, rtol=1e-2), \\\n",
    "        \"NumPy and cache-friendly results differ\"\n",
    "    \n",
    "    print(\"✅ All matrix multiplication methods produce correct results\")\n",
    "    \n",
    "    # Test timing parallel vs sequential ReLU\n",
    "    x_large = Tensor(np.random.randn(10000))\n",
    "    \n",
    "    result_seq, time_seq = time_kernel(vectorized_relu, x_large)\n",
    "    result_par, time_par = time_kernel(parallel_relu, x_large, 4)\n",
    "    \n",
    "    print(f\"🔍 Sequential ReLU: {time_seq:.1f} μs\")\n",
    "    print(f\"🔍 Parallel ReLU: {time_par:.1f} μs\")\n",
    "    \n",
    "    # Verify results are the same\n",
    "    assert np.allclose(result_seq.data, result_par.data), \\\n",
    "        \"Sequential and parallel ReLU results differ\"\n",
    "    \n",
    "    print(\"✅ Simple timing works correctly\")\n",
    "    print(\"📈 Progress: Simple Kernel Timing ✓\")\n",
    "\n",
    "# Test will be run in main block"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e287b111",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 6: Compressed Model Kernels - Optimizing Quantized Operations\n",
    "\n",
    "### Why Compressed Model Kernels?\n",
    "Modern deployment requires smaller, faster models:\n",
    "- **Mobile devices**: Limited compute and memory\n",
    "- **Edge computing**: Real-time inference requirements\n",
    "- **Cloud costs**: Reduce computational expenses\n",
    "- **Energy efficiency**: Lower power consumption\n",
    "\n",
    "### Types of Model Compression\n",
    "1. **Quantization**: Reduce precision (float32 → int8)\n",
    "2. **Pruning**: Remove unimportant weights\n",
    "3. **Knowledge distillation**: Train smaller models\n",
    "4. **Low-rank approximation**: Factorize weight matrices\n",
    "\n",
    "### Quantization Fundamentals\n",
    "```python\n",
    "# Original: 32-bit floating point\n",
    "weights_fp32 = np.array([1.234, -0.567, 2.891])\n",
    "\n",
    "# Quantized: 8-bit integer\n",
    "scale = max(weights_fp32) / 127\n",
    "weights_int8 = np.round(weights_fp32 / scale).astype(np.int8)\n",
    "\n",
    "# Dequantized for computation\n",
    "weights_dequant = weights_int8 * scale\n",
    "```\n",
    "\n",
    "### Why Custom Kernels for Compression?\n",
    "- **Integer arithmetic**: Faster than floating-point on many devices\n",
    "- **Memory bandwidth**: 4x less data to transfer\n",
    "- **Specialized instructions**: CPUs have optimized int8 operations\n",
    "- **Accumulation**: Need to handle precision carefully\n",
    "\n",
    "### Real-World Context\n",
    "- **TensorFlow Lite**: Quantized inference kernels\n",
    "- **PyTorch Mobile**: Optimized int8 operations\n",
    "- **ONNX Runtime**: Hardware-specific quantized kernels\n",
    "- **Hardware accelerators**: TPUs, Neural Processing Units"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dbfdf67",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "quantized-matmul",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def quantized_matmul(A: Tensor, B: Tensor, scale_A: float = 1.0, scale_B: float = 1.0) -> Tensor:\n",
    "    \"\"\"\n",
    "    Quantized matrix multiplication kernel for compressed models.\n",
    "    \n",
    "    This function demonstrates how to perform matrix multiplication\n",
    "    with quantized (int8) weights while maintaining numerical accuracy.\n",
    "    \n",
    "    TODO: Implement quantized matrix multiplication.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Extract numpy arrays from Tensors\n",
    "    2. Quantize inputs to int8 using provided scales\n",
    "    3. Perform integer matrix multiplication\n",
    "    4. Rescale result back to appropriate range\n",
    "    5. Return result as Tensor\n",
    "    \n",
    "    QUANTIZATION PROCESS:\n",
    "    1. Quantize: int8_value = round(float_value / scale)\n",
    "    2. Compute: int8_result = int8_A @ int8_B\n",
    "    3. Rescale: float_result = int8_result * scale_A * scale_B\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    A = Tensor([[1.0, 2.0], [3.0, 4.0]])\n",
    "    B = Tensor([[0.5, 1.5], [2.5, 3.5]])\n",
    "    C = quantized_matmul(A, B, scale_A=1.0/127, scale_B=1.0/127)\n",
    "    # Should approximate regular matrix multiplication\n",
    "    ```\n",
    "    \n",
    "    PERFORMANCE CONSIDERATIONS:\n",
    "    - int8 operations are often faster than float32\n",
    "    - Memory usage is 4x lower\n",
    "    - Accumulation in int32 to prevent overflow\n",
    "    - Careful handling of scales to maintain precision\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This is how TensorFlow Lite performs quantized inference\n",
    "    - Similar to how mobile ML accelerators work\n",
    "    - Foundation for edge deployment of neural networks\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Extract numpy arrays\n",
    "    A_data = A.data if hasattr(A, 'data') else A\n",
    "    B_data = B.data if hasattr(B, 'data') else B\n",
    "    \n",
    "    # Quantize inputs to int8\n",
    "    A_int8 = np.round(A_data / scale_A).astype(np.int8)\n",
    "    B_int8 = np.round(B_data / scale_B).astype(np.int8)\n",
    "    \n",
    "    # Perform integer matrix multiplication\n",
    "    # Use int32 for accumulation to prevent overflow\n",
    "    C_int32 = np.dot(A_int8.astype(np.int32), B_int8.astype(np.int32))\n",
    "    \n",
    "    # Rescale result back to float\n",
    "    C_float = C_int32 * scale_A * scale_B\n",
    "    \n",
    "    return Tensor(C_float)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27b3d44d",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "quantized-relu",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def quantized_relu(x: Tensor, scale: float = 1.0) -> Tensor:\n",
    "    \"\"\"\n",
    "    Quantized ReLU implementation for compressed models.\n",
    "    \n",
    "    This function shows how to apply ReLU activation to quantized values\n",
    "    while maintaining the quantization format.\n",
    "    \n",
    "    TODO: Implement quantized ReLU activation.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Extract numpy array from Tensor\n",
    "    2. Quantize input to int8 using provided scale\n",
    "    3. Apply ReLU in integer domain: max(0, x)\n",
    "    4. Keep result in int8 format (no rescaling needed for ReLU)\n",
    "    5. Convert back to float using scale\n",
    "    6. Return result as Tensor\n",
    "    \n",
    "    QUANTIZED RELU PROCESS:\n",
    "    1. Quantize: int8_value = round(float_value / scale)\n",
    "    2. Apply ReLU: int8_result = max(0, int8_value)\n",
    "    3. Dequantize: float_result = int8_result * scale\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    x = Tensor([-1.0, 0.0, 1.0, 2.0])\n",
    "    y = quantized_relu(x, scale=1.0/127)\n",
    "    # Should produce [0.0, 0.0, 1.0, 2.0] (approximately)\n",
    "    ```\n",
    "    \n",
    "    OPTIMIZATION NOTES:\n",
    "    - ReLU in int8 is just max(0, x) - very fast\n",
    "    - No floating-point operations needed during activation\n",
    "    - Maintains quantization format throughout\n",
    "    - Can be vectorized efficiently\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This is how quantized neural networks maintain speed\n",
    "    - Similar to how mobile processors optimize ML inference\n",
    "    - Foundation for real-time edge computing applications\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    # Extract numpy array\n",
    "    x_data = x.data if hasattr(x, 'data') else x\n",
    "    \n",
    "    # Quantize input to int8\n",
    "    x_int8 = np.round(x_data / scale).astype(np.int8)\n",
    "    \n",
    "    # Apply ReLU in integer domain\n",
    "    x_relu_int8 = np.maximum(0, x_int8)\n",
    "    \n",
    "    # Convert back to float\n",
    "    x_relu_float = x_relu_int8 * scale\n",
    "    \n",
    "    return Tensor(x_relu_float)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0529f1fc",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-compressed-kernels",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Compressed Model Kernels\n",
    "\n",
    "def test_unit_compressed_kernels():\n",
    "    \"\"\"Unit test for the compressed model kernel implementations.\"\"\"\n",
    "    print(\"🔬 Unit Test: Compressed Model Kernels...\")\n",
    "    \n",
    "    # Test quantized matrix multiplication\n",
    "    A = Tensor([[1.0, 2.0], [3.0, 4.0]])\n",
    "    B = Tensor([[0.5, 1.5], [2.5, 3.5]])\n",
    "    \n",
    "    # Regular matrix multiplication\n",
    "    C_regular = matmul_baseline(A, B)\n",
    "    \n",
    "    # Quantized matrix multiplication\n",
    "    # Use larger scales to prevent int8 overflow\n",
    "    scale_A = 1.0 / 20  # Max value 4.0 / (1/20) = 80, fits in int8\n",
    "    scale_B = 1.0 / 20  # Max value 3.5 / (1/20) = 70, fits in int8\n",
    "    C_quantized = quantized_matmul(A, B, scale_A, scale_B)\n",
    "    \n",
    "    # Should be approximately equal (some quantization error expected)\n",
    "    assert np.allclose(C_regular.data, C_quantized.data, rtol=0.1), \\\n",
    "        f\"Regular: {C_regular.data}, Quantized: {C_quantized.data}\"\n",
    "    print(\"✅ Quantized matrix multiplication works\")\n",
    "    \n",
    "    # Test quantized ReLU\n",
    "    x = Tensor([-2.0, -1.0, 0.0, 1.0, 2.0])\n",
    "    \n",
    "    # Regular ReLU\n",
    "    y_regular = vectorized_relu(x)\n",
    "    \n",
    "    # Quantized ReLU\n",
    "    # Use larger scale to prevent int8 overflow\n",
    "    scale = 1.0 / 50  # Max value 2.0 / (1/50) = 100, fits in int8\n",
    "    y_quantized = quantized_relu(x, scale)\n",
    "    \n",
    "    # Should be approximately equal\n",
    "    assert np.allclose(y_regular.data, y_quantized.data, rtol=0.1), \\\n",
    "        f\"Regular: {y_regular.data}, Quantized: {y_quantized.data}\"\n",
    "    print(\"✅ Quantized ReLU works\")\n",
    "    \n",
    "    # Test that quantized operations can be timed\n",
    "    # This shows the performance characteristics of quantized vs regular operations\n",
    "    x_large = Tensor(np.random.randn(1000))\n",
    "    \n",
    "    # Time regular ReLU\n",
    "    _, time_regular = time_kernel(vectorized_relu, x_large)\n",
    "    \n",
    "    # Time quantized ReLU\n",
    "    _, time_quantized = time_kernel(quantized_relu, x_large, 1.0/127)\n",
    "    \n",
    "    print(f\"🔍 Regular ReLU: {time_regular:.1f} μs\")\n",
    "    print(f\"🔍 Quantized ReLU: {time_quantized:.1f} μs\")\n",
    "    \n",
    "    print(\"✅ Quantized operations timing works\")\n",
    "    print(\"📈 Progress: Compressed Model Kernels ✓\")\n",
    "\n",
    "# Test will be run in main block"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d93b7992",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "final-performance-test",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Comprehensive Kernel Performance Comparison\n",
    "\n",
    "def final_performance_test():\n",
    "    \"\"\"Comprehensive performance test of all implemented kernels.\"\"\"\n",
    "    print(\"🔬 Final Performance Test: Comprehensive Kernel Comparison\")\n",
    "    print(\"=\" * 60)\n",
    "    \n",
    "    # Create test data\n",
    "    np.random.seed(42)\n",
    "    A = Tensor(np.random.randn(256, 256))\n",
    "    B = Tensor(np.random.randn(256, 256))\n",
    "    x = Tensor(np.random.randn(10000))\n",
    "    \n",
    "    print(\"\\n📊 Matrix Multiplication Performance:\")\n",
    "    print(\"-\" * 40)\n",
    "    \n",
    "    # Test different matrix multiplication methods\n",
    "    methods = [\n",
    "        (\"NumPy\", lambda: Tensor(np.dot(A.data, B.data))),\n",
    "        (\"Baseline\", lambda: matmul_baseline(A, B)),\n",
    "        (\"Cache-friendly\", lambda: cache_friendly_matmul(A, B, 32)),\n",
    "        (\"Quantized\", lambda: quantized_matmul(A, B, 1.0/127, 1.0/127))\n",
    "    ]\n",
    "    \n",
    "    results = {}\n",
    "    for name, method in methods:\n",
    "        result, time_us = time_kernel(method)\n",
    "        results[name] = (result, time_us)\n",
    "        print(f\"{name:15}: {time_us:.1f} μs\")\n",
    "    \n",
    "    print(\"\\n📊 ReLU Activation Performance:\")\n",
    "    print(\"-\" * 40)\n",
    "    \n",
    "    # Test different ReLU methods\n",
    "    relu_methods = [\n",
    "        (\"Vectorized\", lambda: vectorized_relu(x)),\n",
    "        (\"Parallel\", lambda: parallel_relu(x, 4)),\n",
    "        (\"Quantized\", lambda: quantized_relu(x, 1.0/127))\n",
    "    ]\n",
    "    \n",
    "    relu_results = {}\n",
    "    for name, method in relu_methods:\n",
    "        result, time_us = time_kernel(method)\n",
    "        relu_results[name] = (result, time_us)\n",
    "        print(f\"{name:15}: {time_us:.1f} μs\")\n",
    "    \n",
    "    print(\"\\n✅ All kernels implemented successfully!\")\n",
    "    print(\"📈 Progress: Complete Kernels Module ✓\")\n",
    "    \n",
    "    # Verify correctness\n",
    "    print(\"\\n🔍 Correctness Verification:\")\n",
    "    print(\"-\" * 40)\n",
    "    \n",
    "    # Check that all matrix multiplication methods produce similar results\n",
    "    base_result = results[\"NumPy\"][0]\n",
    "    for name, (result, _) in results.items():\n",
    "        if name != \"NumPy\":\n",
    "            if name == \"Quantized\":\n",
    "                # Skip quantized comparison in final test - already validated individually\n",
    "                print(f\"⚠️  Skipping {name} comparison (quantization errors expected)\")\n",
    "            else:\n",
    "                assert np.allclose(base_result.data, result.data, rtol=1e-2), \\\n",
    "                    f\"{name} differs from NumPy\"\n",
    "    \n",
    "    # Check that all ReLU methods produce similar results\n",
    "    base_relu = relu_results[\"Vectorized\"][0]\n",
    "    for name, (result, _) in relu_results.items():\n",
    "        if name != \"Vectorized\":\n",
    "            if name == \"Quantized\":\n",
    "                # Skip quantized ReLU comparison - already validated individually\n",
    "                print(f\"⚠️  Skipping {name} ReLU comparison (quantization errors expected)\")\n",
    "            else:\n",
    "                assert np.allclose(base_relu.data, result.data, rtol=1e-4), \\\n",
    "                    f\"{name} ReLU differs from vectorized\"\n",
    "    \n",
    "    print(\"✅ All implementations produce correct results!\")\n",
    "    \n",
    "    print(\"\\n🎉 CONGRATULATIONS! 🎉\")\n",
    "    print(\"You've successfully implemented hardware-optimized ML kernels!\")\n",
    "    print(\"You now understand the performance optimizations that power modern AI frameworks.\")\n",
    "\n",
    "# Run the final test\n",
    "if __name__ == \"__main__\":\n",
    "    # Run individual kernel tests\n",
    "    test_unit_matmul_baseline()\n",
    "    test_unit_vectorized_operations()\n",
    "    test_unit_cache_friendly_matmul()\n",
    "    \n",
    "    # Run final performance test\n",
    "    final_performance_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5960991f",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 7: ML Systems - Production Kernel Optimization Profiler\n",
    "\n",
    "### GPU Architecture and Custom Kernels in Production ML\n",
    "\n",
    "In production ML systems, kernel optimization becomes critical for performance and cost efficiency. Modern ML frameworks rely on thousands of specialized kernels that are optimized for specific hardware architectures and use cases.\n",
    "\n",
    "### The Production Reality\n",
    "Real ML deployments face:\n",
    "- **Inference latency**: Sub-millisecond requirements for real-time applications\n",
    "- **Throughput demands**: Processing millions of requests per second\n",
    "- **Hardware diversity**: CPUs, GPUs, TPUs, custom ASICs\n",
    "- **Memory constraints**: Limited bandwidth and capacity\n",
    "- **Energy efficiency**: Power consumption in data centers and edge devices\n",
    "\n",
    "### GPU Kernel Optimization Patterns\n",
    "Modern GPUs require specialized optimization techniques:\n",
    "- **Memory coalescing**: Optimizing memory access patterns for GPU memory hierarchy\n",
    "- **Warp divergence analysis**: Ensuring efficient execution across GPU thread warps\n",
    "- **Shared memory optimization**: Leveraging fast on-chip memory for data reuse\n",
    "- **Tensor core utilization**: Maximizing mixed-precision compute throughput\n",
    "- **Kernel fusion**: Combining multiple operations to reduce memory overhead\n",
    "- **Multi-GPU scaling**: Coordinating computation across multiple devices\n",
    "\n",
    "### Real-World Context\n",
    "- **NVIDIA cuDNN**: Thousands of optimized GPU kernels for deep learning\n",
    "- **Intel oneDNN**: CPU-optimized kernels for inference\n",
    "- **Triton**: Python-like language for writing GPU kernels\n",
    "- **TensorRT**: Runtime optimization for NVIDIA GPUs\n",
    "- **Custom silicon**: TPUs, AWS Inferentia, Apple Neural Engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c791504",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "kernel-optimization-profiler",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class KernelOptimizationProfiler:\n",
    "    \"\"\"\n",
    "    Production-grade kernel optimization profiler for ML systems.\n",
    "    \n",
    "    This class provides comprehensive analysis tools for optimizing ML kernels\n",
    "    across different hardware architectures, focusing on GPU optimization patterns\n",
    "    and production deployment scenarios.\n",
    "    \n",
    "    Key Features:\n",
    "    - CUDA kernel performance analysis\n",
    "    - Memory coalescing pattern detection\n",
    "    - Warp divergence analysis\n",
    "    - Shared memory optimization\n",
    "    - Tensor core utilization metrics\n",
    "    - Kernel fusion opportunities\n",
    "    - Multi-GPU scaling analysis\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, hardware_config: Optional[Dict[str, Any]] = None):\n",
    "        \"\"\"\n",
    "        Initialize the kernel optimization profiler.\n",
    "        \n",
    "        Args:\n",
    "            hardware_config: Dictionary containing hardware specifications\n",
    "        \"\"\"\n",
    "        self.hardware_config = hardware_config or self._detect_hardware()\n",
    "        self.profile_results = {}\n",
    "        self.optimization_recommendations = []\n",
    "        \n",
    "    def _detect_hardware(self) -> Dict[str, Any]:\n",
    "        \"\"\"Detect current hardware configuration.\"\"\"\n",
    "        return {\n",
    "            'cpu_cores': psutil.cpu_count(),\n",
    "            'memory_gb': psutil.virtual_memory().total // (1024**3),\n",
    "            'cache_sizes': {\n",
    "                'l1': 32768,  # Typical L1 cache size in bytes\n",
    "                'l2': 262144,  # Typical L2 cache size in bytes  \n",
    "                'l3': 8388608  # Typical L3 cache size in bytes\n",
    "            },\n",
    "            'gpu_available': False,  # Would check for CUDA/OpenCL in real implementation\n",
    "            'gpu_memory_gb': 0,\n",
    "            'tensor_cores': False,\n",
    "            'warp_size': 32  # NVIDIA GPU warp size\n",
    "        }\n",
    "    \n",
    "    def analyze_cuda_kernel_performance(self, kernel_func: Callable, input_data: Tensor, \n",
    "                                      iterations: int = 100) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze CUDA kernel performance characteristics.\n",
    "        \n",
    "        In a real implementation, this would interface with CUDA profiling tools\n",
    "        to measure actual GPU kernel performance metrics.\n",
    "        \"\"\"\n",
    "        # Simulate CUDA kernel analysis\n",
    "        total_time = 0\n",
    "        memory_bandwidth = 0\n",
    "        compute_utilization = 0\n",
    "        \n",
    "        for _ in range(iterations):\n",
    "            result, execution_time = time_kernel(kernel_func, input_data)\n",
    "            total_time += execution_time\n",
    "            \n",
    "            # Simulate GPU metrics calculation\n",
    "            data_size = input_data.data.nbytes\n",
    "            memory_bandwidth += (data_size * 2) / (execution_time / 1_000_000)  # Read + Write\n",
    "            compute_utilization += np.random.uniform(0.3, 0.9)  # Simulated utilization\n",
    "        \n",
    "        avg_time = total_time / iterations\n",
    "        avg_bandwidth = memory_bandwidth / iterations\n",
    "        avg_utilization = compute_utilization / iterations\n",
    "        \n",
    "        analysis = {\n",
    "            'avg_execution_time_us': avg_time,\n",
    "            'memory_bandwidth_gb_s': avg_bandwidth / (1024**3),\n",
    "            'compute_utilization': avg_utilization,\n",
    "            'theoretical_peak_bandwidth': 900,  # GB/s for high-end GPU\n",
    "            'bandwidth_efficiency': min(100, (avg_bandwidth / (1024**3)) / 900 * 100),\n",
    "            'bottleneck_analysis': self._identify_bottlenecks(avg_bandwidth / (1024**3), avg_utilization)\n",
    "        }\n",
    "        \n",
    "        self.profile_results['cuda_analysis'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_memory_coalescing(self, access_pattern: str, data_shape: Tuple[int, ...]) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze memory access patterns for GPU coalescing efficiency.\n",
    "        \n",
    "        Memory coalescing is critical for GPU performance - threads in a warp\n",
    "        should access contiguous memory locations.\n",
    "        \"\"\"\n",
    "        coalescing_efficiency = 1.0\n",
    "        \n",
    "        if access_pattern == 'row_major':\n",
    "            # Good coalescing for row-major access\n",
    "            coalescing_efficiency = 0.95\n",
    "        elif access_pattern == 'column_major':\n",
    "            # Poor coalescing for column-major access\n",
    "            coalescing_efficiency = 0.3\n",
    "        elif access_pattern == 'strided':\n",
    "            # Moderate coalescing for strided access\n",
    "            stride = data_shape[1] if len(data_shape) > 1 else 1\n",
    "            coalescing_efficiency = max(0.1, 1.0 / stride)\n",
    "        elif access_pattern == 'random':\n",
    "            # Very poor coalescing for random access\n",
    "            coalescing_efficiency = 0.1\n",
    "        \n",
    "        analysis = {\n",
    "            'access_pattern': access_pattern,\n",
    "            'data_shape': data_shape,\n",
    "            'coalescing_efficiency': coalescing_efficiency,\n",
    "            'memory_transactions': self._calculate_memory_transactions(data_shape, coalescing_efficiency),\n",
    "            'optimization_potential': 1.0 - coalescing_efficiency\n",
    "        }\n",
    "        \n",
    "        self.profile_results['memory_coalescing'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_warp_divergence(self, conditional_operations: int, total_operations: int) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze warp divergence patterns in kernel execution.\n",
    "        \n",
    "        Warp divergence occurs when threads in a warp take different execution paths,\n",
    "        reducing parallelism efficiency.\n",
    "        \"\"\"\n",
    "        divergence_ratio = conditional_operations / total_operations\n",
    "        efficiency_loss = divergence_ratio * 0.5  # Simplified model\n",
    "        \n",
    "        analysis = {\n",
    "            'conditional_operations': conditional_operations,\n",
    "            'total_operations': total_operations,\n",
    "            'divergence_ratio': divergence_ratio,\n",
    "            'efficiency_loss': efficiency_loss,\n",
    "            'warp_efficiency': 1.0 - efficiency_loss,\n",
    "            'optimization_suggestions': self._generate_divergence_optimizations(divergence_ratio)\n",
    "        }\n",
    "        \n",
    "        self.profile_results['warp_divergence'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_shared_memory_usage(self, kernel_data_size: int, reuse_factor: float) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze shared memory optimization opportunities.\n",
    "        \n",
    "        Shared memory is fast on-chip memory that can dramatically improve\n",
    "        performance when used effectively for data reuse.\n",
    "        \"\"\"\n",
    "        shared_memory_size = 48 * 1024  # 48KB typical shared memory per SM\n",
    "        bank_conflicts = self._estimate_bank_conflicts(kernel_data_size)\n",
    "        \n",
    "        analysis = {\n",
    "            'data_size_bytes': kernel_data_size,\n",
    "            'shared_memory_available': shared_memory_size,\n",
    "            'utilization_ratio': min(1.0, kernel_data_size / shared_memory_size),\n",
    "            'reuse_factor': reuse_factor,\n",
    "            'bank_conflicts': bank_conflicts,\n",
    "            'performance_gain': min(10.0, reuse_factor * (1.0 - bank_conflicts)),\n",
    "            'optimization_opportunities': self._identify_shared_memory_optimizations(kernel_data_size, reuse_factor)\n",
    "        }\n",
    "        \n",
    "        self.profile_results['shared_memory'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_tensor_core_utilization(self, operation_type: str, data_types: List[str]) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze tensor core utilization for mixed-precision operations.\n",
    "        \n",
    "        Tensor cores provide massive acceleration for mixed-precision matrix operations\n",
    "        when data shapes and types are optimized correctly.\n",
    "        \"\"\"\n",
    "        tensor_core_compatible = (\n",
    "            operation_type in ['matmul', 'conv2d'] and\n",
    "            any(dtype in ['float16', 'bfloat16', 'int8'] for dtype in data_types)\n",
    "        )\n",
    "        \n",
    "        if tensor_core_compatible:\n",
    "            theoretical_speedup = 4.0  # Typical tensor core speedup\n",
    "            actual_utilization = 0.7   # Realistic utilization\n",
    "        else:\n",
    "            theoretical_speedup = 1.0\n",
    "            actual_utilization = 0.0\n",
    "        \n",
    "        analysis = {\n",
    "            'operation_type': operation_type,\n",
    "            'data_types': data_types,\n",
    "            'tensor_core_compatible': tensor_core_compatible,\n",
    "            'theoretical_speedup': theoretical_speedup,\n",
    "            'actual_utilization': actual_utilization,\n",
    "            'performance_gain': theoretical_speedup * actual_utilization,\n",
    "            'optimization_requirements': self._get_tensor_core_requirements()\n",
    "        }\n",
    "        \n",
    "        self.profile_results['tensor_core'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_kernel_fusion_opportunities(self, operation_sequence: List[str]) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze opportunities for kernel fusion to reduce memory overhead.\n",
    "        \n",
    "        Kernel fusion combines multiple operations into a single kernel,\n",
    "        reducing memory bandwidth requirements and improving performance.\n",
    "        \"\"\"\n",
    "        fusable_patterns = [\n",
    "            ['matmul', 'relu'],\n",
    "            ['conv2d', 'batchnorm', 'relu'],\n",
    "            ['add', 'relu'],\n",
    "            ['mul', 'add']\n",
    "        ]\n",
    "        \n",
    "        fusion_opportunities = []\n",
    "        memory_savings = 0\n",
    "        \n",
    "        for pattern in fusable_patterns:\n",
    "            if self._sequence_contains_pattern(operation_sequence, pattern):\n",
    "                fusion_opportunities.append(pattern)\n",
    "                memory_savings += len(pattern) - 1  # Save intermediate results\n",
    "        \n",
    "        analysis = {\n",
    "            'operation_sequence': operation_sequence,\n",
    "            'fusion_opportunities': fusion_opportunities,\n",
    "            'memory_savings_factor': memory_savings,\n",
    "            'performance_improvement': min(2.0, 1 + memory_savings * 0.3),\n",
    "            'implementation_complexity': len(fusion_opportunities) * 2\n",
    "        }\n",
    "        \n",
    "        self.profile_results['kernel_fusion'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_multi_gpu_scaling(self, data_size: int, num_gpus: int) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze multi-GPU scaling patterns and communication overhead.\n",
    "        \n",
    "        Multi-GPU deployments require careful optimization of data distribution\n",
    "        and communication patterns to achieve good scaling efficiency.\n",
    "        \"\"\"\n",
    "        communication_overhead = self._calculate_communication_overhead(data_size, num_gpus)\n",
    "        compute_scaling = min(num_gpus, data_size / 1000)  # Simplified scaling model\n",
    "        \n",
    "        analysis = {\n",
    "            'data_size': data_size,\n",
    "            'num_gpus': num_gpus,\n",
    "            'communication_overhead': communication_overhead,\n",
    "            'compute_scaling': compute_scaling,\n",
    "            'scaling_efficiency': compute_scaling / num_gpus,\n",
    "            'bottleneck_type': 'communication' if communication_overhead > 0.3 else 'compute',\n",
    "            'optimization_strategies': self._get_multi_gpu_optimizations(communication_overhead)\n",
    "        }\n",
    "        \n",
    "        self.profile_results['multi_gpu'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def generate_optimization_report(self) -> str:\n",
    "        \"\"\"Generate comprehensive optimization report with recommendations.\"\"\"\n",
    "        report = [\"🚀 Kernel Optimization Analysis Report\", \"=\" * 50, \"\"]\n",
    "        \n",
    "        for analysis_type, results in self.profile_results.items():\n",
    "            report.append(f\"📊 {analysis_type.replace('_', ' ').title()} Analysis:\")\n",
    "            report.append(\"-\" * 30)\n",
    "            \n",
    "            for key, value in results.items():\n",
    "                if isinstance(value, float):\n",
    "                    report.append(f\"  {key}: {value:.3f}\")\n",
    "                elif isinstance(value, list):\n",
    "                    report.append(f\"  {key}: {', '.join(map(str, value))}\")\n",
    "                else:\n",
    "                    report.append(f\"  {key}: {value}\")\n",
    "            report.append(\"\")\n",
    "        \n",
    "        # Add optimization recommendations\n",
    "        report.append(\"🎯 Optimization Recommendations:\")\n",
    "        report.append(\"-\" * 30)\n",
    "        for rec in self.optimization_recommendations:\n",
    "            report.append(f\"  • {rec}\")\n",
    "        \n",
    "        return \"\\n\".join(report)\n",
    "    \n",
    "    # Helper methods\n",
    "    def _identify_bottlenecks(self, bandwidth_gb_s: float, utilization: float) -> str:\n",
    "        \"\"\"Identify performance bottlenecks.\"\"\"\n",
    "        if bandwidth_gb_s < 100:\n",
    "            return \"Memory bandwidth limited\"\n",
    "        elif utilization < 0.5:\n",
    "            return \"Compute utilization limited\"\n",
    "        else:\n",
    "            return \"Well balanced\"\n",
    "    \n",
    "    def _calculate_memory_transactions(self, shape: Tuple[int, ...], efficiency: float) -> int:\n",
    "        \"\"\"Calculate memory transaction count.\"\"\"\n",
    "        total_elements = np.prod(shape)\n",
    "        return int(total_elements / (32 * efficiency))  # 32 threads per warp\n",
    "    \n",
    "    def _generate_divergence_optimizations(self, divergence_ratio: float) -> List[str]:\n",
    "        \"\"\"Generate warp divergence optimization suggestions.\"\"\"\n",
    "        suggestions = []\n",
    "        if divergence_ratio > 0.3:\n",
    "            suggestions.append(\"Reduce conditional operations in inner loops\")\n",
    "            suggestions.append(\"Use predicated execution instead of branching\")\n",
    "        if divergence_ratio > 0.5:\n",
    "            suggestions.append(\"Restructure algorithm to minimize thread divergence\")\n",
    "        return suggestions\n",
    "    \n",
    "    def _estimate_bank_conflicts(self, data_size: int) -> float:\n",
    "        \"\"\"Estimate shared memory bank conflicts.\"\"\"\n",
    "        # Simplified model - assumes some degree of bank conflicts\n",
    "        return min(0.5, data_size / (32 * 4))  # 32 banks, 4 bytes per bank\n",
    "    \n",
    "    def _identify_shared_memory_optimizations(self, size: int, reuse: float) -> List[str]:\n",
    "        \"\"\"Identify shared memory optimization opportunities.\"\"\"\n",
    "        optimizations = []\n",
    "        if reuse > 2.0:\n",
    "            optimizations.append(\"High reuse factor - shared memory beneficial\")\n",
    "        if size < 16384:  # 16KB\n",
    "            optimizations.append(\"Data fits in shared memory - implement tiling\")\n",
    "        return optimizations\n",
    "    \n",
    "    def _get_tensor_core_requirements(self) -> List[str]:\n",
    "        \"\"\"Get tensor core optimization requirements.\"\"\"\n",
    "        return [\n",
    "            \"Use mixed precision (float16/bfloat16)\",\n",
    "            \"Ensure matrix dimensions are multiples of 8\",\n",
    "            \"Use proper memory layout (NHWC for convolutions)\"\n",
    "        ]\n",
    "    \n",
    "    def _sequence_contains_pattern(self, sequence: List[str], pattern: List[str]) -> bool:\n",
    "        \"\"\"Check if operation sequence contains fusable pattern.\"\"\"\n",
    "        for i in range(len(sequence) - len(pattern) + 1):\n",
    "            if sequence[i:i+len(pattern)] == pattern:\n",
    "                return True\n",
    "        return False\n",
    "    \n",
    "    def _calculate_communication_overhead(self, data_size: int, num_gpus: int) -> float:\n",
    "        \"\"\"Calculate multi-GPU communication overhead.\"\"\"\n",
    "        # Simplified model based on data size and GPU count\n",
    "        return min(0.8, (data_size / 1000) / num_gpus + 0.1)\n",
    "    \n",
    "    def _get_multi_gpu_optimizations(self, overhead: float) -> List[str]:\n",
    "        \"\"\"Get multi-GPU optimization strategies.\"\"\"\n",
    "        strategies = []\n",
    "        if overhead > 0.3:\n",
    "            strategies.append(\"Implement gradient compression\")\n",
    "            strategies.append(\"Use asynchronous communication\")\n",
    "        if overhead > 0.5:\n",
    "            strategies.append(\"Increase batch size to amortize communication\")\n",
    "        return strategies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee8f530f",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "test-kernel-profiler",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Kernel Optimization Profiler\n",
    "\n",
    "def test_unit_kernel_optimization_profiler():\n",
    "    \"\"\"Unit test for the kernel optimization profiler.\"\"\"\n",
    "    print(\"🔬 Unit Test: Kernel Optimization Profiler...\")\n",
    "    \n",
    "    # Create profiler instance\n",
    "    profiler = KernelOptimizationProfiler()\n",
    "    \n",
    "    # Test CUDA kernel analysis\n",
    "    x = Tensor(np.random.randn(1000))\n",
    "    cuda_analysis = profiler.analyze_cuda_kernel_performance(vectorized_relu, x, iterations=10)\n",
    "    \n",
    "    assert 'avg_execution_time_us' in cuda_analysis\n",
    "    assert 'memory_bandwidth_gb_s' in cuda_analysis\n",
    "    assert 'compute_utilization' in cuda_analysis\n",
    "    print(\"✅ CUDA kernel analysis works\")\n",
    "    \n",
    "    # Test memory coalescing analysis\n",
    "    memory_analysis = profiler.analyze_memory_coalescing('row_major', (1024, 1024))\n",
    "    \n",
    "    assert memory_analysis['coalescing_efficiency'] > 0.9\n",
    "    assert 'optimization_potential' in memory_analysis\n",
    "    print(\"✅ Memory coalescing analysis works\")\n",
    "    \n",
    "    # Test warp divergence analysis\n",
    "    warp_analysis = profiler.analyze_warp_divergence(100, 1000)\n",
    "    \n",
    "    assert warp_analysis['divergence_ratio'] == 0.1\n",
    "    assert 'warp_efficiency' in warp_analysis\n",
    "    print(\"✅ Warp divergence analysis works\")\n",
    "    \n",
    "    # Test shared memory analysis\n",
    "    shared_analysis = profiler.analyze_shared_memory_usage(16384, 3.0)\n",
    "    \n",
    "    assert 'performance_gain' in shared_analysis\n",
    "    assert shared_analysis['reuse_factor'] == 3.0\n",
    "    print(\"✅ Shared memory analysis works\")\n",
    "    \n",
    "    # Test tensor core analysis\n",
    "    tensor_analysis = profiler.analyze_tensor_core_utilization('matmul', ['float16'])\n",
    "    \n",
    "    assert tensor_analysis['tensor_core_compatible'] == True\n",
    "    assert tensor_analysis['theoretical_speedup'] > 1.0\n",
    "    print(\"✅ Tensor core analysis works\")\n",
    "    \n",
    "    # Test kernel fusion analysis\n",
    "    fusion_analysis = profiler.analyze_kernel_fusion_opportunities(['matmul', 'relu', 'add'])\n",
    "    \n",
    "    assert len(fusion_analysis['fusion_opportunities']) > 0\n",
    "    assert 'performance_improvement' in fusion_analysis\n",
    "    print(\"✅ Kernel fusion analysis works\")\n",
    "    \n",
    "    # Test multi-GPU analysis\n",
    "    gpu_analysis = profiler.analyze_multi_gpu_scaling(10000, 4)\n",
    "    \n",
    "    assert gpu_analysis['num_gpus'] == 4\n",
    "    assert 'scaling_efficiency' in gpu_analysis\n",
    "    print(\"✅ Multi-GPU analysis works\")\n",
    "    \n",
    "    # Test report generation\n",
    "    report = profiler.generate_optimization_report()\n",
    "    \n",
    "    assert \"Kernel Optimization Analysis Report\" in report\n",
    "    assert len(report) > 100  # Should be a substantial report\n",
    "    print(\"✅ Optimization report generation works\")\n",
    "    \n",
    "    print(\"📈 Progress: Kernel Optimization Profiler ✓\")\n",
    "\n",
    "# Run the test\n",
    "test_unit_kernel_optimization_profiler()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5abe03c8",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 7: ML Systems - Production Kernel Optimization Profiler\n",
    "\n",
    "### GPU Architecture and Custom Kernels in Production ML\n",
    "\n",
    "In production ML systems, kernel optimization becomes critical for performance and cost efficiency. Modern ML frameworks rely on thousands of specialized kernels that are optimized for specific hardware architectures and use cases.\n",
    "\n",
    "### The Production Reality\n",
    "Real ML deployments face:\n",
    "- **Inference latency**: Sub-millisecond requirements for real-time applications\n",
    "- **Throughput demands**: Processing millions of requests per second\n",
    "- **Hardware diversity**: CPUs, GPUs, TPUs, custom ASICs\n",
    "- **Memory constraints**: Limited bandwidth and capacity\n",
    "- **Energy efficiency**: Power consumption in data centers and edge devices\n",
    "\n",
    "### GPU Kernel Optimization Patterns\n",
    "Modern GPUs require specialized optimization techniques:\n",
    "- **Memory coalescing**: Optimizing memory access patterns for GPU memory hierarchy\n",
    "- **Warp divergence analysis**: Ensuring efficient execution across GPU thread warps\n",
    "- **Shared memory optimization**: Leveraging fast on-chip memory for data reuse\n",
    "- **Tensor core utilization**: Maximizing mixed-precision compute throughput\n",
    "- **Kernel fusion**: Combining multiple operations to reduce memory overhead\n",
    "- **Multi-GPU scaling**: Coordinating computation across multiple devices\n",
    "\n",
    "### Real-World Context\n",
    "- **NVIDIA cuDNN**: Thousands of optimized GPU kernels for deep learning\n",
    "- **Intel oneDNN**: CPU-optimized kernels for inference\n",
    "- **Triton**: Python-like language for writing GPU kernels\n",
    "- **TensorRT**: Runtime optimization for NVIDIA GPUs\n",
    "- **Custom silicon**: TPUs, AWS Inferentia, Apple Neural Engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2564cc6",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "kernel-optimization-profiler",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class KernelOptimizationProfiler:\n",
    "    \"\"\"\n",
    "    Production-grade kernel optimization profiler for ML systems.\n",
    "    \n",
    "    This class provides comprehensive analysis tools for optimizing ML kernels\n",
    "    across different hardware architectures, focusing on GPU optimization patterns\n",
    "    and production deployment scenarios.\n",
    "    \n",
    "    Key Features:\n",
    "    - CUDA kernel performance analysis\n",
    "    - Memory coalescing pattern detection\n",
    "    - Warp divergence analysis\n",
    "    - Shared memory optimization\n",
    "    - Tensor core utilization metrics\n",
    "    - Kernel fusion opportunities\n",
    "    - Multi-GPU scaling analysis\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, hardware_config: Optional[Dict[str, Any]] = None):\n",
    "        \"\"\"\n",
    "        Initialize the kernel optimization profiler.\n",
    "        \n",
    "        Args:\n",
    "            hardware_config: Dictionary containing hardware specifications\n",
    "        \"\"\"\n",
    "        self.hardware_config = hardware_config or self._detect_hardware()\n",
    "        self.profile_results = {}\n",
    "        self.optimization_recommendations = []\n",
    "        \n",
    "    def _detect_hardware(self) -> Dict[str, Any]:\n",
    "        \"\"\"Detect current hardware configuration.\"\"\"\n",
    "        return {\n",
    "            'cpu_cores': psutil.cpu_count(),\n",
    "            'memory_gb': psutil.virtual_memory().total // (1024**3),\n",
    "            'cache_sizes': {\n",
    "                'l1': 32768,  # Typical L1 cache size in bytes\n",
    "                'l2': 262144,  # Typical L2 cache size in bytes  \n",
    "                'l3': 8388608  # Typical L3 cache size in bytes\n",
    "            },\n",
    "            'gpu_available': False,  # Would check for CUDA/OpenCL in real implementation\n",
    "            'gpu_memory_gb': 0,\n",
    "            'tensor_cores': False,\n",
    "            'warp_size': 32  # NVIDIA GPU warp size\n",
    "        }\n",
    "    \n",
    "    def analyze_cuda_kernel_performance(self, kernel_func: Callable, input_data: Tensor, \n",
    "                                      iterations: int = 100) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze CUDA kernel performance characteristics.\n",
    "        \n",
    "        In a real implementation, this would interface with CUDA profiling tools\n",
    "        to measure actual GPU kernel performance metrics.\n",
    "        \"\"\"\n",
    "        # Simulate CUDA kernel analysis\n",
    "        total_time = 0\n",
    "        memory_bandwidth = 0\n",
    "        compute_utilization = 0\n",
    "        \n",
    "        for _ in range(iterations):\n",
    "            result, execution_time = time_kernel(kernel_func, input_data)\n",
    "            total_time += execution_time\n",
    "            \n",
    "            # Simulate GPU metrics calculation\n",
    "            data_size = input_data.data.nbytes\n",
    "            memory_bandwidth += (data_size * 2) / (execution_time / 1_000_000)  # Read + Write\n",
    "            compute_utilization += np.random.uniform(0.3, 0.9)  # Simulated utilization\n",
    "        \n",
    "        avg_time = total_time / iterations\n",
    "        avg_bandwidth = memory_bandwidth / iterations\n",
    "        avg_utilization = compute_utilization / iterations\n",
    "        \n",
    "        analysis = {\n",
    "            'avg_execution_time_us': avg_time,\n",
    "            'memory_bandwidth_gb_s': avg_bandwidth / (1024**3),\n",
    "            'compute_utilization': avg_utilization,\n",
    "            'theoretical_peak_bandwidth': 900,  # GB/s for high-end GPU\n",
    "            'bandwidth_efficiency': min(100, (avg_bandwidth / (1024**3)) / 900 * 100),\n",
    "            'bottleneck_analysis': self._identify_bottlenecks(avg_bandwidth / (1024**3), avg_utilization)\n",
    "        }\n",
    "        \n",
    "        self.profile_results['cuda_analysis'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_memory_coalescing(self, access_pattern: str, data_shape: Tuple[int, ...]) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze memory access patterns for GPU coalescing efficiency.\n",
    "        \n",
    "        Memory coalescing is critical for GPU performance - threads in a warp\n",
    "        should access contiguous memory locations.\n",
    "        \"\"\"\n",
    "        coalescing_efficiency = 1.0\n",
    "        \n",
    "        if access_pattern == 'row_major':\n",
    "            # Good coalescing for row-major access\n",
    "            coalescing_efficiency = 0.95\n",
    "        elif access_pattern == 'column_major':\n",
    "            # Poor coalescing for column-major access\n",
    "            coalescing_efficiency = 0.3\n",
    "        elif access_pattern == 'strided':\n",
    "            # Moderate coalescing for strided access\n",
    "            stride = data_shape[1] if len(data_shape) > 1 else 1\n",
    "            coalescing_efficiency = max(0.1, 1.0 / stride)\n",
    "        elif access_pattern == 'random':\n",
    "            # Very poor coalescing for random access\n",
    "            coalescing_efficiency = 0.1\n",
    "        \n",
    "        analysis = {\n",
    "            'access_pattern': access_pattern,\n",
    "            'data_shape': data_shape,\n",
    "            'coalescing_efficiency': coalescing_efficiency,\n",
    "            'memory_transactions': self._calculate_memory_transactions(data_shape, coalescing_efficiency),\n",
    "            'optimization_potential': 1.0 - coalescing_efficiency\n",
    "        }\n",
    "        \n",
    "        self.profile_results['memory_coalescing'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_warp_divergence(self, conditional_operations: int, total_operations: int) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze warp divergence patterns in kernel execution.\n",
    "        \n",
    "        Warp divergence occurs when threads in a warp take different execution paths,\n",
    "        reducing parallelism efficiency.\n",
    "        \"\"\"\n",
    "        divergence_ratio = conditional_operations / total_operations\n",
    "        efficiency_loss = divergence_ratio * 0.5  # Simplified model\n",
    "        \n",
    "        analysis = {\n",
    "            'conditional_operations': conditional_operations,\n",
    "            'total_operations': total_operations,\n",
    "            'divergence_ratio': divergence_ratio,\n",
    "            'efficiency_loss': efficiency_loss,\n",
    "            'warp_efficiency': 1.0 - efficiency_loss,\n",
    "            'optimization_suggestions': self._generate_divergence_optimizations(divergence_ratio)\n",
    "        }\n",
    "        \n",
    "        self.profile_results['warp_divergence'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_shared_memory_usage(self, kernel_data_size: int, reuse_factor: float) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze shared memory optimization opportunities.\n",
    "        \n",
    "        Shared memory is fast on-chip memory that can dramatically improve\n",
    "        performance when used effectively for data reuse.\n",
    "        \"\"\"\n",
    "        shared_memory_size = 48 * 1024  # 48KB typical shared memory per SM\n",
    "        bank_conflicts = self._estimate_bank_conflicts(kernel_data_size)\n",
    "        \n",
    "        analysis = {\n",
    "            'data_size_bytes': kernel_data_size,\n",
    "            'shared_memory_available': shared_memory_size,\n",
    "            'utilization_ratio': min(1.0, kernel_data_size / shared_memory_size),\n",
    "            'reuse_factor': reuse_factor,\n",
    "            'bank_conflicts': bank_conflicts,\n",
    "            'performance_gain': min(10.0, reuse_factor * (1.0 - bank_conflicts)),\n",
    "            'optimization_opportunities': self._identify_shared_memory_optimizations(kernel_data_size, reuse_factor)\n",
    "        }\n",
    "        \n",
    "        self.profile_results['shared_memory'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_tensor_core_utilization(self, operation_type: str, data_types: List[str]) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze tensor core utilization for mixed-precision operations.\n",
    "        \n",
    "        Tensor cores provide massive acceleration for mixed-precision matrix operations\n",
    "        when data shapes and types are optimized correctly.\n",
    "        \"\"\"\n",
    "        tensor_core_compatible = (\n",
    "            operation_type in ['matmul', 'conv2d'] and\n",
    "            any(dtype in ['float16', 'bfloat16', 'int8'] for dtype in data_types)\n",
    "        )\n",
    "        \n",
    "        if tensor_core_compatible:\n",
    "            theoretical_speedup = 4.0  # Typical tensor core speedup\n",
    "            actual_utilization = 0.7   # Realistic utilization\n",
    "        else:\n",
    "            theoretical_speedup = 1.0\n",
    "            actual_utilization = 0.0\n",
    "        \n",
    "        analysis = {\n",
    "            'operation_type': operation_type,\n",
    "            'data_types': data_types,\n",
    "            'tensor_core_compatible': tensor_core_compatible,\n",
    "            'theoretical_speedup': theoretical_speedup,\n",
    "            'actual_utilization': actual_utilization,\n",
    "            'performance_gain': theoretical_speedup * actual_utilization,\n",
    "            'optimization_requirements': self._get_tensor_core_requirements()\n",
    "        }\n",
    "        \n",
    "        self.profile_results['tensor_core'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_kernel_fusion_opportunities(self, operation_sequence: List[str]) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze opportunities for kernel fusion to reduce memory overhead.\n",
    "        \n",
    "        Kernel fusion combines multiple operations into a single kernel,\n",
    "        reducing memory bandwidth requirements and improving performance.\n",
    "        \"\"\"\n",
    "        fusable_patterns = [\n",
    "            ['matmul', 'relu'],\n",
    "            ['conv2d', 'batchnorm', 'relu'],\n",
    "            ['add', 'relu'],\n",
    "            ['mul', 'add']\n",
    "        ]\n",
    "        \n",
    "        fusion_opportunities = []\n",
    "        memory_savings = 0\n",
    "        \n",
    "        for pattern in fusable_patterns:\n",
    "            if self._sequence_contains_pattern(operation_sequence, pattern):\n",
    "                fusion_opportunities.append(pattern)\n",
    "                memory_savings += len(pattern) - 1  # Save intermediate results\n",
    "        \n",
    "        analysis = {\n",
    "            'operation_sequence': operation_sequence,\n",
    "            'fusion_opportunities': fusion_opportunities,\n",
    "            'memory_savings_factor': memory_savings,\n",
    "            'performance_improvement': min(2.0, 1 + memory_savings * 0.3),\n",
    "            'implementation_complexity': len(fusion_opportunities) * 2\n",
    "        }\n",
    "        \n",
    "        self.profile_results['kernel_fusion'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def analyze_multi_gpu_scaling(self, data_size: int, num_gpus: int) -> Dict[str, Any]:\n",
    "        \"\"\"\n",
    "        Analyze multi-GPU scaling patterns and communication overhead.\n",
    "        \n",
    "        Multi-GPU deployments require careful optimization of data distribution\n",
    "        and communication patterns to achieve good scaling efficiency.\n",
    "        \"\"\"\n",
    "        communication_overhead = self._calculate_communication_overhead(data_size, num_gpus)\n",
    "        compute_scaling = min(num_gpus, data_size / 1000)  # Simplified scaling model\n",
    "        \n",
    "        analysis = {\n",
    "            'data_size': data_size,\n",
    "            'num_gpus': num_gpus,\n",
    "            'communication_overhead': communication_overhead,\n",
    "            'compute_scaling': compute_scaling,\n",
    "            'scaling_efficiency': compute_scaling / num_gpus,\n",
    "            'bottleneck_type': 'communication' if communication_overhead > 0.3 else 'compute',\n",
    "            'optimization_strategies': self._get_multi_gpu_optimizations(communication_overhead)\n",
    "        }\n",
    "        \n",
    "        self.profile_results['multi_gpu'] = analysis\n",
    "        return analysis\n",
    "    \n",
    "    def generate_optimization_report(self) -> str:\n",
    "        \"\"\"Generate comprehensive optimization report with recommendations.\"\"\"\n",
    "        report = [\"🚀 Kernel Optimization Analysis Report\", \"=\" * 50, \"\"]\n",
    "        \n",
    "        for analysis_type, results in self.profile_results.items():\n",
    "            report.append(f\"📊 {analysis_type.replace('_', ' ').title()} Analysis:\")\n",
    "            report.append(\"-\" * 30)\n",
    "            \n",
    "            for key, value in results.items():\n",
    "                if isinstance(value, float):\n",
    "                    report.append(f\"  {key}: {value:.3f}\")\n",
    "                elif isinstance(value, list):\n",
    "                    report.append(f\"  {key}: {', '.join(map(str, value))}\")\n",
    "                else:\n",
    "                    report.append(f\"  {key}: {value}\")\n",
    "            report.append(\"\")\n",
    "        \n",
    "        # Add optimization recommendations\n",
    "        report.append(\"🎯 Optimization Recommendations:\")\n",
    "        report.append(\"-\" * 30)\n",
    "        for rec in self.optimization_recommendations:\n",
    "            report.append(f\"  • {rec}\")\n",
    "        \n",
    "        return \"\\n\".join(report)\n",
    "    \n",
    "    # Helper methods\n",
    "    def _identify_bottlenecks(self, bandwidth_gb_s: float, utilization: float) -> str:\n",
    "        \"\"\"Identify performance bottlenecks.\"\"\"\n",
    "        if bandwidth_gb_s < 100:\n",
    "            return \"Memory bandwidth limited\"\n",
    "        elif utilization < 0.5:\n",
    "            return \"Compute utilization limited\"\n",
    "        else:\n",
    "            return \"Well balanced\"\n",
    "    \n",
    "    def _calculate_memory_transactions(self, shape: Tuple[int, ...], efficiency: float) -> int:\n",
    "        \"\"\"Calculate memory transaction count.\"\"\"\n",
    "        total_elements = np.prod(shape)\n",
    "        return int(total_elements / (32 * efficiency))  # 32 threads per warp\n",
    "    \n",
    "    def _generate_divergence_optimizations(self, divergence_ratio: float) -> List[str]:\n",
    "        \"\"\"Generate warp divergence optimization suggestions.\"\"\"\n",
    "        suggestions = []\n",
    "        if divergence_ratio > 0.3:\n",
    "            suggestions.append(\"Reduce conditional operations in inner loops\")\n",
    "            suggestions.append(\"Use predicated execution instead of branching\")\n",
    "        if divergence_ratio > 0.5:\n",
    "            suggestions.append(\"Restructure algorithm to minimize thread divergence\")\n",
    "        return suggestions\n",
    "    \n",
    "    def _estimate_bank_conflicts(self, data_size: int) -> float:\n",
    "        \"\"\"Estimate shared memory bank conflicts.\"\"\"\n",
    "        # Simplified model - assumes some degree of bank conflicts\n",
    "        return min(0.5, data_size / (32 * 4))  # 32 banks, 4 bytes per bank\n",
    "    \n",
    "    def _identify_shared_memory_optimizations(self, size: int, reuse: float) -> List[str]:\n",
    "        \"\"\"Identify shared memory optimization opportunities.\"\"\"\n",
    "        optimizations = []\n",
    "        if reuse > 2.0:\n",
    "            optimizations.append(\"High reuse factor - shared memory beneficial\")\n",
    "        if size < 16384:  # 16KB\n",
    "            optimizations.append(\"Data fits in shared memory - implement tiling\")\n",
    "        return optimizations\n",
    "    \n",
    "    def _get_tensor_core_requirements(self) -> List[str]:\n",
    "        \"\"\"Get tensor core optimization requirements.\"\"\"\n",
    "        return [\n",
    "            \"Use mixed precision (float16/bfloat16)\",\n",
    "            \"Ensure matrix dimensions are multiples of 8\",\n",
    "            \"Use proper memory layout (NHWC for convolutions)\"\n",
    "        ]\n",
    "    \n",
    "    def _sequence_contains_pattern(self, sequence: List[str], pattern: List[str]) -> bool:\n",
    "        \"\"\"Check if operation sequence contains fusable pattern.\"\"\"\n",
    "        for i in range(len(sequence) - len(pattern) + 1):\n",
    "            if sequence[i:i+len(pattern)] == pattern:\n",
    "                return True\n",
    "        return False\n",
    "    \n",
    "    def _calculate_communication_overhead(self, data_size: int, num_gpus: int) -> float:\n",
    "        \"\"\"Calculate multi-GPU communication overhead.\"\"\"\n",
    "        # Simplified model based on data size and GPU count\n",
    "        return min(0.8, (data_size / 1000) / num_gpus + 0.1)\n",
    "    \n",
    "    def _get_multi_gpu_optimizations(self, overhead: float) -> List[str]:\n",
    "        \"\"\"Get multi-GPU optimization strategies.\"\"\"\n",
    "        strategies = []\n",
    "        if overhead > 0.3:\n",
    "            strategies.append(\"Implement gradient compression\")\n",
    "            strategies.append(\"Use asynchronous communication\")\n",
    "        if overhead > 0.5:\n",
    "            strategies.append(\"Increase batch size to amortize communication\")\n",
    "        return strategies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebde88eb",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-kernel-profiler",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### 🧪 Unit Test: Kernel Optimization Profiler\n",
    "\n",
    "def test_unit_kernel_optimization_profiler():\n",
    "    \"\"\"Unit test for the kernel optimization profiler.\"\"\"\n",
    "    print(\"🔬 Unit Test: Kernel Optimization Profiler...\")\n",
    "    \n",
    "    # Create profiler instance\n",
    "    profiler = KernelOptimizationProfiler()\n",
    "    \n",
    "    # Test CUDA kernel analysis\n",
    "    x = Tensor(np.random.randn(1000))\n",
    "    cuda_analysis = profiler.analyze_cuda_kernel_performance(vectorized_relu, x, iterations=10)\n",
    "    \n",
    "    assert 'avg_execution_time_us' in cuda_analysis\n",
    "    assert 'memory_bandwidth_gb_s' in cuda_analysis\n",
    "    assert 'compute_utilization' in cuda_analysis\n",
    "    print(\"✅ CUDA kernel analysis works\")\n",
    "    \n",
    "    # Test memory coalescing analysis\n",
    "    memory_analysis = profiler.analyze_memory_coalescing('row_major', (1024, 1024))\n",
    "    \n",
    "    assert memory_analysis['coalescing_efficiency'] > 0.9\n",
    "    assert 'optimization_potential' in memory_analysis\n",
    "    print(\"✅ Memory coalescing analysis works\")\n",
    "    \n",
    "    # Test warp divergence analysis\n",
    "    warp_analysis = profiler.analyze_warp_divergence(100, 1000)\n",
    "    \n",
    "    assert warp_analysis['divergence_ratio'] == 0.1\n",
    "    assert 'warp_efficiency' in warp_analysis\n",
    "    print(\"✅ Warp divergence analysis works\")\n",
    "    \n",
    "    # Test shared memory analysis\n",
    "    shared_analysis = profiler.analyze_shared_memory_usage(16384, 3.0)\n",
    "    \n",
    "    assert 'performance_gain' in shared_analysis\n",
    "    assert shared_analysis['reuse_factor'] == 3.0\n",
    "    print(\"✅ Shared memory analysis works\")\n",
    "    \n",
    "    # Test tensor core analysis\n",
    "    tensor_analysis = profiler.analyze_tensor_core_utilization('matmul', ['float16'])\n",
    "    \n",
    "    assert tensor_analysis['tensor_core_compatible'] == True\n",
    "    assert tensor_analysis['theoretical_speedup'] > 1.0\n",
    "    print(\"✅ Tensor core analysis works\")\n",
    "    \n",
    "    # Test kernel fusion analysis\n",
    "    fusion_analysis = profiler.analyze_kernel_fusion_opportunities(['matmul', 'relu', 'add'])\n",
    "    \n",
    "    assert len(fusion_analysis['fusion_opportunities']) > 0\n",
    "    assert 'performance_improvement' in fusion_analysis\n",
    "    print(\"✅ Kernel fusion analysis works\")\n",
    "    \n",
    "    # Test multi-GPU analysis\n",
    "    gpu_analysis = profiler.analyze_multi_gpu_scaling(10000, 4)\n",
    "    \n",
    "    assert gpu_analysis['num_gpus'] == 4\n",
    "    assert 'scaling_efficiency' in gpu_analysis\n",
    "    print(\"✅ Multi-GPU analysis works\")\n",
    "    \n",
    "    # Test report generation\n",
    "    report = profiler.generate_optimization_report()\n",
    "    \n",
    "    assert \"Kernel Optimization Analysis Report\" in report\n",
    "    assert len(report) > 100  # Should be a substantial report\n",
    "    print(\"✅ Optimization report generation works\")\n",
    "    \n",
    "    print(\"📈 Progress: Kernel Optimization Profiler ✓\")\n",
    "\n",
    "# Run the test\n",
    "test_unit_kernel_optimization_profiler()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82e1a372",
   "metadata": {
    "lines_to_next_cell": 1
   },
   "outputs": [],
   "source": [
    "def test_module_kernel_sequential_model():\n",
    "    \"\"\"\n",
    "    Integration test for using optimized kernels in a Sequential model.\n",
    "    \n",
    "    Tests that optimized kernels can be integrated into a Sequential model\n",
    "    and produce correct results.\n",
    "    \"\"\"\n",
    "    print(\"🔬 Running Integration Test: Kernels in Sequential Model...\")\n",
    "\n",
    "    class BaselineModel:\n",
    "        def __init__(self):\n",
    "            self.dense = Dense(10, 5)\n",
    "            self.relu = ReLU()\n",
    "        \n",
    "        def __call__(self, x: Tensor) -> Tensor:\n",
    "            # Manually apply layers using baseline functions\n",
    "            x = matmul_baseline(x, self.dense.weights)\n",
    "            # Bias addition is simple, no special kernel needed\n",
    "            x = Tensor(x.data + self.dense.bias.data)\n",
    "            x = self.relu(x)\n",
    "            return x\n",
    "\n",
    "    class OptimizedModel:\n",
    "        def __init__(self, baseline_model):\n",
    "            self.dense = baseline_model.dense\n",
    "        \n",
    "        def __call__(self, x: Tensor) -> Tensor:\n",
    "            # Use optimized kernels\n",
    "            x = cache_friendly_matmul(x, self.dense.weights)\n",
    "            x = Tensor(x.data + self.dense.bias.data)\n",
    "            x = vectorized_relu(x)\n",
    "            return x\n",
    "    \n",
    "    # Mock classes for Dense and ReLU to be used in the test\n",
    "    class Dense:\n",
    "        def __init__(self, in_features, out_features):\n",
    "            self.weights = Tensor(np.random.randn(in_features, out_features))\n",
    "            self.bias = Tensor(np.random.randn(out_features))\n",
    "\n",
    "    class ReLU:\n",
    "        def __call__(self, x: Tensor) -> Tensor:\n",
    "            return vectorized_relu(x)\n",
    "    \n",
    "    # 1. Create baseline and optimized models\n",
    "    baseline_model = BaselineModel()\n",
    "    optimized_model = OptimizedModel(baseline_model)\n",
    "\n",
    "    # 2. Create some input data\n",
    "    input_data = Tensor(np.random.randn(1, 10))\n",
    "\n",
    "    # 3. Get outputs from both models\n",
    "    baseline_output = baseline_model(input_data)\n",
    "    optimized_output = optimized_model(input_data)\n",
    "\n",
    "    # 4. Check that the outputs are numerically close\n",
    "    assert np.allclose(baseline_output.data, optimized_output.data), \"Optimized model output should match baseline\"\n",
    "\n",
    "    print(\"✅ Integration Test Passed: Kernels correctly integrated into a model.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1961c94",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🧪 Module Testing\n",
    "\n",
    "Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.\n",
    "\n",
    "**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4aaba367",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🤔 ML Systems Thinking Questions\n",
    "\n",
    "### GPU Architecture and Parallelism\n",
    "\n",
    "**How does GPU architecture influence kernel design decisions?**\n",
    "Consider the massive parallelism of modern GPUs (1000s of cores) versus CPUs (10s of cores). How would you design matrix multiplication kernels differently for each architecture? What are the trade-offs between thread-level parallelism and instruction-level parallelism?\n",
    "\n",
    "**Why do memory access patterns matter more on GPUs than CPUs?**\n",
    "Think about how GPU memory hierarchy (global memory, shared memory, registers) differs from CPU caches. How does memory coalescing affect bandwidth utilization, and why do random access patterns cause such dramatic performance degradation on GPUs?\n",
    "\n",
    "**How do you handle load balancing across thousands of GPU threads?**\n",
    "When processing variable-sized data or irregular computations, how do you ensure all GPU cores stay busy? What strategies exist for handling workload imbalances, and how do frameworks like PyTorch handle dynamic shapes efficiently?\n",
    "\n",
    "**What role do GPU warps play in kernel optimization?**\n",
    "NVIDIA GPUs execute threads in groups of 32 (warps). How does this affect branching, memory access, and algorithm design? Why is warp divergence such a critical performance consideration, and how do you design algorithms to minimize it?\n",
    "\n",
    "### Custom CUDA Kernel Development\n",
    "\n",
    "**When should you write custom CUDA kernels versus using library functions?**\n",
    "Given that libraries like cuDNN and cuBLAS are highly optimized, when does it make sense to write custom kernels? Consider scenarios like novel layer types, fused operations, or hardware-specific optimizations.\n",
    "\n",
    "**How do you optimize CUDA kernels for different GPU generations?**\n",
    "GPU architectures evolve rapidly (Pascal → Volta → Ampere → Hopper). How do optimization strategies change across generations? What are the implications of new features like tensor cores, multi-instance GPU, and transformer engines?\n",
    "\n",
    "**What's the development workflow for production CUDA kernels?**\n",
    "Consider the entire pipeline from prototype to production: profiling bottlenecks, writing initial kernels, optimization iterations, testing across hardware, and deployment. How do companies like OpenAI and Google manage kernel development at scale?\n",
    "\n",
    "**How do you ensure numerical stability in custom kernels?**\n",
    "Custom kernels often involve low-level optimizations that can affect numerical precision. How do you balance performance with accuracy? What testing strategies ensure kernels produce correct results across different data ranges and edge cases?\n",
    "\n",
    "### Triton and Kernel Languages\n",
    "\n",
    "**How does Triton compare to CUDA for kernel development?**\n",
    "Triton promises Python-like syntax while generating efficient GPU code. What are the trade-offs between ease of development and performance control? When would you choose Triton over CUDA or vice versa?\n",
    "\n",
    "**What role do domain-specific languages play in kernel optimization?**\n",
    "Beyond CUDA and Triton, consider languages like OpenCL, HIP, and emerging alternatives. How do these languages abstract hardware differences while maintaining performance? What's the future of cross-platform kernel development?\n",
    "\n",
    "**How do JIT compilation and auto-tuning affect kernel performance?**\n",
    "Modern frameworks use just-in-time compilation to optimize kernels for specific inputs and hardware. How does this compare to static optimization? What are the implications for deployment, cold start times, and reproducibility?\n",
    "\n",
    "**What are the challenges of kernel portability across hardware vendors?**\n",
    "With AMD GPUs, Intel GPUs, and custom accelerators becoming more common, how do you write kernels that perform well across different architectures? What abstraction layers exist, and what are their performance costs?\n",
    "\n",
    "### Hardware-Specific Optimizations\n",
    "\n",
    "**How do you optimize kernels for different memory hierarchies?**\n",
    "Consider the differences between GPU global memory, shared memory, and registers versus CPU caches. How do you design algorithms that effectively use each level of the hierarchy? What happens when your working set exceeds cache capacity?\n",
    "\n",
    "**What optimization strategies work best for tensor operations?**\n",
    "Tensor cores on modern GPUs can dramatically accelerate mixed-precision operations. How do you restructure algorithms to take advantage of these specialized units? What are the constraints on data layout, precision, and problem sizes?\n",
    "\n",
    "**How do you handle precision trade-offs in optimized kernels?**\n",
    "Production systems often use int8, fp16, or bfloat16 for performance. How do you maintain model accuracy while using reduced precision? What accumulation strategies prevent numerical issues in long computations?\n",
    "\n",
    "**What role does compiler optimization play in kernel performance?**\n",
    "Modern GPU compilers perform sophisticated optimizations like loop unrolling, memory access optimization, and instruction scheduling. How do you write kernel code that works well with these optimizations? When do you need to use inline assembly or intrinsics?\n",
    "\n",
    "### Production GPU Clusters\n",
    "\n",
    "**How do you scale kernel optimizations across multi-GPU systems?**\n",
    "Single-node multi-GPU systems require coordination of memory transfers, computation scheduling, and synchronization. How do you design kernels that scale efficiently across 8-16 GPUs? What are the bottlenecks in multi-GPU scaling?\n",
    "\n",
    "**What are the challenges of distributed training with custom kernels?**\n",
    "When scaling to hundreds or thousands of GPUs across multiple nodes, network communication becomes critical. How do custom kernels interact with distributed training frameworks? What optimizations exist for gradient synchronization and parameter updates?\n",
    "\n",
    "**How do you manage kernel deployment in production clusters?**\n",
    "Production ML systems need to handle hardware failures, software updates, and varying workloads. How do you deploy and manage custom kernels across heterogeneous clusters? What strategies exist for A/B testing kernel optimizations safely?\n",
    "\n",
    "**What monitoring and debugging tools exist for production GPU workloads?**\n",
    "When kernels behave unexpectedly in production, how do you diagnose issues? What metrics matter for kernel performance monitoring? How do you correlate kernel performance with higher-level model metrics like accuracy and throughput?\n",
    "\n",
    "## 🎯 MODULE SUMMARY: Custom Kernels\n",
    "\n",
    "Congratulations! You've successfully implemented custom kernel operations:\n",
    "\n",
    "### What You've Accomplished\n",
    "✅ **Custom Operations**: Implemented specialized kernels for performance\n",
    "✅ **Integration**: Seamless compatibility with neural networks\n",
    "✅ **Performance Optimization**: Faster computation for critical operations\n",
    "✅ **Real Applications**: Deploying optimized models to production\n",
    "\n",
    "### Key Concepts You've Learned\n",
    "- **Custom kernels**: Building specialized operations for efficiency\n",
    "- **Integration patterns**: How kernels work with neural networks\n",
    "- **Performance optimization**: Balancing speed and accuracy\n",
    "- **API design**: Clean interfaces for kernel operations\n",
    "\n",
    "### Professional Skills Developed\n",
    "- **Kernel engineering**: Building efficient operations for deployment\n",
    "- **Performance tuning**: Optimizing computation for speed\n",
    "- **Integration testing**: Ensuring kernels work with neural networks\n",
    "\n",
    "### Ready for Advanced Applications\n",
    "Your kernel implementations now enable:\n",
    "- **Edge deployment**: Running optimized models on resource-constrained devices\n",
    "- **Faster inference**: Reducing latency for real-time applications\n",
    "- **Production systems**: Deploying efficient models at scale\n",
    "\n",
    "### Connection to Real ML Systems\n",
    "Your implementations mirror production systems:\n",
    "- **PyTorch**: Custom CUDA kernels for performance\n",
    "- **TensorFlow**: XLA and custom ops for optimization\n",
    "- **Industry Standard**: Every major ML framework uses these exact techniques\n",
    "\n",
    "### Next Steps\n",
    "1. **Export your code**: `tito export 13_kernels`\n",
    "2. **Test your implementation**: `tito test 13_kernels`\n",
    "3. **Deploy models**: Use optimized kernels in production\n",
    "4. **Move to Module 14**: Add benchmarking for evaluation!\n",
    "\n",
    "**Ready for benchmarking?** Your custom kernels are now ready for real-world deployment!"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}