TinyTorch/modules/14_transformers/transformers_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8e332345",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Transformers - Complete Transformer Architecture Implementation\n",
    "\n",
    "Welcome to the Transformers module! You'll implement complete transformer blocks with LayerNorm, residual connections, and feed-forward networks, building the architecture that powers modern language models like GPT and BERT.\n",
    "\n",
    "## Learning Goals\n",
    "- Systems understanding: How transformer blocks scale memory and computation with model depth\n",
    "- Core implementation skill: Build complete transformer architectures with proper normalization\n",
    "- Pattern recognition: Understand how residual connections enable training of deep transformer models\n",
    "- Framework connection: See how your implementations match production transformer systems\n",
    "- Performance insight: Learn how transformer layer memory accumulation affects model deployment\n",
    "\n",
    "## Build → Use → Reflect\n",
    "1. **Build**: LayerNorm, transformer blocks, and complete transformer models\n",
    "2. **Use**: Process sequences through multi-layer transformer architectures\n",
    "3. **Reflect**: How do transformer design choices affect scalability and training dynamics?\n",
    "\n",
    "## What You'll Achieve\n",
    "By the end of this module, you'll understand:\n",
    "- Deep technical understanding of how transformer blocks enable powerful sequence modeling\n",
    "- Practical capability to implement complete transformer architectures with proper layer organization\n",
    "- Systems insight into how transformer depth affects memory usage and training efficiency\n",
    "- Performance consideration of how layer normalization and residual connections affect convergence\n",
    "- Connection to production systems like GPT's transformer blocks and their optimization techniques\n",
    "\n",
    "## Systems Reality Check\n",
    "💡 **Production Context**: GPT-3 has 96 transformer layers, each with 12k-dimensional representations and complex memory management\n",
    "⚡ **Performance Note**: Transformer layer memory accumulates linearly with depth - deep models require careful activation checkpointing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aaaa5ad1",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "transformers-imports",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp core.transformers\n",
    "\n",
    "#| export\n",
    "import math\n",
    "import numpy as np\n",
    "import os\n",
    "import sys\n",
    "from typing import Union, List, Optional, Tuple, Dict\n",
    "\n",
    "# Import our Tensor class - try from package first, then from local module\n",
    "try:\n",
    "    from tinytorch.core.tensor import Tensor\n",
    "except ImportError:\n",
    "    # For development, import from local tensor module\n",
    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))\n",
    "    from tensor_dev import Tensor\n",
    "\n",
    "# Try to import attention classes\n",
    "try:\n",
    "    from tinytorch.core.attention import ScaledDotProductAttention, MultiHeadAttention, KVCache\n",
    "except ImportError:\n",
    "    # For development, import from local module\n",
    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '13_attention'))\n",
    "    try:\n",
    "        from attention_dev import ScaledDotProductAttention, MultiHeadAttention, KVCache\n",
    "    except ImportError:\n",
    "        # Create minimal mock classes if not available\n",
    "        class MultiHeadAttention:\n",
    "            def __init__(self, embed_dim, num_heads):\n",
    "                self.embed_dim = embed_dim\n",
    "                self.num_heads = num_heads\n",
    "            def forward(self, q, k, v, mask=None):\n",
    "                return q  # Mock implementation\n",
    "        class ScaledDotProductAttention:\n",
    "            def __init__(self):\n",
    "                pass\n",
    "        class KVCache:\n",
    "            def __init__(self, *args, **kwargs):\n",
    "                pass\n",
    "\n",
    "# Try to import embedding classes\n",
    "try:\n",
    "    from tinytorch.core.embeddings import Embedding, PositionalEncoding\n",
    "except ImportError:\n",
    "    # For development, import from local module\n",
    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '12_embeddings'))\n",
    "    try:\n",
    "        from embeddings_dev import Embedding, PositionalEncoding\n",
    "    except ImportError:\n",
    "        # Create minimal mock classes if not available\n",
    "        class Embedding:\n",
    "            def __init__(self, vocab_size, embedding_dim):\n",
    "                self.vocab_size = vocab_size\n",
    "                self.embedding_dim = embedding_dim\n",
    "        class PositionalEncoding:\n",
    "            def __init__(self, embedding_dim, max_seq_length=5000):\n",
    "                self.embedding_dim = embedding_dim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d54a97a",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "transformers-welcome",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "print(\"🏗️ TinyTorch Transformers Module\")\n",
    "print(f\"NumPy version: {np.__version__}\")\n",
    "print(\"Ready to build complete transformer architectures!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e684830c",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in `modules/source/14_transformers/transformers_dev.py`  \n",
    "**Building Side:** Code exports to `tinytorch.core.transformers`\n",
    "\n",
    "```python\n",
    "# Final package structure:\n",
    "from tinytorch.core.transformers import LayerNorm, TransformerBlock, Transformer\n",
    "from tinytorch.core.attention import MultiHeadAttention  # Previous module\n",
    "from tinytorch.core.embeddings import Embedding, PositionalEncoding  # Foundation\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Focused modules for deep understanding\n",
    "- **Production:** Proper organization like PyTorch's transformer implementations\n",
    "- **Consistency:** All transformer components live together in `core.transformers`\n",
    "- **Integration:** Works seamlessly with attention, embeddings, and tokenization systems"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be87d30f",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## What are Transformers?\n",
    "\n",
    "### The Architecture Revolution\n",
    "Transformers revolutionized AI by replacing recurrent connections with attention mechanisms:\n",
    "\n",
    "**Traditional RNN/LSTM:**\n",
    "```\n",
    "h₁ → h₂ → h₃ → h₄  (Sequential processing)\n",
    "```\n",
    "\n",
    "**Transformer:**\n",
    "```\n",
    "All positions attend to all positions simultaneously (Parallel processing)\n",
    "```\n",
    "\n",
    "### Transformer Block Components\n",
    "Each transformer block contains:\n",
    "\n",
    "1. **Multi-Head Self-Attention**: Captures sequence relationships\n",
    "2. **Layer Normalization**: Stabilizes training of deep networks\n",
    "3. **Residual Connections**: Enables gradient flow through many layers\n",
    "4. **Position-wise Feed-Forward**: Applies non-linear transformations\n",
    "\n",
    "### The Complete Architecture\n",
    "```\n",
    "Input Embeddings + Positional Encoding\n",
    "    ↓\n",
    "[Transformer Block] × N layers\n",
    "    ↓\n",
    "Output Layer (Language Modeling Head)\n",
    "```\n",
    "\n",
    "### Systems Trade-offs\n",
    "- **Layer depth**: More layers = more capacity, more memory\n",
    "- **Attention heads**: More heads = richer representations, more computation\n",
    "- **Feed-forward size**: Larger FFN = more parameters, better performance\n",
    "- **Layer normalization**: Pre-norm vs post-norm affects training dynamics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1081f61",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Layer Normalization Implementation\n",
    "\n",
    "Layer normalization is crucial for training stable transformers. Unlike batch normalization, it normalizes across the feature dimension for each sample independently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2166849c",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "layer-norm",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class LayerNorm:\n",
    "    \"\"\"\n",
    "    Layer Normalization for transformers.\n",
    "    \n",
    "    Normalizes across the feature dimension (last axis) for each sample,\n",
    "    making training more stable and enabling deeper networks.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, normalized_shape: Union[int, Tuple[int]], eps: float = 1e-5):\n",
    "        \"\"\"\n",
    "        Initialize layer normalization with learnable parameters.\n",
    "        \n",
    "        TODO: Implement layer normalization initialization.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Store normalization configuration\n",
    "        2. Initialize learnable scale (gamma) and shift (beta) parameters\n",
    "        3. Set epsilon for numerical stability\n",
    "        4. Set up parameter tracking for optimization\n",
    "        \n",
    "        MATHEMATICAL FOUNDATION:\n",
    "        LayerNorm(x) = γ * (x - μ) / σ + β\n",
    "        \n",
    "        Where:\n",
    "        - μ = mean across feature dimensions\n",
    "        - σ = std across feature dimensions  \n",
    "        - γ = learnable scale parameter\n",
    "        - β = learnable shift parameter\n",
    "        \n",
    "        Args:\n",
    "            normalized_shape: Shape of features to normalize (e.g., embedding_dim)\n",
    "            eps: Small value for numerical stability\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if isinstance(normalized_shape, int):\n",
    "            self.normalized_shape = (normalized_shape,)\n",
    "        else:\n",
    "            self.normalized_shape = normalized_shape\n",
    "        \n",
    "        self.eps = eps\n",
    "        \n",
    "        # Initialize learnable parameters\n",
    "        # Gamma (scale): initialized to ones\n",
    "        # Beta (bias): initialized to zeros\n",
    "        self.gamma = Tensor(np.ones(self.normalized_shape))\n",
    "        self.beta = Tensor(np.zeros(self.normalized_shape))\n",
    "        \n",
    "        # Track parameters for optimization\n",
    "        self.parameters = [self.gamma, self.beta]\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def forward(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"\n",
    "        Apply layer normalization to input tensor.\n",
    "        \n",
    "        TODO: Implement layer normalization forward pass.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Calculate mean across feature dimensions\n",
    "        2. Calculate standard deviation across feature dimensions\n",
    "        3. Normalize: (x - mean) / (std + eps)\n",
    "        4. Apply learnable scale and shift: gamma * normalized + beta\n",
    "        \n",
    "        NUMERICAL STABILITY:\n",
    "        - Add eps to variance before taking sqrt\n",
    "        - Use unbiased variance calculation\n",
    "        \n",
    "        EXAMPLE:\n",
    "        layer_norm = LayerNorm(256)\n",
    "        x = Tensor(np.random.randn(32, 128, 256))  # (batch, seq, features)\n",
    "        normalized = layer_norm.forward(x)  # Same shape as input\n",
    "        \n",
    "        Args:\n",
    "            x: Input tensor with shape (..., *normalized_shape)\n",
    "            \n",
    "        Returns:\n",
    "            Normalized tensor with same shape as input\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Calculate mean and variance across the feature dimensions (last axes)\n",
    "        # For shape (..., *normalized_shape), we want to normalize over the last len(normalized_shape) axes\n",
    "        \n",
    "        # Determine axes to normalize over\n",
    "        axes_to_normalize = tuple(range(len(x.shape) - len(self.normalized_shape), len(x.shape)))\n",
    "        \n",
    "        # Calculate mean\n",
    "        mean = np.mean(x.data, axis=axes_to_normalize, keepdims=True)\n",
    "        \n",
    "        # Calculate variance\n",
    "        variance = np.var(x.data, axis=axes_to_normalize, keepdims=True)\n",
    "        \n",
    "        # Normalize\n",
    "        normalized = (x.data - mean) / np.sqrt(variance + self.eps)\n",
    "        \n",
    "        # Apply learnable scale and shift\n",
    "        # Reshape gamma and beta to be broadcastable\n",
    "        gamma_broadcasted = self.gamma.data.reshape([1] * (len(x.shape) - len(self.normalized_shape)) + list(self.normalized_shape))\n",
    "        beta_broadcasted = self.beta.data.reshape([1] * (len(x.shape) - len(self.normalized_shape)) + list(self.normalized_shape))\n",
    "        \n",
    "        output = gamma_broadcasted * normalized + beta_broadcasted\n",
    "        \n",
    "        return Tensor(output)\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def __call__(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"Make the class callable.\"\"\"\n",
    "        return self.forward(x)\n",
    "    \n",
    "    def get_memory_usage(self) -> Dict[str, float]:\n",
    "        \"\"\"\n",
    "        Calculate memory usage of layer normalization parameters.\n",
    "        \n",
    "        This function is PROVIDED to show memory analysis.\n",
    "        \"\"\"\n",
    "        # Parameter memory\n",
    "        param_memory_mb = sum(param.data.nbytes for param in self.parameters) / (1024 * 1024)\n",
    "        \n",
    "        return {\n",
    "            'parameter_memory_mb': param_memory_mb,\n",
    "            'total_parameters': sum(param.data.size for param in self.parameters),\n",
    "            'normalized_shape': self.normalized_shape\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba9e1251",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Test Your Layer Normalization Implementation\n",
    "\n",
    "Once you implement the LayerNorm methods above, run this cell to test it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7349865c",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": true,
     "grade_id": "test-layer-norm-immediate",
     "locked": true,
     "points": 15,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_layer_norm():\n",
    "    \"\"\"Unit test for layer normalization.\"\"\"\n",
    "    print(\"🔬 Unit Test: Layer Normalization...\")\n",
    "    \n",
    "    # Test 1: Basic functionality\n",
    "    embed_dim = 256\n",
    "    layer_norm = LayerNorm(embed_dim)\n",
    "    \n",
    "    # Verify initialization\n",
    "    assert layer_norm.normalized_shape == (embed_dim,), \"Should store normalized shape\"\n",
    "    assert len(layer_norm.parameters) == 2, \"Should have gamma and beta parameters\"\n",
    "    assert layer_norm.gamma.shape == (embed_dim,), \"Gamma should match normalized shape\"\n",
    "    assert layer_norm.beta.shape == (embed_dim,), \"Beta should match normalized shape\"\n",
    "    \n",
    "    # Verify parameter initialization\n",
    "    assert np.allclose(layer_norm.gamma.data, 1.0), \"Gamma should be initialized to ones\"\n",
    "    assert np.allclose(layer_norm.beta.data, 0.0), \"Beta should be initialized to zeros\"\n",
    "    \n",
    "    # Test 2: Forward pass with 2D input\n",
    "    batch_size = 16\n",
    "    x_2d = Tensor(np.random.randn(batch_size, embed_dim))\n",
    "    output_2d = layer_norm.forward(x_2d)\n",
    "    \n",
    "    assert output_2d.shape == x_2d.shape, \"Output shape should match input shape\"\n",
    "    \n",
    "    # Test 3: Forward pass with 3D input (typical transformer use)\n",
    "    seq_length = 32\n",
    "    x_3d = Tensor(np.random.randn(batch_size, seq_length, embed_dim))\n",
    "    output_3d = layer_norm.forward(x_3d)\n",
    "    \n",
    "    assert output_3d.shape == x_3d.shape, \"3D output shape should match input shape\"\n",
    "    \n",
    "    # Test 4: Normalization properties\n",
    "    # For each sample, the normalized features should have ~zero mean and ~unit variance\n",
    "    for i in range(batch_size):\n",
    "        for j in range(seq_length):\n",
    "            sample_output = output_3d.data[i, j, :]\n",
    "            sample_mean = np.mean(sample_output)\n",
    "            sample_var = np.var(sample_output)\n",
    "            \n",
    "            assert abs(sample_mean) < 1e-6, f\"Normalized mean should be ~0, got {sample_mean}\"\n",
    "            assert abs(sample_var - 1.0) < 1e-6, f\"Normalized variance should be ~1, got {sample_var}\"\n",
    "    \n",
    "    # Test 5: Different normalized shapes\n",
    "    multi_dim_shape = (64, 4)  # Multi-dimensional normalization\n",
    "    layer_norm_multi = LayerNorm(multi_dim_shape)\n",
    "    \n",
    "    x_multi = Tensor(np.random.randn(8, 32, 64, 4))\n",
    "    output_multi = layer_norm_multi.forward(x_multi)\n",
    "    \n",
    "    assert output_multi.shape == x_multi.shape, \"Multi-dim normalization should preserve shape\"\n",
    "    \n",
    "    # Test 6: Callable interface\n",
    "    output_callable = layer_norm(x_3d)\n",
    "    assert np.allclose(output_callable.data, output_3d.data), \"Callable interface should work\"\n",
    "    \n",
    "    # Test 7: Numerical stability with extreme values\n",
    "    extreme_x = Tensor(np.ones((4, embed_dim)) * 1e6)  # Very large values\n",
    "    extreme_output = layer_norm.forward(extreme_x)\n",
    "    \n",
    "    assert not np.any(np.isnan(extreme_output.data)), \"Should handle extreme values without NaN\"\n",
    "    assert not np.any(np.isinf(extreme_output.data)), \"Should handle extreme values without inf\"\n",
    "    \n",
    "    # Test 8: Memory usage calculation\n",
    "    memory_stats = layer_norm.get_memory_usage()\n",
    "    assert 'parameter_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
    "    assert memory_stats['total_parameters'] == 2 * embed_dim, \"Should count gamma and beta parameters\"\n",
    "    \n",
    "    print(\"✅ Layer normalization tests passed!\")\n",
    "    print(f\"✅ Properly normalizes across feature dimensions\")\n",
    "    print(f\"✅ Handles 2D and 3D inputs correctly\")\n",
    "    print(f\"✅ Maintains ~0 mean and ~1 variance after normalization\")\n",
    "    print(f\"✅ Parameter memory: {memory_stats['parameter_memory_mb']:.4f}MB\")\n",
    "\n",
    "# Test function defined (called in main block)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b484efe6",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Position-wise Feed-Forward Network\n",
    "\n",
    "Each transformer block contains a position-wise feed-forward network that applies the same transformation to each position independently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1aaebc9",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "feed-forward",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class PositionwiseFeedForward:\n",
    "    \"\"\"\n",
    "    Position-wise feed-forward network used in transformer blocks.\n",
    "    \n",
    "    Applies the same feed-forward network to each position in the sequence:\n",
    "    FFN(x) = max(0, xW₁ + b₁)W₂ + b₂\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, embed_dim: int, hidden_dim: int, dropout: float = 0.0):\n",
    "        \"\"\"\n",
    "        Initialize position-wise feed-forward network.\n",
    "        \n",
    "        TODO: Implement feed-forward network initialization.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Store network configuration\n",
    "        2. Initialize weight matrices and bias vectors for two linear layers\n",
    "        3. Set up parameter tracking for optimization\n",
    "        4. Store dropout rate for training\n",
    "        \n",
    "        ARCHITECTURE:\n",
    "        - Input: (batch, seq_len, embed_dim)\n",
    "        - Linear 1: embed_dim → hidden_dim\n",
    "        - ReLU activation\n",
    "        - Linear 2: hidden_dim → embed_dim\n",
    "        - Output: (batch, seq_len, embed_dim)\n",
    "        \n",
    "        PARAMETER INITIALIZATION:\n",
    "        Use Xavier/Glorot initialization for stable training\n",
    "        \n",
    "        Args:\n",
    "            embed_dim: Embedding dimension (input and output size)\n",
    "            hidden_dim: Hidden layer dimension (typically 4 * embed_dim)\n",
    "            dropout: Dropout rate for regularization\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.embed_dim = embed_dim\n",
    "        self.hidden_dim = hidden_dim\n",
    "        self.dropout = dropout\n",
    "        \n",
    "        # Initialize weights using Xavier initialization\n",
    "        # W1: embed_dim → hidden_dim\n",
    "        xavier_bound_1 = math.sqrt(6.0 / (embed_dim + hidden_dim))\n",
    "        self.w1 = Tensor(np.random.uniform(-xavier_bound_1, xavier_bound_1, (embed_dim, hidden_dim)))\n",
    "        self.b1 = Tensor(np.zeros(hidden_dim))\n",
    "        \n",
    "        # W2: hidden_dim → embed_dim\n",
    "        xavier_bound_2 = math.sqrt(6.0 / (hidden_dim + embed_dim))\n",
    "        self.w2 = Tensor(np.random.uniform(-xavier_bound_2, xavier_bound_2, (hidden_dim, embed_dim)))\n",
    "        self.b2 = Tensor(np.zeros(embed_dim))\n",
    "        \n",
    "        # Track parameters for optimization\n",
    "        self.parameters = [self.w1, self.b1, self.w2, self.b2]\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def forward(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"\n",
    "        Apply position-wise feed-forward transformation.\n",
    "        \n",
    "        TODO: Implement feed-forward forward pass.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Apply first linear transformation: x @ W1 + b1\n",
    "        2. Apply ReLU activation: max(0, linear1)\n",
    "        3. Apply second linear transformation: relu @ W2 + b2\n",
    "        4. Return result with same shape as input\n",
    "        \n",
    "        MATHEMATICAL FORMULATION:\n",
    "        hidden = ReLU(x @ W1 + b1)\n",
    "        output = hidden @ W2 + b2\n",
    "        \n",
    "        Args:\n",
    "            x: Input tensor with shape (batch_size, seq_len, embed_dim)\n",
    "            \n",
    "        Returns:\n",
    "            Output tensor with shape (batch_size, seq_len, embed_dim)\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Reshape input for matrix multiplication if needed\n",
    "        original_shape = x.shape\n",
    "        if len(x.shape) == 3:\n",
    "            batch_size, seq_len, embed_dim = x.shape\n",
    "            # Reshape to (batch_size * seq_len, embed_dim) for efficient computation\n",
    "            x_reshaped = x.data.reshape(-1, embed_dim)\n",
    "        else:\n",
    "            x_reshaped = x.data\n",
    "        \n",
    "        # First linear transformation: x @ W1 + b1\n",
    "        hidden = np.matmul(x_reshaped, self.w1.data) + self.b1.data\n",
    "        \n",
    "        # ReLU activation\n",
    "        hidden_relu = np.maximum(0, hidden)\n",
    "        \n",
    "        # Second linear transformation: hidden @ W2 + b2\n",
    "        output = np.matmul(hidden_relu, self.w2.data) + self.b2.data\n",
    "        \n",
    "        # Reshape back to original shape\n",
    "        if len(original_shape) == 3:\n",
    "            output = output.reshape(original_shape)\n",
    "        \n",
    "        return Tensor(output)\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def __call__(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"Make the class callable.\"\"\"\n",
    "        return self.forward(x)\n",
    "    \n",
    "    def get_memory_usage(self) -> Dict[str, float]:\n",
    "        \"\"\"\n",
    "        Calculate memory usage of feed-forward parameters.\n",
    "        \n",
    "        This function is PROVIDED to show memory analysis.\n",
    "        \"\"\"\n",
    "        # Parameter memory\n",
    "        param_memory_mb = sum(param.data.nbytes for param in self.parameters) / (1024 * 1024)\n",
    "        \n",
    "        # Calculate parameter counts\n",
    "        w1_params = self.embed_dim * self.hidden_dim\n",
    "        w2_params = self.hidden_dim * self.embed_dim\n",
    "        bias_params = self.hidden_dim + self.embed_dim\n",
    "        total_params = w1_params + w2_params + bias_params\n",
    "        \n",
    "        return {\n",
    "            'parameter_memory_mb': param_memory_mb,\n",
    "            'total_parameters': total_params,\n",
    "            'w1_parameters': w1_params,\n",
    "            'w2_parameters': w2_params,\n",
    "            'bias_parameters': bias_params,\n",
    "            'embed_dim': self.embed_dim,\n",
    "            'hidden_dim': self.hidden_dim\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e555b646",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Test Your Feed-Forward Network Implementation\n",
    "\n",
    "Once you implement the PositionwiseFeedForward methods above, run this cell to test it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95b8fd0e",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": true,
     "grade_id": "test-feed-forward-immediate",
     "locked": true,
     "points": 15,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_feed_forward():\n",
    "    \"\"\"Unit test for position-wise feed-forward network.\"\"\"\n",
    "    print(\"🔬 Unit Test: Position-wise Feed-Forward Network...\")\n",
    "    \n",
    "    # Test configuration\n",
    "    embed_dim = 256\n",
    "    hidden_dim = 1024  # Typical 4x expansion\n",
    "    ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=hidden_dim)\n",
    "    \n",
    "    # Verify initialization\n",
    "    assert ffn.embed_dim == embed_dim, \"Should store embedding dimension\"\n",
    "    assert ffn.hidden_dim == hidden_dim, \"Should store hidden dimension\"\n",
    "    assert len(ffn.parameters) == 4, \"Should have W1, b1, W2, b2 parameters\"\n",
    "    \n",
    "    # Verify parameter shapes\n",
    "    assert ffn.w1.shape == (embed_dim, hidden_dim), f\"W1 should be ({embed_dim}, {hidden_dim})\"\n",
    "    assert ffn.b1.shape == (hidden_dim,), f\"b1 should be ({hidden_dim},)\"\n",
    "    assert ffn.w2.shape == (hidden_dim, embed_dim), f\"W2 should be ({hidden_dim}, {embed_dim})\"\n",
    "    assert ffn.b2.shape == (embed_dim,), f\"b2 should be ({embed_dim},)\"\n",
    "    \n",
    "    # Test forward pass with 3D input (typical transformer use)\n",
    "    batch_size = 8\n",
    "    seq_len = 32\n",
    "    x_3d = Tensor(np.random.randn(batch_size, seq_len, embed_dim))\n",
    "    output_3d = ffn.forward(x_3d)\n",
    "    \n",
    "    expected_shape = (batch_size, seq_len, embed_dim)\n",
    "    assert output_3d.shape == expected_shape, f\"Expected shape {expected_shape}, got {output_3d.shape}\"\n",
    "    \n",
    "    # Test forward pass with 2D input\n",
    "    x_2d = Tensor(np.random.randn(batch_size, embed_dim))\n",
    "    output_2d = ffn.forward(x_2d)\n",
    "    \n",
    "    expected_2d_shape = (batch_size, embed_dim)\n",
    "    assert output_2d.shape == expected_2d_shape, f\"Expected 2D shape {expected_2d_shape}, got {output_2d.shape}\"\n",
    "    \n",
    "    # Test that FFN is applied position-wise (same transformation at each position)\n",
    "    # Extract two positions from the sequence\n",
    "    pos_1_input = Tensor(x_3d.data[:, 0, :])  # First position\n",
    "    pos_2_input = Tensor(x_3d.data[:, 1, :])  # Second position\n",
    "    \n",
    "    pos_1_output = ffn.forward(pos_1_input)\n",
    "    pos_2_output = ffn.forward(pos_2_input)\n",
    "    \n",
    "    # Compare with full sequence output\n",
    "    assert np.allclose(pos_1_output.data, output_3d.data[:, 0, :]), \"Position 0 should match individual processing\"\n",
    "    assert np.allclose(pos_2_output.data, output_3d.data[:, 1, :]), \"Position 1 should match individual processing\"\n",
    "    \n",
    "    # Test ReLU activation (some outputs should be zero for negative intermediate values)\n",
    "    # Create input that will definitely produce some negative values after first linear layer\n",
    "    negative_input = Tensor(-np.ones((4, embed_dim)) * 10)  # Very negative input\n",
    "    negative_output = ffn.forward(negative_input)\n",
    "    \n",
    "    # Not all outputs should be negative (ReLU should clip some values)\n",
    "    assert not np.all(negative_output.data < 0), \"ReLU should prevent all outputs from being negative\"\n",
    "    \n",
    "    # Test callable interface\n",
    "    output_callable = ffn(x_3d)\n",
    "    assert np.allclose(output_callable.data, output_3d.data), \"Callable interface should work\"\n",
    "    \n",
    "    # Test different hidden dimensions\n",
    "    for test_hidden_dim in [512, 2048]:\n",
    "        test_ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=test_hidden_dim)\n",
    "        test_output = test_ffn.forward(x_3d)\n",
    "        assert test_output.shape == expected_shape, f\"Should work with hidden_dim={test_hidden_dim}\"\n",
    "    \n",
    "    # Test memory usage calculation\n",
    "    memory_stats = ffn.get_memory_usage()\n",
    "    assert 'parameter_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
    "    \n",
    "    # Verify parameter counts\n",
    "    expected_w1_params = embed_dim * hidden_dim\n",
    "    expected_w2_params = hidden_dim * embed_dim\n",
    "    expected_total = expected_w1_params + expected_w2_params + hidden_dim + embed_dim\n",
    "    \n",
    "    assert memory_stats['w1_parameters'] == expected_w1_params, \"Should count W1 parameters correctly\"\n",
    "    assert memory_stats['w2_parameters'] == expected_w2_params, \"Should count W2 parameters correctly\"\n",
    "    assert memory_stats['total_parameters'] == expected_total, \"Should count total parameters correctly\"\n",
    "    \n",
    "    print(\"✅ Position-wise feed-forward tests passed!\")\n",
    "    print(f\"✅ Handles 2D and 3D inputs correctly\")\n",
    "    print(f\"✅ Position-wise processing verified\")\n",
    "    print(f\"✅ ReLU activation working properly\")\n",
    "    print(f\"✅ Total parameters: {memory_stats['total_parameters']:,}\")\n",
    "    print(f\"✅ Parameter memory: {memory_stats['parameter_memory_mb']:.2f}MB\")\n",
    "\n",
    "# Test function defined (called in main block)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d97703d2",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Transformer Block Implementation\n",
    "\n",
    "Now let's build the complete transformer block that combines multi-head attention, layer normalization, and position-wise feed-forward networks with residual connections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5677022",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "transformer-block",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class TransformerBlock:\n",
    "    \"\"\"\n",
    "    Complete transformer block with self-attention and feed-forward layers.\n",
    "    \n",
    "    Combines multi-head self-attention, layer normalization, residual connections,\n",
    "    and position-wise feed-forward networks into the standard transformer architecture.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, embed_dim: int, num_heads: int, hidden_dim: int, \n",
    "                 dropout: float = 0.0, pre_norm: bool = True):\n",
    "        \"\"\"\n",
    "        Initialize transformer block with all components.\n",
    "        \n",
    "        TODO: Implement transformer block initialization.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Store block configuration\n",
    "        2. Create multi-head attention layer\n",
    "        3. Create two layer normalization layers (for attention and FFN)\n",
    "        4. Create position-wise feed-forward network\n",
    "        5. Set up parameter tracking from all sub-components\n",
    "        \n",
    "        ARCHITECTURE CHOICE: Pre-norm vs Post-norm\n",
    "        - Pre-norm: LayerNorm → Attention → Residual (more stable)\n",
    "        - Post-norm: Attention → LayerNorm → Residual (original paper)\n",
    "        \n",
    "        Args:\n",
    "            embed_dim: Embedding dimension\n",
    "            num_heads: Number of attention heads\n",
    "            hidden_dim: Feed-forward hidden dimension (typically 4 * embed_dim)\n",
    "            dropout: Dropout rate for regularization\n",
    "            pre_norm: Whether to use pre-normalization (recommended)\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.embed_dim = embed_dim\n",
    "        self.num_heads = num_heads\n",
    "        self.hidden_dim = hidden_dim\n",
    "        self.dropout = dropout\n",
    "        self.pre_norm = pre_norm\n",
    "        \n",
    "        # Multi-head self-attention\n",
    "        self.attention = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads, dropout=dropout)\n",
    "        \n",
    "        # Layer normalization layers\n",
    "        self.norm1 = LayerNorm(embed_dim)  # For attention\n",
    "        self.norm2 = LayerNorm(embed_dim)  # For feed-forward\n",
    "        \n",
    "        # Position-wise feed-forward network\n",
    "        self.ffn = PositionwiseFeedForward(embed_dim=embed_dim, hidden_dim=hidden_dim, dropout=dropout)\n",
    "        \n",
    "        # Collect all parameters from sub-components\n",
    "        self.parameters = []\n",
    "        if hasattr(self.attention, 'parameters'):\n",
    "            self.parameters.extend(self.attention.parameters)\n",
    "        self.parameters.extend(self.norm1.parameters)\n",
    "        self.parameters.extend(self.norm2.parameters)\n",
    "        self.parameters.extend(self.ffn.parameters)\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def forward(self, x: Tensor, mask: Optional[Tensor] = None,\n",
    "                return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, Tensor]]:\n",
    "        \"\"\"\n",
    "        Process input through complete transformer block.\n",
    "        \n",
    "        TODO: Implement transformer block forward pass.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION (Pre-norm):\n",
    "        1. Self-attention with residual: x + attention(norm1(x))\n",
    "        2. Feed-forward with residual: attn_out + ffn(norm2(attn_out))\n",
    "        3. Return final output (and optionally attention weights)\n",
    "        \n",
    "        RESIDUAL CONNECTIONS:\n",
    "        Essential for training deep networks - allow gradients to flow directly\n",
    "        \n",
    "        Args:\n",
    "            x: Input tensor with shape (batch_size, seq_len, embed_dim)\n",
    "            mask: Optional attention mask\n",
    "            return_attention_weights: Whether to return attention weights\n",
    "            \n",
    "        Returns:\n",
    "            Transformer block output with same shape as input\n",
    "            Optionally also attention weights\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        if self.pre_norm:\n",
    "            # Pre-normalization: LayerNorm before attention/FFN\n",
    "            \n",
    "            # Self-attention with residual connection\n",
    "            norm1_x = self.norm1(x)\n",
    "            if return_attention_weights:\n",
    "                attn_output, attn_weights = self.attention.forward(\n",
    "                    norm1_x, norm1_x, norm1_x, mask=mask, return_attention_weights=True\n",
    "                )\n",
    "            else:\n",
    "                attn_output = self.attention.forward(norm1_x, norm1_x, norm1_x, mask=mask)\n",
    "            \n",
    "            # Residual connection\n",
    "            x = Tensor(x.data + attn_output.data)\n",
    "            \n",
    "            # Feed-forward with residual connection\n",
    "            norm2_x = self.norm2(x)\n",
    "            ffn_output = self.ffn.forward(norm2_x)\n",
    "            \n",
    "            # Residual connection\n",
    "            output = Tensor(x.data + ffn_output.data)\n",
    "            \n",
    "        else:\n",
    "            # Post-normalization: LayerNorm after attention/FFN (original transformer)\n",
    "            \n",
    "            # Self-attention with residual connection\n",
    "            if return_attention_weights:\n",
    "                attn_output, attn_weights = self.attention.forward(\n",
    "                    x, x, x, mask=mask, return_attention_weights=True\n",
    "                )\n",
    "            else:\n",
    "                attn_output = self.attention.forward(x, x, x, mask=mask)\n",
    "            \n",
    "            # Residual + LayerNorm\n",
    "            attn_residual = Tensor(x.data + attn_output.data)\n",
    "            norm1_output = self.norm1(attn_residual)\n",
    "            \n",
    "            # Feed-forward with residual connection\n",
    "            ffn_output = self.ffn.forward(norm1_output)\n",
    "            \n",
    "            # Residual + LayerNorm\n",
    "            ffn_residual = Tensor(norm1_output.data + ffn_output.data)\n",
    "            output = self.norm2(ffn_residual)\n",
    "        \n",
    "        if return_attention_weights:\n",
    "            return output, attn_weights\n",
    "        else:\n",
    "            return output\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def __call__(self, x: Tensor, mask: Optional[Tensor] = None,\n",
    "                 return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, Tensor]]:\n",
    "        \"\"\"Make the class callable.\"\"\"\n",
    "        return self.forward(x, mask, return_attention_weights)\n",
    "    \n",
    "    def get_memory_usage(self) -> Dict[str, float]:\n",
    "        \"\"\"\n",
    "        Calculate memory usage of transformer block components.\n",
    "        \n",
    "        This function is PROVIDED to show memory analysis.\n",
    "        \"\"\"\n",
    "        # Get memory usage from components\n",
    "        if hasattr(self.attention, 'get_memory_usage'):\n",
    "            attention_memory = self.attention.get_memory_usage()['total_parameter_memory_mb']\n",
    "        else:\n",
    "            attention_memory = 0.0\n",
    "        \n",
    "        norm1_memory = self.norm1.get_memory_usage()['parameter_memory_mb']\n",
    "        norm2_memory = self.norm2.get_memory_usage()['parameter_memory_mb']\n",
    "        ffn_memory = self.ffn.get_memory_usage()['parameter_memory_mb']\n",
    "        \n",
    "        total_memory = attention_memory + norm1_memory + norm2_memory + ffn_memory\n",
    "        total_params = len(self.parameters) if hasattr(self, 'parameters') else 0\n",
    "        \n",
    "        return {\n",
    "            'total_memory_mb': total_memory,\n",
    "            'attention_memory_mb': attention_memory,\n",
    "            'norm_memory_mb': norm1_memory + norm2_memory,\n",
    "            'ffn_memory_mb': ffn_memory,\n",
    "            'total_parameters': sum(p.data.size for p in self.parameters) if hasattr(self, 'parameters') else 0,\n",
    "            'embed_dim': self.embed_dim,\n",
    "            'num_heads': self.num_heads,\n",
    "            'hidden_dim': self.hidden_dim,\n",
    "            'pre_norm': self.pre_norm\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f786ca8b",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Test Your Transformer Block Implementation\n",
    "\n",
    "Once you implement the TransformerBlock methods above, run this cell to test it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5c44e59",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": true,
     "grade_id": "test-transformer-block-immediate",
     "locked": true,
     "points": 20,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_transformer_block():\n",
    "    \"\"\"Unit test for transformer block.\"\"\"\n",
    "    print(\"🔬 Unit Test: Transformer Block...\")\n",
    "    \n",
    "    # Test configuration\n",
    "    embed_dim = 256\n",
    "    num_heads = 8\n",
    "    hidden_dim = 1024\n",
    "    transformer_block = TransformerBlock(\n",
    "        embed_dim=embed_dim, \n",
    "        num_heads=num_heads, \n",
    "        hidden_dim=hidden_dim,\n",
    "        pre_norm=True\n",
    "    )\n",
    "    \n",
    "    # Verify initialization\n",
    "    assert transformer_block.embed_dim == embed_dim, \"Should store embedding dimension\"\n",
    "    assert transformer_block.num_heads == num_heads, \"Should store number of heads\"\n",
    "    assert transformer_block.hidden_dim == hidden_dim, \"Should store hidden dimension\"\n",
    "    assert transformer_block.pre_norm == True, \"Should store normalization type\"\n",
    "    \n",
    "    # Verify components exist\n",
    "    assert hasattr(transformer_block, 'attention'), \"Should have attention layer\"\n",
    "    assert hasattr(transformer_block, 'norm1'), \"Should have first norm layer\"\n",
    "    assert hasattr(transformer_block, 'norm2'), \"Should have second norm layer\"\n",
    "    assert hasattr(transformer_block, 'ffn'), \"Should have feed-forward network\"\n",
    "    \n",
    "    # Test forward pass\n",
    "    batch_size = 4\n",
    "    seq_len = 16\n",
    "    x = Tensor(np.random.randn(batch_size, seq_len, embed_dim))\n",
    "    \n",
    "    output = transformer_block.forward(x)\n",
    "    expected_shape = (batch_size, seq_len, embed_dim)\n",
    "    assert output.shape == expected_shape, f\"Expected shape {expected_shape}, got {output.shape}\"\n",
    "    \n",
    "    # Test with attention weights return\n",
    "    output_with_attn, attn_weights = transformer_block.forward(x, return_attention_weights=True)\n",
    "    \n",
    "    assert output_with_attn.shape == expected_shape, \"Output with attention should have correct shape\"\n",
    "    expected_attn_shape = (batch_size, num_heads, seq_len, seq_len)\n",
    "    assert attn_weights.shape == expected_attn_shape, f\"Expected attention shape {expected_attn_shape}, got {attn_weights.shape}\"\n",
    "    \n",
    "    # Test with causal mask\n",
    "    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)\n",
    "    causal_mask = 1 - causal_mask  # Convert to attention mask\n",
    "    \n",
    "    masked_output, masked_attn = transformer_block.forward(\n",
    "        x, mask=Tensor(causal_mask), return_attention_weights=True\n",
    "    )\n",
    "    \n",
    "    assert masked_output.shape == expected_shape, \"Masked output should have correct shape\"\n",
    "    \n",
    "    # Verify causal masking works\n",
    "    for head in range(num_heads):\n",
    "        for i in range(seq_len):\n",
    "            for j in range(i+1, seq_len):\n",
    "                assert np.all(masked_attn.data[:, head, i, j] < 1e-5), \\\n",
    "                    f\"Position ({i},{j}) should be masked in head {head}\"\n",
    "    \n",
    "    # Test residual connections by checking that output is different from pure attention\n",
    "    # If we zero out the input, residual connections should preserve some information\n",
    "    zero_input = Tensor(np.zeros((batch_size, seq_len, embed_dim)))\n",
    "    zero_output = transformer_block.forward(zero_input)\n",
    "    \n",
    "    # Output should not be exactly zero due to biases and layer norm parameters\n",
    "    assert not np.allclose(zero_output.data, 0), \"Residual connections should prevent zero output\"\n",
    "    \n",
    "    # Test post-normalization variant\n",
    "    post_norm_block = TransformerBlock(\n",
    "        embed_dim=embed_dim, \n",
    "        num_heads=num_heads, \n",
    "        hidden_dim=hidden_dim,\n",
    "        pre_norm=False\n",
    "    )\n",
    "    \n",
    "    post_norm_output = post_norm_block.forward(x)\n",
    "    assert post_norm_output.shape == expected_shape, \"Post-norm should produce correct shape\"\n",
    "    \n",
    "    # Pre-norm and post-norm should produce different outputs\n",
    "    pre_norm_output = transformer_block.forward(x)\n",
    "    assert not np.allclose(pre_norm_output.data, post_norm_output.data), \\\n",
    "        \"Pre-norm and post-norm should produce different outputs\"\n",
    "    \n",
    "    # Test callable interface\n",
    "    output_callable = transformer_block(x)\n",
    "    assert np.allclose(output_callable.data, output.data), \"Callable interface should work\"\n",
    "    \n",
    "    # Test different configurations\n",
    "    for test_heads in [4, 16]:\n",
    "        if embed_dim % test_heads == 0:\n",
    "            test_block = TransformerBlock(embed_dim=embed_dim, num_heads=test_heads, hidden_dim=hidden_dim)\n",
    "            test_output = test_block.forward(x)\n",
    "            assert test_output.shape == expected_shape, f\"Should work with {test_heads} heads\"\n",
    "    \n",
    "    # Test memory usage calculation\n",
    "    memory_stats = transformer_block.get_memory_usage()\n",
    "    assert 'total_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
    "    assert memory_stats['total_memory_mb'] > 0, \"Should have positive memory usage\"\n",
    "    assert memory_stats['total_parameters'] > 0, \"Should count parameters\"\n",
    "    \n",
    "    print(\"✅ Transformer block tests passed!\")\n",
    "    print(f\"✅ Pre-norm and post-norm architectures work correctly\")\n",
    "    print(f\"✅ Residual connections preserve information flow\")\n",
    "    print(f\"✅ Causal masking works across all attention heads\")\n",
    "    print(f\"✅ Total parameters: {memory_stats['total_parameters']:,}\")\n",
    "    print(f\"✅ Total memory: {memory_stats['total_memory_mb']:.2f}MB\")\n",
    "\n",
    "# Test function defined (called in main block)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8c231b1",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Complete Transformer Model\n",
    "\n",
    "Finally, let's build a complete transformer model that can be used for language modeling tasks like text generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6364ce7e",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "transformer-model",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Transformer:\n",
    "    \"\"\"\n",
    "    Complete transformer model for language processing.\n",
    "    \n",
    "    Stacks multiple transformer blocks with token embeddings and positional\n",
    "    encoding to create a complete language model architecture.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, vocab_size: int, embed_dim: int, num_heads: int, \n",
    "                 num_layers: int, hidden_dim: int, max_seq_length: int = 1024,\n",
    "                 dropout: float = 0.0, pre_norm: bool = True):\n",
    "        \"\"\"\n",
    "        Initialize complete transformer model.\n",
    "        \n",
    "        TODO: Implement transformer model initialization.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Store model configuration\n",
    "        2. Create token embedding layer\n",
    "        3. Create positional encoding\n",
    "        4. Create stack of transformer blocks\n",
    "        5. Create output projection layer (for language modeling)\n",
    "        6. Set up parameter tracking from all components\n",
    "        \n",
    "        LANGUAGE MODELING HEAD:\n",
    "        Final linear layer that projects hidden states to vocabulary logits\n",
    "        \n",
    "        Args:\n",
    "            vocab_size: Size of vocabulary\n",
    "            embed_dim: Embedding dimension\n",
    "            num_heads: Number of attention heads per layer\n",
    "            num_layers: Number of transformer blocks\n",
    "            hidden_dim: Feed-forward hidden dimension\n",
    "            max_seq_length: Maximum sequence length for positional encoding\n",
    "            dropout: Dropout rate\n",
    "            pre_norm: Whether to use pre-normalization\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.vocab_size = vocab_size\n",
    "        self.embed_dim = embed_dim\n",
    "        self.num_heads = num_heads\n",
    "        self.num_layers = num_layers\n",
    "        self.hidden_dim = hidden_dim\n",
    "        self.max_seq_length = max_seq_length\n",
    "        self.dropout = dropout\n",
    "        self.pre_norm = pre_norm\n",
    "        \n",
    "        # Token embedding layer\n",
    "        self.token_embedding = Embedding(vocab_size=vocab_size, embedding_dim=embed_dim)\n",
    "        \n",
    "        # Positional encoding\n",
    "        self.pos_encoding = PositionalEncoding(embedding_dim=embed_dim, max_seq_length=max_seq_length)\n",
    "        \n",
    "        # Stack of transformer blocks\n",
    "        self.transformer_blocks = []\n",
    "        for _ in range(num_layers):\n",
    "            block = TransformerBlock(\n",
    "                embed_dim=embed_dim,\n",
    "                num_heads=num_heads,\n",
    "                hidden_dim=hidden_dim,\n",
    "                dropout=dropout,\n",
    "                pre_norm=pre_norm\n",
    "            )\n",
    "            self.transformer_blocks.append(block)\n",
    "        \n",
    "        # Final layer normalization (for pre-norm architecture)\n",
    "        if pre_norm:\n",
    "            self.final_norm = LayerNorm(embed_dim)\n",
    "        else:\n",
    "            self.final_norm = None\n",
    "        \n",
    "        # Language modeling head (projects to vocabulary)\n",
    "        xavier_bound = math.sqrt(6.0 / (embed_dim + vocab_size))\n",
    "        self.lm_head = Tensor(np.random.uniform(-xavier_bound, xavier_bound, (embed_dim, vocab_size)))\n",
    "        \n",
    "        # Collect all parameters\n",
    "        self.parameters = []\n",
    "        if hasattr(self.token_embedding, 'parameters'):\n",
    "            self.parameters.extend(self.token_embedding.parameters)\n",
    "        \n",
    "        for block in self.transformer_blocks:\n",
    "            if hasattr(block, 'parameters'):\n",
    "                self.parameters.extend(block.parameters)\n",
    "        \n",
    "        if self.final_norm:\n",
    "            self.parameters.extend(self.final_norm.parameters)\n",
    "        \n",
    "        self.parameters.append(self.lm_head)\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def forward(self, input_ids: Tensor, mask: Optional[Tensor] = None,\n",
    "                return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, List[Tensor]]]:\n",
    "        \"\"\"\n",
    "        Process input through complete transformer model.\n",
    "        \n",
    "        TODO: Implement transformer model forward pass.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Convert token IDs to embeddings\n",
    "        2. Add positional encoding\n",
    "        3. Process through all transformer blocks\n",
    "        4. Apply final normalization (if pre-norm)\n",
    "        5. Apply language modeling head\n",
    "        6. Return logits (and optionally attention weights)\n",
    "        \n",
    "        Args:\n",
    "            input_ids: Token indices with shape (batch_size, seq_len)\n",
    "            mask: Optional attention mask\n",
    "            return_attention_weights: Whether to return all attention weights\n",
    "            \n",
    "        Returns:\n",
    "            Logits with shape (batch_size, seq_len, vocab_size)\n",
    "            Optionally also list of attention weights from each layer\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Token embeddings\n",
    "        embeddings = self.token_embedding.forward(input_ids)\n",
    "        \n",
    "        # Add positional encoding\n",
    "        x = self.pos_encoding.forward(embeddings)\n",
    "        \n",
    "        # Process through transformer blocks\n",
    "        all_attention_weights = []\n",
    "        \n",
    "        for block in self.transformer_blocks:\n",
    "            if return_attention_weights:\n",
    "                x, attn_weights = block.forward(x, mask=mask, return_attention_weights=True)\n",
    "                all_attention_weights.append(attn_weights)\n",
    "            else:\n",
    "                x = block.forward(x, mask=mask)\n",
    "        \n",
    "        # Final layer normalization (for pre-norm)\n",
    "        if self.final_norm:\n",
    "            x = self.final_norm.forward(x)\n",
    "        \n",
    "        # Language modeling head\n",
    "        # x: (batch_size, seq_len, embed_dim)\n",
    "        # lm_head: (embed_dim, vocab_size)\n",
    "        # output: (batch_size, seq_len, vocab_size)\n",
    "        \n",
    "        batch_size, seq_len, embed_dim = x.shape\n",
    "        x_reshaped = x.data.reshape(-1, embed_dim)  # (batch_size * seq_len, embed_dim)\n",
    "        logits_reshaped = np.matmul(x_reshaped, self.lm_head.data)  # (batch_size * seq_len, vocab_size)\n",
    "        logits = logits_reshaped.reshape(batch_size, seq_len, self.vocab_size)\n",
    "        \n",
    "        if return_attention_weights:\n",
    "            return Tensor(logits), all_attention_weights\n",
    "        else:\n",
    "            return Tensor(logits)\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def __call__(self, input_ids: Tensor, mask: Optional[Tensor] = None,\n",
    "                 return_attention_weights: bool = False) -> Union[Tensor, Tuple[Tensor, List[Tensor]]]:\n",
    "        \"\"\"Make the class callable.\"\"\"\n",
    "        return self.forward(input_ids, mask, return_attention_weights)\n",
    "    \n",
    "    def generate(self, input_ids: Tensor, max_new_tokens: int = 50, \n",
    "                temperature: float = 1.0) -> Tensor:\n",
    "        \"\"\"\n",
    "        Generate text autoregressively.\n",
    "        \n",
    "        This function is PROVIDED to show text generation capability.\n",
    "        \"\"\"\n",
    "        batch_size, current_seq_len = input_ids.shape\n",
    "        \n",
    "        if current_seq_len >= self.max_seq_length:\n",
    "            raise ValueError(f\"Input sequence length {current_seq_len} exceeds max {self.max_seq_length}\")\n",
    "        \n",
    "        generated_ids = input_ids.data.copy()\n",
    "        \n",
    "        for _ in range(max_new_tokens):\n",
    "            # Create causal mask\n",
    "            seq_len = generated_ids.shape[1]\n",
    "            causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)\n",
    "            causal_mask = 1 - causal_mask\n",
    "            \n",
    "            # Forward pass\n",
    "            logits = self.forward(Tensor(generated_ids), mask=Tensor(causal_mask))\n",
    "            \n",
    "            # Get logits for last position\n",
    "            last_logits = logits.data[:, -1, :]  # (batch_size, vocab_size)\n",
    "            \n",
    "            # Apply temperature\n",
    "            last_logits = last_logits / temperature\n",
    "            \n",
    "            # Sample next token (using simple sampling)\n",
    "            # Convert to probabilities\n",
    "            exp_logits = np.exp(last_logits - np.max(last_logits, axis=-1, keepdims=True))\n",
    "            probs = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)\n",
    "            \n",
    "            # Sample from distribution\n",
    "            next_tokens = []\n",
    "            for i in range(batch_size):\n",
    "                next_token = np.random.choice(self.vocab_size, p=probs[i])\n",
    "                next_tokens.append(next_token)\n",
    "            \n",
    "            next_tokens = np.array(next_tokens).reshape(batch_size, 1)\n",
    "            \n",
    "            # Append to sequence\n",
    "            generated_ids = np.concatenate([generated_ids, next_tokens], axis=1)\n",
    "            \n",
    "            # Stop if we reach max sequence length\n",
    "            if generated_ids.shape[1] >= self.max_seq_length:\n",
    "                break\n",
    "        \n",
    "        return Tensor(generated_ids)\n",
    "    \n",
    "    def get_memory_usage(self) -> Dict[str, float]:\n",
    "        \"\"\"\n",
    "        Calculate memory usage of complete transformer model.\n",
    "        \n",
    "        This function is PROVIDED to show memory analysis.\n",
    "        \"\"\"\n",
    "        # Token embedding memory\n",
    "        if hasattr(self.token_embedding, 'get_memory_usage'):\n",
    "            embedding_memory = self.token_embedding.get_memory_usage()['total_memory_mb']\n",
    "        else:\n",
    "            embedding_memory = self.vocab_size * self.embed_dim * 4 / (1024 * 1024)\n",
    "        \n",
    "        # Transformer blocks memory\n",
    "        block_memory = 0\n",
    "        if self.transformer_blocks:\n",
    "            single_block_memory = self.transformer_blocks[0].get_memory_usage()['total_memory_mb']\n",
    "            block_memory = single_block_memory * self.num_layers\n",
    "        \n",
    "        # Final norm memory\n",
    "        final_norm_memory = 0\n",
    "        if self.final_norm:\n",
    "            final_norm_memory = self.final_norm.get_memory_usage()['parameter_memory_mb']\n",
    "        \n",
    "        # Language modeling head memory\n",
    "        lm_head_memory = self.lm_head.data.nbytes / (1024 * 1024)\n",
    "        \n",
    "        total_memory = embedding_memory + block_memory + final_norm_memory + lm_head_memory\n",
    "        total_params = sum(p.data.size for p in self.parameters) if hasattr(self, 'parameters') else 0\n",
    "        \n",
    "        return {\n",
    "            'total_memory_mb': total_memory,\n",
    "            'embedding_memory_mb': embedding_memory,\n",
    "            'transformer_blocks_memory_mb': block_memory,\n",
    "            'lm_head_memory_mb': lm_head_memory,\n",
    "            'total_parameters': total_params,\n",
    "            'vocab_size': self.vocab_size,\n",
    "            'embed_dim': self.embed_dim,\n",
    "            'num_layers': self.num_layers,\n",
    "            'num_heads': self.num_heads,\n",
    "            'hidden_dim': self.hidden_dim\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cba6bfc5",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Test Your Complete Transformer Implementation\n",
    "\n",
    "Once you implement the Transformer methods above, run this cell to test it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "751b3b4c",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": true,
     "grade_id": "test-transformer-model-immediate",
     "locked": true,
     "points": 25,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_transformer_model():\n",
    "    \"\"\"Unit test for complete transformer model.\"\"\"\n",
    "    print(\"🔬 Unit Test: Complete Transformer Model...\")\n",
    "    \n",
    "    # Test configuration\n",
    "    vocab_size = 1000\n",
    "    embed_dim = 256\n",
    "    num_heads = 8\n",
    "    num_layers = 4\n",
    "    hidden_dim = 512\n",
    "    max_seq_length = 128\n",
    "    \n",
    "    transformer = Transformer(\n",
    "        vocab_size=vocab_size,\n",
    "        embed_dim=embed_dim,\n",
    "        num_heads=num_heads,\n",
    "        num_layers=num_layers,\n",
    "        hidden_dim=hidden_dim,\n",
    "        max_seq_length=max_seq_length,\n",
    "        pre_norm=True\n",
    "    )\n",
    "    \n",
    "    # Verify initialization\n",
    "    assert transformer.vocab_size == vocab_size, \"Should store vocabulary size\"\n",
    "    assert transformer.embed_dim == embed_dim, \"Should store embedding dimension\"\n",
    "    assert transformer.num_layers == num_layers, \"Should store number of layers\"\n",
    "    assert len(transformer.transformer_blocks) == num_layers, \"Should create correct number of blocks\"\n",
    "    \n",
    "    # Verify components exist\n",
    "    assert hasattr(transformer, 'token_embedding'), \"Should have token embedding\"\n",
    "    assert hasattr(transformer, 'pos_encoding'), \"Should have positional encoding\"\n",
    "    assert hasattr(transformer, 'lm_head'), \"Should have language modeling head\"\n",
    "    \n",
    "    # Test forward pass with token IDs\n",
    "    batch_size = 4\n",
    "    seq_len = 32\n",
    "    input_ids = np.random.randint(0, vocab_size, (batch_size, seq_len))\n",
    "    input_tensor = Tensor(input_ids)\n",
    "    \n",
    "    logits = transformer.forward(input_tensor)\n",
    "    expected_shape = (batch_size, seq_len, vocab_size)\n",
    "    assert logits.shape == expected_shape, f\"Expected shape {expected_shape}, got {logits.shape}\"\n",
    "    \n",
    "    # Test with attention weights return\n",
    "    logits_with_attn, all_attention_weights = transformer.forward(input_tensor, return_attention_weights=True)\n",
    "    \n",
    "    assert logits_with_attn.shape == expected_shape, \"Logits with attention should have correct shape\"\n",
    "    assert len(all_attention_weights) == num_layers, f\"Should return attention weights from {num_layers} layers\"\n",
    "    \n",
    "    for i, attn_weights in enumerate(all_attention_weights):\n",
    "        expected_attn_shape = (batch_size, num_heads, seq_len, seq_len)\n",
    "        assert attn_weights.shape == expected_attn_shape, \\\n",
    "            f\"Layer {i} attention should have shape {expected_attn_shape}, got {attn_weights.shape}\"\n",
    "    \n",
    "    # Test with causal mask\n",
    "    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)\n",
    "    causal_mask = 1 - causal_mask  # Convert to attention mask\n",
    "    \n",
    "    masked_logits, masked_attention = transformer.forward(\n",
    "        input_tensor, mask=Tensor(causal_mask), return_attention_weights=True\n",
    "    )\n",
    "    \n",
    "    assert masked_logits.shape == expected_shape, \"Masked logits should have correct shape\"\n",
    "    \n",
    "    # Verify causal masking propagates through all layers\n",
    "    for layer_idx, attn_weights in enumerate(masked_attention):\n",
    "        for head in range(num_heads):\n",
    "            for i in range(seq_len):\n",
    "                for j in range(i+1, seq_len):\n",
    "                    assert np.all(attn_weights.data[:, head, i, j] < 1e-5), \\\n",
    "                        f\"Layer {layer_idx}, head {head}: position ({i},{j}) should be masked\"\n",
    "    \n",
    "    # Test callable interface\n",
    "    logits_callable = transformer(input_tensor)\n",
    "    assert np.allclose(logits_callable.data, logits.data), \"Callable interface should work\"\n",
    "    \n",
    "    # Test text generation capability\n",
    "    print(\"  Testing text generation...\")\n",
    "    start_tokens = Tensor(np.random.randint(0, vocab_size, (2, 8)))  # 2 sequences, 8 tokens each\n",
    "    generated = transformer.generate(start_tokens, max_new_tokens=10, temperature=1.0)\n",
    "    \n",
    "    expected_gen_shape = (2, 18)  # 8 original + 10 new tokens\n",
    "    assert generated.shape == expected_gen_shape, f\"Generated shape should be {expected_gen_shape}, got {generated.shape}\"\n",
    "    \n",
    "    # Verify original tokens are preserved\n",
    "    assert np.array_equal(generated.data[:, :8], start_tokens.data), \"Original tokens should be preserved\"\n",
    "    \n",
    "    # Test different model configurations\n",
    "    small_transformer = Transformer(\n",
    "        vocab_size=500, embed_dim=128, num_heads=4, num_layers=2, hidden_dim=256\n",
    "    )\n",
    "    \n",
    "    small_input = Tensor(np.random.randint(0, 500, (2, 16)))\n",
    "    small_logits = small_transformer.forward(small_input)\n",
    "    expected_small_shape = (2, 16, 500)\n",
    "    assert small_logits.shape == expected_small_shape, \"Small transformer should work\"\n",
    "    \n",
    "    # Test pre-norm vs post-norm\n",
    "    post_norm_transformer = Transformer(\n",
    "        vocab_size=vocab_size, embed_dim=embed_dim, num_heads=num_heads,\n",
    "        num_layers=2, hidden_dim=hidden_dim, pre_norm=False\n",
    "    )\n",
    "    \n",
    "    post_norm_logits = post_norm_transformer.forward(input_tensor)\n",
    "    pre_norm_logits = Transformer(\n",
    "        vocab_size=vocab_size, embed_dim=embed_dim, num_heads=num_heads,\n",
    "        num_layers=2, hidden_dim=hidden_dim, pre_norm=True\n",
    "    ).forward(input_tensor)\n",
    "    \n",
    "    assert not np.allclose(post_norm_logits.data, pre_norm_logits.data), \\\n",
    "        \"Pre-norm and post-norm should produce different outputs\"\n",
    "    \n",
    "    # Test memory usage calculation\n",
    "    memory_stats = transformer.get_memory_usage()\n",
    "    assert 'total_memory_mb' in memory_stats, \"Should provide memory statistics\"\n",
    "    assert memory_stats['total_memory_mb'] > 0, \"Should have positive memory usage\"\n",
    "    assert memory_stats['total_parameters'] > 0, \"Should count parameters\"\n",
    "    \n",
    "    # Verify memory breakdown\n",
    "    assert memory_stats['embedding_memory_mb'] > 0, \"Should have embedding memory\"\n",
    "    assert memory_stats['transformer_blocks_memory_mb'] > 0, \"Should have transformer block memory\"\n",
    "    assert memory_stats['lm_head_memory_mb'] > 0, \"Should have language modeling head memory\"\n",
    "    \n",
    "    print(\"✅ Complete transformer model tests passed!\")\n",
    "    print(f\"✅ Forward pass produces correct logit shapes\")\n",
    "    print(f\"✅ Causal masking works across all {num_layers} layers\")\n",
    "    print(f\"✅ Text generation capability verified\")\n",
    "    print(f\"✅ Total parameters: {memory_stats['total_parameters']:,}\")\n",
    "    print(f\"✅ Total memory: {memory_stats['total_memory_mb']:.2f}MB\")\n",
    "    print(f\"✅ Pre-norm and post-norm architectures work correctly\")\n",
    "\n",
    "# Test function defined (called in main block)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fda9a7bd",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🎯 ML Systems: Performance Analysis & Transformer Scaling\n",
    "\n",
    "Now let's develop systems engineering skills by analyzing transformer performance and understanding how model depth and width affect memory usage and computational requirements.\n",
    "\n",
    "### **Learning Outcome**: *\"I understand how transformer architecture choices affect scalability, memory usage, and production deployment constraints\"*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff32bb95",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "transformer-profiler",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "import time\n",
    "\n",
    "class TransformerProfiler:\n",
    "    \"\"\"\n",
    "    Performance profiling toolkit for transformer architectures.\n",
    "    \n",
    "    Helps ML engineers understand computational costs, memory scaling,\n",
    "    and architectural trade-offs in transformer-based models.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self):\n",
    "        self.results = {}\n",
    "    \n",
    "    def measure_scaling_with_depth(self, base_config: Dict, layer_counts: List[int]) -> Dict:\n",
    "        \"\"\"\n",
    "        Measure how transformer performance scales with number of layers.\n",
    "        \n",
    "        TODO: Implement transformer depth scaling measurement.\n",
    "        \n",
    "        STEP-BY-STEP IMPLEMENTATION:\n",
    "        1. Create transformers with different layer counts\n",
    "        2. Measure memory usage and computation time for each\n",
    "        3. Calculate scaling patterns (should be linear with depth)\n",
    "        4. Analyze parameter growth and memory requirements\n",
    "        5. Return comprehensive scaling analysis\n",
    "        \n",
    "        EXPECTED SCALING:\n",
    "        - Parameters: Linear with depth\n",
    "        - Memory: Linear with depth  \n",
    "        - Computation: Linear with depth\n",
    "        - Quality: Generally improves with depth (to a point)\n",
    "        \n",
    "        Args:\n",
    "            base_config: Base transformer configuration\n",
    "            layer_counts: List of layer counts to test\n",
    "            \n",
    "        Returns:\n",
    "            Dictionary with scaling analysis results\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        scaling_results = {}\n",
    "        \n",
    "        # Test input\n",
    "        batch_size = 4\n",
    "        seq_len = 32\n",
    "        vocab_size = base_config['vocab_size']\n",
    "        test_input = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))\n",
    "        \n",
    "        for num_layers in layer_counts:\n",
    "            # Create transformer with this depth\n",
    "            transformer = Transformer(\n",
    "                vocab_size=base_config['vocab_size'],\n",
    "                embed_dim=base_config['embed_dim'],\n",
    "                num_heads=base_config['num_heads'],\n",
    "                num_layers=num_layers,\n",
    "                hidden_dim=base_config['hidden_dim'],\n",
    "                max_seq_length=base_config.get('max_seq_length', 128)\n",
    "            )\n",
    "            \n",
    "            # Measure memory usage\n",
    "            memory_stats = transformer.get_memory_usage()\n",
    "            \n",
    "            # Measure computation time\n",
    "            start_time = time.time()\n",
    "            logits = transformer.forward(test_input)\n",
    "            end_time = time.time()\n",
    "            \n",
    "            computation_time_ms = (end_time - start_time) * 1000\n",
    "            \n",
    "            # Calculate throughput\n",
    "            total_tokens = batch_size * seq_len\n",
    "            tokens_per_second = total_tokens / (end_time - start_time) if end_time > start_time else 0\n",
    "            \n",
    "            scaling_results[num_layers] = {\n",
    "                'num_layers': num_layers,\n",
    "                'total_parameters': memory_stats['total_parameters'],\n",
    "                'total_memory_mb': memory_stats['total_memory_mb'],\n",
    "                'computation_time_ms': computation_time_ms,\n",
    "                'tokens_per_second': tokens_per_second,\n",
    "                'memory_per_layer_mb': memory_stats['transformer_blocks_memory_mb'] / num_layers if num_layers > 0 else 0,\n",
    "                'parameters_per_layer': (memory_stats['total_parameters'] - \n",
    "                                       base_config['vocab_size'] * base_config['embed_dim'] * 2) // num_layers if num_layers > 0 else 0\n",
    "            }\n",
    "        \n",
    "        return scaling_results\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def analyze_width_vs_depth_tradeoffs(self, base_params: int, configurations: List[Dict]) -> Dict:\n",
    "        \"\"\"\n",
    "        Compare different ways to allocate a fixed parameter budget.\n",
    "        \n",
    "        This function is PROVIDED to show parameter allocation analysis.\n",
    "        \"\"\"\n",
    "        print(f\"📊 WIDTH vs DEPTH TRADE-OFF ANALYSIS\")\n",
    "        print(f\"Target parameter budget: ~{base_params:,} parameters\")\n",
    "        print(\"=\" * 70)\n",
    "        \n",
    "        results = {}\n",
    "        \n",
    "        # Test input\n",
    "        batch_size = 4\n",
    "        seq_len = 32\n",
    "        test_input = Tensor(np.random.randint(0, 1000, (batch_size, seq_len)))\n",
    "        \n",
    "        print(f\"{'Config':<15} {'Layers':<7} {'Embed':<6} {'Heads':<6} {'Hidden':<7} {'Params':<12} {'Time (ms)':<10} {'Memory'}\")\n",
    "        print(\"-\" * 80)\n",
    "        \n",
    "        for i, config in enumerate(configurations):\n",
    "            try:\n",
    "                # Create transformer\n",
    "                transformer = Transformer(\n",
    "                    vocab_size=1000,  # Fixed vocab size\n",
    "                    embed_dim=config['embed_dim'],\n",
    "                    num_heads=config['num_heads'],\n",
    "                    num_layers=config['num_layers'],\n",
    "                    hidden_dim=config['hidden_dim'],\n",
    "                    max_seq_length=128\n",
    "                )\n",
    "                \n",
    "                # Get actual parameter count\n",
    "                memory_stats = transformer.get_memory_usage()\n",
    "                actual_params = memory_stats['total_parameters']\n",
    "                \n",
    "                # Measure performance\n",
    "                start_time = time.time()\n",
    "                logits = transformer.forward(test_input)\n",
    "                computation_time = (time.time() - start_time) * 1000\n",
    "                \n",
    "                config_name = f\"Config_{i+1}\"\n",
    "                results[config_name] = {\n",
    "                    'config': config,\n",
    "                    'actual_parameters': actual_params,\n",
    "                    'computation_time_ms': computation_time,\n",
    "                    'memory_mb': memory_stats['total_memory_mb'],\n",
    "                    'parameter_efficiency': abs(actual_params - base_params) / base_params\n",
    "                }\n",
    "                \n",
    "                print(f\"{config_name:<15} {config['num_layers']:<7} {config['embed_dim']:<6} \"\n",
    "                      f\"{config['num_heads']:<6} {config['hidden_dim']:<7} {actual_params:<12,} \"\n",
    "                      f\"{computation_time:<10.2f} {memory_stats['total_memory_mb']:.1f}MB\")\n",
    "                \n",
    "            except Exception as e:\n",
    "                print(f\"{config_name:<15} ERROR: {str(e)[:50]}\")\n",
    "        \n",
    "        # Analysis\n",
    "        print(f\"\\n💡 TRADE-OFF INSIGHTS:\")\n",
    "        print(f\"   - Deeper models: Better at learning complex patterns, more sequential\")\n",
    "        print(f\"   - Wider models: More parallelizable, can capture diverse features\")\n",
    "        print(f\"   - More heads: Richer attention patterns, more computation\")\n",
    "        print(f\"   - Hidden dimension: Affects FFN capacity, major parameter contributor\")\n",
    "        \n",
    "        return results\n",
    "    \n",
    "    def simulate_production_scaling(self, model_sizes: List[str]) -> Dict:\n",
    "        \"\"\"\n",
    "        Simulate memory and computation requirements for production model sizes.\n",
    "        \n",
    "        This function is PROVIDED to show production scaling analysis.\n",
    "        \"\"\"\n",
    "        print(f\"\\n🏭 PRODUCTION MODEL SCALING SIMULATION\")\n",
    "        print(\"=\" * 60)\n",
    "        \n",
    "        # Production model configurations (simplified)\n",
    "        size_configs = {\n",
    "            'Small': {'vocab_size': 50000, 'embed_dim': 512, 'num_heads': 8, 'num_layers': 6, 'hidden_dim': 2048},\n",
    "            'Medium': {'vocab_size': 50000, 'embed_dim': 768, 'num_heads': 12, 'num_layers': 12, 'hidden_dim': 3072},\n",
    "            'Large': {'vocab_size': 50000, 'embed_dim': 1024, 'num_heads': 16, 'num_layers': 24, 'hidden_dim': 4096},\n",
    "            'XL': {'vocab_size': 50000, 'embed_dim': 1280, 'num_heads': 20, 'num_layers': 36, 'hidden_dim': 5120}\n",
    "        }\n",
    "        \n",
    "        results = {}\n",
    "        \n",
    "        print(f\"{'Model Size':<12} {'Parameters':<12} {'Memory (GB)':<12} {'Training GPU':<12} {'Inference'}\")\n",
    "        print(\"-\" * 70)\n",
    "        \n",
    "        for size in model_sizes:\n",
    "            if size not in size_configs:\n",
    "                continue\n",
    "                \n",
    "            config = size_configs[size]\n",
    "            \n",
    "            # Estimate parameters\n",
    "            # Embedding: vocab_size * embed_dim * 2 (input + output)\n",
    "            embedding_params = config['vocab_size'] * config['embed_dim'] * 2\n",
    "            \n",
    "            # Per layer: \n",
    "            # - Attention: 4 * embed_dim^2 (Q, K, V, O projections)\n",
    "            # - FFN: 2 * embed_dim * hidden_dim + embed_dim + hidden_dim (weights + biases)\n",
    "            # - LayerNorm: 2 * embed_dim * 2 (two norms per layer)\n",
    "            attention_params_per_layer = 4 * config['embed_dim'] ** 2\n",
    "            ffn_params_per_layer = 2 * config['embed_dim'] * config['hidden_dim'] + config['embed_dim'] + config['hidden_dim']\n",
    "            norm_params_per_layer = 4 * config['embed_dim']\n",
    "            \n",
    "            layer_params = attention_params_per_layer + ffn_params_per_layer + norm_params_per_layer\n",
    "            total_params = embedding_params + layer_params * config['num_layers']\n",
    "            \n",
    "            # Estimate memory (parameters + activations + gradients for training)\n",
    "            param_memory_gb = total_params * 4 / (1024**3)  # 4 bytes per float32\n",
    "            \n",
    "            # Training memory: parameters + gradients + optimizer states + activations\n",
    "            training_memory_gb = param_memory_gb * 4  # Rough estimate (param + grad + 2x optimizer states)\n",
    "            \n",
    "            # Inference memory: just parameters + activations\n",
    "            inference_memory_gb = param_memory_gb * 1.5  # Parameters + activation memory\n",
    "            \n",
    "            # GPU requirements (very rough estimates)\n",
    "            if training_memory_gb < 24:\n",
    "                training_gpu = \"Single RTX 4090\"\n",
    "            elif training_memory_gb < 80:\n",
    "                training_gpu = \"Single A100\"\n",
    "            else:\n",
    "                training_gpu = \"Multi-GPU\"\n",
    "            \n",
    "            if inference_memory_gb < 12:\n",
    "                inference_req = \"RTX 4060 Ti\"\n",
    "            elif inference_memory_gb < 24:\n",
    "                inference_req = \"RTX 4090\"\n",
    "            else:\n",
    "                inference_req = \"A100+\"\n",
    "            \n",
    "            results[size] = {\n",
    "                'config': config,\n",
    "                'total_parameters': total_params,\n",
    "                'training_memory_gb': training_memory_gb,\n",
    "                'inference_memory_gb': inference_memory_gb,\n",
    "                'training_gpu_req': training_gpu,\n",
    "                'inference_gpu_req': inference_req\n",
    "            }\n",
    "            \n",
    "            print(f\"{size:<12} {total_params/1e6:.1f}M {training_memory_gb:.1f} {training_gpu:<12} {inference_req}\")\n",
    "        \n",
    "        print(f\"\\n📈 SCALING OBSERVATIONS:\")\n",
    "        print(f\"   - Model size grows super-linearly with dimension increases\")\n",
    "        print(f\"   - Memory requirements dominate deployment decisions\")\n",
    "        print(f\"   - Training requires 3-4x more memory than inference\")\n",
    "        print(f\"   - Multi-GPU becomes necessary for large models\")\n",
    "        \n",
    "        return results\n",
    "\n",
    "def analyze_transformer_system_design():\n",
    "    \"\"\"\n",
    "    Comprehensive analysis of transformer system design choices and trade-offs.\n",
    "    \n",
    "    This function is PROVIDED to show systems-level design thinking.\n",
    "    \"\"\"\n",
    "    print(\"🏗️ TRANSFORMER SYSTEM DESIGN ANALYSIS\")\n",
    "    print(\"=\" * 60)\n",
    "    \n",
    "    # Architecture decision analysis\n",
    "    design_choices = {\n",
    "        'Layer Normalization': {\n",
    "            'Pre-norm': {'stability': 'High', 'training': 'Easier', 'performance': 'Good'},\n",
    "            'Post-norm': {'stability': 'Lower', 'training': 'Harder', 'performance': 'Potentially better'}\n",
    "        },\n",
    "        'Attention Patterns': {\n",
    "            'Full attention': {'complexity': 'O(N²)', 'quality': 'Best', 'scalability': 'Limited'},\n",
    "            'Sparse attention': {'complexity': 'O(N√N)', 'quality': 'Good', 'scalability': 'Better'},\n",
    "            'Linear attention': {'complexity': 'O(N)', 'quality': 'Reduced', 'scalability': 'Excellent'}\n",
    "        },\n",
    "        'Feed-Forward Size': {\n",
    "            '2x embed_dim': {'parameters': 'Low', 'capacity': 'Limited', 'speed': 'Fast'},\n",
    "            '4x embed_dim': {'parameters': 'Standard', 'capacity': 'Good', 'speed': 'Medium'},\n",
    "            '8x embed_dim': {'parameters': 'High', 'capacity': 'High', 'speed': 'Slow'}\n",
    "        }\n",
    "    }\n",
    "    \n",
    "    print(\"🎯 ARCHITECTURAL DESIGN CHOICES:\")\n",
    "    for category, choices in design_choices.items():\n",
    "        print(f\"\\n{category}:\")\n",
    "        for choice, properties in choices.items():\n",
    "            prop_str = \", \".join([f\"{k}: {v}\" for k, v in properties.items()])\n",
    "            print(f\"   - {choice}: {prop_str}\")\n",
    "    \n",
    "    # Memory scaling analysis\n",
    "    print(f\"\\n📊 MEMORY SCALING PATTERNS:\")\n",
    "    print(f\"Component breakdown for typical transformer:\")\n",
    "    print(f\"   - Token embeddings: vocab_size × embed_dim parameters\")\n",
    "    print(f\"   - Position encodings: 0 parameters (sinusoidal) or seq_len × embed_dim (learned)\")\n",
    "    print(f\"   - Attention layers: 4 × embed_dim² parameters per layer\")\n",
    "    print(f\"   - Feed-forward: 2 × embed_dim × hidden_dim parameters per layer\")\n",
    "    print(f\"   - Layer normalization: 2 × embed_dim parameters per layer\")\n",
    "    print(f\"   - Output projection: embed_dim × vocab_size parameters\")\n",
    "    \n",
    "    print(f\"\\n🔧 OPTIMIZATION STRATEGIES:\")\n",
    "    optimization_techniques = [\n",
    "        \"Gradient checkpointing: Trade computation for memory\",\n",
    "        \"Mixed precision training: Use FP16 for 2x memory reduction\",\n",
    "        \"Parameter sharing: Share weights across layers\",\n",
    "        \"Sparse attention: Reduce quadratic scaling\",\n",
    "        \"Model parallelism: Distribute layers across GPUs\",\n",
    "        \"Pipeline parallelism: Process different batch elements on different GPUs\",\n",
    "        \"Activation checkpointing: Recompute activations instead of storing\"\n",
    "    ]\n",
    "    \n",
    "    for technique in optimization_techniques:\n",
    "        print(f\"   - {technique}\")\n",
    "    \n",
    "    print(f\"\\n🎯 PRODUCTION DEPLOYMENT CONSIDERATIONS:\")\n",
    "    deployment_factors = [\n",
    "        \"Batch size: Larger batches improve GPU utilization but increase memory\",\n",
    "        \"Sequence length: Quadratic impact on attention memory\",\n",
    "        \"Model depth: Linear impact on memory and computation\",\n",
    "        \"Model width: Quadratic impact on attention parameters\",\n",
    "        \"Precision: FP32 vs FP16 vs INT8 trade-offs\",\n",
    "        \"Hardware: GPU memory and compute capabilities\",\n",
    "        \"Latency requirements: Real-time vs batch processing\",\n",
    "        \"Throughput requirements: Tokens per second targets\"\n",
    "    ]\n",
    "    \n",
    "    for factor in deployment_factors:\n",
    "        print(f\"   - {factor}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0050718c",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Test: Transformer Performance Analysis\n",
    "\n",
    "Let's test our transformer profiler with realistic scaling scenarios."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45818c11",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-transformer-profiler",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_transformer_profiler():\n",
    "    \"\"\"Test transformer profiler with various scenarios.\"\"\"\n",
    "    print(\"🔬 Unit Test: Transformer Performance Profiler...\")\n",
    "    \n",
    "    profiler = TransformerProfiler()\n",
    "    \n",
    "    # Test depth scaling measurement\n",
    "    base_config = {\n",
    "        'vocab_size': 500,\n",
    "        'embed_dim': 128,\n",
    "        'num_heads': 4,\n",
    "        'hidden_dim': 256\n",
    "    }\n",
    "    \n",
    "    layer_counts = [1, 2, 4]\n",
    "    depth_results = profiler.measure_scaling_with_depth(base_config, layer_counts)\n",
    "    \n",
    "    # Verify depth scaling results\n",
    "    assert len(depth_results) == len(layer_counts), f\"Should test {len(layer_counts)} layer counts\"\n",
    "    \n",
    "    for num_layers in layer_counts:\n",
    "        assert num_layers in depth_results, f\"Should include results for {num_layers} layers\"\n",
    "        result = depth_results[num_layers]\n",
    "        \n",
    "        # Verify required metrics\n",
    "        required_keys = ['num_layers', 'total_parameters', 'total_memory_mb', \n",
    "                        'computation_time_ms', 'tokens_per_second']\n",
    "        for key in required_keys:\n",
    "            assert key in result, f\"Missing metric: {key} for {num_layers} layers\"\n",
    "            assert isinstance(result[key], (int, float)), f\"Invalid type for {key}\"\n",
    "        \n",
    "        # Verify reasonable values\n",
    "        assert result['num_layers'] == num_layers, \"Should store correct layer count\"\n",
    "        assert result['total_parameters'] > 0, \"Should have positive parameter count\"\n",
    "        assert result['total_memory_mb'] > 0, \"Should have positive memory usage\"\n",
    "    \n",
    "    # Test that parameters and memory scale roughly linearly with depth\n",
    "    if len(layer_counts) >= 2:\n",
    "        shallow = depth_results[layer_counts[0]]\n",
    "        deep = depth_results[layer_counts[-1]]\n",
    "        \n",
    "        layer_ratio = deep['num_layers'] / shallow['num_layers']\n",
    "        param_ratio = deep['total_parameters'] / shallow['total_parameters']\n",
    "        memory_ratio = deep['total_memory_mb'] / shallow['total_memory_mb']\n",
    "        \n",
    "        # Allow some deviation due to fixed costs (embeddings, etc.)\n",
    "        assert 1.0 < param_ratio < layer_ratio * 2, f\"Parameters should scale sub-linearly, got {param_ratio:.2f}\"\n",
    "        assert 1.0 < memory_ratio < layer_ratio * 2, f\"Memory should scale sub-linearly, got {memory_ratio:.2f}\"\n",
    "    \n",
    "    print(\"✅ Depth scaling measurement test passed\")\n",
    "    \n",
    "    # Test width vs depth analysis\n",
    "    configurations = [\n",
    "        {'embed_dim': 128, 'num_heads': 4, 'num_layers': 4, 'hidden_dim': 256},\n",
    "        {'embed_dim': 256, 'num_heads': 8, 'num_layers': 2, 'hidden_dim': 512},\n",
    "    ]\n",
    "    \n",
    "    width_depth_results = profiler.analyze_width_vs_depth_tradeoffs(100000, configurations)\n",
    "    \n",
    "    # Verify width vs depth results\n",
    "    assert len(width_depth_results) > 0, \"Should analyze at least one configuration\"\n",
    "    \n",
    "    for config_name, result in width_depth_results.items():\n",
    "        assert 'config' in result, \"Should include configuration\"\n",
    "        assert 'actual_parameters' in result, \"Should count actual parameters\"\n",
    "        assert 'computation_time_ms' in result, \"Should measure computation time\"\n",
    "        assert result['actual_parameters'] > 0, \"Should have positive parameter count\"\n",
    "    \n",
    "    print(\"✅ Width vs depth analysis test passed\")\n",
    "    \n",
    "    # Test production scaling simulation\n",
    "    production_results = profiler.simulate_production_scaling(['Small', 'Medium'])\n",
    "    \n",
    "    # Verify production scaling results\n",
    "    for size, result in production_results.items():\n",
    "        assert 'config' in result, \"Should include model configuration\"\n",
    "        assert 'total_parameters' in result, \"Should estimate total parameters\"\n",
    "        assert 'training_memory_gb' in result, \"Should estimate training memory\"\n",
    "        assert 'inference_memory_gb' in result, \"Should estimate inference memory\"\n",
    "        \n",
    "        # Verify reasonable scaling\n",
    "        assert result['total_parameters'] > 1e6, \"Should have millions of parameters\"\n",
    "        assert result['training_memory_gb'] > result['inference_memory_gb'], \"Training should require more memory\"\n",
    "    \n",
    "    print(\"✅ Production scaling simulation test passed\")\n",
    "    print(\"🎯 Transformer Profiler: All tests passed!\")\n",
    "\n",
    "# Test function defined (called in main block)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6abd8ab2",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Integration Testing: Complete Language Model Pipeline\n",
    "\n",
    "Let's test the complete pipeline from tokenization through transformer processing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbf45be4",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "test-transformer-integration",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_complete_language_model_pipeline():\n",
    "    \"\"\"Test complete language model pipeline integration.\"\"\"\n",
    "    print(\"🧪 Integration Test: Complete Language Model Pipeline...\")\n",
    "    \n",
    "    # Create a small but complete language model\n",
    "    vocab_size = 1000\n",
    "    embed_dim = 256\n",
    "    num_heads = 8\n",
    "    num_layers = 4\n",
    "    hidden_dim = 512\n",
    "    max_seq_length = 64\n",
    "    \n",
    "    print(f\"  Creating transformer with {num_layers} layers, {embed_dim} dimensions...\")\n",
    "    transformer = Transformer(\n",
    "        vocab_size=vocab_size,\n",
    "        embed_dim=embed_dim,\n",
    "        num_heads=num_heads,\n",
    "        num_layers=num_layers,\n",
    "        hidden_dim=hidden_dim,\n",
    "        max_seq_length=max_seq_length\n",
    "    )\n",
    "    \n",
    "    # Test 1: Basic text processing pipeline\n",
    "    print(\"  Testing basic text processing pipeline...\")\n",
    "    batch_size = 4\n",
    "    seq_len = 32\n",
    "    \n",
    "    # Simulate tokenized input\n",
    "    input_ids = np.random.randint(0, vocab_size, (batch_size, seq_len))\n",
    "    input_tensor = Tensor(input_ids)\n",
    "    \n",
    "    # Forward pass\n",
    "    logits = transformer.forward(input_tensor)\n",
    "    expected_shape = (batch_size, seq_len, vocab_size)\n",
    "    assert logits.shape == expected_shape, f\"Expected {expected_shape}, got {logits.shape}\"\n",
    "    \n",
    "    # Test that logits are reasonable (not all zeros/inf/nan)\n",
    "    assert not np.all(logits.data == 0), \"Logits should not all be zero\"\n",
    "    assert not np.any(np.isinf(logits.data)), \"Logits should not contain inf\"\n",
    "    assert not np.any(np.isnan(logits.data)), \"Logits should not contain nan\"\n",
    "    \n",
    "    print(f\"    Forward pass successful: {logits.shape}\")\n",
    "    \n",
    "    # Test 2: Language modeling with causal mask\n",
    "    print(\"  Testing language modeling with causal attention...\")\n",
    "    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)\n",
    "    causal_mask = 1 - causal_mask  # Convert to attention mask\n",
    "    \n",
    "    masked_logits, all_attention = transformer.forward(\n",
    "        input_tensor, mask=Tensor(causal_mask), return_attention_weights=True\n",
    "    )\n",
    "    \n",
    "    assert len(all_attention) == num_layers, f\"Should return attention from {num_layers} layers\"\n",
    "    \n",
    "    # Verify causal masking works across all layers\n",
    "    for layer_idx, attn_weights in enumerate(all_attention):\n",
    "        # Check a few positions to ensure masking works\n",
    "        for i in range(min(5, seq_len)):\n",
    "            for j in range(i+1, min(i+5, seq_len)):\n",
    "                future_attention = attn_weights.data[:, :, i, j]  # All heads, all batches\n",
    "                assert np.all(future_attention < 1e-5), \\\n",
    "                    f\"Layer {layer_idx}: future attention at ({i},{j}) should be ~0\"\n",
    "    \n",
    "    print(f\"    Causal masking verified across all layers\")\n",
    "    \n",
    "    # Test 3: Text generation\n",
    "    print(\"  Testing autoregressive text generation...\")\n",
    "    # Start with a shorter sequence for generation\n",
    "    gen_start = Tensor(np.random.randint(0, vocab_size, (2, 8)))\n",
    "    generated = transformer.generate(gen_start, max_new_tokens=8, temperature=1.0)\n",
    "    \n",
    "    expected_gen_shape = (2, 16)  # 8 start + 8 generated\n",
    "    assert generated.shape == expected_gen_shape, f\"Expected {expected_gen_shape}, got {generated.shape}\"\n",
    "    \n",
    "    # Verify original tokens preserved\n",
    "    assert np.array_equal(generated.data[:, :8], gen_start.data), \"Should preserve original tokens\"\n",
    "    \n",
    "    # Verify new tokens are valid\n",
    "    new_tokens = generated.data[:, 8:]\n",
    "    assert np.all(new_tokens >= 0), \"Generated tokens should be >= 0\"\n",
    "    assert np.all(new_tokens < vocab_size), f\"Generated tokens should be < {vocab_size}\"\n",
    "    \n",
    "    print(f\"    Generated {new_tokens.shape[1]} new tokens successfully\")\n",
    "    \n",
    "    # Test 4: Different sequence lengths\n",
    "    print(\"  Testing variable sequence lengths...\")\n",
    "    for test_seq_len in [16, 32, 48]:\n",
    "        if test_seq_len > max_seq_length:\n",
    "            continue\n",
    "            \n",
    "        test_input = Tensor(np.random.randint(0, vocab_size, (2, test_seq_len)))\n",
    "        test_logits = transformer.forward(test_input)\n",
    "        \n",
    "        expected_test_shape = (2, test_seq_len, vocab_size)\n",
    "        assert test_logits.shape == expected_test_shape, f\"Failed for seq_len {test_seq_len}\"\n",
    "    \n",
    "    print(f\"    Variable sequence lengths work correctly\")\n",
    "    \n",
    "    # Test 5: Memory usage analysis\n",
    "    print(\"  Analyzing memory usage...\")\n",
    "    memory_stats = transformer.get_memory_usage()\n",
    "    \n",
    "    print(f\"    Model parameters: {memory_stats['total_parameters']:,}\")\n",
    "    print(f\"    Model memory: {memory_stats['total_memory_mb']:.1f}MB\")\n",
    "    print(f\"    Embedding memory: {memory_stats['embedding_memory_mb']:.1f}MB\")\n",
    "    print(f\"    Transformer blocks: {memory_stats['transformer_blocks_memory_mb']:.1f}MB\")\n",
    "    print(f\"    LM head: {memory_stats['lm_head_memory_mb']:.1f}MB\")\n",
    "    \n",
    "    # Verify memory breakdown makes sense\n",
    "    component_memory = (memory_stats['embedding_memory_mb'] + \n",
    "                       memory_stats['transformer_blocks_memory_mb'] + \n",
    "                       memory_stats['lm_head_memory_mb'])\n",
    "    \n",
    "    # Allow small difference due to final norm layer\n",
    "    memory_diff = abs(memory_stats['total_memory_mb'] - component_memory)\n",
    "    assert memory_diff < 1.0, f\"Memory breakdown doesn't add up: {memory_diff:.2f}MB difference\"\n",
    "    \n",
    "    # Test 6: Performance characteristics\n",
    "    print(\"  Testing performance characteristics...\")\n",
    "    \n",
    "    # Time multiple forward passes\n",
    "    num_iterations = 5\n",
    "    start_time = time.time()\n",
    "    \n",
    "    for _ in range(num_iterations):\n",
    "        _ = transformer.forward(input_tensor)\n",
    "    \n",
    "    total_time = time.time() - start_time\n",
    "    avg_time_per_forward = total_time / num_iterations\n",
    "    tokens_per_second = (batch_size * seq_len) / avg_time_per_forward\n",
    "    \n",
    "    print(f\"    Average forward pass: {avg_time_per_forward*1000:.2f}ms\")\n",
    "    print(f\"    Processing speed: {tokens_per_second:.0f} tokens/second\")\n",
    "    \n",
    "    # Verify reasonable performance\n",
    "    assert avg_time_per_forward < 1.0, \"Forward pass should be < 1 second\"\n",
    "    assert tokens_per_second > 50, \"Should process > 50 tokens/second\"\n",
    "    \n",
    "    # Test 7: Gradient flow (simulated)\n",
    "    print(\"  Testing gradient flow through layers...\")\n",
    "    \n",
    "    # Create slightly different inputs to test sensitivity\n",
    "    input_1 = Tensor(input_ids.copy())\n",
    "    input_2 = Tensor(input_ids.copy())\n",
    "    input_2.data[0, 0] = (input_2.data[0, 0] + 1) % vocab_size  # Change one token\n",
    "    \n",
    "    logits_1 = transformer.forward(input_1)\n",
    "    logits_2 = transformer.forward(input_2)\n",
    "    \n",
    "    # Outputs should be different (model is sensitive to input changes)\n",
    "    output_diff = np.mean(np.abs(logits_1.data - logits_2.data))\n",
    "    assert output_diff > 1e-6, f\"Model should be sensitive to input changes, diff: {output_diff}\"\n",
    "    \n",
    "    # But not too different (model should be stable)\n",
    "    assert output_diff < 100, f\"Model should be stable, large diff: {output_diff}\"\n",
    "    \n",
    "    print(f\"    Model shows appropriate sensitivity to input changes\")\n",
    "    \n",
    "    print(\"✅ Complete language model pipeline integration test passed!\")\n",
    "    print(f\"✅ Forward pass, masking, generation, and performance verified\")\n",
    "    print(f\"✅ Model processes {tokens_per_second:.0f} tokens/second\")\n",
    "    print(f\"✅ Memory footprint: {memory_stats['total_memory_mb']:.1f}MB\")\n",
    "\n",
    "# Test function defined (called in main block)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd6e7970",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Main Execution Block\n",
    "\n",
    "All transformer tests and demonstrations are run from here when the module is executed directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6f54ff9",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "transformers-main",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "if __name__ == \"__main__\":\n",
    "    # Run all unit tests\n",
    "    test_unit_layer_norm()\n",
    "    test_unit_feed_forward()\n",
    "    test_unit_transformer_block()\n",
    "    test_unit_transformer_model()\n",
    "    test_transformer_profiler()\n",
    "    test_complete_language_model_pipeline()\n",
    "    \n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    print(\"🔍 TRANSFORMER SYSTEMS ANALYSIS\")\n",
    "    print(\"=\"*60)\n",
    "    \n",
    "    # Performance analysis\n",
    "    profiler = TransformerProfiler()\n",
    "    \n",
    "    # Test transformer scaling with different depths\n",
    "    print(\"📈 TRANSFORMER DEPTH SCALING ANALYSIS:\")\n",
    "    base_config = {\n",
    "        'vocab_size': 1000,\n",
    "        'embed_dim': 256,\n",
    "        'num_heads': 8,\n",
    "        'hidden_dim': 1024\n",
    "    }\n",
    "    \n",
    "    layer_counts = [2, 4, 8, 12]\n",
    "    depth_results = profiler.measure_scaling_with_depth(base_config, layer_counts)\n",
    "    \n",
    "    # Analyze scaling patterns\n",
    "    print(f\"\\n{'Layers':<7} {'Parameters':<12} {'Memory (MB)':<12} {'Time (ms)':<10} {'Tokens/sec':<10}\")\n",
    "    print(\"-\" * 60)\n",
    "    \n",
    "    for num_layers in layer_counts:\n",
    "        result = depth_results[num_layers]\n",
    "        print(f\"{num_layers:<7} {result['total_parameters']:<12,} {result['total_memory_mb']:<12.1f} \"\n",
    "              f\"{result['computation_time_ms']:<10.2f} {result['tokens_per_second']:<10.0f}\")\n",
    "    \n",
    "    # Width vs depth trade-off analysis\n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    configurations = [\n",
    "        {'embed_dim': 256, 'num_heads': 8, 'num_layers': 8, 'hidden_dim': 1024},  # Deep & narrow\n",
    "        {'embed_dim': 512, 'num_heads': 16, 'num_layers': 4, 'hidden_dim': 2048}, # Wide & shallow\n",
    "        {'embed_dim': 384, 'num_heads': 12, 'num_layers': 6, 'hidden_dim': 1536}, # Balanced\n",
    "    ]\n",
    "    \n",
    "    width_depth_results = profiler.analyze_width_vs_depth_tradeoffs(2000000, configurations)\n",
    "    \n",
    "    # Production scaling simulation\n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    production_results = profiler.simulate_production_scaling(['Small', 'Medium', 'Large'])\n",
    "    \n",
    "    # Systems design analysis\n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    analyze_transformer_system_design()\n",
    "    \n",
    "    # Demonstrate realistic language model setup\n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    print(\"🏗️ REALISTIC LANGUAGE MODEL DEMONSTRATION\")\n",
    "    print(\"=\"*60)\n",
    "    \n",
    "    # Create a realistic small language model\n",
    "    vocab_size = 5000\n",
    "    embed_dim = 512\n",
    "    num_heads = 8\n",
    "    num_layers = 6\n",
    "    hidden_dim = 2048\n",
    "    max_seq_length = 256\n",
    "    \n",
    "    print(f\"Language model configuration:\")\n",
    "    print(f\"  Vocabulary: {vocab_size:,} tokens\")\n",
    "    print(f\"  Embedding dimension: {embed_dim}\")\n",
    "    print(f\"  Attention heads: {num_heads}\")\n",
    "    print(f\"  Transformer layers: {num_layers}\")\n",
    "    print(f\"  Feed-forward dimension: {hidden_dim}\")\n",
    "    print(f\"  Max sequence length: {max_seq_length}\")\n",
    "    \n",
    "    # Create the model\n",
    "    language_model = Transformer(\n",
    "        vocab_size=vocab_size,\n",
    "        embed_dim=embed_dim,\n",
    "        num_heads=num_heads,\n",
    "        num_layers=num_layers,\n",
    "        hidden_dim=hidden_dim,\n",
    "        max_seq_length=max_seq_length,\n",
    "        pre_norm=True\n",
    "    )\n",
    "    \n",
    "    # Analyze model characteristics\n",
    "    memory_stats = language_model.get_memory_usage()\n",
    "    \n",
    "    print(f\"\\nModel characteristics:\")\n",
    "    print(f\"  Total parameters: {memory_stats['total_parameters']:,}\")\n",
    "    print(f\"  Model size: {memory_stats['total_memory_mb']:.1f}MB\")\n",
    "    print(f\"  Embedding table: {memory_stats['embedding_memory_mb']:.1f}MB ({memory_stats['embedding_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)\")\n",
    "    print(f\"  Transformer layers: {memory_stats['transformer_blocks_memory_mb']:.1f}MB ({memory_stats['transformer_blocks_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)\")\n",
    "    print(f\"  Output projection: {memory_stats['lm_head_memory_mb']:.1f}MB ({memory_stats['lm_head_memory_mb']/memory_stats['total_memory_mb']*100:.1f}%)\")\n",
    "    \n",
    "    # Performance simulation\n",
    "    batch_size = 8\n",
    "    seq_len = 128\n",
    "    test_input = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))\n",
    "    \n",
    "    start_time = time.time()\n",
    "    logits = language_model.forward(test_input)\n",
    "    forward_time = time.time() - start_time\n",
    "    \n",
    "    tokens_per_second = (batch_size * seq_len) / forward_time\n",
    "    \n",
    "    print(f\"\\nPerformance simulation:\")\n",
    "    print(f\"  Batch size: {batch_size}, Sequence length: {seq_len}\")\n",
    "    print(f\"  Forward pass time: {forward_time*1000:.2f}ms\")\n",
    "    print(f\"  Throughput: {tokens_per_second:.0f} tokens/second\")\n",
    "    print(f\"  Memory for batch: {logits.data.nbytes/(1024*1024):.1f}MB\")\n",
    "    \n",
    "    # Text generation example\n",
    "    print(f\"\\nText generation example:\")\n",
    "    start_sequence = Tensor(np.random.randint(0, vocab_size, (1, 10)))\n",
    "    generated = language_model.generate(start_sequence, max_new_tokens=20, temperature=0.8)\n",
    "    \n",
    "    print(f\"  Input sequence: {start_sequence.data[0].tolist()}\")\n",
    "    print(f\"  Generated tokens: {generated.data[0, 10:].tolist()}\")\n",
    "    print(f\"  Generation completed successfully\")\n",
    "    \n",
    "    # Scaling predictions\n",
    "    print(f\"\\nScaling analysis:\")\n",
    "    current_params = memory_stats['total_parameters']\n",
    "    \n",
    "    # Estimate for different scales\n",
    "    scaling_factors = [2, 5, 10]\n",
    "    for factor in scaling_factors:\n",
    "        scaled_params = current_params * factor\n",
    "        scaled_memory_gb = memory_stats['total_memory_mb'] * factor / 1024\n",
    "        \n",
    "        print(f\"  {factor}x scale: {scaled_params/1e6:.0f}M params, ~{scaled_memory_gb:.1f}GB memory\")\n",
    "    \n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    print(\"🎯 TRANSFORMERS MODULE COMPLETE!\")\n",
    "    print(\"=\"*60)\n",
    "    print(\"All transformer tests passed!\")\n",
    "    print(\"Complete language model architecture implemented!\")\n",
    "    print(\"Ready for production deployment and optimization!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "390254a0",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🤔 ML Systems Thinking: Interactive Questions\n",
    "\n",
    "Now that you've built complete transformer architectures, let's connect this work to broader ML systems challenges. These questions help you think critically about how transformer design choices affect production deployment and system performance.\n",
    "\n",
    "Take time to reflect thoughtfully on each question - your insights will help you understand how transformer architectures connect to real-world ML systems engineering."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "709877be",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### Question 1: Transformer Architecture Optimization and Resource Allocation\n",
    "\n",
    "**Context**: Your transformer implementations demonstrate how layer depth, attention heads, and hidden dimensions affect model capacity and computational requirements. Production transformer systems must optimize these architectural choices within hardware constraints while maximizing model performance for specific tasks and deployment scenarios.\n",
    "\n",
    "**Reflection Question**: Design a transformer architecture optimization strategy for deploying language models across diverse production scenarios: real-time chat (low latency), document processing (high throughput), and mobile inference (resource-constrained). How would you allocate a fixed parameter budget across depth, width, and attention heads to optimize for each scenario, implement architecture search strategies that consider hardware constraints, and design adaptive model scaling that adjusts to available computational resources? Consider the challenges of maintaining consistent model quality while optimizing for different performance metrics and deployment environments.\n",
    "\n",
    "Think about: parameter budget allocation, architecture search strategies, hardware-aware optimization, and adaptive model scaling techniques.\n",
    "\n",
    "*Target length: 150-300 words*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf1aa9a6",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "question-1-architecture-optimization",
     "locked": false,
     "points": 10,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "YOUR REFLECTION ON TRANSFORMER ARCHITECTURE OPTIMIZATION:\n",
    "\n",
    "TODO: Replace this text with your thoughtful response about transformer architecture optimization for diverse deployment scenarios.\n",
    "\n",
    "Consider addressing:\n",
    "- How would you allocate parameter budgets across depth, width, and attention heads for different scenarios?\n",
    "- What architecture search strategies would you use to optimize within hardware constraints?\n",
    "- How would you implement adaptive model scaling that adjusts to available resources?\n",
    "- What approaches would you use to maintain model quality across different deployment environments?\n",
    "- How would you balance latency, throughput, and resource constraints in architectural decisions?\n",
    "\n",
    "Write a strategic analysis connecting your transformer implementations to real architecture optimization challenges.\n",
    "\n",
    "GRADING RUBRIC (Instructor Use):\n",
    "- Demonstrates understanding of transformer architecture trade-offs and optimization (3 points)\n",
    "- Designs practical approaches to parameter allocation and architecture search (3 points)\n",
    "- Addresses adaptive scaling and hardware-aware optimization (2 points)\n",
    "- Shows systems thinking about production deployment optimization (2 points)\n",
    "- Clear strategic reasoning with architecture optimization insights (bonus points for innovative approaches)\n",
    "\"\"\"\n",
    "\n",
    "### BEGIN SOLUTION\n",
    "# Student response area - instructor will replace this section during grading setup\n",
    "# This is a manually graded question requiring strategic analysis of transformer architecture optimization\n",
    "# Students should demonstrate understanding of architecture design and production deployment challenges\n",
    "### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32bb5968",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### Question 2: Transformer Training and Inference System Design\n",
    "\n",
    "**Context**: Your transformer implementation shows how layer normalization, residual connections, and feed-forward networks work together to enable training of deep models. Production transformer systems must optimize the training pipeline for efficiency while designing inference systems that handle diverse workloads with different latency and throughput requirements.\n",
    "\n",
    "**Reflection Question**: Architect a transformer training and inference system that efficiently trains models with billions of parameters while serving diverse inference workloads with millisecond latency requirements. How would you design distributed training strategies that handle memory constraints and communication bottlenecks, implement efficient inference serving that optimizes for both batch and real-time processing, and manage model deployment across heterogeneous hardware environments? Consider the challenges of maintaining numerical stability during distributed training while achieving consistent inference performance across different deployment targets.\n",
    "\n",
    "Think about: distributed training optimization, inference serving strategies, heterogeneous deployment, and training-inference consistency.\n",
    "\n",
    "*Target length: 150-300 words*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c11dcf55",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "question-2-training-inference-systems",
     "locked": false,
     "points": 10,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "YOUR REFLECTION ON TRANSFORMER TRAINING AND INFERENCE SYSTEM DESIGN:\n",
    "\n",
    "TODO: Replace this text with your thoughtful response about transformer training and inference system architecture.\n",
    "\n",
    "Consider addressing:\n",
    "- How would you design distributed training for billion-parameter transformers with memory constraints?\n",
    "- What strategies would you use for efficient inference serving with millisecond latency requirements?\n",
    "- How would you manage model deployment across heterogeneous hardware environments?\n",
    "- What approaches would you use to maintain numerical stability during distributed training?\n",
    "- How would you ensure consistent inference performance across different deployment targets?\n",
    "\n",
    "Write a system design analysis connecting your transformer implementation to large-scale training and serving challenges.\n",
    "\n",
    "GRADING RUBRIC (Instructor Use):\n",
    "- Shows understanding of distributed training and inference serving challenges (3 points)\n",
    "- Designs practical approaches to memory management and latency optimization (3 points)\n",
    "- Addresses heterogeneous deployment and numerical stability considerations (2 points)\n",
    "- Demonstrates systems thinking about training-inference system coordination (2 points)\n",
    "- Clear system design reasoning with scalability insights (bonus points for comprehensive system architecture)\n",
    "\"\"\"\n",
    "\n",
    "### BEGIN SOLUTION\n",
    "# Student response area - instructor will replace this section during grading setup\n",
    "# This is a manually graded question requiring system design for transformer training and inference\n",
    "# Students should demonstrate knowledge of distributed systems and production deployment architecture\n",
    "### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dab76f7",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### Question 3: Transformer Optimization and Production Deployment\n",
    "\n",
    "**Context**: Your complete transformer model demonstrates the integration of tokenization, embeddings, attention, and feed-forward components into a unified language processing system. Production transformer deployments must optimize the entire pipeline for efficiency while maintaining model quality and enabling continuous improvement through model updates and fine-tuning.\n",
    "\n",
    "**Reflection Question**: Design a production transformer deployment system that optimizes the complete language processing pipeline while enabling continuous model improvement and adaptation. How would you implement end-to-end optimization that spans from tokenization through generation, design efficient model serving infrastructure that handles dynamic batching and request routing, and enable seamless model updates without service interruption? Consider the challenges of optimizing the entire pipeline holistically while maintaining modularity for individual component improvements and supporting diverse model variants and fine-tuned versions.\n",
    "\n",
    "Think about: end-to-end pipeline optimization, model serving infrastructure, continuous deployment strategies, and modular system design.\n",
    "\n",
    "*Target length: 150-300 words*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e30dbecb",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "question-3-production-deployment",
     "locked": false,
     "points": 10,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "YOUR REFLECTION ON TRANSFORMER OPTIMIZATION AND PRODUCTION DEPLOYMENT:\n",
    "\n",
    "TODO: Replace this text with your thoughtful response about transformer production deployment system design.\n",
    "\n",
    "Consider addressing:\n",
    "- How would you implement end-to-end optimization spanning tokenization through generation?\n",
    "- What strategies would you use for efficient model serving with dynamic batching and request routing?\n",
    "- How would you enable seamless model updates without service interruption?\n",
    "- What approaches would you use to maintain pipeline modularity while optimizing holistically?\n",
    "- How would you support diverse model variants and fine-tuned versions in production?\n",
    "\n",
    "Write a deployment analysis connecting your transformer implementation to complete production system optimization.\n",
    "\n",
    "GRADING RUBRIC (Instructor Use):\n",
    "- Understands end-to-end optimization and production deployment challenges (3 points)\n",
    "- Designs practical approaches to model serving and continuous deployment (3 points)\n",
    "- Addresses modularity and system integration considerations (2 points)\n",
    "- Shows systems thinking about holistic pipeline optimization (2 points)\n",
    "- Clear deployment reasoning with production optimization insights (bonus points for innovative system design)\n",
    "\"\"\"\n",
    "\n",
    "### BEGIN SOLUTION\n",
    "# Student response area - instructor will replace this section during grading setup\n",
    "# This is a manually graded question requiring understanding of production transformer deployment optimization\n",
    "# Students should demonstrate knowledge of end-to-end system design and continuous deployment strategies\n",
    "### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b61d666",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🎯 MODULE SUMMARY: Transformers\n",
    "\n",
    "Congratulations! You have successfully implemented complete transformer architectures that power modern language models:\n",
    "\n",
    "### ✅ What You Have Built\n",
    "- **Layer Normalization**: Stable normalization for deep transformer training\n",
    "- **Position-wise Feed-Forward**: Non-linear transformations applied to each sequence position\n",
    "- **Transformer Blocks**: Complete transformer layers with attention, normalization, and residual connections\n",
    "- **Complete Transformer**: Full language model with embeddings, multiple layers, and generation capability\n",
    "- **Text Generation**: Autoregressive generation with proper causal masking\n",
    "- **🆕 Performance Analysis**: Comprehensive scaling analysis and architectural optimization tools\n",
    "- **🆕 Production Insights**: Understanding of real-world transformer deployment challenges\n",
    "\n",
    "### ✅ Key Learning Outcomes\n",
    "- **Understanding**: How transformer blocks enable powerful sequence modeling through attention and feed-forward layers\n",
    "- **Implementation**: Built complete transformer architectures with proper layer organization and residual connections\n",
    "- **Systems Insight**: How transformer depth affects memory usage, training efficiency, and model capacity\n",
    "- **Performance Engineering**: Measured and analyzed transformer scaling characteristics and optimization opportunities\n",
    "- **Production Context**: Understanding transformer deployment challenges and architectural trade-offs\n",
    "\n",
    "### ✅ Technical Mastery\n",
    "- **Layer Normalization**: Stabilizing deep network training with proper feature normalization\n",
    "- **Residual Connections**: Enabling gradient flow through deep transformer architectures\n",
    "- **Pre-norm vs Post-norm**: Understanding normalization placement effects on training stability\n",
    "- **Parameter Scaling**: Understanding how transformer parameters scale with architectural choices\n",
    "- **🆕 Generation Systems**: Autoregressive text generation with causal attention patterns\n",
    "\n",
    "### ✅ Professional Skills Developed\n",
    "- **Systems Architecture**: Designing complete transformer systems for production scale\n",
    "- **Memory Engineering**: Understanding transformer memory scaling and optimization techniques\n",
    "- **Performance Analysis**: Measuring and improving transformer computation and memory efficiency\n",
    "- **Integration Design**: Building complete language processing pipelines from tokenization to generation\n",
    "\n",
    "### ✅ Ready for Next Steps\n",
    "Your transformer implementations provide the foundation for:\n",
    "- **Advanced Language Models**: GPT, BERT, and other transformer-based architectures\n",
    "- **Multi-modal Models**: Extending transformers to vision, audio, and other modalities\n",
    "- **Production Optimization**: Memory optimization, distributed training, and efficient inference\n",
    "- **🧠 AI Applications**: Real-world language processing applications and services\n",
    "\n",
    "### 🔗 Connection to Real ML Systems\n",
    "Your implementations mirror production systems:\n",
    "- **GPT Architecture**: Your transformer matches GPT's decoder-only architecture\n",
    "- **BERT Components**: Layer normalization and attention mechanisms used in BERT\n",
    "- **Production Optimization**: Understanding of memory scaling, batching, and generation optimization\n",
    "- **Industry Applications**: Foundation for all modern language model deployments\n",
    "\n",
    "### 🎯 The Complete Language Model\n",
    "You have built the architecture that transformed AI:\n",
    "- **Before**: RNNs and CNNs limited by sequential processing and local dependencies\n",
    "- **After**: Transformers enable parallel processing and global attention across entire sequences\n",
    "\n",
    "**Achievement Unlocked**: You now understand every component of modern language models from tokenization through generation!\n",
    "\n",
    "Your complete transformer implementation provides the foundation for understanding and building modern AI systems. You've mastered the architecture that powers ChatGPT, GPT-4, BERT, and countless other AI applications.\n",
    "\n",
    "From discrete tokens to continuous embeddings, from attention mechanisms to complete language generation - you've built the entire pipeline that enables machines to understand and generate human language.\n",
    "\n",
    "**🏆 Congratulations on completing the complete transformer architecture implementation!**"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}