mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 02:47:33 -05:00
Added all module development files to modules/XX_name/ directories:
Module notebooks and scripts:
- 18 modules with .ipynb and .py files (01-20, excluding some gaps)
- Moved from modules/source/ to direct module directories
- Includes tensor, autograd, layers, transformers, optimization modules
Module README files:
- Added README.md for modules with additional documentation
- Complements ABOUT.md files added earlier
This completes the module restructuring:
- Before: modules/source/XX_name/*_dev.{py,ipynb}
- After: modules/XX_name/*_dev.{py,ipynb}
All development happens directly in numbered module directories now.
2154 lines
100 KiB
Plaintext
2154 lines
100 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "763d8283",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 13: Transformers - Complete Transformer Architecture\n",
|
||
"\n",
|
||
"Welcome to Module 13! You're about to build the complete transformer architecture that powers modern language models like GPT, Claude, and ChatGPT.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Tokenization, embeddings, attention mechanisms, and all foundational components\n",
|
||
"**You'll Build**: TransformerBlock, complete GPT architecture, and autoregressive generation\n",
|
||
"**You'll Enable**: Full language model training and text generation capabilities\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"Tokenization + Embeddings + Attention → Transformers → Language Generation\n",
|
||
"(text→numbers) (learnable vectors) (sequence modeling) (complete models)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement complete TransformerBlock with attention, MLP, and layer normalization\n",
|
||
"2. Build full GPT architecture with multiple transformer blocks\n",
|
||
"3. Add autoregressive text generation capability\n",
|
||
"4. Understand parameter scaling in large language models\n",
|
||
"5. Test transformer components and generation pipeline\n",
|
||
"\n",
|
||
"Let's get started!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0857efbe",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp models.transformer"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1b58c4de",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"import numpy as np\n",
|
||
"from tinytorch.core.tensor import Tensor\n",
|
||
"from tinytorch.core.layers import Linear\n",
|
||
"from tinytorch.core.attention import MultiHeadAttention\n",
|
||
"from tinytorch.core.activations import GELU"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b35ba8b8",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/13_transformers/transformers_dev.py`\n",
|
||
"**Building Side:** Code exports to `tinytorch.models.transformer`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# How to use this module:\n",
|
||
"from tinytorch.models.transformer import TransformerBlock, GPT, LayerNorm, MLP\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete transformer system showcasing how all components work together\n",
|
||
"- **Production:** Matches PyTorch's transformer implementation with proper model organization\n",
|
||
"- **Consistency:** All transformer components and generation logic in models.transformer\n",
|
||
"- **Integration:** Demonstrates the power of modular design by combining all previous modules"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e36e4f2c",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import math\n",
|
||
"from typing import Optional, List\n",
|
||
"\n",
|
||
"# Import from previous modules - following proper dependency chain\n",
|
||
"# Note: Actual imports happen in try/except blocks below with fallback implementations\n",
|
||
"from tinytorch.core.tensor import Tensor\n",
|
||
"from tinytorch.core.layers import Linear\n",
|
||
"# MultiHeadAttention import happens in try/except below\n",
|
||
"\n",
|
||
"# For development, we'll use minimal implementations if imports fail\n",
|
||
"try:\n",
|
||
" from tinytorch.core.tensor import Tensor\n",
|
||
"except ImportError:\n",
|
||
" print(\"Warning: Using minimal Tensor implementation for development\")\n",
|
||
" class Tensor:\n",
|
||
" \"\"\"Minimal Tensor class for transformer development.\"\"\"\n",
|
||
" def __init__(self, data, requires_grad=False):\n",
|
||
" self.data = np.array(data)\n",
|
||
" self.shape = self.data.shape\n",
|
||
" self.size = self.data.size\n",
|
||
" self.requires_grad = requires_grad\n",
|
||
" self.grad = None\n",
|
||
"\n",
|
||
" def __add__(self, other):\n",
|
||
" if isinstance(other, Tensor):\n",
|
||
" return Tensor(self.data + other.data)\n",
|
||
" return Tensor(self.data + other)\n",
|
||
"\n",
|
||
" def __mul__(self, other):\n",
|
||
" if isinstance(other, Tensor):\n",
|
||
" return Tensor(self.data * other.data)\n",
|
||
" return Tensor(self.data * other)\n",
|
||
"\n",
|
||
" def matmul(self, other):\n",
|
||
" return Tensor(np.dot(self.data, other.data))\n",
|
||
"\n",
|
||
" def sum(self, axis=None, keepdims=False):\n",
|
||
" return Tensor(self.data.sum(axis=axis, keepdims=keepdims))\n",
|
||
"\n",
|
||
" def mean(self, axis=None, keepdims=False):\n",
|
||
" return Tensor(self.data.mean(axis=axis, keepdims=keepdims))\n",
|
||
"\n",
|
||
" def reshape(self, *shape):\n",
|
||
" return Tensor(self.data.reshape(shape))\n",
|
||
"\n",
|
||
" def __repr__(self):\n",
|
||
" return f\"Tensor(data={self.data}, shape={self.shape})\"\n",
|
||
"\n",
|
||
"try:\n",
|
||
" from tinytorch.core.layers import Linear\n",
|
||
"except ImportError:\n",
|
||
" class Linear:\n",
|
||
" \"\"\"Minimal Linear layer for development.\"\"\"\n",
|
||
" def __init__(self, in_features, out_features, bias=True):\n",
|
||
" std = math.sqrt(2.0 / (in_features + out_features))\n",
|
||
" self.weight = Tensor(np.random.normal(0, std, (in_features, out_features)))\n",
|
||
" self.bias = Tensor(np.zeros(out_features)) if bias else None\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" output = x.matmul(self.weight)\n",
|
||
" if self.bias is not None:\n",
|
||
" output = output + self.bias\n",
|
||
" return output\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" params = [self.weight]\n",
|
||
" if self.bias is not None:\n",
|
||
" params.append(self.bias)\n",
|
||
" return params\n",
|
||
"\n",
|
||
"try:\n",
|
||
" from tinytorch.core.attention import MultiHeadAttention\n",
|
||
"except ImportError:\n",
|
||
" class MultiHeadAttention:\n",
|
||
" \"\"\"Minimal MultiHeadAttention for development.\"\"\"\n",
|
||
" def __init__(self, embed_dim, num_heads):\n",
|
||
" assert embed_dim % num_heads == 0\n",
|
||
" self.embed_dim = embed_dim\n",
|
||
" self.num_heads = num_heads\n",
|
||
" self.head_dim = embed_dim // num_heads\n",
|
||
"\n",
|
||
" self.q_proj = Linear(embed_dim, embed_dim)\n",
|
||
" self.k_proj = Linear(embed_dim, embed_dim)\n",
|
||
" self.v_proj = Linear(embed_dim, embed_dim)\n",
|
||
" self.out_proj = Linear(embed_dim, embed_dim)\n",
|
||
"\n",
|
||
" def forward(self, query, key, value, mask=None):\n",
|
||
" batch_size, seq_len, embed_dim = query.shape\n",
|
||
"\n",
|
||
" # Linear projections\n",
|
||
" Q = self.q_proj.forward(query)\n",
|
||
" K = self.k_proj.forward(key)\n",
|
||
" V = self.v_proj.forward(value)\n",
|
||
"\n",
|
||
" # Reshape for multi-head attention\n",
|
||
" Q = Q.reshape(batch_size, seq_len, self.num_heads, self.head_dim)\n",
|
||
" K = K.reshape(batch_size, seq_len, self.num_heads, self.head_dim)\n",
|
||
" V = V.reshape(batch_size, seq_len, self.num_heads, self.head_dim)\n",
|
||
"\n",
|
||
" # Transpose to (batch_size, num_heads, seq_len, head_dim)\n",
|
||
" Q = Tensor(np.transpose(Q.data, (0, 2, 1, 3)))\n",
|
||
" K = Tensor(np.transpose(K.data, (0, 2, 1, 3)))\n",
|
||
" V = Tensor(np.transpose(V.data, (0, 2, 1, 3)))\n",
|
||
"\n",
|
||
" # Scaled dot-product attention\n",
|
||
" scores = Tensor(np.matmul(Q.data, np.transpose(K.data, (0, 1, 3, 2))))\n",
|
||
" scores = scores * (1.0 / math.sqrt(self.head_dim))\n",
|
||
"\n",
|
||
" # Apply causal mask for autoregressive generation\n",
|
||
" if mask is not None:\n",
|
||
" scores = Tensor(scores.data + mask.data)\n",
|
||
"\n",
|
||
" # Softmax\n",
|
||
" attention_weights = self._softmax(scores)\n",
|
||
"\n",
|
||
" # Apply attention to values\n",
|
||
" out = Tensor(np.matmul(attention_weights.data, V.data))\n",
|
||
"\n",
|
||
" # Transpose back and reshape\n",
|
||
" out = Tensor(np.transpose(out.data, (0, 2, 1, 3)))\n",
|
||
" out = out.reshape(batch_size, seq_len, embed_dim)\n",
|
||
"\n",
|
||
" # Final linear projection\n",
|
||
" return self.out_proj.forward(out)\n",
|
||
"\n",
|
||
" def _softmax(self, x):\n",
|
||
" \"\"\"Numerically stable softmax.\"\"\"\n",
|
||
" exp_x = Tensor(np.exp(x.data - np.max(x.data, axis=-1, keepdims=True)))\n",
|
||
" return Tensor(exp_x.data / np.sum(exp_x.data, axis=-1, keepdims=True))\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" params = []\n",
|
||
" params.extend(self.q_proj.parameters())\n",
|
||
" params.extend(self.k_proj.parameters())\n",
|
||
" params.extend(self.v_proj.parameters())\n",
|
||
" params.extend(self.out_proj.parameters())\n",
|
||
" return params\n",
|
||
"\n",
|
||
"try:\n",
|
||
" from tinytorch.core.embeddings import Embedding\n",
|
||
"except ImportError:\n",
|
||
" class Embedding:\n",
|
||
" \"\"\"Minimal Embedding layer for development.\"\"\"\n",
|
||
" def __init__(self, vocab_size, embed_dim):\n",
|
||
" self.vocab_size = vocab_size\n",
|
||
" self.embed_dim = embed_dim\n",
|
||
" self.weight = Tensor(np.random.normal(0, 0.02, (vocab_size, embed_dim)))\n",
|
||
"\n",
|
||
" def forward(self, indices):\n",
|
||
" return Tensor(self.weight.data[indices.data.astype(int)])\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" return [self.weight]\n",
|
||
"\n",
|
||
"def gelu(x):\n",
|
||
" \"\"\"GELU activation function.\"\"\"\n",
|
||
" return Tensor(0.5 * x.data * (1 + np.tanh(np.sqrt(2 / np.pi) * (x.data + 0.044715 * x.data**3))))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "77ba5604",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction: What are Transformers?\n",
|
||
"\n",
|
||
"Transformers are the revolutionary architecture that powers modern AI language models like GPT, ChatGPT, and Claude. The key breakthrough is **self-attention**, which allows every token in a sequence to directly interact with every other token, creating rich contextual understanding.\n",
|
||
"\n",
|
||
"### The Transformer Revolution\n",
|
||
"\n",
|
||
"Before transformers, language models used RNNs or CNNs that processed text sequentially or locally. Transformers changed everything by processing all positions in parallel while maintaining global context.\n",
|
||
"\n",
|
||
"### Complete GPT Architecture Overview\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────────┐\n",
|
||
"│ COMPLETE GPT ARCHITECTURE: From Text to Generation │\n",
|
||
"├─────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ INPUT: \"Hello world\" → Token IDs: [15496, 1917] │\n",
|
||
"│ ↓ │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ EMBEDDING LAYER │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ ┌─────────────┐ ┌─────────────────────────────┐ │ │\n",
|
||
"│ │ │Token Embed │ + │ Positional Embedding │ │ │\n",
|
||
"│ │ │15496→[0.1, │ │ pos_0→[0.05, -0.02, ...] │ │ │\n",
|
||
"│ │ │ 0.3,..]│ │ pos_1→[0.12, 0.08, ...] │ │ │\n",
|
||
"│ │ │1917→[0.2, │ │ │ │ │\n",
|
||
"│ │ │ -0.1,..]│ │ │ │ │\n",
|
||
"│ │ └─────────────┘ └─────────────────────────────┘ │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ ↓ │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ TRANSFORMER BLOCK 1 │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ x → LayerNorm → MultiHeadAttention → + x → result │ │\n",
|
||
"│ │ │ ↑ │ │\n",
|
||
"│ │ │ residual connection │ │ │\n",
|
||
"│ │ └──────────────────────────────────────┘ │ │\n",
|
||
"│ │ │ │ │\n",
|
||
"│ │ result → LayerNorm → MLP (Feed Forward) → + result │ │\n",
|
||
"│ │ │ ↑ │ │\n",
|
||
"│ │ │ residual connection │ │ │\n",
|
||
"│ │ └───────────────────────────────────────────┘ │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ ↓ │\n",
|
||
"│ TRANSFORMER BLOCK 2 (same pattern) │\n",
|
||
"│ ↓ │\n",
|
||
"│ ... (more blocks) ... │\n",
|
||
"│ ↓ │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ OUTPUT HEAD │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ final_hidden → LayerNorm → Linear(embed_dim, vocab_size) │ │\n",
|
||
"│ │ ↓ │ │\n",
|
||
"│ │ Vocabulary Logits: [0.1, 0.05, 0.8, ...] │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ ↓ │\n",
|
||
"│ OUTPUT: Next Token Probabilities │\n",
|
||
"│ \"Hello\" → 10%, \"world\" → 5%, \"!\" → 80%, ... │\n",
|
||
"│ │\n",
|
||
"└─────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Why Transformers Dominate\n",
|
||
"\n",
|
||
"**Parallel Processing**: Unlike RNNs that process tokens one by one, transformers process all positions simultaneously. This makes training much faster.\n",
|
||
"\n",
|
||
"**Global Context**: Every token can directly attend to every other token in the sequence, capturing long-range dependencies that RNNs struggle with.\n",
|
||
"\n",
|
||
"**Scalability**: Performance predictably improves with more parameters and data. This enabled the scaling laws that led to GPT-3, GPT-4, and beyond.\n",
|
||
"\n",
|
||
"**Residual Connections**: Allow training very deep networks (100+ layers) by providing gradient highways.\n",
|
||
"\n",
|
||
"### The Building Blocks We'll Implement\n",
|
||
"\n",
|
||
"1. **LayerNorm**: Stabilizes training by normalizing activations\n",
|
||
"2. **Multi-Layer Perceptron (MLP)**: Provides non-linear transformation\n",
|
||
"3. **TransformerBlock**: Combines attention + MLP with residuals\n",
|
||
"4. **GPT**: Complete model with embeddings and generation capability"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b4f69559",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 2. Foundations: Essential Transformer Mathematics\n",
|
||
"\n",
|
||
"### Layer Normalization: The Stability Engine\n",
|
||
"\n",
|
||
"Layer Normalization is crucial for training deep transformer networks. Unlike batch normalization (which normalizes across the batch), layer norm normalizes across the feature dimension for each individual sample.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Mathematical Formula:\n",
|
||
"output = (x - μ) / σ * γ + β\n",
|
||
"\n",
|
||
"where:\n",
|
||
" μ = mean(x, axis=features) # Mean across feature dimension\n",
|
||
" σ = sqrt(var(x) + ε) # Standard deviation + small epsilon\n",
|
||
" γ = learnable scale parameter # Initialized to 1.0\n",
|
||
" β = learnable shift parameter # Initialized to 0.0\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why Layer Norm Works:**\n",
|
||
"- **Independence**: Each sample normalized independently (good for variable batch sizes)\n",
|
||
"- **Stability**: Prevents internal covariate shift that breaks training\n",
|
||
"- **Gradient Flow**: Helps gradients flow better through deep networks\n",
|
||
"\n",
|
||
"### Residual Connections: The Gradient Highway\n",
|
||
"\n",
|
||
"Residual connections are the secret to training deep networks. They create \"gradient highways\" that allow information to flow directly through the network.\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────────┐\n",
|
||
"│ RESIDUAL CONNECTIONS: The Gradient Highway System │\n",
|
||
"├─────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ PRE-NORM ARCHITECTURE (Modern Standard): │\n",
|
||
"│ │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ ATTENTION SUB-LAYER │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Input (x) ────┬─→ LayerNorm ─→ MultiHeadAttention ─┐ │ │\n",
|
||
"│ │ │ │ │ │\n",
|
||
"│ │ │ ┌─────────────────────────────┘ │ │\n",
|
||
"│ │ │ ▼ │ │\n",
|
||
"│ │ └────→ ADD ─→ Output to next sub-layer │ │\n",
|
||
"│ │ (x + attention_output) │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ ↓ │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ MLP SUB-LAYER │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Input (x) ────┬─→ LayerNorm ─→ MLP (Feed Forward) ─┐ │ │\n",
|
||
"│ │ │ │ │ │\n",
|
||
"│ │ │ ┌─────────────────────────────┘ │ │\n",
|
||
"│ │ │ ▼ │ │\n",
|
||
"│ │ └────→ ADD ─→ Final Output │ │\n",
|
||
"│ │ (x + mlp_output) │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ KEY INSIGHT: Each sub-layer ADDS to the residual stream │\n",
|
||
"│ rather than replacing it, preserving information flow! │\n",
|
||
"│ │\n",
|
||
"└─────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Gradient Flow Visualization:**\n",
|
||
"```\n",
|
||
"Backward Pass Without Residuals: With Residuals:\n",
|
||
"Loss Loss\n",
|
||
" │ gradients get smaller │ gradients stay strong\n",
|
||
" ↓ at each layer ↓ via residual paths\n",
|
||
"Layer N ← tiny gradients Layer N ← strong gradients\n",
|
||
" │ │ ↗ (direct path)\n",
|
||
" ↓ ↓ ↗\n",
|
||
"Layer 2 ← vanishing Layer 2 ← strong gradients\n",
|
||
" │ │ ↗\n",
|
||
" ↓ ↓ ↗\n",
|
||
"Layer 1 ← gone! Layer 1 ← strong gradients\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Feed-Forward Network (MLP): The Thinking Layer\n",
|
||
"\n",
|
||
"The MLP provides the actual \"thinking\" in each transformer block. It's a simple two-layer network with a specific expansion pattern.\n",
|
||
"\n",
|
||
"```\n",
|
||
"MLP Architecture:\n",
|
||
"Input (embed_dim) → Linear → GELU → Linear → Output (embed_dim)\n",
|
||
" 512 2048 2048 512\n",
|
||
" (4x expansion)\n",
|
||
"\n",
|
||
"Mathematical Formula:\n",
|
||
"FFN(x) = Linear₂(GELU(Linear₁(x)))\n",
|
||
" = W₂ · GELU(W₁ · x + b₁) + b₂\n",
|
||
"\n",
|
||
"where:\n",
|
||
" W₁: (embed_dim, 4*embed_dim) # Expansion matrix\n",
|
||
" W₂: (4*embed_dim, embed_dim) # Contraction matrix\n",
|
||
" GELU: smooth activation function (better than ReLU for language)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why 4x Expansion?**\n",
|
||
"- **Capacity**: More parameters = more representation power\n",
|
||
"- **Non-linearity**: GELU activation creates complex transformations\n",
|
||
"- **Information Bottleneck**: Forces the model to compress useful information\n",
|
||
"\n",
|
||
"### The Complete Transformer Block Data Flow\n",
|
||
"\n",
|
||
"```\n",
|
||
"Input Tensor (batch, seq_len, embed_dim)\n",
|
||
" ↓\n",
|
||
" ┌─────────────────────────────────────┐\n",
|
||
" │ ATTENTION SUB-LAYER │\n",
|
||
" │ │\n",
|
||
" │ x₁ = LayerNorm(x₀) │\n",
|
||
" │ attention_out = MultiHeadAttn(x₁) │\n",
|
||
" │ x₂ = x₀ + attention_out (residual) │\n",
|
||
" └─────────────────────────────────────┘\n",
|
||
" ↓\n",
|
||
" ┌─────────────────────────────────────┐\n",
|
||
" │ MLP SUB-LAYER │\n",
|
||
" │ │\n",
|
||
" │ x₃ = LayerNorm(x₂) │\n",
|
||
" │ mlp_out = MLP(x₃) │\n",
|
||
" │ x₄ = x₂ + mlp_out (residual) │\n",
|
||
" └─────────────────────────────────────┘\n",
|
||
" ↓\n",
|
||
"Output Tensor (batch, seq_len, embed_dim)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Insight**: Each sub-layer (attention and MLP) gets a \"clean\" normalized input but adds its contribution to the residual stream. This creates a stable training dynamic."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9a837896",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 3. Implementation: Building Transformer Components\n",
|
||
"\n",
|
||
"Now we'll implement each transformer component with a clear understanding of their role in the overall architecture. We'll follow the pattern: **Explanation → Implementation → Test** for each component.\n",
|
||
"\n",
|
||
"Each component serves a specific purpose:\n",
|
||
"- **LayerNorm**: Stabilizes training and normalizes activations\n",
|
||
"- **MLP**: Provides non-linear transformation and \"thinking\" capacity\n",
|
||
"- **TransformerBlock**: Combines attention with MLP using residual connections\n",
|
||
"- **GPT**: Complete autoregressive language model for text generation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "76f36a18",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Understanding Layer Normalization\n",
|
||
"\n",
|
||
"Layer Normalization is the foundation of stable transformer training. Unlike batch normalization, it normalizes each sample independently across its feature dimensions.\n",
|
||
"\n",
|
||
"#### Why Layer Norm is Essential\n",
|
||
"\n",
|
||
"Without normalization, deep networks suffer from \"internal covariate shift\" - the distribution of inputs to each layer changes during training, making learning unstable.\n",
|
||
"\n",
|
||
"#### Layer Norm Visualization\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────────┐\n",
|
||
"│ LAYER NORMALIZATION: Stabilizing Deep Networks │\n",
|
||
"├─────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ INPUT TENSOR: (batch=2, seq=3, features=4) │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Sample 1: [[1.0, 2.0, 3.0, 4.0], ← Position 0 │ │\n",
|
||
"│ │ [5.0, 6.0, 7.0, 8.0], ← Position 1 │ │\n",
|
||
"│ │ [9.0, 10.0, 11.0, 12.0]] ← Position 2 │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Sample 2: [[13., 14., 15., 16.], ← Position 0 │ │\n",
|
||
"│ │ [17., 18., 19., 20.], ← Position 1 │ │\n",
|
||
"│ │ [21., 22., 23., 24.]] ← Position 2 │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ ↓ │\n",
|
||
"│ NORMALIZE ACROSS FEATURES (per position) │\n",
|
||
"│ ↓ │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ AFTER NORMALIZATION: Each position → mean=0, std=1 │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Sample 1: [[-1.34, -0.45, 0.45, 1.34], │ │\n",
|
||
"│ │ [-1.34, -0.45, 0.45, 1.34], │ │\n",
|
||
"│ │ [-1.34, -0.45, 0.45, 1.34]] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Sample 2: [[-1.34, -0.45, 0.45, 1.34], │ │\n",
|
||
"│ │ [-1.34, -0.45, 0.45, 1.34], │ │\n",
|
||
"│ │ [-1.34, -0.45, 0.45, 1.34]] │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ ↓ │\n",
|
||
"│ APPLY LEARNABLE PARAMETERS: γ * norm + β │\n",
|
||
"│ ↓ │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ FINAL OUTPUT: Model can learn any desired distribution │ │\n",
|
||
"│ │ γ (scale) and β (shift) are learned during training │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ KEY INSIGHT: Unlike batch norm, each sample normalized │\n",
|
||
"│ independently - perfect for variable-length sequences! │\n",
|
||
"│ │\n",
|
||
"└─────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Key Properties\n",
|
||
"- **Per-sample normalization**: Each sequence position normalized independently\n",
|
||
"- **Learnable parameters**: γ (scale) and β (shift) allow the model to recover any desired distribution\n",
|
||
"- **Gradient friendly**: Helps gradients flow smoothly through deep networks"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6878edf0",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "layer-norm",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class LayerNorm:\n",
|
||
" \"\"\"\n",
|
||
" Layer Normalization for transformer blocks.\n",
|
||
"\n",
|
||
" Normalizes across the feature dimension (last axis) for each sample independently,\n",
|
||
" unlike batch normalization which normalizes across the batch dimension.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, normalized_shape, eps=1e-5):\n",
|
||
" \"\"\"\n",
|
||
" Initialize LayerNorm with learnable parameters.\n",
|
||
"\n",
|
||
" TODO: Set up normalization parameters\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Store the shape to normalize over (usually embed_dim)\n",
|
||
" 2. Initialize learnable scale (gamma) and shift (beta) parameters\n",
|
||
" 3. Set small epsilon for numerical stability\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> ln = LayerNorm(512) # For 512-dimensional embeddings\n",
|
||
" >>> x = Tensor(np.random.randn(2, 10, 512)) # (batch, seq, features)\n",
|
||
" >>> normalized = ln.forward(x)\n",
|
||
" >>> # Each (2, 10) sample normalized independently across 512 features\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - gamma should start at 1.0 (identity scaling)\n",
|
||
" - beta should start at 0.0 (no shift)\n",
|
||
" - eps prevents division by zero in variance calculation\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.normalized_shape = normalized_shape\n",
|
||
" self.eps = eps\n",
|
||
"\n",
|
||
" # Learnable parameters: scale and shift\n",
|
||
" # CRITICAL: requires_grad=True so optimizer can train these!\n",
|
||
" self.gamma = Tensor(np.ones(normalized_shape), requires_grad=True) # Scale parameter\n",
|
||
" self.beta = Tensor(np.zeros(normalized_shape), requires_grad=True) # Shift parameter\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" \"\"\"\n",
|
||
" Apply layer normalization.\n",
|
||
"\n",
|
||
" TODO: Implement layer normalization formula\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Compute mean and variance across the last dimension\n",
|
||
" 2. Normalize: (x - mean) / sqrt(variance + eps)\n",
|
||
" 3. Apply learnable scale and shift: gamma * normalized + beta\n",
|
||
"\n",
|
||
" MATHEMATICAL FORMULA:\n",
|
||
" y = (x - μ) / σ * γ + β\n",
|
||
" where μ = mean(x), σ = sqrt(var(x) + ε)\n",
|
||
"\n",
|
||
" HINT: Use keepdims=True to maintain tensor dimensions for broadcasting\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # CRITICAL: Use Tensor operations (not .data) to maintain gradient flow!\n",
|
||
" # Compute statistics across last dimension (features)\n",
|
||
" mean = x.mean(axis=-1, keepdims=True)\n",
|
||
"\n",
|
||
" # Compute variance: E[(x - μ)²]\n",
|
||
" diff = x - mean # Tensor subtraction maintains gradient\n",
|
||
" variance = (diff * diff).mean(axis=-1, keepdims=True) # Tensor ops maintain gradient\n",
|
||
"\n",
|
||
" # Normalize: (x - mean) / sqrt(variance + eps)\n",
|
||
" # Note: sqrt and division need to preserve gradient flow\n",
|
||
" std_data = np.sqrt(variance.data + self.eps)\n",
|
||
" normalized = diff * Tensor(1.0 / std_data) # Scale by reciprocal to maintain gradient\n",
|
||
"\n",
|
||
" # Apply learnable transformation\n",
|
||
" output = normalized * self.gamma + self.beta\n",
|
||
" return output\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" \"\"\"Return learnable parameters.\"\"\"\n",
|
||
" return [self.gamma, self.beta]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b57594b0",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🔬 Unit Test: Layer Normalization\n",
|
||
"This test validates our LayerNorm implementation works correctly.\n",
|
||
"**What we're testing**: Normalization statistics and parameter learning\n",
|
||
"**Why it matters**: Essential for transformer stability and training\n",
|
||
"**Expected**: Mean ≈ 0, std ≈ 1 after normalization, learnable parameters work"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "f187ea71",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-layer-norm",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_layer_norm():\n",
|
||
" \"\"\"🔬 Test LayerNorm implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Layer Normalization...\")\n",
|
||
"\n",
|
||
" # Test basic normalization\n",
|
||
" ln = LayerNorm(4)\n",
|
||
" x = Tensor([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]]) # (2, 4)\n",
|
||
"\n",
|
||
" normalized = ln.forward(x)\n",
|
||
"\n",
|
||
" # Check output shape\n",
|
||
" assert normalized.shape == (2, 4)\n",
|
||
"\n",
|
||
" # Check normalization properties (approximately)\n",
|
||
" # For each sample, mean should be close to 0, std close to 1\n",
|
||
" for i in range(2):\n",
|
||
" sample_mean = np.mean(normalized.data[i])\n",
|
||
" sample_std = np.std(normalized.data[i])\n",
|
||
" assert abs(sample_mean) < 1e-5, f\"Mean should be ~0, got {sample_mean}\"\n",
|
||
" assert abs(sample_std - 1.0) < 1e-4, f\"Std should be ~1, got {sample_std}\"\n",
|
||
"\n",
|
||
" # Test parameter shapes\n",
|
||
" params = ln.parameters()\n",
|
||
" assert len(params) == 2\n",
|
||
" assert params[0].shape == (4,) # gamma\n",
|
||
" assert params[1].shape == (4,) # beta\n",
|
||
"\n",
|
||
" print(\"✅ LayerNorm works correctly!\")\n",
|
||
"\n",
|
||
"# Run test immediately when developing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_layer_norm()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "20fa9a45",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Understanding the Multi-Layer Perceptron (MLP)\n",
|
||
"\n",
|
||
"The MLP is where the \"thinking\" happens in each transformer block. It's a simple feed-forward network that provides non-linear transformation capacity.\n",
|
||
"\n",
|
||
"#### The Role of MLP in Transformers\n",
|
||
"\n",
|
||
"While attention handles relationships between tokens, the MLP processes each position independently, adding computational depth and non-linearity.\n",
|
||
"\n",
|
||
"#### MLP Architecture and Information Flow\n",
|
||
"\n",
|
||
"```\n",
|
||
"Information Flow Through MLP:\n",
|
||
"\n",
|
||
"Input: (batch, seq_len, embed_dim=512)\n",
|
||
" ↓\n",
|
||
"┌─────────────────────────────────────────────┐\n",
|
||
"│ Linear Layer 1: Expansion │\n",
|
||
"│ Weight: (512, 2048) Bias: (2048,) │\n",
|
||
"│ Output: (batch, seq_len, 2048) │\n",
|
||
"└─────────────────────────────────────────────┘\n",
|
||
" ↓\n",
|
||
"┌─────────────────────────────────────────────┐\n",
|
||
"│ GELU Activation │\n",
|
||
"│ Smooth, differentiable activation │\n",
|
||
"│ Better than ReLU for language modeling │\n",
|
||
"└─────────────────────────────────────────────┘\n",
|
||
" ↓\n",
|
||
"┌─────────────────────────────────────────────┐\n",
|
||
"│ Linear Layer 2: Contraction │\n",
|
||
"│ Weight: (2048, 512) Bias: (512,) │\n",
|
||
"│ Output: (batch, seq_len, 512) │\n",
|
||
"└─────────────────────────────────────────────┘\n",
|
||
" ↓\n",
|
||
"Output: (batch, seq_len, embed_dim=512)\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Why 4x Expansion?\n",
|
||
"\n",
|
||
"```\n",
|
||
"Parameter Count Analysis:\n",
|
||
"\n",
|
||
"Embed Dim: 512\n",
|
||
"MLP Hidden: 2048 (4x expansion)\n",
|
||
"\n",
|
||
"Parameters:\n",
|
||
"- Linear1: 512 × 2048 + 2048 = 1,050,624\n",
|
||
"- Linear2: 2048 × 512 + 512 = 1,049,088\n",
|
||
"- Total MLP: ~2.1M parameters\n",
|
||
"\n",
|
||
"For comparison:\n",
|
||
"- Attention (same embed_dim): ~1.5M parameters\n",
|
||
"- MLP has MORE parameters → more computational capacity\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### GELU vs ReLU\n",
|
||
"\n",
|
||
"```\n",
|
||
"Activation Function Comparison:\n",
|
||
"\n",
|
||
"ReLU(x) = max(0, x) # Hard cutoff at 0\n",
|
||
" ┌────\n",
|
||
" │\n",
|
||
" ─────┘\n",
|
||
" 0\n",
|
||
"\n",
|
||
"GELU(x) ≈ x * Φ(x) # Smooth, probabilistic\n",
|
||
" ╭────\n",
|
||
" ╱\n",
|
||
" ───╱\n",
|
||
" ╱\n",
|
||
" 0\n",
|
||
"\n",
|
||
"GELU is smoother and provides better gradients for language modeling.\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "36edc347",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "mlp",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class MLP:\n",
|
||
" \"\"\"\n",
|
||
" Multi-Layer Perceptron (Feed-Forward Network) for transformer blocks.\n",
|
||
"\n",
|
||
" Standard pattern: Linear -> GELU -> Linear with expansion ratio of 4:1.\n",
|
||
" This provides the non-linear transformation in each transformer block.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, embed_dim, hidden_dim=None, dropout_prob=0.1):\n",
|
||
" \"\"\"\n",
|
||
" Initialize MLP with two linear layers.\n",
|
||
"\n",
|
||
" TODO: Set up the feed-forward network layers\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. First layer expands from embed_dim to hidden_dim (usually 4x larger)\n",
|
||
" 2. Second layer projects back to embed_dim\n",
|
||
" 3. Use GELU activation (smoother than ReLU, preferred in transformers)\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> mlp = MLP(512) # Will create 512 -> 2048 -> 512 network\n",
|
||
" >>> x = Tensor(np.random.randn(2, 10, 512))\n",
|
||
" >>> output = mlp.forward(x)\n",
|
||
" >>> assert output.shape == (2, 10, 512)\n",
|
||
"\n",
|
||
" HINT: Standard transformer MLP uses 4x expansion (hidden_dim = 4 * embed_dim)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if hidden_dim is None:\n",
|
||
" hidden_dim = 4 * embed_dim # Standard 4x expansion\n",
|
||
"\n",
|
||
" self.embed_dim = embed_dim\n",
|
||
" self.hidden_dim = hidden_dim\n",
|
||
"\n",
|
||
" # Two-layer feed-forward network\n",
|
||
" self.linear1 = Linear(embed_dim, hidden_dim)\n",
|
||
" self.linear2 = Linear(hidden_dim, embed_dim)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, x):\n",
|
||
" \"\"\"\n",
|
||
" Forward pass through MLP.\n",
|
||
"\n",
|
||
" TODO: Implement the feed-forward computation\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. First linear transformation: embed_dim -> hidden_dim\n",
|
||
" 2. Apply GELU activation (smooth, differentiable)\n",
|
||
" 3. Second linear transformation: hidden_dim -> embed_dim\n",
|
||
"\n",
|
||
" COMPUTATION FLOW:\n",
|
||
" x -> Linear -> GELU -> Linear -> output\n",
|
||
"\n",
|
||
" HINT: GELU activation is implemented above as a function\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # First linear layer with expansion\n",
|
||
" hidden = self.linear1.forward(x)\n",
|
||
"\n",
|
||
" # GELU activation\n",
|
||
" hidden = gelu(hidden)\n",
|
||
"\n",
|
||
" # Second linear layer back to original size\n",
|
||
" output = self.linear2.forward(hidden)\n",
|
||
"\n",
|
||
" return output\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" \"\"\"Return all learnable parameters.\"\"\"\n",
|
||
" params = []\n",
|
||
" params.extend(self.linear1.parameters())\n",
|
||
" params.extend(self.linear2.parameters())\n",
|
||
" return params"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "51e920ba",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🔬 Unit Test: MLP (Feed-Forward Network)\n",
|
||
"This test validates our MLP implementation works correctly.\n",
|
||
"**What we're testing**: Shape preservation and parameter counting\n",
|
||
"**Why it matters**: MLP provides the non-linear transformation in transformers\n",
|
||
"**Expected**: Input/output shapes match, correct parameter count"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "daa33cf0",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-mlp",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_mlp():\n",
|
||
" \"\"\"🔬 Test MLP implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: MLP (Feed-Forward Network)...\")\n",
|
||
"\n",
|
||
" # Test MLP with standard 4x expansion\n",
|
||
" embed_dim = 64\n",
|
||
" mlp = MLP(embed_dim)\n",
|
||
"\n",
|
||
" # Test forward pass\n",
|
||
" batch_size, seq_len = 2, 10\n",
|
||
" x = Tensor(np.random.randn(batch_size, seq_len, embed_dim))\n",
|
||
" output = mlp.forward(x)\n",
|
||
"\n",
|
||
" # Check shape preservation\n",
|
||
" assert output.shape == (batch_size, seq_len, embed_dim)\n",
|
||
"\n",
|
||
" # Check hidden dimension is 4x\n",
|
||
" assert mlp.hidden_dim == 4 * embed_dim\n",
|
||
"\n",
|
||
" # Test parameter counting\n",
|
||
" params = mlp.parameters()\n",
|
||
" expected_params = 4 # 2 weights + 2 biases\n",
|
||
" assert len(params) == expected_params\n",
|
||
"\n",
|
||
" # Test custom hidden dimension\n",
|
||
" custom_mlp = MLP(embed_dim, hidden_dim=128)\n",
|
||
" assert custom_mlp.hidden_dim == 128\n",
|
||
"\n",
|
||
" print(\"✅ MLP works correctly!\")\n",
|
||
"\n",
|
||
"# Run test immediately when developing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_mlp()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0f7a5449",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Understanding the Complete Transformer Block\n",
|
||
"\n",
|
||
"The TransformerBlock is the core building unit of GPT and other transformer models. It combines self-attention with feed-forward processing using a carefully designed residual architecture.\n",
|
||
"\n",
|
||
"#### Pre-Norm vs Post-Norm Architecture\n",
|
||
"\n",
|
||
"Modern transformers use \"pre-norm\" architecture where LayerNorm comes BEFORE the sub-layers, not after. This provides better training stability.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Pre-Norm Architecture (What We Implement):\n",
|
||
"┌─────────────────────────────────────────────────────────┐\n",
|
||
"│ INPUT (x) │\n",
|
||
"│ │ │\n",
|
||
"│ ┌───────────────┴───────────────┐ │\n",
|
||
"│ │ │ │\n",
|
||
"│ ▼ │ │\n",
|
||
"│ LayerNorm │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ ▼ │ │\n",
|
||
"│ MultiHeadAttention │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ └───────────────┬───────────────┘ │\n",
|
||
"│ │ (residual connection) │\n",
|
||
"│ ▼ │\n",
|
||
"│ x + attention │\n",
|
||
"│ │ │\n",
|
||
"│ ┌───────────────┴───────────────┐ │\n",
|
||
"│ │ │ │\n",
|
||
"│ ▼ │ │\n",
|
||
"│ LayerNorm │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ ▼ │ │\n",
|
||
"│ MLP │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ └───────────────┬───────────────┘ │\n",
|
||
"│ │ (residual connection) │\n",
|
||
"│ ▼ │\n",
|
||
"│ x + mlp │\n",
|
||
"│ │ │\n",
|
||
"│ ▼ │\n",
|
||
"│ OUTPUT │\n",
|
||
"└─────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Why Pre-Norm Works Better\n",
|
||
"\n",
|
||
"**Training Stability**: LayerNorm before operations provides clean, normalized inputs to attention and MLP layers.\n",
|
||
"\n",
|
||
"**Gradient Flow**: Residual connections carry gradients directly from output to input, bypassing the normalized operations.\n",
|
||
"\n",
|
||
"**Deeper Networks**: Pre-norm enables training much deeper networks (100+ layers) compared to post-norm.\n",
|
||
"\n",
|
||
"#### Information Processing in Transformer Block\n",
|
||
"\n",
|
||
"```\n",
|
||
"Step-by-Step Data Transformation:\n",
|
||
"\n",
|
||
"1. Input Processing:\n",
|
||
" x₀: (batch, seq_len, embed_dim) # Original input\n",
|
||
"\n",
|
||
"2. Attention Sub-layer:\n",
|
||
" x₁ = LayerNorm(x₀) # Normalize input\n",
|
||
" attn_out = MultiHeadAttn(x₁) # Self-attention\n",
|
||
" x₂ = x₀ + attn_out # Residual connection\n",
|
||
"\n",
|
||
"3. MLP Sub-layer:\n",
|
||
" x₃ = LayerNorm(x₂) # Normalize again\n",
|
||
" mlp_out = MLP(x₃) # Feed-forward\n",
|
||
" x₄ = x₂ + mlp_out # Final residual\n",
|
||
"\n",
|
||
"4. Output:\n",
|
||
" return x₄ # Ready for next block\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Residual Stream Concept\n",
|
||
"\n",
|
||
"Think of the residual connections as a \"stream\" that carries information through the network:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Residual Stream Flow:\n",
|
||
"\n",
|
||
"Layer 1: [original embeddings] ─┐\n",
|
||
" ├─→ + attention info ─┐\n",
|
||
"Attention adds information ──────┘ │\n",
|
||
" ├─→ + MLP info ─┐\n",
|
||
"MLP adds information ───────────────────────────────────┘ │\n",
|
||
" │\n",
|
||
"Layer 2: carries accumulated information ──────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"Each layer adds information to this stream rather than replacing it, creating a rich representation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "3b54f39c",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "transformer-block",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class TransformerBlock:\n",
|
||
" \"\"\"\n",
|
||
" Complete Transformer Block with self-attention, MLP, and residual connections.\n",
|
||
"\n",
|
||
" This is the core building block of GPT and other transformer models.\n",
|
||
" Each block processes the input sequence and passes it to the next block.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, embed_dim, num_heads, mlp_ratio=4, dropout_prob=0.1):\n",
|
||
" \"\"\"\n",
|
||
" Initialize a complete transformer block.\n",
|
||
"\n",
|
||
" TODO: Set up all components of the transformer block\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Multi-head self-attention for sequence modeling\n",
|
||
" 2. First layer normalization (pre-norm architecture)\n",
|
||
" 3. MLP with specified expansion ratio\n",
|
||
" 4. Second layer normalization\n",
|
||
"\n",
|
||
" TRANSFORMER BLOCK ARCHITECTURE:\n",
|
||
" x → LayerNorm → MultiHeadAttention → + (residual) →\n",
|
||
" LayerNorm → MLP → + (residual) → output\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> block = TransformerBlock(embed_dim=512, num_heads=8)\n",
|
||
" >>> x = Tensor(np.random.randn(2, 10, 512)) # (batch, seq, embed)\n",
|
||
" >>> output = block.forward(x)\n",
|
||
" >>> assert output.shape == (2, 10, 512)\n",
|
||
"\n",
|
||
" HINT: We use pre-norm architecture (LayerNorm before attention/MLP)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.embed_dim = embed_dim\n",
|
||
" self.num_heads = num_heads\n",
|
||
"\n",
|
||
" # Multi-head self-attention\n",
|
||
" self.attention = MultiHeadAttention(embed_dim, num_heads)\n",
|
||
"\n",
|
||
" # Layer normalizations (pre-norm architecture)\n",
|
||
" self.ln1 = LayerNorm(embed_dim) # Before attention\n",
|
||
" self.ln2 = LayerNorm(embed_dim) # Before MLP\n",
|
||
"\n",
|
||
" # Feed-forward network\n",
|
||
" hidden_dim = int(embed_dim * mlp_ratio)\n",
|
||
" self.mlp = MLP(embed_dim, hidden_dim)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, x, mask=None):\n",
|
||
" \"\"\"\n",
|
||
" Forward pass through transformer block.\n",
|
||
"\n",
|
||
" TODO: Implement the complete transformer block computation\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Apply layer norm, then self-attention, then add residual\n",
|
||
" 2. Apply layer norm, then MLP, then add residual\n",
|
||
" 3. Return the transformed sequence\n",
|
||
"\n",
|
||
" COMPUTATION FLOW:\n",
|
||
" x → ln1 → attention → + x → ln2 → mlp → + → output\n",
|
||
"\n",
|
||
" RESIDUAL CONNECTIONS:\n",
|
||
" These are crucial for training deep networks - they allow gradients\n",
|
||
" to flow directly through the network during backpropagation.\n",
|
||
"\n",
|
||
" HINT: Store intermediate results to add residual connections properly\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # First sub-layer: Multi-head self-attention with residual connection\n",
|
||
" # Pre-norm: LayerNorm before attention\n",
|
||
" normed1 = self.ln1.forward(x)\n",
|
||
" # Self-attention: query, key, value are all the same (normed1)\n",
|
||
" attention_out = self.attention.forward(normed1, normed1, normed1, mask)\n",
|
||
"\n",
|
||
" # Residual connection\n",
|
||
" x = x + attention_out\n",
|
||
"\n",
|
||
" # Second sub-layer: MLP with residual connection\n",
|
||
" # Pre-norm: LayerNorm before MLP\n",
|
||
" normed2 = self.ln2.forward(x)\n",
|
||
" mlp_out = self.mlp.forward(normed2)\n",
|
||
"\n",
|
||
" # Residual connection\n",
|
||
" output = x + mlp_out\n",
|
||
"\n",
|
||
" return output\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" \"\"\"Return all learnable parameters.\"\"\"\n",
|
||
" params = []\n",
|
||
" params.extend(self.attention.parameters())\n",
|
||
" params.extend(self.ln1.parameters())\n",
|
||
" params.extend(self.ln2.parameters())\n",
|
||
" params.extend(self.mlp.parameters())\n",
|
||
" return params"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "78bc4bf0",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🔬 Unit Test: Transformer Block\n",
|
||
"This test validates our complete TransformerBlock implementation.\n",
|
||
"**What we're testing**: Shape preservation, residual connections, parameter counting\n",
|
||
"**Why it matters**: This is the core component that will be stacked to create GPT\n",
|
||
"**Expected**: Input/output shapes match, all components work together"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2f8fa7e8",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-transformer-block",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_transformer_block():\n",
|
||
" \"\"\"🔬 Test TransformerBlock implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Transformer Block...\")\n",
|
||
"\n",
|
||
" # Test transformer block\n",
|
||
" embed_dim = 64\n",
|
||
" num_heads = 4\n",
|
||
" block = TransformerBlock(embed_dim, num_heads)\n",
|
||
"\n",
|
||
" # Test forward pass\n",
|
||
" batch_size, seq_len = 2, 8\n",
|
||
" x = Tensor(np.random.randn(batch_size, seq_len, embed_dim))\n",
|
||
" output = block.forward(x)\n",
|
||
"\n",
|
||
" # Check shape preservation\n",
|
||
" assert output.shape == (batch_size, seq_len, embed_dim)\n",
|
||
"\n",
|
||
" # Test with causal mask (for autoregressive generation)\n",
|
||
" mask = Tensor(np.triu(np.ones((seq_len, seq_len)) * -np.inf, k=1))\n",
|
||
" masked_output = block.forward(x, mask)\n",
|
||
" assert masked_output.shape == (batch_size, seq_len, embed_dim)\n",
|
||
"\n",
|
||
" # Test parameter counting\n",
|
||
" params = block.parameters()\n",
|
||
" expected_components = 4 # attention, ln1, ln2, mlp parameters\n",
|
||
" assert len(params) > expected_components # Should have parameters from all components\n",
|
||
"\n",
|
||
" # Test different configurations\n",
|
||
" large_block = TransformerBlock(embed_dim=128, num_heads=8, mlp_ratio=2)\n",
|
||
" assert large_block.mlp.hidden_dim == 256 # 128 * 2\n",
|
||
"\n",
|
||
" print(\"✅ TransformerBlock works correctly!\")\n",
|
||
"\n",
|
||
"# Run test immediately when developing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_transformer_block()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d30f17d2",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Understanding the Complete GPT Architecture\n",
|
||
"\n",
|
||
"GPT (Generative Pre-trained Transformer) is the complete language model that combines all our components into a text generation system. It's designed for **autoregressive** generation - predicting the next token based on all previous tokens.\n",
|
||
"\n",
|
||
"#### GPT's Autoregressive Nature\n",
|
||
"\n",
|
||
"GPT generates text one token at a time, using all previously generated tokens as context:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Autoregressive Generation Process:\n",
|
||
"\n",
|
||
"Step 1: \"The cat\" → model predicts → \"sat\"\n",
|
||
"Step 2: \"The cat sat\" → model predicts → \"on\"\n",
|
||
"Step 3: \"The cat sat on\" → model predicts → \"the\"\n",
|
||
"Step 4: \"The cat sat on the\" → model predicts → \"mat\"\n",
|
||
"\n",
|
||
"Result: \"The cat sat on the mat\"\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Complete GPT Architecture\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────┐\n",
|
||
"│ GPT ARCHITECTURE │\n",
|
||
"│ │\n",
|
||
"│ Input: Token IDs [15496, 1917, ...] │\n",
|
||
"│ │ │\n",
|
||
"│ ┌──────────────────┴──────────────────┐ │\n",
|
||
"│ │ EMBEDDING LAYER │ │\n",
|
||
"│ │ ┌─────────────┐ ┌─────────────────┐│ │\n",
|
||
"│ │ │Token Embed │+│Position Embed ││ │\n",
|
||
"│ │ │vocab→vector ││ │sequence→vector ││ │\n",
|
||
"│ │ └─────────────┘ └─────────────────┘│ │\n",
|
||
"│ └──────────────────┬──────────────────┘ │\n",
|
||
"│ │ │\n",
|
||
"│ ┌──────────────────┴──────────────────┐ │\n",
|
||
"│ │ TRANSFORMER BLOCK 1 │ │\n",
|
||
"│ │ ┌─────────┐ ┌─────────┐ ┌───────┐ │ │\n",
|
||
"│ │ │LayerNorm│→│Attention│→│ +x │ │ │\n",
|
||
"│ │ └─────────┘ └─────────┘ └───┬───┘ │ │\n",
|
||
"│ │ │ │ │\n",
|
||
"│ │ ┌─────────┐ ┌─────────┐ ┌───▼───┐ │ │\n",
|
||
"│ │ │LayerNorm│→│ MLP │→│ +x │ │ │\n",
|
||
"│ │ └─────────┘ └─────────┘ └───────┘ │ │\n",
|
||
"│ └──────────────────┬──────────────────┘ │\n",
|
||
"│ │ │\n",
|
||
"│ ... (more transformer blocks) ... │\n",
|
||
"│ │ │\n",
|
||
"│ ┌──────────────────┴──────────────────┐ │\n",
|
||
"│ │ OUTPUT HEAD │ │\n",
|
||
"│ │ ┌─────────┐ ┌─────────────────────┐ │ │\n",
|
||
"│ │ │LayerNorm│→│Linear(embed→vocab) │ │ │\n",
|
||
"│ │ └─────────┘ └─────────────────────┘ │ │\n",
|
||
"│ └──────────────────┬──────────────────┘ │\n",
|
||
"│ │ │\n",
|
||
"│ Output: Vocabulary Logits [0.1, 0.05, 0.8, ...] │\n",
|
||
"└─────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Causal Masking for Autoregressive Training\n",
|
||
"\n",
|
||
"During training, GPT sees the entire sequence but must not \"cheat\" by looking at future tokens:\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────────┐\n",
|
||
"│ CAUSAL MASKING: Preventing Future Information Leakage │\n",
|
||
"├─────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ SEQUENCE: [\"The\", \"cat\", \"sat\", \"on\"] │\n",
|
||
"│ POSITIONS: 0 1 2 3 │\n",
|
||
"│ │\n",
|
||
"│ ATTENTION MATRIX (what each position can see): │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Pos: 0 1 2 3 │ │\n",
|
||
"│ │ Pos 0: [ ✓ ✗ ✗ ✗ ] ← \"The\" only sees itself │ │\n",
|
||
"│ │ Pos 1: [ ✓ ✓ ✗ ✗ ] ← \"cat\" sees \"The\" + self │ │\n",
|
||
"│ │ Pos 2: [ ✓ ✓ ✓ ✗ ] ← \"sat\" sees all previous │ │\n",
|
||
"│ │ Pos 3: [ ✓ ✓ ✓ ✓ ] ← \"on\" sees everything │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ IMPLEMENTATION: Upper triangular matrix with -∞ │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ [[ 0, -∞, -∞, -∞], │ │\n",
|
||
"│ │ [ 0, 0, -∞, -∞], │ │\n",
|
||
"│ │ [ 0, 0, 0, -∞], │ │\n",
|
||
"│ │ [ 0, 0, 0, 0]] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ After softmax: -∞ becomes 0 probability │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ WHY THIS WORKS: During training, model sees entire sequence │\n",
|
||
"│ but mask ensures position i only attends to positions ≤ i │\n",
|
||
"│ │\n",
|
||
"└─────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Generation Temperature Control\n",
|
||
"\n",
|
||
"Temperature controls the randomness of generation:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Temperature Effects:\n",
|
||
"\n",
|
||
"Original logits: [1.0, 2.0, 3.0]\n",
|
||
"\n",
|
||
"Temperature = 0.1 (Conservative):\n",
|
||
"Scaled: [10.0, 20.0, 30.0] → Sharp distribution\n",
|
||
"Probs: [0.00, 0.00, 1.00] → Always picks highest\n",
|
||
"\n",
|
||
"Temperature = 1.0 (Balanced):\n",
|
||
"Scaled: [1.0, 2.0, 3.0] → Moderate distribution\n",
|
||
"Probs: [0.09, 0.24, 0.67] → Weighted sampling\n",
|
||
"\n",
|
||
"Temperature = 2.0 (Creative):\n",
|
||
"Scaled: [0.5, 1.0, 1.5] → Flatter distribution\n",
|
||
"Probs: [0.18, 0.33, 0.49] → More random\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### Model Scaling and Parameters\n",
|
||
"\n",
|
||
"```\n",
|
||
"GPT Model Size Scaling:\n",
|
||
"\n",
|
||
"Tiny GPT (our implementation):\n",
|
||
"- embed_dim: 64, layers: 2, heads: 4\n",
|
||
"- Parameters: ~50K\n",
|
||
"- Use case: Learning and experimentation\n",
|
||
"\n",
|
||
"GPT-2 Small:\n",
|
||
"- embed_dim: 768, layers: 12, heads: 12\n",
|
||
"- Parameters: 117M\n",
|
||
"- Use case: Basic text generation\n",
|
||
"\n",
|
||
"GPT-3:\n",
|
||
"- embed_dim: 12,288, layers: 96, heads: 96\n",
|
||
"- Parameters: 175B\n",
|
||
"- Use case: Advanced language understanding\n",
|
||
"\n",
|
||
"GPT-4 (estimated):\n",
|
||
"- embed_dim: ~16,384, layers: ~120, heads: ~128\n",
|
||
"- Parameters: ~1.7T\n",
|
||
"- Use case: Reasoning and multimodal tasks\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1d86de25",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "gpt",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class GPT:\n",
|
||
" \"\"\"\n",
|
||
" Complete GPT (Generative Pre-trained Transformer) model.\n",
|
||
"\n",
|
||
" This combines embeddings, positional encoding, multiple transformer blocks,\n",
|
||
" and a language modeling head for text generation.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, vocab_size, embed_dim, num_layers, num_heads, max_seq_len=1024):\n",
|
||
" \"\"\"\n",
|
||
" Initialize complete GPT model.\n",
|
||
"\n",
|
||
" TODO: Set up all components of the GPT architecture\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Token embedding layer to convert tokens to vectors\n",
|
||
" 2. Positional embedding to add position information\n",
|
||
" 3. Stack of transformer blocks (the main computation)\n",
|
||
" 4. Final layer norm and language modeling head\n",
|
||
"\n",
|
||
" GPT ARCHITECTURE:\n",
|
||
" tokens → embedding → + pos_embedding →\n",
|
||
" transformer_blocks → layer_norm → lm_head → logits\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> model = GPT(vocab_size=1000, embed_dim=256, num_layers=6, num_heads=8)\n",
|
||
" >>> tokens = Tensor(np.random.randint(0, 1000, (2, 10))) # (batch, seq)\n",
|
||
" >>> logits = model.forward(tokens)\n",
|
||
" >>> assert logits.shape == (2, 10, 1000) # (batch, seq, vocab)\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Positional embeddings are learned, not fixed sinusoidal\n",
|
||
" - Final layer norm stabilizes training\n",
|
||
" - Language modeling head shares weights with token embedding (tie_weights)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.vocab_size = vocab_size\n",
|
||
" self.embed_dim = embed_dim\n",
|
||
" self.num_layers = num_layers\n",
|
||
" self.num_heads = num_heads\n",
|
||
" self.max_seq_len = max_seq_len\n",
|
||
"\n",
|
||
" # Token and positional embeddings\n",
|
||
" self.token_embedding = Embedding(vocab_size, embed_dim)\n",
|
||
" self.position_embedding = Embedding(max_seq_len, embed_dim)\n",
|
||
"\n",
|
||
" # Stack of transformer blocks\n",
|
||
" self.blocks = []\n",
|
||
" for _ in range(num_layers):\n",
|
||
" block = TransformerBlock(embed_dim, num_heads)\n",
|
||
" self.blocks.append(block)\n",
|
||
"\n",
|
||
" # Final layer normalization\n",
|
||
" self.ln_f = LayerNorm(embed_dim)\n",
|
||
"\n",
|
||
" # Language modeling head (projects to vocabulary)\n",
|
||
" self.lm_head = Linear(embed_dim, vocab_size, bias=False)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def forward(self, tokens):\n",
|
||
" \"\"\"\n",
|
||
" Forward pass through GPT model.\n",
|
||
"\n",
|
||
" TODO: Implement the complete GPT forward pass\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Get token embeddings and positional embeddings\n",
|
||
" 2. Add them together (broadcasting handles different shapes)\n",
|
||
" 3. Pass through all transformer blocks sequentially\n",
|
||
" 4. Apply final layer norm and language modeling head\n",
|
||
"\n",
|
||
" COMPUTATION FLOW:\n",
|
||
" tokens → embed + pos_embed → blocks → ln_f → lm_head → logits\n",
|
||
"\n",
|
||
" CAUSAL MASKING:\n",
|
||
" For autoregressive generation, we need to prevent tokens from\n",
|
||
" seeing future tokens. This is handled by the attention mask.\n",
|
||
"\n",
|
||
" HINT: Create position indices as range(seq_len) for positional embedding\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" batch_size, seq_len = tokens.shape\n",
|
||
"\n",
|
||
" # Token embeddings\n",
|
||
" token_emb = self.token_embedding.forward(tokens)\n",
|
||
"\n",
|
||
" # Positional embeddings\n",
|
||
" positions = Tensor(np.arange(seq_len).reshape(1, seq_len))\n",
|
||
" pos_emb = self.position_embedding.forward(positions)\n",
|
||
"\n",
|
||
" # Combine embeddings\n",
|
||
" x = token_emb + pos_emb\n",
|
||
"\n",
|
||
" # Create causal mask for autoregressive generation\n",
|
||
" mask = self._create_causal_mask(seq_len)\n",
|
||
"\n",
|
||
" # Pass through transformer blocks\n",
|
||
" for block in self.blocks:\n",
|
||
" x = block.forward(x, mask)\n",
|
||
"\n",
|
||
" # Final layer normalization\n",
|
||
" x = self.ln_f.forward(x)\n",
|
||
"\n",
|
||
" # Language modeling head\n",
|
||
" logits = self.lm_head.forward(x)\n",
|
||
"\n",
|
||
" return logits\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def _create_causal_mask(self, seq_len):\n",
|
||
" \"\"\"Create causal mask to prevent attending to future positions.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Upper triangular matrix filled with -inf\n",
|
||
" mask = np.triu(np.ones((seq_len, seq_len)) * -np.inf, k=1)\n",
|
||
" return Tensor(mask)\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def generate(self, prompt_tokens, max_new_tokens=50, temperature=1.0):\n",
|
||
" \"\"\"\n",
|
||
" Generate text autoregressively.\n",
|
||
"\n",
|
||
" TODO: Implement autoregressive text generation\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Start with prompt tokens\n",
|
||
" 2. For each new position:\n",
|
||
" - Run forward pass to get logits\n",
|
||
" - Sample next token from logits\n",
|
||
" - Append to sequence\n",
|
||
" 3. Return generated sequence\n",
|
||
"\n",
|
||
" AUTOREGRESSIVE GENERATION:\n",
|
||
" At each step, the model predicts the next token based on all\n",
|
||
" previous tokens. This is how GPT generates coherent text.\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> model = GPT(vocab_size=100, embed_dim=64, num_layers=2, num_heads=4)\n",
|
||
" >>> prompt = Tensor([[1, 2, 3]]) # Some token sequence\n",
|
||
" >>> generated = model.generate(prompt, max_new_tokens=5)\n",
|
||
" >>> assert generated.shape[1] == 3 + 5 # original + new tokens\n",
|
||
"\n",
|
||
" HINT: Use np.random.choice with temperature for sampling\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" current_tokens = Tensor(prompt_tokens.data.copy())\n",
|
||
"\n",
|
||
" for _ in range(max_new_tokens):\n",
|
||
" # Get logits for current sequence\n",
|
||
" logits = self.forward(current_tokens)\n",
|
||
"\n",
|
||
" # Get logits for last position (next token prediction)\n",
|
||
" last_logits = logits.data[:, -1, :] # (batch_size, vocab_size)\n",
|
||
"\n",
|
||
" # Apply temperature scaling\n",
|
||
" scaled_logits = last_logits / temperature\n",
|
||
"\n",
|
||
" # Convert to probabilities (softmax)\n",
|
||
" exp_logits = np.exp(scaled_logits - np.max(scaled_logits, axis=-1, keepdims=True))\n",
|
||
" probs = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)\n",
|
||
"\n",
|
||
" # Sample next token\n",
|
||
" next_token = np.array([[np.random.choice(self.vocab_size, p=probs[0])]])\n",
|
||
"\n",
|
||
" # Append to sequence\n",
|
||
" current_tokens = Tensor(np.concatenate([current_tokens.data, next_token], axis=1))\n",
|
||
"\n",
|
||
" return current_tokens\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def parameters(self):\n",
|
||
" \"\"\"Return all learnable parameters.\"\"\"\n",
|
||
" params = []\n",
|
||
" params.extend(self.token_embedding.parameters())\n",
|
||
" params.extend(self.position_embedding.parameters())\n",
|
||
"\n",
|
||
" for block in self.blocks:\n",
|
||
" params.extend(block.parameters())\n",
|
||
"\n",
|
||
" params.extend(self.ln_f.parameters())\n",
|
||
" params.extend(self.lm_head.parameters())\n",
|
||
"\n",
|
||
" return params"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6994ec05",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🔬 Unit Test: GPT Model\n",
|
||
"This test validates our complete GPT implementation.\n",
|
||
"**What we're testing**: Model forward pass, shape consistency, generation capability\n",
|
||
"**Why it matters**: This is the complete language model that ties everything together\n",
|
||
"**Expected**: Correct output shapes, generation works, parameter counting"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "377dc692",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-gpt",
|
||
"locked": true,
|
||
"points": 20
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_gpt():\n",
|
||
" \"\"\"🔬 Test GPT model implementation.\"\"\"\n",
|
||
" print(\"🔬 Unit Test: GPT Model...\")\n",
|
||
"\n",
|
||
" # Test small GPT model\n",
|
||
" vocab_size = 100\n",
|
||
" embed_dim = 64\n",
|
||
" num_layers = 2\n",
|
||
" num_heads = 4\n",
|
||
"\n",
|
||
" model = GPT(vocab_size, embed_dim, num_layers, num_heads)\n",
|
||
"\n",
|
||
" # Test forward pass\n",
|
||
" batch_size, seq_len = 2, 8\n",
|
||
" tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))\n",
|
||
" logits = model.forward(tokens)\n",
|
||
"\n",
|
||
" # Check output shape\n",
|
||
" expected_shape = (batch_size, seq_len, vocab_size)\n",
|
||
" assert logits.shape == expected_shape\n",
|
||
"\n",
|
||
" # Test generation\n",
|
||
" prompt = Tensor(np.random.randint(0, vocab_size, (1, 5)))\n",
|
||
" generated = model.generate(prompt, max_new_tokens=3)\n",
|
||
"\n",
|
||
" # Check generation shape\n",
|
||
" assert generated.shape == (1, 8) # 5 prompt + 3 new tokens\n",
|
||
"\n",
|
||
" # Test parameter counting\n",
|
||
" params = model.parameters()\n",
|
||
" assert len(params) > 10 # Should have many parameters from all components\n",
|
||
"\n",
|
||
" # Test different model sizes\n",
|
||
" larger_model = GPT(vocab_size=200, embed_dim=128, num_layers=4, num_heads=8)\n",
|
||
" test_tokens = Tensor(np.random.randint(0, 200, (1, 10)))\n",
|
||
" larger_logits = larger_model.forward(test_tokens)\n",
|
||
" assert larger_logits.shape == (1, 10, 200)\n",
|
||
"\n",
|
||
" print(\"✅ GPT model works correctly!\")\n",
|
||
"\n",
|
||
"# Run test immediately when developing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_gpt()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "66fa0b98",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 4. Integration: Complete Transformer Workflow\n",
|
||
"\n",
|
||
"Now that we've built all the components, let's see how they work together in a complete language modeling pipeline. This demonstrates the full power of the transformer architecture.\n",
|
||
"\n",
|
||
"### The Language Modeling Pipeline\n",
|
||
"\n",
|
||
"```\n",
|
||
"Complete Workflow Visualization:\n",
|
||
"\n",
|
||
"1. Text Input:\n",
|
||
" \"hello world\" → Tokenization → [15496, 1917]\n",
|
||
"\n",
|
||
"2. Model Processing:\n",
|
||
" [15496, 1917]\n",
|
||
" ↓ Token Embedding\n",
|
||
" [[0.1, 0.5, ...], [0.3, -0.2, ...]] # Vector representations\n",
|
||
" ↓ + Position Embedding\n",
|
||
" [[0.2, 0.7, ...], [0.1, -0.4, ...]] # With position info\n",
|
||
" ↓ Transformer Block 1\n",
|
||
" [[0.3, 0.2, ...], [0.5, -0.1, ...]] # After attention + MLP\n",
|
||
" ↓ Transformer Block 2\n",
|
||
" [[0.1, 0.9, ...], [0.7, 0.3, ...]] # Further processed\n",
|
||
" ↓ Final LayerNorm + LM Head\n",
|
||
" [[0.1, 0.05, 0.8, ...], [...]] # Probability over vocab\n",
|
||
"\n",
|
||
"3. Generation:\n",
|
||
" Model predicts next token: \"!\" (token 33)\n",
|
||
" New sequence: \"hello world!\"\n",
|
||
"```\n",
|
||
"\n",
|
||
"This integration demo will show:\n",
|
||
"- **Character-level tokenization** for simplicity\n",
|
||
"- **Forward pass** through all components\n",
|
||
"- **Autoregressive generation** in action\n",
|
||
"- **Temperature effects** on creativity"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6381a082",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "integration-demo",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def demonstrate_transformer_integration():\n",
|
||
" \"\"\"\n",
|
||
" Demonstrate complete transformer pipeline.\n",
|
||
"\n",
|
||
" This simulates training a small language model on a simple vocabulary.\n",
|
||
" \"\"\"\n",
|
||
" print(\"🔗 Integration Demo: Complete Language Model Pipeline\")\n",
|
||
" print(\"Building a mini-GPT for character-level text generation\")\n",
|
||
"\n",
|
||
" # Create a small vocabulary (character-level)\n",
|
||
" vocab = list(\"abcdefghijklmnopqrstuvwxyz .\")\n",
|
||
" vocab_size = len(vocab)\n",
|
||
" char_to_idx = {char: i for i, char in enumerate(vocab)}\n",
|
||
" idx_to_char = {i: char for i, char in enumerate(vocab)}\n",
|
||
"\n",
|
||
" print(f\"Vocabulary size: {vocab_size}\")\n",
|
||
" print(f\"Characters: {''.join(vocab)}\")\n",
|
||
"\n",
|
||
" # Create model\n",
|
||
" model = GPT(\n",
|
||
" vocab_size=vocab_size,\n",
|
||
" embed_dim=64,\n",
|
||
" num_layers=2,\n",
|
||
" num_heads=4,\n",
|
||
" max_seq_len=32\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Sample text encoding\n",
|
||
" text = \"hello world.\"\n",
|
||
" tokens = [char_to_idx[char] for char in text]\n",
|
||
" input_tokens = Tensor(np.array([tokens]))\n",
|
||
"\n",
|
||
" print(f\"\\nOriginal text: '{text}'\")\n",
|
||
" print(f\"Tokenized: {tokens}\")\n",
|
||
" print(f\"Input shape: {input_tokens.shape}\")\n",
|
||
"\n",
|
||
" # Forward pass\n",
|
||
" logits = model.forward(input_tokens)\n",
|
||
" print(f\"Output logits shape: {logits.shape}\")\n",
|
||
" print(f\"Each position predicts next token from {vocab_size} possibilities\")\n",
|
||
"\n",
|
||
" # Generation demo\n",
|
||
" prompt_text = \"hello\"\n",
|
||
" prompt_tokens = [char_to_idx[char] for char in prompt_text]\n",
|
||
" prompt = Tensor(np.array([prompt_tokens]))\n",
|
||
"\n",
|
||
" print(f\"\\nGeneration demo:\")\n",
|
||
" print(f\"Prompt: '{prompt_text}'\")\n",
|
||
"\n",
|
||
" generated = model.generate(prompt, max_new_tokens=8, temperature=1.0)\n",
|
||
" generated_text = ''.join([idx_to_char[idx] for idx in generated.data[0]])\n",
|
||
"\n",
|
||
" print(f\"Generated: '{generated_text}'\")\n",
|
||
" print(\"(Note: Untrained model produces random text)\")\n",
|
||
"\n",
|
||
" return model\n",
|
||
"\n",
|
||
"demonstrate_transformer_integration()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "540a7b4d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 5. Systems Analysis: Parameter Scaling and Memory\n",
|
||
"\n",
|
||
"Transformer models scale dramatically with size, leading to both opportunities and challenges. Let's analyze the computational and memory requirements to understand why training large language models requires massive infrastructure.\n",
|
||
"\n",
|
||
"### The Scaling Laws Revolution\n",
|
||
"\n",
|
||
"One of the key discoveries in modern AI is that transformer performance follows predictable scaling laws:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Scaling Laws Pattern:\n",
|
||
"Performance ∝ Parameters^α × Data^β × Compute^γ\n",
|
||
"\n",
|
||
"where α ≈ 0.7, β ≈ 0.8, γ ≈ 0.5\n",
|
||
"\n",
|
||
"This means:\n",
|
||
"- 10× more parameters → ~5× better performance\n",
|
||
"- 10× more data → ~6× better performance\n",
|
||
"- 10× more compute → ~3× better performance\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Memory Scaling Analysis\n",
|
||
"\n",
|
||
"Memory requirements grow in different ways for different components:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Memory Scaling by Component:\n",
|
||
"\n",
|
||
"1. Parameter Memory (Linear with model size):\n",
|
||
" - Embeddings: vocab_size × embed_dim\n",
|
||
" - Transformer blocks: ~4 × embed_dim²\n",
|
||
" - Total: O(embed_dim²)\n",
|
||
"\n",
|
||
"2. Attention Memory (Quadratic with sequence length):\n",
|
||
" - Attention matrices: batch × heads × seq_len²\n",
|
||
" - This is why long context is expensive!\n",
|
||
" - Total: O(seq_len²)\n",
|
||
"\n",
|
||
"3. Activation Memory (Linear with batch size):\n",
|
||
" - Forward pass activations for backprop\n",
|
||
" - Scales with: batch × seq_len × embed_dim\n",
|
||
" - Total: O(batch_size)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Attention Memory Wall\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────────┐\n",
|
||
"│ ATTENTION MEMORY WALL: Why Long Context is Expensive │\n",
|
||
"├─────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ MEMORY USAGE BY SEQUENCE LENGTH (Quadratic Growth): │\n",
|
||
"│ │\n",
|
||
"│ 1K tokens: [▓] 16 MB ← Manageable │\n",
|
||
"│ 2K tokens: [▓▓▓▓] 64 MB ← 4× memory (quadratic!) │\n",
|
||
"│ 4K tokens: [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 256 MB ← 16× memory │\n",
|
||
"│ 8K tokens: [████████████████████████████████] 1 GB │\n",
|
||
"│ 16K tokens: ████████████████████████████████████████████████████████████████ 4 GB │\n",
|
||
"│ 32K tokens: ████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 16 GB │\n",
|
||
"│ │\n",
|
||
"│ REAL-WORLD CONTEXT LIMITS: │\n",
|
||
"│ ┌───────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ GPT-3: 2K tokens (limited by memory) │ │\n",
|
||
"│ │ GPT-4: 8K tokens (32K with optimizations) │ │\n",
|
||
"│ │ Claude-3: 200K tokens (special techniques required!) │ │\n",
|
||
"│ │ GPT-4o: 128K tokens (efficient attention) │ │\n",
|
||
"│ └───────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ MATHEMATICAL SCALING: │\n",
|
||
"│ Memory = batch_size × num_heads × seq_len² × 4 bytes │\n",
|
||
"│ ↑ │\n",
|
||
"│ This is the killer! │\n",
|
||
"│ │\n",
|
||
"└─────────────────────────────────────────────────────────────────┘\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0849dfd0",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "analyze-scaling",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_parameter_scaling():\n",
|
||
" \"\"\"📊 Analyze how parameter count scales with model dimensions.\"\"\"\n",
|
||
" print(\"📊 Analyzing Parameter Scaling in Transformers...\")\n",
|
||
" print(\"Understanding why model size affects performance and cost\\n\")\n",
|
||
"\n",
|
||
" # Test different model sizes\n",
|
||
" configs = [\n",
|
||
" {\"name\": \"Tiny\", \"embed_dim\": 64, \"num_layers\": 2, \"num_heads\": 4},\n",
|
||
" {\"name\": \"Small\", \"embed_dim\": 128, \"num_layers\": 4, \"num_heads\": 8},\n",
|
||
" {\"name\": \"Medium\", \"embed_dim\": 256, \"num_layers\": 8, \"num_heads\": 16},\n",
|
||
" {\"name\": \"Large\", \"embed_dim\": 512, \"num_layers\": 12, \"num_heads\": 16},\n",
|
||
" ]\n",
|
||
"\n",
|
||
" vocab_size = 50000 # Typical vocabulary size\n",
|
||
"\n",
|
||
" for config in configs:\n",
|
||
" model = GPT(\n",
|
||
" vocab_size=vocab_size,\n",
|
||
" embed_dim=config[\"embed_dim\"],\n",
|
||
" num_layers=config[\"num_layers\"],\n",
|
||
" num_heads=config[\"num_heads\"]\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Count parameters\n",
|
||
" total_params = 0\n",
|
||
" for param in model.parameters():\n",
|
||
" total_params += param.size\n",
|
||
"\n",
|
||
" # Calculate memory requirements (4 bytes per float32 parameter)\n",
|
||
" memory_mb = (total_params * 4) / (1024 * 1024)\n",
|
||
"\n",
|
||
" print(f\"{config['name']} Model:\")\n",
|
||
" print(f\" Parameters: {total_params:,}\")\n",
|
||
" print(f\" Memory: {memory_mb:.1f} MB\")\n",
|
||
" print(f\" Embed dim: {config['embed_dim']}, Layers: {config['num_layers']}\")\n",
|
||
" print()\n",
|
||
"\n",
|
||
" print(\"💡 Parameter scaling is roughly quadratic with embedding dimension\")\n",
|
||
" print(\"🚀 Real GPT-3 has 175B parameters, requiring ~350GB memory!\")\n",
|
||
"\n",
|
||
"analyze_parameter_scaling()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "3d83a8fb",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "analyze-attention-memory",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_attention_memory():\n",
|
||
" \"\"\"📊 Analyze attention memory complexity with sequence length.\"\"\"\n",
|
||
" print(\"📊 Analyzing Attention Memory Complexity...\")\n",
|
||
" print(\"Why long context is expensive and how it scales\\n\")\n",
|
||
"\n",
|
||
" embed_dim = 512\n",
|
||
" num_heads = 8\n",
|
||
" batch_size = 4\n",
|
||
"\n",
|
||
" # Test different sequence lengths\n",
|
||
" sequence_lengths = [128, 256, 512, 1024, 2048]\n",
|
||
"\n",
|
||
" print(\"Attention Matrix Memory Usage:\")\n",
|
||
" print(\"Seq Len | Attention Matrix Size | Memory (MB)\")\n",
|
||
" print(\"-\" * 45)\n",
|
||
"\n",
|
||
" for seq_len in sequence_lengths:\n",
|
||
" # Attention matrix is (batch_size, num_heads, seq_len, seq_len)\n",
|
||
" attention_elements = batch_size * num_heads * seq_len * seq_len\n",
|
||
"\n",
|
||
" # 4 bytes per float32\n",
|
||
" memory_bytes = attention_elements * 4\n",
|
||
" memory_mb = memory_bytes / (1024 * 1024)\n",
|
||
"\n",
|
||
" print(f\"{seq_len:6d} | {seq_len}×{seq_len} × {batch_size}×{num_heads} | {memory_mb:8.1f}\")\n",
|
||
"\n",
|
||
" print()\n",
|
||
" print(\"💡 Attention memory grows quadratically with sequence length\")\n",
|
||
" print(\"🚀 This is why techniques like FlashAttention are crucial for long sequences\")\n",
|
||
"\n",
|
||
"analyze_attention_memory()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "61c047e3",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 🧪 Module Integration Test\n",
|
||
"\n",
|
||
"Final validation that everything works together correctly."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1f23223b",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-module",
|
||
"locked": true,
|
||
"points": 25
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire module functionality.\n",
|
||
"\n",
|
||
" This final test runs before module summary to ensure:\n",
|
||
" - All unit tests pass\n",
|
||
" - Functions work together correctly\n",
|
||
" - Module is ready for integration with TinyTorch\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_layer_norm()\n",
|
||
" test_unit_mlp()\n",
|
||
" test_unit_transformer_block()\n",
|
||
" test_unit_gpt()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Test complete transformer training scenario\n",
|
||
" print(\"🔬 Integration Test: Full Training Pipeline...\")\n",
|
||
"\n",
|
||
" # Create model and data\n",
|
||
" vocab_size = 50\n",
|
||
" embed_dim = 64\n",
|
||
" num_layers = 2\n",
|
||
" num_heads = 4\n",
|
||
"\n",
|
||
" model = GPT(vocab_size, embed_dim, num_layers, num_heads)\n",
|
||
"\n",
|
||
" # Test batch processing\n",
|
||
" batch_size = 3\n",
|
||
" seq_len = 16\n",
|
||
" tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))\n",
|
||
"\n",
|
||
" # Forward pass\n",
|
||
" logits = model.forward(tokens)\n",
|
||
" assert logits.shape == (batch_size, seq_len, vocab_size)\n",
|
||
"\n",
|
||
" # Test generation with different temperatures\n",
|
||
" prompt = Tensor(np.random.randint(0, vocab_size, (1, 8)))\n",
|
||
"\n",
|
||
" # Conservative generation\n",
|
||
" conservative = model.generate(prompt, max_new_tokens=5, temperature=0.1)\n",
|
||
" assert conservative.shape == (1, 13)\n",
|
||
"\n",
|
||
" # Creative generation\n",
|
||
" creative = model.generate(prompt, max_new_tokens=5, temperature=2.0)\n",
|
||
" assert creative.shape == (1, 13)\n",
|
||
"\n",
|
||
" # Test parameter counting consistency\n",
|
||
" total_params = sum(param.size for param in model.parameters())\n",
|
||
" assert total_params > 1000 # Should have substantial parameters\n",
|
||
"\n",
|
||
" print(\"✅ Full transformer pipeline works!\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"Run: tito module complete 13\")\n",
|
||
"\n",
|
||
"# Call the comprehensive test\n",
|
||
"test_module()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d9c5a7f9",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"if __name__ == \"__main__\":\n",
|
||
" print(\"🚀 Running Transformers module...\")\n",
|
||
" test_module()\n",
|
||
" print(\"✅ Module validation complete!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "203f8df1",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🤔 ML Systems Thinking: Transformer Architecture Foundations\n",
|
||
"\n",
|
||
"### Question 1: Attention Memory Complexity\n",
|
||
"You implemented multi-head attention that computes attention matrices of size (batch, heads, seq_len, seq_len).\n",
|
||
"\n",
|
||
"For a model with seq_len=1024, batch_size=4, num_heads=8:\n",
|
||
"- How many elements in the attention matrix? _____\n",
|
||
"- If each element is 4 bytes (float32), how much memory per layer? _____ MB\n",
|
||
"- Why does doubling sequence length quadruple attention memory? _____\n",
|
||
"\n",
|
||
"### Question 2: Residual Connection Benefits\n",
|
||
"Your TransformerBlock uses residual connections (x + attention_output, x + mlp_output).\n",
|
||
"\n",
|
||
"- What happens to gradients during backpropagation without residual connections? _____\n",
|
||
"- How do residual connections help train deeper networks? _____\n",
|
||
"- Why is pre-norm (LayerNorm before operations) preferred over post-norm? _____\n",
|
||
"\n",
|
||
"### Question 3: Parameter Scaling Analysis\n",
|
||
"Your GPT model combines embeddings, transformer blocks, and output projection.\n",
|
||
"\n",
|
||
"For embed_dim=512, vocab_size=10000, num_layers=6:\n",
|
||
"- Token embedding parameters: _____ (vocab_size × embed_dim)\n",
|
||
"- Approximate parameters per transformer block: _____ (hint: ~4 × embed_dim²)\n",
|
||
"- Total model parameters: approximately _____ million\n",
|
||
"\n",
|
||
"### Question 4: Autoregressive Generation Efficiency\n",
|
||
"Your generate() method processes the full sequence for each new token.\n",
|
||
"\n",
|
||
"- Why is this inefficient for long sequences? _____\n",
|
||
"- What optimization caches key-value pairs to avoid recomputation? _____\n",
|
||
"- How would this change the computational complexity from O(n²) to O(n)? _____"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "13761f1f",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Transformers\n",
|
||
"\n",
|
||
"Congratulations! You've built the complete transformer architecture that powers modern language models like GPT, Claude, and ChatGPT!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- Built LayerNorm for stable training across deep transformer networks\n",
|
||
"- Implemented MLP (feed-forward) networks with GELU activation and 4x expansion\n",
|
||
"- Created complete TransformerBlock with self-attention, residual connections, and pre-norm architecture\n",
|
||
"- Built full GPT model with embeddings, positional encoding, and autoregressive generation\n",
|
||
"- Discovered attention memory scaling and parameter distribution patterns\n",
|
||
"- All tests pass ✅ (validated by `test_module()`)\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your transformer implementation is the capstone of the language modeling pipeline.\n",
|
||
"Export with: `tito module complete 13`\n",
|
||
"\n",
|
||
"**Next**: Module 14 will add profiling and optimization techniques to make your transformers production-ready!"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|