# --- # jupyter: # jupytext: # text_representation: # extension: .py # format_name: percent # format_version: '1.3' # jupytext_version: 1.17.1 # kernelspec: # display_name: Python 3 (ipykernel) # language: python # name: python3 # --- # %% [markdown] """ # Module 13: Transformers - Complete Transformer Architecture Welcome to Module 13! You're about to build the complete transformer architecture that powers modern language models like GPT, Claude, and ChatGPT. ## πŸ”— Prerequisites & Progress **You've Built**: Tokenization, embeddings, attention mechanisms, and all foundational components **You'll Build**: TransformerBlock, complete GPT architecture, and autoregressive generation **You'll Enable**: Full language model training and text generation capabilities **Connection Map**: ``` Tokenization + Embeddings + Attention β†’ Transformers β†’ Language Generation (textβ†’numbers) (learnable vectors) (sequence modeling) (complete models) ``` ## Learning Objectives By the end of this module, you will: 1. Implement complete TransformerBlock with attention, MLP, and layer normalization 2. Build full GPT architecture with multiple transformer blocks 3. Add autoregressive text generation capability 4. Understand parameter scaling in large language models 5. Test transformer components and generation pipeline Let's get started! """ # %% #| default_exp models.transformer # %% #| export import numpy as np import math from typing import Optional, List # Import from previous modules - following proper dependency chain from tinytorch.core.tensor import Tensor from tinytorch.core.layers import Linear from tinytorch.core.attention import MultiHeadAttention from tinytorch.core.activations import GELU from tinytorch.text.embeddings import Embedding, PositionalEncoding # Constants for memory calculations BYTES_PER_FLOAT32 = 4 # Standard float32 size in bytes MB_TO_BYTES = 1024 * 1024 # Megabytes to bytes conversion # %% [markdown] """ ## πŸ“¦ Where This Code Lives in the Final Package **Learning Side:** You work in `modules/13_transformers/transformers_dev.py` **Building Side:** Code exports to `tinytorch.models.transformer` ```python # How to use this module: from tinytorch.models.transformer import TransformerBlock, GPT, LayerNorm, MLP ``` **Why this matters:** - **Learning:** Complete transformer system showcasing how all components work together - **Production:** Matches PyTorch's transformer implementation with proper model organization - **Consistency:** All transformer components and generation logic in models.transformer - **Integration:** Demonstrates the power of modular design by combining all previous modules """ # %% [markdown] """ ## 1. Introduction: What are Transformers? Transformers are the revolutionary architecture that powers modern AI language models like GPT, ChatGPT, and Claude. The key breakthrough is **self-attention**, which allows every token in a sequence to directly interact with every other token, creating rich contextual understanding. ### The Transformer Revolution Before transformers, language models used RNNs or CNNs that processed text sequentially or locally. Transformers changed everything by processing all positions in parallel while maintaining global context. ### Complete GPT Architecture Overview ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ COMPLETE GPT ARCHITECTURE: From Text to Generation β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ INPUT: "Hello world" β†’ Token IDs: [15496, 1917] β”‚ β”‚ ↓ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ EMBEDDING LAYER β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚Token Embed β”‚ + β”‚ Positional Embedding β”‚ β”‚ β”‚ β”‚ β”‚ β”‚15496β†’[0.1, β”‚ β”‚ pos_0β†’[0.05, -0.02, ...] β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ 0.3,..]β”‚ β”‚ pos_1β†’[0.12, 0.08, ...] β”‚ β”‚ β”‚ β”‚ β”‚ β”‚1917β†’[0.2, β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ -0.1,..]β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ ↓ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ TRANSFORMER BLOCK 1 β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ x β†’ LayerNorm β†’ MultiHeadAttention β†’ + x β†’ result β”‚ β”‚ β”‚ β”‚ β”‚ ↑ β”‚ β”‚ β”‚ β”‚ β”‚ residual connection β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ result β†’ LayerNorm β†’ MLP (Feed Forward) β†’ + result β”‚ β”‚ β”‚ β”‚ β”‚ ↑ β”‚ β”‚ β”‚ β”‚ β”‚ residual connection β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ ↓ β”‚ β”‚ TRANSFORMER BLOCK 2 (same pattern) β”‚ β”‚ ↓ β”‚ β”‚ ... (more blocks) ... β”‚ β”‚ ↓ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ OUTPUT HEAD β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ final_hidden β†’ LayerNorm β†’ Linear(embed_dim, vocab_size) β”‚ β”‚ β”‚ β”‚ ↓ β”‚ β”‚ β”‚ β”‚ Vocabulary Logits: [0.1, 0.05, 0.8, ...] β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ ↓ β”‚ β”‚ OUTPUT: Next Token Probabilities β”‚ β”‚ "Hello" β†’ 10%, "world" β†’ 5%, "!" β†’ 80%, ... β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Why Transformers Dominate **Parallel Processing**: Unlike RNNs that process tokens one by one, transformers process all positions simultaneously. This makes training much faster. **Global Context**: Every token can directly attend to every other token in the sequence, capturing long-range dependencies that RNNs struggle with. **Scalability**: Performance predictably improves with more parameters and data. This enabled the scaling laws that led to GPT-3, GPT-4, and beyond. **Residual Connections**: Allow training very deep networks (100+ layers) by providing gradient highways. ### The Building Blocks We'll Implement 1. **LayerNorm**: Stabilizes training by normalizing activations 2. **Multi-Layer Perceptron (MLP)**: Provides non-linear transformation 3. **TransformerBlock**: Combines attention + MLP with residuals 4. **GPT**: Complete model with embeddings and generation capability """ # %% [markdown] """ ## 2. Foundations: Essential Transformer Mathematics ### Layer Normalization: The Stability Engine Layer Normalization is crucial for training deep transformer networks. Unlike batch normalization (which normalizes across the batch), layer norm normalizes across the feature dimension for each individual sample. ``` Mathematical Formula: output = (x - ΞΌ) / Οƒ * Ξ³ + Ξ² where: ΞΌ = mean(x, axis=features) # Mean across feature dimension Οƒ = sqrt(var(x) + Ξ΅) # Standard deviation + small epsilon Ξ³ = learnable scale parameter # Initialized to 1.0 Ξ² = learnable shift parameter # Initialized to 0.0 ``` **Why Layer Norm Works:** - **Independence**: Each sample normalized independently (good for variable batch sizes) - **Stability**: Prevents internal covariate shift that breaks training - **Gradient Flow**: Helps gradients flow better through deep networks ### Residual Connections: The Gradient Highway Residual connections are the secret to training deep networks. They create "gradient highways" that allow information to flow directly through the network. ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RESIDUAL CONNECTIONS: The Gradient Highway System β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ PRE-NORM ARCHITECTURE (Modern Standard): β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ ATTENTION SUB-LAYER β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Input (x) ────┬─→ LayerNorm ─→ MultiHeadAttention ─┐ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ β”‚ └────→ ADD ─→ Output to next sub-layer β”‚ β”‚ β”‚ β”‚ (x + attention_output) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ ↓ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ MLP SUB-LAYER β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Input (x) ────┬─→ LayerNorm ─→ MLP (Feed Forward) ─┐ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ β”‚ └────→ ADD ─→ Final Output β”‚ β”‚ β”‚ β”‚ (x + mlp_output) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ KEY INSIGHT: Each sub-layer ADDS to the residual stream β”‚ β”‚ rather than replacing it, preserving information flow! β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Gradient Flow Visualization:** ``` Backward Pass Without Residuals: With Residuals: Loss Loss β”‚ gradients get smaller β”‚ gradients stay strong ↓ at each layer ↓ via residual paths Layer N ← tiny gradients Layer N ← strong gradients β”‚ β”‚ β†— (direct path) ↓ ↓ β†— Layer 2 ← vanishing Layer 2 ← strong gradients β”‚ β”‚ β†— ↓ ↓ β†— Layer 1 ← gone! Layer 1 ← strong gradients ``` ### Feed-Forward Network (MLP): The Thinking Layer The MLP provides the actual "thinking" in each transformer block. It's a simple two-layer network with a specific expansion pattern. ``` MLP Architecture: Input (embed_dim) β†’ Linear β†’ GELU β†’ Linear β†’ Output (embed_dim) 512 2048 2048 512 (4x expansion) Mathematical Formula: FFN(x) = Linearβ‚‚(GELU(Linear₁(x))) = Wβ‚‚ Β· GELU(W₁ Β· x + b₁) + bβ‚‚ where: W₁: (embed_dim, 4*embed_dim) # Expansion matrix Wβ‚‚: (4*embed_dim, embed_dim) # Contraction matrix GELU: smooth activation function (better than ReLU for language) ``` **Why 4x Expansion?** - **Capacity**: More parameters = more representation power - **Non-linearity**: GELU activation creates complex transformations - **Information Bottleneck**: Forces the model to compress useful information ### The Complete Transformer Block Data Flow ``` Input Tensor (batch, seq_len, embed_dim) ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ATTENTION SUB-LAYER β”‚ β”‚ β”‚ β”‚ x₁ = LayerNorm(xβ‚€) β”‚ β”‚ attention_out = MultiHeadAttn(x₁) β”‚ β”‚ xβ‚‚ = xβ‚€ + attention_out (residual) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ MLP SUB-LAYER β”‚ β”‚ β”‚ β”‚ x₃ = LayerNorm(xβ‚‚) β”‚ β”‚ mlp_out = MLP(x₃) β”‚ β”‚ xβ‚„ = xβ‚‚ + mlp_out (residual) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ Output Tensor (batch, seq_len, embed_dim) ``` **Key Insight**: Each sub-layer (attention and MLP) gets a "clean" normalized input but adds its contribution to the residual stream. This creates a stable training dynamic. """ # %% [markdown] """ ## 3. Implementation: Building Transformer Components Now we'll implement each transformer component with a clear understanding of their role in the overall architecture. We'll follow the pattern: **Explanation β†’ Implementation β†’ Test** for each component. Each component serves a specific purpose: - **LayerNorm**: Stabilizes training and normalizes activations - **MLP**: Provides non-linear transformation and "thinking" capacity - **TransformerBlock**: Combines attention with MLP using residual connections - **GPT**: Complete autoregressive language model for text generation """ # %% [markdown] """ ### Understanding Layer Normalization Layer Normalization is the foundation of stable transformer training. Unlike batch normalization, it normalizes each sample independently across its feature dimensions. #### Why Layer Norm is Essential Without normalization, deep networks suffer from "internal covariate shift" - the distribution of inputs to each layer changes during training, making learning unstable. #### Layer Norm Visualization ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ LAYER NORMALIZATION: Stabilizing Deep Networks β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ INPUT TENSOR: (batch=2, seq=3, features=4) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Sample 1: [[1.0, 2.0, 3.0, 4.0], ← Position 0 β”‚ β”‚ β”‚ β”‚ [5.0, 6.0, 7.0, 8.0], ← Position 1 β”‚ β”‚ β”‚ β”‚ [9.0, 10.0, 11.0, 12.0]] ← Position 2 β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Sample 2: [[13., 14., 15., 16.], ← Position 0 β”‚ β”‚ β”‚ β”‚ [17., 18., 19., 20.], ← Position 1 β”‚ β”‚ β”‚ β”‚ [21., 22., 23., 24.]] ← Position 2 β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ ↓ β”‚ β”‚ NORMALIZE ACROSS FEATURES (per position) β”‚ β”‚ ↓ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ AFTER NORMALIZATION: Each position β†’ mean=0, std=1 β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Sample 1: [[-1.34, -0.45, 0.45, 1.34], β”‚ β”‚ β”‚ β”‚ [-1.34, -0.45, 0.45, 1.34], β”‚ β”‚ β”‚ β”‚ [-1.34, -0.45, 0.45, 1.34]] β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Sample 2: [[-1.34, -0.45, 0.45, 1.34], β”‚ β”‚ β”‚ β”‚ [-1.34, -0.45, 0.45, 1.34], β”‚ β”‚ β”‚ β”‚ [-1.34, -0.45, 0.45, 1.34]] β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ ↓ β”‚ β”‚ APPLY LEARNABLE PARAMETERS: Ξ³ * norm + Ξ² β”‚ β”‚ ↓ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ FINAL OUTPUT: Model can learn any desired distribution β”‚ β”‚ β”‚ β”‚ Ξ³ (scale) and Ξ² (shift) are learned during training β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ KEY INSIGHT: Unlike batch norm, each sample normalized β”‚ β”‚ independently - perfect for variable-length sequences! β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### Key Properties - **Per-sample normalization**: Each sequence position normalized independently - **Learnable parameters**: Ξ³ (scale) and Ξ² (shift) allow the model to recover any desired distribution - **Gradient friendly**: Helps gradients flow smoothly through deep networks """ # %% nbgrader={"grade": false, "grade_id": "layer-norm", "solution": true} #| export class LayerNorm: """ Layer Normalization for transformer blocks. Normalizes across the feature dimension (last axis) for each sample independently, unlike batch normalization which normalizes across the batch dimension. """ def __init__(self, normalized_shape, eps=1e-5): """ Initialize LayerNorm with learnable parameters. TODO: Set up normalization parameters APPROACH: 1. Store the shape to normalize over (usually embed_dim) 2. Initialize learnable scale (gamma) and shift (beta) parameters 3. Set small epsilon for numerical stability EXAMPLE: >>> ln = LayerNorm(512) # For 512-dimensional embeddings >>> x = Tensor(np.random.randn(2, 10, 512)) # (batch, seq, features) >>> normalized = ln.forward(x) >>> # Each (2, 10) sample normalized independently across 512 features HINTS: - gamma should start at 1.0 (identity scaling) - beta should start at 0.0 (no shift) - eps prevents division by zero in variance calculation """ ### BEGIN SOLUTION self.normalized_shape = normalized_shape self.eps = eps # Learnable parameters: scale and shift self.gamma = Tensor(np.ones(normalized_shape), requires_grad=True) # Scale parameter self.beta = Tensor(np.zeros(normalized_shape), requires_grad=True) # Shift parameter ### END SOLUTION def forward(self, x): """ Apply layer normalization. TODO: Implement layer normalization formula APPROACH: 1. Compute mean and variance across the last dimension 2. Normalize: (x - mean) / sqrt(variance + eps) 3. Apply learnable scale and shift: gamma * normalized + beta MATHEMATICAL FORMULA: y = (x - ΞΌ) / Οƒ * Ξ³ + Ξ² where ΞΌ = mean(x), Οƒ = sqrt(var(x) + Ξ΅) HINT: Use keepdims=True to maintain tensor dimensions for broadcasting """ ### BEGIN SOLUTION # Compute statistics across last dimension (features) mean = x.mean(axis=-1, keepdims=True) # Compute variance: E[(x - ΞΌ)Β²] # Use Tensor operations to preserve computation graph! diff = x - mean variance = (diff * diff).mean(axis=-1, keepdims=True) # Normalize - use Tensor operations to preserve gradients! # Add eps as a Tensor for proper gradient flow eps_tensor = Tensor(np.array(self.eps), requires_grad=False) std = Tensor(np.sqrt(variance.data + self.eps), requires_grad=variance.requires_grad) normalized = (x - mean) / std # Apply learnable transformation output = normalized * self.gamma + self.beta return output ### END SOLUTION def __call__(self, x): """Allows the layer norm to be called like a function.""" return self.forward(x) def parameters(self): """Return learnable parameters.""" return [self.gamma, self.beta] # %% [markdown] """ ### πŸ”¬ Unit Test: Layer Normalization This test validates our LayerNorm implementation works correctly. **What we're testing**: Normalization statistics and parameter learning **Why it matters**: Essential for transformer stability and training **Expected**: Mean β‰ˆ 0, std β‰ˆ 1 after normalization, learnable parameters work """ # %% nbgrader={"grade": true, "grade_id": "test-layer-norm", "locked": true, "points": 10} def test_unit_layer_norm(): """πŸ”¬ Test LayerNorm implementation.""" print("πŸ”¬ Unit Test: Layer Normalization...") # Test basic normalization ln = LayerNorm(4) x = Tensor([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]]) # (2, 4) normalized = ln.forward(x) # Check output shape assert normalized.shape == (2, 4) # Check normalization properties (approximately) # For each sample, mean should be close to 0, std close to 1 for i in range(2): sample_mean = np.mean(normalized.data[i]) sample_std = np.std(normalized.data[i]) assert abs(sample_mean) < 1e-5, f"Mean should be ~0, got {sample_mean}" assert abs(sample_std - 1.0) < 1e-4, f"Std should be ~1, got {sample_std}" # Test parameter shapes params = ln.parameters() assert len(params) == 2 assert params[0].shape == (4,) # gamma assert params[1].shape == (4,) # beta print("βœ… LayerNorm works correctly!") # Run test immediately when developing this module if __name__ == "__main__": test_unit_layer_norm() # %% [markdown] """ ### Understanding the Multi-Layer Perceptron (MLP) The MLP is where the "thinking" happens in each transformer block. It's a simple feed-forward network that provides non-linear transformation capacity. #### The Role of MLP in Transformers While attention handles relationships between tokens, the MLP processes each position independently, adding computational depth and non-linearity. #### MLP Architecture and Information Flow ``` Information Flow Through MLP: Input: (batch, seq_len, embed_dim=512) ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Linear Layer 1: Expansion β”‚ β”‚ Weight: (512, 2048) Bias: (2048,) β”‚ β”‚ Output: (batch, seq_len, 2048) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ GELU Activation β”‚ β”‚ Smooth, differentiable activation β”‚ β”‚ Better than ReLU for language modeling β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Linear Layer 2: Contraction β”‚ β”‚ Weight: (2048, 512) Bias: (512,) β”‚ β”‚ Output: (batch, seq_len, 512) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ Output: (batch, seq_len, embed_dim=512) ``` #### Why 4x Expansion? ``` Parameter Count Analysis: Embed Dim: 512 MLP Hidden: 2048 (4x expansion) Parameters: - Linear1: 512 Γ— 2048 + 2048 = 1,050,624 - Linear2: 2048 Γ— 512 + 512 = 1,049,088 - Total MLP: ~2.1M parameters For comparison: - Attention (same embed_dim): ~1.5M parameters - MLP has MORE parameters β†’ more computational capacity ``` #### GELU vs ReLU ``` Activation Function Comparison: ReLU(x) = max(0, x) # Hard cutoff at 0 β”Œβ”€β”€β”€β”€ β”‚ β”€β”€β”€β”€β”€β”˜ 0 GELU(x) β‰ˆ x * Ξ¦(x) # Smooth, probabilistic ╭──── β•± ───╱ β•± 0 GELU is smoother and provides better gradients for language modeling. ``` """ # %% nbgrader={"grade": false, "grade_id": "mlp", "solution": true} #| export class MLP: """ Multi-Layer Perceptron (Feed-Forward Network) for transformer blocks. Standard pattern: Linear -> GELU -> Linear with expansion ratio of 4:1. This provides the non-linear transformation in each transformer block. """ def __init__(self, embed_dim, hidden_dim=None, dropout_prob=0.1): """ Initialize MLP with two linear layers. TODO: Set up the feed-forward network layers APPROACH: 1. First layer expands from embed_dim to hidden_dim (usually 4x larger) 2. Second layer projects back to embed_dim 3. Use GELU activation (smoother than ReLU, preferred in transformers) EXAMPLE: >>> mlp = MLP(512) # Will create 512 -> 2048 -> 512 network >>> x = Tensor(np.random.randn(2, 10, 512)) >>> output = mlp.forward(x) >>> assert output.shape == (2, 10, 512) HINT: Standard transformer MLP uses 4x expansion (hidden_dim = 4 * embed_dim) """ ### BEGIN SOLUTION if hidden_dim is None: hidden_dim = 4 * embed_dim # Standard 4x expansion self.embed_dim = embed_dim self.hidden_dim = hidden_dim # Two-layer feed-forward network self.linear1 = Linear(embed_dim, hidden_dim) self.gelu = GELU() # Use GELU activation from activations module self.linear2 = Linear(hidden_dim, embed_dim) ### END SOLUTION def forward(self, x): """ Forward pass through MLP. TODO: Implement the feed-forward computation APPROACH: 1. First linear transformation: embed_dim -> hidden_dim 2. Apply GELU activation (smooth, differentiable) 3. Second linear transformation: hidden_dim -> embed_dim COMPUTATION FLOW: x -> Linear -> GELU -> Linear -> output HINT: GELU activation is implemented above as a function """ ### BEGIN SOLUTION # First linear layer with expansion hidden = self.linear1.forward(x) # GELU activation (YOUR activation from Module 03!) hidden = self.gelu.forward(hidden) # Second linear layer back to original size output = self.linear2.forward(hidden) return output ### END SOLUTION def __call__(self, x): """Allows the MLP to be called like a function.""" return self.forward(x) def parameters(self): """Return all learnable parameters.""" params = [] params.extend(self.linear1.parameters()) params.extend(self.linear2.parameters()) return params # %% [markdown] """ ### πŸ”¬ Unit Test: MLP (Feed-Forward Network) This test validates our MLP implementation works correctly. **What we're testing**: Shape preservation and parameter counting **Why it matters**: MLP provides the non-linear transformation in transformers **Expected**: Input/output shapes match, correct parameter count """ # %% nbgrader={"grade": true, "grade_id": "test-mlp", "locked": true, "points": 10} def test_unit_mlp(): """πŸ”¬ Test MLP implementation.""" print("πŸ”¬ Unit Test: MLP (Feed-Forward Network)...") # Test MLP with standard 4x expansion embed_dim = 64 mlp = MLP(embed_dim) # Test forward pass batch_size, seq_len = 2, 10 x = Tensor(np.random.randn(batch_size, seq_len, embed_dim)) output = mlp.forward(x) # Check shape preservation assert output.shape == (batch_size, seq_len, embed_dim) # Check hidden dimension is 4x assert mlp.hidden_dim == 4 * embed_dim # Test parameter counting params = mlp.parameters() expected_params = 4 # 2 weights + 2 biases assert len(params) == expected_params # Test custom hidden dimension custom_mlp = MLP(embed_dim, hidden_dim=128) assert custom_mlp.hidden_dim == 128 print("βœ… MLP works correctly!") # Run test immediately when developing this module if __name__ == "__main__": test_unit_mlp() # %% [markdown] """ ### Understanding the Complete Transformer Block The TransformerBlock is the core building unit of GPT and other transformer models. It combines self-attention with feed-forward processing using a carefully designed residual architecture. #### Pre-Norm vs Post-Norm Architecture Modern transformers use "pre-norm" architecture where LayerNorm comes BEFORE the sub-layers, not after. This provides better training stability. ``` Pre-Norm Architecture (What We Implement): β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ INPUT (x) β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ LayerNorm β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ MultiHeadAttention β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ (residual connection) β”‚ β”‚ β–Ό β”‚ β”‚ x + attention β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ LayerNorm β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ MLP β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ (residual connection) β”‚ β”‚ β–Ό β”‚ β”‚ x + mlp β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ OUTPUT β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### Why Pre-Norm Works Better **Training Stability**: LayerNorm before operations provides clean, normalized inputs to attention and MLP layers. **Gradient Flow**: Residual connections carry gradients directly from output to input, bypassing the normalized operations. **Deeper Networks**: Pre-norm enables training much deeper networks (100+ layers) compared to post-norm. #### Information Processing in Transformer Block ``` Step-by-Step Data Transformation: 1. Input Processing: xβ‚€: (batch, seq_len, embed_dim) # Original input 2. Attention Sub-layer: x₁ = LayerNorm(xβ‚€) # Normalize input attn_out = MultiHeadAttn(x₁) # Self-attention xβ‚‚ = xβ‚€ + attn_out # Residual connection 3. MLP Sub-layer: x₃ = LayerNorm(xβ‚‚) # Normalize again mlp_out = MLP(x₃) # Feed-forward xβ‚„ = xβ‚‚ + mlp_out # Final residual 4. Output: return xβ‚„ # Ready for next block ``` #### Residual Stream Concept Think of the residual connections as a "stream" that carries information through the network: ``` Residual Stream Flow: Layer 1: [original embeddings] ─┐ β”œβ”€β†’ + attention info ─┐ Attention adds information β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”œβ”€β†’ + MLP info ─┐ MLP adds information β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Layer 2: carries accumulated information β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` Each layer adds information to this stream rather than replacing it, creating a rich representation. """ # %% nbgrader={"grade": false, "grade_id": "transformer-block", "solution": true} #| export class TransformerBlock: """ Complete Transformer Block with self-attention, MLP, and residual connections. This is the core building block of GPT and other transformer models. Each block processes the input sequence and passes it to the next block. """ def __init__(self, embed_dim, num_heads, mlp_ratio=4, dropout_prob=0.1): """ Initialize a complete transformer block. TODO: Set up all components of the transformer block APPROACH: 1. Multi-head self-attention for sequence modeling 2. First layer normalization (pre-norm architecture) 3. MLP with specified expansion ratio 4. Second layer normalization TRANSFORMER BLOCK ARCHITECTURE: x β†’ LayerNorm β†’ MultiHeadAttention β†’ + (residual) β†’ LayerNorm β†’ MLP β†’ + (residual) β†’ output EXAMPLE: >>> block = TransformerBlock(embed_dim=512, num_heads=8) >>> x = Tensor(np.random.randn(2, 10, 512)) # (batch, seq, embed) >>> output = block.forward(x) >>> assert output.shape == (2, 10, 512) HINT: We use pre-norm architecture (LayerNorm before attention/MLP) """ ### BEGIN SOLUTION self.embed_dim = embed_dim self.num_heads = num_heads # Multi-head self-attention self.attention = MultiHeadAttention(embed_dim, num_heads) # Layer normalizations (pre-norm architecture) self.ln1 = LayerNorm(embed_dim) # Before attention self.ln2 = LayerNorm(embed_dim) # Before MLP # Feed-forward network hidden_dim = int(embed_dim * mlp_ratio) self.mlp = MLP(embed_dim, hidden_dim) ### END SOLUTION def forward(self, x, mask=None): """ Forward pass through transformer block. TODO: Implement the complete transformer block computation APPROACH: 1. Apply layer norm, then self-attention, then add residual 2. Apply layer norm, then MLP, then add residual 3. Return the transformed sequence COMPUTATION FLOW: x β†’ ln1 β†’ attention β†’ + x β†’ ln2 β†’ mlp β†’ + β†’ output RESIDUAL CONNECTIONS: These are crucial for training deep networks - they allow gradients to flow directly through the network during backpropagation. HINT: Store intermediate results to add residual connections properly """ ### BEGIN SOLUTION # First sub-layer: Multi-head self-attention with residual connection # Pre-norm: LayerNorm before attention normed1 = self.ln1.forward(x) # Self-attention: query, key, value are all the same (normed1) attention_out = self.attention.forward(normed1, mask) # Residual connection x = x + attention_out # Second sub-layer: MLP with residual connection # Pre-norm: LayerNorm before MLP normed2 = self.ln2.forward(x) mlp_out = self.mlp.forward(normed2) # Residual connection output = x + mlp_out return output ### END SOLUTION def __call__(self, x, mask=None): """Allows the transformer block to be called like a function.""" return self.forward(x, mask) def parameters(self): """Return all learnable parameters.""" params = [] params.extend(self.attention.parameters()) params.extend(self.ln1.parameters()) params.extend(self.ln2.parameters()) params.extend(self.mlp.parameters()) return params # %% [markdown] """ ### πŸ”¬ Unit Test: Transformer Block This test validates our complete TransformerBlock implementation. **What we're testing**: Shape preservation, residual connections, parameter counting **Why it matters**: This is the core component that will be stacked to create GPT **Expected**: Input/output shapes match, all components work together """ # %% nbgrader={"grade": true, "grade_id": "test-transformer-block", "locked": true, "points": 15} def test_unit_transformer_block(): """πŸ”¬ Test TransformerBlock implementation.""" print("πŸ”¬ Unit Test: Transformer Block...") # Test transformer block embed_dim = 64 num_heads = 4 block = TransformerBlock(embed_dim, num_heads) # Test forward pass batch_size, seq_len = 2, 8 x = Tensor(np.random.randn(batch_size, seq_len, embed_dim)) output = block.forward(x) # Check shape preservation assert output.shape == (batch_size, seq_len, embed_dim) # Test with causal mask (for autoregressive generation) mask = Tensor(np.triu(np.ones((seq_len, seq_len)) * -np.inf, k=1)) masked_output = block.forward(x, mask) assert masked_output.shape == (batch_size, seq_len, embed_dim) # Test parameter counting params = block.parameters() expected_components = 4 # attention, ln1, ln2, mlp parameters assert len(params) > expected_components # Should have parameters from all components # Test different configurations large_block = TransformerBlock(embed_dim=128, num_heads=8, mlp_ratio=2) assert large_block.mlp.hidden_dim == 256 # 128 * 2 print("βœ… TransformerBlock works correctly!") # Run test immediately when developing this module if __name__ == "__main__": test_unit_transformer_block() # %% [markdown] """ ### Understanding the Complete GPT Architecture GPT (Generative Pre-trained Transformer) is the complete language model that combines all our components into a text generation system. It's designed for **autoregressive** generation - predicting the next token based on all previous tokens. #### GPT's Autoregressive Nature GPT generates text one token at a time, using all previously generated tokens as context: ``` Autoregressive Generation Process: Step 1: "The cat" β†’ model predicts β†’ "sat" Step 2: "The cat sat" β†’ model predicts β†’ "on" Step 3: "The cat sat on" β†’ model predicts β†’ "the" Step 4: "The cat sat on the" β†’ model predicts β†’ "mat" Result: "The cat sat on the mat" ``` #### Complete GPT Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ GPT ARCHITECTURE β”‚ β”‚ β”‚ β”‚ Input: Token IDs [15496, 1917, ...] β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ EMBEDDING LAYER β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚ β”‚ β”‚ β”‚ β”‚Token Embed β”‚+β”‚Position Embed β”‚β”‚ β”‚ β”‚ β”‚ β”‚vocabβ†’vector β”‚β”‚ β”‚sequenceβ†’vector β”‚β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ TRANSFORMER BLOCK 1 β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚LayerNormβ”‚β†’β”‚Attentionβ”‚β†’β”‚ +x β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚LayerNormβ”‚β†’β”‚ MLP β”‚β†’β”‚ +x β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ ... (more transformer blocks) ... β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ OUTPUT HEAD β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚LayerNormβ”‚β†’β”‚Linear(embedβ†’vocab) β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ Output: Vocabulary Logits [0.1, 0.05, 0.8, ...] β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### Causal Masking for Autoregressive Training During training, GPT sees the entire sequence but must not "cheat" by looking at future tokens: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ CAUSAL MASKING: Preventing Future Information Leakage β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ SEQUENCE: ["The", "cat", "sat", "on"] β”‚ β”‚ POSITIONS: 0 1 2 3 β”‚ β”‚ β”‚ β”‚ ATTENTION MATRIX (what each position can see): β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Pos: 0 1 2 3 β”‚ β”‚ β”‚ β”‚ Pos 0: [ βœ“ βœ— βœ— βœ— ] ← "The" only sees itself β”‚ β”‚ β”‚ β”‚ Pos 1: [ βœ“ βœ“ βœ— βœ— ] ← "cat" sees "The" + self β”‚ β”‚ β”‚ β”‚ Pos 2: [ βœ“ βœ“ βœ“ βœ— ] ← "sat" sees all previous β”‚ β”‚ β”‚ β”‚ Pos 3: [ βœ“ βœ“ βœ“ βœ“ ] ← "on" sees everything β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ IMPLEMENTATION: Upper triangular matrix with -∞ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ [[ 0, -∞, -∞, -∞], β”‚ β”‚ β”‚ β”‚ [ 0, 0, -∞, -∞], β”‚ β”‚ β”‚ β”‚ [ 0, 0, 0, -∞], β”‚ β”‚ β”‚ β”‚ [ 0, 0, 0, 0]] β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ After softmax: -∞ becomes 0 probability β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ WHY THIS WORKS: During training, model sees entire sequence β”‚ β”‚ but mask ensures position i only attends to positions ≀ i β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### Generation Temperature Control Temperature controls the randomness of generation: ``` Temperature Effects: Original logits: [1.0, 2.0, 3.0] Temperature = 0.1 (Conservative): Scaled: [10.0, 20.0, 30.0] β†’ Sharp distribution Probs: [0.00, 0.00, 1.00] β†’ Always picks highest Temperature = 1.0 (Balanced): Scaled: [1.0, 2.0, 3.0] β†’ Moderate distribution Probs: [0.09, 0.24, 0.67] β†’ Weighted sampling Temperature = 2.0 (Creative): Scaled: [0.5, 1.0, 1.5] β†’ Flatter distribution Probs: [0.18, 0.33, 0.49] β†’ More random ``` #### Model Scaling and Parameters ``` GPT Model Size Scaling: Tiny GPT (our implementation): - embed_dim: 64, layers: 2, heads: 4 - Parameters: ~50K - Use case: Learning and experimentation GPT-2 Small: - embed_dim: 768, layers: 12, heads: 12 - Parameters: 117M - Use case: Basic text generation GPT-3: - embed_dim: 12,288, layers: 96, heads: 96 - Parameters: 175B - Use case: Advanced language understanding GPT-4 (estimated): - embed_dim: ~16,384, layers: ~120, heads: ~128 - Parameters: ~1.7T - Use case: Reasoning and multimodal tasks ``` """ # %% nbgrader={"grade": false, "grade_id": "gpt", "solution": true} #| export class GPT: """ Complete GPT (Generative Pre-trained Transformer) model. This combines embeddings, positional encoding, multiple transformer blocks, and a language modeling head for text generation. """ def __init__(self, vocab_size, embed_dim, num_layers, num_heads, max_seq_len=1024): """ Initialize complete GPT model. TODO: Set up all components of the GPT architecture APPROACH: 1. Token embedding layer to convert tokens to vectors 2. Positional embedding to add position information 3. Stack of transformer blocks (the main computation) 4. Final layer norm and language modeling head GPT ARCHITECTURE: tokens β†’ embedding β†’ + pos_embedding β†’ transformer_blocks β†’ layer_norm β†’ lm_head β†’ logits EXAMPLE: >>> model = GPT(vocab_size=1000, embed_dim=256, num_layers=6, num_heads=8) >>> tokens = Tensor(np.random.randint(0, 1000, (2, 10))) # (batch, seq) >>> logits = model.forward(tokens) >>> assert logits.shape == (2, 10, 1000) # (batch, seq, vocab) HINTS: - Positional embeddings are learned, not fixed sinusoidal - Final layer norm stabilizes training - Language modeling head shares weights with token embedding (tie_weights) """ ### BEGIN SOLUTION self.vocab_size = vocab_size self.embed_dim = embed_dim self.num_layers = num_layers self.num_heads = num_heads self.max_seq_len = max_seq_len # Token and positional embeddings self.token_embedding = Embedding(vocab_size, embed_dim) self.position_embedding = Embedding(max_seq_len, embed_dim) # Stack of transformer blocks self.blocks = [] for _ in range(num_layers): block = TransformerBlock(embed_dim, num_heads) self.blocks.append(block) # Final layer normalization self.ln_f = LayerNorm(embed_dim) # Language modeling head (projects to vocabulary) self.lm_head = Linear(embed_dim, vocab_size, bias=False) ### END SOLUTION def forward(self, tokens): """ Forward pass through GPT model. TODO: Implement the complete GPT forward pass APPROACH: 1. Get token embeddings and positional embeddings 2. Add them together (broadcasting handles different shapes) 3. Pass through all transformer blocks sequentially 4. Apply final layer norm and language modeling head COMPUTATION FLOW: tokens β†’ embed + pos_embed β†’ blocks β†’ ln_f β†’ lm_head β†’ logits CAUSAL MASKING: For autoregressive generation, we need to prevent tokens from seeing future tokens. This is handled by the attention mask. HINT: Create position indices as range(seq_len) for positional embedding """ ### BEGIN SOLUTION batch_size, seq_len = tokens.shape # Token embeddings token_emb = self.token_embedding.forward(tokens) # Positional embeddings positions = Tensor(np.arange(seq_len).reshape(1, seq_len)) pos_emb = self.position_embedding.forward(positions) # Combine embeddings x = token_emb + pos_emb # Create causal mask for autoregressive generation mask = self._create_causal_mask(seq_len) # Pass through transformer blocks for block in self.blocks: x = block.forward(x, mask) # Final layer normalization x = self.ln_f.forward(x) # Language modeling head logits = self.lm_head.forward(x) return logits ### END SOLUTION def __call__(self, tokens): """Allows the GPT model to be called like a function.""" return self.forward(tokens) def _create_causal_mask(self, seq_len): """Create causal mask to prevent attending to future positions.""" ### BEGIN SOLUTION # Upper triangular matrix filled with -inf mask = np.triu(np.ones((seq_len, seq_len)) * -np.inf, k=1) return Tensor(mask) ### END SOLUTION def generate(self, prompt_tokens, max_new_tokens=50, temperature=1.0): """ Generate text autoregressively. TODO: Implement autoregressive text generation APPROACH: 1. Start with prompt tokens 2. For each new position: - Run forward pass to get logits - Sample next token from logits - Append to sequence 3. Return generated sequence AUTOREGRESSIVE GENERATION: At each step, the model predicts the next token based on all previous tokens. This is how GPT generates coherent text. EXAMPLE: >>> model = GPT(vocab_size=100, embed_dim=64, num_layers=2, num_heads=4) >>> prompt = Tensor([[1, 2, 3]]) # Some token sequence >>> generated = model.generate(prompt, max_new_tokens=5) >>> assert generated.shape[1] == 3 + 5 # original + new tokens HINT: Use np.random.choice with temperature for sampling """ ### BEGIN SOLUTION current_tokens = Tensor(prompt_tokens.data.copy()) for _ in range(max_new_tokens): # Get logits for current sequence logits = self.forward(current_tokens) # Get logits for last position (next token prediction) last_logits = logits.data[:, -1, :] # (batch_size, vocab_size) # Apply temperature scaling scaled_logits = last_logits / temperature # Convert to probabilities (softmax) exp_logits = np.exp(scaled_logits - np.max(scaled_logits, axis=-1, keepdims=True)) probs = exp_logits / np.sum(exp_logits, axis=-1, keepdims=True) # Sample next token next_token = np.array([[np.random.choice(self.vocab_size, p=probs[0])]]) # Append to sequence current_tokens = Tensor(np.concatenate([current_tokens.data, next_token], axis=1)) return current_tokens ### END SOLUTION def parameters(self): """Return all learnable parameters.""" params = [] params.extend(self.token_embedding.parameters()) params.extend(self.position_embedding.parameters()) for block in self.blocks: params.extend(block.parameters()) params.extend(self.ln_f.parameters()) params.extend(self.lm_head.parameters()) return params # %% [markdown] """ ### πŸ”¬ Unit Test: GPT Model This test validates our complete GPT implementation. **What we're testing**: Model forward pass, shape consistency, generation capability **Why it matters**: This is the complete language model that ties everything together **Expected**: Correct output shapes, generation works, parameter counting """ # %% nbgrader={"grade": true, "grade_id": "test-gpt", "locked": true, "points": 20} def test_unit_gpt(): """πŸ”¬ Test GPT model implementation.""" print("πŸ”¬ Unit Test: GPT Model...") # Test small GPT model vocab_size = 100 embed_dim = 64 num_layers = 2 num_heads = 4 model = GPT(vocab_size, embed_dim, num_layers, num_heads) # Test forward pass batch_size, seq_len = 2, 8 tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len))) logits = model.forward(tokens) # Check output shape expected_shape = (batch_size, seq_len, vocab_size) assert logits.shape == expected_shape # Test generation prompt = Tensor(np.random.randint(0, vocab_size, (1, 5))) generated = model.generate(prompt, max_new_tokens=3) # Check generation shape assert generated.shape == (1, 8) # 5 prompt + 3 new tokens # Test parameter counting params = model.parameters() assert len(params) > 10 # Should have many parameters from all components # Test different model sizes larger_model = GPT(vocab_size=200, embed_dim=128, num_layers=4, num_heads=8) test_tokens = Tensor(np.random.randint(0, 200, (1, 10))) larger_logits = larger_model.forward(test_tokens) assert larger_logits.shape == (1, 10, 200) print("βœ… GPT model works correctly!") # Run test immediately when developing this module if __name__ == "__main__": test_unit_gpt() # %% [markdown] """ ## 4. Integration: Complete Transformer Workflow Now that we've built all the components, let's see how they work together in a complete language modeling pipeline. This demonstrates the full power of the transformer architecture. ### The Language Modeling Pipeline ``` Complete Workflow Visualization: 1. Text Input: "hello world" β†’ Tokenization β†’ [15496, 1917] 2. Model Processing: [15496, 1917] ↓ Token Embedding [[0.1, 0.5, ...], [0.3, -0.2, ...]] # Vector representations ↓ + Position Embedding [[0.2, 0.7, ...], [0.1, -0.4, ...]] # With position info ↓ Transformer Block 1 [[0.3, 0.2, ...], [0.5, -0.1, ...]] # After attention + MLP ↓ Transformer Block 2 [[0.1, 0.9, ...], [0.7, 0.3, ...]] # Further processed ↓ Final LayerNorm + LM Head [[0.1, 0.05, 0.8, ...], [...]] # Probability over vocab 3. Generation: Model predicts next token: "!" (token 33) New sequence: "hello world!" ``` This integration demo will show: - **Character-level tokenization** for simplicity - **Forward pass** through all components - **Autoregressive generation** in action - **Temperature effects** on creativity """ # %% nbgrader={"grade": false, "grade_id": "integration-demo", "solution": true} def demonstrate_transformer_integration(): """ Demonstrate complete transformer pipeline. This simulates training a small language model on a simple vocabulary. """ print("πŸ”— Integration Demo: Complete Language Model Pipeline") print("Building a mini-GPT for character-level text generation") # Create a small vocabulary (character-level) vocab = list("abcdefghijklmnopqrstuvwxyz .") vocab_size = len(vocab) char_to_idx = {char: i for i, char in enumerate(vocab)} idx_to_char = {i: char for i, char in enumerate(vocab)} print(f"Vocabulary size: {vocab_size}") print(f"Characters: {''.join(vocab)}") # Create model model = GPT( vocab_size=vocab_size, embed_dim=64, num_layers=2, num_heads=4, max_seq_len=32 ) # Sample text encoding text = "hello world." tokens = [char_to_idx[char] for char in text] input_tokens = Tensor(np.array([tokens])) print(f"\nOriginal text: '{text}'") print(f"Tokenized: {tokens}") print(f"Input shape: {input_tokens.shape}") # Forward pass logits = model.forward(input_tokens) print(f"Output logits shape: {logits.shape}") print(f"Each position predicts next token from {vocab_size} possibilities") # Generation demo prompt_text = "hello" prompt_tokens = [char_to_idx[char] for char in prompt_text] prompt = Tensor(np.array([prompt_tokens])) print(f"\nGeneration demo:") print(f"Prompt: '{prompt_text}'") generated = model.generate(prompt, max_new_tokens=8, temperature=1.0) generated_text = ''.join([idx_to_char[idx] for idx in generated.data[0]]) print(f"Generated: '{generated_text}'") print("(Note: Untrained model produces random text)") return model # demonstrate_transformer_integration() # Moved to __main__ block below # %% [markdown] """ ## 5. Systems Analysis: Parameter Scaling and Memory Transformer models scale dramatically with size, leading to both opportunities and challenges. Let's analyze the computational and memory requirements to understand why training large language models requires massive infrastructure. ### The Scaling Laws Revolution One of the key discoveries in modern AI is that transformer performance follows predictable scaling laws: ``` Scaling Laws Pattern: Performance ∝ Parameters^Ξ± Γ— Data^Ξ² Γ— Compute^Ξ³ where Ξ± β‰ˆ 0.7, Ξ² β‰ˆ 0.8, Ξ³ β‰ˆ 0.5 This means: - 10Γ— more parameters β†’ ~5Γ— better performance - 10Γ— more data β†’ ~6Γ— better performance - 10Γ— more compute β†’ ~3Γ— better performance ``` ### Memory Scaling Analysis Memory requirements grow in different ways for different components: ``` Memory Scaling by Component: 1. Parameter Memory (Linear with model size): - Embeddings: vocab_size Γ— embed_dim - Transformer blocks: ~4 Γ— embed_dimΒ² - Total: O(embed_dimΒ²) 2. Attention Memory (Quadratic with sequence length): - Attention matrices: batch Γ— heads Γ— seq_lenΒ² - This is why long context is expensive! - Total: O(seq_lenΒ²) 3. Activation Memory (Linear with batch size): - Forward pass activations for backprop - Scales with: batch Γ— seq_len Γ— embed_dim - Total: O(batch_size) ``` ### The Attention Memory Wall ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ATTENTION MEMORY WALL: Why Long Context is Expensive β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ MEMORY USAGE BY SEQUENCE LENGTH (Quadratic Growth): β”‚ β”‚ β”‚ β”‚ 1K tokens: [β–“] 16 MB ← Manageable β”‚ β”‚ 2K tokens: [β–“β–“β–“β–“] 64 MB ← 4Γ— memory (quadratic!) β”‚ β”‚ 4K tokens: [β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“] 256 MB ← 16Γ— memory β”‚ β”‚ 8K tokens: [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 1 GB β”‚ β”‚ 16K tokens: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 4 GB β”‚ β”‚ 32K tokens: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 16 GB β”‚ β”‚ β”‚ β”‚ REAL-WORLD CONTEXT LIMITS: β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ GPT-3: 2K tokens (limited by memory) β”‚ β”‚ β”‚ β”‚ GPT-4: 8K tokens (32K with optimizations) β”‚ β”‚ β”‚ β”‚ Claude-3: 200K tokens (special techniques required!) β”‚ β”‚ β”‚ β”‚ GPT-4o: 128K tokens (efficient attention) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ MATHEMATICAL SCALING: β”‚ β”‚ Memory = batch_size Γ— num_heads Γ— seq_lenΒ² Γ— 4 bytes β”‚ β”‚ ↑ β”‚ β”‚ This is the killer! β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` """ # %% nbgrader={"grade": false, "grade_id": "analyze-scaling", "solution": true} def analyze_parameter_scaling(): """πŸ“Š Analyze how parameter count scales with model dimensions.""" print("πŸ“Š Analyzing Parameter Scaling in Transformers...") print("Understanding why model size affects performance and cost\n") # Test different model sizes configs = [ {"name": "Tiny", "embed_dim": 64, "num_layers": 2, "num_heads": 4}, {"name": "Small", "embed_dim": 128, "num_layers": 4, "num_heads": 8}, {"name": "Medium", "embed_dim": 256, "num_layers": 8, "num_heads": 16}, {"name": "Large", "embed_dim": 512, "num_layers": 12, "num_heads": 16}, ] vocab_size = 50000 # Typical vocabulary size for config in configs: model = GPT( vocab_size=vocab_size, embed_dim=config["embed_dim"], num_layers=config["num_layers"], num_heads=config["num_heads"] ) # Count parameters total_params = 0 for param in model.parameters(): total_params += param.size # Calculate memory requirements (4 bytes per float32 parameter) memory_mb = (total_params * BYTES_PER_FLOAT32) / MB_TO_BYTES print(f"{config['name']} Model:") print(f" Parameters: {total_params:,}") print(f" Memory: {memory_mb:.1f} MB") print(f" Embed dim: {config['embed_dim']}, Layers: {config['num_layers']}") print() print("πŸ’‘ Parameter scaling is roughly quadratic with embedding dimension") print("πŸš€ Real GPT-3 has 175B parameters, requiring ~350GB memory!") if __name__ == "__main__": analyze_parameter_scaling() # %% nbgrader={"grade": false, "grade_id": "analyze-attention-memory", "solution": true} def analyze_attention_memory(): """πŸ“Š Analyze attention memory complexity with sequence length.""" print("πŸ“Š Analyzing Attention Memory Complexity...") print("Why long context is expensive and how it scales\n") embed_dim = 512 num_heads = 8 batch_size = 4 # Test different sequence lengths sequence_lengths = [128, 256, 512, 1024, 2048] print("Attention Matrix Memory Usage:") print("Seq Len | Attention Matrix Size | Memory (MB)") print("-" * 45) for seq_len in sequence_lengths: # Attention matrix is (batch_size, num_heads, seq_len, seq_len) attention_elements = batch_size * num_heads * seq_len * seq_len # 4 bytes per float32 memory_bytes = attention_elements * BYTES_PER_FLOAT32 memory_mb = memory_bytes / MB_TO_BYTES print(f"{seq_len:6d} | {seq_len}Γ—{seq_len} Γ— {batch_size}Γ—{num_heads} | {memory_mb:8.1f}") print() print("πŸ’‘ Attention memory grows quadratically with sequence length") print("πŸš€ This is why techniques like FlashAttention are crucial for long sequences") if __name__ == "__main__": analyze_attention_memory() # %% [markdown] """ ## πŸ§ͺ Module Integration Test Final validation that everything works together correctly. """ # %% nbgrader={"grade": true, "grade_id": "test-module", "locked": true, "points": 25} def test_module(): """πŸ§ͺ Module Test: Complete Integration Comprehensive test of entire module functionality. This final test runs before module summary to ensure: - All unit tests pass - Functions work together correctly - Module is ready for integration with TinyTorch """ print("πŸ§ͺ RUNNING MODULE INTEGRATION TEST") print("=" * 50) # Run all unit tests print("Running unit tests...") test_unit_layer_norm() test_unit_mlp() test_unit_transformer_block() test_unit_gpt() print("\nRunning integration scenarios...") # Test complete transformer training scenario print("πŸ”¬ Integration Test: Full Training Pipeline...") # Create model and data vocab_size = 50 embed_dim = 64 num_layers = 2 num_heads = 4 model = GPT(vocab_size, embed_dim, num_layers, num_heads) # Test batch processing batch_size = 3 seq_len = 16 tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len))) # Forward pass logits = model.forward(tokens) assert logits.shape == (batch_size, seq_len, vocab_size) # Test generation with different temperatures prompt = Tensor(np.random.randint(0, vocab_size, (1, 8))) # Conservative generation conservative = model.generate(prompt, max_new_tokens=5, temperature=0.1) assert conservative.shape == (1, 13) # Creative generation creative = model.generate(prompt, max_new_tokens=5, temperature=2.0) assert creative.shape == (1, 13) # Test parameter counting consistency total_params = sum(param.size for param in model.parameters()) assert total_params > 1000 # Should have substantial parameters print("βœ… Full transformer pipeline works!") print("\n" + "=" * 50) print("πŸŽ‰ ALL TESTS PASSED! Module ready for export.") print("Run: tito module complete 13") # Call the comprehensive test # test_module() # Only run in __main__ block below # %% if __name__ == "__main__": print("πŸš€ Running Transformers module...") demonstrate_transformer_integration() test_module() print("βœ… Module validation complete!") # %% [markdown] """ ## πŸ€” ML Systems Thinking: Transformer Architecture Foundations ### Question 1: Attention Memory Complexity You implemented multi-head attention that computes attention matrices of size (batch, heads, seq_len, seq_len). For a model with seq_len=1024, batch_size=4, num_heads=8: - How many elements in the attention matrix? _____ - If each element is 4 bytes (float32), how much memory per layer? _____ MB - Why does doubling sequence length quadruple attention memory? _____ ### Question 2: Residual Connection Benefits Your TransformerBlock uses residual connections (x + attention_output, x + mlp_output). - What happens to gradients during backpropagation without residual connections? _____ - How do residual connections help train deeper networks? _____ - Why is pre-norm (LayerNorm before operations) preferred over post-norm? _____ ### Question 3: Parameter Scaling Analysis Your GPT model combines embeddings, transformer blocks, and output projection. For embed_dim=512, vocab_size=10000, num_layers=6: - Token embedding parameters: _____ (vocab_size Γ— embed_dim) - Approximate parameters per transformer block: _____ (hint: ~4 Γ— embed_dimΒ²) - Total model parameters: approximately _____ million ### Question 4: Autoregressive Generation Efficiency Your generate() method processes the full sequence for each new token. - Why is this inefficient for long sequences? _____ - What optimization caches key-value pairs to avoid recomputation? _____ - How would this change the computational complexity from O(nΒ²) to O(n)? _____ """ # %% [markdown] """ ## 🎯 MODULE SUMMARY: Transformers Congratulations! You've built the complete transformer architecture that powers modern language models like GPT, Claude, and ChatGPT! ### Key Accomplishments - Built LayerNorm for stable training across deep transformer networks - Implemented MLP (feed-forward) networks with GELU activation and 4x expansion - Created complete TransformerBlock with self-attention, residual connections, and pre-norm architecture - Built full GPT model with embeddings, positional encoding, and autoregressive generation - Discovered attention memory scaling and parameter distribution patterns - All tests pass βœ… (validated by `test_module()`) ### Ready for Next Steps Your transformer implementation is the capstone of the language modeling pipeline. Export with: `tito module complete 13` **Next**: Module 14 will add profiling and optimization techniques to make your transformers production-ready! """