mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-03 12:30:54 -05:00
Standardize Module 12 (Attention) to professional template
- Add complete YAML frontmatter with metadata - Add INTELLIGENCE tier badge - Standardize to exactly 5 learning objectives - Implement Build → Use → Analyze pedagogical pattern - Add Why This Matters with GPT-4/BERT/AlphaFold production context - Add historical context from RNNs to Transformers revolution - Add comprehensive Implementation Guide with scaled dot-product and multi-head attention code - Add Systems Thinking Questions on O(n²) complexity and multi-head benefits - Add Real-World Connections to LLMs, translation, and vision transformers - Reduce emoji usage for professional tone - Add clear What's Next navigation to Module 13
This commit is contained in:
@@ -1,196 +1,403 @@
|
||||
---
|
||||
title: "Attention"
|
||||
description: "Core attention mechanism and masking utilities"
|
||||
difficulty: "⭐⭐⭐"
|
||||
time_estimate: "4-5 hours"
|
||||
prerequisites: []
|
||||
next_steps: []
|
||||
learning_objectives: []
|
||||
title: "Attention - The Mechanism That Powers Modern AI"
|
||||
description: "Build scaled dot-product and multi-head attention from scratch"
|
||||
difficulty: 3
|
||||
time_estimate: "5-6 hours"
|
||||
prerequisites: ["Tensor", "Layers", "Embeddings"]
|
||||
next_steps: ["Transformers"]
|
||||
learning_objectives:
|
||||
- "Implement scaled dot-product attention with query, key, and value matrices"
|
||||
- "Design multi-head attention for parallel attention subspaces"
|
||||
- "Understand masking strategies for causal, padding, and bidirectional attention"
|
||||
- "Build self-attention mechanisms for sequence-to-sequence modeling"
|
||||
- "Apply attention patterns that power GPT, BERT, and modern transformers"
|
||||
---
|
||||
|
||||
# Module: Attention
|
||||
# 12. Attention
|
||||
|
||||
```{div} badges
|
||||
⭐⭐⭐ | ⏱️ 4-5 hours
|
||||
```
|
||||
**🧠 INTELLIGENCE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
|
||||
|
||||
## Overview
|
||||
|
||||
## 📊 Module Info
|
||||
- **Difficulty**: ⭐⭐⭐ Advanced
|
||||
- **Time Estimate**: 4-5 hours
|
||||
- **Prerequisites**: Tensor module
|
||||
- **Next Steps**: Training, Transformers modules
|
||||
Implement the attention mechanism that revolutionized AI. This module builds scaled dot-product attention and multi-head attention—the core components of GPT, BERT, and all modern transformer models.
|
||||
|
||||
Build the core attention mechanism that powers modern AI! This module implements the fundamental scaled dot-product attention that's used in ChatGPT, BERT, GPT-4, and virtually all state-of-the-art AI systems.
|
||||
## Learning Objectives
|
||||
|
||||
## 🎯 Learning Objectives
|
||||
By completing this module, you will be able to:
|
||||
|
||||
By the end of this module, you will be able to:
|
||||
1. **Implement scaled dot-product attention** with query, key, and value matrices following the Transformer paper formula
|
||||
2. **Design multi-head attention** for parallel attention in multiple representation subspaces
|
||||
3. **Understand masking strategies** for causal (GPT-style), padding, and bidirectional (BERT-style) attention
|
||||
4. **Build self-attention mechanisms** for sequence-to-sequence modeling with global context
|
||||
5. **Apply attention patterns** that power all modern transformers from GPT-4 to Claude to Gemini
|
||||
|
||||
- **Master the attention formula**: Understand and implement `Attention(Q,K,V) = softmax(QK^T/√d_k)V`
|
||||
- **Build self-attention**: Create the core component that enables global context understanding
|
||||
- **Control information flow**: Implement masking for causal, padding, and bidirectional attention
|
||||
- **Visualize attention patterns**: See what the model "pays attention to"
|
||||
- **Understand modern AI**: Grasp the mechanism that revolutionized natural language processing
|
||||
## Why This Matters
|
||||
|
||||
## 🧠 Build → Use → Understand
|
||||
### Production Context
|
||||
|
||||
This module follows TinyTorch's **Build → Use → Understand** framework:
|
||||
Attention is the core of modern AI:
|
||||
|
||||
1. **Build**: Implement the core attention mechanism and masking utilities from mathematical foundations
|
||||
2. **Use**: Apply attention to sequence tasks and visualize attention patterns
|
||||
3. **Understand**: How attention enables dynamic, global context modeling that powers modern AI
|
||||
- **GPT-4** uses 96 attention layers with 128 heads each; attention is 70% of compute
|
||||
- **BERT** pioneered bidirectional attention; powers Google Search ranking
|
||||
- **AlphaFold2** uses attention over protein sequences; solved 50-year protein folding problem
|
||||
- **Vision Transformers** replaced CNNs in production at Google, Meta, OpenAI
|
||||
|
||||
## 📚 What You'll Build
|
||||
### Historical Context
|
||||
|
||||
### Scaled Dot-Product Attention
|
||||
Attention revolutionized machine learning:
|
||||
|
||||
- **RNN Era (pre-2017)**: Sequential processing; no parallelism; gradient vanishing in long sequences
|
||||
- **Attention is All You Need (2017)**: Pure attention architecture; parallelizable; global context
|
||||
- **BERT/GPT (2018)**: Transformers dominate NLP; attention beats all previous approaches
|
||||
- **Beyond NLP (2020+)**: Attention powers vision (ViT), biology (AlphaFold), multimodal (CLIP)
|
||||
|
||||
The attention mechanism you're implementing sparked the current AI revolution.
|
||||
|
||||
## Pedagogical Pattern: Build → Use → Analyze
|
||||
|
||||
### 1. Build
|
||||
|
||||
Implement from first principles:
|
||||
- Scaled dot-product attention: `softmax(QK^T/√d_k)V`
|
||||
- Multi-head attention with parallel heads
|
||||
- Masking for causal and padding patterns
|
||||
- Self-attention wrapper (Q=K=V)
|
||||
- Attention visualization and interpretation
|
||||
|
||||
### 2. Use
|
||||
|
||||
Apply to real problems:
|
||||
- Build language model with causal attention
|
||||
- Implement BERT-style bidirectional attention
|
||||
- Visualize attention patterns on real text
|
||||
- Compare single-head vs multi-head performance
|
||||
- Measure O(n²) computational scaling
|
||||
|
||||
### 3. Analyze
|
||||
|
||||
Deep-dive into design choices:
|
||||
- Why does attention scale quadratically with sequence length?
|
||||
- How do multiple heads capture different linguistic patterns?
|
||||
- Why is the 1/√d_k scaling factor critical for training?
|
||||
- When would you use causal vs bidirectional attention?
|
||||
- What are the memory vs computation trade-offs?
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Components
|
||||
|
||||
**Scaled Dot-Product Attention - The Heart of Transformers**
|
||||
```python
|
||||
def scaled_dot_product_attention(Q, K, V, mask=None):
|
||||
"""
|
||||
The fundamental attention operation:
|
||||
Attention(Q,K,V) = softmax(QK^T/√d_k)V
|
||||
"""The fundamental attention operation from 'Attention is All You Need'.
|
||||
|
||||
This exact function powers ChatGPT, BERT, and all transformers.
|
||||
Attention(Q, K, V) = softmax(QK^T / √d_k) V
|
||||
|
||||
This exact formula powers GPT, BERT, and all transformers.
|
||||
|
||||
Args:
|
||||
Q: Query matrix (batch, heads, seq_len_q, d_k)
|
||||
K: Key matrix (batch, heads, seq_len_k, d_k)
|
||||
V: Value matrix (batch, heads, seq_len_v, d_v)
|
||||
mask: Optional mask (batch, 1, seq_len_q, seq_len_k)
|
||||
|
||||
Returns:
|
||||
output: Attended values (batch, heads, seq_len_q, d_v)
|
||||
attention_weights: Attention probabilities (batch, heads, seq_len_q, seq_len_k)
|
||||
|
||||
Intuition:
|
||||
Q = "What am I looking for?"
|
||||
K = "What information is available?"
|
||||
V = "What is the actual content?"
|
||||
|
||||
Attention computes: for each query, how much should I focus on each key?
|
||||
Then uses those weights to mix the values.
|
||||
"""
|
||||
# d_k = dimension of keys (and queries)
|
||||
d_k = Q.shape[-1]
|
||||
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
|
||||
if mask is not None:
|
||||
scores = scores.masked_fill(mask == 0, -1e9)
|
||||
attention_weights = softmax(scores)
|
||||
return attention_weights @ V, attention_weights
|
||||
```
|
||||
|
||||
### Self-Attention Wrapper
|
||||
```python
|
||||
class SelfAttention:
|
||||
"""
|
||||
Convenient wrapper for self-attention where Q=K=V.
|
||||
The most common use case in transformer models.
|
||||
"""
|
||||
def __init__(self, d_model):
|
||||
self.d_model = d_model
|
||||
|
||||
def forward(self, x, mask=None):
|
||||
# Self-attention: Q = K = V = x
|
||||
return scaled_dot_product_attention(x, x, x, mask)
|
||||
# Compute attention scores: QK^T
|
||||
# Shape: (batch, heads, seq_len_q, seq_len_k)
|
||||
scores = Q @ K.transpose(-2, -1)
|
||||
|
||||
# Scale by sqrt(d_k) to prevent extreme softmax saturation
|
||||
scores = scores / math.sqrt(d_k)
|
||||
|
||||
# Apply mask if provided (for causal or padding masking)
|
||||
if mask is not None:
|
||||
# Set masked positions to large negative value
|
||||
# After softmax, these become ~0
|
||||
scores = scores.masked_fill(mask == 0, -1e9)
|
||||
|
||||
# Softmax to get attention probabilities
|
||||
# Each row sums to 1: how much attention to pay to each position
|
||||
attention_weights = softmax(scores, dim=-1)
|
||||
|
||||
# Weighted sum of values based on attention
|
||||
output = attention_weights @ V
|
||||
|
||||
return output, attention_weights
|
||||
```
|
||||
|
||||
### Attention Masking
|
||||
**Multi-Head Attention - Parallel Attention Subspaces**
|
||||
```python
|
||||
# Causal masking (GPT-style: can't see future tokens)
|
||||
causal_mask = create_causal_mask(seq_len)
|
||||
|
||||
# Padding masking (ignore padding tokens)
|
||||
padding_mask = create_padding_mask(lengths, max_length)
|
||||
|
||||
# Bidirectional masking (BERT-style: can see all tokens)
|
||||
bidirectional_mask = create_bidirectional_mask(seq_len)
|
||||
class MultiHeadAttention:
|
||||
"""Multi-head attention from 'Attention is All You Need'.
|
||||
|
||||
Allows model to jointly attend to information from different
|
||||
representation subspaces at different positions.
|
||||
|
||||
Architecture:
|
||||
Input (batch, seq_len, d_model)
|
||||
→ Project to Q, K, V (each batch, seq_len, d_model)
|
||||
→ Split into H heads (batch, H, seq_len, d_model/H)
|
||||
→ Attention for each head in parallel
|
||||
→ Concatenate heads
|
||||
→ Final linear projection
|
||||
Output (batch, seq_len, d_model)
|
||||
|
||||
Example:
|
||||
d_model = 512, num_heads = 8
|
||||
Each head processes 512/8 = 64 dimensions
|
||||
8 heads learn different attention patterns in parallel
|
||||
"""
|
||||
def __init__(self, d_model, num_heads):
|
||||
assert d_model % num_heads == 0
|
||||
|
||||
self.d_model = d_model
|
||||
self.num_heads = num_heads
|
||||
self.d_k = d_model // num_heads # Dimension per head
|
||||
|
||||
# Linear projections for Q, K, V
|
||||
self.W_q = Linear(d_model, d_model)
|
||||
self.W_k = Linear(d_model, d_model)
|
||||
self.W_v = Linear(d_model, d_model)
|
||||
|
||||
# Output projection
|
||||
self.W_o = Linear(d_model, d_model)
|
||||
|
||||
def forward(self, query, key, value, mask=None):
|
||||
"""Multi-head attention forward pass.
|
||||
|
||||
Args:
|
||||
query: (batch, seq_len_q, d_model)
|
||||
key: (batch, seq_len_k, d_model)
|
||||
value: (batch, seq_len_v, d_model)
|
||||
mask: Optional mask
|
||||
|
||||
Returns:
|
||||
output: (batch, seq_len_q, d_model)
|
||||
attention_weights: (batch, num_heads, seq_len_q, seq_len_k)
|
||||
"""
|
||||
batch_size = query.shape[0]
|
||||
|
||||
# 1. Linear projections
|
||||
Q = self.W_q(query) # (batch, seq_len_q, d_model)
|
||||
K = self.W_k(key) # (batch, seq_len_k, d_model)
|
||||
V = self.W_v(value) # (batch, seq_len_v, d_model)
|
||||
|
||||
# 2. Split into multiple heads
|
||||
# Reshape: (batch, seq_len, d_model) → (batch, seq_len, num_heads, d_k)
|
||||
# Transpose: → (batch, num_heads, seq_len, d_k)
|
||||
Q = Q.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
K = K.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
V = V.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||||
|
||||
# 3. Apply attention for each head in parallel
|
||||
attended, attention_weights = scaled_dot_product_attention(Q, K, V, mask)
|
||||
# attended: (batch, num_heads, seq_len_q, d_k)
|
||||
|
||||
# 4. Concatenate heads
|
||||
# Transpose: (batch, num_heads, seq_len, d_k) → (batch, seq_len, num_heads, d_k)
|
||||
# Reshape: → (batch, seq_len, d_model)
|
||||
attended = attended.transpose(1, 2).reshape(batch_size, -1, self.d_model)
|
||||
|
||||
# 5. Final linear projection
|
||||
output = self.W_o(attended)
|
||||
|
||||
return output, attention_weights
|
||||
```
|
||||
|
||||
## 🔬 Key Concepts
|
||||
**Masking Utilities**
|
||||
```python
|
||||
def create_causal_mask(seq_len):
|
||||
"""Create causal mask for autoregressive (GPT-style) attention.
|
||||
|
||||
Prevents positions from attending to future positions.
|
||||
Position i can only attend to positions <= i.
|
||||
|
||||
Returns:
|
||||
mask: (seq_len, seq_len) lower triangular matrix
|
||||
|
||||
Example (seq_len=4):
|
||||
[[1, 0, 0, 0], # Position 0 sees only position 0
|
||||
[1, 1, 0, 0], # Position 1 sees 0,1
|
||||
[1, 1, 1, 0], # Position 2 sees 0,1,2
|
||||
[1, 1, 1, 1]] # Position 3 sees all
|
||||
"""
|
||||
mask = np.tril(np.ones((seq_len, seq_len)))
|
||||
return Tensor(mask)
|
||||
|
||||
### Why Attention Revolutionized AI
|
||||
- **Global connectivity**: Unlike CNNs, attention connects any two positions directly
|
||||
- **Dynamic weights**: Attention adapts to input content, not fixed like convolution kernels
|
||||
- **Parallel processing**: Unlike RNNs, all positions computed simultaneously
|
||||
- **Interpretability**: You can visualize what the model pays attention to
|
||||
|
||||
### The Attention Formula Explained
|
||||
```
|
||||
Attention(Q,K,V) = softmax(QK^T/√d_k)V
|
||||
|
||||
Where:
|
||||
- Q (Query): "What am I looking for?"
|
||||
- K (Key): "What information is available?"
|
||||
- V (Value): "What is the actual content?"
|
||||
- √d_k scaling: Prevents extreme softmax values
|
||||
def create_padding_mask(lengths, max_length):
|
||||
"""Create padding mask to ignore padding tokens.
|
||||
|
||||
Args:
|
||||
lengths: (batch_size,) actual sequence lengths
|
||||
max_length: maximum sequence length in batch
|
||||
|
||||
Returns:
|
||||
mask: (batch_size, 1, 1, max_length) where 1=real, 0=padding
|
||||
"""
|
||||
batch_size = lengths.shape[0]
|
||||
mask = np.zeros((batch_size, max_length))
|
||||
for i, length in enumerate(lengths):
|
||||
mask[i, :length] = 1
|
||||
return Tensor(mask).reshape(batch_size, 1, 1, max_length)
|
||||
```
|
||||
|
||||
### Attention vs Convolution
|
||||
| Aspect | Convolution | Attention |
|
||||
|--------|-------------|-----------|
|
||||
| **Receptive field** | Local, grows with depth | Global from layer 1 |
|
||||
| **Computation** | O(n) with kernel size | O(n²) with sequence length |
|
||||
| **Weights** | Fixed learned kernels | Dynamic input-dependent |
|
||||
| **Best for** | Spatial data (images) | Sequential data (text) |
|
||||
### Step-by-Step Implementation
|
||||
|
||||
### Real-World Applications
|
||||
- **Language Models**: GPT, BERT, ChatGPT use self-attention to understand context
|
||||
- **Machine Translation**: Google Translate uses attention to align source and target words
|
||||
- **Image Understanding**: Vision Transformers apply attention to image patches
|
||||
- **Multimodal AI**: CLIP, DALL-E use attention to connect text and images
|
||||
1. **Implement Scaled Dot-Product Attention**
|
||||
- Compute QK^T matmul
|
||||
- Apply 1/√d_k scaling
|
||||
- Add masking support
|
||||
- Apply softmax and value weighting
|
||||
- Verify attention weights sum to 1
|
||||
|
||||
## 🚀 From Attention to Modern AI
|
||||
2. **Build Multi-Head Attention**
|
||||
- Create Q, K, V projection layers
|
||||
- Split embeddings into multiple heads
|
||||
- Apply attention to each head in parallel
|
||||
- Concatenate head outputs
|
||||
- Add final projection layer
|
||||
|
||||
This module teaches the **core building block** of modern AI:
|
||||
3. **Add Masking Utilities**
|
||||
- Implement causal mask for GPT-style models
|
||||
- Create padding mask for variable-length sequences
|
||||
- Test mask shapes and broadcasting
|
||||
- Verify masking prevents information leak
|
||||
|
||||
**What you're building**: The fundamental attention mechanism
|
||||
**What it enables**: Multi-head attention, positional encoding, transformer blocks
|
||||
**What it powers**: ChatGPT, BERT, GPT-4, and contemporary AI systems
|
||||
4. **Create Self-Attention Wrapper**
|
||||
- Build convenience class where Q=K=V
|
||||
- Add optional masking parameter
|
||||
- Test with real embedded sequences
|
||||
- Profile computational cost
|
||||
|
||||
Understanding this module gives you the foundation to understand:
|
||||
- How ChatGPT generates coherent text
|
||||
- How BERT understands language bidirectionally
|
||||
- How Vision Transformers work without convolution
|
||||
- How modern AI achieves human-like language understanding
|
||||
5. **Visualize Attention Patterns**
|
||||
- Extract attention weights from forward pass
|
||||
- Plot heatmaps for different heads
|
||||
- Analyze what patterns each head learns
|
||||
- Interpret attention on real text examples
|
||||
|
||||
## 📈 Module Progression
|
||||
## Testing
|
||||
|
||||
```
|
||||
Tensors → **ATTENTION** → Layers → Networks → CNNs → Training
|
||||
↑ ↑
|
||||
Foundation Modern AI Core
|
||||
### Inline Tests (During Development)
|
||||
|
||||
Run inline tests while building:
|
||||
```bash
|
||||
cd modules/source/12_attention
|
||||
python attention_dev.py
|
||||
```
|
||||
|
||||
After completing this module, you'll understand the mechanism that sparked the AI revolution, making you ready to work with state-of-the-art models and architectures.
|
||||
Expected output:
|
||||
```
|
||||
Unit Test: Scaled dot-product attention...
|
||||
✅ Attention scores computed correctly
|
||||
✅ Softmax normalization verified (sums to 1)
|
||||
✅ Output shape matches expected dimensions
|
||||
Progress: Attention Mechanism ✓
|
||||
|
||||
## 🎯 Success Criteria
|
||||
Unit Test: Multi-head attention...
|
||||
✅ 8 heads process 512 dims in parallel
|
||||
✅ Head splitting and concatenation correct
|
||||
✅ Output projection applied properly
|
||||
Progress: Multi-Head Attention ✓
|
||||
|
||||
You'll know you've mastered this module when you can:
|
||||
- [ ] Implement scaled dot-product attention from scratch
|
||||
- [ ] Explain why the √d_k scaling prevents gradient problems
|
||||
- [ ] Create different types of attention masks for various use cases
|
||||
- [ ] Visualize and interpret attention weights
|
||||
- [ ] Understand why attention enabled the transformer revolution
|
||||
- [ ] Connect this foundation to modern AI systems like ChatGPT
|
||||
|
||||
|
||||
Choose your preferred way to engage with this module:
|
||||
|
||||
````{grid} 1 2 3 3
|
||||
|
||||
```{grid-item-card} 🚀 Launch Binder
|
||||
:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/source/07_attention/attention_dev.ipynb
|
||||
:class-header: bg-light
|
||||
|
||||
Run this module interactively in your browser. No installation required!
|
||||
Unit Test: Causal masking...
|
||||
✅ Future positions blocked correctly
|
||||
✅ Past positions accessible
|
||||
✅ Autoregressive property verified
|
||||
Progress: Masking ✓
|
||||
```
|
||||
|
||||
```{grid-item-card} ⚡ Open in Colab
|
||||
:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/source/07_attention/attention_dev.ipynb
|
||||
:class-header: bg-light
|
||||
### Export and Validate
|
||||
|
||||
Use Google Colab for GPU access and cloud compute power.
|
||||
After completing the module:
|
||||
```bash
|
||||
# Export to tinytorch package
|
||||
tito export 12_attention
|
||||
|
||||
# Run integration tests
|
||||
tito test 12_attention
|
||||
```
|
||||
|
||||
```{grid-item-card} 📖 View Source
|
||||
:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/source/07_attention/attention_dev.py
|
||||
:class-header: bg-light
|
||||
|
||||
Browse the Python source code and understand the implementation.
|
||||
```
|
||||
|
||||
````
|
||||
|
||||
```{admonition} 💾 Save Your Progress
|
||||
:class: tip
|
||||
**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work.
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── nn/
|
||||
│ └── attention.py # Your implementation goes here
|
||||
└── __init__.py # Exposes MultiHeadAttention, etc.
|
||||
|
||||
Usage in other modules:
|
||||
>>> from tinytorch.nn import MultiHeadAttention
|
||||
>>> attn = MultiHeadAttention(d_model=512, num_heads=8)
|
||||
>>> output, weights = attn(query, key, value, mask=causal_mask)
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Quadratic Complexity**: Attention is O(n²) in sequence length. For n=1024, we compute ~1M attention scores. For n=4096 (GPT-3 context), how many? Why is this a problem for long documents?
|
||||
|
||||
2. **Multi-Head Benefits**: Why 8 heads of 64 dims each instead of 1 head of 512 dims? What different patterns might different heads learn (syntax vs semantics vs coreference)?
|
||||
|
||||
3. **Scaling Factor Impact**: Without 1/√d_k scaling, softmax gets extreme values (nearly one-hot). Why? How does this hurt gradient flow? (Hint: softmax derivative)
|
||||
|
||||
4. **Memory vs Compute**: Attention weights matrix is (batch × heads × seq × seq). For batch=32, heads=8, seq=1024, this is 256M values. At FP32, how much memory? Why is this a bottleneck?
|
||||
|
||||
5. **Causal vs Bidirectional**: GPT uses causal masking (can't see future). BERT uses bidirectional (can see all positions). Why does this architectural choice define fundamentally different models?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
### Industry Applications
|
||||
|
||||
**Large Language Models (OpenAI, Anthropic, Google)**
|
||||
- GPT-4: 96 layers × 128 heads = 12,288 attention computations
|
||||
- Attention optimizations (FlashAttention) critical for training at scale
|
||||
- Multi-query attention reduces inference cost in production
|
||||
- Attention is the primary computational bottleneck
|
||||
|
||||
**Machine Translation (Google Translate, DeepL)**
|
||||
- Cross-attention aligns source and target languages
|
||||
- Attention weights show word alignment (interpretability)
|
||||
- Multi-head attention captures different translation patterns
|
||||
- Real-time translation requires optimized attention kernels
|
||||
|
||||
**Vision Models (Google ViT, Meta DINOv2)**
|
||||
- Self-attention over image patches replaces convolution
|
||||
- Global receptive field from layer 1 (vs deep CNN stacks)
|
||||
- Attention scales better to high-resolution images
|
||||
- Now dominant architecture for vision tasks
|
||||
|
||||
### Research Impact
|
||||
|
||||
This module implements patterns from:
|
||||
- Attention is All You Need (Vaswani et al., 2017): The transformer paper
|
||||
- BERT (Devlin et al., 2018): Bidirectional attention for NLP
|
||||
- GPT-2/3 (Radford et al., 2019): Causal attention for generation
|
||||
- ViT (Dosovitskiy et al., 2020): Attention for computer vision
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 13: Transformers**, you'll compose attention into complete transformer blocks:
|
||||
|
||||
- Stack multi-head attention with feedforward networks
|
||||
- Add layer normalization and residual connections
|
||||
- Build encoder (BERT-style) and decoder (GPT-style) architectures
|
||||
- Train full transformer on text generation tasks
|
||||
|
||||
The attention mechanism you built is the core component of every transformer!
|
||||
|
||||
---
|
||||
|
||||
<div class="prev-next-area">
|
||||
<a class="left-prev" href="../chapters/06_spatial.html" title="previous page">← Previous Module</a>
|
||||
<a class="right-next" href="../chapters/08_dataloader.html" title="next page">Next Module →</a>
|
||||
</div>
|
||||
**Ready to build the AI revolution from scratch?** Open `modules/source/12_attention/attention_dev.py` and start implementing.
|
||||
|
||||
Reference in New Issue
Block a user