mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-28 03:07:37 -05:00

Files

Vijay Janapa Reddi bfadc82ce6 Update generated notebooks and package exports

- Regenerate all .ipynb files from fixed .py modules
- Update tinytorch package exports with corrected implementations
- Sync package module index with current 16-module structure

These generated files reflect all the module fixes and ensure consistent
.py ↔ .ipynb conversion with the updated module implementations.

2025-09-18 16:42:57 -04:00

attention_dev.ipynb

Update generated notebooks and package exports

2025-09-18 16:42:57 -04:00

attention_dev.py

Fix attention module execution and function organization

2025-09-18 16:42:46 -04:00

module.yaml

refactor: Implement learner-focused module progression with better naming

2025-07-18 00:12:50 -04:00

README.md

refactor: Implement learner-focused module progression with better naming

2025-07-18 00:12:50 -04:00

README.md

🔥 Module: Attention

📊 Module Info

Difficulty: ⭐⭐⭐ Advanced
Time Estimate: 4-5 hours
Prerequisites: Tensor module
Next Steps: Training, Transformers modules

Build the core attention mechanism that powers modern AI! This module implements the fundamental scaled dot-product attention that's used in ChatGPT, BERT, GPT-4, and virtually all state-of-the-art AI systems.

🎯 Learning Objectives

By the end of this module, you will be able to:

Master the attention formula: Understand and implement Attention(Q,K,V) = softmax(QK^T/√d_k)V
Build self-attention: Create the core component that enables global context understanding
Control information flow: Implement masking for causal, padding, and bidirectional attention
Visualize attention patterns: See what the model "pays attention to"
Understand modern AI: Grasp the mechanism that revolutionized natural language processing

🧠 Build → Use → Understand

This module follows TinyTorch's Build → Use → Understand framework:

Build: Implement the core attention mechanism and masking utilities from mathematical foundations
Use: Apply attention to sequence tasks and visualize attention patterns
Understand: How attention enables dynamic, global context modeling that powers modern AI

📚 What You'll Build

Scaled Dot-Product Attention

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    The fundamental attention operation:
    Attention(Q,K,V) = softmax(QK^T/√d_k)V
    
    This exact function powers ChatGPT, BERT, and all transformers.
    """
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attention_weights = softmax(scores)
    return attention_weights @ V, attention_weights

Self-Attention Wrapper

class SelfAttention:
    """
    Convenient wrapper for self-attention where Q=K=V.
    The most common use case in transformer models.
    """
    def __init__(self, d_model):
        self.d_model = d_model
    
    def forward(self, x, mask=None):
        # Self-attention: Q = K = V = x
        return scaled_dot_product_attention(x, x, x, mask)

Attention Masking

# Causal masking (GPT-style: can't see future tokens)
causal_mask = create_causal_mask(seq_len)

# Padding masking (ignore padding tokens)
padding_mask = create_padding_mask(lengths, max_length)

# Bidirectional masking (BERT-style: can see all tokens)
bidirectional_mask = create_bidirectional_mask(seq_len)

🔬 Key Concepts

Why Attention Revolutionized AI

Global connectivity: Unlike CNNs, attention connects any two positions directly
Dynamic weights: Attention adapts to input content, not fixed like convolution kernels
Parallel processing: Unlike RNNs, all positions computed simultaneously
Interpretability: You can visualize what the model pays attention to

The Attention Formula Explained

Attention(Q,K,V) = softmax(QK^T/√d_k)V

Where:
- Q (Query): "What am I looking for?"
- K (Key): "What information is available?"  
- V (Value): "What is the actual content?"
- √d_k scaling: Prevents extreme softmax values

Attention vs Convolution

Aspect	Convolution	Attention
Receptive field	Local, grows with depth	Global from layer 1
Computation	O(n) with kernel size	O(n²) with sequence length
Weights	Fixed learned kernels	Dynamic input-dependent
Best for	Spatial data (images)	Sequential data (text)

Real-World Applications

Language Models: GPT, BERT, ChatGPT use self-attention to understand context
Machine Translation: Google Translate uses attention to align source and target words
Image Understanding: Vision Transformers apply attention to image patches
Multimodal AI: CLIP, DALL-E use attention to connect text and images

🚀 From Attention to Modern AI

This module teaches the core building block of modern AI:

What you're building: The fundamental attention mechanism
What it enables: Multi-head attention, positional encoding, transformer blocks
What it powers: ChatGPT, BERT, GPT-4, and contemporary AI systems

Understanding this module gives you the foundation to understand:

How ChatGPT generates coherent text
How BERT understands language bidirectionally
How Vision Transformers work without convolution
How modern AI achieves human-like language understanding

📈 Module Progression

Tensors → **ATTENTION** → Layers → Networks → CNNs → Training
  ↑              ↑
Foundation   Modern AI Core

After completing this module, you'll understand the mechanism that sparked the AI revolution, making you ready to work with state-of-the-art models and architectures.

🎯 Success Criteria

You'll know you've mastered this module when you can:

Implement scaled dot-product attention from scratch
Explain why the √d_k scaling prevents gradient problems
Create different types of attention masks for various use cases
Visualize and interpret attention weights
Understand why attention enabled the transformer revolution
Connect this foundation to modern AI systems like ChatGPT