Standardize Module 11 (Embeddings) to professional template

- Add complete YAML frontmatter with metadata
- Add INTELLIGENCE tier badge
- Standardize to exactly 5 learning objectives
- Implement Build → Use → Analyze pedagogical pattern
- Add Why This Matters with GPT-3/BERT production context
- Add historical evolution from Word2Vec to contextual embeddings
- Add comprehensive Implementation Guide with lookup tables and positional encodings
- Add Systems Thinking Questions on memory scaling and sparse gradients
- Add Real-World Connections to LLMs and recommendation systems
- Reduce emoji usage for professional tone
- Add clear What's Next navigation to Module 12
This commit is contained in:
Vijay Janapa Reddi
2025-11-07 17:19:45 -05:00
parent 0d1727a0c5
commit 0c52d61a5f

View File

@@ -1,85 +1,402 @@
# 12. Embeddings
---
title: "Embeddings - Token to Vector Representations"
description: "Build embedding layers that convert discrete tokens to dense vectors"
difficulty: 2
time_estimate: "4-5 hours"
prerequisites: ["Tensor", "Tokenization"]
next_steps: ["Attention"]
learning_objectives:
- "Implement embedding layers with efficient lookup table operations"
- "Design positional encodings to capture sequence order information"
- "Understand memory scaling with vocabulary size and embedding dimensions"
- "Optimize embedding lookups for cache efficiency and bandwidth"
- "Apply dimensionality principles to semantic vector representations"
---
```{admonition} Module Overview
:class: note
Converting tokens to dense vector representations that capture semantic meaning for language models.
```
# 11. Embeddings
## What You'll Build
**🧠 INTELLIGENCE TIER** | Difficulty: ⭐⭐ (2/4) | Time: 4-5 hours
In this module, you'll implement the systems that transform discrete tokens into rich vector representations:
## Overview
- **Embedding layers** with efficient lookup table operations
- **Positional encoding systems** that enable sequence understanding
- **Embedding optimization** for memory-efficient vocabulary management
- **Performance profiling** for embedding lookup patterns and cache efficiency
Build embedding systems that transform discrete token IDs into dense vector representations. This module implements lookup tables, positional encodings, and optimization techniques that power all modern language models.
## Learning Objectives
```{admonition} ML Systems Focus
:class: tip
This module emphasizes embedding table scaling, memory bandwidth optimization, and efficient vector representations.
```
By completing this module, you will be able to:
1. **Build embedding layers** with lookup tables that efficiently convert token indices to dense vectors
2. **Implement positional encoding** systems that capture sequence information for transformer models
3. **Understand embedding scaling** and how vocabulary size affects model memory and computational requirements
1. **Implement embedding layers** with efficient lookup table operations for token-to-vector conversion
2. **Design positional encodings** (learned and sinusoidal) to capture sequence order information
3. **Understand memory scaling** with vocabulary size and embedding dimensions in production models
4. **Optimize embedding lookups** for cache efficiency and memory bandwidth utilization
5. **Analyze embedding trade-offs** between dimension size, vocabulary size, and model capacity
5. **Apply dimensionality principles** to balance expressiveness and computational efficiency
## Systems Concepts
## Why This Matters
This module covers critical ML systems concepts:
### Production Context
- **Memory scaling** with vocabulary size and embedding dimensions
- **Cache-friendly lookup patterns** for high-throughput embedding access
- **Memory bandwidth bottlenecks** in embedding-heavy language models
- **Parameter sharing strategies** for efficient vocabulary management
- **Vector representation efficiency** and storage optimization
Embeddings are the foundation of all modern NLP:
## Prerequisites
- **GPT-3's embedding table**: 50K vocab × 12K dims = 600M parameters (20% of total model)
- **BERT's embeddings**: Token + position + segment embeddings enable bidirectional understanding
- **Word2Vec/GloVe**: Pioneered semantic embeddings; "king - man + woman ≈ queen"
- **Recommendation systems**: Embedding tables for billions of items (YouTube, Netflix, Spotify)
- **Module 02 (Tensor)**: Understanding of tensor operations and indexing
- **Module 11 (Tokenization)**: Token processing and vocabulary management
### Historical Context
## Time Estimate
Embeddings evolved from sparse to dense representations:
**4-5 hours** - Comprehensive implementation with scaling analysis and performance optimization
- **One-Hot Encoding (pre-2013)**: Vocabulary-sized vectors; no semantic similarity
- **Word2Vec (2013)**: Dense embeddings capture semantic relationships; revolutionized NLP
- **GloVe (2014)**: Global co-occurrence statistics improve quality
- **Contextual Embeddings (2018)**: BERT/GPT embeddings depend on context; same word, different vectors
- **Modern Scale (2020+)**: 100K+ vocabulary embeddings in production language models
## Getting Started
The embeddings you're building are the input layer of transformers and all modern NLP.
Open the embeddings module and begin implementing your vector representation systems:
## Pedagogical Pattern: Build → Use → Analyze
### 1. Build
Implement from first principles:
- Embedding layer with learnable lookup table
- Sinusoidal positional encoding (Transformer-style)
- Learned positional embeddings (GPT-style)
- Combined token + position embeddings
- Gradient flow through embedding lookups
### 2. Use
Apply to real problems:
- Convert token sequences to dense vectors
- Add positional information for sequence order
- Visualize embedding spaces with t-SNE
- Measure semantic similarity with cosine distance
- Integrate with attention mechanisms (Module 12)
### 3. Analyze
Deep-dive into design trade-offs:
- How does embedding dimension affect model capacity?
- Why do transformers need positional encodings?
- What's the memory cost of large vocabularies?
- How do embeddings capture semantic relationships?
- Why sinusoidal vs learned position encodings?
## Implementation Guide
### Core Components
**Embedding Layer - Token Lookup Table**
```python
# Navigate to the module
cd modules/12_embeddings
# Open the development notebook
tito module view 12_embeddings
# Complete the module
tito module complete 12_embeddings
class Embedding:
"""Learnable embedding layer for token-to-vector conversion.
Implements efficient lookup table that maps token IDs to dense vectors.
The core component of all language models.
Args:
vocab_size: Size of vocabulary (e.g., 50,000 for GPT-2)
embedding_dim: Dimension of dense vectors (e.g., 768 for BERT-base)
Memory: vocab_size × embedding_dim parameters
Example: 50K vocab × 768 dim = 38M parameters
"""
def __init__(self, vocab_size, embedding_dim):
self.vocab_size = vocab_size
self.embedding_dim = embedding_dim
# Initialize embedding table randomly
# Shape: (vocab_size, embedding_dim)
self.weight = Tensor.randn(vocab_size, embedding_dim) * 0.02
def forward(self, token_ids):
"""Look up embeddings for token IDs.
Args:
token_ids: (batch_size, seq_len) tensor of token IDs
Returns:
embeddings: (batch_size, seq_len, embedding_dim) dense vectors
"""
batch_size, seq_len = token_ids.shape
# Lookup operation: index into embedding table
embeddings = self.weight[token_ids] # Advanced indexing
return embeddings
def backward(self, grad_output):
"""Gradients accumulate in embedding table.
Only embeddings that were looked up receive gradients.
This is sparse gradient update - critical for efficiency.
"""
batch_size, seq_len, embed_dim = grad_output.shape
# Accumulate gradients for each unique token ID
grad_weight = Tensor.zeros_like(self.weight)
for b in range(batch_size):
for s in range(seq_len):
token_id = token_ids[b, s]
grad_weight[token_id] += grad_output[b, s]
return grad_weight
```
## Next Steps
After completing embeddings, you'll be ready for:
- **Module 13 (Attention)**: Multi-head attention mechanisms for sequence understanding
## Production Context
```{admonition} Scale Reality Check
:class: warning
GPT-3 has embedding tables with 600M+ parameters (50k vocabulary × 12k dimensions). Understanding embedding systems is crucial for building scalable language models.
**Positional Encoding - Sinusoidal (Transformer-Style)**
```python
class SinusoidalPositionalEncoding:
"""Fixed sinusoidal positional encoding.
Used in original Transformer (Vaswani et al., 2017).
Encodes absolute position using sine/cosine functions of different frequencies.
Advantages:
- No learned parameters
- Can generalize to longer sequences than training length
- Mathematically elegant relative position representation
"""
def __init__(self, max_seq_len, embedding_dim):
self.max_seq_len = max_seq_len
self.embedding_dim = embedding_dim
# Pre-compute positional encodings
self.encodings = self._compute_encodings()
def _compute_encodings(self):
"""Compute sinusoidal position encodings.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
position = np.arange(self.max_seq_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, self.embedding_dim, 2) *
-(np.log(10000.0) / self.embedding_dim))
encodings = np.zeros((self.max_seq_len, self.embedding_dim))
encodings[:, 0::2] = np.sin(position * div_term) # Even indices
encodings[:, 1::2] = np.cos(position * div_term) # Odd indices
return Tensor(encodings)
def forward(self, seq_len):
"""Return positional encodings for sequence length.
Args:
seq_len: Length of input sequence
Returns:
pos_encodings: (seq_len, embedding_dim) positional vectors
"""
return self.encodings[:seq_len]
```
Modern language models rely heavily on efficient embedding systems:
**Learned Positional Embeddings (GPT-Style)**
```python
class LearnedPositionalEmbedding:
"""Learned positional embeddings.
Used in GPT models. Learns absolute position representations during training.
Advantages:
- Can learn task-specific position patterns
- Often performs slightly better than sinusoidal
Disadvantages:
- Cannot generalize beyond max trained sequence length
- Requires additional parameters
"""
def __init__(self, max_seq_len, embedding_dim):
self.max_seq_len = max_seq_len
self.embedding_dim = embedding_dim
# Learnable position embedding table
self.weight = Tensor.randn(max_seq_len, embedding_dim) * 0.02
def forward(self, seq_len):
"""Look up learned position embeddings.
Args:
seq_len: Length of input sequence
Returns:
pos_embeddings: (seq_len, embedding_dim) learned vectors
"""
return self.weight[:seq_len]
```
- **Memory management**: Embedding tables often represent 20-40% of total model parameters
- **Bandwidth optimization**: Embedding lookups are memory-bandwidth bound operations
- **Distributed training**: Large embedding tables require sophisticated parameter sharding strategies
- **Inference efficiency**: Optimized embedding access patterns are critical for real-time language generation
**Combined Token + Position Embeddings**
```python
def get_combined_embeddings(token_ids, token_embeddings, pos_embeddings):
"""Combine token and position embeddings.
Used as input to transformer models.
Args:
token_ids: (batch_size, seq_len) token indices
token_embeddings: Embedding layer for tokens
pos_embeddings: Positional encoding layer
Returns:
combined: (batch_size, seq_len, embedding_dim) token + position
"""
batch_size, seq_len = token_ids.shape
# Get token embeddings
token_vecs = token_embeddings(token_ids) # (B, L, D)
# Get position embeddings
pos_vecs = pos_embeddings(seq_len) # (L, D)
# Add them together (broadcasting handles batch dimension)
combined = token_vecs + pos_vecs # (B, L, D)
return combined
```
Your embedding implementations provide the foundation for all transformer-based language models in TinyTorch.
### Step-by-Step Implementation
1. **Create Embedding Layer**
- Initialize weight matrix (vocab_size × embedding_dim)
- Implement forward pass with indexing
- Add backward pass with sparse gradient accumulation
- Test with small vocabulary
2. **Implement Sinusoidal Positions**
- Compute sine/cosine encodings
- Handle even/odd indices correctly
- Verify periodicity properties
- Test generalization to longer sequences
3. **Add Learned Positions**
- Create learnable position table
- Initialize with small random values
- Implement forward and backward passes
- Compare with sinusoidal encodings
4. **Combine Token + Position**
- Add token and position embeddings
- Handle batch broadcasting correctly
- Verify gradient flow through both
- Test with real tokenized sequences
5. **Analyze Embedding Spaces**
- Visualize embeddings with t-SNE or PCA
- Measure cosine similarity between tokens
- Verify semantic relationships emerge
- Profile memory and lookup efficiency
## Testing
### Inline Tests (During Development)
Run inline tests while building:
```bash
cd modules/source/11_embeddings
python embeddings_dev.py
```
Expected output:
```
Unit Test: Embedding layer...
✅ Lookup table created: 10K vocab × 256 dims = 2.5M parameters
✅ Forward pass shape correct: (32, 20, 256)
✅ Backward pass accumulates gradients correctly
Progress: Embedding Layer ✓
Unit Test: Sinusoidal positional encoding...
✅ Encodings computed for 512 positions
✅ Sine/cosine patterns verified
✅ Generalization to longer sequences works
Progress: Sinusoidal Positions ✓
Unit Test: Combined embeddings...
✅ Token + position addition works
✅ Gradient flows through both components
✅ Batch broadcasting handled correctly
Progress: Combined Embeddings ✓
```
### Export and Validate
After completing the module:
```bash
# Export to tinytorch package
tito export 11_embeddings
# Run integration tests
tito test 11_embeddings
```
## Where This Code Lives
```
tinytorch/
├── nn/
│ └── embeddings.py # Your implementation goes here
└── __init__.py # Exposes Embedding, PositionalEncoding, etc.
Usage in other modules:
>>> from tinytorch.nn import Embedding, SinusoidalPositionalEncoding
>>> token_emb = Embedding(vocab_size=50000, embedding_dim=768)
>>> pos_emb = SinusoidalPositionalEncoding(max_len=512, dim=768)
```
## Systems Thinking Questions
1. **Memory Scaling**: GPT-3 has 50K vocab × 12K dims = 600M embedding parameters. At FP32 (4 bytes), how much memory? At FP16? Why does this matter for training vs inference?
2. **Sparse Gradients**: During training, only ~1% of vocabulary appears in each batch. How does sparse gradient accumulation save computation compared to dense updates?
3. **Embedding Dimension Choice**: BERT-base uses 768 dims, BERT-large uses 1024. How does dimension affect: (a) model capacity, (b) computation, (c) memory bandwidth?
4. **Position Encoding Trade-offs**: Sinusoidal allows generalization to any length. Learned positions are limited to max training length. When would you choose each?
5. **Semantic Geometry**: Why do word embeddings exhibit linear relationships like "king - man + woman ≈ queen"? What property of the training objective causes this?
## Real-World Connections
### Industry Applications
**Large Language Models (OpenAI, Anthropic, Google)**
- GPT-4: 100K+ vocabulary embeddings
- Embedding tables often 20-40% of total model parameters
- Optimized embedding access critical for inference latency
- Mixed-precision (FP16) embeddings save memory
**Recommendation Systems (YouTube, Netflix, Spotify)**
- Billion-scale item embeddings for personalization
- Embedding retrieval systems for fast nearest-neighbor search
- Continuous embedding updates with online learning
- Embedding quantization for serving efficiency
**Multilingual Models (Google Translate, Facebook M2M)**
- Shared embedding spaces across 100+ languages
- Cross-lingual embeddings enable zero-shot transfer
- Vocabulary size optimization for multilingual coverage
- Embedding alignment techniques for language pairs
### Research Impact
This module implements patterns from:
- Word2Vec (2013): Pioneered dense semantic embeddings
- GloVe (2014): Global co-occurrence matrix factorization
- Transformer (2017): Sinusoidal positional encodings
- BERT (2018): Contextual embeddings revolutionized NLP
- GPT (2018): Learned positional embeddings for autoregressive models
## What's Next?
In **Module 12: Attention**, you'll use these embeddings as input to attention mechanisms:
- Query, Key, Value projections from embeddings
- Scaled dot-product attention over embedded sequences
- Multi-head attention for different representation subspaces
- Self-attention that relates all positions in a sequence
The embeddings you built are the foundation input to every transformer!
---
**Ready to build embedding systems from scratch?** Open `modules/source/11_embeddings/embeddings_dev.py` and start implementing.