mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-03 00:15:32 -05:00
Standardize Module 11 (Embeddings) to professional template
- Add complete YAML frontmatter with metadata - Add INTELLIGENCE tier badge - Standardize to exactly 5 learning objectives - Implement Build → Use → Analyze pedagogical pattern - Add Why This Matters with GPT-3/BERT production context - Add historical evolution from Word2Vec to contextual embeddings - Add comprehensive Implementation Guide with lookup tables and positional encodings - Add Systems Thinking Questions on memory scaling and sparse gradients - Add Real-World Connections to LLMs and recommendation systems - Reduce emoji usage for professional tone - Add clear What's Next navigation to Module 12
This commit is contained in:
@@ -1,85 +1,402 @@
|
||||
# 12. Embeddings
|
||||
---
|
||||
title: "Embeddings - Token to Vector Representations"
|
||||
description: "Build embedding layers that convert discrete tokens to dense vectors"
|
||||
difficulty: 2
|
||||
time_estimate: "4-5 hours"
|
||||
prerequisites: ["Tensor", "Tokenization"]
|
||||
next_steps: ["Attention"]
|
||||
learning_objectives:
|
||||
- "Implement embedding layers with efficient lookup table operations"
|
||||
- "Design positional encodings to capture sequence order information"
|
||||
- "Understand memory scaling with vocabulary size and embedding dimensions"
|
||||
- "Optimize embedding lookups for cache efficiency and bandwidth"
|
||||
- "Apply dimensionality principles to semantic vector representations"
|
||||
---
|
||||
|
||||
```{admonition} Module Overview
|
||||
:class: note
|
||||
Converting tokens to dense vector representations that capture semantic meaning for language models.
|
||||
```
|
||||
# 11. Embeddings
|
||||
|
||||
## What You'll Build
|
||||
**🧠 INTELLIGENCE TIER** | Difficulty: ⭐⭐ (2/4) | Time: 4-5 hours
|
||||
|
||||
In this module, you'll implement the systems that transform discrete tokens into rich vector representations:
|
||||
## Overview
|
||||
|
||||
- **Embedding layers** with efficient lookup table operations
|
||||
- **Positional encoding systems** that enable sequence understanding
|
||||
- **Embedding optimization** for memory-efficient vocabulary management
|
||||
- **Performance profiling** for embedding lookup patterns and cache efficiency
|
||||
Build embedding systems that transform discrete token IDs into dense vector representations. This module implements lookup tables, positional encodings, and optimization techniques that power all modern language models.
|
||||
|
||||
## Learning Objectives
|
||||
|
||||
```{admonition} ML Systems Focus
|
||||
:class: tip
|
||||
This module emphasizes embedding table scaling, memory bandwidth optimization, and efficient vector representations.
|
||||
```
|
||||
|
||||
By completing this module, you will be able to:
|
||||
|
||||
1. **Build embedding layers** with lookup tables that efficiently convert token indices to dense vectors
|
||||
2. **Implement positional encoding** systems that capture sequence information for transformer models
|
||||
3. **Understand embedding scaling** and how vocabulary size affects model memory and computational requirements
|
||||
1. **Implement embedding layers** with efficient lookup table operations for token-to-vector conversion
|
||||
2. **Design positional encodings** (learned and sinusoidal) to capture sequence order information
|
||||
3. **Understand memory scaling** with vocabulary size and embedding dimensions in production models
|
||||
4. **Optimize embedding lookups** for cache efficiency and memory bandwidth utilization
|
||||
5. **Analyze embedding trade-offs** between dimension size, vocabulary size, and model capacity
|
||||
5. **Apply dimensionality principles** to balance expressiveness and computational efficiency
|
||||
|
||||
## Systems Concepts
|
||||
## Why This Matters
|
||||
|
||||
This module covers critical ML systems concepts:
|
||||
### Production Context
|
||||
|
||||
- **Memory scaling** with vocabulary size and embedding dimensions
|
||||
- **Cache-friendly lookup patterns** for high-throughput embedding access
|
||||
- **Memory bandwidth bottlenecks** in embedding-heavy language models
|
||||
- **Parameter sharing strategies** for efficient vocabulary management
|
||||
- **Vector representation efficiency** and storage optimization
|
||||
Embeddings are the foundation of all modern NLP:
|
||||
|
||||
## Prerequisites
|
||||
- **GPT-3's embedding table**: 50K vocab × 12K dims = 600M parameters (20% of total model)
|
||||
- **BERT's embeddings**: Token + position + segment embeddings enable bidirectional understanding
|
||||
- **Word2Vec/GloVe**: Pioneered semantic embeddings; "king - man + woman ≈ queen"
|
||||
- **Recommendation systems**: Embedding tables for billions of items (YouTube, Netflix, Spotify)
|
||||
|
||||
- **Module 02 (Tensor)**: Understanding of tensor operations and indexing
|
||||
- **Module 11 (Tokenization)**: Token processing and vocabulary management
|
||||
### Historical Context
|
||||
|
||||
## Time Estimate
|
||||
Embeddings evolved from sparse to dense representations:
|
||||
|
||||
**4-5 hours** - Comprehensive implementation with scaling analysis and performance optimization
|
||||
- **One-Hot Encoding (pre-2013)**: Vocabulary-sized vectors; no semantic similarity
|
||||
- **Word2Vec (2013)**: Dense embeddings capture semantic relationships; revolutionized NLP
|
||||
- **GloVe (2014)**: Global co-occurrence statistics improve quality
|
||||
- **Contextual Embeddings (2018)**: BERT/GPT embeddings depend on context; same word, different vectors
|
||||
- **Modern Scale (2020+)**: 100K+ vocabulary embeddings in production language models
|
||||
|
||||
## Getting Started
|
||||
The embeddings you're building are the input layer of transformers and all modern NLP.
|
||||
|
||||
Open the embeddings module and begin implementing your vector representation systems:
|
||||
## Pedagogical Pattern: Build → Use → Analyze
|
||||
|
||||
### 1. Build
|
||||
|
||||
Implement from first principles:
|
||||
- Embedding layer with learnable lookup table
|
||||
- Sinusoidal positional encoding (Transformer-style)
|
||||
- Learned positional embeddings (GPT-style)
|
||||
- Combined token + position embeddings
|
||||
- Gradient flow through embedding lookups
|
||||
|
||||
### 2. Use
|
||||
|
||||
Apply to real problems:
|
||||
- Convert token sequences to dense vectors
|
||||
- Add positional information for sequence order
|
||||
- Visualize embedding spaces with t-SNE
|
||||
- Measure semantic similarity with cosine distance
|
||||
- Integrate with attention mechanisms (Module 12)
|
||||
|
||||
### 3. Analyze
|
||||
|
||||
Deep-dive into design trade-offs:
|
||||
- How does embedding dimension affect model capacity?
|
||||
- Why do transformers need positional encodings?
|
||||
- What's the memory cost of large vocabularies?
|
||||
- How do embeddings capture semantic relationships?
|
||||
- Why sinusoidal vs learned position encodings?
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Components
|
||||
|
||||
**Embedding Layer - Token Lookup Table**
|
||||
```python
|
||||
# Navigate to the module
|
||||
cd modules/12_embeddings
|
||||
|
||||
# Open the development notebook
|
||||
tito module view 12_embeddings
|
||||
|
||||
# Complete the module
|
||||
tito module complete 12_embeddings
|
||||
class Embedding:
|
||||
"""Learnable embedding layer for token-to-vector conversion.
|
||||
|
||||
Implements efficient lookup table that maps token IDs to dense vectors.
|
||||
The core component of all language models.
|
||||
|
||||
Args:
|
||||
vocab_size: Size of vocabulary (e.g., 50,000 for GPT-2)
|
||||
embedding_dim: Dimension of dense vectors (e.g., 768 for BERT-base)
|
||||
|
||||
Memory: vocab_size × embedding_dim parameters
|
||||
Example: 50K vocab × 768 dim = 38M parameters
|
||||
"""
|
||||
def __init__(self, vocab_size, embedding_dim):
|
||||
self.vocab_size = vocab_size
|
||||
self.embedding_dim = embedding_dim
|
||||
|
||||
# Initialize embedding table randomly
|
||||
# Shape: (vocab_size, embedding_dim)
|
||||
self.weight = Tensor.randn(vocab_size, embedding_dim) * 0.02
|
||||
|
||||
def forward(self, token_ids):
|
||||
"""Look up embeddings for token IDs.
|
||||
|
||||
Args:
|
||||
token_ids: (batch_size, seq_len) tensor of token IDs
|
||||
|
||||
Returns:
|
||||
embeddings: (batch_size, seq_len, embedding_dim) dense vectors
|
||||
"""
|
||||
batch_size, seq_len = token_ids.shape
|
||||
|
||||
# Lookup operation: index into embedding table
|
||||
embeddings = self.weight[token_ids] # Advanced indexing
|
||||
|
||||
return embeddings
|
||||
|
||||
def backward(self, grad_output):
|
||||
"""Gradients accumulate in embedding table.
|
||||
|
||||
Only embeddings that were looked up receive gradients.
|
||||
This is sparse gradient update - critical for efficiency.
|
||||
"""
|
||||
batch_size, seq_len, embed_dim = grad_output.shape
|
||||
|
||||
# Accumulate gradients for each unique token ID
|
||||
grad_weight = Tensor.zeros_like(self.weight)
|
||||
for b in range(batch_size):
|
||||
for s in range(seq_len):
|
||||
token_id = token_ids[b, s]
|
||||
grad_weight[token_id] += grad_output[b, s]
|
||||
|
||||
return grad_weight
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
After completing embeddings, you'll be ready for:
|
||||
- **Module 13 (Attention)**: Multi-head attention mechanisms for sequence understanding
|
||||
|
||||
## Production Context
|
||||
|
||||
```{admonition} Scale Reality Check
|
||||
:class: warning
|
||||
GPT-3 has embedding tables with 600M+ parameters (50k vocabulary × 12k dimensions). Understanding embedding systems is crucial for building scalable language models.
|
||||
**Positional Encoding - Sinusoidal (Transformer-Style)**
|
||||
```python
|
||||
class SinusoidalPositionalEncoding:
|
||||
"""Fixed sinusoidal positional encoding.
|
||||
|
||||
Used in original Transformer (Vaswani et al., 2017).
|
||||
Encodes absolute position using sine/cosine functions of different frequencies.
|
||||
|
||||
Advantages:
|
||||
- No learned parameters
|
||||
- Can generalize to longer sequences than training length
|
||||
- Mathematically elegant relative position representation
|
||||
"""
|
||||
def __init__(self, max_seq_len, embedding_dim):
|
||||
self.max_seq_len = max_seq_len
|
||||
self.embedding_dim = embedding_dim
|
||||
|
||||
# Pre-compute positional encodings
|
||||
self.encodings = self._compute_encodings()
|
||||
|
||||
def _compute_encodings(self):
|
||||
"""Compute sinusoidal position encodings.
|
||||
|
||||
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
|
||||
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
|
||||
"""
|
||||
position = np.arange(self.max_seq_len)[:, np.newaxis]
|
||||
div_term = np.exp(np.arange(0, self.embedding_dim, 2) *
|
||||
-(np.log(10000.0) / self.embedding_dim))
|
||||
|
||||
encodings = np.zeros((self.max_seq_len, self.embedding_dim))
|
||||
encodings[:, 0::2] = np.sin(position * div_term) # Even indices
|
||||
encodings[:, 1::2] = np.cos(position * div_term) # Odd indices
|
||||
|
||||
return Tensor(encodings)
|
||||
|
||||
def forward(self, seq_len):
|
||||
"""Return positional encodings for sequence length.
|
||||
|
||||
Args:
|
||||
seq_len: Length of input sequence
|
||||
|
||||
Returns:
|
||||
pos_encodings: (seq_len, embedding_dim) positional vectors
|
||||
"""
|
||||
return self.encodings[:seq_len]
|
||||
```
|
||||
|
||||
Modern language models rely heavily on efficient embedding systems:
|
||||
**Learned Positional Embeddings (GPT-Style)**
|
||||
```python
|
||||
class LearnedPositionalEmbedding:
|
||||
"""Learned positional embeddings.
|
||||
|
||||
Used in GPT models. Learns absolute position representations during training.
|
||||
|
||||
Advantages:
|
||||
- Can learn task-specific position patterns
|
||||
- Often performs slightly better than sinusoidal
|
||||
|
||||
Disadvantages:
|
||||
- Cannot generalize beyond max trained sequence length
|
||||
- Requires additional parameters
|
||||
"""
|
||||
def __init__(self, max_seq_len, embedding_dim):
|
||||
self.max_seq_len = max_seq_len
|
||||
self.embedding_dim = embedding_dim
|
||||
|
||||
# Learnable position embedding table
|
||||
self.weight = Tensor.randn(max_seq_len, embedding_dim) * 0.02
|
||||
|
||||
def forward(self, seq_len):
|
||||
"""Look up learned position embeddings.
|
||||
|
||||
Args:
|
||||
seq_len: Length of input sequence
|
||||
|
||||
Returns:
|
||||
pos_embeddings: (seq_len, embedding_dim) learned vectors
|
||||
"""
|
||||
return self.weight[:seq_len]
|
||||
```
|
||||
|
||||
- **Memory management**: Embedding tables often represent 20-40% of total model parameters
|
||||
- **Bandwidth optimization**: Embedding lookups are memory-bandwidth bound operations
|
||||
- **Distributed training**: Large embedding tables require sophisticated parameter sharding strategies
|
||||
- **Inference efficiency**: Optimized embedding access patterns are critical for real-time language generation
|
||||
**Combined Token + Position Embeddings**
|
||||
```python
|
||||
def get_combined_embeddings(token_ids, token_embeddings, pos_embeddings):
|
||||
"""Combine token and position embeddings.
|
||||
|
||||
Used as input to transformer models.
|
||||
|
||||
Args:
|
||||
token_ids: (batch_size, seq_len) token indices
|
||||
token_embeddings: Embedding layer for tokens
|
||||
pos_embeddings: Positional encoding layer
|
||||
|
||||
Returns:
|
||||
combined: (batch_size, seq_len, embedding_dim) token + position
|
||||
"""
|
||||
batch_size, seq_len = token_ids.shape
|
||||
|
||||
# Get token embeddings
|
||||
token_vecs = token_embeddings(token_ids) # (B, L, D)
|
||||
|
||||
# Get position embeddings
|
||||
pos_vecs = pos_embeddings(seq_len) # (L, D)
|
||||
|
||||
# Add them together (broadcasting handles batch dimension)
|
||||
combined = token_vecs + pos_vecs # (B, L, D)
|
||||
|
||||
return combined
|
||||
```
|
||||
|
||||
Your embedding implementations provide the foundation for all transformer-based language models in TinyTorch.
|
||||
### Step-by-Step Implementation
|
||||
|
||||
1. **Create Embedding Layer**
|
||||
- Initialize weight matrix (vocab_size × embedding_dim)
|
||||
- Implement forward pass with indexing
|
||||
- Add backward pass with sparse gradient accumulation
|
||||
- Test with small vocabulary
|
||||
|
||||
2. **Implement Sinusoidal Positions**
|
||||
- Compute sine/cosine encodings
|
||||
- Handle even/odd indices correctly
|
||||
- Verify periodicity properties
|
||||
- Test generalization to longer sequences
|
||||
|
||||
3. **Add Learned Positions**
|
||||
- Create learnable position table
|
||||
- Initialize with small random values
|
||||
- Implement forward and backward passes
|
||||
- Compare with sinusoidal encodings
|
||||
|
||||
4. **Combine Token + Position**
|
||||
- Add token and position embeddings
|
||||
- Handle batch broadcasting correctly
|
||||
- Verify gradient flow through both
|
||||
- Test with real tokenized sequences
|
||||
|
||||
5. **Analyze Embedding Spaces**
|
||||
- Visualize embeddings with t-SNE or PCA
|
||||
- Measure cosine similarity between tokens
|
||||
- Verify semantic relationships emerge
|
||||
- Profile memory and lookup efficiency
|
||||
|
||||
## Testing
|
||||
|
||||
### Inline Tests (During Development)
|
||||
|
||||
Run inline tests while building:
|
||||
```bash
|
||||
cd modules/source/11_embeddings
|
||||
python embeddings_dev.py
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Unit Test: Embedding layer...
|
||||
✅ Lookup table created: 10K vocab × 256 dims = 2.5M parameters
|
||||
✅ Forward pass shape correct: (32, 20, 256)
|
||||
✅ Backward pass accumulates gradients correctly
|
||||
Progress: Embedding Layer ✓
|
||||
|
||||
Unit Test: Sinusoidal positional encoding...
|
||||
✅ Encodings computed for 512 positions
|
||||
✅ Sine/cosine patterns verified
|
||||
✅ Generalization to longer sequences works
|
||||
Progress: Sinusoidal Positions ✓
|
||||
|
||||
Unit Test: Combined embeddings...
|
||||
✅ Token + position addition works
|
||||
✅ Gradient flows through both components
|
||||
✅ Batch broadcasting handled correctly
|
||||
Progress: Combined Embeddings ✓
|
||||
```
|
||||
|
||||
### Export and Validate
|
||||
|
||||
After completing the module:
|
||||
```bash
|
||||
# Export to tinytorch package
|
||||
tito export 11_embeddings
|
||||
|
||||
# Run integration tests
|
||||
tito test 11_embeddings
|
||||
```
|
||||
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── nn/
|
||||
│ └── embeddings.py # Your implementation goes here
|
||||
└── __init__.py # Exposes Embedding, PositionalEncoding, etc.
|
||||
|
||||
Usage in other modules:
|
||||
>>> from tinytorch.nn import Embedding, SinusoidalPositionalEncoding
|
||||
>>> token_emb = Embedding(vocab_size=50000, embedding_dim=768)
|
||||
>>> pos_emb = SinusoidalPositionalEncoding(max_len=512, dim=768)
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Memory Scaling**: GPT-3 has 50K vocab × 12K dims = 600M embedding parameters. At FP32 (4 bytes), how much memory? At FP16? Why does this matter for training vs inference?
|
||||
|
||||
2. **Sparse Gradients**: During training, only ~1% of vocabulary appears in each batch. How does sparse gradient accumulation save computation compared to dense updates?
|
||||
|
||||
3. **Embedding Dimension Choice**: BERT-base uses 768 dims, BERT-large uses 1024. How does dimension affect: (a) model capacity, (b) computation, (c) memory bandwidth?
|
||||
|
||||
4. **Position Encoding Trade-offs**: Sinusoidal allows generalization to any length. Learned positions are limited to max training length. When would you choose each?
|
||||
|
||||
5. **Semantic Geometry**: Why do word embeddings exhibit linear relationships like "king - man + woman ≈ queen"? What property of the training objective causes this?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
### Industry Applications
|
||||
|
||||
**Large Language Models (OpenAI, Anthropic, Google)**
|
||||
- GPT-4: 100K+ vocabulary embeddings
|
||||
- Embedding tables often 20-40% of total model parameters
|
||||
- Optimized embedding access critical for inference latency
|
||||
- Mixed-precision (FP16) embeddings save memory
|
||||
|
||||
**Recommendation Systems (YouTube, Netflix, Spotify)**
|
||||
- Billion-scale item embeddings for personalization
|
||||
- Embedding retrieval systems for fast nearest-neighbor search
|
||||
- Continuous embedding updates with online learning
|
||||
- Embedding quantization for serving efficiency
|
||||
|
||||
**Multilingual Models (Google Translate, Facebook M2M)**
|
||||
- Shared embedding spaces across 100+ languages
|
||||
- Cross-lingual embeddings enable zero-shot transfer
|
||||
- Vocabulary size optimization for multilingual coverage
|
||||
- Embedding alignment techniques for language pairs
|
||||
|
||||
### Research Impact
|
||||
|
||||
This module implements patterns from:
|
||||
- Word2Vec (2013): Pioneered dense semantic embeddings
|
||||
- GloVe (2014): Global co-occurrence matrix factorization
|
||||
- Transformer (2017): Sinusoidal positional encodings
|
||||
- BERT (2018): Contextual embeddings revolutionized NLP
|
||||
- GPT (2018): Learned positional embeddings for autoregressive models
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 12: Attention**, you'll use these embeddings as input to attention mechanisms:
|
||||
|
||||
- Query, Key, Value projections from embeddings
|
||||
- Scaled dot-product attention over embedded sequences
|
||||
- Multi-head attention for different representation subspaces
|
||||
- Self-attention that relates all positions in a sequence
|
||||
|
||||
The embeddings you built are the foundation input to every transformer!
|
||||
|
||||
---
|
||||
|
||||
**Ready to build embedding systems from scratch?** Open `modules/source/11_embeddings/embeddings_dev.py` and start implementing.
|
||||
|
||||
Reference in New Issue
Block a user