mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-30 19:37:36 -05:00

Files

Vijay Janapa Reddi 0c677dd488 Update module documentation: enhance ABOUT.md files across all modules

- Improve module descriptions and learning objectives
- Standardize documentation format and structure
- Add clearer guidance for students
- Enhance module-specific context and examples

2025-11-13 10:42:47 -05:00

24 KiB

Raw Blame History

title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives

title

description

difficulty

time_estimate

prerequisites

next_steps

learning_objectives

Embeddings - Token to Vector Representations

Build embedding layers that convert discrete tokens to dense, learnable vector representations powering modern NLP

4-5 hours

Tensor

Tokenization

Attention

Implement embedding layers with efficient lookup table operations and proper initialization

Design both learned and sinusoidal positional encodings to capture sequence order information

Understand memory scaling relationships with vocabulary size and embedding dimensions

Optimize embedding lookups for cache efficiency and sparse gradient updates

Apply dimensionality principles to semantic vector space design and trade-offs

11. Embeddings - Token to Vector Representations

ARCHITECTURE TIER | Difficulty: ⭐⭐ (2/4) | Time: 4-5 hours

Overview

Build the embedding systems that transform discrete token IDs into dense, learnable vector representations - the bridge between symbolic text and neural computation. This module implements lookup tables, positional encodings, and the optimization techniques that power every modern language model from word2vec to GPT-4's input layers.

You'll discover why embeddings aren't just "lookup tables" but sophisticated parameter spaces where semantic meaning emerges through training. By implementing both token embeddings and positional encodings from scratch, you'll understand the architectural choices that shape how transformers process language and why certain design decisions (sinusoidal vs learned positions, embedding dimensions, initialization strategies) have profound implications for model capacity, memory usage, and inference performance.

Learning Objectives

By the end of this module, you will be able to:

Implement embedding layers: Build efficient lookup tables for token-to-vector conversion with proper Xavier initialization and gradient flow
Design positional encodings: Create both sinusoidal (Transformer-style) and learned (GPT-style) position representations with different extrapolation capabilities
Understand memory scaling: Analyze how vocabulary size and embedding dimensions impact parameter count, memory bandwidth, and serving costs
Optimize embedding lookups: Implement sparse gradient updates that avoid computing gradients for 99% of vocabulary during training
Apply dimensionality principles: Balance semantic expressiveness with computational efficiency in vector space design and initialization

Build → Use → Reflect

This module follows TinyTorch's Build → Use → Reflect framework:

Build: Implement embedding lookup tables with trainable parameters, sinusoidal positional encodings using mathematical patterns, learned position embeddings, and complete token+position combination systems
Use: Convert tokenized text sequences to dense vectors, add positional information for sequence order awareness, and prepare embeddings for attention mechanisms
Reflect: Analyze memory scaling with vocabulary size (why GPT-3's embeddings use 2.4GB), understand sparse gradient efficiency for large vocabularies, and explore semantic geometry in learned embedding spaces

:class: tip

**Production Context**: GPT-3's embedding table contains 50,257 vocabulary × 12,288 dimensions = 617M parameters (about 20% of the model's 175B total). Every token lookup requires reading 48KB of memory - making embedding access a major bandwidth bottleneck during inference, especially for long sequences.

**Performance Note**: During training, only ~1% of vocabulary appears in each batch. Sparse gradient updates avoid computing gradients for the other 99% of embedding parameters, saving massive computation and memory bandwidth. This is why frameworks like PyTorch implement specialized sparse gradient operations for embeddings.

Implementation Guide

Embedding Layer - The Token Lookup Table

The fundamental building block that maps discrete token IDs to continuous dense vectors. This is where semantic meaning will eventually be learned through training.

Core Implementation Pattern:

class Embedding:
    """Learnable embedding layer for token-to-vector conversion.

    Implements efficient lookup table that maps token IDs to dense vectors.
    The foundation of all language models and sequence processing.

    Args:
        vocab_size: Size of vocabulary (e.g., 50,000 for GPT-2)
        embedding_dim: Dimension of dense vectors (e.g., 768 for BERT-base)

    Memory Cost: vocab_size × embedding_dim parameters
    Example: 50K vocab × 768 dim = 38.4M parameters (153MB at FP32)
    """
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim

        # Xavier/Glorot initialization for stable gradients
        limit = math.sqrt(6.0 / (vocab_size + embedding_dim))
        self.weight = Tensor(
            np.random.uniform(-limit, limit, (vocab_size, embedding_dim)),
            requires_grad=True
        )

    def forward(self, indices):
        """Look up embeddings for token IDs.

        Args:
            indices: (batch_size, seq_len) tensor of token IDs

        Returns:
            embeddings: (batch_size, seq_len, embedding_dim) dense vectors
        """
        # Advanced indexing: O(1) per token lookup
        embedded = self.weight.data[indices.data.astype(int)]

        result = Tensor(embedded, requires_grad=self.weight.requires_grad)

        # Attach gradient computation (sparse updates during backward)
        if self.weight.requires_grad:
            result._grad_fn = EmbeddingBackward(self.weight, indices)

        return result

Why This Design Works:

Xavier initialization ensures gradients don't explode or vanish during early training
Advanced indexing provides O(1) lookup complexity regardless of vocabulary size
Sparse gradients mean only embeddings for tokens in the current batch receive updates
Trainable weights allow the model to learn semantic relationships through backpropagation

Sinusoidal Positional Encoding (Transformer-Style)

Fixed mathematical encodings that capture position without learned parameters. The original "Attention is All You Need" approach that enables extrapolation to longer sequences.

Mathematical Foundation:

def create_sinusoidal_embeddings(max_seq_len, embedding_dim):
    """Create sinusoidal positional encodings from Vaswani et al. (2017).

    Uses sine/cosine functions of different frequencies to encode position.

    Formula:
        PE(pos, 2i)   = sin(pos / 10000^(2i/embed_dim))  # Even indices
        PE(pos, 2i+1) = cos(pos / 10000^(2i/embed_dim))  # Odd indices

    Where:
        pos = position in sequence (0, 1, 2, ...)
        i = dimension pair index
        10000 = base frequency (creates wavelengths from 2π to 10000·2π)

    Advantages:
        - Zero parameters (no memory overhead)
        - Generalizes to sequences longer than training
        - Smooth transitions (nearby positions similar)
        - Rich frequency spectrum across dimensions
    """
    # Position indices: [0, 1, 2, ..., max_seq_len-1]
    position = np.arange(max_seq_len, dtype=np.float32)[:, np.newaxis]

    # Frequency term for each dimension pair
    div_term = np.exp(
        np.arange(0, embedding_dim, 2, dtype=np.float32) *
        -(math.log(10000.0) / embedding_dim)
    )

    # Initialize positional encoding matrix
    pe = np.zeros((max_seq_len, embedding_dim), dtype=np.float32)

    # Apply sine to even indices (0, 2, 4, ...)
    pe[:, 0::2] = np.sin(position * div_term)

    # Apply cosine to odd indices (1, 3, 5, ...)
    pe[:, 1::2] = np.cos(position * div_term)

    return Tensor(pe)

Why Sinusoidal Patterns Work:

Different frequencies per dimension: high frequencies change rapidly between positions, low frequencies change slowly
Unique signatures for each position through combination of frequencies
Linear combinations allow the model to learn relative position offsets through attention
No length limit - can compute encodings for any sequence length at inference time

Learned Positional Encoding (GPT-Style)

Trainable position embeddings that can adapt to task-specific patterns. Used in GPT models and other architectures where positional patterns may be learnable.

Implementation Pattern:

class PositionalEncoding:
    """Learnable positional encoding layer.

    Trainable position-specific vectors added to token embeddings.

    Args:
        max_seq_len: Maximum sequence length to support
        embedding_dim: Dimension matching token embeddings

    Advantages:
        - Can learn task-specific position patterns
        - May capture regularities like sentence structure
        - Often performs slightly better than sinusoidal

    Disadvantages:
        - Requires additional parameters (max_seq_len × embedding_dim)
        - Cannot extrapolate beyond training sequence length
        - Needs sufficient training data to learn position patterns
    """
    def __init__(self, max_seq_len, embedding_dim):
        self.max_seq_len = max_seq_len
        self.embedding_dim = embedding_dim

        # Smaller initialization than token embeddings (additive combination)
        limit = math.sqrt(2.0 / embedding_dim)
        self.position_embeddings = Tensor(
            np.random.uniform(-limit, limit, (max_seq_len, embedding_dim)),
            requires_grad=True
        )

    def forward(self, x):
        """Add positional encodings to input embeddings.

        Args:
            x: (batch_size, seq_len, embedding_dim) input embeddings

        Returns:
            Position-aware embeddings of same shape
        """
        batch_size, seq_len, embedding_dim = x.shape

        # Get position embeddings for this sequence length
        pos_embeddings = self.position_embeddings.data[:seq_len]

        # Broadcast to batch dimension: (1, seq_len, embedding_dim)
        pos_embeddings = pos_embeddings[np.newaxis, :, :]

        # Element-wise addition combines token and position information
        result = x + Tensor(pos_embeddings, requires_grad=True)

        return result

Design Rationale:

Learned parameters can capture task-specific patterns (e.g., sentence beginnings, clause boundaries)
Smaller initialization because positions add to token embeddings (not replace them)
Fixed max length is a limitation but acceptable for many production use cases
Element-wise addition preserves both token semantics and position information

Complete Embedding System

Production-ready integration of token and positional embeddings used in real transformer implementations.

Full Pipeline:

class EmbeddingLayer:
    """Complete embedding system combining token and positional embeddings.

    Production component matching PyTorch/HuggingFace transformer patterns.
    """
    def __init__(self, vocab_size, embed_dim, max_seq_len=512,
                 pos_encoding='learned', scale_embeddings=False):
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.scale_embeddings = scale_embeddings

        # Token embedding table
        self.token_embedding = Embedding(vocab_size, embed_dim)

        # Positional encoding strategy
        if pos_encoding == 'learned':
            self.pos_encoding = PositionalEncoding(max_seq_len, embed_dim)
        elif pos_encoding == 'sinusoidal':
            self.pos_encoding = create_sinusoidal_embeddings(max_seq_len, embed_dim)
        elif pos_encoding is None:
            self.pos_encoding = None

    def forward(self, tokens):
        """Convert tokens to position-aware embeddings.

        Args:
            tokens: (batch_size, seq_len) token indices

        Returns:
            (batch_size, seq_len, embed_dim) position-aware vectors
        """
        # Token lookup
        token_embeds = self.token_embedding.forward(tokens)

        # Optional scaling (Transformer convention: √embed_dim)
        if self.scale_embeddings:
            token_embeds = Tensor(token_embeds.data * math.sqrt(self.embed_dim))

        # Add positional information
        if self.pos_encoding is not None:
            output = self.pos_encoding.forward(token_embeds)
        else:
            output = token_embeds

        return output

Integration Benefits:

Flexible positional encoding supports learned, sinusoidal, or none
Embedding scaling (multiply by √d) is Transformer convention for gradient stability
Batch processing handles variable sequence lengths efficiently
Parameter management tracks all trainable components for optimization

Getting Started

Prerequisites

Before starting this module, ensure you have completed:

Module 01 (Tensor): Provides the foundational Tensor class with gradient tracking and operations
Module 10 (Tokenization): Required for converting text to token IDs that embeddings consume

Verify your prerequisites:

# Activate TinyTorch environment
source bin/activate-tinytorch.sh

# Verify prerequisite modules
tito test --module tensor
tito test --module tokenization

Development Workflow

Open the development notebook: modules/11_embeddings/embeddings_dev.ipynb
Implement Embedding class: Create lookup table with Xavier initialization and efficient indexing
Build sinusoidal encodings: Compute sine/cosine position representations using mathematical formula
Create learned positions: Add trainable position embedding table with proper initialization
Integrate complete system: Combine token and position embeddings with flexible encoding strategies
Export and verify: tito module complete 11 && tito test --module embeddings

Testing

Comprehensive Test Suite

Run the full test suite to verify embedding functionality:

# TinyTorch CLI (recommended)
tito test --module embeddings

# Direct pytest execution
python -m pytest tests/ -k embeddings -v

Test Coverage Areas

✅ Embedding lookup correctness: Verify token IDs map to correct vector rows in weight table
✅ Gradient flow verification: Ensure sparse gradient updates accumulate properly during backpropagation
✅ Positional encoding math: Validate sinusoidal formula implementation with correct frequencies
✅ Shape broadcasting: Test token + position combination across batch dimensions
✅ Memory efficiency profiling: Verify parameter count and lookup performance characteristics

Inline Testing & Validation

The module includes comprehensive unit tests during development:

# Example inline test output
🔬 Unit Test: Embedding layer...
✅ Lookup table created: 10K vocab × 256 dims = 2.56M parameters
✅ Forward pass shape correct: (32, 20, 256) for batch of 32 sequences
✅ Backward pass sparse gradients accumulate correctly
✅ Xavier initialization keeps variance stable
📈 Progress: Embedding Layer ✓

🔬 Unit Test: Sinusoidal positional encoding...
✅ Encodings computed for 512 positions × 256 dimensions
✅ Sine/cosine patterns verified (pos 0: [0, 1, 0, 1, ...])
✅ Different positions have unique signatures
✅ Frequency spectrum correct (high to low across dimensions)
📈 Progress: Sinusoidal Positions ✓

🔬 Unit Test: Learned positional encoding...
✅ Trainable position embeddings initialized
✅ Addition with token embeddings preserves gradients
✅ Batch broadcasting handled correctly
📈 Progress: Learned Positions ✓

🔬 Unit Test: Complete embedding system...
✅ Token + position combination works for all strategies
✅ Embedding scaling (√d) applied correctly
✅ Variable sequence lengths handled gracefully
✅ Parameter counting correct for each configuration
📈 Progress: Complete System ✓

Manual Testing Examples

Test your embedding implementation interactively:

from tinytorch.text.embeddings import Embedding, PositionalEncoding, create_sinusoidal_embeddings

# Create embedding layer
vocab_size, embed_dim = 10000, 256
token_emb = Embedding(vocab_size, embed_dim)

# Test token lookup
token_ids = Tensor([[1, 5, 23], [42, 7, 19]])  # (2, 3) - batch of 2 sequences
embeddings = token_emb.forward(token_ids)      # (2, 3, 256)
print(f"Token embeddings shape: {embeddings.shape}")

# Add learned positional encodings
pos_emb = PositionalEncoding(max_seq_len=512, embed_dim=256)
token_embeddings_3d = embeddings  # Already (batch, seq, embed)
pos_aware = pos_emb.forward(token_embeddings_3d)
print(f"Position-aware shape: {pos_aware.shape}")  # (2, 3, 256)

# Try sinusoidal encodings
sin_pe = create_sinusoidal_embeddings(max_seq_len=512, embed_dim=256)
sin_positions = sin_pe.data[:3][np.newaxis, :, :]  # (1, 3, 256)
combined = Tensor(embeddings.data + sin_positions)
print(f"Sinusoidal combined: {combined.shape}")  # (2, 3, 256)

# Verify position 0 pattern (should be [0, 1, 0, 1, ...])
print(f"Position 0 pattern: {sin_pe.data[0, :8]}")
# Expected: [~0.0, ~1.0, ~0.0, ~1.0, ~0.0, ~1.0, ~0.0, ~1.0]

Systems Thinking Questions

Real-World Applications

Large Language Models (GPT-4, Claude, Llama): Embedding tables often contain 20-40% of total model parameters. GPT-3's 50K vocab × 12K dims = 617M embedding parameters alone (2.4GB at FP32). This makes embeddings a major memory consumer in serving infrastructure.
Recommendation Systems (YouTube, Netflix, Spotify): Billion-scale item embeddings for personalized content retrieval. YouTube's embedding space contains hundreds of millions of video embeddings, enabling fast nearest-neighbor search for recommendations in milliseconds.
Multilingual Models (Google Translate, mBERT): Shared embedding spaces across 100+ languages enable zero-shot cross-lingual transfer. Words with similar meanings across languages cluster together in the learned vector space, allowing translation without parallel data.
Search Engines (Google, Bing): Query and document embeddings power semantic search beyond keyword matching. BERT-style embeddings capture meaning, letting "how to fix a leaky faucet" match "plumbing repair for dripping tap" even with no shared words.

Mathematical Foundations

Embedding Geometry: Why do word embeddings exhibit linear relationships like "king - man + woman ≈ queen"? The training objective (predicting context words in word2vec, or next tokens in language models) creates geometric structure where semantic relationships become linear vector operations. This emerges without explicit supervision.
Dimensionality Trade-offs: Higher dimensions increase expressiveness (more capacity to separate distinct concepts) but require more memory and computation. BERT-base uses 768 dimensions, BERT-large uses 1024 - carefully chosen based on performance-cost Pareto analysis. Doubling dimensions doubles memory but may only improve accuracy by a few percentage points.
Positional Encoding Mathematics: Sinusoidal encodings use different frequencies (wavelengths from 2π to 10,000·2π) so each position gets a unique pattern. The model can learn relative positions through attention: the dot product of position encodings at offsets k captures periodic patterns the attention mechanism learns to use.
Sparse Gradient Efficiency: During training with vocabulary size V and batch containing b unique tokens, dense gradients would update all V embeddings. Sparse gradients only update b embeddings - when b << V (typical: 1000 tokens vs 50K vocab), this saves ~98% of gradient computation and memory bandwidth.

Performance Characteristics

Memory Scaling: Embedding tables grow as O(vocab_size × embedding_dim). At FP32 (4 bytes per parameter): 50K vocab × 768 dims = 153MB, 100K vocab × 1024 dims = 410MB. Mixed precision (FP16) cuts this in half, but vocabulary size dominates scaling for large models.
Bandwidth Bottleneck: Every token lookup reads embedding_dim × sizeof(dtype) bytes from memory. With 768 dims at FP32, that's 3KB per token. Processing a 2048-token context requires reading 6MB from the embedding table - memory bandwidth becomes the bottleneck, not compute.
Cache Efficiency: Sequential token access has poor cache locality because tokens are typically non-sequential in the embedding table (token IDs [1, 42, 7, 99] means random jumps through the weight matrix). Batching improves throughput by amortizing cache misses, but embedding access remains memory-bound, not compute-bound.
Inference Optimization: Embedding quantization (INT8 or even INT4) reduces memory footprint and bandwidth by 2-4×, critical for deployment. KV-caching in transformers makes embedding lookup happen only once per token (not per layer), so optimizing this cold start is important for latency-sensitive applications.

Ready to Build?

You're about to implement the embedding systems that power modern AI language understanding. These lookup tables and positional encodings are the bridge between discrete tokens (words, subwords, characters) and the continuous vector spaces where neural networks operate. What seems like a simple "array lookup" is actually the foundation of how language models represent meaning.

What makes this module special is understanding not just how embeddings work, but why certain design choices matter. Why do we need positional encodings when embeddings already contain token information? Why sparse gradients instead of dense updates? How does embedding dimension affect model capacity versus memory footprint? These aren't just implementation details - they're fundamental design principles that shape every production language model's architecture.

By building embeddings from scratch, you'll gain intuition for memory-computation trade-offs in deep learning systems. You'll understand why GPT-3's embedding table consumes 2.4GB of memory, and why that matters for serving costs at scale (more memory = more expensive GPUs = higher operational costs). You'll see how sinusoidal encodings allow transformers to process sequences longer than training data, while learned positions might perform better on specific tasks with known maximum lengths.

This is where theory meets the economic realities of deploying AI at scale. Every architectural choice - vocabulary size, embedding dimension, positional encoding strategy - has both technical implications (accuracy, generalization) and business implications (memory costs, inference latency, serving throughput). Understanding these trade-offs is what separates machine learning researchers from machine learning systems engineers.

Choose your preferred way to engage with this module:


```{grid-item-card} 🚀 Launch Binder
:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/11_embeddings/embeddings_dev.ipynb
:class-header: bg-light

Run this module interactively in your browser. No installation required!
```

```{grid-item-card} ⚡ Open in Colab
:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/11_embeddings/embeddings_dev.ipynb
:class-header: bg-light

Use Google Colab for GPU access and cloud compute power.
```

```{grid-item-card} 📖 View Source
:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/11_embeddings/embeddings_dev.ipynb
:class-header: bg-light

Browse the Jupyter notebook source and understand the implementation.
```

:class: tip
**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work.

← Previous Module Next Module →

24 KiB Raw Blame History Unescape Escape