mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-22 22:55:48 -05:00
- Removed 30 debugging and development artifact files - Kept core system, documentation, and demo files - tests/milestones: 9 clean files (system + docs) - milestones/05_2017_transformer: 5 clean files (demos) - Clear, focused directory structure - Ready for students and developers
1699 lines
81 KiB
Plaintext
1699 lines
81 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8889dadd",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 11: Embeddings - Converting Tokens to Learnable Representations\n",
|
||
"\n",
|
||
"Welcome to Module 11! You're about to build embedding layers that convert discrete tokens into dense, learnable vectors - the foundation of all modern NLP models.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Tensors, layers, tokenization (discrete text processing)\n",
|
||
"**You'll Build**: Embedding lookups and positional encodings for sequence modeling\n",
|
||
"**You'll Enable**: Foundation for attention mechanisms and transformer architectures\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"Tokenization → Embeddings → Positional Encoding → Attention (Module 12)\n",
|
||
"(discrete) (dense) (position-aware) (context-aware)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement embedding layers for token-to-vector conversion\n",
|
||
"2. Understand learnable vs fixed positional encodings\n",
|
||
"3. Build both sinusoidal and learned position encodings\n",
|
||
"4. Analyze embedding memory requirements and lookup performance\n",
|
||
"\n",
|
||
"Let's transform tokens into intelligence!\n",
|
||
"\n",
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/11_embeddings/embeddings_dev.py` \n",
|
||
"**Building Side:** Code exports to `tinytorch.text.embeddings`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# How to use this module:\n",
|
||
"from tinytorch.text.embeddings import Embedding, PositionalEncoding, create_sinusoidal_embeddings\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete embedding system for converting discrete tokens to continuous representations\n",
|
||
"- **Production:** Essential component matching PyTorch's torch.nn.Embedding with positional encoding patterns\n",
|
||
"- **Consistency:** All embedding operations and positional encodings in text.embeddings\n",
|
||
"- **Integration:** Works seamlessly with tokenizers for complete text processing pipeline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "dc2a5f01",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp text.embeddings"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "851b8e9a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"import numpy as np\n",
|
||
"import math\n",
|
||
"from typing import List, Optional, Tuple\n",
|
||
"\n",
|
||
"# Import from previous modules - following dependency chain\n",
|
||
"from tinytorch.core.tensor import Tensor\n",
|
||
"from tinytorch.core.autograd import EmbeddingBackward\n",
|
||
"\n",
|
||
"# Constants for memory calculations\n",
|
||
"BYTES_PER_FLOAT32 = 4 # Standard float32 size in bytes\n",
|
||
"MB_TO_BYTES = 1024 * 1024 # Megabytes to bytes conversion"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "83f99d85",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction - Why Embeddings?\n",
|
||
"\n",
|
||
"Neural networks operate on dense vectors, but language consists of discrete tokens. Embeddings are the crucial bridge that converts discrete tokens into continuous, learnable vector representations that capture semantic meaning.\n",
|
||
"\n",
|
||
"### The Token-to-Vector Challenge\n",
|
||
"\n",
|
||
"Consider the tokens from our tokenizer: [1, 42, 7] - how do we turn these discrete indices into meaningful vectors that capture semantic relationships?\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────────┐\n",
|
||
"│ EMBEDDING PIPELINE: Discrete Tokens → Dense Vectors │\n",
|
||
"├─────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ Input (Token IDs): [1, 42, 7] │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ Step 1: Lookup in embedding table │\n",
|
||
"│ │ Each ID → vector of learned features │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ Step 2: Add positional information │\n",
|
||
"│ │ Same word at different positions → different│\n",
|
||
"│ │ │\n",
|
||
"│ ├─ Step 3: Create position-aware representations │\n",
|
||
"│ │ Ready for attention mechanisms │\n",
|
||
"│ │ │\n",
|
||
"│ └─ Step 4: Enable semantic understanding │\n",
|
||
"│ Similar words → similar vectors │\n",
|
||
"│ │\n",
|
||
"│ Output (Dense Vectors): [[0.1, 0.4, ...], [0.7, -0.2, ...]] │\n",
|
||
"│ │\n",
|
||
"└─────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Four-Layer Embedding System\n",
|
||
"\n",
|
||
"Modern embedding systems combine multiple components:\n",
|
||
"\n",
|
||
"**1. Token embeddings** - Learn semantic representations for each vocabulary token\n",
|
||
"**2. Positional encoding** - Add information about position in sequence\n",
|
||
"**3. Optional scaling** - Normalize embedding magnitudes (Transformer convention)\n",
|
||
"**4. Integration** - Combine everything into position-aware representations\n",
|
||
"\n",
|
||
"### Why This Matters\n",
|
||
"\n",
|
||
"The choice of embedding strategy dramatically affects:\n",
|
||
"- **Semantic understanding** - How well the model captures word meaning\n",
|
||
"- **Memory requirements** - Embedding tables can be gigabytes in size\n",
|
||
"- **Position awareness** - Whether the model understands word order\n",
|
||
"- **Extrapolation** - How well the model handles longer sequences than training"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "076f5a73",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 2. Foundations - Embedding Strategies\n",
|
||
"\n",
|
||
"Different embedding approaches make different trade-offs between memory, semantic understanding, and computational efficiency.\n",
|
||
"\n",
|
||
"### Token Embedding Lookup Process\n",
|
||
"\n",
|
||
"**Approach**: Each token ID maps to a learned dense vector\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌──────────────────────────────────────────────────────────────┐\n",
|
||
"│ TOKEN EMBEDDING LOOKUP PROCESS │\n",
|
||
"├──────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ Step 1: Build Embedding Table (vocab_size × embed_dim) │\n",
|
||
"│ ┌────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Token ID │ Embedding Vector (learned features) │ │\n",
|
||
"│ ├────────────────────────────────────────────────────────┤ │\n",
|
||
"│ │ 0 │ [0.2, -0.1, 0.3, 0.8, ...] (<UNK>) │ │\n",
|
||
"│ │ 1 │ [0.1, 0.4, -0.2, 0.6, ...] (\"the\") │ │\n",
|
||
"│ │ 42 │ [0.7, -0.2, 0.1, 0.4, ...] (\"cat\") │ │\n",
|
||
"│ │ 7 │ [-0.3, 0.1, 0.5, 0.2, ...] (\"sat\") │ │\n",
|
||
"│ │ ... │ ... │ │\n",
|
||
"│ └────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ Step 2: Lookup Process (O(1) per token) │\n",
|
||
"│ ┌────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Input: Token IDs [1, 42, 7] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ ID 1 → embedding[1] → [0.1, 0.4, -0.2, ...] │ │\n",
|
||
"│ │ ID 42 → embedding[42] → [0.7, -0.2, 0.1, ...] │ │\n",
|
||
"│ │ ID 7 → embedding[7] → [-0.3, 0.1, 0.5, ...] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Output: Matrix (3 × embed_dim) │ │\n",
|
||
"│ │ [[0.1, 0.4, -0.2, ...], │ │\n",
|
||
"│ │ [0.7, -0.2, 0.1, ...], │ │\n",
|
||
"│ │ [-0.3, 0.1, 0.5, ...]] │ │\n",
|
||
"│ └────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ Step 3: Training Updates Embeddings │\n",
|
||
"│ ┌────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Gradients flow back to embedding table │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Similar words learn similar vectors: │ │\n",
|
||
"│ │ \"cat\" and \"dog\" → closer in embedding space │ │\n",
|
||
"│ │ \"the\" and \"a\" → closer in embedding space │ │\n",
|
||
"│ │ \"sat\" and \"run\" → farther in embedding space │ │\n",
|
||
"│ └────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"└──────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Pros**:\n",
|
||
"- Dense representation (every dimension meaningful)\n",
|
||
"- Learnable (captures semantic relationships through training)\n",
|
||
"- Efficient lookup (O(1) time complexity)\n",
|
||
"- Scales to large vocabularies\n",
|
||
"\n",
|
||
"**Cons**:\n",
|
||
"- Memory intensive (vocab_size × embed_dim parameters)\n",
|
||
"- Requires training to develop semantic relationships\n",
|
||
"- Fixed vocabulary (new tokens need special handling)\n",
|
||
"\n",
|
||
"### Positional Encoding Strategies\n",
|
||
"\n",
|
||
"Since embeddings by themselves have no notion of order, we need positional information:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Position-Aware Embeddings = Token Embeddings + Positional Encoding\n",
|
||
"\n",
|
||
"Learned Approach: Fixed Mathematical Approach:\n",
|
||
"Position 0 → [learned] Position 0 → [sin/cos pattern]\n",
|
||
"Position 1 → [learned] Position 1 → [sin/cos pattern]\n",
|
||
"Position 2 → [learned] Position 2 → [sin/cos pattern]\n",
|
||
"... ...\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Learned Positional Encoding**:\n",
|
||
"- Trainable position embeddings\n",
|
||
"- Can learn task-specific patterns\n",
|
||
"- Limited to maximum training sequence length\n",
|
||
"\n",
|
||
"**Sinusoidal Positional Encoding**:\n",
|
||
"- Mathematical sine/cosine patterns\n",
|
||
"- No additional parameters\n",
|
||
"- Can extrapolate to longer sequences\n",
|
||
"\n",
|
||
"### Strategy Comparison\n",
|
||
"\n",
|
||
"```\n",
|
||
"Text: \"cat sat on mat\" → Token IDs: [42, 7, 15, 99]\n",
|
||
"\n",
|
||
"Token Embeddings: [vec_42, vec_7, vec_15, vec_99] # Same vectors anywhere\n",
|
||
"Position-Aware: [vec_42+pos_0, vec_7+pos_1, vec_15+pos_2, vec_99+pos_3]\n",
|
||
" ↑ Now \"cat\" at position 0 ≠ \"cat\" at position 1\n",
|
||
"```\n",
|
||
"\n",
|
||
"The combination enables transformers to understand both meaning and order!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a3d12e84",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 3. Implementation - Building Embedding Systems\n",
|
||
"\n",
|
||
"Let's implement embedding systems from basic token lookup to sophisticated position-aware representations. We'll start with the core embedding layer and work up to complete systems."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "cbc321b8",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "embedding-class",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class Embedding:\n",
|
||
" \"\"\"\n",
|
||
" Learnable embedding layer that maps token indices to dense vectors.\n",
|
||
"\n",
|
||
" This is the fundamental building block for converting discrete tokens\n",
|
||
" into continuous representations that neural networks can process.\n",
|
||
"\n",
|
||
" TODO: Implement the Embedding class\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Initialize embedding matrix with random weights (vocab_size, embed_dim)\n",
|
||
" 2. Implement forward pass as matrix lookup using numpy indexing\n",
|
||
" 3. Handle batch dimensions correctly\n",
|
||
" 4. Return parameters for optimization\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> embed = Embedding(vocab_size=100, embed_dim=64)\n",
|
||
" >>> tokens = Tensor([[1, 2, 3], [4, 5, 6]]) # batch_size=2, seq_len=3\n",
|
||
" >>> output = embed.forward(tokens)\n",
|
||
" >>> print(output.shape)\n",
|
||
" (2, 3, 64)\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use numpy advanced indexing for lookup: weight[indices]\n",
|
||
" - Embedding matrix shape: (vocab_size, embed_dim)\n",
|
||
" - Initialize with Xavier/Glorot uniform for stable gradients\n",
|
||
" - Handle multi-dimensional indices correctly\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" def __init__(self, vocab_size: int, embed_dim: int):\n",
|
||
" \"\"\"\n",
|
||
" Initialize embedding layer.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" vocab_size: Size of vocabulary (number of unique tokens)\n",
|
||
" embed_dim: Dimension of embedding vectors\n",
|
||
" \"\"\"\n",
|
||
" self.vocab_size = vocab_size\n",
|
||
" self.embed_dim = embed_dim\n",
|
||
"\n",
|
||
" # Xavier initialization for better gradient flow\n",
|
||
" limit = math.sqrt(6.0 / (vocab_size + embed_dim))\n",
|
||
" self.weight = Tensor(\n",
|
||
" np.random.uniform(-limit, limit, (vocab_size, embed_dim)),\n",
|
||
" requires_grad=True\n",
|
||
" )\n",
|
||
"\n",
|
||
" def forward(self, indices: Tensor) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Forward pass: lookup embeddings for given indices.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" indices: Token indices of shape (batch_size, seq_len) or (seq_len,)\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" Embedded vectors of shape (*indices.shape, embed_dim)\n",
|
||
" \"\"\"\n",
|
||
" # Handle input validation\n",
|
||
" if np.any(indices.data >= self.vocab_size) or np.any(indices.data < 0):\n",
|
||
" raise ValueError(\n",
|
||
" f\"Index out of range. Expected 0 <= indices < {self.vocab_size}, \"\n",
|
||
" f\"got min={np.min(indices.data)}, max={np.max(indices.data)}\"\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Perform embedding lookup using advanced indexing\n",
|
||
" # This is equivalent to one-hot multiplication but much more efficient\n",
|
||
" embedded = self.weight.data[indices.data.astype(int)]\n",
|
||
"\n",
|
||
" # Create result tensor with gradient tracking\n",
|
||
" result = Tensor(embedded, requires_grad=self.weight.requires_grad)\n",
|
||
" \n",
|
||
" # Attach backward function for gradient computation (following TinyTorch protocol)\n",
|
||
" if result.requires_grad:\n",
|
||
" result._grad_fn = EmbeddingBackward(self.weight, indices)\n",
|
||
" \n",
|
||
" return result\n",
|
||
"\n",
|
||
" def __call__(self, indices: Tensor) -> Tensor:\n",
|
||
" \"\"\"Allows the embedding to be called like a function.\"\"\"\n",
|
||
" return self.forward(indices)\n",
|
||
"\n",
|
||
" def parameters(self) -> List[Tensor]:\n",
|
||
" \"\"\"Return trainable parameters.\"\"\"\n",
|
||
" return [self.weight]\n",
|
||
"\n",
|
||
" def __repr__(self):\n",
|
||
" return f\"Embedding(vocab_size={self.vocab_size}, embed_dim={self.embed_dim})\"\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7c6d9cfb",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-embedding",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_embedding():\n",
|
||
" \"\"\"🔬 Unit Test: Embedding Layer Implementation\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Embedding Layer...\")\n",
|
||
"\n",
|
||
" # Test 1: Basic embedding creation and forward pass\n",
|
||
" embed = Embedding(vocab_size=100, embed_dim=64)\n",
|
||
"\n",
|
||
" # Single sequence\n",
|
||
" tokens = Tensor([1, 2, 3])\n",
|
||
" output = embed.forward(tokens)\n",
|
||
"\n",
|
||
" assert output.shape == (3, 64), f\"Expected shape (3, 64), got {output.shape}\"\n",
|
||
" assert len(embed.parameters()) == 1, \"Should have 1 parameter (weight matrix)\"\n",
|
||
" assert embed.parameters()[0].shape == (100, 64), \"Weight matrix has wrong shape\"\n",
|
||
"\n",
|
||
" # Test 2: Batch processing\n",
|
||
" batch_tokens = Tensor([[1, 2, 3], [4, 5, 6]])\n",
|
||
" batch_output = embed.forward(batch_tokens)\n",
|
||
"\n",
|
||
" assert batch_output.shape == (2, 3, 64), f\"Expected batch shape (2, 3, 64), got {batch_output.shape}\"\n",
|
||
"\n",
|
||
" # Test 3: Embedding lookup consistency\n",
|
||
" single_lookup = embed.forward(Tensor([1]))\n",
|
||
" batch_lookup = embed.forward(Tensor([[1]]))\n",
|
||
"\n",
|
||
" # Should get same embedding for same token\n",
|
||
" assert np.allclose(single_lookup.data[0], batch_lookup.data[0, 0]), \"Inconsistent embedding lookup\"\n",
|
||
"\n",
|
||
" # Test 4: Parameter access\n",
|
||
" params = embed.parameters()\n",
|
||
" assert all(p.requires_grad for p in params), \"All parameters should require gradients\"\n",
|
||
"\n",
|
||
" print(\"✅ Embedding layer works correctly!\")\n",
|
||
"\n",
|
||
"# Run test immediately when developing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_embedding()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d9f57aca",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### Learned Positional Encoding\n",
|
||
"\n",
|
||
"Trainable position embeddings that can learn position-specific patterns. This approach treats each position as a learnable parameter, similar to token embeddings.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Learned Position Embedding Process:\n",
|
||
"\n",
|
||
"Step 1: Initialize Position Embedding Table\n",
|
||
"┌───────────────────────────────────────────────────────────────┐\n",
|
||
"│ Position │ Learnable Vector (trainable parameters) │\n",
|
||
"├───────────────────────────────────────────────────────────────┤\n",
|
||
"│ 0 │ [0.1, -0.2, 0.4, ...] ← learns \"start\" patterns │\n",
|
||
"│ 1 │ [0.3, 0.1, -0.1, ...] ← learns \"second\" patterns│\n",
|
||
"│ 2 │ [-0.1, 0.5, 0.2, ...] ← learns \"third\" patterns │\n",
|
||
"│ ... │ ... │\n",
|
||
"│ 511 │ [0.4, -0.3, 0.1, ...] ← learns \"late\" patterns │\n",
|
||
"└───────────────────────────────────────────────────────────────┘\n",
|
||
"\n",
|
||
"Step 2: Add to Token Embeddings\n",
|
||
"Input: [\"The\", \"cat\", \"sat\"] → Token IDs: [1, 42, 7]\n",
|
||
"\n",
|
||
"Token embeddings: Position embeddings: Combined:\n",
|
||
"[1] → [0.1, 0.4, ...] + [0.1, -0.2, ...] = [0.2, 0.2, ...]\n",
|
||
"[42] → [0.7, -0.2, ...] + [0.3, 0.1, ...] = [1.0, -0.1, ...]\n",
|
||
"[7] → [-0.3, 0.1, ...] + [-0.1, 0.5, ...] = [-0.4, 0.6, ...]\n",
|
||
"\n",
|
||
"Result: Position-aware embeddings that can learn task-specific patterns!\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why learned positions work**: The model can discover that certain positions have special meaning (like sentence beginnings, question words, etc.) and learn specific representations for those patterns."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "08efa3db",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Implementing Learned Positional Encoding\n",
|
||
"\n",
|
||
"Let's build trainable positional embeddings that can learn position-specific patterns for our specific task."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8c6621dc",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "positional-encoding",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class PositionalEncoding:\n",
|
||
" \"\"\"\n",
|
||
" Learnable positional encoding layer.\n",
|
||
"\n",
|
||
" Adds trainable position-specific vectors to token embeddings,\n",
|
||
" allowing the model to learn positional patterns specific to the task.\n",
|
||
"\n",
|
||
" TODO: Implement learnable positional encoding\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Create embedding matrix for positions: (max_seq_len, embed_dim)\n",
|
||
" 2. Forward pass: lookup position embeddings and add to input\n",
|
||
" 3. Handle different sequence lengths gracefully\n",
|
||
" 4. Return parameters for training\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> pos_enc = PositionalEncoding(max_seq_len=512, embed_dim=64)\n",
|
||
" >>> embeddings = Tensor(np.random.randn(2, 10, 64)) # (batch, seq, embed)\n",
|
||
" >>> output = pos_enc.forward(embeddings)\n",
|
||
" >>> print(output.shape)\n",
|
||
" (2, 10, 64) # Same shape, but now position-aware\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Position embeddings shape: (max_seq_len, embed_dim)\n",
|
||
" - Use slice [:seq_len] to handle variable lengths\n",
|
||
" - Add position encodings to input embeddings element-wise\n",
|
||
" - Initialize with smaller values than token embeddings (they're additive)\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" def __init__(self, max_seq_len: int, embed_dim: int):\n",
|
||
" \"\"\"\n",
|
||
" Initialize learnable positional encoding.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" max_seq_len: Maximum sequence length to support\n",
|
||
" embed_dim: Embedding dimension (must match token embeddings)\n",
|
||
" \"\"\"\n",
|
||
" self.max_seq_len = max_seq_len\n",
|
||
" self.embed_dim = embed_dim\n",
|
||
"\n",
|
||
" # Initialize position embedding matrix\n",
|
||
" # Smaller initialization than token embeddings since these are additive\n",
|
||
" limit = math.sqrt(2.0 / embed_dim)\n",
|
||
" self.position_embeddings = Tensor(\n",
|
||
" np.random.uniform(-limit, limit, (max_seq_len, embed_dim)),\n",
|
||
" requires_grad=True\n",
|
||
" )\n",
|
||
"\n",
|
||
" def forward(self, x: Tensor) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Add positional encodings to input embeddings.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" x: Input embeddings of shape (batch_size, seq_len, embed_dim)\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" Position-encoded embeddings of same shape\n",
|
||
" \"\"\"\n",
|
||
" if len(x.shape) != 3:\n",
|
||
" raise ValueError(f\"Expected 3D input (batch, seq, embed), got shape {x.shape}\")\n",
|
||
"\n",
|
||
" batch_size, seq_len, embed_dim = x.shape\n",
|
||
"\n",
|
||
" if seq_len > self.max_seq_len:\n",
|
||
" raise ValueError(\n",
|
||
" f\"Sequence length {seq_len} exceeds maximum {self.max_seq_len}\"\n",
|
||
" )\n",
|
||
"\n",
|
||
" if embed_dim != self.embed_dim:\n",
|
||
" raise ValueError(\n",
|
||
" f\"Embedding dimension mismatch: expected {self.embed_dim}, got {embed_dim}\"\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Slice position embeddings for this sequence length using Tensor slicing\n",
|
||
" # This now preserves gradient flow (as of Module 01 update with __getitem__)\n",
|
||
" pos_embeddings = self.position_embeddings[:seq_len] # (seq_len, embed_dim) - gradients preserved!\n",
|
||
" \n",
|
||
" # Reshape to add batch dimension: (1, seq_len, embed_dim)\n",
|
||
" # Need to use .data for reshaping temporarily, then wrap in Tensor\n",
|
||
" pos_data = pos_embeddings.data[np.newaxis, :, :]\n",
|
||
" pos_embeddings_batched = Tensor(pos_data, requires_grad=pos_embeddings.requires_grad)\n",
|
||
" \n",
|
||
" # Copy gradient function if it exists (to preserve backward connection)\n",
|
||
" if hasattr(pos_embeddings, '_grad_fn') and pos_embeddings._grad_fn is not None:\n",
|
||
" pos_embeddings_batched._grad_fn = pos_embeddings._grad_fn\n",
|
||
"\n",
|
||
" # Add positional information - gradients flow through both x and pos_embeddings!\n",
|
||
" result = x + pos_embeddings_batched\n",
|
||
"\n",
|
||
" return result\n",
|
||
"\n",
|
||
" def __call__(self, x: Tensor) -> Tensor:\n",
|
||
" \"\"\"Allows the positional encoding to be called like a function.\"\"\"\n",
|
||
" return self.forward(x)\n",
|
||
"\n",
|
||
" def parameters(self) -> List[Tensor]:\n",
|
||
" \"\"\"Return trainable parameters.\"\"\"\n",
|
||
" return [self.position_embeddings]\n",
|
||
"\n",
|
||
" def __repr__(self):\n",
|
||
" return f\"PositionalEncoding(max_seq_len={self.max_seq_len}, embed_dim={self.embed_dim})\"\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5cd9ec68",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-positional",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_positional_encoding():\n",
|
||
" \"\"\"🔬 Unit Test: Positional Encoding Implementation\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Positional Encoding...\")\n",
|
||
"\n",
|
||
" # Test 1: Basic functionality\n",
|
||
" pos_enc = PositionalEncoding(max_seq_len=512, embed_dim=64)\n",
|
||
"\n",
|
||
" # Create sample embeddings\n",
|
||
" embeddings = Tensor(np.random.randn(2, 10, 64))\n",
|
||
" output = pos_enc.forward(embeddings)\n",
|
||
"\n",
|
||
" assert output.shape == (2, 10, 64), f\"Expected shape (2, 10, 64), got {output.shape}\"\n",
|
||
"\n",
|
||
" # Test 2: Position consistency\n",
|
||
" # Same position should always get same encoding\n",
|
||
" emb1 = Tensor(np.zeros((1, 5, 64)))\n",
|
||
" emb2 = Tensor(np.zeros((1, 5, 64)))\n",
|
||
"\n",
|
||
" out1 = pos_enc.forward(emb1)\n",
|
||
" out2 = pos_enc.forward(emb2)\n",
|
||
"\n",
|
||
" assert np.allclose(out1.data, out2.data), \"Position encodings should be consistent\"\n",
|
||
"\n",
|
||
" # Test 3: Different positions get different encodings\n",
|
||
" short_emb = Tensor(np.zeros((1, 3, 64)))\n",
|
||
" long_emb = Tensor(np.zeros((1, 5, 64)))\n",
|
||
"\n",
|
||
" short_out = pos_enc.forward(short_emb)\n",
|
||
" long_out = pos_enc.forward(long_emb)\n",
|
||
"\n",
|
||
" # First 3 positions should match\n",
|
||
" assert np.allclose(short_out.data, long_out.data[:, :3, :]), \"Position encoding prefix should match\"\n",
|
||
"\n",
|
||
" # Test 4: Parameters\n",
|
||
" params = pos_enc.parameters()\n",
|
||
" assert len(params) == 1, \"Should have 1 parameter (position embeddings)\"\n",
|
||
" assert params[0].shape == (512, 64), \"Position embedding matrix has wrong shape\"\n",
|
||
"\n",
|
||
" print(\"✅ Positional encoding works correctly!\")\n",
|
||
"\n",
|
||
"# Run test immediately when developing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_positional_encoding()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cb37c69a",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### Sinusoidal Positional Encoding\n",
|
||
"\n",
|
||
"Mathematical position encoding that creates unique signatures for each position using trigonometric functions. This approach requires no additional parameters and can extrapolate to sequences longer than seen during training.\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌───────────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ SINUSOIDAL POSITION ENCODING: Mathematical Position Signatures │\n",
|
||
"├───────────────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ MATHEMATICAL FORMULA: │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ PE(pos, 2i) = sin(pos / 10000^(2i/embed_dim)) # Even dims │ │\n",
|
||
"│ │ PE(pos, 2i+1) = cos(pos / 10000^(2i/embed_dim)) # Odd dims │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Where: │ │\n",
|
||
"│ │ pos = position in sequence (0, 1, 2, ...) │ │\n",
|
||
"│ │ i = dimension pair index (0, 1, 2, ...) │ │\n",
|
||
"│ │ 10000 = base frequency (creates different wavelengths) │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ FREQUENCY PATTERN ACROSS DIMENSIONS: │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Dimension: 0 1 2 3 4 5 6 7 │ │\n",
|
||
"│ │ Frequency: High High Med Med Low Low VLow VLow │ │\n",
|
||
"│ │ Function: sin cos sin cos sin cos sin cos │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ pos=0: [0.00, 1.00, 0.00, 1.00, 0.00, 1.00, 0.00, 1.00] │ │\n",
|
||
"│ │ pos=1: [0.84, 0.54, 0.01, 1.00, 0.00, 1.00, 0.00, 1.00] │ │\n",
|
||
"│ │ pos=2: [0.91,-0.42, 0.02, 1.00, 0.00, 1.00, 0.00, 1.00] │ │\n",
|
||
"│ │ pos=3: [0.14,-0.99, 0.03, 1.00, 0.00, 1.00, 0.00, 1.00] │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Each position gets a unique mathematical \"fingerprint\"! │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ WHY THIS WORKS: │\n",
|
||
"│ ┌──────────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ Wave Pattern Visualization: │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ Dim 0: ∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿∿ (rapid oscillation) │ │\n",
|
||
"│ │ Dim 2: ∿---∿---∿---∿---∿---∿ (medium frequency) │ │\n",
|
||
"│ │ Dim 4: ∿-----∿-----∿-----∿-- (low frequency) │ │\n",
|
||
"│ │ Dim 6: ∿----------∿---------- (very slow changes) │ │\n",
|
||
"│ │ │ │\n",
|
||
"│ │ • High frequency dims change rapidly between positions │ │\n",
|
||
"│ │ • Low frequency dims change slowly │ │\n",
|
||
"│ │ • Combination creates unique signature for each position │ │\n",
|
||
"│ │ • Similar positions have similar (but distinct) encodings │ │\n",
|
||
"│ └──────────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │\n",
|
||
"│ KEY ADVANTAGES: │\n",
|
||
"│ • Zero parameters (no memory overhead) │\n",
|
||
"│ • Infinite sequence length (can extrapolate) │\n",
|
||
"│ • Smooth transitions (nearby positions are similar) │\n",
|
||
"│ • Mathematical elegance (interpretable patterns) │\n",
|
||
"│ │\n",
|
||
"└───────────────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why transformers use this**: The mathematical structure allows the model to learn relative positions (how far apart tokens are) through simple vector operations, which is crucial for attention mechanisms!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2374cd16",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Implementing Sinusoidal Positional Encodings\n",
|
||
"\n",
|
||
"Let's implement the mathematical position encoding that creates unique signatures for each position using trigonometric functions."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "cc335811",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "sinusoidal-function",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def create_sinusoidal_embeddings(max_seq_len: int, embed_dim: int) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Create sinusoidal positional encodings as used in \"Attention Is All You Need\".\n",
|
||
"\n",
|
||
" These fixed encodings use sine and cosine functions to create unique\n",
|
||
" positional patterns that don't require training and can extrapolate\n",
|
||
" to longer sequences than seen during training.\n",
|
||
"\n",
|
||
" TODO: Implement sinusoidal positional encoding generation\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Create position indices: [0, 1, 2, ..., max_seq_len-1]\n",
|
||
" 2. Create dimension indices for frequency calculation\n",
|
||
" 3. Apply sine to even dimensions, cosine to odd dimensions\n",
|
||
" 4. Use the transformer paper formula with 10000 base\n",
|
||
"\n",
|
||
" MATHEMATICAL FORMULA:\n",
|
||
" PE(pos, 2i) = sin(pos / 10000^(2i/embed_dim))\n",
|
||
" PE(pos, 2i+1) = cos(pos / 10000^(2i/embed_dim))\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> pe = create_sinusoidal_embeddings(512, 64)\n",
|
||
" >>> print(pe.shape)\n",
|
||
" (512, 64)\n",
|
||
" >>> # Position 0: [0, 1, 0, 1, 0, 1, ...] (sin(0)=0, cos(0)=1)\n",
|
||
" >>> # Each position gets unique trigonometric signature\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - Use np.arange to create position and dimension arrays\n",
|
||
" - Calculate div_term using exponential for frequency scaling\n",
|
||
" - Apply different formulas to even/odd dimensions\n",
|
||
" - The 10000 base creates different frequencies for different dimensions\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" # Create position indices [0, 1, 2, ..., max_seq_len-1]\n",
|
||
" position = np.arange(max_seq_len, dtype=np.float32)[:, np.newaxis] # (max_seq_len, 1)\n",
|
||
"\n",
|
||
" # Create dimension indices for calculating frequencies\n",
|
||
" div_term = np.exp(\n",
|
||
" np.arange(0, embed_dim, 2, dtype=np.float32) *\n",
|
||
" -(math.log(10000.0) / embed_dim)\n",
|
||
" ) # (embed_dim//2,)\n",
|
||
"\n",
|
||
" # Initialize the positional encoding matrix\n",
|
||
" pe = np.zeros((max_seq_len, embed_dim), dtype=np.float32)\n",
|
||
"\n",
|
||
" # Apply sine to even indices (0, 2, 4, ...)\n",
|
||
" pe[:, 0::2] = np.sin(position * div_term)\n",
|
||
"\n",
|
||
" # Apply cosine to odd indices (1, 3, 5, ...)\n",
|
||
" if embed_dim % 2 == 1:\n",
|
||
" # Handle odd embed_dim by only filling available positions\n",
|
||
" pe[:, 1::2] = np.cos(position * div_term[:-1])\n",
|
||
" else:\n",
|
||
" pe[:, 1::2] = np.cos(position * div_term)\n",
|
||
"\n",
|
||
" return Tensor(pe)\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e9524da3",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-sinusoidal",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_sinusoidal_embeddings():\n",
|
||
" \"\"\"🔬 Unit Test: Sinusoidal Positional Embeddings\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Sinusoidal Embeddings...\")\n",
|
||
"\n",
|
||
" # Test 1: Basic shape and properties\n",
|
||
" pe = create_sinusoidal_embeddings(512, 64)\n",
|
||
"\n",
|
||
" assert pe.shape == (512, 64), f\"Expected shape (512, 64), got {pe.shape}\"\n",
|
||
"\n",
|
||
" # Test 2: Position 0 should be mostly zeros and ones\n",
|
||
" pos_0 = pe.data[0]\n",
|
||
"\n",
|
||
" # Even indices should be sin(0) = 0\n",
|
||
" assert np.allclose(pos_0[0::2], 0, atol=1e-6), \"Even indices at position 0 should be ~0\"\n",
|
||
"\n",
|
||
" # Odd indices should be cos(0) = 1\n",
|
||
" assert np.allclose(pos_0[1::2], 1, atol=1e-6), \"Odd indices at position 0 should be ~1\"\n",
|
||
"\n",
|
||
" # Test 3: Different positions should have different encodings\n",
|
||
" pe_small = create_sinusoidal_embeddings(10, 8)\n",
|
||
"\n",
|
||
" # Check that consecutive positions are different\n",
|
||
" for i in range(9):\n",
|
||
" assert not np.allclose(pe_small.data[i], pe_small.data[i+1]), f\"Positions {i} and {i+1} are too similar\"\n",
|
||
"\n",
|
||
" # Test 4: Frequency properties\n",
|
||
" # Higher dimensions should have lower frequencies (change more slowly)\n",
|
||
" pe_test = create_sinusoidal_embeddings(100, 16)\n",
|
||
"\n",
|
||
" # First dimension should change faster than last dimension\n",
|
||
" first_dim_changes = np.sum(np.abs(np.diff(pe_test.data[:10, 0])))\n",
|
||
" last_dim_changes = np.sum(np.abs(np.diff(pe_test.data[:10, -1])))\n",
|
||
"\n",
|
||
" assert first_dim_changes > last_dim_changes, \"Lower dimensions should change faster than higher dimensions\"\n",
|
||
"\n",
|
||
" # Test 5: Odd embed_dim handling\n",
|
||
" pe_odd = create_sinusoidal_embeddings(10, 7)\n",
|
||
" assert pe_odd.shape == (10, 7), \"Should handle odd embedding dimensions\"\n",
|
||
"\n",
|
||
" print(\"✅ Sinusoidal embeddings work correctly!\")\n",
|
||
"\n",
|
||
"# Run test immediately when developing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_sinusoidal_embeddings()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5aba62c8",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 4. Integration - Bringing It Together\n",
|
||
"\n",
|
||
"Now let's build the complete embedding system that combines token and positional embeddings into a production-ready component used in modern transformers and language models.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Complete Embedding Pipeline:\n",
|
||
"\n",
|
||
"1. Token Lookup → 2. Position Encoding → 3. Combination → 4. Ready for Attention\n",
|
||
" ↓ ↓ ↓ ↓\n",
|
||
" sparse IDs position info dense vectors context-aware\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5412ea70",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Complete Embedding System Architecture\n",
|
||
"\n",
|
||
"The production embedding layer that powers modern transformers combines multiple components into an efficient, flexible pipeline.\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌───────────────────────────────────────────────────────────────────────────┐\n",
|
||
"│ COMPLETE EMBEDDING SYSTEM: Token + Position → Attention-Ready │\n",
|
||
"├───────────────────────────────────────────────────────────────────────────┤\n",
|
||
"│ │\n",
|
||
"│ INPUT: Token IDs [1, 42, 7, 99] │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ STEP 1: TOKEN EMBEDDING LOOKUP │\n",
|
||
"│ │ ┌─────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ │ Token Embedding Table (vocab_size × embed_dim) │ │\n",
|
||
"│ │ │ │ │\n",
|
||
"│ │ │ ID 1 → [0.1, 0.4, -0.2, ...] (semantic features) │ │\n",
|
||
"│ │ │ ID 42 → [0.7, -0.2, 0.1, ...] (learned meaning) │ │\n",
|
||
"│ │ │ ID 7 → [-0.3, 0.1, 0.5, ...] (dense vector) │ │\n",
|
||
"│ │ │ ID 99 → [0.9, -0.1, 0.3, ...] (context-free) │ │\n",
|
||
"│ │ └─────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ STEP 2: POSITIONAL ENCODING (Choose Strategy) │\n",
|
||
"│ │ ┌─────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ │ Strategy A: Learned PE │ │\n",
|
||
"│ │ │ pos 0 → [trainable vector] (learns patterns) │ │\n",
|
||
"│ │ │ pos 1 → [trainable vector] (task-specific) │ │\n",
|
||
"│ │ │ pos 2 → [trainable vector] (fixed max length) │ │\n",
|
||
"│ │ │ │ │\n",
|
||
"│ │ │ Strategy B: Sinusoidal PE │ │\n",
|
||
"│ │ │ pos 0 → [sin/cos pattern] (mathematical) │ │\n",
|
||
"│ │ │ pos 1 → [sin/cos pattern] (no parameters) │ │\n",
|
||
"│ │ │ pos 2 → [sin/cos pattern] (infinite length) │ │\n",
|
||
"│ │ │ │ │\n",
|
||
"│ │ │ Strategy C: No PE │ │\n",
|
||
"│ │ │ positions ignored (order-agnostic) │ │\n",
|
||
"│ │ └─────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ STEP 3: ELEMENT-WISE ADDITION │\n",
|
||
"│ │ ┌─────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ │ Token + Position = Position-Aware Representation │ │\n",
|
||
"│ │ │ │ │\n",
|
||
"│ │ │ [0.1, 0.4, -0.2] + [pos0] = [0.1+p0, 0.4+p0, ...] │ │\n",
|
||
"│ │ │ [0.7, -0.2, 0.1] + [pos1] = [0.7+p1, -0.2+p1, ...] │ │\n",
|
||
"│ │ │ [-0.3, 0.1, 0.5] + [pos2] = [-0.3+p2, 0.1+p2, ...] │ │\n",
|
||
"│ │ │ [0.9, -0.1, 0.3] + [pos3] = [0.9+p3, -0.1+p3, ...] │ │\n",
|
||
"│ │ └─────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │ │\n",
|
||
"│ ├─ STEP 4: OPTIONAL SCALING (Transformer Convention) │\n",
|
||
"│ │ ┌─────────────────────────────────────────────────────────┐ │\n",
|
||
"│ │ │ Scale by √embed_dim for gradient stability │ │\n",
|
||
"│ │ │ Helps balance token and position magnitudes │ │\n",
|
||
"│ │ └─────────────────────────────────────────────────────────┘ │\n",
|
||
"│ │ │\n",
|
||
"│ └─ OUTPUT: Position-Aware Dense Vectors │\n",
|
||
"│ Ready for attention mechanisms and transformers! │\n",
|
||
"│ │\n",
|
||
"│ INTEGRATION FEATURES: │\n",
|
||
"│ • Flexible position encoding (learned/sinusoidal/none) │\n",
|
||
"│ • Efficient batch processing with variable sequence lengths │\n",
|
||
"│ • Memory optimization (shared position encodings) │\n",
|
||
"│ • Production patterns (matches PyTorch/HuggingFace) │\n",
|
||
"│ │\n",
|
||
"└───────────────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this architecture works**: By separating token semantics from positional information, the model can learn meaning and order independently, then combine them optimally for the specific task."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "b4c0305c",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "complete-system",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class EmbeddingLayer:\n",
|
||
" \"\"\"\n",
|
||
" Complete embedding system combining token and positional embeddings.\n",
|
||
"\n",
|
||
" This is the production-ready component that handles the full embedding\n",
|
||
" pipeline used in transformers and other sequence models.\n",
|
||
"\n",
|
||
" TODO: Implement complete embedding system\n",
|
||
"\n",
|
||
" APPROACH:\n",
|
||
" 1. Combine token embedding + positional encoding\n",
|
||
" 2. Support both learned and sinusoidal position encodings\n",
|
||
" 3. Handle variable sequence lengths gracefully\n",
|
||
" 4. Add optional embedding scaling (Transformer convention)\n",
|
||
"\n",
|
||
" EXAMPLE:\n",
|
||
" >>> embed_layer = EmbeddingLayer(\n",
|
||
" ... vocab_size=50000,\n",
|
||
" ... embed_dim=512,\n",
|
||
" ... max_seq_len=2048,\n",
|
||
" ... pos_encoding='learned'\n",
|
||
" ... )\n",
|
||
" >>> tokens = Tensor([[1, 2, 3], [4, 5, 6]])\n",
|
||
" >>> output = embed_layer.forward(tokens)\n",
|
||
" >>> print(output.shape)\n",
|
||
" (2, 3, 512)\n",
|
||
"\n",
|
||
" HINTS:\n",
|
||
" - First apply token embedding, then add positional encoding\n",
|
||
" - Support 'learned', 'sinusoidal', or None for pos_encoding\n",
|
||
" - Handle both 2D (batch, seq) and 1D (seq) inputs gracefully\n",
|
||
" - Scale embeddings by sqrt(embed_dim) if requested (transformer convention)\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" def __init__(\n",
|
||
" self,\n",
|
||
" vocab_size: int,\n",
|
||
" embed_dim: int,\n",
|
||
" max_seq_len: int = 512,\n",
|
||
" pos_encoding: str = 'learned',\n",
|
||
" scale_embeddings: bool = False\n",
|
||
" ):\n",
|
||
" \"\"\"\n",
|
||
" Initialize complete embedding system.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" vocab_size: Size of vocabulary\n",
|
||
" embed_dim: Embedding dimension\n",
|
||
" max_seq_len: Maximum sequence length for positional encoding\n",
|
||
" pos_encoding: Type of positional encoding ('learned', 'sinusoidal', or None)\n",
|
||
" scale_embeddings: Whether to scale embeddings by sqrt(embed_dim)\n",
|
||
" \"\"\"\n",
|
||
" self.vocab_size = vocab_size\n",
|
||
" self.embed_dim = embed_dim\n",
|
||
" self.max_seq_len = max_seq_len\n",
|
||
" self.pos_encoding_type = pos_encoding\n",
|
||
" self.scale_embeddings = scale_embeddings\n",
|
||
"\n",
|
||
" # Token embedding layer\n",
|
||
" self.token_embedding = Embedding(vocab_size, embed_dim)\n",
|
||
"\n",
|
||
" # Positional encoding\n",
|
||
" if pos_encoding == 'learned':\n",
|
||
" self.pos_encoding = PositionalEncoding(max_seq_len, embed_dim)\n",
|
||
" elif pos_encoding == 'sinusoidal':\n",
|
||
" # Create fixed sinusoidal encodings (no parameters)\n",
|
||
" self.pos_encoding = create_sinusoidal_embeddings(max_seq_len, embed_dim)\n",
|
||
" elif pos_encoding is None:\n",
|
||
" self.pos_encoding = None\n",
|
||
" else:\n",
|
||
" raise ValueError(f\"Unknown pos_encoding: {pos_encoding}. Use 'learned', 'sinusoidal', or None\")\n",
|
||
"\n",
|
||
" def forward(self, tokens: Tensor) -> Tensor:\n",
|
||
" \"\"\"\n",
|
||
" Forward pass through complete embedding system.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" tokens: Token indices of shape (batch_size, seq_len) or (seq_len,)\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" Embedded tokens with positional information\n",
|
||
" \"\"\"\n",
|
||
" # Handle 1D input by adding batch dimension\n",
|
||
" if len(tokens.shape) == 1:\n",
|
||
" # NOTE: Tensor reshape preserves gradients\n",
|
||
" tokens = tokens.reshape(1, -1)\n",
|
||
" squeeze_batch = True\n",
|
||
" else:\n",
|
||
" squeeze_batch = False\n",
|
||
"\n",
|
||
" # Get token embeddings\n",
|
||
" token_embeds = self.token_embedding.forward(tokens) # (batch, seq, embed)\n",
|
||
"\n",
|
||
" # Scale embeddings if requested (transformer convention)\n",
|
||
" if self.scale_embeddings:\n",
|
||
" scale_factor = math.sqrt(self.embed_dim)\n",
|
||
" token_embeds = token_embeds * scale_factor # Use Tensor multiplication to preserve gradients\n",
|
||
"\n",
|
||
" # Add positional encoding\n",
|
||
" if self.pos_encoding_type == 'learned':\n",
|
||
" # Use learnable positional encoding\n",
|
||
" output = self.pos_encoding.forward(token_embeds)\n",
|
||
" elif self.pos_encoding_type == 'sinusoidal':\n",
|
||
" # Use fixed sinusoidal encoding (not learnable)\n",
|
||
" batch_size, seq_len, embed_dim = token_embeds.shape\n",
|
||
" pos_embeddings = self.pos_encoding[:seq_len] # Slice using Tensor slicing\n",
|
||
" \n",
|
||
" # Reshape to add batch dimension\n",
|
||
" pos_data = pos_embeddings.data[np.newaxis, :, :]\n",
|
||
" pos_embeddings_batched = Tensor(pos_data, requires_grad=False) # Sinusoidal are fixed\n",
|
||
" \n",
|
||
" output = token_embeds + pos_embeddings_batched\n",
|
||
" else:\n",
|
||
" # No positional encoding\n",
|
||
" output = token_embeds\n",
|
||
"\n",
|
||
" # Remove batch dimension if it was added\n",
|
||
" if squeeze_batch:\n",
|
||
" # Use Tensor slicing (now supported in Module 01)\n",
|
||
" output = output[0]\n",
|
||
"\n",
|
||
" return output\n",
|
||
"\n",
|
||
" def __call__(self, tokens: Tensor) -> Tensor:\n",
|
||
" \"\"\"Allows the embedding layer to be called like a function.\"\"\"\n",
|
||
" return self.forward(tokens)\n",
|
||
"\n",
|
||
" def parameters(self) -> List[Tensor]:\n",
|
||
" \"\"\"Return all trainable parameters.\"\"\"\n",
|
||
" params = self.token_embedding.parameters()\n",
|
||
"\n",
|
||
" if self.pos_encoding_type == 'learned':\n",
|
||
" params.extend(self.pos_encoding.parameters())\n",
|
||
"\n",
|
||
" return params\n",
|
||
"\n",
|
||
" def __repr__(self):\n",
|
||
" return (f\"EmbeddingLayer(vocab_size={self.vocab_size}, \"\n",
|
||
" f\"embed_dim={self.embed_dim}, \"\n",
|
||
" f\"pos_encoding='{self.pos_encoding_type}')\")\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c0957e50",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-complete-system",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_complete_embedding_system():\n",
|
||
" \"\"\"🔬 Unit Test: Complete Embedding System\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Complete Embedding System...\")\n",
|
||
"\n",
|
||
" # Test 1: Learned positional encoding\n",
|
||
" embed_learned = EmbeddingLayer(\n",
|
||
" vocab_size=100,\n",
|
||
" embed_dim=64,\n",
|
||
" max_seq_len=128,\n",
|
||
" pos_encoding='learned'\n",
|
||
" )\n",
|
||
"\n",
|
||
" tokens = Tensor([[1, 2, 3], [4, 5, 6]])\n",
|
||
" output_learned = embed_learned.forward(tokens)\n",
|
||
"\n",
|
||
" assert output_learned.shape == (2, 3, 64), f\"Expected shape (2, 3, 64), got {output_learned.shape}\"\n",
|
||
"\n",
|
||
" # Test 2: Sinusoidal positional encoding\n",
|
||
" embed_sin = EmbeddingLayer(\n",
|
||
" vocab_size=100,\n",
|
||
" embed_dim=64,\n",
|
||
" pos_encoding='sinusoidal'\n",
|
||
" )\n",
|
||
"\n",
|
||
" output_sin = embed_sin.forward(tokens)\n",
|
||
" assert output_sin.shape == (2, 3, 64), \"Sinusoidal embedding should have same shape\"\n",
|
||
"\n",
|
||
" # Test 3: No positional encoding\n",
|
||
" embed_none = EmbeddingLayer(\n",
|
||
" vocab_size=100,\n",
|
||
" embed_dim=64,\n",
|
||
" pos_encoding=None\n",
|
||
" )\n",
|
||
"\n",
|
||
" output_none = embed_none.forward(tokens)\n",
|
||
" assert output_none.shape == (2, 3, 64), \"No pos encoding should have same shape\"\n",
|
||
"\n",
|
||
" # Test 4: 1D input handling\n",
|
||
" tokens_1d = Tensor([1, 2, 3])\n",
|
||
" output_1d = embed_learned.forward(tokens_1d)\n",
|
||
"\n",
|
||
" assert output_1d.shape == (3, 64), f\"Expected shape (3, 64) for 1D input, got {output_1d.shape}\"\n",
|
||
"\n",
|
||
" # Test 5: Embedding scaling\n",
|
||
" embed_scaled = EmbeddingLayer(\n",
|
||
" vocab_size=100,\n",
|
||
" embed_dim=64,\n",
|
||
" pos_encoding=None,\n",
|
||
" scale_embeddings=True\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Use same weights to ensure fair comparison\n",
|
||
" embed_scaled.token_embedding.weight = embed_none.token_embedding.weight\n",
|
||
"\n",
|
||
" output_scaled = embed_scaled.forward(tokens)\n",
|
||
" output_unscaled = embed_none.forward(tokens)\n",
|
||
"\n",
|
||
" # Scaled version should be sqrt(64) times larger\n",
|
||
" scale_factor = math.sqrt(64)\n",
|
||
" expected_scaled = output_unscaled.data * scale_factor\n",
|
||
" assert np.allclose(output_scaled.data, expected_scaled, rtol=1e-5), \"Embedding scaling not working correctly\"\n",
|
||
"\n",
|
||
" # Test 6: Parameter counting\n",
|
||
" params_learned = embed_learned.parameters()\n",
|
||
" params_sin = embed_sin.parameters()\n",
|
||
" params_none = embed_none.parameters()\n",
|
||
"\n",
|
||
" assert len(params_learned) == 2, \"Learned encoding should have 2 parameter tensors\"\n",
|
||
" assert len(params_sin) == 1, \"Sinusoidal encoding should have 1 parameter tensor\"\n",
|
||
" assert len(params_none) == 1, \"No pos encoding should have 1 parameter tensor\"\n",
|
||
"\n",
|
||
" print(\"✅ Complete embedding system works correctly!\")\n",
|
||
"\n",
|
||
"# Run test immediately when developing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_complete_embedding_system()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "96851b03",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 5. Systems Analysis - Embedding Trade-offs\n",
|
||
"\n",
|
||
"Understanding the performance implications of different embedding strategies is crucial for building efficient NLP systems that scale to production workloads."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9a051315",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "memory-analysis",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_embedding_memory_scaling():\n",
|
||
" \"\"\"📊 Compare embedding memory requirements across different model scales.\"\"\"\n",
|
||
" print(\"📊 Analyzing Embedding Memory Requirements...\")\n",
|
||
"\n",
|
||
" # Vocabulary and embedding dimension scenarios\n",
|
||
" scenarios = [\n",
|
||
" (\"Small Model\", 10_000, 256),\n",
|
||
" (\"Medium Model\", 50_000, 512),\n",
|
||
" (\"Large Model\", 100_000, 1024),\n",
|
||
" (\"GPT-3 Scale\", 50_257, 12_288),\n",
|
||
" ]\n",
|
||
"\n",
|
||
" print(f\"{'Model':<15} {'Vocab Size':<12} {'Embed Dim':<12} {'Memory (MB)':<15} {'Parameters (M)':<15}\")\n",
|
||
" print(\"-\" * 80)\n",
|
||
"\n",
|
||
" for name, vocab_size, embed_dim in scenarios:\n",
|
||
" # Calculate memory for FP32 (4 bytes per parameter)\n",
|
||
" params = vocab_size * embed_dim\n",
|
||
" memory_mb = params * BYTES_PER_FLOAT32 / MB_TO_BYTES\n",
|
||
" params_m = params / 1_000_000\n",
|
||
"\n",
|
||
" print(f\"{name:<15} {vocab_size:<12,} {embed_dim:<12} {memory_mb:<15.1f} {params_m:<15.2f}\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Key Insights:\")\n",
|
||
" print(\"• Embedding tables often dominate model memory (especially for large vocabularies)\")\n",
|
||
" print(\"• Memory scales linearly with vocab_size × embed_dim\")\n",
|
||
" print(\"• Consider vocabulary pruning for memory-constrained environments\")\n",
|
||
"\n",
|
||
" # Positional encoding memory comparison\n",
|
||
" print(f\"\\n📊 Positional Encoding Memory Comparison (embed_dim=512, max_seq_len=2048):\")\n",
|
||
"\n",
|
||
" learned_params = 2048 * 512\n",
|
||
" learned_memory = learned_params * 4 / (1024 * 1024)\n",
|
||
"\n",
|
||
" print(f\"Learned PE: {learned_memory:.1f} MB ({learned_params:,} parameters)\")\n",
|
||
" print(f\"Sinusoidal PE: 0.0 MB (0 parameters - computed on-the-fly)\")\n",
|
||
" print(f\"No PE: 0.0 MB (0 parameters)\")\n",
|
||
"\n",
|
||
" print(\"\\n🚀 Production Implications:\")\n",
|
||
" print(\"• GPT-3's embedding table: ~2.4GB (50K vocab × 12K dims)\")\n",
|
||
" print(\"• Learned PE adds memory but may improve task-specific performance\")\n",
|
||
" print(\"• Sinusoidal PE saves memory and allows longer sequences\")\n",
|
||
"\n",
|
||
"# Run analysis when developing/testing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" analyze_embedding_memory_scaling()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "22a12bed",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "lookup-performance",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_embedding_performance():\n",
|
||
" \"\"\"📊 Compare embedding lookup performance across different configurations.\"\"\"\n",
|
||
" print(\"\\n📊 Analyzing Embedding Lookup Performance...\")\n",
|
||
"\n",
|
||
" import time\n",
|
||
"\n",
|
||
" # Test different vocabulary sizes and batch configurations\n",
|
||
" vocab_sizes = [1_000, 10_000, 100_000]\n",
|
||
" embed_dim = 512\n",
|
||
" seq_len = 128\n",
|
||
" batch_sizes = [1, 16, 64, 256]\n",
|
||
"\n",
|
||
" print(f\"{'Vocab Size':<12} {'Batch Size':<12} {'Lookup Time (ms)':<18} {'Throughput (tokens/s)':<20}\")\n",
|
||
" print(\"-\" * 70)\n",
|
||
"\n",
|
||
" for vocab_size in vocab_sizes:\n",
|
||
" # Create embedding layer\n",
|
||
" embed = Embedding(vocab_size, embed_dim)\n",
|
||
"\n",
|
||
" for batch_size in batch_sizes:\n",
|
||
" # Create random token batch\n",
|
||
" tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))\n",
|
||
"\n",
|
||
" # Warmup\n",
|
||
" for _ in range(5):\n",
|
||
" _ = embed.forward(tokens)\n",
|
||
"\n",
|
||
" # Time the lookup\n",
|
||
" start_time = time.time()\n",
|
||
" iterations = 100\n",
|
||
"\n",
|
||
" for _ in range(iterations):\n",
|
||
" output = embed.forward(tokens)\n",
|
||
"\n",
|
||
" end_time = time.time()\n",
|
||
"\n",
|
||
" # Calculate metrics\n",
|
||
" total_time = end_time - start_time\n",
|
||
" avg_time_ms = (total_time / iterations) * 1000\n",
|
||
" total_tokens = batch_size * seq_len * iterations\n",
|
||
" throughput = total_tokens / total_time\n",
|
||
"\n",
|
||
" print(f\"{vocab_size:<12,} {batch_size:<12} {avg_time_ms:<18.2f} {throughput:<20,.0f}\")\n",
|
||
"\n",
|
||
" print(\"\\n💡 Performance Insights:\")\n",
|
||
" print(\"• Lookup time is O(1) per token - vocabulary size doesn't affect individual lookups\")\n",
|
||
" print(\"• Larger batches improve throughput due to vectorization\")\n",
|
||
" print(\"• Memory bandwidth becomes bottleneck for large embedding dimensions\")\n",
|
||
" print(\"• Cache locality important for repeated token patterns\")\n",
|
||
"\n",
|
||
"# Run analysis when developing/testing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" analyze_embedding_performance()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "dd92e601",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "position-encoding-comparison",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_positional_encoding_strategies():\n",
|
||
" \"\"\"📊 Compare different positional encoding approaches and trade-offs.\"\"\"\n",
|
||
" print(\"\\n📊 Analyzing Positional Encoding Trade-offs...\")\n",
|
||
"\n",
|
||
" max_seq_len = 512\n",
|
||
" embed_dim = 256\n",
|
||
"\n",
|
||
" # Create both types of positional encodings\n",
|
||
" learned_pe = PositionalEncoding(max_seq_len, embed_dim)\n",
|
||
" sinusoidal_pe = create_sinusoidal_embeddings(max_seq_len, embed_dim)\n",
|
||
"\n",
|
||
" # Analyze memory footprint\n",
|
||
" learned_params = max_seq_len * embed_dim\n",
|
||
" learned_memory = learned_params * 4 / (1024 * 1024) # MB\n",
|
||
"\n",
|
||
" print(f\"📈 Memory Comparison:\")\n",
|
||
" print(f\"Learned PE: {learned_memory:.2f} MB ({learned_params:,} parameters)\")\n",
|
||
" print(f\"Sinusoidal PE: 0.00 MB (0 parameters)\")\n",
|
||
"\n",
|
||
" # Analyze encoding patterns\n",
|
||
" print(f\"\\n📈 Encoding Pattern Analysis:\")\n",
|
||
"\n",
|
||
" # Test sample sequences\n",
|
||
" test_input = Tensor(np.random.randn(1, 10, embed_dim))\n",
|
||
"\n",
|
||
" learned_output = learned_pe.forward(test_input)\n",
|
||
"\n",
|
||
" # For sinusoidal, manually add to match learned interface\n",
|
||
" sin_encodings = sinusoidal_pe.data[:10][np.newaxis, :, :] # (1, 10, embed_dim)\n",
|
||
" sinusoidal_output = Tensor(test_input.data + sin_encodings)\n",
|
||
"\n",
|
||
" # Analyze variance across positions\n",
|
||
" learned_var = np.var(learned_output.data, axis=1).mean() # Variance across positions\n",
|
||
" sin_var = np.var(sinusoidal_output.data, axis=1).mean()\n",
|
||
"\n",
|
||
" print(f\"Position variance (learned): {learned_var:.4f}\")\n",
|
||
" print(f\"Position variance (sinusoidal): {sin_var:.4f}\")\n",
|
||
"\n",
|
||
" # Check extrapolation capability\n",
|
||
" print(f\"\\n📈 Extrapolation Analysis:\")\n",
|
||
" extended_length = max_seq_len + 100\n",
|
||
"\n",
|
||
" try:\n",
|
||
" # Learned PE cannot handle longer sequences\n",
|
||
" extended_learned = PositionalEncoding(extended_length, embed_dim)\n",
|
||
" print(f\"Learned PE: Requires retraining for sequences > {max_seq_len}\")\n",
|
||
" except:\n",
|
||
" print(f\"Learned PE: Cannot handle sequences > {max_seq_len}\")\n",
|
||
"\n",
|
||
" # Sinusoidal can extrapolate\n",
|
||
" extended_sin = create_sinusoidal_embeddings(extended_length, embed_dim)\n",
|
||
" print(f\"Sinusoidal PE: Can extrapolate to length {extended_length} (smooth continuation)\")\n",
|
||
"\n",
|
||
" print(f\"\\n🚀 Production Trade-offs:\")\n",
|
||
" print(f\"Learned PE:\")\n",
|
||
" print(f\" + Can learn task-specific positional patterns\")\n",
|
||
" print(f\" + May perform better for tasks with specific position dependencies\")\n",
|
||
" print(f\" - Requires additional memory and parameters\")\n",
|
||
" print(f\" - Fixed maximum sequence length\")\n",
|
||
" print(f\" - Needs training data for longer sequences\")\n",
|
||
"\n",
|
||
" print(f\"\\nSinusoidal PE:\")\n",
|
||
" print(f\" + Zero additional parameters\")\n",
|
||
" print(f\" + Can extrapolate to any sequence length\")\n",
|
||
" print(f\" + Provides rich, mathematically grounded position signals\")\n",
|
||
" print(f\" - Cannot adapt to task-specific position patterns\")\n",
|
||
" print(f\" - May be suboptimal for highly position-dependent tasks\")\n",
|
||
"\n",
|
||
"# Run analysis when developing/testing this module\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" analyze_positional_encoding_strategies()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3154a2ce",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 6. Module Integration Test\n",
|
||
"\n",
|
||
"Let's test our complete embedding system to ensure everything works together correctly."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8617b5fb",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "module-test",
|
||
"locked": true,
|
||
"points": 20
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire embeddings module functionality.\n",
|
||
"\n",
|
||
" This final test ensures all components work together and the module\n",
|
||
" is ready for integration with attention mechanisms and transformers.\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_embedding()\n",
|
||
" test_unit_positional_encoding()\n",
|
||
" test_unit_sinusoidal_embeddings()\n",
|
||
" test_unit_complete_embedding_system()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Integration Test 1: Realistic NLP pipeline\n",
|
||
" print(\"🔬 Integration Test: NLP Pipeline Simulation...\")\n",
|
||
"\n",
|
||
" # Simulate a small transformer setup\n",
|
||
" vocab_size = 1000\n",
|
||
" embed_dim = 128\n",
|
||
" max_seq_len = 64\n",
|
||
"\n",
|
||
" # Create embedding layer\n",
|
||
" embed_layer = EmbeddingLayer(\n",
|
||
" vocab_size=vocab_size,\n",
|
||
" embed_dim=embed_dim,\n",
|
||
" max_seq_len=max_seq_len,\n",
|
||
" pos_encoding='learned',\n",
|
||
" scale_embeddings=True\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Simulate tokenized sentences\n",
|
||
" sentences = [\n",
|
||
" [1, 15, 42, 7, 99], # \"the cat sat on mat\"\n",
|
||
" [23, 7, 15, 88], # \"dog chased the ball\"\n",
|
||
" [1, 67, 15, 42, 7, 99, 34] # \"the big cat sat on mat here\"\n",
|
||
" ]\n",
|
||
"\n",
|
||
" # Process each sentence\n",
|
||
" outputs = []\n",
|
||
" for sentence in sentences:\n",
|
||
" tokens = Tensor(sentence)\n",
|
||
" embedded = embed_layer.forward(tokens)\n",
|
||
" outputs.append(embedded)\n",
|
||
"\n",
|
||
" # Verify output shape\n",
|
||
" expected_shape = (len(sentence), embed_dim)\n",
|
||
" assert embedded.shape == expected_shape, f\"Wrong shape for sentence: {embedded.shape} != {expected_shape}\"\n",
|
||
"\n",
|
||
" print(\"✅ Variable length sentence processing works!\")\n",
|
||
"\n",
|
||
" # Integration Test 2: Batch processing with padding\n",
|
||
" print(\"🔬 Integration Test: Batched Processing...\")\n",
|
||
"\n",
|
||
" # Create padded batch (real-world scenario)\n",
|
||
" max_len = max(len(s) for s in sentences)\n",
|
||
" batch_tokens = []\n",
|
||
"\n",
|
||
" for sentence in sentences:\n",
|
||
" # Pad with zeros (assuming 0 is padding token)\n",
|
||
" padded = sentence + [0] * (max_len - len(sentence))\n",
|
||
" batch_tokens.append(padded)\n",
|
||
"\n",
|
||
" batch_tensor = Tensor(batch_tokens) # (3, 7)\n",
|
||
" batch_output = embed_layer.forward(batch_tensor)\n",
|
||
"\n",
|
||
" assert batch_output.shape == (3, max_len, embed_dim), f\"Batch output shape incorrect: {batch_output.shape}\"\n",
|
||
"\n",
|
||
" print(\"✅ Batch processing with padding works!\")\n",
|
||
"\n",
|
||
" # Integration Test 3: Different positional encoding types\n",
|
||
" print(\"🔬 Integration Test: Position Encoding Variants...\")\n",
|
||
"\n",
|
||
" test_tokens = Tensor([[1, 2, 3, 4, 5]])\n",
|
||
"\n",
|
||
" # Test all position encoding types\n",
|
||
" for pe_type in ['learned', 'sinusoidal', None]:\n",
|
||
" embed_test = EmbeddingLayer(\n",
|
||
" vocab_size=100,\n",
|
||
" embed_dim=64,\n",
|
||
" pos_encoding=pe_type\n",
|
||
" )\n",
|
||
"\n",
|
||
" output = embed_test.forward(test_tokens)\n",
|
||
" assert output.shape == (1, 5, 64), f\"PE type {pe_type} failed shape test\"\n",
|
||
"\n",
|
||
" # Check parameter counts\n",
|
||
" if pe_type == 'learned':\n",
|
||
" assert len(embed_test.parameters()) == 2, f\"Learned PE should have 2 param tensors\"\n",
|
||
" else:\n",
|
||
" assert len(embed_test.parameters()) == 1, f\"PE type {pe_type} should have 1 param tensor\"\n",
|
||
"\n",
|
||
" print(\"✅ All positional encoding variants work!\")\n",
|
||
"\n",
|
||
" # Integration Test 4: Memory efficiency check\n",
|
||
" print(\"🔬 Integration Test: Memory Efficiency...\")\n",
|
||
"\n",
|
||
" # Test that we're not creating unnecessary copies\n",
|
||
" large_embed = EmbeddingLayer(vocab_size=10000, embed_dim=512)\n",
|
||
" test_batch = Tensor(np.random.randint(0, 10000, (32, 128)))\n",
|
||
"\n",
|
||
" # Multiple forward passes should not accumulate memory (in production)\n",
|
||
" for _ in range(5):\n",
|
||
" output = large_embed.forward(test_batch)\n",
|
||
" assert output.shape == (32, 128, 512), \"Large batch processing failed\"\n",
|
||
"\n",
|
||
" print(\"✅ Memory efficiency check passed!\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"📚 Summary of capabilities built:\")\n",
|
||
" print(\" • Token embedding with trainable lookup tables\")\n",
|
||
" print(\" • Learned positional encodings for position awareness\")\n",
|
||
" print(\" • Sinusoidal positional encodings for extrapolation\")\n",
|
||
" print(\" • Complete embedding system for NLP pipelines\")\n",
|
||
" print(\" • Efficient batch processing and memory management\")\n",
|
||
" print(\"\\n🚀 Ready for: Attention mechanisms, transformers, and language models!\")\n",
|
||
" print(\"Export with: tito module complete 11\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "15888c38",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "main-execution",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"if __name__ == \"__main__\":\n",
|
||
" \"\"\"Main execution block for module validation.\"\"\"\n",
|
||
" print(\"🚀 Running Embeddings module...\")\n",
|
||
" test_module()\n",
|
||
" print(\"✅ Module validation complete!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3abc8acc",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🤔 ML Systems Thinking: Embedding Foundations\n",
|
||
"\n",
|
||
"### Question 1: Memory Scaling\n",
|
||
"You implemented an embedding layer with vocab_size=50,000 and embed_dim=512.\n",
|
||
"- How many parameters does this embedding table contain? _____ million\n",
|
||
"- If using FP32 (4 bytes per parameter), how much memory does this use? _____ MB\n",
|
||
"- If you double the embedding dimension to 1024, what happens to memory usage? _____ MB\n",
|
||
"\n",
|
||
"### Question 2: Lookup Complexity\n",
|
||
"Your embedding layer performs table lookups for token indices.\n",
|
||
"- What is the time complexity of looking up a single token? O(_____)\n",
|
||
"- For a batch of 32 sequences, each of length 128, how many lookup operations? _____\n",
|
||
"- Why doesn't vocabulary size affect individual lookup performance? _____\n",
|
||
"\n",
|
||
"### Question 3: Positional Encoding Trade-offs\n",
|
||
"You implemented both learned and sinusoidal positional encodings.\n",
|
||
"- Learned PE for max_seq_len=2048, embed_dim=512 adds how many parameters? _____\n",
|
||
"- What happens if you try to process a sequence longer than max_seq_len with learned PE? _____\n",
|
||
"- Which type of PE can handle sequences longer than seen during training? _____\n",
|
||
"\n",
|
||
"### Question 4: Production Implications\n",
|
||
"Your complete EmbeddingLayer combines token and positional embeddings.\n",
|
||
"- In GPT-3 (vocab_size≈50K, embed_dim≈12K), approximately what percentage of total parameters are in the embedding table? _____%\n",
|
||
"- If you wanted to reduce memory usage by 50%, which would be more effective: halving vocab_size or halving embed_dim? _____\n",
|
||
"- Why might sinusoidal PE be preferred for models that need to handle variable sequence lengths? _____"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9282ff54",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Embeddings\n",
|
||
"\n",
|
||
"Congratulations! You've built a complete embedding system that transforms discrete tokens into learnable representations!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- Built `Embedding` class with efficient token-to-vector lookup (10M+ token support)\n",
|
||
"- Implemented `PositionalEncoding` for learnable position awareness (unlimited sequence patterns)\n",
|
||
"- Created `create_sinusoidal_embeddings` with mathematical position encoding (extrapolates beyond training)\n",
|
||
"- Developed `EmbeddingLayer` integrating both token and positional embeddings (production-ready)\n",
|
||
"- Analyzed embedding memory scaling and lookup performance trade-offs\n",
|
||
"- All tests pass ✅ (validated by `test_module()`)\n",
|
||
"\n",
|
||
"### Technical Achievements\n",
|
||
"- **Memory Efficiency**: Optimized embedding table storage and lookup patterns\n",
|
||
"- **Flexible Architecture**: Support for learned, sinusoidal, and no positional encoding\n",
|
||
"- **Batch Processing**: Efficient handling of variable-length sequences with padding\n",
|
||
"- **Systems Analysis**: Deep understanding of memory vs performance trade-offs\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your embeddings implementation enables attention mechanisms and transformer architectures!\n",
|
||
"The combination of token and positional embeddings provides the foundation for sequence-to-sequence models.\n",
|
||
"\n",
|
||
"**Next**: Module 12 will add attention mechanisms for context-aware representations!\n",
|
||
"\n",
|
||
"### Production Context\n",
|
||
"You've built the exact embedding patterns used in:\n",
|
||
"- **GPT models**: Token embeddings + learned positional encoding\n",
|
||
"- **BERT models**: Token embeddings + sinusoidal positional encoding\n",
|
||
"- **T5 models**: Relative positional embeddings (variant of your implementations)\n",
|
||
"\n",
|
||
"Export with: `tito module complete 11`"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|