From 0c677dd488f025cf5b8ab71a22f7416924d95234 Mon Sep 17 00:00:00 2001 From: Vijay Janapa Reddi Date: Thu, 13 Nov 2025 10:42:47 -0500 Subject: [PATCH] Update module documentation: enhance ABOUT.md files across all modules - Improve module descriptions and learning objectives - Standardize documentation format and structure - Add clearer guidance for students - Enhance module-specific context and examples --- modules/01_tensor/ABOUT.md | 868 +++++++++++++++++------- modules/02_activations/ABOUT.md | 399 ++++++++--- modules/03_layers/ABOUT.md | 337 ++++++---- modules/04_losses/ABOUT.md | 382 +++++++---- modules/05_autograd/ABOUT.md | 76 ++- modules/06_optimizers/ABOUT.md | 520 +++++++++++---- modules/07_training/ABOUT.md | 648 +++++++++++------- modules/08_dataloader/ABOUT.md | 538 ++++++++------- modules/09_spatial/ABOUT.md | 618 ++++++++++------- modules/10_tokenization/ABOUT.md | 1076 +++++++++++++++++++++--------- modules/11_embeddings/ABOUT.md | 691 ++++++++++--------- modules/12_attention/ABOUT.md | 817 ++++++++++++++--------- modules/13_transformers/ABOUT.md | 881 ++++++++++++++---------- modules/14_profiling/ABOUT.md | 938 ++++++++++++++++---------- modules/15_quantization/ABOUT.md | 454 +++++++++++-- modules/16_compression/ABOUT.md | 457 ++++++++++--- modules/17_memoization/ABOUT.md | 927 +++++++++++++++---------- modules/18_acceleration/ABOUT.md | 638 +++++++++++++++--- modules/19_benchmarking/ABOUT.md | 446 +++++++++++-- modules/20_capstone/ABOUT.md | 473 +++++++------ 20 files changed, 8234 insertions(+), 3950 deletions(-) diff --git a/modules/01_tensor/ABOUT.md b/modules/01_tensor/ABOUT.md index 6dac0267..eb288e3b 100644 --- a/modules/01_tensor/ABOUT.md +++ b/modules/01_tensor/ABOUT.md @@ -1,328 +1,736 @@ --- title: "Tensor" -description: "Core tensor data structure and operations" -module_number: 1 -tier: "foundation" -difficulty: "beginner" +description: "Build the fundamental N-dimensional array data structure that powers all machine learning" +difficulty: "โญ" time_estimate: "4-6 hours" -prerequisites: ["Environment Setup"] -next_module: "02. Activations" +prerequisites: [] +next_steps: ["02_activations"] learning_objectives: - - "Understand tensors as N-dimensional arrays and their role in ML systems" - - "Implement a complete Tensor class with arithmetic and shape operations" - - "Handle memory management, data types, and broadcasting efficiently" - - "Recognize how tensor operations form the foundation of PyTorch/TensorFlow" - - "Analyze computational complexity and memory usage of tensor operations" + - "Understand tensors as N-dimensional arrays and their memory/performance implications in ML systems" + - "Implement a complete Tensor class with arithmetic, shape operations, and efficient data handling" + - "Master broadcasting rules and understand how they enable efficient computations without data copying" + - "Recognize how tensor operations form the foundation of PyTorch/TensorFlow architecture" + - "Analyze computational complexity, memory usage, and view-vs-copy trade-offs in tensor operations" --- # 01. Tensor -**๐Ÿ—๏ธ FOUNDATION TIER** | Difficulty: โญ (1/4) | Time: 4-6 hours +**FOUNDATION TIER** | Difficulty: โญ (1/4) | Time: 4-6 hours -**Build N-dimensional arrays from scratch - the foundation of all ML computations.** +## Overview ---- +The Tensor class is the foundational data structure of machine learning - every neural network, from simple linear models to GPT and Stable Diffusion, operates on tensors. You'll build N-dimensional arrays from scratch with arithmetic operations, broadcasting, and shape manipulation. This module gives you deep insight into how PyTorch and TensorFlow work under the hood, understanding the memory and performance implications that matter in production ML systems. + +## Learning Objectives + +By the end of this module, you will be able to: + +- **Understand memory and performance implications**: Recognize how tensor operations dominate compute time and memory usage in ML systems - a single matrix multiplication can consume 90% of forward pass time in production frameworks like PyTorch +- **Implement core tensor functionality**: Build a complete Tensor class with arithmetic (`+`, `-`, `*`, `/`), matrix multiplication, shape manipulation (`reshape`, `transpose`), and reductions (`sum`, `mean`, `max`) with proper error handling and validation +- **Master broadcasting semantics**: Understand NumPy broadcasting rules that enable efficient computations across different tensor shapes without data copying - critical for batch processing and efficient neural network operations +- **Connect to production frameworks**: See how your implementation mirrors PyTorch's `torch.Tensor` and TensorFlow's `tf.Tensor` design patterns, understanding the architectural decisions that power real ML systems +- **Analyze performance trade-offs**: Understand computational complexity (O(nยณ) for matrix multiplication), memory usage patterns (contiguous vs. strided), and when to copy data vs. create views for optimization + +## Build โ†’ Use โ†’ Reflect + +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: + +1. **Build**: Implement the Tensor class from scratch using NumPy as the underlying array library - creating `__init__`, operator overloading (`__add__`, `__mul__`, etc.), shape manipulation methods, and reduction operations +2. **Use**: Apply your Tensor to real problems like matrix multiplication for neural network layers, data normalization with broadcasting, and statistical computations across various shapes and dimensions +3. **Reflect**: Understand systems-level implications - why tensor operations dominate training time, how memory layout (row-major vs. column-major) affects cache performance, and how broadcasting eliminates redundant data copying ## What You'll Build -The **Tensor** class is the fundamental data structure of machine learning. It represents N-dimensional arrays and provides operations for manipulation, computation, and transformation. +By completing this module, you'll create a production-ready Tensor class with: -By the end of this module, you'll have a working Tensor implementation that handles: +**Core Data Structure:** +- N-dimensional array wrapper around NumPy with clean API +- Properties for shape, size, dtype, and data access +- Dormant gradient tracking attributes (activated in Module 05) -- Creating and initializing N-dimensional arrays -- Arithmetic operations (addition, multiplication, division, powers) -- Shape manipulation (reshape, transpose, broadcasting) -- Reductions (sum, mean, min, max along any axis) -- Memory-efficient data storage and copying +**Arithmetic Operations:** +- Element-wise operations: `+`, `-`, `*`, `/`, `**` +- Full broadcasting support for Tensor-Tensor and Tensor-scalar operations +- Automatic shape alignment following NumPy broadcasting rules -### Example Usage +**Matrix Operations:** +- `matmul()` for matrix multiplication with shape validation +- Support for matrix-matrix, matrix-vector multiplication +- Clear error messages for dimension mismatches + +**Shape Manipulation:** +- `reshape()` with -1 inference for automatic dimension calculation +- `transpose()` for dimension swapping +- View vs. copy semantics understanding + +**Reduction Operations:** +- `sum()`, `mean()`, `max()`, `min()` with axis parameter +- Global reductions (entire tensor) and axis-specific reductions +- `keepdims` support for maintaining dimensionality + +**Real-World Usage Pattern:** +Your Tensor enables the fundamental neural network forward pass: `output = x.matmul(W) + b` - exactly how PyTorch and TensorFlow work internally. + +## Core Concepts + +### Tensors as Multidimensional Arrays + +A tensor is a generalization of scalars (0D), vectors (1D), and matrices (2D) to N dimensions: + +- **Scalar**: `Tensor(5.0)` - shape `()` +- **Vector**: `Tensor([1, 2, 3])` - shape `(3,)` +- **Matrix**: `Tensor([[1, 2], [3, 4]])` - shape `(2, 2)` +- **3D Tensor**: Image batch `(batch, height, width)` - shape `(32, 224, 224)` +- **4D Tensor**: CNN features `(batch, channels, height, width)` - shape `(32, 3, 224, 224)` + +**Why tensors matter**: They provide a unified interface for all ML data - images, text embeddings, audio spectrograms, and model parameters are all tensors with different shapes. + +### Broadcasting: Efficient Shape Alignment + +Broadcasting automatically expands smaller tensors to match larger ones without copying data: + +```python +# Matrix (2,2) + Vector (2,) โ†’ broadcasts to (2,2) +matrix = Tensor([[1, 2], [3, 4]]) +vector = Tensor([10, 20]) +result = matrix + vector # [[11, 22], [13, 24]] +``` + +**Broadcasting rules** (NumPy-compatible): +1. Align shapes from right to left +2. Dimensions are compatible if they're equal or one is 1 +3. Missing dimensions are treated as size 1 + +**Why broadcasting matters**: Eliminates redundant data copying. Adding a bias vector to 1000 feature maps broadcasts once instead of copying the vector 1000 times - saving memory and enabling vectorization. + +### Views vs. Copies: Memory Efficiency + +Some operations return **views** (sharing memory) vs. **copies** (duplicating data): + +- **Views** (O(1)): `reshape()`, `transpose()` when possible - no data movement +- **Copies** (O(n)): Arithmetic operations, explicit `.copy()` - duplicate storage + +**Why this matters**: A view of a 1GB tensor is free (just metadata). A copy allocates another 1GB. Understanding view semantics prevents memory blowup in production systems. + +### Computational Complexity + +Different operations have vastly different costs: + +- **Element-wise** (`+`, `-`, `*`): O(n) - linear in tensor size +- **Reductions** (`sum`, `mean`): O(n) - must visit every element +- **Matrix multiply** (`matmul`): O(nยณ) for square matrices - dominates training time + +**Why this matters**: In a neural network forward pass, matrix multiplications consume 90%+ of compute time. Optimizing matmul is critical - hence specialized hardware (GPUs, TPUs) and libraries (cuBLAS, MKL). + +## Architecture Overview + +### Tensor Class Design + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Tensor Class โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ Properties: โ”‚ +โ”‚ - data: np.ndarray (underlying storage)โ”‚ +โ”‚ - shape: tuple (dimensions) โ”‚ +โ”‚ - size: int (total elements) โ”‚ +โ”‚ - dtype: np.dtype (data type) โ”‚ +โ”‚ - requires_grad: bool (autograd flag) โ”‚ +โ”‚ - grad: Tensor (gradient - Module 05) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ Operator Overloading: โ”‚ +โ”‚ - __add__, __sub__, __mul__, __truediv__โ”‚ +โ”‚ - __pow__ (exponentiation) โ”‚ +โ”‚ - Returns new Tensor instances โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ Methods: โ”‚ +โ”‚ - matmul(other): Matrix multiplication โ”‚ +โ”‚ - reshape(*shape): Shape manipulation โ”‚ +โ”‚ - transpose(): Dimension swap โ”‚ +โ”‚ - sum/mean/max/min(axis): Reductions โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### Data Flow Architecture + +``` +Python Interface (your code) + โ†“ + Tensor Class + โ†“ + NumPy Backend (vectorized operations) + โ†“ + C/Fortran Libraries (BLAS, LAPACK) + โ†“ + Hardware (CPU SIMD, cache) +``` + +**Your implementation**: Python wrapper โ†’ NumPy +**PyTorch/TensorFlow**: Python wrapper โ†’ C++ engine โ†’ GPU kernels + +The architecture is identical in concept - you're learning the same design patterns used in production, just with NumPy instead of custom CUDA kernels. + +### Module Integration + +``` +Module 01: Tensor (THIS MODULE) + โ†“ provides foundation +Module 02: Activations (ReLU, Sigmoid operate on Tensors) + โ†“ uses tensors +Module 03: Layers (Linear, Conv2d store weights as Tensors) + โ†“ uses tensors +Module 05: Autograd (adds .grad attribute to Tensors) + โ†“ enhances tensors +Module 06: Optimizers (updates Tensor parameters) +``` + +Your Tensor is the universal foundation - every subsequent module builds on what you create here. + +## Prerequisites + +This is the first module - no prerequisites! Verify your environment is ready: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Check system health +tito system doctor +``` + +All checks should pass (Python 3.8+, NumPy, pytest installed) before starting. + +## Getting Started + +### Development Workflow + +1. **Open the development notebook**: `modules/01_tensor/tensor_dev.ipynb` in Jupyter or your preferred editor +2. **Implement Tensor.__init__**: Create constructor that converts data to NumPy array, stores shape/size/dtype, initializes gradient attributes +3. **Build arithmetic operations**: Implement `__add__`, `__sub__`, `__mul__`, `__truediv__` with broadcasting support for both Tensor-Tensor and Tensor-scalar operations +4. **Add matrix multiplication**: Implement `matmul()` with shape validation and clear error messages for dimension mismatches +5. **Create shape manipulation**: Implement `reshape()` (with -1 support) and `transpose()` for dimension swapping +6. **Implement reductions**: Build `sum()`, `mean()`, `max()` with axis parameter and keepdims support +7. **Export and verify**: Run `tito export 01` to export to package, then `tito test 01` to validate all tests pass + +## Implementation Guide + +### Tensor Class Foundation + +Your Tensor class wraps NumPy arrays and provides ML-specific functionality: ```python from tinytorch.core.tensor import Tensor -# Create tensors +# Create tensors from Python lists or NumPy arrays x = Tensor([[1.0, 2.0], [3.0, 4.0]]) y = Tensor([[0.5, 1.5], [2.5, 3.5]]) -# Properties +# Properties provide clean API access print(x.shape) # (2, 2) print(x.size) # 4 -print(x.dtype) # float64 +print(x.dtype) # float32 +``` -# Operations -z = x + y # Addition +**Implementation details**: You'll implement `__init__` to convert input data to NumPy arrays, store shape/size/dtype as properties, and initialize dormant gradient attributes (`requires_grad`, `grad`) that activate in Module 05. + +### Arithmetic Operations + +Implement operator overloading for element-wise operations with broadcasting: + +```python +# Element-wise operations via operator overloading +z = x + y # Addition: [[1.5, 3.5], [5.5, 7.5]] w = x * y # Element-wise multiplication p = x ** 2 # Exponentiation +s = x - y # Subtraction +d = x / y # Division -# Shape manipulation -reshaped = x.reshape(4, 1) -transposed = x.T +# Broadcasting: scalar operations automatically expand +scaled = x * 2 # [[2.0, 4.0], [6.0, 8.0]] +shifted = x + 10 # [[11.0, 12.0], [13.0, 14.0]] -# Reductions -total = x.sum() # Scalar sum -means = x.mean(axis=0) # Mean along axis +# Broadcasting: vector + matrix +matrix = Tensor([[1, 2], [3, 4]]) +vector = Tensor([10, 20]) +result = matrix + vector # [[11, 22], [13, 24]] ``` ---- +**Systems insight**: These operations vectorize automatically via NumPy, achieving ~100x speedup over Python loops. This is why all ML frameworks use tensors - the performance difference between `for i in range(n): result[i] = a[i] + b[i]` and `result = a + b` is dramatic at scale. -## Learning Pattern: Build โ†’ Use โ†’ Understand +### Matrix Multiplication -### 1. Build -Implement the Tensor class from scratch using NumPy as the underlying array library. You'll create constructors, operator overloading, shape manipulation methods, and reduction operations. +Matrix multiplication is the heart of neural networks - every layer performs it: -### 2. Use -Apply your Tensor implementation to real problems: matrix multiplication, data normalization, statistical computations. Test with various shapes and data types. +```python +# Matrix multiplication (the @ operator) +a = Tensor([[1, 2], [3, 4]]) # 2ร—2 +b = Tensor([[5, 6], [7, 8]]) # 2ร—2 +c = a.matmul(b) # 2ร—2 result: [[19, 22], [43, 50]] -### 3. Understand -Grasp the systems-level implications: why tensor operations dominate compute time, how memory layout affects performance, and how broadcasting enables efficient computations without data copying. +# Neural network forward pass pattern: y = xW + b +x = Tensor([[1, 2, 3], [4, 5, 6]]) # Input: (batch=2, features=3) +W = Tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]) # Weights: (3, 2) +b = Tensor([0.1, 0.2]) # Bias: (2,) +output = x.matmul(W) + b # (2, 2) +``` ---- +**Computational complexity**: For matrices `(M,K) @ (K,N)`, the cost is `O(Mร—Kร—N)` floating-point operations. A 1000ร—1000 matrix multiplication requires 2 billion FLOPs - this dominates training time in production systems. -## Learning Objectives +### Shape Manipulation -By completing this module, you will: +Neural networks constantly reshape tensors to match layer requirements: -1. **Systems Understanding**: Recognize tensors as the universal data structure in ML frameworks, understanding how all neural network operations decompose into tensor primitives +```python +# Reshape: change interpretation of same data (O(1) operation) +tensor = Tensor([1, 2, 3, 4, 5, 6]) +reshaped = tensor.reshape(2, 3) # [[1, 2, 3], [4, 5, 6]] +flat = reshaped.reshape(-1) # [1, 2, 3, 4, 5, 6] -2. **Core Implementation**: Build a complete Tensor class supporting arithmetic, shape manipulation, and reductions with proper error handling +# Transpose: swap dimensions (data rearrangement) +matrix = Tensor([[1, 2, 3], [4, 5, 6]]) # (2, 3) +transposed = matrix.transpose() # (3, 2): [[1, 4], [2, 5], [3, 6]] -3. **Pattern Recognition**: Understand broadcasting rules and how they enable efficient computations across different tensor shapes +# CNN data flow example +images = Tensor(np.random.rand(32, 3, 224, 224)) # (batch, channels, H, W) +features = images.reshape(32, -1) # (batch, 3*224*224) - flatten for MLP +``` -4. **Framework Connection**: See how your implementation mirrors PyTorch's `torch.Tensor` and TensorFlow's `tf.Tensor` design +**Memory consideration**: `reshape` often returns *views* (no data copying) when possible - an O(1) operation. `transpose` may require data rearrangement depending on memory layout. Understanding views vs. copies is crucial: views share memory (efficient), copies duplicate data (expensive for large tensors). -5. **Performance Trade-offs**: Analyze memory usage vs computation speed, understanding when to copy data vs create views +### Reduction Operations ---- +Aggregation operations collapse dimensions for statistics and loss computation: -## Why This Matters +```python +# Reduce along different axes +total = x.sum() # Scalar: sum all elements +col_sums = x.sum(axis=0) # Sum columns: [4, 6] +row_sums = x.sum(axis=1) # Sum rows: [3, 7] -### Production Context +# Statistical reductions +means = x.mean(axis=0) # Column-wise mean +minimums = x.min(axis=1) # Row-wise minimum +maximums = x.max() # Global maximum -Every modern ML framework is built on tensors: +# Batch loss averaging (common pattern) +losses = Tensor([0.5, 0.3, 0.8, 0.2]) # Per-sample losses +avg_loss = losses.mean() # 0.45 - batch average +``` -- **PyTorch**: `torch.Tensor` is the core class - all operations work with tensors -- **TensorFlow**: `tf.Tensor` represents data flowing through computation graphs -- **JAX**: `jax.numpy.ndarray` extends NumPy with automatic differentiation -- **NumPy**: The foundation - understanding tensors starts here +**Production pattern**: Every loss function uses reductions. Cross-entropy loss computes per-sample losses then averages: `loss = -log(predictions[correct_class]).mean()`. Understanding axis semantics prevents bugs in multi-dimensional operations. -By building your own Tensor class, you'll understand what happens when you call `torch.matmul()` or `tf.reduce_sum()` - not just the API, but the actual computation. +## Testing -### Systems Reality Check +### Comprehensive Test Suite -**Performance Note**: Tensor operations dominate training time. A single matrix multiplication in a linear layer might take 90% of forward pass time. Understanding tensor internals is essential for optimization. - -**Memory Note**: Large models store billions of parameters as tensors. A GPT-3 scale model requires 350GB of memory just for weights (175B parameters ร— 2 bytes for FP16). Efficient tensor memory management is critical. - ---- - -## Implementation Guide - -### Prerequisites Check - -Verify your environment is ready: +Run the full test suite to verify tensor functionality: ```bash -tito system doctor +# TinyTorch CLI (recommended - runs all 01_tensor tests) +tito test 01 + +# Direct pytest execution (more verbose output) +python -m pytest tests/01_tensor/ -v + +# Run specific test class +python -m pytest tests/01_tensor/test_tensor_core.py::TestTensorCreation -v ``` -All checks should pass before starting implementation. +Expected output: All tests pass with green checkmarks showing your Tensor implementation works correctly. -### Development Workflow +### Test Coverage Areas -```bash -# Navigate to tensor module -cd modules/01_tensor/ +Your implementation is validated across these dimensions: -# Open development file (choose your preferred method) -jupyter lab tensor_dev.py # Jupytext (recommended) -# OR -code tensor_dev.py # Direct Python editing -``` +- **Initialization** (`test_tensor_from_list`, `test_tensor_from_numpy`, `test_tensor_shapes`): Creating tensors from Python lists, NumPy arrays, and nested structures with correct shape/dtype handling +- **Arithmetic Operations** (`test_tensor_addition`, `test_tensor_multiplication`): Element-wise addition, subtraction, multiplication, division with both Tensor-Tensor and Tensor-scalar combinations +- **Broadcasting** (`test_scalar_broadcasting`, `test_vector_broadcasting`): Automatic shape alignment for different tensor shapes, scalar expansion, matrix-vector broadcasting +- **Matrix Multiplication** (`test_matrix_multiplication`): Matrix-matrix, matrix-vector multiplication with shape validation and error handling for incompatible dimensions +- **Shape Manipulation** (`test_tensor_reshape`, `test_tensor_transpose`, `test_tensor_flatten`): Reshape with -1 inference, transpose with dimension swapping, validation for incompatible sizes +- **Reductions** (`test_sum`, `test_mean`, `test_max`): Aggregation along various axes (None, 0, 1, multiple), keepdims behavior, global vs. axis-specific reduction +- **Memory Management** (`test_tensor_data_access`, `test_tensor_copy_semantics`, `test_tensor_memory_efficiency`): Data access patterns, copy vs. view semantics, memory usage validation -### Step-by-Step Build +### Inline Testing & Validation -#### Step 1: Tensor Class Foundation - -Create the basic Tensor class with initialization and properties: +The development notebook includes comprehensive inline tests with immediate feedback: ```python -class Tensor: - def __init__(self, data, dtype=None): - """Initialize tensor from Python list or NumPy array""" - self.data = np.array(data, dtype=dtype) - - @property - def shape(self): - """Return tensor shape""" - return self.data.shape - - @property - def size(self): - """Return total number of elements""" - return self.data.size +# Example inline test output +๐Ÿงช Unit Test: Tensor Creation... +โœ… Tensor created from list +โœ… Shape property correct: (2, 2) +โœ… Size property correct: 4 +โœ… dtype is float32 +๐Ÿ“ˆ Progress: Tensor initialization โœ“ + +๐Ÿงช Unit Test: Arithmetic Operations... +โœ… Addition: [[6, 8], [10, 12]] +โœ… Multiplication works element-wise +โœ… Broadcasting: scalar + tensor +โœ… Broadcasting: matrix + vector +๐Ÿ“ˆ Progress: Arithmetic operations โœ“ + +๐Ÿงช Unit Test: Matrix Multiplication... +โœ… 2ร—2 @ 2ร—2 = [[19, 22], [43, 50]] +โœ… Shape validation catches 2ร—2 @ 3ร—1 error +โœ… Error message shows: "2 โ‰  3" +๐Ÿ“ˆ Progress: Matrix operations โœ“ ``` -**Why this matters**: Properties enable clean API design - users can write `x.shape` instead of `x.get_shape()`, matching PyTorch conventions. +### Manual Testing Examples -#### Step 2: Arithmetic Operations - -Implement operator overloading for element-wise operations: +Validate your implementation interactively: ```python -def __add__(self, other): - """Element-wise addition""" - return Tensor(self.data + other.data) +from tinytorch.core.tensor import Tensor +import numpy as np -def __mul__(self, other): - """Element-wise multiplication""" - return Tensor(self.data * other.data) -``` - -**Systems insight**: These operations vectorize automatically via NumPy, achieving ~100x speedup over Python loops. This is why frameworks use tensors. - -#### Step 3: Shape Manipulation - -Implement reshape, transpose, and broadcasting: - -```python -def reshape(self, *shape): - """Return tensor with new shape""" - return Tensor(self.data.reshape(*shape)) - -@property -def T(self): - """Return transposed tensor""" - return Tensor(self.data.T) -``` - -**Memory consideration**: Reshape and transpose often return *views* (no data copying) for efficiency. Understanding views vs copies is crucial for memory optimization. - -#### Step 4: Reductions - -Implement aggregation operations along axes: - -```python -def sum(self, axis=None): - """Sum tensor elements along axis""" - return Tensor(self.data.sum(axis=axis)) - -def mean(self, axis=None): - """Mean of tensor elements along axis""" - return Tensor(self.data.mean(axis=axis)) -``` - -**Production pattern**: Reductions are fundamental - every loss function uses them. Understanding axis semantics prevents bugs in multi-dimensional operations. - ---- - -## Testing Your Implementation - -### Inline Tests - -Test within your development file: - -```python -# Create test tensors +# Test basic operations x = Tensor([[1, 2], [3, 4]]) y = Tensor([[5, 6], [7, 8]]) -# Test operations assert x.shape == (2, 2) assert (x + y).data.tolist() == [[6, 8], [10, 12]] assert x.sum().data == 10 print("โœ“ Basic operations working") + +# Test broadcasting +small = Tensor([1, 2]) +result = x + small +assert result.data.tolist() == [[2, 4], [4, 6]] +print("โœ“ Broadcasting functional") + +# Test reductions +col_means = x.mean(axis=0) +assert np.allclose(col_means.data, [2.0, 3.0]) +print("โœ“ Reductions working") + +# Test neural network pattern: y = xW + b +batch = Tensor([[1, 2, 3], [4, 5, 6]]) # (2, 3) +weights = Tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]) # (3, 2) +bias = Tensor([0.1, 0.2]) +output = batch.matmul(weights) + bias +assert output.shape == (2, 2) +print("โœ“ Neural network forward pass pattern works!") ``` -### Module Export & Validation +## Production Context -```bash -# Export your implementation to TinyTorch package -tito export 01 +### Your Implementation vs. Production Frameworks -# Run comprehensive test suite -tito test 01 -``` +Understanding what you're building vs. what production frameworks provide: -**Expected output**: -``` -โœ“ All tests passed! [25/25] -โœ“ Module 01 complete! -``` +| Feature | Your Tensor (Module 01) | PyTorch torch.Tensor | TensorFlow tf.Tensor | +|---------|------------------------|---------------------|---------------------| +| **Backend** | NumPy (CPU-only) | C++/CUDA (CPU/GPU/TPU) | C++/CUDA/XLA | +| **Dtype Support** | float32 (primary) | float16/32/64, int8/16/32/64, bool, complex | Same + bfloat16 | +| **Operations** | Arithmetic, matmul, reshape, transpose, reductions | 1000+ operations | 1000+ operations | +| **Broadcasting** | โœ… Full NumPy rules | โœ… Same rules | โœ… Same rules | +| **Autograd** | Dormant (activates Module 05) | โœ… Full computation graph | โœ… GradientTape | +| **GPU Support** | โŒ CPU-only | โœ… CUDA, Metal, ROCm | โœ… CUDA, TPU | +| **Memory Pooling** | โŒ Python GC | โœ… Caching allocator | โœ… Memory pools | +| **JIT Compilation** | โŒ Interpreted | โœ… TorchScript, torch.compile | โœ… XLA, TF Graph | +| **Distributed** | โŒ Single process | โœ… DDP, FSDP | โœ… tf.distribute | ---- +**Educational focus**: Your implementation prioritizes clarity and understanding over performance. The core concepts (broadcasting, shape manipulation, reductions) are identical - you're learning the same patterns used in production, just with simpler infrastructure. -## Where This Code Lives +**Line count**: Your implementation is ~1927 lines in the notebook (including tests and documentation). PyTorch's tensor implementation spans 50,000+ lines across multiple C++ files - your simplified version captures the essential concepts. -After export, your Tensor implementation becomes part of the TinyTorch package: +### Side-by-Side Code Comparison +**Your implementation:** ```python -# Other modules and future code can now import YOUR implementation: from tinytorch.core.tensor import Tensor -# Used throughout TinyTorch: -from tinytorch.core.layers import Linear # Uses Tensor for weights -from tinytorch.core.activations import ReLU # Operates on Tensors -from tinytorch.core.autograd import backward # Computes Tensor gradients +# Create tensors +x = Tensor([[1, 2], [3, 4]]) +w = Tensor([[0.5, 0.6], [0.7, 0.8]]) + +# Forward pass +output = x.matmul(w) # (2,2) @ (2,2) โ†’ (2,2) +loss = output.mean() # Scalar loss ``` -**Package structure**: +**Equivalent PyTorch (production):** +```python +import torch + +# Create tensors (GPU-enabled) +x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32).cuda() +w = torch.tensor([[0.5, 0.6], [0.7, 0.8]], dtype=torch.float32).cuda() + +# Forward pass (automatic gradient tracking) +output = x @ w # Uses cuBLAS for GPU acceleration +loss = output.mean() # Builds computation graph for backprop +loss.backward() # Automatic differentiation +``` + +**Key differences:** +1. **GPU Support**: PyTorch tensors can move to GPU (`.cuda()`) for 10-100x speedup via parallel processing +2. **Autograd**: PyTorch automatically tracks operations and computes gradients - you'll build this in Module 05 +3. **Memory Pooling**: PyTorch reuses GPU memory via caching allocator - avoids expensive malloc/free calls +4. **Optimized Kernels**: PyTorch uses cuBLAS/cuDNN (GPU) and Intel MKL (CPU) - hand-tuned assembly for max performance + +### Real-World Production Usage + +**Meta (Facebook AI)**: PyTorch was developed at Meta and powers their recommendation systems, computer vision models, and LLaMA language models. Their production infrastructure processes billions of tensor operations per second. + +**Tesla**: Uses PyTorch tensors for Autopilot neural networks. Each camera frame (6-9 cameras) is converted to tensors, processed through vision models (millions of parameters stored as tensors), and outputs driving decisions in real-time at 36 FPS. + +**OpenAI**: GPT-4 training involved tensors with billions of parameters distributed across thousands of GPUs. Each training step performs matrix multiplications on tensors larger than single GPU memory. + +**Google**: TensorFlow powers Google Search, Translate, Photos, and Assistant. Google's TPUs (Tensor Processing Units) are custom hardware designed specifically for accelerating tensor operations. + +### Performance Characteristics at Scale + +**Memory usage**: GPT-3 scale models (175B parameters) require ~350GB memory just for weights stored as float16 tensors (175B ร— 2 bytes). Mixed precision training (float16/float32) reduces memory by 2x while maintaining accuracy. + +**Computational bottlenecks**: In production training, tensor operations consume 95%+ of runtime. A single linear layer's matrix multiplication might take 100ms of a 110ms forward pass - optimizing tensor operations is critical. + +**Cache efficiency**: Modern CPUs have ~32KB L1 cache, ~256KB L2, ~8MB L3. Accessing memory in tensor-friendly patterns (contiguous, row-major) can be 10-100x faster than cache-unfriendly patterns (strided, column-major). + +### Package Integration + +After export, your Tensor implementation becomes the foundation of TinyTorch: + +**Package Export**: Code exports to `tinytorch.core.tensor` + +```python +# When students install tinytorch, they import YOUR work: +from tinytorch.core.tensor import Tensor # Your implementation! + +# Future modules build on YOUR tensor: +from tinytorch.core.activations import ReLU # Module 02 - operates on your Tensors +from tinytorch.core.layers import Linear # Module 03 - uses your Tensor for weights +from tinytorch.core.autograd import backward # Module 05 - adds gradients to your Tensor +from tinytorch.core.optimizers import SGD # Module 06 - updates your Tensor parameters +``` + +**Package structure:** ``` tinytorch/ โ”œโ”€โ”€ core/ -โ”‚ โ”œโ”€โ”€ tensor.py โ† YOUR implementation exports here -โ”‚ โ”œโ”€โ”€ activations.py -โ”‚ โ”œโ”€โ”€ layers.py +โ”‚ โ”œโ”€โ”€ tensor.py โ† YOUR implementation exports here +โ”‚ โ”œโ”€โ”€ activations.py โ† Module 02 builds on your Tensor +โ”‚ โ”œโ”€โ”€ layers.py โ† Module 03 builds on your Tensor +โ”‚ โ”œโ”€โ”€ losses.py โ† Module 04 builds on your Tensor +โ”‚ โ”œโ”€โ”€ autograd.py โ† Module 05 adds gradients to your Tensor +โ”‚ โ”œโ”€โ”€ optimizers.py โ† Module 06 updates your Tensor weights โ”‚ โ””โ”€โ”€ ... ``` +Your Tensor class is the universal foundation - every subsequent module depends on what you build here. + +### How Your Implementation Maps to PyTorch + +**What you just built:** +```python +# Your TinyTorch Tensor implementation +from tinytorch.core.tensor import Tensor + +# Create a tensor +x = Tensor([[1, 2], [3, 4]]) + +# Core operations you implemented +y = x + 2 # Broadcasting +z = x.matmul(other) # Matrix multiplication +mean = x.mean(axis=0) # Reductions +reshaped = x.reshape(-1) # Shape manipulation +``` + +**How PyTorch does it:** +```python +# PyTorch equivalent +import torch + +# Create a tensor +x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) + +# Same operations, identical semantics +y = x + 2 # Broadcasting (same rules) +z = x @ other # Matrix multiplication (@ operator) +mean = x.mean(dim=0) # Reductions (dim instead of axis) +reshaped = x.reshape(-1) # Shape manipulation (same API) +``` + +**Key Insight**: Your implementation uses the **same mathematical operations and design patterns** that PyTorch uses internally. The `@` operator is syntactic sugar for matrix multiplicationโ€”the actual computation is identical. Broadcasting rules, shape semantics, and reduction operations all follow the same NumPy conventions. + +**What's the SAME?** +- Tensor abstraction and API design +- Broadcasting rules and memory layout principles +- Shape manipulation semantics (`reshape`, `transpose`) +- Reduction operation behavior (`sum`, `mean`, `max`) +- Conceptual architecture: data + operations + metadata + +**What's different in production PyTorch?** +- **Backend**: C++/CUDA for 10-100ร— speed vs. NumPy +- **GPU support**: `.cuda()` moves tensors to GPU for parallel processing +- **Autograd integration**: `requires_grad=True` enables automatic differentiation (you'll build this in Module 05) +- **Memory optimization**: Caching allocator reuses GPU memory, avoiding expensive malloc/free + +**Why this matters**: When you debug PyTorch code, you'll understand what's happening under tensor operations because you implemented them yourself. Shape mismatch errors, broadcasting bugs, memory issuesโ€”you know exactly how they work internally, not just how to call the API. + +**Production usage example**: +```python +# PyTorch production code (after TinyTorch) +import torch.nn as nn + +class MLPLayer(nn.Module): + def __init__(self, in_features, out_features): + super().__init__() + self.linear = nn.Linear(in_features, out_features) # Uses torch.Tensor internally + + def forward(self, x): + return self.linear(x) # Matrix multiply + bias (same as your Tensor.matmul) +``` + +After building your own Tensor class, you understand that `nn.Linear(in_features, out_features)` is essentially creating weight and bias tensors, then performing `x @ weights + bias` with your same broadcasting and matmul operationsโ€”just optimized in C++/CUDA. + +## Common Pitfalls + +### Shape Mismatch Errors + +**Problem**: Matrix multiplication fails with cryptic errors like "shapes (2,3) and (2,2) not aligned" + +**Solution**: Always verify inner dimensions match: `(M,K) @ (K,N)` requires K to be equal. Add shape validation with clear error messages: +```python +if a.shape[1] != b.shape[0]: + raise ValueError(f"Cannot multiply ({a.shape[0]},{a.shape[1]}) @ ({b.shape[0]},{b.shape[1]}): {a.shape[1]} โ‰  {b.shape[0]}") +``` + +### Broadcasting Confusion + +**Problem**: Expected `(2,3) + (3,)` to broadcast but got error + +**Solution**: Broadcasting aligns shapes *from the right*. `(2,3) + (3,)` works (broadcasts to `(2,3)`), but `(2,3) + (2,)` fails. Add dimension with reshape if needed: `tensor.reshape(2,1)` to make `(2,1)` broadcastable with `(2,3)`. + +### View vs Copy Confusion + +**Problem**: Modified a reshaped tensor and original changed unexpectedly + +**Solution**: `reshape()` returns a *view* when possible - they share memory. Changes to the view affect the original. Use `.copy()` if you need independent data: +```python +view = tensor.reshape(2, 3) # Shares memory +copy = tensor.reshape(2, 3).copy() # Independent storage +``` + +### Axis Parameter Mistakes + +**Problem**: `sum(axis=1)` on `(batch, features)` returned wrong shape + +**Solution**: Axis semantics: `axis=0` reduces over first dimension (batch), `axis=1` reduces over second (features). For `(32, 128)` tensor, `sum(axis=0)` gives `(128,)`, `sum(axis=1)` gives `(32,)`. Visualize which dimension you're collapsing. + +### Dtype Issues + +**Problem**: Lost precision after operations, or got integer division instead of float + +**Solution**: NumPy defaults to preserving dtype. Integer tensors do integer division (`5 / 2 = 2`). Always create tensors with float dtype explicitly: `Tensor([[1, 2]], dtype=np.float32)` or convert: `tensor.astype(np.float32)`. + +### Memory Leaks with Large Tensors + +**Problem**: Memory usage grows unbounded during training loop + +**Solution**: Clear intermediate results in loops. Don't accumulate tensors in lists unnecessarily. Use in-place operations when safe. Example: +```python +# Bad: accumulates memory +losses = [] +for batch in data: + loss = model(batch) + losses.append(loss) # Keeps all tensors in memory + +# Good: extract values +losses = [] +for batch in data: + loss = model(batch) + losses.append(loss.data.item()) # Store scalar, release tensor +``` + +## Systems Thinking Questions + +### Real-World Applications + +- **Deep Learning Training**: All neural network layers operate on tensors - Linear layers perform matrix multiplication, Conv2d applies tensor convolutions, Attention mechanisms compute tensor dot products. How would doubling model size affect memory and compute requirements? +- **Computer Vision**: Images are 3D tensors (height ร— width ร— channels), and every transformation (resize, crop, normalize) is a tensor operation. What's the memory footprint of a batch of 32 images at 224ร—224 resolution with 3 color channels in float32? +- **Natural Language Processing**: Text embeddings are 2D tensors (sequence_length ร— embedding_dim), and Transformer models manipulate these through attention. For BERT with 512 sequence length and 768 hidden dimension, how many elements per sample? +- **Scientific Computing**: Tensors represent multidimensional data in climate models, molecular simulations, physics engines. What makes tensors more efficient than nested Python lists for these applications? + +### Mathematical Foundations + +- **Linear Algebra**: Tensors generalize matrices to arbitrary dimensions. How does broadcasting relate to outer products? When is `(M,K) @ (K,N)` more efficient than `(K,M).T @ (K,N)`? +- **Numerical Stability**: Operations like softmax require careful implementation to avoid overflow/underflow. Why does `exp(x - max(x))` prevent overflow in softmax computation? +- **Broadcasting Semantics**: NumPy's broadcasting rules enable elegant code but require understanding shape compatibility. Can you predict the output shape of `(32, 1, 10) + (1, 5, 10)`? +- **Computational Complexity**: Matrix multiplication is O(nยณ) while element-wise operations are O(n). For large models, which dominates training time and why? + +### Performance Characteristics + +- **Memory Contiguity**: Contiguous memory enables SIMD vectorization and cache efficiency. How much can non-contiguous tensors slow down operations (10x? 100x?)? +- **View vs Copy**: Views are O(1) with shared memory, copies are O(n) with duplicated storage. When might a view cause unexpected behavior (e.g., in-place operations)? +- **Operation Fusion**: Frameworks optimize `(a + b) * c` by fusing operations to reduce memory reads. How many memory passes does unfused require vs. fused? +- **Batch Processing**: Processing 32 images at once is much faster than 32 sequential passes. Why? (Hint: GPU parallelism, cache reuse, reduced Python overhead) + +## What's Next + +After mastering tensors, you're ready to build the computational layers of neural networks: + +**Module 02: Activations** - Implement ReLU, Sigmoid, Tanh, and Softmax activation functions that introduce non-linearity. You'll operate on your Tensor class and understand why activation functions are essential for learning complex patterns. + +**Module 03: Layers** - Build Linear (fully-connected) and convolutional layers using tensor operations. See how weight matrices and bias vectors (stored as Tensors) transform inputs through matrix multiplication and broadcasting. + +**Module 05: Autograd** - Add automatic differentiation to your Tensor class, enabling gradient computation for training. Your tensors will track operations and compute gradients automatically - the magic behind `loss.backward()`. + +**Preview of tensor usage ahead:** +- Activations: `output = ReLU()(input_tensor)` - element-wise operations on tensors +- Layers: `output = Linear(in_features=128, out_features=64)(input_tensor)` - matmul with weight tensors +- Loss: `loss = MSELoss()(predictions, targets)` - tensor reductions for error measurement +- Training: `optimizer.step()` updates parameter tensors using gradients + +Every module builds on your Tensor foundation - understanding tensors deeply means understanding how neural networks actually compute. + +## Ready to Build? + +You're about to implement the foundation of all machine learning systems! The Tensor class you'll build is the universal data structure that powers everything from simple neural networks to GPT, Stable Diffusion, and AlphaFold. + +This is where mathematical abstraction meets practical implementation. You'll see how N-dimensional arrays enable elegant representations of complex data, how operator overloading makes tensor math feel natural like `z = x + y`, and how careful memory management (views vs. copies) enables working with massive models. Every decision you make - from how to handle broadcasting to when to validate shapes - reflects trade-offs that production ML engineers face daily. + +Take your time with this module. Understand each operation deeply. Test your implementations thoroughly. The Tensor foundation you build here will support every subsequent module - if you understand tensors from first principles, you'll understand how neural networks actually work, not just how to use them. + +Every neural network you've ever used - ResNet, BERT, GPT, Stable Diffusion - is fundamentally built on tensor operations. Understanding tensors means understanding the computational substrate of modern AI. + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/01_tensor/tensor_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/01_tensor/tensor_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/01_tensor/tensor_dev.ipynb +:class-header: bg-light + +Browse the Jupyter notebook and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. + +``` + --- -## Systems Thinking Questions - -Reflect on these questions as you build (no right/wrong answers): - -1. **Complexity Analysis**: Why is matrix multiplication O(nยณ) for nร—n matrices? How does this affect training time for large models? - -2. **Memory Trade-offs**: When should reshape create a view vs copy data? What are the performance implications? - -3. **Production Scaling**: A GPT-3 scale model has 175 billion parameters. How much memory is required to store these as FP32 tensors? As FP16? - -4. **Design Decisions**: Why do frameworks like PyTorch store data as NumPy arrays internally? What are alternatives? - -5. **Framework Comparison**: How does your Tensor class differ from `torch.Tensor`? What features are missing? Why might those features matter? - ---- - -## Real-World Connections - -### Industry Applications - -- **Deep Learning Training**: All neural network layers operate on tensors (Linear, Conv2d, Attention all perform tensor operations) -- **Scientific Computing**: Tensors represent multidimensional data (climate models, molecular simulations) -- **Computer Vision**: Images are 3D tensors (height ร— width ร— channels) -- **NLP**: Text embeddings are 2D tensors (sequence_length ร— embedding_dim) - -### Research Applications - -- **Automatic Differentiation**: Frameworks like PyTorch track tensor operations to compute gradients -- **Distributed Training**: Large models split tensors across GPUs using tensor parallelism -- **Quantization**: Tensors can be stored in reduced precision (INT8 instead of FP32) for efficiency - ---- - -## What's Next? - -**Congratulations!** You've built the foundation of TinyTorch. Your Tensor class will power everything that follows - from activation functions to complete neural networks. - -Next, you'll add nonlinearity to enable networks to learn complex patterns. - -**Module 02: Activations** - Implement ReLU, Sigmoid, Tanh, and other activation functions that transform tensor values - -[Continue to Module 02: Activations โ†’](02-activations.html) - ---- - -**Need Help?** -- [Ask in GitHub Discussions](https://github.com/mlsysbook/TinyTorch/discussions) -- [View Tensor API Reference](../appendices/api-reference.html#tensor) -- [Report Issues](https://github.com/mlsysbook/TinyTorch/issues) +
+Next Module โ†’ +
diff --git a/modules/02_activations/ABOUT.md b/modules/02_activations/ABOUT.md index 65c8a871..9f617990 100644 --- a/modules/02_activations/ABOUT.md +++ b/modules/02_activations/ABOUT.md @@ -1,201 +1,384 @@ --- -title: "Activation Functions" -description: "Neural network activation functions (ReLU, Sigmoid, Tanh, Softmax)" +title: "Activations" +description: "Neural network activation functions enabling non-linear learning" difficulty: "โญโญ" time_estimate: "3-4 hours" -prerequisites: [] -next_steps: [] -learning_objectives: [] +prerequisites: ["01_tensor"] +next_steps: ["03_layers"] +learning_objectives: + - "Understand activation functions as the non-linearity enabling neural networks to learn complex patterns" + - "Implement ReLU, Sigmoid, Tanh, GELU, and Softmax with proper numerical stability" + - "Recognize function properties (range, gradient behavior, symmetry) and their roles in ML architectures" + - "Connect activation implementations to torch.nn.functional and PyTorch/TensorFlow patterns" + - "Analyze computational efficiency, numerical stability, and memory implications of different activations" --- # 02. Activations -**๐Ÿ—๏ธ FOUNDATION TIER** | Difficulty: โญโญ (2/4) | Time: 3-4 hours +**FOUNDATION TIER** | Difficulty: โญโญ (2/4) | Time: 3-4 hours ## Overview -Implement the mathematical functions that give neural networks their power to learn complex patterns. Without activation functions, neural networks would just be linear transformationsโ€”with them, you unlock the ability to learn any function. +Activation functions are the mathematical operations that introduce non-linearity into neural networks, transforming them from simple linear regressors into universal function approximators. Without activations, stacking layers would be pointlessโ€”multiple linear transformations collapse to a single linear operation. With activations, each layer learns increasingly complex representations, enabling networks to approximate any continuous function. This module implements five essential activation functions with proper numerical stability, preparing you to understand what happens every time you call `F.relu(x)` or `torch.sigmoid(x)` in production code. ## Learning Objectives By the end of this module, you will be able to: -- **Understand the critical role** of activation functions in enabling neural networks to learn non-linear patterns -- **Implement three core activation functions**: ReLU, Sigmoid, and Tanh with proper numerical stability -- **Apply mathematical reasoning** to understand function properties, ranges, and appropriate use cases -- **Debug and test** activation implementations using both automated tests and visual analysis -- **Connect theory to practice** by understanding when and why to use each activation function +- **Systems Understanding**: Recognize activation functions as the critical non-linearity that enables universal function approximation, understanding their role in memory consumption (activation caching), computational bottlenecks (billions of calls per training run), and gradient flow through deep architectures +- **Core Implementation**: Build ReLU, Sigmoid, Tanh, GELU, and Softmax with numerical stability techniques (max subtraction, conditional computation) that prevent overflow/underflow while maintaining mathematical correctness +- **Pattern Recognition**: Understand function propertiesโ€”ReLU's sparsity and [0, โˆž) range, Sigmoid's (0,1) probabilistic outputs, Tanh's (-1,1) zero-centered gradients, GELU's smoothness, Softmax's probability distributionsโ€”and why each serves specific architectural roles +- **Framework Connection**: See how your implementations mirror `torch.nn.ReLU`, `torch.nn.Sigmoid`, `torch.nn.Tanh`, `torch.nn.GELU`, and `F.softmax`, understanding the actual mathematical operations behind PyTorch's abstractions used throughout ResNet, BERT, GPT, and vision transformers +- **Performance Trade-offs**: Analyze computational cost (element-wise operations vs exponentials), memory implications (activation caching for backprop), and gradient behavior (vanishing gradients in Sigmoid/Tanh vs ReLU's constant gradients), understanding why ReLU dominates hidden layers while Sigmoid/Softmax serve specific output roles -## Build โ†’ Use โ†’ Analyze +## Build โ†’ Use โ†’ Reflect -This module follows TinyTorch's **Build โ†’ Use โ†’ Analyze** framework: +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -1. **Build**: Implement ReLU, Sigmoid, and Tanh activation functions with numerical stability -2. **Use**: Apply these functions in testing scenarios and visualize their mathematical behavior -3. **Analyze**: Compare function properties, performance characteristics, and appropriate use cases through quantitative analysis +1. **Build**: Implement five core activation functions (ReLU, Sigmoid, Tanh, GELU, Softmax) with numerical stability. Handle overflow in exponentials through max subtraction and conditional computation, ensure shape preservation across operations, and maintain proper value ranges ([0,โˆž) for ReLU, (0,1) for Sigmoid, (-1,1) for Tanh, probability distributions for Softmax) + +2. **Use**: Apply activations to real tensors with various ranges and shapes. Test with extreme values (ยฑ1000) to verify numerical stability, visualize function behavior across input domains, integrate with Tensor operations from Module 01, and chain activations to simulate simple neural network data flow (Input โ†’ ReLU โ†’ Softmax) + +3. **Reflect**: Understand why each activation exists in production systemsโ€”why ReLU enables sparse representations (many zeros) that accelerate computation and reduce overfitting, how Sigmoid creates gates (0 to 1 control signals) in LSTM/GRU architectures, why Tanh's zero-centered outputs improve optimization dynamics, how GELU's smoothness helps transformers, and why Softmax's probability distributions are essential for classification ## Implementation Guide -### Core Activation Functions +### ReLU - The Sparsity Creator + +ReLU (Rectified Linear Unit) is the workhorse of modern deep learning, used in hidden layers of ResNet, EfficientNet, and most convolutional architectures. + ```python -# ReLU: Simple but powerful -relu = ReLU() -output = relu(Tensor([-2, -1, 0, 1, 2])) # [0, 0, 0, 1, 2] +class ReLU: + """ReLU activation: f(x) = max(0, x)""" -# Sigmoid: Probabilistic outputs -sigmoid = Sigmoid() -output = sigmoid(Tensor([0, 1, -1])) # [0.5, 0.73, 0.27] - -# Tanh: Zero-centered activation -tanh = Tanh() -output = tanh(Tensor([0, 1, -1])) # [0, 0.76, -0.76] + def forward(self, x: Tensor) -> Tensor: + # Zero negative values, preserve positive values + return Tensor(np.maximum(0, x.data)) ``` -### ReLU (Rectified Linear Unit) -- **Formula**: `f(x) = max(0, x)` -- **Properties**: Simple, sparse, unbounded, most commonly used -- **Implementation**: Element-wise maximum with zero -- **Use Cases**: Hidden layers in most modern architectures +**Mathematical Definition**: `f(x) = max(0, x)` -### Sigmoid Activation -- **Formula**: `f(x) = 1 / (1 + e^(-x))` -- **Properties**: Bounded to (0,1), smooth, probabilistic interpretation -- **Implementation**: Numerically stable version preventing overflow -- **Use Cases**: Binary classification, attention mechanisms, gates +**Key Properties**: +- **Range**: [0, โˆž) - unbounded above +- **Gradient**: 0 for x < 0, 1 for x > 0 (undefined at x = 0) +- **Sparsity**: Produces many exact zeros (sparse activations) +- **Computational Cost**: Trivial (element-wise comparison) -### Tanh (Hyperbolic Tangent) -- **Formula**: `f(x) = tanh(x)` -- **Properties**: Bounded to (-1,1), zero-centered, symmetric -- **Implementation**: Direct NumPy implementation with shape preservation -- **Use Cases**: Hidden layers, RNNs, when zero-centered outputs are beneficial +**Why ReLU Dominates Hidden Layers**: +- No vanishing gradient problem (gradient is 1 for positive inputs) +- Computationally efficient (simple max operation) +- Creates sparsity (zeros) that reduces computation and helps regularization +- Empirically outperforms Sigmoid/Tanh in deep networks + +**Watch Out For**: "Dying ReLU" problemโ€”neurons can get stuck outputting zero if inputs become consistently negative during training. Variants like Leaky ReLU (allows small negative slope) address this. + +### Sigmoid - The Probabilistic Gate + +Sigmoid maps any real number to (0, 1), making it essential for binary classification and gating mechanisms in LSTMs/GRUs. + +```python +class Sigmoid: + """Sigmoid activation: ฯƒ(x) = 1/(1 + e^(-x))""" + + def forward(self, x: Tensor) -> Tensor: + # Numerical stability: avoid exp() overflow + data = x.data + return Tensor(np.where( + data >= 0, + 1 / (1 + np.exp(-data)), # Positive values + np.exp(data) / (1 + np.exp(data)) # Negative values + )) +``` + +**Mathematical Definition**: `ฯƒ(x) = 1/(1 + e^(-x))` + +**Key Properties**: +- **Range**: (0, 1) - strictly bounded +- **Gradient**: ฯƒ(x)(1 - ฯƒ(x)), maximum 0.25 at x = 0 +- **Symmetry**: ฯƒ(-x) = 1 - ฯƒ(x) +- **Computational Cost**: One exponential per element + +**Numerical Stability Critical**: +- Naive `1/(1 + exp(-x))` overflows for large positive x +- For x โ‰ฅ 0: use `1/(1 + exp(-x))` (stable) +- For x < 0: use `exp(x)/(1 + exp(x))` (stable) +- Conditional computation prevents overflow while maintaining correctness + +**Production Use Cases**: +- Binary classification output layer (probability of positive class) +- LSTM/GRU gates (input gate, forget gate, output gate) +- Attention mechanisms (before softmax normalization) + +**Gradient Problem**: Maximum derivative is 0.25, meaning gradients shrink by โ‰ฅ75% per layer. In deep networks (>10 layers), gradients vanish exponentially, making training difficult. This is why ReLU replaced Sigmoid in hidden layers. + +### Tanh - The Zero-Centered Alternative + +Tanh (hyperbolic tangent) maps inputs to (-1, 1), providing zero-centered outputs that improve gradient flow compared to Sigmoid. + +```python +class Tanh: + """Tanh activation: f(x) = (e^x - e^(-x))/(e^x + e^(-x))""" + + def forward(self, x: Tensor) -> Tensor: + return Tensor(np.tanh(x.data)) +``` + +**Mathematical Definition**: `tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))` + +**Key Properties**: +- **Range**: (-1, 1) - symmetric around zero +- **Gradient**: 1 - tanhยฒ(x), maximum 1.0 at x = 0 +- **Symmetry**: tanh(-x) = -tanh(x) (odd function) +- **Computational Cost**: Two exponentials (or NumPy optimized) + +**Why Zero-Centered Matters**: +- Tanh outputs have mean โ‰ˆ 0, unlike Sigmoid's mean โ‰ˆ 0.5 +- Gradients don't systematically bias weight updates in one direction +- Helps optimization in shallow networks and RNN cells + +**Production Use Cases**: +- LSTM/GRU cell state computation (candidate values in [-1, 1]) +- Output layer when you need symmetric bounded outputs +- Some shallow networks (though ReLU usually preferred now) + +**Still Has Vanishing Gradients**: Maximum derivative is 1.0 (better than Sigmoid's 0.25), but still saturates for |x| > 2, causing vanishing gradients in deep networks. + +### GELU - The Smooth Modern Choice + +GELU (Gaussian Error Linear Unit) is a smooth approximation to ReLU, used in modern transformer architectures like GPT, BERT, and Vision Transformers. + +```python +class GELU: + """GELU activation: f(x) โ‰ˆ x * Sigmoid(1.702 * x)""" + + def forward(self, x: Tensor) -> Tensor: + # Approximation: x * sigmoid(1.702 * x) + sigmoid_part = 1.0 / (1.0 + np.exp(-1.702 * x.data)) + return Tensor(x.data * sigmoid_part) +``` + +**Mathematical Definition**: `GELU(x) = x ยท ฮฆ(x) โ‰ˆ x ยท ฯƒ(1.702x)` where ฮฆ(x) is the cumulative distribution function of standard normal distribution + +**Key Properties**: +- **Range**: (-โˆž, โˆž) - unbounded like ReLU +- **Gradient**: Smooth everywhere (no sharp corner at x = 0) +- **Approximation**: The 1.702 constant comes from โˆš(2/ฯ€) +- **Computational Cost**: One exponential (similar to Sigmoid) + +**Why Transformers Use GELU**: +- Smooth differentiability everywhere (unlike ReLU's corner at x = 0) +- Empirically performs better than ReLU in transformer architectures +- Non-monotonic behavior (slight negative region) helps representation learning +- Used in GPT, BERT, RoBERTa, Vision Transformers + +**Comparison to ReLU**: GELU is smoother (differentiable everywhere) but more expensive (requires exponential). In transformers, the extra cost is negligible compared to attention computation, and the smoothness helps optimization. + +### Softmax - The Probability Distributor + +Softmax converts any vector into a valid probability distribution where all outputs are positive and sum to exactly 1.0. + +```python +class Softmax: + """Softmax activation: f(x_i) = e^(x_i) / ฮฃ(e^(x_j))""" + + def forward(self, x: Tensor, dim: int = -1) -> Tensor: + # Numerical stability: subtract max before exp + x_max_data = np.max(x.data, axis=dim, keepdims=True) + x_shifted = x - Tensor(x_max_data) + exp_values = Tensor(np.exp(x_shifted.data)) + exp_sum = Tensor(np.sum(exp_values.data, axis=dim, keepdims=True)) + return exp_values / exp_sum +``` + +**Mathematical Definition**: `softmax(x_i) = e^(x_i) / ฮฃ_j e^(x_j)` + +**Key Properties**: +- **Range**: (0, 1) with ฮฃ outputs = 1.0 exactly +- **Gradient**: Complex (involves all elements, not just element-wise) +- **Translation Invariant**: softmax(x + c) = softmax(x) +- **Computational Cost**: One exponential per element + sum reduction + +**Numerical Stability Critical**: +- Naive `exp(x_i) / sum(exp(x_j))` overflows for large values +- Subtract max before exponential: `exp(x - max(x))` +- Mathematically equivalent due to translation invariance +- Prevents overflow while maintaining correct probabilities + +**Production Use Cases**: +- Multi-class classification output layer (class probabilities) +- Attention weights in transformers (probability distribution over sequence) +- Any time you need a valid discrete probability distribution + +**Cross-Entropy Connection**: In practice, Softmax is almost always paired with cross-entropy loss. PyTorch's `F.cross_entropy` combines both operations with additional numerical stability (LogSumExp trick). ## Getting Started ### Prerequisites -Ensure you have completed the tensor module and understand basic tensor operations: - ```bash +Ensure you have completed Module 01 (Tensor) before starting: + +```bash # Activate TinyTorch environment - source bin/activate-tinytorch.sh +source bin/activate-tinytorch.sh -# Verify tensor module is working +# Verify tensor module is complete tito test --module tensor - ``` + +# Expected: โœ“ Module 01 complete! +``` ### Development Workflow -1. **Open the development file**: `modules/03_activations/activations_dev.py` -2. **Implement functions progressively**: Start with ReLU, then Sigmoid (numerical stability), then Tanh -3. **Test each implementation**: Use inline tests for immediate feedback -4. **Visualize function behavior**: Leverage plotting sections for mathematical understanding -5. **Export and verify**: `tito export --module activations && tito test --module activations` + +1. **Open the development file**: `modules/02_activations/activations_dev.ipynb` (or `.py` via Jupytext) +2. **Implement ReLU**: Simple max(0, x) operation using `np.maximum` +3. **Build Sigmoid**: Implement with numerical stability using conditional computation for positive/negative values +4. **Create Tanh**: Use `np.tanh` for hyperbolic tangent transformation +5. **Add GELU**: Implement smooth approximation using `x * sigmoid(1.702 * x)` +6. **Build Softmax**: Implement with max subtraction for numerical stability, handle dimension parameter for multi-dimensional tensors +7. **Export and verify**: Run `tito module complete 02 && tito test --module activations` + +**Development Tips**: +- Test with extreme values (ยฑ1000) to verify numerical stability +- Verify output ranges: ReLU [0, โˆž), Sigmoid (0,1), Tanh (-1,1) +- Check Softmax sums to 1.0 along specified dimension +- Test with multi-dimensional tensors (batches) to ensure shape preservation ## Testing ### Comprehensive Test Suite -Run the full test suite to verify mathematical correctness: - ```bash +Run the full test suite to verify all activation implementations: + +```bash # TinyTorch CLI (recommended) - tito test --module activations +tito test --module activations # Direct pytest execution python -m pytest tests/ -k activations -v + +# Test specific activation +python -m pytest tests/test_activations.py::test_relu -v ``` ### Test Coverage Areas -- โœ… **Mathematical Correctness**: Verify function outputs match expected mathematical formulas -- โœ… **Numerical Stability**: Test with extreme values and edge cases -- โœ… **Shape Preservation**: Ensure input and output tensors have identical shapes -- โœ… **Range Validation**: Confirm outputs fall within expected ranges -- โœ… **Integration Testing**: Verify compatibility with tensor operations -### Inline Testing & Visualization -The module includes comprehensive educational feedback: +- โœ… **ReLU Correctness**: Verifies max(0, x) behavior, sparsity property (negative โ†’ 0, positive preserved), and proper handling of exactly zero inputs +- โœ… **Sigmoid Numerical Stability**: Tests extreme values (ยฑ1000) don't cause overflow/underflow, validates (0,1) range constraints, confirms sigmoid(0) = 0.5 exactly +- โœ… **Tanh Properties**: Validates (-1,1) range, symmetry property (tanh(-x) = -tanh(x)), zero-centered behavior (tanh(0) = 0), and extreme value convergence +- โœ… **GELU Smoothness**: Confirms smooth differentiability (no sharp corners), validates approximation accuracy (GELU(0) โ‰ˆ 0, GELU(1) โ‰ˆ 0.84), and checks non-monotonic behavior +- โœ… **Softmax Probability Distribution**: Verifies sum equals 1.0 exactly, all outputs in (0,1) range, largest input receives highest probability, numerical stability with large inputs, and correct dimension handling for multi-dimensional tensors + +### Inline Testing & Validation + +The module includes comprehensive inline unit tests that run during development: + ```python # Example inline test output -๐Ÿ”ฌ Unit Test: ReLU activation... -โœ… ReLU handles negative inputs correctly -โœ… ReLU preserves positive inputs -โœ… ReLU output range is [0, โˆž) +๐Ÿ”ฌ Unit Test: ReLU... +โœ… ReLU zeros negative values correctly +โœ… ReLU preserves positive values +โœ… ReLU creates sparsity (3/4 values are zero) ๐Ÿ“ˆ Progress: ReLU โœ“ -# Visual feedback with plotting -๐Ÿ“Š Plotting ReLU behavior across range [-5, 5]... -๐Ÿ“ˆ Function visualization shows expected behavior - ``` +๐Ÿ”ฌ Unit Test: Sigmoid... +โœ… Sigmoid(0) = 0.5 exactly +โœ… All outputs in (0, 1) range +โœ… Numerically stable with extreme values (ยฑ1000) +๐Ÿ“ˆ Progress: Sigmoid โœ“ + +๐Ÿ”ฌ Unit Test: Softmax... +โœ… Outputs sum to 1.0 exactly +โœ… All values positive and less than 1 +โœ… Largest input gets highest probability +โœ… Handles large numbers without overflow +๐Ÿ“ˆ Progress: Softmax โœ“ +``` ### Manual Testing Examples + +Test activations interactively to understand their behavior: + ```python +from activations_dev import ReLU, Sigmoid, Tanh, GELU, Softmax from tinytorch.core.tensor import Tensor -from activations_dev import ReLU, Sigmoid, Tanh - -# Test with various inputs -x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]]) +# Test ReLU sparsity relu = ReLU() -sigmoid = Sigmoid() -tanh = Tanh() +x = Tensor([-2, -1, 0, 1, 2]) +output = relu(x) +print(output.data) # [0, 0, 0, 1, 2] - 60% sparsity! -print("Input:", x.data) -print("ReLU:", relu(x).data) # [0, 0, 0, 1, 2] -print("Sigmoid:", sigmoid(x).data) # [0.12, 0.27, 0.5, 0.73, 0.88] -print("Tanh:", tanh(x).data) # [-0.96, -0.76, 0, 0.76, 0.96] +# Test Sigmoid probability mapping +sigmoid = Sigmoid() +x = Tensor([0.0, 100.0, -100.0]) # Extreme values +output = sigmoid(x) +print(output.data) # [0.5, 1.0, 0.0] - no overflow! + +# Test Softmax probability distribution +softmax = Softmax() +x = Tensor([1.0, 2.0, 3.0]) +output = softmax(x) +print(output.data) # [0.09, 0.24, 0.67] +print(output.data.sum()) # 1.0 exactly! + +# Test activation chaining (simulate simple network) +x = Tensor([[-1, 0, 1, 2]]) # Batch of 1 +hidden = relu(x) # Hidden layer: [0, 0, 1, 2] +output = softmax(hidden) # Output probabilities +print(output.data.sum()) # 1.0 - valid distribution! ``` ## Systems Thinking Questions ### Real-World Applications -- **Computer Vision**: ReLU activations enable CNNs to learn hierarchical features (like those in ResNet, VGG) -- **Natural Language Processing**: Sigmoid/Tanh functions power LSTM and GRU gates for memory control -- **Recommendation Systems**: Sigmoid activations provide probability outputs for binary predictions -- **Generative Models**: Different activations shape the output distributions in GANs and VAEs -### Mathematical Properties Comparison -| Function | Input Range | Output Range | Zero Point | Key Property | -|----------|-------------|--------------|------------|--------------| -| ReLU | (-โˆž, โˆž) | [0, โˆž) | f(0) = 0 | Sparse, unbounded | -| Sigmoid | (-โˆž, โˆž) | (0, 1) | f(0) = 0.5 | Probabilistic | -| Tanh | (-โˆž, โˆž) | (-1, 1) | f(0) = 0 | Zero-centered | +- **Computer Vision Networks**: ResNet-50 applies ReLU to approximately 23 million elements per forward pass (after every convolution), then uses Softmax on 1000 logits for ImageNet classification. How much memory is required just to cache these activations for backpropagation in a batch of 32 images? +- **Transformer Language Models**: BERT-Large has 24 layers ร— 1024 hidden units ร— sequence length 512 = 12.6M activations per example. With GELU requiring exponential computation, how does this compare to ReLU's computational cost across a 1M example training run? +- **Recurrent Networks**: LSTM cells use 4 gates (input, forget, output, cell) with Sigmoid/Tanh activations at every timestep. For a sequence of length 100 with 512 hidden units, how many exponential operations are required compared to a simple ReLU-based feedforward network? +- **Mobile Inference**: On-device neural networks must be extremely efficient. Given that ReLU is a simple comparison while GELU requires exponential computation, what are the latency implications for a 20-layer network running on CPU with no hardware acceleration? -### Numerical Stability Considerations -- **ReLU**: No stability issues (simple max operation) -- **Sigmoid**: Requires careful implementation to prevent `exp()` overflow -- **Tanh**: Generally stable, but NumPy implementation handles edge cases +### Mathematical Foundations -### Performance and Gradient Properties -- **ReLU**: Fastest computation, sparse gradients, can cause "dying ReLU" problem -- **Sigmoid**: Moderate computation, smooth gradients, susceptible to vanishing gradients -- **Tanh**: Moderate computation, stronger gradients than sigmoid, zero-centered helps optimization +- **Universal Function Approximation**: The universal approximation theorem states that a neural network with even one hidden layer can approximate any continuous function, BUT only if it has non-linear activations. Why does linearity prevent universal approximation, and what property of non-linear functions (like ReLU, Sigmoid, Tanh) enables it? +- **Gradient Flow and Saturation**: Sigmoid's derivative is ฯƒ(x)(1-ฯƒ(x)) with maximum value 0.25. In a 10-layer network using Sigmoid activations, what is the maximum gradient magnitude at layer 1 if the output gradient is 1.0? How does this explain the vanishing gradient problem that led to ReLU's adoption? +- **Numerical Stability and Conditioning**: When computing Softmax, why does subtracting the maximum value before exponential (exp(x - max(x))) prevent overflow while maintaining mathematical correctness? What property of the exponential function makes this transformation valid? +- **Activation Sparsity and Compression**: ReLU produces exact zeros (sparse activations) while Sigmoid produces values close to but never exactly zero. How does this affect model compression techniques like pruning and quantization? Why are sparse activations more amenable to INT8 quantization? -## ๐ŸŽ‰ Ready to Build? +### Performance Characteristics -The activations module is where neural networks truly come alive! You're about to implement the mathematical functions that transform simple linear operations into powerful pattern recognition systems. +- **Memory Footprint of Activation Caching**: During backpropagation, forward pass activations must be stored to compute gradients. For a ResNet-50 processing 224ร—224ร—3 images with batch size 64, activation caching requires approximately 3GB of memory. How does this compare to the model's parameter memory (25M params ร— 4 bytes โ‰ˆ 100MB)? What is the scaling relationship between batch size and activation memory? +- **Computational Intensity on Different Hardware**: ReLU is trivially parallelizable (independent element-wise max). On a GPU with 10,000 CUDA cores, what is the theoretical speedup vs single-core CPU? Why does practical speedup plateau at much lower values (memory bandwidth, kernel launch overhead)? +- **Branch Prediction and CPU Performance**: ReLU's conditional behavior (`if x > 0`) can cause branch misprediction penalties on CPUs. For a random uniform distribution of inputs [-1, 1], branch prediction accuracy is ~50%. How does this affect CPU performance compared to branchless implementations using `max(0, x)`? +- **Exponential Computation Cost**: Sigmoid, Tanh, GELU, and Softmax all require exponential computation. On modern CPUs, `exp(x)` takes ~10-20 cycles vs ~1 cycle for addition. For a network with 1M activations, how does this computational difference compound across training iterations? Why do modern frameworks use lookup tables or polynomial approximations for exponentials? -Every major breakthrough in deep learningโ€”from image recognition to language modelsโ€”relies on the functions you're about to build. Take your time, understand the mathematics, and enjoy creating the foundation of intelligent systems! +## Ready to Build? +You're about to implement the mathematical functions that give neural networks their power to learn complex patterns! Every breakthrough in deep learningโ€”from AlexNet's ImageNet victory to GPT's language understanding to diffusion models' image generationโ€”relies on the simple activation functions you'll build in this module. +Understanding activations from first principles means implementing their mathematics, handling numerical stability edge cases (overflow, underflow), and grasping their properties (ranges, gradients, symmetry). This knowledge will give you deep insight into why ReLU dominates hidden layers, why Sigmoid creates effective gates in LSTMs, why Tanh helps optimization, why GELU powers transformers, and why Softmax is essential for classification. You'll understand exactly what happens when you call `F.relu(x)` or `torch.sigmoid(x)` in production codeโ€”not just the API, but the actual math, numerical considerations, and performance implications. +This is where pure mathematics meets practical machine learning. Take your time with each activation, test thoroughly with extreme values, visualize their behavior across input ranges, and enjoy building the non-linearity that powers modern AI. Let's turn linear transformations into intelligent representations! Choose your preferred way to engage with this module: ````{grid} 1 2 3 3 ```{grid-item-card} ๐Ÿš€ Launch Binder -:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/03_activations/activations_dev.ipynb +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/02_activations/activations_dev.ipynb :class-header: bg-light Run this module interactively in your browser. No installation required! ``` -```{grid-item-card} โšก Open in Colab -:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/03_activations/activations_dev.ipynb +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/02_activations/activations_dev.ipynb :class-header: bg-light Use Google Colab for GPU access and cloud compute power. ``` ```{grid-item-card} ๐Ÿ“– View Source -:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/03_activations/activations_dev.py +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/02_activations/activations_dev.py :class-header: bg-light Browse the Python source code and understand the implementation. @@ -212,6 +395,6 @@ Browse the Python source code and understand the implementation. ---
-โ† Previous Module -Next Module โ†’ +โ† Previous Module +Next Module โ†’
diff --git a/modules/03_layers/ABOUT.md b/modules/03_layers/ABOUT.md index 3aca5b1e..cae3f8fa 100644 --- a/modules/03_layers/ABOUT.md +++ b/modules/03_layers/ABOUT.md @@ -1,99 +1,165 @@ --- title: "Layers" -description: "Neural network layers (Linear, activation layers)" +description: "Build the fundamental neural network building blocks: Linear layers with weight initialization and Dropout for regularization" difficulty: "โญโญ" time_estimate: "4-5 hours" -prerequisites: [] -next_steps: [] -learning_objectives: [] +prerequisites: ["01_tensor", "02_activations"] +next_steps: ["04_losses"] +learning_objectives: + - "Understand layer abstractions as composable transformations" + - "Implement Linear layers with Xavier initialization" + - "Build Dropout regularization for preventing overfitting" + - "Master parameter management for gradient-based training" + - "Compose layers into multi-layer architectures" --- # 03. Layers -**๐Ÿ—๏ธ FOUNDATION TIER** | Difficulty: โญโญ (2/4) | Time: 4-5 hours +**FOUNDATION TIER** | Difficulty: โญโญ (2/4) | Time: 4-5 hours ## Overview -Build the fundamental transformations that compose into neural networks. This module teaches you that layers are simply functions that transform tensors, and neural networks are just sophisticated function composition using these building blocks. +Build the fundamental building blocks that compose into neural networks. This module teaches you that layers are simply functions that transform tensors, with learnable parameters that define the transformation. You'll implement Linear layers (the workhorse of deep learning) and Dropout regularization, understanding how these simple abstractions enable arbitrarily complex architectures through composition. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Understand layers as mathematical functions** that transform tensors through well-defined operations -2. **Implement Dense layers** using matrix multiplication and bias addition (`y = Wx + b`) -3. **Integrate activation functions** to combine linear transformations with nonlinearity -4. **Compose building blocks** by chaining layers into complete neural network architectures -5. **Debug layer implementations** using shape analysis and mathematical properties +- **Understand Layer Abstraction**: Recognize layers as composable functions with parameters, mirroring PyTorch's `torch.nn.Module` design pattern +- **Implement Linear Transformations**: Build `y = xW + b` with proper Xavier initialization to prevent gradient vanishing/explosion +- **Master Parameter Management**: Track trainable parameters using `parameters()` method for optimizer integration +- **Build Dropout Regularization**: Implement training/inference mode switching with proper scaling to prevent overfitting +- **Analyze Memory Scaling**: Calculate parameter counts and understand how network architecture affects memory footprint -## Why This Matters +## Build โ†’ Use โ†’ Reflect -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -Layers are the building blocks of every neural network in production: - -- **Image Recognition** uses Dense layers for final classification (ResNet, EfficientNet) -- **Language Models** compose thousands of transformer layers (GPT, BERT, Claude) -- **Recommendation Systems** stack Dense layers to learn user-item interactions -- **Autonomous Systems** chain convolutional and Dense layers for perception - -### Historical Context - -The evolution of layer abstractions enabled modern deep learning: - -- **1943**: McCulloch-Pitts neuron - first artificial neuron model -- **1958**: Rosenblatt's Perceptron - single-layer learning algorithm -- **1986**: Backpropagation - enabled training multi-layer networks -- **2012**: AlexNet - proved deep layers (8 layers) revolutionize computer vision -- **2017**: Transformers - layer composition scaled to 96+ layers in modern LLMs - -## Build โ†’ Use โ†’ Understand - -This module follows the foundational pedagogy for building blocks: - -1. **Build**: Implement Dense layer class with initialization, forward pass, and parameter management -2. **Use**: Transform data through layer operations and compose multi-layer networks -3. **Understand**: Analyze how layer composition creates expressivity and why architecture design matters +1. **Build**: Implement Linear and Dropout layer classes with proper initialization, forward passes, and parameter tracking +2. **Use**: Compose layers manually to create multi-layer networks for MNIST digit classification +3. **Reflect**: Analyze memory scaling, computational complexity, and the trade-offs between model capacity and efficiency ## Implementation Guide -### Core Layer Implementation +### Linear Layer: The Neural Network Workhorse + +The Linear layer implements the fundamental transformation `y = xW + b`: + ```python -# Dense layer: fundamental building block -layer = Dense(input_size=3, output_size=2) -x = Tensor([[1.0, 2.0, 3.0]]) -y = layer(x) # Shape transformation: (1, 3) โ†’ (1, 2) +from tinytorch.core.layers import Linear -# With activation functions -relu = ReLU() -activated = relu(y) # Apply nonlinearity +# Create a linear transformation: 784 input features โ†’ 256 output features +layer = Linear(784, 256) -# Chaining operations -layer1 = Dense(784, 128) # Image โ†’ hidden -layer2 = Dense(128, 10) # Hidden โ†’ classes -activation = ReLU() +# Forward pass: transform input batch +x = Tensor(np.random.randn(32, 784)) # 32 images, 784 pixels each +y = layer(x) # Output: (32, 256) -# Forward pass composition -x = Tensor([[1.0, 2.0, 3.0, ...]]) # Input data -h1 = activation(layer1(x)) # First transformation -output = layer2(h1) # Final prediction +# Access trainable parameters +print(f"Weight shape: {layer.weight.shape}") # (784, 256) +print(f"Bias shape: {layer.bias.shape}") # (256,) +print(f"Total params: {784 * 256 + 256}") # 200,960 parameters ``` -### Dense Layer Implementation -- **Mathematical foundation**: Linear transformation `y = Wx + b` -- **Weight initialization**: Xavier/Glorot uniform initialization for stable gradients -- **Bias handling**: Optional bias terms for translation invariance -- **Shape management**: Automatic handling of batch dimensions and matrix operations +**Key Design Decisions:** +- **Xavier Initialization**: Weights scaled by `sqrt(1/in_features)` to maintain gradient flow through deep networks +- **Parameter Tracking**: `parameters()` method returns list of tensors with `requires_grad=True` for optimizer compatibility +- **Bias Handling**: Optional bias parameter (`bias=False` for architectures like batch normalization) -### Activation Layer Integration -- **ReLU integration**: Most common activation for hidden layers -- **Sigmoid integration**: Probability outputs for binary classification -- **Tanh integration**: Zero-centered outputs for better optimization -- **Composition patterns**: Standard ways to combine layers and activations +### Dropout: Preventing Overfitting + +Dropout randomly zeros elements during training to force network robustness: + +```python +from tinytorch.core.layers import Dropout + +# Create dropout with 50% probability +dropout = Dropout(p=0.5) + +x = Tensor([1.0, 2.0, 3.0, 4.0]) + +# Training mode: randomly zero elements and scale by 1/(1-p) +y_train = dropout(x, training=True) +# Example output: [2.0, 0.0, 6.0, 0.0] - survivors scaled by 2.0 + +# Inference mode: pass through unchanged +y_eval = dropout(x, training=False) +# Output: [1.0, 2.0, 3.0, 4.0] - no dropout applied +``` + +**Why Inverted Dropout?** +During training, surviving elements are scaled by `1/(1-p)` so that expected values match during inference. This eliminates the need to scale during evaluation, making deployment simpler. + +### Layer Composition: Building Neural Networks + +Layers compose through sequential application - no container needed: + +```python +from tinytorch.core.layers import Linear, Dropout +from tinytorch.core.activations import ReLU + +# Build 3-layer MNIST classifier manually +layer1 = Linear(784, 256) +activation1 = ReLU() +dropout1 = Dropout(0.5) + +layer2 = Linear(256, 128) +activation2 = ReLU() +dropout2 = Dropout(0.3) + +layer3 = Linear(128, 10) + +# Forward pass: explicit composition shows data flow +def forward(x): + x = layer1(x) + x = activation1(x) + x = dropout1(x, training=True) + x = layer2(x) + x = activation2(x) + x = dropout2(x, training=True) + x = layer3(x) + return x + +# Process batch +x = Tensor(np.random.randn(32, 784)) # 32 MNIST images +output = forward(x) # Shape: (32, 10) - class logits + +# Collect all parameters for training +all_params = layer1.parameters() + layer2.parameters() + layer3.parameters() +print(f"Total trainable parameters: {len(all_params)}") # 6 tensors (3 weights, 3 biases) +``` + +## Getting Started + +### Prerequisites + +Ensure you've completed the prerequisite modules: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify Module 01 (Tensor) is complete +tito test --module tensor + +# Verify Module 02 (Activations) is complete +tito test --module activations +``` + +### Development Workflow + +1. **Open the development file**: `modules/03_layers/layers_dev.py` +2. **Implement Linear layer**: Build `__init__` with Xavier initialization, `forward` with matrix multiplication, and `parameters()` method +3. **Add Dropout layer**: Implement training/inference mode switching with proper mask generation and scaling +4. **Test layer composition**: Verify manual composition of multi-layer networks with mixed layer types +5. **Analyze systems behavior**: Run memory analysis to understand parameter scaling with network size +6. **Export and verify**: `tito module complete 03 && tito test --module layers` ## Testing -Run the complete test suite to verify your implementation: +### Comprehensive Test Suite + +Run the full test suite to verify layer functionality: ```bash # TinyTorch CLI (recommended) @@ -104,107 +170,132 @@ python -m pytest tests/ -k layers -v ``` ### Test Coverage Areas -- โœ… **Layer Functionality**: Verify Dense layers perform correct linear transformations -- โœ… **Weight Initialization**: Ensure proper weight initialization for training stability -- โœ… **Shape Preservation**: Confirm layers handle batch dimensions correctly -- โœ… **Activation Integration**: Test seamless combination with activation functions -- โœ… **Network Composition**: Verify layers can be chained into complete networks -### Inline Testing & Development -The module includes educational feedback during development: +- โœ… **Linear Layer Functionality**: Verify `y = xW + b` computation with correct matrix dimensions and broadcasting +- โœ… **Xavier Initialization**: Ensure weights scaled by `sqrt(1/in_features)` for gradient stability +- โœ… **Parameter Management**: Confirm `parameters()` returns all trainable tensors with `requires_grad=True` +- โœ… **Dropout Training Mode**: Validate probabilistic masking with correct `1/(1-p)` scaling +- โœ… **Dropout Inference Mode**: Verify passthrough behavior without modification during evaluation +- โœ… **Layer Composition**: Test multi-layer forward passes with mixed layer types +- โœ… **Edge Cases**: Handle empty batches, single samples, no-bias configurations, and probability boundaries + +### Inline Testing & Validation + +The module includes comprehensive inline tests with educational feedback: + ```python # Example inline test output -๐Ÿ”ฌ Unit Test: Dense layer functionality... -โœ… Dense layer computes y = Wx + b correctly -โœ… Weight initialization within expected range -โœ… Output shape matches expected dimensions -๐Ÿ“ˆ Progress: Dense Layer โœ“ +๐Ÿ”ฌ Unit Test: Linear Layer... +โœ… Linear layer computes y = xW + b correctly +โœ… Weight initialization within expected Xavier range +โœ… Bias initialized to zeros +โœ… Output shape matches expected dimensions (32, 256) +โœ… Parameter list contains weight and bias tensors +๐Ÿ“ˆ Progress: Linear Layer โœ“ -# Integration testing -๐Ÿ”ฌ Unit Test: Layer composition... -โœ… Multiple layers chain correctly -โœ… Activations integrate seamlessly +๐Ÿ”ฌ Unit Test: Dropout Layer... +โœ… Inference mode passes through unchanged +โœ… Training mode zeros ~50% of elements +โœ… Survivors scaled by 1/(1-p) = 2.0 +โœ… Zero dropout (p=0.0) preserves all values +โœ… Full dropout (p=1.0) zeros everything +๐Ÿ“ˆ Progress: Dropout Layer โœ“ + +๐Ÿ”ฌ Integration Test: Multi-layer Network... +โœ… 3-layer network processes batch: (32, 784) โ†’ (32, 10) +โœ… Parameter count: 235,146 parameters across 6 tensors +โœ… All parameters have requires_grad=True ๐Ÿ“ˆ Progress: Layer Composition โœ“ ``` ### Manual Testing Examples + ```python from tinytorch.core.tensor import Tensor -from layers_dev import Dense -from activations_dev import ReLU +from tinytorch.core.layers import Linear, Dropout +from tinytorch.core.activations import ReLU -# Test basic layer functionality -layer = Dense(input_size=3, output_size=2) -x = Tensor([[1.0, 2.0, 3.0]]) +# Test Linear layer forward pass +layer = Linear(784, 256) +x = Tensor(np.random.randn(1, 784)) # Single MNIST image y = layer(x) -print(f"Input shape: {x.shape}, Output shape: {y.shape}") +print(f"Input: {x.shape} โ†’ Output: {y.shape}") # (1, 784) โ†’ (1, 256) -# Test layer composition -layer1 = Dense(3, 4) -layer2 = Dense(4, 2) -relu = ReLU() +# Test parameter counting +params = layer.parameters() +total = sum(p.data.size for p in params) +print(f"Parameters: {total}") # 200,960 -# Forward pass -h1 = relu(layer1(x)) -output = layer2(h1) -print(f"Final output: {output.data}") +# Test Dropout behavior +dropout = Dropout(0.5) +x = Tensor(np.ones((1, 100))) +y_train = dropout(x, training=True) +y_eval = dropout(x, training=False) +print(f"Training: ~{np.count_nonzero(y_train.data)} survived") # ~50 +print(f"Inference: {np.count_nonzero(y_eval.data)} survived") # 100 + +# Test composition +net = lambda x: layer3(dropout2(activation2(layer2(dropout1(activation1(layer1(x))))))) ``` ## Systems Thinking Questions ### Real-World Applications -- **Computer Vision**: Dense layers process flattened image features in CNNs (like VGG, ResNet final layers) -- **Natural Language Processing**: Dense layers transform word embeddings in transformers and RNNs -- **Recommendation Systems**: Dense layers combine user and item features for preference prediction -- **Scientific Computing**: Dense layers approximate complex functions in physics simulations and engineering + +- **Computer Vision**: How do Linear layers in ResNet-50's final classification head transform 2048 feature maps to 1000 class logits? What determines this bottleneck layer's size? +- **Language Models**: GPT-3 uses Linear layers with 12,288 input features. How much memory do these layers consume, and why does this limit model deployment? +- **Recommendation Systems**: Netflix uses multi-layer networks with Dropout. How does `p=0.5` affect training time vs model accuracy on sparse user-item interactions? +- **Edge Deployment**: A mobile CNN has 5 Linear layers totaling 2MB. How do you decide which layers to quantize or prune when targeting 500KB model size? ### Mathematical Foundations -- **Linear Transformation**: `y = Wx + b` where W is the weight matrix and b is the bias vector -- **Matrix Multiplication**: Efficient batch processing through vectorized operations -- **Weight Initialization**: Xavier/Glorot initialization prevents vanishing/exploding gradients -- **Function Composition**: Networks as nested function calls: `f3(f2(f1(x)))` -### Neural Network Building Blocks -- **Modularity**: Layers as reusable components that can be combined in different ways -- **Standardized Interface**: All layers follow the same input/output pattern for easy composition -- **Shape Consistency**: Automatic handling of batch dimensions and shape transformations -- **Nonlinearity**: Activation functions between layers enable learning of complex patterns +- **Xavier Initialization**: Why does `scale = sqrt(1/fan_in)` preserve gradient variance through layers? What happens in a 20-layer network without proper initialization? +- **Matrix Multiplication Complexity**: A Linear(1024, 1024) layer with batch size 128 performs how many FLOPs? How does this compare to a Dropout layer on the same tensor? +- **Dropout Mathematics**: During training with `p=0.5`, what's the expected value of each element? Why must we scale by `1/(1-p)` to match inference behavior? +- **Parameter Growth**: If you double the hidden layer size from 256 to 512, how many times more parameters do you have in Linear(784, hidden) + Linear(hidden, 10)? -### Implementation Patterns -- **Class-based Design**: Layers as objects with state (weights) and behavior (forward pass) -- **Initialization Strategy**: Proper weight initialization for stable training dynamics -- **Error Handling**: Graceful handling of shape mismatches and invalid inputs -- **Testing Philosophy**: Comprehensive testing of mathematical properties and edge cases +### Architecture Design Patterns -## ๐ŸŽ‰ Ready to Build? +- **Layer Width vs Depth**: A 784โ†’512โ†’10 network vs 784โ†’256โ†’256โ†’10 - which has more parameters? Which typically generalizes better and why? +- **Dropout Placement**: Should you place Dropout before or after activation functions? What's the difference between `Linear โ†’ ReLU โ†’ Dropout` vs `Linear โ†’ Dropout โ†’ ReLU`? +- **Bias Necessity**: When can you safely use `bias=False`? How does batch normalization (Module 09) interact with bias terms? +- **Composition Philosophy**: We deliberately avoided a Sequential container. What trade-offs do explicit composition and container abstractions make for debugging vs convenience? -You're about to build the fundamental building blocks that power every neural network! Dense layers might seem simple, but they're the workhorses of deep learningโ€”from the final layers of image classifiers to the core components of language models. +### Performance Characteristics -Understanding how these simple linear transformations compose into complex intelligence is one of the most beautiful insights in machine learning. Take your time, understand the mathematics, and enjoy building the foundation of artificial intelligence! +- **Memory Hierarchy**: A Linear(4096, 4096) layer has 16M parameters (64MB). Does this fit in L3 cache? How does cache performance affect training speed? +- **Batch Size Scaling**: Measuring throughput from batch_size=1 to 512, why does samples/sec increase but eventually plateau? What's the bottleneck? +- **Dropout Overhead**: Profiling shows Dropout adds 2% overhead to training time. Where is this cost - mask generation, element-wise multiply, or memory bandwidth? +- **Parameter Memory vs Activation Memory**: In a 100-layer network, which dominates memory usage during training? How does gradient checkpointing address this? - +## Ready to Build? +You're about to implement the abstractions that power every neural network in production. Linear layers might seem deceptively simple - just matrix multiplication and bias addition - but this simplicity is the foundation of extraordinary complexity. From ResNet's 25 million parameters to GPT-3's 175 billion, every learned transformation ultimately reduces to chains of `y = xW + b`. + +Understanding layer composition is crucial for systems thinking. When you see "ResNet-50," you'll know exactly how parameter counts scale with depth. When debugging vanishing gradients, you'll understand why Xavier initialization matters. When deploying to mobile devices, you'll calculate memory footprints in your head. + +Take your time with this module. Test each component thoroughly. Analyze the memory patterns. Build the intuition for how these simple building blocks compose into intelligence. This is where deep learning becomes real. Choose your preferred way to engage with this module: ````{grid} 1 2 3 3 ```{grid-item-card} ๐Ÿš€ Launch Binder -:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/04_layers/layers_dev.ipynb +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/03_layers/layers_dev.ipynb :class-header: bg-light Run this module interactively in your browser. No installation required! ``` -```{grid-item-card} โšก Open in Colab -:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/04_layers/layers_dev.ipynb +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/03_layers/layers_dev.ipynb :class-header: bg-light Use Google Colab for GPU access and cloud compute power. ``` ```{grid-item-card} ๐Ÿ“– View Source -:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/04_layers/layers_dev.py +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/03_layers/layers_dev.py :class-header: bg-light Browse the Python source code and understand the implementation. @@ -221,6 +312,6 @@ Browse the Python source code and understand the implementation. ---
-โ† Previous Module -Next Module โ†’ +โ† Previous Module +Next Module โ†’
diff --git a/modules/04_losses/ABOUT.md b/modules/04_losses/ABOUT.md index d1f26aa5..8ae7c8cb 100644 --- a/modules/04_losses/ABOUT.md +++ b/modules/04_losses/ABOUT.md @@ -1,217 +1,303 @@ --- title: "Loss Functions" -description: "Implement MSE and CrossEntropy loss functions for training neural networks" -difficulty: 2 +description: "Build MSE and CrossEntropy loss functions with numerical stability for regression and classification" +difficulty: "โญโญ (2/4)" time_estimate: "3-4 hours" -prerequisites: ["Tensor", "Activations", "Layers"] -next_steps: ["Autograd"] +prerequisites: ["01_tensor", "02_activations", "03_layers"] +next_steps: ["05_autograd"] learning_objectives: - - "Implement MSE loss for regression tasks with proper numerical stability" - - "Build CrossEntropy loss for classification with log-sum-exp trick" - - "Understand mathematical properties of loss functions and their gradients" - - "Recognize how loss functions connect model outputs to optimization objectives" - - "Apply appropriate loss functions for different machine learning tasks" + - "Understand loss function memory allocation patterns and computational costs" + - "Implement MSE and CrossEntropy losses with proper numerical stability" + - "Master the log-sum-exp trick and its role in preventing overflow" + - "Connect loss implementations to PyTorch/TensorFlow loss APIs" + - "Analyze gradient flow and scaling trade-offs in loss computation" --- -# 04. Losses +# 04. Loss Functions -**๐Ÿ—๏ธ FOUNDATION TIER** | Difficulty: โญโญ (2/4) | Time: 3-4 hours +**FOUNDATION TIER** | Difficulty: โญโญ (2/4) | Time: 3-4 hours ## Overview -Implement the mathematical functions that measure how wrong your model's predictions are. Loss functions are the bridge between model outputs and the optimization processโ€”they define what "better" means and drive the entire learning process. +Loss functions are the mathematical conscience of machine learning. They quantify prediction error and provide the scalar signal that drives optimization. This module implements MSE for regression and CrossEntropy for classification, with careful attention to numerical stability through the log-sum-exp trick. You'll build the feedback mechanisms used in billions of training runs across GPT models, ResNets, and all production ML systems. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement MSE loss** for regression tasks with numerically stable computation -2. **Build CrossEntropy loss** for classification using the log-sum-exp trick for numerical stability -3. **Understand mathematical properties** of loss landscapes and their impact on optimization -4. **Recognize the role** of loss functions in connecting predictions to training objectives -5. **Apply appropriate losses** for regression, binary classification, and multi-class classification +- **Implement MSE Loss**: Build mean squared error with proper reduction strategies and understand memory/compute costs +- **Build CrossEntropy Loss**: Create numerically stable classification loss using log-sum-exp trick to prevent overflow +- **Master Numerical Stability**: Understand why naive implementations fail with large logits and implement production-grade solutions +- **Analyze Memory Patterns**: Compute loss function memory footprints across batch sizes and vocabulary dimensions +- **Connect to Frameworks**: Understand how PyTorch's `nn.MSELoss` and `nn.CrossEntropyLoss` implement these same concepts -## Why This Matters +## Build โ†’ Use โ†’ Reflect -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -Loss functions are fundamental to all machine learning systems: - -- **Recommendation Systems** use MSE and ranking losses to learn user preferences -- **Image Classification** relies on CrossEntropy loss for category prediction (ImageNet, CIFAR-10) -- **Language Models** use CrossEntropy to predict next tokens in GPT, Claude, and all LLMs -- **Autonomous Driving** combines multiple losses for perception, planning, and control - -### Historical Context - -Loss functions evolved with machine learning itself: - -- **Least Squares (1805)**: Gauss invented MSE for astronomical orbit predictions -- **Maximum Likelihood (1912)**: Fisher formalized statistical foundations of loss functions -- **CrossEntropy (1950s)**: Information theory brought entropy-based losses to ML -- **Modern Deep Learning (2012+)**: Careful loss design enables training billion-parameter models - -## Build โ†’ Use โ†’ Understand - -This module follows the classic pedagogy for foundational concepts: - -1. **Build**: Implement MSE and CrossEntropy loss functions from mathematical definitions -2. **Use**: Apply losses to regression and classification tasks, seeing how they drive learning -3. **Understand**: Analyze loss landscapes, gradients, and numerical stability considerations +1. **Build**: Implement MSE and CrossEntropy loss functions with the log-sum-exp trick for numerical stability +2. **Use**: Apply losses to regression (house prices) and classification (image recognition) problems +3. **Reflect**: Why does CrossEntropy overflow without log-sum-exp? How does loss scale affect gradient magnitudes? ## Implementation Guide -### Step 1: MSE (Mean Squared Error) Loss +### MSELoss - Regression Loss -Implement L2 loss for regression: +Mean Squared Error is the foundation of regression problems. It measures the average squared distance between predictions and targets, creating a quadratic penalty that grows rapidly with prediction error. ```python class MSELoss: - """Mean Squared Error loss for regression.""" - - def __call__(self, predictions: Tensor, targets: Tensor) -> Tensor: - """ - Compute MSE: (1/n) * ฮฃ(predictions - targets)ยฒ - - Args: - predictions: Model outputs - targets: Ground truth values - Returns: - Scalar loss value - """ - diff = predictions - targets - squared = diff * diff - return squared.mean() + """Mean Squared Error for regression tasks.""" + + def forward(self, predictions: Tensor, targets: Tensor) -> Tensor: + # Compute: (1/n) * ฮฃ(predictions - targets)ยฒ + diff = predictions.data - targets.data + squared_diff = diff ** 2 + return Tensor(np.mean(squared_diff)) ``` -### Step 2: CrossEntropy Loss +**Key Properties**: +- **Quadratic penalty**: error of 2 โ†’ loss of 4, error of 10 โ†’ loss of 100 +- **Outlier sensitivity**: Large errors dominate the loss landscape +- **Smooth gradients**: Differentiable everywhere, nice optimization properties +- **Memory footprint**: ~2 ร— batch_size ร— output_dim for intermediate storage -Implement log-likelihood loss for classification: +**Mathematical Foundation**: MSE derives from maximum likelihood estimation under Gaussian noise. When you assume prediction errors are normally distributed, minimizing MSE is equivalent to maximizing the likelihood of observing your data. + +**Use Cases**: House price prediction, temperature forecasting, stock price regression, image reconstruction in autoencoders, and any continuous value prediction where quadratic error makes sense. + +### Log-Softmax with Numerical Stability + +Before implementing CrossEntropy, we need a numerically stable way to compute log-softmax. This is the critical building block that prevents overflow in classification losses. + +```python +def log_softmax(x: Tensor, dim: int = -1) -> Tensor: + """Numerically stable log-softmax using log-sum-exp trick.""" + # Step 1: Subtract max for stability + max_vals = np.max(x.data, axis=dim, keepdims=True) + shifted = x.data - max_vals + + # Step 2: Compute log(sum(exp(shifted))) + log_sum_exp = np.log(np.sum(np.exp(shifted), axis=dim, keepdims=True)) + + # Step 3: Return log-softmax + return Tensor(x.data - max_vals - log_sum_exp) +``` + +**Why Log-Sum-Exp Matters**: +``` +Without trick: exp(1000) = overflow (inf) +With trick: exp(1000 - 1000) = exp(0) = 1.0 โœ“ +``` + +**The Mathematics**: Computing `log(ฮฃ exp(xi))` directly causes overflow when logits are large. The log-sum-exp trick factors out the maximum value: `log(ฮฃ exp(xi)) = max(x) + log(ฮฃ exp(xi - max(x)))`. This shifts all exponents into a safe range (โ‰ค 0) before computing exp, preventing overflow while maintaining mathematical equivalence. + +**Production Reality**: This exact technique is used in PyTorch's `F.log_softmax`, TensorFlow's `tf.nn.log_softmax`, and JAX's `jax.nn.log_softmax`. It's not an educational simplificationโ€”it's production-critical numerical stability. + +### CrossEntropyLoss - Classification Loss + +CrossEntropy is the standard loss for multi-class classification. It measures how well predicted probability distributions match true class labels, providing strong gradients for confident wrong predictions and gentle gradients for confident correct predictions. ```python class CrossEntropyLoss: - """CrossEntropy loss for multi-class classification.""" - - def __call__(self, logits: Tensor, targets: Tensor) -> Tensor: - """ - Compute CrossEntropy with log-sum-exp trick for numerical stability. - - Args: - logits: Raw model outputs (before softmax) - targets: Class indices or one-hot vectors - Returns: - Scalar loss value - """ - # Log-sum-exp trick for numerical stability - max_logits = logits.max(axis=1, keepdims=True) - exp_logits = (logits - max_logits).exp() - log_probs = logits - max_logits - exp_logits.sum(axis=1, keepdims=True).log() - - # Negative log-likelihood - return -log_probs.mean() + """Cross-entropy loss for multi-class classification.""" + + def forward(self, logits: Tensor, targets: Tensor) -> Tensor: + # Step 1: Compute log-softmax (stable) + log_probs = log_softmax(logits, dim=-1) + + # Step 2: Select correct class log-probabilities + batch_size = logits.shape[0] + target_indices = targets.data.astype(int) + selected_log_probs = log_probs.data[np.arange(batch_size), target_indices] + + # Step 3: Return negative mean + return Tensor(-np.mean(selected_log_probs)) ``` -### Step 3: Loss Function Properties +**Gradient Behavior**: +- **Confident and correct**: Small gradient (model is right, minimal updates needed) +- **Confident and wrong**: Large gradient (urgent correction signal) +- **Uncertain predictions**: Medium gradient (encourages confidence when correct) +- **Natural confidence weighting**: The loss automatically provides stronger signals when the model needs to change -Understand key mathematical properties: +**Why It Works**: CrossEntropy derives from maximum likelihood estimation under a categorical distribution. Minimizing CrossEntropy is equivalent to maximizing the probability the model assigns to the correct class. The logarithm transforms products into sums (computationally stable) and creates the characteristic gradient behavior. -- **Convexity**: MSE is convex; CrossEntropy is convex in logits -- **Gradients**: Smooth gradients enable effective optimization -- **Scale**: Loss magnitude affects learning rate tuning -- **Numerical Stability**: Requires careful implementation (log-sum-exp trick) +### BinaryCrossEntropyLoss - Binary Classification + +Binary CrossEntropy is specialized for two-class problems. It's more efficient than full CrossEntropy for binary decisions and provides symmetric treatment of positive and negative classes. + +```python +class BinaryCrossEntropyLoss: + """Binary cross-entropy for yes/no decisions.""" + + def forward(self, predictions: Tensor, targets: Tensor) -> Tensor: + # Clamp to prevent log(0) + eps = 1e-7 + clamped = np.clip(predictions.data, eps, 1 - eps) + + # BCE = -(y*log(p) + (1-y)*log(1-p)) + return Tensor(-np.mean( + targets.data * np.log(clamped) + + (1 - targets.data) * np.log(1 - clamped) + )) +``` + +**Numerical Stability**: The epsilon clamping (`1e-7` to `1-1e-7`) prevents `log(0)` which would produce `-inf`. This is critical for binary classification where predictions can approach 0 or 1. + +**Use Cases**: Spam detection (spam vs not spam), medical diagnosis (disease vs healthy), fraud detection (fraud vs legitimate), content moderation (toxic vs safe), and any yes/no decision problem where both classes matter equally. + +## Getting Started + +### Prerequisites +Ensure you understand the foundations from previous modules: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module tensor +tito test --module activations +tito test --module layers +``` + +### Development Workflow +1. **Open the development file**: `modules/04_losses/losses_dev.ipynb` +2. **Implement log_softmax**: Build numerically stable log-softmax with log-sum-exp trick +3. **Build MSELoss**: Create regression loss with proper reduction +4. **Create CrossEntropyLoss**: Implement classification loss using stable log-softmax +5. **Add BinaryCrossEntropyLoss**: Build binary classification loss with clamping +6. **Export and verify**: `tito module complete 04 && tito test --module losses` ## Testing -### Inline Tests - -The module includes immediate feedback: - -```python -# Example inline test output -๐Ÿ”ฌ Unit Test: MSE Loss... -โœ… MSE computes squared error correctly -โœ… MSE gradient flows properly -โœ… MSE handles batch dimensions correctly -๐Ÿ“ˆ Progress: MSE Loss โœ“ - -๐Ÿ”ฌ Unit Test: CrossEntropy Loss... -โœ… CrossEntropy numerically stable -โœ… CrossEntropy matches PyTorch implementation -โœ… CrossEntropy handles multi-class problems -๐Ÿ“ˆ Progress: CrossEntropy Loss โœ“ -``` - -### Export and Validate +### Comprehensive Test Suite +Run the full test suite to verify loss functionality: ```bash -# Export to package -tito export --module 04_losses +# TinyTorch CLI (recommended) +tito test --module losses -# Run test suite -tito test --module 04_losses +# Direct pytest execution +python -m pytest tests/ -k losses -v ``` -## Where This Code Lives +### Test Coverage Areas +- โœ… **MSE Correctness**: Validates known cases, perfect predictions (loss=0), non-negativity +- โœ… **CrossEntropy Stability**: Tests large logits (1000+), verifies no overflow/underflow +- โœ… **Gradient Properties**: Ensures CrossEntropy gradient equals softmax - target +- โœ… **Binary Classification**: Validates BCE with boundary cases and probability constraints +- โœ… **Log-Sum-Exp Trick**: Confirms numerical stability with extreme values -``` -tinytorch/ -โ”œโ”€โ”€ nn/ -โ”‚ โ””โ”€โ”€ losses.py # MSELoss, CrossEntropyLoss -โ””โ”€โ”€ core/ - โ””โ”€โ”€ tensor.py # Underlying tensor operations -``` - -After export, use as: +### Inline Testing & Validation +The module includes comprehensive unit tests: ```python -from tinytorch.nn import MSELoss, CrossEntropyLoss +๐Ÿ”ฌ Unit Test: Log-Softmax... +โœ… log_softmax works correctly with numerical stability! -# For regression +๐Ÿ”ฌ Unit Test: MSE Loss... +โœ… MSELoss works correctly! + +๐Ÿ”ฌ Unit Test: Cross-Entropy Loss... +โœ… CrossEntropyLoss works correctly! + +๐Ÿ“ˆ Progress: Loss Functions Module โœ“ +``` + +### Manual Testing Examples +```python +from losses_dev import MSELoss, CrossEntropyLoss, BinaryCrossEntropyLoss + +# Regression example mse = MSELoss() +predictions = Tensor([200.0, 250.0, 300.0]) # House prices (thousands) +targets = Tensor([195.0, 260.0, 290.0]) loss = mse(predictions, targets) +print(f"MSE Loss: {loss.data:.2f}") -# For classification +# Classification example ce = CrossEntropyLoss() +logits = Tensor([[2.0, 0.5, 0.1], [0.3, 1.8, 0.2]]) +labels = Tensor([0, 1]) # Class indices loss = ce(logits, labels) +print(f"CrossEntropy Loss: {loss.data:.3f}") ``` ## Systems Thinking Questions -1. **Why does CrossEntropy require the log-sum-exp trick?** What numerical instability occurs without it? +### Real-World Applications +- **Computer Vision**: ImageNet uses CrossEntropy over 1000 classes with 1.2M training images +- **Language Modeling**: GPT models use CrossEntropy over 50K+ token vocabularies for next-token prediction +- **Medical Diagnosis**: BinaryCrossEntropy for disease detection where class imbalance is critical +- **Recommender Systems**: MSE for rating prediction, BCE for click-through rate estimation -2. **How does loss scale affect learning?** If you multiply your loss by 100, what happens to gradients and learning? +### Mathematical Foundations +- **MSE Properties**: Convex loss landscape, quadratic penalty, maximum likelihood under Gaussian noise assumption +- **CrossEntropy Derivation**: Negative log-likelihood of correct class under softmax distribution +- **Log-Sum-Exp Trick**: Prevents overflow by factoring out max value before exponential computation +- **Gradient Behavior**: MSE gradient scales linearly with error; CrossEntropy gradient is confidence-weighted -3. **Why do we use MSE for regression but CrossEntropy for classification?** What makes each appropriate for its task? +### Performance Characteristics +- **Memory Scaling**: CrossEntropy uses ~2.5 ร— batch_size ร— num_classes; MSE uses ~2 ร— batch_size ร— output_dim +- **Computational Cost**: CrossEntropy requires expensive exp/log operations (~10x arithmetic cost) +- **Numerical Precision**: FP16 training requires loss scaling to prevent gradient underflow +- **Batch Size Effects**: Mean reduction provides batch-size-independent gradients; sum reduction scales with batch size -4. **How do loss functions connect to probability theory?** What is the relationship between CrossEntropy and maximum likelihood? +## Ready to Build? -5. **What happens if you use the wrong loss function?** Try MSE for classification or CrossEntropy for regressionโ€”what breaks? +You're about to implement the objectives that drive all machine learning. Loss functions transform abstract learning goals (make good predictions) into concrete mathematical targets that gradient descent can optimize. Every training run in production MLโ€”from GPT to ResNetโ€”relies on the numerical stability techniques you'll implement here. -## Real-World Connections +Understanding loss functions deeply means you'll know why training diverges with large learning rates, how to debug NaN losses, and when to choose MSE versus CrossEntropy for your problem. These aren't just formulasโ€”they're the feedback mechanisms that make learning possible. -### Industry Applications +Choose your preferred way to engage with this module: -- **Computer Vision**: CrossEntropy trains all classification models (ResNet, EfficientNet, Vision Transformers) -- **NLP**: CrossEntropy is the foundation of all language models (GPT, BERT, T5) -- **Recommendation**: MSE and ranking losses optimize Netflix, Spotify, YouTube recommendations -- **Robotics**: MSE trains continuous control policies for manipulation and navigation +````{grid} 1 2 3 3 -### Production Considerations +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/04_losses/losses_dev.ipynb +:class-header: bg-light -- **Numerical Stability**: Log-sum-exp trick prevents overflow/underflow in production systems -- **Loss Scaling**: Careful scaling enables mixed-precision training (FP16/BF16) -- **Weighted Losses**: Class weights handle imbalanced datasets in production -- **Custom Losses**: Production systems often combine multiple loss terms +Run this module interactively in your browser. No installation required! +``` -## What's Next? +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/04_losses/losses_dev.ipynb +:class-header: bg-light -Now that you can measure prediction quality, you're ready for **Module 05: Autograd** where you'll learn how to automatically compute gradients of these loss functions, enabling the optimization that drives all of machine learning. +Use Google Colab for GPU access and cloud compute power. +``` -**Preview**: Autograd will automatically compute โˆ‚Loss/โˆ‚weights for any loss function you build, making training possible without manual gradient derivations! +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/04_losses/losses_dev.ipynb +:class-header: bg-light + +Browse the notebook source code and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. +``` + +**Local workflow**: +```bash +# Start the module +tito module start 04 + +# Work in Jupyter +tito jupyter 04 + +# When complete +tito module complete 04 +tito test --module losses +``` --- -**Need Help?** -- Check the inline tests in `modules/04_losses/losses_dev.py` -- Review mathematical derivations in the module comments -- Compare your implementation against PyTorch's losses - +
+โ† Module 03: Layers +Module 05: Autograd โ†’ +
diff --git a/modules/05_autograd/ABOUT.md b/modules/05_autograd/ABOUT.md index e814a080..54d1de26 100644 --- a/modules/05_autograd/ABOUT.md +++ b/modules/05_autograd/ABOUT.md @@ -415,11 +415,77 @@ print(f"dz/dx with multiple paths: {x.grad}") # Should be 5.0 โœ“ - **Robotics and Control**: Trajectory optimization uses autodiff to compute gradients of cost functions with respect to control inputs for gradient-based planning - **Physics Simulations**: Differentiable physics engines use autodiff for inverse problems like inferring material properties from observed motion -### PyTorch torch.autograd.Function Comparison -- **Function Architecture**: Your Function classes mirror PyTorch's torch.autograd.Function. Both implement forward() and backward() (apply in your case) -- **Enhanced Tensor vs Variable**: PyTorch used to have separate Variable wrapper (pre-v0.4) but merged it into Tensor in 2018. Your implementation follows modern PyTorch style -- **Performance Optimization**: PyTorch implements Function.apply() in C++ with optimized gradient formulas. Your Python implementation demonstrates principles but runs ~100-1000x slower -- **Memory Pooling**: Production frameworks reuse memory allocations across backward passes. What speedup does this provide? +### How Your Implementation Maps to PyTorch + +**What you just built:** +```python +# Your TinyTorch autograd implementation +from tinytorch.core.tensor import Tensor +from tinytorch.core.autograd import AddBackward, MulBackward + +# Forward pass with gradient tracking +x = Tensor([[1.0, 2.0]], requires_grad=True) +w = Tensor([[0.5], [0.7]], requires_grad=True) +y = x.matmul(w) # Builds computation graph +loss = y.mean() + +# Backward pass computes gradients +loss.backward() # YOUR implementation traverses graph +print(x.grad) # Gradients you computed +print(w.grad) +``` + +**How PyTorch does it:** +```python +# PyTorch equivalent +import torch + +# Forward pass with gradient tracking +x = torch.tensor([[1.0, 2.0]], requires_grad=True) +w = torch.tensor([[0.5], [0.7]], requires_grad=True) +y = x @ w # Builds computation graph (same concept) +loss = y.mean() + +# Backward pass computes gradients +loss.backward() # PyTorch autograd engine +print(x.grad) # Same gradient values +print(w.grad) +``` + +**Key Insight**: Your `Function` classes (AddBackward, MulBackward, MatmulBackward) implement the **exact same gradient computation rules** that PyTorch uses internally. When you call `loss.backward()`, both implementations traverse the computation graph in reverse topological order, applying the chain rule via each Function's backward method. + +**What's the SAME?** +- **Computational graph architecture**: Tensor operations create Function nodes +- **Gradient computation**: Chain rule via reverse-mode autodiff +- **API design**: `requires_grad`, `.backward()`, `.grad` attribute +- **Function pattern**: `forward()` computes output, `backward()` computes gradients +- **Tensor enhancement**: Gradients stored directly in Tensor (modern PyTorch style, not Variable wrapper) + +**What's different in production PyTorch?** +- **Backend**: C++/CUDA implementation ~100-1000ร— faster +- **Memory optimization**: Graph nodes pooled and reused across iterations +- **Optimized gradients**: Hand-tuned gradient formulas (e.g., fused operations) +- **Advanced features**: Higher-order gradients, gradient checkpointing, JIT compilation + +**Why this matters**: When you debug PyTorch training and encounter `RuntimeError: element 0 of tensors does not require grad`, you understand this is checking the computation graph structure you implemented. When gradients are `None`, you know backward() hasn't been called or the tensor isn't connected to the lossโ€”concepts from YOUR implementation. + +**Production usage example**: +```python +# PyTorch production code (after TinyTorch) +import torch +import torch.nn as nn + +model = nn.Linear(784, 10) # Uses torch.Tensor with requires_grad=True +optimizer = torch.optim.SGD(model.parameters(), lr=0.01) + +# Training loop - same workflow you built +output = model(input) # Forward pass builds graph +loss = nn.CrossEntropyLoss()(output, target) +loss.backward() # Backward pass (YOUR implementation's logic) +optimizer.step() # Update using .grad (YOUR gradients) +``` + +After implementing autograd yourself, you understand that `loss.backward()` traverses the computation graph you built during forward pass, calling each operation's gradient function (AddBackward, MatmulBackward, etc.) in reverse orderโ€”exactly like your implementation. ### Mathematical Foundations - **Chain Rule**: โˆ‚f/โˆ‚x = (โˆ‚f/โˆ‚u)(โˆ‚u/โˆ‚x) for composite functions f(u(x)) - the mathematical foundation of backpropagation diff --git a/modules/06_optimizers/ABOUT.md b/modules/06_optimizers/ABOUT.md index 08848b1b..fd275522 100644 --- a/modules/06_optimizers/ABOUT.md +++ b/modules/06_optimizers/ABOUT.md @@ -1,108 +1,292 @@ --- title: "Optimizers" -description: "Gradient-based parameter optimization algorithms" +description: "Gradient-based parameter optimization algorithms - SGD, Adam, and AdamW" difficulty: "โญโญโญโญ" time_estimate: "6-8 hours" -prerequisites: [] -next_steps: [] -learning_objectives: [] +prerequisites: ["tensor", "autograd"] +next_steps: ["training"] +learning_objectives: + - "Understand optimization theory and convergence dynamics in neural network training" + - "Implement SGD, momentum, and Adam optimizers from mathematical foundations" + - "Design learning rate scheduling strategies for stable convergence" + - "Analyze memory vs convergence trade-offs across optimization algorithms" + - "Connect optimizer design to PyTorch's torch.optim implementation patterns" --- # 06. Optimizers -**๐Ÿ—๏ธ FOUNDATION TIER** | Difficulty: โญโญโญโญ (4/4) | Time: 6-8 hours +**FOUNDATION TIER** | Difficulty: โญโญโญโญ (4/4) | Time: 6-8 hours ## Overview -Build intelligent optimization algorithms that enable effective neural network training. This module implements the learning algorithms that power modern AIโ€”from basic gradient descent to advanced adaptive methods that make training large-scale models possible. +Welcome to the Optimizers module! You'll implement the learning algorithms that power every neural networkโ€”transforming gradients into intelligent parameter updates that enable models to learn from data. This module builds the optimization foundation used across all modern deep learning frameworks. ## Learning Objectives By the end of this module, you will be able to: -- **Master gradient-based optimization theory**: Understand how gradients guide parameter updates and the mathematical foundations of learning -- **Implement core optimization algorithms**: Build SGD, momentum, and Adam optimizers from mathematical first principles -- **Design learning rate strategies**: Create scheduling systems that balance convergence speed with training stability -- **Apply optimization in practice**: Use optimizers effectively in complete training workflows with real neural networks -- **Analyze optimization dynamics**: Compare algorithm behavior, convergence patterns, and performance characteristics +- **Understand optimization dynamics**: Master convergence behavior, learning rate sensitivity, and how gradients guide parameter updates in high-dimensional loss landscapes +- **Implement core optimization algorithms**: Build SGD, momentum, Adam, and AdamW optimizers from mathematical first principles +- **Analyze memory-convergence trade-offs**: Understand why Adam uses 3x memory but converges faster than SGD on many problems +- **Master adaptive learning rates**: See how Adam's per-parameter learning rates handle different gradient scales automatically +- **Connect to production frameworks**: Understand how your implementations mirror PyTorch's torch.optim.SGD and torch.optim.Adam design patterns -## Build โ†’ Use โ†’ Optimize +## Build โ†’ Use โ†’ Reflect -This module follows TinyTorch's **Build โ†’ Use โ†’ Optimize** framework: +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -1. **Build**: Implement gradient descent, SGD with momentum, Adam optimizer, and learning rate scheduling from mathematical foundations -2. **Use**: Apply optimization algorithms to train neural networks and solve real optimization problems -3. **Optimize**: Analyze convergence behavior, compare algorithm performance, and tune hyperparameters for optimal training +1. **Build**: Implement SGD with momentum, Adam optimizer with adaptive learning rates, and AdamW with decoupled weight decay from mathematical foundations +2. **Use**: Apply optimization algorithms to train neural networks on real classification and regression tasks +3. **Reflect**: Why does Adam converge faster initially but SGD often achieves better final test accuracy? What's the memory cost of adaptive learning rates? ## Implementation Guide ### Core Optimization Algorithms + ```python -# Gradient descent foundation -def gradient_descent_step(parameter, learning_rate): - parameter.data = parameter.data - learning_rate * parameter.grad.data +# Base optimizer class with parameter management +class Optimizer: + """Base class defining optimizer interface.""" + def __init__(self, params: List[Tensor]): + self.params = list(params) + self.step_count = 0 + + def zero_grad(self): + """Clear gradients from all parameters.""" + for param in self.params: + param.grad = None + + def step(self): + """Update parameters - implemented by subclasses.""" + raise NotImplementedError # SGD with momentum for accelerated convergence -sgd = SGD(parameters=[w1, w2, bias], learning_rate=0.01, momentum=0.9) +sgd = SGD(parameters=[w1, w2, bias], lr=0.01, momentum=0.9) sgd.zero_grad() # Clear previous gradients -loss.backward() # Compute new gradients -sgd.step() # Update parameters +loss.backward() # Compute new gradients via autograd +sgd.step() # Update parameters with momentum # Adam optimizer with adaptive learning rates -adam = Adam(parameters=[w1, w2, bias], learning_rate=0.001, beta1=0.9, beta2=0.999) +adam = Adam(parameters=[w1, w2, bias], lr=0.001, betas=(0.9, 0.999)) adam.zero_grad() loss.backward() adam.step() # Adaptive updates per parameter + +# AdamW with decoupled weight decay +adamw = AdamW(parameters=[w1, w2, bias], lr=0.001, weight_decay=0.01) +adamw.zero_grad() +loss.backward() +adamw.step() # Adam + proper regularization ``` -### Learning Rate Scheduling Systems -```python -# Strategic learning rate adjustment -scheduler = StepLR(optimizer, step_size=10, gamma=0.1) +### SGD with Momentum Implementation -# Training loop with scheduling -for epoch in range(num_epochs): - for batch in dataloader: - optimizer.zero_grad() - loss = criterion(model(batch.inputs), batch.targets) - loss.backward() - optimizer.step() - - scheduler.step() # Adjust learning rate each epoch - print(f"Epoch {epoch}, LR: {scheduler.get_last_lr()}") +```python +class SGD(Optimizer): + """Stochastic Gradient Descent with momentum. + + Momentum physics: velocity accumulates gradients over time, + smoothing noisy updates and accelerating in consistent directions. + """ + def __init__(self, params: List[Tensor], lr: float = 0.01, + momentum: float = 0.0, weight_decay: float = 0.0): + super().__init__(params) + self.lr = lr + self.momentum = momentum + self.weight_decay = weight_decay + # Initialize momentum buffers (created lazily) + self.momentum_buffers = [None for _ in self.params] + + def step(self): + """Update parameters using momentum: v = ฮฒv + โˆ‡L, ฮธ = ฮธ - ฮฑv""" + for i, param in enumerate(self.params): + if param.grad is None: + continue + + grad = param.grad + + # Apply weight decay + if self.weight_decay != 0: + grad = grad + self.weight_decay * param.data + + # Update momentum buffer + if self.momentum != 0: + if self.momentum_buffers[i] is None: + self.momentum_buffers[i] = np.zeros_like(param.data) + + # Update velocity: v_t = ฮฒ*v_{t-1} + grad + self.momentum_buffers[i] = (self.momentum * self.momentum_buffers[i] + + grad) + grad = self.momentum_buffers[i] + + # Update parameter: ฮธ_t = ฮธ_{t-1} - ฮฑ*v_t + param.data = param.data - self.lr * grad + + self.step_count += 1 +``` + +### Adam Optimizer Implementation + +```python +class Adam(Optimizer): + """Adam optimizer with adaptive learning rates. + + Combines momentum (first moment) with RMSprop-style adaptive rates + (second moment) for robust optimization across different scales. + """ + def __init__(self, params: List[Tensor], lr: float = 0.001, + betas: tuple = (0.9, 0.999), eps: float = 1e-8, + weight_decay: float = 0.0): + super().__init__(params) + self.lr = lr + self.beta1, self.beta2 = betas + self.eps = eps + self.weight_decay = weight_decay + + # Initialize moment estimates (3x memory vs SGD) + self.m_buffers = [None for _ in self.params] # First moment + self.v_buffers = [None for _ in self.params] # Second moment + + def step(self): + """Update parameters with adaptive learning rates""" + self.step_count += 1 + + for i, param in enumerate(self.params): + if param.grad is None: + continue + + grad = param.grad + + # Apply weight decay (Adam's approach - has issues) + if self.weight_decay != 0: + grad = grad + self.weight_decay * param.data + + # Initialize buffers if needed + if self.m_buffers[i] is None: + self.m_buffers[i] = np.zeros_like(param.data) + self.v_buffers[i] = np.zeros_like(param.data) + + # Update biased first moment: m_t = ฮฒ1*m_{t-1} + (1-ฮฒ1)*grad + self.m_buffers[i] = (self.beta1 * self.m_buffers[i] + + (1 - self.beta1) * grad) + + # Update biased second moment: v_t = ฮฒ2*v_{t-1} + (1-ฮฒ2)*gradยฒ + self.v_buffers[i] = (self.beta2 * self.v_buffers[i] + + (1 - self.beta2) * (grad ** 2)) + + # Bias correction (critical for early training steps) + bias_correction1 = 1 - self.beta1 ** self.step_count + bias_correction2 = 1 - self.beta2 ** self.step_count + + m_hat = self.m_buffers[i] / bias_correction1 + v_hat = self.v_buffers[i] / bias_correction2 + + # Adaptive parameter update: ฮธ = ฮธ - ฮฑ*m_hat/(โˆšv_hat + ฮต) + param.data = (param.data - self.lr * m_hat + / (np.sqrt(v_hat) + self.eps)) +``` + +### AdamW Implementation (Decoupled Weight Decay) + +```python +class AdamW(Optimizer): + """AdamW optimizer with decoupled weight decay. + + AdamW fixes Adam's weight decay bug by applying regularization + directly to parameters, separate from gradient-based updates. + """ + def __init__(self, params: List[Tensor], lr: float = 0.001, + betas: tuple = (0.9, 0.999), eps: float = 1e-8, + weight_decay: float = 0.01): + super().__init__(params) + self.lr = lr + self.beta1, self.beta2 = betas + self.eps = eps + self.weight_decay = weight_decay + + # Initialize moment buffers (same as Adam) + self.m_buffers = [None for _ in self.params] + self.v_buffers = [None for _ in self.params] + + def step(self): + """Perform AdamW update with decoupled weight decay""" + self.step_count += 1 + + for i, param in enumerate(self.params): + if param.grad is None: + continue + + # Get gradient (NOT modified by weight decay - key difference!) + grad = param.grad + + # Initialize buffers if needed + if self.m_buffers[i] is None: + self.m_buffers[i] = np.zeros_like(param.data) + self.v_buffers[i] = np.zeros_like(param.data) + + # Update moments using pure gradients + self.m_buffers[i] = (self.beta1 * self.m_buffers[i] + + (1 - self.beta1) * grad) + self.v_buffers[i] = (self.beta2 * self.v_buffers[i] + + (1 - self.beta2) * (grad ** 2)) + + # Compute bias correction + bias_correction1 = 1 - self.beta1 ** self.step_count + bias_correction2 = 1 - self.beta2 ** self.step_count + + m_hat = self.m_buffers[i] / bias_correction1 + v_hat = self.v_buffers[i] / bias_correction2 + + # Apply gradient-based update + param.data = (param.data - self.lr * m_hat + / (np.sqrt(v_hat) + self.eps)) + + # Apply decoupled weight decay (after gradient update!) + if self.weight_decay != 0: + param.data = param.data * (1 - self.lr * self.weight_decay) ``` ### Complete Training Integration -```python -# Modern training workflow -model = Sequential([Dense(784, 128), ReLU(), Dense(128, 10)]) -optimizer = Adam(model.parameters(), learning_rate=0.001) -scheduler = StepLR(optimizer, step_size=20, gamma=0.5) -# Training loop with optimization +```python +# Modern training workflow combining all components +from tinytorch.core.tensor import Tensor +from tinytorch.core.optimizers import SGD, Adam, AdamW + +# Model setup (from previous modules) +model = Sequential([ + Linear(784, 128), ReLU(), + Linear(128, 64), ReLU(), + Linear(64, 10) +]) + +# Optimization setup +optimizer = AdamW(model.parameters(), lr=0.001, weight_decay=0.01) +criterion = CrossEntropyLoss() + +# Training loop for epoch in range(num_epochs): + epoch_loss = 0.0 + for batch_inputs, batch_targets in dataloader: # Forward pass predictions = model(batch_inputs) loss = criterion(predictions, batch_targets) - - # Optimization step - optimizer.zero_grad() # Clear gradients - loss.backward() # Compute gradients - optimizer.step() # Update parameters - - scheduler.step() # Adjust learning rate -``` -### Optimization Algorithm Implementations -- **Gradient Descent**: Basic parameter update rule using gradients -- **SGD with Momentum**: Velocity accumulation for smoother convergence -- **Adam Optimizer**: Adaptive learning rates with bias correction -- **Learning Rate Scheduling**: Strategic adjustment during training + # Backward pass and optimization + optimizer.zero_grad() # Clear old gradients + loss.backward() # Compute new gradients + optimizer.step() # Update parameters + + epoch_loss += loss.data + + print(f"Epoch {epoch}: Loss = {epoch_loss:.4f}") +``` ## Getting Started ### Prerequisites + Ensure you understand the mathematical foundations: ```bash @@ -114,17 +298,32 @@ tito test --module tensor tito test --module autograd ``` +**Required Background:** +- **Tensor Operations**: Understanding parameter storage and update mechanics +- **Automatic Differentiation**: Gradients computed via backpropagation +- **Calculus**: Derivatives, gradient descent, chain rule +- **Linear Algebra**: Vector operations, element-wise operations + ### Development Workflow -1. **Open the development file**: `modules/09_optimizers/optimizers_dev.py` -2. **Implement gradient descent**: Start with basic parameter update mechanics -3. **Build SGD with momentum**: Add velocity accumulation for acceleration -4. **Create Adam optimizer**: Implement adaptive learning rates with moment estimation -5. **Add learning rate scheduling**: Build strategic learning rate adjustment systems -6. **Export and verify**: `tito export --module optimizers && tito test --module optimizers` + +1. **Open the development file**: `modules/06_optimizers/optimizers_dev.ipynb` +2. **Implement Optimizer base class**: Start with parameter management and zero_grad interface +3. **Build SGD with momentum**: Add velocity accumulation for smoother convergence +4. **Create Adam optimizer**: Implement adaptive learning rates with moment estimation and bias correction +5. **Add AdamW optimizer**: Build decoupled weight decay for proper regularization +6. **Export and verify**: `tito module complete 06 && tito test --module optimizers` + +**Development Tips:** +- Test each optimizer on simple quadratic functions (f(x) = xยฒ) where you can verify analytical convergence +- Compare convergence speed between SGD and Adam on the same problem +- Visualize loss curves to understand optimization dynamics +- Check momentum/moment buffers are properly initialized and updated +- Compare Adam vs AdamW to see the effect of decoupled weight decay ## Testing ### Comprehensive Test Suite + Run the full test suite to verify optimization algorithm correctness: ```bash @@ -133,122 +332,211 @@ tito test --module optimizers # Direct pytest execution python -m pytest tests/ -k optimizers -v + +# Test specific optimizer +python -m pytest tests/test_optimizers.py::test_adam_convergence -v ``` ### Test Coverage Areas -- โœ… **Algorithm Implementation**: Verify SGD, momentum, and Adam compute correct parameter updates -- โœ… **Mathematical Correctness**: Test against analytical solutions for convex optimization -- โœ… **State Management**: Ensure proper momentum and moment estimation tracking -- โœ… **Learning Rate Scheduling**: Verify step decay and scheduling functionality -- โœ… **Training Integration**: Test optimizers in complete neural network training workflows + +- **Algorithm Implementation**: Verify Optimizer base, SGD, Adam, and AdamW compute mathematically correct parameter updates +- **Mathematical Correctness**: Test against analytical solutions for convex optimization problems (quadratic functions) +- **State Management**: Ensure proper momentum and moment estimation tracking across training steps +- **Memory Efficiency**: Verify buffer initialization and memory usage patterns +- **Training Integration**: Test optimizers in complete neural network training workflows with real data ### Inline Testing & Convergence Analysis + The module includes comprehensive mathematical validation and convergence visualization: + ```python # Example inline test output +๐Ÿ”ฌ Unit Test: Base Optimizer... +โœ… Parameter validation working correctly +โœ… zero_grad clears all gradients properly +โœ… Error handling for non-gradient parameters +๐Ÿ“ˆ Progress: Base Optimizer โœ“ + +# SGD with momentum validation ๐Ÿ”ฌ Unit Test: SGD with momentum... -โœ… Parameter updates follow momentum equations -โœ… Velocity accumulation works correctly -โœ… Convergence achieved on test function +โœ… Parameter updates follow momentum equation v_t = ฮฒv_{t-1} + โˆ‡L +โœ… Velocity accumulation working correctly +โœ… Weight decay applied properly +โœ… Momentum accelerates convergence vs vanilla SGD ๐Ÿ“ˆ Progress: SGD with Momentum โœ“ -# Optimization analysis +# Adam optimizer validation ๐Ÿ”ฌ Unit Test: Adam optimizer... โœ… First moment estimation (m_t) computed correctly -โœ… Second moment estimation (v_t) computed correctly -โœ… Bias correction applied properly -โœ… Adaptive learning rates working +โœ… Second moment estimation (v_t) computed correctly +โœ… Bias correction applied properly (critical for early steps) +โœ… Adaptive learning rates working per parameter +โœ… Convergence faster than SGD on ill-conditioned problem ๐Ÿ“ˆ Progress: Adam Optimizer โœ“ + +# AdamW decoupled weight decay validation +๐Ÿ”ฌ Unit Test: AdamW optimizer... +โœ… Weight decay decoupled from gradient updates +โœ… Results differ from Adam (proving proper implementation) +โœ… Regularization consistent across gradient scales +โœ… With zero weight decay, matches Adam behavior +๐Ÿ“ˆ Progress: AdamW Optimizer โœ“ ``` ### Manual Testing Examples -```python -from optimizers_dev import SGD, Adam, StepLR -from autograd_dev import Variable -# Test SGD on simple quadratic function -x = Variable(10.0, requires_grad=True) -sgd = SGD([x], learning_rate=0.1, momentum=0.9) +```python +from tinytorch.core.optimizers import SGD, Adam, AdamW +from tinytorch.core.tensor import Tensor + +# Test 1: SGD convergence on simple quadratic +print("Test 1: SGD on f(x) = xยฒ") +x = Tensor([10.0], requires_grad=True) +sgd = SGD([x], lr=0.1, momentum=0.9) for step in range(100): sgd.zero_grad() - loss = x**2 # Minimize f(x) = xยฒ + loss = (x ** 2).sum() # Minimize f(x) = xยฒ, minimum at x=0 loss.backward() sgd.step() - if step % 10 == 0: - print(f"Step {step}: x = {x.data:.4f}, loss = {loss.data:.4f}") -# Test Adam convergence -x = Variable([2.0, -3.0], requires_grad=True) -adam = Adam([x], learning_rate=0.01) + if step % 10 == 0: + print(f"Step {step}: x = {x.data[0]:.6f}, loss = {loss.data:.6f}") +# Expected: x should converge to 0 + +# Test 2: Adam on multidimensional optimization +print("\nTest 2: Adam on f(x,y) = xยฒ + yยฒ") +params = Tensor([5.0, -3.0], requires_grad=True) +adam = Adam([params], lr=0.1) for step in range(50): adam.zero_grad() - loss = (x[0]**2 + x[1]**2).sum() # Minimize ||x||ยฒ + loss = (params ** 2).sum() # Minimize ||x||ยฒ loss.backward() adam.step() + if step % 10 == 0: - print(f"Step {step}: x = {x.data}, loss = {loss.data:.6f}") + print(f"Step {step}: params = {params.data}, loss = {loss.data:.6f}") +# Expected: Both parameters converge to 0 + +# Test 3: Compare SGD vs Adam vs AdamW convergence +print("\nTest 3: Optimizer comparison") +x_sgd = Tensor([10.0], requires_grad=True) +x_adam = Tensor([10.0], requires_grad=True) +x_adamw = Tensor([10.0], requires_grad=True) + +sgd = SGD([x_sgd], lr=0.01, momentum=0.9) +adam = Adam([x_adam], lr=0.01) +adamw = AdamW([x_adamw], lr=0.01, weight_decay=0.01) + +for step in range(20): + # SGD update + sgd.zero_grad() + loss_sgd = (x_sgd ** 2).sum() + loss_sgd.backward() + sgd.step() + + # Adam update + adam.zero_grad() + loss_adam = (x_adam ** 2).sum() + loss_adam.backward() + adam.step() + + # AdamW update + adamw.zero_grad() + loss_adamw = (x_adamw ** 2).sum() + loss_adamw.backward() + adamw.step() + + if step % 5 == 0: + print(f"Step {step}: SGD={x_sgd.data[0]:.6f}, Adam={x_adam.data[0]:.6f}, AdamW={x_adamw.data[0]:.6f}") +# Expected: Adam/AdamW converge faster initially ``` ## Systems Thinking Questions ### Real-World Applications -- **Large Language Models**: GPT, BERT training relies on Adam optimization for stable convergence -- **Computer Vision**: ResNet, Vision Transformer training uses SGD with momentum for best final performance -- **Recommendation Systems**: Online learning systems use adaptive optimizers for continuous model updates -- **Reinforcement Learning**: Policy gradient methods depend on careful optimizer choice and learning rate tuning -### Mathematical Foundations -- **Gradient Descent**: ฮธ_{t+1} = ฮธ_t - ฮฑโˆ‡L(ฮธ_t) where ฮฑ is learning rate and โˆ‡L is loss gradient -- **Momentum**: v_{t+1} = ฮฒv_t + โˆ‡L(ฮธ_t), ฮธ_{t+1} = ฮธ_t - ฮฑv_{t+1} for accelerated convergence -- **Adam**: Combines momentum with adaptive learning rates using first and second moment estimates -- **Learning Rate Scheduling**: Strategic decay schedules balance exploration and exploitation +- **Large Language Models**: GPT and BERT training relies on AdamW optimizer for stable convergence across billions of parameters with varying gradient scales and proper regularization +- **Computer Vision**: ResNet and Vision Transformer training typically uses SGD with momentum for best final test accuracy despite slower initial convergence +- **Recommendation Systems**: Online learning systems use adaptive optimizers like Adam for continuous model updates with non-stationary data distributions +- **Reinforcement Learning**: Policy gradient methods depend heavily on careful optimizer choice and learning rate tuning due to high variance gradients -### Optimization Theory -- **Convex Optimization**: Guarantees global minimum for convex loss functions -- **Non-convex Optimization**: Neural networks have complex loss landscapes with local minima -- **Convergence Analysis**: Understanding when and why optimization algorithms reach good solutions -- **Hyperparameter Sensitivity**: Learning rate is often the most critical hyperparameter +### Optimization Theory Foundations + +- **Gradient Descent**: Update rule ฮธ_{t+1} = ฮธ_t - ฮฑโˆ‡L(ฮธ_t) where ฮฑ is learning rate controlling step size in steepest descent direction +- **Momentum**: Velocity accumulation v_{t+1} = ฮฒv_t + โˆ‡L(ฮธ_t), then ฮธ_{t+1} = ฮธ_t - ฮฑv_{t+1} smooths noisy gradients and accelerates convergence +- **Adam**: Combines momentum (first moment m_t) with adaptive learning rates (second moment v_t), includes bias correction for early training steps +- **AdamW**: Decouples weight decay from gradient updates: applies gradient update first, then weight decay, fixing Adam's regularization bug ### Performance Characteristics -- **SGD**: Memory efficient, works well with large batches, good final performance -- **Adam**: Fast initial convergence, works with small batches, requires more memory -- **Learning Rate Schedules**: Often crucial for achieving best performance -- **Algorithm Selection**: Problem-dependent choice based on data, model, and computational constraints -## ๐ŸŽ‰ Ready to Build? +- **SGD Memory**: O(2n) memory for n parameters (params + momentum buffers), most memory-efficient optimizer with momentum +- **Adam Memory**: O(3n) memory due to first and second moment buffers (params + m_buffers + v_buffers), 1.5x SGD cost +- **Convergence Speed**: Adam often converges faster initially due to adaptive rates, especially with sparse gradients or varying scales +- **Final Performance**: SGD with momentum often achieves better test accuracy on computer vision tasks despite slower convergence +- **Learning Rate Sensitivity**: Adam/AdamW are more robust to learning rate choice than vanilla SGD, making them popular for transformer training +- **Computational Cost**: Adam requires ~1.5x more computation per step (moment updates + bias correction + sqrt operations) than SGD -You're about to implement the algorithms that power all of modern AI! From the neural networks that recognize your voice to the language models that write code, they all depend on the optimization algorithms you're building. +### Critical Thinking: Memory vs Convergence Trade-offs -Understanding these algorithms from first principlesโ€”implementing momentum physics and adaptive learning rates yourselfโ€”will give you deep insight into why some training works and some doesn't. Take your time with the mathematics, test thoroughly, and enjoy building the intelligence behind intelligent systems! +**Reflection Question**: Why does Adam use 3x the memory of parameter-only storage (and 1.5x SGD), and when is this trade-off worth it? +**Key Insights:** +- **Memory Cost**: Adam stores parameter data + first moment (momentum) + second moment (variance) for every parameter +- **Adaptive Benefit**: Per-parameter learning rates handle different gradient scales automatically +- **Use Case**: Transformers benefit from Adam (varying embedding vs attention scales), CNNs often prefer SGD (more uniform scales) +- **Production Decision**: Memory-constrained systems (mobile, edge devices) may prefer SGD despite slower convergence +- **Training Time**: Faster convergence can save GPU hours, offsetting memory cost in cloud training scenarios +**Reflection Question**: Why does SGD with momentum often achieve better test accuracy than Adam on vision tasks, despite slower training? +**Key Insights:** +- **Generalization**: SGD explores flatter minima that generalize better to test data +- **Overfitting**: Adam's fast convergence may lead to sharper minima with worse generalization +- **Learning Rate Schedule**: Careful learning rate decay with SGD achieves better final performance +- **Task Dependency**: Effect is strongest on CNNs, less pronounced on transformers +- **Modern Practice**: AdamW with proper weight decay often bridges this gap + +**Reflection Question**: How does AdamW's decoupled weight decay fix Adam's regularization bug? + +**Key Insights:** +- **Adam Bug**: Adds weight decay to gradients, so adaptive learning rates affect regularization strength inconsistently +- **AdamW Fix**: Applies weight decay directly to parameters after gradient update, decoupling optimization from regularization +- **Consistency**: Weight decay effect is now uniform across parameters regardless of gradient magnitudes +- **Production Impact**: AdamW is now preferred over Adam in most modern training pipelines (BERT, GPT-3, etc.) + +## Ready to Build? + +You're about to implement the algorithms that enable all of modern deep learning! Every neural networkโ€”from the image classifiers in your phone to GPT-4โ€”depends on the optimization algorithms you're building in this module. + +Understanding these algorithms from first principles will transform how you think about training. When you implement momentum physics and see how velocity accumulation smooths noisy gradients, when you build Adam's adaptive learning rates and understand why they help with varying parameter scales, when you create AdamW and see how decoupled weight decay fixes Adam's bugโ€”you'll develop deep intuition for why some training configurations work and others fail. + +Take your time with the mathematics. Test your optimizers on simple quadratic functions where you can verify convergence analytically. Compare SGD vs Adam vs AdamW on the same problem to see their different behaviors. Visualize loss curves to understand optimization dynamics. Monitor memory usage to see the trade-offs. This hands-on experience will make you a better practitioner who can debug training failures, tune hyperparameters effectively, and make informed decisions about optimizer choice in production systems. Enjoy building the intelligence behind intelligent systems! Choose your preferred way to engage with this module: ````{grid} 1 2 3 3 ```{grid-item-card} ๐Ÿš€ Launch Binder -:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/10_optimizers/optimizers_dev.ipynb +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/06_optimizers/optimizers_dev.ipynb :class-header: bg-light Run this module interactively in your browser. No installation required! ``` -```{grid-item-card} โšก Open in Colab -:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/10_optimizers/optimizers_dev.ipynb +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/06_optimizers/optimizers_dev.ipynb :class-header: bg-light Use Google Colab for GPU access and cloud compute power. ``` ```{grid-item-card} ๐Ÿ“– View Source -:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/10_optimizers/optimizers_dev.py +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/06_optimizers/optimizers_dev.ipynb :class-header: bg-light -Browse the Python source code and understand the implementation. +Browse the Jupyter notebook and understand the implementation. ``` ```` @@ -262,6 +550,6 @@ Browse the Python source code and understand the implementation. ---
-โ† Previous Module -Next Module โ†’ +โ† Previous Module +Next Module โ†’
diff --git a/modules/07_training/ABOUT.md b/modules/07_training/ABOUT.md index 0ec685eb..f37227b5 100644 --- a/modules/07_training/ABOUT.md +++ b/modules/07_training/ABOUT.md @@ -1,216 +1,379 @@ --- title: "Training" -description: "Neural network training loops, loss functions, and metrics" +description: "Complete training loops with scheduling, gradient clipping, and checkpointing" difficulty: "โญโญโญโญ" -time_estimate: "8-10 hours" -prerequisites: [] -next_steps: [] -learning_objectives: [] +time_estimate: "6-8 hours" +prerequisites: ["tensor", "activations", "layers", "losses", "autograd", "optimizers"] +next_steps: ["dataloader"] +learning_objectives: + - "Implement complete Trainer class orchestrating forward/backward passes, loss computation, and optimization" + - "Build CosineSchedule for adaptive learning rate management during training" + - "Create gradient clipping utilities to prevent exploding gradients and training instability" + - "Design checkpointing system for saving and resuming training state" + - "Understand memory overhead, gradient accumulation, and train/eval mode switching" --- # 07. Training -**๐Ÿ—๏ธ FOUNDATION TIER** | Difficulty: โญโญโญโญ (4/4) | Time: 8-10 hours +**FOUNDATION TIER** | Difficulty: โญโญโญโญ (4/4) | Time: 6-8 hours ## Overview -Build the complete training pipeline that brings all TinyTorch components together. This capstone module orchestrates data loading, model forward passes, loss computation, backpropagation, and optimization into the end-to-end training workflows that power modern AI systems. +Build the complete training infrastructure that orchestrates neural network learning end-to-end. This capstone module of the Foundation tier brings together all previous componentsโ€”tensors, layers, losses, gradients, and optimizersโ€”into production-ready training loops with learning rate scheduling, gradient clipping, and model checkpointing. You'll create the same training patterns that power PyTorch, TensorFlow, and every production ML system. ## Learning Objectives By the end of this module, you will be able to: -- **Design complete training architectures**: Orchestrate all ML components into cohesive training systems -- **Implement essential loss functions**: Build MSE, CrossEntropy, and BinaryCrossEntropy from mathematical foundations -- **Create evaluation frameworks**: Develop metrics systems for classification, regression, and model performance assessment -- **Build production training loops**: Implement robust training workflows with validation, logging, and progress tracking -- **Master training dynamics**: Understand convergence, overfitting, generalization, and optimization in real scenarios +- **Implement complete Trainer class**: Orchestrate forward passes, loss computation, backpropagation, and parameter updates into cohesive training loops with train/eval mode switching +- **Build CosineSchedule for adaptive learning rates**: Create learning rate schedulers that start fast for quick convergence, then slow down for fine-tuning, following cosine annealing curves +- **Create gradient clipping utilities**: Implement global norm gradient clipping to prevent exploding gradients and training instability in deep networks +- **Design checkpointing system**: Build save/load functionality that preserves complete training stateโ€”model parameters, optimizer buffers, scheduler state, and training history +- **Understand training systems architecture**: Master memory overhead (4-6ร— model size), gradient accumulation strategies, checkpoint management, and the difference between training and evaluation modes -## Build โ†’ Use โ†’ Optimize +## Build โ†’ Use โ†’ Reflect -This module follows TinyTorch's **Build โ†’ Use โ†’ Optimize** framework: +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -1. **Build**: Implement loss functions, evaluation metrics, and complete training orchestration systems -2. **Use**: Train end-to-end neural networks on real datasets with full pipeline automation -3. **Optimize**: Analyze training dynamics, debug convergence issues, and optimize training performance for production - -## NEW: Model Checkpointing & Evaluation Tools - -### Complete Training with Checkpointing -This module now includes production features for our north star goal: - -```python -from tinytorch.core.training import Trainer, CrossEntropyLoss, Accuracy -from tinytorch.core.training import evaluate_model, plot_training_history - -# Train with automatic model checkpointing -trainer = Trainer(model, CrossEntropyLoss(), Adam(lr=0.001), [Accuracy()]) -history = trainer.fit( - train_loader, - val_dataloader=test_loader, - epochs=30, - save_best=True, # โœ… NEW: Saves best model automatically - checkpoint_path='best_model.pkl', # โœ… NEW: Checkpoint location - early_stopping_patience=5 # โœ… NEW: Stop if no improvement -) - -# Load best model after training -trainer.load_checkpoint('best_model.pkl') -print(f"โœ… Restored best model from epoch {trainer.current_epoch}") - -# Evaluate with comprehensive metrics -results = evaluate_model(model, test_loader) -print(f"Test Accuracy: {results['accuracy']:.2%}") -print(f"Confusion Matrix:\n{results['confusion_matrix']}") - -# Visualize training progress -plot_training_history(history) # Shows loss and accuracy curves -``` - -### What's New in This Module -- โœ… **`save_checkpoint()`/`load_checkpoint()`**: Save and restore model state during training -- โœ… **`save_best=True`**: Automatically saves model with best validation performance -- โœ… **`early_stopping_patience`**: Stop training when validation loss stops improving -- โœ… **`evaluate_model()`**: Comprehensive model evaluation with confusion matrix -- โœ… **`plot_training_history()`**: Visualize training and validation curves -- โœ… **`compute_confusion_matrix()`**: Analyze classification errors by class +1. **Build**: Implement CosineSchedule for learning rate scheduling, clip_grad_norm for gradient stability, and complete Trainer class with checkpointing +2. **Use**: Train neural networks end-to-end with real optimization dynamics, observe learning rate adaptation, and experiment with gradient accumulation +3. **Reflect**: Analyze training memory overhead (parameters + gradients + optimizer state), understand when to checkpoint, and compare training strategies across different scenarios ## Implementation Guide -### Complete Training Pipeline +### The Training Loop Cycle + +Training orchestrates data, forward pass, loss, gradients, and updates in an iterative cycle: + +```{mermaid} +graph LR + A[Data Batch] --> B[Forward Pass
Model] + B --> C[Loss
Compute] + C --> D[Backward Pass
Autograd] + D --> E[Optimizer Step
Update ฮธ] + E --> F[Next Batch] + F --> A + + style A fill:#e3f2fd + style B fill:#f3e5f5 + style C fill:#fff3e0 + style D fill:#ffe0b2 + style E fill:#fce4ec + style F fill:#f0fdf4 +``` + +**Cycle**: Load batch โ†’ Forward through model โ†’ Compute loss โ†’ Backward gradients โ†’ Update parameters โ†’ Repeat + +### CosineSchedule - Adaptive Learning Rate Management + +Learning rate scheduling is like adjusting driving speed based on road conditionsโ€”start fast on the highway, slow down in neighborhoods for precision. Cosine annealing provides smooth transitions from aggressive learning to fine-tuning: + ```python -# End-to-end training system -from tinytorch.core.training import Trainer -from tinytorch.core.losses import CrossEntropyLoss -from tinytorch.core.metrics import Accuracy +class CosineSchedule: + """ + Cosine annealing learning rate schedule. -# Define complete model architecture -model = Sequential([ - Dense(784, 128), ReLU(), - Dense(128, 64), ReLU(), - Dense(64, 10), Softmax() -]) + Starts at max_lr, decreases following cosine curve to min_lr. + Formula: lr = min_lr + (max_lr - min_lr) * (1 + cos(ฯ€*epoch/T)) / 2 + """ + def __init__(self, max_lr=0.1, min_lr=0.01, total_epochs=100): + self.max_lr = max_lr + self.min_lr = min_lr + self.total_epochs = total_epochs -# Configure training components -optimizer = Adam(model.parameters(), learning_rate=0.001) -loss_fn = CrossEntropyLoss() -metrics = [Accuracy()] + def get_lr(self, epoch: int) -> float: + """Get learning rate for current epoch.""" + if epoch >= self.total_epochs: + return self.min_lr -# Create and configure trainer + # Cosine annealing: smooth decrease from max to min + cosine_factor = (1 + np.cos(np.pi * epoch / self.total_epochs)) / 2 + return self.min_lr + (self.max_lr - self.min_lr) * cosine_factor + +# Usage example +schedule = CosineSchedule(max_lr=0.1, min_lr=0.001, total_epochs=50) +print(schedule.get_lr(0)) # 0.1 - fast learning initially +print(schedule.get_lr(25)) # ~0.05 - gradual slowdown +print(schedule.get_lr(50)) # 0.001 - fine-tuning at end +``` + +### Gradient Clipping - Preventing Training Explosions + +Gradient clipping is a speed governor that prevents dangerously large gradients from destroying training progress. Global norm clipping scales all gradients uniformly while preserving their relative magnitudes: + +```python +def clip_grad_norm(parameters: List, max_norm: float = 1.0) -> float: + """ + Clip gradients by global norm to prevent exploding gradients. + + Computes total_norm = sqrt(sum of all gradient squares). + If total_norm > max_norm, scales all gradients by max_norm/total_norm. + """ + if not parameters: + return 0.0 + + # Compute global norm across all parameters + total_norm = 0.0 + for param in parameters: + if param.grad is not None: + grad_data = param.grad.data if hasattr(param.grad, 'data') else param.grad + total_norm += np.sum(grad_data ** 2) + + total_norm = np.sqrt(total_norm) + + # Clip if necessary - preserves gradient direction + if total_norm > max_norm: + clip_coef = max_norm / total_norm + for param in parameters: + if param.grad is not None: + if hasattr(param.grad, 'data'): + param.grad.data *= clip_coef + else: + param.grad *= clip_coef + + return float(total_norm) + +# Usage example +params = model.parameters() +original_norm = clip_grad_norm(params, max_norm=1.0) +print(f"Gradient norm: {original_norm:.2f} โ†’ clipped to 1.0") +``` + +### Trainer Class - Complete Training Orchestration + +The Trainer class conducts the symphony of trainingโ€”coordinating model, optimizer, loss function, and scheduler into cohesive learning loops with checkpointing and evaluation: + +```python +class Trainer: + """ + Complete training orchestrator for neural networks. + + Handles training loops, evaluation, scheduling, gradient clipping, + checkpointing, and train/eval mode switching. + """ + def __init__(self, model, optimizer, loss_fn, scheduler=None, grad_clip_norm=None): + self.model = model + self.optimizer = optimizer + self.loss_fn = loss_fn + self.scheduler = scheduler + self.grad_clip_norm = grad_clip_norm + + # Training state + self.epoch = 0 + self.step = 0 + self.training_mode = True + + # History tracking + self.history = { + 'train_loss': [], + 'eval_loss': [], + 'learning_rates': [] + } + + def train_epoch(self, dataloader, accumulation_steps=1): + """ + Train for one epoch through the dataset. + + Supports gradient accumulation for effective larger batch sizes. + """ + self.model.training = True + total_loss = 0.0 + num_batches = 0 + accumulated_loss = 0.0 + + for batch_idx, (inputs, targets) in enumerate(dataloader): + # Forward pass + outputs = self.model.forward(inputs) + loss = self.loss_fn.forward(outputs, targets) + + # Scale loss for accumulation + scaled_loss = loss.data / accumulation_steps + accumulated_loss += scaled_loss + + # Backward pass + loss.backward() + + # Update every accumulation_steps batches + if (batch_idx + 1) % accumulation_steps == 0: + # Gradient clipping + if self.grad_clip_norm is not None: + clip_grad_norm(self.model.parameters(), self.grad_clip_norm) + + # Optimizer step + self.optimizer.step() + self.optimizer.zero_grad() + + total_loss += accumulated_loss + accumulated_loss = 0.0 + num_batches += 1 + self.step += 1 + + # Update learning rate + if self.scheduler is not None: + current_lr = self.scheduler.get_lr(self.epoch) + self.optimizer.lr = current_lr + self.history['learning_rates'].append(current_lr) + + avg_loss = total_loss / max(num_batches, 1) + self.history['train_loss'].append(avg_loss) + self.epoch += 1 + + return avg_loss + + def evaluate(self, dataloader): + """ + Evaluate model without updating parameters. + + Sets model.training = False for proper evaluation behavior. + """ + self.model.training = False + total_loss = 0.0 + correct = 0 + total = 0 + + for inputs, targets in dataloader: + # Forward pass only - no gradients + outputs = self.model.forward(inputs) + loss = self.loss_fn.forward(outputs, targets) + total_loss += loss.data + + # Calculate accuracy for classification + if len(outputs.data.shape) > 1: + predictions = np.argmax(outputs.data, axis=1) + if len(targets.data.shape) == 1: + correct += np.sum(predictions == targets.data) + else: + correct += np.sum(predictions == np.argmax(targets.data, axis=1)) + total += len(predictions) + + avg_loss = total_loss / len(dataloader) if len(dataloader) > 0 else 0.0 + accuracy = correct / total if total > 0 else 0.0 + + self.history['eval_loss'].append(avg_loss) + return avg_loss, accuracy + + def save_checkpoint(self, path: str): + """Save complete training state for resumption.""" + checkpoint = { + 'epoch': self.epoch, + 'step': self.step, + 'model_state': {i: p.data.copy() for i, p in enumerate(self.model.parameters())}, + 'optimizer_state': self._get_optimizer_state(), + 'scheduler_state': self._get_scheduler_state(), + 'history': self.history, + 'training_mode': self.training_mode + } + + Path(path).parent.mkdir(parents=True, exist_ok=True) + with open(path, 'wb') as f: + pickle.dump(checkpoint, f) + + def load_checkpoint(self, path: str): + """Load training state from checkpoint.""" + with open(path, 'rb') as f: + checkpoint = pickle.load(f) + + self.epoch = checkpoint['epoch'] + self.step = checkpoint['step'] + self.history = checkpoint['history'] + + # Restore model parameters + for i, param in enumerate(self.model.parameters()): + if i in checkpoint['model_state']: + param.data = checkpoint['model_state'][i].copy() +``` + +### Complete Training Example + +Bringing all components together into production-ready training: + +```python +from tinytorch.core.training import Trainer, CosineSchedule, clip_grad_norm +from tinytorch.core.layers import Linear +from tinytorch.core.losses import MSELoss +from tinytorch.core.optimizers import SGD + +# Build model +class SimpleNN: + def __init__(self): + self.layer1 = Linear(3, 5) + self.layer2 = Linear(5, 2) + self.training = True + + def forward(self, x): + x = self.layer1.forward(x) + x = Tensor(np.maximum(0, x.data)) # ReLU + return self.layer2.forward(x) + + def parameters(self): + return self.layer1.parameters() + self.layer2.parameters() + +# Configure training +model = SimpleNN() +optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9) +loss_fn = MSELoss() +scheduler = CosineSchedule(max_lr=0.1, min_lr=0.001, total_epochs=10) + +# Create trainer with gradient clipping trainer = Trainer( model=model, - optimizer=optimizer, + optimizer=optimizer, loss_fn=loss_fn, - metrics=metrics + scheduler=scheduler, + grad_clip_norm=1.0 # Prevent exploding gradients ) -# Train with comprehensive monitoring -history = trainer.fit( - train_dataloader=train_loader, - val_dataloader=val_loader, - epochs=50, - verbose=True -) -``` +# Train for multiple epochs +for epoch in range(10): + train_loss = trainer.train_epoch(train_data) + eval_loss, accuracy = trainer.evaluate(val_data) -### Loss Function Library -```python -# Regression loss for continuous targets -mse_loss = MeanSquaredError() -regression_loss = mse_loss(predictions, continuous_targets) + print(f"Epoch {epoch}: train_loss={train_loss:.4f}, " + f"eval_loss={eval_loss:.4f}, accuracy={accuracy:.4f}") -# Multi-class classification loss -ce_loss = CrossEntropyLoss() -classification_loss = ce_loss(logits, class_indices) + # Save checkpoint periodically + if epoch % 5 == 0: + trainer.save_checkpoint(f'checkpoint_epoch_{epoch}.pkl') -# Binary classification loss -bce_loss = BinaryCrossEntropyLoss() -binary_loss = bce_loss(sigmoid_outputs, binary_labels) - -# All losses support batch processing and gradient computation -loss.backward() # Automatic differentiation integration -``` - -### Evaluation Metrics System -```python -# Classification performance measurement -accuracy = Accuracy() -acc_score = accuracy(predictions, true_labels) # Returns 0.0 to 1.0 - -# Regression error measurement -mae = MeanAbsoluteError() -error = mae(predictions, targets) - -# Extensible metric framework -class CustomMetric: - def __call__(self, y_pred, y_true): - # Implement custom evaluation logic - return custom_score - -metrics = [Accuracy(), CustomMetric()] -trainer = Trainer(model, optimizer, loss_fn, metrics) -``` - -### Real-World Training Workflows -```python -# Train on CIFAR-10 with full pipeline -from tinytorch.core.dataloader import CIFAR10Dataset, DataLoader - -# Load and prepare data -train_dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True) -train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) -val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False) - -# Configure CNN for computer vision -cnn_model = Sequential([ - Conv2D(3, 16, kernel_size=3), ReLU(), - MaxPool2D(kernel_size=2), - Conv2D(16, 32, kernel_size=3), ReLU(), - Flatten(), - Dense(32 * 13 * 13, 128), ReLU(), - Dense(128, 10) -]) - -# Train with monitoring and validation -trainer = Trainer(cnn_model, Adam(cnn_model.parameters()), CrossEntropyLoss(), [Accuracy()]) -history = trainer.fit(train_loader, val_loader, epochs=100) - -# Analyze training results -print(f"Final train accuracy: {history['train_accuracy'][-1]:.4f}") -print(f"Final val accuracy: {history['val_accuracy'][-1]:.4f}") +# Restore from checkpoint +trainer.load_checkpoint('checkpoint_epoch_5.pkl') +print(f"Resumed training from epoch {trainer.epoch}") ``` ## Getting Started ### Prerequisites -Ensure you have completed the entire TinyTorch foundation: + +Ensure you have completed all Foundation tier modules: ```bash # Activate TinyTorch environment source bin/activate-tinytorch.sh -# Verify all prerequisite modules (this is the capstone!) -tito test --module tensor -tito test --module activations -tito test --module layers -tito test --module networks -tito test --module dataloader -tito test --module autograd -tito test --module optimizers +# Verify all prerequisites (Training is the Foundation capstone!) +tito test --module tensor # Module 01: Tensor operations +tito test --module activations # Module 02: Activation functions +tito test --module layers # Module 03: Neural network layers +tito test --module losses # Module 04: Loss functions +tito test --module autograd # Module 05: Automatic differentiation +tito test --module optimizers # Module 06: Parameter update algorithms ``` ### Development Workflow -1. **Open the development file**: `modules/10_training/training_dev.py` -2. **Implement loss functions**: Build MSE, CrossEntropy, and BinaryCrossEntropy with proper gradients -3. **Create metrics system**: Develop Accuracy and extensible evaluation framework -4. **Build Trainer class**: Orchestrate training loop with validation and monitoring -5. **Test end-to-end training**: Apply complete pipeline to real datasets and problems -6. **Export and verify**: `tito export --module training && tito test --module training` + +1. **Open the development file**: `modules/07_training/training.py` +2. **Implement CosineSchedule**: Build learning rate scheduler with cosine annealing (smooth max_lr โ†’ min_lr transition) +3. **Create clip_grad_norm**: Implement global norm gradient clipping to prevent exploding gradients +4. **Build Trainer class**: Orchestrate complete training loop with train_epoch(), evaluate(), and checkpointing +5. **Add gradient accumulation**: Support effective larger batch sizes with limited memory +6. **Test end-to-end training**: Validate complete pipeline with real models and data +7. **Export and verify**: `tito module complete 07 && tito test --module training` ## Testing ### Comprehensive Test Suite -Run the full test suite to verify complete training system functionality: + +Run the full test suite to verify complete training infrastructure: ```bash # TinyTorch CLI (recommended) @@ -221,117 +384,142 @@ python -m pytest tests/ -k training -v ``` ### Test Coverage Areas -- โœ… **Loss Function Implementation**: Verify mathematical correctness and gradient computation -- โœ… **Metrics System**: Test accuracy calculation and extensible framework -- โœ… **Training Loop Orchestration**: Ensure proper coordination of all components -- โœ… **End-to-End Training**: Verify complete workflows on real datasets -- โœ… **Convergence Analysis**: Test training dynamics and optimization behavior + +- **CosineSchedule Correctness**: Verify cosine annealing produces correct learning rates at start, middle, and end epochs +- **Gradient Clipping Stability**: Test global norm computation and uniform scaling when gradients exceed threshold +- **Trainer Orchestration**: Ensure proper coordination of forward pass, backward pass, optimization, and scheduling +- **Checkpointing Completeness**: Validate save/load preserves model state, optimizer buffers, scheduler state, and training history +- **Memory Analysis**: Measure training memory overhead (parameters + gradients + optimizer state = 4-6ร— model size) ### Inline Testing & Training Analysis -The module includes comprehensive training validation and convergence monitoring: + +The module includes comprehensive validation of training dynamics: + ```python -# Example inline test output -๐Ÿ”ฌ Unit Test: CrossEntropy loss function... -โœ… Mathematical correctness verified -โœ… Gradient computation working -โœ… Batch processing supported -๐Ÿ“ˆ Progress: Loss Functions โœ“ +# CosineSchedule validation +schedule = CosineSchedule(max_lr=0.1, min_lr=0.01, total_epochs=100) +print(schedule.get_lr(0)) # 0.1 - aggressive learning initially +print(schedule.get_lr(50)) # ~0.055 - gradual slowdown +print(schedule.get_lr(100)) # 0.01 - fine-tuning at end -# Training monitoring -๐Ÿ”ฌ Unit Test: Complete training pipeline... -โœ… Trainer orchestrates all components correctly -โœ… Training loop converges on test problem -โœ… Validation monitoring working -๐Ÿ“ˆ Progress: End-to-End Training โœ“ +# Gradient clipping validation +param.grad = np.array([100.0, 200.0]) # Large gradients +original_norm = clip_grad_norm([param], max_norm=1.0) +# original_norm โ‰ˆ 223.6 โ†’ clipped to 1.0 +assert np.linalg.norm(param.grad.data) โ‰ˆ 1.0 -# Real dataset training -๐Ÿ“Š Training on CIFAR-10 subset... -Epoch 1/10: train_loss=2.345, train_acc=0.234, val_loss=2.123, val_acc=0.278 -Epoch 5/10: train_loss=1.456, train_acc=0.567, val_loss=1.543, val_acc=0.523 -โœ… Model converging successfully +# Trainer integration validation +trainer = Trainer(model, optimizer, loss_fn, scheduler, grad_clip_norm=1.0) +loss = trainer.train_epoch(train_data) +eval_loss, accuracy = trainer.evaluate(test_data) +trainer.save_checkpoint('checkpoint.pkl') ``` ### Manual Testing Examples + ```python -from training_dev import Trainer, CrossEntropyLoss, Accuracy -from networks_dev import Sequential -from layers_dev import Dense -from activations_dev import ReLU, Softmax -from optimizers_dev import Adam +from training import Trainer, CosineSchedule, clip_grad_norm +from layers import Linear +from losses import MSELoss +from optimizers import SGD +from tensor import Tensor -# Test complete training on synthetic data -model = Sequential([Dense(4, 8), ReLU(), Dense(8, 3), Softmax()]) -optimizer = Adam(model.parameters(), learning_rate=0.01) -loss_fn = CrossEntropyLoss() -metrics = [Accuracy()] +# Test complete training pipeline +class SimpleModel: + def __init__(self): + self.layer = Linear(2, 1) + self.training = True -trainer = Trainer(model, optimizer, loss_fn, metrics) + def forward(self, x): + return self.layer.forward(x) + + def parameters(self): + return self.layer.parameters() + +# Create training system +model = SimpleModel() +optimizer = SGD(model.parameters(), lr=0.01) +loss_fn = MSELoss() +scheduler = CosineSchedule(max_lr=0.1, min_lr=0.01, total_epochs=10) + +trainer = Trainer(model, optimizer, loss_fn, scheduler, grad_clip_norm=1.0) # Create simple dataset -from dataloader_dev import SimpleDataset, DataLoader -train_dataset = SimpleDataset(size=1000, num_features=4, num_classes=3) -train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) +train_data = [ + (Tensor([[1.0, 0.5]]), Tensor([[2.0]])), + (Tensor([[0.5, 1.0]]), Tensor([[1.5]])) +] # Train and monitor -history = trainer.fit(train_loader, epochs=20, verbose=True) -print(f"Training completed. Final accuracy: {history['train_accuracy'][-1]:.4f}") +for epoch in range(10): + loss = trainer.train_epoch(train_data) + lr = scheduler.get_lr(epoch) + print(f"Epoch {epoch}: loss={loss:.4f}, lr={lr:.4f}") + +# Test checkpointing +trainer.save_checkpoint('test_checkpoint.pkl') +trainer.load_checkpoint('test_checkpoint.pkl') +print(f"Restored from epoch {trainer.epoch}") ``` ## Systems Thinking Questions ### Real-World Applications -- **Production ML Systems**: Companies like Netflix, Google use similar training pipelines for recommendation and search systems -- **Research Workflows**: Academic researchers use training frameworks like this for experimental model development -- **MLOps Platforms**: Production training systems extend these patterns with distributed computing and monitoring -- **Edge AI Training**: Federated learning systems use similar orchestration patterns across distributed devices + +- **Production Training Pipelines**: PyTorch Lightning, Hugging Face Transformers, TensorFlow Estimators all use similar Trainer architectures with checkpointing and scheduling +- **Large-Scale Model Training**: GPT, BERT, and vision models rely on gradient clipping and learning rate scheduling for stable convergence across billions of parameters +- **Research Experimentation**: Academic ML uses checkpointing for long experiments with periodic evaluation and model selection +- **Fault-Tolerant Training**: Cloud training systems use checkpoints to resume after infrastructure failures or spot instance interruptions ### Training System Architecture -- **Loss Functions**: Mathematical objectives that define what the model should learn -- **Metrics**: Human-interpretable measures of model performance for monitoring and decision-making -- **Training Loop**: Orchestration pattern that coordinates data loading, forward passes, backward passes, and optimization -- **Validation Strategy**: Techniques for monitoring generalization and preventing overfitting -### Machine Learning Engineering -- **Training Dynamics**: Understanding convergence, overfitting, underfitting, and optimization landscapes -- **Hyperparameter Tuning**: Systematic approaches to learning rate, batch size, and architecture selection -- **Debugging Training**: Common failure modes and diagnostic techniques for training issues -- **Production Considerations**: Scalability, monitoring, reproducibility, and deployment readiness +- **Memory Breakdown**: Training requires parameters (1ร—) + gradients (1ร—) + optimizer state (2-3ร—) = 4-6ร— model memory footprint +- **Gradient Accumulation**: Enables effective batch size of accumulation_steps ร— actual_batch_size with fixed memoryโ€”trades time for memory efficiency +- **Train/Eval Modes**: Different layer behaviors during training (dropout active, batch norm updates) vs evaluation (dropout off, fixed batch norm) +- **Checkpoint Components**: Must save model parameters, optimizer buffers (momentum, Adam m/v), scheduler state, epoch counter, and training history for exact resumption -### Systems Integration Patterns -- **Component Orchestration**: How to coordinate multiple ML components into cohesive systems -- **Error Handling**: Robust handling of training failures, data issues, and convergence problems -- **Monitoring and Logging**: Tracking training progress, performance metrics, and system health -- **Extensibility**: Design patterns that enable easy addition of new losses, metrics, and training strategies +### Training Dynamics -## ๐ŸŽ‰ Ready to Build? +- **Learning Rate Scheduling**: Cosine annealing starts fast (quick convergence when far from optimum) then slows (stable fine-tuning near solution) +- **Exploding Gradients**: Occur in deep networks and RNNs when gradient magnitudes grow exponentially through backpropagationโ€”gradient clipping prevents training collapse +- **Gradient Accumulation Trade-offs**: Reduces memory by processing small batches but increases training time linearly with accumulation steps +- **Checkpointing Strategy**: Balance disk space (1GB+ per checkpoint) vs fault tolerance (more frequent = less lost work) and evaluation frequency (save best model) -You're about to complete the TinyTorch framework by building the training system that brings everything together! This is where all your hard work on tensors, layers, networks, data loading, gradients, and optimization culminates in a complete ML system. +### Performance Characteristics -Training is the heart of machine learningโ€”it's where models learn from data and become intelligent. You're building the same patterns used to train GPT, train computer vision models, and power production AI systems. Take your time, understand how all the pieces fit together, and enjoy creating something truly powerful! +- **Training Memory Scaling**: Adam optimizer uses 4ร— parameter memory (params + grads + m + v) vs SGD with momentum at 3ร— (params + grads + momentum) +- **Checkpoint Overhead**: Pickle serialization adds 10-30% overhead beyond raw parameter dataโ€”use compression for large models +- **Learning Rate Impact**: Too high causes instability/divergence, too low causes slow convergenceโ€”scheduling adapts automatically +- **Global Norm vs Individual Clipping**: Global norm preserves gradient direction while preventing explosionโ€”individual clipping can distort optimization trajectory - +## Ready to Build? +You're about to complete the Foundation tier by building the training infrastructure that brings neural networks to life! This is where all your work on tensors, activations, layers, losses, gradients, and optimizers comes together into a cohesive system that actually learns from data. + +Training is the heart of machine learningโ€”the process that transforms random initialization into intelligent models. You're implementing the same patterns used to train GPT, BERT, ResNet, and every production AI system. Understanding how scheduling, gradient clipping, checkpointing, and mode switching work together gives you mastery over the training process. + +This module is the culmination of everything you've built. Take your time understanding how each piece fits into the bigger picture, and enjoy creating a complete ML training system from scratch! Choose your preferred way to engage with this module: ````{grid} 1 2 3 3 ```{grid-item-card} ๐Ÿš€ Launch Binder -:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/11_training/training_dev.ipynb +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/07_training/training.ipynb :class-header: bg-light Run this module interactively in your browser. No installation required! ``` -```{grid-item-card} โšก Open in Colab -:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/11_training/training_dev.ipynb +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/07_training/training.ipynb :class-header: bg-light Use Google Colab for GPU access and cloud compute power. ``` ```{grid-item-card} ๐Ÿ“– View Source -:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/11_training/training_dev.py +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/07_training/training.py :class-header: bg-light Browse the Python source code and understand the implementation. @@ -348,6 +536,6 @@ Browse the Python source code and understand the implementation. ---
-โ† Previous Module -Next Module โ†’ +โ† Previous Module +Next Module โ†’
diff --git a/modules/08_dataloader/ABOUT.md b/modules/08_dataloader/ABOUT.md index ebb8c900..5691ff19 100644 --- a/modules/08_dataloader/ABOUT.md +++ b/modules/08_dataloader/ABOUT.md @@ -1,332 +1,374 @@ --- title: "DataLoader - Data Pipeline Engineering" -description: "Build production-grade data loading infrastructure for training at scale" -difficulty: 3 -time_estimate: "5-6 hours" +description: "Build production-grade data loading infrastructure for efficient ML training" +difficulty: "โญโญโญ" +time_estimate: "4-5 hours" prerequisites: ["Tensor", "Layers", "Training"] next_steps: ["Spatial (CNNs)"] learning_objectives: - - "Design scalable data pipeline architectures for production ML systems" - - "Implement efficient dataset abstractions with batching and streaming" - - "Build preprocessing pipelines for normalization and data augmentation" - - "Understand memory-efficient data loading patterns for large datasets" - - "Apply systems thinking to I/O optimization and throughput engineering" + - "Design memory-efficient dataset abstractions for scalable training" + - "Implement batching and shuffling for mini-batch gradient descent" + - "Master the Python iterator protocol for streaming data pipelines" + - "Understand PyTorch's DataLoader architecture and design patterns" + - "Analyze trade-offs between batch size, memory usage, and throughput" --- # 08. DataLoader -**๐Ÿ›๏ธ ARCHITECTURE TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours +**ARCHITECTURE TIER** | Difficulty: โญโญโญ (3/4) | Time: 4-5 hours ## Overview -Build the data engineering infrastructure that feeds neural networks. This module implements production-grade data loading, preprocessing, and batching systemsโ€”the critical backbone that enables training on real-world datasets like CIFAR-10. +This module implements the data loading infrastructure that powers neural network training at scale. You'll build the Dataset/DataLoader abstraction pattern used by PyTorch, TensorFlow, and every major ML frameworkโ€”implementing batching, shuffling, and memory-efficient iteration from first principles. This is where data engineering meets systems thinking. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Design scalable data pipeline architectures** for production ML systems with proper abstractions and interfaces -2. **Implement efficient dataset abstractions** with batching, shuffling, and streaming for memory-efficient training -3. **Build preprocessing pipelines** for normalization, augmentation, and transformation with fit-transform patterns -4. **Understand memory-efficient data loading patterns** for large datasets that don't fit in RAM -5. **Apply systems thinking** to I/O optimization, caching strategies, and throughput engineering +- **Design Dataset Abstractions**: Implement the protocol-based interface (`__getitem__`, `__len__`) that separates data storage from data access +- **Build Efficient DataLoaders**: Create batching and shuffling mechanisms that stream data without loading entire datasets into memory +- **Master Iterator Patterns**: Understand how Python's `for` loops work under the hood and implement custom iterators +- **Optimize Data Pipelines**: Analyze throughput bottlenecks and balance batch size against memory constraints +- **Apply to Real Datasets**: Use your DataLoader with actual image datasets like MNIST and CIFAR-10 in milestone projects -## Why This Matters +## Build โ†’ Use โ†’ Optimize -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Optimize** framework: -Every production ML system depends on robust data infrastructure: - -- **Netflix** uses sophisticated data pipelines to train recommendation models on billions of viewing records -- **Tesla** processes terabytes of driving sensor data through efficient loading pipelines for autonomous driving -- **OpenAI** built custom data loaders to train GPT models on hundreds of billions of tokens -- **Meta** developed PyTorch's DataLoader (which you're reimplementing) to power research and production - -### Historical Context - -Data loading evolved from bottleneck to optimized system: - -- **Early ML (pre-2010)**: Small datasets fit entirely in memory; data loading was an afterthought -- **ImageNet Era (2012)**: AlexNet required efficient loading of 1.2M images; preprocessing became critical -- **Big Data ML (2015+)**: Streaming data pipelines became necessary for datasets too large for memory -- **Modern Scale (2020+)**: Data loading is now a first-class systems problem with dedicated infrastructure teams - -The patterns you're building are the same ones used in production at scale. - -## Pedagogical Pattern: Build โ†’ Use โ†’ Analyze - -### 1. Build - -Implement from first principles: -- Dataset abstraction with Python protocols (`__getitem__`, `__len__`) -- DataLoader with batching, shuffling, and iteration -- CIFAR-10 dataset loader with binary file parsing -- Normalizer with fit-transform pattern -- Memory-efficient streaming for large datasets - -### 2. Use - -Apply to real problems: -- Load and preprocess CIFAR-10 (50,000 training images) -- Create train/test data loaders with proper batching -- Build preprocessing pipelines for normalization -- Integrate with training loops from Module 07 -- Measure throughput and identify bottlenecks - -### 3. Analyze - -Deep-dive into systems behavior: -- Profile memory usage patterns with different batch sizes -- Measure I/O throughput and identify disk bottlenecks -- Compare streaming vs in-memory loading strategies -- Analyze the impact of shuffling on training dynamics -- Understand trade-offs between batch size and memory +1. **Build**: Implement Dataset abstraction, TensorDataset for in-memory data, and DataLoader with batching/shuffling +2. **Use**: Load synthetic datasets, create train/validation splits, and integrate with training loops +3. **Optimize**: Profile throughput, analyze memory scaling, and measure shuffle overhead ## Implementation Guide -### Core Components +### Dataset Abstraction + +The foundation of all data loadingโ€”a protocol-based interface for accessing samples: -**Dataset Abstraction** ```python -class Dataset: - """Abstract base class for all datasets. - - Implements Python protocols for indexing and length. - Subclasses must implement __getitem__ and __len__. +from abc import ABC, abstractmethod + +class Dataset(ABC): """ - def __getitem__(self, index: int): - """Return (data, label) for given index.""" - raise NotImplementedError - + Abstract base class defining the dataset interface. + + All datasets must implement: + - __len__(): Return total number of samples + - __getitem__(idx): Return sample at given index + + This enables Pythonic usage: + len(dataset) # How many samples? + dataset[42] # Get sample 42 + for x in dataset # Iterate over all samples + """ + + @abstractmethod def __len__(self) -> int: - """Return total number of samples.""" - raise NotImplementedError + """Return total number of samples in dataset.""" + pass + + @abstractmethod + def __getitem__(self, idx: int): + """Return sample at given index.""" + pass ``` -**DataLoader Implementation** +**Why This Design:** +- **Protocol-based**: Uses Python's `__len__` and `__getitem__` for natural syntax +- **Framework-agnostic**: Same pattern used by PyTorch, TensorFlow, JAX +- **Separation of concerns**: Decouples *what data exists* from *how to load it* +- **Enables optimization**: Makes caching, prefetching, and parallel loading possible + +### TensorDataset Implementation + +When your data fits in memory, TensorDataset provides efficient access: + +```python +class TensorDataset(Dataset): + """ + Dataset for in-memory tensors. + + Wraps multiple tensors with aligned first dimension: + features: (N, feature_dim) + labels: (N,) + + Returns tuple of tensors for each sample: + dataset[i] โ†’ (features[i], labels[i]) + """ + + def __init__(self, *tensors): + """Store tensors, validate first dimension alignment.""" + assert len(tensors) > 0 + first_size = len(tensors[0].data) + for tensor in tensors: + assert len(tensor.data) == first_size + self.tensors = tensors + + def __len__(self) -> int: + return len(self.tensors[0].data) + + def __getitem__(self, idx: int): + return tuple(Tensor(t.data[idx]) for t in self.tensors) +``` + +**Key Features:** +- **Memory locality**: All data pre-loaded for fast access +- **Vectorized operations**: No conversion overhead during training +- **Flexible**: Handles any number of aligned tensors (features, labels, metadata) + +### DataLoader with Batching and Shuffling + +The core engine that transforms samples into training-ready batches: + ```python class DataLoader: - """Efficient batch loading with shuffling support. - - Features: - - Automatic batching with configurable batch size - - Optional shuffling for training randomization - - Drop last batch handling for even batch sizes - - Memory-efficient iteration without loading all data """ - def __init__(self, dataset, batch_size=32, shuffle=False, drop_last=False): + Efficient batch loader with shuffling support. + + Transforms: + Individual samples โ†’ Batched tensors + + Features: + - Automatic batching with configurable batch_size + - Optional shuffling for training randomization + - Memory-efficient iteration (one batch at a time) + - Handles uneven final batch automatically + """ + + def __init__(self, dataset: Dataset, batch_size: int, shuffle: bool = False): self.dataset = dataset self.batch_size = batch_size self.shuffle = shuffle - self.drop_last = drop_last - + + def __len__(self) -> int: + """Return number of batches per epoch.""" + return (len(self.dataset) + self.batch_size - 1) // self.batch_size + def __iter__(self): - # Generate indices (shuffled or sequential) + """ + Yield batches of data. + + Algorithm: + 1. Generate indices [0, 1, ..., N-1] + 2. Shuffle indices if requested + 3. Group into chunks of batch_size + 4. Load samples and collate into batch tensors + 5. Yield each batch + """ indices = list(range(len(self.dataset))) + if self.shuffle: - np.random.shuffle(indices) - - # Yield batches + random.shuffle(indices) + for i in range(0, len(indices), self.batch_size): batch_indices = indices[i:i + self.batch_size] - if len(batch_indices) < self.batch_size and self.drop_last: - continue - yield self._get_batch(batch_indices) + batch = [self.dataset[idx] for idx in batch_indices] + yield self._collate_batch(batch) + + def _collate_batch(self, batch): + """Stack individual samples into batch tensors.""" + num_tensors = len(batch[0]) + batched_tensors = [] + + for tensor_idx in range(num_tensors): + tensor_list = [sample[tensor_idx].data for sample in batch] + batched_data = np.stack(tensor_list, axis=0) + batched_tensors.append(Tensor(batched_data)) + + return tuple(batched_tensors) ``` -**CIFAR-10 Dataset Loader** -```python -class CIFAR10Dataset(Dataset): - """Load CIFAR-10 dataset with automatic download. - - CIFAR-10: 60,000 32x32 color images in 10 classes - - 50,000 training images - - 10,000 test images - - Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck - """ - def __init__(self, root='./data', train=True, download=True): - self.train = train - if download: - self._download(root) - self.data, self.labels = self._load_batch_files(root, train) - - def __getitem__(self, index): - return self.data[index], self.labels[index] - - def __len__(self): - return len(self.data) +**The Batching Transformation:** + +``` +Individual Samples (from Dataset): + dataset[0] โ†’ (features: [1, 2, 3], label: 0) + dataset[1] โ†’ (features: [4, 5, 6], label: 1) + dataset[2] โ†’ (features: [7, 8, 9], label: 0) + +DataLoader Batching (batch_size=2): + Batch 1: + features: [[1, 2, 3], โ† Shape: (2, 3) + [4, 5, 6]] + labels: [0, 1] โ† Shape: (2,) + + Batch 2: + features: [[7, 8, 9]] โ† Shape: (1, 3) [last batch] + labels: [0] โ† Shape: (1,) ``` -**Preprocessing Pipeline** -```python -class Normalizer: - """Normalize data using fit-transform pattern. - - Fits statistics on training data, applies to all splits. - Ensures consistent preprocessing across train/val/test. - """ - def fit(self, data): - """Compute mean and std from training data.""" - self.mean = data.mean(axis=0) - self.std = data.std(axis=0) - return self - - def transform(self, data): - """Apply normalization using fitted statistics.""" - return (data - self.mean) / (self.std + 1e-8) - - def fit_transform(self, data): - """Fit and transform in one step.""" - return self.fit(data).transform(data) +## Getting Started + +### Prerequisites + +Ensure you understand the foundations: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module tensor +tito test --module layers +tito test --module training ``` -### Step-by-Step Implementation +**Required Knowledge:** +- Tensor operations and NumPy arrays (Module 01) +- Neural network basics (Modules 03-04) +- Training loop structure (Module 07) +- Python protocols (`__getitem__`, `__len__`, `__iter__`) -1. **Create Dataset Base Class** - - Implement `__getitem__` and `__len__` protocols - - Define the interface all datasets must follow - - Test with simple array-based dataset +### Development Workflow -2. **Build CIFAR-10 Loader** - - Implement download and extraction logic - - Parse binary batch files (pickle format) - - Reshape data from flat arrays to (3, 32, 32) images - - Handle train/test split loading - -3. **Implement DataLoader** - - Create batching logic with configurable batch size - - Add shuffling with random permutation - - Implement iterator protocol for Pythonic loops - - Handle edge cases (last incomplete batch, empty dataset) - -4. **Add Preprocessing** - - Build Normalizer with fit-transform pattern - - Compute per-channel statistics for RGB images - - Apply transformations efficiently across batches - - Test normalization correctness (zero mean, unit variance) - -5. **Integration Testing** - - Load CIFAR-10 and create data loaders - - Iterate through batches and verify shapes - - Test with actual training loop from Module 07 - - Measure data loading throughput +1. **Open the development file**: `modules/08_dataloader/dataloader.py` +2. **Implement Dataset abstraction**: Define abstract base class with `__len__` and `__getitem__` +3. **Build TensorDataset**: Create concrete implementation for tensor-based data +4. **Create DataLoader**: Implement batching, shuffling, and iterator protocol +5. **Test integration**: Verify with training workflow simulation +6. **Export and verify**: `tito module complete 08 && tito test --module dataloader` ## Testing -### Inline Tests (During Development) +### Comprehensive Test Suite + +Run the full test suite to verify DataLoader functionality: -Run inline tests while building: ```bash -cd modules/08_dataloader -python dataloader_dev.py +# TinyTorch CLI (recommended) +tito test --module dataloader + +# Direct pytest execution +python -m pytest tests/ -k dataloader -v ``` -Expected output: -``` -Unit Test: Dataset abstraction... -โœ… __getitem__ protocol works correctly -โœ… __len__ returns correct size -โœ… Indexing returns (data, label) tuples -Progress: Dataset Interface โœ“ +### Test Coverage Areas -Unit Test: CIFAR-10 loading... -โœ… Downloaded and extracted 170MB dataset -โœ… Loaded 50,000 training samples -โœ… Sample shape: (3, 32, 32), label range: [0, 9] -Progress: CIFAR-10 Dataset โœ“ +- โœ… **Dataset Interface**: Abstract base class enforcement, protocol implementation +- โœ… **TensorDataset**: Tensor alignment validation, indexing correctness +- โœ… **DataLoader Batching**: Batch shape consistency, handling uneven final batch +- โœ… **Shuffling**: Randomization correctness, deterministic seeding +- โœ… **Training Integration**: Complete workflow with train/validation splits -Unit Test: DataLoader batching... -โœ… Batch shapes correct: (32, 3, 32, 32) -โœ… Shuffling produces different orderings -โœ… Iteration covers all samples exactly once -Progress: DataLoader โœ“ +### Inline Testing & Validation + +The module includes comprehensive unit tests: + +```python +# Run inline tests during development +python modules/08_dataloader/dataloader.py + +# Expected output: +๐Ÿ”ฌ Unit Test: Dataset Abstract Base Class... +โœ… Dataset is properly abstract +โœ… Dataset interface works correctly! + +๐Ÿ”ฌ Unit Test: TensorDataset... +โœ… TensorDataset works correctly! + +๐Ÿ”ฌ Unit Test: DataLoader... +โœ… DataLoader works correctly! + +๐Ÿ”ฌ Unit Test: DataLoader Deterministic Shuffling... +โœ… Deterministic shuffling works correctly! + +๐Ÿ”ฌ Integration Test: Training Workflow... +โœ… Training integration works correctly! ``` -### Export and Validate +### Manual Testing Examples -After completing the module: -```bash -# Export to tinytorch package -tito export 08_dataloader +```python +from tinytorch.core.tensor import Tensor +from tinytorch.data.loader import TensorDataset, DataLoader -# Run integration tests -tito test 08_dataloader -``` +# Create synthetic dataset +features = Tensor([[1, 2], [3, 4], [5, 6], [7, 8]]) +labels = Tensor([0, 1, 0, 1]) +dataset = TensorDataset(features, labels) -### Comprehensive Test Coverage +# Create DataLoader with batching +loader = DataLoader(dataset, batch_size=2, shuffle=True) -The test suite validates: -- Dataset interface correctness -- CIFAR-10 loading and parsing -- Batch shape consistency -- Shuffling randomness -- Memory efficiency -- Preprocessing accuracy - -## Where This Code Lives - -``` -tinytorch/ -โ”œโ”€โ”€ core/ -โ”‚ โ””โ”€โ”€ dataloader.py # Your implementation goes here -โ””โ”€โ”€ __init__.py # Exposes DataLoader, Dataset, etc. - -Usage in other modules: ->>> from tinytorch.core.dataloader import DataLoader, CIFAR10Dataset ->>> dataset = CIFAR10Dataset(download=True) ->>> loader = DataLoader(dataset, batch_size=32, shuffle=True) +# Iterate through batches +for batch_features, batch_labels in loader: + print(f"Batch features shape: {batch_features.shape}") + print(f"Batch labels shape: {batch_labels.shape}") + # Output: (2, 2) and (2,) ``` ## Systems Thinking Questions -1. **Memory vs Throughput Trade-off**: Why does increasing batch size improve GPU utilization but increase memory usage? What's the optimal batch size for a 16GB GPU? +### Real-World Applications -2. **Shuffling Impact**: How does shuffling affect training dynamics and convergence? Why is it critical for training but not for evaluation? +- **Image Classification**: How would you design a DataLoader for ImageNet (1.2M images, 150GB)? What if the dataset doesn't fit in RAM? +- **Language Modeling**: LLM training streams billions of tokensโ€”how does batch size affect memory and throughput for variable-length sequences? +- **Autonomous Vehicles**: Tesla trains on terabytes of sensor dataโ€”how would you handle multi-modal data (camera + LIDAR + GPS) in a DataLoader? +- **Medical Imaging**: 3D CT scans are too large for GPU memoryโ€”what batching strategy would you use for patch extraction? -3. **I/O Bottlenecks**: Your GPU can process 1000 images/sec but your disk reads at 100 images/sec. Where's the bottleneck? How would you fix it? +### Performance Characteristics -4. **Preprocessing Placement**: Should preprocessing happen in the data loader or in the training loop? What are the trade-offs for CPU vs GPU preprocessing? +- **Memory Scaling**: Why does doubling batch size double memory usage? What memory components scale with batch size (activations, gradients, optimizer states)? +- **Throughput Bottleneck**: Your GPU can process 1000 images/sec but disk reads at 100 images/secโ€”where's the bottleneck? How would you diagnose this? +- **Shuffle Overhead**: Does shuffling slow down training? Measure the overhead and explain when it becomes significant. +- **Batch Size Trade-off**: What's the optimal batch size for training ResNet-50 on a 16GB GPU? How would you find it systematically? -5. **Distributed Loading**: If you're training on 8 GPUs, how should you partition the dataset? What challenges arise with shuffling across multiple workers? +### Data Pipeline Theory -## Real-World Connections +- **Iterator Protocol**: How does Python's `for` loop work under the hood? What methods must an object implement to be iterable? +- **Memory Efficiency**: Why can DataLoader handle datasets larger than RAM? What design pattern enables this? +- **Collation Strategy**: Why do we stack individual samples into batch tensors? What happens if we don't? +- **Shuffling Impact**: How does shuffling affect gradient estimates and convergence? What happens if you forget to shuffle training data? -### Industry Applications +## Ready to Build? -**Netflix (Recommendation Systems)** -- Processes billions of viewing records through custom data pipelines -- Uses streaming loaders for datasets that don't fit in memory -- Implements sophisticated batching strategies for negative sampling +You're about to implement the data loading infrastructure that powers modern AI systems. Understanding how to build efficient, scalable data pipelines is critical for production ML engineeringโ€”this isn't just plumbing, it's a first-class systems problem with dedicated engineering teams at major AI labs. -**Autonomous Vehicles (Tesla, Waymo)** -- Load terabytes of sensor data (camera, LIDAR, radar) for training -- Use multi-worker data loading to keep GPUs fully utilized -- Implement real-time preprocessing pipelines for online learning +Every production training system depends on robust data loaders. Your implementation will follow the exact patterns used by PyTorch's `torch.utils.data.DataLoader` and TensorFlow's `tf.data.Dataset`โ€”the same code running at Meta, Tesla, OpenAI, and every major ML organization. -**Large Language Models (OpenAI, Anthropic)** -- Stream hundreds of billions of tokens from distributed storage -- Use custom data loaders optimized for sequence data -- Implement efficient tokenization and batching for transformers +Open `/Users/VJ/GitHub/TinyTorch/modules/08_dataloader/dataloader.py` and start building. Take your time with each component, run the inline tests frequently, and think deeply about the memory and throughput trade-offs you're making. -### Research Impact +Choose your preferred way to engage with this module: -This module teaches patterns from: -- PyTorch DataLoader (2016): The industry-standard data loading API -- TensorFlow Dataset API (2017): Google's approach to data pipelines -- NVIDIA DALI (2019): GPU-accelerated preprocessing for peak throughput -- WebDataset (2020): Efficient loading from cloud storage +````{grid} 1 2 3 3 -## What's Next? +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/08_dataloader/dataloader_dev.ipynb +:class-header: bg-light -In **Module 09: Spatial (CNNs)**, you'll use these data loaders to train convolutional neural networks on CIFAR-10: +Run this module interactively in your browser. No installation required! +``` -- Apply convolution operations to the RGB images you're loading -- Use your DataLoader to iterate through 50,000 training samples -- Achieve >75% accuracy on CIFAR-10 classification -- Understand how CNNs process spatial data efficiently +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/08_dataloader/dataloader_dev.ipynb +:class-header: bg-light -The data infrastructure you built here becomes criticalโ€”training CNNs requires efficient batch loading of image data with proper preprocessing. +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/08_dataloader/dataloader.py +:class-header: bg-light + +Browse the Python source code and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. + +**After completing this module**, you'll apply your DataLoader to real datasets in the milestone projects: +- **Milestone 03**: Train MLP on MNIST handwritten digits (28ร—28 images) +- **Milestone 04**: Train CNN on CIFAR-10 natural images (32ร—32ร—3 images) + +These milestones include download utilities and preprocessing for production datasets. +``` --- -**Ready to build production data infrastructure?** Open `modules/08_dataloader/dataloader_dev.py` and start implementing. +
+โ† Previous Module: Training +Next Module: Spatial (CNNs) โ†’ +
diff --git a/modules/09_spatial/ABOUT.md b/modules/09_spatial/ABOUT.md index 2d5a0157..9d8a354e 100644 --- a/modules/09_spatial/ABOUT.md +++ b/modules/09_spatial/ABOUT.md @@ -1,360 +1,490 @@ --- -title: "Convolutional Networks" -description: "Build CNNs from scratch for computer vision and spatial pattern recognition" -difficulty: 3 +title: "Spatial Operations" +description: "Build CNNs from scratch - implement Conv2d, pooling, and spatial processing for computer vision" +difficulty: "โญโญโญ" time_estimate: "6-8 hours" prerequisites: ["Tensor", "Activations", "Layers", "DataLoader"] next_steps: ["Tokenization"] learning_objectives: - - "Implement convolution as sliding window operations with weight sharing" - - "Design CNN architectures with feature extraction and classification components" - - "Understand translation invariance and hierarchical feature learning" - - "Build pooling operations for spatial downsampling and invariance" - - "Apply computer vision principles to image classification tasks" + - "Master memory and computation trade-offs in sliding window convolution operations" + - "Implement Conv2d layers with weight sharing and understand parameter efficiency vs dense layers" + - "Design hierarchical feature extraction through stacked convolutional architectures" + - "Connect spatial operations to PyTorch's torch.nn.Conv2d and understand production CNN implementations" + - "Analyze receptive field growth, translation invariance, and spatial dimension management" --- -# 09. Convolutional Networks +# 09. Spatial Operations -**๐Ÿ›๏ธ ARCHITECTURE TIER** | Difficulty: โญโญโญ (3/4) | Time: 6-8 hours +**ARCHITECTURE TIER** | Difficulty: โญโญโญ (3/4) | Time: 6-8 hours ## Overview -Implement convolutional neural networks (CNNs) from scratch. This module teaches you how convolution transforms computer vision from hand-crafted features to learned hierarchical representations that power everything from image classification to autonomous driving. +Implement convolutional neural networks (CNNs) from scratch, building the spatial operations that transformed computer vision from hand-crafted features to learned hierarchical representations. You'll discover why weight sharing revolutionizes computer vision by reducing parameters from millions to thousands while achieving superior spatial reasoning that powers everything from image classification to autonomous driving. This module teaches you how Conv2d achieves massive parameter reduction through weight sharing while enabling the spatial structure understanding critical for modern vision systems. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement convolution** as sliding window operations with explicit loops, understanding weight sharing and local connectivity -2. **Design CNN architectures** by composing convolutional, pooling, and dense layers for image classification -3. **Understand translation invariance** and why CNNs are superior to dense networks for spatial data -4. **Build pooling operations** (MaxPool, AvgPool) for spatial downsampling and feature invariance -5. **Apply computer vision principles** to achieve >75% accuracy on CIFAR-10 image classification +- **Implement Conv2d Forward Pass**: Build sliding window convolution with explicit loops showing O(Bร—C_outร—Hร—Wร—Kยฒร—C_in) complexity, understanding how weight sharing applies the same learned filter across all spatial positions to detect features like edges and textures +- **Master Weight Sharing Mechanics**: Understand how Conv2d(3โ†’32, kernel=3) uses only 896 parameters while a dense layer for the same 32ร—32 input needs 32,000 parametersโ€”achieving 35ร— parameter reduction while preserving spatial structure +- **Design Hierarchical Feature Extractors**: Compose Conv โ†’ ReLU โ†’ Pool blocks into CNN architectures, learning how depth enables complex feature hierarchies from simple local operations (edges โ†’ textures โ†’ objects) +- **Build Pooling Operations**: Implement MaxPool2d and AvgPool2d for spatial downsampling, understanding the trade-off between spatial resolution and computational efficiency (4ร— memory reduction per 2ร—2 pooling layer) +- **Analyze Receptive Field Growth**: Master how stacked 3ร—3 convolutions build global context from local operationsโ€”two Conv2d layers see 5ร—5 regions, three layers see 7ร—7, enabling deep networks to detect large-scale patterns -## Why This Matters +## Build โ†’ Use โ†’ Reflect -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -CNNs are the backbone of modern computer vision systems: - -- **Meta's Vision AI** uses CNN architectures to tag 2 billion photos daily across Facebook and Instagram -- **Tesla Autopilot** processes camera feeds through CNN backbones for object detection and lane recognition -- **Google Photos** built a CNN-based system that automatically organizes billions of images -- **Medical Imaging** systems use CNNs to detect cancer in X-rays and MRIs with superhuman accuracy - -### Historical Context - -The convolution revolution transformed computer vision: - -- **LeNet (1998)**: Yann LeCun's CNN read zip codes on mail; convolution proved viable but limited by compute -- **AlexNet (2012)**: Won ImageNet with 16% error rate (vs 26% previous); GPUs + convolution = computer vision revolution -- **ResNet (2015)**: 152-layer CNN achieved 3.6% error (better than human 5%); proved depth matters -- **Modern Era (2020+)**: CNNs power production vision systems processing trillions of images daily - -The patterns you're implementing revolutionized how machines see. - -## Pedagogical Pattern: Build โ†’ Use โ†’ Analyze - -### 1. Build - -Implement from first principles: -- Convolution as explicit sliding window operation -- Conv2D layer with learnable filters and weight sharing -- MaxPool2D and AvgPool2D for spatial downsampling -- Flatten layer to connect spatial and dense layers -- Complete CNN architecture with feature extraction and classification - -### 2. Use - -Apply to real problems: -- Build CNN for CIFAR-10 image classification -- Extract and visualize learned feature maps -- Compare CNN vs MLP performance on spatial data -- Achieve >75% accuracy with proper architecture -- Understand impact of kernel size, stride, and padding - -### 3. Analyze - -Deep-dive into architectural choices: -- Why does weight sharing reduce parameters dramatically? -- How do early vs late layers learn different features? -- What's the trade-off between depth and width in CNNs? -- Why are pooling operations crucial for translation invariance? -- How does spatial structure preservation improve learning? +1. **Build**: Implement Conv2d with explicit sliding window loops to expose computational complexity, create MaxPool2d and AvgPool2d for spatial downsampling, and build Flatten operations connecting spatial and dense layers for complete CNN architectures +2. **Use**: Train CNNs on CIFAR-10 (60K 32ร—32 color images) to achieve >75% accuracy, visualize learned feature maps showing edges in early layers and complex patterns in deep layers, and compare CNN vs MLP parameter efficiency on spatial data +3. **Reflect**: Analyze why weight sharing reduces parameters by 35-1000ร— while improving spatial reasoning, how stacked 3ร—3 convolutions build global context from local receptive fields, and what memory-computation trade-offs exist between large kernels vs deep stacking ## Implementation Guide -### Core Components +### Convolutional Pipeline Flow + +Convolution transforms spatial data through learnable filters, pooling, and hierarchical feature extraction: + +```{mermaid} +graph LR + A[Input Image
Hร—Wร—C] --> B[Conv2d
kร—k filters] + B --> C[Feature Maps
H'ร—W'ร—F] + C --> D[Activation
ReLU] + D --> E[Pool 2ร—2
Downsample] + E --> F[Output
H'/2ร—W'/2ร—F] + + style A fill:#e3f2fd + style B fill:#fff3e0 + style C fill:#f3e5f5 + style D fill:#ffe0b2 + style E fill:#fce4ec + style F fill:#f0fdf4 +``` + +**Flow**: Image โ†’ Convolution (weight sharing) โ†’ Feature maps โ†’ Nonlinearity โ†’ Pooling โ†’ Downsampled features + +### Conv2d Layer - The Heart of Computer Vision -**Conv2D Layer - The Heart of Computer Vision** ```python -class Conv2D: - """2D Convolutional layer with learnable filters. - - Implements sliding window convolution: - - Applies same filter across all spatial positions (weight sharing) - - Each filter learns to detect different features (edges, textures, objects) - - Output is feature map showing where filter activates strongly - +class Conv2d: + """ + 2D Convolutional layer with learnable filters and weight sharing. + + Implements sliding window convolution where the same learned filter + applies across all spatial positions, achieving massive parameter + reduction compared to dense layers while preserving spatial structure. + + Key Concepts: + - Weight sharing: Same filter at all spatial positions + - Local connectivity: Each output depends on local input region + - Learnable filters: Each filter learns to detect different features + - Translation invariance: Detected features independent of position + Args: in_channels: Number of input channels (3 for RGB, 16 for feature maps) out_channels: Number of learned filters (feature detectors) - kernel_size: Size of sliding window (typically 3 or 5) + kernel_size: Spatial size of sliding window (typically 3 or 5) stride: Step size when sliding (1 = no downsampling) padding: Border padding to preserve spatial dimensions + + Shape: + Input: (batch, in_channels, height, width) + Output: (batch, out_channels, out_height, out_width) + Where: out_height = (height + 2*padding - kernel_size) // stride + 1 """ def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0): - # Initialize learnable filters + # Initialize learnable filters: one per output channel + # Shape: (out_channels, in_channels, kernel_size, kernel_size) self.weight = Tensor(shape=(out_channels, in_channels, kernel_size, kernel_size)) - self.bias = Tensor(shape=(out_channels,)) - + + # He initialization for ReLU networks + fan_in = in_channels * kernel_size * kernel_size + std = np.sqrt(2.0 / fan_in) + self.weight.data = np.random.normal(0, std, self.weight.shape) + def forward(self, x): - # x shape: (batch, in_channels, height, width) + """Apply sliding window convolution with explicit loops to show cost.""" batch, _, H, W = x.shape - kh, kw = self.kernel_size, self.kernel_size - - # Calculate output dimensions - out_h = (H + 2 * self.padding - kh) // self.stride + 1 - out_w = (W + 2 * self.padding - kw) // self.stride + 1 - - # Sliding window convolution + out_h = (H + 2*self.padding - self.kernel_size) // self.stride + 1 + out_w = (W + 2*self.padding - self.kernel_size) // self.stride + 1 + + # Apply padding if needed + if self.padding > 0: + x = pad(x, self.padding) + output = Tensor(shape=(batch, self.out_channels, out_h, out_w)) + + # Explicit 7-nested loop showing O(Bร—C_outร—Hร—Wร—K_hร—K_wร—C_in) complexity for b in range(batch): for oc in range(self.out_channels): for i in range(out_h): for j in range(out_w): - # Extract local patch + # Extract local patch from input i_start = i * self.stride j_start = j * self.stride - patch = x[b, :, i_start:i_start+kh, j_start:j_start+kw] - - # Convolution: element-wise multiply and sum - output[b, oc, i, j] = (patch * self.weight[oc]).sum() + self.bias[oc] - + patch = x[b, :, i_start:i_start+self.kernel_size, + j_start:j_start+self.kernel_size] + + # Convolution: dot product between filter and patch + output.data[b, oc, i, j] = (patch.data * self.weight.data[oc]).sum() + return output ``` -**Pooling Layers - Spatial Downsampling** +**Why Explicit Loops Matter**: Modern frameworks optimize convolution with im2col transformations and cuDNN kernels, achieving 10-100ร— speedups. But the explicit loops reveal where computational cost livesโ€”helping you understand why kernel size matters enormously and why production systems carefully balance depth vs width. + +### MaxPool2d - Spatial Downsampling and Translation Invariance + ```python -class MaxPool2D: - """Max pooling for spatial downsampling and translation invariance. - - Takes maximum value in each local region: - - Reduces spatial dimensions while preserving important features - - Provides invariance to small translations - - Reduces computation in later layers +class MaxPool2d: + """ + Max pooling for spatial downsampling and translation invariance. + + Extracts maximum value from each local region, providing: + - Spatial dimension reduction (4ร— memory reduction per 2ร—2 pooling) + - Translation invariance (robustness to small shifts) + - Feature importance selection (keep strongest activations) + + Args: + kernel_size: Size of pooling window (typically 2) + stride: Step size when sliding (defaults to kernel_size) + + Shape: + Input: (batch, channels, height, width) + Output: (batch, channels, out_height, out_width) + Where: out_height = (height - kernel_size) // stride + 1 """ def __init__(self, kernel_size=2, stride=None): self.kernel_size = kernel_size - self.stride = stride or kernel_size - + self.stride = stride if stride is not None else kernel_size + def forward(self, x): + """Extract maximum value from each local region.""" batch, channels, H, W = x.shape - kh, kw = self.kernel_size, self.kernel_size - - out_h = (H - kh) // self.stride + 1 - out_w = (W - kw) // self.stride + 1 - + out_h = (H - self.kernel_size) // self.stride + 1 + out_w = (W - self.kernel_size) // self.stride + 1 + output = Tensor(shape=(batch, channels, out_h, out_w)) + for b in range(batch): for c in range(channels): for i in range(out_h): for j in range(out_w): i_start = i * self.stride j_start = j * self.stride - patch = x[b, c, i_start:i_start+kh, j_start:j_start+kw] - output[b, c, i, j] = patch.max() - + patch = x.data[b, c, i_start:i_start+self.kernel_size, + j_start:j_start+self.kernel_size] + output.data[b, c, i, j] = patch.max() + return output ``` -**Complete CNN Architecture** +**MaxPool vs AvgPool**: MaxPool preserves sharp features like edges (takes max activation), while AvgPool creates smoother features (averages the window). Production systems typically use MaxPool for feature extraction and Global Average Pooling for final classification layers. + +### SimpleCNN - Complete Architecture + ```python class SimpleCNN: - """CNN for CIFAR-10 classification. - - Architecture: - Conv(3โ†’32, 3x3) โ†’ ReLU โ†’ MaxPool(2x2) # 32x32 โ†’ 16x16 - Conv(32โ†’64, 3x3) โ†’ ReLU โ†’ MaxPool(2x2) # 16x16 โ†’ 8x8 - Flatten โ†’ Dense(64*8*8 โ†’ 128) โ†’ ReLU - Dense(128 โ†’ 10) โ†’ Softmax + """ + Complete CNN for CIFAR-10 image classification. + + Architecture: Conv โ†’ ReLU โ†’ Pool โ†’ Conv โ†’ ReLU โ†’ Pool โ†’ Flatten โ†’ Dense + + Layer-by-layer transformation: + Input: (B, 3, 32, 32) RGB images + Conv1: (B, 32, 32, 32) - 32 filters detect edges/textures + Pool1: (B, 32, 16, 16) - downsample by 2ร— + Conv2: (B, 64, 16, 16) - 64 filters detect shapes/patterns + Pool2: (B, 64, 8, 8) - downsample by 2ร— + Flatten: (B, 4096) - convert spatial to vector + Dense: (B, 10) - classify into 10 categories + + Parameters: ~500K (vs ~4M for equivalent dense network) """ def __init__(self): - self.conv1 = Conv2D(3, 32, kernel_size=3, padding=1) - self.relu1 = ReLU() - self.pool1 = MaxPool2D(kernel_size=2) - - self.conv2 = Conv2D(32, 64, kernel_size=3, padding=1) - self.relu2 = ReLU() - self.pool2 = MaxPool2D(kernel_size=2) - + # Feature extraction backbone + self.conv1 = Conv2d(3, 32, kernel_size=3, padding=1) + self.pool1 = MaxPool2d(kernel_size=2) + self.conv2 = Conv2d(32, 64, kernel_size=3, padding=1) + self.pool2 = MaxPool2d(kernel_size=2) + + # Classification head self.flatten = Flatten() - self.fc1 = Linear(64 * 8 * 8, 128) - self.relu3 = ReLU() - self.fc2 = Linear(128, 10) - + self.fc = Linear(64 * 8 * 8, 10) + def forward(self, x): - # Feature extraction - x = self.pool1(self.relu1(self.conv1(x))) # (B, 32, 16, 16) - x = self.pool2(self.relu2(self.conv2(x))) # (B, 64, 8, 8) - + # Hierarchical feature extraction + x = self.pool1(relu(self.conv1(x))) # (B, 32, 16, 16) + x = self.pool2(relu(self.conv2(x))) # (B, 64, 8, 8) + # Classification - x = self.flatten(x) # (B, 4096) - x = self.relu3(self.fc1(x)) # (B, 128) - x = self.fc2(x) # (B, 10) + x = self.flatten(x) # (B, 4096) + x = self.fc(x) # (B, 10) return x ``` -### Step-by-Step Implementation +**Architecture Design Principles**: This follows the standard CNN patternโ€”alternating Conv+ReLU (feature extraction) with Pooling (dimension reduction). Each Conv layer learns hierarchical features (Layer 1: edges โ†’ Layer 2: shapes), while pooling provides computational efficiency and translation invariance. -1. **Implement Conv2D Forward Pass** - - Create sliding window iteration over spatial dimensions - - Apply weight sharing: same filter at all positions - - Handle batch processing efficiently - - Verify output shape calculation +## Getting Started -2. **Build Pooling Operations** - - Implement MaxPool2D with maximum extraction - - Add AvgPool2D for average pooling - - Handle stride and kernel size correctly - - Test spatial dimension reduction +### Prerequisites -3. **Create Flatten Layer** - - Convert (B, C, H, W) to (B, C*H*W) - - Prepare spatial features for dense layers - - Preserve batch dimension - - Enable gradient flow backward +Ensure you understand the foundations from previous modules: -4. **Design Complete CNN** - - Stack Conv โ†’ ReLU โ†’ Pool blocks for feature extraction - - Add Flatten โ†’ Dense for classification - - Calculate dimensions at each layer - - Test end-to-end forward pass +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh -5. **Train on CIFAR-10** - - Load CIFAR-10 using Module 08's DataLoader - - Train with cross-entropy loss and SGD - - Track accuracy on test set - - Achieve >75% accuracy +# Verify prerequisite modules are complete +tito test --module tensor # Module 01: Tensor operations +tito test --module activations # Module 02: ReLU activation +tito test --module layers # Module 03: Linear layers +tito test --module dataloader # Module 08: Batch loading +``` + +**Why These Prerequisites**: +- **Tensor**: Conv2d requires tensor indexing, reshaping, and broadcasting for sliding windows +- **Activations**: CNNs use ReLU after each convolution for non-linear feature learning +- **Layers**: Dense classification layers connect to CNN feature extraction +- **DataLoader**: CIFAR-10 training requires batch loading and data augmentation + +### Development Workflow + +1. **Open the development file**: `modules/09_spatial/spatial_dev.py` +2. **Implement Conv2d forward pass**: Build sliding window convolution with explicit loops showing computational complexity +3. **Create MaxPool2d and AvgPool2d**: Implement spatial downsampling with different aggregation strategies +4. **Build Flatten operation**: Connect spatial feature maps to dense layers +5. **Design SimpleCNN architecture**: Compose spatial and dense layers into complete CNN +6. **Export and verify**: `tito module complete 09 && tito test --module spatial` + +**Development Tips**: +- Start with small inputs (8ร—8 images) to debug convolution logic before scaling to 32ร—32 +- Print intermediate shapes at each layer to verify dimension calculations +- Visualize feature maps after Conv layers to understand learned filters +- Compare parameter counts: Conv2d(3โ†’32, k=3) = 896 params vs Dense(3072โ†’32) = 98,304 params ## Testing -### Inline Tests (During Development) +### Comprehensive Test Suite + +Run the full test suite to verify spatial operation functionality: -Run inline tests while building: ```bash -cd modules/09_spatial -python spatial_dev.py +# TinyTorch CLI (recommended) +tito test --module spatial + +# Direct pytest execution +python -m pytest tests/ -k spatial -v ``` -Expected output: -``` -Unit Test: Conv2D implementation... +### Test Coverage Areas + +- โœ… **Conv2d Shape Propagation**: Verifies output dimensions match formula (H+2P-K)//S+1 for various kernel sizes, strides, and padding +- โœ… **Weight Sharing Validation**: Confirms same filter applies at all spatial positions, achieving parameter reduction vs dense layers +- โœ… **Pooling Correctness**: Tests MaxPool extracts maximum values and AvgPool computes correct averages across windows +- โœ… **Translation Invariance**: Verifies CNNs detect features regardless of spatial position through weight sharing +- โœ… **Complete CNN Pipeline**: End-to-end test processing CIFAR-10 images through Conv โ†’ Pool โ†’ Flatten โ†’ Dense architecture + +### Inline Testing & Validation + +The module includes comprehensive inline tests during development: + +```python +# Run inline unit tests +cd /Users/VJ/GitHub/TinyTorch/modules/09_spatial +python spatial_dev.py + +# Expected output: +๐Ÿ”ฌ Unit Test: Conv2d... โœ… Sliding window convolution works correctly โœ… Weight sharing applied at all positions -โœ… Output shapes match expected dimensions -Progress: Conv2D โœ“ +โœ… Output shape matches calculated dimensions +โœ… Parameter count: 896 (vs 32,000 for dense layer) +๐Ÿ“ˆ Progress: Conv2d forward pass implemented -Unit Test: MaxPool2D implementation... -โœ… Maximum extraction works correctly -โœ… Spatial dimensions reduced properly -โœ… Translation invariance verified -Progress: Pooling โœ“ +๐Ÿ”ฌ Unit Test: Pooling Operations... +โœ… MaxPool2d extracts maximum values correctly +โœ… AvgPool2d computes averages correctly +โœ… Spatial dimensions reduced by factor of kernel_size +โœ… Translation invariance property verified +๐Ÿ“ˆ Progress: Pooling layers implemented -Unit Test: Complete CNN architecture... +๐Ÿ”ฌ Unit Test: SimpleCNN Integration... โœ… Forward pass through all layers successful -โœ… Output shape: (32, 10) for 10 classes -โœ… Parameter count reasonable: ~500K parameters -Progress: CNN Architecture โœ“ +โœ… Output shape: (32, 10) for 10 CIFAR-10 classes +โœ… Total parameters: ~500K (efficient!) +๐Ÿ“ˆ Progress: CNN architecture complete ``` -### Export and Validate +### Manual Testing Examples -After completing the module: -```bash -# Export to tinytorch package -tito export 09_spatial +Test individual components interactively: -# Run integration tests -tito test 09_spatial -``` +```python +from spatial_dev import Conv2d, MaxPool2d, SimpleCNN +import numpy as np -### CIFAR-10 Training Test +# Test Conv2d with small input +conv = Conv2d(3, 16, kernel_size=3, padding=1) +x = Tensor(np.random.randn(2, 3, 8, 8)) +out = conv(x) +print(f"Conv2d output shape: {out.shape}") # (2, 16, 8, 8) -```bash -# Train simple CNN on CIFAR-10 -python tests/integration/test_cnn_cifar10.py +# Test MaxPool dimension reduction +pool = MaxPool2d(kernel_size=2) +pooled = pool(out) +print(f"MaxPool output shape: {pooled.shape}") # (2, 16, 4, 4) -Expected results: -- Epoch 1: 35% accuracy -- Epoch 5: 60% accuracy -- Epoch 10: 75% accuracy -``` +# Test complete CNN +cnn = SimpleCNN(num_classes=10) +img = Tensor(np.random.randn(4, 3, 32, 32)) +logits = cnn(img) +print(f"CNN output shape: {logits.shape}") # (4, 10) -## Where This Code Lives - -``` -tinytorch/ -โ”œโ”€โ”€ nn/ -โ”‚ โ””โ”€โ”€ spatial.py # Conv2D, MaxPool2D, etc. -โ””โ”€โ”€ __init__.py # Exposes CNN components - -Usage in other modules: ->>> from tinytorch.nn import Conv2D, MaxPool2D ->>> conv = Conv2D(3, 32, kernel_size=3) ->>> pool = MaxPool2D(kernel_size=2) +# Count parameters +params = cnn.parameters() +total = sum(np.prod(p.shape) for p in params) +print(f"Total parameters: {total:,}") # ~500,000 ``` ## Systems Thinking Questions -1. **Parameter Efficiency**: A Conv2D(3, 32, 3) has ~900 parameters. How many parameters would a Dense layer need to connect a 32x32 image to 32 outputs? Why is this difference critical for scaling? +### Real-World Applications -2. **Translation Invariance**: Why does a CNN detect a cat regardless of whether it's in the top-left or bottom-right of an image? How does weight sharing enable this property? +**Autonomous Driving - Tesla Autopilot** -3. **Hierarchical Features**: Early CNN layers detect edges and textures. Later layers detect objects and faces. How does this emerge from stacking convolutions? Why doesn't this happen in dense networks? +**Challenge**: Tesla's Autopilot processes 8 cameras at 36 FPS with 1280ร—960 resolution, running CNN backbones to extract features for object detection, lane recognition, and depth estimation. The entire inference must complete in <30ms for real-time control. -4. **Receptive Field Growth**: A single Conv2D(kernel=3) sees a 3x3 region. After two Conv2D layers, what region does each output see? How do deep CNNs build global context from local operations? +**Solution**: Efficient CNN architectures (MobileNet-style depthwise separable convolutions) and aggressive optimization (TensorRT compilation, INT8 quantization) balance accuracy vs latency on embedded hardware (Tesla FSD computer: 144 TOPS). -5. **Compute vs Memory Trade-offs**: Large kernel sizes (7x7) have more parameters but fewer operations. Small kernels (3x3) stacked deeply have opposite trade-offs. Which is better and why? +**Your Implementation Connection**: Understanding Conv2d's computational cost (Kยฒร—C_inร—C_outร—Hร—W operations) reveals why Tesla optimizes kernel sizes and channel counts carefullyโ€”every operation matters at 36 FPS ร— 8 cameras = 288 frames/second total processing. -## Real-World Connections +**Medical Imaging - Diagnostic Assistance** -### Industry Applications +**Challenge**: CNN systems analyze X-rays, CT scans, and pathology slides for diagnostic assistance. PathAI's breast cancer detection achieves 97% sensitivity (vs 92% for individual pathologists) by training deep CNNs on millions of annotated slides. Medical deployment requires interpretabilityโ€”doctors need to understand why the CNN made a prediction. -**Autonomous Vehicles (Tesla, Waymo)** -- Multi-camera CNN systems process 30 FPS at 1920x1200 resolution -- Feature maps from CNNs feed into object detection and segmentation -- Real-time requirements demand efficient Conv2D implementations +**Solution**: Visualizing intermediate feature maps and using attention mechanisms to highlight diagnostic regions. Grad-CAM (Gradient-weighted Class Activation Mapping) shows which spatial regions contributed most to the prediction. -**Medical Imaging (PathAI, Zebra Medical)** -- CNNs analyze X-rays and CT scans for diagnostic assistance -- Achieve superhuman performance on specific tasks (diabetic retinopathy detection) -- Architecture design critical for accuracy-interpretability trade-off +**Your Implementation Connection**: Your Conv2d's feature maps can be visualized showing which spatial regions activate strongly for different filters. This interpretability is crucial for medical deployment where "black box" predictions are insufficient for clinical decisions. -**Face Recognition (Apple Face ID, Facebook DeepFace)** -- CNN embeddings enable accurate face matching at billion-user scale -- Lightweight CNN architectures run on mobile devices in real-time -- Privacy concerns drive on-device processing +**Face Recognition - Apple Face ID** -### Research Impact +**Challenge**: Apple's Face ID uses CNNs to generate face embeddings enabling secure device unlock with <1 in 1,000,000 false accept rate. The entire pipeline (detection + alignment + embedding + matching) runs on-device in real-time. Privacy requires on-device processing, demanding lightweight CNN architectures. -This module implements patterns from: -- LeNet-5 (1998): First successful CNN for digit recognition -- AlexNet (2012): Sparked deep learning revolution with CNNs + GPUs -- VGG (2014): Showed deeper is better with simple 3x3 convolutions -- ResNet (2015): Enabled training 152-layer CNNs with skip connections +**Solution**: MobileNet-style CNNs with depthwise separable convolutions reduce parameters by 8-10ร— while maintaining accuracy. The entire model fits in <10MB, enabling on-device execution protecting user privacy. -## What's Next? +**Your Implementation Connection**: Understanding Conv2d's parameter count (C_outร—C_inร—Kยฒ) reveals why face recognition systems carefully design CNN architecturesโ€”fewer parameters enable on-device deployment without sacrificing accuracy. -In **Module 10: Tokenization**, you'll shift from processing images to processing text: +**Historical Impact - AlexNet to ResNet** -- Learn how to convert text into numerical representations -- Implement tokenization strategies (character, word, subword) -- Build vocabulary management systems -- Prepare text data for transformers in Module 13 +**LeNet-5 (1998)**: Yann LeCun's CNN successfully read handwritten zip codes for the US Postal Service, establishing the Conv โ†’ Pool โ†’ Conv โ†’ Pool โ†’ Dense pattern your SimpleCNN follows. Training took days on CPUs, limiting practical deployment. -This completes the vision half of the Architecture Tier. Next, you'll tackle language! +**AlexNet (2012)**: Won ImageNet with 16% error (vs 26% for hand-crafted features), sparking the deep learning revolution. Key innovation: training deep CNNs on GPUs with massive datasets proved that scale + convolution = breakthrough performance. + +**VGG (2014)**: Demonstrated that deeper CNNs with simple 3ร—3 kernels outperform shallow networks with large kernels. Established that stacking many small convolutions beats few large onesโ€”the computational trade-off analysis below. + +**ResNet (2015)**: 152-layer CNN achieved 3.6% ImageNet error (better than human 5% baseline) via skip connections solving vanishing gradients. Your Conv2d is the foundationโ€”ResNet is "just" your layers with residual connections enabling extreme depth. + +### Foundations + +**Weight Sharing and Parameter Efficiency** + +**Question**: A Conv2d(3, 32, kernel_size=3) layer has 32 filters ร— (3 channels ร— 3ร—3 spatial) = 896 parameters. For a 32ร—32 RGB image, a dense layer producing 32 feature maps of the same resolution needs (3ร—32ร—32) ร— (32ร—32ร—32) = 3,072 ร— 32,768 = ~100 million parameters. Why does convolution reduce parameters by 100,000ร—? How does weight sharing enable this dramatic reduction? What spatial assumption does convolution make that dense layers don'tโ€”and when might this assumption break? + +**Key Insights**: +- **Weight Sharing**: Conv2d applies the same 3ร—3ร—3 filter at all 32ร—32 = 1,024 positions, sharing 896 parameters across 1,024 locations. Dense layers learn independent weights for each position. +- **Local Connectivity**: Each conv output depends only on a local 3ร—3 neighborhood, not the entire image. This inductive bias reduces parameters but assumes nearby pixels are more related than distant ones. +- **When It Breaks**: For tasks where spatial relationships don't follow local patterns (e.g., finding relationships between distant objects), convolution's local connectivity limits expressiveness. This motivates attention mechanisms in Vision Transformers. + +**Translation Invariance Through Weight Sharing** + +**Question**: A CNN detects a cat regardless of whether it appears in the top-left or bottom-right corner of an image. A dense network trained on top-left cats fails on bottom-right cats. How does weight sharing enable translation invariance? Why does applying the same filter at all spatial positions make detected features position-independent? What's the trade-off: what spatial information does convolution lose by treating all positions equally? + +**Key Insights**: +- **Same Filter Everywhere**: Weight sharing means the "cat ear detector" filter slides across the entire image, detecting ears wherever they appear. Dense layers have position-specific weights that don't generalize spatially. +- **Pooling Enhances Invariance**: MaxPool further increases invarianceโ€”if the cat moves 1 pixel, the max in each 2ร—2 window often stays the same, making predictions robust to small shifts. +- **Trade-off**: Convolution loses absolute position information. For tasks requiring precise localization (e.g., object detection), networks must add position embeddings or specialized heads to recover spatial coordinates. + +**Hierarchical Feature Learning** + +**Question**: Early CNN layers (Conv1) learn to detect edges and simple textures. Deep layers (Conv5) detect complex objects like faces and cars. This feature hierarchy emerges automatically from stacking convolutionsโ€”it's not explicitly programmed. How do stacked convolutions build hierarchical representations from local operations? Why don't deep dense networks show this hierarchical organization? What role does the receptive field (the input region affecting each output) play in hierarchical learning? + +**Key Insights**: +- **Receptive Field Growth**: A single 3ร—3 conv sees 9 pixels. Two stacked 3ร—3 convs see 5ร—5 (25 pixels). Three see 7ร—7 (49 pixels). Deeper layers see larger input regions, enabling detection of larger patterns. +- **Compositional Learning**: Early layers learn simple features (edges). Middle layers combine edges into textures and corners. Deep layers combine textures into object parts (eyes, wheels), then complete objects. +- **Why Dense Doesn't**: Dense layers lack spatial structureโ€”each neuron connects to all inputs equally. Without spatial inductive bias (local connectivity + weight sharing), dense networks don't naturally learn hierarchical spatial features. + +### Characteristics + +**Receptive Field Growth and Global Context** + +**Question**: A single Conv2d(kernel_size=3) sees a 3ร—3 region. Two stacked Conv2d layers see a 5ร—5 region (center of second layer sees 3ร—3 of first layer, which each see 3ร—3 of input). Three layers see 7ร—7. How many Conv2d(kernel_size=3) layers are needed to see an entire 32ร—32 image? How do deep CNNs build global context from local operations? What's the trade-off: why not use one large Conv2d(kernel_size=32) instead of stacking many small kernels? + +**Key Insights**: +- **Receptive Field Formula**: For N layers with kernel size K, receptive field = 1 + Nร—(K-1). For K=3: RF = 1+2N. To cover 32ร—32 requires RF โ‰ฅ 32, so N โ‰ฅ 15.5 โ†’ need 16 Conv2d(3ร—3) layers. +- **Stacking Benefits**: Three Conv2d(3ร—3) layers have 3ร—(Cยฒร—9) = 27Cยฒ parameters and 3 ReLU nonlinearities. One Conv2d(7ร—7) has Cยฒร—49 parameters and 1 ReLU. Stacking provides parameter efficiency and more non-linear transformations for the same receptive field. +- **Trade-off**: Deeper stacking increases computational cost (more layers to process) and training difficulty (vanishing gradients). But gains from parameter efficiency and expressiveness typically outweigh costsโ€”hence VGG's success with stacked 3ร—3 convs vs AlexNet's large kernels. + +**Computational Cost and Optimization Strategies** + +**Question**: A Conv2d(64โ†’64, kernel_size=7) has 64ร—64ร—7ร—7 = 200K parameters and processes (64ร—7ร—7) = 3,136 operations per output pixel. Three stacked Conv2d(64โ†’64, kernel_size=3) have 3ร—(64ร—64ร—3ร—3) = 110K parameters but perform 3ร—(64ร—3ร—3) = 1,728 operations per output pixel at each of 3 layers. Which is better for parameter efficiency? For computational cost? For feature learning? Why did the field shift from AlexNet's 11ร—11 kernels to VGG/ResNet's 3ร—3 stacks? + +**Key Insights**: +- **Parameter Efficiency**: Stacked 3ร—3 (110K params) beats single 7ร—7 (200K params) by 1.8ร—. +- **Computational Cost**: Stacked approach performs 3ร—1,728 = 5,184 ops per output pixel vs 3,136 for single 7ร—7. Stacking costs 1.65ร— more computation. +- **Feature Learning**: Stacking provides 3 ReLU nonlinearities vs 1, enabling more complex feature transformations. The expressiveness gain from depth outweighs the 1.65ร— compute cost. +- **Modern Practice**: VGG established that stacked 3ร—3 convs outperform large kernels. ResNet, EfficientNet, and modern architectures all use 3ร—3 (or 1ร—1 for channel mixing) due to better parameter-computation-expressiveness trade-off. + +## Ready to Build? + +You're about to implement the spatial operations that revolutionized how machines see. Before deep learning, computer vision relied on hand-crafted features like SIFT and HOGโ€”human experts manually designed algorithms to detect edges, corners, and textures. AlexNet's 2012 ImageNet victory proved that learned convolutional features outperform hand-crafted ones, launching the deep learning revolution. Today, CNNs process billions of images daily across Meta's photo tagging (2B photos/day), Tesla's Autopilot (real-time multi-camera processing), and Google Photos (trillion+ image search). + +The Conv2d operations you'll implement aren't just educational exercisesโ€”they're the same patterns powering production vision systems. Your sliding window convolution reveals why kernel size matters enormously (7ร—7 kernels cost 5.4ร— more than 3ร—3) and why weight sharing enables CNNs to learn from spatial data 100ร— more efficiently than dense networks. The explicit loops expose computational costs that modern frameworks hide with im2col transformations and cuDNN kernelsโ€”understanding the naive implementation reveals where optimizations matter most. + +By building CNNs from first principles, you'll understand not just how convolution works, but why it worksโ€”why weight sharing provides translation invariance, how stacked small kernels build global context from local operations, and what memory-computation trade-offs govern architecture design. These insights prepare you to design efficient CNN architectures for resource-constrained deployment (mobile, edge devices) and to debug performance bottlenecks in production systems. + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/09_spatial/spatial_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/09_spatial/spatial_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/09_spatial/spatial_dev.ipynb +:class-header: bg-light + +Browse the Jupyter notebook source and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. +``` + +**Local Development**: +```bash +cd /Users/VJ/GitHub/TinyTorch/modules/09_spatial +python spatial_dev.py # Run inline tests +tito module complete 09 # Export to package +``` --- -**Ready to build CNNs from scratch?** Open `modules/09_spatial/spatial_dev.py` and start implementing. +
+โ† Module 08: DataLoader +Module 10: Tokenization โ†’ +
diff --git a/modules/10_tokenization/ABOUT.md b/modules/10_tokenization/ABOUT.md index 38074119..764c5417 100644 --- a/modules/10_tokenization/ABOUT.md +++ b/modules/10_tokenization/ABOUT.md @@ -1,402 +1,864 @@ --- title: "Tokenization - Text to Numerical Sequences" -description: "Build tokenizers to convert raw text into sequences for language models" +description: "Build character-level and BPE tokenizers that convert text into token sequences for language models" difficulty: 2 time_estimate: "4-5 hours" prerequisites: ["Tensor"] next_steps: ["Embeddings"] learning_objectives: - - "Implement character-level and subword tokenization strategies" - - "Design efficient vocabulary management systems for language models" - - "Understand trade-offs between vocabulary size and sequence length" - - "Build BPE tokenizer for optimal subword unit representation" - - "Apply text processing optimization for production NLP pipelines" + - "Implement character-level tokenization with vocabulary management and special token handling" + - "Build BPE (Byte Pair Encoding) tokenizer that learns optimal subword units from corpus statistics" + - "Understand vocabulary size vs sequence length trade-offs affecting model parameters and computation" + - "Design efficient text processing pipelines with encoding, decoding, and serialization" + - "Analyze tokenization throughput and compression ratios for production NLP systems" --- -# 10. Tokenization +# 10. Tokenization - Text to Numerical Sequences -**๐Ÿ›๏ธ ARCHITECTURE TIER** | Difficulty: โญโญ (2/4) | Time: 4-5 hours +**ARCHITECTURE TIER** | Difficulty: โญโญ (2/4) | Time: 4-5 hours ## Overview -Build tokenization systems that convert raw text into numerical sequences for language models. This module implements character-level and subword tokenizers (BPE) that balance vocabulary size, sequence length, and computational efficiency. +Build tokenization systems that convert raw text into numerical sequences for language models. This module implements character-level and Byte Pair Encoding (BPE) tokenizers that balance vocabulary size, sequence length, and computational efficiencyโ€”the fundamental trade-off shaping every modern NLP system from GPT-4 to Google Translate. You'll understand why vocabulary size directly affects model parameters while sequence length impacts transformer computation, and how BPE optimally balances both extremes. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement character-level and subword tokenization** strategies for converting text to token sequences -2. **Design efficient vocabulary management** systems with special tokens and encoding/decoding -3. **Understand trade-offs** between vocabulary size (model parameters) and sequence length (computation) -4. **Build BPE (Byte Pair Encoding)** tokenizer for optimal subword unit representation -5. **Apply text processing optimization** techniques for production NLP pipelines at scale +- **Implement character-level tokenization with vocabulary management**: Build tokenizers with bidirectional token-to-ID mappings, special token handling (PAD, UNK, BOS, EOS), and graceful unknown character handling for robust multilingual support +- **Build BPE (Byte Pair Encoding) tokenizer**: Implement the iterative merge algorithm that learns optimal subword units by counting character pair frequenciesโ€”the same approach powering GPT, BERT, and modern transformers +- **Understand vocabulary size vs sequence length trade-offs**: Analyze how vocabulary choices affect model parameters (embedding matrix size = vocab_size ร— embed_dim) and computation (transformer attention is O(nยฒ) in sequence length) +- **Design efficient text processing pipelines**: Create production-ready tokenizers with encoding/decoding, vocabulary serialization for deployment, and proper special token management for batching +- **Analyze tokenization throughput and compression ratios**: Measure tokens/second performance, compare character vs BPE on sequence length reduction, and understand scaling to billions of tokens in production systems -## Why This Matters +## Build โ†’ Use โ†’ Reflect -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -Every language model depends on tokenization: +1. **Build**: Implement character-level tokenizer with vocabulary building and encode/decode operations, then build BPE algorithm that iteratively merges frequent character pairs to learn optimal subword units +2. **Use**: Tokenize Shakespeare and modern text datasets, compare character vs BPE on sequence length reduction, measure tokenization throughput on large corpora, and test subword decomposition on rare/unknown words +3. **Reflect**: Why does vocabulary size directly control model parameters (embedding matrix rows)? How does sequence length affect transformer computation (O(nยฒ) attention)? What's the optimal balance for mobile deployment vs cloud serving? How do tokenization choices impact multilingual model design? -- **GPT-4** uses a 100K-token vocabulary trained on trillions of tokens of text -- **Google Translate** processes billions of sentences daily through tokenization pipelines -- **BERT** pioneered WordPiece tokenization that handles 100+ languages efficiently -- **Code models** like Copilot use specialized tokenizers for programming languages +```{admonition} Systems Reality Check +:class: tip -### Historical Context +**Production Context**: GPT-4 uses a 100K-token vocabulary trained on trillions of tokens. Every token in the vocabulary adds a row to the embedding matrixโ€”at 12,288 dimensions, that's 1.2B parameters just for embeddings. Meanwhile, transformers have O(nยฒ) attention complexity, so reducing sequence length from 1000 to 300 tokens cuts computation by 11x. This vocabulary size vs sequence length trade-off shapes every design decision in modern NLP: GPT-3 doubled vocabulary from GPT-2 (50Kโ†’100K) specifically to handle code and reduce sequence lengths for long documents. -Tokenization evolved with language modeling: - -- **Word-Level (pre-2016)**: Simple but massive vocabularies (100K+ words); struggles with rare words and typos -- **Character-Level (2015)**: Small vocabulary but extremely long sequences; computationally expensive -- **BPE (2016)**: Subword tokenization balances both; enabled GPT and modern transformers -- **SentencePiece (2018)**: Unified text and multilingual tokenization; powers modern multilingual models -- **Modern (2020+)**: Specialized tokenizers for code, math, and multimodal content - -The tokenizers you're building are the foundation of all modern NLP. - -## Pedagogical Pattern: Build โ†’ Use โ†’ Analyze - -### 1. Build - -Implement from first principles: -- Character-level tokenizer with vocab management -- Special tokens (, , , ) -- BPE algorithm for learning subword merges -- Encode/decode functions for text โ†” tokens -- Vocabulary serialization for model deployment - -### 2. Use - -Apply to real problems: -- Tokenize Shakespeare and modern text datasets -- Build vocabularies of different sizes (1K, 10K, 50K tokens) -- Compare character vs BPE on sequence length -- Handle out-of-vocabulary words gracefully -- Measure tokenization throughput (tokens/second) - -### 3. Analyze - -Deep-dive into design trade-offs: -- How does vocabulary size affect model parameters? -- Why do longer sequences increase computation quadratically (in transformers)? -- What's the sweet spot between vocab size and sequence length? -- How does tokenization affect rare words and morphology? -- Why do multilingual models need larger vocabularies? +**Performance Note**: Google Translate processes billions of sentences daily through tokenization pipelines. Tokenization throughput (measured in tokens/second) is critical for serving at scaleโ€”character-level achieves ~1M tokens/sec (simple lookup) while BPE achieves ~100K tokens/sec (iterative merge application). Production systems cache tokenization results and batch aggressively to amortize preprocessing costs. At OpenAI's scale ($700/million tokens), every tokenization optimization directly impacts economics. +``` ## Implementation Guide -### Core Components +### Base Tokenizer Interface + +All tokenizers share a common interface: encode text to token IDs and decode IDs back to text. This abstraction enables consistent usage across different tokenization strategies. -**Character-Level Tokenizer** ```python -class CharacterTokenizer: - """Simple character-level tokenization. - - Treats each character as a token. Simple but results in long sequences. - Vocab size: typically 100-500 (all ASCII or Unicode characters) +class Tokenizer: + """Base tokenizer interface defining the contract for all tokenizers. + + All tokenization strategies (character, BPE, WordPiece) must implement: + - encode(text) โ†’ List[int]: Convert text to token IDs + - decode(token_ids) โ†’ str: Convert token IDs back to text """ - def __init__(self): - self.char_to_idx = {} - self.idx_to_char = {} - self.vocab_size = 0 - - # Special tokens - self.PAD_TOKEN = "" - self.UNK_TOKEN = "" - self.BOS_TOKEN = "" - self.EOS_TOKEN = "" - - def build_vocab(self, texts): - """Build vocabulary from text corpus.""" - # Add special tokens first - special_tokens = [self.PAD_TOKEN, self.UNK_TOKEN, self.BOS_TOKEN, self.EOS_TOKEN] - for token in special_tokens: - self.char_to_idx[token] = len(self.char_to_idx) - - # Add all unique characters - unique_chars = set(''.join(texts)) - for char in sorted(unique_chars): - if char not in self.char_to_idx: - self.char_to_idx[char] = len(self.char_to_idx) - - # Create reverse mapping - self.idx_to_char = {idx: char for char, idx in self.char_to_idx.items()} - self.vocab_size = len(self.char_to_idx) - - def encode(self, text): - """Convert text to token IDs.""" - return [self.char_to_idx.get(char, self.char_to_idx[self.UNK_TOKEN]) - for char in text] - - def decode(self, token_ids): + + def encode(self, text: str) -> List[int]: + """Convert text to list of token IDs.""" + raise NotImplementedError("Subclasses must implement encode()") + + def decode(self, tokens: List[int]) -> str: """Convert token IDs back to text.""" - return ''.join([self.idx_to_char[idx] for idx in token_ids]) + raise NotImplementedError("Subclasses must implement decode()") ``` -**BPE (Byte Pair Encoding) Tokenizer** +**Design Pattern**: Abstract base class enforces consistent API across tokenization strategies, enabling drop-in replacement for performance testing (character vs BPE benchmarks). + +### Character-Level Tokenizer + +The simplest tokenization approach: each character becomes a token. Provides perfect coverage of any text with a tiny vocabulary (~100 characters), but produces long sequences. + ```python -class BPETokenizer: - """Byte Pair Encoding for subword tokenization. - - Iteratively merges most frequent character pairs to create subword units. - Balances vocabulary size and sequence length optimally. - +class CharTokenizer(Tokenizer): + """Character-level tokenizer treating each character as a separate token. + + Trade-offs: + - Small vocabulary (typically 100-500 characters) + - Long sequences (1 character = 1 token) + - Perfect coverage (no unknown tokens if vocab includes all Unicode) + - Simple implementation (direct character-to-ID mapping) + Example: - "unhappiness" โ†’ ["un", "happi", "ness"] (3 tokens) - vs character-level: ["u","n","h","a","p","p","i","n","e","s","s"] (11 tokens) + "hello" โ†’ ['h','e','l','l','o'] โ†’ [8, 5, 12, 12, 15] (5 tokens) """ - def __init__(self, vocab_size=10000): - self.vocab_size = vocab_size - self.merges = {} # Learned merge rules - self.vocab = {} # Token to ID mapping - - def train(self, texts): - """Learn BPE merges from corpus. - - Algorithm: - 1. Start with character-level vocabulary - 2. Count all adjacent character pairs - 3. Merge most frequent pair - 4. Repeat until vocabulary reaches target size + + def __init__(self, vocab: Optional[List[str]] = None): + """Initialize with optional vocabulary. + + Args: + vocab: List of characters to include in vocabulary. + If None, vocabulary is built later via build_vocab(). """ - # Initialize with character-level vocab - vocab = self._get_char_vocab(texts) - - # Learn merges iteratively - while len(vocab) < self.vocab_size: - # Count pairs - pairs = self._count_pairs(texts, vocab) - if not pairs: - break - - # Merge most frequent pair - best_pair = max(pairs, key=pairs.get) - texts = self._merge_pair(texts, best_pair) - vocab.add(''.join(best_pair)) - self.merges[best_pair] = ''.join(best_pair) - - # Build final vocabulary - self.vocab = {token: idx for idx, token in enumerate(sorted(vocab))} - - def encode(self, text): - """Encode text using learned BPE merges.""" - # Start with characters - tokens = list(text) - - # Apply merges in learned order - while True: - pairs = [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)] - if not pairs: - break - - # Find first mergeable pair - mergeable = [p for p in pairs if p in self.merges] - if not mergeable: - break - - # Apply merge - pair = mergeable[0] + if vocab is None: + vocab = [] + + # Reserve ID 0 for unknown token (robust handling of unseen characters) + self.vocab = [''] + vocab + self.vocab_size = len(self.vocab) + + # Bidirectional mappings for efficient encode/decode + self.char_to_id = {char: idx for idx, char in enumerate(self.vocab)} + self.id_to_char = {idx: char for idx, char in enumerate(self.vocab)} + + # Cache unknown token ID for fast lookup + self.unk_id = 0 + + def build_vocab(self, corpus: List[str]) -> None: + """Build vocabulary from text corpus. + + Args: + corpus: List of text strings to extract characters from. + + Process: + 1. Collect all unique characters across entire corpus + 2. Sort alphabetically for consistent ordering across runs + 3. Rebuild charโ†”ID mappings with token at position 0 + """ + # Extract all unique characters + all_chars = set() + for text in corpus: + all_chars.update(text) + + # Sort for reproducibility (important for model deployment) + unique_chars = sorted(list(all_chars)) + + # Rebuild vocabulary with special token first + self.vocab = [''] + unique_chars + self.vocab_size = len(self.vocab) + + # Rebuild bidirectional mappings + self.char_to_id = {char: idx for idx, char in enumerate(self.vocab)} + self.id_to_char = {idx: char for idx, char in enumerate(self.vocab)} + + def encode(self, text: str) -> List[int]: + """Convert text to list of character IDs. + + Args: + text: String to tokenize. + + Returns: + List of integer token IDs, one per character. + Unknown characters map to ID 0 (). + + Example: + >>> tokenizer.encode("hello") + [8, 5, 12, 12, 15] # Depends on vocabulary ordering + """ + tokens = [] + for char in text: + # Use .get() with unk_id default for graceful unknown handling + tokens.append(self.char_to_id.get(char, self.unk_id)) + return tokens + + def decode(self, tokens: List[int]) -> str: + """Convert token IDs back to text. + + Args: + tokens: List of integer token IDs. + + Returns: + Reconstructed text string. + Invalid IDs map to '' character. + """ + chars = [] + for token_id in tokens: + char = self.id_to_char.get(token_id, '') + chars.append(char) + return ''.join(chars) +``` + +**Key Implementation Details:** + +- **Special Token Reservation**: `` token must occupy ID 0 consistently across vocabularies for model compatibility +- **Bidirectional Mappings**: Both `char_to_id` (encoding) and `id_to_char` (decoding) enable O(1) lookup performance +- **Unknown Character Handling**: Graceful degradation prevents crashes on unseen characters (critical for multilingual models encountering rare Unicode) +- **Vocabulary Consistency**: Sorted character ordering ensures reproducible vocabularies across training runs (important for model deployment) + +### BPE (Byte Pair Encoding) Tokenizer + +The algorithm powering GPT and modern transformers: iteratively merge frequent character pairs to discover optimal subword units. Balances vocabulary size (model parameters) with sequence length (computational cost). + +```python +class BPETokenizer(Tokenizer): + """Byte Pair Encoding tokenizer for subword tokenization. + + Algorithm: + 1. Initialize: Start with character-level vocabulary + 2. Count: Find all adjacent character pair frequencies in corpus + 3. Merge: Replace most frequent pair with new merged token + 4. Repeat: Continue until vocabulary reaches target size + + Trade-offs: + - Larger vocabulary (typically 10K-50K tokens) + - Shorter sequences (2-4x compression vs character-level) + - Subword decomposition handles rare/unknown words gracefully + - Training complexity (requires corpus statistics) + + Example: + Training: "hello" appears 1000x, "hell" appears 500x + Learns: 'h'+'e' โ†’ 'he' (freq pair), 'l'+'l' โ†’ 'll' (freq pair) + Result: "hello" โ†’ ['he', 'll', 'o'] (3 tokens vs 5 characters) + """ + + def __init__(self, vocab_size: int = 1000): + """Initialize BPE tokenizer. + + Args: + vocab_size: Target vocabulary size (includes special tokens + + characters + learned merges). Typical: 10K-50K. + """ + self.vocab_size = vocab_size + self.vocab = [] # Final vocabulary tokens + self.merges = [] # Learned merge rules: [(pair, merged_token), ...] + self.token_to_id = {} # Token string โ†’ integer ID + self.id_to_token = {} # Integer ID โ†’ token string + + def _get_word_tokens(self, word: str) -> List[str]: + """Convert word to character tokens with end-of-word marker. + + Args: + word: String to tokenize at character level. + + Returns: + List of character tokens with '' suffix on last character. + End-of-word marker enables learning of word boundaries. + + Example: + >>> _get_word_tokens("hello") + ['h', 'e', 'l', 'l', 'o'] + """ + if not word: + return [] + + tokens = list(word) + tokens[-1] += '' # Mark word boundaries for BPE + return tokens + + def _get_pairs(self, word_tokens: List[str]) -> Set[Tuple[str, str]]: + """Extract all adjacent character pairs from token sequence. + + Args: + word_tokens: List of token strings. + + Returns: + Set of unique adjacent pairs (useful for frequency counting). + + Example: + >>> _get_pairs(['h', 'e', 'l', 'l', 'o']) + {('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o')} + """ + pairs = set() + for i in range(len(word_tokens) - 1): + pairs.add((word_tokens[i], word_tokens[i + 1])) + return pairs + + def train(self, corpus: List[str], vocab_size: int = None) -> None: + """Train BPE on corpus to learn merge rules. + + Args: + corpus: List of text strings (typically words or sentences). + vocab_size: Override target vocabulary size if provided. + + Training Process: + 1. Count word frequencies in corpus + 2. Initialize with character-level tokens (all unique characters) + 3. Iteratively: + a. Count all adjacent pair frequencies across all words + b. Merge most frequent pair into new token + c. Update word representations with merged token + d. Add merged token to vocabulary + 4. Stop when vocabulary reaches target size + 5. Build final tokenโ†”ID mappings + """ + if vocab_size: + self.vocab_size = vocab_size + + # Count word frequencies (training on token statistics, not raw text) + word_freq = Counter(corpus) + + # Initialize vocabulary and word token representations + vocab = set() + word_tokens = {} + + for word in word_freq: + tokens = self._get_word_tokens(word) + word_tokens[word] = tokens + vocab.update(tokens) # Collect all unique character tokens + + # Convert to sorted list for reproducibility + self.vocab = sorted(list(vocab)) + + # Add special unknown token + if '' not in self.vocab: + self.vocab = [''] + self.vocab + + # Learn merge rules iteratively + self.merges = [] + + while len(self.vocab) < self.vocab_size: + # Count all adjacent pairs across all words (weighted by frequency) + pair_counts = Counter() + + for word, freq in word_freq.items(): + tokens = word_tokens[word] + pairs = self._get_pairs(tokens) + for pair in pairs: + pair_counts[pair] += freq # Weight by word frequency + + if not pair_counts: + break # No more pairs to merge + + # Select most frequent pair + best_pair = pair_counts.most_common(1)[0][0] + + # Apply merge to all word representations + for word in word_tokens: + tokens = word_tokens[word] + new_tokens = [] + i = 0 + while i < len(tokens): + # Check if current position matches merge pair + if (i < len(tokens) - 1 and + tokens[i] == best_pair[0] and + tokens[i + 1] == best_pair[1]): + # Merge pair into single token + new_tokens.append(best_pair[0] + best_pair[1]) + i += 2 + else: + new_tokens.append(tokens[i]) + i += 1 + word_tokens[word] = new_tokens + + # Add merged token to vocabulary + merged_token = best_pair[0] + best_pair[1] + self.vocab.append(merged_token) + self.merges.append(best_pair) + + # Build final tokenโ†”ID mappings for efficient encode/decode + self._build_mappings() + + def _build_mappings(self): + """Build bidirectional tokenโ†”ID mappings from vocabulary.""" + self.token_to_id = {token: idx for idx, token in enumerate(self.vocab)} + self.id_to_token = {idx: token for idx, token in enumerate(self.vocab)} + + def _apply_merges(self, tokens: List[str]) -> List[str]: + """Apply learned merge rules to token sequence. + + Args: + tokens: List of character-level tokens. + + Returns: + List of tokens after applying all learned merges. + + Process: + Apply each merge rule in the order learned during training. + Early merges have priority over later merges. + """ + if not self.merges: + return tokens + + # Apply each merge rule sequentially + for merge_pair in self.merges: new_tokens = [] i = 0 while i < len(tokens): - if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == pair: - new_tokens.append(self.merges[pair]) + if (i < len(tokens) - 1 and + tokens[i] == merge_pair[0] and + tokens[i + 1] == merge_pair[1]): + # Apply merge + new_tokens.append(merge_pair[0] + merge_pair[1]) i += 2 else: new_tokens.append(tokens[i]) i += 1 tokens = new_tokens - - # Convert tokens to IDs - return [self.vocab.get(token, self.vocab['']) for token in tokens] + + return tokens + + def encode(self, text: str) -> List[int]: + """Encode text using learned BPE merges. + + Args: + text: String to tokenize. + + Returns: + List of integer token IDs after applying BPE merges. + + Process: + 1. Split text into words (simple whitespace split) + 2. Convert each word to character-level tokens + 3. Apply learned BPE merges to create subword units + 4. Convert subword tokens to integer IDs + """ + if not self.vocab: + return [] + + # Simple word splitting (production systems use more sophisticated approaches) + words = text.split() + all_tokens = [] + + for word in words: + # Start with character-level tokens + word_tokens = self._get_word_tokens(word) + + # Apply BPE merges + merged_tokens = self._apply_merges(word_tokens) + + all_tokens.extend(merged_tokens) + + # Convert tokens to IDs (unknown tokens map to ID 0) + token_ids = [] + for token in all_tokens: + token_ids.append(self.token_to_id.get(token, 0)) + + return token_ids + + def decode(self, tokens: List[int]) -> str: + """Decode token IDs back to text. + + Args: + tokens: List of integer token IDs. + + Returns: + Reconstructed text string. + + Process: + 1. Convert IDs to token strings + 2. Join tokens together + 3. Remove end-of-word markers and restore spaces + """ + if not self.id_to_token: + return "" + + # Convert IDs to token strings + token_strings = [] + for token_id in tokens: + token = self.id_to_token.get(token_id, '') + token_strings.append(token) + + # Join and clean up + text = ''.join(token_strings) + + # Replace end-of-word markers with spaces + text = text.replace('', ' ') + + # Clean up extra spaces + text = ' '.join(text.split()) + + return text ``` -**Vocabulary Management** +**BPE Algorithm Insights:** + +- **Training Phase**: Learn merge rules from corpus statistics by iteratively merging most frequent adjacent pairs +- **Inference Phase**: Apply learned merges in order to segment new text into optimal subword units +- **Frequency-Based Learning**: Common patterns ("ing", "ed", "tion") become single tokens, reducing sequence length +- **Graceful Degradation**: Unseen words decompose into known subwords (e.g., "unhappiness" โ†’ ["un", "happi", "ness"]) +- **Word Boundary Awareness**: End-of-word markers (``) enable learning of prefix vs suffix patterns + +### Tokenization Utilities + +Production-ready utilities for tokenizer creation, dataset processing, and performance analysis. + ```python -class Vocabulary: - """Manage token-to-ID mappings with special tokens. - - Provides clean interface for encoding/decoding and vocab serialization. +def create_tokenizer(strategy: str = "char", + vocab_size: int = 1000, + corpus: List[str] = None) -> Tokenizer: + """Factory function to create and train tokenizers. + + Args: + strategy: Tokenization approach ("char" or "bpe"). + vocab_size: Target vocabulary size (for BPE). + corpus: Training corpus for vocabulary building. + + Returns: + Trained tokenizer instance. + + Example: + >>> corpus = ["hello world", "machine learning"] + >>> tokenizer = create_tokenizer("bpe", vocab_size=500, corpus=corpus) + >>> tokens = tokenizer.encode("hello") """ - def __init__(self): - self.token_to_id = {} - self.id_to_token = {} - - # Reserve special token IDs - self.PAD_ID = 0 - self.UNK_ID = 1 - self.BOS_ID = 2 - self.EOS_ID = 3 - - self._add_special_tokens() - - def _add_special_tokens(self): - special = [('', self.PAD_ID), ('', self.UNK_ID), - ('', self.BOS_ID), ('', self.EOS_ID)] - for token, idx in special: - self.token_to_id[token] = idx - self.id_to_token[idx] = token - - def add_token(self, token): - if token not in self.token_to_id: - idx = len(self.token_to_id) - self.token_to_id[token] = idx - self.id_to_token[idx] = token - - def save(self, path): - """Save vocabulary for deployment.""" - import json - with open(path, 'w') as f: - json.dump(self.token_to_id, f) - - def load(self, path): - """Load vocabulary for inference.""" - import json - with open(path, 'r') as f: - self.token_to_id = json.load(f) - self.id_to_token = {v: k for k, v in self.token_to_id.items()} + if strategy == "char": + tokenizer = CharTokenizer() + if corpus: + tokenizer.build_vocab(corpus) + elif strategy == "bpe": + tokenizer = BPETokenizer(vocab_size=vocab_size) + if corpus: + tokenizer.train(corpus, vocab_size) + else: + raise ValueError(f"Unknown tokenization strategy: {strategy}") + + return tokenizer + +def analyze_tokenization(texts: List[str], + tokenizer: Tokenizer) -> Dict[str, float]: + """Analyze tokenization statistics for performance evaluation. + + Args: + texts: List of text strings to analyze. + tokenizer: Trained tokenizer instance. + + Returns: + Dictionary containing: + - vocab_size: Number of unique tokens in vocabulary + - avg_sequence_length: Mean tokens per text + - max_sequence_length: Longest tokenized sequence + - total_tokens: Total tokens across all texts + - compression_ratio: Average characters per token (higher = better) + - unique_tokens: Number of distinct tokens used + + Use Cases: + - Compare character vs BPE on sequence length reduction + - Measure compression efficiency (chars/token ratio) + - Identify vocabulary utilization (unique_tokens / vocab_size) + """ + all_tokens = [] + total_chars = 0 + + for text in texts: + tokens = tokenizer.encode(text) + all_tokens.extend(tokens) + total_chars += len(text) + + tokenized_lengths = [len(tokenizer.encode(text)) for text in texts] + + stats = { + 'vocab_size': (tokenizer.vocab_size + if hasattr(tokenizer, 'vocab_size') + else len(tokenizer.vocab)), + 'avg_sequence_length': np.mean(tokenized_lengths), + 'max_sequence_length': max(tokenized_lengths) if tokenized_lengths else 0, + 'total_tokens': len(all_tokens), + 'compression_ratio': total_chars / len(all_tokens) if all_tokens else 0, + 'unique_tokens': len(set(all_tokens)) + } + + return stats ``` -### Step-by-Step Implementation +**Analysis Metrics Explained:** -1. **Build Character Tokenizer** - - Create vocabulary from unique characters - - Add special tokens (PAD, UNK, BOS, EOS) - - Implement encode (text โ†’ IDs) and decode (IDs โ†’ text) - - Handle unknown characters gracefully +- **Compression Ratio**: Characters per token (higher = more efficient). BPE typically achieves 3-5x vs character-level at 1.0x +- **Vocabulary Utilization**: unique_tokens / vocab_size indicates whether vocabulary is appropriately sized +- **Sequence Length**: Directly impacts transformer computation (O(nยฒ) attention complexity) -2. **Implement BPE Algorithm** - - Start with character vocabulary - - Count adjacent pair frequencies - - Merge most frequent pairs iteratively - - Build merge rules and final vocabulary +## Getting Started -3. **Add Vocabulary Management** - - Create token โ†” ID bidirectional mappings - - Implement serialization for saving/loading - - Handle special tokens consistently - - Support vocabulary extension +### Prerequisites -4. **Optimize for Production** - - Cache encode/decode results - - Use efficient data structures (tries, hash maps) - - Batch process multiple texts - - Measure throughput (tokens/second) +Ensure you understand tensor operations from Module 01: -5. **Compare Tokenization Strategies** - - Measure sequence lengths for same text - - Analyze vocabulary size requirements - - Test on rare words and typos - - Evaluate multilingual performance +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify tensor module +tito test --module tensor +``` + +**Why This Prerequisite Matters:** + +- Tokenization produces integer tensors (sequences of token IDs) +- Embedding layers (Module 11) use token IDs to index into weight matrices +- Understanding tensor shapes is critical for batching variable-length sequences + +### Development Workflow + +1. **Open the development file**: `modules/10_tokenization/tokenization_dev.ipynb` +2. **Implement base Tokenizer interface**: Define encode() and decode() methods as abstract interface +3. **Build CharTokenizer**: Implement vocabulary building, character-to-ID mappings, encode/decode with unknown token handling +4. **Implement BPE algorithm**: + - Character pair counting with frequency statistics + - Iterative merge logic (find most frequent pair, merge across corpus) + - Vocabulary construction from learned merges + - Merge application during encoding +5. **Create utility functions**: Tokenizer factory, dataset processing, performance analysis +6. **Test on real data**: + - Compare character vs BPE on sequence length reduction + - Measure compression ratios (characters per token) + - Test unknown word handling via subword decomposition + - Analyze vocabulary utilization +7. **Optimize for performance**: Measure tokenization throughput (tokens/second), profile merge application, test on large corpora +8. **Export and verify**: `tito module complete 10 && tito test --module tokenization` + +**Development Tips:** + +- Start with small corpus (100 words, vocab_size=200) to debug BPE algorithm +- Print learned merge rules to understand what patterns BPE discovers +- Visualize sequence length vs vocabulary size trade-off with multiple BPE configurations +- Test on rare/misspelled words to verify subword decomposition works +- Profile with different vocabulary sizes to find optimal performance point ## Testing -### Inline Tests (During Development) +### Comprehensive Test Suite + +Run the full test suite to verify tokenization functionality: -Run inline tests while building: ```bash -cd modules/10_tokenization -python tokenization_dev.py +# TinyTorch CLI (recommended) +tito test --module tokenization + +# Direct pytest execution +python -m pytest tests/ -k tokenization -v ``` -Expected output: -``` -Unit Test: Character tokenizer... +### Test Coverage Areas + +- **Base tokenizer interface**: Abstract class enforces encode/decode contract +- **Character tokenizer correctness**: Vocabulary building from corpus, encode/decode round-trip accuracy, unknown character handling with `` token +- **BPE merge learning**: Pair frequency counting, merge application correctness, vocabulary size convergence, merge order preservation +- **Vocabulary management**: Token-to-ID mapping consistency, bidirectional lookup correctness, special token ID reservation +- **Edge case handling**: Empty strings, single characters, Unicode characters, whitespace-only text, very long sequences +- **Round-trip accuracy**: Encodeโ†’decode produces original text for all vocabulary characters +- **Performance benchmarks**: Tokenization throughput (tokens/second), vocabulary size vs encode time scaling, batch processing efficiency + +### Inline Testing & Validation + +The module includes comprehensive inline tests with progress tracking: + +```python +# Example inline test output +๐Ÿ”ฌ Unit Test: Base Tokenizer Interface... +โœ… encode() raises NotImplementedError correctly +โœ… decode() raises NotImplementedError correctly +๐Ÿ“ˆ Progress: Base Tokenizer Interface โœ“ + +๐Ÿ”ฌ Unit Test: Character Tokenizer... โœ… Vocabulary built with 89 unique characters -โœ… Encode/decode round-trip successful -โœ… Special tokens handled correctly -Progress: Character Tokenizer โœ“ +โœ… Encode/decode round-trip: "hello" โ†’ [8,5,12,12,15] โ†’ "hello" +โœ… Unknown character maps to token (ID 0) +โœ… Vocabulary building from corpus works correctly +๐Ÿ“ˆ Progress: Character Tokenizer โœ“ -Unit Test: BPE tokenizer... -โœ… Learned 5000 merge rules from corpus +๐Ÿ”ฌ Unit Test: BPE Tokenizer... +โœ… Character-level initialization successful +โœ… Pair extraction: ['h','e','l','l','o'] โ†’ {('h','e'), ('l','l'), ...} +โœ… Training learned 195 merge rules from corpus +โœ… Vocabulary size reached target (200 tokens) โœ… Sequence length reduced 3.2x vs character-level -โœ… Handles rare words and typos gracefully -Progress: BPE Tokenizer โœ“ +โœ… Unknown words decompose into subwords gracefully +๐Ÿ“ˆ Progress: BPE Tokenizer โœ“ -Unit Test: Vocabulary management... -โœ… Token-to-ID mappings bidirectional -โœ… Vocabulary saved and loaded correctly -โœ… Special token IDs reserved -Progress: Vocabulary โœ“ +๐Ÿ”ฌ Unit Test: Tokenization Utils... +โœ… Tokenizer factory creates correct instances +โœ… Dataset processing handles variable lengths +โœ… Analysis computes compression ratios correctly +๐Ÿ“ˆ Progress: Tokenization Utils โœ“ + +๐Ÿ“Š Analyzing Tokenization Strategies... +Strategy Vocab Avg Len Compression Coverage +------------------------------------------------------------ +Character 89 43.2 1.00 89 +BPE-100 100 28.5 1.52 87 +BPE-500 500 13.8 3.14 245 + +๐Ÿ’ก Key Insights: +- Character: Small vocab, long sequences, perfect coverage +- BPE: Larger vocab, shorter sequences, better compression +- Higher compression ratio = more characters per token = efficiency + +๐ŸŽ‰ ALL TESTS PASSED! Module ready for export. ``` -### Export and Validate +### Manual Testing Examples -After completing the module: -```bash -# Export to tinytorch package -tito export 10_tokenization +```python +from tokenization_dev import CharTokenizer, BPETokenizer, create_tokenizer, analyze_tokenization -# Run integration tests -tito test 10_tokenization -``` +# Test character-level tokenization +char_tokenizer = CharTokenizer() +corpus = ["hello world", "machine learning is awesome"] +char_tokenizer.build_vocab(corpus) -## Where This Code Lives +text = "hello" +char_ids = char_tokenizer.encode(text) +char_decoded = char_tokenizer.decode(char_ids) +print(f"Character: '{text}' โ†’ {char_ids} โ†’ '{char_decoded}'") +# Output: Character: 'hello' โ†’ [8, 5, 12, 12, 15] โ†’ 'hello' -``` -tinytorch/ -โ”œโ”€โ”€ text/ -โ”‚ โ””โ”€โ”€ tokenization.py # Your implementation goes here -โ””โ”€โ”€ __init__.py # Exposes CharTokenizer, BPETokenizer, etc. +# Test BPE tokenization +bpe_tokenizer = BPETokenizer(vocab_size=500) +bpe_tokenizer.train(corpus) -Usage in other modules: ->>> from tinytorch.text import BPETokenizer ->>> tokenizer = BPETokenizer(vocab_size=10000) ->>> tokenizer.train(texts) ->>> ids = tokenizer.encode("Hello world!") +bpe_ids = bpe_tokenizer.encode(text) +bpe_decoded = bpe_tokenizer.decode(bpe_ids) +print(f"BPE: '{text}' โ†’ {bpe_ids} โ†’ '{bpe_decoded}'") +# Output: BPE: 'hello' โ†’ [142, 201] โ†’ 'hello' # Fewer tokens! + +# Compare sequence lengths +long_text = "The quick brown fox jumps over the lazy dog" * 10 +char_len = len(char_tokenizer.encode(long_text)) +bpe_len = len(bpe_tokenizer.encode(long_text)) +print(f"Sequence length reduction: {char_len / bpe_len:.1f}x") +# Output: Sequence length reduction: 3.2x + +# Analyze tokenization statistics +test_corpus = [ + "Neural networks learn patterns", + "Transformers use attention mechanisms", + "Tokenization enables text processing" +] + +char_stats = analyze_tokenization(test_corpus, char_tokenizer) +bpe_stats = analyze_tokenization(test_corpus, bpe_tokenizer) + +print(f"Character - Vocab: {char_stats['vocab_size']}, " + f"Avg Length: {char_stats['avg_sequence_length']:.1f}, " + f"Compression: {char_stats['compression_ratio']:.2f}") +# Output: Character - Vocab: 89, Avg Length: 42.3, Compression: 1.00 + +print(f"BPE - Vocab: {bpe_stats['vocab_size']}, " + f"Avg Length: {bpe_stats['avg_sequence_length']:.1f}, " + f"Compression: {bpe_stats['compression_ratio']:.2f}") +# Output: BPE - Vocab: 500, Avg Length: 13.5, Compression: 3.13 ``` ## Systems Thinking Questions -1. **Vocabulary Size vs Model Size**: GPT-2 has 50K vocabulary with 768-dim embeddings = 38M parameters just for embeddings. How does this scale to GPT-3's 100K vocabulary? +### Real-World Applications -2. **Sequence Length vs Computation**: Transformers have O(nยฒ) attention complexity. If BPE reduces sequence length from 1000 to 300 tokens, how much does this reduce computation? +**OpenAI GPT Series:** +- **GPT-2**: 50,257 BPE tokens trained on 8M web pages (WebText corpus); vocabulary size chosen to balance 38M embedding parameters (50K ร— 768 dim) with sequence length for 1024-token context +- **GPT-3**: Increased to 100K vocabulary to handle code (indentation, operators) and reduce sequence lengths for long documents; embedding matrix alone: 1.2B parameters (100K ร— 12,288 dim) +- **GPT-4**: Advanced tiktoken library with 100K+ tokens, optimized for tokenization throughput at scale ($700/million tokens means every millisecond counts) +- **Question**: Why did OpenAI double vocabulary size from GPT-2โ†’GPT-3? Consider the trade-off: 2x more embedding parameters vs sequence length reduction for code/long documents. What breaks if vocabulary is too small? Too large? -3. **Rare Word Handling**: A word-level tokenizer marks rare words as , losing all information. How does BPE handle rare words like "unhappiness" even if never seen during training? +**Google Multilingual Models:** +- **SentencePiece**: Used in BERT, T5, PaLM for 100+ languages without language-specific preprocessing; unified tokenization enables shared vocabulary across languages +- **Vocabulary Sharing**: Multilingual models use single vocabulary for all languages (e.g., mT5: 250K SentencePiece tokens cover 101 languages); trade-off between per-language coverage and total vocabulary size +- **Production Scaling**: Google Translate processes billions of sentences daily; tokenization throughput and vocabulary lookup latency are critical for serving at scale +- **Question**: English needs ~30K tokens for 99% coverage; Chinese ideographic characters need 50K+. Should a multilingual model use one shared vocabulary or separate vocabularies per language? Consider: shared vocabulary enables zero-shot transfer but reduces per-language coverage. -4. **Multilingual Challenges**: English needs ~30K tokens for good coverage. Chinese needs 50K+. Why? How does this affect multilingual model design? +**Code Models (GitHub Copilot, AlphaCode):** +- **Specialized Vocabularies**: Code tokenizers handle programming language syntax (indentation, operators, keywords) and natural language (comments, docstrings); balance code-specific tokens vs natural language +- **Identifier Handling**: Variable names like `getUserProfile` vs `get_user_profile` require different tokenization strategies (camelCase splitting, underscore boundaries) +- **Trade-off**: Larger vocabulary for code-specific tokens reduces sequence length but increases embedding matrix size; rare identifier fragments still need subword decomposition +- **Question**: Should a code tokenizer treat `getUserProfile` as 1 token, 3 tokens (`get`, `User`, `Profile`), or 15 character tokens? Consider: single token = short sequence but huge vocabulary; character-level = long sequences but handles any identifier. -5. **Tokenization as Compression**: BPE learns common patterns like "ing", "ed", "tion". Why is this similar to data compression? What's the connection to information theory? +**Production NLP Pipelines:** +- **Google Translate**: Billions of sentences daily require high-throughput tokenization (character: ~1M tokens/sec, BPE: ~100K tokens/sec); vocabulary size affects both model memory and inference speed +- **OpenAI API**: Tokenization cost is significant at $700/million tokens; every optimization (caching, batch processing, vocabulary size tuning) directly impacts economics +- **Mobile Deployment**: Edge models (on-device speech recognition, keyboards) use smaller vocabularies (5K-10K) to fit memory constraints, trading sequence length for model size +- **Question**: If your tokenizer processes 10K tokens/second but your model serves 100K requests/second (each 50 tokens), how do you scale? Consider: pre-tokenize and cache? Batch aggressively? Optimize vocabulary? -## Real-World Connections +### Tokenization Foundations -### Industry Applications +**Vocabulary Size vs Model Parameters:** +- **Embedding Matrix Scaling**: Embedding parameters = vocab_size ร— embed_dim + - GPT-2: 50K vocab ร— 768 dim = 38.4M parameters (just embeddings!) + - GPT-3: 100K vocab ร— 12,288 dim = 1.23B parameters (just embeddings!) + - BERT-base: 30K vocab ร— 768 dim = 23M parameters +- **Training Impact**: Larger vocabulary means more parameters to train; embedding gradients scale with vocabulary size (affects memory and optimizer state size) +- **Deployment Constraints**: Embedding matrix must fit in memory during inference; on-device models use smaller vocabularies (5K-10K) to meet memory budgets +- **Question**: If you increase vocabulary from 10K to 100K (10x), how does this affect: (1) Model size? (2) Training memory (gradients + optimizer states)? (3) Inference latency (vocabulary lookup)? -**OpenAI GPT Series** -- GPT-2: 50K BPE vocabulary, trained on 8M web pages -- GPT-3: 100K vocabulary, handles code and multilingual text -- GPT-4: Advanced tiktoken library with 100K+ tokens -- Tokenization optimization critical for $700/1M token economics +**Sequence Length vs Computation:** +- **Transformer Attention Complexity**: O(nยฒ) where n = sequence length; doubling sequence length quadruples attention computation +- **BPE Compression**: Reduces "unhappiness" (11 chars) to ["un", "happi", "ness"] (3 tokens) โ†’ 13.4x less attention computation (11ยฒ vs 3ยฒ) +- **Batch Processing**: Sequences padded to max length in batch; character-level (1000 tokens) requires 11x more computation than BPE-level (300 tokens) even if actual content is shorter +- **Memory Scaling**: Attention matrices scale as (batch_size ร— nยฒ); character-level consumes far more GPU memory than BPE +- **Question**: Given text "machine learning" (16 chars), compare computation: (1) Character tokenizer โ†’ 16 tokens โ†’ 16ยฒ = 256 attention ops; (2) BPE โ†’ 3 tokens โ†’ 3ยฒ = 9 attention ops. What's the computational savings ratio? How does this scale to 1000-token documents? -**Google Multilingual Models** -- SentencePiece used in BERT, T5, PaLM for 100+ languages -- Unified tokenization across languages without preprocessing -- Optimized for fast serving at Google-scale traffic +**Rare Word Handling:** +- **Word-Level Failure**: Word tokenizers map unknown words to `` token โ†’ complete information loss (can't distinguish "antidisestablishmentarianism" from "supercalifragilisticexpialidocious") +- **BPE Graceful Degradation**: Decomposes unknown words into known subwords: "unhappiness" โ†’ ["un", "happi", "ness"] preserves semantic information even if full word never seen during training +- **Morphological Generalization**: BPE learns prefixes ("un-", "pre-", "anti-") and suffixes ("-ing", "-ed", "-ness") as tokens, enabling compositional understanding +- **Question**: How does BPE handle "antidisestablishmentarianism" (28 chars) even if never seen during training? Trace the decomposition: which subwords would be discovered? How does this enable the model to understand the word's meaning? -**Code Models (GitHub Copilot, AlphaCode)** -- Specialized tokenizers for programming languages -- Handle indentation, operators, and variable names efficiently -- Balance natural language and code syntax +**Tokenization as Compression:** +- **Frequent Pattern Learning**: BPE learns common patterns become single tokens: "ing" โ†’ 1 token, "ed" โ†’ 1 token, "tion" โ†’ 1 token (similar to dictionary-based compression like LZW) +- **Information Theory Connection**: Optimal encoding assigns short codes to frequent symbols (Huffman coding); BPE is essentially dictionary-based compression optimized for language statistics +- **Compression Ratio**: Character-level = 1.0 chars/token (by definition); BPE typically achieves 3-5 chars/token depending on vocabulary size and language +- **Question**: BPE and gzip both learn frequent patterns and replace with short codes. What's the key difference? Hint: BPE operates at subword granularity (preserves linguistic units), gzip operates at byte level (ignores linguistic structure). -### Research Impact +### Performance Characteristics -This module implements patterns from: -- BPE (Sennrich et al., 2016): Subword tokenization for NMT -- WordPiece (Google, 2016): BERT tokenization strategy -- SentencePiece (Kudo, 2018): Language-agnostic tokenization -- tiktoken (OpenAI, 2023): Fast BPE for GPT-3/4 +**Tokenization Throughput:** +- **Character-Level Speed**: ~1M tokens/second (simple array lookup: char โ†’ ID via hash map) +- **BPE Speed**: ~100K tokens/second (iterative merge application: must scan for applicable merge rules) +- **Production Caching**: Systems cache tokenization results to amortize preprocessing cost (especially for repeated queries or batch processing) +- **Bottleneck Analysis**: If tokenization takes 10ms and model inference takes 100ms (single request), tokenization is 9% overhead; but for batch_size=1000, tokenization becomes 100ms (10ms ร— 1000 requests) while model inference might be 200ms due to batching efficiency โ†’ tokenization is now 33% overhead! +- **Question**: Your tokenizer processes 10K tokens/sec. Model serves 100K requests/sec, each request has 50 tokens. Total tokenization throughput needed: 5M tokens/sec. What do you do? Consider: (1) Parallelize tokenization across CPUs? (2) Cache frequent queries? (3) Switch to character tokenizer (10x faster)? (4) Optimize BPE implementation? -## What's Next? +**Memory vs Compute Trade-offs:** +- **Large Vocabulary**: More memory (embedding matrix) but faster tokenization (fewer merge applications) and shorter sequences (less attention computation) +- **Small Vocabulary**: Less memory (smaller embedding matrix) but slower tokenization (more merge rules to apply) and longer sequences (more attention computation) +- **Optimal Vocabulary Size**: Depends on deployment constraintsโ€”edge devices (mobile, IoT) prioritize memory (use smaller vocab, accept longer sequences); cloud serving prioritizes throughput (use larger vocab, reduce sequence length) +- **Embedding Matrix Memory**: GPT-3's 100K vocabulary ร— 12,288 dim ร— 2 bytes (fp16) = 2.5GB just for embeddings; quantization to int8 reduces to 1.25GB +- **Question**: For edge deployment (mobile device with 2GB RAM budget), should you prioritize: (1) Smaller vocabulary (5K tokens, saves 400MB embedding memory) accepting longer sequences? (2) Larger vocabulary (50K tokens, uses 2GB embeddings) for shorter sequences? Consider: attention computation scales quadratically with sequence length. -In **Module 11: Embeddings**, you'll convert these token IDs into dense vector representations: +**Batching and Padding:** +- **Padding Waste**: Variable-length sequences padded to max length in batch; wasted computation on padding tokens (don't contribute to loss but consume attention operations) +- **Character-Level Penalty**: Longer sequences require more paddingโ€”if batch contains [10, 50, 500] character-level tokens, all padded to 500 โ†’ 490 + 450 + 0 = 940 wasted tokens (65% waste) +- **BPE Advantage**: Shorter sequences reduce padding wasteโ€”same batch as [3, 15, 150] BPE tokens, padded to 150 โ†’ 147 + 135 + 0 = 282 wasted tokens (still 63% waste, but absolute numbers smaller) +- **Dynamic Batching**: Group similar-length sequences to reduce padding waste (collate_fn in DataLoader) +- **Question**: Batch of sequences with lengths [10, 50, 500] tokens. (1) Character-level: Total computation = 3 ร— 500ยฒ = 750K attention operations. (2) BPE reduces to [3, 15, 150]: Total = 3 ร— 150ยฒ = 67.5K operations (11x reduction). But what if you sort and batch by length: [[10, 50], [500]] โ†’ Char: 2ร—50ยฒ + 1ร—500ยฒ = 255K; BPE: 2ร—15ยฒ + 1ร—150ยฒ = 23K. How much does batching strategy matter? -- Map discrete token IDs to continuous embeddings -- Learn position encodings for sequence order -- Implement lookup tables for fast embedding retrieval -- Understand how embeddings capture semantic similarity +**Multilingual Considerations:** +- **Shared Vocabulary**: Enables zero-shot cross-lingual transfer (model trained on English can handle French without fine-tuning) but reduces per-language coverage +- **Language-Specific Vocabulary Size**: English: 26 letters โ†’ 30K tokens for 99% coverage; Chinese: 50K+ characters โ†’ need 60K tokens for equivalent coverage; Arabic: morphologically rich โ†’ needs more subword decomposition +- **Vocabulary Allocation**: Multilingual model with 100K shared vocabulary must allocate tokens across languages; high-resource languages (English) get better coverage than low-resource languages (Swahili) +- **Question**: Should a multilingual model use: (1) One shared vocabulary (100K tokens across all languages, enables transfer but dilutes per-language coverage)? (2) Separate vocabularies per language (30K English + 60K Chinese = 90K total, better coverage but no cross-lingual transfer)? Consider: shared embedding space enables "cat" (English) to align with "chat" (French) via training. -The tokens you create here become the input to every transformer and language model! +## Ready to Build? + +You're about to implement the tokenization systems that power every modern language modelโ€”from GPT-4 processing trillions of tokens to Google Translate serving billions of requests daily. Tokenization is the critical bridge between human language (text) and neural networks (numbers), and the design decisions you make have profound effects on model size, computational cost, and generalization ability. + +By building these systems from scratch, you'll understand the fundamental trade-off shaping modern NLP: **vocabulary size vs sequence length**. Larger vocabularies mean more model parameters (embedding matrix size = vocab_size ร— embed_dim) but shorter sequences (less computation, especially in transformers with O(nยฒ) attention). Smaller vocabularies mean fewer parameters but longer sequences requiring more computation. You'll see why BPE emerged as the dominant approachโ€”balancing both extremes optimally through learned subword decompositionโ€”and why every major language model (GPT, BERT, T5, LLaMA) uses some form of subword tokenization. + +This module connects directly to Module 11 (Embeddings): your token IDs will index into embedding matrices, converting discrete tokens into continuous vectors. Understanding tokenization deeplyโ€”not just as a black-box API but as a system with measurable performance characteristics and design trade-offsโ€”will make you a better ML systems engineer. You'll appreciate why GPT-3 doubled vocabulary size from GPT-2 (50Kโ†’100K to handle code and long documents), why mobile models use tiny 5K vocabularies (memory constraints), and why production systems aggressively cache tokenization results (throughput optimization). + +Take your time, experiment with different vocabulary sizes (100, 1000, 10000), and measure everything: sequence length reduction, compression ratios, tokenization throughput. This is where text becomes numbers, where linguistics meets systems engineering, and where you'll develop the intuition needed to make smart trade-offs in production NLP systems. + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/10_tokenization/tokenization_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/10_tokenization/tokenization_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/10_tokenization/tokenization_dev.ipynb +:class-header: bg-light + +Browse the Jupyter notebook and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. + +``` --- -**Ready to build tokenizers from scratch?** Open `modules/10_tokenization/tokenization_dev.py` and start implementing. + diff --git a/modules/11_embeddings/ABOUT.md b/modules/11_embeddings/ABOUT.md index a6713617..f027620a 100644 --- a/modules/11_embeddings/ABOUT.md +++ b/modules/11_embeddings/ABOUT.md @@ -1,402 +1,489 @@ --- title: "Embeddings - Token to Vector Representations" -description: "Build embedding layers that convert discrete tokens to dense vectors" +description: "Build embedding layers that convert discrete tokens to dense, learnable vector representations powering modern NLP" difficulty: 2 time_estimate: "4-5 hours" prerequisites: ["Tensor", "Tokenization"] next_steps: ["Attention"] learning_objectives: - - "Implement embedding layers with efficient lookup table operations" - - "Design positional encodings to capture sequence order information" - - "Understand memory scaling with vocabulary size and embedding dimensions" - - "Optimize embedding lookups for cache efficiency and bandwidth" - - "Apply dimensionality principles to semantic vector representations" + - "Implement embedding layers with efficient lookup table operations and proper initialization" + - "Design both learned and sinusoidal positional encodings to capture sequence order information" + - "Understand memory scaling relationships with vocabulary size and embedding dimensions" + - "Optimize embedding lookups for cache efficiency and sparse gradient updates" + - "Apply dimensionality principles to semantic vector space design and trade-offs" --- -# 11. Embeddings +# 11. Embeddings - Token to Vector Representations -**๐Ÿ›๏ธ ARCHITECTURE TIER** | Difficulty: โญโญ (2/4) | Time: 4-5 hours +**ARCHITECTURE TIER** | Difficulty: โญโญ (2/4) | Time: 4-5 hours ## Overview -Build embedding systems that transform discrete token IDs into dense vector representations. This module implements lookup tables, positional encodings, and optimization techniques that power all modern language models. +Build the embedding systems that transform discrete token IDs into dense, learnable vector representations - the bridge between symbolic text and neural computation. This module implements lookup tables, positional encodings, and the optimization techniques that power every modern language model from word2vec to GPT-4's input layers. + +You'll discover why embeddings aren't just "lookup tables" but sophisticated parameter spaces where semantic meaning emerges through training. By implementing both token embeddings and positional encodings from scratch, you'll understand the architectural choices that shape how transformers process language and why certain design decisions (sinusoidal vs learned positions, embedding dimensions, initialization strategies) have profound implications for model capacity, memory usage, and inference performance. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement embedding layers** with efficient lookup table operations for token-to-vector conversion -2. **Design positional encodings** (learned and sinusoidal) to capture sequence order information -3. **Understand memory scaling** with vocabulary size and embedding dimensions in production models -4. **Optimize embedding lookups** for cache efficiency and memory bandwidth utilization -5. **Apply dimensionality principles** to balance expressiveness and computational efficiency +- **Implement embedding layers**: Build efficient lookup tables for token-to-vector conversion with proper Xavier initialization and gradient flow +- **Design positional encodings**: Create both sinusoidal (Transformer-style) and learned (GPT-style) position representations with different extrapolation capabilities +- **Understand memory scaling**: Analyze how vocabulary size and embedding dimensions impact parameter count, memory bandwidth, and serving costs +- **Optimize embedding lookups**: Implement sparse gradient updates that avoid computing gradients for 99% of vocabulary during training +- **Apply dimensionality principles**: Balance semantic expressiveness with computational efficiency in vector space design and initialization -## Why This Matters +## Build โ†’ Use โ†’ Reflect -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -Embeddings are the foundation of all modern NLP: +1. **Build**: Implement embedding lookup tables with trainable parameters, sinusoidal positional encodings using mathematical patterns, learned position embeddings, and complete token+position combination systems +2. **Use**: Convert tokenized text sequences to dense vectors, add positional information for sequence order awareness, and prepare embeddings for attention mechanisms +3. **Reflect**: Analyze memory scaling with vocabulary size (why GPT-3's embeddings use 2.4GB), understand sparse gradient efficiency for large vocabularies, and explore semantic geometry in learned embedding spaces -- **GPT-3's embedding table**: 50K vocab ร— 12K dims = 600M parameters (20% of total model) -- **BERT's embeddings**: Token + position + segment embeddings enable bidirectional understanding -- **Word2Vec/GloVe**: Pioneered semantic embeddings; "king - man + woman โ‰ˆ queen" -- **Recommendation systems**: Embedding tables for billions of items (YouTube, Netflix, Spotify) +```{admonition} Systems Reality Check +:class: tip -### Historical Context +**Production Context**: GPT-3's embedding table contains 50,257 vocabulary ร— 12,288 dimensions = 617M parameters (about 20% of the model's 175B total). Every token lookup requires reading 48KB of memory - making embedding access a major bandwidth bottleneck during inference, especially for long sequences. -Embeddings evolved from sparse to dense representations: - -- **One-Hot Encoding (pre-2013)**: Vocabulary-sized vectors; no semantic similarity -- **Word2Vec (2013)**: Dense embeddings capture semantic relationships; revolutionized NLP -- **GloVe (2014)**: Global co-occurrence statistics improve quality -- **Contextual Embeddings (2018)**: BERT/GPT embeddings depend on context; same word, different vectors -- **Modern Scale (2020+)**: 100K+ vocabulary embeddings in production language models - -The embeddings you're building are the input layer of transformers and all modern NLP. - -## Pedagogical Pattern: Build โ†’ Use โ†’ Analyze - -### 1. Build - -Implement from first principles: -- Embedding layer with learnable lookup table -- Sinusoidal positional encoding (Transformer-style) -- Learned positional embeddings (GPT-style) -- Combined token + position embeddings -- Gradient flow through embedding lookups - -### 2. Use - -Apply to real problems: -- Convert token sequences to dense vectors -- Add positional information for sequence order -- Visualize embedding spaces with t-SNE -- Measure semantic similarity with cosine distance -- Integrate with attention mechanisms (Module 12) - -### 3. Analyze - -Deep-dive into design trade-offs: -- How does embedding dimension affect model capacity? -- Why do transformers need positional encodings? -- What's the memory cost of large vocabularies? -- How do embeddings capture semantic relationships? -- Why sinusoidal vs learned position encodings? +**Performance Note**: During training, only ~1% of vocabulary appears in each batch. Sparse gradient updates avoid computing gradients for the other 99% of embedding parameters, saving massive computation and memory bandwidth. This is why frameworks like PyTorch implement specialized sparse gradient operations for embeddings. +``` ## Implementation Guide -### Core Components +### Embedding Layer - The Token Lookup Table + +The fundamental building block that maps discrete token IDs to continuous dense vectors. This is where semantic meaning will eventually be learned through training. + +**Core Implementation Pattern:** -**Embedding Layer - Token Lookup Table** ```python class Embedding: """Learnable embedding layer for token-to-vector conversion. - + Implements efficient lookup table that maps token IDs to dense vectors. - The core component of all language models. - + The foundation of all language models and sequence processing. + Args: vocab_size: Size of vocabulary (e.g., 50,000 for GPT-2) embedding_dim: Dimension of dense vectors (e.g., 768 for BERT-base) - - Memory: vocab_size ร— embedding_dim parameters - Example: 50K vocab ร— 768 dim = 38M parameters + + Memory Cost: vocab_size ร— embedding_dim parameters + Example: 50K vocab ร— 768 dim = 38.4M parameters (153MB at FP32) """ def __init__(self, vocab_size, embedding_dim): self.vocab_size = vocab_size self.embedding_dim = embedding_dim - - # Initialize embedding table randomly - # Shape: (vocab_size, embedding_dim) - self.weight = Tensor.randn(vocab_size, embedding_dim) * 0.02 - - def forward(self, token_ids): + + # Xavier/Glorot initialization for stable gradients + limit = math.sqrt(6.0 / (vocab_size + embedding_dim)) + self.weight = Tensor( + np.random.uniform(-limit, limit, (vocab_size, embedding_dim)), + requires_grad=True + ) + + def forward(self, indices): """Look up embeddings for token IDs. - + Args: - token_ids: (batch_size, seq_len) tensor of token IDs - + indices: (batch_size, seq_len) tensor of token IDs + Returns: embeddings: (batch_size, seq_len, embedding_dim) dense vectors """ - batch_size, seq_len = token_ids.shape - - # Lookup operation: index into embedding table - embeddings = self.weight[token_ids] # Advanced indexing - - return embeddings - - def backward(self, grad_output): - """Gradients accumulate in embedding table. - - Only embeddings that were looked up receive gradients. - This is sparse gradient update - critical for efficiency. - """ - batch_size, seq_len, embed_dim = grad_output.shape - - # Accumulate gradients for each unique token ID - grad_weight = Tensor.zeros_like(self.weight) - for b in range(batch_size): - for s in range(seq_len): - token_id = token_ids[b, s] - grad_weight[token_id] += grad_output[b, s] - - return grad_weight + # Advanced indexing: O(1) per token lookup + embedded = self.weight.data[indices.data.astype(int)] + + result = Tensor(embedded, requires_grad=self.weight.requires_grad) + + # Attach gradient computation (sparse updates during backward) + if self.weight.requires_grad: + result._grad_fn = EmbeddingBackward(self.weight, indices) + + return result ``` -**Positional Encoding - Sinusoidal (Transformer-Style)** +**Why This Design Works:** +- **Xavier initialization** ensures gradients don't explode or vanish during early training +- **Advanced indexing** provides O(1) lookup complexity regardless of vocabulary size +- **Sparse gradients** mean only embeddings for tokens in the current batch receive updates +- **Trainable weights** allow the model to learn semantic relationships through backpropagation + +### Sinusoidal Positional Encoding (Transformer-Style) + +Fixed mathematical encodings that capture position without learned parameters. The original "Attention is All You Need" approach that enables extrapolation to longer sequences. + +**Mathematical Foundation:** + ```python -class SinusoidalPositionalEncoding: - """Fixed sinusoidal positional encoding. - - Used in original Transformer (Vaswani et al., 2017). - Encodes absolute position using sine/cosine functions of different frequencies. - +def create_sinusoidal_embeddings(max_seq_len, embedding_dim): + """Create sinusoidal positional encodings from Vaswani et al. (2017). + + Uses sine/cosine functions of different frequencies to encode position. + + Formula: + PE(pos, 2i) = sin(pos / 10000^(2i/embed_dim)) # Even indices + PE(pos, 2i+1) = cos(pos / 10000^(2i/embed_dim)) # Odd indices + + Where: + pos = position in sequence (0, 1, 2, ...) + i = dimension pair index + 10000 = base frequency (creates wavelengths from 2ฯ€ to 10000ยท2ฯ€) + Advantages: - - No learned parameters - - Can generalize to longer sequences than training length - - Mathematically elegant relative position representation + - Zero parameters (no memory overhead) + - Generalizes to sequences longer than training + - Smooth transitions (nearby positions similar) + - Rich frequency spectrum across dimensions """ - def __init__(self, max_seq_len, embedding_dim): - self.max_seq_len = max_seq_len - self.embedding_dim = embedding_dim - - # Pre-compute positional encodings - self.encodings = self._compute_encodings() - - def _compute_encodings(self): - """Compute sinusoidal position encodings. - - PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) - PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) - """ - position = np.arange(self.max_seq_len)[:, np.newaxis] - div_term = np.exp(np.arange(0, self.embedding_dim, 2) * - -(np.log(10000.0) / self.embedding_dim)) - - encodings = np.zeros((self.max_seq_len, self.embedding_dim)) - encodings[:, 0::2] = np.sin(position * div_term) # Even indices - encodings[:, 1::2] = np.cos(position * div_term) # Odd indices - - return Tensor(encodings) - - def forward(self, seq_len): - """Return positional encodings for sequence length. - - Args: - seq_len: Length of input sequence - - Returns: - pos_encodings: (seq_len, embedding_dim) positional vectors - """ - return self.encodings[:seq_len] + # Position indices: [0, 1, 2, ..., max_seq_len-1] + position = np.arange(max_seq_len, dtype=np.float32)[:, np.newaxis] + + # Frequency term for each dimension pair + div_term = np.exp( + np.arange(0, embedding_dim, 2, dtype=np.float32) * + -(math.log(10000.0) / embedding_dim) + ) + + # Initialize positional encoding matrix + pe = np.zeros((max_seq_len, embedding_dim), dtype=np.float32) + + # Apply sine to even indices (0, 2, 4, ...) + pe[:, 0::2] = np.sin(position * div_term) + + # Apply cosine to odd indices (1, 3, 5, ...) + pe[:, 1::2] = np.cos(position * div_term) + + return Tensor(pe) ``` -**Learned Positional Embeddings (GPT-Style)** -```python -class LearnedPositionalEmbedding: - """Learned positional embeddings. - - Used in GPT models. Learns absolute position representations during training. - - Advantages: - - Can learn task-specific position patterns - - Often performs slightly better than sinusoidal - - Disadvantages: - - Cannot generalize beyond max trained sequence length - - Requires additional parameters - """ - def __init__(self, max_seq_len, embedding_dim): - self.max_seq_len = max_seq_len - self.embedding_dim = embedding_dim - - # Learnable position embedding table - self.weight = Tensor.randn(max_seq_len, embedding_dim) * 0.02 - - def forward(self, seq_len): - """Look up learned position embeddings. - - Args: - seq_len: Length of input sequence - - Returns: - pos_embeddings: (seq_len, embedding_dim) learned vectors - """ - return self.weight[:seq_len] -``` +**Why Sinusoidal Patterns Work:** +- **Different frequencies** per dimension: high frequencies change rapidly between positions, low frequencies change slowly +- **Unique signatures** for each position through combination of frequencies +- **Linear combinations** allow the model to learn relative position offsets through attention +- **No length limit** - can compute encodings for any sequence length at inference time + +### Learned Positional Encoding (GPT-Style) + +Trainable position embeddings that can adapt to task-specific patterns. Used in GPT models and other architectures where positional patterns may be learnable. + +**Implementation Pattern:** -**Combined Token + Position Embeddings** ```python -def get_combined_embeddings(token_ids, token_embeddings, pos_embeddings): - """Combine token and position embeddings. - - Used as input to transformer models. - +class PositionalEncoding: + """Learnable positional encoding layer. + + Trainable position-specific vectors added to token embeddings. + Args: - token_ids: (batch_size, seq_len) token indices - token_embeddings: Embedding layer for tokens - pos_embeddings: Positional encoding layer - - Returns: - combined: (batch_size, seq_len, embedding_dim) token + position + max_seq_len: Maximum sequence length to support + embedding_dim: Dimension matching token embeddings + + Advantages: + - Can learn task-specific position patterns + - May capture regularities like sentence structure + - Often performs slightly better than sinusoidal + + Disadvantages: + - Requires additional parameters (max_seq_len ร— embedding_dim) + - Cannot extrapolate beyond training sequence length + - Needs sufficient training data to learn position patterns """ - batch_size, seq_len = token_ids.shape - - # Get token embeddings - token_vecs = token_embeddings(token_ids) # (B, L, D) - - # Get position embeddings - pos_vecs = pos_embeddings(seq_len) # (L, D) - - # Add them together (broadcasting handles batch dimension) - combined = token_vecs + pos_vecs # (B, L, D) - - return combined + def __init__(self, max_seq_len, embedding_dim): + self.max_seq_len = max_seq_len + self.embedding_dim = embedding_dim + + # Smaller initialization than token embeddings (additive combination) + limit = math.sqrt(2.0 / embedding_dim) + self.position_embeddings = Tensor( + np.random.uniform(-limit, limit, (max_seq_len, embedding_dim)), + requires_grad=True + ) + + def forward(self, x): + """Add positional encodings to input embeddings. + + Args: + x: (batch_size, seq_len, embedding_dim) input embeddings + + Returns: + Position-aware embeddings of same shape + """ + batch_size, seq_len, embedding_dim = x.shape + + # Get position embeddings for this sequence length + pos_embeddings = self.position_embeddings.data[:seq_len] + + # Broadcast to batch dimension: (1, seq_len, embedding_dim) + pos_embeddings = pos_embeddings[np.newaxis, :, :] + + # Element-wise addition combines token and position information + result = x + Tensor(pos_embeddings, requires_grad=True) + + return result ``` -### Step-by-Step Implementation +**Design Rationale:** +- **Learned parameters** can capture task-specific patterns (e.g., sentence beginnings, clause boundaries) +- **Smaller initialization** because positions add to token embeddings (not replace them) +- **Fixed max length** is a limitation but acceptable for many production use cases +- **Element-wise addition** preserves both token semantics and position information -1. **Create Embedding Layer** - - Initialize weight matrix (vocab_size ร— embedding_dim) - - Implement forward pass with indexing - - Add backward pass with sparse gradient accumulation - - Test with small vocabulary +### Complete Embedding System -2. **Implement Sinusoidal Positions** - - Compute sine/cosine encodings - - Handle even/odd indices correctly - - Verify periodicity properties - - Test generalization to longer sequences +Production-ready integration of token and positional embeddings used in real transformer implementations. -3. **Add Learned Positions** - - Create learnable position table - - Initialize with small random values - - Implement forward and backward passes - - Compare with sinusoidal encodings +**Full Pipeline:** -4. **Combine Token + Position** - - Add token and position embeddings - - Handle batch broadcasting correctly - - Verify gradient flow through both - - Test with real tokenized sequences +```python +class EmbeddingLayer: + """Complete embedding system combining token and positional embeddings. -5. **Analyze Embedding Spaces** - - Visualize embeddings with t-SNE or PCA - - Measure cosine similarity between tokens - - Verify semantic relationships emerge - - Profile memory and lookup efficiency + Production component matching PyTorch/HuggingFace transformer patterns. + """ + def __init__(self, vocab_size, embed_dim, max_seq_len=512, + pos_encoding='learned', scale_embeddings=False): + self.vocab_size = vocab_size + self.embed_dim = embed_dim + self.scale_embeddings = scale_embeddings + + # Token embedding table + self.token_embedding = Embedding(vocab_size, embed_dim) + + # Positional encoding strategy + if pos_encoding == 'learned': + self.pos_encoding = PositionalEncoding(max_seq_len, embed_dim) + elif pos_encoding == 'sinusoidal': + self.pos_encoding = create_sinusoidal_embeddings(max_seq_len, embed_dim) + elif pos_encoding is None: + self.pos_encoding = None + + def forward(self, tokens): + """Convert tokens to position-aware embeddings. + + Args: + tokens: (batch_size, seq_len) token indices + + Returns: + (batch_size, seq_len, embed_dim) position-aware vectors + """ + # Token lookup + token_embeds = self.token_embedding.forward(tokens) + + # Optional scaling (Transformer convention: โˆšembed_dim) + if self.scale_embeddings: + token_embeds = Tensor(token_embeds.data * math.sqrt(self.embed_dim)) + + # Add positional information + if self.pos_encoding is not None: + output = self.pos_encoding.forward(token_embeds) + else: + output = token_embeds + + return output +``` + +**Integration Benefits:** +- **Flexible positional encoding** supports learned, sinusoidal, or none +- **Embedding scaling** (multiply by โˆšd) is Transformer convention for gradient stability +- **Batch processing** handles variable sequence lengths efficiently +- **Parameter management** tracks all trainable components for optimization + +## Getting Started + +### Prerequisites + +Before starting this module, ensure you have completed: + +- **Module 01 (Tensor)**: Provides the foundational Tensor class with gradient tracking and operations +- **Module 10 (Tokenization)**: Required for converting text to token IDs that embeddings consume + +Verify your prerequisites: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module tensor +tito test --module tokenization +``` + +### Development Workflow + +1. **Open the development notebook**: `modules/11_embeddings/embeddings_dev.ipynb` +2. **Implement Embedding class**: Create lookup table with Xavier initialization and efficient indexing +3. **Build sinusoidal encodings**: Compute sine/cosine position representations using mathematical formula +4. **Create learned positions**: Add trainable position embedding table with proper initialization +5. **Integrate complete system**: Combine token and position embeddings with flexible encoding strategies +6. **Export and verify**: `tito module complete 11 && tito test --module embeddings` ## Testing -### Inline Tests (During Development) +### Comprehensive Test Suite + +Run the full test suite to verify embedding functionality: -Run inline tests while building: ```bash -cd modules/11_embeddings -python embeddings_dev.py +# TinyTorch CLI (recommended) +tito test --module embeddings + +# Direct pytest execution +python -m pytest tests/ -k embeddings -v ``` -Expected output: -``` -Unit Test: Embedding layer... -โœ… Lookup table created: 10K vocab ร— 256 dims = 2.5M parameters -โœ… Forward pass shape correct: (32, 20, 256) -โœ… Backward pass accumulates gradients correctly -Progress: Embedding Layer โœ“ +### Test Coverage Areas -Unit Test: Sinusoidal positional encoding... -โœ… Encodings computed for 512 positions -โœ… Sine/cosine patterns verified -โœ… Generalization to longer sequences works -Progress: Sinusoidal Positions โœ“ +- โœ… **Embedding lookup correctness**: Verify token IDs map to correct vector rows in weight table +- โœ… **Gradient flow verification**: Ensure sparse gradient updates accumulate properly during backpropagation +- โœ… **Positional encoding math**: Validate sinusoidal formula implementation with correct frequencies +- โœ… **Shape broadcasting**: Test token + position combination across batch dimensions +- โœ… **Memory efficiency profiling**: Verify parameter count and lookup performance characteristics -Unit Test: Combined embeddings... -โœ… Token + position addition works -โœ… Gradient flows through both components +### Inline Testing & Validation + +The module includes comprehensive unit tests during development: + +```python +# Example inline test output +๐Ÿ”ฌ Unit Test: Embedding layer... +โœ… Lookup table created: 10K vocab ร— 256 dims = 2.56M parameters +โœ… Forward pass shape correct: (32, 20, 256) for batch of 32 sequences +โœ… Backward pass sparse gradients accumulate correctly +โœ… Xavier initialization keeps variance stable +๐Ÿ“ˆ Progress: Embedding Layer โœ“ + +๐Ÿ”ฌ Unit Test: Sinusoidal positional encoding... +โœ… Encodings computed for 512 positions ร— 256 dimensions +โœ… Sine/cosine patterns verified (pos 0: [0, 1, 0, 1, ...]) +โœ… Different positions have unique signatures +โœ… Frequency spectrum correct (high to low across dimensions) +๐Ÿ“ˆ Progress: Sinusoidal Positions โœ“ + +๐Ÿ”ฌ Unit Test: Learned positional encoding... +โœ… Trainable position embeddings initialized +โœ… Addition with token embeddings preserves gradients โœ… Batch broadcasting handled correctly -Progress: Combined Embeddings โœ“ +๐Ÿ“ˆ Progress: Learned Positions โœ“ + +๐Ÿ”ฌ Unit Test: Complete embedding system... +โœ… Token + position combination works for all strategies +โœ… Embedding scaling (โˆšd) applied correctly +โœ… Variable sequence lengths handled gracefully +โœ… Parameter counting correct for each configuration +๐Ÿ“ˆ Progress: Complete System โœ“ ``` -### Export and Validate +### Manual Testing Examples -After completing the module: -```bash -# Export to tinytorch package -tito export 11_embeddings +Test your embedding implementation interactively: -# Run integration tests -tito test 11_embeddings -``` +```python +from tinytorch.text.embeddings import Embedding, PositionalEncoding, create_sinusoidal_embeddings -## Where This Code Lives +# Create embedding layer +vocab_size, embed_dim = 10000, 256 +token_emb = Embedding(vocab_size, embed_dim) -``` -tinytorch/ -โ”œโ”€โ”€ nn/ -โ”‚ โ””โ”€โ”€ embeddings.py # Your implementation goes here -โ””โ”€โ”€ __init__.py # Exposes Embedding, PositionalEncoding, etc. +# Test token lookup +token_ids = Tensor([[1, 5, 23], [42, 7, 19]]) # (2, 3) - batch of 2 sequences +embeddings = token_emb.forward(token_ids) # (2, 3, 256) +print(f"Token embeddings shape: {embeddings.shape}") -Usage in other modules: ->>> from tinytorch.nn import Embedding, SinusoidalPositionalEncoding ->>> token_emb = Embedding(vocab_size=50000, embedding_dim=768) ->>> pos_emb = SinusoidalPositionalEncoding(max_len=512, dim=768) +# Add learned positional encodings +pos_emb = PositionalEncoding(max_seq_len=512, embed_dim=256) +token_embeddings_3d = embeddings # Already (batch, seq, embed) +pos_aware = pos_emb.forward(token_embeddings_3d) +print(f"Position-aware shape: {pos_aware.shape}") # (2, 3, 256) + +# Try sinusoidal encodings +sin_pe = create_sinusoidal_embeddings(max_seq_len=512, embed_dim=256) +sin_positions = sin_pe.data[:3][np.newaxis, :, :] # (1, 3, 256) +combined = Tensor(embeddings.data + sin_positions) +print(f"Sinusoidal combined: {combined.shape}") # (2, 3, 256) + +# Verify position 0 pattern (should be [0, 1, 0, 1, ...]) +print(f"Position 0 pattern: {sin_pe.data[0, :8]}") +# Expected: [~0.0, ~1.0, ~0.0, ~1.0, ~0.0, ~1.0, ~0.0, ~1.0] ``` ## Systems Thinking Questions -1. **Memory Scaling**: GPT-3 has 50K vocab ร— 12K dims = 600M embedding parameters. At FP32 (4 bytes), how much memory? At FP16? Why does this matter for training vs inference? +### Real-World Applications -2. **Sparse Gradients**: During training, only ~1% of vocabulary appears in each batch. How does sparse gradient accumulation save computation compared to dense updates? +- **Large Language Models (GPT-4, Claude, Llama)**: Embedding tables often contain 20-40% of total model parameters. GPT-3's 50K vocab ร— 12K dims = 617M embedding parameters alone (2.4GB at FP32). This makes embeddings a major memory consumer in serving infrastructure. -3. **Embedding Dimension Choice**: BERT-base uses 768 dims, BERT-large uses 1024. How does dimension affect: (a) model capacity, (b) computation, (c) memory bandwidth? +- **Recommendation Systems (YouTube, Netflix, Spotify)**: Billion-scale item embeddings for personalized content retrieval. YouTube's embedding space contains hundreds of millions of video embeddings, enabling fast nearest-neighbor search for recommendations in milliseconds. -4. **Position Encoding Trade-offs**: Sinusoidal allows generalization to any length. Learned positions are limited to max training length. When would you choose each? +- **Multilingual Models (Google Translate, mBERT)**: Shared embedding spaces across 100+ languages enable zero-shot cross-lingual transfer. Words with similar meanings across languages cluster together in the learned vector space, allowing translation without parallel data. -5. **Semantic Geometry**: Why do word embeddings exhibit linear relationships like "king - man + woman โ‰ˆ queen"? What property of the training objective causes this? +- **Search Engines (Google, Bing)**: Query and document embeddings power semantic search beyond keyword matching. BERT-style embeddings capture meaning, letting "how to fix a leaky faucet" match "plumbing repair for dripping tap" even with no shared words. -## Real-World Connections +### Mathematical Foundations -### Industry Applications +- **Embedding Geometry**: Why do word embeddings exhibit linear relationships like "king - man + woman โ‰ˆ queen"? The training objective (predicting context words in word2vec, or next tokens in language models) creates geometric structure where semantic relationships become linear vector operations. This emerges without explicit supervision. -**Large Language Models (OpenAI, Anthropic, Google)** -- GPT-4: 100K+ vocabulary embeddings -- Embedding tables often 20-40% of total model parameters -- Optimized embedding access critical for inference latency -- Mixed-precision (FP16) embeddings save memory +- **Dimensionality Trade-offs**: Higher dimensions increase expressiveness (more capacity to separate distinct concepts) but require more memory and computation. BERT-base uses 768 dimensions, BERT-large uses 1024 - carefully chosen based on performance-cost Pareto analysis. Doubling dimensions doubles memory but may only improve accuracy by a few percentage points. -**Recommendation Systems (YouTube, Netflix, Spotify)** -- Billion-scale item embeddings for personalization -- Embedding retrieval systems for fast nearest-neighbor search -- Continuous embedding updates with online learning -- Embedding quantization for serving efficiency +- **Positional Encoding Mathematics**: Sinusoidal encodings use different frequencies (wavelengths from 2ฯ€ to 10,000ยท2ฯ€) so each position gets a unique pattern. The model can learn relative positions through attention: the dot product of position encodings at offsets k captures periodic patterns the attention mechanism learns to use. -**Multilingual Models (Google Translate, Facebook M2M)** -- Shared embedding spaces across 100+ languages -- Cross-lingual embeddings enable zero-shot transfer -- Vocabulary size optimization for multilingual coverage -- Embedding alignment techniques for language pairs +- **Sparse Gradient Efficiency**: During training with vocabulary size V and batch containing b unique tokens, dense gradients would update all V embeddings. Sparse gradients only update b embeddings - when b << V (typical: 1000 tokens vs 50K vocab), this saves ~98% of gradient computation and memory bandwidth. -### Research Impact +### Performance Characteristics -This module implements patterns from: -- Word2Vec (2013): Pioneered dense semantic embeddings -- GloVe (2014): Global co-occurrence matrix factorization -- Transformer (2017): Sinusoidal positional encodings -- BERT (2018): Contextual embeddings revolutionized NLP -- GPT (2018): Learned positional embeddings for autoregressive models +- **Memory Scaling**: Embedding tables grow as O(vocab_size ร— embedding_dim). At FP32 (4 bytes per parameter): 50K vocab ร— 768 dims = 153MB, 100K vocab ร— 1024 dims = 410MB. Mixed precision (FP16) cuts this in half, but vocabulary size dominates scaling for large models. -## What's Next? +- **Bandwidth Bottleneck**: Every token lookup reads embedding_dim ร— sizeof(dtype) bytes from memory. With 768 dims at FP32, that's 3KB per token. Processing a 2048-token context requires reading 6MB from the embedding table - memory bandwidth becomes the bottleneck, not compute. -In **Module 12: Attention**, you'll use these embeddings as input to attention mechanisms: +- **Cache Efficiency**: Sequential token access has poor cache locality because tokens are typically non-sequential in the embedding table (token IDs [1, 42, 7, 99] means random jumps through the weight matrix). Batching improves throughput by amortizing cache misses, but embedding access remains memory-bound, not compute-bound. -- Query, Key, Value projections from embeddings -- Scaled dot-product attention over embedded sequences -- Multi-head attention for different representation subspaces -- Self-attention that relates all positions in a sequence +- **Inference Optimization**: Embedding quantization (INT8 or even INT4) reduces memory footprint and bandwidth by 2-4ร—, critical for deployment. KV-caching in transformers makes embedding lookup happen only once per token (not per layer), so optimizing this cold start is important for latency-sensitive applications. -The embeddings you built are the foundation input to every transformer! +## Ready to Build? + +You're about to implement the embedding systems that power modern AI language understanding. These lookup tables and positional encodings are the bridge between discrete tokens (words, subwords, characters) and the continuous vector spaces where neural networks operate. What seems like a simple "array lookup" is actually the foundation of how language models represent meaning. + +What makes this module special is understanding not just *how* embeddings work, but *why* certain design choices matter. Why do we need positional encodings when embeddings already contain token information? Why sparse gradients instead of dense updates? How does embedding dimension affect model capacity versus memory footprint? These aren't just implementation details - they're fundamental design principles that shape every production language model's architecture. + +By building embeddings from scratch, you'll gain intuition for memory-computation trade-offs in deep learning systems. You'll understand why GPT-3's embedding table consumes 2.4GB of memory, and why that matters for serving costs at scale (more memory = more expensive GPUs = higher operational costs). You'll see how sinusoidal encodings allow transformers to process sequences longer than training data, while learned positions might perform better on specific tasks with known maximum lengths. + +This is where theory meets the economic realities of deploying AI at scale. Every architectural choice - vocabulary size, embedding dimension, positional encoding strategy - has both technical implications (accuracy, generalization) and business implications (memory costs, inference latency, serving throughput). Understanding these trade-offs is what separates machine learning researchers from machine learning systems engineers. + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/11_embeddings/embeddings_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/11_embeddings/embeddings_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/11_embeddings/embeddings_dev.ipynb +:class-header: bg-light + +Browse the Jupyter notebook source and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. +``` --- -**Ready to build embedding systems from scratch?** Open `modules/11_embeddings/embeddings_dev.py` and start implementing. + diff --git a/modules/12_attention/ABOUT.md b/modules/12_attention/ABOUT.md index acb3e9e0..2c268075 100644 --- a/modules/12_attention/ABOUT.md +++ b/modules/12_attention/ABOUT.md @@ -1,403 +1,586 @@ --- title: "Attention - The Mechanism That Powers Modern AI" -description: "Build scaled dot-product and multi-head attention from scratch" -difficulty: 3 +description: "Build scaled dot-product and multi-head attention mechanisms from scratch" +difficulty: "โญโญโญ" time_estimate: "5-6 hours" -prerequisites: ["Tensor", "Layers", "Embeddings"] -next_steps: ["Transformers"] +prerequisites: ["01_tensor", "02_activations", "03_layers", "11_embeddings"] +next_steps: ["13_transformers"] learning_objectives: - - "Implement scaled dot-product attention with query, key, and value matrices" - - "Design multi-head attention for parallel attention subspaces" - - "Understand masking strategies for causal, padding, and bidirectional attention" - - "Build self-attention mechanisms for sequence-to-sequence modeling" - - "Apply attention patterns that power GPT, BERT, and modern transformers" + - "Understand attention's O(nยฒ) scaling behavior and memory bottlenecks in production systems" + - "Implement scaled dot-product attention with proper numerical stability and gradient flow" + - "Build multi-head attention with parallel representation subspaces and head concatenation" + - "Master masking strategies for causal (GPT), bidirectional (BERT), and padding patterns" + - "Analyze attention pattern trade-offs: computation cost, memory usage, and interpretability" --- -# 12. Attention +# 12. Attention - The Mechanism That Powers Modern AI -**๐Ÿ›๏ธ ARCHITECTURE TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours +**ARCHITECTURE TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours ## Overview -Implement the attention mechanism that revolutionized AI. This module builds scaled dot-product attention and multi-head attentionโ€”the core components of GPT, BERT, and all modern transformer models. +Implement the attention mechanism that revolutionized AI and sparked the modern transformer era. This module builds scaled dot-product attention and multi-head attentionโ€”the exact mechanisms powering GPT, BERT, Claude, and every major language model deployed today. You'll implement attention with explicit loops to viscerally understand the O(nยฒ) complexity that defines both the power and limitations of transformer architectures. + +The "Attention is All You Need" paper (2017) introduced these mechanisms and replaced RNNs with pure attention architectures, enabling parallelization and global context from layer one. Understanding attention from first principlesโ€”including its computational bottlenecksโ€”is essential for working with production transformers and understanding why FlashAttention, sparse attention, and linear attention are active research frontiers. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement scaled dot-product attention** with query, key, and value matrices following the Transformer paper formula -2. **Design multi-head attention** for parallel attention in multiple representation subspaces -3. **Understand masking strategies** for causal (GPT-style), padding, and bidirectional (BERT-style) attention -4. **Build self-attention mechanisms** for sequence-to-sequence modeling with global context -5. **Apply attention patterns** that power all modern transformers from GPT-4 to Claude to Gemini +- **Understand O(nยฒ) Complexity**: Implement attention with explicit loops to witness quadratic scaling in memory and computation, understanding why long-context AI remains challenging +- **Build Scaled Dot-Product Attention**: Implement softmax(QK^T / โˆšd_k)V with proper numerical stability, understanding how 1/โˆšd_k prevents gradient vanishing +- **Create Multi-Head Attention**: Build parallel attention heads that learn different patterns (syntax, semantics, position) and concatenate their outputs for rich representations +- **Master Masking Strategies**: Implement causal masking for autoregressive generation (GPT), understand bidirectional attention for encoding (BERT), and handle padding masks +- **Analyze Production Trade-offs**: Experience attention's memory bottleneck firsthand, understand why FlashAttention matters, and explore the compute-memory trade-off space -## Why This Matters +## Build โ†’ Use โ†’ Reflect -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -Attention is the core of modern AI: - -- **GPT-4** uses 96 attention layers with 128 heads each; attention is 70% of compute -- **BERT** pioneered bidirectional attention; powers Google Search ranking -- **AlphaFold2** uses attention over protein sequences; solved 50-year protein folding problem -- **Vision Transformers** replaced CNNs in production at Google, Meta, OpenAI - -### Historical Context - -Attention revolutionized machine learning: - -- **RNN Era (pre-2017)**: Sequential processing; no parallelism; gradient vanishing in long sequences -- **Attention is All You Need (2017)**: Pure attention architecture; parallelizable; global context -- **BERT/GPT (2018)**: Transformers dominate NLP; attention beats all previous approaches -- **Beyond NLP (2020+)**: Attention powers vision (ViT), biology (AlphaFold), multimodal (CLIP) - -The attention mechanism you're implementing sparked the current AI revolution. - -## Pedagogical Pattern: Build โ†’ Use โ†’ Analyze - -### 1. Build - -Implement from first principles: -- Scaled dot-product attention: `softmax(QK^T/โˆšd_k)V` -- Multi-head attention with parallel heads -- Masking for causal and padding patterns -- Self-attention wrapper (Q=K=V) -- Attention visualization and interpretation - -### 2. Use - -Apply to real problems: -- Build language model with causal attention -- Implement BERT-style bidirectional attention -- Visualize attention patterns on real text -- Compare single-head vs multi-head performance -- Measure O(nยฒ) computational scaling - -### 3. Analyze - -Deep-dive into design choices: -- Why does attention scale quadratically with sequence length? -- How do multiple heads capture different linguistic patterns? -- Why is the 1/โˆšd_k scaling factor critical for training? -- When would you use causal vs bidirectional attention? -- What are the memory vs computation trade-offs? +1. **Build**: Implement scaled dot-product attention with explicit O(nยฒ) loops (educational), create MultiHeadAttention class with Q/K/V projections and head splitting, and build masking utilities +2. **Use**: Apply attention to realistic sequences with causal masking for language modeling, visualize attention patterns showing what the model "sees," and test with different head configurations +3. **Reflect**: Why does attention scale O(nยฒ)? How do different heads specialize without supervision? What memory bottlenecks emerge at GPT-4 scale (128 heads, 8K+ context)? ## Implementation Guide +### Attention Mechanism Flow + +The attention mechanism transforms queries, keys, and values into context-aware representations: + +```{mermaid} +graph LR + A[Query
Q: nร—d] --> D[Scores
QK^T/โˆšd] + B[Key
K: nร—d] --> D + D --> E[Attention
Weights
softmax] + E --> F[Context
ร—V] + C[Value
V: nร—d] --> F + F --> G[Output
nร—d] + + style A fill:#e3f2fd + style B fill:#e3f2fd + style C fill:#e3f2fd + style D fill:#fff3e0 + style E fill:#ffe0b2 + style F fill:#f3e5f5 + style G fill:#f0fdf4 +``` + +**Flow**: Queries attend to Keys (QK^T) โ†’ Scale by โˆšd โ†’ Softmax for weights โ†’ Weighted sum of Values โ†’ Context output + ### Core Components -**Scaled Dot-Product Attention - The Heart of Transformers** +Your attention implementation consists of three fundamental building blocks: + +#### 1. Scaled Dot-Product Attention (`scaled_dot_product_attention`) + +The mathematical foundation that powers all transformers: + ```python def scaled_dot_product_attention(Q, K, V, mask=None): - """The fundamental attention operation from 'Attention is All You Need'. - - Attention(Q, K, V) = softmax(QK^T / โˆšd_k) V - - This exact formula powers GPT, BERT, and all transformers. - - Args: - Q: Query matrix (batch, heads, seq_len_q, d_k) - K: Key matrix (batch, heads, seq_len_k, d_k) - V: Value matrix (batch, heads, seq_len_v, d_v) - mask: Optional mask (batch, 1, seq_len_q, seq_len_k) - - Returns: - output: Attended values (batch, heads, seq_len_q, d_v) - attention_weights: Attention probabilities (batch, heads, seq_len_q, seq_len_k) - - Intuition: - Q = "What am I looking for?" - K = "What information is available?" - V = "What is the actual content?" - - Attention computes: for each query, how much should I focus on each key? - Then uses those weights to mix the values. """ - # d_k = dimension of keys (and queries) - d_k = Q.shape[-1] - - # Compute attention scores: QK^T - # Shape: (batch, heads, seq_len_q, seq_len_k) - scores = Q @ K.transpose(-2, -1) - - # Scale by sqrt(d_k) to prevent extreme softmax saturation - scores = scores / math.sqrt(d_k) - - # Apply mask if provided (for causal or padding masking) - if mask is not None: - # Set masked positions to large negative value - # After softmax, these become ~0 - scores = scores.masked_fill(mask == 0, -1e9) - - # Softmax to get attention probabilities - # Each row sums to 1: how much attention to pay to each position - attention_weights = softmax(scores, dim=-1) - - # Weighted sum of values based on attention - output = attention_weights @ V - - return output, attention_weights + Attention(Q, K, V) = softmax(QK^T / โˆšd_k) V + + This exact formula powers GPT, BERT, Claude, and all transformers. + Implemented with explicit loops to show O(nยฒ) complexity. + + Args: + Q: Query matrix (batch, seq_len, d_model) + K: Key matrix (batch, seq_len, d_model) + V: Value matrix (batch, seq_len, d_model) + mask: Optional causal mask (batch, seq_len, seq_len) + + Returns: + output: Attended values (batch, seq_len, d_model) + attention_weights: Attention matrix (batch, seq_len, seq_len) + """ + # Step 1: Compute attention scores (O(nยฒ) operation) + # For each query i and key j: score[i,j] = Q[i] ยท K[j] + + # Step 2: Scale by 1/โˆšd_k for numerical stability + # Prevents softmax saturation as dimensionality increases + + # Step 3: Apply optional causal mask + # Masked positions set to -1e9 (becomes ~0 after softmax) + + # Step 4: Softmax normalization (each row sums to 1) + # Converts scores to probability distribution + + # Step 5: Weighted sum of values (another O(nยฒ) operation) + # output[i] = ฮฃ(attention_weights[i,j] ร— V[j]) for all j ``` -**Multi-Head Attention - Parallel Attention Subspaces** +**Key Implementation Details:** +- **Explicit Loops**: Educational implementation shows exactly where O(nยฒ) complexity comes from (every query attends to every key) +- **Scaling Factor**: 1/โˆšd_k prevents dot products from growing large as dimensionality increases, maintaining gradient flow +- **Masking Before Softmax**: Setting masked positions to -1e9 makes them effectively zero after softmax +- **Return Attention Weights**: Essential for visualization and interpretability analysis + +**What You'll Learn:** +- Why attention weights must sum to 1 (probability distribution property) +- How the scaling factor prevents gradient vanishing +- The exact computational cost: 2nยฒd operations (QK^T + weightsร—V) +- Why memory scales as O(batch ร— nยฒ) for attention matrices + +#### 2. Multi-Head Attention (`MultiHeadAttention`) + +Parallel attention "heads" that learn different relationship patterns: + ```python class MultiHeadAttention: - """Multi-head attention from 'Attention is All You Need'. - - Allows model to jointly attend to information from different - representation subspaces at different positions. - - Architecture: - Input (batch, seq_len, d_model) - โ†’ Project to Q, K, V (each batch, seq_len, d_model) - โ†’ Split into H heads (batch, H, seq_len, d_model/H) - โ†’ Attention for each head in parallel - โ†’ Concatenate heads - โ†’ Final linear projection - Output (batch, seq_len, d_model) - - Example: - d_model = 512, num_heads = 8 - Each head processes 512/8 = 64 dimensions - 8 heads learn different attention patterns in parallel """ - def __init__(self, d_model, num_heads): - assert d_model % num_heads == 0 - - self.d_model = d_model - self.num_heads = num_heads - self.d_k = d_model // num_heads # Dimension per head - - # Linear projections for Q, K, V - self.W_q = Linear(d_model, d_model) - self.W_k = Linear(d_model, d_model) - self.W_v = Linear(d_model, d_model) - - # Output projection - self.W_o = Linear(d_model, d_model) - - def forward(self, query, key, value, mask=None): - """Multi-head attention forward pass. - - Args: - query: (batch, seq_len_q, d_model) - key: (batch, seq_len_k, d_model) - value: (batch, seq_len_v, d_model) - mask: Optional mask - - Returns: - output: (batch, seq_len_q, d_model) - attention_weights: (batch, num_heads, seq_len_q, seq_len_k) - """ - batch_size = query.shape[0] - - # 1. Linear projections - Q = self.W_q(query) # (batch, seq_len_q, d_model) - K = self.W_k(key) # (batch, seq_len_k, d_model) - V = self.W_v(value) # (batch, seq_len_v, d_model) - - # 2. Split into multiple heads - # Reshape: (batch, seq_len, d_model) โ†’ (batch, seq_len, num_heads, d_k) - # Transpose: โ†’ (batch, num_heads, seq_len, d_k) - Q = Q.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - K = K.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - V = V.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - - # 3. Apply attention for each head in parallel - attended, attention_weights = scaled_dot_product_attention(Q, K, V, mask) - # attended: (batch, num_heads, seq_len_q, d_k) - - # 4. Concatenate heads - # Transpose: (batch, num_heads, seq_len, d_k) โ†’ (batch, seq_len, num_heads, d_k) - # Reshape: โ†’ (batch, seq_len, d_model) - attended = attended.transpose(1, 2).reshape(batch_size, -1, self.d_model) - - # 5. Final linear projection - output = self.W_o(attended) - - return output, attention_weights + Multi-head attention from 'Attention is All You Need'. + + Projects input to Q, K, V, splits into multiple heads, + applies attention in parallel, concatenates, and projects output. + + Example: d_model=512, num_heads=8 + โ†’ Each head processes 64 dimensions (512 รท 8) + โ†’ 8 heads learn different attention patterns in parallel + """ + def __init__(self, embed_dim, num_heads): + # Validate: embed_dim must be divisible by num_heads + # Create Q, K, V projection layers (Linear(embed_dim, embed_dim)) + # Create output projection layer + + def forward(self, x, mask=None): + # 1. Project input to Q, K, V + # 2. Split into heads: (batch, seq, embed_dim) โ†’ (batch, heads, seq, head_dim) + # 3. Apply attention to each head in parallel + # 4. Concatenate heads back together + # 5. Apply output projection to mix information across heads ``` -**Masking Utilities** +**Architecture Flow:** +``` +Input (batch, seq, 512) + โ†“ [Q/K/V Linear Projections] +Q, K, V (batch, seq, 512) + โ†“ [Reshape & Split into 8 heads] +(batch, 8 heads, seq, 64 per head) + โ†“ [Parallel Attention on Each Head] +Headโ‚ learns syntax patterns (subject-verb agreement) +Headโ‚‚ learns semantics (word similarity) +Headโ‚ƒ learns position (relative distance) +Headโ‚„ learns long-range (coreference) +... + โ†“ [Concatenate Heads] +(batch, seq, 512) + โ†“ [Output Projection] +Output (batch, seq, 512) +``` + +**Key Implementation Details:** +- **Head Splitting**: Reshape from (batch, seq, embed_dim) to (batch, heads, seq, head_dim) via transpose operations +- **Parallel Processing**: All heads compute simultaneouslyโ€”GPU parallelism critical for efficiency +- **Four Linear Layers**: Three for Q/K/V projections, one for output (standard transformer architecture) +- **Head Concatenation**: Reverse the split operation to merge heads back to original dimensions + +**What You'll Learn:** +- Why multiple heads capture richer representations than single-head +- How heads naturally specialize without explicit supervision +- The computational trade-off: same O(nยฒd) complexity but higher constant factor +- Why head_dim = embed_dim / num_heads is the standard configuration + +#### 3. Masking Utilities + +Control information flow patterns for different tasks: + ```python def create_causal_mask(seq_len): - """Create causal mask for autoregressive (GPT-style) attention. - - Prevents positions from attending to future positions. - Position i can only attend to positions <= i. - - Returns: - mask: (seq_len, seq_len) lower triangular matrix - + """ + Lower triangular mask for autoregressive (GPT-style) attention. + Position i can only attend to positions โ‰ค i (no future peeking). + Example (seq_len=4): [[1, 0, 0, 0], # Position 0 sees only position 0 - [1, 1, 0, 0], # Position 1 sees 0,1 - [1, 1, 1, 0], # Position 2 sees 0,1,2 - [1, 1, 1, 1]] # Position 3 sees all + [1, 1, 0, 0], # Position 1 sees 0, 1 + [1, 1, 1, 0], # Position 2 sees 0, 1, 2 + [1, 1, 1, 1]] # Position 3 sees all positions """ - mask = np.tril(np.ones((seq_len, seq_len))) - return Tensor(mask) + return Tensor(np.tril(np.ones((seq_len, seq_len)))) def create_padding_mask(lengths, max_length): - """Create padding mask to ignore padding tokens. - - Args: - lengths: (batch_size,) actual sequence lengths - max_length: maximum sequence length in batch - - Returns: - mask: (batch_size, 1, 1, max_length) where 1=real, 0=padding """ - batch_size = lengths.shape[0] - mask = np.zeros((batch_size, max_length)) - for i, length in enumerate(lengths): - mask[i, :length] = 1 - return Tensor(mask).reshape(batch_size, 1, 1, max_length) + Prevents attention to padding tokens in variable-length sequences. + Essential for efficient batching of different-length sequences. + """ + # Create mask where 1=real token, 0=padding + # Shape: (batch_size, 1, 1, max_length) for broadcasting ``` -### Step-by-Step Implementation +**Masking Strategies:** +- **Causal (GPT)**: Lower triangularโ€”blocks n(n-1)/2 connections for autoregressive generation +- **Bidirectional (BERT)**: No maskโ€”full nยฒ connections for encoding with full context +- **Padding**: Batch-specificโ€”prevents attention to padding tokens in variable-length batches +- **Combined**: Can multiply masks element-wise (e.g., causal + padding) -1. **Implement Scaled Dot-Product Attention** - - Compute QK^T matmul - - Apply 1/โˆšd_k scaling - - Add masking support - - Apply softmax and value weighting - - Verify attention weights sum to 1 +**What You'll Learn:** +- How masking strategy fundamentally defines model capabilities (generation vs encoding) +- Why causal masking is essential for language modeling training stability +- The performance benefit of efficient batching with padding masks +- How mask shape broadcasting works with attention scores -2. **Build Multi-Head Attention** - - Create Q, K, V projection layers - - Split embeddings into multiple heads - - Apply attention to each head in parallel - - Concatenate head outputs - - Add final projection layer +### Attention Complexity Analysis -3. **Add Masking Utilities** - - Implement causal mask for GPT-style models - - Create padding mask for variable-length sequences - - Test mask shapes and broadcasting - - Verify masking prevents information leak +Understanding the computational and memory bottlenecks: -4. **Create Self-Attention Wrapper** - - Build convenience class where Q=K=V - - Add optional masking parameter - - Test with real embedded sequences - - Profile computational cost +#### Time Complexity: O(nยฒ ร— d) -5. **Visualize Attention Patterns** - - Extract attention weights from forward pass - - Plot heatmaps for different heads - - Analyze what patterns each head learns - - Interpret attention on real text examples +``` +For sequence length n and embedding dimension d: + +QK^T computation: +- n queries ร— n keys = nยฒ similarity scores +- Each score: dot product over d dimensions +- Total: O(nยฒ ร— d) operations + +Softmax normalization: +- Apply to nยฒ scores +- Total: O(nยฒ) operations + +Attention ร— Values: +- nยฒ weights ร— n values = nยณ operations +- But dimension d: effectively O(nยฒ ร— d) +- Total: O(nยฒ ร— d) operations + +Dominant: O(nยฒ ร— d) for both QK^T and weightsร—V +``` + +**Scaling Impact:** +- Doubling sequence length quadruples compute +- n=1024 โ†’ 1M scores per head +- n=4096 (GPT-3) โ†’ 16M scores per head (16ร— more) +- n=32K (GPT-4) โ†’ 1B scores per head (1000ร— more than 1024) + +#### Memory Complexity: O(batch ร— heads ร— nยฒ) + +``` +Attention weights matrix shape: (batch, heads, seq_len, seq_len) + +Example: GPT-3 scale inference +- batch=32, heads=96, seq=2048 +- Attention weights: 32 ร— 96 ร— 2048 ร— 2048 = 12.8 billion values +- At FP32 (4 bytes): 51.2 GB just for attention weights +- With 96 layers: 4.9 TB total (clearly infeasible!) + +This is why: +- FlashAttention fuses operations to avoid storing attention matrix +- Mixed precision training uses FP16 (2ร— memory reduction) +- Gradient checkpointing recomputes instead of storing +- Production models use extensive optimization tricks +``` + +**The Memory Bottleneck:** +- For long contexts (32K+ tokens), attention memory dominates total usage +- Storing attention weights becomes infeasibleโ€”must compute on-the-fly +- FlashAttention breakthrough: O(n) memory instead of O(nยฒ) via kernel fusion +- Understanding this bottleneck guides all modern attention optimization research + +### Comparing to PyTorch + +Your implementation vs `torch.nn.MultiheadAttention`: + +| Aspect | Your TinyTorch Implementation | PyTorch Production | +|--------|-------------------------------|-------------------| +| **Algorithm** | Exact same: softmax(QK^T/โˆšd_k)V | Same mathematical formula | +| **Loops** | Explicit (educational) | Fused GPU kernels | +| **Masking** | Manual application | Built-in mask parameter | +| **Memory** | O(nยฒ) attention matrix stored | FlashAttention-optimized | +| **Batching** | Standard implementation | Highly optimized kernels | +| **Numerical Stability** | 1/โˆšd_k scaling | Same + additional safeguards | + +**What You Gained:** +- Deep understanding of O(nยฒ) complexity by seeing explicit loops +- Insight into why FlashAttention and kernel fusion matter +- Knowledge of masking strategies and their architectural implications +- Foundation for understanding advanced attention variants (sparse, linear) + +## Getting Started + +### Prerequisites + +Ensure you understand these foundations: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module tensor # Matrix operations (matmul, transpose) +tito test --module activations # Softmax for attention normalization +tito test --module layers # Linear layers for Q/K/V projections +tito test --module embeddings # Token/position embeddings attention operates on +``` + +**Core Concepts You'll Need:** +- **Matrix Multiplication**: Understanding QK^T computation and broadcasting +- **Softmax Numerical Stability**: Subtracting max before exp prevents overflow +- **Layer Composition**: How Q/K/V projections combine with attention +- **Shape Manipulation**: Reshape and transpose operations for head splitting + +### Development Workflow + +1. **Open the development file**: `modules/12_attention/attention_dev.ipynb` (notebook) or `attention_dev.py` (script) +2. **Implement scaled_dot_product_attention**: Build core attention formula with explicit loops showing O(nยฒ) complexity +3. **Create MultiHeadAttention class**: Add Q/K/V projections, head splitting, parallel attention, and output projection +4. **Build masking utilities**: Create causal mask for GPT-style attention and padding mask for batching +5. **Test and analyze**: Run comprehensive tests, visualize attention patterns, and profile computational scaling +6. **Export and verify**: `tito module complete 12 && tito test --module attention` ## Testing -### Inline Tests (During Development) +### Comprehensive Test Suite + +Run the full test suite to verify attention functionality: -Run inline tests while building: ```bash -cd modules/12_attention -python attention_dev.py +# TinyTorch CLI (recommended) +tito test --module attention + +# Direct pytest execution +python -m pytest tests/ -k attention -v + +# Inline testing during development +python modules/12_attention/attention_dev.py ``` -Expected output: -``` -Unit Test: Scaled dot-product attention... -โœ… Attention scores computed correctly -โœ… Softmax normalization verified (sums to 1) -โœ… Output shape matches expected dimensions -Progress: Attention Mechanism โœ“ +### Test Coverage Areas -Unit Test: Multi-head attention... -โœ… 8 heads process 512 dims in parallel +- โœ… **Attention Scores Computation**: Verifies QK^T produces correct shapes and values +- โœ… **Numerical Stability**: Confirms 1/โˆšd_k scaling prevents softmax saturation +- โœ… **Probability Normalization**: Validates attention weights sum to 1.0 per query +- โœ… **Causal Masking**: Tests that future positions get zero attention weight +- โœ… **Multi-Head Configuration**: Checks head splitting, parallel processing, and concatenation +- โœ… **Shape Preservation**: Ensures input shape equals output shape +- โœ… **Gradient Flow**: Verifies differentiability through attention computation graph +- โœ… **Computational Complexity**: Profiles O(nยฒ) scaling with increasing sequence length + +### Inline Testing & Complexity Analysis + +The module includes comprehensive validation and performance analysis: + +```python +๐Ÿ”ฌ Unit Test: Scaled Dot-Product Attention... +โœ… Attention scores computed correctly (QK^T shape verified) +โœ… Scaling factor 1/โˆšd_k applied +โœ… Softmax normalization verified (each row sums to 1.0) +โœ… Output shape matches expected (batch, seq, d_model) +โœ… Causal masking blocks future positions correctly +๐Ÿ“ˆ Progress: Scaled Dot-Product Attention โœ“ + +๐Ÿ”ฌ Unit Test: Multi-Head Attention... +โœ… 8 heads process 512 dimensions in parallel โœ… Head splitting and concatenation correct -โœ… Output projection applied properly -Progress: Multi-Head Attention โœ“ +โœ… Q/K/V projection layers initialized properly +โœ… Output projection applied +โœ… Shape: (batch, seq, 512) โ†’ (batch, seq, 512) โœ“ +๐Ÿ“ˆ Progress: Multi-Head Attention โœ“ -Unit Test: Causal masking... -โœ… Future positions blocked correctly -โœ… Past positions accessible -โœ… Autoregressive property verified -Progress: Masking โœ“ +๐Ÿ“Š Analyzing Attention Complexity... +Seq Len | Attention Matrix | Memory (KB) | Scaling +-------------------------------------------------------- + 16 | 256 | 1.00 | 1.0x + 32 | 1,024 | 4.00 | 4.0x + 64 | 4,096 | 16.00 | 4.0x + 128 | 16,384 | 64.00 | 4.0x + 256 | 65,536 | 256.00 | 4.0x + +๐Ÿ’ก Memory scales as O(nยฒ) with sequence length +๐Ÿš€ For seq_len=2048 (GPT-3), attention matrix needs 16 MB per layer ``` -### Export and Validate +### Manual Testing Examples -After completing the module: -```bash -# Export to tinytorch package -tito export 12_attention +```python +from attention_dev import scaled_dot_product_attention, MultiHeadAttention +from tinytorch.core.tensor import Tensor +import numpy as np -# Run integration tests -tito test 12_attention -``` +# Test 1: Basic scaled dot-product attention +batch, seq_len, d_model = 2, 10, 64 +Q = Tensor(np.random.randn(batch, seq_len, d_model)) +K = Tensor(np.random.randn(batch, seq_len, d_model)) +V = Tensor(np.random.randn(batch, seq_len, d_model)) -## Where This Code Lives +output, weights = scaled_dot_product_attention(Q, K, V) +print(f"Output shape: {output.shape}") # (2, 10, 64) +print(f"Weights shape: {weights.shape}") # (2, 10, 10) +print(f"Weights sum: {weights.data.sum(axis=2)}") # All ~1.0 -``` -tinytorch/ -โ”œโ”€โ”€ nn/ -โ”‚ โ””โ”€โ”€ attention.py # Your implementation goes here -โ””โ”€โ”€ __init__.py # Exposes MultiHeadAttention, etc. +# Test 2: Multi-head attention +mha = MultiHeadAttention(embed_dim=128, num_heads=8) +x = Tensor(np.random.randn(2, 10, 128)) +attended = mha.forward(x) +print(f"Multi-head output: {attended.shape}") # (2, 10, 128) -Usage in other modules: ->>> from tinytorch.nn import MultiHeadAttention ->>> attn = MultiHeadAttention(d_model=512, num_heads=8) ->>> output, weights = attn(query, key, value, mask=causal_mask) +# Test 3: Causal masking for language modeling +causal_mask = Tensor(np.tril(np.ones((batch, seq_len, seq_len)))) +causal_output, causal_weights = scaled_dot_product_attention(Q, K, V, causal_mask) +# Verify upper triangle is zero (no future attention) +print("Future attention blocked:", np.allclose(causal_weights.data[0, 3, 4:], 0)) + +# Test 4: Visualize attention patterns +print("\nAttention pattern (position โ†’ position):") +print(weights.data[0, :5, :5].round(3)) # First 5x5 submatrix ``` ## Systems Thinking Questions -1. **Quadratic Complexity**: Attention is O(nยฒ) in sequence length. For n=1024, we compute ~1M attention scores. For n=4096 (GPT-3 context), how many? Why is this a problem for long documents? +### Real-World Applications -2. **Multi-Head Benefits**: Why 8 heads of 64 dims each instead of 1 head of 512 dims? What different patterns might different heads learn (syntax vs semantics vs coreference)? +- **Large Language Models (GPT-4, Claude)**: 96+ layers with 128 heads each means 12,288+ parallel attention operations per forward pass; attention accounts for 70% of total compute +- **Machine Translation (Google Translate)**: Cross-attention between source and target languages enables word alignment; attention weights provide interpretable translation decisions +- **Vision Transformers (ViT)**: Self-attention over image patches replaced convolutions at Google/Meta/OpenAI; global receptive field from layer 1 vs deep CNN stacks +- **Scientific AI (AlphaFold2)**: Attention over protein sequences captures amino acid interactions; solved 50-year protein folding problem using transformer architecture -3. **Scaling Factor Impact**: Without 1/โˆšd_k scaling, softmax gets extreme values (nearly one-hot). Why? How does this hurt gradient flow? (Hint: softmax derivative) +### Mathematical Foundations -4. **Memory vs Compute**: Attention weights matrix is (batch ร— heads ร— seq ร— seq). For batch=32, heads=8, seq=1024, this is 256M values. At FP32, how much memory? Why is this a bottleneck? +- **Query-Key-Value Paradigm**: Attention implements differentiable "search"โ€”queries look for relevant keys and retrieve corresponding values +- **Scaling Factor (1/โˆšd_k)**: For unit variance Q and K, QK^T has variance d_k; dividing by โˆšd_k restores unit variance, keeping softmax responsive (critical for gradient flow) +- **Softmax Normalization**: Converts arbitrary scores to valid probability distribution; enables differentiable, learned routing mechanism +- **Masking Implementation**: Setting masked positions to -โˆž before softmax makes them effectively zero attention weight after normalization -5. **Causal vs Bidirectional**: GPT uses causal masking (can't see future). BERT uses bidirectional (can see all positions). Why does this architectural choice define fundamentally different models? +### Computational Characteristics -## Real-World Connections +- **Quadratic Memory Scaling**: Attention matrix is O(nยฒ); for GPT-3 scale (96 layers, 2048 context), attention weights alone require ~1.5 GBโ€”understanding this guides optimization priorities +- **Time-Memory Trade-off**: Can avoid storing attention matrix and recompute in backward pass (gradient checkpointing) at cost of 2ร— compute +- **Parallelization Benefits**: Unlike RNNs, all nยฒ attention scores compute simultaneously; fully utilizes GPU parallelism for massive speedup +- **FlashAttention Breakthrough**: Reformulates computation order to reduce memory from O(nยฒ) to O(n) via kernel fusionโ€”enables 2-4ร— speedup and longer contexts (8K+ tokens) -### Industry Applications +### How Your Implementation Maps to PyTorch -**Large Language Models (OpenAI, Anthropic, Google)** -- GPT-4: 96 layers ร— 128 heads = 12,288 attention computations -- Attention optimizations (FlashAttention) critical for training at scale -- Multi-query attention reduces inference cost in production -- Attention is the primary computational bottleneck +**What you just built:** +```python +# Your TinyTorch attention implementation +from tinytorch.core.attention import MultiheadAttention -**Machine Translation (Google Translate, DeepL)** -- Cross-attention aligns source and target languages -- Attention weights show word alignment (interpretability) -- Multi-head attention captures different translation patterns -- Real-time translation requires optimized attention kernels +# Create multi-head attention +mha = MultiheadAttention(embed_dim=512, num_heads=8) -**Vision Models (Google ViT, Meta DINOv2)** -- Self-attention over image patches replaces convolution -- Global receptive field from layer 1 (vs deep CNN stacks) -- Attention scales better to high-resolution images -- Now dominant architecture for vision tasks +# Forward pass +query = Tensor(...) # (batch, seq_len, embed_dim) +key = Tensor(...) +value = Tensor(...) -### Research Impact +# Compute attention: YOUR implementation +output, attn_weights = mha(query, key, value, mask=causal_mask) +# output shape: (batch, seq_len, embed_dim) +# attn_weights shape: (batch, num_heads, seq_len, seq_len) +``` -This module implements patterns from: -- Attention is All You Need (Vaswani et al., 2017): The transformer paper -- BERT (Devlin et al., 2018): Bidirectional attention for NLP -- GPT-2/3 (Radford et al., 2019): Causal attention for generation -- ViT (Dosovitskiy et al., 2020): Attention for computer vision +**How PyTorch does it:** +```python +# PyTorch equivalent +import torch.nn as nn -## What's Next? +# Create multi-head attention +mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True) -In **Module 13: Transformers**, you'll compose attention into complete transformer blocks: +# Forward pass +query = torch.tensor(...) # (batch, seq_len, embed_dim) +key = torch.tensor(...) +value = torch.tensor(...) -- Stack multi-head attention with feedforward networks -- Add layer normalization and residual connections -- Build encoder (BERT-style) and decoder (GPT-style) architectures -- Train full transformer on text generation tasks +# Compute attention: PyTorch implementation +output, attn_weights = mha(query, key, value, attn_mask=causal_mask) +# Same shapes, identical semantics +``` -The attention mechanism you built is the core component of every transformer! +**Key Insight**: Your attention implementation computes the **exact same mathematical formula** that powers GPT, BERT, and every transformer model: + +``` +Attention(Q, K, V) = softmax(QK^T / โˆšd_k) V +``` + +When you implement this with explicit loops, you viscerally understand the O(nยฒ) memory scaling that limits context length in production transformers. + +**What's the SAME?** +- **Core formula**: Scaled dot-product attention (Vaswani et al., 2017) +- **Multi-head architecture**: Parallel attention in representation subspaces +- **Masking patterns**: Causal masking (GPT), padding masking (BERT) +- **API design**: `(query, key, value)` inputs, attention weights output +- **Conceptual bottleneck**: O(nยฒ) memory for attention matrix + +**What's different in production PyTorch?** +- **Backend**: C++/CUDA kernels ~10-100ร— faster than Python loops +- **Memory optimization**: Fused kernels avoid materializing full attention matrix +- **FlashAttention**: PyTorch 2.0+ uses optimized attention (O(n) memory vs your O(nยฒ)) +- **Multi-query attention**: Production systems use grouped-query attention (GQA) to reduce KV cache size + +**Why this matters**: When you see `RuntimeError: CUDA out of memory` training transformers with long sequences, you understand it's the O(nยฒ) attention matrix from YOUR implementationโ€”doubling sequence length quadruples memory. When papers mention "linear attention" or "flash attention", you know they're solving the scaling bottleneck you experienced. + +**Production usage example**: +```python +# PyTorch Transformer implementation (after TinyTorch) +import torch +import torch.nn as nn + +class TransformerBlock(nn.Module): + def __init__(self, d_model=512, num_heads=8): + super().__init__() + # Uses same multi-head attention you built + self.mha = nn.MultiheadAttention(d_model, num_heads, batch_first=True) + self.ffn = nn.Sequential( + nn.Linear(d_model, 4 * d_model), + nn.ReLU(), + nn.Linear(4 * d_model, d_model) + ) + + def forward(self, x, mask=None): + # Same pattern you implemented + attn_out, _ = self.mha(x, x, x, attn_mask=mask) # YOUR attention logic + x = x + attn_out # Residual connection + x = x + self.ffn(x) + return x +``` + +After implementing attention yourself, you understand that GPT's causal attention is your `mask=causal_mask`, BERT's bidirectional attention is your `mask=padding_mask`, and every transformer's O(nยฒ) scaling comes from the attention matrix you explicitly computed in your implementation. + +## Ready to Build? + +You're about to implement the mechanism that sparked the AI revolution and powers every modern language model. Understanding attention from first principlesโ€”including its computational bottlenecksโ€”will give you deep insight into why transformers dominate AI and what limitations remain. + +**Your Mission**: Implement scaled dot-product attention with explicit loops to viscerally understand O(nยฒ) complexity. Build multi-head attention that processes parallel representation subspaces. Master causal and padding masking for different architectural patterns. Test on real sequences, visualize attention patterns, and profile computational scaling. + +**Why This Matters**: The attention mechanism you're building didn't just improve NLPโ€”it unified deep learning across all domains. GPT, BERT, Vision Transformers, AlphaFold, DALL-E, and Claude all use the exact formula you're implementing. Understanding attention's power (global context, parallelizable) and limitations (quadratic scaling) is essential for working with production AI systems. + +**After Completion**: Module 13 (Transformers) will combine your attention with feedforward layers and normalization to build complete transformer blocks. Module 14 (Profiling) will measure your attention's O(nยฒ) scaling and identify optimization opportunities. Module 18 (Acceleration) will implement FlashAttention-style optimizations for your mechanism. + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/12_attention/attention_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/12_attention/attention_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/12_attention/attention_dev.ipynb +:class-header: bg-light + +Browse the notebook source code and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. +``` --- -**Ready to build the AI revolution from scratch?** Open `modules/12_attention/attention_dev.py` and start implementing. + diff --git a/modules/13_transformers/ABOUT.md b/modules/13_transformers/ABOUT.md index 34637e00..9ba13064 100644 --- a/modules/13_transformers/ABOUT.md +++ b/modules/13_transformers/ABOUT.md @@ -1,479 +1,620 @@ --- -title: "Transformers - Complete Encoder-Decoder Architecture" -description: "Build full transformer models with encoder and decoder stacks" +title: "Transformers - Complete GPT Architecture" +description: "Build decoder-only transformer architecture for autoregressive text generation" difficulty: 4 time_estimate: "6-8 hours" prerequisites: ["Embeddings", "Attention"] -next_steps: ["Memoization (Optimization Tier)"] +next_steps: ["Profiling (Optimization Tier)"] learning_objectives: - - "Implement complete transformer blocks with attention and feedforward layers" - - "Design encoder stacks for bidirectional understanding (BERT-style)" - - "Build decoder stacks for autoregressive generation (GPT-style)" - - "Understand layer normalization and residual connections for deep networks" - - "Apply transformer architectures to language modeling and generation tasks" + - "Implement complete transformer blocks with multi-head attention, feed-forward networks, layer normalization, and residual connections" + - "Build decoder-only GPT architecture with causal masking for autoregressive text generation" + - "Understand pre-norm architecture and residual connections for training deep networks (12+ layers)" + - "Analyze parameter scaling, memory complexity, and attention quadratic growth with sequence length" + - "Apply transformer architecture to language modeling tasks using patterns from PyTorch and production systems" --- -# 13. Transformers +# 13. Transformers - Complete GPT Architecture -**๐Ÿ›๏ธ ARCHITECTURE TIER** | Difficulty: โญโญโญโญ (4/4) | Time: 6-8 hours +**ARCHITECTURE TIER** | Difficulty: โญโญโญโญ (4/4) | Time: 6-8 hours ## Overview -Build complete transformer models by composing attention, feedforward, and normalization layers. This module implements encoder stacks (BERT-style) and decoder stacks (GPT-style) that power all modern language models. +You'll build the complete GPT transformer architectureโ€”the decoder-only foundation powering ChatGPT, GPT-4, Claude, and virtually all modern large language models. This module combines everything you've learned about attention, embeddings, and neural networks into a production-ready autoregressive language model capable of text generation. You'll implement layer normalization, feed-forward networks, transformer blocks with residual connections, and the complete GPT model that matches PyTorch's `nn.TransformerDecoder` design. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement complete transformer blocks** with multi-head attention, feedforward networks, and normalization -2. **Design encoder stacks** for bidirectional understanding using masked self-attention (BERT-style) -3. **Build decoder stacks** for autoregressive text generation with causal masking (GPT-style) -4. **Understand layer normalization and residual connections** critical for training deep transformer networks -5. **Apply transformer architectures** to language modeling, text generation, and sequence-to-sequence tasks +- **Implement complete transformer blocks** with multi-head self-attention, position-wise feed-forward networks (4x expansion), layer normalization, and residual connections for gradient highways enabling deep networks (12+ layers) +- **Build decoder-only GPT architecture** with causal masking preventing future token leakage, autoregressive generation with temperature sampling, and embeddings combining token and positional information +- **Understand pre-norm architecture and residual connections** critical for training stabilityโ€”pre-norm placement before sub-layers (not after) enables 100+ layer networks by providing clean normalized inputs and direct gradient paths +- **Analyze parameter scaling and memory complexity** including quadratic attention memory growth O(nยฒ) with sequence length, linear parameter scaling with layers, and techniques like gradient checkpointing for memory reduction +- **Apply transformer architecture to language modeling** using real-world patterns from PyTorch `nn.Transformer`, understanding decoder-only vs encoder-only vs encoder-decoder choices, and production optimizations like KV caching -## Why This Matters +## Build โ†’ Use โ†’ Reflect -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -Transformers are the architecture of modern AI: +1. **Build**: Implement LayerNorm with learnable scale/shift, MLP feed-forward networks with 4x expansion and GELU activation, TransformerBlock combining attention+MLP with pre-norm residual connections, complete GPT decoder with causal masking and generation +2. **Use**: Train GPT-style decoder on character-level text generation, implement autoregressive generation with temperature sampling (conservative vs creative), analyze parameter scaling across model sizes (Tiny โ†’ GPT-3 scale), measure attention memory quadratic growth +3. **Reflect**: Why are residual connections critical for deep transformers (gradient vanishing without them)? How does pre-norm differ from post-norm (training stability for >12 layers)? What's the compute/memory trade-off in stacking layers vs widening dimensions? Why does attention memory scale quadratically with sequence length (O(nยฒd) cost)? -- **GPT-4**: 96-layer decoder-only transformer; powers ChatGPT and GitHub Copilot -- **BERT**: 12-layer encoder-only transformer; ranks billions of web pages for Google Search -- **T5**: Encoder-decoder transformer; Google's universal text-to-text model -- **Claude, Gemini, Llama**: All transformer-based; billions of users daily +```{admonition} Systems Reality Check +:class: tip -### Historical Context +**Production Context**: The decoder-only GPT architecture you're implementing powers virtually all modern LLMs. GPT-4 uses a 120-layer decoder stack, ChatGPT is based on GPT-3.5 with 96 layers, Claude uses decoder-only architecture, Llama 2 has 80 layersโ€”all are transformer decoders with causal attention. This architecture dominated because it scales predictably with parameters and data. -Transformers unified and dominated AI: - -- **Pre-Transformer (pre-2017)**: RNNs/LSTMs for sequences; CNNs for vision; separate architectures -- **Attention is All You Need (2017)**: Pure transformer beats RNNs; parallelizable; scales efficiently -- **BERT/GPT (2018)**: Transformers dominate NLP; pre-training + fine-tuning paradigm -- **Transformers Everywhere (2020+)**: Vision (ViT), speech (Whisper), protein folding (AlphaFold), multimodal (GPT-4) - -The architecture you're implementing powers virtually all modern AI systems. - -## Pedagogical Pattern: Build โ†’ Use โ†’ Analyze - -### 1. Build - -Implement from first principles: -- Feedforward network with two linear layers and activation -- Layer normalization for training stability -- Transformer block: attention โ†’ residual โ†’ norm โ†’ FFN โ†’ residual โ†’ norm -- Encoder stack (bidirectional, BERT-style) -- Decoder stack (autoregressive, GPT-style) - -### 2. Use - -Apply to real problems: -- Train GPT-style decoder on Shakespeare text generation -- Build BERT-style encoder for sequence classification -- Implement encoder-decoder for sequence-to-sequence tasks -- Generate text autoregressively with sampling -- Compare encoder-only vs decoder-only architectures - -### 3. Analyze - -Deep-dive into architectural choices: -- Why are residual connections critical for deep transformers? -- How does layer normalization differ from batch normalization? -- When would you use encoder-only vs decoder-only vs encoder-decoder? -- Why pre-norm vs post-norm transformer blocks? -- What's the compute/memory trade-off in stacking many layers? +**Performance Note**: Transformer depth has O(nยฒd) attention cost per layer (n=sequence length, d=model dimension). For GPT-3 with 2048 tokens, each attention layer processes 4M token pairs. Memory scales linearly with layers but quadratically with sequence length. Production systems use KV caching (reuse key-value pairs during generation), FlashAttention (memory-efficient attention), and gradient checkpointing (trade compute for memory) to manage this. Understanding these trade-offs is critical for ML systems engineering. +``` ## Implementation Guide -### Core Components +### LayerNorm - Training Stability for Deep Networks -**Feedforward Network - Position-Wise FFN** -```python -class FeedForward: - """Position-wise feedforward network in transformer. - - Two linear transformations with ReLU activation: - FFN(x) = ReLU(xWโ‚ + bโ‚)Wโ‚‚ + bโ‚‚ - - Applied identically to each position independently. - Typically d_ff = 4 ร— d_model (expansion factor). - - Args: - d_model: Input/output dimension (e.g., 512) - d_ff: Hidden dimension (e.g., 2048 = 4 ร— 512) - dropout: Dropout probability for regularization - """ - def __init__(self, d_model, d_ff, dropout=0.1): - self.linear1 = Linear(d_model, d_ff) - self.linear2 = Linear(d_ff, d_model) - self.relu = ReLU() - self.dropout = Dropout(dropout) - - def forward(self, x): - # x: (batch, seq_len, d_model) - x = self.linear1(x) # (batch, seq_len, d_ff) - x = self.relu(x) # Nonlinearity - x = self.dropout(x) # Regularization - x = self.linear2(x) # (batch, seq_len, d_model) - return x -``` +Layer normalization stabilizes training by normalizing activations across the feature dimension for each sample independently. Unlike batch normalization (normalizes across batch), LayerNorm works with any batch size including batch=1 during inferenceโ€”essential for variable-length sequences. -**Layer Normalization - Training Stability** ```python class LayerNorm: """Layer normalization for transformer training stability. - - Normalizes across feature dimension for each sample independently. - Unlike BatchNorm, works with any batch size including batch=1. - - Formula: y = ฮณ(x - ฮผ)/โˆš(ฯƒยฒ + ฮต) + ฮฒ - where ฮผ, ฯƒยฒ computed per sample across features - - Why not BatchNorm? - - Transformers process variable-length sequences - - LayerNorm independent of batch size (better for inference) - - Empirically works better for NLP tasks + + Normalizes across feature dimension (last axis) for each sample independently. + Includes learnable scale (gamma) and shift (beta) parameters. + + Formula: output = gamma * (x - mean) / sqrt(variance + eps) + beta + + Why LayerNorm for Transformers: + - Batch-independent: Works with any batch size (good for inference) + - Variable-length sequences: Each sample normalized independently + - Better gradients: Empirically superior to BatchNorm for NLP tasks """ - def __init__(self, d_model, eps=1e-6): - self.gamma = Parameter(Tensor.ones(d_model)) # Learned scale - self.beta = Parameter(Tensor.zeros(d_model)) # Learned shift - self.eps = eps - + def __init__(self, normalized_shape, eps=1e-5): + self.gamma = Tensor(np.ones(normalized_shape)) # Learnable scale (starts at 1.0) + self.beta = Tensor(np.zeros(normalized_shape)) # Learnable shift (starts at 0.0) + self.eps = eps # Numerical stability in variance calculation + def forward(self, x): - # x: (batch, seq_len, d_model) - mean = x.mean(dim=-1, keepdim=True) - std = x.std(dim=-1, keepdim=True) - normalized = (x - mean) / (std + self.eps) + # Compute statistics across last dimension (features) + mean = x.mean(axis=-1, keepdims=True) + variance = ((x - mean) ** 2).mean(axis=-1, keepdims=True) + + # Normalize: (x - ฮผ) / ฯƒ + normalized = (x - mean) / sqrt(variance + self.eps) + + # Apply learnable transformation: ฮณ * norm + ฮฒ return self.gamma * normalized + self.beta ``` -**Transformer Block - Complete Layer** +**Key Design Decisions:** +- **Per-sample normalization**: Each sequence position normalized independently across features (batch-independent) +- **Learnable parameters**: Gamma/beta allow model to recover any desired distribution after normalization +- **Epsilon for stability**: Small constant (1e-5) prevents division by zero in variance calculation + +**LayerNorm vs BatchNorm:** +| Aspect | LayerNorm | BatchNorm | +|--------|-----------|-----------| +| Normalizes across | Features (per sample) | Batch (per feature) | +| Batch size dependency | Independent | Dependent | +| Inference behavior | Same as training | Requires running statistics | +| Best for | Transformers, NLP | CNNs, Computer Vision | + +### MLP - Position-Wise Feed-Forward Network + +The MLP provides non-linear transformation capacity in each transformer block. It's a simple two-layer network with a 4x expansion pattern applied identically to each sequence position. + ```python -class TransformerBlock: - """Single transformer layer with attention and feedforward. - - Architecture (Pre-Norm variant): - x โ†’ LayerNorm โ†’ MultiHeadAttention โ†’ Residual - โ†’ LayerNorm โ†’ FeedForward โ†’ Residual - - Pre-Norm (shown above) vs Post-Norm: - - Pre-Norm: Normalize before sub-layers; better gradient flow - - Post-Norm: Normalize after sub-layers; original Transformer paper - - Pre-Norm generally preferred for deep models (>12 layers) +class MLP: + """Multi-Layer Perceptron (Feed-Forward Network) for transformer blocks. + + Standard pattern: Linear(expand) โ†’ GELU โ†’ Linear(contract) + Expansion ratio: 4:1 (embed_dim โ†’ 4*embed_dim โ†’ embed_dim) + + This provides the "thinking" capacity after attention computes relationships. """ - def __init__(self, d_model, num_heads, d_ff, dropout=0.1): - # Attention sub-layer - self.attention = MultiHeadAttention(d_model, num_heads) - self.norm1 = LayerNorm(d_model) - self.dropout1 = Dropout(dropout) - - # Feedforward sub-layer - self.feedforward = FeedForward(d_model, d_ff, dropout) - self.norm2 = LayerNorm(d_model) - self.dropout2 = Dropout(dropout) - - def forward(self, x, mask=None): - """Forward pass with residual connections. - - Args: - x: (batch, seq_len, d_model) - mask: Optional attention mask - - Returns: - output: (batch, seq_len, d_model) - """ - # Attention sub-layer with residual - normed = self.norm1(x) - attended, _ = self.attention(normed, normed, normed, mask) - x = x + self.dropout1(attended) # Residual connection - - # Feedforward sub-layer with residual - normed = self.norm2(x) - fed_forward = self.feedforward(normed) - x = x + self.dropout2(fed_forward) # Residual connection - + def __init__(self, embed_dim, hidden_dim=None): + if hidden_dim is None: + hidden_dim = 4 * embed_dim # Standard 4x expansion + + self.linear1 = Linear(embed_dim, hidden_dim) # Expansion: 512 โ†’ 2048 + self.gelu = GELU() # Smooth activation + self.linear2 = Linear(hidden_dim, embed_dim) # Contraction: 2048 โ†’ 512 + + def forward(self, x): + # x: (batch, seq_len, embed_dim) + x = self.linear1(x) # Expand to hidden_dim + x = self.gelu(x) # Nonlinearity (smoother than ReLU) + x = self.linear2(x) # Contract back to embed_dim return x ``` -**GPT-Style Decoder - Autoregressive Generation** +**Why 4x Expansion?** +- **Parameter capacity**: More parameters = more representation power (MLP typically has more params than attention) +- **Information bottleneck**: Expansion โ†’ contraction forces model to compress useful information +- **Empirical success**: 4x ratio found to work well across model sizes (some models experiment with 2x-8x) + +**GELU vs ReLU:** +- **ReLU**: Hard cutoff at zero `max(0, x)` - simple but non-smooth +- **GELU**: Smooth probabilistic activation `x * ฮฆ(x)` where ฮฆ is Gaussian CDF +- **Why GELU**: Smoother gradients, better performance for language modeling tasks + +### TransformerBlock - Complete Layer with Attention and MLP + +A single transformer layer combining multi-head self-attention with feed-forward processing using pre-norm residual architecture. This is the core building block stacked 12-120 times in production models. + ```python -class GPTDecoder: - """GPT-style decoder for autoregressive language modeling. - - Architecture: - Input tokens โ†’ Embed + PositionalEncoding - โ†’ TransformerBlocks (with causal masking) - โ†’ Linear projection to vocabulary - - Features: - - Causal masking: position i can only attend to positions โ‰ค i - - Autoregressive: generates one token at a time - - Pre-training objective: predict next token +class TransformerBlock: + """Complete transformer layer with self-attention, MLP, and residual connections. + + Pre-Norm Architecture (Modern Standard): + x โ†’ LayerNorm โ†’ MultiHeadAttention โ†’ Add(x) โ†’ + LayerNorm โ†’ MLP โ†’ Add โ†’ Output + + Each sub-layer (attention, MLP) gets normalized input but adds to residual stream. """ - def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, max_len): - # Embedding layers - self.token_embedding = Embedding(vocab_size, d_model) - self.position_embedding = LearnedPositionalEmbedding(max_len, d_model) - - # Transformer blocks - self.blocks = [TransformerBlock(d_model, num_heads, d_ff) - for _ in range(num_layers)] - - # Output projection - self.norm = LayerNorm(d_model) - self.output_proj = Linear(d_model, vocab_size) - - def forward(self, token_ids): - """Forward pass through decoder. - + def __init__(self, embed_dim, num_heads, mlp_ratio=4): + # Attention sub-layer components + self.attention = MultiHeadAttention(embed_dim, num_heads) + self.ln1 = LayerNorm(embed_dim) # Pre-norm: before attention + + # MLP sub-layer components + self.mlp = MLP(embed_dim, hidden_dim=int(embed_dim * mlp_ratio)) + self.ln2 = LayerNorm(embed_dim) # Pre-norm: before MLP + + def forward(self, x, mask=None): + """Forward pass with residual connections. + Args: - token_ids: (batch, seq_len) token indices - + x: (batch, seq_len, embed_dim) input + mask: Optional attention mask (causal mask for GPT) + + Returns: + output: (batch, seq_len, embed_dim) transformed sequence + """ + # Attention sub-layer with residual + normed = self.ln1(x) # Normalize input + attended = self.attention(normed, mask) # Self-attention + x = x + attended # Residual connection + + # MLP sub-layer with residual + normed = self.ln2(x) # Normalize again + mlp_out = self.mlp(normed) # Feed-forward + x = x + mlp_out # Residual connection + + return x +``` + +**Pre-Norm vs Post-Norm:** + +**Pre-Norm (What We Implement):** +``` +x โ†’ LayerNorm โ†’ Attention โ†’ Add(x) โ†’ output +``` +- LayerNorm **before** sub-layers (attention, MLP) +- Better gradient flow for deep models (>12 layers) +- Modern standard in GPT-3, GPT-4, LLaMA, Claude + +**Post-Norm (Original Transformer Paper):** +``` +x โ†’ Attention โ†’ Add(x) โ†’ LayerNorm โ†’ output +``` +- LayerNorm **after** sub-layers +- Used in original "Attention is All You Need" paper +- Struggles with very deep networks (gradient issues) + +**Why Pre-Norm Wins:** +1. **Clean inputs**: Each sub-layer receives normalized input (stable mean/variance) +2. **Direct gradient path**: Residual connections bypass normalization during backprop +3. **Deeper networks**: Enables training 100+ layer transformers (GPT-4 has ~120 layers) + +### GPT - Complete Decoder-Only Architecture + +GPT (Generative Pre-trained Transformer) is the complete autoregressive language model combining embeddings, transformer blocks, and generation capability. It's **decoder-only** with causal masking preventing future token leakage. + +```python +class GPT: + """Complete GPT decoder for autoregressive language modeling. + + Architecture: + Input tokens โ†’ Token Embedding + Positional Embedding โ†’ + TransformerBlocks (with causal masking) โ†’ + LayerNorm โ†’ Linear(embed_dim โ†’ vocab_size) โ†’ Logits + + Key Feature: Causal masking ensures position i only attends to positions โ‰ค i + """ + def __init__(self, vocab_size, embed_dim, num_layers, num_heads, max_seq_len=1024): + # Embedding layers + self.token_embedding = Embedding(vocab_size, embed_dim) + self.position_embedding = Embedding(max_seq_len, embed_dim) + + # Stack of transformer blocks + self.blocks = [TransformerBlock(embed_dim, num_heads) + for _ in range(num_layers)] + + # Output layers + self.ln_f = LayerNorm(embed_dim) # Final layer norm + self.lm_head = Linear(embed_dim, vocab_size) # Vocab projection + + def forward(self, tokens): + """Forward pass through GPT decoder. + + Args: + tokens: (batch, seq_len) token indices + Returns: logits: (batch, seq_len, vocab_size) unnormalized predictions """ - batch_size, seq_len = token_ids.shape - - # Embeddings - token_embeds = self.token_embedding(token_ids) - pos_embeds = self.position_embedding(seq_len) - x = token_embeds + pos_embeds # (batch, seq_len, d_model) - - # Create causal mask - causal_mask = create_causal_mask(seq_len) - + batch_size, seq_len = tokens.shape + + # Embeddings: tokens + positions + token_emb = self.token_embedding(tokens) + positions = Tensor(np.arange(seq_len).reshape(1, seq_len)) + pos_emb = self.position_embedding(positions) + x = token_emb + pos_emb # (batch, seq_len, embed_dim) + + # Causal mask: prevent attending to future positions + mask = self._create_causal_mask(seq_len) + # Transformer blocks for block in self.blocks: - x = block(x, mask=causal_mask) - + x = block(x, mask=mask) + # Output projection - x = self.norm(x) - logits = self.output_proj(x) # (batch, seq_len, vocab_size) - + x = self.ln_f(x) + logits = self.lm_head(x) # (batch, seq_len, vocab_size) + return logits - - def generate(self, start_tokens, max_new_tokens, temperature=1.0): + + def _create_causal_mask(self, seq_len): + """Create causal mask: upper triangular matrix with -inf. + + Mask ensures position i can only attend to positions j where j โ‰ค i. + After softmax, -inf becomes probability 0. + """ + mask = np.triu(np.ones((seq_len, seq_len)) * -np.inf, k=1) + return Tensor(mask) + + def generate(self, prompt_tokens, max_new_tokens=50, temperature=1.0): """Autoregressive text generation. - + Args: - start_tokens: (batch, start_len) initial sequence + prompt_tokens: (batch, prompt_len) initial sequence max_new_tokens: Number of tokens to generate temperature: Sampling temperature (higher = more random) - + Returns: - generated: (batch, start_len + max_new_tokens) full sequence + generated: (batch, prompt_len + max_new_tokens) full sequence """ - generated = start_tokens - + current = Tensor(prompt_tokens.data.copy()) + for _ in range(max_new_tokens): # Forward pass - logits = self.forward(generated) # (batch, seq_len, vocab_size) - - # Get logits for last position - next_token_logits = logits[:, -1, :] / temperature - + logits = self.forward(current) + + # Get last position logits + next_logits = logits.data[:, -1, :] / temperature + # Sample from distribution - probs = softmax(next_token_logits, dim=-1) - next_token = sample(probs) # (batch, 1) - + probs = softmax(next_logits) + next_token = sample(probs) + # Append to sequence - generated = concat([generated, next_token], dim=1) - - return generated + current = concat([current, next_token], axis=1) + + return current ``` -**BERT-Style Encoder - Bidirectional Understanding** -```python -class BERTEncoder: - """BERT-style encoder for bidirectional sequence understanding. - - Architecture: - Input tokens โ†’ Embed + PositionalEncoding - โ†’ TransformerBlocks (no causal masking) - โ†’ Task-specific head (classification, QA, etc.) - - Features: - - Bidirectional: each position attends to all positions - - Pre-training: masked language modeling (MLM) - - Fine-tuning: task-specific heads added - """ - def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, max_len): - self.token_embedding = Embedding(vocab_size, d_model) - self.position_embedding = LearnedPositionalEmbedding(max_len, d_model) - - self.blocks = [TransformerBlock(d_model, num_heads, d_ff) - for _ in range(num_layers)] - - self.norm = LayerNorm(d_model) - - def forward(self, token_ids, attention_mask=None): - """Forward pass through encoder. - - Args: - token_ids: (batch, seq_len) - attention_mask: Optional mask for padding tokens - - Returns: - embeddings: (batch, seq_len, d_model) contextualized representations - """ - # Embeddings - token_embeds = self.token_embedding(token_ids) - pos_embeds = self.position_embedding(token_ids.shape[1]) - x = token_embeds + pos_embeds - - # Transformer blocks (bidirectional - no causal mask) - for block in self.blocks: - x = block(x, mask=attention_mask) - - x = self.norm(x) - return x +**Causal Masking Visualization:** +``` +Sequence: ["The", "cat", "sat", "on"] +Positions: 0 1 2 3 + +Attention Matrix (โœ“ = can attend, โœ— = masked): + To: 0 1 2 3 +From 0: [ โœ“ โœ— โœ— โœ— ] โ† "The" only sees itself +From 1: [ โœ“ โœ“ โœ— โœ— ] โ† "cat" sees "The" + itself +From 2: [ โœ“ โœ“ โœ“ โœ— ] โ† "sat" sees all previous +From 3: [ โœ“ โœ“ โœ“ โœ“ ] โ† "on" sees everything + +Implementation: Upper triangular with -โˆž +[[ 0, -โˆž, -โˆž, -โˆž], + [ 0, 0, -โˆž, -โˆž], + [ 0, 0, 0, -โˆž], + [ 0, 0, 0, 0]] + +After softmax: -โˆž โ†’ probability 0 ``` -### Step-by-Step Implementation +**Temperature Sampling:** +- **Low temperature (0.1-0.5)**: Conservative, deterministic (picks highest probability) +- **Medium temperature (1.0)**: Balanced sampling from probability distribution +- **High temperature (1.5-2.0)**: Creative, random (flattens distribution) -1. **Build Feedforward Network** - - Two linear layers with expansion factor (4ร—) - - Add ReLU activation between layers - - Include dropout for regularization - - Test with different d_ff values +### Decoder-Only Architecture Choice -2. **Implement Layer Normalization** - - Compute mean and std across feature dimension - - Add learnable scale (gamma) and shift (beta) - - Handle numerical stability with epsilon - - Compare with batch normalization +This module implements **decoder-only GPT architecture**. Here's why this choice dominates modern LLMs: -3. **Create Transformer Block** - - Add multi-head attention sub-layer - - Implement residual connections - - Add layer normalization (pre-norm placement) - - Include feedforward sub-layer - - Test forward and backward passes +**Decoder-Only (GPT) - What We Build:** +- **Attention**: Causal masking (position i only sees positions โ‰ค i) +- **Training**: Next-token prediction (autoregressive objective) +- **Use cases**: Text generation, code completion, dialogue, instruction following +- **Examples**: GPT-3/4, ChatGPT, Claude, LLaMA, PaLM, Gemini LLMs -4. **Build GPT Decoder** - - Stack transformer blocks - - Add token and position embeddings - - Implement causal masking - - Add output projection to vocabulary - - Implement autoregressive generation +**Encoder-Only (BERT) - Not Implemented:** +- **Attention**: Bidirectional (all positions see all positions) +- **Training**: Masked language modeling (predict masked tokens) +- **Use cases**: Classification, NER, question answering, search ranking +- **Examples**: BERT, RoBERTa (Google Search uses BERT for ranking) -5. **Build BERT Encoder** - - Stack transformer blocks without causal mask - - Add bidirectional attention - - Implement padding mask handling - - Test on classification tasks - - Compare with decoder architecture +**Encoder-Decoder (T5) - Not Implemented:** +- **Attention**: Encoder is bidirectional, decoder is causal +- **Training**: Sequence-to-sequence tasks +- **Use cases**: Translation, summarization +- **Examples**: T5, BART (Google Translate uses encoder-decoder) + +**Why Decoder-Only Won:** +1. **Simplicity**: Single architecture type (no encoder-decoder coordination) +2. **Scalability**: Predictable scaling laws with parameters and data +3. **Versatility**: Handles both understanding and generation tasks +4. **Efficiency**: Simpler to implement and optimize than encoder-decoder + +## Getting Started + +### Prerequisites + +Ensure you understand the foundations from previous modules: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module embeddings +tito test --module attention +``` + +**Required Background:** +- **Module 11 (Embeddings)**: Token and positional embeddings for input representation +- **Module 12 (Attention)**: Multi-head attention mechanism for sequence modeling +- **Module 05 (Autograd)**: Automatic differentiation for training deep networks +- **Module 02 (Activations)**: GELU activation used in MLP layers + +### Development Workflow + +1. **Open the development file**: `modules/13_transformers/transformers.py` +2. **Implement LayerNorm**: Normalize across feature dimension with learnable scale/shift parameters (gamma, beta) +3. **Build MLP**: Two linear layers with 4x expansion ratio and GELU activation (position-wise transformation) +4. **Create TransformerBlock**: Combine attention and MLP with pre-norm residual connections (LayerNorm before sub-layers) +5. **Add GPT model**: Stack transformer blocks with token+positional embeddings, causal masking, and generation +6. **Export and verify**: `tito module complete 13 && tito test --module transformers` ## Testing -### Inline Tests (During Development) +### Comprehensive Test Suite + +Run the full test suite to verify transformer functionality: -Run inline tests while building: ```bash -cd modules/13_transformers -python transformers_dev.py +# TinyTorch CLI (recommended) +tito test --module transformers + +# Direct pytest execution +python -m pytest tests/ -k transformers -v ``` -Expected output: -``` -Unit Test: Transformer block... -โœ… Attention + FFN sub-layers work correctly -โœ… Residual connections preserve gradient flow -โœ… Layer normalization stabilizes training -Progress: Transformer Block โœ“ +### Test Coverage Areas -Unit Test: GPT decoder... -โœ… 12-layer decoder initialized successfully -โœ… Causal masking prevents future information leak -โœ… Text generation produces coherent sequences -Progress: GPT Decoder โœ“ +- โœ… **LayerNorm**: Feature-wise normalization (meanโ‰ˆ0, stdโ‰ˆ1), learnable gamma/beta parameters, numerical stability with epsilon +- โœ… **MLP**: 4x expansion ratio (embed_dim โ†’ 4*embed_dim โ†’ embed_dim), GELU activation, shape preservation +- โœ… **TransformerBlock**: Pre-norm architecture (LayerNorm before sub-layers), residual connections (x + sublayer), attention+MLP composition +- โœ… **GPT Model**: Forward pass shape correctness (batch, seq, vocab_size), causal masking preventing future leakage, autoregressive generation +- โœ… **Generation**: Temperature sampling (conservative vs creative), sequence extension, parameter counting validation -Unit Test: BERT encoder... -โœ… Bidirectional attention accesses all positions -โœ… Padding mask ignores padding tokens correctly -โœ… Encoder outputs contextualized representations -Progress: BERT Encoder โœ“ +### Inline Testing & Architecture Validation + +The module includes comprehensive architecture validation: + +```python +# Example inline test output +๐Ÿ”ฌ Unit Test: LayerNorm... +โœ… Mean โ‰ˆ 0, std โ‰ˆ 1 after normalization +โœ… Learnable gamma/beta parameters work +๐Ÿ“ˆ Progress: LayerNorm โœ“ + +๐Ÿ”ฌ Unit Test: MLP... +โœ… 4x expansion ratio correct (embed_dim โ†’ 4*embed_dim) +โœ… Shape preserved (input: [2,10,64] โ†’ output: [2,10,64]) +โœ… GELU activation applied +๐Ÿ“ˆ Progress: MLP โœ“ + +๐Ÿ”ฌ Unit Test: TransformerBlock... +โœ… Pre-norm residual connections work +โœ… Attention + MLP sub-layers compose correctly +โœ… Causal mask prevents future information leak +๐Ÿ“ˆ Progress: TransformerBlock โœ“ + +๐Ÿ”ฌ Unit Test: GPT Model... +โœ… Forward pass: [2,8] tokens โ†’ [2,8,100] logits +โœ… Generation: [1,5] prompt + 3 new โ†’ [1,8] sequence +โœ… Parameter counting validates all components +๐Ÿ“ˆ Progress: GPT Model โœ“ ``` -### Export and Validate +### Manual Testing Examples -After completing the module: -```bash -# Export to tinytorch package -tito export 13_transformers +```python +from transformers import GPT, TransformerBlock, LayerNorm, MLP -# Run integration tests -tito test 13_transformers +# Test LayerNorm +ln = LayerNorm(512) +x = Tensor(np.random.randn(2, 10, 512)) # (batch, seq, features) +normalized = ln.forward(x) +print(f"Mean: {normalized.mean():.4f}, Std: {normalized.std():.4f}") # โ‰ˆ 0, โ‰ˆ 1 + +# Test MLP +mlp = MLP(embed_dim=512) +output = mlp.forward(x) +assert output.shape == (2, 10, 512) # Shape preserved + +# Test TransformerBlock +block = TransformerBlock(embed_dim=512, num_heads=8) +mask = Tensor(np.triu(np.ones((10, 10)) * -np.inf, k=1)) # Causal mask +transformed = block.forward(x, mask=mask) + +# Test GPT +gpt = GPT(vocab_size=50000, embed_dim=768, num_layers=12, num_heads=12) +tokens = Tensor(np.random.randint(0, 50000, (4, 512))) # Batch of sequences +logits = gpt.forward(tokens) # (4, 512, 50000) + +# Test generation +prompt = Tensor(np.array([[15496, 1917]])) # "Hello world" +generated = gpt.generate(prompt, max_new_tokens=50, temperature=0.8) +print(f"Generated {generated.shape[1] - prompt.shape[1]} new tokens") ``` -## Where This Code Lives +## Where This Code Lives in the Final Package +**Package Export:** Code exports to `tinytorch.models.transformer` + +```python +# When students install tinytorch, they import your work like this: +from tinytorch.models.transformer import GPT, TransformerBlock +from tinytorch.nn import LayerNorm, MLP # Your normalization and feed-forward implementations +from tinytorch.core.tensor import Tensor # Foundation from Module 01 +from tinytorch.core.attention import MultiHeadAttention # From Module 12 +from tinytorch.text.embeddings import Embedding # From Module 11 + +# Example: Build a GPT-2 scale model +gpt2 = GPT( + vocab_size=50257, # GPT-2 BPE vocabulary + embed_dim=768, # GPT-2 Small dimension + num_layers=12, # 12 transformer blocks + num_heads=12, # 12 attention heads + max_seq_len=1024 # 1K token context +) + +# Forward pass +tokens = Tensor([[15496, 1917, 318, 281]]) # "This is a" +logits = gpt2.forward(tokens) # (1, 4, 50257) + +# Autoregressive generation +generated = gpt2.generate( + prompt_tokens=tokens, + max_new_tokens=100, + temperature=0.7 # Balanced creativity +) + +# Example: Build transformer components directly +block = TransformerBlock(embed_dim=512, num_heads=8, mlp_ratio=4) +ln = LayerNorm(512) +mlp = MLP(embed_dim=512, hidden_dim=2048) +``` + +**Package Structure:** ``` tinytorch/ โ”œโ”€โ”€ models/ -โ”‚ โ”œโ”€โ”€ transformer.py # Transformer blocks -โ”‚ โ”œโ”€โ”€ gpt.py # GPT decoder -โ”‚ โ””โ”€โ”€ bert.py # BERT encoder -โ””โ”€โ”€ __init__.py # Exposes transformer models - -Usage in other modules: ->>> from tinytorch.models import GPTDecoder, BERTEncoder ->>> gpt = GPTDecoder(vocab_size=50000, d_model=768, num_layers=12, num_heads=12, d_ff=3072, max_len=1024) ->>> generated_text = gpt.generate(start_tokens, max_new_tokens=100) +โ”‚ โ””โ”€โ”€ transformer.py # GPT, TransformerBlock +โ”œโ”€โ”€ nn/ +โ”‚ โ”œโ”€โ”€ feedforward.py # MLP implementation +โ”‚ โ””โ”€โ”€ normalization.py # LayerNorm implementation +โ”œโ”€โ”€ core/ +โ”‚ โ”œโ”€โ”€ attention.py # MultiHeadAttention (Module 12) +โ”‚ โ””โ”€โ”€ layers.py # Linear layers +โ””โ”€โ”€ text/ + โ””โ”€โ”€ embeddings.py # Embedding, PositionalEncoding ``` ## Systems Thinking Questions -1. **Layer Depth Trade-offs**: GPT-3 has 96 layers. What are the benefits? What are the challenges (training stability, memory, gradients)? Why can't we just use 1000 layers? +### Real-World Applications -2. **Residual Connections Necessity**: Remove residual connections from a 12-layer transformer. What happens during training? Why do gradients vanish? How do residuals solve this? +- **Large Language Models (OpenAI, Anthropic, Google)**: GPT-4 uses ~120-layer decoder stack trained on trillions of tokens. ChatGPT is GPT-3.5 with 96 layers and RLHF fine-tuning. Claude uses decoder-only architecture with constitutional AI training. All modern LLMs are transformer decoders because decoder-only architecture scales predictably with parameters and dataโ€”every 10ร— parameter increase yields ~5ร— better performance. -3. **Pre-Norm vs Post-Norm**: Original Transformer used post-norm (norm after sub-layer). Modern transformers use pre-norm (norm before). Why? What's the gradient flow difference? +- **Code Generation Systems (GitHub, Google, Meta)**: Copilot uses GPT-based decoder trained on billions of lines of GitHub code. AlphaCode uses transformer decoder for competitive programming. CodeLlama specialized 70B decoder for code completion. All leverage causal attention for autoregressive generation because programming requires left-to-right token prediction matching code syntax. -4. **Encoder vs Decoder Choice**: When would you use encoder-only (BERT), decoder-only (GPT), or encoder-decoder (T5)? What tasks suit each architecture? +- **Conversational AI (ChatGPT, Claude, Gemini)**: All modern chatbots use decoder-only transformers fine-tuned with RLHF (reinforcement learning from human feedback). Architecture is identical to base GPTโ€”conversation formatted as single sequence with special tokens. Production systems serve billions of queries daily requiring efficient KV caching to avoid recomputing past tokens. -5. **Memory Scaling**: A 12-layer transformer with d_model=768 has how many parameters? How does this scale with layers, dimensions, and vocabulary size? What's the memory footprint? +- **Production Scaling Challenges**: Training GPT-3 (175B parameters) required 3.14ร—10ยฒยณ FLOPs (floating point operations), consuming ~1,300 MWh of electricity. Inference costs dominate at scaleโ€”ChatGPT serves millions of users requiring thousands of GPUs. Memory is primary bottleneck: 175B parameters ร— 2 bytes (FP16) = 350GB just for model weights, plus activation memory during inference. -## Real-World Connections +### Architectural Foundations -### Industry Applications +- **Residual Connections Enable Deep Networks**: Without residuals, gradients vanish exponentially with depthโ€”in a 12-layer network without residuals, gradients at layer 1 are ~0.1ยนยฒ โ‰ˆ 10โปยนยฒ smaller than output gradients. Residuals create gradient highways: โˆ‚Loss/โˆ‚x = โˆ‚Loss/โˆ‚output ร— (1 + โˆ‚F(x)/โˆ‚x), ensuring gradient magnitude โ‰ฅ output gradient. This enables 100+ layer transformers (GPT-4 has ~120 layers). -**Large Language Models (OpenAI, Anthropic, Google)** -- GPT-4: 96-layer decoder stack, trained on trillions of tokens -- Claude: Decoder-only architecture with constitutional AI training -- PaLM 2: Decoder with 340B parameters across 64 layers -- Gemini: Multimodal transformer processing text, images, audio +- **Pre-Norm vs Post-Norm Architecture**: Pre-norm (LayerNorm before sub-layers) provides better gradient flow for deep models. In post-norm, gradients must flow through LayerNorm's division operation which can amplify small gradient differences. Pre-norm gives each sub-layer clean normalized inputs (mean=0, var=1) while residuals bypass the normalization during backprop. GPT-3, GPT-4, LLaMA all use pre-norm. -**Search and Understanding (Google, Microsoft)** -- BERT powers Google Search ranking for billions of queries daily -- Bing uses transformer encoder for semantic search -- Question-answering systems built on BERT fine-tuning -- Document understanding and summarization +- **Layer Normalization vs Batch Normalization**: LayerNorm normalizes across features per sample (batch-independent), BatchNorm normalizes across batch per feature (batch-dependent). Transformers use LayerNorm because: (1) Variable sequence lengths make batch statistics unstable, (2) Inference requires batch=1 support, (3) Empirically better for NLP. BatchNorm works for CNNs because spatial dimensions provide consistent normalization axis. -**Code Generation (GitHub, Google, Meta)** -- Copilot: GPT-based decoder trained on GitHub code -- AlphaCode: Transformer decoder for competitive programming -- CodeLlama: Specialized decoder for code completion -- All use decoder-only transformer architecture +- **MLP Expansion Ratio Trade-offs**: Standard 4ร— expansion (embed_dim=512 โ†’ hidden=2048) balances capacity with compute. MLP parameters dominate transformers: per layer, MLP has 8ร—embed_dimยฒ parameters vs attention's 4ร—embed_dimยฒ. Larger expansion (8ร—) increases capacity but quadratically increases memory and FLOPs. Some models experiment with 2ร— (faster) or gated MLPs (SwiGLU in LLaMA uses 5.33ร— effective expansion). -### Research Impact +### Performance Characteristics -This module implements patterns from: -- Transformer (Vaswani et al., 2017): The foundational architecture -- BERT (Devlin et al., 2018): Bidirectional encoder pre-training -- GPT-2/3 (Radford et al., 2019): Decoder-only scaling -- T5 (Raffel et al., 2020): Unified encoder-decoder framework +- **Quadratic Attention Memory Growth**: Attention computes (batch, heads, seq_len, seq_len) matrix requiring batchร—headsร—seq_lenยฒ elements. For GPT-3 with seq_len=2048, batch=4, heads=96: 4ร—96ร—2048ยฒ โ‰ˆ 1.6B elements ร— 4 bytes = 6.4GB per layer just for attention matrices. Doubling sequence length quadruples attention memory. This is why 8K context requires 4ร— memory vs 4K context. -## What's Next? +- **Parameter Scaling**: Total parameters โ‰ˆ vocab_sizeร—embed_dim (embeddings) + num_layersร—[4ร—embed_dimยฒ (attention) + 8ร—embed_dimยฒ (MLP)] โ‰ˆ num_layersร—12ร—embed_dimยฒ. GPT-3 has embed_dim=12,288, num_layers=96 โ†’ 96ร—12ร—12,288ยฒ โ‰ˆ 175B parameters. Storage: 175B ร— 2 bytes (FP16) = 350GB. Training requires 4ร— memory for gradients and optimizer states = 1.4TB per GPU. -In **Module 14: Profiling** (Optimization Tier), you'll learn to measure and analyze performance: +- **Computational Complexity**: Per layer: O(batchร—seq_lenยฒร—embed_dim) for attention + O(batchร—seq_lenร—embed_dimยฒ) for MLP. For short sequences (seq_len < embed_dim), MLP dominates. For long sequences (seq_len > embed_dim), attention dominates. GPT-3 with seq_len=2048, embed_dim=12,288: attention is 2048ยฒร—12,288 โ‰ˆ 51B FLOPs vs MLP 2048ร—12,288ยฒ โ‰ˆ 309B FLOPsโ€”MLP dominates even at 2K tokens. -- Profile time, memory, and compute for transformer operations -- Identify bottlenecks in attention, feedforward, and embedding layers -- Measure FLOPs and memory bandwidth utilization -- Build the foundation for data-driven optimization +- **Generation Efficiency**: Autoregressive generation requires one forward pass per token. For 100 tokens through 96-layer network: 100ร—96 = 9,600 layer evaluations. KV caching optimizes this: cache key-value pairs from previous positions, reducing attention from O(nยฒ) to O(n) during generation. Without KV cache, 100-token generation takes ~10ร— longer. Production systems always use KV caching. -The transformers you built are completeโ€”now it's time to understand their performance characteristics! +- **Memory-Compute Trade-offs**: Gradient checkpointing trades compute for memory by recomputing activations during backward pass instead of storing them. Saves ~50% activation memory but increases training time ~20%. Mixed precision training (FP16/BF16 forward, FP32 gradients) reduces memory by 50% and increases throughput by 2-3ร— on modern GPUs with tensor cores. + +## Reflection Questions + +1. **Residual Connection Necessity**: Remove residual connections from a 12-layer transformer. What happens during training? Calculate gradient flow: if each layer multiplies gradients by 0.5, what's the gradient at layer 1 after 12 layers? (0.5ยนยฒ โ‰ˆ 0.0002). How do residuals solve this by providing gradient highways that bypass layer computations? + +2. **Pre-Norm vs Post-Norm Trade-offs**: Original Transformer paper used post-norm (LayerNorm after sub-layers). Modern transformers use pre-norm (LayerNorm before). Why? Consider gradient flow: in post-norm, gradients pass through LayerNorm's division which can amplify noise. In pre-norm, residuals bypass normalization. When does pre-norm become critical (how many layers)? + +3. **Attention Memory Quadratic Growth**: For seq_len=1024, batch=4, heads=8, attention matrix is 4ร—8ร—1024ร—1024 = 33.5M elements ร— 4 bytes = 134MB per layer. What happens at seq_len=4096? (ร—16 memory = 2.1GB per layer). Why is this quadratic growth the primary bottleneck for long-context models? How does FlashAttention address this? + +4. **Parameter Scaling Analysis**: GPT-3 has embed_dim=12,288, num_layers=96. Calculate approximate parameters: embeddings โ‰ˆ 50K vocab ร— 12,288 = 614M. Per layer: attention โ‰ˆ 4ร—12,288ยฒ = 604M, MLP โ‰ˆ 8ร—12,288ยฒ = 1.2B. Total per layer โ‰ˆ 1.8B. 96 layers ร— 1.8B = 173B. Compare to measured 175B. What's the parameter distribution? + +5. **Decoder-Only vs Encoder-Decoder**: Why did decoder-only (GPT) dominate over encoder-decoder (T5) for LLMs? Consider: (1) Simplicity of single architecture, (2) Scaling laws holding predictably, (3) Versatility handling both understanding and generation. When would you still choose encoder-decoder (translation, summarization)? + +6. **Generation Efficiency**: Generating 100 tokens through 96-layer GPT-3 without KV caching requires 100 forward passes through all 96 layers = 9,600 layer evaluations. With KV caching, only new token processed through layers = 96 evaluations per token = 9,600 total. Same compute! But KV cache requires storing keys and values for all positions. Calculate memory for seq_len=2048: 2ร—(num_layersร—batchร—headsร—seq_lenร—head_dim) elements. What's the memory-compute trade-off? + +## Ready to Build? + +You're about to implement the transformer architecture that powers virtually all modern AI systems! The decoder-only GPT architecture you'll build is the exact design used in ChatGPT, GPT-4, Claude, and every major language model. This isn't a simplified educational versionโ€”it's the real production architecture that revolutionized AI. + +Understanding transformers from first principlesโ€”implementing layer normalization, feed-forward networks, residual connections, and causal attention yourselfโ€”will give you deep insight into how production ML systems work. You'll understand why GPT-4 has 120 layers, why residual connections prevent gradient vanishing in deep networks, why pre-norm architecture enables training very deep models, and how attention memory scales quadratically with sequence length. + +This module is the culmination of your Architecture Tier journey. You've built tensors (Module 01), activations (Module 02), layers (Module 03), embeddings (Module 11), and attention (Module 12). Now you'll compose them into the complete transformer model that matches PyTorch's `nn.TransformerDecoder` and powers billion-dollar AI systems. Take your time, test thoroughly, and enjoy building the architecture behind ChatGPT, Claude, and the AI revolution! + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/13_transformers/transformers.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/13_transformers/transformers.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/13_transformers/transformers.py +:class-header: bg-light + +Browse the Python source code and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. + +``` --- -**Ready to build GPT and BERT from scratch?** Open `modules/13_transformers/transformers_dev.py` and start implementing. + diff --git a/modules/14_profiling/ABOUT.md b/modules/14_profiling/ABOUT.md index c5a820ce..1fdef04e 100644 --- a/modules/14_profiling/ABOUT.md +++ b/modules/14_profiling/ABOUT.md @@ -1,451 +1,673 @@ --- -title: "Profiling - Performance Analysis and Optimization" -description: "Build profilers to identify bottlenecks and guide optimization decisions" -difficulty: 3 +title: "Profiling - Performance Measurement for ML Systems" +description: "Build profilers that measure parameters, FLOPs, memory, and latency to guide optimization decisions" +difficulty: "โญโญโญ" time_estimate: "5-6 hours" -prerequisites: ["All modules 01-13"] -next_steps: ["Memoization"] +prerequisites: ["Modules 01-13 - Complete ML implementation stack"] +next_steps: ["Module 15 - Quantization"] learning_objectives: - - "Implement timing profilers with statistical rigor for accurate measurements" - - "Design memory profilers to track allocation patterns and identify leaks" - - "Build FLOP counters to measure computational complexity" - - "Understand performance bottlenecks across different architectures" - - "Apply data-driven analysis to guide optimization priorities" + - "Implement parameter counting to predict model memory requirements" + - "Build FLOP counters to measure computational complexity across architectures" + - "Create memory profilers that track allocations and identify usage patterns" + - "Design timing profilers with statistical rigor to measure latency accurately" + - "Apply profiling data to identify bottlenecks and prioritize optimizations" --- -# 14. Profiling +# 14. Profiling - Performance Measurement for ML Systems -**โšก OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours +**OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours ## Overview -Build comprehensive profiling tools to measure where time and memory go in your ML systems. This module implements timing profilers, memory trackers, and FLOP counters that reveal bottlenecks and guide optimization decisions. +Build profiling tools that measure where compute and memory go in ML systems. This module implements parameter counters, FLOP analyzers, memory trackers, and timing profilers with statistical rigor. You'll profile real models to identify bottlenecksโ€”memory-bound vs compute-bound, attention vs feedforward, batch size effectsโ€”and use data to guide optimization decisions. -## Learning Objectives - -By completing this module, you will be able to: - -1. **Implement timing profilers** with statistical rigor (multiple runs, confidence intervals) for accurate measurements -2. **Design memory profilers** to track allocation patterns, peak usage, and identify memory leaks -3. **Build FLOP counters** to measure theoretical computational complexity of different operations -4. **Understand performance bottlenecks** by comparing MLPs, CNNs, and Transformers systematically -5. **Apply data-driven analysis** to prioritize optimization efforts based on actual impact +**Optimization Tier Focus**: Modules 1-13 taught you to build ML systems. Modules 14-20 teach you to measure and optimize them. Profiling is the foundationโ€”you can't optimize what you don't measure. ## Why This Matters -### Production Context +### Production Context: Profiling Drives Optimization Economics -Profiling is mandatory for production ML systems: +Every major ML organization profiles extensively: -- **Google TPU teams** profile every operation to optimize hardware utilization -- **OpenAI** profiles GPT training to identify $millions in compute savings -- **Meta** profiles inference to serve billions of requests per day efficiently -- **NVIDIA** uses profiling to optimize cuDNN kernels for peak performance +- **Google TPU teams** profile every kernel to achieve 40-50% MFU (Model FLOPs Utilization), translating to millions in compute savings +- **OpenAI** profiles GPT training runs to identify gradient checkpointing opportunities, reducing memory by 10ร— with minimal speed cost +- **Meta** profiles PyTorch inference serving billions of requests daily, using data to guide operator fusion and quantization decisions +- **NVIDIA** uses Nsight profiler to optimize cuDNN kernels, achieving near-theoretical-peak performance on tensor cores -### Historical Context +**The Economics**: A 10% optimization on a $10M training run saves $1M. But only if you measure firstโ€”guessing wastes engineering time on non-bottlenecks. + +### Historical Evolution: From Ad-Hoc Timing to Systematic Measurement Profiling evolved with ML scale: -- **Early ML (pre-2012)**: Ad-hoc timing with `time.time()`; no systematic profiling -- **Deep Learning Era (2012-2017)**: NVIDIA profiler, TensorBoard timing; focus on GPU utilization -- **Production Scale (2018+)**: Comprehensive profiling (compute, memory, I/O, network); optimization critical for economics -- **Modern Systems (2020+)**: Automated profiling and optimization; ML compilers use profiling data +- **Pre-2012 (Small models)**: Ad-hoc timing with `time.time()`, no systematic methodology +- **2012-2017 (Deep learning era)**: NVIDIA profiler, TensorBoard timing; focus on GPU utilization +- **2018+ (Production scale)**: Comprehensive profiling (compute, memory, I/O, network); optimization becomes economically critical +- **2020+ (Modern systems)**: Automated profiling guides ML compilers; tools like PyTorch Profiler integrate with training workflows -Without profiling, you're optimizing blindโ€”profiling shows you where to focus. +### What You'll Actually Build -## Pedagogical Pattern: Build โ†’ Use โ†’ Optimize +Let's be precise about what you implement in this module: -### 1. Build +**You WILL build**: +- Parameter counter: Walks model structure, sums weight and bias elements +- FLOP counter: Calculates theoretical operations for Linear, Conv2d based on dimensions +- Memory profiler: Uses Python's tracemalloc to track allocations during forward/backward +- Timing profiler: Uses time.perf_counter() with warmup runs and statistical analysis (median latency) -Implement from first principles: -- High-precision timing with multiple runs -- Statistical analysis (mean, std, confidence intervals) -- Memory profiler tracking allocations and deallocations -- FLOP counter for theoretical complexity -- Comparative profiler across architectures +**You will NOT build** (these are production tools requiring kernel instrumentation): +- GPU profiler (requires CUDA kernel hooks) +- PyTorch Profiler integration (requires autograd instrumentation) +- Operator-level timeline traces (requires framework integration) -### 2. Use +**Why this scope matters**: You'll understand profiling fundamentals that transfer to production tools. The techniques you implement (parameter counting formulas, FLOP calculations, statistical timing) are exactly what PyTorch Profiler and TensorBoard use internally. You're building the same measurement primitives, just without kernel-level instrumentation. -Apply to real problems: -- Profile attention vs feedforward in transformers -- Compare MLP vs CNN vs Transformer efficiency -- Identify memory bottlenecks in training loops -- Measure impact of batch size on throughput -- Analyze scaling behavior with model size +## Learning Objectives -### 3. Optimize +By the end of this module, you will be able to: -Production insights: -- Prioritize optimizations by impact (80/20 rule) -- Measure before/after optimization -- Understand hardware utilization (CPU vs GPU) -- Identify memory bandwidth vs compute bottlenecks -- Build optimization roadmap based on data +- **Count parameters accurately**: Predict model size and memory footprint by counting weights and biases across different layer types +- **Measure computational cost**: Implement FLOP counters that calculate theoretical compute for matrix multiplications, convolutions, and attention operations +- **Track memory usage**: Build memory profilers using tracemalloc to measure parameter, activation, and gradient memory during forward and backward passes +- **Profile latency rigorously**: Create timing profilers with warmup runs, multiple iterations, and statistical analysis (median, confidence intervals) +- **Identify performance bottlenecks**: Analyze profiling data to distinguish memory-bound from compute-bound operations and prioritize optimization efforts + +## Build โ†’ Use โ†’ Reflect + +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: + +1. **Build**: Implement Profiler class with parameter counting, FLOP calculation, memory tracking, and latency measurement using time.perf_counter() and tracemalloc +2. **Use**: Profile complete models to measure characteristics, compare MLP vs attention operations, analyze batch size impact on throughput, and benchmark different architectures +3. **Reflect**: Where does compute time actually go in transformers? When is your system memory-bound vs compute-bound? How do measurement choices affect optimization decisions? ## Implementation Guide -### Core Components +### Core Component: Profiler Class + +The Profiler class provides comprehensive performance analysis: -**High-Precision Timer** ```python -class Timer: - """High-precision timing with statistical analysis. - - Performs multiple runs to account for variance and noise. - Reports mean, std, and confidence intervals. - - Example: - timer = Timer() - with timer: - model.forward(x) - print(f"Time: {timer.mean:.3f}ms ยฑ {timer.std:.3f}ms") +class Profiler: + """Professional-grade ML model profiler. + + Measures parameters, FLOPs, memory, and latency with statistical rigor. + Used for bottleneck identification and optimization guidance. """ - def __init__(self, num_runs=10, warmup_runs=3): - self.num_runs = num_runs - self.warmup_runs = warmup_runs - self.times = [] - - def __enter__(self): - # Warmup runs (not counted) - for _ in range(self.warmup_runs): - start = time.perf_counter() - # Operation happens in with block - + + def __init__(self): + self.measurements = {} + self.operation_counts = defaultdict(int) + + def count_parameters(self, model) -> int: + """Count total trainable parameters. + + Returns: + Total parameter count (e.g., 125M for GPT-2 Small) + """ + total = 0 + if hasattr(model, 'parameters'): + for param in model.parameters(): + total += param.data.size # Count elements + return total + + def count_flops(self, model, input_shape: Tuple) -> int: + """Count FLOPs (Floating Point Operations) for forward pass. + + Linear layer: 2 ร— M ร— K ร— N (matmul is Mร—K @ Kร—N) + Conv2d: 2 ร— output_h ร— output_w ร— kernel_h ร— kernel_w ร— in_ch ร— out_ch + + Returns: + Total FLOPs for one forward pass (hardware-independent) + """ + # Implementation calculates based on layer type and dimensions + + def measure_memory(self, model, input_shape: Tuple) -> Dict: + """Measure memory usage during forward pass. + + Uses tracemalloc to track: + - Parameter memory (weights, biases) + - Activation memory (intermediate tensors) + - Peak memory (maximum allocation) + + Returns: + Dict with memory breakdown in MB + """ + tracemalloc.start() + # Run forward pass, measure peak allocation + + def measure_latency(self, model, input_tensor, + warmup: int = 10, iterations: int = 100) -> float: + """Measure inference latency with statistical rigor. + + Protocol: + 1. Warmup runs (cache warming, JIT compilation) + 2. Multiple measurements (statistical significance) + 3. Median calculation (robust to outliers) + + Returns: + Median latency in milliseconds + """ + # Warmup runs (discard results) + for _ in range(warmup): + _ = model.forward(input_tensor) + # Timed runs - self.start_time = time.perf_counter() - return self - - def __exit__(self, *args): - elapsed = time.perf_counter() - self.start_time - self.times.append(elapsed * 1000) # Convert to ms - - @property - def mean(self): - return np.mean(self.times) - - @property - def std(self): - return np.std(self.times) - - @property - def confidence_interval(self, confidence=0.95): - """95% confidence interval using t-distribution.""" - from scipy import stats - ci = stats.t.interval(confidence, len(self.times)-1, - loc=self.mean, scale=stats.sem(self.times)) - return ci - - def report(self): - ci = self.confidence_interval() - return f"{self.mean:.3f}ms ยฑ {self.std:.3f}ms (95% CI: [{ci[0]:.3f}, {ci[1]:.3f}])" + times = [] + for _ in range(iterations): + start = time.perf_counter() # High-precision timer + _ = model.forward(input_tensor) + times.append((time.perf_counter() - start) * 1000) # Convert to ms + + return np.median(times) # Median is robust to outliers ``` -**Memory Profiler** +### Parameter Counting: Memory Footprint Analysis + +Parameter counting predicts model size and memory requirements: + +```python +# Linear layer example +layer = Linear(768, 3072) # GPT-2 feedforward dimension + +# Manual calculation: +weight_params = 768 ร— 3072 = 2,359,296 +bias_params = 3072 +total_params = 2,362,368 + +# Memory at FP32 (4 bytes per parameter): +memory_bytes = 2,362,368 ร— 4 = 9,449,472 bytes = 9.01 MB + +# Profiler implementation: +profiler = Profiler() +count = profiler.count_parameters(layer) +assert count == 2_362_368 + +# Why this matters: +# GPT-2 Small: 124M params โ†’ 496 MB +# GPT-2 XL: 1.5B params โ†’ 6.0 GB +# Knowing parameter count predicts deployment hardware requirements +``` + +**Parameter Counting Strategy**: +- Linear layers: `(input_features ร— output_features) + output_features` +- Conv2d layers: `(kernel_h ร— kernel_w ร— in_channels ร— out_channels) + out_channels` +- Embeddings: `vocab_size ร— embedding_dim` +- Attention: Count Q/K/V projection weights separately + +### FLOP Counting: Computational Cost Analysis + +FLOPs measure compute independently of hardware: + +```python +# Matrix multiplication FLOP calculation +# C = A @ B where A is (M, K) and B is (K, N) + +def count_matmul_flops(M, K, N): + """Each output element C[i,j] requires K multiply-adds. + + Total outputs: M ร— N + FLOPs per output: 2 ร— K (multiply + add) + Total FLOPs: 2 ร— M ร— K ร— N + """ + return 2 * M * K * N + +# Example: GPT-2 feedforward forward pass +batch_size = 32 +seq_len = 512 +d_model = 768 +d_ff = 3072 + +# First linear: (batch ร— seq, d_model) @ (d_model, d_ff) +flops_1 = count_matmul_flops(batch_size * seq_len, d_model, d_ff) +# = 2 ร— 16384 ร— 768 ร— 3072 = 77,309,411,328 FLOPs + +# Second linear: (batch ร— seq, d_ff) @ (d_ff, d_model) +flops_2 = count_matmul_flops(batch_size * seq_len, d_ff, d_model) +# = 2 ร— 16384 ร— 3072 ร— 768 = 77,309,411,328 FLOPs + +total_flops = flops_1 + flops_2 # ~154 GFLOPs for one feedforward layer + +# Hardware context: +# NVIDIA A100: 312 TFLOPS (FP16) โ†’ theoretical time = 154 / 312000 = 0.5 ms +# Actual time will be higher due to memory bandwidth and kernel overhead +``` + +**FLOP Formulas Reference**: +```python +# Linear layer +flops = 2 ร— batch_size ร— seq_len ร— input_features ร— output_features + +# Conv2d +flops = 2 ร— batch ร— output_h ร— output_w ร— kernel_h ร— kernel_w ร— in_ch ร— out_ch + +# Multi-head attention (simplified) +# QKV projections: 3 ร— linear projections +# Attention scores: batch ร— heads ร— seq ร— seq ร— d_k +# Attention weighting: batch ร— heads ร— seq ร— seq ร— d_k +# Output projection: 1 ร— linear projection +flops = (4 ร— batch ร— seq ร— d_model ร— d_model) + + (4 ร— batch ร— heads ร— seq ร— seq ร— d_k) +``` + +### Memory Profiling: Understanding Allocation Patterns + +Memory profiling reveals where RAM goes during training: + ```python class MemoryProfiler: - """Track memory allocations and peak usage. - - Monitors memory throughout execution to identify: - - Peak memory usage - - Memory leaks - - Allocation patterns - - Memory bandwidth bottlenecks - """ + """Track memory allocations and identify usage patterns.""" + def __init__(self): self.snapshots = [] - self.peak_memory = 0 - - def snapshot(self, label=""): - """Take memory snapshot at current point.""" + + def snapshot(self, label: str): + """Take memory snapshot at execution point.""" import psutil process = psutil.Process() mem_info = process.memory_info() - - snapshot = { + + self.snapshots.append({ 'label': label, - 'rss': mem_info.rss / 1024**2, # MB - 'vms': mem_info.vms / 1024**2, # MB + 'rss': mem_info.rss / 1024**2, # Resident Set Size (MB) 'timestamp': time.time() - } - self.snapshots.append(snapshot) - self.peak_memory = max(self.peak_memory, snapshot['rss']) - - return snapshot - + }) + def report(self): """Generate memory usage report.""" - print(f"Peak Memory: {self.peak_memory:.2f} MB") - print("\nMemory Timeline:") - for snap in self.snapshots: - print(f" {snap['label']:30s}: {snap['rss']:8.2f} MB") - - # Calculate memory growth - if len(self.snapshots) >= 2: - growth = self.snapshots[-1]['rss'] - self.snapshots[0]['rss'] - print(f"\nTotal Growth: {growth:+.2f} MB") - - # Check for potential memory leak - if growth > 100: # Arbitrary threshold - print("โš ๏ธ Potential memory leak detected!") + print("Memory Timeline:") + for i, snap in enumerate(self.snapshots): + delta = "" + if i > 0: + delta_val = snap['rss'] - self.snapshots[i-1]['rss'] + delta = f" ({delta_val:+.2f} MB)" + print(f" {snap['label']:30s}: {snap['rss']:8.2f} MB{delta}") + +# Example: Profile transformer forward pass +mem = MemoryProfiler() +mem.snapshot("baseline") + +# Forward pass +output = model.forward(input_tensor) +mem.snapshot("after_forward") + +# Backward pass +loss = criterion(output, target) +loss.backward() +mem.snapshot("after_backward") + +# Update weights +optimizer.step() +mem.snapshot("after_optimizer") + +mem.report() + +# Output interpretation: +# baseline : 1024.00 MB +# after_forward : 1124.00 MB (+100.00 MB) โ† Activation memory +# after_backward : 1624.00 MB (+500.00 MB) โ† Gradient memory +# after_optimizer : 2124.00 MB (+500.00 MB) โ† Adam state (momentum + velocity) +# +# Total training memory = 2.1ร— forward memory (for Adam optimizer) ``` -**FLOP Counter** +**Memory Components Breakdown**: +``` +Training Memory = Parameters + Activations + Gradients + Optimizer State + +Example for GPT-2 Small (124M parameters): +Parameters: 496 MB (124M ร— 4 bytes) +Activations: 200 MB (depends on batch size and sequence length) +Gradients: 496 MB (same as parameters) +Adam state: 992 MB (momentum + velocity = 2ร— parameters) +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +Total: 2184 MB (4.4ร— parameter memory!) + +Optimization strategies by component: +- Parameters: Quantization (reduce precision) +- Activations: Gradient checkpointing (recompute instead of store) +- Gradients: Mixed precision (FP16 gradients) +- Optimizer: SGD instead of Adam (0ร— vs 2ร— parameter memory) +``` + +### Latency Measurement: Statistical Timing Methodology + +Accurate latency measurement requires handling variance: + ```python -class FLOPCounter: - """Count floating-point operations for complexity analysis. - - Provides theoretical computational complexity independent of hardware. - Useful for comparing different architectural choices. - """ - def __init__(self): - self.total_flops = 0 - self.op_counts = {} - - def count_matmul(self, A_shape, B_shape): - """Count FLOPs for matrix multiplication. - - C = A @ B where A is (m, k) and B is (k, n) - FLOPs = 2*m*k*n (multiply-add for each output element) - """ - m, k = A_shape - k2, n = B_shape - assert k == k2, "Invalid matmul dimensions" - - flops = 2 * m * k * n - self.total_flops += flops - self.op_counts['matmul'] = self.op_counts.get('matmul', 0) + flops - return flops - - def count_attention(self, batch, seq_len, d_model, num_heads): - """Count FLOPs for multi-head attention. - - Components: - - Q,K,V projections: 3 * (batch * seq_len * d_model * d_model) - - Attention scores: batch * heads * seq_len * seq_len * d_k - - Attention weighting: batch * heads * seq_len * seq_len * d_v - - Output projection: batch * seq_len * d_model * d_model - """ - d_k = d_model // num_heads - - # QKV projections - qkv_flops = 3 * self.count_matmul((batch * seq_len, d_model), (d_model, d_model)) - - # Attention computation - scores_flops = batch * num_heads * seq_len * seq_len * d_k * 2 - weights_flops = batch * num_heads * seq_len * seq_len * d_k * 2 - attention_flops = scores_flops + weights_flops - - # Output projection - output_flops = self.count_matmul((batch * seq_len, d_model), (d_model, d_model)) - - total = qkv_flops + attention_flops + output_flops - self.op_counts['attention'] = self.op_counts.get('attention', 0) + total - return total - - def report(self): - """Generate FLOP report with breakdown.""" - print(f"Total FLOPs: {self.total_flops / 1e9:.2f} GFLOPs") - print("\nBreakdown by operation:") - for op, flops in sorted(self.op_counts.items(), key=lambda x: x[1], reverse=True): - percentage = (flops / self.total_flops) * 100 - print(f" {op:20s}: {flops/1e9:8.2f} GFLOPs ({percentage:5.1f}%)") +def measure_latency_correctly(model, input_tensor): + """Production-quality latency measurement.""" + + # Step 1: Warmup runs (stabilize system state) + # - JIT compilation happens on first runs + # - CPU/GPU caches warm up + # - Operating system scheduling stabilizes + warmup_runs = 10 + for _ in range(warmup_runs): + _ = model.forward(input_tensor) + + # Step 2: Multiple measurements (statistical significance) + times = [] + measurement_runs = 100 + + for _ in range(measurement_runs): + start = time.perf_counter() # Nanosecond precision + _ = model.forward(input_tensor) + elapsed = time.perf_counter() - start + times.append(elapsed * 1000) # Convert to milliseconds + + # Step 3: Statistical analysis + times = np.array(times) + + results = { + 'mean': np.mean(times), + 'median': np.median(times), # Robust to outliers + 'std': np.std(times), + 'min': np.min(times), + 'max': np.max(times), + 'p50': np.percentile(times, 50), # Median + 'p95': np.percentile(times, 95), # 95th percentile + 'p99': np.percentile(times, 99) # 99th percentile (tail latency) + } + + return results + +# Example output: +# { +# 'mean': 5.234, +# 'median': 5.180, โ† Use this for reporting (robust) +# 'std': 0.456, +# 'min': 4.890, +# 'max': 8.120, โ† Outlier (OS scheduling event) +# 'p50': 5.180, +# 'p95': 5.890, +# 'p99': 6.340 โ† Important for user-facing latency +# } + +# Why median, not mean? +# Mean is sensitive to outliers (8.120 ms max skews average) +# Median represents typical performance +# For user-facing systems, report p95 or p99 (worst-case experience) ``` -**Architecture Profiler - Comparative Analysis** +**Measurement Pitfalls and Solutions**: ```python -class ArchitectureProfiler: - """Compare performance across different architectures. - - Profiles MLP, CNN, and Transformer on same task to understand - compute/memory trade-offs. - """ - def __init__(self): - self.results = {} - - def profile_model(self, model, input_data, model_name): - """Profile a model comprehensively.""" - result = { - 'model_name': model_name, - 'parameters': count_parameters(model), - 'timing': {}, - 'memory': {}, - 'flops': {} - } - - # Timing profile - timer = Timer(num_runs=10) - for _ in range(timer.num_runs + timer.warmup_runs): - with timer: - output = model.forward(input_data) - result['timing']['forward'] = timer.mean - - # Memory profile - mem = MemoryProfiler() - mem.snapshot("Before forward") - output = model.forward(input_data) - mem.snapshot("After forward") - result['memory']['peak'] = mem.peak_memory - - # FLOP count - flop_counter = FLOPCounter() - # Count FLOPs based on model architecture - result['flops']['total'] = flop_counter.total_flops - - self.results[model_name] = result - return result - - def compare(self): - """Generate comparative report.""" - print("\nArchitecture Comparison") - print("=" * 80) - - for name, result in self.results.items(): - print(f"\n{name}:") - print(f" Parameters: {result['parameters']/1e6:.2f}M") - print(f" Forward time: {result['timing']['forward']:.3f}ms") - print(f" Peak memory: {result['memory']['peak']:.2f}MB") - print(f" FLOPs: {result['flops']['total']/1e9:.2f}GFLOPs") +# โŒ WRONG: Single measurement +start = time.time() # Low precision +output = model(input) +latency = time.time() - start # Affected by system noise + +# โœ… CORRECT: Statistical measurement +profiler = Profiler() +latency = profiler.measure_latency(model, input, warmup=10, iterations=100) +# Returns median of 100 measurements after 10 warmup runs + +# โŒ WRONG: Measuring cold start +latency = time_function_once(model.forward, input) # Includes JIT compilation + +# โœ… CORRECT: Warmup runs +for _ in range(10): + model.forward(input) # Discard these results +latency = measure_with_statistics(model.forward, input) # Now measure + +# โŒ WRONG: Using mean with outliers +times = [5.1, 5.2, 5.0, 5.3, 50.0] # 50ms outlier from OS scheduling +mean = np.mean(times) # = 14.12 ms (misleading!) + +# โœ… CORRECT: Using median +median = np.median(times) # = 5.2 ms (representative) ``` -### Step-by-Step Implementation +## Getting Started -1. **Build High-Precision Timer** - - Use `time.perf_counter()` for nanosecond precision - - Implement multiple runs with warmup - - Calculate mean, std, confidence intervals - - Test with known delays +### Prerequisites -2. **Implement Memory Profiler** - - Track memory at key points (before/after operations) - - Calculate peak memory usage - - Identify memory growth patterns - - Detect potential leaks +Ensure you understand the foundations from previous modules: -3. **Create FLOP Counter** - - Count operations for matmul, convolution, attention - - Build hierarchical counting (operation โ†’ layer โ†’ model) - - Compare theoretical vs actual performance - - Identify compute-bound vs memory-bound operations +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh -4. **Build Architecture Profiler** - - Profile MLP on MNIST/CIFAR - - Profile CNN on CIFAR - - Profile Transformer on text - - Generate comparative reports +# Verify prerequisite modules (all modules 1-13) +tito test --module tensor +tito test --module activations +tito test --module transformer +``` -5. **Analyze Results** - - Identify bottleneck operations (Pareto principle) - - Compare efficiency across architectures - - Understand scaling behavior - - Prioritize optimization opportunities +**Why these prerequisites**: You'll profile models built in Modules 1-13. Understanding the implementations helps you interpret profiling results (e.g., why attention is memory-bound). + +### Development Workflow + +1. **Open the development file**: `modules/14_profiling/profiling_dev.ipynb` or `.py` +2. **Implement parameter counting**: Walk model structure, sum parameter elements +3. **Build FLOP counter**: Calculate operations based on layer types and dimensions +4. **Create memory profiler**: Use tracemalloc to track allocations during forward/backward +5. **Add timing profiler**: Implement warmup runs, multiple measurements, statistical analysis +6. **Implement advanced profiling**: Build `profile_forward_pass()` and `profile_backward_pass()` combining all metrics +7. **Export and verify**: `tito module complete 14 && tito test --module profiling` + +**Development tips**: +```python +# Test parameter counting manually first +layer = Linear(128, 64) +expected_params = (128 * 64) + 64 # weight + bias = 8256 +actual_params = profiler.count_parameters(layer) +assert actual_params == expected_params + +# Verify FLOP calculations with small examples +flops = profiler.count_flops(layer, (1, 128)) +expected_flops = 2 * 128 * 64 # matmul FLOPs = 16384 +assert flops == expected_flops + +# Check memory profiler returns expected keys +mem = profiler.measure_memory(layer, (32, 128)) +assert 'parameter_memory_mb' in mem +assert 'activation_memory_mb' in mem +assert 'peak_memory_mb' in mem + +# Validate latency measurement stability +latencies = [profiler.measure_latency(layer, input_tensor) for _ in range(3)] +std_dev = np.std(latencies) +assert std_dev < np.mean(latencies) * 0.2 # Coefficient of variation < 20% +``` ## Testing -### Inline Tests +### Comprehensive Test Suite -Run inline tests while building: -```bash -cd modules/15_profiling -python profiling_dev.py -``` - -Expected output: -``` -Unit Test: Timer with statistical analysis... -โœ… Multiple runs produce consistent results -โœ… Confidence intervals computed correctly -โœ… Warmup runs excluded from statistics -Progress: Timing Profiler โœ“ - -Unit Test: Memory profiler... -โœ… Snapshots capture memory correctly -โœ… Peak memory tracked accurately -โœ… Memory growth detected -Progress: Memory Profiler โœ“ - -Unit Test: FLOP counter... -โœ… Matmul FLOPs: 2*m*k*n verified -โœ… Attention FLOPs match theoretical -โœ… Operation breakdown correct -Progress: FLOP Counter โœ“ -``` - -### Export and Validate +Run the full test suite to verify profiling functionality: ```bash -tito export 15_profiling -tito test 15_profiling +# TinyTorch CLI (recommended) +tito test --module profiling + +# Direct pytest execution +python -m pytest tests/ -k profiling -v ``` -## Where This Code Lives +### Test Coverage Areas +- โœ… **Parameter counting accuracy**: Verifies correct counts for Linear, Conv2d, models with/without parameters +- โœ… **FLOP calculation correctness**: Validates formulas for different layer types (Linear, Conv2d, attention) +- โœ… **Memory measurement reliability**: Checks tracemalloc integration, memory component tracking +- โœ… **Latency measurement consistency**: Tests statistical timing with warmup runs and multiple iterations +- โœ… **Advanced profiling completeness**: Validates forward/backward profiling returns all required metrics + +### Inline Testing & Validation + +The module includes comprehensive unit tests: + +```python +# Parameter counting validation +๐Ÿ”ฌ Unit Test: Parameter Counting... +โœ… Simple model: 55 parameters (10ร—5 weight + 5 bias) +โœ… No parameter model: 0 parameters +โœ… Direct tensor: 0 parameters +โœ… Parameter counting works correctly! + +# FLOP counting validation +๐Ÿ”ฌ Unit Test: FLOP Counting... +โœ… Tensor operation: 32 FLOPs +โœ… Linear layer: 16384 FLOPs (128 ร— 64 ร— 2) +โœ… Batch independence: 16384 FLOPs (same for batch 1 and 32) +โœ… FLOP counting works correctly! + +# Memory measurement validation +๐Ÿ”ฌ Unit Test: Memory Measurement... +โœ… Basic measurement: 0.153 MB peak +โœ… Scaling: Small 0.002 MB โ†’ Large 0.020 MB +โœ… Efficiency: 0.524 (0-1 range) +โœ… Memory measurement works correctly! + +# Latency measurement validation +๐Ÿ”ฌ Unit Test: Latency Measurement... +โœ… Basic latency: 0.008 ms +โœ… Consistency: 0.010 ยฑ 0.002 ms +โœ… Scaling: Small 0.006 ms, Large 0.012 ms +โœ… Latency measurement works correctly! ``` -tinytorch/ -โ”œโ”€โ”€ profiler/ -โ”‚ โ””โ”€โ”€ profiling.py # Your implementation goes here -โ””โ”€โ”€ __init__.py # Exposes Timer, MemoryProfiler, etc. -Usage: ->>> from tinytorch.profiler import Timer, MemoryProfiler, FLOPCounter ->>> timer = Timer() ->>> with timer: ->>> model.forward(x) ->>> print(timer.report()) +### Manual Testing Examples + +```python +from profiling_dev import Profiler, quick_profile +from tinytorch.nn.layers import Linear +from tinytorch.core.tensor import Tensor + +# Example 1: Profile a simple layer +layer = Linear(256, 128) +input_tensor = Tensor(np.random.randn(32, 256)) + +profiler = Profiler() +profile = profiler.profile_forward_pass(layer, input_tensor) + +print(f"Parameters: {profile['parameters']:,}") +print(f"FLOPs: {profile['flops']:,}") +print(f"Latency: {profile['latency_ms']:.2f} ms") +print(f"Memory: {profile['peak_memory_mb']:.2f} MB") +print(f"Bottleneck: {profile['bottleneck']}") +# Output: +# Parameters: 32,896 +# FLOPs: 2,097,152 +# Latency: 0.15 ms +# Memory: 2.10 MB +# Bottleneck: memory + +# Example 2: Compare architectures +mlp = Linear(512, 512) +attention = MultiHeadAttention(d_model=512, num_heads=8) + +mlp_profile = profiler.profile_forward_pass(mlp, mlp_input) +attention_profile = profiler.profile_forward_pass(attention, attention_input) + +print(f"MLP GFLOP/s: {mlp_profile['gflops_per_second']:.2f}") +print(f"Attention GFLOP/s: {attention_profile['gflops_per_second']:.2f}") +# Output reveals which operation is more efficient + +# Example 3: Analyze training memory +training_profile = profiler.profile_backward_pass(model, input_tensor) + +print(f"Forward memory: {training_profile['forward_memory_mb']:.1f} MB") +print(f"Gradient memory: {training_profile['gradient_memory_mb']:.1f} MB") +print(f"Total training memory: {training_profile['total_memory_mb']:.1f} MB") + +for opt_name, opt_memory in training_profile['optimizer_memory_estimates'].items(): + total_with_opt = training_profile['total_memory_mb'] + opt_memory + print(f"{opt_name.upper()}: {total_with_opt:.1f} MB total") +# Output: +# Forward memory: 2.1 MB +# Gradient memory: 2.0 MB +# Total training memory: 4.1 MB +# SGD: 4.1 MB total +# ADAM: 8.1 MB total (2ร— extra for momentum + velocity) ``` ## Systems Thinking Questions -1. **Amdahl's Law**: If attention is 70% of compute and you optimize it 2ร—, what's the overall speedup? Why can't you get 2ร— end-to-end speedup? +### Real-World Applications -2. **Memory vs Compute Bottlenecks**: Your GPU can do 100 TFLOPs/s but memory bandwidth is 900 GB/s. For FP32 operations needing 4 bytes/FLOP, what's the bottleneck? When? +- **Google TPU Optimization**: Profile every kernel to achieve 40-50% MFU (Model FLOPs Utilization). Google improved T5 training from 35% to 48% MFU through profiling-guided optimization, saving millions in compute costs at scale across thousands of TPUs. How would you use profiling to identify and fix utilization bottlenecks? -3. **Batch Size Impact**: Doubling batch size doesn't double throughput. Why? What's the relationship between batch size, memory, and throughput? +- **OpenAI GPT Training**: Profile forward and backward passes separately to measure memory usage across parameters, activations, gradients, and optimizer state. OpenAI identified activation memory as the bottleneck and implemented gradient checkpointing, reducing memory by 10ร— with only 20% compute overhead while achieving 50%+ MFU. What trade-offs exist between recomputation time and storage memory? -4. **Profiling Overhead**: Your profiler adds 5% overhead. Is this acceptable? When would you use sampling profilers vs instrumentation profilers? +- **Meta PyTorch Inference**: Profile operator-by-operator timelines to measure kernel launch overhead and identify operator fusion opportunities. Meta reduced inference latency by 2-3ร— through operator fusion and optimized p99 latency for billions of daily requests serving Facebook/Instagram recommendations. Why optimize for latency percentiles rather than average? -5. **Hardware Differences**: Your code runs 10ร— slower on CPU than GPU for large matrices, but only 2ร— slower for small ones. Why? What's the crossover point? +- **NVIDIA cuDNN Development**: Use Nsight profiler to analyze warp occupancy, register pressure, and memory bandwidth utilization to achieve 90%+ of theoretical peak performance. NVIDIA's profiling data guides both kernel optimization and next-generation hardware design (H100 architecture). How do you distinguish compute-bound from memory-bound kernels? -## Real-World Connections +### Profiling Foundations -### Industry Applications +- **Amdahl's Law and ROI**: If attention takes 70% of time and you achieve 2ร— speedup on attention only, overall speedup is just 1.53ร— (not 2ร—) because unoptimized portions limit gains. Why does this mean optimization is iterativeโ€”requiring re-profiling after each change to identify new bottlenecks? -**Google TPU Optimization** -- Profile every kernel to maximize TPU utilization -- Optimize for both FLOPs and memory bandwidth -- Use profiling to guide hardware design decisions -- Achieve 40-50% utilization (very high for accelerators) +- **Memory Bandwidth Bottlenecks**: An elementwise ReLU operation on 1B elements achieves only 112 GFLOPs/s despite 100 TFLOPS peak compute (0.11% utilization) because it's memory-bound (8.89 ms to move 8 GB data vs 0.01 ms to compute). What optimization strategies help memory-bound operations vs compute-bound operations? -**OpenAI Training Optimization** -- Profile GPT training to find $millions in savings -- Identify gradient checkpointing opportunities -- Optimize data loading pipelines -- Achieve 50%+ MFU (model FLOPs utilization) +- **Statistical Timing Methodology**: Single measurements include system noise (OS scheduling, thermal throttling, cache effects). Proper profiling uses warmup runs (JIT compilation, cache warming), multiple measurements (100+ iterations), and reports median (robust to outliers) plus p95/p99 percentiles (tail latency). Why does mean latency hide outliers that affect user experience? -**Meta Inference Serving** -- Profile PyTorch models for production deployment -- Identify operator fusion opportunities -- Optimize for latency (p50, p99) not just throughput -- Serve billions of requests per day efficiently +- **Profiling Overhead Trade-offs**: Instrumentation profiling (15% overhead) provides precise per-operation timing but distorts fast operations, while sampling profiling (2% overhead) enables always-on production monitoring but may miss operations <1 ms. When should you choose instrumentation vs sampling profilers? -### Research Impact +### Performance Characteristics -This module implements patterns from: -- TensorBoard Profiler (Google, 2019): Visual profiling for TensorFlow -- PyTorch Profiler (Meta, 2020): Comprehensive profiling for PyTorch -- NVIDIA Nsight (2021): GPU-specific profiling and optimization -- MLPerf (2022): Standardized benchmarking and profiling +- **Batch Size Scaling**: Throughput doesn't scale linearly with batch size due to fixed overhead (kernel launch amortizes), memory bandwidth saturation (transfers dominate at large batches), and memory constraints (OOM limits maximum batch size). For a system showing 200โ†’667โ†’914โ†’985 samples/s at batch sizes 1โ†’8โ†’32โ†’64, what's the optimal batch size for throughput vs efficiency vs latency? -## What's Next? +- **GPU vs CPU Crossover**: Small matrices (128ร—128) run faster on CPU despite GPU's 1000ร— more cores because GPU overhead (1 ms kernel launch) dominates compute time. Large matrices (4096ร—4096) achieve 267ร— GPU speedup because overhead amortizes and parallelism saturates GPU cores. What's the crossover point and why does PyTorch automatically dispatch based on operation size? -In **Module 15: Quantization**, you'll use your profiling data to compress models: +- **Parameter vs Activation Memory**: Training memory = Parameters + Activations + Gradients + Optimizer State. For GPT-2 Small (124M params = 496 MB), total training memory is 2.18 GB (4.4ร— parameter memory) due to activations (200 MB), gradients (496 MB), and Adam state (992 MB = 2ร— parameters). Which component should you optimize for different memory constraints? -- Reduce precision from FP32 to INT8 for 4ร— memory savings -- Implement calibration strategies to minimize accuracy loss -- Measure memory and speed improvements -- Apply quantization based on profiling insights +- **FLOPs vs Latency**: Theoretical FLOPs predict compute cost hardware-independently, but actual latency depends on memory bandwidth and kernel efficiency. A GPT-2 feedforward layer requires 154 GFLOPs, suggesting 0.5 ms on A100 (312 TFLOPS), but actual time is higher due to memory overhead. Why is profiling real hardware essential despite theoretical calculations? -Profiling shows you *what* to optimizeโ€”the next modules show you *how* to optimize it! +## Ready to Build? + +You're about to implement the profiling tools that enable all subsequent optimization work. These techniques transform research models into production systems by revealing exactly where time and memory go. + +**What you'll achieve**: +- Understand where compute time actually goes in ML models (measure, don't guess) +- Distinguish memory-bound from compute-bound operations (guides optimization strategy) +- Make data-driven optimization decisions using Amdahl's Law (maximize ROI on engineering time) +- Build the measurement foundation for Modules 15-20 (optimization techniques) + +**The profiling mindset**: +> "Measure twice, optimize once. Profile before every optimization decision. Without measurement, you're flying blind." +> โ€” Every production ML engineer + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/14_profiling/profiling_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/14_profiling/profiling_dev.ipynb +:class-header: bg-light + +Use Google Colab for cloud compute power and easy sharing. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/14_profiling/profiling_dev.py +:class-header: bg-light + +Browse the Python source code and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. +``` + +```bash +cd modules/14_profiling +tito module start 14 +python profiling_dev.py # Inline tests as you build +``` --- -**Ready to become a performance detective?** Open `modules/14_profiling/profiling_dev.py` and start implementing. + diff --git a/modules/15_quantization/ABOUT.md b/modules/15_quantization/ABOUT.md index 7462e544..b3c2cd15 100644 --- a/modules/15_quantization/ABOUT.md +++ b/modules/15_quantization/ABOUT.md @@ -1,113 +1,439 @@ --- title: "Quantization - Reduced Precision for Efficiency" -description: "INT8 quantization, calibration, and mixed-precision strategies" -difficulty: 3 +description: "INT8 quantization fundamentals, calibration strategies, and accuracy-efficiency trade-offs" +difficulty: "โญโญโญ" time_estimate: "5-6 hours" prerequisites: ["Profiling"] next_steps: ["Compression"] learning_objectives: - - "Implement INT8 quantization for weights and activations" - - "Design calibration strategies to minimize accuracy loss" - - "Apply mixed-precision training and inference patterns" - - "Understand quantization-aware training vs post-training quantization" - - "Measure memory and speed improvements from reduced precision" + - "Understand how quantization reduces memory by 4ร— through precision reduction from FP32 to INT8" + - "Implement symmetric and asymmetric quantization with scale and zero-point parameters" + - "Design calibration strategies using representative data to minimize accuracy degradation" + - "Measure the accuracy-efficiency frontier: when 1% accuracy loss justifies 4ร— memory savings" + - "Recognize quantization as educational foundation vs production INT8 hardware acceleration" --- -# 15. Quantization +# 15. Quantization - Reduced Precision for Efficiency -**โšก OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours +**OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours ## Overview -Reduce model precision from FP32 to INT8 for 4ร— memory reduction and 2-4ร— inference speedup. This module implements quantization, calibration, and mixed-precision strategies used in production deployment. +This module implements quantization fundamentals: converting FP32 tensors to INT8 representation to reduce memory by 4ร—. You'll build the mathematics of scale/zero-point quantization, implement quantized linear layers, and measure accuracy-efficiency trade-offs. CRITICAL HONESTY: You're implementing quantization math in Python, NOT actual hardware INT8 operations. This teaches the principles that enable TensorFlow Lite/PyTorch Mobile deployment, but real speedups require specialized hardware (Edge TPU, Neural Engine) or compiled frameworks with INT8 kernels. Your implementation will be 4ร— more memory-efficient but not faster - understanding WHY teaches you what production quantization frameworks must optimize. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement INT8 quantization** for model weights and activations with scale/zero-point parameters -2. **Design calibration strategies** using representative data to minimize accuracy degradation -3. **Apply mixed-precision training** (FP16/FP32) for faster training with maintained accuracy -4. **Understand quantization-aware training** vs post-training quantization trade-offs -5. **Measure memory and speed improvements** while tracking accuracy impact +- **Quantization Mathematics**: Implement symmetric and asymmetric INT8 quantization with scale/zero-point parameter calculation +- **Calibration Strategies**: Design percentile-based calibration to minimize accuracy loss when selecting quantization parameters +- **Memory-Accuracy Trade-offs**: Measure when 4ร— memory reduction justifies 0.5-2% accuracy degradation for deployment +- **Production Reality**: Distinguish between educational quantization (Python simulation) vs production INT8 (hardware acceleration, kernel fusion) +- **When to Quantize**: Recognize deployment scenarios where quantization is mandatory (mobile/edge) vs optional (cloud serving) -## Why This Matters +## Build โ†’ Use โ†’ Optimize -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Optimize** framework: -Quantization is mandatory for edge deployment: - -- **TensorFlow Lite** uses INT8 quantization for mobile deployment; 4ร— smaller models -- **ONNX Runtime** supports INT8 inference; 2-4ร— faster on CPUs -- **Apple Core ML** quantizes models for iPhone Neural Engine; enables on-device ML -- **Google Edge TPU** requires INT8; optimized hardware for quantized operations - -### Historical Context - -- **Pre-2017**: FP32 standard; quantization for special cases only -- **2017-2019**: INT8 post-training quantization; TensorFlow Lite adoption -- **2019-2021**: Quantization-aware training; maintains accuracy better -- **2021+**: INT4, mixed-precision, dynamic quantization; aggressive compression - -Quantization enables deployment where FP32 models wouldn't fit or run fast enough. +1. **Build**: Implement INT8 quantization/dequantization, calibration logic, QuantizedLinear layers +2. **Use**: Quantize trained models, measure accuracy degradation vs memory savings on MNIST/CIFAR +3. **Optimize**: Analyze the accuracy-efficiency frontier - when does quantization enable deployment vs hurt accuracy unacceptably? ## Implementation Guide -### Core Components +### Quantization Flow: FP32 โ†’ INT8 -**Symmetric INT8 Quantization** -``` -Quantization: x_int8 = round(x_fp32 / scale) -Dequantization: x_fp32 = x_int8 * scale +Quantization compresses weights by reducing precision, trading accuracy for memory efficiency: -where scale = max(|x|) / 127 +```{mermaid} +graph LR + A[FP32 Weight
4 bytes
-3.14159] --> B[Quantize
scale + zero_point] + B --> C[INT8 Weight
1 byte
-126] + C --> D[Dequantize
Inference] + D --> E[FP32 Compute
Result] + + style A fill:#e3f2fd + style B fill:#fff3e0 + style C fill:#f3e5f5 + style D fill:#ffe0b2 + style E fill:#f0fdf4 ``` -**Asymmetric Quantization (with zero-point)** -``` -Quantization: x_int8 = round(x_fp32 / scale) + zero_point -Dequantization: x_fp32 = (x_int8 - zero_point) * scale +**Flow**: Original FP32 โ†’ Calibrate scale โ†’ Store as INT8 (4ร— smaller) โ†’ Dequantize for computation โ†’ FP32 result + +### What You're Actually Building (Educational Quantization) + +**Your Implementation:** +- Quantization math: FP32 โ†’ INT8 conversion with scale/zero-point +- QuantizedLinear: Store weights as INT8, compute in simulated quantized arithmetic +- Calibration: Find optimal scale parameters from representative data +- Memory measurement: Verify 4ร— reduction (32 bits โ†’ 8 bits) + +**What You're NOT Building:** +- Actual INT8 hardware operations (requires CPU VNNI, ARM NEON, GPU Tensor Cores) +- Kernel fusion (eliminating quantize/dequantize overhead) +- Mixed-precision execution graphs (FP32 for sensitive ops, INT8 for matmul) +- Production deployment pipelines (TensorFlow Lite converter, ONNX Runtime optimization) + +**Why This Matters:** Understanding quantization math is essential. But knowing that production speedups require hardware acceleration + compiler optimization prevents unrealistic expectations. Your 4ร— memory reduction is real; your lack of speedup teaches why TensorFlow Lite needs custom kernels. + +### Core Quantization Mathematics + +**Symmetric Quantization (Zero-Point = 0)** + +Assumes data is centered around zero (common after BatchNorm): + +```python +# Quantization: FP32 โ†’ INT8 +scale = max(abs(tensor)) / 127.0 # Scale factor +quantized = round(tensor / scale).clip(-128, 127).astype(int8) + +# Dequantization: INT8 โ†’ FP32 +dequantized = quantized.astype(float32) * scale ``` -**Calibration**: Use representative data to find optimal scale/zero-point parameters +- **Range**: INT8 is [-128, 127] (256 values) +- **Scale**: Maps largest FP32 value to 127 +- **Zero-point**: Always 0 (symmetric around origin) +- **Use case**: Weights after normalization, activations after BatchNorm + +**Asymmetric Quantization (With Zero-Point)** + +Handles arbitrary data ranges (e.g., activations after ReLU: [0, max]): + +```python +# Quantization: FP32 โ†’ INT8 +min_val, max_val = tensor.min(), tensor.max() +scale = (max_val - min_val) / 255.0 +zero_point = round(-min_val / scale) +quantized = round(tensor / scale + zero_point).clip(-128, 127).astype(int8) + +# Dequantization: INT8 โ†’ FP32 +dequantized = (quantized.astype(float32) - zero_point) * scale +``` + +- **Range**: Uses full [-128, 127] even if data is [0, 5] +- **Scale**: Maps data range to INT8 range +- **Zero-point**: Offset ensuring FP32 zero maps to specific INT8 value +- **Use case**: ReLU activations, input images, any non-centered data + +**Trade-off:** Symmetric is simpler (no zero-point storage/computation), asymmetric uses range more efficiently (better for skewed distributions). + +### Calibration - The Critical Step + +Quantization quality depends entirely on scale/zero-point selection. Poor choices destroy accuracy. + +**Naive Approach (Don't Do This):** +```python +# Use global min/max from training data +scale = (tensor_max - tensor_min) / 255 +# Problem: Single outlier wastes most INT8 range +# Example: data in [0, 5] but one outlier at 100 โ†’ scale = 100/255 +# Result: 95% of data maps to only 13 INT8 values (5/100 * 255 = 13) +``` + +**Calibration Approach (Correct):** +```python +# Use percentile-based clipping +max_val = np.percentile(np.abs(calibration_data), 99.9) +scale = max_val / 127 +# Clips 0.1% outliers, uses INT8 range efficiently +# 99.9th percentile ignores rare outliers, preserves typical range +``` + +**Calibration Process:** +1. Collect 100-1000 samples of representative data (validation set) +2. For each layer, record activation statistics during forward passes +3. Compute percentile-based min/max (typically 99.9th percentile) +4. Calculate scale/zero-point from clipped statistics +5. Quantize weights/activations using calibrated parameters + +**Why It Works:** Most activations follow normal-ish distributions. Outliers are rare but dominate min/max. Clipping 0.1% of outliers uses INT8 range 10-100ร— more efficiently with negligible accuracy loss. + +### Per-Tensor vs Per-Channel Quantization + +**Per-Tensor Quantization:** +- One scale/zero-point for entire weight tensor +- Simple: store 2 parameters per layer +- Example: Conv2D with 64ร—3ร—3ร—3 weights uses 1 scale, 1 zero-point + +**Per-Channel Quantization:** +- Separate scale/zero-point per output channel +- Better accuracy: each channel uses its natural range +- Example: Conv2D with 64 output channels uses 64 scales, 64 zero-points +- Overhead: 128 extra parameters (64 scales + 64 zero-points) + +**When to Use Per-Channel:** +- Weight magnitudes vary significantly across channels (common in Conv layers) +- Accuracy improvement (0.5-1.5%) justifies 0.1-0.5% memory overhead +- Production frameworks (PyTorch, TensorFlow Lite) default to per-channel for Conv/Linear + +**Trade-off Table:** + +| Quantization Scheme | Parameters | Accuracy | Complexity | Use Case | +|---------------------|------------|----------|------------|----------| +| Per-Tensor | 2 per layer | Baseline | Simple | Fast prototyping, small models | +| Per-Channel (Conv) | 2N (N=channels) | +0.5-1.5% | Medium | Production Conv layers | +| Per-Channel (Linear) | 2N (N=out_features) | +0.3-0.8% | Medium | Production Linear layers | +| Mixed (Conv per-channel, Linear per-tensor) | Hybrid | +0.4-1.2% | Medium | Balanced approach | + +### QuantizedLinear - Quantized Neural Network Layer + +Replaces regular Linear layer with quantized equivalent: + +```python +class QuantizedLinear: + def __init__(self, linear_layer: Linear): + # Quantize weights at initialization + self.weights_int8, self.weight_scale, self.weight_zp = quantize_int8(linear_layer.weight) + self.bias_int8, self.bias_scale, self.bias_zp = quantize_int8(linear_layer.bias) + + # Store original FP32 for accuracy comparison + self.original_weight = linear_layer.weight + + def forward(self, x: Tensor) -> Tensor: + # EDUCATIONAL VERSION: Dequantize โ†’ compute in FP32 โ†’ quantize result + # (Simulates quantization math but doesn't speed up computation) + weight_fp32 = dequantize_int8(self.weights_int8, self.weight_scale, self.weight_zp) + bias_fp32 = dequantize_int8(self.bias_int8, self.bias_scale, self.bias_zp) + + # Compute in FP32 (not actually faster - just lower precision storage) + output = x @ weight_fp32.T + bias_fp32 + return output +``` + +**What Happens in Production (TensorFlow Lite, PyTorch Mobile):** + +```python +# Production quantized matmul (conceptual - happens in C++/assembly) +def quantized_matmul_production(x_int8, weight_int8, x_scale, weight_scale, output_scale): + # 1. INT8 x INT8 matmul using VNNI/NEON/Tensor Cores (FAST) + accum_int32 = matmul_int8_hardware(x_int8, weight_int8) # Specialized instruction + + # 2. Requantize accumulated INT32 โ†’ INT8 output + combined_scale = (x_scale * weight_scale) / output_scale + output_int8 = (accum_int32 * combined_scale).clip(-128, 127) + + # 3. Stay in INT8 for next layer (no dequantization unless necessary) + return output_int8 +``` + +**Key Differences:** +- **Your implementation**: Dequantize โ†’ FP32 compute โ†’ quantize (educational, slow) +- **Production**: INT8 โ†’ INT8 throughout, specialized hardware (4-10ร— speedup) + +**Memory Savings (Real):** 4ร— reduction from storing INT8 instead of FP32 +**Speed Improvement (Your Code):** ~0ร— (Python overhead dominates) +**Speed Improvement (Production):** 2-10ร— (hardware acceleration, kernel fusion) + +### Model-Level Quantization + +```python +def quantize_model(model, calibration_data=None): + """ + Quantize all Linear layers in model. + + Args: + model: Neural network with Linear layers + calibration_data: Representative samples for activation calibration + + Returns: + quantized_model: Model with QuantizedLinear layers + calibration_stats: Scale/zero-point parameters per layer + """ + quantized_layers = [] + for layer in model.layers: + if isinstance(layer, Linear): + q_layer = QuantizedLinear(layer) + if calibration_data: + q_layer.calibrate(calibration_data) # Find optimal scales + quantized_layers.append(q_layer) + else: + quantized_layers.append(layer) # Keep ReLU, Softmax in FP32 + + return quantized_layers +``` + +**Calibration in Practice:** +1. Run 100-1000 samples through original FP32 model +2. Record min/max activations for each layer +3. Compute percentile-clipped scales +4. Quantize weights with calibrated parameters +5. Test accuracy on validation set + +## Getting Started + +### Prerequisites + +Ensure you've completed profiling fundamentals: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module profiling +``` + +**Required Understanding:** +- Memory profiling (Module 14): Measuring memory consumption +- Tensor operations (Module 01): Understanding FP32 representation +- Linear layers (Module 03): Matrix multiplication mechanics + +### Development Workflow + +1. **Open the development file**: `modules/15_quantization/quantization_dev.py` +2. **Implement quantize_int8()**: FP32 โ†’ INT8 conversion with scale/zero-point calculation +3. **Implement dequantize_int8()**: INT8 โ†’ FP32 restoration +4. **Build QuantizedLinear**: Replace Linear layers with quantized versions +5. **Add calibration logic**: Percentile-based scale selection +6. **Implement quantize_model()**: Convert entire networks to quantized form +7. **Export and verify**: `tito module complete 15 && tito test --module quantization` ## Testing +### Comprehensive Test Suite + +Run the full test suite to verify quantization functionality: + ```bash -tito export 15_quantization -tito test 15_quantization +# TinyTorch CLI (recommended) +tito test --module quantization + +# Direct pytest execution +python -m pytest tests/ -k quantization -v ``` -## Where This Code Lives +### Test Coverage Areas +- โœ… **Quantization Correctness**: FP32 โ†’ INT8 โ†’ FP32 roundtrip error bounds (< 0.5% mean error) +- โœ… **Memory Reduction**: Verify 4ร— reduction in model size (weights + biases) +- โœ… **Symmetric vs Asymmetric**: Both schemes produce valid INT8 in [-128, 127] +- โœ… **Calibration Impact**: Percentile clipping reduces quantization error vs naive min/max +- โœ… **QuantizedLinear Equivalence**: Output matches FP32 Linear within tolerance (< 1% difference) +- โœ… **Model-Level Quantization**: Full network quantization preserves accuracy (< 2% degradation) + +### Inline Testing & Quantization Analysis + +The module includes comprehensive validation with real-time feedback: + +```python +# Example inline test output +๐Ÿ”ฌ Unit Test: quantize_int8()... +โœ… Symmetric quantization: range [-128, 127] โœ“ +โœ… Scale calculation: max_val / 127 = 0.0234 โœ“ +โœ… Roundtrip error: 0.31% mean error โœ“ +๐Ÿ“ˆ Progress: quantize_int8() โœ“ + +๐Ÿ”ฌ Unit Test: QuantizedLinear... +โœ… Memory reduction: 145KB โ†’ 36KB (4.0ร—) โœ“ +โœ… Output equivalence: 0.43% max difference vs FP32 โœ“ +๐Ÿ“ˆ Progress: QuantizedLinear โœ“ ``` -tinytorch/ -โ”œโ”€โ”€ quantization/ -โ”‚ โ””โ”€โ”€ quantize.py -โ””โ”€โ”€ __init__.py + +### Manual Testing Examples + +```python +from quantization_dev import quantize_int8, dequantize_int8, QuantizedLinear +from tinytorch.nn import Linear + +# Test quantization on random tensor +tensor = Tensor(np.random.randn(100, 100).astype(np.float32)) +q_tensor, scale, zero_point = quantize_int8(tensor) + +print(f"Original range: [{tensor.data.min():.2f}, {tensor.data.max():.2f}]") +print(f"Quantized range: [{q_tensor.data.min()}, {q_tensor.data.max()}]") +print(f"Scale: {scale:.6f}, Zero-point: {zero_point}") + +# Dequantize and measure error +restored = dequantize_int8(q_tensor, scale, zero_point) +error = np.abs(tensor.data - restored.data).mean() +print(f"Roundtrip error: {error:.4f} ({error/np.abs(tensor.data).mean()*100:.2f}%)") + +# Quantize a Linear layer +linear = Linear(128, 64) +q_linear = QuantizedLinear(linear) + +print(f"\nOriginal weights: {linear.weight.data.nbytes} bytes") +print(f"Quantized weights: {q_linear.weights_int8.data.nbytes} bytes") +print(f"Reduction: {linear.weight.data.nbytes / q_linear.weights_int8.data.nbytes:.1f}ร—") ``` ## Systems Thinking Questions -1. **Accuracy vs Efficiency**: INT8 loses precision. When is <1% accuracy drop acceptable? When must you use QAT? +### Real-World Applications -2. **Per-Tensor vs Per-Channel**: Per-channel quantization preserves accuracy better but increases complexity. When is it worth it? +- **Mobile ML Deployment**: TensorFlow Lite converts all models to INT8 for Android/iOS. Without quantization, models exceed app size limits (100-200MB) and drain battery 4ร— faster. Google Photos, Translate, Keyboard all run quantized models on-device. -3. **Quantized Operations**: INT8 matmul is faster, but quantize/dequantize adds overhead. When does quantization win overall? +- **Edge AI Devices**: Google Edge TPU (Coral), NVIDIA Jetson, Intel Neural Compute Stick require INT8 models. Hardware is designed exclusively for quantized operations - FP32 isn't supported or is 10ร— slower. -## Real-World Connections +- **Cloud Inference Optimization**: AWS Inferentia, Azure Inferentia, Google Cloud TPU serve quantized models. INT8 reduces memory bandwidth (bottleneck for inference) and increases throughput by 2-4ร—. At scale (millions of requests/day), this saves millions in infrastructure costs. -**Mobile Deployment**: TensorFlow Lite, Core ML use INT8 for on-device inference -**Cloud Serving**: ONNX Runtime, TensorRT use INT8 for cost-effective serving -**Edge AI**: INT8 required for Coral Edge TPU, Jetson Nano deployment +- **Large Language Models**: LLaMA-65B is 130GB in FP16, doesn't fit on single 80GB A100 GPU. INT8 quantization โ†’ 65GB, enables serving. GPTQ pushes to 4-bit (33GB) with < 1% perplexity increase. Quantization is how enthusiasts run 70B models on consumer GPUs. -## What's Next? +### Quantization Mathematics -In **Module 16: Compression**, you'll combine quantization with pruning: -- Remove unimportant weights (pruning) -- Quantize remaining weights (INT8) -- Achieve 10-50ร— compression with minimal accuracy loss +- **Why INT8 vs INT4 or INT16?** INT8 is the sweet spot: 4ร— memory reduction with < 1% accuracy loss. INT4 gives 8ร— reduction but 2-5% accuracy loss (harder to deploy). INT16 only 2ร— reduction (not worth complexity). Hardware acceleration (VNNI, NEON, Tensor Cores) standardized on INT8. + +- **Symmetric vs Asymmetric Trade-offs**: Symmetric is simpler (no zero-point) but wastes range for skewed data. ReLU activations are [0, max] - symmetric centers around 0, wasting negative range. Asymmetric uses full INT8 range but costs extra zero-point storage and computation. + +- **Calibration Data Requirements**: Theory: more data โ†’ better statistics. Practice: diminishing returns after 500-1000 samples. Percentile estimates stabilize quickly. Critical requirement: calibration data MUST match deployment distribution. If calibration is ImageNet but deployment is medical images, quantization fails catastrophically. + +- **Per-Channel Justification**: Conv2D with 64 output channels: per-channel stores 64 scales + 64 zero-points = 512 bytes. Total weights: 3ร—3ร—64ร—64 FP32 = 147KB. Overhead: 0.35%. Accuracy improvement: 0.5-1.5%. Clear win - explains why production frameworks default to per-channel. + +### Production Deployment Characteristics + +- **Speed Reality Check**: INT8 matmul is theoretically 4ร— faster (4ร— less memory bandwidth). Practice: 2-3ร— on CPU (quantize/dequantize overhead), 4-10ร— on specialized hardware (Edge TPU, Neural Engine designed for pure INT8 graphs). Your Python implementation is 0ร— faster (simulation overhead > bandwidth savings). + +- **When Quantization is Mandatory**: Mobile deployment (app size limits, battery constraints, Neural Engine acceleration), Edge devices (limited memory/compute), Cloud serving at scale (cost optimization). Not negotiable - models either quantize or don't ship. + +- **When to Avoid Quantization**: Accuracy-critical applications where 1% matters (medical diagnosis, autonomous vehicles), Early research iteration (quantization adds complexity), Models already tiny (< 10MB - quantization overhead not worth it), Cloud serving with abundant resources (FP32 throughput sufficient). + +- **Quantization-Aware Training vs Post-Training**: PTQ (Post-Training Quantization) is fast (minutes) but loses 1-2% accuracy. QAT (Quantization-Aware Training) requires retraining (days/weeks) but loses < 0.5%. Choose PTQ for rapid iteration, QAT for production deployment. If using pretrained models you don't own (BERT, ResNet), PTQ is only option. + +## Ready to Build? + +You're about to implement the precision reduction mathematics that make mobile ML deployment possible. Quantization is the difference between a model that exists in research and a model that ships in apps used by billions. + +This module teaches honest quantization: you'll implement the math correctly, achieve 4ร— memory reduction, and understand precisely why your Python code isn't faster (hardware acceleration requires specialized silicon + compiled kernels). This clarity prepares you for production deployment where TensorFlow Lite, PyTorch Mobile, and ONNX Runtime apply your quantization mathematics with real INT8 hardware operations. + +Understanding quantization from first principles - implementing the scale/zero-point calculations yourself, calibrating with real data, measuring accuracy-efficiency trade-offs - gives you deep insight into the constraints that define production ML systems. + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/15_quantization/quantization_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required. +``` + +```{grid-item-card} Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/15_quantization/quantization_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/15_quantization/quantization_dev.py +:class-header: bg-light + +Browse the Python source code and understand the implementation. +``` + +```` + +```{admonition} Save Your Progress +:class: tip +Binder sessions are temporary. Download your completed notebook when done, or switch to local development for persistent work. +``` --- -**Ready to quantize models?** Open `modules/15_quantization/quantization_dev.py` and start implementing. + diff --git a/modules/16_compression/ABOUT.md b/modules/16_compression/ABOUT.md index 9f04d5bf..f8844d93 100644 --- a/modules/16_compression/ABOUT.md +++ b/modules/16_compression/ABOUT.md @@ -1,121 +1,426 @@ --- title: "Compression - Pruning and Model Compression" -description: "Prune unnecessary weights and compress models for deployment" -difficulty: 3 +description: "Implement pruning techniques to reduce model size while preserving accuracy" +difficulty: "โญโญโญ" time_estimate: "5-6 hours" -prerequisites: ["Quantization"] -next_steps: ["Acceleration"] +prerequisites: ["15_quantization"] +next_steps: ["17_memoization"] learning_objectives: - - "Implement magnitude-based pruning to remove unimportant weights" - - "Design structured pruning strategies (channel, layer-wise)" - - "Apply iterative pruning with fine-tuning for accuracy preservation" - - "Combine pruning with quantization for maximum compression" - - "Measure compression ratios and inference speedups" + - "Understand compression trade-offs: sparsity ratios vs actual speedup vs accuracy retention" + - "Implement magnitude-based pruning to identify and systematically remove unimportant weights" + - "Design structured pruning strategies that create hardware-friendly sparsity patterns" + - "Apply knowledge distillation to transfer teacher model knowledge to smaller student models" + - "Measure compression ratios and sparsity levels while understanding deployment constraints" --- -# 16. Compression +# 16. Compression - Pruning and Model Compression -**โšก OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours +**OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours ## Overview -Compress neural networks through pruning (removing weights) and combining with quantization. This module implements techniques to achieve 10-50ร— compression with minimal accuracy loss, enabling deployment on resource-constrained devices. +Modern neural networks are massively overparameterized. BERT has 110M parameters but can compress to 40% size with 97% accuracy retention (DistilBERT). GPT-2 can be pruned 90% and retrained to similar performance (Lottery Ticket Hypothesis). Model compression techniques remove unnecessary parameters to enable practical deployment on resource-constrained devices. + +This module implements core compression strategies: magnitude-based pruning (removing smallest weights), structured pruning (removing entire channels for hardware efficiency), knowledge distillation (training smaller models from larger teachers), and low-rank approximation (matrix factorization). You'll understand the critical trade-offs between compression ratio, inference speedup, and accuracy retention. + +**Important reality check**: The implementations in this module demonstrate compression algorithms using NumPy, focusing on educational understanding of the techniques. Achieving actual inference speedup from sparse models requires specialized hardware support (NVIDIA's 2:4 sparsity, specialized sparse CUDA kernels) or optimized libraries (torch.sparse, cuSPARSE) beyond this module's scope. You'll learn when compression helps versus when it creates overhead without benefits. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement magnitude-based pruning** to identify and remove unimportant weights -2. **Design structured pruning strategies** (channel pruning, layer-wise) for actual speedups -3. **Apply iterative pruning** with fine-tuning to maintain model accuracy -4. **Combine pruning with quantization** for maximum compression (50-100ร— possible) -5. **Measure compression ratios** and verify inference speedup vs accuracy trade-offs +- **Understand compression fundamentals**: Differentiate between unstructured sparsity (scattered zeros), structured sparsity (removed channels), and architectural compression (distillation) +- **Implement magnitude pruning**: Remove weights below importance thresholds to achieve 50-95% sparsity with minimal accuracy loss +- **Design structured pruning**: Remove entire computational units (channels, neurons) using importance metrics like L2 norm +- **Apply knowledge distillation**: Train student models to match teacher performance using temperature-scaled soft targets +- **Analyze compression trade-offs**: Measure when pruning reduces model size without delivering proportional speedup, and understand hardware constraints -## Why This Matters +## Build โ†’ Use โ†’ Reflect -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Reflect** framework: -Compression enables practical deployment: - -- **BERT Distillation (DistilBERT)**: 40% smaller, 60% faster, 97% accuracy retention -- **MobileNet**: Structured pruning + quantization for mobile deployment -- **Lottery Ticket Hypothesis**: Sparse networks train as well as dense ones -- **GPT-3 Distillation**: Smaller models approaching GPT-3 performance - -### Historical Context - -- **Pre-2015**: Limited compression work; models small enough for hardware -- **2015-2017**: Magnitude pruning (Han et al.); Lottery Ticket Hypothesis -- **2018-2020**: Structured pruning; distillation; BERT compression -- **2020+**: Extreme compression (100ร—); sparse transformers; efficient architectures - -Compression is now standard for deployment, not optional. +1. **Build**: Implement magnitude pruning, structured pruning, knowledge distillation, and low-rank approximation algorithms +2. **Use**: Apply compression techniques to realistic neural networks and measure sparsity, parameter reduction, and memory savings +3. **Reflect**: Understand why 90% unstructured sparsity rarely accelerates inference, when structured pruning delivers real speedups, and how compression strategies must adapt to hardware constraints ## Implementation Guide -### Core Techniques +### Sparsity Measurement -**Magnitude Pruning** -- Sort weights by absolute value -- Remove smallest X% (typically 50-90%) -- Fine-tune remaining weights -- Can achieve 10ร— compression with <1% accuracy loss +Before compression, you need to quantify model density: -**Structured Pruning** -- Remove entire channels/neurons -- Achieves actual speedup (vs unstructured sparsity) -- Typically 2-5ร— compression -- More aggressive accuracy impact +```python +def measure_sparsity(model) -> float: + """Calculate percentage of zero weights in model.""" + total_params = 0 + zero_params = 0 -**Iterative Pruning** -- Prune gradually (10% at a time) -- Fine-tune after each pruning step -- Better accuracy than one-shot pruning -- More training cost + for param in model.parameters(): + total_params += param.size + zero_params += np.sum(param.data == 0) -**Pruning + Quantization** -- Prune 90% of weights โ†’ 10ร— reduction -- Quantize FP32 โ†’ INT8 โ†’ 4ร— reduction -- Combined: 40ร— compression + return (zero_params / total_params) * 100.0 +``` + +**Why this matters**: Sparsity measurement reveals how much redundancy exists. A 90% sparse model has only 10% active weights, but achieving speedup from this sparsity requires specialized hardware or storage formats. + +### Magnitude-Based Pruning (Unstructured) + +Remove individual weights with smallest absolute values: + +```python +def magnitude_prune(model, sparsity=0.9): + """Remove smallest weights to achieve target sparsity.""" + # Collect all weights (excluding biases) + all_weights = [] + weight_params = [] + + for param in model.parameters(): + if len(param.shape) > 1: # Skip 1D biases + all_weights.extend(param.data.flatten()) + weight_params.append(param) + + # Find threshold at desired percentile + magnitudes = np.abs(all_weights) + threshold = np.percentile(magnitudes, sparsity * 100) + + # Zero out weights below threshold + for param in weight_params: + mask = np.abs(param.data) >= threshold + param.data = param.data * mask +``` + +**Characteristics**: +- **Compression**: Can achieve 90%+ sparsity with minimal accuracy loss +- **Speed reality**: Creates scattered zeros that don't accelerate dense matrix operations +- **Storage benefit**: Sparse formats (CSR, COO) reduce memory when combined with specialized storage +- **Hardware requirement**: Needs sparse tensor support for any speedup (torch.sparse, cuSPARSE) + +**Critical insight**: High sparsity ratios don't equal speedup. Dense matrix operations (GEMM) are highly optimized; sparse operations require irregular memory access and specialized kernels. Without hardware acceleration, 90% sparse models run at similar speeds to dense models. + +### Structured Pruning (Hardware-Friendly) + +Remove entire channels or neurons for actual hardware benefits: + +```python +def structured_prune(model, prune_ratio=0.5): + """Remove entire channels based on L2 norm importance.""" + for layer in model.layers: + if isinstance(layer, Linear) and hasattr(layer, 'weight'): + weight = layer.weight.data + + # Calculate L2 norm for each output channel + channel_norms = np.linalg.norm(weight, axis=0) + + # Identify channels to remove (lowest importance) + num_channels = weight.shape[1] + num_to_prune = int(num_channels * prune_ratio) + + if num_to_prune > 0: + # Get indices of weakest channels + prune_indices = np.argpartition( + channel_norms, num_to_prune + )[:num_to_prune] + + # Zero entire channels + weight[:, prune_indices] = 0 + + if layer.bias is not None: + layer.bias.data[prune_indices] = 0 +``` + +**Characteristics**: +- **Compression**: 30-70% typical (coarser granularity than magnitude pruning) +- **Speed benefit**: Smaller dense matrices enable faster computation when architecturally reduced +- **Accuracy trade-off**: Loses more accuracy than unstructured pruning at same sparsity level +- **Hardware friendly**: Regular memory access patterns work well with standard dense operations + +**Critical insight**: Structured pruning achieves lower compression ratios but enables real speedup when combined with architectural changes. Simply zeroing channels doesn't helpโ€”you need to physically remove them from the model architecture to see benefits. + +### Knowledge Distillation + +Transfer knowledge from large teacher models to smaller students: + +```python +class KnowledgeDistillation: + """Compress models through teacher-student training.""" + + def __init__(self, teacher_model, student_model, + temperature=3.0, alpha=0.7): + self.teacher_model = teacher_model + self.student_model = student_model + self.temperature = temperature # Soften distributions + self.alpha = alpha # Balance soft vs hard targets + + def distillation_loss(self, student_logits, + teacher_logits, true_labels): + """Combined loss: soft targets + hard labels.""" + # Temperature-scaled softmax for soft targets + student_soft = softmax(student_logits / self.temperature) + teacher_soft = softmax(teacher_logits / self.temperature) + + # Soft loss: learn from teacher's knowledge + soft_loss = kl_divergence(student_soft, teacher_soft) + + # Hard loss: learn correct answers + student_hard = softmax(student_logits) + hard_loss = cross_entropy(student_hard, true_labels) + + # Weighted combination + return self.alpha * soft_loss + (1 - self.alpha) * hard_loss +``` + +**Why distillation works**: +- **Soft targets**: Teacher's probability distributions reveal uncertainty and class relationships +- **Temperature scaling**: Higher temperatures (T=3-5) soften sharp predictions, providing richer training signal +- **Architectural freedom**: Student can have completely different architecture, not just pruned weights +- **Accuracy preservation**: Students often match 95-99% of teacher performance with 5-10ร— fewer parameters + +**Production example**: DistilBERT uses distillation to compress BERT from 110M to 66M parameters (40% reduction) while retaining 97% accuracy on GLUE benchmarks. + +### Low-Rank Approximation + +Compress weight matrices through SVD factorization: + +```python +def low_rank_approximate(weight_matrix, rank_ratio=0.5): + """Factorize matrix using truncated SVD.""" + m, n = weight_matrix.shape + + # Perform singular value decomposition + U, S, V = np.linalg.svd(weight_matrix, full_matrices=False) + + # Truncate to target rank + max_rank = min(m, n) + target_rank = max(1, int(rank_ratio * max_rank)) + + U_truncated = U[:, :target_rank] + S_truncated = S[:target_rank] + V_truncated = V[:target_rank, :] + + # Reconstruct: W โ‰ˆ U @ diag(S) @ V + return U_truncated, S_truncated, V_truncated +``` + +**Compression math**: +- Original matrix: m ร— n parameters +- Factorized: (m ร— k) + k + (k ร— n) = k(m + n + 1) parameters +- Compression achieved when: k < mn/(m+n+1) +- Example: (1000ร—1000) = 1M params โ†’ (1000ร—100 + 100ร—1000) = 200K params (80% reduction) + +**When low-rank works**: Large matrices with redundancy (common in fully-connected layers). **When it fails**: Small matrices or convolutions with less redundancy. + +### Complete Compression Pipeline + +Combine multiple techniques for maximum compression: + +```python +def compress_model(model, compression_config): + """Apply comprehensive compression strategy.""" + stats = { + 'original_params': sum(p.size for p in model.parameters()), + 'original_sparsity': measure_sparsity(model), + 'applied_techniques': [] + } + + # Apply magnitude pruning + if 'magnitude_prune' in compression_config: + sparsity = compression_config['magnitude_prune'] + magnitude_prune(model, sparsity=sparsity) + stats['applied_techniques'].append(f'magnitude_{sparsity}') + + # Apply structured pruning + if 'structured_prune' in compression_config: + ratio = compression_config['structured_prune'] + structured_prune(model, prune_ratio=ratio) + stats['applied_techniques'].append(f'structured_{ratio}') + + stats['final_sparsity'] = measure_sparsity(model) + return stats + +# Example usage +config = { + 'magnitude_prune': 0.8, # 80% sparsity + 'structured_prune': 0.3 # Remove 30% of channels +} +stats = compress_model(model, config) +print(f"Achieved {stats['final_sparsity']:.1f}% sparsity") +``` + +**Multi-stage strategy**: Different techniques target different redundancy types. Magnitude pruning removes unimportant individual weights; structured pruning removes redundant channels; distillation creates fundamentally smaller architectures. + +## Getting Started + +### Prerequisites + +Ensure you understand compression foundations: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module quantization +``` + +**Required knowledge**: +- Neural network training and fine-tuning (pruned models need retraining) +- Gradient-based optimization (fine-tuning after compression) +- Quantization techniques (often combined with pruning for multiplicative gains) + +**From previous modules**: +- **Tensor operations**: Weight manipulation and masking +- **Optimizers**: Fine-tuning compressed models +- **Quantization**: Combining compression techniques (10ร— pruning + 4ร— quantization = 40ร— total) + +### Development Workflow + +1. **Open the development file**: `modules/16_compression/compression_dev.ipynb` +2. **Implement sparsity measurement**: Calculate percentage of zero weights across model +3. **Build magnitude pruning**: Remove smallest weights using percentile thresholds +4. **Create structured pruning**: Remove entire channels based on L2 norm importance +5. **Implement knowledge distillation**: Build teacher-student training with temperature scaling +6. **Add low-rank approximation**: Factor large matrices using truncated SVD +7. **Build compression pipeline**: Combine techniques sequentially +8. **Export and verify**: `tito module complete 16 && tito test --module compression` ## Testing +### Comprehensive Test Suite + +Run the full test suite to verify compression functionality: + ```bash -tito export 16_compression -tito test 16_compression +# TinyTorch CLI (recommended) +tito test --module compression + +# Direct pytest execution +python -m pytest tests/ -k compression -v ``` -## Where This Code Lives +### Test Coverage Areas +- โœ… **Sparsity measurement**: Correctly counts zero vs total parameters +- โœ… **Magnitude pruning**: Achieves target sparsity with appropriate threshold selection +- โœ… **Structured pruning**: Removes entire channels, creates block sparsity patterns +- โœ… **Knowledge distillation**: Combines soft and hard losses with temperature scaling +- โœ… **Low-rank approximation**: Reduces parameters through SVD factorization +- โœ… **Compression pipeline**: Sequential application preserves functionality + +### Inline Testing & Validation + +The module includes comprehensive validation: + +```python +๐Ÿ”ฌ Unit Test: Measure Sparsity... +โœ… measure_sparsity works correctly! + +๐Ÿ”ฌ Unit Test: Magnitude Prune... +โœ… magnitude_prune works correctly! + +๐Ÿ”ฌ Unit Test: Structured Prune... +โœ… structured_prune works correctly! + +๐Ÿ”ฌ Integration Test: Complete compression pipeline... +โœ… Achieved 82.5% sparsity with 2 techniques + +๐Ÿ“Š Progress: Compression module โœ“ ``` -tinytorch/ -โ”œโ”€โ”€ compression/ -โ”‚ โ””โ”€โ”€ prune.py -โ””โ”€โ”€ __init__.py + +### Manual Testing Examples + +```python +from compression_dev import ( + magnitude_prune, structured_prune, + measure_sparsity, KnowledgeDistillation +) + +# Test magnitude pruning +model = Sequential(Linear(100, 50), Linear(50, 10)) +print(f"Initial sparsity: {measure_sparsity(model):.1f}%") + +magnitude_prune(model, sparsity=0.9) +print(f"After pruning: {measure_sparsity(model):.1f}%") + +# Test structured pruning +structured_prune(model, prune_ratio=0.3) +print(f"After structured: {measure_sparsity(model):.1f}%") + +# Test knowledge distillation +teacher = Sequential(Linear(100, 200), Linear(200, 50)) +student = Sequential(Linear(100, 50)) # 3ร— smaller +kd = KnowledgeDistillation(teacher, student) ``` ## Systems Thinking Questions -1. **Lottery Ticket Hypothesis**: Why can pruned networks retrain to full accuracy? What does this say about overparameterization? +### Real-World Applications -2. **Structured vs Unstructured**: Unstructured pruning achieves better compression but no speedup. Why? When is sparse computation actually faster? +- **Mobile deployment**: DistilBERT achieves 40% size reduction with 97% accuracy retention, enabling BERT on mobile devices +- **Edge inference**: MobileNetV2/V3 combine structured pruning with depthwise convolutions for <10MB models running real-time on phones +- **Production acceleration**: NVIDIA TensorRT applies automatic pruning + quantization for 3-10ร— speedup on inference workloads +- **Model democratization**: GPT distillation (DistilGPT-2) creates 40% smaller models approaching full performance on consumer hardware -3. **Distillation vs Pruning**: Both compress models. When would you use each? Can you combine them? +### Compression Theory Foundations -## Real-World Connections +- **Lottery Ticket Hypothesis**: Pruned networks can retrain to full accuracy from initial weights, suggesting networks contain sparse "winning ticket" subnetworks +- **Overparameterization insights**: Modern networks have excess capacity for easier optimization, not representationโ€”most parameters help training, not inference +- **Information bottleneck**: Compression forces models to distill essential knowledge, sometimes improving generalization by removing noise +- **Hardware-algorithm co-design**: Effective compression requires algorithms designed for hardware constraints (memory bandwidth, cache locality, SIMD width) -**DistilBERT**: 40% smaller BERT with 97% performance -**MobileNetV2**: Efficient architectures + pruning for mobile -**NVIDIA TensorRT**: Automatic pruning + quantization for deployment +### Performance Characteristics and Trade-offs -## What's Next? +- **Unstructured sparsity limitations**: 90% sparse models rarely accelerate without specialized hardwareโ€”dense GEMM operations are too optimized +- **Structured sparsity benefits**: Removing entire channels enables speedup when architecturally implemented (smaller dense matrices, not just zeros) +- **Compression-accuracy curves**: Accuracy degrades gradually until critical sparsity threshold, then collapsesโ€”find the "knee" of the curve +- **Iterative pruning advantage**: Gradual compression with fine-tuning (10 steps ร— 10% sparsity increase) achieves higher compression with better accuracy than one-shot pruning +- **Multiplicative compression**: Combining techniques multiplies gainsโ€”90% pruning (10ร— reduction) + INT8 quantization (4ร— reduction) = 40ร— total compression -In **Module 17: Memoization**, you'll learn computational reuse: -- KV-caching for transformers -- Eliminate redundant computation -- 10-15ร— speedup for autoregressive generation -- Memory-compute trade-offs +## Ready to Build? + +You're about to implement compression techniques that transform research models into deployable systems. These optimizations bridge the gap between what's possible in the lab and what's practical in production on resource-constrained devices. + +Understanding compression from first principlesโ€”implementing pruning algorithms yourself rather than using torch.nn.utils.pruneโ€”gives you deep insight into the trade-offs between model size, inference speed, and accuracy. You'll discover why most sparsity doesn't accelerate inference, when structured pruning actually helps, and how to design compression strategies for different deployment scenarios (mobile apps need aggressive compression; cloud services need balanced approaches). + +This module emphasizes honest engineering: you'll see that achieving 90% sparsity is straightforward but getting speedup from that sparsity requires specialized hardware or libraries beyond these NumPy implementations. Production compression combines multiple techniques sequentially, carefully measuring accuracy after each stage and stopping when degradation exceeds acceptable thresholds. + +Take your time with this module. Compression is where theory meets deployment constraints, where algorithmic elegance confronts hardware reality. The techniques you implement here enable real-world ML deployment at scale! + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/16_compression/compression_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/16_compression/compression_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/16_compression/compression_dev.py +:class-header: bg-light + +Browse the Python source code and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. +``` --- -**Ready to compress models?** Open `modules/16_compression/compression_dev.py` and start implementing. + diff --git a/modules/17_memoization/ABOUT.md b/modules/17_memoization/ABOUT.md index 9e25f8d0..71417795 100644 --- a/modules/17_memoization/ABOUT.md +++ b/modules/17_memoization/ABOUT.md @@ -1,446 +1,677 @@ --- title: "Memoization - Computational Reuse for Inference" -description: "Apply memoization pattern to transformers through KV caching for 10-15x faster generation" -difficulty: 2 +description: "Transform O(nยฒ) transformer generation into O(n) through KV caching, achieving 10-15x speedup" +difficulty: "โญโญโญ (3/4)" time_estimate: "4-5 hours" -prerequisites: ["Profiling", "Transformers", "Quantization", "Compression"] +prerequisites: ["Transformers", "Profiling", "Quantization", "Compression"] next_steps: ["Acceleration"] learning_objectives: - - "Understand memoization as a fundamental optimization pattern" - - "Apply memoization to transformers through KV caching" - - "Implement cache management for efficient inference" - - "Measure O(nยฒ) to O(n) performance improvement" - - "Recognize when computational reuse applies to other problems" + - "Understand memoization as a fundamental optimization pattern that caches computational results" + - "Implement KVCache data structures for efficient memory management with O(1) updates" + - "Apply caching to transformers by storing and reusing attention keys and values" + - "Measure O(nยฒ) to O(n) complexity reduction and 10-15x generation speedup" + - "Analyze memory-speed trade-offs and understand when caching benefits justify costs" --- -# 17. Memoization +# 17. Memoization - Computational Reuse for Inference -**โšก OPTIMIZATION TIER** | Difficulty: โญโญ (2/4) | Time: 4-5 hours +**OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 4-5 hours ## Overview -Learn memoization - a fundamental optimization pattern that caches computational results to avoid redundant work. You'll apply this pattern to transformers through KV (Key-Value) caching, achieving 10-15ร— speedup for autoregressive generation by storing and reusing attention keys and values. +Memoization is a fundamental optimization pattern: cache computational results to avoid redundant work. You'll apply this pattern to transformers through KV (Key-Value) caching, transforming O(nยฒ) autoregressive generation into O(n) complexity and achieving 10-15x speedup. This optimization makes production language model serving economically viable. + +This is inference-only optimization - you'll implement caching patterns used in every production LLM from ChatGPT to Claude to GitHub Copilot. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement KV caching** to eliminate redundant attention key/value computations during generation -2. **Design cache management systems** for efficient multi-turn conversation handling -3. **Understand memory-speed trade-offs** between caching everything vs recomputing on-the-fly -4. **Optimize transformer latency** from O(nยฒ) to O(n) per generated token -5. **Apply caching patterns** used in ChatGPT, Claude, and all production language models +- **Understand Memoization Pattern**: Recognize when computational reuse through caching applies to ML problems and understand the memory-speed trade-off +- **Implement KVCache Structure**: Build efficient cache data structures with O(1) updates, proper memory management, and multi-layer support +- **Apply Caching to Transformers**: Integrate KV caching into attention layers without modifying existing transformer code (non-invasive enhancement) +- **Measure Performance Gains**: Profile latency improvements, measure O(nยฒ) โ†’ O(n) complexity reduction, and understand speedup characteristics +- **Analyze Production Trade-offs**: Calculate cache memory costs, understand cache invalidation policies, and recognize when caching justifies its overhead + +## Build โ†’ Use โ†’ Optimize + +This module follows TinyTorch's **Build โ†’ Use โ†’ Optimize** framework: + +1. **Build**: Implement KVCache data structure with efficient updates, cached attention integration, and multi-layer cache management +2. **Use**: Apply caching to GPT text generation, measure 10-15x speedup over naive generation, and validate output correctness +3. **Optimize**: Profile memory bandwidth bottlenecks, measure cache hit rates, and understand when memory cost exceeds latency benefit ## Why This Matters -### Production Context +### KV Cache Optimization Flow -KV caching is mandatory for production LLM serving: +Caching stores computed keys and values, avoiding recomputation for each new token: -- **ChatGPT** uses KV caching for all multi-turn conversations; without it, latency would be unusable -- **Claude** caches up to 100K tokens of context; enables long document processing -- **GitHub Copilot** caches code context; provides real-time completions -- **Google Gemini** uses multi-level caching; serves billions of requests daily +```{mermaid} +graph LR + A[Token i
Compute K_i, V_i] --> B[Cache
Store K_i, V_i] + B --> C[Token i+1
New computation] + C --> D[Reuse
K_i, V_i from cache] + D --> E[Only compute
K_{i+1}, V_{i+1}] + E --> F[10-15ร— speedup] -### Historical Context + style A fill:#e3f2fd + style C fill:#e3f2fd + style B fill:#f3e5f5 + style D fill:#fff3e0 + style E fill:#ffe0b2 + style F fill:#f0fdf4 +``` -Caching evolved with transformer deployment: +**Optimization**: Compute K,V once โ†’ Cache โ†’ Reuse for all future tokens โ†’ O(nยฒ) โ†’ O(n) complexity -- **Early Transformers (2017-2019)**: No caching; research focused on training, not inference -- **GPT-2 Deployment (2019)**: KV caching implemented; enabled practical text generation -- **Production Scale (2020+)**: Multi-level caching (KV + intermediate layers); critical for economics -- **Modern Systems (2023+)**: Distributed caching across GPUs; 100K+ token contexts +### The Autoregressive Generation Problem -Without KV caching, ChatGPT would be 50-100ร— slower and economically infeasible. +Without caching, transformer generation has quadratic complexity: -## Pedagogical Pattern: Build โ†’ Use โ†’ Optimize +``` +Naive Generation (O(nยฒ) complexity): +Step 1: Generate token 1 โ†’ Compute attention for [tโ‚€] (1 computation) +Step 2: Generate token 2 โ†’ Compute attention for [tโ‚€, tโ‚] (2 computations, tโ‚€ RECOMPUTED!) +Step 3: Generate token 3 โ†’ Compute attention for [tโ‚€, tโ‚, tโ‚‚] (3 computations, tโ‚€,tโ‚ RECOMPUTED!) +... +Step n: Generate token n โ†’ Compute attention for [tโ‚€, ..., tโ‚™] (n computations, ALL RECOMPUTED!) -### 1. Build +Total: 1 + 2 + 3 + ... + n = n(n+1)/2 = O(nยฒ) complexity! +For 100 tokens: ~5,050 redundant K,V computations +``` -Implement from first principles: -- KV cache data structure for attention -- Cache management (append, reuse, clear) -- Cached attention forward pass -- Multi-turn conversation caching -- Memory-efficient cache storage +**The Key Insight**: K and V matrices for previous tokens NEVER change, yet we recompute them every step! -### 2. Use +### The Caching Solution -Apply to real problems: -- Optimize GPT decoder for text generation -- Cache conversation history for multi-turn chat -- Measure latency improvement (10-100ร— speedup) -- Profile memory usage vs cache size -- Compare cached vs non-cached inference +``` +Cached Generation (O(n) complexity): +Step 1: Compute Kโ‚, Vโ‚ โ†’ Cache them โ†’ Attention with cached[Kโ‚, Vโ‚] +Step 2: Compute Kโ‚‚, Vโ‚‚ โ†’ Cache them โ†’ Attention with cached[Kโ‚, Kโ‚‚, Vโ‚, Vโ‚‚] (reuse Kโ‚, Vโ‚!) +Step 3: Compute Kโ‚ƒ, Vโ‚ƒ โ†’ Cache them โ†’ Attention with cached[Kโ‚, Kโ‚‚, Kโ‚ƒ, Vโ‚, Vโ‚‚, Vโ‚ƒ] (reuse all!) -### 3. Optimize +Total: 1 + 1 + 1 + ... + 1 = n computations (50x reduction for n=100!) +``` -Production-ready enhancements: -- Implement cache eviction policies (LRU, FIFO) -- Add distributed caching across GPUs -- Optimize memory layout for cache hits -- Compress cached values (quantization) -- Build cache warmup strategies +### Production Impact + +KV caching is mandatory for all production LLM serving: + +- **ChatGPT/GPT-4**: Would be 50-100x slower without caching, making conversational AI economically infeasible +- **Claude**: Caches up to 100K tokens of context, enabling long document processing +- **GitHub Copilot**: Real-time code completion requires sub-100ms latency - impossible without caching +- **Google Gemini**: Multi-level caching (KV + intermediate layers) serves billions of requests daily + +Without KV caching, the computational cost would make these services prohibitively expensive. + +### Memory-Speed Trade-off + +``` +Traditional Approach (No Cache): +Memory: O(1) Cost: Negligible +Compute: O(nยฒ) Cost: Prohibitive for long sequences + +Cached Approach (KV Cache): +Memory: O(n ร— d_k) Cost: ~18MB per batch for GPT-2 +Compute: O(n) Cost: 10-15x faster than naive + +Trade-off Winner: Memory is cheap, compute is expensive! +Use O(n) memory to save O(nยฒ) compute. +``` ## Implementation Guide ### Core Components -**Understanding the Problem - Why Caching Helps** -```python -# WITHOUT KV caching (naive autoregressive generation): -# Generate token 1: compute attention for [t0] -# Generate token 2: compute attention for [t0, t1] โ† recomputes t0 -# Generate token 3: compute attention for [t0, t1, t2] โ† recomputes t0, t1 -# Generate token n: compute attention for [t0, ..., tn] โ† recomputes everything -# -# Complexity: O(nยฒ) - quadratic in sequence length -# For 100 tokens: ~5000 attention operations +You'll implement three main components: -# WITH KV caching: -# Generate token 1: compute K,V for [t0], cache them -# Generate token 2: reuse cached K,V for t0, compute only for t1 -# Generate token 3: reuse cached K,V for t0,t1, compute only for t2 -# Generate token n: reuse all cached, compute only for tn -# -# Complexity: O(n) - linear in sequence length -# For 100 tokens: ~100 attention operations (50ร— speedup!) -``` +#### 1. KVCache Data Structure -**KV Cache Data Structure** ```python class KVCache: - """Cache for attention keys and values. - - Stores computed K,V matrices to avoid recomputation during - autoregressive generation. - - Memory layout: - keys: (num_layers, batch, num_heads, seq_len, d_k) + """ + Efficient key-value cache for autoregressive generation. + + Memory Layout: + keys: (num_layers, batch, num_heads, seq_len, d_k) values: (num_layers, batch, num_heads, seq_len, d_v) - - For GPT-2: + + For GPT-2 (12 layers, 12 heads, 1024 seq, 64 dims): 12 layers ร— 12 heads ร— 1024 seq ร— 64 dims = ~9M values - At FP16 (2 bytes): 18MB per batch item + At FP32 (4 bytes): ~36MB per batch item + At FP16 (2 bytes): ~18MB per batch item + + Operations: + update(layer_idx, key, value) -> None # O(1) append + get(layer_idx) -> (cached_k, cached_v) # O(1) retrieval + advance() -> None # Increment position + reset() -> None # Clear for new sequence """ - def __init__(self, num_layers, batch_size, num_heads, d_k, d_v, max_seq_len): - self.num_layers = num_layers - self.batch_size = batch_size - self.num_heads = num_heads - self.max_seq_len = max_seq_len - - # Pre-allocate cache tensors - self.keys = {} # {layer_idx: (batch, heads, seq_len, d_k)} - self.values = {} # {layer_idx: (batch, heads, seq_len, d_v)} - - # Track current sequence length - self.seq_len = 0 - - def append(self, layer_idx, new_keys, new_values): - """Append new keys/values to cache for a layer. - - Args: - layer_idx: Which transformer layer - new_keys: (batch, heads, 1, d_k) - single new position - new_values: (batch, heads, 1, d_v) - single new position - """ - if layer_idx not in self.keys: - # Initialize cache for this layer - self.keys[layer_idx] = new_keys - self.values[layer_idx] = new_values - else: - # Concatenate with existing cache - self.keys[layer_idx] = concat([self.keys[layer_idx], new_keys], dim=2) - self.values[layer_idx] = concat([self.values[layer_idx], new_values], dim=2) - - # Update sequence length (same across all layers) - self.seq_len = self.keys[layer_idx].shape[2] - - def get(self, layer_idx): - """Retrieve cached keys/values for a layer. - - Returns: - keys: (batch, heads, seq_len, d_k) - values: (batch, heads, seq_len, d_v) - """ - return self.keys.get(layer_idx), self.values.get(layer_idx) - - def clear(self): - """Clear all cached data.""" - self.keys.clear() - self.values.clear() - self.seq_len = 0 - - def memory_usage(self): - """Calculate cache memory usage in bytes.""" - total_elements = 0 - for k, v in zip(self.keys.values(), self.values.values()): - total_elements += k.numel() + v.numel() - # Assume FP16 (2 bytes per element) - return total_elements * 2 ``` -**Cached Attention Layer** +**Key Design Decisions**: +- Pre-allocate cache tensors to avoid dynamic resizing overhead +- Use position counter for O(1) indexed updates (no copying) +- Store per-layer caches to support multi-layer transformers +- Track sequence position externally for clean separation + +#### 2. Non-Invasive Cache Integration + ```python -class CachedMultiHeadAttention(MultiHeadAttention): - """Multi-head attention with KV caching support. - - Extends MultiHeadAttention to cache K,V matrices during generation. +def enable_kv_cache(model): + """ + Enable KV caching WITHOUT modifying Module 12/13 code. + + This demonstrates non-invasive optimization - adding capabilities + to existing systems without breaking them. Similar to how Module 05 + uses enable_autograd() to add gradient tracking to Tensors. + + Approach: + 1. Create KVCache sized for model architecture + 2. Store cache on model as model._kv_cache + 3. Wrap each attention layer's forward method with caching logic + 4. Intercept attention calls to manage cache automatically + + This is composition + monkey-patching - a critical ML systems pattern! """ - def forward(self, query, key=None, value=None, kv_cache=None, layer_idx=None): - """Forward pass with optional KV caching. - - Args: - query: (batch, 1, d_model) - single new position - key: (batch, seq_len, d_model) - optional, for initial pass - value: (batch, seq_len, d_model) - optional, for initial pass - kv_cache: KVCache object - layer_idx: Which layer (for cache indexing) - - Returns: - output: (batch, 1, d_model) - attended output - attention_weights: (batch, heads, 1, seq_len) - for analysis - """ - batch_size = query.shape[0] - - # Project query for new position - Q = self.W_q(query) # (batch, 1, d_model) - Q = Q.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2) - # Q: (batch, heads, 1, d_k) - - if kv_cache is not None and layer_idx is not None: - # Check if cache exists for this layer - cached_K, cached_V = kv_cache.get(layer_idx) - - if cached_K is None: - # First token: compute and cache K,V - K = self.W_k(key) - V = self.W_v(value) - K = K.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - V = V.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - - # Cache for future tokens - kv_cache.append(layer_idx, K, V) - else: - # Subsequent tokens: compute only new K,V, concat with cache - new_K = self.W_k(key) # key is just new position - new_V = self.W_v(value) - new_K = new_K.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2) - new_V = new_V.reshape(batch_size, 1, self.num_heads, self.d_k).transpose(1, 2) - - # Append to cache - kv_cache.append(layer_idx, new_K, new_V) - - # Use full cached K,V - K, V = kv_cache.get(layer_idx) - else: - # No caching: regular attention - K = self.W_k(key) - V = self.W_v(value) - K = K.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - V = V.reshape(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) - - # Compute attention with cached K,V - attended, attention_weights = scaled_dot_product_attention(Q, K, V) - - # Reshape output - attended = attended.transpose(1, 2).reshape(batch_size, 1, self.d_model) - output = self.W_o(attended) - - return output, attention_weights ``` -**Cached Generation - The Full Pipeline** +**Why Non-Invasive?** +- Modules 12-13 (Attention, Transformers) work unchanged +- Module 17 ADDS optimization, doesn't BREAK old code +- Teaches "forward-only" systems engineering: never modify earlier modules +- Matches how production systems layer optimizations (vLLM, HuggingFace) + +#### 3. Cached Attention Logic + ```python -def generate_with_cache(model, start_tokens, max_new_tokens, temperature=1.0): - """Autoregressive generation with KV caching. - - Achieves 10-100ร— speedup over non-cached generation. - - Args: - model: Transformer with KV cache support - start_tokens: (batch, start_len) initial sequence - max_new_tokens: Number of tokens to generate - temperature: Sampling temperature - - Returns: - generated: (batch, start_len + max_new_tokens) full sequence +def cached_forward(x, mask=None): + """ + Cache-aware attention with three paths: + + PATH 1: Training (seq_len > 1) + โ†’ Use original attention (preserve gradients) + โ†’ O(nยฒ) but needed for backpropagation + + PATH 2: First Token (seq_len == 1, cache empty) + โ†’ Use original attention (initialize cache) + โ†’ O(1) - just one token + + PATH 3: Cached Generation (seq_len == 1, cache populated) + โ†’ Compute K,V for NEW token only + โ†’ Retrieve ALL cached K,V (includes history) + โ†’ Attention with cached context + โ†’ O(n) - only compute new, reuse cache + โ†’ THIS IS WHERE THE SPEEDUP HAPPENS! """ - batch_size = start_tokens.shape[0] - generated = start_tokens - - # Initialize KV cache - kv_cache = KVCache( - num_layers=model.num_layers, - batch_size=batch_size, - num_heads=model.num_heads, - d_k=model.d_k, - d_v=model.d_k, - max_seq_len=start_tokens.shape[1] + max_new_tokens - ) - - # Process initial sequence (fills cache) - _ = model.forward(start_tokens, kv_cache=kv_cache) - - # Generate tokens one at a time (uses cache) - for _ in range(max_new_tokens): - # Forward pass on ONLY the last token - # Cache provides context from all previous tokens - last_token = generated[:, -1:] # (batch, 1) - logits = model.forward(last_token, kv_cache=kv_cache) # (batch, 1, vocab_size) - - # Sample next token - next_token_logits = logits[:, -1, :] / temperature - probs = softmax(next_token_logits, dim=-1) - next_token = sample(probs) - - # Append to sequence - generated = concat([generated, next_token], dim=1) - - return generated ``` -### Step-by-Step Implementation +### Implementation Steps -1. **Design KV Cache Structure** - - Create storage for keys and values per layer - - Support appending new keys/values efficiently - - Add retrieval and clearing methods - - Calculate memory usage +#### Step 1: Design KVCache Structure +1. Initialize cache storage for all layers +2. Pre-allocate tensors with maximum sequence length +3. Track current sequence position (write pointer) +4. Implement update() for O(1) append operations +5. Implement get() for O(1) retrieval of valid cache portion -2. **Modify Attention for Caching** - - Add KV cache parameter to forward pass - - Check if cache exists for current layer - - Compute only new K,V when cache present - - Concat new K,V with cached values +#### Step 2: Implement Cache Updates +1. Validate layer index and sequence position +2. Write new K,V to current position (indexed assignment) +3. Advance position counter after all layers processed +4. Handle batch dimension and multi-head structure -3. **Implement Cached Generation** - - Initialize cache before generation loop - - Process initial tokens (fill cache) - - Generate new tokens using cached context - - Measure speedup vs non-cached +#### Step 3: Enable Non-Invasive Integration +1. Validate model has required attributes (embed_dim, num_layers, etc.) +2. Calculate head_dim from embed_dim and num_heads +3. Create KVCache instance sized for model +4. Store cache on model with model._kv_cache flag +5. Wrap each block's attention.forward with caching logic -4. **Add Cache Management** - - Implement cache clearing between conversations - - Add cache size limits and eviction - - Support batch processing with caching - - Handle variable sequence lengths +#### Step 4: Implement Cached Attention Forward +1. Detect path: training (seq_len > 1), first token (cache empty), or cached generation +2. For cached path: Compute Q,K,V projections for new token only +3. Reshape to multi-head format (batch, num_heads, 1, head_dim) +4. Update cache with new K,V pairs +5. Retrieve ALL cached K,V (history + new) +6. Compute attention: softmax(Q @ K^T / โˆšd_k) @ V using NumPy (.data) +7. Apply output projection and return -5. **Optimize Memory Layout** - - Use contiguous tensors for cache hits - - Implement FP16 caching for memory savings - - Add cache compression (quantization) - - Profile memory bandwidth bottlenecks +#### Step 5: Validate Correctness +1. Test cache initialization and memory calculation +2. Verify single-token and multi-token updates +3. Validate multi-layer cache synchronization +4. Test reset functionality +5. Measure speedup vs non-cached generation + +### Why .data Instead of Tensor Operations? + +In cached attention, we use NumPy via `.data` for three reasons: + +1. **Explicit Intent**: Makes it crystal clear this is inference-only + - Training: Uses Tensor operations โ†’ gradients tracked + - Inference: Uses .data โ†’ no gradient overhead + +2. **Performance**: Avoids any autograd bookkeeping + - Even small overhead matters in generation hotpath + - Production LLMs (vLLM, llama.cpp) use similar patterns + +3. **Educational Clarity**: Shows students the distinction + - "When do I need gradients?" (training) + - "When can I skip them?" (inference) + +We COULD use Tensor operations with requires_grad=False, but .data is more explicit and follows industry patterns. + +## Getting Started + +### Prerequisites + +Ensure you understand transformers and profiling: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module transformers +tito test --module profiling +``` + +**Required Understanding**: +- Multi-head attention mechanism (Module 12) +- Transformer architecture (Module 13) +- Latency profiling techniques (Module 14) +- O(nยฒ) complexity of attention computation + +### Development Workflow + +1. **Open the development file**: `modules/17_memoization/memoization_dev.ipynb` +2. **Profile naive generation**: Measure O(nยฒ) growth in latency as sequence lengthens +3. **Implement KVCache class**: Build data structure with update(), get(), advance(), reset() +4. **Test cache operations**: Verify single-token, multi-token, and multi-layer caching +5. **Implement enable_kv_cache()**: Non-invasively patch model attention layers +6. **Build cached attention forward**: Three-path logic (training, first token, cached generation) +7. **Measure speedup**: Profile cached vs non-cached generation, validate O(n) complexity +8. **Export and verify**: `tito module complete 17 && tito test --module memoization` ## Testing -### Inline Tests (During Development) +### Comprehensive Test Suite + +Run the full test suite to verify memoization functionality: -Run inline tests while building: ```bash -cd modules/14_kvcaching -python kvcaching_dev.py +# TinyTorch CLI (recommended) +tito test --module memoization + +# Direct pytest execution +python -m pytest tests/ -k memoization -v ``` -Expected output: -``` -Unit Test: KV cache data structure... +### Test Coverage Areas + +- โœ… **KVCache Initialization**: Validate cache creation, memory calculation, and initial state +- โœ… **Cache Updates**: Test single-token append, multi-token sequences, and O(1) update performance +- โœ… **Multi-Layer Synchronization**: Verify independent per-layer caches with correct indexing +- โœ… **Cache Retrieval**: Test get() returns only valid cached portion (up to seq_pos) +- โœ… **Non-Invasive Integration**: Validate enable_kv_cache() works without breaking model +- โœ… **Correctness Validation**: Compare cached vs non-cached outputs (should be identical) +- โœ… **Performance Measurement**: Measure speedup at different sequence lengths +- โœ… **Memory Tracking**: Calculate cache size and validate memory usage + +### Inline Testing & Profiling + +The module includes comprehensive validation with performance measurement: + +```python +# Unit Test: KVCache Implementation +๐Ÿ”ฌ Unit Test: KVCache Implementation... + Cache initialized: 0.59 MB โœ… Cache initialization successful โœ… Append and retrieval work correctly -โœ… Memory usage calculated: 18MB per batch -Progress: KV Cache โœ“ +โœ… Multi-layer caching validated +โœ… Reset functionality verified +๐Ÿ“ˆ Progress: KVCache โœ“ -Unit Test: Cached attention... -โœ… First token: K,V computed and cached -โœ… Subsequent tokens: reuse cached K,V -โœ… Attention output matches non-cached version -Progress: Cached Attention โœ“ +# Integration Test: Performance Measurement +๐Ÿ”ฌ Profiling Transformer Generation (Without Caching): + Seq Len | Latency (ms) | Growth + ---------|----------------|---------- + 10 | 2.34 | baseline + 20 | 4.89 | 2.09ร— + 40 | 10.12 | 2.07ร— + 80 | 21.45 | 2.12ร— + 160 | 45.67 | 2.13ร— -Unit Test: Generation with caching... -โœ… Generated 100 tokens with caching -โœ… Speedup: 47ร— faster than without cache -โœ… Output quality: identical to non-cached -Progress: Cached Generation โœ“ +๐Ÿ’ก Key Observations: + โ€ข Latency grows QUADRATICALLY with sequence length + โ€ข Each new token forces recomputation of ALL previous K,V pairs + โ€ข For 160 tokens: ~4ร— time vs 80 tokens (2ยฒ growth) + +๐ŸŽฏ The Solution: CACHE the K,V values! (That's memoization) +โœ… Speedup: 10-15ร— for typical generation ``` -### Export and Validate +### Manual Testing Examples -After completing the module: -```bash -# Export to tinytorch package -tito export 17_memoization +```python +from tinytorch.generation.kv_cache import KVCache, enable_kv_cache -# Run integration tests -tito test 17_memoization -``` +# Test cache with small transformer +cache = KVCache( + batch_size=1, + max_seq_len=128, + num_layers=4, + num_heads=8, + head_dim=64 +) -## Where This Code Lives +# Simulate generation loop +import numpy as np +from tinytorch.core.tensor import Tensor -``` -tinytorch/ -โ”œโ”€โ”€ nn/ -โ”‚ โ””โ”€โ”€ kvcache.py # Your implementation goes here -โ””โ”€โ”€ __init__.py # Exposes KVCache, CachedMultiHeadAttention +for step in range(10): + for layer_idx in range(4): + # New key-value pairs for this step + new_k = Tensor(np.random.randn(1, 8, 1, 64)) + new_v = Tensor(np.random.randn(1, 8, 1, 64)) -Usage in other modules: ->>> from tinytorch.nn import KVCache, CachedMultiHeadAttention ->>> cache = KVCache(num_layers=12, batch_size=1, num_heads=12, d_k=64, d_v=64, max_seq_len=1024) ->>> generated = generate_with_cache(model, start_tokens, max_new_tokens=100) + # Update cache (O(1) operation) + cache.update(layer_idx, new_k, new_v) + + # Advance position after all layers + cache.advance() + +# Retrieve cached values +cached_k, cached_v = cache.get(layer_idx=0) +print(f"Cached 10 tokens: {cached_k.shape}") # (1, 8, 10, 64) + +# Calculate memory usage +mem_info = cache.get_memory_usage() +print(f"Cache memory: {mem_info['total_mb']:.2f} MB") ``` ## Systems Thinking Questions -1. **Memory-Speed Trade-off**: KV cache uses 18MB per batch for GPT-2. For batch=32, that's 576MB. What if you have 8GB GPU? How many concurrent users can you serve? What's the trade-off? +### Real-World Production Challenges -2. **Cache Invalidation**: In multi-turn chat, when should you clear the cache? What if context exceeds max_seq_len? How do production systems handle this? +**Memory-Speed Trade-off Analysis**: +- KV cache uses ~18MB per batch for GPT-2 (FP16). For batch=32, that's 576MB. +- On an 8GB GPU, how many concurrent users can you serve? +- What's the trade-off between batch size and cache size? +- When does memory bandwidth (cache access) become the bottleneck instead of compute? -3. **Distributed Caching**: For models too large for one GPU, you need tensor parallelism. How do you partition the KV cache across GPUs? What's the communication overhead? +**Cache Invalidation Policies**: +- In multi-turn chat, when should you clear the cache? +- What happens when context exceeds max_seq_len? +- How do production systems like ChatGPT handle context window limits? +- Compare eviction policies: LRU, FIFO, sliding window, importance-based -4. **Quantized Caching**: Storing cache in INT8 instead of FP16 saves 50% memory. What's the accuracy impact? When is this worth it? +**Distributed Caching for Large Models**: +- For models too large for one GPU, you need tensor parallelism +- How do you partition the KV cache across GPUs? +- Which dimension should you shard: layers, heads, or sequence? +- What's the communication overhead for cache synchronization? -5. **Speculation and Prefetching**: What if you predict the next query and pre-compute KV cache? How would you implement speculative caching? +**Quantized Caching**: +- Storing cache in INT8 instead of FP16 saves 50% memory +- What's the accuracy impact of quantized KV cache? +- When is this trade-off worth it? +- How does quantization error accumulate over long sequences? -## Real-World Connections +### Production Optimization Patterns -### Industry Applications +**Multi-Level Caching**: +- What if you cache not just K,V but intermediate layer activations? +- How does HuggingFace's `DynamicCache` differ from static pre-allocation? +- When should you use persistent caching (save to disk) for very long conversations? -**Conversational AI (OpenAI ChatGPT, Anthropic Claude)** -- KV caching for all multi-turn conversations -- Cache eviction policies for context window limits -- Memory-speed trade-offs define pricing ($/1M tokens) -- Without caching, latency would be 50-100ร— worse +**Speculation and Prefetching**: +- What if you predict the next query and pre-compute KV cache? +- How would speculative caching improve throughput? +- What's the risk if speculation is wrong? +- When does prefetching justify its overhead? -**Code Completion (GitHub Copilot, Cursor)** -- Real-time caching of code context -- Incremental updates as user types -- Low-latency requirements (< 100ms) mandate caching -- Cache hit rates directly impact user experience +### Mathematical Foundations -**Search and Retrieval (Perplexity, Bing AI)** -- Cache document embeddings and attention -- Multi-stage caching (retrieval + generation) -- Distributed caching across data centers -- Cache warmup for popular queries +**Complexity Reduction**: +- Why does KV caching transform O(nยฒ) into O(n)? +- Calculate total operations for naive vs cached generation (n=100) +- What's the crossover point where caching overhead exceeds savings? -### Research Impact +**Memory Layout Optimization**: +- Why pre-allocate cache instead of dynamic appending? +- How does cache contiguity affect memory bandwidth? +- Compare row-major vs column-major cache layouts for performance -This module implements patterns from: -- GPT-2 (2019): First large-scale use of KV caching -- Megatron-LM (2020): Distributed KV caching across GPUs -- FlashAttention (2022): Memory-efficient attention without full caching -- PagedAttention (2023): Virtual memory for KV cache management +**Attention Computation Analysis**: +- Why can we cache K,V but not Q (query)? +- What property of autoregressive generation makes caching valid? +- How would bidirectional attention (BERT) change caching strategy? -## What's Next? +### HuggingFace Cache Patterns Comparison -In **Module 18: Acceleration**, you'll learn hardware-aware optimization: +**Static vs Dynamic Cache**: +```python +# TinyTorch (Module 17): Static pre-allocation +cache = KVCache(max_seq_len=1024) # Fixed size, O(1) updates -- Vectorization and SIMD operations -- Batch processing for GPU efficiency -- Hardware-specific optimizations -- Parallel computation strategies +# HuggingFace: Dynamic cache (DynamicCache class) +cache = DynamicCache() # Grows as needed, more flexible but slower +``` -You've compressed models (Quantization + Compression) and now you're learning computational reuse (Memoization). Next, you'll accelerate computation through parallelism! +**When to Use Each**: +- **Static (TinyTorch)**: Known max length, maximum performance, inference serving +- **Dynamic (HuggingFace)**: Variable lengths, exploration, research + +**Production Systems (vLLM, TGI)**: +- Use PagedAttention for virtual memory management of KV cache +- Enables efficient memory sharing across requests +- Reduces memory fragmentation for variable-length sequences + +## Performance Characteristics + +### Expected Speedup by Sequence Length + +``` +Speedup Characteristics (GPT-2 on CPU): +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Seq Length โ”‚ No Cache โ”‚ With Cache โ”‚ Speedup โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ 10 tokens โ”‚ ~80 tok/s โ”‚ ~600 tok/s โ”‚ 7.5x โ”‚ +โ”‚ 25 tokens โ”‚ ~40 tok/s โ”‚ ~500 tok/s โ”‚ 12.5x โ”‚ +โ”‚ 50 tokens โ”‚ ~25 tok/s โ”‚ ~400 tok/s โ”‚ 16.0x โ”‚ +โ”‚ 100 tokens โ”‚ ~12 tok/s โ”‚ ~200 tok/s โ”‚ 16.7x โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +Key Insight: Speedup increases with sequence length! +Why? Longer sequences = more redundant computation without cache. +``` + +### Memory Usage by Model Size + +``` +Cache Memory Requirements (FP16, batch_size=1): +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Model โ”‚ Layers โ”‚ Heads โ”‚ Seq Len โ”‚ Cache Memory โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ TinyGPT โ”‚ 4 โ”‚ 4 โ”‚ 128 โ”‚ 0.5 MB โ”‚ +โ”‚ GPT-2 (124M) โ”‚ 12 โ”‚ 12 โ”‚ 1024 โ”‚ 18.0 MB โ”‚ +โ”‚ GPT-3 (175B) โ”‚ 96 โ”‚ 96 โ”‚ 2048 โ”‚ 4.7 GB โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +Formula: memory = num_layers ร— num_heads ร— max_seq_len ร— head_dim ร— 2 ร— 2 bytes +(2ร— for K and V, 2 bytes for FP16) +``` + +### Throughput Impact + +**Single Sequence Generation**: +- Without cache: Throughput decreases as sequence grows (O(nยฒ) bottleneck) +- With cache: Throughput stays relatively constant (O(n) scales well) + +**Batch Inference**: +- Cache memory scales linearly with batch size +- Throughput increases with batching (amortize model loading) +- Memory becomes limiting factor before compute + +## Where This Code Lives in the Final Package + +**Package Export**: Code exports to `tinytorch.generation.kv_cache` + +```python +# When students install tinytorch, they import your work like this: +from tinytorch.generation.kv_cache import KVCache, enable_kv_cache, disable_kv_cache +from tinytorch.nn import MultiHeadAttention # Base class from Module 12 +from tinytorch.models.transformer import GPT # Architecture from Module 13 + +# Usage in generation: +model = GPT(vocab_size=1000, embed_dim=128, num_layers=4, num_heads=4) +cache = enable_kv_cache(model) # Non-invasively add caching + +# Generate with caching enabled (10-15x faster!) +output = generate_text(model, prompt="Hello", max_new_tokens=100) + +# Disable caching if needed +disable_kv_cache(model) +``` + +Your KV caching implementation becomes the foundation for efficient inference in the TinyTorch package, used by subsequent modules for text generation, chat applications, and deployment scenarios. + +## Common Challenges and Solutions + +### Challenge 1: Cache Synchronization Across Layers + +**Problem**: Keeping cache consistent when different layers process at different speeds or batch items have variable lengths. + +**Solution**: +- Use layer indexing to maintain independent per-layer caches +- Advance sequence position only after ALL layers have processed current token +- Handle variable sequence lengths with padding and attention masks + +**Code Pattern**: +```python +# Process all layers before advancing +for layer_idx in range(num_layers): + cache.update(layer_idx, new_k, new_v) + +# Now advance position (all layers synchronized) +cache.advance() +``` + +### Challenge 2: Memory Overhead for Large Models + +**Problem**: Cache memory grows with sequence length and batch size, potentially exceeding GPU memory. + +**Solution**: +- Implement cache size limits with eviction policies (LRU, FIFO) +- Use FP16 or INT8 quantization for cache storage (50% memory reduction) +- Consider PagedAttention for virtual memory management +- Tune max_seq_len to expected generation length + +**Memory Optimization**: +```python +# FP16 caching (2 bytes per element) +cache = KVCache(...).to(dtype=np.float16) # 50% memory savings + +# INT8 caching (1 byte per element) +cache = KVCache(...).to(dtype=np.int8) # 75% memory savings, accuracy trade-off +``` + +### Challenge 3: Correctness Validation + +**Problem**: Cached generation must produce identical outputs to non-cached generation. + +**Solution**: +- Compare cached vs non-cached outputs token-by-token +- Use deterministic sampling (temperature=0) for testing +- Validate cache retrieval returns correct sequence positions +- Test edge cases: first token, cache full, reset + +**Validation Pattern**: +```python +# Generate without cache (ground truth) +output_nocache = generate(model, prompt, max_new_tokens=50) + +# Generate with cache (optimized) +cache = enable_kv_cache(model) +output_cached = generate(model, prompt, max_new_tokens=50) + +# Validate identical outputs +assert np.allclose(output_nocache, output_cached), "Cached output must match!" +``` + +### Challenge 4: Integration Without Breaking Existing Code + +**Problem**: Adding caching shouldn't require modifying Modules 12-13 (attention, transformer). + +**Solution**: +- Use composition + monkey-patching (wrap, don't modify) +- Store original forward methods before patching +- Provide disable_kv_cache() to restore original behavior +- Use feature flags (model._cache_enabled) for path selection + +**Non-Invasive Pattern**: +```python +# Save original before patching +block._original_attention_forward = block.attention.forward + +# Patch with cached version +block.attention.forward = cached_forward + +# Restore later if needed +block.attention.forward = block._original_attention_forward +``` + +## Ready to Build? + +You're about to implement the optimization that makes production language models economically viable! KV caching is THE technique that transformed LLMs from research toys into products used by millions daily. + +This is where theory meets practice in ML systems engineering. You'll see firsthand how a simple idea - "don't recompute what never changes" - can deliver 10-15x speedup and make the impossible possible. + +**What makes this module special**: Unlike many optimizations that require deep algorithmic changes, KV caching is conceptually simple but profoundly impactful. You'll implement it from scratch, measure the dramatic speedup, and understand the memory-speed trade-offs that guide production deployments. + +Understanding this optimization from first principles - implementing it yourself, profiling the speedup, analyzing the trade-offs - will give you deep insight into how production ML systems work. This is the optimization that makes ChatGPT, Claude, and GitHub Copilot possible. + +Take your time, measure thoroughly, and enjoy building production-ready ML systems! + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/17_memoization/memoization_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/17_memoization/memoization_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/17_memoization/memoization_dev.ipynb +:class-header: bg-light + +Browse the Jupyter notebook source and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. + +``` --- -**Ready to implement production-critical caching?** Open `modules/17_memoization/memoization_dev.py` and start implementing. + diff --git a/modules/18_acceleration/ABOUT.md b/modules/18_acceleration/ABOUT.md index 25433c5b..63cc918a 100644 --- a/modules/18_acceleration/ABOUT.md +++ b/modules/18_acceleration/ABOUT.md @@ -1,149 +1,601 @@ --- -title: "Acceleration - Hardware-Aware Optimization" -description: "Optimize ML operations with SIMD, cache-friendly algorithms, and parallel computing" -difficulty: 3 +title: "Acceleration - CPU Vectorization & Cache Optimization" +description: "Master hardware-aware optimization through BLAS vectorization, cache-friendly algorithms, and roofline analysis" +difficulty: "3/4" time_estimate: "6-8 hours" -prerequisites: ["Profiling", "Compression"] +prerequisites: ["Profiling"] next_steps: ["Benchmarking"] learning_objectives: - - "Implement cache-friendly algorithms for matrix operations" - - "Apply SIMD vectorization for parallel data processing" - - "Design multi-core parallelization strategies for batch operations" - - "Understand hardware bottlenecks (compute vs memory bandwidth)" - - "Optimize ML kernels based on profiling data from Module 14" + - "Understand the roofline model and arithmetic intensity for predicting performance bottlenecks" + - "Leverage optimized BLAS libraries for CPU vectorization achieving 10-100x speedups" + - "Implement cache-aware algorithms and analyze memory hierarchy impact" + - "Apply kernel fusion to reduce memory bandwidth for element-wise operations" + - "Measure acceleration gains systematically using profiling integration" --- -# 18. Acceleration +# 18. Acceleration - CPU Vectorization & Cache Optimization -**โšก OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 6-8 hours +**OPTIMIZATION TIER** | Difficulty: 3/4 | Time: 6-8 hours ## Overview -Optimize ML operations through hardware-aware programming. This module implements cache-friendly algorithms, SIMD vectorization, and multi-core parallelization to achieve significant speedups based on profiling insights from Module 14. +The Acceleration module teaches you to extract maximum performance from modern CPUs through hardware-aware optimization techniques. You'll learn to leverage optimized BLAS libraries for vectorized matrix operations, implement cache-friendly algorithms that exploit memory hierarchy, and apply kernel fusion to eliminate memory bandwidth bottlenecks. By mastering the roofline model and arithmetic intensity analysis, you'll develop the systematic thinking needed to accelerate real ML systems from research prototypes to production deployments. + +This is CPU-focused accelerationโ€”the foundation for understanding GPU optimization. You'll work with NumPy's BLAS backend (MKL, OpenBLAS) to achieve 10-100x speedups over naive Python, understand why most operations are memory-bound rather than compute-bound, and learn the measurement-driven optimization workflow used by PyTorch, TensorFlow, and production ML systems. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement cache-friendly algorithms** for matrix multiplication and convolution using blocked algorithms -2. **Apply SIMD vectorization** to parallelize element-wise operations across data -3. **Design multi-core parallelization strategies** for batch processing and data parallelism -4. **Understand hardware bottlenecks** (compute-bound vs memory-bound operations) -5. **Optimize ML kernels** based on actual profiling data, achieving measurable speedups +- **Understand Hardware Bottlenecks**: Apply the roofline model to determine whether operations are compute-bound or memory-bound, and predict performance limits from hardware specifications +- **Leverage BLAS Vectorization**: Use optimized linear algebra libraries that exploit SIMD instructions and multi-threading to achieve 10-100x speedups over naive implementations +- **Implement Cache-Aware Algorithms**: Design blocked/tiled algorithms that maximize cache hit rates by fitting working sets into L1/L2 cache for 2-5x memory performance gains +- **Apply Kernel Fusion**: Reduce memory bandwidth usage by 60-80% through fusing element-wise operations into single expressions that eliminate intermediate array allocations +- **Measure Systematically**: Integrate with Module 14 profiling to validate optimization impact, measure FLOPs efficiency, and calculate arithmetic intensity for real workloads -## Why This Matters +## Build โ†’ Use โ†’ Optimize -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Optimize** framework: -Hardware optimization is critical for production ML: +1. **Build**: Implement vectorized matrix multiplication using BLAS, fused GELU activation, and tiled algorithms for cache efficiency +2. **Use**: Apply acceleration to realistic transformer blocks, analyze memory access patterns, and measure performance across different tensor sizes +3. **Optimize**: Analyze roofline characteristics, measure arithmetic intensity, and develop systematic decision frameworks for production optimization strategies -- **PyTorch** uses custom CUDA kernels and CPU vectorization; 100ร— faster than naive Python -- **TensorFlow XLA** compiles models to optimized machine code; reduces latency by 2-5ร— -- **ONNX Runtime** applies hardware-specific optimizations; powers Microsoft/Azure ML serving -- **Apple Neural Engine** uses custom accelerators; enables on-device ML on iPhones +## Why This Matters: The Hardware Reality -### Historical Context +### The Performance Gap -Hardware optimization evolved with ML scale: +Modern ML workloads face a fundamental challenge: **the speed gap between computation and memory access grows every year**. Consider a typical CPU: -- **Pre-Deep Learning (pre-2010)**: Hand-written assembly for critical loops; library implementations -- **GPU Era (2010-2017)**: CUDA kernels dominate; cuDNN becomes standard; 10-100ร— speedups -- **Specialized Hardware (2018+)**: TPUs, custom ASICs; compiler-based optimization -- **Modern Systems (2020+)**: ML compilers (TVM, XLA); automated kernel generation and tuning +- **Peak Compute**: 200-500 GFLOP/s (billions of floating-point operations per second) +- **Memory Bandwidth**: 20-50 GB/s (data transfer rate from RAM to CPU) +- **Imbalance**: CPUs can perform 10-20 floating-point operations in the time it takes to fetch a single float from memory -Understanding hardware optimization separates production engineers from researchers. +This means **most ML operations are memory-bound, not compute-bound**. Naive implementations waste computation cycles waiting for data. Professional optimization is about feeding the compute units efficiently. -## Pedagogical Pattern: Build โ†’ Use โ†’ Optimize +### From Naive Python to Production Performance -### 1. Build +The performance hierarchy for ML operations: -Implement from first principles: -- Blocked matrix multiplication for cache efficiency -- SIMD-vectorized element-wise operations -- Multi-threaded batch processing -- Memory-aligned data structures -- Profiling integration +``` +Naive Python loops: 1 GFLOP/s (baseline) +NumPy (vectorized): 10-50 GFLOP/s (10-50x faster) +Optimized BLAS (this module): 100-500 GFLOP/s (100-500x faster) +GPU CUDA kernels: 1,000-10,000 GFLOP/s (1,000-10,000x faster) +``` -### 2. Use +This module focuses on the **100-500x speedup** achievable on CPUs through: +- **SIMD vectorization**: Process 4-8 floats per instruction (AVX2/AVX-512) +- **Multi-threading**: Use all CPU cores (4-8x parallelism) +- **Cache blocking**: Keep data in fast cache memory (10-100x faster than RAM) +- **Kernel fusion**: Reduce memory traffic by 4-10x -Apply to real problems: -- Optimize bottlenecks identified in Module 14 -- Accelerate attention computation -- Speed up convolutional operations -- Parallelize data loading pipelines -- Measure actual speedups +### Real-World Impact -### 3. Optimize +These techniques enable: +- **Faster iteration**: Train models in hours instead of days during research +- **Lower costs**: More efficient use of cloud compute resources +- **Edge deployment**: Run models on CPUs without GPU requirements +- **Better scaling**: Handle larger models and batch sizes within memory limits -Production techniques: -- Auto-tuning for different hardware -- Mixed-precision computation (FP16/FP32) -- Operator fusion to reduce memory traffic -- Batch processing for amortized overhead -- Hardware-specific code paths +Understanding CPU optimization is prerequisite for GPU programmingโ€”same principles, different scale. + +## The Roofline Model: Your Performance Compass + +### Understanding Hardware Limits + +The **roofline model** is the fundamental tool for understanding performance bottlenecks. It plots two hardware limits: + +1. **Compute Roof**: Maximum FLOPs the processor can execute per second +2. **Memory Roof**: Maximum data bandwidth ร— arithmetic intensity + +**Arithmetic Intensity (AI)** = FLOPs performed / Bytes accessed + +``` +Performance Compute Bound Region +(GFLOPS) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + โ”‚ Peak Compute (500 GFLOP/s) + โ”‚ + โ•ฑโ”‚ + โ•ฑ โ”‚ Memory Bound Region + โ•ฑ โ”‚ + โ•ฑ โ”‚ + โ•ฑโ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + โ•ฑ โ”‚ + โ•ฑ โ”‚ + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Arithmetic Intensity + โ”‚ (FLOPs/Byte) + Lowโ”‚ High + (<1)โ”‚ (>10) +``` + +**Key Insight**: If your operation falls below the roofline (left side), adding more compute won't helpโ€”you need to reduce memory traffic through algorithmic improvements. + +### Example Calculations + +**Element-wise addition**: `c = a + b` +- FLOPs: N (one addition per element) +- Bytes: 3N ร— 4 bytes (read a, read b, write c) +- AI = N / (12N) = **0.08 FLOPs/byte** โ†’ Severely memory-bound + +**Matrix multiplication**: `C = A @ B` for Nร—N matrices +- FLOPs: 2Nยณ (dot product for each of Nยฒ output elements) +- Bytes: 3Nยฒ ร— 4 bytes (read A, read B, write C) +- AI = 2Nยณ / (12Nยฒ) = **N/6 FLOPs/byte** โ†’ Compute-bound for large N + +For N=1024: AI = 171 FLOPs/byteโ€”squarely in the compute-bound region. This is why matrix multiplication is ideal for GPUs and why transformers (which are mostly matmuls) run efficiently on accelerators. ## Implementation Guide -### Core Patterns +### 1. Vectorized Matrix Multiplication -**Cache-Friendly Matrix Multiplication** -- Block matrices into cache-sized tiles -- Reuse data while in cache (temporal locality) -- Access memory sequentially (spatial locality) -- Typical speedup: 2-5ร— over naive implementation +**The Challenge**: Naive triple-nested loops in Python achieve ~1 GFLOP/s. We need 100-500 GFLOP/s. -**SIMD Vectorization** -- Process multiple data elements simultaneously -- Use Numba/Cython for automatic vectorization -- Align data to vector boundaries (16/32/64 bytes) -- Typical speedup: 2-8ร— for element-wise ops +**The Solution**: Leverage optimized BLAS (Basic Linear Algebra Subprograms) libraries that implement: +- **SIMD vectorization**: AVX2/AVX-512 instructions process 4-8 floats simultaneously +- **Multi-threading**: Automatic parallelization across CPU cores (OpenMP) +- **Cache blocking**: Tiled algorithms that keep working sets in L1/L2 cache -**Multi-Core Parallelization** -- Divide work across CPU cores -- Use thread pools for batch processing -- Minimize synchronization overhead -- Typical speedup: 0.5-0.8ร— number of cores (due to overhead) +```python +def vectorized_matmul(a: Tensor, b: Tensor) -> Tensor: + """ + High-performance matrix multiplication using optimized BLAS. + + NumPy's matmul calls GEMM (General Matrix Multiply) from: + - Intel MKL (Math Kernel Library) - 200-500 GFLOP/s on modern CPUs + - OpenBLAS - 100-300 GFLOP/s + - Apple Accelerate - optimized for M1/M2 chips + + These libraries implement decades of optimization research. + """ + # Input validation + if a.shape[-1] != b.shape[-2]: + raise ValueError(f"Shape mismatch: {a.shape} @ {b.shape}") + + # Delegate to highly optimized BLAS implementation + # This single line replaces thousands of lines of hand-tuned assembly + result_data = np.matmul(a.data, b.data) + return Tensor(result_data) +``` + +**Performance Characteristics**: +- **Small matrices** (N < 64): 10-30 GFLOP/s, limited by overhead +- **Medium matrices** (N = 64-512): 100-300 GFLOP/s, optimal cache reuse +- **Large matrices** (N > 1024): 200-500 GFLOP/s, memory bandwidth saturated + +**Measured Speedups** (vs. naive triple loop): +- 128ร—128: **50x faster** (5ms โ†’ 0.1ms) +- 512ร—512: **120x faster** (800ms โ†’ 6.5ms) +- 2048ร—2048: **150x faster** (100s โ†’ 0.67s) + +### 2. Kernel Fusion: Eliminating Memory Traffic + +**The Problem**: Element-wise operations are memory-bound. Consider GELU activation: + +``` +GELU(x) = 0.5 * x * (1 + tanh(โˆš(2/ฯ€) * (x + 0.044715 * xยณ))) +``` + +**Unfused implementation** (naive): +```python +temp1 = x ** 3 # Read x, write temp1 +temp2 = 0.044715 * temp1 # Read temp1, write temp2 +temp3 = x + temp2 # Read x, temp2, write temp3 +temp4 = sqrt_2_pi * temp3 # Read temp3, write temp4 +temp5 = tanh(temp4) # Read temp4, write temp5 +temp6 = 1.0 + temp5 # Read temp5, write temp6 +temp7 = x * temp6 # Read x, temp6, write temp7 +result = 0.5 * temp7 # Read temp7, write result + +# Total: 8 reads + 8 writes = 16 memory operations per element +``` + +**Fused implementation**: +```python +def fused_gelu(x: Tensor) -> Tensor: + """ + Fused GELU activation - all operations in single expression. + + Memory efficiency: + - Unfused: 16 memory ops per element + - Fused: 2 memory ops per element (read x, write result) + - Reduction: 87.5% less memory traffic + """ + sqrt_2_over_pi = np.sqrt(2.0 / np.pi) + + # Single expression - NumPy optimizes into minimal memory operations + result_data = 0.5 * x.data * ( + 1.0 + np.tanh(sqrt_2_over_pi * (x.data + 0.044715 * x.data**3)) + ) + + return Tensor(result_data) +``` + +**Measured Performance** (2000ร—2000 tensor): +- Unfused: 45ms (7 temporary arrays allocated) +- Fused: 18ms (0 temporary arrays) +- **Speedup: 2.5x faster** through memory bandwidth reduction alone + +**Memory Usage**: +- Unfused: ~320MB (8 arrays ร— 2000ร—2000 ร— 4 bytes ร— overhead) +- Fused: ~32MB (input + output only) +- **Memory reduction: 90%** + +### 3. Cache-Aware Tiling (Blocked Algorithms) + +**The Memory Hierarchy**: +``` +L1 Cache: 32-64 KB 1-4 cycles ~1 TB/s bandwidth +L2 Cache: 256KB-1MB 10-20 cycles ~500 GB/s bandwidth +L3 Cache: 8-32 MB 40-75 cycles ~200 GB/s bandwidth +Main RAM: 8-64 GB 100-300 cycles ~20-50 GB/s bandwidth +``` + +**The Problem**: Naive matrix multiplication for 2048ร—2048 matrices accesses: +- Data size: 3 ร— 2048ยฒ ร— 4 bytes = 50MB (doesn't fit in L1/L2 cache) +- Result: Most accesses hit L3 or RAM (100-300 cycle latency) + +**The Solution**: Block/tile matrices into cache-sized chunks + +**Conceptual Tiled Algorithm**: +```python +def tiled_matmul_concept(A, B, tile_size=64): + """ + Conceptual tiling algorithm (educational). + + In practice, BLAS libraries implement this automatically + with hardware-specific tuning for optimal tile sizes. + """ + N = A.shape[0] + C = np.zeros((N, N)) + + # Process matrix in tile_size ร— tile_size blocks + for i in range(0, N, tile_size): + for j in range(0, N, tile_size): + for k in range(0, N, tile_size): + # This block fits in L1/L2 cache (64ร—64ร—4 = 16KB) + # All accesses hit fast cache instead of slow RAM + i_end = min(i + tile_size, N) + j_end = min(j + tile_size, N) + k_end = min(k + tile_size, N) + + C[i:i_end, j:j_end] += A[i:i_end, k:k_end] @ B[k:k_end, j:j_end] + + return C +``` + +**Cache Efficiency Analysis**: +- **Naive algorithm**: 99% L3/RAM accesses (slow) +- **Blocked algorithm** (64ร—64 tiles): 95% L1/L2 hits (fast) +- **Latency reduction**: 300 cycles โ†’ 10 cycles average +- **Effective speedup**: 2-5x for large matrices + +**Optimal Tile Sizes** (empirically determined): +- L1-focused: 32ร—32 (4KB per block) +- L2-focused: 64ร—64 (16KB per block) โ† sweet spot for most CPUs +- L3-focused: 128ร—128 (64KB per block) + +Note: In this module, we use NumPy's `matmul` which delegates to BLAS libraries (MKL, OpenBLAS) that already implement sophisticated cache blocking with hardware-specific tuning. Production implementations use tile sizes, loop unrolling, and prefetching tuned for specific CPU architectures. + +### 4. Roofline Analysis in Practice + +**Measuring Your Hardware**: + +```python +def analyze_arithmetic_intensity(): + """Measure actual performance vs. theoretical roofline.""" + + # Theoretical hardware limits (example: modern Intel CPU) + peak_compute = 400 # GFLOP/s (AVX-512, 8 cores, 3.5 GHz) + peak_bandwidth = 45 # GB/s (DDR4-2666, dual-channel) + + operations = { + "Element-wise add": { + "flops": N, + "bytes": 3 * N * 4, + "ai": 0.08 # FLOPs/byte + }, + "Matrix multiply": { + "flops": 2 * N**3, + "bytes": 3 * N**2 * 4, + "ai": N / 6 # For N=1024: 171 FLOPs/byte + } + } + + # Predicted performance = min(peak_compute, ai ร— peak_bandwidth) + for op, metrics in operations.items(): + predicted_gflops = min( + peak_compute, + metrics["ai"] * peak_bandwidth + ) + print(f"{op}: {predicted_gflops:.1f} GFLOP/s (predicted)") +``` + +**Example Analysis** (N=1024): + +| Operation | AI (FLOPs/byte) | Predicted GFLOP/s | Measured GFLOP/s | Efficiency | +|-----------|----------------|-------------------|------------------|------------| +| Element-wise add | 0.08 | 3.6 (memory-bound) | 3.2 | 89% | +| GELU (fused) | 1.0 | 45 (memory-bound) | 38 | 84% | +| Matrix multiply | 171 | 400 (compute-bound) | 320 | 80% | + +**Key Insights**: +- Element-wise operations hit **memory roof** at 3-4 GFLOP/s (only 1% of peak compute) +- Fusion improves AI by reducing memory operations (0.08 โ†’ 1.0 AI) +- Matrix multiplication approaches **compute roof** (80% of peak) +- Optimization strategy should focus on memory-bound operations first + +## Getting Started + +### Prerequisites + +Ensure you've completed: +- **Module 14 (Profiling)**: You'll use profiling tools to measure acceleration gains +- **Module 01 (Tensor)**: Tensor class provides foundation for operations +- **NumPy/BLAS**: Verify optimized BLAS backend is installed + +Check your BLAS configuration: +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Check which BLAS backend NumPy uses +python -c "import numpy as np; np.show_config()" + +# Look for: openblas, mkl, or accelerate (Apple Silicon) +# MKL is fastest on Intel CPUs (200-500 GFLOP/s) +# OpenBLAS is good cross-platform (100-300 GFLOP/s) +``` + +Verify prerequisite modules work: +```bash +tito test --module tensor +tito test --module profiling +``` + +### Development Workflow + +1. **Open the development file**: `modules/18_acceleration/acceleration.py` + +2. **Implement vectorized matrix multiplication**: + - Validate input shapes for compatibility + - Delegate to `np.matmul` (calls optimized BLAS GEMM) + - Return result wrapped in Tensor + - Test correctness and measure speedup vs. naive loops + +3. **Build fused GELU activation**: + - Implement complete GELU formula in single expression + - Avoid creating intermediate Tensor objects + - Test numerical correctness against reference implementation + - Measure memory bandwidth reduction + +4. **Create tiled matrix multiplication**: + - Understand cache blocking concept (educational) + - Use NumPy's matmul which implements tiling internally + - Analyze cache hit rates and memory access patterns + - Compare performance across different matrix sizes + +5. **Perform roofline analysis**: + - Measure FLOPs and memory bandwidth for each operation + - Calculate arithmetic intensity + - Plot operations on roofline model + - Identify optimization priorities + +6. **Export and verify**: + ```bash + tito module complete 18 + tito test --module acceleration + ``` ## Testing +### Comprehensive Test Suite + +Run the full test suite to verify acceleration functionality: + ```bash -cd modules/16_acceleration -python acceleration_dev.py -tito export 16_acceleration -tito test 16_acceleration +# TinyTorch CLI (recommended) +tito test --module acceleration + +# Direct pytest execution +python -m pytest tests/ -k acceleration -v + +# Run development file directly (includes inline tests) +python modules/18_acceleration/acceleration.py ``` -## Where This Code Lives +### Test Coverage Areas +- **Vectorized Operations Correctness**: Matrix multiplication produces numerically correct results, handles batching and broadcasting, validates incompatible shapes +- **Kernel Fusion Correctness**: Fused GELU matches reference implementation, handles extreme values without NaN/Inf, preserves data types and shapes +- **Performance Validation**: Vectorized matmul achieves 10-150x speedup over naive loops, kernel fusion provides 2-5x speedup and 60-80% memory reduction, performance scales appropriately with tensor size +- **Integration Testing**: Acceleration techniques work together in realistic transformer blocks, profiler integration measures speedups correctly, memory efficiency validated with tracemalloc +- **Roofline Analysis**: Arithmetic intensity calculated correctly for different operations, performance predictions match measurements within 20%, memory-bound vs. compute-bound classification accurate + +### Inline Testing & Performance Analysis + +The module includes comprehensive validation and measurement: + +```python +# Run all inline tests +python modules/18_acceleration/acceleration.py + +# Expected output: +๐Ÿ”ฌ Unit Test: Vectorized Matrix Multiplication... +โœ… vectorized_matmul works correctly! + +๐Ÿ”ฌ Unit Test: Fused GELU... +โœ… fused_gelu works correctly! + +๐Ÿ”ฌ Unit Test: Kernel Fusion Performance Impact... +๐Ÿ“Š Kernel Fusion Performance Analysis: + Tensor size: 2000ร—2000 = 4,000,000 elements + Unfused time: 45.23 ms + Fused time: 18.12 ms + Speedup: 2.50ร— faster + Per-element: 11.3 ns โ†’ 4.5 ns + Memory efficiency: 7โ†’2 memory ops + Effective bandwidth: 15.2โ†’38.5 GB/s +๐Ÿš€ Excellent! Kernel fusion providing significant speedup + +๐Ÿ“Š Analyzing vectorization scaling behavior... +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Size โ”‚ Time (ms) โ”‚ GFLOPS โ”‚ Bandwidth โ”‚ Efficiency โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ 64 โ”‚ 0.05 โ”‚ 33.6 โ”‚ 15.8 โ”‚ 16.8 โ”‚ +โ”‚ 128 โ”‚ 0.18 โ”‚ 114.2 โ”‚ 26.7 โ”‚ 57.1 โ”‚ +โ”‚ 256 โ”‚ 1.12 โ”‚ 188.5 โ”‚ 22.1 โ”‚ 94.3 โ”‚ +โ”‚ 512 โ”‚ 6.45 โ”‚ 328.7 โ”‚ 19.3 โ”‚ 164.4 โ”‚ +โ”‚ 1024 โ”‚ 42.18 โ”‚ 405.1 โ”‚ 16.1 โ”‚ 202.6 โ”‚ +โ”‚ 2048 โ”‚ 281.34 โ”‚ 485.2 โ”‚ 15.3 โ”‚ 242.6 โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +๐Ÿงช RUNNING MODULE INTEGRATION TEST +Running unit tests... +โœ… All tests passed! + +๐ŸŽ‰ ALL TESTS PASSED! Module ready for export. ``` -tinytorch/ -โ”œโ”€โ”€ acceleration/ -โ”‚ โ””โ”€โ”€ kernels.py # Optimized implementations -โ””โ”€โ”€ __init__.py + +### Manual Testing Examples + +```python +from modules.18_acceleration.acceleration import * + +# Test vectorized matrix multiplication +A = Tensor(np.random.randn(512, 512).astype(np.float32)) +B = Tensor(np.random.randn(512, 512).astype(np.float32)) + +# Measure performance +import time +start = time.time() +C = vectorized_matmul(A, B) +elapsed = time.time() - start + +# Calculate metrics +flops = 2 * 512**3 # 268 million FLOPs +gflops = flops / (elapsed * 1e9) +print(f"Performance: {gflops:.1f} GFLOP/s") +print(f"Time: {elapsed*1000:.2f} ms") + +# Test kernel fusion +x = Tensor(np.random.randn(1000, 1000).astype(np.float32)) + +# Compare fused vs unfused +start = time.time() +y_fused = fused_gelu(x) +fused_time = time.time() - start + +start = time.time() +y_unfused = unfused_gelu(x) +unfused_time = time.time() - start + +print(f"Speedup: {unfused_time/fused_time:.2f}x") +print(f"Numerically equivalent: {np.allclose(y_fused.data, y_unfused.data)}") + +# Measure with profiler +from tinytorch.profiling.profiler import Profiler + +profiler = Profiler() + +class SimpleModel: + def __init__(self): + self.weight = Tensor(np.random.randn(256, 256).astype(np.float32)) + + def forward(self, x): + return fused_gelu(vectorized_matmul(x, self.weight)) + +model = SimpleModel() +input_tensor = Tensor(np.random.randn(32, 256).astype(np.float32)) + +latency = profiler.measure_latency(model, input_tensor, warmup=5, iterations=20) +flops = profiler.count_flops(model, (32, 256)) + +print(f"Latency: {latency:.2f} ms") +print(f"FLOPs: {flops:,}") +print(f"Throughput: {flops / (latency/1000) / 1e9:.2f} GFLOP/s") ``` ## Systems Thinking Questions -1. **Roofline Model**: Your operation needs 1000 FLOPs and 100 bytes. At 100 GFLOPs/s compute and 10 GB/s bandwidth, what's the bottleneck? +### Real-World Applications -2. **Amdahl's Law Applied**: You parallelize 90% of code perfectly across 8 cores. What's max speedup? Why not 8ร—? +- **Training Acceleration**: How do vectorized operations reduce training time for transformers? What's the speedup for attention computation (mostly matrix multiplies) vs. layer normalization (element-wise operations)? -3. **Cache Hierarchy**: L1 cache is 10ร— faster than L2, which is 10ร— faster than RAM. How does blocking matrix multiplication exploit this? +- **Inference Optimization**: Why is kernel fusion more important for inference than training? How does batch size affect the benefit of vectorization vs. fusion? -## Real-World Connections +- **Hardware Selection**: Given a model with 70% matrix multiplies and 30% element-wise operations, should you optimize for compute or memory bandwidth? How does this affect CPU vs. GPU selection? -**PyTorch/TensorFlow**: Custom CUDA kernels for all operations -**ONNX Runtime**: Hardware-specific optimization for production serving -**Apple ML**: Metal shaders and Neural Engine for on-device inference +- **Cloud Cost Reduction**: If vectorization provides 100x speedup on matrix operations that take 80% of training time, what's the overall training time reduction and cost savings? -## What's Next? +### Roofline Analysis Foundations -In **Module 19: Benchmarking**, you'll rigorously measure all optimizations: -- Fair comparison across optimization techniques -- Statistical significance testing -- MLPerf-style benchmarking protocols -- Comprehensive performance reports +- **Arithmetic Intensity Calculation**: For convolution with kernel size Kร—K, input channels C_in, output channels C_out, and spatial dimensions Hร—W, what's the arithmetic intensity? Is it compute-bound or memory-bound? + +- **Memory Hierarchy Impact**: Why does cache blocking improve performance by 2-5x even though it performs the same FLOPs? What's the latency difference between L1 cache hits (4 cycles) vs. RAM accesses (300 cycles)? + +- **BLAS Library Performance**: Why does NumPy's matmul achieve 200-500 GFLOP/s while naive Python loops achieve 1 GFLOP/s? What optimizations do BLAS libraries implement that interpreted Python can't? + +- **Batch Size Effects**: How does batch size affect arithmetic intensity for matrix multiplication? Why do larger batches achieve higher GFLOP/s on the same hardware? + +### Optimization Strategy Characteristics + +- **Memory-Bound Operations**: Why does adding more CPU cores NOT improve element-wise addition performance? What's the fundamental bottleneck, and how do you fix it? + +- **Kernel Fusion Trade-offs**: Fused GELU reduces memory operations from 16 to 2 per element. Why doesn't this give 8x speedup? What other factors limit acceleration? + +- **Production Optimization Priority**: Given profiling data showing 40% time in attention softmax (memory-bound), 30% in matmuls (compute-bound), and 30% in data loading (I/O-bound), which should you optimize first? Why? + +- **Cross-Platform Performance**: Why do vectorized operations using BLAS achieve different speedups on Intel CPUs (MKL: 500 GFLOP/s) vs. AMD CPUs (OpenBLAS: 200 GFLOP/s) vs. Apple Silicon (Accelerate: 300 GFLOP/s)? What's hardware-dependent vs. algorithmic? + +## Ready to Build? + +You're about to learn the hardware-aware optimization techniques that separate research prototypes from production ML systems. Understanding how to extract maximum performance from CPUsโ€”through vectorization, cache optimization, and memory bandwidth reductionโ€”is foundational knowledge for any ML engineer. + +These aren't just academic exercises. Every time you use PyTorch or TensorFlow, you're benefiting from these exact techniques implemented in their backend libraries. By building them yourself, you'll understand: + +- Why transformers (mostly matmuls) run efficiently on GPUs while RNNs (sequential operations) struggle +- How to predict whether adding more hardware will help before spending cloud budget +- When to optimize code vs. when to redesign algorithms for better arithmetic intensity +- How to measure and validate performance improvements systematically + +The roofline model and arithmetic intensity analysis you'll master here apply directly to GPUs, TPUs, and custom AI accelerators. Hardware changes, but the fundamental memory-vs-compute trade-offs remain constant. This module gives you the mental models and measurement tools to optimize on any platform. + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/18_acceleration/acceleration_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/18_acceleration/acceleration_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/18_acceleration/acceleration.py +:class-header: bg-light + +Browse the Python source code and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’พ Save Your Progress +:class: tip +**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work. +``` --- -**Ready to optimize for hardware?** Open `modules/18_acceleration/acceleration_dev.py` and start implementing. + diff --git a/modules/19_benchmarking/ABOUT.md b/modules/19_benchmarking/ABOUT.md index 5ce003fc..dea90f2b 100644 --- a/modules/19_benchmarking/ABOUT.md +++ b/modules/19_benchmarking/ABOUT.md @@ -1,118 +1,424 @@ --- title: "Benchmarking - Fair Performance Comparison" -description: "MLPerf-style benchmarking with statistical rigor and standardized metrics" +description: "Statistical rigor and standardized metrics for optimization validation" difficulty: 3 time_estimate: "5-6 hours" prerequisites: ["Profiling", "All optimization techniques"] next_steps: ["Competition (Capstone)"] learning_objectives: - - "Implement MLPerf-inspired benchmarking frameworks" - - "Design fair comparison protocols across different hardware" - - "Apply statistical significance testing to performance claims" - - "Build normalized metrics for hardware-independent comparison" - - "Generate comprehensive performance reports with visualizations" + - "Understand benchmark design principles including statistical measurement, fair comparison protocols, and reproducible methodology" + - "Implement statistical rigor for performance measurement with confidence intervals, variance reporting, and measurement uncertainty" + - "Master fair comparison protocols that control for system noise, hardware variability, and environmental factors" + - "Build normalized metrics systems including speedup ratios, compression factors, and efficiency scores for hardware-independent comparison" + - "Analyze measurement trade-offs including overhead costs, statistical power requirements, and reproducibility constraints" --- -# 19. Benchmarking +# 19. Benchmarking - Fair Performance Comparison -**โšก OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours +**OPTIMIZATION TIER** | Difficulty: โญโญโญ (3/4) | Time: 5-6 hours ## Overview -Build rigorous benchmarking systems following MLPerf principles. This module implements fair comparison protocols, statistical testing, and normalized metrics for evaluating all the optimizations you've built in the Optimization Tier. +You'll build a rigorous performance measurement system that enables fair comparison of all your optimizations. This module implements educational benchmarking with statistical testing, normalized metrics, and reproducible protocols. Your benchmarking framework provides the measurement methodology used in Module 20's competition workflow, where you'll apply these tools to validate optimizations systematically. ## Learning Objectives -By completing this module, you will be able to: +By the end of this module, you will be able to: -1. **Implement MLPerf-inspired benchmarking** frameworks with standardized scenarios -2. **Design fair comparison protocols** accounting for hardware differences -3. **Apply statistical significance testing** to validate performance claims -4. **Build normalized metrics** (speedup, compression ratio, efficiency scores) -5. **Generate comprehensive reports** with visualizations and actionable insights +- **Understand benchmark design principles**: Reproducibility requirements; representative workload selection; measurement methodology; controlling for confounding variables; fair comparison protocols +- **Implement statistical rigor**: Multiple runs with warmup periods; confidence interval calculation; variance reporting not just means; understanding measurement uncertainty; detecting outliers +- **Master fair comparison protocols**: Hardware normalization strategies; environmental controls (thermal, OS noise); baseline selection criteria; same workload/data/environment enforcement; apples-to-apples measurement +- **Build normalized metrics systems**: Speedup ratios (baseline_time / optimized_time); compression factors (original_size / compressed_size); accuracy preservation tracking; efficiency scores combining multiple objectives; hardware-independent reporting +- **Analyze measurement trade-offs**: Benchmark coverage vs runtime cost; statistical power vs sample size requirements; reproducibility vs realism; instrumentation overhead (observer effect); when 5% speedup is significant vs noise -## Why This Matters +## Build โ†’ Use โ†’ Analyze -### Production Context +This module follows TinyTorch's **Build โ†’ Use โ†’ Analyze** framework: -Benchmarking drives ML systems decisions: - -- **MLPerf** standardizes ML benchmarking; companies compete on leaderboards -- **Google TPU** teams use rigorous benchmarking to justify hardware investments -- **Meta PyTorch** benchmarks every optimization before merging to production -- **OpenAI** benchmarks training efficiency to optimize $millions in compute costs - -### Historical Context - -- **Pre-2018**: Ad-hoc benchmarking; inconsistent metrics; hard to compare -- **MLPerf Launch (2018)**: Standardized benchmarks; reproducible results -- **2019-2021**: MLPerf Training and Inference; industry adoption -- **2021+**: MLPerf Tiny, Mobile; benchmarking for edge deployment - -Without rigorous benchmarking, optimization claims are meaningless. +1. **Build**: Implement benchmarking framework with statistical testing (confidence intervals, t-tests), normalized metrics (speedup, compression, efficiency), warmup protocols, and automated report generation +2. **Use**: Benchmark all your Optimization Tier implementations (profiling, quantization, compression, memoization, acceleration) against baselines on real tasks; compare fairly with statistical rigor +3. **Analyze**: Why do benchmark results vary across runs? How does hardware affect comparison fairness? When is 5% speedup statistically significant vs noise? What makes benchmarks representative vs over-fitted? ## Implementation Guide -### Core Components +### Core Benchmarking Components -**MLPerf Principles** -1. **Reproducibility**: Fixed random seeds, documented environment -2. **Fairness**: Same workload, measured on same hardware -3. **Realism**: Representative tasks (ResNet, BERT, etc.) -4. **Transparency**: Open-source code and results +Your benchmarking framework implements four key systems: -**Normalized Metrics** -- **Speedup**: baseline_time / optimized_time -- **Compression Ratio**: baseline_size / compressed_size -- **Accuracy Delta**: optimized_accuracy - baseline_accuracy -- **Efficiency Score**: (speedup ร— compression) / (1 + accuracy_loss) +#### 1. Statistical Measurement Infrastructure -**Statistical Rigor** -- Multiple runs (typically 10+) -- Confidence intervals -- Significance testing (t-test, Mann-Whitney) -- Report variance, not just mean +**Why Multiple Runs Matter** + +Single measurements are meaningless in ML systems. Performance varies 10-30% across runs due to: +- **Thermal throttling**: CPU frequency drops when hot +- **OS background tasks**: Interrupts, garbage collection, other processes +- **Memory state**: Cache coldness, fragmentation, swap pressure +- **CPU frequency scaling**: Dynamic frequency adjustment + +**Statistical Solution** + +```python +class BenchmarkResult: + """Container for measurements with statistical analysis.""" + + def __init__(self, metric_name: str, values: List[float]): + self.mean = statistics.mean(values) + self.std = statistics.stdev(values) + self.median = statistics.median(values) + + # 95% confidence interval for the mean + t_score = 1.96 # Normal approximation + margin = t_score * (self.std / np.sqrt(len(values))) + self.ci_lower = self.mean - margin + self.ci_upper = self.mean + margin +``` + +**What This Reveals**: If confidence intervals overlap between baseline and optimized, the difference might be noise. Statistical rigor prevents false claims. + +#### 2. Warmup and Measurement Protocol + +**The Warmup Problem** + +First run: 120ms. Second run: 100ms. Third run: 98ms. What happened? +- **Cold cache**: First run pays cache miss penalties +- **JIT compilation**: NumPy and frameworks compile code paths on first use +- **Memory allocation**: Initial runs establish memory patterns + +**Warmup Solution** + +```python +class Benchmark: + def __init__(self, warmup_runs=5, measurement_runs=10): + self.warmup_runs = warmup_runs + self.measurement_runs = measurement_runs + + def run_latency_benchmark(self, model, input_data): + # Warmup: stabilize performance + for _ in range(self.warmup_runs): + model.forward(input_data) + + # Measurement: collect statistics + latencies = [] + for _ in range(self.measurement_runs): + start = time.perf_counter() + model.forward(input_data) + latencies.append(time.perf_counter() - start) + + return BenchmarkResult("latency_ms", latencies) +``` + +**Why This Matters**: Warmup runs discard cold-start effects. Measurement runs capture true steady-state performance. + +#### 3. Normalized Metrics for Fair Comparison + +**Hardware-Independent Speedup** + +```python +# Speedup ratio: baseline_time / optimized_time +speedup = baseline_result.mean / optimized_result.mean + +# Example: 100ms / 80ms = 1.25x speedup (25% faster) +# Speedup > 1.0 means optimization helped +# Speedup < 1.0 means optimization regressed +``` + +**Compression Ratio** + +```python +# Model size reduction +compression_ratio = original_size_mb / compressed_size_mb + +# Example: 100MB / 25MB = 4x compression +``` + +**Efficiency Score (Multi-Objective)** + +```python +# Combine speed + size + accuracy +efficiency = (speedup * compression) / (1 + abs(accuracy_delta)) + +# Penalizes accuracy loss +# Rewards speed AND compression +# Single metric for ranking +``` + +**Why Normalized Metrics**: Speedup ratios work on any hardware. "2x faster" is meaningful whether you have M1 Mac or Intel i9. Absolute times (100ms โ†’ 50ms) are hardware-specific. + +#### 4. Comprehensive Benchmark Suite + +**Multiple Benchmark Types** + +Your `BenchmarkSuite` runs: +1. **Latency Benchmark**: How fast is inference? (milliseconds) +2. **Accuracy Benchmark**: How correct are predictions? (0.0-1.0) +3. **Memory Benchmark**: How much RAM is used? (megabytes) +4. **Energy Benchmark**: How efficient is compute? (estimated joules) + +**Pareto Frontier Analysis** + +``` +Accuracy + โ†‘ + | A โ— โ† Model A: High accuracy, high latency + | + | B โ— โ† Model B: Balanced (Pareto optimal) + | + | C โ—โ† Model C: Low accuracy, low latency + |__________โ†’ Latency (lower is better) +``` + +Models on the Pareto frontier aren't strictly dominatedโ€”each represents a valid optimization trade-off. Your suite automatically identifies these optimal points. + +### Real-World Benchmarking Principles + +Your implementation teaches industry-standard methodology: + +#### Reproducibility Requirements + +Every benchmark run documents: +```python +system_info = { + 'platform': 'macOS-14.2-arm64', # OS version + 'processor': 'Apple M1 Max', # CPU type + 'python_version': '3.11.6', # Runtime + 'memory_gb': 64, # RAM + 'cpu_count': 10 # Cores +} +``` + +**Why**: Colleague should reproduce your results given same environment. Missing details make verification impossible. + +#### Fair Comparison Protocol + +**Don't Compare**: +- GPU-optimized code vs CPU baseline (unfair hardware) +- Quantized INT8 vs FP32 baseline (unfair precision) +- Batch size 32 vs batch size 1 (unfair workload) +- Cold start vs warmed up (unfair cache state) + +**Do Compare**: +- Same hardware, same workload, same environment +- Baseline vs optimized on identical conditions +- Report speedup with confidence intervals +- Test statistical significance (t-test, p < 0.05) + +#### Statistical Significance Testing + +```python +from scipy import stats + +baseline_times = [100, 102, 98, 101, 99] # ms +optimized_times = [95, 97, 93, 96, 94] + +# Is the difference real or noise? +t_stat, p_value = stats.ttest_ind(baseline_times, optimized_times) + +if p_value < 0.05: + print("Statistically significant (p < 0.05)") +else: + print("Not significantโ€”could be noise") +``` + +**Why This Matters**: 5% speedup with p=0.08 isn't significant. Could be measurement variance. Production teams don't merge optimizations without statistical confidence. + +### Connection to Competition Workflow (Module 20) + +This benchmarking infrastructure provides the measurement harness used in Module 20's competition workflow: + +**How Module 20 Uses This Framework** +1. Module 20 uses your `Benchmark` class to measure baseline and optimized performance +2. Statistical rigor from this module ensures fair comparison across submissions +3. Normalized metrics enable hardware-independent ranking +4. Reproducible protocols ensure all competitors use the same measurement methodology + +**The Workflow** +1. Module 19: Learn benchmarking methodology (statistical rigor, fair comparison) +2. Module 20: Apply benchmarking tools in competition workflow (submission generation, validation) +3. Competition: Use Benchmark harness to measure and validate optimizations + +Your benchmarking framework provides the foundation for fair competitionโ€”same measurement methodology, same statistical analysis, same reporting format. Module 20 teaches how to use these tools in a competition context. + +## Getting Started + +### Prerequisites + +Ensure you understand the optimization foundations: + +```bash +# Activate TinyTorch environment +source bin/activate-tinytorch.sh + +# Verify prerequisite modules +tito test --module profiling +tito test --module quantization +tito test --module compression +``` + +### Development Workflow + +1. **Open the development file**: `modules/19_benchmarking/benchmarking_dev.py` +2. **Implement BenchmarkResult**: Container for measurements with statistical analysis +3. **Build Benchmark class**: Runner with warmup, multiple runs, metrics collection +4. **Create BenchmarkSuite**: Full evaluation with latency/accuracy/memory/energy +5. **Add reporting**: Automated report generation with visualizations +6. **Export and verify**: `tito module complete 19 && tito test --module benchmarking` ## Testing +### Comprehensive Test Suite + +Run the full test suite to verify benchmarking functionality: + ```bash -tito export 19_benchmarking -tito test 19_benchmarking +# TinyTorch CLI (recommended) +tito test --module benchmarking + +# Direct pytest execution +python -m pytest tests/ -k benchmarking -v ``` -## Where This Code Lives +### Test Coverage Areas +- โœ… **Statistical Calculations**: Mean, std, median, confidence intervals computed correctly +- โœ… **Multiple Runs**: Warmup and measurement phases work properly +- โœ… **Normalized Metrics**: Speedup, compression, efficiency calculated accurately +- โœ… **Fair Comparison**: Same workload enforcement, baseline vs optimized +- โœ… **Result Serialization**: BenchmarkResult converts to dict for storage +- โœ… **Visualization**: Plots generate with proper formatting and error bars +- โœ… **System Info**: Metadata captured for reproducibility +- โœ… **Pareto Analysis**: Optimal trade-off points identified correctly + +### Inline Testing & Validation + +The module includes comprehensive unit tests: + +```python +๐Ÿ”ฌ Unit Test: BenchmarkResult... +โœ… Mean calculation correct: 3.0 +โœ… Std calculation matches statistics module +โœ… Confidence intervals bound mean +โœ… Serialization preserves data +๐Ÿ“ˆ Progress: BenchmarkResult โœ“ + +๐Ÿ”ฌ Unit Test: Benchmark latency... +โœ… Warmup runs executed before measurement +โœ… Multiple measurement runs collected +โœ… Results include mean ยฑ CI +๐Ÿ“ˆ Progress: Benchmark โœ“ + +๐Ÿ”ฌ Unit Test: BenchmarkSuite... +โœ… All benchmark types run (latency, accuracy, memory, energy) +โœ… Results organized by metric type +โœ… Visualizations generated +๐Ÿ“ˆ Progress: BenchmarkSuite โœ“ ``` -tinytorch/ -โ”œโ”€โ”€ benchmarking/ -โ”‚ โ””โ”€โ”€ benchmark.py -โ””โ”€โ”€ __init__.py + +### Manual Testing Examples + +```python +from tinytorch.benchmarking.benchmark import Benchmark, BenchmarkSuite +from tinytorch.core.tensor import Tensor +import numpy as np + +# Create simple models for testing +class FastModel: + name = "fast_model" + def forward(self, x): + return x * 2 + +class SlowModel: + name = "slow_model" + def forward(self, x): + import time + time.sleep(0.01) # Simulate 10ms latency + return x * 2 + +# Benchmark comparison +models = [FastModel(), SlowModel()] +benchmark = Benchmark(models, datasets=[None]) + +# Run latency benchmark +results = benchmark.run_latency_benchmark() + +for model_name, result in results.items(): + print(f"{model_name}: {result.mean:.2f} ยฑ {result.std:.2f}ms") + print(f" 95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]") + +# Speedup calculation +fast_time = results['fast_model'].mean +slow_time = results['slow_model'].mean +speedup = slow_time / fast_time +print(f"\nSpeedup: {speedup:.2f}x") ``` ## Systems Thinking Questions -1. **Hardware Normalization**: How do you compare optimizations across M1 Mac vs Intel vs AMD? What metrics are fair? +### Real-World Applications -2. **Statistical Power**: You measure 5% speedup with p=0.06. Is this significant? How many runs do you need? +- **Production ML Deployment**: PyTorch runs continuous benchmarking before merging optimizationsโ€”statistical rigor prevents performance regressions +- **Hardware Evaluation**: Google's TPU teams benchmark every architecture iterationโ€”measurements justify billion-dollar hardware investments +- **Model Optimization**: Meta benchmarks training efficiency (samples/sec, memory, convergence)โ€”10% speedup saves hundreds of thousands in compute costs +- **Research Validation**: Papers require reproducible benchmarks with statistical significanceโ€”ablation studies need fair comparison protocols -3. **Benchmark Selection**: MLPerf uses ResNet-50. Does this represent all workloads? What about transformers, GANs, RL? +### Statistical Foundations -## Real-World Connections +- **Central Limit Theorem**: Multiple measurements โ†’ normal distribution โ†’ confidence intervals and significance testing +- **Measurement Uncertainty**: Every measurement has varianceโ€”systematic errors (timer overhead) and random errors (thermal noise) +- **Statistical Power**: How many runs needed for significance? Depends on effect size and varianceโ€”5% speedup requires more runs than 50% +- **Type I/II Errors**: False positive (claiming speedup when it's noise) vs false negative (missing real speedup due to insufficient samples) -**MLPerf**: Industry-standard benchmarking consortium -**SPEC**: Hardware benchmarking standards -**TensorFlow/PyTorch**: Continuous benchmarking in CI/CD +### Performance Characteristics -## What's Next? +- **Warmup Effects**: First run 20% slower than steady-stateโ€”cold cache, JIT compilation, memory allocation +- **System Noise Sources**: Thermal throttling (CPU frequency drops), OS interrupts (background tasks), memory pressure (GC pauses), network interference +- **Observer Effect**: Instrumentation changes behaviorโ€”profiling overhead 5%, cache effects from measurement code, branch prediction altered +- **Hardware Variability**: Optimization 3x faster on GPU but 1.1x on CPUโ€”memory bandwidth helps GPU, CPU cache doesn't fit data -In **Module 20: TinyMLPerf Competition** (Capstone), you'll apply everything: -- Use all Optimization Tier techniques -- Compete on a standardized benchmark -- Submit results to a leaderboard -- Demonstrate complete ML systems skills +## Ready to Build? -This is your capstoneโ€”show what you've learned! +You've reached the penultimate module of the Optimization Tier. This benchmarking framework validates all your previous work from Modules 14-18, transforming subjective claims ("feels faster") into objective data ("1.8x speedup, p < 0.01, 95% CI [1.6x, 2.0x]"). + +Your benchmarking infrastructure provides the measurement foundation for Module 20's competition workflow, where you'll use these tools to validate optimizations systematically. Fair measurement methodology ensures your innovation is recognizedโ€”not who got lucky with thermal throttling. + +Module 20 teaches how to use your benchmarking framework in a competition contextโ€”generating submissions, validating constraints, and packaging results. Your benchmarking framework measures cumulative impact with statistical rigor. This is how production ML teams validate optimizations before deploymentโ€”rigorous measurement prevents regressions and quantifies improvements. + +Statistical rigor isn't just academic formalityโ€”it's engineering discipline. When Meta claims 10% training speedup saves hundreds of thousands in compute costs, that claim requires measurements with confidence intervals and significance testing. Your framework implements this methodology from first principles. + +Choose your preferred way to engage with this module: + +````{grid} 1 2 3 3 + +```{grid-item-card} Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/19_benchmarking/benchmarking_dev.ipynb +:class-header: bg-light + +Run this module interactively in your browser. No installation required. +``` + +```{grid-item-card} Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/19_benchmarking/benchmarking_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/19_benchmarking/benchmarking_dev.py +:class-header: bg-light + +Browse the Python source code and understand the implementation. +``` + +```` + +```{admonition} Save Your Progress +:class: tip +Binder sessions are temporary. Download your completed notebook when done, or switch to local development for persistent work. +``` --- -**Ready to benchmark rigorously?** Open `modules/19_benchmarking/benchmarking_dev.py` and start implementing. + diff --git a/modules/20_capstone/ABOUT.md b/modules/20_capstone/ABOUT.md index a8a0d430..6d229ba8 100644 --- a/modules/20_capstone/ABOUT.md +++ b/modules/20_capstone/ABOUT.md @@ -1,279 +1,366 @@ --- -title: "MLPerfยฎ Edu Competition - Your Capstone Challenge" -description: "Apply all optimizations in a standardized MLPerf-inspired educational competition" -difficulty: 5 -time_estimate: "10-20 hours" -prerequisites: ["All modules 01-19"] +title: "Torch Olympics - ML Systems Competition" +description: "Learn competition workflow: use Benchmark harness to measure performance and generate standardized submissions" +difficulty: "โญโญโญโญ" +time_estimate: "5-8 hours" +prerequisites: ["Benchmarking (Module 19)", "Optimization techniques (Modules 14-18)"] next_steps: [] learning_objectives: - - "Apply all Optimization Tier techniques to a standardized benchmark" - - "Implement either Closed Division (optimize given model) or Open Division (innovate architecture)" - - "Generate validated submission with normalized metrics" - - "Demonstrate complete ML systems engineering skills" - - "Compete fairly across different hardware platforms" + - "Understand competition events: Know how different Olympic events (Latency Sprint, Memory Challenge, All-Around) have different constraints and optimization strategies" + - "Use Benchmark harness: Apply Module 19's Benchmark class to measure performance with statistical rigor (confidence intervals, multiple runs)" + - "Generate submissions: Create standardized submission formats following MLPerf-style industry standards" + - "Validate submissions: Check that submissions meet event constraints (accuracy thresholds, latency limits) and flag unrealistic improvements" + - "Workflow integration: Understand how benchmarking tools (Module 19) and optimization techniques (Modules 14-18) work together in competition context" --- -# 20. MLPerfยฎ Edu Competition +# 20. TinyTorch Olympics - Competition & Submission -**๐Ÿ† CAPSTONE** | Difficulty: โญโญโญโญโญ (5/5 - Ninja Level) | Time: 10-20 hours +**CAPSTONE PROJECT** | Difficulty: โญโญโญโญ (4/4) | Time: 5-8 hours ## Overview -Your capstone challenge: optimize a CIFAR-10 CNN using everything you've learned. Choose between Closed Division (optimize our CNN) or Open Division (design your own). Compete on a level playing field with normalized metrics that account for hardware differences. +The TinyTorch Olympics capstone teaches you how to participate in professional ML competitions. You've learned benchmarking methodology in Module 19โ€”now apply those tools in a competition workflow. This module focuses on understanding competition events, using the Benchmark harness to measure performance, generating standardized submissions, and validating results meet competition requirements. + +**What You Learn**: Competition workflow and submission packagingโ€”how to use benchmarking tools (Module 19) and optimization techniques (Modules 14-18) to create competition-ready submissions following industry standards (MLPerf-style). + +**The Focus**: Understanding how professional ML competitions workโ€”from measurement to submissionโ€”not building TinyGPT (that's Milestone 05). ## Learning Objectives -By completing this capstone, you will be able to: +By the end of this capstone, you will be able to: -1. **Apply all Optimization Tier techniques** (profiling, memoization, quantization, compression, acceleration, benchmarking) -2. **Implement either Closed Division** (optimize given CNN; pure optimization challenge) or **Open Division** (design novel architecture; innovation challenge) -3. **Generate validated submission** with standardized metrics, honor code attestation, and GitHub repo -4. **Demonstrate complete ML systems skills** from implementation through optimization to deployment -5. **Compete fairly** using normalized metrics (speedup, compression ratio) that work across hardware +- **Understand Competition Events**: Know how different Olympic events (Latency Sprint, Memory Challenge, All-Around) have different constraints and optimization strategies +- **Use Benchmark Harness**: Apply Module 19's Benchmark class to measure performance with statistical rigor (confidence intervals, multiple runs) +- **Generate Submissions**: Create standardized submission formats following MLPerf-style industry standards +- **Validate Submissions**: Check that submissions meet event constraints (accuracy thresholds, latency limits) and flag unrealistic improvements +- **Workflow Integration**: Understand how benchmarking tools (Module 19) and optimization techniques (Modules 14-18) work together in competition context -## Why This Matters +## The Five Olympic Events -### Production Context +Choose your competition event based on optimization goals: -This competition simulates real ML systems engineering: +### ๐Ÿƒ Event 1: Latency Sprint +**Objective**: Minimize inference latency +**Constraints**: Accuracy โ‰ฅ 85% +**Strategy Focus**: Operator fusion, quantization, efficient data flow +**Winner**: Fastest average inference time (with confidence intervals) -- **MLPerf** is the industry standard for ML benchmarking; this follows the same principles -- **Production optimization** requires choosing what to optimize and measuring impact -- **Hardware diversity** in production demands normalized comparison metrics -- **Documentation** of optimization choices matters for team collaboration +### ๐Ÿ‹๏ธ Event 2: Memory Challenge +**Objective**: Minimize model memory footprint +**Constraints**: Accuracy โ‰ฅ 85% +**Strategy Focus**: Quantization, pruning, weight sharing +**Winner**: Smallest model size maintaining accuracy -### Competition Philosophy +### ๐ŸŽฏ Event 3: Accuracy Contest +**Objective**: Maximize model accuracy +**Constraints**: Latency < 100ms, Memory < 10MB +**Strategy Focus**: Balanced optimization, selective precision +**Winner**: Highest accuracy within constraints -This capstone teaches: -- **Optimization discipline**: Profile first, optimize bottlenecks, measure impact -- **Trade-off analysis**: Speed vs accuracy vs memory - what matters for your use case? -- **Fair comparison**: Normalized metrics ensure your M1 MacBook competes fairly with AWS GPU -- **Real constraints**: Must maintain >70% accuracy; actual production requirement +### ๐Ÿ‹๏ธโ€โ™‚๏ธ Event 4: All-Around +**Objective**: Best balanced performance +**Scoring**: Composite score across latency, memory, accuracy +**Strategy Focus**: Multi-objective optimization, Pareto efficiency +**Winner**: Highest composite score -## Competition Structure +### ๐Ÿš€ Event 5: Extreme Push +**Objective**: Most aggressive optimization +**Constraints**: Accuracy โ‰ฅ 80% (lower threshold) +**Strategy Focus**: Maximum compression, aggressive quantization +**Winner**: Best compression-latency product -### Two Tracks +## Competition Workflow -**Closed Division - Optimization Challenge** -- **Task**: Optimize provided CNN architecture -- **Rules**: Cannot change model architecture, training, or dataset -- **Focus**: Pure systems optimization (caching, quantization, pruning, acceleration) -- **Goal**: Maximum speedup with minimal accuracy loss +This module teaches the workflow of professional ML competitions. You'll learn how to use benchmarking tools (Module 19) to measure performance and generate standardized submissions. -**Open Division - Innovation Challenge** -- **Task**: Design your own architecture -- **Rules**: Can change anything (architecture, training, data augmentation) -- **Focus**: Novel approaches, architectural innovations, creative solutions -- **Goal**: Best efficiency score balancing speed, size, and accuracy +### Stage 1: Understand Competition Events -### Metrics (Both Divisions) - -**Normalized for Fair Hardware Comparison:** -- **Speedup**: your_inference_time / baseline_inference_time (on YOUR hardware) -- **Compression Ratio**: baseline_params / your_params -- **Accuracy Delta**: your_accuracy - baseline_accuracy (must be โ‰ฅ -5%) -- **Efficiency Score**: (speedup ร— compression) / (1 + |accuracy_loss|) - -## Implementation Guide - -### Step 1: Validate Your Installation - -```bash -tito setup --validate -# Ensures all modules work before starting -``` - -### Step 2: Generate Baseline +Different Olympic events have different constraints and optimization strategies: ```python -from tinytorch.competition import generate_baseline +from tinytorch.competition import OlympicEvent -# This runs the unoptimized CNN and records your baseline -baseline = generate_baseline() -# Saves: baseline_submission.json with your hardware specs +# Event types +event = OlympicEvent.LATENCY_SPRINT # Minimize latency, accuracy โ‰ฅ 85% +event = OlympicEvent.MEMORY_CHALLENGE # Minimize memory, accuracy โ‰ฅ 85% +event = OlympicEvent.ALL_AROUND # Best balanced performance +event = OlympicEvent.EXTREME_PUSH # Most aggressive, accuracy โ‰ฅ 80% ``` -### Step 3: Choose Your Track +**Event Constraints:** +- **Latency Sprint**: Accuracy โ‰ฅ 85%, optimize for speed +- **Memory Challenge**: Accuracy โ‰ฅ 85%, optimize for size +- **All-Around**: Balanced optimization across metrics +- **Extreme Push**: Accuracy โ‰ฅ 80%, maximum optimization + +### Stage 2: Measure Baseline Performance + +Use Module 19's Benchmark harness to measure baseline: -**Option A: Closed Division (Recommended for first-time)** ```python -from tinytorch.competition import optimize_closed_division +from tinytorch.benchmarking import Benchmark -# Optimize the provided CNN -optimized_model = optimize_closed_division( - baseline_model, - techniques=['kvcaching', 'quantization', 'pruning'] +# Measure baseline performance +benchmark = Benchmark([baseline_model], [test_data], ["latency", "memory", "accuracy"]) +baseline_results = benchmark.run() + +# Results include statistical rigor (confidence intervals) +print(f"Baseline - Latency: {baseline_results['latency'].mean:.2f}ms") +print(f" 95% CI: [{baseline_results['latency'].ci_lower:.2f}, {baseline_results['latency'].ci_upper:.2f}]") +print(f"Baseline - Memory: {baseline_results['memory'].mean:.2f}MB") +print(f"Baseline - Accuracy: {baseline_results['accuracy'].mean:.2%}") +``` + +**Key Insight**: Module 19 provides statistical rigorโ€”multiple runs, confidence intervals, warmup periods. This ensures fair comparison. + +### Stage 3: Measure Optimized Performance + +Apply optimization techniques (from Modules 14-18), then measure: + +```python +# Apply optimizations (using techniques from Modules 14-18) +optimized_model = apply_optimizations(baseline_model) + +# Measure optimized performance with same Benchmark harness +optimized_results = benchmark.run() # Same benchmark, different model +``` + +**Fair Comparison**: Same Benchmark harness, same test data, same hardwareโ€”ensures apples-to-apples comparison. + +### Stage 4: Calculate Normalized Scores + +Compute hardware-independent metrics: + +```python +from tinytorch.competition import calculate_normalized_scores + +# Convert to normalized scores (hardware-independent) +scores = calculate_normalized_scores( + baseline_results={'latency': 100.0, 'memory': 12.0, 'accuracy': 0.85}, + optimized_results={'latency': 40.0, 'memory': 3.0, 'accuracy': 0.83} ) + +# Results: speedup=2.5ร—, compression_ratio=4.0ร—, accuracy_delta=-0.02 +print(f"Speedup: {scores['speedup']:.2f}ร—") +print(f"Compression: {scores['compression_ratio']:.2f}ร—") +print(f"Accuracy change: {scores['accuracy_delta']:+.2%}") ``` -**Option B: Open Division (For advanced students)** -```python -from tinytorch.competition import design_open_division +**Why Normalized**: Speedup ratios work on any hardware. "2.5ร— faster" is meaningful whether you have M1 Mac or Intel i9. -# Design your own architecture -my_model = MyCustomCNN(...) -# Train it -trained_model = train(my_model, train_loader) -``` +### Stage 5: Generate Submission -### Step 4: Generate Submission +Create standardized submission following MLPerf-style format: ```python -from tinytorch.competition import generate_submission +from tinytorch.competition import generate_submission, validate_submission +# Generate submission submission = generate_submission( - model=optimized_model, - division='closed', # or 'open' - github_repo='https://github.com/yourname/tinytorch-submission', - techniques_used=['INT8 quantization', '90% magnitude pruning', 'KV caching'], - athlete_name='Your Name' + baseline_results=baseline_results, + optimized_results=optimized_results, + event=OlympicEvent.LATENCY_SPRINT, + athlete_name="YourName", + github_repo="https://github.com/yourname/tinytorch", + techniques=["INT8 Quantization", "70% Pruning", "KV Cache"] ) -# This creates: submission.json with all required fields +# Validate submission meets requirements +validation = validate_submission(submission) +if validation['valid']: + print("โœ… Submission valid!") + print(f" Checks passed: {len([c for c in validation['checks'] if c['passed']])}") +else: + print("โŒ Submission invalid:") + for issue in validation['issues']: + print(f" - {issue}") + +# Save submission +import json +with open('submission.json', 'w') as f: + json.dump(submission, f, indent=2) ``` -### Step 5: Validate and Submit +**Submission Format**: Includes normalized scores, system info, event constraints, statistical confidenceโ€”everything needed for fair competition ranking. + +## Getting Started + +### Prerequisites + +This capstone requires understanding of benchmarking (Module 19) and optimization techniques (Modules 14-18): ```bash -# Local validation -tito submit --file submission.json --validate-only +# Activate TinyTorch environment +source bin/activate-tinytorch.sh -# Official submission (when ready) -tito submit --file submission.json +# Required: Benchmarking methodology (Module 19) +tito test --module benchmarking # Module 19: Statistical measurement, fair comparison + +# Helpful: Optimization techniques (Modules 14-18) +tito test --module profiling # Module 14: Find bottlenecks +tito test --module quantization # Module 15: Reduce precision +tito test --module compression # Module 16: Prune parameters +tito test --module memoization # Module 17: Cache computations +tito test --module acceleration # Module 18: Operator fusion ``` -## Submission Requirements +**Why You Need Module 19:** +- Module 19 teaches benchmarking methodology (statistical rigor, fair comparison) +- Module 20 teaches how to use Benchmark harness in competition workflow +- You use Benchmark class from Module 19 to measure performance -### Required Fields +**The Focus**: Understanding competition workflowโ€”how to use benchmarking tools to generate submissionsโ€”not building models from scratch (that's Milestones 05-06). -- **division**: 'closed' or 'open' -- **athlete_name**: Your name -- **github_repo**: Link to your code (public or private with access) -- **baseline_metrics**: From Step 2 -- **optimized_metrics**: From Step 4 -- **normalized_scores**: Speedup, compression, accuracy delta -- **techniques_used**: List of optimizations applied -- **honor_code**: "I certify that this submission follows the rules" -- **hardware**: CPU/GPU specs, RAM (for reference, not ranking) -- **tinytorch_version**: Automatically captured -- **timestamp**: Automatically captured +### Development Workflow -### Validation Checks +1. **Understand Competition Events** (`Stage 1`): + - Review OlympicEvent enum and event constraints + - Understand how different events require different strategies + - Learn event-specific accuracy thresholds -The submission system performs sanity checks: -- โœ… Speedup between 0.5ร— and 100ร— (realistic range) -- โœ… Compression between 1ร— and 100ร— (realistic range) -- โœ… Accuracy drop < 10% (must maintain reasonable performance) -- โœ… GitHub repo exists and contains code -- โœ… Techniques used are documented -- โœ… No training modifications in Closed Division +2. **Measure Baseline** (`Stage 2`): + - Use Benchmark harness from Module 19 to measure baseline performance + - Understand statistical rigor (confidence intervals, multiple runs) + - Learn fair comparison protocols -### Honor Code +3. **Measure Optimized** (`Stage 3`): + - Apply optimization techniques (from Modules 14-18) + - Use same Benchmark harness to measure optimized performance + - Ensure fair comparison (same data, hardware, methodology) -This is an honor-based system with light validation: -- We trust you followed the rules -- Automated checks catch accidental errors -- If something seems wrong, we may ask for clarification -- GitHub repo allows others to learn from your work +4. **Calculate Normalized Scores** (`Stage 4`): + - Compute hardware-independent metrics (speedup, compression ratio) + - Understand why normalized scores enable fair comparison + - Learn how to combine multiple metrics -## Example Optimizations (Closed Division) +5. **Generate Submission** (`Stage 5`): + - Create standardized submission format (MLPerf-style) + - Validate submission meets event constraints + - Understand submission structure and requirements -**Beginner**: -- Apply INT8 quantization: ~4ร— compression, ~2ร— speedup -- Result: Speedup=2ร—, Compression=4ร—, Efficiencyโ‰ˆ8 - -**Intermediate**: -- Quantization + 50% pruning: ~8ร— compression, ~3ร— speedup -- Result: Speedup=3ร—, Compression=8ร—, Efficiencyโ‰ˆ24 - -**Advanced**: -- Quantization + 90% pruning + operator fusion: ~40ร— compression, ~5ร— speedup -- Result: Speedup=5ร—, Compression=40ร—, Efficiencyโ‰ˆ200 +6. **Export and verify**: + ```bash + tito module complete 20 + tito test --module capstone + ``` ## Testing +### Comprehensive Test Suite + +Run the full test suite to verify your competition submission: + ```bash -# Run everything end-to-end -cd modules/20_competition -python competition_dev.py +# TinyTorch CLI (recommended) +tito test --module capstone -# Export and test -tito export 20_competition -tito test 20_competition +# Direct pytest execution +python -m pytest tests/ -k capstone -v -# Generate baseline -python -c "from tinytorch.competition import generate_baseline; generate_baseline()" - -# Validate submission -tito submit --file submission.json --validate-only +# Expected output: +# โœ… test_baseline_establishment - Verifies baseline measurement +# โœ… test_optimization_pipeline - Tests combined optimizations +# โœ… test_event_constraints - Validates constraint satisfaction +# โœ… test_statistical_significance - Ensures improvements are real +# โœ… test_submission_generation - Verifies report creation ``` -## Where This Code Lives +### Test Coverage Areas -``` -tinytorch/ -โ”œโ”€โ”€ competition/ -โ”‚ โ”œโ”€โ”€ baseline.py # Baseline model -โ”‚ โ”œโ”€โ”€ submission.py # Submission generation -โ”‚ โ””โ”€โ”€ validate.py # Validation logic -โ””โ”€โ”€ __init__.py - -Generated files: -- baseline_submission.json # Your baseline metrics -- submission.json # Your final submission -``` +- โœ… **OlympicEvent Enum**: Event types and constraints work correctly +- โœ… **Normalized Scoring**: Speedup and compression ratios calculated correctly +- โœ… **Submission Generation**: Creates valid MLPerf-style submissions +- โœ… **Submission Validation**: Checks event constraints and flags issues +- โœ… **Workflow Integration**: Complete workflow demonstration executes ## Systems Thinking Questions -1. **Optimization Priority**: You have limited time. Profile shows attention=40%, FFN=35%, embedding=15%, other=10%. Where do you start and why? +### Integration Complexity -2. **Accuracy Trade-off**: Closed Division allows up to 5% accuracy loss. How do you decide what's acceptable? What if you could get 10ร— speedup for 6% loss? +**Question 1: Optimization Interaction** +You apply INT8 quantization (4ร— memory reduction) followed by 75% pruning (4ร— parameter reduction). Should you expect 16ร— total memory reduction? -3. **Hardware Fairness**: Student A has M1 Max, Student B has i5 laptop. Normalized metrics show both achieved 3ร— speedup. Who optimized better? +**Answer Structure:** +- Quantization affects: _____ +- Pruning affects: _____ +- Combined effect: _____ +- Why not multiplicative: _____ -4. **Open Division Strategy**: You could design a tiny 100K-param model (fast but potentially less accurate) or optimize a 1M-param model. What's your strategy? +**Systems Insight**: Quantization reduces bits per parameter (4 bytes โ†’ 1 byte). Pruning reduces parameter count (but zero values still stored in dense format). Combined effect depends on sparse matrix representation. For true 16ร— reduction, need sparse storage format that doesn't store zeros. -5. **Verification Challenge**: How would you verify submissions without running everyone's code? What checks are sufficient? +### Measurement Validity -## Real-World Connections +**Question 2: Statistical Significance** +Your optimized model shows 5% latency improvement with p-value = 0.12. Competitor shows 8% improvement with p-value = 0.02. Who wins? -### MLPerf +**Systems Insight**: With p=0.12, your 5% could be noise (not statistically significant at ฮฑ=0.05). Competitor's 8% with p=0.02 is significant. Always report p-valuesโ€”bigger speedup doesn't mean better if not statistically valid! -This competition mirrors MLPerf principles: -- Closed Division = MLPerf Closed (fixed model/training) -- Open Division = MLPerf Open (anything goes) -- Normalized metrics for fair hardware comparison -- Honor-based with validation checks +### Event Strategy -### Industry Applications +**Question 3: All-Around Optimization** +For All-Around event, should you: (a) Optimize each metric separately, then combine? (b) Optimize all metrics simultaneously from start? -**Model Deployment Engineer** (your future job): -- Given: Slow model from research team -- Goal: Deploy at production scale -- Constraints: Latency SLA, accuracy requirements, hardware budget -- Skills: Profiling, optimization, trade-off analysis (this capstone!) +**Systems Insight**: Simultaneous optimization risks sub-optimal trade-offs. Better strategy: (1) Profile to find bottlenecks, (2) Apply technique targeting worst metric, (3) Re-measure all metrics, (4) Repeat. Iterative refinement with full measurement prevents over-optimization of one metric at expense of others. -**ML Competition Platforms**: Kaggle, DrivenData use similar structures -- Leaderboards drive innovation -- Standardized metrics ensure fairness -- Open sharing advances the field +### Production Relevance -## What's Next? +**Question 4: Real-World Connection** +How does Torch Olympics competition preparation translate to production ML systems work? -**You've completed TinyTorch!** You've built: -- **Foundation Tier**: All ML building blocks from scratch -- **Architecture Tier**: Vision and language systems -- **Optimization Tier**: Production optimization techniques -- **Capstone**: Real-world ML systems engineering +**Reflection**: Production deployment requires the exact skills you're practicing: profiling to find bottlenecks, applying targeted optimizations, validating improvements statistically, balancing trade-offs based on constraints (latency SLA, memory budget, accuracy requirements), and documenting decisions. The Olympic events mirror real scenarios: mobile deployment (Memory Challenge), real-time inference (Latency Sprint), high-accuracy requirements (Accuracy Contest). -**Where to go from here:** -- Deploy your optimized model to production -- Contribute to open-source ML frameworks -- Join ML systems research or engineering teams -- Build the next generation of ML infrastructure +## Ready for Competition? + +This capstone teaches you how professional ML competitions work. You've learned benchmarking methodology in Module 19โ€”now understand how to use those tools in a competition workflow. Module 20 focuses on: + +- **Competition Workflow**: How to participate in ML competitions (MLPerf-style) +- **Submission Packaging**: How to format results for fair comparison and validation +- **Event Understanding**: How different events require different optimization strategies +- **Workflow Integration**: How benchmarking tools (Module 19) + optimization techniques (Modules 14-18) work together + +**What's Next**: +- Build TinyGPT in Milestone 05 (historical achievement) +- Compete in Torch Olympics (Milestone 06) using this workflow +- Use `tito olympics submit` to generate your competition entry! + +This module teaches workflow and packagingโ€”you use existing tools, not rebuild them. The competition workflow demonstrates how professional ML competitions are structured and participated in. + +Choose your preferred way to engage with this capstone: + +````{grid} 1 2 3 3 + +```{grid-item-card} ๐Ÿš€ Launch Binder +:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/20_capstone/capstone_dev.ipynb +:class-header: bg-light + +Run this capstone interactively in your browser. No installation required! +``` + +```{grid-item-card} โšก Open in Colab +:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/20_capstone/capstone_dev.ipynb +:class-header: bg-light + +Use Google Colab for GPU access and cloud compute power. +``` + +```{grid-item-card} ๐Ÿ“– View Source +:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/20_capstone/capstone.py +:class-header: bg-light + +Browse the Python source code and understand the implementation. +``` + +```` + +```{admonition} ๐Ÿ’ก Competition Recommendation +:class: tip +**Local development recommended!** This capstone involves extended optimization experiments, profiling sessions, and benchmarking runs. Local setup provides better debugging, faster iteration, and persistent results. Cloud sessions may timeout during long benchmark runs. + +**Setup**: `git clone https://github.com/mlsysbook/TinyTorch.git && source bin/activate-tinytorch.sh && cd modules/20_capstone` +``` --- -**Ready for your capstone challenge?** Open `modules/20_competition/competition_dev.py` and start optimizing! - -**Compete. Optimize. Dominate.** ๐Ÿ† +