diff --git a/modules/05_autograd/autograd_dev.py b/modules/05_autograd/autograd_dev.py index 51b72595..b2077007 100644 --- a/modules/05_autograd/autograd_dev.py +++ b/modules/05_autograd/autograd_dev.py @@ -10,14 +10,14 @@ # %% [markdown] """ -# Autograd - Automatic Differentiation and Computational Graph Engine +# Autograd - Automatic Differentiation Engine -Welcome to Autograd! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through complex computational graphs. +Welcome to Autograd! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through computational graphs. -## LINK Building on Previous Learning +## 🔗 Building on Previous Learning **What You Built Before**: - Module 02 (Tensor): Data structures that hold neural network parameters -- Module 05 (Losses): Functions that measure prediction accuracy +- Module 04 (Losses): Functions that measure prediction accuracy **What's Working**: You can compute loss values for any prediction! @@ -27,47 +27,46 @@ Welcome to Autograd! You'll implement the automatic differentiation engine that **Connection Map**: ``` -Tensors -> Loss Functions -> Autograd -> Optimizers -(data) (error measure) (gradL/gradθ) (parameter updates) +Tensors → Losses → Autograd → Optimizers +(data) (error) (∇L/∇θ) (updates) ``` ## Learning Objectives +1. **Core Implementation**: Variable class with gradient tracking +2. **Mathematical Foundation**: Chain rule application in computational graphs +3. **Testing Skills**: Gradient computation validation +4. **Integration Knowledge**: How autograd enables neural network training -By completing this module, you will: +## Build → Test → Use +1. **Build**: Variable class with backward propagation +2. **Test**: Verify gradients are computed correctly +3. **Use**: Apply to mathematical expressions and see automatic differentiation -1. **Implement automatic differentiation** - Build the system that computes gradients automatically -2. **Create computational graphs** - Track operations to enable backward propagation -3. **Apply the chain rule** - Understand how gradients flow through complex operations -4. **Build Variable class** - Extend tensors with gradient tracking capabilities -5. **Enable training** - Provide the automatic gradient computation that makes learning possible +## 📦 Where This Code Lives in the Final Package -## Build -> Use -> Reflect -1. **Build**: Variable class with gradient tracking and backward propagation through operations -2. **Use**: Apply autograd to mathematical expressions and see gradients computed automatically -3. **Reflect**: Understand how automatic differentiation enables efficient neural network training +**Learning Side:** You work in modules/05_autograd/autograd_dev.py +**Building Side:** Code exports to tinytorch.core.autograd -## What You'll Achieve -- **Gradient computation**: Automatically compute derivatives for any mathematical expression -- **Chain rule implementation**: Apply calculus systematically through complex operations -- **Memory management**: Handle gradient accumulation and computational graph lifecycle -- **Training enablement**: Provide the gradient information needed for parameter optimization -- **Framework understanding**: See how PyTorch and TensorFlow implement automatic differentiation -- Performance consideration of how computational graph size and memory management affect training efficiency -- Connection to production ML systems and how frameworks optimize gradient computation and memory usage +```python +# Final package structure: +from tinytorch.core.autograd import Variable # This module +from tinytorch.core.tensor import Tensor # Foundation (always needed) +``` -## Systems Reality Check -TIP **Production Context**: PyTorch's autograd can handle graphs with millions of nodes and uses sophisticated memory optimization like gradient checkpointing to train models larger than GPU memory -SPEED **Performance Note**: Gradient computation often requires storing forward activations, leading to memory usage that scales with network depth - this drives innovations like gradient checkpointing +**Why this matters:** +- **Learning:** Complete automatic differentiation system for deep understanding +- **Production:** Proper organization like PyTorch's torch.autograd +- **Consistency:** All gradient operations in core.autograd +- **Integration:** Works seamlessly with tensors for complete training systems """ -# %% nbgrader={"grade": false, "grade_id": "autograd-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} +# %% #| default_exp core.autograd #| export import numpy as np import sys -from typing import Union, List, Tuple, Optional, Any, Callable -from collections import defaultdict +from typing import Union, List, Optional, Callable # Import our existing components try: @@ -78,2389 +77,1072 @@ except ImportError: sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor')) from tensor_dev import Tensor -# %% nbgrader={"grade": false, "grade_id": "autograd-setup", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("FIRE TinyTorch Autograd Module") +# %% +print("🔥 TinyTorch Autograd Module") print(f"NumPy version: {np.__version__}") print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") print("Ready to build automatic differentiation!") -# %% [markdown] -""" -## PACKAGE Where This Code Lives in the Final Package - -**Learning Side:** You work in `modules/06_autograd/autograd_dev.py` -**Building Side:** Code exports to `tinytorch.core.autograd` - -```python -# Final package structure: -from tinytorch.core.autograd import Variable, backward # The gradient engine! -from tinytorch.core.tensor import Tensor -from tinytorch.core.activations import ReLU, Sigmoid, Tanh -``` - -**Why this matters:** -- **Learning:** Focused module for understanding gradients -- **Production:** Proper organization like PyTorch's `torch.autograd` -- **Consistency:** All gradient operations live together in `core.autograd` -- **Foundation:** Enables training for all neural networks -""" - # %% [markdown] """ ## What is Automatic Differentiation? ### The Problem: Computing Gradients at Scale -Neural networks have millions of parameters. To train them, we need gradients of the loss function with respect to every parameter: + +In neural networks, we need to compute gradients of complex functions with millions of parameters: ``` -gradθ L = [dL/dw₁, dL/dw₂, ..., dL/dwₙ, dL/db₁, dL/db₂, ..., dL/dbₘ] +Loss = f(W₁, W₂, ..., Wₙ, data) +∇Loss = [∂Loss/∂W₁, ∂Loss/∂W₂, ..., ∂Loss/∂Wₙ] ``` -**Manual differentiation fails** because: -- Networks have thousands of composed functions -- Manual computation is extremely error-prone -- Every architecture change requires re-deriving all gradients +Manual differentiation is impossible. Numerical differentiation is too slow. ### The Solution: Automatic Differentiation -**Autograd** automatically computes derivatives of functions represented as computational graphs: -```python -# Instead of manually computing: d(x² + 2xy + y²)/dx = 2x + 2y -# Autograd does it automatically: -x = Variable(3.0, requires_grad=True) -y = Variable(4.0, requires_grad=True) -z = x**2 + 2*x*y + y**2 -z.backward() -print(x.grad) # 2*3 + 2*4 = 14 (computed automatically!) -``` +🧠 **Core Concept**: Track operations as we compute forward pass, then apply chain rule backwards +⚡ **Performance**: Same speed as forward pass, exact gradients (not approximations) +📦 **Framework Compatibility**: This is how PyTorch and TensorFlow work internally ### Visual Representation: Computational Graph ``` -Mathematical Expression: z = x² + 2xy + y² +Forward Pass: +x ──┐ + ├──[×]──> z = x * y +y ──┘ -Computational Graph: - x --+--> [*] ---> x² --+--> [+] ---> [+] ---> z - ^ | | ^ ^ - | +--> [*] ---> 2x -+ | | - | ^ | | - | 2 | | - | | | - x --+--> [*] ---> xy --> [*] ---> 2xy | - ^ | ^ ^ | - | | | 2 | - | | y | - | | | - y --+--> [*] ---> y² --------------------+ - -Forward Pass: Compute values x² = 9, 2xy = 24, y² = 16, z = 49 -Backward Pass: Compute gradients dz/dx = 14, dz/dy = 20 +Backward Pass: +∂L/∂z ──┬──> ∂L/∂x = ∂L/∂z * y + │ + └──> ∂L/∂y = ∂L/∂z * x ``` -### Why This is Revolutionary -- **Efficiency**: O(1) overhead per operation -- **Flexibility**: Works with any differentiable function -- **Correctness**: Implements chain rule precisely -- **Scale**: Handles millions of parameters automatically - -### Real-World Impact -- **PyTorch**: `torch.autograd` enables all neural network training -- **TensorFlow**: `tf.GradientTape` provides similar functionality -- **JAX**: `jax.grad` for high-performance computing -- **Deep Learning**: Made training complex models practical - -Let us build the engine that powers modern AI! +**Key Insight**: Each operation stores how to compute gradients with respect to its inputs. """ # %% [markdown] """ -## 🔧 DEVELOPMENT -""" +## Implementation: Variable Class - Gradient Tracking -# %% [markdown] -""" -## Step 1: The Variable Class - Gradient Tracking - -### What is a Variable? -A **Variable** wraps a Tensor and tracks: -- **Data**: The actual values (forward pass) -- **Gradient**: The computed gradients (backward pass) -- **Computation history**: How this Variable was created -- **Backward function**: How to compute gradients - -### Visual: The Computational Graph Structure -``` -Variable Structure: -+---------------------------------+ -| Variable Object | -+---------------------------------┤ -| data: Tensor([1.5, 2.3, ...]) | <- Forward pass values -| grad: None -> Tensor([...]) | <- Backward pass gradients -| requires_grad: True/False | <- Should compute gradients? -| grad_fn: | <- How to compute gradients -| is_leaf: True/False | <- Original parameter? -+---------------------------------+ - -Computational Graph Example: - x (leaf) --+ - +--[ADD]---> z (intermediate) - y (leaf) --+ - - Forward: x.data + y.data = z.data - Backward: z.grad -> x.grad, y.grad (via chain rule) -``` - -### Memory Layout: Variables vs Tensors -``` -Memory Comparison: - Tensor Only Variable with Autograd - +-------------+ +-------------+ - | Data | | Data | <- Same data storage - | 4 bytes | | 4 bytes | - +-------------+ +-------------┤ - | Gradient | <- Additional gradient storage - | 4 bytes | - +-------------┤ - | grad_fn | <- Function pointer - | 8 bytes | - +-------------+ - Total: ~2x memory overhead -``` +🏗️ **Organization**: Variables wrap tensors and track gradients +🎯 **Clean API**: Seamless integration with existing tensor operations +📐 **Mathematical Foundation**: Computational graph representation of functions ### Design Principles -- **Transparency**: Works seamlessly with existing operations -- **Efficiency**: Minimal overhead for forward pass -- **Flexibility**: Supports any differentiable operation -- **Correctness**: Implements chain rule precisely -### Real-World Context -This is like: -- **PyTorch**: `torch.autograd.Variable` (now integrated into tensors) -- **TensorFlow**: `tf.Variable` with gradient tracking -- **JAX**: Variables with `jax.grad` transformation +A Variable tracks: +- **data**: The actual values (using our Tensor) +- **grad**: Accumulated gradients (starts as None) +- **grad_fn**: Function to compute gradients during backward pass +- **requires_grad**: Whether to track gradients for this variable """ -# %% nbgrader={"grade": false, "grade_id": "variable-class", "locked": false, "schema_version": 3, "solution": true, "task": false} +# %% nbgrader={"grade": false, "grade_id": "variable-class", "solution": true} #| export class Variable: """ - Variable: Tensor wrapper with automatic differentiation capabilities. - - The fundamental class for gradient computation in TinyTorch. - Wraps Tensor objects and tracks computational history for backpropagation. + Variable with automatic differentiation support. + + A Variable wraps a Tensor and tracks operations for gradient computation. + + TODO: Implement Variable class with gradient tracking capabilities + + APPROACH: + 1. Initialize with data, optional gradient requirement + 2. Store grad_fn for backward pass computation + 3. Implement backward() method to compute gradients + + EXAMPLE: + >>> x = Variable([2.0], requires_grad=True) + >>> y = Variable([3.0], requires_grad=True) + >>> z = x * y + >>> z.backward() + >>> print(x.grad) # Should be [3.0] + >>> print(y.grad) # Should be [2.0] + + HINTS: + - Store data as Tensor for consistency + - grad starts as None, gets created during backward + - grad_fn is a callable that propagates gradients """ - - def __init__(self, data: Union[Tensor, np.ndarray, list, float, int], - requires_grad: bool = True, grad_fn: Optional[Callable] = None): - """ - Create a Variable with gradient tracking. - - Simple, clear conversion focused on core autograd concepts. - - Args: - data: The data (will be converted to Tensor if needed) - requires_grad: Whether to track gradients for this Variable - grad_fn: Function for computing gradients (None for leaf nodes) - """ - ### BEGIN SOLUTION - # Simple, clear conversion - if isinstance(data, Tensor): + ### BEGIN SOLUTION + def __init__(self, data, requires_grad=False, grad_fn=None): + """Initialize Variable with data and gradient tracking.""" + # Convert to Tensor if needed + if isinstance(data, (list, tuple, int, float)): + self.data = Tensor(data) + elif isinstance(data, np.ndarray): + self.data = Tensor(data) + elif isinstance(data, Tensor): self.data = data else: - self.data = Tensor(data) - - self.requires_grad = requires_grad + raise TypeError(f"Unsupported data type: {type(data)}") + self.grad = None + self.requires_grad = requires_grad self.grad_fn = grad_fn - self.is_leaf = grad_fn is None - ### END SOLUTION - - @property - def shape(self) -> Tuple[int, ...]: - """Get the shape of the underlying tensor.""" - return self.data.shape - - @property - def size(self) -> int: - """Get the total number of elements.""" - return self.data.size - - def __repr__(self) -> str: - """String representation of the Variable.""" - grad_str = f", grad_fn=<{self.grad_fn.__name__}>" if self.grad_fn else "" - return f"Variable(shape={self.shape}, requires_grad={self.requires_grad}{grad_str})" - - def backward(self, gradient: Optional['Variable'] = None) -> None: + + def __repr__(self): + """String representation of Variable.""" + grad_info = f", grad_fn={self.grad_fn.__name__}" if self.grad_fn else "" + requires_grad_info = f", requires_grad={self.requires_grad}" if self.requires_grad else "" + return f"Variable({self.data.data}{grad_info}{requires_grad_info})" + + def backward(self, gradient=None): """ - Compute gradients using backpropagation. - - Simple gradient accumulation focused on learning the core concepts. - + Compute gradients via backpropagation. + Args: - gradient: Incoming gradient (defaults to ones for scalar outputs) + gradient: Gradient flowing backwards (defaults to ones) """ - ### BEGIN SOLUTION + # Default gradient for scalar outputs if gradient is None: - gradient = Variable(np.ones_like(self.numpy())) - + if self.data.data.size == 1: + gradient = np.ones_like(self.data.data) + else: + raise RuntimeError("gradient must be specified for non-scalar variables") + + # Accumulate gradients if self.requires_grad: if self.grad is None: self.grad = gradient else: - # Accumulate gradients - self.grad = Variable(self.grad.numpy() + gradient.numpy()) - + self.grad = self.grad + gradient + + # Propagate gradients backwards through computation graph if self.grad_fn is not None: self.grad_fn(gradient) - ### END SOLUTION - - def zero_grad(self) -> None: - """Reset gradients to zero.""" - self.grad = None - - def numpy(self) -> np.ndarray: - """ - Convert Variable to NumPy array - Universal data extraction interface. - - This is the PyTorch-inspired solution to inconsistent data access. - ALWAYS returns np.ndarray, regardless of internal structure. - - Returns: - NumPy array containing the variable's data - - Usage: - var = Variable([1, 2, 3]) - array = var.numpy() # Always np.ndarray, no conditional logic needed - """ - return self.data.data - - @property - def array(self) -> np.ndarray: - """ - Clean property access to underlying numpy array. - - Use this instead of .data.data for cleaner, more readable code. - - Example: - x = Variable([1, 2, 3]) - arr = x.array # Clean access instead of x.data.data - """ - return self.data.data - - def __add__(self, other: Union['Variable', float, int]) -> 'Variable': - """Addition operator: self + other""" + + # Arithmetic operations with gradient tracking + def __add__(self, other): + """Addition with gradient tracking.""" return add(self, other) - - def __mul__(self, other: Union['Variable', float, int]) -> 'Variable': - """Multiplication operator: self * other""" + + def __radd__(self, other): + """Reverse addition.""" + return add(other, self) + + def __mul__(self, other): + """Multiplication with gradient tracking.""" return multiply(self, other) - - def __sub__(self, other: Union['Variable', float, int]) -> 'Variable': - """Subtraction operator: self - other""" + + def __rmul__(self, other): + """Reverse multiplication.""" + return multiply(other, self) + + def __sub__(self, other): + """Subtraction with gradient tracking.""" return subtract(self, other) - - def __truediv__(self, other: Union['Variable', float, int]) -> 'Variable': - """Division operator: self / other""" - return divide(self, other) - - def __matmul__(self, other: 'Variable') -> 'Variable': - """Matrix multiplication operator: self @ other""" - return matmul(self, other) + + def __rsub__(self, other): + """Reverse subtraction.""" + return subtract(other, self) + + def __matmul__(self, other): + """Matrix multiplication with gradient tracking.""" + return matmul(self, other) + ### END SOLUTION # %% [markdown] """ -### TEST Unit Test: Variable Class - -This test validates Variable initialization, ensuring gradient tracking capabilities work correctly. +### 🧪 Unit Test: Variable Class +This test validates Variable creation and basic gradient setup """ -# %% nbgrader={"grade": true, "grade_id": "test-variable-class", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false} +# %% def test_unit_variable_class(): - """Test Variable class implementation""" + """Test Variable class implementation with gradient tracking.""" print("🔬 Unit Test: Variable Class...") - - # Test Variable creation - x = Variable(5.0, requires_grad=True) - assert x.requires_grad == True, "Variable should require gradients" - assert x.is_leaf == True, "Variable should be a leaf node" - assert x.grad is None, "Gradient should be None initially" - - # Test data access - assert x.numpy().item() == 5.0, "Data should be accessible" - assert x.shape == (), "Scalar should have empty shape" - assert x.size == 1, "Scalar should have size 1" - - # Test with list input - y = Variable([1, 2, 3], requires_grad=True) - assert y.shape == (3,), "List should create 1D tensor" - assert y.size == 3, "Size should be 3" - - # Test with requires_grad=False - z = Variable(10.0, requires_grad=False) - assert z.requires_grad == False, "Should not require gradients" - - # Test zero_grad - x.grad = Variable(1.0) - x.zero_grad() - assert x.grad is None, "zero_grad should reset gradient to None" - - print("PASS Variable class tests passed!") - print(f"PASS Variable creation and initialization working") - print(f"PASS Data access and properties working") - print(f"PASS Gradient management working") -# Test will run in main block + # Test basic creation + x = Variable([2.0, 3.0], requires_grad=True) + assert isinstance(x.data, Tensor), "Variable should wrap Tensor" + assert x.requires_grad == True, "Should track gradients when requested" + assert x.grad is None, "Gradient should start as None" + + # Test creation without gradients + y = Variable([1.0, 2.0], requires_grad=False) + assert y.requires_grad == False, "Should not track gradients when not requested" + + # Test different data types + z = Variable(np.array([4.0]), requires_grad=True) + assert isinstance(z.data, Tensor), "Should convert numpy arrays to Tensors" + + print("✅ Variable class works correctly!") + +test_unit_variable_class() # %% [markdown] """ -## THINK Computational Assessment: Variable Understanding +## Implementation: Addition Operation with Chain Rule -Test your understanding of computational graphs and Variable design. -""" - -# %% nbgrader={"grade": true, "grade_id": "question-variable-design", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} -""" -### Assessment Question: Variable Memory and Design - -Consider this Variable usage pattern: -```python -x = Variable(np.random.randn(1000, 1000), requires_grad=True) -y = x * 2 + 1 -z = y @ y.T -loss = z.sum() -loss.backward() -``` - -**Question**: How much memory does this computational graph consume compared to just storing the final result? Calculate the memory overhead and explain why Variables need to store intermediate values. - -**Calculation Space:** -- Forward pass memory: _____ MB -- Gradient storage memory: _____ MB -- Total overhead factor: _____x - -**Conceptual Analysis:** -TODO: Explain why automatic differentiation requires storing intermediate values and how this affects memory scaling in deep networks. - -**Design Justification:** -TODO: Justify why the Variable design separates data, gradients, and computation history into different attributes. -""" - -### BEGIN SOLUTION -# Student response area - this will be manually graded -# Expected analysis should cover: -# 1. Memory calculation: 1000x1000 float32 = 4MB per intermediate result -# 2. Total memory: x(4MB) + y(4MB) + z(4MB) + gradients(~12MB) = ~24MB vs 4MB final result -# 3. Conceptual understanding: Need intermediate values for chain rule -# 4. Design rationale: Separation enables flexible gradient computation -### END SOLUTION - -# %% [markdown] -""" -## Step 2: Basic Operations with Gradients - -### The Chain Rule in Action -Every operation must implement: -1. **Forward pass**: Compute the result -2. **Backward pass**: Compute gradients for inputs - -### Visual: Chain Rule Through Addition -``` -Forward Pass: z = x + y - x: 3.0 --+ - +--[+]---> z: 5.0 - y: 2.0 --+ - -Backward Pass: dz/dx = 1, dz/dy = 1 - dL/dz: 1.0 --+---> dL/dx: 1.0 (dz/dx = 1) - | - +---> dL/dy: 1.0 (dz/dy = 1) - -Chain Rule: dL/dx = dL/dz · dz/dx = 1.0 · 1 = 1.0 -``` +🧠 **Core Concepts**: Addition requires applying chain rule to both operands +⚡ **Performance**: Gradient computation is O(1) relative to forward pass +📦 **Framework Compatibility**: Matches PyTorch's autograd behavior ### Mathematical Foundation -The chain rule states: -``` -df/dx = df/dz · dz/dx -``` -For complex expressions like f(g(h(x))): -``` -df/dx = df/dg · dg/dh · dh/dx -``` +For z = x + y: +- ∂z/∂x = 1 (derivative of x + y with respect to x) +- ∂z/∂y = 1 (derivative of x + y with respect to y) -### Implementation Pattern -Each operation returns a new Variable with: -- **Forward result**: Computed value -- **Backward function**: Gradient computation +Chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × 1 = ∂L/∂z """ -# %% [markdown] -""" -## Helper Functions for Binary Operations +# %% nbgrader={"grade": false, "grade_id": "add-operation", "solution": true} +def _ensure_variable(x): + """Convert input to Variable if needed.""" + if isinstance(x, Variable): + return x + else: + return Variable(x, requires_grad=False) -These helper functions reduce code repetition and make operations more consistent. -""" - -#| export -def _ensure_variables(a, b): - """Convert inputs to Variables if they are scalars.""" - if isinstance(a, (int, float)): - a = Variable(a, requires_grad=False) - if isinstance(b, (int, float)): - b = Variable(b, requires_grad=False) - return a, b - -def _create_binary_operation(forward_fn, grad_fn_a, grad_fn_b): - """ - Helper to create binary operations with consistent structure. - - Args: - forward_fn: Function to compute forward pass - grad_fn_a: Function to compute gradient for first argument - grad_fn_b: Function to compute gradient for second argument - - Returns: - Binary operation function - """ - def operation(a, b): - # Convert inputs - a, b = _ensure_variables(a, b) - - # Forward pass - result_data = forward_fn(a.data, b.data) - - # Backward function - def grad_fn(grad_output): - if a.requires_grad: - grad_a = grad_fn_a(grad_output, a, b) - a.backward(grad_a) - if b.requires_grad: - grad_b = grad_fn_b(grad_output, a, b) - b.backward(grad_b) - - requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - - return operation - -# %% nbgrader={"grade": false, "grade_id": "add-operation", "locked": false, "schema_version": 3, "solution": true, "task": false} #| export def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: """ - Addition operation with gradient tracking: a + b - - TODO: Implement addition with automatic differentiation. - - STEP-BY-STEP IMPLEMENTATION: - 1. Convert inputs to Variables if they are scalars - 2. Compute forward pass: result = a.data + b.data - 3. Create gradient function that implements: d(a+b)/da = 1, d(a+b)/db = 1 - 4. Return new Variable with result and gradient function - - MATHEMATICAL FOUNDATION: - - Forward: z = x + y - - Backward: dz/dx = 1, dz/dy = 1 - - Chain rule: dL/dx = dL/dz · dz/dx = dL/dz · 1 = dL/dz - - EXAMPLE USAGE: - ```python - x = Variable(2.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) - z = add(x, y) # z = 5.0 - z.backward() - print(x.grad) # 1.0 (dz/dx = 1) - print(y.grad) # 1.0 (dz/dy = 1) - ``` - - IMPLEMENTATION HINTS: - - Convert scalars: if isinstance(a, (int, float)): a = Variable(a, requires_grad=False) - - Forward pass: result_data = a.data + b.data - - Backward function: def grad_fn(grad_output): if a.requires_grad: a.backward(grad_output) - - Return: Variable(result_data, grad_fn=grad_fn) - - Only propagate gradients to Variables that require them - - LEARNING CONNECTIONS: - - This is like torch.add() with autograd - - Addition distributes gradients equally to both inputs - - Forms the basis for bias addition in neural networks - - Chain rule propagates gradients through the graph + Add two variables with gradient tracking. + + TODO: Implement addition that properly tracks gradients + + APPROACH: + 1. Convert inputs to Variables if needed + 2. Compute forward pass (a.data + b.data) + 3. Create grad_fn that propagates gradients to both inputs + 4. Return new Variable with result and grad_fn + + EXAMPLE: + >>> x = Variable([2.0], requires_grad=True) + >>> y = Variable([3.0], requires_grad=True) + >>> z = add(x, y) + >>> z.backward() + >>> print(x.grad) # [1.0] - derivative of z w.r.t x + >>> print(y.grad) # [1.0] - derivative of z w.r.t y + + HINTS: + - Use chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × 1 + - Both operands get same gradient (derivative of sum is 1) + - Only propagate to variables that require gradients """ ### BEGIN SOLUTION - # Convert scalars to Variables - if isinstance(a, (int, float)): - a = Variable(a, requires_grad=False) - if isinstance(b, (int, float)): - b = Variable(b, requires_grad=False) - - # Forward pass - result_data = a.data + b.data - - # Backward function - def grad_fn(grad_output): - # Addition distributes gradients equally, but must handle broadcasting - if a.requires_grad: - # Clean gradient data access - grad_data = grad_output.data - - # Check if we need to sum over broadcasted dimensions - a_shape = a.data.shape - if grad_data.shape != a_shape: - # Sum over the broadcasted dimensions - # For bias: (batch_size, features) -> (features,) - if len(grad_data.shape) == 2 and len(a_shape) == 1: - grad_for_a = Variable(Tensor(np.sum(grad_data, axis=0))) - else: - # Handle other broadcasting cases - grad_for_a = grad_output - else: - grad_for_a = grad_output - - a.backward(grad_for_a) - - if b.requires_grad: - # Clean gradient data access - grad_data = grad_output.data - - # Check if we need to sum over broadcasted dimensions - b_shape = b.data.shape - if grad_data.shape != b_shape: - # Sum over the broadcasted dimensions - # For bias: (batch_size, features) -> (features,) - if len(grad_data.shape) == 2 and len(b_shape) == 1: - grad_for_b = Variable(Tensor(np.sum(grad_data, axis=0))) - else: - # Handle other broadcasting cases - grad_for_b = grad_output - else: - grad_for_b = grad_output - - b.backward(grad_for_b) - - # Return new Variable with gradient function + # Ensure both inputs are Variables + a = _ensure_variable(a) + b = _ensure_variable(b) + + # Forward pass computation + result_data = Tensor(a.data.data + b.data.data) + + # Determine if result requires gradients requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) + + # Define backward function for gradient propagation + def grad_fn(gradient): + """Propagate gradients to both operands.""" + # Addition: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1 + if a.requires_grad: + a.backward(gradient) + if b.requires_grad: + b.backward(gradient) + + # Create result variable with gradient function + result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None) + return result ### END SOLUTION # %% [markdown] """ -### TEST Unit Test: Addition Operation - -This test validates addition operation, ensuring gradients flow correctly through addition. +### 🧪 Unit Test: Addition Operation +This test validates addition with proper gradient computation """ -# %% nbgrader={"grade": true, "grade_id": "test-add-operation", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false} +# %% def test_unit_add_operation(): - """Test addition operation with gradients""" + """Test addition with gradient tracking.""" print("🔬 Unit Test: Addition Operation...") - + # Test basic addition - x = Variable(2.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) + x = Variable([2.0], requires_grad=True) + y = Variable([3.0], requires_grad=True) z = add(x, y) - - assert z.numpy().item() == 5.0, "Addition result should be 5.0" - assert z.requires_grad == True, "Result should require gradients" - assert z.is_leaf == False, "Result should not be a leaf node" - + + # Verify forward pass + assert np.allclose(z.data.data, [5.0]), f"Expected [5.0], got {z.data.data}" + # Test backward pass z.backward() - - assert x.grad is not None, "x should have gradient" - assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 1.0, "dz/dx should be 1.0" - assert y.grad.numpy().item() == 1.0, "dz/dy should be 1.0" - - # Test with scalar - a = Variable(5.0, requires_grad=True) - b = add(a, 3.0) # Add scalar - - assert b.numpy().item() == 8.0, "Addition with scalar should work" - + assert np.allclose(x.grad, [1.0]), f"Expected x.grad=[1.0], got {x.grad}" + assert np.allclose(y.grad, [1.0]), f"Expected y.grad=[1.0], got {y.grad}" + + # Test with constants + a = Variable([1.0], requires_grad=True) + b = add(a, 5.0) # Adding constant b.backward() - assert a.grad.numpy().item() == 1.0, "Gradient through scalar addition should be 1.0" - - print("PASS Addition operation tests passed!") - print(f"PASS Forward pass computing correct results") - print(f"PASS Backward pass computing correct gradients") - print(f"PASS Scalar addition working correctly") + assert np.allclose(a.grad, [1.0]), "Gradient should flow through constant addition" -# Test will run in main block + print("✅ Addition operation works correctly!") -# PASS IMPLEMENTATION CHECKPOINT: Addition operation complete - -# THINK PREDICTION: How does the chain rule apply when operations are chained together? -# Your answer: _______ - -# MAGNIFY SYSTEMS INSIGHT #1: Gradient Flow Analysis -def analyze_gradient_flow(): - """Analyze how gradients flow through computational graphs.""" - try: - print("MAGNIFY GRADIENT FLOW ANALYSIS") - print("=" * 35) - - # Create simple computational graph - x = Variable(2.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) - - # Build graph: z = (x + y) * 2 - sum_xy = add(x, y) # x + y = 5.0 - z = multiply(sum_xy, 2.0) # (x + y) * 2 = 10.0 - - print(f"Forward pass:") - print(f" x = {x.numpy().item()}") - print(f" y = {y.numpy().item()}") - print(f" x + y = {sum_xy.numpy().item()}") - print(f" z = (x + y) * 2 = {z.numpy().item()}") - - # Compute gradients - z.backward() - - print(f"\nBackward pass:") - print(f" dz/dx = {x.grad.numpy().item()}") - print(f" dz/dy = {y.grad.numpy().item()}") - - # Analyze memory usage - import sys - x_memory = sys.getsizeof(x) - z_memory = sys.getsizeof(z) - - print(f"\nMemory Analysis:") - print(f" Leaf variable (x): ~{x_memory} bytes") - print(f" Intermediate result (z): ~{z_memory} bytes") - print(f" Memory overhead: {z_memory/x_memory:.1f}x") - - # TIP WHY THIS MATTERS: In large models, computational graphs can consume - # significant memory. Each intermediate result stores gradients and backward functions. - # This is why techniques like gradient checkpointing are crucial for training large models! - - return True - - except Exception as e: - print(f"WARNING️ Error in gradient flow analysis: {e}") - print("Make sure addition and multiplication are implemented") - return False - -# Run the analysis (will work after multiplication is implemented) +test_unit_add_operation() # %% [markdown] """ -## Step 3: Multiplication Operation +## Implementation: Multiplication Operation with Product Rule + +📐 **Mathematical Foundation**: Product rule for derivatives +🔗 **Connections**: Essential for linear layers, attention mechanisms +⚡ **Performance**: Efficient gradient computation using cached forward values ### The Product Rule -For z = x * y: -- **Forward**: z = x * y -- **Backward**: dz/dx = y, dz/dy = x -### Visual: Product Rule in Action -``` -Forward Pass: z = x * y - x: 2.0 --+ - +--[*]---> z: 6.0 - y: 3.0 --+ +For z = x × y: +- ∂z/∂x = y (derivative with respect to first operand) +- ∂z/∂y = x (derivative with respect to second operand) -Backward Pass: dz/dx = y, dz/dy = x - dL/dz: 1.0 --+---> dL/dx: 3.0 (dz/dx = y = 3.0) - | - +---> dL/dy: 2.0 (dz/dy = x = 2.0) - -Product Rule: -- d(xy)/dx = y -- d(xy)/dy = x -``` - -### Why This Matters -Multiplication is everywhere in neural networks: -- **Weight scaling**: w * x in dense layers -- **Attention mechanisms**: attention_weights * values -- **Gating**: gate_signal * hidden_state - -### Chain Rule Application -When gradients flow back through multiplication: -``` -dL/dx = dL/dz · dz/dx = dL/dz · y -dL/dy = dL/dz · dz/dy = dL/dz · x -``` +Chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × y """ -# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "locked": false, "schema_version": 3, "solution": true, "task": false} +# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "solution": true} #| export def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: """ - Multiplication operation with gradient tracking: a * b - - Uses the product rule: d(a*b)/da = b, d(a*b)/db = a + Multiply two variables with gradient tracking. + + TODO: Implement multiplication using product rule for gradients + + APPROACH: + 1. Convert inputs to Variables if needed + 2. Compute forward pass (a.data × b.data) + 3. Create grad_fn using product rule: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a + 4. Return Variable with result and grad_fn + + EXAMPLE: + >>> x = Variable([2.0], requires_grad=True) + >>> y = Variable([3.0], requires_grad=True) + >>> z = multiply(x, y) + >>> z.backward() + >>> print(x.grad) # [3.0] - derivative is y's value + >>> print(y.grad) # [2.0] - derivative is x's value + + HINTS: + - Product rule: d(uv)/dx = u(dv/dx) + v(du/dx) + - For our case: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a + - Store original values for use in backward pass """ ### BEGIN SOLUTION - # Convert scalars to Variables - a, b = _ensure_variables(a, b) - - # Forward pass - result_data = a.data * b.data - - # Backward function using product rule - def grad_fn(grad_output): - if a.requires_grad: - a.backward(Variable(grad_output.numpy() * b.numpy())) - if b.requires_grad: - b.backward(Variable(grad_output.numpy() * a.numpy())) - - # Return new Variable with gradient function + # Ensure both inputs are Variables + a = _ensure_variable(a) + b = _ensure_variable(b) + + # Forward pass computation + result_data = Tensor(a.data.data * b.data.data) + + # Determine if result requires gradients requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) + + # Define backward function for gradient propagation + def grad_fn(gradient): + """Propagate gradients using product rule.""" + # Product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a + if a.requires_grad: + # ∂L/∂a = ∂L/∂z × ∂z/∂a = gradient × b + a_grad = gradient * b.data.data + a.backward(a_grad) + if b.requires_grad: + # ∂L/∂b = ∂L/∂z × ∂z/∂b = gradient × a + b_grad = gradient * a.data.data + b.backward(b_grad) + + # Create result variable with gradient function + result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None) + return result ### END SOLUTION # %% [markdown] """ -### TEST Unit Test: Multiplication Operation - -This test validates multiplication operation, ensuring the product rule is implemented correctly. +### 🧪 Unit Test: Multiplication Operation +This test validates multiplication with product rule gradients """ -# %% nbgrader={"grade": true, "grade_id": "test-multiply-operation", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false} +# %% def test_unit_multiply_operation(): - """Test multiplication operation with gradients""" + """Test multiplication with gradient tracking.""" print("🔬 Unit Test: Multiplication Operation...") - + # Test basic multiplication - x = Variable(2.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) + x = Variable([2.0], requires_grad=True) + y = Variable([3.0], requires_grad=True) z = multiply(x, y) - - assert z.numpy().item() == 6.0, "Multiplication result should be 6.0" - assert z.requires_grad == True, "Result should require gradients" - + + # Verify forward pass + assert np.allclose(z.data.data, [6.0]), f"Expected [6.0], got {z.data.data}" + # Test backward pass z.backward() - - assert x.grad is not None, "x should have gradient" - assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 3.0, "dz/dx should be y = 3.0" - assert y.grad.numpy().item() == 2.0, "dz/dy should be x = 2.0" - - # Test with scalar - a = Variable(4.0, requires_grad=True) - b = multiply(a, 2.0) # Multiply by scalar - - assert b.numpy().item() == 8.0, "Multiplication with scalar should work" - + assert np.allclose(x.grad, [3.0]), f"Expected x.grad=[3.0], got {x.grad}" + assert np.allclose(y.grad, [2.0]), f"Expected y.grad=[2.0], got {y.grad}" + + # Test with constants + a = Variable([4.0], requires_grad=True) + b = multiply(a, 2.0) # Multiplying by constant b.backward() - assert a.grad.numpy().item() == 2.0, "Gradient through scalar multiplication should be the scalar" - - print("PASS Multiplication operation tests passed!") - print(f"PASS Forward pass computing correct results") - print(f"PASS Backward pass implementing product rule correctly") - print(f"PASS Scalar multiplication working correctly") + assert np.allclose(a.grad, [2.0]), "Gradient should be the constant value" -# Test will run in main block + print("✅ Multiplication operation works correctly!") -# Now run the gradient flow analysis -analyze_gradient_flow() +test_unit_multiply_operation() -# %% nbgrader={"grade": false, "grade_id": "subtract-operation", "locked": false, "schema_version": 3, "solution": true, "task": false} +# %% [markdown] +""" +## Implementation: Additional Operations + +🔗 **Connections**: Complete the basic arithmetic operations needed for neural networks +⚡ **Performance**: Each operation implements efficient gradient computation +📦 **Framework Compatibility**: Matches behavior of production autograd systems +""" + +# %% nbgrader={"grade": false, "grade_id": "additional-operations", "solution": true} #| export def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: """ - Subtraction operation with gradient tracking: a - b - - Uses the rule: d(a-b)/da = 1, d(a-b)/db = -1 + Subtract two variables with gradient tracking. + + TODO: Implement subtraction with proper gradient flow + + HINTS: + - For z = a - b: ∂z/∂a = 1, ∂z/∂b = -1 + - Similar to addition but second operand gets negative gradient """ ### BEGIN SOLUTION - # Convert to Variables if needed - a, b = _ensure_variables(a, b) - - # Forward pass - result_data = a.data - b.data - - # Create gradient function - def grad_fn(grad_output): - # Subtraction rule: d(x-y)/dx = 1, d(x-y)/dy = -1 - if a.requires_grad: - a.backward(grad_output) - if b.requires_grad: - b_grad = Variable(-grad_output.numpy()) - b.backward(b_grad) - + # Ensure both inputs are Variables + a = _ensure_variable(a) + b = _ensure_variable(b) + + # Forward pass computation + result_data = Tensor(a.data.data - b.data.data) + # Determine if result requires gradients requires_grad = a.requires_grad or b.requires_grad - - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) + + # Define backward function for gradient propagation + def grad_fn(gradient): + """Propagate gradients for subtraction.""" + # Subtraction: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1 + if a.requires_grad: + a.backward(gradient) + if b.requires_grad: + b.backward(-gradient) # Negative for subtraction + + # Create result variable with gradient function + result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None) + return result ### END SOLUTION #| export def matmul(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: """ - Matrix multiplication operation with gradient tracking: a @ b - - Uses matrix multiplication gradients: dC/dA = grad_C @ B^T, dC/dB = A^T @ grad_C - """ - ### BEGIN SOLUTION - # Convert scalars to Variables - a, b = _ensure_variables(a, b) - - # Forward pass - matrix multiplication - result_data = Tensor(a.numpy() @ b.numpy()) - - # Backward function - def grad_fn(grad_output): - # Matrix multiplication gradients - if a.requires_grad: - # dC/dA = grad_C @ B^T - grad_a_data = grad_output.numpy() @ b.numpy().T - a.backward(Variable(grad_a_data)) - - if b.requires_grad: - # dC/dB = A^T @ grad_C - grad_b_data = a.numpy().T @ grad_output.numpy() - b.backward(Variable(grad_b_data)) - - # Return new Variable with gradient function - requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - ### END SOLUTION + Matrix multiplication with gradient tracking. -#| export -def divide(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: - """ - Division operation with gradient tracking: a / b - - Uses the quotient rule: d(a/b)/da = 1/b, d(a/b)/db = -a/b² - """ - ### BEGIN SOLUTION - # Convert scalars to Variables - a, b = _ensure_variables(a, b) - - # Forward pass - result_data = a.data / b.data - - # Backward function - def grad_fn(grad_output): - if a.requires_grad: - # d(a/b)/da = 1/b - grad_a = Variable(grad_output.numpy() / b.numpy()) - a.backward(grad_a) - if b.requires_grad: - # d(a/b)/db = -a/b² - grad_b = Variable(-grad_output.numpy() * a.numpy() / (b.numpy() ** 2)) - b.backward(grad_b) - - requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - ### END SOLUTION - -# %% nbgrader={"grade": false, "grade_id": "test-subtract-operation", "locked": false, "schema_version": 3, "solution": false, "task": false} -def test_unit_subtract_operation(): - """Test subtraction operation with gradients""" - print("🔬 Unit Test: Subtraction Operation...") - - # Test basic subtraction - x = Variable(5.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) - z = subtract(x, y) - - assert z.numpy().item() == 2.0, "Subtraction result should be 2.0" - assert z.requires_grad == True, "Result should require gradients" - - # Test backward pass - z.backward() - - assert x.grad is not None, "x should have gradient" - assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 1.0, "dz/dx should be 1.0" - assert y.grad.numpy().item() == -1.0, "dz/dy should be -1.0" - - # Test with scalar - a = Variable(4.0, requires_grad=True) - b = subtract(a, 2.0) # Subtract scalar - - assert b.numpy().item() == 2.0, "Subtraction with scalar should work" - - b.backward() - assert a.grad.numpy().item() == 1.0, "Gradient through scalar subtraction should be 1.0" - - print("PASS Subtraction operation tests passed!") - print(f"PASS Forward pass computing correct results") - print(f"PASS Backward pass implementing subtraction rule correctly") - print(f"PASS Scalar subtraction working correctly") - -# Test will run in main block - -# %% [markdown] -""" -## THINK Computational Assessment: Chain Rule Application - -Test your understanding of how gradients flow through multiple operations. -""" - -# %% nbgrader={"grade": true, "grade_id": "question-chain-rule", "locked": false, "points": 15, "schema_version": 3, "solution": true, "task": false} -""" -### Assessment Question: Manual Gradient Calculation - -Consider this computational graph: -```python -x = Variable(2.0, requires_grad=True) -y = Variable(3.0, requires_grad=True) -a = x * y # a = 6.0 -b = a + x # b = 8.0 -c = b * 2 # c = 16.0 -c.backward() -``` - -**Calculate manually:** -1. dc/db = _____ -2. db/da = _____ -3. db/dx = _____ -4. da/dx = _____ -5. da/dy = _____ - -**Apply chain rule:** -6. dc/dx (through path c->b->a->x) = _____ -7. dc/dx (through path c->b->x) = _____ -8. Total dc/dx = _____ + _____ = _____ -9. dc/dy = _____ - -**Verification:** -TODO: Run the code above and verify your calculations match the computed gradients. -""" - -### BEGIN SOLUTION -# Student calculation space - this will be manually graded -# Expected answers: -# 1. dc/db = 2 (c = b * 2) -# 2. db/da = 1 (b = a + x) -# 3. db/dx = 1 (b = a + x) -# 4. da/dx = y = 3 (a = x * y) -# 5. da/dy = x = 2 (a = x * y) -# 6. dc/dx (path 1) = 2 * 1 * 3 = 6 -# 7. dc/dx (path 2) = 2 * 1 = 2 -# 8. Total dc/dx = 6 + 2 = 8 -# 9. dc/dy = 2 * 1 * 2 = 4 -### END SOLUTION - -# %% [markdown] -""" -## Step 4: Chain Rule in Complex Expressions - -### Building Complex Computations -Now let us test how multiple operations work together through the chain rule: - -### Visual: Complex Computational Graph -``` -Example: f(x, y) = (x + y) * (x - y) = x² - y² - -Computational Graph: - x --+--> [+] --+--> [*] ---> result - | | - y --+--> [+] --+ - | - +--> [-] --+ - x - -Forward Pass Flow: - x=3, y=2 -> sum=5, diff=1 -> result=5 - -Backward Pass Flow: - dL/dresult=1 -> dL/dsum=1, dL/ddiff=5 -> dL/dx=6, dL/dy=-4 - -Manual verification: f(x,y) = x² - y² -df/dx = 2x = 2(3) = 6 OK -df/dy = -2y = -2(2) = -4 OK -``` - -### Chain Rule Application -- **Forward**: Compute each operation in sequence -- **Backward**: Gradients flow back through each operation -- **Automatic**: No manual gradient computation needed! - -### Real-World Significance -Complex neural networks are just larger versions of this: -- **Millions of operations**: Each tracked automatically -- **Complex architectures**: ResNet, Transformer, etc. -- **Efficient computation**: O(1) overhead per operation -""" - -# %% nbgrader={"grade": true, "grade_id": "test-chain-rule", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false} -def test_unit_chain_rule(): - """Test chain rule with complex expressions""" - print("🔬 Unit Test: Chain Rule with Complex Expressions...") - - # Test: f(x, y) = (x + y) * (x - y) = x² - y² - x = Variable(3.0, requires_grad=True) - y = Variable(2.0, requires_grad=True) - - # Build expression step by step - sum_xy = add(x, y) # x + y = 5.0 - diff_xy = subtract(x, y) # x - y = 1.0 - result = multiply(sum_xy, diff_xy) # (x + y) * (x - y) = 5.0 - - # Check forward pass - assert result.numpy().item() == 5.0, "Forward pass should compute 5.0" - - # Compute gradients - result.backward() - - # Check gradients: d(x²-y²)/dx = 2x, d(x²-y²)/dy = -2y - expected_x_grad = 2 * x.numpy().item() # 2 * 3 = 6 - expected_y_grad = -2 * y.numpy().item() # -2 * 2 = -4 - - assert abs(x.grad.numpy().item() - expected_x_grad) < 1e-6, f"x gradient should be {expected_x_grad}" - assert abs(y.grad.numpy().item() - expected_y_grad) < 1e-6, f"y gradient should be {expected_y_grad}" - - # Test more complex expression: f(x) = (x + 1) * (x + 2) * (x + 3) - x2 = Variable(1.0, requires_grad=True) - - term1 = add(x2, 1.0) # x + 1 = 2.0 - term2 = add(x2, 2.0) # x + 2 = 3.0 - term3 = add(x2, 3.0) # x + 3 = 4.0 - - product1 = multiply(term1, term2) # (x + 1) * (x + 2) = 6.0 - result2 = multiply(product1, term3) # * (x + 3) = 24.0 - - assert result2.numpy().item() == 24.0, "Complex expression should compute 24.0" - - result2.backward() - - # For f(x) = (x+1)(x+2)(x+3), f'(x) = 3x² + 12x + 11 - # At x=1: f'(1) = 3 + 12 + 11 = 26 - expected_grad = 3 * (1.0**2) + 12 * 1.0 + 11 # 26 - - assert abs(x2.grad.numpy().item() - expected_grad) < 1e-6, f"Complex gradient should be {expected_grad}" - - print("PASS Chain rule tests passed!") - print(f"PASS Simple expression: (x+y)*(x-y) = x²-y²") - print(f"PASS Complex expression: (x+1)*(x+2)*(x+3)") - print(f"PASS Automatic gradient computation working correctly") - print(f"PASS Chain rule implemented correctly") - -# Test will run in main block - -# PASS IMPLEMENTATION CHECKPOINT: Basic operations complete - -# THINK PREDICTION: How does computational graph memory scale with network depth? -# Your answer: _______ - -# MAGNIFY SYSTEMS INSIGHT #2: Computational Graph Memory Analysis -def analyze_computational_graph_memory(): - """Analyze memory consumption patterns in computational graphs.""" - try: - print("MAGNIFY COMPUTATIONAL GRAPH MEMORY ANALYSIS") - print("=" * 45) - - import sys - - # Test different graph depths - depths = [1, 3, 5, 8] - memory_usage = [] - - for depth in depths: - # Create computational graph of specified depth - x = Variable(np.random.randn(100, 100), requires_grad=True) - current = x - - # Build chain of operations - for i in range(depth): - current = multiply(current, 1.1) - current = add(current, 0.1) - - # Estimate memory usage - base_memory = x.data.data.nbytes / (1024 * 1024) # MB - - # Each operation creates new Variable with references - estimated_graph_memory = depth * 2 * base_memory # Rough estimate - - memory_usage.append(estimated_graph_memory) - - print(f" Depth {depth}: ~{estimated_graph_memory:.1f} MB") - - # Analyze scaling - if len(memory_usage) >= 2: - shallow = memory_usage[0] - deep = memory_usage[-1] - scaling_factor = deep / shallow - - print(f"\nMemory Scaling Analysis:") - print(f" Depth 1: {shallow:.1f} MB") - print(f" Depth {depths[-1]}: {deep:.1f} MB") - print(f" Scaling factor: {scaling_factor:.1f}x") - print(f" Scaling per layer: {scaling_factor/depths[-1]:.2f}x") - - # Production implications - print(f"\n🏭 Production Scaling Implications:") - print(f" • ResNet-50 (50 layers): ~{memory_usage[0] * 50:.0f} MB graph memory") - print(f" • Transformer (100 layers): ~{memory_usage[0] * 100:.0f} MB graph memory") - print(f" • GPT-3 scale models: Gradient checkpointing essential!") - - # TIP WHY THIS MATTERS: Deep networks require storing intermediate activations - # for gradient computation. This memory grows linearly with depth, leading to - # memory constraints. Gradient checkpointing trades compute for memory! - - return memory_usage - - except Exception as e: - print(f"WARNING️ Error in memory analysis: {e}") - print("Make sure all operations are implemented") - return [1.0] - -# Run the analysis -analyze_computational_graph_memory() - -# %% [markdown] -""" -## Step 5: Integration with Neural Network Training - -### The Complete Training Loop -Let us see how autograd enables neural network training: - -1. **Forward pass**: Compute predictions -2. **Loss computation**: Compare with targets -3. **Backward pass**: Compute gradients automatically -4. **Parameter update**: Update weights using gradients - -### Visual: Neural Network Training Flow -``` -Training Loop Architecture: -+-------------+ +-------------+ +-------------+ +-------------+ -| Forward |---▶| Loss |---▶| Backward |---▶| Update | -| Pass | | Computation | | Pass | | Parameters | -+-------------+ +-------------+ +-------------+ +-------------+ - ^ | | - | v v -+-------------+ +-------------+ +-------------+ -| Input Data | | Gradients | | New Weights| -| (x, y) | | gradL/gradθ | | θ' | -+-------------+ +-------------+ +-------------+ - -Memory Flow During Training: - Parameters -> Forward Activations -> Loss -> Gradients -> Parameter Updates - θ f(x; θ) L gradL/gradθ θ - αgradL/gradθ - 4 MB 12 MB 1 val 4 MB 4 MB - (stored for (in-place) - backward) -``` - -### Example: Simple Linear Regression - ```python -# Model: y = wx + b -w = Variable(0.5, requires_grad=True) -b = Variable(0.1, requires_grad=True) - - # Forward pass -prediction = w * x + b - -# Loss: mean squared error -loss = (prediction - target)**2 - -# Backward pass (automatic!) -loss.backward() - -# Update parameters -w.data = w.data - learning_rate * w.grad.data -b.data = b.data - learning_rate * b.grad.data -``` - -### Why This is Powerful -- **Automatic**: No manual gradient computation -- **Flexible**: Works with any differentiable function -- **Efficient**: Minimal computational overhead -- **Scalable**: Handles millions of parameters -""" - -# %% nbgrader={"grade": true, "grade_id": "test-neural-network-training", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false} -def test_module_neural_network_training(): - """Test autograd in neural network training scenario""" - print("🔬 Integration Test: Neural Network Training Comprehensive Test...") - - # Simple linear regression: y = wx + b - # Training data: y = 2x + 1 + noise - - # Initialize parameters - w = Variable(0.1, requires_grad=True) # Start with small random value - b = Variable(0.0, requires_grad=True) # Start with zero bias - - # Training data - x_data = [1.0, 2.0, 3.0, 4.0] - y_data = [3.0, 5.0, 7.0, 9.0] # y = 2x + 1 - - learning_rate = 0.01 - - # Training loop - for epoch in range(100): - total_loss = Variable(0.0) - - for x_val, y_val in zip(x_data, y_data): - # Create input variable - x = Variable(x_val, requires_grad=False) - target = Variable(y_val, requires_grad=False) - - # Forward pass - prediction = add(multiply(w, x), b) # wx + b - - # Loss: squared error - error = subtract(prediction, target) - loss = multiply(error, error) # (pred - target)² - - # Accumulate loss - total_loss = add(total_loss, loss) - - # Backward pass - w.zero_grad() - b.zero_grad() - total_loss.backward() - - # Update parameters - if w.grad is not None: - w.data = Tensor(w.numpy() - learning_rate * w.grad.numpy()) - if b.grad is not None: - b.data = Tensor(b.numpy() - learning_rate * b.grad.numpy()) - - # Check that parameters converged to correct values - final_w = w.numpy().item() - final_b = b.numpy().item() - - print(f"Final weights: w = {final_w:.3f}, b = {final_b:.3f}") - print(f"Target weights: w = 2.000, b = 1.000") - - # Should be close to w=2, b=1 - assert abs(final_w - 2.0) < 0.1, f"Weight should be close to 2.0, got {final_w}" - assert abs(final_b - 1.0) < 0.1, f"Bias should be close to 1.0, got {final_b}" - - # Test prediction with learned parameters - test_x = Variable(5.0, requires_grad=False) - test_prediction = add(multiply(w, test_x), b) - expected_output = 2.0 * 5.0 + 1.0 # 11.0 - - prediction_error = abs(test_prediction.numpy().item() - expected_output) - assert prediction_error < 0.5, f"Prediction error should be small, got {prediction_error}" - - print("PASS Neural network training comprehensive tests passed!") - print(f"PASS Parameters converged to correct values") - print(f"PASS Model makes accurate predictions") - print(f"PASS Autograd enables automatic training") - print(f"PASS Ready for complex neural network architectures!") - -# Test will run in main block - -# PASS IMPLEMENTATION CHECKPOINT: Neural network training complete - -# THINK PREDICTION: How does backward pass time compare to forward pass time? -# Your answer: _______ - -# MAGNIFY SYSTEMS INSIGHT #3: Forward vs Backward Pass Performance -def analyze_forward_backward_performance(): - """Analyze performance characteristics of forward vs backward passes.""" - try: - print("MAGNIFY FORWARD VS BACKWARD PASS PERFORMANCE") - print("=" * 45) - - import time - - # Test with different computation scales - sizes = [50, 100, 200] - results = [] - - for size in sizes: - print(f"\nTesting {size}x{size} operations:") - - # Create computation graph - x = Variable(np.random.randn(size, size), requires_grad=True) - y = Variable(np.random.randn(size, size), requires_grad=True) - - # Forward pass timing - forward_iterations = 5 - forward_start = time.time() - - for _ in range(forward_iterations): - z1 = multiply(x, y) - z2 = add(z1, x) - z3 = multiply(z2, 2.0) - result = z3.sum() if hasattr(z3, 'sum') else Variable(np.sum(z3.numpy())) - - forward_end = time.time() - avg_forward_time = (forward_end - forward_start) / forward_iterations - - # Backward pass timing - backward_start = time.time() - result.backward() - backward_end = time.time() - backward_time = backward_end - backward_start - - # Memory analysis - forward_memory = x.data.data.nbytes * 4 / (1024 * 1024) # Estimate - gradient_memory = (x.grad.data.data.nbytes + y.grad.data.data.nbytes) / (1024 * 1024) if x.grad and y.grad else 0 - - result_data = { - 'size': size, - 'forward_time_ms': avg_forward_time * 1000, - 'backward_time_ms': backward_time * 1000, - 'backward_forward_ratio': backward_time / avg_forward_time, - 'forward_memory_mb': forward_memory, - 'gradient_memory_mb': gradient_memory - } - results.append(result_data) - - print(f" Forward: {avg_forward_time*1000:.2f}ms") - print(f" Backward: {backward_time*1000:.2f}ms") - print(f" Ratio: {backward_time/avg_forward_time:.1f}x") - print(f" Memory: {forward_memory:.1f}MB forward, {gradient_memory:.1f}MB gradients") - - # Analyze trends - avg_ratio = sum(r['backward_forward_ratio'] for r in results) / len(results) - - print(f"\n📊 Performance Analysis:") - print(f" Average backward/forward ratio: {avg_ratio:.1f}x") - - if avg_ratio > 2.0: - print(f" • Backward pass significantly slower than forward") - print(f" • Gradient computation dominates training time") - elif avg_ratio < 1.5: - print(f" • Backward pass efficient relative to forward") - print(f" • Good autograd implementation") - else: - print(f" • Balanced forward/backward performance") - - print(f"\n🏭 Production Implications:") - print(f" • Training time ~= {1 + avg_ratio:.1f}x inference time") - print(f" • Memory usage ~= 2x parameters (gradients + weights)") - print(f" • Gradient checkpointing can trade compute for memory") - - # TIP WHY THIS MATTERS: Backward pass typically takes 1.5-3x forward pass time. - # This determines training speed and influences architecture choices. - # Understanding this ratio helps optimize training pipelines! - - return results - - except Exception as e: - print(f"WARNING️ Error in performance analysis: {e}") - print("Basic timing analysis shows autograd overhead patterns") - return [] - -# Run the analysis -analyze_forward_backward_performance() - -# %% [markdown] -""" -## Step 4: Production Autograd Features - -### 🏗️ Gradient Clipping for Training Stability - -In deep networks, gradients can explode during training, causing training instability and numerical overflow. Gradient clipping is a critical technique used in production systems. - -### Visual: Gradient Explosion Problem -``` -Normal Training: Gradient Explosion: With Clipping: - Loss Loss Loss - | |\\ | - |\\ | \\ |\\ - | \\ | \\ | \\ max_norm - | \\ | \\ | \\___ - | \\__ | \\ | \\ - | \\ | \\ | \\ - +-------- +------\\ +--------- - Epoch Epoch NaN Epoch - ↗ - Training - Diverges -``` - -### Mathematical Foundation -- **Gradient norm**: ||g|| = sqrt(g₁² + g₂² + ... + gₙ²) -- **Clipping factor**: max_norm / max(||g||, max_norm) -- **Clipped gradients**: g' = g * clipping_factor - -### Real-World Usage -- **Transformer training**: Prevents attention weight explosion -- **RNN training**: Essential for sequence modeling stability -- **GAN training**: Stabilizes adversarial training dynamics -- **Large model training**: Critical for models with >1B parameters -""" - -# %% nbgrader={"grade": false, "grade_id": "gradient-clipping", "locked": false, "schema_version": 3, "solution": true, "task": false} -#| export -def clip_gradients(variables: List[Variable], max_norm: float = 1.0) -> float: - """ - Gradient clipping for training stability. - - TODO: Implement gradient clipping across all variables. - - APPROACH: - 1. Calculate total gradient norm across all variables - 2. Compute clipping factor: min(1.0, max_norm / total_norm) - 3. Scale all gradients by the clipping factor - 4. Return the computed gradient norm for monitoring - - EXAMPLE: - w1 = Variable(np.random.randn(10, 10), requires_grad=True) - w2 = Variable(np.random.randn(10, 5), requires_grad=True) - # ... compute loss and gradients ... - grad_norm = clip_gradients([w1, w2], max_norm=1.0) + TODO: Implement matrix multiplication with proper gradients HINTS: - - Total norm = sqrt(sum of squared gradients) - - Only clip if total_norm > max_norm - - Update gradients in-place for efficiency - - Handle case when variables have no gradients + - For z = a @ b: ∂z/∂a = gradient @ b.T, ∂z/∂b = a.T @ gradient + - This is fundamental for neural network linear layers """ ### BEGIN SOLUTION - # Calculate total gradient norm - total_norm = 0.0 + # Ensure both inputs are Variables + a = _ensure_variable(a) + b = _ensure_variable(b) - # Collect all gradients and compute norm - gradients = [] - for var in variables: - if var.grad is not None: - grad_data = var.grad.numpy() - total_norm += np.sum(grad_data ** 2) - gradients.append((var, grad_data)) + # Forward pass computation + result_data = Tensor(a.data.data @ b.data.data) - total_norm = np.sqrt(total_norm) + # Determine if result requires gradients + requires_grad = a.requires_grad or b.requires_grad - # Compute clipping factor - if total_norm > max_norm: - clipping_factor = max_norm / total_norm + # Define backward function for gradient propagation + def grad_fn(gradient): + """Propagate gradients for matrix multiplication.""" + # Matrix multiplication gradients: + # ∂(a@b)/∂a = gradient @ b.T + # ∂(a@b)/∂b = a.T @ gradient + if a.requires_grad: + a_grad = gradient @ b.data.data.T + a.backward(a_grad) + if b.requires_grad: + b_grad = a.data.data.T @ gradient + b.backward(b_grad) - # Apply clipping to all gradients - for var, grad_data in gradients: - clipped_grad = grad_data * clipping_factor - var.grad = Variable(clipped_grad) - - return total_norm + # Create result variable with gradient function + result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None) + return result ### END SOLUTION -#| export -def enable_mixed_precision_gradients(variables: List[Variable], loss_scale: float = 1024.0): - """ - Enable mixed precision gradient computation for memory efficiency. - - TODO: Implement mixed precision gradient scaling. - - APPROACH: - 1. Scale loss by loss_scale to prevent FP16 underflow - 2. Compute gradients normally (they will be scaled) - 3. Unscale gradients before optimizer step - 4. Check for overflow and skip update if needed - - MATHEMATICAL FOUNDATION: - - FP16 range: ~6e-8 to 65504 - - Gradient scaling prevents underflow: grad_scaled = grad * scale - - Unscaling before update: grad_final = grad_scaled / scale - - PRODUCTION USAGE: - - Reduces memory usage by ~2x during training - - Enables training larger models on same hardware - - Used in most large model training (GPT, BERT, etc.) - """ - ### BEGIN SOLUTION - # Apply gradient unscaling for mixed precision - overflow_detected = False - - for var in variables: - if var.grad is not None: - grad_data = var.grad.numpy() - - # Check for overflow (inf or nan) - if np.any(np.isinf(grad_data)) or np.any(np.isnan(grad_data)): - overflow_detected = True - break - - # Unscale gradients - unscaled_grad = grad_data / loss_scale - var.grad = Variable(unscaled_grad) - - if overflow_detected: - # Zero out gradients on overflow - for var in variables: - if var.grad is not None: - var.zero_grad() - print(f"WARNING️ Gradient overflow detected, skipping optimizer step") - - return not overflow_detected - ### END SOLUTION - -# %% nbgrader={"grade": false, "grade_id": "autograd-systems-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false} -#| export -import time -import gc -from collections import defaultdict, deque - -class AutogradSystemsProfiler: - """ - Production Autograd System Performance Analysis and Optimization - - Analyzes computational graph efficiency, memory patterns, and optimization - opportunities for production automatic differentiation systems. - Enhanced with memory management analysis and graph optimization strategies. - """ - - def __init__(self): - """Initialize autograd systems profiler with enhanced analytics.""" - self.profiling_data = defaultdict(list) - self.graph_analysis = defaultdict(list) - self.optimization_strategies = [] - self.memory_patterns = defaultdict(list) - self.graph_optimizations = [] - - def profile_computational_graph_depth(self, max_depth=10, operations_per_level=5): - """ - Profile computational graph performance vs depth. - - TODO: Implement computational graph depth analysis. - - APPROACH: - 1. Create computational graphs of increasing depth - 2. Measure forward and backward pass timing - 3. Analyze memory usage patterns during gradient computation - 4. Identify memory accumulation and gradient flow bottlenecks - 5. Generate graph optimization recommendations - - EXAMPLE: - profiler = AutogradSystemsProfiler() - graph_analysis = profiler.profile_computational_graph_depth(max_depth=8) - print(f"Memory scaling factor: {graph_analysis['memory_scaling_factor']:.2f}") - - HINTS: - - Build graphs by chaining operations: x -> op1 -> op2 -> ... -> loss - - Measure both forward and backward pass timing separately - - Track memory usage throughout the computation - - Monitor gradient accumulation patterns - - Focus on production-relevant graph depths - """ - ### BEGIN SOLUTION - print("🔧 Profiling Computational Graph Depth Impact...") - - results = {} - - for depth in range(1, max_depth + 1): - print(f" Testing graph depth: {depth}") - - # Create a computational graph of specified depth - # Each level adds more operations to test scaling - - # Start with input variable - try: - # Use Variable if available, otherwise simulate - x = Variable(np.random.randn(100, 100), requires_grad=True) - except: - # Fallback for testing - simulate Variable with Tensor - x = Tensor(np.random.randn(100, 100)) - - # Build computational graph of specified depth - current_var = x - operations = [] - - for level in range(depth): - # Add multiple operations per level to increase complexity - for op_idx in range(operations_per_level): - try: - # Simulate various operations - if op_idx % 4 == 0: - current_var = current_var * 0.9 # Scale operation - elif op_idx % 4 == 1: - current_var = current_var + 0.1 # Add operation - elif op_idx % 4 == 2: - # Matrix multiplication (most expensive) - weight = Tensor(np.random.randn(100, 100)) - if hasattr(current_var, 'data'): - current_var = Tensor(current_var.data @ weight.data) - else: - current_var = current_var @ weight - else: - # Activation-like operation - if hasattr(current_var, 'data'): - current_var = Tensor(np.maximum(0, current_var.data)) - else: - current_var = current_var # Skip for simplicity - - operations.append(f"level_{level}_op_{op_idx}") - except: - # Fallback for testing - current_var = Tensor(np.random.randn(100, 100)) - operations.append(f"level_{level}_op_{op_idx}_fallback") - - # Add final loss computation - try: - if hasattr(current_var, 'data'): - loss = Tensor(np.sum(current_var.data ** 2)) - else: - loss = np.sum(current_var ** 2) - except: - loss = Tensor(np.array([1.0])) - - # Measure forward pass timing - forward_iterations = 3 - forward_start = time.time() - - for _ in range(forward_iterations): - # Simulate forward pass computation - temp_x = x - for level in range(depth): - for op_idx in range(operations_per_level): - if op_idx % 4 == 0: - temp_x = temp_x * 0.9 - elif op_idx % 4 == 1: - temp_x = temp_x + 0.1 - # Skip expensive ops for timing - - forward_end = time.time() - avg_forward_time = (forward_end - forward_start) / forward_iterations - - # Measure backward pass timing (simulated) - # In real implementation, this would be loss.backward() - backward_start = time.time() - - # Simulate gradient computation through the graph - for _ in range(forward_iterations): - # Simulate backpropagation through all operations - gradient_accumulation = 0 - for level in range(depth): - for op_idx in range(operations_per_level): - # Simulate gradient computation - gradient_accumulation += level * op_idx * 0.001 - - backward_end = time.time() - avg_backward_time = (backward_end - backward_start) / forward_iterations - - # Memory analysis - try: - if hasattr(x, 'data'): - base_memory = x.data.nbytes / (1024 * 1024) # MB - if hasattr(current_var, 'data'): - result_memory = current_var.data.nbytes / (1024 * 1024) - else: - result_memory = base_memory - else: - base_memory = x.nbytes / (1024 * 1024) if hasattr(x, 'nbytes') else 1.0 - result_memory = base_memory - except: - base_memory = 1.0 - result_memory = 1.0 - - # Estimate gradient memory (in production, each operation stores gradients) - estimated_gradient_memory = depth * operations_per_level * base_memory * 0.5 - total_memory = base_memory + result_memory + estimated_gradient_memory - - # Calculate efficiency metrics - total_operations = depth * operations_per_level - total_time = avg_forward_time + avg_backward_time - operations_per_second = total_operations / total_time if total_time > 0 else 0 - - result = { - 'graph_depth': depth, - 'total_operations': total_operations, - 'forward_time_ms': avg_forward_time * 1000, - 'backward_time_ms': avg_backward_time * 1000, - 'total_time_ms': total_time * 1000, - 'base_memory_mb': base_memory, - 'estimated_gradient_memory_mb': estimated_gradient_memory, - 'total_memory_mb': total_memory, - 'operations_per_second': operations_per_second, - 'memory_per_operation': total_memory / total_operations if total_operations > 0 else 0 - } - - results[depth] = result - - print(f" Forward: {avg_forward_time*1000:.3f}ms, Backward: {avg_backward_time*1000:.3f}ms, Memory: {total_memory:.2f}MB") - - # Analyze scaling patterns - graph_analysis = self._analyze_graph_scaling(results) - - # Store profiling data - self.profiling_data['graph_depth_analysis'] = results - self.graph_analysis = graph_analysis - - return { - 'detailed_results': results, - 'graph_analysis': graph_analysis, - 'optimization_strategies': self._generate_graph_optimizations(results), - 'memory_management_analysis': self._analyze_memory_management_patterns(results), - 'graph_fusion_opportunities': self._identify_graph_fusion_opportunities(results) - } - ### END SOLUTION - - def _analyze_memory_management_patterns(self, results): - """Analyze memory management patterns for dynamic vs static graphs.""" - analysis = { - 'dynamic_graph_characteristics': {}, - 'static_graph_opportunities': {}, - 'memory_optimization_strategies': [] - } - - # Analyze memory growth patterns - memory_values = [result['total_memory_mb'] for result in results.values()] - depths = sorted(results.keys()) - - if len(memory_values) >= 2: - memory_growth_rate = (memory_values[-1] - memory_values[0]) / (depths[-1] - depths[0]) - - analysis['dynamic_graph_characteristics'] = { - 'memory_growth_rate_mb_per_layer': memory_growth_rate, - 'memory_linearity': 'linear' if memory_growth_rate > 0 else 'sublinear', - 'peak_memory_mb': max(memory_values), - 'memory_efficiency': memory_values[0] / max(memory_values) - } - - # Static graph opportunities - if memory_growth_rate > 5.0: # >5MB per layer - analysis['static_graph_opportunities'] = { - 'graph_compilation_benefit': 'high', - 'memory_pooling_opportunity': 'significant', - 'operator_fusion_potential': 'excellent' - } - - analysis['memory_optimization_strategies'].extend([ - "🔧 Static graph compilation for memory pooling", - "🔧 Operator fusion to reduce intermediate allocations", - "🔧 Memory arena allocation for gradient storage" - ]) - else: - analysis['static_graph_opportunities'] = { - 'graph_compilation_benefit': 'moderate', - 'memory_pooling_opportunity': 'limited', - 'operator_fusion_potential': 'moderate' - } - - # Add general memory management strategies - analysis['memory_optimization_strategies'].extend([ - "💾 Gradient checkpointing for memory-time trade-offs", - "🔄 In-place operations where mathematically valid", - "📊 Dynamic memory allocation with smart pre-allocation", - "TARGET Lazy evaluation for unused computation branches" - ]) - - return analysis - - def _identify_graph_fusion_opportunities(self, results): - """Identify operator fusion opportunities for cache efficiency.""" - fusion_analysis = { - 'fusion_opportunities': [], - 'cache_efficiency_patterns': {}, - 'kernel_optimization_strategies': [] - } - - # Analyze operation patterns that benefit from fusion - total_operations = sum(result['total_operations'] for result in results.values()) - avg_operations_per_layer = total_operations / len(results) if results else 0 - - if avg_operations_per_layer > 3: - fusion_analysis['fusion_opportunities'] = [ - "🔀 Element-wise operation fusion (add, multiply, activation)", - "LINK Matrix operation chains (matmul + bias + activation)", - "PROGRESS Reduction operation fusion (sum, mean, variance)", - "🎭 Attention pattern fusion (Q@K^T, softmax, @V)" - ] - - fusion_analysis['cache_efficiency_patterns'] = { - 'memory_access_pattern': 'multiple_passes', - 'cache_utilization': 'suboptimal', - 'fusion_benefit': 'high', - 'bandwidth_reduction_potential': f"{avg_operations_per_layer:.1f}x" - } - else: - fusion_analysis['fusion_opportunities'] = [ - "✨ Limited fusion opportunities in current graph structure" - ] - - fusion_analysis['cache_efficiency_patterns'] = { - 'memory_access_pattern': 'single_pass', - 'cache_utilization': 'good', - 'fusion_benefit': 'low' - } - - # Add kernel optimization strategies - fusion_analysis['kernel_optimization_strategies'] = [ - "SPEED JIT compilation for operation sequences", - "TARGET Vectorization of element-wise operations", - "🔄 Loop fusion for reduced memory bandwidth", - "📱 GPU kernel optimization for parallel execution", - "🧮 Mixed precision kernel specialization" - ] - - return fusion_analysis - - def _analyze_graph_scaling(self, results): - """Analyze computational graph scaling patterns.""" - analysis = {} - - # Extract metrics for scaling analysis - depths = sorted(results.keys()) - forward_times = [results[d]['forward_time_ms'] for d in depths] - backward_times = [results[d]['backward_time_ms'] for d in depths] - total_times = [results[d]['total_time_ms'] for d in depths] - memory_usage = [results[d]['total_memory_mb'] for d in depths] - - # Calculate scaling factors - if len(depths) >= 2: - shallow = depths[0] - deep = depths[-1] - - depth_ratio = deep / shallow - forward_time_ratio = results[deep]['forward_time_ms'] / results[shallow]['forward_time_ms'] - backward_time_ratio = results[deep]['backward_time_ms'] / results[shallow]['backward_time_ms'] - memory_ratio = results[deep]['total_memory_mb'] / results[shallow]['total_memory_mb'] - - analysis['scaling_metrics'] = { - 'depth_ratio': depth_ratio, - 'forward_time_scaling': forward_time_ratio, - 'backward_time_scaling': backward_time_ratio, - 'memory_scaling': memory_ratio, - 'theoretical_linear': depth_ratio # Expected linear scaling - } - - # Identify bottlenecks - if backward_time_ratio > forward_time_ratio * 1.5: - analysis['primary_bottleneck'] = 'backward_pass' - analysis['bottleneck_reason'] = 'Gradient computation scaling worse than forward pass' - elif memory_ratio > depth_ratio * 1.5: - analysis['primary_bottleneck'] = 'memory' - analysis['bottleneck_reason'] = 'Memory usage scaling faster than linear' - else: - analysis['primary_bottleneck'] = 'balanced' - analysis['bottleneck_reason'] = 'Forward and backward passes scaling proportionally' - - # Backward/Forward ratio analysis - backward_forward_ratios = [ - results[d]['backward_time_ms'] / max(results[d]['forward_time_ms'], 0.001) - for d in depths - ] - avg_backward_forward_ratio = sum(backward_forward_ratios) / len(backward_forward_ratios) - - analysis['efficiency_metrics'] = { - 'avg_backward_forward_ratio': avg_backward_forward_ratio, - 'peak_memory_mb': max(memory_usage), - 'memory_efficiency_trend': 'increasing' if memory_usage[-1] > memory_usage[0] * 2 else 'stable' - } - - return analysis - - def _generate_graph_optimizations(self, results): - """Generate computational graph optimization strategies.""" - strategies = [] - - # Analyze memory growth patterns - peak_memory = max(result['total_memory_mb'] for result in results.values()) - - if peak_memory > 50: # > 50MB memory usage - strategies.append("💾 High memory usage detected in computational graph") - strategies.append("🔧 Strategy: Gradient checkpointing for deep graphs") - strategies.append("🔧 Strategy: In-place operations where mathematically valid") - - # Analyze computational efficiency - graph_analysis = self.graph_analysis - if graph_analysis and 'scaling_metrics' in graph_analysis: - backward_scaling = graph_analysis['scaling_metrics']['backward_time_scaling'] - if backward_scaling > 2.0: - strategies.append("🐌 Backward pass scaling poorly with graph depth") - strategies.append("🔧 Strategy: Kernel fusion for backward operations") - strategies.append("🔧 Strategy: Parallel gradient computation") - - # Memory vs computation trade-offs - if graph_analysis and 'efficiency_metrics' in graph_analysis: - backward_forward_ratio = graph_analysis['efficiency_metrics']['avg_backward_forward_ratio'] - if backward_forward_ratio > 3.0: - strategies.append("⚖️ Backward pass significantly slower than forward") - strategies.append("🔧 Strategy: Optimize gradient computation with sparse gradients") - strategies.append("🔧 Strategy: Use mixed precision to reduce memory bandwidth") - - # Production optimization recommendations - strategies.append("🏭 Production graph optimizations:") - strategies.append(" • Graph compilation and optimization (TorchScript, XLA)") - strategies.append(" • Operator fusion to minimize intermediate allocations") - strategies.append(" • Dynamic shape optimization for variable input sizes") - strategies.append(" • Gradient accumulation for large effective batch sizes") - - return strategies - - def analyze_memory_checkpointing_trade_offs(self, checkpoint_frequencies=[1, 2, 4, 8]): - """ - Analyze memory vs computation trade-offs with gradient checkpointing. - - This function is PROVIDED to demonstrate checkpointing analysis. - Students use it to understand memory optimization strategies. - """ - print("MAGNIFY GRADIENT CHECKPOINTING ANALYSIS") - print("=" * 45) - - base_graph_depth = 12 - base_memory_per_layer = 10 # MB per layer - base_computation_time = 5 # ms per layer - - checkpointing_results = [] - - for freq in checkpoint_frequencies: - # Calculate memory savings - # Without checkpointing: store all intermediate activations - no_checkpoint_memory = base_graph_depth * base_memory_per_layer - - # With checkpointing: only store every freq-th activation - checkpointed_memory = max(base_memory_per_layer, (base_graph_depth // freq + 1) * base_memory_per_layer) - memory_savings = no_checkpoint_memory - checkpointed_memory - memory_reduction_pct = (memory_savings / no_checkpoint_memory) * 100 - - # Calculate recomputation overhead - # Need to recompute (freq-1) layers for each checkpoint - recomputation_layers = base_graph_depth * (freq - 1) / freq - recomputation_time = recomputation_layers * base_computation_time - - # Total training time = forward + backward + recomputation - base_training_time = base_graph_depth * base_computation_time * 2 # forward + backward - total_training_time = base_training_time + recomputation_time - time_overhead_pct = (recomputation_time / base_training_time) * 100 - - result = { - 'checkpoint_frequency': freq, - 'memory_mb': checkpointed_memory, - 'memory_reduction_pct': memory_reduction_pct, - 'recomputation_time_ms': recomputation_time, - 'time_overhead_pct': time_overhead_pct, - 'memory_time_ratio': memory_reduction_pct / max(time_overhead_pct, 1) - } - checkpointing_results.append(result) - - print(f" Checkpoint every {freq} layers:") - print(f" Memory: {checkpointed_memory:.0f}MB ({memory_reduction_pct:.1f}% reduction)") - print(f" Time overhead: {time_overhead_pct:.1f}%") - print(f" Efficiency ratio: {result['memory_time_ratio']:.2f}") - - # Find optimal trade-off - optimal = max(checkpointing_results, key=lambda x: x['memory_time_ratio']) - - print(f"\nPROGRESS Checkpointing Analysis:") - print(f" Optimal frequency: Every {optimal['checkpoint_frequency']} layers") - print(f" Best trade-off: {optimal['memory_reduction_pct']:.1f}% memory reduction") - print(f" Cost: {optimal['time_overhead_pct']:.1f}% time overhead") - - return checkpointing_results - - def demonstrate_mixed_precision_benefits(self, precisions=['fp32', 'fp16', 'mixed']): - """ - Demonstrate memory and performance benefits of mixed precision training. - - This function is PROVIDED to show mixed precision analysis. - Students explore precision trade-offs in autograd systems. - """ - print("MAGNIFY MIXED PRECISION TRAINING ANALYSIS") - print("=" * 45) - - model_size_mb = 100 # Example 100MB model - batch_size = 32 - sequence_length = 512 - - precision_results = [] - - for precision in precisions: - if precision == 'fp32': - bytes_per_param = 4 - gradient_memory_multiplier = 1.0 - compute_efficiency = 1.0 - numerical_stability = 1.0 - elif precision == 'fp16': - bytes_per_param = 2 - gradient_memory_multiplier = 0.5 - compute_efficiency = 1.5 # Faster on modern GPUs - numerical_stability = 0.8 # Some precision loss - else: # mixed precision - bytes_per_param = 2.5 # Weighted average - gradient_memory_multiplier = 0.7 # Some operations in fp32 - compute_efficiency = 1.3 # Good balance - numerical_stability = 0.95 # Maintained for critical ops - - # Calculate memory requirements - model_memory = model_size_mb * (bytes_per_param / 4) # Relative to fp32 - gradient_memory = model_memory * gradient_memory_multiplier - activation_memory = batch_size * sequence_length * 768 * (bytes_per_param / 4) / 1024 / 1024 - total_memory = model_memory + gradient_memory + activation_memory - - # Calculate performance metrics - relative_speed = compute_efficiency - memory_efficiency = model_size_mb * 3 / total_memory # vs fp32 baseline - - result = { - 'precision': precision, - 'model_memory_mb': model_memory, - 'gradient_memory_mb': gradient_memory, - 'activation_memory_mb': activation_memory, - 'total_memory_mb': total_memory, - 'relative_speed': relative_speed, - 'memory_efficiency': memory_efficiency, - 'numerical_stability': numerical_stability, - 'memory_savings_pct': (1 - total_memory / (model_size_mb * 3)) * 100 - } - precision_results.append(result) - - print(f" {precision.upper()} precision:") - print(f" Total memory: {total_memory:.1f}MB") - print(f" Memory savings: {result['memory_savings_pct']:.1f}%") - print(f" Relative speed: {relative_speed:.1f}x") - print(f" Numerical stability: {numerical_stability:.2f}") - - # Find optimal configuration - def score_precision(result): - return result['memory_efficiency'] * result['relative_speed'] * result['numerical_stability'] - - optimal = max(precision_results, key=score_precision) - - print(f"\nPROGRESS Mixed Precision Analysis:") - print(f" Optimal configuration: {optimal['precision'].upper()}") - print(f" Memory savings: {optimal['memory_savings_pct']:.1f}%") - print(f" Performance gain: {optimal['relative_speed']:.1f}x") - print(f" Stability score: {optimal['numerical_stability']:.2f}") - - print(f"\n🏭 Production Implementation:") - print(f" • Loss scaling prevents gradient underflow") - print(f" • Critical operations (loss, norm) stay in FP32") - print(f" • Automatic overflow detection and recovery") - print(f" • 30-50% memory reduction typical in large models") - - return precision_results - # %% [markdown] """ -### TEST Unit Test: Autograd Systems Profiling - -This test validates our autograd systems profiler with realistic computational graph scenarios. +### 🧪 Unit Test: Additional Operations +This test validates subtraction and matrix multiplication """ -# %% nbgrader={"grade": false, "grade_id": "test-autograd-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false} -def test_autograd_systems_profiler(): - """Test autograd systems profiler with comprehensive scenarios.""" - print("🔬 Unit Test: Autograd Systems Profiler...") - - profiler = AutogradSystemsProfiler() - - # Test computational graph depth analysis - try: - graph_analysis = profiler.profile_computational_graph_depth(max_depth=5, operations_per_level=3) +# %% +def test_unit_additional_operations(): + """Test subtraction and matrix multiplication.""" + print("🔬 Unit Test: Additional Operations...") - # Verify analysis structure - assert 'detailed_results' in graph_analysis, "Should provide detailed results" - assert 'graph_analysis' in graph_analysis, "Should provide graph analysis" - assert 'optimization_strategies' in graph_analysis, "Should provide optimization strategies" - assert 'memory_management_analysis' in graph_analysis, "Should provide memory management analysis" - assert 'graph_fusion_opportunities' in graph_analysis, "Should provide graph fusion opportunities" - - # Verify detailed results - results = graph_analysis['detailed_results'] - assert len(results) == 5, "Should test all graph depths" - - for depth, result in results.items(): - assert 'forward_time_ms' in result, f"Should include forward timing for depth {depth}" - assert 'backward_time_ms' in result, f"Should include backward timing for depth {depth}" - assert 'total_memory_mb' in result, f"Should analyze memory for depth {depth}" - assert result['forward_time_ms'] >= 0, f"Forward time should be non-negative for depth {depth}" - assert result['backward_time_ms'] >= 0, f"Backward time should be non-negative for depth {depth}" - - print("PASS Computational graph depth analysis test passed") - - # Test memory checkpointing analysis - checkpointing_analysis = profiler.analyze_memory_checkpointing_trade_offs(checkpoint_frequencies=[1, 2, 4]) + # Test subtraction + x = Variable([5.0], requires_grad=True) + y = Variable([2.0], requires_grad=True) + z = subtract(x, y) - assert isinstance(checkpointing_analysis, list), "Should return checkpointing analysis results" - assert len(checkpointing_analysis) == 3, "Should analyze all checkpoint frequencies" + assert np.allclose(z.data.data, [3.0]), f"Subtraction failed: expected [3.0], got {z.data.data}" - for result in checkpointing_analysis: - assert 'checkpoint_frequency' in result, "Should include checkpoint frequency" - assert 'memory_reduction_pct' in result, "Should calculate memory reduction" - assert 'time_overhead_pct' in result, "Should calculate time overhead" - assert result['memory_reduction_pct'] >= 0, "Memory reduction should be non-negative" + z.backward() + assert np.allclose(x.grad, [1.0]), f"Subtraction gradient failed: expected x.grad=[1.0], got {x.grad}" + assert np.allclose(y.grad, [-1.0]), f"Subtraction gradient failed: expected y.grad=[-1.0], got {y.grad}" - print("PASS Memory checkpointing analysis test passed") + # Test matrix multiplication + a = Variable([[1.0, 2.0]], requires_grad=True) + b = Variable([[3.0], [4.0]], requires_grad=True) + c = matmul(a, b) - # Test mixed precision analysis - mixed_precision_analysis = profiler.demonstrate_mixed_precision_benefits() + assert np.allclose(c.data.data, [[11.0]]), f"Matrix multiplication failed: expected [[11.0]], got {c.data.data}" - assert isinstance(mixed_precision_analysis, list), "Should return mixed precision results" - assert len(mixed_precision_analysis) >= 2, "Should test multiple precision modes" + c.backward() + assert np.allclose(a.grad, [[3.0, 4.0]]), f"Matmul gradient failed for a: expected [[3.0, 4.0]], got {a.grad}" + assert np.allclose(b.grad, [[1.0], [2.0]]), f"Matmul gradient failed for b: expected [[1.0], [2.0]], got {b.grad}" - for result in mixed_precision_analysis: - assert 'precision' in result, "Should include precision mode" - assert 'memory_savings_pct' in result, "Should calculate memory savings" - assert 'relative_speed' in result, "Should include performance metrics" + print("✅ Additional operations work correctly!") - print("PASS Mixed precision analysis test passed") - - except Exception as e: - print(f"WARNING️ Autograd profiling test had issues: {e}") - print("PASS Basic structure test passed (graceful degradation)") - - print("TARGET Autograd Systems Profiler: All tests passed!") +test_unit_additional_operations() -# Test will run in main block +# %% [markdown] +""" +## Implementation: Chain Rule Through Complex Expressions -# %% nbgrader={"grade": false, "grade_id": "test-gradient-clipping", "locked": false, "schema_version": 3, "solution": false, "task": false} -def test_unit_gradient_clipping(): - """Test gradient clipping functionality.""" - print("🔬 Unit Test: Gradient Clipping...") +🧠 **Core Concept**: Multiple operations automatically chain gradients together +⚡ **Performance**: Each operation contributes O(1) overhead for gradient computation +🔗 **Connections**: This enables training deep neural networks with many layers - # Set seed for deterministic test - np.random.seed(42) +### Example: Complex Expression - # Create variables with large gradients - w1 = Variable(np.random.randn(5, 5), requires_grad=True) - w2 = Variable(np.random.randn(5, 3), requires_grad=True) +Consider: f(x, y) = (x + y) × (x - y) = x² - y² - # Set seed again for gradient generation to ensure deterministic gradients - np.random.seed(42) - # Simulate large gradients - w1.grad = Variable(np.random.randn(5, 5) * 10) # Large gradients - w2.grad = Variable(np.random.randn(5, 3) * 15) # Even larger gradients +The autograd system automatically: +1. Tracks each intermediate operation +2. Applies chain rule backwards through the computation graph +3. Accumulates gradients at each variable - # Test gradient clipping - original_norm1 = np.sqrt(np.sum(w1.grad.numpy() ** 2)) - original_norm2 = np.sqrt(np.sum(w2.grad.numpy() ** 2)) - total_original_norm = np.sqrt(original_norm1**2 + original_norm2**2) +Expected gradients: +- ∂f/∂x = 2x (derivative of x² - y²) +- ∂f/∂y = -2y (derivative of x² - y²) +""" - max_norm = 2.0 - computed_norm = clip_gradients([w1, w2], max_norm=max_norm) +# %% [markdown] +""" +### 🧪 Unit Test: Chain Rule Application +This test validates complex expressions with multiple operations +""" - # Verify gradient norm was computed correctly - norm_diff = abs(computed_norm - total_original_norm) - assert norm_diff < 1e-5, f"Gradient norm mismatch: computed={computed_norm:.8f}, expected={total_original_norm:.8f}, diff={norm_diff:.8f}" +# %% +def test_unit_chain_rule(): + """Test chain rule through complex expressions.""" + print("🔬 Unit Test: Chain Rule Application...") - # Check that gradients were clipped if necessary - if total_original_norm > max_norm: - new_norm1 = np.sqrt(np.sum(w1.grad.numpy() ** 2)) - new_norm2 = np.sqrt(np.sum(w2.grad.numpy() ** 2)) - new_total_norm = np.sqrt(new_norm1**2 + new_norm2**2) + # Test complex expression: (x + y) * (x - y) = x² - y² + x = Variable([3.0], requires_grad=True) + y = Variable([2.0], requires_grad=True) - assert abs(new_total_norm - max_norm) < 1e-6, f"Clipped norm should be {max_norm}, got {new_total_norm}" - print(f"PASS Gradients clipped from {total_original_norm:.3f} to {new_total_norm:.3f}") - else: - print(f"PASS Gradients within limit ({total_original_norm:.3f} <= {max_norm})") + # Build computation graph + sum_term = add(x, y) # x + y = 5 + diff_term = subtract(x, y) # x - y = 1 + result = multiply(sum_term, diff_term) # (x+y)*(x-y) = 5*1 = 5 - print("PASS Gradient clipping tests passed!") + # Verify forward pass + expected_result = 3.0**2 - 2.0**2 # x² - y² = 9 - 4 = 5 + assert np.allclose(result.data.data, [expected_result]), f"Expected [{expected_result}], got {result.data.data}" -def test_unit_mixed_precision(): - """Test mixed precision gradient handling.""" - print("🔬 Unit Test: Mixed Precision...") + # Test backward pass + result.backward() - # Create variables for mixed precision test - w1 = Variable(np.random.randn(3, 3), requires_grad=True) - w2 = Variable(np.random.randn(3, 2), requires_grad=True) + # Expected gradients: ∂(x²-y²)/∂x = 2x = 6, ∂(x²-y²)/∂y = -2y = -4 + expected_x_grad = 2 * 3.0 # 6.0 + expected_y_grad = -2 * 2.0 # -4.0 - # Test normal gradients (no overflow) - w1.grad = Variable(np.random.randn(3, 3) * 0.01) # Normal gradients - w2.grad = Variable(np.random.randn(3, 2) * 0.01) + assert np.allclose(x.grad, [expected_x_grad]), f"Expected x.grad=[{expected_x_grad}], got {x.grad}" + assert np.allclose(y.grad, [expected_y_grad]), f"Expected y.grad=[{expected_y_grad}], got {y.grad}" - success = enable_mixed_precision_gradients([w1, w2], loss_scale=128.0) - assert success == True, "Should handle normal gradients successfully" + # Test another complex expression: x * y + x * y (should equal 2*x*y) + a = Variable([2.0], requires_grad=True) + b = Variable([3.0], requires_grad=True) - # Test overflow gradients - w1.grad = Variable(np.array([[np.inf, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])) # Overflow - w2.grad = Variable(np.random.randn(3, 2) * 0.01) + term1 = multiply(a, b) + term2 = multiply(a, b) + sum_result = add(term1, term2) - success = enable_mixed_precision_gradients([w1, w2], loss_scale=128.0) - assert success == False, "Should detect overflow and return False" - assert w1.grad is None, "Should zero gradients on overflow" + sum_result.backward() - print("PASS Mixed precision tests passed!") + # Expected: ∂(2xy)/∂x = 2y = 6, ∂(2xy)/∂y = 2x = 4 + assert np.allclose(a.grad, [6.0]), f"Expected a.grad=[6.0], got {a.grad}" + assert np.allclose(b.grad, [4.0]), f"Expected b.grad=[4.0], got {b.grad}" + print("✅ Chain rule works correctly through complex expressions!") + +test_unit_chain_rule() + +# %% [markdown] +""" +## 🔍 Systems Analysis: Gradient Computation Behavior + +Now that your autograd implementation is complete and tested, let's analyze its behavior: + +**Analysis Focus**: Understand memory usage and computational patterns in automatic differentiation +""" + +# %% +def analyze_gradient_computation(): + """ + 📊 SYSTEMS MEASUREMENT: Gradient Computation Analysis + + Measure how autograd scales with expression complexity and input size. + """ + print("📊 AUTOGRAD SYSTEMS ANALYSIS") + print("Testing gradient computation patterns...") + + import time + + # Test 1: Expression complexity scaling + print("\n🔍 Expression Complexity Analysis:") + x = Variable([2.0], requires_grad=True) + y = Variable([3.0], requires_grad=True) + + expressions = [ + ("Simple: x + y", lambda: add(x, y)), + ("Medium: x * y + x", lambda: add(multiply(x, y), x)), + ("Complex: (x + y) * (x - y)", lambda: multiply(add(x, y), subtract(x, y))) + ] + + for name, expr_fn in expressions: + # Reset gradients + x.grad = None + y.grad = None + + # Time forward + backward pass + start = time.perf_counter() + result = expr_fn() + result.backward() + elapsed = time.perf_counter() - start + + print(f" {name}: {elapsed*1000:.3f}ms") + + # Test 2: Memory usage pattern + print("\n💾 Memory Usage Analysis:") + import psutil + import os + + def get_memory_mb(): + process = psutil.Process(os.getpid()) + return process.memory_info().rss / 1024 / 1024 + + baseline = get_memory_mb() + + # Create computation graph with many variables + variables = [] + for i in range(100): + var = Variable([float(i)], requires_grad=True) + variables.append(var) + + # Chain operations + result = variables[0] + for var in variables[1:]: + result = add(result, var) + + memory_after_forward = get_memory_mb() + + # Backward pass + result.backward() + memory_after_backward = get_memory_mb() + + print(f" Baseline memory: {baseline:.1f}MB") + print(f" After forward pass: {memory_after_forward:.1f}MB (+{memory_after_forward-baseline:.1f}MB)") + print(f" After backward pass: {memory_after_backward:.1f}MB (+{memory_after_backward-baseline:.1f}MB)") + + # Test 3: Gradient accumulation + print("\n🔄 Gradient Accumulation Test:") + z = Variable([1.0], requires_grad=True) + + # Multiple backward passes should accumulate gradients + loss1 = multiply(z, 2.0) + loss1.backward() + first_grad = z.grad.copy() + + loss2 = multiply(z, 3.0) + loss2.backward() # Should accumulate with previous gradient + + print(f" First backward: grad = {first_grad}") + print(f" After second backward: grad = {z.grad}") + print(f" Expected accumulation: {first_grad + 3.0}") + + print("\n💡 AUTOGRAD INSIGHTS:") + print(" • Forward pass builds computation graph in memory") + print(" • Backward pass traverses graph and accumulates gradients") + print(" • Memory scales with graph depth, not just data size") + print(" • This is why PyTorch uses gradient checkpointing for deep networks!") + +analyze_gradient_computation() + +# %% [markdown] +""" +## Integration: Complete Module Testing + +🧪 **Testing Strategy**: Comprehensive validation of all autograd functionality +✅ **Quality Assurance**: Ensure all components work together correctly +🚀 **Ready for Training**: Verify autograd enables neural network optimization +""" + +# %% +def test_module(): + """Comprehensive test of autograd module functionality.""" + print("🧪 COMPREHENSIVE MODULE TEST") + print("Running complete autograd validation...") + + # Test 1: Variable creation and basic properties + print("\n1️⃣ Testing Variable creation...") + x = Variable([1.0, 2.0], requires_grad=True) + assert isinstance(x.data, Tensor) + assert x.requires_grad == True + assert x.grad is None + print(" ✅ Variable creation works") + + # Test 2: All arithmetic operations + print("\n2️⃣ Testing arithmetic operations...") + a = Variable([2.0], requires_grad=True) + b = Variable([3.0], requires_grad=True) + + # Test each operation + add_result = add(a, b) + assert np.allclose(add_result.data.data, [5.0]) + + mul_result = multiply(a, b) + assert np.allclose(mul_result.data.data, [6.0]) + + sub_result = subtract(a, b) + assert np.allclose(sub_result.data.data, [-1.0]) + print(" ✅ All arithmetic operations work") + + # Test 3: Gradient computation + print("\n3️⃣ Testing gradient computation...") + x = Variable([3.0], requires_grad=True) + y = Variable([4.0], requires_grad=True) + z = multiply(x, y) # z = 12 + z.backward() + + assert np.allclose(x.grad, [4.0]), f"Expected x.grad=[4.0], got {x.grad}" + assert np.allclose(y.grad, [3.0]), f"Expected y.grad=[3.0], got {y.grad}" + print(" ✅ Gradient computation works") + + # Test 4: Complex expressions + print("\n4️⃣ Testing complex expressions...") + p = Variable([2.0], requires_grad=True) + q = Variable([3.0], requires_grad=True) + + # (p + q) * (p - q) = p² - q² + expr = multiply(add(p, q), subtract(p, q)) + expr.backward() + + # Expected: ∂(p²-q²)/∂p = 2p = 4, ∂(p²-q²)/∂q = -2q = -6 + assert np.allclose(p.grad, [4.0]), f"Expected p.grad=[4.0], got {p.grad}" + assert np.allclose(q.grad, [-6.0]), f"Expected q.grad=[-6.0], got {q.grad}" + print(" ✅ Complex expressions work") + + # Test 5: Matrix operations + print("\n5️⃣ Testing matrix operations...") + A = Variable([[1.0, 2.0]], requires_grad=True) + B = Variable([[3.0], [4.0]], requires_grad=True) + C = matmul(A, B) + + assert np.allclose(C.data.data, [[11.0]]) + C.backward() + assert np.allclose(A.grad, [[3.0, 4.0]]) + assert np.allclose(B.grad, [[1.0], [2.0]]) + print(" ✅ Matrix operations work") + + # Test 6: Mixed operations + print("\n6️⃣ Testing mixed operations...") + u = Variable([1.0], requires_grad=True) + v = Variable([2.0], requires_grad=True) + + # Neural network-like computation: u * v + u + hidden = multiply(u, v) # u * v + output = add(hidden, u) # + u + output.backward() + + # Expected: ∂(u*v + u)/∂u = v + 1 = 3, ∂(u*v + u)/∂v = u = 1 + assert np.allclose(u.grad, [3.0]), f"Expected u.grad=[3.0], got {u.grad}" + assert np.allclose(v.grad, [1.0]), f"Expected v.grad=[1.0], got {v.grad}" + print(" ✅ Mixed operations work") + + print("\n🎉 ALL TESTS PASSED!") + print("🚀 Autograd module is ready for neural network training!") + print("🔗 Next: Use these gradients in optimizers to update parameters") + +# %% if __name__ == "__main__": - print("\nTEST Running Autograd Module Tests...") - - # Run all unit tests - test_unit_variable_class() - test_unit_add_operation() - test_unit_multiply_operation() - test_unit_subtract_operation() - test_unit_chain_rule() - test_module_neural_network_training() - test_unit_gradient_clipping() - test_unit_mixed_precision() - test_autograd_systems_profiler() - - print("\nPASS All Autograd Module Tests Completed!") - print("Autograd module complete!") + test_module() # %% [markdown] """ -## THINK ML Systems Thinking: Interactive Questions +## 🤔 ML Systems Thinking: Interactive Questions -Now that you've built automatic differentiation capabilities that enable neural network training, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how computational graphs scale to production training environments. +### Question 1: Memory Management in Computational Graphs -Take time to reflect thoughtfully on each question - your insights will help you understand how the automatic differentiation concepts you've implemented connect to real-world ML systems engineering. +Consider the expression `z = (x + y) * (x - y)` where x and y have `requires_grad=True`. + +**Analysis Task**: Your autograd implementation stores intermediate results during forward pass and uses them during backward pass. In a deep neural network with 100 layers, each layer creating intermediate variables, what memory challenges would emerge? + +**Specific Questions**: +- How does memory usage scale with network depth in your current implementation? +- What strategies could reduce memory usage during gradient computation? +- Why do production frameworks like PyTorch implement "gradient checkpointing"? + +**Implementation Connection**: Examine how your `grad_fn` closures capture references to input variables and consider the memory implications. """ -# %% [markdown] +# %% nbgrader={"grade": true, "grade_id": "memory-analysis", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} """ -### Question 1: Gradient Clipping and Training Stability +TODO: Analyze memory usage patterns in your autograd implementation. -**Context**: Your Variable implementation computes gradients that can sometimes explode during training. When you tested complex expressions and chain rule operations, you saw how gradients accumulate through multiple operations. In production training, gradient explosion can cause numerical instability and training divergence. +Consider how your Variable class stores references to other variables through grad_fn, +and how this affects memory usage in deep networks. -**Reflection Question**: Analyze how gradient clipping integration would enhance your autograd system's stability for training deep networks. How would you modify your Variable.backward() method and gradient accumulation to incorporate dynamic gradient clipping that adapts to training dynamics? Design clipping strategies that prevent gradient explosion while preserving gradient information necessary for effective learning. - -Think about: adaptive clipping thresholds, per-layer vs global clipping strategies, gradient norm monitoring, and integration with your chain rule implementation. - -*Target length: 150-300 words* +Discuss specific memory optimization strategies you could implement. """ - -# %% nbgrader={"grade": true, "grade_id": "question-1-computational-graphs", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} -""" -YOUR REFLECTION ON COMPUTATIONAL GRAPHS AND MEMORY MANAGEMENT: - -TODO: Replace this text with your thoughtful response about memory-efficient automatic differentiation system design. - -Consider addressing: -- How would you implement gradient checkpointing to optimize memory usage in large models? -- What strategies would you use to balance memory consumption with computational efficiency? -- How would you design graph compilation that maintains flexibility while enabling optimization? -- What role would distributed gradient computation play in your system design? -- How would you handle memory constraints while preserving numerical precision? - -Write a technical analysis connecting your autograd implementations to real memory management challenges. - -GRADING RUBRIC (Instructor Use): -- Demonstrates understanding of computational graph memory management (3 points) -- Addresses gradient checkpointing and memory optimization strategies (3 points) -- Shows practical knowledge of graph compilation and optimization techniques (2 points) -- Demonstrates systems thinking about memory vs compute trade-offs (2 points) -- Clear technical reasoning and practical considerations (bonus points for innovative approaches) -""" - ### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring technical analysis of computational graph optimization -# Students should demonstrate understanding of memory management and gradient computation efficiency +# Memory analysis for autograd implementation: + +# 1. Memory scaling with network depth: +# - Each Variable stores references to inputs through grad_fn closure +# - In deep networks: O(depth) memory growth for intermediate activations +# - Gradient computation requires keeping forward activations in memory +# - 100-layer network = 100x intermediate variables + their grad_fn closures + +# 2. Memory optimization strategies: +# - Gradient checkpointing: Only store subset of activations, recompute others +# - In-place operations where mathematically valid +# - Clear computation graph after backward pass +# - Use smaller data types (float16 vs float32) where precision allows + +# 3. Production framework solutions: +# - PyTorch's gradient checkpointing trades compute for memory +# - Automatic memory management with garbage collection +# - Graph optimization to reduce intermediate storage +# - Dynamic graph construction vs static graph optimization + +# Current implementation improvement: +# Add method to clear computation graph: variable.detach() or graph.clear() ### END SOLUTION # %% [markdown] """ -### Question 2: Memory Management Optimization in Computational Graphs +### Question 2: Gradient Accumulation and Training Efficiency -**Context**: Your Variable implementation stores gradients and computation history, leading to memory accumulation as graph depth increases. In your autograd profiler analysis, you discovered memory scaling patterns with computational graph complexity. Production systems must balance memory efficiency with gradient computation accuracy. +In your autograd implementation, gradients accumulate when `backward()` is called multiple times without zeroing gradients. -**Reflection Question**: Design memory management optimizations for your autograd system that handle dynamic vs static graph trade-offs. How would you modify your Variable class and computational graph construction to support gradient checkpointing, memory pooling, and operator fusion while maintaining the flexibility of dynamic graphs? Analyze the memory-compute trade-offs in your approach. +**Analysis Task**: Design a training loop that uses gradient accumulation to simulate larger batch sizes with limited memory. -Think about: dynamic memory allocation strategies, gradient checkpointing integration, static graph compilation opportunities, and memory pooling techniques in your implementation. +**Specific Questions**: +- How would you modify the Variable class to support gradient zeroing? +- What are the trade-offs between large batches vs. gradient accumulation? +- How does gradient accumulation affect convergence in neural network training? -*Target length: 150-300 words* +**Implementation Connection**: Consider how your `backward()` method accumulates gradients and design a complete training interface. """ -# %% nbgrader={"grade": true, "grade_id": "question-2-distributed-training", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} +# %% nbgrader={"grade": true, "grade_id": "gradient-accumulation", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} """ -YOUR REFLECTION ON DISTRIBUTED TRAINING AND GRADIENT SYNCHRONIZATION: +TODO: Design gradient accumulation strategy for your autograd system. -TODO: Replace this text with your thoughtful response about distributed automatic differentiation system design. - -Consider addressing: -- How would you design gradient synchronization for efficient distributed training? -- What strategies would you use to minimize communication overhead in multi-GPU training? -- How would you implement gradient compression and optimization for distributed systems? -- What role would asynchronous vs synchronous training play in your design? -- How would you ensure numerical stability and convergence in distributed settings? - -Write an architectural analysis connecting your autograd implementation to real distributed training challenges. - -GRADING RUBRIC (Instructor Use): -- Shows understanding of distributed training and gradient synchronization (3 points) -- Designs practical approaches to communication optimization and scalability (3 points) -- Addresses numerical stability and convergence in distributed settings (2 points) -- Demonstrates systems thinking about distributed computation patterns (2 points) -- Clear architectural reasoning with distributed systems insights (bonus points for comprehensive understanding) +Extend your Variable class with gradient management methods and analyze +the trade-offs between memory efficiency and training convergence. """ - ### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring understanding of distributed training systems -# Students should demonstrate knowledge of gradient synchronization and communication optimization +# Gradient accumulation design for training efficiency: + +# 1. Variable class extensions needed: +def zero_grad(self): + """Clear accumulated gradients.""" + self.grad = None + +def add_zero_grad_to_variable(): + """Would add this method to Variable class""" + # Implementation would set self.grad = None + pass + +# 2. Training loop with gradient accumulation: +def training_step_with_accumulation(model, data_loader, accumulation_steps=4): + """ + Simulate larger batches through gradient accumulation + """ + for param in model.parameters(): + param.zero_grad() + + total_loss = 0 + for i, batch in enumerate(data_loader): + loss = compute_loss(model(batch.x), batch.y) + loss.backward() # Accumulate gradients + total_loss += loss.data + + if (i + 1) % accumulation_steps == 0: + # Update parameters with accumulated gradients + optimizer.step() + # Clear gradients for next accumulation cycle + for param in model.parameters(): + param.zero_grad() + + return total_loss / len(data_loader) + +# 3. Trade-offs analysis: +# Memory: Gradient accumulation uses constant memory vs. large batch linear growth +# Convergence: Accumulated gradients approximate large batch behavior +# Computation: Extra backward passes vs. single large batch forward/backward +# Synchronization: In distributed training, less frequent communication + +# 4. Production considerations: +# - Gradient scaling to prevent underflow with accumulated small gradients +# - Learning rate adjustment for effective batch size +# - Batch normalization statistics affected by actual vs effective batch size ### END SOLUTION # %% [markdown] """ -### Question 3: Graph Optimization and Kernel Fusion +### Question 3: Computational Graph Optimization -**Context**: Your autograd implementation creates computational graphs with individual operations, but production systems optimize these graphs through operator fusion and kernel optimization. Your systems profiler identified fusion opportunities that could reduce memory bandwidth and improve cache efficiency. +Your autograd implementation creates a new Variable for each operation, building a computation graph dynamically. -**Reflection Question**: Design graph optimization strategies for your autograd system that enable operator fusion while preserving gradient computation correctness. How would you modify your operation implementations (add, multiply, etc.) to support fused execution and optimize memory access patterns? Analyze how kernel fusion affects both forward and backward pass performance in your system. +**Analysis Task**: Analyze opportunities for optimizing the computational graph to reduce memory usage and improve performance. -Think about: operation fusion patterns, memory access optimization, cache efficiency improvements, and maintaining gradient correctness in fused operations. +**Specific Questions**: +- Which operations could be fused together to reduce intermediate Variable storage? +- How would in-place operations affect gradient computation safety? +- What graph optimization passes could be implemented before backward propagation? -*Target length: 150-300 words* +**Implementation Connection**: Examine your operation functions and identify where intermediate results could be eliminated or reused. """ -# %% nbgrader={"grade": true, "grade_id": "question-3-training-optimizations", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} +# %% nbgrader={"grade": true, "grade_id": "graph-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} """ -YOUR REFLECTION ON ADVANCED TRAINING OPTIMIZATIONS: +TODO: Design graph optimization strategies for your autograd implementation. -TODO: Replace this text with your thoughtful response about advanced automatic differentiation system design. - -Consider addressing: -- How would you integrate automatic mixed precision training with gradient computation? -- What strategies would you use for gradient accumulation and large batch simulation? -- How would you design hardware integration for specialized accelerators like TPUs? -- What role would advanced optimizations play while maintaining research flexibility? -- How would you ensure numerical stability across different precision and hardware configurations? - -Write a design analysis connecting your autograd implementation to real training optimization challenges. - -GRADING RUBRIC (Instructor Use): -- Understands advanced training optimizations and mixed precision challenges (3 points) -- Designs practical approaches to gradient accumulation and hardware integration (3 points) -- Addresses numerical stability and research vs production trade-offs (2 points) -- Shows systems thinking about training optimization and system integration (2 points) -- Clear design reasoning with training optimization insights (bonus points for deep understanding) +Identify specific optimizations that could reduce memory usage and improve +performance while maintaining gradient correctness. """ - ### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring understanding of advanced training optimizations -# Students should demonstrate knowledge of mixed precision, gradient accumulation, and hardware integration +# Computational graph optimization strategies: + +# 1. Operation fusion opportunities: +# - Fuse: add + multiply → fused_add_mul (one intermediate variable) +# - Fuse: activation + linear → fused_linear_activation +# - Elementwise operations: add + relu + multiply can be single kernel +# Current: 3 Variables → Optimized: 1 Variable + +def fused_add_multiply(a, b, c): + """Fused operation: (a + b) * c - saves one intermediate Variable""" + # Direct computation without intermediate Variable + result_data = (a.data.data + b.data.data) * c.data.data + + def grad_fn(gradient): + if a.requires_grad: + a.backward(gradient * c.data.data) + if b.requires_grad: + b.backward(gradient * c.data.data) + if c.requires_grad: + c.backward(gradient * (a.data.data + b.data.data)) + + return Variable(result_data, requires_grad=any([a.requires_grad, b.requires_grad, c.requires_grad]), grad_fn=grad_fn) + +# 2. In-place operation safety: +# Safe: element-wise operations on leaf variables not used elsewhere +# Unsafe: in-place on intermediate variables used in multiple paths +# Solution: Track variable usage count before allowing in-place + +def safe_inplace_add(var, other): + """In-place addition if safe for gradient computation""" + if var.grad_fn is not None: + raise RuntimeError("Cannot do in-place operation on variable with grad_fn") + var.data.data += other.data.data + return var + +# 3. Graph optimization passes: +# - Dead code elimination: Remove unused intermediate variables +# - Common subexpression elimination: Reuse x*y if computed multiple times +# - Memory layout optimization: Arrange for cache-friendly access patterns + +class GraphOptimizer: + def optimize_memory_layout(self, variables): + """Optimize variable storage for cache efficiency""" + # Group related variables in contiguous memory + pass + + def eliminate_dead_variables(self, root_variable): + """Remove variables not needed for gradient computation""" + # Traverse backward from root, mark reachable variables + pass + + def fuse_operations(self, computation_sequence): + """Identify fusible operation sequences""" + # Pattern matching for common operation combinations + pass + +# 4. Production framework techniques: +# - TensorFlow's XLA: Ahead-of-time compilation with graph optimization +# - PyTorch's TorchScript: Graph optimization for inference +# - ONNX graph optimization passes: Constant folding, operator fusion +# - Memory planning: Pre-allocate memory for entire computation graph ### END SOLUTION # %% [markdown] """ -## TARGET MODULE SUMMARY: Automatic Differentiation +## 🎯 MODULE SUMMARY: Autograd - Automatic Differentiation Engine -Congratulations! You have successfully implemented automatic differentiation: +Congratulations! You've successfully implemented the automatic differentiation engine: -### What You have Accomplished -PASS **Computational Graphs**: Dynamic graph construction for gradient computation (Variable class with 200+ lines) -PASS **Backpropagation**: Efficient gradient computation through reverse mode AD (add, multiply, subtract operations) -PASS **Gradient Tracking**: Automatic gradient accumulation and management (chain rule implementation) -PASS **Training Stability**: Gradient clipping and mixed precision support for robust training -PASS **Memory Optimization**: Advanced profiling with checkpointing and fusion analysis -PASS **Integration**: Seamless compatibility with Tensor operations (neural network training capability) -PASS **Real Applications**: Neural network training and optimization (linear regression convergence test) +### What You've Accomplished +✅ **Variable Class Implementation**: Complete gradient tracking system with 200+ lines of core functionality +✅ **Arithmetic Operations**: Addition, multiplication, subtraction, and matrix operations with proper gradient flow +✅ **Chain Rule Application**: Automatic gradient computation through complex mathematical expressions +✅ **Memory Management**: Efficient gradient accumulation and computational graph construction +✅ **Systems Analysis**: Understanding of memory scaling and performance characteristics in gradient computation ### Key Learning Outcomes -- **Computational graphs**: How operations are tracked for gradient computation through dynamic graph construction -- **Backpropagation**: Reverse mode automatic differentiation with O(1) overhead per operation -- **Gradient accumulation**: How gradients flow through complex operations via chain rule -- **Training stability**: Gradient clipping techniques for preventing gradient explosion and training divergence -- **Memory optimization**: Advanced memory management with checkpointing, fusion analysis, and mixed precision support -- **Production features**: Real-world autograd optimizations used in frameworks like PyTorch and TensorFlow -- **Integration patterns**: How autograd works with neural networks for training +- **Automatic Differentiation**: How computational graphs enable efficient gradient computation +- **Chain Rule Implementation**: Mathematical foundation for backpropagation in neural networks +- **Memory Patterns**: How gradient computation affects memory usage in deep learning systems +- **Production Understanding**: Connection to PyTorch/TensorFlow autograd implementations ### Mathematical Foundations Mastered -- **Chain rule**: The mathematical foundation df/dx = df/dz · dz/dx for backpropagation -- **Computational graphs**: Representing operations as directed acyclic graphs with forward/backward passes -- **Gradient flow**: How gradients propagate through complex functions automatically -- **Memory efficiency**: O(N) gradient storage scaling with graph depth +- **Chain Rule**: Systematic application through computational graphs +- **Product Rule**: Gradient computation for multiplication operations +- **Computational Complexity**: O(1) gradient overhead per operation in forward pass +- **Memory Complexity**: O(graph_depth) storage requirements for intermediate activations ### Professional Skills Developed -- **Graph construction**: Building dynamic computational graphs with variable tracking -- **Gradient computation**: Implementing efficient backpropagation algorithms -- **Memory optimization**: Managing gradient storage with systems performance analysis -- **Integration testing**: Ensuring autograd works with neural network training pipelines +- **Gradient System Design**: Building automatic differentiation from scratch +- **Performance Analysis**: Understanding memory and computational trade-offs +- **Testing Methodology**: Comprehensive validation of gradient correctness ### Ready for Advanced Applications Your autograd implementation now enables: -- **Neural network training**: Complete training pipelines with automatic gradient computation -- **Optimization algorithms**: Gradient-based optimization methods with automatic differentiation -- **Custom loss functions**: Implementing specialized loss functions with gradient tracking -- **Advanced architectures**: Training complex neural network models with computational graph optimization +- **Neural Network Training**: Automatic gradient computation for parameter updates +- **Optimization Algorithms**: Foundation for SGD, Adam, and other optimizers +- **Deep Learning Research**: Understanding of how modern frameworks work internally ### Connection to Real ML Systems -Your implementations mirror production systems: -- **PyTorch**: `torch.autograd` with `torch.nn.utils.clip_grad_norm_()` for gradient clipping -- **TensorFlow**: `tf.GradientTape` with automatic mixed precision and graph optimization -- **JAX**: `jax.grad` with XLA compilation and operator fusion for performance -- **Industry Standard**: Gradient clipping, mixed precision, and memory optimization used in all major frameworks -- **Production Training**: GPT, BERT, and other large models rely on these exact stability and optimization techniques +Your implementation mirrors production systems: +- **PyTorch**: `torch.autograd.Variable` and automatic gradient computation +- **TensorFlow**: `tf.GradientTape` for automatic differentiation +- **Industry Standard**: Dynamic computational graphs used in most modern frameworks ### Next Steps -1. **Export your code**: `tito module complete 06_autograd` -2. **Test your implementation**: `tito test 06_autograd` -3. **Build training systems**: Combine with optimizers for complete training pipelines -4. **Move to Module 07**: Add optimization algorithms with your gradient engine! +1. **Export your module**: `tito module complete 05_autograd` +2. **Validate integration**: `tito test --module autograd` +3. **Ready for Module 06**: Optimizers will use your gradients to update neural network parameters! -**Ready for optimizers?** Your autograd system now provides the foundation for all modern neural network training through automatic gradient computation! +**🚀 Achievement Unlocked**: Your automatic differentiation engine is the foundation that makes modern neural network training possible! """ \ No newline at end of file diff --git a/modules/05_autograd/autograd_dev_old.py b/modules/05_autograd/autograd_dev_old.py deleted file mode 100644 index 89a4fff8..00000000 --- a/modules/05_autograd/autograd_dev_old.py +++ /dev/null @@ -1,1622 +0,0 @@ -# --- -# jupyter: -# jupytext: -# text_representation: -# extension: .py -# format_name: percent -# format_version: '1.3' -# jupytext_version: 1.17.1 -# --- - -# %% [markdown] -""" -# Autograd - Automatic Differentiation and Computational Graph Engine - -Welcome to the Autograd module! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through complex computational graphs. - -## Learning Goals -- Systems understanding: How computational graphs enable automatic differentiation and why this approach scales to arbitrary network architectures -- Core implementation skill: Build the Variable class with gradient tracking and implement backward propagation through dynamic computation graphs -- Pattern recognition: Understand how chain rule application through computational graphs generalizes to any differentiable function -- Framework connection: See how your implementation mirrors PyTorch's autograd engine and tensor gradient tracking -- Performance insight: Learn why computational graph memory management and gradient accumulation strategies determine training scalability - -## Build → Use → Reflect -1. **Build**: Complete automatic differentiation system with Variable class, gradient tracking, and backward propagation -2. **Use**: Apply autograd to complex mathematical expressions and neural network operations -3. **Reflect**: Why does automatic differentiation enable ML at scale, and how does graph memory management affect training? - -## What You'll Achieve -By the end of this module, you'll understand: -- Deep technical understanding of how computational graphs enable automatic gradient computation for arbitrary functions -- Practical capability to build the gradient computation engine that powers all modern neural network training -- Systems insight into why automatic differentiation was the breakthrough that enabled deep learning at scale -- Performance consideration of how computational graph size and memory management affect training efficiency -- Connection to production ML systems and how frameworks optimize gradient computation and memory usage - -## Systems Reality Check -💡 **Production Context**: PyTorch's autograd can handle graphs with millions of nodes and uses sophisticated memory optimization like gradient checkpointing to train models larger than GPU memory -⚡ **Performance Note**: Gradient computation often requires storing forward activations, leading to memory usage that scales with network depth - this drives innovations like gradient checkpointing -""" - -# %% nbgrader={"grade": false, "grade_id": "autograd-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} -#| default_exp core.autograd - -#| export -import numpy as np -import sys -from typing import Union, List, Tuple, Optional, Any, Callable -from collections import defaultdict - -# Import our existing components -try: - from tinytorch.core.tensor import Tensor -except ImportError: - # For development, import from local modules - import os - sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor')) - from tensor_dev import Tensor - -# %% nbgrader={"grade": false, "grade_id": "autograd-setup", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("🔥 TinyTorch Autograd Module") -print(f"NumPy version: {np.__version__}") -print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") -print("Ready to build automatic differentiation!") - -# %% [markdown] -""" -## 📦 Where This Code Lives in the Final Package - -**Learning Side:** You work in `modules/source/07_autograd/autograd_dev.py` -**Building Side:** Code exports to `tinytorch.core.autograd` - -```python -# Final package structure: -from tinytorch.core.autograd import Variable, backward # The gradient engine! -from tinytorch.core.tensor import Tensor -from tinytorch.core.activations import ReLU, Sigmoid, Tanh -``` - -**Why this matters:** -- **Learning:** Focused module for understanding gradients -- **Production:** Proper organization like PyTorch's `torch.autograd` -- **Consistency:** All gradient operations live together in `core.autograd` -- **Foundation:** Enables training for all neural networks -""" - -# %% [markdown] -""" -## What is Automatic Differentiation? - -### The Problem: Computing Gradients at Scale -Neural networks have millions of parameters. To train them, we need gradients of the loss function with respect to every parameter: - -``` -∇θ L = [∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ, ∂L/∂b₁, ∂L/∂b₂, ..., ∂L/∂bₘ] -``` - -**Manual differentiation fails** because: -- Networks have thousands of composed functions -- Manual computation is extremely error-prone -- Every architecture change requires re-deriving all gradients - -### The Solution: Automatic Differentiation -**Autograd** automatically computes derivatives of functions represented as computational graphs: - -```python -# Instead of manually computing: ∂(x² + 2xy + y²)/∂x = 2x + 2y -# Autograd does it automatically: -x = Variable(3.0, requires_grad=True) -y = Variable(4.0, requires_grad=True) -z = x**2 + 2*x*y + y**2 -z.backward() -print(x.grad) # 2*3 + 2*4 = 14 (computed automatically!) -``` - -### Why This is Revolutionary -- **Efficiency**: O(1) overhead per operation -- **Flexibility**: Works with any differentiable function -- **Correctness**: Implements chain rule precisely -- **Scale**: Handles millions of parameters automatically - -### Real-World Impact -- **PyTorch**: `torch.autograd` enables all neural network training -- **TensorFlow**: `tf.GradientTape` provides similar functionality -- **JAX**: `jax.grad` for high-performance computing -- **Deep Learning**: Made training complex models practical - -Let us build the engine that powers modern AI! -""" - -# %% [markdown] -""" -## 🔧 DEVELOPMENT -""" - -# %% [markdown] -""" -## Step 1: The Variable Class - Gradient Tracking - -### What is a Variable? -A **Variable** wraps a Tensor and tracks: -- **Data**: The actual values (forward pass) -- **Gradient**: The computed gradients (backward pass) -- **Computation history**: How this Variable was created -- **Backward function**: How to compute gradients - -### The Computational Graph -Variables automatically build a computational graph: - -```python -x = Variable(2.0) # Leaf node -y = Variable(3.0) # Leaf node -z = x * y # Intermediate node: z = x * y -w = z + 1 # Output node: w = z + 1 - -# Graph: x ──→ * ──→ + ──→ w -# y ──→ ──→ ──→ -``` - -### Design Principles -- **Transparency**: Works seamlessly with existing operations -- **Efficiency**: Minimal overhead for forward pass -- **Flexibility**: Supports any differentiable operation -- **Correctness**: Implements chain rule precisely - -### Real-World Context -This is like: -- **PyTorch**: `torch.autograd.Variable` (now integrated into tensors) -- **TensorFlow**: `tf.Variable` with gradient tracking -- **JAX**: Variables with `jax.grad` transformation -""" - -# %% nbgrader={"grade": false, "grade_id": "variable-class", "locked": false, "schema_version": 3, "solution": true, "task": false} -#| export -class Variable: - """ - Variable: Tensor wrapper with automatic differentiation capabilities. - - The fundamental class for gradient computation in TinyTorch. - Wraps Tensor objects and tracks computational history for backpropagation. - """ - - def __init__(self, data: Union[Tensor, np.ndarray, list, float, int], - requires_grad: bool = True, grad_fn: Optional[Callable] = None): - """ - Create a Variable with gradient tracking. - - Simple, clear conversion focused on core autograd concepts. - - Args: - data: The data (will be converted to Tensor if needed) - requires_grad: Whether to track gradients for this Variable - grad_fn: Function for computing gradients (None for leaf nodes) - """ - ### BEGIN SOLUTION - # Simple, clear conversion - if isinstance(data, Tensor): - self.data = data - else: - self.data = Tensor(data) - - self.requires_grad = requires_grad - self.grad = None - self.grad_fn = grad_fn - self.is_leaf = grad_fn is None - ### END SOLUTION - - @property - def shape(self) -> Tuple[int, ...]: - """Get the shape of the underlying tensor.""" - return self.data.shape - - @property - def size(self) -> int: - """Get the total number of elements.""" - return self.data.size - - def __repr__(self) -> str: - """String representation of the Variable.""" - grad_str = f", grad_fn=<{self.grad_fn.__name__}>" if self.grad_fn else "" - return f"Variable(shape={self.shape}, requires_grad={self.requires_grad}{grad_str})" - - def backward(self, gradient: Optional['Variable'] = None) -> None: - """ - Compute gradients using backpropagation. - - Simple gradient accumulation focused on learning the core concepts. - - Args: - gradient: Incoming gradient (defaults to ones for scalar outputs) - """ - ### BEGIN SOLUTION - if gradient is None: - gradient = Variable(np.ones_like(self.numpy())) - - if self.requires_grad: - if self.grad is None: - self.grad = gradient - else: - # Accumulate gradients - self.grad = Variable(self.grad.numpy() + gradient.numpy()) - - if self.grad_fn is not None: - self.grad_fn(gradient) - ### END SOLUTION - - def zero_grad(self) -> None: - """Reset gradients to zero.""" - self.grad = None - - def numpy(self) -> np.ndarray: - """ - Convert Variable to NumPy array - Universal data extraction interface. - - This is the PyTorch-inspired solution to inconsistent data access. - ALWAYS returns np.ndarray, regardless of internal structure. - - Returns: - NumPy array containing the variable's data - - Usage: - var = Variable([1, 2, 3]) - array = var.numpy() # Always np.ndarray, no conditional logic needed - """ - return self.data.data - - @property - def array(self) -> np.ndarray: - """ - Clean property access to underlying numpy array. - - Use this instead of .data.data for cleaner, more readable code. - - Example: - x = Variable([1, 2, 3]) - arr = x.array # Clean access instead of x.data.data - """ - return self.data.data - - def __add__(self, other: Union['Variable', float, int]) -> 'Variable': - """Addition operator: self + other""" - return add(self, other) - - def __mul__(self, other: Union['Variable', float, int]) -> 'Variable': - """Multiplication operator: self * other""" - return multiply(self, other) - - def __sub__(self, other: Union['Variable', float, int]) -> 'Variable': - """Subtraction operator: self - other""" - return subtract(self, other) - - def __truediv__(self, other: Union['Variable', float, int]) -> 'Variable': - """Division operator: self / other""" - return divide(self, other) - - def __matmul__(self, other: 'Variable') -> 'Variable': - """Matrix multiplication operator: self @ other""" - return matmul(self, other) - -# %% [markdown] -""" -### 🧪 Test Your Variable Class - -Once you implement the Variable class above, run this cell to test it: -""" - -# %% nbgrader={"grade": true, "grade_id": "test-variable-class", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false} -def test_unit_variable_class(): - """Test Variable class implementation""" - print("🔬 Unit Test: Variable Class...") - - # Test Variable creation - x = Variable(5.0, requires_grad=True) - assert x.requires_grad == True, "Variable should require gradients" - assert x.is_leaf == True, "Variable should be a leaf node" - assert x.grad is None, "Gradient should be None initially" - - # Test data access - assert x.numpy().item() == 5.0, "Data should be accessible" - assert x.shape == (), "Scalar should have empty shape" - assert x.size == 1, "Scalar should have size 1" - - # Test with list input - y = Variable([1, 2, 3], requires_grad=True) - assert y.shape == (3,), "List should create 1D tensor" - assert y.size == 3, "Size should be 3" - - # Test with requires_grad=False - z = Variable(10.0, requires_grad=False) - assert z.requires_grad == False, "Should not require gradients" - - # Test zero_grad - x.grad = Variable(1.0) - x.zero_grad() - assert x.grad is None, "zero_grad should reset gradient to None" - - print("✅ Variable class tests passed!") - print(f"✅ Variable creation and initialization working") - print(f"✅ Data access and properties working") - print(f"✅ Gradient management working") - -# Test will run in main block - -# %% [markdown] -""" -## Step 2: Basic Operations with Gradients - -### The Chain Rule in Action -Every operation must implement: -1. **Forward pass**: Compute the result -2. **Backward pass**: Compute gradients for inputs - -### Example: Addition -For z = x + y: -- **Forward**: z.data = x.data + y.data -- **Backward**: ∂z/∂x = 1, ∂z/∂y = 1 - -### Mathematical Foundation -The chain rule states: -``` -∂f/∂x = ∂f/∂z · ∂z/∂x -``` - -For complex expressions like f(g(h(x))): -``` -∂f/∂x = ∂f/∂g · ∂g/∂h · ∂h/∂x -``` - -### Implementation Pattern -Each operation returns a new Variable with: -- **Forward result**: Computed value -- **Backward function**: Gradient computation -""" - -# %% [markdown] -""" -## Helper Functions for Binary Operations - -These helper functions reduce code repetition and make operations more consistent. -""" - -#| export -def _ensure_variables(a, b): - """Convert inputs to Variables if they are scalars.""" - if isinstance(a, (int, float)): - a = Variable(a, requires_grad=False) - if isinstance(b, (int, float)): - b = Variable(b, requires_grad=False) - return a, b - -def _create_binary_operation(forward_fn, grad_fn_a, grad_fn_b): - """ - Helper to create binary operations with consistent structure. - - Args: - forward_fn: Function to compute forward pass - grad_fn_a: Function to compute gradient for first argument - grad_fn_b: Function to compute gradient for second argument - - Returns: - Binary operation function - """ - def operation(a, b): - # Convert inputs - a, b = _ensure_variables(a, b) - - # Forward pass - result_data = forward_fn(a.data, b.data) - - # Backward function - def grad_fn(grad_output): - if a.requires_grad: - grad_a = grad_fn_a(grad_output, a, b) - a.backward(grad_a) - if b.requires_grad: - grad_b = grad_fn_b(grad_output, a, b) - b.backward(grad_b) - - requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - - return operation - -# %% nbgrader={"grade": false, "grade_id": "add-operation", "locked": false, "schema_version": 3, "solution": true, "task": false} -#| export -def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: - """ - Addition operation with gradient tracking: a + b - - TODO: Implement addition with automatic differentiation. - - STEP-BY-STEP IMPLEMENTATION: - 1. Convert inputs to Variables if they are scalars - 2. Compute forward pass: result = a.data + b.data - 3. Create gradient function that implements: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1 - 4. Return new Variable with result and gradient function - - MATHEMATICAL FOUNDATION: - - Forward: z = x + y - - Backward: ∂z/∂x = 1, ∂z/∂y = 1 - - Chain rule: ∂L/∂x = ∂L/∂z · ∂z/∂x = ∂L/∂z · 1 = ∂L/∂z - - EXAMPLE USAGE: - ```python - x = Variable(2.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) - z = add(x, y) # z = 5.0 - z.backward() - print(x.grad) # 1.0 (∂z/∂x = 1) - print(y.grad) # 1.0 (∂z/∂y = 1) - ``` - - IMPLEMENTATION HINTS: - - Convert scalars: if isinstance(a, (int, float)): a = Variable(a, requires_grad=False) - - Forward pass: result_data = a.data + b.data - - Backward function: def grad_fn(grad_output): if a.requires_grad: a.backward(grad_output) - - Return: Variable(result_data, grad_fn=grad_fn) - - Only propagate gradients to Variables that require them - - LEARNING CONNECTIONS: - - This is like torch.add() with autograd - - Addition distributes gradients equally to both inputs - - Forms the basis for bias addition in neural networks - - Chain rule propagates gradients through the graph - """ - ### BEGIN SOLUTION - # Convert scalars to Variables - if isinstance(a, (int, float)): - a = Variable(a, requires_grad=False) - if isinstance(b, (int, float)): - b = Variable(b, requires_grad=False) - - # Forward pass - result_data = a.data + b.data - - # Backward function - def grad_fn(grad_output): - # Addition distributes gradients equally, but must handle broadcasting - if a.requires_grad: - # Clean gradient data access - grad_data = grad_output.data - - # Check if we need to sum over broadcasted dimensions - a_shape = a.data.shape - if grad_data.shape != a_shape: - # Sum over the broadcasted dimensions - # For bias: (batch_size, features) -> (features,) - if len(grad_data.shape) == 2 and len(a_shape) == 1: - grad_for_a = Variable(Tensor(np.sum(grad_data, axis=0))) - else: - # Handle other broadcasting cases - grad_for_a = grad_output - else: - grad_for_a = grad_output - - a.backward(grad_for_a) - - if b.requires_grad: - # Clean gradient data access - grad_data = grad_output.data - - # Check if we need to sum over broadcasted dimensions - b_shape = b.data.shape - if grad_data.shape != b_shape: - # Sum over the broadcasted dimensions - # For bias: (batch_size, features) -> (features,) - if len(grad_data.shape) == 2 and len(b_shape) == 1: - grad_for_b = Variable(Tensor(np.sum(grad_data, axis=0))) - else: - # Handle other broadcasting cases - grad_for_b = grad_output - else: - grad_for_b = grad_output - - b.backward(grad_for_b) - - # Return new Variable with gradient function - requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - ### END SOLUTION - -# %% [markdown] -""" -### 🧪 Test Your Addition Operation - -Once you implement the add function above, run this cell to test it: -""" - -# %% nbgrader={"grade": true, "grade_id": "test-add-operation", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false} -def test_unit_add_operation(): - """Test addition operation with gradients""" - print("🔬 Unit Test: Addition Operation...") - - # Test basic addition - x = Variable(2.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) - z = add(x, y) - - assert z.numpy().item() == 5.0, "Addition result should be 5.0" - assert z.requires_grad == True, "Result should require gradients" - assert z.is_leaf == False, "Result should not be a leaf node" - - # Test backward pass - z.backward() - - assert x.grad is not None, "x should have gradient" - assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 1.0, "∂z/∂x should be 1.0" - assert y.grad.numpy().item() == 1.0, "∂z/∂y should be 1.0" - - # Test with scalar - a = Variable(5.0, requires_grad=True) - b = add(a, 3.0) # Add scalar - - assert b.numpy().item() == 8.0, "Addition with scalar should work" - - b.backward() - assert a.grad.numpy().item() == 1.0, "Gradient through scalar addition should be 1.0" - - print("✅ Addition operation tests passed!") - print(f"✅ Forward pass computing correct results") - print(f"✅ Backward pass computing correct gradients") - print(f"✅ Scalar addition working correctly") - -# Test will run in main block - -# %% [markdown] -""" -## Step 3: Multiplication Operation - -### The Product Rule -For z = x * y: -- **Forward**: z = x * y -- **Backward**: ∂z/∂x = y, ∂z/∂y = x - -### Why This Matters -Multiplication is everywhere in neural networks: -- **Weight scaling**: w * x in dense layers -- **Attention mechanisms**: attention_weights * values -- **Gating**: gate_signal * hidden_state - -### Chain Rule Application -When gradients flow back through multiplication: -``` -∂L/∂x = ∂L/∂z · ∂z/∂x = ∂L/∂z · y -∂L/∂y = ∂L/∂z · ∂z/∂y = ∂L/∂z · x -``` -""" - -# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "locked": false, "schema_version": 3, "solution": true, "task": false} -#| export -def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: - """ - Multiplication operation with gradient tracking: a * b - - Uses the product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a - """ - ### BEGIN SOLUTION - # Convert scalars to Variables - a, b = _ensure_variables(a, b) - - # Forward pass - result_data = a.data * b.data - - # Backward function using product rule - def grad_fn(grad_output): - if a.requires_grad: - a.backward(Variable(grad_output.numpy() * b.numpy())) - if b.requires_grad: - b.backward(Variable(grad_output.numpy() * a.numpy())) - - # Return new Variable with gradient function - requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - ### END SOLUTION - -# %% [markdown] -""" -### 🧪 Test Your Multiplication Operation - -Once you implement the multiply function above, run this cell to test it: -""" - -# %% nbgrader={"grade": true, "grade_id": "test-multiply-operation", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false} -def test_unit_multiply_operation(): - """Test multiplication operation with gradients""" - print("🔬 Unit Test: Multiplication Operation...") - - # Test basic multiplication - x = Variable(2.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) - z = multiply(x, y) - - assert z.numpy().item() == 6.0, "Multiplication result should be 6.0" - assert z.requires_grad == True, "Result should require gradients" - - # Test backward pass - z.backward() - - assert x.grad is not None, "x should have gradient" - assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 3.0, "∂z/∂x should be y = 3.0" - assert y.grad.numpy().item() == 2.0, "∂z/∂y should be x = 2.0" - - # Test with scalar - a = Variable(4.0, requires_grad=True) - b = multiply(a, 2.0) # Multiply by scalar - - assert b.numpy().item() == 8.0, "Multiplication with scalar should work" - - b.backward() - assert a.grad.numpy().item() == 2.0, "Gradient through scalar multiplication should be the scalar" - - print("✅ Multiplication operation tests passed!") - print(f"✅ Forward pass computing correct results") - print(f"✅ Backward pass implementing product rule correctly") - print(f"✅ Scalar multiplication working correctly") - -# Test will run in main block - -# %% nbgrader={"grade": false, "grade_id": "subtract-operation", "locked": false, "schema_version": 3, "solution": true, "task": false} -#| export -def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: - """ - Subtraction operation with gradient tracking: a - b - - Uses the rule: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1 - """ - ### BEGIN SOLUTION - # Convert to Variables if needed - a, b = _ensure_variables(a, b) - - # Forward pass - result_data = a.data - b.data - - # Create gradient function - def grad_fn(grad_output): - # Subtraction rule: d(x-y)/dx = 1, d(x-y)/dy = -1 - if a.requires_grad: - a.backward(grad_output) - if b.requires_grad: - b_grad = Variable(-grad_output.numpy()) - b.backward(b_grad) - - # Determine if result requires gradients - requires_grad = a.requires_grad or b.requires_grad - - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - ### END SOLUTION - -#| export -def matmul(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: - """ - Matrix multiplication operation with gradient tracking: a @ b - - Uses matrix multiplication gradients: ∂C/∂A = grad_C @ B^T, ∂C/∂B = A^T @ grad_C - """ - ### BEGIN SOLUTION - # Convert scalars to Variables - a, b = _ensure_variables(a, b) - - # Forward pass - matrix multiplication - result_data = Tensor(a.numpy() @ b.numpy()) - - # Backward function - def grad_fn(grad_output): - # Matrix multiplication gradients - if a.requires_grad: - # ∂C/∂A = grad_C @ B^T - grad_a_data = grad_output.numpy() @ b.numpy().T - a.backward(Variable(grad_a_data)) - - if b.requires_grad: - # ∂C/∂B = A^T @ grad_C - grad_b_data = a.numpy().T @ grad_output.numpy() - b.backward(Variable(grad_b_data)) - - # Return new Variable with gradient function - requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - ### END SOLUTION - -#| export -def divide(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable: - """ - Division operation with gradient tracking: a / b - - Uses the quotient rule: ∂(a/b)/∂a = 1/b, ∂(a/b)/∂b = -a/b² - """ - ### BEGIN SOLUTION - # Convert scalars to Variables - a, b = _ensure_variables(a, b) - - # Forward pass - result_data = a.data / b.data - - # Backward function - def grad_fn(grad_output): - if a.requires_grad: - # ∂(a/b)/∂a = 1/b - grad_a = Variable(grad_output.numpy() / b.numpy()) - a.backward(grad_a) - if b.requires_grad: - # ∂(a/b)/∂b = -a/b² - grad_b = Variable(-grad_output.numpy() * a.numpy() / (b.numpy() ** 2)) - b.backward(grad_b) - - requires_grad = a.requires_grad or b.requires_grad - return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn) - ### END SOLUTION - -# %% nbgrader={"grade": false, "grade_id": "test-subtract-operation", "locked": false, "schema_version": 3, "solution": false, "task": false} -def test_unit_subtract_operation(): - """Test subtraction operation with gradients""" - print("🔬 Unit Test: Subtraction Operation...") - - # Test basic subtraction - x = Variable(5.0, requires_grad=True) - y = Variable(3.0, requires_grad=True) - z = subtract(x, y) - - assert z.numpy().item() == 2.0, "Subtraction result should be 2.0" - assert z.requires_grad == True, "Result should require gradients" - - # Test backward pass - z.backward() - - assert x.grad is not None, "x should have gradient" - assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 1.0, "∂z/∂x should be 1.0" - assert y.grad.numpy().item() == -1.0, "∂z/∂y should be -1.0" - - # Test with scalar - a = Variable(4.0, requires_grad=True) - b = subtract(a, 2.0) # Subtract scalar - - assert b.numpy().item() == 2.0, "Subtraction with scalar should work" - - b.backward() - assert a.grad.numpy().item() == 1.0, "Gradient through scalar subtraction should be 1.0" - - print("✅ Subtraction operation tests passed!") - print(f"✅ Forward pass computing correct results") - print(f"✅ Backward pass implementing subtraction rule correctly") - print(f"✅ Scalar subtraction working correctly") - -# Test will run in main block - -# %% [markdown] -""" -## Step 4: Chain Rule in Complex Expressions - -### Building Complex Computations -Now let us test how multiple operations work together through the chain rule: - -### Example: f(x, y) = (x + y) * (x - y) -This creates a computational graph: -``` -x ──→ + ──→ * ──→ result -y ──→ ──→ ──→ -│ ↑ -└──→ - ──────┘ -``` - -### Chain Rule Application -- **Forward**: Compute each operation in sequence -- **Backward**: Gradients flow back through each operation -- **Automatic**: No manual gradient computation needed! - -### Real-World Significance -Complex neural networks are just larger versions of this: -- **Millions of operations**: Each tracked automatically -- **Complex architectures**: ResNet, Transformer, etc. -- **Efficient computation**: O(1) overhead per operation -""" - -# %% nbgrader={"grade": true, "grade_id": "test-chain-rule", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false} -def test_unit_chain_rule(): - """Test chain rule with complex expressions""" - print("🔬 Unit Test: Chain Rule with Complex Expressions...") - - # Test: f(x, y) = (x + y) * (x - y) = x² - y² - x = Variable(3.0, requires_grad=True) - y = Variable(2.0, requires_grad=True) - - # Build expression step by step - sum_xy = add(x, y) # x + y = 5.0 - diff_xy = subtract(x, y) # x - y = 1.0 - result = multiply(sum_xy, diff_xy) # (x + y) * (x - y) = 5.0 - - # Check forward pass - assert result.numpy().item() == 5.0, "Forward pass should compute 5.0" - - # Compute gradients - result.backward() - - # Check gradients: ∂(x²-y²)/∂x = 2x, ∂(x²-y²)/∂y = -2y - expected_x_grad = 2 * x.numpy().item() # 2 * 3 = 6 - expected_y_grad = -2 * y.numpy().item() # -2 * 2 = -4 - - assert abs(x.grad.numpy().item() - expected_x_grad) < 1e-6, f"x gradient should be {expected_x_grad}" - assert abs(y.grad.numpy().item() - expected_y_grad) < 1e-6, f"y gradient should be {expected_y_grad}" - - # Test more complex expression: f(x) = (x + 1) * (x + 2) * (x + 3) - x2 = Variable(1.0, requires_grad=True) - - term1 = add(x2, 1.0) # x + 1 = 2.0 - term2 = add(x2, 2.0) # x + 2 = 3.0 - term3 = add(x2, 3.0) # x + 3 = 4.0 - - product1 = multiply(term1, term2) # (x + 1) * (x + 2) = 6.0 - result2 = multiply(product1, term3) # * (x + 3) = 24.0 - - assert result2.numpy().item() == 24.0, "Complex expression should compute 24.0" - - result2.backward() - - # For f(x) = (x+1)(x+2)(x+3), f'(x) = 3x² + 12x + 11 - # At x=1: f'(1) = 3 + 12 + 11 = 26 - expected_grad = 3 * (1.0**2) + 12 * 1.0 + 11 # 26 - - assert abs(x2.grad.numpy().item() - expected_grad) < 1e-6, f"Complex gradient should be {expected_grad}" - - print("✅ Chain rule tests passed!") - print(f"✅ Simple expression: (x+y)*(x-y) = x²-y²") - print(f"✅ Complex expression: (x+1)*(x+2)*(x+3)") - print(f"✅ Automatic gradient computation working correctly") - print(f"✅ Chain rule implemented correctly") - -# Test will run in main block - -# %% [markdown] -""" -## Step 5: Integration with Neural Network Training - -### The Complete Training Loop -Let us see how autograd enables neural network training: - -1. **Forward pass**: Compute predictions -2. **Loss computation**: Compare with targets -3. **Backward pass**: Compute gradients automatically -4. **Parameter update**: Update weights using gradients - -### Example: Simple Linear Regression - ```python -# Model: y = wx + b -w = Variable(0.5, requires_grad=True) -b = Variable(0.1, requires_grad=True) - - # Forward pass -prediction = w * x + b - -# Loss: mean squared error -loss = (prediction - target)**2 - -# Backward pass (automatic!) -loss.backward() - -# Update parameters -w.data = w.data - learning_rate * w.grad.data -b.data = b.data - learning_rate * b.grad.data -``` - -### Why This is Powerful -- **Automatic**: No manual gradient computation -- **Flexible**: Works with any differentiable function -- **Efficient**: Minimal computational overhead -- **Scalable**: Handles millions of parameters -""" - -# %% nbgrader={"grade": true, "grade_id": "test-neural-network-training", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false} -def test_module_neural_network_training(): - """Test autograd in neural network training scenario""" - print("🔬 Integration Test: Neural Network Training Comprehensive Test...") - - # Simple linear regression: y = wx + b - # Training data: y = 2x + 1 + noise - - # Initialize parameters - w = Variable(0.1, requires_grad=True) # Start with small random value - b = Variable(0.0, requires_grad=True) # Start with zero bias - - # Training data - x_data = [1.0, 2.0, 3.0, 4.0] - y_data = [3.0, 5.0, 7.0, 9.0] # y = 2x + 1 - - learning_rate = 0.01 - - # Training loop - for epoch in range(100): - total_loss = Variable(0.0) - - for x_val, y_val in zip(x_data, y_data): - # Create input variable - x = Variable(x_val, requires_grad=False) - target = Variable(y_val, requires_grad=False) - - # Forward pass - prediction = add(multiply(w, x), b) # wx + b - - # Loss: squared error - error = subtract(prediction, target) - loss = multiply(error, error) # (pred - target)² - - # Accumulate loss - total_loss = add(total_loss, loss) - - # Backward pass - w.zero_grad() - b.zero_grad() - total_loss.backward() - - # Update parameters - if w.grad is not None: - w.data = Tensor(w.numpy() - learning_rate * w.grad.numpy()) - if b.grad is not None: - b.data = Tensor(b.numpy() - learning_rate * b.grad.numpy()) - - # Check that parameters converged to correct values - final_w = w.numpy().item() - final_b = b.numpy().item() - - print(f"Final weights: w = {final_w:.3f}, b = {final_b:.3f}") - print(f"Target weights: w = 2.000, b = 1.000") - - # Should be close to w=2, b=1 - assert abs(final_w - 2.0) < 0.1, f"Weight should be close to 2.0, got {final_w}" - assert abs(final_b - 1.0) < 0.1, f"Bias should be close to 1.0, got {final_b}" - - # Test prediction with learned parameters - test_x = Variable(5.0, requires_grad=False) - test_prediction = add(multiply(w, test_x), b) - expected_output = 2.0 * 5.0 + 1.0 # 11.0 - - prediction_error = abs(test_prediction.numpy().item() - expected_output) - assert prediction_error < 0.5, f"Prediction error should be small, got {prediction_error}" - - print("✅ Neural network training comprehensive tests passed!") - print(f"✅ Parameters converged to correct values") - print(f"✅ Model makes accurate predictions") - print(f"✅ Autograd enables automatic training") - print(f"✅ Ready for complex neural network architectures!") - -# Test will run in main block - -# %% [markdown] -""" -## Step 4: ML Systems Thinking - Computational Graph Optimization - -### 🏗️ Autograd Systems at Production Scale - -Your autograd implementation provides the foundation for understanding how production ML frameworks optimize computational graphs for massive neural network training and inference. - -#### **Computational Graph Architecture** -```python -class ProductionAutogradEngine: - def __init__(self): - # Advanced autograd optimizations for production systems - self.graph_optimizer = ComputationalGraphOptimizer() - self.memory_manager = GradientMemoryManager() - self.kernel_fusion = AutogradKernelFusion() - self.checkpoint_manager = GradientCheckpointManager() -``` - -Real autograd systems must handle: -- **Graph optimization**: Fusing operations to minimize memory access -- **Memory management**: Releasing intermediate gradients to conserve memory -- **Parallel execution**: Computing gradients across multiple devices -- **Kernel fusion**: Combining operations for GPU efficiency -""" - -# %% nbgrader={"grade": false, "grade_id": "autograd-systems-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false} -#| export -import time -import gc -from collections import defaultdict, deque - -class AutogradSystemsProfiler: - """ - Production Autograd System Performance Analysis and Optimization - - Analyzes computational graph efficiency, memory patterns, and optimization - opportunities for production automatic differentiation systems. - """ - - def __init__(self): - """Initialize autograd systems profiler.""" - self.profiling_data = defaultdict(list) - self.graph_analysis = defaultdict(list) - self.optimization_strategies = [] - - def profile_computational_graph_depth(self, max_depth=10, operations_per_level=5): - """ - Profile computational graph performance vs depth. - - TODO: Implement computational graph depth analysis. - - APPROACH: - 1. Create computational graphs of increasing depth - 2. Measure forward and backward pass timing - 3. Analyze memory usage patterns during gradient computation - 4. Identify memory accumulation and gradient flow bottlenecks - 5. Generate graph optimization recommendations - - EXAMPLE: - profiler = AutogradSystemsProfiler() - graph_analysis = profiler.profile_computational_graph_depth(max_depth=8) - print(f"Memory scaling factor: {graph_analysis['memory_scaling_factor']:.2f}") - - HINTS: - - Build graphs by chaining operations: x -> op1 -> op2 -> ... -> loss - - Measure both forward and backward pass timing separately - - Track memory usage throughout the computation - - Monitor gradient accumulation patterns - - Focus on production-relevant graph depths - """ - ### BEGIN SOLUTION - print("🔧 Profiling Computational Graph Depth Impact...") - - results = {} - - for depth in range(1, max_depth + 1): - print(f" Testing graph depth: {depth}") - - # Create a computational graph of specified depth - # Each level adds more operations to test scaling - - # Start with input variable - try: - # Use Variable if available, otherwise simulate - x = Variable(np.random.randn(100, 100), requires_grad=True) - except: - # Fallback for testing - simulate Variable with Tensor - x = Tensor(np.random.randn(100, 100)) - - # Build computational graph of specified depth - current_var = x - operations = [] - - for level in range(depth): - # Add multiple operations per level to increase complexity - for op_idx in range(operations_per_level): - try: - # Simulate various operations - if op_idx % 4 == 0: - current_var = current_var * 0.9 # Scale operation - elif op_idx % 4 == 1: - current_var = current_var + 0.1 # Add operation - elif op_idx % 4 == 2: - # Matrix multiplication (most expensive) - weight = Tensor(np.random.randn(100, 100)) - if hasattr(current_var, 'data'): - current_var = Tensor(current_var.data @ weight.data) - else: - current_var = current_var @ weight - else: - # Activation-like operation - if hasattr(current_var, 'data'): - current_var = Tensor(np.maximum(0, current_var.data)) - else: - current_var = current_var # Skip for simplicity - - operations.append(f"level_{level}_op_{op_idx}") - except: - # Fallback for testing - current_var = Tensor(np.random.randn(100, 100)) - operations.append(f"level_{level}_op_{op_idx}_fallback") - - # Add final loss computation - try: - if hasattr(current_var, 'data'): - loss = Tensor(np.sum(current_var.data ** 2)) - else: - loss = np.sum(current_var ** 2) - except: - loss = Tensor(np.array([1.0])) - - # Measure forward pass timing - forward_iterations = 3 - forward_start = time.time() - - for _ in range(forward_iterations): - # Simulate forward pass computation - temp_x = x - for level in range(depth): - for op_idx in range(operations_per_level): - if op_idx % 4 == 0: - temp_x = temp_x * 0.9 - elif op_idx % 4 == 1: - temp_x = temp_x + 0.1 - # Skip expensive ops for timing - - forward_end = time.time() - avg_forward_time = (forward_end - forward_start) / forward_iterations - - # Measure backward pass timing (simulated) - # In real implementation, this would be loss.backward() - backward_start = time.time() - - # Simulate gradient computation through the graph - for _ in range(forward_iterations): - # Simulate backpropagation through all operations - gradient_accumulation = 0 - for level in range(depth): - for op_idx in range(operations_per_level): - # Simulate gradient computation - gradient_accumulation += level * op_idx * 0.001 - - backward_end = time.time() - avg_backward_time = (backward_end - backward_start) / forward_iterations - - # Memory analysis - try: - if hasattr(x, 'data'): - base_memory = x.data.nbytes / (1024 * 1024) # MB - if hasattr(current_var, 'data'): - result_memory = current_var.data.nbytes / (1024 * 1024) - else: - result_memory = base_memory - else: - base_memory = x.nbytes / (1024 * 1024) if hasattr(x, 'nbytes') else 1.0 - result_memory = base_memory - except: - base_memory = 1.0 - result_memory = 1.0 - - # Estimate gradient memory (in production, each operation stores gradients) - estimated_gradient_memory = depth * operations_per_level * base_memory * 0.5 - total_memory = base_memory + result_memory + estimated_gradient_memory - - # Calculate efficiency metrics - total_operations = depth * operations_per_level - total_time = avg_forward_time + avg_backward_time - operations_per_second = total_operations / total_time if total_time > 0 else 0 - - result = { - 'graph_depth': depth, - 'total_operations': total_operations, - 'forward_time_ms': avg_forward_time * 1000, - 'backward_time_ms': avg_backward_time * 1000, - 'total_time_ms': total_time * 1000, - 'base_memory_mb': base_memory, - 'estimated_gradient_memory_mb': estimated_gradient_memory, - 'total_memory_mb': total_memory, - 'operations_per_second': operations_per_second, - 'memory_per_operation': total_memory / total_operations if total_operations > 0 else 0 - } - - results[depth] = result - - print(f" Forward: {avg_forward_time*1000:.3f}ms, Backward: {avg_backward_time*1000:.3f}ms, Memory: {total_memory:.2f}MB") - - # Analyze scaling patterns - graph_analysis = self._analyze_graph_scaling(results) - - # Store profiling data - self.profiling_data['graph_depth_analysis'] = results - self.graph_analysis = graph_analysis - - return { - 'detailed_results': results, - 'graph_analysis': graph_analysis, - 'optimization_strategies': self._generate_graph_optimizations(results) - } - ### END SOLUTION - - def _analyze_graph_scaling(self, results): - """Analyze computational graph scaling patterns.""" - analysis = {} - - # Extract metrics for scaling analysis - depths = sorted(results.keys()) - forward_times = [results[d]['forward_time_ms'] for d in depths] - backward_times = [results[d]['backward_time_ms'] for d in depths] - total_times = [results[d]['total_time_ms'] for d in depths] - memory_usage = [results[d]['total_memory_mb'] for d in depths] - - # Calculate scaling factors - if len(depths) >= 2: - shallow = depths[0] - deep = depths[-1] - - depth_ratio = deep / shallow - forward_time_ratio = results[deep]['forward_time_ms'] / results[shallow]['forward_time_ms'] - backward_time_ratio = results[deep]['backward_time_ms'] / results[shallow]['backward_time_ms'] - memory_ratio = results[deep]['total_memory_mb'] / results[shallow]['total_memory_mb'] - - analysis['scaling_metrics'] = { - 'depth_ratio': depth_ratio, - 'forward_time_scaling': forward_time_ratio, - 'backward_time_scaling': backward_time_ratio, - 'memory_scaling': memory_ratio, - 'theoretical_linear': depth_ratio # Expected linear scaling - } - - # Identify bottlenecks - if backward_time_ratio > forward_time_ratio * 1.5: - analysis['primary_bottleneck'] = 'backward_pass' - analysis['bottleneck_reason'] = 'Gradient computation scaling worse than forward pass' - elif memory_ratio > depth_ratio * 1.5: - analysis['primary_bottleneck'] = 'memory' - analysis['bottleneck_reason'] = 'Memory usage scaling faster than linear' - else: - analysis['primary_bottleneck'] = 'balanced' - analysis['bottleneck_reason'] = 'Forward and backward passes scaling proportionally' - - # Backward/Forward ratio analysis - backward_forward_ratios = [ - results[d]['backward_time_ms'] / max(results[d]['forward_time_ms'], 0.001) - for d in depths - ] - avg_backward_forward_ratio = sum(backward_forward_ratios) / len(backward_forward_ratios) - - analysis['efficiency_metrics'] = { - 'avg_backward_forward_ratio': avg_backward_forward_ratio, - 'peak_memory_mb': max(memory_usage), - 'memory_efficiency_trend': 'increasing' if memory_usage[-1] > memory_usage[0] * 2 else 'stable' - } - - return analysis - - def _generate_graph_optimizations(self, results): - """Generate computational graph optimization strategies.""" - strategies = [] - - # Analyze memory growth patterns - peak_memory = max(result['total_memory_mb'] for result in results.values()) - - if peak_memory > 50: # > 50MB memory usage - strategies.append("💾 High memory usage detected in computational graph") - strategies.append("🔧 Strategy: Gradient checkpointing for deep graphs") - strategies.append("🔧 Strategy: In-place operations where mathematically valid") - - # Analyze computational efficiency - graph_analysis = self.graph_analysis - if graph_analysis and 'scaling_metrics' in graph_analysis: - backward_scaling = graph_analysis['scaling_metrics']['backward_time_scaling'] - if backward_scaling > 2.0: - strategies.append("🐌 Backward pass scaling poorly with graph depth") - strategies.append("🔧 Strategy: Kernel fusion for backward operations") - strategies.append("🔧 Strategy: Parallel gradient computation") - - # Memory vs computation trade-offs - if graph_analysis and 'efficiency_metrics' in graph_analysis: - backward_forward_ratio = graph_analysis['efficiency_metrics']['avg_backward_forward_ratio'] - if backward_forward_ratio > 3.0: - strategies.append("⚖️ Backward pass significantly slower than forward") - strategies.append("🔧 Strategy: Optimize gradient computation with sparse gradients") - strategies.append("🔧 Strategy: Use mixed precision to reduce memory bandwidth") - - # Production optimization recommendations - strategies.append("🏭 Production graph optimizations:") - strategies.append(" • Graph compilation and optimization (TorchScript, XLA)") - strategies.append(" • Operator fusion to minimize intermediate allocations") - strategies.append(" • Dynamic shape optimization for variable input sizes") - strategies.append(" • Gradient accumulation for large effective batch sizes") - - return strategies - - def analyze_memory_checkpointing_trade_offs(self, checkpoint_frequencies=[1, 2, 4, 8]): - """ - Analyze memory vs computation trade-offs with gradient checkpointing. - - This function is PROVIDED to demonstrate checkpointing analysis. - Students use it to understand memory optimization strategies. - """ - print("🔍 GRADIENT CHECKPOINTING ANALYSIS") - print("=" * 45) - - base_graph_depth = 12 - base_memory_per_layer = 10 # MB per layer - base_computation_time = 5 # ms per layer - - checkpointing_results = [] - - for freq in checkpoint_frequencies: - # Calculate memory savings - # Without checkpointing: store all intermediate activations - no_checkpoint_memory = base_graph_depth * base_memory_per_layer - - # With checkpointing: only store every freq-th activation - checkpointed_memory = max(base_memory_per_layer, (base_graph_depth // freq + 1) * base_memory_per_layer) - memory_savings = no_checkpoint_memory - checkpointed_memory - memory_reduction_pct = (memory_savings / no_checkpoint_memory) * 100 - - # Calculate recomputation overhead - # Need to recompute (freq-1) layers for each checkpoint - recomputation_layers = base_graph_depth * (freq - 1) / freq - recomputation_time = recomputation_layers * base_computation_time - - # Total training time = forward + backward + recomputation - base_training_time = base_graph_depth * base_computation_time * 2 # forward + backward - total_training_time = base_training_time + recomputation_time - time_overhead_pct = (recomputation_time / base_training_time) * 100 - - result = { - 'checkpoint_frequency': freq, - 'memory_mb': checkpointed_memory, - 'memory_reduction_pct': memory_reduction_pct, - 'recomputation_time_ms': recomputation_time, - 'time_overhead_pct': time_overhead_pct, - 'memory_time_ratio': memory_reduction_pct / max(time_overhead_pct, 1) - } - checkpointing_results.append(result) - - print(f" Checkpoint every {freq} layers:") - print(f" Memory: {checkpointed_memory:.0f}MB ({memory_reduction_pct:.1f}% reduction)") - print(f" Time overhead: {time_overhead_pct:.1f}%") - print(f" Efficiency ratio: {result['memory_time_ratio']:.2f}") - - # Find optimal trade-off - optimal = max(checkpointing_results, key=lambda x: x['memory_time_ratio']) - - print(f"\n📈 Checkpointing Analysis:") - print(f" Optimal frequency: Every {optimal['checkpoint_frequency']} layers") - print(f" Best trade-off: {optimal['memory_reduction_pct']:.1f}% memory reduction") - print(f" Cost: {optimal['time_overhead_pct']:.1f}% time overhead") - - return checkpointing_results - -# %% [markdown] -""" -### 🧪 Test: Autograd Systems Profiling - -Let us test our autograd systems profiler with realistic computational graph scenarios. -""" - -# %% nbgrader={"grade": false, "grade_id": "test-autograd-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false} -def test_autograd_systems_profiler(): - """Test autograd systems profiler with comprehensive scenarios.""" - print("🔬 Unit Test: Autograd Systems Profiler...") - - profiler = AutogradSystemsProfiler() - - # Test computational graph depth analysis - try: - graph_analysis = profiler.profile_computational_graph_depth(max_depth=5, operations_per_level=3) - - # Verify analysis structure - assert 'detailed_results' in graph_analysis, "Should provide detailed results" - assert 'graph_analysis' in graph_analysis, "Should provide graph analysis" - assert 'optimization_strategies' in graph_analysis, "Should provide optimization strategies" - - # Verify detailed results - results = graph_analysis['detailed_results'] - assert len(results) == 5, "Should test all graph depths" - - for depth, result in results.items(): - assert 'forward_time_ms' in result, f"Should include forward timing for depth {depth}" - assert 'backward_time_ms' in result, f"Should include backward timing for depth {depth}" - assert 'total_memory_mb' in result, f"Should analyze memory for depth {depth}" - assert result['forward_time_ms'] >= 0, f"Forward time should be non-negative for depth {depth}" - assert result['backward_time_ms'] >= 0, f"Backward time should be non-negative for depth {depth}" - - print("✅ Computational graph depth analysis test passed") - - # Test memory checkpointing analysis - checkpointing_analysis = profiler.analyze_memory_checkpointing_trade_offs(checkpoint_frequencies=[1, 2, 4]) - - assert isinstance(checkpointing_analysis, list), "Should return checkpointing analysis results" - assert len(checkpointing_analysis) == 3, "Should analyze all checkpoint frequencies" - - for result in checkpointing_analysis: - assert 'checkpoint_frequency' in result, "Should include checkpoint frequency" - assert 'memory_reduction_pct' in result, "Should calculate memory reduction" - assert 'time_overhead_pct' in result, "Should calculate time overhead" - assert result['memory_reduction_pct'] >= 0, "Memory reduction should be non-negative" - - print("✅ Memory checkpointing analysis test passed") - - except Exception as e: - print(f"⚠️ Autograd profiling test had issues: {e}") - print("✅ Basic structure test passed (graceful degradation)") - - print("🎯 Autograd Systems Profiler: All tests passed!") - -# Test will run in main block - -if __name__ == "__main__": - print("\n🧪 Running Autograd Module Tests...") - - # Run all unit tests - test_unit_variable_class() - test_unit_add_operation() - test_unit_multiply_operation() - test_unit_subtract_operation() - test_unit_chain_rule() - test_module_neural_network_training() - test_autograd_systems_profiler() - - print("\n✅ All Autograd Module Tests Completed!") - print("Autograd module complete!") - -# %% [markdown] -""" -## 🤔 ML Systems Thinking: Interactive Questions - -Now that you've built automatic differentiation capabilities that enable neural network training, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how computational graphs scale to production training environments. - -Take time to reflect thoughtfully on each question - your insights will help you understand how the automatic differentiation concepts you've implemented connect to real-world ML systems engineering. -""" - -# %% [markdown] -""" -### Question 1: Computational Graphs and Memory Management - -**Context**: Your autograd implementation builds computational graphs and stores intermediate values for gradient computation. Production training systems must manage memory efficiently when training models with billions of parameters and complex computational graphs that can consume enormous amounts of memory. - -**Reflection Question**: Design a memory-efficient automatic differentiation system for training large-scale neural networks that optimizes computational graph storage and gradient computation. How would you implement gradient checkpointing strategies, manage memory vs compute trade-offs, and optimize graph compilation for both dynamic flexibility and static optimization? Consider scenarios where you need to train models that exceed GPU memory capacity while maintaining numerical precision and training speed. - -Think about: gradient checkpointing strategies, memory vs compute trade-offs, graph optimization techniques, and distributed gradient computation. - -*Target length: 150-300 words* -""" - -# %% nbgrader={"grade": true, "grade_id": "question-1-computational-graphs", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} -""" -YOUR REFLECTION ON COMPUTATIONAL GRAPHS AND MEMORY MANAGEMENT: - -TODO: Replace this text with your thoughtful response about memory-efficient automatic differentiation system design. - -Consider addressing: -- How would you implement gradient checkpointing to optimize memory usage in large models? -- What strategies would you use to balance memory consumption with computational efficiency? -- How would you design graph compilation that maintains flexibility while enabling optimization? -- What role would distributed gradient computation play in your system design? -- How would you handle memory constraints while preserving numerical precision? - -Write a technical analysis connecting your autograd implementations to real memory management challenges. - -GRADING RUBRIC (Instructor Use): -- Demonstrates understanding of computational graph memory management (3 points) -- Addresses gradient checkpointing and memory optimization strategies (3 points) -- Shows practical knowledge of graph compilation and optimization techniques (2 points) -- Demonstrates systems thinking about memory vs compute trade-offs (2 points) -- Clear technical reasoning and practical considerations (bonus points for innovative approaches) -""" - -### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring technical analysis of computational graph optimization -# Students should demonstrate understanding of memory management and gradient computation efficiency -### END SOLUTION - -# %% [markdown] -""" -### Question 2: Distributed Training and Gradient Synchronization - -**Context**: Your autograd computes gradients on a single device, but production training systems must coordinate gradient computation across multiple GPUs and nodes. Efficient gradient synchronization becomes critical for training performance and scalability. - -**Reflection Question**: Architect a distributed automatic differentiation system that efficiently coordinates gradient computation across multiple devices and maintains training efficiency at scale. How would you implement gradient synchronization strategies, handle communication optimization, and manage numerical stability across distributed training? Consider scenarios where you need to train transformer models across hundreds of GPUs while minimizing communication overhead and maintaining convergence guarantees. - -Think about: gradient synchronization strategies, communication optimization, distributed computation patterns, and scalability considerations. - -*Target length: 150-300 words* -""" - -# %% nbgrader={"grade": true, "grade_id": "question-2-distributed-training", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} -""" -YOUR REFLECTION ON DISTRIBUTED TRAINING AND GRADIENT SYNCHRONIZATION: - -TODO: Replace this text with your thoughtful response about distributed automatic differentiation system design. - -Consider addressing: -- How would you design gradient synchronization for efficient distributed training? -- What strategies would you use to minimize communication overhead in multi-GPU training? -- How would you implement gradient compression and optimization for distributed systems? -- What role would asynchronous vs synchronous training play in your design? -- How would you ensure numerical stability and convergence in distributed settings? - -Write an architectural analysis connecting your autograd implementation to real distributed training challenges. - -GRADING RUBRIC (Instructor Use): -- Shows understanding of distributed training and gradient synchronization (3 points) -- Designs practical approaches to communication optimization and scalability (3 points) -- Addresses numerical stability and convergence in distributed settings (2 points) -- Demonstrates systems thinking about distributed computation patterns (2 points) -- Clear architectural reasoning with distributed systems insights (bonus points for comprehensive understanding) -""" - -### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring understanding of distributed training systems -# Students should demonstrate knowledge of gradient synchronization and communication optimization -### END SOLUTION - -# %% [markdown] -""" -### Question 3: Advanced Training Optimizations and System Integration - -**Context**: Your autograd provides basic gradient computation, but production training systems must integrate with advanced optimization techniques like mixed precision training, gradient accumulation, and specialized hardware acceleration to achieve optimal performance. - -**Reflection Question**: Design an advanced automatic differentiation system that integrates with modern training optimizations and hardware acceleration capabilities. How would you implement automatic mixed precision support, gradient accumulation for large effective batch sizes, and integration with specialized hardware like TPUs? Consider scenarios where you need to optimize training for both research flexibility and production efficiency while maintaining numerical stability and debugging capabilities. - -Think about: mixed precision training, gradient accumulation strategies, hardware integration, and training optimization techniques. - -*Target length: 150-300 words* -""" - -# %% nbgrader={"grade": true, "grade_id": "question-3-training-optimizations", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false} -""" -YOUR REFLECTION ON ADVANCED TRAINING OPTIMIZATIONS: - -TODO: Replace this text with your thoughtful response about advanced automatic differentiation system design. - -Consider addressing: -- How would you integrate automatic mixed precision training with gradient computation? -- What strategies would you use for gradient accumulation and large batch simulation? -- How would you design hardware integration for specialized accelerators like TPUs? -- What role would advanced optimizations play while maintaining research flexibility? -- How would you ensure numerical stability across different precision and hardware configurations? - -Write a design analysis connecting your autograd implementation to real training optimization challenges. - -GRADING RUBRIC (Instructor Use): -- Understands advanced training optimizations and mixed precision challenges (3 points) -- Designs practical approaches to gradient accumulation and hardware integration (3 points) -- Addresses numerical stability and research vs production trade-offs (2 points) -- Shows systems thinking about training optimization and system integration (2 points) -- Clear design reasoning with training optimization insights (bonus points for deep understanding) -""" - -### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring understanding of advanced training optimizations -# Students should demonstrate knowledge of mixed precision, gradient accumulation, and hardware integration -### END SOLUTION - -# %% [markdown] -""" -## 🎯 MODULE SUMMARY: Automatic Differentiation - -Congratulations! You have successfully implemented automatic differentiation: - -### What You have Accomplished -✅ **Computational Graphs**: Dynamic graph construction for gradient computation -✅ **Backpropagation**: Efficient gradient computation through reverse mode AD -✅ **Gradient Tracking**: Automatic gradient accumulation and management -✅ **Integration**: Seamless compatibility with Tensor operations -✅ **Real Applications**: Neural network training and optimization - -### Key Concepts You have Learned -- **Computational graphs**: How operations are tracked for gradient computation -- **Backpropagation**: Reverse mode automatic differentiation -- **Gradient accumulation**: How gradients flow through complex operations -- **Memory management**: Efficient handling of gradient storage -- **Integration patterns**: How autograd works with neural networks - -### Mathematical Foundations -- **Chain rule**: The mathematical foundation of backpropagation -- **Computational graphs**: Representing operations as directed acyclic graphs -- **Gradient flow**: How gradients propagate through complex functions -- **Memory efficiency**: Optimizing gradient storage and computation - -### Professional Skills Developed -- **Graph construction**: Building dynamic computational graphs -- **Gradient computation**: Implementing efficient backpropagation -- **Memory optimization**: Managing gradient storage efficiently -- **Integration testing**: Ensuring autograd works with all operations - -### Ready for Advanced Applications -Your autograd implementation now enables: -- **Neural network training**: Complete training pipelines with gradients -- **Optimization algorithms**: Gradient-based optimization methods -- **Custom loss functions**: Implementing specialized loss functions -- **Advanced architectures**: Training complex neural network models - -### Connection to Real ML Systems -Your implementations mirror production systems: -- **PyTorch**: `torch.autograd` provides identical functionality -- **TensorFlow**: `tf.GradientTape` implements similar concepts -- **JAX**: `jax.grad` uses similar automatic differentiation -- **Industry Standard**: Every major ML framework uses these exact principles - -### Next Steps -1. **Export your code**: `tito export 06_autograd` -2. **Test your implementation**: `tito test 06_autograd` -3. **Build training systems**: Combine with optimizers for complete training -4. **Move to Module 10**: Add optimization algorithms! - -**Ready for optimizers?** Your autograd system is now ready for real training! -""" \ No newline at end of file