mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-04 09:07:06 -05:00
✅ Rename all module directories: 00_setup → 01_setup, etc. ✅ Update convert_modules.py mappings for new directory names ✅ Update _toc.yml file paths and titles (1-14 instead of 0-13) ✅ Regenerate all overview pages with new numbering ✅ Fix all broken references in usage-paths and intro ✅ Update chapter references to use natural numbering Benefits: - More intuitive course progression starting from 1 - Matches academic course numbering conventions - Eliminates confusion about 'Module 0' concept - Cleaner mental model for students and instructors - All references and links properly updated Complete transformation: 14 modules now numbered 01-14
1672 lines
58 KiB
Python
1672 lines
58 KiB
Python
# ---
|
|
# jupyter:
|
|
# jupytext:
|
|
# text_representation:
|
|
# extension: .py
|
|
# format_name: percent
|
|
# format_version: '1.3'
|
|
# jupytext_version: 1.17.1
|
|
# ---
|
|
|
|
# %% [markdown]
|
|
"""
|
|
# Module 7: Autograd - Automatic Differentiation Engine
|
|
|
|
Welcome to the Autograd module! This is where TinyTorch becomes truly powerful. You'll implement the automatic differentiation engine that makes neural network training possible.
|
|
|
|
## Learning Goals
|
|
- Understand how automatic differentiation works through computational graphs
|
|
- Implement the Variable class that tracks gradients and operations
|
|
- Build backward propagation for gradient computation
|
|
- Create the foundation for neural network training
|
|
- Master the mathematical concepts behind backpropagation
|
|
|
|
## Build → Use → Analyze
|
|
1. **Build**: Create the Variable class and gradient computation system
|
|
2. **Use**: Perform automatic differentiation on complex expressions
|
|
3. **Analyze**: Understand how gradients flow through computational graphs
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "autograd-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
|
#| default_exp core.autograd
|
|
|
|
#| export
|
|
import numpy as np
|
|
import sys
|
|
from typing import Union, List, Tuple, Optional, Any, Callable
|
|
from collections import defaultdict
|
|
|
|
# Import our existing components
|
|
from tinytorch.core.tensor import Tensor
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "autograd-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
|
print("🔥 TinyTorch Autograd Module")
|
|
print(f"NumPy version: {np.__version__}")
|
|
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
|
|
print("Ready to build automatic differentiation!")
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 📦 Where This Code Lives in the Final Package
|
|
|
|
**Learning Side:** You work in `modules/source/07_autograd/autograd_dev.py`
|
|
**Building Side:** Code exports to `tinytorch.core.autograd`
|
|
|
|
```python
|
|
# Final package structure:
|
|
from tinytorch.core.autograd import Variable, backward # The gradient engine!
|
|
from tinytorch.core.tensor import Tensor
|
|
from tinytorch.core.activations import ReLU, Sigmoid, Tanh
|
|
```
|
|
|
|
**Why this matters:**
|
|
- **Learning:** Focused module for understanding gradients
|
|
- **Production:** Proper organization like PyTorch's `torch.autograd`
|
|
- **Consistency:** All gradient operations live together in `core.autograd`
|
|
- **Foundation:** Enables training for all neural networks
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 1: What is Automatic Differentiation?
|
|
|
|
### Definition
|
|
**Automatic differentiation (autograd)** is a technique that automatically computes derivatives of functions represented as computational graphs. It's the magic that makes neural network training possible.
|
|
|
|
### The Fundamental Challenge: Computing Gradients at Scale
|
|
|
|
#### **The Problem**
|
|
Neural networks have millions or billions of parameters. To train them, we need to compute the gradient of the loss function with respect to every single parameter:
|
|
|
|
```python
|
|
# For a neural network with parameters θ = [w1, w2, ..., wn, b1, b2, ..., bm]
|
|
# We need to compute: ∇θ L = [∂L/∂w1, ∂L/∂w2, ..., ∂L/∂wn, ∂L/∂b1, ∂L/∂b2, ..., ∂L/∂bm]
|
|
```
|
|
|
|
#### **Why Manual Differentiation Fails**
|
|
- **Complexity**: Neural networks are compositions of thousands of operations
|
|
- **Error-prone**: Manual computation is extremely difficult and error-prone
|
|
- **Inflexible**: Every architecture change requires re-deriving gradients
|
|
- **Inefficient**: Manual computation doesn't exploit computational structure
|
|
|
|
#### **Why Numerical Differentiation is Inadequate**
|
|
```python
|
|
# Numerical differentiation: f'(x) ≈ (f(x + h) - f(x)) / h
|
|
def numerical_gradient(f, x, h=1e-5):
|
|
return (f(x + h) - f(x)) / h
|
|
```
|
|
|
|
Problems:
|
|
- **Slow**: Requires 2 function evaluations per parameter
|
|
- **Imprecise**: Numerical errors accumulate
|
|
- **Unstable**: Sensitive to choice of h
|
|
- **Expensive**: O(n) cost for n parameters
|
|
|
|
### The Solution: Computational Graphs
|
|
|
|
#### **Key Insight: Every Computation is a Graph**
|
|
Any mathematical expression can be represented as a directed acyclic graph (DAG):
|
|
|
|
```python
|
|
# Expression: f(x, y) = (x + y) * (x - y)
|
|
# Graph representation:
|
|
# x ──┐ ┌── add ──┐
|
|
# │ │ │
|
|
# ├─────┤ ├── multiply ── output
|
|
# │ │ │
|
|
# y ──┘ └── sub ──┘
|
|
```
|
|
|
|
#### **Forward Pass: Computing Values**
|
|
Traverse the graph from inputs to outputs, computing values at each node:
|
|
|
|
```python
|
|
# Forward pass for f(x, y) = (x + y) * (x - y)
|
|
x = 3, y = 2
|
|
add_result = x + y = 5
|
|
sub_result = x - y = 1
|
|
output = add_result * sub_result = 5
|
|
```
|
|
|
|
#### **Backward Pass: Computing Gradients**
|
|
Traverse the graph from outputs to inputs, computing gradients using the chain rule:
|
|
|
|
For $f(x, y) = (x + y) \cdot (x - y)$ with $x = 3, y = 2$:
|
|
|
|
$$\frac{\partial \text{output}}{\partial \text{multiply}} = 1$$
|
|
|
|
$$\frac{\partial \text{output}}{\partial \text{add}} = \frac{\partial \text{output}}{\partial \text{multiply}} \cdot \frac{\partial \text{multiply}}{\partial \text{add}} = 1 \cdot \text{sub\_result} = 1$$
|
|
|
|
$$\frac{\partial \text{output}}{\partial \text{sub}} = \frac{\partial \text{output}}{\partial \text{multiply}} \cdot \frac{\partial \text{multiply}}{\partial \text{sub}} = 1 \cdot \text{add\_result} = 5$$
|
|
|
|
$$\frac{\partial \text{output}}{\partial x} = \frac{\partial \text{output}}{\partial \text{add}} \cdot \frac{\partial \text{add}}{\partial x} + \frac{\partial \text{output}}{\partial \text{sub}} \cdot \frac{\partial \text{sub}}{\partial x} = 1 \cdot 1 + 5 \cdot 1 = 6$$
|
|
|
|
$$\frac{\partial \text{output}}{\partial y} = \frac{\partial \text{output}}{\partial \text{add}} \cdot \frac{\partial \text{add}}{\partial y} + \frac{\partial \text{output}}{\partial \text{sub}} \cdot \frac{\partial \text{sub}}{\partial y} = 1 \cdot 1 + 5 \cdot (-1) = -4$$
|
|
```
|
|
|
|
### Mathematical Foundation: The Chain Rule
|
|
|
|
#### **Single Variable Chain Rule**
|
|
For composite functions: If $z = f(g(x))$, then:
|
|
|
|
$$\frac{dz}{dx} = \frac{dz}{df} \cdot \frac{df}{dx}$$
|
|
|
|
#### **Multivariable Chain Rule**
|
|
For functions of multiple variables: If $z = f(x, y)$ where $x = g(t)$ and $y = h(t)$, then:
|
|
|
|
$$\frac{dz}{dt} = \frac{\partial z}{\partial x} \cdot \frac{dx}{dt} + \frac{\partial z}{\partial y} \cdot \frac{dy}{dt}$$
|
|
|
|
#### **Chain Rule in Computational Graphs**
|
|
For any path from input to output through intermediate nodes:
|
|
|
|
$$\frac{\partial \text{output}}{\partial \text{input}} = \prod_{i} \frac{\partial \text{node}_{i+1}}{\partial \text{node}_i}$$
|
|
|
|
### Automatic Differentiation Modes
|
|
|
|
#### **Forward Mode (Forward Accumulation)**
|
|
- **Process**: Compute derivatives alongside forward pass
|
|
- **Efficiency**: Efficient when #inputs << #outputs
|
|
- **Use case**: Jacobian-vector products, sensitivity analysis
|
|
|
|
#### **Reverse Mode (Backpropagation)**
|
|
- **Process**: Compute derivatives in reverse pass after forward pass
|
|
- **Efficiency**: Efficient when #outputs << #inputs
|
|
- **Use case**: Neural network training (many parameters, few outputs)
|
|
|
|
#### **Why Reverse Mode Dominates ML**
|
|
Neural networks typically have:
|
|
- **Many inputs**: Millions of parameters
|
|
- **Few outputs**: Single loss value or small output vector
|
|
- **Reverse mode**: O(1) cost per parameter vs O(n) for forward mode
|
|
|
|
### The Computational Graph Abstraction
|
|
|
|
#### **Nodes: Operations and Variables**
|
|
- **Variable nodes**: Store values and gradients
|
|
- **Operation nodes**: Define how to compute forward and backward passes
|
|
|
|
#### **Edges: Data Dependencies**
|
|
- **Forward edges**: Data flow from inputs to outputs
|
|
- **Backward edges**: Gradient flow from outputs to inputs
|
|
|
|
#### **Dynamic vs Static Graphs**
|
|
- **Static graphs**: Define once, execute many times (TensorFlow 1.x)
|
|
- **Dynamic graphs**: Build graph during execution (PyTorch, TensorFlow 2.x)
|
|
|
|
### Real-World Impact: What Autograd Enables
|
|
|
|
#### **Deep Learning Revolution**
|
|
```python
|
|
# Before autograd: Manual gradient computation
|
|
def manual_gradient(x, y, w1, w2, b1, b2):
|
|
# Forward pass
|
|
z1 = w1 * x + b1
|
|
a1 = sigmoid(z1)
|
|
z2 = w2 * a1 + b2
|
|
a2 = sigmoid(z2)
|
|
loss = (a2 - y) ** 2
|
|
|
|
# Backward pass (manual)
|
|
dloss_da2 = 2 * (a2 - y)
|
|
da2_dz2 = sigmoid_derivative(z2)
|
|
dz2_dw2 = a1
|
|
dz2_db2 = 1
|
|
dz2_da1 = w2
|
|
da1_dz1 = sigmoid_derivative(z1)
|
|
dz1_dw1 = x
|
|
dz1_db1 = 1
|
|
|
|
# Chain rule application
|
|
dloss_dw2 = dloss_da2 * da2_dz2 * dz2_dw2
|
|
dloss_db2 = dloss_da2 * da2_dz2 * dz2_db2
|
|
dloss_dw1 = dloss_da2 * da2_dz2 * dz2_da1 * da1_dz1 * dz1_dw1
|
|
dloss_db1 = dloss_da2 * da2_dz2 * dz2_da1 * da1_dz1 * dz1_db1
|
|
|
|
return dloss_dw1, dloss_db1, dloss_dw2, dloss_db2
|
|
|
|
# With autograd: Automatic gradient computation
|
|
def autograd_gradient(x, y, w1, w2, b1, b2):
|
|
# Forward pass with gradient tracking
|
|
z1 = w1 * x + b1
|
|
a1 = sigmoid(z1)
|
|
z2 = w2 * a1 + b2
|
|
a2 = sigmoid(z2)
|
|
loss = (a2 - y) ** 2
|
|
|
|
# Backward pass (automatic)
|
|
loss.backward()
|
|
|
|
return w1.grad, b1.grad, w2.grad, b2.grad
|
|
```
|
|
|
|
#### **Scientific Computing**
|
|
- **Optimization**: Gradient-based optimization algorithms
|
|
- **Inverse problems**: Parameter estimation from observations
|
|
- **Sensitivity analysis**: How outputs change with input perturbations
|
|
|
|
#### **Modern AI Applications**
|
|
- **Neural architecture search**: Differentiable architecture optimization
|
|
- **Meta-learning**: Learning to learn with gradient-based meta-algorithms
|
|
- **Differentiable programming**: Entire programs as differentiable functions
|
|
|
|
### Performance Considerations
|
|
|
|
#### **Memory Management**
|
|
- **Intermediate storage**: Must store forward pass results for backward pass
|
|
- **Memory optimization**: Checkpointing, gradient accumulation
|
|
- **Trade-offs**: Memory vs computation time
|
|
|
|
#### **Computational Efficiency**
|
|
- **Graph optimization**: Fuse operations, eliminate redundancy
|
|
- **Parallelization**: Compute independent gradients simultaneously
|
|
- **Hardware acceleration**: Specialized gradient computation on GPUs/TPUs
|
|
|
|
#### **Numerical Stability**
|
|
- **Gradient clipping**: Prevent exploding gradients
|
|
- **Numerical precision**: Balance between float16 and float32
|
|
- **Accumulation order**: Minimize numerical errors
|
|
|
|
### Connection to Neural Network Training
|
|
|
|
#### **The Training Loop**
|
|
```python
|
|
for epoch in range(num_epochs):
|
|
for batch in dataloader:
|
|
# Forward pass
|
|
predictions = model(batch.inputs)
|
|
loss = criterion(predictions, batch.targets)
|
|
|
|
# Backward pass (autograd)
|
|
loss.backward()
|
|
|
|
# Parameter update
|
|
optimizer.step()
|
|
optimizer.zero_grad()
|
|
```
|
|
|
|
#### **Gradient-Based Optimization**
|
|
- **Stochastic Gradient Descent**: Use gradients to update parameters
|
|
- **Adaptive methods**: Adam, RMSprop use gradient statistics
|
|
- **Second-order methods**: Use gradient and Hessian information
|
|
|
|
### Why Autograd is Revolutionary
|
|
|
|
#### **Democratization of Deep Learning**
|
|
- **Research acceleration**: Focus on architecture, not gradient computation
|
|
- **Experimentation**: Easy to try new ideas and architectures
|
|
- **Accessibility**: Researchers don't need to be differentiation experts
|
|
|
|
#### **Scalability**
|
|
- **Large models**: Handle millions/billions of parameters automatically
|
|
- **Complex architectures**: Support arbitrary computational graphs
|
|
- **Distributed training**: Coordinate gradients across multiple devices
|
|
|
|
Let's implement the Variable class that makes this magic possible!
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 2: The Variable Class
|
|
|
|
### Core Concept
|
|
A **Variable** wraps a Tensor and tracks:
|
|
- **Data**: The actual values (forward pass)
|
|
- **Gradient**: The computed gradients (backward pass)
|
|
- **Computation history**: How this Variable was created
|
|
- **Backward function**: How to compute gradients
|
|
|
|
### Design Principles
|
|
- **Transparency**: Works seamlessly with existing Tensor operations
|
|
- **Efficiency**: Minimal overhead for forward pass
|
|
- **Flexibility**: Supports any differentiable operation
|
|
- **Correctness**: Implements the chain rule precisely
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "variable-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
class Variable:
|
|
"""
|
|
Variable: Tensor wrapper with automatic differentiation capabilities.
|
|
|
|
The fundamental class for gradient computation in TinyTorch.
|
|
Wraps Tensor objects and tracks computational history for backpropagation.
|
|
"""
|
|
|
|
def __init__(self, data: Union[Tensor, np.ndarray, list, float, int],
|
|
requires_grad: bool = True, grad_fn: Optional[Callable] = None):
|
|
"""
|
|
Create a Variable with gradient tracking.
|
|
|
|
Args:
|
|
data: The data to wrap (will be converted to Tensor)
|
|
requires_grad: Whether to compute gradients for this Variable
|
|
grad_fn: Function to compute gradients (None for leaf nodes)
|
|
|
|
TODO: Implement Variable initialization with gradient tracking.
|
|
|
|
APPROACH:
|
|
1. Convert data to Tensor if it's not already
|
|
2. Store the tensor data
|
|
3. Set gradient tracking flag
|
|
4. Initialize gradient to None (will be computed later)
|
|
5. Store the gradient function for backward pass
|
|
6. Track if this is a leaf node (no grad_fn)
|
|
|
|
EXAMPLE:
|
|
Variable(5.0) → Variable wrapping Tensor(5.0)
|
|
Variable([1, 2, 3]) → Variable wrapping Tensor([1, 2, 3])
|
|
|
|
HINTS:
|
|
- Use isinstance() to check if data is already a Tensor
|
|
- Store requires_grad, grad_fn, and is_leaf flags
|
|
- Initialize self.grad to None
|
|
- A leaf node has grad_fn=None
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Convert data to Tensor if needed
|
|
if isinstance(data, Tensor):
|
|
self.data = data
|
|
else:
|
|
self.data = Tensor(data)
|
|
|
|
# Set gradient tracking
|
|
self.requires_grad = requires_grad
|
|
self.grad = None # Will be initialized when needed
|
|
self.grad_fn = grad_fn
|
|
self.is_leaf = grad_fn is None
|
|
|
|
# For computational graph
|
|
self._backward_hooks = []
|
|
### END SOLUTION
|
|
|
|
@property
|
|
def shape(self) -> Tuple[int, ...]:
|
|
"""Get the shape of the underlying tensor."""
|
|
return self.data.shape
|
|
|
|
@property
|
|
def size(self) -> int:
|
|
"""Get the total number of elements."""
|
|
return self.data.size
|
|
|
|
def __repr__(self) -> str:
|
|
"""String representation of the Variable."""
|
|
grad_str = f", grad_fn={self.grad_fn.__name__}" if self.grad_fn else ""
|
|
return f"Variable({self.data.data.tolist()}, requires_grad={self.requires_grad}{grad_str})"
|
|
|
|
def backward(self, gradient: Optional['Variable'] = None) -> None:
|
|
"""
|
|
Compute gradients using backpropagation.
|
|
|
|
Args:
|
|
gradient: The gradient to backpropagate (defaults to ones)
|
|
|
|
TODO: Implement backward propagation.
|
|
|
|
APPROACH:
|
|
1. If gradient is None, create a gradient of ones with same shape
|
|
2. If this Variable doesn't require gradients, return early
|
|
3. If this is a leaf node, accumulate the gradient
|
|
4. If this has a grad_fn, call it to propagate gradients
|
|
|
|
EXAMPLE:
|
|
x = Variable(5.0)
|
|
y = x * 2
|
|
y.backward() # Computes x.grad = 2.0
|
|
|
|
HINTS:
|
|
- Use np.ones_like() to create default gradient
|
|
- Accumulate gradients with += for leaf nodes
|
|
- Call self.grad_fn(gradient) for non-leaf nodes
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Default gradient is ones
|
|
if gradient is None:
|
|
gradient = Variable(np.ones_like(self.data.data))
|
|
|
|
# Skip if gradients not required
|
|
if not self.requires_grad:
|
|
return
|
|
|
|
# Accumulate gradient for leaf nodes
|
|
if self.is_leaf:
|
|
if self.grad is None:
|
|
self.grad = Variable(np.zeros_like(self.data.data))
|
|
self.grad.data._data += gradient.data.data
|
|
else:
|
|
# Propagate gradients through grad_fn
|
|
if self.grad_fn is not None:
|
|
self.grad_fn(gradient)
|
|
### END SOLUTION
|
|
|
|
def zero_grad(self) -> None:
|
|
"""Zero out the gradient."""
|
|
if self.grad is not None:
|
|
self.grad.data._data.fill(0)
|
|
|
|
# Arithmetic operations with gradient tracking
|
|
def __add__(self, other: Union['Variable', float, int]) -> 'Variable':
|
|
"""Addition with gradient tracking."""
|
|
return add(self, other)
|
|
|
|
def __mul__(self, other: Union['Variable', float, int]) -> 'Variable':
|
|
"""Multiplication with gradient tracking."""
|
|
return multiply(self, other)
|
|
|
|
def __sub__(self, other: Union['Variable', float, int]) -> 'Variable':
|
|
"""Subtraction with gradient tracking."""
|
|
return subtract(self, other)
|
|
|
|
def __truediv__(self, other: Union['Variable', float, int]) -> 'Variable':
|
|
"""Division with gradient tracking."""
|
|
return divide(self, other)
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 3: Basic Operations with Gradients
|
|
|
|
### The Pattern
|
|
Every differentiable operation follows the same pattern:
|
|
1. **Forward pass**: Compute the result
|
|
2. **Create grad_fn**: Function that knows how to compute gradients
|
|
3. **Return Variable**: With the result and grad_fn
|
|
|
|
### Mathematical Rules
|
|
- **Addition**: $\frac{d(x + y)}{dx} = 1$, $\frac{d(x + y)}{dy} = 1$
|
|
- **Multiplication**: $\frac{d(x \cdot y)}{dx} = y$, $\frac{d(x \cdot y)}{dy} = x$
|
|
- **Subtraction**: $\frac{d(x - y)}{dx} = 1$, $\frac{d(x - y)}{dy} = -1$
|
|
- **Division**: $\frac{d(x / y)}{dx} = \frac{1}{y}$, $\frac{d(x / y)}{dy} = -\frac{x}{y^2}$
|
|
|
|
### Implementation Strategy
|
|
Each operation creates a closure that captures the input variables and implements the gradient computation rule.
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "add-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
|
|
"""
|
|
Addition operation with gradient tracking.
|
|
|
|
Args:
|
|
a: First operand
|
|
b: Second operand
|
|
|
|
Returns:
|
|
Variable with sum and gradient function
|
|
|
|
TODO: Implement addition with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Convert inputs to Variables if needed
|
|
2. Compute forward pass: result = a + b
|
|
3. Create gradient function that distributes gradients
|
|
4. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = x + y, then dz/dx = 1, dz/dy = 1
|
|
|
|
EXAMPLE:
|
|
x = Variable(2.0), y = Variable(3.0)
|
|
z = add(x, y) # z.data = 5.0
|
|
z.backward() # x.grad = 1.0, y.grad = 1.0
|
|
|
|
HINTS:
|
|
- Use isinstance() to check if inputs are Variables
|
|
- Create a closure that captures a and b
|
|
- In grad_fn, call a.backward() and b.backward() with appropriate gradients
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Convert to Variables if needed
|
|
if not isinstance(a, Variable):
|
|
a = Variable(a, requires_grad=False)
|
|
if not isinstance(b, Variable):
|
|
b = Variable(b, requires_grad=False)
|
|
|
|
# Forward pass
|
|
result_data = a.data + b.data
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
# Addition distributes gradients equally
|
|
if a.requires_grad:
|
|
a.backward(grad_output)
|
|
if b.requires_grad:
|
|
b.backward(grad_output)
|
|
|
|
# Determine if result requires gradients
|
|
requires_grad = a.requires_grad or b.requires_grad
|
|
|
|
return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
|
|
"""
|
|
Multiplication operation with gradient tracking.
|
|
|
|
Args:
|
|
a: First operand
|
|
b: Second operand
|
|
|
|
Returns:
|
|
Variable with product and gradient function
|
|
|
|
TODO: Implement multiplication with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Convert inputs to Variables if needed
|
|
2. Compute forward pass: result = a * b
|
|
3. Create gradient function using product rule
|
|
4. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = x * y, then dz/dx = y, dz/dy = x
|
|
|
|
EXAMPLE:
|
|
x = Variable(2.0), y = Variable(3.0)
|
|
z = multiply(x, y) # z.data = 6.0
|
|
z.backward() # x.grad = 3.0, y.grad = 2.0
|
|
|
|
HINTS:
|
|
- Store a.data and b.data for gradient computation
|
|
- In grad_fn, multiply incoming gradient by the other operand
|
|
- Handle broadcasting if shapes are different
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Convert to Variables if needed
|
|
if not isinstance(a, Variable):
|
|
a = Variable(a, requires_grad=False)
|
|
if not isinstance(b, Variable):
|
|
b = Variable(b, requires_grad=False)
|
|
|
|
# Forward pass
|
|
result_data = a.data * b.data
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
# Product rule: d(xy)/dx = y, d(xy)/dy = x
|
|
if a.requires_grad:
|
|
a_grad = Variable(grad_output.data * b.data)
|
|
a.backward(a_grad)
|
|
if b.requires_grad:
|
|
b_grad = Variable(grad_output.data * a.data)
|
|
b.backward(b_grad)
|
|
|
|
# Determine if result requires gradients
|
|
requires_grad = a.requires_grad or b.requires_grad
|
|
|
|
return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "subtract-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
|
|
"""
|
|
Subtraction operation with gradient tracking.
|
|
|
|
Args:
|
|
a: First operand (minuend)
|
|
b: Second operand (subtrahend)
|
|
|
|
Returns:
|
|
Variable with difference and gradient function
|
|
|
|
TODO: Implement subtraction with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Convert inputs to Variables if needed
|
|
2. Compute forward pass: result = a - b
|
|
3. Create gradient function with correct signs
|
|
4. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = x - y, then dz/dx = 1, dz/dy = -1
|
|
|
|
EXAMPLE:
|
|
x = Variable(5.0), y = Variable(3.0)
|
|
z = subtract(x, y) # z.data = 2.0
|
|
z.backward() # x.grad = 1.0, y.grad = -1.0
|
|
|
|
HINTS:
|
|
- Forward pass is straightforward: a - b
|
|
- Gradient for a is positive, for b is negative
|
|
- Remember to negate the gradient for b
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Convert to Variables if needed
|
|
if not isinstance(a, Variable):
|
|
a = Variable(a, requires_grad=False)
|
|
if not isinstance(b, Variable):
|
|
b = Variable(b, requires_grad=False)
|
|
|
|
# Forward pass
|
|
result_data = a.data - b.data
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
# Subtraction rule: d(x-y)/dx = 1, d(x-y)/dy = -1
|
|
if a.requires_grad:
|
|
a.backward(grad_output)
|
|
if b.requires_grad:
|
|
b_grad = Variable(-grad_output.data.data)
|
|
b.backward(b_grad)
|
|
|
|
# Determine if result requires gradients
|
|
requires_grad = a.requires_grad or b.requires_grad
|
|
|
|
return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "divide-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def divide(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
|
|
"""
|
|
Division operation with gradient tracking.
|
|
|
|
Args:
|
|
a: Numerator
|
|
b: Denominator
|
|
|
|
Returns:
|
|
Variable with quotient and gradient function
|
|
|
|
TODO: Implement division with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Convert inputs to Variables if needed
|
|
2. Compute forward pass: result = a / b
|
|
3. Create gradient function using quotient rule
|
|
4. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = x / y, then dz/dx = \frac{1}{y}, dz/dy = -\frac{x}{y^2}
|
|
|
|
EXAMPLE:
|
|
x = Variable(6.0), y = Variable(2.0)
|
|
z = divide(x, y) # z.data = 3.0
|
|
z.backward() # x.grad = 0.5, y.grad = -1.5
|
|
|
|
HINTS:
|
|
- Forward pass: a.data / b.data
|
|
- Gradient for a: grad_output / b.data
|
|
- Gradient for b: -grad_output * a.data / (b.data ** 2)
|
|
- Be careful with numerical stability
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Convert to Variables if needed
|
|
if not isinstance(a, Variable):
|
|
a = Variable(a, requires_grad=False)
|
|
if not isinstance(b, Variable):
|
|
b = Variable(b, requires_grad=False)
|
|
|
|
# Forward pass
|
|
result_data = a.data / b.data
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
# Quotient rule: d(x/y)/dx = 1/y, d(x/y)/dy = -x/y²
|
|
if a.requires_grad:
|
|
a_grad = Variable(grad_output.data.data / b.data.data)
|
|
a.backward(a_grad)
|
|
if b.requires_grad:
|
|
b_grad = Variable(-grad_output.data.data * a.data.data / (b.data.data ** 2))
|
|
b.backward(b_grad)
|
|
|
|
# Determine if result requires gradients
|
|
requires_grad = a.requires_grad or b.requires_grad
|
|
|
|
return Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 4: Testing Basic Operations
|
|
|
|
Let's test our basic operations to ensure they compute gradients correctly.
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-basic-operations", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
|
|
def test_basic_operations():
|
|
"""Test basic operations with gradient computation."""
|
|
print("🔬 Testing basic operations...")
|
|
|
|
# Test addition
|
|
print("📊 Testing addition...")
|
|
x = Variable(2.0, requires_grad=True)
|
|
y = Variable(3.0, requires_grad=True)
|
|
z = add(x, y)
|
|
|
|
assert abs(z.data.data.item() - 5.0) < 1e-6, f"Addition failed: expected 5.0, got {z.data.data.item()}"
|
|
|
|
z.backward()
|
|
assert abs(x.grad.data.data.item() - 1.0) < 1e-6, f"Addition gradient for x failed: expected 1.0, got {x.grad.data.data.item()}"
|
|
assert abs(y.grad.data.data.item() - 1.0) < 1e-6, f"Addition gradient for y failed: expected 1.0, got {y.grad.data.data.item()}"
|
|
print("✅ Addition test passed!")
|
|
|
|
# Test multiplication
|
|
print("📊 Testing multiplication...")
|
|
x = Variable(2.0, requires_grad=True)
|
|
y = Variable(3.0, requires_grad=True)
|
|
z = multiply(x, y)
|
|
|
|
assert abs(z.data.data.item() - 6.0) < 1e-6, f"Multiplication failed: expected 6.0, got {z.data.data.item()}"
|
|
|
|
z.backward()
|
|
assert abs(x.grad.data.data.item() - 3.0) < 1e-6, f"Multiplication gradient for x failed: expected 3.0, got {x.grad.data.data.item()}"
|
|
assert abs(y.grad.data.data.item() - 2.0) < 1e-6, f"Multiplication gradient for y failed: expected 2.0, got {y.grad.data.data.item()}"
|
|
print("✅ Multiplication test passed!")
|
|
|
|
# Test subtraction
|
|
print("📊 Testing subtraction...")
|
|
x = Variable(5.0, requires_grad=True)
|
|
y = Variable(3.0, requires_grad=True)
|
|
z = subtract(x, y)
|
|
|
|
assert abs(z.data.data.item() - 2.0) < 1e-6, f"Subtraction failed: expected 2.0, got {z.data.data.item()}"
|
|
|
|
z.backward()
|
|
assert abs(x.grad.data.data.item() - 1.0) < 1e-6, f"Subtraction gradient for x failed: expected 1.0, got {x.grad.data.data.item()}"
|
|
assert abs(y.grad.data.data.item() - (-1.0)) < 1e-6, f"Subtraction gradient for y failed: expected -1.0, got {y.grad.data.data.item()}"
|
|
print("✅ Subtraction test passed!")
|
|
|
|
# Test division
|
|
print("📊 Testing division...")
|
|
x = Variable(6.0, requires_grad=True)
|
|
y = Variable(2.0, requires_grad=True)
|
|
z = divide(x, y)
|
|
|
|
assert abs(z.data.data.item() - 3.0) < 1e-6, f"Division failed: expected 3.0, got {z.data.data.item()}"
|
|
|
|
z.backward()
|
|
assert abs(x.grad.data.data.item() - 0.5) < 1e-6, f"Division gradient for x failed: expected 0.5, got {x.grad.data.data.item()}"
|
|
assert abs(y.grad.data.data.item() - (-1.5)) < 1e-6, f"Division gradient for y failed: expected -1.5, got {y.grad.data.data.item()}"
|
|
print("✅ Division test passed!")
|
|
|
|
print("🎉 All basic operation tests passed!")
|
|
return True
|
|
|
|
# Run the test
|
|
success = test_basic_operations()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 5: Chain Rule Testing
|
|
|
|
Let's test more complex expressions to ensure the chain rule works correctly.
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-chain-rule", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
|
|
def test_chain_rule():
|
|
"""Test chain rule with complex expressions."""
|
|
print("🔬 Testing chain rule...")
|
|
|
|
# Test: f(x, y) = (x + y) * (x - y) = x² - y²
|
|
print("📊 Testing f(x, y) = (x + y) * (x - y)...")
|
|
x = Variable(3.0, requires_grad=True)
|
|
y = Variable(2.0, requires_grad=True)
|
|
|
|
# Forward pass
|
|
sum_xy = add(x, y) # x + y = 5
|
|
diff_xy = subtract(x, y) # x - y = 1
|
|
result = multiply(sum_xy, diff_xy) # (x + y) * (x - y) = 5
|
|
|
|
assert abs(result.data.data.item() - 5.0) < 1e-6, f"Chain rule forward failed: expected 5.0, got {result.data.data.item()}"
|
|
|
|
# Backward pass
|
|
result.backward()
|
|
|
|
# Analytical gradients: df/dx = 2x = 6, df/dy = -2y = -4
|
|
expected_x_grad = 2 * 3.0 # 6.0
|
|
expected_y_grad = -2 * 2.0 # -4.0
|
|
|
|
assert abs(x.grad.data.data.item() - expected_x_grad) < 1e-6, f"Chain rule x gradient failed: expected {expected_x_grad}, got {x.grad.data.data.item()}"
|
|
assert abs(y.grad.data.data.item() - expected_y_grad) < 1e-6, f"Chain rule y gradient failed: expected {expected_y_grad}, got {y.grad.data.data.item()}"
|
|
print("✅ Chain rule test passed!")
|
|
|
|
# Test: f(x) = x * x * x (x³)
|
|
print("📊 Testing f(x) = x³...")
|
|
x = Variable(2.0, requires_grad=True)
|
|
|
|
# Forward pass
|
|
x_squared = multiply(x, x) # x²
|
|
x_cubed = multiply(x_squared, x) # x³
|
|
|
|
assert abs(x_cubed.data.data.item() - 8.0) < 1e-6, f"x³ forward failed: expected 8.0, got {x_cubed.data.data.item()}"
|
|
|
|
# Backward pass
|
|
x_cubed.backward()
|
|
|
|
# Analytical gradient: df/dx = 3x² = 12
|
|
expected_grad = 3 * (2.0 ** 2) # 12.0
|
|
|
|
assert abs(x.grad.data.data.item() - expected_grad) < 1e-6, f"x³ gradient failed: expected {expected_grad}, got {x.grad.data.data.item()}"
|
|
print("✅ x³ test passed!")
|
|
|
|
print("🎉 All chain rule tests passed!")
|
|
return True
|
|
|
|
# Run the test
|
|
success = test_chain_rule()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 6: Activation Function Gradients
|
|
|
|
Now let's implement gradients for activation functions to integrate with our existing modules.
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "relu-gradient", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def relu_with_grad(x: Variable) -> Variable:
|
|
"""
|
|
ReLU activation with gradient tracking.
|
|
|
|
Args:
|
|
x: Input Variable
|
|
|
|
Returns:
|
|
Variable with ReLU applied and gradient function
|
|
|
|
TODO: Implement ReLU with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Compute forward pass: max(0, x)
|
|
2. Create gradient function using ReLU derivative
|
|
3. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
f(x) = max(0, x)
|
|
f'(x) = 1 if x > 0, else 0
|
|
|
|
EXAMPLE:
|
|
x = Variable([-1.0, 0.0, 1.0])
|
|
y = relu_with_grad(x) # y.data = [0.0, 0.0, 1.0]
|
|
y.backward() # x.grad = [0.0, 0.0, 1.0]
|
|
|
|
HINTS:
|
|
- Use np.maximum(0, x.data.data) for forward pass
|
|
- Use (x.data.data > 0) for gradient mask
|
|
- Only propagate gradients where input was positive
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Forward pass
|
|
result_data = Tensor(np.maximum(0, x.data.data))
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
if x.requires_grad:
|
|
# ReLU derivative: 1 if x > 0, else 0
|
|
mask = (x.data.data > 0).astype(np.float32)
|
|
x_grad = Variable(grad_output.data.data * mask)
|
|
x.backward(x_grad)
|
|
|
|
return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "sigmoid-gradient", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def sigmoid_with_grad(x: Variable) -> Variable:
|
|
"""
|
|
Sigmoid activation with gradient tracking.
|
|
|
|
Args:
|
|
x: Input Variable
|
|
|
|
Returns:
|
|
Variable with sigmoid applied and gradient function
|
|
|
|
TODO: Implement sigmoid with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Compute forward pass: 1 / (1 + exp(-x))
|
|
2. Create gradient function using sigmoid derivative
|
|
3. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
f(x) = 1 / (1 + exp(-x))
|
|
f'(x) = f(x) * (1 - f(x))
|
|
|
|
EXAMPLE:
|
|
x = Variable(0.0)
|
|
y = sigmoid_with_grad(x) # y.data = 0.5
|
|
y.backward() # x.grad = 0.25
|
|
|
|
HINTS:
|
|
- Use np.clip for numerical stability
|
|
- Store sigmoid output for gradient computation
|
|
- Gradient is sigmoid * (1 - sigmoid)
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Forward pass with numerical stability
|
|
clipped = np.clip(x.data.data, -500, 500)
|
|
sigmoid_output = 1.0 / (1.0 + np.exp(-clipped))
|
|
result_data = Tensor(sigmoid_output)
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
if x.requires_grad:
|
|
# Sigmoid derivative: sigmoid * (1 - sigmoid)
|
|
sigmoid_grad = sigmoid_output * (1.0 - sigmoid_output)
|
|
x_grad = Variable(grad_output.data.data * sigmoid_grad)
|
|
x.backward(x_grad)
|
|
|
|
return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 7: Integration Testing
|
|
|
|
Let's test our autograd system with a simple neural network scenario.
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-integration", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
|
|
def test_integration():
|
|
"""Test autograd integration with neural network scenario."""
|
|
print("🔬 Testing autograd integration...")
|
|
|
|
# Simple neural network: input -> linear -> ReLU -> output
|
|
print("📊 Testing simple neural network...")
|
|
|
|
# Input
|
|
x = Variable(2.0, requires_grad=True)
|
|
|
|
# Weights and bias
|
|
w1 = Variable(0.5, requires_grad=True)
|
|
b1 = Variable(0.1, requires_grad=True)
|
|
w2 = Variable(1.5, requires_grad=True)
|
|
|
|
# Forward pass
|
|
linear1 = add(multiply(x, w1), b1) # x * w1 + b1 = 2*0.5 + 0.1 = 1.1
|
|
activation1 = relu_with_grad(linear1) # ReLU(1.1) = 1.1
|
|
output = multiply(activation1, w2) # 1.1 * 1.5 = 1.65
|
|
|
|
# Check forward pass
|
|
expected_output = 1.65
|
|
assert abs(output.data.data.item() - expected_output) < 1e-6, f"Integration forward failed: expected {expected_output}, got {output.data.data.item()}"
|
|
|
|
# Backward pass
|
|
output.backward()
|
|
|
|
# Check gradients
|
|
# dL/dx = dL/doutput * doutput/dactivation1 * dactivation1/dlinear1 * dlinear1/dx
|
|
# = 1 * w2 * 1 * w1 = 1.5 * 0.5 = 0.75
|
|
expected_x_grad = 0.75
|
|
assert abs(x.grad.data.data.item() - expected_x_grad) < 1e-6, f"Integration x gradient failed: expected {expected_x_grad}, got {x.grad.data.data.item()}"
|
|
|
|
# dL/dw1 = dL/doutput * doutput/dactivation1 * dactivation1/dlinear1 * dlinear1/dw1
|
|
# = 1 * w2 * 1 * x = 1.5 * 2.0 = 3.0
|
|
expected_w1_grad = 3.0
|
|
assert abs(w1.grad.data.data.item() - expected_w1_grad) < 1e-6, f"Integration w1 gradient failed: expected {expected_w1_grad}, got {w1.grad.data.data.item()}"
|
|
|
|
# dL/db1 = dL/doutput * doutput/dactivation1 * dactivation1/dlinear1 * dlinear1/db1
|
|
# = 1 * w2 * 1 * 1 = 1.5
|
|
expected_b1_grad = 1.5
|
|
assert abs(b1.grad.data.data.item() - expected_b1_grad) < 1e-6, f"Integration b1 gradient failed: expected {expected_b1_grad}, got {b1.grad.data.data.item()}"
|
|
|
|
# dL/dw2 = dL/doutput * doutput/dw2 = 1 * activation1 = 1.1
|
|
expected_w2_grad = 1.1
|
|
assert abs(w2.grad.data.data.item() - expected_w2_grad) < 1e-6, f"Integration w2 gradient failed: expected {expected_w2_grad}, got {w2.grad.data.data.item()}"
|
|
|
|
print("✅ Integration test passed!")
|
|
print("🎉 All autograd tests passed!")
|
|
return True
|
|
|
|
# Run the test
|
|
success = test_integration()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 🎯 Module Summary
|
|
|
|
Congratulations! You've successfully implemented automatic differentiation for TinyTorch:
|
|
|
|
### What You've Accomplished
|
|
✅ **Variable Class**: Tensor wrapper with gradient tracking and computational graph
|
|
✅ **Basic Operations**: Addition, multiplication, subtraction, division with gradients
|
|
✅ **Chain Rule**: Automatic gradient computation through complex expressions
|
|
✅ **Activation Functions**: ReLU and Sigmoid with proper gradient computation
|
|
✅ **Integration**: Works seamlessly with neural network scenarios
|
|
|
|
### Key Concepts You've Learned
|
|
- **Computational graphs** represent mathematical expressions as directed graphs
|
|
- **Forward pass** computes function values following the graph
|
|
- **Backward pass** computes gradients using the chain rule in reverse
|
|
- **Gradient functions** capture how to compute gradients for each operation
|
|
- **Variable tracking** enables automatic differentiation of any expression
|
|
|
|
### Mathematical Foundations
|
|
- **Chain rule**: The fundamental principle behind backpropagation
|
|
- **Partial derivatives**: How gradients flow through operations
|
|
- **Computational efficiency**: Reusing forward pass results in backward pass
|
|
- **Numerical stability**: Handling edge cases in gradient computation
|
|
|
|
### Real-World Applications
|
|
- **Neural network training**: Backpropagation through layers
|
|
- **Optimization**: Gradient descent and advanced optimizers
|
|
- **Scientific computing**: Sensitivity analysis and inverse problems
|
|
- **Machine learning**: Any gradient-based learning algorithm
|
|
|
|
### Next Steps
|
|
1. **Export your code**: `tito package nbdev --export 07_autograd`
|
|
2. **Test your implementation**: `tito module test 07_autograd`
|
|
3. **Use your autograd**:
|
|
```python
|
|
from tinytorch.core.autograd import Variable
|
|
|
|
x = Variable(2.0, requires_grad=True)
|
|
y = x**2 + 3*x + 1
|
|
y.backward()
|
|
print(x.grad) # Your gradients in action!
|
|
```
|
|
4. **Move to Module 8**: Start building training loops and optimizers!
|
|
|
|
**Ready for the next challenge?** Let's use your autograd system to build complete training pipelines!
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 8: Performance Optimizations and Advanced Features
|
|
|
|
### Memory Management
|
|
- **Gradient Accumulation**: Efficient in-place gradient updates
|
|
- **Computational Graph Cleanup**: Release intermediate values when possible
|
|
- **Lazy Evaluation**: Compute gradients only when needed
|
|
|
|
### Numerical Stability
|
|
- **Gradient Clipping**: Prevent exploding gradients
|
|
- **Numerical Precision**: Handle edge cases gracefully
|
|
- **Overflow Protection**: Clip extreme values
|
|
|
|
### Advanced Features
|
|
- **Higher-Order Gradients**: Gradients of gradients
|
|
- **Gradient Checkpointing**: Memory-efficient backpropagation
|
|
- **Custom Operations**: Framework for user-defined differentiable functions
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "advanced-features", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def power(base: Variable, exponent: Union[float, int]) -> Variable:
|
|
"""
|
|
Power operation with gradient tracking: base^exponent.
|
|
|
|
Args:
|
|
base: Base Variable
|
|
exponent: Exponent (scalar)
|
|
|
|
Returns:
|
|
Variable with power applied and gradient function
|
|
|
|
TODO: Implement power operation with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Compute forward pass: base^exponent
|
|
2. Create gradient function using power rule
|
|
3. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = x^n, then dz/dx = n * x^(n-1)
|
|
|
|
EXAMPLE:
|
|
x = Variable(2.0)
|
|
y = power(x, 3) # y.data = 8.0
|
|
y.backward() # x.grad = 3 * 2^2 = 12.0
|
|
|
|
HINTS:
|
|
- Use np.power() for forward pass
|
|
- Power rule: gradient = exponent * base^(exponent-1)
|
|
- Handle edge cases like exponent=0 or base=0
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Forward pass
|
|
result_data = Tensor(np.power(base.data.data, exponent))
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
if base.requires_grad:
|
|
# Power rule: d(x^n)/dx = n * x^(n-1)
|
|
if exponent == 0:
|
|
# Special case: derivative of constant is 0
|
|
base_grad = Variable(np.zeros_like(base.data.data))
|
|
else:
|
|
base_grad_data = exponent * np.power(base.data.data, exponent - 1)
|
|
base_grad = Variable(grad_output.data.data * base_grad_data)
|
|
base.backward(base_grad)
|
|
|
|
return Variable(result_data, requires_grad=base.requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "exp-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def exp(x: Variable) -> Variable:
|
|
"""
|
|
Exponential operation with gradient tracking: e^x.
|
|
|
|
Args:
|
|
x: Input Variable
|
|
|
|
Returns:
|
|
Variable with exponential applied and gradient function
|
|
|
|
TODO: Implement exponential operation with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Compute forward pass: e^x
|
|
2. Create gradient function using exponential derivative
|
|
3. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = e^x, then dz/dx = e^x
|
|
|
|
EXAMPLE:
|
|
x = Variable(1.0)
|
|
y = exp(x) # y.data = e^1 ≈ 2.718
|
|
y.backward() # x.grad = e^1 ≈ 2.718
|
|
|
|
HINTS:
|
|
- Use np.exp() for forward pass
|
|
- Exponential derivative is itself: d(e^x)/dx = e^x
|
|
- Store result for gradient computation
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Forward pass
|
|
exp_result = np.exp(x.data.data)
|
|
result_data = Tensor(exp_result)
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
if x.requires_grad:
|
|
# Exponential derivative: d(e^x)/dx = e^x
|
|
x_grad = Variable(grad_output.data.data * exp_result)
|
|
x.backward(x_grad)
|
|
|
|
return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "log-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def log(x: Variable) -> Variable:
|
|
"""
|
|
Natural logarithm operation with gradient tracking: ln(x).
|
|
|
|
Args:
|
|
x: Input Variable
|
|
|
|
Returns:
|
|
Variable with logarithm applied and gradient function
|
|
|
|
TODO: Implement logarithm operation with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Compute forward pass: ln(x)
|
|
2. Create gradient function using logarithm derivative
|
|
3. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = ln(x), then dz/dx = 1/x
|
|
|
|
EXAMPLE:
|
|
x = Variable(2.0)
|
|
y = log(x) # y.data = ln(2) ≈ 0.693
|
|
y.backward() # x.grad = 1/2 = 0.5
|
|
|
|
HINTS:
|
|
- Use np.log() for forward pass
|
|
- Logarithm derivative: d(ln(x))/dx = 1/x
|
|
- Handle numerical stability for small x
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Forward pass with numerical stability
|
|
clipped_x = np.clip(x.data.data, 1e-8, np.inf) # Avoid log(0)
|
|
result_data = Tensor(np.log(clipped_x))
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
if x.requires_grad:
|
|
# Logarithm derivative: d(ln(x))/dx = 1/x
|
|
x_grad = Variable(grad_output.data.data / clipped_x)
|
|
x.backward(x_grad)
|
|
|
|
return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "sum-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def sum_all(x: Variable) -> Variable:
|
|
"""
|
|
Sum all elements operation with gradient tracking.
|
|
|
|
Args:
|
|
x: Input Variable
|
|
|
|
Returns:
|
|
Variable with sum and gradient function
|
|
|
|
TODO: Implement sum operation with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Compute forward pass: sum of all elements
|
|
2. Create gradient function that broadcasts gradient back
|
|
3. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = sum(x), then dz/dx_i = 1 for all i
|
|
|
|
EXAMPLE:
|
|
x = Variable([[1, 2], [3, 4]])
|
|
y = sum_all(x) # y.data = 10
|
|
y.backward() # x.grad = [[1, 1], [1, 1]]
|
|
|
|
HINTS:
|
|
- Use np.sum() for forward pass
|
|
- Gradient is ones with same shape as input
|
|
- This is used for loss computation
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Forward pass
|
|
result_data = Tensor(np.sum(x.data.data))
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
if x.requires_grad:
|
|
# Sum gradient: broadcasts to all elements
|
|
x_grad = Variable(grad_output.data.data * np.ones_like(x.data.data))
|
|
x.backward(x_grad)
|
|
|
|
return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "mean-operation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def mean(x: Variable) -> Variable:
|
|
"""
|
|
Mean operation with gradient tracking.
|
|
|
|
Args:
|
|
x: Input Variable
|
|
|
|
Returns:
|
|
Variable with mean and gradient function
|
|
|
|
TODO: Implement mean operation with gradient computation.
|
|
|
|
APPROACH:
|
|
1. Compute forward pass: mean of all elements
|
|
2. Create gradient function that distributes gradient evenly
|
|
3. Return Variable with result and grad_fn
|
|
|
|
MATHEMATICAL RULE:
|
|
If z = mean(x), then dz/dx_i = 1/n for all i (where n is number of elements)
|
|
|
|
EXAMPLE:
|
|
x = Variable([[1, 2], [3, 4]])
|
|
y = mean(x) # y.data = 2.5
|
|
y.backward() # x.grad = [[0.25, 0.25], [0.25, 0.25]]
|
|
|
|
HINTS:
|
|
- Use np.mean() for forward pass
|
|
- Gradient is 1/n for each element
|
|
- This is commonly used for loss computation
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Forward pass
|
|
result_data = Tensor(np.mean(x.data.data))
|
|
|
|
# Create gradient function
|
|
def grad_fn(grad_output):
|
|
if x.requires_grad:
|
|
# Mean gradient: 1/n for each element
|
|
n = x.data.size
|
|
x_grad = Variable(grad_output.data.data * np.ones_like(x.data.data) / n)
|
|
x.backward(x_grad)
|
|
|
|
return Variable(result_data, requires_grad=x.requires_grad, grad_fn=grad_fn)
|
|
### END SOLUTION
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 9: Gradient Utilities and Helper Functions
|
|
|
|
### Gradient Management
|
|
- **Gradient Clipping**: Prevent exploding gradients
|
|
- **Gradient Checking**: Verify gradient correctness
|
|
- **Parameter Collection**: Gather all parameters for optimization
|
|
|
|
### Debugging Tools
|
|
- **Gradient Visualization**: Inspect gradient flow
|
|
- **Computational Graph**: Visualize the computation graph
|
|
- **Gradient Statistics**: Monitor gradient magnitudes
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "gradient-utilities", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def clip_gradients(variables: List[Variable], max_norm: float = 1.0) -> None:
|
|
"""
|
|
Clip gradients to prevent exploding gradients.
|
|
|
|
Args:
|
|
variables: List of Variables to clip gradients for
|
|
max_norm: Maximum gradient norm allowed
|
|
|
|
TODO: Implement gradient clipping.
|
|
|
|
APPROACH:
|
|
1. Compute total gradient norm across all variables
|
|
2. If norm exceeds max_norm, scale all gradients down
|
|
3. Modify gradients in-place
|
|
|
|
MATHEMATICAL RULE:
|
|
If ||g|| > max_norm, then g := g * (max_norm / ||g||)
|
|
|
|
EXAMPLE:
|
|
variables = [w1, w2, b1, b2]
|
|
clip_gradients(variables, max_norm=1.0)
|
|
|
|
HINTS:
|
|
- Compute L2 norm of all gradients combined
|
|
- Scale factor = max_norm / total_norm
|
|
- Only clip if total_norm > max_norm
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Compute total gradient norm
|
|
total_norm = 0.0
|
|
for var in variables:
|
|
if var.grad is not None:
|
|
total_norm += np.sum(var.grad.data.data ** 2)
|
|
total_norm = np.sqrt(total_norm)
|
|
|
|
# Clip if necessary
|
|
if total_norm > max_norm:
|
|
scale_factor = max_norm / total_norm
|
|
for var in variables:
|
|
if var.grad is not None:
|
|
var.grad.data._data *= scale_factor
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "collect-parameters", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def collect_parameters(*modules) -> List[Variable]:
|
|
"""
|
|
Collect all parameters from modules for optimization.
|
|
|
|
Args:
|
|
*modules: Variable number of modules/objects with parameters
|
|
|
|
Returns:
|
|
List of all Variables that require gradients
|
|
|
|
TODO: Implement parameter collection.
|
|
|
|
APPROACH:
|
|
1. Iterate through all provided modules
|
|
2. Find all Variable attributes that require gradients
|
|
3. Return list of all such Variables
|
|
|
|
EXAMPLE:
|
|
layer1 = SomeLayer()
|
|
layer2 = SomeLayer()
|
|
params = collect_parameters(layer1, layer2)
|
|
|
|
HINTS:
|
|
- Use hasattr() and getattr() to find Variable attributes
|
|
- Check if attribute is Variable and requires_grad
|
|
- Handle different module types gracefully
|
|
"""
|
|
### BEGIN SOLUTION
|
|
parameters = []
|
|
for module in modules:
|
|
if hasattr(module, '__dict__'):
|
|
for attr_name, attr_value in module.__dict__.items():
|
|
if isinstance(attr_value, Variable) and attr_value.requires_grad:
|
|
parameters.append(attr_value)
|
|
return parameters
|
|
### END SOLUTION
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "zero-gradients", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
def zero_gradients(variables: List[Variable]) -> None:
|
|
"""
|
|
Zero out gradients for all variables.
|
|
|
|
Args:
|
|
variables: List of Variables to zero gradients for
|
|
|
|
TODO: Implement gradient zeroing.
|
|
|
|
APPROACH:
|
|
1. Iterate through all variables
|
|
2. Call zero_grad() on each variable
|
|
3. Handle None gradients gracefully
|
|
|
|
EXAMPLE:
|
|
parameters = [w1, w2, b1, b2]
|
|
zero_gradients(parameters)
|
|
|
|
HINTS:
|
|
- Use the zero_grad() method on each Variable
|
|
- Check if variable has gradients before zeroing
|
|
- This is typically called before each training step
|
|
"""
|
|
### BEGIN SOLUTION
|
|
for var in variables:
|
|
if var.grad is not None:
|
|
var.zero_grad()
|
|
### END SOLUTION
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 10: Advanced Testing
|
|
|
|
Let's test our advanced features and optimizations.
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-advanced-operations", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
|
|
def test_advanced_operations():
|
|
"""Test advanced mathematical operations."""
|
|
print("🔬 Testing advanced operations...")
|
|
|
|
# Test power operation
|
|
print("📊 Testing power operation...")
|
|
x = Variable(2.0, requires_grad=True)
|
|
y = power(x, 3) # x^3
|
|
|
|
assert abs(y.data.data.item() - 8.0) < 1e-6, f"Power forward failed: expected 8.0, got {y.data.data.item()}"
|
|
|
|
y.backward()
|
|
# Gradient: d(x^3)/dx = 3x^2 = 3 * 4 = 12
|
|
assert abs(x.grad.data.data.item() - 12.0) < 1e-6, f"Power gradient failed: expected 12.0, got {x.grad.data.data.item()}"
|
|
print("✅ Power operation test passed!")
|
|
|
|
# Test exponential operation
|
|
print("📊 Testing exponential operation...")
|
|
x = Variable(1.0, requires_grad=True)
|
|
y = exp(x) # e^x
|
|
|
|
expected_exp = np.exp(1.0)
|
|
assert abs(y.data.data.item() - expected_exp) < 1e-6, f"Exp forward failed: expected {expected_exp}, got {y.data.data.item()}"
|
|
|
|
y.backward()
|
|
# Gradient: d(e^x)/dx = e^x
|
|
assert abs(x.grad.data.data.item() - expected_exp) < 1e-6, f"Exp gradient failed: expected {expected_exp}, got {x.grad.data.data.item()}"
|
|
print("✅ Exponential operation test passed!")
|
|
|
|
# Test logarithm operation
|
|
print("📊 Testing logarithm operation...")
|
|
x = Variable(2.0, requires_grad=True)
|
|
y = log(x) # ln(x)
|
|
|
|
expected_log = np.log(2.0)
|
|
assert abs(y.data.data.item() - expected_log) < 1e-6, f"Log forward failed: expected {expected_log}, got {y.data.data.item()}"
|
|
|
|
y.backward()
|
|
# Gradient: d(ln(x))/dx = 1/x = 1/2 = 0.5
|
|
assert abs(x.grad.data.data.item() - 0.5) < 1e-6, f"Log gradient failed: expected 0.5, got {x.grad.data.data.item()}"
|
|
print("✅ Logarithm operation test passed!")
|
|
|
|
# Test sum operation
|
|
print("📊 Testing sum operation...")
|
|
x = Variable([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
|
|
y = sum_all(x) # sum of all elements
|
|
|
|
assert abs(y.data.data.item() - 10.0) < 1e-6, f"Sum forward failed: expected 10.0, got {y.data.data.item()}"
|
|
|
|
y.backward()
|
|
# Gradient: all elements should be 1
|
|
expected_grad = np.ones((2, 2))
|
|
np.testing.assert_array_almost_equal(x.grad.data.data, expected_grad)
|
|
print("✅ Sum operation test passed!")
|
|
|
|
# Test mean operation
|
|
print("📊 Testing mean operation...")
|
|
x = Variable([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
|
|
y = mean(x) # mean of all elements
|
|
|
|
assert abs(y.data.data.item() - 2.5) < 1e-6, f"Mean forward failed: expected 2.5, got {y.data.data.item()}"
|
|
|
|
y.backward()
|
|
# Gradient: all elements should be 1/4 = 0.25
|
|
expected_grad = np.ones((2, 2)) * 0.25
|
|
np.testing.assert_array_almost_equal(x.grad.data.data, expected_grad)
|
|
print("✅ Mean operation test passed!")
|
|
|
|
print("🎉 All advanced operation tests passed!")
|
|
return True
|
|
|
|
# Run the test
|
|
success = test_advanced_operations()
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-gradient-utilities", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
|
def test_gradient_utilities():
|
|
"""Test gradient utility functions."""
|
|
print("🔬 Testing gradient utilities...")
|
|
|
|
# Test gradient clipping
|
|
print("📊 Testing gradient clipping...")
|
|
x = Variable(1.0, requires_grad=True)
|
|
y = Variable(1.0, requires_grad=True)
|
|
|
|
# Create large gradients
|
|
z = multiply(x, 10.0) # Large gradient for x
|
|
w = multiply(y, 10.0) # Large gradient for y
|
|
loss = add(z, w)
|
|
loss.backward()
|
|
|
|
# Check gradients are large before clipping
|
|
assert abs(x.grad.data.data.item() - 10.0) < 1e-6
|
|
assert abs(y.grad.data.data.item() - 10.0) < 1e-6
|
|
|
|
# Clip gradients
|
|
clip_gradients([x, y], max_norm=1.0)
|
|
|
|
# Check gradients are clipped
|
|
total_norm = np.sqrt(x.grad.data.data.item()**2 + y.grad.data.data.item()**2)
|
|
assert abs(total_norm - 1.0) < 1e-6, f"Gradient clipping failed: total norm {total_norm}, expected 1.0"
|
|
print("✅ Gradient clipping test passed!")
|
|
|
|
# Test zero gradients
|
|
print("📊 Testing zero gradients...")
|
|
# Gradients should be non-zero before zeroing
|
|
assert abs(x.grad.data.data.item()) > 1e-6
|
|
assert abs(y.grad.data.data.item()) > 1e-6
|
|
|
|
# Zero gradients
|
|
zero_gradients([x, y])
|
|
|
|
# Check gradients are zero
|
|
assert abs(x.grad.data.data.item()) < 1e-6
|
|
assert abs(y.grad.data.data.item()) < 1e-6
|
|
print("✅ Zero gradients test passed!")
|
|
|
|
print("🎉 All gradient utility tests passed!")
|
|
return True
|
|
|
|
# Run the test
|
|
success = test_gradient_utilities()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 11: Complete ML Pipeline Example
|
|
|
|
Let's demonstrate a complete machine learning pipeline using our autograd system.
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-complete-pipeline", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
|
|
def test_complete_ml_pipeline():
|
|
"""Test complete ML pipeline with autograd."""
|
|
print("🔬 Testing complete ML pipeline...")
|
|
|
|
# Create a simple regression problem: y = 2x + 1 + noise
|
|
print("📊 Setting up regression problem...")
|
|
|
|
# Training data
|
|
x_data = [1.0, 2.0, 3.0, 4.0, 5.0]
|
|
y_data = [3.1, 4.9, 7.2, 9.1, 10.8] # Approximately 2x + 1 with noise
|
|
|
|
# Model parameters
|
|
w = Variable(0.1, requires_grad=True) # Weight
|
|
b = Variable(0.0, requires_grad=True) # Bias
|
|
|
|
# Training loop
|
|
learning_rate = 0.01
|
|
num_epochs = 100
|
|
|
|
print("📊 Training model...")
|
|
for epoch in range(num_epochs):
|
|
total_loss = Variable(0.0, requires_grad=False)
|
|
|
|
# Forward pass for all data points
|
|
for x_val, y_val in zip(x_data, y_data):
|
|
x = Variable(x_val, requires_grad=False)
|
|
y_target = Variable(y_val, requires_grad=False)
|
|
|
|
# Prediction: y_pred = w * x + b
|
|
y_pred = add(multiply(w, x), b)
|
|
|
|
# Loss: MSE = (y_pred - y_target)^2
|
|
diff = subtract(y_pred, y_target)
|
|
loss = multiply(diff, diff)
|
|
|
|
# Accumulate loss
|
|
total_loss = add(total_loss, loss)
|
|
|
|
# Backward pass
|
|
total_loss.backward()
|
|
|
|
# Update parameters
|
|
w.data._data -= learning_rate * w.grad.data.data
|
|
b.data._data -= learning_rate * b.grad.data.data
|
|
|
|
# Zero gradients for next iteration
|
|
zero_gradients([w, b])
|
|
|
|
# Print progress
|
|
if epoch % 20 == 0:
|
|
print(f" Epoch {epoch}: Loss = {total_loss.data.data.item():.4f}, w = {w.data.data.item():.4f}, b = {b.data.data.item():.4f}")
|
|
|
|
# Check final parameters
|
|
print("📊 Checking final parameters...")
|
|
final_w = w.data.data.item()
|
|
final_b = b.data.data.item()
|
|
|
|
# Should be close to true values: w=2, b=1
|
|
assert abs(final_w - 2.0) < 0.5, f"Weight not learned correctly: expected ~2.0, got {final_w}"
|
|
assert abs(final_b - 1.0) < 0.5, f"Bias not learned correctly: expected ~1.0, got {final_b}"
|
|
|
|
print(f"✅ Model learned: w = {final_w:.3f}, b = {final_b:.3f}")
|
|
print("✅ Complete ML pipeline test passed!")
|
|
|
|
# Test prediction on new data
|
|
print("📊 Testing prediction on new data...")
|
|
x_test = Variable(6.0, requires_grad=False)
|
|
y_pred = add(multiply(w, x_test), b)
|
|
expected_pred = 2.0 * 6.0 + 1.0 # True function value
|
|
|
|
print(f" Prediction for x=6: {y_pred.data.data.item():.3f} (expected ~{expected_pred})")
|
|
assert abs(y_pred.data.data.item() - expected_pred) < 1.0, "Prediction accuracy insufficient"
|
|
|
|
print("🎉 Complete ML pipeline test passed!")
|
|
return True
|
|
|
|
# Run the test
|
|
success = test_complete_ml_pipeline() |