mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-29 23:09:04 -05:00
- Fixed module 03_layers Tensor/Parameter comparison issues - Fixed module 05_autograd psutil dependency (made optional) - Removed duplicate 04_networks module - Created losses.py with MSELoss and CrossEntropyLoss - Created minimal MNIST training examples - All 20 modules now pass individual tests Note: Gradient flow still needs work for full training capability
1190 lines
42 KiB
Python
1190 lines
42 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.17.1
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Autograd - Automatic Differentiation Engine
|
||
|
||
Welcome to Autograd! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through computational graphs.
|
||
|
||
## 🔗 Building on Previous Learning
|
||
**What You Built Before**:
|
||
- Module 02 (Tensor): Data structures that hold neural network parameters
|
||
- Module 04 (Losses): Functions that measure prediction accuracy
|
||
|
||
**What's Working**: You can compute loss values for any prediction!
|
||
|
||
**The Gap**: Loss values tell you HOW WRONG you are, but not HOW TO IMPROVE the parameters.
|
||
|
||
**This Module's Solution**: Implement automatic differentiation to compute gradients automatically.
|
||
|
||
**Connection Map**:
|
||
```
|
||
Tensors → Losses → Autograd → Optimizers
|
||
(data) (error) (∇L/∇θ) (updates)
|
||
```
|
||
|
||
## Learning Objectives
|
||
1. **Core Implementation**: Variable class with gradient tracking
|
||
2. **Mathematical Foundation**: Chain rule application in computational graphs
|
||
3. **Testing Skills**: Gradient computation validation
|
||
4. **Integration Knowledge**: How autograd enables neural network training
|
||
|
||
## Build → Test → Use
|
||
1. **Build**: Variable class with backward propagation
|
||
2. **Test**: Verify gradients are computed correctly
|
||
3. **Use**: Apply to mathematical expressions and see automatic differentiation
|
||
|
||
## 📦 Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in modules/05_autograd/autograd_dev.py
|
||
**Building Side:** Code exports to tinytorch.core.autograd
|
||
|
||
```python
|
||
# Final package structure:
|
||
from tinytorch.core.autograd import Variable # This module
|
||
from tinytorch.core.tensor import Tensor # Foundation (always needed)
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Complete automatic differentiation system for deep understanding
|
||
- **Production:** Proper organization like PyTorch's torch.autograd
|
||
- **Consistency:** All gradient operations in core.autograd
|
||
- **Integration:** Works seamlessly with tensors for complete training systems
|
||
"""
|
||
|
||
# %%
|
||
#| default_exp core.autograd
|
||
|
||
#| export
|
||
import numpy as np
|
||
import sys
|
||
from typing import Union, List, Optional, Callable
|
||
|
||
# Import our existing components
|
||
try:
|
||
from tinytorch.core.tensor import Tensor
|
||
except ImportError:
|
||
# For development, import from local modules
|
||
import os
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
|
||
from tensor_dev import Tensor
|
||
|
||
# %%
|
||
print("🔥 TinyTorch Autograd Module")
|
||
print(f"NumPy version: {np.__version__}")
|
||
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
|
||
print("Ready to build automatic differentiation!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## What is Automatic Differentiation?
|
||
|
||
### The Problem: Computing Gradients at Scale
|
||
|
||
In neural networks, we need to compute gradients of complex functions with millions of parameters:
|
||
|
||
```
|
||
Loss = f(W₁, W₂, ..., Wₙ, data)
|
||
∇Loss = [∂Loss/∂W₁, ∂Loss/∂W₂, ..., ∂Loss/∂Wₙ]
|
||
```
|
||
|
||
Manual differentiation is impossible. Numerical differentiation is too slow.
|
||
|
||
### The Solution: Automatic Differentiation
|
||
|
||
🧠 **Core Concept**: Track operations as we compute forward pass, then apply chain rule backwards
|
||
⚡ **Performance**: Same speed as forward pass, exact gradients (not approximations)
|
||
📦 **Framework Compatibility**: This is how PyTorch and TensorFlow work internally
|
||
|
||
### Visual Representation: Computational Graph
|
||
|
||
```
|
||
Forward Pass:
|
||
x ──┐
|
||
├──[×]──> z = x * y
|
||
y ──┘
|
||
|
||
Backward Pass:
|
||
∂L/∂z ──┬──> ∂L/∂x = ∂L/∂z * y
|
||
│
|
||
└──> ∂L/∂y = ∂L/∂z * x
|
||
```
|
||
|
||
**Key Insight**: Each operation stores how to compute gradients with respect to its inputs.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Implementation: Variable Class - Gradient Tracking
|
||
|
||
🏗️ **Organization**: Variables wrap tensors and track gradients
|
||
🎯 **Clean API**: Seamless integration with existing tensor operations
|
||
📐 **Mathematical Foundation**: Computational graph representation of functions
|
||
|
||
### Design Principles
|
||
|
||
A Variable tracks:
|
||
- **data**: The actual values (using our Tensor)
|
||
- **grad**: Accumulated gradients (starts as None)
|
||
- **grad_fn**: Function to compute gradients during backward pass
|
||
- **requires_grad**: Whether to track gradients for this variable
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "variable-class", "solution": true}
|
||
#| export
|
||
class Variable:
|
||
"""
|
||
Variable with automatic differentiation support.
|
||
|
||
A Variable wraps a Tensor and tracks operations for gradient computation.
|
||
|
||
TODO: Implement Variable class with gradient tracking capabilities
|
||
|
||
APPROACH:
|
||
1. Initialize with data, optional gradient requirement
|
||
2. Store grad_fn for backward pass computation
|
||
3. Implement backward() method to compute gradients
|
||
|
||
EXAMPLE:
|
||
>>> x = Variable([2.0], requires_grad=True)
|
||
>>> y = Variable([3.0], requires_grad=True)
|
||
>>> z = x * y
|
||
>>> z.backward()
|
||
>>> print(x.grad) # Should be [3.0]
|
||
>>> print(y.grad) # Should be [2.0]
|
||
|
||
HINTS:
|
||
- Store data as Tensor for consistency
|
||
- grad starts as None, gets created during backward
|
||
- grad_fn is a callable that propagates gradients
|
||
"""
|
||
### BEGIN SOLUTION
|
||
def __init__(self, data, requires_grad=False, grad_fn=None):
|
||
"""Initialize Variable with data and gradient tracking."""
|
||
# Convert to Tensor if needed
|
||
if isinstance(data, (list, tuple, int, float)):
|
||
self.data = Tensor(data)
|
||
elif isinstance(data, np.ndarray):
|
||
self.data = Tensor(data)
|
||
elif isinstance(data, Tensor):
|
||
self.data = data
|
||
else:
|
||
raise TypeError(f"Unsupported data type: {type(data)}")
|
||
|
||
self.grad = None
|
||
self.requires_grad = requires_grad
|
||
self.grad_fn = grad_fn
|
||
|
||
def __repr__(self):
|
||
"""String representation of Variable."""
|
||
grad_info = f", grad_fn={self.grad_fn.__name__}" if self.grad_fn else ""
|
||
requires_grad_info = f", requires_grad={self.requires_grad}" if self.requires_grad else ""
|
||
return f"Variable({self.data.data}{grad_info}{requires_grad_info})"
|
||
|
||
def backward(self, gradient=None):
|
||
"""
|
||
Compute gradients via backpropagation.
|
||
|
||
Args:
|
||
gradient: Gradient flowing backwards (defaults to ones)
|
||
"""
|
||
# Default gradient for scalar outputs
|
||
if gradient is None:
|
||
if self.data.data.size == 1:
|
||
gradient = np.ones_like(self.data.data)
|
||
else:
|
||
raise RuntimeError("gradient must be specified for non-scalar variables")
|
||
|
||
# Accumulate gradients
|
||
if self.requires_grad:
|
||
if self.grad is None:
|
||
self.grad = gradient
|
||
else:
|
||
self.grad = self.grad + gradient
|
||
|
||
# Propagate gradients backwards through computation graph
|
||
if self.grad_fn is not None:
|
||
self.grad_fn(gradient)
|
||
|
||
# Arithmetic operations with gradient tracking
|
||
def __add__(self, other):
|
||
"""Addition with gradient tracking."""
|
||
return add(self, other)
|
||
|
||
def __radd__(self, other):
|
||
"""Reverse addition."""
|
||
return add(other, self)
|
||
|
||
def __mul__(self, other):
|
||
"""Multiplication with gradient tracking."""
|
||
return multiply(self, other)
|
||
|
||
def __rmul__(self, other):
|
||
"""Reverse multiplication."""
|
||
return multiply(other, self)
|
||
|
||
def __sub__(self, other):
|
||
"""Subtraction with gradient tracking."""
|
||
return subtract(self, other)
|
||
|
||
def __rsub__(self, other):
|
||
"""Reverse subtraction."""
|
||
return subtract(other, self)
|
||
|
||
def __matmul__(self, other):
|
||
"""Matrix multiplication with gradient tracking."""
|
||
return matmul(self, other)
|
||
|
||
@staticmethod
|
||
def sum(variable):
|
||
"""
|
||
Sum all elements of a Variable, maintaining gradient tracking.
|
||
|
||
This is essential for creating scalar losses from multi-element results.
|
||
Unlike extracting scalar values, this preserves the computational graph.
|
||
|
||
Args:
|
||
variable: Variable to sum
|
||
|
||
Returns:
|
||
Variable containing the sum with gradient tracking
|
||
"""
|
||
# Forward pass: compute sum
|
||
sum_data = np.sum(variable.data.data)
|
||
|
||
# Determine if result requires gradients
|
||
requires_grad = variable.requires_grad
|
||
|
||
# Define backward function for gradient propagation
|
||
def grad_fn(gradient):
|
||
"""Propagate gradients back to all elements."""
|
||
if variable.requires_grad:
|
||
# For sum operation, gradient is broadcast to all elements
|
||
# Since d(sum)/d(xi) = 1 for all i
|
||
grad_shape = variable.data.data.shape
|
||
element_grad = np.full(grad_shape, gradient)
|
||
variable.backward(element_grad)
|
||
|
||
return Variable(sum_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Variable Class
|
||
This test validates Variable creation and basic gradient setup
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_variable_class():
|
||
"""Test Variable class implementation with gradient tracking."""
|
||
print("🔬 Unit Test: Variable Class...")
|
||
|
||
# Test basic creation
|
||
x = Variable([2.0, 3.0], requires_grad=True)
|
||
assert isinstance(x.data, Tensor), "Variable should wrap Tensor"
|
||
assert x.requires_grad == True, "Should track gradients when requested"
|
||
assert x.grad is None, "Gradient should start as None"
|
||
|
||
# Test creation without gradients
|
||
y = Variable([1.0, 2.0], requires_grad=False)
|
||
assert y.requires_grad == False, "Should not track gradients when not requested"
|
||
|
||
# Test different data types
|
||
z = Variable(np.array([4.0]), requires_grad=True)
|
||
assert isinstance(z.data, Tensor), "Should convert numpy arrays to Tensors"
|
||
|
||
print("✅ Variable class works correctly!")
|
||
|
||
test_unit_variable_class()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Implementation: Addition Operation with Chain Rule
|
||
|
||
🧠 **Core Concepts**: Addition requires applying chain rule to both operands
|
||
⚡ **Performance**: Gradient computation is O(1) relative to forward pass
|
||
📦 **Framework Compatibility**: Matches PyTorch's autograd behavior
|
||
|
||
### Mathematical Foundation
|
||
|
||
For z = x + y:
|
||
- ∂z/∂x = 1 (derivative of x + y with respect to x)
|
||
- ∂z/∂y = 1 (derivative of x + y with respect to y)
|
||
|
||
Chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × 1 = ∂L/∂z
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "add-operation", "solution": true}
|
||
def _ensure_variable(x):
|
||
"""Convert input to Variable if needed."""
|
||
if isinstance(x, Variable):
|
||
return x
|
||
else:
|
||
return Variable(x, requires_grad=False)
|
||
|
||
#| export
|
||
def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
|
||
"""
|
||
Add two variables with gradient tracking.
|
||
|
||
TODO: Implement addition that properly tracks gradients
|
||
|
||
APPROACH:
|
||
1. Convert inputs to Variables if needed
|
||
2. Compute forward pass (a.data + b.data)
|
||
3. Create grad_fn that propagates gradients to both inputs
|
||
4. Return new Variable with result and grad_fn
|
||
|
||
EXAMPLE:
|
||
>>> x = Variable([2.0], requires_grad=True)
|
||
>>> y = Variable([3.0], requires_grad=True)
|
||
>>> z = add(x, y)
|
||
>>> z.backward()
|
||
>>> print(x.grad) # [1.0] - derivative of z w.r.t x
|
||
>>> print(y.grad) # [1.0] - derivative of z w.r.t y
|
||
|
||
HINTS:
|
||
- Use chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × 1
|
||
- Both operands get same gradient (derivative of sum is 1)
|
||
- Only propagate to variables that require gradients
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Ensure both inputs are Variables
|
||
a = _ensure_variable(a)
|
||
b = _ensure_variable(b)
|
||
|
||
# Forward pass computation
|
||
result_data = Tensor(a.data.data + b.data.data)
|
||
|
||
# Determine if result requires gradients
|
||
requires_grad = a.requires_grad or b.requires_grad
|
||
|
||
# Define backward function for gradient propagation
|
||
def grad_fn(gradient):
|
||
"""Propagate gradients to both operands."""
|
||
# Addition: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
|
||
if a.requires_grad:
|
||
a.backward(gradient)
|
||
if b.requires_grad:
|
||
b.backward(gradient)
|
||
|
||
# Create result variable with gradient function
|
||
result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
|
||
return result
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Addition Operation
|
||
This test validates addition with proper gradient computation
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_add_operation():
|
||
"""Test addition with gradient tracking."""
|
||
print("🔬 Unit Test: Addition Operation...")
|
||
|
||
# Test basic addition
|
||
x = Variable([2.0], requires_grad=True)
|
||
y = Variable([3.0], requires_grad=True)
|
||
z = add(x, y)
|
||
|
||
# Verify forward pass
|
||
assert np.allclose(z.data.data, [5.0]), f"Expected [5.0], got {z.data.data}"
|
||
|
||
# Test backward pass
|
||
z.backward()
|
||
assert np.allclose(x.grad, [1.0]), f"Expected x.grad=[1.0], got {x.grad}"
|
||
assert np.allclose(y.grad, [1.0]), f"Expected y.grad=[1.0], got {y.grad}"
|
||
|
||
# Test with constants
|
||
a = Variable([1.0], requires_grad=True)
|
||
b = add(a, 5.0) # Adding constant
|
||
b.backward()
|
||
assert np.allclose(a.grad, [1.0]), "Gradient should flow through constant addition"
|
||
|
||
print("✅ Addition operation works correctly!")
|
||
|
||
test_unit_add_operation()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Implementation: Multiplication Operation with Product Rule
|
||
|
||
📐 **Mathematical Foundation**: Product rule for derivatives
|
||
🔗 **Connections**: Essential for linear layers, attention mechanisms
|
||
⚡ **Performance**: Efficient gradient computation using cached forward values
|
||
|
||
### The Product Rule
|
||
|
||
For z = x × y:
|
||
- ∂z/∂x = y (derivative with respect to first operand)
|
||
- ∂z/∂y = x (derivative with respect to second operand)
|
||
|
||
Chain rule: ∂L/∂x = ∂L/∂z × ∂z/∂x = ∂L/∂z × y
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "multiply-operation", "solution": true}
|
||
#| export
|
||
def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
|
||
"""
|
||
Multiply two variables with gradient tracking.
|
||
|
||
TODO: Implement multiplication using product rule for gradients
|
||
|
||
APPROACH:
|
||
1. Convert inputs to Variables if needed
|
||
2. Compute forward pass (a.data × b.data)
|
||
3. Create grad_fn using product rule: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a
|
||
4. Return Variable with result and grad_fn
|
||
|
||
EXAMPLE:
|
||
>>> x = Variable([2.0], requires_grad=True)
|
||
>>> y = Variable([3.0], requires_grad=True)
|
||
>>> z = multiply(x, y)
|
||
>>> z.backward()
|
||
>>> print(x.grad) # [3.0] - derivative is y's value
|
||
>>> print(y.grad) # [2.0] - derivative is x's value
|
||
|
||
HINTS:
|
||
- Product rule: d(uv)/dx = u(dv/dx) + v(du/dx)
|
||
- For our case: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a
|
||
- Store original values for use in backward pass
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Ensure both inputs are Variables
|
||
a = _ensure_variable(a)
|
||
b = _ensure_variable(b)
|
||
|
||
# Forward pass computation
|
||
result_data = Tensor(a.data.data * b.data.data)
|
||
|
||
# Determine if result requires gradients
|
||
requires_grad = a.requires_grad or b.requires_grad
|
||
|
||
# Define backward function for gradient propagation
|
||
def grad_fn(gradient):
|
||
"""Propagate gradients using product rule."""
|
||
# Product rule: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
|
||
if a.requires_grad:
|
||
# ∂L/∂a = ∂L/∂z × ∂z/∂a = gradient × b
|
||
a_grad = gradient * b.data.data
|
||
a.backward(a_grad)
|
||
if b.requires_grad:
|
||
# ∂L/∂b = ∂L/∂z × ∂z/∂b = gradient × a
|
||
b_grad = gradient * a.data.data
|
||
b.backward(b_grad)
|
||
|
||
# Create result variable with gradient function
|
||
result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
|
||
return result
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Multiplication Operation
|
||
This test validates multiplication with product rule gradients
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_multiply_operation():
|
||
"""Test multiplication with gradient tracking."""
|
||
print("🔬 Unit Test: Multiplication Operation...")
|
||
|
||
# Test basic multiplication
|
||
x = Variable([2.0], requires_grad=True)
|
||
y = Variable([3.0], requires_grad=True)
|
||
z = multiply(x, y)
|
||
|
||
# Verify forward pass
|
||
assert np.allclose(z.data.data, [6.0]), f"Expected [6.0], got {z.data.data}"
|
||
|
||
# Test backward pass
|
||
z.backward()
|
||
assert np.allclose(x.grad, [3.0]), f"Expected x.grad=[3.0], got {x.grad}"
|
||
assert np.allclose(y.grad, [2.0]), f"Expected y.grad=[2.0], got {y.grad}"
|
||
|
||
# Test with constants
|
||
a = Variable([4.0], requires_grad=True)
|
||
b = multiply(a, 2.0) # Multiplying by constant
|
||
b.backward()
|
||
assert np.allclose(a.grad, [2.0]), "Gradient should be the constant value"
|
||
|
||
print("✅ Multiplication operation works correctly!")
|
||
|
||
test_unit_multiply_operation()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Implementation: Additional Operations
|
||
|
||
🔗 **Connections**: Complete the basic arithmetic operations needed for neural networks
|
||
⚡ **Performance**: Each operation implements efficient gradient computation
|
||
📦 **Framework Compatibility**: Matches behavior of production autograd systems
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "additional-operations", "solution": true}
|
||
#| export
|
||
def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
|
||
"""
|
||
Subtract two variables with gradient tracking.
|
||
|
||
TODO: Implement subtraction with proper gradient flow
|
||
|
||
HINTS:
|
||
- For z = a - b: ∂z/∂a = 1, ∂z/∂b = -1
|
||
- Similar to addition but second operand gets negative gradient
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Ensure both inputs are Variables
|
||
a = _ensure_variable(a)
|
||
b = _ensure_variable(b)
|
||
|
||
# Forward pass computation
|
||
result_data = Tensor(a.data.data - b.data.data)
|
||
|
||
# Determine if result requires gradients
|
||
requires_grad = a.requires_grad or b.requires_grad
|
||
|
||
# Define backward function for gradient propagation
|
||
def grad_fn(gradient):
|
||
"""Propagate gradients for subtraction."""
|
||
# Subtraction: ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
|
||
if a.requires_grad:
|
||
a.backward(gradient)
|
||
if b.requires_grad:
|
||
b.backward(-gradient) # Negative for subtraction
|
||
|
||
# Create result variable with gradient function
|
||
result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
|
||
return result
|
||
### END SOLUTION
|
||
|
||
#| export
|
||
def matmul(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Variable:
|
||
"""
|
||
Matrix multiplication with gradient tracking.
|
||
|
||
TODO: Implement matrix multiplication with proper gradients
|
||
|
||
HINTS:
|
||
- For z = a @ b: ∂z/∂a = gradient @ b.T, ∂z/∂b = a.T @ gradient
|
||
- This is fundamental for neural network linear layers
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Ensure both inputs are Variables
|
||
a = _ensure_variable(a)
|
||
b = _ensure_variable(b)
|
||
|
||
# Forward pass computation
|
||
result_data = Tensor(a.data.data @ b.data.data)
|
||
|
||
# Determine if result requires gradients
|
||
requires_grad = a.requires_grad or b.requires_grad
|
||
|
||
# Define backward function for gradient propagation
|
||
def grad_fn(gradient):
|
||
"""Propagate gradients for matrix multiplication."""
|
||
# Matrix multiplication gradients:
|
||
# ∂(a@b)/∂a = gradient @ b.T
|
||
# ∂(a@b)/∂b = a.T @ gradient
|
||
if a.requires_grad:
|
||
a_grad = gradient @ b.data.data.T
|
||
a.backward(a_grad)
|
||
if b.requires_grad:
|
||
b_grad = a.data.data.T @ gradient
|
||
b.backward(b_grad)
|
||
|
||
# Create result variable with gradient function
|
||
result = Variable(result_data, requires_grad=requires_grad, grad_fn=grad_fn if requires_grad else None)
|
||
return result
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Additional Operations
|
||
This test validates subtraction and matrix multiplication
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_additional_operations():
|
||
"""Test subtraction and matrix multiplication."""
|
||
print("🔬 Unit Test: Additional Operations...")
|
||
|
||
# Test subtraction
|
||
x = Variable([5.0], requires_grad=True)
|
||
y = Variable([2.0], requires_grad=True)
|
||
z = subtract(x, y)
|
||
|
||
assert np.allclose(z.data.data, [3.0]), f"Subtraction failed: expected [3.0], got {z.data.data}"
|
||
|
||
z.backward()
|
||
assert np.allclose(x.grad, [1.0]), f"Subtraction gradient failed: expected x.grad=[1.0], got {x.grad}"
|
||
assert np.allclose(y.grad, [-1.0]), f"Subtraction gradient failed: expected y.grad=[-1.0], got {y.grad}"
|
||
|
||
# Test matrix multiplication
|
||
a = Variable([[1.0, 2.0]], requires_grad=True)
|
||
b = Variable([[3.0], [4.0]], requires_grad=True)
|
||
c = matmul(a, b)
|
||
|
||
assert np.allclose(c.data.data, [[11.0]]), f"Matrix multiplication failed: expected [[11.0]], got {c.data.data}"
|
||
|
||
c.backward()
|
||
assert np.allclose(a.grad, [[3.0, 4.0]]), f"Matmul gradient failed for a: expected [[3.0, 4.0]], got {a.grad}"
|
||
assert np.allclose(b.grad, [[1.0], [2.0]]), f"Matmul gradient failed for b: expected [[1.0], [2.0]], got {b.grad}"
|
||
|
||
print("✅ Additional operations work correctly!")
|
||
|
||
test_unit_additional_operations()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Implementation: Chain Rule Through Complex Expressions
|
||
|
||
🧠 **Core Concept**: Multiple operations automatically chain gradients together
|
||
⚡ **Performance**: Each operation contributes O(1) overhead for gradient computation
|
||
🔗 **Connections**: This enables training deep neural networks with many layers
|
||
|
||
### Example: Complex Expression
|
||
|
||
Consider: f(x, y) = (x + y) × (x - y) = x² - y²
|
||
|
||
The autograd system automatically:
|
||
1. Tracks each intermediate operation
|
||
2. Applies chain rule backwards through the computation graph
|
||
3. Accumulates gradients at each variable
|
||
|
||
Expected gradients:
|
||
- ∂f/∂x = 2x (derivative of x² - y²)
|
||
- ∂f/∂y = -2y (derivative of x² - y²)
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Chain Rule Application
|
||
This test validates complex expressions with multiple operations
|
||
"""
|
||
|
||
# %%
|
||
def test_unit_chain_rule():
|
||
"""Test chain rule through complex expressions."""
|
||
print("🔬 Unit Test: Chain Rule Application...")
|
||
|
||
# Test complex expression: (x + y) * (x - y) = x² - y²
|
||
x = Variable([3.0], requires_grad=True)
|
||
y = Variable([2.0], requires_grad=True)
|
||
|
||
# Build computation graph
|
||
sum_term = add(x, y) # x + y = 5
|
||
diff_term = subtract(x, y) # x - y = 1
|
||
result = multiply(sum_term, diff_term) # (x+y)*(x-y) = 5*1 = 5
|
||
|
||
# Verify forward pass
|
||
expected_result = 3.0**2 - 2.0**2 # x² - y² = 9 - 4 = 5
|
||
assert np.allclose(result.data.data, [expected_result]), f"Expected [{expected_result}], got {result.data.data}"
|
||
|
||
# Test backward pass
|
||
result.backward()
|
||
|
||
# Expected gradients: ∂(x²-y²)/∂x = 2x = 6, ∂(x²-y²)/∂y = -2y = -4
|
||
expected_x_grad = 2 * 3.0 # 6.0
|
||
expected_y_grad = -2 * 2.0 # -4.0
|
||
|
||
assert np.allclose(x.grad, [expected_x_grad]), f"Expected x.grad=[{expected_x_grad}], got {x.grad}"
|
||
assert np.allclose(y.grad, [expected_y_grad]), f"Expected y.grad=[{expected_y_grad}], got {y.grad}"
|
||
|
||
# Test another complex expression: x * y + x * y (should equal 2*x*y)
|
||
a = Variable([2.0], requires_grad=True)
|
||
b = Variable([3.0], requires_grad=True)
|
||
|
||
term1 = multiply(a, b)
|
||
term2 = multiply(a, b)
|
||
sum_result = add(term1, term2)
|
||
|
||
sum_result.backward()
|
||
|
||
# Expected: ∂(2xy)/∂x = 2y = 6, ∂(2xy)/∂y = 2x = 4
|
||
assert np.allclose(a.grad, [6.0]), f"Expected a.grad=[6.0], got {a.grad}"
|
||
assert np.allclose(b.grad, [4.0]), f"Expected b.grad=[4.0], got {b.grad}"
|
||
|
||
print("✅ Chain rule works correctly through complex expressions!")
|
||
|
||
test_unit_chain_rule()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🔍 Systems Analysis: Gradient Computation Behavior
|
||
|
||
Now that your autograd implementation is complete and tested, let's analyze its behavior:
|
||
|
||
**Analysis Focus**: Understand memory usage and computational patterns in automatic differentiation
|
||
"""
|
||
|
||
# %%
|
||
def analyze_gradient_computation():
|
||
"""
|
||
📊 SYSTEMS MEASUREMENT: Gradient Computation Analysis
|
||
|
||
Measure how autograd scales with expression complexity and input size.
|
||
"""
|
||
print("📊 AUTOGRAD SYSTEMS ANALYSIS")
|
||
print("Testing gradient computation patterns...")
|
||
|
||
import time
|
||
|
||
# Test 1: Expression complexity scaling
|
||
print("\n🔍 Expression Complexity Analysis:")
|
||
x = Variable([2.0], requires_grad=True)
|
||
y = Variable([3.0], requires_grad=True)
|
||
|
||
expressions = [
|
||
("Simple: x + y", lambda: add(x, y)),
|
||
("Medium: x * y + x", lambda: add(multiply(x, y), x)),
|
||
("Complex: (x + y) * (x - y)", lambda: multiply(add(x, y), subtract(x, y)))
|
||
]
|
||
|
||
for name, expr_fn in expressions:
|
||
# Reset gradients
|
||
x.grad = None
|
||
y.grad = None
|
||
|
||
# Time forward + backward pass
|
||
start = time.perf_counter()
|
||
result = expr_fn()
|
||
result.backward()
|
||
elapsed = time.perf_counter() - start
|
||
|
||
print(f" {name}: {elapsed*1000:.3f}ms")
|
||
|
||
# Test 2: Memory usage pattern
|
||
print("\n💾 Memory Usage Analysis:")
|
||
try:
|
||
import psutil
|
||
import os
|
||
|
||
def get_memory_mb():
|
||
process = psutil.Process(os.getpid())
|
||
return process.memory_info().rss / 1024 / 1024
|
||
|
||
baseline = get_memory_mb()
|
||
psutil_available = True
|
||
except ImportError:
|
||
print(" Note: psutil not installed, skipping detailed memory analysis")
|
||
psutil_available = False
|
||
baseline = 0
|
||
|
||
# Create computation graph with many variables
|
||
variables = []
|
||
for i in range(100):
|
||
var = Variable([float(i)], requires_grad=True)
|
||
variables.append(var)
|
||
|
||
# Chain operations
|
||
result = variables[0]
|
||
for var in variables[1:]:
|
||
result = add(result, var)
|
||
|
||
if psutil_available:
|
||
memory_after_forward = get_memory_mb()
|
||
|
||
# Backward pass
|
||
result.backward()
|
||
|
||
if psutil_available:
|
||
memory_after_backward = get_memory_mb()
|
||
print(f" Baseline memory: {baseline:.1f}MB")
|
||
print(f" After forward pass: {memory_after_forward:.1f}MB (+{memory_after_forward-baseline:.1f}MB)")
|
||
print(f" After backward pass: {memory_after_backward:.1f}MB (+{memory_after_backward-baseline:.1f}MB)")
|
||
else:
|
||
print(" Memory tracking skipped (psutil not available)")
|
||
|
||
# Test 3: Gradient accumulation
|
||
print("\n🔄 Gradient Accumulation Test:")
|
||
z = Variable([1.0], requires_grad=True)
|
||
|
||
# Multiple backward passes should accumulate gradients
|
||
loss1 = multiply(z, 2.0)
|
||
loss1.backward()
|
||
first_grad = z.grad.copy()
|
||
|
||
loss2 = multiply(z, 3.0)
|
||
loss2.backward() # Should accumulate with previous gradient
|
||
|
||
print(f" First backward: grad = {first_grad}")
|
||
print(f" After second backward: grad = {z.grad}")
|
||
print(f" Expected accumulation: {first_grad + 3.0}")
|
||
|
||
print("\n💡 AUTOGRAD INSIGHTS:")
|
||
print(" • Forward pass builds computation graph in memory")
|
||
print(" • Backward pass traverses graph and accumulates gradients")
|
||
print(" • Memory scales with graph depth, not just data size")
|
||
print(" • This is why PyTorch uses gradient checkpointing for deep networks!")
|
||
|
||
analyze_gradient_computation()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Integration: Complete Module Testing
|
||
|
||
🧪 **Testing Strategy**: Comprehensive validation of all autograd functionality
|
||
✅ **Quality Assurance**: Ensure all components work together correctly
|
||
🚀 **Ready for Training**: Verify autograd enables neural network optimization
|
||
"""
|
||
|
||
# %%
|
||
def test_module():
|
||
"""Comprehensive test of autograd module functionality."""
|
||
print("🧪 COMPREHENSIVE MODULE TEST")
|
||
print("Running complete autograd validation...")
|
||
|
||
# Test 1: Variable creation and basic properties
|
||
print("\n1️⃣ Testing Variable creation...")
|
||
x = Variable([1.0, 2.0], requires_grad=True)
|
||
assert isinstance(x.data, Tensor)
|
||
assert x.requires_grad == True
|
||
assert x.grad is None
|
||
print(" ✅ Variable creation works")
|
||
|
||
# Test 2: All arithmetic operations
|
||
print("\n2️⃣ Testing arithmetic operations...")
|
||
a = Variable([2.0], requires_grad=True)
|
||
b = Variable([3.0], requires_grad=True)
|
||
|
||
# Test each operation
|
||
add_result = add(a, b)
|
||
assert np.allclose(add_result.data.data, [5.0])
|
||
|
||
mul_result = multiply(a, b)
|
||
assert np.allclose(mul_result.data.data, [6.0])
|
||
|
||
sub_result = subtract(a, b)
|
||
assert np.allclose(sub_result.data.data, [-1.0])
|
||
print(" ✅ All arithmetic operations work")
|
||
|
||
# Test 3: Gradient computation
|
||
print("\n3️⃣ Testing gradient computation...")
|
||
x = Variable([3.0], requires_grad=True)
|
||
y = Variable([4.0], requires_grad=True)
|
||
z = multiply(x, y) # z = 12
|
||
z.backward()
|
||
|
||
assert np.allclose(x.grad, [4.0]), f"Expected x.grad=[4.0], got {x.grad}"
|
||
assert np.allclose(y.grad, [3.0]), f"Expected y.grad=[3.0], got {y.grad}"
|
||
print(" ✅ Gradient computation works")
|
||
|
||
# Test 4: Complex expressions
|
||
print("\n4️⃣ Testing complex expressions...")
|
||
p = Variable([2.0], requires_grad=True)
|
||
q = Variable([3.0], requires_grad=True)
|
||
|
||
# (p + q) * (p - q) = p² - q²
|
||
expr = multiply(add(p, q), subtract(p, q))
|
||
expr.backward()
|
||
|
||
# Expected: ∂(p²-q²)/∂p = 2p = 4, ∂(p²-q²)/∂q = -2q = -6
|
||
assert np.allclose(p.grad, [4.0]), f"Expected p.grad=[4.0], got {p.grad}"
|
||
assert np.allclose(q.grad, [-6.0]), f"Expected q.grad=[-6.0], got {q.grad}"
|
||
print(" ✅ Complex expressions work")
|
||
|
||
# Test 5: Matrix operations
|
||
print("\n5️⃣ Testing matrix operations...")
|
||
A = Variable([[1.0, 2.0]], requires_grad=True)
|
||
B = Variable([[3.0], [4.0]], requires_grad=True)
|
||
C = matmul(A, B)
|
||
|
||
assert np.allclose(C.data.data, [[11.0]])
|
||
C.backward()
|
||
assert np.allclose(A.grad, [[3.0, 4.0]])
|
||
assert np.allclose(B.grad, [[1.0], [2.0]])
|
||
print(" ✅ Matrix operations work")
|
||
|
||
# Test 6: Mixed operations
|
||
print("\n6️⃣ Testing mixed operations...")
|
||
u = Variable([1.0], requires_grad=True)
|
||
v = Variable([2.0], requires_grad=True)
|
||
|
||
# Neural network-like computation: u * v + u
|
||
hidden = multiply(u, v) # u * v
|
||
output = add(hidden, u) # + u
|
||
output.backward()
|
||
|
||
# Expected: ∂(u*v + u)/∂u = v + 1 = 3, ∂(u*v + u)/∂v = u = 1
|
||
assert np.allclose(u.grad, [3.0]), f"Expected u.grad=[3.0], got {u.grad}"
|
||
assert np.allclose(v.grad, [1.0]), f"Expected v.grad=[1.0], got {v.grad}"
|
||
print(" ✅ Mixed operations work")
|
||
|
||
print("\n🎉 ALL TESTS PASSED!")
|
||
print("🚀 Autograd module is ready for neural network training!")
|
||
print("🔗 Next: Use these gradients in optimizers to update parameters")
|
||
|
||
# %%
|
||
if __name__ == "__main__":
|
||
test_module()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🤔 ML Systems Thinking: Interactive Questions
|
||
|
||
### Question 1: Memory Management in Computational Graphs
|
||
|
||
Consider the expression `z = (x + y) * (x - y)` where x and y have `requires_grad=True`.
|
||
|
||
**Analysis Task**: Your autograd implementation stores intermediate results during forward pass and uses them during backward pass. In a deep neural network with 100 layers, each layer creating intermediate variables, what memory challenges would emerge?
|
||
|
||
**Specific Questions**:
|
||
- How does memory usage scale with network depth in your current implementation?
|
||
- What strategies could reduce memory usage during gradient computation?
|
||
- Why do production frameworks like PyTorch implement "gradient checkpointing"?
|
||
|
||
**Implementation Connection**: Examine how your `grad_fn` closures capture references to input variables and consider the memory implications.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "memory-analysis", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
TODO: Analyze memory usage patterns in your autograd implementation.
|
||
|
||
Consider how your Variable class stores references to other variables through grad_fn,
|
||
and how this affects memory usage in deep networks.
|
||
|
||
Discuss specific memory optimization strategies you could implement.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Memory analysis for autograd implementation:
|
||
|
||
# 1. Memory scaling with network depth:
|
||
# - Each Variable stores references to inputs through grad_fn closure
|
||
# - In deep networks: O(depth) memory growth for intermediate activations
|
||
# - Gradient computation requires keeping forward activations in memory
|
||
# - 100-layer network = 100x intermediate variables + their grad_fn closures
|
||
|
||
# 2. Memory optimization strategies:
|
||
# - Gradient checkpointing: Only store subset of activations, recompute others
|
||
# - In-place operations where mathematically valid
|
||
# - Clear computation graph after backward pass
|
||
# - Use smaller data types (float16 vs float32) where precision allows
|
||
|
||
# 3. Production framework solutions:
|
||
# - PyTorch's gradient checkpointing trades compute for memory
|
||
# - Automatic memory management with garbage collection
|
||
# - Graph optimization to reduce intermediate storage
|
||
# - Dynamic graph construction vs static graph optimization
|
||
|
||
# Current implementation improvement:
|
||
# Add method to clear computation graph: variable.detach() or graph.clear()
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 2: Gradient Accumulation and Training Efficiency
|
||
|
||
In your autograd implementation, gradients accumulate when `backward()` is called multiple times without zeroing gradients.
|
||
|
||
**Analysis Task**: Design a training loop that uses gradient accumulation to simulate larger batch sizes with limited memory.
|
||
|
||
**Specific Questions**:
|
||
- How would you modify the Variable class to support gradient zeroing?
|
||
- What are the trade-offs between large batches vs. gradient accumulation?
|
||
- How does gradient accumulation affect convergence in neural network training?
|
||
|
||
**Implementation Connection**: Consider how your `backward()` method accumulates gradients and design a complete training interface.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "gradient-accumulation", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
TODO: Design gradient accumulation strategy for your autograd system.
|
||
|
||
Extend your Variable class with gradient management methods and analyze
|
||
the trade-offs between memory efficiency and training convergence.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Gradient accumulation design for training efficiency:
|
||
|
||
# 1. Variable class extensions needed:
|
||
def zero_grad(self):
|
||
"""Clear accumulated gradients."""
|
||
self.grad = None
|
||
|
||
def add_zero_grad_to_variable():
|
||
"""Would add this method to Variable class"""
|
||
# Implementation would set self.grad = None
|
||
pass
|
||
|
||
# 2. Training loop with gradient accumulation:
|
||
def training_step_with_accumulation(model, data_loader, accumulation_steps=4):
|
||
"""
|
||
Simulate larger batches through gradient accumulation
|
||
"""
|
||
for param in model.parameters():
|
||
param.zero_grad()
|
||
|
||
total_loss = 0
|
||
for i, batch in enumerate(data_loader):
|
||
loss = compute_loss(model(batch.x), batch.y)
|
||
loss.backward() # Accumulate gradients
|
||
total_loss += loss.data
|
||
|
||
if (i + 1) % accumulation_steps == 0:
|
||
# Update parameters with accumulated gradients
|
||
optimizer.step()
|
||
# Clear gradients for next accumulation cycle
|
||
for param in model.parameters():
|
||
param.zero_grad()
|
||
|
||
return total_loss / len(data_loader)
|
||
|
||
# 3. Trade-offs analysis:
|
||
# Memory: Gradient accumulation uses constant memory vs. large batch linear growth
|
||
# Convergence: Accumulated gradients approximate large batch behavior
|
||
# Computation: Extra backward passes vs. single large batch forward/backward
|
||
# Synchronization: In distributed training, less frequent communication
|
||
|
||
# 4. Production considerations:
|
||
# - Gradient scaling to prevent underflow with accumulated small gradients
|
||
# - Learning rate adjustment for effective batch size
|
||
# - Batch normalization statistics affected by actual vs effective batch size
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 3: Computational Graph Optimization
|
||
|
||
Your autograd implementation creates a new Variable for each operation, building a computation graph dynamically.
|
||
|
||
**Analysis Task**: Analyze opportunities for optimizing the computational graph to reduce memory usage and improve performance.
|
||
|
||
**Specific Questions**:
|
||
- Which operations could be fused together to reduce intermediate Variable storage?
|
||
- How would in-place operations affect gradient computation safety?
|
||
- What graph optimization passes could be implemented before backward propagation?
|
||
|
||
**Implementation Connection**: Examine your operation functions and identify where intermediate results could be eliminated or reused.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "graph-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
|
||
"""
|
||
TODO: Design graph optimization strategies for your autograd implementation.
|
||
|
||
Identify specific optimizations that could reduce memory usage and improve
|
||
performance while maintaining gradient correctness.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Computational graph optimization strategies:
|
||
|
||
# 1. Operation fusion opportunities:
|
||
# - Fuse: add + multiply → fused_add_mul (one intermediate variable)
|
||
# - Fuse: activation + linear → fused_linear_activation
|
||
# - Elementwise operations: add + relu + multiply can be single kernel
|
||
# Current: 3 Variables → Optimized: 1 Variable
|
||
|
||
def fused_add_multiply(a, b, c):
|
||
"""Fused operation: (a + b) * c - saves one intermediate Variable"""
|
||
# Direct computation without intermediate Variable
|
||
result_data = (a.data.data + b.data.data) * c.data.data
|
||
|
||
def grad_fn(gradient):
|
||
if a.requires_grad:
|
||
a.backward(gradient * c.data.data)
|
||
if b.requires_grad:
|
||
b.backward(gradient * c.data.data)
|
||
if c.requires_grad:
|
||
c.backward(gradient * (a.data.data + b.data.data))
|
||
|
||
return Variable(result_data, requires_grad=any([a.requires_grad, b.requires_grad, c.requires_grad]), grad_fn=grad_fn)
|
||
|
||
# 2. In-place operation safety:
|
||
# Safe: element-wise operations on leaf variables not used elsewhere
|
||
# Unsafe: in-place on intermediate variables used in multiple paths
|
||
# Solution: Track variable usage count before allowing in-place
|
||
|
||
def safe_inplace_add(var, other):
|
||
"""In-place addition if safe for gradient computation"""
|
||
if var.grad_fn is not None:
|
||
raise RuntimeError("Cannot do in-place operation on variable with grad_fn")
|
||
var.data.data += other.data.data
|
||
return var
|
||
|
||
# 3. Graph optimization passes:
|
||
# - Dead code elimination: Remove unused intermediate variables
|
||
# - Common subexpression elimination: Reuse x*y if computed multiple times
|
||
# - Memory layout optimization: Arrange for cache-friendly access patterns
|
||
|
||
class GraphOptimizer:
|
||
def optimize_memory_layout(self, variables):
|
||
"""Optimize variable storage for cache efficiency"""
|
||
# Group related variables in contiguous memory
|
||
pass
|
||
|
||
def eliminate_dead_variables(self, root_variable):
|
||
"""Remove variables not needed for gradient computation"""
|
||
# Traverse backward from root, mark reachable variables
|
||
pass
|
||
|
||
def fuse_operations(self, computation_sequence):
|
||
"""Identify fusible operation sequences"""
|
||
# Pattern matching for common operation combinations
|
||
pass
|
||
|
||
# 4. Production framework techniques:
|
||
# - TensorFlow's XLA: Ahead-of-time compilation with graph optimization
|
||
# - PyTorch's TorchScript: Graph optimization for inference
|
||
# - ONNX graph optimization passes: Constant folding, operator fusion
|
||
# - Memory planning: Pre-allocate memory for entire computation graph
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Autograd - Automatic Differentiation Engine
|
||
|
||
Congratulations! You've successfully implemented the automatic differentiation engine:
|
||
|
||
### What You've Accomplished
|
||
✅ **Variable Class Implementation**: Complete gradient tracking system with 200+ lines of core functionality
|
||
✅ **Arithmetic Operations**: Addition, multiplication, subtraction, and matrix operations with proper gradient flow
|
||
✅ **Chain Rule Application**: Automatic gradient computation through complex mathematical expressions
|
||
✅ **Memory Management**: Efficient gradient accumulation and computational graph construction
|
||
✅ **Systems Analysis**: Understanding of memory scaling and performance characteristics in gradient computation
|
||
|
||
### Key Learning Outcomes
|
||
- **Automatic Differentiation**: How computational graphs enable efficient gradient computation
|
||
- **Chain Rule Implementation**: Mathematical foundation for backpropagation in neural networks
|
||
- **Memory Patterns**: How gradient computation affects memory usage in deep learning systems
|
||
- **Production Understanding**: Connection to PyTorch/TensorFlow autograd implementations
|
||
|
||
### Mathematical Foundations Mastered
|
||
- **Chain Rule**: Systematic application through computational graphs
|
||
- **Product Rule**: Gradient computation for multiplication operations
|
||
- **Computational Complexity**: O(1) gradient overhead per operation in forward pass
|
||
- **Memory Complexity**: O(graph_depth) storage requirements for intermediate activations
|
||
|
||
### Professional Skills Developed
|
||
- **Gradient System Design**: Building automatic differentiation from scratch
|
||
- **Performance Analysis**: Understanding memory and computational trade-offs
|
||
- **Testing Methodology**: Comprehensive validation of gradient correctness
|
||
|
||
### Ready for Advanced Applications
|
||
Your autograd implementation now enables:
|
||
- **Neural Network Training**: Automatic gradient computation for parameter updates
|
||
- **Optimization Algorithms**: Foundation for SGD, Adam, and other optimizers
|
||
- **Deep Learning Research**: Understanding of how modern frameworks work internally
|
||
|
||
### Connection to Real ML Systems
|
||
Your implementation mirrors production systems:
|
||
- **PyTorch**: `torch.autograd.Variable` and automatic gradient computation
|
||
- **TensorFlow**: `tf.GradientTape` for automatic differentiation
|
||
- **Industry Standard**: Dynamic computational graphs used in most modern frameworks
|
||
|
||
### Next Steps
|
||
1. **Export your module**: `tito module complete 05_autograd`
|
||
2. **Validate integration**: `tito test --module autograd`
|
||
3. **Ready for Module 06**: Optimizers will use your gradients to update neural network parameters!
|
||
|
||
**🚀 Achievement Unlocked**: Your automatic differentiation engine is the foundation that makes modern neural network training possible!
|
||
""" |