mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-09 15:42:00 -05:00
The itemize environment parameters [leftmargin=*, itemsep=1pt, parsep=0pt]
were appearing as visible text in the PDF because the enumitem package
wasn't loaded. This fix adds \usepackage{enumitem} to the preamble.
All itemized lists now format correctly with proper spacing and margins.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
1395 lines
45 KiB
Python
1395 lines
45 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.18.1
|
||
# kernelspec:
|
||
# display_name: Python 3 (ipykernel)
|
||
# language: python
|
||
# name: python3
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Module 06: Optimizers - Sophisticated Learning Algorithms
|
||
|
||
Welcome to Module 06! You'll build optimizers that enable neural networks to learn from gradients using sophisticated algorithms.
|
||
|
||
## 🔗 Prerequisites & Progress
|
||
**You've Built**: Tensor with gradients (Modules 01-05)
|
||
**You'll Build**: SGD, Adam, and AdamW optimizers with sophisticated momentum and adaptive learning
|
||
**You'll Enable**: Modern optimization algorithms that power state-of-the-art neural networks
|
||
|
||
**Connection Map**:
|
||
```
|
||
Gradients → Optimizers → Training
|
||
(Module 05) (Module 06) (Module 07)
|
||
```
|
||
|
||
## Learning Objectives
|
||
By the end of this module, you will:
|
||
1. Implement SGD with momentum for stable gradient descent
|
||
2. Build Adam optimizer with adaptive learning rates
|
||
3. Create AdamW optimizer with decoupled weight decay
|
||
4. Understand memory and computational trade-offs in optimization algorithms
|
||
|
||
Let's get started!
|
||
|
||
## 📦 Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in `modules/06_optimizers/optimizers_dev.py`
|
||
**Building Side:** Code exports to `tinytorch.core.optimizers`
|
||
|
||
```python
|
||
# How to use this module:
|
||
from tinytorch.core.optimizers import SGD, Adam, AdamW
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Complete optimization system for modern neural network training
|
||
- **Production:** Proper organization like PyTorch's torch.optim with all optimization algorithms together
|
||
- **Consistency:** All optimization logic and parameter updating in core.optimizers
|
||
- **Integration:** Works seamlessly with gradients from Module 05 for complete training capability
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
|
||
#| default_exp core.optimizers
|
||
#| export
|
||
|
||
import numpy as np
|
||
from typing import List, Union, Optional, Dict, Any
|
||
|
||
# Import Tensor from Module 01 (now with gradient support from Module 05)
|
||
from tinytorch.core.tensor import Tensor
|
||
|
||
# %% [markdown]
|
||
r"""
|
||
## 1. Introduction: What are Optimizers?
|
||
|
||
Optimizers are the engines that drive neural network learning. They take gradients computed from your loss function and use them to update model parameters toward better solutions. Think of optimization as navigating a complex landscape where you're trying to find the lowest valley (minimum loss).
|
||
|
||
### The Optimization Challenge
|
||
|
||
Imagine you're hiking in dense fog, trying to reach the bottom of a valley. You can only feel the slope under your feet (the gradient), but you can't see where you're going. Different optimization strategies are like different hiking approaches:
|
||
|
||
```
|
||
Loss Landscape (2D visualization):
|
||
🏔️
|
||
/ \\
|
||
🚶 / \\
|
||
/ \\
|
||
/ 🎯 \\ ← Global minimum (goal)
|
||
/ \\
|
||
🏔️ 🏔️
|
||
|
||
Challenge: Navigate to 🎯 using only local slope information!
|
||
```
|
||
|
||
### Our Optimizer Toolkit
|
||
|
||
**SGD (Stochastic Gradient Descent)**
|
||
- Strategy: Always step downhill
|
||
- Problem: Can get stuck oscillating in narrow valleys
|
||
- Solution: Add momentum to "coast" through oscillations
|
||
|
||
**Adam (Adaptive Moment Estimation)**
|
||
- Strategy: Adapt step size for each parameter individually
|
||
- Advantage: Different learning rates for different dimensions
|
||
- Key Insight: Some directions need big steps, others need small steps
|
||
|
||
**AdamW (Adam with Weight Decay)**
|
||
- Strategy: Adam + proper regularization
|
||
- Fix: Separates optimization from regularization
|
||
- Result: Better generalization and training stability
|
||
|
||
### The Mathematics Behind Movement
|
||
|
||
At its core, optimization follows: **θ_new = θ_old - α * direction**
|
||
|
||
Where:
|
||
- `θ` = parameters (your position in the landscape)
|
||
- `α` = step size (learning rate)
|
||
- `direction` = where to step (gradient-based)
|
||
|
||
But sophisticated optimizers do much more than basic gradient descent!
|
||
"""
|
||
|
||
# %% [markdown]
|
||
r"""
|
||
## 2. Foundations: Mathematical Background
|
||
|
||
### Understanding Momentum: The Physics of Optimization
|
||
|
||
Momentum in optimization works like momentum in physics. A ball rolling down a hill doesn't immediately change direction when it hits a small bump - it has momentum that carries it forward.
|
||
|
||
```
|
||
Without Momentum (SGD): With Momentum:
|
||
↓ ↘️
|
||
← • → ← oscillation → • → smooth path
|
||
↑ ↙️
|
||
|
||
Narrow valley problem: Momentum solution:
|
||
|\ /| |\ /|
|
||
| \ • / | ← ping-pong | \ •→/ | ← smoother
|
||
| \ / | motion | \ / | descent
|
||
| ● | | ● |
|
||
```
|
||
|
||
**SGD with Momentum Formula:**
|
||
```
|
||
velocity = β * previous_velocity + (1-β) * current_gradient
|
||
parameter = parameter - learning_rate * velocity
|
||
|
||
Where β ≈ 0.9 means "90% memory of previous direction"
|
||
```
|
||
|
||
### Adam: Adaptive Learning for Each Parameter
|
||
|
||
Adam solves a key problem: different parameters need different learning rates. Imagine adjusting the focus and zoom on a camera - you need fine control for focus but coarse control for zoom.
|
||
|
||
```
|
||
Parameter Landscape (2 dimensions):
|
||
|
||
param2
|
||
^
|
||
|
|
||
😞| steep gradient
|
||
| (needs small steps)
|
||
|
|
||
---+--●--→ param1
|
||
| \\
|
||
| \\ gentle gradient
|
||
| \\ (needs big steps)
|
||
|
||
Adam Solution: Automatic step size per parameter!
|
||
```
|
||
|
||
**Adam's Two-Memory System:**
|
||
|
||
1. **First Moment (m)**: "Which direction am I usually going?"
|
||
- `m = β₁ * old_m + (1-β₁) * gradient`
|
||
- Like momentum, but for direction
|
||
|
||
2. **Second Moment (v)**: "How big are my gradients usually?"
|
||
- `v = β₂ * old_v + (1-β₂) * gradient²`
|
||
- Tracks gradient magnitude
|
||
|
||
3. **Adaptive Update**:
|
||
- `step_size = m / √v`
|
||
- Big gradients → smaller steps
|
||
- Small gradients → relatively bigger steps
|
||
|
||
### AdamW: Fixing Weight Decay
|
||
|
||
Adam has a subtle bug in how it applies weight decay (regularization). AdamW fixes this:
|
||
|
||
```
|
||
Adam (incorrect): AdamW (correct):
|
||
gradient += weight_decay * param [compute gradient update]
|
||
update_param_with_gradient() param -= learning_rate * gradient_update
|
||
param *= (1 - weight_decay) ← separate!
|
||
|
||
Why it matters:
|
||
- Adam: Weight decay affected by adaptive learning rates
|
||
- AdamW: Weight decay is consistent regardless of gradients
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 3. Implementation: Building Optimizers
|
||
|
||
Now we'll implement each optimizer step by step, following the pattern: understand the algorithm → implement it → test it immediately. Each optimizer builds on the foundation of the previous one.
|
||
|
||
### Implementation Strategy
|
||
|
||
```
|
||
Optimizer Base Class
|
||
↓
|
||
SGD (foundation algorithm)
|
||
↓
|
||
SGD + Momentum (reduce oscillations)
|
||
↓
|
||
Adam (adaptive learning rates)
|
||
↓
|
||
AdamW (proper weight decay)
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "optimizer-base", "solution": true}
|
||
#| export
|
||
class Optimizer:
|
||
"""
|
||
Base class for all optimizers.
|
||
|
||
This class defines the common interface that all optimizers must implement:
|
||
- zero_grad(): Clear gradients from parameters
|
||
- step(): Update parameters based on gradients
|
||
"""
|
||
|
||
def __init__(self, params: List[Tensor]):
|
||
"""
|
||
Initialize optimizer with parameters to optimize.
|
||
|
||
TODO: Set up the parameter list for optimization
|
||
|
||
APPROACH:
|
||
1. Store parameters as a list for iteration
|
||
2. Validate that all parameters require gradients
|
||
3. Initialize step counter for algorithms that need it
|
||
|
||
EXAMPLE:
|
||
>>> linear = Linear(784, 128)
|
||
>>> optimizer = SGD(linear.parameters(), lr=0.01)
|
||
|
||
HINT: Check that each parameter has requires_grad=True
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Validate and store parameters
|
||
if not isinstance(params, list):
|
||
params = list(params)
|
||
|
||
# Check that parameters require gradients
|
||
for i, param in enumerate(params):
|
||
if not isinstance(param, Tensor):
|
||
raise TypeError(f"Parameter {i} must be a Tensor, got {type(param)}")
|
||
if not param.requires_grad:
|
||
raise ValueError(f"Parameter {i} does not require gradients. Set requires_grad=True.")
|
||
|
||
self.params = params
|
||
self.step_count = 0 # For algorithms that need step counting
|
||
### END SOLUTION
|
||
|
||
def zero_grad(self):
|
||
"""
|
||
Clear gradients from all parameters.
|
||
|
||
TODO: Reset all parameter gradients to None
|
||
|
||
APPROACH:
|
||
1. Iterate through all parameters
|
||
2. Set each parameter's grad to None
|
||
|
||
EXAMPLE:
|
||
>>> optimizer.zero_grad() # Clears all gradients
|
||
>>> assert param.grad is None for param in optimizer.params
|
||
|
||
WHY: Gradients accumulate by default, so we need to clear them between batches
|
||
"""
|
||
### BEGIN SOLUTION
|
||
for param in self.params:
|
||
param.grad = None
|
||
### END SOLUTION
|
||
|
||
def step(self):
|
||
"""
|
||
Update parameters based on gradients.
|
||
|
||
This is abstract - each optimizer implements its own update rule.
|
||
"""
|
||
raise NotImplementedError("Subclasses must implement step()")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: Base Optimizer
|
||
This test validates our base Optimizer class works correctly.
|
||
**What we're testing**: Parameter validation and zero_grad functionality
|
||
**Why it matters**: Foundation for all specific optimizer implementations
|
||
**Expected**: Proper parameter storage and gradient clearing
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-optimizer-base", "locked": true, "points": 10}
|
||
def test_unit_optimizer_base():
|
||
"""🔬 Test base Optimizer functionality."""
|
||
print("🔬 Unit Test: Base Optimizer...")
|
||
|
||
# Create test parameters
|
||
param1 = Tensor([1.0, 2.0], requires_grad=True)
|
||
param2 = Tensor([[3.0, 4.0], [5.0, 6.0]], requires_grad=True)
|
||
|
||
# Add some gradients
|
||
param1.grad = Tensor([0.1, 0.2])
|
||
param2.grad = Tensor([[0.3, 0.4], [0.5, 0.6]])
|
||
|
||
# Create optimizer
|
||
optimizer = Optimizer([param1, param2])
|
||
|
||
# Test parameter storage
|
||
assert len(optimizer.params) == 2
|
||
assert optimizer.params[0] is param1
|
||
assert optimizer.params[1] is param2
|
||
assert optimizer.step_count == 0
|
||
|
||
# Test zero_grad
|
||
optimizer.zero_grad()
|
||
assert param1.grad is None
|
||
assert param2.grad is None
|
||
|
||
# Test error handling
|
||
try:
|
||
bad_param = Tensor([1.0], requires_grad=False)
|
||
Optimizer([bad_param])
|
||
assert False, "Should have raised ValueError"
|
||
except ValueError as e:
|
||
assert "does not require gradients" in str(e)
|
||
|
||
print("✅ Base Optimizer works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_optimizer_base()
|
||
|
||
# %% [markdown]
|
||
r"""
|
||
## SGD - Stochastic Gradient Descent
|
||
|
||
SGD is the foundation of neural network optimization. It implements the simple but powerful idea: "move in the direction opposite to the gradient."
|
||
|
||
### Why SGD Works
|
||
|
||
Gradients point uphill (toward higher loss). To minimize loss, we go downhill:
|
||
|
||
```
|
||
Loss Surface (side view):
|
||
|
||
Loss
|
||
^
|
||
|
|
||
📈 | current position
|
||
| /
|
||
| • ← you are here
|
||
| / \
|
||
| / \ gradient points uphill
|
||
|/ \
|
||
●-------\--→ parameters
|
||
\ \
|
||
\ ↘️ SGD steps downhill
|
||
\ (opposite to gradient)
|
||
\⭐ ← goal (minimum loss)
|
||
```
|
||
|
||
### The Oscillation Problem
|
||
|
||
Pure SGD can get trapped oscillating in narrow valleys:
|
||
|
||
```
|
||
Narrow valley (top view):
|
||
\ /
|
||
\ / ← steep sides
|
||
\ /
|
||
4← • →2 ← SGD bounces back and forth
|
||
/ \
|
||
1 3 instead of going down the valley
|
||
/ \
|
||
● \
|
||
goal \
|
||
```
|
||
|
||
### Momentum Solution
|
||
|
||
Momentum remembers the direction you were going and continues in that direction:
|
||
|
||
```
|
||
With momentum:
|
||
\ /
|
||
\ /
|
||
\ /
|
||
• ← smooth path down the valley
|
||
/ ↓
|
||
/ ↓
|
||
● ↓ momentum carries us through oscillations
|
||
goal
|
||
```
|
||
|
||
**Implementation:** SGD keeps a "velocity" buffer that accumulates momentum.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "sgd-optimizer", "solution": true}
|
||
#| export
|
||
class SGD(Optimizer):
|
||
"""
|
||
Stochastic Gradient Descent with momentum.
|
||
|
||
SGD is the foundational optimization algorithm that moves parameters
|
||
in the direction opposite to gradients. With momentum, it remembers
|
||
previous updates to reduce oscillations and accelerate convergence.
|
||
"""
|
||
|
||
def __init__(self, params: List[Tensor], lr: float = 0.01, momentum: float = 0.0, weight_decay: float = 0.0):
|
||
"""
|
||
Initialize SGD optimizer.
|
||
|
||
TODO: Set up SGD with momentum and weight decay
|
||
|
||
APPROACH:
|
||
1. Call parent constructor to set up parameters
|
||
2. Store learning rate, momentum, and weight decay
|
||
3. Initialize momentum buffers for each parameter
|
||
|
||
EXAMPLE:
|
||
>>> optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
|
||
|
||
HINTS:
|
||
- Momentum buffers should be initialized as None
|
||
- They'll be created lazily on first step
|
||
"""
|
||
### BEGIN SOLUTION
|
||
super().__init__(params)
|
||
|
||
self.lr = lr
|
||
self.momentum = momentum
|
||
self.weight_decay = weight_decay
|
||
|
||
# Initialize momentum buffers (created lazily)
|
||
self.momentum_buffers = [None for _ in self.params]
|
||
### END SOLUTION
|
||
|
||
def step(self):
|
||
"""
|
||
Perform SGD update step with momentum.
|
||
|
||
TODO: Implement SGD parameter update with momentum
|
||
|
||
APPROACH:
|
||
1. For each parameter with gradients:
|
||
a. Apply weight decay if specified
|
||
b. Update momentum buffer
|
||
c. Update parameter using momentum
|
||
|
||
FORMULA:
|
||
- With weight decay: grad = grad + weight_decay * param
|
||
- Momentum: v = momentum * v_prev + grad
|
||
- Update: param = param - lr * v
|
||
|
||
HINTS:
|
||
- Skip parameters without gradients
|
||
- Initialize momentum buffers on first use
|
||
- Use in-place operations to save memory
|
||
"""
|
||
### BEGIN SOLUTION
|
||
for i, param in enumerate(self.params):
|
||
if param.grad is None:
|
||
continue
|
||
|
||
# Get gradient (param.grad is already a numpy array)
|
||
grad = param.grad
|
||
|
||
# Apply weight decay
|
||
if self.weight_decay != 0:
|
||
grad = grad + self.weight_decay * param.data
|
||
|
||
# Update momentum buffer
|
||
if self.momentum != 0:
|
||
if self.momentum_buffers[i] is None:
|
||
# Initialize momentum buffer
|
||
self.momentum_buffers[i] = np.zeros_like(param.data)
|
||
|
||
# Update momentum: v = momentum * v_prev + grad
|
||
self.momentum_buffers[i] = self.momentum * self.momentum_buffers[i] + grad
|
||
grad = self.momentum_buffers[i]
|
||
|
||
# Update parameter: param = param - lr * grad
|
||
param.data = param.data - self.lr * grad
|
||
|
||
# Increment step counter
|
||
self.step_count += 1
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: SGD Optimizer
|
||
This test validates our SGD implementation works correctly.
|
||
**What we're testing**: SGD updates with and without momentum
|
||
**Why it matters**: Core optimization algorithm used in neural network training
|
||
**Expected**: Correct parameter updates following SGD formulas
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-sgd", "locked": true, "points": 15}
|
||
def test_unit_sgd_optimizer():
|
||
"""🔬 Test SGD optimizer implementation."""
|
||
print("🔬 Unit Test: SGD Optimizer...")
|
||
|
||
# Test basic SGD without momentum
|
||
param = Tensor([1.0, 2.0], requires_grad=True)
|
||
param.grad = Tensor([0.1, 0.2])
|
||
|
||
optimizer = SGD([param], lr=0.1)
|
||
original_data = param.data.copy()
|
||
|
||
optimizer.step()
|
||
|
||
# Expected: param = param - lr * grad = [1.0, 2.0] - 0.1 * [0.1, 0.2] = [0.99, 1.98]
|
||
expected = original_data - 0.1 * param.grad.data
|
||
assert np.allclose(param.data, expected)
|
||
assert optimizer.step_count == 1
|
||
|
||
# Test SGD with momentum
|
||
param2 = Tensor([1.0, 2.0], requires_grad=True)
|
||
param2.grad = Tensor([0.1, 0.2])
|
||
|
||
optimizer_momentum = SGD([param2], lr=0.1, momentum=0.9)
|
||
|
||
# First step: v = 0.9 * 0 + [0.1, 0.2] = [0.1, 0.2]
|
||
optimizer_momentum.step()
|
||
expected_first = np.array([1.0, 2.0]) - 0.1 * np.array([0.1, 0.2])
|
||
assert np.allclose(param2.data, expected_first)
|
||
|
||
# Second step with same gradient
|
||
param2.grad = Tensor([0.1, 0.2])
|
||
optimizer_momentum.step()
|
||
# v = 0.9 * [0.1, 0.2] + [0.1, 0.2] = [0.19, 0.38]
|
||
expected_momentum = np.array([0.19, 0.38])
|
||
expected_second = expected_first - 0.1 * expected_momentum
|
||
assert np.allclose(param2.data, expected_second, rtol=1e-5)
|
||
|
||
# Test weight decay
|
||
param3 = Tensor([1.0, 2.0], requires_grad=True)
|
||
param3.grad = Tensor([0.1, 0.2])
|
||
|
||
optimizer_wd = SGD([param3], lr=0.1, weight_decay=0.01)
|
||
optimizer_wd.step()
|
||
|
||
# grad_with_decay = [0.1, 0.2] + 0.01 * [1.0, 2.0] = [0.11, 0.22]
|
||
expected_wd = np.array([1.0, 2.0]) - 0.1 * np.array([0.11, 0.22])
|
||
assert np.allclose(param3.data, expected_wd)
|
||
|
||
print("✅ SGD optimizer works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_sgd_optimizer()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Adam - Adaptive Moment Estimation
|
||
|
||
Adam solves a fundamental problem with SGD: different parameters often need different learning rates. Think of tuning a complex system where some knobs need gentle adjustments and others need bold changes.
|
||
|
||
### The Parameter Scaling Problem
|
||
|
||
Consider a neural network with both embedding weights and output weights:
|
||
|
||
```
|
||
Parameter Sensitivity Landscape:
|
||
|
||
output_weight embedding_weight
|
||
↑ ↑
|
||
| |
|
||
😱 | steep cliff | 🐌 gentle slope
|
||
| (needs tiny steps) | (needs big steps)
|
||
| |
|
||
━━━●━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━●━━━→
|
||
|
||
Same learning rate = disaster!
|
||
• Small LR: output weights learn fast, embeddings crawl
|
||
• Large LR: embeddings learn well, output weights explode
|
||
```
|
||
|
||
### Adam's Adaptive Solution
|
||
|
||
Adam automatically adjusts learning rates by tracking two statistics:
|
||
|
||
```
|
||
1. MOMENTUM (first moment): "Which way am I usually going?"
|
||
m = 0.9 * old_direction + 0.1 * current_gradient
|
||
|
||
Visualization:
|
||
old: →→→→
|
||
new: ↗️
|
||
m: →→→↗️ (weighted average)
|
||
|
||
2. SCALE (second moment): "How big are my steps usually?"
|
||
v = 0.999 * old_scale + 0.001 * (current_gradient)²
|
||
|
||
Big gradients → bigger v → smaller effective steps
|
||
Small gradients → smaller v → bigger effective steps
|
||
|
||
3. ADAPTIVE UPDATE:
|
||
step = momentum / √scale
|
||
param = param - learning_rate * step
|
||
```
|
||
|
||
### Bias Correction: The Cold Start Problem
|
||
|
||
Adam starts with m=0 and v=0, which creates a bias toward zero initially:
|
||
|
||
```
|
||
Without bias correction: With bias correction:
|
||
|
||
Step 1: m = 0.9*0 + 0.1*g Step 1: m̂ = m / (1-0.9¹) = m / 0.1
|
||
= 0.1*g (too small!) = g (correct!)
|
||
|
||
Step 2: m = 0.9*0.1*g + 0.1*g Step 2: m̂ = m / (1-0.9²) = m / 0.19
|
||
= 0.19*g (still small) ≈ g (better!)
|
||
```
|
||
|
||
**Key Insight:** Adam is like having an automatic transmission that adjusts gear ratios for each parameter individually.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "adam-optimizer", "solution": true}
|
||
#| export
|
||
class Adam(Optimizer):
|
||
"""
|
||
Adam optimizer with adaptive learning rates.
|
||
|
||
Adam computes individual adaptive learning rates for different parameters
|
||
from estimates of first and second moments of the gradients.
|
||
This makes it effective for problems with sparse gradients or noisy data.
|
||
"""
|
||
|
||
def __init__(self, params: List[Tensor], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-8, weight_decay: float = 0.0):
|
||
"""
|
||
Initialize Adam optimizer.
|
||
|
||
TODO: Set up Adam with adaptive learning rates
|
||
|
||
APPROACH:
|
||
1. Call parent constructor
|
||
2. Store hyperparameters (lr, betas, eps, weight_decay)
|
||
3. Initialize first and second moment buffers
|
||
|
||
PARAMETERS:
|
||
- lr: Learning rate (default: 0.001)
|
||
- betas: Coefficients for computing running averages (default: (0.9, 0.999))
|
||
- eps: Small constant for numerical stability (default: 1e-8)
|
||
- weight_decay: L2 penalty coefficient (default: 0.0)
|
||
|
||
EXAMPLE:
|
||
>>> optimizer = Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
|
||
"""
|
||
### BEGIN SOLUTION
|
||
super().__init__(params)
|
||
|
||
self.lr = lr
|
||
self.beta1, self.beta2 = betas
|
||
self.eps = eps
|
||
self.weight_decay = weight_decay
|
||
|
||
# Initialize moment buffers (created lazily)
|
||
self.m_buffers = [None for _ in self.params] # First moment (mean)
|
||
self.v_buffers = [None for _ in self.params] # Second moment (variance)
|
||
### END SOLUTION
|
||
|
||
def step(self):
|
||
"""
|
||
Perform Adam update step.
|
||
|
||
TODO: Implement Adam parameter update with adaptive learning rates
|
||
|
||
APPROACH:
|
||
1. For each parameter with gradients:
|
||
a. Apply weight decay if specified
|
||
b. Update first moment estimate (momentum of gradient)
|
||
c. Update second moment estimate (momentum of squared gradient)
|
||
d. Compute bias-corrected moments
|
||
e. Update parameter using adaptive learning rate
|
||
|
||
FORMULAS:
|
||
- m_t = β₁ * m_{t-1} + (1-β₁) * g_t
|
||
- v_t = β₂ * v_{t-1} + (1-β₂) * g_t²
|
||
- m̂_t = m_t / (1-β₁^t)
|
||
- v̂_t = v_t / (1-β₂^t)
|
||
- θ_t = θ_{t-1} - lr * m̂_t / (√v̂_t + ε)
|
||
|
||
HINTS:
|
||
- Initialize buffers as zeros on first use
|
||
- Use step_count for bias correction
|
||
- Square gradients element-wise for second moment
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Increment step counter first (needed for bias correction)
|
||
self.step_count += 1
|
||
|
||
for i, param in enumerate(self.params):
|
||
if param.grad is None:
|
||
continue
|
||
|
||
# Get gradient (param.grad is already a numpy array)
|
||
grad = param.grad
|
||
|
||
# Apply weight decay
|
||
if self.weight_decay != 0:
|
||
grad = grad + self.weight_decay * param.data
|
||
|
||
# Initialize buffers if needed
|
||
if self.m_buffers[i] is None:
|
||
self.m_buffers[i] = np.zeros_like(param.data)
|
||
self.v_buffers[i] = np.zeros_like(param.data)
|
||
|
||
# Update biased first moment estimate
|
||
self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad
|
||
|
||
# Update biased second moment estimate
|
||
self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad ** 2)
|
||
|
||
# Compute bias correction
|
||
bias_correction1 = 1 - self.beta1 ** self.step_count
|
||
bias_correction2 = 1 - self.beta2 ** self.step_count
|
||
|
||
# Compute bias-corrected moments
|
||
m_hat = self.m_buffers[i] / bias_correction1
|
||
v_hat = self.v_buffers[i] / bias_correction2
|
||
|
||
# Update parameter
|
||
param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: Adam Optimizer
|
||
This test validates our Adam implementation works correctly.
|
||
**What we're testing**: Adam updates with adaptive learning rates and bias correction
|
||
**Why it matters**: Most popular optimizer for modern neural networks
|
||
**Expected**: Correct parameter updates following Adam formulas
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-adam", "locked": true, "points": 20}
|
||
def test_unit_adam_optimizer():
|
||
"""🔬 Test Adam optimizer implementation."""
|
||
print("🔬 Unit Test: Adam Optimizer...")
|
||
|
||
# Test basic Adam functionality
|
||
param = Tensor([1.0, 2.0], requires_grad=True)
|
||
param.grad = Tensor([0.1, 0.2])
|
||
|
||
optimizer = Adam([param], lr=0.01, betas=(0.9, 0.999), eps=1e-8)
|
||
original_data = param.data.copy()
|
||
|
||
# First step
|
||
optimizer.step()
|
||
|
||
# Manually compute expected values
|
||
grad = np.array([0.1, 0.2])
|
||
|
||
# First moment: m = 0.9 * 0 + 0.1 * grad = 0.1 * grad
|
||
m = 0.1 * grad
|
||
|
||
# Second moment: v = 0.999 * 0 + 0.001 * grad^2 = 0.001 * grad^2
|
||
v = 0.001 * (grad ** 2)
|
||
|
||
# Bias correction
|
||
bias_correction1 = 1 - 0.9 ** 1 # = 0.1
|
||
bias_correction2 = 1 - 0.999 ** 1 # = 0.001
|
||
|
||
m_hat = m / bias_correction1 # = grad
|
||
v_hat = v / bias_correction2 # = grad^2
|
||
|
||
# Update
|
||
expected = original_data - 0.01 * m_hat / (np.sqrt(v_hat) + 1e-8)
|
||
|
||
assert np.allclose(param.data, expected, rtol=1e-6)
|
||
assert optimizer.step_count == 1
|
||
|
||
# Test second step to verify moment accumulation
|
||
param.grad = Tensor([0.1, 0.2])
|
||
optimizer.step()
|
||
|
||
# Should have updated moments
|
||
assert optimizer.m_buffers[0] is not None
|
||
assert optimizer.v_buffers[0] is not None
|
||
assert optimizer.step_count == 2
|
||
|
||
# Test with weight decay
|
||
param2 = Tensor([1.0, 2.0], requires_grad=True)
|
||
param2.grad = Tensor([0.1, 0.2])
|
||
|
||
optimizer_wd = Adam([param2], lr=0.01, weight_decay=0.01)
|
||
optimizer_wd.step()
|
||
|
||
# Weight decay should modify the effective gradient
|
||
# grad_with_decay = [0.1, 0.2] + 0.01 * [1.0, 2.0] = [0.11, 0.22]
|
||
# The exact computation is complex, but we can verify parameter changed
|
||
assert not np.array_equal(param2.data, np.array([1.0, 2.0]))
|
||
|
||
print("✅ Adam optimizer works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_adam_optimizer()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## AdamW - Adam with Decoupled Weight Decay
|
||
|
||
AdamW fixes a subtle but important bug in Adam's weight decay implementation. The bug affects how regularization interacts with adaptive learning rates.
|
||
|
||
### The Adam Weight Decay Bug
|
||
|
||
In standard Adam, weight decay is added to gradients before the adaptive scaling:
|
||
|
||
```
|
||
Adam's approach (problematic):
|
||
1. gradient = computed_gradient + weight_decay * parameter
|
||
2. m = β₁ * m + (1-β₁) * gradient
|
||
3. v = β₂ * v + (1-β₂) * gradient²
|
||
4. step = m / √v
|
||
5. parameter = parameter - learning_rate * step
|
||
|
||
Problem: Weight decay gets "adapted" by the learning rate scaling!
|
||
```
|
||
|
||
### Why This Matters
|
||
|
||
Weight decay should be a consistent regularization force, but Adam makes it inconsistent:
|
||
|
||
```
|
||
Parameter Update Comparison:
|
||
|
||
Large gradients → small adaptive LR → weak weight decay effect
|
||
Small gradients → large adaptive LR → strong weight decay effect
|
||
|
||
This is backwards! We want consistent regularization.
|
||
```
|
||
|
||
### AdamW's Fix: Decoupled Weight Decay
|
||
|
||
AdamW separates gradient-based updates from weight decay:
|
||
|
||
```
|
||
AdamW's approach (correct):
|
||
1. m = β₁ * m + (1-β₁) * pure_gradient ← NO weight decay here
|
||
2. v = β₂ * v + (1-β₂) * pure_gradient²
|
||
3. step = m / √v
|
||
4. parameter = parameter - learning_rate * step ← gradient update
|
||
5. parameter = parameter * (1 - weight_decay_rate) ← separate decay
|
||
|
||
Result: Consistent regularization independent of gradient magnitudes!
|
||
```
|
||
|
||
### Visual Comparison
|
||
|
||
```
|
||
Adam weight decay: AdamW weight decay:
|
||
|
||
gradient ──┐ gradient ──→ adaptive ──→ param
|
||
├─→ adaptive ──→ param update
|
||
weight ────┘ scaling
|
||
decay
|
||
weight ─────────→ param
|
||
decay shrinkage
|
||
|
||
Coupled (inconsistent) Decoupled (consistent)
|
||
```
|
||
|
||
**Key Insight:** AdamW treats optimization and regularization as separate, independent processes, leading to better training dynamics and generalization.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "adamw-optimizer", "solution": true}
|
||
#| export
|
||
class AdamW(Optimizer):
|
||
"""
|
||
AdamW optimizer with decoupled weight decay.
|
||
|
||
AdamW fixes a bug in Adam's weight decay implementation by decoupling
|
||
weight decay from the gradient-based update. This leads to better
|
||
regularization and is the preferred version for most applications.
|
||
"""
|
||
|
||
def __init__(self, params: List[Tensor], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-8, weight_decay: float = 0.01):
|
||
"""
|
||
Initialize AdamW optimizer.
|
||
|
||
TODO: Set up AdamW with decoupled weight decay
|
||
|
||
APPROACH:
|
||
1. Call parent constructor
|
||
2. Store hyperparameters (note higher default weight_decay)
|
||
3. Initialize moment buffers like Adam
|
||
|
||
KEY DIFFERENCE from Adam:
|
||
- Weight decay is applied directly to parameters, not added to gradients
|
||
- This provides better regularization behavior
|
||
|
||
EXAMPLE:
|
||
>>> optimizer = AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
super().__init__(params)
|
||
|
||
self.lr = lr
|
||
self.beta1, self.beta2 = betas
|
||
self.eps = eps
|
||
self.weight_decay = weight_decay
|
||
|
||
# Initialize moment buffers (same as Adam)
|
||
self.m_buffers = [None for _ in self.params]
|
||
self.v_buffers = [None for _ in self.params]
|
||
### END SOLUTION
|
||
|
||
def step(self):
|
||
"""
|
||
Perform AdamW update step with decoupled weight decay.
|
||
|
||
TODO: Implement AdamW parameter update
|
||
|
||
APPROACH:
|
||
1. For each parameter with gradients:
|
||
a. Update moments using gradients (NOT modified by weight decay)
|
||
b. Compute bias-corrected moments
|
||
c. Apply gradient-based update
|
||
d. Apply weight decay directly to parameters
|
||
|
||
KEY DIFFERENCE from Adam:
|
||
- Weight decay: θ_t = θ_t - lr * weight_decay * θ_t (applied after gradient update)
|
||
- NOT: grad = grad + weight_decay * param (Adam's incorrect approach)
|
||
|
||
FORMULAS:
|
||
- Same moment updates as Adam (using unmodified gradients)
|
||
- Gradient update: θ_t = θ_{t-1} - lr * m̂_t / (√v̂_t + ε)
|
||
- Weight decay: θ_t = θ_t * (1 - lr * weight_decay)
|
||
|
||
HINT: Apply weight decay after gradient update for proper decoupling
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Increment step counter first
|
||
self.step_count += 1
|
||
|
||
for i, param in enumerate(self.params):
|
||
if param.grad is None:
|
||
continue
|
||
|
||
# Get gradient (NOT modified by weight decay) - param.grad is already a numpy array
|
||
grad = param.grad
|
||
|
||
# Initialize buffers if needed
|
||
if self.m_buffers[i] is None:
|
||
self.m_buffers[i] = np.zeros_like(param.data)
|
||
self.v_buffers[i] = np.zeros_like(param.data)
|
||
|
||
# Update moments using pure gradients
|
||
self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad
|
||
self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad ** 2)
|
||
|
||
# Compute bias correction
|
||
bias_correction1 = 1 - self.beta1 ** self.step_count
|
||
bias_correction2 = 1 - self.beta2 ** self.step_count
|
||
|
||
# Compute bias-corrected moments
|
||
m_hat = self.m_buffers[i] / bias_correction1
|
||
v_hat = self.v_buffers[i] / bias_correction2
|
||
|
||
# Apply gradient-based update
|
||
param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
|
||
|
||
# Apply decoupled weight decay
|
||
if self.weight_decay != 0:
|
||
param.data = param.data * (1 - self.lr * self.weight_decay)
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: AdamW Optimizer
|
||
This test validates our AdamW implementation with decoupled weight decay.
|
||
**What we're testing**: AdamW updates with proper weight decay decoupling
|
||
**Why it matters**: State-of-the-art optimizer for transformer models
|
||
**Expected**: Correct separation of gradient updates and weight decay
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-adamw", "locked": true, "points": 20}
|
||
def test_unit_adamw_optimizer():
|
||
"""🔬 Test AdamW optimizer implementation."""
|
||
print("🔬 Unit Test: AdamW Optimizer...")
|
||
|
||
# Test AdamW vs Adam difference in weight decay
|
||
# Create identical parameters for comparison
|
||
param_adam = Tensor([1.0, 2.0], requires_grad=True)
|
||
param_adamw = Tensor([1.0, 2.0], requires_grad=True)
|
||
|
||
param_adam.grad = Tensor([0.1, 0.2])
|
||
param_adamw.grad = Tensor([0.1, 0.2])
|
||
|
||
# Create optimizers with same settings
|
||
adam = Adam([param_adam], lr=0.01, weight_decay=0.01)
|
||
adamw = AdamW([param_adamw], lr=0.01, weight_decay=0.01)
|
||
|
||
# Take one step
|
||
adam.step()
|
||
adamw.step()
|
||
|
||
# Results should be different due to weight decay implementation
|
||
assert not np.allclose(param_adam.data, param_adamw.data, rtol=1e-6)
|
||
|
||
# Test AdamW basic functionality
|
||
param = Tensor([1.0, 2.0], requires_grad=True)
|
||
param.grad = Tensor([0.1, 0.2])
|
||
|
||
optimizer = AdamW([param], lr=0.01, weight_decay=0.01)
|
||
original_data = param.data.copy()
|
||
|
||
optimizer.step()
|
||
|
||
# Parameter should have changed
|
||
assert not np.array_equal(param.data, original_data)
|
||
assert optimizer.step_count == 1
|
||
|
||
# Test that moment buffers are created
|
||
assert optimizer.m_buffers[0] is not None
|
||
assert optimizer.v_buffers[0] is not None
|
||
|
||
# Test zero weight decay behaves like Adam
|
||
param1 = Tensor([1.0, 2.0], requires_grad=True)
|
||
param2 = Tensor([1.0, 2.0], requires_grad=True)
|
||
|
||
param1.grad = Tensor([0.1, 0.2])
|
||
param2.grad = Tensor([0.1, 0.2])
|
||
|
||
adam_no_wd = Adam([param1], lr=0.01, weight_decay=0.0)
|
||
adamw_no_wd = AdamW([param2], lr=0.01, weight_decay=0.0)
|
||
|
||
adam_no_wd.step()
|
||
adamw_no_wd.step()
|
||
|
||
# Should be very similar (within numerical precision)
|
||
assert np.allclose(param1.data, param2.data, rtol=1e-10)
|
||
|
||
print("✅ AdamW optimizer works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_adamw_optimizer()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 4. Integration: Bringing It Together
|
||
|
||
Now let's see how our optimizers perform in realistic scenarios. We'll compare their behavior on the same optimization problem to understand their different characteristics.
|
||
|
||
### Optimizer Behavior Comparison
|
||
|
||
Each optimizer takes a different approach to the same problem:
|
||
|
||
```
|
||
Optimization Problem: Find minimum of f(x) = x²
|
||
|
||
SGD approach: Adam approach: AdamW approach:
|
||
↓ ↓ ↓
|
||
x ──→ minimize x ──→ minimize x ──→ minimize
|
||
↑ ↑ ↑
|
||
fixed LR adaptive LR adaptive LR + decay
|
||
```
|
||
"""
|
||
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 5. Systems Analysis: Optimizer Performance and Memory
|
||
|
||
Different optimizers have very different resource requirements. Understanding these trade-offs is crucial for production ML systems.
|
||
|
||
### Memory Usage Patterns
|
||
|
||
```
|
||
Optimizer Memory Requirements (per parameter):
|
||
|
||
SGD: Adam/AdamW:
|
||
┌────────┐ ┌────────┐
|
||
│ param │ │ param │
|
||
├────────┤ ├────────┤
|
||
│momentum│ │ m │ ← first moment
|
||
└────────┘ ├────────┤
|
||
│ v │ ← second moment
|
||
└────────┘
|
||
|
||
2× memory 3× memory
|
||
```
|
||
|
||
### Computational Complexity
|
||
|
||
```
|
||
Per-step Operations:
|
||
|
||
SGD: Adam:
|
||
• 1 multiplication • 3 multiplications
|
||
• 1 addition • 4 additions
|
||
• 1 subtraction • 1 subtraction
|
||
• 1 square root
|
||
• 1 division
|
||
|
||
O(n) simple ops O(n) complex ops
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "optimizer-analysis", "solution": true}
|
||
def analyze_optimizer_memory_usage():
|
||
"""📊 Analyze memory usage of different optimizers."""
|
||
print("📊 Analyzing Optimizer Memory Usage...")
|
||
|
||
# Create test parameters of different sizes
|
||
param_sizes = [1000, 10000, 100000] # 1K, 10K, 100K parameters
|
||
|
||
print("Optimizer Memory Analysis (per parameter tensor):")
|
||
print("=" * 60)
|
||
print(f"{'Size':<10} {'SGD':<10} {'Adam':<10} {'AdamW':<10} {'Ratio':<10}")
|
||
print("-" * 60)
|
||
|
||
for size in param_sizes:
|
||
# Create parameter
|
||
param = Tensor(np.random.randn(size), requires_grad=True)
|
||
param.grad = Tensor(np.random.randn(size))
|
||
|
||
# SGD memory (parameter + momentum buffer)
|
||
sgd = SGD([param], momentum=0.9)
|
||
sgd.step() # Initialize buffers
|
||
sgd_memory = size * 2 # param + momentum buffer
|
||
|
||
# Adam memory (parameter + 2 moment buffers)
|
||
param_adam = Tensor(np.random.randn(size), requires_grad=True)
|
||
param_adam.grad = Tensor(np.random.randn(size))
|
||
adam = Adam([param_adam])
|
||
adam.step() # Initialize buffers
|
||
adam_memory = size * 3 # param + m_buffer + v_buffer
|
||
|
||
# AdamW memory (same as Adam)
|
||
adamw_memory = adam_memory
|
||
|
||
# Memory ratio (Adam/SGD)
|
||
ratio = adam_memory / sgd_memory
|
||
|
||
print(f"{size:<10} {sgd_memory:<10} {adam_memory:<10} {adamw_memory:<10} {ratio:.1f}x")
|
||
|
||
print("\n💡 Key Insights:")
|
||
print("- SGD: 2× parameter memory (momentum buffer)")
|
||
print("- Adam/AdamW: 3× parameter memory (two moment buffers)")
|
||
print("- Memory scales linearly with model size")
|
||
print("- Trade-off: More memory for better convergence")
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "optimizer-convergence", "solution": true}
|
||
def analyze_optimizer_convergence_behavior():
|
||
"""📊 Analyze convergence behavior of different optimizers."""
|
||
print("📊 Analyzing Optimizer Convergence Behavior...")
|
||
|
||
# Simulate optimization of a quadratic function: f(x) = 0.5 * x^2
|
||
# Optimal solution: x* = 0, gradient = x
|
||
|
||
def quadratic_loss(x):
|
||
"""Simple quadratic function for optimization testing."""
|
||
return 0.5 * (x ** 2).sum()
|
||
|
||
def compute_gradient(x):
|
||
"""Gradient of quadratic function: df/dx = x."""
|
||
return x.copy()
|
||
|
||
# Starting point
|
||
x_start = np.array([5.0, -3.0, 2.0]) # Far from optimum [0, 0, 0]
|
||
|
||
# Test different optimizers
|
||
optimizers_to_test = [
|
||
("SGD", SGD, {"lr": 0.1}),
|
||
("SGD+Momentum", SGD, {"lr": 0.1, "momentum": 0.9}),
|
||
("Adam", Adam, {"lr": 0.1}),
|
||
("AdamW", AdamW, {"lr": 0.1, "weight_decay": 0.01})
|
||
]
|
||
|
||
print("Convergence Analysis (quadratic function f(x) = 0.5 * x²):")
|
||
print("=" * 70)
|
||
print(f"{'Optimizer':<15} {'Step 0':<12} {'Step 5':<12} {'Step 10':<12} {'Final Loss':<12}")
|
||
print("-" * 70)
|
||
|
||
for name, optimizer_class, kwargs in optimizers_to_test:
|
||
# Reset parameter
|
||
param = Tensor(x_start.copy(), requires_grad=True)
|
||
optimizer = optimizer_class([param], **kwargs)
|
||
|
||
losses = []
|
||
|
||
# Run optimization for 10 steps
|
||
for step in range(11):
|
||
# Compute loss and gradient
|
||
loss = quadratic_loss(param.data)
|
||
param.grad = Tensor(compute_gradient(param.data))
|
||
|
||
losses.append(loss)
|
||
|
||
# Update parameters
|
||
if step < 10: # Don't update after last evaluation
|
||
optimizer.step()
|
||
optimizer.zero_grad()
|
||
|
||
# Format results
|
||
step0 = f"{losses[0]:.6f}"
|
||
step5 = f"{losses[5]:.6f}"
|
||
step10 = f"{losses[10]:.6f}"
|
||
final = f"{losses[10]:.6f}"
|
||
|
||
print(f"{name:<15} {step0:<12} {step5:<12} {step10:<12} {final:<12}")
|
||
|
||
print("\n💡 Key Insights:")
|
||
print("- SGD: Steady progress but can be slow")
|
||
print("- SGD+Momentum: Faster convergence, less oscillation")
|
||
print("- Adam: Adaptive rates help with different parameter scales")
|
||
print("- AdamW: Similar to Adam with regularization effects")
|
||
|
||
# %% [markdown]
|
||
# """
|
||
# # 🧪 Module Integration Test
|
||
#
|
||
# Final validation that everything works together correctly.
|
||
# """
|
||
#
|
||
# def import_previous_module(module_name: str, component_name: str):
|
||
# import sys
|
||
# import os
|
||
# module = __import__(f"{module_name.split('_')[1]}_dev")
|
||
# return getattr(module, component_name)
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "module-integration", "locked": true, "points": 25}
|
||
def test_module():
|
||
"""
|
||
Comprehensive test of entire module functionality.
|
||
|
||
This final test runs before module summary to ensure:
|
||
- All unit tests pass
|
||
- Functions work together correctly
|
||
- Module is ready for integration with TinyTorch
|
||
"""
|
||
print("🧪 RUNNING MODULE INTEGRATION TEST")
|
||
print("=" * 50)
|
||
|
||
# Run all unit tests
|
||
print("Running unit tests...")
|
||
test_unit_optimizer_base()
|
||
test_unit_sgd_optimizer()
|
||
test_unit_adam_optimizer()
|
||
test_unit_adamw_optimizer()
|
||
|
||
print("\nRunning integration scenarios...")
|
||
|
||
# Test realistic neural network optimization scenario
|
||
print("🔬 Integration Test: Multi-layer Network Optimization...")
|
||
|
||
# Import components from previous modules (explicit imports)
|
||
from tinytorch.core.tensor import Tensor # Module 01: Foundation
|
||
from tinytorch.core.layers import Linear # Module 03: Layers
|
||
from tinytorch.core.activations import ReLU # Module 02: Activations
|
||
from tinytorch.core.losses import MSELoss # Module 04: Losses
|
||
|
||
# Create parameters for a 2-layer network
|
||
# Layer 1: 3 inputs -> 4 hidden
|
||
W1 = Tensor(np.random.randn(3, 4) * 0.1, requires_grad=True)
|
||
b1 = Tensor(np.zeros(4), requires_grad=True)
|
||
|
||
# Layer 2: 4 hidden -> 2 outputs
|
||
W2 = Tensor(np.random.randn(4, 2) * 0.1, requires_grad=True)
|
||
b2 = Tensor(np.zeros(2), requires_grad=True)
|
||
|
||
params = [W1, b1, W2, b2]
|
||
|
||
# Add realistic gradients
|
||
W1.grad = Tensor(np.random.randn(3, 4) * 0.01)
|
||
b1.grad = Tensor(np.random.randn(4) * 0.01)
|
||
W2.grad = Tensor(np.random.randn(4, 2) * 0.01)
|
||
b2.grad = Tensor(np.random.randn(2) * 0.01)
|
||
|
||
# Test all optimizers on same network
|
||
optimizers = [
|
||
SGD(params, lr=0.01, momentum=0.9),
|
||
Adam([p for p in params], lr=0.001), # Fresh param list for Adam
|
||
AdamW([p for p in params], lr=0.001, weight_decay=0.01) # Fresh param list for AdamW
|
||
]
|
||
|
||
# Save original parameter values
|
||
original_params = [p.data.copy() for p in params]
|
||
|
||
# Test SGD
|
||
optimizers[0].step()
|
||
sgd_params = [p.data.copy() for p in params]
|
||
|
||
# Restore parameters and test Adam
|
||
for i, p in enumerate(params):
|
||
p.data = original_params[i].copy()
|
||
# Re-add gradients since they may have been modified
|
||
if i == 0:
|
||
p.grad = Tensor(np.random.randn(3, 4) * 0.01)
|
||
elif i == 1:
|
||
p.grad = Tensor(np.random.randn(4) * 0.01)
|
||
elif i == 2:
|
||
p.grad = Tensor(np.random.randn(4, 2) * 0.01)
|
||
else:
|
||
p.grad = Tensor(np.random.randn(2) * 0.01)
|
||
|
||
# Update parameter references for Adam
|
||
optimizers[1].params = params
|
||
optimizers[1].step()
|
||
adam_params = [p.data.copy() for p in params]
|
||
|
||
# Restore parameters and test AdamW
|
||
for i, p in enumerate(params):
|
||
p.data = original_params[i].copy()
|
||
# Re-add gradients
|
||
if i == 0:
|
||
p.grad = Tensor(np.random.randn(3, 4) * 0.01)
|
||
elif i == 1:
|
||
p.grad = Tensor(np.random.randn(4) * 0.01)
|
||
elif i == 2:
|
||
p.grad = Tensor(np.random.randn(4, 2) * 0.01)
|
||
else:
|
||
p.grad = Tensor(np.random.randn(2) * 0.01)
|
||
|
||
# Update parameter references for AdamW
|
||
optimizers[2].params = params
|
||
optimizers[2].step()
|
||
adamw_params = [p.data.copy() for p in params]
|
||
|
||
# Verify parameters changed differently for each optimizer
|
||
for i in range(len(params)):
|
||
# Parameters should be different from original
|
||
assert not np.array_equal(sgd_params[i], original_params[i])
|
||
assert not np.array_equal(adam_params[i], original_params[i])
|
||
assert not np.array_equal(adamw_params[i], original_params[i])
|
||
|
||
# Different optimizers should produce different results
|
||
assert not np.allclose(sgd_params[i], adam_params[i], rtol=1e-6)
|
||
|
||
print("✅ Multi-layer network optimization works!")
|
||
|
||
# Test optimizer state management
|
||
print("🔬 Integration Test: Optimizer State Management...")
|
||
|
||
param = Tensor([1.0, 2.0], requires_grad=True)
|
||
param.grad = Tensor([0.1, 0.2])
|
||
|
||
optimizer = Adam([param], lr=0.001)
|
||
|
||
# First step should initialize buffers
|
||
optimizer.step()
|
||
assert optimizer.m_buffers[0] is not None
|
||
assert optimizer.v_buffers[0] is not None
|
||
assert optimizer.step_count == 1
|
||
|
||
# Zero grad should clear gradients but preserve optimizer state
|
||
optimizer.zero_grad()
|
||
assert param.grad is None
|
||
assert optimizer.m_buffers[0] is not None # State preserved
|
||
assert optimizer.step_count == 1 # Step count preserved
|
||
|
||
print("✅ Optimizer state management works!")
|
||
|
||
print("\n" + "=" * 50)
|
||
print("🎉 ALL TESTS PASSED! Module ready for export.")
|
||
print("Run: tito module complete 06_optimizers")
|
||
|
||
# %%
|
||
# Run comprehensive module test
|
||
if __name__ == "__main__":
|
||
test_module()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Optimizers
|
||
|
||
Congratulations! You've built sophisticated optimization algorithms that power modern neural network training!
|
||
|
||
### Key Accomplishments
|
||
- Built SGD optimizer with momentum for stable gradient descent and oscillation reduction
|
||
- Implemented Adam optimizer with adaptive learning rates and bias correction for different parameter scales
|
||
- Created AdamW optimizer with decoupled weight decay for proper regularization
|
||
- Analyzed memory trade-offs: SGD (2×), Adam/AdamW (3× parameter memory)
|
||
- All tests pass ✅ (validated by `test_module()`)
|
||
|
||
### Ready for Next Steps
|
||
Your optimizer implementations enable sophisticated neural network training! With gradients from Module 05 and optimizers from Module 06, you're ready to build complete training loops.
|
||
|
||
Export with: `tito module complete 06_optimizers`
|
||
|
||
**Next**: Module 07 will add training loops, learning rate scheduling, and checkpointing for complete end-to-end neural network training!
|
||
"""
|