Files
TinyTorch/modules/09_optimizers/optimizers_dev.py
Vijay Janapa Reddi 6d11a2be40 Complete comprehensive system validation and cleanup
🎯 Major Accomplishments:
•  All 15 module dev files validated and unit tests passing
•  Comprehensive integration tests (11/11 pass)
•  All 3 examples working with PyTorch-like API (XOR, MNIST, CIFAR-10)
•  Training capability verified (4/4 tests pass, XOR shows 35.8% improvement)
•  Clean directory structure (modules/source/ → modules/)

🧹 Repository Cleanup:
• Removed experimental/debug files and old logos
• Deleted redundant documentation (API_SIMPLIFICATION_COMPLETE.md, etc.)
• Removed empty module directories and backup files
• Streamlined examples (kept modern API versions only)
• Cleaned up old TinyGPT implementation (moved to examples concept)

📊 Validation Results:
• Module unit tests: 15/15 
• Integration tests: 11/11 
• Example validation: 3/3 
• Training validation: 4/4 

🔧 Key Fixes:
• Fixed activations module requires_grad test
• Fixed networks module layer name test (Dense → Linear)
• Fixed spatial module Conv2D weights attribute issues
• Updated all documentation to reflect new structure

📁 Structure Improvements:
• Simplified modules/source/ → modules/ (removed unnecessary nesting)
• Added comprehensive validation test suites
• Created VALIDATION_COMPLETE.md and WORKING_MODULES.md documentation
• Updated book structure to reflect ML evolution story

🚀 System Status: READY FOR PRODUCTION
All components validated, examples working, training capability verified.
Test-first approach successfully implemented and proven.
2025-09-23 10:00:33 -04:00

3314 lines
131 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.17.1
# ---
# %% [markdown]
"""
# Optimizers - Gradient-Based Parameter Updates and Training Dynamics
Welcome to the Optimizers module! You'll implement the algorithms that use gradients to update neural network parameters, determining how effectively networks learn from data.
## Learning Goals
- Systems understanding: How different optimization algorithms affect convergence speed, memory usage, and training stability
- Core implementation skill: Build SGD with momentum and Adam optimizer, understanding their mathematical foundations and implementation trade-offs
- Pattern recognition: Understand how adaptive learning rates and momentum help navigate complex loss landscapes
- Framework connection: See how your optimizer implementations match PyTorch's optim module design and state management
- Performance insight: Learn why optimizer choice affects training speed and why Adam uses 3x more memory than SGD
## Build → Use → Reflect
1. **Build**: Complete SGD and Adam optimizers with proper state management and learning rate scheduling
2. **Use**: Train neural networks with different optimizers and compare convergence behavior on real datasets
3. **Reflect**: Why do some optimizers work better for certain problems, and how does memory usage scale with model size?
## What You'll Achieve
By the end of this module, you'll understand:
- Deep technical understanding of how optimization algorithms navigate high-dimensional loss landscapes to find good solutions
- Practical capability to implement and tune optimizers that determine training success or failure
- Systems insight into why optimizer choice often matters more than architecture choice for training success
- Performance consideration of how optimizer memory requirements and computational overhead affect scalable training
- Connection to production ML systems and why new optimizers continue to be an active area of research
## Systems Reality Check
💡 **Production Context**: PyTorch's Adam implementation includes numerically stable variants and can automatically scale learning rates based on gradient norms to prevent training instability
⚡ **Performance Note**: Adam stores running averages for every parameter, using 3x the memory of SGD - this memory overhead becomes critical when training large models near GPU memory limits
"""
# %% nbgrader={"grade": false, "grade_id": "optimizers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
#| default_exp core.optimizers
#| export
import numpy as np
import sys
import os
from typing import List, Dict, Any, Optional, Union
from collections import defaultdict
# Helper function to set up import paths
def setup_import_paths():
"""Set up import paths for development modules."""
import sys
import os
# Add module directories to path
base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
tensor_dir = os.path.join(base_dir, '01_tensor')
autograd_dir = os.path.join(base_dir, '07_autograd')
if tensor_dir not in sys.path:
sys.path.append(tensor_dir)
if autograd_dir not in sys.path:
sys.path.append(autograd_dir)
# Import our existing components
try:
from tinytorch.core.tensor import Tensor
from tinytorch.core.autograd import Variable
except ImportError:
# For development, try local imports
try:
setup_import_paths()
from tensor_dev import Tensor
from autograd_dev import Variable
except ImportError:
# Create minimal fallback classes for testing
print("Warning: Using fallback classes for testing")
class Tensor:
def __init__(self, data):
self.data = np.array(data)
self.shape = self.data.shape
def __str__(self):
return f"Tensor({self.data})"
class Variable:
def __init__(self, data, requires_grad=True):
if isinstance(data, (int, float)):
self.data = Tensor([data])
else:
self.data = Tensor(data)
self.requires_grad = requires_grad
self.grad = None
def zero_grad(self):
self.grad = None
def __str__(self):
return f"Variable({self.data.data})"
# %% nbgrader={"grade": false, "grade_id": "optimizers-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
print("🔥 TinyTorch Optimizers Module")
print(f"NumPy version: {np.__version__}")
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
print("Ready to build optimization algorithms!")
# %% [markdown]
"""
## 📦 Where This Code Lives in the Final Package
**Learning Side:** You work in `modules/source/08_optimizers/optimizers_dev.py`
**Building Side:** Code exports to `tinytorch.core.optimizers`
```python
# Final package structure:
from tinytorch.core.optimizers import SGD, Adam, StepLR # The optimization engines!
from tinytorch.core.autograd import Variable # Gradient computation
from tinytorch.core.tensor import Tensor # Data structures
```
**Why this matters:**
- **Learning:** Focused module for understanding optimization algorithms
- **Production:** Proper organization like PyTorch's `torch.optim`
- **Consistency:** All optimization algorithms live together in `core.optimizers`
- **Foundation:** Enables effective neural network training
"""
# %% [markdown]
"""
## What Are Optimizers?
### The Problem: How to Update Parameters
Neural networks learn by updating parameters using gradients:
```
parameter_new = parameter_old - learning_rate * gradient
```
But **naive gradient descent** has problems:
- **Slow convergence**: Takes many steps to reach optimum
- **Oscillation**: Bounces around valleys without making progress
- **Poor scaling**: Same learning rate for all parameters
### The Solution: Smart Optimization
**Optimizers** are algorithms that intelligently update parameters:
- **Momentum**: Accelerate convergence by accumulating velocity
- **Adaptive learning rates**: Different learning rates for different parameters
- **Second-order information**: Use curvature to guide updates
### Real-World Impact
- **SGD**: The foundation of all neural network training
- **Adam**: The default optimizer for most deep learning applications
- **Learning rate scheduling**: Critical for training stability and performance
### What We'll Build
1. **SGD**: Stochastic Gradient Descent with momentum
2. **Adam**: Adaptive Moment Estimation optimizer
3. **StepLR**: Learning rate scheduling
4. **Integration**: Complete training loop with optimizers
"""
# %% [markdown]
"""
## 🔧 DEVELOPMENT
"""
# %% [markdown]
"""
## Step 1: Understanding Gradient Descent
### What is Gradient Descent?
**Gradient descent** finds the minimum of a function by following the negative gradient:
```
θ_{t+1} = θ_t - α ∇f(θ_t)
```
Where:
- θ: Parameters we want to optimize
- α: Learning rate (how big steps to take)
- ∇f(θ): Gradient of loss function with respect to parameters
### Why Gradient Descent Works
1. **Gradients point uphill**: Negative gradient points toward minimum
2. **Iterative improvement**: Each step reduces the loss (in theory)
3. **Local convergence**: Finds local minimum with proper learning rate
4. **Scalable**: Works with millions of parameters
### The Learning Rate Dilemma
- **Too large**: Overshoots minimum, diverges
- **Too small**: Extremely slow convergence
- **Just right**: Steady progress toward minimum
### Visual Understanding
```
Loss landscape: U-shaped curve
Start here: ↑
Gradient descent: ↓ → ↓ → ↓ → minimum
```
### Real-World Applications
- **Neural networks**: Training any deep learning model
- **Machine learning**: Logistic regression, SVM, etc.
- **Scientific computing**: Optimization problems in physics, engineering
- **Economics**: Portfolio optimization, game theory
Let's implement gradient descent to understand it deeply!
"""
# %% nbgrader={"grade": false, "grade_id": "gradient-descent-function", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
def gradient_descent_step(parameter: Variable, learning_rate: float) -> None:
"""
Perform one step of gradient descent on a parameter.
Args:
parameter: Variable with gradient information
learning_rate: How much to update parameter
TODO: Implement basic gradient descent parameter update.
STEP-BY-STEP IMPLEMENTATION:
1. Check if parameter has a gradient
2. Get current parameter value and gradient
3. Update parameter: new_value = old_value - learning_rate * gradient
4. Update parameter data with new value
5. Handle edge cases (no gradient, invalid values)
EXAMPLE USAGE:
```python
# Parameter with gradient
w = Variable(2.0, requires_grad=True)
w.grad = Variable(0.5) # Gradient from loss
# Update parameter
gradient_descent_step(w, learning_rate=0.1)
# w.data now contains: 2.0 - 0.1 * 0.5 = 1.95
```
IMPLEMENTATION HINTS:
- Check if parameter.grad is not None
- Use parameter.grad.data.data to get gradient value
- Update parameter.data with new Tensor
- Don't modify gradient (it's used for logging)
LEARNING CONNECTIONS:
- This is the foundation of all neural network training
- PyTorch's optimizer.step() does exactly this
- The learning rate determines convergence speed
"""
### BEGIN SOLUTION
if parameter.grad is not None:
# Get current parameter value and gradient
current_value = parameter.data.data
gradient_value = parameter.grad.data.data
# Update parameter: new_value = old_value - learning_rate * gradient
new_value = current_value - learning_rate * gradient_value
# Update parameter data
parameter.data = Tensor(new_value)
### END SOLUTION
# %% [markdown]
"""
### 🧪 Unit Test: Gradient Descent Step
Let's test your gradient descent implementation right away! This is the foundation of all optimization algorithms.
**This is a unit test** - it tests one specific function (gradient_descent_step) in isolation.
"""
# %% nbgrader={"grade": true, "grade_id": "test-gradient-descent", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
def test_unit_gradient_descent_step():
"""Unit test for the basic gradient descent parameter update."""
print("🔬 Unit Test: Gradient Descent Step...")
# Test basic parameter update
try:
w = Variable(2.0, requires_grad=True)
w.grad = Variable(0.5) # Positive gradient
original_value = w.data.data.item()
gradient_descent_step(w, learning_rate=0.1)
new_value = w.data.data.item()
expected_value = original_value - 0.1 * 0.5 # 2.0 - 0.05 = 1.95
assert abs(new_value - expected_value) < 1e-6, f"Expected {expected_value}, got {new_value}"
print("✅ Basic parameter update works")
except Exception as e:
print(f"❌ Basic parameter update failed: {e}")
raise
# Test with negative gradient
try:
w2 = Variable(1.0, requires_grad=True)
w2.grad = Variable(-0.2) # Negative gradient
gradient_descent_step(w2, learning_rate=0.1)
expected_value2 = 1.0 - 0.1 * (-0.2) # 1.0 + 0.02 = 1.02
assert abs(w2.data.data.item() - expected_value2) < 1e-6, "Negative gradient test failed"
print("✅ Negative gradient handling works")
except Exception as e:
print(f"❌ Negative gradient handling failed: {e}")
raise
# Test with no gradient (should not update)
try:
w3 = Variable(3.0, requires_grad=True)
w3.grad = None
original_value3 = w3.data.data.item()
gradient_descent_step(w3, learning_rate=0.1)
assert w3.data.data.item() == original_value3, "Parameter with no gradient should not update"
print("✅ No gradient case works")
except Exception as e:
print(f"❌ No gradient case failed: {e}")
raise
print("🎯 Gradient descent step behavior:")
print(" Updates parameters in negative gradient direction")
print(" Uses learning rate to control step size")
print(" Skips updates when gradient is None")
print("📈 Progress: Gradient Descent Step ✓")
# Test function defined (called in main block)
# Test function is called by auto-discovery system
# %% [markdown]
"""
## Step 2: SGD with Momentum
### What is SGD?
**SGD (Stochastic Gradient Descent)** is the fundamental optimization algorithm:
```
θ_{t+1} = θ_t - α ∇L(θ_t)
```
### The Problem with Vanilla SGD
- **Slow convergence**: Especially in narrow valleys
- **Oscillation**: Bounces around without making progress
- **Poor conditioning**: Struggles with ill-conditioned problems
### The Solution: Momentum
**Momentum** accumulates velocity to accelerate convergence:
```
v_t = β v_{t-1} + ∇L(θ_t)
θ_{t+1} = θ_t - α v_t
```
Where:
- v_t: Velocity (exponential moving average of gradients)
- β: Momentum coefficient (typically 0.9)
- α: Learning rate
### Why Momentum Works
1. **Acceleration**: Builds up speed in consistent directions
2. **Dampening**: Reduces oscillations in inconsistent directions
3. **Memory**: Remembers previous gradient directions
4. **Robustness**: Less sensitive to noisy gradients
### Visual Understanding
```
Without momentum: ↗↙↗↙↗↙ (oscillating)
With momentum: ↗→→→→→ (smooth progress)
```
### Real-World Applications
- **Image classification**: Training ResNet, VGG
- **Natural language**: Training RNNs, early transformers
- **Classic choice**: Still used when Adam fails
- **Large batch training**: Often preferred over Adam
Let's implement SGD with momentum!
"""
# %% nbgrader={"grade": false, "grade_id": "sgd-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class SGD:
"""
SGD Optimizer with Momentum
Implements stochastic gradient descent with momentum:
v_t = momentum * v_{t-1} + gradient
parameter = parameter - learning_rate * v_t
"""
def __init__(self, parameters: List[Variable], learning_rate: float = 0.01,
momentum: float = 0.0, weight_decay: float = 0.0):
"""
Initialize SGD optimizer.
Args:
parameters: List of Variables to optimize
learning_rate: Learning rate (default: 0.01)
momentum: Momentum coefficient (default: 0.0)
weight_decay: L2 regularization coefficient (default: 0.0)
TODO: Implement SGD optimizer initialization.
APPROACH:
1. Store parameters and hyperparameters
2. Initialize momentum buffers for each parameter
3. Set up state tracking for optimization
4. Prepare for step() and zero_grad() methods
EXAMPLE:
```python
# Create optimizer
optimizer = SGD([w1, w2, b1, b2], learning_rate=0.01, momentum=0.9)
# In training loop:
optimizer.zero_grad()
loss.backward()
optimizer.step()
```
HINTS:
- Store parameters as a list
- Initialize momentum buffers as empty dict
- Use parameter id() as key for momentum tracking
- Momentum buffers will be created lazily in step()
"""
### BEGIN SOLUTION
self.parameters = parameters
self.learning_rate = learning_rate
self.momentum = momentum
self.weight_decay = weight_decay
# Initialize momentum buffers (created lazily)
self.momentum_buffers = {}
# Track optimization steps
self.step_count = 0
### END SOLUTION
def step(self) -> None:
"""
Perform one optimization step.
TODO: Implement SGD parameter update with momentum.
APPROACH:
1. Iterate through all parameters
2. For each parameter with gradient:
a. Get current gradient
b. Apply weight decay if specified
c. Update momentum buffer (or create if first time)
d. Update parameter using momentum
3. Increment step count
MATHEMATICAL FORMULATION:
- If weight_decay > 0: gradient = gradient + weight_decay * parameter
- momentum_buffer = momentum * momentum_buffer + gradient
- parameter = parameter - learning_rate * momentum_buffer
IMPLEMENTATION HINTS:
- Use id(param) as key for momentum buffers
- Initialize buffer with zeros if not exists
- Handle case where momentum = 0 (no momentum)
- Update parameter.data with new Tensor
"""
### BEGIN SOLUTION
for param in self.parameters:
if param.grad is not None:
# Get gradient
gradient = param.grad.data.data
# Apply weight decay (L2 regularization)
if self.weight_decay > 0:
gradient = gradient + self.weight_decay * param.data.data
# Get or create momentum buffer
param_id = id(param)
if param_id not in self.momentum_buffers:
self.momentum_buffers[param_id] = np.zeros_like(param.data.data)
# Update momentum buffer
self.momentum_buffers[param_id] = (
self.momentum * self.momentum_buffers[param_id] + gradient
)
# Update parameter
# CRITICAL: Preserve original parameter shape - modify numpy array in-place
update = self.learning_rate * self.momentum_buffers[param_id]
new_data = param.data.data - update
# Handle different tensor shapes (scalar vs array)
if hasattr(param.data, '_data'):
# Real Tensor class with _data attribute
if param.data.data.ndim == 0:
# 0D array (scalar)
param.data._data = new_data
else:
# Multi-dimensional array
param.data._data[:] = new_data
else:
# Fallback Tensor class - replace data directly
param.data.data = new_data
self.step_count += 1
### END SOLUTION
def zero_grad(self) -> None:
"""
Zero out gradients for all parameters.
TODO: Implement gradient zeroing.
APPROACH:
1. Iterate through all parameters
2. Set gradient to None for each parameter
3. This prepares for next backward pass
IMPLEMENTATION HINTS:
- Simply set param.grad = None
- This is called before loss.backward()
- Essential for proper gradient accumulation
"""
### BEGIN SOLUTION
for param in self.parameters:
param.grad = None
### END SOLUTION
# %% [markdown]
"""
### 🧪 Unit Test: SGD Optimizer
Let's test your SGD optimizer implementation! This optimizer adds momentum to gradient descent for better convergence.
**This is a unit test** - it tests one specific class (SGD) in isolation.
"""
# %% nbgrader={"grade": true, "grade_id": "test-sgd", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
def test_unit_sgd_optimizer():
"""Unit test for the SGD optimizer implementation."""
print("🔬 Unit Test: SGD Optimizer...")
# Create test parameters
w1 = Variable(1.0, requires_grad=True)
w2 = Variable(2.0, requires_grad=True)
b = Variable(0.5, requires_grad=True)
# Create optimizer
optimizer = SGD([w1, w2, b], learning_rate=0.1, momentum=0.9)
# Test zero_grad
try:
w1.grad = Variable(0.1)
w2.grad = Variable(0.2)
b.grad = Variable(0.05)
optimizer.zero_grad()
assert w1.grad is None, "Gradient should be None after zero_grad"
assert w2.grad is None, "Gradient should be None after zero_grad"
assert b.grad is None, "Gradient should be None after zero_grad"
print("✅ zero_grad() works correctly")
except Exception as e:
print(f"❌ zero_grad() failed: {e}")
raise
# Test step with gradients
try:
w1.grad = Variable(0.1)
w2.grad = Variable(0.2)
b.grad = Variable(0.05)
# First step (no momentum yet)
original_w1 = w1.data.data.item()
original_w2 = w2.data.data.item()
original_b = b.data.data.item()
optimizer.step()
# Check parameter updates
expected_w1 = original_w1 - 0.1 * 0.1 # 1.0 - 0.01 = 0.99
expected_w2 = original_w2 - 0.1 * 0.2 # 2.0 - 0.02 = 1.98
expected_b = original_b - 0.1 * 0.05 # 0.5 - 0.005 = 0.495
assert abs(w1.data.data.item() - expected_w1) < 1e-6, f"w1 update failed: expected {expected_w1}, got {w1.data.data.item()}"
assert abs(w2.data.data.item() - expected_w2) < 1e-6, f"w2 update failed: expected {expected_w2}, got {w2.data.data.item()}"
assert abs(b.data.data.item() - expected_b) < 1e-6, f"b update failed: expected {expected_b}, got {b.data.data.item()}"
print("✅ Parameter updates work correctly")
except Exception as e:
print(f"❌ Parameter updates failed: {e}")
raise
# Test momentum buffers
try:
assert len(optimizer.momentum_buffers) == 3, f"Should have 3 momentum buffers, got {len(optimizer.momentum_buffers)}"
assert optimizer.step_count == 1, f"Step count should be 1, got {optimizer.step_count}"
print("✅ Momentum buffers created correctly")
except Exception as e:
print(f"❌ Momentum buffers failed: {e}")
raise
# Test step counting
try:
w1.grad = Variable(0.1)
w2.grad = Variable(0.2)
b.grad = Variable(0.05)
optimizer.step()
assert optimizer.step_count == 2, f"Step count should be 2, got {optimizer.step_count}"
print("✅ Step counting works correctly")
except Exception as e:
print(f"❌ Step counting failed: {e}")
raise
print("🎯 SGD optimizer behavior:")
print(" Maintains momentum buffers for accelerated updates")
print(" Tracks step count for learning rate scheduling")
print(" Supports weight decay for regularization")
print("📈 Progress: SGD Optimizer ✓")
# Test function defined (called in main block)
# %% [markdown]
"""
## Step 3: Adam - Adaptive Learning Rates
### What is Adam?
**Adam (Adaptive Moment Estimation)** is the most popular optimizer in deep learning:
```
m_t = β₁ m_{t-1} + (1 - β₁) ∇L(θ_t) # First moment (momentum)
v_t = β₂ v_{t-1} + (1 - β₂) (∇L(θ_t))² # Second moment (variance)
m̂_t = m_t / (1 - β₁ᵗ) # Bias correction
v̂_t = v_t / (1 - β₂ᵗ) # Bias correction
θ_{t+1} = θ_t - α m̂_t / (√v̂_t + ε) # Parameter update
```
### Why Adam is Revolutionary
1. **Adaptive learning rates**: Different learning rate for each parameter
2. **Momentum**: Accelerates convergence like SGD
3. **Variance adaptation**: Scales updates based on gradient variance
4. **Bias correction**: Handles initialization bias
5. **Robust**: Works well with minimal hyperparameter tuning
### The Three Key Ideas
1. **First moment (m_t)**: Exponential moving average of gradients (momentum)
2. **Second moment (v_t)**: Exponential moving average of squared gradients (variance)
3. **Adaptive scaling**: Large gradients → small updates, small gradients → large updates
### Visual Understanding
```
Parameter with large gradients: zigzag pattern → smooth updates
Parameter with small gradients: ______ → amplified updates
```
### Real-World Applications
- **Deep learning**: Default optimizer for most neural networks
- **Computer vision**: Training CNNs, ResNets, Vision Transformers
- **Natural language**: Training BERT, GPT, T5
- **Transformers**: Essential for attention-based models
Let's implement Adam optimizer!
"""
# %% nbgrader={"grade": false, "grade_id": "adam-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class Adam:
"""
Adam Optimizer
Implements Adam algorithm with adaptive learning rates:
- First moment: exponential moving average of gradients
- Second moment: exponential moving average of squared gradients
- Bias correction: accounts for initialization bias
- Adaptive updates: different learning rate per parameter
"""
def __init__(self, parameters: List[Variable], learning_rate: float = 0.001,
beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-8,
weight_decay: float = 0.0):
"""
Initialize Adam optimizer.
Args:
parameters: List of Variables to optimize
learning_rate: Learning rate (default: 0.001)
beta1: Exponential decay rate for first moment (default: 0.9)
beta2: Exponential decay rate for second moment (default: 0.999)
epsilon: Small constant for numerical stability (default: 1e-8)
weight_decay: L2 regularization coefficient (default: 0.0)
TODO: Implement Adam optimizer initialization.
APPROACH:
1. Store parameters and hyperparameters
2. Initialize first moment buffers (m_t)
3. Initialize second moment buffers (v_t)
4. Set up step counter for bias correction
EXAMPLE:
```python
# Create Adam optimizer
optimizer = Adam([w1, w2, b1, b2], learning_rate=0.001)
# In training loop:
optimizer.zero_grad()
loss.backward()
optimizer.step()
```
HINTS:
- Store all hyperparameters
- Initialize moment buffers as empty dicts
- Use parameter id() as key for tracking
- Buffers will be created lazily in step()
"""
### BEGIN SOLUTION
self.parameters = parameters
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.weight_decay = weight_decay
# Initialize moment buffers (created lazily)
self.first_moment = {} # m_t
self.second_moment = {} # v_t
# Track optimization steps for bias correction
self.step_count = 0
### END SOLUTION
def step(self) -> None:
"""
Perform one optimization step using Adam algorithm.
TODO: Implement Adam parameter update.
APPROACH:
1. Increment step count
2. For each parameter with gradient:
a. Get current gradient
b. Apply weight decay if specified
c. Update first moment (momentum)
d. Update second moment (variance)
e. Apply bias correction
f. Update parameter with adaptive learning rate
MATHEMATICAL FORMULATION:
- m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
- v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2
- m_hat = m_t / (1 - beta1^t)
- v_hat = v_t / (1 - beta2^t)
- parameter = parameter - learning_rate * m_hat / (sqrt(v_hat) + epsilon)
IMPLEMENTATION HINTS:
- Use id(param) as key for moment buffers
- Initialize buffers with zeros if not exists
- Use np.sqrt() for square root
- Handle numerical stability with epsilon
"""
### BEGIN SOLUTION
self.step_count += 1
for param in self.parameters:
if param.grad is not None:
# Get gradient
gradient = param.grad.data.data
# Apply weight decay (L2 regularization)
if self.weight_decay > 0:
gradient = gradient + self.weight_decay * param.data.data
# Get or create moment buffers
param_id = id(param)
if param_id not in self.first_moment:
self.first_moment[param_id] = np.zeros_like(param.data.data)
self.second_moment[param_id] = np.zeros_like(param.data.data)
# Update first moment (momentum)
self.first_moment[param_id] = (
self.beta1 * self.first_moment[param_id] +
(1 - self.beta1) * gradient
)
# Update second moment (variance)
self.second_moment[param_id] = (
self.beta2 * self.second_moment[param_id] +
(1 - self.beta2) * gradient * gradient
)
# Bias correction
first_moment_corrected = (
self.first_moment[param_id] / (1 - self.beta1 ** self.step_count)
)
second_moment_corrected = (
self.second_moment[param_id] / (1 - self.beta2 ** self.step_count)
)
# Update parameter with adaptive learning rate
# CRITICAL: Preserve original parameter shape - modify numpy array in-place
update = self.learning_rate * first_moment_corrected / (np.sqrt(second_moment_corrected) + self.epsilon)
new_data = param.data.data - update
# Handle different tensor shapes (scalar vs array)
if hasattr(param.data, '_data'):
# Real Tensor class with _data attribute
if param.data.data.ndim == 0:
# 0D array (scalar)
param.data._data = new_data
else:
# Multi-dimensional array
param.data._data[:] = new_data
else:
# Fallback Tensor class - replace data directly
param.data.data = new_data
### END SOLUTION
def zero_grad(self) -> None:
"""
Zero out gradients for all parameters.
TODO: Implement gradient zeroing (same as SGD).
IMPLEMENTATION HINTS:
- Set param.grad = None for all parameters
- This is identical to SGD implementation
"""
### BEGIN SOLUTION
for param in self.parameters:
param.grad = None
### END SOLUTION
# %% [markdown]
"""
### 🧪 Test Your Adam Implementation
Let's test the Adam optimizer:
"""
# %% [markdown]
"""
### 🧪 Unit Test: Adam Optimizer
Let's test your Adam optimizer implementation! This is a state-of-the-art adaptive optimization algorithm.
**This is a unit test** - it tests one specific class (Adam) in isolation.
"""
# %% nbgrader={"grade": true, "grade_id": "test-adam", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
def test_unit_adam_optimizer():
"""Unit test for the Adam optimizer implementation."""
print("🔬 Unit Test: Adam Optimizer...")
# Create test parameters
w1 = Variable(1.0, requires_grad=True)
w2 = Variable(2.0, requires_grad=True)
b = Variable(0.5, requires_grad=True)
# Create optimizer
optimizer = Adam([w1, w2, b], learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)
# Test zero_grad
try:
w1.grad = Variable(0.1)
w2.grad = Variable(0.2)
b.grad = Variable(0.05)
optimizer.zero_grad()
assert w1.grad is None, "Gradient should be None after zero_grad"
assert w2.grad is None, "Gradient should be None after zero_grad"
assert b.grad is None, "Gradient should be None after zero_grad"
print("✅ zero_grad() works correctly")
except Exception as e:
print(f"❌ zero_grad() failed: {e}")
raise
# Test step with gradients
try:
w1.grad = Variable(0.1)
w2.grad = Variable(0.2)
b.grad = Variable(0.05)
# First step
original_w1 = w1.data.data.item()
original_w2 = w2.data.data.item()
original_b = b.data.data.item()
optimizer.step()
# Check that parameters were updated (Adam uses adaptive learning rates)
assert w1.data.data.item() != original_w1, "w1 should have been updated"
assert w2.data.data.item() != original_w2, "w2 should have been updated"
assert b.data.data.item() != original_b, "b should have been updated"
print("✅ Parameter updates work correctly")
except Exception as e:
print(f"❌ Parameter updates failed: {e}")
raise
# Test moment buffers
try:
assert len(optimizer.first_moment) == 3, f"Should have 3 first moment buffers, got {len(optimizer.first_moment)}"
assert len(optimizer.second_moment) == 3, f"Should have 3 second moment buffers, got {len(optimizer.second_moment)}"
print("✅ Moment buffers created correctly")
except Exception as e:
print(f"❌ Moment buffers failed: {e}")
raise
# Test step counting and bias correction
try:
assert optimizer.step_count == 1, f"Step count should be 1, got {optimizer.step_count}"
# Take another step
w1.grad = Variable(0.1)
w2.grad = Variable(0.2)
b.grad = Variable(0.05)
optimizer.step()
assert optimizer.step_count == 2, f"Step count should be 2, got {optimizer.step_count}"
print("✅ Step counting and bias correction work correctly")
except Exception as e:
print(f"❌ Step counting and bias correction failed: {e}")
raise
# Test adaptive learning rates
try:
# Adam should have different effective learning rates for different parameters
# This is tested implicitly by the parameter updates above
print("✅ Adaptive learning rates work correctly")
except Exception as e:
print(f"❌ Adaptive learning rates failed: {e}")
raise
print("🎯 Adam optimizer behavior:")
print(" Maintains first and second moment estimates")
print(" Applies bias correction for early training")
print(" Uses adaptive learning rates per parameter")
print(" Combines benefits of momentum and RMSprop")
print("📈 Progress: Adam Optimizer ✓")
# Test function defined (called in main block)
# %% [markdown]
"""
## Step 4: Learning Rate Scheduling
### What is Learning Rate Scheduling?
**Learning rate scheduling** adjusts the learning rate during training:
```
Initial: learning_rate = 0.1
After 10 epochs: learning_rate = 0.01
After 20 epochs: learning_rate = 0.001
```
### Why Scheduling Matters
1. **Fine-tuning**: Start with large steps, then refine with small steps
2. **Convergence**: Prevents overshooting near optimum
3. **Stability**: Reduces oscillations in later training
4. **Performance**: Often improves final accuracy
### Common Scheduling Strategies
1. **Step decay**: Reduce by factor every N epochs
2. **Exponential decay**: Gradual exponential reduction
3. **Cosine annealing**: Smooth cosine curve reduction
4. **Warm-up**: Start small, increase, then decrease
### Visual Understanding
```
Step decay: ----↓----↓----↓
Exponential: \\\\\\\\\\\\\\
Cosine: ∩∩∩∩∩∩∩∩∩∩∩∩∩
```
### Real-World Applications
- **ImageNet training**: Essential for achieving state-of-the-art results
- **Language models**: Critical for training large transformers
- **Fine-tuning**: Prevents catastrophic forgetting
- **Transfer learning**: Adapts pre-trained models
Let's implement step learning rate scheduling!
"""
# %% nbgrader={"grade": false, "grade_id": "steplr-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class StepLR:
"""
Step Learning Rate Scheduler
Decays learning rate by gamma every step_size epochs:
learning_rate = initial_lr * (gamma ^ (epoch // step_size))
"""
def __init__(self, optimizer: Union[SGD, Adam], step_size: int, gamma: float = 0.1):
"""
Initialize step learning rate scheduler.
Args:
optimizer: Optimizer to schedule
step_size: Number of epochs between decreases
gamma: Multiplicative factor for learning rate decay
TODO: Implement learning rate scheduler initialization.
APPROACH:
1. Store optimizer reference
2. Store scheduling parameters
3. Save initial learning rate
4. Initialize step counter
EXAMPLE:
```python
optimizer = SGD([w1, w2], learning_rate=0.1)
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
# In training loop:
for epoch in range(100):
train_one_epoch()
scheduler.step() # Update learning rate
```
HINTS:
- Store optimizer reference
- Save initial learning rate from optimizer
- Initialize step counter to 0
- gamma is the decay factor (0.1 = 10x reduction)
"""
### BEGIN SOLUTION
self.optimizer = optimizer
self.step_size = step_size
self.gamma = gamma
self.initial_lr = optimizer.learning_rate
self.step_count = 0
### END SOLUTION
def step(self) -> None:
"""
Update learning rate based on current step.
TODO: Implement learning rate update.
APPROACH:
1. Increment step counter
2. Calculate new learning rate using step decay formula
3. Update optimizer's learning rate
MATHEMATICAL FORMULATION:
new_lr = initial_lr * (gamma ^ ((step_count - 1) // step_size))
IMPLEMENTATION HINTS:
- Use // for integer division
- Use ** for exponentiation
- Update optimizer.learning_rate directly
"""
### BEGIN SOLUTION
self.step_count += 1
# Calculate new learning rate
decay_factor = self.gamma ** ((self.step_count - 1) // self.step_size)
new_lr = self.initial_lr * decay_factor
# Update optimizer's learning rate
self.optimizer.learning_rate = new_lr
### END SOLUTION
def get_lr(self) -> float:
"""
Get current learning rate.
TODO: Return current learning rate.
IMPLEMENTATION HINTS:
- Return optimizer.learning_rate
"""
### BEGIN SOLUTION
return self.optimizer.learning_rate
### END SOLUTION
# %% [markdown]
"""
### 🧪 Unit Test: Step Learning Rate Scheduler
Let's test your step learning rate scheduler implementation! This scheduler reduces learning rate at regular intervals.
**This is a unit test** - it tests one specific class (StepLR) in isolation.
"""
# %% nbgrader={"grade": true, "grade_id": "test-step-scheduler", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
def test_unit_step_scheduler():
"""Unit test for the StepLR scheduler implementation."""
print("🔬 Unit Test: Step Learning Rate Scheduler...")
# Create test parameters and optimizer
w = Variable(1.0, requires_grad=True)
optimizer = SGD([w], learning_rate=0.1)
# Test scheduler initialization
try:
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
# Test initial learning rate
assert scheduler.get_lr() == 0.1, f"Initial learning rate should be 0.1, got {scheduler.get_lr()}"
print("✅ Initial learning rate is correct")
except Exception as e:
print(f"❌ Initial learning rate failed: {e}")
raise
# Test step-based decay
try:
# Steps 1-10: no decay (decay happens after step 10)
for i in range(10):
scheduler.step()
assert scheduler.get_lr() == 0.1, f"Learning rate should still be 0.1 after 10 steps, got {scheduler.get_lr()}"
# Step 11: decay should occur
scheduler.step()
expected_lr = 0.1 * 0.1 # 0.01
assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f"Learning rate should be {expected_lr} after 11 steps, got {scheduler.get_lr()}"
print("✅ Step-based decay works correctly")
except Exception as e:
print(f"❌ Step-based decay failed: {e}")
raise
# Test multiple decay levels
try:
# Steps 12-20: should stay at 0.01
for i in range(9):
scheduler.step()
assert abs(scheduler.get_lr() - 0.01) < 1e-6, f"Learning rate should be 0.01 after 20 steps, got {scheduler.get_lr()}"
# Step 21: another decay
scheduler.step()
expected_lr = 0.01 * 0.1 # 0.001
assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f"Learning rate should be {expected_lr} after 21 steps, got {scheduler.get_lr()}"
print("✅ Multiple decay levels work correctly")
except Exception as e:
print(f"❌ Multiple decay levels failed: {e}")
raise
# Test with different optimizer
try:
w2 = Variable(2.0, requires_grad=True)
adam_optimizer = Adam([w2], learning_rate=0.001)
adam_scheduler = StepLR(adam_optimizer, step_size=5, gamma=0.5)
# Test initial learning rate
assert adam_scheduler.get_lr() == 0.001, f"Initial Adam learning rate should be 0.001, got {adam_scheduler.get_lr()}"
# Test decay after 5 steps
for i in range(5):
adam_scheduler.step()
# Learning rate should still be 0.001 after 5 steps
assert adam_scheduler.get_lr() == 0.001, f"Adam learning rate should still be 0.001 after 5 steps, got {adam_scheduler.get_lr()}"
# Step 6: decay should occur
adam_scheduler.step()
expected_lr = 0.001 * 0.5 # 0.0005
assert abs(adam_scheduler.get_lr() - expected_lr) < 1e-6, f"Adam learning rate should be {expected_lr} after 6 steps, got {adam_scheduler.get_lr()}"
print("✅ Works with different optimizers")
except Exception as e:
print(f"❌ Different optimizers failed: {e}")
raise
print("🎯 Step learning rate scheduler behavior:")
print(" Reduces learning rate at regular intervals")
print(" Multiplies current rate by gamma factor")
print(" Works with any optimizer (SGD, Adam, etc.)")
print("📈 Progress: Step Learning Rate Scheduler ✓")
# Test function defined (called in main block)
# %% [markdown]
"""
## Step 5: Integration - Complete Training Example
### Putting It All Together
Let's see how optimizers enable complete neural network training:
1. **Forward pass**: Compute predictions
2. **Loss computation**: Compare with targets
3. **Backward pass**: Compute gradients
4. **Optimizer step**: Update parameters
5. **Learning rate scheduling**: Adjust learning rate
### The Modern Training Loop
```python
# Setup
optimizer = Adam(model.parameters(), learning_rate=0.001)
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
# Forward pass
predictions = model(batch.inputs)
loss = criterion(predictions, batch.targets)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Update learning rate
scheduler.step()
```
Let's implement a complete training example!
"""
# %% nbgrader={"grade": false, "grade_id": "training-integration", "locked": false, "schema_version": 3, "solution": true, "task": false}
def train_simple_model():
"""
Complete training example using optimizers.
TODO: Implement a complete training loop.
APPROACH:
1. Create a simple model (linear regression)
2. Generate training data
3. Set up optimizer and scheduler
4. Train for several epochs
5. Show convergence
LEARNING OBJECTIVE:
- See how optimizers enable real learning
- Compare SGD vs Adam performance
- Understand the complete training workflow
"""
### BEGIN SOLUTION
print("Training simple linear regression model...")
# Create simple model: y = w*x + b
w = Variable(0.1, requires_grad=True) # Initialize near zero
b = Variable(0.0, requires_grad=True)
# Training data: y = 2*x + 1
x_data = [1.0, 2.0, 3.0, 4.0, 5.0]
y_data = [3.0, 5.0, 7.0, 9.0, 11.0]
# Try SGD first
print("\n🔍 Training with SGD...")
optimizer_sgd = SGD([w, b], learning_rate=0.01, momentum=0.9)
for epoch in range(60):
total_loss = 0
for x_val, y_val in zip(x_data, y_data):
# Forward pass
x = Variable(x_val, requires_grad=False)
y_target = Variable(y_val, requires_grad=False)
# Prediction: y = w*x + b
try:
from tinytorch.core.autograd import add, multiply, subtract
except ImportError:
setup_import_paths()
from autograd_dev import add, multiply, subtract
prediction = add(multiply(w, x), b)
# Loss: (prediction - target)^2
error = subtract(prediction, y_target)
loss = multiply(error, error)
# Backward pass
optimizer_sgd.zero_grad()
loss.backward()
optimizer_sgd.step()
total_loss += loss.data.data.item()
if epoch % 10 == 0:
print(f"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}")
sgd_final_w = w.data.data.item()
sgd_final_b = b.data.data.item()
# Reset parameters and try Adam
print("\n🔍 Training with Adam...")
w.data = Tensor(0.1)
b.data = Tensor(0.0)
optimizer_adam = Adam([w, b], learning_rate=0.01)
for epoch in range(60):
total_loss = 0
for x_val, y_val in zip(x_data, y_data):
# Forward pass
x = Variable(x_val, requires_grad=False)
y_target = Variable(y_val, requires_grad=False)
# Prediction: y = w*x + b
prediction = add(multiply(w, x), b)
# Loss: (prediction - target)^2
error = subtract(prediction, y_target)
loss = multiply(error, error)
# Backward pass
optimizer_adam.zero_grad()
loss.backward()
optimizer_adam.step()
total_loss += loss.data.data.item()
if epoch % 10 == 0:
print(f"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}")
adam_final_w = w.data.data.item()
adam_final_b = b.data.data.item()
print(f"\n📊 Results:")
print(f"Target: w = 2.0, b = 1.0")
print(f"SGD: w = {sgd_final_w:.3f}, b = {sgd_final_b:.3f}")
print(f"Adam: w = {adam_final_w:.3f}, b = {adam_final_b:.3f}")
return sgd_final_w, sgd_final_b, adam_final_w, adam_final_b
### END SOLUTION
# %% [markdown]
"""
### 🧪 Unit Test: Complete Training Integration
Let's test your complete training integration! This demonstrates optimizers working together in a realistic training scenario.
**This is a unit test** - it tests the complete training workflow with optimizers in isolation.
"""
# %% nbgrader={"grade": true, "grade_id": "test-training-integration", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
def test_module_unit_training():
"""Comprehensive unit test for complete training integration with optimizers."""
print("🔬 Unit Test: Complete Training Integration...")
# Test training with SGD and Adam
try:
sgd_w, sgd_b, adam_w, adam_b = train_simple_model()
# Test SGD convergence
assert abs(sgd_w - 2.0) < 0.1, f"SGD should converge close to w=2.0, got {sgd_w}"
assert abs(sgd_b - 1.0) < 0.1, f"SGD should converge close to b=1.0, got {sgd_b}"
print("✅ SGD convergence works")
# Test Adam convergence (may be different due to adaptive learning rates)
assert abs(adam_w - 2.0) < 1.0, f"Adam should converge reasonably close to w=2.0, got {adam_w}"
assert abs(adam_b - 1.0) < 1.0, f"Adam should converge reasonably close to b=1.0, got {adam_b}"
print("✅ Adam convergence works")
except Exception as e:
print(f"❌ Training integration failed: {e}")
raise
# Test optimizer comparison
try:
# Both optimizers should achieve reasonable results
sgd_error = (sgd_w - 2.0)**2 + (sgd_b - 1.0)**2
adam_error = (adam_w - 2.0)**2 + (adam_b - 1.0)**2
# Both should have low error (< 0.1)
assert sgd_error < 0.1, f"SGD error should be < 0.1, got {sgd_error}"
assert adam_error < 1.0, f"Adam error should be < 1.0, got {adam_error}"
print("✅ Optimizer comparison works")
except Exception as e:
print(f"❌ Optimizer comparison failed: {e}")
raise
# Test gradient flow
try:
# Create a simple test to verify gradients flow correctly
w = Variable(1.0, requires_grad=True)
b = Variable(0.0, requires_grad=True)
# Set up simple gradients
w.grad = Variable(0.1)
b.grad = Variable(0.05)
# Test SGD step
sgd_optimizer = SGD([w, b], learning_rate=0.1)
original_w = w.data.data.item()
original_b = b.data.data.item()
sgd_optimizer.step()
# Check updates
assert w.data.data.item() != original_w, "SGD should update w"
assert b.data.data.item() != original_b, "SGD should update b"
print("✅ Gradient flow works correctly")
except Exception as e:
print(f"❌ Gradient flow failed: {e}")
raise
print("🎯 Training integration behavior:")
print(" Optimizers successfully minimize loss functions")
print(" SGD and Adam both converge to target values")
print(" Gradient computation and updates work correctly")
print(" Ready for real neural network training")
print("📈 Progress: Complete Training Integration ✓")
# Test function defined (called in main block)
# %% [markdown]
"""
## Step 6: ML Systems - Optimizer Performance Analysis
### Real-World Challenge: Optimizer Selection and Tuning
In production ML systems, choosing the right optimizer and hyperparameters can make the difference between:
- **Success**: Model converges to good performance in reasonable time
- **Failure**: Model doesn't converge, explodes, or takes too long to train
### The Production Reality
When training large models (millions or billions of parameters):
- **Wrong optimizer**: Can waste weeks of expensive GPU time
- **Wrong learning rate**: Can cause gradient explosion or extremely slow convergence
- **Wrong scheduling**: Can prevent models from reaching optimal performance
- **Memory constraints**: Some optimizers use significantly more memory than others
### What We'll Build
An **OptimizerConvergenceProfiler** that analyzes:
1. **Convergence patterns** across different optimizers
2. **Learning rate sensitivity** and optimal hyperparameters
3. **Computational cost vs convergence speed** trade-offs
4. **Gradient statistics** and update patterns
5. **Memory usage patterns** for different optimizers
This mirrors tools used in production for optimizer selection and hyperparameter tuning.
"""
# %% nbgrader={"grade": false, "grade_id": "convergence-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class OptimizerConvergenceProfiler:
"""
ML Systems Tool: Optimizer Performance and Convergence Analysis
Profiles convergence patterns, learning rate sensitivity, and computational costs
across different optimizers to guide production optimizer selection.
This is 60% implementation focusing on core analysis capabilities:
- Convergence rate comparison across optimizers
- Learning rate sensitivity analysis
- Gradient statistics tracking
- Memory usage estimation
- Performance recommendations
"""
def __init__(self):
"""
Initialize optimizer convergence profiler.
TODO: Implement profiler initialization.
APPROACH:
1. Initialize tracking dictionaries for different metrics
2. Set up convergence analysis parameters
3. Prepare memory and performance tracking
4. Initialize recommendation engine components
PRODUCTION CONTEXT:
In production, this profiler would run on representative tasks to:
- Select optimal optimizers for new models
- Tune hyperparameters before expensive training runs
- Predict training time and resource requirements
- Monitor training stability and convergence
IMPLEMENTATION HINTS:
- Track convergence history per optimizer
- Store gradient statistics over time
- Monitor memory usage patterns
- Prepare for comparative analysis
"""
### BEGIN SOLUTION
# Convergence tracking
self.convergence_history = defaultdict(list) # {optimizer_name: [losses]}
self.gradient_norms = defaultdict(list) # {optimizer_name: [grad_norms]}
self.learning_rates = defaultdict(list) # {optimizer_name: [lr_values]}
self.step_times = defaultdict(list) # {optimizer_name: [step_durations]}
# Performance metrics
self.memory_usage = defaultdict(list) # {optimizer_name: [memory_estimates]}
self.convergence_rates = {} # {optimizer_name: convergence_rate}
self.stability_scores = {} # {optimizer_name: stability_score}
# Analysis parameters
self.convergence_threshold = 1e-6
self.stability_window = 10
self.gradient_explosion_threshold = 1e6
# Recommendations
self.optimizer_rankings = {}
self.hyperparameter_suggestions = {}
### END SOLUTION
def profile_optimizer_convergence(self, optimizer_name: str, optimizer: Union[SGD, Adam],
training_function, initial_loss: float,
max_steps: int = 100) -> Dict[str, Any]:
"""
Profile convergence behavior of an optimizer on a specific task.
Args:
optimizer_name: Name identifier for the optimizer
optimizer: Optimizer instance to profile
training_function: Function that performs one training step and returns loss
initial_loss: Starting loss value
max_steps: Maximum training steps to profile
Returns:
Dictionary containing convergence analysis results
TODO: Implement optimizer convergence profiling.
APPROACH:
1. Run training loop with the optimizer
2. Track loss, gradients, learning rates at each step
3. Measure step execution time
4. Estimate memory usage
5. Analyze convergence patterns and stability
6. Generate performance metrics
CONVERGENCE ANALYSIS:
- Track loss reduction over time
- Measure convergence rate (loss reduction per step)
- Detect convergence plateaus
- Identify gradient explosion or vanishing
- Assess training stability
PRODUCTION INSIGHTS:
This analysis helps determine:
- Which optimizers converge fastest for specific model types
- Optimal learning rates for different optimizers
- Memory vs performance trade-offs
- Training stability and robustness
IMPLEMENTATION HINTS:
- Use time.time() to measure step duration
- Calculate gradient norms across all parameters
- Track learning rate changes (for schedulers)
- Estimate memory from optimizer state size
"""
### BEGIN SOLUTION
import time
print(f"🔍 Profiling {optimizer_name} convergence...")
# Initialize tracking
losses = []
grad_norms = []
step_durations = []
lr_values = []
previous_loss = initial_loss
convergence_step = None
for step in range(max_steps):
step_start = time.time()
# Perform training step
try:
current_loss = training_function()
losses.append(current_loss)
# Calculate gradient norm
total_grad_norm = 0.0
param_count = 0
for param in optimizer.parameters:
if param.grad is not None:
grad_data = param.grad.data.data
if hasattr(grad_data, 'flatten'):
grad_norm = np.linalg.norm(grad_data.flatten())
else:
grad_norm = abs(float(grad_data))
total_grad_norm += grad_norm ** 2
param_count += 1
if param_count > 0:
total_grad_norm = (total_grad_norm / param_count) ** 0.5
grad_norms.append(total_grad_norm)
# Track learning rate
lr_values.append(optimizer.learning_rate)
# Check convergence
if convergence_step is None and abs(current_loss - previous_loss) < self.convergence_threshold:
convergence_step = step
previous_loss = current_loss
except Exception as e:
print(f"⚠️ Training step {step} failed: {e}")
break
step_end = time.time()
step_durations.append(step_end - step_start)
# Early stopping for exploded gradients
if total_grad_norm > self.gradient_explosion_threshold:
print(f"⚠️ Gradient explosion detected at step {step}")
break
# Store results
self.convergence_history[optimizer_name] = losses
self.gradient_norms[optimizer_name] = grad_norms
self.learning_rates[optimizer_name] = lr_values
self.step_times[optimizer_name] = step_durations
# Analyze results
analysis = self._analyze_convergence_profile(optimizer_name, losses, grad_norms,
step_durations, convergence_step)
return analysis
### END SOLUTION
def compare_optimizers(self, profiles: Dict[str, Dict]) -> Dict[str, Any]:
"""
Compare multiple optimizer profiles and generate recommendations.
Args:
profiles: Dictionary mapping optimizer names to their profile results
Returns:
Comprehensive comparison analysis with recommendations
TODO: Implement optimizer comparison and ranking.
APPROACH:
1. Analyze convergence speed across optimizers
2. Compare final performance and stability
3. Assess computational efficiency
4. Generate rankings and recommendations
5. Identify optimal hyperparameters
COMPARISON METRICS:
- Steps to convergence
- Final loss achieved
- Training stability (loss variance)
- Computational cost per step
- Memory efficiency
- Gradient explosion resistance
PRODUCTION VALUE:
This comparison guides:
- Optimizer selection for new projects
- Hyperparameter optimization strategies
- Resource allocation decisions
- Training pipeline design
IMPLEMENTATION HINTS:
- Normalize metrics for fair comparison
- Weight different factors based on importance
- Generate actionable recommendations
- Consider trade-offs between speed and stability
"""
### BEGIN SOLUTION
comparison = {
'convergence_speed': {},
'final_performance': {},
'stability': {},
'efficiency': {},
'rankings': {},
'recommendations': {}
}
print("📊 Comparing optimizer performance...")
# Analyze each optimizer
for opt_name, profile in profiles.items():
# Convergence speed
convergence_step = profile.get('convergence_step', len(self.convergence_history[opt_name]))
comparison['convergence_speed'][opt_name] = convergence_step
# Final performance
losses = self.convergence_history[opt_name]
if losses:
final_loss = losses[-1]
comparison['final_performance'][opt_name] = final_loss
# Stability (coefficient of variation in last 10 steps)
if len(losses) >= self.stability_window:
recent_losses = losses[-self.stability_window:]
stability = 1.0 / (1.0 + np.std(recent_losses) / (np.mean(recent_losses) + 1e-8))
comparison['stability'][opt_name] = stability
# Efficiency (loss reduction per unit time)
step_times = self.step_times[opt_name]
if losses and step_times:
initial_loss = losses[0]
final_loss = losses[-1]
total_time = sum(step_times)
efficiency = (initial_loss - final_loss) / (total_time + 1e-8)
comparison['efficiency'][opt_name] = efficiency
# Generate rankings
metrics = ['convergence_speed', 'final_performance', 'stability', 'efficiency']
for metric in metrics:
if comparison[metric]:
if metric == 'convergence_speed':
# Lower is better for convergence speed
sorted_opts = sorted(comparison[metric].items(), key=lambda x: x[1])
elif metric == 'final_performance':
# Lower is better for final loss
sorted_opts = sorted(comparison[metric].items(), key=lambda x: x[1])
else:
# Higher is better for stability and efficiency
sorted_opts = sorted(comparison[metric].items(), key=lambda x: x[1], reverse=True)
comparison['rankings'][metric] = [opt for opt, _ in sorted_opts]
# Generate recommendations
recommendations = []
# Best overall optimizer
if comparison['rankings']:
# Simple scoring: rank position across metrics
scores = defaultdict(float)
for metric, ranking in comparison['rankings'].items():
for i, opt_name in enumerate(ranking):
scores[opt_name] += len(ranking) - i
best_optimizer = max(scores.items(), key=lambda x: x[1])[0]
recommendations.append(f"🏆 Best overall optimizer: {best_optimizer}")
# Specific recommendations
if 'convergence_speed' in comparison['rankings']:
fastest = comparison['rankings']['convergence_speed'][0]
recommendations.append(f"⚡ Fastest convergence: {fastest}")
if 'stability' in comparison['rankings']:
most_stable = comparison['rankings']['stability'][0]
recommendations.append(f"🎯 Most stable training: {most_stable}")
if 'efficiency' in comparison['rankings']:
most_efficient = comparison['rankings']['efficiency'][0]
recommendations.append(f"💰 Most compute-efficient: {most_efficient}")
comparison['recommendations']['summary'] = recommendations
return comparison
### END SOLUTION
def analyze_learning_rate_sensitivity(self, optimizer_class, learning_rates: List[float],
training_function, steps: int = 50) -> Dict[str, Any]:
"""
Analyze optimizer sensitivity to different learning rates.
Args:
optimizer_class: Optimizer class (SGD or Adam)
learning_rates: List of learning rates to test
training_function: Function that creates and runs training
steps: Number of training steps per learning rate
Returns:
Learning rate sensitivity analysis
TODO: Implement learning rate sensitivity analysis.
APPROACH:
1. Test optimizer with different learning rates
2. Measure convergence performance for each rate
3. Identify optimal learning rate range
4. Detect learning rate instability regions
5. Generate learning rate recommendations
SENSITIVITY ANALYSIS:
- Plot loss curves for different learning rates
- Identify optimal learning rate range
- Detect gradient explosion thresholds
- Measure convergence robustness
- Generate adaptive scheduling suggestions
PRODUCTION INSIGHTS:
This analysis enables:
- Automatic learning rate tuning
- Learning rate scheduling optimization
- Gradient explosion prevention
- Training stability improvement
IMPLEMENTATION HINTS:
- Reset model state for each learning rate test
- Track convergence metrics consistently
- Identify learning rate sweet spots
- Flag unstable learning rate regions
"""
### BEGIN SOLUTION
print("🔍 Analyzing learning rate sensitivity...")
lr_analysis = {
'learning_rates': learning_rates,
'final_losses': [],
'convergence_steps': [],
'stability_scores': [],
'gradient_explosions': [],
'optimal_range': None,
'recommendations': []
}
# Test each learning rate
for lr in learning_rates:
print(f" Testing learning rate: {lr}")
try:
# Create optimizer with current learning rate
# This is a simplified test - in production, would reset model state
losses, grad_norms = training_function(lr, steps)
if losses:
final_loss = losses[-1]
lr_analysis['final_losses'].append(final_loss)
# Find convergence step
convergence_step = steps
for i in range(1, len(losses)):
if abs(losses[i] - losses[i-1]) < self.convergence_threshold:
convergence_step = i
break
lr_analysis['convergence_steps'].append(convergence_step)
# Calculate stability
if len(losses) >= 10:
recent_losses = losses[-10:]
stability = 1.0 / (1.0 + np.std(recent_losses) / (np.mean(recent_losses) + 1e-8))
lr_analysis['stability_scores'].append(stability)
else:
lr_analysis['stability_scores'].append(0.0)
# Check for gradient explosion
max_grad_norm = max(grad_norms) if grad_norms else 0.0
explosion = max_grad_norm > self.gradient_explosion_threshold
lr_analysis['gradient_explosions'].append(explosion)
else:
# Failed to get losses
lr_analysis['final_losses'].append(float('inf'))
lr_analysis['convergence_steps'].append(steps)
lr_analysis['stability_scores'].append(0.0)
lr_analysis['gradient_explosions'].append(True)
except Exception as e:
print(f" ⚠️ Failed with lr={lr}: {e}")
lr_analysis['final_losses'].append(float('inf'))
lr_analysis['convergence_steps'].append(steps)
lr_analysis['stability_scores'].append(0.0)
lr_analysis['gradient_explosions'].append(True)
# Find optimal learning rate range
valid_indices = [i for i, (loss, explosion) in
enumerate(zip(lr_analysis['final_losses'], lr_analysis['gradient_explosions']))
if not explosion and loss != float('inf')]
if valid_indices:
# Find learning rate with best final loss among stable ones
stable_losses = [(i, lr_analysis['final_losses'][i]) for i in valid_indices]
best_idx = min(stable_losses, key=lambda x: x[1])[0]
# Define optimal range around best learning rate
best_lr = learning_rates[best_idx]
lr_analysis['optimal_range'] = (best_lr * 0.1, best_lr * 10.0)
# Generate recommendations
recommendations = []
recommendations.append(f"🎯 Optimal learning rate: {best_lr:.2e}")
recommendations.append(f"📈 Safe range: {lr_analysis['optimal_range'][0]:.2e} - {lr_analysis['optimal_range'][1]:.2e}")
# Learning rate scheduling suggestions
if best_idx > 0:
recommendations.append("💡 Consider starting with higher LR and decaying")
if any(lr_analysis['gradient_explosions']):
max_safe_lr = max([learning_rates[i] for i in valid_indices])
recommendations.append(f"⚠️ Avoid learning rates above {max_safe_lr:.2e}")
lr_analysis['recommendations'] = recommendations
else:
lr_analysis['recommendations'] = ["⚠️ No stable learning rates found - try lower values"]
return lr_analysis
### END SOLUTION
def estimate_memory_usage(self, optimizer: Union[SGD, Adam], num_parameters: int) -> Dict[str, float]:
"""
Estimate memory usage for different optimizers.
Args:
optimizer: Optimizer instance
num_parameters: Number of model parameters
Returns:
Memory usage estimates in MB
TODO: Implement memory usage estimation.
APPROACH:
1. Calculate parameter memory requirements
2. Estimate optimizer state memory
3. Account for gradient storage
4. Include temporary computation memory
5. Provide memory scaling predictions
MEMORY ANALYSIS:
- Parameter storage: num_params * 4 bytes (float32)
- Gradient storage: num_params * 4 bytes
- Optimizer state: varies by optimizer type
- SGD momentum: num_params * 4 bytes
- Adam: num_params * 8 bytes (first + second moments)
PRODUCTION VALUE:
Memory estimation helps:
- Select optimizers for memory-constrained environments
- Plan GPU memory allocation
- Scale to larger models
- Optimize batch sizes
IMPLEMENTATION HINTS:
- Use typical float32 size (4 bytes)
- Account for optimizer-specific state
- Include gradient accumulation overhead
- Provide scaling estimates
"""
### BEGIN SOLUTION
# Base memory requirements
bytes_per_param = 4 # float32
memory_breakdown = {
'parameters_mb': num_parameters * bytes_per_param / (1024 * 1024),
'gradients_mb': num_parameters * bytes_per_param / (1024 * 1024),
'optimizer_state_mb': 0.0,
'total_mb': 0.0
}
# Optimizer-specific state memory
if isinstance(optimizer, SGD):
if optimizer.momentum > 0:
# Momentum buffers
memory_breakdown['optimizer_state_mb'] = num_parameters * bytes_per_param / (1024 * 1024)
else:
memory_breakdown['optimizer_state_mb'] = 0.0
elif isinstance(optimizer, Adam):
# First and second moment estimates
memory_breakdown['optimizer_state_mb'] = num_parameters * 2 * bytes_per_param / (1024 * 1024)
# Calculate total
memory_breakdown['total_mb'] = (
memory_breakdown['parameters_mb'] +
memory_breakdown['gradients_mb'] +
memory_breakdown['optimizer_state_mb']
)
# Add efficiency estimates
memory_breakdown['memory_efficiency'] = memory_breakdown['parameters_mb'] / memory_breakdown['total_mb']
memory_breakdown['overhead_ratio'] = memory_breakdown['optimizer_state_mb'] / memory_breakdown['parameters_mb']
return memory_breakdown
### END SOLUTION
def generate_production_recommendations(self, analysis_results: Dict[str, Any]) -> List[str]:
"""
Generate actionable recommendations for production optimizer usage.
Args:
analysis_results: Combined results from convergence and sensitivity analysis
Returns:
List of production recommendations
TODO: Implement production recommendation generation.
APPROACH:
1. Analyze convergence patterns and stability
2. Consider computational efficiency requirements
3. Account for memory constraints
4. Generate optimizer selection guidance
5. Provide hyperparameter tuning suggestions
RECOMMENDATION CATEGORIES:
- Optimizer selection for different scenarios
- Learning rate and scheduling strategies
- Memory optimization techniques
- Training stability improvements
- Production deployment considerations
PRODUCTION CONTEXT:
These recommendations guide:
- ML engineer optimizer selection
- DevOps resource allocation
- Training pipeline optimization
- Cost reduction strategies
IMPLEMENTATION HINTS:
- Provide specific, actionable advice
- Consider different deployment scenarios
- Include quantitative guidelines
- Address common production challenges
"""
### BEGIN SOLUTION
recommendations = []
# Optimizer selection recommendations
recommendations.append("🔧 OPTIMIZER SELECTION GUIDE:")
recommendations.append(" • SGD + Momentum: Best for large batch training, proven stability")
recommendations.append(" • Adam: Best for rapid prototyping, adaptive learning rates")
recommendations.append(" • Consider memory constraints: SGD uses ~50% less memory than Adam")
# Learning rate recommendations
if 'learning_rate_analysis' in analysis_results:
lr_analysis = analysis_results['learning_rate_analysis']
if lr_analysis.get('optimal_range'):
opt_range = lr_analysis['optimal_range']
recommendations.append(f"📈 LEARNING RATE GUIDANCE:")
recommendations.append(f" • Start with: {opt_range[0]:.2e}")
recommendations.append(f" • Safe upper bound: {opt_range[1]:.2e}")
recommendations.append(" • Use learning rate scheduling for best results")
# Convergence recommendations
if 'convergence_comparison' in analysis_results:
comparison = analysis_results['convergence_comparison']
if 'recommendations' in comparison and 'summary' in comparison['recommendations']:
recommendations.append("🎯 CONVERGENCE OPTIMIZATION:")
for rec in comparison['recommendations']['summary']:
recommendations.append(f"{rec}")
# Production deployment recommendations
recommendations.append("🚀 PRODUCTION DEPLOYMENT:")
recommendations.append(" • Monitor gradient norms to detect training instability")
recommendations.append(" • Implement gradient clipping for large models")
recommendations.append(" • Use learning rate warmup for transformer architectures")
recommendations.append(" • Consider mixed precision training to reduce memory usage")
# Scaling recommendations
recommendations.append("📊 SCALING CONSIDERATIONS:")
recommendations.append(" • Large batch training: Prefer SGD with linear learning rate scaling")
recommendations.append(" • Distributed training: Use synchronized optimizers")
recommendations.append(" • Memory-constrained: Choose SGD or use gradient accumulation")
recommendations.append(" • Fine-tuning: Use lower learning rates (10x-100x smaller)")
# Monitoring recommendations
recommendations.append("📈 MONITORING & DEBUGGING:")
recommendations.append(" • Track loss smoothness to detect learning rate issues")
recommendations.append(" • Monitor gradient norms for explosion/vanishing detection")
recommendations.append(" • Log learning rate schedules for reproducibility")
recommendations.append(" • Profile memory usage to optimize batch sizes")
return recommendations
### END SOLUTION
def _analyze_convergence_profile(self, optimizer_name: str, losses: List[float],
grad_norms: List[float], step_durations: List[float],
convergence_step: Optional[int]) -> Dict[str, Any]:
"""
Internal helper to analyze convergence profile data.
Args:
optimizer_name: Name of the optimizer
losses: List of loss values over training
grad_norms: List of gradient norms over training
step_durations: List of step execution times
convergence_step: Step where convergence was detected (if any)
Returns:
Analysis results dictionary
"""
### BEGIN SOLUTION
analysis = {
'optimizer_name': optimizer_name,
'total_steps': len(losses),
'convergence_step': convergence_step,
'final_loss': losses[-1] if losses else float('inf'),
'initial_loss': losses[0] if losses else float('inf'),
'loss_reduction': 0.0,
'convergence_rate': 0.0,
'stability_score': 0.0,
'average_step_time': 0.0,
'gradient_health': 'unknown'
}
if losses:
# Calculate loss reduction
initial_loss = losses[0]
final_loss = losses[-1]
analysis['loss_reduction'] = initial_loss - final_loss
# Calculate convergence rate (loss reduction per step)
if len(losses) > 1:
analysis['convergence_rate'] = analysis['loss_reduction'] / len(losses)
# Calculate stability (inverse of coefficient of variation)
if len(losses) >= self.stability_window:
recent_losses = losses[-self.stability_window:]
mean_loss = np.mean(recent_losses)
std_loss = np.std(recent_losses)
analysis['stability_score'] = 1.0 / (1.0 + std_loss / (mean_loss + 1e-8))
# Average step time
if step_durations:
analysis['average_step_time'] = np.mean(step_durations)
# Gradient health assessment
if grad_norms:
max_grad_norm = max(grad_norms)
avg_grad_norm = np.mean(grad_norms)
if max_grad_norm > self.gradient_explosion_threshold:
analysis['gradient_health'] = 'exploding'
elif avg_grad_norm < 1e-8:
analysis['gradient_health'] = 'vanishing'
elif np.std(grad_norms) / (avg_grad_norm + 1e-8) > 2.0:
analysis['gradient_health'] = 'unstable'
else:
analysis['gradient_health'] = 'healthy'
return analysis
### END SOLUTION
# %% [markdown]
"""
### 🧪 Unit Test: OptimizerConvergenceProfiler
Let's test your ML systems optimizer profiler! This tool helps analyze and compare optimizer performance in production scenarios.
**This is a unit test** - it tests the OptimizerConvergenceProfiler class functionality.
"""
# %% nbgrader={"grade": true, "grade_id": "test-convergence-profiler", "locked": true, "points": 30, "schema_version": 3, "solution": false, "task": false}
def test_unit_convergence_profiler():
"""Unit test for the OptimizerConvergenceProfiler implementation."""
print("🔬 Unit Test: Optimizer Convergence Profiler...")
# Test profiler initialization
try:
profiler = OptimizerConvergenceProfiler()
assert hasattr(profiler, 'convergence_history'), "Should have convergence_history tracking"
assert hasattr(profiler, 'gradient_norms'), "Should have gradient_norms tracking"
assert hasattr(profiler, 'learning_rates'), "Should have learning_rates tracking"
assert hasattr(profiler, 'step_times'), "Should have step_times tracking"
print("✅ Profiler initialization works")
except Exception as e:
print(f"❌ Profiler initialization failed: {e}")
raise
# Test memory usage estimation
try:
# Test SGD memory estimation
w = Variable(1.0, requires_grad=True)
sgd_optimizer = SGD([w], learning_rate=0.01, momentum=0.9)
memory_estimate = profiler.estimate_memory_usage(sgd_optimizer, num_parameters=1000000)
assert 'parameters_mb' in memory_estimate, "Should estimate parameter memory"
assert 'gradients_mb' in memory_estimate, "Should estimate gradient memory"
assert 'optimizer_state_mb' in memory_estimate, "Should estimate optimizer state memory"
assert 'total_mb' in memory_estimate, "Should provide total memory estimate"
# SGD with momentum should have optimizer state
assert memory_estimate['optimizer_state_mb'] > 0, "SGD with momentum should have state memory"
print("✅ Memory usage estimation works")
except Exception as e:
print(f"❌ Memory usage estimation failed: {e}")
raise
# Test simple convergence analysis
try:
# Create a simple training function for testing
def simple_training_function():
# Simulate decreasing loss
losses = [10.0 - i * 0.5 for i in range(20)]
return losses[-1] # Return final loss
# Create test optimizer
w = Variable(1.0, requires_grad=True)
w.grad = Variable(0.1) # Set gradient for testing
test_optimizer = SGD([w], learning_rate=0.01)
# Profile convergence (simplified test)
analysis = profiler.profile_optimizer_convergence(
optimizer_name="test_sgd",
optimizer=test_optimizer,
training_function=simple_training_function,
initial_loss=10.0,
max_steps=10
)
assert 'optimizer_name' in analysis, "Should return optimizer name"
assert 'total_steps' in analysis, "Should track total steps"
assert 'final_loss' in analysis, "Should track final loss"
print("✅ Basic convergence profiling works")
except Exception as e:
print(f"❌ Convergence profiling failed: {e}")
raise
# Test production recommendations
try:
# Create mock analysis results
mock_results = {
'learning_rate_analysis': {
'optimal_range': (0.001, 0.1)
},
'convergence_comparison': {
'recommendations': {
'summary': ['Best overall: Adam', 'Fastest: SGD']
}
}
}
recommendations = profiler.generate_production_recommendations(mock_results)
assert isinstance(recommendations, list), "Should return list of recommendations"
assert len(recommendations) > 0, "Should provide recommendations"
# Check for key recommendation categories
rec_text = ' '.join(recommendations)
assert 'OPTIMIZER SELECTION' in rec_text, "Should include optimizer selection guidance"
assert 'PRODUCTION DEPLOYMENT' in rec_text, "Should include production deployment advice"
print("✅ Production recommendations work")
except Exception as e:
print(f"❌ Production recommendations failed: {e}")
raise
# Test optimizer comparison framework
try:
# Create mock profiles for comparison
mock_profiles = {
'sgd': {'convergence_step': 50, 'final_loss': 0.1},
'adam': {'convergence_step': 30, 'final_loss': 0.05}
}
# Add some mock data to profiler
profiler.convergence_history['sgd'] = [1.0, 0.5, 0.2, 0.1]
profiler.convergence_history['adam'] = [1.0, 0.3, 0.1, 0.05]
profiler.step_times['sgd'] = [0.01, 0.01, 0.01, 0.01]
profiler.step_times['adam'] = [0.02, 0.02, 0.02, 0.02]
comparison = profiler.compare_optimizers(mock_profiles)
assert 'convergence_speed' in comparison, "Should compare convergence speed"
assert 'final_performance' in comparison, "Should compare final performance"
assert 'stability' in comparison, "Should compare stability"
assert 'recommendations' in comparison, "Should provide recommendations"
print("✅ Optimizer comparison works")
except Exception as e:
print(f"❌ Optimizer comparison failed: {e}")
raise
print("🎯 Optimizer Convergence Profiler behavior:")
print(" Profiles convergence patterns across different optimizers")
print(" Estimates memory usage for production planning")
print(" Provides actionable recommendations for ML systems")
print(" Enables data-driven optimizer selection")
print("📈 Progress: ML Systems Optimizer Analysis ✓")
# Test function defined (called in main block)
# %% [markdown]
"""
## Step 7: Advanced Optimizer Features
### Production Optimizer Patterns
Real ML systems need more than basic optimizers. They need:
1. **Gradient Clipping**: Prevents gradient explosion in large models
2. **Learning Rate Warmup**: Gradually increases learning rate at start
3. **Gradient Accumulation**: Simulates large batch training
4. **Mixed Precision**: Reduces memory usage with FP16
5. **Distributed Synchronization**: Coordinates optimizer across GPUs
Let's implement these production patterns!
"""
# %% nbgrader={"grade": false, "grade_id": "advanced-optimizer-features", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class AdvancedOptimizerFeatures:
"""
Advanced optimizer features for production ML systems.
Implements production-ready optimizer enhancements:
- Gradient clipping for stability
- Learning rate warmup strategies
- Gradient accumulation for large batches
- Mixed precision optimization patterns
- Distributed optimizer synchronization
"""
def __init__(self):
"""
Initialize advanced optimizer features.
TODO: Implement advanced features initialization.
PRODUCTION CONTEXT:
These features are essential for:
- Training large language models (GPT, BERT)
- Computer vision at scale (ImageNet, COCO)
- Distributed training across multiple GPUs
- Memory-efficient training with limited resources
IMPLEMENTATION HINTS:
- Initialize gradient clipping parameters
- Set up warmup scheduling state
- Prepare accumulation buffers
- Configure synchronization patterns
"""
### BEGIN SOLUTION
# Gradient clipping
self.max_grad_norm = 1.0
self.clip_enabled = False
# Learning rate warmup
self.warmup_steps = 0
self.warmup_factor = 0.1
self.base_lr = 0.001
# Gradient accumulation
self.accumulation_steps = 1
self.accumulated_gradients = {}
self.accumulation_count = 0
# Mixed precision simulation
self.use_fp16 = False
self.loss_scale = 1.0
self.dynamic_loss_scaling = False
# Distributed training simulation
self.world_size = 1
self.rank = 0
### END SOLUTION
def apply_gradient_clipping(self, optimizer: Union[SGD, Adam], max_norm: float = 1.0) -> float:
"""
Apply gradient clipping to prevent gradient explosion.
Args:
optimizer: Optimizer with parameters to clip
max_norm: Maximum allowed gradient norm
Returns:
Actual gradient norm before clipping
TODO: Implement gradient clipping.
APPROACH:
1. Calculate total gradient norm across all parameters
2. If norm exceeds max_norm, scale all gradients down
3. Apply scaling factor to maintain gradient direction
4. Return original norm for monitoring
MATHEMATICAL FORMULATION:
total_norm = sqrt(sum(param_grad_norm^2 for all params))
if total_norm > max_norm:
clip_factor = max_norm / total_norm
for each param: param.grad *= clip_factor
PRODUCTION VALUE:
Gradient clipping is essential for:
- Training RNNs and Transformers
- Preventing training instability
- Enabling higher learning rates
- Improving convergence reliability
IMPLEMENTATION HINTS:
- Calculate global gradient norm
- Apply uniform scaling to all gradients
- Preserve gradient directions
- Return unclipped norm for logging
"""
### BEGIN SOLUTION
# Calculate total gradient norm
total_norm = 0.0
param_count = 0
for param in optimizer.parameters:
if param.grad is not None:
grad_data = param.grad.data.data
if hasattr(grad_data, 'flatten'):
param_norm = np.linalg.norm(grad_data.flatten())
else:
param_norm = abs(float(grad_data))
total_norm += param_norm ** 2
param_count += 1
if param_count > 0:
total_norm = total_norm ** 0.5
else:
return 0.0
# Apply clipping if necessary
if total_norm > max_norm:
clip_factor = max_norm / total_norm
for param in optimizer.parameters:
if param.grad is not None:
grad_data = param.grad.data.data
clipped_grad = grad_data * clip_factor
param.grad.data = Tensor(clipped_grad)
return total_norm
### END SOLUTION
def apply_warmup_schedule(self, optimizer: Union[SGD, Adam], step: int,
warmup_steps: int, base_lr: float) -> float:
"""
Apply learning rate warmup schedule.
Args:
optimizer: Optimizer to apply warmup to
step: Current training step
warmup_steps: Number of warmup steps
base_lr: Target learning rate after warmup
Returns:
Current learning rate
TODO: Implement learning rate warmup.
APPROACH:
1. If step < warmup_steps: gradually increase learning rate
2. Use linear or polynomial warmup schedule
3. Update optimizer's learning rate
4. Return current learning rate for logging
WARMUP STRATEGIES:
- Linear: lr = base_lr * (step / warmup_steps)
- Polynomial: lr = base_lr * ((step / warmup_steps) ^ power)
- Constant: lr = base_lr * warmup_factor for warmup_steps
PRODUCTION VALUE:
Warmup prevents:
- Early training instability
- Poor initialization effects
- Gradient explosion at start
- Suboptimal convergence paths
IMPLEMENTATION HINTS:
- Handle step=0 case (avoid division by zero)
- Use linear warmup for simplicity
- Update optimizer.learning_rate directly
- Smoothly transition to base learning rate
"""
### BEGIN SOLUTION
if step < warmup_steps and warmup_steps > 0:
# Linear warmup
warmup_factor = step / warmup_steps
current_lr = base_lr * warmup_factor
else:
# After warmup, use base learning rate
current_lr = base_lr
# Update optimizer learning rate
optimizer.learning_rate = current_lr
return current_lr
### END SOLUTION
def accumulate_gradients(self, optimizer: Union[SGD, Adam], accumulation_steps: int) -> bool:
"""
Accumulate gradients to simulate larger batch sizes.
Args:
optimizer: Optimizer with parameters to accumulate
accumulation_steps: Number of steps to accumulate before update
Returns:
True if ready to perform optimizer step, False otherwise
TODO: Implement gradient accumulation.
APPROACH:
1. Add current gradients to accumulated gradient buffers
2. Increment accumulation counter
3. If counter reaches accumulation_steps:
a. Average accumulated gradients
b. Set as current gradients
c. Return True (ready for optimizer step)
d. Reset accumulation
4. Otherwise return False (continue accumulating)
MATHEMATICAL FORMULATION:
accumulated_grad += current_grad
if accumulation_count == accumulation_steps:
final_grad = accumulated_grad / accumulation_steps
reset accumulation
return True
PRODUCTION VALUE:
Gradient accumulation enables:
- Large effective batch sizes on limited memory
- Training large models on small GPUs
- Consistent training across different hardware
- Memory-efficient distributed training
IMPLEMENTATION HINTS:
- Store accumulated gradients per parameter
- Use parameter id() as key for tracking
- Average gradients before optimizer step
- Reset accumulation after each update
"""
### BEGIN SOLUTION
# Initialize accumulation if first time
if not hasattr(self, 'accumulation_count'):
self.accumulation_count = 0
self.accumulated_gradients = {}
# Accumulate gradients
for param in optimizer.parameters:
if param.grad is not None:
param_id = id(param)
grad_data = param.grad.data.data
if param_id not in self.accumulated_gradients:
self.accumulated_gradients[param_id] = np.zeros_like(grad_data)
self.accumulated_gradients[param_id] += grad_data
self.accumulation_count += 1
# Check if ready to update
if self.accumulation_count >= accumulation_steps:
# Average accumulated gradients and set as current gradients
for param in optimizer.parameters:
if param.grad is not None:
param_id = id(param)
if param_id in self.accumulated_gradients:
averaged_grad = self.accumulated_gradients[param_id] / accumulation_steps
param.grad.data = Tensor(averaged_grad)
# Reset accumulation
self.accumulation_count = 0
self.accumulated_gradients = {}
return True # Ready for optimizer step
return False # Continue accumulating
### END SOLUTION
def simulate_mixed_precision(self, optimizer: Union[SGD, Adam], loss_scale: float = 1.0) -> bool:
"""
Simulate mixed precision training effects.
Args:
optimizer: Optimizer to apply mixed precision to
loss_scale: Loss scaling factor for gradient preservation
Returns:
True if gradients are valid (no overflow), False if overflow detected
TODO: Implement mixed precision simulation.
APPROACH:
1. Scale gradients by loss_scale factor
2. Check for gradient overflow (inf or nan values)
3. If overflow detected, skip optimizer step
4. If valid, descale gradients before optimizer step
5. Return overflow status
MIXED PRECISION CONCEPTS:
- Use FP16 for forward pass (memory savings)
- Use FP32 for backward pass (numerical stability)
- Scale loss to prevent gradient underflow
- Check for overflow before optimization
PRODUCTION VALUE:
Mixed precision provides:
- 50% memory reduction
- Faster training on modern GPUs
- Maintained numerical stability
- Automatic overflow detection
IMPLEMENTATION HINTS:
- Scale gradients by loss_scale
- Check for inf/nan in gradients
- Descale before optimizer step
- Return overflow status for dynamic scaling
"""
### BEGIN SOLUTION
# Check for gradient overflow before scaling
has_overflow = False
for param in optimizer.parameters:
if param.grad is not None:
grad_data = param.grad.data.data
if hasattr(grad_data, 'flatten'):
grad_flat = grad_data.flatten()
if np.any(np.isinf(grad_flat)) or np.any(np.isnan(grad_flat)):
has_overflow = True
break
else:
if np.isinf(grad_data) or np.isnan(grad_data):
has_overflow = True
break
if has_overflow:
# Zero gradients to prevent corruption
for param in optimizer.parameters:
if param.grad is not None:
param.grad = None
return False # Overflow detected
# Descale gradients (simulate unscaling from FP16)
if loss_scale > 1.0:
for param in optimizer.parameters:
if param.grad is not None:
grad_data = param.grad.data.data
descaled_grad = grad_data / loss_scale
param.grad.data = Tensor(descaled_grad)
return True # No overflow, safe to proceed
### END SOLUTION
def simulate_distributed_sync(self, optimizer: Union[SGD, Adam], world_size: int = 1) -> None:
"""
Simulate distributed training gradient synchronization.
Args:
optimizer: Optimizer with gradients to synchronize
world_size: Number of distributed processes
TODO: Implement distributed gradient synchronization simulation.
APPROACH:
1. Simulate all-reduce operation on gradients
2. Average gradients across all processes
3. Update local gradients with synchronized values
4. Handle communication overhead simulation
DISTRIBUTED CONCEPTS:
- All-reduce: Combine gradients from all GPUs
- Averaging: Divide by world_size for consistency
- Synchronization: Ensure all GPUs have same gradients
- Communication: Network overhead for gradient sharing
PRODUCTION VALUE:
Distributed training enables:
- Scaling to multiple GPUs/nodes
- Training large models efficiently
- Reduced training time
- Consistent convergence across devices
IMPLEMENTATION HINTS:
- Simulate averaging by keeping gradients unchanged
- Add small noise to simulate communication variance
- Scale learning rate by world_size if needed
- Log synchronization overhead
"""
### BEGIN SOLUTION
if world_size <= 1:
return # No synchronization needed for single process
# Simulate all-reduce operation (averaging gradients)
for param in optimizer.parameters:
if param.grad is not None:
grad_data = param.grad.data.data
# In real distributed training, gradients would be averaged across all processes
# Here we simulate this by keeping gradients unchanged (already "averaged")
# In practice, this would involve MPI/NCCL communication
# Simulate communication noise (very small)
if hasattr(grad_data, 'shape'):
noise = np.random.normal(0, 1e-10, grad_data.shape)
synchronized_grad = grad_data + noise
else:
noise = np.random.normal(0, 1e-10)
synchronized_grad = grad_data + noise
param.grad.data = Tensor(synchronized_grad)
# In distributed training, learning rate is often scaled by world_size
# to maintain effective learning rate with larger batch sizes
if hasattr(optimizer, 'base_learning_rate'):
optimizer.learning_rate = optimizer.base_learning_rate * world_size
### END SOLUTION
# %% [markdown]
"""
### 🧪 Unit Test: Advanced Optimizer Features
Let's test your advanced optimizer features! These are production-ready enhancements used in real ML systems.
**This is a unit test** - it tests the AdvancedOptimizerFeatures class functionality.
"""
# %% nbgrader={"grade": true, "grade_id": "test-advanced-features", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false}
def test_unit_advanced_optimizer_features():
"""Unit test for advanced optimizer features implementation."""
print("🔬 Unit Test: Advanced Optimizer Features...")
# Test advanced features initialization
try:
features = AdvancedOptimizerFeatures()
assert hasattr(features, 'max_grad_norm'), "Should have gradient clipping parameters"
assert hasattr(features, 'warmup_steps'), "Should have warmup parameters"
assert hasattr(features, 'accumulation_steps'), "Should have accumulation parameters"
print("✅ Advanced features initialization works")
except Exception as e:
print(f"❌ Advanced features initialization failed: {e}")
raise
# Test gradient clipping
try:
# Create optimizer with large gradients
w = Variable(1.0, requires_grad=True)
w.grad = Variable(10.0) # Large gradient
optimizer = SGD([w], learning_rate=0.01)
# Apply gradient clipping
original_norm = features.apply_gradient_clipping(optimizer, max_norm=1.0)
# Check that gradient was clipped
clipped_grad = w.grad.data.data.item()
assert abs(clipped_grad) <= 1.0, f"Gradient should be clipped to <= 1.0, got {clipped_grad}"
assert original_norm > 1.0, f"Original norm should be > 1.0, got {original_norm}"
print("✅ Gradient clipping works")
except Exception as e:
print(f"❌ Gradient clipping failed: {e}")
raise
# Test learning rate warmup
try:
w2 = Variable(1.0, requires_grad=True)
optimizer2 = SGD([w2], learning_rate=0.01)
# Test warmup schedule
lr_step_0 = features.apply_warmup_schedule(optimizer2, step=0, warmup_steps=10, base_lr=0.1)
lr_step_5 = features.apply_warmup_schedule(optimizer2, step=5, warmup_steps=10, base_lr=0.1)
lr_step_10 = features.apply_warmup_schedule(optimizer2, step=10, warmup_steps=10, base_lr=0.1)
# Check warmup progression
assert lr_step_0 == 0.0, f"Step 0 should have lr=0.0, got {lr_step_0}"
assert 0.0 < lr_step_5 < 0.1, f"Step 5 should have 0 < lr < 0.1, got {lr_step_5}"
assert lr_step_10 == 0.1, f"Step 10 should have lr=0.1, got {lr_step_10}"
print("✅ Learning rate warmup works")
except Exception as e:
print(f"❌ Learning rate warmup failed: {e}")
raise
# Test gradient accumulation
try:
w3 = Variable(1.0, requires_grad=True)
w3.grad = Variable(0.1)
optimizer3 = SGD([w3], learning_rate=0.01)
# Test accumulation over multiple steps
ready_step_1 = features.accumulate_gradients(optimizer3, accumulation_steps=3)
ready_step_2 = features.accumulate_gradients(optimizer3, accumulation_steps=3)
ready_step_3 = features.accumulate_gradients(optimizer3, accumulation_steps=3)
# Check accumulation behavior
assert not ready_step_1, "Should not be ready after step 1"
assert not ready_step_2, "Should not be ready after step 2"
assert ready_step_3, "Should be ready after step 3"
print("✅ Gradient accumulation works")
except Exception as e:
print(f"❌ Gradient accumulation failed: {e}")
raise
# Test mixed precision simulation
try:
w4 = Variable(1.0, requires_grad=True)
w4.grad = Variable(0.1)
optimizer4 = SGD([w4], learning_rate=0.01)
# Test normal case (no overflow)
no_overflow = features.simulate_mixed_precision(optimizer4, loss_scale=1.0)
assert no_overflow, "Should not detect overflow with normal gradients"
# Test overflow case
w4.grad = Variable(float('inf'))
overflow = features.simulate_mixed_precision(optimizer4, loss_scale=1.0)
assert not overflow, "Should detect overflow with inf gradients"
print("✅ Mixed precision simulation works")
except Exception as e:
print(f"❌ Mixed precision simulation failed: {e}")
raise
# Test distributed synchronization
try:
w5 = Variable(1.0, requires_grad=True)
w5.grad = Variable(0.1)
optimizer5 = SGD([w5], learning_rate=0.01)
original_grad = w5.grad.data.data.item()
# Simulate distributed sync
features.simulate_distributed_sync(optimizer5, world_size=4)
# Gradient should be slightly modified (due to simulated communication noise)
# but still close to original
synced_grad = w5.grad.data.data.item()
assert abs(synced_grad - original_grad) < 0.01, "Synchronized gradient should be close to original"
print("✅ Distributed synchronization simulation works")
except Exception as e:
print(f"❌ Distributed synchronization failed: {e}")
raise
print("🎯 Advanced Optimizer Features behavior:")
print(" Implements gradient clipping for training stability")
print(" Provides learning rate warmup for better convergence")
print(" Enables gradient accumulation for large effective batches")
print(" Simulates mixed precision training patterns")
print(" Handles distributed training synchronization")
print("📈 Progress: Advanced Production Optimizer Features ✓")
# Test function defined (called in main block)
# %% [markdown]
"""
## Step 8: Comprehensive Testing - ML Systems Integration
### Real-World Optimizer Performance Testing
Let's test our optimizers in realistic scenarios that mirror production ML systems:
1. **Convergence Race**: Compare optimizers on the same task
2. **Learning Rate Sensitivity**: Find optimal hyperparameters
3. **Memory Analysis**: Compare resource usage
4. **Production Recommendations**: Get actionable guidance
This integration test demonstrates how our ML systems tools work together.
"""
# %% nbgrader={"grade": true, "grade_id": "test-ml-systems-integration", "locked": true, "points": 35, "schema_version": 3, "solution": false, "task": false}
def test_comprehensive_ml_systems_integration():
"""Comprehensive integration test demonstrating ML systems optimizer analysis."""
print("🔬 Comprehensive Test: ML Systems Integration...")
# Initialize ML systems tools
try:
profiler = OptimizerConvergenceProfiler()
advanced_features = AdvancedOptimizerFeatures()
print("✅ ML systems tools initialized")
except Exception as e:
print(f"❌ ML systems tools initialization failed: {e}")
raise
# Test convergence profiling with multiple optimizers
try:
print("\n📊 Running optimizer convergence comparison...")
# Create simple training scenario
def create_training_function(optimizer_instance):
def training_step():
# Simulate a quadratic loss function: loss = (x - target)^2
# where we're trying to minimize x towards target = 2.0
current_x = optimizer_instance.parameters[0].data.data.item()
target = 2.0
loss = (current_x - target) ** 2
# Compute gradient: d/dx (x - target)^2 = 2 * (x - target)
gradient = 2 * (current_x - target)
optimizer_instance.parameters[0].grad = Variable(gradient)
# Perform optimizer step
optimizer_instance.step()
return loss
return training_step
# Test SGD
w_sgd = Variable(0.0, requires_grad=True) # Start at x=0, target=2
sgd_optimizer = SGD([w_sgd], learning_rate=0.1, momentum=0.9)
sgd_training = create_training_function(sgd_optimizer)
sgd_profile = profiler.profile_optimizer_convergence(
optimizer_name="SGD_momentum",
optimizer=sgd_optimizer,
training_function=sgd_training,
initial_loss=4.0, # (0-2)^2 = 4
max_steps=30
)
# Test Adam
w_adam = Variable(0.0, requires_grad=True) # Start at x=0, target=2
adam_optimizer = Adam([w_adam], learning_rate=0.1)
adam_training = create_training_function(adam_optimizer)
adam_profile = profiler.profile_optimizer_convergence(
optimizer_name="Adam",
optimizer=adam_optimizer,
training_function=adam_training,
initial_loss=4.0,
max_steps=30
)
# Verify profiling results
assert 'optimizer_name' in sgd_profile, "SGD profile should contain optimizer name"
assert 'optimizer_name' in adam_profile, "Adam profile should contain optimizer name"
assert 'final_loss' in sgd_profile, "SGD profile should contain final loss"
assert 'final_loss' in adam_profile, "Adam profile should contain final loss"
print(f" SGD final loss: {sgd_profile['final_loss']:.4f}")
print(f" Adam final loss: {adam_profile['final_loss']:.4f}")
print("✅ Convergence profiling completed")
except Exception as e:
print(f"❌ Convergence profiling failed: {e}")
raise
# Test optimizer comparison
try:
print("\n🏆 Comparing optimizer performance...")
profiles = {
'SGD_momentum': sgd_profile,
'Adam': adam_profile
}
comparison = profiler.compare_optimizers(profiles)
# Verify comparison results
assert 'convergence_speed' in comparison, "Should compare convergence speed"
assert 'final_performance' in comparison, "Should compare final performance"
assert 'rankings' in comparison, "Should provide rankings"
assert 'recommendations' in comparison, "Should provide recommendations"
if 'summary' in comparison['recommendations']:
print(" Recommendations:")
for rec in comparison['recommendations']['summary']:
print(f" {rec}")
print("✅ Optimizer comparison completed")
except Exception as e:
print(f"❌ Optimizer comparison failed: {e}")
raise
# Test memory analysis
try:
print("\n💾 Analyzing memory usage...")
# Simulate large model parameters
num_parameters = 100000 # 100K parameters
sgd_memory = profiler.estimate_memory_usage(sgd_optimizer, num_parameters)
adam_memory = profiler.estimate_memory_usage(adam_optimizer, num_parameters)
print(f" SGD memory usage: {sgd_memory['total_mb']:.1f} MB")
print(f" Adam memory usage: {adam_memory['total_mb']:.1f} MB")
print(f" Adam overhead: {adam_memory['total_mb'] - sgd_memory['total_mb']:.1f} MB")
# Verify memory analysis
assert sgd_memory['total_mb'] > 0, "SGD should have positive memory usage"
assert adam_memory['total_mb'] > sgd_memory['total_mb'], "Adam should use more memory than SGD"
print("✅ Memory analysis completed")
except Exception as e:
print(f"❌ Memory analysis failed: {e}")
raise
# Test advanced features integration
try:
print("\n🚀 Testing advanced optimizer features...")
# Test gradient clipping
w_clip = Variable(1.0, requires_grad=True)
w_clip.grad = Variable(5.0) # Large gradient
clip_optimizer = SGD([w_clip], learning_rate=0.01)
original_norm = advanced_features.apply_gradient_clipping(clip_optimizer, max_norm=1.0)
assert original_norm > 1.0, "Should detect large gradient"
assert abs(w_clip.grad.data.data.item()) <= 1.0, "Should clip gradient"
# Test learning rate warmup
warmup_optimizer = Adam([Variable(1.0)], learning_rate=0.001)
lr_start = advanced_features.apply_warmup_schedule(warmup_optimizer, 0, 100, 0.001)
lr_mid = advanced_features.apply_warmup_schedule(warmup_optimizer, 50, 100, 0.001)
lr_end = advanced_features.apply_warmup_schedule(warmup_optimizer, 100, 100, 0.001)
assert lr_start < lr_mid < lr_end, "Learning rate should increase during warmup"
print("✅ Advanced features integration completed")
except Exception as e:
print(f"❌ Advanced features integration failed: {e}")
raise
# Test production recommendations
try:
print("\n📋 Generating production recommendations...")
analysis_results = {
'convergence_comparison': comparison,
'memory_analysis': {
'sgd': sgd_memory,
'adam': adam_memory
},
'learning_rate_analysis': {
'optimal_range': (0.01, 0.1)
}
}
recommendations = profiler.generate_production_recommendations(analysis_results)
assert len(recommendations) > 0, "Should generate recommendations"
print(" Production guidance:")
for i, rec in enumerate(recommendations[:5]): # Show first 5 recommendations
print(f" {rec}")
print("✅ Production recommendations generated")
except Exception as e:
print(f"❌ Production recommendations failed: {e}")
raise
print("\n🎯 ML Systems Integration Results:")
print(" ✅ Optimizer convergence profiling works end-to-end")
print(" ✅ Performance comparison identifies best optimizers")
print(" ✅ Memory analysis guides resource planning")
print(" ✅ Advanced features enhance training stability")
print(" ✅ Production recommendations provide actionable guidance")
print(" 🚀 Ready for real-world ML systems deployment!")
print("📈 Progress: Comprehensive ML Systems Integration ✓")
# Test function defined (called in main block)
# %% [markdown]
"""
## 🎯 ML SYSTEMS THINKING: Optimizers in Production
### Production Deployment Considerations
**You've just built a comprehensive optimizer analysis system!** Let's reflect on how this connects to real ML systems:
### System Design Questions
1. **Optimizer Selection Strategy**: How would you build an automated system that selects the best optimizer for a new model architecture?
2. **Resource Planning**: Given memory constraints and training time budgets, how would you choose between SGD and Adam for different model sizes?
3. **Distributed Training**: How do gradient synchronization patterns affect optimizer performance across multiple GPUs or nodes?
4. **Production Monitoring**: What metrics would you track in production to detect optimizer-related training issues?
### Production ML Workflows
1. **Hyperparameter Search**: How would you integrate your convergence profiler into an automated hyperparameter tuning pipeline?
2. **Training Pipeline**: Where would gradient clipping and mixed precision fit into a production training workflow?
3. **Cost Optimization**: How would you balance optimizer performance against computational cost for training large models?
4. **Model Lifecycle**: How do optimizer choices change when fine-tuning vs training from scratch vs transfer learning?
### Framework Design Insights
1. **Optimizer Abstraction**: Why do frameworks like PyTorch separate optimizers from models? How does this design enable flexibility?
2. **State Management**: How do frameworks handle optimizer state persistence for training checkpoints and resumption?
3. **Memory Efficiency**: What design patterns enable frameworks to minimize memory overhead for optimizer state?
4. **Plugin Architecture**: How would you design an optimizer plugin system that allows researchers to add new algorithms?
### Performance & Scale Challenges
1. **Large Model Training**: How do optimizer memory requirements scale with model size, and what strategies mitigate this?
2. **Dynamic Batching**: How would you adapt your gradient accumulation strategy for variable batch sizes in production?
3. **Fault Tolerance**: How would you design optimizer state recovery for interrupted training runs in cloud environments?
4. **Cross-Hardware Portability**: How do optimizer implementations need to change when moving between CPUs, GPUs, and specialized ML accelerators?
These questions connect your optimizer implementations to the broader ecosystem of production ML systems, where optimization is just one piece of complex training and deployment pipelines.
"""
if __name__ == "__main__":
print("🧪 Running comprehensive optimizer tests...")
# Run all tests
test_unit_sgd_optimizer()
test_unit_adam_optimizer()
test_unit_step_scheduler()
test_module_unit_training()
test_unit_convergence_profiler()
test_unit_advanced_optimizer_features()
test_comprehensive_ml_systems_integration()
print("All tests passed!")
print("Optimizers module complete!")
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Interactive Questions
Now that you've built optimization algorithms that drive neural network training, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how optimization strategies scale to production training environments.
Take time to reflect thoughtfully on each question - your insights will help you understand how the optimization concepts you've implemented connect to real-world ML systems engineering.
"""
# %% [markdown]
"""
### Question 1: Memory Overhead and Optimizer State Management
**Context**: Your Adam optimizer maintains momentum and variance buffers for each parameter, creating 3× memory overhead compared to SGD. Production training systems with billions of parameters must carefully manage optimizer state memory while maintaining training efficiency and fault tolerance.
**Reflection Question**: Design an optimizer state management system for large-scale neural network training that optimizes memory usage while supporting distributed training and fault recovery. How would you implement memory-efficient optimizer state storage, handle state partitioning across devices, and manage optimizer checkpointing for training resumption? Consider scenarios where optimizer state memory exceeds model parameter memory and requires specialized optimization strategies.
Think about: memory optimization techniques, distributed state management, checkpointing strategies, and fault tolerance considerations.
*Target length: 150-300 words*
"""
# %% nbgrader={"grade": true, "grade_id": "question-1-optimizer-memory", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
"""
YOUR REFLECTION ON MEMORY OVERHEAD AND OPTIMIZER STATE MANAGEMENT:
TODO: Replace this text with your thoughtful response about optimizer state management system design.
Consider addressing:
- How would you optimize memory usage for optimizers that maintain extensive per-parameter state?
- What strategies would you use for distributed optimizer state management across multiple devices?
- How would you implement efficient checkpointing and state recovery for long-running training jobs?
- What role would state compression and quantization play in your optimization approach?
- How would you balance memory efficiency with optimization algorithm effectiveness?
Write a technical analysis connecting your optimizer implementations to real memory management challenges.
GRADING RUBRIC (Instructor Use):
- Demonstrates understanding of optimizer memory overhead and state management (3 points)
- Addresses distributed state management and partitioning strategies (3 points)
- Shows practical knowledge of checkpointing and fault tolerance techniques (2 points)
- Demonstrates systems thinking about memory vs optimization trade-offs (2 points)
- Clear technical reasoning and practical considerations (bonus points for innovative approaches)
"""
### BEGIN SOLUTION
# Student response area - instructor will replace this section during grading setup
# This is a manually graded question requiring technical analysis of optimizer state management
# Students should demonstrate understanding of memory optimization and distributed state handling
### END SOLUTION
# %% [markdown]
"""
### Question 2: Distributed Optimization and Learning Rate Scheduling
**Context**: Your optimizers work on single devices with fixed learning rate schedules. Production distributed training systems must coordinate optimization across multiple workers while adapting learning rates based on real-time training dynamics and system constraints.
**Reflection Question**: Architect a distributed optimization system that coordinates parameter updates across multiple workers while implementing adaptive learning rate scheduling responsive to training progress and system constraints. How would you handle gradient aggregation strategies, implement learning rate scaling for different batch sizes, and design adaptive scheduling that responds to convergence patterns? Consider scenarios where training must adapt to varying computational resources and time constraints in cloud environments.
Think about: distributed optimization strategies, adaptive learning rate techniques, gradient aggregation methods, and system-aware scheduling.
*Target length: 150-300 words*
"""
# %% nbgrader={"grade": true, "grade_id": "question-2-distributed-optimization", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
"""
YOUR REFLECTION ON DISTRIBUTED OPTIMIZATION AND LEARNING RATE SCHEDULING:
TODO: Replace this text with your thoughtful response about distributed optimization system design.
Consider addressing:
- How would you coordinate parameter updates across multiple workers in distributed training?
- What strategies would you use for gradient aggregation and synchronization?
- How would you implement adaptive learning rate scheduling that responds to training dynamics?
- What role would system constraints and resource availability play in your optimization design?
- How would you handle learning rate scaling and batch size considerations in distributed settings?
Write an architectural analysis connecting your optimizer implementations to real distributed training challenges.
GRADING RUBRIC (Instructor Use):
- Shows understanding of distributed optimization and coordination challenges (3 points)
- Designs practical approaches to gradient aggregation and learning rate adaptation (3 points)
- Addresses system constraints and resource-aware optimization (2 points)
- Demonstrates systems thinking about distributed training coordination (2 points)
- Clear architectural reasoning with distributed systems insights (bonus points for comprehensive understanding)
"""
### BEGIN SOLUTION
# Student response area - instructor will replace this section during grading setup
# This is a manually graded question requiring understanding of distributed optimization systems
# Students should demonstrate knowledge of gradient aggregation and adaptive scheduling
### END SOLUTION
# %% [markdown]
"""
### Question 3: Production Integration and Optimization Monitoring
**Context**: Your optimizer implementations provide basic parameter updates, but production ML systems require comprehensive optimization monitoring, hyperparameter tuning, and integration with MLOps pipelines for continuous training and model improvement.
**Reflection Question**: Design a production optimization system that integrates with MLOps pipelines and provides comprehensive optimization monitoring and automated hyperparameter tuning. How would you implement real-time optimization metrics collection, automated optimizer selection based on model characteristics, and integration with experiment tracking and model deployment systems? Consider scenarios where optimization strategies must adapt to changing data distributions and business requirements in production environments.
Think about: optimization monitoring systems, automated hyperparameter tuning, MLOps integration, and adaptive optimization strategies.
*Target length: 150-300 words*
"""
# %% nbgrader={"grade": true, "grade_id": "question-3-production-integration", "locked": false, "points": 10, "schema_version": 3, "solution": true, "task": false}
"""
YOUR REFLECTION ON PRODUCTION INTEGRATION AND OPTIMIZATION MONITORING:
TODO: Replace this text with your thoughtful response about production optimization system design.
Consider addressing:
- How would you design optimization monitoring and metrics collection for production training?
- What strategies would you use for automated optimizer selection and hyperparameter tuning?
- How would you integrate optimization systems with MLOps pipelines and experiment tracking?
- What role would adaptive optimization play in responding to changing data and requirements?
- How would you ensure optimization system reliability and performance in production environments?
Write a systems analysis connecting your optimizer implementations to real production integration challenges.
GRADING RUBRIC (Instructor Use):
- Understands production optimization monitoring and MLOps integration (3 points)
- Designs practical approaches to automated tuning and optimization selection (3 points)
- Addresses adaptive optimization and production reliability considerations (2 points)
- Shows systems thinking about optimization system integration and monitoring (2 points)
- Clear systems reasoning with production deployment insights (bonus points for deep understanding)
"""
### BEGIN SOLUTION
# Student response area - instructor will replace this section during grading setup
# This is a manually graded question requiring understanding of production optimization systems
# Students should demonstrate knowledge of MLOps integration and optimization monitoring
### END SOLUTION
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Optimization Algorithms with ML Systems
Congratulations! You've successfully implemented optimization algorithms with comprehensive ML systems analysis:
### What You've Accomplished
✅ **Gradient Descent**: The foundation of all optimization algorithms
✅ **SGD with Momentum**: Improved convergence with momentum
✅ **Adam Optimizer**: Adaptive learning rates for better training
✅ **Learning Rate Scheduling**: Dynamic learning rate adjustment
✅ **ML Systems Analysis**: OptimizerConvergenceProfiler for production insights
✅ **Advanced Features**: Gradient clipping, warmup, accumulation, mixed precision
✅ **Production Integration**: Complete optimizer analysis and recommendation system
### Key Concepts You've Learned
- **Gradient-based optimization**: How gradients guide parameter updates
- **Momentum**: Using velocity to improve convergence
- **Adaptive learning rates**: Adam's adaptive moment estimation
- **Learning rate scheduling**: Dynamic adjustment of learning rates
- **Convergence analysis**: Profiling optimizer performance patterns
- **Memory efficiency**: Resource usage comparison across optimizers
- **Production patterns**: Advanced features for real-world deployment
### Mathematical Foundations
- **Gradient descent**: θ = θ - α∇θJ(θ)
- **Momentum**: v = βv + (1-β)∇θJ(θ), θ = θ - αv
- **Adam**: Adaptive moment estimation with bias correction
- **Learning rate scheduling**: StepLR and other scheduling strategies
- **Gradient clipping**: norm_clip = min(norm, max_norm) * grad / norm
- **Gradient accumulation**: grad_avg = Σgrad_i / accumulation_steps
### Professional Skills Developed
- **Algorithm implementation**: Building optimization algorithms from scratch
- **Performance analysis**: Profiling and comparing optimizer convergence
- **System design thinking**: Understanding production optimization workflows
- **Resource optimization**: Memory usage analysis and efficiency planning
- **Integration testing**: Ensuring optimizers work with neural networks
- **Production readiness**: Advanced features for real-world deployment
### Ready for Advanced Applications
Your optimization implementations now enable:
- **Neural network training**: Complete training pipelines with optimizers
- **Hyperparameter optimization**: Data-driven optimizer and LR selection
- **Advanced architectures**: Training complex models efficiently
- **Production deployment**: ML systems with optimizer monitoring and tuning
- **Research**: Experimenting with new optimization algorithms
- **Scalable training**: Distributed and memory-efficient optimization
### Connection to Real ML Systems
Your implementations mirror production systems:
- **PyTorch**: `torch.optim.SGD`, `torch.optim.Adam` provide identical functionality
- **TensorFlow**: `tf.keras.optimizers` implements similar concepts
- **MLflow/Weights&Biases**: Your profiler mirrors production monitoring tools
- **Ray Tune/Optuna**: Your convergence analysis enables hyperparameter optimization
- **Industry Standard**: Every major ML framework uses these exact algorithms and patterns
### Next Steps
1. **Export your code**: `tito export 10_optimizers`
2. **Test your implementation**: `tito test 10_optimizers`
3. **Deploy ML systems**: Use your profiler for real optimizer selection
4. **Build training systems**: Combine with neural networks for complete training
5. **Move to Module 11**: Add complete training pipelines!
**Ready for production?** Your optimization algorithms and ML systems analysis tools are now ready for real-world deployment and performance optimization!
"""