mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-06 20:14:44 -05:00
- Exported 09_training module using nbdev directly from Python file - Exported 08_optimizers module to resolve import dependencies - All training components now available in tinytorch.core.training: * MeanSquaredError, CrossEntropyLoss, BinaryCrossEntropyLoss * Accuracy metric * Trainer class with complete training orchestration - All optimizers now available in tinytorch.core.optimizers: * SGD, Adam optimizers * StepLR learning rate scheduler - All components properly exported and functional - Integration tests passing (17/17) - Inline tests passing (6/6) - tito CLI integration working correctly Package exports: - tinytorch.core.training: 688 lines, 5 main classes - tinytorch.core.optimizers: 17,396 bytes, complete optimizer suite - Clean separation of development vs package code - Ready for production use and further development
1755 lines
68 KiB
Plaintext
1755 lines
68 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "602ba54a",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 8: Optimizers - Gradient-Based Parameter Updates\n",
|
||
"\n",
|
||
"Welcome to the Optimizers module! This is where neural networks learn to improve through intelligent parameter updates.\n",
|
||
"\n",
|
||
"## Learning Goals\n",
|
||
"- Understand gradient descent and how optimizers use gradients to update parameters\n",
|
||
"- Implement SGD with momentum for accelerated convergence\n",
|
||
"- Build Adam optimizer with adaptive learning rates\n",
|
||
"- Master learning rate scheduling strategies\n",
|
||
"- See how optimizers enable effective neural network training\n",
|
||
"\n",
|
||
"## Build → Use → Analyze\n",
|
||
"1. **Build**: Core optimization algorithms (SGD, Adam)\n",
|
||
"2. **Use**: Apply optimizers to train neural networks\n",
|
||
"3. **Analyze**: Compare optimizer behavior and convergence patterns"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e3b359ed",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "optimizers-imports",
|
||
"locked": false,
|
||
"schema_version": 3,
|
||
"solution": false,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp core.optimizers\n",
|
||
"\n",
|
||
"#| export\n",
|
||
"import math\n",
|
||
"import numpy as np\n",
|
||
"import sys\n",
|
||
"import os\n",
|
||
"from typing import List, Dict, Any, Optional, Union\n",
|
||
"from collections import defaultdict\n",
|
||
"\n",
|
||
"# Helper function to set up import paths\n",
|
||
"def setup_import_paths():\n",
|
||
" \"\"\"Set up import paths for development modules.\"\"\"\n",
|
||
" import sys\n",
|
||
" import os\n",
|
||
" \n",
|
||
" # Add module directories to path\n",
|
||
" base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\n",
|
||
" tensor_dir = os.path.join(base_dir, '01_tensor')\n",
|
||
" autograd_dir = os.path.join(base_dir, '07_autograd')\n",
|
||
" \n",
|
||
" if tensor_dir not in sys.path:\n",
|
||
" sys.path.append(tensor_dir)\n",
|
||
" if autograd_dir not in sys.path:\n",
|
||
" sys.path.append(autograd_dir)\n",
|
||
"\n",
|
||
"# Import our existing components\n",
|
||
"try:\n",
|
||
" from tinytorch.core.tensor import Tensor\n",
|
||
" from tinytorch.core.autograd import Variable\n",
|
||
"except ImportError:\n",
|
||
" # For development, try local imports\n",
|
||
" try:\n",
|
||
" setup_import_paths()\n",
|
||
" from tensor_dev import Tensor\n",
|
||
" from autograd_dev import Variable\n",
|
||
" except ImportError:\n",
|
||
" # Create minimal fallback classes for testing\n",
|
||
" print(\"Warning: Using fallback classes for testing\")\n",
|
||
" \n",
|
||
" class Tensor:\n",
|
||
" def __init__(self, data):\n",
|
||
" self.data = np.array(data)\n",
|
||
" self.shape = self.data.shape\n",
|
||
" \n",
|
||
" def __str__(self):\n",
|
||
" return f\"Tensor({self.data})\"\n",
|
||
" \n",
|
||
" class Variable:\n",
|
||
" def __init__(self, data, requires_grad=True):\n",
|
||
" if isinstance(data, (int, float)):\n",
|
||
" self.data = Tensor([data])\n",
|
||
" else:\n",
|
||
" self.data = Tensor(data)\n",
|
||
" self.requires_grad = requires_grad\n",
|
||
" self.grad = None\n",
|
||
" \n",
|
||
" def zero_grad(self):\n",
|
||
" self.grad = None\n",
|
||
" \n",
|
||
" def __str__(self):\n",
|
||
" return f\"Variable({self.data.data})\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4dfb6aa4",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "optimizers-setup",
|
||
"locked": false,
|
||
"schema_version": 3,
|
||
"solution": false,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"print(\"🔥 TinyTorch Optimizers Module\")\n",
|
||
"print(f\"NumPy version: {np.__version__}\")\n",
|
||
"print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
|
||
"print(\"Ready to build optimization algorithms!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c9afc185",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in `modules/source/08_optimizers/optimizers_dev.py` \n",
|
||
"**Building Side:** Code exports to `tinytorch.core.optimizers`\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# Final package structure:\n",
|
||
"from tinytorch.core.optimizers import SGD, Adam, StepLR # The optimization engines!\n",
|
||
"from tinytorch.core.autograd import Variable # Gradient computation\n",
|
||
"from tinytorch.core.tensor import Tensor # Data structures\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Focused module for understanding optimization algorithms\n",
|
||
"- **Production:** Proper organization like PyTorch's `torch.optim`\n",
|
||
"- **Consistency:** All optimization algorithms live together in `core.optimizers`\n",
|
||
"- **Foundation:** Enables effective neural network training"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e0d222c6",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## What Are Optimizers?\n",
|
||
"\n",
|
||
"### The Problem: How to Update Parameters\n",
|
||
"Neural networks learn by updating parameters using gradients:\n",
|
||
"```\n",
|
||
"parameter_new = parameter_old - learning_rate * gradient\n",
|
||
"```\n",
|
||
"\n",
|
||
"But **naive gradient descent** has problems:\n",
|
||
"- **Slow convergence**: Takes many steps to reach optimum\n",
|
||
"- **Oscillation**: Bounces around valleys without making progress\n",
|
||
"- **Poor scaling**: Same learning rate for all parameters\n",
|
||
"\n",
|
||
"### The Solution: Smart Optimization\n",
|
||
"**Optimizers** are algorithms that intelligently update parameters:\n",
|
||
"- **Momentum**: Accelerate convergence by accumulating velocity\n",
|
||
"- **Adaptive learning rates**: Different learning rates for different parameters\n",
|
||
"- **Second-order information**: Use curvature to guide updates\n",
|
||
"\n",
|
||
"### Real-World Impact\n",
|
||
"- **SGD**: The foundation of all neural network training\n",
|
||
"- **Adam**: The default optimizer for most deep learning applications\n",
|
||
"- **Learning rate scheduling**: Critical for training stability and performance\n",
|
||
"\n",
|
||
"### What We'll Build\n",
|
||
"1. **SGD**: Stochastic Gradient Descent with momentum\n",
|
||
"2. **Adam**: Adaptive Moment Estimation optimizer\n",
|
||
"3. **StepLR**: Learning rate scheduling\n",
|
||
"4. **Integration**: Complete training loop with optimizers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8ccea3ce",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## Step 1: Understanding Gradient Descent\n",
|
||
"\n",
|
||
"### What is Gradient Descent?\n",
|
||
"**Gradient descent** finds the minimum of a function by following the negative gradient:\n",
|
||
"\n",
|
||
"```\n",
|
||
"θ_{t+1} = θ_t - α ∇f(θ_t)\n",
|
||
"```\n",
|
||
"\n",
|
||
"Where:\n",
|
||
"- θ: Parameters we want to optimize\n",
|
||
"- α: Learning rate (how big steps to take)\n",
|
||
"- ∇f(θ): Gradient of loss function with respect to parameters\n",
|
||
"\n",
|
||
"### Why Gradient Descent Works\n",
|
||
"1. **Gradients point uphill**: Negative gradient points toward minimum\n",
|
||
"2. **Iterative improvement**: Each step reduces the loss (in theory)\n",
|
||
"3. **Local convergence**: Finds local minimum with proper learning rate\n",
|
||
"4. **Scalable**: Works with millions of parameters\n",
|
||
"\n",
|
||
"### The Learning Rate Dilemma\n",
|
||
"- **Too large**: Overshoots minimum, diverges\n",
|
||
"- **Too small**: Extremely slow convergence\n",
|
||
"- **Just right**: Steady progress toward minimum\n",
|
||
"\n",
|
||
"### Visual Understanding\n",
|
||
"```\n",
|
||
"Loss landscape: \\__/\n",
|
||
"Start here: ↑\n",
|
||
"Gradient descent: ↓ → ↓ → ↓ → minimum\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Real-World Applications\n",
|
||
"- **Neural networks**: Training any deep learning model\n",
|
||
"- **Machine learning**: Logistic regression, SVM, etc.\n",
|
||
"- **Scientific computing**: Optimization problems in physics, engineering\n",
|
||
"- **Economics**: Portfolio optimization, game theory\n",
|
||
"\n",
|
||
"Let's implement gradient descent to understand it deeply!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d41c2596",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "gradient-descent-function",
|
||
"locked": false,
|
||
"schema_version": 3,
|
||
"solution": true,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"def gradient_descent_step(parameter: Variable, learning_rate: float) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Perform one step of gradient descent on a parameter.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" parameter: Variable with gradient information\n",
|
||
" learning_rate: How much to update parameter\n",
|
||
" \n",
|
||
" TODO: Implement basic gradient descent parameter update.\n",
|
||
" \n",
|
||
" STEP-BY-STEP IMPLEMENTATION:\n",
|
||
" 1. Check if parameter has a gradient\n",
|
||
" 2. Get current parameter value and gradient\n",
|
||
" 3. Update parameter: new_value = old_value - learning_rate * gradient\n",
|
||
" 4. Update parameter data with new value\n",
|
||
" 5. Handle edge cases (no gradient, invalid values)\n",
|
||
" \n",
|
||
" EXAMPLE USAGE:\n",
|
||
" ```python\n",
|
||
" # Parameter with gradient\n",
|
||
" w = Variable(2.0, requires_grad=True)\n",
|
||
" w.grad = Variable(0.5) # Gradient from loss\n",
|
||
" \n",
|
||
" # Update parameter\n",
|
||
" gradient_descent_step(w, learning_rate=0.1)\n",
|
||
" # w.data now contains: 2.0 - 0.1 * 0.5 = 1.95\n",
|
||
" ```\n",
|
||
" \n",
|
||
" IMPLEMENTATION HINTS:\n",
|
||
" - Check if parameter.grad is not None\n",
|
||
" - Use parameter.grad.data.data to get gradient value\n",
|
||
" - Update parameter.data with new Tensor\n",
|
||
" - Don't modify gradient (it's used for logging)\n",
|
||
" \n",
|
||
" LEARNING CONNECTIONS:\n",
|
||
" - This is the foundation of all neural network training\n",
|
||
" - PyTorch's optimizer.step() does exactly this\n",
|
||
" - The learning rate determines convergence speed\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if parameter.grad is not None:\n",
|
||
" # Get current parameter value and gradient\n",
|
||
" current_value = parameter.data.data\n",
|
||
" gradient_value = parameter.grad.data.data\n",
|
||
" \n",
|
||
" # Update parameter: new_value = old_value - learning_rate * gradient\n",
|
||
" new_value = current_value - learning_rate * gradient_value\n",
|
||
" \n",
|
||
" # Update parameter data\n",
|
||
" parameter.data = Tensor(new_value)\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4d2e1fd4",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Gradient Descent Step\n",
|
||
"\n",
|
||
"Let's test your gradient descent implementation right away! This is the foundation of all optimization algorithms.\n",
|
||
"\n",
|
||
"**This is a unit test** - it tests one specific function (gradient_descent_step) in isolation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "f092d289",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-gradient-descent",
|
||
"locked": true,
|
||
"points": 10,
|
||
"schema_version": 3,
|
||
"solution": false,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_gradient_descent_step_comprehensive():\n",
|
||
" \"\"\"Test basic gradient descent parameter update\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Gradient Descent Step...\")\n",
|
||
" \n",
|
||
" # Test basic parameter update\n",
|
||
" try:\n",
|
||
" w = Variable(2.0, requires_grad=True)\n",
|
||
" w.grad = Variable(0.5) # Positive gradient\n",
|
||
" \n",
|
||
" original_value = w.data.data.item()\n",
|
||
" gradient_descent_step(w, learning_rate=0.1)\n",
|
||
" new_value = w.data.data.item()\n",
|
||
" \n",
|
||
" expected_value = original_value - 0.1 * 0.5 # 2.0 - 0.05 = 1.95\n",
|
||
" assert abs(new_value - expected_value) < 1e-6, f\"Expected {expected_value}, got {new_value}\"\n",
|
||
" print(\"✅ Basic parameter update works\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Basic parameter update failed: {e}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
" # Test with negative gradient\n",
|
||
" try:\n",
|
||
" w2 = Variable(1.0, requires_grad=True)\n",
|
||
" w2.grad = Variable(-0.2) # Negative gradient\n",
|
||
" \n",
|
||
" gradient_descent_step(w2, learning_rate=0.1)\n",
|
||
" expected_value2 = 1.0 - 0.1 * (-0.2) # 1.0 + 0.02 = 1.02\n",
|
||
" assert abs(w2.data.data.item() - expected_value2) < 1e-6, \"Negative gradient test failed\"\n",
|
||
" print(\"✅ Negative gradient handling works\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Negative gradient handling failed: {e}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
" # Test with no gradient (should not update)\n",
|
||
" try:\n",
|
||
" w3 = Variable(3.0, requires_grad=True)\n",
|
||
" w3.grad = None\n",
|
||
" original_value3 = w3.data.data.item()\n",
|
||
" \n",
|
||
" gradient_descent_step(w3, learning_rate=0.1)\n",
|
||
" assert w3.data.data.item() == original_value3, \"Parameter with no gradient should not update\"\n",
|
||
" print(\"✅ No gradient case works\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ No gradient case failed: {e}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
" print(\"🎯 Gradient descent step behavior:\")\n",
|
||
" print(\" Updates parameters in negative gradient direction\")\n",
|
||
" print(\" Uses learning rate to control step size\")\n",
|
||
" print(\" Skips updates when gradient is None\")\n",
|
||
" print(\"📈 Progress: Gradient Descent Step ✓\")\n",
|
||
"\n",
|
||
"# Test function is called by auto-discovery system"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bc218834",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## Step 2: SGD with Momentum\n",
|
||
"\n",
|
||
"### What is SGD?\n",
|
||
"**SGD (Stochastic Gradient Descent)** is the fundamental optimization algorithm:\n",
|
||
"\n",
|
||
"```\n",
|
||
"θ_{t+1} = θ_t - α ∇L(θ_t)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### The Problem with Vanilla SGD\n",
|
||
"- **Slow convergence**: Especially in narrow valleys\n",
|
||
"- **Oscillation**: Bounces around without making progress\n",
|
||
"- **Poor conditioning**: Struggles with ill-conditioned problems\n",
|
||
"\n",
|
||
"### The Solution: Momentum\n",
|
||
"**Momentum** accumulates velocity to accelerate convergence:\n",
|
||
"\n",
|
||
"```\n",
|
||
"v_t = β v_{t-1} + ∇L(θ_t)\n",
|
||
"θ_{t+1} = θ_t - α v_t\n",
|
||
"```\n",
|
||
"\n",
|
||
"Where:\n",
|
||
"- v_t: Velocity (exponential moving average of gradients)\n",
|
||
"- β: Momentum coefficient (typically 0.9)\n",
|
||
"- α: Learning rate\n",
|
||
"\n",
|
||
"### Why Momentum Works\n",
|
||
"1. **Acceleration**: Builds up speed in consistent directions\n",
|
||
"2. **Dampening**: Reduces oscillations in inconsistent directions\n",
|
||
"3. **Memory**: Remembers previous gradient directions\n",
|
||
"4. **Robustness**: Less sensitive to noisy gradients\n",
|
||
"\n",
|
||
"### Visual Understanding\n",
|
||
"```\n",
|
||
"Without momentum: ↗↙↗↙↗↙ (oscillating)\n",
|
||
"With momentum: ↗→→→→→ (smooth progress)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Real-World Applications\n",
|
||
"- **Image classification**: Training ResNet, VGG\n",
|
||
"- **Natural language**: Training RNNs, early transformers\n",
|
||
"- **Classic choice**: Still used when Adam fails\n",
|
||
"- **Large batch training**: Often preferred over Adam\n",
|
||
"\n",
|
||
"Let's implement SGD with momentum!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2f587b7f",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "sgd-class",
|
||
"locked": false,
|
||
"schema_version": 3,
|
||
"solution": true,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class SGD:\n",
|
||
" \"\"\"\n",
|
||
" SGD Optimizer with Momentum\n",
|
||
" \n",
|
||
" Implements stochastic gradient descent with momentum:\n",
|
||
" v_t = momentum * v_{t-1} + gradient\n",
|
||
" parameter = parameter - learning_rate * v_t\n",
|
||
" \"\"\"\n",
|
||
" \n",
|
||
" def __init__(self, parameters: List[Variable], learning_rate: float = 0.01, \n",
|
||
" momentum: float = 0.0, weight_decay: float = 0.0):\n",
|
||
" \"\"\"\n",
|
||
" Initialize SGD optimizer.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" parameters: List of Variables to optimize\n",
|
||
" learning_rate: Learning rate (default: 0.01)\n",
|
||
" momentum: Momentum coefficient (default: 0.0)\n",
|
||
" weight_decay: L2 regularization coefficient (default: 0.0)\n",
|
||
" \n",
|
||
" TODO: Implement SGD optimizer initialization.\n",
|
||
" \n",
|
||
" APPROACH:\n",
|
||
" 1. Store parameters and hyperparameters\n",
|
||
" 2. Initialize momentum buffers for each parameter\n",
|
||
" 3. Set up state tracking for optimization\n",
|
||
" 4. Prepare for step() and zero_grad() methods\n",
|
||
" \n",
|
||
" EXAMPLE:\n",
|
||
" ```python\n",
|
||
" # Create optimizer\n",
|
||
" optimizer = SGD([w1, w2, b1, b2], learning_rate=0.01, momentum=0.9)\n",
|
||
" \n",
|
||
" # In training loop:\n",
|
||
" optimizer.zero_grad()\n",
|
||
" loss.backward()\n",
|
||
" optimizer.step()\n",
|
||
" ```\n",
|
||
" \n",
|
||
" HINTS:\n",
|
||
" - Store parameters as a list\n",
|
||
" - Initialize momentum buffers as empty dict\n",
|
||
" - Use parameter id() as key for momentum tracking\n",
|
||
" - Momentum buffers will be created lazily in step()\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.parameters = parameters\n",
|
||
" self.learning_rate = learning_rate\n",
|
||
" self.momentum = momentum\n",
|
||
" self.weight_decay = weight_decay\n",
|
||
" \n",
|
||
" # Initialize momentum buffers (created lazily)\n",
|
||
" self.momentum_buffers = {}\n",
|
||
" \n",
|
||
" # Track optimization steps\n",
|
||
" self.step_count = 0\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def step(self) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Perform one optimization step.\n",
|
||
" \n",
|
||
" TODO: Implement SGD parameter update with momentum.\n",
|
||
" \n",
|
||
" APPROACH:\n",
|
||
" 1. Iterate through all parameters\n",
|
||
" 2. For each parameter with gradient:\n",
|
||
" a. Get current gradient\n",
|
||
" b. Apply weight decay if specified\n",
|
||
" c. Update momentum buffer (or create if first time)\n",
|
||
" d. Update parameter using momentum\n",
|
||
" 3. Increment step count\n",
|
||
" \n",
|
||
" MATHEMATICAL FORMULATION:\n",
|
||
" - If weight_decay > 0: gradient = gradient + weight_decay * parameter\n",
|
||
" - momentum_buffer = momentum * momentum_buffer + gradient\n",
|
||
" - parameter = parameter - learning_rate * momentum_buffer\n",
|
||
" \n",
|
||
" IMPLEMENTATION HINTS:\n",
|
||
" - Use id(param) as key for momentum buffers\n",
|
||
" - Initialize buffer with zeros if not exists\n",
|
||
" - Handle case where momentum = 0 (no momentum)\n",
|
||
" - Update parameter.data with new Tensor\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" for param in self.parameters:\n",
|
||
" if param.grad is not None:\n",
|
||
" # Get gradient\n",
|
||
" gradient = param.grad.data.data\n",
|
||
" \n",
|
||
" # Apply weight decay (L2 regularization)\n",
|
||
" if self.weight_decay > 0:\n",
|
||
" gradient = gradient + self.weight_decay * param.data.data\n",
|
||
" \n",
|
||
" # Get or create momentum buffer\n",
|
||
" param_id = id(param)\n",
|
||
" if param_id not in self.momentum_buffers:\n",
|
||
" self.momentum_buffers[param_id] = np.zeros_like(param.data.data)\n",
|
||
" \n",
|
||
" # Update momentum buffer\n",
|
||
" self.momentum_buffers[param_id] = (\n",
|
||
" self.momentum * self.momentum_buffers[param_id] + gradient\n",
|
||
" )\n",
|
||
" \n",
|
||
" # Update parameter\n",
|
||
" param.data = Tensor(\n",
|
||
" param.data.data - self.learning_rate * self.momentum_buffers[param_id]\n",
|
||
" )\n",
|
||
" \n",
|
||
" self.step_count += 1\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def zero_grad(self) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Zero out gradients for all parameters.\n",
|
||
" \n",
|
||
" TODO: Implement gradient zeroing.\n",
|
||
" \n",
|
||
" APPROACH:\n",
|
||
" 1. Iterate through all parameters\n",
|
||
" 2. Set gradient to None for each parameter\n",
|
||
" 3. This prepares for next backward pass\n",
|
||
" \n",
|
||
" IMPLEMENTATION HINTS:\n",
|
||
" - Simply set param.grad = None\n",
|
||
" - This is called before loss.backward()\n",
|
||
" - Essential for proper gradient accumulation\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" for param in self.parameters:\n",
|
||
" param.grad = None\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4adee99c",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: SGD Optimizer\n",
|
||
"\n",
|
||
"Let's test your SGD optimizer implementation! This optimizer adds momentum to gradient descent for better convergence.\n",
|
||
"\n",
|
||
"**This is a unit test** - it tests one specific class (SGD) in isolation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "fa93aa53",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-sgd",
|
||
"locked": true,
|
||
"points": 15,
|
||
"schema_version": 3,
|
||
"solution": false,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_sgd_optimizer_comprehensive():\n",
|
||
" \"\"\"Test SGD optimizer implementation\"\"\"\n",
|
||
" print(\"🔬 Unit Test: SGD Optimizer...\")\n",
|
||
" \n",
|
||
" # Create test parameters\n",
|
||
" w1 = Variable(1.0, requires_grad=True)\n",
|
||
" w2 = Variable(2.0, requires_grad=True)\n",
|
||
" b = Variable(0.5, requires_grad=True)\n",
|
||
" \n",
|
||
" # Create optimizer\n",
|
||
" optimizer = SGD([w1, w2, b], learning_rate=0.1, momentum=0.9)\n",
|
||
" \n",
|
||
" # Test zero_grad\n",
|
||
" try:\n",
|
||
" w1.grad = Variable(0.1)\n",
|
||
" w2.grad = Variable(0.2)\n",
|
||
" b.grad = Variable(0.05)\n",
|
||
" \n",
|
||
" optimizer.zero_grad()\n",
|
||
" \n",
|
||
" assert w1.grad is None, \"Gradient should be None after zero_grad\"\n",
|
||
" assert w2.grad is None, \"Gradient should be None after zero_grad\"\n",
|
||
" assert b.grad is None, \"Gradient should be None after zero_grad\"\n",
|
||
" print(\"✅ zero_grad() works correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ zero_grad() failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test step with gradients\n",
|
||
" try:\n",
|
||
" w1.grad = Variable(0.1)\n",
|
||
" w2.grad = Variable(0.2)\n",
|
||
" b.grad = Variable(0.05)\n",
|
||
" \n",
|
||
" # First step (no momentum yet)\n",
|
||
" original_w1 = w1.data.data.item()\n",
|
||
" original_w2 = w2.data.data.item()\n",
|
||
" original_b = b.data.data.item()\n",
|
||
" \n",
|
||
" optimizer.step()\n",
|
||
" \n",
|
||
" # Check parameter updates\n",
|
||
" expected_w1 = original_w1 - 0.1 * 0.1 # 1.0 - 0.01 = 0.99\n",
|
||
" expected_w2 = original_w2 - 0.1 * 0.2 # 2.0 - 0.02 = 1.98\n",
|
||
" expected_b = original_b - 0.1 * 0.05 # 0.5 - 0.005 = 0.495\n",
|
||
" \n",
|
||
" assert abs(w1.data.data.item() - expected_w1) < 1e-6, f\"w1 update failed: expected {expected_w1}, got {w1.data.data.item()}\"\n",
|
||
" assert abs(w2.data.data.item() - expected_w2) < 1e-6, f\"w2 update failed: expected {expected_w2}, got {w2.data.data.item()}\"\n",
|
||
" assert abs(b.data.data.item() - expected_b) < 1e-6, f\"b update failed: expected {expected_b}, got {b.data.data.item()}\"\n",
|
||
" print(\"✅ Parameter updates work correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Parameter updates failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test momentum buffers\n",
|
||
" try:\n",
|
||
" assert len(optimizer.momentum_buffers) == 3, f\"Should have 3 momentum buffers, got {len(optimizer.momentum_buffers)}\"\n",
|
||
" assert optimizer.step_count == 1, f\"Step count should be 1, got {optimizer.step_count}\"\n",
|
||
" print(\"✅ Momentum buffers created correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Momentum buffers failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test step counting\n",
|
||
" try:\n",
|
||
" w1.grad = Variable(0.1)\n",
|
||
" w2.grad = Variable(0.2)\n",
|
||
" b.grad = Variable(0.05)\n",
|
||
" \n",
|
||
" optimizer.step()\n",
|
||
" \n",
|
||
" assert optimizer.step_count == 2, f\"Step count should be 2, got {optimizer.step_count}\"\n",
|
||
" print(\"✅ Step counting works correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Step counting failed: {e}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
" print(\"🎯 SGD optimizer behavior:\")\n",
|
||
" print(\" Maintains momentum buffers for accelerated updates\")\n",
|
||
" print(\" Tracks step count for learning rate scheduling\")\n",
|
||
" print(\" Supports weight decay for regularization\")\n",
|
||
" print(\"📈 Progress: SGD Optimizer ✓\")\n",
|
||
"\n",
|
||
"# Run the test\n",
|
||
"test_sgd_optimizer_comprehensive()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3730c6d6",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## Step 3: Adam - Adaptive Learning Rates\n",
|
||
"\n",
|
||
"### What is Adam?\n",
|
||
"**Adam (Adaptive Moment Estimation)** is the most popular optimizer in deep learning:\n",
|
||
"\n",
|
||
"```\n",
|
||
"m_t = β₁ m_{t-1} + (1 - β₁) ∇L(θ_t) # First moment (momentum)\n",
|
||
"v_t = β₂ v_{t-1} + (1 - β₂) (∇L(θ_t))² # Second moment (variance)\n",
|
||
"m̂_t = m_t / (1 - β₁ᵗ) # Bias correction\n",
|
||
"v̂_t = v_t / (1 - β₂ᵗ) # Bias correction\n",
|
||
"θ_{t+1} = θ_t - α m̂_t / (√v̂_t + ε) # Parameter update\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Why Adam is Revolutionary\n",
|
||
"1. **Adaptive learning rates**: Different learning rate for each parameter\n",
|
||
"2. **Momentum**: Accelerates convergence like SGD\n",
|
||
"3. **Variance adaptation**: Scales updates based on gradient variance\n",
|
||
"4. **Bias correction**: Handles initialization bias\n",
|
||
"5. **Robust**: Works well with minimal hyperparameter tuning\n",
|
||
"\n",
|
||
"### The Three Key Ideas\n",
|
||
"1. **First moment (m_t)**: Exponential moving average of gradients (momentum)\n",
|
||
"2. **Second moment (v_t)**: Exponential moving average of squared gradients (variance)\n",
|
||
"3. **Adaptive scaling**: Large gradients → small updates, small gradients → large updates\n",
|
||
"\n",
|
||
"### Visual Understanding\n",
|
||
"```\n",
|
||
"Parameter with large gradients: /\\/\\/\\/\\ → smooth updates\n",
|
||
"Parameter with small gradients: ______ → amplified updates\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Real-World Applications\n",
|
||
"- **Deep learning**: Default optimizer for most neural networks\n",
|
||
"- **Computer vision**: Training CNNs, ResNets, Vision Transformers\n",
|
||
"- **Natural language**: Training BERT, GPT, T5\n",
|
||
"- **Transformers**: Essential for attention-based models\n",
|
||
"\n",
|
||
"Let's implement Adam optimizer!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "be7d3f7a",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "adam-class",
|
||
"locked": false,
|
||
"schema_version": 3,
|
||
"solution": true,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class Adam:\n",
|
||
" \"\"\"\n",
|
||
" Adam Optimizer\n",
|
||
" \n",
|
||
" Implements Adam algorithm with adaptive learning rates:\n",
|
||
" - First moment: exponential moving average of gradients\n",
|
||
" - Second moment: exponential moving average of squared gradients\n",
|
||
" - Bias correction: accounts for initialization bias\n",
|
||
" - Adaptive updates: different learning rate per parameter\n",
|
||
" \"\"\"\n",
|
||
" \n",
|
||
" def __init__(self, parameters: List[Variable], learning_rate: float = 0.001,\n",
|
||
" beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-8,\n",
|
||
" weight_decay: float = 0.0):\n",
|
||
" \"\"\"\n",
|
||
" Initialize Adam optimizer.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" parameters: List of Variables to optimize\n",
|
||
" learning_rate: Learning rate (default: 0.001)\n",
|
||
" beta1: Exponential decay rate for first moment (default: 0.9)\n",
|
||
" beta2: Exponential decay rate for second moment (default: 0.999)\n",
|
||
" epsilon: Small constant for numerical stability (default: 1e-8)\n",
|
||
" weight_decay: L2 regularization coefficient (default: 0.0)\n",
|
||
" \n",
|
||
" TODO: Implement Adam optimizer initialization.\n",
|
||
" \n",
|
||
" APPROACH:\n",
|
||
" 1. Store parameters and hyperparameters\n",
|
||
" 2. Initialize first moment buffers (m_t)\n",
|
||
" 3. Initialize second moment buffers (v_t)\n",
|
||
" 4. Set up step counter for bias correction\n",
|
||
" \n",
|
||
" EXAMPLE:\n",
|
||
" ```python\n",
|
||
" # Create Adam optimizer\n",
|
||
" optimizer = Adam([w1, w2, b1, b2], learning_rate=0.001)\n",
|
||
" \n",
|
||
" # In training loop:\n",
|
||
" optimizer.zero_grad()\n",
|
||
" loss.backward()\n",
|
||
" optimizer.step()\n",
|
||
" ```\n",
|
||
" \n",
|
||
" HINTS:\n",
|
||
" - Store all hyperparameters\n",
|
||
" - Initialize moment buffers as empty dicts\n",
|
||
" - Use parameter id() as key for tracking\n",
|
||
" - Buffers will be created lazily in step()\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.parameters = parameters\n",
|
||
" self.learning_rate = learning_rate\n",
|
||
" self.beta1 = beta1\n",
|
||
" self.beta2 = beta2\n",
|
||
" self.epsilon = epsilon\n",
|
||
" self.weight_decay = weight_decay\n",
|
||
" \n",
|
||
" # Initialize moment buffers (created lazily)\n",
|
||
" self.first_moment = {} # m_t\n",
|
||
" self.second_moment = {} # v_t\n",
|
||
" \n",
|
||
" # Track optimization steps for bias correction\n",
|
||
" self.step_count = 0\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def step(self) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Perform one optimization step using Adam algorithm.\n",
|
||
" \n",
|
||
" TODO: Implement Adam parameter update.\n",
|
||
" \n",
|
||
" APPROACH:\n",
|
||
" 1. Increment step count\n",
|
||
" 2. For each parameter with gradient:\n",
|
||
" a. Get current gradient\n",
|
||
" b. Apply weight decay if specified\n",
|
||
" c. Update first moment (momentum)\n",
|
||
" d. Update second moment (variance)\n",
|
||
" e. Apply bias correction\n",
|
||
" f. Update parameter with adaptive learning rate\n",
|
||
" \n",
|
||
" MATHEMATICAL FORMULATION:\n",
|
||
" - m_t = beta1 * m_{t-1} + (1 - beta1) * gradient\n",
|
||
" - v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2\n",
|
||
" - m_hat = m_t / (1 - beta1^t)\n",
|
||
" - v_hat = v_t / (1 - beta2^t)\n",
|
||
" - parameter = parameter - learning_rate * m_hat / (sqrt(v_hat) + epsilon)\n",
|
||
" \n",
|
||
" IMPLEMENTATION HINTS:\n",
|
||
" - Use id(param) as key for moment buffers\n",
|
||
" - Initialize buffers with zeros if not exists\n",
|
||
" - Use np.sqrt() for square root\n",
|
||
" - Handle numerical stability with epsilon\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.step_count += 1\n",
|
||
" \n",
|
||
" for param in self.parameters:\n",
|
||
" if param.grad is not None:\n",
|
||
" # Get gradient\n",
|
||
" gradient = param.grad.data.data\n",
|
||
" \n",
|
||
" # Apply weight decay (L2 regularization)\n",
|
||
" if self.weight_decay > 0:\n",
|
||
" gradient = gradient + self.weight_decay * param.data.data\n",
|
||
" \n",
|
||
" # Get or create moment buffers\n",
|
||
" param_id = id(param)\n",
|
||
" if param_id not in self.first_moment:\n",
|
||
" self.first_moment[param_id] = np.zeros_like(param.data.data)\n",
|
||
" self.second_moment[param_id] = np.zeros_like(param.data.data)\n",
|
||
" \n",
|
||
" # Update first moment (momentum)\n",
|
||
" self.first_moment[param_id] = (\n",
|
||
" self.beta1 * self.first_moment[param_id] + \n",
|
||
" (1 - self.beta1) * gradient\n",
|
||
" )\n",
|
||
" \n",
|
||
" # Update second moment (variance)\n",
|
||
" self.second_moment[param_id] = (\n",
|
||
" self.beta2 * self.second_moment[param_id] + \n",
|
||
" (1 - self.beta2) * gradient * gradient\n",
|
||
" )\n",
|
||
" \n",
|
||
" # Bias correction\n",
|
||
" first_moment_corrected = (\n",
|
||
" self.first_moment[param_id] / (1 - self.beta1 ** self.step_count)\n",
|
||
" )\n",
|
||
" second_moment_corrected = (\n",
|
||
" self.second_moment[param_id] / (1 - self.beta2 ** self.step_count)\n",
|
||
" )\n",
|
||
" \n",
|
||
" # Update parameter with adaptive learning rate\n",
|
||
" param.data = Tensor(\n",
|
||
" param.data.data - self.learning_rate * first_moment_corrected / \n",
|
||
" (np.sqrt(second_moment_corrected) + self.epsilon)\n",
|
||
" )\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def zero_grad(self) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Zero out gradients for all parameters.\n",
|
||
" \n",
|
||
" TODO: Implement gradient zeroing (same as SGD).\n",
|
||
" \n",
|
||
" IMPLEMENTATION HINTS:\n",
|
||
" - Set param.grad = None for all parameters\n",
|
||
" - This is identical to SGD implementation\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" for param in self.parameters:\n",
|
||
" param.grad = None\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "41593be1",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"### 🧪 Test Your Adam Implementation\n",
|
||
"\n",
|
||
"Let's test the Adam optimizer:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "461e74f8",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Adam Optimizer\n",
|
||
"\n",
|
||
"Let's test your Adam optimizer implementation! This is a state-of-the-art adaptive optimization algorithm.\n",
|
||
"\n",
|
||
"**This is a unit test** - it tests one specific class (Adam) in isolation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "afe99df3",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-adam",
|
||
"locked": true,
|
||
"points": 20,
|
||
"schema_version": 3,
|
||
"solution": false,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_adam_optimizer_comprehensive():\n",
|
||
" \"\"\"Test Adam optimizer implementation\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Adam Optimizer...\")\n",
|
||
" \n",
|
||
" # Create test parameters\n",
|
||
" w1 = Variable(1.0, requires_grad=True)\n",
|
||
" w2 = Variable(2.0, requires_grad=True)\n",
|
||
" b = Variable(0.5, requires_grad=True)\n",
|
||
" \n",
|
||
" # Create optimizer\n",
|
||
" optimizer = Adam([w1, w2, b], learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)\n",
|
||
" \n",
|
||
" # Test zero_grad\n",
|
||
" try:\n",
|
||
" w1.grad = Variable(0.1)\n",
|
||
" w2.grad = Variable(0.2)\n",
|
||
" b.grad = Variable(0.05)\n",
|
||
" \n",
|
||
" optimizer.zero_grad()\n",
|
||
" \n",
|
||
" assert w1.grad is None, \"Gradient should be None after zero_grad\"\n",
|
||
" assert w2.grad is None, \"Gradient should be None after zero_grad\"\n",
|
||
" assert b.grad is None, \"Gradient should be None after zero_grad\"\n",
|
||
" print(\"✅ zero_grad() works correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ zero_grad() failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test step with gradients\n",
|
||
" try:\n",
|
||
" w1.grad = Variable(0.1)\n",
|
||
" w2.grad = Variable(0.2)\n",
|
||
" b.grad = Variable(0.05)\n",
|
||
" \n",
|
||
" # First step\n",
|
||
" original_w1 = w1.data.data.item()\n",
|
||
" original_w2 = w2.data.data.item()\n",
|
||
" original_b = b.data.data.item()\n",
|
||
" \n",
|
||
" optimizer.step()\n",
|
||
" \n",
|
||
" # Check that parameters were updated (Adam uses adaptive learning rates)\n",
|
||
" assert w1.data.data.item() != original_w1, \"w1 should have been updated\"\n",
|
||
" assert w2.data.data.item() != original_w2, \"w2 should have been updated\"\n",
|
||
" assert b.data.data.item() != original_b, \"b should have been updated\"\n",
|
||
" print(\"✅ Parameter updates work correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Parameter updates failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test moment buffers\n",
|
||
" try:\n",
|
||
" assert len(optimizer.first_moment) == 3, f\"Should have 3 first moment buffers, got {len(optimizer.first_moment)}\"\n",
|
||
" assert len(optimizer.second_moment) == 3, f\"Should have 3 second moment buffers, got {len(optimizer.second_moment)}\"\n",
|
||
" print(\"✅ Moment buffers created correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Moment buffers failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test step counting and bias correction\n",
|
||
" try:\n",
|
||
" assert optimizer.step_count == 1, f\"Step count should be 1, got {optimizer.step_count}\"\n",
|
||
" \n",
|
||
" # Take another step\n",
|
||
" w1.grad = Variable(0.1)\n",
|
||
" w2.grad = Variable(0.2)\n",
|
||
" b.grad = Variable(0.05)\n",
|
||
" \n",
|
||
" optimizer.step()\n",
|
||
" \n",
|
||
" assert optimizer.step_count == 2, f\"Step count should be 2, got {optimizer.step_count}\"\n",
|
||
" print(\"✅ Step counting and bias correction work correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Step counting and bias correction failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test adaptive learning rates\n",
|
||
" try:\n",
|
||
" # Adam should have different effective learning rates for different parameters\n",
|
||
" # This is tested implicitly by the parameter updates above\n",
|
||
" print(\"✅ Adaptive learning rates work correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Adaptive learning rates failed: {e}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
" print(\"🎯 Adam optimizer behavior:\")\n",
|
||
" print(\" Maintains first and second moment estimates\")\n",
|
||
" print(\" Applies bias correction for early training\")\n",
|
||
" print(\" Uses adaptive learning rates per parameter\")\n",
|
||
" print(\" Combines benefits of momentum and RMSprop\")\n",
|
||
" print(\"📈 Progress: Adam Optimizer ✓\")\n",
|
||
"\n",
|
||
"# Run the test\n",
|
||
"test_adam_optimizer_comprehensive()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e198d030",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## Step 4: Learning Rate Scheduling\n",
|
||
"\n",
|
||
"### What is Learning Rate Scheduling?\n",
|
||
"**Learning rate scheduling** adjusts the learning rate during training:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Initial: learning_rate = 0.1\n",
|
||
"After 10 epochs: learning_rate = 0.01\n",
|
||
"After 20 epochs: learning_rate = 0.001\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Why Scheduling Matters\n",
|
||
"1. **Fine-tuning**: Start with large steps, then refine with small steps\n",
|
||
"2. **Convergence**: Prevents overshooting near optimum\n",
|
||
"3. **Stability**: Reduces oscillations in later training\n",
|
||
"4. **Performance**: Often improves final accuracy\n",
|
||
"\n",
|
||
"### Common Scheduling Strategies\n",
|
||
"1. **Step decay**: Reduce by factor every N epochs\n",
|
||
"2. **Exponential decay**: Gradual exponential reduction\n",
|
||
"3. **Cosine annealing**: Smooth cosine curve reduction\n",
|
||
"4. **Warm-up**: Start small, increase, then decrease\n",
|
||
"\n",
|
||
"### Visual Understanding\n",
|
||
"```\n",
|
||
"Step decay: ----↓----↓----↓\n",
|
||
"Exponential: \\\\\\\\\\\\\\\\\\\\\\\\\\\\\n",
|
||
"Cosine: ∩∩∩∩∩∩∩∩∩∩∩∩∩\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Real-World Applications\n",
|
||
"- **ImageNet training**: Essential for achieving state-of-the-art results\n",
|
||
"- **Language models**: Critical for training large transformers\n",
|
||
"- **Fine-tuning**: Prevents catastrophic forgetting\n",
|
||
"- **Transfer learning**: Adapts pre-trained models\n",
|
||
"\n",
|
||
"Let's implement step learning rate scheduling!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7aba8fc9",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "steplr-class",
|
||
"locked": false,
|
||
"schema_version": 3,
|
||
"solution": true,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class StepLR:\n",
|
||
" \"\"\"\n",
|
||
" Step Learning Rate Scheduler\n",
|
||
" \n",
|
||
" Decays learning rate by gamma every step_size epochs:\n",
|
||
" learning_rate = initial_lr * (gamma ^ (epoch // step_size))\n",
|
||
" \"\"\"\n",
|
||
" \n",
|
||
" def __init__(self, optimizer: Union[SGD, Adam], step_size: int, gamma: float = 0.1):\n",
|
||
" \"\"\"\n",
|
||
" Initialize step learning rate scheduler.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" optimizer: Optimizer to schedule\n",
|
||
" step_size: Number of epochs between decreases\n",
|
||
" gamma: Multiplicative factor for learning rate decay\n",
|
||
" \n",
|
||
" TODO: Implement learning rate scheduler initialization.\n",
|
||
" \n",
|
||
" APPROACH:\n",
|
||
" 1. Store optimizer reference\n",
|
||
" 2. Store scheduling parameters\n",
|
||
" 3. Save initial learning rate\n",
|
||
" 4. Initialize step counter\n",
|
||
" \n",
|
||
" EXAMPLE:\n",
|
||
" ```python\n",
|
||
" optimizer = SGD([w1, w2], learning_rate=0.1)\n",
|
||
" scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
|
||
" \n",
|
||
" # In training loop:\n",
|
||
" for epoch in range(100):\n",
|
||
" train_one_epoch()\n",
|
||
" scheduler.step() # Update learning rate\n",
|
||
" ```\n",
|
||
" \n",
|
||
" HINTS:\n",
|
||
" - Store optimizer reference\n",
|
||
" - Save initial learning rate from optimizer\n",
|
||
" - Initialize step counter to 0\n",
|
||
" - gamma is the decay factor (0.1 = 10x reduction)\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.optimizer = optimizer\n",
|
||
" self.step_size = step_size\n",
|
||
" self.gamma = gamma\n",
|
||
" self.initial_lr = optimizer.learning_rate\n",
|
||
" self.step_count = 0\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def step(self) -> None:\n",
|
||
" \"\"\"\n",
|
||
" Update learning rate based on current step.\n",
|
||
" \n",
|
||
" TODO: Implement learning rate update.\n",
|
||
" \n",
|
||
" APPROACH:\n",
|
||
" 1. Increment step counter\n",
|
||
" 2. Calculate new learning rate using step decay formula\n",
|
||
" 3. Update optimizer's learning rate\n",
|
||
" \n",
|
||
" MATHEMATICAL FORMULATION:\n",
|
||
" new_lr = initial_lr * (gamma ^ ((step_count - 1) // step_size))\n",
|
||
" \n",
|
||
" IMPLEMENTATION HINTS:\n",
|
||
" - Use // for integer division\n",
|
||
" - Use ** for exponentiation\n",
|
||
" - Update optimizer.learning_rate directly\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.step_count += 1\n",
|
||
" \n",
|
||
" # Calculate new learning rate\n",
|
||
" decay_factor = self.gamma ** ((self.step_count - 1) // self.step_size)\n",
|
||
" new_lr = self.initial_lr * decay_factor\n",
|
||
" \n",
|
||
" # Update optimizer's learning rate\n",
|
||
" self.optimizer.learning_rate = new_lr\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def get_lr(self) -> float:\n",
|
||
" \"\"\"\n",
|
||
" Get current learning rate.\n",
|
||
" \n",
|
||
" TODO: Return current learning rate.\n",
|
||
" \n",
|
||
" IMPLEMENTATION HINTS:\n",
|
||
" - Return optimizer.learning_rate\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" return self.optimizer.learning_rate\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "51901e5b",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Step Learning Rate Scheduler\n",
|
||
"\n",
|
||
"Let's test your step learning rate scheduler implementation! This scheduler reduces learning rate at regular intervals.\n",
|
||
"\n",
|
||
"**This is a unit test** - it tests one specific class (StepLR) in isolation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7b83de77",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-step-scheduler",
|
||
"locked": true,
|
||
"points": 10,
|
||
"schema_version": 3,
|
||
"solution": false,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_step_scheduler_comprehensive():\n",
|
||
" \"\"\"Test StepLR scheduler implementation\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Step Learning Rate Scheduler...\")\n",
|
||
" \n",
|
||
" # Create test parameters and optimizer\n",
|
||
" w = Variable(1.0, requires_grad=True)\n",
|
||
" optimizer = SGD([w], learning_rate=0.1)\n",
|
||
" \n",
|
||
" # Test scheduler initialization\n",
|
||
" try:\n",
|
||
" scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
|
||
" \n",
|
||
" # Test initial learning rate\n",
|
||
" assert scheduler.get_lr() == 0.1, f\"Initial learning rate should be 0.1, got {scheduler.get_lr()}\"\n",
|
||
" print(\"✅ Initial learning rate is correct\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Initial learning rate failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test step-based decay\n",
|
||
" try:\n",
|
||
" # Steps 1-10: no decay (decay happens after step 10)\n",
|
||
" for i in range(10):\n",
|
||
" scheduler.step()\n",
|
||
" \n",
|
||
" assert scheduler.get_lr() == 0.1, f\"Learning rate should still be 0.1 after 10 steps, got {scheduler.get_lr()}\"\n",
|
||
" \n",
|
||
" # Step 11: decay should occur\n",
|
||
" scheduler.step()\n",
|
||
" expected_lr = 0.1 * 0.1 # 0.01\n",
|
||
" assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f\"Learning rate should be {expected_lr} after 11 steps, got {scheduler.get_lr()}\"\n",
|
||
" print(\"✅ Step-based decay works correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Step-based decay failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test multiple decay levels\n",
|
||
" try:\n",
|
||
" # Steps 12-20: should stay at 0.01\n",
|
||
" for i in range(9):\n",
|
||
" scheduler.step()\n",
|
||
" \n",
|
||
" assert abs(scheduler.get_lr() - 0.01) < 1e-6, f\"Learning rate should be 0.01 after 20 steps, got {scheduler.get_lr()}\"\n",
|
||
" \n",
|
||
" # Step 21: another decay\n",
|
||
" scheduler.step()\n",
|
||
" expected_lr = 0.01 * 0.1 # 0.001\n",
|
||
" assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f\"Learning rate should be {expected_lr} after 21 steps, got {scheduler.get_lr()}\"\n",
|
||
" print(\"✅ Multiple decay levels work correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Multiple decay levels failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test with different optimizer\n",
|
||
" try:\n",
|
||
" w2 = Variable(2.0, requires_grad=True)\n",
|
||
" adam_optimizer = Adam([w2], learning_rate=0.001)\n",
|
||
" adam_scheduler = StepLR(adam_optimizer, step_size=5, gamma=0.5)\n",
|
||
" \n",
|
||
" # Test initial learning rate\n",
|
||
" assert adam_scheduler.get_lr() == 0.001, f\"Initial Adam learning rate should be 0.001, got {adam_scheduler.get_lr()}\"\n",
|
||
" \n",
|
||
" # Test decay after 5 steps\n",
|
||
" for i in range(5):\n",
|
||
" adam_scheduler.step()\n",
|
||
" \n",
|
||
" # Learning rate should still be 0.001 after 5 steps\n",
|
||
" assert adam_scheduler.get_lr() == 0.001, f\"Adam learning rate should still be 0.001 after 5 steps, got {adam_scheduler.get_lr()}\"\n",
|
||
" \n",
|
||
" # Step 6: decay should occur\n",
|
||
" adam_scheduler.step()\n",
|
||
" expected_lr = 0.001 * 0.5 # 0.0005\n",
|
||
" assert abs(adam_scheduler.get_lr() - expected_lr) < 1e-6, f\"Adam learning rate should be {expected_lr} after 6 steps, got {adam_scheduler.get_lr()}\"\n",
|
||
" print(\"✅ Works with different optimizers\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Different optimizers failed: {e}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
" print(\"🎯 Step learning rate scheduler behavior:\")\n",
|
||
" print(\" Reduces learning rate at regular intervals\")\n",
|
||
" print(\" Multiplies current rate by gamma factor\")\n",
|
||
" print(\" Works with any optimizer (SGD, Adam, etc.)\")\n",
|
||
" print(\"📈 Progress: Step Learning Rate Scheduler ✓\")\n",
|
||
"\n",
|
||
"# Run the test\n",
|
||
"test_step_scheduler_comprehensive()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2fc52bc2",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## Step 5: Integration - Complete Training Example\n",
|
||
"\n",
|
||
"### Putting It All Together\n",
|
||
"Let's see how optimizers enable complete neural network training:\n",
|
||
"\n",
|
||
"1. **Forward pass**: Compute predictions\n",
|
||
"2. **Loss computation**: Compare with targets\n",
|
||
"3. **Backward pass**: Compute gradients\n",
|
||
"4. **Optimizer step**: Update parameters\n",
|
||
"5. **Learning rate scheduling**: Adjust learning rate\n",
|
||
"\n",
|
||
"### The Modern Training Loop\n",
|
||
"```python\n",
|
||
"# Setup\n",
|
||
"optimizer = Adam(model.parameters(), learning_rate=0.001)\n",
|
||
"scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
|
||
"\n",
|
||
"# Training loop\n",
|
||
"for epoch in range(num_epochs):\n",
|
||
" for batch in dataloader:\n",
|
||
" # Forward pass\n",
|
||
" predictions = model(batch.inputs)\n",
|
||
" loss = criterion(predictions, batch.targets)\n",
|
||
" \n",
|
||
" # Backward pass\n",
|
||
" optimizer.zero_grad()\n",
|
||
" loss.backward()\n",
|
||
" optimizer.step()\n",
|
||
" \n",
|
||
" # Update learning rate\n",
|
||
" scheduler.step()\n",
|
||
"```\n",
|
||
"\n",
|
||
"Let's implement a complete training example!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a3205aad",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "training-integration",
|
||
"locked": false,
|
||
"schema_version": 3,
|
||
"solution": true,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def train_simple_model():\n",
|
||
" \"\"\"\n",
|
||
" Complete training example using optimizers.\n",
|
||
" \n",
|
||
" TODO: Implement a complete training loop.\n",
|
||
" \n",
|
||
" APPROACH:\n",
|
||
" 1. Create a simple model (linear regression)\n",
|
||
" 2. Generate training data\n",
|
||
" 3. Set up optimizer and scheduler\n",
|
||
" 4. Train for several epochs\n",
|
||
" 5. Show convergence\n",
|
||
" \n",
|
||
" LEARNING OBJECTIVE:\n",
|
||
" - See how optimizers enable real learning\n",
|
||
" - Compare SGD vs Adam performance\n",
|
||
" - Understand the complete training workflow\n",
|
||
" \"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" print(\"Training simple linear regression model...\")\n",
|
||
" \n",
|
||
" # Create simple model: y = w*x + b\n",
|
||
" w = Variable(0.1, requires_grad=True) # Initialize near zero\n",
|
||
" b = Variable(0.0, requires_grad=True)\n",
|
||
" \n",
|
||
" # Training data: y = 2*x + 1\n",
|
||
" x_data = [1.0, 2.0, 3.0, 4.0, 5.0]\n",
|
||
" y_data = [3.0, 5.0, 7.0, 9.0, 11.0]\n",
|
||
" \n",
|
||
" # Try SGD first\n",
|
||
" print(\"\\n🔍 Training with SGD...\")\n",
|
||
" optimizer_sgd = SGD([w, b], learning_rate=0.01, momentum=0.9)\n",
|
||
" \n",
|
||
" for epoch in range(60):\n",
|
||
" total_loss = 0\n",
|
||
" \n",
|
||
" for x_val, y_val in zip(x_data, y_data):\n",
|
||
" # Forward pass\n",
|
||
" x = Variable(x_val, requires_grad=False)\n",
|
||
" y_target = Variable(y_val, requires_grad=False)\n",
|
||
" \n",
|
||
" # Prediction: y = w*x + b\n",
|
||
" try:\n",
|
||
" from tinytorch.core.autograd import add, multiply, subtract\n",
|
||
" except ImportError:\n",
|
||
" setup_import_paths()\n",
|
||
" from autograd_dev import add, multiply, subtract\n",
|
||
" \n",
|
||
" prediction = add(multiply(w, x), b)\n",
|
||
" \n",
|
||
" # Loss: (prediction - target)^2\n",
|
||
" error = subtract(prediction, y_target)\n",
|
||
" loss = multiply(error, error)\n",
|
||
" \n",
|
||
" # Backward pass\n",
|
||
" optimizer_sgd.zero_grad()\n",
|
||
" loss.backward()\n",
|
||
" optimizer_sgd.step()\n",
|
||
" \n",
|
||
" total_loss += loss.data.data.item()\n",
|
||
" \n",
|
||
" if epoch % 10 == 0:\n",
|
||
" print(f\"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}\")\n",
|
||
" \n",
|
||
" sgd_final_w = w.data.data.item()\n",
|
||
" sgd_final_b = b.data.data.item()\n",
|
||
" \n",
|
||
" # Reset parameters and try Adam\n",
|
||
" print(\"\\n🔍 Training with Adam...\")\n",
|
||
" w.data = Tensor(0.1)\n",
|
||
" b.data = Tensor(0.0)\n",
|
||
" \n",
|
||
" optimizer_adam = Adam([w, b], learning_rate=0.01)\n",
|
||
" \n",
|
||
" for epoch in range(60):\n",
|
||
" total_loss = 0\n",
|
||
" \n",
|
||
" for x_val, y_val in zip(x_data, y_data):\n",
|
||
" # Forward pass\n",
|
||
" x = Variable(x_val, requires_grad=False)\n",
|
||
" y_target = Variable(y_val, requires_grad=False)\n",
|
||
" \n",
|
||
" # Prediction: y = w*x + b\n",
|
||
" prediction = add(multiply(w, x), b)\n",
|
||
" \n",
|
||
" # Loss: (prediction - target)^2\n",
|
||
" error = subtract(prediction, y_target)\n",
|
||
" loss = multiply(error, error)\n",
|
||
" \n",
|
||
" # Backward pass\n",
|
||
" optimizer_adam.zero_grad()\n",
|
||
" loss.backward()\n",
|
||
" optimizer_adam.step()\n",
|
||
" \n",
|
||
" total_loss += loss.data.data.item()\n",
|
||
" \n",
|
||
" if epoch % 10 == 0:\n",
|
||
" print(f\"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}\")\n",
|
||
" \n",
|
||
" adam_final_w = w.data.data.item()\n",
|
||
" adam_final_b = b.data.data.item()\n",
|
||
" \n",
|
||
" print(f\"\\n📊 Results:\")\n",
|
||
" print(f\"Target: w = 2.0, b = 1.0\")\n",
|
||
" print(f\"SGD: w = {sgd_final_w:.3f}, b = {sgd_final_b:.3f}\")\n",
|
||
" print(f\"Adam: w = {adam_final_w:.3f}, b = {adam_final_b:.3f}\")\n",
|
||
" \n",
|
||
" return sgd_final_w, sgd_final_b, adam_final_w, adam_final_b\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0a5330c4",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Complete Training Integration\n",
|
||
"\n",
|
||
"Let's test your complete training integration! This demonstrates optimizers working together in a realistic training scenario.\n",
|
||
"\n",
|
||
"**This is a unit test** - it tests the complete training workflow with optimizers in isolation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5aeda8ce",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-training-integration",
|
||
"locked": true,
|
||
"points": 25,
|
||
"schema_version": 3,
|
||
"solution": false,
|
||
"task": false
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_training_integration_comprehensive():\n",
|
||
" \"\"\"Test complete training integration with optimizers\"\"\"\n",
|
||
" print(\"🔬 Unit Test: Complete Training Integration...\")\n",
|
||
" \n",
|
||
" # Test training with SGD and Adam\n",
|
||
" try:\n",
|
||
" sgd_w, sgd_b, adam_w, adam_b = train_simple_model()\n",
|
||
" \n",
|
||
" # Test SGD convergence\n",
|
||
" assert abs(sgd_w - 2.0) < 0.1, f\"SGD should converge close to w=2.0, got {sgd_w}\"\n",
|
||
" assert abs(sgd_b - 1.0) < 0.1, f\"SGD should converge close to b=1.0, got {sgd_b}\"\n",
|
||
" print(\"✅ SGD convergence works\")\n",
|
||
" \n",
|
||
" # Test Adam convergence (may be different due to adaptive learning rates)\n",
|
||
" assert abs(adam_w - 2.0) < 1.0, f\"Adam should converge reasonably close to w=2.0, got {adam_w}\"\n",
|
||
" assert abs(adam_b - 1.0) < 1.0, f\"Adam should converge reasonably close to b=1.0, got {adam_b}\"\n",
|
||
" print(\"✅ Adam convergence works\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Training integration failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test optimizer comparison\n",
|
||
" try:\n",
|
||
" # Both optimizers should achieve reasonable results\n",
|
||
" sgd_error = (sgd_w - 2.0)**2 + (sgd_b - 1.0)**2\n",
|
||
" adam_error = (adam_w - 2.0)**2 + (adam_b - 1.0)**2\n",
|
||
" \n",
|
||
" # Both should have low error (< 0.1)\n",
|
||
" assert sgd_error < 0.1, f\"SGD error should be < 0.1, got {sgd_error}\"\n",
|
||
" assert adam_error < 1.0, f\"Adam error should be < 1.0, got {adam_error}\"\n",
|
||
" print(\"✅ Optimizer comparison works\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Optimizer comparison failed: {e}\")\n",
|
||
" raise\n",
|
||
" \n",
|
||
" # Test gradient flow\n",
|
||
" try:\n",
|
||
" # Create a simple test to verify gradients flow correctly\n",
|
||
" w = Variable(1.0, requires_grad=True)\n",
|
||
" b = Variable(0.0, requires_grad=True)\n",
|
||
" \n",
|
||
" # Set up simple gradients\n",
|
||
" w.grad = Variable(0.1)\n",
|
||
" b.grad = Variable(0.05)\n",
|
||
" \n",
|
||
" # Test SGD step\n",
|
||
" sgd_optimizer = SGD([w, b], learning_rate=0.1)\n",
|
||
" original_w = w.data.data.item()\n",
|
||
" original_b = b.data.data.item()\n",
|
||
" \n",
|
||
" sgd_optimizer.step()\n",
|
||
" \n",
|
||
" # Check updates\n",
|
||
" assert w.data.data.item() != original_w, \"SGD should update w\"\n",
|
||
" assert b.data.data.item() != original_b, \"SGD should update b\"\n",
|
||
" print(\"✅ Gradient flow works correctly\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"❌ Gradient flow failed: {e}\")\n",
|
||
" raise\n",
|
||
"\n",
|
||
" print(\"🎯 Training integration behavior:\")\n",
|
||
" print(\" Optimizers successfully minimize loss functions\")\n",
|
||
" print(\" SGD and Adam both converge to target values\")\n",
|
||
" print(\" Gradient computation and updates work correctly\")\n",
|
||
" print(\" Ready for real neural network training\")\n",
|
||
" print(\"📈 Progress: Complete Training Integration ✓\")\n",
|
||
"\n",
|
||
"# Run the test\n",
|
||
"test_training_integration_comprehensive()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c0464e8c",
|
||
"metadata": {},
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"# 🎯 Module Summary: Optimization Mastery!\n",
|
||
"\n",
|
||
"Congratulations! You've successfully implemented the optimization algorithms that power all modern neural network training:\n",
|
||
"\n",
|
||
"## ✅ What You've Built\n",
|
||
"- **Gradient Descent**: The fundamental parameter update mechanism\n",
|
||
"- **SGD with Momentum**: Accelerated convergence with velocity accumulation\n",
|
||
"- **Adam Optimizer**: Adaptive learning rates with first and second moments\n",
|
||
"- **Learning Rate Scheduling**: Smart learning rate adjustment during training\n",
|
||
"- **Complete Training Integration**: End-to-end training workflow\n",
|
||
"\n",
|
||
"## ✅ Key Learning Outcomes\n",
|
||
"- **Understanding**: How optimizers use gradients to update parameters intelligently\n",
|
||
"- **Implementation**: Built SGD and Adam optimizers from mathematical foundations\n",
|
||
"- **Mathematical mastery**: Momentum, adaptive learning rates, bias correction\n",
|
||
"- **Systems integration**: Complete training loops with scheduling\n",
|
||
"- **Real-world application**: Modern deep learning training workflow\n",
|
||
"\n",
|
||
"## ✅ Mathematical Foundations Mastered\n",
|
||
"- **Gradient Descent**: θ = θ - α∇L(θ) for parameter updates\n",
|
||
"- **Momentum**: v_t = βv_{t-1} + ∇L(θ) for acceleration\n",
|
||
"- **Adam**: Adaptive learning rates with exponential moving averages\n",
|
||
"- **Learning Rate Scheduling**: Strategic learning rate adjustment\n",
|
||
"\n",
|
||
"## ✅ Professional Skills Developed\n",
|
||
"- **Algorithm implementation**: Translating mathematical formulas into code\n",
|
||
"- **State management**: Tracking optimizer buffers and statistics\n",
|
||
"- **Hyperparameter design**: Understanding the impact of learning rate, momentum, etc.\n",
|
||
"- **Training orchestration**: Complete training loop design\n",
|
||
"\n",
|
||
"## ✅ Ready for Advanced Applications\n",
|
||
"Your optimizers now enable:\n",
|
||
"- **Deep Neural Networks**: Effective training of complex architectures\n",
|
||
"- **Computer Vision**: Training CNNs, ResNets, Vision Transformers\n",
|
||
"- **Natural Language Processing**: Training transformers and language models\n",
|
||
"- **Any ML Model**: Gradient-based optimization for any differentiable system\n",
|
||
"\n",
|
||
"## 🔗 Connection to Real ML Systems\n",
|
||
"Your implementations mirror production systems:\n",
|
||
"- **PyTorch**: `torch.optim.SGD()`, `torch.optim.Adam()`, `torch.optim.lr_scheduler.StepLR()`\n",
|
||
"- **TensorFlow**: `tf.keras.optimizers.SGD()`, `tf.keras.optimizers.Adam()`\n",
|
||
"- **Industry Standard**: Every major ML framework uses these exact algorithms\n",
|
||
"\n",
|
||
"## 🎯 The Power of Intelligent Optimization\n",
|
||
"You've unlocked the algorithms that made modern AI possible:\n",
|
||
"- **Scalability**: Efficiently optimize millions of parameters\n",
|
||
"- **Adaptability**: Different learning rates for different parameters\n",
|
||
"- **Robustness**: Handle noisy gradients and ill-conditioned problems\n",
|
||
"- **Universality**: Work with any differentiable neural network\n",
|
||
"\n",
|
||
"## 🧠 Deep Learning Revolution\n",
|
||
"You now understand the optimization technology that powers:\n",
|
||
"- **ImageNet**: Training state-of-the-art computer vision models\n",
|
||
"- **Language Models**: Training GPT, BERT, and other transformers\n",
|
||
"- **Modern AI**: Every breakthrough relies on these optimization algorithms\n",
|
||
"- **Future Research**: Your understanding enables you to develop new optimizers\n",
|
||
"\n",
|
||
"## 🚀 What's Next\n",
|
||
"Your optimizers are the foundation for:\n",
|
||
"- **Training Module**: Complete training loops with loss functions and metrics\n",
|
||
"- **Advanced Optimizers**: RMSprop, AdaGrad, learning rate warm-up\n",
|
||
"- **Distributed Training**: Multi-GPU optimization strategies\n",
|
||
"- **Research**: Experimenting with novel optimization algorithms\n",
|
||
"\n",
|
||
"**Next Module**: Complete training systems that orchestrate your optimizers for real-world ML!\n",
|
||
"\n",
|
||
"You've built the intelligent algorithms that enable neural networks to learn. Now let's use them to train systems that can solve complex real-world problems!\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"Run inline tests when module is executed directly\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" from tito.tools.testing import run_module_tests_auto\n",
|
||
" \n",
|
||
" # Automatically discover and run all tests in this module\n",
|
||
" run_module_tests_auto(\"Optimizers\") "
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"jupytext": {
|
||
"main_language": "python"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|