{ "cells": [ { "cell_type": "markdown", "id": "602ba54a", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "# Module 8: Optimizers - Gradient-Based Parameter Updates\n", "\n", "Welcome to the Optimizers module! This is where neural networks learn to improve through intelligent parameter updates.\n", "\n", "## Learning Goals\n", "- Understand gradient descent and how optimizers use gradients to update parameters\n", "- Implement SGD with momentum for accelerated convergence\n", "- Build Adam optimizer with adaptive learning rates\n", "- Master learning rate scheduling strategies\n", "- See how optimizers enable effective neural network training\n", "\n", "## Build โ†’ Use โ†’ Analyze\n", "1. **Build**: Core optimization algorithms (SGD, Adam)\n", "2. **Use**: Apply optimizers to train neural networks\n", "3. **Analyze**: Compare optimizer behavior and convergence patterns" ] }, { "cell_type": "code", "execution_count": null, "id": "e3b359ed", "metadata": { "nbgrader": { "grade": false, "grade_id": "optimizers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "#| default_exp core.optimizers\n", "\n", "#| export\n", "import math\n", "import numpy as np\n", "import sys\n", "import os\n", "from typing import List, Dict, Any, Optional, Union\n", "from collections import defaultdict\n", "\n", "# Helper function to set up import paths\n", "def setup_import_paths():\n", " \"\"\"Set up import paths for development modules.\"\"\"\n", " import sys\n", " import os\n", " \n", " # Add module directories to path\n", " base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\n", " tensor_dir = os.path.join(base_dir, '01_tensor')\n", " autograd_dir = os.path.join(base_dir, '07_autograd')\n", " \n", " if tensor_dir not in sys.path:\n", " sys.path.append(tensor_dir)\n", " if autograd_dir not in sys.path:\n", " sys.path.append(autograd_dir)\n", "\n", "# Import our existing components\n", "try:\n", " from tinytorch.core.tensor import Tensor\n", " from tinytorch.core.autograd import Variable\n", "except ImportError:\n", " # For development, try local imports\n", " try:\n", " setup_import_paths()\n", " from tensor_dev import Tensor\n", " from autograd_dev import Variable\n", " except ImportError:\n", " # Create minimal fallback classes for testing\n", " print(\"Warning: Using fallback classes for testing\")\n", " \n", " class Tensor:\n", " def __init__(self, data):\n", " self.data = np.array(data)\n", " self.shape = self.data.shape\n", " \n", " def __str__(self):\n", " return f\"Tensor({self.data})\"\n", " \n", " class Variable:\n", " def __init__(self, data, requires_grad=True):\n", " if isinstance(data, (int, float)):\n", " self.data = Tensor([data])\n", " else:\n", " self.data = Tensor(data)\n", " self.requires_grad = requires_grad\n", " self.grad = None\n", " \n", " def zero_grad(self):\n", " self.grad = None\n", " \n", " def __str__(self):\n", " return f\"Variable({self.data.data})\"" ] }, { "cell_type": "code", "execution_count": null, "id": "4dfb6aa4", "metadata": { "nbgrader": { "grade": false, "grade_id": "optimizers-setup", "locked": false, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "print(\"๐Ÿ”ฅ TinyTorch Optimizers Module\")\n", "print(f\"NumPy version: {np.__version__}\")\n", "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n", "print(\"Ready to build optimization algorithms!\")" ] }, { "cell_type": "markdown", "id": "c9afc185", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "## ๐Ÿ“ฆ Where This Code Lives in the Final Package\n", "\n", "**Learning Side:** You work in `modules/source/08_optimizers/optimizers_dev.py` \n", "**Building Side:** Code exports to `tinytorch.core.optimizers`\n", "\n", "```python\n", "# Final package structure:\n", "from tinytorch.core.optimizers import SGD, Adam, StepLR # The optimization engines!\n", "from tinytorch.core.autograd import Variable # Gradient computation\n", "from tinytorch.core.tensor import Tensor # Data structures\n", "```\n", "\n", "**Why this matters:**\n", "- **Learning:** Focused module for understanding optimization algorithms\n", "- **Production:** Proper organization like PyTorch's `torch.optim`\n", "- **Consistency:** All optimization algorithms live together in `core.optimizers`\n", "- **Foundation:** Enables effective neural network training" ] }, { "cell_type": "markdown", "id": "e0d222c6", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "## What Are Optimizers?\n", "\n", "### The Problem: How to Update Parameters\n", "Neural networks learn by updating parameters using gradients:\n", "```\n", "parameter_new = parameter_old - learning_rate * gradient\n", "```\n", "\n", "But **naive gradient descent** has problems:\n", "- **Slow convergence**: Takes many steps to reach optimum\n", "- **Oscillation**: Bounces around valleys without making progress\n", "- **Poor scaling**: Same learning rate for all parameters\n", "\n", "### The Solution: Smart Optimization\n", "**Optimizers** are algorithms that intelligently update parameters:\n", "- **Momentum**: Accelerate convergence by accumulating velocity\n", "- **Adaptive learning rates**: Different learning rates for different parameters\n", "- **Second-order information**: Use curvature to guide updates\n", "\n", "### Real-World Impact\n", "- **SGD**: The foundation of all neural network training\n", "- **Adam**: The default optimizer for most deep learning applications\n", "- **Learning rate scheduling**: Critical for training stability and performance\n", "\n", "### What We'll Build\n", "1. **SGD**: Stochastic Gradient Descent with momentum\n", "2. **Adam**: Adaptive Moment Estimation optimizer\n", "3. **StepLR**: Learning rate scheduling\n", "4. **Integration**: Complete training loop with optimizers" ] }, { "cell_type": "markdown", "id": "8ccea3ce", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "## Step 1: Understanding Gradient Descent\n", "\n", "### What is Gradient Descent?\n", "**Gradient descent** finds the minimum of a function by following the negative gradient:\n", "\n", "```\n", "ฮธ_{t+1} = ฮธ_t - ฮฑ โˆ‡f(ฮธ_t)\n", "```\n", "\n", "Where:\n", "- ฮธ: Parameters we want to optimize\n", "- ฮฑ: Learning rate (how big steps to take)\n", "- โˆ‡f(ฮธ): Gradient of loss function with respect to parameters\n", "\n", "### Why Gradient Descent Works\n", "1. **Gradients point uphill**: Negative gradient points toward minimum\n", "2. **Iterative improvement**: Each step reduces the loss (in theory)\n", "3. **Local convergence**: Finds local minimum with proper learning rate\n", "4. **Scalable**: Works with millions of parameters\n", "\n", "### The Learning Rate Dilemma\n", "- **Too large**: Overshoots minimum, diverges\n", "- **Too small**: Extremely slow convergence\n", "- **Just right**: Steady progress toward minimum\n", "\n", "### Visual Understanding\n", "```\n", "Loss landscape: \\__/\n", "Start here: โ†‘\n", "Gradient descent: โ†“ โ†’ โ†“ โ†’ โ†“ โ†’ minimum\n", "```\n", "\n", "### Real-World Applications\n", "- **Neural networks**: Training any deep learning model\n", "- **Machine learning**: Logistic regression, SVM, etc.\n", "- **Scientific computing**: Optimization problems in physics, engineering\n", "- **Economics**: Portfolio optimization, game theory\n", "\n", "Let's implement gradient descent to understand it deeply!" ] }, { "cell_type": "code", "execution_count": null, "id": "d41c2596", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "gradient-descent-function", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#| export\n", "def gradient_descent_step(parameter: Variable, learning_rate: float) -> None:\n", " \"\"\"\n", " Perform one step of gradient descent on a parameter.\n", " \n", " Args:\n", " parameter: Variable with gradient information\n", " learning_rate: How much to update parameter\n", " \n", " TODO: Implement basic gradient descent parameter update.\n", " \n", " STEP-BY-STEP IMPLEMENTATION:\n", " 1. Check if parameter has a gradient\n", " 2. Get current parameter value and gradient\n", " 3. Update parameter: new_value = old_value - learning_rate * gradient\n", " 4. Update parameter data with new value\n", " 5. Handle edge cases (no gradient, invalid values)\n", " \n", " EXAMPLE USAGE:\n", " ```python\n", " # Parameter with gradient\n", " w = Variable(2.0, requires_grad=True)\n", " w.grad = Variable(0.5) # Gradient from loss\n", " \n", " # Update parameter\n", " gradient_descent_step(w, learning_rate=0.1)\n", " # w.data now contains: 2.0 - 0.1 * 0.5 = 1.95\n", " ```\n", " \n", " IMPLEMENTATION HINTS:\n", " - Check if parameter.grad is not None\n", " - Use parameter.grad.data.data to get gradient value\n", " - Update parameter.data with new Tensor\n", " - Don't modify gradient (it's used for logging)\n", " \n", " LEARNING CONNECTIONS:\n", " - This is the foundation of all neural network training\n", " - PyTorch's optimizer.step() does exactly this\n", " - The learning rate determines convergence speed\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " if parameter.grad is not None:\n", " # Get current parameter value and gradient\n", " current_value = parameter.data.data\n", " gradient_value = parameter.grad.data.data\n", " \n", " # Update parameter: new_value = old_value - learning_rate * gradient\n", " new_value = current_value - learning_rate * gradient_value\n", " \n", " # Update parameter data\n", " parameter.data = Tensor(new_value)\n", " ### END SOLUTION" ] }, { "cell_type": "markdown", "id": "4d2e1fd4", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "### ๐Ÿงช Unit Test: Gradient Descent Step\n", "\n", "Let's test your gradient descent implementation right away! This is the foundation of all optimization algorithms.\n", "\n", "**This is a unit test** - it tests one specific function (gradient_descent_step) in isolation." ] }, { "cell_type": "code", "execution_count": null, "id": "f092d289", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": true, "grade_id": "test-gradient-descent", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "def test_gradient_descent_step_comprehensive():\n", " \"\"\"Test basic gradient descent parameter update\"\"\"\n", " print(\"๐Ÿ”ฌ Unit Test: Gradient Descent Step...\")\n", " \n", " # Test basic parameter update\n", " try:\n", " w = Variable(2.0, requires_grad=True)\n", " w.grad = Variable(0.5) # Positive gradient\n", " \n", " original_value = w.data.data.item()\n", " gradient_descent_step(w, learning_rate=0.1)\n", " new_value = w.data.data.item()\n", " \n", " expected_value = original_value - 0.1 * 0.5 # 2.0 - 0.05 = 1.95\n", " assert abs(new_value - expected_value) < 1e-6, f\"Expected {expected_value}, got {new_value}\"\n", " print(\"โœ… Basic parameter update works\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Basic parameter update failed: {e}\")\n", " raise\n", "\n", " # Test with negative gradient\n", " try:\n", " w2 = Variable(1.0, requires_grad=True)\n", " w2.grad = Variable(-0.2) # Negative gradient\n", " \n", " gradient_descent_step(w2, learning_rate=0.1)\n", " expected_value2 = 1.0 - 0.1 * (-0.2) # 1.0 + 0.02 = 1.02\n", " assert abs(w2.data.data.item() - expected_value2) < 1e-6, \"Negative gradient test failed\"\n", " print(\"โœ… Negative gradient handling works\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Negative gradient handling failed: {e}\")\n", " raise\n", "\n", " # Test with no gradient (should not update)\n", " try:\n", " w3 = Variable(3.0, requires_grad=True)\n", " w3.grad = None\n", " original_value3 = w3.data.data.item()\n", " \n", " gradient_descent_step(w3, learning_rate=0.1)\n", " assert w3.data.data.item() == original_value3, \"Parameter with no gradient should not update\"\n", " print(\"โœ… No gradient case works\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ No gradient case failed: {e}\")\n", " raise\n", "\n", " print(\"๐ŸŽฏ Gradient descent step behavior:\")\n", " print(\" Updates parameters in negative gradient direction\")\n", " print(\" Uses learning rate to control step size\")\n", " print(\" Skips updates when gradient is None\")\n", " print(\"๐Ÿ“ˆ Progress: Gradient Descent Step โœ“\")\n", "\n", "# Test function is called by auto-discovery system" ] }, { "cell_type": "markdown", "id": "bc218834", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "## Step 2: SGD with Momentum\n", "\n", "### What is SGD?\n", "**SGD (Stochastic Gradient Descent)** is the fundamental optimization algorithm:\n", "\n", "```\n", "ฮธ_{t+1} = ฮธ_t - ฮฑ โˆ‡L(ฮธ_t)\n", "```\n", "\n", "### The Problem with Vanilla SGD\n", "- **Slow convergence**: Especially in narrow valleys\n", "- **Oscillation**: Bounces around without making progress\n", "- **Poor conditioning**: Struggles with ill-conditioned problems\n", "\n", "### The Solution: Momentum\n", "**Momentum** accumulates velocity to accelerate convergence:\n", "\n", "```\n", "v_t = ฮฒ v_{t-1} + โˆ‡L(ฮธ_t)\n", "ฮธ_{t+1} = ฮธ_t - ฮฑ v_t\n", "```\n", "\n", "Where:\n", "- v_t: Velocity (exponential moving average of gradients)\n", "- ฮฒ: Momentum coefficient (typically 0.9)\n", "- ฮฑ: Learning rate\n", "\n", "### Why Momentum Works\n", "1. **Acceleration**: Builds up speed in consistent directions\n", "2. **Dampening**: Reduces oscillations in inconsistent directions\n", "3. **Memory**: Remembers previous gradient directions\n", "4. **Robustness**: Less sensitive to noisy gradients\n", "\n", "### Visual Understanding\n", "```\n", "Without momentum: โ†—โ†™โ†—โ†™โ†—โ†™ (oscillating)\n", "With momentum: โ†—โ†’โ†’โ†’โ†’โ†’ (smooth progress)\n", "```\n", "\n", "### Real-World Applications\n", "- **Image classification**: Training ResNet, VGG\n", "- **Natural language**: Training RNNs, early transformers\n", "- **Classic choice**: Still used when Adam fails\n", "- **Large batch training**: Often preferred over Adam\n", "\n", "Let's implement SGD with momentum!" ] }, { "cell_type": "code", "execution_count": null, "id": "2f587b7f", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "sgd-class", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#| export\n", "class SGD:\n", " \"\"\"\n", " SGD Optimizer with Momentum\n", " \n", " Implements stochastic gradient descent with momentum:\n", " v_t = momentum * v_{t-1} + gradient\n", " parameter = parameter - learning_rate * v_t\n", " \"\"\"\n", " \n", " def __init__(self, parameters: List[Variable], learning_rate: float = 0.01, \n", " momentum: float = 0.0, weight_decay: float = 0.0):\n", " \"\"\"\n", " Initialize SGD optimizer.\n", " \n", " Args:\n", " parameters: List of Variables to optimize\n", " learning_rate: Learning rate (default: 0.01)\n", " momentum: Momentum coefficient (default: 0.0)\n", " weight_decay: L2 regularization coefficient (default: 0.0)\n", " \n", " TODO: Implement SGD optimizer initialization.\n", " \n", " APPROACH:\n", " 1. Store parameters and hyperparameters\n", " 2. Initialize momentum buffers for each parameter\n", " 3. Set up state tracking for optimization\n", " 4. Prepare for step() and zero_grad() methods\n", " \n", " EXAMPLE:\n", " ```python\n", " # Create optimizer\n", " optimizer = SGD([w1, w2, b1, b2], learning_rate=0.01, momentum=0.9)\n", " \n", " # In training loop:\n", " optimizer.zero_grad()\n", " loss.backward()\n", " optimizer.step()\n", " ```\n", " \n", " HINTS:\n", " - Store parameters as a list\n", " - Initialize momentum buffers as empty dict\n", " - Use parameter id() as key for momentum tracking\n", " - Momentum buffers will be created lazily in step()\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " self.parameters = parameters\n", " self.learning_rate = learning_rate\n", " self.momentum = momentum\n", " self.weight_decay = weight_decay\n", " \n", " # Initialize momentum buffers (created lazily)\n", " self.momentum_buffers = {}\n", " \n", " # Track optimization steps\n", " self.step_count = 0\n", " ### END SOLUTION\n", " \n", " def step(self) -> None:\n", " \"\"\"\n", " Perform one optimization step.\n", " \n", " TODO: Implement SGD parameter update with momentum.\n", " \n", " APPROACH:\n", " 1. Iterate through all parameters\n", " 2. For each parameter with gradient:\n", " a. Get current gradient\n", " b. Apply weight decay if specified\n", " c. Update momentum buffer (or create if first time)\n", " d. Update parameter using momentum\n", " 3. Increment step count\n", " \n", " MATHEMATICAL FORMULATION:\n", " - If weight_decay > 0: gradient = gradient + weight_decay * parameter\n", " - momentum_buffer = momentum * momentum_buffer + gradient\n", " - parameter = parameter - learning_rate * momentum_buffer\n", " \n", " IMPLEMENTATION HINTS:\n", " - Use id(param) as key for momentum buffers\n", " - Initialize buffer with zeros if not exists\n", " - Handle case where momentum = 0 (no momentum)\n", " - Update parameter.data with new Tensor\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " for param in self.parameters:\n", " if param.grad is not None:\n", " # Get gradient\n", " gradient = param.grad.data.data\n", " \n", " # Apply weight decay (L2 regularization)\n", " if self.weight_decay > 0:\n", " gradient = gradient + self.weight_decay * param.data.data\n", " \n", " # Get or create momentum buffer\n", " param_id = id(param)\n", " if param_id not in self.momentum_buffers:\n", " self.momentum_buffers[param_id] = np.zeros_like(param.data.data)\n", " \n", " # Update momentum buffer\n", " self.momentum_buffers[param_id] = (\n", " self.momentum * self.momentum_buffers[param_id] + gradient\n", " )\n", " \n", " # Update parameter\n", " param.data = Tensor(\n", " param.data.data - self.learning_rate * self.momentum_buffers[param_id]\n", " )\n", " \n", " self.step_count += 1\n", " ### END SOLUTION\n", " \n", " def zero_grad(self) -> None:\n", " \"\"\"\n", " Zero out gradients for all parameters.\n", " \n", " TODO: Implement gradient zeroing.\n", " \n", " APPROACH:\n", " 1. Iterate through all parameters\n", " 2. Set gradient to None for each parameter\n", " 3. This prepares for next backward pass\n", " \n", " IMPLEMENTATION HINTS:\n", " - Simply set param.grad = None\n", " - This is called before loss.backward()\n", " - Essential for proper gradient accumulation\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " for param in self.parameters:\n", " param.grad = None\n", " ### END SOLUTION" ] }, { "cell_type": "markdown", "id": "4adee99c", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "### ๐Ÿงช Unit Test: SGD Optimizer\n", "\n", "Let's test your SGD optimizer implementation! This optimizer adds momentum to gradient descent for better convergence.\n", "\n", "**This is a unit test** - it tests one specific class (SGD) in isolation." ] }, { "cell_type": "code", "execution_count": null, "id": "fa93aa53", "metadata": { "nbgrader": { "grade": true, "grade_id": "test-sgd", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "def test_sgd_optimizer_comprehensive():\n", " \"\"\"Test SGD optimizer implementation\"\"\"\n", " print(\"๐Ÿ”ฌ Unit Test: SGD Optimizer...\")\n", " \n", " # Create test parameters\n", " w1 = Variable(1.0, requires_grad=True)\n", " w2 = Variable(2.0, requires_grad=True)\n", " b = Variable(0.5, requires_grad=True)\n", " \n", " # Create optimizer\n", " optimizer = SGD([w1, w2, b], learning_rate=0.1, momentum=0.9)\n", " \n", " # Test zero_grad\n", " try:\n", " w1.grad = Variable(0.1)\n", " w2.grad = Variable(0.2)\n", " b.grad = Variable(0.05)\n", " \n", " optimizer.zero_grad()\n", " \n", " assert w1.grad is None, \"Gradient should be None after zero_grad\"\n", " assert w2.grad is None, \"Gradient should be None after zero_grad\"\n", " assert b.grad is None, \"Gradient should be None after zero_grad\"\n", " print(\"โœ… zero_grad() works correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ zero_grad() failed: {e}\")\n", " raise\n", " \n", " # Test step with gradients\n", " try:\n", " w1.grad = Variable(0.1)\n", " w2.grad = Variable(0.2)\n", " b.grad = Variable(0.05)\n", " \n", " # First step (no momentum yet)\n", " original_w1 = w1.data.data.item()\n", " original_w2 = w2.data.data.item()\n", " original_b = b.data.data.item()\n", " \n", " optimizer.step()\n", " \n", " # Check parameter updates\n", " expected_w1 = original_w1 - 0.1 * 0.1 # 1.0 - 0.01 = 0.99\n", " expected_w2 = original_w2 - 0.1 * 0.2 # 2.0 - 0.02 = 1.98\n", " expected_b = original_b - 0.1 * 0.05 # 0.5 - 0.005 = 0.495\n", " \n", " assert abs(w1.data.data.item() - expected_w1) < 1e-6, f\"w1 update failed: expected {expected_w1}, got {w1.data.data.item()}\"\n", " assert abs(w2.data.data.item() - expected_w2) < 1e-6, f\"w2 update failed: expected {expected_w2}, got {w2.data.data.item()}\"\n", " assert abs(b.data.data.item() - expected_b) < 1e-6, f\"b update failed: expected {expected_b}, got {b.data.data.item()}\"\n", " print(\"โœ… Parameter updates work correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Parameter updates failed: {e}\")\n", " raise\n", " \n", " # Test momentum buffers\n", " try:\n", " assert len(optimizer.momentum_buffers) == 3, f\"Should have 3 momentum buffers, got {len(optimizer.momentum_buffers)}\"\n", " assert optimizer.step_count == 1, f\"Step count should be 1, got {optimizer.step_count}\"\n", " print(\"โœ… Momentum buffers created correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Momentum buffers failed: {e}\")\n", " raise\n", " \n", " # Test step counting\n", " try:\n", " w1.grad = Variable(0.1)\n", " w2.grad = Variable(0.2)\n", " b.grad = Variable(0.05)\n", " \n", " optimizer.step()\n", " \n", " assert optimizer.step_count == 2, f\"Step count should be 2, got {optimizer.step_count}\"\n", " print(\"โœ… Step counting works correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Step counting failed: {e}\")\n", " raise\n", "\n", " print(\"๐ŸŽฏ SGD optimizer behavior:\")\n", " print(\" Maintains momentum buffers for accelerated updates\")\n", " print(\" Tracks step count for learning rate scheduling\")\n", " print(\" Supports weight decay for regularization\")\n", " print(\"๐Ÿ“ˆ Progress: SGD Optimizer โœ“\")\n", "\n", "# Run the test\n", "test_sgd_optimizer_comprehensive()" ] }, { "cell_type": "markdown", "id": "3730c6d6", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "## Step 3: Adam - Adaptive Learning Rates\n", "\n", "### What is Adam?\n", "**Adam (Adaptive Moment Estimation)** is the most popular optimizer in deep learning:\n", "\n", "```\n", "m_t = ฮฒโ‚ m_{t-1} + (1 - ฮฒโ‚) โˆ‡L(ฮธ_t) # First moment (momentum)\n", "v_t = ฮฒโ‚‚ v_{t-1} + (1 - ฮฒโ‚‚) (โˆ‡L(ฮธ_t))ยฒ # Second moment (variance)\n", "mฬ‚_t = m_t / (1 - ฮฒโ‚แต—) # Bias correction\n", "vฬ‚_t = v_t / (1 - ฮฒโ‚‚แต—) # Bias correction\n", "ฮธ_{t+1} = ฮธ_t - ฮฑ mฬ‚_t / (โˆšvฬ‚_t + ฮต) # Parameter update\n", "```\n", "\n", "### Why Adam is Revolutionary\n", "1. **Adaptive learning rates**: Different learning rate for each parameter\n", "2. **Momentum**: Accelerates convergence like SGD\n", "3. **Variance adaptation**: Scales updates based on gradient variance\n", "4. **Bias correction**: Handles initialization bias\n", "5. **Robust**: Works well with minimal hyperparameter tuning\n", "\n", "### The Three Key Ideas\n", "1. **First moment (m_t)**: Exponential moving average of gradients (momentum)\n", "2. **Second moment (v_t)**: Exponential moving average of squared gradients (variance)\n", "3. **Adaptive scaling**: Large gradients โ†’ small updates, small gradients โ†’ large updates\n", "\n", "### Visual Understanding\n", "```\n", "Parameter with large gradients: /\\/\\/\\/\\ โ†’ smooth updates\n", "Parameter with small gradients: ______ โ†’ amplified updates\n", "```\n", "\n", "### Real-World Applications\n", "- **Deep learning**: Default optimizer for most neural networks\n", "- **Computer vision**: Training CNNs, ResNets, Vision Transformers\n", "- **Natural language**: Training BERT, GPT, T5\n", "- **Transformers**: Essential for attention-based models\n", "\n", "Let's implement Adam optimizer!" ] }, { "cell_type": "code", "execution_count": null, "id": "be7d3f7a", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "adam-class", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#| export\n", "class Adam:\n", " \"\"\"\n", " Adam Optimizer\n", " \n", " Implements Adam algorithm with adaptive learning rates:\n", " - First moment: exponential moving average of gradients\n", " - Second moment: exponential moving average of squared gradients\n", " - Bias correction: accounts for initialization bias\n", " - Adaptive updates: different learning rate per parameter\n", " \"\"\"\n", " \n", " def __init__(self, parameters: List[Variable], learning_rate: float = 0.001,\n", " beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-8,\n", " weight_decay: float = 0.0):\n", " \"\"\"\n", " Initialize Adam optimizer.\n", " \n", " Args:\n", " parameters: List of Variables to optimize\n", " learning_rate: Learning rate (default: 0.001)\n", " beta1: Exponential decay rate for first moment (default: 0.9)\n", " beta2: Exponential decay rate for second moment (default: 0.999)\n", " epsilon: Small constant for numerical stability (default: 1e-8)\n", " weight_decay: L2 regularization coefficient (default: 0.0)\n", " \n", " TODO: Implement Adam optimizer initialization.\n", " \n", " APPROACH:\n", " 1. Store parameters and hyperparameters\n", " 2. Initialize first moment buffers (m_t)\n", " 3. Initialize second moment buffers (v_t)\n", " 4. Set up step counter for bias correction\n", " \n", " EXAMPLE:\n", " ```python\n", " # Create Adam optimizer\n", " optimizer = Adam([w1, w2, b1, b2], learning_rate=0.001)\n", " \n", " # In training loop:\n", " optimizer.zero_grad()\n", " loss.backward()\n", " optimizer.step()\n", " ```\n", " \n", " HINTS:\n", " - Store all hyperparameters\n", " - Initialize moment buffers as empty dicts\n", " - Use parameter id() as key for tracking\n", " - Buffers will be created lazily in step()\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " self.parameters = parameters\n", " self.learning_rate = learning_rate\n", " self.beta1 = beta1\n", " self.beta2 = beta2\n", " self.epsilon = epsilon\n", " self.weight_decay = weight_decay\n", " \n", " # Initialize moment buffers (created lazily)\n", " self.first_moment = {} # m_t\n", " self.second_moment = {} # v_t\n", " \n", " # Track optimization steps for bias correction\n", " self.step_count = 0\n", " ### END SOLUTION\n", " \n", " def step(self) -> None:\n", " \"\"\"\n", " Perform one optimization step using Adam algorithm.\n", " \n", " TODO: Implement Adam parameter update.\n", " \n", " APPROACH:\n", " 1. Increment step count\n", " 2. For each parameter with gradient:\n", " a. Get current gradient\n", " b. Apply weight decay if specified\n", " c. Update first moment (momentum)\n", " d. Update second moment (variance)\n", " e. Apply bias correction\n", " f. Update parameter with adaptive learning rate\n", " \n", " MATHEMATICAL FORMULATION:\n", " - m_t = beta1 * m_{t-1} + (1 - beta1) * gradient\n", " - v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2\n", " - m_hat = m_t / (1 - beta1^t)\n", " - v_hat = v_t / (1 - beta2^t)\n", " - parameter = parameter - learning_rate * m_hat / (sqrt(v_hat) + epsilon)\n", " \n", " IMPLEMENTATION HINTS:\n", " - Use id(param) as key for moment buffers\n", " - Initialize buffers with zeros if not exists\n", " - Use np.sqrt() for square root\n", " - Handle numerical stability with epsilon\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " self.step_count += 1\n", " \n", " for param in self.parameters:\n", " if param.grad is not None:\n", " # Get gradient\n", " gradient = param.grad.data.data\n", " \n", " # Apply weight decay (L2 regularization)\n", " if self.weight_decay > 0:\n", " gradient = gradient + self.weight_decay * param.data.data\n", " \n", " # Get or create moment buffers\n", " param_id = id(param)\n", " if param_id not in self.first_moment:\n", " self.first_moment[param_id] = np.zeros_like(param.data.data)\n", " self.second_moment[param_id] = np.zeros_like(param.data.data)\n", " \n", " # Update first moment (momentum)\n", " self.first_moment[param_id] = (\n", " self.beta1 * self.first_moment[param_id] + \n", " (1 - self.beta1) * gradient\n", " )\n", " \n", " # Update second moment (variance)\n", " self.second_moment[param_id] = (\n", " self.beta2 * self.second_moment[param_id] + \n", " (1 - self.beta2) * gradient * gradient\n", " )\n", " \n", " # Bias correction\n", " first_moment_corrected = (\n", " self.first_moment[param_id] / (1 - self.beta1 ** self.step_count)\n", " )\n", " second_moment_corrected = (\n", " self.second_moment[param_id] / (1 - self.beta2 ** self.step_count)\n", " )\n", " \n", " # Update parameter with adaptive learning rate\n", " param.data = Tensor(\n", " param.data.data - self.learning_rate * first_moment_corrected / \n", " (np.sqrt(second_moment_corrected) + self.epsilon)\n", " )\n", " ### END SOLUTION\n", " \n", " def zero_grad(self) -> None:\n", " \"\"\"\n", " Zero out gradients for all parameters.\n", " \n", " TODO: Implement gradient zeroing (same as SGD).\n", " \n", " IMPLEMENTATION HINTS:\n", " - Set param.grad = None for all parameters\n", " - This is identical to SGD implementation\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " for param in self.parameters:\n", " param.grad = None\n", " ### END SOLUTION" ] }, { "cell_type": "markdown", "id": "41593be1", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "### ๐Ÿงช Test Your Adam Implementation\n", "\n", "Let's test the Adam optimizer:" ] }, { "cell_type": "markdown", "id": "461e74f8", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "### ๐Ÿงช Unit Test: Adam Optimizer\n", "\n", "Let's test your Adam optimizer implementation! This is a state-of-the-art adaptive optimization algorithm.\n", "\n", "**This is a unit test** - it tests one specific class (Adam) in isolation." ] }, { "cell_type": "code", "execution_count": null, "id": "afe99df3", "metadata": { "nbgrader": { "grade": true, "grade_id": "test-adam", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "def test_adam_optimizer_comprehensive():\n", " \"\"\"Test Adam optimizer implementation\"\"\"\n", " print(\"๐Ÿ”ฌ Unit Test: Adam Optimizer...\")\n", " \n", " # Create test parameters\n", " w1 = Variable(1.0, requires_grad=True)\n", " w2 = Variable(2.0, requires_grad=True)\n", " b = Variable(0.5, requires_grad=True)\n", " \n", " # Create optimizer\n", " optimizer = Adam([w1, w2, b], learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)\n", " \n", " # Test zero_grad\n", " try:\n", " w1.grad = Variable(0.1)\n", " w2.grad = Variable(0.2)\n", " b.grad = Variable(0.05)\n", " \n", " optimizer.zero_grad()\n", " \n", " assert w1.grad is None, \"Gradient should be None after zero_grad\"\n", " assert w2.grad is None, \"Gradient should be None after zero_grad\"\n", " assert b.grad is None, \"Gradient should be None after zero_grad\"\n", " print(\"โœ… zero_grad() works correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ zero_grad() failed: {e}\")\n", " raise\n", " \n", " # Test step with gradients\n", " try:\n", " w1.grad = Variable(0.1)\n", " w2.grad = Variable(0.2)\n", " b.grad = Variable(0.05)\n", " \n", " # First step\n", " original_w1 = w1.data.data.item()\n", " original_w2 = w2.data.data.item()\n", " original_b = b.data.data.item()\n", " \n", " optimizer.step()\n", " \n", " # Check that parameters were updated (Adam uses adaptive learning rates)\n", " assert w1.data.data.item() != original_w1, \"w1 should have been updated\"\n", " assert w2.data.data.item() != original_w2, \"w2 should have been updated\"\n", " assert b.data.data.item() != original_b, \"b should have been updated\"\n", " print(\"โœ… Parameter updates work correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Parameter updates failed: {e}\")\n", " raise\n", " \n", " # Test moment buffers\n", " try:\n", " assert len(optimizer.first_moment) == 3, f\"Should have 3 first moment buffers, got {len(optimizer.first_moment)}\"\n", " assert len(optimizer.second_moment) == 3, f\"Should have 3 second moment buffers, got {len(optimizer.second_moment)}\"\n", " print(\"โœ… Moment buffers created correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Moment buffers failed: {e}\")\n", " raise\n", " \n", " # Test step counting and bias correction\n", " try:\n", " assert optimizer.step_count == 1, f\"Step count should be 1, got {optimizer.step_count}\"\n", " \n", " # Take another step\n", " w1.grad = Variable(0.1)\n", " w2.grad = Variable(0.2)\n", " b.grad = Variable(0.05)\n", " \n", " optimizer.step()\n", " \n", " assert optimizer.step_count == 2, f\"Step count should be 2, got {optimizer.step_count}\"\n", " print(\"โœ… Step counting and bias correction work correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Step counting and bias correction failed: {e}\")\n", " raise\n", " \n", " # Test adaptive learning rates\n", " try:\n", " # Adam should have different effective learning rates for different parameters\n", " # This is tested implicitly by the parameter updates above\n", " print(\"โœ… Adaptive learning rates work correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Adaptive learning rates failed: {e}\")\n", " raise\n", "\n", " print(\"๐ŸŽฏ Adam optimizer behavior:\")\n", " print(\" Maintains first and second moment estimates\")\n", " print(\" Applies bias correction for early training\")\n", " print(\" Uses adaptive learning rates per parameter\")\n", " print(\" Combines benefits of momentum and RMSprop\")\n", " print(\"๐Ÿ“ˆ Progress: Adam Optimizer โœ“\")\n", "\n", "# Run the test\n", "test_adam_optimizer_comprehensive()" ] }, { "cell_type": "markdown", "id": "e198d030", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "## Step 4: Learning Rate Scheduling\n", "\n", "### What is Learning Rate Scheduling?\n", "**Learning rate scheduling** adjusts the learning rate during training:\n", "\n", "```\n", "Initial: learning_rate = 0.1\n", "After 10 epochs: learning_rate = 0.01\n", "After 20 epochs: learning_rate = 0.001\n", "```\n", "\n", "### Why Scheduling Matters\n", "1. **Fine-tuning**: Start with large steps, then refine with small steps\n", "2. **Convergence**: Prevents overshooting near optimum\n", "3. **Stability**: Reduces oscillations in later training\n", "4. **Performance**: Often improves final accuracy\n", "\n", "### Common Scheduling Strategies\n", "1. **Step decay**: Reduce by factor every N epochs\n", "2. **Exponential decay**: Gradual exponential reduction\n", "3. **Cosine annealing**: Smooth cosine curve reduction\n", "4. **Warm-up**: Start small, increase, then decrease\n", "\n", "### Visual Understanding\n", "```\n", "Step decay: ----โ†“----โ†“----โ†“\n", "Exponential: \\\\\\\\\\\\\\\\\\\\\\\\\\\\\n", "Cosine: โˆฉโˆฉโˆฉโˆฉโˆฉโˆฉโˆฉโˆฉโˆฉโˆฉโˆฉโˆฉโˆฉ\n", "```\n", "\n", "### Real-World Applications\n", "- **ImageNet training**: Essential for achieving state-of-the-art results\n", "- **Language models**: Critical for training large transformers\n", "- **Fine-tuning**: Prevents catastrophic forgetting\n", "- **Transfer learning**: Adapts pre-trained models\n", "\n", "Let's implement step learning rate scheduling!" ] }, { "cell_type": "code", "execution_count": null, "id": "7aba8fc9", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "steplr-class", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#| export\n", "class StepLR:\n", " \"\"\"\n", " Step Learning Rate Scheduler\n", " \n", " Decays learning rate by gamma every step_size epochs:\n", " learning_rate = initial_lr * (gamma ^ (epoch // step_size))\n", " \"\"\"\n", " \n", " def __init__(self, optimizer: Union[SGD, Adam], step_size: int, gamma: float = 0.1):\n", " \"\"\"\n", " Initialize step learning rate scheduler.\n", " \n", " Args:\n", " optimizer: Optimizer to schedule\n", " step_size: Number of epochs between decreases\n", " gamma: Multiplicative factor for learning rate decay\n", " \n", " TODO: Implement learning rate scheduler initialization.\n", " \n", " APPROACH:\n", " 1. Store optimizer reference\n", " 2. Store scheduling parameters\n", " 3. Save initial learning rate\n", " 4. Initialize step counter\n", " \n", " EXAMPLE:\n", " ```python\n", " optimizer = SGD([w1, w2], learning_rate=0.1)\n", " scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n", " \n", " # In training loop:\n", " for epoch in range(100):\n", " train_one_epoch()\n", " scheduler.step() # Update learning rate\n", " ```\n", " \n", " HINTS:\n", " - Store optimizer reference\n", " - Save initial learning rate from optimizer\n", " - Initialize step counter to 0\n", " - gamma is the decay factor (0.1 = 10x reduction)\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " self.optimizer = optimizer\n", " self.step_size = step_size\n", " self.gamma = gamma\n", " self.initial_lr = optimizer.learning_rate\n", " self.step_count = 0\n", " ### END SOLUTION\n", " \n", " def step(self) -> None:\n", " \"\"\"\n", " Update learning rate based on current step.\n", " \n", " TODO: Implement learning rate update.\n", " \n", " APPROACH:\n", " 1. Increment step counter\n", " 2. Calculate new learning rate using step decay formula\n", " 3. Update optimizer's learning rate\n", " \n", " MATHEMATICAL FORMULATION:\n", " new_lr = initial_lr * (gamma ^ ((step_count - 1) // step_size))\n", " \n", " IMPLEMENTATION HINTS:\n", " - Use // for integer division\n", " - Use ** for exponentiation\n", " - Update optimizer.learning_rate directly\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " self.step_count += 1\n", " \n", " # Calculate new learning rate\n", " decay_factor = self.gamma ** ((self.step_count - 1) // self.step_size)\n", " new_lr = self.initial_lr * decay_factor\n", " \n", " # Update optimizer's learning rate\n", " self.optimizer.learning_rate = new_lr\n", " ### END SOLUTION\n", " \n", " def get_lr(self) -> float:\n", " \"\"\"\n", " Get current learning rate.\n", " \n", " TODO: Return current learning rate.\n", " \n", " IMPLEMENTATION HINTS:\n", " - Return optimizer.learning_rate\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " return self.optimizer.learning_rate\n", " ### END SOLUTION" ] }, { "cell_type": "markdown", "id": "51901e5b", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "### ๐Ÿงช Unit Test: Step Learning Rate Scheduler\n", "\n", "Let's test your step learning rate scheduler implementation! This scheduler reduces learning rate at regular intervals.\n", "\n", "**This is a unit test** - it tests one specific class (StepLR) in isolation." ] }, { "cell_type": "code", "execution_count": null, "id": "7b83de77", "metadata": { "nbgrader": { "grade": true, "grade_id": "test-step-scheduler", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "def test_step_scheduler_comprehensive():\n", " \"\"\"Test StepLR scheduler implementation\"\"\"\n", " print(\"๐Ÿ”ฌ Unit Test: Step Learning Rate Scheduler...\")\n", " \n", " # Create test parameters and optimizer\n", " w = Variable(1.0, requires_grad=True)\n", " optimizer = SGD([w], learning_rate=0.1)\n", " \n", " # Test scheduler initialization\n", " try:\n", " scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n", " \n", " # Test initial learning rate\n", " assert scheduler.get_lr() == 0.1, f\"Initial learning rate should be 0.1, got {scheduler.get_lr()}\"\n", " print(\"โœ… Initial learning rate is correct\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Initial learning rate failed: {e}\")\n", " raise\n", " \n", " # Test step-based decay\n", " try:\n", " # Steps 1-10: no decay (decay happens after step 10)\n", " for i in range(10):\n", " scheduler.step()\n", " \n", " assert scheduler.get_lr() == 0.1, f\"Learning rate should still be 0.1 after 10 steps, got {scheduler.get_lr()}\"\n", " \n", " # Step 11: decay should occur\n", " scheduler.step()\n", " expected_lr = 0.1 * 0.1 # 0.01\n", " assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f\"Learning rate should be {expected_lr} after 11 steps, got {scheduler.get_lr()}\"\n", " print(\"โœ… Step-based decay works correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Step-based decay failed: {e}\")\n", " raise\n", " \n", " # Test multiple decay levels\n", " try:\n", " # Steps 12-20: should stay at 0.01\n", " for i in range(9):\n", " scheduler.step()\n", " \n", " assert abs(scheduler.get_lr() - 0.01) < 1e-6, f\"Learning rate should be 0.01 after 20 steps, got {scheduler.get_lr()}\"\n", " \n", " # Step 21: another decay\n", " scheduler.step()\n", " expected_lr = 0.01 * 0.1 # 0.001\n", " assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f\"Learning rate should be {expected_lr} after 21 steps, got {scheduler.get_lr()}\"\n", " print(\"โœ… Multiple decay levels work correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Multiple decay levels failed: {e}\")\n", " raise\n", " \n", " # Test with different optimizer\n", " try:\n", " w2 = Variable(2.0, requires_grad=True)\n", " adam_optimizer = Adam([w2], learning_rate=0.001)\n", " adam_scheduler = StepLR(adam_optimizer, step_size=5, gamma=0.5)\n", " \n", " # Test initial learning rate\n", " assert adam_scheduler.get_lr() == 0.001, f\"Initial Adam learning rate should be 0.001, got {adam_scheduler.get_lr()}\"\n", " \n", " # Test decay after 5 steps\n", " for i in range(5):\n", " adam_scheduler.step()\n", " \n", " # Learning rate should still be 0.001 after 5 steps\n", " assert adam_scheduler.get_lr() == 0.001, f\"Adam learning rate should still be 0.001 after 5 steps, got {adam_scheduler.get_lr()}\"\n", " \n", " # Step 6: decay should occur\n", " adam_scheduler.step()\n", " expected_lr = 0.001 * 0.5 # 0.0005\n", " assert abs(adam_scheduler.get_lr() - expected_lr) < 1e-6, f\"Adam learning rate should be {expected_lr} after 6 steps, got {adam_scheduler.get_lr()}\"\n", " print(\"โœ… Works with different optimizers\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Different optimizers failed: {e}\")\n", " raise\n", "\n", " print(\"๐ŸŽฏ Step learning rate scheduler behavior:\")\n", " print(\" Reduces learning rate at regular intervals\")\n", " print(\" Multiplies current rate by gamma factor\")\n", " print(\" Works with any optimizer (SGD, Adam, etc.)\")\n", " print(\"๐Ÿ“ˆ Progress: Step Learning Rate Scheduler โœ“\")\n", "\n", "# Run the test\n", "test_step_scheduler_comprehensive()" ] }, { "cell_type": "markdown", "id": "2fc52bc2", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "## Step 5: Integration - Complete Training Example\n", "\n", "### Putting It All Together\n", "Let's see how optimizers enable complete neural network training:\n", "\n", "1. **Forward pass**: Compute predictions\n", "2. **Loss computation**: Compare with targets\n", "3. **Backward pass**: Compute gradients\n", "4. **Optimizer step**: Update parameters\n", "5. **Learning rate scheduling**: Adjust learning rate\n", "\n", "### The Modern Training Loop\n", "```python\n", "# Setup\n", "optimizer = Adam(model.parameters(), learning_rate=0.001)\n", "scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n", "\n", "# Training loop\n", "for epoch in range(num_epochs):\n", " for batch in dataloader:\n", " # Forward pass\n", " predictions = model(batch.inputs)\n", " loss = criterion(predictions, batch.targets)\n", " \n", " # Backward pass\n", " optimizer.zero_grad()\n", " loss.backward()\n", " optimizer.step()\n", " \n", " # Update learning rate\n", " scheduler.step()\n", "```\n", "\n", "Let's implement a complete training example!" ] }, { "cell_type": "code", "execution_count": null, "id": "a3205aad", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "training-integration", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "def train_simple_model():\n", " \"\"\"\n", " Complete training example using optimizers.\n", " \n", " TODO: Implement a complete training loop.\n", " \n", " APPROACH:\n", " 1. Create a simple model (linear regression)\n", " 2. Generate training data\n", " 3. Set up optimizer and scheduler\n", " 4. Train for several epochs\n", " 5. Show convergence\n", " \n", " LEARNING OBJECTIVE:\n", " - See how optimizers enable real learning\n", " - Compare SGD vs Adam performance\n", " - Understand the complete training workflow\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " print(\"Training simple linear regression model...\")\n", " \n", " # Create simple model: y = w*x + b\n", " w = Variable(0.1, requires_grad=True) # Initialize near zero\n", " b = Variable(0.0, requires_grad=True)\n", " \n", " # Training data: y = 2*x + 1\n", " x_data = [1.0, 2.0, 3.0, 4.0, 5.0]\n", " y_data = [3.0, 5.0, 7.0, 9.0, 11.0]\n", " \n", " # Try SGD first\n", " print(\"\\n๐Ÿ” Training with SGD...\")\n", " optimizer_sgd = SGD([w, b], learning_rate=0.01, momentum=0.9)\n", " \n", " for epoch in range(60):\n", " total_loss = 0\n", " \n", " for x_val, y_val in zip(x_data, y_data):\n", " # Forward pass\n", " x = Variable(x_val, requires_grad=False)\n", " y_target = Variable(y_val, requires_grad=False)\n", " \n", " # Prediction: y = w*x + b\n", " try:\n", " from tinytorch.core.autograd import add, multiply, subtract\n", " except ImportError:\n", " setup_import_paths()\n", " from autograd_dev import add, multiply, subtract\n", " \n", " prediction = add(multiply(w, x), b)\n", " \n", " # Loss: (prediction - target)^2\n", " error = subtract(prediction, y_target)\n", " loss = multiply(error, error)\n", " \n", " # Backward pass\n", " optimizer_sgd.zero_grad()\n", " loss.backward()\n", " optimizer_sgd.step()\n", " \n", " total_loss += loss.data.data.item()\n", " \n", " if epoch % 10 == 0:\n", " print(f\"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}\")\n", " \n", " sgd_final_w = w.data.data.item()\n", " sgd_final_b = b.data.data.item()\n", " \n", " # Reset parameters and try Adam\n", " print(\"\\n๐Ÿ” Training with Adam...\")\n", " w.data = Tensor(0.1)\n", " b.data = Tensor(0.0)\n", " \n", " optimizer_adam = Adam([w, b], learning_rate=0.01)\n", " \n", " for epoch in range(60):\n", " total_loss = 0\n", " \n", " for x_val, y_val in zip(x_data, y_data):\n", " # Forward pass\n", " x = Variable(x_val, requires_grad=False)\n", " y_target = Variable(y_val, requires_grad=False)\n", " \n", " # Prediction: y = w*x + b\n", " prediction = add(multiply(w, x), b)\n", " \n", " # Loss: (prediction - target)^2\n", " error = subtract(prediction, y_target)\n", " loss = multiply(error, error)\n", " \n", " # Backward pass\n", " optimizer_adam.zero_grad()\n", " loss.backward()\n", " optimizer_adam.step()\n", " \n", " total_loss += loss.data.data.item()\n", " \n", " if epoch % 10 == 0:\n", " print(f\"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}\")\n", " \n", " adam_final_w = w.data.data.item()\n", " adam_final_b = b.data.data.item()\n", " \n", " print(f\"\\n๐Ÿ“Š Results:\")\n", " print(f\"Target: w = 2.0, b = 1.0\")\n", " print(f\"SGD: w = {sgd_final_w:.3f}, b = {sgd_final_b:.3f}\")\n", " print(f\"Adam: w = {adam_final_w:.3f}, b = {adam_final_b:.3f}\")\n", " \n", " return sgd_final_w, sgd_final_b, adam_final_w, adam_final_b\n", " ### END SOLUTION" ] }, { "cell_type": "markdown", "id": "0a5330c4", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "### ๐Ÿงช Unit Test: Complete Training Integration\n", "\n", "Let's test your complete training integration! This demonstrates optimizers working together in a realistic training scenario.\n", "\n", "**This is a unit test** - it tests the complete training workflow with optimizers in isolation." ] }, { "cell_type": "code", "execution_count": null, "id": "5aeda8ce", "metadata": { "nbgrader": { "grade": true, "grade_id": "test-training-integration", "locked": true, "points": 25, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "def test_training_integration_comprehensive():\n", " \"\"\"Test complete training integration with optimizers\"\"\"\n", " print(\"๐Ÿ”ฌ Unit Test: Complete Training Integration...\")\n", " \n", " # Test training with SGD and Adam\n", " try:\n", " sgd_w, sgd_b, adam_w, adam_b = train_simple_model()\n", " \n", " # Test SGD convergence\n", " assert abs(sgd_w - 2.0) < 0.1, f\"SGD should converge close to w=2.0, got {sgd_w}\"\n", " assert abs(sgd_b - 1.0) < 0.1, f\"SGD should converge close to b=1.0, got {sgd_b}\"\n", " print(\"โœ… SGD convergence works\")\n", " \n", " # Test Adam convergence (may be different due to adaptive learning rates)\n", " assert abs(adam_w - 2.0) < 1.0, f\"Adam should converge reasonably close to w=2.0, got {adam_w}\"\n", " assert abs(adam_b - 1.0) < 1.0, f\"Adam should converge reasonably close to b=1.0, got {adam_b}\"\n", " print(\"โœ… Adam convergence works\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Training integration failed: {e}\")\n", " raise\n", " \n", " # Test optimizer comparison\n", " try:\n", " # Both optimizers should achieve reasonable results\n", " sgd_error = (sgd_w - 2.0)**2 + (sgd_b - 1.0)**2\n", " adam_error = (adam_w - 2.0)**2 + (adam_b - 1.0)**2\n", " \n", " # Both should have low error (< 0.1)\n", " assert sgd_error < 0.1, f\"SGD error should be < 0.1, got {sgd_error}\"\n", " assert adam_error < 1.0, f\"Adam error should be < 1.0, got {adam_error}\"\n", " print(\"โœ… Optimizer comparison works\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Optimizer comparison failed: {e}\")\n", " raise\n", " \n", " # Test gradient flow\n", " try:\n", " # Create a simple test to verify gradients flow correctly\n", " w = Variable(1.0, requires_grad=True)\n", " b = Variable(0.0, requires_grad=True)\n", " \n", " # Set up simple gradients\n", " w.grad = Variable(0.1)\n", " b.grad = Variable(0.05)\n", " \n", " # Test SGD step\n", " sgd_optimizer = SGD([w, b], learning_rate=0.1)\n", " original_w = w.data.data.item()\n", " original_b = b.data.data.item()\n", " \n", " sgd_optimizer.step()\n", " \n", " # Check updates\n", " assert w.data.data.item() != original_w, \"SGD should update w\"\n", " assert b.data.data.item() != original_b, \"SGD should update b\"\n", " print(\"โœ… Gradient flow works correctly\")\n", " \n", " except Exception as e:\n", " print(f\"โŒ Gradient flow failed: {e}\")\n", " raise\n", "\n", " print(\"๐ŸŽฏ Training integration behavior:\")\n", " print(\" Optimizers successfully minimize loss functions\")\n", " print(\" SGD and Adam both converge to target values\")\n", " print(\" Gradient computation and updates work correctly\")\n", " print(\" Ready for real neural network training\")\n", " print(\"๐Ÿ“ˆ Progress: Complete Training Integration โœ“\")\n", "\n", "# Run the test\n", "test_training_integration_comprehensive()" ] }, { "cell_type": "markdown", "id": "c0464e8c", "metadata": {}, "source": [ "\"\"\"\n", "# ๐ŸŽฏ Module Summary: Optimization Mastery!\n", "\n", "Congratulations! You've successfully implemented the optimization algorithms that power all modern neural network training:\n", "\n", "## โœ… What You've Built\n", "- **Gradient Descent**: The fundamental parameter update mechanism\n", "- **SGD with Momentum**: Accelerated convergence with velocity accumulation\n", "- **Adam Optimizer**: Adaptive learning rates with first and second moments\n", "- **Learning Rate Scheduling**: Smart learning rate adjustment during training\n", "- **Complete Training Integration**: End-to-end training workflow\n", "\n", "## โœ… Key Learning Outcomes\n", "- **Understanding**: How optimizers use gradients to update parameters intelligently\n", "- **Implementation**: Built SGD and Adam optimizers from mathematical foundations\n", "- **Mathematical mastery**: Momentum, adaptive learning rates, bias correction\n", "- **Systems integration**: Complete training loops with scheduling\n", "- **Real-world application**: Modern deep learning training workflow\n", "\n", "## โœ… Mathematical Foundations Mastered\n", "- **Gradient Descent**: ฮธ = ฮธ - ฮฑโˆ‡L(ฮธ) for parameter updates\n", "- **Momentum**: v_t = ฮฒv_{t-1} + โˆ‡L(ฮธ) for acceleration\n", "- **Adam**: Adaptive learning rates with exponential moving averages\n", "- **Learning Rate Scheduling**: Strategic learning rate adjustment\n", "\n", "## โœ… Professional Skills Developed\n", "- **Algorithm implementation**: Translating mathematical formulas into code\n", "- **State management**: Tracking optimizer buffers and statistics\n", "- **Hyperparameter design**: Understanding the impact of learning rate, momentum, etc.\n", "- **Training orchestration**: Complete training loop design\n", "\n", "## โœ… Ready for Advanced Applications\n", "Your optimizers now enable:\n", "- **Deep Neural Networks**: Effective training of complex architectures\n", "- **Computer Vision**: Training CNNs, ResNets, Vision Transformers\n", "- **Natural Language Processing**: Training transformers and language models\n", "- **Any ML Model**: Gradient-based optimization for any differentiable system\n", "\n", "## ๐Ÿ”— Connection to Real ML Systems\n", "Your implementations mirror production systems:\n", "- **PyTorch**: `torch.optim.SGD()`, `torch.optim.Adam()`, `torch.optim.lr_scheduler.StepLR()`\n", "- **TensorFlow**: `tf.keras.optimizers.SGD()`, `tf.keras.optimizers.Adam()`\n", "- **Industry Standard**: Every major ML framework uses these exact algorithms\n", "\n", "## ๐ŸŽฏ The Power of Intelligent Optimization\n", "You've unlocked the algorithms that made modern AI possible:\n", "- **Scalability**: Efficiently optimize millions of parameters\n", "- **Adaptability**: Different learning rates for different parameters\n", "- **Robustness**: Handle noisy gradients and ill-conditioned problems\n", "- **Universality**: Work with any differentiable neural network\n", "\n", "## ๐Ÿง  Deep Learning Revolution\n", "You now understand the optimization technology that powers:\n", "- **ImageNet**: Training state-of-the-art computer vision models\n", "- **Language Models**: Training GPT, BERT, and other transformers\n", "- **Modern AI**: Every breakthrough relies on these optimization algorithms\n", "- **Future Research**: Your understanding enables you to develop new optimizers\n", "\n", "## ๐Ÿš€ What's Next\n", "Your optimizers are the foundation for:\n", "- **Training Module**: Complete training loops with loss functions and metrics\n", "- **Advanced Optimizers**: RMSprop, AdaGrad, learning rate warm-up\n", "- **Distributed Training**: Multi-GPU optimization strategies\n", "- **Research**: Experimenting with novel optimization algorithms\n", "\n", "**Next Module**: Complete training systems that orchestrate your optimizers for real-world ML!\n", "\n", "You've built the intelligent algorithms that enable neural networks to learn. Now let's use them to train systems that can solve complex real-world problems!\n", "\"\"\"\n", "\n", "Run inline tests when module is executed directly\n", "if __name__ == \"__main__\":\n", " from tito.tools.testing import run_module_tests_auto\n", " \n", " # Automatically discover and run all tests in this module\n", " run_module_tests_auto(\"Optimizers\") " ] } ], "metadata": { "jupytext": { "main_language": "python" } }, "nbformat": 4, "nbformat_minor": 5 }