TinyTorch/modules/source/08_optimizers/optimizers_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "602ba54a",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Module 8: Optimizers - Gradient-Based Parameter Updates\n",
    "\n",
    "Welcome to the Optimizers module! This is where neural networks learn to improve through intelligent parameter updates.\n",
    "\n",
    "## Learning Goals\n",
    "- Understand gradient descent and how optimizers use gradients to update parameters\n",
    "- Implement SGD with momentum for accelerated convergence\n",
    "- Build Adam optimizer with adaptive learning rates\n",
    "- Master learning rate scheduling strategies\n",
    "- See how optimizers enable effective neural network training\n",
    "\n",
    "## Build → Use → Analyze\n",
    "1. **Build**: Core optimization algorithms (SGD, Adam)\n",
    "2. **Use**: Apply optimizers to train neural networks\n",
    "3. **Analyze**: Compare optimizer behavior and convergence patterns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3b359ed",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "optimizers-imports",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp core.optimizers\n",
    "\n",
    "#| export\n",
    "import math\n",
    "import numpy as np\n",
    "import sys\n",
    "import os\n",
    "from typing import List, Dict, Any, Optional, Union\n",
    "from collections import defaultdict\n",
    "\n",
    "# Helper function to set up import paths\n",
    "def setup_import_paths():\n",
    "    \"\"\"Set up import paths for development modules.\"\"\"\n",
    "    import sys\n",
    "    import os\n",
    "    \n",
    "    # Add module directories to path\n",
    "    base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\n",
    "    tensor_dir = os.path.join(base_dir, '01_tensor')\n",
    "    autograd_dir = os.path.join(base_dir, '07_autograd')\n",
    "    \n",
    "    if tensor_dir not in sys.path:\n",
    "        sys.path.append(tensor_dir)\n",
    "    if autograd_dir not in sys.path:\n",
    "        sys.path.append(autograd_dir)\n",
    "\n",
    "# Import our existing components\n",
    "try:\n",
    "    from tinytorch.core.tensor import Tensor\n",
    "    from tinytorch.core.autograd import Variable\n",
    "except ImportError:\n",
    "    # For development, try local imports\n",
    "    try:\n",
    "        setup_import_paths()\n",
    "        from tensor_dev import Tensor\n",
    "        from autograd_dev import Variable\n",
    "    except ImportError:\n",
    "        # Create minimal fallback classes for testing\n",
    "        print(\"Warning: Using fallback classes for testing\")\n",
    "        \n",
    "        class Tensor:\n",
    "            def __init__(self, data):\n",
    "                self.data = np.array(data)\n",
    "                self.shape = self.data.shape\n",
    "            \n",
    "            def __str__(self):\n",
    "                return f\"Tensor({self.data})\"\n",
    "        \n",
    "        class Variable:\n",
    "            def __init__(self, data, requires_grad=True):\n",
    "                if isinstance(data, (int, float)):\n",
    "                    self.data = Tensor([data])\n",
    "                else:\n",
    "                    self.data = Tensor(data)\n",
    "                self.requires_grad = requires_grad\n",
    "                self.grad = None\n",
    "            \n",
    "            def zero_grad(self):\n",
    "                self.grad = None\n",
    "            \n",
    "            def __str__(self):\n",
    "                return f\"Variable({self.data.data})\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4dfb6aa4",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "optimizers-setup",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "print(\"🔥 TinyTorch Optimizers Module\")\n",
    "print(f\"NumPy version: {np.__version__}\")\n",
    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
    "print(\"Ready to build optimization algorithms!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9afc185",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in `modules/source/08_optimizers/optimizers_dev.py`  \n",
    "**Building Side:** Code exports to `tinytorch.core.optimizers`\n",
    "\n",
    "```python\n",
    "# Final package structure:\n",
    "from tinytorch.core.optimizers import SGD, Adam, StepLR  # The optimization engines!\n",
    "from tinytorch.core.autograd import Variable  # Gradient computation\n",
    "from tinytorch.core.tensor import Tensor  # Data structures\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Focused module for understanding optimization algorithms\n",
    "- **Production:** Proper organization like PyTorch's `torch.optim`\n",
    "- **Consistency:** All optimization algorithms live together in `core.optimizers`\n",
    "- **Foundation:** Enables effective neural network training"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0d222c6",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## What Are Optimizers?\n",
    "\n",
    "### The Problem: How to Update Parameters\n",
    "Neural networks learn by updating parameters using gradients:\n",
    "```\n",
    "parameter_new = parameter_old - learning_rate * gradient\n",
    "```\n",
    "\n",
    "But **naive gradient descent** has problems:\n",
    "- **Slow convergence**: Takes many steps to reach optimum\n",
    "- **Oscillation**: Bounces around valleys without making progress\n",
    "- **Poor scaling**: Same learning rate for all parameters\n",
    "\n",
    "### The Solution: Smart Optimization\n",
    "**Optimizers** are algorithms that intelligently update parameters:\n",
    "- **Momentum**: Accelerate convergence by accumulating velocity\n",
    "- **Adaptive learning rates**: Different learning rates for different parameters\n",
    "- **Second-order information**: Use curvature to guide updates\n",
    "\n",
    "### Real-World Impact\n",
    "- **SGD**: The foundation of all neural network training\n",
    "- **Adam**: The default optimizer for most deep learning applications\n",
    "- **Learning rate scheduling**: Critical for training stability and performance\n",
    "\n",
    "### What We'll Build\n",
    "1. **SGD**: Stochastic Gradient Descent with momentum\n",
    "2. **Adam**: Adaptive Moment Estimation optimizer\n",
    "3. **StepLR**: Learning rate scheduling\n",
    "4. **Integration**: Complete training loop with optimizers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ccea3ce",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 1: Understanding Gradient Descent\n",
    "\n",
    "### What is Gradient Descent?\n",
    "**Gradient descent** finds the minimum of a function by following the negative gradient:\n",
    "\n",
    "```\n",
    "θ_{t+1} = θ_t - α ∇f(θ_t)\n",
    "```\n",
    "\n",
    "Where:\n",
    "- θ: Parameters we want to optimize\n",
    "- α: Learning rate (how big steps to take)\n",
    "- ∇f(θ): Gradient of loss function with respect to parameters\n",
    "\n",
    "### Why Gradient Descent Works\n",
    "1. **Gradients point uphill**: Negative gradient points toward minimum\n",
    "2. **Iterative improvement**: Each step reduces the loss (in theory)\n",
    "3. **Local convergence**: Finds local minimum with proper learning rate\n",
    "4. **Scalable**: Works with millions of parameters\n",
    "\n",
    "### The Learning Rate Dilemma\n",
    "- **Too large**: Overshoots minimum, diverges\n",
    "- **Too small**: Extremely slow convergence\n",
    "- **Just right**: Steady progress toward minimum\n",
    "\n",
    "### Visual Understanding\n",
    "```\n",
    "Loss landscape: \\__/\n",
    "Start here: ↑\n",
    "Gradient descent: ↓ → ↓ → ↓ → minimum\n",
    "```\n",
    "\n",
    "### Real-World Applications\n",
    "- **Neural networks**: Training any deep learning model\n",
    "- **Machine learning**: Logistic regression, SVM, etc.\n",
    "- **Scientific computing**: Optimization problems in physics, engineering\n",
    "- **Economics**: Portfolio optimization, game theory\n",
    "\n",
    "Let's implement gradient descent to understand it deeply!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d41c2596",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "gradient-descent-function",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "def gradient_descent_step(parameter: Variable, learning_rate: float) -> None:\n",
    "    \"\"\"\n",
    "    Perform one step of gradient descent on a parameter.\n",
    "    \n",
    "    Args:\n",
    "        parameter: Variable with gradient information\n",
    "        learning_rate: How much to update parameter\n",
    "    \n",
    "    TODO: Implement basic gradient descent parameter update.\n",
    "    \n",
    "    STEP-BY-STEP IMPLEMENTATION:\n",
    "    1. Check if parameter has a gradient\n",
    "    2. Get current parameter value and gradient\n",
    "    3. Update parameter: new_value = old_value - learning_rate * gradient\n",
    "    4. Update parameter data with new value\n",
    "    5. Handle edge cases (no gradient, invalid values)\n",
    "    \n",
    "    EXAMPLE USAGE:\n",
    "    ```python\n",
    "    # Parameter with gradient\n",
    "    w = Variable(2.0, requires_grad=True)\n",
    "    w.grad = Variable(0.5)  # Gradient from loss\n",
    "    \n",
    "    # Update parameter\n",
    "    gradient_descent_step(w, learning_rate=0.1)\n",
    "    # w.data now contains: 2.0 - 0.1 * 0.5 = 1.95\n",
    "    ```\n",
    "    \n",
    "    IMPLEMENTATION HINTS:\n",
    "    - Check if parameter.grad is not None\n",
    "    - Use parameter.grad.data.data to get gradient value\n",
    "    - Update parameter.data with new Tensor\n",
    "    - Don't modify gradient (it's used for logging)\n",
    "    \n",
    "    LEARNING CONNECTIONS:\n",
    "    - This is the foundation of all neural network training\n",
    "    - PyTorch's optimizer.step() does exactly this\n",
    "    - The learning rate determines convergence speed\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    if parameter.grad is not None:\n",
    "        # Get current parameter value and gradient\n",
    "        current_value = parameter.data.data\n",
    "        gradient_value = parameter.grad.data.data\n",
    "        \n",
    "        # Update parameter: new_value = old_value - learning_rate * gradient\n",
    "        new_value = current_value - learning_rate * gradient_value\n",
    "        \n",
    "        # Update parameter data\n",
    "        parameter.data = Tensor(new_value)\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d2e1fd4",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Gradient Descent Step\n",
    "\n",
    "Let's test your gradient descent implementation right away! This is the foundation of all optimization algorithms.\n",
    "\n",
    "**This is a unit test** - it tests one specific function (gradient_descent_step) in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f092d289",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": true,
     "grade_id": "test-gradient-descent",
     "locked": true,
     "points": 10,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_gradient_descent_step_comprehensive():\n",
    "    \"\"\"Test basic gradient descent parameter update\"\"\"\n",
    "    print(\"🔬 Unit Test: Gradient Descent Step...\")\n",
    "    \n",
    "    # Test basic parameter update\n",
    "    try:\n",
    "        w = Variable(2.0, requires_grad=True)\n",
    "        w.grad = Variable(0.5)  # Positive gradient\n",
    "        \n",
    "        original_value = w.data.data.item()\n",
    "        gradient_descent_step(w, learning_rate=0.1)\n",
    "        new_value = w.data.data.item()\n",
    "        \n",
    "        expected_value = original_value - 0.1 * 0.5  # 2.0 - 0.05 = 1.95\n",
    "        assert abs(new_value - expected_value) < 1e-6, f\"Expected {expected_value}, got {new_value}\"\n",
    "        print(\"✅ Basic parameter update works\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Basic parameter update failed: {e}\")\n",
    "        raise\n",
    "\n",
    "    # Test with negative gradient\n",
    "    try:\n",
    "        w2 = Variable(1.0, requires_grad=True)\n",
    "        w2.grad = Variable(-0.2)  # Negative gradient\n",
    "        \n",
    "        gradient_descent_step(w2, learning_rate=0.1)\n",
    "        expected_value2 = 1.0 - 0.1 * (-0.2)  # 1.0 + 0.02 = 1.02\n",
    "        assert abs(w2.data.data.item() - expected_value2) < 1e-6, \"Negative gradient test failed\"\n",
    "        print(\"✅ Negative gradient handling works\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Negative gradient handling failed: {e}\")\n",
    "        raise\n",
    "\n",
    "    # Test with no gradient (should not update)\n",
    "    try:\n",
    "        w3 = Variable(3.0, requires_grad=True)\n",
    "        w3.grad = None\n",
    "        original_value3 = w3.data.data.item()\n",
    "        \n",
    "        gradient_descent_step(w3, learning_rate=0.1)\n",
    "        assert w3.data.data.item() == original_value3, \"Parameter with no gradient should not update\"\n",
    "        print(\"✅ No gradient case works\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ No gradient case failed: {e}\")\n",
    "        raise\n",
    "\n",
    "    print(\"🎯 Gradient descent step behavior:\")\n",
    "    print(\"   Updates parameters in negative gradient direction\")\n",
    "    print(\"   Uses learning rate to control step size\")\n",
    "    print(\"   Skips updates when gradient is None\")\n",
    "    print(\"📈 Progress: Gradient Descent Step ✓\")\n",
    "\n",
    "# Test function is called by auto-discovery system"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc218834",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 2: SGD with Momentum\n",
    "\n",
    "### What is SGD?\n",
    "**SGD (Stochastic Gradient Descent)** is the fundamental optimization algorithm:\n",
    "\n",
    "```\n",
    "θ_{t+1} = θ_t - α ∇L(θ_t)\n",
    "```\n",
    "\n",
    "### The Problem with Vanilla SGD\n",
    "- **Slow convergence**: Especially in narrow valleys\n",
    "- **Oscillation**: Bounces around without making progress\n",
    "- **Poor conditioning**: Struggles with ill-conditioned problems\n",
    "\n",
    "### The Solution: Momentum\n",
    "**Momentum** accumulates velocity to accelerate convergence:\n",
    "\n",
    "```\n",
    "v_t = β v_{t-1} + ∇L(θ_t)\n",
    "θ_{t+1} = θ_t - α v_t\n",
    "```\n",
    "\n",
    "Where:\n",
    "- v_t: Velocity (exponential moving average of gradients)\n",
    "- β: Momentum coefficient (typically 0.9)\n",
    "- α: Learning rate\n",
    "\n",
    "### Why Momentum Works\n",
    "1. **Acceleration**: Builds up speed in consistent directions\n",
    "2. **Dampening**: Reduces oscillations in inconsistent directions\n",
    "3. **Memory**: Remembers previous gradient directions\n",
    "4. **Robustness**: Less sensitive to noisy gradients\n",
    "\n",
    "### Visual Understanding\n",
    "```\n",
    "Without momentum: ↗↙↗↙↗↙ (oscillating)\n",
    "With momentum:    ↗→→→→→ (smooth progress)\n",
    "```\n",
    "\n",
    "### Real-World Applications\n",
    "- **Image classification**: Training ResNet, VGG\n",
    "- **Natural language**: Training RNNs, early transformers\n",
    "- **Classic choice**: Still used when Adam fails\n",
    "- **Large batch training**: Often preferred over Adam\n",
    "\n",
    "Let's implement SGD with momentum!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f587b7f",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "sgd-class",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class SGD:\n",
    "    \"\"\"\n",
    "    SGD Optimizer with Momentum\n",
    "    \n",
    "    Implements stochastic gradient descent with momentum:\n",
    "    v_t = momentum * v_{t-1} + gradient\n",
    "    parameter = parameter - learning_rate * v_t\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, parameters: List[Variable], learning_rate: float = 0.01, \n",
    "                 momentum: float = 0.0, weight_decay: float = 0.0):\n",
    "        \"\"\"\n",
    "        Initialize SGD optimizer.\n",
    "        \n",
    "        Args:\n",
    "            parameters: List of Variables to optimize\n",
    "            learning_rate: Learning rate (default: 0.01)\n",
    "            momentum: Momentum coefficient (default: 0.0)\n",
    "            weight_decay: L2 regularization coefficient (default: 0.0)\n",
    "        \n",
    "        TODO: Implement SGD optimizer initialization.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Store parameters and hyperparameters\n",
    "        2. Initialize momentum buffers for each parameter\n",
    "        3. Set up state tracking for optimization\n",
    "        4. Prepare for step() and zero_grad() methods\n",
    "        \n",
    "        EXAMPLE:\n",
    "        ```python\n",
    "        # Create optimizer\n",
    "        optimizer = SGD([w1, w2, b1, b2], learning_rate=0.01, momentum=0.9)\n",
    "        \n",
    "        # In training loop:\n",
    "        optimizer.zero_grad()\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "        ```\n",
    "        \n",
    "        HINTS:\n",
    "        - Store parameters as a list\n",
    "        - Initialize momentum buffers as empty dict\n",
    "        - Use parameter id() as key for momentum tracking\n",
    "        - Momentum buffers will be created lazily in step()\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.parameters = parameters\n",
    "        self.learning_rate = learning_rate\n",
    "        self.momentum = momentum\n",
    "        self.weight_decay = weight_decay\n",
    "        \n",
    "        # Initialize momentum buffers (created lazily)\n",
    "        self.momentum_buffers = {}\n",
    "        \n",
    "        # Track optimization steps\n",
    "        self.step_count = 0\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def step(self) -> None:\n",
    "        \"\"\"\n",
    "        Perform one optimization step.\n",
    "        \n",
    "        TODO: Implement SGD parameter update with momentum.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Iterate through all parameters\n",
    "        2. For each parameter with gradient:\n",
    "           a. Get current gradient\n",
    "           b. Apply weight decay if specified\n",
    "           c. Update momentum buffer (or create if first time)\n",
    "           d. Update parameter using momentum\n",
    "        3. Increment step count\n",
    "        \n",
    "        MATHEMATICAL FORMULATION:\n",
    "        - If weight_decay > 0: gradient = gradient + weight_decay * parameter\n",
    "        - momentum_buffer = momentum * momentum_buffer + gradient\n",
    "        - parameter = parameter - learning_rate * momentum_buffer\n",
    "        \n",
    "        IMPLEMENTATION HINTS:\n",
    "        - Use id(param) as key for momentum buffers\n",
    "        - Initialize buffer with zeros if not exists\n",
    "        - Handle case where momentum = 0 (no momentum)\n",
    "        - Update parameter.data with new Tensor\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        for param in self.parameters:\n",
    "            if param.grad is not None:\n",
    "                # Get gradient\n",
    "                gradient = param.grad.data.data\n",
    "                \n",
    "                # Apply weight decay (L2 regularization)\n",
    "                if self.weight_decay > 0:\n",
    "                    gradient = gradient + self.weight_decay * param.data.data\n",
    "                \n",
    "                # Get or create momentum buffer\n",
    "                param_id = id(param)\n",
    "                if param_id not in self.momentum_buffers:\n",
    "                    self.momentum_buffers[param_id] = np.zeros_like(param.data.data)\n",
    "                \n",
    "                # Update momentum buffer\n",
    "                self.momentum_buffers[param_id] = (\n",
    "                    self.momentum * self.momentum_buffers[param_id] + gradient\n",
    "                )\n",
    "                \n",
    "                # Update parameter\n",
    "                param.data = Tensor(\n",
    "                    param.data.data - self.learning_rate * self.momentum_buffers[param_id]\n",
    "                )\n",
    "        \n",
    "        self.step_count += 1\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def zero_grad(self) -> None:\n",
    "        \"\"\"\n",
    "        Zero out gradients for all parameters.\n",
    "        \n",
    "        TODO: Implement gradient zeroing.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Iterate through all parameters\n",
    "        2. Set gradient to None for each parameter\n",
    "        3. This prepares for next backward pass\n",
    "        \n",
    "        IMPLEMENTATION HINTS:\n",
    "        - Simply set param.grad = None\n",
    "        - This is called before loss.backward()\n",
    "        - Essential for proper gradient accumulation\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        for param in self.parameters:\n",
    "            param.grad = None\n",
    "        ### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4adee99c",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: SGD Optimizer\n",
    "\n",
    "Let's test your SGD optimizer implementation! This optimizer adds momentum to gradient descent for better convergence.\n",
    "\n",
    "**This is a unit test** - it tests one specific class (SGD) in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa93aa53",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-sgd",
     "locked": true,
     "points": 15,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_sgd_optimizer_comprehensive():\n",
    "    \"\"\"Test SGD optimizer implementation\"\"\"\n",
    "    print(\"🔬 Unit Test: SGD Optimizer...\")\n",
    "    \n",
    "    # Create test parameters\n",
    "    w1 = Variable(1.0, requires_grad=True)\n",
    "    w2 = Variable(2.0, requires_grad=True)\n",
    "    b = Variable(0.5, requires_grad=True)\n",
    "    \n",
    "    # Create optimizer\n",
    "    optimizer = SGD([w1, w2, b], learning_rate=0.1, momentum=0.9)\n",
    "    \n",
    "    # Test zero_grad\n",
    "    try:\n",
    "        w1.grad = Variable(0.1)\n",
    "        w2.grad = Variable(0.2)\n",
    "        b.grad = Variable(0.05)\n",
    "        \n",
    "        optimizer.zero_grad()\n",
    "        \n",
    "        assert w1.grad is None, \"Gradient should be None after zero_grad\"\n",
    "        assert w2.grad is None, \"Gradient should be None after zero_grad\"\n",
    "        assert b.grad is None, \"Gradient should be None after zero_grad\"\n",
    "        print(\"✅ zero_grad() works correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ zero_grad() failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test step with gradients\n",
    "    try:\n",
    "        w1.grad = Variable(0.1)\n",
    "        w2.grad = Variable(0.2)\n",
    "        b.grad = Variable(0.05)\n",
    "        \n",
    "        # First step (no momentum yet)\n",
    "        original_w1 = w1.data.data.item()\n",
    "        original_w2 = w2.data.data.item()\n",
    "        original_b = b.data.data.item()\n",
    "        \n",
    "        optimizer.step()\n",
    "        \n",
    "        # Check parameter updates\n",
    "        expected_w1 = original_w1 - 0.1 * 0.1  # 1.0 - 0.01 = 0.99\n",
    "        expected_w2 = original_w2 - 0.1 * 0.2  # 2.0 - 0.02 = 1.98\n",
    "        expected_b = original_b - 0.1 * 0.05   # 0.5 - 0.005 = 0.495\n",
    "        \n",
    "        assert abs(w1.data.data.item() - expected_w1) < 1e-6, f\"w1 update failed: expected {expected_w1}, got {w1.data.data.item()}\"\n",
    "        assert abs(w2.data.data.item() - expected_w2) < 1e-6, f\"w2 update failed: expected {expected_w2}, got {w2.data.data.item()}\"\n",
    "        assert abs(b.data.data.item() - expected_b) < 1e-6, f\"b update failed: expected {expected_b}, got {b.data.data.item()}\"\n",
    "        print(\"✅ Parameter updates work correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Parameter updates failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test momentum buffers\n",
    "    try:\n",
    "        assert len(optimizer.momentum_buffers) == 3, f\"Should have 3 momentum buffers, got {len(optimizer.momentum_buffers)}\"\n",
    "        assert optimizer.step_count == 1, f\"Step count should be 1, got {optimizer.step_count}\"\n",
    "        print(\"✅ Momentum buffers created correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Momentum buffers failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test step counting\n",
    "    try:\n",
    "        w1.grad = Variable(0.1)\n",
    "        w2.grad = Variable(0.2)\n",
    "        b.grad = Variable(0.05)\n",
    "        \n",
    "        optimizer.step()\n",
    "        \n",
    "        assert optimizer.step_count == 2, f\"Step count should be 2, got {optimizer.step_count}\"\n",
    "        print(\"✅ Step counting works correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Step counting failed: {e}\")\n",
    "        raise\n",
    "\n",
    "    print(\"🎯 SGD optimizer behavior:\")\n",
    "    print(\"   Maintains momentum buffers for accelerated updates\")\n",
    "    print(\"   Tracks step count for learning rate scheduling\")\n",
    "    print(\"   Supports weight decay for regularization\")\n",
    "    print(\"📈 Progress: SGD Optimizer ✓\")\n",
    "\n",
    "# Run the test\n",
    "test_sgd_optimizer_comprehensive()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3730c6d6",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 3: Adam - Adaptive Learning Rates\n",
    "\n",
    "### What is Adam?\n",
    "**Adam (Adaptive Moment Estimation)** is the most popular optimizer in deep learning:\n",
    "\n",
    "```\n",
    "m_t = β₁ m_{t-1} + (1 - β₁) ∇L(θ_t)        # First moment (momentum)\n",
    "v_t = β₂ v_{t-1} + (1 - β₂) (∇L(θ_t))²     # Second moment (variance)\n",
    "m̂_t = m_t / (1 - β₁ᵗ)                      # Bias correction\n",
    "v̂_t = v_t / (1 - β₂ᵗ)                      # Bias correction\n",
    "θ_{t+1} = θ_t - α m̂_t / (√v̂_t + ε)        # Parameter update\n",
    "```\n",
    "\n",
    "### Why Adam is Revolutionary\n",
    "1. **Adaptive learning rates**: Different learning rate for each parameter\n",
    "2. **Momentum**: Accelerates convergence like SGD\n",
    "3. **Variance adaptation**: Scales updates based on gradient variance\n",
    "4. **Bias correction**: Handles initialization bias\n",
    "5. **Robust**: Works well with minimal hyperparameter tuning\n",
    "\n",
    "### The Three Key Ideas\n",
    "1. **First moment (m_t)**: Exponential moving average of gradients (momentum)\n",
    "2. **Second moment (v_t)**: Exponential moving average of squared gradients (variance)\n",
    "3. **Adaptive scaling**: Large gradients → small updates, small gradients → large updates\n",
    "\n",
    "### Visual Understanding\n",
    "```\n",
    "Parameter with large gradients: /\\/\\/\\/\\ → smooth updates\n",
    "Parameter with small gradients: ______ → amplified updates\n",
    "```\n",
    "\n",
    "### Real-World Applications\n",
    "- **Deep learning**: Default optimizer for most neural networks\n",
    "- **Computer vision**: Training CNNs, ResNets, Vision Transformers\n",
    "- **Natural language**: Training BERT, GPT, T5\n",
    "- **Transformers**: Essential for attention-based models\n",
    "\n",
    "Let's implement Adam optimizer!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be7d3f7a",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "adam-class",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Adam:\n",
    "    \"\"\"\n",
    "    Adam Optimizer\n",
    "    \n",
    "    Implements Adam algorithm with adaptive learning rates:\n",
    "    - First moment: exponential moving average of gradients\n",
    "    - Second moment: exponential moving average of squared gradients\n",
    "    - Bias correction: accounts for initialization bias\n",
    "    - Adaptive updates: different learning rate per parameter\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, parameters: List[Variable], learning_rate: float = 0.001,\n",
    "                 beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-8,\n",
    "                 weight_decay: float = 0.0):\n",
    "        \"\"\"\n",
    "        Initialize Adam optimizer.\n",
    "        \n",
    "        Args:\n",
    "            parameters: List of Variables to optimize\n",
    "            learning_rate: Learning rate (default: 0.001)\n",
    "            beta1: Exponential decay rate for first moment (default: 0.9)\n",
    "            beta2: Exponential decay rate for second moment (default: 0.999)\n",
    "            epsilon: Small constant for numerical stability (default: 1e-8)\n",
    "            weight_decay: L2 regularization coefficient (default: 0.0)\n",
    "        \n",
    "        TODO: Implement Adam optimizer initialization.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Store parameters and hyperparameters\n",
    "        2. Initialize first moment buffers (m_t)\n",
    "        3. Initialize second moment buffers (v_t)\n",
    "        4. Set up step counter for bias correction\n",
    "        \n",
    "        EXAMPLE:\n",
    "        ```python\n",
    "        # Create Adam optimizer\n",
    "        optimizer = Adam([w1, w2, b1, b2], learning_rate=0.001)\n",
    "        \n",
    "        # In training loop:\n",
    "        optimizer.zero_grad()\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "        ```\n",
    "        \n",
    "        HINTS:\n",
    "        - Store all hyperparameters\n",
    "        - Initialize moment buffers as empty dicts\n",
    "        - Use parameter id() as key for tracking\n",
    "        - Buffers will be created lazily in step()\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.parameters = parameters\n",
    "        self.learning_rate = learning_rate\n",
    "        self.beta1 = beta1\n",
    "        self.beta2 = beta2\n",
    "        self.epsilon = epsilon\n",
    "        self.weight_decay = weight_decay\n",
    "        \n",
    "        # Initialize moment buffers (created lazily)\n",
    "        self.first_moment = {}   # m_t\n",
    "        self.second_moment = {}  # v_t\n",
    "        \n",
    "        # Track optimization steps for bias correction\n",
    "        self.step_count = 0\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def step(self) -> None:\n",
    "        \"\"\"\n",
    "        Perform one optimization step using Adam algorithm.\n",
    "        \n",
    "        TODO: Implement Adam parameter update.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Increment step count\n",
    "        2. For each parameter with gradient:\n",
    "           a. Get current gradient\n",
    "           b. Apply weight decay if specified\n",
    "           c. Update first moment (momentum)\n",
    "           d. Update second moment (variance)\n",
    "           e. Apply bias correction\n",
    "           f. Update parameter with adaptive learning rate\n",
    "        \n",
    "        MATHEMATICAL FORMULATION:\n",
    "        - m_t = beta1 * m_{t-1} + (1 - beta1) * gradient\n",
    "        - v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2\n",
    "        - m_hat = m_t / (1 - beta1^t)\n",
    "        - v_hat = v_t / (1 - beta2^t)\n",
    "        - parameter = parameter - learning_rate * m_hat / (sqrt(v_hat) + epsilon)\n",
    "        \n",
    "        IMPLEMENTATION HINTS:\n",
    "        - Use id(param) as key for moment buffers\n",
    "        - Initialize buffers with zeros if not exists\n",
    "        - Use np.sqrt() for square root\n",
    "        - Handle numerical stability with epsilon\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.step_count += 1\n",
    "        \n",
    "        for param in self.parameters:\n",
    "            if param.grad is not None:\n",
    "                # Get gradient\n",
    "                gradient = param.grad.data.data\n",
    "                \n",
    "                # Apply weight decay (L2 regularization)\n",
    "                if self.weight_decay > 0:\n",
    "                    gradient = gradient + self.weight_decay * param.data.data\n",
    "                \n",
    "                # Get or create moment buffers\n",
    "                param_id = id(param)\n",
    "                if param_id not in self.first_moment:\n",
    "                    self.first_moment[param_id] = np.zeros_like(param.data.data)\n",
    "                    self.second_moment[param_id] = np.zeros_like(param.data.data)\n",
    "                \n",
    "                # Update first moment (momentum)\n",
    "                self.first_moment[param_id] = (\n",
    "                    self.beta1 * self.first_moment[param_id] + \n",
    "                    (1 - self.beta1) * gradient\n",
    "                )\n",
    "                \n",
    "                # Update second moment (variance)\n",
    "                self.second_moment[param_id] = (\n",
    "                    self.beta2 * self.second_moment[param_id] + \n",
    "                    (1 - self.beta2) * gradient * gradient\n",
    "                )\n",
    "                \n",
    "                # Bias correction\n",
    "                first_moment_corrected = (\n",
    "                    self.first_moment[param_id] / (1 - self.beta1 ** self.step_count)\n",
    "                )\n",
    "                second_moment_corrected = (\n",
    "                    self.second_moment[param_id] / (1 - self.beta2 ** self.step_count)\n",
    "                )\n",
    "                \n",
    "                # Update parameter with adaptive learning rate\n",
    "                param.data = Tensor(\n",
    "                    param.data.data - self.learning_rate * first_moment_corrected / \n",
    "                    (np.sqrt(second_moment_corrected) + self.epsilon)\n",
    "                )\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def zero_grad(self) -> None:\n",
    "        \"\"\"\n",
    "        Zero out gradients for all parameters.\n",
    "        \n",
    "        TODO: Implement gradient zeroing (same as SGD).\n",
    "        \n",
    "        IMPLEMENTATION HINTS:\n",
    "        - Set param.grad = None for all parameters\n",
    "        - This is identical to SGD implementation\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        for param in self.parameters:\n",
    "            param.grad = None\n",
    "        ### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41593be1",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### 🧪 Test Your Adam Implementation\n",
    "\n",
    "Let's test the Adam optimizer:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "461e74f8",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Adam Optimizer\n",
    "\n",
    "Let's test your Adam optimizer implementation! This is a state-of-the-art adaptive optimization algorithm.\n",
    "\n",
    "**This is a unit test** - it tests one specific class (Adam) in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afe99df3",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-adam",
     "locked": true,
     "points": 20,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_adam_optimizer_comprehensive():\n",
    "    \"\"\"Test Adam optimizer implementation\"\"\"\n",
    "    print(\"🔬 Unit Test: Adam Optimizer...\")\n",
    "    \n",
    "    # Create test parameters\n",
    "    w1 = Variable(1.0, requires_grad=True)\n",
    "    w2 = Variable(2.0, requires_grad=True)\n",
    "    b = Variable(0.5, requires_grad=True)\n",
    "    \n",
    "    # Create optimizer\n",
    "    optimizer = Adam([w1, w2, b], learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)\n",
    "    \n",
    "    # Test zero_grad\n",
    "    try:\n",
    "        w1.grad = Variable(0.1)\n",
    "        w2.grad = Variable(0.2)\n",
    "        b.grad = Variable(0.05)\n",
    "        \n",
    "        optimizer.zero_grad()\n",
    "        \n",
    "        assert w1.grad is None, \"Gradient should be None after zero_grad\"\n",
    "        assert w2.grad is None, \"Gradient should be None after zero_grad\"\n",
    "        assert b.grad is None, \"Gradient should be None after zero_grad\"\n",
    "        print(\"✅ zero_grad() works correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ zero_grad() failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test step with gradients\n",
    "    try:\n",
    "        w1.grad = Variable(0.1)\n",
    "        w2.grad = Variable(0.2)\n",
    "        b.grad = Variable(0.05)\n",
    "        \n",
    "        # First step\n",
    "        original_w1 = w1.data.data.item()\n",
    "        original_w2 = w2.data.data.item()\n",
    "        original_b = b.data.data.item()\n",
    "        \n",
    "        optimizer.step()\n",
    "        \n",
    "        # Check that parameters were updated (Adam uses adaptive learning rates)\n",
    "        assert w1.data.data.item() != original_w1, \"w1 should have been updated\"\n",
    "        assert w2.data.data.item() != original_w2, \"w2 should have been updated\"\n",
    "        assert b.data.data.item() != original_b, \"b should have been updated\"\n",
    "        print(\"✅ Parameter updates work correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Parameter updates failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test moment buffers\n",
    "    try:\n",
    "        assert len(optimizer.first_moment) == 3, f\"Should have 3 first moment buffers, got {len(optimizer.first_moment)}\"\n",
    "        assert len(optimizer.second_moment) == 3, f\"Should have 3 second moment buffers, got {len(optimizer.second_moment)}\"\n",
    "        print(\"✅ Moment buffers created correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Moment buffers failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test step counting and bias correction\n",
    "    try:\n",
    "        assert optimizer.step_count == 1, f\"Step count should be 1, got {optimizer.step_count}\"\n",
    "        \n",
    "        # Take another step\n",
    "        w1.grad = Variable(0.1)\n",
    "        w2.grad = Variable(0.2)\n",
    "        b.grad = Variable(0.05)\n",
    "        \n",
    "        optimizer.step()\n",
    "        \n",
    "        assert optimizer.step_count == 2, f\"Step count should be 2, got {optimizer.step_count}\"\n",
    "        print(\"✅ Step counting and bias correction work correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Step counting and bias correction failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test adaptive learning rates\n",
    "    try:\n",
    "        # Adam should have different effective learning rates for different parameters\n",
    "        # This is tested implicitly by the parameter updates above\n",
    "        print(\"✅ Adaptive learning rates work correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Adaptive learning rates failed: {e}\")\n",
    "        raise\n",
    "\n",
    "    print(\"🎯 Adam optimizer behavior:\")\n",
    "    print(\"   Maintains first and second moment estimates\")\n",
    "    print(\"   Applies bias correction for early training\")\n",
    "    print(\"   Uses adaptive learning rates per parameter\")\n",
    "    print(\"   Combines benefits of momentum and RMSprop\")\n",
    "    print(\"📈 Progress: Adam Optimizer ✓\")\n",
    "\n",
    "# Run the test\n",
    "test_adam_optimizer_comprehensive()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e198d030",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 4: Learning Rate Scheduling\n",
    "\n",
    "### What is Learning Rate Scheduling?\n",
    "**Learning rate scheduling** adjusts the learning rate during training:\n",
    "\n",
    "```\n",
    "Initial: learning_rate = 0.1\n",
    "After 10 epochs: learning_rate = 0.01\n",
    "After 20 epochs: learning_rate = 0.001\n",
    "```\n",
    "\n",
    "### Why Scheduling Matters\n",
    "1. **Fine-tuning**: Start with large steps, then refine with small steps\n",
    "2. **Convergence**: Prevents overshooting near optimum\n",
    "3. **Stability**: Reduces oscillations in later training\n",
    "4. **Performance**: Often improves final accuracy\n",
    "\n",
    "### Common Scheduling Strategies\n",
    "1. **Step decay**: Reduce by factor every N epochs\n",
    "2. **Exponential decay**: Gradual exponential reduction\n",
    "3. **Cosine annealing**: Smooth cosine curve reduction\n",
    "4. **Warm-up**: Start small, increase, then decrease\n",
    "\n",
    "### Visual Understanding\n",
    "```\n",
    "Step decay:     ----↓----↓----↓\n",
    "Exponential:    \\\\\\\\\\\\\\\\\\\\\\\\\\\\\n",
    "Cosine:         ∩∩∩∩∩∩∩∩∩∩∩∩∩\n",
    "```\n",
    "\n",
    "### Real-World Applications\n",
    "- **ImageNet training**: Essential for achieving state-of-the-art results\n",
    "- **Language models**: Critical for training large transformers\n",
    "- **Fine-tuning**: Prevents catastrophic forgetting\n",
    "- **Transfer learning**: Adapts pre-trained models\n",
    "\n",
    "Let's implement step learning rate scheduling!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7aba8fc9",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "steplr-class",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class StepLR:\n",
    "    \"\"\"\n",
    "    Step Learning Rate Scheduler\n",
    "    \n",
    "    Decays learning rate by gamma every step_size epochs:\n",
    "    learning_rate = initial_lr * (gamma ^ (epoch // step_size))\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, optimizer: Union[SGD, Adam], step_size: int, gamma: float = 0.1):\n",
    "        \"\"\"\n",
    "        Initialize step learning rate scheduler.\n",
    "        \n",
    "        Args:\n",
    "            optimizer: Optimizer to schedule\n",
    "            step_size: Number of epochs between decreases\n",
    "            gamma: Multiplicative factor for learning rate decay\n",
    "        \n",
    "        TODO: Implement learning rate scheduler initialization.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Store optimizer reference\n",
    "        2. Store scheduling parameters\n",
    "        3. Save initial learning rate\n",
    "        4. Initialize step counter\n",
    "        \n",
    "        EXAMPLE:\n",
    "        ```python\n",
    "        optimizer = SGD([w1, w2], learning_rate=0.1)\n",
    "        scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
    "        \n",
    "        # In training loop:\n",
    "        for epoch in range(100):\n",
    "            train_one_epoch()\n",
    "            scheduler.step()  # Update learning rate\n",
    "        ```\n",
    "        \n",
    "        HINTS:\n",
    "        - Store optimizer reference\n",
    "        - Save initial learning rate from optimizer\n",
    "        - Initialize step counter to 0\n",
    "        - gamma is the decay factor (0.1 = 10x reduction)\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.optimizer = optimizer\n",
    "        self.step_size = step_size\n",
    "        self.gamma = gamma\n",
    "        self.initial_lr = optimizer.learning_rate\n",
    "        self.step_count = 0\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def step(self) -> None:\n",
    "        \"\"\"\n",
    "        Update learning rate based on current step.\n",
    "        \n",
    "        TODO: Implement learning rate update.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Increment step counter\n",
    "        2. Calculate new learning rate using step decay formula\n",
    "        3. Update optimizer's learning rate\n",
    "        \n",
    "        MATHEMATICAL FORMULATION:\n",
    "        new_lr = initial_lr * (gamma ^ ((step_count - 1) // step_size))\n",
    "        \n",
    "        IMPLEMENTATION HINTS:\n",
    "        - Use // for integer division\n",
    "        - Use ** for exponentiation\n",
    "        - Update optimizer.learning_rate directly\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        self.step_count += 1\n",
    "        \n",
    "        # Calculate new learning rate\n",
    "        decay_factor = self.gamma ** ((self.step_count - 1) // self.step_size)\n",
    "        new_lr = self.initial_lr * decay_factor\n",
    "        \n",
    "        # Update optimizer's learning rate\n",
    "        self.optimizer.learning_rate = new_lr\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def get_lr(self) -> float:\n",
    "        \"\"\"\n",
    "        Get current learning rate.\n",
    "        \n",
    "        TODO: Return current learning rate.\n",
    "        \n",
    "        IMPLEMENTATION HINTS:\n",
    "        - Return optimizer.learning_rate\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        return self.optimizer.learning_rate\n",
    "        ### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51901e5b",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Step Learning Rate Scheduler\n",
    "\n",
    "Let's test your step learning rate scheduler implementation! This scheduler reduces learning rate at regular intervals.\n",
    "\n",
    "**This is a unit test** - it tests one specific class (StepLR) in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b83de77",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-step-scheduler",
     "locked": true,
     "points": 10,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_step_scheduler_comprehensive():\n",
    "    \"\"\"Test StepLR scheduler implementation\"\"\"\n",
    "    print(\"🔬 Unit Test: Step Learning Rate Scheduler...\")\n",
    "    \n",
    "    # Create test parameters and optimizer\n",
    "    w = Variable(1.0, requires_grad=True)\n",
    "    optimizer = SGD([w], learning_rate=0.1)\n",
    "    \n",
    "    # Test scheduler initialization\n",
    "    try:\n",
    "        scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
    "        \n",
    "        # Test initial learning rate\n",
    "        assert scheduler.get_lr() == 0.1, f\"Initial learning rate should be 0.1, got {scheduler.get_lr()}\"\n",
    "        print(\"✅ Initial learning rate is correct\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Initial learning rate failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test step-based decay\n",
    "    try:\n",
    "        # Steps 1-10: no decay (decay happens after step 10)\n",
    "        for i in range(10):\n",
    "            scheduler.step()\n",
    "        \n",
    "        assert scheduler.get_lr() == 0.1, f\"Learning rate should still be 0.1 after 10 steps, got {scheduler.get_lr()}\"\n",
    "        \n",
    "        # Step 11: decay should occur\n",
    "        scheduler.step()\n",
    "        expected_lr = 0.1 * 0.1  # 0.01\n",
    "        assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f\"Learning rate should be {expected_lr} after 11 steps, got {scheduler.get_lr()}\"\n",
    "        print(\"✅ Step-based decay works correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Step-based decay failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test multiple decay levels\n",
    "    try:\n",
    "        # Steps 12-20: should stay at 0.01\n",
    "        for i in range(9):\n",
    "            scheduler.step()\n",
    "        \n",
    "        assert abs(scheduler.get_lr() - 0.01) < 1e-6, f\"Learning rate should be 0.01 after 20 steps, got {scheduler.get_lr()}\"\n",
    "        \n",
    "        # Step 21: another decay\n",
    "        scheduler.step()\n",
    "        expected_lr = 0.01 * 0.1  # 0.001\n",
    "        assert abs(scheduler.get_lr() - expected_lr) < 1e-6, f\"Learning rate should be {expected_lr} after 21 steps, got {scheduler.get_lr()}\"\n",
    "        print(\"✅ Multiple decay levels work correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Multiple decay levels failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test with different optimizer\n",
    "    try:\n",
    "        w2 = Variable(2.0, requires_grad=True)\n",
    "        adam_optimizer = Adam([w2], learning_rate=0.001)\n",
    "        adam_scheduler = StepLR(adam_optimizer, step_size=5, gamma=0.5)\n",
    "        \n",
    "        # Test initial learning rate\n",
    "        assert adam_scheduler.get_lr() == 0.001, f\"Initial Adam learning rate should be 0.001, got {adam_scheduler.get_lr()}\"\n",
    "        \n",
    "        # Test decay after 5 steps\n",
    "        for i in range(5):\n",
    "            adam_scheduler.step()\n",
    "        \n",
    "        # Learning rate should still be 0.001 after 5 steps\n",
    "        assert adam_scheduler.get_lr() == 0.001, f\"Adam learning rate should still be 0.001 after 5 steps, got {adam_scheduler.get_lr()}\"\n",
    "        \n",
    "        # Step 6: decay should occur\n",
    "        adam_scheduler.step()\n",
    "        expected_lr = 0.001 * 0.5  # 0.0005\n",
    "        assert abs(adam_scheduler.get_lr() - expected_lr) < 1e-6, f\"Adam learning rate should be {expected_lr} after 6 steps, got {adam_scheduler.get_lr()}\"\n",
    "        print(\"✅ Works with different optimizers\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Different optimizers failed: {e}\")\n",
    "        raise\n",
    "\n",
    "    print(\"🎯 Step learning rate scheduler behavior:\")\n",
    "    print(\"   Reduces learning rate at regular intervals\")\n",
    "    print(\"   Multiplies current rate by gamma factor\")\n",
    "    print(\"   Works with any optimizer (SGD, Adam, etc.)\")\n",
    "    print(\"📈 Progress: Step Learning Rate Scheduler ✓\")\n",
    "\n",
    "# Run the test\n",
    "test_step_scheduler_comprehensive()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fc52bc2",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 5: Integration - Complete Training Example\n",
    "\n",
    "### Putting It All Together\n",
    "Let's see how optimizers enable complete neural network training:\n",
    "\n",
    "1. **Forward pass**: Compute predictions\n",
    "2. **Loss computation**: Compare with targets\n",
    "3. **Backward pass**: Compute gradients\n",
    "4. **Optimizer step**: Update parameters\n",
    "5. **Learning rate scheduling**: Adjust learning rate\n",
    "\n",
    "### The Modern Training Loop\n",
    "```python\n",
    "# Setup\n",
    "optimizer = Adam(model.parameters(), learning_rate=0.001)\n",
    "scheduler = StepLR(optimizer, step_size=10, gamma=0.1)\n",
    "\n",
    "# Training loop\n",
    "for epoch in range(num_epochs):\n",
    "    for batch in dataloader:\n",
    "        # Forward pass\n",
    "        predictions = model(batch.inputs)\n",
    "        loss = criterion(predictions, batch.targets)\n",
    "        \n",
    "        # Backward pass\n",
    "        optimizer.zero_grad()\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "    \n",
    "    # Update learning rate\n",
    "    scheduler.step()\n",
    "```\n",
    "\n",
    "Let's implement a complete training example!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3205aad",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "training-integration",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def train_simple_model():\n",
    "    \"\"\"\n",
    "    Complete training example using optimizers.\n",
    "    \n",
    "    TODO: Implement a complete training loop.\n",
    "    \n",
    "    APPROACH:\n",
    "    1. Create a simple model (linear regression)\n",
    "    2. Generate training data\n",
    "    3. Set up optimizer and scheduler\n",
    "    4. Train for several epochs\n",
    "    5. Show convergence\n",
    "    \n",
    "    LEARNING OBJECTIVE:\n",
    "    - See how optimizers enable real learning\n",
    "    - Compare SGD vs Adam performance\n",
    "    - Understand the complete training workflow\n",
    "    \"\"\"\n",
    "    ### BEGIN SOLUTION\n",
    "    print(\"Training simple linear regression model...\")\n",
    "    \n",
    "    # Create simple model: y = w*x + b\n",
    "    w = Variable(0.1, requires_grad=True)  # Initialize near zero\n",
    "    b = Variable(0.0, requires_grad=True)\n",
    "    \n",
    "    # Training data: y = 2*x + 1\n",
    "    x_data = [1.0, 2.0, 3.0, 4.0, 5.0]\n",
    "    y_data = [3.0, 5.0, 7.0, 9.0, 11.0]\n",
    "    \n",
    "    # Try SGD first\n",
    "    print(\"\\n🔍 Training with SGD...\")\n",
    "    optimizer_sgd = SGD([w, b], learning_rate=0.01, momentum=0.9)\n",
    "    \n",
    "    for epoch in range(60):\n",
    "        total_loss = 0\n",
    "        \n",
    "        for x_val, y_val in zip(x_data, y_data):\n",
    "            # Forward pass\n",
    "            x = Variable(x_val, requires_grad=False)\n",
    "            y_target = Variable(y_val, requires_grad=False)\n",
    "            \n",
    "            # Prediction: y = w*x + b\n",
    "            try:\n",
    "                from tinytorch.core.autograd import add, multiply, subtract\n",
    "            except ImportError:\n",
    "                setup_import_paths()\n",
    "                from autograd_dev import add, multiply, subtract\n",
    "            \n",
    "            prediction = add(multiply(w, x), b)\n",
    "            \n",
    "            # Loss: (prediction - target)^2\n",
    "            error = subtract(prediction, y_target)\n",
    "            loss = multiply(error, error)\n",
    "            \n",
    "            # Backward pass\n",
    "            optimizer_sgd.zero_grad()\n",
    "            loss.backward()\n",
    "            optimizer_sgd.step()\n",
    "            \n",
    "            total_loss += loss.data.data.item()\n",
    "        \n",
    "        if epoch % 10 == 0:\n",
    "            print(f\"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}\")\n",
    "    \n",
    "    sgd_final_w = w.data.data.item()\n",
    "    sgd_final_b = b.data.data.item()\n",
    "    \n",
    "    # Reset parameters and try Adam\n",
    "    print(\"\\n🔍 Training with Adam...\")\n",
    "    w.data = Tensor(0.1)\n",
    "    b.data = Tensor(0.0)\n",
    "    \n",
    "    optimizer_adam = Adam([w, b], learning_rate=0.01)\n",
    "    \n",
    "    for epoch in range(60):\n",
    "        total_loss = 0\n",
    "        \n",
    "        for x_val, y_val in zip(x_data, y_data):\n",
    "            # Forward pass\n",
    "            x = Variable(x_val, requires_grad=False)\n",
    "            y_target = Variable(y_val, requires_grad=False)\n",
    "            \n",
    "            # Prediction: y = w*x + b\n",
    "            prediction = add(multiply(w, x), b)\n",
    "            \n",
    "            # Loss: (prediction - target)^2\n",
    "            error = subtract(prediction, y_target)\n",
    "            loss = multiply(error, error)\n",
    "            \n",
    "            # Backward pass\n",
    "            optimizer_adam.zero_grad()\n",
    "            loss.backward()\n",
    "            optimizer_adam.step()\n",
    "            \n",
    "            total_loss += loss.data.data.item()\n",
    "        \n",
    "        if epoch % 10 == 0:\n",
    "            print(f\"Epoch {epoch}: Loss = {total_loss:.4f}, w = {w.data.data.item():.3f}, b = {b.data.data.item():.3f}\")\n",
    "    \n",
    "    adam_final_w = w.data.data.item()\n",
    "    adam_final_b = b.data.data.item()\n",
    "    \n",
    "    print(f\"\\n📊 Results:\")\n",
    "    print(f\"Target: w = 2.0, b = 1.0\")\n",
    "    print(f\"SGD:    w = {sgd_final_w:.3f}, b = {sgd_final_b:.3f}\")\n",
    "    print(f\"Adam:   w = {adam_final_w:.3f}, b = {adam_final_b:.3f}\")\n",
    "    \n",
    "    return sgd_final_w, sgd_final_b, adam_final_w, adam_final_b\n",
    "    ### END SOLUTION"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a5330c4",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🧪 Unit Test: Complete Training Integration\n",
    "\n",
    "Let's test your complete training integration! This demonstrates optimizers working together in a realistic training scenario.\n",
    "\n",
    "**This is a unit test** - it tests the complete training workflow with optimizers in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5aeda8ce",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-training-integration",
     "locked": true,
     "points": 25,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def test_training_integration_comprehensive():\n",
    "    \"\"\"Test complete training integration with optimizers\"\"\"\n",
    "    print(\"🔬 Unit Test: Complete Training Integration...\")\n",
    "    \n",
    "    # Test training with SGD and Adam\n",
    "    try:\n",
    "        sgd_w, sgd_b, adam_w, adam_b = train_simple_model()\n",
    "        \n",
    "        # Test SGD convergence\n",
    "        assert abs(sgd_w - 2.0) < 0.1, f\"SGD should converge close to w=2.0, got {sgd_w}\"\n",
    "        assert abs(sgd_b - 1.0) < 0.1, f\"SGD should converge close to b=1.0, got {sgd_b}\"\n",
    "        print(\"✅ SGD convergence works\")\n",
    "        \n",
    "        # Test Adam convergence (may be different due to adaptive learning rates)\n",
    "        assert abs(adam_w - 2.0) < 1.0, f\"Adam should converge reasonably close to w=2.0, got {adam_w}\"\n",
    "        assert abs(adam_b - 1.0) < 1.0, f\"Adam should converge reasonably close to b=1.0, got {adam_b}\"\n",
    "        print(\"✅ Adam convergence works\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Training integration failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test optimizer comparison\n",
    "    try:\n",
    "        # Both optimizers should achieve reasonable results\n",
    "        sgd_error = (sgd_w - 2.0)**2 + (sgd_b - 1.0)**2\n",
    "        adam_error = (adam_w - 2.0)**2 + (adam_b - 1.0)**2\n",
    "        \n",
    "        # Both should have low error (< 0.1)\n",
    "        assert sgd_error < 0.1, f\"SGD error should be < 0.1, got {sgd_error}\"\n",
    "        assert adam_error < 1.0, f\"Adam error should be < 1.0, got {adam_error}\"\n",
    "        print(\"✅ Optimizer comparison works\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Optimizer comparison failed: {e}\")\n",
    "        raise\n",
    "    \n",
    "    # Test gradient flow\n",
    "    try:\n",
    "        # Create a simple test to verify gradients flow correctly\n",
    "        w = Variable(1.0, requires_grad=True)\n",
    "        b = Variable(0.0, requires_grad=True)\n",
    "        \n",
    "        # Set up simple gradients\n",
    "        w.grad = Variable(0.1)\n",
    "        b.grad = Variable(0.05)\n",
    "        \n",
    "        # Test SGD step\n",
    "        sgd_optimizer = SGD([w, b], learning_rate=0.1)\n",
    "        original_w = w.data.data.item()\n",
    "        original_b = b.data.data.item()\n",
    "        \n",
    "        sgd_optimizer.step()\n",
    "        \n",
    "        # Check updates\n",
    "        assert w.data.data.item() != original_w, \"SGD should update w\"\n",
    "        assert b.data.data.item() != original_b, \"SGD should update b\"\n",
    "        print(\"✅ Gradient flow works correctly\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Gradient flow failed: {e}\")\n",
    "        raise\n",
    "\n",
    "    print(\"🎯 Training integration behavior:\")\n",
    "    print(\"   Optimizers successfully minimize loss functions\")\n",
    "    print(\"   SGD and Adam both converge to target values\")\n",
    "    print(\"   Gradient computation and updates work correctly\")\n",
    "    print(\"   Ready for real neural network training\")\n",
    "    print(\"📈 Progress: Complete Training Integration ✓\")\n",
    "\n",
    "# Run the test\n",
    "test_training_integration_comprehensive()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0464e8c",
   "metadata": {},
   "source": [
    "\"\"\"\n",
    "# 🎯 Module Summary: Optimization Mastery!\n",
    "\n",
    "Congratulations! You've successfully implemented the optimization algorithms that power all modern neural network training:\n",
    "\n",
    "## ✅ What You've Built\n",
    "- **Gradient Descent**: The fundamental parameter update mechanism\n",
    "- **SGD with Momentum**: Accelerated convergence with velocity accumulation\n",
    "- **Adam Optimizer**: Adaptive learning rates with first and second moments\n",
    "- **Learning Rate Scheduling**: Smart learning rate adjustment during training\n",
    "- **Complete Training Integration**: End-to-end training workflow\n",
    "\n",
    "## ✅ Key Learning Outcomes\n",
    "- **Understanding**: How optimizers use gradients to update parameters intelligently\n",
    "- **Implementation**: Built SGD and Adam optimizers from mathematical foundations\n",
    "- **Mathematical mastery**: Momentum, adaptive learning rates, bias correction\n",
    "- **Systems integration**: Complete training loops with scheduling\n",
    "- **Real-world application**: Modern deep learning training workflow\n",
    "\n",
    "## ✅ Mathematical Foundations Mastered\n",
    "- **Gradient Descent**: θ = θ - α∇L(θ) for parameter updates\n",
    "- **Momentum**: v_t = βv_{t-1} + ∇L(θ) for acceleration\n",
    "- **Adam**: Adaptive learning rates with exponential moving averages\n",
    "- **Learning Rate Scheduling**: Strategic learning rate adjustment\n",
    "\n",
    "## ✅ Professional Skills Developed\n",
    "- **Algorithm implementation**: Translating mathematical formulas into code\n",
    "- **State management**: Tracking optimizer buffers and statistics\n",
    "- **Hyperparameter design**: Understanding the impact of learning rate, momentum, etc.\n",
    "- **Training orchestration**: Complete training loop design\n",
    "\n",
    "## ✅ Ready for Advanced Applications\n",
    "Your optimizers now enable:\n",
    "- **Deep Neural Networks**: Effective training of complex architectures\n",
    "- **Computer Vision**: Training CNNs, ResNets, Vision Transformers\n",
    "- **Natural Language Processing**: Training transformers and language models\n",
    "- **Any ML Model**: Gradient-based optimization for any differentiable system\n",
    "\n",
    "## 🔗 Connection to Real ML Systems\n",
    "Your implementations mirror production systems:\n",
    "- **PyTorch**: `torch.optim.SGD()`, `torch.optim.Adam()`, `torch.optim.lr_scheduler.StepLR()`\n",
    "- **TensorFlow**: `tf.keras.optimizers.SGD()`, `tf.keras.optimizers.Adam()`\n",
    "- **Industry Standard**: Every major ML framework uses these exact algorithms\n",
    "\n",
    "## 🎯 The Power of Intelligent Optimization\n",
    "You've unlocked the algorithms that made modern AI possible:\n",
    "- **Scalability**: Efficiently optimize millions of parameters\n",
    "- **Adaptability**: Different learning rates for different parameters\n",
    "- **Robustness**: Handle noisy gradients and ill-conditioned problems\n",
    "- **Universality**: Work with any differentiable neural network\n",
    "\n",
    "## 🧠 Deep Learning Revolution\n",
    "You now understand the optimization technology that powers:\n",
    "- **ImageNet**: Training state-of-the-art computer vision models\n",
    "- **Language Models**: Training GPT, BERT, and other transformers\n",
    "- **Modern AI**: Every breakthrough relies on these optimization algorithms\n",
    "- **Future Research**: Your understanding enables you to develop new optimizers\n",
    "\n",
    "## 🚀 What's Next\n",
    "Your optimizers are the foundation for:\n",
    "- **Training Module**: Complete training loops with loss functions and metrics\n",
    "- **Advanced Optimizers**: RMSprop, AdaGrad, learning rate warm-up\n",
    "- **Distributed Training**: Multi-GPU optimization strategies\n",
    "- **Research**: Experimenting with novel optimization algorithms\n",
    "\n",
    "**Next Module**: Complete training systems that orchestrate your optimizers for real-world ML!\n",
    "\n",
    "You've built the intelligent algorithms that enable neural networks to learn. Now let's use them to train systems that can solve complex real-world problems!\n",
    "\"\"\"\n",
    "\n",
    "Run inline tests when module is executed directly\n",
    "if __name__ == \"__main__\":\n",
    "    from tito.tools.testing import run_module_tests_auto\n",
    "    \n",
    "    # Automatically discover and run all tests in this module\n",
    "    run_module_tests_auto(\"Optimizers\") "
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}