TinyTorch/modules/source/02_activations/activations_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a65f03ef",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# Activations - Intelligence Through Nonlinearity\n",
    "\n",
    "Welcome to Activations! Today you'll add the secret ingredient that makes neural networks intelligent: **nonlinearity**.\n",
    "\n",
    "## 🔗 Prerequisites & Progress\n",
    "**You've Built**: Tensor with data manipulation and basic operations\n",
    "**You'll Build**: Activation functions that add nonlinearity to transformations\n",
    "**You'll Enable**: Neural networks with the ability to learn complex patterns\n",
    "\n",
    "**Connection Map**:\n",
    "```\n",
    "Tensor → Activations → Layers\n",
    "(data)   (intelligence) (architecture)\n",
    "```\n",
    "\n",
    "## Learning Objectives\n",
    "By the end of this module, you will:\n",
    "1. Implement 5 core activation functions (Sigmoid, ReLU, Tanh, GELU, Softmax)\n",
    "2. Understand how nonlinearity enables neural network intelligence\n",
    "3. Test activation behaviors and output ranges\n",
    "4. Connect activations to real neural network components\n",
    "\n",
    "Let's add intelligence to your tensors!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d2bde70",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in modules/02_activations/activations_dev.py\n",
    "**Building Side:** Code exports to tinytorch.core.activations\n",
    "\n",
    "```python\n",
    "# Final package structure:\n",
    "from tinytorch.core.activations import Sigmoid, ReLU, Tanh, GELU, Softmax  # This module\n",
    "from tinytorch.core.tensor import Tensor  # Foundation (Module 01)\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Complete activation system in one focused module for deep understanding\n",
    "- **Production:** Proper organization like PyTorch's torch.nn.functional with all activation operations together\n",
    "- **Consistency:** All activation functions and behaviors in core.activations\n",
    "- **Integration:** Works seamlessly with Tensor for complete nonlinear transformations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc87ae92",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 📋 Module Prerequisites & Setup\n",
    "\n",
    "This module builds on previous TinyTorch components. Here's what we need and why:\n",
    "\n",
    "**Required Components:**\n",
    "- **Tensor** (Module 01): Foundation for all activation computations and data flow\n",
    "\n",
    "**Integration Helper:**\n",
    "The `import_previous_module()` function below helps us cleanly import components from previous modules during development and testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7797ec62",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "setup",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp core.activations\n",
    "#| export\n",
    "\n",
    "import numpy as np\n",
    "from typing import Optional\n",
    "import sys\n",
    "import os\n",
    "\n",
    "\n",
    "# Import will be in export cell"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cf71245",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 1. Introduction - What Makes Neural Networks Intelligent?\n",
    "\n",
    "Consider two scenarios:\n",
    "\n",
    "**Without Activations (Linear Only):**\n",
    "```\n",
    "Input → Linear Transform → Output\n",
    "[1, 2] → [3, 4] → [11]  # Just weighted sum\n",
    "```\n",
    "\n",
    "**With Activations (Nonlinear):**\n",
    "```\n",
    "Input → Linear → Activation → Linear → Activation → Output\n",
    "[1, 2] → [3, 4] → [3, 4] → [7] → [7] → Complex Pattern!\n",
    "```\n",
    "\n",
    "The magic happens in those activation functions. They introduce **nonlinearity** - the ability to curve, bend, and create complex decision boundaries instead of just straight lines.\n",
    "\n",
    "### Why Nonlinearity Matters\n",
    "\n",
    "Without activation functions, stacking multiple linear layers is pointless:\n",
    "```\n",
    "Linear(Linear(x)) = Linear(x)  # Same as single layer!\n",
    "```\n",
    "\n",
    "With activation functions, each layer can learn increasingly complex patterns:\n",
    "```\n",
    "Layer 1: Simple edges and lines\n",
    "Layer 2: Curves and shapes\n",
    "Layer 3: Complex objects and concepts\n",
    "```\n",
    "\n",
    "This is how deep networks build intelligence from simple mathematical operations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a42e702",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 2. Mathematical Foundations\n",
    "\n",
    "Each activation function serves a different purpose in neural networks:\n",
    "\n",
    "### The Five Essential Activations\n",
    "\n",
    "1. **Sigmoid**: Maps to (0, 1) - perfect for probabilities\n",
    "2. **ReLU**: Removes negatives - creates sparsity and efficiency\n",
    "3. **Tanh**: Maps to (-1, 1) - zero-centered for better training\n",
    "4. **GELU**: Smooth ReLU - modern choice for transformers\n",
    "5. **Softmax**: Creates probability distributions - essential for classification\n",
    "\n",
    "Let's implement each one with clear explanations and immediate testing!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a08f91f1",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 3. Implementation - Building Activation Functions\n",
    "\n",
    "### 🏗️ Implementation Pattern\n",
    "\n",
    "Each activation follows this structure:\n",
    "```python\n",
    "class ActivationName:\n",
    "    def forward(self, x: Tensor) -> Tensor:\n",
    "        # Apply mathematical transformation\n",
    "        # Return new Tensor with result\n",
    "\n",
    "    def backward(self, grad: Tensor) -> Tensor:\n",
    "        # Stub for Module 05 - gradient computation\n",
    "        pass\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb7e11b8",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Sigmoid - The Probability Gatekeeper\n",
    "\n",
    "Sigmoid maps any real number to the range (0, 1), making it perfect for probabilities and binary decisions.\n",
    "\n",
    "### Mathematical Definition\n",
    "```\n",
    "σ(x) = 1/(1 + e^(-x))\n",
    "```\n",
    "\n",
    "### Visual Behavior\n",
    "```\n",
    "Input:  [-3, -1,  0,  1,  3]\n",
    "         ↓   ↓   ↓   ↓   ↓  Sigmoid Function\n",
    "Output: [0.05, 0.27, 0.5, 0.73, 0.95]\n",
    "```\n",
    "\n",
    "### ASCII Visualization\n",
    "```\n",
    "Sigmoid Curve:\n",
    "    1.0 ┤     ╭─────\n",
    "        │    ╱\n",
    "    0.5 ┤   ╱\n",
    "        │  ╱\n",
    "    0.0 ┤─╱─────────\n",
    "       -3  0  3\n",
    "```\n",
    "\n",
    "**Why Sigmoid matters**: In binary classification, we need outputs between 0 and 1 to represent probabilities. Sigmoid gives us exactly that!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b90730ab",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "sigmoid-impl",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "from tinytorch.core.tensor import Tensor\n",
    "\n",
    "class Sigmoid:\n",
    "    \"\"\"\n",
    "    Sigmoid activation: σ(x) = 1/(1 + e^(-x))\n",
    "\n",
    "    Maps any real number to (0, 1) range.\n",
    "    Perfect for probabilities and binary classification.\n",
    "    \"\"\"\n",
    "\n",
    "    def forward(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"\n",
    "        Apply sigmoid activation element-wise.\n",
    "\n",
    "        TODO: Implement sigmoid function\n",
    "\n",
    "        APPROACH:\n",
    "        1. Apply sigmoid formula: 1 / (1 + exp(-x))\n",
    "        2. Use np.exp for exponential\n",
    "        3. Return result wrapped in new Tensor\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> sigmoid = Sigmoid()\n",
    "        >>> x = Tensor([-2, 0, 2])\n",
    "        >>> result = sigmoid(x)\n",
    "        >>> print(result.data)\n",
    "        [0.119, 0.5, 0.881]  # All values between 0 and 1\n",
    "\n",
    "        HINT: Use np.exp(-x.data) for numerical stability\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Apply sigmoid: 1 / (1 + exp(-x))\n",
    "        result_data = 1.0 / (1.0 + np.exp(-x.data))\n",
    "        result = Tensor(result_data)\n",
    "        \n",
    "        # Track gradients if autograd is enabled and input requires_grad\n",
    "        if SigmoidBackward is not None and x.requires_grad:\n",
    "            result.requires_grad = True\n",
    "            result._grad_fn = SigmoidBackward(x, result)\n",
    "        \n",
    "        return result\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def __call__(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
    "        return self.forward(x)\n",
    "\n",
    "    def backward(self, grad: Tensor) -> Tensor:\n",
    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
    "        pass  # Will implement backward pass in Module 05"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27a57cf3",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🔬 Unit Test: Sigmoid\n",
    "This test validates sigmoid activation behavior.\n",
    "**What we're testing**: Sigmoid maps inputs to (0, 1) range\n",
    "**Why it matters**: Ensures proper probability-like outputs\n",
    "**Expected**: All outputs between 0 and 1, sigmoid(0) = 0.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91296689",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-sigmoid",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_sigmoid():\n",
    "    \"\"\"🔬 Test Sigmoid implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Sigmoid...\")\n",
    "\n",
    "    sigmoid = Sigmoid()\n",
    "\n",
    "    # Test basic cases\n",
    "    x = Tensor([0.0])\n",
    "    result = sigmoid.forward(x)\n",
    "    assert np.allclose(result.data, [0.5]), f\"sigmoid(0) should be 0.5, got {result.data}\"\n",
    "\n",
    "    # Test range property - all outputs should be in (0, 1)\n",
    "    x = Tensor([-10, -1, 0, 1, 10])\n",
    "    result = sigmoid.forward(x)\n",
    "    assert np.all(result.data > 0) and np.all(result.data < 1), \"All sigmoid outputs should be in (0, 1)\"\n",
    "\n",
    "    # Test specific values\n",
    "    x = Tensor([-1000, 1000])  # Extreme values\n",
    "    result = sigmoid.forward(x)\n",
    "    assert np.allclose(result.data[0], 0, atol=1e-10), \"sigmoid(-∞) should approach 0\"\n",
    "    assert np.allclose(result.data[1], 1, atol=1e-10), \"sigmoid(+∞) should approach 1\"\n",
    "\n",
    "    print(\"✅ Sigmoid works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_sigmoid()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41ae8ed4",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## ReLU - The Sparsity Creator\n",
    "\n",
    "ReLU (Rectified Linear Unit) is the most popular activation function. It simply removes negative values, creating sparsity that makes neural networks more efficient.\n",
    "\n",
    "### Mathematical Definition\n",
    "```\n",
    "f(x) = max(0, x)\n",
    "```\n",
    "\n",
    "### Visual Behavior\n",
    "```\n",
    "Input:  [-2, -1,  0,  1,  2]\n",
    "         ↓   ↓   ↓   ↓   ↓  ReLU Function\n",
    "Output: [ 0,  0,  0,  1,  2]\n",
    "```\n",
    "\n",
    "### ASCII Visualization\n",
    "```\n",
    "ReLU Function:\n",
    "        ╱\n",
    "    2  ╱\n",
    "      ╱\n",
    "    1╱\n",
    "    ╱\n",
    "   ╱\n",
    "  ╱\n",
    "─┴─────\n",
    "-2  0  2\n",
    "```\n",
    "\n",
    "**Why ReLU matters**: By zeroing negative values, ReLU creates sparsity (many zeros) which makes computation faster and helps prevent overfitting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3438519",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "relu-impl",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class ReLU:\n",
    "    \"\"\"\n",
    "    ReLU activation: f(x) = max(0, x)\n",
    "\n",
    "    Sets negative values to zero, keeps positive values unchanged.\n",
    "    Most popular activation for hidden layers.\n",
    "    \"\"\"\n",
    "\n",
    "    def forward(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"\n",
    "        Apply ReLU activation element-wise.\n",
    "\n",
    "        TODO: Implement ReLU function\n",
    "\n",
    "        APPROACH:\n",
    "        1. Use np.maximum(0, x.data) for element-wise max with zero\n",
    "        2. Return result wrapped in new Tensor\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> relu = ReLU()\n",
    "        >>> x = Tensor([-2, -1, 0, 1, 2])\n",
    "        >>> result = relu(x)\n",
    "        >>> print(result.data)\n",
    "        [0, 0, 0, 1, 2]  # Negative values become 0, positive unchanged\n",
    "\n",
    "        HINT: np.maximum handles element-wise maximum automatically\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Apply ReLU: max(0, x)\n",
    "        result = np.maximum(0, x.data)\n",
    "        return Tensor(result)\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def __call__(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
    "        return self.forward(x)\n",
    "\n",
    "    def backward(self, grad: Tensor) -> Tensor:\n",
    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
    "        pass  # Will implement backward pass in Module 05"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b038349a",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🔬 Unit Test: ReLU\n",
    "This test validates ReLU activation behavior.\n",
    "**What we're testing**: ReLU zeros negative values, preserves positive\n",
    "**Why it matters**: ReLU's sparsity helps neural networks train efficiently\n",
    "**Expected**: Negative → 0, positive unchanged, zero → 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "710535c5",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-relu",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_relu():\n",
    "    \"\"\"🔬 Test ReLU implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: ReLU...\")\n",
    "\n",
    "    relu = ReLU()\n",
    "\n",
    "    # Test mixed positive/negative values\n",
    "    x = Tensor([-2, -1, 0, 1, 2])\n",
    "    result = relu.forward(x)\n",
    "    expected = [0, 0, 0, 1, 2]\n",
    "    assert np.allclose(result.data, expected), f\"ReLU failed, expected {expected}, got {result.data}\"\n",
    "\n",
    "    # Test all negative\n",
    "    x = Tensor([-5, -3, -1])\n",
    "    result = relu.forward(x)\n",
    "    assert np.allclose(result.data, [0, 0, 0]), \"ReLU should zero all negative values\"\n",
    "\n",
    "    # Test all positive\n",
    "    x = Tensor([1, 3, 5])\n",
    "    result = relu.forward(x)\n",
    "    assert np.allclose(result.data, [1, 3, 5]), \"ReLU should preserve all positive values\"\n",
    "\n",
    "    # Test sparsity property\n",
    "    x = Tensor([-1, -2, -3, 1])\n",
    "    result = relu.forward(x)\n",
    "    zeros = np.sum(result.data == 0)\n",
    "    assert zeros == 3, f\"ReLU should create sparsity, got {zeros} zeros out of 4\"\n",
    "\n",
    "    print(\"✅ ReLU works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_relu()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25c9a414",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Tanh - The Zero-Centered Alternative\n",
    "\n",
    "Tanh (hyperbolic tangent) is like sigmoid but centered around zero, mapping inputs to (-1, 1). This zero-centering helps with gradient flow during training.\n",
    "\n",
    "### Mathematical Definition\n",
    "```\n",
    "f(x) = (e^x - e^(-x))/(e^x + e^(-x))\n",
    "```\n",
    "\n",
    "### Visual Behavior\n",
    "```\n",
    "Input:  [-2,  0,  2]\n",
    "         ↓   ↓   ↓  Tanh Function\n",
    "Output: [-0.96, 0, 0.96]\n",
    "```\n",
    "\n",
    "### ASCII Visualization\n",
    "```\n",
    "Tanh Curve:\n",
    "    1 ┤     ╭─────\n",
    "      │    ╱\n",
    "    0 ┤───╱─────\n",
    "      │  ╱\n",
    "   -1 ┤─╱───────\n",
    "     -3  0  3\n",
    "```\n",
    "\n",
    "**Why Tanh matters**: Unlike sigmoid, tanh outputs are centered around zero, which can help gradients flow better through deep networks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e428827",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "tanh-impl",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Tanh:\n",
    "    \"\"\"\n",
    "    Tanh activation: f(x) = (e^x - e^(-x))/(e^x + e^(-x))\n",
    "\n",
    "    Maps any real number to (-1, 1) range.\n",
    "    Zero-centered alternative to sigmoid.\n",
    "    \"\"\"\n",
    "\n",
    "    def forward(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"\n",
    "        Apply tanh activation element-wise.\n",
    "\n",
    "        TODO: Implement tanh function\n",
    "\n",
    "        APPROACH:\n",
    "        1. Use np.tanh(x.data) for hyperbolic tangent\n",
    "        2. Return result wrapped in new Tensor\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> tanh = Tanh()\n",
    "        >>> x = Tensor([-2, 0, 2])\n",
    "        >>> result = tanh(x)\n",
    "        >>> print(result.data)\n",
    "        [-0.964, 0.0, 0.964]  # Range (-1, 1), symmetric around 0\n",
    "\n",
    "        HINT: NumPy provides np.tanh function\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Apply tanh using NumPy\n",
    "        result = np.tanh(x.data)\n",
    "        return Tensor(result)\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def __call__(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
    "        return self.forward(x)\n",
    "\n",
    "    def backward(self, grad: Tensor) -> Tensor:\n",
    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
    "        pass  # Will implement backward pass in Module 05"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "045af2f1",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🔬 Unit Test: Tanh\n",
    "This test validates tanh activation behavior.\n",
    "**What we're testing**: Tanh maps inputs to (-1, 1) range, zero-centered\n",
    "**Why it matters**: Zero-centered activations can help with gradient flow\n",
    "**Expected**: All outputs in (-1, 1), tanh(0) = 0, symmetric behavior"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "287a3c73",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-tanh",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_tanh():\n",
    "    \"\"\"🔬 Test Tanh implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Tanh...\")\n",
    "\n",
    "    tanh = Tanh()\n",
    "\n",
    "    # Test zero\n",
    "    x = Tensor([0.0])\n",
    "    result = tanh.forward(x)\n",
    "    assert np.allclose(result.data, [0.0]), f\"tanh(0) should be 0, got {result.data}\"\n",
    "\n",
    "    # Test range property - all outputs should be in (-1, 1)\n",
    "    x = Tensor([-10, -1, 0, 1, 10])\n",
    "    result = tanh.forward(x)\n",
    "    assert np.all(result.data >= -1) and np.all(result.data <= 1), \"All tanh outputs should be in [-1, 1]\"\n",
    "\n",
    "    # Test symmetry: tanh(-x) = -tanh(x)\n",
    "    x = Tensor([2.0])\n",
    "    pos_result = tanh.forward(x)\n",
    "    x_neg = Tensor([-2.0])\n",
    "    neg_result = tanh.forward(x_neg)\n",
    "    assert np.allclose(pos_result.data, -neg_result.data), \"tanh should be symmetric: tanh(-x) = -tanh(x)\"\n",
    "\n",
    "    # Test extreme values\n",
    "    x = Tensor([-1000, 1000])\n",
    "    result = tanh.forward(x)\n",
    "    assert np.allclose(result.data[0], -1, atol=1e-10), \"tanh(-∞) should approach -1\"\n",
    "    assert np.allclose(result.data[1], 1, atol=1e-10), \"tanh(+∞) should approach 1\"\n",
    "\n",
    "    print(\"✅ Tanh works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_tanh()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7be7b936",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## GELU - The Smooth Modern Choice\n",
    "\n",
    "GELU (Gaussian Error Linear Unit) is a smooth approximation to ReLU that's become popular in modern architectures like transformers. Unlike ReLU's sharp corner, GELU is smooth everywhere.\n",
    "\n",
    "### Mathematical Definition\n",
    "```\n",
    "f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)\n",
    "```\n",
    "Where Φ(x) is the cumulative distribution function of standard normal distribution.\n",
    "\n",
    "### Visual Behavior\n",
    "```\n",
    "Input:  [-1,  0,  1]\n",
    "         ↓   ↓   ↓  GELU Function\n",
    "Output: [-0.16, 0, 0.84]\n",
    "```\n",
    "\n",
    "### ASCII Visualization\n",
    "```\n",
    "GELU Function:\n",
    "        ╱\n",
    "    1  ╱\n",
    "      ╱\n",
    "     ╱\n",
    "    ╱\n",
    "   ╱ ↙ (smooth curve, no sharp corner)\n",
    "  ╱\n",
    "─┴─────\n",
    "-2  0  2\n",
    "```\n",
    "\n",
    "**Why GELU matters**: Used in GPT, BERT, and other transformers. The smoothness helps with optimization compared to ReLU's sharp corner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "faa72fc8",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "gelu-impl",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class GELU:\n",
    "    \"\"\"\n",
    "    GELU activation: f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)\n",
    "\n",
    "    Smooth approximation to ReLU, used in modern transformers.\n",
    "    Where Φ(x) is the cumulative distribution function of standard normal.\n",
    "    \"\"\"\n",
    "\n",
    "    def forward(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"\n",
    "        Apply GELU activation element-wise.\n",
    "\n",
    "        TODO: Implement GELU approximation\n",
    "\n",
    "        APPROACH:\n",
    "        1. Use approximation: x * sigmoid(1.702 * x)\n",
    "        2. Compute sigmoid part: 1 / (1 + exp(-1.702 * x))\n",
    "        3. Multiply by x element-wise\n",
    "        4. Return result wrapped in new Tensor\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> gelu = GELU()\n",
    "        >>> x = Tensor([-1, 0, 1])\n",
    "        >>> result = gelu(x)\n",
    "        >>> print(result.data)\n",
    "        [-0.159, 0.0, 0.841]  # Smooth, like ReLU but differentiable everywhere\n",
    "\n",
    "        HINT: The 1.702 constant comes from √(2/π) approximation\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # GELU approximation: x * sigmoid(1.702 * x)\n",
    "        # First compute sigmoid part\n",
    "        sigmoid_part = 1.0 / (1.0 + np.exp(-1.702 * x.data))\n",
    "        # Then multiply by x\n",
    "        result = x.data * sigmoid_part\n",
    "        return Tensor(result)\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def __call__(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
    "        return self.forward(x)\n",
    "\n",
    "    def backward(self, grad: Tensor) -> Tensor:\n",
    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
    "        pass  # Will implement backward pass in Module 05"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aca7e16d",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🔬 Unit Test: GELU\n",
    "This test validates GELU activation behavior.\n",
    "**What we're testing**: GELU provides smooth ReLU-like behavior\n",
    "**Why it matters**: GELU is used in modern transformers like GPT and BERT\n",
    "**Expected**: Smooth curve, GELU(0) ≈ 0, positive values preserved roughly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d66fad33",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-gelu",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_gelu():\n",
    "    \"\"\"🔬 Test GELU implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: GELU...\")\n",
    "\n",
    "    gelu = GELU()\n",
    "\n",
    "    # Test zero (should be approximately 0)\n",
    "    x = Tensor([0.0])\n",
    "    result = gelu.forward(x)\n",
    "    assert np.allclose(result.data, [0.0], atol=1e-10), f\"GELU(0) should be ≈0, got {result.data}\"\n",
    "\n",
    "    # Test positive values (should be roughly preserved)\n",
    "    x = Tensor([1.0])\n",
    "    result = gelu.forward(x)\n",
    "    assert result.data[0] > 0.8, f\"GELU(1) should be ≈0.84, got {result.data[0]}\"\n",
    "\n",
    "    # Test negative values (should be small but not zero)\n",
    "    x = Tensor([-1.0])\n",
    "    result = gelu.forward(x)\n",
    "    assert result.data[0] < 0 and result.data[0] > -0.2, f\"GELU(-1) should be ≈-0.16, got {result.data[0]}\"\n",
    "\n",
    "    # Test smoothness property (no sharp corners like ReLU)\n",
    "    x = Tensor([-0.001, 0.0, 0.001])\n",
    "    result = gelu.forward(x)\n",
    "    # Values should be close to each other (smooth)\n",
    "    diff1 = abs(result.data[1] - result.data[0])\n",
    "    diff2 = abs(result.data[2] - result.data[1])\n",
    "    assert diff1 < 0.01 and diff2 < 0.01, \"GELU should be smooth around zero\"\n",
    "\n",
    "    print(\"✅ GELU works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_gelu()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13a2312e",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Softmax - The Probability Distributor\n",
    "\n",
    "Softmax converts any vector into a valid probability distribution. All outputs are positive and sum to exactly 1.0, making it essential for multi-class classification.\n",
    "\n",
    "### Mathematical Definition\n",
    "```\n",
    "f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
    "```\n",
    "\n",
    "### Visual Behavior\n",
    "```\n",
    "Input:  [1, 2, 3]\n",
    "         ↓  ↓  ↓  Softmax Function\n",
    "Output: [0.09, 0.24, 0.67]  # Sum = 1.0\n",
    "```\n",
    "\n",
    "### ASCII Visualization\n",
    "```\n",
    "Softmax Transform:\n",
    "Raw scores: [1, 2, 3, 4]\n",
    "           ↓ Exponential ↓\n",
    "          [2.7, 7.4, 20.1, 54.6]\n",
    "           ↓ Normalize ↓\n",
    "          [0.03, 0.09, 0.24, 0.64]  ← Sum = 1.0\n",
    "```\n",
    "\n",
    "**Why Softmax matters**: In multi-class classification, we need outputs that represent probabilities for each class. Softmax guarantees valid probabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5fbaab2",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "softmax-impl",
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Softmax:\n",
    "    \"\"\"\n",
    "    Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
    "\n",
    "    Converts any vector to a probability distribution.\n",
    "    Sum of all outputs equals 1.0.\n",
    "    \"\"\"\n",
    "\n",
    "    def forward(self, x: Tensor, dim: int = -1) -> Tensor:\n",
    "        \"\"\"\n",
    "        Apply softmax activation along specified dimension.\n",
    "\n",
    "        TODO: Implement numerically stable softmax\n",
    "\n",
    "        APPROACH:\n",
    "        1. Subtract max for numerical stability: x - max(x)\n",
    "        2. Compute exponentials: exp(x - max(x))\n",
    "        3. Sum along dimension: sum(exp_values)\n",
    "        4. Divide: exp_values / sum\n",
    "        5. Return result wrapped in new Tensor\n",
    "\n",
    "        EXAMPLE:\n",
    "        >>> softmax = Softmax()\n",
    "        >>> x = Tensor([1, 2, 3])\n",
    "        >>> result = softmax(x)\n",
    "        >>> print(result.data)\n",
    "        [0.090, 0.245, 0.665]  # Sums to 1.0, larger inputs get higher probability\n",
    "\n",
    "        HINTS:\n",
    "        - Use np.max(x.data, axis=dim, keepdims=True) for max\n",
    "        - Use np.sum(exp_values, axis=dim, keepdims=True) for sum\n",
    "        - The max subtraction prevents overflow in exponentials\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Numerical stability: subtract max to prevent overflow\n",
    "        # Use Tensor operations to preserve gradient flow!\n",
    "        x_max_data = np.max(x.data, axis=dim, keepdims=True)\n",
    "        x_max = Tensor(x_max_data, requires_grad=False)  # max is not differentiable in this context\n",
    "        x_shifted = x - x_max  # Tensor subtraction!\n",
    "\n",
    "        # Compute exponentials (NumPy operation, but wrapped in Tensor)\n",
    "        exp_values = Tensor(np.exp(x_shifted.data), requires_grad=x_shifted.requires_grad)\n",
    "\n",
    "        # Sum along dimension (Tensor operation)\n",
    "        exp_sum_data = np.sum(exp_values.data, axis=dim, keepdims=True)\n",
    "        exp_sum = Tensor(exp_sum_data, requires_grad=exp_values.requires_grad)\n",
    "\n",
    "        # Normalize to get probabilities (Tensor division!)\n",
    "        result = exp_values / exp_sum\n",
    "        return result\n",
    "        ### END SOLUTION\n",
    "\n",
    "    def __call__(self, x: Tensor, dim: int = -1) -> Tensor:\n",
    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
    "        return self.forward(x, dim)\n",
    "\n",
    "    def backward(self, grad: Tensor) -> Tensor:\n",
    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
    "        pass  # Will implement backward pass in Module 05"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7f6d4a6",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "### 🔬 Unit Test: Softmax\n",
    "This test validates softmax activation behavior.\n",
    "**What we're testing**: Softmax creates valid probability distributions\n",
    "**Why it matters**: Essential for multi-class classification outputs\n",
    "**Expected**: Outputs sum to 1.0, all values in (0, 1), largest input gets highest probability"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a68dea4a",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-softmax",
     "locked": true,
     "points": 10
    }
   },
   "outputs": [],
   "source": [
    "def test_unit_softmax():\n",
    "    \"\"\"🔬 Test Softmax implementation.\"\"\"\n",
    "    print(\"🔬 Unit Test: Softmax...\")\n",
    "\n",
    "    softmax = Softmax()\n",
    "\n",
    "    # Test basic probability properties\n",
    "    x = Tensor([1, 2, 3])\n",
    "    result = softmax.forward(x)\n",
    "\n",
    "    # Should sum to 1\n",
    "    assert np.allclose(np.sum(result.data), 1.0), f\"Softmax should sum to 1, got {np.sum(result.data)}\"\n",
    "\n",
    "    # All values should be positive\n",
    "    assert np.all(result.data > 0), \"All softmax values should be positive\"\n",
    "\n",
    "    # All values should be less than 1\n",
    "    assert np.all(result.data < 1), \"All softmax values should be less than 1\"\n",
    "\n",
    "    # Largest input should get largest output\n",
    "    max_input_idx = np.argmax(x.data)\n",
    "    max_output_idx = np.argmax(result.data)\n",
    "    assert max_input_idx == max_output_idx, \"Largest input should get largest softmax output\"\n",
    "\n",
    "    # Test numerical stability with large numbers\n",
    "    x = Tensor([1000, 1001, 1002])  # Would overflow without max subtraction\n",
    "    result = softmax.forward(x)\n",
    "    assert np.allclose(np.sum(result.data), 1.0), \"Softmax should handle large numbers\"\n",
    "    assert not np.any(np.isnan(result.data)), \"Softmax should not produce NaN\"\n",
    "    assert not np.any(np.isinf(result.data)), \"Softmax should not produce infinity\"\n",
    "\n",
    "    # Test with 2D tensor (batch dimension)\n",
    "    x = Tensor([[1, 2], [3, 4]])\n",
    "    result = softmax.forward(x, dim=-1)  # Softmax along last dimension\n",
    "    assert result.shape == (2, 2), \"Softmax should preserve input shape\"\n",
    "    # Each row should sum to 1\n",
    "    row_sums = np.sum(result.data, axis=-1)\n",
    "    assert np.allclose(row_sums, [1.0, 1.0]), \"Each row should sum to 1\"\n",
    "\n",
    "    print(\"✅ Softmax works correctly!\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    test_unit_softmax()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "936779e1",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 2
   },
   "source": [
    "## 4. Integration - Bringing It Together\n",
    "\n",
    "Now let's test how all our activation functions work together and understand their different behaviors."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ecfa064",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### Understanding the Output Patterns\n",
    "\n",
    "From the demonstration above, notice how each activation serves a different purpose:\n",
    "\n",
    "**Sigmoid**: Squashes everything to (0, 1) - good for probabilities\n",
    "**ReLU**: Zeros negatives, keeps positives - creates sparsity\n",
    "**Tanh**: Like sigmoid but centered at zero (-1, 1) - better gradient flow\n",
    "**GELU**: Smooth ReLU-like behavior - modern choice for transformers\n",
    "**Softmax**: Converts to probability distribution - sum equals 1\n",
    "\n",
    "These different behaviors make each activation suitable for different parts of neural networks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6d4f14d",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## 🧪 Module Integration Test\n",
    "\n",
    "Final validation that everything works together correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d3e00f4",
   "metadata": {
    "lines_to_next_cell": 2,
    "nbgrader": {
     "grade": true,
     "grade_id": "module-test",
     "locked": true,
     "points": 20
    }
   },
   "outputs": [],
   "source": [
    "def import_previous_module(module_name: str, component_name: str):\n",
    "    import sys\n",
    "    import os\n",
    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', module_name))\n",
    "    module = __import__(f\"{module_name.split('_')[1]}_dev\")\n",
    "    return getattr(module, component_name)\n",
    "\n",
    "def test_module():\n",
    "    \"\"\"\n",
    "    Comprehensive test of entire module functionality.\n",
    "\n",
    "    This final test runs before module summary to ensure:\n",
    "    - All unit tests pass\n",
    "    - Functions work together correctly\n",
    "    - Module is ready for integration with TinyTorch\n",
    "    \"\"\"\n",
    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
    "    print(\"=\" * 50)\n",
    "\n",
    "    # Run all unit tests\n",
    "    print(\"Running unit tests...\")\n",
    "    test_unit_sigmoid()\n",
    "    test_unit_relu()\n",
    "    test_unit_tanh()\n",
    "    test_unit_gelu()\n",
    "    test_unit_softmax()\n",
    "\n",
    "    print(\"\\nRunning integration scenarios...\")\n",
    "\n",
    "    # Test 1: All activations preserve tensor properties\n",
    "    print(\"🔬 Integration Test: Tensor property preservation...\")\n",
    "    test_data = Tensor([[1, -1], [2, -2]])  # 2D tensor\n",
    "\n",
    "    activations = [Sigmoid(), ReLU(), Tanh(), GELU()]\n",
    "    for activation in activations:\n",
    "        result = activation.forward(test_data)\n",
    "        assert result.shape == test_data.shape, f\"Shape not preserved by {activation.__class__.__name__}\"\n",
    "        assert isinstance(result, Tensor), f\"Output not Tensor from {activation.__class__.__name__}\"\n",
    "\n",
    "    print(\"✅ All activations preserve tensor properties!\")\n",
    "\n",
    "    # Test 2: Softmax works with different dimensions\n",
    "    print(\"🔬 Integration Test: Softmax dimension handling...\")\n",
    "    data_3d = Tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])  # (2, 2, 3)\n",
    "    softmax = Softmax()\n",
    "\n",
    "    # Test different dimensions\n",
    "    result_last = softmax(data_3d, dim=-1)\n",
    "    assert result_last.shape == (2, 2, 3), \"Softmax should preserve shape\"\n",
    "\n",
    "    # Check that last dimension sums to 1\n",
    "    last_dim_sums = np.sum(result_last.data, axis=-1)\n",
    "    assert np.allclose(last_dim_sums, 1.0), \"Last dimension should sum to 1\"\n",
    "\n",
    "    print(\"✅ Softmax handles different dimensions correctly!\")\n",
    "\n",
    "    # Test 3: Activation chaining (simulating neural network)\n",
    "    print(\"🔬 Integration Test: Activation chaining...\")\n",
    "\n",
    "    # Simulate: Input → Linear → ReLU → Linear → Softmax (like a simple network)\n",
    "    x = Tensor([[-1, 0, 1, 2]])  # Batch of 1, 4 features\n",
    "\n",
    "    # Apply ReLU (hidden layer activation)\n",
    "    relu = ReLU()\n",
    "    hidden = relu.forward(x)\n",
    "\n",
    "    # Apply Softmax (output layer activation)\n",
    "    softmax = Softmax()\n",
    "    output = softmax.forward(hidden)\n",
    "\n",
    "    # Verify the chain\n",
    "    assert hidden.data[0, 0] == 0, \"ReLU should zero negative input\"\n",
    "    assert np.allclose(np.sum(output.data), 1.0), \"Final output should be probability distribution\"\n",
    "\n",
    "    print(\"✅ Activation chaining works correctly!\")\n",
    "\n",
    "    print(\"\\n\" + \"=\" * 50)\n",
    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
    "    print(\"Run: tito module complete 02\")\n",
    "\n",
    "# Run comprehensive module test\n",
    "if __name__ == \"__main__\":\n",
    "    test_module()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df17a734",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🎯 MODULE SUMMARY: Activations\n",
    "\n",
    "Congratulations! You've built the intelligence engine of neural networks!\n",
    "\n",
    "### Key Accomplishments\n",
    "- Built 5 core activation functions with distinct behaviors and use cases\n",
    "- Implemented forward passes for Sigmoid, ReLU, Tanh, GELU, and Softmax\n",
    "- Discovered how nonlinearity enables complex pattern learning\n",
    "- All tests pass ✅ (validated by `test_module()`)\n",
    "\n",
    "### Ready for Next Steps\n",
    "Your activation implementations enable neural network layers to learn complex, nonlinear patterns instead of just linear transformations.\n",
    "\n",
    "Export with: `tito module complete 02`\n",
    "\n",
    "**Next**: Module 03 will combine your Tensors and Activations to build complete neural network Layers!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}