diff --git a/docs/_static/demos/02-build-test-ship.gif b/docs/_static/demos/02-build-test-ship.gif
index 70d6b157..415503f3 100644
Binary files a/docs/_static/demos/02-build-test-ship.gif and b/docs/_static/demos/02-build-test-ship.gif differ
diff --git a/docs/_static/demos/03-milestone-unlocked.gif b/docs/_static/demos/03-milestone-unlocked.gif
index 2ddad865..80fa821a 100644
Binary files a/docs/_static/demos/03-milestone-unlocked.gif and b/docs/_static/demos/03-milestone-unlocked.gif differ
diff --git a/src/02_activations/02_activations.ipynb b/src/02_activations/02_activations.ipynb
new file mode 100644
index 00000000..d215e06f
--- /dev/null
+++ b/src/02_activations/02_activations.ipynb
@@ -0,0 +1,1355 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7d81c336",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "# Activations - Intelligence Through Nonlinearity\n",
+    "\n",
+    "Welcome to Activations! Today you'll add the secret ingredient that makes neural networks intelligent: **nonlinearity**.\n",
+    "\n",
+    "## 🔗 Prerequisites & Progress\n",
+    "**You've Built**: Tensor with data manipulation and basic operations\n",
+    "**You'll Build**: Activation functions that add nonlinearity to transformations\n",
+    "**You'll Enable**: Neural networks with the ability to learn complex patterns\n",
+    "\n",
+    "**Connection Map**:\n",
+    "```\n",
+    "Tensor → Activations → Layers\n",
+    "(data)   (intelligence) (architecture)\n",
+    "```\n",
+    "\n",
+    "## Learning Objectives\n",
+    "By the end of this module, you will:\n",
+    "1. Implement 5 core activation functions (Sigmoid, ReLU, Tanh, GELU, Softmax)\n",
+    "2. Understand how nonlinearity enables neural network intelligence\n",
+    "3. Test activation behaviors and output ranges\n",
+    "4. Connect activations to real neural network components\n",
+    "\n",
+    "Let's add intelligence to your tensors!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4d420d81",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 📦 Where This Code Lives in the Final Package\n",
+    "\n",
+    "**Learning Side:** You work in modules/02_activations/activations_dev.py\n",
+    "**Building Side:** Code exports to tinytorch.core.activations\n",
+    "\n",
+    "```python\n",
+    "# Final package structure:\n",
+    "from tinytorch.core.activations import Sigmoid, ReLU, Tanh, GELU, Softmax  # This module\n",
+    "from tinytorch.core.tensor import Tensor  # Foundation (Module 01)\n",
+    "```\n",
+    "\n",
+    "**Why this matters:**\n",
+    "- **Learning:** Complete activation system in one focused module for deep understanding\n",
+    "- **Production:** Proper organization like PyTorch's torch.nn.functional with all activation operations together\n",
+    "- **Consistency:** All activation functions and behaviors in core.activations\n",
+    "- **Integration:** Works seamlessly with Tensor for complete nonlinear transformations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef737812",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 📋 Module Dependencies\n",
+    "\n",
+    "**Prerequisites**: Module 01 (Tensor) must be completed\n",
+    "\n",
+    "**External Dependencies**:\n",
+    "- `numpy` (for numerical operations)\n",
+    "\n",
+    "**TinyTorch Dependencies**:\n",
+    "- **Module 01 (Tensor)**: Foundation for all activation computations and data flow\n",
+    "  - Used for: Input/output data structures, shape operations, element-wise operations\n",
+    "  - Required: Yes - activations operate on Tensor objects\n",
+    "\n",
+    "**Dependency Flow**:\n",
+    "```\n",
+    "Module 01 (Tensor) → Module 02 (Activations) → Module 03 (Layers)\n",
+    "     ↓                      ↓                         ↓\n",
+    "  Foundation          Nonlinearity              Architecture\n",
+    "```\n",
+    "\n",
+    "**Import Strategy**:\n",
+    "This module imports directly from the TinyTorch package (`from tinytorch.core.*`).\n",
+    "**Assumption**: Module 01 (Tensor) has been completed and exported to the package.\n",
+    "If you see import errors, ensure you've run `tito export` after completing Module 01."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2066641f",
+   "metadata": {
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "setup",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| default_exp core.activations\n",
+    "#| export\n",
+    "\n",
+    "import numpy as np\n",
+    "from typing import Optional\n",
+    "\n",
+    "# Import from TinyTorch package (previous modules must be completed and exported)\n",
+    "from tinytorch.core.tensor import Tensor\n",
+    "\n",
+    "# Constants for numerical comparisons\n",
+    "TOLERANCE = 1e-10  # Small tolerance for floating-point comparisons in tests"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c47833e7",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 1. Introduction - What Makes Neural Networks Intelligent?\n",
+    "\n",
+    "Consider two scenarios:\n",
+    "\n",
+    "**Without Activations (Linear Only):**\n",
+    "```\n",
+    "Input → Linear Transform → Output\n",
+    "[1, 2] → [3, 4] → [11]  # Just weighted sum\n",
+    "```\n",
+    "\n",
+    "**With Activations (Nonlinear):**\n",
+    "```\n",
+    "Input → Linear → Activation → Linear → Activation → Output\n",
+    "[1, 2] → [3, 4] → [3, 4] → [7] → [7] → Complex Pattern!\n",
+    "```\n",
+    "\n",
+    "The magic happens in those activation functions. They introduce **nonlinearity** - the ability to curve, bend, and create complex decision boundaries instead of just straight lines.\n",
+    "\n",
+    "### Why Nonlinearity Matters\n",
+    "\n",
+    "Without activation functions, stacking multiple linear layers is pointless:\n",
+    "```\n",
+    "Linear(Linear(x)) = Linear(x)  # Same as single layer!\n",
+    "```\n",
+    "\n",
+    "With activation functions, each layer can learn increasingly complex patterns:\n",
+    "```\n",
+    "Layer 1: Simple edges and lines\n",
+    "Layer 2: Curves and shapes\n",
+    "Layer 3: Complex objects and concepts\n",
+    "```\n",
+    "\n",
+    "This is how deep networks build intelligence from simple mathematical operations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca836d90",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 2. Mathematical Foundations\n",
+    "\n",
+    "Each activation function serves a different purpose in neural networks:\n",
+    "\n",
+    "### The Five Essential Activations\n",
+    "\n",
+    "1. **Sigmoid**: Maps to (0, 1) - perfect for probabilities\n",
+    "2. **ReLU**: Removes negatives - creates sparsity and efficiency\n",
+    "3. **Tanh**: Maps to (-1, 1) - zero-centered for better training\n",
+    "4. **GELU**: Smooth ReLU - modern choice for transformers\n",
+    "5. **Softmax**: Creates probability distributions - essential for classification\n",
+    "\n",
+    "Let's implement each one with clear explanations and immediate testing!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f73cf7e",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 3. Implementation - Building Activation Functions\n",
+    "\n",
+    "### 🏗️ Implementation Pattern\n",
+    "\n",
+    "Each activation follows this structure:\n",
+    "```python\n",
+    "class ActivationName:\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        # Apply mathematical transformation\n",
+    "        # Return new Tensor with result\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        # Stub for Module 05 - gradient computation\n",
+    "        pass\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79ef7336",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## Sigmoid - The Probability Gatekeeper\n",
+    "\n",
+    "Sigmoid maps any real number to the range (0, 1), making it perfect for probabilities and binary decisions.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "σ(x) = 1/(1 + e^(-x))\n",
+    "```\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [-3, -1,  0,  1,  3]\n",
+    "         ↓   ↓   ↓   ↓   ↓  Sigmoid Function\n",
+    "Output: [0.05, 0.27, 0.5, 0.73, 0.95]\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "Sigmoid Curve:\n",
+    "    1.0 ┤     ╭─────\n",
+    "        │    ╱\n",
+    "    0.5 ┤   ╱\n",
+    "        │  ╱\n",
+    "    0.0 ┤─╱─────────\n",
+    "       -3  0  3\n",
+    "```\n",
+    "\n",
+    "**Why Sigmoid matters**: In binary classification, we need outputs between 0 and 1 to represent probabilities. Sigmoid gives us exactly that!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e0285e2",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "sigmoid-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "from tinytorch.core.tensor import Tensor\n",
+    "\n",
+    "class Sigmoid:\n",
+    "    \"\"\"\n",
+    "    Sigmoid activation: σ(x) = 1/(1 + e^(-x))\n",
+    "\n",
+    "    Maps any real number to (0, 1) range.\n",
+    "    Perfect for probabilities and binary classification.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply sigmoid activation element-wise.\n",
+    "\n",
+    "        TODO: Implement sigmoid function\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Apply sigmoid formula: 1 / (1 + exp(-x))\n",
+    "        2. Use np.exp for exponential\n",
+    "        3. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> sigmoid = Sigmoid()\n",
+    "        >>> x = Tensor([-2, 0, 2])\n",
+    "        >>> result = sigmoid(x)\n",
+    "        >>> print(result.data)\n",
+    "        [0.119, 0.5, 0.881]  # All values between 0 and 1\n",
+    "\n",
+    "        HINT: Use np.exp(-x.data) for numerical stability\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Apply sigmoid: 1 / (1 + exp(-x))\n",
+    "        # Clip extreme values to prevent overflow (sigmoid(-500) ≈ 0, sigmoid(500) ≈ 1)\n",
+    "        # Clipping at ±500 ensures exp() stays within float64 range\n",
+    "        z = np.clip(x.data, -500, 500)\n",
+    "\n",
+    "        # Use numerically stable sigmoid\n",
+    "        # For positive values: 1 / (1 + exp(-x))\n",
+    "        # For negative values: exp(x) / (1 + exp(x)) = 1 / (1 + exp(-x)) after clipping\n",
+    "        result_data = np.zeros_like(z)\n",
+    "\n",
+    "        # Positive values (including zero)\n",
+    "        pos_mask = z >= 0\n",
+    "        result_data[pos_mask] = 1.0 / (1.0 + np.exp(-z[pos_mask]))\n",
+    "\n",
+    "        # Negative values\n",
+    "        neg_mask = z < 0\n",
+    "        exp_z = np.exp(z[neg_mask])\n",
+    "        result_data[neg_mask] = exp_z / (1.0 + exp_z)\n",
+    "\n",
+    "        return Tensor(result_data)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __call__(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
+    "        return self.forward(x)\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "064c45b4",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Sigmoid\n",
+    "This test validates sigmoid activation behavior.\n",
+    "**What we're testing**: Sigmoid maps inputs to (0, 1) range\n",
+    "**Why it matters**: Ensures proper probability-like outputs\n",
+    "**Expected**: All outputs between 0 and 1, sigmoid(0) = 0.5"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "622abc9a",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-sigmoid",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_sigmoid():\n",
+    "    \"\"\"🔬 Test Sigmoid implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Sigmoid...\")\n",
+    "\n",
+    "    sigmoid = Sigmoid()\n",
+    "\n",
+    "    # Test basic cases\n",
+    "    x = Tensor([0.0])\n",
+    "    result = sigmoid.forward(x)\n",
+    "    assert np.allclose(result.data, [0.5]), f\"sigmoid(0) should be 0.5, got {result.data}\"\n",
+    "\n",
+    "    # Test range property - all outputs should be in (0, 1)\n",
+    "    x = Tensor([-10, -1, 0, 1, 10])\n",
+    "    result = sigmoid.forward(x)\n",
+    "    assert np.all(result.data > 0) and np.all(result.data < 1), \"All sigmoid outputs should be in (0, 1)\"\n",
+    "\n",
+    "    # Test specific values\n",
+    "    x = Tensor([-1000, 1000])  # Extreme values\n",
+    "    result = sigmoid.forward(x)\n",
+    "    assert np.allclose(result.data[0], 0, atol=TOLERANCE), \"sigmoid(-∞) should approach 0\"\n",
+    "    assert np.allclose(result.data[1], 1, atol=TOLERANCE), \"sigmoid(+∞) should approach 1\"\n",
+    "\n",
+    "    print(\"✅ Sigmoid works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_sigmoid()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "edb8b018",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## ReLU - The Sparsity Creator\n",
+    "\n",
+    "ReLU (Rectified Linear Unit) is the most popular activation function. It simply removes negative values, creating sparsity that makes neural networks more efficient.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "f(x) = max(0, x)\n",
+    "```\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [-2, -1,  0,  1,  2]\n",
+    "         ↓   ↓   ↓   ↓   ↓  ReLU Function\n",
+    "Output: [ 0,  0,  0,  1,  2]\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "ReLU Function:\n",
+    "        ╱\n",
+    "    2  ╱\n",
+    "      ╱\n",
+    "    1╱\n",
+    "    ╱\n",
+    "   ╱\n",
+    "  ╱\n",
+    "─┴─────\n",
+    "-2  0  2\n",
+    "```\n",
+    "\n",
+    "**Why ReLU matters**: By zeroing negative values, ReLU creates sparsity (many zeros) which makes computation faster and helps prevent overfitting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "492c0f67",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "relu-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class ReLU:\n",
+    "    \"\"\"\n",
+    "    ReLU activation: f(x) = max(0, x)\n",
+    "\n",
+    "    Sets negative values to zero, keeps positive values unchanged.\n",
+    "    Most popular activation for hidden layers.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply ReLU activation element-wise.\n",
+    "\n",
+    "        TODO: Implement ReLU function\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use np.maximum(0, x.data) for element-wise max with zero\n",
+    "        2. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> relu = ReLU()\n",
+    "        >>> x = Tensor([-2, -1, 0, 1, 2])\n",
+    "        >>> result = relu(x)\n",
+    "        >>> print(result.data)\n",
+    "        [0, 0, 0, 1, 2]  # Negative values become 0, positive unchanged\n",
+    "\n",
+    "        HINT: np.maximum handles element-wise maximum automatically\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Apply ReLU: max(0, x)\n",
+    "        result = np.maximum(0, x.data)\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __call__(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
+    "        return self.forward(x)\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12eff51b",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: ReLU\n",
+    "This test validates ReLU activation behavior.\n",
+    "**What we're testing**: ReLU zeros negative values, preserves positive\n",
+    "**Why it matters**: ReLU's sparsity helps neural networks train efficiently\n",
+    "**Expected**: Negative → 0, positive unchanged, zero → 0"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2f82fe9d",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-relu",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_relu():\n",
+    "    \"\"\"🔬 Test ReLU implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: ReLU...\")\n",
+    "\n",
+    "    relu = ReLU()\n",
+    "\n",
+    "    # Test mixed positive/negative values\n",
+    "    x = Tensor([-2, -1, 0, 1, 2])\n",
+    "    result = relu.forward(x)\n",
+    "    expected = [0, 0, 0, 1, 2]\n",
+    "    assert np.allclose(result.data, expected), f\"ReLU failed, expected {expected}, got {result.data}\"\n",
+    "\n",
+    "    # Test all negative\n",
+    "    x = Tensor([-5, -3, -1])\n",
+    "    result = relu.forward(x)\n",
+    "    assert np.allclose(result.data, [0, 0, 0]), \"ReLU should zero all negative values\"\n",
+    "\n",
+    "    # Test all positive\n",
+    "    x = Tensor([1, 3, 5])\n",
+    "    result = relu.forward(x)\n",
+    "    assert np.allclose(result.data, [1, 3, 5]), \"ReLU should preserve all positive values\"\n",
+    "\n",
+    "    # Test sparsity property\n",
+    "    x = Tensor([-1, -2, -3, 1])\n",
+    "    result = relu.forward(x)\n",
+    "    zeros = np.sum(result.data == 0)\n",
+    "    assert zeros == 3, f\"ReLU should create sparsity, got {zeros} zeros out of 4\"\n",
+    "\n",
+    "    print(\"✅ ReLU works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_relu()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e337e334",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Tanh - The Zero-Centered Alternative\n",
+    "\n",
+    "Tanh (hyperbolic tangent) is like sigmoid but centered around zero, mapping inputs to (-1, 1). This zero-centering helps with gradient flow during training.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "f(x) = (e^x - e^(-x))/(e^x + e^(-x))\n",
+    "```\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [-2,  0,  2]\n",
+    "         ↓   ↓   ↓  Tanh Function\n",
+    "Output: [-0.96, 0, 0.96]\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "Tanh Curve:\n",
+    "    1 ┤     ╭─────\n",
+    "      │    ╱\n",
+    "    0 ┤───╱─────\n",
+    "      │  ╱\n",
+    "   -1 ┤─╱───────\n",
+    "     -3  0  3\n",
+    "```\n",
+    "\n",
+    "**Why Tanh matters**: Unlike sigmoid, tanh outputs are centered around zero, which can help gradients flow better through deep networks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5097cffc",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "tanh-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Tanh:\n",
+    "    \"\"\"\n",
+    "    Tanh activation: f(x) = (e^x - e^(-x))/(e^x + e^(-x))\n",
+    "\n",
+    "    Maps any real number to (-1, 1) range.\n",
+    "    Zero-centered alternative to sigmoid.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply tanh activation element-wise.\n",
+    "\n",
+    "        TODO: Implement tanh function\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use np.tanh(x.data) for hyperbolic tangent\n",
+    "        2. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> tanh = Tanh()\n",
+    "        >>> x = Tensor([-2, 0, 2])\n",
+    "        >>> result = tanh(x)\n",
+    "        >>> print(result.data)\n",
+    "        [-0.964, 0.0, 0.964]  # Range (-1, 1), symmetric around 0\n",
+    "\n",
+    "        HINT: NumPy provides np.tanh function\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Apply tanh using NumPy\n",
+    "        result = np.tanh(x.data)\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __call__(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
+    "        return self.forward(x)\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "83e4b892",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Tanh\n",
+    "This test validates tanh activation behavior.\n",
+    "**What we're testing**: Tanh maps inputs to (-1, 1) range, zero-centered\n",
+    "**Why it matters**: Zero-centered activations can help with gradient flow\n",
+    "**Expected**: All outputs in (-1, 1), tanh(0) = 0, symmetric behavior"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f55159ca",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-tanh",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_tanh():\n",
+    "    \"\"\"🔬 Test Tanh implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Tanh...\")\n",
+    "\n",
+    "    tanh = Tanh()\n",
+    "\n",
+    "    # Test zero\n",
+    "    x = Tensor([0.0])\n",
+    "    result = tanh.forward(x)\n",
+    "    assert np.allclose(result.data, [0.0]), f\"tanh(0) should be 0, got {result.data}\"\n",
+    "\n",
+    "    # Test range property - all outputs should be in (-1, 1)\n",
+    "    x = Tensor([-10, -1, 0, 1, 10])\n",
+    "    result = tanh.forward(x)\n",
+    "    assert np.all(result.data >= -1) and np.all(result.data <= 1), \"All tanh outputs should be in [-1, 1]\"\n",
+    "\n",
+    "    # Test symmetry: tanh(-x) = -tanh(x)\n",
+    "    x = Tensor([2.0])\n",
+    "    pos_result = tanh.forward(x)\n",
+    "    x_neg = Tensor([-2.0])\n",
+    "    neg_result = tanh.forward(x_neg)\n",
+    "    assert np.allclose(pos_result.data, -neg_result.data), \"tanh should be symmetric: tanh(-x) = -tanh(x)\"\n",
+    "\n",
+    "    # Test extreme values\n",
+    "    x = Tensor([-1000, 1000])\n",
+    "    result = tanh.forward(x)\n",
+    "    assert np.allclose(result.data[0], -1, atol=TOLERANCE), \"tanh(-∞) should approach -1\"\n",
+    "    assert np.allclose(result.data[1], 1, atol=TOLERANCE), \"tanh(+∞) should approach 1\"\n",
+    "\n",
+    "    print(\"✅ Tanh works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_tanh()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3a2be663",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## GELU - The Smooth Modern Choice\n",
+    "\n",
+    "GELU (Gaussian Error Linear Unit) is a smooth approximation to ReLU that's become popular in modern architectures like transformers. Unlike ReLU's sharp corner, GELU is smooth everywhere.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)\n",
+    "```\n",
+    "Where Φ(x) is the cumulative distribution function of standard normal distribution.\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [-1,  0,  1]\n",
+    "         ↓   ↓   ↓  GELU Function\n",
+    "Output: [-0.16, 0, 0.84]\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "GELU Function:\n",
+    "        ╱\n",
+    "    1  ╱\n",
+    "      ╱\n",
+    "     ╱\n",
+    "    ╱\n",
+    "   ╱ ↙ (smooth curve, no sharp corner)\n",
+    "  ╱\n",
+    "─┴─────\n",
+    "-2  0  2\n",
+    "```\n",
+    "\n",
+    "**Why GELU matters**: Used in GPT, BERT, and other transformers. The smoothness helps with optimization compared to ReLU's sharp corner."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "702988e0",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "gelu-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class GELU:\n",
+    "    \"\"\"\n",
+    "    GELU activation: f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)\n",
+    "\n",
+    "    Smooth approximation to ReLU, used in modern transformers.\n",
+    "    Where Φ(x) is the cumulative distribution function of standard normal.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply GELU activation element-wise.\n",
+    "\n",
+    "        TODO: Implement GELU approximation\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Use approximation: x * sigmoid(1.702 * x)\n",
+    "        2. Compute sigmoid part: 1 / (1 + exp(-1.702 * x))\n",
+    "        3. Multiply by x element-wise\n",
+    "        4. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> gelu = GELU()\n",
+    "        >>> x = Tensor([-1, 0, 1])\n",
+    "        >>> result = gelu(x)\n",
+    "        >>> print(result.data)\n",
+    "        [-0.159, 0.0, 0.841]  # Smooth, like ReLU but differentiable everywhere\n",
+    "\n",
+    "        HINT: The 1.702 constant comes from √(2/π) approximation\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # GELU approximation: x * sigmoid(1.702 * x)\n",
+    "        # First compute sigmoid part\n",
+    "        sigmoid_part = 1.0 / (1.0 + np.exp(-1.702 * x.data))\n",
+    "        # Then multiply by x\n",
+    "        result = x.data * sigmoid_part\n",
+    "        return Tensor(result)\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __call__(self, x: Tensor) -> Tensor:\n",
+    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
+    "        return self.forward(x)\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5c9142d2",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: GELU\n",
+    "This test validates GELU activation behavior.\n",
+    "**What we're testing**: GELU provides smooth ReLU-like behavior\n",
+    "**Why it matters**: GELU is used in modern transformers like GPT and BERT\n",
+    "**Expected**: Smooth curve, GELU(0) ≈ 0, positive values preserved roughly"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e9f917b3",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-gelu",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_gelu():\n",
+    "    \"\"\"🔬 Test GELU implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: GELU...\")\n",
+    "\n",
+    "    gelu = GELU()\n",
+    "\n",
+    "    # Test zero (should be approximately 0)\n",
+    "    x = Tensor([0.0])\n",
+    "    result = gelu.forward(x)\n",
+    "    assert np.allclose(result.data, [0.0], atol=TOLERANCE), f\"GELU(0) should be ≈0, got {result.data}\"\n",
+    "\n",
+    "    # Test positive values (should be roughly preserved)\n",
+    "    x = Tensor([1.0])\n",
+    "    result = gelu.forward(x)\n",
+    "    assert result.data[0] > 0.8, f\"GELU(1) should be ≈0.84, got {result.data[0]}\"\n",
+    "\n",
+    "    # Test negative values (should be small but not zero)\n",
+    "    x = Tensor([-1.0])\n",
+    "    result = gelu.forward(x)\n",
+    "    assert result.data[0] < 0 and result.data[0] > -0.2, f\"GELU(-1) should be ≈-0.16, got {result.data[0]}\"\n",
+    "\n",
+    "    # Test smoothness property (no sharp corners like ReLU)\n",
+    "    x = Tensor([-0.001, 0.0, 0.001])\n",
+    "    result = gelu.forward(x)\n",
+    "    # Values should be close to each other (smooth)\n",
+    "    diff1 = abs(result.data[1] - result.data[0])\n",
+    "    diff2 = abs(result.data[2] - result.data[1])\n",
+    "    assert diff1 < 0.01 and diff2 < 0.01, \"GELU should be smooth around zero\"\n",
+    "\n",
+    "    print(\"✅ GELU works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_gelu()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "770d4997",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## Softmax - The Probability Distributor\n",
+    "\n",
+    "Softmax converts any vector into a valid probability distribution. All outputs are positive and sum to exactly 1.0, making it essential for multi-class classification.\n",
+    "\n",
+    "### Mathematical Definition\n",
+    "```\n",
+    "f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
+    "```\n",
+    "\n",
+    "### Visual Behavior\n",
+    "```\n",
+    "Input:  [1, 2, 3]\n",
+    "         ↓  ↓  ↓  Softmax Function\n",
+    "Output: [0.09, 0.24, 0.67]  # Sum = 1.0\n",
+    "```\n",
+    "\n",
+    "### ASCII Visualization\n",
+    "```\n",
+    "Softmax Transform:\n",
+    "Raw scores: [1, 2, 3, 4]\n",
+    "           ↓ Exponential ↓\n",
+    "          [2.7, 7.4, 20.1, 54.6]\n",
+    "           ↓ Normalize ↓\n",
+    "          [0.03, 0.09, 0.24, 0.64]  ← Sum = 1.0\n",
+    "```\n",
+    "\n",
+    "**Why Softmax matters**: In multi-class classification, we need outputs that represent probabilities for each class. Softmax guarantees valid probabilities."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b8a39ebe",
+   "metadata": {
+    "lines_to_next_cell": 1,
+    "nbgrader": {
+     "grade": false,
+     "grade_id": "softmax-impl",
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "class Softmax:\n",
+    "    \"\"\"\n",
+    "    Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))\n",
+    "\n",
+    "    Converts any vector to a probability distribution.\n",
+    "    Sum of all outputs equals 1.0.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def forward(self, x: Tensor, dim: int = -1) -> Tensor:\n",
+    "        \"\"\"\n",
+    "        Apply softmax activation along specified dimension.\n",
+    "\n",
+    "        TODO: Implement numerically stable softmax\n",
+    "\n",
+    "        APPROACH:\n",
+    "        1. Subtract max for numerical stability: x - max(x)\n",
+    "        2. Compute exponentials: exp(x - max(x))\n",
+    "        3. Sum along dimension: sum(exp_values)\n",
+    "        4. Divide: exp_values / sum\n",
+    "        5. Return result wrapped in new Tensor\n",
+    "\n",
+    "        EXAMPLE:\n",
+    "        >>> softmax = Softmax()\n",
+    "        >>> x = Tensor([1, 2, 3])\n",
+    "        >>> result = softmax(x)\n",
+    "        >>> print(result.data)\n",
+    "        [0.090, 0.245, 0.665]  # Sums to 1.0, larger inputs get higher probability\n",
+    "\n",
+    "        HINTS:\n",
+    "        - Use np.max(x.data, axis=dim, keepdims=True) for max\n",
+    "        - Use np.sum(exp_values, axis=dim, keepdims=True) for sum\n",
+    "        - The max subtraction prevents overflow in exponentials\n",
+    "        \"\"\"\n",
+    "        ### BEGIN SOLUTION\n",
+    "        # Numerical stability: subtract max to prevent overflow\n",
+    "        # Use Tensor operations to preserve gradient flow!\n",
+    "        x_max_data = np.max(x.data, axis=dim, keepdims=True)\n",
+    "        x_max = Tensor(x_max_data, requires_grad=False)  # max is not differentiable in this context\n",
+    "        x_shifted = x - x_max  # Tensor subtraction!\n",
+    "\n",
+    "        # Compute exponentials (NumPy operation, but wrapped in Tensor)\n",
+    "        exp_values = Tensor(np.exp(x_shifted.data), requires_grad=x_shifted.requires_grad)\n",
+    "\n",
+    "        # Sum along dimension (Tensor operation)\n",
+    "        exp_sum_data = np.sum(exp_values.data, axis=dim, keepdims=True)\n",
+    "        exp_sum = Tensor(exp_sum_data, requires_grad=exp_values.requires_grad)\n",
+    "\n",
+    "        # Normalize to get probabilities (Tensor division!)\n",
+    "        result = exp_values / exp_sum\n",
+    "        return result\n",
+    "        ### END SOLUTION\n",
+    "\n",
+    "    def __call__(self, x: Tensor, dim: int = -1) -> Tensor:\n",
+    "        \"\"\"Allows the activation to be called like a function.\"\"\"\n",
+    "        return self.forward(x, dim)\n",
+    "\n",
+    "    def backward(self, grad: Tensor) -> Tensor:\n",
+    "        \"\"\"Compute gradient (implemented in Module 05).\"\"\"\n",
+    "        pass  # Will implement backward pass in Module 05"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b9d8ff4",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "### 🔬 Unit Test: Softmax\n",
+    "This test validates softmax activation behavior.\n",
+    "**What we're testing**: Softmax creates valid probability distributions\n",
+    "**Why it matters**: Essential for multi-class classification outputs\n",
+    "**Expected**: Outputs sum to 1.0, all values in (0, 1), largest input gets highest probability"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ba0c1c6e",
+   "metadata": {
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "test-softmax",
+     "locked": true,
+     "points": 10
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def test_unit_softmax():\n",
+    "    \"\"\"🔬 Test Softmax implementation.\"\"\"\n",
+    "    print(\"🔬 Unit Test: Softmax...\")\n",
+    "\n",
+    "    softmax = Softmax()\n",
+    "\n",
+    "    # Test basic probability properties\n",
+    "    x = Tensor([1, 2, 3])\n",
+    "    result = softmax.forward(x)\n",
+    "\n",
+    "    # Should sum to 1\n",
+    "    assert np.allclose(np.sum(result.data), 1.0), f\"Softmax should sum to 1, got {np.sum(result.data)}\"\n",
+    "\n",
+    "    # All values should be positive\n",
+    "    assert np.all(result.data > 0), \"All softmax values should be positive\"\n",
+    "\n",
+    "    # All values should be less than 1\n",
+    "    assert np.all(result.data < 1), \"All softmax values should be less than 1\"\n",
+    "\n",
+    "    # Largest input should get largest output\n",
+    "    max_input_idx = np.argmax(x.data)\n",
+    "    max_output_idx = np.argmax(result.data)\n",
+    "    assert max_input_idx == max_output_idx, \"Largest input should get largest softmax output\"\n",
+    "\n",
+    "    # Test numerical stability with large numbers\n",
+    "    x = Tensor([1000, 1001, 1002])  # Would overflow without max subtraction\n",
+    "    result = softmax.forward(x)\n",
+    "    assert np.allclose(np.sum(result.data), 1.0), \"Softmax should handle large numbers\"\n",
+    "    assert not np.any(np.isnan(result.data)), \"Softmax should not produce NaN\"\n",
+    "    assert not np.any(np.isinf(result.data)), \"Softmax should not produce infinity\"\n",
+    "\n",
+    "    # Test with 2D tensor (batch dimension)\n",
+    "    x = Tensor([[1, 2], [3, 4]])\n",
+    "    result = softmax.forward(x, dim=-1)  # Softmax along last dimension\n",
+    "    assert result.shape == (2, 2), \"Softmax should preserve input shape\"\n",
+    "    # Each row should sum to 1\n",
+    "    row_sums = np.sum(result.data, axis=-1)\n",
+    "    assert np.allclose(row_sums, [1.0, 1.0]), \"Each row should sum to 1\"\n",
+    "\n",
+    "    print(\"✅ Softmax works correctly!\")\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_unit_softmax()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e51cf5d",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## 4. Integration - Bringing It Together\n",
+    "\n",
+    "Now let's test how all our activation functions work together and understand their different behaviors."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3c74a5d8",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "### Understanding the Output Patterns\n",
+    "\n",
+    "From the demonstration above, notice how each activation serves a different purpose:\n",
+    "\n",
+    "**Sigmoid**: Squashes everything to (0, 1) - good for probabilities\n",
+    "**ReLU**: Zeros negatives, keeps positives - creates sparsity\n",
+    "**Tanh**: Like sigmoid but centered at zero (-1, 1) - better gradient flow\n",
+    "**GELU**: Smooth ReLU-like behavior - modern choice for transformers\n",
+    "**Softmax**: Converts to probability distribution - sum equals 1\n",
+    "\n",
+    "These different behaviors make each activation suitable for different parts of neural networks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4a784c4",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 1
+   },
+   "source": [
+    "## 🧪 Module Integration Test\n",
+    "\n",
+    "Final validation that everything works together correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c9f61aa8",
+   "metadata": {
+    "lines_to_next_cell": 2,
+    "nbgrader": {
+     "grade": true,
+     "grade_id": "module-test",
+     "locked": true,
+     "points": 20
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def test_module():\n",
+    "    \"\"\"🧪 Module Test: Complete Integration\n",
+    "\n",
+    "    Comprehensive test of entire module functionality.\n",
+    "\n",
+    "    This final test runs before module summary to ensure:\n",
+    "    - All unit tests pass\n",
+    "    - Functions work together correctly\n",
+    "    - Module is ready for integration with TinyTorch\n",
+    "    \"\"\"\n",
+    "    print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
+    "    print(\"=\" * 50)\n",
+    "\n",
+    "    # Run all unit tests\n",
+    "    print(\"Running unit tests...\")\n",
+    "    test_unit_sigmoid()\n",
+    "    test_unit_relu()\n",
+    "    test_unit_tanh()\n",
+    "    test_unit_gelu()\n",
+    "    test_unit_softmax()\n",
+    "\n",
+    "    print(\"\\nRunning integration scenarios...\")\n",
+    "\n",
+    "    # Test 1: All activations preserve tensor properties\n",
+    "    print(\"🔬 Integration Test: Tensor property preservation...\")\n",
+    "    test_data = Tensor([[1, -1], [2, -2]])  # 2D tensor\n",
+    "\n",
+    "    activations = [Sigmoid(), ReLU(), Tanh(), GELU()]\n",
+    "    for activation in activations:\n",
+    "        result = activation.forward(test_data)\n",
+    "        assert result.shape == test_data.shape, f\"Shape not preserved by {activation.__class__.__name__}\"\n",
+    "        assert isinstance(result, Tensor), f\"Output not Tensor from {activation.__class__.__name__}\"\n",
+    "\n",
+    "    print(\"✅ All activations preserve tensor properties!\")\n",
+    "\n",
+    "    # Test 2: Softmax works with different dimensions\n",
+    "    print(\"🔬 Integration Test: Softmax dimension handling...\")\n",
+    "    data_3d = Tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])  # (2, 2, 3)\n",
+    "    softmax = Softmax()\n",
+    "\n",
+    "    # Test different dimensions\n",
+    "    result_last = softmax(data_3d, dim=-1)\n",
+    "    assert result_last.shape == (2, 2, 3), \"Softmax should preserve shape\"\n",
+    "\n",
+    "    # Check that last dimension sums to 1\n",
+    "    last_dim_sums = np.sum(result_last.data, axis=-1)\n",
+    "    assert np.allclose(last_dim_sums, 1.0), \"Last dimension should sum to 1\"\n",
+    "\n",
+    "    print(\"✅ Softmax handles different dimensions correctly!\")\n",
+    "\n",
+    "    # Test 3: Activation chaining (simulating neural network)\n",
+    "    print(\"🔬 Integration Test: Activation chaining...\")\n",
+    "\n",
+    "    # Simulate: Input → Linear → ReLU → Linear → Softmax (like a simple network)\n",
+    "    x = Tensor([[-1, 0, 1, 2]])  # Batch of 1, 4 features\n",
+    "\n",
+    "    # Apply ReLU (hidden layer activation)\n",
+    "    relu = ReLU()\n",
+    "    hidden = relu.forward(x)\n",
+    "\n",
+    "    # Apply Softmax (output layer activation)\n",
+    "    softmax = Softmax()\n",
+    "    output = softmax.forward(hidden)\n",
+    "\n",
+    "    # Verify the chain\n",
+    "    assert hidden.data[0, 0] == 0, \"ReLU should zero negative input\"\n",
+    "    assert np.allclose(np.sum(output.data), 1.0), \"Final output should be probability distribution\"\n",
+    "\n",
+    "    print(\"✅ Activation chaining works correctly!\")\n",
+    "\n",
+    "    print(\"\\n\" + \"=\" * 50)\n",
+    "    print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
+    "    print(\"Run: tito module complete 02\")\n",
+    "\n",
+    "# Run comprehensive module test\n",
+    "if __name__ == \"__main__\":\n",
+    "    test_module()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d1a45630",
+   "metadata": {
+    "cell_marker": "\"\"\"",
+    "lines_to_next_cell": 2
+   },
+   "source": [
+    "## 5. Real-World Production Context\n",
+    "\n",
+    "Now that you've implemented these activations, let's understand how they're used in real ML systems.\n",
+    "\n",
+    "### Activation Selection Guide\n",
+    "\n",
+    "**When to Use Each Activation:**\n",
+    "\n",
+    "**Sigmoid**\n",
+    "- **Use case**: Binary classification output layers, gates in LSTMs/GRUs\n",
+    "- **Production example**: Spam detection (output: probability of spam)\n",
+    "- **Why**: Outputs valid probabilities in (0, 1)\n",
+    "- **Avoid**: Hidden layers in deep networks (vanishing gradients)\n",
+    "\n",
+    "**ReLU**\n",
+    "- **Use case**: Hidden layers in CNNs, feedforward networks\n",
+    "- **Production example**: Image classification networks (ResNet, VGG)\n",
+    "- **Why**: Fast computation, prevents vanishing gradients, creates sparsity\n",
+    "- **Avoid**: Output layers (can't output negative values or probabilities)\n",
+    "\n",
+    "**Tanh**\n",
+    "- **Use case**: RNN hidden states, when zero-centered outputs matter\n",
+    "- **Production example**: Sentiment analysis RNNs, time series prediction\n",
+    "- **Why**: Zero-centered helps with gradient flow in recurrent networks\n",
+    "- **Avoid**: Very deep networks (still suffers from vanishing gradients)\n",
+    "\n",
+    "**GELU**\n",
+    "- **Use case**: Transformer models, modern architectures\n",
+    "- **Production example**: GPT, BERT, modern language models\n",
+    "- **Why**: Smooth approximation of ReLU, better gradient flow, state-of-the-art results\n",
+    "- **Avoid**: When computational efficiency is critical (slightly slower than ReLU)\n",
+    "\n",
+    "**Softmax**\n",
+    "- **Use case**: Multi-class classification output layers\n",
+    "- **Production example**: ImageNet classification (1000 classes), NLP token prediction\n",
+    "- **Why**: Converts logits to valid probability distribution (sums to 1)\n",
+    "- **Avoid**: Hidden layers (loses information through normalization)\n",
+    "\n",
+    "### Common Production Patterns\n",
+    "\n",
+    "**Pattern 1: CNN Image Classification**\n",
+    "```\n",
+    "Input → Conv+ReLU → Conv+ReLU → ... → Linear → Softmax → Class Probabilities\n",
+    "```\n",
+    "\n",
+    "**Pattern 2: Binary Classifier**\n",
+    "```\n",
+    "Input → Linear+ReLU → Linear+ReLU → Linear → Sigmoid → Binary Probability\n",
+    "```\n",
+    "\n",
+    "**Pattern 3: Modern Transformer**\n",
+    "```\n",
+    "Input → Attention → Linear+GELU → Linear+GELU → Output\n",
+    "```\n",
+    "\n",
+    "### Common Pitfalls and Debugging\n",
+    "\n",
+    "**Sigmoid/Tanh Pitfalls:**\n",
+    "- **Vanishing gradients**: Gradients near 0 for extreme inputs\n",
+    "- **Saturation**: Outputs plateau, learning slows\n",
+    "- **Debug tip**: Check activation distribution - avoid all values near 0 or 1\n",
+    "\n",
+    "**ReLU Pitfalls:**\n",
+    "- **Dying ReLU**: Neurons output 0 forever after large negative gradient\n",
+    "- **No negative outputs**: Can't represent negative relationships\n",
+    "- **Debug tip**: Monitor % of dead neurons (always output 0)\n",
+    "\n",
+    "**Softmax Pitfalls:**\n",
+    "- **Numerical overflow**: exp(x) explodes for large x (solved by max subtraction)\n",
+    "- **Dimension confusion**: Must apply along correct axis for batched data\n",
+    "- **Debug tip**: Verify outputs sum to 1.0 along correct dimension\n",
+    "\n",
+    "**GELU Pitfalls:**\n",
+    "- **Approximation error**: Using wrong approximation constant\n",
+    "- **Speed**: Slightly slower than ReLU\n",
+    "- **Debug tip**: Compare outputs to reference implementation\n",
+    "\n",
+    "### Performance Characteristics\n",
+    "\n",
+    "**Computational Cost (relative to ReLU = 1.0):**\n",
+    "- ReLU: 1.0× (fastest - just comparison and max)\n",
+    "- Sigmoid: ~3×-4× (exponential computation)\n",
+    "- Tanh: ~3×-4× (two exponentials)\n",
+    "- GELU: ~4×-5× (exponential in approximation)\n",
+    "- Softmax: ~5×+ (exponentials + division across all elements)\n",
+    "\n",
+    "**Memory Impact:**\n",
+    "- All activations: Minimal memory overhead (output same size as input)\n",
+    "- Softmax: Slightly higher (temporary buffers for exp and sum)\n",
+    "- For autograd (Module 05): Must cache inputs for backward pass\n",
+    "\n",
+    "### Integration with TinyTorch\n",
+    "\n",
+    "Your activation functions integrate seamlessly with other modules:\n",
+    "\n",
+    "**Module 03 (Layers)**: Will use these activations\n",
+    "```python\n",
+    "# Coming in Module 03\n",
+    "class Linear:\n",
+    "    def __init__(self, in_features, out_features, activation=None):\n",
+    "        self.activation = activation  # Your ReLU, Sigmoid, etc.\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        out = self.compute_linear(x)\n",
+    "        if self.activation:\n",
+    "            out = self.activation(out)  # Uses your forward()\n",
+    "        return out\n",
+    "```\n",
+    "\n",
+    "**Module 05 (Autograd)**: Will add gradient computation\n",
+    "```python\n",
+    "# Coming in Module 05\n",
+    "class Sigmoid:\n",
+    "    def backward(self, grad):\n",
+    "        # ∂sigmoid/∂x = sigmoid(x) * (1 - sigmoid(x))\n",
+    "        return grad * self.output * (1 - self.output)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "904d4f89",
+   "metadata": {
+    "cell_marker": "\"\"\""
+   },
+   "source": [
+    "## 🎯 MODULE SUMMARY: Activations\n",
+    "\n",
+    "Congratulations! You've built the intelligence engine of neural networks!\n",
+    "\n",
+    "### Key Accomplishments\n",
+    "- Built 5 core activation functions with distinct behaviors and use cases\n",
+    "- Implemented forward passes for Sigmoid, ReLU, Tanh, GELU, and Softmax\n",
+    "- Discovered how nonlinearity enables complex pattern learning\n",
+    "- All tests pass ✅ (validated by `test_module()`)\n",
+    "\n",
+    "### Ready for Next Steps\n",
+    "Your activation implementations enable neural network layers to learn complex, nonlinear patterns instead of just linear transformations.\n",
+    "\n",
+    "Export with: `tito module complete 02`\n",
+    "\n",
+    "**Next**: Module 03 will combine your Tensors and Activations to build complete neural network Layers!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tito/main.py b/tito/main.py
index 02e2c6ee..fe279511 100644
--- a/tito/main.py
+++ b/tito/main.py
@@ -216,37 +216,60 @@ Tracking Progress:
             if not parsed_args.command:
                 # Show ASCII logo first
                 print_ascii_logo()
-                
-                # Show enhanced help with command groups
+
+                # Dynamically build help based on registered commands
+                # Categorize commands by role
+                essential = ['setup']
+                student_workflow = ['module', 'checkpoint', 'milestones']
+                community = ['leaderboard', 'olympics', 'community']
+                developer = ['system', 'package', 'nbgrader', 'src']
+                shortcuts = ['export', 'test', 'book', 'demo']
+                other = ['benchmark', 'grade', 'logo']
+
+                help_text = "[bold]Essential Commands:[/bold]\n"
+                for cmd in essential:
+                    if cmd in self.commands:
+                        desc = self.commands[cmd](self.config).description
+                        help_text += f"  [bold cyan]{cmd}[/bold cyan]        - {desc}\n"
+
+                help_text += "\n[bold]Student Workflow:[/bold]\n"
+                for cmd in student_workflow:
+                    if cmd in self.commands:
+                        desc = self.commands[cmd](self.config).description
+                        help_text += f"  [bold green]{cmd}[/bold green]       - {desc}\n"
+
+                help_text += "\n[bold]Community:[/bold]\n"
+                for cmd in community:
+                    if cmd in self.commands:
+                        desc = self.commands[cmd](self.config).description
+                        help_text += f"  [bold bright_blue]{cmd}[/bold bright_blue]  - {desc}\n"
+
+                help_text += "\n[bold]Developer Tools:[/bold]\n"
+                for cmd in developer:
+                    if cmd in self.commands:
+                        desc = self.commands[cmd](self.config).description
+                        help_text += f"  [dim]{cmd}[/dim]       - {desc}\n"
+
+                help_text += "\n[bold]Shortcuts:[/bold]\n"
+                for cmd in shortcuts:
+                    if cmd in self.commands:
+                        desc = self.commands[cmd](self.config).description
+                        help_text += f"  [bold yellow]{cmd}[/bold yellow]      - {desc}\n"
+
+                help_text += "\n[bold]Quick Start:[/bold]\n"
+                help_text += "  [dim]tito setup[/dim]                    - First-time setup (run once)\n"
+                help_text += "  [dim]tito module start 01[/dim]          - Start Module 01 (tensors)\n"
+                help_text += "  [dim]tito module complete 01[/dim]       - Complete it (test + export + track)\n"
+                help_text += "  [dim]tito module status[/dim]            - View all progress\n"
+                help_text += "\n[bold]Track Progress:[/bold]\n"
+                help_text += "  [dim]tito checkpoint status[/dim]        - Capabilities unlocked\n"
+                help_text += "  [dim]tito leaderboard profile[/dim]      - Your achievement journey\n"
+                help_text += "\n[bold]Get Help:[/bold]\n"
+                help_text += "  [dim]tito <command>[/dim]                - Show command subcommands\n"
+                help_text += "  [dim]tito --help[/dim]                   - Show full help"
+
                 self.console.print(Panel(
-                    "[bold]Essential Commands:[/bold]\n"
-                    "  [bold cyan]setup[/bold cyan]        - First-time environment setup\n\n"
-                    "[bold]Command Groups:[/bold]\n"
-                    "  [bold green]system[/bold green]       - System environment and configuration\n"
-                    "  [bold green]module[/bold green]       - Module workflow (start, complete, resume)\n"
-                    "  [bold green]package[/bold green]      - Package management and nbdev integration\n"
-                    "  [bold green]nbgrader[/bold green]     - Assignment management and auto-grading\n"
-                    "  [bold cyan]checkpoint[/bold cyan]   - Progress tracking (capabilities unlocked)\n"
-                    "  [bold magenta]milestones[/bold magenta]   - Epic achievements (major unlocks)\n"
-                    "  [bold bright_blue]leaderboard[/bold bright_blue] - Community showcase (share progress)\n"
-                    "  [bold bright_yellow]olympics[/bold bright_yellow]     - Competition events (challenges)\n\n"
-                    "[bold]Convenience Shortcuts:[/bold]\n"
-                    "  [bold yellow]export[/bold yellow]      - Quick export (→ module export)\n"
-                    "  [bold yellow]test[/bold yellow]        - Quick test (→ module test)\n"
-                    "  [bold green]book[/bold green]        - Build Jupyter Book documentation\n"
-                    "  [bold green]logo[/bold green]        - Learn about Tiny🔥Torch philosophy\n"
-                    "[bold]Quick Start:[/bold]\n"
-                    "  [dim]tito setup[/dim]                    - First-time setup (run once)\n"
-                    "  [dim]tito module start 01[/dim]          - Start Module 01 (tensors)\n"
-                    "  [dim]tito module complete 01[/dim]       - Complete it (test + export + track)\n"
-                    "  [dim]tito module start 02[/dim]          - Continue to Module 02\n"
-                    "  [dim]tito module status[/dim]            - View all progress\n\n"
-                    "[bold]Track Progress:[/bold]\n"
-                    "  [dim]tito checkpoint status[/dim]        - Capabilities unlocked\n"
-                    "  [dim]tito leaderboard profile[/dim]      - Your achievement journey\n\n"
-                    "[bold]Get Help:[/bold]\n"
-                    "  [dim]tito <command>[/dim]                - Show command subcommands\n"
-                    "  [dim]tito --help[/dim]                   - Show full help",
+                    help_text,
                     title="Welcome to Tiny🔥Torch!",
                     border_style="bright_green"
                 ))