feat: add ABOUT.md generation system with agent instructions

- Create about-generator.md agent with YAML frontmatter - Create about-guidelines.md with 13-section template - Update Module 01 ABOUT.md as canonical reference - Add embedded code snippets from .py files - Add Q5 (views vs copies) and BLAS paper reference - Use 'Your Tiny🔥Torch' branding in tab-sets - Remove chapter-level TOC (master TOC handles navigation) - Configure PDF TOC depth to 3 in _config.yml
2026-03-11 17:49:25 -05:00 · 2025-12-13 11:14:48 -05:00
parent 6355145879
commit 77a40ffea8
7 changed files with 979 additions and 1029 deletions
--- a/tinytorch/settings.json
+++ b/tinytorch/settings.json
@@ -0,0 +1 @@
+{}
--- a/tinytorch/site/_config.yml
+++ b/tinytorch/site/_config.yml
@@ -72,6 +72,9 @@ html:
 latex:
  latex_documents:
    targetname: tinytorch-course.tex
+  latex_elements:
+    # Show 3 levels in PDF table of contents (chapter, section, subsection)
+    tocdepth: 3

 # Bibliography support
 bibtex_bibfiles:
--- a/tinytorch/site/development/MODULE_ABOUT_TEMPLATE.md
+++ b/tinytorch/site/development/MODULE_ABOUT_TEMPLATE.md
@@ -2,157 +2,207 @@

 This template defines the standardized structure for all module ABOUT.md files used in the Jupyter Book site.

+## Key Design Decisions
+
+Based on expert review (David Patterson, Education Reviewer, Website Manager):
+
+1. **No YAML frontmatter** - Was not used by build system, created maintenance burden
+2. **Learning Objectives before Overview** - Pre-organizes cognitive schema for better learning
+3. **Time estimates in Build → Use → Reflect** - Enables self-pacing
+4. **Checkpoints in Implementation Guide** - Progress validation every 90-120 minutes
+5. **No manual prev/next navigation** - Auto-generated by Jupyter Book
+
 ## Standard Structure

 ```markdown
---
-title: "[Module Title]"
-description: "[Brief description]"
-difficulty: "[⭐⭐⭐⭐]"
-time_estimate: "[X-Y hours]"
-prerequisites: []
-next_steps: []
-learning_objectives: []
---
+# Module Name

-# [NN]. [Module Title]
-
-**[TIER]** | Difficulty: ⭐⭐⭐⭐ (X/4) | Time: X-Y hours
-
-## Overview
-
-[2-3 sentence overview explaining what this module builds and why it matters]
+**[TIER]** | Difficulty: ●●○○ (2/4) | Time: X-Y hours | Prerequisites: Module XX

 ## Learning Objectives

 By the end of this module, you will be able to:

- **[Objective 1]**: [Description]
- **[Objective 2]**: [Description]
- **[Objective 3]**: [Description]
- **[Objective 4]**: [Description]
- **[Objective 5]**: [Description]
+- **[Action verb]**: [Specific, measurable outcome with systems context]
+- **[Action verb]**: [Outcome connecting to production frameworks]
+- **[Action verb]**: [Outcome with performance/complexity awareness]
+- **[Action verb]**: [Outcome linking to real-world applications]
+- **[Action verb]**: [Outcome demonstrating systems thinking]

-## Build → Use → [Analyze/Optimize/Reflect]
+## Overview

-This module follows TinyTorch's **Build → Use → [Third Stage]** framework:
+[2-3 paragraphs covering:]
+- What you're building (concrete deliverable)
+- Why it matters (motivation and real-world relevance)
+- Where it fits in ML systems (context within TinyTorch and production frameworks)

-1. **Build**: [What students implement]
-2. **Use**: [How they apply it]
-3. **[Third Stage]**: [Deeper engagement - varies by module]
+## Build → Use → Reflect
+
+This module follows TinyTorch's **Build → Use → Reflect** framework:
+
+1. **Build** (~X hours): [What students implement - specific components]
+2. **Use** (~Y hours): [How they apply it - concrete scenarios]
+3. **Reflect** (~Z hours): [Systems thinking - quantitative questions]

 ## Implementation Guide

-### [Main Component Name]
+### [Component 1 Name]
+
+[Explanation of what this component does and why]
+
 ```python
-# Example code showing key functionality
+# Code example with clear comments
+from tinytorch.core.module import Component
+
+# Usage pattern
+example = Component(...)
+result = example.method(...)
 ```

-### [Additional Components]
-[More implementation examples]
+**Systems insight**: [Connection to production frameworks, performance implications, or architectural decisions]
+
+### [Component 2 Name]
+
+[Continue with additional components...]
+
+---
+
+**CHECKPOINT 1**: [What capability is now unlocked]
+
+What you can do now: [Concrete demonstration of progress]
+
+Progress: ~X% through module | Time invested: ~Y hours | Remaining: ~Z hours
+
+---
+
+### [Component 3 Name]
+
+[Continue implementation...]
+
+---
+
+**CHECKPOINT 2**: [Next capability unlocked]
+
+Progress: ~X% through module | Remaining: ~Z hours
+
+---

 ## Getting Started

 ### Prerequisites
-Ensure you understand the [foundations]:
+
+Complete these modules first:

 ```bash
 # Activate TinyTorch environment
 source scripts/activate-tinytorch

-# Verify prerequisite modules
+# Verify prerequisites
 tito test [prerequisite1]
 tito test [prerequisite2]
 ```

 ### Development Workflow
-1. **Open the development file**: `modules/[NN]_[modulename]/[modulename]_dev.py`
-2. **Implement [component 1]**: [Description]
-3. **Build [component 2]**: [Description]
-4. **Create [component 3]**: [Description]
-5. **Add [component 4]**: [Description]
-6. **Export and verify**: `tito module complete [NN] && tito test [modulename]`
+
+1. **Open the development file**: `src/[NN]_[modulename]/[NN]_[modulename].py`
+2. **Implement [component 1]**: [Brief description]
+3. **Build [component 2]**: [Brief description]
+4. **Add [component 3]**: [Brief description]
+5. **Export and verify**: `tito test [NN]`

 ## Testing

 ### Comprehensive Test Suite
-Run the full test suite to verify [module] functionality:

 ```bash
 # TinyTorch CLI (recommended)
-tito test [modulename]
+tito test [NN]

 # Direct pytest execution
-python -m pytest tests/ -k [modulename] -v
+python -m pytest tests/[NN]_[modulename]/ -v
 ```

 ### Test Coverage Areas
- ✅ **[Test area 1]**: [Description]
- ✅ **[Test area 2]**: [Description]
- ✅ **[Test area 3]**: [Description]
- ✅ **[Test area 4]**: [Description]
- ✅ **[Test area 5]**: [Description]

-### Inline Testing & [Analysis Type]
-The module includes comprehensive [validation type]:
-```python
-# Example inline test output
-🔬 Unit Test: [Component]...
-✓ [Test result 1]
-✓ [Test result 2]
-↗ Progress: [Component] ✓
-```
+- **[Area 1]**: [What is tested and why it matters]
+- **[Area 2]**: [What is tested]
+- **[Area 3]**: [What is tested]

 ### Manual Testing Examples
+
 ```python
-from [modulename]_dev import [Component]
-# Example usage
+from tinytorch.core.[module] import [Component]
+
+# Test basic functionality
+component = Component(...)
+result = component.method(...)
+assert result.shape == expected_shape
+print("Basic functionality working")
 ```

+## Production Context
+
+### Your Implementation vs. PyTorch
+
+| Feature | Your Implementation | PyTorch |
+|---------|--------------------| --------|
+| [Feature 1] | [Your approach] | [Production approach] |
+| [Feature 2] | [Your approach] | [Production approach] |
+
+**What's the same**: [Core concepts that transfer directly]
+
+**What's different**: [Production optimizations you're not implementing]
+
+### Real-World Applications
+
+- **[Company/Domain]**: [How this concept is used at scale]
+- **[Company/Domain]**: [Another production example]
+
 ## Systems Thinking Questions

-### Real-World Applications
- **[Application 1]**: [Description]
- **[Application 2]**: [Description]
- **[Application 3]**: [Description]
- **[Application 4]**: [Description]
+### Computational Analysis

-### [Mathematical/Technical] Foundations
- **[Concept 1]**: [Description]
- **[Concept 2]**: [Description]
- **[Concept 3]**: [Description]
- **[Concept 4]**: [Description]
+- [Quantitative question with concrete parameters, e.g., "For a batch size of 32 with 512-dimensional embeddings, calculate the memory required for..."]
+- [Scaling question, e.g., "How does computational cost change when doubling sequence length?"]

-### [Theory/Performance] Characteristics
- **[Characteristic 1]**: [Description]
- **[Characteristic 2]**: [Description]
- **[Characteristic 3]**: [Description]
- **[Characteristic 4]**: [Description]
+### Architectural Trade-offs
+
+- [Design decision question, e.g., "Why does PyTorch use X approach instead of Y?"]
+- [Trade-off analysis, e.g., "When would you choose dense vs. sparse representation?"]
+
+### Performance Characteristics
+
+- [Measurement question, e.g., "Profile the forward pass - which operation dominates?"]
+- [Optimization question, e.g., "What would you change to reduce memory by 50%?"]
+
+## What's Next
+
+**Module [NN+1]: [Name]** - [Brief description of what comes next and how it builds on this module]

 ## Ready to Build?

-[2-3 paragraph motivational conclusion explaining why this module matters and what students will achieve]
+[1-2 paragraphs: Action-oriented closing that emphasizes what students will accomplish and connects to the broader ML systems journey]

 Choose your preferred way to engage with this module:

 ````{grid} 1 2 3 3

-```{grid-item-card} 🚀 Launch Binder
-:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/[NN]_[modulename]/[modulename]_dev.ipynb
+```{grid-item-card} Launch Binder
+:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=src/[NN]_[modulename]/[NN]_[modulename].py
 :class-header: bg-light

 Run this module interactively in your browser. No installation required!
 ```

-```{grid-item-card} ⚡ Open in Colab  
-:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/[NN]_[modulename]/[modulename]_dev.ipynb
+```{grid-item-card} Open in Colab
+:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/src/[NN]_[modulename]/[NN]_[modulename].py
 :class-header: bg-light

 Use Google Colab for GPU access and cloud compute power.
 ```

-```{grid-item-card} 📖 View Source
-:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/[NN]_[modulename]/[modulename]_dev.py
+```{grid-item-card} View Source
+:link: https://github.com/mlsysbook/TinyTorch/blob/main/src/[NN]_[modulename]/[NN]_[modulename].py
 :class-header: bg-light

 Browse the Python source code and understand the implementation.
@@ -160,47 +210,49 @@ Browse the Python source code and understand the implementation.

 ````

-```{admonition} 💾 Save Your Progress
+```{admonition} Save Your Progress
 :class: tip
 **Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work.
-
 ```
-
---
-
-<div class="prev-next-area">
-<a class="left-prev" href="../chapters/[prev_module].html" title="previous page">← Previous Module</a>
-<a class="right-next" href="../chapters/[next_module].html" title="next page">Next Module →</a>
-</div>
 ```

 ## Required Sections

 All modules MUST include:
-1. Frontmatter (YAML metadata)
-2. Title with tier/difficulty/time
+
+1. Title with tier/difficulty/time/prerequisites (inline metadata)
+2. Learning Objectives (BEFORE Overview)
 3. Overview
-4. Learning Objectives
-5. Build → Use → [Third Stage]
-6. Implementation Guide
-7. Getting Started (Prerequisites + Development Workflow)
-8. Testing (Comprehensive Test Suite + Test Coverage + Inline Testing + Manual Examples)
-9. Systems Thinking Questions (Real-World Applications + Foundations + Characteristics)
-10. Ready to Build? (Motivational conclusion)
-11. Launch Binder/Colab/Source grid
-12. Save Your Progress admonition
-13. Previous/Next navigation
-
-## Optional Sections
-
- "Why This Matters" (can be integrated into Overview or Systems Thinking)
- Additional implementation examples
- Extended mathematical foundations
-
+4. Build → Use → Reflect (with time estimates)
+5. Implementation Guide (with checkpoints every 90-120 min)
+6. Getting Started (Prerequisites + Development Workflow)
+7. Testing (Test Suite + Coverage Areas + Manual Examples)
+8. Production Context (Comparison table + Real-world applications)
+9. Systems Thinking Questions (Quantitative, not open-ended)
+10. What's Next (Forward reference)
+11. Ready to Build? (Action-oriented closing)
+12. Launch grid (Binder/Colab/Source)
+13. Save Your Progress admonition

+## Removed Elements

+These were removed based on expert review:

+- **YAML frontmatter** - Not used by Jupyter Book, duplicated body content
+- **Manual prev/next navigation** - Auto-generated by theme
+- **Separate "What You'll Build" section** - Merged into Implementation Guide
+- **Separate "Core Concepts" section** - Merged into Implementation Guide
+- **Open-ended reflection questions** - Replaced with quantitative questions

+## Tier Definitions

+- **FOUNDATION TIER**: Core building blocks (Tensor, Activations, Layers, Losses, Autograd, Optimizers, Training)
+- **ARCHITECTURE TIER**: Model architectures (DataLoader, Spatial, Tokenization, Embeddings, Attention, Transformers)
+- **OPTIMIZATION TIER**: Performance optimization (Profiling, Quantization, Compression, Memoization, Acceleration, Benchmarking, Capstone)

+## Difficulty Scale

+- ● (1/4): Introductory, minimal prerequisites
+- ●● (2/4): Intermediate, builds on foundations
+- ●●● (3/4): Advanced, requires solid understanding
+- ●●●● (4/4): Expert, integrates multiple concepts
--- a/tinytorch/site/intro.md
+++ b/tinytorch/site/intro.md
@@ -81,7 +81,7 @@ Build a complete machine learning (ML) framework from tensors to systems—under
  <p class="vision-title">The "AI Bricks" Approach</p>
  <div class="vision-grid">
    <div class="vision-item">
-      <span class="vision-icon">🧱</span>
+      <span class="vision-icon">🔧</span>
      <span class="vision-text"><strong>Build each piece</strong> — Tensors, autograd, optimizers, attention. No magic imports.</span>
    </div>
    <div class="vision-item">
--- a/tinytorch/site/preface.md
+++ b/tinytorch/site/preface.md
@@ -33,7 +33,7 @@ Students cannot learn this from production code. PyTorch is too large, too compl

 They also cannot learn it from toy scripts. A hundred-line neural network does not reveal the architecture of a framework. It hides it.

-## The Solution: AI Bricks 🧱
+## The Solution: AI Bricks

 TinyTorch teaches you the **AI bricks**—the stable engineering foundations you can use to build any AI system. Small enough to learn from: bite-sized code that runs even on a Raspberry Pi. Big enough to matter: showing the real architecture of how frameworks are built.

--- a/tinytorch/src/01_tensor/ABOUT.md
+++ b/tinytorch/src/01_tensor/ABOUT.md
--- a/tinytorch/src/02_activations/ABOUT.md
+++ b/tinytorch/src/02_activations/ABOUT.md
@@ -1,400 +1,383 @@
---
-title: "Activations"
-description: "Neural network activation functions enabling non-linear learning"
-difficulty: "●●"
-time_estimate: "3-4 hours"
-prerequisites: ["01_tensor"]
-next_steps: ["03_layers"]
-learning_objectives:
-  - "Understand activation functions as the non-linearity enabling neural networks to learn complex patterns"
-  - "Implement ReLU, Sigmoid, Tanh, GELU, and Softmax with proper numerical stability"
-  - "Recognize function properties (range, gradient behavior, symmetry) and their roles in ML architectures"
-  - "Connect activation implementations to torch.nn.functional and PyTorch/TensorFlow patterns"
-  - "Analyze computational efficiency, numerical stability, and memory implications of different activations"
---
-
 # Activations

-**FOUNDATION TIER** | Difficulty: ●● (2/4) | Time: 3-4 hours
+**FOUNDATION TIER** | Difficulty: ● (1/4) | Time: 3-4 hours | Prerequisites: Module 01
+
+This module builds directly on Module 01 (Tensor). You should be comfortable with:
+- Creating and manipulating Tensors
+- Broadcasting semantics
+- Element-wise operations
+
+If you can create a Tensor and perform arithmetic on it, you're ready.

 ## Overview

-Activation functions are the mathematical operations that introduce non-linearity into neural networks, transforming them from simple linear regressors into universal function approximators. Without activations, stacking layers would be pointless—multiple linear transformations collapse to a single linear operation. With activations, each layer learns increasingly complex representations, enabling networks to approximate any continuous function. This module implements five essential activation functions with proper numerical stability, preparing you to understand what happens every time you call `F.relu(x)` or `torch.sigmoid(x)` in production code.
+Every neural network needs activation functions to learn complex patterns. Without them, stacking layers would be mathematically equivalent to having a single layer. No matter how deep your network, linear transformations composed together are still just linear transformations. Activation functions introduce the non-linearity that lets networks curve, bend, and approximate any function.
+
+In this module, you'll build five essential activation functions: Sigmoid, ReLU, Tanh, GELU, and Softmax. Each serves a different purpose in neural networks, from gating probabilities to creating sparsity to producing probability distributions. By implementing them yourself, you'll understand exactly what happens when you call `torch.relu()` or `torch.softmax()` in production code.
+
+These activations operate on the Tensor class you built in Module 01. They take a Tensor as input and return a new Tensor with the transformed values. This pattern of Tensor-in, Tensor-out is fundamental to how neural networks work.

 ## Learning Objectives

-By the end of this module, you will be able to:
+```{admonition} By completing this module, you will:
+:class: tip

- **Systems Understanding**: Recognize activation functions as the critical non-linearity that enables universal function approximation, understanding their role in memory consumption (activation caching), computational bottlenecks (billions of calls per training run), and gradient flow through deep architectures
- **Core Implementation**: Build ReLU, Sigmoid, Tanh, GELU, and Softmax with numerical stability techniques (max subtraction, conditional computation) that prevent overflow/underflow while maintaining mathematical correctness
- **Pattern Recognition**: Understand function properties—ReLU's sparsity and [0, ∞) range, Sigmoid's (0,1) probabilistic outputs, Tanh's (-1,1) zero-centered gradients, GELU's smoothness, Softmax's probability distributions—and why each serves specific architectural roles
- **Framework Connection**: See how your implementations mirror `torch.nn.ReLU`, `torch.nn.Sigmoid`, `torch.nn.Tanh`, `torch.nn.GELU`, and `F.softmax`, understanding the actual mathematical operations behind PyTorch's abstractions used throughout ResNet, BERT, GPT, and vision transformers
- **Performance Trade-offs**: Analyze computational cost (element-wise operations vs exponentials), memory implications (activation caching for backprop), and gradient behavior (vanishing gradients in Sigmoid/Tanh vs ReLU's constant gradients), understanding why ReLU dominates hidden layers while Sigmoid/Softmax serve specific output roles
+- **Implement** five core activation functions (Sigmoid, ReLU, Tanh, GELU, Softmax) with proper numerical stability
+- **Understand** why non-linearity is essential for neural network expressiveness
+- **Analyze** computational costs and numerical stability considerations for each activation
+- **Connect** your implementations to production usage patterns in PyTorch and modern architectures
+```

-## Build → Use → Reflect
+## What You'll Build

-This module follows TinyTorch's **Build → Use → Reflect** framework:
+```{mermaid}
+flowchart LR
+    subgraph "Your Activation Functions"
+        A["Sigmoid<br/>(0, 1) range"]
+        B["ReLU<br/>zeros negatives"]
+        C["Tanh<br/>(-1, 1) range"]
+        D["GELU<br/>smooth ReLU"]
+        E["Softmax<br/>probabilities"]
+    end

-1. **Build**: Implement five core activation functions (ReLU, Sigmoid, Tanh, GELU, Softmax) with numerical stability. Handle overflow in exponentials through max subtraction and conditional computation, ensure shape preservation across operations, and maintain proper value ranges ([0,∞) for ReLU, (0,1) for Sigmoid, (-1,1) for Tanh, probability distributions for Softmax)
+    T["Input Tensor"] --> A & B & C & D & E
+    A & B & C & D & E --> O["Output Tensor"]

-2. **Use**: Apply activations to real tensors with various ranges and shapes. Test with extreme values (±1000) to verify numerical stability, visualize function behavior across input domains, integrate with Tensor operations from Module 01, and chain activations to simulate simple neural network data flow (Input → ReLU → Softmax)
+    style A fill:#e1f5ff
+    style B fill:#fff3cd
+    style C fill:#d4edda
+    style D fill:#f8d7da
+    style E fill:#e2d5f1
+```

-3. **Reflect**: Understand why each activation exists in production systems—why ReLU enables sparse representations (many zeros) that accelerate computation and reduce overfitting, how Sigmoid creates gates (0 to 1 control signals) in LSTM/GRU architectures, why Tanh's zero-centered outputs improve optimization dynamics, how GELU's smoothness helps transformers, and why Softmax's probability distributions are essential for classification
+**Implementation roadmap:**

-## Implementation Guide
-
-### ReLU - The Sparsity Creator
-
-ReLU (Rectified Linear Unit) is the workhorse of modern deep learning, used in hidden layers of ResNet, EfficientNet, and most convolutional architectures.
+| Part | What You'll Implement | Key Concept |
+|------|----------------------|-------------|
+| 1 | `Sigmoid.forward()` | Squashing to (0, 1) for probabilities |
+| 2 | `ReLU.forward()` | Zeroing negatives for sparsity |
+| 3 | `Tanh.forward()` | Zero-centered outputs in (-1, 1) |
+| 4 | `GELU.forward()` | Smooth approximation for transformers |
+| 5 | `Softmax.forward()` | Converting to probability distribution |

+**The pattern you'll enable:**
 ```python
-class ReLU:
-    """ReLU activation: f(x) = max(0, x)"""
-
-    def forward(self, x: Tensor) -> Tensor:
-        # Zero negative values, preserve positive values
-        return Tensor(np.maximum(0, x.data))
-```
-
-**Mathematical Definition**: `f(x) = max(0, x)`
-
-**Key Properties**:
- **Range**: [0, ∞) - unbounded above
- **Gradient**: 0 for x < 0, 1 for x > 0 (undefined at x = 0)
- **Sparsity**: Produces many exact zeros (sparse activations)
- **Computational Cost**: Trivial (element-wise comparison)
-
-**Why ReLU Dominates Hidden Layers**:
- No vanishing gradient problem (gradient is 1 for positive inputs)
- Computationally efficient (simple max operation)
- Creates sparsity (zeros) that reduces computation and helps regularization
- Empirically outperforms Sigmoid/Tanh in deep networks
-
-**Watch Out For**: "Dying ReLU" problem—neurons can get stuck outputting zero if inputs become consistently negative during training. Variants like Leaky ReLU (allows small negative slope) address this.
-
-### Sigmoid - The Probabilistic Gate
-
-Sigmoid maps any real number to (0, 1), making it essential for binary classification and gating mechanisms in LSTMs/GRUs.
-
-```python
-class Sigmoid:
-    """Sigmoid activation: σ(x) = 1/(1 + e^(-x))"""
-
-    def forward(self, x: Tensor) -> Tensor:
-        # Numerical stability: avoid exp() overflow
-        data = x.data
-        return Tensor(np.where(
-            data >= 0,
-            1 / (1 + np.exp(-data)),           # Positive values
-            np.exp(data) / (1 + np.exp(data))  # Negative values
-        ))
-```
-
-**Mathematical Definition**: `σ(x) = 1/(1 + e^(-x))`
-
-**Key Properties**:
- **Range**: (0, 1) - strictly bounded
- **Gradient**: σ(x)(1 - σ(x)), maximum 0.25 at x = 0
- **Symmetry**: σ(-x) = 1 - σ(x)
- **Computational Cost**: One exponential per element
-
-**Numerical Stability Critical**:
- Naive `1/(1 + exp(-x))` overflows for large positive x
- For x ≥ 0: use `1/(1 + exp(-x))` (stable)
- For x < 0: use `exp(x)/(1 + exp(x))` (stable)
- Conditional computation prevents overflow while maintaining correctness
-
-**Production Use Cases**:
- Binary classification output layer (probability of positive class)
- LSTM/GRU gates (input gate, forget gate, output gate)
- Attention mechanisms (before softmax normalization)
-
-**Gradient Problem**: Maximum derivative is 0.25, meaning gradients shrink by ≥75% per layer. In deep networks (>10 layers), gradients vanish exponentially, making training difficult. This is why ReLU replaced Sigmoid in hidden layers.
-
-### Tanh - The Zero-Centered Alternative
-
-Tanh (hyperbolic tangent) maps inputs to (-1, 1), providing zero-centered outputs that improve gradient flow compared to Sigmoid.
-
-```python
-class Tanh:
-    """Tanh activation: f(x) = (e^x - e^(-x))/(e^x + e^(-x))"""
-
-    def forward(self, x: Tensor) -> Tensor:
-        return Tensor(np.tanh(x.data))
-```
-
-**Mathematical Definition**: `tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))`
-
-**Key Properties**:
- **Range**: (-1, 1) - symmetric around zero
- **Gradient**: 1 - tanh²(x), maximum 1.0 at x = 0
- **Symmetry**: tanh(-x) = -tanh(x) (odd function)
- **Computational Cost**: Two exponentials (or NumPy optimized)
-
-**Why Zero-Centered Matters**:
- Tanh outputs have mean ≈ 0, unlike Sigmoid's mean ≈ 0.5
- Gradients don't systematically bias weight updates in one direction
- Helps optimization in shallow networks and RNN cells
-
-**Production Use Cases**:
- LSTM/GRU cell state computation (candidate values in [-1, 1])
- Output layer when you need symmetric bounded outputs
- Some shallow networks (though ReLU usually preferred now)
-
-**Still Has Vanishing Gradients**: Maximum derivative is 1.0 (better than Sigmoid's 0.25), but still saturates for |x| > 2, causing vanishing gradients in deep networks.
-
-### GELU - The Smooth Modern Choice
-
-GELU (Gaussian Error Linear Unit) is a smooth approximation to ReLU, used in modern transformer architectures like GPT, BERT, and Vision Transformers.
-
-```python
-class GELU:
-    """GELU activation: f(x) ≈ x * Sigmoid(1.702 * x)"""
-
-    def forward(self, x: Tensor) -> Tensor:
-        # Approximation: x * sigmoid(1.702 * x)
-        sigmoid_part = 1.0 / (1.0 + np.exp(-1.702 * x.data))
-        return Tensor(x.data * sigmoid_part)
-```
-
-**Mathematical Definition**: `GELU(x) = x · Φ(x) ≈ x · σ(1.702x)` where Φ(x) is the cumulative distribution function of standard normal distribution
-
-**Key Properties**:
- **Range**: (-∞, ∞) - unbounded like ReLU
- **Gradient**: Smooth everywhere (no sharp corner at x = 0)
- **Approximation**: The 1.702 constant comes from √(2/π)
- **Computational Cost**: One exponential (similar to Sigmoid)
-
-**Why Transformers Use GELU**:
- Smooth differentiability everywhere (unlike ReLU's corner at x = 0)
- Empirically performs better than ReLU in transformer architectures
- Non-monotonic behavior (slight negative region) helps representation learning
- Used in GPT, BERT, RoBERTa, Vision Transformers
-
-**Comparison to ReLU**: GELU is smoother (differentiable everywhere) but more expensive (requires exponential). In transformers, the extra cost is negligible compared to attention computation, and the smoothness helps perf.
-
-### Softmax - The Probability Distributor
-
-Softmax converts any vector into a valid probability distribution where all outputs are positive and sum to exactly 1.0.
-
-```python
-class Softmax:
-    """Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))"""
-
-    def forward(self, x: Tensor, dim: int = -1) -> Tensor:
-        # Numerical stability: subtract max before exp
-        x_max_data = np.max(x.data, axis=dim, keepdims=True)
-        x_shifted = x - Tensor(x_max_data)
-        exp_values = Tensor(np.exp(x_shifted.data))
-        exp_sum = Tensor(np.sum(exp_values.data, axis=dim, keepdims=True))
-        return exp_values / exp_sum
-```
-
-**Mathematical Definition**: `softmax(x_i) = e^(x_i) / Σ_j e^(x_j)`
-
-**Key Properties**:
- **Range**: (0, 1) with Σ outputs = 1.0 exactly
- **Gradient**: Complex (involves all elements, not just element-wise)
- **Translation Invariant**: softmax(x + c) = softmax(x)
- **Computational Cost**: One exponential per element + sum reduction
-
-**Numerical Stability Critical**:
- Naive `exp(x_i) / sum(exp(x_j))` overflows for large values
- Subtract max before exponential: `exp(x - max(x))`
- Mathematically equivalent due to translation invariance
- Prevents overflow while maintaining correct probabilities
-
-**Production Use Cases**:
- Multi-class classification output layer (class probabilities)
- Attention weights in transformers (probability distribution over sequence)
- Any time you need a valid discrete probability distribution
-
-**Cross-Entropy Connection**: In practice, Softmax is almost always paired with cross-entropy loss. PyTorch's `F.cross_entropy` combines both operations with additional numerical stability (LogSumExp trick).
-
-## Getting Started
-
-### Prerequisites
-
-Ensure you have completed Module 01 (Tensor) before starting:
-
-```bash
-# Activate TinyTorch environment
-source scripts/activate-tinytorch
-
-# Verify tensor module is complete
-tito test tensor
-
-# Expected: ✓ Module 01 complete!
-```
-
-### Development Workflow
-
-1. **Open the development file**: `modules/02_activations/activations_dev.ipynb` (or `.py` via Jupytext)
-2. **Implement ReLU**: Simple max(0, x) operation using `np.maximum`
-3. **Build Sigmoid**: Implement with numerical stability using conditional computation for positive/negative values
-4. **Create Tanh**: Use `np.tanh` for hyperbolic tangent transformation
-5. **Add GELU**: Implement smooth approximation using `x * sigmoid(1.702 * x)`
-6. **Build Softmax**: Implement with max subtraction for numerical stability, handle dimension parameter for multi-dimensional tensors
-7. **Export and verify**: Run `tito module complete 02 && tito test activations`
-
-**Development Tips**:
- Test with extreme values (±1000) to verify numerical stability
- Verify output ranges: ReLU [0, ∞), Sigmoid (0,1), Tanh (-1,1)
- Check Softmax sums to 1.0 along specified dimension
- Test with multi-dimensional tensors (batches) to ensure shape preservation
-
-## Testing
-
-### Comprehensive Test Suite
-
-Run the full test suite to verify all activation implementations:
-
-```bash
-# TinyTorch CLI (recommended)
-tito test activations
-
-# Direct pytest execution
-python -m pytest tests/ -k activations -v
-
-# Test specific activation
-python -m pytest tests/test_activations.py::test_relu -v
-```
-
-### Test Coverage Areas
-
- ✓ **ReLU Correctness**: Verifies max(0, x) behavior, sparsity property (negative → 0, positive preserved), and proper handling of exactly zero inputs
- ✓ **Sigmoid Numerical Stability**: Tests extreme values (±1000) don't cause overflow/underflow, validates (0,1) range constraints, confirms sigmoid(0) = 0.5 exactly
- ✓ **Tanh Properties**: Validates (-1,1) range, symmetry property (tanh(-x) = -tanh(x)), zero-centered behavior (tanh(0) = 0), and extreme value convergence
- ✓ **GELU Smoothness**: Confirms smooth differentiability (no sharp corners), validates approximation accuracy (GELU(0) ≈ 0, GELU(1) ≈ 0.84), and checks non-monotonic behavior
- ✓ **Softmax Probability Distribution**: Verifies sum equals 1.0 exactly, all outputs in (0,1) range, largest input receives highest probability, numerical stability with large inputs, and correct dimension handling for multi-dimensional tensors
-
-### Inline Testing & Validation
-
-The module includes comprehensive inline unit tests that run during development:
-
-```python
-# Example inline test output
- Unit Test: ReLU...
- ReLU zeros negative values correctly
- ReLU preserves positive values
- ReLU creates sparsity (3/4 values are zero)
- Progress: ReLU ✓
-
- Unit Test: Sigmoid...
- Sigmoid(0) = 0.5 exactly
- All outputs in (0, 1) range
- Numerically stable with extreme values (±1000)
- Progress: Sigmoid ✓
-
- Unit Test: Softmax...
- Outputs sum to 1.0 exactly
- All values positive and less than 1
- Largest input gets highest probability
- Handles large numbers without overflow
- Progress: Softmax ✓
-```
-
-### Manual Testing Examples
-
-Test activations interactively to understand their behavior:
-
-```python
-from activations_dev import ReLU, Sigmoid, Tanh, GELU, Softmax
-from tinytorch.core.tensor import Tensor
-
-# Test ReLU sparsity
+# Applying non-linearity to tensor data
 relu = ReLU()
-x = Tensor([-2, -1, 0, 1, 2])
-output = relu(x)
-print(output.data)  # [0, 0, 0, 1, 2] - 60% sparsity!
-
-# Test Sigmoid probability mapping
-sigmoid = Sigmoid()
-x = Tensor([0.0, 100.0, -100.0])  # Extreme values
-output = sigmoid(x)
-print(output.data)  # [0.5, 1.0, 0.0] - no overflow!
-
-# Test Softmax probability distribution
-softmax = Softmax()
-x = Tensor([1.0, 2.0, 3.0])
-output = softmax(x)
-print(output.data)  # [0.09, 0.24, 0.67]
-print(output.data.sum())  # 1.0 exactly!
-
-# Test activation chaining (simulate simple network)
-x = Tensor([[-1, 0, 1, 2]])  # Batch of 1
-hidden = relu(x)  # Hidden layer: [0, 0, 1, 2]
-output = softmax(hidden)  # Output probabilities
-print(output.data.sum())  # 1.0 - valid distribution!
+activated = relu(x)  # x is a Tensor, activated is a new Tensor
 ```

-## Systems Thinking Questions
+### What You're NOT Building (Yet)

-### Real-World Applications
+To keep this module focused, you will **not** implement:

- **Computer Vision Networks**: ResNet-50 applies ReLU to approximately 23 million elements per forward pass (after every convolution), then uses Softmax on 1000 logits for ImageNet classification. How much memory is required just to cache these activations for backpropagation in a batch of 32 images?
- **Transformer Language Models**: BERT-Large has 24 layers × 1024 hidden units × sequence length 512 = 12.6M activations per example. With GELU requiring exponential computation, how does this compare to ReLU's computational cost across a 1M example training run?
- **Recurrent Networks**: LSTM cells use 4 gates (input, forget, output, cell) with Sigmoid/Tanh activations at every timestep. For a sequence of length 100 with 512 hidden units, how many exponential operations are required compared to a simple ReLU-based feedforward network?
- **Mobile Inference**: On-device neural networks must be extremely efficient. Given that ReLU is a simple comparison while GELU requires exponential computation, what are the latency implications for a 20-layer network running on CPU with no hardware acceleration?
+- Backward pass (that's Module 05: Autograd)
+- Learnable parameters (activations are parameter-free)
+- GPU-optimized kernels (PyTorch uses CUDA implementations)
+- Variants like LeakyReLU, ELU, Swish (we focus on the core five)

-### Mathematical Foundations
+**You are building the forward transformations.** Gradient computation comes in Module 05.

- **Universal Function Approximation**: The universal approximation theorem states that a neural network with even one hidden layer can approximate any continuous function, BUT only if it has non-linear activations. Why does linearity prevent universal approximation, and what property of non-linear functions (like ReLU, Sigmoid, Tanh) enables it?
- **Gradient Flow and Saturation**: Sigmoid's derivative is σ(x)(1-σ(x)) with maximum value 0.25. In a 10-layer network using Sigmoid activations, what is the maximum gradient magnitude at layer 1 if the output gradient is 1.0? How does this explain the vanishing gradient problem that led to ReLU's adoption?
- **Numerical Stability and Conditioning**: When computing Softmax, why does subtracting the maximum value before exponential (exp(x - max(x))) prevent overflow while maintaining mathematical correctness? What property of the exponential function makes this transformation valid?
- **Activation Sparsity and Compression**: ReLU produces exact zeros (sparse activations) while Sigmoid produces values close to but never exactly zero. How does this affect model compression techniques like pruning and quantization? Why are sparse activations more amenable to INT8 quantization?
+## API Reference

-### Performance Characteristics
+This section provides a quick reference for the activation classes you'll build. Each activation follows the same interface: instantiate, then call with a Tensor.

- **Memory Footprint of Activation Caching**: During backpropagation, forward pass activations must be stored to compute gradients. For a ResNet-50 processing 224×224×3 images with batch size 64, activation caching requires approximately 3GB of memory. How does this compare to the model's parameter memory (25M params × 4 bytes ≈ 100MB)? What is the scaling relationship between batch size and activation memory?
- **Computational Intensity on Different Hardware**: ReLU is trivially parallelizable (independent element-wise max). On a GPU with 10,000 CUDA cores, what is the theoretical speedup vs single-core CPU? Why does practical speedup plateau at much lower values (memory bandwidth, kernel launch overhead)?
- **Branch Prediction and CPU Performance**: ReLU's conditional behavior (`if x > 0`) can cause branch misprediction penalties on CPUs. For a random uniform distribution of inputs [-1, 1], branch prediction accuracy is ~50%. How does this affect CPU performance compared to branchless implementations using `max(0, x)`?
- **Exponential Computation Cost**: Sigmoid, Tanh, GELU, and Softmax all require exponential computation. On modern CPUs, `exp(x)` takes ~10-20 cycles vs ~1 cycle for addition. For a network with 1M activations, how does this computational difference compound across training iterations? Why do modern frameworks use lookup tables or polynomial approximations for exponentials?
+### Common Interface

-## Ready to Build?
+All activations implement:

-You're about to implement the mathematical functions that give neural networks their power to learn complex patterns! Every breakthrough in deep learning—from AlexNet's ImageNet victory to GPT's language understanding to diffusion models' image generation—relies on the simple activation functions you'll build in this module.
+```python
+class Activation:
+    def forward(self, x: Tensor) -> Tensor: ...
+    def __call__(self, x: Tensor) -> Tensor: ...  # Delegates to forward()
+    def parameters(self) -> list: ...  # Returns [] (no learnable parameters)
+```

-Understanding activations from first principles means implementing their mathematics, handling numerical stability edge cases (overflow, underflow), and grasping their properties (ranges, gradients, symmetry). This knowledge will give you deep insight into why ReLU dominates hidden layers, why Sigmoid creates effective gates in LSTMs, why Tanh helps optimization, why GELU powers transformers, and why Softmax is essential for classification. You'll understand exactly what happens when you call `F.relu(x)` or `torch.sigmoid(x)` in production code—not just the API, but the actual math, numerical considerations, and performance implications.
+### Activation Functions

-This is where pure mathematics meets practical machine learning. Take your time with each activation, test thoroughly with extreme values, visualize their behavior across input ranges, and enjoy building the non-linearity that powers modern AI. Let's turn linear transformations into intelligent representations!
+| Class | Formula | Output Range | Primary Use |
+|-------|---------|--------------|-------------|
+| `Sigmoid` | 1/(1 + e^(-x)) | (0, 1) | Binary classification, gates |
+| `ReLU` | max(0, x) | [0, +inf) | Hidden layers (default) |
+| `Tanh` | (e^x - e^(-x))/(e^x + e^(-x)) | (-1, 1) | Hidden layers, RNNs |
+| `GELU` | x * sigmoid(1.702x) | (-inf, +inf) | Transformers |
+| `Softmax` | e^(x_i) / sum(e^(x_j)) | (0, 1), sum=1 | Multi-class output |

-Choose your preferred way to engage with this module:
+## Core Concepts
+
+This section explains the fundamental ideas behind activation functions. Understanding these concepts will help you make informed choices when building neural networks.
+
+### Why Non-linearity Matters
+
+Consider what happens when you stack linear transformations. If layer 1 computes `W1 @ x + b1` and layer 2 computes `W2 @ y + b2`, the combined result is `W2 @ (W1 @ x + b1) + b2 = (W2 @ W1) @ x + (W2 @ b1 + b2)`. This is still just a linear transformation! You could replace both layers with a single layer and get the same result.
+
+Activation functions break this equivalence. When you apply ReLU between layers, you're introducing non-linear behavior that can't be collapsed. Each layer can now learn something that the previous layers couldn't represent. This is why deep networks can approximate complex functions: they're not just stacking linear maps, they're building up increasingly sophisticated non-linear transformations.
+
+The mathematical proof is elegant: without non-linearity, an N-layer network has the same representational power as a 1-layer network. With non-linearity, representational power grows with depth.
+
+### Output Ranges and Use Cases
+
+Each activation maps inputs to a specific output range, and this determines where you use it.
+
+Sigmoid outputs values in (0, 1), making it perfect for probabilities. When you need to predict "spam or not spam," sigmoid gives you a probability like 0.87. But sigmoid saturates for large inputs (both positive and negative), which can slow down learning in hidden layers.
+
+ReLU outputs values in [0, +inf), zeroing out negatives. This creates sparsity: many neurons output exactly zero, which makes computation more efficient and can help prevent overfitting. ReLU is the default choice for hidden layers in most networks.
+
+Tanh outputs values in (-1, 1), similar to sigmoid but centered around zero. This zero-centering can help with the flow of information through networks, especially recurrent networks where the same weights are applied repeatedly.
+
+GELU is like a smooth version of ReLU. Instead of the sharp corner at zero, it curves gently. This smoothness helps with optimization in transformers, which is why GPT and BERT use GELU.
+
+Softmax converts any vector to a probability distribution where all values are positive and sum to 1. It's essential for multi-class classification: given 1000 ImageNet classes, softmax gives you probabilities for each.
+
+### Numerical Stability
+
+Naive implementations of activations can fail catastrophically with extreme inputs. Consider sigmoid with x = 1000: computing e^(-1000) underflows to 0.0, and computing e^(1000) overflows to infinity. Your implementation handles this by clipping inputs and using numerically stable formulas.
+
+Softmax is particularly tricky. Computing e^(1000) for one element while others are e^(1) would give infinity. The standard trick is to subtract the maximum value first: `softmax(x) = softmax(x - max(x))`. This keeps all exponentials in a safe range without changing the result.
+
+These stability tricks are essential for production code. A model that works on small test inputs but crashes on real data is useless.
+
+### Computational Cost
+
+Not all activations are equal in speed. Understanding these costs matters when you're processing billions of activations per training step.
+
+| Activation | Operations | Relative Cost |
+|------------|------------|---------------|
+| ReLU | 1 comparison per element | 1x (baseline) |
+| Sigmoid | 1 exp + 1 division per element | 3-4x |
+| Tanh | 2 exp + 1 division per element | 3-4x |
+| GELU | 1 exp + 2 multiplications per element | 4-5x |
+| Softmax | n exp + n-1 additions + n divisions | 5x+ |
+
+ReLU's simplicity is one reason it became the default. When you apply an activation billions of times per training step, a 4x speedup matters. GELU's extra cost is worth it in transformers because the improved optimization outweighs the computational overhead.
+
+## Common Errors
+
+### Numerical Overflow in Sigmoid
+
+**Symptom**: `RuntimeWarning: overflow encountered in exp`
+
+**Cause**: Computing `exp(-x)` for very large positive x, or `exp(x)` for very large negative x.
+
+**Fix**: Clip inputs to a safe range (approximately -500 to 500), or use the numerically stable formulation that handles positive and negative inputs separately.
+
+### Softmax Dimension Confusion
+
+**Symptom**: Probabilities don't sum to 1, or shape is wrong
+
+**Cause**: Applying softmax along the wrong axis in multi-dimensional tensors
+
+**Fix**: Always specify the `dim` parameter explicitly. For classification with shape `(batch, classes)`, use `dim=-1` to normalize across classes.
+
+```python
+# Wrong: softmax over entire tensor
+probs = softmax(logits)  # Sum might not be 1 per sample
+
+# Right: softmax over class dimension
+probs = softmax(logits, dim=-1)  # Each row sums to 1
+```
+
+### Dead ReLU Neurons
+
+**Symptom**: Some neurons always output 0 during training
+
+**Cause**: Large negative input causes ReLU to output 0, gradient is also 0, so weights never update
+
+**Fix**: This is a known issue with ReLU. Monitor the percentage of dead neurons. If too many die, consider using LeakyReLU (not implemented in this module) or reducing learning rate.
+
+## Production Context
+
+Your TinyTorch activations and PyTorch's `torch.nn.functional` activations share the same mathematical definitions. The differences are in implementation: PyTorch uses C++/CUDA kernels optimized for specific hardware, while yours use NumPy operations.
+
+### Your Implementation vs. PyTorch
+
+| Feature | Your Implementation | PyTorch |
+|---------|---------------------|---------|
+| **Backend** | NumPy (Python) | C++/CUDA kernels |
+| **Speed** | 1x (baseline) | 5-20x faster |
+| **GPU** | CPU only | CUDA, Metal, ROCm |
+| **Fused ops** | Separate operations | Fused kernels (e.g., bias + ReLU) |
+| **Autograd** | forward() only | Full backward support |
+
+### Code Comparison
+
+The following comparison shows equivalent operations in TinyTorch and PyTorch. Notice that the API and behavior are identical; only the import changes.
+
+`````{tab-set}
+````{tab-item} 🔥 Your TinyTorch
+```python
+from tinytorch import Tensor
+from tinytorch.core.activations import ReLU, Softmax
+
+x = Tensor([[-1, 0, 1], [2, -2, 3]])
+relu = ReLU()
+activated = relu(x)  # [0, 0, 1], [2, 0, 3]
+
+logits = Tensor([[1, 2, 3]])
+softmax = Softmax()
+probs = softmax(logits)  # [0.09, 0.24, 0.67]
+```
+````
+
+````{tab-item} ⚡ PyTorch
+```python
+import torch
+import torch.nn.functional as F
+
+x = torch.tensor([[-1, 0, 1], [2, -2, 3]], dtype=torch.float32)
+activated = F.relu(x)  # Same result!
+
+logits = torch.tensor([[1, 2, 3]], dtype=torch.float32)
+probs = F.softmax(logits, dim=-1)  # Same result!
+```
+````
+`````
+
+Let's walk through each operation to understand the comparison:
+
+- **Lines 1-2 (Import)**: TinyTorch organizes activations in `core.activations`; PyTorch provides them as both `torch.nn` modules and `torch.nn.functional` functions. The functional interface (`F.relu`) is more common for stateless activations.
+- **Lines 4-6 (ReLU)**: TinyTorch uses class instantiation then calling; PyTorch's functional interface is a single function call. Both produce identical output: negative values become 0, positive values unchanged.
+- **Lines 8-10 (Softmax)**: Both normalize the input to a probability distribution. Note that PyTorch requires explicit `dim` specification, which is good practice for avoiding bugs.
+
+```{admonition} What's Identical
+:class: tip
+
+The mathematical transformations, output ranges, and numerical stability considerations. When you understand how your ReLU handles negative values, you understand exactly what PyTorch's ReLU does. The only difference is speed.
+```
+
+### Why Activations Matter at Scale
+
+Activation functions are applied to every neuron in every layer. Consider the scale:
+
+- **GPT-3**: 175 billion parameters means billions of activation function calls per forward pass
+- **ResNet-152**: 60 million activations per image, multiplied by batch size
+- **Real-time inference**: 30+ frames per second requires activations to complete in microseconds
+
+ReLU's simplicity is a competitive advantage. In a network with 1 billion parameters, using ReLU instead of GELU saves approximately 3 billion floating-point operations per forward pass. At scale, this translates to measurable time and energy savings.
+
+## Check Your Understanding
+
+Test yourself with these questions. They're designed to build intuition for activation behavior and computational characteristics.
+
+**Q1: Output Range Prediction**
+
+What is the output range of `Sigmoid(Tensor([-1000, 0, 1000]))`?
+
+```{admonition} Answer
+:class: dropdown
+
+Output: approximately `[0, 0.5, 1]`
+
+Sigmoid(-1000) approaches 0 (but never reaches it)
+Sigmoid(0) = exactly 0.5
+Sigmoid(1000) approaches 1 (but never reaches it)
+
+The output is always in the open interval (0, 1), never exactly 0 or 1.
+```
+
+**Q2: Computational Cost**
+
+A network has 10 hidden layers, each with 1 million neurons. How many more floating-point operations does GELU require compared to ReLU for a single forward pass?
+
+```{admonition} Answer
+:class: dropdown
+
+ReLU: 1 operation per element = 10 million operations total
+GELU: ~4-5 operations per element = 40-50 million operations total
+
+Difference: **30-40 million extra operations**
+
+For 10 layers: 300-400 million extra operations per forward pass!
+```
+
+**Q3: Softmax Properties**
+
+Given input `[10, 10, 10]`, what is the softmax output?
+
+```{admonition} Answer
+:class: dropdown
+
+Output: `[0.333..., 0.333..., 0.333...]` (equal probabilities)
+
+When all inputs are equal, softmax produces a uniform distribution regardless of the actual values. This is because `e^10 / (3 * e^10) = 1/3` for each element.
+
+The same result would occur for `[0, 0, 0]` or `[100, 100, 100]`.
+```
+
+**Q4: Memory for Activation Caching**
+
+ResNet-50 applies ReLU to approximately 23 million elements per forward pass for a single image. During training with batch size 32, how much memory is required just to cache these activations for backpropagation (assuming float32)?
+
+```{admonition} Answer
+:class: dropdown
+
+23 million elements × 32 batch × 4 bytes = **2.94 GB**
+
+This is why activation memory often dominates GPU memory usage during training, and why techniques like gradient checkpointing (recomputing activations instead of storing them) are used for very large models.
+```
+
+## Further Reading
+
+For students who want to understand the academic foundations and evolution of activation functions:
+
+### Seminal Papers
+
+- **Deep Sparse Rectifier Neural Networks** - Glorot, Bordes, Bengio (2011). The paper that established ReLU as the default activation for deep networks, showing how its sparsity and constant gradient enable training of very deep networks. [AISTATS](http://proceedings.mlr.press/v15/glorot11a.html)
+
+- **Gaussian Error Linear Units (GELUs)** - Hendrycks & Gimpel (2016). Introduced the smooth activation that powers modern transformers like GPT and BERT. Explains the probabilistic interpretation and why smoothness helps optimization. [arXiv](https://arxiv.org/abs/1606.08415)
+
+- **Attention Is All You Need** - Vaswani et al. (2017). While primarily about transformers, this paper's use of specific activations (ReLU in position-wise FFN, Softmax in attention) established patterns still used today. [NeurIPS](https://arxiv.org/abs/1706.03762)
+
+### Additional Resources
+
+- **Textbook**: "Deep Learning" by Goodfellow, Bengio, and Courville - Chapter 6.3 covers activation functions with mathematical rigor
+- **Blog**: [Understanding Activation Functions](https://mlu-explain.github.io/relu/) - Amazon's MLU visual explanation of ReLU
+
+## What's Next
+
+```{admonition} Coming Up: Module 03 - Layers
+:class: seealso
+
+Implement Linear layers that combine your Tensor operations with your activation functions. You'll build the building blocks that stack to form neural networks: weights, biases, and the forward pass that transforms inputs to outputs.
+```
+
+**Preview - How Your Activations Get Used in Future Modules:**
+
+| Module | What It Does | Your Activations In Action |
+|--------|--------------|---------------------------|
+| **03: Layers** | Neural network building blocks | `Linear(x)` followed by `ReLU()(output)` |
+| **04: Losses** | Training objectives | Softmax + cross-entropy for classification |
+| **05: Autograd** | Automatic gradients | `ReLU.backward()` computes gradients |
+
+## Get Started

 ````{grid} 1 2 3 3

-```{grid-item-card}  Launch Binder
-:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/02_activations/activations_dev.ipynb
+```{grid-item-card} 🚀 Launch Binder
+:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=src/02_activations/02_activations.py
 :class-header: bg-light

-Run this module interactively in your browser. No installation required!
+Run interactively in browser - no setup required
 ```

-```{grid-item-card}  Open in Colab
-:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/02_activations/activations_dev.ipynb
+```{grid-item-card} ☁️ Open in Colab
+:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/src/02_activations/02_activations.py
 :class-header: bg-light

-Use Google Colab for GPU access and cloud compute power.
+Use Google Colab for cloud compute
 ```

-```{grid-item-card}  View Source
-:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/02_activations/activations_dev.py
+```{grid-item-card} 📄 View Source
+:link: https://github.com/mlsysbook/TinyTorch/blob/main/src/02_activations/02_activations.py
 :class-header: bg-light

-Browse the Python source code and understand the implementation.
+Browse the implementation code
 ```

 ````

-```{admonition}  Save Your Progress
-:class: tip
-**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work.
+```{admonition} Save Your Progress
+:class: warning

+Binder and Colab sessions are temporary. Download your completed notebook when done, or clone the repository for persistent local work.
 ```
-
---
-
-<div class="prev-next-area">
-<a class="left-prev" href="01_tensor_ABOUT.html" title="previous page">← Module 01: Tensor</a>
-<a class="right-next" href="03_layers_ABOUT.html" title="next page">Module 03: Layers →</a>
-</div>