mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-02 13:39:40 -05:00
- Add 🧪 emoji to all test_module() docstrings (20 modules)
- Fix Module 16 (compression): Add if __name__ guards to 6 test functions
- Fix Module 08 (dataloader): Add if __name__ guard to test_training_integration
All modules now follow consistent formatting standards for release.
1073 lines
34 KiB
Python
1073 lines
34 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.17.1
|
||
# kernelspec:
|
||
# display_name: Python 3 (ipykernel)
|
||
# language: python
|
||
# name: python3
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Activations - Intelligence Through Nonlinearity
|
||
|
||
Welcome to Activations! Today you'll add the secret ingredient that makes neural networks intelligent: **nonlinearity**.
|
||
|
||
## 🔗 Prerequisites & Progress
|
||
**You've Built**: Tensor with data manipulation and basic operations
|
||
**You'll Build**: Activation functions that add nonlinearity to transformations
|
||
**You'll Enable**: Neural networks with the ability to learn complex patterns
|
||
|
||
**Connection Map**:
|
||
```
|
||
Tensor → Activations → Layers
|
||
(data) (intelligence) (architecture)
|
||
```
|
||
|
||
## Learning Objectives
|
||
By the end of this module, you will:
|
||
1. Implement 5 core activation functions (Sigmoid, ReLU, Tanh, GELU, Softmax)
|
||
2. Understand how nonlinearity enables neural network intelligence
|
||
3. Test activation behaviors and output ranges
|
||
4. Connect activations to real neural network components
|
||
|
||
Let's add intelligence to your tensors!
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 📦 Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in modules/02_activations/activations_dev.py
|
||
**Building Side:** Code exports to tinytorch.core.activations
|
||
|
||
```python
|
||
# Final package structure:
|
||
from tinytorch.core.activations import Sigmoid, ReLU, Tanh, GELU, Softmax # This module
|
||
from tinytorch.core.tensor import Tensor # Foundation (Module 01)
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Complete activation system in one focused module for deep understanding
|
||
- **Production:** Proper organization like PyTorch's torch.nn.functional with all activation operations together
|
||
- **Consistency:** All activation functions and behaviors in core.activations
|
||
- **Integration:** Works seamlessly with Tensor for complete nonlinear transformations
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 📋 Module Dependencies
|
||
|
||
**Prerequisites**: Module 01 (Tensor) must be completed
|
||
|
||
**External Dependencies**:
|
||
- `numpy` (for numerical operations)
|
||
|
||
**TinyTorch Dependencies**:
|
||
- **Module 01 (Tensor)**: Foundation for all activation computations and data flow
|
||
- Used for: Input/output data structures, shape operations, element-wise operations
|
||
- Required: Yes - activations operate on Tensor objects
|
||
|
||
**Dependency Flow**:
|
||
```
|
||
Module 01 (Tensor) → Module 02 (Activations) → Module 03 (Layers)
|
||
↓ ↓ ↓
|
||
Foundation Nonlinearity Architecture
|
||
```
|
||
|
||
**Import Strategy**:
|
||
This module imports directly from the TinyTorch package (`from tinytorch.core.*`).
|
||
**Assumption**: Module 01 (Tensor) has been completed and exported to the package.
|
||
If you see import errors, ensure you've run `tito export` after completing Module 01.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "setup", "solution": true}
|
||
#| default_exp core.activations
|
||
#| export
|
||
|
||
import numpy as np
|
||
from typing import Optional
|
||
|
||
# Import from TinyTorch package (previous modules must be completed and exported)
|
||
from tinytorch.core.tensor import Tensor
|
||
|
||
# Constants for numerical comparisons
|
||
TOLERANCE = 1e-10 # Small tolerance for floating-point comparisons in tests
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 1. Introduction - What Makes Neural Networks Intelligent?
|
||
|
||
Consider two scenarios:
|
||
|
||
**Without Activations (Linear Only):**
|
||
```
|
||
Input → Linear Transform → Output
|
||
[1, 2] → [3, 4] → [11] # Just weighted sum
|
||
```
|
||
|
||
**With Activations (Nonlinear):**
|
||
```
|
||
Input → Linear → Activation → Linear → Activation → Output
|
||
[1, 2] → [3, 4] → [3, 4] → [7] → [7] → Complex Pattern!
|
||
```
|
||
|
||
The magic happens in those activation functions. They introduce **nonlinearity** - the ability to curve, bend, and create complex decision boundaries instead of just straight lines.
|
||
|
||
### Why Nonlinearity Matters
|
||
|
||
Without activation functions, stacking multiple linear layers is pointless:
|
||
```
|
||
Linear(Linear(x)) = Linear(x) # Same as single layer!
|
||
```
|
||
|
||
With activation functions, each layer can learn increasingly complex patterns:
|
||
```
|
||
Layer 1: Simple edges and lines
|
||
Layer 2: Curves and shapes
|
||
Layer 3: Complex objects and concepts
|
||
```
|
||
|
||
This is how deep networks build intelligence from simple mathematical operations.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 2. Mathematical Foundations
|
||
|
||
Each activation function serves a different purpose in neural networks:
|
||
|
||
### The Five Essential Activations
|
||
|
||
1. **Sigmoid**: Maps to (0, 1) - perfect for probabilities
|
||
2. **ReLU**: Removes negatives - creates sparsity and efficiency
|
||
3. **Tanh**: Maps to (-1, 1) - zero-centered for better training
|
||
4. **GELU**: Smooth ReLU - modern choice for transformers
|
||
5. **Softmax**: Creates probability distributions - essential for classification
|
||
|
||
Let's implement each one with clear explanations and immediate testing!
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 3. Implementation - Building Activation Functions
|
||
|
||
### 🏗️ Implementation Pattern
|
||
|
||
Each activation follows this structure:
|
||
```python
|
||
class ActivationName:
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
# Apply mathematical transformation
|
||
# Return new Tensor with result
|
||
|
||
def backward(self, grad: Tensor) -> Tensor:
|
||
# Stub for Module 05 - gradient computation
|
||
pass
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Sigmoid - The Probability Gatekeeper
|
||
|
||
Sigmoid maps any real number to the range (0, 1), making it perfect for probabilities and binary decisions.
|
||
|
||
### Mathematical Definition
|
||
```
|
||
σ(x) = 1/(1 + e^(-x))
|
||
```
|
||
|
||
### Visual Behavior
|
||
```
|
||
Input: [-3, -1, 0, 1, 3]
|
||
↓ ↓ ↓ ↓ ↓ Sigmoid Function
|
||
Output: [0.05, 0.27, 0.5, 0.73, 0.95]
|
||
```
|
||
|
||
### ASCII Visualization
|
||
```
|
||
Sigmoid Curve:
|
||
1.0 ┤ ╭─────
|
||
│ ╱
|
||
0.5 ┤ ╱
|
||
│ ╱
|
||
0.0 ┤─╱─────────
|
||
-3 0 3
|
||
```
|
||
|
||
**Why Sigmoid matters**: In binary classification, we need outputs between 0 and 1 to represent probabilities. Sigmoid gives us exactly that!
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "sigmoid-impl", "solution": true}
|
||
#| export
|
||
from tinytorch.core.tensor import Tensor
|
||
|
||
class Sigmoid:
|
||
"""
|
||
Sigmoid activation: σ(x) = 1/(1 + e^(-x))
|
||
|
||
Maps any real number to (0, 1) range.
|
||
Perfect for probabilities and binary classification.
|
||
"""
|
||
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
"""
|
||
Apply sigmoid activation element-wise.
|
||
|
||
TODO: Implement sigmoid function
|
||
|
||
APPROACH:
|
||
1. Apply sigmoid formula: 1 / (1 + exp(-x))
|
||
2. Use np.exp for exponential
|
||
3. Return result wrapped in new Tensor
|
||
|
||
EXAMPLE:
|
||
>>> sigmoid = Sigmoid()
|
||
>>> x = Tensor([-2, 0, 2])
|
||
>>> result = sigmoid(x)
|
||
>>> print(result.data)
|
||
[0.119, 0.5, 0.881] # All values between 0 and 1
|
||
|
||
HINT: Use np.exp(-x.data) for numerical stability
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Apply sigmoid: 1 / (1 + exp(-x))
|
||
# Clip extreme values to prevent overflow (sigmoid(-500) ≈ 0, sigmoid(500) ≈ 1)
|
||
# Clipping at ±500 ensures exp() stays within float64 range
|
||
z = np.clip(x.data, -500, 500)
|
||
|
||
# Use numerically stable sigmoid
|
||
# For positive values: 1 / (1 + exp(-x))
|
||
# For negative values: exp(x) / (1 + exp(x)) = 1 / (1 + exp(-x)) after clipping
|
||
result_data = np.zeros_like(z)
|
||
|
||
# Positive values (including zero)
|
||
pos_mask = z >= 0
|
||
result_data[pos_mask] = 1.0 / (1.0 + np.exp(-z[pos_mask]))
|
||
|
||
# Negative values
|
||
neg_mask = z < 0
|
||
exp_z = np.exp(z[neg_mask])
|
||
result_data[neg_mask] = exp_z / (1.0 + exp_z)
|
||
|
||
return Tensor(result_data)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x: Tensor) -> Tensor:
|
||
"""Allows the activation to be called like a function."""
|
||
return self.forward(x)
|
||
|
||
def backward(self, grad: Tensor) -> Tensor:
|
||
"""Compute gradient (implemented in Module 05)."""
|
||
pass # Will implement backward pass in Module 05
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: Sigmoid
|
||
This test validates sigmoid activation behavior.
|
||
**What we're testing**: Sigmoid maps inputs to (0, 1) range
|
||
**Why it matters**: Ensures proper probability-like outputs
|
||
**Expected**: All outputs between 0 and 1, sigmoid(0) = 0.5
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-sigmoid", "locked": true, "points": 10}
|
||
def test_unit_sigmoid():
|
||
"""🔬 Test Sigmoid implementation."""
|
||
print("🔬 Unit Test: Sigmoid...")
|
||
|
||
sigmoid = Sigmoid()
|
||
|
||
# Test basic cases
|
||
x = Tensor([0.0])
|
||
result = sigmoid.forward(x)
|
||
assert np.allclose(result.data, [0.5]), f"sigmoid(0) should be 0.5, got {result.data}"
|
||
|
||
# Test range property - all outputs should be in (0, 1)
|
||
x = Tensor([-10, -1, 0, 1, 10])
|
||
result = sigmoid.forward(x)
|
||
assert np.all(result.data > 0) and np.all(result.data < 1), "All sigmoid outputs should be in (0, 1)"
|
||
|
||
# Test specific values
|
||
x = Tensor([-1000, 1000]) # Extreme values
|
||
result = sigmoid.forward(x)
|
||
assert np.allclose(result.data[0], 0, atol=TOLERANCE), "sigmoid(-∞) should approach 0"
|
||
assert np.allclose(result.data[1], 1, atol=TOLERANCE), "sigmoid(+∞) should approach 1"
|
||
|
||
print("✅ Sigmoid works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_sigmoid()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## ReLU - The Sparsity Creator
|
||
|
||
ReLU (Rectified Linear Unit) is the most popular activation function. It simply removes negative values, creating sparsity that makes neural networks more efficient.
|
||
|
||
### Mathematical Definition
|
||
```
|
||
f(x) = max(0, x)
|
||
```
|
||
|
||
### Visual Behavior
|
||
```
|
||
Input: [-2, -1, 0, 1, 2]
|
||
↓ ↓ ↓ ↓ ↓ ReLU Function
|
||
Output: [ 0, 0, 0, 1, 2]
|
||
```
|
||
|
||
### ASCII Visualization
|
||
```
|
||
ReLU Function:
|
||
╱
|
||
2 ╱
|
||
╱
|
||
1╱
|
||
╱
|
||
╱
|
||
╱
|
||
─┴─────
|
||
-2 0 2
|
||
```
|
||
|
||
**Why ReLU matters**: By zeroing negative values, ReLU creates sparsity (many zeros) which makes computation faster and helps prevent overfitting.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "relu-impl", "solution": true}
|
||
#| export
|
||
class ReLU:
|
||
"""
|
||
ReLU activation: f(x) = max(0, x)
|
||
|
||
Sets negative values to zero, keeps positive values unchanged.
|
||
Most popular activation for hidden layers.
|
||
"""
|
||
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
"""
|
||
Apply ReLU activation element-wise.
|
||
|
||
TODO: Implement ReLU function
|
||
|
||
APPROACH:
|
||
1. Use np.maximum(0, x.data) for element-wise max with zero
|
||
2. Return result wrapped in new Tensor
|
||
|
||
EXAMPLE:
|
||
>>> relu = ReLU()
|
||
>>> x = Tensor([-2, -1, 0, 1, 2])
|
||
>>> result = relu(x)
|
||
>>> print(result.data)
|
||
[0, 0, 0, 1, 2] # Negative values become 0, positive unchanged
|
||
|
||
HINT: np.maximum handles element-wise maximum automatically
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Apply ReLU: max(0, x)
|
||
result = np.maximum(0, x.data)
|
||
return Tensor(result)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x: Tensor) -> Tensor:
|
||
"""Allows the activation to be called like a function."""
|
||
return self.forward(x)
|
||
|
||
def backward(self, grad: Tensor) -> Tensor:
|
||
"""Compute gradient (implemented in Module 05)."""
|
||
pass # Will implement backward pass in Module 05
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: ReLU
|
||
This test validates ReLU activation behavior.
|
||
**What we're testing**: ReLU zeros negative values, preserves positive
|
||
**Why it matters**: ReLU's sparsity helps neural networks train efficiently
|
||
**Expected**: Negative → 0, positive unchanged, zero → 0
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-relu", "locked": true, "points": 10}
|
||
def test_unit_relu():
|
||
"""🔬 Test ReLU implementation."""
|
||
print("🔬 Unit Test: ReLU...")
|
||
|
||
relu = ReLU()
|
||
|
||
# Test mixed positive/negative values
|
||
x = Tensor([-2, -1, 0, 1, 2])
|
||
result = relu.forward(x)
|
||
expected = [0, 0, 0, 1, 2]
|
||
assert np.allclose(result.data, expected), f"ReLU failed, expected {expected}, got {result.data}"
|
||
|
||
# Test all negative
|
||
x = Tensor([-5, -3, -1])
|
||
result = relu.forward(x)
|
||
assert np.allclose(result.data, [0, 0, 0]), "ReLU should zero all negative values"
|
||
|
||
# Test all positive
|
||
x = Tensor([1, 3, 5])
|
||
result = relu.forward(x)
|
||
assert np.allclose(result.data, [1, 3, 5]), "ReLU should preserve all positive values"
|
||
|
||
# Test sparsity property
|
||
x = Tensor([-1, -2, -3, 1])
|
||
result = relu.forward(x)
|
||
zeros = np.sum(result.data == 0)
|
||
assert zeros == 3, f"ReLU should create sparsity, got {zeros} zeros out of 4"
|
||
|
||
print("✅ ReLU works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_relu()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Tanh - The Zero-Centered Alternative
|
||
|
||
Tanh (hyperbolic tangent) is like sigmoid but centered around zero, mapping inputs to (-1, 1). This zero-centering helps with gradient flow during training.
|
||
|
||
### Mathematical Definition
|
||
```
|
||
f(x) = (e^x - e^(-x))/(e^x + e^(-x))
|
||
```
|
||
|
||
### Visual Behavior
|
||
```
|
||
Input: [-2, 0, 2]
|
||
↓ ↓ ↓ Tanh Function
|
||
Output: [-0.96, 0, 0.96]
|
||
```
|
||
|
||
### ASCII Visualization
|
||
```
|
||
Tanh Curve:
|
||
1 ┤ ╭─────
|
||
│ ╱
|
||
0 ┤───╱─────
|
||
│ ╱
|
||
-1 ┤─╱───────
|
||
-3 0 3
|
||
```
|
||
|
||
**Why Tanh matters**: Unlike sigmoid, tanh outputs are centered around zero, which can help gradients flow better through deep networks.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tanh-impl", "solution": true}
|
||
#| export
|
||
class Tanh:
|
||
"""
|
||
Tanh activation: f(x) = (e^x - e^(-x))/(e^x + e^(-x))
|
||
|
||
Maps any real number to (-1, 1) range.
|
||
Zero-centered alternative to sigmoid.
|
||
"""
|
||
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
"""
|
||
Apply tanh activation element-wise.
|
||
|
||
TODO: Implement tanh function
|
||
|
||
APPROACH:
|
||
1. Use np.tanh(x.data) for hyperbolic tangent
|
||
2. Return result wrapped in new Tensor
|
||
|
||
EXAMPLE:
|
||
>>> tanh = Tanh()
|
||
>>> x = Tensor([-2, 0, 2])
|
||
>>> result = tanh(x)
|
||
>>> print(result.data)
|
||
[-0.964, 0.0, 0.964] # Range (-1, 1), symmetric around 0
|
||
|
||
HINT: NumPy provides np.tanh function
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Apply tanh using NumPy
|
||
result = np.tanh(x.data)
|
||
return Tensor(result)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x: Tensor) -> Tensor:
|
||
"""Allows the activation to be called like a function."""
|
||
return self.forward(x)
|
||
|
||
def backward(self, grad: Tensor) -> Tensor:
|
||
"""Compute gradient (implemented in Module 05)."""
|
||
pass # Will implement backward pass in Module 05
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: Tanh
|
||
This test validates tanh activation behavior.
|
||
**What we're testing**: Tanh maps inputs to (-1, 1) range, zero-centered
|
||
**Why it matters**: Zero-centered activations can help with gradient flow
|
||
**Expected**: All outputs in (-1, 1), tanh(0) = 0, symmetric behavior
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-tanh", "locked": true, "points": 10}
|
||
def test_unit_tanh():
|
||
"""🔬 Test Tanh implementation."""
|
||
print("🔬 Unit Test: Tanh...")
|
||
|
||
tanh = Tanh()
|
||
|
||
# Test zero
|
||
x = Tensor([0.0])
|
||
result = tanh.forward(x)
|
||
assert np.allclose(result.data, [0.0]), f"tanh(0) should be 0, got {result.data}"
|
||
|
||
# Test range property - all outputs should be in (-1, 1)
|
||
x = Tensor([-10, -1, 0, 1, 10])
|
||
result = tanh.forward(x)
|
||
assert np.all(result.data >= -1) and np.all(result.data <= 1), "All tanh outputs should be in [-1, 1]"
|
||
|
||
# Test symmetry: tanh(-x) = -tanh(x)
|
||
x = Tensor([2.0])
|
||
pos_result = tanh.forward(x)
|
||
x_neg = Tensor([-2.0])
|
||
neg_result = tanh.forward(x_neg)
|
||
assert np.allclose(pos_result.data, -neg_result.data), "tanh should be symmetric: tanh(-x) = -tanh(x)"
|
||
|
||
# Test extreme values
|
||
x = Tensor([-1000, 1000])
|
||
result = tanh.forward(x)
|
||
assert np.allclose(result.data[0], -1, atol=TOLERANCE), "tanh(-∞) should approach -1"
|
||
assert np.allclose(result.data[1], 1, atol=TOLERANCE), "tanh(+∞) should approach 1"
|
||
|
||
print("✅ Tanh works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_tanh()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## GELU - The Smooth Modern Choice
|
||
|
||
GELU (Gaussian Error Linear Unit) is a smooth approximation to ReLU that's become popular in modern architectures like transformers. Unlike ReLU's sharp corner, GELU is smooth everywhere.
|
||
|
||
### Mathematical Definition
|
||
```
|
||
f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)
|
||
```
|
||
Where Φ(x) is the cumulative distribution function of standard normal distribution.
|
||
|
||
### Visual Behavior
|
||
```
|
||
Input: [-1, 0, 1]
|
||
↓ ↓ ↓ GELU Function
|
||
Output: [-0.16, 0, 0.84]
|
||
```
|
||
|
||
### ASCII Visualization
|
||
```
|
||
GELU Function:
|
||
╱
|
||
1 ╱
|
||
╱
|
||
╱
|
||
╱
|
||
╱ ↙ (smooth curve, no sharp corner)
|
||
╱
|
||
─┴─────
|
||
-2 0 2
|
||
```
|
||
|
||
**Why GELU matters**: Used in GPT, BERT, and other transformers. The smoothness helps with optimization compared to ReLU's sharp corner.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "gelu-impl", "solution": true}
|
||
#| export
|
||
class GELU:
|
||
"""
|
||
GELU activation: f(x) = x * Φ(x) ≈ x * Sigmoid(1.702 * x)
|
||
|
||
Smooth approximation to ReLU, used in modern transformers.
|
||
Where Φ(x) is the cumulative distribution function of standard normal.
|
||
"""
|
||
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
"""
|
||
Apply GELU activation element-wise.
|
||
|
||
TODO: Implement GELU approximation
|
||
|
||
APPROACH:
|
||
1. Use approximation: x * sigmoid(1.702 * x)
|
||
2. Compute sigmoid part: 1 / (1 + exp(-1.702 * x))
|
||
3. Multiply by x element-wise
|
||
4. Return result wrapped in new Tensor
|
||
|
||
EXAMPLE:
|
||
>>> gelu = GELU()
|
||
>>> x = Tensor([-1, 0, 1])
|
||
>>> result = gelu(x)
|
||
>>> print(result.data)
|
||
[-0.159, 0.0, 0.841] # Smooth, like ReLU but differentiable everywhere
|
||
|
||
HINT: The 1.702 constant comes from √(2/π) approximation
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# GELU approximation: x * sigmoid(1.702 * x)
|
||
# First compute sigmoid part
|
||
sigmoid_part = 1.0 / (1.0 + np.exp(-1.702 * x.data))
|
||
# Then multiply by x
|
||
result = x.data * sigmoid_part
|
||
return Tensor(result)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x: Tensor) -> Tensor:
|
||
"""Allows the activation to be called like a function."""
|
||
return self.forward(x)
|
||
|
||
def backward(self, grad: Tensor) -> Tensor:
|
||
"""Compute gradient (implemented in Module 05)."""
|
||
pass # Will implement backward pass in Module 05
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: GELU
|
||
This test validates GELU activation behavior.
|
||
**What we're testing**: GELU provides smooth ReLU-like behavior
|
||
**Why it matters**: GELU is used in modern transformers like GPT and BERT
|
||
**Expected**: Smooth curve, GELU(0) ≈ 0, positive values preserved roughly
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-gelu", "locked": true, "points": 10}
|
||
def test_unit_gelu():
|
||
"""🔬 Test GELU implementation."""
|
||
print("🔬 Unit Test: GELU...")
|
||
|
||
gelu = GELU()
|
||
|
||
# Test zero (should be approximately 0)
|
||
x = Tensor([0.0])
|
||
result = gelu.forward(x)
|
||
assert np.allclose(result.data, [0.0], atol=TOLERANCE), f"GELU(0) should be ≈0, got {result.data}"
|
||
|
||
# Test positive values (should be roughly preserved)
|
||
x = Tensor([1.0])
|
||
result = gelu.forward(x)
|
||
assert result.data[0] > 0.8, f"GELU(1) should be ≈0.84, got {result.data[0]}"
|
||
|
||
# Test negative values (should be small but not zero)
|
||
x = Tensor([-1.0])
|
||
result = gelu.forward(x)
|
||
assert result.data[0] < 0 and result.data[0] > -0.2, f"GELU(-1) should be ≈-0.16, got {result.data[0]}"
|
||
|
||
# Test smoothness property (no sharp corners like ReLU)
|
||
x = Tensor([-0.001, 0.0, 0.001])
|
||
result = gelu.forward(x)
|
||
# Values should be close to each other (smooth)
|
||
diff1 = abs(result.data[1] - result.data[0])
|
||
diff2 = abs(result.data[2] - result.data[1])
|
||
assert diff1 < 0.01 and diff2 < 0.01, "GELU should be smooth around zero"
|
||
|
||
print("✅ GELU works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_gelu()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Softmax - The Probability Distributor
|
||
|
||
Softmax converts any vector into a valid probability distribution. All outputs are positive and sum to exactly 1.0, making it essential for multi-class classification.
|
||
|
||
### Mathematical Definition
|
||
```
|
||
f(x_i) = e^(x_i) / Σ(e^(x_j))
|
||
```
|
||
|
||
### Visual Behavior
|
||
```
|
||
Input: [1, 2, 3]
|
||
↓ ↓ ↓ Softmax Function
|
||
Output: [0.09, 0.24, 0.67] # Sum = 1.0
|
||
```
|
||
|
||
### ASCII Visualization
|
||
```
|
||
Softmax Transform:
|
||
Raw scores: [1, 2, 3, 4]
|
||
↓ Exponential ↓
|
||
[2.7, 7.4, 20.1, 54.6]
|
||
↓ Normalize ↓
|
||
[0.03, 0.09, 0.24, 0.64] ← Sum = 1.0
|
||
```
|
||
|
||
**Why Softmax matters**: In multi-class classification, we need outputs that represent probabilities for each class. Softmax guarantees valid probabilities.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "softmax-impl", "solution": true}
|
||
#| export
|
||
class Softmax:
|
||
"""
|
||
Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))
|
||
|
||
Converts any vector to a probability distribution.
|
||
Sum of all outputs equals 1.0.
|
||
"""
|
||
|
||
def forward(self, x: Tensor, dim: int = -1) -> Tensor:
|
||
"""
|
||
Apply softmax activation along specified dimension.
|
||
|
||
TODO: Implement numerically stable softmax
|
||
|
||
APPROACH:
|
||
1. Subtract max for numerical stability: x - max(x)
|
||
2. Compute exponentials: exp(x - max(x))
|
||
3. Sum along dimension: sum(exp_values)
|
||
4. Divide: exp_values / sum
|
||
5. Return result wrapped in new Tensor
|
||
|
||
EXAMPLE:
|
||
>>> softmax = Softmax()
|
||
>>> x = Tensor([1, 2, 3])
|
||
>>> result = softmax(x)
|
||
>>> print(result.data)
|
||
[0.090, 0.245, 0.665] # Sums to 1.0, larger inputs get higher probability
|
||
|
||
HINTS:
|
||
- Use np.max(x.data, axis=dim, keepdims=True) for max
|
||
- Use np.sum(exp_values, axis=dim, keepdims=True) for sum
|
||
- The max subtraction prevents overflow in exponentials
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Numerical stability: subtract max to prevent overflow
|
||
# Use Tensor operations to preserve gradient flow!
|
||
x_max_data = np.max(x.data, axis=dim, keepdims=True)
|
||
x_max = Tensor(x_max_data, requires_grad=False) # max is not differentiable in this context
|
||
x_shifted = x - x_max # Tensor subtraction!
|
||
|
||
# Compute exponentials (NumPy operation, but wrapped in Tensor)
|
||
exp_values = Tensor(np.exp(x_shifted.data), requires_grad=x_shifted.requires_grad)
|
||
|
||
# Sum along dimension (Tensor operation)
|
||
exp_sum_data = np.sum(exp_values.data, axis=dim, keepdims=True)
|
||
exp_sum = Tensor(exp_sum_data, requires_grad=exp_values.requires_grad)
|
||
|
||
# Normalize to get probabilities (Tensor division!)
|
||
result = exp_values / exp_sum
|
||
return result
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x: Tensor, dim: int = -1) -> Tensor:
|
||
"""Allows the activation to be called like a function."""
|
||
return self.forward(x, dim)
|
||
|
||
def backward(self, grad: Tensor) -> Tensor:
|
||
"""Compute gradient (implemented in Module 05)."""
|
||
pass # Will implement backward pass in Module 05
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔬 Unit Test: Softmax
|
||
This test validates softmax activation behavior.
|
||
**What we're testing**: Softmax creates valid probability distributions
|
||
**Why it matters**: Essential for multi-class classification outputs
|
||
**Expected**: Outputs sum to 1.0, all values in (0, 1), largest input gets highest probability
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-softmax", "locked": true, "points": 10}
|
||
def test_unit_softmax():
|
||
"""🔬 Test Softmax implementation."""
|
||
print("🔬 Unit Test: Softmax...")
|
||
|
||
softmax = Softmax()
|
||
|
||
# Test basic probability properties
|
||
x = Tensor([1, 2, 3])
|
||
result = softmax.forward(x)
|
||
|
||
# Should sum to 1
|
||
assert np.allclose(np.sum(result.data), 1.0), f"Softmax should sum to 1, got {np.sum(result.data)}"
|
||
|
||
# All values should be positive
|
||
assert np.all(result.data > 0), "All softmax values should be positive"
|
||
|
||
# All values should be less than 1
|
||
assert np.all(result.data < 1), "All softmax values should be less than 1"
|
||
|
||
# Largest input should get largest output
|
||
max_input_idx = np.argmax(x.data)
|
||
max_output_idx = np.argmax(result.data)
|
||
assert max_input_idx == max_output_idx, "Largest input should get largest softmax output"
|
||
|
||
# Test numerical stability with large numbers
|
||
x = Tensor([1000, 1001, 1002]) # Would overflow without max subtraction
|
||
result = softmax.forward(x)
|
||
assert np.allclose(np.sum(result.data), 1.0), "Softmax should handle large numbers"
|
||
assert not np.any(np.isnan(result.data)), "Softmax should not produce NaN"
|
||
assert not np.any(np.isinf(result.data)), "Softmax should not produce infinity"
|
||
|
||
# Test with 2D tensor (batch dimension)
|
||
x = Tensor([[1, 2], [3, 4]])
|
||
result = softmax.forward(x, dim=-1) # Softmax along last dimension
|
||
assert result.shape == (2, 2), "Softmax should preserve input shape"
|
||
# Each row should sum to 1
|
||
row_sums = np.sum(result.data, axis=-1)
|
||
assert np.allclose(row_sums, [1.0, 1.0]), "Each row should sum to 1"
|
||
|
||
print("✅ Softmax works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_softmax()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 4. Integration - Bringing It Together
|
||
|
||
Now let's test how all our activation functions work together and understand their different behaviors.
|
||
"""
|
||
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Understanding the Output Patterns
|
||
|
||
From the demonstration above, notice how each activation serves a different purpose:
|
||
|
||
**Sigmoid**: Squashes everything to (0, 1) - good for probabilities
|
||
**ReLU**: Zeros negatives, keeps positives - creates sparsity
|
||
**Tanh**: Like sigmoid but centered at zero (-1, 1) - better gradient flow
|
||
**GELU**: Smooth ReLU-like behavior - modern choice for transformers
|
||
**Softmax**: Converts to probability distribution - sum equals 1
|
||
|
||
These different behaviors make each activation suitable for different parts of neural networks.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🧪 Module Integration Test
|
||
|
||
Final validation that everything works together correctly.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "module-test", "locked": true, "points": 20}
|
||
|
||
def test_module():
|
||
"""🧪 Module Test: Complete Integration
|
||
|
||
Comprehensive test of entire module functionality.
|
||
|
||
This final test runs before module summary to ensure:
|
||
- All unit tests pass
|
||
- Functions work together correctly
|
||
- Module is ready for integration with TinyTorch
|
||
"""
|
||
print("🧪 RUNNING MODULE INTEGRATION TEST")
|
||
print("=" * 50)
|
||
|
||
# Run all unit tests
|
||
print("Running unit tests...")
|
||
test_unit_sigmoid()
|
||
test_unit_relu()
|
||
test_unit_tanh()
|
||
test_unit_gelu()
|
||
test_unit_softmax()
|
||
|
||
print("\nRunning integration scenarios...")
|
||
|
||
# Test 1: All activations preserve tensor properties
|
||
print("🔬 Integration Test: Tensor property preservation...")
|
||
test_data = Tensor([[1, -1], [2, -2]]) # 2D tensor
|
||
|
||
activations = [Sigmoid(), ReLU(), Tanh(), GELU()]
|
||
for activation in activations:
|
||
result = activation.forward(test_data)
|
||
assert result.shape == test_data.shape, f"Shape not preserved by {activation.__class__.__name__}"
|
||
assert isinstance(result, Tensor), f"Output not Tensor from {activation.__class__.__name__}"
|
||
|
||
print("✅ All activations preserve tensor properties!")
|
||
|
||
# Test 2: Softmax works with different dimensions
|
||
print("🔬 Integration Test: Softmax dimension handling...")
|
||
data_3d = Tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) # (2, 2, 3)
|
||
softmax = Softmax()
|
||
|
||
# Test different dimensions
|
||
result_last = softmax(data_3d, dim=-1)
|
||
assert result_last.shape == (2, 2, 3), "Softmax should preserve shape"
|
||
|
||
# Check that last dimension sums to 1
|
||
last_dim_sums = np.sum(result_last.data, axis=-1)
|
||
assert np.allclose(last_dim_sums, 1.0), "Last dimension should sum to 1"
|
||
|
||
print("✅ Softmax handles different dimensions correctly!")
|
||
|
||
# Test 3: Activation chaining (simulating neural network)
|
||
print("🔬 Integration Test: Activation chaining...")
|
||
|
||
# Simulate: Input → Linear → ReLU → Linear → Softmax (like a simple network)
|
||
x = Tensor([[-1, 0, 1, 2]]) # Batch of 1, 4 features
|
||
|
||
# Apply ReLU (hidden layer activation)
|
||
relu = ReLU()
|
||
hidden = relu.forward(x)
|
||
|
||
# Apply Softmax (output layer activation)
|
||
softmax = Softmax()
|
||
output = softmax.forward(hidden)
|
||
|
||
# Verify the chain
|
||
assert hidden.data[0, 0] == 0, "ReLU should zero negative input"
|
||
assert np.allclose(np.sum(output.data), 1.0), "Final output should be probability distribution"
|
||
|
||
print("✅ Activation chaining works correctly!")
|
||
|
||
print("\n" + "=" * 50)
|
||
print("🎉 ALL TESTS PASSED! Module ready for export.")
|
||
print("Run: tito module complete 02")
|
||
|
||
# Run comprehensive module test
|
||
if __name__ == "__main__":
|
||
test_module()
|
||
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 5. Real-World Production Context
|
||
|
||
Now that you've implemented these activations, let's understand how they're used in real ML systems.
|
||
|
||
### Activation Selection Guide
|
||
|
||
**When to Use Each Activation:**
|
||
|
||
**Sigmoid**
|
||
- **Use case**: Binary classification output layers, gates in LSTMs/GRUs
|
||
- **Production example**: Spam detection (output: probability of spam)
|
||
- **Why**: Outputs valid probabilities in (0, 1)
|
||
- **Avoid**: Hidden layers in deep networks (vanishing gradients)
|
||
|
||
**ReLU**
|
||
- **Use case**: Hidden layers in CNNs, feedforward networks
|
||
- **Production example**: Image classification networks (ResNet, VGG)
|
||
- **Why**: Fast computation, prevents vanishing gradients, creates sparsity
|
||
- **Avoid**: Output layers (can't output negative values or probabilities)
|
||
|
||
**Tanh**
|
||
- **Use case**: RNN hidden states, when zero-centered outputs matter
|
||
- **Production example**: Sentiment analysis RNNs, time series prediction
|
||
- **Why**: Zero-centered helps with gradient flow in recurrent networks
|
||
- **Avoid**: Very deep networks (still suffers from vanishing gradients)
|
||
|
||
**GELU**
|
||
- **Use case**: Transformer models, modern architectures
|
||
- **Production example**: GPT, BERT, modern language models
|
||
- **Why**: Smooth approximation of ReLU, better gradient flow, state-of-the-art results
|
||
- **Avoid**: When computational efficiency is critical (slightly slower than ReLU)
|
||
|
||
**Softmax**
|
||
- **Use case**: Multi-class classification output layers
|
||
- **Production example**: ImageNet classification (1000 classes), NLP token prediction
|
||
- **Why**: Converts logits to valid probability distribution (sums to 1)
|
||
- **Avoid**: Hidden layers (loses information through normalization)
|
||
|
||
### Common Production Patterns
|
||
|
||
**Pattern 1: CNN Image Classification**
|
||
```
|
||
Input → Conv+ReLU → Conv+ReLU → ... → Linear → Softmax → Class Probabilities
|
||
```
|
||
|
||
**Pattern 2: Binary Classifier**
|
||
```
|
||
Input → Linear+ReLU → Linear+ReLU → Linear → Sigmoid → Binary Probability
|
||
```
|
||
|
||
**Pattern 3: Modern Transformer**
|
||
```
|
||
Input → Attention → Linear+GELU → Linear+GELU → Output
|
||
```
|
||
|
||
### Common Pitfalls and Debugging
|
||
|
||
**Sigmoid/Tanh Pitfalls:**
|
||
- **Vanishing gradients**: Gradients near 0 for extreme inputs
|
||
- **Saturation**: Outputs plateau, learning slows
|
||
- **Debug tip**: Check activation distribution - avoid all values near 0 or 1
|
||
|
||
**ReLU Pitfalls:**
|
||
- **Dying ReLU**: Neurons output 0 forever after large negative gradient
|
||
- **No negative outputs**: Can't represent negative relationships
|
||
- **Debug tip**: Monitor % of dead neurons (always output 0)
|
||
|
||
**Softmax Pitfalls:**
|
||
- **Numerical overflow**: exp(x) explodes for large x (solved by max subtraction)
|
||
- **Dimension confusion**: Must apply along correct axis for batched data
|
||
- **Debug tip**: Verify outputs sum to 1.0 along correct dimension
|
||
|
||
**GELU Pitfalls:**
|
||
- **Approximation error**: Using wrong approximation constant
|
||
- **Speed**: Slightly slower than ReLU
|
||
- **Debug tip**: Compare outputs to reference implementation
|
||
|
||
### Performance Characteristics
|
||
|
||
**Computational Cost (relative to ReLU = 1.0):**
|
||
- ReLU: 1.0× (fastest - just comparison and max)
|
||
- Sigmoid: ~3×-4× (exponential computation)
|
||
- Tanh: ~3×-4× (two exponentials)
|
||
- GELU: ~4×-5× (exponential in approximation)
|
||
- Softmax: ~5×+ (exponentials + division across all elements)
|
||
|
||
**Memory Impact:**
|
||
- All activations: Minimal memory overhead (output same size as input)
|
||
- Softmax: Slightly higher (temporary buffers for exp and sum)
|
||
- For autograd (Module 05): Must cache inputs for backward pass
|
||
|
||
### Integration with TinyTorch
|
||
|
||
Your activation functions integrate seamlessly with other modules:
|
||
|
||
**Module 03 (Layers)**: Will use these activations
|
||
```python
|
||
# Coming in Module 03
|
||
class Linear:
|
||
def __init__(self, in_features, out_features, activation=None):
|
||
self.activation = activation # Your ReLU, Sigmoid, etc.
|
||
|
||
def forward(self, x):
|
||
out = self.compute_linear(x)
|
||
if self.activation:
|
||
out = self.activation(out) # Uses your forward()
|
||
return out
|
||
```
|
||
|
||
**Module 05 (Autograd)**: Will add gradient computation
|
||
```python
|
||
# Coming in Module 05
|
||
class Sigmoid:
|
||
def backward(self, grad):
|
||
# ∂sigmoid/∂x = sigmoid(x) * (1 - sigmoid(x))
|
||
return grad * self.output * (1 - self.output)
|
||
```
|
||
"""
|
||
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Activations
|
||
|
||
Congratulations! You've built the intelligence engine of neural networks!
|
||
|
||
### Key Accomplishments
|
||
- Built 5 core activation functions with distinct behaviors and use cases
|
||
- Implemented forward passes for Sigmoid, ReLU, Tanh, GELU, and Softmax
|
||
- Discovered how nonlinearity enables complex pattern learning
|
||
- All tests pass ✅ (validated by `test_module()`)
|
||
|
||
### Ready for Next Steps
|
||
Your activation implementations enable neural network layers to learn complex, nonlinear patterns instead of just linear transformations.
|
||
|
||
Export with: `tito module complete 02`
|
||
|
||
**Next**: Module 03 will combine your Tensors and Activations to build complete neural network Layers!
|
||
""" |