mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-01 02:11:08 -05:00
- Implement 'explain → code → test → repeat' structure across all modules - Replace comprehensive end-of-module tests with progressive unit tests - Add rich scaffolding with detailed implementation guidance - Transform generic TODOs into step-by-step learning instructions - Connect educational content to real-world ML systems and PyTorch - Reduce overall codebase by 37% while enhancing learning experience - Ensure immediate feedback and skill building for students Modules transformed: - 01_tensor: Tensor operations and broadcasting - 02_activations: Activation functions and derivatives - 03_layers: Linear layers and forward/backward propagation - 04_networks: Network building and multi-layer composition - 05_cnn: Convolution operations and CNN architecture - 06_dataloader: Data pipeline and batch processing - 07_autograd: Automatic differentiation and computational graphs
886 lines
31 KiB
Python
886 lines
31 KiB
Python
# ---
|
|
# jupyter:
|
|
# jupytext:
|
|
# text_representation:
|
|
# extension: .py
|
|
# format_name: percent
|
|
# format_version: '1.3'
|
|
# jupytext_version: 1.17.1
|
|
# ---
|
|
|
|
# %% [markdown]
|
|
"""
|
|
# Module 2: Activations - Nonlinearity in Neural Networks
|
|
|
|
Welcome to the Activations module! This is where neural networks get their power through nonlinearity.
|
|
|
|
## Learning Goals
|
|
- Understand why activation functions are essential for neural networks
|
|
- Implement the four most important activation functions: ReLU, Sigmoid, Tanh, and Softmax
|
|
- Visualize how activations transform data and enable complex learning
|
|
- See how activations work with layers to build powerful networks
|
|
- Master the NBGrader workflow with comprehensive testing
|
|
|
|
## Build → Use → Understand
|
|
1. **Build**: Activation functions that add nonlinearity
|
|
2. **Use**: Transform tensors and see immediate results
|
|
3. **Understand**: How nonlinearity enables complex pattern learning
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "activations-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
|
#| default_exp core.activations
|
|
|
|
#| export
|
|
import math
|
|
import numpy as np
|
|
import matplotlib.pyplot as plt
|
|
import os
|
|
import sys
|
|
from typing import Union, List
|
|
|
|
# Import our Tensor class - try from package first, then from local module
|
|
try:
|
|
from tinytorch.core.tensor import Tensor
|
|
except ImportError:
|
|
# For development, import from local tensor module
|
|
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
|
|
from tensor_dev import Tensor
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "activations-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
|
#| hide
|
|
#| export
|
|
def _should_show_plots():
|
|
"""Check if we should show plots (disable during testing)"""
|
|
# Check multiple conditions that indicate we're in test mode
|
|
is_pytest = (
|
|
'pytest' in sys.modules or
|
|
'test' in sys.argv or
|
|
os.environ.get('PYTEST_CURRENT_TEST') is not None or
|
|
any('test' in arg for arg in sys.argv) or
|
|
any('pytest' in arg for arg in sys.argv)
|
|
)
|
|
|
|
# Show plots in development mode (when not in test mode)
|
|
return not is_pytest
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "activations-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
|
print("🔥 TinyTorch Activations Module")
|
|
print(f"NumPy version: {np.__version__}")
|
|
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
|
|
print("Ready to build activation functions!")
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 📦 Where This Code Lives in the Final Package
|
|
|
|
**Learning Side:** You work in `modules/source/02_activations/activations_dev.py`
|
|
**Building Side:** Code exports to `tinytorch.core.activations`
|
|
|
|
```python
|
|
# Final package structure:
|
|
from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
|
|
from tinytorch.core.tensor import Tensor # Foundation
|
|
from tinytorch.core.layers import Dense # Uses activations
|
|
```
|
|
|
|
**Why this matters:**
|
|
- **Learning:** Focused modules for deep understanding
|
|
- **Production:** Proper organization like PyTorch's `torch.nn.ReLU`
|
|
- **Consistency:** All activation functions live together in `core.activations`
|
|
- **Integration:** Works seamlessly with tensors and layers
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## What Are Activation Functions?
|
|
|
|
### The Problem: Linear Limitations
|
|
Without activation functions, neural networks can only learn linear relationships:
|
|
```
|
|
y = W₁ · (W₂ · (W₃ · x + b₃) + b₂) + b₁
|
|
```
|
|
|
|
This simplifies to just:
|
|
```
|
|
y = W_combined · x + b_combined
|
|
```
|
|
|
|
**A single linear function!** No matter how many layers you add, you can't learn complex patterns like:
|
|
- Image recognition (nonlinear pixel relationships)
|
|
- Language understanding (nonlinear word relationships)
|
|
- Game playing (nonlinear strategy relationships)
|
|
|
|
### The Solution: Nonlinearity
|
|
Activation functions add nonlinearity between layers:
|
|
```
|
|
y = W₁ · f(W₂ · f(W₃ · x + b₃) + b₂) + b₁
|
|
```
|
|
|
|
Now each layer can learn complex transformations!
|
|
|
|
### Real-World Impact
|
|
- **Before activations**: Only linear classifiers (logistic regression)
|
|
- **After activations**: Complex pattern recognition (deep learning revolution)
|
|
|
|
### What We'll Build
|
|
1. **ReLU**: The foundation of modern deep learning
|
|
2. **Sigmoid**: Classic activation for binary classification
|
|
3. **Tanh**: Centered activation for better gradients
|
|
4. **Softmax**: Probability distributions for multi-class classification
|
|
"""
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 1: ReLU - The Foundation of Deep Learning
|
|
|
|
### What is ReLU?
|
|
**ReLU (Rectified Linear Unit)** is the most important activation function in deep learning:
|
|
|
|
```
|
|
f(x) = max(0, x)
|
|
```
|
|
|
|
- **Positive inputs**: Pass through unchanged
|
|
- **Negative inputs**: Become zero
|
|
- **Zero**: Stays zero
|
|
|
|
### Why ReLU Revolutionized Deep Learning
|
|
1. **Computational efficiency**: Just a max operation
|
|
2. **No vanishing gradients**: Derivative is 1 for positive values
|
|
3. **Sparsity**: Many neurons output exactly 0
|
|
4. **Empirical success**: Works well in practice
|
|
|
|
### Visual Understanding
|
|
```
|
|
Input: [-2, -1, 0, 1, 2]
|
|
ReLU: [ 0, 0, 0, 1, 2]
|
|
```
|
|
|
|
### Real-World Applications
|
|
- **Image classification**: ResNet, VGG, AlexNet
|
|
- **Object detection**: YOLO, R-CNN
|
|
- **Language models**: Transformer feedforward layers
|
|
- **Recommendation**: Deep collaborative filtering
|
|
|
|
### Mathematical Properties
|
|
- **Derivative**: f'(x) = 1 if x > 0, else 0
|
|
- **Range**: [0, ∞)
|
|
- **Sparsity**: Outputs exactly 0 for negative inputs
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "relu-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
class ReLU:
|
|
"""
|
|
ReLU Activation Function: f(x) = max(0, x)
|
|
|
|
The most popular activation function in deep learning.
|
|
Simple, fast, and effective for most applications.
|
|
"""
|
|
|
|
def forward(self, x: Tensor) -> Tensor:
|
|
"""
|
|
Apply ReLU activation: f(x) = max(0, x)
|
|
|
|
TODO: Implement ReLU activation function.
|
|
|
|
STEP-BY-STEP IMPLEMENTATION:
|
|
1. For each element in the input tensor, apply max(0, element)
|
|
2. Use NumPy's maximum function for efficient element-wise operation
|
|
3. Return a new Tensor with the results
|
|
4. Preserve the input tensor's shape
|
|
|
|
EXAMPLE USAGE:
|
|
```python
|
|
relu = ReLU()
|
|
input_tensor = Tensor([[-2, -1, 0, 1, 2]])
|
|
output = relu(input_tensor)
|
|
print(output.data) # [[0, 0, 0, 1, 2]]
|
|
```
|
|
|
|
IMPLEMENTATION HINTS:
|
|
- Use np.maximum(0, x.data) for element-wise max with 0
|
|
- Remember to return a new Tensor object: return Tensor(result)
|
|
- The shape should remain the same as input
|
|
- Don't modify the input tensor (immutable operations)
|
|
|
|
LEARNING CONNECTIONS:
|
|
- This is like torch.nn.ReLU() in PyTorch
|
|
- Used in virtually every modern neural network
|
|
- Enables deep networks by preventing vanishing gradients
|
|
- Creates sparse representations (many zeros)
|
|
"""
|
|
### BEGIN SOLUTION
|
|
result = np.maximum(0, x.data)
|
|
return Tensor(result)
|
|
### END SOLUTION
|
|
|
|
def __call__(self, x: Tensor) -> Tensor:
|
|
"""Make the class callable: relu(x) instead of relu.forward(x)"""
|
|
return self.forward(x)
|
|
|
|
# %% [markdown]
|
|
"""
|
|
### 🧪 Test Your ReLU Implementation
|
|
|
|
Once you implement the ReLU forward method above, run this cell to test it:
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-relu-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
|
|
def test_relu_activation():
|
|
"""Test ReLU activation function"""
|
|
print("Testing ReLU activation...")
|
|
|
|
# Create ReLU instance
|
|
relu = ReLU()
|
|
|
|
# Test with mixed positive/negative values
|
|
test_input = Tensor([[-2, -1, 0, 1, 2]])
|
|
result = relu(test_input)
|
|
expected = np.array([[0, 0, 0, 1, 2]])
|
|
|
|
assert np.array_equal(result.data, expected), f"ReLU failed: expected {expected}, got {result.data}"
|
|
|
|
# Test that negative values become zero
|
|
assert np.all(result.data >= 0), "ReLU should make all negative values zero"
|
|
|
|
# Test that positive values remain unchanged
|
|
positive_input = Tensor([[1, 2, 3, 4, 5]])
|
|
positive_result = relu(positive_input)
|
|
assert np.array_equal(positive_result.data, positive_input.data), "ReLU should preserve positive values"
|
|
|
|
# Test with 2D tensor
|
|
matrix_input = Tensor([[-1, 2], [3, -4]])
|
|
matrix_result = relu(matrix_input)
|
|
matrix_expected = np.array([[0, 2], [3, 0]])
|
|
assert np.array_equal(matrix_result.data, matrix_expected), "ReLU should work with 2D tensors"
|
|
|
|
# Test shape preservation
|
|
assert matrix_result.shape == matrix_input.shape, "ReLU should preserve input shape"
|
|
|
|
print("✅ ReLU activation tests passed!")
|
|
print(f"✅ Negative values correctly zeroed")
|
|
print(f"✅ Positive values preserved")
|
|
print(f"✅ Shape preservation working")
|
|
print(f"✅ Works with multi-dimensional tensors")
|
|
|
|
# Run the test
|
|
test_relu_activation()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 2: Sigmoid - Classic Binary Classification
|
|
|
|
### What is Sigmoid?
|
|
**Sigmoid** is the classic activation function that maps any real number to (0, 1):
|
|
|
|
```
|
|
f(x) = 1 / (1 + e^(-x))
|
|
```
|
|
|
|
### Why Sigmoid Matters
|
|
1. **Probability interpretation**: Outputs between 0 and 1
|
|
2. **Smooth gradients**: Differentiable everywhere
|
|
3. **Historical importance**: Enabled early neural networks
|
|
4. **Binary classification**: Perfect for yes/no decisions
|
|
|
|
### Visual Understanding
|
|
```
|
|
Input: [-∞, -2, -1, 0, 1, 2, ∞]
|
|
Sigmoid:[0, 0.12, 0.27, 0.5, 0.73, 0.88, 1]
|
|
```
|
|
|
|
### Real-World Applications
|
|
- **Binary classification**: Spam detection, medical diagnosis
|
|
- **Gating mechanisms**: LSTM and GRU cells
|
|
- **Output layers**: When you need probabilities
|
|
- **Attention mechanisms**: Where to focus attention
|
|
|
|
### Mathematical Properties
|
|
- **Range**: (0, 1)
|
|
- **Derivative**: f'(x) = f(x) · (1 - f(x))
|
|
- **Centered**: f(0) = 0.5
|
|
- **Symmetric**: f(-x) = 1 - f(x)
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "sigmoid-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
class Sigmoid:
|
|
"""
|
|
Sigmoid Activation Function: f(x) = 1 / (1 + e^(-x))
|
|
|
|
Maps any real number to the range (0, 1).
|
|
Useful for binary classification and probability outputs.
|
|
"""
|
|
|
|
def forward(self, x: Tensor) -> Tensor:
|
|
"""
|
|
Apply Sigmoid activation: f(x) = 1 / (1 + e^(-x))
|
|
|
|
TODO: Implement Sigmoid activation function.
|
|
|
|
STEP-BY-STEP IMPLEMENTATION:
|
|
1. Compute the negative of input: -x.data
|
|
2. Compute the exponential: np.exp(-x.data)
|
|
3. Add 1 to the exponential: 1 + np.exp(-x.data)
|
|
4. Take the reciprocal: 1 / (1 + np.exp(-x.data))
|
|
5. Return as new Tensor
|
|
|
|
EXAMPLE USAGE:
|
|
```python
|
|
sigmoid = Sigmoid()
|
|
input_tensor = Tensor([[-2, -1, 0, 1, 2]])
|
|
output = sigmoid(input_tensor)
|
|
print(output.data) # [[0.119, 0.269, 0.5, 0.731, 0.881]]
|
|
```
|
|
|
|
IMPLEMENTATION HINTS:
|
|
- Use np.exp() for exponential function
|
|
- Formula: 1 / (1 + np.exp(-x.data))
|
|
- Handle potential overflow with np.clip(-x.data, -500, 500)
|
|
- Return Tensor(result)
|
|
|
|
LEARNING CONNECTIONS:
|
|
- This is like torch.nn.Sigmoid() in PyTorch
|
|
- Used in binary classification output layers
|
|
- Key component in LSTM and GRU gating mechanisms
|
|
- Historically important for early neural networks
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Clip to prevent overflow
|
|
clipped_input = np.clip(-x.data, -500, 500)
|
|
result = 1 / (1 + np.exp(clipped_input))
|
|
return Tensor(result)
|
|
### END SOLUTION
|
|
|
|
def __call__(self, x: Tensor) -> Tensor:
|
|
"""Make the class callable: sigmoid(x) instead of sigmoid.forward(x)"""
|
|
return self.forward(x)
|
|
|
|
# %% [markdown]
|
|
"""
|
|
### 🧪 Test Your Sigmoid Implementation
|
|
|
|
Once you implement the Sigmoid forward method above, run this cell to test it:
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-sigmoid-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
|
|
def test_sigmoid_activation():
|
|
"""Test Sigmoid activation function"""
|
|
print("Testing Sigmoid activation...")
|
|
|
|
# Create Sigmoid instance
|
|
sigmoid = Sigmoid()
|
|
|
|
# Test with known values
|
|
test_input = Tensor([[0]])
|
|
result = sigmoid(test_input)
|
|
expected = 0.5
|
|
|
|
assert abs(result.data[0][0] - expected) < 1e-6, f"Sigmoid(0) should be 0.5, got {result.data[0][0]}"
|
|
|
|
# Test with positive and negative values
|
|
test_input = Tensor([[-2, -1, 0, 1, 2]])
|
|
result = sigmoid(test_input)
|
|
|
|
# Check that all values are between 0 and 1
|
|
assert np.all(result.data > 0), "Sigmoid output should be > 0"
|
|
assert np.all(result.data < 1), "Sigmoid output should be < 1"
|
|
|
|
# Test symmetry: sigmoid(-x) = 1 - sigmoid(x)
|
|
x_val = 1.0
|
|
pos_result = sigmoid(Tensor([[x_val]]))
|
|
neg_result = sigmoid(Tensor([[-x_val]]))
|
|
symmetry_check = abs(pos_result.data[0][0] + neg_result.data[0][0] - 1.0)
|
|
assert symmetry_check < 1e-6, "Sigmoid should be symmetric around 0.5"
|
|
|
|
# Test with 2D tensor
|
|
matrix_input = Tensor([[-1, 1], [0, 2]])
|
|
matrix_result = sigmoid(matrix_input)
|
|
assert matrix_result.shape == matrix_input.shape, "Sigmoid should preserve shape"
|
|
|
|
# Test extreme values (should not overflow)
|
|
extreme_input = Tensor([[-100, 100]])
|
|
extreme_result = sigmoid(extreme_input)
|
|
assert not np.any(np.isnan(extreme_result.data)), "Sigmoid should handle extreme values"
|
|
assert not np.any(np.isinf(extreme_result.data)), "Sigmoid should not produce inf values"
|
|
|
|
print("✅ Sigmoid activation tests passed!")
|
|
print(f"✅ Outputs correctly bounded between 0 and 1")
|
|
print(f"✅ Symmetric property verified")
|
|
print(f"✅ Handles extreme values without overflow")
|
|
print(f"✅ Shape preservation working")
|
|
|
|
# Run the test
|
|
test_sigmoid_activation()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 3: Tanh - Centered Activation
|
|
|
|
### What is Tanh?
|
|
**Tanh (Hyperbolic Tangent)** is similar to sigmoid but centered around zero:
|
|
|
|
```
|
|
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
|
|
```
|
|
|
|
### Why Tanh is Better Than Sigmoid
|
|
1. **Zero-centered**: Outputs range from -1 to 1
|
|
2. **Better gradients**: Helps with gradient flow in deep networks
|
|
3. **Faster convergence**: Less bias shift during training
|
|
4. **Stronger gradients**: Maximum gradient is 1 vs 0.25 for sigmoid
|
|
|
|
### Visual Understanding
|
|
```
|
|
Input: [-∞, -2, -1, 0, 1, 2, ∞]
|
|
Tanh: [-1, -0.96, -0.76, 0, 0.76, 0.96, 1]
|
|
```
|
|
|
|
### Real-World Applications
|
|
- **Hidden layers**: Better than sigmoid for internal activations
|
|
- **RNN cells**: Classic RNN and LSTM use tanh
|
|
- **Normalization**: When you need zero-centered outputs
|
|
- **Feature scaling**: Maps inputs to [-1, 1] range
|
|
|
|
### Mathematical Properties
|
|
- **Range**: (-1, 1)
|
|
- **Derivative**: f'(x) = 1 - f(x)²
|
|
- **Zero-centered**: f(0) = 0
|
|
- **Antisymmetric**: f(-x) = -f(x)
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "tanh-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
class Tanh:
|
|
"""
|
|
Tanh Activation Function: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
|
|
|
|
Zero-centered activation function with range (-1, 1).
|
|
Better gradient properties than sigmoid.
|
|
"""
|
|
|
|
def forward(self, x: Tensor) -> Tensor:
|
|
"""
|
|
Apply Tanh activation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
|
|
|
|
TODO: Implement Tanh activation function.
|
|
|
|
STEP-BY-STEP IMPLEMENTATION:
|
|
1. Use NumPy's built-in tanh function: np.tanh(x.data)
|
|
2. Alternatively, implement manually:
|
|
- Compute e^x and e^(-x)
|
|
- Calculate (e^x - e^(-x)) / (e^x + e^(-x))
|
|
3. Return as new Tensor
|
|
|
|
EXAMPLE USAGE:
|
|
```python
|
|
tanh = Tanh()
|
|
input_tensor = Tensor([[-2, -1, 0, 1, 2]])
|
|
output = tanh(input_tensor)
|
|
print(output.data) # [[-0.964, -0.762, 0, 0.762, 0.964]]
|
|
```
|
|
|
|
IMPLEMENTATION HINTS:
|
|
- Use np.tanh(x.data) for simplicity
|
|
- Manual implementation: (np.exp(x.data) - np.exp(-x.data)) / (np.exp(x.data) + np.exp(-x.data))
|
|
- Handle overflow by clipping inputs: np.clip(x.data, -500, 500)
|
|
- Return Tensor(result)
|
|
|
|
LEARNING CONNECTIONS:
|
|
- This is like torch.nn.Tanh() in PyTorch
|
|
- Used in RNN, LSTM, and GRU cells
|
|
- Better than sigmoid for hidden layers
|
|
- Zero-centered outputs help with gradient flow
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Use NumPy's built-in tanh function
|
|
result = np.tanh(x.data)
|
|
return Tensor(result)
|
|
### END SOLUTION
|
|
|
|
def __call__(self, x: Tensor) -> Tensor:
|
|
"""Make the class callable: tanh(x) instead of tanh.forward(x)"""
|
|
return self.forward(x)
|
|
|
|
# %% [markdown]
|
|
"""
|
|
### 🧪 Test Your Tanh Implementation
|
|
|
|
Once you implement the Tanh forward method above, run this cell to test it:
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-tanh-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
|
|
def test_tanh_activation():
|
|
"""Test Tanh activation function"""
|
|
print("Testing Tanh activation...")
|
|
|
|
# Create Tanh instance
|
|
tanh = Tanh()
|
|
|
|
# Test with zero (should be 0)
|
|
test_input = Tensor([[0]])
|
|
result = tanh(test_input)
|
|
expected = 0.0
|
|
|
|
assert abs(result.data[0][0] - expected) < 1e-6, f"Tanh(0) should be 0, got {result.data[0][0]}"
|
|
|
|
# Test with positive and negative values
|
|
test_input = Tensor([[-2, -1, 0, 1, 2]])
|
|
result = tanh(test_input)
|
|
|
|
# Check that all values are between -1 and 1
|
|
assert np.all(result.data > -1), "Tanh output should be > -1"
|
|
assert np.all(result.data < 1), "Tanh output should be < 1"
|
|
|
|
# Test antisymmetry: tanh(-x) = -tanh(x)
|
|
x_val = 1.5
|
|
pos_result = tanh(Tensor([[x_val]]))
|
|
neg_result = tanh(Tensor([[-x_val]]))
|
|
antisymmetry_check = abs(pos_result.data[0][0] + neg_result.data[0][0])
|
|
assert antisymmetry_check < 1e-6, "Tanh should be antisymmetric"
|
|
|
|
# Test with 2D tensor
|
|
matrix_input = Tensor([[-1, 1], [0, 2]])
|
|
matrix_result = tanh(matrix_input)
|
|
assert matrix_result.shape == matrix_input.shape, "Tanh should preserve shape"
|
|
|
|
# Test extreme values (should not overflow)
|
|
extreme_input = Tensor([[-100, 100]])
|
|
extreme_result = tanh(extreme_input)
|
|
assert not np.any(np.isnan(extreme_result.data)), "Tanh should handle extreme values"
|
|
assert not np.any(np.isinf(extreme_result.data)), "Tanh should not produce inf values"
|
|
|
|
# Test that extreme values approach ±1
|
|
assert abs(extreme_result.data[0][0] - (-1)) < 1e-6, "Tanh(-∞) should approach -1"
|
|
assert abs(extreme_result.data[0][1] - 1) < 1e-6, "Tanh(∞) should approach 1"
|
|
|
|
print("✅ Tanh activation tests passed!")
|
|
print(f"✅ Outputs correctly bounded between -1 and 1")
|
|
print(f"✅ Antisymmetric property verified")
|
|
print(f"✅ Zero-centered (tanh(0) = 0)")
|
|
print(f"✅ Handles extreme values correctly")
|
|
|
|
# Run the test
|
|
test_tanh_activation()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## Step 4: Softmax - Probability Distributions
|
|
|
|
### What is Softmax?
|
|
**Softmax** converts a vector of real numbers into a probability distribution:
|
|
|
|
```
|
|
f(x_i) = e^(x_i) / Σ(e^(x_j))
|
|
```
|
|
|
|
### Why Softmax is Essential
|
|
1. **Probability distribution**: Outputs sum to 1
|
|
2. **Multi-class classification**: Choose one class from many
|
|
3. **Interpretable**: Each output is a probability
|
|
4. **Differentiable**: Enables gradient-based learning
|
|
|
|
### Visual Understanding
|
|
```
|
|
Input: [1, 2, 3]
|
|
Softmax:[0.09, 0.24, 0.67] # Sums to 1.0
|
|
```
|
|
|
|
### Real-World Applications
|
|
- **Classification**: Image classification, text classification
|
|
- **Language models**: Next word prediction
|
|
- **Attention mechanisms**: Where to focus attention
|
|
- **Reinforcement learning**: Action selection probabilities
|
|
|
|
### Mathematical Properties
|
|
- **Range**: (0, 1) for each output
|
|
- **Constraint**: Σ(f(x_i)) = 1
|
|
- **Argmax preservation**: Doesn't change relative ordering
|
|
- **Temperature scaling**: Can be made sharper or softer
|
|
"""
|
|
|
|
# %% nbgrader={"grade": false, "grade_id": "softmax-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
|
#| export
|
|
class Softmax:
|
|
"""
|
|
Softmax Activation Function: f(x_i) = e^(x_i) / Σ(e^(x_j))
|
|
|
|
Converts a vector of real numbers into a probability distribution.
|
|
Essential for multi-class classification.
|
|
"""
|
|
|
|
def forward(self, x: Tensor) -> Tensor:
|
|
"""
|
|
Apply Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))
|
|
|
|
TODO: Implement Softmax activation function.
|
|
|
|
STEP-BY-STEP IMPLEMENTATION:
|
|
1. Subtract max value for numerical stability: x - max(x)
|
|
2. Compute exponentials: np.exp(x - max(x))
|
|
3. Compute sum of exponentials: np.sum(exp_values)
|
|
4. Divide each exponential by the sum: exp_values / sum
|
|
5. Return as new Tensor
|
|
|
|
EXAMPLE USAGE:
|
|
```python
|
|
softmax = Softmax()
|
|
input_tensor = Tensor([[1, 2, 3]])
|
|
output = softmax(input_tensor)
|
|
print(output.data) # [[0.09, 0.24, 0.67]]
|
|
print(np.sum(output.data)) # 1.0
|
|
```
|
|
|
|
IMPLEMENTATION HINTS:
|
|
- Subtract max for numerical stability: x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)
|
|
- Compute exponentials: exp_values = np.exp(x_shifted)
|
|
- Sum along last axis: sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
|
|
- Divide: result = exp_values / sum_exp
|
|
- Return Tensor(result)
|
|
|
|
LEARNING CONNECTIONS:
|
|
- This is like torch.nn.Softmax() in PyTorch
|
|
- Used in classification output layers
|
|
- Key component in attention mechanisms
|
|
- Enables probability-based decision making
|
|
"""
|
|
### BEGIN SOLUTION
|
|
# Subtract max for numerical stability
|
|
x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)
|
|
|
|
# Compute exponentials
|
|
exp_values = np.exp(x_shifted)
|
|
|
|
# Sum along last axis
|
|
sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
|
|
|
|
# Divide to get probabilities
|
|
result = exp_values / sum_exp
|
|
|
|
return Tensor(result)
|
|
### END SOLUTION
|
|
|
|
def __call__(self, x: Tensor) -> Tensor:
|
|
"""Make the class callable: softmax(x) instead of softmax.forward(x)"""
|
|
return self.forward(x)
|
|
|
|
# %% [markdown]
|
|
"""
|
|
### 🧪 Test Your Softmax Implementation
|
|
|
|
Once you implement the Softmax forward method above, run this cell to test it:
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-softmax-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
|
def test_softmax_activation():
|
|
"""Test Softmax activation function"""
|
|
print("Testing Softmax activation...")
|
|
|
|
# Create Softmax instance
|
|
softmax = Softmax()
|
|
|
|
# Test with simple input
|
|
test_input = Tensor([[1, 2, 3]])
|
|
result = softmax(test_input)
|
|
|
|
# Check that outputs sum to 1
|
|
output_sum = np.sum(result.data)
|
|
assert abs(output_sum - 1.0) < 1e-6, f"Softmax outputs should sum to 1, got {output_sum}"
|
|
|
|
# Check that all outputs are positive
|
|
assert np.all(result.data > 0), "Softmax outputs should be positive"
|
|
assert np.all(result.data < 1), "Softmax outputs should be less than 1"
|
|
|
|
# Test with uniform input (should give equal probabilities)
|
|
uniform_input = Tensor([[1, 1, 1]])
|
|
uniform_result = softmax(uniform_input)
|
|
expected_prob = 1.0 / 3.0
|
|
|
|
for prob in uniform_result.data[0]:
|
|
assert abs(prob - expected_prob) < 1e-6, f"Uniform input should give equal probabilities"
|
|
|
|
# Test with batch input (multiple samples)
|
|
batch_input = Tensor([[1, 2, 3], [4, 5, 6]])
|
|
batch_result = softmax(batch_input)
|
|
|
|
# Check that each row sums to 1
|
|
for i in range(batch_input.shape[0]):
|
|
row_sum = np.sum(batch_result.data[i])
|
|
assert abs(row_sum - 1.0) < 1e-6, f"Each row should sum to 1, row {i} sums to {row_sum}"
|
|
|
|
# Test numerical stability with large values
|
|
large_input = Tensor([[1000, 1001, 1002]])
|
|
large_result = softmax(large_input)
|
|
|
|
assert not np.any(np.isnan(large_result.data)), "Softmax should handle large values"
|
|
assert not np.any(np.isinf(large_result.data)), "Softmax should not produce inf values"
|
|
|
|
large_sum = np.sum(large_result.data)
|
|
assert abs(large_sum - 1.0) < 1e-6, "Large values should still sum to 1"
|
|
|
|
# Test shape preservation
|
|
assert batch_result.shape == batch_input.shape, "Softmax should preserve shape"
|
|
|
|
print("✅ Softmax activation tests passed!")
|
|
print(f"✅ Outputs sum to 1 (probability distribution)")
|
|
print(f"✅ All outputs are positive")
|
|
print(f"✅ Handles uniform inputs correctly")
|
|
print(f"✅ Works with batch inputs")
|
|
print(f"✅ Numerically stable with large values")
|
|
|
|
# Run the test
|
|
test_softmax_activation()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 🎯 Integration Test: All Activations Working Together
|
|
|
|
### Real-World Scenario
|
|
Let's test how all activation functions work together in a realistic neural network scenario:
|
|
|
|
- **Input processing**: Raw data transformation
|
|
- **Hidden layers**: ReLU for internal processing
|
|
- **Output layer**: Softmax for classification
|
|
- **Comparison**: See how different activations transform the same data
|
|
"""
|
|
|
|
# %% nbgrader={"grade": true, "grade_id": "test-activations-integration", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
|
def test_activations_integration():
|
|
"""Test all activation functions working together"""
|
|
print("Testing activation functions integration...")
|
|
|
|
# Create instances of all activation functions
|
|
relu = ReLU()
|
|
sigmoid = Sigmoid()
|
|
tanh = Tanh()
|
|
softmax = Softmax()
|
|
|
|
# Test data: simulating neural network layer outputs
|
|
test_data = Tensor([[-2, -1, 0, 1, 2]])
|
|
|
|
# Apply each activation function
|
|
relu_result = relu(test_data)
|
|
sigmoid_result = sigmoid(test_data)
|
|
tanh_result = tanh(test_data)
|
|
softmax_result = softmax(test_data)
|
|
|
|
# Test that all functions preserve input shape
|
|
assert relu_result.shape == test_data.shape, "ReLU should preserve shape"
|
|
assert sigmoid_result.shape == test_data.shape, "Sigmoid should preserve shape"
|
|
assert tanh_result.shape == test_data.shape, "Tanh should preserve shape"
|
|
assert softmax_result.shape == test_data.shape, "Softmax should preserve shape"
|
|
|
|
# Test that all functions return Tensor objects
|
|
assert isinstance(relu_result, Tensor), "ReLU should return Tensor"
|
|
assert isinstance(sigmoid_result, Tensor), "Sigmoid should return Tensor"
|
|
assert isinstance(tanh_result, Tensor), "Tanh should return Tensor"
|
|
assert isinstance(softmax_result, Tensor), "Softmax should return Tensor"
|
|
|
|
# Test ReLU properties
|
|
assert np.all(relu_result.data >= 0), "ReLU output should be non-negative"
|
|
|
|
# Test Sigmoid properties
|
|
assert np.all(sigmoid_result.data > 0), "Sigmoid output should be positive"
|
|
assert np.all(sigmoid_result.data < 1), "Sigmoid output should be less than 1"
|
|
|
|
# Test Tanh properties
|
|
assert np.all(tanh_result.data > -1), "Tanh output should be > -1"
|
|
assert np.all(tanh_result.data < 1), "Tanh output should be < 1"
|
|
|
|
# Test Softmax properties
|
|
softmax_sum = np.sum(softmax_result.data)
|
|
assert abs(softmax_sum - 1.0) < 1e-6, "Softmax outputs should sum to 1"
|
|
|
|
# Test chaining activations (realistic neural network scenario)
|
|
# Hidden layer with ReLU
|
|
hidden_output = relu(test_data)
|
|
|
|
# Add some weights simulation (element-wise multiplication)
|
|
weights = Tensor([[0.5, 0.3, 0.8, 0.2, 0.7]])
|
|
weighted_output = hidden_output * weights
|
|
|
|
# Final layer with Softmax
|
|
final_output = softmax(weighted_output)
|
|
|
|
# Test that chained operations work
|
|
assert isinstance(final_output, Tensor), "Chained operations should return Tensor"
|
|
assert abs(np.sum(final_output.data) - 1.0) < 1e-6, "Final output should be valid probability"
|
|
|
|
# Test with batch data (multiple samples)
|
|
batch_data = Tensor([
|
|
[-2, -1, 0, 1, 2],
|
|
[1, 2, 3, 4, 5],
|
|
[-1, 0, 1, 2, 3]
|
|
])
|
|
|
|
batch_softmax = softmax(batch_data)
|
|
|
|
# Each row should sum to 1
|
|
for i in range(batch_data.shape[0]):
|
|
row_sum = np.sum(batch_softmax.data[i])
|
|
assert abs(row_sum - 1.0) < 1e-6, f"Batch row {i} should sum to 1"
|
|
|
|
print("✅ Activation functions integration tests passed!")
|
|
print(f"✅ All functions work together seamlessly")
|
|
print(f"✅ Shape preservation across all activations")
|
|
print(f"✅ Chained operations work correctly")
|
|
print(f"✅ Batch processing works for all activations")
|
|
print(f"✅ Ready for neural network integration!")
|
|
|
|
# Run the integration test
|
|
test_activations_integration()
|
|
|
|
# %% [markdown]
|
|
"""
|
|
## 🎯 Module Summary: Activation Functions Mastery!
|
|
|
|
Congratulations! You've successfully implemented all four essential activation functions:
|
|
|
|
### ✅ What You've Built
|
|
- **ReLU**: The foundation of modern deep learning with sparsity and efficiency
|
|
- **Sigmoid**: Classic activation for binary classification and probability outputs
|
|
- **Tanh**: Zero-centered activation with better gradient properties
|
|
- **Softmax**: Probability distribution for multi-class classification
|
|
|
|
### ✅ Key Learning Outcomes
|
|
- **Understanding**: Why nonlinearity is essential for neural networks
|
|
- **Implementation**: Built activation functions from scratch using NumPy
|
|
- **Testing**: Progressive validation with immediate feedback after each function
|
|
- **Integration**: Saw how activations work together in neural networks
|
|
- **Real-world context**: Understanding where each activation is used
|
|
|
|
### ✅ Mathematical Mastery
|
|
- **ReLU**: f(x) = max(0, x) - Simple but powerful
|
|
- **Sigmoid**: f(x) = 1/(1 + e^(-x)) - Maps to (0,1)
|
|
- **Tanh**: f(x) = tanh(x) - Zero-centered, maps to (-1,1)
|
|
- **Softmax**: f(x_i) = e^(x_i)/Σ(e^(x_j)) - Probability distribution
|
|
|
|
### ✅ Professional Skills Developed
|
|
- **Numerical stability**: Handling overflow and underflow
|
|
- **API design**: Consistent interfaces across all functions
|
|
- **Testing discipline**: Immediate validation after each implementation
|
|
- **Integration thinking**: Understanding how components work together
|
|
|
|
### ✅ Ready for Next Steps
|
|
Your activation functions are now ready to power:
|
|
- **Dense layers**: Linear transformations with nonlinear activations
|
|
- **Convolutional layers**: Spatial feature extraction with ReLU
|
|
- **Network architectures**: Complete neural networks with proper activations
|
|
- **Training**: Gradient computation through activation functions
|
|
|
|
### 🔗 Connection to Real ML Systems
|
|
Your implementations mirror production systems:
|
|
- **PyTorch**: `torch.nn.ReLU()`, `torch.nn.Sigmoid()`, `torch.nn.Tanh()`, `torch.nn.Softmax()`
|
|
- **TensorFlow**: `tf.nn.relu()`, `tf.nn.sigmoid()`, `tf.nn.tanh()`, `tf.nn.softmax()`
|
|
- **Industry applications**: Every major deep learning model uses these functions
|
|
|
|
### 🎯 The Power of Nonlinearity
|
|
You've unlocked the key to deep learning:
|
|
- **Before**: Linear models limited to simple patterns
|
|
- **After**: Nonlinear models can learn any pattern (universal approximation)
|
|
|
|
**Next Module**: Layers - Building blocks that combine your tensors and activations into powerful transformations!
|
|
|
|
Your activation functions are the key to neural network intelligence. Now let's build the layers that use them!
|
|
""" |