mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 00:33:04 -05:00
- Fixed test functions to only run when modules executed directly - Added proper __name__ == '__main__' guards to all test calls - Fixed syntax errors from incorrect replacements in Module 13 and 15 - Modules now import properly without executing tests - ProductionBenchmarkingProfiler (Module 14) and ProductionMLSystemProfiler (Module 16) fully working - Other profiler classes present but require full numpy environment to test completely
1321 lines
50 KiB
Python
1321 lines
50 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.17.1
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Activations - Nonlinearity in Neural Networks
|
||
|
||
Welcome to the Activations module! This is where neural networks get their power through nonlinearity.
|
||
|
||
## Learning Goals
|
||
- Understand why activation functions are essential for neural networks
|
||
- Implement the four most important activation functions: ReLU, Sigmoid, Tanh, and Softmax
|
||
- Visualize how activations transform data and enable complex learning
|
||
- See how activations work with layers to build powerful networks
|
||
- Master the NBGrader workflow with comprehensive testing
|
||
|
||
## Build → Use → Understand
|
||
1. **Build**: Activation functions that add nonlinearity
|
||
2. **Use**: Transform tensors and see immediate results
|
||
3. **Understand**: How nonlinearity enables complex pattern learning
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "activations-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
#| default_exp core.activations
|
||
|
||
#| export
|
||
import math
|
||
import numpy as np
|
||
import matplotlib.pyplot as plt
|
||
import os
|
||
import sys
|
||
from typing import Union, List
|
||
|
||
# Import our Tensor class - try from package first, then from local module
|
||
try:
|
||
from tinytorch.core.tensor import Tensor
|
||
except ImportError:
|
||
# For development, import from local tensor module
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
|
||
from tensor_dev import Tensor
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "activations-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
print("🔥 TinyTorch Activations Module")
|
||
print(f"NumPy version: {np.__version__}")
|
||
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
|
||
print("Ready to build activation functions!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 📦 Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in `modules/source/02_activations/activations_dev.py`
|
||
**Building Side:** Code exports to `tinytorch.core.activations`
|
||
|
||
```python
|
||
# Final package structure:
|
||
from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
|
||
from tinytorch.core.tensor import Tensor # Foundation
|
||
from tinytorch.core.layers import Dense # Uses activations
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Focused modules for deep understanding
|
||
- **Production:** Proper organization like PyTorch's `torch.nn.ReLU`
|
||
- **Consistency:** All activation functions live together in `core.activations`
|
||
- **Integration:** Works seamlessly with tensors and layers
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## What Are Activation Functions?
|
||
|
||
### The Problem: Linear Limitations
|
||
Without activation functions, neural networks can only learn linear relationships:
|
||
```
|
||
y = W₁ · (W₂ · (W₃ · x + b₃) + b₂) + b₁
|
||
```
|
||
|
||
This simplifies to just:
|
||
```
|
||
y = W_combined · x + b_combined
|
||
```
|
||
|
||
**A single linear function!** No matter how many layers you add, you can't learn complex patterns like:
|
||
- Image recognition (nonlinear pixel relationships)
|
||
- Language understanding (nonlinear word relationships)
|
||
- Game playing (nonlinear strategy relationships)
|
||
|
||
### The Solution: Nonlinearity
|
||
Activation functions add nonlinearity between layers:
|
||
```
|
||
y = W₁ · f(W₂ · f(W₃ · x + b₃) + b₂) + b₁
|
||
```
|
||
|
||
Now each layer can learn complex transformations!
|
||
|
||
### Real-World Impact
|
||
- **Before activations**: Only linear classifiers (logistic regression)
|
||
- **After activations**: Complex pattern recognition (deep learning revolution)
|
||
|
||
### What We'll Build
|
||
1. **ReLU**: The foundation of modern deep learning
|
||
2. **Sigmoid**: Classic activation for binary classification
|
||
3. **Tanh**: Centered activation for better gradients
|
||
4. **Softmax**: Probability distributions for multi-class classification
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🔧 DEVELOPMENT
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 1: ReLU - The Foundation of Deep Learning
|
||
|
||
### What is ReLU?
|
||
**ReLU (Rectified Linear Unit)** is the most important activation function in deep learning:
|
||
|
||
```
|
||
f(x) = max(0, x)
|
||
```
|
||
|
||
- **Positive inputs**: Pass through unchanged
|
||
- **Negative inputs**: Become zero
|
||
- **Zero**: Stays zero
|
||
|
||
### Why ReLU Revolutionized Deep Learning
|
||
1. **Computational efficiency**: Just a max operation
|
||
2. **No vanishing gradients**: Derivative is 1 for positive values
|
||
3. **Sparsity**: Many neurons output exactly 0
|
||
4. **Empirical success**: Works well in practice
|
||
|
||
### Visual Understanding
|
||
```
|
||
Input: [-2, -1, 0, 1, 2]
|
||
ReLU: [ 0, 0, 0, 1, 2]
|
||
```
|
||
|
||
### Real-World Applications
|
||
- **Image classification**: ResNet, VGG, AlexNet
|
||
- **Object detection**: YOLO, R-CNN
|
||
- **Language models**: Transformer feedforward layers
|
||
- **Recommendation**: Deep collaborative filtering
|
||
|
||
### Mathematical Properties
|
||
- **Derivative**: f'(x) = 1 if x > 0, else 0
|
||
- **Range**: [0, ∞)
|
||
- **Sparsity**: Outputs exactly 0 for negative inputs
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "relu-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class ReLU:
|
||
"""
|
||
ReLU Activation Function: f(x) = max(0, x)
|
||
|
||
The most popular activation function in deep learning.
|
||
Simple, fast, and effective for most applications.
|
||
"""
|
||
|
||
def forward(self, x):
|
||
"""
|
||
Apply ReLU activation: f(x) = max(0, x)
|
||
|
||
TODO: Implement ReLU activation function.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. For each element in the input tensor, apply max(0, element)
|
||
2. Use NumPy's maximum function for efficient element-wise operation
|
||
3. Return a new tensor of the same type with the results
|
||
4. Preserve the input tensor's shape
|
||
|
||
EXAMPLE USAGE:
|
||
```python
|
||
relu = ReLU()
|
||
input_tensor = Tensor([[-2, -1, 0, 1, 2]])
|
||
output = relu(input_tensor)
|
||
print(output.data) # [[0, 0, 0, 1, 2]]
|
||
```
|
||
|
||
IMPLEMENTATION HINTS:
|
||
- Use np.maximum(0, x.data) for element-wise max with 0
|
||
- Return the same type as input: return type(x)(result)
|
||
- The shape should remain the same as input
|
||
- Don't modify the input tensor (immutable operations)
|
||
|
||
LEARNING CONNECTIONS:
|
||
- This is like torch.nn.ReLU() in PyTorch
|
||
- Used in virtually every modern neural network
|
||
- Enables deep networks by preventing vanishing gradients
|
||
- Creates sparse representations (many zeros)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
result = np.maximum(0, x.data)
|
||
return type(x)(result)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x):
|
||
"""Make the class callable: relu(x) instead of relu.forward(x)"""
|
||
return self.forward(x)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your ReLU Implementation
|
||
|
||
Once you implement the ReLU forward method above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-relu-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_relu_activation():
|
||
"""Unit test for the ReLU activation function."""
|
||
print("🔬 Unit Test: ReLU Activation...")
|
||
|
||
# Create ReLU instance
|
||
relu = ReLU()
|
||
|
||
# Test with mixed positive/negative values
|
||
test_input = Tensor([[-2, -1, 0, 1, 2]])
|
||
result = relu(test_input)
|
||
expected = np.array([[0, 0, 0, 1, 2]])
|
||
|
||
assert np.array_equal(result.data, expected), f"ReLU failed: expected {expected}, got {result.data}"
|
||
|
||
# Test that negative values become zero
|
||
assert np.all(result.data >= 0), "ReLU should make all negative values zero"
|
||
|
||
# Test that positive values remain unchanged
|
||
positive_input = Tensor([[1, 2, 3, 4, 5]])
|
||
positive_result = relu(positive_input)
|
||
assert np.array_equal(positive_result.data, positive_input.data), "ReLU should preserve positive values"
|
||
|
||
# Test with 2D tensor
|
||
matrix_input = Tensor([[-1, 2], [3, -4]])
|
||
matrix_result = relu(matrix_input)
|
||
matrix_expected = np.array([[0, 2], [3, 0]])
|
||
assert np.array_equal(matrix_result.data, matrix_expected), "ReLU should work with 2D tensors"
|
||
|
||
# Test shape preservation
|
||
assert matrix_result.shape == matrix_input.shape, "ReLU should preserve input shape"
|
||
|
||
print("✅ ReLU activation tests passed!")
|
||
print(f"✅ Negative values correctly zeroed")
|
||
print(f"✅ Positive values preserved")
|
||
print(f"✅ Shape preservation working")
|
||
print(f"✅ Works with multi-dimensional tensors")
|
||
|
||
# Run the test
|
||
test_unit_relu_activation()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 2: Sigmoid - Classic Binary Classification
|
||
|
||
### What is Sigmoid?
|
||
**Sigmoid** is the classic activation function that maps any real number to (0, 1):
|
||
|
||
```
|
||
f(x) = 1 / (1 + e^(-x))
|
||
```
|
||
|
||
### Why Sigmoid Matters
|
||
1. **Probability interpretation**: Outputs between 0 and 1
|
||
2. **Smooth gradients**: Differentiable everywhere
|
||
3. **Historical importance**: Enabled early neural networks
|
||
4. **Binary classification**: Perfect for yes/no decisions
|
||
|
||
### Visual Understanding
|
||
```
|
||
Input: [-∞, -2, -1, 0, 1, 2, ∞]
|
||
Sigmoid:[0, 0.12, 0.27, 0.5, 0.73, 0.88, 1]
|
||
```
|
||
|
||
### Real-World Applications
|
||
- **Binary classification**: Spam detection, medical diagnosis
|
||
- **Gating mechanisms**: LSTM and GRU cells
|
||
- **Output layers**: When you need probabilities
|
||
- **Attention mechanisms**: Where to focus attention
|
||
|
||
### Mathematical Properties
|
||
- **Range**: (0, 1)
|
||
- **Derivative**: f'(x) = f(x) · (1 - f(x))
|
||
- **Centered**: f(0) = 0.5
|
||
- **Symmetric**: f(-x) = 1 - f(x)
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "sigmoid-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class Sigmoid:
|
||
"""
|
||
Sigmoid Activation Function: f(x) = 1 / (1 + e^(-x))
|
||
|
||
Maps any real number to the range (0, 1).
|
||
Useful for binary classification and probability outputs.
|
||
"""
|
||
|
||
def forward(self, x):
|
||
"""
|
||
Apply Sigmoid activation: f(x) = 1 / (1 + e^(-x))
|
||
|
||
TODO: Implement Sigmoid activation function.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Compute the negative of input: -x.data
|
||
2. Compute the exponential: np.exp(-x.data)
|
||
3. Add 1 to the exponential: 1 + np.exp(-x.data)
|
||
4. Take the reciprocal: 1 / (1 + np.exp(-x.data))
|
||
5. Return as new Tensor
|
||
|
||
EXAMPLE USAGE:
|
||
```python
|
||
sigmoid = Sigmoid()
|
||
input_tensor = Tensor([[-2, -1, 0, 1, 2]])
|
||
output = sigmoid(input_tensor)
|
||
print(output.data) # [[0.119, 0.269, 0.5, 0.731, 0.881]]
|
||
```
|
||
|
||
IMPLEMENTATION HINTS:
|
||
- Use np.exp() for exponential function
|
||
- Formula: 1 / (1 + np.exp(-x.data))
|
||
- Handle potential overflow with np.clip(-x.data, -500, 500)
|
||
- Return Tensor(result)
|
||
|
||
LEARNING CONNECTIONS:
|
||
- This is like torch.nn.Sigmoid() in PyTorch
|
||
- Used in binary classification output layers
|
||
- Key component in LSTM and GRU gating mechanisms
|
||
- Historically important for early neural networks
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Clip to prevent overflow
|
||
clipped_input = np.clip(-x.data, -500, 500)
|
||
result = 1 / (1 + np.exp(clipped_input))
|
||
return type(x)(result)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x):
|
||
"""Make the class callable: sigmoid(x) instead of sigmoid.forward(x)"""
|
||
return self.forward(x)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your Sigmoid Implementation
|
||
|
||
Once you implement the Sigmoid forward method above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-sigmoid-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_sigmoid_activation():
|
||
"""Unit test for the Sigmoid activation function."""
|
||
print("🔬 Unit Test: Sigmoid Activation...")
|
||
|
||
# Create Sigmoid instance
|
||
sigmoid = Sigmoid()
|
||
|
||
# Test with known values
|
||
test_input = Tensor([[0]])
|
||
result = sigmoid(test_input)
|
||
expected = 0.5
|
||
|
||
assert abs(result.data[0][0] - expected) < 1e-6, f"Sigmoid(0) should be 0.5, got {result.data[0][0]}"
|
||
|
||
# Test with positive and negative values
|
||
test_input = Tensor([[-2, -1, 0, 1, 2]])
|
||
result = sigmoid(test_input)
|
||
|
||
# Check that all values are between 0 and 1
|
||
assert np.all(result.data > 0), "Sigmoid output should be > 0"
|
||
assert np.all(result.data < 1), "Sigmoid output should be < 1"
|
||
|
||
# Test symmetry: sigmoid(-x) = 1 - sigmoid(x)
|
||
x_val = 1.0
|
||
pos_result = sigmoid(Tensor([[x_val]]))
|
||
neg_result = sigmoid(Tensor([[-x_val]]))
|
||
symmetry_check = abs(pos_result.data[0][0] + neg_result.data[0][0] - 1.0)
|
||
assert symmetry_check < 1e-6, "Sigmoid should be symmetric around 0.5"
|
||
|
||
# Test with 2D tensor
|
||
matrix_input = Tensor([[-1, 1], [0, 2]])
|
||
matrix_result = sigmoid(matrix_input)
|
||
assert matrix_result.shape == matrix_input.shape, "Sigmoid should preserve shape"
|
||
|
||
# Test extreme values (should not overflow)
|
||
extreme_input = Tensor([[-100, 100]])
|
||
extreme_result = sigmoid(extreme_input)
|
||
assert not np.any(np.isnan(extreme_result.data)), "Sigmoid should handle extreme values"
|
||
assert not np.any(np.isinf(extreme_result.data)), "Sigmoid should not produce inf values"
|
||
|
||
print("✅ Sigmoid activation tests passed!")
|
||
print(f"✅ Outputs correctly bounded between 0 and 1")
|
||
print(f"✅ Symmetric property verified")
|
||
print(f"✅ Handles extreme values without overflow")
|
||
print(f"✅ Shape preservation working")
|
||
|
||
# Run the test
|
||
test_unit_sigmoid_activation()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 3: Tanh - Centered Activation
|
||
|
||
### What is Tanh?
|
||
**Tanh (Hyperbolic Tangent)** is similar to sigmoid but centered around zero:
|
||
|
||
```
|
||
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
|
||
```
|
||
|
||
### Why Tanh is Better Than Sigmoid
|
||
1. **Zero-centered**: Outputs range from -1 to 1
|
||
2. **Better gradients**: Helps with gradient flow in deep networks
|
||
3. **Faster convergence**: Less bias shift during training
|
||
4. **Stronger gradients**: Maximum gradient is 1 vs 0.25 for sigmoid
|
||
|
||
### Visual Understanding
|
||
```
|
||
Input: [-∞, -2, -1, 0, 1, 2, ∞]
|
||
Tanh: [-1, -0.96, -0.76, 0, 0.76, 0.96, 1]
|
||
```
|
||
|
||
### Real-World Applications
|
||
- **Hidden layers**: Better than sigmoid for internal activations
|
||
- **RNN cells**: Classic RNN and LSTM use tanh
|
||
- **Normalization**: When you need zero-centered outputs
|
||
- **Feature scaling**: Maps inputs to [-1, 1] range
|
||
|
||
### Mathematical Properties
|
||
- **Range**: (-1, 1)
|
||
- **Derivative**: f'(x) = 1 - f(x)²
|
||
- **Zero-centered**: f(0) = 0
|
||
- **Antisymmetric**: f(-x) = -f(x)
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "tanh-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class Tanh:
|
||
"""
|
||
Tanh Activation Function: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
|
||
|
||
Zero-centered activation function with range (-1, 1).
|
||
Better gradient properties than sigmoid.
|
||
"""
|
||
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
"""
|
||
Apply Tanh activation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
|
||
|
||
TODO: Implement Tanh activation function.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Use NumPy's built-in tanh function: np.tanh(x.data)
|
||
2. Alternatively, implement manually:
|
||
- Compute e^x and e^(-x)
|
||
- Calculate (e^x - e^(-x)) / (e^x + e^(-x))
|
||
3. Return as new Tensor
|
||
|
||
EXAMPLE USAGE:
|
||
```python
|
||
tanh = Tanh()
|
||
input_tensor = Tensor([[-2, -1, 0, 1, 2]])
|
||
output = tanh(input_tensor)
|
||
print(output.data) # [[-0.964, -0.762, 0, 0.762, 0.964]]
|
||
```
|
||
|
||
IMPLEMENTATION HINTS:
|
||
- Use np.tanh(x.data) for simplicity
|
||
- Manual implementation: (np.exp(x.data) - np.exp(-x.data)) / (np.exp(x.data) + np.exp(-x.data))
|
||
- Handle overflow by clipping inputs: np.clip(x.data, -500, 500)
|
||
- Return Tensor(result)
|
||
|
||
LEARNING CONNECTIONS:
|
||
- This is like torch.nn.Tanh() in PyTorch
|
||
- Used in RNN, LSTM, and GRU cells
|
||
- Better than sigmoid for hidden layers
|
||
- Zero-centered outputs help with gradient flow
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Use NumPy's built-in tanh function
|
||
result = np.tanh(x.data)
|
||
return type(x)(result)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x: Tensor) -> Tensor:
|
||
"""Make the class callable: tanh(x) instead of tanh.forward(x)"""
|
||
return self.forward(x)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your Tanh Implementation
|
||
|
||
Once you implement the Tanh forward method above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-tanh-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_tanh_activation():
|
||
"""Unit test for the Tanh activation function."""
|
||
print("🔬 Unit Test: Tanh Activation...")
|
||
|
||
# Create Tanh instance
|
||
tanh = Tanh()
|
||
|
||
# Test with zero (should be 0)
|
||
test_input = Tensor([[0]])
|
||
result = tanh(test_input)
|
||
expected = 0.0
|
||
|
||
assert abs(result.data[0][0] - expected) < 1e-6, f"Tanh(0) should be 0, got {result.data[0][0]}"
|
||
|
||
# Test with positive and negative values
|
||
test_input = Tensor([[-2, -1, 0, 1, 2]])
|
||
result = tanh(test_input)
|
||
|
||
# Check that all values are between -1 and 1
|
||
assert np.all(result.data > -1), "Tanh output should be > -1"
|
||
assert np.all(result.data < 1), "Tanh output should be < 1"
|
||
|
||
# Test antisymmetry: tanh(-x) = -tanh(x)
|
||
x_val = 1.5
|
||
pos_result = tanh(Tensor([[x_val]]))
|
||
neg_result = tanh(Tensor([[-x_val]]))
|
||
antisymmetry_check = abs(pos_result.data[0][0] + neg_result.data[0][0])
|
||
assert antisymmetry_check < 1e-6, "Tanh should be antisymmetric"
|
||
|
||
# Test with 2D tensor
|
||
matrix_input = Tensor([[-1, 1], [0, 2]])
|
||
matrix_result = tanh(matrix_input)
|
||
assert matrix_result.shape == matrix_input.shape, "Tanh should preserve shape"
|
||
|
||
# Test extreme values (should not overflow)
|
||
extreme_input = Tensor([[-100, 100]])
|
||
extreme_result = tanh(extreme_input)
|
||
assert not np.any(np.isnan(extreme_result.data)), "Tanh should handle extreme values"
|
||
assert not np.any(np.isinf(extreme_result.data)), "Tanh should not produce inf values"
|
||
|
||
# Test that extreme values approach ±1
|
||
assert abs(extreme_result.data[0][0] - (-1)) < 1e-6, "Tanh(-∞) should approach -1"
|
||
assert abs(extreme_result.data[0][1] - 1) < 1e-6, "Tanh(∞) should approach 1"
|
||
|
||
print("✅ Tanh activation tests passed!")
|
||
print(f"✅ Outputs correctly bounded between -1 and 1")
|
||
print(f"✅ Antisymmetric property verified")
|
||
print(f"✅ Zero-centered (tanh(0) = 0)")
|
||
print(f"✅ Handles extreme values correctly")
|
||
|
||
# Run the test
|
||
test_unit_tanh_activation()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Step 4: Softmax - Probability Distributions
|
||
|
||
### What is Softmax?
|
||
**Softmax** converts a vector of real numbers into a probability distribution:
|
||
|
||
```
|
||
f(x_i) = e^(x_i) / Σ(e^(x_j))
|
||
```
|
||
|
||
### Why Softmax is Essential
|
||
1. **Probability distribution**: Outputs sum to 1
|
||
2. **Multi-class classification**: Choose one class from many
|
||
3. **Interpretable**: Each output is a probability
|
||
4. **Differentiable**: Enables gradient-based learning
|
||
|
||
### Visual Understanding
|
||
```
|
||
Input: [1, 2, 3]
|
||
Softmax:[0.09, 0.24, 0.67] # Sums to 1.0
|
||
```
|
||
|
||
### Real-World Applications
|
||
- **Classification**: Image classification, text classification
|
||
- **Language models**: Next word prediction
|
||
- **Attention mechanisms**: Where to focus attention
|
||
- **Reinforcement learning**: Action selection probabilities
|
||
|
||
### Mathematical Properties
|
||
- **Range**: (0, 1) for each output
|
||
- **Constraint**: Σ(f(x_i)) = 1
|
||
- **Argmax preservation**: Doesn't change relative ordering
|
||
- **Temperature scaling**: Can be made sharper or softer
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "softmax-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class Softmax:
|
||
"""
|
||
Softmax Activation Function: f(x_i) = e^(x_i) / Σ(e^(x_j))
|
||
|
||
Converts a vector of real numbers into a probability distribution.
|
||
Essential for multi-class classification.
|
||
"""
|
||
|
||
def forward(self, x):
|
||
"""
|
||
Apply Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))
|
||
|
||
TODO: Implement Softmax activation function.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Handle empty input case
|
||
2. Subtract max value for numerical stability: x - max(x)
|
||
3. Compute exponentials: np.exp(x - max(x))
|
||
4. Compute sum of exponentials: np.sum(exp_values)
|
||
5. Divide each exponential by the sum: exp_values / sum
|
||
6. Return as same tensor type as input
|
||
|
||
EXAMPLE USAGE:
|
||
```python
|
||
softmax = Softmax()
|
||
input_tensor = Tensor([[1, 2, 3]])
|
||
output = softmax(input_tensor)
|
||
print(output.data) # [[0.09, 0.24, 0.67]]
|
||
print(np.sum(output.data)) # 1.0
|
||
```
|
||
|
||
IMPLEMENTATION HINTS:
|
||
- Handle empty case: if x.data.size == 0: return type(x)(x.data.copy())
|
||
- Subtract max for numerical stability: x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)
|
||
- Compute exponentials: exp_values = np.exp(x_shifted)
|
||
- Sum along last axis: sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
|
||
- Divide: result = exp_values / sum_exp
|
||
- Return same type as input: return type(x)(result)
|
||
|
||
LEARNING CONNECTIONS:
|
||
- This is like torch.nn.Softmax() in PyTorch
|
||
- Used in classification output layers
|
||
- Key component in attention mechanisms
|
||
- Enables probability-based decision making
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Handle empty input
|
||
if x.data.size == 0:
|
||
return type(x)(x.data.copy())
|
||
|
||
# Subtract max for numerical stability
|
||
x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)
|
||
|
||
# Compute exponentials
|
||
exp_values = np.exp(x_shifted)
|
||
|
||
# Sum along last axis
|
||
sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
|
||
|
||
# Divide to get probabilities
|
||
result = exp_values / sum_exp
|
||
|
||
return type(x)(result)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x):
|
||
"""Make the class callable: softmax(x) instead of softmax.forward(x)"""
|
||
return self.forward(x)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test Your Softmax Implementation
|
||
|
||
Once you implement the Softmax forward method above, run this cell to test it:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-softmax-immediate", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_softmax_activation():
|
||
"""Unit test for the Softmax activation function."""
|
||
print("🔬 Unit Test: Softmax Activation...")
|
||
|
||
# Create Softmax instance
|
||
softmax = Softmax()
|
||
|
||
# Test with simple input
|
||
test_input = Tensor([[1, 2, 3]])
|
||
result = softmax(test_input)
|
||
|
||
# Check that outputs sum to 1
|
||
output_sum = np.sum(result.data)
|
||
assert abs(output_sum - 1.0) < 1e-6, f"Softmax outputs should sum to 1, got {output_sum}"
|
||
|
||
# Check that all outputs are positive
|
||
assert np.all(result.data > 0), "Softmax outputs should be positive"
|
||
assert np.all(result.data < 1), "Softmax outputs should be less than 1"
|
||
|
||
# Test with uniform input (should give equal probabilities)
|
||
uniform_input = Tensor([[1, 1, 1]])
|
||
uniform_result = softmax(uniform_input)
|
||
expected_prob = 1.0 / 3.0
|
||
|
||
for prob in uniform_result.data[0]:
|
||
assert abs(prob - expected_prob) < 1e-6, f"Uniform input should give equal probabilities"
|
||
|
||
# Test with batch input (multiple samples)
|
||
batch_input = Tensor([[1, 2, 3], [4, 5, 6]])
|
||
batch_result = softmax(batch_input)
|
||
|
||
# Check that each row sums to 1
|
||
for i in range(batch_input.shape[0]):
|
||
row_sum = np.sum(batch_result.data[i])
|
||
assert abs(row_sum - 1.0) < 1e-6, f"Each row should sum to 1, row {i} sums to {row_sum}"
|
||
|
||
# Test numerical stability with large values
|
||
large_input = Tensor([[1000, 1001, 1002]])
|
||
large_result = softmax(large_input)
|
||
|
||
assert not np.any(np.isnan(large_result.data)), "Softmax should handle large values"
|
||
assert not np.any(np.isinf(large_result.data)), "Softmax should not produce inf values"
|
||
|
||
large_sum = np.sum(large_result.data)
|
||
assert abs(large_sum - 1.0) < 1e-6, "Large values should still sum to 1"
|
||
|
||
# Test shape preservation
|
||
assert batch_result.shape == batch_input.shape, "Softmax should preserve shape"
|
||
|
||
print("✅ Softmax activation tests passed!")
|
||
print(f"✅ Outputs sum to 1 (probability distribution)")
|
||
print(f"✅ All outputs are positive")
|
||
print(f"✅ Handles uniform inputs correctly")
|
||
print(f"✅ Works with batch inputs")
|
||
print(f"✅ Numerically stable with large values")
|
||
|
||
# Run the test
|
||
test_unit_softmax_activation()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 Comprehensive Test: All Activations Working Together
|
||
|
||
### Real-World Scenario
|
||
Let's test how all activation functions work together in a realistic neural network scenario:
|
||
|
||
- **Input processing**: Raw data transformation
|
||
- **Hidden layers**: ReLU for internal processing
|
||
- **Output layer**: Softmax for classification
|
||
- **Comparison**: See how different activations transform the same data
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-activations-comprehensive", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_activations_comprehensive():
|
||
"""Comprehensive unit test for all activation functions working together."""
|
||
print("🔬 Unit Test: Activation Functions Comprehensive Test...")
|
||
|
||
# Create instances of all activation functions
|
||
relu = ReLU()
|
||
sigmoid = Sigmoid()
|
||
tanh = Tanh()
|
||
softmax = Softmax()
|
||
|
||
# Test data: simulating neural network layer outputs
|
||
test_data = Tensor([[-2, -1, 0, 1, 2]])
|
||
|
||
# Apply each activation function
|
||
relu_result = relu(test_data)
|
||
sigmoid_result = sigmoid(test_data)
|
||
tanh_result = tanh(test_data)
|
||
softmax_result = softmax(test_data)
|
||
|
||
# Test that all functions preserve input shape
|
||
assert relu_result.shape == test_data.shape, "ReLU should preserve shape"
|
||
assert sigmoid_result.shape == test_data.shape, "Sigmoid should preserve shape"
|
||
assert tanh_result.shape == test_data.shape, "Tanh should preserve shape"
|
||
assert softmax_result.shape == test_data.shape, "Softmax should preserve shape"
|
||
|
||
# Test that all functions return Tensor objects
|
||
assert isinstance(relu_result, Tensor), "ReLU should return Tensor"
|
||
assert isinstance(sigmoid_result, Tensor), "Sigmoid should return Tensor"
|
||
assert isinstance(tanh_result, Tensor), "Tanh should return Tensor"
|
||
assert isinstance(softmax_result, Tensor), "Softmax should return Tensor"
|
||
|
||
# Test ReLU properties
|
||
assert np.all(relu_result.data >= 0), "ReLU output should be non-negative"
|
||
|
||
# Test Sigmoid properties
|
||
assert np.all(sigmoid_result.data > 0), "Sigmoid output should be positive"
|
||
assert np.all(sigmoid_result.data < 1), "Sigmoid output should be less than 1"
|
||
|
||
# Test Tanh properties
|
||
assert np.all(tanh_result.data > -1), "Tanh output should be > -1"
|
||
assert np.all(tanh_result.data < 1), "Tanh output should be < 1"
|
||
|
||
# Test Softmax properties
|
||
softmax_sum = np.sum(softmax_result.data)
|
||
assert abs(softmax_sum - 1.0) < 1e-6, "Softmax outputs should sum to 1"
|
||
|
||
# Test chaining activations (realistic neural network scenario)
|
||
# Hidden layer with ReLU
|
||
hidden_output = relu(test_data)
|
||
|
||
# Add some weights simulation (element-wise multiplication)
|
||
weights = Tensor([[0.5, 0.3, 0.8, 0.2, 0.7]])
|
||
weighted_output = hidden_output * weights
|
||
|
||
# Final layer with Softmax
|
||
final_output = softmax(weighted_output)
|
||
|
||
# Test that chained operations work
|
||
assert isinstance(final_output, Tensor), "Chained operations should return Tensor"
|
||
assert abs(np.sum(final_output.data) - 1.0) < 1e-6, "Final output should be valid probability"
|
||
|
||
# Test with batch data (multiple samples)
|
||
batch_data = Tensor([
|
||
[-2, -1, 0, 1, 2],
|
||
[1, 2, 3, 4, 5],
|
||
[-1, 0, 1, 2, 3]
|
||
])
|
||
|
||
batch_softmax = softmax(batch_data)
|
||
|
||
# Each row should sum to 1
|
||
for i in range(batch_data.shape[0]):
|
||
row_sum = np.sum(batch_softmax.data[i])
|
||
assert abs(row_sum - 1.0) < 1e-6, f"Batch row {i} should sum to 1"
|
||
|
||
print("✅ Activation functions comprehensive tests passed!")
|
||
print(f"✅ All functions work together seamlessly")
|
||
print(f"✅ Shape preservation across all activations")
|
||
print(f"✅ Chained operations work correctly")
|
||
print(f"✅ Batch processing works for all activations")
|
||
print(f"✅ Ready for neural network integration!")
|
||
|
||
# Run the comprehensive test
|
||
test_unit_activations_comprehensive()
|
||
|
||
# %%
|
||
def test_module_activation_tensor_integration():
|
||
"""
|
||
Integration test for activation functions with Tensor operations.
|
||
|
||
Tests that activation functions properly integrate with the Tensor class
|
||
and maintain compatibility for neural network operations.
|
||
"""
|
||
print("🔬 Running Integration Test: Activation-Tensor Integration...")
|
||
|
||
# Test 1: Activation functions preserve Tensor types
|
||
input_tensor = Tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
|
||
|
||
relu_fn = ReLU()
|
||
sigmoid_fn = Sigmoid()
|
||
tanh_fn = Tanh()
|
||
|
||
relu_result = relu_fn(input_tensor)
|
||
sigmoid_result = sigmoid_fn(input_tensor)
|
||
tanh_result = tanh_fn(input_tensor)
|
||
|
||
assert isinstance(relu_result, Tensor), "ReLU should return Tensor"
|
||
assert isinstance(sigmoid_result, Tensor), "Sigmoid should return Tensor"
|
||
assert isinstance(tanh_result, Tensor), "Tanh should return Tensor"
|
||
|
||
# Test 2: Activations work with matrix Tensors (neural network layers)
|
||
layer_output = Tensor([[1.0, -2.0, 3.0],
|
||
[-1.0, 2.0, -3.0]]) # Simulating dense layer output
|
||
|
||
relu_fn = ReLU()
|
||
activated = relu_fn(layer_output)
|
||
expected = np.array([[1.0, 0.0, 3.0],
|
||
[0.0, 2.0, 0.0]])
|
||
|
||
assert isinstance(activated, Tensor), "Matrix activation should return Tensor"
|
||
assert np.array_equal(activated.data, expected), "Matrix ReLU should work correctly"
|
||
|
||
# Test 3: Softmax with classification scenario
|
||
logits = Tensor([[2.0, 1.0, 0.1], # Batch of 2 samples
|
||
[1.0, 3.0, 0.2]]) # Each with 3 classes
|
||
|
||
softmax_fn = Softmax()
|
||
probabilities = softmax_fn(logits)
|
||
|
||
assert isinstance(probabilities, Tensor), "Softmax should return Tensor"
|
||
assert probabilities.shape == logits.shape, "Softmax should preserve shape"
|
||
|
||
# Each row should sum to 1 (probability distribution)
|
||
for i in range(logits.shape[0]):
|
||
row_sum = np.sum(probabilities.data[i])
|
||
assert abs(row_sum - 1.0) < 1e-6, f"Probability row {i} should sum to 1"
|
||
|
||
# Test 4: Chaining tensor operations with activations
|
||
x = Tensor([1.0, 2.0, 3.0])
|
||
y = Tensor([4.0, 5.0, 6.0])
|
||
|
||
# Simulate: dense layer output -> activation -> more operations
|
||
dense_sim = x * y # Element-wise multiplication (simulating dense layer)
|
||
relu_fn = ReLU()
|
||
activated = relu_fn(dense_sim) # Apply activation
|
||
final = activated + Tensor([1.0, 1.0, 1.0]) # More tensor operations
|
||
|
||
expected_final = np.array([5.0, 11.0, 19.0]) # [4,10,18] -> relu -> +1 = [5,11,19]
|
||
|
||
assert isinstance(final, Tensor), "Chained operations should maintain Tensor type"
|
||
assert np.array_equal(final.data, expected_final), "Chained operations should work correctly"
|
||
|
||
print("✅ Integration Test Passed: Activation-Tensor integration works correctly.")
|
||
|
||
# Run the integration test
|
||
test_module_activation_tensor_integration()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Activation Functions
|
||
|
||
Congratulations! You've successfully implemented all four essential activation functions:
|
||
|
||
### ✅ What You've Built
|
||
- **ReLU**: The foundation of modern deep learning with sparsity and efficiency
|
||
- **Sigmoid**: Classic activation for binary classification and probability outputs
|
||
- **Tanh**: Zero-centered activation with better gradient properties
|
||
- **Softmax**: Probability distribution for multi-class classification
|
||
|
||
### ✅ Key Learning Outcomes
|
||
- **Understanding**: Why nonlinearity is essential for neural networks
|
||
- **Implementation**: Built activation functions from scratch using NumPy
|
||
- **Testing**: Progressive validation with immediate feedback after each function
|
||
- **Integration**: Saw how activations work together in neural networks
|
||
- **Real-world context**: Understanding where each activation is used
|
||
|
||
### ✅ Mathematical Mastery
|
||
- **ReLU**: f(x) = max(0, x) - Simple but powerful
|
||
- **Sigmoid**: f(x) = 1/(1 + e^(-x)) - Maps to (0,1)
|
||
- **Tanh**: f(x) = tanh(x) - Zero-centered, maps to (-1,1)
|
||
- **Softmax**: f(x_i) = e^(x_i)/Σ(e^(x_j)) - Probability distribution
|
||
|
||
### ✅ Professional Skills Developed
|
||
- **Numerical stability**: Handling overflow and underflow
|
||
- **API design**: Consistent interfaces across all functions
|
||
- **Testing discipline**: Immediate validation after each implementation
|
||
- **Integration thinking**: Understanding how components work together
|
||
|
||
### ✅ Ready for Next Steps
|
||
Your activation functions are now ready to power:
|
||
- **Dense layers**: Linear transformations with nonlinear activations
|
||
- **Convolutional layers**: Spatial feature extraction with ReLU
|
||
- **Network architectures**: Complete neural networks with proper activations
|
||
- **Training**: Gradient computation through activation functions
|
||
|
||
### 🔗 Connection to Real ML Systems
|
||
Your implementations mirror production systems:
|
||
- **PyTorch**: `torch.nn.ReLU()`, `torch.nn.Sigmoid()`, `torch.nn.Tanh()`, `torch.nn.Softmax()`
|
||
- **TensorFlow**: `tf.nn.relu()`, `tf.nn.sigmoid()`, `tf.nn.tanh()`, `tf.nn.softmax()`
|
||
- **Industry applications**: Every major deep learning model uses these functions
|
||
|
||
### 🎯 The Power of Nonlinearity
|
||
You've unlocked the key to deep learning:
|
||
- **Before**: Linear models limited to simple patterns
|
||
- **After**: Nonlinear models can learn any pattern (universal approximation)
|
||
|
||
**Next Module**: Layers - Building blocks that combine your tensors and activations into powerful transformations!
|
||
|
||
Your activation functions are the key to neural network intelligence. Now let's build the layers that use them!
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🤔 ML Systems Thinking: Reflection Questions
|
||
|
||
Now that you've built the nonlinear functions that enable neural network intelligence, reflect on how these simple mathematical operations power the AI revolution:
|
||
|
||
### System Design - How does this fit into larger systems?
|
||
1. **The Nonlinearity Bottleneck**: Your implementations show how ReLU, Sigmoid, Tanh, and Softmax each solve different problems. When OpenAI designs GPT architectures, how do they decide which activation goes where? What happens when you choose the wrong activation for a specific layer?
|
||
|
||
2. **Activation Memory**: Your ReLU creates sparse outputs (many zeros). How does this sparsity enable frameworks to optimize memory usage in massive models? Why might this be crucial when training models with billions of parameters?
|
||
|
||
3. **Numerical Stability**: Your Softmax includes overflow protection with clipping. How do production systems handle numerical stability at scale? What happens when a single unstable activation cascades through a 100-layer network?
|
||
|
||
### Production ML - How is this used in real ML workflows?
|
||
4. **Hardware Acceleration**: Your simple element-wise operations translate to GPU kernels. How do cloud providers optimize activation function compute for millions of concurrent neural network inferences? Why might ReLU be preferred on certain hardware?
|
||
|
||
5. **Model Serving Latency**: Your activation functions add computational overhead. When serving real-time recommendations or autonomous vehicle decisions, how do activation choices affect response times? Which activations are fastest and why?
|
||
|
||
6. **Gradient Behavior**: Your functions enable backpropagation (coming in Module 7). In production training of large language models, how do activation choices affect training stability and convergence speed? Why did ReLU revolutionize deep learning?
|
||
|
||
### Framework Design - Why do frameworks make certain choices?
|
||
7. **API Consistency**: Your callable classes (`relu(x)`) mirror PyTorch's design. How does consistent activation APIs enable researchers to experiment rapidly with different nonlinearities? What would happen if each activation had a different interface?
|
||
|
||
8. **Automatic Differentiation**: Your forward-only implementations will connect to gradient computation. How do frameworks like PyTorch automatically track operations through activations to compute gradients? Why is this harder than it looks?
|
||
|
||
9. **Activation Fusion**: Your separate activation calls could be optimized. How do production frameworks combine linear layers with activations into single GPU operations for speed? What trade-offs does this optimization create?
|
||
|
||
### Performance & Scale - What happens when systems get large?
|
||
10. **Activation Bottlenecks**: Your functions process tensors sequentially. When training transformer models with billions of parameters, how do activation computations get parallelized across multiple GPUs? What new challenges emerge at scale?
|
||
|
||
11. **Memory vs Computation**: Your implementations store all intermediate results. How do techniques like gradient checkpointing trade activation memory for recomputation time? When do these trade-offs become critical?
|
||
|
||
12. **Dynamic Activations**: Your static functions are fixed at definition. How might learned activations (like Swish or GELU) that adapt during training change the computational requirements of large-scale systems?
|
||
|
||
**💡 Systems Insight**: The activation functions you built are the "biological neurons" of artificial intelligence—each simple nonlinear transformation enables networks to learn complex patterns. Your ReLU implementation, despite being just `max(0, x)`, is literally computing in millions of deployed models right now, powering everything from photo recognition to language translation.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## ⚡ ML Systems: Performance Analysis & Optimization
|
||
|
||
Now that you have working activation functions, let's develop **performance engineering skills**. This section teaches you to measure computational costs, understand scaling patterns, and think about production optimization.
|
||
|
||
### **Learning Outcome**: *"I understand performance trade-offs between different activation functions"*
|
||
|
||
---
|
||
|
||
## Performance Profiling Tools (Light Implementation)
|
||
|
||
As an ML systems engineer, you need to understand which activation functions are fast vs slow, and why. Let's build simple tools to measure and compare performance.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "activation-profiler", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
import time
|
||
|
||
class ActivationProfiler:
|
||
"""
|
||
Performance profiling toolkit for activation functions.
|
||
|
||
Helps ML engineers understand computational costs and optimize
|
||
neural network performance for production deployment.
|
||
"""
|
||
|
||
def __init__(self):
|
||
self.results = {}
|
||
|
||
def time_activation(self, activation_fn, tensor, activation_name, iterations=100):
|
||
"""
|
||
Time how long an activation function takes to run.
|
||
|
||
TODO: Implement activation timing.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Record start time using time.time()
|
||
2. Run the activation function for specified iterations
|
||
3. Record end time
|
||
4. Calculate average time per iteration
|
||
5. Return the average time in milliseconds
|
||
|
||
EXAMPLE:
|
||
profiler = ActivationProfiler()
|
||
relu = ReLU()
|
||
test_tensor = Tensor(np.random.randn(1000, 1000))
|
||
avg_time = profiler.time_activation(relu, test_tensor, "ReLU")
|
||
print(f"ReLU took {avg_time:.3f} ms on average")
|
||
|
||
HINTS:
|
||
- Use time.time() for timing
|
||
- Run multiple iterations for better accuracy
|
||
- Calculate: (end_time - start_time) / iterations * 1000 for ms
|
||
- Return the average time per call in milliseconds
|
||
"""
|
||
### BEGIN SOLUTION
|
||
start_time = time.time()
|
||
|
||
for _ in range(iterations):
|
||
result = activation_fn(tensor)
|
||
|
||
end_time = time.time()
|
||
avg_time_ms = (end_time - start_time) / iterations * 1000
|
||
|
||
return avg_time_ms
|
||
### END SOLUTION
|
||
|
||
def compare_activations(self, tensor_size=(1000, 1000), iterations=50):
|
||
"""
|
||
Compare performance of all activation functions.
|
||
|
||
This function is PROVIDED to show systems analysis.
|
||
Students run it to understand performance differences.
|
||
"""
|
||
print(f"⚡ ACTIVATION PERFORMANCE COMPARISON")
|
||
print(f"=" * 50)
|
||
print(f"Tensor size: {tensor_size}, Iterations: {iterations}")
|
||
|
||
# Create test tensor
|
||
test_tensor = Tensor(np.random.randn(*tensor_size))
|
||
tensor_mb = test_tensor.data.nbytes / (1024 * 1024)
|
||
print(f"Test tensor: {tensor_mb:.2f} MB")
|
||
|
||
# Test all activation functions
|
||
activations = {
|
||
'ReLU': ReLU(),
|
||
'Sigmoid': Sigmoid(),
|
||
'Tanh': Tanh(),
|
||
'Softmax': Softmax()
|
||
}
|
||
|
||
results = {}
|
||
for name, activation_fn in activations.items():
|
||
avg_time = self.time_activation(activation_fn, test_tensor, name, iterations)
|
||
results[name] = avg_time
|
||
print(f" {name:8}: {avg_time:.3f} ms")
|
||
|
||
# Calculate speed ratios relative to fastest
|
||
fastest_time = min(results.values())
|
||
fastest_name = min(results, key=results.get)
|
||
|
||
print(f"\n📊 SPEED ANALYSIS:")
|
||
for name, time_ms in sorted(results.items(), key=lambda x: x[1]):
|
||
speed_ratio = time_ms / fastest_time
|
||
if name == fastest_name:
|
||
print(f" {name:8}: {speed_ratio:.1f}x (fastest)")
|
||
else:
|
||
print(f" {name:8}: {speed_ratio:.1f}x slower than {fastest_name}")
|
||
|
||
return results
|
||
|
||
def analyze_scaling(self, activation_fn, activation_name, sizes=[100, 500, 1000]):
|
||
"""
|
||
Analyze how activation performance scales with tensor size.
|
||
|
||
This function is PROVIDED to demonstrate scaling patterns.
|
||
Students use it to understand computational complexity.
|
||
"""
|
||
print(f"\n🔍 SCALING ANALYSIS: {activation_name}")
|
||
print(f"=" * 40)
|
||
|
||
scaling_results = []
|
||
|
||
for size in sizes:
|
||
test_tensor = Tensor(np.random.randn(size, size))
|
||
avg_time = self.time_activation(activation_fn, test_tensor, activation_name, iterations=20)
|
||
|
||
elements = size * size
|
||
time_per_element = avg_time / elements * 1e6 # microseconds per element
|
||
|
||
result = {
|
||
'size': size,
|
||
'elements': elements,
|
||
'time_ms': avg_time,
|
||
'time_per_element_us': time_per_element
|
||
}
|
||
scaling_results.append(result)
|
||
|
||
print(f" {size}x{size}: {avg_time:.3f}ms ({time_per_element:.3f}μs/element)")
|
||
|
||
# Analyze scaling pattern
|
||
if len(scaling_results) >= 2:
|
||
small = scaling_results[0]
|
||
large = scaling_results[-1]
|
||
|
||
size_ratio = large['size'] / small['size']
|
||
time_ratio = large['time_ms'] / small['time_ms']
|
||
|
||
print(f"\n📈 Scaling Pattern:")
|
||
print(f" Size increased {size_ratio:.1f}x ({small['size']} → {large['size']})")
|
||
print(f" Time increased {time_ratio:.1f}x")
|
||
|
||
if abs(time_ratio - size_ratio**2) < abs(time_ratio - size_ratio):
|
||
print(f" Pattern: O(n²) - linear in tensor size")
|
||
else:
|
||
print(f" Pattern: ~O(n) - very efficient scaling")
|
||
|
||
return scaling_results
|
||
|
||
def benchmark_activation_suite():
|
||
"""
|
||
Comprehensive benchmark of all activation functions.
|
||
|
||
This function is PROVIDED to show complete systems analysis.
|
||
Students run it to understand production performance implications.
|
||
"""
|
||
profiler = ActivationProfiler()
|
||
|
||
print("🏆 COMPREHENSIVE ACTIVATION BENCHMARK")
|
||
print("=" * 60)
|
||
|
||
# Test 1: Performance comparison
|
||
comparison_results = profiler.compare_activations(tensor_size=(800, 800), iterations=30)
|
||
|
||
# Test 2: Scaling analysis for each activation
|
||
activations_to_test = [
|
||
(ReLU(), "ReLU"),
|
||
(Sigmoid(), "Sigmoid"),
|
||
(Tanh(), "Tanh")
|
||
]
|
||
|
||
for activation_fn, name in activations_to_test:
|
||
profiler.analyze_scaling(activation_fn, name, sizes=[200, 400, 600])
|
||
|
||
# Test 3: Memory vs Performance trade-offs
|
||
print(f"\n💾 MEMORY vs PERFORMANCE ANALYSIS:")
|
||
print(f"=" * 40)
|
||
|
||
test_tensor = Tensor(np.random.randn(500, 500))
|
||
original_memory = test_tensor.data.nbytes / (1024 * 1024)
|
||
|
||
for name, activation_fn in [("ReLU", ReLU()), ("Sigmoid", Sigmoid())]:
|
||
start_time = time.time()
|
||
result = activation_fn(test_tensor)
|
||
end_time = time.time()
|
||
|
||
result_memory = result.data.nbytes / (1024 * 1024)
|
||
time_ms = (end_time - start_time) * 1000
|
||
|
||
print(f" {name}:")
|
||
print(f" Input: {original_memory:.2f} MB")
|
||
print(f" Output: {result_memory:.2f} MB")
|
||
print(f" Memory overhead: {result_memory - original_memory:.2f} MB")
|
||
print(f" Time: {time_ms:.3f} ms")
|
||
|
||
print(f"\n🎯 PRODUCTION INSIGHTS:")
|
||
print(f" - ReLU is typically fastest (simple max operation)")
|
||
print(f" - Sigmoid/Tanh slower due to exponential calculations")
|
||
print(f" - All operations scale linearly with tensor size")
|
||
print(f" - Memory usage doubles (input + output tensors)")
|
||
print(f" - Choose activation based on accuracy vs speed trade-offs")
|
||
|
||
return comparison_results
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Test: Activation Performance Profiling
|
||
|
||
Let's test our activation profiler with realistic performance analysis.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "test-activation-profiler", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
def test_activation_profiler():
|
||
"""Test activation profiler with comprehensive scenarios."""
|
||
print("🔬 Unit Test: Activation Performance Profiler...")
|
||
|
||
profiler = ActivationProfiler()
|
||
|
||
# Create test tensor
|
||
test_tensor = Tensor(np.random.randn(100, 100))
|
||
relu = ReLU()
|
||
|
||
# Test timing functionality
|
||
avg_time = profiler.time_activation(relu, test_tensor, "ReLU", iterations=10)
|
||
|
||
# Verify timing results
|
||
assert isinstance(avg_time, (int, float)), "Should return numeric time"
|
||
assert avg_time > 0, "Time should be positive"
|
||
assert avg_time < 1000, "Time should be reasonable (< 1000ms)"
|
||
|
||
print("✅ Basic timing functionality test passed")
|
||
|
||
# Test comparison functionality
|
||
comparison_results = profiler.compare_activations(tensor_size=(50, 50), iterations=5)
|
||
|
||
# Verify comparison results
|
||
assert isinstance(comparison_results, dict), "Should return dictionary of results"
|
||
assert len(comparison_results) == 4, "Should test all 4 activation functions"
|
||
|
||
expected_activations = ['ReLU', 'Sigmoid', 'Tanh', 'Softmax']
|
||
for activation in expected_activations:
|
||
assert activation in comparison_results, f"Should include {activation}"
|
||
assert comparison_results[activation] > 0, f"{activation} time should be positive"
|
||
|
||
print("✅ Activation comparison test passed")
|
||
|
||
# Test scaling analysis
|
||
scaling_results = profiler.analyze_scaling(relu, "ReLU", sizes=[50, 100])
|
||
|
||
# Verify scaling results
|
||
assert isinstance(scaling_results, list), "Should return list of scaling results"
|
||
assert len(scaling_results) == 2, "Should test both sizes"
|
||
|
||
for result in scaling_results:
|
||
assert 'size' in result, "Should include size"
|
||
assert 'time_ms' in result, "Should include timing"
|
||
assert result['time_ms'] > 0, "Time should be positive"
|
||
|
||
print("✅ Scaling analysis test passed")
|
||
|
||
print("🎯 Activation Profiler: All tests passed!")
|
||
|
||
# Run the test
|
||
test_activation_profiler()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🎯 Learning Activity: Activation Performance Analysis
|
||
|
||
**Goal**: Learn to measure activation function performance and understand which operations are fast vs slow in production ML systems.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "activation-performance-analysis", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
# Initialize the activation profiler
|
||
profiler = ActivationProfiler()
|
||
|
||
print("⚡ ACTIVATION PERFORMANCE ANALYSIS")
|
||
print("=" * 50)
|
||
|
||
# Create test data
|
||
test_tensor = Tensor(np.random.randn(500, 500)) # Medium-sized tensor for testing
|
||
print(f"Test tensor size: {test_tensor.shape}")
|
||
print(f"Memory footprint: {test_tensor.data.nbytes/(1024*1024):.2f} MB")
|
||
|
||
# Test individual activation timing
|
||
print(f"\n🎯 Individual Activation Timing:")
|
||
activations_to_test = [
|
||
(ReLU(), "ReLU"),
|
||
(Sigmoid(), "Sigmoid"),
|
||
(Tanh(), "Tanh"),
|
||
(Softmax(), "Softmax")
|
||
]
|
||
|
||
individual_results = {}
|
||
for activation_fn, name in activations_to_test:
|
||
# Students implement this timing call
|
||
avg_time = profiler.time_activation(activation_fn, test_tensor, name, iterations=50)
|
||
individual_results[name] = avg_time
|
||
print(f" {name:8}: {avg_time:.3f} ms average")
|
||
|
||
# Analyze the results
|
||
fastest = min(individual_results, key=individual_results.get)
|
||
slowest = max(individual_results, key=individual_results.get)
|
||
speed_ratio = individual_results[slowest] / individual_results[fastest]
|
||
|
||
print(f"\n📊 PERFORMANCE INSIGHTS:")
|
||
print(f" Fastest: {fastest} ({individual_results[fastest]:.3f} ms)")
|
||
print(f" Slowest: {slowest} ({individual_results[slowest]:.3f} ms)")
|
||
print(f" Speed difference: {speed_ratio:.1f}x")
|
||
|
||
print(f"\n💡 WHY THE DIFFERENCE?")
|
||
print(f" - ReLU: Just max(0, x) - simple comparison")
|
||
print(f" - Sigmoid: Requires exponential calculation")
|
||
print(f" - Tanh: Also exponential, but often optimized")
|
||
print(f" - Softmax: Exponentials + division")
|
||
|
||
print(f"\n🏭 PRODUCTION IMPLICATIONS:")
|
||
print(f" - ReLU dominates modern deep learning (speed + effectiveness)")
|
||
print(f" - Sigmoid/Tanh used where probability interpretation needed")
|
||
print(f" - Speed matters: 1000 layers × speed difference = major impact")
|
||
""" |