mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-04 18:27:54 -05:00
- Removed temporary test files and audit reports - Deleted backup and temp_holding directories - Reorganized module structure (07->09 spatial, 09->07 dataloader) - Added new modules: 11-14 (tokenization, embeddings, attention, transformers) - Updated examples with historical ML milestones - Cleaned up documentation structure
884 lines
36 KiB
Python
884 lines
36 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.17.1
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Loss Functions - Essential Training Objectives for Neural Networks
|
||
|
||
Welcome to the Loss Functions module! You'll implement the essential loss functions that define learning objectives and enable neural networks to learn from data through gradient-based optimization.
|
||
|
||
## Learning Goals
|
||
- Systems understanding: How loss functions define learning objectives and drive gradient-based optimization
|
||
- Core implementation skill: Build MSE, CrossEntropy, and BinaryCrossEntropy with proper numerical stability
|
||
- Pattern recognition: Understand how different loss functions shape learning dynamics and convergence behavior
|
||
- Framework connection: See how your loss implementations mirror PyTorch's loss functions and autograd integration
|
||
- Performance insight: Learn why numerically stable loss computation affects training reliability and convergence speed
|
||
|
||
## Build → Use → Reflect
|
||
1. **Build**: Complete loss function implementations with numerical stability and gradient support
|
||
2. **Use**: Apply loss functions to regression and classification problems with real neural networks
|
||
3. **Reflect**: Why do different loss functions lead to different learning behaviors, and when does numerical stability matter?
|
||
|
||
## What You'll Achieve
|
||
By the end of this module, you'll understand:
|
||
- Deep technical understanding of how loss functions translate learning problems into optimization objectives
|
||
- Practical capability to implement production-quality loss functions with proper numerical stability
|
||
- Systems insight into why loss function design affects training stability and convergence characteristics
|
||
- Performance consideration of how numerical precision in loss computation affects training reliability
|
||
- Connection to production ML systems and how frameworks implement robust loss computation
|
||
|
||
## Systems Reality Check
|
||
💡 **Production Context**: PyTorch's loss functions use numerically stable implementations and automatic mixed precision to handle extreme gradients and values
|
||
⚡ **Performance Note**: Numerically unstable loss functions can cause training to fail catastrophically - proper implementation is critical for reliable ML systems
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "losses-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
#| default_exp core.losses
|
||
|
||
#| export
|
||
import numpy as np
|
||
import sys
|
||
import os
|
||
|
||
# Import our building blocks - try package first, then local modules
|
||
try:
|
||
from tinytorch.core.tensor import Tensor
|
||
# Note: For now, we'll use simplified implementations without full autograd
|
||
# In a complete system, these would integrate with the autograd Variable system
|
||
except ImportError:
|
||
# For development, import from local modules
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_tensor'))
|
||
from tensor_dev import Tensor
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "losses-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
print("🔥 TinyTorch Loss Functions Module")
|
||
print(f"NumPy version: {np.__version__}")
|
||
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
|
||
print("Ready to build loss functions for neural network training!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in modules/05_losses/losses_dev.py
|
||
**Building Side:** Code exports to tinytorch.core.losses
|
||
|
||
```python
|
||
# Final package structure:
|
||
from tinytorch.core.losses import MeanSquaredError, CrossEntropyLoss, BinaryCrossEntropyLoss # All loss functions!
|
||
from tinytorch.core.tensor import Tensor # The foundation
|
||
from tinytorch.core.layers import Linear, Sequential # Network components
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Focused module for understanding loss functions and training objectives
|
||
- **Production:** Proper organization like PyTorch's torch.nn with all loss functions together
|
||
- **Consistency:** All loss functions live together in core.losses for easy access
|
||
- **Integration:** Works seamlessly with tensors and neural networks for complete training systems
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Understanding Loss Functions in Neural Networks
|
||
|
||
## What are Loss Functions?
|
||
|
||
Loss functions (also called cost functions or objective functions) quantify how far your model's predictions are from the true targets. They provide:
|
||
|
||
🎯 **Learning Objectives**: Define what "good" performance means for your specific problem
|
||
📈 **Gradient Signal**: Provide gradients that guide parameter updates during training
|
||
🔍 **Progress Measurement**: Enable monitoring training progress and convergence
|
||
⚖️ **Trade-off Control**: Balance different aspects of model performance (accuracy vs regularization)
|
||
|
||
## Why Loss Functions Matter for ML Systems
|
||
|
||
### The Learning Loop
|
||
```
|
||
1. Forward Pass: Input → Network → Predictions
|
||
2. Loss Computation: Loss = loss_function(predictions, targets)
|
||
3. Backward Pass: Compute gradients of loss w.r.t. parameters
|
||
4. Parameter Update: parameters -= learning_rate * gradients
|
||
5. Repeat until convergence
|
||
```
|
||
|
||
### Different Problems Need Different Loss Functions
|
||
|
||
🔢 **Regression Problems**: Mean Squared Error (MSE)
|
||
- Predicting continuous values (house prices, temperatures, stock prices)
|
||
- Penalizes large errors more than small errors (quadratic penalty)
|
||
|
||
🏷️ **Multi-Class Classification**: Cross-Entropy Loss
|
||
- Predicting one class from many options (image classification, text categorization)
|
||
- Works with probability distributions over classes
|
||
|
||
⚪ **Binary Classification**: Binary Cross-Entropy Loss
|
||
- Predicting yes/no, positive/negative (spam detection, medical diagnosis)
|
||
- Optimized for two-class problems
|
||
|
||
Let's implement these essential loss functions!
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Mean Squared Error - Foundation for Regression
|
||
|
||
MSE measures the average squared difference between predictions and targets. It's the most fundamental loss function for regression problems.
|
||
|
||
## Why MSE Matters
|
||
|
||
📊 **Regression Standard**: The go-to loss function for predicting continuous values
|
||
🎯 **Quadratic Penalty**: Penalizes large errors more than small errors
|
||
📈 **Smooth Gradients**: Provides smooth gradients for stable optimization
|
||
🔢 **Interpretable**: Loss value has same units as your target variable (squared)
|
||
|
||
## Mathematical Foundation
|
||
|
||
For batch of predictions and targets:
|
||
```
|
||
MSE = (1/n) × Σ(y_pred - y_true)²
|
||
```
|
||
|
||
## Learning Objectives
|
||
By implementing MSE, you'll understand:
|
||
- How regression loss functions quantify prediction quality
|
||
- Why squared error creates smooth gradients for optimization
|
||
- How batch processing enables efficient training on multiple samples
|
||
- The connection between mathematical loss functions and practical ML training
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "mse-loss-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class MeanSquaredError:
|
||
"""
|
||
Mean Squared Error Loss for Regression Problems
|
||
|
||
Computes the average squared difference between predictions and targets:
|
||
MSE = (1/n) × Σ(y_pred - y_true)²
|
||
|
||
Features:
|
||
- Numerically stable computation
|
||
- Efficient batch processing
|
||
- Clean gradient properties for optimization
|
||
- Compatible with tensor operations
|
||
|
||
Example Usage:
|
||
mse = MeanSquaredError()
|
||
loss = mse(predictions, targets) # Returns scalar loss value
|
||
"""
|
||
|
||
def __init__(self):
|
||
"""Initialize MSE loss function."""
|
||
pass
|
||
|
||
def __call__(self, y_pred, y_true):
|
||
"""
|
||
Compute MSE loss between predictions and targets.
|
||
|
||
Args:
|
||
y_pred: Model predictions (Tensor, shape: [batch_size, ...])
|
||
y_true: True targets (Tensor, shape: [batch_size, ...])
|
||
|
||
Returns:
|
||
Tensor with scalar loss value
|
||
"""
|
||
# Convert to tensors if needed
|
||
if not isinstance(y_pred, Tensor):
|
||
y_pred = Tensor(y_pred)
|
||
if not isinstance(y_true, Tensor):
|
||
y_true = Tensor(y_true)
|
||
|
||
# Compute mean squared error
|
||
diff = y_pred.data - y_true.data
|
||
squared_diff = diff * diff
|
||
mean_loss = np.mean(squared_diff)
|
||
|
||
return Tensor(mean_loss)
|
||
|
||
def forward(self, y_pred, y_true):
|
||
"""Alternative interface for forward pass."""
|
||
return self.__call__(y_pred, y_true)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Testing Mean Squared Error
|
||
|
||
Let's verify our MSE implementation works correctly with various test cases.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-mse-loss", "locked": true, "points": 3, "schema_version": 3, "solution": false, "task": false}
|
||
def test_mse_loss():
|
||
"""Test MSE loss implementation."""
|
||
print("🧪 Testing Mean Squared Error Loss...")
|
||
|
||
mse = MeanSquaredError()
|
||
|
||
# Test case 1: Perfect predictions (loss should be 0)
|
||
y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])
|
||
y_true = Tensor([[1.0, 2.0], [3.0, 4.0]])
|
||
loss = mse(y_pred, y_true)
|
||
assert abs(loss.data) < 1e-6, f"Perfect predictions should have loss ≈ 0, got {loss.data}"
|
||
print("✅ Perfect predictions test passed")
|
||
|
||
# Test case 2: Known loss computation
|
||
y_pred = Tensor([[1.0, 2.0]])
|
||
y_true = Tensor([[0.0, 1.0]])
|
||
loss = mse(y_pred, y_true)
|
||
expected = 1.0 # [(1-0)² + (2-1)²] / 2 = [1 + 1] / 2 = 1.0
|
||
assert abs(loss.data - expected) < 1e-6, f"Expected loss {expected}, got {loss.data}"
|
||
print("✅ Known loss computation test passed")
|
||
|
||
# Test case 3: Batch processing
|
||
y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]])
|
||
y_true = Tensor([[1.5, 2.5], [2.5, 3.5]])
|
||
loss = mse(y_pred, y_true)
|
||
expected = 0.25 # All squared differences are 0.25
|
||
assert abs(loss.data - expected) < 1e-6, f"Expected batch loss {expected}, got {loss.data}"
|
||
print("✅ Batch processing test passed")
|
||
|
||
# Test case 4: Single value
|
||
y_pred = Tensor([5.0])
|
||
y_true = Tensor([3.0])
|
||
loss = mse(y_pred, y_true)
|
||
expected = 4.0 # (5-3)² = 4
|
||
assert abs(loss.data - expected) < 1e-6, f"Expected single value loss {expected}, got {loss.data}"
|
||
print("✅ Single value test passed")
|
||
|
||
print("🎉 All MSE loss tests passed!")
|
||
|
||
test_mse_loss()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Cross-Entropy Loss - Foundation for Multi-Class Classification
|
||
|
||
Cross-Entropy Loss measures the difference between predicted probability distributions and true class labels. It's the standard loss function for multi-class classification problems.
|
||
|
||
## Why Cross-Entropy Matters
|
||
|
||
🎯 **Classification Standard**: The go-to loss function for multi-class problems
|
||
📊 **Probability Interpretation**: Works naturally with softmax probability outputs
|
||
🔄 **Information Theory**: Measures information distance between distributions
|
||
⚖️ **Class Balance**: Handles multiple classes in a principled way
|
||
|
||
## Mathematical Foundation
|
||
|
||
For predictions and class indices:
|
||
```
|
||
CrossEntropy = -Σ y_true × log(softmax(y_pred))
|
||
```
|
||
|
||
With softmax normalization:
|
||
```
|
||
softmax(x_i) = exp(x_i) / Σ exp(x_j)
|
||
```
|
||
|
||
## Learning Objectives
|
||
By implementing Cross-Entropy, you'll understand:
|
||
- How classification losses work with probability distributions
|
||
- Why softmax normalization is essential for multi-class problems
|
||
- The importance of numerical stability in log computations
|
||
- How cross-entropy encourages confident, correct predictions
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "crossentropy-loss-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class CrossEntropyLoss:
|
||
"""
|
||
Cross-Entropy Loss for Multi-Class Classification Problems
|
||
|
||
Computes the cross-entropy between predicted probability distributions
|
||
and true class labels with numerically stable implementation.
|
||
|
||
Features:
|
||
- Numerically stable softmax computation
|
||
- Support for both class indices and one-hot encoding
|
||
- Efficient batch processing
|
||
- Automatic handling of edge cases
|
||
|
||
Example Usage:
|
||
ce_loss = CrossEntropyLoss()
|
||
loss = ce_loss(logits, class_indices) # Returns scalar loss value
|
||
"""
|
||
|
||
def __init__(self):
|
||
"""Initialize CrossEntropy loss function."""
|
||
pass
|
||
|
||
def __call__(self, y_pred, y_true):
|
||
"""
|
||
Compute CrossEntropy loss between predictions and targets.
|
||
|
||
Args:
|
||
y_pred: Model predictions/logits (Tensor, shape: [batch_size, num_classes])
|
||
y_true: True class indices (Tensor, shape: [batch_size]) or one-hot encoding
|
||
|
||
Returns:
|
||
Tensor with scalar loss value
|
||
"""
|
||
# Convert to tensors if needed
|
||
if not isinstance(y_pred, Tensor):
|
||
y_pred = Tensor(y_pred)
|
||
if not isinstance(y_true, Tensor):
|
||
y_true = Tensor(y_true)
|
||
|
||
# Get data arrays
|
||
pred_data = y_pred.data
|
||
true_data = y_true.data
|
||
|
||
# Handle both 1D and 2D prediction arrays
|
||
if pred_data.ndim == 1:
|
||
pred_data = pred_data.reshape(1, -1)
|
||
|
||
# Apply softmax to get probability distribution (numerically stable)
|
||
exp_pred = np.exp(pred_data - np.max(pred_data, axis=1, keepdims=True))
|
||
softmax_pred = exp_pred / np.sum(exp_pred, axis=1, keepdims=True)
|
||
|
||
# Add small epsilon to avoid log(0)
|
||
epsilon = 1e-15
|
||
softmax_pred = np.clip(softmax_pred, epsilon, 1.0 - epsilon)
|
||
|
||
# Handle class indices vs one-hot encoding
|
||
if len(true_data.shape) == 1:
|
||
# y_true contains class indices
|
||
batch_size = true_data.shape[0]
|
||
log_probs = np.log(softmax_pred[np.arange(batch_size), true_data.astype(int)])
|
||
loss_value = -np.mean(log_probs)
|
||
else:
|
||
# y_true is one-hot encoded
|
||
log_probs = np.log(softmax_pred)
|
||
loss_value = -np.mean(np.sum(true_data * log_probs, axis=1))
|
||
|
||
return Tensor(loss_value)
|
||
|
||
def forward(self, y_pred, y_true):
|
||
"""Alternative interface for forward pass."""
|
||
return self.__call__(y_pred, y_true)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Testing Cross-Entropy Loss
|
||
|
||
Let's verify our Cross-Entropy implementation handles various classification scenarios.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-crossentropy-loss", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false}
|
||
def test_crossentropy_loss():
|
||
"""Test CrossEntropy loss implementation."""
|
||
print("🧪 Testing Cross-Entropy Loss...")
|
||
|
||
ce = CrossEntropyLoss()
|
||
|
||
# Test case 1: Perfect predictions
|
||
y_pred = Tensor([[10.0, 0.0, 0.0], [0.0, 10.0, 0.0]]) # Very confident correct predictions
|
||
y_true = Tensor([0, 1]) # Class indices
|
||
loss = ce(y_pred, y_true)
|
||
assert loss.data < 0.1, f"Perfect predictions should have low loss, got {loss.data}"
|
||
print("✅ Perfect predictions test passed")
|
||
|
||
# Test case 2: Random predictions (should have higher loss)
|
||
y_pred = Tensor([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]) # Uniform after softmax
|
||
y_true = Tensor([0, 1])
|
||
loss = ce(y_pred, y_true)
|
||
expected_random = -np.log(1.0/3.0) # log(1/num_classes) for uniform distribution
|
||
assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss ≈ {expected_random}, got {loss.data}"
|
||
print("✅ Random predictions test passed")
|
||
|
||
# Test case 3: Binary classification
|
||
y_pred = Tensor([[2.0, 1.0], [1.0, 2.0]])
|
||
y_true = Tensor([0, 1])
|
||
loss = ce(y_pred, y_true)
|
||
assert 0.0 < loss.data < 2.0, f"Binary classification loss should be reasonable, got {loss.data}"
|
||
print("✅ Binary classification test passed")
|
||
|
||
# Test case 4: One-hot encoded labels
|
||
y_pred = Tensor([[2.0, 1.0, 0.0], [0.0, 2.0, 1.0]])
|
||
y_true = Tensor([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]) # One-hot encoded
|
||
loss = ce(y_pred, y_true)
|
||
assert 0.0 < loss.data < 2.0, f"One-hot encoded loss should be reasonable, got {loss.data}"
|
||
print("✅ One-hot encoded labels test passed")
|
||
|
||
print("🎉 All Cross-Entropy loss tests passed!")
|
||
|
||
test_crossentropy_loss()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Binary Cross-Entropy Loss - Optimized for Binary Classification
|
||
|
||
Binary Cross-Entropy Loss is specifically designed for binary classification problems. While you could use regular Cross-Entropy with 2 classes, BCE is more efficient and numerically stable for binary problems.
|
||
|
||
## Why Binary Cross-Entropy Matters
|
||
|
||
⚪ **Binary Optimization**: Specifically designed for two-class problems
|
||
🔢 **Efficiency**: More efficient than multi-class cross-entropy for binary cases
|
||
🎯 **Stability**: Better numerical stability with sigmoid outputs
|
||
📈 **Standard Practice**: Industry standard for binary classification
|
||
|
||
## Mathematical Foundation
|
||
|
||
For binary predictions and labels:
|
||
```
|
||
BCE = -y_true × log(σ(y_pred)) - (1-y_true) × log(1-σ(y_pred))
|
||
```
|
||
|
||
Where σ(x) is the sigmoid function:
|
||
```
|
||
σ(x) = 1 / (1 + exp(-x))
|
||
```
|
||
|
||
## Learning Objectives
|
||
By implementing Binary Cross-Entropy, you'll understand:
|
||
- How binary classification differs from multi-class problems
|
||
- Why sigmoid activation is natural for binary problems
|
||
- The importance of numerical stability in sigmoid + log computations
|
||
- How BCE loss shapes binary decision boundaries
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "binary-crossentropy-implementation", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class BinaryCrossEntropyLoss:
|
||
"""
|
||
Binary Cross-Entropy Loss for Binary Classification Problems
|
||
|
||
Computes binary cross-entropy between predictions and binary labels
|
||
with numerically stable sigmoid + BCE implementation.
|
||
|
||
Features:
|
||
- Numerically stable computation from logits
|
||
- Efficient batch processing
|
||
- Automatic sigmoid application
|
||
- Robust to extreme input values
|
||
|
||
Example Usage:
|
||
bce_loss = BinaryCrossEntropyLoss()
|
||
loss = bce_loss(logits, binary_labels) # Returns scalar loss value
|
||
"""
|
||
|
||
def __init__(self):
|
||
"""Initialize Binary CrossEntropy loss function."""
|
||
pass
|
||
|
||
def __call__(self, y_pred, y_true):
|
||
"""
|
||
Compute Binary CrossEntropy loss between predictions and targets.
|
||
|
||
Args:
|
||
y_pred: Model predictions/logits (Tensor, shape: [batch_size, 1] or [batch_size])
|
||
y_true: True binary labels (Tensor, shape: [batch_size, 1] or [batch_size])
|
||
|
||
Returns:
|
||
Tensor with scalar loss value
|
||
"""
|
||
# Convert to tensors if needed
|
||
if not isinstance(y_pred, Tensor):
|
||
y_pred = Tensor(y_pred)
|
||
if not isinstance(y_true, Tensor):
|
||
y_true = Tensor(y_true)
|
||
|
||
# Get flat arrays for computation
|
||
logits = y_pred.data.flatten()
|
||
labels = y_true.data.flatten()
|
||
|
||
# Numerically stable binary cross-entropy from logits
|
||
def stable_bce_with_logits(logits, labels):
|
||
# Use the stable formulation: max(x, 0) - x * y + log(1 + exp(-abs(x)))
|
||
stable_loss = np.maximum(logits, 0) - logits * labels + np.log(1 + np.exp(-np.abs(logits)))
|
||
return stable_loss
|
||
|
||
# Compute loss for each sample
|
||
losses = stable_bce_with_logits(logits, labels)
|
||
mean_loss = np.mean(losses)
|
||
|
||
return Tensor(mean_loss)
|
||
|
||
def forward(self, y_pred, y_true):
|
||
"""Alternative interface for forward pass."""
|
||
return self.__call__(y_pred, y_true)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Testing Binary Cross-Entropy Loss
|
||
|
||
Let's verify our Binary Cross-Entropy implementation handles binary classification correctly.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-binary-crossentropy", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false}
|
||
def test_binary_crossentropy_loss():
|
||
"""Test Binary CrossEntropy loss implementation."""
|
||
print("🧪 Testing Binary Cross-Entropy Loss...")
|
||
|
||
bce = BinaryCrossEntropyLoss()
|
||
|
||
# Test case 1: Perfect predictions
|
||
y_pred = Tensor([[10.0], [-10.0]]) # Very confident correct predictions
|
||
y_true = Tensor([[1.0], [0.0]])
|
||
loss = bce(y_pred, y_true)
|
||
assert loss.data < 0.1, f"Perfect predictions should have low loss, got {loss.data}"
|
||
print("✅ Perfect predictions test passed")
|
||
|
||
# Test case 2: Random predictions (should have higher loss)
|
||
y_pred = Tensor([[0.0], [0.0]]) # 0.5 probability after sigmoid
|
||
y_true = Tensor([[1.0], [0.0]])
|
||
loss = bce(y_pred, y_true)
|
||
expected_random = -np.log(0.5) # log(0.5) for random guessing
|
||
assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss ≈ {expected_random}, got {loss.data}"
|
||
print("✅ Random predictions test passed")
|
||
|
||
# Test case 3: Batch processing
|
||
y_pred = Tensor([[1.0], [2.0], [-1.0]])
|
||
y_true = Tensor([[1.0], [1.0], [0.0]])
|
||
loss = bce(y_pred, y_true)
|
||
assert 0.0 < loss.data < 2.0, f"Batch processing loss should be reasonable, got {loss.data}"
|
||
print("✅ Batch processing test passed")
|
||
|
||
# Test case 4: Extreme values (test numerical stability)
|
||
y_pred = Tensor([[100.0], [-100.0]]) # Extreme logits
|
||
y_true = Tensor([[1.0], [0.0]])
|
||
loss = bce(y_pred, y_true)
|
||
assert not np.isnan(loss.data) and not np.isinf(loss.data), f"Extreme values should not cause NaN/Inf, got {loss.data}"
|
||
assert loss.data < 1.0, f"Extreme correct predictions should have low loss, got {loss.data}"
|
||
print("✅ Extreme values test passed")
|
||
|
||
print("🎉 All Binary Cross-Entropy loss tests passed!")
|
||
|
||
test_binary_crossentropy_loss()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Loss Function Comparison and Usage Guide
|
||
|
||
## When to Use Each Loss Function
|
||
|
||
### Mean Squared Error (MSE)
|
||
**Best for:** Regression problems where you predict continuous values
|
||
- **Examples:** Predicting house prices, temperature, stock values, ages
|
||
- **Characteristics:** Penalizes large errors more than small ones
|
||
- **Output:** Any real number
|
||
- **Activation:** Usually none (linear output)
|
||
|
||
### Cross-Entropy Loss
|
||
**Best for:** Multi-class classification (3+ classes)
|
||
- **Examples:** Image classification (cats/dogs/birds), text categorization, medical diagnosis
|
||
- **Characteristics:** Works with probability distributions over classes
|
||
- **Output:** Class probabilities (sums to 1)
|
||
- **Activation:** Softmax
|
||
|
||
### Binary Cross-Entropy Loss
|
||
**Best for:** Binary classification (2 classes)
|
||
- **Examples:** Spam detection, medical positive/negative, fraud detection
|
||
- **Characteristics:** Optimized for binary decisions
|
||
- **Output:** Single probability (0 to 1)
|
||
- **Activation:** Sigmoid
|
||
|
||
## Numerical Stability Considerations
|
||
|
||
All our implementations include numerical stability features:
|
||
|
||
🔢 **MSE**: Straightforward - no special numerical concerns
|
||
📊 **Cross-Entropy**: Uses log-sum-exp trick, clips probabilities, handles edge cases
|
||
⚪ **Binary CE**: Uses stable logits formulation, clips extreme values
|
||
|
||
## Integration with Neural Networks
|
||
|
||
```python
|
||
# Example usage in training loop
|
||
model = Sequential([
|
||
Linear(784, 128),
|
||
ReLU(),
|
||
Linear(128, 10),
|
||
Softmax()
|
||
])
|
||
|
||
# Choose appropriate loss for your problem
|
||
loss_fn = CrossEntropyLoss() # For 10-class classification
|
||
|
||
# In training loop
|
||
predictions = model(inputs)
|
||
loss = loss_fn(predictions, targets)
|
||
# loss.backward() # Would trigger gradient computation (when autograd is integrated)
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Comprehensive Loss Function Testing
|
||
|
||
Let's verify all our loss functions work correctly together and can be used interchangeably.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "comprehensive-loss-tests", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
def test_all_loss_functions():
|
||
"""Test all loss functions work correctly together."""
|
||
print("🔬 Comprehensive Loss Function Testing")
|
||
print("=" * 45)
|
||
|
||
# Test 1: All losses can be instantiated
|
||
print("\\n1. Loss Function Instantiation:")
|
||
mse = MeanSquaredError()
|
||
ce = CrossEntropyLoss()
|
||
bce = BinaryCrossEntropyLoss()
|
||
print(" ✅ All loss functions created successfully")
|
||
|
||
# Test 2: Loss functions return appropriate types
|
||
print("\\n2. Return Type Verification:")
|
||
|
||
# MSE test
|
||
pred = Tensor([[1.0, 2.0]])
|
||
target = Tensor([[1.0, 2.0]])
|
||
loss = mse(pred, target)
|
||
assert isinstance(loss, Tensor), "MSE should return Tensor"
|
||
assert loss.data.shape == (), "MSE should return scalar"
|
||
|
||
# Cross-entropy test
|
||
pred = Tensor([[1.0, 2.0], [2.0, 1.0]])
|
||
target = Tensor([1, 0])
|
||
loss = ce(pred, target)
|
||
assert isinstance(loss, Tensor), "CrossEntropy should return Tensor"
|
||
assert loss.data.shape == (), "CrossEntropy should return scalar"
|
||
|
||
# Binary cross-entropy test
|
||
pred = Tensor([[1.0], [-1.0]])
|
||
target = Tensor([[1.0], [0.0]])
|
||
loss = bce(pred, target)
|
||
assert isinstance(loss, Tensor), "Binary CrossEntropy should return Tensor"
|
||
assert loss.data.shape == (), "Binary CrossEntropy should return scalar"
|
||
|
||
print(" ✅ All loss functions return correct types")
|
||
|
||
# Test 3: Loss values are reasonable
|
||
print("\\n3. Loss Value Sanity Checks:")
|
||
|
||
# All losses should be non-negative
|
||
assert mse.forward(Tensor([1.0]), Tensor([2.0])).data >= 0, "MSE should be non-negative"
|
||
assert ce.forward(Tensor([[1.0, 0.0]]), Tensor([0])).data >= 0, "CrossEntropy should be non-negative"
|
||
assert bce.forward(Tensor([1.0]), Tensor([1.0])).data >= 0, "Binary CrossEntropy should be non-negative"
|
||
|
||
print(" ✅ All loss functions produce reasonable values")
|
||
|
||
# Test 4: Perfect predictions give low loss
|
||
print("\\n4. Perfect Prediction Tests:")
|
||
|
||
perfect_mse = mse(Tensor([5.0]), Tensor([5.0]))
|
||
perfect_ce = ce(Tensor([[10.0, 0.0]]), Tensor([0]))
|
||
perfect_bce = bce(Tensor([10.0]), Tensor([1.0]))
|
||
|
||
assert perfect_mse.data < 1e-10, f"Perfect MSE should be ~0, got {perfect_mse.data}"
|
||
assert perfect_ce.data < 0.1, f"Perfect CE should be low, got {perfect_ce.data}"
|
||
assert perfect_bce.data < 0.1, f"Perfect BCE should be low, got {perfect_bce.data}"
|
||
|
||
print(" ✅ Perfect predictions produce low loss")
|
||
|
||
print("\\n🎉 All comprehensive tests passed!")
|
||
print(" • Loss functions instantiate correctly")
|
||
print(" • Return types are consistent (Tensor scalars)")
|
||
print(" • Loss values are mathematically sound")
|
||
print(" • Perfect predictions are handled correctly")
|
||
print(" • Ready for integration with neural network training!")
|
||
|
||
test_all_loss_functions()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🤔 ML Systems Thinking: Interactive Questions
|
||
|
||
Now that you've implemented all the core loss functions, let's think about their implications for ML systems:
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "question-1", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
# Question 1: Loss Function Selection and System Performance
|
||
"""
|
||
🤔 **Question 1: Loss Function Selection Impact**
|
||
|
||
You're building a production recommendation system that needs to predict user ratings (1-5 stars) for movies.
|
||
|
||
You have three options:
|
||
A) Treat as regression: Use MSE loss with continuous outputs (1.0-5.0)
|
||
B) Treat as classification: Use CrossEntropy loss with 5 classes
|
||
C) Use a custom loss that penalizes being off by multiple stars more heavily
|
||
|
||
Analyze each approach considering:
|
||
- Training speed and convergence behavior
|
||
- Model interpretability and debugging
|
||
- Production inference speed
|
||
- How well each loss function matches the business objective
|
||
- Edge case handling (what happens with ratings like 3.7?)
|
||
|
||
Which approach would you choose and why? Consider both technical and business factors.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "question-2", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
# Question 2: Numerical Stability in Production
|
||
"""
|
||
🤔 **Question 2: Numerical Stability Analysis**
|
||
|
||
Your cross-entropy loss function works perfectly in development, but in production you start seeing NaN losses that crash training.
|
||
|
||
Investigate the numerical stability issues:
|
||
1. What specific computations in cross-entropy can produce NaN or infinity values?
|
||
2. How do our implementations handle these edge cases?
|
||
3. What would happen if you removed the epsilon clipping in softmax computation?
|
||
4. How would you debug this in a production system with millions of training examples?
|
||
|
||
Research areas to consider:
|
||
- Floating point precision and representation limits
|
||
- Log of very small numbers and exp of very large numbers
|
||
- Batch processing effects on numerical stability
|
||
- How PyTorch handles these same numerical challenges
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "question-3", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
# Question 3: Loss Function Innovation
|
||
"""
|
||
🤔 **Question 3: Custom Loss Functions for Real Problems**
|
||
|
||
Standard loss functions don't always match real-world objectives. Consider these scenarios:
|
||
|
||
**Scenario A**: Medical diagnosis where false negatives are 10x more costly than false positives
|
||
**Scenario B**: Search ranking where being wrong about the top result is much worse than being wrong about result #50
|
||
**Scenario C**: Financial trading where large losses should be penalized exponentially more than small losses
|
||
|
||
For each scenario:
|
||
1. Why would standard loss functions (MSE, CrossEntropy, BCE) be suboptimal?
|
||
2. How would you modify the loss function to better match the business objective?
|
||
3. What are the implementation challenges of custom loss functions?
|
||
4. How would you validate that your custom loss actually improves business outcomes?
|
||
|
||
Design principles to consider:
|
||
- Asymmetric penalties for different types of errors
|
||
- Position-aware losses for ranking problems
|
||
- Risk-adjusted losses for financial applications
|
||
- How custom losses affect gradient flow and training dynamics
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Loss Functions - Learning Objectives Made Concrete
|
||
|
||
## 🎯 What You've Accomplished
|
||
|
||
You've successfully implemented the complete foundation for neural network training objectives:
|
||
|
||
### ✅ **Complete Loss Function Library**
|
||
- **Mean Squared Error**: Robust regression loss with smooth gradients for continuous value prediction
|
||
- **Cross-Entropy Loss**: Multi-class classification loss with numerically stable softmax integration
|
||
- **Binary Cross-Entropy Loss**: Optimized binary classification loss with stable sigmoid computation
|
||
- **Numerical Stability**: All implementations handle edge cases and extreme values gracefully
|
||
|
||
### ✅ **Systems Understanding**
|
||
- **Training Objectives**: How loss functions translate business problems into mathematical optimization objectives
|
||
- **Numerical Stability**: Why proper implementation prevents catastrophic training failures in production
|
||
- **Performance Characteristics**: Understanding computational complexity and batch processing efficiency
|
||
- **Problem Matching**: When to use each loss function based on problem structure and data characteristics
|
||
|
||
### ✅ **ML Engineering Skills**
|
||
- **Production-Ready Implementation**: Robust loss functions that handle real-world data edge cases
|
||
- **Batch Processing**: Efficient computation across multiple samples for scalable training
|
||
- **Error Handling**: Proper numerical stability measures for reliable production deployment
|
||
- **Integration Ready**: Clean interfaces that work seamlessly with neural network training loops
|
||
|
||
## 🔗 **Connection to Production ML Systems**
|
||
|
||
Your implementations mirror the essential patterns used in:
|
||
- **PyTorch's loss functions**: Same mathematical formulations with production-grade numerical stability
|
||
- **TensorFlow's losses**: Identical computational patterns and stability measures
|
||
- **Production ML pipelines**: The same loss functions that power real ML systems at scale
|
||
- **Research frameworks**: Foundation for experimenting with custom loss functions and training objectives
|
||
|
||
## 🚀 **What's Next**
|
||
|
||
With solid loss function implementations, you're ready to:
|
||
- **Build complete training loops** that optimize neural networks on real data
|
||
- **Implement optimizers** that use loss gradients to update model parameters
|
||
- **Create training infrastructure** with proper monitoring and convergence detection
|
||
- **Experiment with custom losses** for specialized business objectives and research problems
|
||
|
||
## 💡 **Key Systems Insights**
|
||
|
||
1. **Loss functions are the interface between business objectives and mathematical optimization** - they translate "what we want" into "what the computer can optimize"
|
||
2. **Numerical stability is not optional in production** - unstable loss computation causes catastrophic training failures
|
||
3. **Different problem types require different loss functions** - the choice affects both convergence speed and final model behavior
|
||
4. **Batch processing efficiency determines training speed** - loss computation must scale to handle large datasets efficiently
|
||
|
||
You now understand how to implement the mathematical foundation that enables neural networks to learn from data and solve real-world problems!
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "final-demo", "locked": false, "schema_version": 3, "solution": false, "task": false}
|
||
if __name__ == "__main__":
|
||
print("🔥 TinyTorch Loss Functions Module - Complete Demo")
|
||
print("=" * 55)
|
||
|
||
# Test all core implementations
|
||
print("\\n🧪 Testing All Loss Functions:")
|
||
test_mse_loss()
|
||
test_crossentropy_loss()
|
||
test_binary_crossentropy_loss()
|
||
test_all_loss_functions()
|
||
|
||
print("\\n" + "="*60)
|
||
print("📊 Loss Function Usage Examples")
|
||
print("=" * 35)
|
||
|
||
# Example 1: Regression with MSE
|
||
print("\\n1. Regression Example (Predicting House Prices):")
|
||
mse = MeanSquaredError()
|
||
house_predictions = Tensor([[250000, 180000, 320000]]) # Predicted prices
|
||
house_actual = Tensor([[240000, 175000, 315000]]) # Actual prices
|
||
regression_loss = mse(house_predictions, house_actual)
|
||
print(f" House price prediction loss: ${regression_loss.data:,.0f}² average error")
|
||
|
||
# Example 2: Multi-class classification with CrossEntropy
|
||
print("\\n2. Multi-Class Classification Example (Image Recognition):")
|
||
ce = CrossEntropyLoss()
|
||
image_logits = Tensor([[2.1, 0.5, -0.3, 1.8, 0.1], # Model outputs for 5 classes
|
||
[-0.2, 3.1, 0.8, -1.0, 0.4]]) # (cat, dog, bird, fish, rabbit)
|
||
true_classes = Tensor([0, 1]) # First image = cat, second = dog
|
||
classification_loss = ce(image_logits, true_classes)
|
||
print(f" Image classification loss: {classification_loss.data:.4f}")
|
||
|
||
# Example 3: Binary classification with BCE
|
||
print("\\n3. Binary Classification Example (Spam Detection):")
|
||
bce = BinaryCrossEntropyLoss()
|
||
spam_logits = Tensor([[1.2], [-0.8], [2.1], [-1.5]]) # Spam prediction logits
|
||
spam_labels = Tensor([[1.0], [0.0], [1.0], [0.0]]) # 1=spam, 0=not spam
|
||
spam_loss = bce(spam_logits, spam_labels)
|
||
print(f" Spam detection loss: {spam_loss.data:.4f}")
|
||
|
||
print("\\n" + "="*60)
|
||
print("🎯 Loss Function Characteristics")
|
||
print("=" * 35)
|
||
|
||
# Compare perfect vs imperfect predictions
|
||
print("\\n📊 Perfect vs Random Predictions:")
|
||
|
||
# Perfect predictions
|
||
perfect_mse = mse(Tensor([5.0]), Tensor([5.0]))
|
||
perfect_ce = ce(Tensor([[10.0, 0.0, 0.0]]), Tensor([0]))
|
||
perfect_bce = bce(Tensor([10.0]), Tensor([1.0]))
|
||
|
||
print(f" Perfect MSE loss: {perfect_mse.data:.6f}")
|
||
print(f" Perfect CE loss: {perfect_ce.data:.6f}")
|
||
print(f" Perfect BCE loss: {perfect_bce.data:.6f}")
|
||
|
||
# Random predictions
|
||
random_mse = mse(Tensor([3.0]), Tensor([5.0])) # Off by 2
|
||
random_ce = ce(Tensor([[0.0, 0.0, 0.0]]), Tensor([0])) # Uniform distribution
|
||
random_bce = bce(Tensor([0.0]), Tensor([1.0])) # 50% confidence
|
||
|
||
print(f" Random MSE loss: {random_mse.data:.6f}")
|
||
print(f" Random CE loss: {random_ce.data:.6f}")
|
||
print(f" Random BCE loss: {random_bce.data:.6f}")
|
||
|
||
print("\\n🎉 Complete loss function foundation ready!")
|
||
print(" ✅ MSE for regression problems")
|
||
print(" ✅ CrossEntropy for multi-class classification")
|
||
print(" ✅ Binary CrossEntropy for binary classification")
|
||
print(" ✅ Numerically stable implementations")
|
||
print(" ✅ Production-ready batch processing")
|
||
print(" ✅ Ready for neural network training!") |