Files
TinyTorch/modules/source/04_losses/losses_dev.py
Vijay Janapa Reddi 828c3d9081 feat: Add CrossEntropyLoss autograd support + Milestone 03 MLP on digits
Key Changes:
- Implemented CrossEntropyBackward for gradient computation
- Integrated CrossEntropyLoss into enable_autograd() patching
- Created comprehensive loss gradient test suite
- Milestone 03: MLP digits classifier (77.5% accuracy)
- Shipped tiny 8x8 digits dataset (67KB) for instant demos
- Updated DataLoader module with ASCII visualizations

Tests:
- All 3 losses (MSE, BCE, CrossEntropy) now have gradient flow
- MLP successfully learns digit classification (6.9% → 77.5%)
- Integration tests pass

Technical:
- CrossEntropyBackward: softmax - one_hot gradient
- Numerically stable via log-softmax
- Works with raw class labels (no one-hot needed)
2025-09-30 16:22:09 -04:00

1367 lines
50 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.17.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
"""
# Module 04: Losses - Measuring How Wrong We Are
Welcome to Module 04! Today you'll implement the mathematical functions that measure how wrong your model's predictions are - the essential feedback signal that enables all machine learning.
## 🔗 Prerequisites & Progress
**You've Built**: Tensors (data), Activations (intelligence), Layers (architecture)
**You'll Build**: Loss functions that measure prediction quality
**You'll Enable**: The feedback signal needed for training (Module 05: Autograd)
**Connection Map**:
```
Layers → Losses → Autograd
(predictions) (error measurement) (learning signals)
```
## Learning Objectives
By the end of this module, you will:
1. Implement MSELoss for regression problems
2. Implement CrossEntropyLoss for classification problems
3. Implement BinaryCrossEntropyLoss for binary classification
4. Understand numerical stability in loss computation
5. Test all loss functions with realistic examples
Let's measure prediction quality!
"""
# %% [markdown]
"""
## 📦 Where This Code Lives in the Final Package
**Learning Side:** You work in modules/04_losses/losses_dev.py
**Building Side:** Code exports to tinytorch.core.losses
```python
# Final package structure:
from tinytorch.core.losses import MSELoss, CrossEntropyLoss, BinaryCrossEntropyLoss, log_softmax # This module
```
**Why this matters:**
- **Learning:** Complete loss function system in one focused module
- **Production:** Proper organization like PyTorch's torch.nn functional losses
- **Consistency:** All loss computations and numerical stability in core.losses
- **Integration:** Works seamlessly with layers for complete prediction-to-error workflow
"""
# %% [markdown]
"""
## 📋 Module Prerequisites & Setup
This module builds on previous TinyTorch components. Here's what we need and why:
**Required Components:**
- **Tensor** (Module 01): Foundation for all loss computations
- **Linear** (Module 03): For testing loss functions with realistic predictions
- **ReLU** (Module 02): For building test networks that generate realistic outputs
**Integration Helper:**
The `import_previous_module()` function below helps us cleanly import components from previous modules during development and testing.
"""
# %% nbgrader={"grade": false, "grade_id": "setup", "solution": true}
#| default_exp core.losses
#| export
import numpy as np
from typing import Optional
def import_previous_module(module_name: str, component_name: str):
import sys
import os
sys.path.append(os.path.join(os.path.dirname(__file__), '..', module_name))
module = __import__(f"{module_name.split('_')[1]}_dev")
return getattr(module, component_name)
# Import from tinytorch package
from tinytorch.core.tensor import Tensor
from tinytorch.core.layers import Linear
from tinytorch.core.activations import ReLU
# %% [markdown]
"""
# Part 1: Introduction - What Are Loss Functions?
Loss functions are the mathematical conscience of machine learning. They measure the distance between what your model predicts and what actually happened. Without loss functions, models have no way to improve - they're like athletes training without knowing their score.
## The Three Essential Loss Functions
Think of loss functions as different ways to measure "wrongness" - each optimized for different types of problems:
**MSELoss (Mean Squared Error)**: "How far off are my continuous predictions?"
- Used for: Regression (predicting house prices, temperature, stock values)
- Calculation: Average of squared differences between predictions and targets
- Properties: Heavily penalizes large errors, smooth gradients
```
Loss Landscape for MSE:
Loss
^
|
4 | *
| / \
2 | / \
| / \
0 |_/_______\\____> Prediction Error
0 -2 0 +2
Quadratic growth: small errors → small penalty, large errors → huge penalty
```
**CrossEntropyLoss**: "How confident am I in the wrong class?"
- Used for: Multi-class classification (image recognition, text classification)
- Calculation: Negative log-likelihood of correct class probability
- Properties: Encourages confident correct predictions, punishes confident wrong ones
```
Cross-Entropy Penalty Curve:
Loss
^
10 |*
||
5 | \
| \
2 | \
| \
0 |_____\\____> Predicted Probability of Correct Class
0 0.5 1.0
Logarithmic: wrong confident predictions get severe penalty
```
**BinaryCrossEntropyLoss**: "How wrong am I about yes/no decisions?"
- Used for: Binary classification (spam detection, medical diagnosis)
- Calculation: Cross-entropy specialized for two classes
- Properties: Symmetric penalty for false positives and false negatives
```
Binary Decision Boundary:
Target=1 (Positive) Target=0 (Negative)
┌─────────────────┬─────────────────┐
│ Pred → 1.0 │ Pred → 1.0 │
│ Loss → 0 │ Loss → ∞ │
├─────────────────┼─────────────────┤
│ Pred → 0.0 │ Pred → 0.0 │
│ Loss → ∞ │ Loss → 0 │
└─────────────────┴─────────────────┘
```
Each loss function creates a different "error landscape" that guides learning in different ways.
"""
# %% [markdown]
"""
# Part 2: Mathematical Foundations
## Mean Squared Error (MSE)
The foundation of regression, MSE measures the average squared distance between predictions and targets:
```
MSE = (1/N) * Σ(prediction_i - target_i)²
```
**Why square the differences?**
- Makes all errors positive (no cancellation between positive/negative errors)
- Heavily penalizes large errors (error of 2 becomes 4, error of 10 becomes 100)
- Creates smooth gradients for optimization
## Cross-Entropy Loss
For classification, we need to measure how wrong our probability distributions are:
```
CrossEntropy = -Σ target_i * log(prediction_i)
```
**The Log-Sum-Exp Trick**:
Computing softmax directly can cause numerical overflow. The log-sum-exp trick provides stability:
```
log_softmax(x) = x - log(Σ exp(x_i))
= x - max(x) - log(Σ exp(x_i - max(x)))
```
This prevents exp(large_number) from exploding to infinity.
## Binary Cross-Entropy
A specialized case where we have only two classes:
```
BCE = -(target * log(prediction) + (1-target) * log(1-prediction))
```
The mathematics naturally handles both "positive" and "negative" cases in a single formula.
"""
# %% [markdown]
"""
# Part 3: Implementation - Building Loss Functions
Let's implement our loss functions with proper numerical stability and clear educational structure.
"""
# %% [markdown]
"""
## Log-Softmax - The Numerically Stable Foundation
Before implementing loss functions, we need a reliable way to compute log-softmax. This function is the numerically stable backbone of classification losses.
### Why Log-Softmax Matters
Naive softmax can explode with large numbers:
```
Naive approach:
logits = [100, 200, 300]
exp(300) = 1.97 × 10^130 ← This breaks computers!
Stable approach:
max_logit = 300
shifted = [-200, -100, 0] ← Subtract max
exp(0) = 1.0 ← Manageable numbers
```
### The Log-Sum-Exp Trick Visualization
```
Original Computation: Stable Computation:
logits: [a, b, c] logits: [a, b, c]
↓ ↓
exp(logits) max_val = max(a,b,c)
↓ ↓
sum(exp(logits)) shifted = [a-max, b-max, c-max]
↓ ↓
log(sum) exp(shifted) ← All ≤ 1.0
↓ ↓
logits - log(sum) sum(exp(shifted))
log(sum) + max_val
logits - (log(sum) + max_val)
```
Both give the same result, but the stable version never overflows!
"""
# %% nbgrader={"grade": false, "grade_id": "log_softmax", "solution": true}
#| export
def log_softmax(x: Tensor, dim: int = -1) -> Tensor:
"""
Compute log-softmax with numerical stability.
TODO: Implement numerically stable log-softmax using the log-sum-exp trick
APPROACH:
1. Find maximum along dimension (for stability)
2. Subtract max from input (prevents overflow)
3. Compute log(sum(exp(shifted_input)))
4. Return input - max - log_sum_exp
EXAMPLE:
>>> logits = Tensor([[1.0, 2.0, 3.0], [0.1, 0.2, 0.9]])
>>> result = log_softmax(logits, dim=-1)
>>> print(result.shape)
(2, 3)
HINT: Use np.max(x.data, axis=dim, keepdims=True) to preserve dimensions
"""
### BEGIN SOLUTION
# Step 1: Find max along dimension for numerical stability
max_vals = np.max(x.data, axis=dim, keepdims=True)
# Step 2: Subtract max to prevent overflow
shifted = x.data - max_vals
# Step 3: Compute log(sum(exp(shifted)))
log_sum_exp = np.log(np.sum(np.exp(shifted), axis=dim, keepdims=True))
# Step 4: Return log_softmax = input - max - log_sum_exp
result = x.data - max_vals - log_sum_exp
return Tensor(result)
### END SOLUTION
# %% nbgrader={"grade": true, "grade_id": "test_log_softmax", "locked": true, "points": 10}
def test_unit_log_softmax():
"""🔬 Test log_softmax numerical stability and correctness."""
print("🔬 Unit Test: Log-Softmax...")
# Test basic functionality
x = Tensor([[1.0, 2.0, 3.0], [0.1, 0.2, 0.9]])
result = log_softmax(x, dim=-1)
# Verify shape preservation
assert result.shape == x.shape, f"Shape mismatch: expected {x.shape}, got {result.shape}"
# Verify log-softmax properties: exp(log_softmax) should sum to 1
softmax_result = np.exp(result.data)
row_sums = np.sum(softmax_result, axis=-1)
assert np.allclose(row_sums, 1.0, atol=1e-6), f"Softmax doesn't sum to 1: {row_sums}"
# Test numerical stability with large values
large_x = Tensor([[100.0, 101.0, 102.0]])
large_result = log_softmax(large_x, dim=-1)
assert not np.any(np.isnan(large_result.data)), "NaN values in result with large inputs"
assert not np.any(np.isinf(large_result.data)), "Inf values in result with large inputs"
print("✅ log_softmax works correctly with numerical stability!")
if __name__ == "__main__":
test_unit_log_softmax()
# %% [markdown]
"""
## MSELoss - Measuring Continuous Prediction Quality
Mean Squared Error is the workhorse of regression problems. It measures how far your continuous predictions are from the true values.
### When to Use MSE
**Perfect for:**
- House price prediction ($200k vs $195k)
- Temperature forecasting (25°C vs 23°C)
- Stock price prediction ($150 vs $148)
- Any continuous value where "distance" matters
### How MSE Shapes Learning
```
Prediction vs Target Visualization:
Target = 100
Prediction: 80 90 95 100 105 110 120
Error: -20 -10 -5 0 +5 +10 +20
MSE: 400 100 25 0 25 100 400
Loss Curve:
MSE
^
400 |* *
|
100 | * *
| \
25 | * *
| \\ /
0 |_____*_____> Prediction
80 100 120
Quadratic penalty: Large errors are MUCH more costly than small errors
```
### Why Square the Errors?
1. **Positive penalties**: (-10)² = 100, same as (+10)² = 100
2. **Heavy punishment for large errors**: Error of 20 → penalty of 400
3. **Smooth gradients**: Quadratic function has nice derivatives for optimization
4. **Statistical foundation**: Maximum likelihood for Gaussian noise
### MSE vs Other Regression Losses
```
Error Sensitivity Comparison:
Error: -10 -5 0 +5 +10
MSE: 100 25 0 25 100 ← Quadratic growth
MAE: 10 5 0 5 10 ← Linear growth
Huber: 50 12.5 0 12.5 50 ← Hybrid approach
MSE: More sensitive to outliers
MAE: More robust to outliers
Huber: Best of both worlds
```
"""
# %% nbgrader={"grade": false, "grade_id": "mse_loss", "solution": true}
#| export
class MSELoss:
"""Mean Squared Error loss for regression tasks."""
def __init__(self):
"""Initialize MSE loss function."""
pass
def forward(self, predictions: Tensor, targets: Tensor) -> Tensor:
"""
Compute mean squared error between predictions and targets.
TODO: Implement MSE loss calculation
APPROACH:
1. Compute difference: predictions - targets
2. Square the differences: diff²
3. Take mean across all elements
EXAMPLE:
>>> loss_fn = MSELoss()
>>> predictions = Tensor([1.0, 2.0, 3.0])
>>> targets = Tensor([1.5, 2.5, 2.8])
>>> loss = loss_fn(predictions, targets)
>>> print(f"MSE Loss: {loss.data:.4f}")
MSE Loss: 0.1467
HINTS:
- Use (predictions.data - targets.data) for element-wise difference
- Square with **2 or np.power(diff, 2)
- Use np.mean() to average over all elements
"""
### BEGIN SOLUTION
# Step 1: Compute element-wise difference
diff = predictions.data - targets.data
# Step 2: Square the differences
squared_diff = diff ** 2
# Step 3: Take mean across all elements
mse = np.mean(squared_diff)
return Tensor(mse)
### END SOLUTION
def __call__(self, predictions: Tensor, targets: Tensor) -> Tensor:
"""Allows the loss function to be called like a function."""
return self.forward(predictions, targets)
def backward(self) -> Tensor:
"""
Compute gradients (implemented in Module 05: Autograd).
For now, this is a stub that students can ignore.
"""
pass
# %% nbgrader={"grade": true, "grade_id": "test_mse_loss", "locked": true, "points": 10}
def test_unit_mse_loss():
"""🔬 Test MSELoss implementation and properties."""
print("🔬 Unit Test: MSE Loss...")
loss_fn = MSELoss()
# Test perfect predictions (loss should be 0)
predictions = Tensor([1.0, 2.0, 3.0])
targets = Tensor([1.0, 2.0, 3.0])
perfect_loss = loss_fn.forward(predictions, targets)
assert np.allclose(perfect_loss.data, 0.0, atol=1e-7), f"Perfect predictions should have 0 loss, got {perfect_loss.data}"
# Test known case
predictions = Tensor([1.0, 2.0, 3.0])
targets = Tensor([1.5, 2.5, 2.8])
loss = loss_fn.forward(predictions, targets)
# Manual calculation: ((1-1.5)² + (2-2.5)² + (3-2.8)²) / 3 = (0.25 + 0.25 + 0.04) / 3 = 0.18
expected_loss = (0.25 + 0.25 + 0.04) / 3
assert np.allclose(loss.data, expected_loss, atol=1e-6), f"Expected {expected_loss}, got {loss.data}"
# Test that loss is always non-negative
random_pred = Tensor(np.random.randn(10))
random_target = Tensor(np.random.randn(10))
random_loss = loss_fn.forward(random_pred, random_target)
assert random_loss.data >= 0, f"MSE loss should be non-negative, got {random_loss.data}"
print("✅ MSELoss works correctly!")
if __name__ == "__main__":
test_unit_mse_loss()
# %% [markdown]
"""
## CrossEntropyLoss - Measuring Classification Confidence
Cross-entropy loss is the gold standard for multi-class classification. It measures how wrong your probability predictions are and heavily penalizes confident mistakes.
### When to Use Cross-Entropy
**Perfect for:**
- Image classification (cat, dog, bird)
- Text classification (spam, ham, promotion)
- Language modeling (next word prediction)
- Any problem with mutually exclusive classes
### Understanding Cross-Entropy Through Examples
```
Scenario: Image Classification (3 classes: cat, dog, bird)
Case 1: Correct and Confident
Model Output (logits): [5.0, 1.0, 0.1] ← Very confident about "cat"
After Softmax: [0.95, 0.047, 0.003]
True Label: cat (class 0)
Loss: -log(0.95) = 0.05 ← Very low loss ✅
Case 2: Correct but Uncertain
Model Output: [1.1, 1.0, 0.9] ← Uncertain between classes
After Softmax: [0.4, 0.33, 0.27]
True Label: cat (class 0)
Loss: -log(0.4) = 0.92 ← Higher loss (uncertainty penalized)
Case 3: Wrong and Confident
Model Output: [0.1, 5.0, 1.0] ← Very confident about "dog"
After Softmax: [0.003, 0.95, 0.047]
True Label: cat (class 0)
Loss: -log(0.003) = 5.8 ← Very high loss ❌
```
### Cross-Entropy's Learning Signal
```
What Cross-Entropy Teaches the Model:
┌─────────────────┬─────────────────┬─────────────────┐
│ Prediction │ True Label │ Learning Signal │
├─────────────────┼─────────────────┼─────────────────┤
│ Confident ✅ │ Correct ✅ │ "Keep doing this"
│ Uncertain ⚠️ │ Correct ✅ │ "Be more confident"
│ Confident ❌ │ Wrong ❌ │ "STOP! Change everything"
│ Uncertain ⚠️ │ Wrong ❌ │ "Learn the right answer"
└─────────────────┴─────────────────┴─────────────────┘
Loss Landscape by Confidence:
Loss
^
5 |*
||
3 | *
| \
1 | *
| \\
0 |______**____> Predicted Probability (correct class)
0 0.5 1.0
Message: "Be confident when you're right!"
```
### Why Cross-Entropy Works So Well
1. **Probabilistic interpretation**: Measures quality of probability distributions
2. **Strong gradients**: Large penalty for confident mistakes drives fast learning
3. **Smooth optimization**: Log function provides nice gradients
4. **Information theory**: Minimizes "surprise" about correct answers
### Multi-Class vs Binary Classification
```
Multi-Class (3+ classes): Binary (2 classes):
Classes: [cat, dog, bird] Classes: [spam, not_spam]
Output: [0.7, 0.2, 0.1] Output: 0.8 (spam probability)
Must sum to 1.0 ✅ Must be between 0 and 1 ✅
Uses: CrossEntropyLoss Uses: BinaryCrossEntropyLoss
```
"""
# %% nbgrader={"grade": false, "grade_id": "cross_entropy_loss", "solution": true}
#| export
class CrossEntropyLoss:
"""Cross-entropy loss for multi-class classification."""
def __init__(self):
"""Initialize cross-entropy loss function."""
pass
def forward(self, logits: Tensor, targets: Tensor) -> Tensor:
"""
Compute cross-entropy loss between logits and target class indices.
TODO: Implement cross-entropy loss with numerical stability
APPROACH:
1. Compute log-softmax of logits (numerically stable)
2. Select log-probabilities for correct classes
3. Return negative mean of selected log-probabilities
EXAMPLE:
>>> loss_fn = CrossEntropyLoss()
>>> logits = Tensor([[2.0, 1.0, 0.1], [0.5, 1.5, 0.8]]) # 2 samples, 3 classes
>>> targets = Tensor([0, 1]) # First sample is class 0, second is class 1
>>> loss = loss_fn(logits, targets)
>>> print(f"Cross-Entropy Loss: {loss.data:.4f}")
HINTS:
- Use log_softmax() for numerical stability
- targets.data.astype(int) ensures integer indices
- Use np.arange(batch_size) for row indexing: log_probs[np.arange(batch_size), targets]
- Return negative mean: -np.mean(selected_log_probs)
"""
### BEGIN SOLUTION
# Step 1: Compute log-softmax for numerical stability
log_probs = log_softmax(logits, dim=-1)
# Step 2: Select log-probabilities for correct classes
batch_size = logits.shape[0]
target_indices = targets.data.astype(int)
# Select correct class log-probabilities using advanced indexing
selected_log_probs = log_probs.data[np.arange(batch_size), target_indices]
# Step 3: Return negative mean (cross-entropy is negative log-likelihood)
cross_entropy = -np.mean(selected_log_probs)
return Tensor(cross_entropy)
### END SOLUTION
def __call__(self, logits: Tensor, targets: Tensor) -> Tensor:
"""Allows the loss function to be called like a function."""
return self.forward(logits, targets)
def backward(self) -> Tensor:
"""
Compute gradients (implemented in Module 05: Autograd).
For now, this is a stub that students can ignore.
"""
pass
# %% nbgrader={"grade": true, "grade_id": "test_cross_entropy_loss", "locked": true, "points": 10}
def test_unit_cross_entropy_loss():
"""🔬 Test CrossEntropyLoss implementation and properties."""
print("🔬 Unit Test: Cross-Entropy Loss...")
loss_fn = CrossEntropyLoss()
# Test perfect predictions (should have very low loss)
perfect_logits = Tensor([[10.0, -10.0, -10.0], [-10.0, 10.0, -10.0]]) # Very confident predictions
targets = Tensor([0, 1]) # Matches the confident predictions
perfect_loss = loss_fn.forward(perfect_logits, targets)
assert perfect_loss.data < 0.01, f"Perfect predictions should have very low loss, got {perfect_loss.data}"
# Test uniform predictions (should have loss ≈ log(num_classes))
uniform_logits = Tensor([[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]) # Equal probabilities
uniform_targets = Tensor([0, 1])
uniform_loss = loss_fn.forward(uniform_logits, uniform_targets)
expected_uniform_loss = np.log(3) # log(3) ≈ 1.099 for 3 classes
assert np.allclose(uniform_loss.data, expected_uniform_loss, atol=0.1), f"Uniform predictions should have loss ≈ log(3) = {expected_uniform_loss:.3f}, got {uniform_loss.data:.3f}"
# Test that wrong confident predictions have high loss
wrong_logits = Tensor([[10.0, -10.0, -10.0], [-10.0, -10.0, 10.0]]) # Confident but wrong
wrong_targets = Tensor([1, 1]) # Opposite of confident predictions
wrong_loss = loss_fn.forward(wrong_logits, wrong_targets)
assert wrong_loss.data > 5.0, f"Wrong confident predictions should have high loss, got {wrong_loss.data}"
# Test numerical stability with large logits
large_logits = Tensor([[100.0, 50.0, 25.0]])
large_targets = Tensor([0])
large_loss = loss_fn.forward(large_logits, large_targets)
assert not np.isnan(large_loss.data), "Loss should not be NaN with large logits"
assert not np.isinf(large_loss.data), "Loss should not be infinite with large logits"
print("✅ CrossEntropyLoss works correctly!")
if __name__ == "__main__":
test_unit_cross_entropy_loss()
# %% [markdown]
"""
## BinaryCrossEntropyLoss - Measuring Yes/No Decision Quality
Binary Cross-Entropy is specialized for yes/no decisions. It's like regular cross-entropy but optimized for the special case of exactly two classes.
### When to Use Binary Cross-Entropy
**Perfect for:**
- Spam detection (spam vs not spam)
- Medical diagnosis (disease vs healthy)
- Fraud detection (fraud vs legitimate)
- Content moderation (toxic vs safe)
- Any two-class decision problem
### Understanding Binary Cross-Entropy
```
Binary Classification Decision Matrix:
TRUE LABEL
Positive Negative
PREDICTED P TP FP ← Model says "Yes"
N FN TN ← Model says "No"
BCE Loss for each quadrant:
- True Positive (TP): -log(prediction) ← Reward confident correct "Yes"
- False Positive (FP): -log(1-prediction) ← Punish confident wrong "Yes"
- False Negative (FN): -log(prediction) ← Punish confident wrong "No"
- True Negative (TN): -log(1-prediction) ← Reward confident correct "No"
```
### Binary Cross-Entropy Behavior Examples
```
Scenario: Spam Detection
Case 1: Perfect Spam Detection
Email: "Buy now! 50% off! Limited time!"
Model Prediction: 0.99 (99% spam probability)
True Label: 1 (actually spam)
Loss: -log(0.99) = 0.01 ← Very low loss ✅
Case 2: Uncertain About Spam
Email: "Meeting rescheduled to 2pm"
Model Prediction: 0.51 (slightly thinks spam)
True Label: 0 (actually not spam)
Loss: -log(1-0.51) = -log(0.49) = 0.71 ← Moderate loss
Case 3: Confident Wrong Prediction
Email: "Hi mom, how are you?"
Model Prediction: 0.95 (very confident spam)
True Label: 0 (actually not spam)
Loss: -log(1-0.95) = -log(0.05) = 3.0 ← High loss ❌
```
### Binary vs Multi-Class Cross-Entropy
```
Binary Cross-Entropy: Regular Cross-Entropy:
Single probability output Probability distribution output
Predict: 0.8 (spam prob) Predict: [0.1, 0.8, 0.1] (3 classes)
Target: 1.0 (is spam) Target: 1 (class index)
Formula: Formula:
-[y*log(p) + (1-y)*log(1-p)] -log(p[target_class])
Handles class imbalance well Assumes balanced classes
Optimized for 2-class case General for N classes
```
### Why Binary Cross-Entropy is Special
1. **Symmetric penalties**: False positives and false negatives treated equally
2. **Probability calibration**: Output directly interpretable as probability
3. **Efficient computation**: Simpler than full softmax for binary cases
4. **Medical-grade**: Well-suited for safety-critical binary decisions
### Loss Landscape Visualization
```
Binary Cross-Entropy Loss Surface:
Loss
^
10 |* * ← Wrong confident predictions
||
5 | * *
| \\ /
2 | * * ← Uncertain predictions
| \\ /
0 |_____*_______*_____> Prediction
0 0.2 0.8 1.0
Target = 1.0 (positive class)
Message: "Be confident about positive class, uncertain is okay,
but don't be confident about wrong class!"
```
"""
# %% nbgrader={"grade": false, "grade_id": "binary_cross_entropy_loss", "solution": true}
#| export
class BinaryCrossEntropyLoss:
"""Binary cross-entropy loss for binary classification."""
def __init__(self):
"""Initialize binary cross-entropy loss function."""
pass
def forward(self, predictions: Tensor, targets: Tensor) -> Tensor:
"""
Compute binary cross-entropy loss.
TODO: Implement binary cross-entropy with numerical stability
APPROACH:
1. Clamp predictions to avoid log(0) and log(1)
2. Compute: -(targets * log(predictions) + (1-targets) * log(1-predictions))
3. Return mean across all samples
EXAMPLE:
>>> loss_fn = BinaryCrossEntropyLoss()
>>> predictions = Tensor([0.9, 0.1, 0.7, 0.3]) # Probabilities between 0 and 1
>>> targets = Tensor([1.0, 0.0, 1.0, 0.0]) # Binary labels
>>> loss = loss_fn(predictions, targets)
>>> print(f"Binary Cross-Entropy Loss: {loss.data:.4f}")
HINTS:
- Use np.clip(predictions.data, 1e-7, 1-1e-7) to prevent log(0)
- Binary cross-entropy: -(targets * log(preds) + (1-targets) * log(1-preds))
- Use np.mean() to average over all samples
"""
### BEGIN SOLUTION
# Step 1: Clamp predictions to avoid numerical issues with log(0) and log(1)
eps = 1e-7
clamped_preds = np.clip(predictions.data, eps, 1 - eps)
# Step 2: Compute binary cross-entropy
# BCE = -(targets * log(preds) + (1-targets) * log(1-preds))
log_preds = np.log(clamped_preds)
log_one_minus_preds = np.log(1 - clamped_preds)
bce_per_sample = -(targets.data * log_preds + (1 - targets.data) * log_one_minus_preds)
# Step 3: Return mean across all samples
bce_loss = np.mean(bce_per_sample)
return Tensor(bce_loss)
### END SOLUTION
def __call__(self, predictions: Tensor, targets: Tensor) -> Tensor:
"""Allows the loss function to be called like a function."""
return self.forward(predictions, targets)
def backward(self) -> Tensor:
"""
Compute gradients (implemented in Module 05: Autograd).
For now, this is a stub that students can ignore.
"""
pass
# %% nbgrader={"grade": true, "grade_id": "test_binary_cross_entropy_loss", "locked": true, "points": 10}
def test_unit_binary_cross_entropy_loss():
"""🔬 Test BinaryCrossEntropyLoss implementation and properties."""
print("🔬 Unit Test: Binary Cross-Entropy Loss...")
loss_fn = BinaryCrossEntropyLoss()
# Test perfect predictions
perfect_predictions = Tensor([0.9999, 0.0001, 0.9999, 0.0001])
targets = Tensor([1.0, 0.0, 1.0, 0.0])
perfect_loss = loss_fn.forward(perfect_predictions, targets)
assert perfect_loss.data < 0.01, f"Perfect predictions should have very low loss, got {perfect_loss.data}"
# Test worst predictions
worst_predictions = Tensor([0.0001, 0.9999, 0.0001, 0.9999])
worst_targets = Tensor([1.0, 0.0, 1.0, 0.0])
worst_loss = loss_fn.forward(worst_predictions, worst_targets)
assert worst_loss.data > 5.0, f"Worst predictions should have high loss, got {worst_loss.data}"
# Test uniform predictions (probability = 0.5)
uniform_predictions = Tensor([0.5, 0.5, 0.5, 0.5])
uniform_targets = Tensor([1.0, 0.0, 1.0, 0.0])
uniform_loss = loss_fn.forward(uniform_predictions, uniform_targets)
expected_uniform = -np.log(0.5) # Should be about 0.693
assert np.allclose(uniform_loss.data, expected_uniform, atol=0.01), f"Uniform predictions should have loss ≈ {expected_uniform:.3f}, got {uniform_loss.data:.3f}"
# Test numerical stability at boundaries
boundary_predictions = Tensor([0.0, 1.0, 0.0, 1.0])
boundary_targets = Tensor([0.0, 1.0, 1.0, 0.0])
boundary_loss = loss_fn.forward(boundary_predictions, boundary_targets)
assert not np.isnan(boundary_loss.data), "Loss should not be NaN at boundaries"
assert not np.isinf(boundary_loss.data), "Loss should not be infinite at boundaries"
print("✅ BinaryCrossEntropyLoss works correctly!")
if __name__ == "__main__":
test_unit_binary_cross_entropy_loss()
# %% [markdown]
"""
# Part 4: Integration - Bringing It Together
Now let's test how our loss functions work together with real data scenarios and explore their behavior with different types of predictions.
## Real-World Loss Function Usage Patterns
Understanding when and why to use each loss function is crucial for ML engineering success:
```
Problem Type Decision Tree:
What are you predicting?
┌────┼────┐
│ │
Continuous Categorical
Values Classes
│ │
│ ┌───┼───┐
│ │ │
│ 2 Classes 3+ Classes
│ │ │
MSELoss BCE Loss CE Loss
Examples:
MSE: House prices, temperature, stock values
BCE: Spam detection, fraud detection, medical diagnosis
CE: Image classification, language modeling, multiclass text classification
```
## Loss Function Behavior Comparison
Each loss function creates different learning pressures on your model:
```
Error Sensitivity Comparison:
Small Error (0.1): Medium Error (0.5): Large Error (2.0):
MSE: 0.01 MSE: 0.25 MSE: 4.0
BCE: 0.11 BCE: 0.69 BCE: ∞ (clips to large)
CE: 0.11 CE: 0.69 CE: ∞ (clips to large)
MSE: Quadratic growth, manageable with outliers
BCE/CE: Logarithmic growth, explodes with confident wrong predictions
```
"""
# %% nbgrader={"grade": false, "grade_id": "loss_comparison", "solution": true}
def compare_loss_behaviors():
"""
🔬 Compare how different loss functions behave with various prediction patterns.
This helps students understand when to use each loss function.
"""
print("🔬 Integration Test: Loss Function Behavior Comparison...")
# Initialize loss functions
mse_loss = MSELoss()
ce_loss = CrossEntropyLoss()
bce_loss = BinaryCrossEntropyLoss()
print("\n1. Regression Scenario (House Price Prediction)")
print(" Predictions: [200k, 250k, 300k], Targets: [195k, 260k, 290k]")
house_pred = Tensor([200.0, 250.0, 300.0]) # In thousands
house_target = Tensor([195.0, 260.0, 290.0])
mse = mse_loss.forward(house_pred, house_target)
print(f" MSE Loss: {mse.data:.2f} (thousand²)")
print("\n2. Multi-Class Classification (Image Recognition)")
print(" Classes: [cat, dog, bird], Predicted: confident about cat, uncertain about dog")
# Logits: [2.0, 0.5, 0.1] suggests model is most confident about class 0 (cat)
image_logits = Tensor([[2.0, 0.5, 0.1], [0.3, 1.8, 0.2]]) # Two samples
image_targets = Tensor([0, 1]) # First is cat (0), second is dog (1)
ce = ce_loss.forward(image_logits, image_targets)
print(f" Cross-Entropy Loss: {ce.data:.3f}")
print("\n3. Binary Classification (Spam Detection)")
print(" Predictions: [0.9, 0.1, 0.7, 0.3] (spam probabilities)")
spam_pred = Tensor([0.9, 0.1, 0.7, 0.3])
spam_target = Tensor([1.0, 0.0, 1.0, 0.0]) # 1=spam, 0=not spam
bce = bce_loss.forward(spam_pred, spam_target)
print(f" Binary Cross-Entropy Loss: {bce.data:.3f}")
print("\n💡 Key Insights:")
print(" - MSE penalizes large errors heavily (good for continuous values)")
print(" - Cross-Entropy encourages confident correct predictions")
print(" - Binary Cross-Entropy balances false positives and negatives")
return mse.data, ce.data, bce.data
# %% nbgrader={"grade": false, "grade_id": "loss_sensitivity", "solution": true}
def analyze_loss_sensitivity():
"""
📊 Analyze how sensitive each loss function is to prediction errors.
This demonstrates the different error landscapes created by each loss.
"""
print("\n📊 Analysis: Loss Function Sensitivity to Errors...")
# Create a range of prediction errors for analysis
true_value = 1.0
predictions = np.linspace(0.1, 1.9, 50) # From 0.1 to 1.9
# Initialize loss functions
mse_loss = MSELoss()
bce_loss = BinaryCrossEntropyLoss()
mse_losses = []
bce_losses = []
for pred in predictions:
# MSE analysis
pred_tensor = Tensor([pred])
target_tensor = Tensor([true_value])
mse = mse_loss.forward(pred_tensor, target_tensor)
mse_losses.append(mse.data)
# BCE analysis (clamp prediction to valid probability range)
clamped_pred = max(0.01, min(0.99, pred))
bce_pred_tensor = Tensor([clamped_pred])
bce_target_tensor = Tensor([1.0]) # Target is "positive class"
bce = bce_loss.forward(bce_pred_tensor, bce_target_tensor)
bce_losses.append(bce.data)
# Find minimum losses
min_mse_idx = np.argmin(mse_losses)
min_bce_idx = np.argmin(bce_losses)
print(f"MSE Loss:")
print(f" Minimum at prediction = {predictions[min_mse_idx]:.2f}, loss = {mse_losses[min_mse_idx]:.4f}")
print(f" At prediction = 0.5: loss = {mse_losses[24]:.4f}") # Middle of range
print(f" At prediction = 0.1: loss = {mse_losses[0]:.4f}")
print(f"\nBinary Cross-Entropy Loss:")
print(f" Minimum at prediction = {predictions[min_bce_idx]:.2f}, loss = {bce_losses[min_bce_idx]:.4f}")
print(f" At prediction = 0.5: loss = {bce_losses[24]:.4f}")
print(f" At prediction = 0.1: loss = {bce_losses[0]:.4f}")
print(f"\n💡 Sensitivity Insights:")
print(" - MSE grows quadratically with error distance")
print(" - BCE grows logarithmically, heavily penalizing wrong confident predictions")
print(" - Both encourage correct predictions but with different curvatures")
# %% [markdown]
"""
# Part 5: Systems Analysis - Understanding Loss Function Performance
Loss functions seem simple, but they have important computational and numerical properties that affect training performance. Let's analyze the systems aspects.
## Computational Complexity Analysis
Different loss functions have different computational costs, especially at scale:
```
Computational Cost Comparison (Batch Size B, Classes C):
MSELoss:
┌───────────────┬───────────────┐
│ Operation │ Complexity │
├───────────────┼───────────────┤
│ Subtraction │ O(B) │
│ Squaring │ O(B) │
│ Mean │ O(B) │
│ Total │ O(B) │
└───────────────┴───────────────┘
CrossEntropyLoss:
┌───────────────┬───────────────┐
│ Operation │ Complexity │
├───────────────┼───────────────┤
│ Max (stability)│ O(B*C) │
│ Exponential │ O(B*C) │
│ Sum │ O(B*C) │
│ Log │ O(B) │
│ Indexing │ O(B) │
│ Total │ O(B*C) │
└───────────────┴───────────────┘
Cross-entropy is C times more expensive than MSE!
For ImageNet (C=1000), CE is 1000x more expensive than MSE.
```
## Memory Layout and Access Patterns
```
Memory Usage Patterns:
MSE Forward Pass: CE Forward Pass:
Input: [B] predictions Input: [B, C] logits
│ │
│ subtract │ subtract max
v v
Temp: [B] differences Temp1: [B, C] shifted
│ │
│ square │ exponential
v v
Temp: [B] squared Temp2: [B, C] exp_vals
│ │
│ mean │ sum along C
v v
Output: [1] scalar Temp3: [B] sums
Memory: 3*B*sizeof(float) │ log + index
v
Output: [1] scalar
Memory: (3*B*C + 2*B)*sizeof(float)
```
"""
# %% nbgrader={"grade": false, "grade_id": "analyze_numerical_stability", "solution": true}
def analyze_numerical_stability():
"""
📊 Demonstrate why numerical stability matters in loss computation.
Shows the difference between naive and stable implementations.
"""
print("📊 Analysis: Numerical Stability in Loss Functions...")
# Test with increasingly large logits
test_cases = [
("Small logits", [1.0, 2.0, 3.0]),
("Medium logits", [10.0, 20.0, 30.0]),
("Large logits", [100.0, 200.0, 300.0]),
("Very large logits", [500.0, 600.0, 700.0])
]
print("\nLog-Softmax Stability Test:")
print("Case | Max Input | Log-Softmax Min | Numerically Stable?")
print("-" * 70)
for case_name, logits in test_cases:
x = Tensor([logits])
# Our stable implementation
stable_result = log_softmax(x, dim=-1)
max_input = np.max(logits)
min_output = np.min(stable_result.data)
is_stable = not (np.any(np.isnan(stable_result.data)) or np.any(np.isinf(stable_result.data)))
print(f"{case_name:20} | {max_input:8.0f} | {min_output:15.3f} | {'✅ Yes' if is_stable else '❌ No'}")
print(f"\n💡 Key Insight: Log-sum-exp trick prevents overflow")
print(" Without it: exp(700) would cause overflow in standard softmax")
print(" With it: We can handle arbitrarily large logits safely")
# %% nbgrader={"grade": false, "grade_id": "analyze_loss_memory", "solution": true}
def analyze_loss_memory():
"""
📊 Analyze memory usage patterns of different loss functions.
Understanding memory helps with batch size decisions.
"""
print("\n📊 Analysis: Loss Function Memory Usage...")
batch_sizes = [32, 128, 512, 1024]
num_classes = 1000 # Like ImageNet
print("\nMemory Usage by Batch Size:")
print("Batch Size | MSE (MB) | CrossEntropy (MB) | BCE (MB) | Notes")
print("-" * 75)
for batch_size in batch_sizes:
# Memory calculations (assuming float32 = 4 bytes)
bytes_per_float = 4
# MSE: predictions + targets (both same size as output)
mse_elements = batch_size * 1 # Regression usually has 1 output
mse_memory = mse_elements * bytes_per_float * 2 / 1e6 # Convert to MB
# CrossEntropy: logits + targets + softmax + log_softmax
ce_logits = batch_size * num_classes
ce_targets = batch_size * 1 # Target indices
ce_softmax = batch_size * num_classes # Intermediate softmax
ce_total_elements = ce_logits + ce_targets + ce_softmax
ce_memory = ce_total_elements * bytes_per_float / 1e6
# BCE: predictions + targets (binary, so smaller)
bce_elements = batch_size * 1
bce_memory = bce_elements * bytes_per_float * 2 / 1e6
notes = "Linear scaling" if batch_size == 32 else f"{batch_size//32}× first"
print(f"{batch_size:10} | {mse_memory:8.2f} | {ce_memory:13.2f} | {bce_memory:7.2f} | {notes}")
print(f"\n💡 Memory Insights:")
print(" - CrossEntropy dominates due to large vocabulary (num_classes)")
print(" - Memory scales linearly with batch size")
print(" - Intermediate activations (softmax) double CE memory")
print(f" - For batch=1024, CE needs {ce_memory:.1f}MB just for loss computation")
# %% [markdown]
"""
# Part 6: Production Context - How Loss Functions Scale
Understanding how loss functions behave in production helps make informed engineering decisions about model architecture and training strategies.
## Loss Function Scaling Challenges
As models grow larger, loss function bottlenecks become critical:
```
Scaling Challenge Matrix:
│ Small Model │ Large Model │ Production Scale
│ (MNIST) │ (ImageNet) │ (GPT/BERT)
────────────────────┼─────────────────┼──────────────────┼──────────────────
Classes (C) │ 10 │ 1,000 │ 50,000+
Batch Size (B) │ 64 │ 256 │ 2,048
Memory (CE) │ 2.5 KB │ 1 MB │ 400 MB
Memory (MSE) │ 0.25 KB │ 1 KB │ 8 KB
Bottleneck │ None │ Softmax compute │ Vocabulary memory
Memory grows as B*C for cross-entropy!
At scale, vocabulary (C) dominates everything.
```
## Engineering Optimizations in Production
```
Common Production Optimizations:
1. Hierarchical Softmax:
┌─────────────────┐
│ Full Softmax: │
│ O(V) per sample │ ┌─────────────────┐
│ 50k classes = 50k │ │ Hierarchical: │
│ operations │ │ O(log V) per sample │
└─────────────────┘ │ 50k classes = 16 │
│ operations │
└─────────────────┘
2. Sampled Softmax:
Instead of computing over all 50k classes,
sample 1k negative classes + correct class.
50× speedup for training!
3. Label Smoothing:
Instead of hard targets [0, 0, 1, 0],
use soft targets [0.1, 0.1, 0.7, 0.1].
Improves generalization.
4. Mixed Precision:
Use FP16 for forward pass, FP32 for loss.
2× memory reduction, same accuracy.
```
"""
# %% nbgrader={"grade": false, "grade_id": "analyze_production_patterns", "solution": true}
def analyze_production_patterns():
"""
🚀 Analyze loss function patterns in production ML systems.
Real insights from systems perspective.
"""
print("🚀 Production Analysis: Loss Function Engineering Patterns...")
print("\n1. Loss Function Choice by Problem Type:")
scenarios = [
("Recommender Systems", "BCE/MSE", "User preference prediction", "Billions of interactions"),
("Computer Vision", "CrossEntropy", "Image classification", "1000+ classes, large batches"),
("NLP Translation", "CrossEntropy", "Next token prediction", "50k+ vocabulary"),
("Medical Diagnosis", "BCE", "Disease probability", "Class imbalance critical"),
("Financial Trading", "MSE/Huber", "Price prediction", "Outlier robustness needed")
]
print("System Type | Loss Type | Use Case | Scale Challenge")
print("-" * 80)
for system, loss_type, use_case, challenge in scenarios:
print(f"{system:20} | {loss_type:12} | {use_case:20} | {challenge}")
print("\n2. Engineering Trade-offs:")
trade_offs = [
("CrossEntropy vs Label Smoothing", "Stability vs Confidence", "Label smoothing prevents overconfident predictions"),
("MSE vs Huber Loss", "Sensitivity vs Robustness", "Huber is less sensitive to outliers"),
("Full Softmax vs Sampled", "Accuracy vs Speed", "Hierarchical softmax for large vocabularies"),
("Per-Sample vs Batch Loss", "Accuracy vs Memory", "Batch computation is more memory efficient")
]
print("\nTrade-off | Spectrum | Production Decision")
print("-" * 85)
for trade_off, spectrum, decision in trade_offs:
print(f"{trade_off:28} | {spectrum:20} | {decision}")
print("\n💡 Production Insights:")
print(" - Large vocabularies (50k+ tokens) dominate memory in CrossEntropy")
print(" - Batch computation is 10-100× more efficient than per-sample")
print(" - Numerical stability becomes critical at scale (FP16 training)")
print(" - Loss computation is often <5% of total training time")
# %% [markdown]
"""
## 🧪 Module Integration Test
Final validation that everything works together correctly.
"""
# %% nbgrader={"grade": true, "grade_id": "test_module", "locked": true, "points": 20}
def test_module():
"""
Comprehensive test of entire losses module functionality.
This final test runs before module summary to ensure:
- All unit tests pass
- Functions work together correctly
- Module is ready for integration with TinyTorch
"""
print("🧪 RUNNING MODULE INTEGRATION TEST")
print("=" * 50)
# Run all unit tests
print("Running unit tests...")
test_unit_log_softmax()
test_unit_mse_loss()
test_unit_cross_entropy_loss()
test_unit_binary_cross_entropy_loss()
print("\nRunning integration scenarios...")
# Test realistic end-to-end scenario with previous modules
print("🔬 Integration Test: Realistic training scenario...")
# Simulate a complete prediction -> loss computation pipeline
# 1. MSE for regression (house price prediction)
house_predictions = Tensor([250.0, 180.0, 320.0, 400.0]) # Predicted prices in thousands
house_actual = Tensor([245.0, 190.0, 310.0, 420.0]) # Actual prices
mse_loss = MSELoss()
house_loss = mse_loss.forward(house_predictions, house_actual)
assert house_loss.data > 0, "House price loss should be positive"
assert house_loss.data < 1000, "House price loss should be reasonable"
# 2. CrossEntropy for classification (image recognition)
image_logits = Tensor([[2.1, 0.5, 0.3], [0.2, 2.8, 0.1], [0.4, 0.3, 2.2]]) # 3 images, 3 classes
image_labels = Tensor([0, 1, 2]) # Correct class for each image
ce_loss = CrossEntropyLoss()
image_loss = ce_loss.forward(image_logits, image_labels)
assert image_loss.data > 0, "Image classification loss should be positive"
assert image_loss.data < 5.0, "Image classification loss should be reasonable"
# 3. BCE for binary classification (spam detection)
spam_probabilities = Tensor([0.85, 0.12, 0.78, 0.23, 0.91])
spam_labels = Tensor([1.0, 0.0, 1.0, 0.0, 1.0]) # True spam labels
bce_loss = BinaryCrossEntropyLoss()
spam_loss = bce_loss.forward(spam_probabilities, spam_labels)
assert spam_loss.data > 0, "Spam detection loss should be positive"
assert spam_loss.data < 5.0, "Spam detection loss should be reasonable"
# 4. Test numerical stability with extreme values
extreme_logits = Tensor([[100.0, -100.0, 0.0]])
extreme_targets = Tensor([0])
extreme_loss = ce_loss.forward(extreme_logits, extreme_targets)
assert not np.isnan(extreme_loss.data), "Loss should handle extreme values"
assert not np.isinf(extreme_loss.data), "Loss should not be infinite"
print("✅ End-to-end loss computation works!")
print("✅ All loss functions handle edge cases!")
print("✅ Numerical stability verified!")
print("\n" + "=" * 50)
print("🎉 ALL TESTS PASSED! Module ready for export.")
print("Run: tito module complete 04")
# %%
# Run comprehensive module test
if __name__ == "__main__":
test_module()
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Losses
Congratulations! You've built the measurement system that enables all machine learning!
### Key Accomplishments
- Built 3 essential loss functions: MSE, CrossEntropy, and BinaryCrossEntropy ✅
- Implemented numerical stability with log-sum-exp trick ✅
- Discovered memory scaling patterns with batch size and vocabulary ✅
- Analyzed production trade-offs between different loss function choices ✅
- All tests pass ✅ (validated by `test_module()`)
### Ready for Next Steps
Your loss functions provide the essential feedback signal for learning. These "error measurements" will become the starting point for backpropagation in Module 05!
Export with: `tito module complete 04`
**Next**: Module 05 will add automatic differentiation - the magic that computes how to improve predictions!
"""