mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-08 07:45:06 -05:00
- Added progressive complexity guidelines (Foundation/Intermediate/Advanced) - Added measurement function consolidation to prevent information overload - Fixed all diagnostic issues in losses_dev.py - Fixed markdown formatting across all modules - Consolidated redundant analysis functions in foundation modules - Fixed syntax errors and unused variables - Ensured all educational content is in proper markdown cells for Jupyter
1564 lines
62 KiB
Python
1564 lines
62 KiB
Python
# %% [markdown]
|
||
"""
|
||
# Normalization - Stabilizing Deep Network Training
|
||
|
||
Welcome to Normalization! You'll implement the normalization techniques that make deep neural networks trainable and stable.
|
||
|
||
## LINK Building on Previous Learning
|
||
**What You Built Before**:
|
||
- Module 02 (Tensor): Data structures with gradient tracking
|
||
- Module 04 (Layers): Neural network layer primitives
|
||
- Module 06 (Autograd): Automatic gradient computation
|
||
- Module 07 (Optimizers): Parameter update algorithms
|
||
|
||
**What's Working**: You can build multi-layer networks and train them with optimizers!
|
||
|
||
**The Gap**: Deep networks suffer from internal covariate shift - activations drift during training, making learning unstable and slow.
|
||
|
||
**This Module's Solution**: Implement BatchNorm, LayerNorm, and GroupNorm to stabilize training by normalizing intermediate activations.
|
||
|
||
**Connection Map**:
|
||
```
|
||
Layers -> Normalization -> Stable Training
|
||
(unstable) (stabilized) (convergence)
|
||
```
|
||
|
||
## Learning Goals (5-Point Framework)
|
||
- **Systems understanding**: Memory and computation patterns of different normalization schemes
|
||
- **Core implementation skill**: Build BatchNorm, LayerNorm, and GroupNorm from mathematical foundations
|
||
- **Pattern/abstraction mastery**: Understand when to use each normalization technique
|
||
- **Framework connections**: Connect to PyTorch's nn.BatchNorm2d, nn.LayerNorm, nn.GroupNorm
|
||
- **Optimization trade-offs**: Analyze memory vs stability vs computation trade-offs
|
||
|
||
## Build -> Use -> Reflect
|
||
1. **Build**: Implementation of BatchNorm, LayerNorm, and GroupNorm with running statistics
|
||
2. **Use**: Apply normalization to stabilize training of deep networks
|
||
3. **Reflect**: How do different normalization schemes affect memory, computation, and training dynamics?
|
||
|
||
## Systems Reality Check
|
||
TIP **Production Context**: Normalization is critical in all modern deep learning - ResNet uses BatchNorm, Transformers use LayerNorm, modern ConvNets use GroupNorm
|
||
SPEED **Performance Insight**: BatchNorm adds 2* parameters per layer but often enables 10* larger learning rates, dramatically accelerating training
|
||
|
||
## What You'll Achieve
|
||
By the end of this module, you'll have implemented the normalization arsenal that makes modern deep learning possible, with complete understanding of their memory characteristics and performance trade-offs.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Mathematical Foundation: Why Normalization Works
|
||
|
||
Internal covariate shift occurs when the distribution of inputs to each layer changes during training. This makes learning slow and unstable.
|
||
|
||
### The Core Problem:
|
||
```
|
||
Layer 1: x₁ -> f₁(x₁) -> y₁ (distribution D₁)
|
||
Layer 2: y₁ -> f₂(y₁) -> y₂ (distribution changes as f₁ changes!)
|
||
Layer 3: y₂ -> f₃(y₂) -> y₃ (distribution keeps shifting!)
|
||
```
|
||
|
||
### The Normalization Solution:
|
||
Normalize activations to have stable statistics (mean=0, variance=1):
|
||
|
||
**Mathematical Form:**
|
||
```
|
||
ŷ = γ * (x - μ) / σ + β
|
||
|
||
Where:
|
||
- μ = E[x] (mean)
|
||
- σ = sqrt(Var[x] + ε) (standard deviation)
|
||
- γ = learnable scale parameter
|
||
- β = learnable shift parameter
|
||
- ε = numerical stability constant (usually 1e-5)
|
||
```
|
||
|
||
**Key Insight**: γ and β allow the network to recover the original representation if normalization hurts performance.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Context: Why Normalization Matters
|
||
|
||
### Historical Context
|
||
- **2015**: BatchNorm revolutionizes training, enables much deeper networks
|
||
- **2016**: LayerNorm enables stable transformer training
|
||
- **2018**: GroupNorm provides batch-independent normalization for object detection
|
||
|
||
### Production Impact
|
||
- **ImageNet Training**: BatchNorm reduces training time from weeks to days
|
||
- **Language Models**: LayerNorm enables training of billion-parameter transformers
|
||
- **Object Detection**: GroupNorm enables small-batch training with stable results
|
||
|
||
### Memory vs Performance Trade-offs
|
||
- **BatchNorm**: 2* parameters, but enables 5-10* larger learning rates
|
||
- **LayerNorm**: No batch dimension dependence, consistent across batch sizes
|
||
- **GroupNorm**: Balance between batch and layer normalization benefits
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Connections: Production Normalization Systems
|
||
|
||
### PyTorch Implementation Patterns
|
||
```python
|
||
# Production patterns you'll implement
|
||
torch.nn.BatchNorm2d(channels, eps=1e-5, momentum=0.1)
|
||
torch.nn.LayerNorm(normalized_shape, eps=1e-5)
|
||
torch.nn.GroupNorm(num_groups, num_channels, eps=1e-5)
|
||
|
||
# Your implementation will match these interfaces
|
||
```
|
||
|
||
### Real-World Usage
|
||
- **ResNet**: Uses BatchNorm after every convolution layer
|
||
- **BERT/GPT**: Uses LayerNorm in transformer blocks
|
||
- **YOLO**: Uses BatchNorm for training stability with large images
|
||
- **Modern ConvNets**: Often use GroupNorm for object detection tasks
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Design: Why Build Normalization From Scratch?
|
||
|
||
### Learning Justification
|
||
Building normalization layers teaches:
|
||
1. **Statistical Computing**: How to compute mean/variance efficiently across different dimensions
|
||
2. **Memory Management**: Understanding running statistics and their memory implications
|
||
3. **Training vs Inference**: How normalization behaves differently during training and evaluation
|
||
4. **Gradient Flow**: How normalization affects backpropagation through learnable parameters
|
||
|
||
### Systems Understanding Goals
|
||
- **Dimension Analysis**: How normalization axes affect memory and computation
|
||
- **Batch Dependencies**: Understanding when normalization depends on batch statistics
|
||
- **Parameter Sharing**: How γ and β parameters are organized in memory
|
||
- **Numerical Stability**: Why ε is critical for avoiding division by zero
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Architecture: Normalization Design Decisions
|
||
|
||
### Key Design Choices
|
||
|
||
1. **Normalization Axis Selection**:
|
||
```
|
||
BatchNorm: Normalize across batch dimension (N, C, H, W) -> across N
|
||
LayerNorm: Normalize across feature dimensions -> across C, H, W
|
||
GroupNorm: Normalize across channel groups -> within groups of C
|
||
```
|
||
|
||
2. **Parameter Organization**:
|
||
```
|
||
γ (scale) and β (bias) parameters:
|
||
- BatchNorm: Shape (C,) - one per channel
|
||
- LayerNorm: Shape of normalized dimensions
|
||
- GroupNorm: Shape (C,) - one per channel
|
||
```
|
||
|
||
3. **Training vs Inference**:
|
||
```
|
||
Training: Use batch statistics (mean, var computed from current batch)
|
||
Inference: Use running statistics (exponential moving average from training)
|
||
```
|
||
|
||
4. **Memory Layout Optimization**:
|
||
```
|
||
Running statistics stored separately from learnable parameters
|
||
Efficient computation using vectorized operations across normalization axes
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Implementation: Building Normalization Classes
|
||
|
||
Let's implement the three essential normalization techniques used in modern deep learning.
|
||
"""
|
||
|
||
# %%
|
||
#| default_exp tinytorch.core.normalization
|
||
import numpy as np
|
||
from typing import Optional, Union, Tuple, Dict, List
|
||
import warnings
|
||
|
||
# Import our tensor and layer base classes
|
||
try:
|
||
from tinytorch.core.tensor import Tensor
|
||
from tinytorch.core.layers import Module
|
||
except ImportError:
|
||
# Fallback for development environment
|
||
import sys
|
||
import os
|
||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..', '..'))
|
||
from tinytorch.core.tensor import Tensor
|
||
from tinytorch.core.layers import Module
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Batch Normalization Implementation
|
||
|
||
Batch Normalization normalizes activations across the batch dimension, making training more stable and allowing higher learning rates.
|
||
|
||
**Key Insight**: BatchNorm computes statistics across the batch dimension, so it requires batch_size > 1 during training.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "batch-norm", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class BatchNorm2d(Module):
|
||
"""
|
||
Batch Normalization for 2D convolutions (4D tensors: N*C*H*W).
|
||
|
||
Normalizes across the batch dimension, computing μ and σ² across N, H, W
|
||
for each channel C independently.
|
||
|
||
MATHEMATICAL FOUNDATION:
|
||
BN(x) = γ * (x - μ_batch) / sqrt(σ²_batch + ε) + β
|
||
|
||
Where μ_batch and σ²_batch are computed across (N, H, W) dimensions.
|
||
"""
|
||
|
||
def __init__(self, num_features: int, eps: float = 1e-5, momentum: float = 0.1):
|
||
"""
|
||
Initialize Batch Normalization layer.
|
||
|
||
TODO: Implement BatchNorm initialization with running statistics.
|
||
|
||
APPROACH (4-Step BatchNorm Setup):
|
||
1. Store configuration parameters (num_features, eps, momentum)
|
||
2. Initialize learnable parameters (γ=1, β=0) with proper shapes
|
||
3. Initialize running statistics (running_mean=0, running_var=1)
|
||
4. Set training mode flag for different train/eval behavior
|
||
|
||
MEMORY ANALYSIS:
|
||
- Learnable parameters: 2 * num_features (γ and β)
|
||
- Running statistics: 2 * num_features (running_mean and running_var)
|
||
- Total memory: 4 * num_features parameters
|
||
|
||
EXAMPLE (BatchNorm Usage):
|
||
>>> bn = BatchNorm2d(64) # For 64 channels
|
||
>>> x = Tensor(np.random.randn(32, 64, 28, 28)) # batch * channels * height * width
|
||
>>> normalized = bn(x)
|
||
>>> print(f"Normalized shape: {normalized.shape}") # (32, 64, 28, 28)
|
||
|
||
HINTS:
|
||
- Use np.ones() for γ initialization (multiplicative identity)
|
||
- Use np.zeros() for β initialization (additive identity)
|
||
- Running statistics are numpy arrays (not Tensors - no gradients needed)
|
||
- momentum controls exponential moving average: new_running = (1-momentum)*old + momentum*batch
|
||
|
||
Args:
|
||
num_features: Number of channels (C dimension)
|
||
eps: Small constant for numerical stability
|
||
momentum: Momentum for running statistics update
|
||
"""
|
||
### BEGIN SOLUTION
|
||
super().__init__()
|
||
self.num_features = num_features
|
||
self.eps = eps
|
||
self.momentum = momentum
|
||
self.training = True
|
||
|
||
# Learnable parameters - shape (num_features,)
|
||
self.gamma = Tensor(np.ones((num_features,))) # Scale parameter
|
||
self.beta = Tensor(np.zeros((num_features,))) # Shift parameter
|
||
|
||
# Running statistics for inference - numpy arrays (no gradients needed)
|
||
self.running_mean = np.zeros((num_features,))
|
||
self.running_var = np.ones((num_features,))
|
||
|
||
# Track parameters for optimization
|
||
self.parameters = [self.gamma, self.beta]
|
||
### END SOLUTION
|
||
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
"""
|
||
Apply batch normalization to input tensor.
|
||
|
||
TODO: Implement batch normalization forward pass with proper training/eval modes.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Determine which statistics to use (batch vs running)
|
||
2. Compute mean and variance across appropriate dimensions
|
||
3. Normalize: (x - mean) / sqrt(var + eps)
|
||
4. Scale and shift: γ * normalized + β
|
||
5. Update running statistics during training
|
||
|
||
DIMENSION ANALYSIS for 4D input (N, C, H, W):
|
||
- Batch statistics computed across dims (0, 2, 3) -> shape (C,)
|
||
- γ and β broadcasted to match input: (1, C, 1, 1)
|
||
- Output has same shape as input
|
||
|
||
TRAINING vs INFERENCE:
|
||
- Training: Use batch statistics, update running statistics
|
||
- Inference: Use running statistics, no updates
|
||
|
||
EXAMPLE:
|
||
>>> bn = BatchNorm2d(3)
|
||
>>> x = Tensor(np.random.randn(16, 3, 32, 32))
|
||
>>> bn.training = True # Training mode
|
||
>>> out_train = bn.forward(x)
|
||
>>> bn.training = False # Inference mode
|
||
>>> out_eval = bn.forward(x)
|
||
|
||
Args:
|
||
x: Input tensor of shape (N, C, H, W)
|
||
|
||
Returns:
|
||
Normalized tensor of shape (N, C, H, W)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
if self.training:
|
||
# Training mode: compute batch statistics
|
||
# Compute mean and variance across batch, height, width (dims 0, 2, 3)
|
||
batch_mean = np.mean(x.data, axis=(0, 2, 3), keepdims=False) # Shape: (C,)
|
||
batch_var = np.var(x.data, axis=(0, 2, 3), keepdims=False) # Shape: (C,)
|
||
|
||
# Update running statistics using exponential moving average
|
||
self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * batch_mean
|
||
self.running_var = (1 - self.momentum) * self.running_var + self.momentum * batch_var
|
||
|
||
# Use batch statistics for normalization
|
||
mean = batch_mean
|
||
var = batch_var
|
||
else:
|
||
# Inference mode: use running statistics
|
||
mean = self.running_mean
|
||
var = self.running_var
|
||
|
||
# Reshape statistics for broadcasting: (1, C, 1, 1)
|
||
mean = mean.reshape(1, -1, 1, 1)
|
||
var = var.reshape(1, -1, 1, 1)
|
||
gamma = self.gamma.data.reshape(1, -1, 1, 1)
|
||
beta = self.beta.data.reshape(1, -1, 1, 1)
|
||
|
||
# Apply normalization: γ * (x - μ) / σ + β
|
||
normalized = (x.data - mean) / np.sqrt(var + self.eps)
|
||
output = gamma * normalized + beta
|
||
|
||
return Tensor(output)
|
||
### END SOLUTION
|
||
|
||
def train(self, mode: bool = True) -> 'BatchNorm2d':
|
||
"""Set training mode."""
|
||
self.training = mode
|
||
return self
|
||
|
||
def eval(self) -> 'BatchNorm2d':
|
||
"""Set evaluation mode."""
|
||
self.training = False
|
||
return self
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Batch Normalization Memory Analysis
|
||
def analyze_batchnorm_memory():
|
||
"""Let's analyze BatchNorm memory usage and batch dependency!"""
|
||
try:
|
||
print("MAGNIFY SYSTEMS INSIGHT: Batch Normalization Analysis")
|
||
print("=" * 50)
|
||
|
||
# Different channel sizes to show scaling
|
||
channel_sizes = [64, 256, 512, 1024]
|
||
|
||
for channels in channel_sizes:
|
||
bn = BatchNorm2d(channels)
|
||
|
||
# Parameter memory calculation
|
||
param_memory = 4 * channels * 4 # 4 params per channel * 4 bytes (float32)
|
||
print(f"Channels: {channels:4d} | Parameters: {4 * channels:4d} | Memory: {param_memory / 1024:.2f} KB")
|
||
|
||
print("\nTIP KEY INSIGHTS:")
|
||
print("• BatchNorm memory scales linearly with channel count")
|
||
print("• Only 4 parameters per channel (γ, β, running_mean, running_var)")
|
||
print("• Memory overhead is typically < 1% of layer weights")
|
||
|
||
# Batch size dependency demonstration
|
||
print("\nTARGET BATCH SIZE DEPENDENCY:")
|
||
bn = BatchNorm2d(64)
|
||
|
||
for batch_size in [1, 8, 32, 128]:
|
||
x = Tensor(np.random.randn(batch_size, 64, 32, 32))
|
||
|
||
if batch_size == 1:
|
||
print(f"Batch size {batch_size:3d}: WARNING️ May be unstable (poor statistics)")
|
||
else:
|
||
print(f"Batch size {batch_size:3d}: PASS Good statistics")
|
||
|
||
print("\n🚨 CRITICAL: BatchNorm needs batch_size > 1 for stable training!")
|
||
print(" Single-sample batches have undefined variance")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in BatchNorm analysis: {e}")
|
||
|
||
# Run the analysis
|
||
analyze_batchnorm_memory()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Batch Normalization
|
||
|
||
This test validates BatchNorm2d implementation, ensuring proper normalization across batch dimension and correct running statistics updates.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-batch-norm", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_batch_norm():
|
||
"""Unit test for batch normalization."""
|
||
print("🔬 Unit Test: Batch Normalization...")
|
||
|
||
# Test 1: Basic functionality
|
||
num_features = 32
|
||
bn = BatchNorm2d(num_features)
|
||
|
||
# Verify initialization
|
||
assert bn.num_features == num_features, "Should store number of features"
|
||
assert bn.eps == 1e-5, "Should use default epsilon"
|
||
assert bn.momentum == 0.1, "Should use default momentum"
|
||
assert bn.training == True, "Should start in training mode"
|
||
|
||
# Check parameter shapes
|
||
assert bn.gamma.shape == (num_features,), f"Gamma shape should be ({num_features},)"
|
||
assert bn.beta.shape == (num_features,), f"Beta shape should be ({num_features},)"
|
||
assert np.allclose(bn.gamma.data, 1.0), "Gamma should be initialized to 1"
|
||
assert np.allclose(bn.beta.data, 0.0), "Beta should be initialized to 0"
|
||
|
||
# Test 2: Forward pass in training mode
|
||
batch_size, height, width = 16, 8, 8
|
||
x = Tensor(np.random.randn(batch_size, num_features, height, width))
|
||
|
||
output = bn.forward(x)
|
||
|
||
# Check output shape
|
||
assert output.shape == x.shape, "Output should have same shape as input"
|
||
|
||
# Check normalization (approximately zero mean, unit variance per channel)
|
||
for c in range(num_features):
|
||
channel_data = output.data[:, c, :, :]
|
||
channel_mean = np.mean(channel_data)
|
||
channel_var = np.var(channel_data)
|
||
|
||
assert abs(channel_mean) < 1e-6, f"Channel {c} should have ~0 mean, got {channel_mean}"
|
||
assert abs(channel_var - 1.0) < 1e-4, f"Channel {c} should have ~1 variance, got {channel_var}"
|
||
|
||
# Test 3: Running statistics update
|
||
initial_running_mean = bn.running_mean.copy()
|
||
initial_running_var = bn.running_var.copy()
|
||
|
||
# Process another batch
|
||
x2 = Tensor(np.random.randn(batch_size, num_features, height, width) * 2 + 1)
|
||
_ = bn.forward(x2)
|
||
|
||
# Running statistics should have changed
|
||
assert not np.allclose(bn.running_mean, initial_running_mean), "Running mean should update"
|
||
assert not np.allclose(bn.running_var, initial_running_var), "Running variance should update"
|
||
|
||
# Test 4: Evaluation mode
|
||
bn.eval()
|
||
assert bn.training == False, "Should be in eval mode"
|
||
|
||
running_mean_before = bn.running_mean.copy()
|
||
running_var_before = bn.running_var.copy()
|
||
|
||
# Forward pass in eval mode
|
||
output_eval = bn.forward(x)
|
||
|
||
# Running statistics should not change in eval mode
|
||
assert np.allclose(bn.running_mean, running_mean_before), "Running mean should not change in eval mode"
|
||
assert np.allclose(bn.running_var, running_var_before), "Running variance should not change in eval mode"
|
||
|
||
# Test 5: Gradient flow (basic check)
|
||
bn.train()
|
||
x_grad = Tensor(np.random.randn(batch_size, num_features, height, width))
|
||
output_grad = bn.forward(x_grad)
|
||
|
||
# Should be able to access gamma and beta for gradient computation
|
||
assert hasattr(bn, 'gamma'), "Should have gamma parameter"
|
||
assert hasattr(bn, 'beta'), "Should have beta parameter"
|
||
assert len(bn.parameters) == 2, "Should have 2 learnable parameters"
|
||
|
||
print("PASS Batch normalization tests passed!")
|
||
print(f"PASS Properly normalizes across batch dimension")
|
||
print(f"PASS Updates running statistics during training")
|
||
print(f"PASS Uses running statistics during evaluation")
|
||
print(f"PASS Maintains gradient flow through learnable parameters")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Layer Normalization Implementation
|
||
|
||
Layer Normalization normalizes across the feature dimensions for each sample independently, making it batch-size independent.
|
||
|
||
**Key Insight**: LayerNorm is crucial for transformers because it doesn't depend on batch statistics, enabling consistent behavior across different batch sizes.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "layer-norm", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class LayerNorm(Module):
|
||
"""
|
||
Layer Normalization for any-dimensional tensors.
|
||
|
||
Normalizes across specified feature dimensions for each sample independently.
|
||
Unlike BatchNorm, LayerNorm doesn't depend on batch statistics.
|
||
|
||
MATHEMATICAL FOUNDATION:
|
||
LN(x) = γ * (x - μ) / sqrt(σ² + ε) + β
|
||
|
||
Where μ and σ² are computed across feature dimensions for each sample.
|
||
"""
|
||
|
||
def __init__(self, normalized_shape: Union[int, Tuple[int, ...]], eps: float = 1e-5):
|
||
"""
|
||
Initialize Layer Normalization.
|
||
|
||
TODO: Implement LayerNorm initialization with proper shape handling.
|
||
|
||
APPROACH (3-Step LayerNorm Setup):
|
||
1. Store normalization configuration (shape and eps)
|
||
2. Initialize learnable parameters γ and β with correct shapes
|
||
3. Set up parameter tracking for optimization
|
||
|
||
SHAPE ANALYSIS:
|
||
- If normalized_shape is int: treat as last dimension only
|
||
- If normalized_shape is tuple: treat as multiple dimensions
|
||
- γ and β have shape matching normalized_shape
|
||
|
||
EXAMPLE (LayerNorm Shapes):
|
||
>>> ln1 = LayerNorm(512) # For last dim: (..., 512)
|
||
>>> ln2 = LayerNorm((64, 64)) # For last 2 dims: (..., 64, 64)
|
||
>>> ln3 = LayerNorm((256, 4, 4)) # For 3D features: (..., 256, 4, 4)
|
||
|
||
HINTS:
|
||
- Convert int to tuple for consistent handling
|
||
- Parameter shapes should match normalized_shape exactly
|
||
- No running statistics needed (computed fresh each time)
|
||
|
||
Args:
|
||
normalized_shape: Shape of features to normalize over
|
||
eps: Small constant for numerical stability
|
||
"""
|
||
### BEGIN SOLUTION
|
||
super().__init__()
|
||
|
||
# Handle both int and tuple inputs
|
||
if isinstance(normalized_shape, int):
|
||
self.normalized_shape = (normalized_shape,)
|
||
else:
|
||
self.normalized_shape = tuple(normalized_shape)
|
||
|
||
self.eps = eps
|
||
|
||
# Learnable parameters with shape matching normalized dimensions
|
||
self.gamma = Tensor(np.ones(self.normalized_shape)) # Scale parameter
|
||
self.beta = Tensor(np.zeros(self.normalized_shape)) # Shift parameter
|
||
|
||
# Track parameters for optimization
|
||
self.parameters = [self.gamma, self.beta]
|
||
### END SOLUTION
|
||
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
"""
|
||
Apply layer normalization to input tensor.
|
||
|
||
TODO: Implement layer normalization forward pass.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Determine normalization axes based on normalized_shape
|
||
2. Compute mean and variance across those axes (keepdims=True)
|
||
3. Normalize: (x - mean) / sqrt(var + eps)
|
||
4. Apply learnable parameters: γ * normalized + β
|
||
|
||
AXIS CALCULATION:
|
||
For input shape (N, ..., D1, D2, ..., Dk) and normalized_shape (D1, D2, ..., Dk):
|
||
- Normalize over last len(normalized_shape) dimensions
|
||
- Keep dimensions for proper broadcasting
|
||
|
||
EXAMPLE:
|
||
>>> ln = LayerNorm(256)
|
||
>>> x = Tensor(np.random.randn(32, 128, 256)) # (batch, seq, features)
|
||
>>> out = ln.forward(x) # Normalize over last dim (256)
|
||
|
||
Args:
|
||
x: Input tensor
|
||
|
||
Returns:
|
||
Normalized tensor (same shape as input)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Calculate which axes to normalize over (last len(normalized_shape) dimensions)
|
||
num_dims_to_normalize = len(self.normalized_shape)
|
||
axes = tuple(range(-num_dims_to_normalize, 0)) # Last N dimensions
|
||
|
||
# Compute mean and variance over normalization axes
|
||
mean = np.mean(x.data, axis=axes, keepdims=True)
|
||
var = np.var(x.data, axis=axes, keepdims=True)
|
||
|
||
# Normalize
|
||
normalized = (x.data - mean) / np.sqrt(var + self.eps)
|
||
|
||
# Apply learnable parameters (broadcasting automatically handles shapes)
|
||
output = self.gamma.data * normalized + self.beta.data
|
||
|
||
return Tensor(output)
|
||
### END SOLUTION
|
||
|
||
def __call__(self, x: Tensor) -> Tensor:
|
||
"""Allow LayerNorm to be called directly."""
|
||
return self.forward(x)
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: Basic LayerNorm complete
|
||
|
||
# THINK PREDICTION: How does LayerNorm memory scale compared to BatchNorm?
|
||
# Your guess: LayerNorm uses _____ memory than BatchNorm for the same feature size
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: LayerNorm vs BatchNorm Memory Comparison
|
||
def compare_normalization_memory():
|
||
"""Compare memory usage between different normalization techniques."""
|
||
try:
|
||
print("MAGNIFY SYSTEMS INSIGHT: Normalization Memory Comparison")
|
||
print("=" * 60)
|
||
|
||
# Test different feature configurations
|
||
configs = [
|
||
(64, "Small ConvNet channel"),
|
||
(256, "ResNet channel"),
|
||
(512, "Transformer embedding"),
|
||
(1024, "Large transformer")
|
||
]
|
||
|
||
print(f"{'Features':<8} {'BatchNorm':<12} {'LayerNorm':<12} {'Ratio':<8} {'Context'}")
|
||
print("-" * 60)
|
||
|
||
for features, context in configs:
|
||
# BatchNorm memory: 4 parameters per channel (γ, β, running_mean, running_var)
|
||
bn_memory = 4 * features * 4 # 4 bytes per float32
|
||
|
||
# LayerNorm memory: 2 parameters per feature (γ, β only)
|
||
ln_memory = 2 * features * 4 # 4 bytes per float32
|
||
|
||
ratio = bn_memory / ln_memory
|
||
|
||
print(f"{features:<8} {bn_memory/1024:.2f} KB {ln_memory/1024:.2f} KB {ratio:.1f}x {context}")
|
||
|
||
print(f"\nTIP KEY INSIGHTS:")
|
||
print("• BatchNorm uses 2* more memory than LayerNorm")
|
||
print("• BatchNorm stores running statistics (inference requirements)")
|
||
print("• LayerNorm has no running state (batch-independent)")
|
||
|
||
# Batch size independence demonstration
|
||
print(f"\nTARGET BATCH SIZE INDEPENDENCE:")
|
||
ln = LayerNorm(256)
|
||
|
||
for batch_size in [1, 8, 32, 128]:
|
||
x = Tensor(np.random.randn(batch_size, 64, 256))
|
||
output = ln.forward(x)
|
||
|
||
# Check normalization quality
|
||
sample_mean = np.mean(output.data[0, :, :]) # First sample mean
|
||
sample_var = np.var(output.data[0, :, :]) # First sample variance
|
||
|
||
print(f"Batch size {batch_size:3d}: Mean={sample_mean:.6f}, Var={sample_var:.6f} PASS")
|
||
|
||
print(f"\n✨ LayerNorm gives consistent results regardless of batch size!")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in normalization comparison: {e}")
|
||
|
||
# Run the comparison
|
||
compare_normalization_memory()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Layer Normalization
|
||
|
||
This test validates LayerNorm implementation, ensuring proper normalization across feature dimensions and batch-size independence.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-layer-norm", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_layer_norm():
|
||
"""Unit test for layer normalization."""
|
||
print("🔬 Unit Test: Layer Normalization...")
|
||
|
||
# Test 1: Basic 1D normalization
|
||
embed_dim = 256
|
||
ln = LayerNorm(embed_dim)
|
||
|
||
# Verify initialization
|
||
assert ln.normalized_shape == (embed_dim,), "Should store normalized shape as tuple"
|
||
assert ln.eps == 1e-5, "Should use default epsilon"
|
||
|
||
# Check parameter shapes
|
||
assert ln.gamma.shape == (embed_dim,), f"Gamma shape should be ({embed_dim},)"
|
||
assert ln.beta.shape == (embed_dim,), f"Beta shape should be ({embed_dim},)"
|
||
assert np.allclose(ln.gamma.data, 1.0), "Gamma should be initialized to 1"
|
||
assert np.allclose(ln.beta.data, 0.0), "Beta should be initialized to 0"
|
||
|
||
# Test 2: Forward pass with 3D input (batch, seq, features)
|
||
batch_size, seq_len = 16, 64
|
||
x = Tensor(np.random.randn(batch_size, seq_len, embed_dim) * 2 + 3) # Non-standard distribution
|
||
|
||
output = ln.forward(x)
|
||
|
||
# Check output shape
|
||
assert output.shape == x.shape, "Output should have same shape as input"
|
||
|
||
# Check normalization for each sample independently
|
||
for b in range(batch_size):
|
||
for s in range(seq_len):
|
||
sample_data = output.data[b, s, :]
|
||
sample_mean = np.mean(sample_data)
|
||
sample_var = np.var(sample_data)
|
||
|
||
assert abs(sample_mean) < 1e-6, f"Sample [{b},{s}] should have ~0 mean, got {sample_mean}"
|
||
assert abs(sample_var - 1.0) < 1e-4, f"Sample [{b},{s}] should have ~1 variance, got {sample_var}"
|
||
|
||
# Test 3: Multi-dimensional normalization
|
||
multi_dim_shape = (64, 4) # Normalize over 2D features
|
||
ln_multi = LayerNorm(multi_dim_shape)
|
||
|
||
x_multi = Tensor(np.random.randn(8, 32, 64, 4))
|
||
output_multi = ln_multi.forward(x_multi)
|
||
|
||
assert output_multi.shape == x_multi.shape, "Multi-dim normalization should preserve shape"
|
||
|
||
# Check normalization across last 2 dimensions for each sample
|
||
for b in range(8):
|
||
for s in range(32):
|
||
sample_data = output_multi.data[b, s, :, :].flatten()
|
||
sample_mean = np.mean(sample_data)
|
||
sample_var = np.var(sample_data)
|
||
|
||
assert abs(sample_mean) < 1e-6, f"Multi-dim sample should have ~0 mean"
|
||
assert abs(sample_var - 1.0) < 1e-4, f"Multi-dim sample should have ~1 variance"
|
||
|
||
# Test 4: Callable interface
|
||
output_callable = ln(x)
|
||
assert np.allclose(output.data, output_callable.data), "Callable interface should work"
|
||
|
||
# Test 5: Batch size independence
|
||
x_small = Tensor(np.random.randn(1, seq_len, embed_dim))
|
||
x_large = Tensor(np.random.randn(64, seq_len, embed_dim))
|
||
|
||
output_small = ln.forward(x_small)
|
||
output_large = ln.forward(x_large)
|
||
|
||
# Both should be properly normalized regardless of batch size
|
||
small_mean = np.mean(output_small.data[0, 0, :])
|
||
large_mean = np.mean(output_large.data[0, 0, :]) # Same position
|
||
|
||
assert abs(small_mean) < 1e-6, "Small batch should be normalized"
|
||
assert abs(large_mean) < 1e-6, "Large batch should be normalized"
|
||
|
||
# Test 6: Parameter tracking
|
||
assert len(ln.parameters) == 2, "Should have 2 learnable parameters"
|
||
assert ln.gamma in ln.parameters, "Gamma should be tracked"
|
||
assert ln.beta in ln.parameters, "Beta should be tracked"
|
||
|
||
print("PASS Layer normalization tests passed!")
|
||
print(f"PASS Properly normalizes across feature dimensions")
|
||
print(f"PASS Works with any input shape")
|
||
print(f"PASS Batch-size independent behavior")
|
||
print(f"PASS Supports multi-dimensional normalization")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Group Normalization Implementation
|
||
|
||
Group Normalization divides channels into groups and normalizes within each group, providing a middle ground between batch and layer normalization.
|
||
|
||
**Key Insight**: GroupNorm is particularly useful for object detection and when batch sizes are small, as it doesn't depend on batch statistics but provides channel-wise organization.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "group-norm", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
#| export
|
||
class GroupNorm(Module):
|
||
"""
|
||
Group Normalization for convolutional layers.
|
||
|
||
Divides channels into groups and normalizes within each group.
|
||
Provides benefits of both batch and layer normalization.
|
||
|
||
MATHEMATICAL FOUNDATION:
|
||
For input (N, C, H, W) with G groups:
|
||
1. Reshape to (N, G, C//G, H, W)
|
||
2. Normalize within each group: GN(x) = γ * (x - μ_group) / sqrt(σ²_group + ε) + β
|
||
3. Reshape back to (N, C, H, W)
|
||
"""
|
||
|
||
def __init__(self, num_groups: int, num_channels: int, eps: float = 1e-5):
|
||
"""
|
||
Initialize Group Normalization.
|
||
|
||
TODO: Implement GroupNorm initialization with group configuration.
|
||
|
||
APPROACH (4-Step GroupNorm Setup):
|
||
1. Validate group configuration (num_channels must be divisible by num_groups)
|
||
2. Store configuration parameters
|
||
3. Initialize learnable parameters γ and β for each channel
|
||
4. Set up parameter tracking
|
||
|
||
GROUP ORGANIZATION:
|
||
- Each group contains num_channels // num_groups channels
|
||
- Normalization computed independently within each group
|
||
- Parameters γ and β have shape (num_channels,) for per-channel scaling
|
||
|
||
EXAMPLE (GroupNorm Configurations):
|
||
>>> gn1 = GroupNorm(32, 64) # 32 groups, 64 channels -> 2 channels per group
|
||
>>> gn2 = GroupNorm(8, 256) # 8 groups, 256 channels -> 32 channels per group
|
||
>>> gn3 = GroupNorm(1, 128) # 1 group, 128 channels -> LayerNorm equivalent
|
||
|
||
HINTS:
|
||
- Use assert to validate num_channels % num_groups == 0
|
||
- Special case: num_groups = num_channels -> InstanceNorm (each channel is a group)
|
||
- Special case: num_groups = 1 -> LayerNorm for spatial data
|
||
|
||
Args:
|
||
num_groups: Number of groups to divide channels into
|
||
num_channels: Total number of channels
|
||
eps: Small constant for numerical stability
|
||
"""
|
||
### BEGIN SOLUTION
|
||
super().__init__()
|
||
|
||
# Validate configuration
|
||
assert num_channels % num_groups == 0, f"num_channels ({num_channels}) must be divisible by num_groups ({num_groups})"
|
||
assert num_groups > 0, "num_groups must be positive"
|
||
assert num_channels > 0, "num_channels must be positive"
|
||
|
||
self.num_groups = num_groups
|
||
self.num_channels = num_channels
|
||
self.eps = eps
|
||
|
||
# Calculate channels per group
|
||
self.channels_per_group = num_channels // num_groups
|
||
|
||
# Learnable parameters - one per channel
|
||
self.gamma = Tensor(np.ones((num_channels,))) # Scale parameter
|
||
self.beta = Tensor(np.zeros((num_channels,))) # Shift parameter
|
||
|
||
# Track parameters for optimization
|
||
self.parameters = [self.gamma, self.beta]
|
||
### END SOLUTION
|
||
|
||
def forward(self, x: Tensor) -> Tensor:
|
||
"""
|
||
Apply group normalization to input tensor.
|
||
|
||
TODO: Implement group normalization forward pass.
|
||
|
||
STEP-BY-STEP IMPLEMENTATION:
|
||
1. Reshape input to separate groups: (N, C, H, W) -> (N, G, C//G, H, W)
|
||
2. Compute mean and variance within each group
|
||
3. Normalize within groups
|
||
4. Reshape back to original shape
|
||
5. Apply per-channel γ and β parameters
|
||
|
||
SHAPE TRANSFORMATIONS:
|
||
Input: (N, C, H, W)
|
||
Groups: (N, G, C//G, H, W) # Separate groups for normalization
|
||
Norm: (N, G, C//G, H, W) # Normalized within groups
|
||
Output: (N, C, H, W) # Back to original shape with γ/β applied
|
||
|
||
EXAMPLE:
|
||
>>> gn = GroupNorm(8, 64) # 8 groups, 64 channels
|
||
>>> x = Tensor(np.random.randn(16, 64, 32, 32))
|
||
>>> out = gn.forward(x) # Normalized within 8 groups
|
||
|
||
Args:
|
||
x: Input tensor of shape (N, C, H, W)
|
||
|
||
Returns:
|
||
Normalized tensor of shape (N, C, H, W)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
N, C, H, W = x.shape
|
||
assert C == self.num_channels, f"Expected {self.num_channels} channels, got {C}"
|
||
|
||
# Reshape to separate groups: (N, C, H, W) -> (N, G, C//G, H, W)
|
||
x_grouped = x.data.reshape(N, self.num_groups, self.channels_per_group, H, W)
|
||
|
||
# Compute mean and variance within each group
|
||
# Normalize over dimensions (2, 3, 4) which are (channels_per_group, H, W)
|
||
mean = np.mean(x_grouped, axis=(2, 3, 4), keepdims=True) # Shape: (N, G, 1, 1, 1)
|
||
var = np.var(x_grouped, axis=(2, 3, 4), keepdims=True) # Shape: (N, G, 1, 1, 1)
|
||
|
||
# Normalize within groups
|
||
normalized = (x_grouped - mean) / np.sqrt(var + self.eps)
|
||
|
||
# Reshape back to original shape: (N, G, C//G, H, W) -> (N, C, H, W)
|
||
normalized = normalized.reshape(N, C, H, W)
|
||
|
||
# Apply per-channel learnable parameters
|
||
gamma = self.gamma.data.reshape(1, C, 1, 1) # Broadcast shape
|
||
beta = self.beta.data.reshape(1, C, 1, 1) # Broadcast shape
|
||
|
||
output = gamma * normalized + beta
|
||
|
||
return Tensor(output)
|
||
### END SOLUTION
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: All normalization techniques complete
|
||
|
||
# THINK PREDICTION: Which normalization uses the most memory - Batch, Layer, or Group?
|
||
# Your answer: _______ because _______
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Complete Normalization Scaling Analysis
|
||
def analyze_normalization_scaling():
|
||
"""Analyze how different normalization techniques scale with architecture size."""
|
||
try:
|
||
print("MAGNIFY SYSTEMS INSIGHT: Normalization Scaling Analysis")
|
||
print("=" * 70)
|
||
|
||
# Different model scales to analyze
|
||
model_configs = [
|
||
(64, "Small CNN"),
|
||
(256, "ResNet-50 layer"),
|
||
(512, "Large CNN"),
|
||
(1024, "Vision Transformer")
|
||
]
|
||
|
||
print(f"{'Channels':<8} {'BatchNorm':<12} {'LayerNorm':<12} {'GroupNorm':<12} {'Context'}")
|
||
print("-" * 70)
|
||
|
||
for channels, context in model_configs:
|
||
# Memory calculations (in bytes, float32 = 4 bytes)
|
||
bn_memory = 4 * channels * 4 # γ, β, running_mean, running_var
|
||
ln_memory = 2 * channels * 4 # γ, β only
|
||
gn_memory = 2 * channels * 4 # γ, β only (same as LayerNorm)
|
||
|
||
print(f"{channels:<8} {bn_memory/1024:.2f} KB {ln_memory/1024:.2f} KB {gn_memory/1024:.2f} KB {context}")
|
||
|
||
print(f"\nTIP MEMORY INSIGHTS:")
|
||
print("• BatchNorm: Highest memory (stores running statistics)")
|
||
print("• LayerNorm: 50% less memory than BatchNorm")
|
||
print("• GroupNorm: Same memory as LayerNorm")
|
||
|
||
# Computational complexity analysis
|
||
print(f"\nSPEED COMPUTATIONAL COMPLEXITY:")
|
||
batch_size, height, width = 32, 64, 64
|
||
channels = 256
|
||
|
||
# Calculate FLOPs for each normalization type
|
||
spatial_size = height * width
|
||
total_elements = batch_size * channels * spatial_size
|
||
|
||
# All normalizations require: mean, variance, normalize, scale, shift
|
||
base_flops = 5 * total_elements # 5 operations per element
|
||
|
||
print(f"Input: ({batch_size}, {channels}, {height}, {width})")
|
||
print(f"BatchNorm FLOPs: ~{base_flops/1e6:.1f}M (batch statistics)")
|
||
print(f"LayerNorm FLOPs: ~{base_flops/1e6:.1f}M (per-sample statistics)")
|
||
print(f"GroupNorm FLOPs: ~{base_flops/1e6:.1f}M (group statistics)")
|
||
|
||
print(f"\nTARGET WHEN TO USE EACH:")
|
||
print("• BatchNorm: Large batches, CNNs, stable batch sizes")
|
||
print("• LayerNorm: Transformers, variable batch sizes, RNNs")
|
||
print("• GroupNorm: Small batches, object detection, fine-tuning")
|
||
|
||
# Demonstrate batch size effects
|
||
print(f"\n📊 BATCH SIZE EFFECTS:")
|
||
test_channels = 128
|
||
bn = BatchNorm2d(test_channels)
|
||
ln = LayerNorm((test_channels, 32, 32))
|
||
gn = GroupNorm(32, test_channels)
|
||
|
||
for batch_size in [1, 4, 16, 64]:
|
||
x = Tensor(np.random.randn(batch_size, test_channels, 32, 32))
|
||
|
||
# Only test mean for first sample to see consistency
|
||
if batch_size > 1: # BatchNorm needs batch_size > 1
|
||
bn_out = bn.forward(x)
|
||
bn_mean = np.mean(bn_out.data[0])
|
||
else:
|
||
bn_mean = "unstable"
|
||
|
||
ln_out = ln.forward(x)
|
||
ln_mean = np.mean(ln_out.data[0])
|
||
|
||
gn_out = gn.forward(x)
|
||
gn_mean = np.mean(gn_out.data[0])
|
||
|
||
print(f"Batch {batch_size:2d}: BN={bn_mean if isinstance(bn_mean, str) else f'{bn_mean:.6f}':<10} "
|
||
f"LN={ln_mean:.6f} GN={gn_mean:.6f}")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in scaling analysis: {e}")
|
||
|
||
# Run the scaling analysis
|
||
analyze_normalization_scaling()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Unit Test: Group Normalization
|
||
|
||
This test validates GroupNorm implementation, ensuring proper grouping and normalization within channel groups.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-group-norm", "locked": true, "points": 20, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_group_norm():
|
||
"""Unit test for group normalization."""
|
||
print("🔬 Unit Test: Group Normalization...")
|
||
|
||
# Test 1: Basic configuration
|
||
num_groups = 8
|
||
num_channels = 64
|
||
gn = GroupNorm(num_groups, num_channels)
|
||
|
||
# Verify initialization
|
||
assert gn.num_groups == num_groups, "Should store number of groups"
|
||
assert gn.num_channels == num_channels, "Should store number of channels"
|
||
assert gn.channels_per_group == 8, "Should calculate channels per group correctly"
|
||
|
||
# Check parameter shapes
|
||
assert gn.gamma.shape == (num_channels,), f"Gamma shape should be ({num_channels},)"
|
||
assert gn.beta.shape == (num_channels,), f"Beta shape should be ({num_channels},)"
|
||
|
||
# Test 2: Configuration validation
|
||
try:
|
||
GroupNorm(7, 64) # Should fail: 64 % 7 != 0
|
||
assert False, "Should raise error for invalid group configuration"
|
||
except AssertionError as e:
|
||
if "divisible" in str(e):
|
||
pass # Expected error
|
||
else:
|
||
raise e
|
||
|
||
# Test 3: Forward pass
|
||
batch_size, height, width = 16, 32, 32
|
||
x = Tensor(np.random.randn(batch_size, num_channels, height, width) * 3 + 2)
|
||
|
||
output = gn.forward(x)
|
||
|
||
# Check output shape
|
||
assert output.shape == x.shape, "Output should have same shape as input"
|
||
|
||
# Test 4: Verify group normalization properties
|
||
# Each group should have approximately normalized statistics
|
||
channels_per_group = num_channels // num_groups
|
||
|
||
for group_idx in range(num_groups):
|
||
start_channel = group_idx * channels_per_group
|
||
end_channel = start_channel + channels_per_group
|
||
|
||
# Extract group data for first sample
|
||
group_data = output.data[0, start_channel:end_channel, :, :].flatten()
|
||
group_mean = np.mean(group_data)
|
||
group_var = np.var(group_data)
|
||
|
||
assert abs(group_mean) < 1e-5, f"Group {group_idx} should have ~0 mean, got {group_mean}"
|
||
assert abs(group_var - 1.0) < 1e-3, f"Group {group_idx} should have ~1 variance, got {group_var}"
|
||
|
||
# Test 5: Special cases
|
||
# Case 1: num_groups = num_channels (Instance Normalization)
|
||
instance_norm = GroupNorm(num_channels, num_channels)
|
||
assert instance_norm.channels_per_group == 1, "Instance norm should have 1 channel per group"
|
||
|
||
# Case 2: num_groups = 1 (Layer Normalization for spatial data)
|
||
layer_norm_like = GroupNorm(1, num_channels)
|
||
assert layer_norm_like.channels_per_group == num_channels, "Single group should contain all channels"
|
||
|
||
# Test 6: Different group sizes
|
||
configs_to_test = [
|
||
(1, 32), # LayerNorm-like
|
||
(4, 32), # 8 channels per group
|
||
(32, 32), # InstanceNorm-like
|
||
]
|
||
|
||
for groups, channels in configs_to_test:
|
||
gn_test = GroupNorm(groups, channels)
|
||
x_test = Tensor(np.random.randn(8, channels, 16, 16))
|
||
output_test = gn_test.forward(x_test)
|
||
|
||
assert output_test.shape == x_test.shape, f"Config ({groups}, {channels}) should preserve shape"
|
||
|
||
# Basic normalization check
|
||
sample_data = output_test.data[0, :, :, :].flatten()
|
||
overall_mean = np.mean(sample_data)
|
||
# Note: overall variance might not be exactly 1 due to grouping
|
||
|
||
# Test 7: Parameter tracking
|
||
assert len(gn.parameters) == 2, "Should have 2 learnable parameters"
|
||
assert gn.gamma in gn.parameters, "Gamma should be tracked"
|
||
assert gn.beta in gn.parameters, "Beta should be tracked"
|
||
|
||
print("PASS Group normalization tests passed!")
|
||
print(f"PASS Properly groups channels and normalizes within groups")
|
||
print(f"PASS Validates configuration constraints")
|
||
print(f"PASS Supports special cases (Instance/Layer norm variants)")
|
||
print(f"PASS Maintains gradient flow through learnable parameters")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Integration: Normalization in Neural Networks
|
||
|
||
Now let's see how normalization techniques integrate with neural network layers to stabilize training and improve performance.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Normalization Layer Integration Example
|
||
|
||
Here's how normalization layers are typically used in different architectures:
|
||
|
||
**ConvNet with BatchNorm:**
|
||
```
|
||
Conv2d -> BatchNorm2d -> ReLU -> Conv2d -> BatchNorm2d -> ReLU -> ...
|
||
```
|
||
|
||
**Transformer with LayerNorm:**
|
||
```
|
||
Embedding -> LayerNorm -> Attention -> Add & Norm -> FFN -> Add & Norm -> ...
|
||
```
|
||
|
||
**ResNet Block with GroupNorm:**
|
||
```
|
||
Conv2d -> GroupNorm -> ReLU -> Conv2d -> GroupNorm -> Add -> ReLU
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "normalization-example", "locked": false, "schema_version": 3, "solution": true, "task": false}
|
||
def demonstrate_normalization_usage():
|
||
"""
|
||
Demonstrate how different normalization techniques are used in practice.
|
||
|
||
TODO: Implement a simple example showing normalization in a mini-network.
|
||
|
||
APPROACH:
|
||
1. Create sample activations that would be unstable without normalization
|
||
2. Apply different normalization techniques
|
||
3. Show how they stabilize the activations
|
||
4. Demonstrate the effect on gradient flow
|
||
|
||
This function is PROVIDED as an educational example.
|
||
"""
|
||
### BEGIN SOLUTION
|
||
print("🔬 Normalization Integration Example")
|
||
print("=" * 40)
|
||
|
||
# Simulate unstable activations (high variance, non-zero mean)
|
||
batch_size, channels, height, width = 16, 64, 32, 32
|
||
unstable_activations = Tensor(np.random.randn(batch_size, channels, height, width) * 5 + 3)
|
||
|
||
print(f"Original activations:")
|
||
print(f" Mean: {np.mean(unstable_activations.data):.3f}")
|
||
print(f" Std: {np.std(unstable_activations.data):.3f}")
|
||
print(f" Range: [{np.min(unstable_activations.data):.2f}, {np.max(unstable_activations.data):.2f}]")
|
||
|
||
# Apply different normalizations
|
||
bn = BatchNorm2d(channels)
|
||
ln = LayerNorm((channels, height, width))
|
||
gn = GroupNorm(8, channels)
|
||
|
||
bn.train() # Ensure BatchNorm is in training mode
|
||
|
||
bn_output = bn.forward(unstable_activations)
|
||
ln_output = ln.forward(unstable_activations)
|
||
gn_output = gn.forward(unstable_activations)
|
||
|
||
print(f"\nAfter BatchNorm:")
|
||
print(f" Mean: {np.mean(bn_output.data):.6f}")
|
||
print(f" Std: {np.std(bn_output.data):.3f}")
|
||
|
||
print(f"\nAfter LayerNorm:")
|
||
print(f" Mean: {np.mean(ln_output.data):.6f}")
|
||
print(f" Std: {np.std(ln_output.data):.3f}")
|
||
|
||
print(f"\nAfter GroupNorm:")
|
||
print(f" Mean: {np.mean(gn_output.data):.6f}")
|
||
print(f" Std: {np.std(gn_output.data):.3f}")
|
||
|
||
print(f"\nPASS All normalization techniques stabilize activations!")
|
||
print(f"PASS Mean ~= 0, Std ~= 1 for all methods")
|
||
### END SOLUTION
|
||
|
||
# Run the demonstration
|
||
demonstrate_normalization_usage()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Performance Comparison: Training Stability
|
||
|
||
Let's compare how different normalization techniques affect training stability by simulating gradient updates.
|
||
"""
|
||
|
||
# PASS IMPLEMENTATION CHECKPOINT: All normalization implementations complete
|
||
|
||
# THINK PREDICTION: Which normalization technique will be most stable for very small batch sizes?
|
||
# Your answer: _______ because _______
|
||
|
||
# MAGNIFY SYSTEMS INSIGHT: Training Stability Analysis
|
||
def analyze_training_stability():
|
||
"""Analyze how normalization affects training stability across different scenarios."""
|
||
try:
|
||
print("MAGNIFY SYSTEMS INSIGHT: Training Stability Analysis")
|
||
print("=" * 60)
|
||
|
||
# Test stability across different batch sizes
|
||
channels = 128
|
||
scenarios = [
|
||
(1, "Single sample (inference)"),
|
||
(2, "Tiny batch (edge case)"),
|
||
(8, "Small batch (mobile/edge)"),
|
||
(32, "Standard batch"),
|
||
(128, "Large batch")
|
||
]
|
||
|
||
bn = BatchNorm2d(channels)
|
||
ln = LayerNorm((channels, 16, 16))
|
||
gn = GroupNorm(16, channels)
|
||
|
||
print(f"{'Batch Size':<12} {'BatchNorm':<12} {'LayerNorm':<12} {'GroupNorm':<12} {'Scenario'}")
|
||
print("-" * 70)
|
||
|
||
for batch_size, scenario in scenarios:
|
||
x = Tensor(np.random.randn(batch_size, channels, 16, 16) * 2 + 1)
|
||
|
||
# BatchNorm stability
|
||
if batch_size == 1:
|
||
bn_stability = "UNSTABLE" # Can't compute batch stats with N=1
|
||
else:
|
||
bn.train()
|
||
bn_out = bn.forward(x)
|
||
bn_var = np.var(bn_out.data)
|
||
bn_stability = f"{bn_var:.4f}"
|
||
|
||
# LayerNorm stability
|
||
ln_out = ln.forward(x)
|
||
ln_var = np.var(ln_out.data[0]) # Per sample variance
|
||
ln_stability = f"{ln_var:.4f}"
|
||
|
||
# GroupNorm stability
|
||
gn_out = gn.forward(x)
|
||
gn_var = np.var(gn_out.data[0]) # Per sample variance
|
||
gn_stability = f"{gn_var:.4f}"
|
||
|
||
print(f"{batch_size:<12} {bn_stability:<12} {ln_stability:<12} {gn_stability:<12} {scenario}")
|
||
|
||
print(f"\nTIP STABILITY INSIGHTS:")
|
||
print("• BatchNorm: Unstable with batch_size=1, best with large batches")
|
||
print("• LayerNorm: Consistent across all batch sizes")
|
||
print("• GroupNorm: Consistent across all batch sizes")
|
||
|
||
# Gradient flow analysis
|
||
print(f"\n🌊 GRADIENT FLOW ANALYSIS:")
|
||
|
||
# Simulate deep network gradients
|
||
x = Tensor(np.random.randn(16, channels, 16, 16))
|
||
|
||
# Test gradient magnitude after normalization
|
||
original_grad_norm = np.linalg.norm(x.data)
|
||
|
||
bn_out = bn.forward(x)
|
||
ln_out = ln.forward(x)
|
||
gn_out = gn.forward(x)
|
||
|
||
print(f"Original gradient norm: {original_grad_norm:.3f}")
|
||
print(f"After BatchNorm: ~{np.linalg.norm(bn_out.data):.3f} (normalized)")
|
||
print(f"After LayerNorm: ~{np.linalg.norm(ln_out.data):.3f} (normalized)")
|
||
print(f"After GroupNorm: ~{np.linalg.norm(gn_out.data):.3f} (normalized)")
|
||
|
||
print(f"\nTARGET PRACTICAL RECOMMENDATIONS:")
|
||
print("• Use BatchNorm for: CNNs with batch_size >= 8, stable training")
|
||
print("• Use LayerNorm for: Transformers, RNNs, variable batch sizes")
|
||
print("• Use GroupNorm for: Object detection, fine-tuning, small batches")
|
||
|
||
except Exception as e:
|
||
print(f"WARNING️ Error in stability analysis: {e}")
|
||
|
||
# Run the stability analysis
|
||
analyze_training_stability()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### TEST Integration Test: Complete Normalization Suite
|
||
|
||
This test validates that all normalization techniques work together and can be used interchangeably in neural network architectures.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test-normalization-integration", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
|
||
def test_unit_normalization_integration():
|
||
"""Integration test for all normalization techniques."""
|
||
print("🔬 Integration Test: Complete Normalization Suite...")
|
||
|
||
# Test configuration
|
||
batch_size, channels, height, width = 8, 32, 16, 16
|
||
x = Tensor(np.random.randn(batch_size, channels, height, width) * 3 + 2)
|
||
|
||
# Initialize all normalization types
|
||
bn = BatchNorm2d(channels)
|
||
ln = LayerNorm((channels, height, width))
|
||
gn = GroupNorm(8, channels) # 4 channels per group
|
||
|
||
# Test 1: All normalizations work with same input
|
||
bn.train()
|
||
bn_output = bn.forward(x)
|
||
ln_output = ln.forward(x)
|
||
gn_output = gn.forward(x)
|
||
|
||
# All should have same output shape
|
||
assert bn_output.shape == x.shape, "BatchNorm should preserve shape"
|
||
assert ln_output.shape == x.shape, "LayerNorm should preserve shape"
|
||
assert gn_output.shape == x.shape, "GroupNorm should preserve shape"
|
||
|
||
# Test 2: All produce normalized outputs
|
||
for name, output in [("BatchNorm", bn_output), ("LayerNorm", ln_output), ("GroupNorm", gn_output)]:
|
||
# Check that outputs are normalized (approximately)
|
||
output_mean = np.mean(output.data)
|
||
output_std = np.std(output.data)
|
||
|
||
# Normalization should reduce extreme values
|
||
assert abs(output_mean) < 2.0, f"{name} should reduce mean magnitude"
|
||
assert 0.5 < output_std < 2.0, f"{name} should normalize standard deviation"
|
||
|
||
# Test 3: Parameter count comparison
|
||
bn_params = len(bn.parameters)
|
||
ln_params = len(ln.parameters)
|
||
gn_params = len(gn.parameters)
|
||
|
||
assert bn_params == 2, "BatchNorm should have 2 learnable parameters"
|
||
assert ln_params == 2, "LayerNorm should have 2 learnable parameters"
|
||
assert gn_params == 2, "GroupNorm should have 2 learnable parameters"
|
||
|
||
# Test 4: Training vs evaluation mode (BatchNorm only)
|
||
bn.train()
|
||
bn_train_out = bn.forward(x)
|
||
|
||
bn.eval()
|
||
bn_eval_out = bn.forward(x)
|
||
|
||
# Outputs should be different (training uses batch stats, eval uses running stats)
|
||
# Note: might be similar if running stats are close to batch stats
|
||
assert bn_train_out.shape == bn_eval_out.shape, "Train/eval should have same shape"
|
||
|
||
# Test 5: Batch size independence (LayerNorm and GroupNorm)
|
||
x_single = Tensor(np.random.randn(1, channels, height, width))
|
||
|
||
ln_single = ln.forward(x_single)
|
||
gn_single = gn.forward(x_single)
|
||
|
||
assert ln_single.shape == x_single.shape, "LayerNorm should work with batch_size=1"
|
||
assert gn_single.shape == x_single.shape, "GroupNorm should work with batch_size=1"
|
||
|
||
# Test 6: Memory efficiency check
|
||
# All should use similar parameter memory (2 * channels * 4 bytes for γ and β)
|
||
expected_param_memory = 2 * channels * 4 # γ and β parameters
|
||
|
||
# BatchNorm has additional running statistics
|
||
bn_total_memory = 4 * channels * 4 # γ, β, running_mean, running_var
|
||
ln_total_memory = 2 * channels * 4 # γ, β only
|
||
gn_total_memory = 2 * channels * 4 # γ, β only
|
||
|
||
assert bn_total_memory > ln_total_memory, "BatchNorm should use more memory (running stats)"
|
||
assert ln_total_memory == gn_total_memory, "LayerNorm and GroupNorm should use same memory"
|
||
|
||
print("PASS Normalization integration tests passed!")
|
||
print(f"PASS All techniques work with same input format")
|
||
print(f"PASS All produce appropriately normalized outputs")
|
||
print(f"PASS Memory usage patterns are as expected")
|
||
print(f"PASS Batch size independence works correctly")
|
||
|
||
# Test function defined (called in main block)
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Testing: Comprehensive Validation
|
||
|
||
Let's run comprehensive tests to ensure all normalization implementations work correctly.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Performance Benchmarking
|
||
|
||
Let's benchmark the performance characteristics of our normalization implementations.
|
||
"""
|
||
|
||
def benchmark_normalization_performance():
|
||
"""
|
||
Benchmark performance of different normalization techniques.
|
||
|
||
This function is PROVIDED for educational analysis.
|
||
"""
|
||
print("SPEED Performance Benchmark: Normalization Techniques")
|
||
print("=" * 55)
|
||
|
||
import time
|
||
|
||
# Test configuration
|
||
batch_size, channels, height, width = 32, 256, 64, 64
|
||
num_iterations = 100
|
||
|
||
# Create test data
|
||
x = Tensor(np.random.randn(batch_size, channels, height, width))
|
||
|
||
# Initialize normalization layers
|
||
bn = BatchNorm2d(channels)
|
||
ln = LayerNorm((channels, height, width))
|
||
gn = GroupNorm(32, channels) # 8 channels per group
|
||
|
||
# Benchmark each technique
|
||
techniques = [
|
||
("BatchNorm2d", bn),
|
||
("LayerNorm", ln),
|
||
("GroupNorm", gn)
|
||
]
|
||
|
||
results = {}
|
||
|
||
for name, norm_layer in techniques:
|
||
if name == "BatchNorm2d":
|
||
norm_layer.train() # Ensure training mode
|
||
|
||
# Warmup
|
||
for _ in range(10):
|
||
_ = norm_layer.forward(x)
|
||
|
||
# Benchmark
|
||
start_time = time.perf_counter()
|
||
for _ in range(num_iterations):
|
||
output = norm_layer.forward(x)
|
||
end_time = time.perf_counter()
|
||
|
||
avg_time_ms = (end_time - start_time) * 1000 / num_iterations
|
||
results[name] = avg_time_ms
|
||
|
||
print(f"{name:<12}: {avg_time_ms:.3f} ms/forward")
|
||
|
||
# Analysis
|
||
print(f"\n📊 Performance Analysis:")
|
||
baseline = results["BatchNorm2d"]
|
||
for name, time_ms in results.items():
|
||
speedup = baseline / time_ms
|
||
print(f" {name}: {speedup:.2f}x relative to BatchNorm")
|
||
|
||
print(f"\nTIP Performance Insights:")
|
||
print(f" • All normalizations have similar computational complexity")
|
||
print(f" • Differences mainly due to memory access patterns")
|
||
print(f" • BatchNorm may be slightly faster due to batch parallelization")
|
||
|
||
# Run performance benchmark
|
||
benchmark_normalization_performance()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Main Execution Block
|
||
|
||
Run all tests to validate our normalization implementations.
|
||
"""
|
||
|
||
if __name__ == "__main__":
|
||
"""Main execution block - runs all normalization tests."""
|
||
print("TEST Running Complete Normalization Test Suite")
|
||
print("=" * 50)
|
||
|
||
# Run all unit tests
|
||
test_unit_batch_norm()
|
||
print()
|
||
|
||
test_unit_layer_norm()
|
||
print()
|
||
|
||
test_unit_group_norm()
|
||
print()
|
||
|
||
test_unit_normalization_integration()
|
||
print()
|
||
|
||
print("PASS All normalization tests passed!")
|
||
print("\nTARGET NORMALIZATION SUITE COMPLETE")
|
||
print("Your normalization implementations are ready for use in neural networks!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## THINK ML Systems Thinking: Interactive Questions
|
||
|
||
Now that you've implemented all three major normalization techniques, let's reflect on their systems implications and design trade-offs.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 1: Memory and Batch Size Trade-offs
|
||
|
||
**Context**: In your BatchNorm2d implementation, you saw that running statistics require additional memory (4* parameters vs 2* for LayerNorm/GroupNorm), but BatchNorm fails completely with batch_size=1. Your memory analysis showed that BatchNorm needs 2* the memory of other techniques, while your stability analysis revealed batch size dependencies.
|
||
|
||
**Reflection Question**: Analyze the memory vs batch size trade-offs in your normalization implementations. When you tested different batch sizes, you discovered BatchNorm becomes unstable with small batches while LayerNorm/GroupNorm remain consistent. For a production system that needs to handle both training (large batches) and inference (single samples), how would you modify your current normalization implementations to optimize memory usage while maintaining stability? Consider the running statistics storage in your BatchNorm class and the per-sample computation in your LayerNorm class.
|
||
|
||
Think about: running statistics memory optimization, batch size adaptation strategies, inference mode memory requirements, and hybrid normalization approaches.
|
||
|
||
*Target length: 150-300 words*
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 2: Computational Scaling and Group Organization
|
||
|
||
**Context**: Your GroupNorm implementation divides channels into groups and normalizes within each group, providing a middle ground between BatchNorm and LayerNorm. Your scaling analysis showed that all normalization techniques have similar computational complexity, but different memory access patterns. The group organization in your implementation affects both memory layout and computational efficiency.
|
||
|
||
**Reflection Question**: Examine the computational scaling patterns in your normalization implementations. Your GroupNorm.forward() method reshapes tensors to separate groups, computes statistics within groups, then reshapes back. How does this grouping strategy affect memory access patterns and cache efficiency compared to your BatchNorm (batch-wise) and LayerNorm (sample-wise) approaches? If you needed to optimize your GroupNorm implementation for very large channel counts (1024+ channels), what modifications to your group organization and computation order would improve performance while maintaining mathematical correctness?
|
||
|
||
Think about: memory access patterns, cache locality, vectorization opportunities, and group size optimization strategies.
|
||
|
||
*Target length: 150-300 words*
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Question 3: Production Deployment and Architecture Selection
|
||
|
||
**Context**: Your normalization implementations mirror production systems - BatchNorm for CNNs like ResNet, LayerNorm for Transformers like BERT/GPT, and GroupNorm for object detection models. Your training stability analysis revealed when each technique works best, and your performance benchmarks showed similar computational costs but different memory characteristics.
|
||
|
||
**Reflection Question**: Based on your implementation experience and performance analysis, design a normalization selection strategy for a production ML system that needs to support multiple model architectures (CNNs, Transformers, and detection models). Your BatchNorm implementation works well for large-batch training but fails at batch_size=1, while your LayerNorm provides consistent behavior but lacks the batch parallelization benefits. How would you extend your current normalization classes to create an adaptive normalization system that automatically selects the optimal technique based on input characteristics (batch size, model architecture, deployment constraints)?
|
||
|
||
Think about: automatic technique selection, runtime adaptation, memory budget constraints, and deployment environment requirements.
|
||
|
||
*Target length: 150-300 words*
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## TARGET MODULE SUMMARY: Normalization
|
||
|
||
Congratulations! You have successfully implemented the complete normalization toolkit that makes modern deep learning possible:
|
||
|
||
### PASS What You Have Built
|
||
- **BatchNorm2d**: Complete batch normalization with running statistics and train/eval modes
|
||
- **LayerNorm**: Batch-independent normalization for any tensor dimensions
|
||
- **GroupNorm**: Channel group normalization balancing batch and layer norm benefits
|
||
- **🆕 Comprehensive Analysis**: Memory scaling, training stability, and performance benchmarking
|
||
- **🆕 Integration Examples**: How normalization fits into different network architectures
|
||
|
||
### PASS Technical Mastery
|
||
- **Statistical Computing**: Efficient mean/variance computation across different tensor dimensions
|
||
- **Memory Management**: Understanding parameter storage vs running statistics trade-offs
|
||
- **Training Dynamics**: How normalization affects gradient flow and training stability
|
||
- **Batch Dependencies**: When and why batch size affects normalization behavior
|
||
- **🆕 Production Patterns**: Architecture-specific normalization choices and deployment considerations
|
||
|
||
### PASS Systems Understanding
|
||
- **Memory Scaling**: BatchNorm uses 2* memory of LayerNorm/GroupNorm due to running statistics
|
||
- **Computational Complexity**: All techniques have similar O(N) complexity but different access patterns
|
||
- **Batch Size Effects**: BatchNorm requires batch_size > 1, others work with any batch size
|
||
- **Cache Efficiency**: How normalization axes affect memory access patterns and vectorization
|
||
- **🆕 Training Stability**: Why normalization enables higher learning rates and deeper networks
|
||
|
||
### LINK Connection to Real ML Systems
|
||
Your implementations mirror production systems:
|
||
- **PyTorch nn.BatchNorm2d**: Your BatchNorm2d matches PyTorch's interface and behavior
|
||
- **BERT LayerNorm**: Your LayerNorm enables transformer training stability
|
||
- **Object Detection GroupNorm**: Your GroupNorm provides batch-independent normalization
|
||
- **Production Deployment**: Understanding of when to use each technique in real systems
|
||
|
||
### ROCKET What You Can Build Now
|
||
- **Stable CNNs**: Use BatchNorm for ResNet-style architectures with large batches
|
||
- **Transformer Models**: Use LayerNorm for attention-based architectures
|
||
- **Detection Systems**: Use GroupNorm for models with variable batch sizes
|
||
- **Adaptive Networks**: Combine techniques for optimal performance across scenarios
|
||
|
||
### Next Steps
|
||
1. **Export your module**: `tito module complete 08_normalization`
|
||
2. **Integration ready**: Your normalization layers integrate with any neural network architecture
|
||
3. **Ready for Module 09**: Spatial operations will use your normalization for CNN stability
|
||
|
||
**CELEBRATE Achievement Unlocked**: You've mastered the normalization techniques that enable modern deep learning, with complete understanding of their memory characteristics and performance trade-offs!
|
||
""" |