mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-10 04:03:25 -05:00
- Add 🧪 emoji to all test_module() docstrings (20 modules)
- Fix Module 16 (compression): Add if __name__ guards to 6 test functions
- Fix Module 08 (dataloader): Add if __name__ guard to test_training_integration
All modules now follow consistent formatting standards for release.
1809 lines
70 KiB
Python
1809 lines
70 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.17.1
|
||
# kernelspec:
|
||
# display_name: Python 3 (ipykernel)
|
||
# language: python
|
||
# name: python3
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
"""
|
||
# Module 14: Profiling - Measuring What Matters in ML Systems
|
||
|
||
Welcome to Module 14! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.
|
||
|
||
## 🔗 Prerequisites & Progress
|
||
**You've Built**: Complete ML stack from tensors to transformers
|
||
**You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency
|
||
**You'll Enable**: Data-driven optimization decisions and performance analysis
|
||
|
||
**Connection Map**:
|
||
```
|
||
All Modules (01-13) → Profiling (14) → Optimization Techniques (15-18)
|
||
(implementations) (measurement) (targeted fixes)
|
||
```
|
||
|
||
**Before starting this module, verify:**
|
||
- [ ] Module 01 (Tensor): Core tensor operations
|
||
- [ ] Module 03 (Layers): Linear layer implementation
|
||
- [ ] Module 08 (Spatial): Convolutional operations
|
||
|
||
This module can work standalone with minimal Tensor implementation, but
|
||
full functionality requires previous modules for realistic profiling scenarios
|
||
|
||
## Learning Objectives
|
||
By the end of this module, you will:
|
||
1. Implement a complete Profiler class for model analysis
|
||
2. Count parameters and FLOPs accurately for different architectures
|
||
3. Measure memory usage and latency with statistical rigor
|
||
4. Create production-quality performance analysis tools
|
||
|
||
Let's build the measurement foundation for ML systems optimization!
|
||
|
||
## 📦 Where This Code Lives in the Final Package
|
||
|
||
**Learning Side:** You work in `modules/14_profiling/profiling_dev.py`
|
||
**Building Side:** Code exports to `tinytorch.profiling.profiler`
|
||
|
||
```python
|
||
# How to use this module:
|
||
from tinytorch.profiling.profiler import Profiler, profile_forward_pass, profile_backward_pass
|
||
```
|
||
|
||
**Why this matters:**
|
||
- **Learning:** Complete profiling system for understanding model performance characteristics
|
||
- **Production:** Professional measurement tools like those used in PyTorch, TensorFlow
|
||
- **Consistency:** All profiling and measurement tools in profiling.profiler
|
||
- **Integration:** Works with any model built using TinyTorch components
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
|
||
#| default_exp profiling.profiler
|
||
#| export
|
||
|
||
import sys
|
||
import os
|
||
import time
|
||
import numpy as np
|
||
import tracemalloc
|
||
from typing import Dict, List, Any, Optional, Tuple
|
||
from collections import defaultdict
|
||
import gc
|
||
|
||
# Import from TinyTorch package (previous modules must be completed and exported)
|
||
from tinytorch.core.tensor import Tensor
|
||
from tinytorch.core.layers import Linear
|
||
from tinytorch.core.spatial import Conv2d
|
||
|
||
# Constants for memory and performance measurement
|
||
BYTES_PER_FLOAT32 = 4 # Standard float32 size in bytes
|
||
KB_TO_BYTES = 1024 # Kilobytes to bytes conversion
|
||
MB_TO_BYTES = 1024 * 1024 # Megabytes to bytes conversion
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 1. Introduction: Why Profiling Matters in ML Systems
|
||
|
||
Imagine you're a detective investigating a performance crime. Your model is running slowly, using too much memory, or burning through compute budgets. Without profiling, you're flying blind - making guesses about what to optimize. With profiling, you have evidence.
|
||
|
||
**The Performance Investigation Process:**
|
||
```
|
||
Suspect Model → Profile Evidence → Identify Bottleneck → Target Optimization
|
||
↓ ↓ ↓ ↓
|
||
"Too slow" "200 GFLOP/s" "Memory bound" "Reduce transfers"
|
||
```
|
||
|
||
**Questions Profiling Answers:**
|
||
- **How many parameters?** (Memory footprint, model size)
|
||
- **How many FLOPs?** (Computational cost, energy usage)
|
||
- **Where are bottlenecks?** (Memory vs compute bound)
|
||
- **What's actual latency?** (Real-world performance)
|
||
|
||
**Production Importance:**
|
||
In production ML systems, profiling isn't optional - it's survival. A model that's 10% more accurate but 100× slower often can't be deployed. Teams use profiling daily to make data-driven optimization decisions, not guesses.
|
||
|
||
### The Profiling Workflow Visualization
|
||
```
|
||
Model → Profiler → Measurements → Analysis → Optimization Decision
|
||
↓ ↓ ↓ ↓ ↓
|
||
GPT Parameter 125M params Memory Use quantization
|
||
Counter 2.5B FLOPs bound Reduce precision
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🔗 From Implementation to Optimization: The Profiling Foundation
|
||
|
||
**In this module (14)**, you'll build the measurement tools to discover optimization opportunities.
|
||
**In later modules (15+)**, you'll use these profiling insights to implement optimizations like KV caching.
|
||
|
||
**The Real ML Engineering Workflow**:
|
||
```
|
||
Step 1: Measure (This Module!) Step 2: Analyze
|
||
↓ ↓
|
||
Profile baseline → Find bottleneck → Understand cause
|
||
40 tok/s 80% in attention O(n²) recomputation
|
||
↓
|
||
Step 4: Validate Step 3: Optimize (Future Modules)
|
||
↓ ↓
|
||
Profile optimized ← Verify speedup ← Implement optimization
|
||
500 tok/s (12.5x) Measure impact Design solution
|
||
```
|
||
|
||
**Without profiling**: You'd never know WHERE to optimize!
|
||
**Without measurement**: You couldn't verify improvements!
|
||
|
||
This module teaches the measurement and analysis skills that enable
|
||
optimization breakthroughs. You'll profile real models and discover
|
||
bottlenecks just like production ML teams do.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 2. Foundations: Performance Measurement Principles
|
||
|
||
Before we build our profiler, let's understand what we're measuring and why each metric matters.
|
||
|
||
### Parameter Counting - Model Size Detective Work
|
||
|
||
Parameters determine your model's memory footprint and storage requirements. Every parameter is typically a 32-bit float (4 bytes), so counting them precisely predicts memory usage.
|
||
|
||
**Parameter Counting Formula:**
|
||
```
|
||
Linear Layer: (input_features × output_features) + output_features
|
||
↑ ↑ ↑
|
||
Weight matrix Bias vector Total parameters
|
||
|
||
Example: Linear(768, 3072) → (768 × 3072) + 3072 = 2,362,368 parameters
|
||
Memory: 2,362,368 × 4 bytes = 9.45 MB
|
||
```
|
||
|
||
### FLOP Counting - Computational Cost Analysis
|
||
|
||
FLOPs (Floating Point Operations) measure computational work. Unlike wall-clock time, FLOPs are hardware-independent and predict compute costs across different systems.
|
||
|
||
**FLOP Formulas for Key Operations:**
|
||
```
|
||
Matrix Multiplication (M,K) @ (K,N):
|
||
FLOPs = M × N × K × 2
|
||
↑ ↑ ↑ ↑
|
||
Rows Cols Inner Multiply+Add
|
||
|
||
Linear Layer Forward:
|
||
FLOPs = batch_size × input_features × output_features × 2
|
||
↑ ↑ ↑
|
||
Matmul cost Bias add Operations
|
||
|
||
Convolution (simplified):
|
||
FLOPs = output_H × output_W × kernel_H × kernel_W × in_channels × out_channels × 2
|
||
```
|
||
|
||
### Memory Profiling - The Three Types of Memory
|
||
|
||
ML models use memory in three distinct ways, each with different optimization strategies:
|
||
|
||
**Memory Type Breakdown:**
|
||
```
|
||
Total Training Memory = Parameters + Activations + Gradients + Optimizer State
|
||
↓ ↓ ↓ ↓
|
||
Model Forward Backward Adam: 2×params
|
||
weights pass cache gradients SGD: 0×params
|
||
|
||
Example for 125M parameter model:
|
||
Parameters: 500 MB (125M × 4 bytes)
|
||
Activations: 200 MB (depends on batch size)
|
||
Gradients: 500 MB (same as parameters)
|
||
Adam state: 1,000 MB (momentum + velocity)
|
||
Total: 2,200 MB (4.4× parameter memory!)
|
||
```
|
||
|
||
### Latency Measurement - Dealing with Reality
|
||
|
||
Latency measurement is tricky because systems have variance, warmup effects, and measurement overhead. Professional profiling requires statistical rigor.
|
||
|
||
**Latency Measurement Best Practices:**
|
||
```
|
||
Measurement Protocol:
|
||
1. Warmup runs (10+) → CPU/GPU caches warm up
|
||
2. Timed runs (100+) → Statistical significance
|
||
3. Outlier handling → Use median, not mean
|
||
4. Memory cleanup → Prevent contamination
|
||
|
||
Timeline:
|
||
Warmup: [run][run][run]...[run] ← Don't time these
|
||
Timing: [⏱run⏱][⏱run⏱]...[⏱run⏱] ← Time these
|
||
Result: median(all_times) ← Robust to outliers
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 3. Implementation: Building the Core Profiler Class
|
||
|
||
Now let's implement our profiler step by step. We'll start with the foundation and build up to comprehensive analysis.
|
||
|
||
### The Profiler Architecture
|
||
```
|
||
Profiler Class
|
||
├── count_parameters() → Model size analysis
|
||
├── count_flops() → Computational cost estimation
|
||
├── measure_memory() → Memory usage tracking
|
||
├── measure_latency() → Performance timing
|
||
├── profile_layer() → Layer-wise analysis
|
||
├── profile_forward_pass() → Complete forward analysis
|
||
└── profile_backward_pass() → Training analysis
|
||
|
||
Integration:
|
||
All methods work together to provide comprehensive performance insights
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "profiler_class", "solution": true}
|
||
#| export
|
||
class Profiler:
|
||
"""
|
||
Professional-grade ML model profiler for performance analysis.
|
||
|
||
Measures parameters, FLOPs, memory usage, and latency with statistical rigor.
|
||
Used for optimization guidance and deployment planning.
|
||
"""
|
||
|
||
def __init__(self):
|
||
"""
|
||
Initialize profiler with measurement state.
|
||
|
||
TODO: Set up profiler tracking structures
|
||
|
||
APPROACH:
|
||
1. Create empty measurements dictionary
|
||
2. Initialize operation counters
|
||
3. Set up memory tracking state
|
||
|
||
EXAMPLE:
|
||
>>> profiler = Profiler()
|
||
>>> profiler.measurements
|
||
{}
|
||
|
||
HINTS:
|
||
- Use defaultdict(int) for operation counters
|
||
- measurements dict will store timing results
|
||
"""
|
||
### BEGIN SOLUTION
|
||
self.measurements = {}
|
||
self.operation_counts = defaultdict(int)
|
||
self.memory_tracker = None
|
||
### END SOLUTION
|
||
|
||
def count_parameters(self, model) -> int:
|
||
"""
|
||
Count total trainable parameters in a model.
|
||
|
||
TODO: Implement parameter counting for any model with parameters() method
|
||
|
||
APPROACH:
|
||
1. Get all parameters from model.parameters() if available
|
||
2. For single layers, count weight and bias directly
|
||
3. Sum total element count across all parameter tensors
|
||
|
||
EXAMPLE:
|
||
>>> linear = Linear(128, 64) # 128*64 + 64 = 8256 parameters
|
||
>>> profiler = Profiler()
|
||
>>> count = profiler.count_parameters(linear)
|
||
>>> print(count)
|
||
8256
|
||
|
||
HINTS:
|
||
- Use parameter.data.size for tensor element count
|
||
- Handle models with and without parameters() method
|
||
- Don't forget bias terms when present
|
||
"""
|
||
### BEGIN SOLUTION
|
||
total_params = 0
|
||
|
||
# Handle SimpleModel pattern (has .layers attribute)
|
||
if hasattr(model, 'layers'):
|
||
# SimpleModel: iterate through layers
|
||
for layer in model.layers:
|
||
for param in layer.parameters():
|
||
total_params += param.data.size
|
||
elif hasattr(model, 'parameters'):
|
||
# Model with direct parameters() method
|
||
for param in model.parameters():
|
||
total_params += param.data.size
|
||
elif hasattr(model, 'weight'):
|
||
# Single layer (Linear, Conv2d) - all have .weight
|
||
total_params += model.weight.data.size
|
||
# Check for bias (may be None)
|
||
if hasattr(model, 'bias') and model.bias is not None:
|
||
total_params += model.bias.data.size
|
||
else:
|
||
# No parameters (activations, etc.)
|
||
total_params = 0
|
||
|
||
return total_params
|
||
### END SOLUTION
|
||
|
||
def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:
|
||
"""
|
||
Count FLOPs (Floating Point Operations) for one forward pass.
|
||
|
||
TODO: Implement FLOP counting for different layer types
|
||
|
||
APPROACH:
|
||
1. Create dummy input with given shape
|
||
2. Calculate FLOPs based on layer type and dimensions
|
||
3. Handle different model architectures (Linear, Conv2d, Sequential)
|
||
|
||
LAYER-SPECIFIC FLOP FORMULAS:
|
||
- Linear: input_features × output_features × 2 (matmul + bias)
|
||
- Conv2d: output_h × output_w × kernel_h × kernel_w × in_channels × out_channels × 2
|
||
- Activation: Usually 1 FLOP per element (ReLU, Sigmoid)
|
||
|
||
EXAMPLE:
|
||
>>> linear = Linear(128, 64)
|
||
>>> profiler = Profiler()
|
||
>>> flops = profiler.count_flops(linear, (1, 128))
|
||
>>> print(flops) # 128 * 64 * 2 = 16384
|
||
16384
|
||
|
||
HINTS:
|
||
- Batch dimension doesn't affect per-sample FLOPs
|
||
- Focus on major operations (matmul, conv) first
|
||
- For Sequential models, sum FLOPs of all layers
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Create dummy input (unused but kept for interface consistency)
|
||
_dummy_input = Tensor(np.random.randn(*input_shape))
|
||
total_flops = 0
|
||
|
||
# Handle different model types
|
||
if hasattr(model, '__class__'):
|
||
model_name = model.__class__.__name__
|
||
|
||
if model_name == 'Linear':
|
||
# Linear layer: input_features × output_features × 2
|
||
in_features = input_shape[-1]
|
||
out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1
|
||
total_flops = in_features * out_features * 2
|
||
|
||
elif model_name == 'Conv2d':
|
||
# Conv2d layer: complex calculation based on output size
|
||
# Simplified: assume we know the output dimensions
|
||
if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):
|
||
_batch_size = input_shape[0] if len(input_shape) > 3 else 1
|
||
in_channels = model.in_channels
|
||
out_channels = model.out_channels
|
||
kernel_h = kernel_w = model.kernel_size
|
||
|
||
# Estimate output size (simplified)
|
||
input_h, input_w = input_shape[-2], input_shape[-1]
|
||
output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)
|
||
output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)
|
||
|
||
total_flops = (output_h * output_w * kernel_h * kernel_w *
|
||
in_channels * out_channels * 2)
|
||
|
||
elif model_name == 'Sequential':
|
||
# Sequential model: sum FLOPs of all layers
|
||
current_shape = input_shape
|
||
for layer in model.layers:
|
||
layer_flops = self.count_flops(layer, current_shape)
|
||
total_flops += layer_flops
|
||
# Update shape for next layer (simplified)
|
||
if hasattr(layer, 'weight'):
|
||
current_shape = current_shape[:-1] + (layer.weight.shape[1],)
|
||
|
||
else:
|
||
# Activation or other: assume 1 FLOP per element
|
||
total_flops = np.prod(input_shape)
|
||
|
||
return total_flops
|
||
### END SOLUTION
|
||
|
||
def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:
|
||
"""
|
||
Measure memory usage during forward pass.
|
||
|
||
TODO: Implement memory tracking for model execution
|
||
|
||
APPROACH:
|
||
1. Use tracemalloc to track memory allocation
|
||
2. Measure baseline memory before model execution
|
||
3. Run forward pass and track peak usage
|
||
4. Calculate different memory components
|
||
|
||
RETURN DICTIONARY:
|
||
- 'parameter_memory_mb': Memory for model parameters
|
||
- 'activation_memory_mb': Memory for activations
|
||
- 'peak_memory_mb': Maximum memory usage
|
||
- 'memory_efficiency': Ratio of useful to total memory
|
||
|
||
EXAMPLE:
|
||
>>> linear = Linear(1024, 512)
|
||
>>> profiler = Profiler()
|
||
>>> memory = profiler.measure_memory(linear, (32, 1024))
|
||
>>> print(f"Parameters: {memory['parameter_memory_mb']:.1f} MB")
|
||
Parameters: 2.1 MB
|
||
|
||
HINTS:
|
||
- Use tracemalloc.start() and tracemalloc.get_traced_memory()
|
||
- Account for float32 = 4 bytes per parameter
|
||
- Activation memory scales with batch size
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Start memory tracking
|
||
tracemalloc.start()
|
||
|
||
# Measure baseline memory (unused but kept for completeness)
|
||
_baseline_memory = tracemalloc.get_traced_memory()[0]
|
||
|
||
# Calculate parameter memory
|
||
param_count = self.count_parameters(model)
|
||
parameter_memory_bytes = param_count * BYTES_PER_FLOAT32
|
||
parameter_memory_mb = parameter_memory_bytes / MB_TO_BYTES
|
||
|
||
# Create input and measure activation memory
|
||
dummy_input = Tensor(np.random.randn(*input_shape))
|
||
input_memory_bytes = dummy_input.data.nbytes
|
||
|
||
# Estimate activation memory (simplified)
|
||
activation_memory_bytes = input_memory_bytes * 2 # Rough estimate
|
||
activation_memory_mb = activation_memory_bytes / MB_TO_BYTES
|
||
|
||
# Run forward pass to measure peak memory usage
|
||
_ = model.forward(dummy_input)
|
||
|
||
# Get peak memory
|
||
_current_memory, peak_memory = tracemalloc.get_traced_memory()
|
||
peak_memory_mb = (peak_memory - _baseline_memory) / MB_TO_BYTES
|
||
|
||
tracemalloc.stop()
|
||
|
||
# Calculate efficiency
|
||
useful_memory = parameter_memory_mb + activation_memory_mb
|
||
memory_efficiency = useful_memory / max(peak_memory_mb, 0.001) # Avoid division by zero
|
||
|
||
return {
|
||
'parameter_memory_mb': parameter_memory_mb,
|
||
'activation_memory_mb': activation_memory_mb,
|
||
'peak_memory_mb': max(peak_memory_mb, useful_memory),
|
||
'memory_efficiency': min(memory_efficiency, 1.0)
|
||
}
|
||
### END SOLUTION
|
||
|
||
def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:
|
||
"""
|
||
Measure model inference latency with statistical rigor.
|
||
|
||
TODO: Implement accurate latency measurement
|
||
|
||
APPROACH:
|
||
1. Run warmup iterations to stabilize performance
|
||
2. Measure multiple iterations for statistical accuracy
|
||
3. Calculate median latency to handle outliers
|
||
4. Return latency in milliseconds
|
||
|
||
PARAMETERS:
|
||
- warmup: Number of warmup runs (default 10)
|
||
- iterations: Number of measurement runs (default 100)
|
||
|
||
EXAMPLE:
|
||
>>> linear = Linear(128, 64)
|
||
>>> input_tensor = Tensor(np.random.randn(1, 128))
|
||
>>> profiler = Profiler()
|
||
>>> latency = profiler.measure_latency(linear, input_tensor)
|
||
>>> print(f"Latency: {latency:.2f} ms")
|
||
Latency: 0.15 ms
|
||
|
||
HINTS:
|
||
- Use time.perf_counter() for high precision
|
||
- Use median instead of mean for robustness against outliers
|
||
- Handle different model interfaces (forward, __call__)
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Warmup runs to stabilize performance
|
||
for _ in range(warmup):
|
||
_ = model.forward(input_tensor)
|
||
|
||
# Measurement runs
|
||
times = []
|
||
for _ in range(iterations):
|
||
start_time = time.perf_counter()
|
||
_ = model.forward(input_tensor)
|
||
end_time = time.perf_counter()
|
||
times.append((end_time - start_time) * 1000) # Convert to milliseconds
|
||
|
||
# Calculate statistics - use median for robustness
|
||
times = np.array(times)
|
||
median_latency = np.median(times)
|
||
|
||
return float(median_latency)
|
||
### END SOLUTION
|
||
|
||
def profile_layer(self, layer, input_shape: Tuple[int, ...]) -> Dict[str, Any]:
|
||
"""
|
||
Profile a single layer comprehensively.
|
||
|
||
TODO: Implement layer-wise profiling
|
||
|
||
APPROACH:
|
||
1. Count parameters for this layer
|
||
2. Count FLOPs for this layer
|
||
3. Measure memory usage
|
||
4. Measure latency
|
||
5. Return comprehensive layer profile
|
||
|
||
EXAMPLE:
|
||
>>> linear = Linear(256, 128)
|
||
>>> profiler = Profiler()
|
||
>>> profile = profiler.profile_layer(linear, (32, 256))
|
||
>>> print(f"Layer uses {profile['parameters']} parameters")
|
||
Layer uses 32896 parameters
|
||
|
||
HINTS:
|
||
- Use existing profiler methods (count_parameters, count_flops, etc.)
|
||
- Create dummy input for latency measurement
|
||
- Include layer type information in profile
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Create dummy input for latency measurement
|
||
dummy_input = Tensor(np.random.randn(*input_shape))
|
||
|
||
# Gather all measurements
|
||
params = self.count_parameters(layer)
|
||
flops = self.count_flops(layer, input_shape)
|
||
memory = self.measure_memory(layer, input_shape)
|
||
latency = self.measure_latency(layer, dummy_input, warmup=3, iterations=10)
|
||
|
||
# Compute derived metrics
|
||
gflops_per_second = (flops / 1e9) / max(latency / 1000, 1e-6)
|
||
|
||
return {
|
||
'layer_type': layer.__class__.__name__,
|
||
'parameters': params,
|
||
'flops': flops,
|
||
'latency_ms': latency,
|
||
'gflops_per_second': gflops_per_second,
|
||
**memory
|
||
}
|
||
### END SOLUTION
|
||
|
||
def profile_forward_pass(self, model, input_tensor) -> Dict[str, Any]:
|
||
"""
|
||
Comprehensive profiling of a model's forward pass.
|
||
|
||
TODO: Implement complete forward pass analysis
|
||
|
||
APPROACH:
|
||
1. Use Profiler class to gather all measurements
|
||
2. Create comprehensive performance profile
|
||
3. Add derived metrics and insights
|
||
4. Return structured analysis results
|
||
|
||
RETURN METRICS:
|
||
- All basic profiler measurements
|
||
- FLOPs per second (computational efficiency)
|
||
- Memory bandwidth utilization
|
||
- Performance bottleneck identification
|
||
|
||
EXAMPLE:
|
||
>>> model = Linear(256, 128)
|
||
>>> input_data = Tensor(np.random.randn(32, 256))
|
||
>>> profiler = Profiler()
|
||
>>> profile = profiler.profile_forward_pass(model, input_data)
|
||
>>> print(f"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s")
|
||
Throughput: 2.45 GFLOP/s
|
||
|
||
HINTS:
|
||
- GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)
|
||
- Memory bandwidth = memory_mb / (latency_ms / 1000)
|
||
- Consider realistic hardware limits for efficiency calculations
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Basic measurements
|
||
param_count = self.count_parameters(model)
|
||
flops = self.count_flops(model, input_tensor.shape)
|
||
memory_stats = self.measure_memory(model, input_tensor.shape)
|
||
latency_ms = self.measure_latency(model, input_tensor, warmup=5, iterations=20)
|
||
|
||
# Derived metrics
|
||
latency_seconds = latency_ms / 1000.0
|
||
gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)
|
||
|
||
# Memory bandwidth (MB/s)
|
||
memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)
|
||
|
||
# Efficiency metrics
|
||
theoretical_peak_gflops = 100.0 # Assume 100 GFLOP/s theoretical peak for CPU
|
||
computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)
|
||
|
||
# Bottleneck analysis
|
||
is_memory_bound = memory_bandwidth > gflops_per_second * 100 # Rough heuristic
|
||
is_compute_bound = not is_memory_bound
|
||
|
||
return {
|
||
# Basic measurements
|
||
'parameters': param_count,
|
||
'flops': flops,
|
||
'latency_ms': latency_ms,
|
||
**memory_stats,
|
||
|
||
# Derived metrics
|
||
'gflops_per_second': gflops_per_second,
|
||
'memory_bandwidth_mbs': memory_bandwidth,
|
||
'computational_efficiency': computational_efficiency,
|
||
|
||
# Bottleneck analysis
|
||
'is_memory_bound': is_memory_bound,
|
||
'is_compute_bound': is_compute_bound,
|
||
'bottleneck': 'memory' if is_memory_bound else 'compute'
|
||
}
|
||
### END SOLUTION
|
||
|
||
def profile_backward_pass(self, model, input_tensor, _loss_fn=None) -> Dict[str, Any]:
|
||
"""
|
||
Profile both forward and backward passes for training analysis.
|
||
|
||
TODO: Implement training-focused profiling
|
||
|
||
APPROACH:
|
||
1. Profile forward pass first
|
||
2. Estimate backward pass costs (typically 2× forward)
|
||
3. Calculate total training iteration metrics
|
||
4. Analyze memory requirements for gradients and optimizers
|
||
|
||
BACKWARD PASS ESTIMATES:
|
||
- FLOPs: ~2× forward pass (gradient computation)
|
||
- Memory: +1× parameters (gradient storage)
|
||
- Latency: ~2× forward pass (more complex operations)
|
||
|
||
EXAMPLE:
|
||
>>> model = Linear(128, 64)
|
||
>>> input_data = Tensor(np.random.randn(16, 128))
|
||
>>> profiler = Profiler()
|
||
>>> profile = profiler.profile_backward_pass(model, input_data)
|
||
>>> print(f"Training iteration: {profile['total_latency_ms']:.2f} ms")
|
||
Training iteration: 0.45 ms
|
||
|
||
HINTS:
|
||
- Total memory = parameters + activations + gradients
|
||
- Optimizer memory depends on algorithm (SGD: 0×, Adam: 2×)
|
||
- Consider gradient accumulation effects
|
||
"""
|
||
### BEGIN SOLUTION
|
||
# Get forward pass profile
|
||
forward_profile = self.profile_forward_pass(model, input_tensor)
|
||
|
||
# Estimate backward pass (typically 2× forward)
|
||
backward_flops = forward_profile['flops'] * 2
|
||
backward_latency_ms = forward_profile['latency_ms'] * 2
|
||
|
||
# Gradient memory (equal to parameter memory)
|
||
gradient_memory_mb = forward_profile['parameter_memory_mb']
|
||
|
||
# Total training iteration
|
||
total_flops = forward_profile['flops'] + backward_flops
|
||
total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms
|
||
total_memory_mb = (forward_profile['parameter_memory_mb'] +
|
||
forward_profile['activation_memory_mb'] +
|
||
gradient_memory_mb)
|
||
|
||
# Training efficiency
|
||
total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)
|
||
|
||
# Optimizer memory estimates
|
||
optimizer_memory_estimates = {
|
||
'sgd': 0, # No extra memory
|
||
'adam': gradient_memory_mb * 2, # Momentum + velocity
|
||
'adamw': gradient_memory_mb * 2, # Same as Adam
|
||
}
|
||
|
||
return {
|
||
# Forward pass
|
||
'forward_flops': forward_profile['flops'],
|
||
'forward_latency_ms': forward_profile['latency_ms'],
|
||
'forward_memory_mb': forward_profile['peak_memory_mb'],
|
||
|
||
# Backward pass estimates
|
||
'backward_flops': backward_flops,
|
||
'backward_latency_ms': backward_latency_ms,
|
||
'gradient_memory_mb': gradient_memory_mb,
|
||
|
||
# Total training iteration
|
||
'total_flops': total_flops,
|
||
'total_latency_ms': total_latency_ms,
|
||
'total_memory_mb': total_memory_mb,
|
||
'total_gflops_per_second': total_gflops_per_second,
|
||
|
||
# Optimizer memory requirements
|
||
'optimizer_memory_estimates': optimizer_memory_estimates,
|
||
|
||
# Training insights
|
||
'memory_efficiency': forward_profile['memory_efficiency'],
|
||
'bottleneck': forward_profile['bottleneck']
|
||
}
|
||
### END SOLUTION
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Helper Functions - Quick Profiling Utilities
|
||
|
||
These helper functions provide simplified interfaces for common profiling tasks.
|
||
They make it easy to quickly profile models and analyze characteristics without
|
||
manually calling multiple profiler methods.
|
||
|
||
### Why Helper Functions Matter
|
||
|
||
In production ML engineering, you often need quick insights without setting up
|
||
full profiling workflows. These utilities provide:
|
||
- **Quick profiling**: One-line model analysis with formatted output
|
||
- **Weight analysis**: Understanding parameter distributions for compression
|
||
- **Student-friendly output**: Clear, formatted results for learning
|
||
|
||
These functions wrap our core Profiler class with convenience interfaces used
|
||
in real ML workflows for rapid iteration and debugging.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "helper_quick_profile", "solution": true}
|
||
#| export
|
||
def quick_profile(model, input_tensor, profiler=None):
|
||
"""
|
||
Quick profiling function for immediate insights.
|
||
|
||
Provides a simplified interface for profiling that displays key metrics
|
||
in a student-friendly format.
|
||
|
||
Args:
|
||
model: Model to profile
|
||
input_tensor: Input data for profiling
|
||
profiler: Optional Profiler instance (creates new one if None)
|
||
|
||
Returns:
|
||
dict: Profile results with key metrics
|
||
|
||
Example:
|
||
>>> model = Linear(128, 64)
|
||
>>> input_data = Tensor(np.random.randn(16, 128))
|
||
>>> results = quick_profile(model, input_data)
|
||
>>> # Displays formatted output automatically
|
||
"""
|
||
if profiler is None:
|
||
profiler = Profiler()
|
||
|
||
profile = profiler.profile_forward_pass(model, input_tensor)
|
||
|
||
# Display formatted results
|
||
print("🔬 Quick Profile Results:")
|
||
print(f" Parameters: {profile['parameters']:,}")
|
||
print(f" FLOPs: {profile['flops']:,}")
|
||
print(f" Latency: {profile['latency_ms']:.2f} ms")
|
||
print(f" Memory: {profile['peak_memory_mb']:.2f} MB")
|
||
print(f" Bottleneck: {profile['bottleneck']}")
|
||
print(f" Efficiency: {profile['computational_efficiency']*100:.1f}%")
|
||
|
||
return profile
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "helper_weight_distribution", "solution": true}
|
||
#| export
|
||
def analyze_weight_distribution(model, percentiles=[10, 25, 50, 75, 90]):
|
||
"""
|
||
Analyze weight distribution for compression insights.
|
||
|
||
Helps understand which weights are small and might be prunable.
|
||
Used by Module 17 (Compression) to motivate pruning.
|
||
|
||
Args:
|
||
model: Model to analyze
|
||
percentiles: List of percentiles to compute
|
||
|
||
Returns:
|
||
dict: Weight distribution statistics
|
||
|
||
Example:
|
||
>>> model = Linear(512, 512)
|
||
>>> stats = analyze_weight_distribution(model)
|
||
>>> print(f"Weights < 0.01: {stats['below_threshold_001']:.1f}%")
|
||
"""
|
||
# Collect all weights
|
||
weights = []
|
||
if hasattr(model, 'parameters'):
|
||
for param in model.parameters():
|
||
weights.extend(param.data.flatten().tolist())
|
||
elif hasattr(model, 'weight'):
|
||
weights.extend(model.weight.data.flatten().tolist())
|
||
else:
|
||
return {'error': 'No weights found'}
|
||
|
||
weights = np.array(weights)
|
||
abs_weights = np.abs(weights)
|
||
|
||
# Calculate statistics
|
||
stats = {
|
||
'total_weights': len(weights),
|
||
'mean': float(np.mean(abs_weights)),
|
||
'std': float(np.std(abs_weights)),
|
||
'min': float(np.min(abs_weights)),
|
||
'max': float(np.max(abs_weights)),
|
||
}
|
||
|
||
# Percentile analysis
|
||
for p in percentiles:
|
||
stats[f'percentile_{p}'] = float(np.percentile(abs_weights, p))
|
||
|
||
# Threshold analysis (useful for pruning)
|
||
for threshold in [0.001, 0.01, 0.1]:
|
||
below = np.sum(abs_weights < threshold) / len(weights) * 100
|
||
stats[f'below_threshold_{str(threshold).replace(".", "")}'] = below
|
||
|
||
return stats
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Helper Functions
|
||
This test validates our helper utilities work correctly and provide useful output.
|
||
**What we're testing**: Quick profiling and weight distribution analysis
|
||
**Why it matters**: These utilities are used daily in production ML workflows
|
||
**Expected**: Correct profiles with formatted output
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test_helper_functions", "locked": true, "points": 5}
|
||
def test_unit_helper_functions():
|
||
"""🔬 Test helper function implementations."""
|
||
print("🔬 Unit Test: Helper Functions...")
|
||
|
||
# Test 1: Quick profile function
|
||
from tinytorch.core.layers import Linear
|
||
test_model = Linear(16, 8)
|
||
test_input = Tensor(np.random.randn(8, 16))
|
||
profile = quick_profile(test_model, test_input, profiler=Profiler())
|
||
|
||
# Validate profile contains expected keys
|
||
assert 'parameters' in profile, "Quick profile should include parameters"
|
||
assert 'flops' in profile, "Quick profile should include FLOPs"
|
||
assert 'latency_ms' in profile, "Quick profile should include latency"
|
||
print("✅ Quick profile provides comprehensive metrics")
|
||
|
||
# Test 2: Weight distribution analysis
|
||
class SimpleModel:
|
||
def __init__(self):
|
||
self.weight = Tensor(np.random.randn(10, 5) * 0.1) # Small weights
|
||
|
||
model = SimpleModel()
|
||
stats = analyze_weight_distribution(model)
|
||
|
||
# Validate statistics structure
|
||
assert 'total_weights' in stats, "Should count total weights"
|
||
assert 'mean' in stats, "Should compute mean"
|
||
assert 'std' in stats, "Should compute standard deviation"
|
||
assert stats['total_weights'] == 50, f"Expected 50 weights, got {stats['total_weights']}"
|
||
print(f"✅ Weight distribution analysis: {stats['total_weights']} weights analyzed")
|
||
|
||
# Test 3: Weight distribution with no weights
|
||
class NoWeightModel:
|
||
pass
|
||
|
||
no_weight_model = NoWeightModel()
|
||
stats = analyze_weight_distribution(no_weight_model)
|
||
assert 'error' in stats, "Should handle models without weights"
|
||
print("✅ Handles models without weights gracefully")
|
||
|
||
print("✅ Helper functions work correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_helper_functions()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Parameter Counting - Model Size Analysis
|
||
|
||
Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's validate our implementation.
|
||
|
||
### Why Parameter Counting Matters
|
||
```
|
||
Model Deployment Pipeline:
|
||
Parameters → Memory → Hardware → Cost
|
||
↓ ↓ ↓ ↓
|
||
125M 500MB 8GB GPU $200/month
|
||
|
||
Parameter Growth Examples:
|
||
Small: GPT-2 Small (124M parameters) → 500MB memory
|
||
Medium: GPT-2 Medium (350M parameters) → 1.4GB memory
|
||
Large: GPT-2 Large (774M parameters) → 3.1GB memory
|
||
XL: GPT-2 XL (1.5B parameters) → 6.0GB memory
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Parameter Counting
|
||
This test validates our parameter counting works correctly for different model types.
|
||
**What we're testing**: Parameter counting accuracy for various architectures
|
||
**Why it matters**: Accurate parameter counts predict memory usage and model complexity
|
||
**Expected**: Correct counts for known model configurations
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test_parameter_counting", "locked": true, "points": 10}
|
||
def test_unit_parameter_counting():
|
||
"""🔬 Test parameter counting implementation."""
|
||
print("🔬 Unit Test: Parameter Counting...")
|
||
|
||
profiler = Profiler()
|
||
|
||
# Test 1: Simple model with known parameters
|
||
class SimpleModel:
|
||
def __init__(self):
|
||
self.weight = Tensor(np.random.randn(10, 5))
|
||
self.bias = Tensor(np.random.randn(5))
|
||
|
||
def parameters(self):
|
||
return [self.weight, self.bias]
|
||
|
||
simple_model = SimpleModel()
|
||
param_count = profiler.count_parameters(simple_model)
|
||
expected_count = 10 * 5 + 5 # weight + bias
|
||
assert param_count == expected_count, f"Expected {expected_count} parameters, got {param_count}"
|
||
print(f"✅ Simple model: {param_count} parameters")
|
||
|
||
# Test 2: Model without parameters
|
||
class NoParamModel:
|
||
def __init__(self):
|
||
pass
|
||
|
||
no_param_model = NoParamModel()
|
||
param_count = profiler.count_parameters(no_param_model)
|
||
assert param_count == 0, f"Expected 0 parameters, got {param_count}"
|
||
print(f"✅ No parameter model: {param_count} parameters")
|
||
|
||
# Test 3: Direct tensor (no parameters)
|
||
test_tensor = Tensor(np.random.randn(2, 3))
|
||
param_count = profiler.count_parameters(test_tensor)
|
||
assert param_count == 0, f"Expected 0 parameters for tensor, got {param_count}"
|
||
print(f"✅ Direct tensor: {param_count} parameters")
|
||
|
||
print("✅ Parameter counting works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_parameter_counting()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## FLOP Counting - Computational Cost Estimation
|
||
|
||
FLOPs measure the computational work required for model operations. Unlike latency, FLOPs are hardware-independent and help predict compute costs across different systems.
|
||
|
||
### FLOP Counting Visualization
|
||
```
|
||
Linear Layer FLOP Breakdown:
|
||
Input (batch=32, features=768) × Weight (768, 3072) + Bias (3072)
|
||
↓
|
||
Matrix Multiplication: 32 × 768 × 3072 × 2 = 150,994,944 FLOPs
|
||
Bias Addition: 32 × 3072 × 1 = 98,304 FLOPs
|
||
↓
|
||
Total FLOPs: 151,093,248 FLOPs
|
||
|
||
Convolution FLOP Breakdown:
|
||
Input (batch=1, channels=3, H=224, W=224)
|
||
Kernel (out=64, in=3, kH=7, kW=7)
|
||
↓
|
||
Output size: (224×224) → (112×112) with stride=2
|
||
FLOPs = 112 × 112 × 7 × 7 × 3 × 64 × 2 = 235,012,096 FLOPs
|
||
```
|
||
|
||
### FLOP Counting Strategy
|
||
Different operations require different FLOP calculations:
|
||
- **Matrix operations**: M × N × K × 2 (multiply + add)
|
||
- **Convolutions**: Output spatial × kernel spatial × channels
|
||
- **Activations**: Usually 1 FLOP per element
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: FLOP Counting
|
||
This test validates our FLOP counting for different operations and architectures.
|
||
**What we're testing**: FLOP calculation accuracy for various layer types
|
||
**Why it matters**: FLOPs predict computational cost and energy usage
|
||
**Expected**: Correct FLOP counts for known operation types
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test_flop_counting", "locked": true, "points": 10}
|
||
def test_unit_flop_counting():
|
||
"""🔬 Test FLOP counting implementation."""
|
||
print("🔬 Unit Test: FLOP Counting...")
|
||
|
||
profiler = Profiler()
|
||
|
||
# Test 1: Simple tensor operations
|
||
test_tensor = Tensor(np.random.randn(4, 8))
|
||
flops = profiler.count_flops(test_tensor, (4, 8))
|
||
expected_flops = 4 * 8 # 1 FLOP per element for generic operation
|
||
assert flops == expected_flops, f"Expected {expected_flops} FLOPs, got {flops}"
|
||
print(f"✅ Tensor operation: {flops} FLOPs")
|
||
|
||
# Test 2: Simulated Linear layer
|
||
class MockLinear:
|
||
def __init__(self, in_features, out_features):
|
||
self.weight = Tensor(np.random.randn(in_features, out_features))
|
||
self.__class__.__name__ = 'Linear'
|
||
|
||
mock_linear = MockLinear(128, 64)
|
||
flops = profiler.count_flops(mock_linear, (1, 128))
|
||
expected_flops = 128 * 64 * 2 # matmul FLOPs
|
||
assert flops == expected_flops, f"Expected {expected_flops} FLOPs, got {flops}"
|
||
print(f"✅ Linear layer: {flops} FLOPs")
|
||
|
||
# Test 3: Batch size independence
|
||
flops_batch1 = profiler.count_flops(mock_linear, (1, 128))
|
||
flops_batch32 = profiler.count_flops(mock_linear, (32, 128))
|
||
assert flops_batch1 == flops_batch32, "FLOPs should be independent of batch size"
|
||
print(f"✅ Batch independence: {flops_batch1} FLOPs (same for batch 1 and 32)")
|
||
|
||
print("✅ FLOP counting works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_flop_counting()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Memory Profiling - Understanding Memory Usage Patterns
|
||
|
||
Memory profiling reveals how much RAM your model consumes during training and inference. This is critical for deployment planning and optimization.
|
||
|
||
### Memory Usage Breakdown
|
||
```
|
||
ML Model Memory Components:
|
||
┌───────────────────────────────────────────────────┐
|
||
│ Total Memory │
|
||
├─────────────────┬─────────────────┬───────────────┤
|
||
│ Parameters │ Activations │ Gradients │
|
||
│ (persistent) │ (per forward) │ (per backward)│
|
||
├─────────────────┼─────────────────┼───────────────┤
|
||
│ Linear weights │ Hidden states │ ∂L/∂W │
|
||
│ Conv filters │ Attention maps │ ∂L/∂b │
|
||
│ Embeddings │ Residual cache │ Optimizer │
|
||
└─────────────────┴─────────────────┴───────────────┘
|
||
|
||
Memory Scaling:
|
||
Batch Size → Activation Memory (linear scaling)
|
||
Model Size → Parameter + Gradient Memory (linear scaling)
|
||
Sequence Length → Attention Memory (quadratic scaling!)
|
||
```
|
||
|
||
### Memory Measurement Strategy
|
||
We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Memory Measurement
|
||
This test validates our memory tracking works correctly and provides useful metrics.
|
||
**What we're testing**: Memory usage measurement and calculation accuracy
|
||
**Why it matters**: Memory constraints often limit model deployment
|
||
**Expected**: Reasonable memory measurements with proper components
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test_memory_measurement", "locked": true, "points": 10}
|
||
def test_unit_memory_measurement():
|
||
"""🔬 Test memory measurement implementation."""
|
||
print("🔬 Unit Test: Memory Measurement...")
|
||
|
||
profiler = Profiler()
|
||
|
||
# Test 1: Basic memory measurement
|
||
test_tensor = Tensor(np.random.randn(10, 20))
|
||
from tinytorch.core.layers import Linear
|
||
test_model = Linear(20, 10)
|
||
memory_stats = profiler.measure_memory(test_model, (10, 20))
|
||
|
||
# Validate dictionary structure
|
||
required_keys = ['parameter_memory_mb', 'activation_memory_mb', 'peak_memory_mb', 'memory_efficiency']
|
||
for key in required_keys:
|
||
assert key in memory_stats, f"Missing key: {key}"
|
||
|
||
# Validate non-negative values
|
||
for key in required_keys:
|
||
assert memory_stats[key] >= 0, f"{key} should be non-negative, got {memory_stats[key]}"
|
||
|
||
print(f"✅ Basic measurement: {memory_stats['peak_memory_mb']:.3f} MB peak")
|
||
|
||
# Test 2: Memory scaling with size
|
||
from tinytorch.core.layers import Linear
|
||
small_model = Linear(5, 5)
|
||
large_model = Linear(50, 50)
|
||
|
||
small_memory = profiler.measure_memory(small_model, (5, 5))
|
||
large_memory = profiler.measure_memory(large_model, (50, 50))
|
||
|
||
# Larger tensor should use more activation memory
|
||
assert large_memory['activation_memory_mb'] >= small_memory['activation_memory_mb'], \
|
||
"Larger tensor should use more activation memory"
|
||
|
||
print(f"✅ Scaling: Small {small_memory['activation_memory_mb']:.3f} MB → Large {large_memory['activation_memory_mb']:.3f} MB")
|
||
|
||
# Test 3: Efficiency bounds
|
||
assert 0 <= memory_stats['memory_efficiency'] <= 1.0, \
|
||
f"Memory efficiency should be between 0 and 1, got {memory_stats['memory_efficiency']}"
|
||
|
||
print(f"✅ Efficiency: {memory_stats['memory_efficiency']:.3f} (0-1 range)")
|
||
|
||
print("✅ Memory measurement works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_memory_measurement()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## Latency Measurement - Accurate Performance Timing
|
||
|
||
Latency measurement is the most challenging part of profiling because it's affected by system state, caching, and measurement overhead. We need statistical rigor to get reliable results.
|
||
|
||
### Latency Measurement Challenges
|
||
```
|
||
Timing Challenges:
|
||
┌─────────────────────────────────────────────────┐
|
||
│ Time Variance │
|
||
├─────────────────┬─────────────────┬─────────────┤
|
||
│ System Noise │ Cache Effects │ Thermal │
|
||
│ │ │ Throttling │
|
||
├─────────────────┼─────────────────┼─────────────┤
|
||
│ Background │ Cold start vs │ CPU slows │
|
||
│ processes │ warm caches │ when hot │
|
||
│ OS scheduling │ Memory locality │ GPU thermal │
|
||
│ Network I/O │ Branch predict │ limits │
|
||
└─────────────────┴─────────────────┴─────────────┘
|
||
|
||
Solution: Statistical Approach
|
||
Warmup → Multiple measurements → Robust statistics (median)
|
||
```
|
||
|
||
### Measurement Protocol
|
||
Our latency measurement follows professional benchmarking practices:
|
||
1. **Warmup runs** to stabilize system state
|
||
2. **Multiple measurements** for statistical significance
|
||
3. **Median calculation** to handle outliers
|
||
4. **Memory cleanup** to prevent contamination
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Latency Measurement
|
||
This test validates our latency measurement provides consistent and reasonable results.
|
||
**What we're testing**: Timing accuracy and statistical robustness
|
||
**Why it matters**: Latency determines real-world deployment feasibility
|
||
**Expected**: Consistent timing measurements with proper statistical handling
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test_latency_measurement", "locked": true, "points": 10}
|
||
def test_unit_latency_measurement():
|
||
"""🔬 Test latency measurement implementation."""
|
||
print("🔬 Unit Test: Latency Measurement...")
|
||
|
||
profiler = Profiler()
|
||
|
||
# Test 1: Basic latency measurement
|
||
from tinytorch.core.layers import Linear
|
||
test_model = Linear(8, 4)
|
||
test_input = Tensor(np.random.randn(4, 8))
|
||
latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5)
|
||
|
||
assert latency >= 0, f"Latency should be non-negative, got {latency}"
|
||
assert latency < 1000, f"Latency seems too high for simple operation: {latency} ms"
|
||
print(f"✅ Basic latency: {latency:.3f} ms")
|
||
|
||
# Test 2: Measurement consistency
|
||
latencies = []
|
||
for _ in range(3):
|
||
lat = profiler.measure_latency(test_model, test_input, warmup=1, iterations=3)
|
||
latencies.append(lat)
|
||
|
||
# Measurements should be in reasonable range
|
||
avg_latency = np.mean(latencies)
|
||
std_latency = np.std(latencies)
|
||
assert std_latency < avg_latency, "Standard deviation shouldn't exceed mean for simple operations"
|
||
print(f"✅ Consistency: {avg_latency:.3f} ± {std_latency:.3f} ms")
|
||
|
||
# Test 3: Size scaling
|
||
small_model = Linear(2, 2)
|
||
large_model = Linear(20, 20)
|
||
small_input = Tensor(np.random.randn(2, 2))
|
||
large_input = Tensor(np.random.randn(20, 20))
|
||
|
||
small_latency = profiler.measure_latency(small_model, small_input, warmup=1, iterations=3)
|
||
large_latency = profiler.measure_latency(large_model, large_input, warmup=1, iterations=3)
|
||
|
||
# Larger operations might take longer (though not guaranteed for simple operations)
|
||
print(f"✅ Scaling: Small {small_latency:.3f} ms, Large {large_latency:.3f} ms")
|
||
|
||
print("✅ Latency measurement works correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_latency_measurement()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 4. Integration: Advanced Profiling Functions
|
||
|
||
Now let's validate our higher-level profiling functions that combine core measurements into comprehensive analysis tools.
|
||
|
||
### Advanced Profiling Architecture
|
||
```
|
||
Core Profiler Methods → Advanced Analysis Functions → Optimization Insights
|
||
↓ ↓ ↓
|
||
count_parameters() profile_forward_pass() "Memory-bound workload"
|
||
count_flops() profile_backward_pass() "Optimize data movement"
|
||
measure_memory() profile_layer() "Focus on bandwidth"
|
||
measure_latency() benchmark_efficiency() "Use quantization"
|
||
```
|
||
|
||
### Forward Pass Profiling - Complete Performance Picture
|
||
|
||
A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions.
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### Backward Pass Profiling - Training Analysis
|
||
|
||
Training requires both forward and backward passes. The backward pass typically uses 2× the compute and adds gradient memory. Understanding this is crucial for training optimization.
|
||
|
||
### Training Memory Visualization
|
||
```
|
||
Training Memory Timeline:
|
||
Forward Pass: [Parameters] + [Activations]
|
||
↓
|
||
Backward Pass: [Parameters] + [Activations] + [Gradients]
|
||
↓
|
||
Optimizer: [Parameters] + [Gradients] + [Optimizer State]
|
||
|
||
Memory Examples:
|
||
Model: 125M parameters (500MB)
|
||
Forward: 500MB params + 100MB activations = 600MB
|
||
Backward: 500MB params + 100MB activations + 500MB gradients = 1,100MB
|
||
Adam: 500MB params + 500MB gradients + 1,000MB momentum/velocity = 2,000MB
|
||
|
||
Total Training Memory: 4× parameter memory!
|
||
```
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
### 🧪 Unit Test: Advanced Profiling Functions
|
||
This test validates our advanced profiling functions provide comprehensive analysis.
|
||
**What we're testing**: Forward and backward pass profiling completeness
|
||
**Why it matters**: Training optimization requires understanding both passes
|
||
**Expected**: Complete profiles with all required metrics and relationships
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test_advanced_profiling", "locked": true, "points": 15}
|
||
def test_unit_advanced_profiling():
|
||
"""🔬 Test advanced profiling functions."""
|
||
print("🔬 Unit Test: Advanced Profiling Functions...")
|
||
|
||
# Create profiler and test model
|
||
profiler = Profiler()
|
||
from tinytorch.core.layers import Linear
|
||
test_model = Linear(8, 4)
|
||
test_input = Tensor(np.random.randn(4, 8))
|
||
|
||
# Test forward pass profiling
|
||
forward_profile = profiler.profile_forward_pass(test_model, test_input)
|
||
|
||
# Validate forward profile structure
|
||
required_forward_keys = [
|
||
'parameters', 'flops', 'latency_ms', 'gflops_per_second',
|
||
'memory_bandwidth_mbs', 'bottleneck'
|
||
]
|
||
|
||
for key in required_forward_keys:
|
||
assert key in forward_profile, f"Missing key: {key}"
|
||
|
||
assert forward_profile['parameters'] >= 0
|
||
assert forward_profile['flops'] >= 0
|
||
assert forward_profile['latency_ms'] >= 0
|
||
assert forward_profile['gflops_per_second'] >= 0
|
||
|
||
print(f"✅ Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s")
|
||
|
||
# Test backward pass profiling
|
||
backward_profile = profiler.profile_backward_pass(test_model, test_input)
|
||
|
||
# Validate backward profile structure
|
||
required_backward_keys = [
|
||
'forward_flops', 'backward_flops', 'total_flops',
|
||
'total_latency_ms', 'total_memory_mb', 'optimizer_memory_estimates'
|
||
]
|
||
|
||
for key in required_backward_keys:
|
||
assert key in backward_profile, f"Missing key: {key}"
|
||
|
||
# Validate relationships
|
||
assert backward_profile['total_flops'] >= backward_profile['forward_flops']
|
||
assert backward_profile['total_latency_ms'] >= backward_profile['forward_latency_ms']
|
||
assert 'sgd' in backward_profile['optimizer_memory_estimates']
|
||
assert 'adam' in backward_profile['optimizer_memory_estimates']
|
||
|
||
# Check backward pass estimates are reasonable
|
||
assert backward_profile['backward_flops'] >= backward_profile['forward_flops'], \
|
||
"Backward pass should have at least as many FLOPs as forward"
|
||
assert backward_profile['gradient_memory_mb'] >= 0, \
|
||
"Gradient memory should be non-negative"
|
||
|
||
print(f"✅ Backward profiling: {backward_profile['total_latency_ms']:.2f} ms total")
|
||
print(f"✅ Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training")
|
||
print("✅ Advanced profiling functions work correctly!")
|
||
|
||
if __name__ == "__main__":
|
||
test_unit_advanced_profiling()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 5. Systems Analysis: Understanding Performance Characteristics
|
||
|
||
Let's analyze how different model characteristics affect performance. This analysis guides optimization decisions and helps identify bottlenecks.
|
||
|
||
### Performance Analysis Workflow
|
||
```
|
||
Model Scaling Analysis:
|
||
Size → Memory → Latency → Throughput → Bottleneck Identification
|
||
↓ ↓ ↓ ↓ ↓
|
||
64 1MB 0.1ms 10K ops/s Memory bound
|
||
128 4MB 0.2ms 8K ops/s Memory bound
|
||
256 16MB 0.5ms 4K ops/s Memory bound
|
||
512 64MB 2.0ms 1K ops/s Memory bound
|
||
|
||
Insight: This workload is memory-bound → Optimize data movement, not compute!
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "performance_analysis", "solution": true}
|
||
def analyze_model_scaling():
|
||
"""📊 Analyze how model performance scales with size."""
|
||
print("📊 Analyzing Model Scaling Characteristics...")
|
||
|
||
profiler = Profiler()
|
||
results = []
|
||
|
||
# Test different model sizes
|
||
sizes = [64, 128, 256, 512]
|
||
|
||
print("\nModel Scaling Analysis:")
|
||
print("Size\tParams\t\tFLOPs\t\tLatency(ms)\tMemory(MB)\tGFLOP/s")
|
||
print("-" * 80)
|
||
|
||
for size in sizes:
|
||
# Create models of different sizes for comparison
|
||
from tinytorch.core.layers import Linear
|
||
test_model = Linear(size, size)
|
||
input_shape = (32, size) # Batch of 32
|
||
dummy_input = Tensor(np.random.randn(*input_shape))
|
||
|
||
# Simulate linear layer characteristics
|
||
linear_params = size * size + size # W + b
|
||
linear_flops = size * size * 2 # matmul
|
||
|
||
# Measure actual performance
|
||
latency = profiler.measure_latency(test_model, dummy_input, warmup=3, iterations=10)
|
||
memory = profiler.measure_memory(test_model, input_shape)
|
||
|
||
gflops_per_second = (linear_flops / 1e9) / (latency / 1000)
|
||
|
||
results.append({
|
||
'size': size,
|
||
'parameters': linear_params,
|
||
'flops': linear_flops,
|
||
'latency_ms': latency,
|
||
'memory_mb': memory['peak_memory_mb'],
|
||
'gflops_per_second': gflops_per_second
|
||
})
|
||
|
||
print(f"{size}\t{linear_params:,}\t\t{linear_flops:,}\t\t"
|
||
f"{latency:.2f}\t\t{memory['peak_memory_mb']:.2f}\t\t"
|
||
f"{gflops_per_second:.2f}")
|
||
|
||
# Analysis insights
|
||
print("\n💡 Scaling Analysis Insights:")
|
||
|
||
# Memory scaling
|
||
memory_growth = results[-1]['memory_mb'] / max(results[0]['memory_mb'], 0.001)
|
||
print(f"Memory grows {memory_growth:.1f}× from {sizes[0]} to {sizes[-1]} size")
|
||
|
||
# Compute scaling
|
||
compute_growth = results[-1]['gflops_per_second'] / max(results[0]['gflops_per_second'], 0.001)
|
||
print(f"Compute efficiency changes {compute_growth:.1f}× with size")
|
||
|
||
# Performance characteristics
|
||
avg_efficiency = np.mean([r['gflops_per_second'] for r in results])
|
||
if avg_efficiency < 10: # Arbitrary threshold for "low" efficiency
|
||
print("🚀 Low compute efficiency suggests memory-bound workload")
|
||
else:
|
||
print("🚀 High compute efficiency suggests compute-bound workload")
|
||
|
||
def analyze_batch_size_effects():
|
||
"""📊 Analyze how batch size affects performance and efficiency."""
|
||
print("\n📊 Analyzing Batch Size Effects...")
|
||
|
||
profiler = Profiler()
|
||
batch_sizes = [1, 8, 32, 128]
|
||
feature_size = 256
|
||
|
||
print("\nBatch Size Effects Analysis:")
|
||
print("Batch\tLatency(ms)\tThroughput(samples/s)\tMemory(MB)\tMemory Efficiency")
|
||
print("-" * 85)
|
||
|
||
for batch_size in batch_sizes:
|
||
from tinytorch.core.layers import Linear
|
||
test_model = Linear(feature_size, feature_size)
|
||
input_shape = (batch_size, feature_size)
|
||
dummy_input = Tensor(np.random.randn(*input_shape))
|
||
|
||
# Measure performance
|
||
latency = profiler.measure_latency(test_model, dummy_input, warmup=3, iterations=10)
|
||
memory = profiler.measure_memory(test_model, input_shape)
|
||
|
||
# Calculate throughput
|
||
samples_per_second = (batch_size * 1000) / latency # samples/second
|
||
|
||
# Calculate efficiency (samples per unit memory)
|
||
efficiency = samples_per_second / max(memory['peak_memory_mb'], 0.001)
|
||
|
||
print(f"{batch_size}\t{latency:.2f}\t\t{samples_per_second:.0f}\t\t\t"
|
||
f"{memory['peak_memory_mb']:.2f}\t\t{efficiency:.1f}")
|
||
|
||
print("\n💡 Batch Size Insights:")
|
||
print("Larger batches typically improve throughput but increase memory usage")
|
||
|
||
# Run the analysis
|
||
if __name__ == "__main__":
|
||
analyze_model_scaling()
|
||
analyze_batch_size_effects()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 6. Optimization Insights: Production Performance Patterns
|
||
|
||
Understanding profiling results helps guide optimization decisions. Let's analyze different operation types and measurement overhead.
|
||
|
||
### Operation Efficiency Analysis
|
||
```
|
||
Operation Types and Their Characteristics:
|
||
┌─────────────────┬──────────────────┬──────────────────┬─────────────────┐
|
||
│ Operation │ Compute/Memory │ Optimization │ Priority │
|
||
├─────────────────┼──────────────────┼──────────────────┼─────────────────┤
|
||
│ Matrix Multiply │ Compute-bound │ BLAS libraries │ High │
|
||
│ Elementwise │ Memory-bound │ Data locality │ Medium │
|
||
│ Reductions │ Memory-bound │ Parallelization│ Medium │
|
||
│ Attention │ Memory-bound │ FlashAttention │ High │
|
||
└─────────────────┴──────────────────┴──────────────────┴─────────────────┘
|
||
|
||
Optimization Strategy:
|
||
1. Profile first → Identify bottlenecks
|
||
2. Focus on compute-bound ops → Algorithmic improvements
|
||
3. Focus on memory-bound ops → Data movement optimization
|
||
4. Measure again → Verify improvements
|
||
```
|
||
"""
|
||
|
||
# %% nbgrader={"grade": false, "grade_id": "optimization_insights", "solution": true}
|
||
def benchmark_operation_efficiency():
|
||
"""📊 Compare efficiency of different operations for optimization guidance."""
|
||
print("📊 Benchmarking Operation Efficiency...")
|
||
|
||
profiler = Profiler()
|
||
operations = []
|
||
|
||
# Test different operation types
|
||
size = 256
|
||
input_tensor = Tensor(np.random.randn(32, size))
|
||
|
||
# Elementwise operations (memory-bound)
|
||
# Create a simple model wrapper for elementwise operations
|
||
class ElementwiseModel:
|
||
def forward(self, x):
|
||
return x + x # Simple elementwise operation
|
||
|
||
elementwise_model = ElementwiseModel()
|
||
elementwise_latency = profiler.measure_latency(elementwise_model, input_tensor, iterations=20)
|
||
elementwise_flops = size * 32 # One operation per element
|
||
|
||
operations.append({
|
||
'operation': 'Elementwise',
|
||
'latency_ms': elementwise_latency,
|
||
'flops': elementwise_flops,
|
||
'gflops_per_second': (elementwise_flops / 1e9) / (elementwise_latency / 1000),
|
||
'efficiency_class': 'memory-bound',
|
||
'optimization_focus': 'data_locality'
|
||
})
|
||
|
||
# Matrix operations (compute-bound)
|
||
from tinytorch.core.layers import Linear
|
||
matrix_model = Linear(size, size)
|
||
matrix_latency = profiler.measure_latency(matrix_model, input_tensor, iterations=10)
|
||
matrix_flops = size * size * 2 # Matrix multiplication
|
||
|
||
operations.append({
|
||
'operation': 'Matrix Multiply',
|
||
'latency_ms': matrix_latency,
|
||
'flops': matrix_flops,
|
||
'gflops_per_second': (matrix_flops / 1e9) / (matrix_latency / 1000),
|
||
'efficiency_class': 'compute-bound',
|
||
'optimization_focus': 'algorithms'
|
||
})
|
||
|
||
# Reduction operations (memory-bound)
|
||
class ReductionModel:
|
||
def forward(self, x):
|
||
return x.sum() # Sum reduction operation
|
||
|
||
reduction_model = ReductionModel()
|
||
reduction_latency = profiler.measure_latency(reduction_model, input_tensor, iterations=20)
|
||
reduction_flops = size * 32 # Sum reduction
|
||
|
||
operations.append({
|
||
'operation': 'Reduction',
|
||
'latency_ms': reduction_latency,
|
||
'flops': reduction_flops,
|
||
'gflops_per_second': (reduction_flops / 1e9) / (reduction_latency / 1000),
|
||
'efficiency_class': 'memory-bound',
|
||
'optimization_focus': 'parallelization'
|
||
})
|
||
|
||
print("\nOperation Efficiency Comparison:")
|
||
print("Operation\t\tLatency(ms)\tGFLOP/s\t\tEfficiency Class\tOptimization Focus")
|
||
print("-" * 95)
|
||
|
||
for op in operations:
|
||
print(f"{op['operation']:<15}\t{op['latency_ms']:.3f}\t\t"
|
||
f"{op['gflops_per_second']:.2f}\t\t{op['efficiency_class']:<15}\t{op['optimization_focus']}")
|
||
|
||
print("\n💡 Operation Optimization Insights:")
|
||
|
||
# Find most and least efficient
|
||
best_op = max(operations, key=lambda x: x['gflops_per_second'])
|
||
worst_op = min(operations, key=lambda x: x['gflops_per_second'])
|
||
|
||
print(f"Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)")
|
||
print(f"Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)")
|
||
|
||
# Count operation types
|
||
memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound']
|
||
compute_bound_ops = [op for op in operations if op['efficiency_class'] == 'compute-bound']
|
||
|
||
print(f"\n🚀 Optimization Priority:")
|
||
if len(memory_bound_ops) > len(compute_bound_ops):
|
||
print("Focus on memory optimization: data locality, bandwidth, caching")
|
||
else:
|
||
print("Focus on compute optimization: better algorithms, vectorization")
|
||
|
||
def analyze_profiling_overhead():
|
||
"""📊 Measure the overhead of profiling itself."""
|
||
print("\n📊 Analyzing Profiling Overhead...")
|
||
|
||
# Test with and without profiling
|
||
test_tensor = Tensor(np.random.randn(100, 100))
|
||
iterations = 50
|
||
|
||
# Without profiling - baseline measurement
|
||
start_time = time.perf_counter()
|
||
for _ in range(iterations):
|
||
_ = test_tensor.data.copy() # Simple operation
|
||
end_time = time.perf_counter()
|
||
baseline_ms = (end_time - start_time) * 1000
|
||
|
||
# With profiling - includes measurement overhead
|
||
profiler = Profiler()
|
||
# Create a simple model for profiling overhead measurement
|
||
class TestModel:
|
||
def forward(self, x):
|
||
return x + 1.0
|
||
|
||
test_model = TestModel()
|
||
start_time = time.perf_counter()
|
||
for _ in range(iterations):
|
||
_ = profiler.measure_latency(test_model, test_tensor, warmup=1, iterations=1)
|
||
end_time = time.perf_counter()
|
||
profiled_ms = (end_time - start_time) * 1000
|
||
|
||
overhead_factor = profiled_ms / max(baseline_ms, 0.001)
|
||
|
||
print(f"\nProfiling Overhead Analysis:")
|
||
print(f"Baseline execution: {baseline_ms:.2f} ms")
|
||
print(f"With profiling: {profiled_ms:.2f} ms")
|
||
print(f"Profiling overhead: {overhead_factor:.1f}× slower")
|
||
|
||
print(f"\n💡 Profiling Overhead Insights:")
|
||
if overhead_factor < 2:
|
||
print("Low overhead - suitable for frequent profiling")
|
||
elif overhead_factor < 10:
|
||
print("Moderate overhead - use for development and debugging")
|
||
else:
|
||
print("High overhead - use sparingly in production")
|
||
|
||
# Run optimization analysis
|
||
if __name__ == "__main__":
|
||
benchmark_operation_efficiency()
|
||
analyze_profiling_overhead()
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🧪 Module Integration Test
|
||
|
||
Final validation that everything works together correctly.
|
||
"""
|
||
|
||
# %% nbgrader={"grade": true, "grade_id": "test_module", "locked": true, "points": 20}
|
||
def test_module():
|
||
"""🧪 Module Test: Complete Integration
|
||
|
||
Comprehensive test of entire profiling module functionality.
|
||
|
||
This final test runs before module summary to ensure:
|
||
- All unit tests pass
|
||
- Functions work together correctly
|
||
- Module is ready for integration with TinyTorch
|
||
"""
|
||
print("🧪 RUNNING MODULE INTEGRATION TEST")
|
||
print("=" * 50)
|
||
|
||
# Run all unit tests
|
||
print("Running unit tests...")
|
||
test_unit_helper_functions()
|
||
test_unit_parameter_counting()
|
||
test_unit_flop_counting()
|
||
test_unit_memory_measurement()
|
||
test_unit_latency_measurement()
|
||
test_unit_advanced_profiling()
|
||
|
||
print("\nRunning integration scenarios...")
|
||
|
||
# Test realistic usage patterns
|
||
print("🔬 Integration Test: Complete Profiling Workflow...")
|
||
|
||
# Create profiler
|
||
profiler = Profiler()
|
||
|
||
# Create test model and data
|
||
from tinytorch.core.layers import Linear
|
||
test_model = Linear(16, 32)
|
||
test_input = Tensor(np.random.randn(8, 16))
|
||
|
||
# Run complete profiling workflow
|
||
print("1. Measuring model characteristics...")
|
||
params = profiler.count_parameters(test_model)
|
||
flops = profiler.count_flops(test_model, test_input.shape)
|
||
memory = profiler.measure_memory(test_model, test_input.shape)
|
||
latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5)
|
||
|
||
print(f" Parameters: {params}")
|
||
print(f" FLOPs: {flops}")
|
||
print(f" Memory: {memory['peak_memory_mb']:.2f} MB")
|
||
print(f" Latency: {latency:.2f} ms")
|
||
|
||
# Test advanced profiling
|
||
print("2. Running advanced profiling...")
|
||
forward_profile = profiler.profile_forward_pass(test_model, test_input)
|
||
backward_profile = profiler.profile_backward_pass(test_model, test_input)
|
||
|
||
assert 'gflops_per_second' in forward_profile
|
||
assert 'total_latency_ms' in backward_profile
|
||
print(f" Forward GFLOP/s: {forward_profile['gflops_per_second']:.2f}")
|
||
print(f" Training latency: {backward_profile['total_latency_ms']:.2f} ms")
|
||
|
||
# Test bottleneck analysis
|
||
print("3. Analyzing performance bottlenecks...")
|
||
bottleneck = forward_profile['bottleneck']
|
||
efficiency = forward_profile['computational_efficiency']
|
||
print(f" Bottleneck: {bottleneck}")
|
||
print(f" Compute efficiency: {efficiency:.3f}")
|
||
|
||
# Validate end-to-end workflow
|
||
assert params >= 0, "Parameter count should be non-negative"
|
||
assert flops >= 0, "FLOP count should be non-negative"
|
||
assert memory['peak_memory_mb'] >= 0, "Memory usage should be non-negative"
|
||
assert latency >= 0, "Latency should be non-negative"
|
||
assert forward_profile['gflops_per_second'] >= 0, "GFLOP/s should be non-negative"
|
||
assert backward_profile['total_latency_ms'] >= 0, "Total latency should be non-negative"
|
||
assert bottleneck in ['memory', 'compute'], "Bottleneck should be memory or compute"
|
||
assert 0 <= efficiency <= 1, "Efficiency should be between 0 and 1"
|
||
|
||
print("✅ End-to-end profiling workflow works!")
|
||
|
||
# Test production-like scenario
|
||
print("4. Testing production profiling scenario...")
|
||
|
||
# Simulate larger model analysis
|
||
from tinytorch.core.layers import Linear
|
||
large_model = Linear(512, 256)
|
||
large_input = Tensor(np.random.randn(32, 512)) # Larger model input
|
||
large_profile = profiler.profile_forward_pass(large_model, large_input)
|
||
|
||
# Verify profile contains optimization insights
|
||
assert 'bottleneck' in large_profile, "Profile should identify bottlenecks"
|
||
assert 'memory_bandwidth_mbs' in large_profile, "Profile should measure memory bandwidth"
|
||
|
||
print(f" Large model analysis: {large_profile['bottleneck']} bottleneck")
|
||
print(f" Memory bandwidth: {large_profile['memory_bandwidth_mbs']:.1f} MB/s")
|
||
|
||
print("✅ Production profiling scenario works!")
|
||
|
||
print("\n" + "=" * 50)
|
||
print("🎉 ALL TESTS PASSED! Module ready for export.")
|
||
print("Run: tito module complete 14")
|
||
|
||
# Call before module summary
|
||
if __name__ == "__main__":
|
||
test_module()
|
||
|
||
# %%
|
||
if __name__ == "__main__":
|
||
print("🚀 Running Profiling module...")
|
||
test_module()
|
||
print("✅ Module validation complete!")
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🤔 ML Systems Thinking: Performance Measurement
|
||
|
||
### Question 1: FLOP Analysis
|
||
You implemented a profiler that counts FLOPs for different operations.
|
||
For a Linear layer with 1000 input features and 500 output features:
|
||
- How many FLOPs are required for one forward pass? _____ FLOPs
|
||
- If you process a batch of 32 samples, how does this change the per-sample FLOPs? _____
|
||
|
||
### Question 2: Memory Scaling
|
||
Your profiler measures memory usage for models and activations.
|
||
A transformer model has 125M parameters (500MB at FP32).
|
||
During training with batch size 16:
|
||
- What's the minimum memory for gradients? _____ MB
|
||
- With Adam optimizer, what's the total memory requirement? _____ MB
|
||
|
||
### Question 3: Performance Bottlenecks
|
||
You built tools to identify compute vs memory bottlenecks.
|
||
A model achieves 10 GFLOP/s on hardware with 100 GFLOP/s peak:
|
||
- What's the computational efficiency? _____%
|
||
- If doubling batch size doesn't improve GFLOP/s, the bottleneck is likely _____
|
||
|
||
### Question 4: Profiling Trade-offs
|
||
Your profiler adds measurement overhead to understand performance.
|
||
If profiling adds 5× overhead but reveals a 50% speedup opportunity:
|
||
- Is the profiling cost justified for development? _____
|
||
- When should you disable profiling in production? _____
|
||
"""
|
||
|
||
# %% [markdown]
|
||
"""
|
||
## 🎯 MODULE SUMMARY: Profiling
|
||
|
||
Congratulations! You've built a comprehensive profiling system for ML performance analysis!
|
||
|
||
### Key Accomplishments
|
||
- Built complete Profiler class with parameter, FLOP, memory, and latency measurement
|
||
- Implemented advanced profiling functions for forward and backward pass analysis
|
||
- Discovered performance characteristics through scaling and efficiency analysis
|
||
- Created production-quality measurement tools for optimization guidance
|
||
- All tests pass ✅ (validated by `test_module()`)
|
||
|
||
### Systems Insights Gained
|
||
- **FLOPs vs Reality**: Theoretical operations don't always predict actual performance
|
||
- **Memory Bottlenecks**: Many ML operations are limited by memory bandwidth, not compute
|
||
- **Batch Size Effects**: Larger batches improve throughput but increase memory requirements
|
||
- **Profiling Overhead**: Measurement tools have costs but enable data-driven optimization
|
||
|
||
### Production Skills Developed
|
||
- **Performance Detective Work**: Use data, not guesses, to identify bottlenecks
|
||
- **Optimization Prioritization**: Focus efforts on actual bottlenecks, not assumptions
|
||
- **Resource Planning**: Predict memory and compute requirements for deployment
|
||
- **Statistical Rigor**: Handle measurement variance with proper methodology
|
||
|
||
### Ready for Next Steps
|
||
Your profiling implementation enables optimization modules (15-18) to make data-driven optimization decisions.
|
||
Export with: `tito module complete 14`
|
||
|
||
**Next**: Module 15 (Memoization) will use profiling to discover transformer bottlenecks and fix them!
|
||
"""
|