Files
TinyTorch/modules/14_profiling/profiling.py
Vijay Janapa Reddi 4f06392de5 Apply formatting fixes to achieve 10/10 consistency
- Add 🧪 emoji to all test_module() docstrings (20 modules)
- Fix Module 16 (compression): Add if __name__ guards to 6 test functions
- Fix Module 08 (dataloader): Add if __name__ guard to test_training_integration

All modules now follow consistent formatting standards for release.
2025-11-24 15:07:32 -05:00

1809 lines
70 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.17.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
"""
# Module 14: Profiling - Measuring What Matters in ML Systems
Welcome to Module 14! You'll build professional profiling tools to measure model performance and uncover optimization opportunities.
## 🔗 Prerequisites & Progress
**You've Built**: Complete ML stack from tensors to transformers
**You'll Build**: Comprehensive profiling system for parameters, FLOPs, memory, and latency
**You'll Enable**: Data-driven optimization decisions and performance analysis
**Connection Map**:
```
All Modules (01-13) → Profiling (14) → Optimization Techniques (15-18)
(implementations) (measurement) (targeted fixes)
```
**Before starting this module, verify:**
- [ ] Module 01 (Tensor): Core tensor operations
- [ ] Module 03 (Layers): Linear layer implementation
- [ ] Module 08 (Spatial): Convolutional operations
This module can work standalone with minimal Tensor implementation, but
full functionality requires previous modules for realistic profiling scenarios
## Learning Objectives
By the end of this module, you will:
1. Implement a complete Profiler class for model analysis
2. Count parameters and FLOPs accurately for different architectures
3. Measure memory usage and latency with statistical rigor
4. Create production-quality performance analysis tools
Let's build the measurement foundation for ML systems optimization!
## 📦 Where This Code Lives in the Final Package
**Learning Side:** You work in `modules/14_profiling/profiling_dev.py`
**Building Side:** Code exports to `tinytorch.profiling.profiler`
```python
# How to use this module:
from tinytorch.profiling.profiler import Profiler, profile_forward_pass, profile_backward_pass
```
**Why this matters:**
- **Learning:** Complete profiling system for understanding model performance characteristics
- **Production:** Professional measurement tools like those used in PyTorch, TensorFlow
- **Consistency:** All profiling and measurement tools in profiling.profiler
- **Integration:** Works with any model built using TinyTorch components
"""
# %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
#| default_exp profiling.profiler
#| export
import sys
import os
import time
import numpy as np
import tracemalloc
from typing import Dict, List, Any, Optional, Tuple
from collections import defaultdict
import gc
# Import from TinyTorch package (previous modules must be completed and exported)
from tinytorch.core.tensor import Tensor
from tinytorch.core.layers import Linear
from tinytorch.core.spatial import Conv2d
# Constants for memory and performance measurement
BYTES_PER_FLOAT32 = 4 # Standard float32 size in bytes
KB_TO_BYTES = 1024 # Kilobytes to bytes conversion
MB_TO_BYTES = 1024 * 1024 # Megabytes to bytes conversion
# %% [markdown]
"""
## 1. Introduction: Why Profiling Matters in ML Systems
Imagine you're a detective investigating a performance crime. Your model is running slowly, using too much memory, or burning through compute budgets. Without profiling, you're flying blind - making guesses about what to optimize. With profiling, you have evidence.
**The Performance Investigation Process:**
```
Suspect Model → Profile Evidence → Identify Bottleneck → Target Optimization
↓ ↓ ↓ ↓
"Too slow" "200 GFLOP/s" "Memory bound" "Reduce transfers"
```
**Questions Profiling Answers:**
- **How many parameters?** (Memory footprint, model size)
- **How many FLOPs?** (Computational cost, energy usage)
- **Where are bottlenecks?** (Memory vs compute bound)
- **What's actual latency?** (Real-world performance)
**Production Importance:**
In production ML systems, profiling isn't optional - it's survival. A model that's 10% more accurate but 100× slower often can't be deployed. Teams use profiling daily to make data-driven optimization decisions, not guesses.
### The Profiling Workflow Visualization
```
Model → Profiler → Measurements → Analysis → Optimization Decision
↓ ↓ ↓ ↓ ↓
GPT Parameter 125M params Memory Use quantization
Counter 2.5B FLOPs bound Reduce precision
```
"""
# %% [markdown]
"""
### 🔗 From Implementation to Optimization: The Profiling Foundation
**In this module (14)**, you'll build the measurement tools to discover optimization opportunities.
**In later modules (15+)**, you'll use these profiling insights to implement optimizations like KV caching.
**The Real ML Engineering Workflow**:
```
Step 1: Measure (This Module!) Step 2: Analyze
↓ ↓
Profile baseline → Find bottleneck → Understand cause
40 tok/s 80% in attention O(n²) recomputation
Step 4: Validate Step 3: Optimize (Future Modules)
↓ ↓
Profile optimized ← Verify speedup ← Implement optimization
500 tok/s (12.5x) Measure impact Design solution
```
**Without profiling**: You'd never know WHERE to optimize!
**Without measurement**: You couldn't verify improvements!
This module teaches the measurement and analysis skills that enable
optimization breakthroughs. You'll profile real models and discover
bottlenecks just like production ML teams do.
"""
# %% [markdown]
"""
## 2. Foundations: Performance Measurement Principles
Before we build our profiler, let's understand what we're measuring and why each metric matters.
### Parameter Counting - Model Size Detective Work
Parameters determine your model's memory footprint and storage requirements. Every parameter is typically a 32-bit float (4 bytes), so counting them precisely predicts memory usage.
**Parameter Counting Formula:**
```
Linear Layer: (input_features × output_features) + output_features
↑ ↑ ↑
Weight matrix Bias vector Total parameters
Example: Linear(768, 3072) → (768 × 3072) + 3072 = 2,362,368 parameters
Memory: 2,362,368 × 4 bytes = 9.45 MB
```
### FLOP Counting - Computational Cost Analysis
FLOPs (Floating Point Operations) measure computational work. Unlike wall-clock time, FLOPs are hardware-independent and predict compute costs across different systems.
**FLOP Formulas for Key Operations:**
```
Matrix Multiplication (M,K) @ (K,N):
FLOPs = M × N × K × 2
↑ ↑ ↑ ↑
Rows Cols Inner Multiply+Add
Linear Layer Forward:
FLOPs = batch_size × input_features × output_features × 2
↑ ↑ ↑
Matmul cost Bias add Operations
Convolution (simplified):
FLOPs = output_H × output_W × kernel_H × kernel_W × in_channels × out_channels × 2
```
### Memory Profiling - The Three Types of Memory
ML models use memory in three distinct ways, each with different optimization strategies:
**Memory Type Breakdown:**
```
Total Training Memory = Parameters + Activations + Gradients + Optimizer State
↓ ↓ ↓ ↓
Model Forward Backward Adam: 2×params
weights pass cache gradients SGD: 0×params
Example for 125M parameter model:
Parameters: 500 MB (125M × 4 bytes)
Activations: 200 MB (depends on batch size)
Gradients: 500 MB (same as parameters)
Adam state: 1,000 MB (momentum + velocity)
Total: 2,200 MB (4.4× parameter memory!)
```
### Latency Measurement - Dealing with Reality
Latency measurement is tricky because systems have variance, warmup effects, and measurement overhead. Professional profiling requires statistical rigor.
**Latency Measurement Best Practices:**
```
Measurement Protocol:
1. Warmup runs (10+) → CPU/GPU caches warm up
2. Timed runs (100+) → Statistical significance
3. Outlier handling → Use median, not mean
4. Memory cleanup → Prevent contamination
Timeline:
Warmup: [run][run][run]...[run] ← Don't time these
Timing: [⏱run⏱][⏱run⏱]...[⏱run⏱] ← Time these
Result: median(all_times) ← Robust to outliers
```
"""
# %% [markdown]
"""
## 3. Implementation: Building the Core Profiler Class
Now let's implement our profiler step by step. We'll start with the foundation and build up to comprehensive analysis.
### The Profiler Architecture
```
Profiler Class
├── count_parameters() → Model size analysis
├── count_flops() → Computational cost estimation
├── measure_memory() → Memory usage tracking
├── measure_latency() → Performance timing
├── profile_layer() → Layer-wise analysis
├── profile_forward_pass() → Complete forward analysis
└── profile_backward_pass() → Training analysis
Integration:
All methods work together to provide comprehensive performance insights
```
"""
# %% nbgrader={"grade": false, "grade_id": "profiler_class", "solution": true}
#| export
class Profiler:
"""
Professional-grade ML model profiler for performance analysis.
Measures parameters, FLOPs, memory usage, and latency with statistical rigor.
Used for optimization guidance and deployment planning.
"""
def __init__(self):
"""
Initialize profiler with measurement state.
TODO: Set up profiler tracking structures
APPROACH:
1. Create empty measurements dictionary
2. Initialize operation counters
3. Set up memory tracking state
EXAMPLE:
>>> profiler = Profiler()
>>> profiler.measurements
{}
HINTS:
- Use defaultdict(int) for operation counters
- measurements dict will store timing results
"""
### BEGIN SOLUTION
self.measurements = {}
self.operation_counts = defaultdict(int)
self.memory_tracker = None
### END SOLUTION
def count_parameters(self, model) -> int:
"""
Count total trainable parameters in a model.
TODO: Implement parameter counting for any model with parameters() method
APPROACH:
1. Get all parameters from model.parameters() if available
2. For single layers, count weight and bias directly
3. Sum total element count across all parameter tensors
EXAMPLE:
>>> linear = Linear(128, 64) # 128*64 + 64 = 8256 parameters
>>> profiler = Profiler()
>>> count = profiler.count_parameters(linear)
>>> print(count)
8256
HINTS:
- Use parameter.data.size for tensor element count
- Handle models with and without parameters() method
- Don't forget bias terms when present
"""
### BEGIN SOLUTION
total_params = 0
# Handle SimpleModel pattern (has .layers attribute)
if hasattr(model, 'layers'):
# SimpleModel: iterate through layers
for layer in model.layers:
for param in layer.parameters():
total_params += param.data.size
elif hasattr(model, 'parameters'):
# Model with direct parameters() method
for param in model.parameters():
total_params += param.data.size
elif hasattr(model, 'weight'):
# Single layer (Linear, Conv2d) - all have .weight
total_params += model.weight.data.size
# Check for bias (may be None)
if hasattr(model, 'bias') and model.bias is not None:
total_params += model.bias.data.size
else:
# No parameters (activations, etc.)
total_params = 0
return total_params
### END SOLUTION
def count_flops(self, model, input_shape: Tuple[int, ...]) -> int:
"""
Count FLOPs (Floating Point Operations) for one forward pass.
TODO: Implement FLOP counting for different layer types
APPROACH:
1. Create dummy input with given shape
2. Calculate FLOPs based on layer type and dimensions
3. Handle different model architectures (Linear, Conv2d, Sequential)
LAYER-SPECIFIC FLOP FORMULAS:
- Linear: input_features × output_features × 2 (matmul + bias)
- Conv2d: output_h × output_w × kernel_h × kernel_w × in_channels × out_channels × 2
- Activation: Usually 1 FLOP per element (ReLU, Sigmoid)
EXAMPLE:
>>> linear = Linear(128, 64)
>>> profiler = Profiler()
>>> flops = profiler.count_flops(linear, (1, 128))
>>> print(flops) # 128 * 64 * 2 = 16384
16384
HINTS:
- Batch dimension doesn't affect per-sample FLOPs
- Focus on major operations (matmul, conv) first
- For Sequential models, sum FLOPs of all layers
"""
### BEGIN SOLUTION
# Create dummy input (unused but kept for interface consistency)
_dummy_input = Tensor(np.random.randn(*input_shape))
total_flops = 0
# Handle different model types
if hasattr(model, '__class__'):
model_name = model.__class__.__name__
if model_name == 'Linear':
# Linear layer: input_features × output_features × 2
in_features = input_shape[-1]
out_features = model.weight.shape[1] if hasattr(model, 'weight') else 1
total_flops = in_features * out_features * 2
elif model_name == 'Conv2d':
# Conv2d layer: complex calculation based on output size
# Simplified: assume we know the output dimensions
if hasattr(model, 'kernel_size') and hasattr(model, 'in_channels'):
_batch_size = input_shape[0] if len(input_shape) > 3 else 1
in_channels = model.in_channels
out_channels = model.out_channels
kernel_h = kernel_w = model.kernel_size
# Estimate output size (simplified)
input_h, input_w = input_shape[-2], input_shape[-1]
output_h = input_h // (model.stride if hasattr(model, 'stride') else 1)
output_w = input_w // (model.stride if hasattr(model, 'stride') else 1)
total_flops = (output_h * output_w * kernel_h * kernel_w *
in_channels * out_channels * 2)
elif model_name == 'Sequential':
# Sequential model: sum FLOPs of all layers
current_shape = input_shape
for layer in model.layers:
layer_flops = self.count_flops(layer, current_shape)
total_flops += layer_flops
# Update shape for next layer (simplified)
if hasattr(layer, 'weight'):
current_shape = current_shape[:-1] + (layer.weight.shape[1],)
else:
# Activation or other: assume 1 FLOP per element
total_flops = np.prod(input_shape)
return total_flops
### END SOLUTION
def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float]:
"""
Measure memory usage during forward pass.
TODO: Implement memory tracking for model execution
APPROACH:
1. Use tracemalloc to track memory allocation
2. Measure baseline memory before model execution
3. Run forward pass and track peak usage
4. Calculate different memory components
RETURN DICTIONARY:
- 'parameter_memory_mb': Memory for model parameters
- 'activation_memory_mb': Memory for activations
- 'peak_memory_mb': Maximum memory usage
- 'memory_efficiency': Ratio of useful to total memory
EXAMPLE:
>>> linear = Linear(1024, 512)
>>> profiler = Profiler()
>>> memory = profiler.measure_memory(linear, (32, 1024))
>>> print(f"Parameters: {memory['parameter_memory_mb']:.1f} MB")
Parameters: 2.1 MB
HINTS:
- Use tracemalloc.start() and tracemalloc.get_traced_memory()
- Account for float32 = 4 bytes per parameter
- Activation memory scales with batch size
"""
### BEGIN SOLUTION
# Start memory tracking
tracemalloc.start()
# Measure baseline memory (unused but kept for completeness)
_baseline_memory = tracemalloc.get_traced_memory()[0]
# Calculate parameter memory
param_count = self.count_parameters(model)
parameter_memory_bytes = param_count * BYTES_PER_FLOAT32
parameter_memory_mb = parameter_memory_bytes / MB_TO_BYTES
# Create input and measure activation memory
dummy_input = Tensor(np.random.randn(*input_shape))
input_memory_bytes = dummy_input.data.nbytes
# Estimate activation memory (simplified)
activation_memory_bytes = input_memory_bytes * 2 # Rough estimate
activation_memory_mb = activation_memory_bytes / MB_TO_BYTES
# Run forward pass to measure peak memory usage
_ = model.forward(dummy_input)
# Get peak memory
_current_memory, peak_memory = tracemalloc.get_traced_memory()
peak_memory_mb = (peak_memory - _baseline_memory) / MB_TO_BYTES
tracemalloc.stop()
# Calculate efficiency
useful_memory = parameter_memory_mb + activation_memory_mb
memory_efficiency = useful_memory / max(peak_memory_mb, 0.001) # Avoid division by zero
return {
'parameter_memory_mb': parameter_memory_mb,
'activation_memory_mb': activation_memory_mb,
'peak_memory_mb': max(peak_memory_mb, useful_memory),
'memory_efficiency': min(memory_efficiency, 1.0)
}
### END SOLUTION
def measure_latency(self, model, input_tensor, warmup: int = 10, iterations: int = 100) -> float:
"""
Measure model inference latency with statistical rigor.
TODO: Implement accurate latency measurement
APPROACH:
1. Run warmup iterations to stabilize performance
2. Measure multiple iterations for statistical accuracy
3. Calculate median latency to handle outliers
4. Return latency in milliseconds
PARAMETERS:
- warmup: Number of warmup runs (default 10)
- iterations: Number of measurement runs (default 100)
EXAMPLE:
>>> linear = Linear(128, 64)
>>> input_tensor = Tensor(np.random.randn(1, 128))
>>> profiler = Profiler()
>>> latency = profiler.measure_latency(linear, input_tensor)
>>> print(f"Latency: {latency:.2f} ms")
Latency: 0.15 ms
HINTS:
- Use time.perf_counter() for high precision
- Use median instead of mean for robustness against outliers
- Handle different model interfaces (forward, __call__)
"""
### BEGIN SOLUTION
# Warmup runs to stabilize performance
for _ in range(warmup):
_ = model.forward(input_tensor)
# Measurement runs
times = []
for _ in range(iterations):
start_time = time.perf_counter()
_ = model.forward(input_tensor)
end_time = time.perf_counter()
times.append((end_time - start_time) * 1000) # Convert to milliseconds
# Calculate statistics - use median for robustness
times = np.array(times)
median_latency = np.median(times)
return float(median_latency)
### END SOLUTION
def profile_layer(self, layer, input_shape: Tuple[int, ...]) -> Dict[str, Any]:
"""
Profile a single layer comprehensively.
TODO: Implement layer-wise profiling
APPROACH:
1. Count parameters for this layer
2. Count FLOPs for this layer
3. Measure memory usage
4. Measure latency
5. Return comprehensive layer profile
EXAMPLE:
>>> linear = Linear(256, 128)
>>> profiler = Profiler()
>>> profile = profiler.profile_layer(linear, (32, 256))
>>> print(f"Layer uses {profile['parameters']} parameters")
Layer uses 32896 parameters
HINTS:
- Use existing profiler methods (count_parameters, count_flops, etc.)
- Create dummy input for latency measurement
- Include layer type information in profile
"""
### BEGIN SOLUTION
# Create dummy input for latency measurement
dummy_input = Tensor(np.random.randn(*input_shape))
# Gather all measurements
params = self.count_parameters(layer)
flops = self.count_flops(layer, input_shape)
memory = self.measure_memory(layer, input_shape)
latency = self.measure_latency(layer, dummy_input, warmup=3, iterations=10)
# Compute derived metrics
gflops_per_second = (flops / 1e9) / max(latency / 1000, 1e-6)
return {
'layer_type': layer.__class__.__name__,
'parameters': params,
'flops': flops,
'latency_ms': latency,
'gflops_per_second': gflops_per_second,
**memory
}
### END SOLUTION
def profile_forward_pass(self, model, input_tensor) -> Dict[str, Any]:
"""
Comprehensive profiling of a model's forward pass.
TODO: Implement complete forward pass analysis
APPROACH:
1. Use Profiler class to gather all measurements
2. Create comprehensive performance profile
3. Add derived metrics and insights
4. Return structured analysis results
RETURN METRICS:
- All basic profiler measurements
- FLOPs per second (computational efficiency)
- Memory bandwidth utilization
- Performance bottleneck identification
EXAMPLE:
>>> model = Linear(256, 128)
>>> input_data = Tensor(np.random.randn(32, 256))
>>> profiler = Profiler()
>>> profile = profiler.profile_forward_pass(model, input_data)
>>> print(f"Throughput: {profile['gflops_per_second']:.2f} GFLOP/s")
Throughput: 2.45 GFLOP/s
HINTS:
- GFLOP/s = (FLOPs / 1e9) / (latency_ms / 1000)
- Memory bandwidth = memory_mb / (latency_ms / 1000)
- Consider realistic hardware limits for efficiency calculations
"""
### BEGIN SOLUTION
# Basic measurements
param_count = self.count_parameters(model)
flops = self.count_flops(model, input_tensor.shape)
memory_stats = self.measure_memory(model, input_tensor.shape)
latency_ms = self.measure_latency(model, input_tensor, warmup=5, iterations=20)
# Derived metrics
latency_seconds = latency_ms / 1000.0
gflops_per_second = (flops / 1e9) / max(latency_seconds, 1e-6)
# Memory bandwidth (MB/s)
memory_bandwidth = memory_stats['peak_memory_mb'] / max(latency_seconds, 1e-6)
# Efficiency metrics
theoretical_peak_gflops = 100.0 # Assume 100 GFLOP/s theoretical peak for CPU
computational_efficiency = min(gflops_per_second / theoretical_peak_gflops, 1.0)
# Bottleneck analysis
is_memory_bound = memory_bandwidth > gflops_per_second * 100 # Rough heuristic
is_compute_bound = not is_memory_bound
return {
# Basic measurements
'parameters': param_count,
'flops': flops,
'latency_ms': latency_ms,
**memory_stats,
# Derived metrics
'gflops_per_second': gflops_per_second,
'memory_bandwidth_mbs': memory_bandwidth,
'computational_efficiency': computational_efficiency,
# Bottleneck analysis
'is_memory_bound': is_memory_bound,
'is_compute_bound': is_compute_bound,
'bottleneck': 'memory' if is_memory_bound else 'compute'
}
### END SOLUTION
def profile_backward_pass(self, model, input_tensor, _loss_fn=None) -> Dict[str, Any]:
"""
Profile both forward and backward passes for training analysis.
TODO: Implement training-focused profiling
APPROACH:
1. Profile forward pass first
2. Estimate backward pass costs (typically 2× forward)
3. Calculate total training iteration metrics
4. Analyze memory requirements for gradients and optimizers
BACKWARD PASS ESTIMATES:
- FLOPs: ~2× forward pass (gradient computation)
- Memory: +1× parameters (gradient storage)
- Latency: ~2× forward pass (more complex operations)
EXAMPLE:
>>> model = Linear(128, 64)
>>> input_data = Tensor(np.random.randn(16, 128))
>>> profiler = Profiler()
>>> profile = profiler.profile_backward_pass(model, input_data)
>>> print(f"Training iteration: {profile['total_latency_ms']:.2f} ms")
Training iteration: 0.45 ms
HINTS:
- Total memory = parameters + activations + gradients
- Optimizer memory depends on algorithm (SGD: 0×, Adam: 2×)
- Consider gradient accumulation effects
"""
### BEGIN SOLUTION
# Get forward pass profile
forward_profile = self.profile_forward_pass(model, input_tensor)
# Estimate backward pass (typically 2× forward)
backward_flops = forward_profile['flops'] * 2
backward_latency_ms = forward_profile['latency_ms'] * 2
# Gradient memory (equal to parameter memory)
gradient_memory_mb = forward_profile['parameter_memory_mb']
# Total training iteration
total_flops = forward_profile['flops'] + backward_flops
total_latency_ms = forward_profile['latency_ms'] + backward_latency_ms
total_memory_mb = (forward_profile['parameter_memory_mb'] +
forward_profile['activation_memory_mb'] +
gradient_memory_mb)
# Training efficiency
total_gflops_per_second = (total_flops / 1e9) / (total_latency_ms / 1000.0)
# Optimizer memory estimates
optimizer_memory_estimates = {
'sgd': 0, # No extra memory
'adam': gradient_memory_mb * 2, # Momentum + velocity
'adamw': gradient_memory_mb * 2, # Same as Adam
}
return {
# Forward pass
'forward_flops': forward_profile['flops'],
'forward_latency_ms': forward_profile['latency_ms'],
'forward_memory_mb': forward_profile['peak_memory_mb'],
# Backward pass estimates
'backward_flops': backward_flops,
'backward_latency_ms': backward_latency_ms,
'gradient_memory_mb': gradient_memory_mb,
# Total training iteration
'total_flops': total_flops,
'total_latency_ms': total_latency_ms,
'total_memory_mb': total_memory_mb,
'total_gflops_per_second': total_gflops_per_second,
# Optimizer memory requirements
'optimizer_memory_estimates': optimizer_memory_estimates,
# Training insights
'memory_efficiency': forward_profile['memory_efficiency'],
'bottleneck': forward_profile['bottleneck']
}
### END SOLUTION
# %% [markdown]
"""
## Helper Functions - Quick Profiling Utilities
These helper functions provide simplified interfaces for common profiling tasks.
They make it easy to quickly profile models and analyze characteristics without
manually calling multiple profiler methods.
### Why Helper Functions Matter
In production ML engineering, you often need quick insights without setting up
full profiling workflows. These utilities provide:
- **Quick profiling**: One-line model analysis with formatted output
- **Weight analysis**: Understanding parameter distributions for compression
- **Student-friendly output**: Clear, formatted results for learning
These functions wrap our core Profiler class with convenience interfaces used
in real ML workflows for rapid iteration and debugging.
"""
# %% nbgrader={"grade": false, "grade_id": "helper_quick_profile", "solution": true}
#| export
def quick_profile(model, input_tensor, profiler=None):
"""
Quick profiling function for immediate insights.
Provides a simplified interface for profiling that displays key metrics
in a student-friendly format.
Args:
model: Model to profile
input_tensor: Input data for profiling
profiler: Optional Profiler instance (creates new one if None)
Returns:
dict: Profile results with key metrics
Example:
>>> model = Linear(128, 64)
>>> input_data = Tensor(np.random.randn(16, 128))
>>> results = quick_profile(model, input_data)
>>> # Displays formatted output automatically
"""
if profiler is None:
profiler = Profiler()
profile = profiler.profile_forward_pass(model, input_tensor)
# Display formatted results
print("🔬 Quick Profile Results:")
print(f" Parameters: {profile['parameters']:,}")
print(f" FLOPs: {profile['flops']:,}")
print(f" Latency: {profile['latency_ms']:.2f} ms")
print(f" Memory: {profile['peak_memory_mb']:.2f} MB")
print(f" Bottleneck: {profile['bottleneck']}")
print(f" Efficiency: {profile['computational_efficiency']*100:.1f}%")
return profile
# %% nbgrader={"grade": false, "grade_id": "helper_weight_distribution", "solution": true}
#| export
def analyze_weight_distribution(model, percentiles=[10, 25, 50, 75, 90]):
"""
Analyze weight distribution for compression insights.
Helps understand which weights are small and might be prunable.
Used by Module 17 (Compression) to motivate pruning.
Args:
model: Model to analyze
percentiles: List of percentiles to compute
Returns:
dict: Weight distribution statistics
Example:
>>> model = Linear(512, 512)
>>> stats = analyze_weight_distribution(model)
>>> print(f"Weights < 0.01: {stats['below_threshold_001']:.1f}%")
"""
# Collect all weights
weights = []
if hasattr(model, 'parameters'):
for param in model.parameters():
weights.extend(param.data.flatten().tolist())
elif hasattr(model, 'weight'):
weights.extend(model.weight.data.flatten().tolist())
else:
return {'error': 'No weights found'}
weights = np.array(weights)
abs_weights = np.abs(weights)
# Calculate statistics
stats = {
'total_weights': len(weights),
'mean': float(np.mean(abs_weights)),
'std': float(np.std(abs_weights)),
'min': float(np.min(abs_weights)),
'max': float(np.max(abs_weights)),
}
# Percentile analysis
for p in percentiles:
stats[f'percentile_{p}'] = float(np.percentile(abs_weights, p))
# Threshold analysis (useful for pruning)
for threshold in [0.001, 0.01, 0.1]:
below = np.sum(abs_weights < threshold) / len(weights) * 100
stats[f'below_threshold_{str(threshold).replace(".", "")}'] = below
return stats
# %% [markdown]
"""
### 🧪 Unit Test: Helper Functions
This test validates our helper utilities work correctly and provide useful output.
**What we're testing**: Quick profiling and weight distribution analysis
**Why it matters**: These utilities are used daily in production ML workflows
**Expected**: Correct profiles with formatted output
"""
# %% nbgrader={"grade": true, "grade_id": "test_helper_functions", "locked": true, "points": 5}
def test_unit_helper_functions():
"""🔬 Test helper function implementations."""
print("🔬 Unit Test: Helper Functions...")
# Test 1: Quick profile function
from tinytorch.core.layers import Linear
test_model = Linear(16, 8)
test_input = Tensor(np.random.randn(8, 16))
profile = quick_profile(test_model, test_input, profiler=Profiler())
# Validate profile contains expected keys
assert 'parameters' in profile, "Quick profile should include parameters"
assert 'flops' in profile, "Quick profile should include FLOPs"
assert 'latency_ms' in profile, "Quick profile should include latency"
print("✅ Quick profile provides comprehensive metrics")
# Test 2: Weight distribution analysis
class SimpleModel:
def __init__(self):
self.weight = Tensor(np.random.randn(10, 5) * 0.1) # Small weights
model = SimpleModel()
stats = analyze_weight_distribution(model)
# Validate statistics structure
assert 'total_weights' in stats, "Should count total weights"
assert 'mean' in stats, "Should compute mean"
assert 'std' in stats, "Should compute standard deviation"
assert stats['total_weights'] == 50, f"Expected 50 weights, got {stats['total_weights']}"
print(f"✅ Weight distribution analysis: {stats['total_weights']} weights analyzed")
# Test 3: Weight distribution with no weights
class NoWeightModel:
pass
no_weight_model = NoWeightModel()
stats = analyze_weight_distribution(no_weight_model)
assert 'error' in stats, "Should handle models without weights"
print("✅ Handles models without weights gracefully")
print("✅ Helper functions work correctly!")
if __name__ == "__main__":
test_unit_helper_functions()
# %% [markdown]
"""
## Parameter Counting - Model Size Analysis
Parameter counting is the foundation of model profiling. Every parameter contributes to memory usage, training time, and model complexity. Let's validate our implementation.
### Why Parameter Counting Matters
```
Model Deployment Pipeline:
Parameters → Memory → Hardware → Cost
↓ ↓ ↓ ↓
125M 500MB 8GB GPU $200/month
Parameter Growth Examples:
Small: GPT-2 Small (124M parameters) → 500MB memory
Medium: GPT-2 Medium (350M parameters) → 1.4GB memory
Large: GPT-2 Large (774M parameters) → 3.1GB memory
XL: GPT-2 XL (1.5B parameters) → 6.0GB memory
```
"""
# %% [markdown]
"""
### 🧪 Unit Test: Parameter Counting
This test validates our parameter counting works correctly for different model types.
**What we're testing**: Parameter counting accuracy for various architectures
**Why it matters**: Accurate parameter counts predict memory usage and model complexity
**Expected**: Correct counts for known model configurations
"""
# %% nbgrader={"grade": true, "grade_id": "test_parameter_counting", "locked": true, "points": 10}
def test_unit_parameter_counting():
"""🔬 Test parameter counting implementation."""
print("🔬 Unit Test: Parameter Counting...")
profiler = Profiler()
# Test 1: Simple model with known parameters
class SimpleModel:
def __init__(self):
self.weight = Tensor(np.random.randn(10, 5))
self.bias = Tensor(np.random.randn(5))
def parameters(self):
return [self.weight, self.bias]
simple_model = SimpleModel()
param_count = profiler.count_parameters(simple_model)
expected_count = 10 * 5 + 5 # weight + bias
assert param_count == expected_count, f"Expected {expected_count} parameters, got {param_count}"
print(f"✅ Simple model: {param_count} parameters")
# Test 2: Model without parameters
class NoParamModel:
def __init__(self):
pass
no_param_model = NoParamModel()
param_count = profiler.count_parameters(no_param_model)
assert param_count == 0, f"Expected 0 parameters, got {param_count}"
print(f"✅ No parameter model: {param_count} parameters")
# Test 3: Direct tensor (no parameters)
test_tensor = Tensor(np.random.randn(2, 3))
param_count = profiler.count_parameters(test_tensor)
assert param_count == 0, f"Expected 0 parameters for tensor, got {param_count}"
print(f"✅ Direct tensor: {param_count} parameters")
print("✅ Parameter counting works correctly!")
if __name__ == "__main__":
test_unit_parameter_counting()
# %% [markdown]
"""
## FLOP Counting - Computational Cost Estimation
FLOPs measure the computational work required for model operations. Unlike latency, FLOPs are hardware-independent and help predict compute costs across different systems.
### FLOP Counting Visualization
```
Linear Layer FLOP Breakdown:
Input (batch=32, features=768) × Weight (768, 3072) + Bias (3072)
Matrix Multiplication: 32 × 768 × 3072 × 2 = 150,994,944 FLOPs
Bias Addition: 32 × 3072 × 1 = 98,304 FLOPs
Total FLOPs: 151,093,248 FLOPs
Convolution FLOP Breakdown:
Input (batch=1, channels=3, H=224, W=224)
Kernel (out=64, in=3, kH=7, kW=7)
Output size: (224×224) → (112×112) with stride=2
FLOPs = 112 × 112 × 7 × 7 × 3 × 64 × 2 = 235,012,096 FLOPs
```
### FLOP Counting Strategy
Different operations require different FLOP calculations:
- **Matrix operations**: M × N × K × 2 (multiply + add)
- **Convolutions**: Output spatial × kernel spatial × channels
- **Activations**: Usually 1 FLOP per element
"""
# %% [markdown]
"""
### 🧪 Unit Test: FLOP Counting
This test validates our FLOP counting for different operations and architectures.
**What we're testing**: FLOP calculation accuracy for various layer types
**Why it matters**: FLOPs predict computational cost and energy usage
**Expected**: Correct FLOP counts for known operation types
"""
# %% nbgrader={"grade": true, "grade_id": "test_flop_counting", "locked": true, "points": 10}
def test_unit_flop_counting():
"""🔬 Test FLOP counting implementation."""
print("🔬 Unit Test: FLOP Counting...")
profiler = Profiler()
# Test 1: Simple tensor operations
test_tensor = Tensor(np.random.randn(4, 8))
flops = profiler.count_flops(test_tensor, (4, 8))
expected_flops = 4 * 8 # 1 FLOP per element for generic operation
assert flops == expected_flops, f"Expected {expected_flops} FLOPs, got {flops}"
print(f"✅ Tensor operation: {flops} FLOPs")
# Test 2: Simulated Linear layer
class MockLinear:
def __init__(self, in_features, out_features):
self.weight = Tensor(np.random.randn(in_features, out_features))
self.__class__.__name__ = 'Linear'
mock_linear = MockLinear(128, 64)
flops = profiler.count_flops(mock_linear, (1, 128))
expected_flops = 128 * 64 * 2 # matmul FLOPs
assert flops == expected_flops, f"Expected {expected_flops} FLOPs, got {flops}"
print(f"✅ Linear layer: {flops} FLOPs")
# Test 3: Batch size independence
flops_batch1 = profiler.count_flops(mock_linear, (1, 128))
flops_batch32 = profiler.count_flops(mock_linear, (32, 128))
assert flops_batch1 == flops_batch32, "FLOPs should be independent of batch size"
print(f"✅ Batch independence: {flops_batch1} FLOPs (same for batch 1 and 32)")
print("✅ FLOP counting works correctly!")
if __name__ == "__main__":
test_unit_flop_counting()
# %% [markdown]
"""
## Memory Profiling - Understanding Memory Usage Patterns
Memory profiling reveals how much RAM your model consumes during training and inference. This is critical for deployment planning and optimization.
### Memory Usage Breakdown
```
ML Model Memory Components:
┌───────────────────────────────────────────────────┐
│ Total Memory │
├─────────────────┬─────────────────┬───────────────┤
│ Parameters │ Activations │ Gradients │
│ (persistent) │ (per forward) │ (per backward)│
├─────────────────┼─────────────────┼───────────────┤
│ Linear weights │ Hidden states │ ∂L/∂W │
│ Conv filters │ Attention maps │ ∂L/∂b │
│ Embeddings │ Residual cache │ Optimizer │
└─────────────────┴─────────────────┴───────────────┘
Memory Scaling:
Batch Size → Activation Memory (linear scaling)
Model Size → Parameter + Gradient Memory (linear scaling)
Sequence Length → Attention Memory (quadratic scaling!)
```
### Memory Measurement Strategy
We use Python's `tracemalloc` to track memory allocations during model execution. This gives us precise measurements of memory usage patterns.
"""
# %% [markdown]
"""
### 🧪 Unit Test: Memory Measurement
This test validates our memory tracking works correctly and provides useful metrics.
**What we're testing**: Memory usage measurement and calculation accuracy
**Why it matters**: Memory constraints often limit model deployment
**Expected**: Reasonable memory measurements with proper components
"""
# %% nbgrader={"grade": true, "grade_id": "test_memory_measurement", "locked": true, "points": 10}
def test_unit_memory_measurement():
"""🔬 Test memory measurement implementation."""
print("🔬 Unit Test: Memory Measurement...")
profiler = Profiler()
# Test 1: Basic memory measurement
test_tensor = Tensor(np.random.randn(10, 20))
from tinytorch.core.layers import Linear
test_model = Linear(20, 10)
memory_stats = profiler.measure_memory(test_model, (10, 20))
# Validate dictionary structure
required_keys = ['parameter_memory_mb', 'activation_memory_mb', 'peak_memory_mb', 'memory_efficiency']
for key in required_keys:
assert key in memory_stats, f"Missing key: {key}"
# Validate non-negative values
for key in required_keys:
assert memory_stats[key] >= 0, f"{key} should be non-negative, got {memory_stats[key]}"
print(f"✅ Basic measurement: {memory_stats['peak_memory_mb']:.3f} MB peak")
# Test 2: Memory scaling with size
from tinytorch.core.layers import Linear
small_model = Linear(5, 5)
large_model = Linear(50, 50)
small_memory = profiler.measure_memory(small_model, (5, 5))
large_memory = profiler.measure_memory(large_model, (50, 50))
# Larger tensor should use more activation memory
assert large_memory['activation_memory_mb'] >= small_memory['activation_memory_mb'], \
"Larger tensor should use more activation memory"
print(f"✅ Scaling: Small {small_memory['activation_memory_mb']:.3f} MB → Large {large_memory['activation_memory_mb']:.3f} MB")
# Test 3: Efficiency bounds
assert 0 <= memory_stats['memory_efficiency'] <= 1.0, \
f"Memory efficiency should be between 0 and 1, got {memory_stats['memory_efficiency']}"
print(f"✅ Efficiency: {memory_stats['memory_efficiency']:.3f} (0-1 range)")
print("✅ Memory measurement works correctly!")
if __name__ == "__main__":
test_unit_memory_measurement()
# %% [markdown]
"""
## Latency Measurement - Accurate Performance Timing
Latency measurement is the most challenging part of profiling because it's affected by system state, caching, and measurement overhead. We need statistical rigor to get reliable results.
### Latency Measurement Challenges
```
Timing Challenges:
┌─────────────────────────────────────────────────┐
│ Time Variance │
├─────────────────┬─────────────────┬─────────────┤
│ System Noise │ Cache Effects │ Thermal │
│ │ │ Throttling │
├─────────────────┼─────────────────┼─────────────┤
│ Background │ Cold start vs │ CPU slows │
│ processes │ warm caches │ when hot │
│ OS scheduling │ Memory locality │ GPU thermal │
│ Network I/O │ Branch predict │ limits │
└─────────────────┴─────────────────┴─────────────┘
Solution: Statistical Approach
Warmup → Multiple measurements → Robust statistics (median)
```
### Measurement Protocol
Our latency measurement follows professional benchmarking practices:
1. **Warmup runs** to stabilize system state
2. **Multiple measurements** for statistical significance
3. **Median calculation** to handle outliers
4. **Memory cleanup** to prevent contamination
"""
# %% [markdown]
"""
### 🧪 Unit Test: Latency Measurement
This test validates our latency measurement provides consistent and reasonable results.
**What we're testing**: Timing accuracy and statistical robustness
**Why it matters**: Latency determines real-world deployment feasibility
**Expected**: Consistent timing measurements with proper statistical handling
"""
# %% nbgrader={"grade": true, "grade_id": "test_latency_measurement", "locked": true, "points": 10}
def test_unit_latency_measurement():
"""🔬 Test latency measurement implementation."""
print("🔬 Unit Test: Latency Measurement...")
profiler = Profiler()
# Test 1: Basic latency measurement
from tinytorch.core.layers import Linear
test_model = Linear(8, 4)
test_input = Tensor(np.random.randn(4, 8))
latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5)
assert latency >= 0, f"Latency should be non-negative, got {latency}"
assert latency < 1000, f"Latency seems too high for simple operation: {latency} ms"
print(f"✅ Basic latency: {latency:.3f} ms")
# Test 2: Measurement consistency
latencies = []
for _ in range(3):
lat = profiler.measure_latency(test_model, test_input, warmup=1, iterations=3)
latencies.append(lat)
# Measurements should be in reasonable range
avg_latency = np.mean(latencies)
std_latency = np.std(latencies)
assert std_latency < avg_latency, "Standard deviation shouldn't exceed mean for simple operations"
print(f"✅ Consistency: {avg_latency:.3f} ± {std_latency:.3f} ms")
# Test 3: Size scaling
small_model = Linear(2, 2)
large_model = Linear(20, 20)
small_input = Tensor(np.random.randn(2, 2))
large_input = Tensor(np.random.randn(20, 20))
small_latency = profiler.measure_latency(small_model, small_input, warmup=1, iterations=3)
large_latency = profiler.measure_latency(large_model, large_input, warmup=1, iterations=3)
# Larger operations might take longer (though not guaranteed for simple operations)
print(f"✅ Scaling: Small {small_latency:.3f} ms, Large {large_latency:.3f} ms")
print("✅ Latency measurement works correctly!")
if __name__ == "__main__":
test_unit_latency_measurement()
# %% [markdown]
"""
## 4. Integration: Advanced Profiling Functions
Now let's validate our higher-level profiling functions that combine core measurements into comprehensive analysis tools.
### Advanced Profiling Architecture
```
Core Profiler Methods → Advanced Analysis Functions → Optimization Insights
↓ ↓ ↓
count_parameters() profile_forward_pass() "Memory-bound workload"
count_flops() profile_backward_pass() "Optimize data movement"
measure_memory() profile_layer() "Focus on bandwidth"
measure_latency() benchmark_efficiency() "Use quantization"
```
### Forward Pass Profiling - Complete Performance Picture
A forward pass profile combines all our measurements to understand model behavior comprehensively. This is essential for optimization decisions.
"""
# %% [markdown]
"""
### Backward Pass Profiling - Training Analysis
Training requires both forward and backward passes. The backward pass typically uses 2× the compute and adds gradient memory. Understanding this is crucial for training optimization.
### Training Memory Visualization
```
Training Memory Timeline:
Forward Pass: [Parameters] + [Activations]
Backward Pass: [Parameters] + [Activations] + [Gradients]
Optimizer: [Parameters] + [Gradients] + [Optimizer State]
Memory Examples:
Model: 125M parameters (500MB)
Forward: 500MB params + 100MB activations = 600MB
Backward: 500MB params + 100MB activations + 500MB gradients = 1,100MB
Adam: 500MB params + 500MB gradients + 1,000MB momentum/velocity = 2,000MB
Total Training Memory: 4× parameter memory!
```
"""
# %% [markdown]
"""
### 🧪 Unit Test: Advanced Profiling Functions
This test validates our advanced profiling functions provide comprehensive analysis.
**What we're testing**: Forward and backward pass profiling completeness
**Why it matters**: Training optimization requires understanding both passes
**Expected**: Complete profiles with all required metrics and relationships
"""
# %% nbgrader={"grade": true, "grade_id": "test_advanced_profiling", "locked": true, "points": 15}
def test_unit_advanced_profiling():
"""🔬 Test advanced profiling functions."""
print("🔬 Unit Test: Advanced Profiling Functions...")
# Create profiler and test model
profiler = Profiler()
from tinytorch.core.layers import Linear
test_model = Linear(8, 4)
test_input = Tensor(np.random.randn(4, 8))
# Test forward pass profiling
forward_profile = profiler.profile_forward_pass(test_model, test_input)
# Validate forward profile structure
required_forward_keys = [
'parameters', 'flops', 'latency_ms', 'gflops_per_second',
'memory_bandwidth_mbs', 'bottleneck'
]
for key in required_forward_keys:
assert key in forward_profile, f"Missing key: {key}"
assert forward_profile['parameters'] >= 0
assert forward_profile['flops'] >= 0
assert forward_profile['latency_ms'] >= 0
assert forward_profile['gflops_per_second'] >= 0
print(f"✅ Forward profiling: {forward_profile['gflops_per_second']:.2f} GFLOP/s")
# Test backward pass profiling
backward_profile = profiler.profile_backward_pass(test_model, test_input)
# Validate backward profile structure
required_backward_keys = [
'forward_flops', 'backward_flops', 'total_flops',
'total_latency_ms', 'total_memory_mb', 'optimizer_memory_estimates'
]
for key in required_backward_keys:
assert key in backward_profile, f"Missing key: {key}"
# Validate relationships
assert backward_profile['total_flops'] >= backward_profile['forward_flops']
assert backward_profile['total_latency_ms'] >= backward_profile['forward_latency_ms']
assert 'sgd' in backward_profile['optimizer_memory_estimates']
assert 'adam' in backward_profile['optimizer_memory_estimates']
# Check backward pass estimates are reasonable
assert backward_profile['backward_flops'] >= backward_profile['forward_flops'], \
"Backward pass should have at least as many FLOPs as forward"
assert backward_profile['gradient_memory_mb'] >= 0, \
"Gradient memory should be non-negative"
print(f"✅ Backward profiling: {backward_profile['total_latency_ms']:.2f} ms total")
print(f"✅ Memory breakdown: {backward_profile['total_memory_mb']:.2f} MB training")
print("✅ Advanced profiling functions work correctly!")
if __name__ == "__main__":
test_unit_advanced_profiling()
# %% [markdown]
"""
## 5. Systems Analysis: Understanding Performance Characteristics
Let's analyze how different model characteristics affect performance. This analysis guides optimization decisions and helps identify bottlenecks.
### Performance Analysis Workflow
```
Model Scaling Analysis:
Size → Memory → Latency → Throughput → Bottleneck Identification
↓ ↓ ↓ ↓ ↓
64 1MB 0.1ms 10K ops/s Memory bound
128 4MB 0.2ms 8K ops/s Memory bound
256 16MB 0.5ms 4K ops/s Memory bound
512 64MB 2.0ms 1K ops/s Memory bound
Insight: This workload is memory-bound → Optimize data movement, not compute!
```
"""
# %% nbgrader={"grade": false, "grade_id": "performance_analysis", "solution": true}
def analyze_model_scaling():
"""📊 Analyze how model performance scales with size."""
print("📊 Analyzing Model Scaling Characteristics...")
profiler = Profiler()
results = []
# Test different model sizes
sizes = [64, 128, 256, 512]
print("\nModel Scaling Analysis:")
print("Size\tParams\t\tFLOPs\t\tLatency(ms)\tMemory(MB)\tGFLOP/s")
print("-" * 80)
for size in sizes:
# Create models of different sizes for comparison
from tinytorch.core.layers import Linear
test_model = Linear(size, size)
input_shape = (32, size) # Batch of 32
dummy_input = Tensor(np.random.randn(*input_shape))
# Simulate linear layer characteristics
linear_params = size * size + size # W + b
linear_flops = size * size * 2 # matmul
# Measure actual performance
latency = profiler.measure_latency(test_model, dummy_input, warmup=3, iterations=10)
memory = profiler.measure_memory(test_model, input_shape)
gflops_per_second = (linear_flops / 1e9) / (latency / 1000)
results.append({
'size': size,
'parameters': linear_params,
'flops': linear_flops,
'latency_ms': latency,
'memory_mb': memory['peak_memory_mb'],
'gflops_per_second': gflops_per_second
})
print(f"{size}\t{linear_params:,}\t\t{linear_flops:,}\t\t"
f"{latency:.2f}\t\t{memory['peak_memory_mb']:.2f}\t\t"
f"{gflops_per_second:.2f}")
# Analysis insights
print("\n💡 Scaling Analysis Insights:")
# Memory scaling
memory_growth = results[-1]['memory_mb'] / max(results[0]['memory_mb'], 0.001)
print(f"Memory grows {memory_growth:.1f}× from {sizes[0]} to {sizes[-1]} size")
# Compute scaling
compute_growth = results[-1]['gflops_per_second'] / max(results[0]['gflops_per_second'], 0.001)
print(f"Compute efficiency changes {compute_growth:.1f}× with size")
# Performance characteristics
avg_efficiency = np.mean([r['gflops_per_second'] for r in results])
if avg_efficiency < 10: # Arbitrary threshold for "low" efficiency
print("🚀 Low compute efficiency suggests memory-bound workload")
else:
print("🚀 High compute efficiency suggests compute-bound workload")
def analyze_batch_size_effects():
"""📊 Analyze how batch size affects performance and efficiency."""
print("\n📊 Analyzing Batch Size Effects...")
profiler = Profiler()
batch_sizes = [1, 8, 32, 128]
feature_size = 256
print("\nBatch Size Effects Analysis:")
print("Batch\tLatency(ms)\tThroughput(samples/s)\tMemory(MB)\tMemory Efficiency")
print("-" * 85)
for batch_size in batch_sizes:
from tinytorch.core.layers import Linear
test_model = Linear(feature_size, feature_size)
input_shape = (batch_size, feature_size)
dummy_input = Tensor(np.random.randn(*input_shape))
# Measure performance
latency = profiler.measure_latency(test_model, dummy_input, warmup=3, iterations=10)
memory = profiler.measure_memory(test_model, input_shape)
# Calculate throughput
samples_per_second = (batch_size * 1000) / latency # samples/second
# Calculate efficiency (samples per unit memory)
efficiency = samples_per_second / max(memory['peak_memory_mb'], 0.001)
print(f"{batch_size}\t{latency:.2f}\t\t{samples_per_second:.0f}\t\t\t"
f"{memory['peak_memory_mb']:.2f}\t\t{efficiency:.1f}")
print("\n💡 Batch Size Insights:")
print("Larger batches typically improve throughput but increase memory usage")
# Run the analysis
if __name__ == "__main__":
analyze_model_scaling()
analyze_batch_size_effects()
# %% [markdown]
"""
## 6. Optimization Insights: Production Performance Patterns
Understanding profiling results helps guide optimization decisions. Let's analyze different operation types and measurement overhead.
### Operation Efficiency Analysis
```
Operation Types and Their Characteristics:
┌─────────────────┬──────────────────┬──────────────────┬─────────────────┐
│ Operation │ Compute/Memory │ Optimization │ Priority │
├─────────────────┼──────────────────┼──────────────────┼─────────────────┤
│ Matrix Multiply │ Compute-bound │ BLAS libraries │ High │
│ Elementwise │ Memory-bound │ Data locality │ Medium │
│ Reductions │ Memory-bound │ Parallelization│ Medium │
│ Attention │ Memory-bound │ FlashAttention │ High │
└─────────────────┴──────────────────┴──────────────────┴─────────────────┘
Optimization Strategy:
1. Profile first → Identify bottlenecks
2. Focus on compute-bound ops → Algorithmic improvements
3. Focus on memory-bound ops → Data movement optimization
4. Measure again → Verify improvements
```
"""
# %% nbgrader={"grade": false, "grade_id": "optimization_insights", "solution": true}
def benchmark_operation_efficiency():
"""📊 Compare efficiency of different operations for optimization guidance."""
print("📊 Benchmarking Operation Efficiency...")
profiler = Profiler()
operations = []
# Test different operation types
size = 256
input_tensor = Tensor(np.random.randn(32, size))
# Elementwise operations (memory-bound)
# Create a simple model wrapper for elementwise operations
class ElementwiseModel:
def forward(self, x):
return x + x # Simple elementwise operation
elementwise_model = ElementwiseModel()
elementwise_latency = profiler.measure_latency(elementwise_model, input_tensor, iterations=20)
elementwise_flops = size * 32 # One operation per element
operations.append({
'operation': 'Elementwise',
'latency_ms': elementwise_latency,
'flops': elementwise_flops,
'gflops_per_second': (elementwise_flops / 1e9) / (elementwise_latency / 1000),
'efficiency_class': 'memory-bound',
'optimization_focus': 'data_locality'
})
# Matrix operations (compute-bound)
from tinytorch.core.layers import Linear
matrix_model = Linear(size, size)
matrix_latency = profiler.measure_latency(matrix_model, input_tensor, iterations=10)
matrix_flops = size * size * 2 # Matrix multiplication
operations.append({
'operation': 'Matrix Multiply',
'latency_ms': matrix_latency,
'flops': matrix_flops,
'gflops_per_second': (matrix_flops / 1e9) / (matrix_latency / 1000),
'efficiency_class': 'compute-bound',
'optimization_focus': 'algorithms'
})
# Reduction operations (memory-bound)
class ReductionModel:
def forward(self, x):
return x.sum() # Sum reduction operation
reduction_model = ReductionModel()
reduction_latency = profiler.measure_latency(reduction_model, input_tensor, iterations=20)
reduction_flops = size * 32 # Sum reduction
operations.append({
'operation': 'Reduction',
'latency_ms': reduction_latency,
'flops': reduction_flops,
'gflops_per_second': (reduction_flops / 1e9) / (reduction_latency / 1000),
'efficiency_class': 'memory-bound',
'optimization_focus': 'parallelization'
})
print("\nOperation Efficiency Comparison:")
print("Operation\t\tLatency(ms)\tGFLOP/s\t\tEfficiency Class\tOptimization Focus")
print("-" * 95)
for op in operations:
print(f"{op['operation']:<15}\t{op['latency_ms']:.3f}\t\t"
f"{op['gflops_per_second']:.2f}\t\t{op['efficiency_class']:<15}\t{op['optimization_focus']}")
print("\n💡 Operation Optimization Insights:")
# Find most and least efficient
best_op = max(operations, key=lambda x: x['gflops_per_second'])
worst_op = min(operations, key=lambda x: x['gflops_per_second'])
print(f"Most efficient: {best_op['operation']} ({best_op['gflops_per_second']:.2f} GFLOP/s)")
print(f"Least efficient: {worst_op['operation']} ({worst_op['gflops_per_second']:.2f} GFLOP/s)")
# Count operation types
memory_bound_ops = [op for op in operations if op['efficiency_class'] == 'memory-bound']
compute_bound_ops = [op for op in operations if op['efficiency_class'] == 'compute-bound']
print(f"\n🚀 Optimization Priority:")
if len(memory_bound_ops) > len(compute_bound_ops):
print("Focus on memory optimization: data locality, bandwidth, caching")
else:
print("Focus on compute optimization: better algorithms, vectorization")
def analyze_profiling_overhead():
"""📊 Measure the overhead of profiling itself."""
print("\n📊 Analyzing Profiling Overhead...")
# Test with and without profiling
test_tensor = Tensor(np.random.randn(100, 100))
iterations = 50
# Without profiling - baseline measurement
start_time = time.perf_counter()
for _ in range(iterations):
_ = test_tensor.data.copy() # Simple operation
end_time = time.perf_counter()
baseline_ms = (end_time - start_time) * 1000
# With profiling - includes measurement overhead
profiler = Profiler()
# Create a simple model for profiling overhead measurement
class TestModel:
def forward(self, x):
return x + 1.0
test_model = TestModel()
start_time = time.perf_counter()
for _ in range(iterations):
_ = profiler.measure_latency(test_model, test_tensor, warmup=1, iterations=1)
end_time = time.perf_counter()
profiled_ms = (end_time - start_time) * 1000
overhead_factor = profiled_ms / max(baseline_ms, 0.001)
print(f"\nProfiling Overhead Analysis:")
print(f"Baseline execution: {baseline_ms:.2f} ms")
print(f"With profiling: {profiled_ms:.2f} ms")
print(f"Profiling overhead: {overhead_factor:.1f}× slower")
print(f"\n💡 Profiling Overhead Insights:")
if overhead_factor < 2:
print("Low overhead - suitable for frequent profiling")
elif overhead_factor < 10:
print("Moderate overhead - use for development and debugging")
else:
print("High overhead - use sparingly in production")
# Run optimization analysis
if __name__ == "__main__":
benchmark_operation_efficiency()
analyze_profiling_overhead()
# %% [markdown]
"""
## 🧪 Module Integration Test
Final validation that everything works together correctly.
"""
# %% nbgrader={"grade": true, "grade_id": "test_module", "locked": true, "points": 20}
def test_module():
"""🧪 Module Test: Complete Integration
Comprehensive test of entire profiling module functionality.
This final test runs before module summary to ensure:
- All unit tests pass
- Functions work together correctly
- Module is ready for integration with TinyTorch
"""
print("🧪 RUNNING MODULE INTEGRATION TEST")
print("=" * 50)
# Run all unit tests
print("Running unit tests...")
test_unit_helper_functions()
test_unit_parameter_counting()
test_unit_flop_counting()
test_unit_memory_measurement()
test_unit_latency_measurement()
test_unit_advanced_profiling()
print("\nRunning integration scenarios...")
# Test realistic usage patterns
print("🔬 Integration Test: Complete Profiling Workflow...")
# Create profiler
profiler = Profiler()
# Create test model and data
from tinytorch.core.layers import Linear
test_model = Linear(16, 32)
test_input = Tensor(np.random.randn(8, 16))
# Run complete profiling workflow
print("1. Measuring model characteristics...")
params = profiler.count_parameters(test_model)
flops = profiler.count_flops(test_model, test_input.shape)
memory = profiler.measure_memory(test_model, test_input.shape)
latency = profiler.measure_latency(test_model, test_input, warmup=2, iterations=5)
print(f" Parameters: {params}")
print(f" FLOPs: {flops}")
print(f" Memory: {memory['peak_memory_mb']:.2f} MB")
print(f" Latency: {latency:.2f} ms")
# Test advanced profiling
print("2. Running advanced profiling...")
forward_profile = profiler.profile_forward_pass(test_model, test_input)
backward_profile = profiler.profile_backward_pass(test_model, test_input)
assert 'gflops_per_second' in forward_profile
assert 'total_latency_ms' in backward_profile
print(f" Forward GFLOP/s: {forward_profile['gflops_per_second']:.2f}")
print(f" Training latency: {backward_profile['total_latency_ms']:.2f} ms")
# Test bottleneck analysis
print("3. Analyzing performance bottlenecks...")
bottleneck = forward_profile['bottleneck']
efficiency = forward_profile['computational_efficiency']
print(f" Bottleneck: {bottleneck}")
print(f" Compute efficiency: {efficiency:.3f}")
# Validate end-to-end workflow
assert params >= 0, "Parameter count should be non-negative"
assert flops >= 0, "FLOP count should be non-negative"
assert memory['peak_memory_mb'] >= 0, "Memory usage should be non-negative"
assert latency >= 0, "Latency should be non-negative"
assert forward_profile['gflops_per_second'] >= 0, "GFLOP/s should be non-negative"
assert backward_profile['total_latency_ms'] >= 0, "Total latency should be non-negative"
assert bottleneck in ['memory', 'compute'], "Bottleneck should be memory or compute"
assert 0 <= efficiency <= 1, "Efficiency should be between 0 and 1"
print("✅ End-to-end profiling workflow works!")
# Test production-like scenario
print("4. Testing production profiling scenario...")
# Simulate larger model analysis
from tinytorch.core.layers import Linear
large_model = Linear(512, 256)
large_input = Tensor(np.random.randn(32, 512)) # Larger model input
large_profile = profiler.profile_forward_pass(large_model, large_input)
# Verify profile contains optimization insights
assert 'bottleneck' in large_profile, "Profile should identify bottlenecks"
assert 'memory_bandwidth_mbs' in large_profile, "Profile should measure memory bandwidth"
print(f" Large model analysis: {large_profile['bottleneck']} bottleneck")
print(f" Memory bandwidth: {large_profile['memory_bandwidth_mbs']:.1f} MB/s")
print("✅ Production profiling scenario works!")
print("\n" + "=" * 50)
print("🎉 ALL TESTS PASSED! Module ready for export.")
print("Run: tito module complete 14")
# Call before module summary
if __name__ == "__main__":
test_module()
# %%
if __name__ == "__main__":
print("🚀 Running Profiling module...")
test_module()
print("✅ Module validation complete!")
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Performance Measurement
### Question 1: FLOP Analysis
You implemented a profiler that counts FLOPs for different operations.
For a Linear layer with 1000 input features and 500 output features:
- How many FLOPs are required for one forward pass? _____ FLOPs
- If you process a batch of 32 samples, how does this change the per-sample FLOPs? _____
### Question 2: Memory Scaling
Your profiler measures memory usage for models and activations.
A transformer model has 125M parameters (500MB at FP32).
During training with batch size 16:
- What's the minimum memory for gradients? _____ MB
- With Adam optimizer, what's the total memory requirement? _____ MB
### Question 3: Performance Bottlenecks
You built tools to identify compute vs memory bottlenecks.
A model achieves 10 GFLOP/s on hardware with 100 GFLOP/s peak:
- What's the computational efficiency? _____%
- If doubling batch size doesn't improve GFLOP/s, the bottleneck is likely _____
### Question 4: Profiling Trade-offs
Your profiler adds measurement overhead to understand performance.
If profiling adds 5× overhead but reveals a 50% speedup opportunity:
- Is the profiling cost justified for development? _____
- When should you disable profiling in production? _____
"""
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Profiling
Congratulations! You've built a comprehensive profiling system for ML performance analysis!
### Key Accomplishments
- Built complete Profiler class with parameter, FLOP, memory, and latency measurement
- Implemented advanced profiling functions for forward and backward pass analysis
- Discovered performance characteristics through scaling and efficiency analysis
- Created production-quality measurement tools for optimization guidance
- All tests pass ✅ (validated by `test_module()`)
### Systems Insights Gained
- **FLOPs vs Reality**: Theoretical operations don't always predict actual performance
- **Memory Bottlenecks**: Many ML operations are limited by memory bandwidth, not compute
- **Batch Size Effects**: Larger batches improve throughput but increase memory requirements
- **Profiling Overhead**: Measurement tools have costs but enable data-driven optimization
### Production Skills Developed
- **Performance Detective Work**: Use data, not guesses, to identify bottlenecks
- **Optimization Prioritization**: Focus efforts on actual bottlenecks, not assumptions
- **Resource Planning**: Predict memory and compute requirements for deployment
- **Statistical Rigor**: Handle measurement variance with proper methodology
### Ready for Next Steps
Your profiling implementation enables optimization modules (15-18) to make data-driven optimization decisions.
Export with: `tito module complete 14`
**Next**: Module 15 (Memoization) will use profiling to discover transformer bottlenecks and fix them!
"""