All module references updated to reflect new ordering: - Module 15: Quantization (was 16) - Module 16: Compression (was 17) - Module 17: Memoization (was 15) Updated by module-developer and website-manager agents: - Module ABOUT files with correct numbers and prerequisites - Cross-references and "What's Next" chains - Website navigation (_toc.yml) and content - Learning path progression in LEARNING_PATH.md - Profile milestone completion message (Module 17) Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
15 KiB
title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives
| title | description | difficulty | time_estimate | prerequisites | next_steps | learning_objectives | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Profiling - Performance Analysis and Optimization | Build profilers to identify bottlenecks and guide optimization decisions | 3 | 5-6 hours |
|
|
|
14. Profiling
⚡ OPTIMIZATION TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
Overview
Build comprehensive profiling tools to measure where time and memory go in your ML systems. This module implements timing profilers, memory trackers, and FLOP counters that reveal bottlenecks and guide optimization decisions.
Learning Objectives
By completing this module, you will be able to:
- Implement timing profilers with statistical rigor (multiple runs, confidence intervals) for accurate measurements
- Design memory profilers to track allocation patterns, peak usage, and identify memory leaks
- Build FLOP counters to measure theoretical computational complexity of different operations
- Understand performance bottlenecks by comparing MLPs, CNNs, and Transformers systematically
- Apply data-driven analysis to prioritize optimization efforts based on actual impact
Why This Matters
Production Context
Profiling is mandatory for production ML systems:
- Google TPU teams profile every operation to optimize hardware utilization
- OpenAI profiles GPT training to identify $millions in compute savings
- Meta profiles inference to serve billions of requests per day efficiently
- NVIDIA uses profiling to optimize cuDNN kernels for peak performance
Historical Context
Profiling evolved with ML scale:
- Early ML (pre-2012): Ad-hoc timing with
time.time(); no systematic profiling - Deep Learning Era (2012-2017): NVIDIA profiler, TensorBoard timing; focus on GPU utilization
- Production Scale (2018+): Comprehensive profiling (compute, memory, I/O, network); optimization critical for economics
- Modern Systems (2020+): Automated profiling and optimization; ML compilers use profiling data
Without profiling, you're optimizing blind—profiling shows you where to focus.
Pedagogical Pattern: Build → Use → Optimize
1. Build
Implement from first principles:
- High-precision timing with multiple runs
- Statistical analysis (mean, std, confidence intervals)
- Memory profiler tracking allocations and deallocations
- FLOP counter for theoretical complexity
- Comparative profiler across architectures
2. Use
Apply to real problems:
- Profile attention vs feedforward in transformers
- Compare MLP vs CNN vs Transformer efficiency
- Identify memory bottlenecks in training loops
- Measure impact of batch size on throughput
- Analyze scaling behavior with model size
3. Optimize
Production insights:
- Prioritize optimizations by impact (80/20 rule)
- Measure before/after optimization
- Understand hardware utilization (CPU vs GPU)
- Identify memory bandwidth vs compute bottlenecks
- Build optimization roadmap based on data
Implementation Guide
Core Components
High-Precision Timer
class Timer:
"""High-precision timing with statistical analysis.
Performs multiple runs to account for variance and noise.
Reports mean, std, and confidence intervals.
Example:
timer = Timer()
with timer:
model.forward(x)
print(f"Time: {timer.mean:.3f}ms ± {timer.std:.3f}ms")
"""
def __init__(self, num_runs=10, warmup_runs=3):
self.num_runs = num_runs
self.warmup_runs = warmup_runs
self.times = []
def __enter__(self):
# Warmup runs (not counted)
for _ in range(self.warmup_runs):
start = time.perf_counter()
# Operation happens in with block
# Timed runs
self.start_time = time.perf_counter()
return self
def __exit__(self, *args):
elapsed = time.perf_counter() - self.start_time
self.times.append(elapsed * 1000) # Convert to ms
@property
def mean(self):
return np.mean(self.times)
@property
def std(self):
return np.std(self.times)
@property
def confidence_interval(self, confidence=0.95):
"""95% confidence interval using t-distribution."""
from scipy import stats
ci = stats.t.interval(confidence, len(self.times)-1,
loc=self.mean, scale=stats.sem(self.times))
return ci
def report(self):
ci = self.confidence_interval()
return f"{self.mean:.3f}ms ± {self.std:.3f}ms (95% CI: [{ci[0]:.3f}, {ci[1]:.3f}])"
Memory Profiler
class MemoryProfiler:
"""Track memory allocations and peak usage.
Monitors memory throughout execution to identify:
- Peak memory usage
- Memory leaks
- Allocation patterns
- Memory bandwidth bottlenecks
"""
def __init__(self):
self.snapshots = []
self.peak_memory = 0
def snapshot(self, label=""):
"""Take memory snapshot at current point."""
import psutil
process = psutil.Process()
mem_info = process.memory_info()
snapshot = {
'label': label,
'rss': mem_info.rss / 1024**2, # MB
'vms': mem_info.vms / 1024**2, # MB
'timestamp': time.time()
}
self.snapshots.append(snapshot)
self.peak_memory = max(self.peak_memory, snapshot['rss'])
return snapshot
def report(self):
"""Generate memory usage report."""
print(f"Peak Memory: {self.peak_memory:.2f} MB")
print("\nMemory Timeline:")
for snap in self.snapshots:
print(f" {snap['label']:30s}: {snap['rss']:8.2f} MB")
# Calculate memory growth
if len(self.snapshots) >= 2:
growth = self.snapshots[-1]['rss'] - self.snapshots[0]['rss']
print(f"\nTotal Growth: {growth:+.2f} MB")
# Check for potential memory leak
if growth > 100: # Arbitrary threshold
print("⚠️ Potential memory leak detected!")
FLOP Counter
class FLOPCounter:
"""Count floating-point operations for complexity analysis.
Provides theoretical computational complexity independent of hardware.
Useful for comparing different architectural choices.
"""
def __init__(self):
self.total_flops = 0
self.op_counts = {}
def count_matmul(self, A_shape, B_shape):
"""Count FLOPs for matrix multiplication.
C = A @ B where A is (m, k) and B is (k, n)
FLOPs = 2*m*k*n (multiply-add for each output element)
"""
m, k = A_shape
k2, n = B_shape
assert k == k2, "Invalid matmul dimensions"
flops = 2 * m * k * n
self.total_flops += flops
self.op_counts['matmul'] = self.op_counts.get('matmul', 0) + flops
return flops
def count_attention(self, batch, seq_len, d_model, num_heads):
"""Count FLOPs for multi-head attention.
Components:
- Q,K,V projections: 3 * (batch * seq_len * d_model * d_model)
- Attention scores: batch * heads * seq_len * seq_len * d_k
- Attention weighting: batch * heads * seq_len * seq_len * d_v
- Output projection: batch * seq_len * d_model * d_model
"""
d_k = d_model // num_heads
# QKV projections
qkv_flops = 3 * self.count_matmul((batch * seq_len, d_model), (d_model, d_model))
# Attention computation
scores_flops = batch * num_heads * seq_len * seq_len * d_k * 2
weights_flops = batch * num_heads * seq_len * seq_len * d_k * 2
attention_flops = scores_flops + weights_flops
# Output projection
output_flops = self.count_matmul((batch * seq_len, d_model), (d_model, d_model))
total = qkv_flops + attention_flops + output_flops
self.op_counts['attention'] = self.op_counts.get('attention', 0) + total
return total
def report(self):
"""Generate FLOP report with breakdown."""
print(f"Total FLOPs: {self.total_flops / 1e9:.2f} GFLOPs")
print("\nBreakdown by operation:")
for op, flops in sorted(self.op_counts.items(), key=lambda x: x[1], reverse=True):
percentage = (flops / self.total_flops) * 100
print(f" {op:20s}: {flops/1e9:8.2f} GFLOPs ({percentage:5.1f}%)")
Architecture Profiler - Comparative Analysis
class ArchitectureProfiler:
"""Compare performance across different architectures.
Profiles MLP, CNN, and Transformer on same task to understand
compute/memory trade-offs.
"""
def __init__(self):
self.results = {}
def profile_model(self, model, input_data, model_name):
"""Profile a model comprehensively."""
result = {
'model_name': model_name,
'parameters': count_parameters(model),
'timing': {},
'memory': {},
'flops': {}
}
# Timing profile
timer = Timer(num_runs=10)
for _ in range(timer.num_runs + timer.warmup_runs):
with timer:
output = model.forward(input_data)
result['timing']['forward'] = timer.mean
# Memory profile
mem = MemoryProfiler()
mem.snapshot("Before forward")
output = model.forward(input_data)
mem.snapshot("After forward")
result['memory']['peak'] = mem.peak_memory
# FLOP count
flop_counter = FLOPCounter()
# Count FLOPs based on model architecture
result['flops']['total'] = flop_counter.total_flops
self.results[model_name] = result
return result
def compare(self):
"""Generate comparative report."""
print("\nArchitecture Comparison")
print("=" * 80)
for name, result in self.results.items():
print(f"\n{name}:")
print(f" Parameters: {result['parameters']/1e6:.2f}M")
print(f" Forward time: {result['timing']['forward']:.3f}ms")
print(f" Peak memory: {result['memory']['peak']:.2f}MB")
print(f" FLOPs: {result['flops']['total']/1e9:.2f}GFLOPs")
Step-by-Step Implementation
-
Build High-Precision Timer
- Use
time.perf_counter()for nanosecond precision - Implement multiple runs with warmup
- Calculate mean, std, confidence intervals
- Test with known delays
- Use
-
Implement Memory Profiler
- Track memory at key points (before/after operations)
- Calculate peak memory usage
- Identify memory growth patterns
- Detect potential leaks
-
Create FLOP Counter
- Count operations for matmul, convolution, attention
- Build hierarchical counting (operation → layer → model)
- Compare theoretical vs actual performance
- Identify compute-bound vs memory-bound operations
-
Build Architecture Profiler
- Profile MLP on MNIST/CIFAR
- Profile CNN on CIFAR
- Profile Transformer on text
- Generate comparative reports
-
Analyze Results
- Identify bottleneck operations (Pareto principle)
- Compare efficiency across architectures
- Understand scaling behavior
- Prioritize optimization opportunities
Testing
Inline Tests
Run inline tests while building:
cd modules/15_profiling
python profiling_dev.py
Expected output:
Unit Test: Timer with statistical analysis...
✅ Multiple runs produce consistent results
✅ Confidence intervals computed correctly
✅ Warmup runs excluded from statistics
Progress: Timing Profiler ✓
Unit Test: Memory profiler...
✅ Snapshots capture memory correctly
✅ Peak memory tracked accurately
✅ Memory growth detected
Progress: Memory Profiler ✓
Unit Test: FLOP counter...
✅ Matmul FLOPs: 2*m*k*n verified
✅ Attention FLOPs match theoretical
✅ Operation breakdown correct
Progress: FLOP Counter ✓
Export and Validate
tito export 15_profiling
tito test 15_profiling
Where This Code Lives
tinytorch/
├── profiler/
│ └── profiling.py # Your implementation goes here
└── __init__.py # Exposes Timer, MemoryProfiler, etc.
Usage:
>>> from tinytorch.profiler import Timer, MemoryProfiler, FLOPCounter
>>> timer = Timer()
>>> with timer:
>>> model.forward(x)
>>> print(timer.report())
Systems Thinking Questions
-
Amdahl's Law: If attention is 70% of compute and you optimize it 2×, what's the overall speedup? Why can't you get 2× end-to-end speedup?
-
Memory vs Compute Bottlenecks: Your GPU can do 100 TFLOPs/s but memory bandwidth is 900 GB/s. For FP32 operations needing 4 bytes/FLOP, what's the bottleneck? When?
-
Batch Size Impact: Doubling batch size doesn't double throughput. Why? What's the relationship between batch size, memory, and throughput?
-
Profiling Overhead: Your profiler adds 5% overhead. Is this acceptable? When would you use sampling profilers vs instrumentation profilers?
-
Hardware Differences: Your code runs 10× slower on CPU than GPU for large matrices, but only 2× slower for small ones. Why? What's the crossover point?
Real-World Connections
Industry Applications
Google TPU Optimization
- Profile every kernel to maximize TPU utilization
- Optimize for both FLOPs and memory bandwidth
- Use profiling to guide hardware design decisions
- Achieve 40-50% utilization (very high for accelerators)
OpenAI Training Optimization
- Profile GPT training to find $millions in savings
- Identify gradient checkpointing opportunities
- Optimize data loading pipelines
- Achieve 50%+ MFU (model FLOPs utilization)
Meta Inference Serving
- Profile PyTorch models for production deployment
- Identify operator fusion opportunities
- Optimize for latency (p50, p99) not just throughput
- Serve billions of requests per day efficiently
Research Impact
This module implements patterns from:
- TensorBoard Profiler (Google, 2019): Visual profiling for TensorFlow
- PyTorch Profiler (Meta, 2020): Comprehensive profiling for PyTorch
- NVIDIA Nsight (2021): GPU-specific profiling and optimization
- MLPerf (2022): Standardized benchmarking and profiling
What's Next?
In Module 15: Quantization, you'll use your profiling data to compress models:
- Reduce precision from FP32 to INT8 for 4× memory savings
- Implement calibration strategies to minimize accuracy loss
- Measure memory and speed improvements
- Apply quantization based on profiling insights
Profiling shows you what to optimize—the next modules show you how to optimize it!
Ready to become a performance detective? Open modules/14_profiling/profiling_dev.py and start implementing.