mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-29 01:29:21 -05:00

Files

Vijay Janapa Reddi a5679de141 Update documentation after module reordering

All module references updated to reflect new ordering:
- Module 15: Quantization (was 16)
- Module 16: Compression (was 17)
- Module 17: Memoization (was 15)

Updated by module-developer and website-manager agents:
- Module ABOUT files with correct numbers and prerequisites
- Cross-references and "What's Next" chains
- Website navigation (_toc.yml) and content
- Learning path progression in LEARNING_PATH.md
- Profile milestone completion message (Module 17)

Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate

2025-11-10 19:37:41 -05:00

15 KiB

Raw Blame History

title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives

title

description

difficulty

time_estimate

prerequisites

next_steps

learning_objectives

Profiling - Performance Analysis and Optimization

Build profilers to identify bottlenecks and guide optimization decisions

5-6 hours

All modules 01-13

Memoization

Implement timing profilers with statistical rigor for accurate measurements

Design memory profilers to track allocation patterns and identify leaks

Build FLOP counters to measure computational complexity

Understand performance bottlenecks across different architectures

Apply data-driven analysis to guide optimization priorities

14. Profiling

⚡ OPTIMIZATION TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours

Overview

Build comprehensive profiling tools to measure where time and memory go in your ML systems. This module implements timing profilers, memory trackers, and FLOP counters that reveal bottlenecks and guide optimization decisions.

Learning Objectives

By completing this module, you will be able to:

Implement timing profilers with statistical rigor (multiple runs, confidence intervals) for accurate measurements
Design memory profilers to track allocation patterns, peak usage, and identify memory leaks
Build FLOP counters to measure theoretical computational complexity of different operations
Understand performance bottlenecks by comparing MLPs, CNNs, and Transformers systematically
Apply data-driven analysis to prioritize optimization efforts based on actual impact

Why This Matters

Production Context

Profiling is mandatory for production ML systems:

Google TPU teams profile every operation to optimize hardware utilization
OpenAI profiles GPT training to identify $millions in compute savings
Meta profiles inference to serve billions of requests per day efficiently
NVIDIA uses profiling to optimize cuDNN kernels for peak performance

Historical Context

Profiling evolved with ML scale:

Early ML (pre-2012): Ad-hoc timing with time.time(); no systematic profiling
Deep Learning Era (2012-2017): NVIDIA profiler, TensorBoard timing; focus on GPU utilization
Production Scale (2018+): Comprehensive profiling (compute, memory, I/O, network); optimization critical for economics
Modern Systems (2020+): Automated profiling and optimization; ML compilers use profiling data

Without profiling, you're optimizing blind—profiling shows you where to focus.

Pedagogical Pattern: Build → Use → Optimize

1. Build

Implement from first principles:

High-precision timing with multiple runs
Statistical analysis (mean, std, confidence intervals)
Memory profiler tracking allocations and deallocations
FLOP counter for theoretical complexity
Comparative profiler across architectures

2. Use

Apply to real problems:

Profile attention vs feedforward in transformers
Compare MLP vs CNN vs Transformer efficiency
Identify memory bottlenecks in training loops
Measure impact of batch size on throughput
Analyze scaling behavior with model size

3. Optimize

Production insights:

Prioritize optimizations by impact (80/20 rule)
Measure before/after optimization
Understand hardware utilization (CPU vs GPU)
Identify memory bandwidth vs compute bottlenecks
Build optimization roadmap based on data

Implementation Guide

Core Components

High-Precision Timer

class Timer:
    """High-precision timing with statistical analysis.
    
    Performs multiple runs to account for variance and noise.
    Reports mean, std, and confidence intervals.
    
    Example:
        timer = Timer()
        with timer:
            model.forward(x)
        print(f"Time: {timer.mean:.3f}ms ± {timer.std:.3f}ms")
    """
    def __init__(self, num_runs=10, warmup_runs=3):
        self.num_runs = num_runs
        self.warmup_runs = warmup_runs
        self.times = []
    
    def __enter__(self):
        # Warmup runs (not counted)
        for _ in range(self.warmup_runs):
            start = time.perf_counter()
            # Operation happens in with block
        
        # Timed runs
        self.start_time = time.perf_counter()
        return self
    
    def __exit__(self, *args):
        elapsed = time.perf_counter() - self.start_time
        self.times.append(elapsed * 1000)  # Convert to ms
    
    @property
    def mean(self):
        return np.mean(self.times)
    
    @property
    def std(self):
        return np.std(self.times)
    
    @property
    def confidence_interval(self, confidence=0.95):
        """95% confidence interval using t-distribution."""
        from scipy import stats
        ci = stats.t.interval(confidence, len(self.times)-1,
                              loc=self.mean, scale=stats.sem(self.times))
        return ci
    
    def report(self):
        ci = self.confidence_interval()
        return f"{self.mean:.3f}ms ± {self.std:.3f}ms (95% CI: [{ci[0]:.3f}, {ci[1]:.3f}])"

Memory Profiler

class MemoryProfiler:
    """Track memory allocations and peak usage.
    
    Monitors memory throughout execution to identify:
    - Peak memory usage
    - Memory leaks
    - Allocation patterns
    - Memory bandwidth bottlenecks
    """
    def __init__(self):
        self.snapshots = []
        self.peak_memory = 0
    
    def snapshot(self, label=""):
        """Take memory snapshot at current point."""
        import psutil
        process = psutil.Process()
        mem_info = process.memory_info()
        
        snapshot = {
            'label': label,
            'rss': mem_info.rss / 1024**2,  # MB
            'vms': mem_info.vms / 1024**2,  # MB
            'timestamp': time.time()
        }
        self.snapshots.append(snapshot)
        self.peak_memory = max(self.peak_memory, snapshot['rss'])
        
        return snapshot
    
    def report(self):
        """Generate memory usage report."""
        print(f"Peak Memory: {self.peak_memory:.2f} MB")
        print("\nMemory Timeline:")
        for snap in self.snapshots:
            print(f"  {snap['label']:30s}: {snap['rss']:8.2f} MB")
        
        # Calculate memory growth
        if len(self.snapshots) >= 2:
            growth = self.snapshots[-1]['rss'] - self.snapshots[0]['rss']
            print(f"\nTotal Growth: {growth:+.2f} MB")
            
            # Check for potential memory leak
            if growth > 100:  # Arbitrary threshold
                print("⚠️  Potential memory leak detected!")

FLOP Counter

class FLOPCounter:
    """Count floating-point operations for complexity analysis.
    
    Provides theoretical computational complexity independent of hardware.
    Useful for comparing different architectural choices.
    """
    def __init__(self):
        self.total_flops = 0
        self.op_counts = {}
    
    def count_matmul(self, A_shape, B_shape):
        """Count FLOPs for matrix multiplication.
        
        C = A @ B where A is (m, k) and B is (k, n)
        FLOPs = 2*m*k*n (multiply-add for each output element)
        """
        m, k = A_shape
        k2, n = B_shape
        assert k == k2, "Invalid matmul dimensions"
        
        flops = 2 * m * k * n
        self.total_flops += flops
        self.op_counts['matmul'] = self.op_counts.get('matmul', 0) + flops
        return flops
    
    def count_attention(self, batch, seq_len, d_model, num_heads):
        """Count FLOPs for multi-head attention.
        
        Components:
        - Q,K,V projections: 3 * (batch * seq_len * d_model * d_model)
        - Attention scores: batch * heads * seq_len * seq_len * d_k
        - Attention weighting: batch * heads * seq_len * seq_len * d_v
        - Output projection: batch * seq_len * d_model * d_model
        """
        d_k = d_model // num_heads
        
        # QKV projections
        qkv_flops = 3 * self.count_matmul((batch * seq_len, d_model), (d_model, d_model))
        
        # Attention computation
        scores_flops = batch * num_heads * seq_len * seq_len * d_k * 2
        weights_flops = batch * num_heads * seq_len * seq_len * d_k * 2
        attention_flops = scores_flops + weights_flops
        
        # Output projection
        output_flops = self.count_matmul((batch * seq_len, d_model), (d_model, d_model))
        
        total = qkv_flops + attention_flops + output_flops
        self.op_counts['attention'] = self.op_counts.get('attention', 0) + total
        return total
    
    def report(self):
        """Generate FLOP report with breakdown."""
        print(f"Total FLOPs: {self.total_flops / 1e9:.2f} GFLOPs")
        print("\nBreakdown by operation:")
        for op, flops in sorted(self.op_counts.items(), key=lambda x: x[1], reverse=True):
            percentage = (flops / self.total_flops) * 100
            print(f"  {op:20s}: {flops/1e9:8.2f} GFLOPs ({percentage:5.1f}%)")

Architecture Profiler - Comparative Analysis

class ArchitectureProfiler:
    """Compare performance across different architectures.
    
    Profiles MLP, CNN, and Transformer on same task to understand
    compute/memory trade-offs.
    """
    def __init__(self):
        self.results = {}
    
    def profile_model(self, model, input_data, model_name):
        """Profile a model comprehensively."""
        result = {
            'model_name': model_name,
            'parameters': count_parameters(model),
            'timing': {},
            'memory': {},
            'flops': {}
        }
        
        # Timing profile
        timer = Timer(num_runs=10)
        for _ in range(timer.num_runs + timer.warmup_runs):
            with timer:
                output = model.forward(input_data)
        result['timing']['forward'] = timer.mean
        
        # Memory profile
        mem = MemoryProfiler()
        mem.snapshot("Before forward")
        output = model.forward(input_data)
        mem.snapshot("After forward")
        result['memory']['peak'] = mem.peak_memory
        
        # FLOP count
        flop_counter = FLOPCounter()
        # Count FLOPs based on model architecture
        result['flops']['total'] = flop_counter.total_flops
        
        self.results[model_name] = result
        return result
    
    def compare(self):
        """Generate comparative report."""
        print("\nArchitecture Comparison")
        print("=" * 80)
        
        for name, result in self.results.items():
            print(f"\n{name}:")
            print(f"  Parameters: {result['parameters']/1e6:.2f}M")
            print(f"  Forward time: {result['timing']['forward']:.3f}ms")
            print(f"  Peak memory: {result['memory']['peak']:.2f}MB")
            print(f"  FLOPs: {result['flops']['total']/1e9:.2f}GFLOPs")

Step-by-Step Implementation

Build High-Precision Timer
- Use time.perf_counter() for nanosecond precision
- Implement multiple runs with warmup
- Calculate mean, std, confidence intervals
- Test with known delays
Implement Memory Profiler
- Track memory at key points (before/after operations)
- Calculate peak memory usage
- Identify memory growth patterns
- Detect potential leaks
Create FLOP Counter
- Count operations for matmul, convolution, attention
- Build hierarchical counting (operation → layer → model)
- Compare theoretical vs actual performance
- Identify compute-bound vs memory-bound operations
Build Architecture Profiler
- Profile MLP on MNIST/CIFAR
- Profile CNN on CIFAR
- Profile Transformer on text
- Generate comparative reports
Analyze Results
- Identify bottleneck operations (Pareto principle)
- Compare efficiency across architectures
- Understand scaling behavior
- Prioritize optimization opportunities

Testing

Inline Tests

Run inline tests while building:

cd modules/15_profiling
python profiling_dev.py

Expected output:

Unit Test: Timer with statistical analysis...
✅ Multiple runs produce consistent results
✅ Confidence intervals computed correctly
✅ Warmup runs excluded from statistics
Progress: Timing Profiler ✓

Unit Test: Memory profiler...
✅ Snapshots capture memory correctly
✅ Peak memory tracked accurately
✅ Memory growth detected
Progress: Memory Profiler ✓

Unit Test: FLOP counter...
✅ Matmul FLOPs: 2*m*k*n verified
✅ Attention FLOPs match theoretical
✅ Operation breakdown correct
Progress: FLOP Counter ✓

Export and Validate

tito export 15_profiling
tito test 15_profiling

Where This Code Lives

tinytorch/
├── profiler/
│   └── profiling.py            # Your implementation goes here
└── __init__.py                 # Exposes Timer, MemoryProfiler, etc.

Usage:
>>> from tinytorch.profiler import Timer, MemoryProfiler, FLOPCounter
>>> timer = Timer()
>>> with timer:
>>>     model.forward(x)
>>> print(timer.report())

Systems Thinking Questions

Amdahl's Law: If attention is 70% of compute and you optimize it 2×, what's the overall speedup? Why can't you get 2× end-to-end speedup?
Memory vs Compute Bottlenecks: Your GPU can do 100 TFLOPs/s but memory bandwidth is 900 GB/s. For FP32 operations needing 4 bytes/FLOP, what's the bottleneck? When?
Batch Size Impact: Doubling batch size doesn't double throughput. Why? What's the relationship between batch size, memory, and throughput?
Profiling Overhead: Your profiler adds 5% overhead. Is this acceptable? When would you use sampling profilers vs instrumentation profilers?
Hardware Differences: Your code runs 10× slower on CPU than GPU for large matrices, but only 2× slower for small ones. Why? What's the crossover point?

Real-World Connections

Industry Applications

Google TPU Optimization

Profile every kernel to maximize TPU utilization
Optimize for both FLOPs and memory bandwidth
Use profiling to guide hardware design decisions
Achieve 40-50% utilization (very high for accelerators)

OpenAI Training Optimization

Profile GPT training to find $millions in savings
Identify gradient checkpointing opportunities
Optimize data loading pipelines
Achieve 50%+ MFU (model FLOPs utilization)

Meta Inference Serving

Profile PyTorch models for production deployment
Identify operator fusion opportunities
Optimize for latency (p50, p99) not just throughput
Serve billions of requests per day efficiently

Research Impact

This module implements patterns from:

TensorBoard Profiler (Google, 2019): Visual profiling for TensorFlow
PyTorch Profiler (Meta, 2020): Comprehensive profiling for PyTorch
NVIDIA Nsight (2021): GPU-specific profiling and optimization
MLPerf (2022): Standardized benchmarking and profiling

What's Next?

In Module 15: Quantization, you'll use your profiling data to compress models:

Reduce precision from FP32 to INT8 for 4× memory savings
Implement calibration strategies to minimize accuracy loss
Measure memory and speed improvements
Apply quantization based on profiling insights

Profiling shows you what to optimize—the next modules show you how to optimize it!

Ready to become a performance detective? Open modules/14_profiling/profiling_dev.py and start implementing.

15 KiB Raw Blame History Unescape Escape