--- title: "Benchmarking - Fair Performance Comparison" description: "Statistical rigor and standardized metrics for optimization validation" difficulty: "⭐⭐⭐" time_estimate: "5-6 hours" prerequisites: ["Profiling", "All optimization techniques"] next_steps: ["Competition (Capstone)"] learning_objectives: - "Understand benchmark design principles including statistical measurement, fair comparison protocols, and reproducible methodology" - "Implement statistical rigor for performance measurement with confidence intervals, variance reporting, and measurement uncertainty" - "Master fair comparison protocols that control for system noise, hardware variability, and environmental factors" - "Build normalized metrics systems including speedup ratios, compression factors, and efficiency scores for hardware-independent comparison" - "Analyze measurement trade-offs including overhead costs, statistical power requirements, and reproducibility constraints" --- # 19. Benchmarking - Fair Performance Comparison **OPTIMIZATION TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours ## Overview You'll build a rigorous performance measurement system that enables fair comparison of all your optimizations. This module implements educational benchmarking with statistical testing, normalized metrics, and reproducible protocols. Your benchmarking framework provides the measurement methodology used in Module 20's competition workflow, where you'll apply these tools to validate optimizations systematically. ## Learning Objectives By the end of this module, you will be able to: - **Understand benchmark design principles**: Reproducibility requirements; representative workload selection; measurement methodology; controlling for confounding variables; fair comparison protocols - **Implement statistical rigor**: Multiple runs with warmup periods; confidence interval calculation; variance reporting not just means; understanding measurement uncertainty; detecting outliers - **Master fair comparison protocols**: Hardware normalization strategies; environmental controls (thermal, OS noise); baseline selection criteria; same workload/data/environment enforcement; apples-to-apples measurement - **Build normalized metrics systems**: Speedup ratios (baseline_time / optimized_time); compression factors (original_size / compressed_size); accuracy preservation tracking; efficiency scores combining multiple objectives; hardware-independent reporting - **Analyze measurement trade-offs**: Benchmark coverage vs runtime cost; statistical power vs sample size requirements; reproducibility vs realism; instrumentation overhead (observer effect); when 5% speedup is significant vs noise ## Build → Use → Analyze This module follows TinyTorch's **Build → Use → Analyze** framework: 1. **Build**: Implement benchmarking framework with statistical testing (confidence intervals, t-tests), normalized metrics (speedup, compression, efficiency), warmup protocols, and automated report generation 2. **Use**: Benchmark all your Optimization Tier implementations (profiling, quantization, compression, memoization, acceleration) against baselines on real tasks; compare fairly with statistical rigor 3. **Analyze**: Why do benchmark results vary across runs? How does hardware affect comparison fairness? When is 5% speedup statistically significant vs noise? What makes benchmarks representative vs over-fitted? ## Implementation Guide ### Core Benchmarking Components Your benchmarking framework implements four key systems: #### 1. Statistical Measurement Infrastructure **Why Multiple Runs Matter** Single measurements are meaningless in ML systems. Performance varies 10-30% across runs due to: - **Thermal throttling**: CPU frequency drops when hot - **OS background tasks**: Interrupts, garbage collection, other processes - **Memory state**: Cache coldness, fragmentation, swap pressure - **CPU frequency scaling**: Dynamic frequency adjustment **Statistical Solution** ```python class BenchmarkResult: """Container for measurements with statistical analysis.""" def __init__(self, metric_name: str, values: List[float]): self.mean = statistics.mean(values) self.std = statistics.stdev(values) self.median = statistics.median(values) # 95% confidence interval for the mean t_score = 1.96 # Normal approximation margin = t_score * (self.std / np.sqrt(len(values))) self.ci_lower = self.mean - margin self.ci_upper = self.mean + margin ``` **What This Reveals**: If confidence intervals overlap between baseline and optimized, the difference might be noise. Statistical rigor prevents false claims. #### 2. Warmup and Measurement Protocol **The Warmup Problem** First run: 120ms. Second run: 100ms. Third run: 98ms. What happened? - **Cold cache**: First run pays cache miss penalties - **JIT compilation**: NumPy and frameworks compile code paths on first use - **Memory allocation**: Initial runs establish memory patterns **Warmup Solution** ```python class Benchmark: def __init__(self, warmup_runs=5, measurement_runs=10): self.warmup_runs = warmup_runs self.measurement_runs = measurement_runs def run_latency_benchmark(self, model, input_data): # Warmup: stabilize performance for _ in range(self.warmup_runs): model.forward(input_data) # Measurement: collect statistics latencies = [] for _ in range(self.measurement_runs): start = time.perf_counter() model.forward(input_data) latencies.append(time.perf_counter() - start) return BenchmarkResult("latency_ms", latencies) ``` **Why This Matters**: Warmup runs discard cold-start effects. Measurement runs capture true steady-state performance. #### 3. Normalized Metrics for Fair Comparison **Hardware-Independent Speedup** ```python # Speedup ratio: baseline_time / optimized_time speedup = baseline_result.mean / optimized_result.mean # Example: 100ms / 80ms = 1.25x speedup (25% faster) # Speedup > 1.0 means optimization helped # Speedup < 1.0 means optimization regressed ``` **Compression Ratio** ```python # Model size reduction compression_ratio = original_size_mb / compressed_size_mb # Example: 100MB / 25MB = 4x compression ``` **Efficiency Score (Multi-Objective)** ```python # Combine speed + size + accuracy efficiency = (speedup * compression) / (1 + abs(accuracy_delta)) # Penalizes accuracy loss # Rewards speed AND compression # Single metric for ranking ``` **Why Normalized Metrics**: Speedup ratios work on any hardware. "2x faster" is meaningful whether you have M1 Mac or Intel i9. Absolute times (100ms → 50ms) are hardware-specific. #### 4. Comprehensive Benchmark Suite **Multiple Benchmark Types** Your `BenchmarkSuite` runs: 1. **Latency Benchmark**: How fast is inference? (milliseconds) 2. **Accuracy Benchmark**: How correct are predictions? (0.0-1.0) 3. **Memory Benchmark**: How much RAM is used? (megabytes) 4. **Energy Benchmark**: How efficient is compute? (estimated joules) **Pareto Frontier Analysis** ``` Accuracy ↑ | A ● ← Model A: High accuracy, high latency | | B ● ← Model B: Balanced (Pareto optimal) | | C ●← Model C: Low accuracy, low latency |__________→ Latency (lower is better) ``` Models on the Pareto frontier aren't strictly dominated—each represents a valid optimization trade-off. Your suite automatically identifies these optimal points. ### Real-World Benchmarking Principles Your implementation teaches industry-standard methodology: #### Reproducibility Requirements Every benchmark run documents: ```python system_info = { 'platform': 'macOS-14.2-arm64', # OS version 'processor': 'Apple M1 Max', # CPU type 'python_version': '3.11.6', # Runtime 'memory_gb': 64, # RAM 'cpu_count': 10 # Cores } ``` **Why**: Colleague should reproduce your results given same environment. Missing details make verification impossible. #### Fair Comparison Protocol **Don't Compare**: - GPU-optimized code vs CPU baseline (unfair hardware) - Quantized INT8 vs FP32 baseline (unfair precision) - Batch size 32 vs batch size 1 (unfair workload) - Cold start vs warmed up (unfair cache state) **Do Compare**: - Same hardware, same workload, same environment - Baseline vs optimized on identical conditions - Report speedup with confidence intervals - Test statistical significance (t-test, p < 0.05) #### Statistical Significance Testing ```python from scipy import stats baseline_times = [100, 102, 98, 101, 99] # ms optimized_times = [95, 97, 93, 96, 94] # Is the difference real or noise? t_stat, p_value = stats.ttest_ind(baseline_times, optimized_times) if p_value < 0.05: print("Statistically significant (p < 0.05)") else: print("Not significant—could be noise") ``` **Why This Matters**: 5% speedup with p=0.08 isn't significant. Could be measurement variance. Production teams don't merge optimizations without statistical confidence. ### Connection to Competition Workflow (Module 20) This benchmarking infrastructure provides the measurement harness used in Module 20's competition workflow: **How Module 20 Uses This Framework** 1. Module 20 uses your `Benchmark` class to measure baseline and optimized performance 2. Statistical rigor from this module ensures fair comparison across submissions 3. Normalized metrics enable hardware-independent ranking 4. Reproducible protocols ensure all competitors use the same measurement methodology **The Workflow** 1. Module 19: Learn benchmarking methodology (statistical rigor, fair comparison) 2. Module 20: Apply benchmarking tools in competition workflow (submission generation, validation) 3. Competition: Use Benchmark harness to measure and validate optimizations Your benchmarking framework provides the foundation for fair competition—same measurement methodology, same statistical analysis, same reporting format. Module 20 teaches how to use these tools in a competition context. ## Getting Started ### Prerequisites Ensure you understand the optimization foundations: ```bash # Activate TinyTorch environment source bin/activate-tinytorch.sh # Verify prerequisite modules tito test --module profiling tito test --module quantization tito test --module compression ``` ### Development Workflow 1. **Open the development file**: `modules/19_benchmarking/benchmarking_dev.py` 2. **Implement BenchmarkResult**: Container for measurements with statistical analysis 3. **Build Benchmark class**: Runner with warmup, multiple runs, metrics collection 4. **Create BenchmarkSuite**: Full evaluation with latency/accuracy/memory/energy 5. **Add reporting**: Automated report generation with visualizations 6. **Export and verify**: `tito module complete 19 && tito test --module benchmarking` ## Testing ### Comprehensive Test Suite Run the full test suite to verify benchmarking functionality: ```bash # TinyTorch CLI (recommended) tito test --module benchmarking # Direct pytest execution python -m pytest tests/ -k benchmarking -v ``` ### Test Coverage Areas - ✅ **Statistical Calculations**: Mean, std, median, confidence intervals computed correctly - ✅ **Multiple Runs**: Warmup and measurement phases work properly - ✅ **Normalized Metrics**: Speedup, compression, efficiency calculated accurately - ✅ **Fair Comparison**: Same workload enforcement, baseline vs optimized - ✅ **Result Serialization**: BenchmarkResult converts to dict for storage - ✅ **Visualization**: Plots generate with proper formatting and error bars - ✅ **System Info**: Metadata captured for reproducibility - ✅ **Pareto Analysis**: Optimal trade-off points identified correctly ### Inline Testing & Validation The module includes comprehensive unit tests: ```python 🔬 Unit Test: BenchmarkResult... ✅ Mean calculation correct: 3.0 ✅ Std calculation matches statistics module ✅ Confidence intervals bound mean ✅ Serialization preserves data 📈 Progress: BenchmarkResult ✓ 🔬 Unit Test: Benchmark latency... ✅ Warmup runs executed before measurement ✅ Multiple measurement runs collected ✅ Results include mean ± CI 📈 Progress: Benchmark ✓ 🔬 Unit Test: BenchmarkSuite... ✅ All benchmark types run (latency, accuracy, memory, energy) ✅ Results organized by metric type ✅ Visualizations generated 📈 Progress: BenchmarkSuite ✓ ``` ### Manual Testing Examples ```python from tinytorch.benchmarking.benchmark import Benchmark, BenchmarkSuite from tinytorch.core.tensor import Tensor import numpy as np # Create simple models for testing class FastModel: name = "fast_model" def forward(self, x): return x * 2 class SlowModel: name = "slow_model" def forward(self, x): import time time.sleep(0.01) # Simulate 10ms latency return x * 2 # Benchmark comparison models = [FastModel(), SlowModel()] benchmark = Benchmark(models, datasets=[None]) # Run latency benchmark results = benchmark.run_latency_benchmark() for model_name, result in results.items(): print(f"{model_name}: {result.mean:.2f} ± {result.std:.2f}ms") print(f" 95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]") # Speedup calculation fast_time = results['fast_model'].mean slow_time = results['slow_model'].mean speedup = slow_time / fast_time print(f"\nSpeedup: {speedup:.2f}x") ``` ## Systems Thinking Questions ### Real-World Applications - **Production ML Deployment**: PyTorch runs continuous benchmarking before merging optimizations—statistical rigor prevents performance regressions - **Hardware Evaluation**: Google's TPU teams benchmark every architecture iteration—measurements justify billion-dollar hardware investments - **Model Optimization**: Meta benchmarks training efficiency (samples/sec, memory, convergence)—10% speedup saves hundreds of thousands in compute costs - **Research Validation**: Papers require reproducible benchmarks with statistical significance—ablation studies need fair comparison protocols ### Statistical Foundations - **Central Limit Theorem**: Multiple measurements → normal distribution → confidence intervals and significance testing - **Measurement Uncertainty**: Every measurement has variance—systematic errors (timer overhead) and random errors (thermal noise) - **Statistical Power**: How many runs needed for significance? Depends on effect size and variance—5% speedup requires more runs than 50% - **Type I/II Errors**: False positive (claiming speedup when it's noise) vs false negative (missing real speedup due to insufficient samples) ### Performance Characteristics - **Warmup Effects**: First run 20% slower than steady-state—cold cache, JIT compilation, memory allocation - **System Noise Sources**: Thermal throttling (CPU frequency drops), OS interrupts (background tasks), memory pressure (GC pauses), network interference - **Observer Effect**: Instrumentation changes behavior—profiling overhead 5%, cache effects from measurement code, branch prediction altered - **Hardware Variability**: Optimization 3x faster on GPU but 1.1x on CPU—memory bandwidth helps GPU, CPU cache doesn't fit data ## Ready to Build? You've reached the penultimate module of the Optimization Tier. This benchmarking framework validates all your previous work from Modules 14-18, transforming subjective claims ("feels faster") into objective data ("1.8x speedup, p < 0.01, 95% CI [1.6x, 2.0x]"). Your benchmarking infrastructure provides the measurement foundation for Module 20's competition workflow, where you'll use these tools to validate optimizations systematically. Fair measurement methodology ensures your innovation is recognized—not who got lucky with thermal throttling. Module 20 teaches how to use your benchmarking framework in a competition context—generating submissions, validating constraints, and packaging results. Your benchmarking framework measures cumulative impact with statistical rigor. This is how production ML teams validate optimizations before deployment—rigorous measurement prevents regressions and quantifies improvements. Statistical rigor isn't just academic formality—it's engineering discipline. When Meta claims 10% training speedup saves hundreds of thousands in compute costs, that claim requires measurements with confidence intervals and significance testing. Your framework implements this methodology from first principles. Choose your preferred way to engage with this module: ````{grid} 1 2 3 3 ```{grid-item-card} Launch Binder :link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/19_benchmarking/benchmarking_dev.ipynb :class-header: bg-light Run this module interactively in your browser. No installation required. ``` ```{grid-item-card} Open in Colab :link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/19_benchmarking/benchmarking_dev.ipynb :class-header: bg-light Use Google Colab for GPU access and cloud compute power. ``` ```{grid-item-card} View Source :link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/19_benchmarking/benchmarking_dev.py :class-header: bg-light Browse the Python source code and understand the implementation. ``` ```` ```{admonition} Save Your Progress :class: tip Binder sessions are temporary. Download your completed notebook when done, or switch to local development for persistent work. ``` ---