Refactor Module 19 to TorchPerf Olympics framework

- Updated module title to TorchPerf Olympics Preparation - Added OlympicEvent enum with 5 competition categories - Removed meta-analysis sections (532 lines) - Added section 4.5 on combination strategies and ablation studies - Updated documentation to explain Olympic events and optimization order - Module teaches benchmarking principles while preparing students for capstone
2026-04-30 18:37:30 -05:00 · 2025-11-06 21:53:36 -05:00
parent 3dfaca0f19
commit 803ac39b07
3 changed files with 214 additions and 576 deletions
--- a/PROJECT_STATUS.md
+++ b/PROJECT_STATUS.md
@@ -9,21 +9,23 @@

 TinyTorch is a comprehensive educational ML framework designed for a Machine Learning Systems course. Students build every component from scratch, progressing from basic tensors through modern transformer architectures.

-### Current Status: **Core Complete, Optimization Modules In Progress**
+### Current Status: **Core Complete, Ready for TorchPerf Olympics Capstone!**

- **16/19 modules** fully implemented and exported ✅
+- **19/19 modules** fully implemented and exported ✅
 - **All 5 historical milestones** functional and tested ✅
 - **Transformer module** with complete gradient flow ✅
 - **KV Caching module** with 10-15x speedup ✅
 - **Profiling module** with scientific performance measurement ✅
- **Quantization module** with INT8 compression ✅ NEW!
- **3 advanced modules** ready for implementation (16, 18-19)
+- **Acceleration module** with vectorization and kernel fusion ✅
+- **Quantization module** with INT8 compression ✅
+- **Compression module** with pruning and distillation ✅
+- **Benchmarking module (TorchPerf Olympics)** with standardized evaluation framework ✅ NEW!

 ---

 ## 📊 Module Implementation Status

-### ✅ Fully Implemented (Modules 01-17)
+### ✅ Fully Implemented (All 19 Modules!)

 These modules are complete, tested, and exported to `tinytorch/`:

@@ -44,23 +46,23 @@ These modules are complete, tested, and exported to `tinytorch/`:
 | 13 | **Transformers** | `tinytorch/models/transformer.py` | ✅ Complete | 1,726 |
 | 14 | **KV Caching** | `tinytorch/generation/kv_cache.py` | ✅ Complete | 805 |
 | 15 | **Profiling** | `tinytorch/profiling/profiler.py` | ✅ Complete | 155 |
+| 16 | **Acceleration** | `tinytorch/acceleration/` | ✅ Complete | ~800 |
 | 17 | **Quantization** | `tinytorch/optimization/quantization.py` | ✅ Complete | 289 |
+| 18 | **Compression** | `tinytorch/optimization/compression.py` | ✅ Complete | ~600 |
+| 19 | **Benchmarking** | `tinytorch/benchmarking/benchmark.py` | ✅ Complete | 1,100 |

-**Total:** 18,699+ lines of educational ML code (including tests)
+**Total:** 21,000+ lines of educational ML code (including tests)

-### 🔧 Ready for Implementation (Modules 16, 18-19)
+### 🏅 TorchPerf Olympics Capstone

-These modules have source files created but need export:
+**TorchPerf Olympics**: The capstone competition where students combine all optimization techniques (M14-18) and use the benchmarking framework (M19) to compete in 5 Olympic events:
+- 🏃 **Latency Sprint**: Fastest inference
+- 🏋️ **Memory Challenge**: Smallest footprint
+- 🎯 **Accuracy Contest**: Highest precision
+- 🏋️‍♂️ **All-Around**: Best balance
+- 🚀 **Extreme Push**: Most aggressive optimization

-| Module | Name | Purpose | Priority |
-|--------|------|---------|----------|
-| 16 | **Acceleration** | Vectorization and fusion | 🔴 High |
-| 18 | **Compression** | Pruning and distillation | 🟡 Medium |
-| 19 | **Benchmarking** | Fair performance comparison | 🟡 Medium |
-
-### 📚 Capstone (Module 20)
-
-**TinyGPT**: Complete end-to-end language model project integrating all 19 modules.
+🔥 Carry the torch. Optimize the model. Win the gold! 🏅

 ---

@@ -134,34 +136,35 @@ Modules 14-19:  Production ML (Optimization, Profiling, Benchmarking)

 ---

-## 🚀 Next Steps: Implementing Modules 14-19
+## 🚀 Next Steps: TorchPerf Olympics Launch! 🏅

-### Immediate Priority: Module 14 (KV Caching)
+### All 19 Modules Complete! ✅

-**Why Critical:**
- Makes generation 10x+ faster
- Essential for production transformers
- Unlocks interactive chatbot experiences
- Natural extension of Module 13
+The TinyTorch educational framework is now complete with all core and optimization modules implemented:
+- ✅ Modules 01-13: Core ML system (tensors through transformers)
+- ✅ Modules 14-18: Optimization techniques (KV cache, profiling, acceleration, quantization, compression)
+- ✅ Module 19: Benchmarking framework (TorchPerf Olympics)

-**Implementation Plan:**
-1. Edit `modules/source/14_kvcaching/kvcaching_dev.py`
-2. Implement key-value cache data structure
-3. Modify attention to reuse cached keys/values
-4. Add cache-aware generation loop
-5. Run `tito export` to export to `tinytorch/generation/`
-6. Test with transformer generation benchmarks
+### Ready for Capstone: TorchPerf Olympics

-### Medium Priority: Modules 15-17
+Students now have everything they need to:
+1. **Build** their own ML models using M01-13
+2. **Optimize** them using techniques from M14-18
+3. **Benchmark** and **compete** using M19 TorchPerf Olympics framework

- **Module 15 (Profiling):** Measure what matters - timing, memory, FLOPs
- **Module 16 (Acceleration):** Operator fusion, kernel optimization
- **Module 17 (Quantization):** INT8/FP16 for smaller, faster models
+**Olympic Events:**
+- 🏃 Latency Sprint
+- 🏋️ Memory Challenge
+- 🎯 Accuracy Contest
+- 🏋️‍♂️ All-Around Champion
+- 🚀 Extreme Push

-### Lower Priority: Modules 18-19
+### Potential Future Enhancements

- **Module 18 (Compression):** Pruning, distillation techniques
- **Module 19 (Benchmarking):** Fair apples-to-apples comparisons
+- **MLPerf-style Benchmark Suite**: Standardized competition baseline models
+- **Cloud Leaderboard**: Real-time competition results and rankings
+- **Advanced Optimizations**: Mixed precision training, distributed inference
+- **Production Deployment**: Module 20 on serving and monitoring

 ---

--- a/modules/source/19_benchmarking/benchmarking_dev.py
+++ b/modules/source/19_benchmarking/benchmarking_dev.py
@@ -17,29 +17,38 @@

 # %% [markdown]
 """
-# Module 19: Benchmarking - Fair Performance Comparison Systems
+# Module 19: Benchmarking - TorchPerf Olympics Preparation

-Welcome to the final implementation module! Today you'll build a comprehensive benchmarking system that can fairly compare different ML approaches across multiple dimensions.
+Welcome to the final implementation module! You've learned individual optimization techniques in Modules 14-18. Now you'll build the benchmarking infrastructure that powers **TorchPerf Olympics** - the capstone competition framework.

 ## 🔗 Prerequisites & Progress
 **You've Built**: Complete ML framework with profiling, acceleration, quantization, and compression
-**You'll Build**: Professional benchmarking suite with statistical rigor and automated reporting
-**You'll Enable**: Data-driven optimization decisions and performance regression detection
+**You'll Build**: TorchPerf benchmarking system for fair model comparison and capstone submission
+**You'll Enable**: Systematic optimization combination and competitive performance evaluation

 **Connection Map**:
 ```
-Profiling (Module 15) → Benchmarking (Module 19) → Systems Capstone (Milestone 5)
-(measurement)          (comparison)               (optimization)
+Individual Optimizations (M14-18) → Benchmarking (M19) → TorchPerf Olympics (Capstone)
+(techniques)                        (evaluation)         (competition)
 ```

+## 🏅 TorchPerf Olympics: The Capstone Framework
+
+The TorchPerf Olympics is your capstone competition! Choose your event:
+- 🏃 **Latency Sprint**: Minimize inference time (fastest model wins)
+- 🏋️ **Memory Challenge**: Minimize model size (smallest footprint wins)  
+- 🎯 **Accuracy Contest**: Maximize accuracy within constraints
+- 🏋️‍♂️ **All-Around**: Best balanced performance across all metrics
+- 🚀 **Extreme Push**: Most aggressive optimization while staying viable
+
 ## Learning Objectives
 By the end of this module, you will:
-1. Implement comprehensive benchmarking infrastructure with statistical analysis
-2. Build automated comparison systems across accuracy, latency, memory, and energy
-3. Create professional reporting with visualization and recommendations
-4. Integrate TinyMLPerf-style standardized benchmarks for reproducible results
+1. Implement professional benchmarking infrastructure with statistical rigor
+2. Learn to combine optimization techniques strategically (order matters!)
+3. Build the TorchPerf class - your standardized capstone submission framework
+4. Understand ablation studies and systematic performance evaluation

-Let's build the foundation for data-driven ML systems optimization!
+🔥 Carry the torch. Optimize the model. Win the gold! 🏅
 """

 # %% [markdown]
@@ -51,14 +60,19 @@ Let's build the foundation for data-driven ML systems optimization!

 ```python
 # How to use this module:
-from tinytorch.benchmarking.benchmark import Benchmark, BenchmarkSuite, TinyMLPerf
+from tinytorch.benchmarking.benchmark import Benchmark, OlympicEvent
+
+# For capstone submission:
+benchmark = Benchmark([baseline_model, optimized_model],
+                     [{"name": "baseline"}, {"name": "optimized"}])
+results = benchmark.run_latency_benchmark()
 ```

 **Why this matters:**
 - **Learning:** Complete benchmarking ecosystem in one focused module for rigorous evaluation
- **Production:** Proper organization like MLPerf and TensorBoard profiling with all analysis tools together
+- **TorchPerf Olympics:** The Benchmark class provides the standardized framework for capstone submissions
 - **Consistency:** All benchmarking operations and reporting in benchmarking.benchmark
- **Integration:** Works seamlessly with optimization modules for complete systems evaluation
+- **Integration:** Works seamlessly with optimization modules (M14-18) for complete systems evaluation
 """

 # %% [markdown]
@@ -157,6 +171,23 @@ import warnings
 # Import Profiler from Module 15 for measurement reuse
 from tinytorch.profiling.profiler import Profiler

+# %%
+#| export
+from enum import Enum
+
+class OlympicEvent(Enum):
+    """
+    TorchPerf Olympics event categories.
+    
+    Each event optimizes for different objectives with specific constraints.
+    Students choose their event and compete for medals!
+    """
+    LATENCY_SPRINT = "latency_sprint"      # Minimize latency (accuracy >= 85%)
+    MEMORY_CHALLENGE = "memory_challenge"   # Minimize memory (accuracy >= 85%)
+    ACCURACY_CONTEST = "accuracy_contest"   # Maximize accuracy (latency < 100ms, memory < 10MB)
+    ALL_AROUND = "all_around"               # Best balanced score across all metrics
+    EXTREME_PUSH = "extreme_push"           # Most aggressive optimization (accuracy >= 80%)
+
 # %% [markdown]
 """
 # 3. Implementation - Building Professional Benchmarking Infrastructure
@@ -1907,539 +1938,99 @@ test_unit_optimization_comparison()

 # %% [markdown]
 """
-# 5. Systems Analysis - Performance Engineering Insights
+## 4.5 Combination Strategies - Preparing for TorchPerf Olympics

-Let's analyze how our benchmarking system behaves under different conditions and reveal insights about measurement accuracy, system variability, and scalability patterns.
+You've learned individual optimizations (M14-18). Now it's time to combine them strategically! The order and parameters matter significantly for final performance.

-This analysis section demonstrates a key principle: **benchmark the benchmarking system itself**. Understanding how your measurement tools behave is crucial for interpreting results correctly.
+### Why Combination Order Matters

-## Why Analyze Measurement Systems?
+Consider these two strategies:
+- **Strategy A**: Quantize INT8 → Prune 70% → Fuse kernels
+- **Strategy B**: Prune 70% → Quantize INT8 → Fuse kernels

-Consider two scenarios:
- **Scenario A**: Your measurements show Model B is 10% faster than Model A
- **Scenario B**: Your measurements show Model B is 10% faster, but measurement uncertainty is ±15%
+Strategy A might preserve more accuracy because quantization happens first (on the full network), while Strategy B might be faster because pruning reduces what needs to be quantized. The "best" depends on your Olympic event!

-In Scenario A, you might deploy Model B. In Scenario B, the difference isn't statistically significant - you can't trust the comparison.
+### Ablation Studies: Understanding Individual Contributions

-Professional benchmarking requires understanding and quantifying measurement uncertainty.
+Professional ML engineers use **ablation studies** to understand what each optimization contributes:
+
+```
+Baseline:           Accuracy: 89%, Latency: 45ms, Memory: 12MB
+ Quantization:     Accuracy: 88%, Latency: 30ms, Memory: 3MB   (Δ: -1%, -33%, -75%)
+ Pruning:          Accuracy: 87%, Latency: 22ms, Memory: 2MB   (Δ: -1%, -27%, -33%)
+ Kernel Fusion:    Accuracy: 87%, Latency: 18ms, Memory: 2MB   (Δ: 0%, -18%, 0%)
+
+Conclusion: Quantization provides biggest memory reduction, fusion provides latency boost
+```
+
+This systematic analysis tells you what to prioritize for each Olympic event!
+
+### Olympic Event Strategies
+
+**🏃 Latency Sprint**: Minimize inference time
+- Priority: Kernel fusion > KV caching > Quantization > Pruning
+- Risk: Aggressive optimizations may hurt accuracy
+- Tip: Start with proven speed techniques, then add memory techniques if needed
+
+**🏋️ Memory Challenge**: Minimize model footprint
+- Priority: Quantization > Pruning > Compression
+- Risk: Model quality degradation
+- Tip: Quantize first (4x memory reduction), then prune to meet target
+
+**🎯 Accuracy Contest**: Maximize accuracy within constraints
+- Priority: Minimal optimizations, careful tuning
+- Risk: Not enough optimization to meet constraints
+- Tip: Use high-bit quantization (8-bit), light pruning (30-50%)
+
+**🏋️‍♂️ All-Around**: Best balanced performance
+- Priority: Balanced application of all techniques
+- Risk: Jack of all trades, master of none
+- Tip: Use moderate settings for each technique (INT8, 60% pruning, selective fusion)
+
+**🚀 Extreme Push**: Most aggressive optimization
+- Priority: Maximum of everything
+- Risk: Significant accuracy loss
+- Tip: Start with 4-bit quantization + 90% pruning, verify accuracy threshold
+
+### Example: Combining for All-Around Event
+
+```python
+from tinytorch.optimization.quantization import quantize_model
+from tinytorch.optimization.compression import magnitude_prune
+from tinytorch.generation.kv_cache import enable_kv_cache
+
+# Load baseline
+baseline_model = load_baseline("cifar10_cnn")
+
+# Apply balanced optimization strategy
+optimized = baseline_model
+
+# Step 1: Quantize to INT8 (moderate precision)
+optimized = quantize_model(optimized, bits=8)
+
+# Step 2: Prune 60% (moderate sparsity)
+optimized = magnitude_prune(optimized, sparsity=0.6)
+
+# Step 3: Enable KV cache for transformers (if applicable)
+if hasattr(optimized, 'transformer_blocks'):
+    enable_kv_cache(optimized)
+
+# Benchmark using TorchPerf
+from tinytorch.benchmarking.benchmark import Benchmark, OlympicEvent
+
+benchmark = Benchmark([baseline_model, optimized], 
+                     [{"name": "baseline"}, {"name": "optimized"}])
+
+results = benchmark.run_latency_benchmark()
+# Compare and iterate!
+```
+
+The key: **Start with one technique, measure impact, add next technique, repeat!**
 """

 # %% [markdown]
 """
-## Measurement Variance Analysis
-
-Understanding measurement variance is fundamental to statistical significance. This analysis reveals how sample size affects measurement reliability and helps determine optimal benchmark configurations.
-
-### Statistical Significance in Practice
-
-When you measure a model's latency multiple times, you get a distribution of values. The key insight: **more measurements reduce uncertainty about the true mean, but with diminishing returns**.
-
-```
-Measurement Variance Relationship:
-Standard Error = σ / √n
-
-Where:
- σ = underlying measurement noise
- n = number of samples
- Standard Error = uncertainty in the estimated mean
-
-Doubling samples reduces uncertainty by √2 ≈ 1.41x
-10x samples reduces uncertainty by √10 ≈ 3.16x
-```
-
-### Variance Sources in ML Benchmarking
-
-**System-Level Variance**:
- CPU frequency scaling (thermal throttling)
- Background processes (OS scheduling)
- Memory pressure (garbage collection)
- Network traffic (for distributed models)
-
-**Algorithm-Level Variance**:
- Input-dependent computation paths
- Random initialization effects
- Numerical precision variations
-
-**Measurement-Level Variance**:
- Timer resolution and overhead
- Function call overhead
- Memory allocation patterns
-
-This analysis quantifies these effects and determines optimal measurement protocols.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "analyze-measurement-variance", "solution": true}
-def analyze_measurement_variance():
-    """📊 Analyze how measurement variance affects benchmark reliability."""
-    print("📊 Analyzing measurement variance and statistical significance...")
-
-    # Create a simple test model for consistent analysis
-    class TestModel:
-        def __init__(self, base_latency=0.001):
-            self.base_latency = base_latency
-            self.name = "test_model"
-
-        def forward(self, x):
-            # Add realistic variance sources
-            system_noise = np.random.normal(0, 0.0001)  # System noise
-            thermal_variance = np.random.normal(0, 0.00005)  # CPU frequency variation
-            time.sleep(max(0, self.base_latency + system_noise + thermal_variance))
-            return x
-
-    model = TestModel()
-
-    # Test different numbers of measurement runs
-    run_counts = [3, 5, 10, 20, 50, 100]
-    variance_results = []
-
-    for num_runs in run_counts:
-        benchmark = Benchmark([model], [{"data": "test"}],
-                            warmup_runs=2, measurement_runs=num_runs)
-
-        # Run multiple benchmark sessions to see variance between sessions
-        session_means = []
-        session_stds = []
-
-        for session in range(5):  # 5 different benchmark sessions
-            results = benchmark.run_latency_benchmark()
-            result = list(results.values())[0]
-            session_means.append(result.mean)
-            session_stds.append(result.std)
-
-        # Calculate variance across sessions
-        mean_of_means = np.mean(session_means)
-        std_of_means = np.std(session_means)
-        mean_of_stds = np.mean(session_stds)
-
-        variance_results.append({
-            'num_runs': num_runs,
-            'mean_latency': mean_of_means,
-            'std_between_sessions': std_of_means,
-            'mean_std_within_session': mean_of_stds,
-            'coefficient_of_variation': std_of_means / mean_of_means if mean_of_means > 0 else 0
-        })
-
-    # Plot results
-    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
-
-    # Plot 1: Standard deviation vs number of runs
-    num_runs_list = [r['num_runs'] for r in variance_results]
-    between_session_std = [r['std_between_sessions'] * 1000 for r in variance_results]  # Convert to ms
-    within_session_std = [r['mean_std_within_session'] * 1000 for r in variance_results]
-
-    ax1.plot(num_runs_list, between_session_std, 'o-', label='Between Sessions', linewidth=2)
-    ax1.plot(num_runs_list, within_session_std, 's-', label='Within Session', linewidth=2)
-    ax1.set_xlabel('Number of Measurement Runs')
-    ax1.set_ylabel('Standard Deviation (ms)')
-    ax1.set_title('Measurement Variance vs Sample Size')
-    ax1.legend()
-    ax1.grid(True, alpha=0.3)
-    ax1.set_xscale('log')
-
-    # Plot 2: Coefficient of variation
-    cv_values = [r['coefficient_of_variation'] * 100 for r in variance_results]
-    ax2.plot(num_runs_list, cv_values, 'o-', color='red', linewidth=2)
-    ax2.set_xlabel('Number of Measurement Runs')
-    ax2.set_ylabel('Coefficient of Variation (%)')
-    ax2.set_title('Measurement Reliability vs Sample Size')
-    ax2.grid(True, alpha=0.3)
-    ax2.set_xscale('log')
-
-    plt.tight_layout()
-    plt.show()
-
-    # Key insights
-    print("\n💡 Measurement Variance Analysis:")
-    print(f"With 10 runs: CV = {variance_results[2]['coefficient_of_variation']:.1%}")
-    print(f"With 50 runs: CV = {variance_results[4]['coefficient_of_variation']:.1%}")
-    print(f"With 100 runs: CV = {variance_results[5]['coefficient_of_variation']:.1%}")
-
-    if variance_results[4]['coefficient_of_variation'] < 0.05:
-        print("🚀 50+ runs provide stable measurements (CV < 5%)")
-    else:
-        print("⚠️  High variance detected - consider longer warmup or controlled environment")
-
-analyze_measurement_variance()
-
-# %% [markdown]
-"""
-## Benchmark Scaling Analysis
-
-Understanding how benchmark overhead scales with model complexity helps optimize measurement protocols and interpret results correctly.
-
-### Why Benchmark Overhead Matters
-
-Every measurement tool adds overhead. For benchmarking to be meaningful, this overhead must be:
-1. **Consistent**: Same overhead across different models
-2. **Minimal**: Small compared to what you're measuring
-3. **Predictable**: Understood so you can account for it
-
-### Overhead Analysis Framework
-
-```
-Total Measured Time = True Model Time + Benchmark Overhead
-
-Benchmark Overhead includes:
-├── Framework setup (model loading, input preparation)
-├── Timing infrastructure (context managers, precision counters)
-├── Result collection (statistics, metadata gathering)
-└── System interactions (memory allocation, Python overhead)
-```
-
-### Scaling Behavior Patterns
-
-**Good Scaling**: Overhead decreases as percentage of total time
- Simple models: 20% overhead (still usable)
- Complex models: 2% overhead (negligible)
-
-**Bad Scaling**: Overhead increases with model complexity
- Indicates benchmark framework bottlenecks
- Makes results unreliable for optimization decisions
-
-**Optimal Configuration**: Overhead < 5% for target model complexity range
-
-This analysis identifies the optimal benchmark configuration for different model types and deployment scenarios.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "analyze-scaling-behavior", "solution": true}
-def analyze_scaling_behavior():
-    """📊 Analyze how benchmark overhead scales with model and input complexity."""
-    print("📊 Analyzing benchmark overhead and scaling behavior...")
-
-    # Create models with different computational complexity
-    class ScalingTestModel:
-        def __init__(self, complexity_factor, name):
-            self.complexity_factor = complexity_factor
-            self.name = name
-
-        def forward(self, x):
-            # Simulate computational work proportional to complexity
-            base_time = 0.001  # 1ms base
-            compute_time = base_time * self.complexity_factor
-
-            # Simulate actual computation with matrix operations
-            if hasattr(x, 'shape'):
-                size = np.prod(x.shape)
-            else:
-                size = len(x) if hasattr(x, '__len__') else 100
-
-            # Simulate memory allocation and computation
-            temp_data = np.random.randn(int(size * self.complexity_factor))
-            _ = np.sum(temp_data * temp_data)  # Some computation
-
-            time.sleep(compute_time)
-            return x
-
-    # Models with different complexity
-    models = [
-        ScalingTestModel(1, "simple_model"),
-        ScalingTestModel(5, "medium_model"),
-        ScalingTestModel(20, "complex_model"),
-        ScalingTestModel(100, "very_complex_model")
-    ]
-
-    # Test different input sizes
-    input_sizes = [(1, 28, 28), (1, 64, 64), (1, 128, 128), (1, 256, 256)]
-
-    scaling_results = []
-
-    for input_shape in input_sizes:
-        print(f"Testing input shape: {input_shape}")
-
-        for model in models:
-            # Measure pure model time (without benchmark overhead)
-            dummy_input = np.random.randn(*input_shape).astype(np.float32)
-
-            pure_times = []
-            for _ in range(10):
-                with precise_timer() as timer:
-                    model.forward(dummy_input)
-                pure_times.append(timer.elapsed * 1000)
-
-            pure_mean = np.mean(pure_times)
-
-            # Measure with benchmark framework
-            benchmark = Benchmark([model], [{"data": "test"}],
-                                warmup_runs=3, measurement_runs=10)
-
-            bench_results = benchmark.run_latency_benchmark(input_shape)
-            bench_mean = list(bench_results.values())[0].mean
-
-            # Calculate overhead
-            overhead_ms = bench_mean - pure_mean
-            overhead_percent = (overhead_ms / pure_mean) * 100 if pure_mean > 0 else 0
-
-            scaling_results.append({
-                'input_size': np.prod(input_shape),
-                'model_complexity': model.complexity_factor,
-                'model_name': model.name,
-                'pure_latency_ms': pure_mean,
-                'benchmark_latency_ms': bench_mean,
-                'overhead_ms': overhead_ms,
-                'overhead_percent': overhead_percent
-            })
-
-    # Create DataFrame for analysis
-    df = pd.DataFrame(scaling_results)
-
-    # Plot results
-    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
-
-    # Plot 1: Overhead vs model complexity
-    for input_size in [784, 4096, 16384, 65536]:  # Representative sizes
-        subset = df[df['input_size'] == input_size]
-        if not subset.empty:
-            ax1.plot(subset['model_complexity'], subset['overhead_percent'],
-                    'o-', label=f'Input size: {input_size}', linewidth=2)
-
-    ax1.set_xlabel('Model Complexity Factor')
-    ax1.set_ylabel('Benchmark Overhead (%)')
-    ax1.set_title('Benchmark Overhead vs Model Complexity')
-    ax1.legend()
-    ax1.grid(True, alpha=0.3)
-    ax1.set_xscale('log')
-
-    # Plot 2: Absolute overhead vs input size
-    for complexity in [1, 5, 20, 100]:
-        subset = df[df['model_complexity'] == complexity]
-        if not subset.empty:
-            ax2.plot(subset['input_size'], subset['overhead_ms'],
-                    'o-', label=f'Complexity: {complexity}x', linewidth=2)
-
-    ax2.set_xlabel('Input Size (elements)')
-    ax2.set_ylabel('Benchmark Overhead (ms)')
-    ax2.set_title('Benchmark Overhead vs Input Size')
-    ax2.legend()
-    ax2.grid(True, alpha=0.3)
-    ax2.set_xscale('log')
-
-    plt.tight_layout()
-    plt.show()
-
-    # Analysis insights
-    print("\n💡 Scaling Behavior Analysis:")
-
-    # Find overhead patterns
-    high_complexity_overhead = df[df['model_complexity'] >= 20]['overhead_percent'].mean()
-    low_complexity_overhead = df[df['model_complexity'] <= 5]['overhead_percent'].mean()
-
-    print(f"Low complexity models: {low_complexity_overhead:.1f}% overhead")
-    print(f"High complexity models: {high_complexity_overhead:.1f}% overhead")
-
-    if high_complexity_overhead < 5:
-        print("🚀 Benchmark overhead is negligible for complex models")
-    elif low_complexity_overhead > 20:
-        print("⚠️  High overhead for simple models - consider optimization")
-    else:
-        print("✅ Benchmark scaling is appropriate for intended use cases")
-
-analyze_scaling_behavior()
-
-# %% [markdown]
-"""
-# 6. Optimization Insights - Trade-offs and Production Patterns
-
-Understanding the real-world implications of benchmarking decisions and how to optimize the measurement process itself for different use cases.
-
-This section addresses a meta-question: **How do you optimize the optimization process?** Different use cases need different measurement trade-offs.
-
-## Benchmarking Configuration Optimization
-
-Professional ML teams face a fundamental trade-off in benchmarking:
- **More accurate measurements** require more time and resources
- **Faster measurements** enable more iteration but with less precision
- **Different development phases** need different measurement fidelity
-
-The goal: Find the minimum measurement overhead that provides sufficient confidence for decision-making.
-"""
-
-# %% [markdown]
-"""
-## Optimal Benchmark Configuration Analysis
-
-This analysis helps determine the right benchmark configuration for different development scenarios. It's a practical application of statistics to engineering workflow optimization.
-
-### The Measurement Fidelity Spectrum
-
-```
-Development Phase        Accuracy Need    Speed Need    Optimal Config
-─────────────────────────────────────────────────────────────────────
-Rapid prototyping        Low              High          Fast (5 runs)
-Feature development      Medium           Medium        Standard (20 runs)
-Performance optimization High             Low           Accurate (50 runs)
-Production validation    Very High        Very Low      Research (100+ runs)
-Regression testing       Medium           High          Automated (15 runs)
-```
-
-### Multi-Objective Optimization for Benchmarking
-
-We optimize across three competing objectives:
-1. **Accuracy**: How close to the true performance value
-2. **Precision**: How consistent are repeated measurements
-3. **Speed**: How quickly we get results
-
-```
-Benchmark Configuration Optimization:
-minimize: w₁×(accuracy_error) + w₂×(precision_error) + w₃×(time_cost)
-subject to: measurement_runs ≥ min_statistical_power
-           total_time ≤ max_allowed_time
-
-Where weights w₁, w₂, w₃ depend on use case
-```
-
-This analysis empirically determines optimal configurations for different scenarios.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "benchmark-optimization", "solution": true}
-def optimize_benchmark_configuration():
-    """📊 Find optimal benchmark configuration for different accuracy vs speed needs."""
-    print("📊 Optimizing benchmark configuration for different use cases...")
-
-    # Test model for configuration optimization
-    class ConfigTestModel:
-        def __init__(self):
-            self.name = "config_test_model"
-
-        def forward(self, x):
-            # Consistent baseline with small variance
-            time.sleep(0.002 + np.random.normal(0, 0.0001))
-            return x
-
-    model = ConfigTestModel()
-
-    # Test different configuration combinations
-    configurations = [
-        {'warmup': 1, 'runs': 5, 'name': 'fast'},
-        {'warmup': 3, 'runs': 10, 'name': 'standard'},
-        {'warmup': 5, 'runs': 20, 'name': 'accurate'},
-        {'warmup': 10, 'runs': 50, 'name': 'precise'},
-        {'warmup': 15, 'runs': 100, 'name': 'research'}
-    ]
-
-    config_results = []
-
-    # Ground truth: run very long benchmark to get "true" value
-    true_benchmark = Benchmark([model], [{"data": "test"}],
-                              warmup_runs=20, measurement_runs=200)
-    true_results = true_benchmark.run_latency_benchmark()
-    true_latency = list(true_results.values())[0].mean
-
-    print(f"Ground truth latency: {true_latency:.4f}s")
-
-    for config in configurations:
-        print(f"\nTesting {config['name']} configuration...")
-
-        # Run multiple trials with this configuration
-        trial_results = []
-        total_time_spent = []
-
-        for trial in range(8):  # 8 trials per configuration
-            start_time = time.time()
-
-            benchmark = Benchmark([model], [{"data": "test"}],
-                                warmup_runs=config['warmup'],
-                                measurement_runs=config['runs'])
-
-            results = benchmark.run_latency_benchmark()
-            measured_latency = list(results.values())[0].mean
-
-            end_time = time.time()
-
-            trial_results.append(measured_latency)
-            total_time_spent.append(end_time - start_time)
-
-        # Calculate accuracy and efficiency metrics
-        trial_mean = np.mean(trial_results)
-        trial_std = np.std(trial_results)
-        accuracy_error = abs(trial_mean - true_latency) / true_latency * 100
-        precision_cv = trial_std / trial_mean * 100 if trial_mean > 0 else 0
-        avg_benchmark_time = np.mean(total_time_spent)
-
-        config_results.append({
-            'name': config['name'],
-            'warmup_runs': config['warmup'],
-            'measurement_runs': config['runs'],
-            'total_runs': config['warmup'] + config['runs'],
-            'accuracy_error_percent': accuracy_error,
-            'precision_cv_percent': precision_cv,
-            'benchmark_time_s': avg_benchmark_time,
-            'efficiency_score': 100 / (accuracy_error + precision_cv + avg_benchmark_time * 10)  # Combined score
-        })
-
-    # Create comparison DataFrame
-    df = pd.DataFrame(config_results)
-
-    # Visualize trade-offs
-    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
-
-    # Plot 1: Accuracy vs Speed
-    ax1.scatter(df['benchmark_time_s'], df['accuracy_error_percent'],
-               s=100, alpha=0.7, c=df['total_runs'], cmap='viridis')
-    for i, name in enumerate(df['name']):
-        ax1.annotate(name, (df['benchmark_time_s'].iloc[i], df['accuracy_error_percent'].iloc[i]),
-                    xytext=(5, 5), textcoords='offset points')
-    ax1.set_xlabel('Benchmark Time (seconds)')
-    ax1.set_ylabel('Accuracy Error (%)')
-    ax1.set_title('Accuracy vs Speed Trade-off')
-    ax1.grid(True, alpha=0.3)
-
-    # Plot 2: Precision vs Speed
-    ax2.scatter(df['benchmark_time_s'], df['precision_cv_percent'],
-               s=100, alpha=0.7, c=df['total_runs'], cmap='viridis')
-    for i, name in enumerate(df['name']):
-        ax2.annotate(name, (df['benchmark_time_s'].iloc[i], df['precision_cv_percent'].iloc[i]),
-                    xytext=(5, 5), textcoords='offset points')
-    ax2.set_xlabel('Benchmark Time (seconds)')
-    ax2.set_ylabel('Precision CV (%)')
-    ax2.set_title('Precision vs Speed Trade-off')
-    ax2.grid(True, alpha=0.3)
-
-    # Plot 3: Efficiency comparison
-    ax3.bar(df['name'], df['efficiency_score'], alpha=0.7)
-    ax3.set_ylabel('Efficiency Score (higher = better)')
-    ax3.set_title('Overall Benchmark Efficiency')
-    ax3.tick_params(axis='x', rotation=45)
-
-    # Plot 4: Configuration breakdown
-    width = 0.35
-    x = np.arange(len(df))
-    ax4.bar(x - width/2, df['warmup_runs'], width, label='Warmup Runs', alpha=0.7)
-    ax4.bar(x + width/2, df['measurement_runs'], width, label='Measurement Runs', alpha=0.7)
-    ax4.set_xlabel('Configuration')
-    ax4.set_ylabel('Number of Runs')
-    ax4.set_title('Configuration Breakdown')
-    ax4.set_xticks(x)
-    ax4.set_xticklabels(df['name'])
-    ax4.legend()
-
-    plt.tight_layout()
-    plt.show()
-
-    # Generate recommendations
-    print("\n💡 Benchmark Configuration Recommendations:")
-
-    # Find best configurations for different use cases
-    best_fast = df.loc[df['benchmark_time_s'].idxmin()]
-    best_accurate = df.loc[df['accuracy_error_percent'].idxmin()]
-    best_precise = df.loc[df['precision_cv_percent'].idxmin()]
-    best_balanced = df.loc[df['efficiency_score'].idxmax()]
-
-    print(f"🚀 Fastest: {best_fast['name']} - {best_fast['benchmark_time_s']:.1f}s, {best_fast['accuracy_error_percent']:.1f}% error")
-    print(f"🎯 Most Accurate: {best_accurate['name']} - {best_accurate['accuracy_error_percent']:.1f}% error")
-    print(f"📊 Most Precise: {best_precise['name']} - {best_precise['precision_cv_percent']:.1f}% CV")
-    print(f"⚖️  Best Balanced: {best_balanced['name']} - efficiency score {best_balanced['efficiency_score']:.1f}")
-
-    print("\n🎯 Use Case Recommendations:")
-    print("- Development/debugging: Use 'fast' config for quick feedback")
-    print("- CI/CD pipelines: Use 'standard' config for reasonable accuracy/speed balance")
-    print("- Performance optimization: Use 'accurate' config for reliable comparisons")
-    print("- Research papers: Use 'precise' or 'research' config for publication-quality results")
-
-optimize_benchmark_configuration()
-
-# %% [markdown]
-"""
-# 7. Module Integration Test
+# 5. Module Integration Test

 Final validation that our complete benchmarking system works correctly and integrates properly with all TinyTorch components.

--- a/tinytorch/_modidx.py
+++ b/tinytorch/_modidx.py
@@ -21,7 +21,51 @@ d = { 'settings': { 'branch': 'main',
                'doc_host': 'https://tinytorch.github.io',
                'git_url': 'https://github.com/tinytorch/TinyTorch/',
                'lib_path': 'tinytorch'},
-  'syms': { 'tinytorch.core.activations': { 'tinytorch.core.activations.GELU': ( '02_activations/activations_dev.html#gelu',
+  'syms': { 'tinytorch.benchmarking.benchmark': { 'tinytorch.benchmarking.benchmark.Benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmark',
+                                                                                                  'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.Benchmark.__init__': ( '19_benchmarking/benchmarking_dev.html#benchmark.__init__',
+                                                                                                           'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.Benchmark.compare_models': ( '19_benchmarking/benchmarking_dev.html#benchmark.compare_models',
+                                                                                                                 'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.Benchmark.run_accuracy_benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmark.run_accuracy_benchmark',
+                                                                                                                         'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.Benchmark.run_latency_benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmark.run_latency_benchmark',
+                                                                                                                        'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.Benchmark.run_memory_benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmark.run_memory_benchmark',
+                                                                                                                       'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.BenchmarkSuite': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite',
+                                                                                                       'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.BenchmarkSuite.__init__': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.__init__',
+                                                                                                                'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.BenchmarkSuite._estimate_energy_efficiency': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite._estimate_energy_efficiency',
+                                                                                                                                   'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.BenchmarkSuite.generate_report': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.generate_report',
+                                                                                                                       'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.BenchmarkSuite.plot_pareto_frontier': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.plot_pareto_frontier',
+                                                                                                                            'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.BenchmarkSuite.plot_results': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.plot_results',
+                                                                                                                    'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.BenchmarkSuite.run_full_benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.run_full_benchmark',
+                                                                                                                          'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.OlympicEvent': ( '19_benchmarking/benchmarking_dev.html#olympicevent',
+                                                                                                     'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.TinyMLPerf': ( '19_benchmarking/benchmarking_dev.html#tinymlperf',
+                                                                                                   'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.TinyMLPerf.__init__': ( '19_benchmarking/benchmarking_dev.html#tinymlperf.__init__',
+                                                                                                            'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.TinyMLPerf.generate_compliance_report': ( '19_benchmarking/benchmarking_dev.html#tinymlperf.generate_compliance_report',
+                                                                                                                              'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.TinyMLPerf.run_all_benchmarks': ( '19_benchmarking/benchmarking_dev.html#tinymlperf.run_all_benchmarks',
+                                                                                                                      'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.TinyMLPerf.run_standard_benchmark': ( '19_benchmarking/benchmarking_dev.html#tinymlperf.run_standard_benchmark',
+                                                                                                                          'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.test_unit_benchmark': ( '19_benchmarking/benchmarking_dev.html#test_unit_benchmark',
+                                                                                                            'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.test_unit_benchmark_suite': ( '19_benchmarking/benchmarking_dev.html#test_unit_benchmark_suite',
+                                                                                                                  'tinytorch/benchmarking/benchmark.py'),
+                                                  'tinytorch.benchmarking.benchmark.test_unit_tinymlperf': ( '19_benchmarking/benchmarking_dev.html#test_unit_tinymlperf',
+                                                                                                             'tinytorch/benchmarking/benchmark.py')},
+            'tinytorch.core.activations': { 'tinytorch.core.activations.GELU': ( '02_activations/activations_dev.html#gelu',
                                                                                 'tinytorch/core/activations.py'),
                                            'tinytorch.core.activations.GELU.__call__': ( '02_activations/activations_dev.html#gelu.__call__',
                                                                                          'tinytorch/core/activations.py'),