Refactor Module 19 to TorchPerf Olympics framework

- Updated module title to TorchPerf Olympics Preparation
- Added OlympicEvent enum with 5 competition categories
- Removed meta-analysis sections (532 lines)
- Added section 4.5 on combination strategies and ablation studies
- Updated documentation to explain Olympic events and optimization order
- Module teaches benchmarking principles while preparing students for capstone
This commit is contained in:
Vijay Janapa Reddi
2025-11-06 21:53:36 -05:00
parent 3dfaca0f19
commit 803ac39b07
3 changed files with 214 additions and 576 deletions

View File

@@ -9,21 +9,23 @@
TinyTorch is a comprehensive educational ML framework designed for a Machine Learning Systems course. Students build every component from scratch, progressing from basic tensors through modern transformer architectures.
### Current Status: **Core Complete, Optimization Modules In Progress**
### Current Status: **Core Complete, Ready for TorchPerf Olympics Capstone!**
- **16/19 modules** fully implemented and exported ✅
- **19/19 modules** fully implemented and exported ✅
- **All 5 historical milestones** functional and tested ✅
- **Transformer module** with complete gradient flow ✅
- **KV Caching module** with 10-15x speedup ✅
- **Profiling module** with scientific performance measurement ✅
- **Quantization module** with INT8 compression ✅ NEW!
- **3 advanced modules** ready for implementation (16, 18-19)
- **Acceleration module** with vectorization and kernel fusion ✅
- **Quantization module** with INT8 compression ✅
- **Compression module** with pruning and distillation ✅
- **Benchmarking module (TorchPerf Olympics)** with standardized evaluation framework ✅ NEW!
---
## 📊 Module Implementation Status
### ✅ Fully Implemented (Modules 01-17)
### ✅ Fully Implemented (All 19 Modules!)
These modules are complete, tested, and exported to `tinytorch/`:
@@ -44,23 +46,23 @@ These modules are complete, tested, and exported to `tinytorch/`:
| 13 | **Transformers** | `tinytorch/models/transformer.py` | ✅ Complete | 1,726 |
| 14 | **KV Caching** | `tinytorch/generation/kv_cache.py` | ✅ Complete | 805 |
| 15 | **Profiling** | `tinytorch/profiling/profiler.py` | ✅ Complete | 155 |
| 16 | **Acceleration** | `tinytorch/acceleration/` | ✅ Complete | ~800 |
| 17 | **Quantization** | `tinytorch/optimization/quantization.py` | ✅ Complete | 289 |
| 18 | **Compression** | `tinytorch/optimization/compression.py` | ✅ Complete | ~600 |
| 19 | **Benchmarking** | `tinytorch/benchmarking/benchmark.py` | ✅ Complete | 1,100 |
**Total:** 18,699+ lines of educational ML code (including tests)
**Total:** 21,000+ lines of educational ML code (including tests)
### 🔧 Ready for Implementation (Modules 16, 18-19)
### 🏅 TorchPerf Olympics Capstone
These modules have source files created but need export:
**TorchPerf Olympics**: The capstone competition where students combine all optimization techniques (M14-18) and use the benchmarking framework (M19) to compete in 5 Olympic events:
- 🏃 **Latency Sprint**: Fastest inference
- 🏋️ **Memory Challenge**: Smallest footprint
- 🎯 **Accuracy Contest**: Highest precision
- 🏋️‍♂️ **All-Around**: Best balance
- 🚀 **Extreme Push**: Most aggressive optimization
| Module | Name | Purpose | Priority |
|--------|------|---------|----------|
| 16 | **Acceleration** | Vectorization and fusion | 🔴 High |
| 18 | **Compression** | Pruning and distillation | 🟡 Medium |
| 19 | **Benchmarking** | Fair performance comparison | 🟡 Medium |
### 📚 Capstone (Module 20)
**TinyGPT**: Complete end-to-end language model project integrating all 19 modules.
🔥 Carry the torch. Optimize the model. Win the gold! 🏅
---
@@ -134,34 +136,35 @@ Modules 14-19: Production ML (Optimization, Profiling, Benchmarking)
---
## 🚀 Next Steps: Implementing Modules 14-19
## 🚀 Next Steps: TorchPerf Olympics Launch! 🏅
### Immediate Priority: Module 14 (KV Caching)
### All 19 Modules Complete! ✅
**Why Critical:**
- Makes generation 10x+ faster
- Essential for production transformers
- Unlocks interactive chatbot experiences
- Natural extension of Module 13
The TinyTorch educational framework is now complete with all core and optimization modules implemented:
- ✅ Modules 01-13: Core ML system (tensors through transformers)
- ✅ Modules 14-18: Optimization techniques (KV cache, profiling, acceleration, quantization, compression)
- ✅ Module 19: Benchmarking framework (TorchPerf Olympics)
**Implementation Plan:**
1. Edit `modules/source/14_kvcaching/kvcaching_dev.py`
2. Implement key-value cache data structure
3. Modify attention to reuse cached keys/values
4. Add cache-aware generation loop
5. Run `tito export` to export to `tinytorch/generation/`
6. Test with transformer generation benchmarks
### Ready for Capstone: TorchPerf Olympics
### Medium Priority: Modules 15-17
Students now have everything they need to:
1. **Build** their own ML models using M01-13
2. **Optimize** them using techniques from M14-18
3. **Benchmark** and **compete** using M19 TorchPerf Olympics framework
- **Module 15 (Profiling):** Measure what matters - timing, memory, FLOPs
- **Module 16 (Acceleration):** Operator fusion, kernel optimization
- **Module 17 (Quantization):** INT8/FP16 for smaller, faster models
**Olympic Events:**
- 🏃 Latency Sprint
- 🏋️ Memory Challenge
- 🎯 Accuracy Contest
- 🏋️‍♂️ All-Around Champion
- 🚀 Extreme Push
### Lower Priority: Modules 18-19
### Potential Future Enhancements
- **Module 18 (Compression):** Pruning, distillation techniques
- **Module 19 (Benchmarking):** Fair apples-to-apples comparisons
- **MLPerf-style Benchmark Suite**: Standardized competition baseline models
- **Cloud Leaderboard**: Real-time competition results and rankings
- **Advanced Optimizations**: Mixed precision training, distributed inference
- **Production Deployment**: Module 20 on serving and monitoring
---

View File

@@ -17,29 +17,38 @@
# %% [markdown]
"""
# Module 19: Benchmarking - Fair Performance Comparison Systems
# Module 19: Benchmarking - TorchPerf Olympics Preparation
Welcome to the final implementation module! Today you'll build a comprehensive benchmarking system that can fairly compare different ML approaches across multiple dimensions.
Welcome to the final implementation module! You've learned individual optimization techniques in Modules 14-18. Now you'll build the benchmarking infrastructure that powers **TorchPerf Olympics** - the capstone competition framework.
## 🔗 Prerequisites & Progress
**You've Built**: Complete ML framework with profiling, acceleration, quantization, and compression
**You'll Build**: Professional benchmarking suite with statistical rigor and automated reporting
**You'll Enable**: Data-driven optimization decisions and performance regression detection
**You'll Build**: TorchPerf benchmarking system for fair model comparison and capstone submission
**You'll Enable**: Systematic optimization combination and competitive performance evaluation
**Connection Map**:
```
Profiling (Module 15) → Benchmarking (Module 19) → Systems Capstone (Milestone 5)
(measurement) (comparison) (optimization)
Individual Optimizations (M14-18) → Benchmarking (M19) → TorchPerf Olympics (Capstone)
(techniques) (evaluation) (competition)
```
## 🏅 TorchPerf Olympics: The Capstone Framework
The TorchPerf Olympics is your capstone competition! Choose your event:
- 🏃 **Latency Sprint**: Minimize inference time (fastest model wins)
- 🏋️ **Memory Challenge**: Minimize model size (smallest footprint wins)
- 🎯 **Accuracy Contest**: Maximize accuracy within constraints
- 🏋️‍♂️ **All-Around**: Best balanced performance across all metrics
- 🚀 **Extreme Push**: Most aggressive optimization while staying viable
## Learning Objectives
By the end of this module, you will:
1. Implement comprehensive benchmarking infrastructure with statistical analysis
2. Build automated comparison systems across accuracy, latency, memory, and energy
3. Create professional reporting with visualization and recommendations
4. Integrate TinyMLPerf-style standardized benchmarks for reproducible results
1. Implement professional benchmarking infrastructure with statistical rigor
2. Learn to combine optimization techniques strategically (order matters!)
3. Build the TorchPerf class - your standardized capstone submission framework
4. Understand ablation studies and systematic performance evaluation
Let's build the foundation for data-driven ML systems optimization!
🔥 Carry the torch. Optimize the model. Win the gold! 🏅
"""
# %% [markdown]
@@ -51,14 +60,19 @@ Let's build the foundation for data-driven ML systems optimization!
```python
# How to use this module:
from tinytorch.benchmarking.benchmark import Benchmark, BenchmarkSuite, TinyMLPerf
from tinytorch.benchmarking.benchmark import Benchmark, OlympicEvent
# For capstone submission:
benchmark = Benchmark([baseline_model, optimized_model],
[{"name": "baseline"}, {"name": "optimized"}])
results = benchmark.run_latency_benchmark()
```
**Why this matters:**
- **Learning:** Complete benchmarking ecosystem in one focused module for rigorous evaluation
- **Production:** Proper organization like MLPerf and TensorBoard profiling with all analysis tools together
- **TorchPerf Olympics:** The Benchmark class provides the standardized framework for capstone submissions
- **Consistency:** All benchmarking operations and reporting in benchmarking.benchmark
- **Integration:** Works seamlessly with optimization modules for complete systems evaluation
- **Integration:** Works seamlessly with optimization modules (M14-18) for complete systems evaluation
"""
# %% [markdown]
@@ -157,6 +171,23 @@ import warnings
# Import Profiler from Module 15 for measurement reuse
from tinytorch.profiling.profiler import Profiler
# %%
#| export
from enum import Enum
class OlympicEvent(Enum):
"""
TorchPerf Olympics event categories.
Each event optimizes for different objectives with specific constraints.
Students choose their event and compete for medals!
"""
LATENCY_SPRINT = "latency_sprint" # Minimize latency (accuracy >= 85%)
MEMORY_CHALLENGE = "memory_challenge" # Minimize memory (accuracy >= 85%)
ACCURACY_CONTEST = "accuracy_contest" # Maximize accuracy (latency < 100ms, memory < 10MB)
ALL_AROUND = "all_around" # Best balanced score across all metrics
EXTREME_PUSH = "extreme_push" # Most aggressive optimization (accuracy >= 80%)
# %% [markdown]
"""
# 3. Implementation - Building Professional Benchmarking Infrastructure
@@ -1907,539 +1938,99 @@ test_unit_optimization_comparison()
# %% [markdown]
"""
# 5. Systems Analysis - Performance Engineering Insights
## 4.5 Combination Strategies - Preparing for TorchPerf Olympics
Let's analyze how our benchmarking system behaves under different conditions and reveal insights about measurement accuracy, system variability, and scalability patterns.
You've learned individual optimizations (M14-18). Now it's time to combine them strategically! The order and parameters matter significantly for final performance.
This analysis section demonstrates a key principle: **benchmark the benchmarking system itself**. Understanding how your measurement tools behave is crucial for interpreting results correctly.
### Why Combination Order Matters
## Why Analyze Measurement Systems?
Consider these two strategies:
- **Strategy A**: Quantize INT8 → Prune 70% → Fuse kernels
- **Strategy B**: Prune 70% → Quantize INT8 → Fuse kernels
Consider two scenarios:
- **Scenario A**: Your measurements show Model B is 10% faster than Model A
- **Scenario B**: Your measurements show Model B is 10% faster, but measurement uncertainty is ±15%
Strategy A might preserve more accuracy because quantization happens first (on the full network), while Strategy B might be faster because pruning reduces what needs to be quantized. The "best" depends on your Olympic event!
In Scenario A, you might deploy Model B. In Scenario B, the difference isn't statistically significant - you can't trust the comparison.
### Ablation Studies: Understanding Individual Contributions
Professional benchmarking requires understanding and quantifying measurement uncertainty.
Professional ML engineers use **ablation studies** to understand what each optimization contributes:
```
Baseline: Accuracy: 89%, Latency: 45ms, Memory: 12MB
+ Quantization: Accuracy: 88%, Latency: 30ms, Memory: 3MB (Δ: -1%, -33%, -75%)
+ Pruning: Accuracy: 87%, Latency: 22ms, Memory: 2MB (Δ: -1%, -27%, -33%)
+ Kernel Fusion: Accuracy: 87%, Latency: 18ms, Memory: 2MB (Δ: 0%, -18%, 0%)
Conclusion: Quantization provides biggest memory reduction, fusion provides latency boost
```
This systematic analysis tells you what to prioritize for each Olympic event!
### Olympic Event Strategies
**🏃 Latency Sprint**: Minimize inference time
- Priority: Kernel fusion > KV caching > Quantization > Pruning
- Risk: Aggressive optimizations may hurt accuracy
- Tip: Start with proven speed techniques, then add memory techniques if needed
**🏋️ Memory Challenge**: Minimize model footprint
- Priority: Quantization > Pruning > Compression
- Risk: Model quality degradation
- Tip: Quantize first (4x memory reduction), then prune to meet target
**🎯 Accuracy Contest**: Maximize accuracy within constraints
- Priority: Minimal optimizations, careful tuning
- Risk: Not enough optimization to meet constraints
- Tip: Use high-bit quantization (8-bit), light pruning (30-50%)
**🏋️‍♂️ All-Around**: Best balanced performance
- Priority: Balanced application of all techniques
- Risk: Jack of all trades, master of none
- Tip: Use moderate settings for each technique (INT8, 60% pruning, selective fusion)
**🚀 Extreme Push**: Most aggressive optimization
- Priority: Maximum of everything
- Risk: Significant accuracy loss
- Tip: Start with 4-bit quantization + 90% pruning, verify accuracy threshold
### Example: Combining for All-Around Event
```python
from tinytorch.optimization.quantization import quantize_model
from tinytorch.optimization.compression import magnitude_prune
from tinytorch.generation.kv_cache import enable_kv_cache
# Load baseline
baseline_model = load_baseline("cifar10_cnn")
# Apply balanced optimization strategy
optimized = baseline_model
# Step 1: Quantize to INT8 (moderate precision)
optimized = quantize_model(optimized, bits=8)
# Step 2: Prune 60% (moderate sparsity)
optimized = magnitude_prune(optimized, sparsity=0.6)
# Step 3: Enable KV cache for transformers (if applicable)
if hasattr(optimized, 'transformer_blocks'):
enable_kv_cache(optimized)
# Benchmark using TorchPerf
from tinytorch.benchmarking.benchmark import Benchmark, OlympicEvent
benchmark = Benchmark([baseline_model, optimized],
[{"name": "baseline"}, {"name": "optimized"}])
results = benchmark.run_latency_benchmark()
# Compare and iterate!
```
The key: **Start with one technique, measure impact, add next technique, repeat!**
"""
# %% [markdown]
"""
## Measurement Variance Analysis
Understanding measurement variance is fundamental to statistical significance. This analysis reveals how sample size affects measurement reliability and helps determine optimal benchmark configurations.
### Statistical Significance in Practice
When you measure a model's latency multiple times, you get a distribution of values. The key insight: **more measurements reduce uncertainty about the true mean, but with diminishing returns**.
```
Measurement Variance Relationship:
Standard Error = σ / √n
Where:
- σ = underlying measurement noise
- n = number of samples
- Standard Error = uncertainty in the estimated mean
Doubling samples reduces uncertainty by √2 ≈ 1.41x
10x samples reduces uncertainty by √10 ≈ 3.16x
```
### Variance Sources in ML Benchmarking
**System-Level Variance**:
- CPU frequency scaling (thermal throttling)
- Background processes (OS scheduling)
- Memory pressure (garbage collection)
- Network traffic (for distributed models)
**Algorithm-Level Variance**:
- Input-dependent computation paths
- Random initialization effects
- Numerical precision variations
**Measurement-Level Variance**:
- Timer resolution and overhead
- Function call overhead
- Memory allocation patterns
This analysis quantifies these effects and determines optimal measurement protocols.
"""
# %% nbgrader={"grade": false, "grade_id": "analyze-measurement-variance", "solution": true}
def analyze_measurement_variance():
"""📊 Analyze how measurement variance affects benchmark reliability."""
print("📊 Analyzing measurement variance and statistical significance...")
# Create a simple test model for consistent analysis
class TestModel:
def __init__(self, base_latency=0.001):
self.base_latency = base_latency
self.name = "test_model"
def forward(self, x):
# Add realistic variance sources
system_noise = np.random.normal(0, 0.0001) # System noise
thermal_variance = np.random.normal(0, 0.00005) # CPU frequency variation
time.sleep(max(0, self.base_latency + system_noise + thermal_variance))
return x
model = TestModel()
# Test different numbers of measurement runs
run_counts = [3, 5, 10, 20, 50, 100]
variance_results = []
for num_runs in run_counts:
benchmark = Benchmark([model], [{"data": "test"}],
warmup_runs=2, measurement_runs=num_runs)
# Run multiple benchmark sessions to see variance between sessions
session_means = []
session_stds = []
for session in range(5): # 5 different benchmark sessions
results = benchmark.run_latency_benchmark()
result = list(results.values())[0]
session_means.append(result.mean)
session_stds.append(result.std)
# Calculate variance across sessions
mean_of_means = np.mean(session_means)
std_of_means = np.std(session_means)
mean_of_stds = np.mean(session_stds)
variance_results.append({
'num_runs': num_runs,
'mean_latency': mean_of_means,
'std_between_sessions': std_of_means,
'mean_std_within_session': mean_of_stds,
'coefficient_of_variation': std_of_means / mean_of_means if mean_of_means > 0 else 0
})
# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot 1: Standard deviation vs number of runs
num_runs_list = [r['num_runs'] for r in variance_results]
between_session_std = [r['std_between_sessions'] * 1000 for r in variance_results] # Convert to ms
within_session_std = [r['mean_std_within_session'] * 1000 for r in variance_results]
ax1.plot(num_runs_list, between_session_std, 'o-', label='Between Sessions', linewidth=2)
ax1.plot(num_runs_list, within_session_std, 's-', label='Within Session', linewidth=2)
ax1.set_xlabel('Number of Measurement Runs')
ax1.set_ylabel('Standard Deviation (ms)')
ax1.set_title('Measurement Variance vs Sample Size')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xscale('log')
# Plot 2: Coefficient of variation
cv_values = [r['coefficient_of_variation'] * 100 for r in variance_results]
ax2.plot(num_runs_list, cv_values, 'o-', color='red', linewidth=2)
ax2.set_xlabel('Number of Measurement Runs')
ax2.set_ylabel('Coefficient of Variation (%)')
ax2.set_title('Measurement Reliability vs Sample Size')
ax2.grid(True, alpha=0.3)
ax2.set_xscale('log')
plt.tight_layout()
plt.show()
# Key insights
print("\n💡 Measurement Variance Analysis:")
print(f"With 10 runs: CV = {variance_results[2]['coefficient_of_variation']:.1%}")
print(f"With 50 runs: CV = {variance_results[4]['coefficient_of_variation']:.1%}")
print(f"With 100 runs: CV = {variance_results[5]['coefficient_of_variation']:.1%}")
if variance_results[4]['coefficient_of_variation'] < 0.05:
print("🚀 50+ runs provide stable measurements (CV < 5%)")
else:
print("⚠️ High variance detected - consider longer warmup or controlled environment")
analyze_measurement_variance()
# %% [markdown]
"""
## Benchmark Scaling Analysis
Understanding how benchmark overhead scales with model complexity helps optimize measurement protocols and interpret results correctly.
### Why Benchmark Overhead Matters
Every measurement tool adds overhead. For benchmarking to be meaningful, this overhead must be:
1. **Consistent**: Same overhead across different models
2. **Minimal**: Small compared to what you're measuring
3. **Predictable**: Understood so you can account for it
### Overhead Analysis Framework
```
Total Measured Time = True Model Time + Benchmark Overhead
Benchmark Overhead includes:
├── Framework setup (model loading, input preparation)
├── Timing infrastructure (context managers, precision counters)
├── Result collection (statistics, metadata gathering)
└── System interactions (memory allocation, Python overhead)
```
### Scaling Behavior Patterns
**Good Scaling**: Overhead decreases as percentage of total time
- Simple models: 20% overhead (still usable)
- Complex models: 2% overhead (negligible)
**Bad Scaling**: Overhead increases with model complexity
- Indicates benchmark framework bottlenecks
- Makes results unreliable for optimization decisions
**Optimal Configuration**: Overhead < 5% for target model complexity range
This analysis identifies the optimal benchmark configuration for different model types and deployment scenarios.
"""
# %% nbgrader={"grade": false, "grade_id": "analyze-scaling-behavior", "solution": true}
def analyze_scaling_behavior():
"""📊 Analyze how benchmark overhead scales with model and input complexity."""
print("📊 Analyzing benchmark overhead and scaling behavior...")
# Create models with different computational complexity
class ScalingTestModel:
def __init__(self, complexity_factor, name):
self.complexity_factor = complexity_factor
self.name = name
def forward(self, x):
# Simulate computational work proportional to complexity
base_time = 0.001 # 1ms base
compute_time = base_time * self.complexity_factor
# Simulate actual computation with matrix operations
if hasattr(x, 'shape'):
size = np.prod(x.shape)
else:
size = len(x) if hasattr(x, '__len__') else 100
# Simulate memory allocation and computation
temp_data = np.random.randn(int(size * self.complexity_factor))
_ = np.sum(temp_data * temp_data) # Some computation
time.sleep(compute_time)
return x
# Models with different complexity
models = [
ScalingTestModel(1, "simple_model"),
ScalingTestModel(5, "medium_model"),
ScalingTestModel(20, "complex_model"),
ScalingTestModel(100, "very_complex_model")
]
# Test different input sizes
input_sizes = [(1, 28, 28), (1, 64, 64), (1, 128, 128), (1, 256, 256)]
scaling_results = []
for input_shape in input_sizes:
print(f"Testing input shape: {input_shape}")
for model in models:
# Measure pure model time (without benchmark overhead)
dummy_input = np.random.randn(*input_shape).astype(np.float32)
pure_times = []
for _ in range(10):
with precise_timer() as timer:
model.forward(dummy_input)
pure_times.append(timer.elapsed * 1000)
pure_mean = np.mean(pure_times)
# Measure with benchmark framework
benchmark = Benchmark([model], [{"data": "test"}],
warmup_runs=3, measurement_runs=10)
bench_results = benchmark.run_latency_benchmark(input_shape)
bench_mean = list(bench_results.values())[0].mean
# Calculate overhead
overhead_ms = bench_mean - pure_mean
overhead_percent = (overhead_ms / pure_mean) * 100 if pure_mean > 0 else 0
scaling_results.append({
'input_size': np.prod(input_shape),
'model_complexity': model.complexity_factor,
'model_name': model.name,
'pure_latency_ms': pure_mean,
'benchmark_latency_ms': bench_mean,
'overhead_ms': overhead_ms,
'overhead_percent': overhead_percent
})
# Create DataFrame for analysis
df = pd.DataFrame(scaling_results)
# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot 1: Overhead vs model complexity
for input_size in [784, 4096, 16384, 65536]: # Representative sizes
subset = df[df['input_size'] == input_size]
if not subset.empty:
ax1.plot(subset['model_complexity'], subset['overhead_percent'],
'o-', label=f'Input size: {input_size}', linewidth=2)
ax1.set_xlabel('Model Complexity Factor')
ax1.set_ylabel('Benchmark Overhead (%)')
ax1.set_title('Benchmark Overhead vs Model Complexity')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xscale('log')
# Plot 2: Absolute overhead vs input size
for complexity in [1, 5, 20, 100]:
subset = df[df['model_complexity'] == complexity]
if not subset.empty:
ax2.plot(subset['input_size'], subset['overhead_ms'],
'o-', label=f'Complexity: {complexity}x', linewidth=2)
ax2.set_xlabel('Input Size (elements)')
ax2.set_ylabel('Benchmark Overhead (ms)')
ax2.set_title('Benchmark Overhead vs Input Size')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xscale('log')
plt.tight_layout()
plt.show()
# Analysis insights
print("\n💡 Scaling Behavior Analysis:")
# Find overhead patterns
high_complexity_overhead = df[df['model_complexity'] >= 20]['overhead_percent'].mean()
low_complexity_overhead = df[df['model_complexity'] <= 5]['overhead_percent'].mean()
print(f"Low complexity models: {low_complexity_overhead:.1f}% overhead")
print(f"High complexity models: {high_complexity_overhead:.1f}% overhead")
if high_complexity_overhead < 5:
print("🚀 Benchmark overhead is negligible for complex models")
elif low_complexity_overhead > 20:
print("⚠️ High overhead for simple models - consider optimization")
else:
print("✅ Benchmark scaling is appropriate for intended use cases")
analyze_scaling_behavior()
# %% [markdown]
"""
# 6. Optimization Insights - Trade-offs and Production Patterns
Understanding the real-world implications of benchmarking decisions and how to optimize the measurement process itself for different use cases.
This section addresses a meta-question: **How do you optimize the optimization process?** Different use cases need different measurement trade-offs.
## Benchmarking Configuration Optimization
Professional ML teams face a fundamental trade-off in benchmarking:
- **More accurate measurements** require more time and resources
- **Faster measurements** enable more iteration but with less precision
- **Different development phases** need different measurement fidelity
The goal: Find the minimum measurement overhead that provides sufficient confidence for decision-making.
"""
# %% [markdown]
"""
## Optimal Benchmark Configuration Analysis
This analysis helps determine the right benchmark configuration for different development scenarios. It's a practical application of statistics to engineering workflow optimization.
### The Measurement Fidelity Spectrum
```
Development Phase Accuracy Need Speed Need Optimal Config
─────────────────────────────────────────────────────────────────────
Rapid prototyping Low High Fast (5 runs)
Feature development Medium Medium Standard (20 runs)
Performance optimization High Low Accurate (50 runs)
Production validation Very High Very Low Research (100+ runs)
Regression testing Medium High Automated (15 runs)
```
### Multi-Objective Optimization for Benchmarking
We optimize across three competing objectives:
1. **Accuracy**: How close to the true performance value
2. **Precision**: How consistent are repeated measurements
3. **Speed**: How quickly we get results
```
Benchmark Configuration Optimization:
minimize: w₁×(accuracy_error) + w₂×(precision_error) + w₃×(time_cost)
subject to: measurement_runs ≥ min_statistical_power
total_time ≤ max_allowed_time
Where weights w₁, w₂, w₃ depend on use case
```
This analysis empirically determines optimal configurations for different scenarios.
"""
# %% nbgrader={"grade": false, "grade_id": "benchmark-optimization", "solution": true}
def optimize_benchmark_configuration():
"""📊 Find optimal benchmark configuration for different accuracy vs speed needs."""
print("📊 Optimizing benchmark configuration for different use cases...")
# Test model for configuration optimization
class ConfigTestModel:
def __init__(self):
self.name = "config_test_model"
def forward(self, x):
# Consistent baseline with small variance
time.sleep(0.002 + np.random.normal(0, 0.0001))
return x
model = ConfigTestModel()
# Test different configuration combinations
configurations = [
{'warmup': 1, 'runs': 5, 'name': 'fast'},
{'warmup': 3, 'runs': 10, 'name': 'standard'},
{'warmup': 5, 'runs': 20, 'name': 'accurate'},
{'warmup': 10, 'runs': 50, 'name': 'precise'},
{'warmup': 15, 'runs': 100, 'name': 'research'}
]
config_results = []
# Ground truth: run very long benchmark to get "true" value
true_benchmark = Benchmark([model], [{"data": "test"}],
warmup_runs=20, measurement_runs=200)
true_results = true_benchmark.run_latency_benchmark()
true_latency = list(true_results.values())[0].mean
print(f"Ground truth latency: {true_latency:.4f}s")
for config in configurations:
print(f"\nTesting {config['name']} configuration...")
# Run multiple trials with this configuration
trial_results = []
total_time_spent = []
for trial in range(8): # 8 trials per configuration
start_time = time.time()
benchmark = Benchmark([model], [{"data": "test"}],
warmup_runs=config['warmup'],
measurement_runs=config['runs'])
results = benchmark.run_latency_benchmark()
measured_latency = list(results.values())[0].mean
end_time = time.time()
trial_results.append(measured_latency)
total_time_spent.append(end_time - start_time)
# Calculate accuracy and efficiency metrics
trial_mean = np.mean(trial_results)
trial_std = np.std(trial_results)
accuracy_error = abs(trial_mean - true_latency) / true_latency * 100
precision_cv = trial_std / trial_mean * 100 if trial_mean > 0 else 0
avg_benchmark_time = np.mean(total_time_spent)
config_results.append({
'name': config['name'],
'warmup_runs': config['warmup'],
'measurement_runs': config['runs'],
'total_runs': config['warmup'] + config['runs'],
'accuracy_error_percent': accuracy_error,
'precision_cv_percent': precision_cv,
'benchmark_time_s': avg_benchmark_time,
'efficiency_score': 100 / (accuracy_error + precision_cv + avg_benchmark_time * 10) # Combined score
})
# Create comparison DataFrame
df = pd.DataFrame(config_results)
# Visualize trade-offs
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
# Plot 1: Accuracy vs Speed
ax1.scatter(df['benchmark_time_s'], df['accuracy_error_percent'],
s=100, alpha=0.7, c=df['total_runs'], cmap='viridis')
for i, name in enumerate(df['name']):
ax1.annotate(name, (df['benchmark_time_s'].iloc[i], df['accuracy_error_percent'].iloc[i]),
xytext=(5, 5), textcoords='offset points')
ax1.set_xlabel('Benchmark Time (seconds)')
ax1.set_ylabel('Accuracy Error (%)')
ax1.set_title('Accuracy vs Speed Trade-off')
ax1.grid(True, alpha=0.3)
# Plot 2: Precision vs Speed
ax2.scatter(df['benchmark_time_s'], df['precision_cv_percent'],
s=100, alpha=0.7, c=df['total_runs'], cmap='viridis')
for i, name in enumerate(df['name']):
ax2.annotate(name, (df['benchmark_time_s'].iloc[i], df['precision_cv_percent'].iloc[i]),
xytext=(5, 5), textcoords='offset points')
ax2.set_xlabel('Benchmark Time (seconds)')
ax2.set_ylabel('Precision CV (%)')
ax2.set_title('Precision vs Speed Trade-off')
ax2.grid(True, alpha=0.3)
# Plot 3: Efficiency comparison
ax3.bar(df['name'], df['efficiency_score'], alpha=0.7)
ax3.set_ylabel('Efficiency Score (higher = better)')
ax3.set_title('Overall Benchmark Efficiency')
ax3.tick_params(axis='x', rotation=45)
# Plot 4: Configuration breakdown
width = 0.35
x = np.arange(len(df))
ax4.bar(x - width/2, df['warmup_runs'], width, label='Warmup Runs', alpha=0.7)
ax4.bar(x + width/2, df['measurement_runs'], width, label='Measurement Runs', alpha=0.7)
ax4.set_xlabel('Configuration')
ax4.set_ylabel('Number of Runs')
ax4.set_title('Configuration Breakdown')
ax4.set_xticks(x)
ax4.set_xticklabels(df['name'])
ax4.legend()
plt.tight_layout()
plt.show()
# Generate recommendations
print("\n💡 Benchmark Configuration Recommendations:")
# Find best configurations for different use cases
best_fast = df.loc[df['benchmark_time_s'].idxmin()]
best_accurate = df.loc[df['accuracy_error_percent'].idxmin()]
best_precise = df.loc[df['precision_cv_percent'].idxmin()]
best_balanced = df.loc[df['efficiency_score'].idxmax()]
print(f"🚀 Fastest: {best_fast['name']} - {best_fast['benchmark_time_s']:.1f}s, {best_fast['accuracy_error_percent']:.1f}% error")
print(f"🎯 Most Accurate: {best_accurate['name']} - {best_accurate['accuracy_error_percent']:.1f}% error")
print(f"📊 Most Precise: {best_precise['name']} - {best_precise['precision_cv_percent']:.1f}% CV")
print(f"⚖️ Best Balanced: {best_balanced['name']} - efficiency score {best_balanced['efficiency_score']:.1f}")
print("\n🎯 Use Case Recommendations:")
print("- Development/debugging: Use 'fast' config for quick feedback")
print("- CI/CD pipelines: Use 'standard' config for reasonable accuracy/speed balance")
print("- Performance optimization: Use 'accurate' config for reliable comparisons")
print("- Research papers: Use 'precise' or 'research' config for publication-quality results")
optimize_benchmark_configuration()
# %% [markdown]
"""
# 7. Module Integration Test
# 5. Module Integration Test
Final validation that our complete benchmarking system works correctly and integrates properly with all TinyTorch components.

46
tinytorch/_modidx.py generated
View File

@@ -21,7 +21,51 @@ d = { 'settings': { 'branch': 'main',
'doc_host': 'https://tinytorch.github.io',
'git_url': 'https://github.com/tinytorch/TinyTorch/',
'lib_path': 'tinytorch'},
'syms': { 'tinytorch.core.activations': { 'tinytorch.core.activations.GELU': ( '02_activations/activations_dev.html#gelu',
'syms': { 'tinytorch.benchmarking.benchmark': { 'tinytorch.benchmarking.benchmark.Benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmark',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.Benchmark.__init__': ( '19_benchmarking/benchmarking_dev.html#benchmark.__init__',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.Benchmark.compare_models': ( '19_benchmarking/benchmarking_dev.html#benchmark.compare_models',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.Benchmark.run_accuracy_benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmark.run_accuracy_benchmark',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.Benchmark.run_latency_benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmark.run_latency_benchmark',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.Benchmark.run_memory_benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmark.run_memory_benchmark',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.BenchmarkSuite': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.BenchmarkSuite.__init__': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.__init__',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.BenchmarkSuite._estimate_energy_efficiency': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite._estimate_energy_efficiency',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.BenchmarkSuite.generate_report': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.generate_report',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.BenchmarkSuite.plot_pareto_frontier': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.plot_pareto_frontier',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.BenchmarkSuite.plot_results': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.plot_results',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.BenchmarkSuite.run_full_benchmark': ( '19_benchmarking/benchmarking_dev.html#benchmarksuite.run_full_benchmark',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.OlympicEvent': ( '19_benchmarking/benchmarking_dev.html#olympicevent',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.TinyMLPerf': ( '19_benchmarking/benchmarking_dev.html#tinymlperf',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.TinyMLPerf.__init__': ( '19_benchmarking/benchmarking_dev.html#tinymlperf.__init__',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.TinyMLPerf.generate_compliance_report': ( '19_benchmarking/benchmarking_dev.html#tinymlperf.generate_compliance_report',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.TinyMLPerf.run_all_benchmarks': ( '19_benchmarking/benchmarking_dev.html#tinymlperf.run_all_benchmarks',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.TinyMLPerf.run_standard_benchmark': ( '19_benchmarking/benchmarking_dev.html#tinymlperf.run_standard_benchmark',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.test_unit_benchmark': ( '19_benchmarking/benchmarking_dev.html#test_unit_benchmark',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.test_unit_benchmark_suite': ( '19_benchmarking/benchmarking_dev.html#test_unit_benchmark_suite',
'tinytorch/benchmarking/benchmark.py'),
'tinytorch.benchmarking.benchmark.test_unit_tinymlperf': ( '19_benchmarking/benchmarking_dev.html#test_unit_tinymlperf',
'tinytorch/benchmarking/benchmark.py')},
'tinytorch.core.activations': { 'tinytorch.core.activations.GELU': ( '02_activations/activations_dev.html#gelu',
'tinytorch/core/activations.py'),
'tinytorch.core.activations.GELU.__call__': ( '02_activations/activations_dev.html#gelu.__call__',
'tinytorch/core/activations.py'),