mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-04 14:05:50 -05:00
- Updated all module references to start from 01 instead of 00 - Changed tagline to 'Build your own ML framework. Start small. Go deep.' - Added educational foundation section linking to ML Systems book - Updated README, documentation, CLI examples, and prerequisites - Regenerated book content with consistent numbering throughout - Maintains 14 modules total but with natural numbering (01-14)
157 lines
5.8 KiB
Markdown
157 lines
5.8 KiB
Markdown
# 📊 Module 12: Benchmarking - Systematic ML Performance Evaluation
|
|
|
|
## 📊 Module Info
|
|
- **Difficulty**: ⭐⭐⭐⭐ Advanced
|
|
- **Time Estimate**: 6-8 hours
|
|
- **Prerequisites**: All previous modules (01-12), especially Kernels
|
|
- **Next Steps**: MLOps module (13)
|
|
|
|
**Learn to systematically evaluate ML systems using industry-standard benchmarking methodology**
|
|
|
|
## 🎯 Learning Objectives
|
|
|
|
After completing this module, you will:
|
|
- Design systematic benchmarking experiments for ML systems
|
|
- Apply MLPerf-inspired patterns to evaluate model performance
|
|
- Implement statistical validation for benchmark results
|
|
- Create professional performance reports and comparisons
|
|
- Apply systematic evaluation to real ML projects
|
|
|
|
## 🔗 Connection to Previous Modules
|
|
|
|
### What You Already Know
|
|
- **Kernels (Module 11)**: *How* to optimize individual operations
|
|
- **Training (Module 09)**: End-to-end model training workflows
|
|
- **Compression (Module 10)**: Model optimization techniques
|
|
- **Networks (Module 04)**: Model architectures and complexity
|
|
|
|
### The Evaluation Gap
|
|
Students understand **how to build** ML systems but not **how to evaluate** them systematically:
|
|
- ✅ **Implementation**: Can build tensors, layers, networks, optimizers
|
|
- ❌ **Evaluation**: Don't know how to measure performance reliably
|
|
- ✅ **Optimization**: Can implement kernels and compression
|
|
- ❌ **Validation**: Can't prove optimizations actually work
|
|
|
|
## 🧠 Build → Use → Analyze
|
|
|
|
This module follows the **"Build → Use → Analyze"** pedagogical framework:
|
|
|
|
### 1. **Build**: Benchmarking Framework
|
|
- Understand the four-component MLPerf architecture
|
|
- Learn different benchmark scenarios (latency, throughput, server)
|
|
- Implement statistical validation for meaningful results
|
|
|
|
### 2. **Use**: Systematic Evaluation
|
|
- Apply benchmarking to your TinyTorch models
|
|
- Compare different approaches systematically
|
|
- Validate optimization claims with proper methodology
|
|
|
|
### 3. **Analyze**: Professional Reporting
|
|
- Generate industry-standard performance reports
|
|
- Present results with statistical confidence
|
|
- Prepare for capstone project presentations
|
|
|
|
## 🎓 Why This Matters
|
|
|
|
### **Industry Reality**
|
|
Real ML engineers spend significant time on:
|
|
- **A/B testing**: Comparing model variants in production
|
|
- **Performance optimization**: Proving optimizations actually work
|
|
- **Research validation**: Demonstrating improvements over baselines
|
|
- **System design**: Choosing between architectural alternatives
|
|
|
|
### **Professional Applications**
|
|
This module prepares you for:
|
|
- **ML project evaluation**: Systematic comparison against baselines
|
|
- **Performance presentations**: Professional reporting of results
|
|
- **Statistical validation**: Proving your improvements are significant
|
|
- **Research methodology**: Reproducible evaluation practices
|
|
|
|
## 🚀 Key Concepts
|
|
|
|
### **MLPerf-Inspired Architecture**
|
|
- **System Under Test (SUT)**: Your ML model/system
|
|
- **Dataset**: Standardized evaluation data
|
|
- **Model**: The specific architecture being tested
|
|
- **Load Generator**: Controls how evaluation queries are sent
|
|
|
|
### **Benchmark Scenarios**
|
|
- **Single-Stream**: Measures latency (mobile/edge use cases)
|
|
- **Server**: Measures throughput (production server use cases)
|
|
- **Offline**: Measures batch processing (data center use cases)
|
|
|
|
### **Statistical Validation**
|
|
- **Confidence intervals**: Ensuring results are meaningful
|
|
- **Multiple runs**: Accounting for variability
|
|
- **Significance testing**: Proving improvements are real
|
|
- **Pitfall detection**: Avoiding common benchmarking mistakes
|
|
|
|
## 🔧 What You'll Build
|
|
|
|
### **1. TinyTorchPerf Framework**
|
|
```python
|
|
from tinytorch.benchmarking import TinyTorchPerf
|
|
|
|
# Professional ML benchmarking
|
|
benchmark = TinyTorchPerf()
|
|
benchmark.set_model(your_model)
|
|
benchmark.set_dataset('cifar10')
|
|
|
|
# Run different scenarios
|
|
results = benchmark.run_all_scenarios()
|
|
```
|
|
|
|
### **2. Statistical Validator**
|
|
```python
|
|
# Ensure statistically valid results
|
|
validator = StatisticalValidator()
|
|
validation = validator.validate_results(results)
|
|
if validation.significant:
|
|
print("✅ Improvement is statistically significant")
|
|
```
|
|
|
|
### **3. Performance Reporter**
|
|
```python
|
|
# Generate professional reports
|
|
reporter = PerformanceReporter()
|
|
report = reporter.generate_report(results)
|
|
report.save_as_html("my_capstone_results.html")
|
|
```
|
|
|
|
## 📈 Real-World Applications
|
|
|
|
### **Immediate Use Cases**
|
|
- **ML projects**: Systematic evaluation of your model implementations
|
|
- **Module integration**: Validate that your TinyTorch components work together
|
|
- **Performance optimization**: Prove your kernels actually improve performance
|
|
|
|
### **Career Applications**
|
|
- **Research**: Proper experimental methodology for papers
|
|
- **Industry**: A/B testing and performance optimization
|
|
- **Open source**: Contributing benchmarks to ML libraries
|
|
|
|
## 🎯 Success Metrics
|
|
|
|
By the end of this module, you should be able to:
|
|
- [ ] Design a systematic benchmark for any ML system
|
|
- [ ] Apply MLPerf principles to your own evaluations
|
|
- [ ] Generate statistically valid performance comparisons
|
|
- [ ] Create professional reports suitable for presentations
|
|
- [ ] Identify and avoid common benchmarking pitfalls
|
|
|
|
## 🔄 Connection to Module 13 (MLOps)
|
|
|
|
**Benchmarking** → **Production Monitoring**
|
|
- Benchmarking establishes baselines for production systems
|
|
- Monitoring detects when production deviates from benchmarks
|
|
- Both use similar metrics and statistical validation
|
|
|
|
## 📚 Resources
|
|
|
|
- [MLPerf Inference Rules](https://github.com/mlcommons/inference_policies)
|
|
- [Statistical Methods for ML Evaluation](https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/)
|
|
- [A/B Testing for ML Systems](https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15)
|
|
|
|
---
|
|
|
|
**🎉 Ready to become a systematic ML evaluator? Let's build professional benchmarking skills!** |