mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-03-12 14:43:34 -05:00

Files

Vijay Janapa Reddi 9b3c4958e7 Add ML systems content to Module 14 (Benchmarking) - 75% implementation

- Added ProductionBenchmarkingProfiler class with end-to-end profiling
- Implemented resource utilization monitoring and bottleneck detection
- Added A/B testing framework with statistical significance
- Included performance regression detection and capacity planning
- Added comprehensive ML systems thinking questions

2025-09-16 01:02:20 -04:00

benchmarking_dev.py

Add ML systems content to Module 14 (Benchmarking) - 75% implementation

2025-09-16 01:02:20 -04:00

module.yaml

refactor: Implement learner-focused module progression with better naming

2025-07-18 00:12:50 -04:00

README.md

refactor: Implement learner-focused module progression with better naming

2025-07-18 00:12:50 -04:00

test_report.md

refactor: Implement learner-focused module progression with better naming

2025-07-18 00:12:50 -04:00

README.md

🔥 Module: Benchmarking

📊 Module Info

Difficulty: ⭐⭐⭐⭐ Advanced
Time Estimate: 6-8 hours
Prerequisites: All previous modules (01-12), especially Kernels
Next Steps: MLOps module (14)

Learn to systematically evaluate ML systems using industry-standard benchmarking methodology. This module teaches you to measure performance reliably, validate optimization claims, and create professional evaluation reports that meet research and industry standards.

🎯 Learning Objectives

By the end of this module, you will be able to:

Design systematic benchmarking experiments: Apply MLPerf-inspired methodology to evaluate ML system performance
Implement statistical validation: Ensure benchmark results are statistically significant and reproducible
Create professional performance reports: Generate industry-standard documentation for optimization claims
Apply evaluation methodology: Systematically compare models, optimizations, and architectural choices
Debug performance systematically: Use benchmarking to identify bottlenecks and validate improvements

🧠 Build → Use → Analyze

This module follows TinyTorch's Build → Use → Analyze framework:

Build: Implement comprehensive benchmarking framework with MLPerf-inspired architecture and statistical validation
Use: Apply systematic evaluation to TinyTorch models, optimizations, and performance claims
Analyze: Generate professional reports, validate optimization effectiveness, and prepare results for presentations

📚 What You'll Build

MLPerf-Inspired Benchmarking Framework

# Professional ML system evaluation
from tinytorch.core.benchmarking import TinyTorchPerf, StatisticalValidator

# Configure benchmark system
benchmark = TinyTorchPerf()
benchmark.set_model(your_trained_model)
benchmark.set_dataset('cifar10', subset_size=1000)
benchmark.set_metrics(['latency', 'throughput', 'accuracy'])

# Run comprehensive evaluation
results = benchmark.run_all_scenarios([
    'single_stream',    # Latency-focused (mobile/edge)
    'server',          # Throughput-focused (production)
    'offline'          # Batch processing (data center)
])

print(f"Single-stream latency: {results['single_stream']['latency']:.2f}ms")
print(f"Server throughput: {results['server']['throughput']:.0f} samples/sec")
print(f"Offline batch time: {results['offline']['batch_time']:.2f}s")

Statistical Validation System

# Ensure statistically valid results
validator = StatisticalValidator(confidence_level=0.95, min_runs=30)

# Compare two models with statistical rigor
baseline_model = load_model("baseline_v1")
optimized_model = load_model("optimized_v2")

comparison = validator.compare_models(
    baseline_model, 
    optimized_model, 
    test_dataset,
    metrics=['latency', 'accuracy']
)

if comparison['latency']['significant']:
    speedup = comparison['latency']['improvement']
    confidence = comparison['latency']['confidence_interval']
    print(f"✅ Speedup: {speedup:.2f}x (95% CI: {confidence[0]:.2f}-{confidence[1]:.2f})")
else:
    print("❌ Performance difference not statistically significant")

Comprehensive Performance Reporter

# Generate professional evaluation reports
from tinytorch.core.benchmarking import PerformanceReporter

reporter = PerformanceReporter()
report = reporter.generate_comprehensive_report({
    'models': [baseline_model, optimized_model, compressed_model],
    'datasets': ['cifar10', 'imagenet_subset'],
    'scenarios': ['mobile', 'server', 'edge'],
    'optimizations': ['baseline', 'quantized', 'pruned', 'kernels']
})

# Export professional documentation
report.save_as_html("performance_evaluation.html")
report.save_as_pdf("performance_evaluation.pdf")
report.save_summary_table("results_summary.csv")

# Generate presentation slides
report.create_presentation_slides("optimization_results.pptx")

Real-World Evaluation Scenarios

# Mobile deployment evaluation
mobile_benchmark = TinyTorchPerf()
mobile_benchmark.configure_mobile_scenario(
    max_latency_ms=100,
    battery_constraints=True,
    memory_limit_mb=50
)

mobile_results = mobile_benchmark.evaluate_model(compressed_model)
mobile_feasible = mobile_results['meets_constraints']

# Production server evaluation
server_benchmark = TinyTorchPerf()
server_benchmark.configure_server_scenario(
    target_throughput=1000,  # requests/second
    max_latency_p99=50,      # 99th percentile latency
    concurrent_users=100
)

server_results = server_benchmark.evaluate_model(optimized_model)
production_ready = server_results['meets_sla']

🚀 Getting Started

Prerequisites

Ensure you have built the complete TinyTorch system:

# Activate TinyTorch environment
source bin/activate-tinytorch.sh

# Verify prerequisite modules (comprehensive system needed)
tito test --module kernels      # Performance optimization
tito test --module compression  # Model optimization
tito test --module training     # End-to-end training

Development Workflow

Open the development file: modules/source/13_benchmarking/benchmarking_dev.py
Implement benchmarking framework: Build MLPerf-inspired evaluation system
Add statistical validation: Ensure reproducible and significant results
Create performance reporters: Generate professional documentation
Test evaluation scenarios: Apply to real models and optimization claims
Export and verify: tito export --module benchmarking && tito test --module benchmarking

🧪 Testing Your Implementation

Comprehensive Test Suite

Run the full test suite to verify benchmarking system functionality:

# TinyTorch CLI (recommended)
tito test --module benchmarking

# Direct pytest execution
python -m pytest tests/ -k benchmarking -v

Test Coverage Areas

✅ Benchmarking Framework: Verify MLPerf-inspired evaluation system works correctly
✅ Statistical Validation: Test confidence intervals, significance testing, and reproducibility
✅ Performance Reporting: Ensure professional report generation and data visualization
✅ Scenario Testing: Validate mobile, server, and offline evaluation scenarios
✅ Integration Testing: Test with real TinyTorch models and optimizations

Inline Testing & Evaluation Validation

The module includes comprehensive benchmarking validation and methodology verification:

# Example inline test output
🔬 Unit Test: MLPerf-inspired benchmark framework...
✅ Single-stream scenario working correctly
✅ Server scenario measures throughput accurately
✅ Offline scenario handles batch processing
📈 Progress: Benchmarking Framework ✓

# Statistical validation testing
🔬 Unit Test: Statistical significance testing...
✅ Confidence intervals computed correctly
✅ Multiple comparison correction applied
✅ Minimum sample size requirements enforced
📈 Progress: Statistical Validation ✓

# Report generation testing
🔬 Unit Test: Performance report generation...
✅ HTML reports generated with proper formatting
✅ Summary tables include all required metrics
✅ Visualization charts display correctly
📈 Progress: Professional Reporting ✓

Manual Testing Examples

from benchmarking_dev import TinyTorchPerf, StatisticalValidator
from networks_dev import Sequential
from layers_dev import Dense
from activations_dev import ReLU

# Create test models
baseline_model = Sequential([Dense(784, 128), ReLU(), Dense(128, 10)])
optimized_model = compress_model(baseline_model, compression_ratio=0.5)

# Set up benchmarking
benchmark = TinyTorchPerf()
benchmark.set_dataset('synthetic', size=1000, input_shape=(784,), num_classes=10)

# Run evaluation
baseline_results = benchmark.evaluate_model(baseline_model)
optimized_results = benchmark.evaluate_model(optimized_model)

print(f"Baseline latency: {baseline_results['latency']:.2f}ms")
print(f"Optimized latency: {optimized_results['latency']:.2f}ms")
print(f"Speedup: {baseline_results['latency']/optimized_results['latency']:.2f}x")

# Statistical validation
validator = StatisticalValidator()
comparison = validator.compare_models(baseline_model, optimized_model, test_data)
print(f"Statistically significant: {comparison['significant']}")

🎯 Key Concepts

Real-World Applications

MLPerf Benchmarks: Industry-standard evaluation methodology for ML systems and hardware
Production A/B Testing: Statistical validation of model improvements in live systems
Research Paper Evaluation: Rigorous experimental methodology for academic publication
Hardware Evaluation: Systematic comparison of ML accelerators and deployment platforms

Evaluation Methodology

Systematic Experimentation: Controlled variables, multiple runs, and statistical validation
Scenario-Based Testing: Mobile, server, and edge deployment evaluation patterns
Performance Metrics: Latency, throughput, accuracy, memory usage, and energy consumption
Statistical Rigor: Confidence intervals, significance testing, and reproducibility requirements

Professional Reporting

Industry Standards: MLPerf-style reporting with comprehensive metrics and statistical validation
Visual Communication: Charts, tables, and graphs that clearly communicate performance results
Executive Summaries: High-level findings suitable for technical and business stakeholders
Reproducibility: Complete methodology documentation for result verification

Benchmarking Best Practices

Baseline Establishment: Proper reference points for meaningful comparisons
Environment Control: Consistent hardware, software, and data conditions
Statistical Power: Sufficient sample sizes for reliable conclusions
Bias Avoidance: Careful experimental design to prevent misleading results

🎉 Ready to Build?

You're about to master the evaluation methodology that separates rigorous engineering from wishful thinking! This module teaches you to validate claims, measure improvements systematically, and communicate results professionally.

Every major breakthrough in ML—from ImageNet winners to production systems—depends on systematic evaluation like what you're building. You'll learn to think like a performance scientist, ensuring your optimizations actually work and proving it with statistical rigor. Take your time, be thorough, and enjoy building the foundation of evidence-based ML engineering!

:gutter: 3
:margin: 2

{grid-item-card} 🚀 Launch Builder
:link: https://mybinder.org/v2/gh/VJProductions/TinyTorch/main?filepath=modules/source/13_benchmarking/benchmarking_dev.py
:class-title: text-center
:class-body: text-center

Interactive development environment

{grid-item-card} 📓 Open in Colab  
:link: https://colab.research.google.com/github/VJProductions/TinyTorch/blob/main/modules/source/13_benchmarking/benchmarking_dev.ipynb
:class-title: text-center
:class-body: text-center

Google Colab notebook

{grid-item-card} 👀 View Source
:link: https://github.com/VJProductions/TinyTorch/blob/main/modules/source/13_benchmarking/benchmarking_dev.py  
:class-title: text-center
:class-body: text-center

Browse the code on GitHub