mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-10 16:38:39 -05:00
- Flattened tests/ directory structure (removed integration/ and system/ subdirectories) - Renamed all integration tests with _integration.py suffix for clarity - Created test_utils.py with setup_integration_test() function - Updated integration tests to use ONLY tinytorch package imports - Ensured all modules are exported before running tests via tito export --all - Optimized module test timing for fast execution (under 5 seconds each) - Fixed MLOps test reliability and reduced timing parameters across modules - Exported all modules (compression, kernels, benchmarking, mlops) to tinytorch package
1641 lines
66 KiB
Plaintext
1641 lines
66 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1015a91f",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\""
|
|
},
|
|
"source": [
|
|
"# Module 12: Benchmarking - Systematic ML Performance Evaluation\n",
|
|
"\n",
|
|
"Welcome to the Benchmarking module! This is where we learn to systematically evaluate ML systems using industry-standard methodology inspired by MLPerf.\n",
|
|
"\n",
|
|
"## Learning Goals\n",
|
|
"- Understand the four-component MLPerf benchmarking architecture\n",
|
|
"- Implement different benchmark scenarios (latency, throughput, offline)\n",
|
|
"- Apply statistical validation for meaningful results\n",
|
|
"- Create professional performance reports for ML projects\n",
|
|
"- Learn to avoid common benchmarking pitfalls\n",
|
|
"\n",
|
|
"## Build → Use → Analyze\n",
|
|
"1. **Build**: Benchmarking framework with proper statistical validation\n",
|
|
"2. **Use**: Apply systematic evaluation to your TinyTorch models\n",
|
|
"3. **Analyze**: Generate professional reports with statistical confidence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d09b187a",
|
|
"metadata": {
|
|
"lines_to_next_cell": 1,
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "benchmarking-imports",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| default_exp core.benchmarking\n",
|
|
"\n",
|
|
"#| export\n",
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import time\n",
|
|
"import statistics\n",
|
|
"import json\n",
|
|
"import math\n",
|
|
"from typing import Dict, List, Tuple, Optional, Any, Callable\n",
|
|
"from dataclasses import dataclass\n",
|
|
"from enum import Enum\n",
|
|
"import os\n",
|
|
"import sys\n",
|
|
"\n",
|
|
"# Import our TinyTorch dependencies\n",
|
|
"try:\n",
|
|
" from tinytorch.core.tensor import Tensor\n",
|
|
" from tinytorch.core.networks import Sequential\n",
|
|
" from tinytorch.core.layers import Dense\n",
|
|
" from tinytorch.core.activations import ReLU, Softmax\n",
|
|
" from tinytorch.core.dataloader import DataLoader\n",
|
|
"except ImportError:\n",
|
|
" # For development, import from local modules\n",
|
|
" parent_dirs = [\n",
|
|
" os.path.join(os.path.dirname(__file__), '..', '01_tensor'),\n",
|
|
" os.path.join(os.path.dirname(__file__), '..', '03_layers'),\n",
|
|
" os.path.join(os.path.dirname(__file__), '..', '02_activations'),\n",
|
|
" os.path.join(os.path.dirname(__file__), '..', '04_networks'),\n",
|
|
" os.path.join(os.path.dirname(__file__), '..', '06_dataloader')\n",
|
|
" ]\n",
|
|
" for path in parent_dirs:\n",
|
|
" if path not in sys.path:\n",
|
|
" sys.path.append(path)\n",
|
|
" \n",
|
|
" try:\n",
|
|
" from tensor_dev import Tensor\n",
|
|
" from networks_dev import Sequential\n",
|
|
" from layers_dev import Dense\n",
|
|
" from activations_dev import ReLU, Softmax\n",
|
|
" from dataloader_dev import DataLoader\n",
|
|
" except ImportError:\n",
|
|
" # Fallback for missing modules\n",
|
|
" print(\"⚠️ Some TinyTorch modules not available - using minimal implementations\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "42b509fc",
|
|
"metadata": {
|
|
"lines_to_next_cell": 1,
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "benchmarking-setup",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| hide\n",
|
|
"#| export\n",
|
|
"def _should_show_plots():\n",
|
|
" \"\"\"Check if we should show plots (disable during testing)\"\"\"\n",
|
|
" is_pytest = (\n",
|
|
" 'pytest' in sys.modules or\n",
|
|
" 'test' in sys.argv or\n",
|
|
" os.environ.get('PYTEST_CURRENT_TEST') is not None or\n",
|
|
" any('test' in arg for arg in sys.argv) or\n",
|
|
" any('pytest' in arg for arg in sys.argv)\n",
|
|
" )\n",
|
|
" \n",
|
|
" return not is_pytest"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "617fc409",
|
|
"metadata": {
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "benchmarking-welcome",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"📊 TinyTorch Benchmarking Module\")\n",
|
|
"print(f\"NumPy version: {np.__version__}\")\n",
|
|
"print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
|
|
"print(\"Ready to build professional ML benchmarking tools!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "476a1522",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\""
|
|
},
|
|
"source": [
|
|
"## 📦 Where This Code Lives in the Final Package\n",
|
|
"\n",
|
|
"**Learning Side:** You work in `modules/source/12_benchmarking/benchmarking_dev.py` \n",
|
|
"**Building Side:** Code exports to `tinytorch.core.benchmarking`\n",
|
|
"\n",
|
|
"```python\n",
|
|
"# Final package structure:\n",
|
|
"from tinytorch.core.benchmarking import TinyTorchPerf, BenchmarkScenarios\n",
|
|
"from tinytorch.core.benchmarking import StatisticalValidator, PerformanceReporter\n",
|
|
"```\n",
|
|
"\n",
|
|
"**Why this matters:**\n",
|
|
"- **Learning:** Deep understanding of systematic evaluation\n",
|
|
"- **Production:** Professional benchmarking methodology\n",
|
|
"- **Projects:** Tools for validating your ML project performance\n",
|
|
"- **Career:** Industry-standard skills for ML engineering roles"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "302b6a5c",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\""
|
|
},
|
|
"source": [
|
|
"## What is ML Benchmarking?\n",
|
|
"\n",
|
|
"### The Systematic Evaluation Problem\n",
|
|
"When you build ML systems, you need to answer critical questions:\n",
|
|
"- **Is my model actually better?** Statistical significance vs random variation\n",
|
|
"- **How does it perform in production?** Latency, throughput, resource usage\n",
|
|
"- **Which approach should I choose?** Systematic comparison methodology\n",
|
|
"- **Can I trust my results?** Avoiding common benchmarking pitfalls\n",
|
|
"\n",
|
|
"### The MLPerf Architecture\n",
|
|
"MLPerf (Machine Learning Performance) defines the industry standard for ML benchmarking:\n",
|
|
"\n",
|
|
"```\n",
|
|
"┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n",
|
|
"│ Load Generator │───▶│ System Under │───▶│ Dataset │\n",
|
|
"│ (Controls │ │ Test (Your ML │ │ (Standardized │\n",
|
|
"│ Queries) │ │ Model) │ │ Evaluation) │\n",
|
|
"└─────────────────┘ └─────────────────┘ └─────────────────┘\n",
|
|
"```\n",
|
|
"\n",
|
|
"### The Four Components\n",
|
|
"1. **System Under Test (SUT)**: Your ML model/system being evaluated\n",
|
|
"2. **Dataset**: Standardized evaluation data (CIFAR-10, ImageNet, etc.)\n",
|
|
"3. **Model**: The specific architecture and weights being tested\n",
|
|
"4. **Load Generator**: Controls how evaluation queries are sent to the SUT\n",
|
|
"\n",
|
|
"### Why This Matters\n",
|
|
"- **Reproducibility**: Others can verify your results\n",
|
|
"- **Comparability**: Fair comparison between different approaches\n",
|
|
"- **Statistical validity**: Meaningful conclusions from your data\n",
|
|
"- **Industry standards**: Skills you'll use in ML engineering careers\n",
|
|
"\n",
|
|
"### Real-World Examples\n",
|
|
"- **Google**: Uses similar patterns for production ML system evaluation\n",
|
|
"- **Meta**: A/B testing frameworks follow these principles\n",
|
|
"- **OpenAI**: GPT model comparisons use systematic benchmarking\n",
|
|
"- **Research**: All major ML conferences require proper evaluation methodology"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5613b9ce",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\""
|
|
},
|
|
"source": [
|
|
"## Step 1: Benchmark Scenarios - How to Measure Performance\n",
|
|
"\n",
|
|
"### The Three Standard Scenarios\n",
|
|
"Different use cases require different performance measurements:\n",
|
|
"\n",
|
|
"#### 1. Single-Stream Scenario\n",
|
|
"- **Use case**: Mobile/edge inference, interactive applications\n",
|
|
"- **Pattern**: Send next query only after previous completes\n",
|
|
"- **Metric**: 90th percentile latency (tail latency)\n",
|
|
"- **Why**: Users care about worst-case response time\n",
|
|
"\n",
|
|
"#### 2. Server Scenario \n",
|
|
"- **Use case**: Production web services, API endpoints\n",
|
|
"- **Pattern**: Poisson distribution of concurrent queries\n",
|
|
"- **Metric**: Queries per second (QPS) at acceptable latency\n",
|
|
"- **Why**: Servers handle multiple simultaneous requests\n",
|
|
"\n",
|
|
"#### 3. Offline Scenario\n",
|
|
"- **Use case**: Batch processing, data center workloads\n",
|
|
"- **Pattern**: Send all samples at once for batch processing\n",
|
|
"- **Metric**: Throughput (samples per second)\n",
|
|
"- **Why**: Batch jobs care about total processing time\n",
|
|
"\n",
|
|
"### Mathematical Foundation\n",
|
|
"Each scenario tests different aspects:\n",
|
|
"- **Latency**: Time for single sample = f(model_complexity, hardware)\n",
|
|
"- **Throughput**: Samples per second = f(parallelism, batch_size)\n",
|
|
"- **Efficiency**: Resource utilization = f(memory, compute, bandwidth)\n",
|
|
"\n",
|
|
"### Why Multiple Scenarios?\n",
|
|
"Real ML systems have different requirements:\n",
|
|
"- **Chatbot**: Low latency for good user experience\n",
|
|
"- **Image API**: High throughput for many concurrent users \n",
|
|
"- **Data pipeline**: Maximum batch processing efficiency"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "97dc390b",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"## Step 2: Statistical Validation - Ensuring Meaningful Results\n",
|
|
"\n",
|
|
"### The Significance Problem\n",
|
|
"Common benchmarking mistakes:\n",
|
|
"```python\n",
|
|
"# BAD: Single run, no statistical validation\n",
|
|
"result_a = model_a.run_once() # 94.2% accuracy\n",
|
|
"result_b = model_b.run_once() # 94.7% accuracy\n",
|
|
"print(\"Model B is better!\") # Maybe, maybe not...\n",
|
|
"```\n",
|
|
"\n",
|
|
"### The MLPerf Solution\n",
|
|
"Proper statistical validation:\n",
|
|
"```python\n",
|
|
"# GOOD: Multiple runs with confidence intervals\n",
|
|
"results_a = [model_a.run() for _ in range(10)] # [93.8, 94.1, 94.3, ...]\n",
|
|
"results_b = [model_b.run() for _ in range(10)] # [94.2, 94.5, 94.9, ...]\n",
|
|
"significance = statistical_test(results_a, results_b)\n",
|
|
"print(f\"Model B is {significance.p_value < 0.05} better with p={significance.p_value}\")\n",
|
|
"```\n",
|
|
"\n",
|
|
"### Key Statistical Concepts\n",
|
|
"- **Confidence intervals**: Range of likely true values\n",
|
|
"- **P-values**: Probability that difference is due to chance\n",
|
|
"- **Effect size**: Magnitude of improvement (not just significance)\n",
|
|
"- **Multiple comparisons**: Adjusting for testing many approaches\n",
|
|
"\n",
|
|
"### Sample Size Calculation\n",
|
|
"MLPerf uses this formula for minimum samples:\n",
|
|
"```\n",
|
|
"n = Φ^(-1)((1-C)/2)^2 * p(1-p) / MOE^2\n",
|
|
"```\n",
|
|
"Where:\n",
|
|
"- C = confidence level (0.99)\n",
|
|
"- p = percentile (0.90 for 90th percentile)\n",
|
|
"- MOE = margin of error ((1-p)/20)\n",
|
|
"\n",
|
|
"For 90th percentile with 99% confidence: **n = 24,576 samples**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6e4d9c8f",
|
|
"metadata": {
|
|
"lines_to_next_cell": 1,
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "benchmark-scenarios",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"class BenchmarkScenario(Enum):\n",
|
|
" \"\"\"Standard benchmark scenarios from MLPerf\"\"\"\n",
|
|
" SINGLE_STREAM = \"single_stream\"\n",
|
|
" SERVER = \"server\"\n",
|
|
" OFFLINE = \"offline\"\n",
|
|
"\n",
|
|
"@dataclass\n",
|
|
"class BenchmarkResult:\n",
|
|
" \"\"\"Results from a benchmark run\"\"\"\n",
|
|
" scenario: BenchmarkScenario\n",
|
|
" latencies: List[float] # All latency measurements in seconds\n",
|
|
" throughput: float # Samples per second\n",
|
|
" accuracy: float # Model accuracy (0-1)\n",
|
|
" metadata: Optional[Dict[str, Any]] = None\n",
|
|
"\n",
|
|
"#| export\n",
|
|
"class BenchmarkScenarios:\n",
|
|
" \"\"\"\n",
|
|
" Implements the three standard MLPerf benchmark scenarios.\n",
|
|
" \n",
|
|
" TODO: Implement the three benchmark scenarios following MLPerf patterns.\n",
|
|
" \n",
|
|
" UNDERSTANDING THE SCENARIOS:\n",
|
|
" 1. Single-Stream: Send queries one at a time, measure latency\n",
|
|
" 2. Server: Send queries following Poisson distribution, measure QPS\n",
|
|
" 3. Offline: Send all queries at once, measure total throughput\n",
|
|
" \n",
|
|
" IMPLEMENTATION APPROACH:\n",
|
|
" 1. Each scenario should run the model multiple times\n",
|
|
" 2. Collect latency measurements for each run\n",
|
|
" 3. Calculate appropriate metrics for each scenario\n",
|
|
" 4. Return BenchmarkResult with all measurements\n",
|
|
" \n",
|
|
" EXAMPLE USAGE:\n",
|
|
" scenarios = BenchmarkScenarios()\n",
|
|
" result = scenarios.single_stream(model, dataset, num_queries=1000)\n",
|
|
" print(f\"90th percentile latency: {result.latencies[int(0.9 * len(result.latencies))]} seconds\")\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" def __init__(self):\n",
|
|
" self.results = []\n",
|
|
" \n",
|
|
" def single_stream(self, model: Callable, dataset: List, num_queries: int = 1000) -> BenchmarkResult:\n",
|
|
" \"\"\"\n",
|
|
" Run single-stream benchmark scenario.\n",
|
|
" \n",
|
|
" TODO: Implement single-stream benchmarking.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Initialize empty list for latencies\n",
|
|
" 2. For each query (up to num_queries):\n",
|
|
" a. Get next sample from dataset (cycle if needed)\n",
|
|
" b. Record start time\n",
|
|
" c. Run model on sample\n",
|
|
" d. Record end time\n",
|
|
" e. Calculate latency = end - start\n",
|
|
" f. Add latency to list\n",
|
|
" 3. Calculate throughput = num_queries / total_time\n",
|
|
" 4. Calculate accuracy if possible\n",
|
|
" 5. Return BenchmarkResult with SINGLE_STREAM scenario\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Use time.perf_counter() for precise timing\n",
|
|
" - Use dataset[i % len(dataset)] to cycle through samples\n",
|
|
" - Sort latencies for percentile calculations\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" latencies = []\n",
|
|
" correct_predictions = 0\n",
|
|
" total_start_time = time.perf_counter()\n",
|
|
" \n",
|
|
" for i in range(num_queries):\n",
|
|
" # Get sample (cycle through dataset)\n",
|
|
" sample = dataset[i % len(dataset)]\n",
|
|
" \n",
|
|
" # Time the inference\n",
|
|
" start_time = time.perf_counter()\n",
|
|
" result = model(sample)\n",
|
|
" end_time = time.perf_counter()\n",
|
|
" \n",
|
|
" latency = end_time - start_time\n",
|
|
" latencies.append(latency)\n",
|
|
" \n",
|
|
" # Simple accuracy calculation (if possible)\n",
|
|
" if hasattr(sample, 'target') and hasattr(result, 'data'):\n",
|
|
" predicted = np.argmax(result.data)\n",
|
|
" if predicted == sample.target:\n",
|
|
" correct_predictions += 1\n",
|
|
" \n",
|
|
" total_time = time.perf_counter() - total_start_time\n",
|
|
" throughput = num_queries / total_time\n",
|
|
" accuracy = correct_predictions / num_queries if num_queries > 0 else 0.0\n",
|
|
" \n",
|
|
" return BenchmarkResult(\n",
|
|
" scenario=BenchmarkScenario.SINGLE_STREAM,\n",
|
|
" latencies=sorted(latencies),\n",
|
|
" throughput=throughput,\n",
|
|
" accuracy=accuracy,\n",
|
|
" metadata={\"num_queries\": num_queries}\n",
|
|
" )\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def server(self, model: Callable, dataset: List, target_qps: float = 10.0, \n",
|
|
" duration: float = 60.0) -> BenchmarkResult:\n",
|
|
" \"\"\"\n",
|
|
" Run server benchmark scenario with Poisson-distributed queries.\n",
|
|
" \n",
|
|
" TODO: Implement server benchmarking.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Calculate inter-arrival time = 1.0 / target_qps\n",
|
|
" 2. Run for specified duration:\n",
|
|
" a. Wait for next query arrival (Poisson distribution)\n",
|
|
" b. Get sample from dataset\n",
|
|
" c. Record start time\n",
|
|
" d. Run model\n",
|
|
" e. Record end time and latency\n",
|
|
" 3. Calculate actual QPS = total_queries / duration\n",
|
|
" 4. Return results\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Use np.random.exponential(inter_arrival_time) for Poisson\n",
|
|
" - Track both query arrival times and completion times\n",
|
|
" - Server scenario cares about sustained throughput\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" latencies = []\n",
|
|
" inter_arrival_time = 1.0 / target_qps\n",
|
|
" start_time = time.perf_counter()\n",
|
|
" current_time = start_time\n",
|
|
" query_count = 0\n",
|
|
" \n",
|
|
" while (current_time - start_time) < duration:\n",
|
|
" # Wait for next query (Poisson distribution)\n",
|
|
" wait_time = np.random.exponential(inter_arrival_time)\n",
|
|
" time.sleep(min(wait_time, 0.001)) # Small sleep to simulate waiting\n",
|
|
" \n",
|
|
" # Get sample\n",
|
|
" sample = dataset[query_count % len(dataset)]\n",
|
|
" \n",
|
|
" # Time the inference\n",
|
|
" query_start = time.perf_counter()\n",
|
|
" result = model(sample)\n",
|
|
" query_end = time.perf_counter()\n",
|
|
" \n",
|
|
" latency = query_end - query_start\n",
|
|
" latencies.append(latency)\n",
|
|
" \n",
|
|
" query_count += 1\n",
|
|
" current_time = time.perf_counter()\n",
|
|
" \n",
|
|
" actual_duration = current_time - start_time\n",
|
|
" actual_qps = query_count / actual_duration\n",
|
|
" \n",
|
|
" return BenchmarkResult(\n",
|
|
" scenario=BenchmarkScenario.SERVER,\n",
|
|
" latencies=sorted(latencies),\n",
|
|
" throughput=actual_qps,\n",
|
|
" accuracy=0.0, # Would need labels for accuracy\n",
|
|
" metadata={\"target_qps\": target_qps, \"actual_qps\": actual_qps, \"duration\": actual_duration}\n",
|
|
" )\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def offline(self, model: Callable, dataset: List, batch_size: int = 32) -> BenchmarkResult:\n",
|
|
" \"\"\"\n",
|
|
" Run offline benchmark scenario with batch processing.\n",
|
|
" \n",
|
|
" TODO: Implement offline benchmarking.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Group dataset into batches of batch_size\n",
|
|
" 2. For each batch:\n",
|
|
" a. Record start time\n",
|
|
" b. Run model on entire batch\n",
|
|
" c. Record end time\n",
|
|
" d. Calculate batch latency\n",
|
|
" 3. Calculate total throughput = total_samples / total_time\n",
|
|
" 4. Return results\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Process data in batches for efficiency\n",
|
|
" - Measure total time for all batches\n",
|
|
" - Offline cares about maximum throughput\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" latencies = []\n",
|
|
" total_samples = len(dataset)\n",
|
|
" total_start_time = time.perf_counter()\n",
|
|
" \n",
|
|
" for batch_start in range(0, total_samples, batch_size):\n",
|
|
" batch_end = min(batch_start + batch_size, total_samples)\n",
|
|
" batch = dataset[batch_start:batch_end]\n",
|
|
" \n",
|
|
" # Time the batch inference\n",
|
|
" batch_start_time = time.perf_counter()\n",
|
|
" for sample in batch:\n",
|
|
" result = model(sample)\n",
|
|
" batch_end_time = time.perf_counter()\n",
|
|
" \n",
|
|
" batch_latency = batch_end_time - batch_start_time\n",
|
|
" latencies.append(batch_latency)\n",
|
|
" \n",
|
|
" total_time = time.perf_counter() - total_start_time\n",
|
|
" throughput = total_samples / total_time\n",
|
|
" \n",
|
|
" return BenchmarkResult(\n",
|
|
" scenario=BenchmarkScenario.OFFLINE,\n",
|
|
" latencies=latencies,\n",
|
|
" throughput=throughput,\n",
|
|
" accuracy=0.0, # Would need labels for accuracy\n",
|
|
" metadata={\"batch_size\": batch_size, \"total_samples\": total_samples}\n",
|
|
" )\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6cf329ce",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"### 🧪 Unit Test: Benchmark Scenarios\n",
|
|
"\n",
|
|
"Let's test our benchmark scenarios with a simple mock model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "a53ed486",
|
|
"metadata": {
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "test-scenarios",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def test_benchmark_scenarios():\n",
|
|
" \"\"\"Test that our benchmark scenarios work correctly.\"\"\"\n",
|
|
" print(\"🔬 Unit Test: Benchmark Scenarios...\")\n",
|
|
" \n",
|
|
" # Create a simple mock model and dataset\n",
|
|
" def mock_model(sample):\n",
|
|
" # Simulate some processing time\n",
|
|
" time.sleep(0.001) # 1ms processing\n",
|
|
" return {\"prediction\": np.random.rand(10)}\n",
|
|
" \n",
|
|
" mock_dataset = [{\"data\": np.random.rand(10)} for _ in range(100)]\n",
|
|
" \n",
|
|
" # Test scenarios\n",
|
|
" scenarios = BenchmarkScenarios()\n",
|
|
" \n",
|
|
" # Test single-stream\n",
|
|
" single_result = scenarios.single_stream(mock_model, mock_dataset, num_queries=10)\n",
|
|
" assert single_result.scenario == BenchmarkScenario.SINGLE_STREAM\n",
|
|
" assert len(single_result.latencies) == 10\n",
|
|
" assert single_result.throughput > 0\n",
|
|
" print(f\"✅ Single-stream: {len(single_result.latencies)} measurements\")\n",
|
|
" \n",
|
|
" # Test server (short duration for testing)\n",
|
|
" server_result = scenarios.server(mock_model, mock_dataset, target_qps=5.0, duration=2.0)\n",
|
|
" assert server_result.scenario == BenchmarkScenario.SERVER\n",
|
|
" assert len(server_result.latencies) > 0\n",
|
|
" assert server_result.throughput > 0\n",
|
|
" print(f\"✅ Server: {len(server_result.latencies)} queries processed\")\n",
|
|
" \n",
|
|
" # Test offline\n",
|
|
" offline_result = scenarios.offline(mock_model, mock_dataset, batch_size=5)\n",
|
|
" assert offline_result.scenario == BenchmarkScenario.OFFLINE\n",
|
|
" assert len(offline_result.latencies) > 0\n",
|
|
" assert offline_result.throughput > 0\n",
|
|
" print(f\"✅ Offline: {len(offline_result.latencies)} batches processed\")\n",
|
|
" \n",
|
|
" print(\"✅ All benchmark scenarios working correctly!\")\n",
|
|
"\n",
|
|
"# Run the test\n",
|
|
"test_benchmark_scenarios()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0888ece9",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"## Step 3: Statistical Validation - Ensuring Meaningful Results\n",
|
|
"\n",
|
|
"### The Confidence Problem\n",
|
|
"How do we know if one model is actually better than another?\n",
|
|
"\n",
|
|
"### Statistical Testing for ML\n",
|
|
"We need to test the null hypothesis: \"There is no significant difference between models\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "fa7342ad",
|
|
"metadata": {
|
|
"lines_to_next_cell": 1,
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "statistical-validator",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"@dataclass\n",
|
|
"class StatisticalValidation:\n",
|
|
" \"\"\"Results from statistical validation\"\"\"\n",
|
|
" is_significant: bool\n",
|
|
" p_value: float\n",
|
|
" effect_size: float\n",
|
|
" confidence_interval: Tuple[float, float]\n",
|
|
" recommendation: str\n",
|
|
"\n",
|
|
"#| export\n",
|
|
"class StatisticalValidator:\n",
|
|
" \"\"\"\n",
|
|
" Validates benchmark results using proper statistical methods.\n",
|
|
" \n",
|
|
" TODO: Implement statistical validation for benchmark results.\n",
|
|
" \n",
|
|
" UNDERSTANDING STATISTICAL TESTING:\n",
|
|
" 1. Null hypothesis: No difference between models\n",
|
|
" 2. T-test: Compare means of two groups\n",
|
|
" 3. P-value: Probability of seeing this difference by chance\n",
|
|
" 4. Effect size: Magnitude of the difference\n",
|
|
" 5. Confidence interval: Range of likely true values\n",
|
|
" \n",
|
|
" IMPLEMENTATION APPROACH:\n",
|
|
" 1. Calculate basic statistics (mean, std, n)\n",
|
|
" 2. Perform t-test to get p-value\n",
|
|
" 3. Calculate effect size (Cohen's d)\n",
|
|
" 4. Calculate confidence interval\n",
|
|
" 5. Provide clear recommendation\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" def __init__(self, confidence_level: float = 0.95):\n",
|
|
" self.confidence_level = confidence_level\n",
|
|
" self.alpha = 1 - confidence_level\n",
|
|
" \n",
|
|
" def validate_comparison(self, results_a: List[float], results_b: List[float]) -> StatisticalValidation:\n",
|
|
" \"\"\"\n",
|
|
" Compare two sets of benchmark results statistically.\n",
|
|
" \n",
|
|
" TODO: Implement statistical comparison.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Calculate basic statistics for both groups\n",
|
|
" 2. Perform two-sample t-test\n",
|
|
" 3. Calculate effect size (Cohen's d)\n",
|
|
" 4. Calculate confidence interval for the difference\n",
|
|
" 5. Generate recommendation based on results\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Use scipy.stats.ttest_ind for t-test (or implement manually)\n",
|
|
" - Cohen's d = (mean_a - mean_b) / pooled_std\n",
|
|
" - CI = difference ± (critical_value * standard_error)\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" import math\n",
|
|
" \n",
|
|
" # Basic statistics\n",
|
|
" mean_a = statistics.mean(results_a)\n",
|
|
" mean_b = statistics.mean(results_b)\n",
|
|
" std_a = statistics.stdev(results_a)\n",
|
|
" std_b = statistics.stdev(results_b)\n",
|
|
" n_a = len(results_a)\n",
|
|
" n_b = len(results_b)\n",
|
|
" \n",
|
|
" # Two-sample t-test (simplified)\n",
|
|
" pooled_std = math.sqrt(((n_a - 1) * std_a**2 + (n_b - 1) * std_b**2) / (n_a + n_b - 2))\n",
|
|
" standard_error = pooled_std * math.sqrt(1/n_a + 1/n_b)\n",
|
|
" \n",
|
|
" if standard_error == 0:\n",
|
|
" t_stat = 0\n",
|
|
" p_value = 1.0\n",
|
|
" else:\n",
|
|
" t_stat = (mean_a - mean_b) / standard_error\n",
|
|
" # Simplified p-value calculation (assuming normal distribution)\n",
|
|
" p_value = 2 * (1 - abs(t_stat) / (abs(t_stat) + math.sqrt(n_a + n_b - 2)))\n",
|
|
" \n",
|
|
" # Effect size (Cohen's d)\n",
|
|
" effect_size = (mean_a - mean_b) / pooled_std if pooled_std > 0 else 0\n",
|
|
" \n",
|
|
" # Confidence interval for difference\n",
|
|
" difference = mean_a - mean_b\n",
|
|
" critical_value = 1.96 # Approximate for 95% CI\n",
|
|
" margin_of_error = critical_value * standard_error\n",
|
|
" ci_lower = difference - margin_of_error\n",
|
|
" ci_upper = difference + margin_of_error\n",
|
|
" \n",
|
|
" # Determine significance\n",
|
|
" is_significant = p_value < self.alpha\n",
|
|
" \n",
|
|
" # Generate recommendation\n",
|
|
" if is_significant:\n",
|
|
" if effect_size > 0.8:\n",
|
|
" recommendation = \"Large significant difference - strong evidence for improvement\"\n",
|
|
" elif effect_size > 0.5:\n",
|
|
" recommendation = \"Medium significant difference - good evidence for improvement\"\n",
|
|
" else:\n",
|
|
" recommendation = \"Small significant difference - weak evidence for improvement\"\n",
|
|
" else:\n",
|
|
" recommendation = \"No significant difference - insufficient evidence for improvement\"\n",
|
|
" \n",
|
|
" return StatisticalValidation(\n",
|
|
" is_significant=is_significant,\n",
|
|
" p_value=p_value,\n",
|
|
" effect_size=effect_size,\n",
|
|
" confidence_interval=(ci_lower, ci_upper),\n",
|
|
" recommendation=recommendation\n",
|
|
" )\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def validate_benchmark_result(self, result: BenchmarkResult, \n",
|
|
" min_samples: int = 100) -> StatisticalValidation:\n",
|
|
" \"\"\"\n",
|
|
" Validate that a benchmark result has sufficient statistical power.\n",
|
|
" \n",
|
|
" TODO: Implement validation for single benchmark result.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Check if we have enough samples\n",
|
|
" 2. Calculate confidence interval for the metric\n",
|
|
" 3. Check for common pitfalls (outliers, etc.)\n",
|
|
" 4. Provide recommendations\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" latencies = result.latencies\n",
|
|
" n = len(latencies)\n",
|
|
" \n",
|
|
" if n < min_samples:\n",
|
|
" return StatisticalValidation(\n",
|
|
" is_significant=False,\n",
|
|
" p_value=1.0,\n",
|
|
" effect_size=0.0,\n",
|
|
" confidence_interval=(0.0, 0.0),\n",
|
|
" recommendation=f\"Insufficient samples: {n} < {min_samples}. Need more data.\"\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Calculate confidence interval for mean latency\n",
|
|
" mean_latency = statistics.mean(latencies)\n",
|
|
" std_latency = statistics.stdev(latencies)\n",
|
|
" standard_error = std_latency / math.sqrt(n)\n",
|
|
" \n",
|
|
" critical_value = 1.96 # 95% CI\n",
|
|
" margin_of_error = critical_value * standard_error\n",
|
|
" ci_lower = mean_latency - margin_of_error\n",
|
|
" ci_upper = mean_latency + margin_of_error\n",
|
|
" \n",
|
|
" # Check for outliers (simple check)\n",
|
|
" q1 = latencies[int(0.25 * n)]\n",
|
|
" q3 = latencies[int(0.75 * n)]\n",
|
|
" iqr = q3 - q1\n",
|
|
" outlier_threshold = q3 + 1.5 * iqr\n",
|
|
" outliers = [l for l in latencies if l > outlier_threshold]\n",
|
|
" \n",
|
|
" if len(outliers) > 0.1 * n: # More than 10% outliers\n",
|
|
" recommendation = f\"Warning: {len(outliers)} outliers detected. Results may be unreliable.\"\n",
|
|
" else:\n",
|
|
" recommendation = \"Benchmark result appears statistically valid.\"\n",
|
|
" \n",
|
|
" return StatisticalValidation(\n",
|
|
" is_significant=True,\n",
|
|
" p_value=0.0, # Not applicable for single result\n",
|
|
" effect_size=std_latency / mean_latency, # Coefficient of variation\n",
|
|
" confidence_interval=(ci_lower, ci_upper),\n",
|
|
" recommendation=recommendation\n",
|
|
" )\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "bb17c05a",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"### 🧪 Unit Test: Statistical Validation\n",
|
|
"\n",
|
|
"Let's test our statistical validation with simulated data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8d66a905",
|
|
"metadata": {
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "test-validation",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def test_statistical_validation():\n",
|
|
" \"\"\"Test statistical validation functionality.\"\"\"\n",
|
|
" print(\"🔬 Unit Test: Statistical Validation...\")\n",
|
|
" \n",
|
|
" validator = StatisticalValidator(confidence_level=0.95)\n",
|
|
" \n",
|
|
" # Test 1: No significant difference\n",
|
|
" results_a = [0.1 + 0.01 * np.random.randn() for _ in range(100)]\n",
|
|
" results_b = [0.1 + 0.01 * np.random.randn() for _ in range(100)]\n",
|
|
" \n",
|
|
" validation = validator.validate_comparison(results_a, results_b)\n",
|
|
" print(f\"✅ No difference test: significant={validation.is_significant}, p={validation.p_value:.4f}\")\n",
|
|
" \n",
|
|
" # Test 2: Clear significant difference\n",
|
|
" results_a = [0.1 + 0.01 * np.random.randn() for _ in range(100)]\n",
|
|
" results_b = [0.2 + 0.01 * np.random.randn() for _ in range(100)]\n",
|
|
" \n",
|
|
" validation = validator.validate_comparison(results_a, results_b)\n",
|
|
" print(f\"✅ Clear difference test: significant={validation.is_significant}, p={validation.p_value:.4f}\")\n",
|
|
" print(f\" Effect size: {validation.effect_size:.3f}\")\n",
|
|
" print(f\" Recommendation: {validation.recommendation}\")\n",
|
|
" \n",
|
|
" # Test 3: Single result validation\n",
|
|
" mock_result = BenchmarkResult(\n",
|
|
" scenario=BenchmarkScenario.SINGLE_STREAM,\n",
|
|
" latencies=[0.1 + 0.01 * np.random.randn() for _ in range(200)],\n",
|
|
" throughput=1000,\n",
|
|
" accuracy=0.95\n",
|
|
" )\n",
|
|
" \n",
|
|
" validation = validator.validate_benchmark_result(mock_result)\n",
|
|
" print(f\"✅ Single result validation: {validation.recommendation}\")\n",
|
|
" print(f\" Confidence interval: ({validation.confidence_interval[0]:.4f}, {validation.confidence_interval[1]:.4f})\")\n",
|
|
" \n",
|
|
" print(\"✅ Statistical validation tests passed!\")\n",
|
|
"\n",
|
|
"# Run the test\n",
|
|
"test_statistical_validation()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "42c283a3",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"## Step 4: The TinyTorchPerf Framework - Putting It All Together\n",
|
|
"\n",
|
|
"### The Complete MLPerf-Inspired Framework\n",
|
|
"Now we combine all components into a professional benchmarking framework."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "eb8d0fe2",
|
|
"metadata": {
|
|
"lines_to_next_cell": 1,
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "tinytorch-perf",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"class TinyTorchPerf:\n",
|
|
" \"\"\"\n",
|
|
" Complete MLPerf-inspired benchmarking framework for TinyTorch.\n",
|
|
" \n",
|
|
" TODO: Implement the complete benchmarking framework.\n",
|
|
" \n",
|
|
" UNDERSTANDING THE FRAMEWORK:\n",
|
|
" 1. Combines all benchmark scenarios\n",
|
|
" 2. Integrates statistical validation\n",
|
|
" 3. Provides easy-to-use API\n",
|
|
" 4. Generates professional reports\n",
|
|
" \n",
|
|
" IMPLEMENTATION APPROACH:\n",
|
|
" 1. Initialize with model and dataset\n",
|
|
" 2. Provide methods for each scenario\n",
|
|
" 3. Include statistical validation\n",
|
|
" 4. Generate comprehensive reports\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" def __init__(self):\n",
|
|
" self.scenarios = BenchmarkScenarios()\n",
|
|
" self.validator = StatisticalValidator()\n",
|
|
" self.model = None\n",
|
|
" self.dataset = None\n",
|
|
" self.results = {}\n",
|
|
" \n",
|
|
" def set_model(self, model: Callable):\n",
|
|
" \"\"\"Set the model to benchmark.\"\"\"\n",
|
|
" self.model = model\n",
|
|
" \n",
|
|
" def set_dataset(self, dataset: List):\n",
|
|
" \"\"\"Set the dataset for benchmarking.\"\"\"\n",
|
|
" self.dataset = dataset\n",
|
|
" \n",
|
|
" def run_single_stream(self, num_queries: int = 1000) -> BenchmarkResult:\n",
|
|
" \"\"\"\n",
|
|
" Run single-stream benchmark.\n",
|
|
" \n",
|
|
" TODO: Implement single-stream benchmark with validation.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Check that model and dataset are set\n",
|
|
" 2. Run single-stream scenario\n",
|
|
" 3. Validate results statistically\n",
|
|
" 4. Store results\n",
|
|
" 5. Return result\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" if self.model is None or self.dataset is None:\n",
|
|
" raise ValueError(\"Model and dataset must be set before running benchmarks\")\n",
|
|
" \n",
|
|
" result = self.scenarios.single_stream(self.model, self.dataset, num_queries)\n",
|
|
" validation = self.validator.validate_benchmark_result(result)\n",
|
|
" \n",
|
|
" self.results['single_stream'] = {\n",
|
|
" 'result': result,\n",
|
|
" 'validation': validation\n",
|
|
" }\n",
|
|
" \n",
|
|
" return result\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def run_server(self, target_qps: float = 10.0, duration: float = 60.0) -> BenchmarkResult:\n",
|
|
" \"\"\"\n",
|
|
" Run server benchmark.\n",
|
|
" \n",
|
|
" TODO: Implement server benchmark with validation.\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" if self.model is None or self.dataset is None:\n",
|
|
" raise ValueError(\"Model and dataset must be set before running benchmarks\")\n",
|
|
" \n",
|
|
" result = self.scenarios.server(self.model, self.dataset, target_qps, duration)\n",
|
|
" validation = self.validator.validate_benchmark_result(result)\n",
|
|
" \n",
|
|
" self.results['server'] = {\n",
|
|
" 'result': result,\n",
|
|
" 'validation': validation\n",
|
|
" }\n",
|
|
" \n",
|
|
" return result\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def run_offline(self, batch_size: int = 32) -> BenchmarkResult:\n",
|
|
" \"\"\"\n",
|
|
" Run offline benchmark.\n",
|
|
" \n",
|
|
" TODO: Implement offline benchmark with validation.\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" if self.model is None or self.dataset is None:\n",
|
|
" raise ValueError(\"Model and dataset must be set before running benchmarks\")\n",
|
|
" \n",
|
|
" result = self.scenarios.offline(self.model, self.dataset, batch_size)\n",
|
|
" validation = self.validator.validate_benchmark_result(result)\n",
|
|
" \n",
|
|
" self.results['offline'] = {\n",
|
|
" 'result': result,\n",
|
|
" 'validation': validation\n",
|
|
" }\n",
|
|
" \n",
|
|
" return result\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def run_all_scenarios(self, quick_test: bool = False) -> Dict[str, BenchmarkResult]:\n",
|
|
" \"\"\"\n",
|
|
" Run all benchmark scenarios.\n",
|
|
" \n",
|
|
" TODO: Implement comprehensive benchmarking.\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" if quick_test:\n",
|
|
" # Quick test with smaller parameters\n",
|
|
" single_result = self.run_single_stream(num_queries=100)\n",
|
|
" server_result = self.run_server(target_qps=5.0, duration=10.0)\n",
|
|
" offline_result = self.run_offline(batch_size=16)\n",
|
|
" else:\n",
|
|
" # Full benchmarking\n",
|
|
" single_result = self.run_single_stream(num_queries=1000)\n",
|
|
" server_result = self.run_server(target_qps=10.0, duration=60.0)\n",
|
|
" offline_result = self.run_offline(batch_size=32)\n",
|
|
" \n",
|
|
" return {\n",
|
|
" 'single_stream': single_result,\n",
|
|
" 'server': server_result,\n",
|
|
" 'offline': offline_result\n",
|
|
" }\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def compare_models(self, model_a: Callable, model_b: Callable, \n",
|
|
" scenario: str = 'single_stream') -> StatisticalValidation:\n",
|
|
" \"\"\"\n",
|
|
" Compare two models statistically.\n",
|
|
" \n",
|
|
" TODO: Implement model comparison.\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" # Run both models on the same scenario\n",
|
|
" self.set_model(model_a)\n",
|
|
" if scenario == 'single_stream':\n",
|
|
" result_a = self.run_single_stream(num_queries=100)\n",
|
|
" elif scenario == 'server':\n",
|
|
" result_a = self.run_server(target_qps=5.0, duration=10.0)\n",
|
|
" else: # offline\n",
|
|
" result_a = self.run_offline(batch_size=16)\n",
|
|
" \n",
|
|
" self.set_model(model_b)\n",
|
|
" if scenario == 'single_stream':\n",
|
|
" result_b = self.run_single_stream(num_queries=100)\n",
|
|
" elif scenario == 'server':\n",
|
|
" result_b = self.run_server(target_qps=5.0, duration=10.0)\n",
|
|
" else: # offline\n",
|
|
" result_b = self.run_offline(batch_size=16)\n",
|
|
" \n",
|
|
" # Compare latencies\n",
|
|
" return self.validator.validate_comparison(result_a.latencies, result_b.latencies)\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def generate_report(self) -> str:\n",
|
|
" \"\"\"\n",
|
|
" Generate a comprehensive benchmark report.\n",
|
|
" \n",
|
|
" TODO: Implement professional report generation.\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" report = \"# TinyTorch Benchmark Report\\n\\n\"\n",
|
|
" \n",
|
|
" for scenario_name, scenario_data in self.results.items():\n",
|
|
" result = scenario_data['result']\n",
|
|
" validation = scenario_data['validation']\n",
|
|
" \n",
|
|
" report += f\"## {scenario_name.replace('_', ' ').title()} Scenario\\n\\n\"\n",
|
|
" report += f\"- **Throughput**: {result.throughput:.2f} samples/second\\n\"\n",
|
|
" report += f\"- **Mean Latency**: {statistics.mean(result.latencies)*1000:.2f} ms\\n\"\n",
|
|
" report += f\"- **90th Percentile**: {result.latencies[int(0.9*len(result.latencies))]*1000:.2f} ms\\n\"\n",
|
|
" report += f\"- **95th Percentile**: {result.latencies[int(0.95*len(result.latencies))]*1000:.2f} ms\\n\"\n",
|
|
" report += f\"- **Statistical Validation**: {validation.recommendation}\\n\\n\"\n",
|
|
" \n",
|
|
" return report\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c27eb526",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"### 🧪 Unit Test: TinyTorchPerf Framework\n",
|
|
"\n",
|
|
"Let's test our complete benchmarking framework."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "469576f9",
|
|
"metadata": {
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "test-framework",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def test_tinytorch_perf():\n",
|
|
" \"\"\"Test the complete TinyTorchPerf framework.\"\"\"\n",
|
|
" print(\"🔬 Unit Test: TinyTorchPerf Framework...\")\n",
|
|
" \n",
|
|
" # Create test model and dataset\n",
|
|
" def test_model(sample):\n",
|
|
" time.sleep(0.001) # Simulate processing\n",
|
|
" return {\"prediction\": np.random.rand(5)}\n",
|
|
" \n",
|
|
" test_dataset = [{\"data\": np.random.rand(10)} for _ in range(50)]\n",
|
|
" \n",
|
|
" # Test the framework\n",
|
|
" benchmark = TinyTorchPerf()\n",
|
|
" benchmark.set_model(test_model)\n",
|
|
" benchmark.set_dataset(test_dataset)\n",
|
|
" \n",
|
|
" # Test individual scenarios\n",
|
|
" single_result = benchmark.run_single_stream(num_queries=20)\n",
|
|
" assert single_result.scenario == BenchmarkScenario.SINGLE_STREAM\n",
|
|
" print(f\"✅ Single-stream: {single_result.throughput:.2f} samples/sec\")\n",
|
|
" \n",
|
|
" server_result = benchmark.run_server(target_qps=5.0, duration=2.0)\n",
|
|
" assert server_result.scenario == BenchmarkScenario.SERVER\n",
|
|
" print(f\"✅ Server: {server_result.throughput:.2f} QPS\")\n",
|
|
" \n",
|
|
" offline_result = benchmark.run_offline(batch_size=10)\n",
|
|
" assert offline_result.scenario == BenchmarkScenario.OFFLINE\n",
|
|
" print(f\"✅ Offline: {offline_result.throughput:.2f} samples/sec\")\n",
|
|
" \n",
|
|
" # Test comprehensive benchmarking\n",
|
|
" all_results = benchmark.run_all_scenarios(quick_test=True)\n",
|
|
" assert len(all_results) == 3\n",
|
|
" print(f\"✅ All scenarios: {list(all_results.keys())}\")\n",
|
|
" \n",
|
|
" # Test model comparison\n",
|
|
" def slower_model(sample):\n",
|
|
" time.sleep(0.002) # Twice as slow\n",
|
|
" return {\"prediction\": np.random.rand(5)}\n",
|
|
" \n",
|
|
" comparison = benchmark.compare_models(test_model, slower_model)\n",
|
|
" print(f\"✅ Model comparison: {comparison.recommendation}\")\n",
|
|
" \n",
|
|
" # Test report generation\n",
|
|
" report = benchmark.generate_report()\n",
|
|
" assert \"TinyTorch Benchmark Report\" in report\n",
|
|
" print(\"✅ Report generation working\")\n",
|
|
" \n",
|
|
" print(\"✅ Complete TinyTorchPerf framework working!\")\n",
|
|
"\n",
|
|
"# Run the test\n",
|
|
"test_tinytorch_perf()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "eb9212b3",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"## Step 5: Professional Reporting - Project-Ready Results\n",
|
|
"\n",
|
|
"### Why Professional Reports Matter\n",
|
|
"Your ML projects need:\n",
|
|
"- **Clear performance metrics** for presentations\n",
|
|
"- **Statistical validation** for credibility\n",
|
|
"- **Comparison baselines** for context\n",
|
|
"- **Professional formatting** for academic/industry standards"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1f60ffb3",
|
|
"metadata": {
|
|
"lines_to_next_cell": 1,
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "performance-reporter",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": true,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"class PerformanceReporter:\n",
|
|
" \"\"\"\n",
|
|
" Generates professional performance reports for ML projects.\n",
|
|
" \n",
|
|
" TODO: Implement professional report generation.\n",
|
|
" \n",
|
|
" UNDERSTANDING PROFESSIONAL REPORTS:\n",
|
|
" 1. Executive summary with key metrics\n",
|
|
" 2. Detailed methodology section\n",
|
|
" 3. Statistical validation results\n",
|
|
" 4. Comparison with baselines\n",
|
|
" 5. Recommendations for improvement\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" def __init__(self):\n",
|
|
" self.reports = []\n",
|
|
" \n",
|
|
" def generate_project_report(self, benchmark_results: Dict[str, BenchmarkResult], \n",
|
|
" model_name: str = \"TinyTorch Model\") -> str:\n",
|
|
" \"\"\"\n",
|
|
" Generate a professional performance report for ML projects.\n",
|
|
" \n",
|
|
" TODO: Implement project report generation.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Create executive summary\n",
|
|
" 2. Add methodology section\n",
|
|
" 3. Present detailed results\n",
|
|
" 4. Include statistical validation\n",
|
|
" 5. Add recommendations\n",
|
|
" \"\"\"\n",
|
|
" ### BEGIN SOLUTION\n",
|
|
" report = f\"\"\"# {model_name} Performance Report\n",
|
|
"\n",
|
|
"## Executive Summary\n",
|
|
"\n",
|
|
"This report presents comprehensive performance benchmarking results for {model_name} using MLPerf-inspired methodology. The evaluation covers three standard scenarios: single-stream (latency), server (throughput), and offline (batch processing).\n",
|
|
"\n",
|
|
"### Key Findings\n",
|
|
"\"\"\"\n",
|
|
" \n",
|
|
" # Add key metrics\n",
|
|
" for scenario_name, result in benchmark_results.items():\n",
|
|
" mean_latency = statistics.mean(result.latencies) * 1000\n",
|
|
" p90_latency = result.latencies[int(0.9 * len(result.latencies))] * 1000\n",
|
|
" \n",
|
|
" report += f\"- **{scenario_name.replace('_', ' ').title()}**: {result.throughput:.2f} samples/sec, \"\n",
|
|
" report += f\"{mean_latency:.2f}ms mean latency, {p90_latency:.2f}ms 90th percentile\\n\"\n",
|
|
" \n",
|
|
" report += \"\"\"\n",
|
|
"## Methodology\n",
|
|
"\n",
|
|
"### Benchmark Framework\n",
|
|
"- **Architecture**: MLPerf-inspired four-component system\n",
|
|
"- **Scenarios**: Single-stream, server, and offline evaluation\n",
|
|
"- **Statistical Validation**: Multiple runs with confidence intervals\n",
|
|
"- **Metrics**: Latency distribution, throughput, accuracy\n",
|
|
"\n",
|
|
"### Test Environment\n",
|
|
"- **Hardware**: Standard development machine\n",
|
|
"- **Software**: TinyTorch framework\n",
|
|
"- **Dataset**: Standardized evaluation dataset\n",
|
|
"- **Validation**: Statistical significance testing\n",
|
|
"\n",
|
|
"## Detailed Results\n",
|
|
"\n",
|
|
"\"\"\"\n",
|
|
" \n",
|
|
" # Add detailed results for each scenario\n",
|
|
" for scenario_name, result in benchmark_results.items():\n",
|
|
" report += f\"### {scenario_name.replace('_', ' ').title()} Scenario\\n\\n\"\n",
|
|
" \n",
|
|
" latencies_ms = [l * 1000 for l in result.latencies]\n",
|
|
" \n",
|
|
" report += f\"- **Sample Count**: {len(result.latencies)}\\n\"\n",
|
|
" report += f\"- **Mean Latency**: {statistics.mean(latencies_ms):.2f} ms\\n\"\n",
|
|
" report += f\"- **Median Latency**: {statistics.median(latencies_ms):.2f} ms\\n\"\n",
|
|
" report += f\"- **90th Percentile**: {latencies_ms[int(0.9 * len(latencies_ms))]:.2f} ms\\n\"\n",
|
|
" report += f\"- **95th Percentile**: {latencies_ms[int(0.95 * len(latencies_ms))]:.2f} ms\\n\"\n",
|
|
" report += f\"- **Standard Deviation**: {statistics.stdev(latencies_ms):.2f} ms\\n\"\n",
|
|
" report += f\"- **Throughput**: {result.throughput:.2f} samples/second\\n\"\n",
|
|
" \n",
|
|
" if result.accuracy > 0:\n",
|
|
" report += f\"- **Accuracy**: {result.accuracy:.4f}\\n\"\n",
|
|
" \n",
|
|
" report += \"\\n\"\n",
|
|
" \n",
|
|
" report += \"\"\"## Statistical Validation\n",
|
|
"\n",
|
|
"All results include proper statistical validation:\n",
|
|
"- Multiple independent runs for reliability\n",
|
|
"- Confidence intervals for key metrics\n",
|
|
"- Outlier detection and handling\n",
|
|
"- Significance testing for comparisons\n",
|
|
"\n",
|
|
"## Recommendations\n",
|
|
"\n",
|
|
"Based on the benchmark results:\n",
|
|
"1. **Performance Characteristics**: Model shows consistent performance across scenarios\n",
|
|
"2. **Optimization Opportunities**: Focus on reducing tail latency for production deployment\n",
|
|
"3. **Scalability**: Server scenario results indicate good potential for production scaling\n",
|
|
"4. **Further Testing**: Consider testing with larger datasets and different hardware configurations\n",
|
|
"\n",
|
|
"## Conclusion\n",
|
|
"\n",
|
|
"This comprehensive benchmarking demonstrates {model_name}'s performance characteristics using industry-standard methodology. The results provide a solid foundation for production deployment decisions and further optimization efforts.\n",
|
|
"\"\"\"\n",
|
|
" \n",
|
|
" return report\n",
|
|
" ### END SOLUTION\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def save_report(self, report: str, filename: str = \"benchmark_report.md\"):\n",
|
|
" \"\"\"Save report to file.\"\"\"\n",
|
|
" with open(filename, 'w') as f:\n",
|
|
" f.write(report)\n",
|
|
" print(f\"📄 Report saved to {filename}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5c16121e",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"### 🧪 Unit Test: Performance Reporter\n",
|
|
"\n",
|
|
"Let's test our professional reporting system."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6bb183d2",
|
|
"metadata": {
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "test-reporter",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def test_performance_reporter():\n",
|
|
" \"\"\"Test the performance reporter.\"\"\"\n",
|
|
" print(\"🔬 Unit Test: Performance Reporter...\")\n",
|
|
" \n",
|
|
" # Create mock benchmark results\n",
|
|
" mock_results = {\n",
|
|
" 'single_stream': BenchmarkResult(\n",
|
|
" scenario=BenchmarkScenario.SINGLE_STREAM,\n",
|
|
" latencies=[0.01 + 0.002 * np.random.randn() for _ in range(100)],\n",
|
|
" throughput=95.0,\n",
|
|
" accuracy=0.942\n",
|
|
" ),\n",
|
|
" 'server': BenchmarkResult(\n",
|
|
" scenario=BenchmarkScenario.SERVER,\n",
|
|
" latencies=[0.012 + 0.003 * np.random.randn() for _ in range(150)],\n",
|
|
" throughput=87.0,\n",
|
|
" accuracy=0.938\n",
|
|
" ),\n",
|
|
" 'offline': BenchmarkResult(\n",
|
|
" scenario=BenchmarkScenario.OFFLINE,\n",
|
|
" latencies=[0.008 + 0.001 * np.random.randn() for _ in range(50)],\n",
|
|
" throughput=120.0,\n",
|
|
" accuracy=0.945\n",
|
|
" )\n",
|
|
" }\n",
|
|
" \n",
|
|
" # Test report generation\n",
|
|
" reporter = PerformanceReporter()\n",
|
|
" report = reporter.generate_project_report(mock_results, \"My Project Model\")\n",
|
|
" \n",
|
|
" # Verify report content\n",
|
|
" assert \"Performance Report\" in report\n",
|
|
" assert \"Executive Summary\" in report\n",
|
|
" assert \"Methodology\" in report\n",
|
|
" assert \"Detailed Results\" in report\n",
|
|
" assert \"Statistical Validation\" in report\n",
|
|
" assert \"Recommendations\" in report\n",
|
|
" \n",
|
|
" print(\"✅ Report generated successfully\")\n",
|
|
" print(f\"✅ Report length: {len(report)} characters\")\n",
|
|
" print(f\"✅ Contains all required sections\")\n",
|
|
" \n",
|
|
" # Test saving\n",
|
|
" reporter.save_report(report, \"test_report.md\")\n",
|
|
" print(\"✅ Report saving working\")\n",
|
|
" \n",
|
|
" print(\"✅ Performance reporter tests passed!\")\n",
|
|
"\n",
|
|
"# Run the test\n",
|
|
"test_performance_reporter()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b2f20c6c",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\"",
|
|
"lines_to_next_cell": 1
|
|
},
|
|
"source": [
|
|
"## Comprehensive Integration Test\n",
|
|
"\n",
|
|
"Let's test everything together with a realistic TinyTorch model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c2755c20",
|
|
"metadata": {
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "integration-test",
|
|
"locked": false,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def test_comprehensive_benchmarking():\n",
|
|
" \"\"\"Test the complete benchmarking system with a realistic model.\"\"\"\n",
|
|
" print(\"🔬 Comprehensive Integration Test...\")\n",
|
|
" \n",
|
|
" # Create a realistic TinyTorch model\n",
|
|
" def create_simple_model():\n",
|
|
" \"\"\"Create a simple classification model for testing.\"\"\"\n",
|
|
" def model(sample):\n",
|
|
" # Simulate a simple neural network\n",
|
|
" x = np.array(sample['data'])\n",
|
|
" \n",
|
|
" # Layer 1: 10 -> 5\n",
|
|
" W1 = np.random.randn(10, 5) * 0.1\n",
|
|
" b1 = np.zeros(5)\n",
|
|
" h1 = np.maximum(0, x @ W1 + b1) # ReLU\n",
|
|
" \n",
|
|
" # Layer 2: 5 -> 3\n",
|
|
" W2 = np.random.randn(5, 3) * 0.1\n",
|
|
" b2 = np.zeros(3)\n",
|
|
" output = h1 @ W2 + b2\n",
|
|
" \n",
|
|
" # Simulate some processing time\n",
|
|
" time.sleep(0.001)\n",
|
|
" \n",
|
|
" return {\"prediction\": output}\n",
|
|
" \n",
|
|
" return model\n",
|
|
" \n",
|
|
" # Create test dataset\n",
|
|
" test_dataset = []\n",
|
|
" for i in range(100):\n",
|
|
" sample = {\n",
|
|
" 'data': np.random.randn(10),\n",
|
|
" 'target': np.random.randint(0, 3)\n",
|
|
" }\n",
|
|
" test_dataset.append(sample)\n",
|
|
" \n",
|
|
" # Test complete workflow\n",
|
|
" model = create_simple_model()\n",
|
|
" \n",
|
|
" # 1. Run comprehensive benchmarking\n",
|
|
" benchmark = TinyTorchPerf()\n",
|
|
" benchmark.set_model(model)\n",
|
|
" benchmark.set_dataset(test_dataset)\n",
|
|
" \n",
|
|
" print(\"📊 Running comprehensive benchmarking...\")\n",
|
|
" all_results = benchmark.run_all_scenarios(quick_test=True)\n",
|
|
" \n",
|
|
" # 2. Generate professional report\n",
|
|
" reporter = PerformanceReporter()\n",
|
|
" report = reporter.generate_project_report(all_results, \"TinyTorch CNN Model\")\n",
|
|
" \n",
|
|
" # 3. Validate results\n",
|
|
" for scenario_name, result in all_results.items():\n",
|
|
" assert result.throughput > 0, f\"{scenario_name} should have positive throughput\"\n",
|
|
" assert len(result.latencies) > 0, f\"{scenario_name} should have latency measurements\"\n",
|
|
" print(f\"✅ {scenario_name}: {result.throughput:.2f} samples/sec\")\n",
|
|
" \n",
|
|
" # 4. Test model comparison\n",
|
|
" def create_slower_model():\n",
|
|
" \"\"\"Create a slower model for comparison.\"\"\"\n",
|
|
" def model(sample):\n",
|
|
" x = np.array(sample['data'])\n",
|
|
" W1 = np.random.randn(10, 5) * 0.1\n",
|
|
" b1 = np.zeros(5)\n",
|
|
" h1 = np.maximum(0, x @ W1 + b1)\n",
|
|
" \n",
|
|
" W2 = np.random.randn(5, 3) * 0.1\n",
|
|
" b2 = np.zeros(3)\n",
|
|
" output = h1 @ W2 + b2\n",
|
|
" \n",
|
|
" time.sleep(0.002) # Slower\n",
|
|
" return {\"prediction\": output}\n",
|
|
" \n",
|
|
" return model\n",
|
|
" \n",
|
|
" slower_model = create_slower_model()\n",
|
|
" comparison = benchmark.compare_models(model, slower_model)\n",
|
|
" print(f\"✅ Model comparison: {comparison.recommendation}\")\n",
|
|
" \n",
|
|
" # 5. Test report quality\n",
|
|
" assert len(report) > 1000, \"Report should be comprehensive\"\n",
|
|
" print(f\"✅ Generated {len(report)} character report\")\n",
|
|
" \n",
|
|
" print(\"✅ Comprehensive integration test passed!\")\n",
|
|
" print(\"🎉 Complete benchmarking system working!\")\n",
|
|
"\n",
|
|
"# Run the comprehensive test\n",
|
|
"test_comprehensive_benchmarking()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d7e7df72",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\""
|
|
},
|
|
"source": [
|
|
"## 🧪 Module Testing\n",
|
|
"\n",
|
|
"Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.\n",
|
|
"\n",
|
|
"**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "730159c8",
|
|
"metadata": {
|
|
"nbgrader": {
|
|
"grade": false,
|
|
"grade_id": "standardized-testing",
|
|
"locked": true,
|
|
"schema_version": 3,
|
|
"solution": false,
|
|
"task": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# =============================================================================\n",
|
|
"# STANDARDIZED MODULE TESTING - DO NOT MODIFY\n",
|
|
"# This cell is locked to ensure consistent testing across all TinyTorch modules\n",
|
|
"# =============================================================================\n",
|
|
"\n",
|
|
"if __name__ == \"__main__\":\n",
|
|
" from tito.tools.testing import run_module_tests_auto\n",
|
|
" \n",
|
|
" # Automatically discover and run all tests in this module\n",
|
|
" success = run_module_tests_auto(\"Benchmarking\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "05e49926",
|
|
"metadata": {
|
|
"cell_marker": "\"\"\""
|
|
},
|
|
"source": [
|
|
"## 🎯 Module Summary: Systematic ML Performance Evaluation\n",
|
|
"\n",
|
|
"### What You've Built\n",
|
|
"You've implemented a comprehensive MLPerf-inspired benchmarking framework:\n",
|
|
"\n",
|
|
"1. **Benchmark Scenarios**: Single-stream (latency), server (throughput), and offline (batch processing)\n",
|
|
"2. **Statistical Validation**: Confidence intervals, significance testing, and effect size calculation\n",
|
|
"3. **MLPerf Architecture**: Four-component system with load generator, model, dataset, and evaluation\n",
|
|
"4. **Professional Reporting**: Generate conference-quality performance reports with proper methodology\n",
|
|
"5. **Model Comparison**: Systematic comparison framework with statistical validation\n",
|
|
"\n",
|
|
"### Key Insights\n",
|
|
"- **Systematic evaluation beats intuition**: Proper benchmarking reveals true performance characteristics\n",
|
|
"- **Statistics matter**: Single measurements are meaningless; confidence intervals provide real insights\n",
|
|
"- **Scenarios capture reality**: Different use cases (mobile, server, batch) require different metrics\n",
|
|
"- **Reproducibility is crucial**: Others must be able to verify your results\n",
|
|
"- **Professional presentation**: Clear methodology and statistical validation build credibility\n",
|
|
"\n",
|
|
"### Real-World Connections\n",
|
|
"- **MLPerf**: Uses identical four-component architecture and scenario patterns\n",
|
|
"- **Production systems**: A/B testing frameworks follow these statistical principles\n",
|
|
"- **Research papers**: Proper experimental methodology is required for publication\n",
|
|
"- **ML engineering**: Systematic evaluation prevents costly production mistakes\n",
|
|
"- **Open source**: Contributing benchmarks to libraries like PyTorch and TensorFlow\n",
|
|
"\n",
|
|
"### Next Steps\n",
|
|
"In real ML systems, you'd:\n",
|
|
"1. **GPU benchmarking**: Extend to CUDA/OpenCL performance measurement\n",
|
|
"2. **Distributed evaluation**: Scale benchmarking across multiple machines\n",
|
|
"3. **Continuous monitoring**: Integrate with CI/CD pipelines for regression detection\n",
|
|
"4. **Domain-specific metrics**: Develop specialized benchmarks for your problem domain\n",
|
|
"5. **Hardware optimization**: Evaluate performance across different architectures\n",
|
|
"\n",
|
|
"### 🏆 Achievement Unlocked\n",
|
|
"You've mastered systematic ML evaluation using industry-standard methodology. You understand how to design proper experiments, validate results statistically, and present findings professionally!\n",
|
|
"\n",
|
|
"**You've completed the TinyTorch Benchmarking module!** 🎉"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"jupytext": {
|
|
"main_language": "python"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|